Building a tidy data frame

Introduction

In v0.2 of the package, we include functionality to convert JSON files to various data frame formats. In order to use these features, we recommend the following workflow.

First, you should build your query using the build_query function.

require(academictwitteR)
require(tibble)
#> Loading required package: tibble
my_query <- build_query(c("#ichbinhanna", "#ichwarhanna"), place = "Berlin")
my_query
#> [1] "(#ichbinhanna OR #ichwarhanna) place:Berlin"

Then, use the get_all_tweets to collect data. Make sure to specify data_path and set bind_tweets to FALSE.

get_all_tweets(
  query = my_query,
  start_tweets = "2021-06-01T00:00:00Z",
  end_tweets = "2021-06-20T00:00:00Z",
  n = Inf,
  data_path = "tweetdata",
  bind_tweets = FALSE
)

The first format is the so-called “vanilla” format. This vanilla format is the direct output from jsonlite::read_json. It can display columns such as text just fine. But some columns such as retweet_count are nested in list-columns.

In order to extract user information, it is additionally necessary to set user = TRUE. Please also note that the data frame returned in this format is not a tibble. As such, we first need to convert it to a tibble.

bind_tweets(data_path = "tweetdata") %>% as_tibble
#> ================================================================================
#> # A tibble: 25 x 14
#>    public_metrics$retwee… $reply_count $like_count $quote_count conversation_id 
#>                     <int>        <int>       <int>        <int> <chr>           
#>  1                     48            0           0            0 140600740518034…
#>  2                      4            0           0            0 140561738640589…
#>  3                      4            0           0            0 140561604799096…
#>  4                      4            0           0            0 140561505555576…
#>  5                      4            0           0            0 140561306496840…
#>  6                      4            0          35            0 140561072402663…
#>  7                      0            0           1            0 140539303355899…
#>  8                      0            0           1            0 140480875185768…
#>  9                     20            1         113            2 140444092988126…
#> 10                      0            0           0            0 140439345742735…
#> # … with 15 more rows, and 12 more variables: author_id <chr>,
#> #   entities <df[,3]>, text <chr>, lang <chr>, created_at <chr>, id <chr>,
#> #   possibly_sensitive <lgl>, referenced_tweets <list>, source <chr>,
#> #   geo <df[,1]>, context_annotations <list>, in_reply_to_user_id <chr>

The second format is the “raw” format. It is a list of data frames containing all of the data extracted in the API call. Please note that not all data frames are in Boyce-Codd 3rd Normal form, i.e. some columns are still list-column.

bind_tweets(data_path = "tweetdata", output_format = "raw") %>% names
#>  [1] "tweet.public_metrics.retweet_count"  "tweet.public_metrics.reply_count"   
#>  [3] "tweet.public_metrics.like_count"     "tweet.public_metrics.quote_count"   
#>  [5] "tweet.entities.mentions"             "tweet.entities.hashtags"            
#>  [7] "tweet.entities.urls"                 "tweet.geo.place_id"                 
#>  [9] "tweet.referenced_tweets"             "tweet.context_annotations"          
#> [11] "tweet.main"                          "user.public_metrics.followers_count"
#> [13] "user.public_metrics.following_count" "user.public_metrics.tweet_count"    
#> [15] "user.public_metrics.listed_count"    "user.entities.url"                  
#> [17] "user.entities.description"           "user.main"                          
#> [19] "sourcetweet.main"

The third format is the “tidy” format. It is an “opinionated” format, which we believe to contain all essential columns for social media research. By default, it is a tibble.

bind_tweets(data_path = "tweetdata", output_format = "tidy")
#> # A tibble: 25 x 31
#>    tweet_id  user_username  text      conversation_id author_id lang  created_at
#>    <chr>     <chr>          <chr>     <chr>           <chr>     <chr> <chr>     
#>  1 14060074… Phardiga       "RT @Tob… 14060074051803… 58755490  de    2021-06-1…
#>  2 14056173… dorothee_goet… "RT @Tob… 14056173864058… 97759337… de    2021-06-1…
#>  3 14056160… dejools        "RT @Tob… 14056160479909… 13065071… de    2021-06-1…
#>  4 14056150… LenaOetzel     "RT @Tob… 14056150555557… 97897581… de    2021-06-1…
#>  5 14056130… jenniferhenke… "RT @Tob… 14056130649684… 114774406 de    2021-06-1…
#>  6 14056107… Tobias_Schulze "Ihr sei… 14056107240266… 47919307  de    2021-06-1…
#>  7 14053930… HTMIBerlin     "👩‍💻👩‍💻👩‍💻👩‍… 14053930335589… 94052353… und   2021-06-1…
#>  8 14048087… Tobias_Schulze ".@jsued… 14048087518576… 47919307  de    2021-06-1…
#>  9 14044409… ASattelmacher  "Okay al… 14044409298812… 11508518… de    2021-06-1…
#> 10 14043934… dr_john_aus_b  "#Ichbin… 14043934574273… 30635588… und   2021-06-1…
#> # … with 15 more rows, and 24 more variables: possibly_sensitive <lgl>,
#> #   source <chr>, in_reply_to_user_id <chr>, user_url <chr>,
#> #   user_verified <lgl>, user_name <chr>, user_protected <lgl>,
#> #   user_profile_image_url <chr>, user_description <chr>,
#> #   user_created_at <chr>, user_pinned_tweet_id <chr>, user_location <chr>,
#> #   retweet_count <int>, like_count <int>, quote_count <int>,
#> #   user_tweet_count <int>, user_list_count <int>, user_followers_count <int>,
#> #   user_following_count <int>, sourcetweet_type <chr>, sourcetweet_id <chr>,
#> #   sourcetweet_text <chr>, sourcetweet_lang <chr>, sourcetweet_author_id <chr>

It has the following features / caveats:

  1. It has both the data about tweets, their authors, and “source tweets”, a.k.a. referenced tweets. Columns are named according to these three sources. The primary keys of these three sources are named tweet_id, author_id and sourcetweet_id respectively.
  2. By default, the text field of a retweet is truncated. However, the full-text original tweet is located in sourcetweet_text.
  3. The replied tweets of a reply is not counted as sourcetweet_text. If you need that data, please follow the clue using the conversation_id.
  4. Many data extracted from text by Twitter are not available in the tidy format, e.g. list of hashtags, cashtags, urls, entities, context annotations etc. If you need those columns, please consider using the “raw” format above.