Read and Visualize your Twitter Archive

Using R to read and visualize my Twitter archive data. Featuring {ggiraph}, {ggplot2}, {jsonlite}, {dplyr} and more…

R
Twitter
Personal Data
Author

Garrick Aden-Buie

Published

December 7, 2022

Keywords

rstats

Twitter finds itself in an… interesting… transition period. Whether or not you’re considering jumping ship to another service — you can find me lurking on Mastdon — you should download an archive of your Twitter data. Not only does the archive include all of your tweets, it also contains a variety of other interesting data about your account: who you followed and who followed you; the tweets you liked; the ads you were served; and much more.

This post, very much inspired by the awesome Observable notebook, Planning to leave Twitter?, shows you how to use R to read and explore your archive, using my own archive as an example.

Read on to learn how to read your Twitter archive into R, or how to tidy your tweets. The second half of the post showcases a collection of plots about monthy tweet volume, popular tweets, the time of day when tweets were sent, and the app used to send the tweet.

I’ve also included a section on using rtweet to collect a full dataset about the tweets you’ve liked and another section about the advertising data in your Twitter archive.

Reading your Twitter archive

Get your Twitter data archive

First things first, you need to have your Twitter data archive. If you don’t have it yet, go to Settings and Privacy and click Download an archive of your data. After you submit the request, it takes about a day or so for an email to show up in your inbox.

@grrrck your Twitter data is ready

Your Twitter archive is ready for you to download and view using your desktop browser. Make sure you download it before Nov 12, 2022, 9:46:31 PM

The archive downloads as a zip file containing a standalone web page — called Your archive.html — for exploring your data. But the real archive lives in the included data/ folder as a bunch of .js files. I’ve copied that data/ directory into my working directory for this post.

Setup

On the R side, we’ll need the usual suspects: tidyverse and glue.

library(tidyverse)
#> ── Attaching core tidyverse packages ───────────────────────
#>  dplyr     1.0.10      readr     2.1.3 
#>  forcats   0.5.2       stringr   1.5.0 
#>  ggplot2   3.4.0       tibble    3.1.8 
#>  lubridate 1.9.0       tidyr     1.2.1 
#>  purrr     1.0.1      
#> ── Conflicts ────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()
#>  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(glue)

(I’m using the dev version of tidyverse (1.3.2.9000), which loads lubridate automatically, and the dev version of purrr that is slated to become version 1.0.0.)

To read in the data files, I’ll use jsonlite to read the archive JSON data, with a small assist from brio for fast file reading. I’m also going to have some fun with ggiraph for turning static ggplot2 plots into interactive plots.

Finally, the Twitter archive doesn’t require API access to Twitter, but you can use it to augment the data in the archive. The rtweet package is excellent for this, even though it takes a little effort to get it set up.

Read the manifest

The data/ folder is surprisingly well structured! There are two key files to help you find your way around the archive. First, the README.txt file explains the structure and layout of the files, and includes descriptions of the data contained in all of the files.

Here’s how the README describes the account.js data file:

account.js
- email: Email address currently associated with the account if an email address has been provided.
- createdVia: Client application used when the account was created. For example: “web” if the  account was created from a browser.
- username: The account’s current @username. Note that the @username may change but the account ID will remain the same for the lifetime of the account.
- accountId: Unique identifier for the account.
- createdAt: Date and time when the account was created.
- accountDisplayName: The account’s name as displayed on the profile.

The data/ folder also contains a manifest.js file that can be used to help read the data included in the archive. Let’s start by assuming this file is JSON and reading it in.

jsonlite::fromJSON("data/manifest.js")
#> Error in parse_con(txt, bigint_as_char): lexical error: invalid char in json text.
#>                                        window.__THAR_CONFIG = {   "use
#>                      (right here) ------^

Here we hit our first snag. The archive files are packaged as JSON, but they’re not strictly compliant JSON files; they include some JavaScript to assign JSON objects to the global namespace (called window in the browser). Here’s the data/manifest.js file as an example.

window.__THAR_CONFIG = {
  // ... data ...
}

If we just remove everything up to the first the { or sometimes [ on the first line, we can turn the data into valid JSON.

lines[1] <- sub("^[^{[]+([{[])", "\\1", lines[1])
manifest <- jsonlite::fromJSON(lines)

This worked, but… jsonlite was designed for statistical work, so it transforms the data structure when reading in the JSON. For example, by default it converts arrays that look like JSON-ified data frames into actual data.frames.

manifest$dataTypes[1:2] |> str()
#> List of 2
#>  $ account          :List of 1
#>   ..$ files:'data.frame':    1 obs. of  3 variables:
#>   .. ..$ fileName  : chr "data/account.js"
#>   .. ..$ globalName: chr "YTD.account.part0"
#>   .. ..$ count     : chr "1"
#>  $ accountCreationIp:List of 1
#>   ..$ files:'data.frame':    1 obs. of  3 variables:
#>   .. ..$ fileName  : chr "data/account-creation-ip.js"
#>   .. ..$ globalName: chr "YTD.account_creation_ip.part0"
#>   .. ..$ count     : chr "1"

That’s often quite helpful! But I find it’s safer, when trying to generalize data reading, to disable the simplification and know for certain that the data strcutre matches the original JSON. For that reason, I tend to disable the matrix and data.frame simplifications and only allow jsonlite to simplify vectors.

Here’s a quick helper function that includes those setting changes and the first line substitution needed to read the archive JSON files.

read_archive_json <- function(path) {
  lines <- brio::read_lines(path)
  lines[1] <- sub("^[^{[]+([{[])", "\\1", lines[1])

  jsonlite::fromJSON(
    txt = lines,
    simplifyVector = TRUE,
    simplifyDataFrame = FALSE,
    simplifyMatrix = FALSE
  )
}

Now we’re ready to read the manifest again.

manifest <- read_archive_json("data/manifest.js")
names(manifest)
#> [1] "userInfo"    "archiveInfo" "readmeInfo"  "dataTypes"

The manifest file contains some information about the user and the archive,

str(manifest$userInfo)
#> List of 3
#>  $ accountId  : chr "47332433"
#>  $ userName   : chr "grrrck"
#>  $ displayName: chr "garrick aden-buie"

plus details about all of the various data included in the archive, like the data about my account.

str(manifest$dataTypes$account)
#> List of 1
#>  $ files:List of 1
#>   ..$ :List of 3
#>   .. ..$ fileName  : chr "data/account.js"
#>   .. ..$ globalName: chr "YTD.account.part0"
#>   .. ..$ count     : chr "1"

Each dataType in the manifest points us to a file (or files) in the archive and helpfully tells us how many records are included.

Here are the data files with the most records.

Code: Manifest, Top Records
manifest$dataTypes |>
  # All data types we can read have a "files" item
  keep(~ "files" %in% names(.x)) |>
  # We keep the files objects but still as a list of lists within a list
  map("files") |>
  # Turn the files into tibbles (list of tibbles within a list)
  map_depth(2, as_tibble) |>
  # Then combine the files tables for each item keeping track of the file index
  map(list_rbind, names_to = "index") |>
  # And finally combine files for all items
  list_rbind(names_to = "item") |>
  mutate(across(count, as.integer)) |>
  select(-globalName, -index) |>
  slice_max(count, n = 15) |>
  knitr::kable(
    format.args = list(big.mark = ","),
    table.attr = 'class="table"',
    format = "html"
  )
item fileName count
like data/like.js 11,773
follower data/follower.js 9,030
tweetHeaders data/tweet-headers.js 6,225
tweets data/tweets.js 6,225
ipAudit data/ip-audit.js 3,787
following data/following.js 1,519
contact data/contact.js 645
listsMember data/lists-member.js 254
block data/block.js 242
adImpressions data/ad-impressions.js 173
adEngagements data/ad-engagements.js 171
directMessageHeaders data/direct-message-headers.js 97
directMessages data/direct-messages.js 97
userLinkClicks data/user-link-clicks.js 67
connectedApplication data/connected-application.js 63

Reading the account data file

For a first example, let’s read the data/account.js archive file. We start by inspecting the manifest, where manifest$dataTypes$account tells us which files hold the account data and how many records are in each.

manifest$dataTypes$account |> str()
#> List of 1
#>  $ files:List of 1
#>   ..$ :List of 3
#>   .. ..$ fileName  : chr "data/account.js"
#>   .. ..$ globalName: chr "YTD.account.part0"
#>   .. ..$ count     : chr "1"

Here there’s only one file containing a single account record: data/account.js. Inside that file is a small bit of JavaScript. Like the manifest, it’s almost JSON, except that it assigns the JavaScript object to window.YTD.account.part0.

window.YTD.account.part0 = [
  {
    "account" : {
      "email" : "my-email@example.com",
      "createdVia" : "web",
      "username" : "grrrck",
      "accountId" : "47332433",
      "createdAt" : "2009-06-15T13:21:50.000Z",
      "accountDisplayName" : "garrick aden-buie"
    }
  }
]

And again, if we clean up the first line, this is valid JSON that we can read in directly with jsonlite.

account <- read_archive_json("data/account.js")
str(account)
#> List of 1
#>  $ :List of 1
#>   ..$ account:List of 6
#>   .. ..$ email             : chr "my-email@example.com"
#>   .. ..$ createdVia        : chr "web"
#>   .. ..$ username          : chr "grrrck"
#>   .. ..$ accountId         : chr "47332433"
#>   .. ..$ createdAt         : chr "2009-06-15T13:21:50.000Z"
#>   .. ..$ accountDisplayName: chr "garrick aden-buie"

This leads us to our first fun fact: I created my Twitter account on June 15, 2009, which means that I’ve been using Twitter (on and off) for 13.6 years. That’s 4,981 days of twittering!

Read any archive item

Let’s generalize what we learned into a few helper functions we can reuse. I’ve placed everything into a single code block so that you can copy and paste it into your R session or script to use it right away.

#' Read the Twitter Archive JSON
#'
#' @param path Path to a Twitter archve `.js` file
read_archive_json <- function(path) {
  lines <- brio::read_lines(path)
  lines[1] <- sub("^[^{[]+([{[])", "\\1", lines[1])

  jsonlite::fromJSON(
    txt = lines,
    simplifyVector = TRUE,
    simplifyDataFrame = FALSE,
    simplifyMatrix = FALSE
  )
}

#' Read an twitter archive data item
#'
#' @param manifest The list from `manifest.js`
#' @param item The name of an item in the manifest
read_twitter_data <- function(manifest, item) {
  manifest$dataTypes[[item]]$files |>
    purrr::transpose() |>
    purrr::pmap(\(fileName, ...) read_archive_json(fileName))
}

#' Simplify the data, if possible and easy
#'
#' @param x A list of lists as returned from `read_twitter_data()`
#' @param simplifier A function that's applied to each item in the
#'   list of lists and that can be used to simplify the output data.
simplify_twitter_data <- function(x, simplifier = identity) {
   x <- purrr::flatten(x)
   item_names <- x |> purrr::map(names) |> purrr::reduce(union)
   if (length(item_names) > 1) return(x)

   x |>
    purrr::map(item_names) |>
    purrr::map_dfr(simplifier)
}

Quick recap: to use the functions above, load your archive manifest with read_archive_json() and then pass it to read_twitter_data() along with an item name from the archive. If the data in the archive item is reasonably structured, you can call simplify_twitter_data() to get a tidy tibble1.

manifest <- read_archive_json("data/manifest.js")
account <- read_twitter_data(manifest, "account")

simplify_twitter_data(account)
#> # A tibble: 1 × 6
#>   email              creat…¹ usern…² accou…³ creat…⁴ accou…⁵
#>   <chr>              <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1 my-email@example.… web     grrrck  473324… 2009-0… garric…
#> # … with abbreviated variable names ¹​createdVia, ²​username,
#> #   ³​accountId, ⁴​createdAt, ⁵​accountDisplayName

Example: my followers

Let’s use this on another archive item to find the earliest Twitter adopters among my followers.

# These tables are wide, you may need to scroll to see the preview
options(width = 120)

followers <-
  read_twitter_data(manifest, "follower") |>
  simplify_twitter_data()

Then we can arrange the rows of followers by accountId as a proxy for date of account creation.

early_followers <-
  followers |>
  arrange(as.numeric(accountId)) |>
  slice_head(n = 11)

# Top 11 earliest followers
early_followers
#> # A tibble: 11 × 2
#>    accountId userLink                                      
#>    <chr>     <chr>                                         
#>  1 1496      https://twitter.com/intent/user?user_id=1496  
#>  2 11309     https://twitter.com/intent/user?user_id=11309 
#>  3 37193     https://twitter.com/intent/user?user_id=37193 
#>  4 716213    https://twitter.com/intent/user?user_id=716213
#>  5 741803    https://twitter.com/intent/user?user_id=741803
#>  6 755726    https://twitter.com/intent/user?user_id=755726
#>  7 774234    https://twitter.com/intent/user?user_id=774234
#>  8 787219    https://twitter.com/intent/user?user_id=787219
#>  9 799574    https://twitter.com/intent/user?user_id=799574
#> 10 860921    https://twitter.com/intent/user?user_id=860921
#> 11 944231    https://twitter.com/intent/user?user_id=944231

As you can see, some parts of the Twitter archive include the barest minimum amount of data. Thankfully, we can still use rtweet to gather additional data about these users. I’m looking at a small subset of my 9,030 followers here, but you might want to do this for all your followers and save the collected user data in your archive.

early_followers_accounts <-
  early_followers |>
  pull(accountId) |>
  rtweet::lookup_users()

early_followers_accounts |>
  select(id, name, screen_name, created_at, followers_count, description)
#> # A tibble: 11 × 6
#>        id name                                    screen_name    created_at          followers_count description        
#>     <int> <chr>                                   <chr>          <dttm>                        <int> <chr>              
#>  1   1496 Aelfrick                                Aelfrick       2006-07-16 14:44:05              25 ""                 
#>  2  11309 Aaron Khoo                              aklw           2006-11-02 07:14:47             240 "I am a weapon of …
#>  3  37193 Rob                                     coleman        2006-12-02 11:54:15             654 "data science / la…
#>  4 716213 Tim Dennis                              jt14den        2007-01-27 16:54:12             971 "Data librarian/di…
#>  5 741803 @AlgoCompSynth@ravenation.club by znmeb znmeb          2007-02-01 00:03:16            9755 "https://t.co/rZhZ…
#>  6 755726 Travis Dawry                            tdawry         2007-02-06 23:45:01             274 "data, politics, o…
#>  7 774234 Shea's Coach Beard                      mandoescamilla 2007-02-15 13:51:10            1177 "my anger is a gif…
#>  8 787219 Jonathan                                jmcphers       2007-02-21 15:56:20             591 "Software engineer…
#>  9 799574 @dietrich@mastodon.social               dietrich       2007-02-27 18:41:20            6113 "A lifestyle brand…
#> 10 860921 ⌜will⌟                                  wtd            2007-03-09 23:20:43             719 "👋 I'm an optimis…
#> 11 944231 Christopher Peters 🇺🇦                   statwonk       2007-03-11 14:49:39            4476 "Lead Econometrici…

My tweets

Now we get to the main course: the tweets themselves. We can read them in the same way that we imported accounts and followers with read_twitter_data(), but for now we won’t simplify them.

To see why, let’s take a look at a single tweet. The file of tweets (outer list, [[1]]) contains an array (inner list, e.g. [[105]]) of tweets (named item, $tweet). Here’s that example tweet:

# Tweets are a list of a list of tweets...
tweet <- read_twitter_data(manifest, "tweets")[[1]][[105]]$tweet
str(tweet, max.level = 2)
#> List of 16
#>  $ edit_info         :List of 1
#>   ..$ initial:List of 4
#>  $ retweeted         : logi FALSE
#>  $ source            : chr "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>"
#>  $ entities          :List of 5
#>   ..$ user_mentions:List of 1
#>   ..$ urls         :List of 1
#>   ..$ symbols      : list()
#>   ..$ media        :List of 1
#>   ..$ hashtags     :List of 1
#>  $ display_text_range: chr [1:2] "0" "236"
#>  $ favorite_count    : chr "118"
#>  $ id_str            : chr "1276198597596459018"
#>  $ truncated         : logi FALSE
#>  $ retweet_count     : chr "33"
#>  $ id                : chr "1276198597596459018"
#>  $ possibly_sensitive: logi FALSE
#>  $ created_at        : chr "Thu Jun 25 17:00:30 +0000 2020"
#>  $ favorited         : logi FALSE
#>  $ full_text         : chr "Thanks to prodding from @dsquintana, I added `include_tweet()` to {tweetrmd}. Automatically embed the HTML twee"| __truncated__
#>  $ lang              : chr "en"
#>  $ extended_entities :List of 1
#>   ..$ media:List of 1

There’s quite a bit of data in each tweet, so we’ll pause here and figure out how we want to transform the nested list into a flat last that will rectangle nicely.

tidy_tweet_raw <- function(tweet_raw) {
  basic_items <- c(
    "created_at",
    "favorite_count",
    "retweet_count",
    "full_text",
    "id",
    "lang",
    "source"
  )

  # start with a few basic items
  tweet <- tweet_raw[basic_items]

  # and collapse a few nested items into a single string
  tweet$user_mentions <- tweet_raw |>
    purrr::pluck("entities", "user_mentions") |>
    purrr::map_chr("screen_name") |>
    paste(collapse = ",")

  tweet$hashtags <- tweet_raw |>
    purrr::pluck("entities", "hashtags") |>
    purrr::map_chr("text") |>
    paste(collapse = ",")

  tweet
}

When we apply this function to the example tweet, we get a nice, flat list.

tidy_tweet_raw(tweet) |> str()
#> List of 9
#>  $ created_at    : chr "Thu Jun 25 17:00:30 +0000 2020"
#>  $ favorite_count: chr "118"
#>  $ retweet_count : chr "33"
#>  $ full_text     : chr "Thanks to prodding from @dsquintana, I added `include_tweet()` to {tweetrmd}. Automatically embed the HTML twee"| __truncated__
#>  $ id            : chr "1276198597596459018"
#>  $ lang          : chr "en"
#>  $ source        : chr "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>"
#>  $ user_mentions : chr "dsquintana"
#>  $ hashtags      : chr "rstats"

This flattened tweet list will end up becoming a row in a tidy table of tweets thanks to simplify_twitter_data(), which is used to flatten the list of all of the tweets into a tibble. Once combined into a single table, we use our good friends dplyr, lubridate and stringr to convert columns to their correct format and to extract a few features.

tidy_tweets <-
  read_twitter_data(manifest, "tweets") |>
  simplify_twitter_data(tidy_tweet_raw) |>
  mutate(
    across(contains("_count"), as.integer),
    retweet = str_detect(full_text, "^RT @"),
    reply = str_detect(full_text, "^@"),
    type = case_when(
      retweet ~ "retweet",
      reply ~ "reply",
      TRUE ~ "tweet"
    ),
    created_at = strptime(created_at, "%a %b %d %T %z %Y"),
    hour = hour(created_at),
    day = wday(created_at, label = TRUE, abbr = TRUE, week_start = 1),
    month = month(created_at, label = TRUE, abbr = FALSE),
    day_of_month = day(created_at),
    year = year(created_at)
  )

The result… a nice tidy table of tweets!

tidy_tweets
#> # A tibble: 6,223 × 17
#>    created_at          favori…¹ retwe…² full_…³ id    lang  source user_…⁴ hasht…⁵ retweet reply type   hour day   month
#>    <dttm>                 <int>   <int> <chr>   <chr> <chr> <chr>  <chr>   <chr>   <lgl>   <lgl> <chr> <int> <ord> <ord>
#>  1 2022-11-05 10:02:17        0       0 "RT @g… 1588… en    "<a h… "georg… ""      TRUE    FALSE retw…    10 Sat   Nove…
#>  2 2022-11-04 19:42:01        4       0 "@JonT… 1588… en    "<a h… "JonTh… ""      FALSE   TRUE  reply    19 Fri   Nove…
#>  3 2022-11-04 15:21:23        1       0 "@tjma… 1588… en    "<a h… "tjmah… ""      FALSE   TRUE  reply    15 Fri   Nove…
#>  4 2022-11-03 12:39:09        1       0 "@trav… 1588… en    "<a h… "trave… ""      FALSE   TRUE  reply    12 Thu   Nove…
#>  5 2022-11-03 06:45:53        5       0 "@mcca… 1588… en    "<a h… "mccar… ""      FALSE   TRUE  reply     6 Thu   Nove…
#>  6 2022-11-03 06:36:56        2       0 "@trav… 1588… en    "<a h… "trave… ""      FALSE   TRUE  reply     6 Thu   Nove…
#>  7 2022-11-02 12:26:46        0       0 "RT @p… 1587… en    "<a h… "posit… ""      TRUE    FALSE retw…    12 Wed   Nove…
#>  8 2022-11-02 12:20:50        4       0 "And I… 1587… en    "<a h… ""      ""      FALSE   FALSE tweet    12 Wed   Nove…
#>  9 2022-10-31 11:47:57        0       0 "RT @D… 1587… en    "<a h… "Dante… ""      TRUE    FALSE retw…    11 Mon   Octo…
#> 10 2022-10-30 19:32:22        8       0 "At fi… 1586… en    "<a h… "pomol… ""      FALSE   FALSE tweet    19 Sun   Octo…
#> # … with 6,213 more rows, 2 more variables: day_of_month <int>, year <dbl>, and abbreviated variable names
#> #   ¹​favorite_count, ²​retweet_count, ³​full_text, ⁴​user_mentions, ⁵​hashtags

If you’ve seen the Observable notebook that inspired this post, you’ll notice that I’ve mostly recreated their data structure, but in R. Next, let’s recreate some of the plots in that notebook, too!

Monthly tweets, replies and retweets

Code: Set Blog Theme

Yeah, so real quick, I’m going to set up a plot theme for the rest of this post. Here it is, if you’re interested in this kind of thing!

blog_theme <-
  theme_minimal(18, base_family = "IBM Plex Mono") +
  theme(
    plot.background = element_rect(fill = "#f9fafa", color = NA),
    plot.title.position = "plot",
    plot.title = element_text(size = 24, margin = margin(b = 1, unit = "line")),
    legend.position = c(0, 1),
    legend.direction = "horizontal",
    legend.justification = c(0, 1),
    legend.title.align = 1,
    axis.title.y = element_text(hjust = 0),
    axis.title.x = element_text(hjust = 0),
    panel.grid.major = element_line(color = "#d3d9db"),
    panel.grid.minor = element_blank()
  )

theme_set(blog_theme)

The first chart shows the number of tweets, replies and mentions sent in each month from 2009 to 2022. From 2009 to 2015, I sent about 25 total tweets per month, with one large spike in January 2014 when a grad school course I was taking decided to do a “Twitter seminar.” My Twitter usage dropped off considerably between 2015 and 2018: the result of a mix of grad school grinding, and then when my son was born in 2016 tweeting practically stopped altogether.

My Twitter usage picked up again in 2018, which also coincided with my realization that academia wasn’t my ideal future. In 2018 and 2019 you can see my baseline usage pick up considerably at the start of the year — the effects of a lot of tweeting and networking during rstudio::conf. Since 2019, my usage has been fairly stable; I typically send between 50 and 100 tweets a month. Finally, there’s a noticeable recent drop in activity: since Twitter changed ownership I still read Twitter but only occasionally tweet.

Hover or tap2 on a bar above to see the top 5 tweets in each segment.

Code: Plot Monthly Tweets
type_colors <- c(reply = "#5e5b7f", tweet = "#ef8c02", retweet = "#7ab26f")

top_5_tweets_text <- function(data) {
  slice_max(
    data,
    n = 5,
    order_by = retweet_count * 2 + favorite_count,
    with_ties = FALSE
  ) |>
    pull(full_text) |>
    str_trunc(width = 120)
}

plot_monthly <-
  tidy_tweets |>
  # Group nest by month and tweet type ---
  mutate(dt_month = sprintf("%d-%02d", year, month(created_at))) |>
  group_nest(dt_month, month, year, type) |>
  mutate(
    # Calculate number of tweets per month/type
    n = map_int(data, nrow),
    # and extract the top 5 tweets
    top = map(data, top_5_tweets_text)
  ) |>
  select(-data) |>
  # Then build the tooltip (one row per month/type)
  rowwise() |>
  mutate(
    type_pl = plu::ral(type, n = n),
    tooltip = glue::glue(
      "<p><strong>{month} {year}: ",
      "<span style=\"color:{type_colors[type]}\">{n} {type_pl}</span></strong></p>",
      "<ol>{tweets}</ol>",
      tweets = paste(sprintf("<li>%s</li>", top), collapse = "")
    ),
    tooltip = htmltools::HTML(tooltip)
  ) |>
  ungroup() |>
  # Finally ensure the order of factors (including month!)
  mutate(type = factor(type, rev(c("tweet", "reply", "retweet")))) |>
  arrange(dt_month, type) |>
  mutate(dt_month = fct_inorder(dt_month)) |>
  # Plot time! ----
  ggplot() +
  aes(x = dt_month, y = n, fill = type, color = type, group = type) +
  ggiraph::geom_col_interactive(
    width = 1,
    aes(tooltip = tooltip)
  ) +
  scale_fill_manual(values = type_colors) +
  scale_color_manual(values = type_colors) +
  # The x-axis is factors for each month,
  # we need labels for each year, e.g. 2010-01 => 2010
  scale_x_discrete(
    breaks = paste0(seq(2008, 2022, by = 1), "-01"),
    labels = seq(2008, 2022, by = 1)
  ) +
  scale_y_continuous(expand = expansion(add = c(1, 1))) +
  labs(
    title = "Tweets per month",
    x = "Month Tweeted →",
    y = "Count →",
    fill = NULL,
    color = NULL
  ) +
  theme(
    plot.title = element_text(size = 24, margin = margin(b = 2, unit = "line")),
    legend.position = c(0, 1.14)
  )

ggiraph::girafe(
  ggobj = plot_monthly,
  width_svg = 14,
  height_svg = 6,
  desc = knitr::opts_current$get("fig.alt")
)

Tweets by time of day

The next plot highlights the time of day at which I sent tweets. Each bar show the total number of tweets I’ve written within a given hour of the day. Morning hours are in the top half of each day’s circular panel and evening hours are in the bottom half. Tuesday at noon seems to be my favorite time to tweet — I sent 120 tweets between 12pm and 1pm on Tuesday — followed by Friday at 1pm (111 tweets) or at 11am (110 tweets).

Hover or tap on a bar to compare a given time across all days.

Code: Plot Tweets by Time of Day
tweet_count_by_hour <-
  tidy_tweets |>
  count(day, hour) |>
  mutate(
    hour_label = case_when(
      hour == 12 ~ "12pm",
      hour == 0 ~ "12am",
      hour > 12 ~ paste0(hour - 12, "pm"),
      hour < 12 ~ paste0(hour, "am")
    ),
    pct = n / sum(n)
  )
tooltip_hour <- function(day, hour_label, ...) {
  this_hour_count <-
    tweet_count_by_hour |>
    filter(hour_label == !!hour_label)

  this_hour_total <- sum(this_hour_count$n)
  this_hour_pct <- scales::percent(this_hour_total / sum(tweet_count_by_hour$n), 0.1)
  this_hour_total <- trimws(format(this_hour_total, big.mark = ","))

  this_hour_days <-
    this_hour_count |>
    mutate(
      across(pct, scales::percent_format(0.1)),
      across(n, format, big.mark = ","),
      across(n, trimws),
      text = glue("{day}: {pct} ({n})"),
      text = if_else(day == !!day, glue("<strong>{text}</strong>"), text)
    ) |>
    glue_data("<li>{text}</li>") |>
    glue_collapse()

  glue::glue(
    "<p><strong>{hour_label}</strong><br><small>{this_hour_pct} of total ({this_hour_total})</small></p>",
    "<ul>{this_hour_days}</ul>"
  )
}

tweet_count_by_hour$tooltip <- pmap_chr(tweet_count_by_hour, tooltip_hour)

plot_time_of_day <-
  ggplot(tweet_count_by_hour ) +
  aes(y = n, fill = day, x = hour, data_id = hour, tooltip = tooltip) +
  geom_area(
    data = function(d) {
      # Shade from midnight-6am and 6pm-midnight, kinda like geom_step_area()
      max_count <- max(d$n)
      tibble(
        day = sort(rep(unique(d$day), 6)),
        hour = rep(c(0, 6, 6.01, 18, 18.01, 24), 7),
        n = rep(c(max_count, max_count, 0, 0, max_count, max_count), 7),
        tooltip = ""
      )
    },
    fill = "#aaaaaa30",
  ) +
  ggiraph::geom_col_interactive(show.legend = FALSE, width = 1) +
  facet_wrap(vars(day), nrow = 2) +
  coord_polar(start = pi) +
  scale_x_continuous(
    breaks = seq(0, 23, 3),
    minor_breaks = 0:23,
    labels = c("12am", paste0(seq(3, 9, 3), "am"), "12pm", paste0(seq(3, 9, 3), "pm")),
    limits = c(0, 24),
    expand = expansion()
  ) +
  scale_y_continuous(expand = expansion(), breaks = seq(0, 100, 25)) +
  scale_fill_discrete() +
  labs(
    title = "When do I do my tweeting?",
    x = NULL,
    y = NULL
  ) +
  theme(
    axis.text.y = element_blank(),
    axis.text.x = element_text(size = 10),
    panel.grid.major.y = element_blank()
  )

ggiraph::girafe(
  ggobj = plot_time_of_day,
  width_svg = 12,
  height_svg = 8,
  options = list(
    ggiraph::opts_hover_inv("filter: saturate(30%) brightness(125%)"),
    ggiraph::opts_hover(css = "opacity:1"),
    ggiraph::opts_tooltip(
      placement = "container",
      css = "width: 12rem; font-family: var(--font-monospace, 'IBM Plex Mono');",
      # These don't matter, position is set by CSS rules below
      offx = 600,
      offy = 260,
      use_cursor_pos = FALSE
    )
  ),
  desc = knitr::opts_current$get("fig.alt")
)

Tweet source

The tweet archive includes the application used to send the tweet, stored as the HTML that’s displayed in the tweet text:

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>

With a little bit of regex, we can extract the tweet source. Apparently, I’ve used 37 different apps to write my tweets, but 17 were used for no more than 5 tweets. Most often — actually, 79% of the time — I wrote tweets from the web app or my phone.

Code: Plot Tweet Source
tweet_source <-
  tidy_tweets |>
  extract(
    source,
    into = c("source_href", "source"),
    regex = '<a href="([^"]+)"[^>]+>([^<]+)</a>'
  )

tweet_source_count <- tweet_source |>
  count(source) |>
  mutate(pct = n / sum(n))
plot_tweet_source <-
  tweet_source |>
  mutate(
    source = fct_lump_n(source, n = 15),
    source = fct_rev(fct_infreq(source))
  ) |>
  count(source, type, sort = TRUE) |>
  pivot_wider(names_from = type, values_from = n, values_fill = 0) |>
  mutate(
    total = reply + retweet + tweet,
    tooltip = pmap_chr(
      list(source, reply, retweet, tweet, total),
      function(source, reply, retweet, tweet, total) {
        x <- glue(
          '<label for="{tolower(label)}">{label}</label>',
          '<progress id="{tolower(label)}" max="{total}" value="{value}">{value}</progress>',
          label = c("Tweets", "Replies", "Retweets"),
          value = c(tweet, reply, retweet)
        )
        x <- glue_collapse(x)
        paste0('<p class="b">', source, "</p>", x)
      }
    )
  ) |>
  ggplot() +
  aes(x = total, y = source, tooltip = tooltip) +
  ggiraph::geom_col_interactive(show.legend = FALSE) +
  scale_x_continuous(expand = expansion(add = c(0, 0.01))) +
  scale_y_discrete(expand = expansion()) +
  labs(
    title = "What app did I use to tweet?",
    x = "Tweets →",
    y = NULL
  ) +
  theme(
    panel.grid.major.y = element_blank()
  )

ggiraph::girafe(
  ggobj = plot_tweet_source,
  width_svg = 10,
  height_svg = 8,
  options = list(
    ggiraph::opts_hover_inv("filter: saturate(30%) brightness(125%)"),
    ggiraph::opts_hover(css = "opacity:1"),
    ggiraph::opts_tooltip(
      placement = "container",
      css = "width: 15rem; font-family: var(--font-monospace, 'IBM Plex Mono');"
    )
  ),
  desc = knitr::opts_current$get("fig.alt")
)

My likes

One huge reason to go through the trouble of requesting and downloading your Twitter archive is to collect a copy of your liked tweets. (Sadly, your bookmarks are not a part of the archive.)

likes <-
  read_twitter_data(manifest, "like") |>
  simplify_twitter_data()

likes |>
  arrange(as.numeric(tweetId))
#> # A tibble: 11,773 × 3
#>    tweetId            fullText                                                                                   expan…¹
#>    <chr>              <chr>                                                                                      <chr>  
#>  1 42240201359233024  We just went live with RStudio, a new IDE for R. Try it out and let us know what you thin… https:…
#>  2 169437879704092672 RT @DKThomp: Adulthood, Delayed: What the Recession Has Done to Millennials http://t.co/U… https:…
#>  3 338425212762738690 Joy! First sighting of NYC's CitiBikes in place! http://t.co/o6VPop4kWZ                    https:…
#>  4 343026917659791360 What a shitty day to announce my new data analytics project, Prism.                        https:…
#>  5 343037575889580033 Very cool: DoS using agent-based modeling to understand conflict dynamics in Niger Delta … https:…
#>  6 347931309496213504 BACK TO BACK CHAMPS!!! Going crazy all by myself in my little hotel room in Prague. Effin… https:…
#>  7 383071681079549953 The real reason lowering health care costs is hard: Every patient is unique http://t.co/j… https:…
#>  8 386857068423938048 Good work on study on admits/length of stay/ 'crowdedness' of ICU &amp; impacts on morbid… https:…
#>  9 386862608231325696 I'm giving my presentation on scheduling medical residents at #informs2013 in the Doing G… https:…
#> 10 386935928042029056 Gustavo just presented a semicont opt that can B perfectly applied in my supply chain pro… https:…
#> # … with 11,763 more rows, and abbreviated variable name ¹​expandedUrl

While the likes archive includes the full text of each tweet, we can use the lookup_tweets() function from the rtweet package to download complete information about each tweet.

likes_full <-
  rtweet::lookup_tweets(likes$tweetId) |>
  write_rds("data/likes.rds")

Getting all 11,773 tweets takes a few minutes, so I highly recommend saving the data to disk as soon as you’ve collected it.

likes_full <- read_rds("data/likes.rds")
likes_full
#> # A tibble: 11,385 × 43
#>    created_at               id id_str       full_…¹ trunc…² displ…³ entities     source in_rep…⁴ in_re…⁵ in_re…⁶ in_re…⁷
#>    <dttm>                <dbl> <chr>        <chr>   <lgl>     <dbl> <list>       <chr>     <dbl> <chr>     <dbl> <chr>  
#>  1 2022-11-04 20:35:35 1.59e18 15886914597… "Read … FALSE       278 <named list> "<a h… NA       NA      NA      NA     
#>  2 2022-11-05 10:33:18 1.59e18 15889022786… "heari… FALSE       190 <named list> "<a h… NA       NA      NA      NA     
#>  3 2022-11-05 01:35:38 1.59e18 15887669703… "Defen… FALSE       236 <named list> "<a h… NA       NA      NA      NA     
#>  4 2022-11-04 14:23:18 1.59e18 15885977742… "Pleas… FALSE        60 <named list> "<a h… NA       NA      NA      NA     
#>  5 2022-11-04 11:35:40 1.59e18 15885555857… "Despe… FALSE       114 <named list> "<a h… NA       NA      NA      NA     
#>  6 2022-11-04 16:16:08 1.59e18 15886261683… "Here'… FALSE       269 <named list> "<a h… NA       NA      NA      NA     
#>  7 2022-11-04 11:46:29 1.59e18 15885583081… "@tjma… FALSE        39 <named list> "<a h…  1.59e18 158855…  1.29e9 128991…
#>  8 2022-11-03 10:22:11 1.59e18 15881747079… "https… FALSE         0 <named list> "<a h… NA       NA      NA      NA     
#>  9 2022-11-04 09:48:27 1.59e18 15885286045… "One o… FALSE       188 <named list> "<a h… NA       NA      NA      NA     
#> 10 2022-11-04 13:54:39 1.59e18 15885905619… "We’ve… FALSE       141 <named list> "<a h… NA       NA      NA      NA     
#> # … with 11,375 more rows, 31 more variables: in_reply_to_screen_name <chr>, geo <list>, coordinates <list>,
#> #   place <list>, contributors <lgl>, is_quote_status <lgl>, retweet_count <int>, favorite_count <int>,
#> #   favorited <lgl>, retweeted <lgl>, lang <chr>, possibly_sensitive <lgl>, quoted_status_id <dbl>,
#> #   quoted_status_id_str <chr>, quoted_status_permalink <list>, quoted_status <list>, text <chr>, favorited_by <lgl>,
#> #   scopes <list>, display_text_width <lgl>, retweeted_status <lgl>, quote_count <lgl>, timestamp_ms <lgl>,
#> #   reply_count <lgl>, filter_level <lgl>, metadata <lgl>, query <lgl>, withheld_scope <lgl>, withheld_copyright <lgl>,
#> #   withheld_in_countries <lgl>, possibly_sensitive_appealable <lgl>, and abbreviated variable names ¹​full_text, …

Assuming I liked a tweet in the same year it was written (reasonable but not entirely accurate), plotting the source year of the tweet highlights just how much my Twitter usage picked up in 2018.

Code: Plot Total Likes
plot_liked_tweets <-
  likes_full |>
  count(year = year(created_at)) |>
  mutate(
    noun = map_chr(n, \(n) plu::ral("tweet", n = n)),
    tooltip = paste(format(n, big.mark = ","), "liked", noun, "in", year)
  ) |>
  ggplot() +
  aes(year, n, tooltip = tooltip, group = 1) +
  geom_line(color = "#595959", linewidth = 1.5) +
  ggiraph::geom_point_interactive(color = "#595959", size = 7) +
  scale_x_continuous(breaks = seq(2008, 2022, 2), expand = expansion(add = 0.25)) +
  labs(
    title = "Tweets I've Liked",
    x = "Year →",
    y = "Liked Tweets →"
  )

ggiraph::girafe(
  ggobj = plot_liked_tweets,
  width_svg = 12,
  height_svg = 4,
  options = list(ggiraph::opts_tooltip()),
  desc = knitr::opts_current$get("fig.alt")
)

Advertising info

The last thing I want to dive into is a part of the archive that includes information about Twitter’s perception of you. Or more importantly how they see you in terms of advertising.

Impressions and engagements

There are two key items in the archive: ad impressions and engagements. All ads on Twitter are actually tweets that are promoted into your view because an advertiser has paid for Twitter to show you a tweet you wouldn’t otherwise see.

An impression is a promoted tweet you see in your timeline or in tweet replies, but you don’t interact with the tweet. An engagement is a tweet that you click on or interact with in some way. The definitions (included in the details below) are hazy — I’m fairly certain from looking at my data that some tweets are “engaged with” simply by being visible on my screen for a longer period of time. (In other words, I’m certain I haven’t actively engaged with as many tweets as are highlighted below.)

The ads data items are imported separately and have a pretty wild nested structure. I used a lot of tidyr’s tidyr::unnest() and my newest favorite function, unnest_wider().

Code: ad_impressions

ad-impressions.js - ad: Promoted Tweets the account has viewed and any associated metadata. - deviceInfo: Information about the device where the impression was viewed such as its ID and operating system. - displayLocation: Location where the ad was viewed on Twitter. - promotedTweetInfo: Information about the associated tweet such as unique identifier, text, URLs and media when applicable. - advertiserInfo: Advertiser name and screen name. - matchedTargetingCriteria: Targeting criteria that were used to run the campaign. - impressionTime: Date and time when the ad was viewed.

ad_impressions <-
  read_twitter_data(manifest, "adImpressions") |>
  simplify_twitter_data() |>
  unnest(adsUserData) |>
  unnest(adsUserData) |>
  unnest_wider(adsUserData) |>
  unnest_wider(c(deviceInfo, promotedTweetInfo, advertiserInfo)) |>
  mutate(
    matchedTargetingCriteria = map(matchedTargetingCriteria, map_dfr, identity),
    across(impressionTime, ymd_hms)
  )
Code: ad_engagements

ad-engagements.js - ad: Promoted Tweets the account has engaged with and any associated metadata. - engagementAttributes: Type of engagement as well as date and time when it occurred.

ad_engagements <-
  read_twitter_data(manifest, "adEngagements") |>
  simplify_twitter_data() |>
  unnest(adsUserData) |>
  unnest(adsUserData) |>
  unnest_wider(adsUserData) |>
  mutate(across(engagementAttributes, map, map_dfr, identity)) |>
  unnest_wider(impressionAttributes) |>
  # now the same as the impressions
  unnest_wider(c(deviceInfo, promotedTweetInfo, advertiserInfo)) |>
  mutate(
    matchedTargetingCriteria = map(matchedTargetingCriteria, map_dfr, identity),
    across(impressionTime, ymd_hms)
  )

Once you have the impressions and engagements tables, you can combine them together with purrr::list_rbind.

ads <-
  list(
    impression = ad_impressions,
    engagement = ad_engagements
  ) |>
  list_rbind(names_to = "type")

ads
#> # A tibble: 8,599 × 16
#>    type       osType devic…¹ devic…² displ…³ tweetId tweet…⁴ urls   media…⁵ adver…⁶ scree…⁷ matche…⁸ impressionTime     
#>    <chr>      <chr>  <chr>   <chr>   <chr>   <chr>   <chr>   <list> <list>  <chr>   <chr>   <list>   <dttm>             
#>  1 impression Ios    2eVW2/… iPhone… Timeli… 154910… "When … <NULL> <NULL>  Chevro… @chevr… <tibble> 2022-08-08 14:03:09
#>  2 impression Ios    2eVW2/… iPhone… Timeli… 155557… "Tune … <chr>  <NULL>  Walmart @Walma… <tibble> 2022-08-08 14:08:07
#>  3 impression Ios    2eVW2/… iPhone… Timeli… 155465… "Meet … <NULL> <NULL>  Anker   @Anker… <tibble> 2022-08-08 14:04:17
#>  4 impression Ios    2eVW2/… iPhone… Timeli… 155598… "This … <NULL> <chr>   KESIMP… @KESIM… <tibble> 2022-08-08 10:54:28
#>  5 impression Ios    2eVW2/… iPhone… TweetC… 155507… "🎁🎁G… <NULL> <NULL>  Webull  @Webul… <tibble> 2022-08-08 10:57:56
#>  6 impression Ios    2eVW2/… iPhone… Timeli… 151472… "#1 is… <NULL> <NULL>  Financ… @finan… <tibble> 2022-08-08 10:56:19
#>  7 impression Ios    2eVW2/… iPhone… TweetC… 155507… "🎁🎁G… <NULL> <NULL>  Webull  @Webul… <tibble> 2022-08-08 10:56:57
#>  8 impression Ios    2eVW2/… iPhone… Timeli… 155481… "Mick.… <NULL> <NULL>  EPIX i… @EPIXHD <tibble> 2022-08-08 03:33:23
#>  9 impression Ios    2eVW2/… iPhone… Timeli… 155407… "Watch… <NULL> <NULL>  Paper … @HowLi… <tibble> 2022-08-08 03:34:51
#> 10 impression Ios    2eVW2/… iPhone… Timeli… 155512… "Wreck… <NULL> <NULL>  Mill G… @Mill_… <tibble> 2022-08-08 03:25:08
#> # … with 8,589 more rows, 3 more variables: publisherInfo <list>, promotedTrendInfo <list>,
#> #   engagementAttributes <list>, and abbreviated variable names ¹​deviceId, ²​deviceType, ³​displayLocation, ⁴​tweetText,
#> #   ⁵​mediaUrls, ⁶​advertiserName, ⁷​screenName, ⁸​matchedTargetingCriteria

The downside of the ads data is that it only includes the last three-ish months. Here are my impressions and engagements for August through early November of 2022.

Code: Plot Interactions by Month
plot_ads_interactions <-
  ads |>
  count(type, month = floor_date(impressionTime, "month")) |>
  mutate(
    n_str = format(n, big.mark = ","),
    tooltip = pmap_chr(
      list(type, n, n_str, month),
      \(type, n, n_str, month) {
        glue(
          "{n} {type} in {month}",
          type = plu::ral(type, n = n),
          month = month(month, label = TRUE, abbr = FALSE)
        )
      })
  ) |>
  ggplot() +
  aes(month, n, fill = type, tooltip = tooltip) +
  ggiraph::geom_col_interactive() +
  scale_fill_manual(
    values = c("#97c4ca", "#1c7d8b"),
    labels = c("Engagement", "Impression")
  ) +
  labs(
    title = "Ad Interactions by Month",
    x = NULL,
    y = "Promoted Tweets →",
    fill = NULL
  ) +
  theme(
    panel.grid.major.x = element_blank(),
    legend.direction = "vertical",
    legend.position = c(0.95, 0.9),
    legend.justification = c(1, 1)
  )

ggiraph::girafe(
  ggobj = plot_ads_interactions,
  width_svg = 12,
  height_svg = 6,
  options = list(ggiraph::opts_tooltip()),
  desc = knitr::opts_current$get("fig.alt")
)

Who advertized to me?

Finally, I wanted to know who was advertising to me and which tweets I was seeing. The advertising data includes demographics and keywords used by the advertisers to target you, and I recommend taking a look at that. But I’m running out of steam in this post, so let’s just take a look at the promoted content I saw on Twitter over the last few months.

Code: Plot Ad Interactions by Advertiser
ads_advertiser_counts <-
  ads |>
  count(advertiserName, type, sort = TRUE) |>
  pivot_wider(names_from = type, values_from = n) |>
  slice_max(n = 25, engagement + impression) |>
  pivot_longer(-1, names_to = "type")

ads_tweet_examples <-
  ads |>
  filter(!is.na(tweetText)) |>
  semi_join(ads_advertiser_counts) |>
  group_by(advertiserName, type) |>
  mutate(tweetText = str_trunc(tweetText, width = 80)) |>
  summarize(
    n = n(),
    tweets = glue_collapse(glue(
      "<li>{sample(unique(tweetText), min(5, length(unique(tweetText))))}</li>"
    )),
    .groups = "drop"
  ) |>
  mutate(
    tweets = glue('<ul>{tweets}</ul>'),
    tweets = glue(
      '<p><strong>{n}</strong> promoted tweet ',
      '<strong>{type}s</strong> ',
      'by <strong>{advertiserName}</strong></p>',
      '{tweets}'
    )
  )

plot_advertisers <-
  ads_advertiser_counts |>
  left_join(ads_tweet_examples) |>
  mutate(advertiserName = fct_reorder(advertiserName, value, sum)) |>
  ggplot() +
  aes(value, advertiserName, fill = type, tooltip = tweets) +
  ggiraph::geom_col_interactive() +
  scale_x_continuous(expand = expansion()) +
  scale_fill_manual(
    values = c("#97c4ca", "#1c7d8b"),
    labels = c("Engagement", "Impression")
  ) +
  labs(
    title = "Ad Interactions by Advertiser",
    x = "Interactions with Promoted Tweets →",
    y = NULL,
    fill = NULL
  ) +
  theme(
    panel.grid.major.y = element_blank(),
    legend.direction = "vertical",
    legend.position = c(0.99, 0.1),
    legend.justification = c(1, 0)
  )

ggiraph::girafe(
  ggobj = plot_advertisers,
  width_svg = 12,
  height_svg = 10,
  options = list(ggiraph::opts_tooltip()),
  desc = knitr::opts_current$get("fig.alt")
)

Footnotes

  1. simplify_twitter_data() is an optional and separate function because it’s an 80/20 function: it’s 20% of the code that does the right thing 80% of the time.↩︎

  2. On mobile devices, tapping on a bar kind of works. But to change focus from one plot element to another, you might need to tap outside of the plot area before tapping on the new element. Sorry! The hover interactions work a whole lot better on desktop.↩︎