Twitter’s Feelings About Programming Languages

A deep dive into an informal, free-form survey about experiences with programming languages.
R
rtweet
Data Analysis
Visualization
Programming
Author

Garrick Aden-Buie

Published

October 8, 2019

Keywords

rstats, rtweet, Tweet analysis, Programming languages, R

An informal poll about experiences with programming languages has been making the rounds on Twitter this week. It all started with this tweet from @cotufa82:

The tweet caught on within a few days and there are now more than 16,840 replies and quote tweets from developers and programmers sharing their own experiences.

My interest in the poll was piqued by another tweet by @edsu sharing a Jupyter notebook analyzing the tweeted responses. I thought it would be interesting to do a similar analysis using R, initially thinking I could compare the R and Python versions.

What I should have done is to have used both R and Python (because they’re friends and language wars are silly), but instead I ended up going down the endless rabbit hole of regular expressions and free-form informal survey results.

Gather the Tweets

I gathered all tweets containing "first language", "most used", and "most loved" using the excellent rtweet package by Mike Kearney.

tweets <- rtweet::search_tweets(
  '"first language" AND "most used" AND "most loved"',
  n = 18000,
  include_rts = FALSE
)

You can download a CSV with the processed tweets. The .csv doesn’t include the full tweet data, but it does include status_id so that you can recover the tweet data with rtweet::lookup_statuses().

Our Experience with Programming Languages

Let’s dive into the results. If you’re interested in taking a peek behind the regular expressions curtain, I’ve included a code walkthrough below.

The original tweet asked for six categories: First language, Had difficulties, Most used, Totally hate, Most loved, For beginners. Replies to this tweet were… creative. The category names and formatting were hand-typed, so flexible and prone to spelling errors and permutations.

To get the broadest range of answers possible, I used flexible regular expressions to accept a variety of formatting choices, and I also widened the categories to encompass the same core themes. For example, first love, secret love, and mostly loved all were added to the Most loved category, which I called, simply, love.

I also captured multiple programming languages in each category (even the original tweet had multiple answers for first language (Basic/Java) and a few other categories).

Each of the following plots shows the top 20 responses in each category.

Love It or Hate It

Which programming languages are loved and which languages are not? The world seems to have a love/hate relationship with JavaScript, but Python is much more loved than hated. Likewise Swift, Ruby, and Go are significantly more positive than negative, C++ is also a bit love/hate, and PHP certainly isn’t feeling the love.

Most Used or Had Difficulties

Which languages are most used compared with those that have caused difficulties? JavaScript is eating the world, and plenty of people are using workhorse languages like Python, Java and C#/C++. (And a quite a few are using PHP presumably because they have to.) Still, JavaScript’s love/hate relationship continues as many people indicated that it caused them problems. I’m not surprised to see C++, C, and Java on the had difficulties list. Interestingly, Haskell shows up in the loved list but seems to also be tricky to learn.

Feelings about #rstats

How do developers feel about my favorite language? R isn’t a typical first language, but it is among the top 20 recommended to new programmers to learn first. It’s also the 12th most used language.

Category Rank Total
most used 12 1456
love 15 2067
had difficulties 19 2092
hate 16 2641
beginner 17 2296
first language 28 1508
curious 15 207
currently 2 63
next 3 50
honerable mention 8 98
chronology 25 29
also used, eager to learn, frenemy, never studied, on my list, to learn, totally meh, willing to learn

Code Walkthrough

At a high level, the process for cleaning and standardizing the tweet repsonses looks like this. I abstracted some of the larger steps in the pipeline into separate functions.

  1. Pre-clean the tweet text, including remove_unused_text()

  2. Separate tweets so that each line or item of the tweet is in its own row using tidyr::separate_rows()

    • Items are indicated by N., N), N:, or N-, or just appear on a new line without numbering.
  3. Remove whitespace and any numbering from each line

  4. Separate each line into a question category and answer pair by splitting on : using tidyr::separate()

  5. Filter out empty answers and convert everything to lower case

  6. Use a set of regular expressions to process_answer() into individual languages

  7. Use more regular expressions to recode_answer() and recode_category(), fixing spelling mistakes and combining overlapping groups

  8. Count the number of replies mentioning each programming language by category

The whole pipeline is summarized below, including the function to plot response counts by category.

Remove Unused Text

This little function removes usernames (@user), URLs, parenthetical comments, and turns #hashtag into hashtag because many people specified their choices using language hashtags, like #rstats instead of r.

remove_unused_text <- function(text) {
  text %>%
    # strip usernames
    str_remove_all("@\\w+\\s*") %>%
    # strip URLs
    str_remove_all("\\s*http[^ ]+\\s*") %>%
    # remove parentheticals
    str_remove_all("\\s*\\(.+?\\)( |\n|$)") %>%
    # replace "#hashtag" with "hashtag"
    str_replace_all("#(\\w)", "\\1")
}

Process Answer

The goal in processing the answers is to transform each answer to a single string of comma separated languages. In doing this, common variations of language lists should result in the same final answers. For example, Python and R, Python/R, and Python or R should all be handled similarly. To help with this process I created a list of common languages that frequently appear in the answers.

common_langs <- c(
  # c, c#, c++, and .net are manually included later
  "css", "html", "python", "javascript", "x86", "java", "ruby", "pascal", "php",
  "matlab", "perl", "fortran", "logo", "actionscript", "lua", "assembly",
  "delphi", "js", "scheme", "scratch", "go", "typescript", "clojure", "elixr",
  "kotlin", "ocaml", "rust", "mathematica", "matlab", "dart", "flutter", "groovy",
  "flash", "bash", "shell", "sql", "haskell", "lisp", "scala", "sas",
  "rstats", "golang"
)

Then, with a bit of regex kung fu, the responses are converted from Python and R to python,r.

process_answer <- function(answer, common_langs) {
  answer %>%
    # Aggresively remove unusual characters
    str_replace_all("[^\\w\\d#+., ]", " ") %>%
    # Remove leading character if it's a `,`
    str_replace_all("^,", " ") %>%
    # Remove `.` at end of string
    str_remove_all("[.]$") %>%
    # Replace and, or with space (prep for next step)
    str_replace_all("\\b(and|or|also|amp)\\b", " ") %>%
    # Remove qualifiers
    str_remove_all("\\b(maybe|now)\\b") %>%
    # Multiple languages may be listed separated by spaces, if so add comma
    str_replace_all(
      pattern = paste0("\\b(", paste(common_langs, collapse = "|"), ")\\b\\s*"),
      replacement = "\\1,"
    ) %>%
    gsub("c\\+\\+\\d+", "c++", .) %>%
    # Comma separate languages that are tough to regex
    gsub("c ", "c,", ., fixed = TRUE) %>%
    gsub(".net ", ".net,", ., fixed = TRUE) %>%
    gsub("c# ", "c#,", ., fixed = TRUE) %>%
    gsub("c++ ", "c++,", ., fixed = TRUE) %>%
    # No trailing punctuation
    str_remove("[.,!?/=<>;:]+$")
}

Recode Answer

There are a number of programming languages that have multiple variants or are commonly referred to by shorthand names — rstats for R or golang for go, for example. This function recodes the programming language answers that I noticed while working with the data (but it’s admitedly not complete).

recode_answer <- function(answer) {
  # Recode Basic Variants
  answer <- recode(answer, "vb" = "visual basic")
  answer <- if_else(str_detect(answer, "visual.*basic"), "visual basic", answer)
  answer <- if_else(str_detect(answer, "q.*basic"), "qbasic", answer)
  answer <- if_else(str_detect(answer, "gw.*basic"), "gw basic", answer)
  answer <- if_else(str_detect(answer, "(?<!(visual|q|gw)\\s?)basic"), "basic", answer)
  # Recode Pascal variants
  answer <- if_else(str_detect(answer, "pascal"), "pascal", answer)
  # Recode js vs Javascript
  answer <- recode(answer, "js" = "javascript")
  # Recode golang to go
  answer <- recode(answer, "golang" = "go")
  # Recode rstats as r
  recode(answer, "rstats" = "r")
}

Recode Category

As you might imagine with a free-form survey where users manually enter both the question and the answer, there is a large amount of variation in the spelling and categories used.

I broadly grouped many of the variations into common themes, primarily working to fit the original prompt. There are many, many interesting created categories, like best dead language, didn't spark joy, or latest crush. Here are two additional categories that I created, curious and interesting.

recode_category <- function(category) {
  case_when(
    str_detect(category, "first.+lang(uage)?|firstlanguage") ~ "first language",
    str_detect(category, "^first$") ~ "first language",
    str_detect(category, "b(e|i)ginn?e|new dev|newb|starter|noob|brginners|begginners|begginers") ~ "beginner",
    str_detect(category, "want|would|wish|wanna|curious|desire|(like.+learn)|curios|(like to try)") ~ "curious",
    str_detect(category, "m[ou]st?(ly)? ?used?") ~ "most used",
    str_detect(category, "diff?.+c.+lt|diificulties|difficulies|difficuties|difficulities") ~ "had difficulties",
    str_detect(category, "love") ~ "love",
    str_detect(category, "hate|dislike|avoid|(don.?t.+like)") ~ "hate",
    str_detect(category, "promis|interest|exotic|esoter|(most excited)|(weird)") ~ "interesting",
    str_detect(category, "honou?rable mention") ~ "honerable mention",
    str_detect(category, "next|need to learn") ~ "next",
    str_detect(category, "others used|other lang|dabbl") ~ "others used",
    str_detect(category, "current") ~ "currently",
    TRUE ~ category
  )
}

Poll Processing Pipeline

Finally, here is the full pipeline to go from raw tweets to poll results.

tweets_lang_poll <-
  tweets %>%
  select(status_id, created_at, user_id, screen_name, text) %>%
  # Remove tweets with "English" because that's probably a different thread
  filter(!str_detect(text, "[eE]nglish")) %>%
  mutate(
    # Backup original tweet text
    text_og = text,
    # Remove unused text from tweets
    text = remove_unused_text(text)
  ) %>%
  # Split text into question/answer pairs,
  # splitting on newline or one of: `N.`, `N)`, `N:`, or `N-`
  separate_rows(text, sep = "\n|\\d\\s*[.):-]") %>%
  # Remove whitespace and `N.` numbers from start of text
  mutate(text = str_remove_all(text, "^\\s*(\\d[.):-])?\\s*")) %>%
  # Seperate question/answer into category, answer columns, splitting on colon `:`
  separate(
    col = text,
    into = c("category", "answer"),
    sep = "\\s*:\\s*",
    remove = FALSE
  ) %>%
  # Remove nothing answers or answers without any letters
  filter(
    !is.na(answer),
    str_detect(answer, "[[:alnum:]]")
  ) %>%
  # Re-encode category, answer as UTF-8 (:shrug:) and lowercase
  mutate_at(vars(category, answer), stringi::stri_enc_toutf8) %>%
  mutate_at(vars(category, answer), tolower) %>%
  # Category: Remove leading non-alpha characters and squish whitespace
  mutate(
    category = str_remove(category, "^[^[:alpha:]]+"),
    category = str_squish(category)
  ) %>%
  # Process answer as well as we can programmatically
  mutate(answer = process_answer(answer, common_langs)) %>%
  # Separate into one language per row
  separate_rows(answer, sep = "\\s*[,/]\\s*") %>%
  # Squish the strings
  mutate_at(vars(answer), str_squish) %>%
  mutate(
    answer = recode_answer(answer),
    category2 = recode_category(category)
  ) %>%
  # Filter out empty category, answer fields
  filter(!str_detect(answer, "^\\s*$")) %>%
  filter(
    nchar(answer) > 0,
    nchar(category) > 4
  )

And then to aggregate and count programming language mentions per category.

tweets_lang_counted <-
  tweets_lang_poll %>%
  count(category2, answer, sort = TRUE)

Plot Language Counts by Category

Last, but not least, this function creates the plots for requested categories. One key detail is that bars are ordered within each facet using tidytext’s reorder_within() function. Check out Julia Silge’s excellent blog post on this function: Reordering and facetting for ggplot2.

While the bars are ordered in descending order, I wanted the bar fill color to be consistent across facets to facilitate comparison between the two categories. The color palette is ocean.deep from the pals package, which I found by looking through Emil Hvitfeldt’s Comprehensive list of color palettes in R.

plot_tweets_by_category <- function(
  tweets_lang_counted,
  categories,
  ncol = 2,
  min_count = 10
) {
  tweets_lang_counted %>%
    filter(category2 %in% categories) %>%
    mutate_at(vars(category2), factor, levels = categories) %>%
    group_by(category2) %>%
    arrange(desc(n)) %>%
    filter(n >= min_count) %>%
    top_n(20, n) %>%
    ungroup() %>%
    arrange(category2, answer, desc(n)) %>%
    mutate(
      answer_within = tidytext::reorder_within(answer, n, category2),
      answer = fct_reorder(answer, n, first)
    ) %>%
    ggplot() +
    aes(answer_within, n, fill = answer) +
    geom_col() +
    coord_flip() +
    tidytext::scale_x_reordered(expand = c(0, 0)) +
    discrete_scale("fill", "ocean", function(n) rev(pals::ocean.deep(n + 10))[6:(n+5)]) +
    guides(fill = FALSE) +
    labs(x = NULL, y = NULL) +
    facet_wrap(~ category2, scales = "free", ncol = ncol) +
    theme_minimal(base_family = "PT Sans", base_size = 18) +
    theme(
      plot.margin = margin(20, 20, 20, 20),
      panel.grid.major.y = element_blank(),
      panel.grid.minor.x = element_blank(),
      axis.ticks.y = element_blank(),
      axis.text.x = element_text(family = "PT Sans Narrow"),
      axis.text.y.left = element_text(margin = margin()),
      panel.spacing.x = unit(3, "line"),
      panel.spacing.y = unit(2, "line")
    )
}

What About You?

If you made it this far, share your programming experiences on Twitter!

Thanks for reading and feel free to share feedback, thoughts, or questions with me on Twitter at @grrrck.