Wordle Guess Helper

By Garrick Aden-Buie in Blog

February 21, 2022

Posted on:
February 21, 2022
Length:
32 minute read, 6804 words
Categories:
Blog
Tags:
R JavaScript dplyr stringr purrr Wordle tidyjs js4shiny
Source:
content/blog/2022/wordle-guessing/index.Rmd
See Also:
branchMover: A Shiny app for moving the default branch of your GitHub repos
Pull Request Flow with usethis
Signed and verified: signed git commits with Keybase and RStudio

Have you heard of Wordle? Who am I kidding, of course you’ve heard of Wordle! In fact, I’m pretty certain we’re way past peak Wordle at this point.

Here’s a Wordle helper that doesn’t completely take the fun out of the guessing, while also making sure you’ve got a good chance at winning every time.

Type your guesses below and then use the buttons below each letter to report Wordle’s response. Press return to start the next guess. Use delete to remove letters or words you’ve entered.

As soon as you add the results for a new word, the table of next guess candidates will update! Pick wisely.

Intro #

I started this post about a month ago, roughly at the same time that every other person with a blog about doing things with computers also decided to start writing a post about Wordle.

I’ve been tempted to just walk away from this post more than once. After all, since I’ve started writing this post Wordle has been solved more than once, Winston Chang rewrote Wordle in Shiny, roughly 70 other people wrote Wordle clones or helper apps and packages in R alone. Felienne Hermans wrote a Twitter bot to guess the word from shared game emojis. Someone else wrote a bot to intentionally ruin everyone’s day by spoling the answer to the next day’s Wordle. (Both bots were eventually suspended by Twitter.) Oh and Wordle was bought for big money by the New York Times who fumbled the handoff and lost more than a few player’s word streaks in the transfer.

I should admit up-front that I’ve never really played Wordle. It’s exactly the kind of task that immediately cries out to be automated: I’d apparently much rather spend a month’s worth of after-hours tinkering time to think through and codify a decent strategy than to just think up some words on my own.

And yet I love Wordle. I think it’s awesome. The rules are simple, but deceptively ambiguous. The game play is so concise it can fit in a tweet (even though that’s annoying for accessibility reasons). Still, the UI is simple, intuitive and fun without trying to hack your brain to be addictive. It’s a feel good game.

Another reason to love Wordle: there are so many great programming tasks around Wordle. It’s easy to describe the mechanics, to understand the game play, to look at the app and think: I can do that. Which is why, right now, programmers are hard at work tinkering over word lists or practicing web development in their favorite framework using Wordle.

As an educator, it means you can tailor a Wordle-based programming challenge to be as simple or complicated as needed. Once you start to break down the game, it’s more complicated than it appears at first glance, and there’s so much to choose from. State management, data structures, browser storage, game theory, CSS, user interface design, accessibility. You can go deep on any of these topics.

So if you have a Wordle idea you want to tinker with, I wholeheartedly encourage you to run with it. Let Wordle inspire you to practice using regular expressions with stringr, web scraping with httr, text processing with Python, working with Twitter data with rtweet, or making accessible plots with ggplot2.

What follows here is a bit of a journey. It is not the best strategy for Wordle or even the best way to play. But along the way we’ll learn a few text processing tricks, we’ll write a few functions, and we’ll learn how to move seamlessly from R to the browser in the same document or blog post. (The R code and data I write below create the word data used in the table and app above!)

Let’s look at some words #

Let’s dig in. To get started, I’m using a few of the usual suspects from the tidyverse package. Out of habit, I’ll load the ones I want specifically. (I think I also used tidyr somewhere in here, too.)

library(dplyr)
library(purrr)
library(stringr)

Now we’re ready to load our word list. At first I started with Scrabble’s word list, but it turns out that Wordle included the complete word list in its source code. (You could call it a hack but only in the state of Missouri.)

I used my elite hacker copy-and-paste skills to store Wordle’s word list as a JSON file (165K).

wordle_words <- jsonlite::fromJSON(
  "wordle.json",
  simplifyVector = TRUE
)

It turns out that Wordle maintains two separate lists. One list contains the 2,315 words used as solutions

sample(wordle_words$answers, 5)
## [1] "hippo" "hatch" "baggy" "tatty" "ratio"

and the other contains the 10,657 words that the game considers a valid guess.

sample(wordle_words$words, 5)
## [1] "pagod" "bandy" "teads" "cozed" "rifer"

Do the two word lists overlap?

wordle_words %>%
  reduce(intersect) %>%
  length()
## [1] 0

No, they do not (the intersection of the two word lists is empty). We could make things super easy for ourselves by only considering the words on the solution list, but that would really ruin the fun. So let’s combine the two lists.

words <- unlist(wordle_words)

Now lets turn those words into data we can work with.

A letter popularity contest #

Popularity by word #

My first thought (and I think it’s many people first thought) was to consider the probability that a letter appears in a word. In other words: does R appear in more words than F?

To answer this we can split each word into a vector of letters, take only the unique letters, and then count how many times each letter appears in a word.

Splitting the word into a vector of letters is something we’ll be doing a lot, and stringr::str_split() or strsplit() can help. The trick is to use an empty string as the split pattern to break apart each string character by character.

str_split(c("unhip", "jeans"), "")
## [[1]]
## [1] "u" "n" "h" "i" "p"
## 
## [[2]]
## [1] "j" "e" "a" "n" "s"

Note that this process takes our vector and gives us a list of vectors, which means we’ll be seeing a lot of purrr’s map() function in this post.

letter_freq <-
  words %>%
  # Split each word into a vector of letters
  str_split("") %>%
  # Keep one of each letter per word
  map(unique) %>%
  # Unlist into a big vector of letters
  unlist() %>%
  # Count the letters (each appearance in a word)
  table() %>%
  # Most popular letters first
  sort(decreasing = TRUE) %>%
  # Turn into frequency table
  `/`(length(words)) %>%
  # Remove attributes from table()
  c()

letter_freq
##           s           e           a           o           r           i 
## 0.457600987 0.439793401 0.410884983 0.301495529 0.301341351 0.276672834 
##           l           t           n           u           d           y 
## 0.240055504 0.233811286 0.214847364 0.187789084 0.177150786 0.156567993 
##           c           p           m           h           g           b 
## 0.148011101 0.145312982 0.144002467 0.131668208 0.118948504 0.117098366 
##           k           w           f           v           z           j 
## 0.111316682 0.079247610 0.076318224 0.051958064 0.030141844 0.022278754 
##           x           q 
## 0.022124576 0.008556892

Note that we only counted each letter once per word, so we now know that R appears in 30% of the words in the word list, while F appears in only 8%. A first guess that includes R would probably be better than one with an F.

Popularity by position #

Another way to look at letter frequency would be to consider the position of the letter in the word. What if we know that R and F are in the word: which is a more likely choice as the fourth letter?

To do this we…

  • First turn the word list into a tibble with one row per word.
  • Then, using tidyr::separate_rows(), we can add a new column with the letters in each word.
  • Grouping by word and adding a row_number() gives us the position of each letter in the word.
  • Then we can count the number of times each letter occurs in a given position with a new group_by() and summarize() (we could have used count() with another ungroup(), too).
  • Then, if we re-use our letter-word counts from the last step, we can count the number of words that have a the letter in question so that our frequency is effectively given the letter R, how often does it appear as the fourth letter?
  • Finally, tidyr::pivot_wider() moves the positions to the columns so the table is easier to read.
letter_freq_pos <-
  tibble(word = words) %>%
  select(word) %>%
  mutate(letter = word) %>%
  tidyr::separate_rows(letter, sep = "") %>%
  filter(letter != "") %>%
  group_by(word) %>%
  mutate(position = row_number()) %>%
  group_by(letter, position) %>%
  summarize(n = n(), .groups = "drop") %>%
  mutate(
    words = letter_freq[letter] * length(!!words),
    freq = n / words
  ) %>%
  select(-n, -words) %>%
  tidyr::pivot_wider(
    names_from = position,
    values_from = freq,
    values_fill = 0,
    names_prefix = "p"
  )

letter_freq_pos
## # A tibble: 26 × 6
##    letter     p1     p2     p3    p4     p5
##    <chr>   <dbl>  <dbl>  <dbl> <dbl>  <dbl>
##  1 a      0.138  0.425  0.232  0.202 0.128 
##  2 b      0.598  0.0533 0.221  0.160 0.0388
##  3 c      0.480  0.0917 0.204  0.214 0.0661
##  4 d      0.298  0.0366 0.170  0.205 0.358 
##  5 e      0.0531 0.285  0.155  0.408 0.267 
##  6 f      0.604  0.0242 0.180  0.235 0.0828
##  7 g      0.413  0.0493 0.236  0.274 0.0927
##  8 h      0.286  0.320  0.0703 0.138 0.217 
##  9 i      0.0460 0.385  0.293  0.245 0.0780
## 10 j      0.699  0.0381 0.159  0.100 0.0104
## # … with 16 more rows

Now we can answer our question about R and F in the fourth position.

letter_freq_pos %>%
  filter(letter %in% c("r", 'f')) %>%
  select(letter, p4)
## # A tibble: 2 × 2
##   letter    p4
##   <chr>  <dbl>
## 1 f      0.235
## 2 r      0.184

So R is the fourth letter in 18% of the words containing R — but F is the fourth letter in 24% of its words.

Ideally this information will help us filter guesses when we know that a set of letters are in the solution, but we don’t yet know where.

First Choice #

What word should we guess first? Ideally, we want a word whose answer gives us the most information. Intuitively, if we pick a word that has the most popular letters and each letter is different, we’ll be able to discard or include the most words when Wordle tells us which letters are in or out.

Formally, this calculation is called entropy. It measures how much information is contained in a particular instance of a random process. In this case, words with higher entropy give us more information because they encode more information.

This is all a little hand-wavy, so I’ll just duck the details and call this number a score. The higher the score, the better the word choice.

To calculate the entropy score, we take a word, split it into it’s letters, and then get the probability of each letter appearing in the word. Duplicated letters don’t tell us much, so we set second appearances of a letter close to zero. And then we calculate entropy

$$-\sum_{i=1}^{n} p_i \log_2 p_i$$

which in R code is

- sum(p * log(p, base = 2))

where p is a vector of probabilities for a given outome.

We can wrap all of that up into a function score_entropy():

score_entropy <- function(word) {
  chars <- str_split(word, "")[[1]]
  p <- letter_freq[chars]
  # we learn something but not much from duplicated letters
  p[duplicated(chars)] <- min(letter_freq)
  - sum(p * log(p, base = 2))
}

Notice that score_entropy() isn’t vectorized, so we’ll have to use a map() function to call it over a vector of words. We can be even more specific and use map_dbl() since we know that score_entropy() returns a number.

c("unhip", "jeans", "pools") %>%
  set_names() %>%
  map_dbl(score_entropy)
##    unhip    jeans    pools 
## 2.232150 2.163479 1.994939

This tells us, broadly, that unhip is a better choice than jeans and pools is worse than either. (Intuitively: you don’t learn much from the second O.)

Let’s use this to create a table of words and their associated entropy scores. Taking a peek at the highest scoring words tells us…

words_first_choice <-
  tibble(word = words) %>%
  mutate(score = map_dbl(word, score_entropy)) %>%
  arrange(desc(score))

words_first_choice
## # A tibble: 12,972 × 2
##    word  score
##    <chr> <dbl>
##  1 arose  2.61
##  2 aeros  2.61
##  3 soare  2.61
##  4 arise  2.60
##  5 raise  2.60
##  6 aesir  2.60
##  7 reais  2.60
##  8 serai  2.60
##  9 osier  2.59
## 10 realo  2.59
## # … with 12,962 more rows

… that according to this measure, the best first-choice words are arose, aeros, and soare. arose uses all five of the letters that most commonly appear in a word, and is also probably (okay, it is) on the answers list, so hello new first word choice!

Second Choice #

After your first choice, you know up to three pieces of additional information. Some of the letters in your guess

  1.  dark square aren’t in the solution
  2.  yellow square are in the solution but not where you guessed
  3.  green square are in the solution and are where you guessed

None of the letters are in the solution #

What if you guessed arose and got five gray boxes telling you that none of those letters appear in the solution?

a (absent) r (absent) o (absent) s (absent) e (absent)

We need to discard any words with a, r, o, s, or E in them. To do this, we’ll write a small function str_has_none_of() that takes a vector of words and a vector of letters, and checks if any of the letters are in each of the words. Technically, we use our same str_split() trick to split each word into a vector of letters and then check that the intersection of word letters and unwanted letters is empty.

str_has_none_of <- function(words, letters) {
  words <- str_split(words, "")
  map_lgl(words, ~ length(intersect(letters, .x)) == 0)
}

Using this function, we can quickly reduce our word list from 12,972 to 577 words.

words_first_choice %>%
  filter(str_has_none_of(word, c("a", "r", "o", "s", "e")))
## # A tibble: 577 × 2
##    word  score
##    <chr> <dbl>
##  1 unlit  2.43
##  2 until  2.43
##  3 linty  2.39
##  4 clint  2.38
##  5 unlid  2.38
##  6 culti  2.36
##  7 tulip  2.35
##  8 uplit  2.35
##  9 unity  2.35
## 10 lindy  2.34
## # … with 567 more rows

This new word list suggests that unlit or until would be a good next choice, so we’ll go with until. And if none of the letters in arose and until appear in the solution…

a (absent) r (absent) o (absent) s (absent) e (absent)
u (absent) n (absent) t (absent) i (absent) l (absent)
letters_guess <- str_split("arose until", "")[[1]]

words_first_choice %>%
  filter(str_has_none_of(word, letters_guess))
## # A tibble: 3 × 2
##   word  score
##   <chr> <dbl>
## 1 pygmy  1.65
## 2 hyphy  1.33
## 3 gyppy  1.31

then your answer is most definitely one of pygmy, hyphy, or gyppy.

Right letter, wrong place #

If you learn something from the guess, though, you can filter the word list based on the information you just learned.

Say we guess arose and wordle reveals that R and O appear in the solution.

a (absent) r (in solution, wrong position) o (in solution, wrong position) s (absent) e (absent)

We now know that the solution:

  1. Doesn’t have A, S or E
  2. Does contain R and O
  3. Doesn’t have R as the 2nd letter or O as the 3rd.

We’ve already implemented this the first step by discarding words with str_has_none_of(). We also need a similar version called str_has_all_of() to keep only words that have letters we know are in the solution.

str_has_all_of <- function(words, letters) {
  words <- str_split(words, "")
  map_lgl(words, ~ length(setdiff(letters, .x)) == 0)
}

str_has_all_of("rhino", c("r", "o"))
## [1] TRUE

And finally we can use regular expressions to keep track of the third piece of information:

.[^r][^o]..

A . means any letter at that spot in the word (other than the ones we’ve excluded). The [] indicate a set of options that could be present at a location in the string. The opening ^ negates the selection, so [^r] means a character that isn’t r.

words_first_choice %>%
  filter(
    str_has_none_of(word, c("a", "s", "e")),
    str_has_all_of(word, c("r", "o")),
    str_detect(word, ".[^r][^o]..")
  )
## # A tibble: 142 × 2
##    word  score
##    <chr> <dbl>
##  1 lirot  2.54
##  2 intro  2.52
##  3 nitro  2.52
##  4 nidor  2.47
##  5 roily  2.47
##  6 loric  2.46
##  7 toric  2.45
##  8 milor  2.45
##  9 corni  2.44
## 10 porin  2.44
## # … with 132 more rows

lirot is an unusual word, so let’s choose the next word on the list: intro.

a (absent) r (in solution, wrong position) o (in solution, wrong position) s (absent) e (absent)
i (absent) n (absent) t (correct) r (in solution, wrong position) o (in solution, wrong position)

Wordle thinks and tells us that we have T in the right spot! Also, we now know that I and N aren’t in the solution, and we still haven’t got R and O in the right place.

Right letter, right place #

We can repeat the step above, but using a new regular expression:

.[^r]t[^ro][^o]

Notice that we know a little more about where R and O can’t be, but importantly the t in the middle letter ensures we find words with T in the right place.

This leaves us with a few good choices:

words_first_choice %>%
  filter(
    str_has_none_of(word, c("a", "s", "e", "i", "n")),
    str_has_all_of(word, c("r", "o", "t")),
    str_detect(word, ".[^r]t[^r][^o]")
  )
## # A tibble: 4 × 2
##   word  score
##   <chr> <dbl>
## 1 rotch  2.33
## 2 tutor  2.05
## 3 motor  1.99
## 4 rotor  1.65

rotch seems very unlikely, so we can pick from tutor, motor and rotor. But notice that the these include a small set of the same letters. In a sense, we might ask ourselves a new question — which is the more likely starting combination: tu, mo or ro?

At this point, you could just guess. It is a game after all! But no, let’s power forward and add more complexity to this blog post.

What if we switched our scoring at this point and considered the position of the letters in the candidate words? Doing something medium-naive, let’s frame this as: what’s the probability of T in the first position and U in the second and so on…

score_by_position <- function(word) {
  chars <- str_split(word, "")[[1]]

  res <- c()
  for (i in seq_along(chars)) {
    pos_alpha <- which(letters == chars[i])
    p <- letter_freq_pos[[str_c("p", i)]][pos_alpha]
    res <- c(res, p)
  }

  prod(res)
}

words_score_pos <-
  tibble(word = words) %>%
  mutate(
    score_pos = map_dbl(word, score_by_position),
    score_pos = score_pos / diff(range(score_pos))
  ) %>%
  arrange(desc(score_pos))

words_score_pos
## # A tibble: 12,972 × 2
##    word  score_pos
##    <chr>     <dbl>
##  1 foxes     1.00 
##  2 boxes     0.991
##  3 jones     0.864
##  4 juves     0.808
##  5 coxes     0.795
##  6 faxes     0.792
##  7 poxes     0.754
##  8 fones     0.746
##  9 bones     0.739
## 10 fixes     0.719
## # … with 12,962 more rows

If we join this with our “new information” score, we now have to scores to choose from:

words_first_choice %>%
  filter(
    str_has_none_of(word, c("a", "s", "e", "i", "n")),
    str_has_all_of(word, c("r", "o", "t")),
    str_detect(word, ".[^r]t[^r][^o]")
  ) %>%
  left_join(words_score_pos) %>%
  arrange(desc(score_pos))
## # A tibble: 4 × 3
##   word  score score_pos
##   <chr> <dbl>     <dbl>
## 1 motor  1.99    0.0304
## 2 tutor  2.05    0.0200
## 3 rotch  2.33    0.0199
## 4 rotor  1.65    0.0132

Now we see that motor and tutor are the most likely words based on their position. We guess motor… and we’re right!

a (absent) r (in solution, wrong position) o (in solution, wrong position) s (absent) e (absent)
i (absent) n (absent) t (correct) r (in solution, wrong position) o (in solution, wrong position)
m (correct) o (correct) t (correct) o (correct) r (correct)

It only took three guesses! It’s almost like I planned this example to work out just like I wanted.

Generalizing #

Okay, let’s do this for any number of guesses. First, let’s join our scored words into a single data frame.

words_scored <-
  left_join(
    words_first_choice,
    words_score_pos,
    by = "word"
  )

Then, we need a function that takes our guesses and results and generalizes them into the pieces of information our guesses reveal about the solution. This function is going to take a vector of guesses and a vector of results. The guesses are just the words we guessed, but we’ll need to invent a syntax to concicesly report the results. Here’s the syntax I decided to use:

  • . means the letter is absent
  • - means the letter is present (wrong place)
  • + means the letter is correct (right place)

In broad strokes, the function will take each guess and use the result

  • Pull out the correct letters and their positions in exact so we can pick out words with letters in those spots.
  • Pull out present letters and their positions into exclude so we can compose the regular expression to filter out words that have these letters in those places.
  • Add the present by wrong place letters to bucket_keep, a bucket of letters that we know are in the solution.
  • And add any absent letters to bucket_dicard so we can filter out words that have any of these letters.
  • The last step is to compose the regular expression pattern from exact and exclude, and then return the regexp and the letters to keep and discard.
#' @param guesses A vector of words that you have guessed
#' @param result A vector of results for each guess using `.` for a miss, `-`
#'   for a letter in the solution that isn't in the right place and `+` for a
#'   letter that's in the right spot.
summarize_guesses <- function(guesses, results) {
  stopifnot(all(str_length(c(guesses, results)) == 5))

  guesses <- str_split(guesses, "")
  results <- str_split(results, "")

  exclude <- character(5)
  exact <- character(5)
  bucket_keep <- c()
  bucket_discard <- c()

  for (i in seq_along(guesses)) {
    g <- guesses[[i]]
    r <- results[[i]]

    if (any(r == "+")) {
      exact[r == "+"] <- g[r == "+"]
      bucket_keep <<- c(bucket_keep, g[r == "+"])
    }
    if (any(r == "-")) {
      bucket_keep <- c(bucket_keep, g[r == "-"])
      exclude[r == "-"] <- paste0(exclude[r == "-"], g[r == "-"])
    }
    if (any(r == ".")) {
      bucket_discard <- c(bucket_discard, g[r == "."])
    }
  }

  exclude[exclude != ""] <- paste0("[^", exclude[exclude != ""], "]")
  exact[exact == ""] <- NA_character_
  exclude[exclude == ""] <- NA_character_

  pattern <- coalesce(coalesce(exact, exclude), ".")

  # Say you guess a word with two Ts,
  # but there's only one T in the solution.
  # T will appear on keep and discard bucket,
  # so we need to explicitly keep it.
  # (we could use that info, though, e.g. at most 1 T)
  bucket_discard <- setdiff(bucket_discard, bucket_keep)

  list(
    discard = unique(bucket_discard),
    keep = unique(bucket_keep),
    pattern = str_c(pattern, collapse = "")
  )
}

Remember when we guessed arose and got this result?

a (absent) r (in solution, wrong position) o (in solution, wrong position) s (absent) e (absent)

Our new function summarizes the information we’ve learned from this guess.

summarize_guesses(
  guesses = "arose",
  results = ".--.."
)
## $discard
## [1] "a" "s" "e"
## 
## $keep
## [1] "r" "o"
## 
## $pattern
## [1] ".[^r][^o].."

Then we guessed intro and got this result.

a (absent) r (in solution, wrong position) o (in solution, wrong position) s (absent) e (absent)
i (absent) n (absent) t (correct) r (in solution, wrong position) o (in solution, wrong position)

And again we have this summary.

guess_results <-
  summarize_guesses(
    guesses = c("arose", "intro"),
    results = c(".--..", "..+--")
  )

guess_results
## $discard
## [1] "a" "s" "e" "i" "n"
## 
## $keep
## [1] "r" "o"
## 
## $pattern
## [1] ".[^r]t[^r][^o]"

To get the remaining possible words, we can use this information to filter down to the words that

  1. have none of the $discard letters
  2. have all of the $keep letters
  3. match the regular expression $pattern.
words_scored %>%
  filter(
    str_has_none_of(word, guess_results$discard),
    str_has_all_of(word, guess_results$keep),
    str_detect(word, guess_results$pattern)
  )
## # A tibble: 4 × 3
##   word  score score_pos
##   <chr> <dbl>     <dbl>
## 1 rotch  2.33    0.0199
## 2 tutor  2.05    0.0200
## 3 motor  1.99    0.0304
## 4 rotor  1.65    0.0132

All together now #

Now that we know how to summarize and use the guess results to filter our next word choices, we can do this in one step with another small function, score_next_guess().

score_next_guess <- function(guesses, results) {
  guess_results <- summarize_guesses(guesses, results)

  words_scored %>%
    filter(
      str_has_none_of(word, guess_results$discard),
      str_has_all_of(word, guess_results$keep),
      str_detect(word, guess_results$pattern)
    )
}

Having guessed arose and intro, what would happen if we guessed rotch1 next?

a (absent) r (in solution, wrong position) o (in solution, wrong position) s (absent) e (absent)
i (absent) n (absent) t (correct) r (in solution, wrong position) o (in solution, wrong position)
r (in solution, wrong position) o (correct) t (correct) c (absent) h (absent)
score_next_guess(
  guesses = c("arose", "intro", "rotch"),
  results = c(".--..", "..+--", "-++..")
)
## # A tibble: 1 × 3
##   word  score score_pos
##   <chr> <dbl>     <dbl>
## 1 motor  1.99    0.0304

From rotch we learn that the first letter isn’t R, but the second letter is o, which leaves us just one choice: motor.

m (correct) o (correct) t (correct) o (correct) r (correct)

Guessing Wordle words in real life #

Beginner’s Luck #

I wrapped up the score_next_guess() function on January 16th, 2022, which happened to be the easiest Wordle day of any day I’ve “played”. But it was a nice motivator to feel like I had spent my Sunday tinkering time well.

Opening with arose lead to a pleasant surprise.

a (in solution, wrong position) r (in solution, wrong position) o (in solution, wrong position) s (in solution, wrong position) e (absent)

From 12,972 words down to 37 words with our first guess. Nice!

# 2022-01-16
score_next_guess(
  guesses = c("arose"),
  results = c("----.")
)
## # A tibble: 37 × 3
##    word  score score_pos
##    <chr> <dbl>     <dbl>
##  1 solar  2.58    0.0327
##  2 soral  2.58    0.0327
##  3 ratos  2.58    0.0404
##  4 rotas  2.58    0.0576
##  5 sorta  2.58    0.0401
##  6 taros  2.58    0.102 
##  7 toras  2.58    0.145 
##  8 sonar  2.56    0.0416
##  9 roans  2.56    0.0923
## 10 roads  2.53    0.0669
## # … with 27 more rows

Let’s just pick the first word on the list: solar.

a (in solution, wrong position) r (in solution, wrong position) o (in solution, wrong position) s (in solution, wrong position) e (absent)
s (correct) o (correct) l (correct) a (correct) r (correct)

Very nice!

Problematic words #

In working on this, I ran into more than a few posts that had trouble with a few more obscure words, like igloo and ferry.

igloo

How many guesses would it take for us to get to igloo?

Round 1

a (absent) r (absent) o (in solution, wrong position) s (absent) e (absent)

Opening with arose is helpfulish.

score_next_guess(
  guesses = c("arose"),
  results = c("..-..")
)
## # A tibble: 463 × 3
##    word  score score_pos
##    <chr> <dbl>     <dbl>
##  1 doilt  2.46  0.0680  
##  2 indol  2.45  0.000646
##  3 tondi  2.44  0.0195  
##  4 lotic  2.43  0.00802 
##  5 noily  2.42  0.0711  
##  6 pilot  2.42  0.0501  
##  7 colin  2.41  0.0801  
##  8 nicol  2.41  0.00613 
##  9 tonic  2.41  0.0198  
## 10 ontic  2.41  0.000670
## # … with 453 more rows

Many of the words are obviously not the answer. Pilot is the first reasonable word on the list, and its score is relatively similar to the other top word choices, so I’d go with pilot.

Round 2

a (absent) r (absent) o (in solution, wrong position) s (absent) e (absent)
p (absent) i (in solution, wrong position) l (correct) o (correct) t (absent)

Picking pilot is a good choice!

score_next_guess(
  guesses = c("arose", "pilot"),
  results = c("..-..", ".-++.")
)
## # A tibble: 1 × 3
##   word  score score_pos
##   <chr> <dbl>     <dbl>
## 1 igloo  1.95  0.000268

Round 3

a (absent) r (absent) o (in solution, wrong position) s (absent) e (absent)
p (absent) i (in solution, wrong position) l (correct) o (correct) t (absent)
i (correct) g (correct) l (correct) o (correct) o (correct)

🎉 Great work!

ferry

Apparently there was a general furor about ferry when it was the Wordle solution of the day. Let’s see how long it takes us to get to that word.

Round 1

a (absent) r (in solution, wrong position) o (absent) s (absent) e (in solution, wrong position)

Opening with arose narrows down our word choices to 357 words.

score_next_guess(
  guesses = c("arose"),
  results = c(".-..-")
)
## # A tibble: 357 × 3
##    word  score score_pos
##    <chr> <dbl>     <dbl>
##  1 liter  2.54  0.0250  
##  2 relit  2.54  0.0180  
##  3 tiler  2.54  0.0485  
##  4 liner  2.53  0.0425  
##  5 inert  2.52  0.000951
##  6 inter  2.52  0.00199 
##  7 niter  2.52  0.0157  
##  8 uteri  2.50  0.000332
##  9 idler  2.49  0.000788
## 10 riled  2.49  0.0604  
## # … with 347 more rows

liter is both a word and at the top of our list, so it’s an easy next choice.

Round 2

a (absent) r (in solution, wrong position) o (absent) s (absent) e (in solution, wrong position)
l (absent) i (absent) t (absent) e (in solution, wrong position) r (in solution, wrong position)

The word list is now full of words with similar patterns, so let’s sort by position score to help us choose.

score_next_guess(
  guesses = c("arose", "liter"),
  results = c(".-..-", "...--")
) %>%
  arrange(desc(score_pos))
## # A tibble: 50 × 3
##    word  score score_pos
##    <chr> <dbl>     <dbl>
##  1 jerky  1.94     0.334
##  2 ferny  2.22     0.235
##  3 perky  2.22     0.218
##  4 jerry  1.64     0.177
##  5 query  1.97     0.153
##  6 ferry  1.80     0.153
##  7 berry  1.88     0.151
##  8 pervy  2.09     0.145
##  9 perdy  2.31     0.128
## 10 kerky  1.87     0.125
## # … with 40 more rows

Round 3

a (absent) r (in solution, wrong position) o (absent) s (absent) e (in solution, wrong position)
l (absent) i (absent) t (absent) e (in solution, wrong position) r (in solution, wrong position)
j (absent) e (correct) r (correct) k (absent) y (correct)

Now we’re down to 17 words to choose from. Still complicated. But if we arrange by position score, our top two choices are ferny and ferry.

You can see where this is headed, but let’s pretend we had no idea. Which would you pick?

score_next_guess(
  guesses = c("arose", "liter", "jerky"),
  results = c(".-..-", "...--", ".++.+")
) %>%
  arrange(desc(score_pos))
## # A tibble: 17 × 3
##    word  score score_pos
##    <chr> <dbl>     <dbl>
##  1 ferny  2.22    0.235 
##  2 ferry  1.80    0.153 
##  3 berry  1.88    0.151 
##  4 pervy  2.09    0.145 
##  5 perdy  2.31    0.128 
##  6 germy  2.23    0.122 
##  7 derny  2.38    0.116 
##  8 perry  1.92    0.115 
##  9 mercy  2.27    0.109 
## 10 merry  1.92    0.0937
## 11 verry  1.74    0.0907
## 12 derry  1.96    0.0753
## 13 herry  1.91    0.0723
## 14 derby  2.27    0.0655
## 15 herby  2.21    0.0629
## 16 nervy  2.16    0.0371
## 17 nerdy  2.38    0.0328

Round 4

a (absent) r (in solution, wrong position) o (absent) s (absent) e (in solution, wrong position)
l (absent) i (absent) t (absent) e (in solution, wrong position) r (in solution, wrong position)
j (absent) e (correct) r (correct) k (absent) y (correct)
f (correct) e (correct) r (correct) n (absent) y (correct)

Now we’re down to 1 words to choose from.

score_next_guess(
  guesses = c("arose", "liter", "jerky", "ferny"),
  results = c(".-..-", "...--", ".++.+", "+++.+")
) %>%
  arrange(desc(score_pos))
## # A tibble: 1 × 3
##   word  score score_pos
##   <chr> <dbl>     <dbl>
## 1 ferry  1.80     0.153

Round 5

a (absent) r (in solution, wrong position) o (absent) s (absent) e (in solution, wrong position)
l (absent) i (absent) t (absent) e (in solution, wrong position) r (in solution, wrong position)
j (absent) e (correct) r (correct) k (absent) y (correct)
f (correct) e (correct) r (correct) n (absent) y (correct)
f (correct) e (correct) r (correct) r (correct) y (correct)

🎉 We did it! 5 isn’t bad, especially considering the terrible choices we had in round 3.

Make it an app #

It’s awesome being able to run R code to test things out, but it’s also a little tedious. Since we’ve done the heavy lifting of prepping and scoring words, it’d be great if we could have a little web app that would help us

  • Input our guesses and results
  • Show us possible words after each round

And since I’m writing this blog post in R Markdown via blogdown, I can do it all right here!

Move the data from R to the web #

The first thing we need to do is save our data in a way that it can be accessed by JavaScript in the browser. To do this, we’ll take our words_scored table and use jsonlite::write_json() to save the data frame as JSON.

words_scored %>%
  mutate(across(starts_with("score"), round, digits = 2)) %>%
  jsonlite::write_json("wordle-scored.json")

Now we have the data in a JSON file (that you can download if you want).

But to make life even easier, I’m going to use a trick I learned from htmlwidgets. What we can do is embed in the JSON file, which is only 589K, in a <script type="application/json"> tag with a specific id that makes it easy to find later on.

htmltools::tags$script(
  id = "words-scored",
  type = "application/json",
  readLines("wordle-scored.json")
)

Now that we have the data in a place where we can get it, let’s switch gears and write some JavaScript!

Start working in JavaScript #

Here’s the cool thing: from here on out, the actual computation of the rest of the blog post is done in your browser. To facilitate this, I’ll use an extension I built for knitr for literate JavaScript programming with the js4shiny package.

js4shiny

Setting up literate JavaScript in blogdown is pretty straight-forward thanks to a little helper function from js4shiny.

js4shiny::html_setup_blogdown(stylize = "none")

tidyjs

The other cool thing we’ll use is tidyjs. It’s a really neat JavaScript library that makes it easy to work with data frames in the browser. If you squint really hard, it’s remarkably similar to the tidyverse, just with a JavaScript spin.

I wrapped tidyjs in an R package that automatically stays up to date with the latest version of tidyjs. To use tidyjs, we just need to call use_tidyjs().

tidyjs::use_tidyjs()

Now that we’ve included tidyjs in this page, we can finally switch to writing JavaScript instead of R.

First, we need to import a couple of functions from tidyjs that we’re going to want to use. With tidyjs, all transformations are wrapped in a call tidy(), so we have to import tidy. We also need filter() and sliceMax() for easy filtering.

const { tidy, filter, sliceMax } = Tidy

Load our data

The next step is to find the JSON data that we just serialized and stashed in our page. We can use document.getElementById() to find the element with the id 'words-scored', and then grab the JSON text itself from the .innerText property of that object. Finally, we call JSON.parse() on the json text to parse it into a JavaScript object.

wordsScored = JSON.parse(
  document.getElementById('words-scored').innerText
)

Preview the data

Here’s a quick preview of the data. In tidyjs you wrap a pipeline in tidy() and then each additional argument to tidy() is the next step in the pipe. To make it look a little more familiar to R users, I’ve added the %>% in the comments.

tidy(
  wordsScored, // %>%
  sliceMax(5, 'score')
)

Same song, different dance #

Summarizing guesses

Next, we translate summarize_guesses() from R to summarizeGuesses() in JavaScript.

function summarizeGuesses ({ guesses, results }) {
  // Check that all guesses and results have 5 characters
  const allComplete = [...guesses, ...results].every(s => s.length == 5)
  if (!allComplete) {
    console.error('All guesses and results must have 5 characters.')
    return
  }

  // R: str_split(x, '')
  guesses = guesses.map(s => s.split(''))
  results = results.map(s => s.split(''))

  let exclude = Array(5).fill('')
  let exact = Array(5).fill('')
  let keep = []
  let discard = []

  for (i = 0; i < guesses.length; i++) {
    let g = guesses[i] // g: an array of 5 letters of a guess
    let r = results[i] // r: an array of 5 letters of the result

    for (j = 0; j < r.length; j++) {
      if (r[j] == '+') {
        // this letter is exactly right
        exact[j] = g[j]
        keep.push(g[j])
      } else if (r[j] == '-') {
        // this letter is included, wrong place
        keep.push(g[j])
        // so exclude it from this position
        exclude[j] += g[j]
      } else {
        // this letter isn't in the solution
        discard.push(g[j])
      }
    }
  }

  // build up the regex pattern blending `exact` and `exclude`
  const pattern = Array(5).fill('.')
  for (i = 0; i < 5; i++) {
    if (exact[i] != '') {
      pattern[i] = exact[i]
    } else if (exclude[i] != '') {
      pattern[i] = `[^${exclude[i]}]`
    }
  }

  discard = discard.filter(x => !keep.includes(x))
  return {discard, keep, pattern: pattern.join('')}
}

Here’s a quick preview of summarizeGuesses().

let summary = summarizeGuesses({
  guesses: ["arose", "indol"],
  results: ["..-..", "+..+-"]
})
console.log(summary)

Searching for the next word

And then we need to do the same for score_next_guess(). Of course, at this point I’m older and wiser and choose a better name: searchNextGuess().

function searchNextGuess ({ guesses, results }) {
  const guessResult = summarizeGuesses({guesses, results})

  return tidy(
    wordsScored,
    // discard words that contain a letter in the discard pile
    filter(d => !guessResult.discard.some(l => d.word.includes(l))),
    // keep only words that have all letters in the keep pile
    filter(d => guessResult.keep.every(l => d.word.includes(l))),
    // keep words that are consistent with results to date
    filter(d => RegExp(guessResult.pattern).test(d.word))
  )
}

Let’s prove to ourselves that these functions work.

let next = searchNextGuess({
  guesses: ["arose", "indol"],
  results: ["..-..", "+..+-"]
})

console.log(`There is ${next.length} word available for our next guess:`)
console.table(next[0])

Let’s try again. What if we chose a different second guess?

let rounds = {
  guesses: ["arose", "intro"],
  results: [".--..", "..+--"]
}
let next = searchNextGuess(rounds)

console.log('Guess summary ----')
console.log(summarizeGuesses(rounds))

console.log('Next word choices ----')
next.forEach(ws => console.log(`${ws.word} (${ws.score})`))

Now build the rest of the owl #

Okay, this is the point where I confess that I went way off-track in building the little app at the top of this post. I fully intended to write about that part too, but honestly I’ve done a good job curing myself of the Wordle bug with this post.

For the curious, all the JavaScript code for the guess helper lives in wordle-component.js. Or, right click on this page and pick Inspect Element and find your way to the Sources or Debugger tab for a better look. It’s all vanilla JavaScript.

Also a quick shout-out to gridjs, which turned out to be a very easy way to create the table of sorted words.

<script src="https://unpkg.com/gridjs/dist/gridjs.umd.js"></script>
<link href="https://unpkg.com/gridjs/dist/theme/mermaid.min.css" rel="stylesheet" />

  1. What does rotch mean? Is it even a word? No, it is not. It’s a surname for a few Americans: a meteorologist, an architect, a tennis player, two politicians and a pediatrician. ↩︎