# Most Popular Wordle Openers

## Preface

Back to All Projects page

This is a Quarto version of my original jupyter notebook. I plan to update this version regularly. See the quarto .qmd source on Github here.

On September 1, 2022, The New York Times published a piece on opener popularity, including publishing the top 5 most popular openers. My analysis matches their actual data for the top 3! Pretty good for looking at tweets!

## Load and prep the data

The function make_first_guess_list creates a useful dataframe for analysis. It, and most other helper utilities are in the first_word.py file.

I start with the very useful kaggle data set wordle-tweets and extract directly from the zipfile which I download with the kaggle api. The get_first_words function processes the dataframe:

1. It removes tweeted scores that contain an invalid score line for that answer (likely played on a cached version of the puzzle, not the live one on the NY Times site).
2. It removes any tweets that have more than 6 score patterns.
3. It extracts the first pattern from the tweeted scores.
4. It maps on the answer for a given wordle id.
5. It groups by the answer for the day, and creates a dataframe that has score, target, guess, and some data on how popular that score is.
6. Colored squares are mapped to numbers. So a score of 🟨🟩⬜⬜🟩 becomes 12002.
Code
from first_word import make_first_guest_list,format_df

import pandas as pd

import datetime
wordle_start = datetime.datetime(2021, 6, 19)
now = datetime.datetime.now()

mapping_date_dict = {wordle_id: wordle_start+datetime.timedelta(days=wordle_id) for wordle_id in range(210,500)}

df = make_first_guest_list()
format_df(df.sample(10))
Max wordle num 458
Filtered out 53709 of 983804 rows
score target guess score_frequency_rank score_count_fraction wordle_num guess_count commonality weighted_rank
00001 tiara upper 13.0 0.022766 342 387 38341510 169.00
01000 badge ceros 24.5 0.007752 321 666 0 600.25
00200 parer corny 18.5 0.015355 454 269 378659 342.25
00000 girth lemed 3.0 0.086652 355 3454 0 9.00
00000 alpha texes 3.0 0.124421 451 4114 96227 9.00
00112 story patsy 33.0 0.007643 317 32 1064602 1089.00
00000 doubt pangs 1.0 0.213710 453 3684 194132 1.00
01021 egret defer 52.5 0.002174 378 28 1207925 2756.25
00000 choke buppy 4.0 0.097980 254 2791 0 16.00
01000 retro bohos 10.0 0.032967 373 1066 0 100.00

## Simple analysis

The large dataframe has one row for every pattern / guess / wordle answer combination. A simple way to look at common starter words would be to group by the opener and look which openers consistently rank high across several days.

Code
df.groupby('guess')['score_frequency_rank'].mean().sort_values().head(10)
guess
craze    4.191176
crare    4.406780
stare    4.441176
braze    4.544118
crave    4.665966
crane    4.792017
audio    4.834034
brave    4.955882
quaff    5.211538
Name: score_frequency_rank, dtype: float64

This analysis leaves a lot to be desired. While other evidence indicates adieu is a popular opener, it does not seem like craze would be one. Plus, there are other words ending in -aze.

The 00000 all grey pattern is fairly common, so words with uncommon letters will show up a lot since 00000 is common because of the sheer volume of words that can create the pattern. So what happens if we filter out the null score?

Code
df.query("score != '00000'").groupby('guess')['score_frequency_rank'].mean().sort_values().head(10)
guess
craze    4.773632
stare    4.809302
crare    5.035176
audio    5.069507
crane    5.178241
crave    5.314356
braze    5.377604
brave    5.837629
brane    6.055024
Name: score_frequency_rank, dtype: float64

The braze craze continues. Are people guessing braze regularly or is something else going on? BRAZE’s score line is common but what other words could make the same pattern?

Code
format_df(df.query('score == "00101" and wordle_num == 230').sort_values('commonality',ascending=False).head(10))
score target guess score_frequency_rank score_count_fraction wordle_num guess_count commonality weighted_rank
00101 pleat image 9.0 0.029706 230 186 197874283 81.0
00101 pleat share 9.0 0.029706 230 186 119294241 81.0
00101 pleat until 9.0 0.029706 230 186 113090086 81.0
00101 pleat grade 9.0 0.029706 230 186 54275130 81.0
00101 pleat frame 9.0 0.029706 230 186 46079991 81.0
00101 pleat usage 9.0 0.029706 230 186 25440406 81.0
00101 pleat sharp 9.0 0.029706 230 186 24904199 81.0
00101 pleat grace 9.0 0.029706 230 186 17642126 81.0
00101 pleat villa 9.0 0.029706 230 186 17587586 81.0
00101 pleat solve 9.0 0.029706 230 186 13452150 81.0

There are many other words that make the same pattern. BRAZE could be popular or perhaps it is riding the coattails of some other _RA_E word?

## Linear Regression

A better approach is to control for the presence of other words. If BRAZE only does well when it’s paired with GRACE or GRADE or SHARE than a linear regression should isolate the guesses that actually are predictive of a popular score count line. Since I’m trying to account for colinearity, I will use a Ridge regression.

One difficulty came from getting the dataframe into the right format. I want one row per wordle number / score pattern, with each possible guess a column of 1 or 0. The “dependent” is the fraction of all tweeted opening score patterns that match that pattern. (e.g. for wordle 230 what fraction of all tweeted scores started 🟨🟩⬜⬜🟨). Another feature is the number of words that could produce that pattern.

I discovered pd.crosstab, and there are other methods that were all better than what I had been doing originally (an awful groupby loop than took over a minute.)

I normalize the data somewhat, and then fit a model across all the word features as well as guess count. The Ridge helps control the size of the coefficients, and is neccessary to handle the colinearity of the variables.

Code
df.rename(columns={'score': 'score_pattern'}, inplace=True)

one_hot_encoded_data = pd.crosstab(
[df['wordle_num'], df['score_pattern']], df['guess']).join(
df.groupby(['wordle_num',
'score_pattern'])[['score_count_fraction',
'guess_count']].first()).reset_index()

one_hot_encoded_data.dropna(subset=['score_count_fraction'],inplace=True) #don't need patterns no one actually guessed
# actually not sure if fillna 0 would be better?
std = one_hot_encoded_data['score_count_fraction'].std()

one_hot_encoded_data['guess_count_orig'] = one_hot_encoded_data['guess_count']
guess_count_std = one_hot_encoded_data[
'guess_count'].std()
guess_count_mean = one_hot_encoded_data['guess_count'].mean()

one_hot_encoded_data['guess_count'] = (one_hot_encoded_data[
'guess_count'] - guess_count_mean ) / guess_count_std

Fitting the Ridge model to the 90 most recent wordles.

Code
from sklearn import linear_model
from tweet_script import today_wordle_num
lookback_num = today_wordle_num() - 90
data = one_hot_encoded_data.query("wordle_num != 258 and wordle_num > @lookback_num ") #explained further down, this data point has particularly bad colinearity issues
end_date = mapping_date_dict[data['wordle_num'].max()].strftime("%B %-d, %Y")
begin_date = mapping_date_dict[data['wordle_num'].min()].strftime("%B %-d, %Y")

X= data.drop(
columns=[ 'score_count_fraction', 'wordle_num','score_pattern','guess_count_orig'],errors='ignore') # fit to the one hot encoded guesses and the total guess count
y=data['score_count_fraction'] # our dependent is the fraction of guesses that had the score pattern

r = linear_model.RidgeCV(alphas=[5,10,15])
r.fit(X,y)
r.alpha_
RidgeCV(alphas=[5, 10, 15])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
5

### Results!

Now we can look at the variable coefficients to see which words most strongly predict a popular pattern. The top few guesses contain some familiar choices.

Code
from IPython.display import display, HTML

top_openers = pd.DataFrame(list(zip(r.feature_names_in_,r.coef_)),
columns=['variable', 'coef']).sort_values('coef',ascending=False)

top_openers_list = top_openers.query('variable != "guess_count"')['variable'].head(15).tolist()
variable coef
audio 0.014438
stare 0.013428
irate 0.010029
raise 0.009542
great 0.007361
arise 0.006987
crate 0.005715
steam 0.005546
train 0.005466
crane 0.005010
arose 0.004998
amies 0.004935
ajies 0.004854
soare 0.004568

## CRANE

On February 6, 3Blue1Brown released a video positing that CRANE was the best opener. Though this was later recanted, it generated plenty of media coverage. CRANE was high on my list based on recent wordles, how does it look before the video?

Code
data_past = one_hot_encoded_data.query('wordle_num <= 232')
X= data_past.drop(
columns=[ 'score_count_fraction', 'wordle_num']) # fit to the one hot encouded guesses and the total guess count
y=data_past['score_count_fraction'] # our dependent is the fraction of guesses that had the score pattern

r = linear_model.Ridge(alpha=10)
r.fit(X,y)

out = pd.DataFrame(list(zip(r.coef_, r.feature_names_in_)),
columns=['coef', 'variable']).sort_values('coef',ascending=False)
out['guess_rank'] = out['coef'].rank(ascending=False)
format_df(out.query('variable == "crane"'))
Ridge(alpha=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
coef variable guess_rank
0.000294 crane 1022.0

Prior to Wordle 233, CRANE ranked 1018! Quite the turnaround. (Alpha values may not be optimal for these smaller sample sizes) You can see this even in the cruder analysis, where the ranks for CRANE were much lower in the past.

Code
guess = 'crane'
myplot = df.query(f'guess == @guess').sort_values(
'wordle_num').plot.scatter(
x='wordle_num',
y='score_frequency_rank',
color='guess_count',
title=f'{guess.upper()} popularity',backend='plotly',
color_continuous_scale='bluered',
)
myplot.update_yaxes(autorange="reversed")

Back to All Projects page