English Letter & Word Frequency Analysis

_____

** Please Note - this is an unfinished placeholder page, I will be adding to it over time


In this analysis, we delve into the patterns and frequencies of letter and word usage across various English-language texts.

By examining millions of words across social media data as well as years of personal keylogged data, this study aims to uncover the underlying rends that characterize contemporary English communication.

Ultimately, this data should help us design better, more effective keyboards and keyboard layouts.

Data And Methods

These analyses were done using a combination of public data and my own personal data.

The public data is mostly Twitter and Reddit datasets available on Kaggle. These were then edited to remove automated bot comments, mod posts, spam tweets, etc.

The personal data was collected via a keylogger on my own computer over the course of about two years. It contains significant coding portions, including Python, C++, JavaScript and other web devepment languages, as well as regular English prose. I used Benign Keylogger from Ga68 on Github. I highly recommend it if you are looking to do your own analysis and need a simple, open-source keylogger.

Letter Frequencies

#CharacterFrequencyPercent
1 e 11044015 11.7%
2 t 9144868 9.7%
3 a 7551625 8.0%
4 o 7533545 8.0%
5 i 7077050 7.5%
6 n 6363567 6.7%
7 s 6166414 6.5%
8 r 5132215 5.4%
9 h 4627447 4.9%
10 l 4012021 4.2%
11 d 3371028 3.6%
12 u 3023660 3.2%
13 c 2613052 2.8%
14 m 2482433 2.6%
15 y 2290010 2.4%
16 g 2160828 2.3%
17 p 1999561 2.1%
18 w 1889726 2.0%
19 f 1809299 1.9%
20 b 1535919 1.6%
21 v 987223 1.0%
22 k 983624 1.0%
23 j 231844 0.2%
24 x 208332 0.2%
25 q 94637 0.1%
26 z 86526 0.1%

Character Frequencies

Word Frequencies

# Word Frequency Percent
1 the 817341 3.7%
2 to 577652 2.8%
3 a 533522 2.5%
4 and 450719 2.1%
5 i 406851 1.9%
6 of 379025 1.8%
7 is 324513 1.6%
8 you 311558 1.4%
9 that 310685 1.4%
10 in 292468 1.4%
11 it 285303 1.3%
12 for 207893 1.1%
13 this 168830 0.8%
14 but 154275 0.7%
15 not 147227 0.7%
16 be 145007 0.7%
17 on 143752 0.7%
18 are 142758 0.6%
19 with 141891 0.6%
20 with 137668 0.6%
21 was 124175 0.5%
22 they 122025 0.5%
23 its 117822 0.5%
24 if 115868 0.5%
25 as 110247 0.5%
26 just 103891 0.4%
27 my 103848 0.4%
28 like 102658 0.4%
29 or 94767 0.4%
30 so 94718 0.4%
31 your 91193 0.4%
32 what 87282 0.4%
33 at 86765 0.4%
34 do 81627 0.4%
35 can 81162 0.4%
36 people 76458 0.4%
37 about 75498 0.3%
38 dont 75494 0.3%
39 he 75303 0.3%
40 an 74111 0.3%
41 all 71631 0.3%
42 would 68937 0.3%
43 from 68652 0.3%
44 more 67329 0.3%
45 we 66051 0.3%
46 im 65939 0.3%
47 me 63655 0.3%
48 one 61472 0.3%
49 how 61293 0.3%
50 get 59114 0.3%

Lexical Bundles

Lexical Bundles refers to common sequences of words, like "that is" or "a lot of". While there are a good amount of research available on lexical bundles, almost all of it is specific to certain communities or styles of writing. For instance it is very common to find research detailing the frequency of lexical bundles in research papers for a specific discipline, like this one about Chemistry writing.

It is surprisingly difficult to find is lexical bundles for just English as a whole. So that is what I have done here. These are the top 50 two-word, three-word, and four-word lexical bundles is English.

Ngrams

Normalized Letter Counts

A helpful thing in making decisions for your keyboard layout is to know how much more/less common a specific letter is than some other letter. This is a link to a spreadsheet which shows all the letters normalized to every other letter. So for instance, you can see that the letter "T" is 1.21 times more common than the letter "A", or that "N" is 1.24 as common as "R", etc.


Click for full customizable spreadsheet


You can make a copy of this spreadsheet and update it with you own numbers, and the spreadsheet will update it's table to normalize them for you. Just be sure to update the letters in the first column and row as well as the numbers, as your list will probably be in a slightly different order.

Implications

My intention when beginning this research was to extend the concept of the "word keys" that I have on my fulcrum keyboard - keys that output entire words rather that individual letters. I have found having keys for both "the" and "and" has been very beneficial, and I wanted to see if there were any multi-word phrsases that were common enough to warrant getting there own key.

The only phrase long enough and common enough is "a lot of".

I have experimented with an "a lot of" key, and had pretty mixed results. When I first added the keys for "the" and "and", they were immediately easy to use. Totally intuitive. However it is much harder for your brain to think in multiple word terms while typing. Often what happens it that you get halfway through "lot" and think "damn, I should have used the "a lot of" key!

It is possible that this could be overcome with a little bit of practice. But ultimately I decided that there is other, better low-hanging fruit it terms of improving keyboard efficiency, so I've turned my attention to those. Maybe eventually I will revisit "a lot of",