Frequency analysis
Also known as : Letter counting
Frequency analysis is the first great weapon of cryptanalysis. It rests on a simple observation: in any natural language, letters do not appear with equal probability. That seemingly trivial inequality is what lets you break centuries of artisanal ciphers in a few minutes.
A language’s statistical fingerprint
In English, E accounts for about 12.7 % of occurrences, followed by T (9.1 %), A (8.2 %), O (7.5 %), I (7 %), N (6.7 %), S (6.3 %), H (6.1 %), R (6 %)… while Z or Q sit below 0.1 %. The mnemonic “ETAOIN SHRDLU” sums up the twelve most frequent letters in decreasing order. It comes from typesetters: it’s the order of keys on a Linotype machine.
In French, the order shifts: E (~17 %), then A (8 %), S (8 %), I (7 %), T (7 %), N (7 %), R (6.5 %)… The mnemonic phrase “ESARTILUNOC” recaps the eleven most common letters in French.
Other languages have their own fingerprints: in German, E stays first but N climbs to second; in Italian, the vowels A, E, I take the top three slots; in Russian, you work on the Cyrillic alphabet with a different profile altogether.
How it works
In a monoalphabetic cipher (simple substitution, Caesar, Atbash), each plaintext letter is systematically replaced by the same other letter in the ciphertext. Consequence: the frequency just shifts — it doesn’t disappear.
Let’s run an example. Take the ciphertext (English):
WKH UHSRUW ZLOO DUULYH WRPRUURZ DW WKH ODWHVW. Steps:
- Count: list each ciphertext letter and its occurrences. Here
W6 times,H5 times,R4 times,U4 times… - Compare:
WandHlead. Hypothesis:H = E. (Wis unusually frequent — likely T.) - Verify: if
H = E, then the Caesar shift isH − E = 3. Decrypt:THE REPORT WILL ARRIVE TOMORROW AT THE LATEST. Confirmed instantly.
For a general substitution cipher (each plaintext letter replaced by any of the other 25, no shift constraint), you proceed table by table: spot the E, then look for digrams TH, HE, IN, ER that almost always cluster around it; deduce neighbors, and the chain closes quickly.
Digrams and trigrams: the next level
Single-letter frequencies aren’t always enough, especially against well-built ciphers. Then you look at digrams (pairs) and trigrams (triples):
- Most frequent English digrams:
TH,HE,IN,ER,AN,RE,ON,AT,EN,ND. - French digrams:
ES,EN,ON,ER,LE,DE,NT,RE,TE,AN. - English trigrams:
THE,AND,ING,ION,ENT,FOR. - French trigrams:
ENT,LES,EDE,DES,QUE,AIT,LLE.
Spotting a common digram or trigram in a ciphertext gives you several letters at once and dramatically accelerates the crack.
Historical role: Al-Kindi, 9th century
Frequency analysis was formalized by the Arab polymath Al-Kindi in the 9th century, in his Manuscript on Deciphering Cryptographic Messages, rediscovered in the 20th century in Ottoman archives. It is the first truly documented cryptanalysis method. It killed off in one stroke most of the ciphers in use at the time, and rendered Western cryptography obsolete for 700 years against Arab cryptanalysts — until the polyalphabetic cipher (Alberti, 1467) briefly restored some fog.
Beyond Al-Kindi, frequency analysis remains the first-aid tool of the cryptanalyst of every century: Babbage and Kasiski use it (on Vigenère sub-messages once the key length is known), Bletchley Park combined it with cribs (probable words like “WETTER” in a German weather bulletin), and any beginner in a CTF reaches for it the moment they see text that smells like substitution.
Limits of frequency analysis
- Useless against polyalphabetic ciphers (Vigenère, Beaufort) which flatten the distribution — you must first determine the key length (Kasiski test, Friedman index) then apply frequency analysis to each sub-message.
- Sensitive to length: on text shorter than 100 characters, statistical noise drowns the peaks. The longer the message, the sharper the frequency signature. Minimalist apocalypses (very short ciphertexts like a single word) often resist.
- Sensitive to language: applying an English profile to a French text gives skewed results (E stays frequent, but the digram and trigram structure differs).
- Useless against modern ciphers (AES, ChaCha20): output is statistically uniform; E is no longer distinguishable from Z.
Key takeaways:
- Any cipher that preserves letter frequency (pure monoalphabetic) falls in minutes to frequency analysis. No exceptions.
- Useful mnemonics: ETAOIN SHRDLU (EN), ESARTILUNOC (FR).
- Next level: digrams (
TH,ES) and trigrams (THE,ENT) to confirm hypotheses. - Frequency analysis is step one of classical cryptanalysis. Faced with an unknown cipher, you always run it first.