Index of coincidence
Also known as : IC · IoC · Indice de coïncidence
The index of coincidence (often abbreviated IC or IoC) is a statistic that characterizes the “concentration” of the letter distribution in a text. Concretely, you compute the probability that two letters drawn at random from the text are identical. It’s a numerical fingerprint of the language (or absence of language) — and one of the most effective tools of classical cryptanalysis.
The formula
For a text of N characters where each letter i appears n_i times (over the 26-letter alphabet):
IC = Σ n_i × (n_i − 1) / (N × (N − 1)) The idea: n_i × (n_i − 1) counts the ordered pairs (without repetition) you can form from letter i. Summing over all letters gives the total number of identical pairs. Dividing by N × (N − 1) (the total number of possible pairs) gives the probability that a pair drawn at random contains two identical letters.
Three reference values
| Type of text | Typical IC |
|---|---|
| Uniformly random text (every letter equally likely) | ≈ 0.038 (= 1/26) |
| Plaintext English | ≈ 0.067 |
| Plaintext French | ≈ 0.074 |
| Plaintext German | ≈ 0.072 |
| Plaintext Italian | ≈ 0.074 |
| Vigenère with a long key | ≈ 0.038 to 0.045 |
| Caesar / Atbash / monoalphabetic substitution | same as plaintext (≈ 0.067 EN, 0.074 FR) |
These values aren’t magic: they follow directly from each language’s letter frequencies. A language where one letter strongly dominates (E in French) concentrates the identical pairs and raises the IC. A flat language (or a polyalphabetic cipher that flattens frequency) drives the IC back down toward the uniform value.
Use 1: monoalphabetic versus polyalphabetic
First classical use: faced with an unknown ciphertext, you measure the IC.
- IC ≈ 0.067 (English) or 0.074 (French) → probably a monoalphabetic cipher (Caesar, Atbash, substitution). Run frequency analysis.
- IC ≈ 0.04 → probably a polyalphabetic cipher (Vigenère, Beaufort) or a modern ciphertext. Direct frequency analysis won’t work; you must first determine the key length.
- IC ≈ 0.038 with apparently uniform random output → modern cipher (AES, ChaCha20) or OTP. Abandon classical cryptanalysis.
This measurement takes three lines of code and gives an immediate orientation.
Use 2: finding a Vigenère key length
Second use, more powerful. On Vigenère, the global IC is flat (~0.04). But if you take a sub-message by selecting one letter every k, that sub-message has been encrypted with a single key letter — i.e. a Caesar cipher. Its IC climbs back to the language’s value.
Procedure:
- For k = 1, 2, 3, … 20 (reasonably), take the ciphertext and extract every letter at positions
0, k, 2k, 3k, …. That gives the “first sub-message”. - Compute the IC of that sub-message.
- Repeat for all tested
k. - The
kvalue that yields IC ≈ 0.067 (EN) or 0.074 (FR) is most likely the key length.
This trick is what allowed William F. Friedman, at the NSA, to mechanize the cracking of Vigenère in the 1920s. Coupled with the Kasiski test, it reduces a cipher long deemed undecipherable to a puzzle of a few hours.
Friedman’s and Kullback’s contribution
William Friedman published in 1922 his famous “The Index of Coincidence and Its Applications in Cryptography”. He formalized the measure and showed how it allows you to:
- Determine the key length of a periodic polyalphabetic cipher.
- Identify the language of a recovered plaintext.
- Detect a statistical dependency between two texts (notably for known-plaintext attacks).
His colleague Solomon Kullback extended the approach into Kullback-Leibler divergence, a more general measure but directly inspired by IC. Both ended up at the origin of the NSA.
A quick worked calculation
Take the ciphertext ATTACKATDAWN (12 letters):
- A appears 4 times → 4 × 3 = 12
- T appears 3 times → 3 × 2 = 6
- C appears 1 time → 0
- K appears 1 time → 0
- D appears 1 time → 0
- W appears 1 time → 0
- N appears 1 time → 0
- Sum = 18
- IC = 18 / (12 × 11) = 18 / 132 ≈ 0.136
The IC is very high — but on 12 letters, statistical noise dominates. IC is reliable from ~100 characters onward, ideally 200+. On short texts, comparing several ICs against each other still works, but absolute-value judgments are hazardous.
Key takeaways:
- IC measures the concentration of the letter distribution: high for a language (E, A dominate), low for random text.
- Reference values: ≈ 0.074 French, 0.067 English, 0.038 random.
- First use: distinguish monoalphabetic (high IC) from polyalphabetic (low IC) on an unknown ciphertext.
- Second use: find a Vigenère key length by testing the IC of sub-messages taken every k characters.
- Unreliable below ~100 characters. Beyond that, it’s a thoroughly robust cryptanalyst’s tool.