Index of coincidence

Also known as : IC · IoC · Indice de coïncidence

The index of coincidence (often abbreviated IC or IoC) is a statistic that characterizes the “concentration” of the letter distribution in a text. Concretely, you compute the probability that two letters drawn at random from the text are identical. It’s a numerical fingerprint of the language (or absence of language) — and one of the most effective tools of classical cryptanalysis.

The formula

For a text of N characters where each letter i appears n_i times (over the 26-letter alphabet):

IC = Σ n_i × (n_i − 1) / (N × (N − 1))

The idea: n_i × (n_i − 1) counts the ordered pairs (without repetition) you can form from letter i. Summing over all letters gives the total number of identical pairs. Dividing by N × (N − 1) (the total number of possible pairs) gives the probability that a pair drawn at random contains two identical letters.

Three reference values

Type of text	Typical IC
Uniformly random text (every letter equally likely)	≈ 0.038 (= 1/26)
Plaintext English	≈ 0.067
Plaintext French	≈ 0.074
Plaintext German	≈ 0.072
Plaintext Italian	≈ 0.074
Vigenère with a long key	≈ 0.038 to 0.045
Caesar / Atbash / monoalphabetic substitution	same as plaintext (≈ 0.067 EN, 0.074 FR)

These values aren’t magic: they follow directly from each language’s letter frequencies. A language where one letter strongly dominates (E in French) concentrates the identical pairs and raises the IC. A flat language (or a polyalphabetic cipher that flattens frequency) drives the IC back down toward the uniform value.

Use 1: monoalphabetic versus polyalphabetic

First classical use: faced with an unknown ciphertext, you measure the IC.

IC ≈ 0.067 (English) or 0.074 (French) → probably a monoalphabetic cipher (Caesar, Atbash, substitution). Run frequency analysis.
IC ≈ 0.04 → probably a polyalphabetic cipher (Vigenère, Beaufort) or a modern ciphertext. Direct frequency analysis won’t work; you must first determine the key length.
IC ≈ 0.038 with apparently uniform random output → modern cipher (AES, ChaCha20) or OTP. Abandon classical cryptanalysis.

This measurement takes three lines of code and gives an immediate orientation.

Use 2: finding a Vigenère key length

Second use, more powerful. On Vigenère, the global IC is flat (~0.04). But if you take a sub-message by selecting one letter every k, that sub-message has been encrypted with a single key letter — i.e. a Caesar cipher. Its IC climbs back to the language’s value.

Procedure:

For k = 1, 2, 3, … 20 (reasonably), take the ciphertext and extract every letter at positions 0, k, 2k, 3k, …. That gives the “first sub-message”.
Compute the IC of that sub-message.
Repeat for all tested k.
The k value that yields IC ≈ 0.067 (EN) or 0.074 (FR) is most likely the key length.

This trick is what allowed William F. Friedman, at the NSA, to mechanize the cracking of Vigenère in the 1920s. Coupled with the Kasiski test, it reduces a cipher long deemed undecipherable to a puzzle of a few hours.

Friedman’s and Kullback’s contribution

William Friedman published in 1922 his famous “The Index of Coincidence and Its Applications in Cryptography”. He formalized the measure and showed how it allows you to:

Determine the key length of a periodic polyalphabetic cipher.
Identify the language of a recovered plaintext.
Detect a statistical dependency between two texts (notably for known-plaintext attacks).

His colleague Solomon Kullback extended the approach into Kullback-Leibler divergence, a more general measure but directly inspired by IC. Both ended up at the origin of the NSA.

A quick worked calculation

Take the ciphertext ATTACKATDAWN (12 letters):

A appears 4 times → 4 × 3 = 12
T appears 3 times → 3 × 2 = 6
C appears 1 time → 0
K appears 1 time → 0
D appears 1 time → 0
W appears 1 time → 0
N appears 1 time → 0
Sum = 18
IC = 18 / (12 × 11) = 18 / 132 ≈ 0.136

The IC is very high — but on 12 letters, statistical noise dominates. IC is reliable from ~100 characters onward, ideally 200+. On short texts, comparing several ICs against each other still works, but absolute-value judgments are hazardous.

Key takeaways:

IC measures the concentration of the letter distribution: high for a language (E, A dominate), low for random text.
Reference values: ≈ 0.074 French, 0.067 English, 0.038 random.
First use: distinguish monoalphabetic (high IC) from polyalphabetic (low IC) on an unknown ciphertext.
Second use: find a Vigenère key length by testing the IC of sub-messages taken every k characters.
Unreliable below ~100 characters. Beyond that, it’s a thoroughly robust cryptanalyst’s tool.

← Whole glossary