Frequency Analysis: How to Break Substitution Ciphers with Letter Statistics
Learn how frequency analysis breaks substitution ciphers using letter statistics. Complete guide with English frequency tables, step-by-step cipher breaking tutorial, bigram analysis, and Index of Coincidence.
Introduction
Frequency analysis is the oldest and most fundamental technique in cryptanalysis — the science of breaking codes and ciphers without knowing the key. Its underlying principle is deceptively simple: every natural language has a characteristic distribution of letters, and this distribution survives encryption by substitution ciphers. By counting how often each letter appears in a ciphertext and comparing those counts to the expected frequencies of the target language, a cryptanalyst can deduce the substitution key and recover the original message.
This technique was first described around 850 AD by the Arab polymath Al-Kindi in his manuscript On Deciphering Cryptographic Messages. For nearly a thousand years, it remained the primary method for breaking ciphers — and even today, it is the first tool taught in any introductory cryptography course.
This guide covers the full depth of frequency analysis: the mathematical foundations, complete reference tables for English letter frequencies, a step-by-step tutorial for breaking a real cipher, advanced techniques including bigram and trigram analysis, the Index of Coincidence, and the limitations of frequency analysis against modern ciphers.
Try our free Letter Frequency Analysis Tool to analyze any text and compare its letter distribution with standard English frequencies in real time.
What Is Frequency Analysis?
Frequency analysis is the study of how often each letter (or symbol) appears in a body of text. In the context of cryptanalysis, it exploits a fundamental weakness of monoalphabetic substitution ciphers: these ciphers replace each plaintext letter with a single, fixed ciphertext letter, which means the frequency pattern of the original language is perfectly preserved in the ciphertext — just mapped to different letters.
For example, if the letter E appears 12.7% of the time in English text, and a substitution cipher replaces E with the letter Q, then Q will appear approximately 12.7% of the time in the ciphertext. The substitution changes the labels but not the underlying statistical distribution.
This means that to break the cipher, the cryptanalyst simply needs to:
- Count the frequency of each letter in the ciphertext
- Compare those frequencies to the known frequency distribution of the target language
- Match the most common ciphertext letter to the most common plaintext letter (E in English), the second most common to the second (T), and so on
- Refine the initial guesses using bigram analysis, common word patterns, and contextual clues
Al-Kindi and the Birth of Cryptanalysis
Abu Yusuf Yaqub ibn Ishaq al-Sabbah Al-Kindi (c. 801-873 AD), known in the West as Alkindus, was an Arab philosopher, mathematician, and polymath working in the House of Wisdom in Baghdad during the Islamic Golden Age. His treatise A Manuscript on Deciphering Cryptographic Messages is the earliest known description of frequency analysis.
Al-Kindi wrote:
"One way to solve an encrypted message, if we know its language, is to find a different plaintext of the same language long enough to fill one sheet or so, and then we count the occurrences of each letter. We call the most frequently occurring letter the 'first,' the next most frequently occurring letter the 'second,' the following most frequently occurring letter the 'third,' and so on, until we account for all the different letters in the plaintext sample."
This passage describes essentially the same technique that modern cryptanalysts use. Al-Kindi's work remained largely unknown in Europe until the Renaissance, but in the Arabic-speaking world, his methods were applied to break ciphers used in diplomatic and military communications for centuries.
English Letter Frequency Distribution
The following table presents the standard frequency distribution for all 26 letters of the English alphabet, based on the analysis of millions of words from diverse English-language texts including newspapers, novels, academic papers, and correspondence.
| Rank | Letter | Frequency (%) | Cumulative (%) | Notes |
|---|---|---|---|---|
| 1 | E | 12.702 | 12.70 | Most common; found in "the," "he," "she," "be" |
| 2 | T | 9.056 | 21.76 | Second most common; "the," "to," "it," "that" |
| 3 | A | 8.167 | 29.93 | Third most common; "and," "a," "are," "as" |
| 4 | O | 7.507 | 37.43 | "of," "or," "on," "to," "so" |
| 5 | I | 6.966 | 44.40 | "in," "is," "it," "I," "if" |
| 6 | N | 6.749 | 51.15 | "and," "not," "no," "in," "on" |
| 7 | S | 6.327 | 57.48 | "so," "she," "is," plurals |
| 8 | H | 6.094 | 63.57 | "the," "he," "has," "had," "his" |
| 9 | R | 5.987 | 69.56 | "are," "or," "her," "for" |
| 10 | D | 4.253 | 73.81 | "and," "did," "do," "had" |
| 11 | L | 4.025 | 77.84 | "all," "like," "last," "will" |
| 12 | U | 2.758 | 80.60 | "up," "us," "but," "use" |
| 13 | C | 2.782 | 83.38 | "can," "come," "could" |
| 14 | M | 2.406 | 85.79 | "me," "my," "may," "more" |
| 15 | W | 2.360 | 88.15 | "was," "we," "with," "will" |
| 16 | F | 2.228 | 90.38 | "for," "from," "first" |
| 17 | Y | 1.974 | 92.35 | "you," "yes," "year" |
| 18 | G | 2.015 | 94.37 | "go," "get," "good" |
| 19 | P | 1.929 | 96.30 | "put," "part," "people" |
| 20 | B | 1.492 | 97.79 | "but," "be," "by," "been" |
| 21 | V | 0.978 | 98.77 | "very," "have," "over" |
| 22 | K | 0.772 | 99.54 | "know," "keep," "king" |
| 23 | X | 0.150 | 99.69 | "next," "six," "box" |
| 24 | J | 0.153 | 99.84 | "just," "job," "join" |
| 25 | Q | 0.095 | 99.94 | "queen," "quite," "question" |
| 26 | Z | 0.074 | 100.00 | "zero," "zone," "size" |
The mnemonic ETAOIN SHRDLU captures the top 12 letters by frequency. This sequence was so well known among Linotype typesetters (who had their keyboards arranged by letter frequency) that the phrase entered popular culture as a synonym for garbled text.
Frequency Variation Across Languages
Different languages have dramatically different frequency profiles. This is useful for identifying the language of an encrypted text:
| Language | Most Common Letters (descending) | IC Value |
|---|---|---|
| English | E, T, A, O, I, N, S, H, R | 0.0667 |
| French | E, A, S, I, N, T, R, L, U | 0.0778 |
| German | E, N, I, S, R, A, T, D, H | 0.0762 |
| Spanish | E, A, O, S, R, N, I, D, L | 0.0775 |
| Italian | E, A, I, O, N, L, R, T, S | 0.0738 |
| Portuguese | A, E, O, S, R, I, N, D, M | 0.0745 |
Notice that E is the most common letter in most European languages, but the rest of the frequency order varies significantly. These differences mean that frequency analysis requires knowledge of (or a hypothesis about) the plaintext language.
Step-by-Step Cipher Breaking Tutorial
Let us work through a complete example of breaking a substitution cipher using frequency analysis. Here is the ciphertext:
UIF RVJDL CSPXO GPY KVNQT PWFS UIF MBAZ EPH.
UIF TFDPOET BGUFS UIF GJSTU BSF BMXBZT IBSEFS.
Step 1: Count Letter Frequencies
First, we count every alphabetic character, ignoring spaces and punctuation:
| Letter | Count | Frequency (%) | Letter | Count | Frequency (%) |
|---|---|---|---|---|---|
| F | 8 | 10.67 | B | 4 | 5.33 |
| U | 5 | 6.67 | S | 4 | 5.33 |
| I | 4 | 5.33 | T | 4 | 5.33 |
| P | 3 | 4.00 | G | 3 | 4.00 |
| E | 2 | 2.67 | J | 2 | 2.67 |
| Q | 2 | 2.67 | W | 2 | 2.67 |
| D | 2 | 2.67 | X | 1 | 1.33 |
| H | 1 | 1.33 | K | 1 | 1.33 |
| L | 1 | 1.33 | M | 1 | 1.33 |
| N | 1 | 1.33 | O | 2 | 2.67 |
| A | 2 | 2.67 | C | 1 | 1.33 |
| Z | 1 | 1.33 | R | 1 | 1.33 |
| V | 1 | 1.33 | Y | 1 | 1.33 |
Total alphabetic characters: 75
Step 2: Compare with English Frequencies
The most frequent letter in our ciphertext is F (10.67%). In English, the most frequent letter is E (12.7%). Our initial hypothesis: F = E.
The second most frequent is U (6.67%). In English, T is second (9.1%). Hypothesis: U = T.
Step 3: Look for Common Words
The three-letter word "UIF" appears twice. If U=T and F=E, then UIF = T_E, which strongly suggests UIF = THE, meaning I = H.
Now we have three confirmed substitutions: F=E, U=T, I=H.
Step 4: Test the Caesar Cipher Hypothesis
Looking at our substitutions:
- F (position 6) -> E (position 5): shift of +1
- U (position 21) -> T (position 20): shift of +1
- I (position 9) -> H (position 8): shift of +1
All three substitutions show the same shift of +1. This is a strong indicator that we are dealing with a Caesar cipher with shift 1 — every letter has been shifted forward by one position.
Step 5: Apply the Key and Decode
Shifting every letter back by 1 position, the complete plaintext is:
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
THE SECONDS AFTER THE FIRST ARE ALWAYS HARDER.
This example is intentionally straightforward to illustrate the method clearly. In practice, substitution ciphers use more complex key mappings, but the same frequency analysis approach applies — it simply requires more iteration and pattern matching.
What If It Is Not a Caesar Cipher?
If the shifts are not uniform, you are dealing with a general monoalphabetic substitution cipher. In that case:
- Map the top 5-6 ciphertext letters to ETAOIN by frequency
- Look for repeated two-letter and three-letter patterns (likely "the," "and," "for," "is," "of")
- Check for single-letter words (likely "a" or "I")
- Use the partial decryption to identify more letter mappings
- Iterate until the plaintext becomes clear
For texts longer than 200 characters, this process typically converges to the correct solution within 15-30 minutes of systematic work.
Bigrams, Trigrams, and Pattern Analysis
Single-letter frequency analysis provides the first approximation, but analyzing pairs (bigrams) and triples (trigrams) of consecutive letters dramatically improves accuracy and speed.
Top 20 English Bigrams
| Rank | Bigram | Frequency (%) | Rank | Bigram | Frequency (%) |
|---|---|---|---|---|---|
| 1 | TH | 3.56 | 11 | ES | 1.34 |
| 2 | HE | 3.07 | 12 | ED | 1.17 |
| 3 | IN | 2.43 | 13 | OR | 1.15 |
| 4 | ER | 2.05 | 14 | TI | 1.14 |
| 5 | AN | 1.99 | 15 | HI | 1.09 |
| 6 | RE | 1.85 | 16 | AS | 1.07 |
| 7 | ON | 1.76 | 17 | TO | 1.05 |
| 8 | AT | 1.49 | 18 | HA | 1.02 |
| 9 | EN | 1.45 | 19 | NG | 0.95 |
| 10 | ND | 1.35 | 20 | SE | 0.93 |
Top 15 English Trigrams
| Rank | Trigram | Frequency (%) | Rank | Trigram | Frequency (%) |
|---|---|---|---|---|---|
| 1 | THE | 3.51 | 9 | ION | 0.70 |
| 2 | AND | 1.59 | 10 | TER | 0.68 |
| 3 | ING | 1.47 | 11 | WAS | 0.61 |
| 4 | HER | 0.90 | 12 | THA | 0.58 |
| 5 | THA | 0.83 | 13 | HAT | 0.55 |
| 6 | ERE | 0.78 | 14 | ATE | 0.52 |
| 7 | FOR | 0.76 | 15 | ALL | 0.50 |
| 8 | ENT | 0.73 |
How to Use N-gram Analysis
The power of n-gram analysis lies in its ability to constrain the solution space:
- Find the most common bigram in the ciphertext. It very likely represents TH or HE.
- If you have identified T and H, look for the trigram THE. This is the single most common word in English and produces the trigram THE at 3.51%.
- Check for the bigram pattern AB BA. In English, the most common reciprocal bigrams are ER/RE, HE/EH, ON/NO, and AN/NA. Finding these patterns confirms letter assignments.
- Analyze word endings. Common English suffixes include -ING, -TION, -ED, -ER, -EST, -LY, and -MENT. In a substitution cipher, these patterns remain as recognizable repeated sequences at word boundaries.
- Use double letters. The most common double letters in English are LL, SS, EE, OO, TT, FF, RR, NN, PP, and CC. If you see doubled letters in the ciphertext, check these candidates first.
Index of Coincidence Explained
The Index of Coincidence (IC), introduced by William Friedman in 1922, is a statistical measure that helps determine the type of cipher used to encrypt a message. It calculates the probability that two randomly selected letters from a text are identical.
The Formula
For a text of length N with letter counts n1, n2, ..., n26 (for each letter A through Z), the IC is:
IC = Sum of [ni * (ni - 1)] / [N * (N - 1)]
where the sum is taken over all 26 letters.
Interpreting IC Values
| IC Value | Interpretation |
|---|---|
| ~0.0667 | Typical English text or monoalphabetic cipher |
| ~0.0500 - 0.0600 | Polyalphabetic cipher with short key (2-5 characters) |
| ~0.0385 | Random text or very long polyalphabetic key |
The IC of standard English text (~0.0667) is significantly higher than that of random text (~0.0385, which equals 1/26) because English has an uneven letter distribution. Some letters (E, T, A) appear far more often than others (Z, Q, X), making it more likely that two randomly chosen letters will match.
Using IC in Cryptanalysis
The IC is most useful for distinguishing between monoalphabetic and polyalphabetic ciphers:
- Calculate the IC of the ciphertext.
- If IC is near 0.0667: The cipher is likely monoalphabetic (Caesar, substitution, Atbash, affine). The frequency distribution has been shuffled but not flattened. Standard frequency analysis will work.
- If IC is between 0.04 and 0.06: The cipher is likely polyalphabetic (Vigenere, Beaufort, Gronsfeld). The frequency distribution has been partially flattened. You need to determine the key length first, then apply frequency analysis to each sub-cipher.
- If IC is near 0.0385: The cipher uses a very long key (approaching a one-time pad) or is a modern cipher. Frequency analysis will not be directly useful.
Determining Key Length with IC
For a suspected Vigenere cipher, you can estimate the key length by:
- Try key lengths of 2, 3, 4, 5, etc.
- For each candidate key length k, split the ciphertext into k groups (every k-th letter belongs to the same group).
- Calculate the IC for each group.
- When the IC of each group is close to 0.0667, you have found the correct key length — because each group was encrypted with the same Caesar shift.
This method, combined with the Kasiski examination (looking for repeated sequences in the ciphertext to infer key length), forms the standard approach for breaking Vigenere ciphers.
When Frequency Analysis Fails
Frequency analysis is a powerful tool, but it has clear limitations. Understanding when and why it fails is as important as knowing how to apply it.
Polyalphabetic Ciphers
The Vigenere cipher, invented in the 16th century, was specifically designed to defeat frequency analysis. It uses multiple Caesar cipher alphabets in rotation, controlled by a keyword. Each plaintext letter is shifted by a different amount depending on its position in the keyword cycle.
The effect on frequency analysis is dramatic: instead of each plaintext E mapping to a single ciphertext letter, it maps to several different ciphertext letters (one per keyword character). This distributes the frequency peak of E across multiple ciphertext letters, flattening the overall distribution and making it resemble random text.
However, polyalphabetic ciphers are not immune to statistical attack. Once the key length is determined (using IC or Kasiski examination), the ciphertext can be split into groups, each encrypted with a simple Caesar shift. Frequency analysis then works perfectly on each group individually.
Homophonic Substitution
A homophonic substitution cipher assigns multiple ciphertext symbols to each plaintext letter, with common letters receiving more alternatives. For example:
- E (12.7%) might map to any of symbols: 14, 27, 38, 51, 63, 79, 82, 91, 03, 45, 56, 68, 74
- Z (0.07%) maps to only: 99
If the number of alternatives for each letter is proportional to its frequency, the resulting ciphertext has a nearly flat frequency distribution — each of the ~100 symbols appears approximately 1% of the time. Simple frequency counting reveals nothing.
Breaking homophonic ciphers requires more sophisticated techniques: bigram frequency analysis (the substitution does not hide bigram patterns as effectively), hill-climbing algorithms, and known-plaintext attacks.
Very Short Texts
Frequency analysis relies on the law of large numbers: with enough text, observed frequencies converge to their expected values. With short texts (under 100 characters), the statistical noise is too large for reliable conclusions.
Consider: a perfectly normal English sentence like "Fuzzy ducks quack by jinxing vows" has a wildly non-standard letter distribution simply because it is short and happens to contain uncommon letters. Analyzing this sentence would incorrectly suggest it was encrypted.
As a rule of thumb:
- < 50 characters: Frequency analysis is essentially useless
- 50-100 characters: Provides only weak hypotheses
- 100-200 characters: Moderately reliable for monoalphabetic ciphers
- 200+ characters: Highly reliable for monoalphabetic ciphers
- 500+ characters: Can distinguish between cipher types using IC
One-Time Pad
The one-time pad (OTP) is the only theoretically unbreakable cipher. It uses a random key that is as long as the message and is never reused. Against a properly implemented OTP, frequency analysis (and every other cryptanalytic technique) is provably useless — the ciphertext contains zero information about the plaintext, because every possible plaintext is equally likely.
Modern Encryption
Modern cryptographic algorithms (AES, ChaCha20, RSA) produce ciphertext that is computationally indistinguishable from random data. Every possible byte value (0-255) appears with equal probability, and no statistical pattern of any kind is detectable. Frequency analysis is entirely inapplicable to modern encryption — it is strictly a tool for classical ciphers.
Modern Applications of Frequency Analysis
While frequency analysis originated as a code-breaking tool, the underlying principle — that text has characteristic statistical patterns — has found applications far beyond cryptography.
Authorship Attribution
Statistical analysis of letter, word, and n-gram frequencies can help identify the author of anonymous or disputed texts. Different authors have measurably different stylistic "fingerprints" in their use of function words (the, a, of, and), sentence lengths, and vocabulary patterns. This technique has been applied to disputed Shakespearean works, anonymous political tracts, and forensic document examination.
Language Identification
Frequency analysis can automatically identify the language of a text by comparing its letter and bigram frequencies against known profiles for different languages. This is the basis for language detection features in search engines, translation tools, and text processing systems.
Spam Detection
Statistical text analysis, including character and word frequency patterns, is one component of modern spam detection systems. Spam emails often have measurably different frequency profiles from legitimate correspondence — for example, higher frequencies of exclamation marks, uppercase letters, and words like "free," "winner," and "urgent."
Data Compression
The fundamental insight behind frequency analysis — that some symbols are more common than others — is the same principle that drives data compression algorithms. Huffman coding assigns shorter binary codes to more frequent symbols, just as Morse code assigns shorter dot-dash patterns to more common letters. Arithmetic coding and entropy coding extend this principle to achieve near-optimal compression ratios.
Forensic Linguistics
In legal contexts, frequency analysis techniques help determine whether a confession was genuinely written by the accused, whether a threatening letter matches a suspect's writing style, or whether a document has been forged. These analyses go well beyond simple letter counting to include word frequency profiles, syntactic patterns, and statistical measures of text complexity.
Frequently Asked Questions
Can frequency analysis break any cipher?
No. Frequency analysis is effective against monoalphabetic substitution ciphers (Caesar, keyword, Atbash, affine) where each plaintext letter maps to exactly one ciphertext letter. It is less effective against polyalphabetic ciphers (Vigenere, Beaufort), requires modification for homophonic substitution, and is completely useless against modern encryption algorithms (AES, RSA) which produce statistically random output.
How long does it take to break a cipher with frequency analysis?
For a monoalphabetic substitution cipher with 200+ characters of ciphertext, an experienced cryptanalyst can typically recover the plaintext within 15-30 minutes using a combination of single-letter frequency analysis, bigram/trigram patterns, and common word recognition. For shorter texts or more complex ciphers, the process can take significantly longer and may require computerized analysis.
What tools do professional cryptanalysts use for frequency analysis?
Modern cryptanalysts use software tools that automate letter counting, bigram/trigram analysis, IC calculation, and substitution testing. Tools range from simple scripts in Python (using the collections.Counter class) to specialized programs like CrypTool, Cipher Tools, and custom-built cryptanalysis software. Our Letter Frequency Analysis Tool provides interactive charts and statistical comparisons suitable for analyzing classical ciphers.
Is frequency analysis the same as statistical analysis?
Frequency analysis is a specific application of statistical analysis to text and cryptography. Statistical analysis is the broader discipline of collecting, organizing, and interpreting data. Frequency analysis applies statistical counting and comparison techniques specifically to the letters (or other elements) of a text, with the goal of identifying patterns that reveal information about the text's origin, language, or encryption method.
Can frequency analysis determine what language a cipher was written in?
Yes. If you can decrypt enough of the ciphertext to see the plaintext frequency distribution, the pattern will indicate the language. Even without decryption, the IC value provides clues: different languages have different IC values (English ~0.0667, French ~0.0778, German ~0.0762). If you know the cipher type but not the language, trying different language frequency profiles will reveal which one produces coherent plaintext.
What is the difference between frequency analysis and brute force?
Frequency analysis is an intelligent, pattern-based attack that uses statistical properties of language to deduce the key. Brute force tries every possible key until one produces readable plaintext. For a general 26-letter substitution cipher, brute force would need to try up to 26! (approximately 4 x 10^26) possible keys — computationally infeasible. Frequency analysis, by contrast, typically requires examining only a few dozen hypotheses, making it practical even by hand.
How did frequency analysis change the history of warfare?
Frequency analysis has influenced numerous military conflicts. Arab cryptanalysts used it against European codes during the Crusades. Mary, Queen of Scots was convicted and executed in 1587 partly because her encrypted correspondence with Anthony Babington was broken using frequency analysis. In World War I, Room 40 at British Naval Intelligence used frequency analysis techniques to break German diplomatic codes, including the Zimmermann Telegram that helped bring the United States into the war.
What comes after frequency analysis in a cryptanalysis course?
After mastering frequency analysis, students typically study the Kasiski examination and Index of Coincidence for breaking polyalphabetic ciphers, followed by transposition cipher analysis, known-plaintext attacks, chosen-plaintext attacks, and eventually the mathematical foundations of modern cryptography including modular arithmetic, prime number theory, and computational complexity. Our site offers tools for many of these cipher types at caesarcipher.org/ciphers.
Conclusion
Frequency analysis stands as one of the most elegant intersections of linguistics and mathematics. From Al-Kindi's pioneering manuscript in 9th-century Baghdad to the computerized cryptanalysis tools of today, the core insight has remained unchanged: language is not random, and the statistical regularities in how we use letters provide a powerful key for breaking ciphers that substitute one letter for another.
Understanding frequency analysis is not merely an academic exercise. It provides the conceptual foundation for understanding why modern ciphers work the way they do — specifically, why they must produce output that is statistically indistinguishable from random data. Every advance in cryptography since the Renaissance can be understood as a response to frequency analysis: polyalphabetic ciphers attempted to flatten the frequency distribution, homophonic ciphers tried to equalize symbol usage, and modern algorithms ensure that no statistical pattern of any kind survives encryption.
Ready to try frequency analysis yourself? Use our free Letter Frequency Analysis Tool to paste any text — plaintext or ciphertext — and instantly see its letter distribution compared with standard English frequencies. The interactive charts and detailed statistics make it easy to identify patterns and test hypotheses. For practice material, try encrypting a message with our Caesar Cipher or Keyword Cipher tools, then use frequency analysis to break your own encryption.