字母频率分析工具

频率分析检查每个字母在文本中出现的频率,并将分布与已知语言模式进行比较。这是密码分析中最古老且最强大的技术之一——最早由 Al-Kindi 于 9 世纪描述——至今仍是破解凯撒、Atbash 和关键字密码等经典替换密码的主要方法。

ETAOIN SHRDLUMost common English letters

Input Text

0 letters
Analysis:
Compare with:
View:
Sort:

Frequency Distribution

Actual
Expected (English)

Enter text above to see the frequency distribution chart.

Frequently Asked Questions About Frequency Analysis

What is frequency analysis in cryptography?

Frequency analysis is a cryptanalysis technique that studies how often each letter appears in a piece of text. Since every language has a characteristic letter frequency distribution (for example, E is the most common letter in English at about 12.7%), analyzing the frequencies of letters in ciphertext can reveal the substitution pattern used to encrypt it. This method was first described by the Arab polymath Al-Kindi in the 9th century and remains one of the most fundamental tools in classical cryptography.

How does frequency analysis break substitution ciphers?

In a simple substitution cipher, each plaintext letter is consistently replaced by a single ciphertext letter. This means the frequency pattern of the original language is preserved in the ciphertext — just mapped to different letters. By comparing the ciphertext letter frequencies to known language frequencies, a cryptanalyst can match the most common ciphertext letter to E, the second most common to T, and so on. Combined with analysis of common digrams (TH, HE, IN) and trigrams (THE, AND, ING), most substitution ciphers can be broken with moderate amounts of ciphertext.

What are the most common English letter frequencies?

The most common letters in English, in order, are: E (12.7%), T (9.1%), A (8.2%), O (7.5%), I (7.0%), N (6.7%), S (6.3%), H (6.1%), R (6.0%), and D (4.3%). The mnemonic ETAOIN SHRDLU captures the top 12 letters by frequency. The least common letters are Z (0.07%), Q (0.10%), X (0.15%), and J (0.15%). These frequencies are averages across large bodies of English text and may vary with specific texts, genres, and writing styles.

What is the chi-squared statistic in frequency analysis?

The chi-squared statistic measures how much an observed frequency distribution differs from an expected one. In frequency analysis, it compares the actual letter counts in your text against the counts you would expect if the text followed standard language frequencies. A low chi-squared value (below about 30 for 25 degrees of freedom) suggests the text matches normal language patterns, while a high value suggests the text is encrypted, written in a different language, or has an unusual letter distribution.

Which ciphers are vulnerable to frequency analysis?

Simple (monoalphabetic) substitution ciphers are the most vulnerable, including Caesar cipher, Atbash cipher, keyword cipher, and affine cipher. These all map each plaintext letter to exactly one ciphertext letter, preserving frequency patterns. Polyalphabetic ciphers like Vigenère make frequency analysis harder because each plaintext letter can encrypt to multiple ciphertext letters, but they can still be broken using the Kasiski examination or index of coincidence to determine the key length, after which each sub-cipher can be attacked individually.

How much ciphertext do you need for frequency analysis to work?

Generally, frequency analysis becomes reliable with at least 100-200 characters of ciphertext for simple substitution ciphers. With shorter texts, the natural variation in letter frequencies makes it harder to draw reliable conclusions. Very short messages (under 50 characters) may not contain enough data for letter frequencies to match the expected language pattern. For polyalphabetic ciphers, even more ciphertext is needed because the analysis must be performed on subsets of the text corresponding to each key position.

What are the most common English letter bigrams?

The most common English bigrams are TH (3.56%), HE (3.07%), IN (2.43%), ER (2.05%), AN (1.99%), RE (1.85%), ON (1.76%), AT (1.49%), EN (1.45%), and ND (1.35%). Analyzing bigram frequency can reveal patterns that single-letter frequency analysis misses.

How do you use frequency analysis to crack a cipher?

Count how often each letter appears in the ciphertext. Compare these frequencies with standard English letter frequencies (E=12.7%, T=9.1%, A=8.2%, O=7.5%, I=7.0%). The most frequent ciphertext letter likely represents E. Use common bigrams (TH, HE, IN) and short words (THE, AND, FOR) to confirm substitutions and gradually decode the message.

What is the Index of Coincidence?

The Index of Coincidence (IC) measures how likely two randomly chosen letters from a text are to be identical. English text has an IC of approximately 0.0667, while random text is about 0.0385. IC helps determine whether a cipher is monoalphabetic (IC near English) or polyalphabetic (IC closer to random), guiding which cryptanalysis approach to use.

When does frequency analysis fail?

Frequency analysis is unreliable on very short texts (under 100 characters), texts in specialized vocabularies, polyalphabetic ciphers like Vigenère (which flatten frequency distributions), and homophonic substitution ciphers that map frequent letters to multiple symbols. For polyalphabetic ciphers, you must first determine the key length using Kasiski examination or IC analysis.

如何使用频率分析工具

频率分析是统计文本中每个字母出现次数,并利用这些统计数据推断文本来源或加密方式的过程。以下是使用本页工具的分步指南:

  1. 将文本粘贴或输入到输入框中。该工具接受任何文本——明文、密文或混合内容均可。为获得最佳效果,请使用至少 100 个字符的文本,以确保频率规律在统计上具有意义。

  2. 查看频率图表。交互式条形图以占所有字母字符的百分比显示每个字母的频率。字母默认按字母顺序排列,但您也可以按频率排序,以快速识别出现最多和最少的字母。

  3. 与英语频率进行对比。图表将标准英语字母频率与您文本的分布叠加显示。留意 E、T、A、O、I 处的特征峰值。如果峰值整体出现均匀偏移,您看到的可能是凯撒密码。如果分布显得较为平坦,则很可能使用了多表替换密码,例如维吉尼亚密码。

  4. 查看卡方统计量。这一单一数值概括了您的文本与预期英语频率的吻合程度。卡方值低于 30 表明文本接近正常英语;高于 50 则强烈提示文本经过加密或使用了非英语语言。

  5. 检查逐字母偏差。详细统计表显示每个字母的实际频率、预期频率及两者之间的偏差。较大的正偏差表示该字母在英语中出现得比预期更频繁;较大的负偏差则表示出现得更少。

  6. 形成假设并进行验证。如果您怀疑使用的是单表替换密码,可将最常见的密文字母映射到 E,第二常见的映射到 T,以此类推。对照常见的双字母组和三字母组核验这些猜测,不断调整替换方案,直到出现连贯的明文。

英语字母频率参考表

下表显示了英语文本中各字母的标准频率分布,基于对大规模文本语料库的分析。这些数值代表平均水平,在不同体裁、作者和文本长度下会有所差异。

字母频率 (%)示例单词字母频率 (%)示例单词
A8.167and, are, atN6.749not, new, no
B1.492but, be, byO7.507of, or, on
C2.782can, comeP1.929put, part
D4.253do, did, dayQ0.095queen, quite
E12.702the, he, beR5.987are, her, or
F2.228for, fromS6.327so, she, is
G2.015get, go, gotT9.056the, to, it
H6.094he, has, hadU2.758up, us, use
I6.966in, is, itV0.978very, have
J0.153just, jobW2.360was, we, with
K0.772know, keepX0.150next, six
L4.025like, lastY1.974you, year
M2.406my, me, mayZ0.074zero, zone

助记词 ETAOIN SHRDLU 按降序排列了十二个最常见的字母:E、T、A、O、I、N、S、H、R、D、L、U。这一序列在排字工人中广为人知,以至于成为独具一格的文化符号。

破解密码:实例演示

请看以下密文,它使用了简单的替换密码加密:

GSZIV GSV OVGGVI UIVJFVMXB WRHGIRYF GRLM LU GSRH GVCG DRGS HGZMWZIW VMTORHSFIVJFVMXRVH GL XIZXP GSV XRKSVI

第一步:统计字母频率。

分析该文本,出现频率最高的字母为:

排名字母出现次数频率
1G1413.2%
2V1211.3%
3R98.5%
4H87.5%
5I76.6%

第二步:与英语频率对比。

在标准英语中,频率最高的五个字母为 E (12.7%)、T (9.1%)、A (8.2%)、O (7.5%)、I (7.0%)。对比分析:

  • G (13.2%) 很可能对应 T (9.1%) 或 E (12.7%)
  • V (11.3%) 很可能对应 E (12.7%) 或 T (9.1%)

第三步:寻找常见规律。

三字母单词 "GSV" 多次出现。英语中最常见的三字母单词是 "THE"。若 GSV = THE,则 G=T、S=H、V=E。

第四步:应用假设并扩展推断。

代入 G=T、S=H、V=E,检验 "GSZIV"——替换后得到 "THA_E",强烈暗示为 "SHARE"(Z=R,I=R……等等,I 对应的字母应不同)。实际上,Z=A 且 I=R 得到 "THARE"——接近 "SHARE"。进一步核验后发现:这实际上是一种埃特巴什密码,每个字母映射为其逆序字母(A<->Z,B<->Y,以此类推)。字母 G(位置 7)映射到 T(位置 20),印证了 7+20=27,符合埃特巴什规律(位置 + 逆序位置 = 27)。

第五步:解码完整信息。

应用埃特巴什替换,将整段密文解码为:"SHARE THE LETTER FREQUENCY DISTRIBUTION OF THIS TEXT WITH STANDARD ENGLISH FREQUENCIES TO CRACK THE CIPHER"

这个示例展示了频率分析结合规律识别和常用词知识,如何系统地破解替换密码。

N-gram 分析:双字母组与三字母组

单字母频率分析功能强大,但对连续字母对(双字母组)和三字母组进行分析,能揭示更多关于文本结构的信息。N-gram 分析利用了这样一个事实:英语——以及任何自然语言——在特定字母组合上具有强烈的统计偏好。

英语最常见双字母组 Top 10

排名双字母组频率 (%)说明
1TH3.56最常见的双字母组;出现于 "the"、"that"、"this"、"them"
2HE3.07出现于 "the"、"he"、"her"、"here"、"them"
3IN2.43常见介词及词尾("-ing"、"-tion")
4ER2.05常见词尾("-er"、"-ler"、"-ber")及 "her"、"every"
5AN1.99冠词 "an" 及出现于 "and"、"any"、"can"、"man"
6RE1.85前缀 "re-" 及出现于 "are"、"were"、"here"
7ON1.76介词及出现于 "one"、"only"、"upon"
8AT1.49介词及出现于 "that"、"what"、"cat"
9EN1.45常见词尾("-en"、"-ment")及出现于 "then"、"when"
10ND1.35出现于 "and"、"end"、"find"、"kind" 的词尾

英语最常见三字母组 Top 10

排名三字母组频率 (%)说明
1THE3.51英语最常见单词
2AND1.59最常见的连词
3ING1.47现在分词后缀
4HER0.90代词,及出现于 "there"、"where"、"other"
5THA0.83出现于 "that"、"than" 的开头
6ERE0.78出现于 "there"、"where"、"here"
7FOR0.76常见介词
8ENT0.73出现于 "went"、"sent"、"ment" 的后缀
9ION0.70常见后缀 "-tion"、"-sion"
10TER0.68出现于 "after"、"water"、"letter"

在密码分析中使用 N-gram

当单字母频率分析产生多个可能的映射时,双字母组和三字母组分析有助于缩小正确替换方案的范围:

  1. 识别密文中的重复双字母组。密文中最常见的双字母组很可能对应 TH。
  2. 寻找三字母组规律。若某个三字母序列频繁出现,它很可能代表 THE。
  3. 关注单词边界。英语中的两字母单词极为有限(常见的有:OF、TO、IN、IS、IT、AS、AT、WE、HE、BY、OR、ON、DO、IF、ME、MY、UP、AN、GO、NO、US、AM、SO)。若能识别密文中的单词边界,将其与已知两字母单词对照,可以迅速缩小解答空间。
  4. 结合字母频率数据。一旦通过 N-gram 分析获得高置信度的映射,便可将其用作锚点来确定单字母频率的对应关系。

频率分析的局限性

频率分析并非万能的破密工具。有几类加密方式能够抵御甚至彻底规避它:

多表替换密码

维吉尼亚密码等密码使用多个替换字母表,随着每个字母的加密循环切换。这将每个明文字母分散到多个不同的密文字母中,使频率分布趋于平坦,看起来接近随机文本。破解多表替换密码需要先确定密钥长度(使用 Kasiski 检验重合指数),然后对每个子密码分别进行频率分析。

同音替换密码

同音替换密码将每个明文字母映射到多个可能的密文符号,频率越高的字母拥有越多的替代符号。例如,E 可能映射到五种不同符号中的任意一种,而 Z 只有一种。这使密文频率分布趋于均匀,从而击败基于简单计数的攻击。破解同音替换密码需要更复杂的技术,包括双字母组频率分析和爬山算法。

短文本

当字符数少于 100 个时,字母使用上的自然统计波动可能大于您试图检测的信号。一段短文本可能恰好不含字母 E,尽管 E 是英语中最常见的字母。在这种情况下,频率分析只能提供微弱的证据,必须辅以其他技术,如已知明文攻击或基于上下文的推断。

无效密码与隐写术

某些加密方法将信息隐藏在看似无害的文本中,对载体文本进行频率分析毫无用处,因为载体文本本身具有正常的频率分布。检测这类方法需要完全不同的分析手段。

现代加密

现代密码学算法(AES、RSA、ChaCha20)产生的密文在计算上与随机数据无法区分。每个字节值以相等的概率出现,无论进行多少频率分析都无法揭示任何有关明文的信息。频率分析严格来说只适用于古典密码。

相关工具

  • 凯撒密码解码器 — 凯撒密码是最简单的替换密码之一,容易受到频率分析攻击,因为它只是将整体频率分布整体偏移了一个固定量。
  • 关键字密码 — 一种使用关键字重新排列字母表的单表替换密码。频率分析是破解关键字密码的主要方法。
  • 同音密码 — 专门设计用于对抗频率分析,通过将常见字母映射到多个密文符号来均衡输出分布。
  • 密码识别器 — 在选择分析方法之前,使用密码识别器确定加密消息所使用的密码类型。
  • 维吉尼亚密码 — 一种抵抗简单频率分析的多表替换密码。破解它需要先使用 Kasiski 检验或重合指数确定密钥长度,再对每个子密码分别进行频率分析。