Homophonic Substitution Cipher: From Defeating Frequency Analysis to the Zodiac Killer

Introduction

Frequency analysis is the most powerful weapon in a cryptanalyst's arsenal against simple substitution ciphers. By counting how often each symbol appears in a ciphertext and comparing those frequencies to the known letter distribution of the source language, a skilled analyst can crack most monoalphabetic ciphers in minutes. For centuries after the Arab polymath al-Kindi first described the technique around 850 CE, frequency analysis made simple substitution ciphers unreliable for serious secrets.

The homophonic substitution cipher was the cryptographic world's answer to this problem. Instead of replacing each plaintext letter with a single fixed ciphertext symbol, a homophonic cipher assigns multiple different symbols to each letter. The more frequently a letter appears in the language, the more substitute symbols it receives. When done well, this flattens the frequency distribution of the ciphertext, making each symbol appear with roughly equal probability and stripping away the statistical fingerprint that frequency analysis depends on.

This idea, simple in concept but demanding in execution, has shaped some of the most consequential episodes in cryptographic history -- from Renaissance papal correspondence to the most famous unsolved ciphers of the twentieth century.

Try our free Homophonic Cipher tool to experiment with homophonic substitution and see how it obscures letter frequencies.

What Is Homophonic Substitution?

The Core Idea

In a standard simple substitution cipher (like the Caesar cipher or keyword cipher), each letter of the plaintext alphabet maps to exactly one ciphertext symbol, and each ciphertext symbol maps back to exactly one plaintext letter. The mapping is a bijection: one-to-one and onto. This means that if E is the most common letter in English (appearing about 12.7% of the time), then whatever symbol E maps to will also appear about 12.7% of the time in the ciphertext. The frequency fingerprint transfers directly from plaintext to ciphertext.

A homophonic substitution cipher breaks this one-to-one relationship. Each plaintext letter can be represented by several different ciphertext symbols, called homophones. When encrypting a message, each occurrence of a letter is replaced by a randomly chosen homophone from that letter's set. The result is that the ciphertext frequency of each individual symbol is much more uniform than in a standard substitution cipher.

A Simple Example

Suppose we assign the following homophones to just a few letters:

Letter	Homophones
E	14, 27, 43, 56, 72, 81
T	09, 33, 65, 48
A	17, 39, 62, 51
O	05, 28, 74
N	11, 36, 88
S	22, 47, 70
R	03, 58
I	19, 44
...	...

The letter E, being the most common, gets six homophones. When encrypting, each occurrence of E is replaced by one of {14, 27, 43, 56, 72, 81} chosen at random. In the resulting ciphertext, no single symbol dominates -- instead, each of the six E-symbols appears only about 2% of the time (12.7% divided among six symbols), blending in with the frequencies of less common letters.

Why It Works

The effectiveness of homophonic substitution comes down to a simple statistical principle. Frequency analysis relies on the assumption that the frequency of ciphertext symbols mirrors the frequency of plaintext letters. By distributing each letter's frequency across multiple symbols, the homophonic cipher violates this assumption. If each symbol appears with roughly equal frequency, the analyst cannot determine which symbols represent common letters and which represent rare ones.

The ideal homophonic cipher assigns homophones in exact proportion to letter frequencies: if E accounts for 12.7% of text and the cipher uses 100 total symbols, then E should receive about 13 symbols. If Z accounts for 0.07% of text, it might receive only one symbol. When the proportions are exact, every symbol appears with probability approximately 1/100 = 1%, and the ciphertext has a perfectly flat frequency distribution.

Historical Origins: Papal Cryptographers and Renaissance Courts

The Earliest Known Use (1400s)

The homophonic substitution cipher emerged in the courts of Renaissance Italy, where city-states like Florence, Venice, Milan, and the Papal States were locked in constant diplomatic intrigue. Ambassadors, spymasters, and papal secretaries needed ciphers that could resist the frequency analysis techniques that were already spreading through the scholarly community.

The earliest documented homophonic ciphers appear in the archives of the Duchy of Mantua and the Republic of Florence from the early fifteenth century. These early systems were relatively simple: they assigned two or three symbols to the most common letters (E, T, A, O) while leaving less common letters with single symbols. Even this modest level of homophonic substitution significantly increased the difficulty of cryptanalysis.

The Papal Cipher Office

The Vatican maintained one of the most sophisticated cryptographic operations in Renaissance Europe. The papal cipher secretary was responsible for encoding and decoding all diplomatic correspondence between Rome and its network of nuncios (papal ambassadors) across Europe.

By the mid-fifteenth century, the papal cipher office had developed elaborate homophonic systems that included:

Multiple symbols for each common letter
Nulls (meaningless symbols inserted to confuse analysts)
Code words for common names, places, and concepts (a hybrid system known as a nomenclator)

The Argentis, a family of papal cipher secretaries spanning several generations in the sixteenth and seventeenth centuries, refined these systems to a high art. Their nomenclators combined hundreds of code groups for words and syllables with homophonic substitution for individual letters, creating ciphers that were extremely difficult to break by the standards of the era.

Nomenclators: The Dominant Cipher System for Centuries

For roughly three hundred years, from about 1400 to 1700, the nomenclator was the dominant encryption system in European diplomacy. A nomenclator combined two layers of encryption:

A homophonic substitution table for individual letters
A code table mapping common words, names, and phrases to arbitrary symbols or number groups

The term "nomenclator" itself comes from the Latin for "name caller," reflecting the code table's role in disguising proper names. A typical nomenclator might include code groups for "the King of France," "declare war," "peace treaty," "10,000 troops," and hundreds of other diplomatic terms, alongside a homophonic alphabet for spelling out words not in the code table.

Nomenclators ranged from simple systems with a few dozen code groups to massive tables with thousands of entries. The more entries the code table contained, the harder the cipher was to break -- but also the harder it was to use correctly and the more likely the operator was to make errors.

The Great Cipher of Louis XIV

Antoine and Bonaventure Rossignol

The most celebrated nomenclator in history was the "Grand Chiffre" (Great Cipher) used by the court of Louis XIV of France in the late seventeenth century. It was created by Antoine Rossignol and his son Bonaventure Rossignol, who served as cryptanalysts and cipher makers to the French crown.

The Great Cipher was exceptional because of its design: instead of encoding individual letters, it encoded syllables. The French language was divided into its component syllables, and each syllable was assigned a unique number (from a set of 587 numbers). Some numbers were traps -- they indicated that the previous number should be ignored, throwing off any analyst who was attempting a systematic attack.

This syllabic encoding, combined with the trap numbers, made the Great Cipher extraordinarily resistant to analysis. After the Rossignols died, the key to the cipher was lost, and the messages encoded with it remained unread for over two hundred years.

Etienne Bazeries and the Solution

In 1893, the French military cryptanalyst Commandant Etienne Bazeries finally cracked the Great Cipher after three years of painstaking work. Bazeries' key insight was recognizing that the numbers represented syllables rather than individual letters. Once he identified a few syllable values through informed guessing and statistical analysis, he could bootstrap his way through the system, using the partial decryptions to identify additional syllable values.

Among the messages Bazeries decoded was correspondence revealing the true identity of the Man in the Iron Mask -- a prisoner of state whom Louis XIV had kept confined for decades with his face hidden behind a mask. The decoded letters suggested the prisoner was a disgraced general named Vivien de Bulonde, though this identification remains debated by historians.

The Zodiac Killer's Ciphers

The most famous homophonic substitution ciphers in modern history are the cryptograms created by the Zodiac Killer, an unidentified serial murderer who operated in Northern California in the late 1960s and early 1970s. The Zodiac sent taunting letters and ciphers to San Francisco Bay Area newspapers, claiming to have committed numerous murders and challenging the public to decode his messages.

The Z408 Cipher (1969)

On July 31, 1969, the Zodiac sent three fragments of a cipher to three Bay Area newspapers: the San Francisco Chronicle, the San Francisco Examiner, and the Vallejo Times-Herald. Each newspaper received one third of the complete 408-symbol cryptogram. The Zodiac demanded that the newspapers publish the cipher on their front pages, threatening to go on a killing spree if they did not.

The combined cipher, known as Z408, was a homophonic substitution cipher using a mix of letters, numbers, and invented symbols. The Zodiac used approximately 54 different symbols to represent the 26 letters of the alphabet, assigning multiple symbols to common letters like E and T.

Donald and Bettye Harden: Solving Z408

Within a week of the cipher's publication, a high school teacher named Donald Harden and his wife Bettye, from Salinas, California, cracked the Z408. They were amateur puzzle enthusiasts with no formal training in cryptanalysis.

The Hardens' approach combined several techniques:

Crib guessing. Bettye Harden suggested that the Zodiac, given his apparent narcissism, might have started the message with the word "I" or the phrase "I LIKE KILLING." This gave them a possible crib -- a known or guessed plaintext segment -- to look for patterns in the cipher.

Pattern matching. The phrase "KILLING" contains a double L, which would appear as a repeated symbol in the ciphertext. They searched for repeated symbols that could correspond to the double L and found a match.

Bootstrapping. Once "I LIKE KILLING" was tentatively placed, the letters I, L, K, N, G were partially known. These known letters were used to guess adjacent words, which revealed more letter-symbol mappings, which in turn allowed further decryption.

The decoded Z408 read:

I LIKE KILLING PEOPLE BECAUSE IT IS SO MUCH FUN IT IS MORE FUN THAN KILLING WILD GAME IN THE FORREST BECAUSE MAN IS THE MOST DANGEROUE ANAMAL OF ALL TO KILL SOMETHING GIVES ME THE MOST THRILLING EXPERENCE IT IS EVEN BETTER THAN GETTING YOUR ROCKS OFF WITH A GIRL THE BEST PART OF IT IS THAE WHEN I DIE I WILL BE REBORN IN PARADICE AND ALL THE I HAVE KILLED WILL BECOME MY SLAVES I WILL NOT GIVE YOU MY NAME BECAUSE YOU WILL TRY TO SLOI DOWN OR ATOP MY COLLECTIOG OF SLAVES FOR MY AFTERLIFE EBEORIETEMETHHPITI

The message contained several misspellings (likely intentional) and ended with 18 apparently meaningless characters ("EBEORIETEMETHHPITI") that have never been satisfactorily explained. Some analysts believe they conceal the Zodiac's name; others think they are random filler to complete the grid.

The Z340 Cipher: 51 Years Unsolved

On November 8, 1969, the Zodiac sent a new cipher to the San Francisco Chronicle. This one was 340 symbols long and used a more complex homophonic substitution scheme. The Z340 would resist all attempts at solution for over half a century.

The Z340 defied analysis for several reasons:

Multiple encryption layers. Unlike the straightforward homophonic substitution of Z408, the Z340 appeared to use additional manipulations on top of the substitution -- but no one could determine exactly what those manipulations were.

Shorter length. At 340 symbols, the cipher provided less statistical data to work with than the 408-symbol Z408.

Possible errors. If the Zodiac made mistakes while encrypting (a real possibility given the cipher's complexity), those errors would introduce noise that could throw off any systematic analysis.

Uncertain plaintext language. While English was assumed, the Zodiac's known tendency toward misspellings and unusual phrasing made it harder to use standard English statistics.

The 2020 Breakthrough: Oranchak, Blake, and Van Eycke

On December 5, 2020 -- 51 years after the cipher was sent -- a team of three amateur codebreakers announced that they had solved the Z340. The team consisted of David Oranchak, a web developer from Virginia who had been working on the cipher for over 14 years; Sam Blake, an applied mathematician from Melbourne, Australia; and Jarl Van Eycke, a Belgian warehouse operator and programmer who had created specialized cipher-solving software called AZdecrypt.

Their breakthrough came from recognizing that the Z340 used not just homophonic substitution but also a transposition step. The plaintext had been written into a grid, then the rows of the grid had been manipulated (some were read in reverse, some were shifted) before the homophonic substitution was applied. This two-layer encryption -- transposition followed by substitution -- was what had defeated cryptanalysts for five decades.

The team's approach relied heavily on computational methods:

Hypothesis generation. Blake wrote software that systematically tested thousands of possible transposition schemes -- different ways the rows could have been rearranged, reversed, or shifted.
Automated solving. For each candidate transposition, Van Eycke's AZdecrypt software attempted to solve the resulting homophonic substitution using hill-climbing algorithms that tested millions of possible substitution tables.
Scoring. Each candidate solution was scored against English language statistics (letter frequencies, bigram frequencies, trigram frequencies) to identify the most plausible plaintext.
Human verification. Oranchak reviewed the top-scoring candidates and confirmed the solution when coherent English text emerged.

The decoded Z340 read:

I HOPE YOU ARE HAVING LOTS OF FUN IN TRYING TO CATCH ME THAT WASNT ME ON THE TV SHOW WHICH BRINGS UP A POINT ABOUT ME I AM NOT AFRAID OF THE GAS CHAMBER BECAUSE IT WILL SEND ME TO PARADICE ALL THE SOONER BECAUSE I NOW HAVE ENOUGH SLAVES TO WORK FOR ME WHERE EVERYONE ELSE HAS NOTHING WHEN THEY REACH PARADICE SO THEY ARE AFRAID OF DEATH I AM NOT AFRAID BECAUSE I KNOW THAT MY NEW LIFE IS LIFE WILL BE AN EASY ONE IN PARADICE DEATH

The FBI confirmed the solution on December 11, 2020. The decoded text was consistent with the Zodiac's known writing style, including his characteristic misspelling of "paradise" as "PARADICE" -- the same misspelling that appeared in the Z408 solution. However, the message contained no identifying information about the killer.

The Z13 and Z32 Ciphers

The Zodiac also sent two shorter ciphers: a 13-symbol cipher (Z13) purportedly containing his name, and a 32-symbol cipher (Z32) containing a claimed bomb-making formula. Both remain unsolved. Their extreme brevity (13 and 32 symbols respectively) provides too little statistical data for frequency-based methods, and brute-force approaches produce too many plausible solutions to distinguish the correct one.

How to Construct a Homophonic Substitution Table

Building an effective homophonic substitution requires careful attention to letter frequency proportions.

Step 1: Determine the Total Symbol Count

A common choice is 100 symbols, which allows frequency percentages to map directly to symbol counts. Larger symbol sets (200, 500) provide finer granularity and flatter ciphertext distributions.

Step 2: Assign Symbols Proportionally

Using standard English letter frequencies:

Letter	Frequency (%)	Symbols (out of 100)
E	12.7	13
T	9.1	9
A	8.2	8
O	7.5	8
I	7.0	7
N	6.7	7
S	6.3	6
H	6.1	6
R	6.0	6
D	4.3	4
L	4.0	4
C	2.8	3
U	2.8	3
M	2.4	2
W	2.4	2
F	2.2	2
G	2.0	2
Y	2.0	2
P	1.9	2
B	1.5	2
V	1.0	1
K	0.8	1
J	0.2	1
X	0.2	1
Q	0.1	1
Z	0.1	1

This gives 103 symbols. Adjust by removing symbols from letters where the rounding was generous until the total reaches exactly 100 (or use 103 symbols).

Step 3: Assign Specific Symbols

Each symbol should be a unique glyph, number, or character. For a 100-symbol set, you might use the digits 00-99, or a mix of letters, numbers, and special characters. The Zodiac Killer used a creative mix of standard letters, reversed letters, astrological symbols, and invented shapes.

Step 4: Randomize Symbol Selection During Encryption

When encrypting, for each plaintext letter, choose randomly among its available homophones. This randomization is critical -- if you always use the homophones in a fixed order (cycling through them sequentially), an analyst can detect the pattern and reconstruct the mapping.

Strengths and Limitations

Strengths

Defeats simple frequency analysis. The primary advantage. With a well-proportioned symbol set, single-symbol frequency analysis yields no useful information. This was a revolutionary improvement over simple substitution ciphers.

Flexible security scaling. More symbols means flatter frequency distributions and harder analysis. A 50-symbol set provides moderate protection; a 500-symbol set approaches a flat distribution.

Historical proven effectiveness. Homophonic ciphers successfully protected diplomatic communications for centuries and defeated professional cryptanalysts on numerous occasions.

Limitations

Vulnerable to bigram and trigram analysis. Even with flat single-symbol frequencies, the patterns of symbol pairs (bigrams) and triples (trigrams) still carry information. In English, the pair TH is far more common than QX. If the analyst can identify which symbol pairs correspond to common bigrams, the cipher can be unraveled. This is the primary attack vector against homophonic substitution.

Key management complexity. The substitution table is large and must be kept secret. Sharing a table with 100+ entries is more difficult and error-prone than sharing a simple keyword.

Ciphertext expansion. If the symbols are multi-digit numbers (like two-digit codes), the ciphertext is longer than the plaintext, which is a practical disadvantage for handwritten or telegraphic communication.

Not resistant to known plaintext. If the attacker knows a portion of the plaintext, they can immediately identify which symbols map to which letters in that portion, then extend the mapping to the rest of the ciphertext.

Comparison with Other Substitution Ciphers

vs. Simple Substitution (Caesar, Keyword)

The Caesar cipher and keyword cipher are both monoalphabetic -- each letter has exactly one substitute. Frequency analysis breaks them trivially. Homophonic substitution is a direct upgrade that neutralizes this specific attack. However, the additional complexity of managing multiple symbols per letter makes homophonic ciphers more error-prone in practice.

vs. Polyalphabetic Ciphers (Vigenere)

Polyalphabetic ciphers like the Vigenere cipher also defeat single-letter frequency analysis, but they do so through a different mechanism: using multiple substitution alphabets in rotation. The Vigenere cipher's weakness is its repeating key, which can be detected through the Kasiski examination or the Friedman test. Homophonic ciphers have no repeating key to detect, but they are vulnerable to bigram analysis, which polyalphabetic ciphers partially resist.

In practice, a well-designed homophonic cipher is roughly comparable in security to a Vigenere cipher with a moderate-length key. Both can be broken by a skilled analyst with sufficient ciphertext, but both represent a significant step up from simple substitution.

vs. Polygraphic Ciphers (Playfair, Hill)

Polygraphic ciphers like the Playfair cipher encrypt multiple letters at once, which also disrupts single-letter frequency patterns. The Playfair cipher's digraphic encryption creates a different kind of frequency masking -- it obscures single letters but introduces detectable digraph patterns. Homophonic substitution and polygraphic encryption can be viewed as two different strategies for the same goal: defeating frequency analysis.

Breaking Homophonic Ciphers: Modern Approaches

Hill-Climbing Algorithms

The most successful modern approach to breaking homophonic ciphers uses hill-climbing optimization. The algorithm works as follows:

Start with a random assignment of symbols to letters.
Decrypt the ciphertext using this assignment.
Score the resulting plaintext using a fitness function (typically based on quadgram statistics -- the frequencies of four-letter sequences in English).
Make a small random change to the assignment (swap two symbol mappings).
Re-decrypt and re-score. If the score improves, keep the change; otherwise, revert it.
Repeat thousands of times until the score converges.

This approach was central to the solving of the Zodiac's Z340 cipher. Jarl Van Eycke's AZdecrypt software uses a sophisticated variant of hill-climbing optimized for homophonic ciphers.

Simulated Annealing

Simulated annealing is a refinement of hill-climbing that occasionally accepts changes that make the score worse. This helps the algorithm escape local optima -- suboptimal solutions that hill-climbing gets stuck in because every small change makes the score worse, even though a larger jump could find a much better solution.

Machine Learning Approaches

Recent research has explored using neural networks and other machine learning techniques to attack homophonic ciphers. These approaches train on large datasets of known plaintext-ciphertext pairs and learn to recognize the statistical signatures of correctly decoded text. While still experimental, machine learning shows promise for automating the analysis of complex ciphers that resist traditional methods.

Frequently Asked Questions

What does "homophonic" mean?

The word "homophonic" comes from the Greek "homo" (same) and "phone" (sound or voice). In music, "homophonic" refers to a texture where multiple voices move in the same rhythm. In cryptography, "homophonic" means that multiple different symbols can represent the same letter -- they are different "voices" for the same "sound." The term distinguishes homophonic substitution from simple (monophonic) substitution, where each letter has only one substitute.

How many symbols do I need for a secure homophonic cipher?

There is no magic number, but as a general guideline: 50 symbols provides moderate protection against casual analysis, 100 symbols provides good protection against manual analysis, and 200+ symbols approaches the practical limit of what homophonic substitution can achieve. Beyond a certain point, adding more symbols provides diminishing returns because bigram and trigram analysis become the dominant attack vector regardless of single-symbol frequency distribution.

Was the Zodiac Killer ever identified?

As of 2026, the Zodiac Killer has never been officially identified. Various suspects have been proposed over the decades, including Arthur Leigh Allen, who was investigated by police during the original case. In October 2021, a team called "The Case Breakers" named Gary Francis Poste as a suspect, but law enforcement has not confirmed this identification. The case remains open with the FBI and local law enforcement agencies.

Can a homophonic cipher be truly unbreakable?

No. While homophonic substitution significantly increases the difficulty of cryptanalysis, it is not theoretically unbreakable. With sufficient ciphertext length, bigram and trigram analysis will eventually reveal the underlying substitution pattern. The only cipher proven to be theoretically unbreakable is the one-time pad, which requires a key as long as the message and is not a substitution cipher at all. However, very short homophonic ciphers (like the Zodiac's Z13) may be practically unbreakable simply because they contain too little data for statistical analysis to work.

How is a homophonic cipher different from a nomenclator?

A nomenclator is a hybrid system that combines homophonic letter substitution with a code table for words and phrases. In a pure homophonic cipher, every character of the plaintext is individually encrypted through the substitution table. In a nomenclator, common words and phrases are replaced with code groups from a separate table, and only the remaining text is encrypted letter by letter. Nomenclators were the dominant diplomatic cipher system from approximately 1400 to 1850 and are historically the most common context in which homophonic substitution appeared.