Instant ciphertext-only cryptanalysis of gsm encrypted communication

So, GSM (Groupe Spécial Mobile) is the most widely-used standard for cellular communication. Wikipedia tells me that 80% of the cellular market uses this standard, representing about 4.3 billion people. And guess what? The encryption algorithm of this standard is completely broken-- according to this paper, anyway. And the way in which the paper goes about breaking the algorithm is itself beautiful, illustrating a number of common crypto flaws simultaneously.

Okay, so let's be clear what we're talking about. The GSM specification actually contains two encryption algorithms, A5/1 and A5/2. These were apparently developed back when there were export restrictions on strong crypto, as A5/1 is the 'strong' algorithm for the US and Europe and A5/2 is the 'weak' algorithm for elsewhere. This paper directly breaks A5/2, but then shows that this is 'good enough' to attack all GSM encryption. But before going in to that, let's first focus on how they broke A5/2. How does A5/2 work? It's a stream cipher, which mean that it:

Uses the secret key to generate a pseudo-random pad of the same length as the plaintext, and then
Pretends that this pad is a one time pad. That is, it creates the ciphertext by doing a bit-by-bit xor of the plaintext and the pad.

So far, so good. But how does A5/2 turn the key into a pseudo-random pad? Like most pseudo-random generators, it has the following structure:

First, it turns the key into some initial internal 'state'.
Then, for every bit required to make the pad, it will:
- Update the current state into some new state, using some algorithm-specific process, and
- Extract a single pad-bit by taking some function of the new state.

To present an example, which is not the A5/2 algorithm, consider the following (dumb) pseudo-random generator:

The initial internal state is the key, interpreted as a positive integer.
To extract a bit for the pad:
- Take the current state (which is a positive integer) and double it. Then take its remainder when you divide it by 17. This remainder is the new internal state.
- If the new state is even, output 0. If the new state is odd, on the other hand, output 1.

How bad is this generator? Well, first notice that no matter what your key is, the pad will repeat every eight bits. Either you fall into the following state-sequence

2, 4, 8, 16, 15, 13, 9, 1, 2...

which yields the following sequence of pad-bits

0, 0, 0, 0, 1, 1, 1, 1, 0...

or you fall into the other possible sequence of states

3, 6, 12, 7, 14, 11, 5, 10, 3...

which yields pad-bits

1, 1, 0, 1, 0, 1, 1, 0, 1...

Now, the dumb thing here is not that the pattern repeats every 8 bits-- that's just an artifact of the fact that 17 is small. No, the dumb thing is that if I ever learn four consecutive bits of the pad, then I learn which of the two sequences above is being used and where we are in it. That is, I learn the internal state, which means I can predict the rest of the pad--- forward and backward--- allowing me to learn the entire plaintext.

The pseudo-random generator of A5/2 is more complicated than this, of course, but it is built on similarly simple steps. In particular, it is something called a linear feedback shift register, which uses mostly xor operations and bit-shifts. The advantage to such creatures is that they use very basic operations which can be executed very quickly. The disadvantage, however, is that the pad-bits generated will not be completely independent. There will be interrelations which can be represented by a number of polynomials. If you were to get enough of the pad-bits, you could use them to (a) solve the polynomials, (b) completely recover the internal state, and thus (c) recover the plaintext.

But for the adversary to launch such an attack, they would need to somehow learn the bits of the pad. Not the ciphertext, the pad. How realistic is that? Well, due to the property of one-time pads (and hence, stream ciphers) the adversary could recover the pad very easily if it had both the ciphertext and the plaintext. So if the adversary had both ciphertext and plaintext, it could easily launch this attack.

"...Hold on," I hear you say. "We're now assuming the adversary has the plaintext?" Well, some of it. "But isn't the plaintext the goal of the adversary in the first place?" Well, yes. "So what is the point of launching this attack if the adversary already has the plaintext?" The point is that this attack doesn't require the adversary to have all of the plaintext, but just enough to solve these polynomials.

And here we get into an interesting argument between amateur cryptographers and real cryptographers. For some reason, amateur cryptographers often believe that the adversary will not know what is being encrypted. There is, I admit, a germ of truth to this. There is no point in using encryption in a setting where the adversary knows exactly what is being encrypted, so you might as well assume that's not the case. But in the minds of many amateur cryptographers, this true statement somehow morphs into its more dangerous cousin: that the adversary doesn't know anything about what is being encrypted. Or put more bluntly, that people only encrypt random data.

This is not true, and the paper in question illustrates that fact very vividly. This was not the first paper to attack A5/2, but previous attacks required some pretty strong assumptions. One prior work, for example, showed how the adversary could break the algorithm if it were to know the plaintext of two GMS 'frames' (packets) exactly 2048 frames apart. Now, this may not be entirely unreasonable-- most frames will be encrypting silence, after all-- but it's not very satisfying. This paper, on the other hand, eliminates such prerequisites entirely by noticing that the plaintext to A5/2 will never be completely random. Even if the input to the GMS stack is random, the GMS stack will apply error-correcting codes before the encryption algorithm. That is, the inputs to A5/2 will always satisfy some particular relationship (i.e., is a valid code-word of the error-correcting code) and this is all they need. Based on this fact, they create a ciphertext-only attack on A5/2 that only needs some 12 miliseconds of GMS traffic to analyze, and only a second or so of computing-time on a PC.

And it just keeps getting better. Remember, this paper attacked not just A5/2 but all GMS communication. How? They don't show how to break A5/1, but instead show how they can prevent A5/1 from being used. When two GSM phones call each other, one of the first things they need to do is to come to agreement on which algorithm to use. Remember, A5/1 was reserved for the U.S. and Europe. So an European phone could use either A5/1 or A5/2, while an Asian phone could only use A5/2. So what happens when a European phone calls an Asian phone? The European phone (able to use A5/1) would need to discover whether the other phone could handle A5/1 as well. If so, they both use A5/1. If not, the European phone downgrades itself and both phones use A5/2.

Why am I mentioning this? The GSM spec contains a protocol for doing this kind of discovery, and it turns out that the protocol just assumes that both sides are telling the truth. This is a dangerous assumption. Suppose that the adversary is sitting on the GSM network between two U.S. phones. Whenever one of the phones says to the other, "I can use A5/1 or A5/2," the adversary changes it to "I can use A5/2 only." The phones will never detect that these messages were changed, and they both downgrade to using A5/2-- which this paper can attack. (And yes, this is akin to an attack on SSL found a few years ago.)

So, this one paper manages to illustrate at least two lessons:

That 'known plaintext' attacks are real, and
That meta-data needs as much protection as data (if not more).

Two more notes:

Returning to the amateur vs. 'real' cryptographer debate, above: as this paper illustrates, it is not safe to assume that the plaintext of a ciphertext will be completely unpredictable to the adversary. But it remains true that encryption is pointless if the adversary knows the entire plaintext, and so you might as well assume that the plaintext is at least a little unpredictable. So where do we draw the line? How do we balance these two extremes: completely random (unrealistic) and completely known (useless)?

The answer is to assume that we are living in the worst possible non- degenerate world: that there is only a single bit of uncertainty about the plaintext. Put another way, we assume that the adversary has through a priori knowledge narrowed the set of possible plaintext down to two messages. And to make things worse, we assume that the adversary was somehow able to choose the two possible messages! This is a somewhat unrealistic scenario, but it errs on the side of caution. If we can prove an encryption scheme remains secure even under this scenario (meaning that the adversary can't tell which message has been encrypted) then the scheme will remain secure in more realistic scenarios. And as the above paper shows, what may seem to be unrealistic scenarios sometimes turn out to be very realistic indeed.¹

Believe it or not, I am actually prepared to be sympathetic to the GMS designers on this one. The GSM spec was written in the 1980's. The 'standard' crypto algorithm was DES, which may have been too slow, big, or power-consuming for these purposes. We cryptographers almost always strongly advise against 'rolling your own' crypto when you don't need to so do, but this may have been one of those cases when rolling their own crypto was the least awful operational decision possible. And I have a hard time criticizing the crypto algorithm they made: linear feedback shift registers were pretty state-of-the-art then (AFAIK), there weren't that many cryptographers around at that time (outside the NSA), and let's remember that A5/2 was meant to be weak.

I lied a little bit, there. This is not the worst scenario we might imagine: we might also consider the possibility that the adversary might be able to fool us into decrypting some (but not all) ciphertexts of its choosing. But this gets into the chosen-ciphertext attack, a topic that should probably be left for later. ↩