Zoom in on others’ mistakes – Vol.2: Cryptography
Continuing our analysis on Zoom vulnerabilities, this time we will take a look at cryptographic weaknesses.
After discussing the injection flaws in Zoom, in this second article we will take a look at a more sophisticated, but also more serious underlying issue: usage of weak cryptography. Cryptography is an essential security feature to protect data – any weakness in its implementation obviously puts users’ information at the risk, potentially breaking confidentiality and/or integrity.
The end-to-ends of the earth
End-to-end encryption (E2EE) essentially means that audio and video data is encrypted in a way that prevents anyone other than the participants from accessing it. By ‘anyone’ we usually mean any intermediate nodes in the communication. But even more importantly, in this context it also includes the vendor or service provider of the product. This functionality is especially relevant in areas such as telemedicine that have strong security requirements on data confidentiality – but in general, if E2EE is not supported, the operator has the ability to intercept and read (or even modify!) the audio and video streams of a video conference.
E2EE is not standard among videoconferencing tools due to many of their features (such as transcription or recording) relying on the platform having access to the decrypted audio and video streams, and the difficulty of implementing a mode where the server does not touch audio and video data at all. For example, E2EE can result in significantly worse performance due to the service not being able to optimize streams depending on which user is actively talking or presenting. Still, some providers support E2EE functionality, such as FaceTime or Signal.
While Zoom documentation claimed they implemented ‘end-to-end’ encryption, they were actually using a non-standard definition of the term used in cryptography. As explained by a Zoom spokesperson to the Intercept, “When we use the phrase ‘End to End’ in our other literature, it is in reference to the connection being encrypted from Zoom end point to Zoom end point”, which is ambiguous. An ‘end point’ may refer to (e.g.) the key server to send the stream encryption key to all participants, in which case ‘End to End’ just means the use of transport encryption via TLS between clients and Zoom servers. Of course, Zoom does support true end-to-end encryption for text chat, but that is not the concern here.
Zoom’s stream encryption model is fairly simple – as described in section 4 of the Citizen Lab report, a secret key is generated for each conference by a Zoom key server, and everyone can use that to decrypt the stream sent by the Zoom router server. Since Zoom servers control the stream and the key, Zoom is implied to have the ability to decrypt and modify the stream if it wants. Definitely not what someone would expect from an E2EE solution!
But looking into the encryption’s details, things get even worse.
ECB in 2020?!
Assuming we are using block ciphers (which is generally the case), symmetric encryption algorithms such as AES only encrypt data of a certain length at a time – a number of bytes equal to the block size of the algorithm. However, in practice we want to encrypt a lot more data than that, and that is where the modes of operation come in.
A mode of operation is a cryptographic model that chains usage of the block cipher to encrypt data of arbitrary length. The simplest one is called ECB, short for Electronic Codebook. It simply slices the data to be encrypted into blocks and encrypts each block separately with the same key.
There is a clear problem with this: the same input for a particular 16-byte block (in case of AES) will produce the same output. This allows several different cryptographic attacks, but even without going into detail, just consider what happens if the input contains large repeated blocks of data (using the demonstration images from Wikipedia):
|Problem with Electronic Codebook (ECB) – you still see the “encrypted” penguin|
The outline and rough content of the image is clearly visible, despite using the right cryptographic algorithm! Of course, in case of compressed data this is going to be much less obvious – nevertheless, it will remain a cryptographic problem and open many doors for malicious cryptanalysis. For this reason, everyone knows to never use ECB for anything. To overcome the problem of the same input (plaintext block) producing the same output (ciphertext block) with the same key, we have better modes of operation. When encrypting a block, they typically incorporate the result of the previously encrypted block into the encryption. They also use a random value to initialize the process (this is called the Initialization Vector or IV). Used properly, these secure modes of operation produce a real ‘random-looking’ output.
|When using e.g. Output Feedback (OFB), the encryption is effective|
So how is this relevant for Zoom? Simple: up until the end of May 2020, Zoom was using AES-128 in ECB mode for encrypting the streams in its own custom implementation of encryption over the Real-time Transport Protocol (RTP), as outlined in the report. This is a departure from the already-existing Secure Real-time Transport Protocol (SRTP) standard. That standard clearly states that the only acceptable cipher modes are Segmented Integer Counter Mode (a subtype of CTR), and f8-mode (a slight variation of OFB) – both modes of operation are resilient to the above mentioned cryptanalyses.
What this means is that the audio/video data from a video conference was potentially vulnerable to cryptographic attacks. Due to the compression, such an attack would take significant effort to achieve, but with enough time and resources (and observed data), an attacker could eventually decrypt a stream even without having the key. The attacker’s job would also be significantly easier if they could influence the encrypted data somehow, or if they knew part of its contents – for example, if the first few frames of the video consisted of the company’s logo. This is a similar process to how Enigma was eventually broken in WWII – by British cryptanalysts, including Alan Turing, knowing some plaintext messages in advance.
When the waiting room is an all-seeing cryptographic eye
While the aforementioned weaknesses in cryptography are mainly threats that can be realized by high-powered adversaries (also called Advanced Persistent Threats or APTs), there was another problem also discovered by Citizen Lab: a vulnerability in the implementation of Zoom waiting rooms.
Basically, Zoom’s ‘waiting room’ function allows the organizer to be selective when admitting participants to a meeting. Everyone who joins the meeting has to stay in the waiting room until the organizer lets them in. This looks like a good security measure; but there was a problem with the implementation that turned it into a significant threat.
Zoom was sending the encrypted video stream to unapproved participants in the waiting room as well – probably to enhance response time and user experience. This way, after a user is admitted to the meeting, their client can immediately start showing the stream without having to do any buffering. This was a vulnerability by itself considering the existing weaknesses in cryptography – however, the stream decryption key was also shared with everyone in the waiting room. This meant that if an attacker could get in the waiting room for a Zoom meeting, they could decrypt the video stream and observe the meeting from that point on… even if the meeting organizer never lets them in!
Just as always, a usability (UX) feature became a security problem – in this case, it turned the Zoom waiting room from a security measure into an inadvertent spying tool.
The first rule of cryptography for software developers
Implementing any cryptography invented by someone else is hard. Coming up with your own cryptographic algorithms and secure protocols is even harder. For software developers, the first rule of implementing or designing your own crypto is “Don’t do it!”, and with good reason. Designing a protocol for sharing a video stream securely between A and B looks simple enough in theory, but it can be a minefield in practice; a designer can create a protocol that looks bullet-proof, and yet may still be vulnerable. Similarly, without understanding all cryptographic details, a programmer can create a ‘bug-free’ crypto implementation that’s still vulnerable to various cryptanalysis attacks.
Zoom’s transport protocol was not even particularly complex, but it still contained at least two critical design flaws – not to mention possible implementation problems. This is in addition to all the misleading statements about implementing end-to-end encryption, using TLS (only partially true, and only for the browser plugin), or using AES-256 (when it actually was AES-128). Just using an existing implementation of SRTP instead would have saved a lot of trouble.
The ultimate takeaway here is: never roll your own cryptography (unless you’re a trained cryptographer). If you can use a cryptographic library to achieve your goals of data confidentiality and integrity, you absolutely should. But you should still know how to use existing algorithms from those libraries. It’s like driving: you don’t need to be a mechanic to drive a car – but to get a driving license you should still learn how to drive it right, don’t you?
Note that Zoom has responded to these issues quickly, and – after fixing the implementation problems – is in the process of redesigning and reimplementing the protocol to support true end-to-end encryption. Zoom is switching from ECB to GCM with Zoom 5.0; see also the recently-updated whitepaper (AES and SRTP sections under Zoom Client).