Whisper Leak: How Threat Actors Can See What You Talk to AI About

Transcript

Sherrod DeGrippo: Welcome to the "Microsoft Threat Intelligence Podcast". I'm Sherrod DeGrippo. Ever wanted to step into the shadowy realm of digital espionage, cybercrime, social engineering, fraud? Well, each week, dive deep with us into the underground. Come here for Microsoft's elite threat intelligence researchers. Join us as we decode mysteries, expose hidden adversaries, and shape the future of cybersecurity. It might get a little weird. But don't worry, I'm your guide to the back alleys of the threat landscape. [ Music ] Today we're talking about why that little lock icon isn't really the whole story. Microsoft researchers recently uncovered Whisper Leak, shhh!, a side channel attack on large language models that can infer what you're talking about with an AI, even when the traffic is fully encrypted over TLS. You can think about somebody sitting on the network, not breaking the crypto, but still figuring out whether you're asking about money laundering, protests, medium-sized shepherd dogs, any of those things that I use LLMs for, or other sensitive topics, just from the shape of the traffic. So, if you're a CISO, a threat intelligence nerd, I know there's a lot of you out there, a policy person, or just somebody who's been feeding your deepest, darkest secrets to LLM chatbots, this is going to matter to you. My first guest is Jonathan Bar Or, also known as JBO. Jonathan is a pretty well-known information security expert and what I would definitely call a hacker. Most recently, he was Principal Security Researcher at Microsoft, where he served as the Microsoft Defender Research Architect for Cross Platform. He's worked on binary analysis, vulnerability research, reverse engineering, and cryptography. He's also helped uncover critical vulnerabilities that impacted millions of users across Windows, Mac, iOS, and more. He's presented on stages like BlueHat Israel, and if you've ever shipped code on a major OS, there's a chance that he's poked at it. I also will have joining us Geoff McDonald. Geoff is a Microsoft Defender Security Research Lead and Principal Research Manager at Microsoft Defender for Endpoint. He leads teams of data scientists and engineers building large-scale machine learning models and pipelines that protect over a billion devices from malware and other threats, everything from malicious scripts to invasive malware campaigns. Together, Jonathan and Geoff are the minds behind Whisper Leak, the research that basically says, Yes, your AI traffic is encrypted, but no, that doesn't mean an observer learns nothing. Geoff, JBO, welcome to the Microsoft Threat Intelligence Podcast. I am Sherrod DeGrippo, Director of Threat Intelligence Strategy here at Microsoft. Thanks for joining me.

Jonathan Bar Or: Thank you for having us.

Geoff McDonald: Yes, happy to be here.

Sherrod DeGrippo: So it's good to talk to you about this because I remember when we were releasing this blog and JBO was kind of leading how we were getting this out. If any of you are interested in reading about what we're talking about, there is a great blog on Whisper Leak. You can just search that up. So, this is what is interesting to me. We've been telling people for years, Use HTTPS, look for TLS, lock icon, all of that. And now what you have found is that an attacker could potentially watch encrypted AI traffic, and still be able to infer what you're talking about, what the user is talking about from things like packet sizes and timing patterns. So can you help me understand exactly what Whisper Leak is, why it works, what it doesn't do? Instead of just simply saying, Encryption doesn't work.

Jonathan Bar Or: Yeah, so it's not that encryption doesn't work, it works well. I guess the idea of side-channel attacks and generally the fact that TLS is not the end of the story, as you put it, is not new. If some of the listeners remember, when Google Search Engine started, you could type let's say a word, and it would give you results back. And based on the number of results, if someone listened on the network, even though it's end-to-end encrypted, they could infer what you were looking for. For instance, if you started looking for words that start with a letter Z, you'd get significantly less results than the number of results that you'd get if you started typing the letter E. So that's just one example where you have, and obviously that would manifest in the number and sizes that are returned to the client. And that's like a primitive example of a side-channel attack. And we utilize the same ideas with Whisper Leak. So it's not a brand-new idea. And the fact that basically data sizes and timing expose critical information is not new, either. But applying that to the world of AI is what we were after, really.

Sherrod DeGrippo: Okay. So let's talk about how you did this exactly. So, I don't fully understand exactly how you can determine this. What stood out to me was that this was tied to streaming LLM responses with the tokens. So, Geoff, I'll start with you to give our listeners just a quick introduction to what tokens are in the land of AI LLMs.

Geoff McDonald: Sure. So, tokens in the world of large language models is basically a small grouping of characters. So, it can be anywhere between one character is the smallest, like, just representing a period, for example, or it can represent most of a word or an entire word. It's usually between, like, one and seven characters. On average, I think it's like around four and a half characters for the English language, for example. And yeah, so this is fundamentally how the large language models breaks down text, both in understanding the text that you've written as well as sending text back down to you as a reply. So these are individually called tokens, and they have a length between one and seven is the key part. Now, as you've just referred to, like, are this attack, these side-channel attacks against large language models work because of that variance in these token sizes. It can be between one and seven. So when in streaming mode, you might be chatting to, like, ChatGPT, for example, and you'll ask it a question. And you might notice that when it comes back with its reply to you, it doesn't come back with a full reply all at once, but it comes revealing part of the reply a few letters, a few letters, a few letters, a few words, a few letters, a few letters, a few letters. So, it streams the response back to you. And over the network through TLS encryption there, it's actually leaking the size of the tokens with each of the message if the care is not being taken.

Sherrod DeGrippo: And so can I ask, when we're talking about the response from the LLM chat as it appears, like you were saying, yes, it runs like, you know, almost like matrix text from the movie, right? It, like, starts filling up the screen. Not, it doesn't appear all at once, bam. it trickles in. And so is that a user experience? Is that a UX choice to show the text as it appears to give a conversational feeling to it? Is that why it's like that?

Geoff McDonald: Yes, absolutely. And the engineering choice actually varies a little bit by the large language model provider. Like, for example, Google, if you talk with Gemini, you'll notice it actually doesn't stream token by token. It'll stream half sentence by half sentence back to you. So there are differences depending on which provider you're chatting with, whether it sends each partial word or token by token back to you, or whether it does, like, a grouping of tokens together, like half sentence by half sentence back to you. And that fundamentally impacts the risk of the different models to side-channel attacks like these as well.

Jonathan Bar Or: I will also add that eventually, yes, this is a decision, like, all of us, you know, could just wait. You type in a prompt and you can just wait until you get the entire text, taking Geoff's response to the extreme. But most of us want to feel as humans. We want to feel like there is some sort of progress. And because of that, we get streaming answers. And obviously, there is an engineering constraint here because it takes time to generate a response, a proper LLM response to a prompt. So, during that time, obviously, us humans want to feel like progress is being made. And that's why we have those streaming techniques that Whisper Leak actually takes advantage of.

Sherrod DeGrippo: That's fascinating to me. That's something that I haven't really thought about, especially from a security or threat intelligence perspective, is that design choices within a UX to make the person, the human feel like they're having a more human experience, that design choice then leads to the enablement of the side channel attack for Whisper Leak. Yeah. Okay. That's an interesting aspect of this. It's an interesting aspect of this. It makes, I think, everyone listening, you know, I'm thinking, Well, maybe I'd be fine just getting my wall of text, like, bam, and waiting a few extra seconds versus having it give this, you know, artificial attempt at making it feel like I'm speaking with a human. It's almost like, are you familiar with the concept of the skeuomorphic icons where, you know, the icon for a notepad is a little piece of paper, which just kind of makes you feel like, Oh, this is equivalent to paper or, you know, the icon for the media player is a little TV or a little speaker or something. It almost has that feel of, like, artificially conveying some kind of organic humanity, which is fascinating. So this UX then enables the side channel traffic analysis. What is the difference here in terms of newness? Because we've had timing attacks, website fingerprinting. What is actually new here that warrants, like, a new name and makes it not just, like, another timing attack?

Geoff McDonald: I think, firstly, it might be really interesting to go over one of the key prior works, which I think is really fascinating in terms of the risks to large language models. There was a paper published about a year ago by Roy Weiss and his co-authors, and the paper was titled, "What Was Your Prompt: A Remote Keylogging Attack on AI Assistance". And I think it was a fairly transformational paper for the side, the risk of side-channel attacks against large language models. They were also targeting the streaming response language providers here, similarly to what we're targeting. Now, they did direct reconstruction as each message is sent back down to the user of the token length of each message in streaming. Then they used a really cool model architecture following the design of translation tasks. So basically they created a new translator which acts as a translator between the length of the tokens, like 1, 3, 7, 5, 1, 2, 3, and to translate that into English. So they would prepare a large amount of training data where they interact with these chat models, they know what the translation outcome is, and they translate these token links into a guess of what the English language output is. And I thought it was, you know, really scary because not only can they, they can actually reconstruct, take a pretty good guess of the actual reconstruction of the full output that the large language model is sending back to you. And I think this is pretty transformational to the secretive large language models. Now, I think that where the difference in our paper is that the attack can work also if you can't reconstruct the exact lengths of the packets. So, like for example, in the case of some large language providers are now starting to hide the length of the tokens by adding obfuscation to it. And also there are some providers due to engineering choice are grouping the tokens together where they're sending four tokens at a time. So now you can no longer get the exact length of each token sent back down to the client. And I think the key finding of our paper is that actually despite these mitigations, these models can still be at risk of inferring the targeted topic. So, some of the mitigations proposed by the original paper of adding obfuscation, like hiding the length of each token by adding an obfuscated token which hides the length of it, isn't necessarily sufficient to fully protect all of our customers, well, all the customers of these models.

Sherrod DeGrippo: Okay, so this is some pretty in-depth stuff about how these tokens work and the lengths and things. JBO, help me understand too, you're not actually reconstructing content when doing Whisper Leak?

Jonathan Bar Or: We're not. What we've done basically is to train a model against a specific what we called a sensitive prompt or a sensitive topic. For instance, in our paper, we used money laundering as an example. And the idea is that some powerful entity that has a lot of computation power and can let's say get data from your ISP or can sit on your router somehow, can basically train a model just like we've done that sees the streaming that Geoff was talking about, and basically conclude whether it's a sensitive topic or not. And for instance, in that case that they can, without understanding exactly the contents, can flag and say, Hey, you know what, you were talking about the sensitive topic and I don't like it. And because of that, I can apprehend you or do whatever. I'm obviously talking about very powerful entities, not people like you and I. And so I think that's one of the interesting implications that we had. And the other thing that I will mention is that we kind of proven that the more data that adversary or, yeah, let's call it adversary. The more data that adversary collects, the more accurate those things become. So, the more you speak with the LLM, you basically help that adversary. Another thing to basically mention, I think it's worth mentioning, is that it could happen today even. Someone might be right now just siphoning all data from ISPs, recording them, saving them somewhere, and then training later. So you don't have to even do it in real time. You can do kind of like an offline attack. So the implications are done immediately right now, and even in the past, if you think about it.

Sherrod DeGrippo: So I think that's important to highlight what you just said. This is, like, industrial-grade surveillance. This is nation-sponsored level. This is not something that academics are doing as a demo for fun. This is, you have to have some real compute power to be able to pull this off. So, talking about that threat model, realistically it's going to be nation-sponsored actors with potentially ISP-level or backbone-level access. Is this something that could happen on, like, shared Wi-Fi at a hotel or a coffee shop?

Geoff McDonald: In principle, yes.

Sherrod DeGrippo: In principle, yes, okay. Fun. This is concerning, right, to everyone who potentially is putting sensitive information into an AI that they don't control, right? Like, putting sensitive personal information or asking about sensitive topics in a hostile area and having that government or that ISP be able to reconstruct, not reconstruct, but to be able to infer to some degree what you are working on within the AI.

Jonathan Bar Or: Yeah, and there is no way for you to know. If you just use the LLM and whatever secrets you might spill out, even if they're encrypted, again, someone just might be siphoning that data and you would never know about this to begin with. That's why, at least my recommendation, I don't know if Geoff agrees here or not, is to be extremely careful about what you write to an LLM. Even, you know, even if there is the little lock icon that you mentioned. It doesn't, like, it means a lot, but it's not bulletproof.

Sherrod DeGrippo: I think it is warranted to change how you think about safe enough. If you're using AI in a high-risk country with sensitive topics, it stops being about, like, Can they decrypt this? And it becomes more like, Can my traffic be classified if someone is talking about something that particular surveillance mechanism, whether it's the country's administration or the ISP or whatever, can that be classified as something off limits or prohibited?

Geoff McDonald: Yep. Yes, exactly.

Sherrod DeGrippo: So let's talk about the pipeline that you built. I want to understand how you went from, We have encrypted packets, to, We have a classifier that can say, This looks potentially like talking about money laundering. What does that flow look like? Collecting traffic, data sets, like, are you using TCP dump for this? What does this look like?

Jonathan Bar Or: Yeah, I will say that first of all, we started looking just by opening Wireshark, seeing how the packets look like and seeing with our eyes first that there is something to research. Because the following parts are very expensive in terms of time and effort, right? So once we understood that there might be a correlation between the contents and the data and the packet sizes, which we did. One thing that I do have to mention is that we looked and when basically someone uses TLS, which is everyone, there are two modes for the symmetric cipher after that. One of them is called block cipher and the other one is stream cipher. And we basically noticed that virtually all of those LLMs use stream ciphers, which makes a lot of sense. And that actually makes the contents more susceptible for that kind of inference, because in stream ciphers, the amount of data that is being, the plaintext data size and the ciphertext data size is equal or almost equal. So if I see five characters that are encrypted, I can tell that, in layman's terms, of course, I can tell that it's probably coming from a word like "hello" or "quack" rather than encyclopedia or a car, right? Because it's five letters. So that kind of helped. Once we did that, we understood that there is grounds for, like, for research here. We started collecting data, and a lot of it. I think Geoff would be the right person to talk about it, being an amazing data scientist.

Geoff McDonald: Yeah, I think the data collection at scale is one of the more challenging aspects. So we have to spin up, like, you know, quite a large number, well, a moderate number of virtual machines in the cloud in order to do the data collection. So, I think getting the volume of data was fairly expensive needed to carry out these studies across the large variety of models involved. And also very time consuming, like, collecting a single training data set for a large language model provider would take approximately about a week of execution on one of these virtual machines. So, data collection at scale is one of the biggest challenges I'd say we faced in terms of carrying out the research.

Jonathan Bar Or: Yeah, the other thing was coming up with the right, not just, like, collecting the data, but also coming up with the right kind of prompts. So for example when we asked about money laundering, we actually asked, like, 10,000 different variations or ways to ask about money laundering, because when we train a machine learning model, you want, like, a variety of data. You don't just want to ask, Hey, is this money laundering okay? Like, 10,000 times. That doesn't make sense. So we did that. And for benign questions, we also came up with 10,000 different types of benign questions as well.

Geoff McDonald: No, JBO, it was 100 varieties for the fraud.

Jonathan Bar Or: Oh, 100. That's right. That's right, yeah.

Geoff McDonald: For the financial laundering case.

Sherrod DeGrippo: So one, I want to understand the methodology of how you did this. But before you tell me that, I'm dying. Who came up with this idea?

Jonathan Bar Or: That was me.

Sherrod DeGrippo: JBO. So JBO thought, I'm going to look at the encrypted traffic. I want to see if I can figure out what the topics are that someone is talking about. And Geoff, how did you design this experiment to figure this out? What do you do?

Geoff McDonald: Yeah, exactly. So JBO brought me in as a machine learning expert to say, Hey, you know, I've got this really cool problem and it looks promising based on my initial analysis, and kind of brought me in as a machine learning expert and sort of worked on the methodology together. And then worked on the machine learning side of things. So, key to the methodology is that we wanted to test, we wanted to not test our ability to overfit to a question that the user is asking. So, the way that we designed the methodology was we generated 100 ways of asking the same question. So in this case, it was the financial money laundering. So we generated 100 different ways to ask that question on the same topic. And then we take 80 of those as training data, and we hold out 20 of the ways to ask the questions as our test data to validate that, Hey, the model we trained can work against ways to ask the question that have not been seen before. So basically we ask all of the large language model providers out there many different questions. We ask many times those 80 questions and interspersed with about 10,000 random questions from the Cora dataset, which are very similar questions about different topics, which is the noise. And then we basically train a machine learning model using those, training a machine learning model to identify if based on the network traffic we've captured, the data sizes and the interpacket timings, if we think that this conversation is about that targeted topic. Now, we had to be careful in a number of ways because we wanted to mitigate the risk of, for example, the large language model providers caching the results. So we have a lot of special mitigations that we had to design where we add random noise to the question each time we ask it to the provider to mitigate the risk of the provider's caching the result, which would completely invalidate our results where a model would learn. Instead, it'll learn to answer the question of, Did it use the cached result or not, as the solution instead. So we had to take some careful mitigations in order to mitigate some of the risks in those ways as well.

Sherrod DeGrippo: Was there any time where you had, like, two people sitting next to each other and one person asking a question and then you're watching the traffic and they're like, Does it look like money laundering? Does it look like money laundering yet? I imagine you had to do something in real time like that.

Jonathan Bar Or: Yes, pretty much.

Sherrod DeGrippo: What was that like? I mean, I imagine that scene in "Ghostbusters" where Bill Murray's character is, like, showing them the cards and is like, Is that it? Is that it? And, you know, the guy, he's always shocking him and telling them he's wrong. And then the pretty woman, he's like, Yeah, that's always correct. So like, you sit next to each other and you're like, I'm asking you about money laundering. What was that experience like?

Jonathan Bar Or: I mean, we didn't really, we didn't physically sit next to each other, but we did work remotely.

Sherrod DeGrippo: Okay.

Jonathan Bar Or: Geoff is in Canada, I'm in the US. But I think we did have, like, if I remember correctly, at least one of those, like, similar scenarios where I basically, I remember vaguely at least it was really late at night where I was running, Geoff was training the models. And I remember I was running one of the models and was testing to see how things, like, basically will give you a probability between 0 and 1 eventually of whether something is money laundering or not. And when it started answering, like, really correctly, I was trying to, like, test it, test and see, like, how far does it go? Like, I asked about money and I asked about, like, laundering machines, for instance. And it was really interesting to see that it didn't pick it up, but it did pick up, like, the money laundering per se. And then I knew that Geoff, like, really, like, trained the model really well and that we have something good on our hands. It was really interesting because it's LLMs, it's extremely fuzzy. Like, there are certain ways that you can ask something and maybe you'd get the probability that is not very close to one, but still you would be asking about money laundering. So I think I ran several of those and they had something like that, with Geoff being somewhere in Canada while I was doing it.

Geoff McDonald: One other thing I was fascinated of when I first saw those real-time execution of attacks from JBO, and we didn't include this in the paper or blog, but, like, when it would identify, like, the money laundering attacks, JBO would run it real-time against each token as it comes back. And you can see early in the conversation, early in the response, it was actually able to identify that the topic was money laundering. So, it didn't actually have to wait for classification for the end of the response to be streamed. It was actually classifying it successfully as money laundering when it was partially, when the response was partially streamed to the user. Which I thought was, you know, fascinating because it's not, it means that the machine learning model is not just learning the, how long the response is as the information which is providing the side channel attack. It is actually learning that individual responses are providing the information even when it's a partial reply as well.

Jonathan Bar Or: Yeah, I think this is something that we didn't mention in our paper, but in the blog post there is a video. And if some of the viewers want to later check out the video, they can actually see the demo. And in the demo, you can see that it jumps to, like, almost a probability of one or very close to 1, even though it didn't finish the streaming, as Geoff was mentioning. So that's the behavior that Geoff was mentioning. And we do see that in the video that we recorded.

Sherrod DeGrippo: I also want to talk about you tested 28 models. That is a lot. Oh, my gosh. You've got OpenAI, Microsoft, Mistral, xAI, which is our friend Grok, Alibaba, DeepSeek. Any patterns that you saw, anything stick out about any of the models in particular?

Geoff McDonald: Yeah, I think, like, maybe one of the most interesting observations, at least for myself, was for the Alibaba models for example. We saw that it was batching the tokens, around four tokens per response sent to the device. So you can think of it, instead of sending part of each word at a time, it was sending maybe around half a sentence to the user at a time. And the attack was extremely effective against the Alibaba models, despite the fact that they were sending large chunks of the response to the user at a time. And that was something that I thought might not actually work. But clearly for some models, despite that batching, the attack is extremely effective against them. That was kind of one of the interesting, most interesting observations I found.

Jonathan Bar Or: I was pretty shocked that the combined information from packet sizes and the timing between them is extremely effective. Whereas if you take only the data or only the time, sometimes it wasn't as effective. So I think that was kind of like a little surprise for me, because I was thinking that the time might be extremely difficult because there is a lot of noise when it comes to measuring time, especially if you look at, you know, things like routing or how much stress there is on a certain server. I thought that it might not be so great, but apparently we collected enough data and there is, like, there is something in it in the timing differences. So that was a big surprise to me. And that's true for all models, not just, like, for one of them. One thing that I did want to clarify for the listeners is that an adversary might have to train their own, like, machine learning models for each target model, for each target LLM basically, because those behave very differently.

Sherrod DeGrippo: So everyone's making AI products now. And I guess what that leads me to think of is, you know, the three of us have worked for many years with a lot of product managers in our time. And if somebody is a product manager or is an engineer building one of these products, what would you tell them they need to know after you've learned this and done this research? What does an engineer or a product manager need to know if they're building these kinds of products? You know, maybe they work at one of those 28 providers that you tested.

Jonathan Bar Or: Yeah. First of all, like side-channel attacks are tricky because there is no, like, single point where there was, you know, it's unlike memory corruption where someone definitely made a programming error. You don't have that case with those kind of attacks. And, like, the first thing is to be mindful of those ideas. A lot of programmers out there are just, like, they can't conceive the idea even. So I think that's, like, obviously the first one. Second one is that now the OpenAI API, as well as Mistral, I think has a response. Well, I don't think they actually said that. As a response to our research have included new parameters that add more obfuscation that basically mitigates, you know, Whisper Leak to an extent because it's probabilistic, of course. But there are accepted, like, probabilities of attack that anyone can work with. And so to use those, like, obfuscation ideas is something that I would encourage. It does, you know, it does cost more money, for instance, because you have to create, like, dummy tokens, for instance. But I think it's worth the costs because they do protect the customer data, of course. Obviously, like, the number one thing that I would recommend is that if you don't have to stream responses, then don't do it. Like, there is no, if it's okay in terms of user experience for someone to wait and then get the entire thing in one go, then that would be the preferred way, in my opinion.

Sherrod DeGrippo: That's a hard sell. I feel like if you tell a product manager, Take this very humanizing feature away from the AI, they're going to say, Oh no, I have to have it.

Jonathan Bar Or: Well, it depends. So if you think about a chatbot, then I agree with you. But if you think about something that indexes files and, I don't know, tags different types of objects based on pictures like Apple Intelligence does, for instance. You don't have to do it in real time, and then you definitely do not have to stream the data, in my opinion.

Sherrod DeGrippo: Okay. Geoff, what about you? What would you tell a product manager or an engineer making one of these 28 platforms, or coming up with the 29th one?

Geoff McDonald: Yeah, definitely apply the obfuscation to the response so that the exact size of the token can't be reconstructed by the attacker on the network. So I think that's the most basic thing that they should do, and it's extremely easy to add. It's extremely cheap. It should have no impact on the user experience. It just makes things significantly more secure. As our research shows, you know, that isn't a perfect solution still. Other things that are good at mitigating it, while not impacting the user experience, is batching the tokens. So don't send it one token at a time, send it 10 tokens at a time. Nowadays, these large language models, they generate really quickly. So even if you're sending a response to the customer, to the user, every 10 tokens, from the user's perspective it'll look just the same. And this is already what a lot of the large language model providers do. And it also helps reduce the risk. So I would recommend to any provider to combine both those techniques together, batch the tokens, as well as add obfuscation on top of that as well, with adding a random length string to each reply as well.

Sherrod DeGrippo: It sounds like one of the things that comes out of this is that engineers and designers, you know, product designers and architects need to understand that the way it streams the response, the way it caches things, and the way that tokens are used and released, that is now part of an attack surface, which I don't think most developers, even really deeply security-minded developers, would have considered part of the attack surface previously. But this paper that you released and the blog posts and everything really proves that those things are absolutely risks. I also think about LLMs being put into sensitive potential environments like healthcare or legal stuff or government stuff, and making sure that there are people that understand what a side channel is, how traffic analysis works, and include them in some of this architecture. Because those are the people that are going to be able to say to you, Hey, you know, I know it feels very real and conversational if you do it this way, but we need to put in some mitigations to make sure that it's not causing unnecessary indications of what those topics are.

Jonathan Bar Or: That's exactly right. And that's also why it was extremely important for us to release our paper. I will mention that we started this research in January. So, it took a very long time to actually not just collect data, but also do the actual analysis and more importantly, make sure that we do disclosure the proper way. And then we did want to bring that idea to everyone's attention, just like today when someone, hopefully people don't do that, but if people roll out their own cryptography, they don't check for passwords, for instance, by doing a memory compare. That's, like, a very important, I think you mentioned caching, for instance. Caching and all those things are the breeding grounds for side-channel attacks. And when I mentioned now when checking for passwords, you do not want to stop immediately once the password is wrong. You want to actually check for a password in a constant time. That's why we don't do memory compare when checking for passwords. Actually, it's really interesting. Like, from a developer's perspective, again, there is nothing wrong with doing memory compare. You actually, functionally it works well. But you have to be mindful of those ideas to actually come to the conclusion, Hey, I can't just, like, check a password character by character and stop when there is a mismatch because the timing, for instance, would infer the password. It's the same idea with LLMs. And I think bringing that to everyone's attention in this day and age where everyone uses an LLM and everyone spins, just like people don't spin their own cryptography libraries today. You know, maybe in the future we won't see cases where people roll out their own LLMs, but today we do. So that mindfulness is quite important, in my opinion.

Geoff McDonald: Yeah. One other sort of interesting risk that all of these side-channel attacks against large language models highlights as well, which JBO just brought to mind again, is the risk of passwords. So a lot of users may, in their large language model prompts, give it a bunch of code with a hard coded password in it, for example. Now, this is a bit of a risk. I believe the original paper, which can do output reconstruction, is going to have a really hard time reconstructing the password. So, you'll have a password in there. It's a high enthalpy, or it'll be a small number of tokens per character, because it's usually not, like, an English word. And it'll have a very big difficulty in reconstructing the password accurately, which is good news for everyone. But I think there's some bad news here too, in that I still think the information being leaked of that password field in the token links is likely to reduce the complexity of the password enough that an attacker might be able to brute force it. Because instead of being one of one billion combinations, they think it's probably likely one of a hundred combinations as what the password could be. So I think people should be extremely mindful when putting passwords into chatbots or API keys, because it will reduce the complexity of the password where an attacker on your same local network might have a pretty good guess, or at least a good guess of a bunch of ideas to try, of what your password might be. So that's a very high-risk aspect as well. And adding to that, for example, crypto wallets, you know, it's quite famous to have a sequence of words as your password. You might have a, I forget the name of the type of password, but you have five words in a row, which is your crypto wallet password, for example, and it's highly secure. When you get, like, three words in a row, you have an extremely secure password because the amount of types of words that there are. But when you go for a word-based password like that, it makes it highly susceptible to an attack reconstruction attack like this, for example. So it is possible that when you go to that more secure password type of a sequence of words, your password would actually be vulnerable to a side-channel attack like this as well. So, I think there's a lot of reasons to be really careful about passing passwords for example, in content, two large language models. And a lot of people might be doing this inadvertently. They might not even know that their passwords are in the content. They'll be vibe coding. They just hard-coded their password in their Python application because they're not going to commit it to the repo yet. They're just testing. And now, but what they don't realize is technically all of that code is being sent up to the cloud automatically behind the scenes. An attacker might be able to steal your password or API key. And so it presents significant risk to especially the password field. And yeah, not only in just reducing the complexity, but also if it's, like, a word sequence, it might actually be able to be fully reconstructed.

Sherrod DeGrippo: I worry about the vibe coding aspect a lot. I will make an example now. JBO, one word answer, should you roll your own crypto?

Jonathan Bar Or: No.

Sherrod DeGrippo: Geoff, one word answer, should you roll your own crypto?

Geoff McDonald: No.

Sherrod DeGrippo: However, I have seen in a lot of studies that I've read where people are vibe coding, the LLM that they're coding with suggests not at any point using established, well-known libraries and established, well-known cryptographic schemes. The LLM has the capability to not need all of those libraries that have been established and maintained and checked. And it says, Hey, just go for it. The vibes are good. It's a good vibe. Let's just go. So, I think that's something that developers really need to think about and hold onto is that it can give you advice. A lot of times it can guide you in a direction that is doing things that are directly contra to a secure application.

Jonathan Bar Or: Yeah, I would say, I don't know if it's just my approach or not, but my approach to vibe coding and all those things is that it's great, but it's a tool. And you use it when it's appropriate. By the way, I've seen, like, I've vibe coded once or twice in my life, and I've seen really weird suggestions. For instance, the LLM put out a code that in order to get the current username, it actually shells out and runs OMI. I mean, that's not something that a normal developer, in my opinion, should do, or a good developer. I'll change it slightly. But those things are there. So in this case, it's not a security risk per se, not necessarily, but it could be a risk of, let's say, even performance, right? So that's one point to take into consideration. And the other one is that the LLM that you're using might not have been trained on data that is moving so fast. So, for instance, if now there is a huge vulnerability in a cryptography library and the LLM is not aware of that, then you might inadvertently expose yourself. So that's another thing to, like, take into consideration. Don't just trust whatever output the LLM gives you, but actually use that as boilerplate code, as whatever you want to use that as. But I would, at least when I code, I always don't just rely on the vibes.

Geoff McDonald: Yeah. And I think you're touching on, like, a really good point. Like, a really commonly talked about risk area for large language models as well. Like, I think it's one thing for the large language models to accidentally author vulnerable code, but I think there's also a growing concern over the large language models deliberately authoring vulnerable code. So it could be that the nation-state or, like, the government requires the large language model provider to put in bad training data, which basically results in the ChatGPT model purposely authoring insecure code. So it could be that they wanted to use a specific encryption algorithm, which is insecure because that government knows how to break that encryption algorithm. And that is a really big, challenging risk as well. One other really interesting example of the non-deliberate vulnerable code that I've seen is, like, Supabase. Supabase is a provider where you can get, like, a Postgres database and it's commonly used. It's really growing in popularity to create websites where you need to have user accounts or track data over time. You can build it for free with Supabase and scale it for free to quite a large scale. So it's becoming very popular to use, and a lot of vibe coding applications, including web apps, are using Supabase. But I think if you look at when you create web apps using Supabase, most of the time, I don't know if most of the time, but tons and tons of the time, it is actually vulnerable and it's hard coding the key to the Supabase into the web app itself exposed to the user. So any user of the website can dump the whole database, get access to all the customer information, and it's having tons. There was a Reddit post on this of a person talking about this issue. And ironically, one of the replies was saying, was a person advertising a website they'd created to scan your creator website for security issues. And this person goes and looks at that website that person had created, and it had the Supabase vulnerability, ironically.

Sherrod DeGrippo: So, it shows that the tools that you use and the libraries that you use are bringing the vulns with them, as always.

Jonathan Bar Or: I will even add one more thing for the paranoids of you, that you never know how the model was trained, and you don't know if it's backdoored or not. So you always want to check the output of the LLM anyway, because you don't know when you're going to ask it to create whatever and it's going to add a backdoor somewhere.

Sherrod DeGrippo: Well, that's a scary note to end on, JBO. I appreciate that. Well, that was Geoff McDonald and Jonathan Bar Or telling us all about Whisper Leak and how side-channel attacks can take a look at potentially your AI LLM conversations. Geoff, JBO, thank you so much for joining me. It was great to have you on the "Microsoft Threat Intelligence Podcast".

Jonathan Bar Or: Thank you so much.

Geoff McDonald: Thank you so much for having us both.

Sherrod DeGrippo: Thanks for listening to the "Microsoft Threat Intelligence Podcast". We'd love to hear from you. Email us with your ideas at tipodcast@microsoft.com. Every episode will decode the threat landscape and arm you with the intelligence you need to take on threat actors. Check us out, msthreatintelpodcast.com for more, and subscribe on your favorite podcast app. [ Music ]

HOST(S):

Sherrod DeGrippo, Director of Threat Intelligence Strategy at Microsoft, is a frequently cited threat intelligence expert with a 19-year career leading global threat research and analyst teams. She was named Cybersecurity Woman of the Year in 2022 and Cybersecurity PR Spokesperson of the Year for 2021. Sherrod has provided expert commentary for BBC News, Wall Street Journal, CNN, and New York Times and has presented extensively at conferences including Black Hat, RSA Conference, RMISC, SleuthCon, and others.

Schedule: Bi-Weekly

Credits: Producer is Rob Petrillo, Production Manager is Max Solomon, Scheduling and Administrative Support is Elliot Volkman, and our Audio Engineer (and magician) is none other than The Great Rich Cerbini.

Creator: Microsoft

Social Media: