How Rubrik Zero Labs Uses LLMs to Analyze Malware at Machine Speed

Transcript

[ Music ]

>> Amit Malik: One of the malware that we identified in our report that we have mentioned, this guy is actually running on the machine. It's collecting all the information of the machine and then sending it to the LLM saying that, hey, you know, I'm running in this environment. Now, give me the code to actually bypass all these things or, you know, do that stuff. Now, this thing has dynamically been coming from the LLM, or this code has been dynamically generated; it's coming. It's not part of the actual binary, so the actual footprint the attacker has to use in this case is very, very small. You can just put a very small binary on that. That has plain English inside it, and then, it will connect to the LLM and do the rest of the job. So it will definitely increase some complications, but I do feel that, then, we will have the same technology, so we will be able to operate at the same machine speed and then counter these two things.

[ Music ]

>> Caleb Tolin: Hello, and welcome to another episode of Data Security Decoded. I'm your host, Caleb Tolin, and if this is your first time joining us, welcome to the show. Make sure you hit that subscribe button so you're notified when new episodes go live, and if you're already a subscriber, thanks for coming back. We encourage you to, as Dr. Seuss would say, rate the show and leave a comment below. I love that rhyme, but in all seriousness, this really helps us reach more listeners like you who are eager to learn more about reducing risk across their business. Now, today, I'm joined by Amit Malik, a cyber researcher for Rubrik Zero Labs. His team released a report about chameleon malware that hides in the OS and GhostPenguin stealing your data via protocols you aren't even watching. Rubrik Zero Labs used LLMs to catch these ghosts, but here's the real question. In a world of AI driven attacks, is your data resilient or are you just building a faster getaway car for the bad guys? Let's get you your answers.

[ Music ]

>> Caleb Tolin: Amit, welcome to the podcast. It's so great to have you on Data Security Decoded. I'm really excited for this conversation, but before we dive into the meat of it all, I have to ask you something that I ask every guest, and that is, what is something that is not related to cyber that you're obsessed with lately? For me, it's been Pokemon cards. When I was a kid, I did a little bit of collecting of that, and recently, a friend got me back into the world of trading card games, and specifically, Pokemon. And I've started my own little collection, and it's really cool to see how the art has evolved over the years and see the culture that's built up around this trading card game. And it's been really cool to be a part of, so what is something that is non-cyber related that you're obsessed with?

>> Amit Malik: Thank you, Caleb, for inviting me to the podcast. Right, and so, recently, I have gotten interested into psychology and philosophy. I'm kind of trying to go through the people that are prominent and read their books. So right now, I'm reading Osho and the teachings from the Osho, like, the books that he has written. So there are a bunch of books that I have collected, you know, and I'm going through them one by one. Right now, I am kind of reading the book name -- called Awareness, and it is pretty interesting, like, you know, the way the things are described and then the way human minds work and the ways to go through that.

>> Caleb Tolin: I love it. When I was in school, I studied political science and we did a little bit of philosophy, not quite as much as I would have liked, but it's very interesting to kind of stretch and mold your mind in a different way by reading different philosophies, so very cool, very cool. Well, I'm really excited to talk about this report that you put together for Rubrik Zero Labs. And, you know, to kind of slightly play devil's advocate here, everyone really claims that they use AI powered analysis, but how did the LLM change the workflow in this case? Did it actually find the malware or did it just explain what a human found much faster?

>> Amit Malik: So basically, I mean, we all know that the AI is actually changing the kind of the productivity for people. It is helping a lot in the development world, and it is helping a lot in the customer and other areas, right? So our kind of goal is basically how it can help in the malware analysis, because that's what we do day in day out, right. So we kind of worked on that thought and designed a system. Now, I would not say that we are at a very mature level because there are so much nitty-gritties into the malware analysis that happens. You know, malware sometimes downloads payload from -- you know, once you run them, then they download, right? And then, they have various obfuscation techniques and so on and so forth. But with the current state of affairs, the type of malwares that we are getting and the decomposition that we can do, like, in terms of extracting macros or in terms of doing any unpacking that we see, right, or any dynamic analysis, basic stuff, we execute and try to extract the memory that we are able to, you know, kind of extract the code. And then, we try to, you know, kind of run that through the LLM, because LLMs are really good in terms of understanding the code. And at the end of the day, malware is just a code, right? So we have, you know, kind of developed specialized prompts working on them, and we asked the LLM, the models, what this code looks like. Surprisingly, it is, you know, initially, we had not thought really that this can give us that level of results, I mean, the quality of results that we are getting on a daily basis. So just to give you a perspective, we are roughly getting around 5,000 to 6,000 samples today, based on the hunting and all the stuff that we are doing. Out of that, we do the clustering and everything that is also part of the system design and so on and so forth. Then we get roughly 500 to 600 samples that we really want to look into from the LLM point of view, right? And then, we send to the LLM for this analysis, and LLM is providing us only 10 to 20 samples that are worth looking into that are really new, that are using new techniques and so on and so forth. And every day, we are finding some surprising facts that are not really possible. If, as an analyst, I just work on that, because the amount of sample is too large. I can't possibly analyze, you know, 500, 600 unique samples on a daily basis, that are of our interest. So definitely, I would say that the AI or the analysis based on the LLM is definitely helping. You know, as an analyst, I can say that it has increased my productivity to a great level. I do not really have to go and analyze a malware for, you know, maybe the initial analysis. I don't really have to do that. It's already been done by the LLM, and I would say that if the code is not obfuscated, and it's not that complex in terms of when I'm saying complex, not in terms of functionality, but in terms of the obfuscation and the layer of obfuscation that it is using or, you know, the type of payload that it will download from the internet and so on and so forth, then it is providing really, really good results. That's what I can say from my experience on that.

>> Caleb Tolin: Right, and I'd love to talk a little bit more about that obfuscation that you mentioned there, and also, what a difficult word to say, you know? But anyhoo, how did the LLM help you really understand the intent of the code rather than just the syntax?

>> Amit Malik: So basically, what we really do is that, you know, you can have a code and then code can be -- let's say if we are talking about a Linux binary, right? It is an ELF binary. Most of our focus right now is cloud-based threads that are either Linux cloud workloads or the ransomwares that are there or other document related, like, whether it's a macro or something like that. So let's say that we have a Linux binary and that binary is, actually, you know, maybe a certain size. It is using a certain type of code base inside it. Our focus is to identify the actual business logic of the malware, right, because in reality, that is a lot of code and it will have lots of libraries, code, as well, integrated inside the system stuff that it is using. We don't want to send that to LLM. Otherwise, the LLM will get confused because if you send all of those things to the LLM, the LLM will not really be able to identify what really is the core of the logic is. And the cost will also be very high because now you are looking at lots of tokens that you have to consume because you are sending the entire code. So we follow a kind of step process where we have a code, then we kind of remove the library code that is there. There are technology like identification of library code, and then, removing that. We do the decompilation of the code, and then, we send the code to the LLM. Now the core of LLM is again, you have a business logic extracted before sending it to the LLM and then the prompt as well that the type of prompt that we have to use. So it requires a little bit of effort in terms of different permutations of because prompt is just your English text that you want to talk to the LLM and give that code and do that. So based on our experience and, you know, doing a little bit of iterations, we are able to embed this code into a prompt and ask LLM that, hey, you know, can you do the analysis of this code that we are sharing with you and what type of functionalities or uniqueness that you see that, you know, kind of not seen previously. Now, the existing models that we have, like your ChatGPT or Anthropic and we are talking about, they have certain level of guardrails inside them. They are designed for a defensive purpose because they do not know from that perspective whether it is really an analyst that is asking about that or it's a malware author that's actually asking about that. So that is the challenge that we kind of need to solve it, but we have prompts that try to, you know, kind of say that -- take us as we are the good guys and then try to do the best you can and try to, you know, tell us, in summary, what this, you know, kind of code is doing. And if there is anything significant, then tell us that you have seen something significant, right? So that is the way we are using it, yeah.

>> Caleb Tolin: Right. Absolutely. Now, let's talk a little bit about Chameleon C2, which you mentioned in the report. Why is the Windows subsystem for Linux such a juicy target? Is it because EDR tools are, essentially, just looking the other way when Windows are talking to its Linux sublayer?

>> Amit Malik: Yeah, exactly. I think it all depends on the malware evolution. Malware authors are using all the unique ways because they have the infrastructure. They have the horsepower in terms of they can use the different security products in their environment. They have to -- you know, they kind of test those things, you know, because for a malware author, it is just one single stuff, right? They just have to bypass the security products, and then, they have to just deploy into their target and then compromise those things. Specifically, when we talk about the APDs, we believe that some of these, you know, kind of malwares that we have mentioned here, they are part of some malware, you know, the APT groups that have done. So this is the interesting thing, it is not like it has been talked about that the WSL can be abused by the malware authors, but it was mostly in terms of as a proof of concept in conferences by the researchers and so on and so forth. We have seen, like, in this, they have beautifully used the WSL to actually use the Linux subsystems to compromise and do all this kind of activity that they are doing, right? And we are tracking this thing, and now we see. Initially we saw one or two samples, and now, we see a large quantity of samples, in terms of that are using this type of technique. That is so -- when these type of things happen, it can be that the malware authors are actually testing something initially, and they want to see whether it is detected, not detected, and what really is going on. And then, later on, they kind of, you know, use that in a more organized way and then use that to compromise more users, so this one was interesting. And so, it is not like we have talked about only three cases that we felt is very, very interesting. On a daily basis, we are getting lots of insight. And again, that credit I will give to the LLM because, practically, it was not possible for us to go through that number of samples and then, you know, extract those insights from these malwares. It is because of the systems and the design of the systems that we have done in, you know, accordance with the LLM, that we are able to kind of support these things as they happen or as the malware sample is submitted or as the malware sample is circulated. We have that way to look into it, yeah.

>> Caleb Tolin: Right, right. The productivity gains that you're mentioning are just incredible, so this is really, really exciting, and your report mentions APT41. Is the Linux RAT, or RAT, you found a greatest hits version of their old stuff, or are we seeing a total rebuild for modern cloud and hybrid environments?

>> Amit Malik: Based on the code and variables, it feels like it is associated with APT41, though there is no 100% confidence that this is really associated with APT41. But if you look at the maturity of the code inside that and the type of functionality that is there in the RAT, right? It looks like they are associated with APT, and it most likely might be associated with APT41, right? Because on Linux side, it is very -- because Linux is primarily used by the enterprise, you know, the servers and all these things. It is not something, you know, people want to steal your credit card or, you know, your password on that. Linux is mostly business orientation. If somebody is deploying a RAT off of sophisticated capabilities, right, then it largely means that they have different and bigger motives than the simple commodity malwares are doing. So in its fair ground, we do feel that it is associated with the APT, and the system, basically, identified on that day, we got around three, four notifications that we have talked about in the blog. Because the system is relatively new right now, we are getting very interesting insights right now that, you know, there are a bunch of other things that we haven't really talked about, but we are getting really good insights from the system. And with time, we will basically, you know, share that with the community, our findings. But yeah, I mean, it looks like that it might be associated with an APT, and it was purely the kind of the -- the behavior of the code, and then, identification of that is done by LLM itself. We just looked into that. We verified the technical accuracy that, yes, this is -- so here's the thing, right? So right now, at this state, you cannot trust LLM at all, like 100%. You can't say that whatever is LLM is saying is 100%. So that stuff you have to do like, okay, it has been, you know, referred to as we got the notification. We saw that there is something new coming, but then, it has to be technically validated so that all the functionality we validated, you know, manually to make sure that everything that is there is correct in its sense.

>> Caleb Tolin: Absolutely, yeah, hallucinations are still a thing. And so, it's important to have that human layer of trust there for sure. Now, your report also mentioned that GhostPenguin uses UDP for communication. For our non-networking nerds out there and engineers, why does that make a defender's life really, like, a nightmare?

>> Amit Malik: I mean, the UDP is definitely -- like, TCP is something that you kind of know that there is a handshake and then there is a proper packet rebuilding that you can do in terms of -- you know, because there is a sequence and all these type of things, you can kind of correlate the timelines and then see, hey, you know, what really is being done? But UDP is very, very, I would say, asynchronous in that way, like you have packet delivery right now, you have command and control. And then, analyzing, let's say, APK file will not that be easy in terms of in TCP, you can correlate the sequences and stuff, but in UDP, it's kind of difficult, right, from that point of view. But it was, you know, kind of interesting to see that, you know, the malware authors are actually going in that direction and using UDP as a command and control.

>> Amit Malik: So I want to step back a little bit from the report. We dove into multiple elements of this blog and report that you put together, but if I'm a CISO at some midsize company, I'm probably thinking, I don't have a Rubrik Zero Labs. I don't have this team that's, you know, investigating into these innovative ways to scan and analyze malware. How do these findings change how an average company should look at their posture?

>> Amit Malik: One thing I would say, and I think, pretty much, everybody is doing, if that level of resources or skill set is not there, then definitely, just like we are contributing, it's very, very -- the security community is very strong in terms of the information that is coming from all the companies, whether the other security companies are there. So they do proactively share any important intelligence on their blogs and stuff, so they should keep an eye on that thing to see what really is going on, and then, they should see, like, what really is their environment look like, right? If their environment is having Linux exposure higher or the cloud, which is going to be the case, then they should, you know, kind of ingest that information and then try to see how they can leverage this information coming in their security posture, right, and how they can follow the best practices to mitigate the risk.

>> Caleb Tolin: Right. All right. So my last question for you kind of circles back to the beginning. You kind of started us off with talking about how you're really getting into analyzing different philosophies, and this is kind of a philosophic question for you. So if hackers are starting to use LLMs to write the code as fast as your systems are analyzing it, who wins that race? Is machine speed defense enough when the attack also starts moving at machine speed as well?

>> Amit Malik: Yeah, I mean, definitely. That is very interesting because there is one report that we have published ourselves, as well, where we have kind of shared that there are malware samples that are actually trying out. It is not really in production, but I feel like the malware authors are kind of trying the different ways how they can use the LLM in order to devise different things, and that report is public actually on Rupert Zero Lab's page, where we have shared the case studies from the live malware samples. So one of the things that I feel is basically, since the malware authors are using the LLM and at the defense side, we are also using LLM to analyze. I think the problem right now for the authors or the malware authors is basically to call the LLM, because LLM is not something, at least as of now, you can host on your own, right? It is something that you have to rely on some third party and something, so you have to make a call. That is reasonably a good thing for defenders because then they know that, hey, some sort of call is being made to a public provider and then they can watch for that and then they can see and block those things. And it also kind of restricts the attackers to use it as a large scale because as you use it as a large scale, people will get to know and then they will kind of put those guardrails there. But at the same time, I do feel that at some point in time, as the top technology kind of matures and the computation will be not that much required, and then -- especially the nation state threat actors, where they have the host power to actually kind of host their own LLM models without guardrails and all these things -- I do feel that it will, to some degree, complicate the life of the defenders if they start using the LLMs in the code. The reason, I feel, is because it's plain English. It's not really a code. The guy is just writing a prompt saying that, hey, you know, give me this code. You know, just to give an example, one of the malware that we identified in our report that we mentioned, this guy is, actually, running on the machine. It's collecting all the information of the machine and then sending it to the LLM saying that, hey, you know, I'm running in this environment. Now, give me the code to actually bypass all these things or, you know, do that stuff. Now, this thing has dynamically been coming from the LLM or this code is being dynamically generated. It's coming -- it's not part of the actual binary, so the actual footprint the attacker has to use in this case is very, very small. You just put a very small binary on that. That has a plain English inside it. And then, it will connect to the LLM and do the rest of the job. So it will definitely increase some complications, but I do feel that then we will have the same technology. So we will be able to operate at the same machine speed and then counter these two things.

>> Caleb Tolin: Right. I certainly hope so, because, you know, recently we had a conversation with Hayden Smith, CEO of Hunted Labs, on the podcast, too, where we we're talking about how attackers are even more so now using these open source libraries to insert malware into different technologies and people's environments, something that a lot of enterprises aren't even really aware of necessarily at this point. So it's kind of interesting to kind of build on that a little bit more, but Amit, thank you so much for joining us and telling us about this innovative method that you and the team have uncovered using LLMs to uncover malware within code and kind of analyze it there, too. So for those who are listening, you can learn more about this in the show notes. We'll drop the link to the report in there also on zerolabs.rubric.com. I know they have a bunch of other resources that Amit and the team are pulling together all the time, so they're really doing some cool stuff over there. Amit, is there anything else you want to leave our audience with as we wrap up here?

>> Amit Malik: I would say at Zero Labs, we are doing pretty good stuff, and we are making our effort to share our findings with the security community. So I hope those contributions make life easier somewhere.

>> Caleb Tolin: Wonderful. Thank you so much again and until next time.

>> Amit Malik: Thank you, Caleb.

[ Music ]

>> Caleb Tolin: That's a wrap on today's episode of Data Security Decoded. If you like what you heard today, please subscribe wherever you listen and leave us a review on either Apple Podcasts or Spotify. Your feedback really helps me understand what you want to hear more about, and if you want to email me directly about the show, you can send me an email at data-security-decoded@n2k.com. Thank you to Rubrik for sponsoring this podcast. Thanks to the team at N2K, which includes Senior Producer Alice Carruth and Executive Producer Jennifer Eiben, content strategy by Ma'ayan Plaut, sound design by Elliot Peltzman, audio mixing by Elliot Peltzman and Tré Hester, video production support by Brigitte Criqui Wild and Sarelle Joppy. Until next time, stay resilient.

[ Music ]

HOST(S):

As the host of Data Security Decoded, Caleb Tolin dives deep with cyber experts to deliver actionable, vendor-agnostic insights to reduce data security risks and improve cyber resilience outcomes. Caleb asks the incisive questions that you need answered, extracting actionable guidance for defenders. Come be obsessed with improving your organization's cyber resilience.

Schedule: Two times per month. Every other Tuesday.

Credits: Data Security Decoded is a podcast by Rubrik.

Creator: Rubrik