Robust Signatures for Kernel Data Structures
First, I should warn my non-CS readers: I had originally decided that this paper was too technical to blog about. But then I received a request that I actually discuss this specific paper, and who am I to say 'no' to a beloved reader? So be aware that this one is going to be a bit more esoteric than even usual, but I also will go somewhere with it at the end.
With that out of the way, what is this paper all about? Let me provide a little context. Suppose that some bit of malware has made it on to your system. One of the primary things that malware would like to do is hide itself as best it can. The better it hides itself, the more likely it will escape detection and the longer it can sit there sniffing your passwords and sending out spam for its evil overlords. But what does that mean, for a program to 'hide itself'?
To explain that, I first need to explain that your computer is doing a lot of things at once-- it's running the operating system (which is always running), it's running the web browser or RSS reader you're using to read this right now, it's running the music-player you probably have going, it's running any other programs you happen to have open right now to do your actual job (remember that?), and so on. Roughly speaking, each running program corresponds to a process, which is CS-speak for 'running program.'* Many programs are written to take advantage of multi-processor and multi-core systems, and so a given process can have started multiple threads of computation.** Threads are the things that are actually executed by the chips/cores, but they are really low-level-- much lower than the concept of 'program.' For that, you need to pop back up and talk about 'processes' again.
Okay, so your computer has all of these processes running-- lots and lots, all the time. (Right now, as I type this, I have 145 of them. Most of them are parts of the operating system.) To keep track of them all, the operating system creates an internal data structure for each of them and keeps those structures in memory. (For non-CS people, imagine that by 'data structure' I mean a standard form the operating system fills out for each process: putting in the name, the user who started it, when it was started, etc.) And by 'operating system' here, I really mean a very core piece called the kernel. Thus, these things are 'kernel data structures'.
(Congratulations on making it this far. We're about halfway there.)
To keep track of all of these data structures, the kernel keeps them in a special place in memory. Malware designers know this. So to keep their malware from being detected, they have the malware go into that special location, find the structure that represents the malware, and move it to a random location in memory.*** The malware can't really delete the structure without deleting itself in the process (the structure in question part of the malware's life-support system, in essence) but it *can* make the structure hard to find by throwing it somewhere in a very, very, very large memory-space. (But how does the malware keep running if we hide the structure? Remember that these structures represent processes, and it is threads that actually get executed. The malware won't mess with the threads.)
So what is the end result for us honest users? If we want to check for malware running on our system, we can't rely on any of the standard built-in tools of the operating system. They weren't built for this, and simply look through all the structures in the 'special place.' (That phrase is becoming more and more ridiculous every time I type it.) So to find malware, we need to scan the entire memory space of the computer and look for these kernel data structures.
Can we do this? Sure, but first we need to answer an important question: how will we know it if we find one? Put another way, what could these structures look like? The malware writers know that we will come looking for these structures, and so they take pains to camouflage them best they can. They will change everything about they can change without hurting the ability of the malware to run. So what does that leave? Or put another way, how much can we mess with these structures without crashing either the malware or the whole computer?
At last, we have reached the core question of this paper. The authors devise an experimental apparatus for answering that exact question. Specially, the apparatus runs an instance of a virtual machine (sort of a computer within a computer). It starts a process on the virtual machine, freezes the virtual machine, randomly messes with the relevant data structure, lets the virtual machine resume, and watches to see if either the process or the virtual machine dies. By doing this repeatedly, they can begin to build up experimentally-derived rules (called 'signatures') for what about these structures cannot be changed. These signatures can then be built into memory scanners, which use them to scan the memory-space for such hidden structures and find the malware. And these signatures will be robust in the sense that the malware writers can't evade them without hurting their own malware. Pretty slick.
But also completely ridiculous. Why on Earth is it necessary to do this? Computers are human-crafted machines, and their behavior is completely determined by the code they are running. Can't we just look at the code and see what will and will not crash the system?
Given that this was a highly-acclaimed paper at CCS 2009, the answer is obviously 'no.' Modern computers are just too complex for any one human to really understand, and so no, we can't just read the code. We're not smart enough to read and understand that much code well enough to predict what will and will not make it crash. We need to determine that experimentally.
...Which is the first real evidence I've seen that Computer Science deserves that second word. That is, this is the first paper I've read that made me consider the possibility that Computer Science really is a science. Up until now, I had been conflating two seperate definitions of 'science':
- An effort to measure some 'natural' phenomenon, and
- Determining things experimentally.
In my mind, these two things implied each other. The only way to measure a 'natural' phenomenon is experimentally, right? And if it's not a 'natural' phenomenon, then it's human-made. And if it's human made, wouldn't it be easier to understand it through non-experimental methods?
Obviously not. My mistake was thinking that any human-generated system would be comprehensible to humans. And when I put it that way, I feel stupid for having believed it (even subconsciously). We humans make all kinds of things with simple descriptions but complex and unpredictable behaviors. Societies. Pinball games. And now, operating systems. Given how complex computers really are, the only way to characterize their behavior seems to be through experiment. I guess we are a science.
And if that's the case, we might want to collectively think about taking a class in experimental methods. Right now, unfortunately, we kind of suck. (But that's another post entirely.)
----
* Technically, the word 'process' has a specific technical meaning (which is not worth giving here) and a given program can correspond to more than one process. But every running program must have spawned at least one process.
** As in the previous footnote, a process must have started at least one thread.
*** These are lies, but reasonable ones. For CS people who know the lingo, the kernel does not really keep these structures in a 'special place' but holds them in a dynamic doubly-linked list. So they actually start out scattered all over in memory. The malware simply splices the relevant list-element out of the list, leaving it at its original location but making the kernel 'forget' where it is.

As far as I'm concerned half
As far as I'm concerned half the point of reading this blog is finding out about stuff I don't know, and I found this perfectly easy to follow.
Let me clarify
When I said that this paper was 'too technical,' I did not mean to imply that it would be beyond my audience. What I meant, instead, was that it required more explanation than I usually like to include. Roughly speaking, I divide these posts into three parts*:
Ideally, I'd like my posts to roughly balance these three things. This paper, on the other hand, heavily shifted the 'weight' of the post to the 'background' part. Not a show-stopper, but probably enough for me to choose to blog a different paper instead.
I do take requests, however, and so you got to learn about playing hide-and-seek with malware writers. (And thanks for letting me know this post was clear. Maybe I'll be a little less scared of background-heavy papers in the future.)
* Postae sunt omnis divisa in partes tres?
This is neat stuff, and you
This is neat stuff, and you write it very clearly.
I'll trust you if you say it's too technical, but how does the thread manager work without any kind of master list of all processes?
A totally separate list of threads
Let me prefix my answer with the disclaimer that this is waaaay out of my specialty, and so I am likely to be wrong. But my understanding is that the thread manager (also called the scheduler) maintains its own list of threads. Furthermore, the scheduler's thread-list and the kernel's process-list are totally separate (I'm told). This raises the question: can you find malware by inspecting the scheduler's list of threads? I asked this myself, and the answer seems to be 'no'. The threads are apparently too low-level to reconstruct the process that spawned them. This seems a little strange to me, frankly, but it's the answer I got.
Thanks for your excellent
Thanks for your excellent review of the paper! I'm going to point people here when they ask for a concise (and not overly technical) description of the paper, as it's much more cogent than I can manage :)
To directly answer your your question about the relationship between process and threads (bearing in mind that what I'm saying is specific to Windows, as I know the internals better): it is possible to get back to a process, given one of its threads -- there's a pointer to the thread's process (called, creatively enough, ThreadsProcess) in the thread structure itself. I do not know if this field is essential, however, as I haven't done the same testing on threads as I have with processes.
However, the approach of just traversing the scheduler's list of threads has some pitfalls of its own, even if it is possible to find processes from their threads. In particular, it's been demonstrated that one can, without too much work, change the location of the scheduler's thread list (or specifically, its list head), and update the kernel code that refers to that variable to the new location. Security tools which look in the old location, then, will get an inaccurate view of what's going on (and if you try not to use a hardcoded value for the list head in your security software, you end up playing cat and mouse with an attacker, as you try to analyze the modified kernel code to discover the new location, and the attacker tries to make such analysis difficult--not fun!). This technique is described in a non-academic article, "Bypassing Klister 0.4 With No Hooks, or, Running a Controlled Thread Scheduler" http://ht-group.net/33/
Because of these complications, scanning for data structures is still a fruitful approach. Note, though, that it is possible in theory that someone could attempt to change how a given data structure is handled by the kernel, by updating its code on the fly. This seems intuitively quite difficult to me, though, and goes well beyond changing a pointer to a list head (and I have yet to see someone do it in a real-world setting, though this paper manages to pull off something similar after some fairly heavyweight analysis, in a controlled environment).
Cheers,
Brendan