Document analysis and Malware Sandboxes? – Part 2
Post originally published on my old blog. It was automatically translated and as such may be poorly translated.
Continuing our analysis of malicious documents, we will now address the step that is known as static analysis . However, before entering this phase, we need to understand which format is used by the documents.
In our case, we are referring to documents used by the Microsoft Office suite, which follow the Compound Document Format (mentioned in the previous post ), in addition to having support for the document architecture known as Object Linking and Embedding (OLE).
The main point of this alphabet soup is: Office documents have the ability, through OLE, to link other documents or objects within themselves. A spreadsheet within a Word document is an example of this. And it is in this structure that the malicious code is inserted, which can be simple embedded executables or even shellcodes.
These codes abuse known flaws in old versions of Microsoft Office (which only adopted a new standard in 2007) to be executed. Once running, they use several other techniques to be able to create or read files, capture user entries, modify other documents or even, in rarer cases, enter the kernel space (from where they can run with more privileges).
Now the question that remains is: how do I preemptively discover that a document is malicious? Unfortunately the answer is not simple and depends on factors that are not necessarily under our control, as we saw in one the first part of this post. Discussing them here would be quite time consuming (and make reading tedious for some). But there is something we can do once we have a suspicious file: a static analysis of it.
The analysis of this type of document has received less attention than it deserves, perhaps because it is not so trivial and does not fit the simple analysis based on hash. In addition, PDF documents are distributed more frequently and easily. In this analysis we will need to perform the following steps:
- Check the file structure and look for malicious snippets.
- Identify what type of threat is inserted in the file.
- Extract the malicious instructions (or part of them).
- Look for suspicious symbols.
- Check if they use exploration techniques.
- Rebuild the executable and analyze it.
To accomplish step 1 we have a few options:
a) Use a sandbox (method used in the first part and which we will ignore); b) Analyze the file structure and search its fields for unexpected data; c) Automatically analyze the file and score it according to the data entered.
Options b) and c) are actually complementary, however, for learning purposes they will be put separately.
I will start presenting OfficeVis, a tool developed by Microsoft for its developers and that will help us in activity “b”. It is able, when loading the files, to separate its fields according to the CDF standard, to show the associated data and to check for inconsistencies.
In addition, there is the ability to search for known exploits, based on CVEs. Once analyzed, it can identify the malicious bytes inserted into the document. As an example, I will use the file that was sent to the clouds in the first part of the post . Below is the image of the result found by the tool:
We are immediately introduced to this set of bytes which, according to the tool, make up a strong candidate for an exploit reported by CVE-2006-6456 .
Once again we have results that give us strong indications but do not give us certainty. The tool reported a few more sets of malicious bytes but was also unsure.
For this reason, I will now introduce another tool: MalOfficeScanner. Developed and presented by Frank Boldewin at Hack.Lu 2009, it presents some very interesting features: scanning for malicious codes (scan); decoding codes through brute force (brute); debugging of codes found with disassembly of them and detection of strings and embedded PE files (debugging); obtaining OLEs, offsets, and VB macros (info); decompression of Office 2007 files for identification and categorization of threats (inflate).
We will focus here on two options: scan and debug. First the scan:
Right from the start we have several function prologue signatures found on certain offset:
What is this “function prologue”? Roughly speaking, the prologue is a programming convention that determines that some instructions must be called in certain order to prepare the stack for use by the function that is starting. This preparation can (and should) keep the “context” of the past function, in case it is necessary to return to it. Later on I will show some examples.
The key point here is: should we have functions in this document? Analyzing the next evidence we found something very interesting:
The tool was able to detect several signatures of a technique that we simply call JMP/CALL/POP. This is one of the best known and most used shellcoding techniques to obtain the effective address from where that piece of code is executing. I encourage you to seek a more “extensive” definition of this technique; worth it.
This time, we have confirmation that a technique that allows us to discover a memory address (and later manipulate it) is being used extensively.
Continuing, we are informed that section signatures exist within the document that identify the existence of possible MZ / PE executables; formats for MS-DOS and Windows executables, respectively:
It is interesting to note that at this very moment the tool is extracting these “executables” from a relative address and creating a separate binary. At the end it performs a summation of all the evidence found, giving different weight to each one of them, and shows us an index:
This rate is very high! Frank himself showed files with a score of 36 points in his presentation. This file is, according to the tool, “very malicious”.
Now let's go to the debug, which will analyze the binary, present it in hexadecimal format and try to extract some valuable information from there:
We are immediately presented with a character string that identifies an “old” function, which according to Microsoft exists only for the purpose of backwards compatibility with old versions of Windows, whose purpose is to run binaries: Winexec . Very convenient, isn't it?
Next is a call to the CreateFile function , which creates files or accesses input/output devices (I/O, for example: hard drives):
The following function (CloseHandle) is related to object handling:
Another interesting function: WriteFile, which similarly to CreateFile, writes to files or I/O devices.
ReadFile: reads files or I/O devices.
SetFilePointer: adjust the pointer to files.
VirtualAlloc: reserves pages at the virtual address as a resource for the running process.
And now, as promised, the binary after being reversed by the tool and with its proper identified mnemonics and indicating the presence of prologues! Following is the assembly code that characterizes the prologue of a function:
Without going too far into the code above, I would like to draw attention to the first 6 lines, which represent the instant that the base addresses are saved and a new stack is created (stack frame). “When” the prologue takes place is irrelevant to this moment in the analysis and varies widely with what we know as calling convention. For the sake of curiosity, I recommend checking the difference of this call between Windows, Linux and OSX.
And finally the raw data that, after “mapped” by MalOfficeScan, presents the string “this program must be run under Win32”, indicating that an executable must be run on the Windows operating system that supports programmed binaries for the 32-bit architecture (the encompassing a wide spectrum of machines):
I think we got enough evidence that the document is not to be trusted! In this post, we walk through steps 1 to 5. In a way, we also performed step 6, but in an automated way. For the next post we will go deeper into the binary in search of more clues and other techniques used by those who wrote this artifact.
Until then, Unfortunately, the original document was lost before I complete the complete analysis of this malicious doc
Bibliographic references: Check the original post linked before