Here's an idea/question/postulation to consider...
Is it possible to reasonably classify malware based on the opcodes a binary utilizes as the input attributes to a bayesian classifier? Put another way for the less technical readership, is it possible to determine if a given program (binary) is good or bad based on the internal mechanics of that program when we compare it to the internal mechanics of other programs that are good and bad?
The idea is simple enough; take a given binary and assess the distribution of opcodes it contains then classify it using just the distribution of those opcodes.
The basis for this kind of thing has successfully been used for email spam/ham classification for quite sometime now. Malware detection technologies and bayesian classifiers happen to be things I know a thing or two about though technology companies I have founded with friends over the years.
A quick bit of Google research shows that there are researchers out there examining statistical approaches to malware detection but I'm surprised to find there is (currently) little to discover on the topic of bayesian techniques for malware detection using opcodes as input attributes. This is especially surprising to me given that the individuals that circulate in the anti-spam world are generally very close to those in the anti-malware world.
Either way, if such research/program/technique is to exist (because maybe I've just not looked hard enough) then I'd have the following questions and/or reckon the following pitfalls:-
- Explain the robustness of the unpackers to be used - malware scanners these days tend to be more about their ability to unpack binaries in order to get down to the content hidden inside - this task is far more formidable than it sounds because malware writers will try every possible trick in the book to slip by - I'd worry that without a great unpacker it might be difficult to get an accurate profile on the opcodes inside - that said - it might turn out that the lack of observable opcodes represents the signal.
- What is the false positive rate? In general binaries are much less tolerant to false positives than email. While a mis-classified spam/ham email can and does cause business grief, a misclassified binary can stop an entire business at the whim of a simple AV database update.
- ... and the real clincher - is the address space of x86 opcodes large enough to create different distributions between goodware / badware for any of this to work? The x86 architecture has an opcode address space of about ~1130 different codes which might sound like a like a big enough number but when you compare it to email spam/ham usage where the address space is typically based on unicode (UTF-8 encoding) characters where the address space is ~2 orders of magnitude larger (~107,000 characters) then a problem seems to emerge.
All that aside, it's still a curious enough area that warrants further research. There are certainly retorts to each of the points above; not much of the UTF-8 address space generally gets used in any particular language; the Bayes filter does not need to be used as the final line defense and there are reasonable unpackers tool-sets about (however they are often encumbered with license restrictions)
N

Post new comment