Sei sulla pagina 1di 32

Ecient

Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Ecient Malware Classication Techniques
Arno Pol
September 18, 2014
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
About malware
Any software used to disrupt computer operation, gather
sensitive information, or gain access to private computer
systems.
Persistence common
Malware can exist in multiple locations, for instance RAM,
on disk, and even in CMOS
Most malware is undetected or easily hidden
Encryption and packers are used to transform malware
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
About malware detection
Millions of malicious binaries
Thousands of new binaries each year
Massive cost to operations
Corporate espionage, online theft, spam
Contrary to popular belief bugs arent rare
Have to beat packers and Encryption
The usual approach of hashing is not sucient
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Malware detection techniques
Signature-based techniques
Behavoir based techniques
Dynamic analysis techniques
Data-mining methods can apply to all of the above
Strong selection for low false-positive rate neccesary
Deletion of important data and system les can be
devastating
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Signature-based techniques
Binary analysis
Static (source code) analysis
Entry point, Strings, IAT, section table
I/O Analysis (Includes network API)
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Behavoir-based techniques
System log analysis
Network trac analysis
Power consumption analysis
System call monitoring
Filesystem IO monotiring
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Dynamic analysis techniques
Taint analysis
Memory analysis
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Guilt by association
Minig le-relation graphs
A man is known by the company he keeps
A le is known by the les that appear with it on the
machine
Symantecs polonium
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Polonium
Problem In Chair Not In Computer
Users that do not follow good security practice often
download lots of malware
Create a bipartite graph between machines and les, each
edge a le existing on a machine
This gives information about a les badness
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
AESOP
Unlike polonium captures File to le relations
File identiers are SHA-256 hashes
Poloniums machine identication is imperfect, serial
numbers
AESOP uses norton community watch data
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
AESOP
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Co-occurence Strength
Jacccard similarity
Co-occurence strength between sets M
f i
and M
f j
Filter out any les with a tiny presence, below treshold
Expensive to calculate
Locally sensitive hashing (For instance hamming distance)
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Locally Sensitive Hashing (LSH)
Randomly reorder machines into set M
Generate a set of minhash values, and separate into bands
Apply random permutation function during hashing
Order into buckets depending on similarity
If le appears in a bucket, high chance it co-occurs with at
least one le in that bucket
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Locally Sensitive Hashing continued
Eect of the number of bands b, and MinHash values in
each band, r on co-occurence
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Putting it all together
Create bipartite graph, with edges from les to buckets.
If two les are often connected trough the buckets, they
are more likely to be strongly co-occuring
Apply belief propagation to the markov random eld we
just created.
With buckets labeled 0.5, bad les 0.01 and good les 0.99
We now have an idea about which les are good and bad
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Results
Labeling 1.6 million unlabeled les
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Results Continued
Labeling unknown malware
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Data dump
2552 benign and 1202 malware samples
Heap commit default value 4096
Stack reserve median appears to be set by msvc compiler
(40000h)
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Data dump
Number of sections can be reduced in malware (packers?)
Malware tends to have more imports
The datestamp corresponds to dataset (2011)
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Data dump
Entropy very similar
Entry points tend to be closer
(packers/encryption/optimisation?)
Weird section alignment
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Data dump
Weird le alignment more common
Lower image versions
Sometimes odd heap reserves and smaller stack usage
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Preliminary analysis
Dataset from 2011, of course timestamp is a strong
classier, however useless
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Modied tree
Much more sensible
Stack reserve and image base strong classiers
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Modied tree
15% misclassication rate
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Code substrings
Malware uses bugs, sometimes encryption and packing
Identify pieces of code that do things the malware needs
And pieces of malware have in common
O(n
2
) on large datasets
Subdivide the code sections into segments
Hash those segments, and compare.
Use bloom lter to lter matches.
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Algorithm outline
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Building classier
Remove subsequences that are common in non-malware
Count predicitve value of subsequences on training set
To classify, walk trough a binary comparing each n-length
subsequence to database
Using a max length of 10, step of 4 bytes, running time is
O(10N log(N))
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
ROC curve
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
ROC Explanation
False positives increase past 70% mark
Undesirable, may damage system
Ways to circumvent this, hashing, applied in modern
virusscans
Strong classier of unknown malware
Even with a small dataset
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Precision-recall graph
Ecient
Malware
Classication
Techniques
Arno Pol
Introduction
Guilt by
association
Malware
dataset
Decision tree
Mining
substrings
Questions
Hopefully you enjoyed the presentation.
If there are any questions about the presentation, now is
the time to ask.

Potrebbero piacerti anche