Sei sulla pagina 1di 6

Efficient and Interpretable Real-Time Malware

Detection Using Random-Forest


Alan Mills∗ , Theo Spyridopoulos and Phil Legg
Department of Computer Science and Creative Technologies
Computer Science Research Centre
University of the West of England
Bristol, UK.

Abstract— referred to as NODENS, was able to detect unique malware


Index Terms—component, formatting, style, styling, insert samples with a 99% accuracy. The system was built using a
Random-Forest (the most accurate of tested algorithms, with
I. I NTRODUCTION no possibility of over-fit) and designed under the following
Currently, common malware analysis and detection use parameters;
a static database to compare known malicious signatures • A near real time detection system, providing early warn-
to suspected malicious programs. This system requires that ing (not prevention)
a malicious signature is ’known’ to either that specified • Able to detect previously unseen malware, to counter the
database or the wider security community in general, making threat presented by âĂŸzero-dayâĂŹ malware
it ineffective against new (’zero-day’) threats and reliant on • Capable of re-fitting, utilising end-user input to validate
end users keeping their system up to date. It can also be or counter automated decisions, to consistently remain at
fooled by obfuscating the code, allowing known malicious the forefront of malware detection, without reliance on
code or malware to bypass detection. The current commer- an internet connection or sandbox environment.
cial alternatives to database orientated malware detection are The main body of this paper will focus on the findings from
Behavioural analysis or Heurist analysis. Behavioural analysis NODENS as is organised as follows;
is a type of dynamic analysis which monitors a suspected
program at execution. This requires a sandbox environment II. R ELATED W ORKS
and necessitates that the suspected malware is allowed to run,
possibly to completion, making it un-suitable for computation- There are numerous works on malware detection and anal-
ally limited systems or near real time situations. In addition ysis. [1] [2] [3]
malware vendors have created ‘VM aware’ malware, which Cardiff Paper [4] Melbourne Paper Denmark / Cuckoo
is able to complete âĂŸfloor checksâĂŹ or ‘celling checks’ Paper
to detect if it is running in a virtual environment. At which Whilst the use of machine learning to detect malware has
stage the malware will not carry out any malicious processes been a growing area within the academic community, it has
and act benign, making any attempts at sandbox detection largely been confined to research and proof of concept models,
or analysis impossible. Heurist analysis can be either static often using the output of other stand alone products, such as
or dynamic, mitigating the possible time penalty associated Cuckoo Sandbox, Anubis or HookMe, as input ( [5], [6], [7],
with Behavioural analysis and does not require a sandbox [8]). Such models, whilst providing a valid proof of concept,
(though can be run inside one). It analyses the source code of would be unsuitable for use either at home or commercially.
suspected malware, following the code execution either before Most models which are able to work as an ’end to end’ system
execution or during run time and monitors for any malicious are relatively complex, often comprising of one or more Neural
calls or access. However this type of analysis can be defeated Networks ( [4], [9], [10], [11]), with a time frame for detection
through code obfuscation, in much the same way as database ranging from 4 seconds ( [4]) to 5 minutes ( [11]). One element
orientated systems can be. The author or this paper believes that is not evident in any of these models is a human readable
that the answer to the current gap in cyber security is the use of out to explain why a particular software is considered to be
machine learning to automate the process of malware analysis malware.
and detection. To that end a lightweight machine learning The system we propose is lightweight enough to be run
based system was created that would detect malware based remotely from a Raspberry Pi, able to initialise and carry out
on what will be referred to as ‘process signatures’ which are feature extraction from live data as a standalone system and
created by process during execution. During testing the system, produces human readable output, creating a previously unseen
level of interpretability for the end user. We believe ours is
∗ Contact E-mail: Alan2.Mills@live.uwe.ac.uk the first model to offer these services.
Whilst there are a plethora of existing techniques for the above algorithms. This process was repeated 10 times, each
detection of malware, the challenge remains as the threat time incrementing the number of malware process by 1. These
is ever-evolving. New strands of malware are developed at output files were then run against the comparison script and
an alarming rate by all manner of actors, including cyber- a âĂŸwinningâĂŹ algorithm established each time based on
criminals, hackivists, script kiddies, and state-sponsored at- the following outputs; Accuracy score, false positive rate and
tackers. Therefore, there is a need to re-address the issue false negative rate. In addition each process was given its own
and focus on why a particular software is considered to be individual score.
malware, what the impact of the software is on the system A k-fold of 7 was used for cross validation scoring with
that it resides on, and what action is appropriate for the user a testing / validation split of 80/20, whilst feature selection
to take. These three premises contribute to the notion that in was utilised during each run of the comparison script, with
order for a detection system to be meaningful and effective, the selected features being output along with the previously
how the detection process is conducted requires interpretability listed scores. The end goal being that a smaller number of
by the end-user. This is a primary goal that we consider in our Get-Process output features could be identified as key, and
work. only these features used in the subsequent training and testing
of the classifier. This would also allow researchers to identify
III. M ETHODOLOGY potentially key aspects of malware behaviour which separate
All malware samples used during the testing of this system it from benign processes and aid in further malware detection
were taken from VX Vault, an open source malware repository, and analysis research.
and run on a Windows 7 Virtual Machine, running PowerShell By the end of the comparison process each algorithm had
(Version 4.0) which was used to collect process details. These seen 55 unique malware samples in total. Random-Forest was
process details were the building blocks used to train the the most accurate classifier having the top accuracy score in
system and identify malware. The classifier(s) and all scripts 7 out of 10 runs, with an average accuracy of 99.98%. A
were built using Python (scikit-learn library) and run remotely Random-Forest classifier script was then created and trained
from a Kali Linux OS. using the features identified as having been key in 50% or
Does this refer to labelling the processes? more of the comparison script outputs. However this proved
The same guidelines as used by [11] were used to identify to be inaccurate and would misclassify benign processes as
a process as a malware process, namely that: malware. As a result a .csv file containing the combined data
from all training malware samples was created and used to re-
• The process shared the same name as the pre-
run the training and feature selection . This was done twice,
determined malware file
once with all features included and a second time with the
• Any children processes generated from a pro-
version features (File and Product removed). The latter feature
cess identified by step 1
selection set (referred to as the scheme 3 or the final set) was
• Any process which is injected by malicious code
used to train NODENS.
from steps 1 or 2
The methodology can be broken down into the following B. Live Testing
sections; Algorithm Selection and Live Testing, which will be Following the successful training of the classifier on static
explained in more detail below. data it was then tested against live, previously unseen malware
A. Algorithm Selection samples. This initially led to a delay of approximately 30
seconds between a malicious process being started and it
Initially a comparison script was created that compared the being detected, which was deemed unsuitable for the design
below algorithms, to establish which script would be the most parameters the system was being based around.
accurate in identifying malware processes. The data collection script was modified to split output files
Algorithms Tested down to a smaller size, limited to the output of 10 iterations
• Random-Forest of the Get-Process cmdlet. The classifier was modified to
• GNB introduce a whitelisting approach for already known processes,
• DecisionTree such as consistent background processes, and a blacklisting
• KNearestNeighbour system setup to allow the automatic termination of previously
• AdaBoost identified malware processes. As a result of these modifica-
• SVC tions the delay was shortened to between 3-5 seconds and
• GradientBoosting deemed to be within acceptable parameters.
• LogisticRegression The final script was built entirely in Python, using the Scikit
• OneClassSVM Learn library, utilising the Gini Index and a SVM model
To establish this a malware process was allowed to run for for feature selection. A basic command line interface was
n iterations of the Get-Process cmdlet on the Windows 7 VM implemented which allowed an end user to utilise multiple
and the output was exported as a .csv file. This file was then different âĂŸplug-inâĂŹ scripts, such as being able to start
labelled and used as the dataset for supervised training of the and stop the data collection and detection process from the
command line, as well the termination of malware processes with the largest problem being the misclassification of be-
and re-fitting of the classifier upon identification of previously nignware as malware. The first set of features were identified
unseen malware. The classifier would also seek clarification during the running of the comparison script and algorithm
on previously unseen, but suspected malware processes, in selection, as a result of running feature selection against each
this way the classifiers âĂŸunderstandingâĂŹ could be refined incremental output file individually. These features were then
with the assistance of the end-user and limit the chances of compared and the first set created using features that had been
misidentifying processes and then re-fitting these errors into key in 50% or more of the comparison script outputs.
the algorithm. First Set
• ProcessorAffinity
IV. DATASET
• Company
A. Malware Samples
• Product
Malware samples downloaded from VXVault, with the mal- • ProductVersion
ware being run on a Windows 7 VM, alongside the PowerShell • UserProcessorTime
cmdlet Get-Process, in iterations of 1 through to 10. The output • Description
from the Get-Process cmdlet was saved as a .csv file, which • Handle
were manually labelled for supervised training and then used • Path
as the input for the comparison script during the initial training Once the winning algorithm (Random-Forest) had been de-
phase. These .csv files contained only the normal background cided testing was conducted on live data using these features,
processes for a Windows 7 VM, the malware processes and however these features lead to misclassification of benignware
the PowerShell process. processes, such as Mozilla Firefox or VLC player. As a result
During live testing the malware was download and run feature selection was run against a dataset containing the
on the windows 7 VM as above, however NODENS was combined output for all 55 malware samples. This identified
able to start and stop the iteration and collection of the Get- the following features as key;
Process cmdlet utilising an unsupervised approach. At this Second Set
stage multiple benignware processes were also run during the
• Path
testing of NODENS to refine and test the systemâĂŹs ability
• HasExited
to distinguish between malware and benignware.
• Description
B. Data Pre-processing • PagedSystemMemorySize64

The algorithm used integer and floating point values, which • PrivateMemorySize64

were originally taken and converted from the Get-Process • FileVersion

cmdlet. String values were converted to binary data (1 = true, 0 • TotalProcessorTime

=false), with the exception of Has Exited, which operates as 1 • ProcessorAffinity

= Exited, -1 Not Exited and 0 indicates this field had no value • Handle

during pre-processing, to show if the string had been present. • PeakPagedMemorySize64

All other data was converted to either integers or floating point, Whilst this lead to a different set of features being identified,
dependent on which was more appropriate i.e. measurements when used as the basis to re-train the classifier it still re-
in seconds and milliseconds. The most successful process sulted in misclassification of benignware processes. Following
parameters which were used by the algorithm were: manual inspection of the script output it was apparent that
• Path âĂŞ Binary processes the script considered to be âĂŸlegitimateâĂŹ had an
• Company âĂŞ Binary extremely high output score, leading to a very high threshold
• Description âĂŞ Binary for any process to be considered benignware and resulting in
• Has Exited âĂŞ Binary a high number of false positives.
• Processor Affinity âĂŞ Binary The features identified as being key in the high output
• Peak Working Set64 âĂŞ Integer for âĂŸlegitimateâĂŹ processes were identified as being the
• Peak Virtual Memory Size64 âĂŞ Integer Version features (Product and File). These were removed from
• Private Memory Size64 âĂŞ Integer the combined dataset and training (with feature selection) was
• Handle Count âĂŞ Integer re-run. The following features were then identified as key, this
• Virtual Memory Size64 âĂŞ Integer being the final feature selection used for the training of the
• Working Set64 âĂŞ Integer current NODENS system;
• Total Processor Time âĂŞ Float Final Set
• ProcessorAffinity
V. R ESULTS
• VirtualMemorySize64
A. Feature Selection • HandleCount
The features identified as key changed multiple times, each • HasExited
time being refined to increase the accuracy of the classifier, • Company
Process Classification Total (scheme 1) Total (scheme 2)
Firefox Malware 7,638 64,547,168
Python Malware 2,182 27,109,170
exe1 Malware 1,888 2,514,799,716
Internet Explorer Legitimate 800,760,001,869 80,100,000,000,000,000,000,000
TABLE I
C APTION

• Description During the course of several process cycles, one snapshot


• PeakVirtualMemorySize64 of the VLC process was identified as both malware and a
• TotalProcessorTime legitimate process, despite there being no change to any of the
• PeakWorkingSet64 selected features. This can likely be attributed to the system
• PrivateMemorySize64 being less than 100% accurate, and as such is unable to
• WorkingSet64 correctly identify each individual process snapshot, but rather
• Path relying on a weight of averages. The second consideration is
This resulted in much lower scores for âĂŸlegitimateâĂŹ that this process was correctly identified, with a lower score,
processes and made the classification of benignware processes which goes against the trend noted for malware identification.
significantly more accurate. With only the handle count and total processor time showing
Throughout the process of adjusting the features utilised a positive difference.
in the training of the final classifier, the ProcessorAffinity, The next stage was to look at the malware and legitimate
Description and Path were all identified as key in each scheme. processes which were correctly identified, but fit within the
The author believes that the latter two features were likely same score bracket, 9 malware processes were identified
identified as key due to the fact that most malware vendors do whose output scores put them inside the legitimate process
not bother to add a description to any malware created, whilst output scores bracket. These processes were then empirically
the path field would either be from the desktop or become compared to legitimate processes that they were scored be-
blank following the execution of some malware. tween or closely to.
There was no immediate distinction between them, with
B. Binary Data
neither binary nor variable data showing a clear pattern. The
Data was organised based on the values of the binary data variable data parameters were then individually compared
parameters, in this way an empirical comparison could be to see if a differential in values would establish a pattern
more easily drawn between parameters. Legitimate processes of identification. This showed that the differential between
were set to either true or false for all binary parameters, with (Peak)WorkingSet64 and PrivateMemorySet64 was negative
only one third party (non-windows native background pro- for 6 of the 9 of the malware processes and none of the
cess) showing false for all binary parameters. By comparison legitimate processes. The three malware processes showing
malware processes showed a significant range, with multiple a positive differential were; VLC (identified as both malware
combinations of binary values being presented by different and legitimate), python (misclassified) and consent (a genuine
processes, as illustrated below. malware process). Removing VLC from the equation we are
This provides an easy metric to help decide if a process is able to extend the distinction to any differential that was lower
legitimate or not, based on the binary fields. From 39 malware than approx. 6,700,000. This showed that malware processes
samples tested 22 provided no Company data, of which 18 had (on average) a higher amount of private data than shared,
also provided no Description, in comparison to one legitimate something that was not found in legitimate processes.
process.
C. Variable Data D. Alternative Classifiers
Variable data consisted of both the integer and float type Two alternative algorithms were tested during the live
parameters listed above and the output score that each process testing phase in an attempt to shorten the time between
was given by the system. A comparison of the variable data execution and detection. Both a Neural Network and a SGD
showed that legitimate processes had on average a higher (Stochastic Gradient Descent) classifier were found to be faster
score, as illustrated below. and better suited to incremental learning however neither were
Whilst this makes for a good metric overall, there are as accurate as the original Random-Forest, with accuracy
instances where the scores for malware processes and legit- dropping down to 63% in one instance. The author believes
imate processes were within the same score bracket (between that in the instance of the Neural Network this drop in accuracy
the lowest or highest score for correctly identified legitimate is due to relatively low number of training samples used (the
processes, hereafter referred to as a score bracket), yet being same 55 as used to train the Random-Forest) classifier. It is
correctly identified as malware. To research this further data highly likely that with access to a larger, more industrial scale
was compared from a process that had been classified as both dataset the accuracy of the NeuralNetwork could be improved
malware and legitimate, as illustrated below. (Cardiff Paper).
Process Classification Total (scheme 1) Total (scheme 2) Total (scheme 3)
Firefox Malware 7,638 64,547,168 832,033,057
Python Malware 2,182 27,109,170 245,776,557
qw Malware - - 291,352,773
Internet Explorer Legitimate 800,760,001,869 80,100,000,000,000,000,000,000 316,039,580
TABLE II
C APTION

Malware Process Name Path Company Description Has Exited Processor Affinity
bot 1 0 0 0 1
p.tmp 1 0 1 -1 1
DUCUMENT-3839274322-pdf 1 1 0 1 1
re1608 1 1 1 -1 1
TABLE III
C APTION

Total output score Lowest Highest Average


[4] M. Rhode, P. Burnap, and K. Jones, “Early-stage malware
Malware 27,321,331 554,336,342 196,841,342
prediction using recurrent neural networks,” Computers Security,
Legitimate 184,082,529 816,992,765 377,574,468
TABLE IV vol. 77, pp. 578 – 594, 2018. [Online]. Available:
C APTION http://www.sciencedirect.com/science/article/pii/S0167404818305546
[5] I. Firdausi, C. Lim, A. Erwin, and S. N. Anto, “Analysis of machine
learning techniques used in behavior-based malware detection,” in 2010
Second International Conference on Advances in Computing, Control,
and Telecommunication Technologies, 2010, pp. 201–203.
VI. L IMITATIONS AND F UTURE W ORK [6] R. Tian, R. Islam, L. Batten, and S. Versteeg, “Differentiating malware
from cleanware using behavioural analysis,” in 2010 5th International
Current size of malware samples used / tested Classification Conference on Malicious and Unwanted Software, 2010, pp. 23–30.
/ types of malware largely unknown (VirusTotal / Phil)? All [7] S. S. Hansen, T. M. T. Larsen, M. Stevanovic, and J. M. Pedersen, “An
approach for detection and family classification of malware based on
research / testing conducted in a VM environment behavioral analysis,” in 2016 International Conference on Computing,
Networking and Communications (ICNC), Workshop on Computing,
VII. C ONCLUSION Networking and Communications (CNC), 2016, pp. 1–5.
[8] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagiup,
More testing / samples run needed before conclusion on “Malware detection with deep neural network using process behavior,”
accuracy etc. in 2016 IEEE 40th Annual Computer Software and Applications Con-
3 metrics for malware distinction were identified, allowing ference, 2016, pp. 577–582.
[9] J. Saxe and K. Berlin, “Deep neural network based malware detection
for the creation of a malware process model. Binary values, using two dimensional binary program features,” in 2015 10th Interna-
particularly the company and description features, were shown tional Conference on Malicious and Unwanted Software (MALWARE),
to be a reliable metric as they are rarely populated in malware. 2015.
[10] R. Pascanu, J. W. Stokest, H. Sanossian, and A. T. Mady Marinescu,
However this could change if these features are populated “Malware classification with recurrent networks,” in 2015 IEEE In-
in more malware samples. The output score and differential ternational Conference on Acoustics, Speech and Signal Processing
between (Peak)WorkingSet64 and PrivateMemorySet64 were (ICASSP), 2015, pp. 1916–1920.
[11] T. Shibahara, T. Yagi, M. Akiyama, D. Chiba, and T. Yada, “Efficient
negative for the majority of the malware processes and none dynamic malware analysis based on network behavior using deep learn-
of the legitimate processes, as such this could also be used to ing,” in 2016 IEEE Global Communications Conference (GLOBECOM),
identify malware processes. However these metrics were not 2016.
100% accurate in isolation. As such an approach that utilises
multiple methods of distinction is considered best. It is also
the authorâĂŹs opinion that further research is required in this
area.

R EFERENCES
R EFERENCES
[1] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant,
“Semantics-aware malware detection,” in 2005 IEEE Symposium on
Security and Privacy (S P’05), May 2005, pp. 32–46.
[2] H. Yin, D. Song, M. Egele, C. Kruegel, and E. Kirda, “Panorama:
Capturing system-wide information flow for malware detection
and analysis,” in Proceedings of the 14th ACM Conference on
Computer and Communications Security, ser. CCS ’07. New
York, NY, USA: ACM, 2007, pp. 116–127. [Online]. Available:
http://doi.acm.org/10.1145/1315245.1315261
[3] S. J. Stolfo, K. Wang, and W.-J. Li, “Towards stealthy malware detec-
tion,” in Malware Detection, M. Christodorescu, S. Jha, D. Maughan,
D. Song, and C. Wang, Eds. Boston, MA: Springer US, 2007, pp.
231–249.
Private Total Virtual
Handle Peak Working Peak Virtual Working Total output
Classification Process Memory Processor Memory
Count Set64 Memory Size64 Set64 score
Size64 Time Size64
Malware vlc 51 5,328,896 56,504,320 2,191,360 0.1602304 56,504,320 5,328,896 125,857,846
Both vlc 323 19,406,848 117,628,928 7,737,344 0.1602304 117,616,640 19,394,560 281,784,646
Legitimate vlc 324 19,312,640 117,022,720 7,684,096 0.1802592 117,010,432 19,308,544 280,338,759
Difference +1 -94,208 -606,208 -53,248 +0.200288 -606,208 -86,016 -1,445,887
TABLE V
C APTION

Input .CSV input

Pre-Processing Feature Extraction


Data Conversion

Processing RF Classifier

Prediction Legitimate Malware

Known Unknown

Process Killed
Process Process Signature Saved
Signature Saved

Refitting

Fig. 1. Title

Potrebbero piacerti anche