Sei sulla pagina 1di 4

International Journal of Computer Trends and Technology (IJCTT) volume 8 number 4 Feb 2014

ISSN: 2231-2803 www.internationaljournalssrg.org Page 200



A Proposed Methodology for Virus Detection Using Data Mining
and Reverse Engineering Tools with Client-Server Model

Uday Babu P
1
, Visakh R
2

1
(Department of Computer Science & Engineering, Rajagiri School Of Engineering & Technology-Kochi, India)
2
(Department of Computer Science & Engineering, Rajagiri School Of Engineering & Technology-Kochi, India)

ABSTRACT : Viruses are a class of malicious
programs that cause unfavourable effects on the
computer system and thereby becomes an obstacle
to the standard operation of the system. Their
existence and execution within the system should be
detected within an apt time to prevent them from
causing irrecoverable and devastating problems
that can cause loss of performance and loss of
confidentiality of sensitive information. To detect
the presence of a virus within a system, firstly the
effects of various viruses on the computer systems
are analysed by executing them one by one in a
virtual environment. These effects are captured
using reverse engineering tools. Data mining is
applied on the data recorded by the reverse
engineering tools to extract the significant patterns
that characterize the respective viruses. These
patterns are converted into a unique binary code
which can be used to detect viruses using a client-
server model.

Keywords - Client-server model, Data mining, FP
growth algorithm, Reverse engineering, Virus
detection.
I. INTRODUCTION
Viruses are malwares that are designed to damage
the computer systems and thereby make them
vulnerable to security threats and performance
degradation. The evolution of internet has resulted
in the spawning of new malwares including
viruses. Viruses have mutated into a sophisticated
formthat their detection may become a laborious
process using major conventional methods like
signature-based and heuristic detection.
Cyber security is under threat and cyber wars
are forecasted in the near future [1]. Hence a
methodology that is capable enough to detect the
presence of malwares in a system should be
formulated so that the malwares and their effects
can be removed fromthe infected system. Data
mining is the process of excavating frequent and
relevant patterns froma humongous data set [2].
In the proposed system, reverse engineering
tools and data mining algorithmare applied one
after the other in each virtual systembefore and
after infecting the systemwith a given virus, to
extract the relevant patterns that characterize the
effects of the given virus. These patterns are
exploited to discover the presence of an
unidentified virus in a system on the basis of a
client-server model. Characteristic pattern of each
known virus, after transforming it into a binary
code, is saved in to database at the server. Binary
code formulated for an unknown virus is compared
with binary codes of known viruses to spot the
unknown virus.
In [3], Burji, Liszka and Cha stated that
malwares can be detected by integrating reverse
engineering tools and data mining. Three virtual
machines are created in each systemand each of
themis infected with a given malware. Reverse
engineering tools like file monitor, registry
monitor, API call tracer, etc are executed in each of
the virtual machines to record various aspects of
the machine state of each of the infected virtual
machines.
Data mining is applied on the reverse
engineered data of the virus to retrieve pertinent
and frequent data patterns that characterize the
virus efficiently. The output of the data mining step
is supplied to rough set theory based tool known as
Blem2. Blem2 will generate the rules of required
confidence and strength that can be used to detect
malwares. But here machine state is only captured
after infection; hence the observations taken may
contain effects that may not be caused by the
malware attack. Hence rules developed by the
rough set based tool may not be precise enough to
catch the malware and this may result in detection
of false positives.
Reverse engineering of a malware is the
analysis of a malware in order to comprehend and
capture its design, components, behaviour and
effects by executing them in a controlled and
isolated virtual environment.
Reverse engineering tools like File system
monitor, Registry monitor, etc are used to trace the
machine state of the system. Each reverse
engineering tool captures a single aspect of the
machine state. Virus changes one or more aspects
of the machine state of the system, when it infects
that system.
Few of the reverse engineering tools available to
capture the state of the machine are the following: -

1.1 File SystemMonitor
When a process is executed in a system, it makes
changes in the file systemby adding, deleting or
editing the files in the system. File systemmonitor
captures all the file system activity performed by
all the processes running in the system. Changes
made by the malware in the file system of the
International Journal of Computer Trends and Technology (IJCTT) volume 8 number 4 Feb 2014

ISSN: 2231-2803 www.internationaljournalssrg.org Page 201

infected systemcan be gathered using file system
monitor.

1.2 Registry Monitor
A system Registry preserves the configuration
details of the operating systemand all the programs
that are installed in it. Registry monitor can be used
to monitor the registry. Malwares infects the
systemby making modifications in the registry and
these manipulations can be retrieved using a
registry monitor.

1.3 Process Monitor
A process monitor keeps track of the processes that
are currently executing in a given system. A virus
is associated with one or more processes. A system
is said to be infected with a given virus, if
processes associated with that virus is currently
running in the system.

The paper is organized in the following manner.
Section II gives an insight into the proposed
system. Section III elaborates on the techniques
available to evaluate the system. Section IV gives a
conclusion and provides information regarding the
scope for future works.

II. PROPOSED METHODOLOGY
The proposed methodology as shown in Fig.1
includes the following steps:-

2.1 Application Of Reverse Engineering Tools

The effects of viruses on a machine can be captured
dynamically by executing them in controlled
virtual machines that are created using the software
Oracle virtual box [4]. Such a virtual machine
keeps the host operating systemprotected fromthe
ill effects of the malware, when the virtual machine
gets infected by a malware.
Two to three virtual machines can be created in
a single system depending on its configuration,
each of which has its own operating systemwhich
is isolated fromoperating systemof other virtual
machines and the host operating system.
Multiple virtual machines are created and
reverse engineering tools are applied to retrieve the
current machine state which represents an infection
free condition. Each virtual machine is infected
with the same given virus and reverse engineering
tools are reapplied to capture the updated machine
states. A symmetric difference is taken between the
original and updated machine states to obtain the
changes that were made by the infection. This
removes observations that are not probably the
result of the malware attack.


Fig.1. proposed methodology

2.2 Data Mining

Data mining algorithm namely FP growth
algorithm[2], which is proved to be efficient in
terms of memory usage and execution time, can be
applied on the data obtained after taking the
symmetric difference. This will extract the most
crisp and relevant patterns that will be helpful to
characterize the given virus. Repeated database
scans are not performed by FP growth algorithm.

FP growth algorithmuses a divide and conquer
strategy [2]. First, it constructs a FP-tree by
compressing the database of frequent items by
preserving the itemset association data. The
compressed database thus obtained is partitioned
into a set of conditional databases, each of which is
associated with a unique frequent itemand mining
is done independently on each partition.

2.3 Converting To Binary Code

The patterns that characterise each virus, fetched
using data mining can be converted into a binary
code that represents the respective virus.

2.4 Server Creation

A server can be created in the cloud computing
platform which will store the binary codes
corresponding to each known virus. Server is
International Journal of Computer Trends and Technology (IJCTT) volume 8 number 4 Feb 2014

ISSN: 2231-2803 www.internationaljournalssrg.org Page 202

programmed to wait for the requests fromclient
machines which are suspected to be infected by one
or more viruses.

2.5 Client Processing

Each computer system acts as a client machine.
The machine states of each computer systemare
captured periodically after regular intervals. Length
of the interval depends on the level of security that
the systemneeds. The captured machine state of a
systemis compared with the machine state of the
systemobtained during previous capture.
If the degree of difference between the two
machine states is above a predetermined threshold,
then data mining is applied on the symmetric
difference taken between the machine states, to
obtain the significant patterns that may be
characterizing an unknown virus.
The retrieved patterns can be converted into a
binary code. The binary code is send to the remote
server for analysis. The server performs an analysis
to detect virus based on the stored binary codes.
Client systemis pushed into a dormant state if it is
a highly sensitive machine, until a confirmation is
received fromthe server. Actions are taken based
on the servers response.
The client sends the binary code to server only
when the degree of difference between consecutive
machine states is more than a threshold, so server
will not be overloaded unnecessarily.


Fig.2. client processing

2.6 Server Processing

As shown in Fig.3, when the server obtains the
binary code of an unknown virus fromthe client, it
calculates a hamming distance of the given code
fromall the binary codes of known viruses which is
stored in the database of the server. The binary
code which has minimumhamming distance with
the given binary code is selected. The systemis
hence said to be infected with a virus
corresponding to the most matching binary code.
Server notifies the client and client can resume
accordingly.
Viruses corresponding to the binary codes,
whose hamming distance from the given binary
code is below a predetermined threshold hamming
distance can be considered to be belonging to the
same family of viruses which may be formed
through mutation. Hamming distance is the number
of bit positions at which a change is observed
between the two binary codes under consideration.
If the minimum hamming distance of the given
binary code with respect to all binary codes in
database is above a predetermined threshold then it
may be a newly generated unclassified virus. If it is
a newly generated virus then its binary code is
inserted into the database at server to represent the
new virus. Hence known viruses, their mutants and
unknown viruses can be detected.


Fig.3. server processing

III. EVALUATION
Evaluation of the methodology can be done by
employing the parameters like accuracy, detection
rate, precision and false positive rate. Accuracy is
the ratio of the sumof number of true negatives
and true positives to the sumof number of true
positives, true negatives, false positives and false
negatives. Detection Rate is the ratio of number of
true positives to the sum of number of true
positives and false negatives. Precision is the ratio
of number of true positives to the sumof number of
International Journal of Computer Trends and Technology (IJCTT) volume 8 number 4 Feb 2014

ISSN: 2231-2803 www.internationaljournalssrg.org Page 203

true positives and false positives. False positive
rate is the ratio of number of false positives to the
sum of number of true negatives and false
positives.
IV. CONCLUSION
New viruses are generated one after the other in an
accelerated mode due to the rapid development of
internet. Complete dependence on the antivirus
software, which takes a significant amount of
memory to run, is not reliable to detect the recent
sophisticated viruses that have devastative effects.
Hence the proposed system provides a cost
effective and efficient method to detect unknown
virus infections, which may be better than the
antivirus softwares that need to be updated
periodically and whose detection power varies with
the manufacturer of the antivirus. As an
implementation, all systems in a company can form
clients except one systemwhich acts as the server.
Detection of an infection alone will not be
sufficient hence as a future work, techniques to
remove the detected viruses fromthe systemmust
be determined and they should be integrated with
the proposed system to complete the task of
protecting the computer systems fromthe attacks of
viruses.
The proposed methodology may be extremely
useful for systems, incorporating highly sensitive
data, for which security is of prime importance and
short periods of unavailability can be tolerated.

REFERENCES

[1] Bhavani Thuraisingham, Data mining for malicious
code detection and security applications, IEEE/WIC/ACM
International Conference on Web Intelligence and Intelligent
Agent Technology Workshops, 2009.
[2] J iawei Han and Micheline Kamber, Data Mining
Concepts and techniques (Morgan Kaufmann Publishers, USA,
Second Edition).
[3] Supreeth Burji, Kathy J . Liszka, Chan, Malware
analysis using reverse engineering and data mining tools,
International Conference on SystemScience and Engineering,
2010.
[4] Oracle VirtualBox, virtual machine software tool,
www.virtualbox.org

Potrebbero piacerti anche