Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract: Intelligent intrusion detection systems can only be built if there is availability of an effective data set. A data
set with a sizable amount of quality data which mimics the real time can only help to train and test an intrusion
detection system. The NSL-KDD data set is a refined version of its predecessor KDD99 data set. In this paper the
NSL-KDD data set is analysed and used to study the effectiveness of the various classification algorithms in detecting
the anomalies in the network traffic patterns. We have also analysed the relationship of the protocols available in the
commonly used network protocol stack with the attacks used by intruders to generate anomalous network traffic. The
analysis is done using classification algorithms available in the data mining tool WEKA. The study has exposed many
facts about the bonding between the protocols and network attacks.
I. INTRODUCTION
Communication system plays an inevitable role in is used. This paper uses the NSL-KDD data set to reveal
common mans daily life. Computer networks are the most vulnerable protocol that is frequently used
effectively used for business data processing, education intruders to launch network based intrusions.
and learning, collaboration, widespread data acquisition The rest of the paper is organized as follows: Section II
and entertainment. The computer network protocol stack presents some related work based on intrusion detection.
that is in use today was developed with a motive to make Section III gives a brief description on the contents of the
it transparent and user friendly. This lead to the NSL-KDD dataset. Section IV summarizes about analysis
development of a robust communication protocol stack. of the dataset with various classification techniques.
The flexibility of the protocol has made it vulnerable to Section V presents the graphical analysis report on various
the attacks launched by the intruders. This makes the intrusions using different classification methods. Section
requirement for the computer networks to be continuously VI, deals with conclusion and future work.
monitored and protected. The monitoring process is
automated by an intrusion detection system (IDS) [1]. The II. RELATED WORK
IDS can be made of combination of hardware and The NSL-KDD data set is the refined version of the KDD
software. cup99 data set [5]. Many types of analysis have been
At any point of time a web server can be visited by many carried out by many researchers on the NSL-KDD dataset
clients and they naturally produce heavy traffic. Each employing different techniques and tools with a universal
network connection can be visualized as a set of attributes. objective to develop an effective intrusion detection
The traffic data can be logged and be used to study and system. A detailed analysis on the NSL-KDD data set
classify in to normal and abnormal traffic. In order to using various machine learning techniques is done in [6]
process the voluminous database, machine learning available in the WEKA tool. K-means clustering algorithm
techniques can be used. uses the NSL-KDD data set [7] to train and test various
existing and new attacks. A comparative study on the
Data mining is the process of extracting interested data NSL-KDD data set with its predecessor KDD99 cup data
from voluminous data sets using machine learning set is made in [8] by employing the Self Organization Map
techniques [2]. (SOM) Artificial Neural Network. An exhaustive analysis
In this paper the analysis of the NSL-KDD data set [3] is on various data sets like KDD99, GureKDD and NSL-
made by using various clustering algorithms available in KDD are made in using various data mining based
the WEKA [4] data mining tool. The NSL-KDD data set is machine learning algorithms like Support Vector Machine
analyzed and categorized into four different clusters (SVM), Decision Tree, K-nearest neighbor, K-Means and
depicting the four common different types of attacks. An Fuzzy C-Mean clustering algorithms.
in depth analytical study is made on the test and training III. DATASET DESCRIPTION
data set. Execution speed of the various clustering The inherent drawbacks in the KDD cup 99 dataset [9] has
algorithms is analysed. Here the 20% train and test data set been revealed by various statistical analyses has affected
the detection accuracy of many IDS modelled by TABLE II: BASIC FEATURES OF EACH NETWORK
researchers. NSL-KDD data set [3] is a refined version of CONNECTION VECTOR
its predecessor. Attribute Attribute Description Sample
It contains essential records of the complete KDD data set. No. Name Data
There are a collection of downloadable files at the disposal Length of
for the researchers. They are listed in the Table 1 time duration
1 Duration 0
of the
TABLE I : LIST OF NSL-KDD DATASET FILES AND THEIR connection
DESCRIPTION Protocol used
2 Protocol_type in the Tcp
S.N connection
Name of the file Description
o.
Destination
The full NSL-KDD train set 3 Service network ftp_data
1 KDDTrain+.ARFF with binary labels in ARFF service used
format Status of the
The full NSL-KDD train set connection
including attack-type labels 4 Flag SF
2 KDDTrain+.TXT Normal or
and difficulty level in CSV Error
format Number of
KDDTrain+_20Perce A 20% subset of the data bytes
3
nt.ARFF KDDTrain+.arff file transferred
KDDTrain+_20Perce A 20% subset of the 5 Src_bytes from source to 491
4
nt.TXT KDDTrain+.txt file destination in
The full NSL-KDD test set single
5 KDDTest+.ARFF with binary labels in ARFF connection
format Number of
The full NSL-KDD test set data bytes
KDDTest+.TXT including attack-type labels transferred
6
and difficulty level in CSV from
format 6 Dst_bytes 0
destination to
A subset of the source in
KDDTest-21.ARFF KDDTest+.arff file which single
7
does not include records with connection
difficulty level of 21 out of 21 if source and
A subset of the KDDTest+.txt destination IP
file which does not include addresses and
8 KDDTest-21.TXT
records with difficulty level of port numbers
21 out of 21 7 Land 0
are equal then,
this variable
1. Redundant records are removed to enable the takes value 1
classifiers to produce an un-biased result. else 0
Total number
2. Sufficient number of records is available in the
of wrong
train and test data sets, which is reasonably Wrong_fragm
8 fragments in 0
rational and enables to execute experiments on ent
this
the complete set. connection
3. The number of selected records from each Number of
urgent packets
difficult level group is inversely proportional to in this
the percentage of records in the original KDD connection.
data set. 9 Urgent Urgent 0
packets are
In each record there are 41 attributes unfolding different packets with
features of the flow and a label assigned to each either as the urgent bit
an attack type or as normal. activated
The details of the attributes namely the attribute name,
their description and sample data are listed in the Tables TABLE III : CONTENT RELATED FEATURES OF EACH
II, III, IV, V. The Table VI contains type information of NETWORK CONNECTION VECTOR
all the 41 attributes available in the NSL-KDD data set. Attribute Attribute Sample
Description
The 42nd attribute contains data about the various 5 classes No. Name Data
of network connection vectors and they are categorized as Number of
hot indicators
one normal class and four attack class. The 4 attack classes
in the content
are further grouped as DoS, Probe, R2L and U2R. The 10 Hot 0
such as:
description of the attack classes. entering a
system
Copyright to IJARCCE DOI 10.17148/IJARCCE.2015.4696 447
ISSN (Online) 2278-1021
ISSN (Print) 2319-5940
directory, seconds
creating Number of
programs and connections to
executing the same
programs service (port
Num_failed Count of failed 24 Srv_count number) as the 2
11 0
_logins login attempts current
Login Status : connection in
1 if the past two
12 Logged_in successfully 0 seconds
logged in; 0 The percentage
otherwise of connections
Number of that have
Num_comp
13 ``compromised' 0 activated the
romised
' conditions flag (4) s0, s1,
25 Serror_rate 0
1 if root shell is s2 or s3,
14 Root_shell obtained; 0 0 among the
otherwise connections
1 if ``su root'' aggregated in
command count (23)
Su_attempt The percentage
15 attempted or 0
ed of connections
used; 0
otherwise that have
Number of activated the
``root'' accesses flag (4) s0, s1,
26 Srv_serror_rate 0
or number of s2 or s3,
16 Num_root operations 0 among the
performed as a connections
root in the aggregated in
connection srv_count (24)
Number of file The percentage
Num_file_c creation of connections
17 0 that have
reations operations in
the connection activated the
Number of 27 Rerror_rate flag (4) REJ, 0
18 Num_shells 0 among the
shell prompts
Number of connections
Num_acces operations on aggregated in
19 0 count (23)
s_files access control
files The percentage
Number of of connections
Num_outbo outbound that have
20 0 activated the
und_cmds commands in
an ftp session 28 Srv_rerror_rate flag (4) REJ, 0
1 if the login among the
belongs to the connections
Is_hot_logi aggregated in
21 ``hot'' list i.e., 0
n srv_count (24)
root or admin;
else 0 The percentage
1 if the login is of connections
Is_guest_lo a ``guest'' that were to
22 0 the same
gin login; 0
otherwise 29 Same_srv_rate service, among 1
the
TABLE IV : TIME RELATED TRAFFIC FEATURES OF EACH connections
NETWORK CONNECTION VECTOR
aggregated in
count (23)
Attribute Attribute Sample
Description The percentage
No. Name Data
of connections
Number of
that were to
connections to
different
the same
30 Diff_srv_rate services, 0
destination
23 Count 2 among the
host as the
connections
current
aggregated in
connection in
count (23)
the past two
Copyright to IJARCCE DOI 10.17148/IJARCCE.2015.4696 448
ISSN (Online) 2278-1021
ISSN (Print) 2319-5940
4. R2L: unauthorized access from a remote machine, the TABLE VIII: DETAILS OF NORMAL AND ATTACK DATA IN
attacker intrudes into a remote machine and gains local DIFFERENT TYPES OF NSL-KDD DATA SET
access of the victim machine. E.g. password guessing
Relevant features: Network level features duration of Total No. of
connection and service requested and host level Data
Do
features - number of failed login attempts Set
Reco
Norm
S Probe U2R R2L
Type al
rds Cla Class Class Class
TABLE VI: ATTRIBUTE VALUE TYPE Class
ss
Type Features 13449 923 2289 11 209
KDD 4
Nominal Protocol_type(2), Service(3), Flag(4) 2519
Train+ 53.39 36. 9.09% 0.04 0.83
Land(7), logged_in(12), 2
20% % 65 % %
Binary root_shell(14), su_attempted(15), %
is_host_login(21),, is_guest_login(22) 67343 459 11656 52 995
Duration(1), src_bytes(5), KDD 1259 27
dst_bytes(6), wrong_fragment(8), Train+ 73 53.46 36. 9.25% 0.04 0.79
urgent(9), hot(10), % 46 % %
num_failed_logins(11), 9711 745 2421 200 2754
num_compromised(13), 8
num_root(16), KDD 2254
43.08 33. 10.74 0.89 12.22
num_file_creations(17), Test+ 4
% 08 % % %
num_shells(18), %
num_access_files(19),
num_outbound_cmds(20), count(23)
srv_count(24), serror_rate(25), Figure 1 clearly exhibits the count of normal and various
srv_serror_rate(26), rerror_rate(27), attack class records in the different train and test NSL-
umeric srv_rerror_rate(28), same_srv_rate(29) KDD data sets.
diff_srv_rate(30),
srv_diff_host_rate(31), 80000
dst_host_count(32),
dst_host_srv_count(33), 70000 67343
dst_host_same_srv_rate(34),
dst_host_diff_srv_rate(35),
60000
dst_host_same_src_port_rate(36),
dst_host_srv_diff_host_rate(37), KDDTr
dst_host_serror_rate(38), 50000 45927 ain+20
Count
dst_host_srv_serror_rate(39),
dst_host_rerror_rate(40), %
40000
dst_host_srv_rerror_rate(41)
KDDTr
30000
The specific types of attacks are classified into four major 11656 ain+
categories. The table VII shows this detail.
13449
20000
9711
9234
10000
TYPE st+
2754
2421
2289
200
Class of 52 995
209
Data
Attack
11
0
Attack Type Normal DoS Probe U2R R2L
Class
Back, Land, Neptune, Pod,
DoS Smurf,Teardrop,Apache2, Udpstorm,
Fig 1. Network vector distribution in various NSL-KDD
Processtable, Worm (10)
Satan, Ipsweep, Nmap, Portsweep, Mscan,
train and test data set
Probe
Saint (6)
Guess_Password, Ftp_write, Imap, Phf, Further analysis of the KDDTrain+ data set has exposed
R2L
Multihop, Warezmaster, Warezclient, Spy, one of the very important facts about the attack class
Xlock, Xsnoop, Snmpguess, Snmpgetattack, network vectors as shown in Table IX.
Httptunnel, Sendmail, Named (16)
Buffer_overflow, Loadmodule, Rootkit, Perl, From the Figure2, it is apparent that most of the attacks
U2R
Sqlattack, Xterm, Ps (7) launched by the attackers use the TCP protocol suite.
The transparency and ease of use of the TCP protocol is
The Table VIII shows the distribution of the normal and exploited by attackers to launch network based attacks on
attack records available in the various NSL-KDD datasets. the victim computers.
TABLE IX: PROTOCOLS USED BY VARIOUS ATTACKS Vector Machines (SVM), Apriori Algorithm, PageRank,
Attack AdaBoost, K-Nearest Neighbor and Nave Bayes in
Class existence[10].
DoS Probe R2L U2R
Protocol V . EXPERIMENTAL RESULT AND ANALYSIS
This section deals about the experiment setup and result
TCP 42188 5857 995 49
analysis
A. Experiment Setup
UDP 892 1664 0 3
Many standard data mining process such as data cleaning
and pre-processing, clustering, classification, regression,
ICMP 2847 4135 0 0 visualization and feature selection are already
implemented in WEKA. The automated data mining tool
WEKA is used to perform the classification experiments
50000
on the 20% NSL-KDD dataset. The data set consists of
40000 various classes of attacks namely DoS, R2L, U2R and
Probe.
30000 DoS
Probe B. Pre-processing, Feature Selection and
Count
REFERENCES
[1] Aleksandar Lazarevic, Levent Ertoz, Vipin Kumar, Aysel Ozgur,
Jaideep Srivastava, A Comparative Study of Anomaly Detection
Schemes in Network Intrusion Detection
[2] http://en.wikipedia.org/wiki/Data_mining
[3] http://nsl.cs.unb.ca/NSL-KDD/
[4] http://www.cs.waikato.ac.nz/ml/weka/
[5] Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani
A Detailed Analysis of the KDD CUP 99 Data Set, Proceedings
of the 2009 IEEE Symposium on Computational Intelligence in
Security and Defense Applications (CISDA 2009)
[6] S. Revathi, Dr. A. Malathi, A Detailed Analysis on NSL-KDD
Dataset Using Various Machine Learning Techniques for Intrusion
Detection, International Journal of Engineering Research &
Technology (IJERT), ISSN: 2278-0181, Vol. 2 Issue 12, December
- 2013
[7] Vipin Kumar, Himadri Chauhan, Dheeraj Panwar, K-Means
Clustering Approach to Analyze NSL-KDD Intrusion Detection
Dataset, International Journal of Soft Computing and Engineering
(IJSCE) ISSN: 2231-2307, Volume-3, Issue-4, September 2013
[8] Santosh Kumar Sahu Sauravranjan Sarangi Sanjaya Kumar Jena,
A Detail Analysis on Intrusion Detection Datasets, 2014 IEEE
International Advance Computing Conference (IACC)
[9] Sapna S. Kaushik, Dr. Prof.P.R.Deshmukh, Detection of Attacks
in an Intrusion Detection System, International Journal of
Computer Science and Information Technologies, Vol. 2 (3), 2011,
982-986
[10] XindongWu Vipin Kumar J. Ross Quinlan Joydeep Ghosh
Qiang Yang Hiroshi Motoda Geoffrey J. McLachlan Angus Ng
Bing Liu Philip S. Yu Zhi-Hua Zhou Michael Steinbach
David J. Hand Dan Steinberg, Top 10 algorithms in data
mining, Knowledge and Information Systems Journal,Springer-
Verlag London, vol. 14, Issue 1, pp. 1-37, 2007.