Sei sulla pagina 1di 200

Robust and Efcient Intrusion

Detection Systems
Kapil Kumar Gupta
Submitted in total fullment of the requirements of the degree of
Doctor of Philosophy
January 2009
Department of Computer Science and Software Engineering
THE UNIVERSITY OF MELBOURNE
Abstract
I
NTRUSION Detection systems are now an essential component in the overall network and
data security arsenal. With the rapid advancement in the network technologies including
higher bandwidths and ease of connectivity of wireless and mobile devices, the focus of intrusion
detection has shifted from simple signature matching approaches to detecting attacks based on an-
alyzing contextual information which may be specic to individual networks and applications. As
a result, anomaly and hybrid intrusion detection approaches have gained signicance. However,
present anomaly and hybrid detection approaches suffer from three major setbacks; limited attack
detection coverage, large number of false alarms and inefciency in operation.
In this thesis, we address these three issues by introducing efcient intrusion detection frame-
works and models which are effective in detecting a wide variety of attacks and which result in very
few false alarms. Additionally, using our approach, attacks can not only be accurately detected but
can also be identied which helps to initiate effective intrusion response mechanisms in real-time.
Experimental results performed on the benchmark KDD 1999 data set and two additional data
sets collected locally conrm that layered conditional random elds are particularly well suited to
detect attacks at the network level and user session modeling using conditional random elds can
effectively detect attacks at the application level.
We rst introduce the layered framework with conditional random elds as the core intrusion
detector. Layered conditional random eld can be used to build scalable and efcient network
intrusion detection systems which are highly accurate in attack detection. We show that our sys-
tems can operate either at the network level or at the application level and perform better than
other well known approaches for intrusion detection. Experimental results further demonstrate
that our system is robust to noise in training data and handles noise better than other systems such
as the decision trees and the naive Bayes. We then introduce our unied logging framework for
audit data collection and perform user session modeling using conditional random elds to build
iii
real-time application intrusion detection systems. We demonstrate that our system can effectively
detect attacks even when they are disguised within normal events in a single user session. Using
our user session modeling approach based on conditional random elds also results in early at-
tack detection. This is desirable since intrusion response mechanisms can be initiated in real-time
thereby minimizing the impact of an attack.
iv
Declaration
This is to certify that
1. the thesis comprises only my original work towards the PhD,
2. due acknowledgement has been made in the text to all other material used,
3. the thesis is less than 100,000 words in length, exclusive of tables, maps, bibliographies
and appendices.
Kapil Kumar Gupta,
January 2009
v
List of Publications
Part of the work which is described in this thesis has been published as journal articles, book
chapters and conference proceedings. Following is the list of the papers which have been published
during the course of the candidature.
1. Robust Application Intrusion Detection using User Session Modeling Kapil Kumar Gupta,
Baikunth Nath, Kotagiri Ramamohanarao. Submitted to the ACM Transactions on Infor-
mation and Systems Security (TISSEC). Under Review.
2. Layered Approach using Conditional Random Fields for Intrusion Detection Kapil Ku-
mar Gupta, Baikunth Nath, Kotagiri Ramamohanarao. IEEE Transactions on Depend-
able and Secure Computing (TDSC). In Press.
3. User Session Modeling for Effective Application Intrusion Detection Kapil Kumar Gupta,
Baikunth Nath, Kotagiri Ramamohanarao. In Proceedings of the 23
rd
International In-
formation Security Conference, Lecture Notes in Computer Science, Springer Verlag,
vol (278), pages 269 - 284, 2008.
4. Sequence Labeling for Effective Intrusion Detection Kotagiri Ramamohanarao, Kapil
Kumar Gupta, Baikunth Nath. In Proceedings of the 2
nd
Annual Computer Security
Conference. In Press.
5. Intrusion Detection in Networks and Applications Kapil Kumar Gupta, Baikunth Nath,
Kotagiri Ramamohanarao. In Handbook of Communication Networks and Distributed
Systems, World Scientic. In Press.
6. The Curse of Ease of Access to the Internet Kotagiri Ramamohanarao, Kapil Kumar
Gupta, Tao Peng, Christopher Leckie. In Proceedings of the 3
rd
International Conference
on Information Systems Security, Lecture Notes in Computer Science, Springer Verlag,
vol (4812), pages 234 - 249, 2007.
vii
7. Conditional Random Fields for Intrusion Detection Kapil Kumar Gupta, Baikunth Nath,
Kotagiri Ramamohanarao. In Proceedings of the IEEE 21
st
International Conference
on Advanced Information Networking and Applications Workshops, IEEE Computer
Society, vol (1), pages 203 - 208, 2007.
8. Network Security Framework Kapil Kumar Gupta, Baikunth Nath, Kotagiri Ramamoha-
narao. International Journal of Computer Science and Network Security (IJCSNS),
vol 6(7B), pages 151 - 157, 2006.
9. Attacking Condentiality: An Agent Based Approach Kapil Kumar Gupta, Baikunth
Nath, Kotagiri Ramamohanarao, Ashraf Kazi. In Proceedings of the IEEE International
Conference on Intelligence and Security Informatics, Lecture Notes in Computer Sci-
ence, Springer Verlag, vol (3975), pages 285 - 296, 2006.
viii
Acknowledgements
It gives me immense pleasure to thank and express my gratitude towards my supervisors Assoc.
Prof. Baikunth Nath and Prof. Ramamohanarao Kotagiri, for their support throughout the
course of my study. Their constant motivation, support and expert guidance has helped me to
overcome all odds making this journey a truly rewarding experience in my life. I thank them from
the bottom of my heart.
I would also like to thank my Ph.D. committee member Assoc. Prof. Chris Leckie for his
valuable feedback and critical reviews which have helped to improve the quality of the thesis.
I am grateful for the support received from the University of Melbourne via numerous chan-
nels including the Melbourne International Fee Remission Scholarship (MIFRS), tremendous sup-
port from the School of Graduate Research, supportive staff at the university libraries and various
other university resources. In particular, I thank the staff at the Department of Computer Science
and Software Engineering, Melbourne School of Engineering who has been extremely helpful at
numerous occasions.
I am extremely grateful to the National ICT Australia (NICTA) for the nancial support
in the form of the prestigious NICTA Studentship and regular support to present the research at
various international conferences and to visit international laboratories.
I do not have words to express my gratitude towards my parents and my elder brother whose
support and uncountable sacrices have paved the way for me to pursue this study. It would not
have been possible for me to undertake this challenging task without their constant support.
I would like to thank my friends in the research lab, room 3.08, and in the department for
making the place a fun place to work and who also helped me to collect the data sets used in this
research. Finally, Alauddin Bhuiyan deserves a special mention and I shall cherish the frequent
tea breaks that we had together.
ix
Contents
1 Introduction 1
1.1 Motivation and Problem Description . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Emerging Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions to Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Layered Framework for Intrusion Detection . . . . . . . . . . . . . . . . 5
1.3.2 Layered Conditional Random Fields for Network Intrusion Detection . . 5
1.3.3 Unied Logging Framework for Audit Data Collection . . . . . . . . . . 6
1.3.4 User Session Modeling for Application Intrusion Detection . . . . . . . . 7
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Intrusion Detection and Intrusion Detection System . . . . . . . . . . . . . . . . 11
2.2.1 Principles and Assumptions in Intrusion Detection . . . . . . . . . . . . 13
2.2.2 Components of Intrusion Detection Systems . . . . . . . . . . . . . . . . 13
2.2.3 Challenges and Requirements for Intrusion Detection Systems . . . . . . 14
2.3 Classication of Intrusion Detection Systems . . . . . . . . . . . . . . . . . . . 15
2.3.1 Classication based upon the Security Policy denition . . . . . . . . . . 17
2.3.2 Classication based upon the Audit Patterns . . . . . . . . . . . . . . . . 19
2.4 Audit Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Properties of Audit Patterns useful for Intrusion Detection . . . . . . . . 22
2.4.2 Univariate or Multivariate Audit Patterns . . . . . . . . . . . . . . . . . 23
2.4.3 Relational or Sequential Representation . . . . . . . . . . . . . . . . . . 24
xi
2.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.1 Frameworks for building Intrusion Detection Systems . . . . . . . . . . 26
2.6.2 Network Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.3 Monitoring Access Logs . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.4 Application Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Layered Framework for Building Intrusion Detection Systems 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Description of our Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Components of Individual Layers . . . . . . . . . . . . . . . . . . . . . 41
3.4 Advantages of Layered Framework . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Comparison with other Frameworks . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Layered Conditional Random Fields for Network Intrusion Detection 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.2 Integrating the Layered Framework . . . . . . . . . . . . . . . . . . . . 55
4.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Building Individual Layers of the System . . . . . . . . . . . . . . . . . 57
4.5.2 Implementing the Integrated System . . . . . . . . . . . . . . . . . . . . 64
4.6 Comparison and Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6.1 Signicance of Layered Framework . . . . . . . . . . . . . . . . . . . . 70
4.6.2 Signicance of Feature Selection . . . . . . . . . . . . . . . . . . . . . . 71
4.6.3 Signicance of Our Results . . . . . . . . . . . . . . . . . . . . . . . . 72
xii
4.7 Robustness of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7.1 Addition of Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 Unied Logging Framework and Audit Data Collection 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 Description of our Framework . . . . . . . . . . . . . . . . . . . . . . . 83
5.4 Audit Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.2 Normal Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.3 Attack Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6 User Session Modeling using Unied Log for Application Intrusion Detection 91
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4.1 Feature Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.2 Session Modeling using a Moving Window of Events . . . . . . . . . . . 96
6.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5.1 Experiments with Clean Data (p = 1) . . . . . . . . . . . . . . . . . . . 99
6.5.2 Experiments with Disguised Attack Data (p = 0.60) . . . . . . . . . . . . 102
6.6 Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.6.1 Effect of S on Attack Detection . . . . . . . . . . . . . . . . . . . . . 114
6.6.2 Effect of p on Attack Detection (0 < p 1) . . . . . . . . . . . . . . . 116
6.6.3 Signicance of Using Unied Log . . . . . . . . . . . . . . . . . . . . . 118
6.6.4 Test Time Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6.5 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.7 Issues in Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xiii
6.7.1 Availability of Training Data . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7.2 Suitability of Our Approach for a Variety of Applications . . . . . . . . . 123
6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7 Conclusions 125
7.1 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Bibliography 131
Appendices 147
A An Introduction to Conditional Random Fields 149
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
A.3 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A.3.1 Directed Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.3.2 Undirected Graphical Models . . . . . . . . . . . . . . . . . . . . . . . 167
A.4 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.4.1 Representation of Conditional Random elds . . . . . . . . . . . . . . . 169
A.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
A.4.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.4.4 Tools Available for Conditional Random Fields . . . . . . . . . . . . . . 175
A.5 Comparing the Directed and Undirected Graphical Models . . . . . . . . . . . . 175
A.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
B Feature Selection for Network Intrusion Detection 177
B.1 Feature Selection for Probe Layer . . . . . . . . . . . . . . . . . . . . . . . . . 177
B.2 Feature Selection for DoS Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.3 Feature Selection for R2L Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.4 Feature Selection for U2R Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.5 Template Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
C Feature Selection for Application Intrusion Detection 181
C.1 Template Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
xiv
List of Tables
2.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 KDD 1999 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Detecting Probe Attacks (with all 41 Features) . . . . . . . . . . . . . . . . . . . 58
4.3 Detecting Probe Attacks (with Feature Selection) . . . . . . . . . . . . . . . . . 59
4.4 Detecting DoS Attacks (with all 41 Features) . . . . . . . . . . . . . . . . . . . 60
4.5 Detecting DoS Attacks (with Feature Selection) . . . . . . . . . . . . . . . . . . 60
4.6 Detecting R2L Attacks (with all 41 Features) . . . . . . . . . . . . . . . . . . . 61
4.7 Detecting R2L Attacks (with Feature Selection) . . . . . . . . . . . . . . . . . . 62
4.8 Detecting U2R Attacks (with all 41 Features) . . . . . . . . . . . . . . . . . . . 63
4.9 Detecting U2R Attacks (with Feature Selection) . . . . . . . . . . . . . . . . . . 63
4.10 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.11 Attack Detection at Individual Layers (Case:1) . . . . . . . . . . . . . . . . . . 66
4.12 Attack Detection at Individual Layers (Case:2) . . . . . . . . . . . . . . . . . . 67
4.13 Comparison of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.14 Layered Vs. Non Layered Framework . . . . . . . . . . . . . . . . . . . . . . . 70
4.15 Signicance of Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.16 Ranking Various Methods for Intrusion Detection . . . . . . . . . . . . . . . . . 73
6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 Effect of S on Attack Detection for Data Set One, when p = 0.60 . . . . . . . . 114
6.3 Analysis of Performance of Different Methods . . . . . . . . . . . . . . . . . . . 115
6.4 Effect of S on Attack Detection for Data Set Two, when p = 0.60 . . . . . . . 116
6.5 Comparison of Test Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xv
B.1 Probe Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
B.2 DoS Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.3 R2L Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.4 U2R Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
xvi
List of Figures
1.1 Behaviour of an Intruding Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Classication of Intrusion Detection Systems . . . . . . . . . . . . . . . . . . . 16
2.2 Knowledge Representation for a Resource (R) . . . . . . . . . . . . . . . . . . . 17
2.3 Representation of a Signature Based System . . . . . . . . . . . . . . . . . . . . 18
2.4 Representation of a Behaviour Based System . . . . . . . . . . . . . . . . . . . 18
2.5 Representation of a Hybrid System . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Graphical Representation of a Conditional Random Field . . . . . . . . . . . . . 35
3.1 Layered Framework for Building Intrusion Detection Systems . . . . . . . . . . 40
3.2 Traditional Layered Defence Approach to Provide Enterprise Wide Security . . . 44
4.1 Conditional Random Fields for Network Intrusion Detection . . . . . . . . . . . 51
4.2 Representation of Probe Layer with Feature Selection . . . . . . . . . . . . . . . 53
4.3 Integrating Layered Framework with Conditional Random Fields . . . . . . . . . 55
4.4 Effect of Noise on Probe Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Effect of Noise on DoS Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Effect of Noise on R2L Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 Effect of Noise on U2R Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1 Framework for Building Application Intrusion Detection System . . . . . . . . . 83
5.2 Representation of a Single Event in the Unied log . . . . . . . . . . . . . . . . 85
5.3 Representation of a Normal Session . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 Representation of an Anomalous Session . . . . . . . . . . . . . . . . . . . . . . 89
6.1 User Session Modeling using Conditional Random Fields . . . . . . . . . . . . . 95
xvii
6.2 Comparison of F-Measure (p = 1) . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 Comparison of F-Measure (p = 0.60) . . . . . . . . . . . . . . . . . . . . . . . 103
6.4 Results using Conditional Random Fields at p = 0.60 . . . . . . . . . . . . . . . 105
6.5 Results using Support Vector Machines at p = 0.60 . . . . . . . . . . . . . . . . 107
6.6 Results using Decision Trees at p = 0.60 . . . . . . . . . . . . . . . . . . . . . 109
6.7 Results using Naive Bayes Classier at p = 0.60 . . . . . . . . . . . . . . . . . 111
6.8 Results using Hidden Markov Models at p = 0.60 . . . . . . . . . . . . . . . . . 113
6.9 Effect of p: Results using Conditional Random Fields when 0 < p 1 . . . . 117
6.10 Signicance of Using Unied Log . . . . . . . . . . . . . . . . . . . . . . . . . 119
A.1 Fully Connected Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.2 Fully Disconnected Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . 154
A.3 Naive Bayes Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.4 Maxent Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
A.5 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
A.6 Decoding in an Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . 162
A.7 Maximum Entropy Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . 165
A.8 Label Bias Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.9 Undirected Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.10 Linear Chain Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . 170
A.11 Factorization in Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . 176
xviii
Chapter 1
Introduction
I
N this thesis, we address three signicant issues which severely restrict the utility of anomaly
and hybrid intrusion detection systems in present networks and applications. The three issues
are; limited attack detection coverage, large number of false alarms and inefciency in operation.
Present anomaly and hybrid intrusion detection systems have limited attack detection capability,
suffer from a large number of false alarms and cannot be deployed in high speed networks and ap-
plications without dropping audit patterns. Hence, most existing intrusion detection systems such
as the USTAT, IDIOT, EMERALD, Snort and others are developed using knowledge engineering
approaches where domain experts can build focused and optimized pattern matching models [1].
Though such systems result in very few false alarms, they are specic in attack detection and of-
ten tend to be incomplete. As a result their effectiveness is limited. Further, due to their manual
development process, signature based systems are expensive and slow to build. We, thus, address
these shortcomings and develop better anomaly and hybrid intrusion detection systems which are
accurate in attack detection, efcient in operation and have wide attack detection coverage.
1.1 Motivation and Problem Description
I
NTRUSION detection as dened by the Sysadmin, Audit, Networking, and Security (SANS)
institute is the act of detecting actions that attempt to compromise the condentiality, integrity
or availability of a resource [2]. Today, intrusion detection is one of the high priority and chal-
lenging tasks for network administrators and security professionals.
The objective of an intrusion detection system is to provide data security and ensure continuity
of services provided by a network [3]. Present networks provide critical services which are neces-
sary for businesses to perform optimally and are, thus, a target of attacks which aim to bring down
1
2 Introduction
the services provided by the network. Additionally, with more and more data becoming available
in digital format and more applications being developed to access this data, the data and appli-
cations are also a victim of attackers who exploit these applications to gain access to data. With
the deployment of more sophisticated security tools, in order to protect the data and services, the
attackers often come up with newer and more advanced methods to defeat the installed security
systems [4], [5].
According to the Internet Systems Consortium (ISC) survey, the number of hosts on the In-
ternet exceeded 550,000,000 in July 2008 [6]. Earlier, a project in 2002, estimated the size of
the Internet to be 532,897 TB [7]. Increasing dependence of businesses on the services over the
Internet, though, has led to their rapid growth; it has also made the networks and applications a
prime target of attacks. Conguration errors and vulnerabilities in software are exploited by the
attackers who launch powerful attacks such as the Denial of Service (DoS) [8] and Information
attacks [9]. Rapid increase in the number of vulnerabilities has resulted in an exponential rise in
the number of attacks. According to the Computer Emergency Response Team (CERT), the num-
ber of vulnerabilities in software has been increasing and many of them exist in highly deployed
software [10], [11]. Considering that it is near to impossible to build perfect software, it be-
comes critical to build effective intrusion detection systems which can detect attacks reliably. The
prospect of obtaining valuable information, as a result of a successful attack, subside the threat of
legal convictions. The inability to prevent attacks furthers the need for intrusion detection. The
problem becomes more profound since authorized users can misuse their privileges and attackers
can masquerade as authentic users by exploiting vulnerable applications.
Given the diverse type of attacks (Denial of Service, Probing, Remote to Local, User to Root
and others), it is a challenge for any intrusion detection system to detect a wide variety of attacks
with very few false alarms in real-time environment. Ideally, the system must detect all intrusions
with no false alarms. The challenge is, thus, to build a system which has broad attack detection
coverage and at the same time which results in very few false alarms. The system must also
be efcient enough to handle large amount of audit data without affecting performance at the
deployed environment. The simplest way to ensure a high level of security, provided we can
ensure hardware security, is to disable all resource sharing and communication between any two
computers. However, this is in no way a solution for securing todays highly networked computing
environment and, hence, the need to develop better intrusion detection systems.
1.2 Emerging Attacks 3
1.1.1 Research Objectives
In this thesis:
1. We aim to develop systems which have broad attack detection coverage and which are not
specic in detecting only the previously known attacks.
2. We aim to reduce the number of false alarms generated by anomaly and hybrid intrusion
detection systems, thereby improving their attack detection accuracy.
3. We aim to develop anomaly intrusion detection systems which can operate efciently in
high speed networks without dropping audit data.
Issues such as scalability, availability of training data, robustness to noise in the training data
and others are also implicitly addressed.
1.2 Emerging Attacks
For an intrusion detection system, it is important to detect previously known attacks with high
accuracy. However, detecting previously unseen attacks is equally important in order to minimize
the losses as a result of a successful intrusion.
In [5], we describe a scenario in which a software agent can be used to attack a specic
target without affecting any other network with a purpose to search and transmit condential and
sensitive information without authorized access. Such an attack can be carried out by experts
with the motive to hide the entire attack and protect their identity from being discovered. Further,
since the attack targets only a single network, it would not be detected by large scale cooperative
intrusion detection systems. The most signicant part of the entire attack is that none of the present
systems can detect such attacks and the agent can destroy itself when the attack is successful
without leaving traces of its activities. Unlike worms, the replication in case of an intruding agent
is limited and it does not degrade performance at the target making their detection very difcult.
We represent the behaviour of the intruding agent in Figure 1.1 by a ow diagram.
In addition to detecting the Denial of Service attacks, which target availability aspect, and
the Information attacks, which target condentiality and integrity aspects, the intrusion detection
systems must also be able to detect attacks which present a change in the motive of the attackers.
Such attacks are network specic and the attacker follows a criminal pursuit which is driven by
4 Introduction
Time
Out
Information
Correct
Transmit Information and
Await Confirmation
Success
Search Information
Start
Destroy Itself and Traces
Set Up a Knowledge Database
Search and Control a Zombie
Out
Time
Attempt to Enter the Target Network
Success
Update Knowledge Database
and Adjust Behaviour
Success
Replicate
Yes
No
No
Yes
End
Attempts
n>N
Time
Out
No
Yes
No
Yes
Yes
Yes
No
No
Yes
No
Yes
No
Figure 1.1: Behaviour of an Intruding Agent
the goal to make money [4]. This has not only resulted in increasing the severity of attacks, but
the attacks have become isolated; targeting only a few nodes in a single network. Such attacks are
very difcult to detect using generic systems and hence, better intrusion detection systems must
be developed which are capable of detecting such specic attacks.
1.3 Contributions to Thesis
In order to launch an attack, an attacker often follows a sequence of events. The events in such a
sequence are highly correlated and long range dependencies exist between them. Further, in order
to prevent detection, the attacker can also hide the individual events within a large number of
normal events. As a result, considering the events in isolation affects classication and results in a
1.3 Contributions to Thesis 5
large number of false alarms. Additionally, the individual events themselves are vector quantities
and consist of multiple features which are monitored continuously. These features are also highly
correlated and must not be analyzed in isolation.
In order to operate in high speed networks, present anomaly based systems consider the events
individually, thereby, discarding any correlation between the sequential events. In cases when the
present systems consider a sequence of events, they monitor only one feature, ignoring others,
which results in a poor model. Hence, we introduce efcient intrusion detection frameworks
and methods which consider a sequence of events and which analyze multiple features without
assuming any independence among the features.
1.3.1 Layered Framework for Intrusion Detection
In Chapter 3, we introduce our Layered Framework for building intrusion detection systems
which can be used, for example, as a network intrusion detection system and can detect a wide
variety of attacks reliably and efciently when compared to the traditional network intrusion de-
tection systems. In our layered framework, we use a number of separately trained and sequentially
arranged sub systems in order to decrease the number of false alarms and increase the attack de-
tection coverage. In particular, our layered framework has the following advantages:
The framework is customizable and domain specic knowledge can be easily incorporated
to build individual layers which help to improve accuracy.
Individual intrusion detection sub systems are light weight and can be trained separately.
Different anomaly and hybrid intrusion detectors can be incorporated in our framework.
Our framework not only helps to detect an attack but it also helps to identify the type of
attack. As a result, specic intrusion response mechanisms can be initiated automatically
thereby reducing the impact of an attack.
Our framework is scalable and the number of layers can be increased (or decreased) in the
overall framework.
1.3.2 Layered Conditional Random Fields for Network Intrusion Detection
Network monitoring is one of the common and widely applied methods for detecting malicious
activities in an entire network. However, real-time monitoring of every single event even in a
6 Introduction
moderate size network may not be feasible, simply due to the large amount of network trafc. As
a result, it is only possible to perform pattern matching using attack signatures which may at best
detect only previously known attacks. Anomaly based systems result in dropping audit data when
they are used to analyze every event. As a result, network monitoring often involves analyzing only
the summary statistics from the audit data. The summary statistics may include features of a single
TCP session between two IP addresses or may include network level features such as the load on
sever, number of incoming connections per unit time and others. Such statistics are represented
in the KDD 1999 data set [12]. In Chapter 4, we introduce the Layered Conditional Random
Fields which can be used to build accurate anomaly intrusion detection systems which can operate
efciently in high speed networks. In particular, our system has the following advantages:
The attack detection accuracy improves for individual sub systems when using conditional
random elds.
The overall system has wide attack detection coverage, where every sub system is trained
to detect attacks belonging to a single attack class.
Attacks can be detected efciently in high speed networks.
Our system is robust to noise and performs better than any other compared system.
1.3.3 Unied Logging Framework for Audit Data Collection
In order to access application data, a user has no option but to access the application which interacts
with the application data. Hence, application access and the corresponding data accesses are
highly correlated. In order to detect attacks effectively, we aim to capture this correlation between
the application access and the corresponding data accesses. Hence, in Chapter 5, we present our
Unied Logging Framework which efciently integrates the application and the data access logs.
We have collected two such data sets which can be downloaded and used freely [13]. In particular,
our unied logging framework has the following advantages:
By using the unied log, the objective is to capture the user-application and the application-
data interaction in order to improve attack detection. Further, this interaction is xed and
does not vary overtime as opposed to modeling user proles which changes frequently.
Our framework is application independent and can be deployed for a variety of applications.
1.4 Thesis Organization 7
1.3.4 User Session Modeling for Application Intrusion Detection
Network monitoring is often restricted to monitoring summary statistics due to excessive amount
of network trafc and is further affected due to network address translation and encryption, mak-
ing it difcult to provide high level of security. Thus, it becomes necessary to extend network
monitoring and focus on data and applications which are often target of attacks. Further, as we
have already mentioned, many attacks require a number of sequential operations to be performed.
In Chapter 6, we introduce User Session Modeling using Conditional Random Fields which an-
alyzes the unied log to detect application level attacks. In particular, our system has the following
advantages:
Conditional random elds perform best outperforming other well known anomaly detection
approaches including decision trees, naive Bayes classiers, support vector machines and
hidden Markov models. Our system based on conditional random elds is particularly
effective when attacks span over a sequence of events (such as password guessing followed
by launching of the exploit to gain administrative privileges on the target and nally leading
to unauthorized access of data.)
Our approach is robust in detecting disguised attacks.
Using our system, attacks can be blocked in real-time.
Performing session modeling using conditional random elds in our unied logging frame-
work, attacks can be detected at smaller window widths thereby resulting in an efcient
system which does not require a large amount of history to be maintained.
1.4 Thesis Organization
This thesis is organized as follows; we rst present the taxonomy of intrusion detection and give
the related literature review in Chapter 2. We then describe our layered framework which can be
used to build effective and efcient intrusion detection systems in Chapter 3. In Chapter 4, we
describe how conditional random elds can be integrated in our layered framework. We present
our experimental results and demonstrate that layered conditional random elds outperform well
known methods for intrusion detection and are a strong candidate to build robust and efcient
network intrusion detection systems. We then describe our unied logging framework in Chapter
8 Introduction
5, which integrates the application access logs and the corresponding data access logs to provide
a unied audit log. The unied log captures the necessary user-application and application-data
interaction which is useful to detect application level attacks effectively. In Chapter 6, we then
use the conditional random elds and perform user session modeling using a moving window of
events in our unied logging framework to build real-time application intrusion detection systems.
Our experimental results suggest that performing user session modeling using conditional random
elds attacks can be detected by analyzing only a small number of events in a user session which
results in an efcient and an accurate system. Finally, in Chapter 7 we conclude and give possible
directions for future research.
Chapter 2
Background
D
ETECTING intrusions in networks and applications has become one of the most critical
tasks to prevent their misuse by attackers. The cost involved in protecting these valuable
resources is often negligible when compared with the actual cost of a successful intrusion, which
strengthens the need to develop more powerful intrusion detection systems. Intrusion detection
started in 1980s and since then a number of approaches have been introduced to build intrusion
detection systems [1], [14], [15], [16], [17], [18], [19], [20]. However, intrusion detection is still
at its infancy and naive attackers can launch powerful attacks which can bring down an entire net-
work [5]. To identify the shortcoming of different approaches for intrusion detection, we explore
the related research in intrusion detection. We describe the problem of intrusion detection in detail
and analyze various well known methods for intrusion detection with respect to two critical re-
quirements viz. accuracy of attack detection and efciency of system operation. We observe that
present methods for intrusion detection suffer from a number of drawbacks which signicantly af-
fect their attack detection capability. Hence, we introduce conditional random elds for effective
intrusion detection and motivate our approach for building intrusion detection systems which can
operate efciently and which can detect a wide variety of attacks with relatively higher accuracy,
both at the network and at the application level.
2.1 Introduction
P
RESENT networks are increasingly based on the concept of resource sharing as it is a neces-
sity for collaboration, and provides an easy means of communication and economic growth.
However, the need to communicate and share resources increases the complexity of the system.
The systems are getting bigger with more and more add on features making them complex. This
9
10 Background
results in vulnerabilities in software and conguration errors in networks and deployed applica-
tions. Ease of access of resources in addition to vulnerabilities and poor management of resources
can be exploited to launch attacks [3]. Further, features intended for some specic usage in many
applications may also be exploited for misuse of systems. A typical example of this is the response
generated by the SQL server which is often exploited in the SQL injection attacks. As a result, the
number of attacks has increased signicantly [10]. Additionally, the attacks have become more
complex and difcult to detect using traditional intrusion detection approaches, demanding more
effective solutions [5]. More stringent monitoring has further increased the resources required by
the intrusion detection systems. However, addition of more resources may not always provide a
desired level of security.
The notion of intrusion detection was born in the 1980s with a paper from Anderson [21],
which described that audit trails contain valuable information and could be utilized for the pur-
pose of misuse detection by identifying anomalous user behaviour. The lead was then taken by
Denning at the SRI International and the rst model of intrusion detection, Intrusion Detection
Expert System (IDES) [22], [23] was born in 1984. Another project at the Lawrence Livermore
Laboratories developed the Haystack intrusion detection system in 1988 [24]. This further led
to the concept of distributed intrusion detection system which augmented the existing solution by
tracking client machines as well as the servers. The last system to be released under the same
generation, called Stalker, was released in 1989 which was again a host based, pattern matching
system [25]. Until then, the majority of the systems were host based and analyzed the individual
host level audit records. Todd Heberlein, in 1990 introduced the concept of network intrusion
detection and came up with the system called the Network Security Monitor (NSM) [26], [27].
These developments gradually paved way for the intrusion detection systems to enter into the
commercial market with products such as Net Ranger, Real Secure and Snort acquiring big
market shares [25], [28].
Present intrusion detection systems are very often based on analyzing individual audit patterns
by extracting signatures or are based on analyzing summary statistic collected at the network or
at the application level [9], [29]. Such systems are unable to detect attacks reliably because they
neglect the sequence structure in the audit patterns and consider every pattern to be independent.
In most situations such independence assumptions do not hold which severely affect the attack
detection capability of an intrusion detection system.
2.2 Intrusion Detection and Intrusion Detection System 11
Another approach for intrusion detection is based on analyzing sequence structure in the audit
patterns. Methods based on analyzing sequence of system calls issued by privileged processes
are well known [30], [31]. However, to reduce system complexity, the system considers only one
feature which is the sequence of system calls. Other features, such as the arguments of the system
calls, are ignored. In cases, when multiple features are considered, the features are assumed inde-
pendent and separate models are built using individual features. Results from all the models are
then combined using a voting mechanism. This again may not detect attacks reliably. To improve
attack detection, all of the features must be considered collectively and not independently [32],
[33]. Assuming events to be independent makes the model simple and improves speed of opera-
tion; but at the cost of reduced attack detection and increased number of false alarms. Frequent
false alarms, in turn, make the system administrators to ignore the alarms altogether.
Present networks and applications are, thus, far away froma state where they can be considered
secure. Hence, in this chapter we explore the problem of intrusion detection to identify the root
causes of the inability of the present intrusion detection systems to detect attacks reliably. We then
motivate the use of conditional random elds [34] for building effective network and application
intrusion detection systems [32], [33], [35], [36], [37].
The rest of the chapter is organized as follows; in Section 2.2, we give the taxonomy of in-
trusion detection which is described in detail in [38]. We then give their classication in Section
2.3, followed by the properties of the audit patterns which can be used to detect attacks in Sec-
tion 2.4. We present the evaluation metrics for analyzing intrusion detection systems in Section
2.5 and give a detailed literature review for intrusion detection in Section 2.6. We then describe
conditional random elds in Section 2.7. Finally, we conclude this chapter in Section 2.8.
2.2 Intrusion Detection and Intrusion Detection System
The intrusion detection systems are a critical component in the network security arsenal. Security
is often implemented as a multi layer infrastructure and different approaches for providing security
can be categorized into the following six areas [39]:
1. Attack Deterrence Attack deterrence refers to persuading an attacker not to launch an
attack by increasing the perceived risk of negative consequences for the attacker. Having
a strong legal system may be helpful in attack deterrence. However, it requires strong
12 Background
evidence against the attacker in case an attack was launched. Research in this area focuses
on methods such as those discussed in [40] which can effectively trace the true source of
attack as very often the attacks are launched with spoofed source IP address. (Spoong
refers to sending IP packets with modied source IP address so that the true sender of the
packet cannot be traced.)
2. Attack Prevention Attack prevention aims to prevent an attack by blocking it before
an attack can reach the target. However, it is very difcult to prevent all attacks. This
is because, to prevent an attack, the system requires complete knowledge of all possible
attacks as well as the complete knowledge of all the allowed normal activities which is not
always available. An example of attack prevention system is a rewall [41].
3. Attack Deection Attack deection refers to tricking an attacker by making the attacker
believe that the attack was successful though, in reality, the attacker was trapped by the
system and deliberately made to reveal the attack. Research in this area focuses on attack
deection systems such as the honey pots [42].
4. Attack Avoidance Attack avoidance aims to make the resource unusable by an attacker
even though the attacker is able to illegitimately access that resource. An example of
security mechanism for attack avoidance is the use of cryptography [43]. Encrypting data
renders the data useless to the attacker, thus, avoiding possible threat.
5. Attack Detection Attack detection refers to detecting an attack while the attack is still in
progress or to detect an attack which has already occurred in the past. Detecting an attack
is signicant for two reasons; rst the system must recover from the damage caused by
the attack and second, it allows the system to take measures to prevent similar attacks in
future. Research in this area focuses on building intrusion detection systems.
6. Attack Reaction and Recovery Once an attack is detected, the system must react to an
attack and perform the recovery mechanisms as dened in the security policy.
Tools available to perform attack detection followed by reaction and recovery are known as the
intrusion detection systems. However, the difference between intrusion prevention and intrusion
detection is slowly diminishing as the present intrusion detection systems increasingly focus on
real-time attack detection and blocking an attack before it reaches the target. Such systems are
better known as the Intrusion Prevention Systems.
2.2 Intrusion Detection and Intrusion Detection System 13
2.2.1 Principles and Assumptions in Intrusion Detection
Denning [22] denes the principle for characterizing a system under attack. The principle states
that for a system which is not under attack, the following three conditions hold true:
1. Actions of users conform to statistically predictable patterns.
2. Actions of users do not include sequences which violate the security policy.
3. Actions of every process correspond to a set of specications which describe what the
process is allowed to do.
Systems under attack do not meet at least one of the three conditions. Further, intrusion de-
tection is based upon some assumptions which are true regardless of the approach adopted by the
intrusion detection system. These assumptions are:
1. There exists a security policy which denes the normal and (or) the abnormal usage of
every resource.
2. The patterns generated during the abnormal system usage are different from the patterns
generated during the normal usage of the system; i.e., the abnormal and normal usage of a
system results in different system behaviour. This difference in behaviour can be used to
detect intrusions.
As we shall discuss later, different methods can be used to detect intrusions which make a
number of assumptions that are specic only to the particular method. Hence, in addition to the
denition of the security policy and the access patterns which are used in the learning phase of
the detector, the attack detection capability of an intrusion detection system also depends upon the
assumptions made by individual methods for intrusion detection [44].
2.2.2 Components of Intrusion Detection Systems
An intrusion detection system typically consists of three sub systems or components:
1. Data Preprocessor Data preprocessor is responsible for collecting and providing the audit
data (in a specied form) that will be used by the next component (analyzer) to make a
decision. Data preprocessor is, thus, concerned with collecting the data from the desired
source and converting it into a format that is comprehensible by the analyzer.
14 Background
Data used for detecting intrusions range from user access patterns (for example, the se-
quence of commands issued at the terminal and the resources requested) to network packet
level features (such as the source and destination IP addresses, type of packets and rate of
occurrence of packets) to application and system level behaviour (such as the sequence of
system calls generated by a process.) We refer to this data as the audit patterns.
2. Analyzer (Intrusion Detector) The analyzer or the intrusion detector is the core compo-
nent which analyzes the audit patterns to detect attacks. This is a critical component and
one of the most researched. Various pattern matching, machine learning, data mining and
statistical techniques can be used as intrusion detectors. The capability of the analyzer to
detect an attack often determines the strength of the overall system.
3. Response Engine The response engine controls the reaction mechanism and determines
how to respond when the analyzer detects an attack. The system may decide either to raise
an alert without taking any action against the source or may decide to block the source for
a predened period of time. Such an action depends upon the predened security policy of
the network.
In [45], the authors dene the Common Intrusion Detection Framework (CIDF) which recog-
nizes a common architecture for intrusion detection systems. The CIDF denes four components
that are common to any intrusion detection system. The four components are; Event generators (E-
boxes), event Analyzers (A-boxes), event Databases (D-boxes) and the Response units (R-boxes).
The additional component, called the D-boxes, is optional and can be used for later analysis.
2.2.3 Challenges and Requirements for Intrusion Detection Systems
The purpose of an intrusion detection system is to detect attacks. However, it is equally important
to detect attacks at an early stage in order to minimize their impact. The major challenges and
requirements for building intrusion detection systems are:
1. The system must be able to detect attacks reliably without giving false alarms. It is very
important that the false alarm rate is low as in a live network with large amount of traf-
c, the number of false alarms may exceed the total number of attacks detected correctly
thereby decreasing the condence in the attack detection capability of the system. Ideally,
the system must detect all intrusions with no false alarms. The challenge is to build a sys-
2.3 Classication of Intrusion Detection Systems 15
tem which has broad attack detection coverage, i.e. it can detect a wide variety of attacks
and at the same time which results in very few false alarms.
2. The system must be able to handle large amount of data without affecting performance and
without dropping data, i.e. the rate at which the audit patterns are processed and decision
is made must be greater than or equal to the rate of arrival of new audit patterns. Hence the
speed of operation is critical for systems deployed in high speed networks. In addition, the
system must be capable of operating in real-time by initiating a response mechanism once
an attack is detected. The challenge is to prevent an attack rather than simply detecting it.
3. A system which can link an alert generated by the intrusion detector to the actual security
incident is desirable. Such a system would help in quick analysis of the attack and may
also provide effective response to intrusion as opposed to a system which offers no after
attack analysis. Hence, it is not only necessary to detect an attack, but it is also important
to identify the type of attack.
4. It is desirable to develop a system which is resistant to attacks since, a system that can be
exploited during an attack may not be able to detect attacks reliably.
5. Every network and application is different. The challenge is to build a system which is
scalable and which can be easily customized as per the specic requirements of the envi-
ronment where it is deployed.
2.3 Classication of Intrusion Detection Systems
Classifying intrusion detection systems helps to better understand their capabilities and limitations.
We therefore, present the classication of intrusion detection systems in Figure 2.1.
From Figure 2.1, we observe that for any intrusion detection system, security policy and audit
patterns are the two prime information sources. The audit patterns must be analyzed to detect an
attack and the security policy denes the acceptable and non acceptable usage of a resource and
helps to qualify whether an event is normal or attack. Hence, based on the given classication,
an example of an intrusion detection system can be a centralized system deployed on a network
with sliding window based data collection which operates in real-time and is based on signature
analysis with active response to intrusion.
16 Background
Audit
Patterns
Information Sources for an
Intrusion Detection System
Security
Policy
Signature Based
Behaviour Based
Hybrid
Knowledge of the Resources
Frequency of Analysis
Batch Mode
Near Real Time
Real Time
Response on Intrusion
Passive
Active
Number of Audit Sources
Centralized
Distributed
Alert Correlation
Audit Source Location
Network Based
Host Based
Application Based
Periodic Snapshot Based
Frequency of Audit Data Collection
Session Based
Sliding Window Based
Figure 2.1: Classication of Intrusion Detection Systems
2.3 Classication of Intrusion Detection Systems 17
2.3.1 Classication based upon the Security Policy denition
Intrusion detection systems are classied in two ways based upon the security policy denition.
1. Security policy denes the normal and abnormal usage of every resource. Consider a set U,
which represents the complete domain (universe) for a resource R. The set U consists of
both, normal and abnormal usage of R. Hence, U = U
Rnormal
, U
Rattack
. The problem is
to identify the set U such that it is complete and unambiguous. However, in most practical
situations it is very difcult to identify and dene the complete set U and only a small
portion of this set is available which is denoted as S. Hence, the security policy is dened
with only the knowledge contained in the subset S, where S = S
Rnormal
, S
Rattack
. This
is represented in Figure 2.2.
U
Rattack
U
Rnormal
(a) Total Knowledge
S
Rattack
S
Rnormal
(b) Available Knowledge
Figure 2.2: Knowledge Representation for a Resource (R)
where |U
Rnormal
| |S
Rnormal
| and |U
Rattack
| |S
Rattack
|
Based upon the elements of subset S, intrusion detection system can be classied as:
(a) Signature (Misuse) Based When the set S only contains the events which are
known to be attack, the system focuses on detecting known misuses and is known
as signature or misuse based system [42]. Signature based system are represented in
Figure 2.3.
Signature based systems employ pattern matching approaches to detect attacks.
They can detect attacks with very few false alarms but have limited attack detection
capability since they cannot detect unseen attacks. Their attack detection capabil-
ity is directly proportional to the available knowledge of attacks in the set S, i.e.
knowledge of S
Rattack
. To be effective, such systems require complete knowledge
of attacks, i.e. S
Rattack
should be equal to U
Rattack
, which is not always possible.
18 Background
Correctly
Correctly
Detected Attacks
Detected Normals
Missed Attacks
Figure 2.3: Representation of a Signature Based System
(b) Behaviour (Anomaly) Based When the set S only consists of events which are
known to be normal, the goal of the intrusion detection system is to identify signi-
cant deviations from the known normal behaviour [42] as shown in Figure 2.4.
Correctly
Detected Normals
Correctly
Detected Attacks
False Alarms
Missed Attacks
Figure 2.4: Representation of a Behaviour Based System
For behaviour based systems to be effective complete knowledge of normal be-
haviour of a resource is required, i.e. the set S
Rnormal
should be equal to the set
U
Rnormal
. Since the complete knowledge of a resource may not be available, a
threshold is used which gives some exibility to the system. Events which lie be-
yond the threshold are detected as attacks. Hence, behaviour based systems; in
general, suffer from a large false alarm rate. False alarms can be reduced by increas-
ing the threshold, however, this affects the attack detection and the system may not
be able to detect a wide variety of attacks. Hence, there is a tradeoff in limiting the
number of false alarms and the capability of the system to detect a variety of attacks.
(c) Hybrid In most environments, it may not be possible to completely dene either
the normal or the abnormal behaviour. As a result, an intrusion detection systemmay
generate a large number of false alarms or may be specic in detecting only a few
types of attacks. A hybrid system uses the partial knowledge of both, i.e., S
Rnormal
and S
Rattack
, to detect attacks; often resulting in fewer false alarms and detecting
2.3 Classication of Intrusion Detection Systems 19
more attacks. Such systems generally employ machine learning approaches. A
hybrid system is represented in Figure 2.5.
False Alarms
Missed Attacks
Correctly
Detected Attacks
Correctly
Detected Normals
Figure 2.5: Representation of a Hybrid System
2. The security policy also denes how the system must respond when an attack is detected;
based upon which the intrusion detection systems can be classied as:
(a) Passive Response Systems In a passive response system, the system does not take
any measure to respond to an attack once an attack is detected. It simply generates
an alert which can be analyzed by the administrator at some later stage [39], [42].
(b) Active Response Systems In active response systems, the intrusion detection sys-
tems respond to attacks by various possible approaches which may include blocking
the source of the attack for a predened time period [39], [42].
2.3.2 Classication based upon the Audit Patterns
1. The source from which the audit patterns are collected affects the attack detection capabil-
ity of a system. For example, when network statistics are used as the audit patterns, they
cannot provide any detail about the user and system interaction. Based on this, intrusion
detection systems are classied as:
(a) Network Based In a network based system, the audit patterns collected at the net-
work level are used by the intrusion detector [46], [47]. Though a single system
(or a few strategically placed systems) is (are) sufcient for the entire network, the
attack detection capability of a network based system is limited. This is because it
is hard to infer the contextual information directly from the network audit patterns.
Further, the audit patterns may be encrypted rendering them unusable by the intru-
sion detector at the network level. In addition, large amount of audit patterns at the
20 Background
network level may also affect the total attack detection accuracy. This is because of
two reasons; rst, a signicant portion of the total incoming patterns may be allowed
to pass into the network without any analysis and second, in high speed networks,
it may be practical to analyze only the summary statistics collected at regular time
intervals. These statistics may include features such as the total number of connec-
tions, amount of incoming and outgoing trafc. Such features only provide a high
level summary which may not be able to detect attacks reliably [42].
(b) Host Based The intrusion detector in a host based system analyzes the audit pat-
terns generated at the kernel level of the system which include system access logs
and the error logs [42]. The audit patterns collected at the individual host contains
more specic information than the network level audit patterns, which may be used
to detect attacks reliably. However, it becomes difcult to manage a large num-
ber of host based systems in a big network. Additionally, host based systems can
themselves be the victims of an attack.
(c) Application Based The application based systems are concerned only with a single
application and detect attacks directed at a particular application or a privileged
process [31]. They can analyze either the application access logs or the system
calls generated by the processes to detect anomalous activities. The application
based systems can be very effective as they can exploit the complete knowledge of
the application and can be used even when encryption is used in communication.
They can also analyze the user and application interactions which can signicantly
improve the attack detection accuracy.
2. In order to detect intrusions, the audit patterns can be collected froma single source or from
a number of sources. When the audit patterns are collected from more than one source, the
decision can be made by individual nodes or by aggregating the audit patterns at a single
point and then analyzing them together. Based upon this property, the intrusion detection
systems can be classied as:
(a) Centralized System In a centralized system, the audit patterns are collected either
from a single source or from multiple sources but are processed at a single point
where they are analyzed together to determine the global state of the network [42].
However, such systems may themselves become a target of attacks.
2.3 Classication of Intrusion Detection Systems 21
(b) Distributed System In contrast to the centralized systems, the distributed systems
can make local decisions close to the source of the audit patterns and may report
only a small summary of activities to a higher level in the system. The advantage of
a distributed system for intrusion detection is that immediate response mechanism
can be activated based upon local decisions. However, distributed systems can be
less accurate due to lack of global knowledge. Agent based systems are examples
of distributed intrusion detection systems [42].
(c) Alert Correlation Alert correlation based systems analyze the alerts generated by
a number of cooperating intrusion detection systems [39]. The individual systems
may themselves be centralized or decentralized. Alert correlation systems can only
be effective when multiple networks are attacked with similar attacks such as in case
of worm outbreak. Incase when the attacks are network specic, the alert correlation
systems will not be effective even though a few target networks may detect some
anomalous activities. In such cases, the local alerts will be discarded as false alarms
due to lack of global consensus.
3. Regardless of the source and the number of audit patterns, the intrusion detection systems
can be classied depending upon the frequency at which the audit patterns are collected.
Based on this, they are classied as:
(a) Session Based Audit patterns can be collected at the end of every session by sum-
marizing different features. Methods can be used which analyze the summary of
every session once the session is terminated.
(b) Sliding Window Based In case of sliding window based collection of audit pat-
terns, events are recorded using a moving window of xed or variable width. The
width of the window denes the number of events recorded together and the step
size for sliding the window determines how fast the window is advanced forward.
(c) Periodic Snapshot Based Instead of recording every event or summarizing a ses-
sion at its termination, snapshots of different states of the entire system can be taken
at regular intervals which can be analyzed to detect intrusions.
4. Depending upon the frequency of analysis of audit patterns, the intrusion detection systems
can be classied as:
22 Background
(a) Batch Mode In batch mode intrusion detection, the audit patterns are aggregated
in a central repository. The patterns are then analyzed for intrusions at predened
time intervals. Such systems cannot provide any immediate response to intrusion
and can only perform the recovery task once an attack is detected.
(b) Near Real-time An intrusion detection system is said to perform in near real-time
when the system cannot detect an intrusion when it commenced, but can detect it at
some later stage during the attack or immediately at the end of an attack. In such
systems, there is some delay before the patterns are made available to the intrusion
detector. Patterns collected by taking periodic snapshots or using moving window
with step size greater than one can be used for near real-time intrusion detection.
(c) Real-time A real-time intrusion detection system must detect an attack as soon
as it is commenced, i.e. the system is said to perform in real-time if and only if,
for an event x when the attack commenced, the attacker cannot succeed with the
event x+1. Hence, for real-time intrusion detection, the system must detect an
attack immediately. However, in practice it is very difcult to build such a system
given the constraint that it should have low false alarm rate and high attack detection
accuracy. Real-time intrusion detection systems can be implemented by using a
moving window with a step of size one. Network based signature detection systems,
which perform pattern matching can also perform in real-time by checking every
event for known attacks. However, they are limited in detecting only those attacks
whose signatures are known in prior. A typical example is the Snort [48].
2.4 Audit Patterns
The raw patterns must be preprocessed and presented in a format which can be interpreted by the
intrusion detector before they can be analyzed.
2.4.1 Properties of Audit Patterns useful for Intrusion Detection
Different properties in the audit patterns can be analyzed for detecting intrusions. The authors in
[49] describe three properties which can be used to detect intrusions.
2.4 Audit Patterns 23
1. Frequency of Event(s) Frequency determines how often an event occurs in a predened
time interval. A threshold can be used to dene the limit. When the frequency crosses this
limit, an alarm can be raised. Properties such as the number of invalid login attempts and
the number of rows accessed in a database can be used to measure frequency.
2. Duration of Event(s) Rather than counting the number of occurrences of an event, the
duration property determines the acceptable time duration for a particular event. It is based
upon selecting a threshold which denes an acceptable range for a particular event. For
example, large number of invalid login attempts for a single user id in a very short time
span can be considered as an attempt to guess a password and hence an attack.
Systems analyzing the frequency or (and) duration property for the events can perform
efciently but they suffer from large false alarm rate as it is often difcult to determine the
correct threshold for the events.
3. Ordering of Events Analyzing the order in which events occur can improve the attack
detection accuracy and reduce false alarms. This is because, very often, intrusion is a multi
step process in which a number of events must occur sequentially in order to launch a
successful attack. However, to avoid detection from systems which do analyze a sequence
of events, the attacks can be spread over a long period of time such that the events cannot
be correlated unless a long history is maintained by the intrusion detection system.
A system which can analyze all of the above mentioned properties can detect attacks with high
accuracy. However, such a system may be inefcient in operation.
2.4.2 Univariate or Multivariate Audit Patterns
The audit patterns used to detect attacks may either be univariate or multivariate. As, discussed
before, the audit patterns may be collected from the routers and switches for the network level
systems. When only one feature is analyzed, in case of univariate audit patterns, the analysis is
much simpler in comparison to when many features are analyzed together, as in case of multivari-
ate analysis. However, a single feature itself may not be the complete representation and, hence,
insufcient to detect attacks. For example, when the sequence of system calls generated by a priv-
ileged process is analyzed for detecting abnormal behaviour, discarding other features such as the
parameters of the system calls can affect the attack detection capability of the system [9].
24 Background
2.4.3 Relational or Sequential Representation
Very often, the audit patterns collected are sequential where one or more features are recorded
continuously. However, the raw audit patterns may be processed into a relational form and a
number of new features can be added. These features often give a high level representation of
the audit patterns in a summarized form. Examples of such features include; total amount of
data transferred in a session and duration of a session. Frequency and duration properties of
events can be easily represented in a relational form. Converting the audit patterns from sequential
to relational form has two advantages; rst, more features can be added and second, efcient
methods can be used for analysis of audit patterns in relational form. However, this may result in
affecting the attack detection capability as in relational form the ordering of events and, hence, the
relationship among sequential events is lost. When the audit patterns are represented sequentially,
event ordering can be exploited in favour of higher attack detection accuracy. However, in general,
sequence analysis is slower when compared to the relational analysis.
2.5 Evaluation Metrics
Evaluating different methods for detecting intrusions is important. Intrusion detection is an ex-
ample of a problem with imbalanced classes, i.e. the number of instances in the classes is not
equally distributed. The number of attacks is very small when compared with the total number of
normal events. Note that, in case of the Denial of Service attacks, the amount of attack trafc is
extremely large as compared to the normal trafc. Hence, evaluating intrusion detection systems
using simple accuracy metric may result in very high accuracy [50]. Other metrics such as Pre-
cision, Recall and F-Measure, which do not depend on the size of the test set, are, thus, used for
evaluating intrusion detectors. These are dened with the help of the confusion matrix as follows:
Table 2.1: Confusion Matrix
Predicted Normal Predicted Attack
True Normal True Negative False Positive
True Attack False Negative True Positive
2.6 Literature Review 25
Precision =
number o f True Positives
number o f True Positives + number o f False Positives
Recall =
number o f True Positives
number o f True Positives + number o f False Negatives
F Measure =
(1 +
2
) Recall Precision

2
(Recall + Precision)
where corresponds to the relative importance of Precision vs. Recall and is usually set to 1.
Hence, a system must have high Precision (i.e. it must detect only attacks), high Recall (i.e. it
must detect all attacks) and, thus, a high F-Measure.
In addition to evaluating the attack detection capability of the detector, time taken to detect
an attack is also signicant. The time performance is generally measured for the time taken by
the intrusion detector to detect an attack from the time the audit patterns are fed into the detector.
This is sufcient for comparison when different methods use exactly the same data for analysis,
however, it does not represent the efciency of the intrusion detection system, since the time taken
in collecting and preprocessing the audit patterns is not considered. Hence, in real environments,
total time must be measured which is the time from the point when intrusion actually started to the
point in time when the response mechanism is activated.
2.6 Literature Review
Two most signicant motives to launch attacks as described in [3] are, either to force a network
to stop some service(s) that it is providing or to steal some information stored in a network. An
intrusion detection system must be able to detect such anomalous activities. However, what is
normal and what is anomalous is not dened, i.e., an event may be considered normal with respect
to some criteria, but the same may be labeled anomalous when this criterion is changed. Hence,
the objective is to nd anomalous test patterns which are similar to the anomalous patterns which
occurred during training. The underlying assumption is that the evaluating criterion is unchanged
and the system is properly trained such that it can reliably separate normal and anomalous events.
26 Background
2.6.1 Frameworks for building Intrusion Detection Systems
A number of frameworks have been proposed for building intrusion detection systems. The com-
mon intrusion detection framework is described in [45]. The authors in [50] and [51] describe a
data mining framework for building intrusion detection systems. Using the approach described
in [51], the rules can be learned inductively instead of manually coding the intrusion patterns and
proles. However, their approach requires the use of a large amount of noise free audit data to train
the models. Agent based intrusion detection frameworks are discussed in [52] and [53]. Frame-
works which describe the collaborative use of intrusion detection systems have also been proposed
[54], [55]. The system described in [54] is based on the combination of network based and host
based systems while the system in [55] employs both, signature based and behaviour based tech-
niques for detecting intrusions. All of these frameworks suffer from one major drawback; a single
intrusion detector used within these frameworks is trained to detect a wide variety of attacks. This
results in a large number of false alarms. To ameliorate this, we introduce our Layered Framework
for building Intrusion Detection Systems in Chapter 3.
2.6.2 Network Intrusion Detection
The prospect of maintaining a single system which can be used to detect network wide attacks
make network monitoring a preferred option as opposed to monitoring individual hosts in a large
network. A number of techniques such as association rules, clustering, naive Bayes classier, sup-
port vector machines, genetic algorithms, articial neural networks and others have been applied
to detect intrusions at network level. It is important to note that different methods are based on
specic assumptions and analyze different properties in the audit patterns, resulting in different
attack detection capabilities. These methods can be broadly divided into three major categories:
Pattern Matching
Pattern matching techniques search for predened set of patterns (known as signatures) in the
audit patterns to detect intrusions. Pattern matching approaches are employed on the audit patterns
which do not have any state or sequence information. Hence, they assume independence among
events. However, this assumption may not always hold as a single intrusion may span over multiple
events which are correlated. The prime advantage of pattern matching approaches is that they
2.6 Literature Review 27
are very efcient and triggers an alert only when an exact match of an attack signature is found
resulting in very few false alarms. They can, however, detect attacks only if the corresponding
pattern (signature) exists in the signature database. Hence, they cannot detect unseen attacks for
which there are no signatures [9], [42]. Snort system [48] is based upon pattern matching.
Statistical Methods
Statistical methods based on modeling the monitored variables as independent Gaussian random
variables and methods such as those based on the Hotelling T
2
test statistic can be used to detect
attacks by calculating deviations in the present prole from the stored normal prole [9]. They
are based upon modeling the underlying process which generates the audit patterns and exploit
the frequency and duration property of events. They often analyze properties such as the overall
system load and statistical distribution of events, which represent a summary measure. When the
deviations exceed a predened threshold, the system triggers an alarm. To determine this threshold
accurately is a critical issue. When the threshold is low, the system raises a large number of
(false) alarms and when the threshold is high, the system may not detect attacks reliably. Though
these methods can handle multiple features in the audit patterns, very often, in order to reduce
complexity and improve system performance only a single feature is considered, as in the Intrusion
Detection Expert System (IDES) [23], or the features are assumed to be independent, as in the
Haystack system [24]. This, however, affects the attack detection accuracy. Statistical methods
can operate either in batch mode (Haystack system) or in real-time mode (IDES).
Data Mining and Machine Learning
Data mining and machine learning methods focus on analyzing the properties of the audit patterns
rather than identifying the process which generated them [9]. These methods include approaches
for mining association rules, classication and cluster analysis. Classication methods are one
of the most researched and include methods like the decision trees, Bayesian classiers, articial
neural networks, k-nearest neighbour classication, support vector machines and many others.
Clustering Clustering of data has been applied extensively for intrusion detection using a
number of methods such as k-means, fuzzy c-means and others [56], [57]. Clustering meth-
ods are based upon calculating the numeric distance of a test point from different cluster
28 Background
centres and then adding the point to the closest cluster. One of the main drawbacks of clus-
tering technique is that since a numeric distance measure is used, the observations must be
numeric. Observations with symbolic features cannot be readily used for clustering which
results in inaccuracy. In addition, clustering methods consider the features independently
and are unable to capture the relationship between different features of a single record which
results in lower accuracy. Another issue when applying any clustering method is to select
the distance measure as different distance measures result in clusters with different shapes
and sizes. Frequently used distance measures are the Euclidian distance and the Maha-
lanobis distance [9]. Clustering can, however, be performed in case only the normal audit
patterns are available. In such cases, density based clustering methods can be used which
are based on the assumption that intrusions are rare and dissimilar to the normal events.
This is similar to identifying the outlier points which can be considered as intrusions.
Data Mining Data mining approaches [50], [51] are based on mining association rules
[58] and using frequent episodes [59] to build classiers by discovering relevant patterns
of program and user behaviour. Association rules and frequent episodes are used to learn
the record patterns that describe user behaviour. These approaches can deal with symbolic
features and the features can be dened in the formof packet and connection details. Mining
association rules for intrusion detection has the advantage that they are easy to interpret.
However, they are based upon building a database of rules of normal and frequent items
during the training phase. During testing, patterns from the test data are extracted and
various classication methods can be used to classify the test data. The detection accuracy
suffers as the database of rules is not a complete representation of the normal audit patterns.
Bayesian Classiers Naive Bayes classiers are also proposed for intrusion detection [60].
However, they make strict independence assumption between the features in an observation
resulting in lower attack detection accuracy when the features are correlated, which is of-
ten the case. Bayesian network [61] can also be used for intrusion detection [62], [63].
However, they tend to be attack specic and build a decision network based on special
characteristics of individual attacks. As a result, the size of a Bayesian network increases
rapidly as the number of features and the type of attacks modeled by the network increases.
Decision Trees Decision trees have also been used for intrusion detection [60], [64]. De-
cision trees select the best features for each decision node during tree construction based on
2.6 Literature Review 29
some well dened criteria. One such criterion is the gain ratio which is used in C4.5. De-
cision trees generally have very high speed of operation and high attack detection accuracy
and have been successfully used to build effective intrusion detection systems.
Articial Neural Networks Neural networks have been used extensively to build net-
work intrusion detection systems as discussed in [65], [66], [67], [68], [69], [70] and [71].
Though, the neural networks can work effectively with noisy data, like other methods, they
require large amount of data for training and it is often hard to select the best possible
architecture for the neural network.
Support Vector Machines Support vector machines map real valued input feature vector
to higher dimensional feature space through nonlinear mapping and have been used for
detecting intrusions [70], [71], [72]. They can provide real-time attack detection capability,
deal with large dimensionality of data and perform multi class classication.
For data mining and machine learning based approaches, the accuracy of the trained system
also depends upon the amount of audit patterns available during training. Generally, training
with more audit patterns result in a better model. The above discussed methods often deal with
the summarized representation of the audit patterns and may analyze multiple features which are
considered independently. The prime reason for working with summary patterns is that the system
tends to be simple, efcient and give fairly good attack detection accuracy. Similar to the pattern
matching and statistical methods, these methods assume independence among consecutive events
and hence do not consider the order of occurrence of events for attack detection.
Markov Models Markov chains [73], [74] and hidden Markov models [75] can be used
when dealing with sequential representation of audit patterns. [31], [76], [77] and [78] de-
scribes the use of hidden Markov model for intrusion detection. Hidden Markov models
have been shown to be effective in modeling sequences of system calls of a privileged pro-
cess, which can be used to detect anomalous traces. However, modeling system calls alone
may not always provide accurate classication as various connection level features are ig-
nored. Further, hidden Markov models cannot model long range dependencies between
the observations [34]. Very often the sequence itself is a vector and has many correlated
features. However, in order to gain computational efciency the multivariate data analysis
problem is broken into multiple univariate data analysis problems and the individual results
30 Background
are combined using a voting mechanism [9]. This however, results in inaccuracy as the
correlation among the features is lost. The authors in [49] show that modeling the ordering
property of events, in addition to the duration and frequency, results in higher attack detec-
tion accuracy. The drawback with modeling the ordering of events is that the complexity of
the system increases which affects the performance of the system. Hence, there is a tradeoff
between detection accuracy and the time required for attack detection.
Others Other approaches for detecting intrusion include the use of genetic algorithm and
autonomous and probabilistic agents [79], [80]. These methods are generally aimed at
developing a distributed intrusion detection system.
A number of intrusion detection systems such as the IDES (Intrusion Detection Expert Sys-
tem), Haystack system, the MIDAS (Multics Intrusion Detection System), W&S (Wisdom and
Sense) system, TIM (Time based Inductive Machine), Snort and others have been developed which
operate at the network level [1]. However, network intrusion detection systems must perform very
efciently in order to handle large amount of network data and hence many of the network in-
trusion detection systems are primarily based on signature matching. When anomaly detection
systems are used at network level, they either consider only one feature [23] or assume the fea-
tures to be independent [24]. However, we propose to use a hybrid system based on conditional
random elds and integrate the layered framework to build a single system which can operate in
high speed networks and can detect a wide variety of attacks with very few false alarms. We, thus,
present the Layered Conditional Random Fields for Network Intrusion Detection in Chapter 4.
The most closely related work, to our work, is of Lee et al. [51], [81], [82], [83], [84].
They, however, consider a data mining approach for mining association rules and nding fre-
quent episodes in order to calculate the support and condence of the rules. Instead, in our work
we dene features from the observations as well as from the observations and the previous la-
bels and perform sequence labeling via the conditional random elds to label every feature in the
observation. This setting is sufcient to model the correlation between different features in an
observation. We also compare our work with [85] which describes the use of maximum entropy
principle for detecting anomalies in the network trafc. The key difference between [85] and our
work is that, the authors in [85] use only the normal audit patterns during training and build a be-
haviour based system while we train our system using both the normal and the anomalous patterns
i.e. we build a hybrid system. Secondly, the system in [85] fails to model long range dependencies
2.6 Literature Review 31
in the observations, which can be easily represented in our model. As we shall describe in Chapter
4, we also integrate the layered framework with the conditional random elds to gain the benets
of computational efciency, wide attack detection coverage and high accuracy of attack detection
in a single system.
2.6.3 Monitoring Access Logs
A number of approaches have been described to monitor the data access logs or (and) the appli-
cation access logs in order to detect attacks, particularly at the user access level. We now review
some of these well known approaches.
Monitoring Data Access Logs
In [86], the authors focus on detecting malicious database modications using database logs to
mine dependencies among data items by creating dependency rules. For example, any update
operation must satisfy certain rules which dene what data items must be read before an update and
what data items must be written after the update operation. In order to detect malicious queries, the
authors in [87] perform clustering of queries that might return one or more features, each of which
can further return multiple records. In [88], the authors discuss that time differences between
multiple transactions in database systems can be used to detect malicious transactions when an
intruder masquerades as a normal user. The authors describe the use of Petri-Nets for nding
anomalies at the user task level. In [89], the authors describe that the database logs can be used to
build role proles to model normal behaviour which can then be used to identify intruders. The
authors use naive Bayes classier to perform classication using features extracted from the SQL
commands, the set of relations accessed and the attributes referenced. In [90] and [91], the authors
describe that ngerprinting of SQL queries can be used to detect malicious requests. They also
present an algorithm which summarizes the raw transactional SQL queries into compact regular
expressions that can be used for matching against known attack signatures. Further, ordering
constraints are imposed on the SQL queries in [90] which improves attack detection. In [92],
the authors describe the use of database logs to build user proles based on user query frequent
item-sets. They also dene support and condence functions for ngerprints generated for the
queries depending upon the user prole. In [93], the authors describe that data objects can be
32 Background
tagged with time semantics that captures expectations about update rates which are unknown to
attackers. This is, however, applicable only to data which is refreshed regularly. In [94], the
authors describe the use of audit logs for building user proles. They consider both, the integrity
constraints encoded in the data dictionary and the user proles to dene a distance measure which
estimates the closeness of a set of attributes which are referenced together. The authors in [95]
describe a system which determines whether a query should be denied in order to protect the
privacy of users by constructing auditors for max, min and sum queries. In [96], [97] and
[98], the authors describe Hippocratic Databases and present an auditing framework to detect
suspicious queries which do not adhere to data disclosure policies.
Such approaches are; generally, rule based and expensive to build. Additionally, they have lim-
ited attack detection capability since they cannot detect attacks whose signatures are not available.
Further, such systems are application specic and their deployment in different environments re-
quire recreating the set of rules applicable in the new domain. Another drawback is that, a system
based on modeling user proles results in a large false alarm rate. This is because of two reasons;
rst, the user behaviour is not xed and changes overtime and, second, prole based systems em-
ploy threshold to determine the acceptable deviation in normal activities. The thresholds are often
determined empirically and, hence, may be unreliable. Additionally, these methods consider data
access patterns in isolation of the events which generates the data request.
Monitoring Web Access Logs
Contrary to the systems which monitor data queries alone, there exist systems which analyze the
web server access logs to detect malicious data and application accesses. Systems such as [99]
combine static and dynamic verication of web requests to ensure absence of particular kind of
erroneous behaviour in web applications. They, however, do not consider underlying data access
and hence cannot detect a wide variety of attacks. The system described in [100] performs ap-
plication layer protocol analysis to detect intrusions. The authors in [101] describe an anomaly
based approach for detecting attacks against web applications by analyzing its internal state and
learning the relationships between critical execution points and the internal states. In [102], the
authors describe a technique called protomatching which combines protocol analysis, normaliza-
tion and pattern matching into a single phase and hence can be used to perform signature analysis
efciently. The authors claim that their protomatching approach improves the efciency of the
2.6 Literature Review 33
Snort [48] intrusion detection system by up to 49%. In [103], the authors model network trafc
into network sessions and packets to identify instances with high attack probability. The authors
in [104] describe a tool for performing intrusion detection at application level. Their system uses
Apache web server to implement an audit data source which monitors the web server.
In order to improve attack detection at the application level we are, however, interested in
analyzing the behaviour of a web application in conjunction with the underlying data accesses
rather than analyzing them separately. Hence, we present our Unied Logging Framework in
Chapter 5. The advantage of our framework is that it is application independent, since we do
not extract application specic signatures, and therefore our framework can be used in a variety
of applications. Further, instead of modeling user proles, our system models application-data
interaction which does not depend upon a particular user and therefore does not change overtime.
2.6.4 Application Intrusion Detection
Network monitoring, though signicant, is not sufcient to detect attacks which are directed to-
wards individual applications. In order to detect such malicious application and data accesses,
intrusion detection must also be performed at the application level. Further, for an attack to be suc-
cessful, very often, a sequence of events must be followed. Present application intrusion detection
systems consider every event individually rather than considering a sequence of events resulting in
a large number of false alarms and, hence, poor attack detection accuracy. To ameliorate this, we
introduce User Session Modeling using Unied Log for Application Intrusion Detection in Chap-
ter 6. We perform session modeling at the user application level, as opposed to the network packet
level, and integrate the unied logging framework to build an application intrusion detection sys-
tem. We show that using conditional random elds, session modeling can be performed with our
unied logging framework and attacks can be detected by monitoring only a small number of
events in a sequence. This results in an efcient and an accurate system.
The most closely related works to ours are [105], [106] and [107]. In [105], the authors
describe an anomaly based learning approach for detecting SQL attacks by learning proles of
the normal database accesses for web applications. Our work is different from this because we
consider both the normal and the anomalous data patterns during training and build a classication
system based on user session modeling, while in [105] the authors use only the normal patterns
during training to build an anomaly based system and analyze the events independently. In [106],
34 Background
the authors describe anomaly detection techniques to detect attacks against web servers and web
based applications by correlating the server side programs referenced by client queries with the
parameters contained in the queries. Their system primarily focuses on the web server logs to
produce an anomaly score by creating proles for every server side program and its features and
then establishing their threshold, while in our system we combine the web server logs with the data
access logs to detect malicious data accesses and use a moving window (of size more than one)
to analyze a sequence of events. The authors in [107] describe a two layer system in which the
rst layer generates pre alarms and the second layer makes the nal decision to activate an alarm.
Even though the authors use both the web access logs and the data access logs, they build separate
proles using the two logs. We also compare our work with [103]. Their system analyzes network
sessions and network packets while, we model the user application sessions to detect malicious
data accesses.
2.7 Conditional Random Fields
Conditional models are probabilistic systems which are used to model the conditional distribu-
tion over a set of random variables. Such models have been extensively used in natural language
processing tasks and computational biology. Conditional models offer a better framework as they
do not make any unwarranted assumptions on the observations and can be used to model rich
overlapping features among the visible observations. Maxent classiers [108], [109], [110] maxi-
mum entropy Markov models [85], [111] and conditional random elds [34] are such conditional
models. The simplest conditional classier is the Maxent classier based upon maximum entropy
classication which estimates the conditional distribution of every class given the observations.
The training data is used to constrain this conditional distribution while ensuring maximum en-
tropy and hence maximum uniformity. We now give a brief description of the conditional random
elds which is motivated from the work in [34]. A comprehensive introduction to the conditional
random elds is provided in Appendix A.
Let X be the random variable over a data sequence to be labeled and Y be the corresponding
label sequence. Also, let G = (V, E) be a graph such that Y = (Y
v
)
v(V)
, so that Y is indexed
by the vertices of G. Then (X, Y) is a conditional random eld, when conditioned on X, the
random variables Y
v
obey the Markov property with respect to the graph: p(Y
v
|X, Y
w
, w = v) =
2.7 Conditional Random Fields 35
p(Y
v
|X, Y
w
, w v), where w v means that w and v are neighbors in G, i.e. a conditional
random eld is a random eld globally conditioned on X. For a simple sequence (or chain)
modeling, as in our case, the joint distribution over the label sequence Y given X has the form:
p

(y|x) exp(

eE,k

k
f
k
(e, y|
e
, x) +

vV,k

k
g
k
(v, y|
v
, x)) (2.1)
where x is the data sequence, y is a label sequence, and y|
s
is the set of components of y associated
with the vertices or edges in sub graph S. The features f
k
and g
k
are assumed to be given and xed.
Further, the parameter estimation problem is to nd the parameters = (
1
,
2
, ...;
1
,
2
, ...)
from the training data D = (x
i
, y
i
)
N
i=1
with the empirical distribution p(x, y).
The graphical structure of a conditional random eld is represented in Figure 2.6 where
x
1
, x
2
, x
3
, x
4
represents an observed sequence of length four and every event in the sequence is
correspondingly labeled as y
1
, y
2
, y
3
, y
4
.
y
1
y
2
x
1
x
2
x
4
x
3
y
3
y
4
Figure 2.6: Graphical Representation of a Conditional Random Field
The prime advantage of conditional random elds is that they are discriminative models which
directly model the conditional distribution p(y|x). Further, conditional random elds are undi-
rected models and free from label bias and observation bias which are present in other conditional
models [112]. Generative models such as the Markov chains, hidden Markov models and joint dis-
tribution have two disadvantages. First, the joint distribution is not required since the observations
are completely visible and the interest is in nding the correct class which is the conditional distri-
bution p(y|x). Second, inferring conditional probability p(y|x) from the joint distribution, using
the Bayes rule, requires marginal distribution p(x) which is difcult to estimate as the amount of
training data is limited and the observation x contains highly dependent features. As a result strong
independence assumptions are made to reduce complexity. This results in reduced accuracy [113].
36 Background
Instead, conditional random elds predict the label sequence y given the observation sequence
x, allowing them to model arbitrary relationships among different features in the observations
without making independence assumptions.
Conditional random elds, thus, offer us the required framework to build effective intrusion
detection systems. The task of intrusion detection can be compared to many problems in machine
learning, natural language processing and bio-informatics such as gene prediction, determining
secondary structures of protein sequences, part of speech tagging, text segmentation, shallow pars-
ing, named entity recognition, object recognition and many others. The conditional random elds
have proven to be very successful in such tasks. Hence, in this thesis, we explore the suitability of
conditional random elds for building robust intrusion detection systems.
2.8 Conclusions
In this chapter, we presented the taxonomy of intrusion detection and explored the problem in
detail. We rst discussed the principles and assumptions involved in building intrusion detection
systems and described the components of intrusion detection systems in detail. We then presented
various challenges and requirements for effective intrusion detection and presented a classication
of intrusion detection systems. We then discussed methods which have been used for detecting
intrusions, their underlying assumptions, and their strengths and limitations with regards to their
attack detection capability. We presented the literature review where we explored various frame-
works and methods which have been used to build network and application intrusion detection
systems. Finally, we drew similarities between intrusion detection and various tasks in computa-
tional linguistics, computational biology and motivated our approach to build intrusion detection
systems based on conditional random elds.
In the next chapter, we present our layered framework and describe how it can be used to build
accurate and efcient anomaly and hybrid network intrusion detection systems.
Chapter 3
Layered Framework for Building Intrusion
Detection Systems
Present networks and enterprises follow a layered defence approach to ensure security at different
access levels by using a variety of tools such as network surveillance, perimeter access control, re-
walls, network, host and application intrusion detection systems, data encryption and others. Given
this traditional layered defence approach, only a single system is employed at every layer which is
expected to detect attacks at that particular location. However, with the rapid increase in the number
and type of attacks, a single system is not effective enough given the constraints of achieving high
attack detection accuracy and high system throughput. Hence, we propose a layered framework for
building intrusion detection systems which can be used, for example, to build a network intrusion de-
tection system which can detect a wide variety of attacks reliably and efciently when compared to
the traditional network intrusion detection systems. Another advantage of our Layer based Intrusion
Detection System (LIDS) framework is that it is very general and easily customizable depending upon
the specic requirements of individual networks.
3.1 Introduction
T
WO signicant requirements for building intrusion detection systems are broad attack de-
tection coverage and efciency in operation, i.e., an intrusion detection system must detect
different type of attacks effectively and must operate efciently in high trafc networks. Present
networks are prone to a number of attacks, a large number of which are previously known. How-
ever, the number of previously unseen attacks is on a rise [10].
Signature based systems using pattern matching approaches can be used effectively and ef-
ciently to detect previously known attacks in high speed networks. However, even a slight variation
in attacks may not be detected by a signature based system. As a result, anomaly and hybrid sys-
tems are used to detect previously unseen attacks and have been proven to be more reliable in
37
38 Layered Framework for Building Intrusion Detection Systems
detecting novel attacks when compared with the signature based systems. A common practice to
build anomaly and hybrid intrusion detection systems is to train a single system with labeled data
to build a classier which can then be used to detect attacks from a previously unseen test set.
At times, when labeled data is not available, clustering based systems can be used to distinguish
between legitimate and malicious packets. However, a signicant disadvantage of such systems is
that they result in a large number of false alarms. The attack detection coverage of the system is
further affected when a single system is trained to detect different type of attacks. To maximize
attack detection, various systems such as [55] and [114] employ both the signature based and the
anomaly based systems together. However, the anomaly based systems still remain a bottleneck
in the joint system. This is because, a single anomaly detector is trained which is expected to
accurately detect a variety of attacks and perform efciently.
Thus, for a network intrusion detection system monitoring the incoming and outgoing network
trafc and ensuring condentiality, integrity and availability via a single system may not be possi-
ble due to several reasons including the complexity and the diverse type of attacks at the network
level. Ensuring high speed of operation further limits the deployment, particularly, of anomaly
and hybrid network intrusion detection systems. Network monitoring using a network intrusion
detection system is only a single line of defence in the traditional layered defence approach which
aims to provide complete organizational security. Hence, network intrusion detection systems are
complemented by a variety of other tools such as network surveillance, perimeter access control,
rewalls, host and application intrusion detection systems, le integrity checkers, data encryption
and others and are deployed at different access points in a layered organizational security frame-
work [115]. In this chapter we propose a layered framework for building anomaly and hybrid
network intrusion detection systems which can operate efciently in high speed networks and can
accurately detect a variety of attacks. Our proposed framework is very general and can be easily
customized by adding domain specic knowledge as per the specic requirements of the network
in concern, thereby, giving exibility in implementation.
The rest of the chapter is organized as follows; we give motivating examples to highlight the
signicance of the layered framework for intrusion detection in Section 3.2. We then describe our
layered framework in Section 3.3. We highlight the advantages of our framework in Section 3.4
and compare the layered framework with others in Section 3.5. Finally, we conclude this chapter
in Section 3.6.
3.2 Motivating Examples 39
3.2 Motivating Examples
Anomaly and hybrid intrusion detection systems typically employ various data mining and ma-
chine learning based approaches which are inefcient when compared to the signature based sys-
tems which employ pattern matching. Hence, it becomes critical to search for methods which can
be used to build efcient anomaly and hybrid intrusion detection systems. However, given that
the present networks are prone to a wide variety of attacks, using a single system would not only
degrade performance but will also be less effective in attack detection.
Consider for example, a single network intrusion detection system which is deployed to detect
every network attack in a high speed network. A network is prone to different types of attacks
such as the Denial of Service (DoS), Probe and others. We note that the DoS and Probe attacks
are different and require different features for their effective detection. When same features are
used to detect the two attacks the accuracy decreases. It also makes the system bulky which
affects its speed of operation. Hence, for effective attack detection, a network intrusion detection
system must differentiate between different types of attacks. Thus, using a single system is not a
viable option. One possible solution is having a number of sub systems each of which is specic
in detecting a single category of attack (such as DoS, Probe and others). This is not only more
effective in detecting individual classes of attacks, but it also results in an efcient system. The
number of sub systems to be used can be determined by analyzing the potential risks and the
availability of resources at individual installations.
Hence, we propose a layered framework for building efcient anomaly and hybrid intrusion
detection systems where different layers in the system are trained independently to detect different
type of attacks with high accuracy. For example, based on our proposed framework a network
intrusion detection system may consist of four layers, where the layers correspond to four different
attack classes; Denial of Service, Probe, Remote to Local and User to Root.
3.3 Description of our Framework
Figure 3.1 represents our framework for building Layer based Intrusion Detection Systems (LIDS).
The gure represents an n layer system where every layer in itself is a small intrusion de-
tection system which is specically trained to detect only a single type of attack, for example the
40 Layered Framework for Building Intrusion Detection Systems
Layer Two
Feature Selection
Intrusion Detection
Sub System
Intrusion Detection
Sub System
Intrusion Detection
Sub System Feature Selection
All Features
Block
No
Yes
Allow
Block
Normal
No
Normal
Block
Yes
No
Normal
Layer One
Feature Selection
Yes
Layer n
Figure 3.1: Layered Framework for Building Intrusion Detection Systems
DoS attack. A number of such sub systems are then deployed sequentially, one after the other.
This serves dual purpose; rst, every layer can be trained with only a small number of features
which are signicant in detecting a particular class of attack. Second, the size of the sub system
remains small and hence, it performs efciently. A common disadvantage of using a modular ap-
proach, similar to our layered framework, is that it increases the communication overhead among
the modules (sub systems). However, this can be easily eliminated in our framework by making
every layer completely independent of every other layer. As a result, some features may be present
in more than one layer. Depending upon the security policy of the network, every layer can simply
block an attack once it is detected without the need of a central decision maker.
A number of such layers essentially act as lters, which blocks anomalous connection as soon
as they are detected in a particular layer, thereby providing a quick response to intrusion and
simultaneously reducing the analysis at subsequent layers. It is important to note that a different
response may be initiated at different layers depending upon the class of attack the layer is trained
to detect. The amount of audit data analyzed by the system is more at the rst layer and decreases
at subsequent layers as more and more attacks are detected and blocked. In the worst case, when
no attacks are detected until at the last layer, all the layers have the same load. However, the
3.3 Description of our Framework 41
overall load for the average case is expected to be much less since attacks are detected and blocked
at every subsequent layer. On the contrary, if the layers are arranged in parallel rather than in
a sequence, the load at every sub system is same and is equal to that of the worst case in the
sequential conguration. Additionally, the initial layers in the sequential conguration can be
replicated to perform load balancing in order to improve performance.
3.3.1 Components of Individual Layers
Given that a network is prone to a wide variety of attacks, it is often not feasible to add a separate
layer to detect every single attack. However, a number of similar attacks can be grouped together
and represented as a single attack class. Every layer in our framework corresponds to a sub system
which is trained independently to detect attacks belonging to a single attack class. As a result, the
total number of layers in our framework remains small. For example, both, Smurf and Neptune
result in Denial of Service and, hence, can be detected at a single layer rather than at two different
layers.
Additionally, the layered framework is very general and the number of layers in the overall
system can be adjusted depending upon the individual requirements of the network in concern.
Consider for example, a data repository which is a replica of a real-time application data and
which does not provide any online services. To ensure security of this data, the priority is to simply
detect network scans as opposed to detecting malicious data accesses. For such an environment,
only a single layer which can reliably detect the Probe attacks is sufcient. Hence, the number of
layers in our framework can be easily customized depending upon the identied threats and the
availability of resources.
Even though the number of layers and the signicance of every layer in our framework depend
upon the target network, every layer has two signicant components:
1. Feature Selection Component In order to detect intrusions, a large number of features
can be monitored. These features include protocol, type of service, number of bytes
from source to destination, number of bytes from destination to source, whether or not a
user is logged in, number of root accesses, number of les accessed and many others.
However, to detect a single attack class, only a small set of these features is required at
every layer. Using more features than required makes the system inefcient. For example,
to detect Probe attacks, features such as the protocol and type of service are signicant
42 Layered Framework for Building Intrusion Detection Systems
while features such as number of root accesses and number of les accessed are not
signicant.
2. Intrusion Detection and Response Sub System The second component in every layer
is the intrusion detection and response unit. To detect intrusions, our framework is not
restrictive in using a particular anomaly or hybrid detector. A variety of previously well
known intrusion detection methods such as the naive Bayes classier, decision trees, sup-
port vector machines and others can be used. A prime advantage of our framework is
that newer methods, such as conditional random elds as we will discuss in the following
chapters, which are more effective in detecting attacks can be easily incorporated in our
framework. Finally, once an attack is detected, the response unit can provide adequate
intrusion response depending upon the security policy.
In order to take advantages of our proposed framework, every layer must contain both of the
above mentioned components.
3.4 Advantages of Layered Framework
We now summarize the advantages of using our layered framework.
Using our layered framework improves attack detection accuracy and the system can detect
a wide variety of attacks by making use of the domain specic knowledge.
The layered framework does not degrade system performance as individual layers are in-
dependent and are trained with only a small number of features, thereby, resulting in an
efcient system. Additionally, using our layered framework opens avenues to perform
pipelining resulting in very high speed of operation. Implementing pipelining, particu-
larly in multi core processors, can signicantly improve the performance by reducing the
multiple I/O operations to a single I/O operation since all the features can be read in a single
operation and analyzed by different layers in the layered framework.
Our framework is easily customizable and the number of layers can be adjusted depending
upon the requirements of the target network.
Our framework is not restrictive in using a single method to detect attacks. Different meth-
ods can be seamlessly integrated in our framework to build effective intrusion detectors.
3.5 Comparison with other Frameworks 43
Our proposed layered framework for building effective and efcient network intrusion de-
tection systems ts well in the traditional layered defence approach for providing network
and enterprise level security.
Our framework has the advantage that the type of attack can be inferred directly from the
layer at which it is detected. As a result, specic intrusion response mechanisms can be
activated for different attacks.
3.5 Comparison with other Frameworks
Ensuring continuity of services and security of data from unauthorized disclosure and malicious
modications are critical for any organization. However, providing a desired level of security at
the enterprise level can be challenging. No single tool can provide enterprise wide security and
hence, a number of different security tools are deployed. For this, a layered defence approach is
often employed to provide security at the organizational level. This traditional layered defense
approach incorporates a variety of security tools such as the network surveillance, perimeter ac-
cess control, rewalls, network, host and application intrusion detection systems, le integrity
checkers, data encryption and others which are deployed at different access points in a layered
security framework. The traditional layered architecture is perceived as a framework for ensuring
complete organizational security rather than as an approach for building effective and efcient
intrusion detection systems. Figure 3.2 represents the traditional layered defence approach.
However, as discussed earlier, we present a layered framework for building intrusion detection
systems. Our framework ts well in the traditional layered defence approach and can be used to
develop effective and efcient network intrusion detection systems. Further, the four components
viz., event generators, event Analyzers, event Databases and the response units, presented in the
Common Intrusion Detection Framework [45] can be dened for every intrusion detection sub
system in our layered framework.
In the data mining framework for intrusion detection, [84], the authors describe the use of
data mining algorithms to compute activity patterns from system audit data to extract features
which are then used to generate rules to detect intrusions. The same approach can be applied
for building an intrusion detection system based on our layered framework. Our framework can
not only seamlessly integrate the use of data mining technique for intrusion detection, but can
44 Layered Framework for Building Intrusion Detection Systems
Data
Security
Application Security
Business Continuity
Content Management
Network Security
Surveillance
Host Security (Infrastructure Protection)
Perimeter Security (Network Access Control)
Figure 3.2: Traditional Layered Defence Approach to Provide Enterprise Wide Security
also help to improve its performance by selecting only a small number of signicant features for
building separate intrusion detection sub systems which can be used to effectively detect different
classes of attacks at different layers.
A number of other frameworks have been proposed which describe the use of classier com-
bination [55], [114], [116], [117]. In [55] and [114], the authors apply a combination of anomaly
and misuse detectors for better qualication of analyzed events. The authors in [116] describe the
combination of strong classiers using stacking where decision tress, naive Bayes and a number
of other classication methods are used as base classiers. The authors show that the output from
these classiers can be combined to generate a better classier rather than selecting the individual
best classier. In [117], the authors use a combination of weak classiers where the individual
classication power of weak classiers is slightly better than that of random guessing. The authors
show that a number of such classiers when combined by using simple majority voting mechanism
provide good classication. Our framework is, however, not based upon classier combination.
Combination of classiers is expensive with regards to the processing time and decision making.
In addition, centralized decision making systems often tend to be complex and slow in operation.
3.6 Conclusions 45
The only purpose of classier combination is to improve accuracy. Rather, our system is based
upon serial layering of multiple hybrid detectors which are trained independently and which oper-
ate without the inuence of any central controller. In our framework, the results from individual
classiers at a layer are not combined at any later stage and, hence, an attack is blocked at the
layer where it is detected. There is no communication overhead among the layers and the central
decision maker which results in an efcient system. In addition, since the layers are independent
they can be trained separately and deployed independently. As already discussed, using a stacked
system is expensive when compared to the sequential model. From our experimental results in the
following chapters, we will show that an intrusion detection system based on our layered frame-
work performs better and is more efcient when compared with individual systems as well as with
systems based on classier combination.
3.6 Conclusions
In this chapter, we presented our layered framework for building effective and efcient intrusion
detection systems. We compared our framework with other well known frameworks and high-
lighted its specic advantages. In addition to improving the attack detection accuracy and detect-
ing a variety of attacks, our framework can be used to build efcient anomaly and hybrid network
intrusion detection systems. In particular our framework can identify the class of an attack once
detected, is scalable and can be easily customized depending upon the specic requirements of a
network.
Given the layered framework, in the next chapter, we rst demonstrate the effectiveness of
conditional random elds to build intrusion detection sub systems which are individually trained
to effectively detect a single attack class. We then integrate the trained (sub) systems into our
layered framework to build accurate and efcient network intrusion detection systems which are
not based on attack signatures. Experimental results demonstrate that our system outperforms
other well known approaches for intrusion detection.
Chapter 4
Layered Conditional Random Fields for
Network Intrusion Detection
Ever increasing network bandwidth poses a signicant challenge to build efcient network intrusion
detection systems which can detect a wide variety of attacks with acceptable reliability. In order to
operate in high trafc environment, present network intrusion detection systems are often signature
based. However, signature based systems have obvious disadvantages. As a result, anomaly and
hybrid intrusion detection systems must be used to detect novel attacks. However, such systems are
inefcient and suffer from a large false alarm rate. To ameliorate these drawbacks, we rst develop
better hybrid intrusion detection methods which are not based on attack signatures and which can
detect a wide variety of attacks with very few false alarms. We then integrate the layered framework,
discussed in previous chapter, to build a single system which is effective in attack detection and which
can also perform efciently in high trafc environment.
4.1 Introduction
I
NCREASING network bandwidth has enabled a large number of services to be provided over
a network. High speed of communication and increasing complexity in systems has, however,
made it difcult to detect intrusive activities in real-time. In order to operate in high speed net-
works, intrusion detection systems are either signature based which perform pattern matching or
operate on summarized audit patterns which are collected regularly at predened intervals. Pattern
matching systems operate on signatures extracted from previously known attacks and are limited
in detecting only the attacks with known signatures. Anomaly and hybrid intrusion detection sys-
tems, in addition to detecting previously known attacks, can also detect previously unseen attacks;
however, they are expensive in operation. As a result, such systems analyze summarized data
instead of monitoring a sequence of events.
Anomaly and hybrid intrusion detection systems suffer from two major disadvantages; rst,
47
48 Layered Conditional Random Fields for Network Intrusion Detection
they generate a large number of false alarms and second, they are expensive in operation. Further,
a single system has limited attack detection coverage and it cannot detect a wide variety of attacks
reliably. Hence, in this chapter, we focus on building accurate hybrid intrusion detection systems
which can perform efciently in high speed network environment.
We rst develop hybrid intrusion detection systems based on conditional random elds which
can detect a wide variety of attacks and which result in very few false alarms. To improve the
efciency of the system, we then integrate the layered framework, as discussed in the previous
chapter, and demonstrate that a single system based on our framework is more effective than pre-
viously well known methods for network intrusion detection. Experimental results on the bench-
mark KDD 1999 intrusion data set [12] and comparison with other well known methods such as
decision trees and naive Bayes show that our approach based on layered conditional random
elds outperform these methods; in terms of, both, accuracy of attack detection and efciency of
operation. Impressive part of our results is the percentage improvement in attack detection accu-
racy, particularly, for User to Root (U2R) attacks (34.8% improvement) and Remote to Local (R2L)
attacks (34.5% improvement). Statistical tests also demonstrate higher condence in detection ac-
curacy with layered conditional random elds. We also show that our system is robust and can
detect attacks with higher accuracy, when compared with other methods, even when trained with
noisy data.
Rest of the chapter is organized as follows; in Section 4.2 we motivate the use of conditional
random elds for intrusion detection which can model complex relationships between different
features in the data set. We then describe the data set used in our experiments in Section 4.3.
In Section 4.4, we describe how conditional random elds can be used for effective intrusion
detection followed by the algorithm to integrate the layered framework with conditional random
elds to build an effective and an efcient network intrusion detection system. In Section 4.5
we give details of the experiments performed and describe the implementation of our integrated
system. In Section 4.6, we compare our results with other methods such as decision trees, naive
Bayes classier, multi layer perceptron, support vector machines, K-means clustering, principle
component analysis and approaches based on classier combination, which are known to perform
well for intrusion detection. We analyze the robustness of our system in Section 4.7 by introducing
noise in the training data. Finally, we draw conclusions and highlight the advantages of layered
conditional random elds for network intrusion detection in Section 4.8.
4.2 Motivating Examples 49
4.2 Motivating Examples
Network intrusion detection systems operate at the periphery of the networks and are, thus, over-
loaded with large amount of network trafc, particularly in high speed networks. As a result, the
anomaly and hybrid intrusion detection systems generally operate on summarized audit patterns.
However, when audit patterns are summarized, they are represented with multiple features which
are correlated and complex relationships exist between them. To detect intrusions effectively,
these features must not be considered independently. Methods, such as conditional random elds,
which can capture relationships among multiple features, would perform better when compared
with methods which consider the features to be independent such as the naive Bayes classier.
Consider, for example, a network intrusion detection system which uses two features logged
in and number of le creations to classify network connections as either normal or attack.
When these features are analyzed in isolation they do not provide signicant information which
can help in detecting attacks. However, analyzing these features together can provide meaningful
information for classication. This is because, a particular user may or may not have privileges to
create les in the system or the system may detect anomalous activity by calculating deviation in
the current prole and then comparing it with the previously saved prole for that particular user.
Consider another network intrusion detection system which analyzes connection level feature
such as service invoked at the destination in order to detect attacks. When this feature is analyzed
in isolation, it is signicant only when an attacker requests for a service that is not available at
the destination and the system may then tag the connection as a Probe attack. However, if this
information is analyzed in combination with other features such as protocol type and amount
of data transferred between the source and the destination; the audit data provides signicant
details which help in improving classication. In this case, if the features are considered to be
independent, the system is limited in detecting only Probe attacks. However, as we will show
from our experiments, if these features are not considered to be independent, the system may not
only detect Probe attacks, but it can also correctly detect R2L and U2R attacks.
Such relationships between different features in the observed data, if considered by an intru-
sion detection system during classication can signicantly decrease classication error, thereby
improving the attack detection accuracy. We thus explore the effectiveness of conditional random
elds which can effectively model such relationships and compare their performance with other
well known approaches for intrusion detection.
50 Layered Conditional Random Fields for Network Intrusion Detection
4.3 Data Description
We perform our experiments with the benchmark KDD 1999 intrusion data set [12]. The data set
is a version of the 1998 DARPA intrusion detection evaluation program, prepared and managed by
the MIT Lincoln Labs. The data set contains about ve million connection records as the training
data and about two million connection records as the test data. In our experiments, we use the
ten percent of the total training data and ten percent of the test data (with corrected labels) which
are provided separately. This leads to 494,020 training and 311,029 test instances. Each record
in the data set represents a connection between two IP addresses, starting and ending at some
well dened times with a well dened protocol. Further, with 41 different features, every record
represents a separate connection and, hence in our experiments, we consider every record to be
independent of every other record.
Table 4.1 gives the number of instances for every class in the data set. The training data is
either labeled as normal or as one of the 24 different kinds of attack. All of the 24 attacks can be
grouped into one of the four classes; Probe, Denial of Service (DoS), unauthorized access from a
remote machine or Remote to Local (R2L) and unauthorized access to root or User to Root (U2R).
Similarly the test data is also labeled as either normal or as one of the attacks belonging to the
four attack classes. It is important to note that the test data includes specic attacks which are not
present in the training data. This makes the intrusion detection task more realistic [12].
Table 4.1: KDD 1999 Data Set
Training Set Test Set
Normal 97,277 60,593
Probe 4,107 4,166
DoS 391,458 229,853
R2L 1,126 16,349
U2R 52 68
Total 494,020 311,029
4.4 Methodology 51
4.4 Methodology
Given the network audit patterns where every connection between two hosts is presented in a
summarized form with 41 features, our objective is to detect most of the anomalous connections
while generating very few false alarms. In our experiments, we used the KDD 1999 data set
described in Section 4.3. Conventional methods, such as decision trees and naive Bayes, are
known to perform well in such an environment; however, they assume observation features to
be independent. We propose to use conditional random elds which can capture the correlations
among different features in the data and hence perform better when compared with other methods.
The KDD 1999 data set represents multiple features, a total of 41, for every session in rela-
tional form with only one label for the entire record. In this case, using a conditional model would
result in a maximum entropy classier [108], [110]. However, we represent the audit data in the
form of a sequence and assign label to every feature in the sequence using the rst order Markov
assumption instead of assigning a single label to the entire observation. Though, this increases
complexity, it also improves the attack detection accuracy. To manage complexity and improve
systems performance, we integrate the layered framework, described in the previous chapter, with
the conditional random elds to build a single system which is more efcient and more effective.
Figure 4.1 represents how conditional random elds can be used for detecting network intrusions.
= 0 = SF
flag protocol duration service
attack attack attack attack attack
= 8
src_byte
= icmp = eco_i
(a) Attack Event
= 0
flag src_byte
= tcp = smtp
= SF
normal normal normal normal normal
duration protocol service
= 4854
(b) Normal Event
Figure 4.1: Conditional Random Fields for Network Intrusion Detection
In the gure, observation features duration, protocol, service, ag and source bytes are
used to discriminate between attack and normal events. The features take some possible value for
every connection which are then used to determine the most likely sequence of labels < attack,
attack, attack, attack, attack > or < normal, normal,normal, normal, normal >. Custom
feature functions can be dened which describe the relationships among different features in the
observation. During training, feature weights are learnt and during testing, features are evaluated
52 Layered Conditional Random Fields for Network Intrusion Detection
for the given observation which is then labeled accordingly. It is evident from the gure that every
input feature is connected to every label which indicates that all the features in an observation
determine the nal labeling of the entire sequence. Thus, a conditional random eld can model
dependencies among different features in an observation. Present intrusion detection systems do
not consider such relationships. They either consider only one feature, as in case of system call
modeling, or assume independence among different features in an observation, as in case of a
naive Bayes classier. Our experimental results, described in Section 4.5, clearly suggest that
conditional random elds can effectively model such relationships among different features of an
observation resulting in higher attack detection accuracy.
We also note that in the KDD 1999 data set, attacks can be represented in four classes; Probe,
DoS, R2L and U2R. In order to consider this as a two class classication problem, the attacks
belonging to all the four attack classes can be re-labeled as attack and mixed with the audit patterns
belonging to the normal class to build a single model which can be trained to detect any kind of
attack. Another approach for considering the same problem, as a two class problem, is to use
only the attacks belonging to a single attack class mixed with audit patterns belonging to the
normal class to train a separate sub system for all the four attack classes. The problem can also be
considered as a ve class classication problem, where a single system is trained with ve classes
(normal, Probe, DoS, R2L and U2R) instead of two. Such a system can easily identify an attack
once it is detected but is very slow in operation, making their deployment impractical in high speed
networks.
As we will see from our experimental results, particularly from Table 4.14 in Section 4.6,
considering every attack class separately not only improves the attack detection accuracy but also
helps to improve the overall system performance when integrated with the layered framework.
Furthermore, it also helps to identify the class of an attack once it is detected at a particular layer
in the layered framework. However, a drawback of this implementation is that it requires domain
knowledge to perform feature selection for every layer. Nonetheless, this is one time process and
given the critical nature of the problem of intrusion detection, if domain knowledge can help to
improve the attack detection accuracy it is recommended to do so.
Using conditional random elds improve the attack detection accuracy particularly for the
U2R attacks. They are also effective in detecting the Probe, R2L and the DoS attacks. However,
when we consider all the 41 features in the data set for each of the four attack classes separately,
4.4 Methodology 53
conditional random elds can be expensive during training and testing. For a simple linear chain
structure, the time complexity for training a conditional random eld is O(TL
2
NI) where T is the
length of the sequence, L is the number of labels, N is the number of training instances and I is
the number of iterations. During inference, the Viterbi algorithm [118], [119] is employed which
has a complexity of O(TL
2
). The quadratic complexity is signicant when the number of labels is
large as in language tasks. However, for intrusion detection there are only two labels normal and
attack and, thus, our system is very efcient. We further improve the overall system performance
by implementing the layered framework and performing feature selection which decreases T, i.e.,
the length of the sequence. We now describe feature selection for all the four attack classes.
4.4.1 Feature Selection
Attacks belonging to different classes are different and, hence for better attack detection, it be-
comes necessary to consider them separately. As a result, in our layered system, we train every
layer separately to optimally detect a single class of attack. We therefore select different features
for different layers based upon the type of attack the layer is trained to detect. In Figure 4.2, we
represent a detailed view of a single layer (Probe layer) which can be used to detect Probe attacks
in our integrated system.
All Features
Probe Layer
Feature Selection
Audit Data
(Normal + Probe)
Normal
No
Yes
Allow
Block
Figure 4.2: Representation of Probe Layer with Feature Selection
The Probe layer is optimally trained to detect only the Probe attacks. Hence, we use only
the Probe attacks and the normal instances from the audit data to train this layer. Other layers
can be trained similarly. Note that, we select different features to train different layers in our
framework. Experimental results clearly suggest that feature selection signicantly improves the
54 Layered Conditional Random Fields for Network Intrusion Detection
attack detection capability of our system. Ideally, we would like to perform feature selection
automatically. However, experimental results in Section 4.6.2 suggest that present methods for
automatic feature selection are not effective. Hence, we use domain knowledge to select features
for all the four attack classes. We now describe our approach for selecting features for every layer
and why some features were chosen over others.
1. Probe Layer Probe attacks are aimed at acquiring information about the target network
from a source which is often external to the network. Hence, basic connection level fea-
tures such as the duration of connection and source bytes are signicant; while features
like number of le creations and number of les accessed are not expected to provide
information for detecting Probe attacks.
2. DoS Layer DoS attacks are meant to prevent the target from providing service(s) to its
users by ooding the network with illegitimate requests. Hence, to detect attacks at the
DoS layer; network trafc features such as the percentage of connections having same
destination host and same service and packet level features such as the source bytes and
percentage of packets with errors are signicant. To detect DoS attacks, it may not be
important to know whether a user is logged in or not and hence, such features are not
considered in the DoS layer.
3. R2L Layer R2L attacks are one of the most difcult attacks to detect as they involve both,
the network level and the host level features. Hence, to detect R2L attacks, we selected
both, the network level features such as the duration of connection, service requested
and the host level features such as the number of failed login attempts among others.
4. U2R Layer U2R attacks involve the semantic details which are very difcult to capture
at an early stage at the network level. Such attacks are often content based and target an
application. Hence for detecting U2R attacks, we selected features such as number of le
creations, number of shell prompts invoked, while we ignored features such as protocol
and source bytes.
From all the 41 features in the KDD 1999 data set, we select only ve features for Probe layer,
nine features for DoS layer, 14 features for R2L layer and eight features for U2R layer. Since every
layer in our framework is independent, feature sets for all the four layers are not disjoint. We list
the features used for all the four layers in Appendix B.
4.4 Methodology 55
4.4.2 Integrating the Layered Framework
The layered framework, introduced in Chapter 3, is general and can be tailored to build specic in-
trusion detection systems. In this section, we describe how we can integrate the layered framework
with the conditional random elds to build an effective and an efcient hybrid network intrusion
detection system.
Given the four different attack classes in the KDD1999 data, we implement a four layer system
where every layer corresponds to a single attack class. The four layers are arranged in a sequence
as represented in Figure 4.3.
Feature Selection
R2L Layer
Feature Selection
DoS Layer
Feature Selection
Normal
Normal
Normal
All Features
Normal
Yes
No
No
Yes
No
Yes Yes
Allow
Block Block
Block
No
Probe Layer
Feature Selection
U2R Layer
Block
Figure 4.3: Integrating Layered Framework with Conditional Random Fields
In the system, every layer is trained separately with the normal instances and with the attack
instances belonging to a single attack class. The layers are then arranged one after the other in a
sequence as shown in Figure 4.3. However, during testing, all the audit patterns (irrespective of
their attack class, which is unknown) are passed into the system starting from the rst layer. If
the layer detects the instance as an attack, the system labels the instance as a Probe attack and
initiates the response mechanism; otherwise it passes the instance to the next layer. Same process
is repeated at every layer until either an instance is detected as an attack or it reaches the last layer
where the instance is labeled as normal if no attack is detected. We now give the algorithm to
integrate the layered framework with conditional random elds.
56 Layered Conditional Random Fields for Network Intrusion Detection
Algorithm: Integrating Layered Framework & Conditional Random Fields
Algorithm 1 Training
1: Select the number of layers, n, for the complete system.
2: Separately perform features selection for each layer.
3: Train a separate model with conditional random elds for each layer using the features se-
lected from Step 2.
4: Plug in the trained models sequentially such that only the connections labeled as normal are
passed to the next layer.
Algorithm 2 Testing
1: For each (next) test instance perform Steps 2 through 5.
2: Test the instance and label it either as attack or normal.
3: If the instance is labeled as attack, block it and identify it as an attack represented by the layer
name at which it is detected and go to Step 1. Else pass the sequence to the next layer.
4: If the current layer is not the last layer in the system, test the instance and go to Step 3. Else
go to Step 5.
5: Test the instance and label it either as normal or as an attack. If the instance is labeled as an
attack, block it and identify it as an attack corresponding to the layer name.
4.5 Experiments and Results
For our experiments, we use the conditional random eld toolkit CRF++ [120] and the Weka tool
[121]. We develop python and shell scripts for data formatting and implementing the layered
framework and perform all of our experiments on a desktop running with Intel(R) Core(TM) 2,
CPU 2.4 GHz and 2 GB RAM under exactly the same conditions.
In our experiments we perform hybrid detection, i.e., we use both normal and anomalous audit
patterns to train the model in a supervised learning environment. We perform our experiments ten
times and report the best, the average and the worst cases. To measure the efciency of attack
detection, we consider only the test time efciency since the real-time performance of an intrusion
detection system depends upon the test time efciency alone. We observe that our system based on
layered framework and conditional random elds, which we refer to as the Layered Conditional
Random Fields, is very efcient during testing. The time required to test every instance when
we consider all the 41 features for all the four layers is 0.2236 ms. This reduces to 0.0678 ms when
we perform feature selection and implement the layered framework. More details are presented in
the following sections.
4.5 Experiments and Results 57
4.5.1 Building Individual Layers of the System
To determine the effectiveness of conditional random elds for intrusion detection we perform
two set of experiments. From the rst experiment, we examine the accuracy of conditional ran-
dom elds and compare them with other techniques which are known to perform well. In this
experiment we use all the 41 features to make a decision. We observe that the conditional random
elds perform very well particularly for detecting U2R attacks while the decision trees achieve
higher attack detection for the Probe and R2L attacks. The difference in attack detection accuracy
for DoS attacks is not signicant. The reason for better accuracy for decision trees is that they per-
form feature selection and use only a small set of features in the nal model. Hence, we perform
our second experiment where we select a subset of features for all the four layers separately as
discussed earlier in Section 4.4.1.
For our experiments, we divided the training data into ve different classes; normal, Probe,
DoS, R2L and U2R. Similarly, we divided the test data into ve classes. As discussed in Section
4.4, we perform experiments separately for all the four attack classes by randomly selecting data
corresponding to that particular attack class and normal data only. For example, to detect Probe
attacks, we train and test the system with Probe attacks and normal audit patterns only. We do not
add other attacks such as DoS, R2L and U2R in the training data when training the sub system to
detect Probe attacks. Not including other attacks allow the system to better learn features specic
to the Probe attacks and normal events. Hence, for four attack classes we train four independent
models, separately, with and without feature selection to compare their performance. We perform
similar experiments with decision trees and naive Bayes. We call the models as layered conditional
random elds, layered decision trees and layered naive Bayes when we perform feature selection.
For better comparison and readability, we present the results for the two experiments for all the
four layers together.
Detecting Probe Attacks
To detect Probe attacks, we train our system by randomly selecting 10,000 normal records and the
entire Probe records from the training data. For testing the model, we select all the normal and
Probe records from the test data. Hence, we have about 15,000 training and 64,759 test instances.
1. Experiments with all 41 Features In Table 4.2, we give the results for detecting Probe
58 Layered Conditional Random Fields for Network Intrusion Detection
attacks when we use all the 41 features for training and testing in the rst experiment. The
table represents that the system takes a total of 14.53 seconds to label all the 64,759 test
instances. Results suggest that decision trees are more efcient than conditional random
elds and naive Bayes. This is because they have a small tree structure, often with very
few decision nodes, which is very efcient. The attack detection accuracy is also higher
for the decision trees since they select the best possible features during tree construction.
However, when we performfeature selection, the layered conditional randomelds achieve
much higher accuracy and there is signicant improvement in train and test time efciency.
Table 4.2: Detecting Probe Attacks (with all 41 Features)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Conditional Best 84.60 89.94 86.73
Random Average 82.53 88.06 85.21 200.6 14.53
Fields Worst 80.44 86.13 83.19
Naive
Best 73.20 97.00 83.30
Bayes
Average 72.26 96.65 82.70 1.08 6.31
Worst 71.20 96.30 81.90
Decision
Best 93.20 97.70 95.40
Trees
Average 87.36 95.73 91.34 2.04 2.40
Worst 85.50 90.90 88.80
2. Experiments with Feature Selection In the second experiment, we use the same data as
used in previous experiment, however, we perform feature selection in this experiment.
We give the results for detecting Probe attacks with feature selection in Table 4.3. The
table suggests that the layered conditional random elds perform better and faster than the
previous experiment and are the best choice for detecting Probe attacks. The system takes
only 2.04 seconds to label all the 64,759 test instances. We observe that there is no sig-
nicant advantage with respect to time for the layered decision trees. This is because the
size of the nal tree with decision trees and with layered decision trees is not signicantly
different, resulting in similar efciency. We also observe that the Recall and, hence, the
F-Measure for layered naive Bayes decreases drastically. This can be explained as follows;
the classication accuracy with naive Bayes generally improves as the number of features
4.5 Experiments and Results 59
increases. However, if the number of features increases to a very large extent, the esti-
mation tends to become unreliable. As a result, when we use all the 41 features, naive
Bayes performs well but when we perform feature selection and use only ve features, its
classication accuracy decreases. The results from Table 4.3 clearly suggest that layered
conditional random elds are a better choice for detecting Probe attacks.
Table 4.3: Detecting Probe Attacks (with Feature Selection)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Layered Best 89.72 98.03 93.68
Conditional Average 88.19 97.82 92.73 6.91 2.04
Random Fields Worst 82.92 96.48 89.82
Layered Best 78.80 21.30 33.60
Naive Average 77.23 19.57 31.22 0.45 1.13
Bayes Worst 74.70 17.00 27.70
Layered Best 87.50 97.70 92.30
Decision Average 87.04 97.41 91.93 0.54 1.00
Trees Worst 86.60 95.20 90.80
Detecting DoS Attacks
We randomly select 20,000 normal records and 4,000 DoS records from the training data to train
the system to detect DoS attacks. For testing, we select all the normal and DoS records from the
test set. Hence, we have 24,000 training instances and 290,446 test instances.
1. Experiments with all 41 Features In Table 4.4, we give the results for detecting DoS
attacks when we use all the 41 features. The table represents that the system takes a total
of 64.42 seconds to label all the 290,446 test instances. The results show that all the
three methods have similar attack detection accuracy; however, decision trees give a slight
advantage with regards to the test time efciency.
2. Experiments with Feature Selection To detect DoS attacks with feature selection we
perform experiments on the same data used in the previous experiment, but we perform
60 Layered Conditional Random Fields for Network Intrusion Detection
Table 4.4: Detecting DoS Attacks (with all 41 Features)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Conditional Best 99.82 97.11 98.43
Random Average 99.78 97.05 98.40 256.11 64.42
Fields Worst 99.75 96.99 98.37
Naive
Best 99.40 97.00 98.20
Bayes
Average 99.32 97.00 98.17 1.79 26.28
Worst 99.30 97.00 98.10
Decision
Best 99.90 97.20 98.60
Trees
Average 99.90 97.00 98.46 6.09 9.04
Worst 99.90 96.70 98.30
feature selection in this experiment. Table 4.5 presents the results. With feature selection,
the system takes only 15.17 seconds to label all the 290,446 test instances. The results
follow the same trend as in the previous experiment. Considering the test time efciency,
layered decision trees are a better choice for detecting DoS attacks. It is important to note
that there is slight increase in the detection accuracy when feature selection is performed;
however, this increase is not signicant. In this experiment, the real advantage of feature
selection is seen in terms of improvement in the test time performance.
Table 4.5: Detecting DoS Attacks (with Feature Selection)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Layered Best 99.99 97.12 98.53
Conditional Average 99.98 97.05 98.50 26.59 15.17
Random Fields Worst 99.97 97.01 98.48
Layered Best 99.40 97.00 98.20
Naive Average 99.39 97.00 98.19 0.68 6.50
Bayes Worst 99.30 97.00 98.10
Layered Best 99.90 97.30 98.60
Decision Average 99.90 97.10 98.50 1.31 3.87
Trees Worst 99.90 97.00 98.40
4.5 Experiments and Results 61
Detecting R2L Attacks
For training our system to detect R2L attacks, we randomly select 1,000 normal records and all the
R2L records from the training data. To test the model, we select all the normal and R2L records
from the test set. Hence, we have about 2,000 training instances and 76,942 test instances.
1. Experiments with all 41 Features In Table 4.6, we give the results for detecting R2L
attacks when we use all the 41 features. We observe that to test all the 76,942 test in-
stances, the system take 17.16 seconds. Table 4.6 suggests that decision trees have higher
F-Measure, but the conditional random elds have higher Precision when compared with
other methods, i.e., a system using conditional random elds generates less false alarms.
Table 4.6: Detecting R2L Attacks (with all 41 Features)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Conditional Best 93.67 16.81 28.42
Random Average 92.35 15.10 25.94 23.40 17.16
Fields Worst 90.54 12.42 21.89
Naive
Best 74.10 7.40 13.40
Bayes
Average 70.03 6.63 12.12 0.38 7.33
Worst 61.30 5.40 10.00
Decision
Best 98.30 37.10 53.20
Trees
Average 84.68 23.29 35.62 0.60 2.75
Worst 63.70 10.40 18.30
2. Experiments with Feature Selection In the second experiment, we use the same data as
used in the previous experiment, however, we perform feature selection in this experiment.
From the results in Table 4.7, we observe that the system takes only 5.96 seconds to test
all the 76,942 test instances and the layered conditional random elds perform much better
than conditional random elds (increase in F-Measure of about 60%), layered decision
trees (increase of about 125%), decision trees (increase of about 17%), layered naive Bayes
(increase of about 250%) and naive Bayes (increase of about 250%) and are the best choice
for detecting R2L attacks. Layered condition random elds take slightly more time which
is acceptable as they achieve much higher attack detection accuracy.
62 Layered Conditional Random Fields for Network Intrusion Detection
Table 4.7: Detecting R2L Attacks (with Feature Selection)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Layered Best 95.84 31.67 47.52
Conditional Average 94.70 27.08 42.08 5.30 5.96
Random Fields Worst 91.37 24.98 39.23
Layered Best 88.30 7.20 13.30
Naive Average 81.81 6.47 11.98 0.31 2.99
Bayes Worst 78.20 4.10 7.80
Layered Best 89.70 14.50 24.90
Decision Average 85.48 10.39 18.43 0.36 1.43
Trees Worst 78.80 7.30 13.50
Detecting U2R Attacks
To detect U2R attacks, in the rst experiment, we train our system by randomly selecting 1,000
normal records and all the U2R records from the training data. We used all the normal and U2R
records from the test set for testing the system. Hence, we have about 1,000 training instances
and 60,661 test instances.
1. Experiments with all 41 Features In Table 4.8, we give the results for detecting U2R
attacks when we use all of the 41 features. The system takes 13.45 seconds to label 60,661
test instances. Table 4.8, clearly shows that conditional random elds are far better for
detecting U2R attacks when compared with other methods. The F-Measure for conditional
random elds is more than 150% with respect to the decision trees and more than 600%
with respect to the naive Bayes. The U2R attacks are very difcult to detect and most of the
present intrusion detection systems fail to detect such attacks with acceptable reliability.
We observe that conditional random elds can be used to reliably detect the U2R attacks
in particular.
2. Experiments with Feature Selection In the second experiment, we use the same data
as used in the previous experiment to detect U2R attacks; however, we perform feature
selection in this experiment. We give the results for detecting U2R attacks with feature
4.5 Experiments and Results 63
Table 4.8: Detecting U2R Attacks (with all 41 Features)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Conditional Best 58.62 60.29 56.74
Random Average 52.16 55.02 53.44 8.35 13.45
Fields Worst 47.30 50.00 49.30
Naive
Best 5.30 91.20 10.00
Bayes
Average 3.94 85.88 7.54 0.31 5.90
Worst 3.20 82.40 6.20
Decision
Best 24.80 63.20 34.90
Trees
Average 12.93 57.49 20.42 0.37 2.22
Worst 6.30 51.50 11.20
selection in Table 4.9. We observe that the system takes only 2.67 seconds to label all the
60,661 test instances. Table 4.9, clearly suggests that layered conditional random elds
are the best choice for detecting U2R attacks and are far better than conditional random
elds (increase of about 8%), layered decision trees (increase of about 30%), decision trees
(increase of about 184%), layered naive Bayes (increase of about 38%) and naive Bayes
(increase of about 675%). We also observe that the attack detection capability increases
for the decision trees and the naive Bayes when we perform feature selection.
Table 4.9: Detecting U2R Attacks (with Feature Selection)
Precision Recall F-Measure Train Test
(%) (%) (%) (sec.) (sec.)
Layered Best 58.57 64.71 61.11
Conditional Average 55.07 62.35 58.19 0.85 2.67
Random Fields Worst 34.96 60.29 45.03
Layered Best 50.00 66.20 51.40
Naive Average 35.48 55.12 41.97 0.25 1.83
Bayes Worst 19.60 52.90 29.80
Layered Best 51.00 38.20 43.70
Decision Average 51.00 38.20 43.70 0.29 0.93
Trees Worst 51.00 38.20 43.70
64 Layered Conditional Random Fields for Network Intrusion Detection
It is evident from our results that the attack detection accuracy using layered conditional ran-
dom elds is signicantly higher for detecting the U2R, R2L and Probe attacks. The difference in
attack detection accuracy is, however, not signicant for the DoS attacks. Further, regardless of the
method considered and particularly for conditional random elds, the time required for training
and testing the system reduces signicantly once we perform feature selection.
4.5.2 Implementing the Integrated System
In many situations, there is a tradeoff between efciency and accuracy of the system and there
can be various avenues to improve system performance. Methods such as naive Bayes assume
independence among the observed data. This certainly increases system efciency but it severely
affects the accuracy as we observed from the experimental results. To balance this tradeoff we
use the conditional random elds which are more accurate, though expensive, but we implement
the layered approach to improve overall system performance. The performance of our integrated
system, Layered Conditional Random Fields, is comparable to that of the decision trees and
the naive Bayes and our system has higher attack detection accuracy.
Experimental results in Section 4.5.1 suggest that conditional random elds (with feature se-
lection) can be very effective in detecting different attacks when different attack classes are con-
sidered separately. However, in real scenario, the category of an attack is unknown. Rather, it
would be benecial if an intrusion detection system not only detects an attack but also identies
the type of attack, thereby enabling specic intrusion response mechanisms depending upon the
type of attack. We perform further experiments with the integrated system presented in Section
4.4.2. Results show that integrating the layered framework not only improves the efciency of the
overall system, but it also helps to identify the type of attack once it is detected. This is because
individual layers in the layered framework are trained to detect only a particular class of attack.
As soon as a layer detects an attack, the category of the attack can be inferred from the class of
attack the layer is trained to detect. For example, if an attack is detected at the U2R layer in the
layered framework, it is very likely that the attack is of U2R type and hence, the system labels the
attack as U2R and initiates specic response mechanisms.
To examine the effectiveness of our integrated system, layered conditional random elds, we
perform experiments in an environment similar to the real life deployment of the system. For this
experiment, we perform feature selection and use exactly the same training instances as used for
4.5 Experiments and Results 65
training the individual models in the experiments described in Section 4.5.1. However, we re-label
the entire data in the test set as either normal or attack. During testing, all the instances from the
test set are passed through the system starting from the rst layer. If layer one detects an attack, it
is blocked and labeled as Probe. Only the connections which are labeled as normal at the rst layer
are allowed to pass to the next layer. Since, the layer is trained to detect Probe attacks effectively;
most of the Probe attacks are detected. Other attacks such as DoS can either be seen as normal or
as Probe. If other attacks are detected as Probe, it must be considered as an advantage, since the
attack is detected at an early stage. Similarly, if some Probe attacks are not detected at the rst
layer, they may be detected at subsequent layers. Same process is repeated at the following layers
where an attack is blocked and labeled as DoS, R2L or U2R at layer two, layer three and layer
four respectively. We perform all experiments 10 times and report their average. Table 4.10 gives
the % detection with respect to each of the ve classes in a confusion matrix.
Table 4.10: Confusion Matrix
% Detection
Probe DoS R2L U2R Normal (Total % Blocked)
Probe 97.82 0.11 0.69 0.00 1.38 98.62
DoS 25.50 71.90 0.00 0.00 2.60 97.40
R2L 3.00 0.00 26.58 0.04 70.38 29.62
U2R 5.15 0.00 77.65 3.53 13.67 86.33
Normal 0.91 0.07 0.35 0.05 98.62 1.38
From Table 4.10, we observe that an intrusion detection system based on layered conditional
random elds can detect most of the Probe (98.62%), DoS (97.40%) and U2R (86.33%) attacks
while giving very few false alarms at each layer. The system can also detect R2L attacks with
much higher accuracy (29.62%) when compared with previously reported systems. The confusion
matrix shows that only 71.90% of DoS attacks are labeled as DoS. However, it is very important to
note that the accuracy for detecting DoS attacks is not 71.90%, rather it is 25.50 +71.90 +0.00 +
0.00 = 97.40%. This is because, 25.50% of the DoS attacks have been detected at the rst layer
itself, though the system identies them as Probe attacks since the rst layer represents the Probe
66 Layered Conditional Random Fields for Network Intrusion Detection
layer. This is acceptable because it is critical to detect an attack as early as possible which helps
to minimize the impact of an attack.
We also note that most of the U2R attacks are detected in the third layer and hence labeled as
R2L. However, if we remove the third layer, the fourth layer can detect U2R attacks with similar
accuracy. Looking at the R2L and U2R attacks in Table 4.10, it is natural to think that the two
layers can be merged. However, this has two disadvantages. First, merging the two layers re-
sult in increasing the number of features which affects efciency. When the layers are merged, the
merged layer performs poorly with respect to the total test time when compared with the combined
test time for both the unmerged layers. Second, the U2R attacks are not detected effectively and
their individual attack detection accuracy decreases. This is because the number of U2R attacks in
the training data is very small and the system simply learns the features which are specic to the
R2L attacks. Hence, we use two separate layers for detecting R2L and U2R attacks. Using the lay-
ered framework, it is hoped that any attack, even though its category is unknown, can be detected
at any one of the layers in the system. The number of layers can also be increased or decreased
in the layered framework, making the system scalable and exible to specic requirements of the
particular environment where it is deployed.
We evaluate the performance of every layer in our system in Table 4.11. The table clearly
shows that out of all the 250,436 attack instances in the test set, more than 25% of the attacks are
blocked at layer one and more than 90% of all the attacks are blocked by the end of layer two.
Thus, the layered framework is very effective in reducing the attack trafc at every layer in the
system. This conguration takes only 21 seconds to classify all the 250,436 attacks.
Table 4.11: Attack Detection at Individual Layers (Case:1)
Accuracy Test Time
Total Cumulative Per Instance Total Cumulative
(%) (%) (m sec.) (sec.) (sec.)
Probe 25.226 25.226 0.031 8 8
DoS 65.996 91.222 0.053 10 18
R2L 1.770 92.992 0.090 2 20
U2R 0.004 92.996 0.056 1 21
4.6 Comparison and Analysis of Results 67
We can further optimize this conguration by putting the DoS layer before the Probe layer.
We can do this because the data is relational and every layer in the system is independent. Putting
the DoS layer before the Probe layer improves overall system performance and helps to detect a
large number of attacks at the rst layer itself. Such optimization becomes signicant in severe
attack situations when the target is overwhelmed with illegitimate connections. We present the
results in Table 4.12.
Table 4.12: Attack Detection at Individual Layers (Case:2)
Accuracy Test Time
Total Cumulative Per Instance Total Cumulative
(%) (%) (m sec.) (sec.) (sec.)
DoS 89.807 89.807 0.051 13 13
Probe 1.415 91.222 0.031 1 14
R2L 1.770 92.992 0.090 2 16
U2R 0.004 92.996 0.056 1 17
Table 4.12 shows that our system can analyze 250,436 test instances in 17 seconds, i.e., it can
handle 1.4731 10
4
instances per second. Now, assuming the average size of an instance to be
1.5 KB, the overall bandwidth which our system can handle is easily in excess of 100 Mbps. It is
important to note that this performance is achieved on a desktop running with Intel(R) Core(TM)
2, CPU 2.4 GHz and 2 GB RAM in the Windows environment. Signicant performance improve-
ment can be achieved by building dedicated devices for large scale commercial deployment.
4.6 Comparison and Analysis of Results
Experimental results from Section 4.5 clearly suggest that conditional random elds when inte-
grated with the layered framework can be used to build effective and efcient network intrusion de-
tection systems. In this section, we compare the layered conditional random elds with other well
known methods for intrusion detection based on the anomaly detection principle. The anomaly
based systems primarily detect deviations from the learnt normal data using statistical methods,
machine learning or data mining approaches [9]. Standard techniques such as decision trees and
68 Layered Conditional Random Fields for Network Intrusion Detection
naive Bayes are known to perform well. However, our experimental results show that layered
conditional random elds perform far better than these techniques. The main reason for better
accuracy of our system is that the conditional random elds do not consider observation features
to be independent. In [122], the authors present a comparative study of various classiers when
applied to the KDD 1999 data set. To improve attack detection, the authors in [123] propose the
use of principle component analysis before applying any machine learning algorithm. The use of
support vector machines for intrusion detection is discussed in [72]. We compare these methods
with our layered conditional random elds for intrusion detection in Table 4.13. The table repre-
sents the Probability of Detection (PD) and the False Alarm Rate (FAR) in % for different methods
including the winners of the KDD 1999 cup.
Comparison from Table 4.13 suggests that layered conditional random elds perform sig-
nicantly better than previously reported results including the winner of the KDD 1999 cup and
various other methods applied to the KDD 1999 data set. The most impressive part of layered con-
ditional random elds is the margin of improvement when compared with other methods. They
have very high attack detection of 98.6% for Probe attacks (5.8% improvement) and 97.40% detec-
tion for DoS attacks. They outperform by a signicant percentage for R2L (34.5% improvement)
and U2R (34.8% improvement) attacks.
4.6 Comparison and Analysis of Results 69
Table 4.13: Comparison of Results
Probe DoS R2L U2R
Layered Conditional PD 98.60 97.40 29.600 86.3000
Random Fields FAR 0.91 0.07 0.350 0.0500
KDD 1999 Winner [122]
PD 83.30 97.10 8.400 13.2000
FAR 0.60 0.30 0.005 0.0030
Multi Classier [122]
PD 88.70 97.30 9.600 29.8000
FAR 0.40 0.40 0.100 0.4000
Multi Layer Perceptron [122]
PD 88.70 97.20 5.600 13.2000
FAR 0.40 0.30 0.010 0.0500
Gaussian Classier [122]
PD 90.20 82.40 9.600 22.8000
FAR 11.30 0.90 0.100 0.5000
K-Means Clustering [122]
PD 87.60 97.30 6.400 29.8000
FAR 2.60 0.40 0.100 0.4000
Nearest Cluster Algorithm [122]
PD 88.80 97.10 3.400 2.2000
FAR 0.50 0.30 0.010 0.0006
Incremental Radial PD 93.20 73.00 5.900 6.1000
Basis Function [122] FAR 18.80 0.20 0.300 0.0400
Leader Algorithm [122]
PD 83.80 97.20 0.100 6.6000
FAR 0.30 0.30 0.003 0.0300
Hypersphere Algorithm [122]
PD 84.80 97.20 1.000 8.3000
FAR 0.40 0.30 0.005 0.0090
Fuzzy ARTMAP [122]
PD 77.20 97.00 3.700 6.1000
FAR 0.20 0.30 0.004 0.0010
C4.5 (Decision Trees) [122]
PD 80.80 97.00 4.600 1.8000
FAR 0.70 0.30 0.005 0.0020
Nearest Neighbour
with Principle PD 86.13 97.32 2.510 64.0400
Component Analysis FAR 0.27 0.23 0.001 0.0001
(4 axis) [123]
Decision Trees
with Principle PD 70.40 97.58 0.070 7.0200
Component Analysis FAR 0.85 0.12 0.030 0.0001
(2 axis) [123]
Support Vector Machines [72]
PD 36.65 91.60 22.000 12.0000
FAR - - - -
70 Layered Conditional Random Fields for Network Intrusion Detection
4.6.1 Signicance of Layered Framework
To evaluate the effectiveness of the layered framework, we perform further experiments where we
do not implement the layered framework, i.e., we train a single system with two classes, normal
and attack, by labeling all the Probe, DoS, R2L and U2R attacks as attack. We perform experi-
ments, both, with and without feature selection. For experiments when we do not implement the
layered framework but we perform feature selection, we select 21 features out of the total of 41
features by applying the union operation on the feature sets of the four individual attack classes.
Table 4.14 presents the results.
Table 4.14: Layered Vs. Non Layered Framework
Attack Detection in % Time Taken (sec.)
Probe DoS R2L U2R Test
Layered
Feature Selection 98.62 97.40 29.62 86.33 17
All Features 88.06 97.05 15.10 55.03 56
Non Feature Selection 92.21 96.88 16.01 60.00 29
Layered
All Features 87.94 96.12 17.58 48.24 57
Comparison from Table 4.14 clearly suggests that a system implementing the layered frame-
work with feature selection is more efcient and more accurate in detecting attacks particularly the
U2R, R2L and Probe attacks. The motivation behind layered framework is to improve performance
speed while feature selection helps to improve classication accuracy. Hence, a system which im-
plements feature selection with layered framework can benet from both; high performance speed
and high classication accuracy. Further, in Table 4.14, we should read the time in relative terms
rather than in absolute terms since, for ease of experiments, we use scripts for implementation. In
real environment, high speed can be achieved by implementing the complete system in languages
with efcient compilers such as the C Language. Further, as discussed earlier, we can implement
pipelining in multi core processors where every core represents a single layer and due to pipelin-
ing, multiple I/O operations can be replaced by a single I/O operation providing very high speed
of operation.
4.6 Comparison and Analysis of Results 71
4.6.2 Signicance of Feature Selection
Form the experiments in previous sections; we observe that performing feature selection improves
the attack detection accuracy as well as the efciency of the system. In our experiments, we
performed manual feature selection, using our domain knowledge. However, it would be advan-
tageous if we can select features automatically for different attack classes. For experiments with
automatic feature selection, we use methods such as those discussed in [124] and [125] which
can automatically extract signicant features. We compare the results of manual feature selection
with automatic feature selection for all the layers. We observe that the system using automatic
feature selection has similar test time performance when compared to the system with manual
feature selection, but the accuracy of detection is signicantly lower when features are induced
automatically than the system based on manual feature selection. We compare the effect of fea-
ture selection on intrusion detection in Table 4.15. For automatic feature selection, we perform
experiments using the Mallet tool [126].
Table 4.15: Signicance of Feature Selection
F-Measure (%)
Probe DoS R2L U2R
Best 93.68 98.53 47.52 61.11
Manual Average 92.73 98.50 42.08 58.19
Worst 89.82 98.48 39.23 45.03
Best 87.39 98.38 42.06 53.90
Automatic Average 86.28 98.31 32.14 49.80
Worst 85.03 98.20 25.15 46.58
No
Best 86.73 98.43 28.42 56.74
Selection
Average 85.21 98.40 25.94 53.44
Worst 83.19 98.37 21.89 49.30
It is not surprising that manual feature selection performs better than automatic feature selec-
tion. However, we also considered other methods for automatic feature selection. We performed
experiments with feed forward neural network to determine the weights for all the 41 features. We
then discarded the features with weights close to zero. This results in only a small set of features
for each layer. However, when we performed similar experiments with the reduced set of features,
72 Layered Conditional Random Fields for Network Intrusion Detection
there was no signicant improvement in the attack detection accuracy. We then used Principle
Component Analysis (PCA) for dimensionality reduction [123]. However, main drawback of us-
ing PCA followed by conditional random elds is that, the PCA transforms a large number of
possibly correlated features into a small number of uncorrelated features known as the principle
components. Hence, when we applied PCA to the data set and then implemented the system us-
ing conditional random elds in the newly transformed feature space, the combined approach did
not provide signicant advantage. This is because, the strength of conditional random elds is to
model correlation between features, but the features in the transformed space are independent. We
then used the C4.5 algorithm [127] to perform feature selection. We constructed a decision tree
and selected only a small set of features which were selected by the C4.5 algorithm for further
experiments. However, there was no signicant improvement in the results.
Given the critical nature of the task of intrusion detection, it is important to detect most of the
attacks with very few false alarms; hence, we use domain knowledge to improve attack detection
accuracy. Nonetheless, automatic feature selection with layered conditional random elds is still
a feasible scheme for building reliable network intrusion detection systems which can operate
efciently in high speed networks.
4.6.3 Signicance of Our Results
Experimental results show that conditional random elds have high attack detection accuracy.
However, if we use all the 41 features for all the four attack classes, the time required to train
and test the model is very high. To address this, we perform feature selection and implement the
layered framework with the conditional random elds to produce a four layer system. The four
layers correspond to Probe, DoS, R2L and U2R attacks. We observe that the test time performance
of the integrated system is comparable with other methods; however, the time required to train
the model is slightly higher. We also observe that feature selection not only improves the test
time efciency, but it also increases the accuracy of attack detection. This is because using more
features than required can generate superuous rules often resulting in tting irregularities in the
data which can misguide classication. With regards to improving the attack detection accuracy,
the main strength of layered conditional random elds lies in detecting R2L and U2R attacks
which are not satisfactorily detected by other methods. Our system also gives slight improvement
for detecting Probe attacks but has similar accuracy for detecting DoS attacks.
4.7 Robustness of the System 73
The prime reason for better attack detection accuracy for conditional random elds is that they
do not consider observation features to be independent. This results in capturing the correlation
among different features in the observation resulting in higher accuracy. Considering both, the
accuracy and the time required for testing, layered conditional random elds score better.
To determine the statistical signicance of our results, we rank all the six systems in order
of signicance for detecting Probe, DoS, R2L and U2R attacks. We use the Wilcoxon test [128]
with 95% condence interval to discriminate the performance of these methods. We compare
the ranking for various methods in Table 4.16, where a system with rank 1 represents the best
system.
Table 4.16: Ranking Various Methods for Intrusion Detection
Probe DoS R2L U2R
Layered Conditional Random Fields 1 1 1 1
Conditional Random Fields 4 4 3 2
Layered Decision Trees 1 1 4 3
Decision Trees 1 1 1 5
Layered Naive Bayes 6 5 5 3
Naive Bayes 5 5 5 6
The results of the test indicate that layered conditional random elds are signicantly better
(or equal) for detecting attacks when compared with other methods. Thus, layered conditional
random elds are a strong candidate for building effective and efcient network intrusion detection
systems.
4.7 Robustness of the System
In order to test the robustness of our system, it is important to perform similar experiments with
a number of other data sets. However, given the domain of the problem, no other data sets are
freely available which can be used for similar experimentation. To ameliorate this problem to
some extent and to study the robustness of our system, we add substantial amount of noise in the
training data and perform similar experiments.
74 Layered Conditional Random Fields for Network Intrusion Detection
4.7.1 Addition of Noise
We control the addition of noise in the data by two parameters, the probability of adding noise to
a feature, p, and the scaling factor, s. We perform four set of experiments with noisy data, one
for each layer. For every set of experiment, we vary the parameter p from 0 and 1 (by keeping
it at values 0.10, 0.20, 0.33, 0.50, 0.75, 0.90 and 0.95) and vary the parameter s from -1000 and
+1000. In case, when the original feature is 0, we add noise to that feature by using an additive
function (a random value between -1000 and +1000) instead of scaling. We represent the effect of
noise for detecting Probe, DoS, R2L and U2R attacks separately in Figures 4.4, 4.5, 4.6 and 4.7
respectively. The gures clearly suggest that the layered conditional random elds are robust to
noise in the training data and perform better than other methods.
4.7 Robustness of the System 75
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
F
-
M
e
a
s
u
r
e
Noise %
LCRF CRF DT NB
Figure 4.4: Effect of Noise on Probe Layer
89
90
91
92
93
94
95
96
97
98
99
0 10 20 30 40 50 60 70 80 90 100
F
-
M
e
a
s
u
r
e
Noise %
LCRF CRF DT NB
Figure 4.5: Effect of Noise on DoS Layer
76 Layered Conditional Random Fields for Network Intrusion Detection
0
5
10
15
20
25
30
35
40
45
0 10 20 30 40 50 60 70 80 90 100
F
-
M
e
a
s
u
r
e
Noise %
LCRF CRF DT NB
Figure 4.6: Effect of Noise on R2L Layer
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90 100
F
-
M
e
a
s
u
r
e
Noise %
LCRF CRF DT NB
Figure 4.7: Effect of Noise on U2R Layer
4.8 Conclusions 77
4.8 Conclusions
In this chapter, we addressed the core issues concerning the anomaly and hybrid intrusion detec-
tion systems at the network level; viz, the accuracy of attack detection, capability of detecting
a wide variety of attacks and efciency of operation. Our experimental results in Section 4.5.1
show that conditional random elds are very effective in improving the attack detection rate and
decreasing the false alarm rate. Having a low false alarm rate is important for any intrusion detec-
tion system. Further, experimental results presented in Section 4.5.2, show that feature selection
and implementing the layered framework signicantly reduces the time required to train and test
the model. Experiments also suggest that conditional random elds can be very effective in re-
ducing the false alarms, thereby improving the attack detection accuracy. Further, our system can
be implemented to detect a variety of attacks including the DoS, Probe, R2L and the U2R. Other
type of attacks can also be detected by adding new layers in the system, making our system highly
scalable. We compared our approach with some well known methods for intrusion detection such
as the decision trees and naive Bayes. These methods, however, cannot detect the R2L and the
U2R attacks effectively, while our integrated system can effectively and efciently detect such
attacks giving an improvement of 34.5% for the R2L attacks and 34.8% for the U2R attacks. Our
system also helps in identifying an attack once it is detected at a particular layer which expedites
the intrusion response mechanism, thus minimizing the impact of an attack. We showed that our
system is robust to noise in the training data and performs better than any other compared system.
Our system has all the advantages of the layered framework discussed in the previous chapter,
and, in particular the number of layers in the system can be easily increased or decreased giving
exibility to network administrators.
Our system can clearly provide better intrusion detection capabilities at the network level.
However, as discussed earlier, to provide a higher level of security it is signicant to detect in-
trusions at the application level along with detecting intrusions at the periphery of the network.
Hence, in the following chapters, we focus on developing intrusion detection systems which can
operate at the application level and which can be effective in detecting application level attacks.
Chapter 5
Unied Logging Framework and Audit
Data Collection
In order to detect malicious activities at the application level, present intrusion detection systems
either analyze the application access logs or the data access logs. A stacked system can also be used
which analyzes the two logs separately one after the other. Such systems, however, cannot model the
application-data interaction which is signicant to detect low level application specic attacks. To
overcome this deciency in present application intrusion detection systems, we introduce a unied
logging framework which combines the application and the data access logs to produce a unied log
which can be used as the audit patterns to detect attacks at the application level. This unied log can
easily incorporate features from both, the application accesses and the corresponding data accesses.
As a result, application-data interaction can be captured which improves attack detection. Finally,
our framework does not encode application specic features to extract attack signatures and can be
used for a variety of similar applications.
5.1 Introduction
U
SING our layered framework, as discussed in previous chapters, can undoubtedly provide
effective network intrusion detection capability. However, to ensure a higher level of secu-
rity, network level systems must be complimented with application level systems. This is because;
the attack detection capability of a network based system is different from that of a host based
and application based system. A network based system primarily focuses on monitoring network
packets and, hence, cannot detect data and application level attacks particularly when Network
Address Translation (NAT) and encryption are used in communication. Further, attacks can be
split into more than one packet to avoid their detection. As a result, network intrusion detection
systems cannot reliably detect application attacks such as the SQL injection. Similarly, host and
application based systems cannot protect against network attacks such as the Denial of Service.
79
80 Unied Logging Framework and Audit Data Collection
Methods which are effective in detecting attacks at network level, such as those discussed
earlier cannot be directly used to detect low level application attacks. Detecting application level
attacks often require monitoring every single data access in real-time environment which may not
always be feasible, simply due to large number of data requests per unit time. Further, attackers
may come up with previously unseen attacks compounding the situation even more difcult [5].
Present application intrusion detection systems either analyze only the web access logs or only the
data access logs or use two separate systems (based on analyzing the web access logs and the data
access logs) which operate independently and, hence, cannot detect attacks reliably. Such systems
are often signature based and, thus, have limited attack detection. Therefore, it becomes critical
to develop better application intrusion detection systems which can detect attacks reliably and are
not entirely dependent on attack signatures. Detecting malicious data accesses, thus, presents a
major challenge and alternate methods must be considered which are efcient and at the same time
which can detect attacks reliably.
We note that to effectively detect application level attacks the application-data interaction must
be captured. Hence, we introduce a unied logging framework which combines the application
access logs and the corresponding data access logs in real-time to provide a unied log with
features from both the application accesses and the corresponding data accesses. This captures the
correlation between the two logs and also eliminates the need to analyze them separately, thereby
resulting in a system which is accurate and which operates efciently.
The rest of the chapter is organized as follows; we motivate our unied logging framework
with some examples in Section 5.2. We then describe our proposed framework in Section 5.3 and
the setup for data collection in Section 5.4. Finally, we conclude this chapter in Section 5.5.
5.2 Motivating Example
Data access in three tier application architecture is restricted via the application and, hence, ap-
plications are one of the prime targets of attack. However, the ultimate objective of attacking an
application is either to launch a Denial of Service or to access the underlying data. To detect such
malicious data accesses it becomes critical to consider the user behaviour (via the web applica-
tion requests) and the corresponding application behaviour (via the corresponding data accesses)
together, i.e., by analyzing the applications interaction with the underlying data.
5.3 Proposed Framework 81
Consider for example, a simple website which links page A to either page B, page C or any
other page. This depends on the logic encoded in the application. Transition from page A to page
B may be valid only if some conditions are satised, such as the user must be logged in to transit
from page A to page B. If this condition is not satised, the transition is considered as anomalous.
Considering only a single feature, transition sequence of web pages may not be sufcient to
detect attacks. Other features such as the result of authentication module are signicant for
decision making. Neglecting such features result in false alarms. This is because; the encoded
logic cannot be modeled by analyzing the web accesses alone. However, when the system is
made aware of the data access pattern via features such as the number of requests generated by
a particular page, the corresponding next page and other features, it can effectively model the
user-application interaction, thereby resulting in better attack detection.
Similarly, monitoring the data access queries alone without any knowledge of the web applica-
tion which requests the data is insufcient to detect attacks since they lack the necessary contextual
information. Hence, to detect attacks reliably, we propose monitoring web accesses together with
the corresponding data accesses using our unied logging framework.
5.3 Proposed Framework
In order to detect malicious data accesses, the straight forward approach is to audit every data
access request before it is processed by the application. However, this is not the ideal solution to
detect data breaches due to the following reasons:
1. In most applications, the number of data accesses per unit time is very large as compared to
the number of web accesses and, thus, monitoring every data request in real-time severely
affects system performance.
2. Assuming that we can somehow monitor every data request by using a signature based
system; the system is application specic because the attack signatures are dened by
encoding application specic knowledge.
3. The system must be regularly updated with new signatures to detect attacks. As with any
signature based system, it cannot detect zero day attacks.
Thus, monitoring every data request is not feasible in high speed application environment. We
also observe that the real world applications follow the three tier architecture [129] which ensures
82 Unied Logging Framework and Audit Data Collection
application and data independence, i.e., data is managed separately and is not encoded into the
application. To access application data, an attacker has no option but to exploit the application. To
detect such attacks, an intrusion detection system can either monitor the application requests or
(and) monitor the data requests. When a system monitors the application accesses alone, it cannot
detect attacks such as the SQL injection since the system lacks useful information about the data
accessed. Similarly, analyzing every data access in isolation limits the attack detection capability
of an intrusion detection system. Further, using two separate systems does not capture application-
data interaction which affects attack detection. As discussed earlier, previous approaches either
consider only the application accesses or the data accesses or consider both in isolation and, hence,
unable to correlate the events together resulting in a large number of false alarms. We, thus,
propose a unied logging framework which generates a single audit log that can be used by the
application intrusion detection system to detect a variety of attacks including the SQL injection,
cross site scripting and other application level attacks. Before we describe our framework in detail,
we dene some key terms which will be helpful in better understanding of the remaining of the
chapter.
1. Application: Application is a software by which a user can access data. There exists no
other way in which the data can be made available to a user.
2. User: User is either an individual or any other application which access data.
3. Event: Data transfer between a user and an application is a result of multiple sequential
events. Data transfer can be considered as a request-response system where a request for
data access is followed by a response. An event is such a single request-response pair. We
use the term event interchangeably with the term request. A single event is represented as
an N feature vector which is denoted as:
e
i
= f
1
, f
2
, f
3
... f
N
4. User Session: A user session is an ordered set of events or actions performed, i.e., a
session is a sequence of one or more request-response pairs. Every session can be uniquely
identied by a session id. A user session is represented as a sequence of event vectors as:
s
i
= start, e
1
, e
2
, e
3
..., end
5.3 Proposed Framework 83
5.3.1 Description of our Framework
We present our unied logging framework in Figure 5.1 which can be used for building effective
application intrusion detection systems.
Control
Web Server
with
Deployed
Application
Data
Intrusion
Detection
System
User / Client
Web Server
Log
Data Access
Log
Unified
Log
Session
Figure 5.1: Framework for Building Application Intrusion Detection System
In our framework, we dene two modules; session control module and logs unication module,
in addition to an intrusion detection system which is used to detect malicious data accesses in an
application. The logs unication module provides input audit patterns to the intrusion detection
system and the response generated by the intrusion detection system is passed on to the session
control module which can initiate appropriate intrusion response mechanisms. We have already
discussed that the three tier architecture restricts data access only via the application. Hence, user
access is restricted via the application and, thus, the application acts as bridging element between
the user and the data. In our framework, every request rst passes through the session control
which is described next.
84 Unied Logging Framework and Audit Data Collection
Session Control Module
The prime objective of an intrusion detection system is to detect attacks reliably. However, it
must also ensure that once an attack is detected, appropriate intrusion response mechanisms are
activated in order to mitigate their impact and prevent similar attacks in future. The session control
module serves dual purpose in our framework. First, it is responsible for establishing new sessions
and for checking the session id for previously established sessions. For this, it maintains a list of
valid sessions which are allowed to access the application. Every request to access the application
is checked for a valid session id at the session control and anomalous requests can be blocked
depending upon the installed security policy. Second, the session control also accepts input from
the intrusion detection system. As a result, it is capable of acting as an intrusion response system.
If a request is evaluated to be anomalous by the intrusion detection system, the response from the
application can be blocked at the session control before data is made visible to the user, thereby
preventing malicious data accesses in real-time. The session control can either be implemented as
a part of the application or can also be implemented as a separate entity.
Once the session id is evaluated for a request, the request is sent to the application where it is
processed. The web server logs every request. All corresponding data accesses are also logged.
The two logs are then combined by the logs unication module to generate unied log which is
described next.
Logs Unication Module
In Section 5.2, we discussed that analyzing the web assess logs and the data access logs in isolation
is not sufcient to detect application level attacks. Hence, we propose using unied log which can
better detect attacks as compared to independent analysis of the two logs. The logs unication
module is used to generate the unied log. The unied log incorporates features from both the
web access logs and the corresponding data access logs. Using the unied log, thus, helps to
capture the user-application interaction and the application-data interactions. However, very often,
the number of data accesses is extremely large when compared to the number of web requests.
Hence, we rst process the data access logs and represent them using simple statistics such as the
number of queries invoked by a single web request and the time taken to process them rather
than analyzing every data access individually. We then use the session id, present in both, the
5.4 Audit Data Collection 85
application access logs and the associated data access logs, to uniquely map the extracted statistics
(obtained from the data access logs) to the corresponding web requests in order to generate a
unied log. Figure 5.2, represents how the web access logs and the corresponding data access logs
can be uniquely mapped to generate a unied log. In the gure, f
1
, f
2
, f
3
... f
N
and g

1
, g

2
, g

3
...g

M
represent the features of web access logs and the features extracted from the reduced data access
logs respectively.
Web Request Reduced Data Acesses
Unified Log
Log Unification
Data Access Log Reduction
Web Request
Data Access (d
e11
= g
1
, g
2
, ..., g
m
)
Data Access (d
e12
= g
1
, g
2
, ..., g
m
)
Data Access (d
e13
= g
1
, g
2
, ..., g
m
)
(W
e1
= f
1
, f
2
, ..., f
n
) (d
e1
= g

1
, g

2
, ..., g

m
)
(e
1
= f
1
, f
2
, ..., f
n
, g

1
, g

2
, ..., g

m
)
(W
e1
= f
1
, f
2
, ..., f
n
)
Figure 5.2: Representation of a Single Event in the Unied log
From Figure 5.2, we observe that a single web request may result in more than one data
accesses which depend upon the logic encoded into the application. Once the web access logs
and the corresponding data access logs are available, the next step involves the reduction of data
access logs by extracting simple statistics as discussed before. The session id can, then, be used to
uniquely combine the two logs to generate the unied log.
5.4 Audit Data Collection
As presented in our framework, the log unication module generates a unied log which can be
used by an application intrusion detection system. However, there is no data set which can be
used for our experiments. Application data sets such as [130] are available, but are restricted to
86 Unied Logging Framework and Audit Data Collection
monitoring the sequence of system calls for privileged processes. Such data sets cannot be used in
our experiments. Further, getting real world application data, for example a bank website data, is
very hard, if not impossible. Hence, we collected data sets locally.
We collected two separate data sets by setting up an environment that mimics a real world
application environment. Both the data sets are made freely available and can be downloaded
from [13]. For the rst data set, we used an online shopping application [131] and deployed it
on a web server running Apache, version 2.0.55. At the backend, the application was connected
to MySQL database, version 4.1.22. Both, the web requests and the corresponding data accesses
were logged. The servers and the application were installed on a desktop running with Intel(R)
Core(TM) 2, CPU 2.4 GHz and 2 GB RAM. The operating system installed was Microsoft Win-
dows XP Professional Service Pack 2. To collect the second data set, we used another online
shopping application [132] and deployed it separately on exactly the same conguration. For both
the applications, we consider a web request to be a single request to render a page by the server and
not a single HTTP GET request as it may contain multiple images, frames and dynamic content.
A request can be easily identied from the web server logs. This request further generates one or
more data requests which depend on the logic encoded in the application.
5.4.1 Feature Selection
We used two features from the data access logs and four features from the web access logs to
represent the unied log. Thus, we generate a unied log format where every user session is
represented as a sequence of vectors, each having six features. The six features are:
1. Number of data queries generated in a single web request.
2. Time taken to process the request.
3. Response generated for the request.
4. Amount of data transferred (in bytes).
5. Request made (or the function invoked) by the client.
6. Reference to the previous request in the same session.
Web access logs contain useful information such as the details of every request made by a
client (user), response of the web server, amount of data transferred etc. Similarly, data access
logs contain important details such as the exact data table and columns accessed, in case the
5.4 Audit Data Collection 87
data is stored in a database. Performing intrusion detection at the data access level, in isolation,
requires substantially more resources when compared to our approach. Furthermore, monitoring
the logs together eliminate the need to monitor every data query since we can use simple statistics
to represent the features of the data accesses logs in the unied log. The unied log is then used
as input to the intrusion detection system, which is the nal module in our framework and is
discussed in the next chapter.
5.4.2 Normal Data Collection
To collect normal data, the postgraduate students in our department were encouraged to access
the application. The application was accessible like any other online shopping website; however,
the access to the application was restricted to only from within the department. For the purpose
of normal data collection, the students were advised not to provide any personal information and
were asked to use dummy information instead of using their actual details. The application was
accessed using different scenarios; some examples of the scenarios are:
1. A user is not interested in shopping but clicks on a few links to explore the website.
2. A user is not a registered user. The user visits the website, looks at some items, adds few
items to cart but does not buy them.
3. A user is not a registered user. The user visits the website, looks at some items, adds few
items to cart, buys them by registering and completes the check out process and nally
logs off.
4. A user is a registered user, visits the website, searches for an item, adds it to cart, starts the
check out process but does not nish buying and logs off.
5. A user is a registered user, visits the website, searches for an item, adds some items to cart,
buys some products and logs off.
In addition to these, other scenarios were also considered. For data collection, the system
was online for ve consecutive days, separately, for both the data sets. Further, the students were
asked to use different browsers to access the same application. The students were not restricted
to create a single user account, and many of them created multiple accounts. This is signicant
because, in this case, we cannot consider a one to one mapping between a user and an IP address.
Hence, we did not use the IP address to identify a user accessing the application, which is also not
88 Unied Logging Framework and Audit Data Collection
possible in any real world application due to sharing of computers and the use of Network Address
Translation in networks.
For the rst data set, we observe that 35 different users accessed the application which results
in 117 unique sessions composed of 2,615 web requests and 232,655 data requests. We then
combine the web server logs with the data server logs to generate unied log as discussed earlier
in Section 5.3. This results in 117 user sessions with only 2,615 event vectors, each of which
include features from the web requests and the associated data requests. We also observe that a
large number of user sessions are terminated without actual purchase, resulting in abandoning the
shopping cart. This is a realistic scenario and in reality a large number of the shopping carts are
abandoned without purchase.
Similarly, for the second data set, we combine 1,642 web requests with 931,671 data accesses
which results in 60 unique user sessions with 1,642 event vectors. Note that, the number of
data accesses per web request is large in the second data set when compared to the rst. This is
because; the two applications are different. Also, we did not make any additional change specic
to the second application to collect the second data set. This shows that our framework for unied
logging can be employed with minimum effort for a variety of existing applications.
We represent a normal user session from the data set in Figure 5.3. The session depicts a user
browsing the website and looking at different products displayed on the index page of the deployed
web application.
0,0,301,369,GET /catalog HTTP 1.1,,normal
0,0,200,28885,GET /catalog/ HTTP 1.1,,normal
131,1,200,28480,GET /catalog/index.php,http://dummydata.xyz/catalog/,normal
84,0,200,25431,GET /catalog/index.php,http://dummydata.xyz/catalog/index.php,normal
108,1,200,27121,GET /catalog/product_info.php,http://dummydata.xyz/catalog/index.php,normal
105,0,200,25252,GET /catalog/index.php,http://dummydata.xyz/catalog/product_info.php,normal
Figure 5.3: Representation of a Normal Session
5.4.3 Attack Data Collection
To collect attack data, we disabled access to the system to other users and generated the attack
trafc manually. We launched attacks based upon two criteria:
1. Attacks which do not require any control over the web server or the database such as
password guessing and SQL injection attack.
5.5 Conclusions 89
2. Attacks which require prior control over the web server such as website defacement and
cross site scripting.
To collect the attack data, both, the web requests and the data accesses were logged. The logs
were then combined using our framework. For the rst data set, we generate 45 different attack
sessions with 272 web requests resulting in 44,390 data requests. Combining the two together,
the unied log has 45 unique attack sessions with 272 event vectors. For the second data set, we
generate 241 web requests and 249,597 corresponding data requests. Combing the logs result in
25 unique sessions with 241 event vectors in the unied log.
A typical anomalous session in the data set is represented in Figure 5.4. The session depicts
a scenario where the deployed application has been modied by taking control of the web server.
This is because; we observe that a user has bypassed the login module which is necessary to
complete a genuine transaction. In this case, a user successfully completes the transaction and
the login module is never invoked. This is possible only when the deployed application has been
modied and hence, the entire session is labeled as attack.
103,0,200,28623,GET /catalog/index.php HTTP 1.1,,attack
203,0,200,35467,GET /catalog/checkout_shipping.php HTTP 1.1,http://dummydata.xyz/catalog/index.php,attack
208,0,200,40401,GET /catalog/checkout_payment.php HTTP 1.1,http://dummydata.xyz/catalog/checkout_shipping.php,attack
203,0,200,47801,GET /catalog/checkout_payment_address.php HTTP 1.1,http://dummydata.xyz/catalog/checkout_payment.php,attack
203,0,200,25605,GET /catalog/checkout_success.php HTTP 1.1,http://dummydata.xyz/catalog/checkout_payment_address.php,attack
Figure 5.4: Representation of an Anomalous Session
5.5 Conclusions
In this chapter, we introduced our unied logging framework which efciently combines the ap-
plication access logs and the corresponding data access logs to generate unied log. The unied
log can be used as the input audit patterns for building application intrusion detection system.
The advantage of using unied log is that they include features of both, the user behaviour and
the application behaviour and can, thus, capture the application-data interaction which helps in
improving attack detection at the application level. We showed that our framework is not specic
to any particular application since it does not encode application specic signatures and can be
used for a variety of applications. Finally, we described our audit data collection methodology
which was used to collect two different data sets. The two data sets can be used for building and
90 Unied Logging Framework and Audit Data Collection
evaluating application intrusion detection systems and can be downloaded from [13].
In the next chapter, we perform experiments using our collected data sets and analyze the
effectiveness of the unied log in building application intrusion detection systems. We introduce
user session modeling using a moving window of events to model sequence of events in a user
session which can be used to effectively detect application level attacks.
Chapter 6
User Session Modeling using Unied Log
for Application Intrusion Detection
Present application intrusion detection systems suffer from two disadvantages; rst they analyze
every single event independently to detect possible attacks and second, they are based on signature
matching and, hence, have limited attack detection capabilities. To overcome these deciencies and
to improve attack detection at the application level, we introduce a novel approach of modeling user
sessions as a sequence of events instead of analyzing every event in isolation. From our experiments,
we show that the attack detection accuracy improves signicantly when we perform session modeling.
We integrate our unied logging framework, discussed in previous chapter, to build effective appli-
cation intrusion detection systems which are not specic in detecting a single type of attack. Our
experimental results on the locally collected data sets show that our approach based on conditional
random elds is effective and can detect attacks at an early stage by analyzing only a small number of
sequential events. We also show that our system is robust and can reliably detect disguised attacks.
6.1 Introduction
A
PPLICATIONS have unrestricted access to the underlying application data and are thus
a prime target of attacks resulting in loss of one or more of the three basic security re-
quirements, viz., condentiality, integrity and availability of the data. To prevent such malicious
data accesses, it becomes critical to detect any compromise of applications which accesses the
data. Web-based applications, in particular, are easy targets and can be exploited by the attackers.
Hence, we integrate our unied logging framework, discussed in Chapter 5, and introduce user
session modeling to detect application level attacks reliably and efciently.
Present application intrusion detection systems cannot detect attacks reliably because, to per-
form efciently, they are often signature based and, thus, unable to detect novel attacks whose
signatures are not available. Similarly, hybrid and anomaly detection systems are inefcient and
91
92 User Session Modeling using Unied Log for Application Intrusion Detection
unreliable, resulting in a large number of false alarms because they are based on thresholds which
are difcult to estimate accurately. Further, application based systems often consider sequential
events independently and hence unable to capture the sequence behaviour in consecutive events
in a single user session. Very often, attacks are a result of more than one events and monitoring
the events individually result in reduced attack detection accuracy. Hence, to detect attacks effec-
tively, we introduce user session modeling at application level by monitoring a sequence of events
using a moving window. We also integrate the unied logging framework which generates a single
unied log with features from both, the application accesses and the corresponding data accesses.
We evaluate various methods such as conditional random elds, support vector machines, decision
trees, naive Bayes and hidden Markov models and compare their attack detection capability. As
we will demonstrate from our experimental results, integrating the unied logging framework and
modeling user sessions result in better attack detection accuracy, particularly for the conditional
random elds. Session modeling, however, increases the complexity of the system. Nonetheless,
our experiments show that using conditional random elds higher attack detection accuracy can be
achieved by analyzing only a few events, which is desirable, as opposed to other methods which
must analyze a large number of events to operate with comparable accuracy. Further our system
operates efciently as it uses simple statistics rather than analyzing all the features in every data
access. Finally, our system performs best and is able to detect disguised attacks reliably when
compared with other methods.
The rest of the chapter is organized as follows; we motivate the use of session modeling for
application intrusion detection in Section 6.2. We then describe the data sets used in our experi-
ments in Section 6.3 and our methodology in Section 6.4. We describe our experimental set up and
present our results in Section 6.5 followed by the analysis of our results in Section 6.6. In Section
6.7, we discuss some implementation issues such as the availability of training data and suitability
of our approach for a variety of applications. Finally, we conclude this chapter in Section 6.8.
6.2 Motivating Example
Recalling from the previous chapter, we dened an event as a single request-response pair which
can be represented as an N feature vector as:
e
i
= f
1
, f
2
, f
3
... f
N
6.2 Motivating Example 93
Similarly, we dened a user session as an ordered set of events or actions performed, i.e., a
session is a sequence of one or more request-response pairs and is represented as a sequence of
event vectors:
s
i
= start, e
1
, e
2
, e
3
..., end
In many situations, to launch an attack the attacker must follow a sequence of events. For such
cases in particular, the attack will be successful when the entire sequence of events is performed.
Each event individually is not signicant; however, the events if performed in a sequence can result
in powerful attacks. Further, the situation can be relaxed to give advantage to an attacker such that
the individual anomalous events may not strictly follow each other. As a result, the anomalous
events may be disguised within a number of legitimate events, such that the attack is successful
and hence, the overall session is considered as anomalous. For example, a single session with ve
sequential events along with their labels may be represented as follows:
< Session Start >
e
1
< f
1
1
, g
1
2
, ...h
1
n
Normal>
e
2
< f
2
1
, g
2
2
, ...h
2
n
Normal>
e
3
< f
3
1
, g
3
2
, ...h
3
n
Attack >
e
4
< f
4
1
, g
4
2
, ...h
4
n
Normal>
e
5
< f
5
1
, g
5
2
, ...h
5
n
Attack >
< Session End >
In the above sequence of events, e
1
...e
5
, when we consider every event individually, anoma-
lous events may not be detected, however, if the events are analyzed such that their sequence of
occurrence is taken into consideration, the attack sessions can be detected effectively. Consider for
example, a website which collects and stores credit card information and the following sequence
of events occur in a single session:
1. A user attempts to log in by entering a (stolen) user id and password. The log in is suc-
cessful. (Note that, SQL injection can also be used to reveal such login information).
2. The user then visits the home page and modies some information (to create a backdoor
for reentry).
94 User Session Modeling using Unied Log for Application Intrusion Detection
3. The user exploits the application to gain administrator access.
4. The user then visits the home page of the original user (in order to attempt) to disguise the
previous event within normal events.
5. The user exploits administrator rights to reveal credit card information of other users.
It must be noted that in the above sequence of events, the individual events appear to be normal
events and may not be detected by the intrusion detection system when the system analyzes the
events in isolation. In particular, the third event in the above sequence, when analyzed in isolation,
may be considered as normal since the administrator can access the application using the super
user access. However, the overall sequence of events; transition from a user with limited access to
a user with administrator access and nally revealing the credit card information is made visible
only when the system analyzes all the events together in the session. Using session modeling,
we aim to minimize the number of false alarms and detect such attacks, including the disguised
attacks, which cannot be reliably detected by traditional intrusion detection systems.
6.3 Data Description
To perform experiments using user session modeling at the application level, there does not exists
any freely available data set which can be used. As a result, we collected the data sets locally as
described earlier in Chapter 5. We summarize the two data sets in Table 6.1.
Table 6.1: Data Sets
Number of Number of Number of
Web Requests Data Accesses Sessions
Data Set Normal 2615 232,655 117
One
Attack 272 44,390 45
Data Set Normal 1642 931,671 60
Two
Attack 241 249,597 25
Every session in both the data sets represents a sequence of event vectors, with each event
vector having six features. It is important to note that, though, both the applications are examples
6.4 Methodology 95
of online shopping website; there is difference in the two applications. One signicant difference
is the applications interaction with the underlying database which is encoded as the application
logic. As a result, the number of data accesses in the second data set (1,181,268) is signicantly
larger than in the rst (277,045). Further, the size of the two data sets is also different; the rst
data set consists of 162 sessions, while the second has only 85 sessions. It is also important to
note that the two data sets were collected independently at different times.
6.4 Methodology
In order to gain data access an attacker performs a sequence of malicious events. An experienced
attacker can also disguise attacks within a number of normal events in order to avoid detection.
Hence, to reduce the false alarms and increase attack detection accuracy, intrusion detection sys-
tems must be capable of analyzing entire sequence of events rather than considering every event
in isolation [49]. We therefore propose user session modeling to detect application level attacks.
To model a sequence of event vectors, we need a method which does not assume independence
among sequential events. Hence, we use conditional random elds as the core intrusion detector
in our application intrusion detection system. The advantage of using conditional random elds is
that they predict the label sequence y given the observation sequence x allowing them to model
arbitrary relationships between different features in the observations without making independence
assumptions. Figure 6.1 shows, how conditional random elds can be used to model user sessions.
e
1
=
f
1
...f
6
e
2
=
f
1
...f
6
e
3
=
f
1
...f
6
e
4
=
f
1
...f
6
y
2
y
3
y
4
y
1
Figure 6.1: User Session Modeling using Conditional Random Fields
In the gure, e
1
, e
2
, e
3
, e
4
represents a user session of length four and every event e
i
in the session
is correspondingly labeled as y
1
, y
2
, y
3
, y
4
. Further, every event e
i
is a feature vector of length
six as described in the unied logging framework. The conditional random elds do not assume
96 User Session Modeling using Unied Log for Application Intrusion Detection
any independence among the sequence of events e
1
, e
2
, e
3
, e
4
. We note that a user session can
be of variable length and some sessions may be longer than others. Analyzing every session at
its termination is effective since complete session information is available; however, it has two
disadvantages:
1. The attack detection is not real-time.
2. The size of the session can be very large with more than 50 events. As a result, analyzing
all the events together increases the complexity and the amount of history that must be
maintained for session analysis.
Hence, we perform user session modeling using a moving window of events. We vary the
width of the window from 1 to 20 in all our experiments. Since the complexity of the system
increases as the width of the window increases, a method which can reliably detect attacks with
only a small number of events, i.e., at small values of window width, is considered better. Hence,
we restrict the window width to 20 in our experiments.
6.4.1 Feature Functions
For a conditional random eld, it is critical to dene the feature functions because the ability
of a conditional random eld to model correlation between different features depends upon the
predened features used for training the random eld.
We use our domain knowledge to identify such dependencies in the features and then dene
functions which extracts features from the training data. Examples of features extracted include; if
f eature
1
is request made = abc and f eature
2
is reference to previous request = xyz then label
is normal. Similarly another example can be; if f eature is amount of data transferred = pqr
then label is attack. Using feature conjunction, as shown in the rst example helps to capture
the correlation between different features. Based on our domain knowledge, other features were
extracted similarly using the CRF++ tool [120]. The feature functions used in our experiments are
presented in Appendix C.
6.4.2 Session Modeling using a Moving Window of Events
We use the logs generated by the unied logging framework presented in Chapter 5 and perform
user session modeling using a moving window of events to build effective application intrusion
6.5 Experiments and Results 97
detection systems. For example, consider a session of length 10 represented as a sequence of
events:
< start >, e
1
, e
2
, e
3
, e
4
, e
5
, e
6
, e
7
, e
8
, e
9
, e
10
, < end >
Using a moving window of width ve with a step size of one, the events in this session can be
analyzed as shown below (note that represents absence of an event):
e
1
, , , , Label at step 1
e
1
, e
2
, , , Label at step 2
e
1
, e
2
, e
3
, , Label at step 3
e
1
, e
2
, e
3
, e
4
, Label at step 4
e
1
, e
2
, e
3
, e
4
, e
5
Label at step 5
e
2
, e
3
, e
4
, e
5
, e
6
Label at step 6
e
3
, e
4
, e
5
, e
6
, e
7
Label at step 7
e
4
, e
5
, e
6
, e
7
, e
8
Label at step 8
e
5
, e
6
, e
7
, e
8
, e
9
Label at step 9
e
6
, e
7
, e
8
, e
9
, e
10
Label at step 10
It is evident from the above representation that the window of events is advanced forward by
one and hence, such a systemcan performin real-time. However, depending upon the requirements
of a particular application, the window can be advanced forward with a step size > 1. In such
cases, the system no longer operates in real-time. [Note that, if the analysis is performed only at
the end of every session, the system operates in batch mode.]
6.5 Experiments and Results
We now describe the experimental setup and compare our results using a number of methods
such as the conditional random elds, decision trees, naive Bayes, support vector machines and
hidden Markov models for detecting malicious data accesses at the application level. It is impor-
tant to note that the accuracy of attack detection and efciency of operation are the two critical
factors which determine the suitability of any method for intrusion detection. A method which
98 User Session Modeling using Unied Log for Application Intrusion Detection
can detect most of the attacks but is extremely slow in operation may not be useful. Similarly, a
technique which is efcient but cannot detect attacks with acceptable level of condence is not
useful. Hence, an intrusion detection technique must balance the two. Decision trees are very fast
and generally result in accurate classication. The naive Bayes classier is simple to implement
and very efcient. Support vector machines are also considered to be high quality systems which
can handle data in high dimensional space. Hidden Markov models are well known for modeling
sequences and have been successful in various tasks in language processing. These methods have
been effectively used for building anomaly and hybrid intrusion detection systems. Our experi-
mental results, from Chapter 4, suggests that conditional random elds outperform these methods
and can be used to build accurate network intrusion detection systems. In this chapter, we analyze
the effectiveness of conditional random elds for building application intrusion detection systems
and compare their performance with these methods.
For our experiments, we use the CRF++ toolkit [120], hidden Markov model toolbox for
MatLab and the weka tool [121] and perform experiments, separately, using both the data sets.
We perform all experiments ten times by randomly selecting training and test data and report
their average. We use exactly the same training and test samples for all the ve methods that
we compare (conditional random elds, decision trees, hidden Markov models, support vector
machines and naive Bayes classier). It is important to note that methods such as decision trees,
naive Bayes and support vector machines are not designed for labeling sequential data. However,
to experiment with these methods, we convert every session into a single record by appending
sequential events at the end of the previous event and then label the entire session as either normal
or as attack. For example, for a session of length ve, where every event is described by six
features, we create a single record with 5 6 = 30 features. Additionally, for the support vector
machines we experiment with three kernels; poly-kernel, rbf-kernel and normalized-poly-kernel,
and vary the value of c between 1 and 100 for all of the kernels [121]. As we mentioned before,
we use six features to represent every event. Hence, for experiments with the hidden Markov
models, we build six different hidden Markov models, one for each feature, and then combine the
individual results using a voting mechanism to get the nal label for the sequence, i.e., we label
the sequence as attack when the number of votes in favour of the attack class is greater than or
equal to three.
We perform our experiments using a moving window of events and vary the window width S
6.5 Experiments and Results 99
from 1 to 20. Window of width S = 1 indicates that we consider only the current event and do
not consider the history, while a window of width S = 20 implies that a sequence of 20 events
is analyzed to perform the labeling. We limit S to 20 as the complexity of the system increases
which affects systems efciency. This can, however, be exploited by attackers since they can
hide the attacks within normal events, making attack detection very difcult. Thus, to make the
intrusion detection task more realistic, we dene the disguised attack parameter, p as follows:
p =
number o f Attack events
number o f Normal events + number o f Attack events
where number o f Attack events > 0 and number o f Normal events 0. The value of p
lies in the range (0,1]. The attacks are not disguised when p = 1, since in this case the number
of normal events is 0. As the value of p decreases, when the number of normal events increases,
the attacks are disguised in a large number of normal events. To create disguised attack data, we
add a random number of attack events at random locations in the normal sessions and label all the
events in the session as attack. This results in hiding the attacks within normal events such that the
attack detection becomes difcult. We perform experiments to reect these scenarios by varying
the number of normal events in an attack session by setting p between 0 and 1.
6.5.1 Experiments with Clean Data (p = 1)
We rst set p = 1, i.e., the attacks are not disguised. In Figure 6.2, we compare the attack detection
accuracy (F-Measure) as we increase the window width S from 1 to 20 for a xed value of p = 1
for conditional random elds, support vector machines, decision trees, naive Bayes and hidden
Markov models for both the data sets.
Results for both the data sets show similar trends. We observe that conditional random elds
and support vector machines perform similarly and their attack detection capability (F-Measure)
increases, slowly but steadily, as the value of S increases. This shows that modeling a user session
results in better attack detection accuracy when compared to analyzing the events individually, i.e.
attack detection accuracy improves as S increases.
Conditional randomelds do not consider the sequence of events in a session to be independent
and, hence, can model the correlation between events. Hence, they can detect attacks reliably.
Support vector machines also result in good attack detection accuracy and can easily handle a
100 User Session Modeling using Unied Log for Application Intrusion Detection
large number of features, thereby resulting in good classication.
Decision trees and naive Bayes perform poorly and have low F-Measure regardless of the
window width S. Their accuracy improves initially as S increases but when S becomes large
their accuracy tends to decrease. This is because; they consider features independently to label a
particular event in a session and then combine the results for all the features but do not consider the
correlation between them. When the number of features is less, the error due to loss of correlation
is small which increases with the number of features. Also, the number of input features increases
as S increases; but the decision trees selects a subset of features and its size remains fairly
constant. Hence, there is little effect of S on the attack detection accuracy for decision trees
when compared with the naive Bayes classier.
Hidden Markov models also perform poorly; however, their accuracy improves slightly as S
increases. When compared with the conditional random elds, hidden Markov models have lower
attack detection accuracy because they are generative systems which model the joint distribution
instead of the conditional distribution and, thus, make independence assumptions. Furthermore,
they cannot model long range dependencies in the observations, thereby resulting in poor perfor-
mance.
6.5 Experiments and Results 101
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
CRF
SVM
Naive Bayes
C4.5
HMM
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
CRF
SVM
Naive Bayes
C4.5
HMM
(b) Data Set Two
Figure 6.2: Comparison of F-Measure (p = 1)
102 User Session Modeling using Unied Log for Application Intrusion Detection
6.5.2 Experiments with Disguised Attack Data (p = 0.60)
In order to test the robustness of different methods, we perform experiments with disguised attack
data. Using such a data set makes attack detection realistic and more difcult as an attacker may
try to hide the attack within normal events. As discussed earlier, we dene the disguised attack
parameter, p, where p < 1 indicates that the attack is disguised within normal events in a session.
In Figure 6.3, we compare the results for all the ve methods for both the data sets at p = 0.60.
For both the data sets, we observe that the attack detection capability decreases as the attacks
are disguised within normal events. However, the conditional random elds perform best, outper-
forming all other methods and are robust in detecting disguised attacks when compared with any
other method. Hidden Markov models are least effective for the rst data set while support vector
machines and naive Bayes classier have similar performance. The decision trees are least effec-
tive in detecting disguised attacks for the second data set. Again, the attack detection accuracy
increases as S increases.
The reason for better accuracy for the conditional random elds is that they can model long
range dependencies among the events in a sequence, since they do not assume independence within
the event vectors and, thus, perform effectively even when the attacks are disguised. As we de-
crease p, the support vector machines did not perform as well. The reason for this is that, the
support vector machines cannot geometrically differentiate between the normal and attack events
because of the overlap between the normal data space and the attack data space.
The variation in performance of the hidden Markov models and the decision trees for the two
data sets is attributed to the size of the data sets. The size of the rst data set is bigger, compared
to that of the second. As a result, the decision trees can better select signicant features in the
rst data set resulting in higher accuracy. However, for the second data set, due to its small size,
the decision trees cannot perform optimally compared to the hidden Markov models. The hidden
Markov models perform better because they consider the sequence information which becomes
signicant when the size of the data set is small.
6.5 Experiments and Results 103
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
CRF
SVM
Naive Bayes
C4.5
HMM
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
CRF
SVM
Naive Bayes
C4.5
HMM
(b) Data Set Two
Figure 6.3: Comparison of F-Measure (p = 0.60)
104 User Session Modeling using Unied Log for Application Intrusion Detection
Results using Conditional Random Fields
We study the Precision, Recall and F-Measure for conditional random elds at p = 0.60 and
present the results in Figure 6.4.
Results for conditional random elds, from both the data sets, suggest that they have high F-
Measure which increases steadily as the windowwidth S increases. The best value for F-Measure
for data set one is 0.87 at S = 15, while it is 0.65 at S = 20 for data set two. This suggests that
the system based on conditional random eld generates fewer false alarms and performs reliably
even when attacks are disguised.
6.5 Experiments and Results 105
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(b) Data Set Two
Figure 6.4: Results using Conditional Random Fields at p = 0.60
106 User Session Modeling using Unied Log for Application Intrusion Detection
Results using Support Vector Machines
Figure 6.5 represents the variation in Precision, Recall and F-Measure for support vector machines
as we increase S from 1 to 20 at p = 0.60.
As mentioned earlier, for support vector machines, we experiment with three kernels; poly-
kernel, rbf-kernel and normalized-poly-kernel, and vary the value of c between 1 and 100 for all of
the three kernels. We observe that the poly-kernel with c = 1 performs best and, hence, we report
the results using the same kernel. Figure 6.5 shows that support vector machines have moderate
Precision for both the data sets, but low Recall and hence low F-Measure. The best value of F-
Measure for support vector machines for data set one is 0.82 at S = 17, while it is 0.49 at S = 20
for data set two in comparison to the conditional random elds which have the F-Measure of 0.87
and 0.65 for data set one and data set two respectively.
6.5 Experiments and Results 107
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(b) Data Set Two
Figure 6.5: Results using Support Vector Machines at p = 0.60
108 User Session Modeling using Unied Log for Application Intrusion Detection
Results using Decision Trees
We study the variation in Precision, Recall and F-Measure for decision trees in Figure 6.6.
Results from Figure 6.6 show that the decision trees have very low F-Measure suggesting
that they cannot be used effectively for detecting anomalous data accesses when the attacks are
disguised. The detection accuracy for decision trees remains fairly constant as S increases and is
maximum at S = 20 and at S = 19 for the two data sets.
6.5 Experiments and Results 109
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(b) Data Set Two
Figure 6.6: Results using Decision Trees at p = 0.60
110 User Session Modeling using Unied Log for Application Intrusion Detection
Results using Naive Bayes Classier
Figure 6.7 represents the variation in Precision, Recall and F-Measure for the naive Bayes classier
as we vary S from 1 to 20 at p = 0.60.
Experimental results using both the data sets show similar trend for the naive Bayes classica-
tion. The results suggest that the system has low F-Measure and there is little improvement in the
attack detection accuracy as S increases. The maximum value for F-Measure is 0.67 at S = 12
for data set one and 0.43 at S = 19 for data set two, suggesting that a system based on naive Bayes
classier cannot detect attacks reliably.
6.5 Experiments and Results 111
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(b) Data Set Two
Figure 6.7: Results using Naive Bayes Classier at p = 0.60
112 User Session Modeling using Unied Log for Application Intrusion Detection
Results using Hidden Markov Models
We present the Precision, Recall and F-Measure for the hidden Markov models for both the data
sets at p = 0.60 in Figure 6.8.
From Figure 6.8, we observe that the hidden Markov models have very high Recall but very
low Precision and hence low F-Measure. There is little effect of S on the F-Measure which does
not improve signicantly.
6.5 Experiments and Results 113
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
Width of Window, S
Precision Recall F-Measure
(b) Data Set Two
Figure 6.8: Results using Hidden Markov Models at p = 0.60
114 User Session Modeling using Unied Log for Application Intrusion Detection
6.6 Analysis of Results
Experimental results clearly suggest that the conditional random elds outperform other methods
and are the best choice to build application intrusion detection systems.
6.6.1 Effect of S on Attack Detection
In our experiments, we use a moving window to model user sessions by varying S from 1 to 20.
We want S to be small since the complexity and the amount of history that must be maintained
increases with S and the system cannot respond to attacks in real-time. Window width of 20 and
beyond is often large, resulting in delayed attack detection and high computation cost. Tables 6.2
and 6.4 describe the effect of S on attack detection for the two data sets.
Table 6.2: Effect of S on Attack Detection for Data Set One, when p = 0.60
F-Measure
Width of Hidden
Decision Naive
Support Conditional
Window Markov
Trees Bayes
Vector Random
S Models Machines Fields
1 0.00 0.47 0.61 0.56 0.62
2 0.24 0.47 0.58 0.66 0.66
3 0.27 0.44 0.61 0.69 0.68
4 0.26 0.47 0.65 0.71 0.79
5 0.27 0.46 0.64 0.72 0.76
6 0.30 0.44 0.60 0.69 0.76
7 0.31 0.33 0.61 0.68 0.81
8 0.35 0.47 0.65 0.74 0.81
9 0.36 0.51 0.65 0.70 0.80
10 0.35 0.48 0.65 0.75 0.83
11 0.35 0.51 0.66 0.80 0.84
12 0.39 0.41 0.67 0.75 0.82
13 0.38 0.44 0.65 0.77 0.84
14 0.38 0.47 0.63 0.74 0.86
15 0.39 0.50 0.66 0.80 0.87
16 0.40 0.50 0.63 0.77 0.86
17 0.39 0.47 0.65 0.82 0.86
18 0.41 0.51 0.64 0.78 0.87
19 0.40 0.53 0.64 0.76 0.86
20 0.41 0.56 0.66 0.81 0.86
6.6 Analysis of Results 115
From Table 6.2, we observe that conditional random elds perform best and their attack de-
tection capability increases as the window width increases. Additionally, when we increase S
beyond 20 (not shown in the graphs), the attack detection accuracy increases steadily and the
system achieves very high F-Measure when we analyze the events in the entire session together.
Results for the rst data set show that the hidden Markov model performs best at S = 18 while
conditional random elds achieve the same performance at S = 1. Similarly, decision trees an-
alyzes 20 events to reach their best performance while conditional random elds achieve same
performance by analyzing only a single event (i.e., at S = 1). Naive Bayes peak their perfor-
mance at S = 12 while conditional random elds achieve same performance at S = 3. Finally,
support vector machines reach their best performance at window width of 17 while the conditional
random elds achieve same performance at S = 10. We compare various methods in Table 6.3.
Table 6.3: Analysis of Performance of Different Methods
HMM C4.5 Naive Bayes SVM CRF
HMM (0.41) S = 18 S = 1 S = 1 S = 1 S = 1
C4.5 (0.56) S > 20 S = 20 S = 1 S = 1 S = 1
Naive Bayes (0.67) S > 20 S > 20 S = 12 S = 3 S = 3
SVM (0.82) S > 20 S > 20 S > 20 S = 17 S = 10
CRF (0.87) S > 20 S > 20 S > 20 S > 20 S = 15
Table 6.3 can be interpreted as follows. Row one in the table shows that the hidden Markov
models achieve the best F-Measure of 0.41 at S = 18 while decision trees, naive Bayes classier,
support vector machines and conditional random elds achieve the same F-Measure at S = 1.
Similarly, the last row indicates that the conditional random elds achieve the highest F-Measure
of 0.87 at S value of 15 while all other methods require more than 20 events to achieve the same
performance.
Hence, performing session modeling using conditional random elds result in higher accu-
racy for attack detection at lower values of S which is desirable, since it results in early attack
detection and an efcient system.
116 User Session Modeling using Unied Log for Application Intrusion Detection
Table 6.4: Effect of S on Attack Detection for Data Set Two, when p = 0.60
F-Measure
Width of Hidden
Decision Naive
Support Conditional
Window Markov
Trees Bayes
Vector Random
S Models Machines Fields
1 0.00 0.28 0.28 0.21 0.50
2 0.35 0.01 0.31 0.36 0.48
3 0.37 0.03 0.35 0.40 0.52
4 0.42 0.02 0.36 0.39 0.50
5 0.39 0.04 0.38 0.37 0.53
6 0.41 0.13 0.37 0.42 0.53
7 0.37 0.18 0.37 0.35 0.57
8 0.42 0.06 0.39 0.35 0.58
9 0.42 0.25 0.42 0.38 0.55
10 0.45 0.21 0.41 0.40 0.55
11 0.46 0.35 0.37 0.32 0.52
12 0.44 0.16 0.36 0.35 0.56
13 0.44 0.23 0.39 0.25 0.56
14 0.42 0.26 0.41 0.36 0.58
15 0.42 0.34 0.38 0.46 0.59
16 0.43 0.21 0.37 0.43 0.59
17 0.42 0.31 0.40 0.41 0.60
18 0.41 0.35 0.41 0.46 0.63
19 0.41 0.42 0.43 0.41 0.63
20 0.40 0.30 0.41 0.49 0.65
6.6.2 Effect of p on Attack Detection (0 < p 1)
To analyze the robustness of conditional random elds, we experiment with the disguised attack
data by varying the disguised attack parameter, p, between 0 and 1. Figure 6.9, represents the
effect of p on conditional random elds for different values of S for both the data sets. We do
not present the results for other methods since they perform poorly at lower values of p.
From Figure 6.9, we make two observations; rst, as the value of p decreases, i.e., attacks
are disguised within normal events, the attack detection accuracy decreases making it difcult to
detect attacks and second, regardless of the value of p and for a xed value of p, the attack
detection accuracy increases as the width of the window S increases.
6.6 Analysis of Results 117
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
p=1.00 p=0.60 p=0.45 p=0.35 p=0.25 p=0.15
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
p=1.00 p=0.60 p=0.45 p=0.35 p=0.25 p=0.15
(b) Data Set Two
Figure 6.9: Effect of p: Results using Conditional Random Fields when 0 < p 1
118 User Session Modeling using Unied Log for Application Intrusion Detection
6.6.3 Signicance of Using Unied Log
We performed our experiments using the unied log (based on the framework described in Chapter
5) to detect application level attacks. By using the unied log, our system can analyze the user
behaviour (via the web accesses) and its effect on the application behaviour (via the corresponding
data accesses). Features in both the logs are correlated and analyzing themindividually by building
separate systems signicantly affects attack detection capability. Hence, we perform additional
experiments where we build three separate systems and compare them with our approach. The
rst system analyzes only the application logs, while the second system analyzes only the data
access logs. In the third system, we combine the individual responses from both the systems using
a voting mechanism to determine the nal labeling. If either of the two systems labels an event
as attack, we label the event as attack. We call this system as a voting based system. We use
the same instances as used in our previous experiments and present the results with conditional
random elds at p value of 0.60 by varying S from 1 to 20. We present the comparison in
Figure 6.10.
The results clearly suggest that using a single system, based on our approach of session mod-
eling with unied log, performs best. We also observe that when we use two separate systems and
use a voting mechanism to determine the nal label, the performance improves for the rst data
set, but it decreases for the second data set. Hence, we can conclude that using a voting mechanism
may not always be useful.
An advantage of our system is that it can be deployed in real environment as it analyzes only
the summary statistics extracted from the data access logs rather than analyzing every data access
to match previously known attack signatures. From Table 6.1 in Section 6.3, it is evident that
using the unied log eliminate the need to consider over one million (1,181,268) data accesses
for the second data set. Instead, our approach limits the number of events to the number of web
accesses, which is signicantly smaller when compared to the number of data accesses. Hence,
our approach uses features from both, the web access logs and the corresponding data access logs,
and at the same time limits the load at the intrusion detection system which is signicant in high
speed application environment.
6.6 Analysis of Results 119
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
Unified Log
Voting Based
Web Access Logs Alone
Data Access Logs Alone
(a) Data Set One
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
F
-
M
e
a
s
u
r
e
Width of Window, S
Unified Log
Voting Based
Web Access Logs Alone
Data Access Logs Alone
(b) Data Set Two
Figure 6.10: Signicance of Using Unied Log
120 User Session Modeling using Unied Log for Application Intrusion Detection
6.6.4 Test Time Performance
It is not justied to compare the efciency of our system with that of a signature based system
because the two systems are signicantly different in their attack detection capability. Signature
based systems simply perform signature matching for previously known attacks while the strength
of anomaly and hybrid systems, such as one described in this chapter, lies in their capability of
detecting novel attacks in addition to detecting previously seen attacks.
It is important to note that the unication of logs does incur some overhead. However, this
overhead can be eliminated by developing better software engineering practices which are aware
of the security implications, particularly in web based applications. Security aware software engi-
neering practices can be followed which can provide standardized unied log rather than separately
logging web accesses and their corresponding data accesses. Nonetheless, the overhead incurred
is very small when compared with the time required to individually analyze the web access logs
and the data access logs.
We now compare the test time performance of different methods. We are generally not inter-
ested in the training time because training is often a one time process and can be performed ofine.
Hence, we focus only on the test time complexity. During testing, both conditional random elds
and hidden Markov models employ the Viterbi algorithm which has a complexity of O(TL
2
),
where T is the length of the sequence and L is the number of labels. The quadratic complexity
is problematic when the number of labels is large, such as in the language tasks, but for intrusion
detection we have a limited number of labels (normal and attack) and, hence, the system is ef-
cient. Support vector machines, naive Bayes classier and decision trees are very efcient and
can handle large dimensionality in data. Table 6.5 compares the average test time for analyzing a
session by different methods at S = 20 and p = 0.60 for both the data sets.
The test time performance of various systems presented in Table 6.5, appear to be counter in-
tuitive and we expect the performance of naive Bayes classier and decision trees to be better than
that of the conditional random elds. On the contrary, we observe that the conditional random
elds perform best. However, considering the fact that when we increase S to 20, the complex-
ity increases for decision tress, support vector machines and naive Bayes classiers because the
number of features increases from 6 to 120 while the number of features in conditional random
elds still remain equal to six. Further, we consider a rst order Markov assumption for labeling
6.6 Analysis of Results 121
Table 6.5: Comparison of Test Time
Test Time ( sec.)
Data Set One Data Set Two
Conditional Random Fields 510 555
Hidden Markov Models 7361 7415
Decision Trees 3515 3510
Naive Bayes Classier 4125 4080
Support Vector Machines 9740 9125
in the conditional random elds and the label set itself is very small (equal to two) which results
in high test time efciency. Additionally, the time complexity for hidden Markov models is higher
because of the additional overhead involved in combining the results from six independent models
to get the nal label as we discussed earlier.
From our experiments, we observe that the results follow the same trend for both the data
sets and, hence, we can conclude that our results are not the artifact of a particular data set, that
our framework is application independent and can be easily used for a variety of applications.
Therefore, considering both, the attack detection accuracy and the test time performance, the
conditional random elds score better and are a strong candidate for building robust and efcient
application intrusion detection systems.
6.6.5 Discussion of Results
Experimental results from both the data sets clearly suggest that conditional random elds, when
compared with other methods, perform best and are able to detect attacks reliably, even when the
attacks are disguised in normal events, i.e., at lower values of p. Further, performing session
modeling using our unied logging framework, based on unied web access and data access logs,
help to improve attack detection accuracy. This is because, very often, to launch an attack, the
attacker performs a number of events in a sequence. As a result, systems based on session model-
ing can detect attacks better when compared to those which analyze every event in isolation. This
is clear from our experiments where we show that the attack detection improves as the value of
122 User Session Modeling using Unied Log for Application Intrusion Detection
S increases. We also note that an experienced attacker may disguise attacks within 20 or more
normal events. Even then, our system is capable of detecting attacks as the system does not con-
sider events independently. However, there is a tradeoff between the disguise attack parameter p
and the window width S. In general, for better attack detection, S must be increased when p
decreases. The advantage of conditional random elds is that higher attack detection occurs at
lower values of S which is desirable for the reasons discussed before.
The reason for better attack detection with conditional random elds is that they do not con-
sider the features to be independent and are able to model the correlation between them. Further,
they can model the long range dependencies between sequential events in a session and, hence,
they can reliably detect attacks when the value of p decreases. Conditional random elds do not
make any unwarranted assumptions about the data, and once trained they are very efcient and
robust. Support vector machines, decision trees and naive Bayes classiers on the other hand con-
sider the events to be independent and ignore the correlation between features, thereby resulting in
lower accuracy of attack detection. Similarly, as we discussed earlier, hidden Markov models are
generative systems and cannot represent long range dependencies among observations, thereby
resulting in lower accuracy of attack detection.
Performing session modeling using a moving window of events in our unied logging frame-
work helps to correlate the user behaviour and the application behaviour providing rich interacting
features which improve attack detection. Our experimental results conrm that when the unied
log are analyzed using session modeling, the system can detect attacks with higher accuracy as
opposed to the independent analysis of the web access logs and the data access logs.
Finally, it is important to note that simulating a few attacks does not necessarily imply that
our system is limited in detecting only these attacks. We have already discussed that our system
focuses on modeling the interaction between the user behaviour and the application behaviour.
Hence, our system can detect any illegitimate data access since malicious modications result
in different application-data interaction when compared to the legitimate requests. Our system
focuses on detecting such modications by combining the user behaviour with the application
behaviour instead of using specially crafted signatures which are limited in detecting specic
attacks. Further, we performed our experiments on two data sets and our results clearly suggest
that the conditional random elds perform best for both the data sets, establishing that our results
are not an artifact of a particular data set.
6.7 Issues in Implementation 123
6.7 Issues in Implementation
Experimental results show that our approach based on conditional random elds can be used to
build effective application intrusion detection systems. However, before deployment, it is critical
to resolve issues such as the availability of the training data and suitability of our approach for a
variety of applications. We now discuss various methods which can be employed to resolve such
issues.
6.7.1 Availability of Training Data
Though our system is application independent and can be used to detect malicious data access in
a variety of applications, it must be trained before the system can be deployed online to detect
attacks. This requires training data which is specic to the application. To obtain such data may
be difcult. However, training data can be made available as early as during the application testing
phase when the application is tested to identify errors. Logs generated during the application test-
ing phase can be used for training the intrusion detection system. However, this requires security
aware software engineering practices which must ensure that necessary measures are taken to pro-
vide training data during the application development phase, which can be used to train effective
application intrusion detection systems.
6.7.2 Suitability of Our Approach for a Variety of Applications
As we already discussed, our framework is generic and can be deployed for a variety of applica-
tions. It is particularly suited to applications which follow the three tier architecture which have
application and data independence. Furthermore, our framework can be easily extended and de-
ployed in the Service Oriented Architecture [133]. This is because as part of the business solution,
the service oriented architecture denes numerous services each of which provides specic func-
tionality and which have the capability to interact among themselves. Our proposed framework
can be considered as a special case for the service oriented architecture which denes only one
service. Nonetheless, it can be easily extended to the general service oriented architecture by se-
lecting many services. This would, however, require some domain specic knowledge in order
to identify the correlated services (applications). The challenge is to identify such correlations
automatically and this provides an interesting direction for future work.
124 User Session Modeling using Unied Log for Application Intrusion Detection
6.8 Conclusions
In this chapter, we implemented user session modeling using a moving window of events in our
unied logging framework to build application intrusion detection systems which can detect ap-
plication level attacks effectively and efciently. Experimental results conrm that conditional
random elds can be effectively used in our framework and perform better when compared with
other methods. In our framework, we considered a sequence of events in a session, rather than
analyzing the events individually which improves the attack detection accuracy. Our system based
on conditional random elds can detect attacks at smaller values of S resulting in early attack
detection. We also showed that unied log not only helps to improve the attack detection accu-
racy but also to improve systems performance since we can use summary statistics rather than
analyzing every data access. Our experimental results with multiple data sets show similar trends
and conrm that our framework is application independent and can be used for a variety of appli-
cations. Another advantage of our system is that it models user-application and application-data
interaction which does not vary overtime as compared to modeling user proles which change
frequently. Application and data interaction vary only in case of an attack which is detected by
our system. We also showed that our system using conditional random elds is robust and is able
to detect disguised attacks effectively.
Finally, following better security aware software engineering practices and taking care of log-
ging mechanism during application development would not only help in application testing and
related areas but would also provide necessary framework for building better and efcient appli-
cation intrusion detection systems, such as those discussed in this chapter.
Chapter 7
Conclusions
I
N this thesis, we explored the suitability of conditional random elds for building robust
and efcient intrusion detection systems which can operate, both, at the network and at the
application level. In particular, we introduced novel frameworks and developed models which
address three critical issues that severely affect the large scale deployment of present anomaly and
hybrid intrusion detection systems in high speed networks. The three issues are:
1. Limited attack detection coverage
2. Large number of false alarms and
3. Inefciency in operation
Other issues such as the scalability and ease of system customization, robustness of the system
to noise in the training data, availability of training data, and the ability of the system to detect
disguised attacks were also addressed. As a result of this research, we conclude that:
1. Layered framework can be used to build efcient intrusion detection systems. In addition,
the framework offers ease of scalability for detecting different variety of attacks as well
as ease of customization by incorporating domain specic knowledge. The framework
also identies the type of attack and, hence, specic intrusion response mechanism can be
initiated which helps to minimize the impact of the attack.
2. Conditional random elds are a strong candidate for building robust and efcient intru-
sion detection systems. Integrating the layered framework with the conditional random
elds can be used to build effective and efcient network intrusion detection systems. Us-
ing conditional random elds as intrusion detectors result in very few false alarms and,
thus, the attacks can be detected with very high accuracy.
125
126 Conclusions
3. Unied logging framework can capture user-application and application-data interactions
which are signicant to detect application level attacks. The framework is application
independent and can be used for a variety of applications.
4. User session modeling using the unied log must be performed in order to detect applica-
tion level attacks with high accuracy. Conditional random elds can be effectively used in
this framework to model a sequence of events in a user session. Using conditional random
elds attacks can be detected at smaller window widths, thereby, resulting in an efcient
system. Additionally, the system is robust and can effectively detect disguised attacks.
We performed a range of experiments which show that, in order to detect intrusions effectively,
it is critical to model the correlations between multiple features in an observation. Assuming var-
ious features to be independent, though, makes a model simple and efcient; it affects its attack
detection capability. Conditional random elds can easily model such correlations by dening
specic feature functions which make them a strong candidate for building effective intrusion
detectors. Further, we introduced the layered framework which helps to improve overall system
performance. Our framework is highly scalable, easily customizable and can be used to build ef-
cient network intrusion detection systems which can detect a wide variety of attacks. Experimental
results on the benchmark KDD 1999 intrusion data set [12] and comparison with other well known
methods for intrusion detection such as decision trees, naive Bayes, support vector machines and
the winners of the KDD 1999 cup, show that our approach, based on layered conditional random
elds, outperform these methods; in terms of, both, accuracy of attack detection and efciency of
system operation. The impressive part of our results is the percentage improvement in attack de-
tection accuracy, particularly, for User to Root (U2R) attacks (34.8% improvement) and Remote
to Local (R2L) attacks (34.5% improvement). Statistical tests also demonstrate higher condence
in detection accuracy with layered conditional random elds. We also showed that our system
is robust and can detect attacks with higher accuracy, when compared with other methods, even
when trained with noisy data. Finally, our system is not based on attack signatures and, hence,
capable of detecting novel attacks.
We also performed experiments which showthat, in order to effectively detect application level
attacks, it is critical to model the sequence of events. This is because, very often, an attacker must
perform a number of sequential operations in order to launch a successful attack. Additionally,
7.1 Directions for Future Research 127
for most applications, and in particular for web based applications, the application access logs
and the corresponding data access logs are highly correlated. To detect attacks at the application
level, the application logs or (and) the data access logs can be used. However, present application
intrusion detection systems analyze the logs separately, often using two separate systems, resulting
in inefcient systems which give a large number of false alarms and, hence, low attack detection
accuracy. To address these issues, we introduced our unied logging framework which integrates
the application access logs and the corresponding data access logs to generate unied log. As
a result, the user-application and the application-data interaction can be captured; this can be
used to detect attacks with high accuracy. Further, the user-application and the application-data
interactions are stable and do not vary overtime as opposed to modeling user proles which change
frequently. Experimental results conrm that our system, based on user session modeling using
conditional random elds which analyze unied log, can detect attacks at an early stage by
analyzing only a small number of past events resulting in an efcient system which can block
attacks in real-time. Experimental results also demonstrate that our system is robust and can detect
disguised attacks effectively, outperforming other methods such as the hidden Markov models,
support vector machines, decision trees and the naive Bayes. In particular, for data set one at p
= 0.60, using conditional random elds in our unied logging framework achieves an F-Measure
of 0.87 while the same for hidden Markov models, decision trees, naive Bayes and support vector
machines is 0.41, 0.56, 0.67 and 0.82 respectively. Similarly, for data set two at p = 0.60, our
system achieves an F-Measure of 0.65 while the same for hidden Markov models, decision trees,
naive Bayes and support vector machines is 0.46, 0.42, 0.43 and 0.49 respectively. Finally, the two
data sets which we collected can be downloaded from [13] and can be used to build and evaluate
application intrusion detection systems.
7.1 Directions for Future Research
The critical nature of the task of detecting intrusions in networks and applications leaves no mar-
gin for errors. The effective cost of a successful intrusion overshadows the cost of developing
intrusion detection systems and hence, it becomes critical to identify the best possible approach
for developing better intrusion detection systems.
Every network and application is custom designed and it becomes extremely difcult to de-
128 Conclusions
velop a single solution which can work for every network and application. In this thesis, we
proposed novel frameworks and developed methods which perform better than previously known
approaches. However, in order to improve the overall performance of our system we used the
domain knowledge for selecting better features for training our models. This is justied because
of the critical nature of the task of intrusion detection. Using domain knowledge to develop bet-
ter systems is not a signicant disadvantage; however, developing completely automatic systems
presents an interesting direction for future research.
From our experiments, it is evident that our systems performed efciently. However, devel-
oping faster implementations of conditional random elds particularly for the domain of intrusion
detection requires further investigation.
Another possible direction for future research is to employ our approach, layered framework,
for building highly efcient systems since they give opportunity to implement pipelining of layers
in multi core processors.
We demonstrated the effectiveness of our application intrusion detection system in the well
known three tier application architecture. However, our framework can be extended and deployed
in the Service Oriented Architecture [133] and presents another line of interesting research.
There is ample scope and need to build systems which aim at preventing attacks rather than
simply detecting them. Integrating intrusion detection systems with the security policy in individ-
ual networks would help to minimize the false alarms and qualify the alarms raised by the intrusion
detection systems.
Thoughts for Practitioners
We now outline some open issues which are signicant but outside the scope of this thesis which
must be considered in order to develop better intrusion detection systems [3].
1. Many of the attacks are successful because the attackers enjoy anonymity and they can
launch attacks from spoofed sources, making it very hard to trace back the true source of
the attack. However, if there is a reliable method to trace back the packets to their actual
source, many of the attacks can be prevented. Solutions are available for this such as
the adjusted probabilistic packet marking and others [40], but they require a global effort
which is not very easy to ensure. The problem is to identify the true source of attack
without affecting the performance of the overall system.
7.1 Directions for Future Research 129
2. Security policy plays an important role in a network and describes the acceptable and non
acceptable usage of the resources. There are two major issues in dening the security
policy; rst, the security policy must be complete and second, the policy must be clear and
unambiguous. Hence, the problem is to clearly dene the acceptable and the unacceptable
usage of every resource.
3. Many systems are based upon authenticating a user. However, authentication mechanisms
such as the use of login and password are weak and can be compromised. Multi factor
authentication and use of biometric methods have been introduced but they can also be
bypassed. The problem is how to link the supplied credentials with the actual human user?
Methods based on user proling can be used which learn the normal user prole and then
can be used to detect signicant deviations from the learnt prole. However, they are based
upon thresholds which are selected by empirical analysis and, hence, may not always be
accurate.
The eld of intrusion detection has been around since 1980s and a lot of advancement has
been made in the same. However, to keep pace with the rapid and ever changing networks and ap-
plications, the research in intrusion detection must synchronize with the present networks. Present
networks increasingly support wireless technologies, removable and mobile devices. Intrusion de-
tection systems must integrate with such networks and devices and provide support for advances
in a comprehensible manner.
Bibliography
[1] Stefan Axelsson. Research in Intrusion-Detection Systems: A Survey. Technical Report
98-17, Department of Computer Engineering, Chalmers University of Technology, 1998.
[2] SANS Institute - Intrusion Detection FAQ. Last accessed: Novmeber 30, 2008. http:
//www.sans.org/resources/idfaq/.
[3] Kotagiri Ramamohanarao, Kapil Kumar Gupta, Tao Peng, and Christopher Leckie. The
Curse of Ease of Access to the Internet. In Proceedings of the 3
rd
International Confer-
ence on Information Systems Security (ICISS), pages 234249. Lecture Notes in Computer
Science, Springer Verlag, Vol (4812), 2007.
[4] Overview of Attack Trends, 2002. Last accessed: November 30, 2008. http://www.
cert.org/archive/pdf/attack_trends.pdf.
[5] Kapil Kumar Gupta, Baikunth Nath, Kotagiri Ramamohanarao, and Ashraf Kazi. Attacking
Condentiality: An Agent Based Approach. In Proceedings of IEEE International Confer-
ence on Intelligence and Security Informatics, pages 285296. Lecture Notes in Computer
Science, Springer Verlag, Vol (3975), 2006.
[6] The ISC Domain Survey. Last accessed: Novmeber 30, 2008. https://www.isc.
org/solutions/survey/.
[7] Peter Lyman, Hal R. Varian, Peter Charles, Nathan Good, Laheem Lamar Jor-
dan, Joyojeet Pal, and Kirsten Swearingen. How much Information. Last ac-
cessed: Novmeber 30, 2008. http://www2.sims.berkeley.edu/research/
projects/how-much-info-2003.
[8] Tao Peng, Christopher Leckie, and Kotagiri Ramamohanarao. Survey of Network-Based
Defense Mechanisms Countering the DoS and DDoS Problems. ACM Computing Surveys,
39(1):3, 2007. ACM.
131
132 BIBLIOGRAPHY
[9] Animesh Patcha and Jung-Min Park. An Overview of Anomaly Detection Techniques:
Existing Solutions and Latest Technological Trends. Computer Networks, 51(12):3448
3470, 2007.
[10] CERT/CC Statistics. Last accessed: Novmeber 30, 2008. http://www.cert.org/
stats/.
[11] Thomas A. Longstaff, James T. Ellis, Shawn V. Hernan, Howard F. Lipson, Robert D.
Mcmillan, Linda Hutz Pesante, and Derek Simmel. Security of the Internet. Technical
Report The Froehlich/Kent Encyclopedia of Telecommunications Vol (15), CERT Coordi-
nation Center, 1997. Last accessed: Novmeber 30, 2008. http://www.cert.org/
encyc_article/tocencyc.html.
[12] KDD Cup 1999 Intrusion Detection Data. Last accessed: Novmeber 30, 2008. http:
//kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[13] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Application based
Intrusion Detection Dataset. Last accessed: Novmeber 30, 2008. http://www.csse.
unimelb.edu.au/

kgupta.
[14] Stefan Axelsson. Intrusion Detection Systems: A Taxomomy and Survey. Technical Report
99-15, Department of Computer Engineering, Chalmers University of Technology, 2000.
[15] Anita K. Jones and Robert S. Sielken. Computer System Intrusion Detection: A Sur-
vey. Technical report, Department of Computer Science, University of Virginia, 1999.
Last accessed: Novmeber 30, 2008. http://www.cs.virginia.edu/

jones/
IDS-research/Documents/jones-sielken-survey-v11.pdf.
[16] Peyman Kabiri and Ali A. Ghorbani. Research on Intrusion Detection and Response: A
Survey. International Journal of Network Security, 1(2):84102, 2005.
[17] Joseph S. Sherif and Tommy G. Dearmond. Intrusion Detection: Systems and Models.
In Proceedings of the Eleventh IEEE International Workshops on Enabling Technologies:
Infrastructure for Collaborative Enterprises. WET ICE, pages 115133. IEEE, 2002.
[18] Mikko T. Siponen and Harri Oinas-Kukkonen. A Review of Information Security Issues
and Respective Research Contributions. SIGMIS Database, 38(1):6080, 2007. ACM.
BIBLIOGRAPHY 133
[19] Teresa F. Lunt. A survey of intrusion detection techniques. Computers and Security,
12(4):405418, 1993. Elsevier Advanced Technology Publications.
[20] Emilie Lundin and Erland Jonsson. Survey of Intrusion Detection Research. Technical
Report 02-04, Department of Computer Engineering, Chalmers University of Technology,
2002.
[21] James P. Anderson. Computer Security Threat Monitoring and Surveillance, 1980.
Last accessed: Novmeber 30, 2008. http://csrc.nist.gov/publications/
history/ande80.pdf.
[22] Dorothy E. Denning. An Intrusion-Detection Model. IEEE Transactions on Software En-
gineering, 13(2):222232, 1987. IEEE.
[23] H. S. Javitz and A. Valdes. The SRI IDES Statistical Anomaly Detector. In Proceedings of
the IEEE Symposium on Security and Privacy, pages 316326. IEEE, 1991.
[24] S.E.Smaha. Haystack: An Intrusion Detection System. In Proceedings of the 4th Aerospace
Computer Security Applications Conference, pages 3744. IEEE, 1988.
[25] Paul Innella. The Evolution of Intrusion Detection Systems, 2001. Last accessed: Novme-
ber 30, 2008. http://www.securityfocus.com/infocus/1514.
[26] L. T. Heberlein, G.V. Dias, K. N. Levitt, B. Mukherjee, J. Wood, and D. Wolber. A Network
Security Monitor. In Proceedings of the IEEE Symposium on Research in Security and
Privacy, pages 296304. IEEE, 1990.
[27] Biswanath Mukherjee, L. Todd Heberlein, and Karl N. Levitt. Network Intrusion Detection.
IEEE Network, 8(3):2641, 1994. IEEE.
[28] John McHugh. Intrusion and intrusion detection. International Journal of Information
Security, 1(1):1435, 2001. Springer.
[29] Herv e Debar, Marc Dacier, and Andreas Wespi. Towards a taxonomy of intrusion-detection
systems. Computer Networks, 31(9):805822, 1999. Elsevier.
[30] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff. A Sense of Self for Unix
Processes. In Proceedinges of the IEEE Symposium on Research in Security and Privacy,
pages 120128. IEEE, 1996.
134 BIBLIOGRAPHY
[31] Christina Warrender, Stephanie Forrest, and Barak Pearlmutter. Detecting Intrusions Using
System Calls: Alternative Data Models. In Proceedings of the IEEE Symposium on Security
and Privacy, pages 133145. IEEE, 1999.
[32] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Layered Approach us-
ing Conditional Random Fields for Intrusion Detection. IEEE Transactions on Dependable
and Secure Computing, In Press.
[33] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Robust Application
Intrusion Detection using User Session Modeling. ACM Transactions on Information and
Systems Security, Under Review.
[34] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of Eigh-
teenth International Conference on Machine Learning, pages 282289. Morgan Kaufmann,
2001.
[35] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Network Security
Framework. International Journal of Computer Science and Network Security, 6(7B):151
157, 2006.
[36] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Conditional Random
Fields for Intrusion Detection. In Proceedings of 21st International Conference on Ad-
vanced Information Networking and Applications Workshops (AINAW), pages 203208.
IEEE, 2007.
[37] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. User Session Modeling
for Effective Application Intrusion Detection. In Proceedings of the 23rd International
Information Security Conference (SEC 2008), pages 269283. Lecture Notes in Computer
Science, Springer Verlag, Vol (278), 2008.
[38] Kapil Kumar Gupta, Baikunth Nath, and Kotagiri Ramamohanarao. Intrusion Detection
in Networks and Applications. In Handbook of Communication Networks and Distributed
Systems. World Scientic, To Appear.
[39] Christopher Kruegel, Fredrik Valeur, and Giovanni Vigna. Intrusion Detection and Corre-
lation: Challenges and Solutions. Springer, 2005.
BIBLIOGRAPHY 135
[40] Tao Peng, Christopher Leckie, and Kotagiri Ramamohanarao. Adjusted Probabilistic Packet
Marking for IP Traceback. In Proceedings of the Second IFIP Networking Conference,
pages 697708. Springer, 2002.
[41] William R. Cheswick and Steven M. Bellovin. Firewalls and Internet Security. Addison-
Wesley, 1994.
[42] Rebecca Bace and Peter Mell. Intrusion Detection Systems. Gaithersburg, MD : Computer
Security Division, Information Technology Laboratory, National Institute of Standards and
Technology, 2001.
[43] Bruce Schneier. Applied Cryptography. John Wiley & Sons, 1996.
[44] Kymie Tan. Dening the Operational Limits of Sequence-Based Anomaly Detectors. PhD
thesis, The University of Melbourne, 2002.
[45] Stuart Staniford-Chen, Brian Tung, Phil Porras, Cliff Kahn, Dan Schnackenberg, Rich
Feiertag, and Maureen Stillman. The Common Intrusion Detection Framework - Data For-
mats, March 1998. Last accessed: Novmeber 30, 2008. http://tools.ietf.org/
html/draft-staniford-cidf-data-formats-00.
[46] Giovanni Vigna and Richard A. Kemmerer. NetSTAT: A Network-based Intrusion Detec-
tion Approach. In Proceedings of the 14th Annual Computer Security Applications Confer-
ence, pages 2534. IEEE, 1998.
[47] Carol Taylor and Jim Alves-Foss. An Empirical Analysis of NATE: Network Analysis
of Anomalous Trafc Events. In Proceedings of the 2002 Workshop on New Security
Paradigms, pages 1826. ACM, 2002.
[48] Snort, a Network based Intrusion Detection System. Last accessed: Novmeber 30, 2008.
http://www.snort.org/.
[49] Nong Ye, Xiangyang Li, Qiang Chen, Syed Masum Emran, and Mingming Xu. Probabilis-
tic Techniques for Intrusion Detection Based on Computer Audit Data. IEEE Transactions
on Systems, Man and Cybernetics, Part A: Systems and Humans, 31(4):266274, 2001.
136 BIBLIOGRAPHY
[50] Paul Dokas, Levent Ertoz, Vipin Kumar, Aleksandar Lazarevic, Jaideep Srivastava, and
Pang-Ning Tan. Data Mining for Network Intrusion Detection. In Proceedings of the NSF
Workshop on Next Generation Data Mining, pages 2130, 2002.
[51] Wenke Lee, Salvatore J. Stolfo, and Kui W. Mok. A Data Mining Framework for Build-
ing Intrusion Detection Model. In Proceedings of the IEEE Symposium on Security and
Privacy, pages 120132. IEEE, 1999.
[52] Dalila Boughaci, Habiba Drias, Ahmed Bendib, Youcef Bouznit, and Belaid Benhamou.
Distributed Intrusion Detection Framework Based on Mobile Agents. In Proceedings of the
International Conference on Dependability of Computer Systems, pages 248255. IEEE,
2006.
[53] Jai Sundar Balasubramaniyan, Jose Omar Garcia-Fernandez, David Isacoff, Eugene H.
Spafford, and Diego Zamboni. An Architecture for Intrusion Detection Using Autonomous
Agents. In Proceeding of the 14th Annual Computer Security Applications Conference,
pages 1324. IEEE, 1998.
[54] Yu-Sung Wu, Bingrui Foo, Yongguo Mei, and Saurabh Bagchi. Collaborative Intrusion
Detection System (CIDS): A Framework for Accurate and Efcient IDS. In Proceedings of
the 19th Annual Computer Security Applications Conference, pages 234244. IEEE, 2003.
[55] Elvis Tombini, Herv e Debar, Ludovic Me, and Mireille Ducasse. A Serial Combination of
Anomaly and Misuse IDSes Applied to HTTP Trafc. In Proceedings of the 20th Annual
Computer Security Applications Conference, pages 428437. IEEE, 2004.
[56] L. Portnoy, E. Eskin, and S. Stolfo. Intrusion Detection with Unlabeled Data using Clus-
tering. In Proceedings of the ACM Workshop on Data Mining Applied to Security (DMSA).
ACM, 2001.
[57] H. Shah, J. Undercoffer, and A. Joshi. Fuzzy Clustering for Intrusion Detection. In Pro-
ceedings of the 12th IEEE International Conference on Fuzzy Systems, pages 12741278.
IEEE, 2003.
[58] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items
in Large Databases. In Proceedings of the International Conference on Management of
Data (SIGMOD), pages 207216. ACM, 1993.
BIBLIOGRAPHY 137
[59] H.Mannila, H.Toivonen, and A.I.Verkamo. Discovering Frequent Episodes in Sequences. In
Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining,
pages 210215. AAAI, 1995.
[60] Nahla Ben Amor, Salem Benferhat, and Zied Elouedi. Naive Bayes vs Decision Trees in In-
trusion Detection Systems. In Proceedings of the ACM Symposium on Applied Computing,
pages 420424. ACM, 2004.
[61] Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian Network Classiers. Ma-
chine Learning, 29(2-3):131163, 1997. Springer.
[62] Darren Mutz, Fredrik Valeur, Giovanni Vigna, and Christopher Kruegel. Anomalous Sys-
tem Call Detection. ACM Transactions on Information and System Security, 9(1):6193,
2006. ACM.
[63] Christopher Kruegel, Darren Mutz, William Robertson, and Fredrik Valeur. Bayesian Event
Classication for Intrusion Detection. In Proceedings of 19th Annual Computer Security
Applications Conference, pages 1423. IEEE, 2003.
[64] Gray Stein, Bing Chen, Annie S. Wu, and Kien A. Hua. Decision Tree Classier for Net-
work Intrusion Detection with GA-Based Feature Selection. In Proceedings of the 43rd
Annual SouthEast Regional Conference - Volume 2, pages 136141. ACM, 2005.
[65] Herv e Debar, Monique Becke, and Didier Siboni. A Neural Network Component for an
Intrusion Detection System. In Proceedings of the IEEE Symposiumon Research in Security
and Privacy, pages 240250. IEEE, 1992.
[66] Anup K. Ghosh, James Wanken, and Frank Charron. Detecting Anomalous and Unknown
Intrusions Against Programs. In Proceedings of the 14th Annual Computer Security Appli-
cations Conference, pages 259267. IEEE, 1998.
[67] Jake Ryan, Meng-Jang Lin, and Risto Mikkulainen. Intrusion Detection with Neural Net-
works. In Advances in Neural Information Processing Systems, pages 943949. MIT, 1997.
[68] Zheng Zhang, Jun Li, C.N. Manikopoulos, Jay Jorgenson, and Jose Ucles. HIDE: a Hi-
erarchical Network Intrusion Detection System Using Statistical Preprocessing and Neural
138 BIBLIOGRAPHY
Network Classication. In Proceedings of the IEEE Workshop on Information Assurance
and Security United States Military Academy, pages 8590. IEEE, 2001.
[69] Anup K. Ghosh, Aaron Schwartzbard, and Michael Schatz. Learning Program Behavior
Proles for Intrusion Detection. In Proceedings of the 1st USENIX Workshop on Intrusion
Detection and Network Monitoring, pages 5162. USENIX Association, 1999.
[70] Srinivas Mukkamala, Guadalupe Janoski, and Andrew H. Sung. Intrusion Detection Using
Neural Networks and Support Vector Machines. In Proceedings of the International Joint
Conference on Neural Networks (IJCNN), pages 17021707. IEEE, 2002.
[71] Andrew H. Sung and Srinivas Mukkamala. Identifying Important Features for Intrusion
Detection Using Support Vector Machines and Neural Networks. In Proceedings of Sym-
posium on Applications and the Internet, pages 209216. IEEE, 2003.
[72] Dong Seong Kim and Jong Sou Park. Network-Based Intrusion Detection with Support
Vector Machines. In Proceedings of the Information Networking, Networking Technologies
for Enhanced Internet Services International Conference, ICOIN, pages 747756. Lecture
Notes in Computer Science, Springer Verlag, 2003.
[73] S. Jha, K. Tan, and R.A. Maxion. Markov chains, Classiers, and Intrusion Detection. In
Proceedings of the 14th IEEE Computer Security Foundations Workshop, pages 206219.
IEEE, 2001.
[74] Nong Ye, Yebin Zhang, and Connie M. Borror. Robustness of the Markov-Chain Model for
Cyber-Attack Detection. IEEE Transactions on Reliability, 53(1):116123, 2004.
[75] Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition. Proceedings of the IEEE, 77(2):257286, 1989.
[76] Svetlana Radosavac. Detection and Classication of Network Intrusions using Hidden
Markov Models. Masters thesis, University of Maryland, 2003.
[77] Wei Wang, Xiao-Hong Guan, and Xiang-Liang Zhang. Modeling Program behaviors by
Hidden Markov Models for Intrusion Detection. In Proceedings of International Confer-
ence on Machine Learning and Cybernetics, pages 28302835. IEEE, 2004.
BIBLIOGRAPHY 139
[78] Ye Du, Huiqiang Wang, and Yonggang Pang. A Hidden Markov Models-Based Anomaly
Intrusion Detection Method. In Proceeedings of the Fifth World Congress on Intelligent
Control and Automation (WCICA), pages 43484351. IEEE, 2004.
[79] Autonomous Agents for Intrusion Detection. Last accessed: Novmeber 30, 2008.
http://www.cerias.purdue.edu/about/history/coast/projects/
aafid.php.
[80] Probabilistic Agent based Intrusion Detection. Last accessed: Novmeber 30, 2008. http:
//www.cse.sc.edu/research/isl/agentIDS.shtml.
[81] Wenke Lee and Salvatore J. Stolfo. Data Mining Approaches for Intrusion Detection. In
Proceedings of the 7th USENIX Security Symposium, pages 7994, 1998.
[82] Wenke Lee, Salvatore J. Stolfo, and Kui W. Mok. Mining Audit Data to build Intrusion
Detection Models. In Proceedings of the 4th International Conference on Knowledge Dis-
covery and Data Mining, pages 6672. AAAI, 1998.
[83] Wenke Lee, Salvatore J. Stolfo, and Kui W. Mok. Mining in a Data-ow Environment:
Experience in Network Intrusion Detection. In Proceedings of the Fifth International Con-
ference on Knowledge Discovery and Data Mining, pages 114124. ACM, 1999.
[84] Wenke Lee and Salvatore J. Stolfo. A Framework for Constructing Features and Models
for Intrusion Detection Systems. ACM Transactions on Information and System Security
(TISSEC), 3(4):227261, 2000. ACM.
[85] Yu Gu, Andrew McCallum, and Don Towsley. Detecting Anomalies in Network Trafc
Using Maximum Entropy Estimation. In Proceedings of the Internet Measurement Confer-
ence, pages 345350. USENIX Association, 2005.
[86] Yi Hu and Brajendra Panda. A Data Mining Approach for Database Intrusion Detection. In
Proceedings of the ACM symposium on Applied Computing, pages 711716. ACM, 2004.
[87] Yong Zhong, Zhen Zhu, and Xiao-Lin Qin. A Clustering Method Based on Data Queries
and Its Application in Database Intrusion Detection. In Proceedings of the Fourth Interna-
tional Conference on Machine Learning and Cybernetics, Vol (4), pages 20962101. IEEE,
2005.
140 BIBLIOGRAPHY
[88] Yi Hu and Brajendra Panda. Identication of Malicious Transactions in Database Systems.
In Proceedings of the 7th International Database Engineering and Applications Sympo-
sium, pages 329335. IEEE, 2003.
[89] Elisa Bertino, Ashish Kamra, Evimaria Terzi, and Athena Vakali. Intrusion Detection in
RBAC-Administered Databases. In Proceedings of the 21st Annual Computer Security
Applications Conference, pages 170182. IEEE, 2005.
[90] Wai Lup Low, Joseph Lee, and Peter Teoh. DIDAFIT: Detecting Intrusions in Databases
Through Fingerprinting Transactions. In Proceedings of the 4th International Conference
on Enterprise Information Systems, pages 121128, 2002.
[91] Sin Yeung Lee, Wai Lup Low, and Pei Yuen Wong. Learning Fingerprints for a Database
Intrusion Detection System. In Proceedings of the 7th European Symposium on Research
in Computer Security, Vol (2502), pages 264279. Lecture Notes in Computer Science,
Springer Verlag, 2002.
[92] Yong Zhong and Xiao-Lin-Qin. Research on Algorithm of User Query Frequent Itemsets
Mining. In Proceedings of Third International Conference on Machine Learning and Cy-
bernetics, Vol (3), pages 16711676. IEEE, 2004.
[93] Victor C.S. Lee, John A. Stankovic, and Sang H. Son. Intrusion Detection in Real-time
Database Systems Via Time Signatures. In Proceedings of the Sixth IEEE Real Time Tech-
nology and Applications Symposium, pages 124133. IEEE, 2000.
[94] Christina Yip Chung, Michael Gertz, and Karl Levitt. DEMIDS: A Misuse Detection Sys-
tem for Database Systems. In Proceeding of the 3rd International IFIP TC-11 WG11.5
Working Conference on Integrity and Internal Control in Information Systems, pages 159
178. Kluwer, 1999.
[95] Shubha U. Nabar, Bhaskara Marthi, KrishnaramKenthapadi, Nina Mishra, and Rajeev Mot-
wani. Towards Robustness in Query Auditing. In Proceedings of the 32nd International
Conference on Very large Data Bases, pages 151162. ACM, 2006.
[96] Rakesh Agarwal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu. Hippocratic
Databases. In Proceedings of the 28th International Conference on Very Large Databases,
pages 143154. Morgan Kaufmann, 2002.
BIBLIOGRAPHY 141
[97] Rakesh Agrawal, Roberto J. Bayardo Jr., Christos Faloutsos, Jerry Kiernan, Ralf Rantzau,
and Ramakrishnan Srikant. Auditing Compliance with a Hippocratic Database. In Pro-
ceedings of the 30th International Conference on Very Large Databases, pages 516527.
Morgan Kaufmann, 2004.
[98] Kristen LeFevre, Rakesh Agrawal, Vuk Ercegovac, Raghu Ramakrishnan, Yirong Xu, and
David J. DeWitt. Limiting Disclosure in Hippocratic Databases. In Proceedings of the 30th
International Conference on Very Large Databases, pages 108119. Morgan Kaufmann,
2004.
[99] Lieven Desmet, Frank Piessens, Wouter Joosen, and Pierre Verbaeten. Bridging the Gap
Between Web Application Firewalls and Web Applications. In Proceedings of the Fourth
ACM workshop on Formal methods in security, FMSE, pages 6777. ACM, 2006.
[100] Holger Dreger, Anja Feldmann, Michael Mai, Vern Paxson, and Robin Sommer. Dynamic
Application-Layer Protocol Analysis for Network Intrusion Detection. In Proceedings of
the 15th Usenix Security Symposium, pages 257272. USENIX Association, 2006.
[101] Marco Cova, Davide Balzarotti, Viktoria Felmetsger, and Giovanni Vigna. Swaddler: An
Approach for the Anomaly-Based Detection of State Violations in Web Applications. In
Proceedings of the 10th International Symposium on Recent Advances in Intrusion Detec-
tion (RAID), pages 6386. Springer, 2007.
[102] Shai Rubin, Somesh Jha, and Barton P. Miller. Protomatching Network Trafc for High
Throughput Network Intrusion Detection. In Proceedings of the Proceedings of the 13th
ACM conference on Computer and Communications Security, pages 4758. ACM, 2006.
[103] Bruce D. Caulkins, Joohan Lee, and Morgan Wang. Packet- vs. Session-Based Modeling
for Intrusion Detection Systems. In Proceedings of the International Conference on Infor-
mation Technology: Coding and Computing (ITCC05), pages 116121. IEEE, 2005.
[104] Magnus Almgren and Ulf Lindqvist. Application-Integrated Data Collection for Security
Monitoring. In Proceedings of the 4th International Symposium on Recent Advances in
Intrusion Detection, pages 2236. Lecture Notes in Computer Science, Springer Verlag,
Vol (2212), 2001.
142 BIBLIOGRAPHY
[105] Fredrik Valeur, Darren Mutz, and Giovanni Vigna. A Learning-Based Approach to the De-
tection of SQL Attacks. In Proceedings of Second International Conference on Detection of
Intrusions and Malware, and Vulnerability Assessment (DIMVA), pages 123140. Springer,
2005.
[106] Christopher Kruegel and Giovanni Vigna. Anomaly Detection of Web-Based Attacks.
In Proceedings of the 10th ACM Conference on Computer and Communications Security
(CCS), pages 251261. ACM, 2003.
[107] Shu Wenhui and Tan T H Daniel. A Novel Intrusion Detection System Model for Securing
Web-based Database Systems. In Proceedings of the 25th Annual International Computer
Software and Applications Conference (COMPSAC), pages 249254. IEEE, 2001.
[108] Adwait Ratnaparkhi. A Maximum Entropy Model for Part-of-Speech Tagging. In Pro-
ceedings of the Conference on Empirical Methods in Natural Language Processing, pages
133142. Association for Computational Linguistics, 1996.
[109] Adwait Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolu-
tion. PhD thesis, University of Pennsylvania, 1998.
[110] Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. A Maximum Entropy
Approach to Natural Language Processing. Computational Linguistics, 22(1):3971, 1996.
[111] Andrew McCallum, Dayne Freitag, and Fernando Pereira. Maximum Entropy Markov
Models for Information Extraction and Segmentation. In Proceedings of the 17th Interna-
tional Conference on Machine Learning, pages 591598. Morgan Kaufmann, 2000.
[112] Dan Klein and Christopher D. Manning. Conditional Structure versus Conditional Esti-
mation in NLP Models. In Proceedings of the ACL-02 Conference on Empirical methods
in Natural Language Processing Vol (10), pages 916. Association for Computational Lin-
guistics, 2002.
[113] Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for
Relational Learning. In Introduction to Statistical Relational Learning. MIT, 2006.
[114] L. Ertoz, A. Lazarevic, E. Eilertson, Pang-Ning Tan, Paul Dokas, V. Kumar, and Jaideep
Srivastava. Protecting Against Cyber Threats in Networked Information Systems. In Pro-
BIBLIOGRAPHY 143
ceedings of SPIE; Battlespace Digitization and Network Centric Systems III, pages 5156,
2003.
[115] Shon Harris. CISSP All-in-One Exam Guide. McGraw-Hill Osborne Media, 2007.
[116] Saso Dzeroski and Bernard Zenko. Is Combining Classiers Better than Selecting the Best
One. In Proceedings of the Nineteenth International Conference on Machine Learning,
pages 123129. Morgan Kaufmann, 2002.
[117] Chuanyi Ji and Sheng Ma. Combinations of Weak Classiers. IEEE Transactions on Neural
Networks, 8(1):3242, 1997.
[118] Andrew Viterbi. Error Bounds for Convolutional Codes and an Asymptotically Optimum
Decoding Algorithm. IEEE Transactions on Information Theory, 13(2):260269, 1967.
[119] G.D.Forney. The Viterbi Algorithm. Proceedings of the IEEE, 61(3):268278, 1973.
[120] Taku Kudo. CRF++: Yet another CRF toolkit. Last accessed: Novmeber 30, 2008. http:
//crfpp.sourceforge.net/.
[121] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Tech-
niques. Morgan Kaufmann, 2005.
[122] Maheshkumar Sabhnani and Gursel Serpen. Application of Machine Learning Algorithms
to KDDIntrusion Detection Dataset within Misuse Detection Context. In Proceedings of the
International Conference on Machine Learning; Models, Technologies and Applications,
MLMTA, pages 209215. CSREA, 2003.
[123] Yacine Bouzida and Sylvain Gombault. Eigenconnections to Intrusion Detection. In Secu-
rity and Protection in Information Processing Systems, pages 241258. Springer, 2004.
[124] Andrew McCallum. Efciently Inducing Features of Conditional Random Fields. In Pro-
ceedings of the 19th Annual Conference on Uncertainty in Articial Intelligence, pages
403410. Morgan Kaufmann, 2003.
[125] Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing Features of Random
Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380393,
1997.
144 BIBLIOGRAPHY
[126] Andrew Kachites McCallum. MALLET: A Machine Learning for Language Toolkit, 2002.
Last accessed: Novmeber 30, 2008. http://mallet.cs.umass.edu.
[127] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[128] Frank Wilcoxon. Individual Comparisons by Ranking Methods. Biometrics, 1(6):8083,
1945.
[129] W.W. Eckerson. Three Tier Client/Server Architecture: Achieving Scalability, Perfor-
mance, and Efciency in Client Server Applications. Open Information Systems, 10(1),
1995.
[130] Computer Immune Systems- Data Sets and Software. Last accessed: Novmeber 30, 2008.
http://www.cs.unm.edu/

immsec/systemcalls.htm.
[131] osCommerce, Open Source Online Shop E-Commerce Solutions. Last accessed: Novmeber
30, 2008. http://www.oscommerce.com/.
[132] Zen Cart, the art of e-commerce. Last accessed: Novmeber 30, 2008. http://www.
zencart.com/.
[133] Eric Newcomer and Greg Lomow. Understanding SOA with Web Services. Addison-Wesley
Professional, 2004.
[134] Thomas G. Dietterich. Machine Learning for Sequential Data: A Review. In Proceed-
ings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pat-
tern Recognition, pages 1530. Lecture Notes in Computer Science, Springer Verlag, No.
(2396), 2002.
[135] Hanna Wallach. Conditional Random Fields: An Introduction. Technical Report MS-
CIS-04-21, Department of Computer and Information Science, University of Pennsylvania,
2004.
[136] Hanna Wallach. Efcient Training of Conditional Random Fields. Masters thesis, Division
of Informatics, University of Edinburgh, 2002.
[137] Charles Sutton, Khashayar Rohanimanesh, and Andrew McCallum. Dynamic Conditional
Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence
BIBLIOGRAPHY 145
Data. In Proceedings of the 21st International Conference on Machine Learning, pages
99106. ACM, 2004.
[138] Yang Wang, Kia-Fock Loe, and Jian-Kang Wu. A Dynamic Conditional Random Field
Model for Foreground and Shadow Segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 28(2):279289, 2006.
[139] Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. In Pro-
ceedings of the 2003 Conference of the North American Chapter of the Association for
Computational Linguistics on Human Language Technology, pages 134141. Association
for Computational Linguistics, 2003.
[140] Tak-Lam Wong and Wai Lam. Semi Supervised Learning for Sequence Labeling Using
Conditional Random Fields. In Proceedings of the 4th International Conference on Ma-
chine Learning and Cybernetics, pages 28322837. IEEE, 2005.
[141] Ariadna Quattoni, Michael Collins, and Trevor Darrel. Conditional Random Fields for
Object Recognition. In Proceedings of Advances in Neural Information Processing Systems,
pages 10971104. MIT, 2004.
[142] John Lafferty, Xiaojin Zhu, and Yan Liu. Kernel Conditional Random Fields: Representa-
tion and Clique Selection. In Proceedings of the 21st International Conference on Machine
Learning, pages 6471. ACM, 2004.
[143] Aron Culotta, David Kulp, and Andrew McCallum. Gene Prediction with Conditional Ran-
dom Fields. Technical Report UM-CS-2005-028, University of Massachusetts, Amherst,
2005.
[144] Sunita Sarawagi and William W. Cohen. Semi-Markov Conditional Random Fields for
Information Extraction. In Advances in Neural Information Processing Systems, pages
11851192, 2004.
[145] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[146] Kevin Murphy. An Introduction to Graphical Models. Technical report, Intel Research,
2001.
146 BIBLIOGRAPHY
[147] Kamal Nigam, John Lafferty, and Andrew McCallum. Using Maximum Entropy for Text
Classication. In IJCAI-99 Workshop on Machine Learning for Information Filtering, pages
6167, 1999.
[148] Edwin Thompson Jaynes. Information Theory and Statistical Mechanics. The Physical
Review, 106(4):620630, 1957.
[149] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A Maximization Tech-
nique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains.
The Annals of Mathematical Statistics, 41(1):164171, 1970.
[150] Arthur Pentland Dempster, Nan M. Laird, and Donald B. Rubin. Maximum Likelihood
from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, B,
39(1):138, 1977.
[151] Seram Batzoglou. CS 262 Computational Genomics Winter 2005. Last accessed: Novme-
ber 30, 2008. http://robotics.stanford.edu/

serafim/CS262_2005/
index.html.
[152] Roman Klinger and Katrin Tomanek. Classical Probabilistic Models and Conditional Ran-
dom Fields. Technical Report TR07-2-013, Technical University of Dortmund, 2007.
[153] Sunita Sarawagi. CRF Project. Last accessed: Novmeber 30, 2008. http://crf.
sourceforge.net/.
[154] Kevin P Murphy. Conditional random elds (chains, trees and general graphs; includes BP
code). Last accessed: Novmeber 30, 2008. http://www.cs.ubc.ca/

murphyk/.
Appendices
147
Appendix A
An Introduction to Conditional Random
Fields
Conditional random elds have been effectively used for a variety of tasks including gene predic-
tion, determining secondary structures of protein sequences, part of speech tagging, text segmentation,
shallow parsing, named entity recognition, object recognition, intrusion detection and many others.
Conditional random elds exploit the sequence structure in the observations without making unwar-
ranted assumptions, which results in better classication. We describe the theory behind conditional
random elds in detail; give their properties along with the assumptions made which motivate their
use in a particular problem including their advantages and disadvantages with respect to previously
known approaches which can be used for similar tasks.
A.1 Introduction
The need to correctly label a sequence of observations is of vital importance in a variety of do-
mains including computational linguistics, computational biology and real-time intrusion detec-
tion. Computational linguistics involve various tasks such as text segmentation, determining the
part of speech tags for a sentence, information extraction, named entity recognition and others.
Similarly, computational biology includes various tasks such as biological sequence alignment,
determining secondary structure of protein sequences, gene prediction and many more. The need
to label sequence of observations also arises in intrusion detection tasks to correctly identify ma-
licious events.
The problem of sequence labeling is dened as follows: given a sequence of observations
x
1
, x
2
, x
2
, ..., x
n
, label every observation as y
1
, y
2
, y
2
, ..., y
n
from a nite set of labels Y [134],
[135]. We shall, thus, focus on a sequence of observations and discuss various methods which
have been proposed to label them. In particular, we shall emphasize on conditional random elds,
[34], [113], [124], [136], highlighting their advantages over other methods and list a number of
149
150 An Introduction to Conditional Random Fields
applications where they have been successfully applied [32], [33], [36], [37], [137], [138], [139],
[140], [141], [142], [143], [144].
The rest of the chapter is organized as follows; in Section A.2, we give a brief background
on probability distributions and describe the notations used. In Section A.3, we discuss various
graphical methods and highlight drawbacks in previously introduced methods such as the maxi-
mum entropy Markov models, hidden Markov models, naive Bayes classiers and others which
motivate the use of conditional random elds. We then describe conditional random elds, in
details in Section A.4, highlighting situations where conditional random elds are expected to
perform better than their predecessors. We emphasize on feature functions, training and testing
and the complexity involved in using conditional random elds. We also give a brief description
of the tools which implement conditional random elds. In Section A.5, we compare the directed
and the undirected graphical models. Finally, we conclude the chapter in Section A.6.
A.2 Background
Many real life problems in language processing, computational biology and real-time intrusion
detection involve sequence labeling, time series prediction and sequence classication. In order to
perform such tasks, probabilistic approaches have gained wide acceptance which involve estimat-
ing either the joint distribution or the conditional distribution which are dened as follows.
Joint Probability Distribution- Given N random variables, the joint distribution of the given
random variables is the distribution, D, of all the variables occurring together. When there
are only two random variables, X and Y, the joint distribution is represented as: P(X =
x, Y = y), x, y values.
Conditional Probability Distribution- Given N random variables, the conditional distribu-
tion is a distribution, D, of a subset of variables given the occurrences of the remaining
random variables in the set N. For two random variables, X and Y, the conditional distri-
bution of Y given X is represented as: P(Y = y|X = x), x, y values.
The observations in most sequence labeling tasks are known and the objective is to assign
the correct label given the observations. The aim is, thus, to predict the label sequence which
maximizes the probability of the class labels given the observations. However, many machine
learning approaches rst estimate the joint distribution of the observations and the labels and then
A.2 Background 151
determine the required conditional distribution using the Bayes rule. Once the complete joint
distribution is available, calculating the required marginal and conditional probabilities is an easy
task. However, the major issue in a joint distribution is estimating the required joint distribution
itself. To learn the joint distribution from the training data is difcult due to the following reasons:
1. The number of observations required to determine the complete joint distribution is ex-
ponential in the number of variables. For M variables each taking K possible labels, this
number is O(K
M
). Assuming complete independence among variables signicantly re-
duces this to O(K M), however, making such strong independence assumptions affects
the accuracy of the model.
2. The amount of training data is limited and hence, it is difcult to estimate the accurate joint
distribution. The joint distribution learnt from a limited data set can result in over-tting
and mirrors the training data. As a result, the learnt model does not generalize to new
observations.
Estimating the joint distribution without making any independence assumptions is, thus, fea-
sible only in situations when the number of random variables is small and large amount of data
samples are available for training. On the contrary, assuming complete independence among the
random variables though makes the model tractable but severely affects the modeling capability.
Hence, the objective is to build models which optimally balance the dual constraints; making the
model tractable with the help of independence assumptions without affecting the modeling power
of the system and improving the generalization capability of the model on unseen observations
given the limited number of training data samples. Domain knowledge is typically used to deter-
mine such dependence and independence relations as in case of the Bayesian networks which are
described later.
Estimating the conditional distribution directly from the training observations eliminates the
need of estimating the joint distribution and does not necessitate any unwarranted independence
assumptions among the random variables.
For estimating either the joint or the conditional distribution, a diagrammatic representation
of the random variables presenting their dependence relations is advantageous and graphical mod-
els have become an important tool for various machine learning tasks as presented in [145] and
[146]. As mentioned in [145], various complicated problems can be formulated and solved using
purely algebraic manipulation, however, the use of graphical models augments the analysis us-
152 An Introduction to Conditional Random Fields
ing diagrammatic representations of probability distributions which not only help to visualize the
structure of the probabilistic model but also gives insight to the properties of the model including
the conditional independence which signicantly improves the probabilistic analysis and helps to
reduce the need of using larger data sets.
Notation
We use the following notations for the rest of the chapter.


x = x
1
x
2
...x
t
is the observed vector. Let there be m alphabets for each x
i
.
y is estimated class. We use the term class interchangeably with the term label. Let
there be k possible classes. For sequence labeling y is a vector,

y = y
1
, y
2
, ..., y
t
, whose
length is equal to that of the observation

x .
Note that, the graphical methods can be applied to label a single observation with multiple
features as well as to label a sequence of observations where each observation is itself represented
by multiple features. Methods which generally deal with a single observation are naive Bayes
classier and Maxent classier. Similarly, methods which deal with a sequence of observations
are hidden Markov models, maximum entropy Markov models, Markov random elds and condi-
tional random elds. Very often, when a sequence of observations is considered, the observation
represents the value of a single feature observed overtime, even though more than one feature can
be used to represent an observation sequence.
A.3 Graphical Models
Graphical models are often used to model the probability distribution over a set of random vari-
ables by factorizing complex distributions, with a large number of random variables, into a product
of simpler distributions, each with a small set of variables. A graph, G, is a set of vertices, V, con-
nected by edges, E, where a vertex represents a single or a group of random variable(s) and the
edges between the vertices represents the relationship between these random variables.
Based upon the type of edges used in the graphs, the graphical models can be broadly classied
as Directed or Undirected.
A.3 Graphical Models 153
A.3.1 Directed Graphical Models
Def.: A directed graphical model is a graph G = (V, E) where V = {V
1
, V
2
, ..., V
N
} are the
vertices and E = {(V
i
, V
j
), i = j} are the directed edges from vertex V
i
to vertex V
j
. A vertex V
i
can be represented by the random variable representation X
i
.
A directed graphical model incorporates the parent child relationship via the direction of an
edge, i.e., an edge pointing from the vertex V
i
to vertex V
j
implicitly describes the parent child
relationship such that X
i
is the parent of X
j
. The joint distribution over a set of random variables
can be factorized into a product of local conditional distributions in the directed graphical models.
Directed graphical models are also known as the Bayesian Networks [61].
An important restriction for the directed graphs is the absence of closed loops, i.e., there
should be no directed path starting from and ending to the same vertex. Such graphs are called
as the Directed Acyclic Graphs (DAG). The directed graphical models factorize according to the
probability distribution given in equation A.1, where x
i
represents a node and x

i
represents its
parents.
p(x
1
, x
2
, ..., x
n
) =
N

i=1
p(x
i
|x

i
) (A.1)
Figure A.1 represents a fully connected directed graphical model for three random variables.
x
1
x
3
x
2
Figure A.1: Fully Connected Graphical Model
The graphical model represented in Figure A.1 can be factorized as:
p(x
1
, x
2
, x
3
) = p(x
1
) p(x
2
|x
1
) p(x
3
|x
1
, x
2
) (A.2)
Thus, for a fully connected graph with M variables each taking K possible values the total number
154 An Introduction to Conditional Random Fields
of parameters that must be specied for an arbitrary joint distribution is equal to K
M
1 which
grows exponentially with M. This is not feasible for most real world applications which often
involve a large number of random variables with complex dependencies among them. The com-
plexity can be drastically reduced by assuming the random variables to be completely independent.
Figure A.2 represents a graphical model where the random variables are assumed to be completely
independent.
x
1
x
3
x
2
Figure A.2: Fully Disconnected Graphical Model
The graphical model represented in Figure A.2 can be factorized as:
p(x
1
, x
2
, x
3
) = p(x
1
) p(x
2
) p(x
3
) (A.3)
Assuming the variables to be completely independent signicantly reduces the number of required
parameters to M(K 1), which is manageable.
Conditional independence properties can be used to simplify the structure of the graph. In
case of directed graphs, the conditional independence properties can be tested by applying the d-
separation test. This involves testing whether or not the path between two nodes is blocked. More
details on d-separation can be found in [145]. We shall now describe some of the well known
directed graphical models.
Naive Bayes Classier
Naive Bayes classier is a well known directed graphical model which is frequently used to de-
termine the class label for a given observation. The naive Bayes classier is represented in Figure
A.3.
A.3 Graphical Models 155
x
1
x
2
x
t
y
Figure A.3: Naive Bayes Classier
The objective is to nd the label, y, which maximizes the probability of the given observation,
i.e., nd:
argmax
y
p(y|

x ) (A.4)
The Bayes Rule can be used to nd p(y|

x ):
p(y|

x ) =
p(y) p(

x |y)
p(

x )
Hence, Equation A.4 can be rewritten as:
argmax
y
p(y|

x ) = argmax
y
_
p(y) p(

x |y)
p(

x )
_
= argmax
y
p(y) p(

x |y)
= argmax
y
p(y) p(x
1
, x
2
, ...x
t
|y)
Making naive Bayes assumption, Every feature x
i
is conditionally independent of every other
feature, the resulting naive Bayes classier is given by:
argmax
y
p(y|

x ) = argmax
y
p(y) p(x
1
|y) p(x
2
|y), ..., p(x
t
|y)
= argmax
y
_
p(y)
t

i=1
p(x
i
|y)
_
(A.5)
The classier presented in Equation A.5 considers the features in the observation to be inde-
pendent and discard any correlation which may exist between them. This makes the model simple
156 An Introduction to Conditional Random Fields
but, it affects classication accuracy of the resulting classier.
Maxent Classier
Recalling that naive Bayes classier is a directed graphical model which is often used to assign a
single label to an observation. We also observed that, to make the model simple, the naive Bayes
classier assumes different observation features to be completely independent which affects clas-
sication. Similar to the naive Bayes classier, the Maxent classier (or logistic regression) can
be used to classify an observation which may be represented by multiple features. Contrary to the
naive Bayes assumption, the Maxent classier does not assume independence among the observa-
tion features thereby resulting in better classication accuracy. Maxent classier is represented in
Figure A.4.
x
1
x
2
x
t
y
Figure A.4: Maxent Classier
Maxent classier is motivated by the assumption that the log probability, log p(y|x), for each
class is a linear function of the observation x and a normalization constant. This results in a
conditional distribution which is represented in Equation A.6.
p(y|x) =
1
Z(x)
exp
K

k=1

k
f
k
(y, x) (A.6)
where
Z(x) =

y
exp
K

k=1

k
f
k
(y, x)
is the normalization constant,
k
is the bias weight and f
k
(y, x) is a feature function dened on an
observation and label pair for every feature k. The ability of this model to capture the correlation
A.3 Graphical Models 157
between observation features depend upon the feature functions, f
k
(y, x), and the weights,
k
,
learnt during training [147].
Such a conditional probability model is based on the Principle of Maximum Entropy [148]
which states that; when for a probability distribution only incomplete information is available, the
only unbiased assumption that can be made is a distribution which is as uniform as possible given
the available information. This means that the model should follow all the constraints imposed on
it (which are dened by the feature functions extracted from the training data), but beyond these
constraints, the model should be as uniform as possible, i.e., one which does not make any further
assumptions. More details on Maximum Entropy models can be obtained from [110].
Generative and Discriminative Graphical Models
Def.: A graphical model which models the joint probability of the observations and the labels,
p(y, x), is known as a generative model. The naive Bayes classier discussed earlier is an exam-
ple of a generative graphical model. Other well known generative models are the hidden Markov
models, Bayesian networks, Markov random elds. The prime disadvantage of generative models
is that they need to enumerate all possible observation sequences. However, in many real world
situations, the amount of data available for training is limited and hence, independence assump-
tions are made which results in approximate models.
Def.: A graphical model which models the conditional distribution of the labels given the ob-
servations, p(y|x), is known as a discriminative model. Maxent classier (logistic regression)
discussed earlier is a typical discriminative model. Other well known methods such as the support
vector machines, maximum entropy Markov models, conditional random elds, neural networks
and nearest neighbor are examples of discriminative models.
Hidden Markov Model
The naive Bayes classier is generally used to predict only a single class label. This model can
be extended to estimate a sequence of labels,

y , for an observed sequence,

x , of length t. As
mentioned earlier, very often the observed sequence represents the values of a single feature taken
158 An Introduction to Conditional Random Fields
over a period of time. The hidden Markov model is well known example of a directed and a
generative graphical model. They are doubly stochastic models; the state sequence is generated
by a stochastic process from which the output sequence is then generated [75]. Hence, given an
output sequence (observation), one cannot uniquely determine the labeling (i.e., the sequence of
states which generated the observation) since there may exist more than one sequence of states
which generated the particular observation.
We shall concentrate only on the rst order hidden Markov model which assumes that a state
at time t depends only on the state at time t 1. Further, the observation at time t depends only on
the state at time t. Since we consider only a single feature which is observed overtime, this results
in a chain like structure as represented in Figure A.5.
y
1
x
1
x
2
x
t
y
t
y
2
y
t1
x
t1
Figure A.5: Hidden Markov Model
The hidden Markov model represented in Figure A.5 can be factorized as:
p(

y ,

x ) =
t

i=1
p(y
i
|y
i1
) p(x
i
|y
i
) (A.7)
The best label sequence,

y , is one which maximizes this joint distribution, p(

y ,

x ). The draw-
back of hidden Markov model is that observation at time t, i.e., x
t
, is assumed to be independent
of observation at any other time which can affect accuracy.
For an hidden Markov model, we often assume the number of states to be equal to the number
of class labels. Hence, the set of states is represented by Q = (q
1
, q
2
, ..., q
k
). Further, let the
transition probability from state q
t1
= i to state q
t
= j be represented by a
ij
such that a
i1
+
a
i2
+ ... + a
ik
= 1, states i = 1...k. The starting probabilities, a
0i
, are initialized for each state i
such that a
01
+ a
02
+ ... + a
0k
= 1. Also, each state has a probability of emitting an observation,
e
i
(b) = p(x
i
= b|q
i
= k) such that e
i
(b
1
) + ... +e
i
(b
m
) = 1, states i = 1...k.
A.3 Graphical Models 159
Three main questions are considered when using an hidden Markov model.
1. Evaluation - Given an hidden Markov model M and an observation sequence

x , what is
the probability of the observation sequence given the model, i.e., nd
p(

x |M)
2. Decoding - Given an hidden Markov model M and an observation sequence

x , what is
the sequence of states that maximizes the joint probability of the observation sequence and
the state sequence, i.e. nd
argmax

q
p(

x ,

q |M)
3. Learning - Given an hidden Markov model M with unspecied transition and emission
probabilities, and an observation sequence

x , what are the parameters (transition and
emission probabilities) that maximize the probability of the observation sequence, i.e.,
nd
argmax

p(

x |)
Evaluation
The objective is to nd p(

x |M), i.e., the probability of the observation sequence given the model.
The naive approach is to perform summation over all possible ways of generating the observation
sequence,

x , i.e.,
p(

x ) =

q
p(

x ,

q )
=

q
p(

x |

q ) p(

q )
Summing over exponential number of paths is not desirable. Dynamic programming can be used
to perform this efciently. First, dene the forward probability as follows:
f
k
(i) = p(x
1
, ..., x
i
, q
i
= k)
=

q
1
...q
i1
p(x
1
, ..., x
i1
, q
1
, ..., q
i1
, q
i
= k) e
k
(x
i
)
160 An Introduction to Conditional Random Fields
=

q
1
...q
i1
p(x
1
, ..., x
i1
, q
1
, ..., q
i1
) p(q
i
= k|q
i1
) e
k
(x
i
)
=

q
1
...q
i2
p(x
1
, ..., x
i1
, q
1
, ..., q
i1
= l) p(q
i
= k|q
i1
) e
k
(x
i
)
=

q
1
...q
i2
p(x
1
, ..., x
i1
, q
1
, ..., q
i2
, q
i1
= l) a
lk
e
k
(x
i
)
=

l
p(x
1
, ...x
i1
, q
i1
= l) a
lk
e
k
(x
i
)
= e
k
(x
i
)

l
f
l
(i 1) a
lk
Using this idea, the forward algorithm [75] can be used to perform this efciently with a time
complexity of O(K
2
T) and a space complexity of O(KT) where K is the possible number of
states and T is the length of the observation sequence. The algorithm is described next.
Initialization:
f
0
(0) = 1
f
k
(0) = 0, k > 0
Iteration:
f
k
(i) = e
k
(x
i
)
l
f
l
(i 1) a
lk
Termination:
p(

x ) =
k
f
k
(t) a
k0
, where a
k0
is the probability of terminating in state k.
Similar to the forward algorithm described above, the backward algorithm [75] employs dy-
namic programming and can be used in conjunction with the forward algorithm to determine the
most likely state at position i given the observation sequence

x . First, dene the backward prob-
ability as follows:
b
k
(i) = p(x
i+1
, ..., x
t
|q
i
= k)
=

q
i+1
...q
t
p(x
i+1
, ..., x
t
, q
i+1
, ..., q
t
|q
i
= k)
=

q
i+1
...q
t
p(x
i+1
, ..., x
t
, q
i+1
= l, q
i+2
..., q
t
|q
i
= k)
A.3 Graphical Models 161
=

l
e
l
(x
i+1
) a
kl
q
i+1
...q
t
p(x
i+2
, ..., x
t
, q
i+2
, ..., q
t
|q
t+1
= l)
=

l
e
l
(x
i+1
) a
kl
b
l
(i + 1)
Using the above concept, the backward algorithm described next.
Initialization:
b
k
(t) = a
k0
, k
Iteration:
b
k
(i) =
l
e
l
(x
i+1
) a
kl
b
l
(i + 1)
Termination:
p(

x ) =
l
a
0l
e
l
(x
1
)b
l
(1)
The backward algorithm also has a time complexity of O(K
2
T) and a space complexity of
O(KT) where K is the possible number of states and T is the length of the observation sequence.
The most likely state at position i, given the observation sequence

x , can now be calculated using


the Equation A.8.
p(q
i
= k|

x ) =
f
k
(i) b
k
(i)
p(

x )
(A.8)
This is also known as posterior decoding. The most likely state can be calculated at each position
using Equation A.8. However, this does not represent the most likely sequence of states given the
entire observation sequence of length t.
Decoding
The objective is to nd:

q = argmax

q
p(

x ,

q |M)
Consider the given observation sequence x
1
, x
2
, ...x
t
as shown in Figure A.6.
To calculate the sequence of states which maximizes the joint probability of the observation se-
quence and the state sequence, dynamic programming can be used to perform the computation
162 An Introduction to Conditional Random Fields
k k k k k
2 2 2 2 2
1 1 1 1 1
x
3
x
1
x
2
x
t
x
t1
Figure A.6: Decoding in an Hidden Markov Model
efciently. Let V
k
(i) is the probability of most likely sequence of states ending in state q
i
= k
V
k
(i) = max
q
1
...q
i1
p(x
1
, ...x
i1
, q
1
, ...q
i1
, x
i
, q
i
= k) (A.9)
Given V
k
(i) for all states k, and for a xed position i, calculate V
l
(i + 1) as:
V
l
(i + 1) = max
q
1
...q
i
p(x
1
, ...x
i
, q
1
, ...q
i
, x
i+1
, q
i+1
= l)
= max
q
1
...q
i
p(x
i+1
, q
i+1
= l|x
1
, ...x
i
, q
1
, ...q
i
) p(x
1
, ...x
i
, q
1
, ...q
i
)
= max
q
1
...q
i
p(x
i+1
, q
i+1
= l|q
i
) p(x
1
, ...x
i1
, q
1
, ...q
i1
, x
i
, q
i
)
= max
k
[p(x
i+1
, q
i+1
= l|q
i
= k)
max
q
1
...q
i1
p(x
1
, ...x
i1
, q
1
, ...q
i1
, x
i
, q
i
= k)]
= max
k
[p(x
i+1
|q
i+1
= l) p(q
i+1
= l|q
i
= k) V
k
(i)]
= e
l
(x
i+1
) max
k
[a
kl
V
k
(i)] (A.10)
The Viterbi algorithm [118], [119] implements this idea with a time complexity of O(K
2
T) and
a space complexity of O(KT) where K is the possible number of states and T is the length of the
observation sequence. The algorithm is described in the following steps:
A.3 Graphical Models 163
Initialization:
V
0
(0) = 1, where 0 is the imaginary start position.
V
k
(0) = 0, k > 0
Iteration:
V
j
(i) = e
j
(x
i
) max
k
[a
kj
V
k
(i 1)]
Ptr
j
(i) = argmax
k
a
kj
V
k
(i 1)
Termination:
p(

x ,

q) = max
k
V
k
(t)
Traceback:
q

t
= argmax
k
V
k
(t)
q

i1
= Ptr
q
i
(i)
Learning
In order to estimate the parameters of an hidden Markov model, two learning scenarios exist; when
the labeled training data is available and when the training data is not labeled. In this chapter, we
shall only discuss the rst case when labeled training data is available. In case when training
data is not labeled, the Baum-Welch algorithm [149] can be used which is based on the principle
of expectation maximization [150]. Alternately, the Viterbi training can also be used. In case
when the training data is labeled the observation sequence,

x = x
1
, x
2
, ..., x
t
, is given and the
corresponding state sequence,

q = q
1
, q
2
, ..., q
t
is known. We dene,
A
kl
= number o f times transition occurs f rom k to l in

q .
E
k
(x) = number o f times state k in

q emits x in

x .
The maximum likelihood parameters , i.e., maximize p(

x |), can be shown to be:


a
kl
=
A
kl

i
A
ki
(A.11)
e
k
(b) =
E
k
(b)

c
E
k
(c)
(A.12)
164 An Introduction to Conditional Random Fields
Hence, given the labeled training data, the best estimate of the parameters that can be obtained
is the average frequency of transitions and emissions that occur in the training data. A common
drawback is that this can result in over-tting which affects their generalization capability.
We, thus, observe that for an hidden Markov model, the transition and emission probabilities
can be used to determine the likelihood of the parse, i.e., given an hidden Markov model, an
observation sequence

x and a parse

q , the likelihood of this parse is:
p(

x ,

q ) = p(x
1
, x
2
, ..., x
t
, q
1
, q
2
, ..., q
t
)
= a
0q
1
a
q
1
q
2
...a
q
t1
q
t
e
q
1
(x
1
)e
q
2
(x
3
)...e
q
t
(x
t
)
A compact approach to represent a
0q
1
a
q
1
q
2
...a
q
t1
q
t
e
q
1
(x
1
)e
q
2
(x
3
)...e
q
t
(x
t
) is to consider all the
parameters a
ij
and e
i
(b) as features. Let there be n such features (both a
ij
and e
i
(b)). Counting
the number of times every feature j = 1, ..., n occurs in (

x ) and (

q ), we represent the count as


F(j, x, q) = number o f parameters
j
which occur in (

x ,

q )
Thus,
p(

x ,

q ) =

j=1,...,n

F(j,

x ,

q )
j
which can be reduced to the form
p(

x ,

q ) = exp
_

j=1,...,n
log(
j
) F(j,

x ,

q )
_
(A.13)
Equation A.13 gives another way of representing an hidden Markov model which presents an intu-
itive approach for understanding the maximum entropy Markov models and, thus, the conditional
random elds which are discussed next.
Maximum Entropy Markov Model
Similar to how we extended the naive Bayes classier to perform sequence labeling in the hidden
Markov models, given the maximum entropy model (Maxent classier) in Equation A.6, we can
A.3 Graphical Models 165
extend it to perform sequence labeling for an observation sequence,

x . This results in a maximum


entropy Markov model as represented in Figure A.7.
y
1
x
1
y
2
y
t
x
2
x
t
y
t1
x
t1
Figure A.7: Maximum Entropy Markov Model
One approach to perform sequence labeling is to run the Maxent classier locally for every
observation in the sequence resulting in a label for every observation, x
i
. An obvious drawback of
this approach is that the labels for each observation x
i
are optimal locally as opposed to the optimal
sequence of labels, y
1
, y
2
, ..., y
t
. To avoid this, the Viterbi decoding can be performed similar to
the hidden Markov models. The maximum entropy Markov model represented in Figure A.7 can
be factorized as:
p(y
t
|y
t1
, x
t
) =
1
Z(y
t1
, x
t
)
exp

k

k
f
k
(y
t
, y
t1
, x
t
) (A.14)
where
Z(y
t1
, x
t
) =

y
t
exp

k

k
f
k
(y
t
, y
t1
, x
t
)
is the partition function,
k
is the weight and f
k
(y
t
, y
t1
, x
t
) is the feature function dened for a
feature k. Using the Viterbi algorithm, decoding can be performed similar to the hidden Markov
model such that the probability of the overall sequence of labels is maximized as opposed to
nding the optimum class label at each observation x
t
.
Comparing Equation A.13 and Equation A.14, we note that the hidden Markov model models
the joint probability of the observation sequence and the label sequence by assuming that a state
at time t depends only on the state at time t 1. Further, they assume that the observation at
time t depends only on the state at time t. Instead, the maximum entropy Markov model models
166 An Introduction to Conditional Random Fields
the conditional distribution of the label sequence by conditioning on the observation at time t.
Maximum entropy Markov models, thus, often perform better than the hidden Markov models,
however, they suffer from the Label Bias problem [34], [112] which is described next.
Label Bias in Maximum Entropy Markov Models
Label bias is the phenomenon in which the model effectively ignores the observation thereby
resulting in inaccurate results. This is attributed to the directed graphical structure and, hence, local
conditional modeling in each state [112]. As we discussed earlier, the maximum entropy Markov
model is analogous to a sequence of independent Maxent classiers, thus, the probability at every
instant sums to one. As a result, if certain sequence of states is more frequent during training, the
same path is preferred irrespective of the observation at any later stage (during decoding). In other
words, the previous state explains the current state so well that the observation at the current state
is effectively ignored.
In [34], the authors explain the label bias phenomenon with the following example. Consider
the nite state model represented in Figure A.8.
o
i
r
b
r
b
4 5
1 2
3 0
Figure A.8: Label Bias Problem
Suppose that the observation sequence is r i b. Once, the model observes, the observation r,
it assign equal probability to both, state 1 and state 4. Next the model observes the observation i.
However, given the model, both the states 1 and 4 have only one outgoing transition and because
the incoming probability is equal to the outgoing probability, due to local normalization, when
the model observes i or any other observation, both the states have no choice but to ignore the
observation and move to the next state with the maximum probability. As a result, both states 2
and 5 result in equal probability. Further, if one of the observation sequences is more common in
the training, the transitions would prefer the corresponding path irrespective of the observation.
A.3 Graphical Models 167
In the above example, it is possible to eliminate label bias by collapsing the states 1 and 4,
however, this is a special case and not always possible [34]. Another approach is to start with
a fully connected structure; however, this would preclude the use of prior structural knowledge.
Similar to the label bias, the authors in [112] describe what they call as the observation bias where
the observations explain the states such that the previous states are effectively ignored.
Conditional random elds effectively address these issues by dropping local normalization
and instead normalize globally on the observation sequence. However, before we describe the
conditional random elds, we present the general undirected graphical models which are necessary
for better understanding of the conditional random elds.
A.3.2 Undirected Graphical Models
Def.: An undirected graphical model is a graph G = (V, E) where V = {V
1
, V
2
, ..., V
N
} are the
vertices and E = {(V
i
, V
j
), i = j} are the undirected edges from vertex V
i
to vertex V
j
. A vertex
V
i
can be represented by the random variable representation X
i
. Undirected graphical models are
also known as the Markov Random Fields [145].
Similar to the directed graphical models, the undirected graphical models describe the factor-
ization of a set of random variables and their notion of conditional independence. The undirected
graphical models factorize according to the probability distribution given in Equation A.15.
p(x
1
, x
2
, ..., x
n
) =
1
Z

cC

c
(x
c
) (A.15)
such that

c
(x
c
) > 0, c, x
c
and
Z =

x
1

x
2
...

x
n

cC

c
(x
c
)
where C is the set of cliques in the graph, Z is the normalization factor known as the partition
function and
c
are the strictly positive real valued functions known as the potential functions
dened over the cliques. Potentials have no specic probabilistic interpretations. To make sure
that the Equation A.15 represents a probability distribution, it is necessary to calculate the partition
function Z. Figure A.9 represents an undirected graphical model for three random variables.
168 An Introduction to Conditional Random Fields
x
1
x
3
x
2
Figure A.9: Undirected Graphical Model
The undirected graphical model represented in Figure A.9 can be factorized as:
p(x
1
, x
2
, x
3
) =
1
Z

1,2
(x
1
, x
2
)
1,3
(x
1
, x
3
)
2,3
(x
2
, x
3
)
1,2,3
(x
1
, x
2
, x
3
) (A.16)
where
Z =

x
1

x
2

x
3

1,2
(x
1
, x
2
)
1,3
(x
1
, x
3
)
2,3
(x
2
, x
3
)
1,2,3
(x
1
, x
2
, x
3
)
The complexity of an undirected graphical model depends upon the size of the largest clique.
The overall complexity can be determined from
cC
O(
k
m
c
) where, m
c
is the size of the clique
c. For the undirected graphical models, conditional independence properties can be simply deter-
mined by graph separation. More details can be found in [145].
A.4 Conditional Random Fields
In [34], the authors proposed the conditional random elds as a solution for the label bias prob-
lem. However, conditional random elds can also be considered as a generalization of the hidden
Markov models, [113], [151] which also gives a better view and help in better understanding.
A major drawback in an hidden Markov model is that the state q
i
can observe only the ob-
servation symbol x
i
, i.e., a strong independence assumption is made that the state at any instant
depends only upon the previous state. Further, we observed that using the dynamic programming
approach, all K
2
a
kl
features and all K e
l
(x
i
) features are signicant at every instant. Rearrang-
ing Equation A.7 and Equation A.10 as:
V
l
(i) = V
k
(i 1) + (a(k, l) +e(l, i))
= V
k
(i 1) + g(k, l, x
i
)
A.4 Conditional Random Fields 169
We note that the restriction in an hidden Markov model arise from the x
i
part in the function
g(k, l, x
i
). Generalizing this function to g(k, l, x, i) removes the independence assumptions made
in the hidden Markov model which forms the basis for conditional random elds. A large number
of features can be dened at every position which can capture long range dependencies in the
observation sequence x. Higher the value of function g, the more likely state k will follow state
l at position i. A conditional random eld, thus, includes all the features present in an hidden
Markov model and also has the capability to dene a large number of additional features which
signicantly improves its modeling power compared to that of an hidden Markov model.
A.4.1 Representation of Conditional Random elds
Using Equation A.15,
p(x
1
, x
2
, ..., x
t
) =
1
Z

cC

c
(x
c
)
p(

x ) =
1
Z

cC

c
(x
c
)
The conditional probability can be written as:
p(

y |

x ) =
p(

y ,

x )
p(

x )
=
p(

y ,

x )

y
p(

y

,

x )
=
1
Z

cC

c
(y
c
, x
c
)
1
Z

y

cC

c
(y

c
, x
c
)
=
1
Z(

x )

cC

c
(y
c
, x
c
) (A.17)
where
Z(

x ) =

cC

c
(y

c
, x
c
)
Equation A.17, presents the general formulation of a conditional random eld. However, in this
chapter, we shall focus on a linear chain structure for conditional random elds which is motivated
from [34], [113], [151] and [152] and is described next.
170 An Introduction to Conditional Random Fields
Linear Chain Conditional Random Field
Consider an observation sequence

x of length t + 1. A linear chain conditional random eld is
represented in Figure A.10.
y
1
y
2
x
1
x
2
y
t+1
x
t+1
x
t
y
t
Figure A.10: Linear Chain Conditional Random Field
Using Equation A.17, a linear chain conditional random eld can be formulated as:
p(

y |

x ) =
1
Z(

x )
t

j=1

j
(

y ,

x ) (A.18)
where
Z(

x ) =

y

t

j=1

j
(

y

,

x )
and

j
(

y ,

x ) = exp
_
m

i=1

i
f
i
(y
j1
, y
j
,

x , j)
_
Equation A.18 can be rewritten as:
p(

y |

x ) =
1
Z(

x )
exp
_
t

j=1
m

i=1

i
f
i
(y
j1
, y
j
,

x , j)
_
(A.19)
where
Z(

x ) =

y
exp
_
t

j=1
m

i=1

i
f
i
(y
j1
, y
j
,

x , j)
_
In Equation A.19, summing over all possible label sequences ensures that it is a probability dis-
tribution. Further, for an observation of length t + 1 for a linear chain structure represented in
Figure A.4, there exists t possible maximal cliques which are represented by adjacent nodes in the
chain. The index j, thus, represents the position in the input sequence and sums over a sequence
A.4 Conditional Random Fields 171
of length t. Index i represents the m overall feature functions dened on the specied set of vari-
ables. Further, the feature weights
i
are not dependent on the position j, rather they are tied to
the individual feature functions.
From Equation A.18, we observe that the potential function must be a strictly positive real
valued function. Using the exponential function, for dening the potentials, implicitly represents
the positivity constraint on the potentials. Further, as we shall observe later, since exponential
function is a continuous function and easily differentiable, it can be effectively used for maximum
likelihood parameter estimation (estimating

s) during training.
Feature Functions and Feature Selection
In hidden Markov models, every label (or state), y
i
, can look only at the observation x
i
and hence
they cannot model long range dependencies between the observation sequence. As discussed
earlier, conditional random elds do not assume such independence among observations. This is
accomplished by using the features dened while training the random eld. In order to dene
the features, a clique template is dened which can extract a variety of features from the given
training samples. The clique template makes assumptions on the structure of the underlying data
by dening the composition of the cliques.
For a linear chain conditional random eld, there exist only one clique template which denes
the links between y
j
and y
j1
and

x . Given the clique template, features can then be extracted
for different realizations of y
j
and y
j1
and

x from the training data.
A.4.2 Training
Given the labeled training sequences,

x , the objective of training a conditional random eld is to
determine the weights,

s which maximize p(

y |

x ). Maximum likelihood method is applied


for parameter estimation. The log likelihood L on the training data D is given by:
L(D) =

(

y ,

x )D
log p(

y |

x ) (A.20)
=

(

y ,

x )D
_
log
_
exp(
t
j=1

m
i=1

i
f
i
(y
j1
, y
j
,

x ))

y
exp(
t
j=1

m
i=1

i
f
i
(y

j1
, y

j
,

x ))
__
172 An Introduction to Conditional Random Fields
To avoid over-tting, the likelihood is often penalized with some form of a prior distribution which
has a regularizing inuence. A number of priors such as the Gaussian, Laplacian, Hyperbolic and
others can be used. Consider a simple prior of the form
m
i=1

2
i
2
2
i
where
i
is the standard deviation
of the parameter
i
.
Hence, the likelihood becomes:
L(D) =

(

y ,

x )D
_
log
_
exp(
t
j=1

m
i=1

i
f
i
(y
j1
, y
j
,

x ))

y
exp(
t
j=1

m
i=1

i
f
i
(y

j1
, y

j
,

x ))
__

i=1

2
i
2
2
i
=

(

y ,

x )D
log
_
exp(
t

j=1
m

i=1

i
f
i
(y
j1
, y
j
,

x ))
_

y ,

x )D
log
_
_

y

exp(
t

j=1
m

i=1

i
f
i
(y

j1
, y

j
,

x ))
_
_

i=1

2
i
2
2
i
=

(

y ,

x )D
t

j=1
m

i=1

i
f
i
(y
j1
, y
j
,

x )
. .
A

y ,

x )D
log Z(

x )
. .
B

i=1

2
i
2
2
i
. .
C
(A.21)
Taking partial derivatives of the likelihood with respect to the parameters,

s, we get:

i
A =

(

y ,

x )D
t

j=1
f
i
(y
j1
, y
j
,

x ) (A.22)
which is same as the expected value of the feature under its empirical distribution and is denoted
as

E( f i).

i
B =

(

y ,

x )D
1
Z(

x )
Z(

x )

i
=

(

y ,

x )D
1
Z(

x )

y

exp
_
t

j=1
m

i=1

i
f
i
(y

j1
, y

j
,

x )
_
=

(

y ,

x )D
1
Z(

x )

y

exp
_
t

j=1
m

i=1

i
f
i
(y

j1
, y

j
,

x )
_
t

j=1
f
i
(y

j1
, y

j
,

x )
=

(

y ,

x )D

y

1
Z(

x )
exp
_
t

j=1
m

i=1

i
f
i
(y

j1
, y

j
,

x )
_
t

j=1
f
i
(y

j1
, y

j
,

x )
=

(

y ,

x )D

y

p(

y |

x )
t

j=1
f
i
(y

j1
, y

j
,

x ) (A.23)
A.4 Conditional Random Fields 173
which is the expectation under the model distribution and is denoted as E( f i).

i
C =
2
i
2
2
i
=

i

2
i
(A.24)
Using A.21, A.22, A.23 and A.24, we get:
L(D)

i
=

E( f i) E( f i)

i

2
i
(A.25)
To nd the maximum, we equate the right hand side in Equation A.25 to 0. Hence,

E( f i) E( f i)

i

2
i
= 0 (A.26)

E( f i) can be easily computed by counting how often every feature occurs in the training data. To
efciently calculate the E( f i), a modied version of the forward-backward algorithm can be used.
Consider states s

and s. As described in [34], dening the forward () and backward () scores


as follows:

j
(s|

x ) =

j1
(s

x )
j
(

x , s

, s) (A.27)

j
(s|

x ) =

j+1
(s

x )
j
(

x , s

, s) (A.28)
where

j
(

x , s, s

) = exp
_
m

i=1

i
f
i
(y
j1
= s, y
j
= s

x )
_
Using the and functions, it is possible to compute the expectation under the model distribution
efciently by:
E( f
i
) =

(

y ,

x )
1
Z(

x )
t

j=1

f
i
(s, s

x )
j
(s|

x )
j
(

x , s, s

)
j
(s

x )
The forward-backward algorithm has a complexity of O(K
2
T) where K is the number of states
and T is the length of the sequence. Training a conditional random eld involves many iterations
of the forward-backward algorithm.
174 An Introduction to Conditional Random Fields
A.4.3 Inference
Given the observed sequence

x and the trained conditional random eld, the objective is to nd
the most likely sequence of labels for the given observation. As with the hidden Markov models,
the Viterbi algorithm can be used to effectively determine the sequence of states. Often the number
of states is assumed to be equal to the number of labels and, hence, we use the two interchangeably.
Let
j
(s|

x ) represent the highest score of the sequence of states ending in state s at position
j and is dened as:

j
(s|

o ) = max
y
1
,...,y
j1
p(y
1
, ..., y
j1
, y
j
= s|

x ) (A.29)
We then calculate

j+1
(s|

x ) = max
s

j
(s

)
j+1
(

x , s, s

) (A.30)
The algorithm is described in the following steps:
Initialization: s S :

1
(s) =
1
(

x , s
0
, s)
q
1
(s) = s
0
Recursion: s S : 1 j t

j
(s) = max
s

j1
(s

) (

x , s, s

)
q
j
(s) = argmax
s


j1
(s

) (

x , s

, s)
Termination:
p

= max
s

t
(s

l

t
= argmax
s


t
(s

)
Traceback:

l

t
= q
t+1
(

l

t+1
)
The complexity of the algorithm is O(K
2
T) where K is the number of states and T is the length
of the sequence.
A.5 Comparing the Directed and Undirected Graphical Models 175
A.4.4 Tools Available for Conditional Random Fields
We now list some of the tools which implement conditional random elds. This, however, is not
a complete list of tools and many other tools exist that can be used. The tools include CRF++
[120], Mallet [126], Sunita Sarawagis CRF package [153], Kevin Murphys MATLAB CRF code
[154]. We mainly experimented with the CRF++ and found it to be effective and easy to use
and customize. [120] gives an in depth description of the software and describes the commands
necessary to run it using example of the named entity recognition task from language processing.
A.5 Comparing the Directed and Undirected Graphical Models
Both directed and undirected graphical models allow complex distributions to be factorized into
a product of simpler distributions (functions). However, the two models differ in the way they
determine the conditional independence relations. The directed models determine the conditional
independence properties via the d-separation test while the undirected models determine the same
via graph separation [145].
As a result, the two models also differ in the way the probability distribution is factorized.
In directed graphs, the factorization results into a product of conditional probability distributions
while in undirected graphs, the factorization results into arbitrary functions. Factorization into
arbitrary functions enable us to dene functions which can capture dependencies among variables,
however, it comes with the cost of calculating the normalization constant Z. The directed graphical
models do not require calculating such a partition function.
Given the two ways (directed and undirected models) to factorize a distribution, consider a
set S which represents the universe of distributions. Then, using the approach of directed graph-
ical models, we can represent only a subset, D, of distributions which follow all the conditional
independence properties. Similarly, using the approach presented by the undirected models, we
can represent a subset, U, of distributions which follow all the conditional independence relations.
This can be represented as shown in Figure A.11.
Note that, Figure A.11 shows that there exist a subset of distributions (which follow all the
conditional independence relations) which can be represented, both, by directed and undirected
models and there also exist distributions which can be represented only by either of the two. A
176 An Introduction to Conditional Random Fields
Distributions Represented
by Undirected Models
Distributions Represented by Both
Directed and Undirected Models
Distributions Represented
by Directed Models
Set of All Distributions
Figure A.11: Factorization in Graphical Models
trivial example where the factorizations are alike is when all the random variables are independent.
A.6 Conclusions
In this chapter we described conditional random elds in detail, discussed properties along with
the assumptions made which motivate their use in a particular problem including their advantages
and disadvantages with respect to previously known approaches which can be used for similar
tasks. The key features for conditional random elds are:
Conditional random elds can be considered as a generalization of the hidden Markov mod-
els.
Conditional random elds eliminate the label bias problem which is present in other condi-
tional models such as the maximum entropy Markov models.
Long range dependencies can be modeled among observations using conditional random
elds.
Training a conditional random eld involves many iterations of the forward-backward al-
gorithm which has a complexity of O(K
2
T) where K is the number of states and T is the
length of the sequence.
Inference or test time complexity for a conditional random eld is also O(K
2
T) where K is
the number of states and T is the length of the sequence.
Conditional random elds have been shown to be successful in many domains including
computational linguistics, computational biology and real-time intrusion detection.
Appendix B
Feature Selection for Network Intrusion
Detection
As described in Chapter 4, every record in the KDD 1999 data set presents 41 features which
can be used for detecting a variety of attacks such as the Probe, DoS, R2L and U2R. However,
using all the 41 features for detecting attacks belonging to all these classes severely affects the
performance of the system and also generates superuous rules, resulting in tting irregularities in
the data which can misguide classication. Hence, we performed feature selection to effectively
detect different classes of attacks. We now describe our approach for selecting features for every
layer and why some features were chosen over others.
B.1 Feature Selection for Probe Layer
Probe attacks are aimed at acquiring information about the target network from a source that is
often external to the network. For detecting Probe attacks, basic connection level features such as
the duration of connection and source bytes are signicant, while features like number of le
creations and number of les accessed are not expected to provide signicant information. We
selected only ve features for Probe layer. The features selected for detecting Probe attacks are
presented in Table B.1.
Table B.1: Probe Layer Features
Feature Number Feature Name
1 duration
2 protocol type
3 service
4 ag
5 src bytes
177
178 Feature Selection for Network Intrusion Detection
B.2 Feature Selection for DoS Layer
DoS attacks are meant to prevent the target from providing service(s) to its users by ooding
the network with illegitimate requests. Hence, to detect attacks at the DoS layer, network trafc
features such as the percentage of connections having same destination host and same service
and packet level features such as the duration of a connection, protocol type, source bytes,
percentage of packets with errors and others are signicant. To detect DoS attacks, it may not be
important to know whether a user is logged in or not, or whether or not the root shell is invoked
or number of les accessed and, hence, such features are not considered in the DoS layer. From
all the 41 features, we selected only nine features for the DoS layer. The features selected for
detecting DoS attacks are presented in Table B.2.
Table B.2: DoS Layer Features
Feature Number Feature Name
1 duration
2 protocol type
4 ag
5 src bytes
23 count
34 dst host same srv rate
38 dst host serror rate
39 dst host srv serror rate
40 dst host rerror rate
B.3 Feature Selection for R2L Layer
R2L attacks are one of the most difcult attacks to detect and most of the present systems cannot
detect them reliably. However, our experimental results presented earlier show that careful feature
selection can signicantly improve their detection. We observed that effective detection of the R2L
attacks involve both, the network level and the host level features. Hence, to detect R2L attacks, we
selected both, the network level features such as the duration of connection, service requested
and the host level features such as the number of failed login attempts among others. Detecting
R2L attacks, require a large number of features and we selected 14 features. The features selected
for detecting R2L attacks are presented in Table B.3.
B.4 Feature Selection for U2R Layer 179
Table B.3: R2L Layer Features
Feature Number Feature Name
1 duration
2 protocol type
3 service
4 ag
5 src bytes
10 hot
11 num failed logins
12 logged in
13 num compromised
17 num le creations
18 num shells
19 num access les
21 is host login
22 is guest login
B.4 Feature Selection for U2R Layer
U2R attacks involve the semantic details which are very difcult to capture at an early stage at
the network level. Such attacks are often content based and target an application. Hence, for
detecting U2R attacks, we selected features such as number of le creations, number of shell
prompts invoked, while we ignored features such as protocol and source bytes. From all the
41 features, we selected only eight features for the U2R layer. Features selected for detecting U2R
attacks are presented in Table B.4.
Table B.4: U2R Layer Features
Feature Number Feature Name
10 hot
13 num compromised
14 root shell
16 num root
17 num le creations
18 num shells
19 num access les
21 is host login
180 Feature Selection for Network Intrusion Detection
B.5 Template Selection
To train a conditional random eld, feature functions must be chosen in prior. Hence, we dened
a template which can be used to extract all possible feature functions from the given training data
to train the conditional random led using the CRF++ tool [120].
The template can be used to dene both, unigram and bigram feature functions. For unigram
feature functions, which begins with U, the template denes a special macro %x[row,col] which is
used to specify a token in the input data, where row species the relative position from the current
focusing token and col species the absolute position of the column. The number of feature
functions generated by this type of template amounts to (L * N), where L is the number of output
classes and N is the number of unique string expanded from the given template.
For bigram feature functions, which begins with B, a combination of the current output token
and previous output token (bigram) is automatically generated. This type of template generates a
total of (L * L * N) distinct features, where L is the number of output classes and N is the number
of unique features generated by the templates. A sample template used in our experiments is
presented next.
# Unigram
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
# Bigram
B
Appendix C
Feature Selection for Application
Intrusion Detection
As described in Chapter 5, we used 6 features to represent a user session. The six features are:
1. Number of data queries generated in a single web request.
2. Time taken to process the request.
3. Response generated for the request.
4. Amount of data transferred (in bytes).
5. Request made (or the function invoked) by the client.
6. Reference to the previous request in the same session.
C.1 Template Selection
To train a conditional random eld, feature functions must be chosen in prior. Hence, we dened
a template which can be used to extract all possible feature functions from the given training data
to train the conditional random led using the CRF++ tool [120].
The template can be used to dene both, unigram and bigram feature functions. For unigram
feature functions, which begins with U, the template denes a special macro %x[row,col] which is
used to specify a token in the input data, where row species the relative position from the current
focusing token and col species the absolute position of the column. The number of feature
functions generated by this type of template amounts to (L * N), where L is the number of output
classes and N is the number of unique string expanded from the given template.
For bigram feature functions, which begins with B, a combination of the current output token
and previous output token (bigram) is automatically generated. This type of template generates a
total of (L * L * N) distinct features, where L is the number of output classes and N is the number
181
182 Feature Selection for Application Intrusion Detection
of unique features generated by the templates. A sample template used in our experiments is
presented next.
\# Unigram
U001:%x[-4,0] , U002:%x[-3,0] , U003:%x[-2,0] ,
U004:%x[-1,0] , U005:%x[0,0] , U006:%x[1,0] ,
U007:%x[2,0] , U008:%x[3,0] , U009:%x[4,0] ,
U101:%x[-4,1] , U102:%x[-3,1] , U103:%x[-2,1] ,
U104:%x[-1,1] , U105:%x[0,1] , U106:%x[1,1] ,
U107:%x[2,1] , U108:%x[3,1] , U109:%x[4,1] ,
U201:%x[-4,2] , U202:%x[-3,2] , U203:%x[-2,2] ,
U204:%x[-1,2] , U205:%x[0,2] , U206:%x[1,2] ,
U207:%x[2,2] , U208:%x[3,2] , U209:%x[4,2] ,
U301:%x[-4,3] , U302:%x[-3,3] , U303:%x[-2,3] ,
U304:%x[-1,3] , U305:%x[0,3] , U306:%x[1,3] ,
U307:%x[2,3] , U308:%x[3,3] , U309:%x[4,3] ,
U401:%x[-4,4] , U402:%x[-3,4] , U403:%x[-2,4] ,
U404:%x[-1,4] , U405:%x[0,4] , U406:%x[1,4] ,
U407:%x[2,4] , U408:%x[3,4] , U409:%x[4,4] ,
U501:%x[-4,5] , U502:%x[-3,5] , U503:%x[-2,5] ,
U504:%x[-1,5] , U505:%x[0,5] , U506:%x[1,5] ,
U507:%x[2,5] , U508:%x[3,5] , U509:%x[4,5]
\# Bigram
B

Potrebbero piacerti anche