Sei sulla pagina 1di 22

Query Pattern Access and Fuzzy Clustering Based Database

Intrusion Detection System.


Indu Singh Shivam Gupta, Shivam Maini, Shubham, Simran
Department of Computer Science and Engineering Seth
Delhi Technological University
Delhi, India Computer Science and Engineering
indu.singh.dtu14@gmail.com Delhi Technological University
Delhi, India
shivam_bt2k16@dtu.ac.in,
shivammaini_bt2k16@dtu.ac.in,
shubham_bt2k16@dtu.ac.in,
Abstract- Hackers and malicious insiders
simran_bt2k16@dtu.ac.in
perpetually try to steal, manipulate and corrupt
sensitive data elements and an organization’s patterns of users, user access pattern mining [1] is
database servers are often the primary targets of a suitable approach for the detection of these
these attacks. In the broadest sense, misuse attacks. It creates profiles of the normal access
(witting or unwitting) by authorized database patterns of users using past logs of users accesses.
users, database administrators, or New accesses are later checked against these
network/systems managers are potential insider profiles and mismatches indicate potential attacks.
threats that our project intends to address. Insider
threats are more menacing because in contrast to A security technique called Access control[2] can
outsiders (hackers or unauthorised users), insiders regulate who can view or use resources in a
have authorised access to the database and have computing environment. There are diverse access
knowledge about the critical nuances of the control systems that perform authorization,
database. Database security involves using identification, authentication and access approval.
multitude of information security controls to Intrusion Detection Systems[3]. scrutinise and
protect databases against breach of unearth surreptitious activities perpetrated by
confidentiality, integrity and availability (CIA). It malevolent users. IDS work by either looking for
involves plethora of controls such as technical, signatures of known attacks or deviations of normal
procedural/administrative and physical. We hence activity. Normally, IDS undergo a training phase
intend to propose an Intrusion Detection System with intrusion free data wherein they maintain a log
(IDS). QPAFCS(Query Pattern Access and Fuzzy of benign transactions. Pattern matching [4] is then
Clustering System) that monitors a database used to detect whether or not an action is
management system and prevents inference malign. This is called anomaly based detection[5].
attacks on sensitive attributes, by means of When errors are detected using their known
auditing user access patterns . “signatures” from previous knowledge of the
attack, it is called signature based detection[6].
Keywords: Database Intrusion Detection, Fuzzy These malicious actions once detected are then
Clustering, User Access Pattern, Insider Attacks, either blocked or probed depending upon the
Dubiety Score organisation’s policy. However, IDS need to be
dynamic, robust and quick. Different architectures
for IDS function differently and have different
I. INTRODUCTION measures of performance. Every organisation
needs to make sure that the IDS it uses satisfies its
Data protection from insider threats is essential to requisites.
most organizations. Attacks from insiders could be
more damaging than those from outsiders, since in Several AD techniques have been proposed to
most cases insiders have full or partial access to the detect anomalous data accesses. Some rely on the
data; therefore, traditional mechanisms for data analysis of input queries based on the syntax.
protection, such as authentication and access Although these approaches are computationally
control, cannot be solely used to protect against efficient, they are unable to detect anomalies in
insiders. Since recent work has shown that insider scenarios like the following one. Consider a clerk in
attacks are accompanied by changes in the access an organization who issues queries to a relational
database that typically selects a few rows from
specific tables. An access from this clerk that selects
all or most of the rows of these tables should be the threat to our critical data elements using such
considered anomalous with respect to the daily attributes also.
access pattern of the clerk. However, approaches
based on syntax only are not able to classify such We investigate a suspected user also from the
access as anomalous. Thus, syntactic approaches diachronic view by analysing his/her historical
have to be extended to take into account the behaviour. We store a measure denoting how
semantic features of queries such as the number of suspicious a user has been. The greater this
result rows. An important requirement is that measure, the greater the chances of the query
queries should be inspected before their execution being malicious. This measure also solves the
in order to prevent malicious queries from making problem of gradually malicious threat since the
changes to the database. historical statics measures the accumulative
results.
From the technical perspective, the main purpose is
to ensure the effective enforcement of security The main purpose of our project is to recognise user
regulations. Audit is an important technique of access pattern. Our Intrusion Detection
examining whether user behaviours in a system are System(IDS) pays special attention to certain
in conformance with security policies. Many semantically critical data elements, along with
methods audit a database processing by comparing those elements which can be used to infer the
a user SQL query expression against some aforementioned elements. We present an
predefined patterns so as to find out an anomaly. innovative approach to combine a user’s historic
But a malicious query may be made up as good and present access pattern and hence classify the
looking so as to evade such syntactic detection. To incoming transaction as malicious/non-malicious.
overcome this shortcoming, the data-centric Using FCM, we partition the users into fuzzy
method further audits whether the data a user clusters. Each of these clusters contains a set of
query actually accessed has involved any banned rules in their cluster profiles. In the detection phase,
information. However, such audit concerns a new transactions are checked against rules in these
concrete policy rather than the overall view of clusters, and then a suitable action is taken
multiple security policies. It requires clear audit depending upon the nature of transaction. The
commands that are articulated by experienced main advantage of our IDS lies in its ability to
professionals and much interactive analysis. Since prevent inference attacks on Critical Data Elements.
in practice an anomaly pattern cannot be The remainder of this work is organized as follows.
articulated in advance, it is difficult to detect such In Sect. 2, we present prior research related to this
fraud by the current audit method. work. Section 3 introduces the fuzzy clustering and
The anomaly detection technology is used to belief update framework. Section 4 discusses the
identify abnormal behaviours that are statistical approach using examples. In Sect. 5, we discuss how
outliers. Some probabilistic methods learned to apply our method into practical system.
normal patterns, against which they detected an Experimental evaluation is discussed in Section. 6.
anomaly. But these methods assume very few users
are deviated from normal patterns. In case there
are a number of anomalous users, the normal
pattern would be diverged. These works do not
examine user behaviour from either a historical or
an incremental view, which may overlook some
malicious behaviour. Furthermore, if a group of
people collude together, it is difficult to find them
by the current methods.

In QPAFCS we tackle the insider threat problem


using different approaches. We take into
consideration the fact that certain data elements
are more critical to the database as compared with
other data elements. Thus we pay special attention
to the security of such critical data elements. We
I. RELATED WORK
also recognise the presence of data attributes in a
system which can be manipulated to indirectly
influence the crucial data attributes. We address Numerous researchers are currently working in the
field of Network Intrusion Detection Systems, but
only a few have proposed research work in Hu et al.[16] presented a data mining based
Database IDSs. Several systems for Intrusion intrusion detection system, which used the static
Detection in operating systems and networks have analysis of database audit log to mine
been developed, however they are not adequate in dependencies among attributes at transaction level
protecting the database from intruders.[11] ID and represented those dependencies as sets of
system in databases work at query level, reading and writing operations on each data item.
transaction level and user (role) level. Bertino et. In another approach proposed by Hu et al.
Al described the challenges to ensure data ,techniques of sequential pattern mining have
confidentiality, integrity and availability and the been applied on the training log, in order to
need of database security wherein the need of identify frequent sequences at the transaction
database IDSs to tackle insider threats was level. This approach helped in identifying a group of
discussed. malicious transactions, which individually complied
with the user behavior. The approach was
Panda et. Al[19] propose to employ data mining improved in by Hu et. al by clustering legitimate
approach for determining data dependencies in the user transaction into user tasks for discovery of
database system. The classification rules reflecting inter-transaction data dependencies.
data dependencies are deduced directly from the
database log. These rules represent what data The method proposed extends the approach by
items probably need to be read before an update assigning weights to all the operations on data
operation and what data items are most likely to be attributes. The transactions which didn’t follow the
written following this update operation. data dependencies were marked as malicious. The
Transactions that are not compliant to the data major disadvantage of user assigned weights is the
dependencies generated are flagged as anomalous fact that they are static and unrelated to other data
transactions. attributes. Kamra et. [27] al employed a
clustering technique on an RBAC model to form
Database IDSs include Temporal Analysis of queries profiles based on attribute access which
and Data dependencies among attributes, queries represented normal user behavior. An alarm is
and transaction. Lee et al.[28] proposed a Temporal raised when anomalous behavior of that role profile
Analysis based intrusion detection method which is observed.
incorporated time signatures and recorded update (Bezdek, Ehrlich & Full, 1984) proposed the Fuzzy C-
gap of temporal attributes. Any anomaly in update Means Algorithm. The basic idea behind this
pattern of the attribute was reported as an approach was to illustrate the similarity a data point
intrusion in the proposed approach. The may share with each of the clusters with help of a
breakthrough introduction to association rule function often referred to as membership function.
mining by Aggarwal et al.[22] helped in finding data This measure of similarity lies between zero and
dependencies among data attributes, which was one signifies the extent of similarity between the
incorporated in the field of intrusion detection in data point and the cluster and is termed as the
Databases. membership value. The main aim of this technique
is to construct fuzzy partitions of a particular data
During the initial development of data dependency set.
association rule mining, DEMIDS ,a misuse
Y. Yu et. al.[29] illustrated a fuzzy logic based
detection system for relational database systems
Anomaly Intrusion Detection System. A Na¨ıve
was proposed by Chung et al.[7] Profiles which
Bayes Classifier is used to classify an input event as
specified user access pattern were derived from the
normal or anomalous. The basis of classifier is
audit log and Distance Metrics were further applied
formed by the independent frequency of each
for recognizing data items These were used
system call from a process in normal conditions.
together in order to represent the expanse of users.
The ratio of the probability of a sequence from a
But once the number of users for a single system
process and the probability not from the process
becomes substantial, maintaining profiles becomes
serves as the input of a fuzzy system for the
a redundant procedure. Another flaw was the
classification.
system assuming domain information about a given
schema.
A hybrid approach was described by Doroudian et. alarm is vital [10]. In this problem layout, McGough
al [26] to identify intrusion at both transaction and et al [8] designed a system to identify anomalous
inter-transaction level. At the transaction level, a behavior of user by comparing of individual user’s
set of predefined expected transactions were activities against their own routine profile, as well
specified to the system and a sequential rule mining as against the organization’s rule. They applied two
algorithm was applied at the inter transaction level independent approaches of machine learning and
to find dependencies between the identified Statistical Analyzer on data. Then results from these
transactions. The drawback of such a system is that two parts combined together to form consensus
sequences with frequencies lower than the which then mapped to a risk score. Their system
threshold value are neglected. Therefore, the showed high accuracy, low false positive and
infrequent sequences were completely overlooked minimum effect on the existing computing and
by the system, irrespective of their importance. As network resources in terms of memory and CPU
a result, the True Positive Rate falls down for the usage.
system.
Bhattacharjee et al proposed a graph-based
The above drawback was overcome by Sohrabi et. method that can investigate user behavior from
Al[20] who proposed a novel approach ODARDM, in two perspectives: (a) anomaly with reference to the
which rules were formulated for lower frequency normal activities of individual user which has been
item sets, as well. These rules were extracted using observed in a prolonged period of time, and (b)
leverage as the rule value measure, which finding the relationship between user and his
minimized the interesting data dependencies. As a colleagues with similar roles/profiles. They utilized
result, True Positive Rate increased while the False CMU-CERT dataset in unsupervised manner. In their
Positive Rate decreased. In recent developments, model, Boykov Kolmogorov algorithm was used and
Rao et. Al [21] presented a Query Access the result compared with different algorithms
detection approach through Principal Component including Single Model One-Class SVM, Individual
Analysis and Random Forest to reduce data Profile Analysis, k-User Clustering and Maximum
dimensionality and produce only relevant and Clique (MC). Their proposed model evaluated by
uncorrelated data. As the dimensionality is evaluation metrics Area-Under-Curve (AUC) that
reduced, both, the system performance and True showed impressive improvement compare to other
Positive rate increases. algorithms [9].

In 2009, majumdar et. Al[15] propose a T. Rashid et al. proposed that parameter learning
comprehensive database intrusion detection task in HMMs is to find, given an output sequence
system that integrates different types of evidences or a set of such sequences, the best set of state
using an extended Dempster-Shafer theory. Besides transition and emission probabilities. The task is
combining evidences, they also incorporate usually to derive the maximum likelihood estimate
learning in our system through application of prior of the parameters of the HMM given the set of
knowledge and observed data on suspicious output sequences. No tractable algorithm is known
users.In 2016, bertino et. Al[14] e tackle the insider for solving this problem exactly, but a local
threat problem from a data-driven systemic view. maximum likelihood can be derived efficiently using
User actions are recorded as historical log data in a the Baum–Welch algorithm or the Baldi–Chauvin
system, and the evaluation investigates the date algorithm. The Baum–Welch algorithm is a special
that users actually process. From the horizontal case of the expectation-maximization algorithm. If
view, users are grouped together according to their the HMMs are used for time series prediction, more
responsibilities and a normal pattern is learned sophisticated Bayesian inference methods, like
from the group behaviours. They investigate a Markov chain Monte Carlo (MCMC) sampling are
suspected user also from the diachronic view by proven to be favorable over finding a single
comparing his/her historical behaviours with the maximum likelihood model both in terms of
historical average of the same group. accuracy and stability.[12] Log data are considered
as high-dimensional data which contain irrelevant
Anomaly detection has been an important research
and redundant features. Feature selection methods
problem in security analysis, therefore
can be applied to reduce dimensionality, decrease
development of methods that can detect malicious
training time and enhance learning performance .
insider behavior with high accuracy and low false
each set of query patterns. Each Transaction T is
3. OUR APPROACH denoted as
3.1 Basic Notations <Uid, Tid, <q1, q2, … qn>>
Large organisations deal with tremendous amount where
of data whose security is of prime interest. The data
in databases comprises of attributes describing real qi denotes the ith query, i ∈ [1 … n]
life objects called as entities. The attributes have
For example, suppose a user has id 1001. He/she
varying levels of sensitivity, i.e. not all attributes are
then executes the following set of SQL queries:
equally important to the integrity of database. As
an example, the signatures and other biometric q1: SELECT a,b,c
data are highly sensitive data attributes for a
FROM R1,R2
financial organisation like Bank in comparison to
others like name, gender etc. So, unauthorised WHERE R1.A>R2.B
access to the crucial attributes is of a greater
q2: SELECT P
concern. Only certain employees may have access
to such data elements and access by all others must FROM R5
be blocked instantaneously to ensure
Confidentiality and consistency of data. WHERE R5.P==10

Then this is said to be a transaction of the form:


Our proposed model QPAFCS(Query Pattern Access
and Fuzzy Clustering System) pays special attention t=<1001,67,<q1,q2>>
to sensitive data attributes and they have been
referred to as CDE (Critical Data Elements) in the Definition 2 (Query) A query is a standard database
text. The attributes that can be used to indirectly management system token/request for inserting
infer CDEs are also critical to the functioning of the and retrieving data or information from a database
organisation. For instance, account number of a table or combination of tables. We define query as
user may be used to access the signatures and other a read or write request on an attribute of the
crucial details about him. Such attributes have been relation. A query is represented as
referred to as DAE (Directly Associated Elements) in <O(D1), O(D2), … O(Dn)>
the text.
where,
We propose a two-phase detection and prevention
model that clusters users based on similarity of D1 , D 2 , … D n ∈ R s
their attribute access patterns and the types of
where Rs is the relation schema and Di are the
queries performed by them, i.e. our model tries to
attributes. O represents the operations i.e. Read or
track the user access pattern of each user and
write Operations. O ∈ {R, W}
further classify it as normal or malicious. The
superiority of our model lies in its ability to prevent For example, examine the following transaction:-
unauthorised retrieving and modification of most
sensitive data elements (CDEs). Our model also start transaction
makes sure that the query pattern for access of select balance from Account where
CDEs is specific and fixed for a particular user to Account_Number='9001';
avoid data breaches, i.e. the user associates himself
with his regular access behaviour. Any deviation select balance from Account where
from the regular arrangement may lead to Account_Number='9002';
depreciation of user’s confidence and may act as
update Account set balance=balance-900 where
representative of user’s malicious intent. The
Account_Number='9001' ;
following terminologies are used:
update Account set balance=balance+900 where
Definition 1 (Transaction) Set of queries executed
Account_Number='9002' ;
by a user. Each transaction is represented by a
unique transaction ID and also carries the user’s ID. commit; //if all SQL queries succeed
Hence <Uid,Tid> act as unique identification key for
rollback; //if any of SQL queries failed or error WS(x), where WS(x) denotes the write sequence set
of x.
The query corresponding to this transaction is:
Definition 5 (Read Rules (RR)) Read rules are the
<<R(Account_Number),R(balance)>,
association rules generated from Read sequences
<R(Account_Number),R(balance)>,
whose confidence is greater than the user defined
<R(Account_Number),R(balance),W(balance)>,
threshold (Ψconf). A read rule is represented as
<R(Account_Number),R(balance),W(balance)>>
{R(x1), R(x2) ...} ⇒ O(x).
Definition 3 (Read Sequence) A read sequence is
defined as For all sequential patterns <R(x1), R(x2), …, R(Xn-1),
O(xn) > in read sequence set, generate the read
{R(x1), R(x2), … O(xn)}
rules with the format {R(x1), R(x2) ...} ⇒ O(xn). If the
where O represents the operations i.e. Read or confidence of the rule is larger than the minimum
write Operations. O ∈ {R, W}. The Read sequence confidence (Ψconf), then it’s added to the answer
represents that the transaction may need to read set of read rules, which implies that before xn, we
all data items x1, x2, …, xn-1 before the transaction need to read x1,x2…….. xn-1
performs operation (O∈ {R, W}) on data item xn.
For example:
For example, consider the following update
The Read Rule corresponding to the read sequence
statement in a transaction.
<R(a), R(b),
Update Table1 set x = a + b + c where d = 90;
R(c), R(d), W(x)> is:
In this statement, before updating x, values of a, b,
{R(a), R(b), R(c), R(d)} ⇒ W(x)
c and d must
Definition 6 (Write Rules (WR)) Write rules are the
be read and then the new value of x is calculated.
association rules generated from write sequences
So <R(a), R(b),
whose confidence is greater than the user defined
R(c), R(d), W(x)> ∈ RS(x). threshold (Ψconf). A write rule is represented as

Definition 4 (Write Sequence) A write sequence is O(x) ⇒ {W(x1), W(x2) …}


defined as
For all sequential patterns O(x), W(x1), W(x2), …,(xk)
{O(x1), W(x2), … W(xn)} in the write sequence set, generate the write rules
with the format O(x)→W(x1), w(x2), …, w(xk). If the
where O represents the operations i.e. Read or confidence of the rule is larger than the minimum
write Operations i.e. O ∈ {R, W} which represents confidence (Ψconf), then it’s added in the set of write
that the transaction may need to write all data rules which depicts after updating x, data
items x1, x2, …, xn-1 in this order after the
transaction operates on data item xn. items x1, x2, …, xk must be updated by the same
transaction.
For example, consider the following update
statements in one transaction. For Example: The write rule corresponding to the
write sequence
Update Table1 set x = a + b + c where a=50;
<W(x), W(y),W(z)> is W(x) ⇒ {W(y),W(z)}
Update Table1 set y = x + u where x=60;
Definition 7 (Critical Data Elements (CDE)) They are
Update Table1 set z = x + w + v where w=80; semantically defined data elements crucial to the
functioning of the system. They are the data
Using the above example, it can be noted that
attributes of prime significance having direct
<W(x), W(y),W(z)>
correlation to the integrity of the system. In a
is one write sequence of data item x, that is <W(x), vertically hierarchical organisation, these are the
W(y),W(z)> ∈ attributes accessed only by the top level
management, and the access by lower levels of
hierarchy is strictly protected.
Type of Attribute Sensitivity Level historic transactional data. This score summarizes
the user’s historic malicious access attempts.
Critical data Elements Highest Dubiety Score attempts to quantify the personnel
Directly Associated Medium vulnerability that the organisation faces because of
Elements a particular user.

Normal Attributes Low Dubiety Score is indicative of the amount of


deviation between the user’s access pattern and his
Table 3.1 Types of attributes and their sensitivity designated role. Dubiety Score combined with the
levels deviation of user’s present query from his normal
behaviour pattern, yields the output of the
CDEs are tokens of behaviour that our model uses proposed IDS.
for the malicious activity recognition of users of
system. For our paper:

Definition 8 (Critical Rules (CR)) A set of rules that 0<=φ<=1.


contain a Critical Data Element in its antecedent or
Higher the Dubiety Score, more is the evidence
consequent.
against user following the assigned role, that is
CR = {ζ | (ζ ∈ RR ∨ ζ ∈ WR) ∩ (x ∈ CDE ∩ ({R(x1), more is the malicious intent i.e. rogue behaviour.
R(x2) …} ⇒ O(x) ∪ O(x) ⇒ {W(x1), W(x2) …}))}
Definition 11 (Dubiety Table) A table maintaining
We propose a method of user Access Pattern the record of dubiety scores of each user. It
Recognition using the Critical Rules. CR recognize contains two attributes: UserID and Dubiety Score.
the actions and goals of Users from a series of
The initial Dubiety scores are set to 1.
observations on the users' actions and the
environmental conditions, i.e. the user query Uid φ
pattern associated to the Critical data elements.
1001 1
Definition 9 (Directly Associated elements (DAE))
The attributes except those present in CDE, which 1002 1
are either part of antecedents or consequents of
1003 1
Critical Rules.
DAE = {μi| μi ∈ CR ∩ μi ∉ CDE}. 1004 1
The query patterns as perceived by our model 1005 1
QPAFCS are explored using DAEs that represent the
first level of access of the CDEs. A user's behaviour Table 3.2 Initial Dubiety Table
is represented by a set of first-order statements
The dubiety table is updated each time a user
(derived from queries) called attribute hierarchy
performs query.
encoded in first-order logic, which defines
abstraction, decomposition and functional For example:
relationships between types of access
arrangements. The unit-transactions accessing Let user 1001’s deviation from normal query is
CDEs are decomposed into attribute hierarchy quantified as 0.81, Then the updated Dubiety table
comprising of DAEs, which further represents the is as shown.
user’s most sensitive retrieval pattern. Where:
Example: ds = deviation from normal query
 R(b) → R(a) φi = Initial dubiety score.
 R(b), R(c) → R(a)
If a is a CDE, then the set {b,c} represents DAEs. Uid 𝑑𝑠 ∗ фi

Definition 10 (Dubiety Score(φ)) A measure of 1001 0.9


anomaly exhibited by a user in the past based on his
1002 1 transaction log and one data file that is exclusive to
the database for which it was created. Our initial
1003 1 input to the learning phase algorithm is the
1004 1 transaction log, with only authorised and consistent
transactions. This data is free of any unauthorised
1005 1 activity and is used to form user profiles, role
profiles etc based on normal user transactions. The
Table 3.3 Updated Dubiety Table
logs are scanned, and the following elements are
The Updated Dubeity table is hence stored in extracted:
memory for further processing.
a. SQL Queries
3.2 Learning Phase
b. The user executing a given query
We start our learning phase by reading the
SQL query parser: This is a tool that takes SQL
training dataset into the memory and extracting
queries as input, parses them and produces
useful patterns out of it. Our system requires non-
sequences (read and write) corresponding to the
malicious training dataset composed of
SQL query as output. The query parser also assigns
transactions executed by trusted users. The model
a unique Transaction ID. The final output consists of
aims at generating user-profiles from the two 3 columns: (TID), UID (User ID) and the read and
transaction-logs and quantifies deviation from
write sequence generated by the parsing algorithm.
normal behaviour i.e. this phase aims to recognise
and characterise the user activity pattern on the As an Example, if the following transaction
basis of their queries arrangement. The following performed by user U1001 is examined:
are various components of architecture of the
start transaction
proposed model:
select balance from Account where
Account_Number='9001';

commit; //if all SQL queries succeed

rollback; //if any of SQL queries failed or error

The parser generates a unique Transaction ID say


T1234 followed by parsing the transaction. The
parser finally yields:

< T1234,U1001,<R(Account_number),R(balance)>>

Frequent sequences generator: After the SQL


query parser generates the sequences, the
generated sequences are pre-processed. Then
weights are assigned to data items, for instance the
CDEs are given greater weight as compared to DAEs
Fig 3(a) Learning Phase Architecture
and other normal attributes. Then finally these pre-
COMPONENTS OF ARCHITECTURE: processed sequences are given as inputs to
frequent sequences generator. It uses the prefix
Training data: A transaction log is a sequential span algorithm to generate frequent sequences out
record of all changes made to the database while of input sequences corresponding to each UID.
the actual data is contained in a separate file. The
transaction log contains enough information to Rule generator: The frequent sequences are
undo all changes made to the data file as part of any given as inputs to the rule generator module which
individual transaction. The log records the start of a uses association rule mining to generate read rules
transaction, all the changes considered to be a part and write rules out of the frequent sequences.
of it, and then the final commit or rollback of the As an example, if the input frequent sequences are:
transaction. Each database has at least one physical
1. <R(m),R(n),R(o),W(a)>
Algorithm 1: DAE Generator
2. <R(m),R(n),W(o),W(a)>
Data: CDE, Set DAE = {}, RR = Set of Read
3. <R(m),W(n),W(o),W(a)>
4. <W(a),R(b),W(o)> Rules, WR = Set of Write Rules
5. <R(a),R(b),R(m),W(a)> Result: The set of Directly Associated
6. <R(a),R(b),W(m),W(b)> elements DAE
Function: DAE Generator (CDE, RR, WR)
S.No Frequent Sequences Associated
for Ω є RR ∪ WR do
. Rules
for α є Ω do
1 <R(m),R(n),R(o),W(a)> R(m),R(n),R(o) if α є CDE
→W(a) while β є Ω do
2 <R(m),R(n),W(o),W(a) R(m),R(n),W(o)
DAE {} ⃪ β
> →W(a) end
end
3 <R(m),W(n),W(o),W(a) R(m),W(n),W(o end
> ) →W(a) end
4 <W(a),R(b),W(o)> W(a),R(b)
User vector generator: Using the frequent
→W(o)
sequences for the given audit period, it generates
5 <R(a),R(b),R(m),W(a)> R(a),R(b),R(m) the user vectors. A user vector is of the form
→W(a) BID = < UID, w1, w2, w3, ... wn >
6 <R(a),R(b),W(m),W(b) R(a),R(b),W(m) where wi = |O(ai)|.
> →W(b)
|O(ai)| represents the total number of times user
Table 3.4 Rule Generator for given Example with the given Uid performs operation (O ∈ {R, W})
DAE generator: In our approach, we semantically on the aforesaid attribute ai in the pre-decided
define a class of data items known as Critical data audit period. An audit period τ refers to a period of
elements or CDEs. These CDEs and rules are given time such as one year, a time window τ = [t1, t2] or
as input to our DAE (Directly associated element) the recent 10 months. User vector is representative
generator which specifies all those elements as DAE of user’s activity.
which are present in either the antecedent or the Each of these wi would represent how frequently a
consequent of those rules that involve at least one user performs the operation on the particular data
of the CDEs. item. It also can be used in a normalized form, as is
used in our proposed model QPAFCS.

UVID = <UID, < p(a1), p(a2), p(a3), … p(an)>>

where,
𝑤
p(𝑎 ) =
∑ 𝑤

p(ak) is defined as the probability of accessing the


attribute ak.

Value of p(𝑎 ) close to 1 would mean that the user


accesses the given attribute frequently.

Cluster generator: It takes user vectors and rules


as input and generates fuzzy clusters. Users are
clustered into different fuzzy clusters based on the
similarity of their user vectors. A cluster profile
would include
Ci = <CID, {R}> UVy = <Uy, < py(a1), py(a2), py(a3), … py(an)>>

where, CID represents the cluster centroid, and of equal length n, the modified Jensen Shannon[27]
distance is computed as
{R} is a set of rules which is formed by taking the
union of all the rules that the members of the given 𝐷(𝑈𝑉 ||𝑈𝑉 )
fuzzy cluster abide by. 1 + 𝑝 (𝑎 ) ∗ 𝑤(𝑎 )
1 + 𝑝 (𝑎 ) ∗ 𝑤(𝑎 ) log
We have used Fuzzy c-means[26] clustering to ⎛ 1 + 𝑝 (𝑎 ) ∗ 𝑤(𝑎 ) ⎞
⎜ + ⎟
create cluster. Each user belongs to a cluster to a ⎜ ⎟
certain degree wij. 1 + 𝑝 (𝑎 ) ∗ 𝑤(𝑎 )
1 + 𝑝 (𝑎 ) ∗ 𝑤(𝑎 ) log
⎝ (1 + 𝑝 (𝑎 ) ∗ 𝑤(𝑎 ))⎠
Where: =
2
wij represents the membership coefficient of the ith
where, w(ai) is the semantic weight associated with
user (ui) with the jth cluster
the aith attribute
The centre of a cluster (α) is the mean of all points,
User profile generator: This module takes user
weighted by their membership coefficients[28].
vectors and the cluster profiles as input and
Mathematically,
generates user profiles. A user profile is of the form
𝑤 = Ui=<UID, < p(a1), p(a2), p(a3) … p(ak) >, < c1, c2, … cC >
|| ||

| | >

where

∑ 𝑤(𝑢) 𝑢 UID is a unique ID given to each user,


𝛼 =
∑ 𝑤(𝑢)
<p(a1), p(a2), p(a3), … p(an)> is a 1-D matrix
The objective function that is minimized to create containing the probability of the user accessing a
clusters is defined as: particular attribute, and

< c1, c2, … cC > is a vector representing the


Inputs Outputs

C1 C2 C3 C4 User Vector User profile

0.2 0.2 0.2 0.4 <U1001,0.2,0.109,0.9, 0.6> <U1001,<0.2,0.1,0.9,0.6>,<0.2,0.2,0.2,0.4>>

membership coefficients of the given user for C


𝑎𝑟𝑔 𝑚𝑖𝑛 𝑤 ||𝑢 − 𝛼 || different clusters.

As an Example:
where
Table 3.5 User profile for the given Example
n is the total number of users,
Consider a system with 4 fuzzy clusters and 4
C is the number of clusters, and attributes, the given table illustrates the profile of
m is the fuzzifier. user U1001.

The dissimilarity/distance function used in the 3.3 Testing Phase


formation of fuzzy clusters is the modified Jenson In section 3.2, the learning phase is described, in
Shannon distance which is illustrated as: which the system is trained using non-malicious or
Given two user vectors[13] benign transactions. Now the trained model can be
used to detect malicious transactions. In this phase,
UVx = <Ux, < px(a1), px(a2), px(a3), … px(an)>> and a test query is obtained as input and it is compared
with the model’s perception of user’s access
pattern, and the model perpetually evaluates if the
test transaction is malicious. It is first checked
whether the user is trying to access a CDE. If yes,
the transaction is allowed only if the given user has
accessed that CDE before. Next, it is checked if any
DAE is being accessed. A user can perform write
operation on a DAE iff it is previously written by the
same user, otherwise the transaction is termed as
malicious. Next, we check if the transaction abides
by the rules that are generally followed by similar
users.

PHASES OF TESTING PHASE:

Rule generator: This module takes the sequence


as generated by the SQL query parser and gives the
rule that the input transaction follows. This can be
a read rule or a write rule and indicates the
operations done by the user, data attributes
accessed by the user and the order in which they
Algorithm 2: CDE Detector
are accessed. Now this rule can be checked for Data: Set of rules (ϒ) from test transaction, Set
maliciousness. χCDE, UID, User Profile(ϴ)
CDE Detector: The semantically critical elements
Result: Checks whether the test transaction is
referred to in our approach as CDEs are detected in malicious or normal with respect to CDE
this module. The read/ write rule corresponding to for Ѓє ϒ do
the incoming transaction is checked for the for ϱ є Ѓ do
presence of CDEs. If the rule being checked for
if ϱ є χCDE then
maliciousness contains a CDE, then it is dealt with
using the following policy:-
if w(ϱ) є Ѓ & ϴ[UID][w(ϱ)] == 0 then
Raise Alarm;
a. If read operation has been performed on any
end
CDE, i.e. r(CDE) is present in the rule and
UV[i][r(CDE)] = 0 and UV[i][w(CDE)] = 0 for
if r(ϱ) є Ѓ & ϴ[UID][r(ϱ)] == 0 &
the given user, then the transaction is termed ϴ[UID][w(ϱ)] == 0 then
as malicious. Raise Alarm;
b. If write operation has been performed on any end
CDE i.e. w(CDE) is encountered and
end
UV[i][w(CDE)] = 0 for the given user, then the
transaction is termed as malicious.
end
end

DAE Detector: This module addresses the issue of


inference attacks on CDEs. As discussed earlier,
certain data elements can be used to access the
CDEs, i.e. first order inference. This module uses the
rules mined in the learning phase to determine
which elements can be used to directly infer the
DAEs.
Our system seeks to prevent inference attacks by matched with the corresponding rules in
especially monitoring the DAEs. We lay emphasis on the cluster of which a user is as part.
write operations on DAEs. If write operation has
been performed on any DAEs i.e. w(DAE) is Algorithm 4: Modified Jaccard Distance
present in the rule to be checked and Data: Rules R1, R2; 𝛿1, 𝛿2; Set χR1, χR2
UV[i][w(DAE)] = 0 for the given user, then the Result: Distance between the two rules (Ԏ)
transaction is termed as malicious.
Function jcDistance (R1, R2)
for Ω є R1 do
Dubiety Score Calculator and Analyser: If the χR1 ← Ω;
transaction has not been found malicious in the end
previous two modules, we check if the for Ω’ є R2 do
transaction is malicious based on the previous χR2 ← Ω’;
history of the user and the behaviour pattern of
end
all similar users (modified Jenson Shannon
∗ – ∗ –
distance). To do so, we maintain a record of Ԏ= ;
action of all users by keeping the measure of
Dubiety Score(φi).
return Ԏ;

The deviation of a user’s new transaction with his


normal access pattern is referred to as Dubiety, and  In order to quantitatively measure the
the relative measure of Dubiety is the Dubiety similarity between two rules, we use
Score. Our IDS keeps a log of the DS (Dubiety Score) modified Jaccard distance[30]:
in a separate table. A user who is a potential threat JD = 1-𝛿1(R1 R2) - 𝛿2(R1 R2- R1 R2)
tends to have a high dubiety score. Another R1 R2
intuition that our system follows is that any R2| μi > 𝛿 and i [1, k]
transaction that a user makes matches significantly
either with the transactions the same user or
similar users have made in the past.

We use a measure ds to keep a track of the


maximum similarity of the given rule. We combine
Algorithm 3: DAE Detector
ds with φ i to get the final measure of dubiety score
φ f for the given user. We define 2 thresholds ФLT Data: Set of rules (ϒ) from test transaction, Set
and ФUT. ФUT represents the upper limit for the χDAE, UID, User Profile(ϴ)
dubiety score of a non-malicious user whereas ФLT Result: Checks whether the test transaction is
denotes the lower limit. This means that if φf for a malicious or normal with respect to DAE
user comes out to be greater than ФUT, the user is
for Ѓє ϒ do
malicious. On the other hand, φf value less than ФLT
denotes a benign user. for ϱ є Ѓ do
if ϱ є χDAE then
 If the incoming rule (R1) is a write rule, then
if w(ϱ) є Ѓ & ϴ[UID][w(ϱ)] == 0 then
the consequent of the incoming rule is
matched with the corresponding rules in Raise Alarm;
the cluster of which a user is as part. A user end
is said to be the part of the ith cluster iff: end
μi > 𝛿. end
Where,
end
μi is the fuzzy membership coefficient of the
given user for the ith cluster.  The minimum value of JD is regarded as ds.
𝛿 is a user defined threshold. φi is fetched directly from dubiety table.
 If the incoming rule (R1) is a read rule, then Final dubiety score for the given user is
the antecedent of the incoming rule is calculated as:
φf = 𝑑𝑠 ∗ фi 1003 0.2
 If φf < ФLT, the transaction is termed as non-
1004 0.6
malicious. In this case, the current dubiety
score in the dubiety table for the given user 1005 0.46
is reduced by a factor known as
“amelioration factor(Å)”. Table 3.8 calculated dubiety scores table
Thus, φi is updated as
φi = Å φi
 If ФUT > φf ≥ ФLT, the transaction is termed
as non-malicious and the dubiety table
entry for the given user is updated with φf.
 If φf ≥ ФUT the transaction is termed as
malicious.
 As an Example, Let the initial dubiety table
be:

Uid φ

1001 0.9 Taking ФLT=0.3 and ФUT=0.6


1002 0.8

1003 0.2 Uid φf Nature of Updated


Transaction
1004 0.6 φf

1005 0.7 1001 0.42 Non- 0.42


malicious
Table 3.6 Initial dubiety table
1002 0.49 Non- 0.49
Let the minimum value of ds corresponding to
malicious
each user be:
1003 0.2 Non- 0.198
malicious
Uid ds
1004 0.6 Malicious 0.6
1001 0.2
1005 0.46 Non- 0.46
1002 0.3 malicious

1003 0.2 Table 3.9 Summary of transactions of various users

1004 0.6 The Malicious Transactions are blocked in a


straightforward fashion and the Non Malicious
1005 0.3 transactions are processed. Updated Dubiety Table
is stored in database.
Table 3.7 Minimum ds values for various Users

The calculated dubiety score table:


4. Discussion

With regard to a typical credit card company


Uid φf= 𝑑𝑠 ∗ фi
dataset, some examples of critical data elements
1001 0.42 (CDEs) are: -

1002 0.49 1. CVV (denoted by a)


Card verification value (CVV) is a combination of  R(b) → R(a)
features used in credit, debit and automated teller  R(b), R(c) → Ra)
machine (ATM) cards for the purpose of
establishing the owner's identity and minimizing 5. Example to our Approach
the risk of fraud. The CVV is also known as the card 1. JC Distance
verification code (CVC) or card security code (CSC).
R1: R(c), R(b) → R(a)
When properly used, the CVV is highly effective
against some forms of fraud. For example, if the R2: R(d), R(b) → R(a)
data in the magnetic stripe is changed, a stripe
The modified JC Distance between R1 & R2 where
reader will indicate a "damaged card" error. The
the hyperparameters are 𝛿1 = 0.70 and 𝛿2 = 0.20,
flat-printed CVV is (or should be) routinely
is calculated as
required for telephone or Internet-based
purchases because it implies that the person JC Distance = 1-𝛿1(R1 R2) - 𝛿2(R1 R2- R1 R2)
placing the order has physical possession of the
card. Some merchants check the flat-printed CVV R1 R2
even when transactions are conducted in person. R1 R2 = 2
CVV technology cannot protect against all forms of R1 R2 = 4
fraud. If a card is stolen or the legitimate user is
tricked into divulging vital account information to JC Distance = 0.75
a fraudulent merchant, unauthorized charges
2. User Profile Vector
against the account can result. A common method
B1 = <U1, <0.7, 0.1, 0.6, 0.2, 0.4, 0.0, 0.2, 0.0>,
of stealing credit card data is phishing, in which a
<0.2, 0.3, 0.1, 0.2, 0.167, 0.033> >
criminal sends out legitimate-looking email in an
Here the values in the second tuple <0.7, …0.0>
attempt to gather personal and financial
represent the probability of User U1 accessing
information from recipients. Once the criminal has
particular attributes, for instance 0.7 denotes
possession of the CVV in addition to personal data
that there is a 70% probability that U1 accesses
from a victim, widespread fraud against that
the first attribute.
victim, including identity theft, can occur.
The values in the third tuple represent the
The following are directly associated membership of user U1 in the various(k) fuzzy
elements (DAEs) to CVV:- clusters, which is 6 in our case.

a. Credit card number (denoted by b) 3. Dubiety Score


b. Name of card holder (denoted by c) Suppose the Dubiety Score φi for User U1 is 0.8.
c. Card expiry date (denoted by d) The JC Distance of the test transaction with its
Credit Card Number, Name of card holder, Card cluster is 0.6. Then,
expiry date are elements that are read before CVV φf = 𝑑𝑠 ∗ φi
and hence used to validate the CVV entered by the φf =√0.6 ∗ 0.8 = 0.69
user. Hence the above-mentioned attributes have
been classified as DAEs, by our system. Setting our hyperparameter ФUT as 0.65. We
observe that φf > ФUT. Hence the test transaction is
Some normal data attributes are: - malicious, and an alarm is raised.
1. Gender of Customer (denoted by e)
2. Credit Limit (denoted by f)
3. Customer’s phone number (denoted by g)

These are the attributes that have been collected


for the fraud detection and are not directly used to
access the CDE but are crucial for the process.

Some examples of transactions for our proposed


approach:
6. Experimentation The details of CDEs, DAEs and Normal data items
has already been given in Section 3 and examples
In this section, we describe the method of
have been discussed in Section 5.
evaluation of the proposed algorithm. Firstly, we
describe our dataset. We then calculate various The access pattern data hereby shows that CDEs are
accuracy measures considering different rarely accessed, that too only by a few user roles
parameters as reference. and hence, protection of CDEs from malicious
access is of a greater significance as compared to
6.1 Description of dataset
DAEs and Normal data elements.
This paper is about anomaly detection of user
behaviours. An ideal dataset should be obtained
from a practical system with concrete job functions. 6.2 Cluster Analysis
But in fact, it is very sensitive for almost every
organization or company. When the number of users/user roles exceeds a
given limit, it becomes exceedingly difficult for the
The performance of the algorithm was IDS to keep track of individual user access patterns
analyzed by carrying out several experiments on a and hence detect anomaly. This is the reason that
credit card company dataset adhering to the TPC-C
clustering is a better and computationally efficient
benchmark[18]. The TPC-C schema is composed of solution for better performance of IDS. We prefer
a mixture of read only and read/write transactions Fuzzy clustering over hard clustering. Fuzzy
that replicate the activities found in complex OLTP clustering (also referred to as soft clustering) is a
application environment. The database schema,
form of clustering in which each data point can
data population, transactions, and implementation belong to more than one cluster. In non-fuzzy
rules were designed to broadly represent modern clustering (also known as hard clustering), data is
OLTP systems. We used two audit logs: one for divided into distinct clusters, where each data point
training the model and the second for testing it. The can only belong to exactly one cluster. In fuzzy
training log comprised of normal user transactions
clustering, data points can potentially belong to
and testing log consisted of a mixture of normal as multiple clusters. Membership grades are assigned
well as malicious user transactions. Although there to each of the data points(tags). These membership
are unusual records in real dataset, we also inject grades indicate the degree to which data points
some anomalies for detection. The injected belong to each cluster. Thus, points on the edge of
anomalies are set differently with the normal a cluster, with lower membership grades, may be in
behaviour pattern from several aspects. In totality, the cluster to a lesser degree than points in the
about 20,000 transactions were used. In total, center of cluster. When we evaluate various
about 99% of data was non-malicious while less performance measures keeping the number of
than 1% of data was malicious. Fig. 6(a) shows the clusters as a reference parameter, it is observed
distribution of malicious and benign data in the that a particular count for clusters is the most
dataset used: efficient in predicting results.

Fig 6(b) Variation of performance with number of


Fig 6(a) Frequency of data items and their access
clusters
frequency
Fig 6(b) depicts variation in precision, recall, TNR, increase in value of 𝛿1, while the value of Recall
accuracy with change in number of clusters. From decreases with increase in value of 𝛿1.
the graph, we can see that :-
Fig 6(e) shows the variation of Precision, recall, TNR,
 TNR does not vary with the number of clusters, accuracy with 𝛿2. It can be observed from the graph
i.e. TNR is invariant. that the value of Precision, TNR and Accuracy starts
 The precision is always greater than 0.94 and is decreasing when the value of 𝛿2 increases beyond
more or less constant. a certain value. Recall, on the other hand, increases
 Recall reaches optimum value when number of for higher values of 𝛿2.
Fuzzy Clusters is greater than 3.
Fig 6(d) shows the variation of Precision, recall,
 Accuracy also reaches the optimum value
TNR, accuracy with фUT. It can be observed from the
when number of clusters is greater than 3.
graph that the value of Precision first decreases and
6.3 Distances and thresholds then exponentially increases with the increase in
value of фUT. An identical trend is followed by
In section 3.2, we have described Modified Jensen- Accuracy. Somewhat similar trend is followed by
Shannon distance as a measure to calculate TNR except that it does not decrease initially. On
distance between two user vectors of same length. the contrary, the value of Recall decreases with the
In probability theory and statistics, the Jensen– increase in value of фUT.
Shannon divergence is a method of measuring the
similarity between two probability distributions. It Fig 6(f) shows the variation of Precision, recall, TNR,
is also known as information radius (IRad) or total accuracy with фLT. It can be observed from the
divergence to the average. It is based on the graph that the values of all the parameters
Kullback–Leibler divergence, with some notable fluctuate a little but remain more or less constant
(and useful) differences, including that it is with the increase in value of фLT.
symmetric and it is always a finite value. The square
With regards to the dataset we have used, following
root of the Jensen–Shannon divergence is a metric
inferences can be done from the graphs:
often referred to as Jensen-Shannon distance. We
preferred to use modified Jenson-Shannon distance 1. Value of 𝛿1 should be close to 0.65 for
to give weights to data attributes and avoid curse of optimum performance.
dimensionality. The variation of modified Jenson- 2. Value of 𝛿2 should be close to 0.55 for
Shannon distance with Euclidean distance is shown optimum performance.
in the fig 6(g). 3. Value of фUT should be close to 0.59 for
optimum performance.
In section 3.3, we have defined modified Jaccard
4. Value of фLT should be close to 0.2 for optimum
distance to quantitatively measure the similarity
performance.
between two rules. The Jaccard index, also known
as Intersection over Union of the Jaccard similarity
coefficient, is a statistical measure used for
comparing the similarity and diversity of sample 6.4 Comparison with related methods
sets. The Jaccard coefficient measures similarity Table 1 shows the performance measures used for
between finite sample sets, and is defined as the comparison of approaches. Using these
size of the intersection divided by the size of the performance measures, we will compare our
union of the sample sets. The variation of modified approaches with other related works. Our various
Jaccard index with Jaccard index is shown in fig 6(h). approaches are:-

The variation of precision, recall, TNR, accuracy Approach 1. Our approach using modified Jenson-
with the various thresholds, namely 𝛿1, 𝛿2, фUT , фLT Shanon distance and modified Jaccard index.
that were defined in section 3 is shown in the
Approach 2. Using unmodified Jaccard index with
following figures:
Jenson-Shanon distance.
Fig 6(c) shows the variation of Precision, recall, TNR,
Approach 3. Using Euclidean distance with
accuracy with 𝛿1. It can be observed from the graph
unmodified Jaccard index.
that Precision, TNR and Accuracy increase with the
S.No. PERFORMANCE FORMULA
MEASURE
1 TNR TN
TN + FP
2 Precision TP
TP + FP
3 Accuracy TP + TN
TN + FP + TP + FN
4 F1 Score 2 ∗ Precision ∗ Recall
Precision + Recall
5 PPV TP
TP + FP
6 ACC TP + TN
TP + TN + FP + FN
7 NPV TN
TN + FN
8 FDR FP
FP + TP
9 FOR FN
TN + FN
10 BM TPR + TNR – 1
11 FPR FP
FP + TN
12 FNR FN
FN + TP
13 MK PPV + NPV – 1
14 MCC TP × TN − FP × FN
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
Table 1 ( Performance Measures)
In table 2 we have compared the three approaches If we compare Approach 1 with Approach 3, we
with each other. observe that:-
Sensitivity Approach 1 Approach 2 Approach 3  TNR and precision of Approach 1 is a lot better
Measures
than the TNR and precision for Approach 3
PPV 0.96 0.73 0.74
 .It has also got better accuracy as compared to
TPR 0.81 0.95 1.00
Approach 3.
ACC 0.89 0.80 0.83
 Approach 1 also has a much lower FPR and FDR
F1 Score 0.88 0.83 0.85 score as compared to Approach 3.
NPV 0.83 0.93 1.00  Amongst other performance measures, MK
FDR 0.04 0.27 0.26 and MCC values of Approach 1 are also slightly
FOR 0.17 0.07 0.00 better than that of Approach 3.
BM 0.77 0.60 0.65  Approach 3, on the other hand has got better
FPR 0.03 0.34 0.34 TPR, NPV and FOR measures as compared to
TNR 0.96 0.65 0.65 Approach 1. In fact, it has the best values for
FNR 0.19 0.05 0.00 these parameters in the entire table.
MK 0.79 0.66 0.74  Also, both Approach 1 and Approach 3 have got
MCC 0.78 0.63 0.70 somewhat similar F1 score.

Table 2 (Comparison of our approaches) In the measures like TNR and precision,where
Approach 1 has one of the best score in the entire
From the table, following observations can be
table, Approach 3 performs rather poorly. Also,
made:-
Approach 3 lags far behind in measures like FPR and
If we compare Approach 1 with Approach 2, we can FDR score. On the other hand, in the measures in
observe that:- which Approach 3 performs better than Approach
1, Approach 1 is also performing quiet nicely. For
 TNR and FPR of Approach 1 is a lot better than example, in case of NPV, both the approaches have
the TNR and precision for Approach 2. good scores, with Approach 3 performing better.
 Approach 1 has also got better accuracy as similar trends are observed in case of all other
compared to Approach 2. measures except FNR, where Approach 3 has is far
 Approach 1 has a much lower FPR and FDR superior. Considering all the above scenario, we can
score as compared to Approach 2. say that the overall even though Approach 3 has the
 Amongst other performance measures, MK best values for some performance measures, its
and MCC values of approach 1 are also better poor performance in other measures are clearly a
than that of Approach 2. disadvantage due to which Approach 1 is better
 Approach 2, on the other hand has got better than Approach 3.
TPR, NPV and FOR measures as compared to
Approach 1.
 Both Approach 1 and Approach 2 have got Table 3 shows a comparison of our approaches with
somewhat similar F1 score. various other related works. If we compare our
approach with other related approaches, we
In the measures like FPR and TNR where Approach
observe that:-
1 has good performnce, Approach 2 performs
rather poorly. However, in measures like TPR and  In comparison to HU Panda, our approach
NPV, where Approach 2 performs better, Approach works better with respect to all the
1 also has good performance. For example, both performance measures considered for the
Approach 1 and Approach 2 have similar NPV scores purpose of comparison.
with Approach 2 performing slightly better As  In comparison to the work of Mostafa et al. our
Approach 1 performs far better than Approach 2 in approach performs better with respect to all
most of the measures, we can conclude that the the performance measures that are considered
overall performance of Approach 1 is better than for comparison.
Approach 2.
Sensitivity Approach Approach Approach HU Panda Hashemi Mostafa Mina Majumdar EliSa UP Rao et
Measures 1 2 3 et al. et al. Sohrabi et al. (2006) Bertino al.(2016)
et al. et al.
PPV 0.96 0.73 0.74 0.88 0.97 0.94 0.93 0.88 0.94 0.61

TPR 0.81 0.95 1.00 0.73 0.71 0.75 0.66 0.70 0.91 0.70

ACC 0.89 0.80 0.83 0.81 0.84 0.85 0.80 0.80 0.93 0.64

F1 Score 0.88 0.83 0.85 0.79 0.82 0.83 0.77 0.78 0.92 0.65

NPV 0.83 0.93 1.00 0.77 0.77 0.79 0.73 0.75 0.91 0.68

FDR 0.04 0.27 0.26 0.12 0.03 0.06 0.07 0.13 0.06 0.39

FOR 0.17 0.07 0.00 0.23 0.23 0.21 0.27 0.25 0.09 0.32

BM 0.77 0.60 0.65 0.63 0.69 0.70 0.60 0.60 0.85 0.35

FPR 0.03 0.34 0.34 0.10 0.02 0.05 0.05 0.10 0.06 0.45

TNR 0.96 0.65 0.65 0.90 0.98 0.95 0.94 0.90 0.94 0.65

FNR 0.19 0.05 0.00 0.28 0.29 0.25 0.35 0.30 0.09 0.30

MK 0.79 0.66 0.74 0.65 0.74 0.73 0.66 0.63 0.85 0.29

MCC 0.78 0.63 0.70 0.63 0.72 0.71 0.63 0.61 0.85 0.29

Table 3 (Comparison of our approaches with related works)

 In comparison to the work of Hashemi et al. and recall, both approaches have somewhat
even though our approach scores just a little similar score. Since our work is mostly related
less in measures like TNR and precision, it to finding Critical Data Items in a dataset,
scores a lot better with respect to rest of the higher TNR and precision scores are more
performance measures. desirable as compared to other performance
 If we consider the work of Mina Sohrabi et al. measures. Since our approach performs quiet
our approach performs better with respect to well with respect to other performance
all the performance measures that are present measures as well, better TNR and precision
in the table. scores can easily cover up lower recall values.
 In comparison to the work of Majumdar et al. 7. Analysis and Conclusion
our approach performs better with respect to
In this paper we have tried to detect malicious
all the performance measures that we have
transactions keeping in mind that certain data
considered for the purpose of comparison.
elements hold more critical information. We also
 With comparison to the work of UP Rao et al.
take into consideration user behaviour pattern in
our approach performs better in context to all
this approach. A user regularly behaving as a
the measures that are considered in the table
normal user will be gradually improving his
for comparison.
suspicion score. We then analyse the approach
 In comparison to the work of Elisa Bertino, our
w.r.t different parameters by conducting
approach gives better TNR and precision
experiments. Finally, we conclude that the
scores. It also gives comparatively better FDR
approach works efficiently in determining the
and FPR scores. In other measures, except TPR
nature of a transaction.
References
1. I-Yuan Lin ; Xin-Mao Huang ; Ming-Syan Chen “Capturing user access patterns in
the Web for data mining” Published in: IEEE ;Proceedings 11th International
Conference on Tools with Artificial Intelligence pp 9-11 Nov. 1999
2. R.S. Sandhu ; P. Samarati “ Access control: principle and practice” Published in:
IEEE Communications Magazine ( Volume: 32 , Issue: 9 , Sept. 1994 )
3. Denning, D.E. (1987) An Intrusion Detection Model. IEEE Transactions on Software
Engineering, Vol. SE-13, 222-232.
4. Knuth, Donald E., James H. Morris, Jr, and Vaughan R. Pratt. "Fast pattern
matching in strings." SIAM journal on computing 6.2 (1977): 323-350.
5. Wang, Ke. "Anomalous Payload-Based Network Intrusion Detection" . Recent
Advances in Intrusion Detection. Springer Berlin. doi:10.1007/978-3-540-30143-1_11
6. Douligeris, Christos; Serpanos, Dimitrios N. (2007-02-09). Network Security:
Current Status and Future Directions. John Wiley & Sons. ISBN 9780470099735.
7. Christina Yip Chung, Michael Gertz and Karl Levitt (2000), “DEMIDS: a misuse
detection system for database systems”, Integrity and internal control information
systems: strategic views on the need for control, Kluwer Academic Publishers,
Norwell, MA.
8. A. S. McGough, D. Wall, J. Brennan, G. Theodoropoulos, E. Ruck-Keene, B. Arief, et
al., "Insider Threats: Identifying Anomalous Human Behaviour in Heterogeneous
Systems Using Beneficial Intelligent Software (Ben-ware)," presented at the
Proceedings of the 7th ACM CCS International Workshop on Managing Insider
Security Threats, Denver, Colorado, USA, 2015.
9. S. D. Bhattacharjee, J. Yuan, Z. Jiaqi, and Y.-P. Tan, "Context-aware graph-based
analysis for detecting anomalous activities," presented at the Multimedia and Expo
(ICME), 2017 IEEE International Conference on, 2017.
10. P. A. Legg, O. Buckley, M. Goldsmith, and S. Creese, "Automated insider threat
detection system using user and role-based profile assessment," IEEE Systems
Journal, vol. 11, pp. 503-512, 2015.
11. I. Agrafiotis, A. Erola, J. Happa, M. Goldsmith, and S. Creese, "Validating an
Insider Threat Detection System: A Real Scenario Perspective," presented at the
2016 IEEE Security and Privacy Workshops (SPW), 2016.
12. T. Rashid, I. Agrafiotis, and J. R. C. Nurse, "A New Take on Detecting Insider
Threats: Exploring the Use of Hidden Markov Models," presented at the Proceedings
of the 8th ACM CCS International Workshop on Managing Insider Security Threats,
Vienna, Austria, 2016.
13. Zamanian Z., Feizollah A., Anuar N.B., Kiah L.B.M., Srikanth K., Kumar S. (2019)
User Profiling in Anomaly Detection of Authorization Logs. In: Alfred R., Lim Y.,
Ibrahim A., Anthony P. (eds) Computational Science and Technology. Lecture Notes
in Electrical Engineering, vol 481. Springer, Singapore
14. Yuqing Sun, Haoran Xu, Elisa Bertino, and Chao Sun. 2016. A Data-Driven
Evaluation for Insider Threats. Data Science and Engineering Vol. 1, 2 (2016), 73--85.
[doi>10.1007/s41019-016-0009-x]
15. S. Panigrahi, S. Sural and A. K. Majumdar, "Detection of intrusive activity in
databases by combining multiple evidences and belief update," 2009 IEEE
Symposium on Computational Intelligence in Cyber Security, Nashville, TN, 2009, pp.
83-90. doi: 10.1109/CICYBS.2009.4925094
[16] Yi Hu, Bajendra Panda, A data mining approach for database intrusion detection,
SAC '04 Proceedings of the 2004 ACM symposium on Applied computing Pages 711-
716, doi>10.1145/967900.968048
[17] Abhinav Srivastava , Shamik Sural , A. K. Majumdar, Weighted intra-
transactional rule mining for database intrusion detection, Proceedings of the 10th
Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining, April
09-12, 2006, Singapore [doi>10.1007/11731139_71]
18 TPC-C benchmark: http://www.tpc.org/tpcc/default.asp
19 Mina Sohrabi, M. M. Javidi, S. Hashemi, ”Detecting intrusion transactions in
database systems: a novel approach”, Journal of Intelligent Info Systems 42:619-644
DOI 10.1007 Springer 2014.
20 UP Rao et. al ,“Weighted Role Based Data Dependency Approach for Intrusion
Detection in Database”, International Journal of Network Security, Vol.19, No.3,
PP.358-370, May 2017 (DOI: 10.6633/IJNS.201703.19(3).05).
[21] R. Agrawal, T. lmieliiski, and A. Swami, ”Mining Association Rules between Sets
of Items in Large Databases”, in Proceedings of the 1993 ACM SIGMOD International
Conference on Management of data, 1993.
[22] Sattar Hashemi, Ying Yang,Davoud Zabihzadeh and Mohammadreza Kangavari,
“Detecting intrusion transactions in databases using data item dependencies and
anomaly analysis”, Article in Expert Systems 25(5):460-473 · November 2008 DOI:
10.1111/j.1468-0394.2008.00467.
[23] Mostafa Doroudian, Hamid Reza Shahriari, “A Hybrid Approach for Database
Intrusion Detection at Transaction and Inter-transaction Levels”, 6th Conference on
Information and Knowledge Technology (IKT 2014), May 28-30, 2014, Shahrood
University of Technology, Tehran, Iran.
24 E. Bertino, A. Kamra, E. Terzi and A. Vakali (2005), "Intrusion detection in RBAC
administered databases, " in Proceedings of the Applied Computer Security
Applications Conference (ACSAC).
25 Lee, V. C.S., Stankovic, J. A., Son, S. H. Intrusion Detection in Real-time Database
Systems Via Time Signatures. In Proceedings of the Sixth IEEE Real Time Technology
and Applications Symposium, 2000.
26. Weina Wang, Yunjie Zhang, Yi Li and Xiaona Zhang (2006), "The Global Fuzzy C-
Means Clustering Algorithm," 2006 6th World Congress on Intelligent Control and
Automation, Dalian, 2006, pp. 3604- 3607.
27. Fuglede, Bent; Topsøe, Flemming (2004). "Jensen-Shannon divergence and
Hilbert space embedding - IEEE Conference Publication". ieeexplore.ieee.org.
28. Dunn, J. C. (1973-01-01). "A Fuzzy Relative of the ISODATA Process and Its Use in
Detecting Compact Well-Separated Clusters". Journal of Cybernetics. 3 (3): 32–57.
doi:10.1080/01969727308546046. ISSN 0022-0280.
29. A. Mangalampalli and V. Pudi (2009), "Fuzzy association rule mining algorithm for
fast and efficient performance on very large datasets," 2009 IEEE International
Conference on Fuzzy Systems, Jeju Island, 2009, pp. 1163-1168
30. Vorontsov, I.E., Kulakovskiy, I.V. & Makeev, V.J. Algorithms Mol Biol (2013) 8: 23.
https://doi.org/10.1186/1748-7188-8-23 “ Jaccard index based similarity measure to
compare transcription factor binding site models”

Potrebbero piacerti anche