Sei sulla pagina 1di 19

Query Pattern Access and Fuzzy Clustering Based Database

Intrusion Detection System.


Indu Singh Shivam Gupta, Shivam Maini, Shubham, Simran
Department of Computer Science and Engineering Seth
Delhi Technological University
Delhi, India Computer Science and Engineering
indu.singh.dtu14@gmail.com Delhi Technological University
Delhi, India
shivam_bt2k16@dtu.ac.in,
shivammaini_bt2k16@dtu.ac.in,
Abstract- Hackers and malicious insiders shubham_bt2k16@dtu.ac.in,
perpetually try to steal, manipulate and corrupt simran_bt2k16@dtu.ac.in
sensitive data elements and an organization’s
database servers are often the primary targets of patterns of users, user access pattern mining [1] is
these attacks. In the broadest sense, misuse a suitable approach for the detection of these
(witting or unwitting) by authorized database attacks. It creates profiles of the normal access
users, database administrators, or patterns of users using past logs of users accesses.
network/systems managers are potential insider New accesses are later checked against these
threats that our project intends to address. profiles and mismatches indicate potential attacks.
Insider threats are more menacing because in
A security technique called Access control[2] can
contrast to outsiders (hackers or unauthorised
regulate who can view or use resources in a
users), insiders have authorised access to the
computing environment. There are diverse access
database and have knowledge about the critical
control systems that perform authorization,
nuances of the database. Database
identification, authentication and access approval.
security involves using multitude of information
security controls to protect databases against Intrusion Detection Systems[3]. scrutinise and
breach of confidentiality, integrity and unearth surreptitious activities perpetrated by
availability (CIA). It involves plethora of controls malevolent users. IDS work by either looking for
such as technical, procedural/administrative and signatures of known attacks or deviations of
physical. We hence intend to propose an Intrusion normal activity. Normally, IDS undergo a training
Detection System (IDS). QPAFCS(Query Pattern phase with intrusion free data wherein they
Access and Fuzzy Clustering System) that maintain a log of benign transactions. Pattern
monitors a database management system and matching [4] is then used to detect whether or not
prevents inference attacks on sensitive attributes, an action is malign. This is called anomaly based
by means of auditing user access patterns . detection[5]. When errors are detected using their
known “signatures” from previous knowledge of
Keywords: Database Intrusion Detection, Fuzzy the attack, it is called signature based detection[6].
Clustering, User Access Pattern, Insider Attacks, These malicious actions once detected are then
Dubiety Score either blocked or probed depending upon the
organisation’s policy. However, IDS need to be
dynamic, robust and quick. Different architectures
I. INTRODUCTION for IDS function differently and have different
measures of performance. Every organisation
Data protection from insider threats is essential to needs to make sure that the IDS it uses satisfies its
most organizations. Attacks from insiders could be requisites.
more damaging than those from outsiders, since in
Several AD techniques have been proposed to
most cases insiders have full or partial access to
detect anomalous data accesses. Some rely on the
the data; therefore, traditional mechanisms for
analysis of input queries based on the syntax.
data protection, such as authentication and access
Although these approaches are computationally
control, cannot be solely used to protect against
efficient, they are unable to detect anomalies in
insiders. Since recent work has shown that insider
scenarios like the following one. Consider a clerk in
attacks are accompanied by changes in the access
an organization who issues queries to a relational
database that typically selects a few rows from
specific tables. An access from this clerk that attributes in a system which can be manipulated to
selects all or most of the rows of these tables indirectly influence the crucial data attributes. We
should be considered anomalous with respect to address the threat to our critical data elements
the daily access pattern of the clerk. However, using such attributes also.
approaches based on syntax only are not able to
classify such access as anomalous. Thus, syntactic We investigate a suspected user also from the
approaches have to be extended to take into diachronic view by analysing his/her historical
account the semantic features of queries such as behaviour. We store a measure denoting how
the number of result rows. An important suspicious a user has been. The greater this
requirement is that queries should be inspected measure, the greater the chances of the query
before their execution in order to prevent being malicious. This measure also solves the
malicious queries from making changes to the problem of gradually malicious threat since the
database. historical statics measures the accumulative
results.
From the technical perspective, the main purpose
is to ensure the effective enforcement of security The main purpose of our project is to recognise
regulations. Audit is an important technique of user access pattern. Our Intrusion Detection
examining whether user behaviours in a system System(IDS) pays special attention to certain
are in conformance with security policies. Many semantically critical data elements, along with
methods audit a database processing by those elements which can be used to infer the
comparing a user SQL query expression against aforementioned elements. We present an
some predefined patterns so as to find out an innovative approach to combine a user’s historic
anomaly. But a malicious query may be made up and present access pattern and hence classify the
as good looking so as to evade such syntactic incoming transaction as malicious/non-malicious.
detection. To overcome this shortcoming, the Using FCM, we partition the users into fuzzy
data-centric method further audits whether the clusters. Each of these clusters contains a set of
data a user query actually accessed has involved rules in their cluster profiles. In the detection
any banned information. However, such audit phase, new transactions are checked against rules
concerns a concrete policy rather than the overall in these clusters, and then a suitable action is
view of multiple security policies. It requires clear taken depending upon the nature of transaction.
audit commands that are articulated by The main advantage of our IDS lies in its ability to
experienced professionals and much interactive prevent inference attacks on Critical Data
analysis. Since in practice an anomaly pattern Elements.
cannot be articulated in advance, it is difficult to
detect such fraud by the current audit method. The remainder of this work is organized as follows.
In Sect. 2, we present prior research related to this
The anomaly detection technology is used to work. Section 3 introduces the fuzzy clustering and
identify abnormal behaviours that are statistical belief update framework. Section 4 discusses the
outliers. Some probabilistic methods learned approach using examples. In Sect. 5, we discuss
normal patterns, against which they detected an how to apply our method into practical system.
anomaly. But these methods assume very few Experimental evaluation is discussed in Section. 6.
users are deviated from normal patterns. In case
there are a number of anomalous users, the
normal pattern would be diverged. These works do
not examine user behaviour from either a
historical or an incremental view, which may
overlook some malicious behaviour. Furthermore,
if a group of people collude together, it is difficult
to find them by the current methods.

In QPAFCS we tackle the insider threat problem


using different approaches. We take into
consideration the fact that certain data elements
are more critical to the database as compared with
other data elements. Thus we pay special
attention to the security of such critical data
elements. We also recognise the presence of data
I. RELATED WORK applied for recognizing data items These were
used together in order to represent the expanse of
Numerous researchers are currently working in users. But once the number of users for a single
the field of Network Intrusion Detection system becomes substantial, maintaining profiles
Systems, but only a few have proposed research becomes a redundant procedure. Another flaw
work in Database IDSs. Several systems for was the system assuming domain information
Intrusion Detection in operating systems and about a given schema.
networks have been developed, however they are
not adequate in protecting the database from Hu et al.[16] presented a data mining based
intruders. ID system in databases work at query intrusion detection system, which used the static
level, transaction level and user (role) level. analysis of database audit log to mine
Bertino et. Al described the challenges to ensure dependencies among attributes at transaction
data confidentiality, integrity and availability and level and represented those dependencies as sets
the need of database security wherein the need of of reading and writing operations on each data
database IDSs to tackle insider threats was item. In another approach proposed by Hu et al.
discussed. ,techniques of sequential pattern mining have
been applied on the training log, in order to
Panda et. Al[19] propose to employ data mining identify frequent sequences at the transaction
approach for determining data dependencies in level. This approach helped in identifying a group
the database system. The classification rules of malicious transactions, which individually
reflecting data dependencies are deduced directly complied with the user behavior. The approach
from the database log. These rules represent what was improved in by Hu et. al by clustering
data items probably need to be read before an legitimate user transaction into user tasks for
update operation and what data items are most discovery of inter-transaction data dependencies.
likely to be written following this update
operation. Transactions that are not compliant to The method proposed extends the approach by
the data dependencies generated are flagged as assigning weights to all the operations on data
anomalous transactions. attributes. The transactions which didn’t follow
the data dependencies were marked as malicious.
Database IDSs include Temporal Analysis of The major disadvantage of user assigned weights is
queries and Data dependencies among attributes, the fact that they are static and unrelated to other
queries and transaction. Lee et al.[28] proposed a data attributes. Kamra et. [27] al employed a
Temporal Analysis based intrusion detection clustering technique on an RBAC model to form
method which incorporated time signatures and profiles based on attribute access which
recorded update gap of temporal attributes. Any represented normal user behavior. An alarm is
anomaly in update pattern of the attribute was raised when anomalous behavior of that role
reported as an intrusion in the proposed approach. profile is observed.
The breakthrough introduction to association rule (Bezdek, Ehrlich & Full, 1984) proposed the Fuzzy
mining by Aggarwal et al.[22] helped in finding C-Means Algorithm. The basic idea behind this
data dependencies among data attributes, which approach was to illustrate the similarity a data
was incorporated in the field of intrusion detection point may share with each of the clusters with help
in Databases. of a function often referred to as membership
function. This measure of similarity lies between
During the initial development of data dependency zero and one signifies the extent of similarity
association rule mining, DEMIDS ,a misuse between the data point and the cluster and is
detection system for relational database systems termed as the membership value. The main aim of
was proposed by Chung et al.[7] Profiles which this technique is to construct fuzzy partitions of a
specified user access pattern were derived from particular data set.
the audit log and Distance Metrics were further
Y. Yu et. al.[29] illustrated a fuzzy logic based application of prior knowledge and observed data
Anomaly Intrusion Detection System. A Na¨ıve on suspicious users.In 2016, bertino et. Al[14] e
Bayes Classifier is used to classify an input event tackle the insider threat problem from a data-
as normal or anomalous. The basis of classifier is driven systemic view. User actions are recorded as
formed by the independent frequency of each historical log data in a system, and the evaluation
system call from a process in normal conditions. investigates the date that users actually process.
The ratio of the probability of a sequence from a From the horizontal view, users are grouped
process and the probability not from the process together according to their responsibilities and a
serves as the input of a fuzzy system for the normal pattern is learned from the group
classification. behaviours. They investigate a suspected user also
from the diachronic view by comparing his/her
historical behaviours with the historical average of
the same group.
A hybrid approach was described by Doroudian et.
al [26] to identify intrusion at both transaction and Anomaly detection has been an important
inter-transaction level. At the transaction level, a research problem in security analysis, therefore
set of predefined expected transactions were development of methods that can detect malicious
specified to the system and a sequential rule insider behavior with high accuracy and low false
mining algorithm was applied at the inter alarm is vital [10]. In this problem layout,
transaction level to find dependencies between McGough et al [8] designed a system to identify
the identified transactions. The drawback of such a anomalous behavior of user by comparing of
system is that sequences with frequencies lower individual user’s activities against their own
than the threshold value are neglected. Therefore, routine profile, as well as against the
the infrequent sequences were completely organization’s rule. They applied two independent
overlooked by the system, irrespective of their approaches of machine learning and Statistical
importance. As a result, the True Positive Rate falls Analyzer on data. Then results from these two
down for the system. parts combined together to form consensus which
then mapped to a risk score. Their system showed
The above drawback was overcome by Sohrabi et.
high accuracy, low false positive and minimum
Al[20] who proposed a novel approach ODARDM,
effect on the existing computing and network
in which rules were formulated for lower
resources in terms of memory and CPU usage.
frequency item sets, as well. These rules were
extracted using leverage as the rule value Bhattacharjee et al proposed a graph-based
measure, which minimized the interesting data method that can investigate user behavior from
dependencies. As a result, True Positive Rate two perspectives: (a) anomaly with reference to
increased while the False Positive Rate decreased. the normal activities of individual user which has
In recent developments, Rao et. Al [21] presented been observed in a prolonged period of time, and
a Query Access detection approach through (b) finding the relationship between user and his
Principal Component Analysis and Random Forest colleagues with similar roles/profiles. They utilized
to reduce data dimensionality and produce only CMU-CERT dataset in unsupervised manner. In
relevant and uncorrelated data. As the their model, Boykov Kolmogorov algorithm was
dimensionality is reduced, both, the system used and the result compared with different
performance and True Positive rate increases. algorithms including Single Model One-Class SVM,
Individual Profile Analysis, k-User Clustering and
In 2009, majumdar et. Al[15] propose a
Maximum Clique (MC). Their proposed model
comprehensive database intrusion detection
evaluated by evaluation metrics Area-Under-Curve
system that integrates different types of evidences
(AUC) that showed impressive improvement
using an extended Dempster-Shafer theory.
compare to other algorithms [9]. Log data are
Besides combining evidences, they also
considered as high-dimensional data which contain
incorporate learning in our system through
irrelevant and redundant features. Feature
selection methods can be applied to reduce to depreciation of user’s confidence and may act
dimensionality, decrease training time and as representative of user’s malicious intent. The
enhance learning performance [11]. following terminologies are used:

Definition 1 (Transaction) Set of queries executed


3. OUR APPROACH
by a user. Each transaction is represented by a
unique transaction ID and also carries the user’s
3.1 Basic Notations
ID. Hence <Uid,Tid> act as unique identification key
Large organisations deal with tremendous amount for each set of query patterns. Each Transaction T
of data whose security is of prime interest. The is denoted as
data in databases comprises of attributes
<Uid, Tid, <q1, q2, … qn>>
describing real life objects called as entities. The
attributes have varying levels of sensitivity, i.e. not where
all attributes are equally important to the integrity
of database. As an example, the signatures and qi denotes the ith query, i ∈ [1 … n]
other biometric data are highly sensitive data
For example, suppose a user has id 1001. He/she
attributes for a financial organisation like Bank in
then executes the following set of SQL queries:
comparison to others like name, gender etc. So,
unauthorised access to the crucial attributes is of a q1: SELECT a,b,c
greater concern. Only certain employees may have
access to such data elements and access by all FROM R1,R2
others must be blocked instantaneously to ensure
WHERE R1.A>R2.B
Confidentiality and consistency of data.
q2: SELECT P
Our proposed model QPAFCS(Query Pattern
Access and Fuzzy Clustering System) pays special FROM R5
attention to sensitive data attributes and they
WHERE R5.P==10
have been referred to as CDE (Critical Data
Elements) in the text. The attributes that can be Then this is said to be a transaction of the form:
used to indirectly infer CDEs are also critical to the
functioning of the organisation. For instance, t=<1001,67,<q1,q2>>
account number of a user may be used to access Definition 2 (Query) A query is a standard
the signatures and other crucial details about him. database management system token/request for
Such attributes have been referred to as DAE inserting and retrieving data or information from a
(Directly Associated Elements) in the text. database table or combination of tables. We
define query as a read or write request on an
We propose a two-phase detection and prevention
attribute of the relation. A query is represented as
model that clusters users based on similarity of
their attribute access patterns and the types of <O(D1), O(D2), … O(Dn)>
queries performed by them, i.e. our model tries to
track the user access pattern of each user and where,
further classify it as normal or malicious. The
superiority of our model lies in its ability to D1, D2, … Dn ∈ Rs
prevent unauthorised retrieving and modification
where Rs is the relation schema and Di are the
of most sensitive data elements (CDEs). Our model
attributes. O represents the operations i.e. Read or
also makes sure that the query pattern for access
write Operations. O ∈ {R, W}
of CDEs is specific and fixed for a particular user to
avoid data breaches, i.e. the user associates For example, examine the following transaction:-
himself with his regular access behaviour. Any
deviation from the regular arrangement may lead start transaction
select balance from Account where items x1, x2, …, xn-1 in this order after the
Account_Number='9001'; transaction operates on data item xn.

select balance from Account where For example, consider the following update
Account_Number='9002'; statements in one transaction.

update Account set balance=balance-900 where Update Table1 set x = a + b + c where a=50;
Account_Number='9001' ;
Update Table1 set y = x + u where x=60;
update Account set balance=balance+900 where
Account_Number='9002' ; Update Table1 set z = x + w + v where w=80;

commit; //if all SQL queries succeed Using the above example, it can be noted that
<W(x), W(y),W(z)>
rollback; //if any of SQL queries failed or error
is one write sequence of data item x, that is <W(x),
The query corresponding to this transaction is: W(y),W(z)> ∈

<<R(Account_Number),R(balance)>, WS(x), where WS(x) denotes the write sequence


<R(Account_Number),R(balance)>, set of x.
<R(Account_Number),R(balance),W(balance)>,
<R(Account_Number),R(balance),W(balance)>> Definition 5 (Read Rules (RR)) Read rules are the
association rules generated from Read sequences
Definition 3 (Read Sequence) A read sequence is whose confidence is greater than the user defined
defined as threshold (Ψconf). A read rule is represented as

{R(x1), R(x2), … O(xn)} {R(x1), R(x2) ...} ⇒ O(x).

where O represents the operations i.e. Read or For all sequential patterns <R(x1), R(x2), …, R(Xn-1),
write Operations. O ∈ {R, W}. The Read sequence O(xn) > in read sequence set, generate the read
represents that the transaction may need to read rules with the format {R(x1), R(x2) ...} ⇒ O(xn). If the
all data items x1, x2, …, xn-1 before the transaction confidence of the rule is larger than the minimum
performs operation (O∈ {R, W}) on data item xn. confidence (Ψconf), then it’s added to the answer
set of read rules, which implies that before xn, we
For example, consider the following update need to read x1,x2…….. xn-1
statement in a transaction.
For example:
Update Table1 set x = a + b + c where d = 90;
The Read Rule corresponding to the read sequence
In this statement, before updating x, values of a, b, <R(a), R(b),
c and d must
R(c), R(d), W(x)> is:
be read and then the new value of x is calculated.
So <R(a), R(b), {R(a), R(b), R(c), R(d)} ⇒ W(x)

R(c), R(d), W(x)> ∈ RS(x). Definition 6 (Write Rules (WR)) Write rules are the
association rules generated from write sequences
Definition 4 (Write Sequence) A write sequence is whose confidence is greater than the user defined
defined as threshold (Ψconf). A write rule is represented as

{O(x1), W(x2), … W(xn)} O(x) ⇒ {W(x1), W(x2) …}


where O represents the operations i.e. Read or For all sequential patterns O(x), W(x1), W(x2),
write Operations i.e. O ∈ {R, W} which represents …,(xk) in the write sequence set, generate the
that the transaction may need to write all data
write rules with the format O(x)→W(x1), w(x2), …, Definition 9 (Directly Associated elements (DAE))
w(xk). If the confidence of the rule is larger than The attributes except those present in CDE, which
the minimum confidence (Ψconf), then it’s added in are either part of antecedents or consequents of
the set of write rules which depicts after updating Critical Rules.
x, data DAE = {μi| μi ∈ CR ∩ μi ∉ CDE}.

items x1, x2, …, xk must be updated by the same The query patterns as perceived by our model
transaction. QPAFCS are explored using DAEs that represent
the first level of access of the CDEs. A user's
For Example: The write rule corresponding to the behaviour is represented by a set of first-order
write sequence statements (derived from queries) called attribute
hierarchy encoded in first-order logic, which
<W(x), W(y),W(z)> is W(x) ⇒ {W(y),W(z)}
defines abstraction, decomposition and functional
Definition 7 (Critical Data Elements (CDE)) They relationships between types of access
are semantically defined data elements crucial to arrangements. The unit-transactions accessing
the functioning of the system. They are the data CDEs are decomposed into attribute hierarchy
attributes of prime significance having direct comprising of DAEs, which further represents the
correlation to the integrity of the system. In a user’s most sensitive retrieval pattern.
vertically hierarchical organisation, these are the
Example:
attributes accessed only by the top level
management, and the access by lower levels of  R(b) → R(a)
hierarchy is strictly protected.  R(b), R(c) → R(a)
If a is a CDE, then the set {b,c} represents DAEs.
Type of Attribute Sensitivity Level
Definition 10 (Dubiety Score(φ)) A measure of
Critical data Elements Highest
anomaly exhibited by a user in the past based on
Directly Associated Medium his historic transactional data. This score
Elements summarizes the user’s historic malicious access
attempts. Dubiety Score attempts to quantify the
Normal Attributes Low personnel vulnerability that the organisation faces
because of a particular user.
Table 3.1 Types of attributes and their sensitivity
levels Dubiety Score is indicative of the amount of
deviation between the user’s access pattern and
CDEs are tokens of behaviour that our model uses his designated role. Dubiety Score combined with
for the malicious activity recognition of users of the deviation of user’s present query from his
system. normal behaviour pattern, yields the output of the
proposed IDS.
Definition 8 (Critical Rules (CR)) A set of rules that
contain a Critical Data Element in its antecedent or For our paper:
consequent.
0<=φ<=1.
CR = {ζ | (ζ ∈ RR ∨ ζ ∈ WR) ∩ (x ∈ CDE ∩ ({R(x1),
R(x2) …} ⇒ O(x) ∪ O(x) ⇒ {W(x1), W(x2) …}))} Higher the Dubiety Score, more is the evidence
against user following the assigned role, that is
We propose a method of user Access Pattern more is the malicious intent i.e. rogue behaviour.
Recognition using the Critical Rules. CR recognize
the actions and goals of Users from a series of Definition 11 (Dubiety Table) A table maintaining
observations on the users' actions and the the record of dubiety scores of each user. It
environmental conditions, i.e. the user query contains two attributes: UserID and Dubiety Score.
pattern associated to the Critical data elements.
The initial Dubiety scores are set to 1. aims at generating user-profiles from the
transaction-logs and quantifies deviation from
Uid φ normal behaviour i.e. this phase aims to recognise
and characterise the user activity pattern on the
1001 1
basis of their queries arrangement. The following
1002 1 are various components of architecture of the
proposed model:
1003 1

1004 1

1005 1

Table 3.2 Initial Dubiety Table

The dubiety table is updated each time a user


performs query.

For example:

Let user 1001’s deviation from normal query is


quantified as 0.81, Then the updated Dubiety table
is as shown.

Where:

ds = deviation from normal query


Fig 3(a) Learning Phase Architecture
φi = Initial dubiety score.
COMPONENTS OF ARCHITECTURE:
Uid √𝑑𝑠 ∗ фi
Training data: A transaction log is a sequential
1001 0.9 record of all changes made to the database while
the actual data is contained in a separate file. The
1002 1
transaction log contains enough information to
1003 1 undo all changes made to the data file as part of
any individual transaction. The log records the
1004 1 start of a transaction, all the changes considered
to be a part of it, and then the final commit or
1005 1 rollback of the transaction. Each database has at
least one physical transaction log and one data file
Table 3.3 Updated Dubiety Table
that is exclusive to the database for which it was
The Updated Dubeity table is hence stored in created. Our initial input to the learning phase
memory for further processing. algorithm is the transaction log, with only
authorised and consistent transactions. This data is
3.2 Learning Phase free of any unauthorised activity and is used to
form user profiles, role profiles etc based on
We start our learning phase by reading the
normal user transactions. The logs are scanned,
training dataset into the memory and extracting
and the following elements are extracted:
useful patterns out of it. Our system requires non-
malicious training dataset composed of a. SQL Queries
transactions executed by trusted users. The model
b. The user executing a given query
SQL query parser: This is a tool that takes SQL S.No. Frequent Sequences Associated
queries as input, parses them and produces Rules
sequences (read and write) corresponding to the
SQL query as output. The query parser also assigns 1 <R(m),R(n),R(o),W(a)> R(m),R(n),R(o)
a unique Transaction ID. The final output consists →W(a)
of two 3 columns: (TID), UID (User ID) and the read
2 <R(m),R(n),W(o),W(a)> R(m),R(n),W(o)
and write sequence generated by the parsing
→W(a)
algorithm.
3 <R(m),W(n),W(o),W(a)> R(m),W(n),W(o)
As an Example, if the following transaction
→W(a)
performed by user U1001 is examined:
4 <W(a),R(b),W(o)> W(a),R(b)
start transaction
→W(o)
select balance from Account where
5 <R(a),R(b),R(m),W(a)> R(a),R(b),R(m)
Account_Number='9001';
→W(a)
commit; //if all SQL queries succeed
6 <R(a),R(b),W(m),W(b)> R(a),R(b),W(m)
rollback; //if any of SQL queries failed or error →W(b)

The parser generates a unique Transaction ID say Table 3.4 Rule Generator for given Example
T1234 followed by parsing the transaction. The
parser finally yields: DAE generator: In our approach, we
semantically define a class of data items known as
< T1234,U1001,<R(Account_number),R(balance)>> Critical data elements or CDEs. These CDEs and
rules are given as input to our DAE (Directly
Frequent sequences generator: After the SQL associated element) generator which specifies all
query parser generates the sequences, the those elements as DAE which are present in either
generated sequences are pre-processed. Then the antecedent or the consequent of those rules
weights are assigned to data items, for instance that involve at least one of the CDEs.
the CDEs are given greater weight as compared to
DAEs and other normal attributes. Then finally
Algorithm 1: DAE Generator
these pre-processed sequences are given as inputs
Data: CDE, Set DAE = {}, RR = Set of Read
to frequent sequences generator. It uses the prefix
Rules, WR = Set of Write Rules
span algorithm to generate frequent sequences
Result: The set of Directly Associated
out of input sequences corresponding to each UID.
elements DAE
Rule generator: The frequent sequences are Function: DAE Generator (CDE, RR, WR)
given as inputs to the rule generator module which for Ω є RR ∪ WR do
uses association rule mining to generate read rules for α є Ω do
and write rules out of the frequent sequences. if α є CDE
while β є Ω do
As an example, if the input frequent sequences
DAE {} ⃪ β
are:
end
1. <R(m),R(n),R(o),W(a)> end
2. <R(m),R(n),W(o),W(a)> end
3. <R(m),W(n),W(o),W(a)> end
4. <W(a),R(b),W(o)>
User vector generator: Using the frequent
5. <R(a),R(b),R(m),W(a)>
6. <R(a),R(b),W(m),W(b)> sequences for the given audit period, it generates
the user vectors. A user vector is of the form
BID = < UID, w1, w2, w3, ... wn > The centre of a cluster (α) is the mean of all points,
weighted by their membership coefficients.
where wi = |O(ai)|. Mathematically,

|O(ai)| represents the total number of times user 1


𝑤𝑖𝑗 = 2
with the given Uid performs operation (O ∈ {R, W}) ||𝑢𝑖 −𝛼𝑗 || 𝑚−1
∑𝐶
𝑘=1(||𝑢 −𝛼 ||)
on the aforesaid attribute ai in the pre-decided 𝑖 𝑘

audit period. An audit period τ refers to a period of


time such as one year, a time window τ = [t1, t2] or
the recent 10 months. User vector is ∑𝑢 𝑤(𝑢)𝑚 𝑢
𝛼𝑘 =
representative of user’s activity. ∑𝑢 𝑤(𝑢)𝑚

Each of these wi would represent how frequently a The objective function that is minimized to create
user performs the operation on the particular data clusters is defined as:
item. It also can be used in a normalized form, as is
𝑛 𝐶
used in our proposed model QPAFCS.
𝑎𝑟𝑔 𝑚𝑖𝑛 ∑ ∑ 𝑤𝑖𝑗𝑚 ||𝑢𝑖 − 𝛼𝑗 ||2
𝑖=1 𝑗=1
UVID = <UID, < p(a1), p(a2), p(a3), … p(an)>>

where, where

𝑤𝑘 n is the total number of users,


p(𝑎𝑘 ) =
∑𝑤𝑗 𝜖 𝐵𝑖 𝑤𝑗
C is the number of clusters, and
p(ak) is defined as the probability of accessing the
m is the fuzzifier.
attribute ak.
The dissimilarity/distance function used in the
Value of p(𝑎𝑘 ) close to 1 would mean that the
formation of fuzzy clusters is the modified Jenson
user accesses the given attribute frequently.
Shannon distance which is illustrated as:
Cluster generator: It takes user vectors and rules
Given two user vectors
as input and generates fuzzy clusters. Users are
clustered into different fuzzy clusters based on the UVx = <Ux, < px(a1), px(a2), px(a3), … px(an)>> and
similarity of their user vectors. A cluster profile
would include UVy = <Uy, < py(a1), py(a2), py(a3), … py(an)>>

Ci = <CID, {R}> of equal length n, the modified Jensen Shannon


distance is computed as
where, CID represents the cluster centroid, and
𝐷(𝑈𝑉𝑝 ||𝑈𝑉𝑞 )
{R} is a set of rules which is formed by taking the (1 + 𝑝𝑥 (𝑎𝑖 ) ∗ 𝑤(𝑎𝑖 ))
union of all the rules that the members of the (1 + 𝑝𝑥 (𝑎𝑖 ) ∗ 𝑤(𝑎𝑖 )) log 2
(1 + 𝑝𝑦 (𝑎𝑖 ) ∗ 𝑤(𝑎𝑖 ))
given fuzzy cluster abide by.
+
(1 + 𝑝𝑦 (𝑎𝑖 ) ∗ 𝑤(𝑎𝑖 ))
We have used Fuzzy c-means clustering to create 𝑛 (1 + 𝑝𝑦 (𝑎𝑖 ) ∗ 𝑤(𝑎𝑖 )) log 2
( (1 + 𝑝𝑥 (𝑎𝑖 ) ∗ 𝑤(𝑎𝑖 )))
cluster. Each user belongs to a cluster to a certain =∑
2
degree wij. 𝑖=1

Where: where, w(ai) is the semantic weight associated


with the aith attribute
wij represents the membership coefficient of the
ith user (ui) with the jth cluster User profile generator: This module takes user
vectors and the cluster profiles as input and
generates user profiles. A user profile is of the PHASES OF TESTING PHASE:
form
Rule generator: This module takes the sequence
Ui=<UID, < p(a1), p(a2), p(a3) … p(ak) >, < c1, c2, … cC as generated by the SQL query parser and gives
>> the rule that the input transaction follows. This can
be a read rule or a write rule and indicates the
where operations done by the user, data attributes
accessed by the user and the order in which they
UID is a unique ID given to each user,
are accessed. Now this rule can be checked for
<p(a1), p(a2), p(a3), … p(an)> is a 1-D matrix maliciousness.
containing the probability of the user accessing a
CDE Detector: The semantically critical elements
particular attribute, and
referred to in our approach as CDEs are detected
< c1, c2, … cC > is a vector representing the in this module. The read/ write rule corresponding
membership coefficients of the given user for C to the incoming transaction is checked for the
different clusters. presence of CDEs. If the rule being checked for
maliciousness contains a CDE, then it is dealt with
As an Example: using the following policy:-

Inputs Outputs

C1 C2 C3 C4 User Vector User profile

0.2 0.2 0.2 0.4 <U1001,0.2,0.109,0.9, 0.6> <U1001,<0.2,0.1,0.9,0.6>,<0.2,0.2,0.2,0.4>>

Table 3.5 User profile for the given Example a. If read operation has been performed on
any CDE, i.e. r(CDE) is present in the rule and
Consider a system with 4 fuzzy clusters and 4 UV[i][r(CDE)] = 0 and UV[i][w(CDE)] = 0 for
attributes, the given table illustrates the profile of the given user, then the transaction is
user U1001. termed as malicious.
b. If write operation has been performed on
3.3 Testing Phase any CDE i.e. w(CDE) is encountered and
UV[i][w(CDE)] = 0 for the given user, then
In section 3.2, the learning phase is described, in
the transaction is termed as malicious.
which the system is trained using non-malicious or
benign transactions. Now the trained model can
be used to detect malicious transactions. In this
phase, a test query is obtained as input and it is
compared with the model’s perception of user’s
access pattern, and the model perpetually
evaluates if the test transaction is malicious. It is
first checked whether the user is trying to access a
CDE. If yes, the transaction is allowed only if the
given user has accessed that CDE before. Next, it is
checked if any DAE is being accessed. A user can
perform write operation on a DAE iff it is
previously written by the same user, otherwise the
transaction is termed as malicious. Next, we check
if the transaction abides by the rules that are
generally followed by similar users.
DAE Detector: This module addresses the issue of
inference attacks on CDEs. As discussed earlier,
certain data elements can be used to access the
CDEs, i.e. first order inference. This module uses
the rules mined in the learning phase to determine
which elements can be used to directly infer the
DAEs.

Our system seeks to prevent inference attacks


by especially monitoring the DAEs. We lay
emphasis on write operations on DAEs. If write
operation has been performed on any DAEs i.e.
w(DAE) is present in the rule to be checked and
Algorithm 2: CDE Detector UV[i][w(DAE)] = 0 for the given user, then the
Data: Set of rules (ϒ) from test transaction, Set transaction is termed as malicious.
χCDE, UID, User Profile(ϴ)
Result: Checks whether the test transaction is
malicious or normal with respect to CDE Dubiety Score Calculator and Analyser: If the
transaction has not been found malicious in the
for Ѓє ϒ do
previous two modules, we check if the transaction
for ϱ є Ѓ do is malicious based on the previous history of the
if ϱ є χCDE then user and the behaviour pattern of all similar users
if w(ϱ) є Ѓ & ϴ[UID][w(ϱ)] == 0 then (modified Jenson Shannon distance). To do so, we
Raise Alarm; maintain a record of action of all users by keeping
the measure of Dubiety Score(φi).
end
if r(ϱ) є Ѓ & ϴ[UID][r(ϱ)] == 0 & The deviation of a user’s new transaction with
ϴ[UID][w(ϱ)] == 0 then his normal access pattern is referred to as Dubiety,
Raise Alarm; and the relative measure of Dubiety is the Dubiety
Score. Our IDS keeps a log of the DS (Dubiety
end
Score) in a separate table. A user who is a
end potential threat tends to have a high dubiety
end score. Another intuition that our system follows is
end that any transaction that a user makes matches
significantly either with the transactions the same
user or similar users have made in the past.
We use a measure ds to keep a track of the
maximum similarity of the given rule. We combine
Algorithm 3: DAE Detector
ds with φ i to get the final measure of dubiety score
φ f for the given user. We define 2 thresholds ФLT
Data: Set of rules (ϒ) from test transaction, Set
and ФUT. ФUT represents the upper limit for the χDAE, UID, User Profile(ϴ)
dubiety score of a non-malicious user whereas ФLT Result: Checks whether the test transaction is
denotes the lower limit. This means that if φf for a malicious or normal with respect to DAE
user comes out to be greater than ФUT, the user is
for Ѓє ϒ do
malicious. On the other hand, φf value less than
ФLT denotes a benign user.
for ϱ є Ѓ do
if ϱ є χDAE then
 If the incoming rule (R1) is a write rule, if w(ϱ) є Ѓ & ϴ[UID][w(ϱ)] == 0 then
then the consequent of the incoming rule
is matched with the corresponding rules in
Raise Alarm;
the cluster of which a user is as part. A end
user is said to be the part of the ith cluster end
iff:
end
μi > 𝛿.
Where,
end φi = Å φi
μi is the fuzzy membership coefficient of  If ФUT > φf ≥ ФLT, the transaction is termed
th as non-malicious and the dubiety table
the given user for the i cluster.
𝛿 is a user defined threshold. entry for the given user is updated with φf.
 If the incoming rule (R1) is a read rule,  If φf ≥ ФUT the transaction is termed as
then the antecedent of the incoming rule malicious.
is matched with the corresponding
rules in the cluster of which a user is as Algorithm 4: Modified Jaccard Distance
part. Data: Rules R1, R2; 𝛿1, 𝛿2; Set χR1, χR2
Result: Distance between the two rules (Ԏ)
 In order to quantitatively measure the
similarity between two rules, we use
Function jcDistance (R1, R2)
modified Jaccard distance: for Ω є R1 do
JD = 1-𝛿1(R1 R2) - 𝛿2(R1 R2- R1 χR1 ← Ω;
R2) end
R1 R2
for Ω’ є R2 do
R2| μi > 𝛿 and i [1, k]
χR2 ← Ω’;
end
𝛿1∗(𝜒𝑅1 𝜒𝑅2)– 𝛿2∗(𝜒𝑅1 𝜒𝑅2 – 𝜒𝑅1 𝜒𝑅2)
Ԏ= ;
𝜒𝑅1 𝜒𝑅2
return Ԏ;
 The minimum value of JD is regarded  As an Example, Let the initial dubiety table
as ds. φi is fetched directly from dubiety be:
table. Final dubiety score for the given
user is calculated as: Uid φ
φf =√𝑑𝑠 ∗ фi
1001 0.9
 If φf < ФLT, the transaction is termed as
non-malicious. In this case, the current
1002 0.8
dubiety score in the dubiety table for the
given user is reduced by a factor known as 1003 0.2
“amelioration factor(Å)”.
Thus, φi is updated as
1004 0.6 Taking ФLT=0.3 and ФUT=0.6

1005 0.7

Table 3.6 Initial dubiety table Uid φf Nature of Updated


Transaction
Let the minimum value of ds corresponding to φf
each user be:
1001 0.42 Non- 0.42
malicious

Uid ds 1002 0.49 Non- 0.49


malicious
1001 0.2
1003 0.2 Non- 0.198
1002 0.3 malicious

1003 0.2 1004 0.6 Malicious 0.6

1004 0.6 1005 0.46 Non- 0.46


malicious
1005 0.3
Table 3.9 Summary of transactions of various users
Table 3.7 Minimum ds values for various Users
The Malicious Transactions are blocked in a
The calculated dubiety score table:
straightforward fashion and the Non Malicious
transactions are processed. Updated Dubiety Table
is stored in database.
Uid φf= √𝑑𝑠 ∗ фi

1001 0.42
4. Discussion
1002 0.49
With regard to a typical credit card company
1003 0.2 dataset, some examples of critical data elements
(CDEs) are: -
1004 0.6
1. CVV (denoted by a)
1005 0.46
Card verification value (CVV) is a combination of
Table 3.8 calculated dubiety scores table features used in credit, debit and automated teller
machine (ATM) cards for the purpose of
establishing the owner's identity and minimizing
the risk of fraud. The CVV is also known as the card
verification code (CVC) or card security code (CSC).

When properly used, the CVV is highly effective


against some forms of fraud. For example, if the
data in the magnetic stripe is changed, a stripe
reader will indicate a "damaged card" error. The
flat-printed CVV is (or should be) routinely
required for telephone or Internet-based
purchases because it implies that the person
placing the order has physical possession of the
card. Some merchants check the flat-printed CVV The modified JC Distance between R1 & R2 where
even when transactions are conducted in person. the hyperparameters are 𝛿1 = 0.70 and 𝛿2 = 0.20,
is calculated as
CVV technology cannot protect against all forms of
fraud. If a card is stolen or the legitimate user is JC Distance = 1-𝛿1(R1 R2) - 𝛿2(R1 R2- R1 R2)
tricked into divulging vital account information to
a fraudulent merchant, unauthorized charges R1 R2
against the account can result. A common method
R1 R2 = 2
of stealing credit card data is phishing, in which a
criminal sends out legitimate-looking email in an R1 R2 = 4
attempt to gather personal and financial
information from recipients. Once the criminal has JC Distance = 0.75
possession of the CVV in addition to personal data
2. User Profile Vector
from a victim, widespread fraud against that
B1 = <U1, <0.7, 0.1, 0.6, 0.2, 0.4, 0.0, 0.2, 0.0>,
victim, including identity theft, can occur.
<0.2, 0.3, 0.1, 0.2, 0.167, 0.033> >
The following are directly associated Here the values in the second tuple <0.7,
elements (DAEs) to CVV:- …0.0> represent the probability of User U1
accessing particular attributes, for instance 0.7
a. Credit card number (denoted by b) denotes that there is a 70% probability that U1
b. Name of card holder (denoted by c) accesses the first attribute.
c. Card expiry date (denoted by d) The values in the third tuple represent the
Credit Card Number, Name of card holder, Card membership of user U1 in the various(k) fuzzy
expiry date are elements that are read before CVV clusters, which is 6 in our case.
and hence used to validate the CVV entered by the
user. Hence the above-mentioned attributes have 3. Dubiety Score
Suppose the Dubiety Score φi for User U1 is
been classified as DAEs, by our system.
0.8.
Some normal data attributes are: - The JC Distance of the test transaction with its
cluster is 0.6. Then,
1. Gender of Customer (denoted by e) φf =√𝑑𝑠 ∗ φi
2. Credit Limit (denoted by f)
φf =√0.6 ∗ 0.8 = 0.69
3. Customer’s phone number (denoted by g)

These are the attributes that have been collected Setting our hyperparameter ФUT as 0.65. We
for the fraud detection and are not directly used to observe that φf > ФUT. Hence the test transaction is
access the CDE but are crucial for the process. malicious, and an alarm is raised.

Some examples of transactions for our proposed


approach:

 R(b) → R(a)
 R(b), R(c) → Ra)

5. Example to our Approach

1. JC Distance

R1: R(c), R(b) → R(a)

R2: R(d), R(b) → R(a)


References
1. I-Yuan Lin ; Xin-Mao Huang ; Ming-Syan Chen “Capturing user access patterns in
the Web for data mining” Published in: IEEE ;Proceedings 11th International
Conference on Tools with Artificial Intelligence pp 9-11 Nov. 1999
2. R.S. Sandhu ; P. Samarati “ Access control: principle and practice” Published in:
IEEE Communications Magazine ( Volume: 32 , Issue: 9 , Sept. 1994 )
3. Denning, D.E. (1987) An Intrusion Detection Model. IEEE Transactions on
Software Engineering, Vol. SE-13, 222-232.
4. Knuth, Donald E., James H. Morris, Jr, and Vaughan R. Pratt. "Fast pattern
matching in strings." SIAM journal on computing 6.2 (1977): 323-350.
5. Wang, Ke. "Anomalous Payload-Based Network Intrusion Detection" . Recent
Advances in Intrusion Detection. Springer Berlin. doi:10.1007/978-3-540-30143-
1_11
6. Douligeris, Christos; Serpanos, Dimitrios N. (2007-02-09). Network Security:
Current Status and Future Directions. John Wiley & Sons. ISBN 9780470099735.
7. Christina Yip Chung, Michael Gertz and Karl Levitt (2000), “DEMIDS: a misuse
detection system for database systems”, Integrity and internal control information
systems: strategic views on the need for control, Kluwer Academic Publishers,
Norwell, MA.
8. A. S. McGough, D. Wall, J. Brennan, G. Theodoropoulos, E. Ruck-Keene, B. Arief,
et al., "Insider Threats: Identifying Anomalous Human Behaviour in Heterogeneous
Systems Using Beneficial Intelligent Software (Ben-ware)," presented at the
Proceedings of the 7th ACM CCS International Workshop on Managing Insider
Security Threats, Denver, Colorado, USA, 2015.
9. S. D. Bhattacharjee, J. Yuan, Z. Jiaqi, and Y.-P. Tan, "Context-aware graph-based
analysis for detecting anomalous activities," presented at the Multimedia and Expo
(ICME), 2017 IEEE International Conference on, 2017.
10. P. A. Legg, O. Buckley, M. Goldsmith, and S. Creese, "Automated insider threat
detection system using user and role-based profile assessment," IEEE Systems
Journal, vol. 11, pp. 503-512, 2015.
11. I. Agrafiotis, A. Erola, J. Happa, M. Goldsmith, and S. Creese, "Validating an
Insider Threat Detection System: A Real Scenario Perspective," presented at the
2016 IEEE Security and Privacy Workshops (SPW), 2016.
12. T. Rashid, I. Agrafiotis, and J. R. C. Nurse, "A New Take on Detecting Insider
Threats: Exploring the Use of Hidden Markov Models," presented at the
Proceedings of the 8th ACM CCS International Workshop on Managing Insider
Security Threats, Vienna, Austria, 2016.
13. Zamanian Z., Feizollah A., Anuar N.B., Kiah L.B.M., Srikanth K., Kumar S. (2019)
User Profiling in Anomaly Detection of Authorization Logs. In: Alfred R., Lim Y.,
Ibrahim A., Anthony P. (eds) Computational Science and Technology. Lecture Notes
in Electrical Engineering, vol 481. Springer, Singapore
14. Yuqing Sun, Haoran Xu, Elisa Bertino, and Chao Sun. 2016. A Data-Driven
Evaluation for Insider Threats. Data Science and Engineering Vol. 1, 2 (2016), 73--
85. [doi>10.1007/s41019-016-0009-x]
15. S. Panigrahi, S. Sural and A. K. Majumdar, "Detection of intrusive activity in
databases by combining multiple evidences and belief update," 2009 IEEE
Symposium on Computational Intelligence in Cyber Security, Nashville, TN, 2009,
pp. 83-90. doi: 10.1109/CICYBS.2009.4925094
[16] Yi Hu, Bajendra Panda, A data mining approach for database intrusion
detection, SAC '04 Proceedings of the 2004 ACM symposium on Applied computing
Pages 711-716, doi>10.1145/967900.968048
[17] Abhinav Srivastava , Shamik Sural , A. K. Majumdar, Weighted intra-
transactional rule mining for database intrusion detection, Proceedings of the 10th
Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining,
April 09-12, 2006, Singapore [doi>10.1007/11731139_71]
18 TPC-C benchmark: http://www.tpc.org/tpcc/default.asp
19 Mina Sohrabi, M. M. Javidi, S. Hashemi, ”Detecting intrusion transactions in
database systems: a novel approach”, Journal of Intelligent Info Systems 42:619-
644 DOI 10.1007 Springer 2014.
20 UP Rao et. al ,“Weighted Role Based Data Dependency Approach for Intrusion
Detection in Database”, International Journal of Network Security, Vol.19, No.3,
PP.358-370, May 2017 (DOI: 10.6633/IJNS.201703.19(3).05).
[21] R. Agrawal, T. lmieliiski, and A. Swami, ”Mining Association Rules between
Sets of Items in Large Databases”, in Proceedings of the 1993 ACM SIGMOD
International Conference on Management of data, 1993.
[22] Sattar Hashemi, Ying Yang,Davoud Zabihzadeh and Mohammadreza
Kangavari, “Detecting intrusion transactions in databases using data item
dependencies and anomaly analysis”, Article in Expert Systems 25(5):460-473 ·
November 2008 DOI: 10.1111/j.1468-0394.2008.00467.
[23] Mostafa Doroudian, Hamid Reza Shahriari, “A Hybrid Approach for Database
Intrusion Detection at Transaction and Inter-transaction Levels”, 6th Conference on
Information and Knowledge Technology (IKT 2014), May 28-30, 2014, Shahrood
University of Technology, Tehran, Iran.
24 E. Bertino, A. Kamra, E. Terzi and A. Vakali (2005), "Intrusion detection in RBAC
administered databases, " in Proceedings of the Applied Computer Security
Applications Conference (ACSAC).
25 Lee, V. C.S., Stankovic, J. A., Son, S. H. Intrusion Detection in Real-time Database
Systems Via Time Signatures. In Proceedings of the Sixth IEEE Real Time
Technology and Applications Symposium, 2000.
26. Weina Wang, Yunjie Zhang, Yi Li and Xiaona Zhang (2006), "The Global Fuzzy C-
Means Clustering Algorithm," 2006 6th World Congress on Intelligent Control and
Automation, Dalian, 2006, pp. 3604- 3607.
27. Fuglede, Bent; Topsøe, Flemming (2004). "Jensen-Shannon divergence and
Hilbert space embedding - IEEE Conference Publication". ieeexplore.ieee.org.
28. Dunn, J. C. (1973-01-01). "A Fuzzy Relative of the ISODATA Process and Its Use
in Detecting Compact Well-Separated Clusters". Journal of Cybernetics. 3 (3): 32–
57. doi:10.1080/01969727308546046. ISSN 0022-0280.
29. A. Mangalampalli and V. Pudi (2009), "Fuzzy association rule mining algorithm
for fast and efficient performance on very large datasets," 2009 IEEE International
Conference on Fuzzy Systems, Jeju Island, 2009, pp. 1163-1168
30. Vorontsov, I.E., Kulakovskiy, I.V. & Makeev, V.J. Algorithms Mol Biol (2013) 8:
23. https://doi.org/10.1186/1748-7188-8-23 “ Jaccard index based similarity
measure to compare transcription factor binding site models”