Sei sulla pagina 1di 4

Incremental Outlier Detection for Transaction

Databases using Frequent Itemsets


Saurabh Singh Mathuriya

Navneet Harinkhede

Indian Institute of Technology Delhi


Computer Science
Email: mcs132581@cse.iitd.ac.in

Indian Institute of Technology Delhi


Computer Science
Email: mcs132567@cse.iitd.ac.in

AbstractOutlier detection is an important problem in many


domains and data mining techniques have been proposed in
literature for their determination. Most of these algorithms have
been proposed for static large databases, while in real world
the data keeps building up incrementally. We observe a gap
in determination of outliers in growing(incremental) databases.
We propose an algorithm for outlier detection in incremental
transaction database using frequent itemsets. In this paper we
find the maximal frequent itemsets using incremental FP-Growth
algorithm. A transaction subsuming a part of maximal frequent
itemset must contain remaining items of maximal frequent
itemset. The transaction not following this property is more likely
to be an outlier. Based on above unseen items outlier degree
is calculated. In this paper, we first discuss unseen items and
then provide outlier degree for detecting outlier transactions.
Our experiments show promising results on both synthetic and
real data.

I. I NTRODUCTION
Outlier detection is an important problem heavily applied in
many domains e.g. commercial (credit card, insurance, money
laundering[1,2,3]) medical (drug discovery, diagnosis[4,5])
scientific (new discovery [6]), security etc. An outlier is an
element which does not follow a given pattern. Hawkins[7]
defined it as An outlier is an observation which deviates so
much from the other observations as to arouse suspicions that
it was generated by a different mechanism. Detecting such
outliers can be useful many a times in datamining applications.
The existing outlier detection algorithms work on static data
sets by scanning the entire data set to find the global outliers.
Datasets used in data stream applications are unbound. It is
quite impractical to scan the entire data set, since we need to
have quick response to request. Also, multiple data scans are
needed for different applications. These two reasons call for
better qualitative and efficient techniques.
There have been some work on outlier detection in data
streams[18,19,20]. Novel and compact data structures were
proposed to avoid scanning the entire data stream. For example, a sliding window was proposed to help define current data
sets and local outliers could be detected inside them [18, 19].
Autoregression based methods were proposed to be used in
stream data outlier detection [20].
In this paper, an algorithm is proposed to employ frequent
itemsets to check the incremental data and find the embedded
outliers. The outlier transaction will be detected according to

Abhinav Tiwari
Indian Institute of Technology Delhi
Computer Science
Email: jca132381@maths.iitd.ac.in

the mined frequent itemsets. As an instance, let D be a transactional database containing transaction T = {t1 , t2 , ....., tn }
consisting of subset of itemset I = {i1 , i2 , ....., im }. Let x =
{x1 , x3 , x4 , x5 } be a maximal frequent itemset that is derived
from the dataset D. A transaction < x1 , x3 > is treated as
abnormal if the items {x4 , x5 } are supposed to appear along
with the items {x1 , x3 } in the database D but not present.
That is, a transaction that is expected to contain some items
that actually did not appear is an outlier.
This paper presents a novel algorithm termed IODFI (Incremental Outlier Detection using frequent itemset). It allows
the user to detect the outlier transaction in incremental data
by finding frequent itemsets incrementally(without scanning
the previous data). In commercial world, data keeps growing
incrementally. With incremental data, frequent itemsets also
keeps changing. To find frequent itemsets in entire dataset we
used incremental FP-Growth algorithm.
The remaining sections of this paper are organized as
follows. Section II introduces Literature survey of outlier
detection. Section III describes the contribution of this paper.
Section IV describe the proposed method, experimental results
are provided in Section V and section VI concludes the paper.
II. O UTLIER D ETECTION :L ITERATURE S URVEY
A significant amount of work has been done on outlier detection since it finds applications in various domains[7,8,9,10].
It has resulted in various techniques which finds applications
in specific domains. A Statistical Distribution-Based outlier
detection[8] methodology was proposed in which they consider a point p of set D to be an outlier if it deviates much
from variance.Another methodology was proposed by Wang
et al[9] in which distance of a data point p was considered to
determine the outlier. Here they considered the percentage of
other points in D that were away from p over a predefined
distance to mark it as a outlier. Another approach compared
density around a point with density around its local neighbors
to identify the outliers[10].
Many real life applications uses transactional database to
store data. For mentioned methods are not suitable to find outliers in transactional database. For instance, if we use distancebased method it is not always possible to measure distance
between two transactions. There are some applications like
credit card transactions, Grocery store transactions that have

growing (incremental) transactional data. Thus the transaction


based incremental outlier detection is an undeniable research
issue that deserve further investigation.
Only a limited literature was focused on identifying outliers
from transactional database in recent years [11,12,13,14].
They defined frequent pattern outlier factor(FPOF) to evaluate
whether a transaction is an outlier or not. Assumption is
that transaction that doesnt contain frequent itemsets are
more likely to be outlier transaction. Narita et al[11] gave a
different definition of outlier transaction. According to Narita a
transaction is likely to be an outlier if some items are supposed
to have appeared but are not present in the transaction. Based
on this concept an outlier degree is defined to evaluate whether
a single transaction is an outlier or not.
We observe that there has been a little work on stream
database[15,16]. Most of the commercial databases keep growing in time and imbibe patterns in them. These patterns are
expected to continue in incremental database as well.
III. F OUNDATION
This section introduces the terms and statements used in this
paper. Let D be a transaction database containing transactions
T = {t1 , t2 , ....., tn } each consisting of subset of itemset I =
{i1 , i2 , ....., im }. |t| indicates a cardinality of transaction t. Let
XI be an itemset, then Xs support sup(X) on T is defined
. For a user-specified threshold
as sup(X) = |{t|tT|TXt}|
|
min sup , itemset X such that sup(X)min sup is called
f requent itemset[21]. In this paper, we often mentioned a
frequent itemset as F I for short. Also, the threshold min sup
is named as minimum support. In a set of all frequent
itemsets derived with a minimal support min sup, an itemset,
which has no other frequent itemsets as a superset, is called
maximal f requent itemset[21].In this paper,we often call
a maximal frequent itemset as max F I for short.
TABLE I
S AMPLE T RANSACTIONAL DATABASE
TID
t001
t002
t003
t004
t005
t006
t007
t008
t009
t010

Items
a,b,c,d
b,c
a,b,c,d
a,c,d,e,f
a,b,c,d,e,f
a,b,c,d,e
a,c
a,b,c,d,f
a,b,c,d
c,d,f

A. Unseen Items
Let x be a maximal frequent itemset and t be a transaction
such that tx. The items y(x t) are called unseen items
if they are supposed to appear in the transaction t along with
x but not present.
In table I, the unseen items for transaction t002 are {a,d}.
U I t denot the unseen items of transaction t.

TABLE II
M AXIMAL F REQUENT ITEMSET OF TABLE I WITH min sup = 30%
ID
max
max
max
max
max

F I001
F I002
F I003
F I004
F I005

Items
a,b,c,d
a,c,f
a,c,d,e
a,d,f
c,d,f

support
60%
30%
30%
30%
40%

B. Outlier Degree
In this subsection we derived Outlier degree of a transaction
using maximal frequent itemsets. Outlier degree is used to
detect outlier transactions in a transactional database.
Let tT be a transaction in a transactional database D and x
be a maximal frequent itemset such that tx. Then ts outlier
degree is derived by the formula below.
od(t) =

|xt|
|x|

It may be observed that the range of the outlier degree is


between 0 and 1, if |x| = |t|, then it is equal to 0.
In table I outlier degree for t002 is calculated below
t002 = {b, c}
max F I001 = {a, b, c, d}
42
od(t005 ) = |xt|
|x| = 4 = 0.5
C. Outlier Transaction
When a particular transaction satisfies the condition
od(t)min od then t is considered as a outlier transaction.
Here min od is called as minimal outlier degree which is a
treshold speicified by user.
IV. O UTLIER T RANSACTION D ETECTION A LGORITHM
Based on the above concept of the outlier degree of a transaction, we present an outlier detection algorithm. Algorithm is
divided into two parts, first part considers a set of transactions,
while second part considers sequences of transactions database
ariving incrementally and uses algorithm 1.
A. Algorithm Part 1
Determine outlier transactions from a set of transactions
Input:
i. The set of transaction D
ii. Support s
iii. min od
Output:
Set of outlier transaction.
Step 1: Derive all maximal frequent itemsets for D with
given support value using FP-Growth. Denoted them by
{x1 , x2 , x3 , ........, xm }
Step 2: For each transaction ti T find the unseen items.
Step 3: Find the outlier degree od(ti ) for each transaction.
Step 4: Determine the outlier transactions using the
condition od(ti ) > min od.

B. Algorithm Part 2
0

Let D be a incremental transaction database containing


transaction T0 = {t01 , t02 , t03 ,........, t0n0 } each consisting of subset
of itemset I = {i1 , i2 , i3 ,........., im }.
Step 1: Derived all maximal frequent itemsets incrementally for D0 with given support. Denoted them by {x01 ,
x02 ,......., x0m0 }.
Step 2: Follow the algorithm part 1 to determine the
outlier transaction in incremental database.
Algorithm 1 Incremental Outlier Detection using frequent
itemset
INPUT: Database D, Incremental Dataset D0 , minimum
support min sup, minimum outlier degree min od
OUTPUT:
Outlier
Transactions
OT
1: Derived all maximal frequent itemset max F I
2: for each ti T do
3:
U I ti =0
4:
for each max F Ij max F I do
5:
if ti max F Ij && |max F Ij ti | > U I ti then
6:
U I ti = max F Ij ti
7:
end if
8:
end for
9: end for
UI t
10: od(ti ) = |t +U I i t |
i
i
11: if od(ti ) min od then
12:
OT = OT {ti }
13: end if
// for incremental database
14: Derived all maximal frequent itemset max F I 0 incremental FP-Growth
15: for for each t0i T 0 do
16:
U I t0i =0
17:
for each max F Ij max F I 0 do
18:
if t0i max F Ij && |max F Ij t0i | > U I t0i then
19:
U I t0i = max F Ij t0i
20:
end if
21:
end for
22: end for
U I t0
23: od(t0i ) = |t0 +U I i t0 |
i
i
24: if od(t0 ) min od then
25:
OT = OT {t0i }
26: end if
For practical implementation we use FP-Growth
algorithm[23]. Complexity of algorithm after finding
maximal frequent itemsets is O(|T | |max F I|).
Property 1: Let t T
minimum support values
that min sup1 <min sup2 ,
od2 (t) for min sup1 , and
od1 (t)od2 (t).

be a transaction. For two


min sup1 , min sup2 such
ts outlier degrees are od1 (t),
min sup2 respectively. Then

Proof 1: For a set of transaction T , let max F I1 , max F I2


be a set of maximal frequent itemset for min sup1 ,
andmin sup2 respectively. When min sup1 <min sup2
then max F I1 max F I2 . Thus, for a transaction t, let
U I t1 , U I t2 be a unseen items of t for min sup1 ,
and min sup2 respectively. Then U I t1 U I t2 become
max F I1 max F I2 and |t| is constant. For a constant c
x
is monotonically increasing function.
and a variable x, c+x
Therefor, a fixed |t|, od1 (t) is not less than od2 (t). Hence,
when min sup1 <min sup1 , then od1 (t)od1 (t).
V. E XPERIMENT
Extensive experiments were carried on an Intel(R)
Core(TM)2 Duo 3.33GHz CPU machine with 2GB main memory, Window7 and the algorithm in this paper are implemented
in JAVA Eclipse Luna 4.4.2. First explaining the dataset used
in the experiments.
A. Data Sets
We used two datasets namely Albane, a real dataset and
Synthetic, a synthetic datasset collected from UCI MachineLearning Repository[24]. Because our algorithm deals with
transaction data, we have converted this numerical data into
transaction data. For conversion we caluculated the mean value
for each column and marked entry as 1 if it contain value more
than mean value or else 0. An entry of 1 in a row means that
the item is present in that transaction.
B. Correctness
We determined the outliers from entire transaction dataset
and called them true outliers. And then we divided the dataset
into multiple parts to find the outliers incrementally. These
incrementally determined outliers are then matched with the
true outliers. We defined correctness by the following formula
correctness =

#matched outliers
#of all true outliers

TABLE III
C ORRECTNESS OF SYNTHETIC DATA , min od=0.25
min sup
0.1070
0.0500
0.0285
0.0142
0.0107

Correctness(Synthetic)
57.58
69.66
82.84
92.82
93.21

TABLE IV
C ORRECTNESS OF A BALONE DATA , min od=0.25
min sup
0.3591
0.3710
0.3830
0.4069

Correctness(Abalone)
89.34
83.21
80.84
77.82

VI. CONCLUSIONS
The traditional apriori-based approaches are not appropriate
to find outliers in incremental data. These approaches used
to rescan the original database to check whether a itemset
remains frequent whenever new transactions are added. Incremental outlier detection provides a more efficient way to
detect outliers over incremental data as it avoids rescanning of
processed data. While mining of incremental databases is more
complicated than the mining of static transaction databases.
In this paper we proposed a new algorithm to find the
outliers in incremental data based on frequent itemsets. FPtrees are used to monitor frequent itemsets in incremental
data. The tree generated in the previous data scan is updated
in the next step with the new data increment. This saves
the time to rescan the large dataset. Our experiment results
show that the incremental outliers calculated this way are
qualitatively comparable to the outliers calculated by aprioribased approach. Results also provide evidences to verify
that the proposed algorithm is efficient in both accuracy and
precision rates.
R EFERENCES
[1] Yu, Wen-Fang, and Na Wang. Research on credit card fraud detection
model based on distance sum. Artificial Intelligence, 2009. JCAI09.
International Joint Conference on. IEEE, 2009.
[2] Konijn, Rob M., and Wojtek Kowalczyk. Finding fraud in health insurance data with two-layer outlier detection approach. Data Warehousing
and Knowledge Discovery. Springer Berlin Heidelberg, 2011. 394-405.
[3] Lopez-Rojas, Edgar Alonso, and Stefan Axelsson. Money Laundering
Detection using Synthetic Data. The 27th annual workshop of the
Swedish Artificial Intelligence Society
(SAIS). 2012.
[4] Lin, Xiwu, et al. Validation of multivariate outlier detection analyses
used to identify potential drug-induced liver injury in clinical trial
populations. Drug safety 35.10 (2012): 865-875.
[5] Pachgade, Ms SD, and Ms SS Dhande. Outlier detection over data set
using cluster-based and distance-based approach. International Journal
of Advanced Research in Computer Science and Software Engineering
2.4 (2012).
[6] Borne, Kirk. Outlier Detection Gets a Makeover-Surprise Discovery in
Scientific Big Data. (2014).
[7] Zaki, Mohammed J., and Wagner Meira Jr. Data Mining and Analysis:
Fundamental Concepts and Algorithms. Cambridge University Press,
2014.
[8] Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining, southeast
asia edition: Concepts and techniques. Morgan kaufmann, 2006.
[9] Wang, Bin, et al. Distance-based outlier detection on uncertain data.
Computer and Information Technology, 2009. CIT09. Ninth IEEE International Conference on. Vol. 1. IEEE, 2009.
[10] Tao, Yunxin, and Dechang Pi. Unifying density-based clustering and
outlier detection. Knowledge Discovery and Data Mining, 2009. WKDD
2009. Second International Workshop on. IEEE, 2009.
[11] Narita, Kazuyo, and Hiroyuki Kitagawa. Outlier detection for transaction databases using association rules. Web-Age Information Management, 2008. WAIM08. The Ninth International Conference on. IEEE,
2008.
[12] Bouguessa, Mohamed. Unsupervised Anomaly Detection in Transactional Data. Machine Learning and Applications (ICMLA), 2012 11th
International Conference on. Vol. 1. IEEE, 2012.
[13] Kao, Li-Jen, and Yo-Ping Huang. An efficient strategy to detect outlier
transactions for knowledge mining. Systems, Man, and Cybernetics
(SMC), 2011 IEEE International Conference on. IEEE, 2011.
[14] He, Zengyou, et al. FP-outlier: Frequent pattern based outlier detection. Computer Science and Information Systems 2.1 (2005): 103-118.
[15] Pradeepini, G., and S. Jyothi. Tree-based incremental association rule
mining without candidate itemset generation. Trendz in Information
Sciences & Computing (TISC), 2010. IEEE, 2010.

[16] Siqueira, Adriano A. Veloso Gustavo M., et al. An Efficient Incremental


Association Rule Mining Algorithm.
[17] Aggarwal, Charu C., and Philip S. Yu. Outlier detection for high
dimensional data. ACM Sigmod Record. Vol. 30. No. 2. ACM, 2001.
[18] Angiulli, Fabrizio, and Fabio Fassetti. Detecting distance-based outliers in streams of data. Proceedings of the sixteenth ACM conference
on Conference on information and knowledge management. ACM, 2007.
[19] Basu, Sabyasachi, and Martin Meckesheimer. Automatic outlier detection for time series: an application to sensor data. Knowledge and
Information Systems 11.2 (2007): 137-154.
[20] Curiac, Daniel-Ioan, et al. Malicious Node Detection in Wireless Sensor
Networks Using an Autoregression Technique. ICNS 7 (2007): 83-88.
[21] Agraval, R., and Ramakrishnan Srikant. Fast Algorithms for Mining
Association Rules in Large Data Bases. 20th Inter-national Conference
on Very Large Databases, Santiagom. 1994.
[22] Burdick, Douglas, Manuel Calimlim, and Johannes Gehrke. MAFIA:
A maximal frequent itemset algorithm for transactional databases. Data
Engineering, 2001. Proceedings. 17th International Conference on. IEEE,
2001.
[23] Han, Jiawei, Jian Pei, and Yiwen Yin. Mining frequent patterns without
candidate generation. ACM SIGMOD Record. Vol. 29. No. 2. ACM,
2000.
[24] Lichman, M. (2013). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.

Potrebbero piacerti anche