Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Navneet Harinkhede
I. I NTRODUCTION
Outlier detection is an important problem heavily applied in
many domains e.g. commercial (credit card, insurance, money
laundering[1,2,3]) medical (drug discovery, diagnosis[4,5])
scientific (new discovery [6]), security etc. An outlier is an
element which does not follow a given pattern. Hawkins[7]
defined it as An outlier is an observation which deviates so
much from the other observations as to arouse suspicions that
it was generated by a different mechanism. Detecting such
outliers can be useful many a times in datamining applications.
The existing outlier detection algorithms work on static data
sets by scanning the entire data set to find the global outliers.
Datasets used in data stream applications are unbound. It is
quite impractical to scan the entire data set, since we need to
have quick response to request. Also, multiple data scans are
needed for different applications. These two reasons call for
better qualitative and efficient techniques.
There have been some work on outlier detection in data
streams[18,19,20]. Novel and compact data structures were
proposed to avoid scanning the entire data stream. For example, a sliding window was proposed to help define current data
sets and local outliers could be detected inside them [18, 19].
Autoregression based methods were proposed to be used in
stream data outlier detection [20].
In this paper, an algorithm is proposed to employ frequent
itemsets to check the incremental data and find the embedded
outliers. The outlier transaction will be detected according to
Abhinav Tiwari
Indian Institute of Technology Delhi
Computer Science
Email: jca132381@maths.iitd.ac.in
the mined frequent itemsets. As an instance, let D be a transactional database containing transaction T = {t1 , t2 , ....., tn }
consisting of subset of itemset I = {i1 , i2 , ....., im }. Let x =
{x1 , x3 , x4 , x5 } be a maximal frequent itemset that is derived
from the dataset D. A transaction < x1 , x3 > is treated as
abnormal if the items {x4 , x5 } are supposed to appear along
with the items {x1 , x3 } in the database D but not present.
That is, a transaction that is expected to contain some items
that actually did not appear is an outlier.
This paper presents a novel algorithm termed IODFI (Incremental Outlier Detection using frequent itemset). It allows
the user to detect the outlier transaction in incremental data
by finding frequent itemsets incrementally(without scanning
the previous data). In commercial world, data keeps growing
incrementally. With incremental data, frequent itemsets also
keeps changing. To find frequent itemsets in entire dataset we
used incremental FP-Growth algorithm.
The remaining sections of this paper are organized as
follows. Section II introduces Literature survey of outlier
detection. Section III describes the contribution of this paper.
Section IV describe the proposed method, experimental results
are provided in Section V and section VI concludes the paper.
II. O UTLIER D ETECTION :L ITERATURE S URVEY
A significant amount of work has been done on outlier detection since it finds applications in various domains[7,8,9,10].
It has resulted in various techniques which finds applications
in specific domains. A Statistical Distribution-Based outlier
detection[8] methodology was proposed in which they consider a point p of set D to be an outlier if it deviates much
from variance.Another methodology was proposed by Wang
et al[9] in which distance of a data point p was considered to
determine the outlier. Here they considered the percentage of
other points in D that were away from p over a predefined
distance to mark it as a outlier. Another approach compared
density around a point with density around its local neighbors
to identify the outliers[10].
Many real life applications uses transactional database to
store data. For mentioned methods are not suitable to find outliers in transactional database. For instance, if we use distancebased method it is not always possible to measure distance
between two transactions. There are some applications like
credit card transactions, Grocery store transactions that have
Items
a,b,c,d
b,c
a,b,c,d
a,c,d,e,f
a,b,c,d,e,f
a,b,c,d,e
a,c
a,b,c,d,f
a,b,c,d
c,d,f
A. Unseen Items
Let x be a maximal frequent itemset and t be a transaction
such that tx. The items y(x t) are called unseen items
if they are supposed to appear in the transaction t along with
x but not present.
In table I, the unseen items for transaction t002 are {a,d}.
U I t denot the unseen items of transaction t.
TABLE II
M AXIMAL F REQUENT ITEMSET OF TABLE I WITH min sup = 30%
ID
max
max
max
max
max
F I001
F I002
F I003
F I004
F I005
Items
a,b,c,d
a,c,f
a,c,d,e
a,d,f
c,d,f
support
60%
30%
30%
30%
40%
B. Outlier Degree
In this subsection we derived Outlier degree of a transaction
using maximal frequent itemsets. Outlier degree is used to
detect outlier transactions in a transactional database.
Let tT be a transaction in a transactional database D and x
be a maximal frequent itemset such that tx. Then ts outlier
degree is derived by the formula below.
od(t) =
|xt|
|x|
B. Algorithm Part 2
0
#matched outliers
#of all true outliers
TABLE III
C ORRECTNESS OF SYNTHETIC DATA , min od=0.25
min sup
0.1070
0.0500
0.0285
0.0142
0.0107
Correctness(Synthetic)
57.58
69.66
82.84
92.82
93.21
TABLE IV
C ORRECTNESS OF A BALONE DATA , min od=0.25
min sup
0.3591
0.3710
0.3830
0.4069
Correctness(Abalone)
89.34
83.21
80.84
77.82
VI. CONCLUSIONS
The traditional apriori-based approaches are not appropriate
to find outliers in incremental data. These approaches used
to rescan the original database to check whether a itemset
remains frequent whenever new transactions are added. Incremental outlier detection provides a more efficient way to
detect outliers over incremental data as it avoids rescanning of
processed data. While mining of incremental databases is more
complicated than the mining of static transaction databases.
In this paper we proposed a new algorithm to find the
outliers in incremental data based on frequent itemsets. FPtrees are used to monitor frequent itemsets in incremental
data. The tree generated in the previous data scan is updated
in the next step with the new data increment. This saves
the time to rescan the large dataset. Our experiment results
show that the incremental outliers calculated this way are
qualitatively comparable to the outliers calculated by aprioribased approach. Results also provide evidences to verify
that the proposed algorithm is efficient in both accuracy and
precision rates.
R EFERENCES
[1] Yu, Wen-Fang, and Na Wang. Research on credit card fraud detection
model based on distance sum. Artificial Intelligence, 2009. JCAI09.
International Joint Conference on. IEEE, 2009.
[2] Konijn, Rob M., and Wojtek Kowalczyk. Finding fraud in health insurance data with two-layer outlier detection approach. Data Warehousing
and Knowledge Discovery. Springer Berlin Heidelberg, 2011. 394-405.
[3] Lopez-Rojas, Edgar Alonso, and Stefan Axelsson. Money Laundering
Detection using Synthetic Data. The 27th annual workshop of the
Swedish Artificial Intelligence Society
(SAIS). 2012.
[4] Lin, Xiwu, et al. Validation of multivariate outlier detection analyses
used to identify potential drug-induced liver injury in clinical trial
populations. Drug safety 35.10 (2012): 865-875.
[5] Pachgade, Ms SD, and Ms SS Dhande. Outlier detection over data set
using cluster-based and distance-based approach. International Journal
of Advanced Research in Computer Science and Software Engineering
2.4 (2012).
[6] Borne, Kirk. Outlier Detection Gets a Makeover-Surprise Discovery in
Scientific Big Data. (2014).
[7] Zaki, Mohammed J., and Wagner Meira Jr. Data Mining and Analysis:
Fundamental Concepts and Algorithms. Cambridge University Press,
2014.
[8] Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining, southeast
asia edition: Concepts and techniques. Morgan kaufmann, 2006.
[9] Wang, Bin, et al. Distance-based outlier detection on uncertain data.
Computer and Information Technology, 2009. CIT09. Ninth IEEE International Conference on. Vol. 1. IEEE, 2009.
[10] Tao, Yunxin, and Dechang Pi. Unifying density-based clustering and
outlier detection. Knowledge Discovery and Data Mining, 2009. WKDD
2009. Second International Workshop on. IEEE, 2009.
[11] Narita, Kazuyo, and Hiroyuki Kitagawa. Outlier detection for transaction databases using association rules. Web-Age Information Management, 2008. WAIM08. The Ninth International Conference on. IEEE,
2008.
[12] Bouguessa, Mohamed. Unsupervised Anomaly Detection in Transactional Data. Machine Learning and Applications (ICMLA), 2012 11th
International Conference on. Vol. 1. IEEE, 2012.
[13] Kao, Li-Jen, and Yo-Ping Huang. An efficient strategy to detect outlier
transactions for knowledge mining. Systems, Man, and Cybernetics
(SMC), 2011 IEEE International Conference on. IEEE, 2011.
[14] He, Zengyou, et al. FP-outlier: Frequent pattern based outlier detection. Computer Science and Information Systems 2.1 (2005): 103-118.
[15] Pradeepini, G., and S. Jyothi. Tree-based incremental association rule
mining without candidate itemset generation. Trendz in Information
Sciences & Computing (TISC), 2010. IEEE, 2010.