Sei sulla pagina 1di 47

Agenda

Abstract
Introduction
Preliminaries
Drifting concept detection
Clustering relationship analysis
Experimental results
Conclusions
Abstract
the problem of how to allocate those
unlabeled data points into proper
clusters remains as a challenging
issue in the categorical domain.

Abstract
In this paper, a mechanism named MAximal
Resemblance Data Labeling (abbreviated as
MARDL) is proposed to allocate each
unlabeled data point into the corresponding
appropriate cluster.
MARDL has two advantages:
1) MARDL exhibits high execution efficiency, and
2) MARDL can achieve high intracluster similarity
and low intercluster similarity
Introduction
As the concepts behind the data evolve with
time, the underlying clusters may also
change considerably with time.
Previous works on clustering categorical
data focus on doing clustering on the entire
data set and do not take the drifting
concepts into consideration.
The problem of clustering time-evolving data
in the categorical domain remains a
challenging issue.
Introduction
Introduction
A practical categorical clustering
representative, named Node
Importance Representative(NIR).
NIR represents clusters by measuring
the importance of each attribute value
in the clusters.
Introduction
Based on NIR, we propose the Drifting
Concept Detection(DCD).
In DCD, the incoming categorical data
points at the present sliding window are first
allocated into the corresponding proper
cluster at the last clustering result
If the distribution is changed (exceeding
some criteria), the concepts are said to drift.
Introduction
The framework presented in this paper
not only detects the drifting concepts in
the categorical data but also explains
the drifting concepts by analyzing the
relationship between clustering results
at different times.
The analyzing algorithm is named
Cluster Relationship Analysis (CRA).
Preliminaries
The problem of clustering the
categorical time-evolving data is
formulated as follows:
a series of
categorical data set D
data point
Attribute
Preliminaries
1. Suppose that the window size N is also given. The data set D is separated
into several continuous subsets St
2. The superscript number t is the identification number of the sliding window
and t is also called time stamp in this paper.
For example, the first N data points in D are located in the first subset S1
Preliminaries
the objective of the framework is to
perform clustering on the data set D and
consider the drifting concepts between St
and Stt1 and also analyze the relationship
between different clustering results.
Preliminaries
Preliminaries
The basic idea behind NIR is to represent a
cluster as the distribution of the attribute
values, which are called nodes.
the importance of a node is evaluated based
on the following two concepts:
The node is important in the cluster when
the frequency of the node is high in this
cluster.
The node is important in the cluster if the
node appears prevalently in this cluster
rather than in other clusters.
Preliminaries
Definition 1 (node). A node, , is defined as
attribute name + attribute value
The age is in the range 50-59 and the weight is in the
range 50-59, the attribute value 50-59 is confusing
when we separate the attribute value from the
attribute name.
Nodes [age = 50-59] and [weight = 50-59] avoid this
ambiguity
r
I
Preliminaries
Definition 2 (node importance). The importance
value of the node is calculated as the
following equations

ir
I

=
=
=

=
=
t
k
z
zr
yr
yr
k
y
yr yr r
r
i
ir
ir i
I
I
I p
where
I p I p
k
I f
I f
m
I
I c w
1
1
) (
)) ( log( ) ( *
log
1
1 ) (
) ( * ) , (
Preliminaries
| | { }
1 )
3
0
log
3
0
3
3
log
3
3
( *
2 log
1
1 ) (
)) ( log( ) ( *
log
1
1 ) (
1
1
= +

=
=
=

A A
k
y
yr yr r
I f
I p I p
k
I f
| | { }
| | { }

= = =
= = =
=
0 1 *
2
0
) , (
1 1 *
3
3
) , (
) ( * ) , (
1
1
2
1
1
1
A A c w
A A c w
I f
m
I
I c w
r
i
ir
ir i
Preliminaries
Drifting Concept Detection
The objective of the DCD algorithm is to detect
the difference of cluster distributions between
the current data subset and the last
clustering result and to decide whether
the reclustering is required or not in .
t
S
] 1 , [ t t
e
C
t
S
Drifting Concept Detection
Drifting Concept Detection
The goal of data labeling is to decide the most
appropriate cluster label for each incoming data
point.
Definition 3 (resemblance and maximal
resemblance).Given a data point and an NIR
table of clusters , a data point is labeled to
the cluster that obtains the maximal resemblance:

=
=
q
r
ir i i j
I c w c p R
1
) , ( ) , (
j
p
i
c
j
p
Drifting Concept Detection
When a data point contains nodes that are more
important in cluster than in cluster ,
will be larger than .
if the maximal resemblance (the most
appropriate cluster) is smaller than the threshold
in that cluster, the data point is seen as an
outlier.

x
c
y
c
) , (
y j
c p R
) , (
x j
c p R
i

s s >
=
. ,
, 1 , ) , ( max ,
*
otherwise outliers
k i where c p R if C
Label
i i j i

Drifting Concept Detection

=
=
=
0
0
) , , (
1
2
1
1
6
c in
c in
G E B p

= + + =
=
=
529 . 1 1 029 . 0 5 . 0
029 . 0
) , , (
1
2
1
1
7
c in
c in
P M X p
threshold=0.5data pointoutlier
1.529>0.029
threshold=0.5
data point
Drifting Concept Detection
The clustering results are said to be
different according to the following two
criteria:
The clustering results are different if quite
a large number of outliers are found by
data labeling.
The clustering results are different if quite
a large number of clusters are varied in
the ratio of data points.
Drifting Concept Detection

Drifting Concept Detection
2
S
There are three outliers in , and the ratio of
outliers in S2 is
Therefore, S2 is considered as a concept drifting
window and is going to do reclustering.
4 . 0 6 . 0
5
3
= > = u
Drifting Concept Detection
5 . 0 1
2
2
1 ) , ( ) , (
3 . 0 6 . 0
5
0
5
3
3 . 0 4 . 0
5
4
5
2
3 ' 2
3
2
' 2
2
3
1
' 2
1
3
2
' 2
2
3
1
' 2
1
= > =
= =

= > =
= > =
q
c
c
is C and C
c c d c c d
is c and c
is c and c
4 . 0 2 . 0
5
1
= < = u
the ratio of outliers in is
However, the variation of the ratio of data points between clusters

3
S
S3 is also considered as a
concept-drifting window
Drifting Concept Detection
Drifting Concept Detection

Drifting Concept Detection
The bottlenecks of the execution time in
DCD may occur on the reclustering step
when the concept drifts and on the updating
NIR table step when the concept does not
drift.
if we can obtain prior knowledge such as the
frequency of the drifting concepts of the data
from domain experts, the prior knowledge
can help us to set proper parameter values.
Clustering relationship analysis
CRA measures the similarity of
clusters between the clustering results
at different time stamps.

CRA links the similar clusterswhen
similarity is higher than the threshold.

CRA will provide clues for us to catch
the time-evolving trends in the data set.
Node Importance Vector and
Cluster Distance
Node importance vector






The dimensions of all the vectors are
the same.

i
c
Example
Vector space(14 nodes)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ( ) T A , P A , G A , D A , C A , N A , M A , F A , E A , Z A , Y A , X A , B A , A A
3 3 3 3 3 2 2 2 2 1 1 1 1 1
= = = = = = = = = = = = = =
Cosine measure
Calculate the cosine of the angle
between two vectors.
Measure of similarity.

Example
The similarity between vectors and





1
1
c
2
1
c
Visualizing the Evolving Clusters
Cluster
Time
Clustering
result
Experimental Results-Test Environment
Synthetic data sets
Numerical data set Clustering data
Drifting concept is generated by
combining two different clustering
results.



Experimental Results-Test Environment
Real data setKDD-CUP99 Network intrusion
Detection
Each recordnormal connectionattack
Drifting concept the change is continued
for at least 300 connections.
493,857 recordseach record contains 42
attributes.
33 drifting concepts







Evaluation on Efficiency
The number of drifting concepts directly
impacts the execution time of DCD.







The execution time of DCD is faster than that of EM.

a little
influence
dimensionality=20
# of clusters=20
N=500
Evaluation on scalability
Data size=50000
N=500
# of clusters=20 dimensionality=20
bottleneckthe number of drifting concepts
that require doing reclustering.
Evaluation on Accuracy
Test the accuracy of drifting concepts that
are detected by DCD.
The CU function
To maximize both the probability
the same cluster the same attribute values
different clusters different attributes


Evaluation on Accuracy
Confusion matrix accuracy (CMA)

Evaluate the clustering results by
comparing with the original clustering
labels j.

By maximizing the count of ij in
which one output cluster is mapped
to one original clustering label j.
i
c
i
c
Accuracy Evaluation on Synthetic Data
Set



Each synthetic data set is generated by randomly
combining 50 clustering results
DCD is effective for detecting drifting concepts.
data set varies dramatically smaller N
data set is stable larger Nsave the execution time

>0.8
The highest
# of clusters ,averages of 20 experiments
imum max k , 5 . 0 , 1 . 0 , 1 . 0 = = = = q c u
Accuracy Evaluation on Synthetic Data
Set






Clustering resultsDCD VS. EM performing in
setting
N=2000, drifting concepts occur once per five
sliding windows (50*10000/2000=250,250/50=5)
The variation of CU and CMA on doing EM once
is quite larger than DCD.
1
D
Accuracy Evaluation on Synthetic Data
Set




Clustering resultsDCD VS. EM performing in
setting
The drifting concepts occur irregularly.
DCD better than performing EM when we do
clustering on the categorical time-evolving data.

2
D
Accuracy Evaluation on Real Data
Set



The small sliding window size is induced to a
high recall but a little low precision.

the data set does not evolve frequently larger
N



3000 N , 10 k , 5 . 0 , 1 . 0 , 1 . 0 = = = = = q c u
Accuracy Evaluation on Real Data
Set





The records are the same in 51-114,134-149, and
155-160 sliding windows.
The peak value of CU in DCD is the time stamp
that a drifting concept occurs.
DCD is able to quickly reflect the drifting concept
and generate better clustering results.

Conclusions

A framework to perform clustering on
categorical time-evolving data.
Detects the drifting concepts at different sliding
window by DCD.
CRA to analyze and show the changes between
different clustering results.
Shows the relationship between clustering
results by visualization.
DCD can provide high-quality clustering results
with correctly detected drifting concepts.

Potrebbero piacerti anche