Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Cluster Analysis
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering
feature spaces
detect spatial clusters and explain them in
spatial data mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns
Examples of Clustering
Applications
Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
Scalability
High dimensionality
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
Data Structures
Data matrix
(two modes)
x11
...
x
i1
...
x
n1
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
Dissimilarity matrix
d(3,1) d ( 3,2) 0
(one mode)
:
:
:
... 0
8
Interval-scaled variables:
Binary variables:
10
Interval-valued variables
Standardize data
where
...
xnf )
m f 1n (x1 f x2 f
xif m f
zif
sf
11
q
q
Some popular
d (i, j) q (| ones
x x |include:
| x x Minkowski
| q ... | x x |distance:
)
i1
j1
i2
j2
ip
jp
If q = 1, d isd (iManhattan
, j) | x x | |distance
x x | ... | x x |
i1
j1
i2
j2
ip
jp
12
If q = 2, d is Euclidean distance:
d (i, j) (| x x | 2 | x x | 2 ... | x x |2 )
i1
j1
i2
j2
ip
jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
13
Binary Variables
Object i
1
0
1
a
c
0
b
d
sum a c b d
sum
a b
cd
p
variable is asymmetric):
d (i, j)
May 23, 2016
bc
a bc
14
15
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
16
Nominal Variables
17
Ordinal Variables
18
Ratio-Scaled Variables
Methods:
19
Variables of Mixed
Types
f 1 ij dij
d (i, j )
pf 1 ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
zif
if
20
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
21
22
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
23
24
25
Example
10
10
0
0
10
10
10
10
0
0
10
10
26
Strength
Relatively efficient: O(tkn), where n is # objects, k is
# clusters, and t is # iterations. Normally, k, t << n.
Often terminates at a local optimum.
Weakness
Applicable only when mean is defined, then what
about categorical data?
Need to specify k, the number of clusters, in
advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex
shapes
27
28
29
30
10
8
7
6
5
4
3
6
5
h
i
4
3
0
0
10
Cjih = 0
0
10
10
10
7
6
5
7
6
4
3
10
10
31
Weakness:
32
33
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
34
Hierarchical Clustering
a
b
Step 1
ab
abcde
cde
de
e
Step 4
May 23, 2016
agglomerative
(AGNES)
Step 3
divisive
(DIANA)
35
Go on in a non-descending fashion
10
10
10
0
0
10
0
0
10
10
36
37
10
10
0
0
10
0
0
10
10
38
Agglomerative Clustering
Algorithm
39
40
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
41
Density-Based Clustering
Methods
Clustering
based on density (local cluster criterion),
such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination
condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD96)
OPTICS: Ankerst, et al (SIGMOD99).
DENCLUE: Hinneburg & D. Keim (KDD98)
CLIQUE: Agrawal, et al. (SIGMOD98)
42
Density-Based Clustering:
Background
Two parameters:
1) p belongs to NEps(q)
p
q
MinPts = 5
Eps = 1 cm
43
Density-Based Clustering:
Background (II)
Density-reachable:
Density-connected
A point p is density-reachable
from a point q wrt. Eps, MinPts if
there is a chain of points p1, ,
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
A point p is density-connected to
a point q wrt. Eps, MinPts if there
is a point o such that both, p and
q are density-reachable from o
wrt. Eps and MinPts.
p1
q
o
44
Eps = 1cm
Core
MinPts = 5
45
If p is a border point, no points are densityreachable from p and DBSCAN visits the next
point of the database.
46
47
Index-based:
k = number of dimensions
N = 20
p = 75%
M = N(1-p) = 5
Complexity: O(kN2)
Core Distance
Reachability Distancep2
p1
o
o
MinPts = 5
= 3 cm
48
Reachability
-distance
undefined
Cluster-order
of the objects
49
Major features
50
51
Outlier Discovery:
Statistical
Approaches
52