Cluster Analysis PDF

Data Mining
Cluster Analysis: Basic Concepts

g
and Algorithms
Instructor: Wei Ding
© Tan,Steinbach, Kumar Introduction to Data Mining 1
What is Cluster Analysis?
z Finding groups of objects such that the objects in a group

will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
di t
distances are maximized
i i d
minimized
What is not Cluster Analysis?
z Supervised classification
– Have class label information
z Simple segmentation
– Dividing students into different registration groups
alphabetically,
p y by
y last name
Types of Clusters: Objective Function
z Clusters Defined by an Objective Function

– Finds clusters that minimize or maximize an objective function.
– Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness'
goodness of each potential set of clusters by using
the given objective function. (NP Hard)
– Can have global or local objectives.
– A variation of the global objective function approach is to fit the
data to a parameterized model.
Parameters for the model are determined from the data
data.
Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.
Types of Clusters: Objective Function …
z Map the clustering problem to a different domain

and
d solve
l a related
l t d problem
bl iin th
thatt d
domain
i
– Proximity matrix defines a weighted graph, where the
nodes are the points being clustered, and the
weighted edges represent the proximities between
points
– Clustering is equivalent to breaking the graph into

connected components, one for each cluster.
– Want to minimize the edge weight between clusters

and maximize the edge
g weight
g within clusters
K-means Clustering
z Partitional clustering approach

z Each cluster is associated with a centroid (center point)
z Each point is assigned to the cluster with the closest
centroid
z Number of clusters, K, must be specified
z The basic algorithm
g is very
y simple
p
K-means Clustering – Details
z Initial centroids are often chosen randomly.

– Clusters produced vary from one run to another.
z The centroid is (typically) the mean of the points in the
cluster.
z ‘Closeness’’ iis measured
‘Cl dbby E
Euclidean
lid di
distance, cosine
i
similarity, correlation, etc.
z K-means will converge for common similarity measures
mentioned above.
z Most of the convergence happens in the first few
iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change
p g clusters’
z Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number
b off iterations,
it ti d = number
b off attributes
tt ib t
Two different K-means Clusterings

3
2.5
2
Original Points
1.5
y
0.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
3 3
2.5 2.5
2 2
1.5 1.5
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x
O ti l Clustering
Optimal Cl t i S b
Sub-optimal
ti l Clustering
Cl t i
Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
-2
2 -1.5
15 -1
1 -0.5
05 0 05
0.5 1 15
1.5 2
x
Importance of Choosing Initial Centroids

Iteration 1 Iteration 2 Iteration 3
3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
1 1 1
0.5 0.5 0.5
0 0 0
-2
2 -1.5
1.5 -1
1 -0.5
0.5 0 0.5 1 1.5 2 -2
2 -1.5
1.5 -1
1 -0.5
0.5 0 0.5 1 1.5 2 -2
2 -1.5
1.5 -1
1 -0.5
0.5 0 0.5 1 1.5 2
x x x
Evaluating K-means Clusters
z Most common measure is Sum of Squared Error (SSE)

– For each point,
point the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE = ∑ ∑ dist 2 ( mi , x )
i =1 x∈Ci
– x is
i addata
t point
i t iin cluster
l t Ci and
d mi is
i th
the representative
t ti point
i t ffor
cluster Ci
can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest
error
– One easy way to reduce SSE is to increase K
K, the number of
clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
-2
2 -1.5
15 -1
1 -0.5
05 0 05
0.5 1 15
1.5 2
x
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points
z If there are K ‘real’ clusters then the chance of selecting

one centroid from each cluster is small.
– Chance is relatively small when K is large
– If clusters are the same size, n, then
– For example, 10, then probability = 10!/1010 = 0.00036

example if K = 10 0 00036
– Sometimes the initial centroids will readjust themselves in
‘right’ way, and sometimes they don’t
– Consider an example of five pairs of clusters
10 Clusters Example
Iteration 4
1
2
3
8
2
y
-2
-4
-6
0 5 10 15 20
x
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
8 8
6 6
4 4
2 2
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
8 8
6 6
4 4
2 2
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example
Iteration 4
1
2
3
8
2
y
-2
-4
-6
0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other have only one.
10 Clusters Example
8 8
6 6
4 4
2 2
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8
6 6
4 4
2 2
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have only one.
Limitations of K-means
z K-means has problems when clusters are of

diff i
differing
– Sizes
– Densities
– Non-globular shapes
z K means has problems when the data contains

K-means
outliers.
Limitations of K-means: Differing Sizes
Original Points K-means

K means (3 Clusters)
Limitations of K-means: Differing Density

Limitations of K-means: Non-globular Shapes

Overcoming K-means Limitations
Original Points K means Clusters

K-means
One solution is to use many clusters.

Fi d parts
Find t off clusters,
l t b
butt need
d tto putt together.
t th
Original Points K means Clusters

K-means

K means Clusters

Cluster Analysis PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Cluster Analysis PDF

Caricato da

Copyright:

Formati disponibili

Data Mining

Cluster Analysis: Basic Concepts

© Tan,Steinbach, Kumar Introduction to Data Mining 1

What is Cluster Analysis?

z Finding groups of objects such that the objects in a group

Types of Clusters: Objective Function

z Clusters Defined by an Objective Function

z Map the clustering problem to a different domain

– Clustering is equivalent to breaking the graph into

– Want to minimize the edge weight between clusters

z Partitional clustering approach

z Initial centroids are often chosen randomly.

Two different K-means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Importance of Choosing Initial Centroids

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

z Most common measure is Sum of Squared Error (SSE)

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3 Iteration 4 Iteration 5

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Problems with Selecting Initial Points

z If there are K ‘real’ clusters then the chance of selecting

– For example, 10, then probability = 10!/1010 = 0.00036

z K-means has problems when clusters are of

z K means has problems when the data contains

Limitations of K-means: Differing Sizes

Original Points K-means

Original Points K-means

Limitations of K-means: Non-globular Shapes

Original Points K-means

Original Points K means Clusters

One solution is to use many clusters.

Overcoming K-means Limitations

Original Points K means Clusters

Original Points K-means

Potrebbero piacerti anche