Sei sulla pagina 1di 33

K-means Clustering

Desirable Properties of a Clustering Algorithm

• Scalability (in terms of both time and space)


• Ability to deal with different data types
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Incorporation of user-specified constraints
• Interpretability and usability
K-means Overview
• A clustering algorithm
• An approximation to an NP-hard combinatorial
optimization problem
• It is unsupervised
• “K” stands for number of clusters, it is a user input to
the algorithm
• From a set of data points or observations (all
numerical), K-means attempts to classify them into K
clusters
• The algorithm is iterative in nature
K-means Details
• X1,…, XN are data points or vectors or observations

• Each observation will be assigned to one and only one cluster

• C(i) denotes cluster number for the ith observation

• Dissimilarity measure: Euclidean distance metric

• K-means minimizes within-cluster point scatter:


1 K K

∑ ∑ ∑ xi − x j = ∑ Nk ∑ x −m
2
W (C ) =
2
i k
2 k =1 C (i ) = k C ( j ) = k k =1 C (i )=k

where

mk is the mean vector of the kth cluster

Nk is the number of observations in kth cluster


Squared Error

10
9
8
7
6
5
K K
1
∑ ∑ ∑ xi − x j = ∑ Nk ∑ x −m
2
W (C ) =
2

2 k =1 C (i ) = k C ( j ) = k
i k 4
k =1 C (i )=k

3
2
1

1 2 3 4 5 6 7 8 9 10
Objective Function
K-means Algorithm
• For a given assignment C, compute the cluster means mk:
∑x
i:C ( i ) = k
i

mk = , k = 1, K , K .
Nk
• For a current set of cluster means, assign each observation
as:

C (i ) = arg min xi − mk , i = 1, K , N
2

1≤ k ≤ K

• Iterate above two steps until convergence


Algorithm k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if
necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
center
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in
the last iteration, exit. Otherwise goto 3.
Observations
• Theorem: If d() is Euclidean, then k-means
converges monotonically to a local minimum of
within-class squared distortion: ∑x d(c(x),x)2

1 Many variants
1. variants, complex history since 1956
1956, over
100 papers per year currently
2. Iterative, related to expectation-maximization
(EM)
3. # of iterations to converge grows slowly with n, k,
d
4. No accepted method exists to discover k.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4
k1

k2
2

k3
0
0 1 2 3 4 5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4 k1

2
k3
k2
1

0
0 1 2 3 4 5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

4 k1

2
k3
k2
1

0
0 1 2 3 4 5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance

n condition 2 5

4
k1

3
expression in

2
k2
k3
1

0
0 1 2 3 4 5

expression in condition 1
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing
and genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
Image Segmentation Results

An image (I) Three-cluster image (J) on


gray values of I

Matlab code:

I = double(imread(‘…')); Note that K-means result is “noisy”

J = reshape(kmeans(I(:),3),size(I));
EM Algorithm
• Initialize K cluster centers
• Iterate between two steps
– Expectation step: assign points to clusters
P(di ∈ ck ) = wk Pr(di | ck ) ∑ w Pr(d | c )
j i j

∑ Pr( d ∈ c )
j
i k
wk = i
N
– Maximation step: estimate model parameters
1 m
d iP (d i ∈ ck )
µ = ∑
∑ P (d i ∈ c j)
k
m i=1
k
Iteration 1

The cluster
means are
randomly
assigned
Iteration 2
Iteration 5
Iteration 25
What happens if the data is streaming…

Nearest Neighbor Clustering


Not to be confused with Nearest Neighbor Classification

• Items are iterativelyy merged


g into the
existing clusters that are closest.
• Incremental
• Threshold, t, used to determine if items are
added to existing clusters or a new cluster
is created.
10
9
8
7
6
Threshold t
5
4
3 1
t
2
1 2

1 2 3 4 5 6 7 8 9 10
10
9
8
7
6
New data point arrives… 5
4
It is within the threshold for cluster 1,
so add it to the cluster, and update 3 1
3
cluster center. 2
1 2

1 2 3 4 5 6 7 8 9 10
New data point arrives… 10
9 4
It is not within the threshold for cluster
8
1, so create a new cluster, and so on..
7
6
5
4
3 1
3
2
1 2

Algorithm is highly order dependent… 1 2 3 4 5 6 7 8 9 10

It is difficult to determine t in
advance…
How can we tell the right number of clusters?
In general, this is a unsolved problem. However there are many approximate methods. In the
next few slides we will see an example.

10
9 For our example, we will use the familiar
katydid/grasshopper dataset.
8
7 However, in this case we are imagining
However
6 that we do NOT know the class labels. We
are only clustering on the X and Y axis
5 values.
4
3
2
1

1 2 3 4 5 6 7 8 9 10
When k = 1, the objective function is 873.0

1 2 3 4 5 6 7 8 9 10
When k = 2, the objective function is 173.1

1 2 3 4 5 6 7 8 9 10
When k = 3, the objective function is 133.6

1 2 3 4 5 6 7 8 9 10
We can plot the objective function values for k equals 1 to 6…

The abrupt change at k = 2, is highly suggestive of two clusters in the data. This
technique for determining the number of clusters is known as “knee finding” or
“elbow finding”.

1.00E+03

9.00E+02

8.00E+02

7 00E+02
7.00E+02
Objective Function

6.00E+02

5.00E+02

4.00E+02

3.00E+02

2.00E+02

1.00E+02

0.00E+00
1 2 3
k 4 5 6

Note that the results are not always as clear cut as in this toy example
Summary
• K-means converges, but it finds a local minimum of
the cost function
• Works only for numerical observations (for
categorical and mixture observations, K-medoids is a
clustering method)
• Fine tuning is required when applied for image
segmentation; mostly because there is no imposed
spatial coherency in k-means algorithm
• Often works as a staring point for sophisticated
image segmentation algorithms