K Means

K-means Clustering
Desirable Properties of a Clustering Algorithm
• Scalability (in terms of both time and space)

• Ability to deal with different data types
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Incorporation of user-specified constraints
• Interpretability and usability
K-means Overview
• A clustering algorithm
• An approximation to an NP-hard combinatorial
optimization problem
• It is unsupervised
• “K” stands for number of clusters, it is a user input to
the algorithm
• From a set of data points or observations (all
numerical), K-means attempts to classify them into K
clusters
• The algorithm is iterative in nature
K-means Details
• X1,…, XN are data points or vectors or observations
• Each observation will be assigned to one and only one cluster
• C(i) denotes cluster number for the ith observation
• Dissimilarity measure: Euclidean distance metric
• K-means minimizes within-cluster point scatter:

1 K K
∑ ∑ ∑ xi − x j = ∑ Nk ∑ x −m
2
W (C ) =
2
i k
2 k =1 C (i ) = k C ( j ) = k k =1 C (i )=k
where
mk is the mean vector of the kth cluster
Nk is the number of observations in kth cluster

Squared Error
10
9
8
7
6
5
K K
1
∑ ∑ ∑ xi − x j = ∑ Nk ∑ x −m
2
W (C ) =
2
2 k =1 C (i ) = k C ( j ) = k
i k 4
k =1 C (i )=k
3
2
1
1 2 3 4 5 6 7 8 9 10
Objective Function
K-means Algorithm
• For a given assignment C, compute the cluster means mk:
∑x
i:C ( i ) = k
i
mk = , k = 1, K , K .
Nk
• For a current set of cluster means, assign each observation
as:
C (i ) = arg min xi − mk , i = 1, K , N
2
1≤ k ≤ K
• Iterate above two steps until convergence

Algorithm k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if
necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
center
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in
the last iteration, exit. Otherwise goto 3.
Observations
• Theorem: If d() is Euclidean, then k-means
converges monotonically to a local minimum of
within-class squared distortion: ∑x d(c(x),x)2
1 Many variants
1. variants, complex history since 1956
1956, over
100 papers per year currently
2. Iterative, related to expectation-maximization
(EM)
3. # of iterations to converge grows slowly with n, k,
d
4. No accepted method exists to discover k.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
k2
2
k3
0
0 1 2 3 4 5
5
4
k1
k2
2
k3
0
0 1 2 3 4 5
5
4 k1
2
k3
k2
1
0
0 1 2 3 4 5
5
4 k1
2
k3
k2
1
0
0 1 2 3 4 5
n condition 2 5
4
k1
3
expression in
2
k2
k3
1
0
0 1 2 3 4 5
expression in condition 1
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing
and genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
Image Segmentation Results
An image (I) Three-cluster image (J) on

gray values of I
Matlab code:
I = double(imread(‘…')); Note that K-means result is “noisy”
J = reshape(kmeans(I(:),3),size(I));
EM Algorithm
• Initialize K cluster centers
• Iterate between two steps
– Expectation step: assign points to clusters
P(di ∈ ck ) = wk Pr(di | ck ) ∑ w Pr(d | c )
j i j
∑ Pr( d ∈ c )
j
i k
wk = i
N
– Maximation step: estimate model parameters
1 m
d iP (d i ∈ ck )
µ = ∑
∑ P (d i ∈ c j)
k
m i=1
k
Iteration 1
The cluster
means are
randomly
assigned
Iteration 2
Iteration 5
Iteration 25
What happens if the data is streaming…
Nearest Neighbor Clustering

Not to be confused with Nearest Neighbor Classification
• Items are iterativelyy merged

g into the
existing clusters that are closest.
• Incremental
• Threshold, t, used to determine if items are
added to existing clusters or a new cluster
is created.
10
9
8
7
6
Threshold t
5
4
3 1
t
2
1 2
1 2 3 4 5 6 7 8 9 10
10
9
8
7
6
New data point arrives… 5
4
It is within the threshold for cluster 1,
so add it to the cluster, and update 3 1
3
cluster center. 2
1 2
1 2 3 4 5 6 7 8 9 10
New data point arrives… 10
9 4
It is not within the threshold for cluster
8
1, so create a new cluster, and so on..
7
6
5
4
3 1
3
2
1 2
Algorithm is highly order dependent… 1 2 3 4 5 6 7 8 9 10
It is difficult to determine t in
advance…
How can we tell the right number of clusters?
In general, this is a unsolved problem. However there are many approximate methods. In the
next few slides we will see an example.
10
9 For our example, we will use the familiar
katydid/grasshopper dataset.
8
7 However, in this case we are imagining
However
6 that we do NOT know the class labels. We
are only clustering on the X and Y axis
5 values.
4
3
2
1
1 2 3 4 5 6 7 8 9 10
When k = 1, the objective function is 873.0
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
We can plot the objective function values for k equals 1 to 6…
The abrupt change at k = 2, is highly suggestive of two clusters in the data. This
technique for determining the number of clusters is known as “knee finding” or
“elbow finding”.
1.00E+03
9.00E+02
8.00E+02
7 00E+02
7.00E+02
Objective Function
6.00E+02
5.00E+02
4.00E+02
3.00E+02
2.00E+02
1.00E+02
0.00E+00
1 2 3
k 4 5 6
Note that the results are not always as clear cut as in this toy example
Summary
• K-means converges, but it finds a local minimum of
the cost function
• Works only for numerical observations (for
categorical and mixture observations, K-medoids is a
clustering method)
• Fine tuning is required when applied for image
segmentation; mostly because there is no imposed
spatial coherency in k-means algorithm
• Often works as a staring point for sophisticated
image segmentation algorithms

K Means

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

K Means

Caricato da

Copyright:

Formati disponibili

K-means Clustering

Desirable Properties of a Clustering Algorithm

• Scalability (in terms of both time and space)

• Each observation will be assigned to one and only one cluster

• C(i) denotes cluster number for the ith observation

• Dissimilarity measure: Euclidean distance metric

• K-means minimizes within-cluster point scatter:

mk is the mean vector of the kth cluster

Nk is the number of observations in kth cluster

• Iterate above two steps until convergence

An image (I) Three-cluster image (J) on

I = double(imread(‘…')); Note that K-means result is “noisy”

Nearest Neighbor Clustering

• Items are iterativelyy merged

Algorithm is highly order dependent… 1 2 3 4 5 6 7 8 9 10

Potrebbero piacerti anche