Sei sulla pagina 1di 14

An Incremental K-means

Algorithm

Presented By:Navneet Chaudhary


(2015EET2880)

Data Clustering

Technique to group similar physical or abstract objects.


Unsupervised learning
>> Data labels are not given.
>> Looks for similarity in the given objects and lumps out
common objects for further processing.

Conventional K-means
Main steps :Step 1: Chose any k objects arbitrarily for k cluster
centers.
Step 2: Assign each objects in the training set to the
closest cluster and update the center of all
clusters.
Step 3: If the cluster criterion is satisfied i.e. there is
no movement in the cluster centers, the
algorithm stops, else go to step 2.

Conventional K-means
Pros:>> Simple to implement.
>> Computationally efficient. (linear growth in
computational time with respect to
increase
in data set size).
Cons:>> Has problem of Converging at local
minimum.

Cluster Parameters
Cluster information represented as: <w,N,S>
Here,
w center of the cluster
N no. of elements in the cluster
S sum of the square of the distances between
the cluster center and center of the Euclidean
space.
x0 center of Euclidean space.
Distortion error I= S- N[d(w,x0)]2

Modified K- means
Involves jumping operation.
Moves cluster center from least distortion error to cluster
with most distortion error.
Main steps:
Step 1: Choose arbitrarily k objects for k cluster center.
Step 2: Assign each objects to its closest cluster and
update the centers of the cluster.
Step 3: If clustering criterion is satisfied, go to step
4
else go to step 2.
Step 4: If there is a cluster that could be moved to a better
position to reduce sum of total distortion, move
it and go to
step 2, else stop.

Modified K- means
Objective:
>>To obtain the least distortion error in the cluster.
Limitations of this Algorithm:
>> Difficult to calculate the distortion error change
due to shifting of a cluster center.
Overcoming limitation:
>> Change in the distortion error could be
calculated for the special case using 2
procedures.

Evaluation of distortion of the clusters


Say cluster center Ci is taken out and in the worst case
all objects of the cluster go to second nearest cluster Cj
then:
If <wi,Ni,Si> and <wj,Nj,Sj> characterize Ci and Cj
respectively hence <wk,Nk,Sk> characterize Ck.
Where:
Nk= Ni+Nj
wk = (1/Nk)*(Niwi+Njwj)
Sk =Si+Sj
I= Ik-Ii-Ij =((NiNj)/(Ni+Nj))*[d(wi,wj)]2

Evaluation of distortion of the clusters


When a center moved to a new position: causes decrease
in the sum of cluster distortion errors.
This distortion can only be calculated if C z (i.e. the cluster
center of the new position) is assumed to be hypercube
with uniform object density p.
Distortion error Iz for Cz is as follows:
Iz =(Nz*ND*d2)/12.
Because of uniform object distribution, both cluster
centers have equal objects, i.e. Nz1=Nz2=Nz/2.
Hence:
D = (3*Iz)/(4*ND)

Evaluation of distortion of the clusters


M= I- D.
If M is negative, operation could result in better
clustering.

Incremental K-means
Assign K=1.
Phase 1. Normal training
Step 1. If K=1, choose an arbitrary point for a cluster
center.
If K > 1, insert the center of the new cluster in
the cluster with the greatest distortion.
Step 2. Assign each object in the training set to the closest
cluster and update its center.
Step 3. If the cluster center does not move, go to phase 2.
Else, go to phase 1, step 2.
Phase 2. Increasing the number of clusters
If K is smaller than a specified value, increase K by 1 and
go to phase 1, step 1.
Else, stop.

Incremental K-means
Time complexity of incremental K-means =O(K 2*N*num_iter)
Where num_iter is the no. of iterations for final convergence.
Advantages of Incremental K-means :>> Removes the possibility to converge at local
minimum.
>> No need to shift one cluster center from low
distortion
to high distortion cluster and calculating
its effect on overall and individual distortion error of
other clusters.
Disadvantages :>> Higher time complexity as compared to simple Kmeans algorithm.

Performance comparison
Data sets used: 6 artificial and 6 real data sets from UCI
repository.
Stopping criterion: Max no. of iterations (20 in this case) and
% reduction in distortion error in one iteration (10 -7 in this
case).
Number of clusters K=1 to 15.
Results:
>> On almost all data sets K-means with jumping operation
outperforms original K-means.
>> In case of incremental K- means I a/Imin is almost equal
to
1 for most of the cases showing that it is independent of K and
data set nature & provides reliable & optimal
clustering.

Thank you..

Potrebbero piacerti anche