K-means Clustering Review: Evolution and Improvements

A Review Paper
On
K-means Data Clustering Approach
Shaikh Faizan Ahmed , Atir Khan
School of Engineering and Technology
faizanshaikh1230@gmail.com , atir.khan11@gmail.com
Abstract-In data mining, clustering is a human capabilities, people look for

technique in which the set of objects are computing technologies to automate the
assigned to a group called clusters. process
Clustering is the most essential part of data Data mining is one of the youngest research
mining. K-means clustering is the basic activities in the field of computing science
clustering technique and is most widely and is defined as extraction of interesting
used algorithm. It is also known as nearest (non-trivial, implicit, previously unknown
neighbor searching. It simply clusters the and potentially useful) patterns or
datasets into given number of clusters. knowledge from huge amount of data.
Numerous efforts have been made to Data mining is applied to gain some useful
improve the performance of the K-means information out of bulk data. There are
clustering algorithm. In this paper we have number of tools and techniques provided by
been briefed in the form of a review the researchers in data mining to obtain the
work carried out by the different pattern out of data. Different patterns can be
researchers using K-means clustering. We mined by classification, clustering,
have discussed the limitations and association rules, regression, outlier
applications of the K-means clustering analysis, etc.
algorithm as well. This paper presents a
current review about the K means II. K-MEANS CLUSTERING
clustering algorithm. K-means clustering is most widely used
clustering algorithm which is used in many
I. INTRODUCTION areas such as information retrieval,
Due to the increased availability of computer vision and pattern recognition. K-
computer hardware and software and the means clustering assigns n data points into k
rapid computerization of business, large clusters so that similar data points can be
amount of data has been collected and stored grouped together. It is an iterative method
in databases. Researchers have estimated which assigns each point to the cluster
that amount of information in the world whose centroid is the nearest. Then it again
doubles for every 20 months. calculates the centroid of these groups by
However raw data cannot be used directly. taking its average. The algorithm 1 shows
Its real value is predicted by extracting the basic approach of K-means clustering
information useful for decision support. In
most areas, data analysis was traditionally 1: An initial clustering is created by
a manual process. When the size of data choosing k random centroids from the
manipulation and exploration goes beyond dataset.
2: For each data point, calculate the distance
from all centroids, and assign its
membership to the nearest centroid.
3: Recalculate the new cluster centroids by work was related to the introduction of K-
the average of all data points that are means in clustering area. After this all the
assigned to the clusters. modifications and improvements were
4: Repeat step 2 until convergence. started on K-means clustering. Alsabti et al
has given an efficient clustering way by
Algorithm 1 K-Means Clustering making a pattern in a k-d tree so that one
The working of Algorithm 1 can be can find the desired pattern easily. Kamal et
explained clearly with the help of an al has introduced an algorithm
example, which uses labeled and unlabeled documents
which is shown on Figure 2. based on expectation machine and a
Figure 2 shows the graphical representation classifier. They have concluded that their
for working of K-means algorithm. In algorithm shows improvements in
the first step there are two sets of objects. classification results
Then the centroids of both sets are Kiri et al have developed a general method
determined. According to the centroid again for having background knowledge
the clusters are formed which gave the out of K-means clustering algorithm in the
different clusters of dataset. This process form of constraints. Tapas et al have
repeats until the best clusters are achieved. presented an implementation of Lloyd’s K-
There are abundant tools available for data means clustering algorithm, which they
mining. Some of them are Rapid termed as filtering algorithm. Cheung has
Miner, R, Knime, Own Code, Weka Or given a generalized way of K-means
Pentaho, Statistica, Sas Or Sas Enterprise clustering by this algorithm, correct
Miner, Orange, Tanagra, And Matlab. clustered can be
formed without initially known number of
Figure 2 clusters. A simple study of the influence of
1: An initial clustering is created by

choosing k random centroids from the
dataset.
2: For each data point, calculate the
distance from all centroids, and assign its
membership to the nearest centroid.
3: Recalculate the new cluster centroids by
the average of all data points that are
assigned to the clusters.
4: Repeat step 2 until convergence
Working of K-means clustering algorithm the semantics in the IR using LSI and K-
means clustering technique has been done
by
III. EVOLUTION Jimenez et al Modha and Spangler have
K-means was introduced by James obtained a structure for integrating multiple,
MacQueen in 1967 .It is observed that a lot diverse feature spaces in K-means clustering
of algorithm. Tian et al gives a study
work has been done in this field. In the time on parallel K-means clustering algorithm
frame of 1967 to 1998, all the research and presented a superior initial centers
method to reduce the number of actions Yao have applied their improved algorithm
required while grouping. Pham et al has on clustering analysis of transit data
worked on the number of k used in K-means with same site name and different locations.
clustering. They have concluded Yufen Sun et al has presented a general K-
different number of clusters for different means clustering to identify
datasets. A survey was conducted by natural clusters in datasets. They have also
Xindong wu et al .In their survey they have shown high accuracy in their results.
given the top 10 algorithms in data Wang and Yin have shown that their
mining with their limitations, current and algorithm has overcome the deficiencies of
future works. original K-means clustering and has higher
Xiong et al has provided the results of the accuracy. Li Xinwu has given higher
effect of skewed data distribution accuracy and better stability by improving
on K-means clustering. They have given an the clustering algorithm. Napoleon and
organized study of K-means and cluster Lakshmi has analyzed the time taken for
validation measures from a data distribution execution by the original K-means
perspective. In fact, their focus was on clustering and their proposed K-means
characterization of the relationships between algorithm.
data distribution and K-means clustering Hesam et al has given improvements for
in addition to entropy measure and f- guided K-means algorithm so that
measure.Xiuyun Li et al have proposed an astrophysics data bases can be handled. Shi
improved K-means clustering which uses Na et al present a simple way to
fuzzy feature selection. They used feature assign data points to clusters. Their
important factor to get the contribution of all improved algorithm works in O (nk) time
the features in clustering. Osama Abu with
Abbas has given a comparison between high accuracy. Shamir and Tishby have
some of the data clustering algorithms. concluded that K-means does not break
Yan Zhu et al has proposed a new method in down in the large sample regime. Mark and
which clustering initialization Boris address the most controversial
has been done using clustering exemplars issue of clustering i.e. the selection of right
produced by affinity propagation. They number of clusters.
have also minimized the total squared error In Honda et al they have proposed a method
g the clusters. Taoying Li et al has for PCA guided K-means
given a new approach towards fuzzy K- clustering of incomplete datasets. They have
means clustering and has concluded with concluded that a PCA guided K-means
higher efficiency, greater precision and clustering is more robust than K-means
reduced amount of calculation.Viet-vu vu et clustering with PDS. Sathya et al has
al has proposed an efficient algorithm for given an approach to efficiently retrieve the
active seed selection which is based on search of clusters, in which comparison is
min max approach that favors the coverage based on the similarity of documents and co
of whole dataset. Zhu and wang has occurrence term, of the query. Xueyi
given an improved clustering algorithm with Wang has proposed a new algorithm
the use of genetic algorithm. kMkNN for the nearest neighbor searching
Oyelade et al has implemented K-means problem. He has considered implementation
clustering algorithm to analyze the of K-means clustering and triangle
academic performances of students on the inequality. Ren and Fan has introduced a K-
basis of some pre set measures. Wu and means clustering approach based on
coefficient of variation. They show how parallel K-Means at the inter-node level.
their approach can generate better results Jing et al presented a simple, easily
than K-means clustering algorithm. parallelized and efficient K-means
Son and Anh have compared two different clustering algorithm via closure. Zhang et al
methods for center initialization. present a clustering algorithm on
One is kd tree and another is CF tree. self adaptive weights. Their experimental
Thomas A. Runkler has focused on result shows that it is more accurate and
partially supervised clustering and stable. Mahmud et al have shown
introduced a partially supervised k – improvement in algorithm by taking
harmonic weighted
means clustering. Fatta et al has proposed a average to overcome the initial seed point
decentralized algorithm for K-means. limitation. They have also reduced the
They concluded that their proposed method number of iterations for the clustering
is practical and accurate. Murugesan and procedure. Wang et al has improved the
Zhang introduced hybrid algorithm for K- K-means clustering using the density
means clustering. They uses two concept. They succeeded in obtaining
approaches top down and bottom up as increased
bisect K-means and UPGMA respectively. clustering precision and criterion function E.
Sarma et al have proposed a fast method for Lee and Lin have designed a selection and
K-means clustering which was erasure K-means algorithm. The
useful for large datasets. They have come to authors achieved increase in the efficiency
know that their approach speed up the with large no. of clusters. Patil and Vaidya
kernel K-means clustering. Wang and Su carried out a review on different clustering
have modified the clustering algorithm techniques. Cheng et al proposed
and showed their test results on iris, wine a system named cluchunk. This system
and abalone datasets. They have improved clusters the unlabeled web data which
the algorithm with respect to the time and incorporate chunklet information. Bikram et
accuracy of the results. al have made some improvements in
Tripathy et al have proposed a method for the traditional K-means clustering algorithm
traditional kernel based K-means and used DI and DBI parameters for
clustering algorithm which will later use the clustering validation. Ellis et al has
rough set concept for updation in the implemented a quantum based K-means
centroid value. Abhay et al gave a model for clustering, which shows the improvements
predicting the outcome as yes or no in accuracy and precision. Nuno et al
in K-means clustering on weather data. The have improved the text clustering approach
thought of maximum triangle rule was via K-means using overlapping
proposed by Feng et al to optimize K-means community structures of a network of tags.
clustering algorithm. They have Anoop and Satyam made a survey of recent
overcome the shortcoming of K-means by clustering techniques in data
introducing KMTR for the improvement in mining. Vijayalakshmi and Renuka
clusters. Ekasit et al have proposed parallel discussed different methodologies and
K-means on GPU clusters. They use parameters associated with different
the task pool model for dynamic load clustering algorithms. They also discussed
balancing to distribute workload equally on on
different GPUs installed in the clusters so as issues in different clustering algorithms used
to improve the performance of the in large datasets. Kurt et al has
presented spherical K-means clustering and hybrid approach to accelerate the traditional
suitable extensions. They also introduced K-means clustering. Lam et
R extension package skmeams. Deepti et al has
made a study on different clustering proposed a PSO based K-means clustering
algorithms with introduction, application, for gene expression to enhance cluster
limitations, and their requirements. Maryam matching. Sharma and Fotedar have made a
et al have made an analysis on all clustering review on different data mining
algorithms to choose the best techniques used for software effort
algorithm for identifying duplicate entities. estimation.
Rupali and Suresh proposed an Krey et al have presented order constrained
improved K-means clustering algorithm for solution in K-means as a more
two dimensional data. stable method for clustering of sound
Biggio et al has presented an approach to features. Huwang and Su have improved
evaluate clustering algorithm’s the traditional K-means algorithm by
security in adversial settings. Khadem et al. making analysis on the statistical data. Xue
provided a survey on data mining and
methods and utilities. Yogish and Raju have Liu proposed a new approach to solve
presented an approach for clustering problems of clustering, which combines
web users by using ART1 neural network membrane computing with K-means
based clustering algorithm. The algorithm.
performance of this method is compared
with K-means and SOM clustering IV. LIMITATIONS
algorithms. Silva et al presented a current K-means clustering has some of the
survey on data stream clustering. limitations which need to get overcome.
Ichikawa and Morishita have introduced a Several
new and simple method with heuristic people got multiple limitations while
feature that reduces the computational time. working on their research with K-means
Pattabiraman et al used three algorithm. Some of the common limitations
different clustering methods to cluster forum are discussed below.
threads and discussed on the Outliers
improvement of accuracy. It has been observed by several researchers
Parimala and Palanisamy have introduced a that, when the data contains outliers there
new term “MFCC” i.e. Multitype will be a variation in the result that means
Feature Co-selection for Clustering. To no stable result from different executions
perform clustering of web documents, it on the same data. Outliers are such objects
exploits different type of feature classes. they present in dataset but do not result in
The authors also removed some challenges the clusters formed. Outliers can also
of search engine. Wang et al has introduced increase the sum of squared error within
AFS global K-means algorithm. In clusters. Hence it is very important to
this method the distance based on AFS remove outliers from the dataset. Outliers
topology neighborhood is employed to can
determine initial cluster center. Execution be removed by applying preprocessing
time of K-means clustering algorithm has techniques on original dataset .
been reduced by Lee and Lin . Sarma et al.
has proposed a prototype based Number of clusters
Determining the number of clusters in clustering is calculated easily as the labels of
advance is always been a challenging task samples were known initially
for Clustering Algorithm in Search Engines
K-means clustering approach. It is beneficial Clustering algorithm plays an important role
to determine the correct number of in the functioning of search engines.
clusters in the beginning. It has been Hence it will act as a backbone to search
observed that sometimes the number of engines. Search engines try to group similar
clusters kind of objects into one cluster and
are assigned according to the number of dissimilar objects into other. The
classes present in the dataset. performance of
the search engines depend on the working of
Empty clusters the clustering techniques.
If no points are allocated to a cluster during
the assignment step, then the empty Clustering Algorithm in Academics
clusters occurs. It was an earlier problem Students' academic progress monitoring has
with the traditional K-means clustering been a vital issue for academic society of
algorithm . higher learning. With clustering technique
this issue can be managed easily. Based on
Non globular shapes and sizes the scores obtained by the students they are
With the K-means clustering algorithm if the grouped into different clusters, where
clusters are of different size, different each cluster shows the different level of
densities and non globular shapes, then the performance. By calculating the number of
results are not optimal. There is always an students' in each cluster we can determine
issue with the convex shapes of clusters the average performance of a class all
formed together.
V. APPLICATIONS Clustering Algorithm in Wireless

There are diverse applications of clustering Sensor Network based Application
techniques in the fields of finance, health Clustering Algorithm can be used efficiently
care, telecommunication, scientific, World in Wireless Sensor Network's based
Wide Web, etc. Some of the applications application. It can be used in landmine
are discussed below. detection. Clustering algorithm plays a role
of
Clustering Algorithm in Identifying finding the cluster heads which collects all
Cancerous Data the data in its respective cluster.
Clustering algorithm can be used in
identifying the cancerous data record within VI. CONCLUSION
a dataset. Different people tried on this In this paper, we have made a survey on
application by assigning labels to known work carried out by different researchers
samples of datasets as cancerous and non- using K-means clustering approach. We also
cancerous. Then randomly the data samples discussed the evolution, limitations and
are mixed together and different clustering applications of K-means clustering
algorithms were applied. The result of algorithm. It is observed that a lot of
clustering has been analyzed to know the improvement
correctly clustered samples. Accuracy of has been made to the working of K-means
algorithm in the past years. Maximum
work carried out on the improvement of
efficiency and accuracy of the clusters. This
field is always open for improvements.
Setting appropriate initial number of clusters
is always a challenging task. At the end it is
concluded that although there has been
made plenty of work on K-means clustering
approach, there is a scope for future
enhancement.
VII. REFERENCES-
1. Hierarchical, mixture of gaussians) +
some interactive demos (java
applets).
2. Digital Image Processing and

Analysis-byB.Chanda and D.Dutta
Majumdar.
3. H. Zha, C. Ding, M. Gu, X. He and

H.D. Simon. "Spectral Relaxation for
K-means Clustering", Neural
Information Processing Systems
vol.14 (NIPS 2001). pp. 1057-1064,
Vancouver, Canada. Dec. 2001.
4. J. A. Hartigan (1975) "Clustering

Algorithms". Wiley.
5. J. A. Hartigan and M. A. Wong

(1979) "A K-Means Clustering
Algorithm", Applied Statistics, Vol.
28, No. 1, p100-108.
6. D. Arthur, S. Vassilvitskii (2006):

"How Slow is the k-means
Method?,"
7. D. Arthur, S. Vassilvitskii: "k-

means++ The Advantages of Careful
Seeding" 2007 Symposium on
Discrete Algorithms (SODA).
8. www.wikipedia.com

K-means Clustering Review: Evolution and Improvements

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

K-means Clustering Review: Evolution and Improvements

Caricato da

Copyright:

Formati disponibili

A Review Paper

Abstract-In data mining, clustering is a human capabilities, people look for

1: An initial clustering is created by

V. APPLICATIONS Clustering Algorithm in Wireless

2. Digital Image Processing and

3. H. Zha, C. Ding, M. Gu, X. He and

4. J. A. Hartigan (1975) "Clustering

5. J. A. Hartigan and M. A. Wong

6. D. Arthur, S. Vassilvitskii (2006):

7. D. Arthur, S. Vassilvitskii: "k-

Potrebbero piacerti anche