Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
On
K-means Data Clustering Approach
Shaikh Faizan Ahmed , Atir Khan
School of Engineering and Technology
faizanshaikh1230@gmail.com , atir.khan11@gmail.com
Working of K-means clustering algorithm the semantics in the IR using LSI and K-
means clustering technique has been done
by
III. EVOLUTION Jimenez et al Modha and Spangler have
K-means was introduced by James obtained a structure for integrating multiple,
MacQueen in 1967 .It is observed that a lot diverse feature spaces in K-means clustering
of algorithm. Tian et al gives a study
work has been done in this field. In the time on parallel K-means clustering algorithm
frame of 1967 to 1998, all the research and presented a superior initial centers
method to reduce the number of actions Yao have applied their improved algorithm
required while grouping. Pham et al has on clustering analysis of transit data
worked on the number of k used in K-means with same site name and different locations.
clustering. They have concluded Yufen Sun et al has presented a general K-
different number of clusters for different means clustering to identify
datasets. A survey was conducted by natural clusters in datasets. They have also
Xindong wu et al .In their survey they have shown high accuracy in their results.
given the top 10 algorithms in data Wang and Yin have shown that their
mining with their limitations, current and algorithm has overcome the deficiencies of
future works. original K-means clustering and has higher
Xiong et al has provided the results of the accuracy. Li Xinwu has given higher
effect of skewed data distribution accuracy and better stability by improving
on K-means clustering. They have given an the clustering algorithm. Napoleon and
organized study of K-means and cluster Lakshmi has analyzed the time taken for
validation measures from a data distribution execution by the original K-means
perspective. In fact, their focus was on clustering and their proposed K-means
characterization of the relationships between algorithm.
data distribution and K-means clustering Hesam et al has given improvements for
in addition to entropy measure and f- guided K-means algorithm so that
measure.Xiuyun Li et al have proposed an astrophysics data bases can be handled. Shi
improved K-means clustering which uses Na et al present a simple way to
fuzzy feature selection. They used feature assign data points to clusters. Their
important factor to get the contribution of all improved algorithm works in O (nk) time
the features in clustering. Osama Abu with
Abbas has given a comparison between high accuracy. Shamir and Tishby have
some of the data clustering algorithms. concluded that K-means does not break
Yan Zhu et al has proposed a new method in down in the large sample regime. Mark and
which clustering initialization Boris address the most controversial
has been done using clustering exemplars issue of clustering i.e. the selection of right
produced by affinity propagation. They number of clusters.
have also minimized the total squared error In Honda et al they have proposed a method
g the clusters. Taoying Li et al has for PCA guided K-means
given a new approach towards fuzzy K- clustering of incomplete datasets. They have
means clustering and has concluded with concluded that a PCA guided K-means
higher efficiency, greater precision and clustering is more robust than K-means
reduced amount of calculation.Viet-vu vu et clustering with PDS. Sathya et al has
al has proposed an efficient algorithm for given an approach to efficiently retrieve the
active seed selection which is based on search of clusters, in which comparison is
min max approach that favors the coverage based on the similarity of documents and co
of whole dataset. Zhu and wang has occurrence term, of the query. Xueyi
given an improved clustering algorithm with Wang has proposed a new algorithm
the use of genetic algorithm. kMkNN for the nearest neighbor searching
Oyelade et al has implemented K-means problem. He has considered implementation
clustering algorithm to analyze the of K-means clustering and triangle
academic performances of students on the inequality. Ren and Fan has introduced a K-
basis of some pre set measures. Wu and means clustering approach based on
coefficient of variation. They show how parallel K-Means at the inter-node level.
their approach can generate better results Jing et al presented a simple, easily
than K-means clustering algorithm. parallelized and efficient K-means
Son and Anh have compared two different clustering algorithm via closure. Zhang et al
methods for center initialization. present a clustering algorithm on
One is kd tree and another is CF tree. self adaptive weights. Their experimental
Thomas A. Runkler has focused on result shows that it is more accurate and
partially supervised clustering and stable. Mahmud et al have shown
introduced a partially supervised k – improvement in algorithm by taking
harmonic weighted
means clustering. Fatta et al has proposed a average to overcome the initial seed point
decentralized algorithm for K-means. limitation. They have also reduced the
They concluded that their proposed method number of iterations for the clustering
is practical and accurate. Murugesan and procedure. Wang et al has improved the
Zhang introduced hybrid algorithm for K- K-means clustering using the density
means clustering. They uses two concept. They succeeded in obtaining
approaches top down and bottom up as increased
bisect K-means and UPGMA respectively. clustering precision and criterion function E.
Sarma et al have proposed a fast method for Lee and Lin have designed a selection and
K-means clustering which was erasure K-means algorithm. The
useful for large datasets. They have come to authors achieved increase in the efficiency
know that their approach speed up the with large no. of clusters. Patil and Vaidya
kernel K-means clustering. Wang and Su carried out a review on different clustering
have modified the clustering algorithm techniques. Cheng et al proposed
and showed their test results on iris, wine a system named cluchunk. This system
and abalone datasets. They have improved clusters the unlabeled web data which
the algorithm with respect to the time and incorporate chunklet information. Bikram et
accuracy of the results. al have made some improvements in
Tripathy et al have proposed a method for the traditional K-means clustering algorithm
traditional kernel based K-means and used DI and DBI parameters for
clustering algorithm which will later use the clustering validation. Ellis et al has
rough set concept for updation in the implemented a quantum based K-means
centroid value. Abhay et al gave a model for clustering, which shows the improvements
predicting the outcome as yes or no in accuracy and precision. Nuno et al
in K-means clustering on weather data. The have improved the text clustering approach
thought of maximum triangle rule was via K-means using overlapping
proposed by Feng et al to optimize K-means community structures of a network of tags.
clustering algorithm. They have Anoop and Satyam made a survey of recent
overcome the shortcoming of K-means by clustering techniques in data
introducing KMTR for the improvement in mining. Vijayalakshmi and Renuka
clusters. Ekasit et al have proposed parallel discussed different methodologies and
K-means on GPU clusters. They use parameters associated with different
the task pool model for dynamic load clustering algorithms. They also discussed
balancing to distribute workload equally on on
different GPUs installed in the clusters so as issues in different clustering algorithms used
to improve the performance of the in large datasets. Kurt et al has
presented spherical K-means clustering and hybrid approach to accelerate the traditional
suitable extensions. They also introduced K-means clustering. Lam et
R extension package skmeams. Deepti et al has
made a study on different clustering proposed a PSO based K-means clustering
algorithms with introduction, application, for gene expression to enhance cluster
limitations, and their requirements. Maryam matching. Sharma and Fotedar have made a
et al have made an analysis on all clustering review on different data mining
algorithms to choose the best techniques used for software effort
algorithm for identifying duplicate entities. estimation.
Rupali and Suresh proposed an Krey et al have presented order constrained
improved K-means clustering algorithm for solution in K-means as a more
two dimensional data. stable method for clustering of sound
Biggio et al has presented an approach to features. Huwang and Su have improved
evaluate clustering algorithm’s the traditional K-means algorithm by
security in adversial settings. Khadem et al. making analysis on the statistical data. Xue
provided a survey on data mining and
methods and utilities. Yogish and Raju have Liu proposed a new approach to solve
presented an approach for clustering problems of clustering, which combines
web users by using ART1 neural network membrane computing with K-means
based clustering algorithm. The algorithm.
performance of this method is compared
with K-means and SOM clustering IV. LIMITATIONS
algorithms. Silva et al presented a current K-means clustering has some of the
survey on data stream clustering. limitations which need to get overcome.
Ichikawa and Morishita have introduced a Several
new and simple method with heuristic people got multiple limitations while
feature that reduces the computational time. working on their research with K-means
Pattabiraman et al used three algorithm. Some of the common limitations
different clustering methods to cluster forum are discussed below.
threads and discussed on the Outliers
improvement of accuracy. It has been observed by several researchers
Parimala and Palanisamy have introduced a that, when the data contains outliers there
new term “MFCC” i.e. Multitype will be a variation in the result that means
Feature Co-selection for Clustering. To no stable result from different executions
perform clustering of web documents, it on the same data. Outliers are such objects
exploits different type of feature classes. they present in dataset but do not result in
The authors also removed some challenges the clusters formed. Outliers can also
of search engine. Wang et al has introduced increase the sum of squared error within
AFS global K-means algorithm. In clusters. Hence it is very important to
this method the distance based on AFS remove outliers from the dataset. Outliers
topology neighborhood is employed to can
determine initial cluster center. Execution be removed by applying preprocessing
time of K-means clustering algorithm has techniques on original dataset .
been reduced by Lee and Lin . Sarma et al.
has proposed a prototype based Number of clusters
Determining the number of clusters in clustering is calculated easily as the labels of
advance is always been a challenging task samples were known initially
for Clustering Algorithm in Search Engines
K-means clustering approach. It is beneficial Clustering algorithm plays an important role
to determine the correct number of in the functioning of search engines.
clusters in the beginning. It has been Hence it will act as a backbone to search
observed that sometimes the number of engines. Search engines try to group similar
clusters kind of objects into one cluster and
are assigned according to the number of dissimilar objects into other. The
classes present in the dataset. performance of
the search engines depend on the working of
Empty clusters the clustering techniques.
If no points are allocated to a cluster during
the assignment step, then the empty Clustering Algorithm in Academics
clusters occurs. It was an earlier problem Students' academic progress monitoring has
with the traditional K-means clustering been a vital issue for academic society of
algorithm . higher learning. With clustering technique
this issue can be managed easily. Based on
Non globular shapes and sizes the scores obtained by the students they are
With the K-means clustering algorithm if the grouped into different clusters, where
clusters are of different size, different each cluster shows the different level of
densities and non globular shapes, then the performance. By calculating the number of
results are not optimal. There is always an students' in each cluster we can determine
issue with the convex shapes of clusters the average performance of a class all
formed together.
VII. REFERENCES-
1. Hierarchical, mixture of gaussians) +
some interactive demos (java
applets).
8. www.wikipedia.com