Sei sulla pagina 1di 6

International Journal of Research in Advanced Technology - IJORAT

Vol. 1, Issue 6, JUNE 2016


Parneet Kaur1, Kamaljit Kaur2
CSE Department, GNDU University, Amritsar, India1
CSE Department, GNDU University, Amritsar, India 2
Abstract: Data mining refers to the extraction of obscured prognostic details of data from large databases. The
extracted information is visualized in the form of charts, graph, tables and other graphical forms. Clustering is an
unsupervised approach under data mining which groups together data points on the basis of similarity and separate
them from dissimilar objects. Many clustering algorithms such as algorithm for mining clusters with arbitrary
shaped (CLASP), Density peaks (DP) and k-means are proposed by different researchers in different areas to
enhance clustering technique. The limitation addressed by one clustering technique may get resolved by another
technique. In this review paper our main objective is to do comparative study of clustering algorithms and issues
arising during clustering process are also identified.
Keywords: Data Mining, Database, Clustering, K- Means, Outliers.

Data mining is used for analysing huge datasets, finds
relationships among these datasets and in addition the results
are also summarized which are useful and understandable to
the user. Today, large datasets are present in many areas due
to the usage of distributed information systems [14]. Sheer
amount of data is stored in world today commonly known as
big data. The process of extracting useful patterns of
knowledge from database is called data mining. The
extracted information is visualized in the form of charts,
graph, tables and other graphical forms. Data mining is also
known by another name called KDD (Knowledge Discovery
from the Database). The data present in database is in
structured format whereas, data warehousing may contain
unstructured data. It is comparatively easier to handle static
data as compared to dynamically varying data [16].
Reliability and scalability are two major challenges in data
mining. Effective, efficient and scalable mining of data
should be achieved by building incremental and efficient
mining algorithms for mining large datasets and streaming
data [14]. In this review paper our main objective is to do the
comparative study of clustering algorithms and to identify
the challenges associated with them.
Clustering means putting objects having similar properties
into one group and the objects with dissimilar properties into
another. Based on the given threshold value the objects
having values above and below threshold are placed into
different clusters [14]. A cluster is group of objects which
possess common characteristics. The main objective in
clustering is to find out the inherent grouping in a set of
unlabeled data [16]. Clustering is referred to as unsupervised

All Rights Reserved 2016 IJORAT

learning technique because of the absence of classifiers and

their associated labels. It is a type of learning by observation
technique [26]. Clustering algorithm must satisfy certain
requirements such as, it should be scalable, able of dealing
with distinct attributes, capable of discovering arbitrary
shaped clusters, and must possess minimal requirements for
domain knowledge to determine input parameters. In
addition it should deal with noise and outliers, and
insensitive to order of input records [27, 30, 9].
A. Partitioning Clustering
In partitioning methods the instances are relocated and are
moved from one cluster to another by starting the relocation
from initial partitioning. The number of clusters to be
formed is user defined. Examples of partitioning algorithms
include Clustering Large Datasets Algorithm (CLARA) and
K-means [19, 1].
B. Density Based Clustering
These methods are based upon density and the cluster
grows till the time the density does not exceed some
threshold value [40]. Density Based Spatial Clustering of
Applications with Noise (DBSCAN) approach is a density
based technique which is based on the idea that the least
number of data points (Minpts) must be present around a
point in its neighbourhood with radius () [16].
C. Hierarchical Methods
In such methods the data set is decomposed into a
hierarchy. The decomposition can be done in agglomerative
or divisive manner. Agglomerative approach is a bottom up
technique where initially each data object is present in a


International Journal of Research in Advanced Technology - IJORAT

Vol. 1, Issue 6, JUNE 2016

single group whereas divisive is top down approach in which

initially all the clusters are present in one cluster and then
with every iteration this cluster is splitted into tiny clusters
and the process continues until each data point is present
within a single cluster. This kind of decomposition is
represented by a tree structure called as dendrogram [18].
D. Model Based Clustering
Model based approaches exaggerate the fit among the
dataset and few mathematical models. The mathematical
model generates data and then the original model is
discovered from the data. The recovered model defines
clusters and assigns documents to clusters [17].
Different types of mining algorithms have been proposed
by distinct researchers. Selecting appropriate clustering
algorithm however, depends on the application goal and
algorithms compatibility with the dataset. This section
illustrates issues that may arise during the formation of
clusters and different approaches to tackle with these
A. Identification of Formation of Clusters
Very few techniques are available which can
automatically detect the number of clusters to be formed.
Some of the techniques rely on the information provided by
the user while some use cluster validity indices which are
very costly in terms of time required for computation. Some
statistics such as Pseudo-F statistic and the Cubic Clustering
Criterion (CCC) are used for identifying the cluster number
[26]. Hao Huang et al. [27] designed an approach which is
used for clustering clusters having arbitrary shapes (CLASP)
that shrinks the size of dataset. CLASP is very effective and
efficient algorithm which automatically determines the
number of clusters and also saves computational cost.
Zhensong Chen et al. [21] presented an approach for image
segmentation, based on density peaks (DP) clustering. This
method possesses many advantages in comparison to current
methods and can predict the cluster number, based on the
decision graph and defines the correct cluster centers. Mark
Junjie Li et al. [22] presented an agglomerative fuzzy $k$means clustering algorithm which could cluster numerical
data in an efficient manner. This approach is an extension of
fuzzy $k$-means algorithm and is used effectively for
numerical data. Although, this approach could identify the
cluster number but overlapping during the cluster formation
process may occur.
B. Clustering Large Datasets
Significant accuracy in clustering can be achieved by using
Constrained Spectral Clustering (CSC) algorithms.
However, to handle large and moderate datasets the existing
CSC algorithms are inefficient. Clustering Large
Applications (CLARA) is the best partitioning technique

All Rights Reserved 2016 IJORAT

designed for large datasets which has less computation time

[1]. Macario O. Cordel et al. [25] presented a visualization
methodology in case of large datasets. Self Organized Maps
(SOM) is unattractive tool due to its time complexity but this
method emulates SOM methodology without any speed
constraints. A great speed is achieved while clustering large
datasets by using this proposed SOM emulation procedure.
Ahmad M. Bakr et al. [28] proposed an enhanced version of
the DBSCAN algorithm which cluster massive datasets
efficiently and gives efficient clustering results as compared
to other incremental algorithms. This technique
incrementally builds and updates arbitrary shaped clusters.
By limiting the search space only to the partitions and not
the whole dataset, this algorithm helps in enhancing the
process of incremental clustering. Xiaoyun Chen et al. [29]
gives a new approach by improving the existing semi
supervised clustering algorithm (SCMD). This algorithm has
advantage of dealing with environments having multiple
densities. This technique gives better performance than
SCMD algorithm in case of insufficient constraints in
datasets. Chih- Ping Wei et al. [1] gives the comparative
study of algorithms which cluster complex datasets. As the
number of clusters increase, Clustering Large Applications
based on Randomized Search (CLARNS) performs best in
case of execution time and produces good quality clusters. In
large datasets, CLARA gives better clustering results
whereas Genetic Algorithm based clustering-Random
Respectful Recombination (GAC-R) performs efficient
clustering only in case of small datasets.
C. Outlier Detection
Outliers can be studied in certain domains like big data,
uncertain data, data with multiple dimensions and biological
datasets. Due to large complexity, the detection of outlier is
a complex task. As the data streams cannot be scanned
multiple times, the outlier detection is a major problem in
streaming data. Fuzzy logic is good for handling
uncertainties. Real time outlier detection is required by many
applications and due to parallel nature, neural networks are
good at handling real time applications. By doing
hybridization of neural networks and fuzzy techniques, we
can obtain efficient results in detection of outliers. There are
many methods for detecting outliers like Mahalanobis
Outlier Analysis (MOA), Rule Based Modeling (RBM) and
Local Outlier Factor Method (LOFM) [30, 23]. Nilam
Upasania et al. [30] gave an approach known as Fuzzy minmax neural network which detects the outliers efficiently.
This algorithm performs efficiently but it is based on the
parameters defined by user. The drawback of this approach
is increase in recall time and in the testing phase the
complexity of O (k) is added. A. Christy et al. [23] proposed
two algorithms, detection of outliers on the basis of clusters
and on the basis of distance, which uses outlier score for the
detection and then removal of the outliers. The removal of
outliers occur on the basis of key attribute subset instead of
considering full dimensional subsets of the dataset by first
cleaning the dataset and clustering the dataset on the basis of


International Journal of Research in Advanced Technology - IJORAT

Vol. 1, Issue 6, JUNE 2016

similarity. Better accuracy is provided by the cluster based results and works efficiently and accurately. Iurie Chiosa et
outlier detection technique as compared to the distance al. [11] has proposed novel clustering algorithm called
based approach.
Variational Multilevel Mesh Clustering (VMLC) which
incorporates the benefits of variational algorithms (Lloyd)
and hierarchical clustering algorithms. The selection of
D. Large Computational Time
As compared to the traditional clustering algorithms like seeds to be selected initially is not predefined. So to solve
K- means, hierarchical clustering algorithms have many this problem, a multilevel clustering is built which offers
advantages but such algorithms may suffer from high certain benefits by resolving the problems present in
computational cost [14]. Density based outlier detection variational algorithms and performs the initial seed
algorithms also suffer from the problem of large selection. Another problem that the clusters have non
computation time. High computation time is a major barrier optimal shapes can be solved by using greedy nature of
in case of density based outlier detection algorithms hierarchical approaches. Tu Linli et al. [12] gave a new Kalthough, they have number of advantages. Such algorithms means technique for clustering which considers double
have a less obvious parallel structure [15]. So to resolve the attributes. The high density set generates dissimilarity
problem of time and cost some algorithms are proposed by degree matrix and the Huffman tree is constructed on the
different researchers. William Hendrix et al. [24] presented a basis of this matrix. Then the initial cluster seeds are
shared-memory algorithm for single linkage hierarchical selected from the Huffman tree and it helps to overcome the
clustering (SHRINK) which merges the overlapped clusters. problem of initial seed selection. Md Anisur Rahman et al.
This algorithm provides a speedup up to great extent in case [13] uses ModEx and Seed-Detective approaches which help
of synthetic and real datasets of up to 250,000 points. in performing high quality clustering by generating good
Parallel algorithms are also proposed for clustering large initial seeds. The former approach is the modified version of
datasets in bioinformatics for solving the problem of large the Ex- Detective technique and also illustrates some of the
computation time. High time consumption often occurs in limitations of Ex-Detective. The latter is the combination of
large datasets while solving the cluster identification two approaches, ModEx and Simple K-means. By using
problem to identify noisy background and dense clusters and Modex approach Seed-Detective produces high quality of
by using this approach the computational time could be initial seeds which are given as input to the k-means and
reduced to great extent. Spectral clustering algorithms can leads to the better formation of clusters. Jeyhun Karimov et
easily recognize non-convex distribution and are used in al. [20] proposed hybrid evolutionary model for K-means
segmentation of images and many more fields. Such clustering (HE k- means). This method helps in the selection
clustering often costs high computation time when it deals of good initial centroid in case of K-means clustering by
with large images. So to solve this problem Kia.Li et al. [7] using meta- heuristic. Clustering quality is improved by 30%
proposed an algorithm based on spectral clustering which by using this approach in comparison to random seed
performs segmentation of images in less computational time. selection approach.
Seung Kim et al. [15] used a method for reducing
computational time in case of density based algorithm F. Identification of Different Distance and Similarity
known as Local Outlier Factor (LOF). This approach
incorporates two approaches; k-nearest neighbors search
For measuring the distance some standard equations are
algorithm (ANN) and kd-tree indexing. This method works
used in case of mathematical attributes like Euclidean,
efficiently in detecting local outliers in less computational
Manhattan and other maximum distance. These three special
cases belong to Minkowski distance. Euclidean distance
(ED) is the measure which is usually used for evaluating
E. Efficient Initial Seed Selection
K-means algorithm is the crucial clustering algorithm used similarity between two points. It is very simple and easy
for mining data. The centers are generated randomly or they metric, but it also possesses some disadvantages like it is not
are assumed to be already available. In seed based suitable in case of time series application fields and is highly
integration small set of labeled data (called seeds) is susceptible to outliers and also to noise [2]. Usue Mori et al.
integrated which improves the performance and overcome [2] has proposed a multi-label classification framework
the problem of initial seed centers [20]. Viet-Vu Vu et al. [8] which selects reliable distance measure to cluster time series
performed active seed selection by using an efficient database. Appropriate distance measure is automatically
approach which is based on Min Max approach which selected by this framework. The classifier is based on
covers entire dataset. After few queries, all the clusters characteristics describing important features of time series
contain at least a single seed point and it also reduces the database and can easily predict and discriminate between
iteration number. Kiran Agrawal et al. [10] gave an efficient different set of measures. Duc Thang Nguyen [3] discusses
algorithm by using K-means which solves the problem of two clustering methods explicitly and implicitly for finding
initial seed selection and also determines the number of similarity between objects on the basis of viewpoints. The
clusters to be formed. This algorithm gives satisfactory traditional technique uses single view point whereas other
techniques of similarity measure use multi view point for

All Rights Reserved 2016 IJORAT


International Journal of Research in Advanced Technology - IJORAT

Vol. 1, Issue 6, JUNE 2016

achieving more information. D.S.Yeung and X.Z.Wang [4]

has presented measures between features of the object and Macario
finds similarity by using weighted distance. Clustering
performance can be improved significantly with the use of Cordel
gradient descent technique which helps in easy learning of (2015)
feature weights. Magnus Rattray [5] illustrates that
Riemannian distance is appropriate for clustering
multivariate data but its requirement is that Riemannian
metrics must be defined. R. Vidal et al. [6] has introduced a
framework for defining a Euclidian space for Linear
Dynamical Systems (LDSs). This framework is used to
compute distance easily in massive datasets. This framework
is suitable for many applications like for analyzation of
dynamic visual scenes. Simple LDS averaging algorithm is
devised which is based on this distance and is suitable for
clustering of time-series data. Luan Laun et al. [9] has
presented relational distance with multi-relational clustering.
Relational distance among tuples is calculated. The weight
which is being used in each table represents distance
measurement. In addition, it also calculates distance between A.Chrisy
two clusters to find center point.


This section summarizes the clustering approaches which
are reviewed in the above section. It is very clear from the
table that the limitation addressed by one technique may get
resolved by another technique.

Kia Li




Junjie Li

for mining
clusters with





Extension to Automatically
fuzzy $K$determines
cluster number

All Rights Reserved 2016 IJORAT





Clusters may (2012)
overlap in large


for large

High time
complexity and
less speed


distinct and

Overlapping of

Fuzzy minmax neural


n algorithm
based on
K- means


for single

Efficient in

User needs to
define some
parameters like
threshold value.

Removes noise

Poor feature

non convex
in images.


initial seed

Produce tight


complexity and

al cost

for large


International Journal of Research in Advanced Technology - IJORAT

Vol. 1, Issue 6, JUNE 2016

This paper describes the comparative study of clustering
techniques such as CLARA, K-means, CLASP and SHRINK
which are used by researchers in different application areas.
Comparison of different clustering algorithms is examined at
different levels of perception. This paper highlights the
concerned issues and challenges present in different
clustering algorithms. The issue arising in one approach is
resolved by other approach. Fuzzy logic is good for handling
uncertainties and due to parallel nature, neural networks are
good at handling real time applications. By doing
hybridization of neural networks and fuzzy techniques we
can obtain efficient results in detection of outliers. We have
concluded that algorithms like CLARA are used for
clustering large datasets efficiently but some asymmetric
clustering algorithms like CLASP, efficiently cluster simple
datasets but do not give expected outputs in case of mixed
and tightly coupled datasets. They are less accurate and
efficient for clustering large datasets. Therefore, the
technique based on the neural networks should be proposed
to improve clustering for efficiency enhancement in the
asymmetric algorithms.
[1] Chih- Ping Wei, Yen-Hsien Lee and Che-Ming Hsu.
Department of Information, Empirical Comparison of Fast
Clustering Algorithms for Large Data Sets, Proceedings of the
33rd Hawaii International Conference on System Sciences
[2] Usue Mori, Alexander Mendiburu, and Jose A. Lozano,
Member, Similarity Measure Selection for Clustering Time
Series Databases, IEEE Transactions on Knowledge and Data
[3] Duc Thang Nguyen, "Clustering with Multiviewpoint-Based
Similarity Measure", IEEE Transactions on Knowledge & Data
Engineering, vol. 24, no. 6, pp. 988-1001, June 2012.
[4] D. S. Yeung, X. Z. Wang, "Improving Performance of
Similarity-Based Clustering by Feature Weight Learning,"
IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 24, no. 4, pp. 556-561, April, 2002.
[5] Magnus Rattray, "A Model-Based Distance for Clustering,"
Neural Networks, IEEE - INNS - ENNS International Joint
Conference on, p. 4013, IEEE-INNS-ENNS International Joint
Conference on Neural Networks (IJCNN'00)-Volume 4, 2000.
[6] R. Vidal, A. Ravichandran, B. Afsari, R. Chaudhry, "Group
action induced distances for averaging and clustering Linear
Dynamical Systems with applications to the analysis of
dynamic scenes," 2014 IEEE Conference on Computer Vision
and Pattern Recognition, pp. 2208-2215, 2012 IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2012.
[7] Kai Li, Xinxin Song, "A Fast Large Size Image Segmentation
Algorithm Based on Spectral Clustering," 2012 Fourth
International Conference on Computational and Information
Sciences, pp. 345-348.
[8] Viet-Vu Vu, Nicolas Labroche, Bernadette Bouchon-Meunier,
"Active Learning for Semi-Supervised K-Means Clustering",
ICTAI, 2010, 2012 IEEE 24th International Conference on
Tools with Artificial Intelligence, 2012 IEEE 24th International

All Rights Reserved 2016 IJORAT

Conference on Tools with Artificial Intelligence 2010, pp. 1215.

[9] Luan Luan, Yun Li, Jiang Yin, Yan Sheng, Multi-relational
Clustering Based on Relational Distance, Physics Procedia,
Volume 24, Part C, 2012, Pages 1982-1989, ISSN1875-3892
[10] Kiran Agrawal, Ashish Mishra, "Improved K-MEAN
Clustering Approach for Web Usage Mining", ICETET, 2009,
Emerging Trends in Engineering & Technology, International
Conference on, Emerging Trends in Engineering &
Technology, International Conference on 2009, pp. 298-300
[11] Iurie Chiosa, Andreas Kolb, "Variational Multilevel Mesh
Clustering," Shape Modeling and Applications, International
Conference on, pp. 197- 204, 2008 IEEE International
Conference on Shape Modeling and Applications, 2008
[12] Tu Linli, DengYanni, ChuSiyong, A K-Means Clustering
Algorithm Based on Double Attributes of Objects, 2015
Seventh International Conference on Measuring Technology
and Mechatronics Automation (ICMTMA)
[13] Md Anisur Rahman, MdZahidul Islam, Terry Bossomaier,
ModEx and Seed-Detective: Two novel techniques for high
quality clustering by using good initial seeds in K-Means,
Journal of King Saud University - Computer and Information
Sciences, Volume 27, Issue 2, April 2015, Pages 113-128,
ISSN 1319-1578
[14] R. Mythily, Aisha Banu, ShriramRaghunathan, Clustering
Models for Data Stream Mining, Procedia Computer Science,
Volume 46, 2015, Pages 619-626, ISSN 1877-0509
[15] Seung Kim, Nam Wook Cho, Bokyoung Kang, Suk-Ho Kang,
Fast outlier detection for very large log data, Expert Systems
with Applications, Volume 38, Issue 8, August 2011, Pages
9587-9596, ISSN 0957-4174.
[16] Amineh Amini, Teh Ying, Hadi Saboohi, On Density-Based
Data Streams Clustering Algorithms: A Survey, Journal of
Computer Science and Technology, January 2014, Volume 29,
Issue 1, pp 116-141.
[17] Zhang Tie-jun, Chen Duo, Sun Jie, Research on Neural
Network Model Based on Subtraction Clustering and Its
Applications, Physics Procedia, Volume 25, 2012, Pages
1642-1647, ISSN 1875-3892.
[18] Pedro Pereira Rodrigues, Joao Gama, Joao Pedro Pedroso,
"Hierarchical Clustering of Time-Series Data Streams," IEEE
Transactions on Knowledge and Data Engineering, vol. 20,
no. 5, pp. 615-627, May, 2008
[19] Batagelj,V. ,Mrvar,A.,andZaversnik,M. , Partitioning
approaches to clustering in graphs, Pr Drawing1999, LNCS,
2000, pp. 90-97.
[20] Jeyhun Karimov, Murat Ozbayoglu, Clustering Quality
Improvement of k-means Using a Hybrid Evolutionary Model,
Procedia Computer Science, Volume 61, 2015, Pages 38-45,
ISSN 1877-0509
[21] Zhensong Chen, Zhiquan Qi, Fan Meng, Limeng Cui, Yong
Shi, Image Segmentation via Improving Clustering
Algorithms with Density and Distance, Procedia Computer
Science, Volume 55, 2015, Pages 1015- 1022, ISSN 18770509.
[22] Mark Junjie Li, Michael K. Ng, Yiu-ming Cheung, Joshua
Zhexue Huang, "Agglomerative Fuzzy K-Means Clustering
Algorithm with Selection of Number of Clusters," IEEE
Transactions on Knowledge and Data Engineering, vol. 20,
no. 11, pp. 1519-1534, November, 2008.
[23] A. Christy, G. Meera Gandhi, S. Vaithya subramanian, Cluster
Based Outlier Detection Algorithm for Healthcare Data,


International Journal of Research in Advanced Technology - IJORAT

Vol. 1, Issue 6, JUNE 2016
Procedia Computer Science, Volume 50, 2015, Pages 209215, ISSN 1877-0509
[24] William Hendrix, Md. Mostofa Ali Patwary, Ankit Agrawal,
Wei-keng Liao, and Alok Choudhary, Parallel Hierarchical
Clustering on Shared MemoryPlatforms,2012 IEEE.
[25] Macario O. Cordel II, Arnulfo P. Azcarraga, Fast Emulation of
Self- organizing Maps for Large Datasets, Procedia Computer
Science, Volume 52, 2015, Pages 381-388, ISSN 1877-0509.
[26] Parul Agarwal, M. Afshar Alam, Ranjit Biswas, Issues,
Challenges and Tools of Clustering Algorithm, IJCSI
International Journal of Computer Science Issues, Vol. 8,
Issue 3, No. 1, May 201.
[27] Hao Huang, Yunjun Gao, Kevin Chiew, Lei Chen, Qinming
He, "Towards effective and efficient mining of arbitrary
shaped clusters", ICDE, 2014, 2014 IEEE 30th International
Conference on Data Engineering (ICDE), 2014 IEEE 30th
International Conference on Data Engineering (ICDE) 2014,
pp. 28-39.
[28] Ahmad M. Bakr, Nagia M. Ghanem, Mohamed A. Ismail,
Efficient incremental density-based algorithm for clustering
large datasets, Alexandria Engineering Journal, Volume 54,
Issue 4, December 2015, Pages 1147-1154, ISSN 1110-0168.
[29] Xiaoyun Chen, Sha Liu, Tao Chen, Zhengquan Zhang,
Hairong Zhang, An Improved Semi-Supervised Clustering
Algorithm for Multi-Density Datasets with Fewer Constraints,
Procedia Engineering, Volume 29, 2012, Pages 4325-4329,
ISSN 1877-7058.
[30] Nilam Upasani, Hari Om, Evolving Fuzzy Min-max Neural
Network for Outlier Detection, Procedia Computer Science,
Volume 45, 2015, Pages 753-761, ISSN 1877-0509

All Rights Reserved 2016 IJORAT