Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
I. INTRODUCTION
In todays scenario, lots of data like web commerce, credit card/debit card transactions, web server logs
and so on are collected and stored in database, warehouse every second. The applications range from
scientific discovery to business which includes various techniques to analyze data for find hidden
patterns that are not readily evident and models for all types of data [2]. There are many techniques to
find the patterns i.e. supervised learning includes Classification and Regression, Unsupervised learning
consists of Clustering, Dependency modeling have associations, summarization and causality, outlier and
deviation detection [1].
Clustering is defined as partitioning the data into groups or subsets called as clusters. Clustering methods
includes partitioning and hierarchical. The hierarchical clustering broadly categories into two groups i.e.
the divisive approach and agglomerative approach. The main drawback of this method is that the division
or merging what have done on clusters is permanent in nature and the time complexity is quite high i.e.
O(M2), where M is the number of data points. The partitioning clustering divides the datasets into given
number of clusters. There are many methods through which partitioning clustering can be done it i.e.
Graph - theoretic, density-based, model-based, Grid-based and etc. The aim of clustering is to minimize
the homogeneity within clusters and maximize the heterogeneity between the clusters. As in
unsupervised learning the initial knowledge i.e. where to start the clustering is missing, which resulting
more complexity to divide the dataset into clusters. So, the initial centers selection is quiet complex. KMeans, K-Medoids are two well- known partitioning algorithms. In K-Means the initial centers are
randomly selected from the dataset. As, K-Means mainly consists of two steps: 1) Assigning the data
points to nearest centre using distance approach. 2) Updating centers while taking the mean of all data
points assigned to particular cluster. Repeat these two steps until centers do not convergence. On the
other hand, K-Medoids also select the initial medoids randomly from the dataset. In the first step, for
every medoids swap the medoids to all the non-medoids and calculating the cost of that configuration. In
second step, if the cost is lower than the previous one than the swapping between the medoids to nonmedoids is done. Repeat the step until no more data points or medoids change the cluster. The main
45 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
www.ijafrc.org
( i)
i =1
represented as
(Y
,Y
,.....
Y )
k
k +1
and M dis (Y
k +1
, Y 1 ), dis (Y
k +1
,Y
),... dis (Y
k +1
,Y k )
k +1
CCN ( j , c ) =
(CCP ( j , i ))
i =1
Then, we calculate the c-closet data points radius (CCR) indicates that the data point is present in dense
area or not. In this we are taking the mean distance to all the c closet data points of it and is calculated as
follows:
dis
(Y , Y ) =
j
d =0
(Y dj Y id )
Where, m indicates the dimensions of data point and c indicates division of data point in the class done by
experts.
C. Merge Stage: Merge the clusters having minimum Euclidean distance and update the clusters until
desired number of clusters does not reach.
(Y i )
i =1
where c
(Y i )
i =1
www.ijafrc.org
(Y ,Y
1
,.....Y k ) .
Step2.3: Find out the data point which has the min CNR, add to IL(set of selected initial centers),
and delete it and its all c closest neighbors from RL.
Step2.4: Go to step 2.1 until RL is empty.
Step3: Merge potential centers.
Step3.1: Place all potential points acquired for Step1 and Step 2 in IC.
Step3.2: Calculate distance all data point in IC and select the min distance merge with update
centre in C.
Step3.3: Go to step 3.2 until IC=k
Step4: Then, IC is feeded to K-means as initial centers run up to centers do not change.
End
IV. EXPERIMENT AND DISCUSSION
In the study, proposed algorithm was developed in Java programming via Eclipse IDE Tool and run on a
computer with 2.5 GHz Intel Core i5 processor, 4GB 1600MHz DDR3 on Mac Book Pro OS X Yosemite
Version 10.10 System.
A. Description of Datasets
Our experiments contains of three real datasets taken from UCI dataset repository. The Iris datasets
contains 150 objects in 5 dimensions with all values are preprocessed and real with no missing values.
The classification of iris data into 3 classes with 50 objects each that was given by experts. The Climate
Model Simulation Crashes dataset contains 540 objects in 21 dimensions with all values are preprocessed
and real with no missing values. The Seeds data consists of 210 objects in 8 dimensions with all values
are preprocessed and real with no missing values [11]. All the datasets are described in Table 1.
www.ijafrc.org
DATASETS
IRIS PLANT
CLIMATE MODEL SIMULATION CRASHES
SEEDS
INSTANCES ATTRIBUTES
150
5
540
21
210
8
1. RMSE: The Root Mean Square Error represents the mean of sum of square error. The fewer Root
mean square error value indicates the better cluster structure.
m
xi c j
RMSE =
x X ,c C
i
i =1 j =1
Where, X Number of data points, C k cluster centers, k- Number of clusters, m-Number of dimensions
2. SSE: The SSE calculates the squared distance of all data points to the nearest centers. The smaller
value of SSE shows the result close to optimal clustering structure.
k
SSE =
i =1
x j ci
x jci
x X ,c C
j
SDIndex =
Where,
(C i )
1
+
k i =1 (( X ))
(C ) ---Variance of
i
between clusters,
min
max
min i =1
j =1, j i
C i C j
max
---Maximum distance
4. DB Index: DB Index calculates the mean of compactness between clusters and identify most similar
one. The lesser value of DB index indicates preferred cluster configuration.
49 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
DBIndex =
1 nj
j k =1
Where,
= max (S kl )k , l = 1,2,... n j k l ,
dis
S
kl
---Similarity of clusters,
disp
kl
= d (vk , vl ) and
kl
disp + disp
dis
k
kl
disp
kl
d (x , v )
k x
ck
dis
kl
---Dissimilarity of cluster
The iteration speed is also compared from the existed techniques. The speed has to be less to acquire
final cluster centers.
D. Results and Discussion
In this section, we show the results of experiment performed on the proposed method and compare it
with three well known existing methods using three standard datasets and five validity measures. Here,
all the results are given in Table II, Table III, Table IV, Figure I, Figure II and Figure III below where we
assume that there are three clusters (k = 3) in each dataset. All the values of RMSE and DB are dividing by
100 and multiplied by 100 respectively in the Figures for effective visualization. In all the comparison
tables the bold value shows the better results.
Table 2: Comparison of Results on Iris Plant Dataset
QUALITY
MEASURE
RMSE
K-MEANS
K-MEDOIDS
K-MEANS++
122.27
113.34
97.34
PROPOSED
METHOD
97.32
SSE
3.191
2.180
1.419
1.414
SD INDEX
16.5
8.1
5.9
3.1
DB INDEX
0.09
0.08
0.01
0.08
ITERATION
29700
www.ijafrc.org
K-MEANS
K-MEDOIDS
K-MEANS++
PROPOSED
METHOD
RMSE
313.73
381.90
313.21
313.21
SSE
7.54
12.79
8.79
8.97
SD INDEX
2.76
4.11
3.06
2.56
DB INDEX
0.021
0.022
0.012
0.025
ITERATION
33390
K-MEANS
K-MEDOIDS
K-MEANS++
PROPOSED
METHOD
RMSE
644.65
744.84
645.07
644.64
SSE
2.65
4.18
2.70
2.33
SD INDEX
4.12
1.28
4.19
4.06
DB INDEX
0.020
0.010
0.020
0.017
ITERATION
21
200340
21
13
35
30
25
K-MEANS
20
K-MEDOIDS
15
K-MEANS++
10
PROPOSED
5
0
RMSE
SSE
SD INDEX
DB INDEX
ITERATION
www.ijafrc.org
45
40
35
30
25
20
15
10
5
0
K-MEANS
K-MEDOIDS
K-MEANS++
PROPOSED
RMSE
SSE
SD INDEX
DB INDEX
ITERATION
35
30
25
K-MEANS
20
K-MEDOIDS
15
10
K-MEANS++
PROPOSED
0
RMSE
SSE
SD INDEX
DB INDEX
ITERATION
www.ijafrc.org
REFERENCES
[1]
A. K. Jain, M. N. Murty and P. J. Flynn, Data Clustering: A Review, Journal ACM Computing
Surveys (CSUR), Volume 31, Issue 3, pp. 264-323, 1999.
[2]
K. P. Soman, S. Diwaker and V. Ajay, Insight into Data mining: Theory and Practice, pp.17-18,
2006.
[3]
J. Chang, SDCC: A New Stable Double-Centroid Clustering Technique Based on K-Means for Nonspherical Patterns, Advances in Neural Networks, Springer Berlin Heidelberg, pp. 794-801, 2009.
[4]
Ye, Y., Huang, J. Z., Chen, X., Zhou, S., Williams, G., and Xu, X., Neighborhood density method for
selecting initial cluster centers in K-means clustering, Advances in Knowledge Discovery and
Data Mining, Springer Berlin Heidelberg, pp. 189-198, 2006.
[5]
Bishnu, P. S., and Bhattacherjee, V., Software fault prediction using quad tree-based k-means
clustering algorithm, IEEE Transactions on Knowledge and Data Engineering,volume 24, Issue 6,
pp. 1146-1150, 2012.
[6]
Lei, K., Wang, S., Song, W., and Li, Q., Size-Constrained Clustering Using an Initial Points Selection
Method, Knowledge Science, Engineering and Management, Springer Berlin Heidelberg, pp. 195205, 2013.
[7]
Lee, S. S., and Han, C. Y., Finding Good Initial Cluster Center by Using Maximum Average
Distance, Advances in Natural Language Processing, Springer Berlin Heidelberg, pp. 228-238,
2012.
[8]
Wang, X., Wang, C., and Shen, J., Semisupervised K-Means Clustering by Optimizing Initial
Cluster Centers, Web Information Systems and Mining, Springer Berlin Heidelberg, pp. 178-187,
2011.
www.ijafrc.org
Zhang, Y., and Cheng, E., An optimized method for selection of the initial centers of k-means
clustering, Integrated Uncertainty in Knowledge Modeling and Decision Making, Springer Berlin
Heidelberg, pp. 149-156, 2013.
[10]
Kovcs, F., Legny, C., and Babos, A., Cluster validity measurement techniques, 6th International
symposium of Hungarian researchers on computational intelligence, 2005.
[11]
www.ijafrc.org