Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
a r t i c l e i n f o a b s t r a c t
Keywords: Clustering is used to group data objects into sets of disjoint classes called clusters so that objects within
Clustering the same class are highly similar to each other and dissimilar from the objects in other classes. K-har-
K-harmonic means monic means (KHM) is one of the most popular clustering techniques, and has been applied widely
Gravitational search algorithm and works well in many fields. But this method usually runs into local optima easily. A hybrid data clus-
IGSAKHM algorithm
tering algorithm based on an improved version of Gravitational Search Algorithm and KHM, called
IGSAKHM, is proposed in this research. With merits of both algorithms, IGSAKHM not only helps the
KHM clustering escape from local optima but also overcomes the slow convergence speed of the IGSA.
The proposed method is compared with some existing algorithm on seven data sets, and the obtained
results indicate that IGSAKHM is superior to KHM and PSOKHM in most cases.
Ó 2011 Elsevier Ltd. All rights reserved.
0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.01.018
9320 M. Yin et al. / Expert Systems with Applications 38 (2011) 9319–9324
The KM clustering is a simple and fast method used commonly Gravitational search algorithm (GSA) is a recently proposed
due to its straightforward implementation and small number of method used on optimization problem, as seen in Rashedi et al.
iterations. The KM algorithm attempts to find the cluster centers, (2009). It has been compared with some well-known heuristic
(c1, . . . , ck), such that the sum of the squared distances of each data optimization methods exiting, and the obtained results showed
point Xi to its nearest cluster center cj is minimized. The depen- the high performance of the method. The GSA is constructed on
dency of the KM performance on the initialization of the centers the law of Newtonian Gravity: ‘‘Every particle in the universe at-
has been a major problem. This is due to its winner-takes-all par- tracts every other particle with a force that is directly proportional
titioning strategy, which results in a strong association between to the product of their masses and inversely proportional to the
the data points and the nearest center and prevents the centers square of the distance between them’’. In the algorithm, all the
from moving out of a local density of data. individuals can be viewed as objects with masses. The objects at-
The KHM clustering addresses the intrinsic problem by replac- tract each other by the gravity force, and the force makes all of
ing the minimum distance from a data point to the centers, used them move towards the ones with heavier masses. The objects
in KM, by the harmonic mean of the distance from each data point transform information by the gravitational force, and the objects
to all centers. The harmonic means gives a good (low) score for with heavier masses become heavier. The GSA algorithm can be de-
each data point when that data point is close to any one center. scribed as follows:
This is a property of the harmonic means; it is similar to the min- First assuming there are N objects and each of them has m
imum function used by KM, but it is a smooth differentiable func- dimensions, we define the ith object by:
tion. The following notations are used to formulate the KHM
X i ¼ ðx1i ; . . . ; xdi ; . . . ; xm
i Þ;
algorithm (Hammerly & Elkan, 2002; Ünler & Güngör, 2008):
where xdi presents the position of ith object in the dth dimension.
X = (x1, . . . , xn): the data to be clustered. At the tth iteration, the force acted on object i from object j in
C = (c1, . . . , ck): the set of cluster centers. the dth dimension is defined:
m(cjjxi): the membership function defining the proportion of
M i ðtÞ M j ðtÞ d
data point that belongs to center cj. F dij ðtÞ ¼ GðtÞ ðxj ðtÞ xdi ðtÞÞ; ð5Þ
w(xi): the weight function defining how much influence data Rij ðtÞ þ e
point xi has in re-computing the center parameters in the next where G(t) is gravitational constant at iteration t, e is a constant, and
iteration. Rij(t) is the Euclidian distance between object i and object j:
where randj is a random number in the interval [0, 1]. Then the
X
n
k acceleration of object i in the dimension d can be calculated as:
KHMðX; CÞ ¼ Pk ; ð1Þ
1
i¼1 j¼1 kxi cj kp F di ðtÞ
adi ðtÞ ¼ ; ð8Þ
M i ðtÞ
where p is an input parameter and typically p P 2.
where Mi is the mass of object i, and the object moves according to
3. For each data point xi, compute its membership m(cjjxi) in each
the follows:
center cj according to
v di ðt þ 1Þ ¼ randi v di ðtÞ þ adi ðtÞ; ð9Þ
p2
kxi cj k xdi ðt þ 1Þ ¼ xdi ðtÞ þ v di ðt þ 1Þ; ð10Þ
mðcj jxi Þ ¼ Pk : ð2Þ
j¼1 kxi cj kp2 where v d
is the velocity in dth dimension of object i at iteration t,
i ðtÞ
4. For each data point xi, compute its weight w(xi) according to xdi ðtÞ is the position in dth dimension of object i at iteration t, and
randi is a uniform random variable in the interval [0, 1], which adds
Pk a randomized characteristic to the search.
cj kp2
j¼1 kxi
wðxi Þ ¼ 2 : ð3Þ
Pk p2 3.1. Improved gravitation search algorithm
j¼1 kxi c j k
5. For each center cj, re-compute its location from all data points xi When any object jumps out of its range, the original GSA just
according to their memberships and weights: pulls it back to the fringe. In other words, the object will be as-
signed a boundary value. The disadvantage is that if there are too
Pn many objects on the boundary, and especially when there exist
mðcj jxi Þwðxi Þxi
cj ¼ Pi¼1
n : ð4Þ some local optima on the boundary, the algorithm will lose its pop-
i¼1 mðc j jxi Þwðxi Þ
ulation diversity and stay in the local optima to some extent.
6. Repeat steps 2–5 predefined number of iterations or until According to this, we propose a strategy to overcome this problem,
KHM(X, C) does not change significantly. and it is described as follows:
7. Assign data point xi to cluster j with the biggest m(cjjxi). It is When X(d) > up[d], we have
demonstrated that KHM is essentially insensitive to the initial-
ðdÞ ¼ up½d;
ization of the centers (Zhang et al., 1999), while it tends to
converge to local optima (Ünler & Güngör, 2008). v ½d ¼ randðÞ v ½d; ð11Þ
M. Yin et al. / Expert Systems with Applications 38 (2011) 9319–9324 9321
similarly, when X(d) < down[d], we have Step 5.1: Apply the IGSA operator to update the
psize objects.
XðdÞ ¼ down½d;
Step 5.2: Gen2 = gen2 + 1. If Gen2 < 8, go to Step
v ½d ¼ randðÞ v ½d; ð12Þ 5.1.
where up[d] is the maximum of the dth dimension in the data sets, Step 6: (KHM Method) For each object i
and down[d] is the minimum of the dth dimension in the data sets, Step 6.1: Take the position of object i as the initial
rand() is the random number in the interval [0, 1]. cluster centres of the KHM algorithm.
On the other hand, in the original GSA algorithm, G(t) is defined Step 6.2: Recalculate each cluster center using the
as follows: KHM algorithm.
b Step 6.3: Gen3 = Gen3 + 1. If Gen3 < 4, go to Step
t0 6.2.
GðtÞ ¼ Gðt0 Þ ; b < 1; ð13Þ
t Step 7: Gen1 = Gen1 + 1. If Gen1 < IterCount, go to Step4.
Step 8: Assign data point xi to cluster j with the biggest m(cjjxi).
where G(t0) is the initial value of gravitational constant. In order to
speed up the converge of GSA, we propose another strategy to im-
4. Experimental results
prove G(t), here we redefine G(t):
2
GðtÞ ¼ G0 expa ðmax tÞ ;
t
ð14Þ We test our proposed algorithm on seven data sets and com-
pared with other well-known algorithms. These data sets are
where G0 is same with Eq. (13), maxt is the number of iterations. well-known ArtSet1, ArtSet2, Wine, Glass, Iris, Breast-Cancer-Wis-
By the two improvements above, we add the objects’ diversity, consin (denoted as Cancer), and Contraceptive Method Choice (de-
make them explore more solution spaces and improve the con- noted as CMC), cover examples of data of low, medium and high
verge speed of the GSA. At the same time, we can limit the objects dimensions, which have also been used in Yang et al. (2009). The
in the data spaces and get better results in our research. The pro- details of these data sets can be viewed in Table 1. The experimen-
cess of the improved GSA is shown as follow: tal results are averages of 10 runs of simulation. The algorithms are
implemented using Visual C++ and executed on a Pentium (R) D
Step1: Search space identification. CPU 2.66 GHz with 1.00 GB RAM.
Step2: Randomized initialization.
Step3: Fitness evolution of agents. 4.1. Data sets
Step4: update the G by Eq. (12), and update the best and worst
of the population. (1) ArtSet1 (n = 300, d = 2, k = 3): This is an artificial data set. It is
Step5: Calculation of the total force in different directions. a two-featured problem with three unique classes. A total of
Step 6: Calculate M and a for every agent. 300 patterns are drawn from three independent bivariate
Step 7: Updating agents’ velocities and positions by Eq. (19),
Eqs. 10, 13 and 14.
Step 8: Repeat steps Step 3 to Step 7 until the stop criteria is Table 1
reached. Characteristics of data sets considered.
Step 1: Set the initial parameters including the maxi- mum itera-
tive count IterCount, the population size Psize,e, G0.
Step 2: Initialize a population of sizePsize.
Step 3: Set iterative countGen1 = 0.
Step 4: Set iterative countGen2 = Gen3 = 0.
Step 5: (IGSA Method)
Fig. 1. The representation of an object in IGSAKHM. Fig. 2. Clusters of Artset1 data set.
9322 M. Yin et al. / Expert Systems with Applications 38 (2011) 9319–9324
Table 2 Table 3
Results of KHM, PSO, and PSOKHM clustering on two artificial and five real data sets Results of KHM, PSO, and PSOKHM clustering on two artificial and five real data sets
when p = 2.5. The quality of clustering is evaluated using KHM (X, C) and the F- when p = 3. The quality of clustering is evaluated using KHM (X, C) and the F-measure.
measure. Runtimes (s) are additionally provided. The table shows means and standard Runtimes (s) are additionally provided. The table shows means and standard
deviations (in brackets) for 10 independent runs. Bold face indicates the best and deviations (in brackets) for 10 independent runs. Bold face indicates the best and
italic face indicates the second best result out of the three algorithms. italic face indicates the second best result out of the three algorithms.
ArtSet1 ArtSet1
KHM (X, C) 703.867(0.000) 703.509(0.050) 703.509(0.000) KHM (X, C) 742.110(0.000) 741.455(0.002) 741.455(0.000)
F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000) F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000)
Runtime 0.106(0.006) 1.921(0.007) 1.783(0.001) Runtime 0.001(0.006) 1.921(0.007) 1.789(0.002)
ArtSet2 ArtSet2
KHM (X, C) 111,852(0) 111,813(2) 111,813(0) KHM (X, C) 278,758(0) 278,541(33) 278,541(0)
F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000) F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000)
Runtime 0.223(0.008) 2.859(0.000) 2.461(0.001) Runtime 0.220(0.008) 2.844(0.010) 2.524(0.005)
Iris Iris
KHM (X, C) 149.333(0.000) 149.058(0.074) 149.058(0.000) KHM (X, C) 126.517(0.000) 125.951(0.052) 125.951(0.000)
F-measure 0.750(0.000) 0.753(0.005) 0.763(0.000) F-measure 0.744(0.000) 0.744(0.000) 0.751(0.000)
Runtime 0.192(0.008) 1.842(0.005) 1.577(0.002) Runtime 0.190(0.007) 1.826(0.009) 1.650(0.004)
Glass Glass
KHM (X, C) 1203.554(16.231) 1196.789(0.439) 1180.756(0.134) KHM (X, C) 1535.198(0.000) 1442.847(35.871) 1400.950(0.630)
F-measure 0.421(0.011) 0.424(0.003) 0.454(0.000) F-measure 0.422(0.000) 0.427(0.003) 0.442(0.000)
Runtime 4.064(0.010) 17.669(0.018) 15.910(0.010) Runtime 4.042(0.007) 17.609(0.015) 15.958(0.001)
Cancer Cancer
KHM (X, C) 60,189(0) 59844(22) 59844(0) KHM (X, C) 119,458(0) 117,418(237) 117,418(55)
F-measure 0.829(0.000) 0.829(0.000) 0.829(0.000) F-measure 0.834(0.000) 0.834(0.000) 0.847(0.000)
Runtime 2.017(0.009) 9.525(0.013) 7.509(0.007) Runtime 2.027(0.007) 9.594(0.023) 7.91(0.002)
CMC CMC
KHM (X, C) 96520(0) 96193(25) 96193(52) KHM (X, C) 187,525(0) 186,722(111) 186,722(94)
F-measure 0.335(0.000) 0.333(0.002) 0.488(0.000) F-measure 0.303(0.000) 0.303(0.000) 0.472(0.000)
Runtime 8.639(0.009) 39.825(0.072) 31.563(0.012) Runtime 8.627(0.009) 39.485(0.056) 32.107(0.034)
Wine Wine
KHM (X, C) 18,386,505(0) 18,386,285(5) 18,386,285(28) KHM (X, C) 298,230,848(24,270,951) 252,522,504(766) 252,522,000(0)
F-measure 0.516(0.000) 0.516(0.000) 0.519(0.000) F-measure 0.538(0.007) 0.553(0.000) 0.553(0.000)
Runtime 2.059(0.010) 6.539(0.008) 5.628(0.004) Runtime 2.084(0.010) 6.598(0.008) 5.710(0.001)
M. Yin et al. / Expert Systems with Applications 38 (2011) 9319–9324 9323
Table 4
Results of KHM, PSO, and PSOKHM clustering on two artificial and five real data sets when p = 3..5. The quality of clustering is evaluated using KHM (X, C) and the F-measure.
Runtimes (s) are additionally provided. The table shows means and standard deviations (in brackets) for 10 independent runs. Bold face indicates the best and italic face indicates
the second best result out of the three algorithms.
Table 5
Results of KHM, PSO, and PSOKHM clustering on two artificial and five real data sets when p = 4. The quality of clustering is evaluated using KHM (X, C) and the F-measure.
Runtimes (s) are additionally provided. The table shows means and standard deviations (in brackets) for 10 independent runs. Bold face indicates the best and italic face indicates
the second best result out of the three algorithms.
flavanoids, nonflavanoid phenols, proanthocyanins, colour the same region in Italy but derived from three different cul-
intensity, hue, OD280/OD315 of diluted wines, and praline, tivars. There are three categories in the data: class 1 (59
are the results of a chemical analysis of wines brewed in objects), class 2 (71 objects), and class 3 (48 objects).
9324 M. Yin et al. / Expert Systems with Applications 38 (2011) 9319–9324
4.2. Experimental results problems. Yet it should be noted that one drawback of IGSAKHM
is that it requires more runtime compared to KHM. And when
In this section, we evaluate and compare the performances of the runtime is quite critical, IGSAKHM is not applicable. In the fu-
the following methods: KHM, PSOKHM and IGSAKHM algorithms ture, we may integrate other local search algorithms into KHM, to
as means of solution for the objective function of the KHM algo- get more efficient and effective clustering algorithm.
rithm. The quality of the respective clustering will also be com-
pared, where the quality is measured by the following two criteria: Acknowledgments
(1) The sum over all data points of the harmonic average of the This work is fully supported by the National Natural Science
distance from a data point to all the centers, as defined in Eq. Foundation of China under Grant Nos. 60473042, 60573067,
(1). Clearly, the smaller the sum is, the higher the quality of 60803102.
clustering is.
(2) The F-measure uses the ideas of precision and recall from References
information retrieval (Dalli, 2003; Handl, Knowles, & Dorigo,
Alpaydin, E. (2004). Introduction to machine learning. Cambridge: The MIT Press.
2003). Each class i (as given by the class labels of the used pp.133–150.
benchmark data set) is regarded as the set of ni items desired Cui, X., & Potok, T. E. (2005). Document clustering using particle swarm
for a query; each cluster j (generated by the algorithm) is optimization. In: IEEE swarm intelligence symposium. Pasadena, California.
Dalli, A. (2003). Adaptation of the F-measure to cluster-based Lexicon quality
regarded as the set of nj items retrieved for a query; nij gives evaluation. In EACL 2003, Budapest.
the number of elements of class i within cluster j. For each Kao, Y. T., Zahara, E., & Kao, I. W. (2008). A hybridized approach to data clustering.
class i and cluster j precision and recall are then defined as Expert Systems with Applications, 34(3), 1754–1762.
n n Halberstadt, W., & Douglas, T. S. (2008). Fuzzy clustering to detect tuberculous
pði; jÞ ¼ nijj and rði; jÞ ¼ niji , and the corresponding value under meningitis-associated hyperdensity in CT images. Computers in Biology and
2 Medicine, 38(2), 165–170.
the F-measure is Fði; jÞ ¼ ðbb2þ1Þpði;jÞrði;jÞ
pði;jÞþrði;jÞ
, where we chose b = 1to Hammerly, G., & Elkan, C. (2002). Alternatives to the k-means algorithm that find
better clusterings. In: Proceedings of the 11th international conference on
obtain equal weighting for p(i, j) and r(i, j). The overall
information and knowledge management (pp. 600–607).
F-measure for the data set of size n is given by Handl, J., Knowles, J., & Dorigo, M. (2003). On the performance of ant-based
P
F ¼ i nni maxj fFði; jÞg. Obviously, the bigger F-measure is, clustering. Design and Application of Hybrid Intelligent Systems. Frontiers in
Artificial Intelligence and Applications, 104, 204–213.
the higher the quality of clustering is. He, Y., Pan, W., & Lin, J. (2006). Cluster analysis using multivariate normal mixture
models to detect differential gene expression with microarray data.
It is known that p is a key parameter to get good objective func- Computational Statistics and Data Analysis, 51(2), 641–658.
Hu, G., Zhou, S., Guan, J., & Hu, X. (2008). Towards effective document clustering: A
tion values. For this reason we conduct our experiments with dif-
constrained K-means based approach. Information Processing and Management,
ferent p values. Tables 2–5 show the means and standard 44(4), 1397–1409.
deviations (over 10 runs) obtained for each of these measures Kerr, G., Ruskin, H. J., Crane, M., & Doolan, P. (2008). Techniques for clustering gene
expression data. Computers in Biology and Medicine, 38(3), 283–293.
when p is 2.5, 3, 3.5 and 4 respectively. Additionally, the runtimes
Krishna, K., & Murty (1999). Genetic k-means algorithm. IEEE Transactions on
of the algorithms are also shown in these tables. Systems, Man and Cybernetics B Cybernetics, 29, 433–439.
We see that for ARTSET1 and ARTSET2, the F-measure of KHM, Li, Y. J., Chung, S. M., & Holt, J. D. (2008). Text document clustering based on
PSOKHM and IGSAKHM are all equal to 1, at the same time, the frequent word meaning sequences. Data and Knowledge Engineering, 64(1),
381–404.
average KHM (X, C) of KHM, PSOKHM and IGSAKHM for the first MacQueen, J. (1967). Some methods for classification and analysis of multivariate
three datasets are almost the same (except case p = 4), that is to observations. Proceedings of the fifth Berkeley symposium on mathematical
say when the data set is simple and specially separated well, the statistics and probability, 1, 281–297.
Rashedi, E., Nezamabadi-pour, H., & Saryazdi, S. (2009). GSA: A gravitational search
performance of the three algorithms are nearly the same. On the algorithm. Information Sciences, 179, 2232–2248.
other hand, the more complex the data set is, the better the results Sung, C. S., & Jin, H. W. (2000). A tabu-search-based heuristic for clustering. Pattern
are. The F-measure of IGSAKHM is superior to the other two algo- Recognition, 33, 849–858.
Tan, P. N., Steinbach, M., & Kumar, V. (2005). Introduction to data mining. Boston:
rithms when p is bigger. Especially when p = 4, IGSAKHM algo- Addison-Wesley. pp. 487–559.
rithm outperforms the KHM and PSOKHM both in KHM (X, C) and Tjhi, W. C., & Chen, L. H. (2008). A heuristic-based fuzzy co-clustering algorithm for
F-measure with the exception of ARTSET1 and ARTSET2. It is categorization of high-dimensional data. Fuzzy Sets and Systems, 159(4),
371–389.
shown that our algorithm obviously performs the KHM algorithm
Ünler, A., & Güngör, Z. (2008). Apply ing K-harmonic means clustering to the
for both F-measure and KHM (X, C), and in most cases, it outper- partmachine classification problem. Expert Systems with Applications.
forms PSOKHM for F-measure while at least comparable to PSO- doi:10.1016/j.eswa.2007.11.048
Webb, A. (2002). Statistical pattern recognition. New Jersey: John Wiley & Sons. pp.
KHM for KHM (X, C) requiring less time.
361–406.
Yang, F. Q., Sun, T. L., & Zhang, C. H. (2009). efficient hybrid data clustering method
5. Conclusions based on K-harmonic means and particle swarm optimization. Expert Systems
with Applications, 36(6), 9847–9852.
Zhou, H., & Liu, Y. H. (2008). Accurate integration of multi-view range images using
Firstly, this paper proposes IGSA algorithm by adding two strat- k-means clustering. Pattern Recognition, 41(1), 152–175.
egies to the original GSA algorithm, and then investigates a new Zhang, B., Hsu, M., & Dayal, U. (1999). K-harmonic means – a data clustering
algorithm. Technical Report HPL-1999-124. Hewlett-Packard Laboratories.
hybrid clustering algorithm based on the KHM algorithm and the Zhang, B., Hsu, M., & Dayal, U. (2000). K-harmonic means. In: International workshop
IGSA algorithm, called IGSAKHM. The proposed algorithm is tested on temporal, spatial and spatio-temporal data mining, TSDM2000. Lyon, France,
on seven data sets. And experimental results show that the algo- September 12.
rithm is both efficient and effective especially for complicated