Sei sulla pagina 1di 6

Expert Systems with Applications 38 (2011) 9319–9324

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

A novel hybrid K-harmonic means and gravitational search algorithm approach


for clustering
Minghao Yin, Yanmei Hu ⇑, Fengqin Yang, Xiangtao Li, Wenxiang Gu
College of Computer Science, Northeast Normal University, Changchun 130117, China

a r t i c l e i n f o a b s t r a c t

Keywords: Clustering is used to group data objects into sets of disjoint classes called clusters so that objects within
Clustering the same class are highly similar to each other and dissimilar from the objects in other classes. K-har-
K-harmonic means monic means (KHM) is one of the most popular clustering techniques, and has been applied widely
Gravitational search algorithm and works well in many fields. But this method usually runs into local optima easily. A hybrid data clus-
IGSAKHM algorithm
tering algorithm based on an improved version of Gravitational Search Algorithm and KHM, called
IGSAKHM, is proposed in this research. With merits of both algorithms, IGSAKHM not only helps the
KHM clustering escape from local optima but also overcomes the slow convergence speed of the IGSA.
The proposed method is compared with some existing algorithm on seven data sets, and the obtained
results indicate that IGSAKHM is superior to KHM and PSOKHM in most cases.
Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction been introduced. Krishna et al. (1999) proposed a method called


genetic K-means algorithm (GKM) which defines a basic mutation
Clustering is a search for hidden patterns that may exist in data operator specific to clustering called distance-based mutation. It
sets. It is a process of grouping data objects into disjointed clusters has been proved that GKM converge to the best-known optimum
so that the objects in each cluster are similar, yet different from the by using the theory of finite Markov chain. A tabu search based
others. Clustering techniques have been applied in many applica- heuristic for clustering was proposed by Sung et al. (2000). Two
tion areas such as pattern recognition (Halberstadt & Douglas, complementary functional procedures, called packing and releas-
2008; Webb, 2002; Zhou & Liu, 2008), machine learning (Alpaydin, ing procedures were combined with the tabu search. In 2009, we
2004), data mining (Tan, Steinbach, & Kumar, 2005; Tjhi & Chen, proposed a hybrid algorithm for clustering called PSOKHM (Yang,
2008), information retrival (Hu et al., 2008; Li, Ch et al., 2008) Sun, & Zhang, 2009). This algorithm uses Particle Swarm Optimiza-
and bioinformatics (He, Pan, & Lin, 2006; Kerr, Ruskin, Crane, & tion to help KHM escape from local optima at a certain level, and
Doolan, 2008). There are many techniques for data clustering. results in better clustering.
Among them, K-means (KM) algorithm (MacQueen, 1967) may GSA is a novel population optimization algorithm based on the
be one of the most popular and widely used algorithms for its law of gravity proposed by Rashedi, Nezamabadi-pour, and Sary-
superior feasibility and efficiency in dealing with a large amount azdi (2009). In GSA, the individuals are a collection of masses
of data. As an improved version of KM, the K-harmonic means which interact with each other based on the Newtonian gravity
(KHM) algorithm was first proposed by Zhang, Hsu, and Dayal and the laws of motion. In Rashedi et al. (2009), it has been shown
(1999, 2000) and modified by Hammerly and Elkan (2002). Similar that the global search ability of GSA is superior to that of other
to KM, KHM partitions data objects into k clusters according to algorithms in most cases. In this case, we explore the application
application purposes. While the clustering objective of KHM is to of GSA to help KHM escape from local optima. Moreover, to make
minimize the harmonic average from all points in the data set to sure that GSA is more useful to the data sets, we add two strategies
all cluster centers. Therefore, KHM is not sensitive to the selection to the original GSA to get an improved GSA algorithm, called IGSA.
of the initial cluster centers, which is the main drawback of KM In this way, a new hybrid data clustering algorithm based on KHM
(Cui & Potok, 2005; Kao, Zahara, & Kao, 2008). and IGSA, called IGSAKHM, is proposed and shown to be efficient.
However, both KM and KHM easily run into local optima. For The experimental results, obtained by testing on a variety of data
solving this problem, some heuristic clustering algorithms have sets provided from several artificial and real-life situations,
indicate that IGSAKHM is superior to the KHM algorithm and
PSOKHM algorithm in most cases.
⇑ Corresponding author. Tel.: +86 0431 84536857; fax: +86 0431 84536857.
E-mail address: huym260@nenu.edu.cn (Y. Hu).

0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.01.018
9320 M. Yin et al. / Expert Systems with Applications 38 (2011) 9319–9324

2. K-harmonic means clustering 3. The proposed hybrid clustering algorithm

The KM clustering is a simple and fast method used commonly Gravitational search algorithm (GSA) is a recently proposed
due to its straightforward implementation and small number of method used on optimization problem, as seen in Rashedi et al.
iterations. The KM algorithm attempts to find the cluster centers, (2009). It has been compared with some well-known heuristic
(c1, . . . , ck), such that the sum of the squared distances of each data optimization methods exiting, and the obtained results showed
point Xi to its nearest cluster center cj is minimized. The depen- the high performance of the method. The GSA is constructed on
dency of the KM performance on the initialization of the centers the law of Newtonian Gravity: ‘‘Every particle in the universe at-
has been a major problem. This is due to its winner-takes-all par- tracts every other particle with a force that is directly proportional
titioning strategy, which results in a strong association between to the product of their masses and inversely proportional to the
the data points and the nearest center and prevents the centers square of the distance between them’’. In the algorithm, all the
from moving out of a local density of data. individuals can be viewed as objects with masses. The objects at-
The KHM clustering addresses the intrinsic problem by replac- tract each other by the gravity force, and the force makes all of
ing the minimum distance from a data point to the centers, used them move towards the ones with heavier masses. The objects
in KM, by the harmonic mean of the distance from each data point transform information by the gravitational force, and the objects
to all centers. The harmonic means gives a good (low) score for with heavier masses become heavier. The GSA algorithm can be de-
each data point when that data point is close to any one center. scribed as follows:
This is a property of the harmonic means; it is similar to the min- First assuming there are N objects and each of them has m
imum function used by KM, but it is a smooth differentiable func- dimensions, we define the ith object by:
tion. The following notations are used to formulate the KHM
X i ¼ ðx1i ; . . . ; xdi ; . . . ; xm
i Þ;
algorithm (Hammerly & Elkan, 2002; Ünler & Güngör, 2008):
where xdi presents the position of ith object in the dth dimension.
X = (x1, . . . , xn): the data to be clustered. At the tth iteration, the force acted on object i from object j in
C = (c1, . . . , ck): the set of cluster centers. the dth dimension is defined:
m(cjjxi): the membership function defining the proportion of
M i ðtÞ  M j ðtÞ d
data point that belongs to center cj. F dij ðtÞ ¼ GðtÞ ðxj ðtÞ  xdi ðtÞÞ; ð5Þ
w(xi): the weight function defining how much influence data Rij ðtÞ þ e
point xi has in re-computing the center parameters in the next where G(t) is gravitational constant at iteration t, e is a constant, and
iteration. Rij(t) is the Euclidian distance between object i and object j:

Basic algorithm for KHM clustering is shown as follows:


Rij ðtÞ ¼ kX i ðtÞ; X j ðtÞk2 ; ð6Þ
and the total force acted on the dimension object i is given by:
1. Initialize the algorithm with guessed centers C, i.e., randomly X
N
choose the initial centers. F di ðtÞ ¼ randj F dij ðtÞ; ð7Þ
2. Calculate objective function value according to j¼1:j–i

where randj is a random number in the interval [0, 1]. Then the
X
n
k acceleration of object i in the dimension d can be calculated as:
KHMðX; CÞ ¼ Pk ; ð1Þ
1
i¼1 j¼1 kxi cj kp F di ðtÞ
adi ðtÞ ¼ ; ð8Þ
M i ðtÞ
where p is an input parameter and typically p P 2.
where Mi is the mass of object i, and the object moves according to
3. For each data point xi, compute its membership m(cjjxi) in each
the follows:
center cj according to
v di ðt þ 1Þ ¼ randi  v di ðtÞ þ adi ðtÞ; ð9Þ
p2
kxi  cj k xdi ðt þ 1Þ ¼ xdi ðtÞ þ v di ðt þ 1Þ; ð10Þ
mðcj jxi Þ ¼  Pk : ð2Þ
j¼1 kxi  cj kp2 where v d
is the velocity in dth dimension of object i at iteration t,
i ðtÞ

4. For each data point xi, compute its weight w(xi) according to xdi ðtÞ is the position in dth dimension of object i at iteration t, and
randi is a uniform random variable in the interval [0, 1], which adds
Pk a randomized characteristic to the search.
 cj kp2
j¼1 kxi
wðxi Þ ¼  2 : ð3Þ
Pk p2 3.1. Improved gravitation search algorithm
j¼1 kxi  c j k

5. For each center cj, re-compute its location from all data points xi When any object jumps out of its range, the original GSA just
according to their memberships and weights: pulls it back to the fringe. In other words, the object will be as-
signed a boundary value. The disadvantage is that if there are too
Pn many objects on the boundary, and especially when there exist
mðcj jxi Þwðxi Þxi
cj ¼ Pi¼1
n : ð4Þ some local optima on the boundary, the algorithm will lose its pop-
i¼1 mðc j jxi Þwðxi Þ
ulation diversity and stay in the local optima to some extent.
6. Repeat steps 2–5 predefined number of iterations or until According to this, we propose a strategy to overcome this problem,
KHM(X, C) does not change significantly. and it is described as follows:
7. Assign data point xi to cluster j with the biggest m(cjjxi). It is When X(d) > up[d], we have
demonstrated that KHM is essentially insensitive to the initial-
ðdÞ ¼ up½d;
ization of the centers (Zhang et al., 1999), while it tends to
converge to local optima (Ünler & Güngör, 2008). v ½d ¼ randðÞ v ½d; ð11Þ
M. Yin et al. / Expert Systems with Applications 38 (2011) 9319–9324 9321

similarly, when X(d) < down[d], we have Step 5.1: Apply the IGSA operator to update the
psize objects.
XðdÞ ¼ down½d;
Step 5.2: Gen2 = gen2 + 1. If Gen2 < 8, go to Step
v ½d ¼ randðÞ v ½d; ð12Þ 5.1.
where up[d] is the maximum of the dth dimension in the data sets, Step 6: (KHM Method) For each object i
and down[d] is the minimum of the dth dimension in the data sets, Step 6.1: Take the position of object i as the initial
rand() is the random number in the interval [0, 1]. cluster centres of the KHM algorithm.
On the other hand, in the original GSA algorithm, G(t) is defined Step 6.2: Recalculate each cluster center using the
as follows: KHM algorithm.
 b Step 6.3: Gen3 = Gen3 + 1. If Gen3 < 4, go to Step
t0 6.2.
GðtÞ ¼ Gðt0 Þ ; b < 1; ð13Þ
t Step 7: Gen1 = Gen1 + 1. If Gen1 < IterCount, go to Step4.
Step 8: Assign data point xi to cluster j with the biggest m(cjjxi).
where G(t0) is the initial value of gravitational constant. In order to
speed up the converge of GSA, we propose another strategy to im-
4. Experimental results
prove G(t), here we redefine G(t):
2
GðtÞ ¼ G0 expa ðmax tÞ ;
 t
ð14Þ We test our proposed algorithm on seven data sets and com-
pared with other well-known algorithms. These data sets are
where G0 is same with Eq. (13), maxt is the number of iterations. well-known ArtSet1, ArtSet2, Wine, Glass, Iris, Breast-Cancer-Wis-
By the two improvements above, we add the objects’ diversity, consin (denoted as Cancer), and Contraceptive Method Choice (de-
make them explore more solution spaces and improve the con- noted as CMC), cover examples of data of low, medium and high
verge speed of the GSA. At the same time, we can limit the objects dimensions, which have also been used in Yang et al. (2009). The
in the data spaces and get better results in our research. The pro- details of these data sets can be viewed in Table 1. The experimen-
cess of the improved GSA is shown as follow: tal results are averages of 10 runs of simulation. The algorithms are
implemented using Visual C++ and executed on a Pentium (R) D
Step1: Search space identification. CPU 2.66 GHz with 1.00 GB RAM.
Step2: Randomized initialization.
Step3: Fitness evolution of agents. 4.1. Data sets
Step4: update the G by Eq. (12), and update the best and worst
of the population. (1) ArtSet1 (n = 300, d = 2, k = 3): This is an artificial data set. It is
Step5: Calculation of the total force in different directions. a two-featured problem with three unique classes. A total of
Step 6: Calculate M and a for every agent. 300 patterns are drawn from three independent bivariate
Step 7: Updating agents’ velocities and positions by Eq. (19),
Eqs. 10, 13 and 14.
Step 8: Repeat steps Step 3 to Step 7 until the stop criteria is Table 1
reached. Characteristics of data sets considered.

Name of No. of No. of Size of data set (size of classes in


3.2. The proposed hybrid clustering algorithm data set classes features parentheses)
ArtSet1 3 2 300 (100, 100, 100)
Because of requiring fewer function evaluations, the KHM algo- ArtSet2 3 3 300 (100, 100, 100)
rithm tends to converge faster than the GSA algorithm, but usually Iris 3 4 150 (50, 50, 50)
it gets stuck in local optima. On the other hand, the GSA algorithm Glass 6 9 214 (70, 17, 76, 13, 9, 29)
always converges to better global optima for its excellent global Cancer 2 9 683 (444, 239)
CMC 3 9 1473 (629, 334, 510)
searching ability. Therefore in this paper, we integrate the im-
Wine 3 13 178 (59, 71, 48)
proved version of GSA described in above section into KHM to form
a hybrid clustering algorithm, as we called IGSAKHM. The hybrid
algorithm maintains the merits of KHM and IGSA. In the proposed
algorithm, an object, as shown in Fig. 1, is a vector of real number
of dimension k * d, where k is the number of clusters and d is the
dimension of data to be clustered. The fitness function is the objec-
tive function of the KHM algorithm. The following has shown the
process of the hybrid IGSAKHM algorithm:

Step 1: Set the initial parameters including the maxi- mum itera-
tive count IterCount, the population size Psize,e, G0.
Step 2: Initialize a population of sizePsize.
Step 3: Set iterative countGen1 = 0.
Step 4: Set iterative countGen2 = Gen3 = 0.
Step 5: (IGSA Method)

Expert Systems with Applications


x11 x12 … x1d … xk1 xk2 … xkd

Fig. 1. The representation of an object in IGSAKHM. Fig. 2. Clusters of Artset1 data set.
9322 M. Yin et al. / Expert Systems with Applications 38 (2011) 9319–9324

(3) Fisher’s iris data set (n = 150, d = 4, k = 3), which consists of


three different species of iris flower: Iris Setosa, Iris Versicol-
our and Iris Virginica. For each species, 50 samples with four
features (sepal length, sepal width, petal length, and petal
width) were collected.
(4) Glass (n = 214, d = 9, k = 6), which consists of six different
types of glass: building windows float processed (70
objects), building windows non-float processed (76 objects),
vehicle windows float processed (17 objects), containers (13
objects), tableware (9 objects), and headlamps (29 objects).
Each type has nine features, which are refractive index,
sodium, magnesium, aluminium, silicon, potassium, cal-
cium, barium, and iron.
(5) Wisconsin breast cancer (n = 683, d = 9, k = 2), which consists
of 683 objects characterized by nine features: clump thick-
ness, cell size uniformity, cell shape uniformity, marginal
adhesion, single epithelial cell size, bare nuclei, bland chro-
matin, normal nucleoli, and mitoses. There are two categories
Fig. 3. Clusters of ArtSet2 data set. in the data: malignant (444 objects) and benign (239 objects).
(6) Contraceptive Method Choice (n = 1473, d = 9, k = 3): This
dataset is a subset of the 1987 National Indonesia Contra-
normal distributions, where classes are distributed accord- ceptive Prevalence Survey. The samples are married women
    
li1 ; P ¼ 0:4 0:04 ; i ¼ 1; 2; 3; who either were not pregnant or did not know if they were
N2 ¼ l ¼
ing to li2 0:04 0:4 l at the time of interview. The problem is to predict the choice
P
l11 ¼ l12 ¼ 2; l21 ¼ l22 ¼ 2; l31 ¼ l32 ¼ 6 of current contraceptive method (no use has 629 objects,
and being mean vector and covariance matrix, respec- long-term methods have 334 objects, and short-term meth-
tively. The data set is illustrated in Fig. 2. ods have 510 objects) of a woman based on her demographic
(2) ArtSet2 (n = 300, d = 3, k = 3): This is an artificial data set. It is and socioeconomic characteristics.
a three-featured problem with three classes and 300 pat- (7) Wine (n = 178, d = 13, k = 3): These data, consisting of 178
terns, where every feature of the classes is distributed objects characterized by 13 such features as alcohol, malic
according to Class1  Uniform (10, 25), Class2  Uniform acid, ash, alkalinity of ash, magnesium, total phenols,
(25, 40), Class3  Uniform (40, 55). The data set is illustrated
in Fig. 3.

Table 2 Table 3
Results of KHM, PSO, and PSOKHM clustering on two artificial and five real data sets Results of KHM, PSO, and PSOKHM clustering on two artificial and five real data sets
when p = 2.5. The quality of clustering is evaluated using KHM (X, C) and the F- when p = 3. The quality of clustering is evaluated using KHM (X, C) and the F-measure.
measure. Runtimes (s) are additionally provided. The table shows means and standard Runtimes (s) are additionally provided. The table shows means and standard
deviations (in brackets) for 10 independent runs. Bold face indicates the best and deviations (in brackets) for 10 independent runs. Bold face indicates the best and
italic face indicates the second best result out of the three algorithms. italic face indicates the second best result out of the three algorithms.

KHM PSOKHM GSAKHM KHM PSOKHM GSAKHM

ArtSet1 ArtSet1
KHM (X, C) 703.867(0.000) 703.509(0.050) 703.509(0.000) KHM (X, C) 742.110(0.000) 741.455(0.002) 741.455(0.000)
F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000) F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000)
Runtime 0.106(0.006) 1.921(0.007) 1.783(0.001) Runtime 0.001(0.006) 1.921(0.007) 1.789(0.002)

ArtSet2 ArtSet2
KHM (X, C) 111,852(0) 111,813(2) 111,813(0) KHM (X, C) 278,758(0) 278,541(33) 278,541(0)
F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000) F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000)
Runtime 0.223(0.008) 2.859(0.000) 2.461(0.001) Runtime 0.220(0.008) 2.844(0.010) 2.524(0.005)

Iris Iris
KHM (X, C) 149.333(0.000) 149.058(0.074) 149.058(0.000) KHM (X, C) 126.517(0.000) 125.951(0.052) 125.951(0.000)
F-measure 0.750(0.000) 0.753(0.005) 0.763(0.000) F-measure 0.744(0.000) 0.744(0.000) 0.751(0.000)
Runtime 0.192(0.008) 1.842(0.005) 1.577(0.002) Runtime 0.190(0.007) 1.826(0.009) 1.650(0.004)

Glass Glass
KHM (X, C) 1203.554(16.231) 1196.789(0.439) 1180.756(0.134) KHM (X, C) 1535.198(0.000) 1442.847(35.871) 1400.950(0.630)
F-measure 0.421(0.011) 0.424(0.003) 0.454(0.000) F-measure 0.422(0.000) 0.427(0.003) 0.442(0.000)
Runtime 4.064(0.010) 17.669(0.018) 15.910(0.010) Runtime 4.042(0.007) 17.609(0.015) 15.958(0.001)

Cancer Cancer
KHM (X, C) 60,189(0) 59844(22) 59844(0) KHM (X, C) 119,458(0) 117,418(237) 117,418(55)
F-measure 0.829(0.000) 0.829(0.000) 0.829(0.000) F-measure 0.834(0.000) 0.834(0.000) 0.847(0.000)
Runtime 2.017(0.009) 9.525(0.013) 7.509(0.007) Runtime 2.027(0.007) 9.594(0.023) 7.91(0.002)

CMC CMC
KHM (X, C) 96520(0) 96193(25) 96193(52) KHM (X, C) 187,525(0) 186,722(111) 186,722(94)
F-measure 0.335(0.000) 0.333(0.002) 0.488(0.000) F-measure 0.303(0.000) 0.303(0.000) 0.472(0.000)
Runtime 8.639(0.009) 39.825(0.072) 31.563(0.012) Runtime 8.627(0.009) 39.485(0.056) 32.107(0.034)

Wine Wine
KHM (X, C) 18,386,505(0) 18,386,285(5) 18,386,285(28) KHM (X, C) 298,230,848(24,270,951) 252,522,504(766) 252,522,000(0)
F-measure 0.516(0.000) 0.516(0.000) 0.519(0.000) F-measure 0.538(0.007) 0.553(0.000) 0.553(0.000)
Runtime 2.059(0.010) 6.539(0.008) 5.628(0.004) Runtime 2.084(0.010) 6.598(0.008) 5.710(0.001)
M. Yin et al. / Expert Systems with Applications 38 (2011) 9319–9324 9323

Table 4
Results of KHM, PSO, and PSOKHM clustering on two artificial and five real data sets when p = 3..5. The quality of clustering is evaluated using KHM (X, C) and the F-measure.
Runtimes (s) are additionally provided. The table shows means and standard deviations (in brackets) for 10 independent runs. Bold face indicates the best and italic face indicates
the second best result out of the three algorithms.

KHM PSOKHM GSAKHM


ArtSet1
KHM (X, C) 807.536(0.028) 806.617(0.007) 806.617(0.010)
F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000)
Runtime 0.106(0.006) 1.921(0.007) 1.766(0.001)
ArtSet2
KHM (X, C) 697,006(0) 696,049(78) 696,049(1)
F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000)
Runtime 0.220(0.005) 2.842(0.005) 2.471(0.001)
Iris
KHM (X, C) 113.413(0.085) 110.004(0.260) 110.004(0.002)
F-measure 0.770(0.024) 0.762(0.004) 0.766(0.000)
Runtime 0.194(0.008) 1.873(0.005) 1.587(0.004)
Glass
KHM (X, C) 1871.812(0.000) 1857.152(4.937) 1857.152(0.035)
F-measure 0.396(0.000) 0.396(0.000) 0.420(0.000)
Runtime 4.056(0.008) 17.651(0.013) 15.799(0.003)
Cancer
KHM (X, C) 243,440(0) 235,441(696) 236,125(15)
F-measure 0.832(0.000) 0.835(0.003) 0.862(0.000)
Runtime 2.072(0.008) 9.859(0.015) 31.521(0.009)
CMC
KHM (X, C) 381,444(0) 379,678(247) 380,183(16)
F-measure 0.332(0.000) 0.332(0.000) 0.506(0.000)
Runtime 8.528(0.012) 42.701(0.250) 31.521(0.009)
Wine
KHM (X, C) 8,568,319,639(2075) 3,546,930,579(1,214,985) 3,540,920,000(232)
F-measure 0.502(0.000) 0.535(0.004) 0.536(0.000)
Runtime 2.040(0.008) 6.508(0.017) 5.536(0.001)

Table 5
Results of KHM, PSO, and PSOKHM clustering on two artificial and five real data sets when p = 4. The quality of clustering is evaluated using KHM (X, C) and the F-measure.
Runtimes (s) are additionally provided. The table shows means and standard deviations (in brackets) for 10 independent runs. Bold face indicates the best and italic face indicates
the second best result out of the three algorithms.

KHM PSOKHM GSAKHM


ArtSet1
KHM (X, C) 1027.73(0.000) 916.635(41.679) 916.635(1.029)
F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000)
Runtime 0.1210(0.004) 1.9511(0.001) 1.804(0.001)
ArtSet2
KHM (X, C) 1,750,830(0) 1,749,960(34746) 1,749,960(41147)
F-measure 1.000(0.000) 1.000(0.000) 1.000(0.000)
Runtime 0.258(0.001) 2.982(0.003) 2.508(0.001)
Iris
KHM (X, C) 117.563(0.000) 106.062(24.636) 98.133(0.000)
F-measure 0.785(0.000) 0.751(0.001) 0.761(0.000)
Runtime 0.227(0.001) 1.967(0.020) 1.659(0.018)
Glass
KHM (X, C) 2556.4(0.001) 2556.22(0.359) 2550.202(0.025)
F-measure 0.341(0.000) 0.342(0.000) 0.343(0.000)
Runtime 5.295(0.019) 17.7.7(0.003) 15.989(0.023)
Cancer
KHM (X, C) 505,296(0) 504,629(48954) 502,519(61156)
F-measure 0.836(0.000) 0.838(0.000) 0.846(0.000)
Runtime 2.655(0.001) 9.845(0.001) 7.651(0.001)
CMC
KHM (X, C) 805,603(0) 805,531(915) 804,351(77872)
F-measure 0.333(0.000) 0.333(0.000) 0.441(0.002)
Runtime 10.102(0.018) 40.384(0.033) 31.582(0.021)
Wine
KHM (X, C) 115,784,000,000(555,350,000) 89,454,300,000(120,877,000) 72,820,113,001(427,820,000)
F-measure 0.507(0.000) 0.512(0.000) 0.518(0.000)
Runtime 2.728(0.001) 7.434(0.016) 5.506(0.002)

flavanoids, nonflavanoid phenols, proanthocyanins, colour the same region in Italy but derived from three different cul-
intensity, hue, OD280/OD315 of diluted wines, and praline, tivars. There are three categories in the data: class 1 (59
are the results of a chemical analysis of wines brewed in objects), class 2 (71 objects), and class 3 (48 objects).
9324 M. Yin et al. / Expert Systems with Applications 38 (2011) 9319–9324

4.2. Experimental results problems. Yet it should be noted that one drawback of IGSAKHM
is that it requires more runtime compared to KHM. And when
In this section, we evaluate and compare the performances of the runtime is quite critical, IGSAKHM is not applicable. In the fu-
the following methods: KHM, PSOKHM and IGSAKHM algorithms ture, we may integrate other local search algorithms into KHM, to
as means of solution for the objective function of the KHM algo- get more efficient and effective clustering algorithm.
rithm. The quality of the respective clustering will also be com-
pared, where the quality is measured by the following two criteria: Acknowledgments

(1) The sum over all data points of the harmonic average of the This work is fully supported by the National Natural Science
distance from a data point to all the centers, as defined in Eq. Foundation of China under Grant Nos. 60473042, 60573067,
(1). Clearly, the smaller the sum is, the higher the quality of 60803102.
clustering is.
(2) The F-measure uses the ideas of precision and recall from References
information retrieval (Dalli, 2003; Handl, Knowles, & Dorigo,
Alpaydin, E. (2004). Introduction to machine learning. Cambridge: The MIT Press.
2003). Each class i (as given by the class labels of the used pp.133–150.
benchmark data set) is regarded as the set of ni items desired Cui, X., & Potok, T. E. (2005). Document clustering using particle swarm
for a query; each cluster j (generated by the algorithm) is optimization. In: IEEE swarm intelligence symposium. Pasadena, California.
Dalli, A. (2003). Adaptation of the F-measure to cluster-based Lexicon quality
regarded as the set of nj items retrieved for a query; nij gives evaluation. In EACL 2003, Budapest.
the number of elements of class i within cluster j. For each Kao, Y. T., Zahara, E., & Kao, I. W. (2008). A hybridized approach to data clustering.
class i and cluster j precision and recall are then defined as Expert Systems with Applications, 34(3), 1754–1762.
n n Halberstadt, W., & Douglas, T. S. (2008). Fuzzy clustering to detect tuberculous
pði; jÞ ¼ nijj and rði; jÞ ¼ niji , and the corresponding value under meningitis-associated hyperdensity in CT images. Computers in Biology and
2 Medicine, 38(2), 165–170.
the F-measure is Fði; jÞ ¼ ðbb2þ1Þpði;jÞrði;jÞ
pði;jÞþrði;jÞ
, where we chose b = 1to Hammerly, G., & Elkan, C. (2002). Alternatives to the k-means algorithm that find
better clusterings. In: Proceedings of the 11th international conference on
obtain equal weighting for p(i, j) and r(i, j). The overall
information and knowledge management (pp. 600–607).
F-measure for the data set of size n is given by Handl, J., Knowles, J., & Dorigo, M. (2003). On the performance of ant-based
P
F ¼ i nni maxj fFði; jÞg. Obviously, the bigger F-measure is, clustering. Design and Application of Hybrid Intelligent Systems. Frontiers in
Artificial Intelligence and Applications, 104, 204–213.
the higher the quality of clustering is. He, Y., Pan, W., & Lin, J. (2006). Cluster analysis using multivariate normal mixture
models to detect differential gene expression with microarray data.
It is known that p is a key parameter to get good objective func- Computational Statistics and Data Analysis, 51(2), 641–658.
Hu, G., Zhou, S., Guan, J., & Hu, X. (2008). Towards effective document clustering: A
tion values. For this reason we conduct our experiments with dif-
constrained K-means based approach. Information Processing and Management,
ferent p values. Tables 2–5 show the means and standard 44(4), 1397–1409.
deviations (over 10 runs) obtained for each of these measures Kerr, G., Ruskin, H. J., Crane, M., & Doolan, P. (2008). Techniques for clustering gene
expression data. Computers in Biology and Medicine, 38(3), 283–293.
when p is 2.5, 3, 3.5 and 4 respectively. Additionally, the runtimes
Krishna, K., & Murty (1999). Genetic k-means algorithm. IEEE Transactions on
of the algorithms are also shown in these tables. Systems, Man and Cybernetics B Cybernetics, 29, 433–439.
We see that for ARTSET1 and ARTSET2, the F-measure of KHM, Li, Y. J., Chung, S. M., & Holt, J. D. (2008). Text document clustering based on
PSOKHM and IGSAKHM are all equal to 1, at the same time, the frequent word meaning sequences. Data and Knowledge Engineering, 64(1),
381–404.
average KHM (X, C) of KHM, PSOKHM and IGSAKHM for the first MacQueen, J. (1967). Some methods for classification and analysis of multivariate
three datasets are almost the same (except case p = 4), that is to observations. Proceedings of the fifth Berkeley symposium on mathematical
say when the data set is simple and specially separated well, the statistics and probability, 1, 281–297.
Rashedi, E., Nezamabadi-pour, H., & Saryazdi, S. (2009). GSA: A gravitational search
performance of the three algorithms are nearly the same. On the algorithm. Information Sciences, 179, 2232–2248.
other hand, the more complex the data set is, the better the results Sung, C. S., & Jin, H. W. (2000). A tabu-search-based heuristic for clustering. Pattern
are. The F-measure of IGSAKHM is superior to the other two algo- Recognition, 33, 849–858.
Tan, P. N., Steinbach, M., & Kumar, V. (2005). Introduction to data mining. Boston:
rithms when p is bigger. Especially when p = 4, IGSAKHM algo- Addison-Wesley. pp. 487–559.
rithm outperforms the KHM and PSOKHM both in KHM (X, C) and Tjhi, W. C., & Chen, L. H. (2008). A heuristic-based fuzzy co-clustering algorithm for
F-measure with the exception of ARTSET1 and ARTSET2. It is categorization of high-dimensional data. Fuzzy Sets and Systems, 159(4),
371–389.
shown that our algorithm obviously performs the KHM algorithm
Ünler, A., & Güngör, Z. (2008). Apply ing K-harmonic means clustering to the
for both F-measure and KHM (X, C), and in most cases, it outper- partmachine classification problem. Expert Systems with Applications.
forms PSOKHM for F-measure while at least comparable to PSO- doi:10.1016/j.eswa.2007.11.048
Webb, A. (2002). Statistical pattern recognition. New Jersey: John Wiley & Sons. pp.
KHM for KHM (X, C) requiring less time.
361–406.
Yang, F. Q., Sun, T. L., & Zhang, C. H. (2009). efficient hybrid data clustering method
5. Conclusions based on K-harmonic means and particle swarm optimization. Expert Systems
with Applications, 36(6), 9847–9852.
Zhou, H., & Liu, Y. H. (2008). Accurate integration of multi-view range images using
Firstly, this paper proposes IGSA algorithm by adding two strat- k-means clustering. Pattern Recognition, 41(1), 152–175.
egies to the original GSA algorithm, and then investigates a new Zhang, B., Hsu, M., & Dayal, U. (1999). K-harmonic means – a data clustering
algorithm. Technical Report HPL-1999-124. Hewlett-Packard Laboratories.
hybrid clustering algorithm based on the KHM algorithm and the Zhang, B., Hsu, M., & Dayal, U. (2000). K-harmonic means. In: International workshop
IGSA algorithm, called IGSAKHM. The proposed algorithm is tested on temporal, spatial and spatio-temporal data mining, TSDM2000. Lyon, France,
on seven data sets. And experimental results show that the algo- September 12.
rithm is both efficient and effective especially for complicated

Potrebbero piacerti anche