Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
n
i
k
c i
x x
1
2
) ( (1)
where symbol n represents the number of attributes. In addition,
the formulation of Manhattan distance is giving below.
n
i
k
c i
x x
1
(2)
In this paper, we propose two kinds of EvoDM algorithms,
which combines K-means algorithm with two evolutionary
algorithms, GA and MPSO. The two proposed EvoDM
algorithms are termed GA-KM and MPSO-KM, respectively.
To calculate the K-means clustering result, we adopt the
weighted sum of Manhattan distance including optimal weight
(
i
w ), which is the design variable of GA or MPSO. Therefore,
Equation (2) can be rewritten in the following form.
n
i
k
c i i
x x w
1
(3)
Figure 2. Flowchart of K-means algorithm
As mentioned above, we combined GA or MPSO with K-
means for computing weights. Genetic algorithm is a stochastic
search algorithm which based on the Darwinian principal of
natural selection and natural genetics. The selection is biased
toward more highly fit individuals, so the average fitness of the
population tends to improve from one generation to the next. In
general, GA generates an optimal solution by means of using
reproduction, crossover, and mutation operators [14, 15]. The
fitness of the best individual is also expected to improve over
time, and the best individual may be selected as a solution after
several generations. In this work, an intelligent GA [16],
which includes tournament selection, intelligent crossover, and
uniform mutation operators, is adopted. In addition, the elitist
strategy is also applied to maintain the best individual can be
maintained to the next generation during evolutionary
computation. The pseudo-code of the intelligent GA is as
follows [16].
Begin
Generate initial chromosomes of a population randomly;
Do {
design variables= DECODE(chromosomes);
fitness=EVAL(objective function)
parents=SELECT(individuals of population);
/* TOURNAMENT SELECTION */
children=FFD(parent1, parent2);
/* INTELLIGENT CROSSOVER */
new children=MUTAT (children);
/*NONUNIFORM MUTATION */
elitist children=ELITISM (parents);
/* ELITIST STRATEGY */
} While (stop criterion is not satisfied);
End
For comparative study, we also combine MPSO algorithm
with K-means for computing optimal weights. Liu and Lin
proposed a MPSO in 2007 [17] for improving the
computational efficiency and solution accuracy of Shi and
Eberharts PSO [18]. The original PSO developed by Kennedy
and Eberhart [19] supposed that the i
th
particle flies over a
hyperspace, with its position and velocity given by
i
x
and
i
v
.
The best previous position of the i
th
particle is denoted by
Pbest
i
. The term Gbest
i
represents the best particle with the
highest function value in the population. The Liu and Lins
MPSO proposed the next flying velocity and position of the
particle i at iteration 1 + k by using the following heuristic
equations:
) ( ) ( ) (
2 2 1 1
1 k
i i
k
i i
k
i
k
i
x Gbest r c x Pbest r c v v
+ + =
+
(4)
particle
k
i
k
i
k
i
, i v x x N 1,2,...,
1 1
= + =
+ +
(5)
where
1
c and
2
c are the cognitive and social learning rates,
respectively. The random function
1
r and
2
r are uniformly
distributed in the range [0, 1]. The value of is a positive
number ( 1 0 < ) termed the momentum constant, which
controls the rate of change in velocity vector. In this paper, the
value of was 0.1.
III. PROPOSED EVOLUTIONARY DATA MINING ALGORITHMS
This study aims to find cardiac disease cluster
optimization by EvoDM algorithms based on the K-means
algorithm. Although K-means algorithm is a popular method
to slove this kind of clustering problem, the drawback of it is
that the accuracy of clustering results needs to further improve.
Therefore, the K-means clustering algorithm are combined
GAs as hybrid genetic models [20, 21] to improve the
prediction accuracy. The proposed two EvoDM algorithms are
employed to determine the optimal weights and final cluster
centers for attributes, and the prediction accuracy for dataset is
computed based on the optimal weights and final cluster
centers. Furthermore, this work also implements Nave Bayes,
C5 and K-means algorithms for comparing the classification
results that of using the presented two EvoDM algorithms. The
objective function, Obj ) (w
for
each attributes as follows.
( ) ( )
=
=
M
i
i actual
i
pred
C C Min w Obj
1
) (
(6)
This study proposes two kinds of EvoDM algorithms as
GA-based K-means and MPSO-based K-means. The function
of GA or MPSO is providing the weights ) (w
which are
equivalent to the design variables of GA or MPSO, while K-
means is computing the centroids of clusters ) (
k
c
x . After
completing the classification via the synergy of K-means and
GA (or MPSO), the fitness function value, obtained by
calculating Eq. (6), is updated and next generation begins. The
optimal weights are achieved when the stop condition GA (or
MPSO) is satisfied when the computation evolves over a
number of generations. The often used stop condition for GA
(or MPSO) is the maximum number of generation.
Accordingly, the flowcharts of GA-KM and MPSO-KM are
depicted in Figures 3 and 4.
Figure 3. Flowchart of GA-KM algorithm
Figure 4. Flowchart of MPSO-KM algorithm
IV. RESULTS & DISCUSSION
A. Dataset Sample
This work conducts 270 instances of cardiac disease for
data mining. In our experiments, we evaluated the proposed
algorithms on the cardiac disease dataset from UCI machine
learning repository. It contains 270 instances belonging to two
classes normal (150) and heart patient (120). Each record in
the cardiac disease dataset is characterized by 14 attributes,
including 13 condition attributes and 1 decision attribute which
is the presence of cardiac disease. The information of the
attributes is tabulated in Table I.
TABLE I. ATTRIBUTE INFORMATION
Attribute Description
1. age age in year
2. sex value 1: male; value 0: female
3. cp chest pain type
value 1: typical angina; value 2: atypical angina
value 3: non-anginal pain; value 4: asymptomatic
4. trestbps resting blood pressure in mmHg
5. chol serum cholestoral in mg/dl
6. fbs fasting blood sugar >120 mg/dl
value 1: true; value 0: false
7. restecg resting electrocardiographic results
value 0: normal
value 1: having ST-T wave abnormality (T wave inversions
and/or ST elevation or depression of > 0.05 mV)
value 2: showing probable or definite left ventri-cular
hypertrophy by Estes criteria
8. thalach Maximum heart rate achieved
9. exang exercise induced angina. value 1: yes; value 0: no
10. oldpeak ST depression induced by exercise relative to rest
11. slope the slope of the peak exercise ST segment
value 1: upsloping; value 2: flat; value 3: downsloping
12. ca number of major vessels (0-3) colored by flourosopy
13. thal value 3: normal; value 6: fixed defect
value 7: reversable defect
14.num
(outcome)
diagnosis of cardiac disease (angiographic disease status)
value 0: <50% diameter narrowing
value 1: >50% diameter narrowing
Initial Population
Decoding parameters, calculate
fitness function value
If satisfied
stop condition
Reproduction
Mutation
The best solution
Crossover
Yes
No
K-means result for setting the
objective function of GA
K-means result for
setting the objective
function of MPSO
Update particle
1 1 k k k
i i i
x x v
+ +
= +
B. Normalization
The partial datasets of original and normalization were
listed in Tables II and III, respectively. The normalization is
required for the computation using K-means algorithm to
avoiding the wide gap of scale size of attributes during the
determination of cluster centers. This work specified 13
weights (w
1
, w
2
, , w
13
) by applying GA-KM and MPSO-KM
algorithms due to 13 condition attributes for the disease dataset.
All values of w