Sei sulla pagina 1di 8

Development of Evolutionary Data Mining

Algorithms and their Applications


to Cardiac Disease Diagnosis

Jenn-Long Liu
Dept. of Information Management
I-Shou University
Kaohsiung 84001, Taiwan
e-mail: jlliu@isu.edu.tw

Yu-Tzu Hsu
Dept. of Information Engineering
I-Shou University
Kaohsiung 84001, Taiwan
e-mail: isu9903009d@isu.edu.tw

Chih-Lung Hung
Dept. of Information Technology
E-Da Hospital
Kaohsiung 82445, Taiwan
e-mail:ed100005@edah.org.tw

AbstractThis paper presents two kinds of evolutionary data
mining (EvoDM) algorithms, termed GA-KM and MPSO-KM, to
cluster the dataset of cardiac disease and predict the accuracy of
diagnostics. Our proposed GA-KM is a hybrid method that
combines a genetic algorithm (GA) and K-means (KM) algorithm,
and MPSO-KM is also a hybrid method that combines a
momentum-type particle swarm optimization (MPSO) and K-
means algorithm. The functions of GA-KM or MPSO-KM are to
determine the optimal weights of attributes and cluster centers of
clusters that are needed to classify the disease dataset. The
dataset, used in this study, includes 13 attributes with 270
instances, which are the data records of cardiac disease. A
comparative study is conducted by using C5, Nave Bayes, K-
means, GA-KM and MPSO-KM to evaluate the accuracy of
presented algorithms. Our experiments indicate that the
clustering accuracy of cardiac disease dataset is significantly
improved by using GA-KM and MPSO-KM when compared to
that of using K-means only.
Keywords-Evolutaionary data mining; genetic algorith;
momentum-type particle swarm optimization; K-means algorithm;
cardiac disease.
I. INTRODUCTION
Cardiac disease, heart disease or cardiopathy is an
umbrella term for a variety of diseases affecting the heart. As
we known, cardiac disease is a silent killer due to incorrect
diet and sport habit, as many people do not know that the
behaviors they are engaging in are slowly hurting their
cardiovascular system. According to the report of World
Health Organization (WHO), there are more than 17 million
deaths due to cardiac disease in the world each year. As of
2007, it is the leading cause of death in the United States [1],
England, and Canada, with accounting for 25.4% of the total
deaths in the United States. Moreover, cardiac risk in women
has been increasing and has killed more women than breast
cancer in recent years.
The heart is the key organ in the circulatory system. It is a
complex, highly specialized, muscular organ in the chest that
maintains the circulation of blood throughout the body. The
circulatory system is composed of the heart and blood vessels,
including arteries, veins, and capillaries. Most of cardiac
diseases are caused by high blood pressure contributes to
hardening of the arteries. High levels of bad cholesterol (LDL)
build up in the arteries resulting from uncontrolled diet with
high levels of saturated fats and trans fats. The sticky substance,
commonly called plaque, can develop and build up on the
interior walls of coronary arteries. All these add to the
formation of atherosclerosis lesions and eventually arterial
blockage or anything that serves to damage the inner lining of
blood vessels and impedes the transportation of oxygen and
nutrition to the heart so that the heart cannot pump a sufficient
amount of blood and results in people at higher risk of heart
attack. The illustration of plaque in the coronary artery is
shown in Fig. 1.



\\\






Figure 1. Coronary artery blockage
To investigate the cardiac disease, literature surveys that
there are some risk factors for cardiac disease including age,
sex, race, smoking, hypertension, hypercholesterolemia,
diabetes obesity, diuretics, inactivity, and stress [2]. In the
medical diagnosis, doctors use a variety of medical tests to
evaluate heart health such as the basics of Electrocardiogram
(EKG), Chest X-Ray, Stress Test, Echocardiogram, Computed
Tomography (CT) Heart Scan, and Heart Magnetic Resonance
Imaging (MRI) etc. Since there are many causes of cardiac
disease with many types of heart problems and still no single
This work was supported in part by grant NSC 100-2221-E-214-040 from the
National Science Council of Republic of China.
Plaque build up in the
coronary artery
blocking blood flow and
oxygen to the heart
Damage and death to
heart tissue shown in purple
U.S. Government work not protected by U.S. copyright
WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia IEEE CEC
diagnostic test that can detect if a patient has cardiac disease.
Therefore, it is very difficult for a doctor to judge whether a
patient gets the cardiac disease or not.
Owing to the development of novel medical techniques for
generating and collecting data, the growth rate of medical
databases has become tremendous. Data mining, the extraction
of hidden predictive information from large databases, is a
powerful new technology with great potential to help medical
institutions focusing on the most important information in their
disease databases. Also, data mining is a crucial step in the
Knowledge Discovery in Database (KDD) processes that
consist of applying data analysis and knowledge discovery
algorithms to produce useful patterns (or rules) over the
datasets. Generally, data mining technologies include [3, 4]:
1) Associate Rules,
2) Classification,
3) Clustering Analysis,
4) Regression Analysis,
5) Time Series Analysis, etc.
Clustering analysis has been widely used in data mining,
which has a prominent feature of handling with large and
complex datasets. The clustering algorithm is a main method
for exploring data mining and also is a common technique for
statistical data analysis, and it can be applied to machine
learning, image analysis, pattern recognition, information
retrieval, and bioinformatics. The K-means algorithm is the one
of often used method in the clustering algorithms. When the
number of clusters is fixed to k, K-means algorithm gives a
formal definition as an optimization problem to specify k
cluster centers and assign each instance to its belonging cluster
with the smallest distance, measured by Euclidean distance for
example, from the instance to assigned cluster [3].
While many studies show that the use of intelligent system
for decision support in clinical trials could help minimize costs
[5]. Shantakumar and Kumaraswamy [6] proposed the
intelligent heart attack prediction system using data mining and
artificial neural network techniques. Once the preprocessing
gets over, the cardiac disease warehouse is clustered with the
aid of the K-means clustering algorithm. Consequently the
frequent patterns applicable to cardiac disease are mined with
the aid of the Maximal Frequent Item set Algorithm (MAFIA)
from the data extracted. In addition, the patterns vital to heart
attack prediction are selected on basis of the computed
significant weightage. The values corresponding to each
attribute in the significant patterns are as follows: blood
pressure range is greater than 140/90 mmHg, cholesterol range
is greater than 240 mg/dl, maximum heart rate is greater than
100 beats/minute, abnormal ECG, and unstable angina.
Subsequently, Multi-layer Perceptron Neural Network has been
trained with the selected significant patterns by using back-
propagation training algorithm. This paper illustrates that
significant patterns could expected to predict better of cardiac
disease.
Since there are no doctors who have well-rounded
specialties, and furthermore many hospitals across the United
States are facing significant staff shortages from today. It is
useful to develop an automated medical diagnosis system.
Parthiban and Subramanian [7] proposed coactive neuro-fuzzy
inference system (CANFIS) model combined the neural
network adaptive capabilities and the fuzzy logic qualitative
approach which is then integrated with genetic algorithm (GA)
to diagnose the presence of cardiac disease. Their results shown
that GA is a very useful technique for auto-tuning of the
CANFIS parameters and selection of optimal feature set.
Besides, Rajeswari et al. [8] mentioned that the aetiology of
cardiac disease is multifactorial. Their studies have shown
several important risk factors, which make the occurrence of
the disease more probable as follows: increasing age, male sex,
smoking, hypertension, serum cholesterol, diabetes, sedentary
life style, individual with blood type A, and a family history of
cardiac disease. They employ GA to determine the most high
impact association pattern for cardiac disease diagnosis and the
optimal values corresponding to each of the considered
attributes, and then a risk factor can be made by summing the
risk score of the various features. Srinivas et al. [9] use One
Dependency Augmented Nave Bayes classifier (ODANB) and
Nave Creedal Classifier 2 (NCC2) to analyze three datasets of
cardiac disease including Heart-c, Heart-h, and Heart-staglog
from the UCI repository. Their results clearly state the highest
accuracy rate is only for 84.14%. It also shows that use the data
mining merely could not enhance the efficiency of the
diagnostic precision. In Kumari and Godaras [10] research
paper, data mining classification techniques using RIPPER
classifier, Decision Tree, Artificial Neural Networks (ANNs),
and Support Vector Machine (SVM) are applied to Heart-h
dataset analysis. Their analysis shows that SVM predicts with
least error rate of 0.1588 and highest accuracy of 84.12%.
Moreover, Das et al. [11] introduce a methodology, SAS
based software 9.1.3, for diagnosing the heart disease. They
obtained classification accuracy with 89.01% from the
experiments made on the data taken from Cleveland heart
disease database (Heart-c). They also obtained 80.95% and
95.91% for sensitivity and specificity values, respectively, in
heart disease diagnosis. Anbarasi et al. [12] proposed that GA
is useful to determine the significant attributes, which are
contribute more towards the diagnosis of heart ailments, so that
indirectly reduces the number of tests which are needed to be
taken by a patient. The classification accuracy obtained using
Decision Tree and Bayesian Classification are improved by
applying GA to reduce the actual data size for getting the
optimal subset of attribute that is sufficient for heart disease
prediction [13]. According to these studies, Using GA for data
mining optimization can auto-tuning the significant patterns
and increase the classification accuracy of cardiac disease.
The remaining content of this paper is organized as follows.
The algorithms of K-means clustering, GA, and momentum-
type particle swarm optimization (MPSO) are described in
section II, and the proposed evolutionary data mining (EVoDM)
algorithms are discussed in section III. Section IV contains data
description and result analysis. Finally, we conclude in section
V
II. K-MEANS, GA, AND MPSO ALGORITHMS
This study uses K-means clustering algorithm to partition
cardiac disease dataset into k clusters. Each cluster is
represented by its centroid (
k
c
x ), which is the mean of feature
Start
Select appropriate Cluster numbers & centers
Calculate the center of every cluster
Calculate the distance between every data point and each centroid
Each data point is associated with the nearest centroid
Have every data point moved to another centroid?
End
No
Yes
vectors within the cluster. When the number of clusters is fixed
to k, K-means algorithm gives a formal definition as an
optimization problem to specify k cluster centers and assign
each instance to its belonging cluster with the smallest distance
from the instance to assigned cluster c. The flowchart of K-
means is depicted in Fig. 2. Generally, the global shape or size
of clusters is determined by using an appropriate distance
measurement such as Euclidean or Manhattan. The Euclidean
distance is expressed by equation (1) as following:

n
i
k
c i
x x
1
2
) ( (1)
where symbol n represents the number of attributes. In addition,
the formulation of Manhattan distance is giving below.

n
i
k
c i
x x
1
(2)
In this paper, we propose two kinds of EvoDM algorithms,
which combines K-means algorithm with two evolutionary
algorithms, GA and MPSO. The two proposed EvoDM
algorithms are termed GA-KM and MPSO-KM, respectively.
To calculate the K-means clustering result, we adopt the
weighted sum of Manhattan distance including optimal weight
(
i
w ), which is the design variable of GA or MPSO. Therefore,
Equation (2) can be rewritten in the following form.

n
i
k
c i i
x x w
1
(3)










Figure 2. Flowchart of K-means algorithm
As mentioned above, we combined GA or MPSO with K-
means for computing weights. Genetic algorithm is a stochastic
search algorithm which based on the Darwinian principal of
natural selection and natural genetics. The selection is biased
toward more highly fit individuals, so the average fitness of the
population tends to improve from one generation to the next. In
general, GA generates an optimal solution by means of using
reproduction, crossover, and mutation operators [14, 15]. The
fitness of the best individual is also expected to improve over
time, and the best individual may be selected as a solution after
several generations. In this work, an intelligent GA [16],
which includes tournament selection, intelligent crossover, and
uniform mutation operators, is adopted. In addition, the elitist
strategy is also applied to maintain the best individual can be
maintained to the next generation during evolutionary
computation. The pseudo-code of the intelligent GA is as
follows [16].
Begin
Generate initial chromosomes of a population randomly;
Do {
design variables= DECODE(chromosomes);
fitness=EVAL(objective function)
parents=SELECT(individuals of population);
/* TOURNAMENT SELECTION */
children=FFD(parent1, parent2);
/* INTELLIGENT CROSSOVER */
new children=MUTAT (children);
/*NONUNIFORM MUTATION */
elitist children=ELITISM (parents);
/* ELITIST STRATEGY */
} While (stop criterion is not satisfied);
End
For comparative study, we also combine MPSO algorithm
with K-means for computing optimal weights. Liu and Lin
proposed a MPSO in 2007 [17] for improving the
computational efficiency and solution accuracy of Shi and
Eberharts PSO [18]. The original PSO developed by Kennedy
and Eberhart [19] supposed that the i
th
particle flies over a
hyperspace, with its position and velocity given by
i
x

and
i
v

.
The best previous position of the i
th
particle is denoted by
Pbest
i
. The term Gbest
i
represents the best particle with the
highest function value in the population. The Liu and Lins
MPSO proposed the next flying velocity and position of the
particle i at iteration 1 + k by using the following heuristic
equations:
) ( ) ( ) (
2 2 1 1
1 k
i i
k
i i
k
i
k
i
x Gbest r c x Pbest r c v v

+ + =
+
(4)
particle
k
i
k
i
k
i
, i v x x N 1,2,...,
1 1
= + =
+ +

(5)
where
1
c and
2
c are the cognitive and social learning rates,
respectively. The random function
1
r and
2
r are uniformly
distributed in the range [0, 1]. The value of is a positive
number ( 1 0 < ) termed the momentum constant, which
controls the rate of change in velocity vector. In this paper, the
value of was 0.1.
III. PROPOSED EVOLUTIONARY DATA MINING ALGORITHMS
This study aims to find cardiac disease cluster
optimization by EvoDM algorithms based on the K-means
algorithm. Although K-means algorithm is a popular method
to slove this kind of clustering problem, the drawback of it is
that the accuracy of clustering results needs to further improve.
Therefore, the K-means clustering algorithm are combined
GAs as hybrid genetic models [20, 21] to improve the
prediction accuracy. The proposed two EvoDM algorithms are
employed to determine the optimal weights and final cluster
centers for attributes, and the prediction accuracy for dataset is
computed based on the optimal weights and final cluster
centers. Furthermore, this work also implements Nave Bayes,
C5 and K-means algorithms for comparing the classification
results that of using the presented two EvoDM algorithms. The
objective function, Obj ) (w

, for GA-KM and MPSO-KM is


specified by minimizing the clustering errors between
classification results of prediction (Cpred) and original (Cactual)
for M training data to determine the optimal weights ) (w

for
each attributes as follows.
( ) ( )

=

=
M
i
i actual
i
pred
C C Min w Obj
1
) (

(6)
This study proposes two kinds of EvoDM algorithms as
GA-based K-means and MPSO-based K-means. The function
of GA or MPSO is providing the weights ) (w

which are
equivalent to the design variables of GA or MPSO, while K-
means is computing the centroids of clusters ) (
k
c
x . After
completing the classification via the synergy of K-means and
GA (or MPSO), the fitness function value, obtained by
calculating Eq. (6), is updated and next generation begins. The
optimal weights are achieved when the stop condition GA (or
MPSO) is satisfied when the computation evolves over a
number of generations. The often used stop condition for GA
(or MPSO) is the maximum number of generation.
Accordingly, the flowcharts of GA-KM and MPSO-KM are
depicted in Figures 3 and 4.














Figure 3. Flowchart of GA-KM algorithm












Figure 4. Flowchart of MPSO-KM algorithm
IV. RESULTS & DISCUSSION
A. Dataset Sample
This work conducts 270 instances of cardiac disease for
data mining. In our experiments, we evaluated the proposed
algorithms on the cardiac disease dataset from UCI machine
learning repository. It contains 270 instances belonging to two
classes normal (150) and heart patient (120). Each record in
the cardiac disease dataset is characterized by 14 attributes,
including 13 condition attributes and 1 decision attribute which
is the presence of cardiac disease. The information of the
attributes is tabulated in Table I.
TABLE I. ATTRIBUTE INFORMATION
Attribute Description
1. age age in year
2. sex value 1: male; value 0: female
3. cp chest pain type
value 1: typical angina; value 2: atypical angina
value 3: non-anginal pain; value 4: asymptomatic
4. trestbps resting blood pressure in mmHg
5. chol serum cholestoral in mg/dl
6. fbs fasting blood sugar >120 mg/dl
value 1: true; value 0: false
7. restecg resting electrocardiographic results
value 0: normal
value 1: having ST-T wave abnormality (T wave inversions
and/or ST elevation or depression of > 0.05 mV)
value 2: showing probable or definite left ventri-cular
hypertrophy by Estes criteria
8. thalach Maximum heart rate achieved
9. exang exercise induced angina. value 1: yes; value 0: no
10. oldpeak ST depression induced by exercise relative to rest
11. slope the slope of the peak exercise ST segment
value 1: upsloping; value 2: flat; value 3: downsloping
12. ca number of major vessels (0-3) colored by flourosopy
13. thal value 3: normal; value 6: fixed defect
value 7: reversable defect
14.num
(outcome)
diagnosis of cardiac disease (angiographic disease status)
value 0: <50% diameter narrowing
value 1: >50% diameter narrowing
Initial Population
Decoding parameters, calculate
fitness function value
If satisfied
stop condition
Reproduction
Mutation
The best solution
Crossover
Yes
No
K-means result for setting the
objective function of GA
K-means result for
setting the objective
function of MPSO
Update particle

s speed and position


Initialization ( initial
position and speed of
particle)
Calculate the fitness
function value
If satisfied
stop condition
The best
parameter
Yes
No
) ( ) ( ) (
2 2 1 1
1 k
i i
k
i i
k
i
k
i
x Gbest r c x Pbest r c v v

+ + =
+

1 1 k k k
i i i
x x v
+ +
= +

B. Normalization
The partial datasets of original and normalization were
listed in Tables II and III, respectively. The normalization is
required for the computation using K-means algorithm to
avoiding the wide gap of scale size of attributes during the
determination of cluster centers. This work specified 13
weights (w
1
, w
2
, , w
13
) by applying GA-KM and MPSO-KM
algorithms due to 13 condition attributes for the disease dataset.
All values of w

are specified in the range [0, 1].


TABLE II. PARTIAL DATA OF ORIGINAL CARDIAC DISEASE DATASET
attribute
instance
1 2 3 4 5 6 7
8 9 10 11 12 13 14
1
70 1 4 130 322 0 2
109 0 2.4 2 3 3 1
2
67 0 3 115 564 0 2
160 0 1.6 2 0 7 0
3
57 1 2 124 261 0 0
141 0 0.3 1 0 7 1
4
64 1 4 128 263 0 0
105 1 0.2 2 1 7 0
5
74 0 2 120 269 0 2
121 1 0.2 1 1 3 0

TABLE III. PARTIAL DATA OF NORMALIZED CARDIAC DISEASE DATASET
attribute
instance
1 2 3 4 5 6 7
8 9 10 11 12 13 14
1
0.7 1 1 0.17 1 0 1
0.18 0 0.48 0.5 1 0 1
2
0.67 0 0.67 0 1 0 1
1 0 0.32 0.5 0 1 0
3
0.57 1 0.33 0.07 0.61 0 0
0.82 0 0.06 0 0 1 1
4
0.64 1 1 0.13 0.63 0 0
0.1 1 0.04 0.5 0.33 1 0
5
0.74 0 0.33 0 0.69 0 1
0.42 1 0.04 0 0.33 0 0


C. Performance Measurement
A confusion matrix as shown in Table IV is typically used
to evaluate performance of a machine learning algorithm. It
contains information about actual and predicted classifications
done by a classification system. The tn and tp indicate the
correct numbers of classification, whereas fp and fn represents
erroneous ones. Table V lists the cost matrix that describes
misclassification costs needed for a classifier when the results
of prediction are different from the actualities. In practical
medical diagnosis, the case will be the worst if the
classification result of prediction is negative but that of
actuality is positive. In the present study, the paid cost for the
case is set as five times of that case when prediction is positive
but the actuality is negative. Namely, we analyze results via a
cost matrix which assigned misclassification costs with 5 unit
costs for false positive error and 1 unit cost for false negative
error. In general, the less false positive error is expected for the
practical medical diagnosis.
TABLE IV. CONFUSION MATRIX
Predicted Negative Predicted Positive
Actual Negative tn fp
Actual Positive fn tp
TABLE V. COST MATRIX
Predicted Negative Predicted Positive
Actual Negative 0 1
Actual Positive 5 0

In addition, performance of a classifier is commonly
evaluated using the data in the confusion matrix. The most
widely used measures are accuracy, sensitivity, specificity, F-
measure, and cost. A brief summary of the measures with the
formulas are given as follows.
1) Accuracy:
tp fp fn tn
tp tn
accuracy
+ + +
+
= (7)
2) Sensitivity (P):
tp fn
tp
y sensitivit
+
= (8)
3) Specificity (R):
fp tn
tn
y specificit
+
= (9)
4) F-Measure:
R P
R P
Measure F
+
=
* * 2
(10)
5) Cost:
5 cos + = fn fp t (11)
D. Prediction Results Obtained Using C5 Classifier
First, the well-known classification software, named C5
proposed by Quinlan [22], is employed for the cardiac disease
diagnosis in this study. Nowadays, the commercial version C5
has served as the primary decision tree program in the artificial
intelligence and machine learning domains. This classifier is
able to handle both continuous and categorical variables. The
C5 classifier creates binary splits on interval inputs and
multiple splits on nominal inputs for a nominal target. The split
is chosen that maximizes the gain ratio. Figure 5 lists the
prediction results by applying C5 classifier to the training set
with 270 cases that are used 100% records of Heart-staglog
dataset. The classification accuracy, which is computed from
the ratio of true prediction number (138+105) divided by total
number of instances (138+105+15+12), was 90%.









Figure 5. Classification result by using C5.
Besides, we selected 80% and 20% records randomly from
the Heart-staglog dataset as the training set and test set to
evaluate the prediction accuracy of the test set. Thus, the total
numbers for the training set and test set were 216 and 54,
respectively. Figures 6(a) and (b) list the prediction results by
applying C5 classification software. The classification
accuracy as shown in Fig. 6(a) was 87% which is less than
that of using whole dataset as shown in Fig. 5. Using the
decision trees created via training set, the prediction accuracy
of test set was 76.9% as shown in Fig. 6(b). Clearly, the
prediction accuracy of test set decreases due to less samples of
training set.
(a)





(b)





Figure 6. Classification result for (a) training set and (b) test set by using C5.
Cross-validation is a technique for assessing how the results
of a statistical analysis will generalize to an independent data
set. it is mainly used in settings where the goal is prediction,
and to estimate how accurately a predictive model will
perform in practice. To get a more reliable estimate of
predictive accuracy, we use 5-fold and 10-fold cross-
validations. As shown in Figures 7(a) and (b), the prediction
accuracy using 5-fold and 10-fold cross-validations were
74.8% and 73.0%, respectively.
(a)

(b)


Figure 7. Classification results for cross-validation with (a) 5-fold and (b)
10-fold by using C5.
E. Evaluation of Presented Two EvoDM Algorithms
1) Case 1: Whole Heart-staglog Dataset as Training Set
Table VI lists the accuracy of five different algorithms for
Case 1 which the training set uses 100% records of Heart-
staglog dataset. The initial cluster centers are obtained by
averaging all training set for each attributes. Regarding the
fact of random sampling, each experiment was performed 200
repeated runs. From the prediction results shown in Table VI,
the presented MPSO-KM, with prediction accuracy of 90.37%,
outperforms other four algorithms. Computational results also
indicated that the accuracy of the presented two EvoDM
algorithms was better than that of K-means algorithm. The
prediction accuracy obtained using GA-KM with 90%
accuracy was very close to that of MPSO-KM. The
comparisons of sensitivity, specificity and cost are also listed
in VI. A theoretical, optimal prediction for the three measures
aims to attain 100% sensitivity, 100% specificity, and zero
cost. Therefore, the higher values of P and R (max. of 1) the
better is the classifier. From Table VI, C5 gave the highest
sensitivity and cost, indicating that it has the maximum
reliability for predicting all patients from the sick group as
sick. Meanwhile, GA-KM and MPSO-KM provided the
highest specificity, indicating that they have the maximum
reliability for predicting anyone from the healthy group as sick.
Since it is difficult to maximize both sensitivity (P) and
specificity (R), F-measure is an alternative evaluation measure
that represents the harmonic mean of sensitivity and specificity.
The values of F-measure, achieved according to Eq. (10), were
shown in Table VI. The values obtained using C5, GA-KM and
MPSO-KM were all close to 90% indicating that the three
algorithms were comparable on the algorithmic performance
for Case 1. Moreover, Table VII lists the optimal weights of 13
attributes obtained using GA-KM and MPSO-KM algorithms.
The attributes for sex, cp, oldpeak, and ca were
relatively significant than other attributes for determining the
clusters by applying presented GA-KM and MPSO-KM
algorithms. In addition, the attribute
trestbps also was a significant attribute by using GA-
KM.
TABLE VI. COMPARISON OF ALGORITHMIC PERFORMANCE FOR CASE 1
Algorithm

Perfor-
mance
Nave
Bayes
C5
Clustering
(K-means)
Evolutionary Data
Mining Algorithms
GA-KM MPSO-KM
Accuracy 86.29% 90.00% 81.11% 90.00% 90.37%
Sensitivity 83.33% 87.50% 81.67% 85.00% 85.83%
Specificity 88.67% 92.00% 80.67% 94.00% 94.00%
F-Measure 85.92% 89.69% 81.17% 89.27% 89.73%
Cost 117 87 139 99 94
TABLE VII. OPTIMAL WEIGHTS OF CASE 1 COMPUTED BY PRESENTED
EVODM ALGORITHMS
Weights for 13 attributes GA-KM MPSO-KM
w1 (age) 0.01018 0.08961
w2 (sex) 0.55083
*
0.49871
*

w3 (cp) 0.99702
*
0.95002
*

w4 (trestbps) 0.66539
*
0.27084
w5 (chol) 0.16214 0.10845
w6 (fbs) 0.00179 0.04599
w7 (restecg) 0.43857 0.45355
w8 (thalach) 0.28774 0.26321
w9 (exang) 0.06759 0.02332
w10 (oldpeak) 0.64807
*
0.74761
*

w11 (slope) 0.45851 0.44980
w12 (ca) 0.94535
*
0.89623
*

w13 (thal) 0.25170 0.26279
Remark: symbol star (*) stand for significant patterns

Table VIII lists the confusion matrices obtained using the
five algorithms. The C5 and MPSO-KM gave low false
positive numbers, and GA-KM and MPSO-KM produced low
false negative numbers. The occurrences of false positive and
false negative presented two kinds of erroneous classification.
TABLE VIII. CONFUSION MATRICES OBTAINED USING FIVE ALGORITHMS
FOR CASE 1

Predicted
Negative
Predicted
Positive
Actual
Negative
Nave Bayes 133 17
C5 138 12
K-means 121 29
GA-KM 141 9
MPSO-KM 141 9
Actual
Positive
Nave Bayes 20 100
C5 15 105
K-means 22 98
GA-KM 18 102
MPSO-KM 17 103

2) Case 2: Training Set: 80% and Test Set: 20%
Case 2 was studied when the disease dataset divide into two
parts as training set and test set. The ratio was 80% of the
whole dataset as training set, and was 20% as test set.
Accordingly, the total numbers of training set and test set are
216 and 54, respectively. The initial cluster centers are also
obtained by averaging all training set for each attributes. Table
IX lists the prediction accuracy using five different algorithms
for Case 2. Computational results also showed that the
accuracy of the presented two EvoDM algorithms were better
than that of Nave Bayes, C5, and K-means algorithms. The
MPSO-KM gave the highest accuracy and sensitivity, F-
measure, and also the lowest cost. The comparison of
prediction accuracy for test set was listed in Table X. The
MPSO-KM outperforms the other classifiers for predicting the
accuracy of test set. Moreover, Table XI lists the optimal
weights of 13 attributes obtained using GA-KM and MPSO-
KM algorithms. The attributes for cp, chol, fbs, oldpeak,
and ca were relatively significant than other attributes for
determining the clusters. That is, chest pain type, serum blood
pressure, fasting blood sugar, ST depression induced by
exercise relative to rest, number of major vessels are important
factors for diagnosing the cardiac disease. In addition, the
attribute for resting blood pressure, trestbps, also listed as a
significant attribute for using MPSO-KM.
TABLE IX. COMPARISON OF PREDICTION RESULTS OF CASE 2
Algorithm

Perfor-
mance
Nave
Bayes
C5
Clustering
(K-means)
Evolutionary Data
Mining Algorithms
GA-KM MPSO-KM
Accuracy 82.87% 87.00% 80.56% 89.82% 89.82%
Sensitivity 84.04% 81.00% 80.22% 81.32% 84.62%
Specificity 81.97% 92.24% 80.80% 96.00% 93.60%
F-Measure 82.99% 86.26% 80.51% 88.05% 88.88%
Cost 97 104 114 90 78
TABLE X. COMPARISON OF PREDICTION ACCURACY FOR TEST SET
Algorithm

Data set
Nave
Bayes
C5
Clustering
(K-means)
Evolutionary Data Mining
GA-KM MPSO-KM
Test set 85.185% 75.900% 79.630% 87.037% 88.889%
TABLE XI. OPTIMAL WEIGHTS OF CASE 2 COMPUTED BY PRESENTED
EVODM ALGORITHMS
Weights for 13 attributes GA-KM MPSO-KM
w1 (age) 0.32545 0.43209
w2 (sex) 0.31850 0.46946
w3 (cp) 0.60009* 0.99159*
w4 (trestbps) 0.30384 0.99991*
w5 (chol) 0.80559* 0.95911*
w6 (fbs) 0.83290* 0.99997*
w7 (restecg) 0.22876 0.32683
w8 (thalach) 0.32553 0.45493
w9 (exang) 0.06035 0.00710
w10 (oldpeak) 0.57284* 0.61428*
w11 (slope) 0.43577 0.42057
w12 (ca) 0.91037* 0.99968*
w13 (thal) 0.24591 0.29386
Remark: symbol star (*) stand for significant patterns
Table XII lists the confusion matrices by using the five
algorithms. Clearly, the GA-KM and MPSO-KM gave both
low false positive number and false negative number.
According the confusion matrix, we obtained the values of
accuracy, sensitivity, specificity, F-measure, and cost based on
Eqs. (7)-(11) for the five classifiers as listed in Table IX. The
presented MPSO-KM gave the minimum false positive number
when compared with other classifiers.
TABLE XII. CONFUSION MATRICES OBTAINED USING FIVE ALGORITHMS
FOR CASE 2

Predicted
Negative
Predicted
Positive
Actual
Negative
Nave Bayes 100 22
C5 107 9
K-means 101 24
GA-KM 120 5
MPSO-KM 117 8
Actual
Positive
Nave Bayes 15 79
C5 19 81
K-means 18 73
GA-KM 17 74
MPSO-KM 14 77

V. CONCLUSION
This study introduced two EvoDM algorithms to the
cardiac disease prediction. The two EvoDM algorithms, termed
GA-KM and MPSO-KM, were hybrid by incorporating the K-
means algorithm with GA and MPSO, respectively. Two
training sets of cardiac disease dataset were studied to check
the robustness of the algorithms. From our computational
results, the accuracy for the prediction obtained using GA-KM
algorithm was 90.00%, MPSO-KM algorithm was 90.37% for
Case 1 which the training set was selected from the whole
disease dataset, whereas the accuracy obtained using K-means
algorithm was only 81.111%. From the weight distribution of
Case 1, the attributes of sex, cp, oldpeak and ca showed
the relatively important in diagnose the cardiac disease.
Furthermore, this work made changes for the training set,
termed Case 2, by selecting 80% disease dataset as training set
and 20% as test set. The accuracy for test set prediction
obtained using GA-KM algorithm was 87.037%, MPSO-KM
algorithm was 88.889% for Case 2 while the accuracy obtained
using K-means algorithm was 79.63%. From the weight
distribution of Case2, the attributes of cp, chol, fbs,
oldpeak and ca demonstrated relatively important in
diagnose the cardiac disease. The performance of classification
obtained using MPSO-KM outperformed competing classifiers,
such as Nave Bayes, C5, and K-means. From the results of
Case 1 and Case 2, the two presented EvoDM algorithms
significantly enhanced the prediction accuracy and decreased
the total paid cos. We concluded that, the proposed two
EvoDM algorithms could effectively enhance the clustering
accuracy rate, and make it possible for doctors to make an
effective diagnosis of cardiac disease.
REFERENCES
[1] A. M. Minio, M. P. Heron, S. L. Murphy, K. D. Kochanek, Deaths:
final data for 2004, National Vital Statistics Reports (United States:
Center for Disease Control, vol. 55, no. 19, August 2007.
[2] M. L. Hess, Heart Disease in Primary Care, Williams & Wilkins, 1999.
[3] H. Jiawei and K. Micheline,. Data Mining: Concepts and Techniques,
Morgan Kaufmann Publishers, 2001.
[4] D. Olson and Y. Shi, Introduction to Business Data Mining, McGraw-
Hill Education, 2008.
[5] S. Palaniappan, and R. Awang, Intelligent cardiac disease prediction
system using data mining techniques, International Journal of
Computer Science and Network Security, vol. 8, no. 8, pp. 343-350,
August 2008.
[6] B.P. Shantakumar and Y.S. Kumaraswamy, Intelligent and effective
heart attack prediction system using data mining and artificial neural
network, European Journal of Sci. Research, vol. 31 no. 4, pp. 642-656,
2009.
[7] L. Parthiban, and R. Subramanian, Intelligent cardiac disease prediction
system using CANFIS and genetic algorithm, International Journal of
and Life Sciences, vol. 3, no. 3, pp. 157-160, 2008.
[8] K. Rajeswari, Dr. V. Vaithiyanathan, and Dr. P. Amirtharaj, Prediction
of risk score for cardiac disease in India using machine intelligence, in
2011 International Conference on Information and Network Technology,
IPCSIT, vol. 4, 2011, IACSIT Press, Singapore.
[9] K. Srinivas, B.K. Rani, and Dr. A. Govrdhan, Applications of data
mining techniques in healthcare and prediction of heart attacks,
International Journal on Computer Science and Engineering, vol. 2, no.
2, pp. 250-255, 2010.
[10] M. Kumari, and S. Godara, Comparative study of data mining
classification methods in cardiovascular disease prediction,
International Journal of Computer Science and Technology, vol. 2, issue
2, pp. 304-308, June 2011.
[11] R. Das, I. Turkoglu, and A. Sengur, Effective diagnosis of cardiac
disease through neural networks ensembles, Expert Systems with
Applications, vol. 36, issue 4, pp. 7675-7680, May 2009.
[12] M. Anbarasi, E. Anupriya, and N.CH.S.N. Iyengar, Enhanced
prediction of cardiac disease with feature subset selection using genetic
algorithm, International Journal of Engineering Science and
Technology, vol. 2, no. 10, pp. 5370-5376, 2010.
[13] J. Soni, U. Ansari, D. Sharma, and S. Soni, Predictive data mining for
medical diagnosis: an overview of cardiac disease prediction,
International Journal of Computer Applications, vol. 17, no. 8, pp. 43-
48, March 2011.
[14] D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine
Learning, Addison Wesley, 1989.
[15] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution
Programs. 3rd ed., Springer-Verlag, 1999.
[16] J.L. Liu, Intelligent genetic algorithm and its application to
aerodynamic optimization of airplanes, AIAA Journal, vol. 43, no.3, pp.
530-538, Mar. 2005.
[17] J.L. Liu and J.H. Lin, Evolutionary computation of unconstrained and
constrained problems using a novel momentum-type particle swarm
optimization, Engineering Optimization, vol. 39, issue 3, pp. 287-305,
April 2007.
[18] Y. SHI and R. Eberhart, A modified particle swarm optimization, in
Proc. of IEEE International Conference on Evolutionary Computation,
May 1998, pp. 69-72.
[19] J. Kennedy and R. Eberhart, Particle swarm optimization, in IEEE
International Conference on Neural Networks, IEEE Service Center,
Piscataway, NJ, vol. 4, 1995, pp. 1942-1948.
[20] A. Brabazon and P. Keenan, A hybrid genetic model for the prediction
of corporate failure, Computational Management Science, vol. 1, no 3-
4, pp. 293-310, 2004.
[21] P.C. Lin and J.S. Chen, A genetic-based hybrid approach to corporate
failure prediction, International Journal of Electronic Finance, vol. 2,
no. 2, pp. 241-255, 2008.
[22] J.R. Quinlan, Improved use of continuous attributes in C4.5, Journal
of Artificial Intelligence Approach, vol. 4, pp. 77-90, 1996.

Potrebbero piacerti anche