Sei sulla pagina 1di 8

Rule Acquisition in Data Mining Using a Self Adaptive Genetic Algorithm

K.Indira#1, S. Kanmani $2, Gaurav Sethia.D $3, Kumaran.S$4, Prabhakar.J $5


#

Department of Computer Science,


Department of Information Technology,
Pondicherry Engineering College,
Puducherry, India
1
induharini@gmail.com
2
kanmani.n@gmail.com
3
gaurav.sethia7@gmail.com
4
kumarane.90@gmail.com
5
prabhakar_pec64@yahoo.co.in

Abstract- Rule acquisition is a technique of data mining that is used to deduce inferences from large databases. These
inferences cannot be noticed easily without data mining. Genetic algorithms (GAs) are considered as a global search
approach for optimization problems. Through the proper evaluation strategy, the best chromosome can be found from
the numerous genetic combinations. In the self-adaptive genetic algorithm, its main thought is to let control parameter
(crossover rate, mutation rate) adjusted adaptively within the proper range, thus achieve a more optimum solution. It is
proved that the self-adaptive genetic algorithm is with excellent convergence and higher precision than the traditional
genetic algorithm.
Keywords Association rule mining, Genetic algorithm, Crossover, Mutation, Fitness, Support, Confidence.
I.

INTRODUCTION

Mining is used to refer to the process of searching through a large volume of data, stored into a database, to
discover useful and interesting information previously unknown. Association rule mining is a type of data mining. It is
the method of finding the relations between entities in databases. Association rule mining is mainly used in market
analysis, transaction data analysis or in the medical field. For example, in a medical database, diagnosis is possible
provided the symptoms or in case of supermarket, the relation between the purchase of different commodities can
be obtained. Such inferences are drawn using association rule mining and can be used for making decisions.
There are some well known techniques for association rule mining. Some of the well known algorithms are Apriori,
constraint based mining, Frequency Pattern Growth Approach, genetic algorithm. There have been several attempts
for mining association rules using Genetic Algorithm.
The main reason for choosing a genetic algorithm for data mining is that a GA performs global search and copes
better with attribute interaction when compared with the traditional greedy methods, based on induction.
Genetic algorithm is evolved from Charles Darwins Survival of the fittest theory. It is based on individuals fitness
and genetic similarity between the individuals. Breeding occurs in every generation and eventually it leads to better
and optimal group in the later generations. [1] analyses the mining of Association Rules by applying Genetic
Algorithms.
[2] introduces CBGA approach that hybridizes constraint-based reasoning within a genetic algorithm for rule
induction. The CBGA approach uses Apriori algorithm to improve its efficiency.
[3], [4] discuss some variations of the traditional Genetic algorithms in the field of data mining. [3] is based on a
evolutionary strategy and [4] adopts a self adaptive approach. The self adaptive modification on a GA has never
been attempted on association rule mining before, but as this promises to be very promising in improving the
efficiency, it has been taken up.
The main modules in data mining process are
i.

Data cleaning: also so known as data cleansing, is a phase in which noise data and irrelevant data are
removed from the collection.

ii.
iii.

Data selection: at this step, the data relevant for the analysis is decided on and retrieved from the large data
collection.
Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful.

A brief introduction about Association Rule Mining and GA is given in Section 2, followed by the proposed
system in section 3. In section 4 the parameters used in association rule mining using SAGA are defined. Section 5
presents the experimental results followed by conclusion in the last section.

2. Association Rules And Genetic Algorithm


2.1

Association Rules:

An important type of knowledge acquired by many data mining systems takes the form of if-then rules [5]. Such
rules state that the presence of one or more items implies or predicts the presence of other items. A typical rule has
the form
If A, B, Cn then Y
The two parameters with respect to if-then rules are described below.
The confidence [6] for a given rule is a measure of how often the consequent is true, given that the
antecedent is true. If the consequent is false while the antecedent is true, then the rule is also false. If the
antecedent is not matched by a given data item, then this item does not contribute to the determination of the
confidence of the rule.
The support indicates how often the rule holds in a set of data. This is a relative measure determined by
dividing the number of data that the rule covers, i.e., that support the rule, by the total number of data in the set.
2.2

Genetic Algorithm:

Genetic algorithm [7] is simulated in the natural environment of biological evolution and genetics and the
formation of an adaptive search algorithm for global optimization probability. Genetic algorithm is suitable for solving
problems characterized by large space, multi-peak, non-linear, global optimization problem of high complexity
display. The parameters of the problem to be resolved into a binary code or the decimal code (also into other
hexadecimal code) that the gene, a number of genes form a chromosome (individual), a number of chromosomes is
similar to natural selection, crossover and mutation matching algorithms, after repeated iteration (that is, hereditary
from generation to generation) until the final results of the optimization. The use of genetic algorithms to solve the
problem involved the following seven key factors: encoding, fitness function, selection operator, the crossover
operator, mutation operator, control parameters.
In the traditional genetic algorithm, the crossover rate and mutation rate are fixed values which are selected
based on experience. Generally we believe that when the crossover rate is too low, the evolutionary process can
easily fall into local optimum to result in groups of premature convergence due to population size and the lack of
diversity. When the crossover rate is too high, the process is optimized to the vicinity of optimal point and the
individual is difficult to reach optimal point which can slow the speed of convergence significantly though groups can
ensure the diversity.

Figure 1. Flow Chart Traditional GA


3. Proposed System
To overcome the drawbacks of traditional genetic algorithm, SAGA is proposed. SAGA involves changing the
crossover and mutation rates adaptively [8]. The main purpose of setting mutation operator is to maintain the
diversity of population and avoid stagnation of evolution. In the traditional genetic algorithm the mutation rate is fixed
and after several iterations, the groups quality will gradually to converge and to form inbreeding. The organized
adaptive genetic algorithm has a higher robustness, global optimality and efficiency.

Procedure SAGA
Begin
Initialize population p(k);
Define the crossover and mutation rate;
Do
{
Do
{
Calculate support of all k rules;
Calculate confidence of all k rules;
Obtain fitness;
Select individuals for crossover / mutation;
Calculate the average fitness of the n and (n-1) the generation;
Calculate the maximum fitness of the n and (n-1) the generation;
Based on the fitness of the selected item, calculate the new crossover and mutation rate;
Choose the operation to be performed;
} k times;
}

4. Parameters in Genetic Algorithm


4.1 Selection of Individuals
Chromosomes are selected from the population for breeding. According to Darwin's evolution theory the best ones
should survive and create new offspring. According to Roulette Wheel Selection, Parents are selected according to
their fitness. The better the chromosomes are, the more chances to be selected they have, because the selection
depends on fitness.

Figure 2. Roulette Wheel Selection

4.2 Fitness Function


Given a particular chromosome, the fitness function returns a single numerical "fitness," or "figure of merit". This
value is proportional to the "utility" or "ability" of the individual which that chromosome represents.
This paper adopts minimum support and minimum confidence for filtering rules. Then correlative degree is
confirmed in rules which satisfy minimum support-degree and minimum confidence degree. After support-degree
and confidence-degree are synthetically taken into account, fit degree function is defined as follows.

( )

( )

( )

In the above formula, Rs + Rc = 1 (Rs 0_Rc 0) and Suppmin, Confmin are respective values of minimum support
and minimum confidence. By all appearances_ if the Suppmin and Confmin are set to higher values, then the value of
fitness function is also found to be high.
4.3 Crossover Operator
Crossover selects genes from parent chromosomes and creates a new offspring. The most common form of
crossover is single point crossover in which a crossover point on both parents is selected and child 1 is head of
chromosome of parent 1 with tail of chromosome of parent 2 and child 2 is head of 2 with tail of 1.

4.4. Mutation Operator


Mutation changes randomly the new offspring. For binary encoding we can switch a few randomly chosen
bits from 1 to 0 or from 0 to 1. Mutation provides a small amount of random search, and helps ensure that no point
in the search has a zero probability of being examined.

4.5 Number of Generations


The generational process of mining association rules by Genetic algorithm is repeated until a termination
condition has been reached. Common terminating conditions are:

A solution is found that satisfies minimum criteria.


Fixed number of generations reached.
The highest ranking solution's fitness is reaching
iterations no longer produce better results.
Manual inspection.
Combinations of the above.

or has reached a plateau such that successive

4.6 Self Adaptive


The use of a fixed mutation probability Pm, when Pm value is small, does not have an impact on the
mutation operator. When Pm value is great, it could undermine the group's excellent genes, the algorithm does not
even slow down the convergence. Here, a method of adaptive mutation rate is used as follows:

(n+1)

)
( )

)
)

th

pm is the nth generation mutation rate, pm


is the (n+1) generation mutation rate. The first generation mutation
0
(m)
(n+1)
th
rate is pm . fi
is the fitness of the nth individual stocks i. fmax
is the highest fitness of the (n+1) individual stocks.
(n)
fi is the fitness of the nth individual i. m is the number of individual stocks. is the adjustment factor.

5. Experimental studies
The objective of this study is to compare the accuracy achieved on different datasets using a traditional GA
and a SAGA. The encoding of chromosome is binary encoding with fixed length. The fitness function adopted is as
given.
Three datasets namely Lenses, Haberman and Car evaluation from UCI Machine Learning Repository have
been taken up for experimentation. Lenses dataset has 4 attributes with 24 instances. The second dataset is
Haberman which has 4 attributes and 306 instances. The final one is the car evaluation dataset, which has 6
attributes and 1728 instances. The Algorithm is implemented using Java.
The accuracy and the convergence rate are recorded in the table below. Accuracy is the count of rules
matching between the original dataset and resulting population divided by the number of instances in dataset. The
convergence rate is the generation at which the fitness value becomes fixed.

TABLE 1
DEFAULT GA PARAMETERS.
Parameter

Value

Population Size

Varies as per the


dataset

Initial Crossover rate

0.9

Initial Mutation rate

0.1

Selection Method

Roulette wheel
selection

Minimum Support

0.2

Minimum Confidence

0.8

The parameters for the fitness function and the other parameters are chosen for best performance of a
traditional GA. The parameters are set such that the convergence is fastest and the number of matches is maximum
and SAGA is run with the same parameters and the results obtained are tabulated below. Accuracy is in terms of
percentage with respect to number of matches
.
TABLE 2
ACCURACY COMPARISON BETWEEN GA AND SAGA WHEN PARAMETERS ARE IDEAL FOR
TRADITIONAL GA
Dataset

Traditional GA

Self Adaptive GA

Accuracy

No. of Generations

Accuracy

No. of Generations

Lenses

75

38

87.5

35

Haberman

52

36

68

28

Car Evaluation

85

29

96

21

From the above table we can conclude that the Self adaptive GA performs better than a traditional GA in both
aspects, i.e. the convergence rate and the accuracy. The accuracy in case of the Haberman dataset is low because
one of the parameter in the data is age. As age can take a wide range of values and only perfect matches are
considered, the accuracy comes down.
When the algorithm comes to an end, the parameter values for mutation rate and crossover rate are changed
because of self adaptivity. If the new values are set as the original values for a GA, then the performance of the GA
is as below.

TABLE 3
ACCURACY COMPARISON BETWEEN GA AND SAGA WHEN PARAMETERS ARE ACCORDING TO
TERMINTAION OF SAGA

Dataset

Traditional GA

Self Adaptive GA

Accuracy

No. of Generations

Accuracy

No. of Generations

Lenses

50

35

87.5

35

Haberman

36

38

68

28

Car Evaluation

74

36

96

21

The table shows that the accuracy of the traditional GA goes down if the parameters are set in accordance
with the termination Condition mutation rate of SAGA, this is because, when the SAGA ends, the mutation rate
might take a high value, which when applied to a GA, will bring down the accuracy. The fitness threshold plays a
major role in deciding the efficiency of the rules mined and convergence of the system.

6.

Conclusion

Genetic Algorithms have been used to solve difficult optimization problems in a number of fields and have
proved to produce optimum results in mining Association rules. When Genetic algorithm is used for mining
association rules the GA parameters decides the efficiency of the system. Once the optimum values are fixed for
individual parameters, then making the algorithm self adaptive increases the efficiency because it changes the
mutation and crossover rate adaptively thus making the algorithm more intelligent. As the rates are varied with
respect to the results from the previous generation, the accuracy increases. The efficiency of the methodology could
be further explored on more datasets with varying attribute sizes.

References
[1]. Martine Collard, Dominique Francisi, Evolutionary Data Mining: An Overview of Genetic-Based Algorithms,
8th IEEE International Conference on Emerging Technologies and Factory Automation, Vol. 1, Page(s): 3 9,
2001.
[2]. Chaochang Chiu, Pei-lun hsu, A Constraint Based Genetic algorithm approach for Mining Classification Rules,
IEEE, Transactions on Systems, Man and cybernetics,, Vol. 35, Page(s): 305 320, 2005.
[3]. Manish Saggar, Ashish Kumar Agarwal, Abhimanyu Lad, Optimization of Association Rule Mining using
Improved Genetic Algorithms, IEEE, Transaction on System, Man and Cybernetics, Vol. 4, Page(s): 3725
3729, 2004.
[4]. Xiaoyuan Zhu, Yongquan Yu, Xueyan Guo, Genetic Algorithm Based on Evolution Strategy and the Application
in Data Mining, First International Workshop on Education Technology and Computer Science, ETCS '09,
Volume: 1, Page(s): 848 852, 2009.
[5]. Robert Cattral, Franz Oppacher, Dwight Dwego, Rule Acquisition with Genetic Algorithm, Congress on
Evolutionary Computation CEC'99, Vol.1, 1999.
[6]. Shangping Dai, Li Gao, Qiang Zhu, Changwu Zhu, A Novel Genetic Algorithm Based on Image Databases for
Mining Association Rules, IEEE, Conference on Computer and Information Science, Page(s): 977 980,
2007.

[7]. YI-Ta-Wu, Yoo Juang An, James Geller, Yih Tyng Wu, A Data Mining Based Genetic Algorithm, IEEE,
Workshop on Software Technologies for Future Embedded and Ubiquitous Systems, 2006.
[8]. Jing Li, Han Rui Feng, A Self-Adaptive Genetic Algorithm Based on Real Code, Capital Normal University,
CNU, Page(s):1-4,, 2010

Potrebbero piacerti anche