Sei sulla pagina 1di 6

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

128

Rule Acquisition in Data Mining Using Genetic Algorithm


K.Indira, Dr.S.Kanmani
Abstract Association Rule mining is a technique of data mining that is very widely used in many areas. It is used to deduce results that prove to be very helpful in the field as they provide some inferences from possibly large databases. These inferences cannot be noticed without data mining. It is seen that by altering representations and operators the Genetic algorithm could be applied for any fields without compromising the efficiency. Keywords- Association rule mining, Genetic algorithm, Crossover, Mutation.

1 INTRODUCTION
Data mining is concerned with the analysis of data and the use of software techniques for drawing conclusions from the large sets of data. This includes finding patterns and regularities in sets of data. Association rule mining is a type of data mining. It is the method of finding the relations between entities in databases. Association rule mining is mainly used in market analysis, transaction data analysis or in the medical field. For example, all of the transactions occurring in a super market are stored in a large database, and if a customer buys bread in a supermarket, then there is a chance that he buys butter. Such inferences can be used for making decisions, and such inferences are drawn using association rule mining. Many algorithms for generating association rules were developed over time. Some of the well known algorithms are Apriori, Eclat and FP-Growth tree. There have been several attempts for mining association rules using Genetic Algorithm. This paper analyses the mining of Association Rules by applying Genetic Algorithms. The suitability of Genetic algorithms in the field of data mining is studied in the paper [7]. The main reason for choosing a genetic algorithm for data mining is that a GA performs global search and copes better with attribute interaction when compared with the traditional greedy methods, based on induction. Genetic algorithm is evolved from Charles Darwins Survival of the fittest theory. It is based on individuals fitness and genetic similarity between the individuals Breeding occurs in every generation and eventually it leads to better and optimal group in the later generations. Combining natural immune evolution theory and relevant bionic mechanism, [1] proposes an IOGA (Immune Optimization based Genetic Algorithm) approach for incremental association rules mining for large and frequently updating data sets. The experiment demonstrates the methods efficiency, its good performance in pruning redundant rules, discovering meaningful rules and perceiving low support rules in additional data set. A fitness function is presented in [2] by proposing an efficient rule generator for denial of services of network intrusion detection. More chromosomes with relevant features are used thereby resulting in generation of more rules. As such, the rules generated by this algorithm are suitable for continuously changing misuse detection. [3] presents a genetic algorithm based approach for mining classification rules from large database. It emphasizes on predictive accuracy, comprehensibility and interestingness of the rules and simplifying the implementation of a GA. The paper discusses in detail the design of encoding, genetic operators and fitness function of genetic algorithm for this task. The main functional concepts in data mining process are I. Data cleaning: also so known as data cleansing, is a phase in which noise data and irrelevant data are removed from the collection II. Data selection: at this step, the data relevant for the analysis is decided on and retrieved from the large data collection. III. Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful. A brief introduction about Association Rule Mining and GA is given in Section 2, followed by methodology in section 3, which describes the basic implementation details of Association Rule Mining with GA. In section 4 the Parameters that decides on efficiency of the algorithm is presented. Section 5 presents the experimental results followed by conclusion in the last section.

ASSOCIATION RULES
RITHMS

AND

GENETIC ALGO-

2.1 Association Rules


Association rule mining finds interesting associations and/or correlation relationships among large set of data items. Association rules show attributes value conditions that occur frequently together in a given dataset.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

129

Typically the relationship will be in the form of a rule: IF {antecedent} THEN {consequent} There are two types of Association rule levels: Support Level- The minimum percentage of instances in the database that contain all items listed in a given association rule and Confidence Level- If A then B, rule confidence is the conditional probability that B is true when A is known to be true.

Start

Generate Population

Initial

Evaluate Fitness

2.2 Genetic Algorithm


Genetic Algorithm is based on Charles Darwins theory of The survival of the fittest. Algorithm is started with a set of solutions (represented by chromosomes) called population. Solutions from one population are taken and used to form a new population. This is motivated by a hope, that the new population will be better than the old one. Solutions which are selected to form new solutions(offspring) are selected according to their fitness - the more suitable they are the more chances they have to reproduce.If the fitness of the new individuals is better than the fitness of the individuals in the previous generation, the individuals are replaced. This is carried out till the termination condition is reached. The chromosome should in some way contain information about solution which it represents. The most used way of encoding is a binary string. Each chromosome has one binary string. Each bit in this string can represent some characteristic of the solution. Or the whole string can represent a number. But there are many other ways of encoding.
Selection

Crossover and Mutation

Terminal Condiction Yes Stop

No

Fig. 1 Flow Chart for Genetic Algorithm

3.1 Selection of Individuals


During each successive generation, a proportion of the existing population is selected to breed a new generation. certain selection methods rate the fitness of each solution and preferentially select the best solutions. other methods rate only a random sample of the population, as this process may be very time consuming. the given figure depicts the roulette wheel selection

M ETHODOLOGY

A new population is first initialized. For every individual in the population, a fitness function is applied and the fitness is calculated. Then based on the crossover and mutation rates, the crossover and mutation functions are performed. The new individuals obtained are again subjected to the fitness function. If the fitness of the new individuals is better than the fitness of the individuals in the previous generation, the individuals are replaced. This is carried out till the termination condition is reached. The following are the steps of a Genetic Algorithm

Fig. 2. Roulette Wheel Selection Methodology

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

130

3.2 Fitness Function


A fitness function must be devised for each problem to be solved. Given a particular chromosome, the fitness function returns a single numerical "fitness," or "figure of merit," which is supposed to be proportional to the "utility" or "ability" of the individual which that chromosome represents. For many problems, particularly function optimisation, the fitness function should simply measure the value of the function. This paper adopts minimum support and minimum confidence for filtering rules. Then correlative degree is confirmed in rules which satisfy minimum supportdegree and minimum confidence degree. After supportdegree and confidence-degree are synthetically taken into account, fit degree function is defined as follows. In the above formula, Rs + Rc = 1 (Rs 0_Rc 0) and Suppmin, Confmin are respective values of minimum support and minimum confidence. By all appearances_ if the Suppmin and Confmin are set to higher values, then the value of fitness function is also found to be high. 3.3

This paper adopts minimum support and minimum confidence for filtering rules. Then correlative degree is confirmed in rules which satisfy minimum supportdegree and minimum confidence-degree. After supportdegree and confidence-degree are synthetically taken into account, fit degree function is defined as follows. (1)

In the above formula, Rs + Rc =1 (Rs 0Rc 0) and Suppmin, Confmin are respective values of minimum support and minimum confidence. By all appearances if the Suppmin and Confmin are set to higher values, then the value of fitness function is also found to be high.

3.6 Number of Generations


The generational process of mining association rules by Genetic algorithm is repeated until a termination condition has been reached. Common terminating conditions are: A solution is found that satisfies minimum criteria Fixed number of generations reached The highest ranking solution's fitness is reaching or has reached a plateau such that successive iterations

Crossover Operator

Crossover selects genes from parent chromosomes and creates a new offspring. The simplest way how to do this is to choose randomly some crossover point and everything before this point copy from a first parent and then everything after a crossover point copy from the second parent. Common form of crossover is single point crossover where randomly one position in the chromosomes is chosen and child 1 is head of chromosome of parent 1 with tail of chromosome of parent 2 and child 2 is head of 2 with tail of 1. There are other ways to make crossover, for example we can choose more crossover points. Crossover can be rather complicated and depends on encoding of the encoding of chromosome.

no longer produce better results Manual inspection Combinations of the above

4 EXPERIMENTAL STUDIES
The objective of this study is to compare the accuracy achieved in datasets by varying the GA Parameters. The encoding of chromosome is binary encoding with fixed length. As the crossover is performed on attribute level the mutation rate is set to zero so as to retain the original attribute values. The fitness function adopted is as given. Three datasets from UCI Machine Learning Repository [7] namely lenses, Habermans Survival dataset and Iris have been taken up for experimentation. Lenses dataset has 4 attributes with 24 instances. Lenses dataset has 4 attributes with 24 instances. Haberman's Survival data Set has 3 attributes and 306 instances and Iris dataset has 5 attributes and 150 instances. The Algorithm is implemented using Java. The accuracy and the convergence rate by controlling the GA parameters are recorded in the table below. Accuracy is the count of dataset matching between the original dataset and resulting population divided by the number of instances in dataset. The convergence rate is the generation at which the fitness value becomes fixed.

3.4 Mutation Operator


Mutation changes randomly the new offspring. For binary encoding we can switch a few randomly chosen bits from 1 to 0 or from 0 to 1. Mutation provides a small amount of random search, and helps ensure that no point in the search has a zero probability of being examined.

3.5 Fitness Function


A fitness function is a particular type of objective function that prescribes the optimality of a chromosome in a genetic algorithm, so that the particular chromosome may be ranked against all the other chromosomes [9, 10]. An ideal fitness function correlates closely with the algorithm's goal, and yet may be computed quickly. Speed of execution is very important, as a typical genetic algorithm must be iterated many times in order to produce an usable result for a non-trivial problem.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

131

TABLE 1 DEFAULT GA PARAMETERS. Parameter Population Size Crossover rate Mutation rate Selection Method Minimum Support Minimum Confidence Value 24 0.5 0.0 Roulette wheel selection 0.2 0.8

TABLE 2 ACCUARCY BY VARYING MINIMUM SUPPORT AND CONFIDENCE Sup = .2 & con = .5 Accuracy No. of Generations 25 68 28 Sup = .5 & con = .5 Accuracy No. of Generations 38 83 37 Sup = .75 & con = .75 Accuracy No. of Generations 31 90 48 Sup = .8 & con = .8 Accuracy No. of Generations 39 75 55

Lenses Habermans Survival Iris

0.7 0.45 0.40

0.8 0.58 0.59

0.5 0.71 0.78

0.8 0.62 0.87

From the Table 2 it is clear that the variation in minimum support and confidence brings greater changes in accura

cy. The optimum values of minimum support and confidence is based on the support and confidence values of the attributes in dataset.
TABLE 3

ACCUARCY BY VARYING CROSSOVER RATE

Pc = .25 Accuracy No. of Generations 8 77

Pc = .5 Accuracy No. of Generations 30 83

Pc = .75 Accuracy No. of Generations 39 80

Lenses Habermans Survival Iris

0.7 0.69

0.9 0.71

0.3 0.70

0.84

45

0.86

51

0.87

55

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

132

From the above table it is evident that the accuracy varies with changes in the point of crossover
TABLE 4 ACCUARCY BY VARYING Rs

& Rc
Rs = .8 & Rc = .2 Accuracy No. of Generations 30 83

Rs = .2 & Rc = .8 Accuracy No. of Generations 38 80

Rs = .4 & Rc = .6 Accuracy No. of Generations 34 70

Rs = .5 & Rc = .5 Accuracy No. of Generations 37 75

Lenses Habermans Survival Iris

0.9 0.71

0.6 0.56

0.6 0.65

0.9 0.7

0.87

53

0.76

45

0.7

43

0.86

55

From the above table it can be concluded that higher the difference between Rs and Rc, the more the accuracy. While Rs and Rc are close, the accuracy is less. Fitness threshold plays a major role in deciding the efficiency of the rules mined and convergence of the system. Setting up values for minimum support and confidence depends on the dataset and their relationship between attributes. The accuracy of the algorithm and optimum values for the GA parameters cannot be generalized as the optimum value of these parameters varies from dataset to dataset.

References
[1] Genxiang Zhang, Haishan Chen, Immune Optimization Based Genetic Algorithm for Incremental Association Rules Mining, International Conference on Artificial Intelligence and Computational Intelligence, Volume: 4, pp. 341 345, 2009. Gonzales, E., Mabu, S., Taboada, K., Shimada, K., Hirasawa, K., Mining Multi-class Datasets using Genetic Relation Algorithm for Rule Reduction, IEEE Congress on Evolutionary Computation,CEC09 , pp. 3249 3255, 2009. Hong Guo, Ya Zhou, An Algorithm for Mining Association Rules Based on Improved Genetic Algorithm and its Application, 3rd International Conference on Genetic and Evolutionary Computing, WGEC '09, pp. 117 120, 2009. Jing Li, Han Rui Feng, A self-adaptive genetic algorithm based on real code, Capital Normal University, CNU, 2010 Xiaoyuan Zhu, Yongquan Yu, Xueyan Guo, Genetic Algorithm Based on Evolution Strategy and the Application in Data Mining, First International Workshop on Education Technology and Computer Science, ETCS '09, Volume: 1, pp. 848 852, 2009 Xian-Jun Shi, Hong Lei, A Genetic Algorithm-Based Approach for Classification Rule Discovery, International Conference on Information Management, Innovation Management and Industrial Engineering, ICIII '08, Volume: 1, pp. 175 178, 2008. Blake, C., L., Merz, C., J., UCI Repository of machine learning databases, Irvine, CA: University of California, Department of Information and Computer Science available at http://www.ics.uci.edu/~mlearn (1998).

[2]

[3]

5. CONCLUSION
Genetic Algorithms have been used to solve difficult optimization problems in a number of fields and have proved to produce optimum results in mining Association rules. When Genetic algorithm is used for mining association rules the GA parameters decides the efficiency of the system. Values of minimum support, minimum confidence and population size decides upon the accuracy of the system than other GA parameters. The optimum value of crossover rate leads to earlier convergence while playing minimum role in achieving better accuracy. The optimum value of the GA parameters varies from data to data and the fitness function plays a major role in optimizing the results. The size of the dataset and relationship between attributes in data contributes to the setting up of the parameters. The efficiency of the methodology could be further explored on more datasets with varying attribute sizes.
[4] [5]

[6]

[7]

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

133

K.Indira received her M.E. degree in 2005 from Department of Computer Science and Engineering, FEAT, Annamalai University, Chidambaram. She had been working as the Head of the Department of Computer Science for the past 12 years in Theivanai Ammal College for Women, Tamil Nadu, India from 1998 to 2007 and E.S. College of engineering and Technology, Affiliated to Anna University , Chennai, India. Currently she is working towards the Ph.D degree in Genetic Algorithms applied for data mining. Her areas of interest are Data Mining, Artificial Intelligence and Evolutionary Computing. Dr. S. Kanmani received her B.E and M.E in Computer Science and Engineering from Bharathiyar University and Ph.D in Anna University, Chennai. She had been the faculty of Department of Computer Science and Engineering, Pondicherry Engineering College from 1992 onwards. Presently she is heading the Department of Information Technology, Pondicherry Engineering College. Her research interests are Software Engineering, Software testing, Object oriented system, and Data Mining. She is the Member of Computer Society of India, ISTE and Institute of Engineers, India. She has published about 50 papers in various international conferences and journals.

Potrebbero piacerti anche