Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
ORG
128
1 INTRODUCTION
Data mining is concerned with the analysis of data and the use of software techniques for drawing conclusions from the large sets of data. This includes finding patterns and regularities in sets of data. Association rule mining is a type of data mining. It is the method of finding the relations between entities in databases. Association rule mining is mainly used in market analysis, transaction data analysis or in the medical field. For example, all of the transactions occurring in a super market are stored in a large database, and if a customer buys bread in a supermarket, then there is a chance that he buys butter. Such inferences can be used for making decisions, and such inferences are drawn using association rule mining. Many algorithms for generating association rules were developed over time. Some of the well known algorithms are Apriori, Eclat and FP-Growth tree. There have been several attempts for mining association rules using Genetic Algorithm. This paper analyses the mining of Association Rules by applying Genetic Algorithms. The suitability of Genetic algorithms in the field of data mining is studied in the paper [7]. The main reason for choosing a genetic algorithm for data mining is that a GA performs global search and copes better with attribute interaction when compared with the traditional greedy methods, based on induction. Genetic algorithm is evolved from Charles Darwins Survival of the fittest theory. It is based on individuals fitness and genetic similarity between the individuals Breeding occurs in every generation and eventually it leads to better and optimal group in the later generations. Combining natural immune evolution theory and relevant bionic mechanism, [1] proposes an IOGA (Immune Optimization based Genetic Algorithm) approach for incremental association rules mining for large and frequently updating data sets. The experiment demonstrates the methods efficiency, its good performance in pruning redundant rules, discovering meaningful rules and perceiving low support rules in additional data set. A fitness function is presented in [2] by proposing an efficient rule generator for denial of services of network intrusion detection. More chromosomes with relevant features are used thereby resulting in generation of more rules. As such, the rules generated by this algorithm are suitable for continuously changing misuse detection. [3] presents a genetic algorithm based approach for mining classification rules from large database. It emphasizes on predictive accuracy, comprehensibility and interestingness of the rules and simplifying the implementation of a GA. The paper discusses in detail the design of encoding, genetic operators and fitness function of genetic algorithm for this task. The main functional concepts in data mining process are I. Data cleaning: also so known as data cleansing, is a phase in which noise data and irrelevant data are removed from the collection II. Data selection: at this step, the data relevant for the analysis is decided on and retrieved from the large data collection. III. Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful. A brief introduction about Association Rule Mining and GA is given in Section 2, followed by methodology in section 3, which describes the basic implementation details of Association Rule Mining with GA. In section 4 the Parameters that decides on efficiency of the algorithm is presented. Section 5 presents the experimental results followed by conclusion in the last section.
ASSOCIATION RULES
RITHMS
AND
GENETIC ALGO-
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
129
Typically the relationship will be in the form of a rule: IF {antecedent} THEN {consequent} There are two types of Association rule levels: Support Level- The minimum percentage of instances in the database that contain all items listed in a given association rule and Confidence Level- If A then B, rule confidence is the conditional probability that B is true when A is known to be true.
Start
Generate Population
Initial
Evaluate Fitness
No
M ETHODOLOGY
A new population is first initialized. For every individual in the population, a fitness function is applied and the fitness is calculated. Then based on the crossover and mutation rates, the crossover and mutation functions are performed. The new individuals obtained are again subjected to the fitness function. If the fitness of the new individuals is better than the fitness of the individuals in the previous generation, the individuals are replaced. This is carried out till the termination condition is reached. The following are the steps of a Genetic Algorithm
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
130
This paper adopts minimum support and minimum confidence for filtering rules. Then correlative degree is confirmed in rules which satisfy minimum supportdegree and minimum confidence-degree. After supportdegree and confidence-degree are synthetically taken into account, fit degree function is defined as follows. (1)
In the above formula, Rs + Rc =1 (Rs 0Rc 0) and Suppmin, Confmin are respective values of minimum support and minimum confidence. By all appearances if the Suppmin and Confmin are set to higher values, then the value of fitness function is also found to be high.
Crossover Operator
Crossover selects genes from parent chromosomes and creates a new offspring. The simplest way how to do this is to choose randomly some crossover point and everything before this point copy from a first parent and then everything after a crossover point copy from the second parent. Common form of crossover is single point crossover where randomly one position in the chromosomes is chosen and child 1 is head of chromosome of parent 1 with tail of chromosome of parent 2 and child 2 is head of 2 with tail of 1. There are other ways to make crossover, for example we can choose more crossover points. Crossover can be rather complicated and depends on encoding of the encoding of chromosome.
4 EXPERIMENTAL STUDIES
The objective of this study is to compare the accuracy achieved in datasets by varying the GA Parameters. The encoding of chromosome is binary encoding with fixed length. As the crossover is performed on attribute level the mutation rate is set to zero so as to retain the original attribute values. The fitness function adopted is as given. Three datasets from UCI Machine Learning Repository [7] namely lenses, Habermans Survival dataset and Iris have been taken up for experimentation. Lenses dataset has 4 attributes with 24 instances. Lenses dataset has 4 attributes with 24 instances. Haberman's Survival data Set has 3 attributes and 306 instances and Iris dataset has 5 attributes and 150 instances. The Algorithm is implemented using Java. The accuracy and the convergence rate by controlling the GA parameters are recorded in the table below. Accuracy is the count of dataset matching between the original dataset and resulting population divided by the number of instances in dataset. The convergence rate is the generation at which the fitness value becomes fixed.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
131
TABLE 1 DEFAULT GA PARAMETERS. Parameter Population Size Crossover rate Mutation rate Selection Method Minimum Support Minimum Confidence Value 24 0.5 0.0 Roulette wheel selection 0.2 0.8
TABLE 2 ACCUARCY BY VARYING MINIMUM SUPPORT AND CONFIDENCE Sup = .2 & con = .5 Accuracy No. of Generations 25 68 28 Sup = .5 & con = .5 Accuracy No. of Generations 38 83 37 Sup = .75 & con = .75 Accuracy No. of Generations 31 90 48 Sup = .8 & con = .8 Accuracy No. of Generations 39 75 55
From the Table 2 it is clear that the variation in minimum support and confidence brings greater changes in accura
cy. The optimum values of minimum support and confidence is based on the support and confidence values of the attributes in dataset.
TABLE 3
0.7 0.69
0.9 0.71
0.3 0.70
0.84
45
0.86
51
0.87
55
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
132
From the above table it is evident that the accuracy varies with changes in the point of crossover
TABLE 4 ACCUARCY BY VARYING Rs
& Rc
Rs = .8 & Rc = .2 Accuracy No. of Generations 30 83
0.9 0.71
0.6 0.56
0.6 0.65
0.9 0.7
0.87
53
0.76
45
0.7
43
0.86
55
From the above table it can be concluded that higher the difference between Rs and Rc, the more the accuracy. While Rs and Rc are close, the accuracy is less. Fitness threshold plays a major role in deciding the efficiency of the rules mined and convergence of the system. Setting up values for minimum support and confidence depends on the dataset and their relationship between attributes. The accuracy of the algorithm and optimum values for the GA parameters cannot be generalized as the optimum value of these parameters varies from dataset to dataset.
References
[1] Genxiang Zhang, Haishan Chen, Immune Optimization Based Genetic Algorithm for Incremental Association Rules Mining, International Conference on Artificial Intelligence and Computational Intelligence, Volume: 4, pp. 341 345, 2009. Gonzales, E., Mabu, S., Taboada, K., Shimada, K., Hirasawa, K., Mining Multi-class Datasets using Genetic Relation Algorithm for Rule Reduction, IEEE Congress on Evolutionary Computation,CEC09 , pp. 3249 3255, 2009. Hong Guo, Ya Zhou, An Algorithm for Mining Association Rules Based on Improved Genetic Algorithm and its Application, 3rd International Conference on Genetic and Evolutionary Computing, WGEC '09, pp. 117 120, 2009. Jing Li, Han Rui Feng, A self-adaptive genetic algorithm based on real code, Capital Normal University, CNU, 2010 Xiaoyuan Zhu, Yongquan Yu, Xueyan Guo, Genetic Algorithm Based on Evolution Strategy and the Application in Data Mining, First International Workshop on Education Technology and Computer Science, ETCS '09, Volume: 1, pp. 848 852, 2009 Xian-Jun Shi, Hong Lei, A Genetic Algorithm-Based Approach for Classification Rule Discovery, International Conference on Information Management, Innovation Management and Industrial Engineering, ICIII '08, Volume: 1, pp. 175 178, 2008. Blake, C., L., Merz, C., J., UCI Repository of machine learning databases, Irvine, CA: University of California, Department of Information and Computer Science available at http://www.ics.uci.edu/~mlearn (1998).
[2]
[3]
5. CONCLUSION
Genetic Algorithms have been used to solve difficult optimization problems in a number of fields and have proved to produce optimum results in mining Association rules. When Genetic algorithm is used for mining association rules the GA parameters decides the efficiency of the system. Values of minimum support, minimum confidence and population size decides upon the accuracy of the system than other GA parameters. The optimum value of crossover rate leads to earlier convergence while playing minimum role in achieving better accuracy. The optimum value of the GA parameters varies from data to data and the fitness function plays a major role in optimizing the results. The size of the dataset and relationship between attributes in data contributes to the setting up of the parameters. The efficiency of the methodology could be further explored on more datasets with varying attribute sizes.
[4] [5]
[6]
[7]
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
133
K.Indira received her M.E. degree in 2005 from Department of Computer Science and Engineering, FEAT, Annamalai University, Chidambaram. She had been working as the Head of the Department of Computer Science for the past 12 years in Theivanai Ammal College for Women, Tamil Nadu, India from 1998 to 2007 and E.S. College of engineering and Technology, Affiliated to Anna University , Chennai, India. Currently she is working towards the Ph.D degree in Genetic Algorithms applied for data mining. Her areas of interest are Data Mining, Artificial Intelligence and Evolutionary Computing. Dr. S. Kanmani received her B.E and M.E in Computer Science and Engineering from Bharathiyar University and Ph.D in Anna University, Chennai. She had been the faculty of Department of Computer Science and Engineering, Pondicherry Engineering College from 1992 onwards. Presently she is heading the Department of Information Technology, Pondicherry Engineering College. Her research interests are Software Engineering, Software testing, Object oriented system, and Data Mining. She is the Member of Computer Society of India, ISTE and Institute of Engineers, India. She has published about 50 papers in various international conferences and journals.