An Entropy-Based Adaptive Genetic Algorithm For Learning Classification Rules

An Entropy-based Adaptive Genetic Algorithm for Learning Classification Rules
Linyu Yang
Dept. of Computer Science Texas A&M University College Station, TX 77 !" lyyang#cs.tamu.e$u
Dwi
! "idyantoro
#homas $oerger
Dept. of Computer Science Texas A&M University College Station, TX 77 !" ioerger#cs.tamu.e$u
%ohn Yen
Dept. of Computer Science Texas A&M University College Station, TX 77 !" yen#cs.tamu.e$u
Dept. of Computer Science Texas A&M University College Station, TX 77 !" $%&7'!(#cs.tamu.e$u
Abstract- Genetic algorithm is one of the commonly used approaches on data mining! $n this paper& we put forward a genetic algorithm approach for classification problems! 'inary coding is adopted in which an individual in a population consists of a fi(ed number of rules that stand for a solution candidate! #he evaluation function considers four important factors which are error rate& entropy measure& rule consistency and hole ratio& respectively! Adaptive asymmetric mutation is applied by the self-adaptation of mutation inversion probability from )-* +*-),! #he generated rules are not dis-oint but can overlap! #he final conclusion for prediction is based on the voting of rules and the classifier gives all rules e.ual weight for their votes! 'ased on three databases& we compared our approach with several other traditional data mining techni.ues including decision trees& neural networ/s and naive bayes learning! #he results show that our approach outperformed others on both the prediction accuracy and the standard deviation! 0eywords1 genetic algorit%m, a$aptive asymmetric mutation, entropy, voting)*ase$ classifier
)! $ntroduction
+enetic algorit%ms %ave *een successfully applie$ to a &i$e range of optimi,ation pro*lems inclu$ing $esign, sc%e$uling, routing, an$ control, etc. Data mining is also one of t%e important application fiel$s of genetic algorit%ms. -n $ata mining, +A can *e use$ to eit%er optimi,e parameters for ot%er .in$s of $ata mining algorit%ms or $iscover .no&le$ge *y itself. -n t%is latter tas. t%e rules t%at +A foun$ are usually more general *ecause of its glo*al searc% nature. -n contrast, most ot%er $ata mining met%o$s are *ase$ on t%e rule in$uction para$igm, &%ere t%e algorit%m usually performs a .in$ of local searc%. T%e a$vantage of +A *ecomes more o*vious &%en t%e searc% space of a tas. is large. -n t%is paper &e put for&ar$ a genetic algorit%m approac% for classification pro*lems. /irst &e use *inary co$ing in &%ic% an in$ivi$ual solution can$i$ate consists of a fixe$ num*er of rules. -n eac% rule, k *its are use$ for t%e possi*le k values of a certain attri*ute. Continuous attri*utes are
mo$ifie$ to t%res%ol$)*ase$ *oolean attri*utes *efore co$ing. 0ule conse1uent is not explicitly co$e$ in t%e string, instea$, t%e conse1uent of a rule is $etermine$ *y t%e ma2ority of training examples it matc%es. /our important factors are consi$ere$ in our evaluation functions. 3rror rate is calculate$ *y t%e pre$icting results on t%e training examples. 3ntropy is use$ to measure t%e %omogeneity of t%e examples t%at a certain rule matc%es. 0ule consistency is a measure on t%e consistency of classification conclusions of a certain training example given *y a set of rules. /inally, %ole ratio is use$ to evaluate t%e percentage of training examples t%at a set of rules $oes not cover. 4e try to inclu$e relate$ information as complete as possi*le in t%e evaluation function so t%at t%e overall performance of a rule set can *e *etter. An a$aptive asymmetric mutation operator is applie$ in our repro$uction step. 4%en a *it is selecte$ to mutate, t%e inversion pro*a*ility from 5)6 76)58 is not 96: as usual. T%e value of t%is pro*a*ility is asymmetric an$ self) a$aptive $uring t%e running of t%e program. T%is is ma$e to reac% t%e *est matc% of a certain rule to training examples. /or crossover, t&o)point crossover is a$opte$ in our approac%. 4e use$ t%ree real $ata*ases to test our approac%; cre$it $ata*ase, voting $ata*ase an$ %eart $ata*ase. 4e compare$ our performance &it% four &ell).no&n met%o$s from $ata mining, namely -n$uction Decision Trees 7-D"8 7<uinlan, 5' =8, -D" &it% >oosting 7<uinlan, 5''=8, ?eural ?et&or.s, an$ ?aive >ayes 7Mitc%ell, 5''78. T%e appropriate state)of)t%e)art tec%ni1ues are incorporate$ in t%ese non)+A met%o$s to improve t%eir performances. T%e results s%o& t%at our +A approac% outperforme$ ot%er approac%es on *ot% t%e pre$iction accuracy an$ t%e stan$ar$ $eviation. -n t%e rest of t%e paper, &e &ill first give a *rief overvie& of relate$ &or.. @ur +A approac% is t%en $iscusse$ in $etail. T%e results of our application on t%ree real $ata*ases an$ t%e comparison &it% ot%er $ata mining met%o$s are follo&e$. /inally &e ma.e some conclu$ing remar.s.
2! Related wor/s
Many researc%ers %ave contri*ute$ to t%e application of +A on $ata mining. -n t%is section, &e &ill give a *rief overvie& on a fe& representative &or.s.
-n early '6As, De Bong et al. implemente$ a +A)*ase$ system calle$ +A>-C t%at continually learns an$ refines concept classification rules 7De Bong, 5''"8. An in$ivi$ual is a varia*le)lengt% string representing a set of fixe$)lengt% rules. Tra$itional *it inversion is use$ on mutation. -n crossover, correspon$ing crossover points in t%e t&o parents must semantically matc%. T%e fitness function contains only t%e percent of correct examples classifie$ *y an in$ivi$ual rule set. T%ey compare$ t%e performance of +A>-C &it% t%at of four ot%er tra$itional concept learners on a variety of target concepts. -n +-C 7Bani.o&, 5''"8, an in$ivi$ual is also a set of rules, *ut attri*utes values are enco$e$ $irectly, rat%er t%an *its. +-C %as special genetic operators for %an$ling rule sets, rules, an$ rule con$itions. T%e operators can perform generali,ation, speciali,ation or ot%er operations. >esi$es t%e correctness, t%e evaluation function of +-C also inclu$es t%e complexity of a rule set. T%erefore it favors correct, simple 7s%ort8 rules. +reene an$ Smit% put for&ar$ a +A)*ase$ in$uctive system calle$ C@+-? 7+reene, 5''"8. -n contrast to t%e a*ove t&o approac%es, t%e systemAs current mo$el at any point $uring t%e searc% is represente$ as a population of fixe$ lengt% rules. T%e population si,e 7i.e., t%e num*er of rules in t%e mo$el8 &ill vary from cycle to cycle as a function of t%e coverage constraint is applie$. T%e fitness function contains t%e information gain of a rule 0 an$ a penalty of t%e num*er of misclassifications ma$e *y 0. 3ntropy measure is use$ to calculate t%e information gain of rule 0 *ase$ on t%e num*er of examples rule 0 Matched an$ Unmatched. Do&ever, it coul$nAt evaluate t%e entropy measure of t%e entire partition forme$ *y t%e classification $ue to its enco$ing met%o$. -n vie& of t%e situation t%at most of t%e $ata mining &or. emp%asi,es only on t%e pre$ictive accuracy an$ compre%ensi*ility, ?o$a et al. 7?o$a, 5'''8 put for&ar$ a +A approac% $esigne$ to $iscover t%e interesting rules. T%e fitness function consists of t&o parts. T%e first one measures t%e $egree of interestingness of t%e rule, &%ile t%e secon$ measures its pre$ictive accuracy. T%e computation of t%e conse1uentAs $egree of interestingness is *ase$ on t%e follo&ing i$ea; t%e larger t%e relative fre1uency 7in t%e training set8 of t%e value *eing pre$icte$ *y t%e conse1uent, t%e less interesting it is. -n ot%er &or$s, t%e rarer a value of a goal attri*ute, t%e more interesting a rule pre$icting it is. Since t%e values of t%e goal attri*ute in t%e $ata*ases &e teste$ are not unevenly $istri*ute$ seriously, &e $i$nAt specially consi$er interestingness of a rule in our current implementation *ut mainly focus on inclu$ing relate$ factors as complete as possi*le to improve t%e pre$icting accuracy. Do&ever, &e $i$ consi$er %o& to treat uneven $istri*ution of goal attri*ute values some%o&. 4e &ill $iscuss t%is in $etails in t%e next section.
-n t%is section &e present our +A approac% for classification pro*lem. T%e .ey i$ea of t%e algorit%m is general an$ s%oul$ *e applica*le for various .in$s of classification pro*lems. Some parameter values use$ in t%e algorit%m mig%t *e tas. $epen$ent. 3!) $ndividual5s encoding 3ac% in$ivi$ual in t%e population consists of a fixe$ num*er of rules. -n ot%er &or$s, t%e in$ivi$ual itself is a complete solution can$i$ate. -n our current implementation, &e set t%is fixe$ num*er as 56 &%ic% &ell satisfies t%e re1uirement of our testing $ata*ases. T%e antece$ent of a certain rule in t%e in$ivi$ual is forme$ *y a con2unction of n attri*utes, &%ere n is num*er of attri*utes *eing mine$. K *its &ill *e use$ to stan$ for an attri*ute if t%is attri*ute %as k possi*le values. Continuous attri*utes &ill *e partitione$ to t%res%ol$)*ase$ *oolean attri*ute in &%ic% t%e t%res%ol$ is a *oun$ary 7a$2acent examples across t%is *oun$ary $iffer in t%eir target classification8 t%at maximi,es t%e information gain. T%erefore t&o *its &ill *e use$ for a continuous attri*ute. T%e conse1uent of a rule is not explicitly enco$e$ in t%e string. -n contrast, it is automatically given *ase$ on t%e proportion of positiveEnegative examples it matc%es in t%e training set. 4e &ill illustrate t%e enco$ing met%o$ *y t%e follo&ing example. Suppose our tas. %as t%ree attri*utes, an$ t%ey %ave !, (, 9 possi*le values respectively. T%en an in$ivi$ual in t%e population can *e represente$ as follo&ing; A5 A( A" A5 A( A" A5 A( A" 6556 55 56556 5556 65 56655 FF 5566 55 65556 0ule 5 0ule ( FF 0ule 56 Ten rules are inclu$e$ in t%is in$ivi$ual. T%e arc%itecture of eac% rule is same. 4e &ill use rule 5 to explain t%e meaning of enco$ing. -n t%is example, t%e meaning of t%e antece$ent of rule 5 is; -f 7A5Gvalue ( @0 value "8 A?D 7A(Gvalue 5 @0 value (8 A?D 7A"Gvalue 5 @0 value " @0 value !8 -f all t%e *its *elong to one attri*ute are 6s, it means t%at attri*ute can not e1ual to any possi*le value t%erefore t%is is meaningless. To avoi$ t%is, &e a$$ one step *efore t%e evaluation of t%e population. 4e &ill c%ec. eac% rule in eac% in$ivi$ual one *y one, if t%e a*ove case %appens, &e &ill ran$omly select one *it of t%at attri*ute an$ c%ange it to one. T%e conse1uent of a rule is not enco$e$ in t%e string. -t &ill *e $etermine$ *y t%e proportion situation of t%e training examples t%at rule matc%es. Suppose i is one of t%e classifications, t%e conse1uent of a rule &ill *e i if
N matched H i N matched
>
N training H i N training
758 &%ere N matched H i
is t%e num*er of examples &%ose
3! 4ur GA approach
classification is i an$ matc%e$ *y t%at rule, N matched is t%e total num*er of examples t%at t%e rule matc%esI N training H i is t%e num*er of training examples &%ose
classification is i, N training is t%e total num*er of training examples. /or example, if t%e $istri*ution of t%e positive examples an$ negative examples in t%e training set is !(: an$ 9 :, an$ among t%e examples of rule 5 matc%es positive E negative examples are %alf to %alf, t%en t%e conse1uent of rule 5 s%oul$ *e positive *ecause 6.9 J 6.!(. Since t%e testing $ata*ases &e use at t%is time $onAt %ave a very uneven $istri*ution on t%e classification of training examples, in our current implementation &e $i$nAt specially consi$er t%e interestingness of rules *ut use t%is strategy to .eep enoug% rules to matc% examples &it% minor classification. @ur enco$ing met%o$ is not limite$ to t&o) category classification *ut applica*le to multi target value pro*lems. 3!2 6itness function -t is very important to $efine a goo$ fitness function t%at re&ar$s t%e rig%t .in$s of in$ivi$uals. 4e try to consi$er affecting factors as complete as possi*le to improve t%e results of classification. @ur fitness function is $efine$ as follo&ing; /itness G 3rror rate K 3ntropy measure K 0ule consistency K Dole ratio 7(8 4e &ill ela*orate eac% part in t%e fitness function in t%e follo&ing. 58 3rror rate -t is &ell .no&n t%at accuracy is t%e most important an$ commonly use$ measure in t%e fitness function as t%e final goal of $ata mining is to get goo$ pre$iction results. Since our o*2ective function %ere is minimi,ation, &e use error rate to represent t%is information. -t is calculate$ as; 3rror rate G percent of misclassifie$ examples 7"8 -f a rule matc%es a certain example, t%e classification it gives is its conse1uent part. -f it $oesnAt matc%, no classification is given. An in$ivi$ual consists of a set of rules, t%e final classification pre$icte$ *y t%is rule set is *ase$ on t%e voting of t%ose rules t%at matc% t%e example. T%e classifier gives all matc%ing rules e1ual &eig%t. /or instance, in an in$ivi$ual 7&%ic% %as ten rules %ere8, one rule $oesnAt matc%, six rules give positive classification an$ t%ree rules give negative classification on a training example, t%en t%e final conclusion given *y t%is in$ivi$ual on t%at training example is positive. -f a tie %appens 7i.e., four positive classifications an$ four negative classifications8, t%e final conclusion &ill *e t%e ma2ority classification in t%e training examples. -f none of t%e rules in t%e in$ivi$ual matc%es t%at example, t%e final conclusion &ill also *e t%e ma2ority classification in t%e training examples. T%e error rate measure of t%is in$ivi$ual is t%e percent of misclassifie$ examples among all training examples. (8 3ntropy measure 3ntropy is a commonly use$ measure in information t%eory. @riginally it is use$ to c%aracteri,e t%e 7im8purity of an ar*itrary collection of examples. -n our implementation
entropy is use$ to measure t%e %omogeneity of t%e examples t%at a rule matc%es. +iven a collection S, containing t%e examples t%at a certain rule R matc%es, let Pi *e t%e proportion of examples in S *elonging to class i, t%en t%e entropy Entropy(R) relate$ to t%is rule is $efine$ as;
Entropy 7 R 8 =
7 p
i =5
log ( 7 p i 88
7!8 &%ere n is t%e num*er of target classifications. 4%ile an in$ivi$ual consists of a num*er of rules, t%e entropy measure of an in$ivi$ual is calculate$ *y averaging t%e entropy of eac% rule;
Entropy 7individual 8 =
Entropy7 R 8
i i =5
NR
NR
798 &%ere NR is num*er of rules in t%e in$ivi$ual 7in our current implementation it is 568. T%e rationale of using entropy measure in fitness function is to prefer t%ose rules t%at matc% less examples &%ose target values are $ifferent from ruleAs conse1uent. Dig% accuracy $oes not implicitly guarantee t%e entropy measure is goo$ *ecause t%e final classification conclusion of a certain training example is *ase$ on t%e compre%ensive results of a num*er of rules. -t is very possi*le t%at eac% rule in t%e in$ivi$ual %as a *a$ entropy measure *ut t%e &%ole rule set still gives t%e correct classification. Leeping lo& entropy value of an in$ivi$ual &ill *e %elpful to get *etter pre$icting results for untraine$ examples. "8 0ule consistency As state$ in t%e a*ove sections, t%e final pre$icte$ classification of a training example is t%e ma2ority classification ma$e *y rules in an in$ivi$ual. CetAs consi$er t%e follo&ing classifications ma$e *y t&o in$ivi$uals on an example; -n$ivi$ual a; six rules K, four rules ), final classification; K -n$ivi$ual *; nine rules K, one rule ), final classification; K 4e &ill prefer t%e secon$ in$ivi$ual since it is less am*iguous. To a$$ress t%is rule consistency issue, &e a$$ anot%er measure in t%e fitness function. T%e calculation is similar to t%e entropy measure. Cet M correct *e t%e proportion of rules in one in$ivi$ual &%ose conse1uent is same &it% t%e target value of t%e training example, t%en
Rule consistency 7individual 8 = p correct log ( p correct
75 p correct 8 log ( 75 p correct 8

7=8 4e s%oul$ notice t%at t%is formula &ill give t%e same rule consistency value &%en pcorrect an$ 75)pcorrect8 s&itc% eac% ot%er. T%erefore a penalty &ill *e given &%en p correct is smaller t%an 6.9. -n t%is case 0uleconsistency G ( ) 0uleconsistency.
T%e a*ove calculation is *ase$ on t%e pre$icting results for one training example. T%e complete measure of rule consistency of an in$ivi$ual s%oul$ *e average$ *y t%e num*er of training examples. !8 Dole ratio; T%e last element in t%e fitness function is t%e %ole ratio. -t is a measure of ruleAs coverage on training examples. Coverage is not a pro*lem for tra$itional in$uctive learning met%o$s li.e $ecision trees, since t%e process of creating trees guarantees t%at all t%e training examples &ill *e covere$ in t%e tree. Do&ever, t%is also *rings a ne& pro*lem t%at it may *e sensitive to noise. +A approac% $oes not guarantee t%at t%e generate$ rules &ill cover all t%e training examples. T%is allo&s flexi*ility an$ may *e potentially useful for future pre$iction. -n real implementation &e still %ope t%e coverage s%oul$ reac% a certain point. /or instance, if a rule only matc%es one training example an$ its conse1uent is correct, t%e accuracy an$ entropy measure of t%is rule are *ot% excellent *ut &e $o not prefer t%is rule *ecause its coverage is too lo&. -n our fitness function t%e %ole ratio e1uals to 1 coverage, in &%ic% t%e latter is calculate$ *y t%e union of examples t%at are matc%e$ an$ also correctly pre$icte$ *y t%e rules in an in$ivi$ual. Totally misclassifie$ examples 7not classifie$ correctly *y any rule in t%e in$ivi$ual8 &ill not *e inclu$e$ even t%oug% t%ey are matc%e$ *y some rules. T%e follo&ing is t%e formula to calculate t%e %ole ratio for *inary classification pro*lem 7positive, negative8.
!ole =5
i$ea is a$opte$ %ere t%at average of fitness is use$ as a fee$*ac. to a$2ust t%e inversion pro*a*ility. T%e process of self)a$aptation is $escri*e$ as follo&ing; 58 An initial inversion pro*a*ility is set 7e.g., 6.9 for 5)68. Use t%is pro*a*ility on mutation to pro$uce a ne& generation. Calculate t%e average fitness of t%is generation. (8 0an$omly select t%e $irection of c%anging t%is pro*a*ility 7increase, $ecrease8. Mo$ify t%e pro*a*ility along t%at $irection &it% a small amount 76.6( in our current implementation8. Use t%e ne& pro*a*ility to pro$uce t%e next generation an$ calculate t%e average fitness of t%e ne& generation. "8 -f t%e fitness is *etter 7value is smaller8, continue on t%is $irection an$ t%e amount of c%ange is; p G maxO6.69, 75) fitnessne& E fitnessol$8 P 6.5Q 7 8 -f t%e fitness is &orse 7value is larger8, reverse t%e $irection an$ t%e amount of c%ange is; p G maxO6.69, 7fitnessne& E fitnessol$ ) 58 P 6.69Q 7'8 Use t%e ne& pro*a*ility to pro$uce t%e next generation an$ calculate t%e average fitness of t%e ne& generation. 0epeat step " until t%e program en$s.
7! Results and discussions

4e teste$ our approac% on t%ree real $ata*ases. 4e compare$ our approac% &it% four ot%er tra$itional $ata mining tec%ni1ues. T%is section &ill present t%e testing results. 7!) #he information of databases 58 Cre$it $ata*ase T%is $ata*ase concerns cre$it car$ applications. All attri*ute names an$ values %ave *een c%ange$ to meaningless sym*ols to protect confi$entiality of t%e $ata. T%is $ata*ase is interesting *ecause t%ere is a goo$ mix of attri*utes )))) continuous, nominal &it% small num*ers of values, an$ nominal &it% larger num*ers of values. T%ere are 59 attri*utes plus one target attri*ute. Total num*er of instances is ='6. (8 Roting $ata*ase T%is $ata*ase saves 5' ! Unite$ States Congressional Roting 0ecor$s. T%e $ata set inclu$es votes for eac% of t%e U.S. Douse of 0epresentatives Congressmen on t%e 5= .ey votes i$entifie$ *y t%e C<A. T%e C<A lists nine $ifferent types of votes; vote$ for, paire$ for, an$ announce$ for 7t%ese t%ree simplifie$ to yea8, vote$ against, paire$ against, an$ announce$ against 7t%ese t%ree simplifie$ to nay8, vote$ present, vote$ present to avoi$ conflict of interest, an$ $i$ not vote or ot%er&ise ma.e a position .no&n 7t%ese t%ree simplifie$ to an un.no&n $isposition8. T%ere are 5= attri*utes plus one target attri*ute. Total num*er of instances is !"9 7(=7 $emocrats, 5= repu*licans8. "8 Deart $ata*ase T%is $ata*ase concerns %eart $isease $iagnosis. T%e $ata &as provi$e$ *y R.A. Me$ical Center, Cong >eac% an$ Clevelan$ Clinic /oun$ation; 0o*ert Detrano, M.D., M%.D. T%ere are 5! attri*utes plus one target attri*ute. Total num*er of instances is "6".
P
i
+ S
N
i
778 + &%ere P stan$s for t%ose examples &%ose target value is i positive an$ classifie$ as positive *y at least one rule in t%e in$ivi$ual, N i stan$s for t%ose examples &%ose target value is negative an$ classifie$ as negative *y at least one rule in t%e in$ivi$ual. S is t%e total num*er of training examples. 3!3 Adaptive asymmetric mutation -n our repro$uction step, tra$itional *it inversion is use$ on mutation. Do&ever, &e foun$ many examples &ill not *e matc%e$ if &e .eep num*er of 5As an$ 6As approximately e1uivalent in an in$ivi$ual 7i.e., t%e inversion pro*a*ility from 5)6 an$ 6)5 are *ot% 96:8. T%e learning process &ill *ecome a ma2ority guess if t%ere are too many unmatc%e$ examples. T%erefore, &e put for&ar$ a strategy of a$aptive asymmetric mutation in &%ic% t%e inversion pro*a*ility from 5)6 76)58 is self)a$aptive $uring t%e process of run. T%e asymmetric mutation *iases t%e population to&ar$ generating rules &it% more coverage on training examples. T%e self)a$aptation of inversion pro*a*ility ma.es t%e optimal mutation parameter *e automatically a$2uste$. 4e presente$ an a$aptive simplex genetic algorit%m *efore 7Nang, (6668 in &%ic% t%e percentage of simplex operator is self)a$aptive $uring t%e process of run. Similar
7!2 #he description of non-GA approaches 4e use$ four &ell).no&n met%o$s from mac%ine learning, namely -n$uction Decision Trees 7-D"8 7<uinlan, 5' =8, Decision Trees &it% >oosting 7<uinlan, 5''=8, ?eural ?et&or.s, an$ ?aSve >ayes 7Mitc%ell, 5''78, to compare t%e performance of our improve$ +A. Appropriate state)of) t%e)art tec%ni1ues %ave *een incorporate$ in most of t%e non)+A met%o$s to improve t%eir performance. T%e follo&ing is a $escription of t%e non)+A approac%es &e use$ for t%e performance comparison stu$ies. 58 -n$uction Decision Trees T%e construction of a $ecision tree is $ivi$e$ into t&o stages. /irst, creating an initial, large $ecision tree using a set of training set. Secon$, pruning t%e initial $ecision tree, if applica*le, using a vali$ation set. +iven a noise)free training set, t%e first stage &ill generate a $ecision tree t%at can classify correctly all examples in t%e set. 3xcept t%at t%e training set covers all instances in t%e $omain, t%e initial $ecision tree generate$ &ill over fit t%e training $ata, &%ic% t%en re$uce its performance in t%e test $ata. T%e secon$ stage %elps alleviate t%is pro*lem *y re$ucing t%e si,e of t%e tree. T%is process %as an effect in generali,ing t%e $ecision tree t%at %opefully coul$ improve its performance in t%e test $ata. During t%e construction of an initial $ecision tree, t%e selection of t%e *est attri*ute is *ase$ on eit%er t%e in"ormation gain 7-+8 or gain ratio 7+08. A *inary split is applie$ in no$es &it% continuous)value$ attri*utes. T%e *est cut)off value of a continuous)value$ attri*ute is locally selecte$ &it%in eac% no$e in t%e tree *ase$ on t%e remaining training examples. T%e treeAs no$e expansion stops eit%er &%en t%e remaining training set is %omogeneous 7e.g., all instances %ave t%e same target attri*ute values8 or &%en no attri*ute remains for selection. T%e $ecision on a leaf resulting from t%e latter case is $etermine$ *y selecting t%e ma2ority of t%e target attri*ute value in t%e remaining training set. Decision tree pruning is a process of replacing su*)trees &it% leaves to re$uce t%e si,e of t%e $ecision tree &%ile retaining an$ %opefully increasing t%e accuracy of treeTs classification. To o*tain t%e *est result from t%e in$uction $ecision tree met%o$, &e varie$ t%e use of pruning algorit%m to t%e initial $ecision tree. 4e consi$ere$ using t%e follo&ing $ecision trees pruning algorit%ms; critical value pruning 7Mingers, 5' 78, minimum error pruning 7?i*lett & >rat.o, 5' =8, pessimistic pruning, cost comple#ity pruning an$ reduced error pruning 7<uinlan, 5' 78,. (8 -n$uction Decision Trees &it% >oosting Decision Tree &it% >oosting is a met%o$ t%at generates a se1uence of $ecision trees from a single training set *y re) &eig%ting an$ re)sampling t%e samples in t%e set 7<uinlan, 5''=I /reun$ & Sc%apire, 5''=8. -nitially, all samples in t%e training set are e1ually &eig%te$ so t%at t%eir sum is one. @nce a $ecision tree %as *een create$, t%e samples in t%e training set are re)&eig%te$ in suc% a &ay t%at misclassifie$ examples &ill get %ig%er &eig%ts t%an t%e ones t%at are
easier to classify. T%e ne& samples &eig%ts are t%en renormali,e$, an$ next $ecision tree is create$ using %ig%er) &eig%te$ samples in t%e training set. -n effect, t%is process enforces t%e more $ifficult samples to *e learne$ more fre1uently *y $ecision trees. T%e trees generate$ are t%en given &eig%ts in accor$ance &it% t%eir performance in t%e training examples 7e.g., t%eir accuracy in correctly classifying t%e training $ata8. +iven a ne& instance, t%e ne& instance class is selecte$ from t%e maximum &eig%te$ average of t%e pre$icte$ class over all $ecision trees. "8 ?eural ?et&or.s -nspire$ in part *y *iological learning systems, neural net&or.s approac% is *uilt from a $ensely interconnecte$ set of simple units. Since t%is tec%ni1ue offers many $esign selections, &e fixe$ some of t%em to t%e ones t%at %a$ *een &ell proven to *e goo$ or accepta*le $esign c%oices. -n particular, &e use a net&or. arc%itecture &it% one %i$$en layer, t%e *ac.)propagation learning algorit%m 70umel%art, Dinton & 4illiam, 5' =8, t%e $elta)*ar)$elta a$aptive learning rates 7Baco*s, 5' 8, an$ t%e ?guyen)4i$ro& &eig%t initiali,ation 7?guyen & 4i$ro&, 5''68. Discrete) value$ attri*utes fe$ into t%e net&or.s input layer is represente$ as 5)of)? enco$ing using *ipolar values to $enote t%e presence 7e.g., value 58 an$ t%e a*sence 7e.g., value U58 of an attri*ute value. Continuous)value$ attri*ute is scale$ into a real value in t%e range VU5, 5W. 4e varie$ t%e $esign c%oices for *atc% versus incremental learning, an$ 5) of)? versus single net&or. output enco$ing. !8 ?aive >ayes ?aive >ayes classifier is a variant of t%e >ayesian learning t%at manipulates $irectly pro*a*ilities from o*serve$ $ata an$ uses t%ese pro*a*ilities to ma.e an optimal $ecision. T%is approac% assumes t%at attri*utes are con$itionally in$epen$ent given a class. >ase$ on t%is simplifying assumption, t%e pro*a*ility of o*serving attri*utesA con2unction is t%e pro$uct of t%e pro*a*ilities for t%e in$ivi$ual attri*utes. +iven an instance &it% a set of attri*ute)value pairs, t%e ?aive >ayes approac% &ill c%oose a class t%at maximi,es t%e con$itional pro*a*ility of t%e class given t%e con2unction of attri*utes values. Alt%oug% in practice t%e in$epen$ence assumption is not entirely correct, it $oes not necessarily $egra$e t%e system performance 7Domingos & Ma,,ani, 5''78. 4e also assume t%at t%e values of continuous)value$ attri*utes follo& +aussian $istri*ution. Dence, once t%e mean an$ t%e stan$ar$ $eviation of t%ese attri*utes are o*taine$ from t%e training examples, t%e pro*a*ility of t%e correspon$ing attri*utes can *e calculate$ from t%e given attri*ute values. To avoi$ ,ero fre1uency count pro*lem t%at can $ampen t%e entire pro*a*ilities calculation, &e use an m estimate approac% in calculating pro*a*ilities of $iscrete) value$ attri*utes 7Mitc%ell, 5''78. 7!3 Comparison results /or eac% $ata*ase, k)fol$ cross)vali$ation met%o$ is use$ for evaluation. -n t%is met%o$, a $ata set is $ivi$e$ e1ually into k $is2oint su*sets. k experiments are t%en performe$ using k $ifferent training)test set pairs. A training)test set
pair use$ in eac% experiment is generate$ *y using one of t%e k su*sets as t%e test set, an$ using t%e remaining su*sets as t%e training set. +iven k $is2oint su*sets, for example, t%e first experiment ta.es t%e first su*set for t%e test set, an$ uses t%e secon$ su*set t%roug% t%e kt% su*set for t%e training set. T%e secon$ experiment uses su*set 5 an$ su*set " t%roug% su*set k for t%e training setI an$ ta.es su*set ( for t%e test set, an$ so on. All results from a particular $ata*ase are average$ along &it% its variance over k experiment runs. >ase$ on t%eir si,e, t%e cre$it $ata*ase an$ voting $ata*ase are partitione$ into 56 $is2oint su*sets, t%e %eart
$ata*ase is partitione$ into 9 su*sets. Ta*le 5, (, " s%o& t%e performance comparison results of $ifferent approac%es on t%ese t%ree $ata*ases. /or $ecision trees &it% an$ &it%out *oosting, &e present only t%e *est experimental results after varying t%e $ata splitting met%o$s as &ell as t%e pruning algorit%ms $escri*e$ earlier. Similarly, results from neural net&or.s approac% are selecte$ from t%e ones t%at provi$e t%e *est performance after varying t%e use of $ifferent net&or. output enco$ing an$ $atch versus incremental learning met%o$s.
Ta*le 5 T%e comparison results on t%e pre$iction accuracy an$ stan$ar$ $eviation 7:8 of cre$it $ata*ase. @ur +A Decision trees Decision trees ?eural net&or.s ?aive >ayes approac% 7-+, Min)3rr8 &it% *oosting 75)of)?, *atc% 70+, Cost)Com, learning8 (5 trees8 0un 5 '6.77 7.=' '.(" '.(" ==.59 0un ( '.(" !.=( =.59 =.59 7 .!= 0un " '.(" '.(" '6.77 '6.77 !.=( 0un ! '(."5 '6.77 '6.77 '.(" 5.9! 0un 9 =.59 5.9! 5.9! !.=( 79." 0un = '.(" 7.=' 7.=' 7.=' 6.66 0un 7 !.=( 5.9! !.=( !.=( 7". 9 0un 7.=' =.59 7.=' =.59 ".6 0un ' '6.77 =.59 '.(" 7.=' 7=.'( 0un 56 =.7= .(! '5.5 =.7= 79.66 Average .= =."= 7. ' 7.(' 77.96 Stan$ar$ $eviation (."7 ".6= ".6 (.6" 9."= -n a*ove ta*le, t%e $ecision tree is generate$ using information gain for $ata splitting an$ minimum)error pruning. Decision trees &it% *oosting generates (5 $ifferent $ecision trees, eac% is constructe$ *y using gain ratio for $ata splitting an$ cost)complexity pruning algorit%m. >atc% learning an$ 5)of)? output enco$ing are use$ in neural net&or.s.
Ta*le ( T%e comparison results on t%e pre$iction accuracy an$ stan$ar$ $eviation 7:8 of voting $ata*ase. @ur +A Decision trees Decision trees ?eural net&or.s ?aive >ayes approac% 7-+, 0e$)3rr8 &it% *oosting 75)of)?, *atc% 7-+, 0e$)3rr, learning8 " trees8 0un 5 '9."9 '9."9 '9."9 '".6( '".6( 0un ( '7.=7 '".6( '7.=7 '7.=7 '6.76 0un " '7.=7 '7.=7 '7.=7 '7.=7 '".6( 0un ! '9."9 '9."9 '7.=7 '7.=7 ."7 0un 9 '9."9 '9."9 '9."9 '".6( ."7 0un = '7.=7 '7.=7 '7.=7 '7.=7 ."7 0un 7 '9."9 '".6( '".6( '9."9 '".6( 0un '7.=7 '9."9 '9."9 '9."9 '6.76 0un ' 566.66 566.66 '7.=7 '9."9 '6.76 0un 56 '9. " '".79 '7.'( '9.6" 9.!( Average '=.7' '9.=9 '=.9! '9. = '6.57 Stan$ar$ $eviation 5.9' (.(" 5.=7 5.!= (.9" -n ta*le (, t%e *est results from *ot% $ecision trees &it% an$ &it%out *oosting are o*taine$ from using information gain for $ata splitting an$ re$uce$)error pruning algorit%m. @nly t%ree $ecision trees are nee$e$ in t%e $ecision trees &it% *oosting.
Ta*le " T%e comparison results on t%e pre$iction accuracy an$ stan$ar$ $eviation 7:8 of %eart $ata*ase. @ur +A Decision trees Decision trees ?eural net&or.s ?aive >ayes approac% 7-+, 0e$)3rr8 &it% *oosting 75)of)?, incr 7-+, 0e$)3rr, learning8 (5 trees8 0un 5 '. " 5."= .5! 5."= '. " 0un ( ".69 77.'7 !.79 !.79 7'.== 0un " 5."= 7=.(7 7=.(7 7!.9 ".69 0un ! .5! 7=.(7 ".69 '. " ".69 0un 9 "."" ==.=7 75.=7 5.=7 6.66 Average 9.5! 79.75 6.77 (.!! ".5( Stan$ar$ $eviation ".=! 9.!= 9.9! 9.== !.6 -n ta*le ", t%e *est results from neural net&or.s are o*taine$ from applying incremental learning an$ 5)of)? net&or. output enco$ing. /rom t%e a*ove results &e can see t%at our +A approac% outperforme$ ot%er approac%es on *ot% t%e average pre$iction accuracy an$ t%e stan$ar$ $eviation. T%e a$vantage of our +A approac% *ecomes more o*vious on %eart $ata*ase, &%ic% is most $ifficult to learn among t%e t%ree. During t%e process of running, &e also foun$ t%at t%e training accuracy an$ testing accuracy of +A approac% are *asically in t%e same level, &%ile t%e training accuracy is often muc% %ig%er t%an testing accuracy for ot%er approac%es. T%is proves t%at +A approac% is less sensitive to noise an$ mig%t *e more effective for future pre$iction. De Bong, L. A., Spears, 4. M., an$ +or$on, D. /. 75''"8 Using genetic algorit%ms for concept learning. Mac%ine Cearning, 5", 5=5)5 . Domingos, M. an$ Ma,,ani, M. 75''78 @n t%e @ptimality of t%e Simple >ayesian Classifier un$er Xero)@ne Coss. Mac%ine Cearning, (', 56")5"6. /reun$, Noav, an$ Sc%apire, 0. 3. 75''=8 3xperiments &it% a ne& *oosting algorit%m. -n Mac%ine Cearning; Mrocee$ings of t%e T%irteen -nternational Conference, pp. 5! )59=. +reene, D., M. an$ Smit%, S. /. 75''"8 Competition)*ase$ in$uction of $ecision mo$els from examples. Mac%ine Cearning, 5", ((')(97. Baco*s, 0.A. 75' 8 -ncrease$ 0ates of Convergence T%roug% Cearning 0ate A$aptation. ?eural ?et&or.s, 57!8; ('9)"67. Bani.o&, C. X. 75''"8 A .no&le$ge)intensive genetic algorit%m for supervise$ learning. Mac%ine Cearning, 5", 5 ')(( . Mingers, B. 75' 78 3xpert Systems U 0ule -n$uction &it% Statistical Data. Bournal of t%e @perational 0esearc% Society, " , "')!7. Mitc%ell, Tom. 75''78 Mac%ine Cearning. ?e& Nor.; Mc+ra&)Dill. ?i*lett, T. 75' =8 Constructing Decision Trees in ?oisy Domains. -n -. >rat.o an$ ?. Cavrac 73$s8. Mrogress in Mac%ine Cearning. 3nglan$; Sigma Mress. ?o$a, 3., /reitas, A. A. an$ Copes, D. S. 75'''8 Discovering interesting pre$iction rules &it% a genetic algorit%m. -n Mrocee$ings of 5''' Congress on 3volutionary Computation 7C3CA ''8, pp. 5"(()5"('. ?guyen, D. an$ 4i$ro&, >. 75''68 -mproving t%e Cearning Spee$ of T&o)Cayer ?et&or.s *y C%oosing -nitial Ralues of t%e A$aptive 4eig%ts. -nternational Boint Conference on ?eural ?et&or.s, San Diego, CA, ---;(5)(=. <uinlan, B.0. 75' =8 -n$uction of Decision Trees. Mac%ine Cearning, 5, 5)56=. <uinlan, B.0. 75' 78 Simplifying Decision Trees. -nternational Bournal of Man)Mac%ine Stu$ies, (7, ((5)("!. <uinlan, B. 0. 75''=8 >agging, >oosting, an$ C!.9. -n Mrocee$ings of t%e T%irteent% ?ational Conference on Artificial -ntelligence, pp. 7(9)7"6.
8! Conclusions and future wor/

-n t%is paper &e put for&ar$ a genetic algorit%m approac% for classification pro*lem. An in$ivi$ual in a population is a complete solution can$i$ate t%at consists of a fixe$ num*er of rules. 0ule conse1uent is not explicitly enco$e$ in t%e string *ut $etermine$ *y t%e matc% situation on training examples of t%e rule. To consi$er t%e performance affecting factors as complete as possi*le, four elements are inclu$e$ in t%e fitness function &%ic% are pre$icting error rate, entropy measure, rule consistency an$ %ole ratio, respectively. A$aptive asymmetric mutation an$ t&o)point crossover are a$opte$ in repro$uction step. T%e inversion pro*a*ility of 5)6 76)58 in mutation is self)a$aptive *y t%e fee$*ac. of average fitness $uring t%e run. T%e generate$ classifier after evolution is voting)*ase$. 0ules are not $is2oint *ut allo&e$ to overlap. Classifier gives all rules t%e e1ual &eig%t for t%eir votes. 4e teste$ our algorit%m on t%ree real $ata*ases an$ compare$ t%e results &it% four ot%er tra$itional $ata mining approac%es. -t is s%o&n t%at our approac% outperforme$ ot%er approac%es on *ot% pre$iction accuracy an$ t%e stan$ar$ $eviation. /urt%er testing on various $ata*ases is in progress to test t%e ro*ustness of our algorit%m. Splitting continuous attri*ute into multiple intervals rat%er t%an 2ust t&o intervals *ase$ on a single t%res%ol$ is also consi$ere$ to improve t%e performance.
'ibliography
0umel%art, D. 3., Dinton, +.3., an$ 4illiam, 0. B. 75' =8 Cearning 0epresentations *y >ac.)Mropagation 3rror. ?ature, "(";9"")9"=. Nang, C. an$ Nen, B. 7(6668 An a$aptive simplex genetic algorit%m. -n Mrocee$ings of t%e +enetic an$ 3volutionary Computation Conference 7+3CC@ (6668, Buly (666, pp. "7'.

An Entropy-Based Adaptive Genetic Algorithm For Learning Classification Rules

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

An Entropy-Based Adaptive Genetic Algorithm For Learning Classification Rules

Caricato da

Copyright:

Formati disponibili

An Entropy-based Adaptive Genetic Algorithm for Learning Classification Rules

758 &%ere N matched H i

is t%e num*er of examples &%ose

Rule consistency 7individual 8 = p correct log ( p correct

75 p correct 8 log ( 75 p correct 8

7! Results and discussions

8! Conclusions and future wor/

Potrebbero piacerti anche