Sei sulla pagina 1di 7

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 89

Reproduction Operator Evaluation for CFG


Induction Using Genetic Algorithm
Nitin S. Choubey, and Madan U. Kharat

Abstract— These Grammar Induction (or Grammar Inference or Language Learning) is the process of learning of a grammar
from training data of the positive and negative strings of the language. Genetic algorithms are amongst the techniques which
provide successful result for the grammar induction. The paper discusses an extended approach of using stochastic mutation
scheme based on Adaptive Genetic Algorithm for the induction of the grammar for a set of four different languages and its
comparison with other reproduction operators. The algorithm produces successive generations of individuals, computing their
“fitness value” at each step and selecting the best of them when the termination condition is reached. The paper deals with the
issues in implementation of the algorithm, chromosome representation and evaluation, selection and replacement strategy, and
the genetic operators for crossover and mutation. The model has been implemented, and the results obtained for the set of four
languages with comparison over three crossovers and stochastic mutation scheme along with other three mutation operators
are presented.

Index Terms—Learning, Induction, Evolutionary Computation, Genetic Algorithm.

——————————  ——————————

1 INTRODUCTION

T HE field of evolutionary computing has been ap-


plying problem-solving techniques that are similar in
intent to the Machine Learning recombination me-
to every individual in the population. This value is used
to rank individuals depending on their relative suitability
for the problem being solved [2].
thods. Most evolutionary computing approaches hold in This paper presents work conducted by the authors on
common that they try and find a solution to a particular various reproduction operators along with a new stochas-
problem, by recombining and mutating individuals in a tic Mutation operator based on an Adapted Genetic Algo-
society of possible solutions [1]. The origin of evolutio- rithm which works with random mask with uniform dis-
nary algorithms was an attempt to mimic some of the tribution of bits over the chromosome length. The paper
processes taking place in natural evolution. An Evolutio- includes performance evolution of the operator discussed
nary Algorithm (EA) is an iterative and stochastic process in section on CFG induction for four different languages
that operates on a set of individuals (population). Each The rest of the paper proceeds as follows:
individual represents a potential solution to the problem The section 2 of the paper discusses the brief overview
being solved. This solution is obtained by means of an of the CFG Induction process adapted by the authors Ge-
encoding/decoding mechanism. Initially, the population netic Algorithm, the chromosome structure adapted is
is randomly generated (perhaps with the help of a con- discussed in the section 3 whereas the section 4 of the
struction heuristic). Every individual in the population is paper covers Crossover and mutation operators. The sec-
assigned, by means of a fitness function, a measure of its tion 5 and section 6 includes the details of the implemen-
goodness with respect to the problem under considera- tation done by the authors for CFG induction with Genet-
tion. This value is the quantitative information the algo- ic Algorithm. It is further extended to the conclusion part
rithm uses to guide the search. Among the evolutionary of the paper.
techniques, the genetic algorithms (GAs) are the most
extended group of methods representing the application
2 CFG INDUCTION PROCESS
of evolutionary tools. They rely on the use of a selection,
crossover and mutation operators. Replacement is usually Inductive Inference is the process of making generaliza-
by generations of new individuals. Intuitively a GA tions from sample. In the conventional grammatical in-
proceeds by creating successive generations of better and duction, a language acceptor is constructed to accept all
better individuals by applying very simple operations. the positive examples. Learning from positive examples is
The search is only guided by the fitness value associated called text learning. A more powerful technique uses
negative examples as well. This is learning with an infor-
———————————————— mant. In informant learning, the language acceptor is
 Nitin S. Choubey is Ph.D. Student with the Department of Computer constructed so as to accept all the positive examples and
Science, Sant GadgeBaba Amravati Univeristy, Amravati, Maharashtra reject all the negative examples. A positive sentence is
India. defined as a sentence represented by the grammar of a
 Madan U. Kharat is Principal with Pankaj Laddhad Institute of Technolo-
gy and Management, Buldana, in Sant GadgeBaba Amravati Univeristy, language and hence included in the language. A negative
Amravati, Maharashtra India. sentence is defined accordingly. The problem discussed
here is finding generalizations for Context Free Languag-
es from finite sets of positive and negative examples.
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 8, AUGUST 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 90

Wyard [3] explored the impact of different grammar re-


presentations and experimental results show that an evo-
lutionary algorithm using standard context-free gram-
mars (Backus-Naur Form, BNF) outperformed other repre-
sentations. In formal language theory, a context-free
grammar (CFG) [4] is a grammar, in which
every production rule is of the form,
V → w,
Where, V is a single Nonterminal symbol, and w is a
string of terminals and/or Nonterminals (possibly emp-
ty).

3 CHROMOSOME STRUCTURE
A sequential structured chromosome [5] is used in the
implementation consists of random sequence of 0s and 1s.
The decoding procedure of the grammar maps the ran-
dom chromosome according to bit sequence based on the
number of terminals available in given sample data.
Mapping from bit representation to symbolic representa-
tion “111”, “1111” is taken as “epsilon” symbol in 3-bit, 4-
bit respectively. If the number of terminals for language Fig. 2. Symbol mapping for L= {(10)n | n0}Grammar mapping from
contains less than 4 terminals then 3-bit representation is the binary chromosome mapping for L= {(10)n | n0}.
used along with 4 variables. e. g. Let the language L is
given by, L= {(10)n | n0}, the symbol mapping for decod- the mother possess different good qualities, we would expect
ing the chromosome for L is as shown in Fig. 1. that all the good qualities will be passed into the child. Thus
the offspring, just by combining all the good features from its
parents, may surpass its ancestors. The mixing of genetic ma-
terial via sexual reproduction is one of the most powerful fea-
tures of Genetic Algorithms. Genetic Algorithms representa-
tion does not differentiate male and female individuals.
Mutation is the other way to get new offspring. Mutation
consists of changing the value of genes. In natural evolution,
n
Fig. 1. Symbol mapping for L= {(10) | n0}. mutation mostly engenders non-viable genomes. Actually
mutation is not a very frequent operator in natural evolution.
Note that, there are two terminals ‘a’, ‘b’ in the lan- Nevertheless, is optimization, a few random changes can be a
guage which are represented by “100” and “101” respec- good way of exploring the search space quickly [2]. The
tively.Chromosome is applied with a Special operator which paper focuses on the experimentation with three crossov-
masks every 5th symbol in the chromosome to ‘V’ by changing er and four mutation operators.
first bit in the equivalent binary representation from ‘1’ to ‘0’
without adding any new production to the grammar derived 4.1 Crossover Operators Used
from the chromosome. It is similar to the expansion operator Three crossover operators are used in the experimenta-
used in [6], [7] which adds two extra productions to the set of tion includes Two Point crossover method, a variant of
rules derived. The resultant productions are then processed the cyclic crossover with internal swapping and Uniform
for left recursion removal followed for left factoring. The sam- crossover method.
ple data is used to evaluate the fitness of the resultant produc-
4.1.2 Two Point Crossover Method (C1)
tions set by checking acceptance of the positive sample and
rejection of the negative sample. The process of grammar In two point crossover method, the parent chromosome is
mapping from the binary chromosome is shown in fig. 2. cut at two random points and the child1 is created by re-
placing the slice between two cut in parent1 with the slice
from Parent2. The same process is to be conducted for the
4 REPRODUCTION OPERATORS USED generation of child2. The example is shown in Fig. 3(a).
Recombination or sexual reproduction is a key operator for
4.1.2 A Variant of the Two Point Crossover with Internal
natural evolution. It takes two chromosomes and it produces
Swapping (C2)
two new chromosomes by mixing the gene found in the orig-
inals. In biology, the most common form of recombination is In this crossover method, the internal swapping is done
crossover, two chromosomes are cut at one point and the with the slices and those slices are used in cyclic manner
halves are spliced to create new chromosomes. The effect of for generating the children. The example is shown in Fig.
recombination is very important because it allows characteris- 3(b).
tics from two different parents to be assorted. If the father and
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 91

Fig. 4. Mutation Methods Used.


Fig. 3. Crossover Methods Used.
tor. Each gene in the offspring is created by copying the
corresponding gene from one or the reverse is chosen
4.1.2 A Variant of the Uniform crossover method. (C3)
according to a random generated binary crossover mask
Each gene in the offspring is created by copying the cor- of the same length as the chromosomes. Where there is a
responding gene from one or the other parent chosen ac- 1 in the crossover mask, the same gene is copied from the
cording to a random generated binary crossover mask of parent, and where there is a 0 in the mask reverse (1 for 0
the same length as the chromosomes. Where there is a 1 and 0 for 1) of the gene is copied to the offspring.
in the crossover mask, the gene is copied from the first
parent, and where there is a 0 in the mask the gene is co- 4.2.4 Block copier with random length (M4)
pied from the second parent. A new crossover mask is Each child is created by shifting a random length block
randomly generated for each pair of parents. Child chro- (restricted to the L/2) of genes from middle of the chro-
mosomes therefore contain a mixture of genes from each mosome to the place starting from random location i as
parent. The number of effective crossing point is not show below:
fixed, but will average L/2 (where L is the Chromosome 1. Copy first i genes from the parent to child.
length). The example is shown in Fig. 3(c). 2. Copy variable length (say j) block from middle of
the chromosome from parent to child.
4.2 Mutation Operators Used
3. Copy all the genes from location i+j through the
Four mutation operators are used in the experimentation length of chromosome, say L, to child.
includes Inverse Mutation method, block copier fixed The example given below random location i = 4, j = 3
length, Stochastic Mutation method and block copier with & L = 10. The value of ‘j’ is randomly chosen in M2 where
random length Mutation method. The examples are as it is fixed in M2.
shown in Fig. 4.

4.2.1 Inverse Mutation method (M1) 5 THE GA METHOD ADAPTED


Each gene in the offspring is created by reversing the cor-
The method adapted is based on the NEO-
responding gene from the Parent (1 for 0 and 0 for 1).
DARWINISM theory given by Charles Darwin in 1859
4.2.2 Block copier fixed length (M2) and on the laws of nature “The Preservation of favored races
Each child is created by shifting a fixed length block of in the struggle for life” [8]. According to the theory, in the
genes from middle of the chromosome to the place start- Animal world, the fittest members get opportunity to
ing from random location i as show below: breed and to be multiplied, as they are able to attract fe-
1. Copy first i genes from the parent to child. males to get multiply, whereas the others are unable to
2. Copy fixed length (say j) block from middle of the combat the illness and are prone to mutation. The
chromosome from parent to child. adapted method maintains two separate populations for
3. Copy all the genes from location i+j through the the purpose of reproduction and the generated offspring
length of chromosome, say L, to child. populations are then overlapped to get the next genera-
The example given below random location i = 4, j = 2 & tion.
L = 10. The initial random population (with n individuals) is
generated and used as a first generation population.
4.2.3 Stochastic Mutation Method (M3) Every current population is then used to get next genera-
This operator is similar to the Uniform crossover opera- tion with the help of reproduction operators discussed in
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 8, AUGUST 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 92

TABLE 1
THE LANGUAGES USED.

ing with L=0 and gradually increasing L to get the re-


quired number of strings. The validity of the generated
string is checked with the best known grammar for the
languages. The invalid string generated during this
process is considered as negative strings. The Best
Grammar is represented as <V, Σ, P, S>, where V is finite
set of Variables, Σ is finite set of Terminals, P is finite set
of Production rules and S is a Starting Variable. The re-
sults obtained through the experiment are discussed in
accordance with the resultant best grammar, Generation
Chart and the number of GAs run conducted for first ten
successful results. The best grammar generated for differ-
ent languages are given in Table 2 below.
TABLE 2
THE RESULTANT BEST EQUIVALENT GRAMMAR AND ITS FIT-
NESS VALUE FOR DIFFERENT LANGUAGES.

Fig. 5. The GA Method Adopted.

the section IV. The proportionate amounts of the inter-


mediate population are generated by using crossover
(n*CR1) and mutation (n*CR2) operators separately and
then the intermediate population is merged with current
population to get the subsequent population so that ap- Due to the stochastic nature of Genetic Algorithm, the
plication of the high amount of mutation should not lead result is obtained as the average of 10 GAs runs for Lan-
to the random search. CR1 and CR2 is the percentage of guages considered for experimentation over the combina-
intermediate population generated by crossover and mu- tion of crossover and mutation operators [10]. The results
tation operators respectively. This merging process leads shown here are based on the examples selected. The re-
to moving elite member in to next level in each genera- sultant grammars shown are the results which have suc-
tional cycle. The Genetic Algorithm process adopted for ceeded in accepting the positive examples and rejecting
the experiment is shown in Fig. 5. the negative examples considered for the experiment. The
Reproduction operators are ranks based on the number of
generations required to get the individual having Aver-
6 EXPERIMENTAL SETUP AND RESULT ANALYSIS age Threshold Fitness Value for accepting all the positive
Experiment is done with JDK 1.4 on an Intel® Core™2 samples and rejecting all the negative samples. The Rank
CPU with 2.66 GHZ and 2 GB RAM. The Population size Matrix for the languages L1, L2, L3 and L4 are shown in
= 50, Chromosome size = 240, The Corpus size of 50, the Table 3, Table 4, Table 5, and Table 6 respectively. The
maximum number of generation considered are 400 for cumulative Rank Matrix for the reproduction operators,
the experiment. CR1 = 0.9 and CR2 = 0.8. The Languages Crossover (C1, C2 and C3) and Mutation (M1, M2, M3
used for the purpose of experiment corresponding to var- and M4) is shown in Table 7. The Table 7 also shows the
ious GAs runs are listed in the Table 1. Cumulative Rank for Crossover Operators (CCR), Cumu-
The languages chosen for the experiment are the col- lative Rank for Mutation Operators (CMR), Overall Rank
lection of Context Free Language as well as Regular Lan- for the Crossover operator (ORCO) and the Overall Rank
guage having varying pattern of 0’s and 1’s. The positive for the Mutation Operator (ORMO) obtained for the vari-
and negative string set required for the experiment is ous operators from the result analysis.
generated by using the Minimum Length Description The Analysis indicates that the Stochastic Mutation
Principle (MLDP) [9]. The strings with the terminals for method (M3) and Two Point Crossover Method (C1)
the given language are generated for the length, L, start- found to be best in induction of the grammar for the lan-
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 93

TABLE 3
THE RANK MATRIX FOR THE LANGUAGE L1.

TABLE 4
THE RANK MATRIX FOR THE LANGUAGE L2.

Fig. 6. Comparison of Execution Time per Generation over


Various Operator Combinations.
Data Analysis for the Languages L1, L2, L3 and L4 is
shown in Table 8 through Table 11. The comparative
charts for the execution time taken and the number of
generation required to reach the average threshold value
by the various operator combination is given in Fig. 6 and
TABLE 5 Fig. 7 respectively. The generation charts for the average
THE RANK MATRIX FOR THE LANGUAGE L3. of Best Fitness Value for the combination of crossover and
mutation operator are shown in the Fig. 8.

7 CONCLUSION
The proposed model has been implemented, and the re-
sults analysis for the set of four languages with compari-
son over three crossovers and stochastic mutation scheme
along with other three mutation operators is done. The
Stochastic Mutation method (M3) and Two Point Cros-
sover Method (C1) found to be best in induction of the
TABLE 6
THE RANK MATRIX FOR THE LANGUAGE L4. grammar considered in the experiment whereas the oper-
ator combination C2-M3 is found to the best combination
for the same. MLDP is found to be more effective in the
selection of the corpus. The selection of the good quality
corpus (positive and negative string inputs) has resulted
into induction of good quality grammar for the languages
considered. Results have shown tendency towards the
local optimum convergence which requires special atten-
tion in future work.

TABLE 7
THE CUMULATIVE RANK MATRIX FOR REPRODUCTION
OPERATORS.

guages considered, whereas the operator combination


C2-M3 is found to the best combination for the same. The Fig.  7.    Average  Threshold  Chart  for  the  Various  Operator
Average of the Best fitness Values over ten GAs runs and Combinations.  
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 8, AUGUST 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 94

TABLE 8 TABLE 10
THE STATISTICAL ANALYSIS FOR THE LANGUAGE L1. THE STATISTICAL ANALYSIS FOR THE LANGUAGE L3.

[3] Wyard, P., “Representational Issues for Context‐Free Grammar 
ACKNOWLEDGMENT
Induction  Using  Genetic  Algorithm”,  Proceedings  of  the  2nd 
Authors thank to Dr. V. M. Thakre, P. G. Department of International Colloquium on Grammatical Inference and Appli‐
Computer Science, Sant Gadge Baba Amravati University, cations, Vol 862, pp. 222‐235, 1994. 
Amravati, Maharastra, for his kind support in providing [4] Introduction  to  Automata  Theory,  Languages,  and  Computa‐
Laboratory infrastructural facility required for the con- tion, 3/E ,John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman,   
Addison‐Wesley, 2007. 
duction of the experiment.
[5] Sequential  Structuring  Element  for  CFG  Induction  Using  Ge‐
netic Algorithm, Choubey N. S. and Kharat M. U., International 
REFERENCES Journal of Futuristic Computer Application, 1(1):12‐16, 2010. 
[1] Guy  De  Pauw,  “Evolutionary  Computing  as  a  Tool  for  Gram‐ [6] Ernesto  Rodrigues  and  Heitor  Silvério  Lopes,  “Genetic  Pro‐
mar  Development”,    CNTS  –  Language  Technology  Group,  gramming  with  Incremental  Learning  for  Grammatical  Infe‐
UIA  –  University  of  Antwerp,    Antwerp  –  Belgium,  GECCO  rence”,  Proceedings  of  the  Sixth  International  Conference  on 
2003,  LNCS  2723,  pp.  549–560,  2003.,Springer‐Verlag  Berlin  Hybrid Intelligence Systems (HIS), pp. 47‐47, 2006. IEEE. 
Heidelberg 2003.  [7] Ernesto  Rodrigues  and  Heitor  Silvério  Lopes,  “Genetic  Pro‐
[2] Sivanandam,  Deepa  “Introduction  to  Genetic  Algorithm”,  gramming  for  Induction  of  Context‐free  Grammars”,  Seventh 
Springer, 2008. 
TABLE 11
TABLE 9
THE STATISTICAL ANALYSIS FOR THE LANGUAGE L4.
THE STATISTICAL ANALYSIS FOR THE LANGUAGE L2.
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 95
Nitin S. Choubey, BE(1995), MBA(1997),
ME(2002), Ph.D.[Management](2004) was
educated at Sant Gadge Baba Amravati (SGBA)
University, Maharashtram India. He is pursuing
Ph.D. program in faculty of Computer Science &
Engineering from SGBA University. Presently he
is working at Mukesh Patel School of Technolo-
gy Management and Engineering at SVKM's
NMIMS-deemed-to-be-University, Shirpur Cam-
pus, Shirpur, Dhule, Maharastra, India, as Associate Professor and
Head of the Computer Department. He has presented papers at
National/International conferences and also published papers in
National/International Journals on various issues of Computer Engi-
neering and Management. To his credit, he has published books on
various topics in Computer Science and Management subjects. His
areas of interest include Algorithms, Theoretical Computer Science,
and Computer Networks and Internet.

Madan U. Kharat, BE(1992), MS(1995),


Ph.D.(2006) was educated at Amravati Universi-
ty. Presently he is working at the Pankaj Ladd-
had Institute of Technology & Management,
Buldana, SGBA University, Amaravati, Maha-
rashtra, India, as a Principal. He has presented
papers at National and International conferences
and also published papers in National and Inter-
national Journals on various aspects of Com-
puter Engineering and Networks. He has worked in various capaci-
ties in academic institutions at the level of Professor, Head of Com-
puter Department. His areas of interest include Digital Signal
Processing, Computer Networks, and the Internet.

Fig. 8. Generation Chart for the Languages L1, L2, L3 and L4.

International  Conference  on  Intelligent  Systems  Design  and 


Applications (ISDA), pp. 297‐302, 2007, IEEE. 
[8] Darwin  C.:  “The  Origin  of  Species”,  1859,  Sixth  London  Edi‐
tions, 1999. 
[9] Bill  Keller,  and  Rudi  Lutz,  “Evolving  stochastic  context‐free 
grammars from examples using a minimum description length 
principle”,  Paper  presented  at  the  Workshop  on  Automata  In‐
duction  Grammatical  Inference  and  Language  Acquisition, 
ICML‐97. 
[10] Marc  Lankhorst,  “A  Genetic  Algorithm  for  the  Induction  of 
Nondeterministic Pushdown Automata”, Technical Report CS‐
R 9502, University of Groningen, The Netherlands, May 1995.