Sei sulla pagina 1di 8

A CONNECTIONIST SYMBOL MANIPULATOR THAT INDUCES REWRITE RULES IN CONTEXT-FREE GRAMMARS

Sreerupa Ds & Michael C. Mozer' a'

Introduction.

We describe a connectionist architecture that is able to learn to parse strings in a context-free grammar (CFG) from positive and negative examples. To illustrate, consider the grammar in Figure la. The grammar consists of terminal symbob (denoted by upper case letters), nonterminal symbols (denoted by lower case letters), and rules that specify how a symbol string can be d u c e d to a nonterminal, for example, ab reduced to S. The grammar can be used to parse strings like aabb into a tree structure as shown in Figure lb. If a left-to-right parsing strategy is taken then the steps in the reduction process for the string aabb would be as shown in Figure IC. Our architecture attempts to learn explicit rewrite d e s in a grammar of the form in Figure la, to be able to reduce (or correctly parse) positive examples as shown in Figure IC. This involves the ability to iteratively substitute a single nonterminal in place of a string of symbols, that is, reduce more than one symbols to one. Since this architecture takes a left-to-right parsing strategy, it is suitable for LR grammars. Any CFG can be rm classified as LR(n) grammar, which means that strings can be parsed f o left to right with n symbols of lookahead. In the present bork, we examine only LR(0) grammars, although the architecture can be generalized to any n.
S
A

A 8 s Ab 8

Figure 1: (a) The rewrite rules in a grammar for the language a"b". (b) A parse tree for the string aabb. (c) The stages of left-to-right reduction. bctangles around the symbols indicate f the section o the string that has been exposed to the parser so far. In the context of the network, the rectangles denote the symbols that were either presented to the network at an earlier time step or were written on the scratch pad as a result of a prior reduction step. Giles et al. (1990), Sun et al. (1990), Das, Giles, and Sun (1992) have previously explored the learning of CFGs using neural network model. Their approach was based on the automaton perspective of a recognizer, where the primary interest was to learn the
*Department of Computer Science & Institute of Cognitive Science, University of Colorado, Boulder, CO 80309-0430, USA

0 1993 The Institutionof ElectricalEngineers. Printed and publishedby the IEE, Savoy Place, London WCSR OBL, UK.

P 5/1

Authorized licensed use limited to: Tec de Monterrey. Downloaded on July 23, 2009 at 19:41 from IEEE Xplore. Restrictions apply.

Figure 2: The network architecture dynamics of a pushdown automaton. There has also been work in CFG inference using symbolic approaches (for example, Cook & Rosenfeld, 1974; Crespi-Reghizzi, 1971; Fass, 1983; Knobe & Knobe, 1976; Sakakibara, 1988). These approaches require a significant amount of prior information about the grammar and, although theoretically sound, have not proven very useful in practice. Processing Mechanism. Once learning is complete, we envision a processing mechanism that has the following dynamics. An input string is copied into a linear scmtch pad memory. The purpose of the scratch pad is to hold the transitional stages of the string during the reduction process. A set of demons looks through the scratch pad, each looking for a specific pattern of symbols. A read-head determines the location on the scratch pad memory where the demons should focus their attention. When a demon finds its pattern on the scratch pad, it fires, which causes the elements of its pattern to be replaced by a symbol associated with that demon. This action corresponds to the reduction of a string to a nonterminal symbol in accordance with a rule of the grammar. The read-head starts from the left-end of the string in the scratch pad and makes a right shift when none of the demons fire. This process continues until the read-head has processed all symbols in the scratch pad and no demon can fire. The sequence of demon firings provides information about the hierarchical structure of the string. If the string has been reduced correctly, the final contents of the scratch pad will simply be the root symbol S as illustrated in Figure IC. Architecture. The architecture consists of a two layer network and a scratch pad memory (Figure 2). A set of demon units and a set of input units constitute the two layers in the network. Each demon unit is associated with a particular nonterminal. Several demons may be associated with the same nonterminal, leading to rewrite rules of a more general form, for example, X + ablYc. The read head of the scratch pad memory is implemented by the input units. At a particular time step, the input units make two symbols from the scratch pad memory visible to the demon units. If a demon recognizes the ordered pair
,

of symbols, it replaces the two symbols by the nonterminal symbol it represents. Since a l l CFGs can be formalized by rules that reduce two symbols to a nonterminal, presenting only two symbols to the demon units at a time places no restriction on the class of grammars that can be recognized. In our architecture, the scratch pad is implemented as a combination of stack and an input queue, details of which will be discussed in a subsequent section. The architecture does require some prior knowledge about the grammars to be processed. The maximum number of rewrite rules and the maximum number of rules that have the same left-hand side need to be specified in advance. This information puts a lower bound on the number of demons units the network may have.

Continuous Dynamics. So far, we have described the model in a discrete way: demon units firing is all-or-none and mutually exclusive, corresponding to the demon units achieving a unary representation. This may be the desired behavior following learning, but neural net learning algorithms like back-propagation require exploration in continuous state and weight spaces and therefore need to allow partial activity of demon units. Therefore, the activation dynamics for the demon units have been formulated as follows. Demon unit i computes the distance between the input vector, x, and the weight vector,
W; :

d i d ; = b; C ( w ; j - ~ j ) ' , where b; is an adjustable bias associated with the unit. The activity of demon unit i, denoted by si, is computed via a normalized exponential transform (Bridle, 1990; Rumelhart, in press)
,-di&

which enforces a competition among the units. A special unit, called the default unit, is designed to respond when none of the demon units fire strongly. Its activity, s k f , is computed like that of any other demon unit with di&f = b k t . The activation of the default unit determines the amount of right shift to be made by the read-head.

Continuous Scratch Pad Memory. The two-to-one reduction of the symbols in the scratch pad shrinks the length of the partially reduced string. To incorporate this fact in the architecture, the scratch pad memory has been implemented as a stack and an input queue. The stack holds the "seen" part of the input string that has been (completely or partially) processed so far. This corresponds to the sections of the string bounded with rectangles at various time steps in Figure IC. The "unseen" part of the string is contained in the input queue. The top two symbols on the stack are made visible to the network at a particular time step. Reduction of a pair of symbols on the scratch pad corresponds to popping the top two symbols from the stack and pushing the non-terminal symbol associated with the demon unit that recognized the inputs. The left-to-right movement of the read-head is rm achieved by moving the next symbol f o the input queue on the top of the stack. This operation is performed when no demon units fire. Since the demon units can be partially active, reading symbols and reduction of symbols using stack operations need to be performed partially. This can be accomplished with a continuous stack of the sort used in Giles et al. (1990). Unlike a discrete stack where an

5/3

Authorized licensed use limited to: Tec de Monterrey. Downloaded on July 23, 2009 at 19:41 from IEEE Xplore. Restrictions apply.

Composite symbol 1

composite symbol 2

F
L

topofstack thiCknesS

3 5
A
.7

Figure 3: A continuous stack. The symbols indicate the contents; the height of a stack entry indicates its thickness, also given by the number to the right. The top composite symbol on the stact is a combination of the items forming a total thickness of 1 0 the next composite is a .; . combination of the items making up the next 1 0 units of thickness. item is either present or absent in a cell, items can be present to varying degrees. With each item on the stack we associate a thickness, a scalar in the interval [0,1]corresponding to what fraction of the element is present. An example of the continuous stack is shown in Figure 3. To understand how the thickness plays a role in processing, we digress briefly and explain the encoding of symbols. Both on the stack and in the network, symbols are represented by numerical vectors that have one component per symbol. The vector representation of some symbol z,denoted by r,, has value 1for the component corresponding to z and 0 for all other components. If the symbol has thickness t, the vector representation is tvm. Although items on the stack have different thicknesses, the network is presented with composite symbob having thickness 1 0 Composite symbols are formed by combining stack .. items. For example, in Figure 3, composite symbol 1is defined as the vector .2r,+.5rz+.3r,. The two symbols that the input units present to the network at every time step consists of the top two composite symbols on the stack. The advantages of a continuous stack are twofold. First, it is required for network learning; if a discrete stack were used, a small change in weights could result in a big (discrete) change in the stack. This was the motivation underlying the continuous stack memory used by Giles et al. Second, a continuous stack is differentiable and hence allows us to back propagate error through the stack during learning. Giles et al. did not consider back propagation through the stack. At every time step, the network performs two operations on the stack: Pop. If a demon unit fires, the top two composite symbols should be popped from the stack corresponding to reduction (to make room for the demons symbol). If no demon fires, in which case the default unit becomes active, the stack should remain unchanged. These behaviors, as well as interpolated behaviors, are achieved by multiplying by a&f the thickness of all items on the stack contributing to the top two composite symbols. Since s&j is 0 when one or more demon units are strongly active, the top two symbols gets erased from the stack. When s&f is 1 the stack remains unchanged.

Push The symbol written onto the stack is the composite symbol formed by summing

? 5/4

the identity vectors of the demon units, weighted by their activities: C,s;r;, where r , is the vector representing demon's i's identity. Included in the summation is the default unit, where r&f is defmed to be composite symbol over the thickness s&f of the input queue. Once a composite symbol is read from the input queue it is removed.

Training Methodology. The system is trained on positive and negative examples from a context-free grammar. Its task is to learn the underlying rewrite rules and classify each input as grammatical or ungrammatical. Once a positive string is correctly parsed the symbol remaining on the stack should be the symbol corresponding to the root symbol S (as in Figure IC).For a negative example, the stack should contain any symbol other than

S.
These criteria can be translated into an objective function as follows. I we assume a f Gaussian noise distribution over outputs, the probability that the stack contains the symbol S following presentation of example i is:

where c; is the vector representing the symbol under the read-head; and the probability that the total thickness of the symbols on stack is 1 (i.e., the stack contains exactly one item) is:
Pi thick
(x

e4(!l'i-l)2

where Ti is the total thickness of all symbols in the stack and @ is a constant. For a positive example, the objective function should be greate'st when there is a high probability of S being in the stack and a high probabil;ty of it being the sole symbol in the stack; for a negative example, the objective function should be greatest when either event has low probability. We thus obtain a likelihood objective function whose logarithm the learning procedure attempts to maximize:

The derivative of the objective function is computed with respect to the weight parameters using a form of back propagation through time (Rumelhart, Hinton, & Williams, 1986). This involves "unfolding" the architecture in time and back propagation through the stack. Weights are then updated to perform gradient ascent in the log likelihood function.

Simulation Details. Training set were generated by hand, with a preferences for shorter strings. Positive examples were generated from the grammar; negative examples were randomly generated by perturbing a grammatical string. For a given grammar, the number of negative examples were much more than the number of positive examples. Therefore, positive examples were repeated in the training set so as to constitute half of the total training examples. The total number of demon units and the (fixed) identity of each was specified in advance of learning. For example, for the grammar in Figure la, we provided at least two S demon units and one X demon unit. Redundant demon units beyond these numbers

0 5/5

Authorized licensed use limited to: Tec de Monterrey. Downloaded on July 23, 2009 at 19:41 from IEEE Xplore. Restrictions apply.

grammar

a" b"
I

rewrite rules S -+ ablaX

I parenthesis balancing I S
postfix pseudo-nlp

X
-+

s -+ aXlSX
X
+

-+ Sb ()[(XISS
+

S)

b+ I + S

S --+ Nv(nV

Table 1: Some results

Figure 4: Sample weights for the grammar anbn. Weights are organized by demon units, whose identities appear above the rectangles. The left and tight halves of the three rectangles represent connections from composite symbols 1and 2, respectively. The darker the shading is of a rectangle, rm the larger the connection strength is f o the input unit representing that symbol to the demon unit. The weights clearly indicate the three rules in the grammar. did not degrade the network's performance. The initial weights ( w i j } were selected from a uniform distribution over the interval [.45, .55]. The biases, bj, were initialized to 1.0. Before an example is presented, the stack is reset t o contain only a single symbol, the null symbol, with vector representation 0 and infinite thickness. The example string is placed in the input queue. The network is then allowed to run for for 21 - 2 time steps, which is exactly the number of steps requires to process any grammatical string of length 1. For example, the string aabb in Figure ICtakes 6 steps to get reduced to S.

Results. We have successfully trained the architecture on a variety of grammars. Some of them are listed in table 1. In each case, the network was able to discriminate positive and negative examples on the training set. For the f i s t three grammars, additional strings were used to test network generalization performance. The generalization performance was 100% in each case.
Due to the simplicity of the architecture - the fact that there is only one layer of modifiable weights - the learned weights can often be interpreted as symbolic rewrite rules (Figure 4). It is a remarkable achievement that the numerical optimization framework of

of neural net learning can be used to discover symbolic rules (see also Mozer & Bachrach, 1991). Although the current version of the model is designed for LR(0) CFGs, it can be extended to LR(n) grammars by including connections from the first n composite symbols in the input queue to the demon units. However, our focus is not necessarily on building a theoretically powerful formal language recognizer and learning systems; rather, our interest has been on integrating symbol manipulation capabilities into a neural network architecture. The model has the ability to represent a string of symbols with a single symbol, and to do so iteratively, allowing for the formation of the hierarchical and recursive structures. This is the essence of symbolic information processing, and, in our view, a key ingredient for structure learning.

Acknowledgement. T i research was supported by NSF Presidential Young Investihs gator award IN-9058450 and grant 90-21 from the James S. McDonnell Foundation. We thanks Paul Smolensky, C. Lee Giles, and hugen Schmidhuber for helpful comments regarding this work.

Bibliography
Cook, C.M. & Rosenfeld, A. (1974). Some experiments in grammatical inference, NATO ASI on Computer Oriented Learning Processes, Bonus, h n c e , pp. 157. Crespi-Reghizzi, C. (1971). An effective model for grammatical inference, Pmceedings of IFIP Congress, Ljublijana, pp. 524.

Das, S , Giles, C.L., & Sun, G.Z. (1992). Learning context-free grammars: Capabilities . and limitations of neural network with an external stack memory. In Proceedings of the Fourteenth Annual Conference of the Cognitive Science (pp. 791). Hillsdale, NJ: Erlbaum.
Fass, L.F. (1983). Learning Context-Free Languages from their structured sentences, SIGACT News, 15, p.24. Giles, C.L., Sun, G.Z., Chen, H.H., Lee, Y.C., & Chen, D.(1990). Higher order recurrent networks and grammatical inference. In D.S. Touretzky (Ed.), Advances in Neural Information Systems 2 (pp. 380). San Mateo, CA: Morgan Kaufmann. Hinton, G.E. (1988). Representing part-whole hierarchies in connectionist networks. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society. Hillsdale, N J :Erlbaum. Knobe, B. & Knobe, K. (1976). A method for inferring context-free grammars, Information and Control, 31, pp.129. Mozer, M.C. & Bachrach, J. (1991). SLUG: A connectionist architecture for inferring

5/7

Authorized licensed use limited to: Tec de Monterrey. Downloaded on July 23, 2009 at 19:41 from IEEE Xplore. Restrictions apply.

the structure of finite-state environments. Machine Learning, 7, pp. 139. Mozer, M.C. (1992). Induction of multiscale temporal structure. In J.E. Moody, S.J. Hanson, and R.P. Lippmann (Eds.), Advances in Neuml Information Processing Systems 4 (pp. 275). San Mateo, CA: Morgan Kaufmann, 1992. Mozer, M.C. & Das, S. (1993). A connectionist symbol manipulator that discovers the structure of context-free languages. In C.L. Giles, S.J. Hanson, & J.D.Cowan (Eds.), Advances in Neuml Information Processing Systems 5, To appear. Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation. In D.E. Rumelhart & J.L. McMillan (Eds), Pamllel distributed processing: Ezplomtions in the microstructure of cognition. Volume I: Foundations (pp. 318). Cambridge, MA: MIT Press/Bradford Books. Sakakibara, Y. (1988). Learning Context-F'ree Grammars from Structural Data, Proceedings of the 1988 workshop on Computational Learning Theory (pp 330). Morgan Kauffman Publishers. Sun, G.Z., Chen, H.H., Giles, C.L., Lee, Y.C., & Chen, D. (1990). Connectionist pushdown automata that learn context-free grammars. In Proceedings of the Intemational Joint Conference on Neuml Networks (pp. 1-577). Hillsdale, NJ: Erlbaum Associates.

Potrebbero piacerti anche