Sei sulla pagina 1di 3

Cognitive Models of Language and Beyond

Assignments Week 4: Connectionistic Models of Language Processing


Hielke Prins, 6359973

Answers

1. Two properties of Simple Recurrent Networks (SRN) limit the capacities of these neural models
to account for strong systematicity. First, their time dependent behavior results from a bounded
number of previous states. Their current output is dependent on previous input in a way that
declines with temporal distance, much like the way estimating the probability of the next token
in a sequence works for n-grams in n-state Hidden Markov Models (HMM) 1. Contrary to
(unconstrained) data-oriented parsing (DOP) models, the patterns captured this way are not
scale free and reflect temporal patterns rather then the (discontiguous) structural constituents
behind them.

Simple Recurrent Networks are essentially local and dependence on most recent previous states
might exceed the influence of more relevant long distance dependencies. When the history of
previous states taken into account (n) becomes large enough, they eventually approximate
performance of models that exploit these structural dependencies but they will do so mainly for
previously encountered temporary similar examples. Training a SRN will furthermore require
more examples because they do not capture important syntactical regularities.

A second limitation makes this worse: representations in a Simple Recurrent Network will also
be encoded as temporal patterns that so become dependent on the specific position of the word
in previous examples. Representations of “John” after training on a story about him, are likely
be different from representations of “Mary” despite their membership of the same syntactical
category (nouns).

The representations encode word order. When “John” has never appeared immediately after
“loves” but often before, a SRN will not predict it to be in the object position. This is because
the network learned to generalize over temporal surface patterns (“Mary” occurring often after
and “John” more often before “loves”) instead of constituent syntactical ones ( “NP loves
NP” , S → NP VP, VP → loves NP, etc.). Representations are in short not context-free.

Elman (1990) states that an hierarchical cluster analyses of hidden unit activations supports the
claim that a SRN is capable of learning category membership and type/token distinctions. His
clusters however include sentence-final and sentence-initial ones that encode for dispositional
instead of category information. Furthermore they reflect temporal patterns already existing in
the input as shown in a trigram model (n = 3) of the target vectors in the training set and are
thus not real categories and not type/token distinctions but tightly bound to previous tokens.

2. Constituent structure is the deeper structural (often hierarchical) relationships between

1 Borensztajn (2011) compares the way a SRN generalizes with that of a Finite State Machine (FSM). Hidden Markov
Models in fact are a special kind of probabilistic FSM that have been quite successful in speech recognition.

1/3
constituents (parts) of a sentence ascribed to underly its surface structure (mere word order). It
is relevant because the constituent structure is likely to determine (cognitively) plausible
transitions in a model by means of operations that efficiently exploit the structure. Conversely
these operations might shape the solution space of processing operations involved in parsing
and generating.

In DOP for example, the constituent structure of a sentence is defined in terms of subtrees of
arbitrary size combined in a parse tree using substitution operations. The trees can be
discontigious and may contain variables. They can thus be both rule like context-free and
pattern like context-sensitive representations. Patterns may be derived from temporal n-gram
sequences in the training data but always include a deeper constituent structure that allows them
to be used in another context based on their syntactical role. Derivations are evaluated in terms
of their efficiency or weighted by the trained context dependent probabilities of the subtree
representations that where used.

In Elman (1990) the constituent structure is essentially temporal and consists of probabilistic
relationship between subsequent different states (see Question 1). Elman (1991) argues that
these depict trajectories through an arbitrary structured high-dimensional state space of hidden
unit activation vectors in the network while processing the surface structure (words in a
particular order). These trajectories are similar for all sentences with the same grammatical
form and thus believed to be systematic (note however that sentences with a similar
grammatical form probably share much of their temporal structure).

Dimensions in the high-dimensional state space that explain most of the variance are isolated
using Principle Component Analyses (PCA) and are believed to describe the most
discriminative and thus meaningful features. These components are data dependent and known
nor constrained beforehand, that is what makes the representations distributed. Repetitive
structures in the trajectories through space along two of these orthogonal components are taken
as evidence that a SRN can account for limited recursion and systematically depth of the
embedding in the current state.

Dependence on their temporal context is precisely the proposed mechanism by Elman (1991)
behind abstraction and generalization. According to Marcus (1998) however models need
variables to explain systematic generalization, something a simple SRN does not account for
apart from the dispositional slots.

3. DOP is usually implemented offline and thus requires a buffer, while SRN prediction happens
online based on the input so far. The models therefore can not be directly compared to each
other but it seems reasonable to conclude that the connectionistic approach needs
representations that better describe structural relationships between constituents then a SRN has
to offer. The Hierarchical Prediction Network (HPN) by Borensztajn et al. borrows the notions
of substitution and hierarchical structural relationships between constituents from rule and tree
based parsing approaches like DOP.

At the same time the HPN acknowledges the important role of temporal sequences. Matching
representations in substitution space to the compressor node happens in a fixed order and can be

2/3
seen as a separate retrieval step that executes a production. A production resembles a rewrite
rule or a derivation by node substitution. Matching probability for two nodes is based on their
distance in substitution space rather then strictly defined category membership or position
within the sentence. Binding a specific slot is thus locally constrained as opposed to globally
available to all subtrees with the right root node in DOP. The winning match is the one that
receives most of the bottom-up activation and needs not to be selected afterwards using a cost
function such as shortest derivation.

The nodes in a compressor function as variable slots that allow for substitution. An HPN with a
single compressor layer implements a Context-Free Grammar (CFG). Completed production
rules transmit their representations and can themselves be matched with a slot of another
production. This way the HPN implements reduction without the degradation Elman (1991)
observes in his SRN.

A HPN thus implements many of the features seen in symbolic parsing models like DOP,
including variables and recursion. It allows for hierarchical processing and learning of category
labels in an unsupervised way like unsupervised DOP (U-DOP). Moreover it does both at the
same time by updating substitution space while using it. Contrary to U-DOP, the whole model is
already physically realizable using a single mechanism that does learning and parsing. Selection
based on levels of activation, local binding and distributed topological memory allow a neural
implementation of the mechanism that is biologically plausible.

Both approaches show however that there is a continuum between symbolic (rule based) and
connectionistic (exemplar based) approaches that can be bridged leaving from either side of the
spectrum.

3/3

Potrebbero piacerti anche