Sei sulla pagina 1di 11

Improved Semantic Representations From

Tree-Structured Long Short-Term Memory Networks

Kai Sheng Tai, Richard Socher*, Christopher D. Manning


Computer Science Department, Stanford University, *MetaMind Inc.
kst@cs.stanford.edu, richard@metamind.io, manning@stanford.edu

Abstract y1 y2 y3 y4

Because of their superior ability to pre-


serve sequence information over time,
Long Short-Term Memory (LSTM) net- x1 x2 x3 x4
arXiv:1503.00075v3 [cs.CL] 30 May 2015

works, a type of recurrent neural net- y1


work with a more complex computational
unit, have obtained strong results on a va-
y2 y3
riety of sequence modeling tasks. The
x1
only underlying LSTM structure that has
been explored so far is a linear chain. y4 y6
However, natural language exhibits syn- x2
tactic properties that would naturally com-
bine words to phrases. We introduce the
x4 x5 x6
Tree-LSTM, a generalization of LSTMs to
tree-structured network topologies. Tree-
LSTMs outperform all existing systems Figure 1: Top: A chain-structured LSTM net-
and strong LSTM baselines on two tasks: work. Bottom: A tree-structured LSTM network
predicting the semantic relatedness of two with arbitrary branching factor.
sentences (SemEval 2014, Task 1) and
sentiment classification (Stanford Senti-
ment Treebank). Order-insensitive models are insufficient to
fully capture the semantics of natural language
1 Introduction due to their inability to account for differences in
Most models for distributed representations of meaning as a result of differences in word order
phrases and sentencesthat is, models where real- or syntactic structure (e.g., cats climb trees vs.
valued vectors are used to represent meaningfall trees climb cats). We therefore turn to order-
into one of three classes: bag-of-words models, sensitive sequential or tree-structured models. In
sequence models, and tree-structured models. In particular, tree-structured models are a linguisti-
bag-of-words models, phrase and sentence repre- cally attractive option due to their relation to syn-
sentations are independent of word order; for ex- tactic interpretations of sentence structure. A nat-
ample, they can be generated by averaging con- ural question, then, is the following: to what ex-
stituent word representations (Landauer and Du- tent (if at all) can we do better with tree-structured
mais, 1997; Foltz et al., 1998). In contrast, se- models as opposed to sequential models for sen-
quence models construct sentence representations tence representation? In this paper, we work to-
as an order-sensitive function of the sequence of wards addressing this question by directly com-
tokens (Elman, 1990; Mikolov, 2012). Lastly, paring a type of sequential model that has recently
tree-structured models compose each phrase and been used to achieve state-of-the-art results in sev-
sentence representation from its constituent sub- eral NLP tasks against its tree-structured general-
phrases according to a given syntactic structure ization.
over the sentence (Goller and Kuchler, 1996; Due to their capability for processing arbitrary-
Socher et al., 2011). length sequences, recurrent neural networks
(RNNs) are a natural choice for sequence model- dimensional distributed representation of the se-
ing tasks. Recently, RNNs with Long Short-Term quence of tokens observed up to time t.
Memory (LSTM) units (Hochreiter and Schmid- Commonly, the RNN transition function is an
huber, 1997) have re-emerged as a popular archi- affine transformation followed by a pointwise non-
tecture due to their representational power and ef- linearity such as the hyperbolic tangent function:
fectiveness at capturing long-term dependencies.
LSTM networks, which we review in Sec. 2, have ht = tanh (W xt + U ht1 + b) .
been successfully applied to a variety of sequence Unfortunately, a problem with RNNs with transi-
modeling and prediction tasks, notably machine tion functions of this form is that during training,
translation (Bahdanau et al., 2014; Sutskever et al., components of the gradient vector can grow or de-
2014), speech recognition (Graves et al., 2013), cay exponentially over long sequences (Hochre-
image caption generation (Vinyals et al., 2014), iter, 1998; Bengio et al., 1994). This problem with
and program execution (Zaremba and Sutskever, exploding or vanishing gradients makes it difficult
2014). for the RNN model to learn long-distance correla-
In this paper, we introduce a generalization of tions in a sequence.
the standard LSTM architecture to tree-structured The LSTM architecture (Hochreiter and
network topologies and show its superiority for Schmidhuber, 1997) addresses this problem of
representing sentence meaning over a sequential learning long-term dependencies by introducing a
LSTM. While the standard LSTM composes its memory cell that is able to preserve state over long
hidden state from the input at the current time periods of time. While numerous LSTM variants
step and the hidden state of the LSTM unit in the have been described, here we describe the version
previous time step, the tree-structured LSTM, or used by Zaremba and Sutskever (2014).
Tree-LSTM, composes its state from an input vec- We define the LSTM unit at each time step t to
tor and the hidden states of arbitrarily many child be a collection of vectors in Rd : an input gate it , a
units. The standard LSTM can then be considered forget gate ft , an output gate ot , a memory cell ct
a special case of the Tree-LSTM where each inter- and a hidden state ht . The entries of the gating
nal node has exactly one child. vectors it , ft and ot are in [0, 1]. We refer to d as
In our evaluations, we demonstrate the empiri- the memory dimension of the LSTM.
cal strength of Tree-LSTMs as models for repre- The LSTM transition equations are the follow-
senting sentences. We evaluate the Tree-LSTM ing:
architecture on two tasks: semantic relatedness  
prediction on sentence pairs and sentiment clas- it = W (i) xt + U (i) ht1 + b(i) , (1)
sification of sentences drawn from movie reviews.  
Our experiments show that Tree-LSTMs outper- ft = W (f ) xt + U (f ) ht1 + b(f ) ,
form existing systems and sequential LSTM base-  
lines on both tasks. Implementations of our mod- ot = W (o) xt + U (o) ht1 + b(o) ,
els and experiments are available at https://
 
ut = tanh W (u) xt + U (u) ht1 + b(u) ,
github.com/stanfordnlp/treelstm.
ct = it ut + ft ct1 ,
2 Long Short-Term Memory Networks ht = ot tanh(ct ),
2.1 Overview where xt is the input at the current time step, de-
Recurrent neural networks (RNNs) are able to pro- notes the logistic sigmoid function and denotes
cess input sequences of arbitrary length via the re- elementwise multiplication. Intuitively, the for-
cursive application of a transition function on a get gate controls the extent to which the previous
hidden state vector ht . At each time step t, the memory cell is forgotten, the input gate controls
hidden state ht is a function of the input vector xt how much each unit is updated, and the output gate
that the network receives at time t and its previous controls the exposure of the internal memory state.
hidden state ht1 . For example, the input vector xt The hidden state vector in an LSTM unit is there-
could be a vector representation of the t-th word in fore a gated, partial view of the state of the units
body of text (Elman, 1990; Mikolov, 2012). The internal memory cell. Since the value of the gating
hidden state ht Rd can be interpreted as a d- variables vary for each vector element, the model
can learn to represent information over multiple c2
h2
time scales.
f2
2.2 Variants i1 o1
x1 u1 c1 h1
Two commonly-used variants of the basic LSTM
architecture are the Bidirectional LSTM and the f3
Multilayer LSTM (also known as the stacked or h3 c3
deep LSTM).
Bidirectional LSTM. A Bidirectional LSTM
(Graves et al., 2013) consists of two LSTMs that Figure 2: Composing the memory cell c1 and hid-
are run in parallel: one on the input sequence and den state h1 of a Tree-LSTM unit with two chil-
the other on the reverse of the input sequence. At dren (subscripts 2 and 3). Labeled edges cor-
each time step, the hidden state of the Bidirec- respond to gating by the indicated gating vector,
tional LSTM is the concatenation of the forward with dependencies omitted for compactness.
and backward hidden states. This setup allows the
hidden state to capture both past and future infor- task, or it can learn to preserve the representation
mation. of sentiment-rich children for sentiment classifica-
tion.
Multilayer LSTM. In Multilayer LSTM archi-
As with the standard LSTM, each Tree-LSTM
tectures, the hidden state of an LSTM unit in layer
unit takes an input vector xj . In our applications,
` is used as input to the LSTM unit in layer `+1 in
each xj is a vector representation of a word in a
the same time step (Graves et al., 2013; Sutskever
sentence. The input word at each node depends
et al., 2014; Zaremba and Sutskever, 2014). Here,
on the tree structure used for the network. For in-
the idea is to let the higher layers capture longer-
stance, in a Tree-LSTM over a dependency tree,
term dependencies of the input sequence.
each node in the tree takes the vector correspond-
These two variants can be combined as a Multi- ing to the head word as input, whereas in a Tree-
layer Bidirectional LSTM (Graves et al., 2013). LSTM over a constituency tree, the leaf nodes take
the corresponding word vectors as input.
3 Tree-Structured LSTMs
3.1 Child-Sum Tree-LSTMs
A limitation of the LSTM architectures described
in the previous section is that they only allow for Given a tree, let C(j) denote the set of children
strictly sequential information propagation. Here, of node j. The Child-Sum Tree-LSTM transition
we propose two natural extensions to the basic equations are the following:
LSTM architecture: the Child-Sum Tree-LSTM X
j =
h hk , (2)
and the N-ary Tree-LSTM. Both variants allow for
kC(j)
richer network topologies where each LSTM unit  
is able to incorporate information from multiple j + b(i) ,
ij = W (i) xj + U (i) h (3)
child units.  
As in standard LSTM units, each Tree-LSTM fjk = W (f ) xj + U (f ) hk + b(f ) , (4)
unit (indexed by j) contains input and output  
j + b(o) ,
oj = W (o) xj + U (o) h (5)
gates ij and oj , a memory cell cj and hidden
 
state hj . The difference between the standard uj = tanh W (u) xj + U (u) h j + b(u) , (6)
LSTM unit and Tree-LSTM units is that gating X
vectors and memory cell updates are dependent cj = ij uj + fjk ck , (7)
on the states of possibly many child units. Ad- kC(j)
ditionally, instead of a single forget gate, the Tree- hj = oj tanh(cj ), (8)
LSTM unit contains one forget gate fjk for each
child k. This allows the Tree-LSTM unit to se- where in Eq. 4, k C(j).
lectively incorporate information from each child. Intuitively, we can interpret each parameter ma-
For example, a Tree-LSTM model can learn to em- trix in these equations as encoding correlations be-
phasize semantic heads in a semantic relatedness tween the component vectors of the Tree-LSTM
unit, the input xj , and the hidden states hk of the model to learn more fine-grained conditioning on
units children. For example, in a dependency tree the states of a units children than the Child-
application, the model can learn parameters W (i) Sum Tree-LSTM. Consider, for example, a con-
such that the components of the input gate ij have stituency tree application where the left child of a
values close to 1 (i.e., open) when a semanti- node corresponds to a noun phrase, and the right
cally important content word (such as a verb) is child to a verb phrase. Suppose that in this case
given as input, and values close to 0 (i.e., closed) it is advantageous to emphasize the verb phrase
(f )
when the input is a relatively unimportant word in the representation. Then the Uk` parameters
(such as a determiner). can be trained such that the components of fj1 are
Dependency Tree-LSTMs. Since the Child- close to 0 (i.e., forget), while the components of
Sum Tree-LSTM unit conditions its components fj2 are close to 1 (i.e., preserve).
on the sum of child hidden states hk , it is well- Forget gate parameterization. In Eq. 10, we
suited for trees with high branching factor or define a parameterization of the kth childs for-
whose children are unordered. For example, it is a get gate fjk that contains off-diagonal param-
good choice for dependency trees, where the num- (f )
eter matrices Uk` , k 6= `. This parameteriza-
ber of dependents of a head can be highly variable.
tion allows for more flexible control of informa-
We refer to a Child-Sum Tree-LSTM applied to a
tion propagation from child to parent. For exam-
dependency tree as a Dependency Tree-LSTM.
ple, this allows the left hidden state in a binary tree
3.2 N -ary Tree-LSTMs to have either an excitatory or inhibitory effect on
the forget gate of the right child. However, for
The N -ary Tree-LSTM can be used on tree struc-
large values of N , these additional parameters are
tures where the branching factor is at most N and
impractical and may be tied or fixed to zero.
where children are ordered, i.e., they can be in-
dexed from 1 to N . For any node j, write the hid- Constituency Tree-LSTMs. We can naturally
den state and memory cell of its kth child as hjk apply Binary Tree-LSTM units to binarized con-
and cjk respectively. The N -ary Tree-LSTM tran- stituency trees since left and right child nodes are
sition equations are the following: distinguished. We refer to this application of Bi-
N
! nary Tree-LSTMs as a Constituency Tree-LSTM.
(i)
X
ij = W (i) xj + U` hj` + b(i) , (9) Note that in Constituency Tree-LSTMs, a node j
`=1 receives an input vector xj only if it is a leaf node.
N
!
(f ) In the remainder of this paper, we focus on
X
fjk = W (f ) xj + Uk` hj` + b(f ) ,
`=1
the special cases of Dependency Tree-LSTMs and
(10) Constituency Tree-LSTMs. These architectures
N
! are in fact closely related; since we consider only
(o) binarized constituency trees, the parameterizations
X
oj = W (o) xj + U` hj` + b(o) , (11)
`=1 of the two models are very similar. The key dif-
N
! ference is in the application of the compositional
(u)
X
uj = tanh W (u) xj + U` hj` + b(u) , parameters: dependent vs. head for Dependency
`=1 Tree-LSTMs, and left child vs. right child for Con-
(12) stituency Tree-LSTMs.
N
X
cj = ij uj + fj` cj` , (13) 4 Models
`=1
We now describe two specific models that apply
hj = oj tanh(cj ), (14)
the Tree-LSTM architectures described in the pre-
where in Eq. 10, k = 1, 2, . . . , N . Note that vious section.
when the tree is simply a chain, both Eqs. 28
and Eqs. 914 reduce to the standard LSTM tran- 4.1 Tree-LSTM Classification
sitions, Eqs. 1. In this setting, we wish to predict labels y from a
The introduction of separate parameter matri- discrete set of classes Y for some subset of nodes
ces for each child k allows the N -ary Tree-LSTM in a tree. For example, the label for a node in a
parse tree could correspond to some property of comparison of the signs of the input representa-
the phrase spanned by that node. tions.
At each node j, we use a softmax classifier to We want the expected rating under the predicted
predict the label yj given the inputs {x}j observed distribution p given model parameters to be
at nodes in the subtree rooted at j. The classifier close to the gold rating y [1, K]: y = rT p y.
takes the hidden state hj at the node as input: We therefore define a sparse target distribution1 p
  that satisfies y = rT p:
p (y | {x}j ) = softmax W (s) hj + b(s) ,
yj = arg max p (y | {x}j ) . y byc,
i = byc + 1
y pi = byc y + 1, i = byc

0 otherwise

The cost function is the negative log-likelihood
of the true class labels y (k) at each labeled node:
for 1 i K. The cost function is the regular-
m ized KL-divergence between p and p :
1 X  
J() = log p y (k) {x}(k) + kk22 ,

m 2 m
k=1 1 X 
(k)

J() = KL p(k) p + kk22 ,
where m is the number of labeled nodes in the m 2
k=1
training set, the superscript k indicates the kth la-
beled node, and is an L2 regularization hyperpa- where m is the number of training pairs and the
rameter. superscript k indicates the kth sentence pair.

4.2 Semantic Relatedness of Sentence Pairs 5 Experiments


Given a sentence pair, we wish to predict a We evaluate our Tree-LSTM architectures on two
real-valued similarity score in some range [1, K], tasks: (1) sentiment classification of sentences
where K > 1 is an integer. The sequence sampled from movie reviews and (2) predicting
{1, 2, . . . , K} is some ordinal scale of similarity, the semantic relatedness of sentence pairs.
where higher scores indicate greater degrees of In comparing our Tree-LSTMs against sequen-
similarity, and we allow real-valued scores to ac- tial LSTMs, we control for the number of LSTM
count for ground-truth ratings that are an average parameters by varying the dimensionality of the
over the evaluations of several human annotators. hidden states2 . Details for each model variant are
We first produce sentence representations hL summarized in Table 1.
and hR for each sentence in the pair using a
Tree-LSTM model over each sentences parse tree. 5.1 Sentiment Classification
Given these sentence representations, we predict In this task, we predict the sentiment of sen-
the similarity score y using a neural network that tences sampled from movie reviews. We use
considers both the distance and angle between the the Stanford Sentiment Treebank (Socher et al.,
pair (hL , hR ): 2013). There are two subtasks: binary classifica-
tion of sentences, and fine-grained classification
h = hL hR , (15) over five classes: very negative, negative, neu-
h+ = |hL hR |, tral, positive, and very positive. We use the stan-
 
hs = W () h + W (+) h+ + b(h) , dard train/dev/test splits of 6920/872/1821 for the
  binary classification subtask and 8544/1101/2210
p = softmax W (p) hs + b(p) , for the fine-grained classification subtask (there
are fewer examples for the binary subtask since
y = rT p ,
1
In the subsequent experiments, we found that optimizing
where rT = [1 2 . . . K] and the absolute value this objective yielded better performance than a mean squared
error objective.
function is applied elementwise. The use of both 2
For our Bidirectional LSTMs, the parameters of the for-
distance measures h and h+ is empirically mo- ward and backward transition functions are shared. In our
tivated: we find that the combination outperforms experiments, this achieved superior performance to Bidirec-
tional LSTMs with untied weights and the same number of
the use of either measure alone. The multiplicative parameters (and therefore smaller hidden vector dimension-
measure h can be interpreted as an elementwise ality).
Relatedness Sentiment Method Fine-grained Binary
RAE (Socher et al., 2013) 43.2 82.4
LSTM Variant d || d ||
MV-RNN (Socher et al., 2013) 44.4 82.9
Standard 150 203,400 168 315,840 RNTN (Socher et al., 2013) 45.7 85.4
Bidirectional 150 203,400 168 315,840 DCNN (Blunsom et al., 2014) 48.5 86.8
Paragraph-Vec (Le and Mikolov, 2014) 48.7 87.8
2-layer 108 203,472 120 318,720 CNN-non-static (Kim, 2014) 48.0 87.2
Bidirectional 2-layer 108 203,472 120 318,720 CNN-multichannel (Kim, 2014) 47.4 88.1
Constituency Tree 142 205,190 150 316,800 DRNN (Irsoy and Cardie, 2014) 49.8 86.6
Dependency Tree 150 203,400 168 315,840 LSTM 46.4 (1.1) 84.9 (0.6)
Bidirectional LSTM 49.1 (1.0) 87.5 (0.5)
2-layer LSTM 46.0 (1.3) 86.3 (0.6)
Table 1: Memory dimensions d and composition 2-layer Bidirectional LSTM 48.5 (1.0) 87.2 (1.0)
function parameter counts || for each LSTM vari- Dependency Tree-LSTM 48.4 (0.4) 85.7 (0.4)
ant that we evaluate. Constituency Tree-LSTM
randomly initialized vectors 43.9 (0.6) 82.0 (0.5)
Glove vectors, fixed 49.7 (0.4) 87.5 (0.8)
neutral sentences are excluded). Standard bina- Glove vectors, tuned 51.0 (0.5) 88.0 (0.3)

rized constituency parse trees are provided for


each sentence in the dataset, and each node in Table 2: Test set accuracies on the Stanford Sen-
these trees is annotated with a sentiment label. timent Treebank. For our experiments, we report
For the sequential LSTM baselines, we predict mean accuracies over 5 runs (standard deviations
the sentiment of a phrase using the representation in parentheses). Fine-grained: 5-class sentiment
given by the final LSTM hidden state. The sequen- classification. Binary: positive/negative senti-
tial LSTM models are trained on the spans corre- ment classification.
sponding to labeled nodes in the training set.
We use the classification model described in produce binarized constituency parses4 and depen-
Sec. 4.1 with both Dependency Tree-LSTMs dency parses of the sentences in the dataset for our
(Sec. 3.1) and Constituency Tree-LSTMs Constituency Tree-LSTM and Dependency Tree-
(Sec. 3.2). The Constituency Tree-LSTMs are LSTM models.
structured according to the provided parse trees.
For the Dependency Tree-LSTMs, we produce 5.3 Hyperparameters and Training Details
dependency parses3 of each sentence; each node The hyperparameters for our models were tuned
in a tree is given a sentiment label if its span on the development set for each task.
matches a labeled span in the training set.
We initialized our word representations using
5.2 Semantic Relatedness publicly available 300-dimensional Glove vec-
tors5 (Pennington et al., 2014). For the sentiment
For a given pair of sentences, the semantic relat-
classification task, word representations were up-
edness task is to predict a human-generated rating
dated during training with a learning rate of 0.1.
of the similarity of the two sentences in meaning.
For the semantic relatedness task, word represen-
We use the Sentences Involving Composi-
tations were held fixed as we did not observe any
tional Knowledge (SICK) dataset (Marelli et al.,
significant improvement when the representations
2014), consisting of 9927 sentence pairs in a
were tuned.
4500/500/4927 train/dev/test split. The sentences
Our models were trained using AdaGrad (Duchi
are derived from existing image and video descrip-
et al., 2011) with a learning rate of 0.05 and a
tion datasets. Each sentence pair is annotated with
minibatch size of 25. The model parameters were
a relatedness score y [1, 5], with 1 indicating
regularized with a per-minibatch L2 regularization
that the two sentences are completely unrelated,
strength of 104 . The sentiment classifier was ad-
and 5 indicating that the two sentences are very
ditionally regularized using dropout (Hinton et al.,
related. Each label is the average of 10 ratings as-
2012) with a dropout rate of 0.5. We did not ob-
signed by different human annotators.
serve performance gains using dropout on the se-
Here, we use the similarity model described in
mantic relatedness task.
Sec. 4.2. For the similarity prediction network
4
(Eqs. 15) we use a hidden layer of size 50. We Constituency parses produced by the Stanford PCFG
Parser (Klein and Manning, 2003).
3 5
Dependency parses produced by the Stanford Neural Trained on 840 billion tokens of Common Crawl data,
Network Dependency Parser (Chen and Manning, 2014). http://nlp.stanford.edu/projects/glove/.
Method Pearsons r Spearmans MSE
Illinois-LH (Lai and Hockenmaier, 2014) 0.7993 0.7538 0.3692
UNAL-NLP (Jimenez et al., 2014) 0.8070 0.7489 0.3550
Meaning Factory (Bjerva et al., 2014) 0.8268 0.7721 0.3224
ECNU (Zhao et al., 2014) 0.8414
Mean vectors 0.7577 (0.0013) 0.6738 (0.0027) 0.4557 (0.0090)
DT-RNN (Socher et al., 2014) 0.7923 (0.0070) 0.7319 (0.0071) 0.3822 (0.0137)
SDT-RNN (Socher et al., 2014) 0.7900 (0.0042) 0.7304 (0.0076) 0.3848 (0.0074)
LSTM 0.8528 (0.0031) 0.7911 (0.0059) 0.2831 (0.0092)
Bidirectional LSTM 0.8567 (0.0028) 0.7966 (0.0053) 0.2736 (0.0063)
2-layer LSTM 0.8515 (0.0066) 0.7896 (0.0088) 0.2838 (0.0150)
2-layer Bidirectional LSTM 0.8558 (0.0014) 0.7965 (0.0018) 0.2762 (0.0020)
Constituency Tree-LSTM 0.8582 (0.0038) 0.7966 (0.0053) 0.2734 (0.0108)
Dependency Tree-LSTM 0.8676 (0.0030) 0.8083 (0.0042) 0.2532 (0.0052)

Table 3: Test set results on the SICK semantic relatedness subtask. For our experiments, we report mean
scores over 5 runs (standard deviations in parentheses). Results are grouped as follows: (1) SemEval
2014 submissions; (2) Our own baselines; (3) Sequential LSTMs; (4) Tree-structured LSTMs.

6 Results tion metrics. The first two metrics are measures of


correlation against human evaluations of semantic
6.1 Sentiment Classification
relatedness.
Our results are summarized in Table 2. The Con- We compare our models against a number of
stituency Tree-LSTM outperforms existing sys- non-LSTM baselines. The mean vector baseline
tems on the fine-grained classification subtask and computes sentence representations as a mean of
achieves accuracy comparable to the state-of-the- the representations of the constituent words. The
art on the binary subtask. In particular, we find that DT-RNN and SDT-RNN models (Socher et al.,
it outperforms the Dependency Tree-LSTM. This 2014) both compose vector representations for the
performance gap is at least partially attributable to nodes in a dependency tree as a sum over affine-
the fact that the Dependency Tree-LSTM is trained transformed child vectors, followed by a nonlin-
on less data: about 150K labeled nodes vs. 319K earity. The SDT-RNN is an extension of the DT-
for the Constituency Tree-LSTM. This difference RNN that uses a separate transformation for each
is due to (1) the dependency representations con- dependency relation. For each of our baselines,
taining fewer nodes than the corresponding con- including the LSTM models, we use the similarity
stituency representations, and (2) the inability to model described in Sec. 4.2.
match about 9% of the dependency nodes to a cor- We also compare against four of the top-
responding span in the training data. performing systems6 submitted to the SemEval
We found that updating the word representa- 2014 semantic relatedness shared task: ECNU
tions during training (fine-tuning the word em- (Zhao et al., 2014), The Meaning Factory (Bjerva
bedding) yields a significant boost in performance et al., 2014), UNAL-NLP (Jimenez et al., 2014),
on the fine-grained classification subtask and gives and Illinois-LH (Lai and Hockenmaier, 2014).
a minor gain on the binary classification subtask These systems are heavily feature engineered,
(this finding is consistent with previous work on generally using a combination of surface form
this task by Kim (2014)). These gains are to be overlap features and lexical distance features de-
expected since the Glove vectors used to initial- rived from WordNet or the Paraphrase Database
ize our word representations were not originally (Ganitkevitch et al., 2013).
trained to capture sentiment. Our LSTM models outperform all these sys-
6.2 Semantic Relatedness 6
We list the strongest results we were able to find for this
Our results are summarized in Table 3. Following task; in some cases, these results are stronger than the official
performance by the team on the shared task. For example,
Marelli et al. (2014), we use Pearsons r, Spear- the listed result by Zhao et al. (2014) is stronger than their
mans and mean squared error (MSE) as evalua- submitted systems Pearson correlation score of 0.8280.
0.70 0.90
0.65
0.88
0.60
0.86
0.55
accuracy

0.50 0.84

r
0.45
DT-LSTM
0.82 DT-LSTM
0.40 CT-LSTM CT-LSTM
LSTM 0.80 LSTM
0.35
Bi-LSTM Bi-LSTM
0.30 0.78
0 5 10 15 20 25 30 35 40 45 4 6 8 10 12 14 16 18 20
sentence length mean sentence length

Figure 3: Fine-grained sentiment classification ac- Figure 4: Pearson correlations r between pre-
curacy vs. sentence length. For each `, we plot dicted similarities and gold ratings vs. sentence
accuracy for the test set sentences with length in length. For each `, we plot r for the pairs with
the window [` 2, ` + 2]. Examples in the tail mean length in the window [`2, `+2]. Examples
of the length distribution are batched in the final in the tail of the length distribution are batched in
window (` = 45). the final window (` = 18.5).

tems without any additional feature engineering, bustness to differences in sentence length. Given
with the best results achieved by the Dependency the query two men are playing guitar, the Tree-
Tree-LSTM. Recall that in this task, both Tree- LSTM associates the phrase playing guitar with
LSTM models only receive supervision at the root the longer, related phrase dancing and singing in
of the tree, in contrast to the sentiment classifi- front of a crowd (note as well that there is zero
cation task where supervision was also provided token overlap between the two phrases).
at the intermediate nodes. We conjecture that in
7.2 Effect of Sentence Length
this setting, the Dependency Tree-LSTM benefits
from its more compact structure relative to the One hypothesis to explain the empirical strength
Constituency Tree-LSTM, in the sense that paths of Tree-LSTMs is that tree structures help miti-
from input word vectors to the root of the tree gate the problem of preserving state over long se-
are shorter on aggregate for the Dependency Tree- quences of words. If this were true, we would ex-
LSTM. pect to see the greatest improvement over sequen-
tial LSTMs on longer sentences. In Figs. 3 and 4,
7 Discussion and Qualitative Analysis we show the relationship between sentence length
and performance as measured by the relevant task-
7.1 Modeling Semantic Relatedness specific metric. Each data point is a mean score
In Table 4, we list nearest-neighbor sentences re- over 5 runs, and error bars have been omitted for
trieved from a 1000-sentence sample of the SICK clarity.
test set. We compare the neighbors ranked by the We observe that while the Dependency Tree-
Dependency Tree-LSTM model against a baseline LSTM does significantly outperform its sequen-
ranking by cosine similarity of the mean word vec- tial counterparts on the relatedness task for
tors for each sentence. longer sentences of length 13 to 15 (Fig. 4), it
The Dependency Tree-LSTM model exhibits also achieves consistently strong performance on
several desirable properties. Note that in the de- shorter sentences. This suggests that unlike se-
pendency parse of the second query sentence, the quential LSTMs, Tree-LSTMs are able to encode
word ocean is the second-furthest word from the semantically-useful structural information in the
root (waving), with a depth of 4. Regardless, the sentence representations that they compose.
retrieved sentences are all semantically related to
8 Related Work
the word ocean, which indicates that the Tree-
LSTM is able to both preserve and emphasize in- Distributed representations of words (Rumelhart
formation from relatively distant nodes. Addi- et al., 1988; Collobert et al., 2011; Turian et al.,
tionally, the Tree-LSTM model shows greater ro- 2010; Huang et al., 2012; Mikolov et al., 2013;
Ranking by mean word vector cosine similarity Score Ranking by Dependency Tree-LSTM model Score
a woman is slicing potatoes a woman is slicing potatoes
a woman is cutting potatoes 0.96 a woman is cutting potatoes 4.82
a woman is slicing herbs 0.92 potatoes are being sliced by a woman 4.70
a woman is slicing tofu 0.92 tofu is being sliced by a woman 4.39
a boy is waving at some young runners from the ocean a boy is waving at some young runners from the ocean
a man and a boy are standing at the bottom of some stairs , 0.92 a group of men is playing with a ball on the beach 3.79
which are outdoors
a group of children in uniforms is standing at a gate and 0.90 a young boy wearing a red swimsuit is jumping out of a 3.37
one is kissing the mother blue kiddies pool
a group of children in uniforms is standing at a gate and 0.90 the man is tossing a kid into the swimming pool that is 3.19
there is no one kissing the mother near the ocean
two men are playing guitar two men are playing guitar
some men are playing rugby 0.88 the man is singing and playing the guitar 4.08
two men are talking 0.87 the man is opening the guitar for donations and plays 4.01
with the case
two dogs are playing with each other 0.87 two men are dancing and singing in front of a crowd 4.00

Table 4: Most similar sentences from a 1000-sentence sample drawn from the SICK test set. The Tree-
LSTM model is able to pick up on more subtle relationships, such as that between beach and ocean
in the second example.

Pennington et al., 2014) have found wide appli- and sentiment classification, outperforming exist-
cability in a variety of NLP tasks. Following ing systems on both. Controlling for model di-
this success, there has been substantial interest in mensionality, we demonstrated that Tree-LSTM
the area of learning distributed phrase and sen- models are able to outperform their sequential
tence representations (Mitchell and Lapata, 2010; counterparts. Our results suggest further lines of
Yessenalina and Cardie, 2011; Grefenstette et al., work in characterizing the role of structure in pro-
2013; Mikolov et al., 2013), as well as distributed ducing distributed representations of sentences.
representations of longer bodies of text such as
paragraphs and documents (Srivastava et al., 2013; Acknowledgements
Le and Mikolov, 2014). We thank our anonymous reviewers for their valu-
Our approach builds on recursive neural net- able feedback. Stanford University gratefully ac-
works (Goller and Kuchler, 1996; Socher et al., knowledges the support of a Natural Language
2011), which we abbreviate as Tree-RNNs in or- Understanding-focused gift from Google Inc. and
der to avoid confusion with recurrent neural net- the Defense Advanced Research Projects Agency
works. Under the Tree-RNN framework, the vec- (DARPA) Deep Exploration and Filtering of Text
tor representation associated with each node of (DEFT) Program under Air Force Research Lab-
a tree is composed as a function of the vectors oratory (AFRL) contract no. FA8750-13-2-0040.
corresponding to the children of the node. The Any opinions, findings, and conclusion or recom-
choice of composition function gives rise to nu- mendations expressed in this material are those of
merous variants of this basic framework. Tree- the authors and do not necessarily reflect the view
RNNs have been used to parse images of natu- of the DARPA, AFRL, or the US government.
ral scenes (Socher et al., 2011), compose phrase
representations from word vectors (Socher et al., References
2012), and classify the sentiment polarity of sen-
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua
tences (Socher et al., 2013).
Bengio. 2014. Neural machine translation by
9 Conclusion jointly learning to align and translate. arXiv
preprint arXiv:1409.0473 .
In this paper, we introduced a generalization of
LSTMs to tree-structured network topologies. The Bengio, Yoshua, Patrice Simard, and Paolo Fras-
Tree-LSTM architecture can be applied to trees coni. 1994. Learning long-term dependencies
with arbitrary branching factor. We demonstrated with gradient descent is difficult. IEEE Trans-
the effectiveness of the Tree-LSTM by applying actions on Neural Networks 5(2):157166.
the architecture in two tasks: semantic relatedness Bjerva, Johannes, Johan Bos, Rob van der Goot,
and Malvina Nissim. 2014. The Meaning Fac- Hinton, Geoffrey E, Nitish Srivastava, Alex
tory: Formal semantics for recognizing textual Krizhevsky, Ilya Sutskever, and Ruslan R
entailment and determining semantic similarity. Salakhutdinov. 2012. Improving neural net-
SemEval 2014 . works by preventing co-adaptation of feature
Blunsom, Phil, Edward Grefenstette, Nal Kalch- detectors. arXiv preprint arXiv:1207.0580 .
brenner, et al. 2014. A convolutional neural net- Hochreiter, Sepp. 1998. The vanishing gradient
work for modelling sentences. In Proceedings problem during learning recurrent neural nets
of the 52nd Annual Meeting of the Association and problem solutions. International Journal of
for Computational Linguistics. Uncertainty, Fuzziness and Knowledge-Based
Systems 6(02):107116.
Chen, Danqi and Christopher D Manning. 2014. A
fast and accurate dependency parser using neu- Hochreiter, Sepp and Jurgen Schmidhuber. 1997.
ral networks. In Proceedings of the 2014 Con- Long Short-Term Memory. Neural Computa-
ference on Empirical Methods in Natural Lan- tion 9(8):17351780.
guage Processing (EMNLP). pages 740750. Huang, Eric H., Richard Socher, Christopher D.
Collobert, Ronan, Jason Weston, Leon Bottou, Manning, and Andrew Y. Ng. 2012. Improv-
Michael Karlen, Koray Kavukcuoglu, and Pavel ing word representations via global context and
Kuksa. 2011. Natural language processing (al- multiple word prototypes. In Annual Meeting
most) from scratch. The Journal of Machine of the Association for Computational Linguis-
Learning Research 12:24932537. tics (ACL).

Duchi, John, Elad Hazan, and Yoram Singer. 2011. Irsoy, Ozan and Claire Cardie. 2014. Deep re-
Adaptive subgradient methods for online learn- cursive neural networks for compositionality in
ing and stochastic optimization. The Journal of language. In Advances in Neural Information
Machine Learning Research 12:21212159. Processing Systems. pages 20962104.
Jimenez, Sergio, George Duenas, Julia Baquero,
Elman, Jeffrey L. 1990. Finding structure in time.
Alexander Gelbukh, Av Juan Dios Batiz, and
Cognitive science 14(2):179211.
Av Mendizabal. 2014. UNAL-NLP: Combin-
Foltz, Peter W, Walter Kintsch, and Thomas K ing soft cardinality features for semantic textual
Landauer. 1998. The measurement of textual similarity, relatedness and entailment. SemEval
coherence with latent semantic analysis. Dis- 2014 .
course processes 25(2-3):285307. Kim, Yoon. 2014. Convolutional neural net-
Ganitkevitch, Juri, Benjamin Van Durme, and works for sentence classification. arXiv preprint
Chris Callison-Burch. 2013. PPDB: The Para- arXiv:1408.5882 .
phrase Database. In HLT-NAACL. pages 758 Klein, Dan and Christopher D Manning. 2003.
764. Accurate unlexicalized parsing. In Proceedings
Goller, Christoph and Andreas Kuchler. 1996. of the 41st Annual Meeting on Association for
Learning task-dependent distributed representa- Computational Linguistics-Volume 1. Associa-
tions by backpropagation through structure. In tion for Computational Linguistics, pages 423
IEEE International Conference on Neural Net- 430.
works. volume 1, pages 347352. Lai, Alice and Julia Hockenmaier. 2014. Illinois-
Graves, Alex, Navdeep Jaitly, and A-R Mohamed. lh: A denotational and distributional approach
2013. Hybrid speech recognition with deep to semantics. SemEval 2014 .
bidirectional LSTM. In IEEE Workshop on Au- Landauer, Thomas K and Susan T Dumais. 1997.
tomatic Speech Recognition and Understanding A solution to platos problem: The latent se-
(ASRU). pages 273278. mantic analysis theory of acquisition, induction,
Grefenstette, Edward, Georgiana Dinu, Yao- and representation of knowledge. Psychological
Zhong Zhang, Mehrnoosh Sadrzadeh, and review 104(2):211.
Marco Baroni. 2013. Multi-step regression Le, Quoc V and Tomas Mikolov. 2014. Dis-
learning for compositional distributional se- tributed representations of sentences and doc-
mantics. arXiv preprint arXiv:1301.6939 . uments. arXiv preprint arXiv:1405.4053 .
Marelli, Marco, Luisa Bentivogli, Marco Ba- drew Y Ng, and Christopher Potts. 2013. Re-
roni, Raffaella Bernardi, Stefano Menini, and cursive deep models for semantic composition-
Roberto Zamparelli. 2014. SemEval-2014 Task ality over a sentiment treebank. In Proceedings
1: Evaluation of compositional distributional of the Conference on Empirical Methods in Nat-
semantic models on full sentences through se- ural Language Processing (EMNLP).
mantic relatedness and textual entailment. In Srivastava, Nitish, Ruslan R Salakhutdinov, and
SemEval 2014. Geoffrey E Hinton. 2013. Modeling documents
Mikolov, Tomas. 2012. Statistical Language Mod- with deep boltzmann machines. arXiv preprint
els Based on Neural Networks. Ph.D. thesis, arXiv:1309.6865 .
Brno University of Technology. Sutskever, Ilya, Oriol Vinyals, and Quoc VV Le.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S 2014. Sequence to sequence learning with neu-
Corrado, and Jeff Dean. 2013. Distributed ral networks. In Advances in Neural Informa-
representations of words and phrases and their tion Processing Systems. pages 31043112.
compositionality. In Advances in Neural Infor- Turian, Joseph, Lev Ratinov, and Yoshua Bengio.
mation Processing Systems. pages 31113119. 2010. Word representations: a simple and gen-
Mitchell, Jeff and Mirella Lapata. 2010. Composi- eral method for semi-supervised learning. In
tion in distributional models of semantics. Cog- Proceedings of the 48th annual meeting of the
nitive science 34(8):13881429. association for computational linguistics. As-
Pennington, Jeffrey, Richard Socher, and Christo- sociation for Computational Linguistics, pages
pher D Manning. 2014. Glove: Global vectors 384394.
for word representation. Proceedings of the Em- Vinyals, Oriol, Alexander Toshev, Samy Bengio,
piricial Methods in Natural Language Process- and Dumitru Erhan. 2014. Show and tell: A
ing (EMNLP 2014) 12. neural image caption generator. arXiv preprint
arXiv:1411.4555 .
Rumelhart, David E, Geoffrey E Hinton, and
Ronald J Williams. 1988. Learning represen- Yessenalina, Ainur and Claire Cardie. 2011. Com-
tations by back-propagating errors. Cognitive positional matrix-space models for sentiment
modeling 5. analysis. In Proceedings of the Conference
on Empirical Methods in Natural Language
Socher, Richard, Brody Huval, Christopher D
Processing. Association for Computational Lin-
Manning, and Andrew Y Ng. 2012. Seman-
guistics, pages 172182.
tic compositionality through recursive matrix-
vector spaces. In Proceedings of the 2012 Joint Zaremba, Wojciech and Ilya Sutskever.
Conference on Empirical Methods in Natural 2014. Learning to execute. arXiv preprint
Language Processing and Computational Nat- arXiv:1410.4615 .
ural Language Learning. Association for Com- Zhao, Jiang, Tian Tian Zhu, and Man Lan. 2014.
putational Linguistics, pages 12011211. ECNU: One stone two birds: Ensemble of het-
Socher, Richard, Andrej Karpathy, Quoc V Le, erogenous measures for semantic relatedness
Christopher D Manning, and Andrew Y Ng. and textual entailment. SemEval 2014 .
2014. Grounded compositional semantics for
finding and describing images with sentences.
Transactions of the Association for Computa-
tional Linguistics 2:207218.
Socher, Richard, Cliff C Lin, Chris Manning, and
Andrew Y Ng. 2011. Parsing natural scenes
and natural language with recursive neural net-
works. In Proceedings of the 28th International
Conference on Machine Learning (ICML-11).
pages 129136.
Socher, Richard, Alex Perelygin, Jean Y Wu,
Jason Chuang, Christopher D Manning, An-

Potrebbero piacerti anche