Predicting Target Language CCG Supertags Improves Neural Machine Translation

Predicting Target Language CCG Supertags Improves Neural Machine
Translation
Maria Nadejde1 and Siva Reddy1 and Rico Sennrich1 and Tomasz Dwojak1,2
Marcin Junczys-Dowmunt2 and Philipp Koehn3 and Alexandra Birch1
1
School of Informatics, University of Edinburgh
2
Adam Mickiewicz University
3
Dep. of Computer Science, Johns Hopkins University
{m.nadejde,siva.reddy, rico.sennrich, a.birch}@ed.ac.uk
{t.dwojak,junczys}@amu.edu.pl, phi@jhu.edu
Abstract tured by these models. In a detailed analysis,

Bentivogli et al. (2016) show that NMT signifi-
Neural machine translation (NMT) mod- cantly improves over phrase-based SMT, in par-
els are able to partially learn syntactic in- ticular with respect to morphology and word or-
arXiv:1702.01147v2 [cs.CL] 18 Jul 2017
formation from sequential lexical informa- der, but that results can still be improved for longer
tion. Still, some complex syntactic phe- sentences and complex syntactic phenomena such
nomena such as prepositional phrase at- as prepositional phrase (PP) attachment. Another
tachment are poorly modeled. This work study by Shi et al. (2016) shows that the encoder
aims to answer two questions: 1) Does layer of NMT partially learns syntactic informa-
explicitly modeling target language syntax tion about the source language, however complex
help NMT? 2) Is tight integration of words syntactic phenomena such as coordination or PP
and syntax better than multitask training? attachment are poorly modeled.
We introduce syntactic information in the Recent work which incorporates additional
form of CCG supertags in the decoder, source-side linguistic information in NMT mod-
by interleaving the target supertags with els (Luong et al., 2016; Sennrich and Haddow,
the word sequence. Our results on WMT 2016) show that even though neural models have
data show that explicitly modeling target- strong learning capabilities, explicit features can
syntax improves machine translation qual- still improve translation quality. In this work, we
ity for GermanEnglish, a high-resource examine the benefit of incorporating global syn-
pair, and for RomanianEnglish, a low- tactic information on the target-side. We also ad-
resource pair and also several syntactic dress the question of how best to incorporate this
phenomena including prepositional phrase information. For language pairs where syntac-
attachment. Furthermore, a tight cou- tic resources are available on both the source and
pling of words and syntax improves trans- target-side, we show that approaches to incorpo-
lation quality more than multitask training. rate source syntax and target syntax are comple-
By combining target-syntax with adding mentary.
source-side dependency labels in the em- We propose a method for tightly coupling words
bedding layer, we obtain a total improve- and syntax by interleaving the target syntactic rep-
ment of 0.9 BLEU for GermanEnglish resentation with the word sequence. We compare
and 1.2 BLEU for RomanianEnglish. this to loosely coupling words and syntax using a
multitask solution, where the shared parts of the
1 Introduction
model are trained to produce either a target se-
Sequence-to-sequence neural machine translation quence of words or supertags in a similar fashion
(NMT) models (Sutskever et al., 2014; Cho et al., to Luong et al. (2016).
2014b; Bahdanau et al., 2015) are state-of-the-art We use CCG syntactic categories (Steedman,
on a multitude of language-pairs (Sennrich et al., 2000), also known as supertags, to represent syn-
2016a; Junczys-Dowmunt et al., 2016). Part of the tax explicitly. Supertags provide global syntac-
appeal of neural models is that they can learn to tic information locally at the lexical level. They
implicitly model phenomena which underlie high encode subcategorization information, capturing
quality output, and some syntax is indeed cap- short and long range dependencies and attach-
ments, and also tense and morphological as- et al. (2016) co-train a translation model and a
pects of the word in a given context. Consider source-side syntactic parser which share the en-
the sentence in Figure 1. This sentence con- coder. Our multitask models extend their work
tains two PP attachments and could lead to sev- to attention-based NMT models and to predict-
eral disambiguation possibilities (in can attach ing target-side syntax as the secondary task. Sen-
to Netanyahu or receives, and of can at- nrich and Haddow (2016) generalize the embed-
tach to capital, Netanyahu or receives ). ding layer of NMT to include explicit linguistic
These alternatives may lead to different trans- features such as dependency relations and part-of-
lations in other languages. However the su- speech tags and we use their framework to show
pertag ((S[dcl]\NP)/PP)/NP of receives indi- source and target syntax provide complementary
cates that the preposition in attaches to the verb, information.
and the supertag (NP\NP)/NP of of indicates Applying more tightly coupled linguistic fac-
that it attaches to capital, thereby resolving the tors on the target for NMT has been previously
ambiguity. investigated. Niehues et al. (2016) proposed a fac-
Our research contributions are as follows: tored RNN-based language model for re-scoring
We propose a novel approach to integrating tar- an n-best list produced by a phrase-based MT sys-
get syntax at word level in the decoder, by in- tem. In recent work, Martnez et al. (2016) im-
terleaving CCG supertags in the target word se- plemented a factored NMT decoder which gener-
quence. ated both lemmas and morphological tags. The
two factors were then post-processed to gener-
We show that the target language syntax im- ate the word form. Unfortunately no real gain
proves translation quality for GermanEnglish was reported for these experiments. Concurrently
and RomanianEnglish as measured by with our work, Aharoni and Goldberg (2017) pro-
BLEU. Our results suggest that a tight coupling posed serializing the target constituency trees and
of target words and syntax (by interleaving) Eriguchi et al. (2017) model target dependency re-
improves translation quality more than the lations by augmenting the NMT decoder with a
decoupled signal from multitask training. RNN grammar (Dyer et al., 2016). In our work,
we use CCG supertags which are a more compact
We show that incorporating source-side linguis- representation of global syntax. Furthermore, we
tic information is complimentary to our method, do not focus on model architectures, and instead
further improving the translation quality. we explore the more general problem of includ-
ing target syntax in NMT: comparing tightly and
We present a fine-grained analysis of SNMT
loosely coupled syntactic information and show-
and show consistent gains for different linguis-
ing source and target syntax are complementary.
tic phenomena and sentence lengths.
Previous work on integrating CCG supertags in
2 Related work factored phrase-based models (Birch et al., 2007)
made strong independence assumptions between
Syntax has helped in statistical machine trans- the target word sequence and the CCG categories.
lation (SMT) to capture dependencies between In this work we take advantage of the expressive
distant words that impact morphological agree- power of recurrent neural networks to learn repre-
ment, subcategorisation and word order (Galley sentations that generate both words and CCG su-
et al., 2004; Menezes and Quirk, 2007; Williams pertags, conditioned on the entire lexical and syn-
and Koehn, 2012; Nadejde et al., 2013; Sennrich, tactic target history.
2015; Nadejde et al., 2016a,b; Chiang, 2007).
There has been some work in NMT on modeling 3 Modeling Syntax in NMT
source-side syntax implicitly or explicitly. Kalch-
brenner and Blunsom (2013); Cho et al. (2014a) CCG is a lexicalised formalism in which words are
capture the hierarchical aspects of language im- assigned with syntactic categories, i.e., supertags,
plicitly by using convolutional neural networks, that indicate context-sensitive morpho-syntactic
while Eriguchi et al. (2016) use the parse tree of properties of a word in a sentence. The com-
the source sentence to guide the recurrence and binators of CCG allow the supertags to capture
attention model in tree-to-sequence NMT. Luong global syntactic constraints locally. Though NMT
Source-side
BPE: Obama receives Net+ an+ yahu in the capital of USA
IOB: O O B I E O O O O O
CCG: NP ((S[dcl]\NP)/PP)/NP NP NP NP PP/NP NP/N N (NP\NP)/NP NP
Target-side
NP Obama ((S[dcl]\NP)/PP)/NP receives NP Net+ an+ yahu PP/NP in NP/N the N capital (NP\NP)/NP of NP USA
Figure 1: Source and target representation of syntactic information in syntax-aware NMT.
captures long range dependencies using long-term indicates the current word is a verb in continuous
memory, short-term memory is cheap and reliable. form looking for an infinitive construction on the
Supertags can help by allowing the model to rely right, and an expletive pronoun on the left.
more on local information (short-term) and not We explore the effect of target-side syntax by
having to rely heavily on long-term memory. using CCG supertags in the decoder and by com-
Consider a decoder that has to generate the fol- bining these with source-side syntax in the en-
lowing sentences: coder, as follows.
1. What(S[wq]/(S[q]/N P ))/N city is(S[q]/P P )/N P Baseline decoder The baseline decoder archi-
the Taj Mahal in? tecture is a conditional GRU with attention
(cGRUattn ) as implemented in the Nematus
2. WhereS[wq]/(S[q]/N P ) is(S[q]/N P )/N P the Taj toolkit (Sennrich et al., 2017). The decoder is a
Mahal? recursive function computing a hidden state sj at
If the decoding starts with predicting What, it each time step j [1, T ] of the target recurrence.
is ungrammatical to omit the preposition in, and This function takes as input the previous hidden
if the decoding starts with predicting Where, it state sj1 , the embedding of the previous target
is ungrammatical to predict the preposition. Here word yj1 and the output of the attention model
the decision to predict in depends on the first cj . The attention model computes a weighted sum

word, a long range dependency. However if we over the hidden states hi = [ hi ; hi ] of the bi-
rely on CCG supertags, the supertags of both directional RNN encoder. The function g com-
these sequences look very different. The supertag putes the intermediate representation tj and passes
(S[q]/PP)/NP for the verb is in the first sen- this to a softmax layer which first applies a linear
tence indicates that a preposition is expected in fu- transformation (Wo ) and then computes the prob-
ture context. Furthermore it is likely to see this ability distribution over the target vocabulary. The
particular supertag of the verb in the context of training objective for the entire architecture is min-
(S[wq]/(S[q]/NP))/N but it is unlikely in the con- imizing the discrete cross-entropy, therefore the
text of S[wq]/(S[q]/NP). Therefore a succession loss l is the negative log-probability of the refer-
of local decisions based on CCG supertags will ence sentence.
result in the correct prediction of the preposition
in the first sentence, and omitting the preposition
in the second sentence. Since the vocabulary of s0j = GRU1 (yj1 , sj1 ) (1)
CCG supertags is much smaller than that of possi- cj = AT T ([h1 ; ...; h|x| ], s0j ) (2)
ble words, the NMT model will do a better job at sj = cGRUattn (yj1 , sj1 , cj ) (3)
generalizing over and predicting the correct CCG
tj = g(yj1 , sj , cj ) (4)
supertags sequence.
T T
CCG supertags also help during encoding if Y Y
they are given in the input, as we saw with the py = p(yj |x, y1:j1 ) = sof tmax(tj Wo )
j=1 j=1
case of PP attachment in Figure 1. Translation
(5)
of the correct verb form and agreement can be
improved with CCG since supertags also encode l = log(py ) (6)
tense, morphology and agreements. For exam-
ple, in the sentence It is going to rain, the su- Target-side syntax When modeling the target-
pertag (S[ng]\NP[expl])/(S[to]\NP) of going side syntactic information we consider different
NP Obama ((S\NP)/PP)/NP receives NP ((S\NP)/PP)/NP Obama receives
st-3 st-2 st-1 st s't-1 s't st-1 st

t,1
t,T t,3 t,T
t,1 t,T t,1 t,2 t,2
t,2 t,3 t,3
h1 h2 h3 . hT .
h1 h2 h3 hT
x1 x2 x3 x4 x1 x2 x3 x4
a) b)
Figure 2: Integrating target syntax in the NMT decoder: a) interleaving and b) multitasking.
strategies of coupling the CCG supertags with the testing time we delete the predicted CCG su-
translated words in the decoder: interleaving and pertags to obtain the final translation. Figure 1
multitasking with shared encoder. In Figure 2 we gives an example of the target-side representa-
represent graphically the differences between the tion in the case of interleaving. The supertag
two strategies and in the next paragraphs we for- NP corresponding to the word Netanyahu is in-
malize them. cluded only once before the three BPE subunits
Net+ an+ yahu.
Interleaving In this paper we propose a tight
integration in the decoder of the syntactic rep- Multitasking shared encoder A loose cou-
resentation and the surface forms. Before each pling of the syntactic representation and the sur-
word of the target sequence we include its su- face forms can be achieved by co-training a
pertag as an extra token. The new target se- translation model with a secondary prediction
quence y 0 will have the length 2T , where T is task, in our case CCG supertagging. In the mul-
the number of target words. With this represen- titask framework (Luong et al., 2016) the en-
tation, a single decoder learns to predict both coder part is shared while the decoder is dif-
the target supertags and the target words con- ferent for each of the prediction tasks: transla-
ditioned on previous syntactic and lexical con- tion and tagging. In contrast to Luong et al.,
text. We do not make changes to the baseline we train a separate attention model for each
NMT decoder architecture, keeping equations task and perform multitask learning with tar-
(1) - (6) and the corresponding set of parame- get syntax. The two decoders take as input
ters unchanged. Instead, we augment the tar- the same source context, represented by the en-

get vocabulary to include both words and CCG coders hidden states hi = [ hi ; hi ]. However,
supertags. This results in a shared embedding each task has its own set of parameters associ-
space and the following probability of the target ated with the five components of the decoder:
sequence y 0 , where yj0 can be either a word or a GRU1 , AT T , cGRUatt , g, sof tmax. Further-
tag: more, the two decoders may predict a different
number of target symbols, resulting in target se-
quences of different lengths T1 and T2 . This re-
sults in two probability distributions over sep-
y 0 = y1tag , y1word , ...., yTtag , yTword (7) arate target vocabularies for the words and the
2T
Y tags:
py0 = p(yj0 |x, y1:j1
0
) (8)
j T1
Y
pword
y = p(yjword |x, y1:j1
word
) (9)
At training time we pre-process the target se- j
quence to add the syntactic annotation and then T2
split only the words into byte-pair-encoding p(yktag |x, y1:k1
tag
Y
ptag
y = ) (10)
(BPE) (Sennrich et al., 2016b) sub-units. At k
The final loss is the sum of the losses for the two train dev test
decoders: DE-EN 4,468,314 2,986 2,994
RO-EN 605,885 1,984 1,984
l = (log(pword
y ) + log(ptag
y )) (11)
Table 1: Number of sentences in the training, de-
We use EasySRL to label the English side of
velopment and test sets.
the parallel corpus with CCG supertags1 instead
of using a corpus with gold annotations as in sets in Table 1. Dependency labels are annotated
Luong et al. (2016). with ParZU (Sennrich et al., 2013) for German and
Source-side syntax shared embedding While SyntaxNet (Andor et al., 2016) for Romanian.
our focus is on target-side syntax, we also exper- All the neural MT systems are attentional
iment with including source-side syntax to show encoder-decoder networks (Bahdanau et al., 2015)
that the two approaches are complementary. as implemented in the Nematus toolkit (Sennrich
Sennrich and Haddow propose a framework for et al., 2017).4 We use similar hyper-parameters to
including source-side syntax as extra features in those reported by (Sennrich et al., 2016a; Sennrich
the NMT encoder. They extend the model of Bah- and Haddow, 2016) with minor modifications: we
danau et al. by learning a separate embedding for used mini-batches of size 60 and Adam optimizer
several source-side features such as the word itself (Kingma and Ba, 2014). We select the best single
or its part-of-speech. All feature embeddings are models according to BLEU on the development set
concatenated into one embedding vector which is and use the four best single models for the ensem-
used in all parts of the encoder model instead of bles.
the word embedding. When modeling the source- To show that we report results over strong base-
side syntactic information, we include the CCG lines, table 2 compares the scores obtained by our
supertags or dependency labels as extra features. baseline system to the ones reported in Sennrich
The baseline features are the subword units ob- et al. (2016a). We normalize diacritics5 for the
tained using BPE together with the annotation of EnglishRomanian test set. We did not remove
the subword structure using IOB format by mark- or normalize Romanian diacritics for the other ex-
ing if a symbol in the text forms the beginning (B), periments reported in this paper. Our baseline sys-
inside (I), or end (E) of a word. A separate tag (O) tems are generally stronger than Sennrich et al.
is used if a symbol corresponds to the full word. (2016a) due to training with a different optimizer
The word level supertag is replicated for each BPE for more iterations.
unit. Figure 1 gives an example of the source-side
feature representation. This work Sennrich et. al
DEEN 31.0 28.5
4 Experimental Setup and Evaluation ENDE 27.8 26.8
ROEN 28.0 27.8
4.1 Data and methods ENRO1 25.6 23.9
We train the neural MT systems on all the parallel
data available at WMT16 (Bojar et al., 2016) for Table 2: Comparison of baseline systems in
the GermanEnglish and RomanianEnglish this work and in Sennrich et al. (2016a). Case-
language pairs. The English side of the train- sensitive BLEU scores reported over newstest2016
ing data is annotated with CCG lexical tags2 us- with mteval-13a.perl. 1 Normalized diacritics.
ing EasySRL (Lewis et al., 2015) and the avail-
able pre-trained model3 . Some longer sentences During training we validate our models with
cannot be processed by the parser and therefore BLEU (Papineni et al., 2002) on development sets:
we eliminate them from our training and test data. newstest2013 for GermanEnglish and news-
We report the sentence counts for the filtered data dev2016 for RomanianEnglish. We evaluate the
systems on newstest2016 test sets for both lan-
1
We use the same data and annotations for the interleav-
ing approach. 4
https://github.com/rsennrich/nematus
2 5
The CCG tags include features such as the verb tense There are different encodings for letters with
(e.g. [ng] for continuous form) or the sentence type (e.g. [pss] cedilla (s,t) used interchangeably throughout the corpus.
for passive). https://en.wikipedia.org/wiki/Romanian_
3
https://github.com/uwnlp/EasySRL alphabet#ISO_8859
guage pairs and use bootstrap resampling (Riezler as extra information in the encoder or in the de-
and Maxwell, 2005) to test statistical significance. coder.
We compute BLEU with multi-bleu.perl over tok-
enized sentences both on the development sets, for Target-side syntax We first evaluate the impact
early stopping, and on the test sets for evaluating of target-side CCG supertags on overall transla-
our systems. tion quality. In Table 3 we report results for
Words are segmented into sub-units that are GermanEnglish, a high-resource language pair,
learned jointly for source and target using BPE and for RomanianEnglish, a low-resource lan-
(Sennrich et al., 2016b), resulting in a vocabulary guage pair. We report BLEU scores for both the
size of 85,000. The vocabulary size for CCG su- best single models and ensemble models. How-
pertags was 500. ever, we will only refer to the results with ensem-
For the experiments with source-side features ble models since these are generally better.
we use the BPE sub-units and the IOB tags as The SNMT system with target-side
baseline features. We keep the total word em- syntax improves BLEU scores by 0.9
bedding size fixed to 500 dimensions. We allo- for RomanianEnglish and by 0.6 for
cate 10 dimensions for dependency labels when GermanEnglish. Although the training data for
using these as source-side features and when us- GermanEnglish is large, the CCG supertags
ing source-side CCG supertags we allocate 135 di- still improve translation quality. These results
mensions. suggest that the baseline NMT decoder benefits
The interleaving approach to integrating target from modeling the global syntactic information
syntax increases the length of the target sequence. locally via supertags.
Therefore, at training time, when adding the CCG Next, we evaluate whether there is a benefit to
supertags in the target sequence we increase the tight coupling between the target word sequence
maximum length of sentences from 50 to 100. On and syntax, as apposed to loose coupling. We
average, the length of English sentences for new- compare our method of interleaving the CCG su-
stest2013 in BPE representation is 22.7, while the pertags with multitasking, which predicts target
average length when adding the CCG supertags is CCG supertags as a secondary task. The results
44. Increasing the length of the target recurrence in Table 3 show that the multitask approach does
results in larger memory consumption and slower not improve BLEU scores for GermanEnglish,
training.6 . At test time, we obtain the final trans- which exhibits long distance word reordering. For
lation by post-processing the predicted target se- RomanianEnglish, which exhibits more local
quence to remove the CCG supertags. word reordering, multitasking improves BLEU by
0.6 relative to the baseline. In contrast, the inter-
4.2 Results leaving approach improves translation quality for
both language pairs and to a larger extent. There-
In this section, we first evaluate the syntax-aware
fore, we conclude that a tight integration of the tar-
NMT model (SNMT) with target-side CCG su-
get syntax and word sequence is important. Con-
pertags as compared to the baseline NMT model
ditioning the prediction of words on their corre-
described in the previous section (Bahdanau et al.,
sponding CCG supertags is what sets SNMT apart
2015; Sennrich et al., 2016a). We show that our
from the multitasking approach.
proposed method for tightly coupling target syn-
tax via interleaving, improves translation for both Source-side and target-side syntax We now
GermanEnglish and RomanianEnglish while show that our method for integrating target-side
the multitasking framework does not. Next, we syntax can be combined with the framework
show that SNMT with target-side CCG supertags of Sennrich and Haddow (2016) for integrating
can be complemented with source-side dependen- source-side linguistic information, leading to fur-
cies, and that combining both types of syntax ther improvement in translation quality. We evalu-
brings the most improvement. Finally, our exper- ate the syntax-aware NMT system, with CCG su-
iments with source-side CCG supertags confirm pertags as target-syntax and dependency labels as
that global syntax can improve translation either source-syntax. While the dependency labels do
6
Roughly 10h30 per 100,000 sentences (20,000 batches) not encode global syntactic information, they dis-
for SNMT compared to 6h for NMT. ambiguate the grammatical function of words. Ini-
GermanEnglish RomanianEnglish
model syntax strategy single ensemble single ensemble
NMT - - 31.0 32.1 28.1 28.4
SNMT target CCG interleaving 32.0 32.7* 29.2 29.3**
Multitasking target CCG shared encoder 31.4 32.0 28.4 29.0*
SNMT source dep shared embedding 31.4 32.2 28.2 28.9
+ target CCG + interleaving 32.1 33.0** 29.1 29.6**
Table 3: Experiments with target-side syntax for GermanEnglish and RomanianEnglish. BLEU
scores reported for baseline NMT, syntax-aware NMT (SNMT) and multitasking. The SNMT system is
also combined with source dependencies. Statistical significance is indicated with * p < 0.05 and **
p < 0.01, when comparing against the NMT baseline.
tially, we had intended to use global syntax on the model syntax ENDE ENRO
source-side as well for GermanEnglish, how- NMT - 28.3 25.6
ever the German CCG tree-bank is still under de- SNMT source CCG 29.0* 26.1*
velopment.
From the results in Table 3 we first ob- Table 4: Results for EnglishGerman and
serve that for GermanEnglish the source-side EnglishRomanian with source-side syntax. The
dependency labels improve BLEU by only 0.1, SNMT system uses the CCG supertags of the
while RomanianEnglish sees an improvement source words in the embedding layer. *p < 0.05.
of 0.5. Source-syntax may help more for
RomanianEnglish because the training data is the baseline NMT model is not able to learn from
smaller and the word order is more similar be- the source word sequence alone.
tween the source and target languages than it is
4.3 Analyses by sentence type
for GermanEnglish.
For both language pairs, target-syntax im- In this section, we make a finer grained analysis
proves translation quality more than source- of the impact of target-side syntax by looking at a
syntax. However, target-syntax is complemented breakdown of BLEU scores with respect to differ-
by source-syntax when used together, leading ent linguistic constructions and sentence lengths7 .
to a final improvement of 0.9 BLEU points We classify sentences into different linguis-
for GermanEnglish and 1.2 BLEU points for tic constructions based on the CCG supertags
RomanianEnglish. that appear in them, e.g., the presence of cate-
Finally, we show that CCG supertags are also gory (NP\NP)/(S/NP) indicates a subordinate
an effective representation of global-syntax when construction. Figure 3 a) shows the difference
used in the encoder. In Table 4 we present re- in BLEU points between the syntax-aware NMT
sults for using CCG supertags as source-syntax system and the baseline NMT system for the
in the embedding layer. Because we have CCG following linguistic constructions: coordination
annotations only for English, we reverse the (conj), control and raising (control), prepositional
translation directions and report BLEU scores for phrase attachment (pp), questions and subordinate
EnglishGerman and EnglishRomanian. The clauses (subordinate). In the figure we use the
BLEU scores reported are for the ensemble models symbol * to indicate that syntactic information
over newstest2016. is used on the target (eg. de-en*), or both on the
For EnglishGerman BLEU increases by 0.7 source and target (eg. *de-en*). We report the
points and for EnglishRomanian by 0.5 points. number of sentences for each category in Table 5.
In contrast, Sennrich and Haddow (2016) obtain With target-syntax, we see consistent im-
an improvement of only 0.2 for EnglishGerman provements across all linguistic constructions for
using dependency labels which encode only the RomanianEnglish and across all but control and
grammatical function of words. These results con- raising for GermanEnglish. In particular, the in-
firm that representing global syntax in the en- 7
Document-level BLEU is computed over each subset of
coder provides complementary information that sentences.
a) b)
Figure 3: Difference in BLEU points between SNMT and NMT, relative to baseline NMT scores, with
respect to a) linguistic constructs and b) sentence lengths. The numbers attached to the bars represent
the BLEU score for the baseline NMT system. The symbol * indicates that syntactic information is used
on the target (eg. de-en*), or both on the source and target (eg. *de-en*)
sub. qu. pp contr. conj both language pairs.

ROEN 742 90 1,572 415 845 Next, we compare the systems with respect to
DEEN 936 114 2,321 546 1,129 sentence length. Figure 3 b) shows the difference
in BLEU points between the syntax-aware NMT
Table 5: Sentence counts for different linguistic system and the baseline NMT system with respect
constructions. to the length of the source sentence measured in
BPE sub-units. We report the number of sentences
for each category in Table 6.
crease in BLEU scores for the prepositional phrase
and subordinate constructions suggests that target <15 15-25 25-35 >35
word order is improved. ROEN 491 540 433 520
For GermanEnglish, there is a small de- DEEN 918 934 582 560
crease in BLEU for the control and raising con-
structions when using target-syntax alone. How- Table 6: Sentence counts for different sentence
ever, source-syntax adds complementary informa- lengths.
tion to target-syntax, resulting in a small improve-
ment for this category as well. Moreover, com- With target-syntax, we see consistent
bining source and target-syntax increases trans- improvements across all sentence lengths
lation quality across all linguistic constructions for RomanianEnglish and across all but
as compared to NMT and SNMT with target- short sentences for GermanEnglish. For
syntax alone. For RomanianEnglish, combin- GermanEnglish there is a decrease in BLEU
ing source and target-syntax brings an additional for sentences up to 15 words. Since the
improvement of 0.7 for subordinate constructs GermanEnglish training data is large, the base-
and 0.4 for prepositional phrase attachment. For line NMT system learns a good model for short
GermanEnglish, on the same categories, there is sentences with local dependencies and without
an additional improvement of 0.4 and 0.3 respec- subordinate or coordinate clauses. Including extra
tively. Overall, BLEU scores improve by more than CCG supertags increases the target sequence
1 BLEU point for most linguistic constructs and for without adding information about complex lin-
DE - EN Question
Source Oder wollen Sie herausfinden , uber was andere reden ?
Ref. Or do you want to find out what others are talking about ?
NMT Or would you like to find out about what others are talking about ?
SNMT Or do you want to find out whatN P/(S[dcl]/N P ) others are(S[dcl]\N P )/(S[ng]\N P ) talking(S[ng]\N P )/P P aboutP P/N P ?
DE - EN Subordinate
Source ...dass die Polizei jetzt sagt , ..., und dass Lamb in seinem Notruf Prentiss zwar als seine Frau bezeichnete ...
Ref. ...that police are now saying ..., and that while Lamb referred to Prentiss as his wife in the 911 call ...
NMT ...police are now saying ..., and that in his emergency call Prentiss he called his wife ...
SNMT ...police are now saying ..., and that lamb , in his emergency call , described((S[dcl]\N P )/P P )/N P Prentiss as his wife ....
Figure 4: Comparison of baseline NMT and SNMT with target syntax for GermanEnglish.
guistic phenomena. However, when using both and obtain the following accuracies: 93.2 for
source and target syntax, the effect on short sen- RomanianEnglish, 95.6 for GermanEnglish,
tences disappears. For RomanianEnglish there 95.8 for GermanEnglish with both source and
is also a large improvement on short sentences target syntax.8
when combining source and target syntax: 2.9 We conclude by giving a couple of examples in
BLEU points compared to the NMT baseline Figure 4 for which the SNMT system with tar-
and 1.2 BLEU points compared to SNMT with get syntax produced more grammatical transla-
target-syntax alone. tions than the baseline NMT system.
With both source and target-syntax, translation In the example DE-EN Question the baseline
quality increases across all sentence lengths as NMT system translates the preposition uber
compared to NMT and SNMT with target-syntax twice as about. The SNMT system with tar-
alone. For GermanEnglish sentences that are get syntax predicts the correct CCG supertag for
more than 35 words, we see again the effect of what which expects to be followed by a sen-
increasing the target sequence by adding CCG tence and not a preposition: NP/(S[dcl]/NP).
supertags. Target-syntax helps, however BLEU Therefore the SNMT correctly re-orders the
improves by only 0.4, compared to 0.9 for sen- preposition about at the end of the question.
tences between 15 and 35 words. With both In the example DE-EN Subordinate the base-
source and target syntax, BLEU improves by 0.8 line NMT system fails to correctly attach Pren-
for sentences with more than 35 words. For tiss as an object and his wife as a modifier
RomanianEnglish we see a similar result for to the verb called (bezeichnete) in the subor-
sentences with more than 35 words: target-syntax dinate clause. In contrast the SNMT system pre-
improves BLEU by 0.6, while combining source dicts the correct sub-categorization frame of the
and target syntax improves BLEU by 0.8. These verb described and correctly translates the en-
results confirm as well that source-syntax adds tire predicate-argument structure.
complementary information to target-syntax and
mitigates the problem of increasing the target se- 5 Conclusions
quence.
This work introduces a method for modeling ex-
4.4 Discussion plicit target-syntax in a neural machine transla-
tion system, by interleaving target words with their
Our experiments demonstrate that target-syntax corresponding CCG supertags. Earlier work on
improves translation for two translation directions: syntax-aware NMT mainly modeled syntax in the
GermanEnglish and RomanianEnglish. Our encoder, while our experiments suggest model-
proposed method predicts the target words to- ing syntax in the decoder is also useful. Our re-
gether with their CCG supertags. sults show that a tight integration of syntax in
Although the focus of this paper is not im- the decoder improves translation quality for both
proving CCG tagging, we can also measure that
8
SNMT is accurate at predicting CCG supertags. The multitasking model predicts a different number of
CCG supertags than the number of target words. For the sen-
We compare the CCG sequence predicted by the tences where these numbers match, the CCG supetagging ac-
SNMT models with that predicted by EasySRL curacy is 73.2.
GermanEnglish and RomanianEnglish lan- Methods in Natural Language Processing, EMNLP
guage pairs, more so than a loose coupling of tar- 2016, Austin, Texas, USA, November 1-4, 2016,
pages 257267.
get words and syntax as in multitask learning. Fi-
nally, by combining our method for integrating Alexandra Birch, Miles Osborne, and Philipp Koehn.
target-syntax with the framework of Sennrich and 2007. Ccg supertags in factored statistical machine
Haddow (2016) for source-syntax we obtain the translation. In Proceedings of the Second Work-
most improvement over the baseline NMT system: shop on Statistical Machine Translation, StatMT
07, pages 916, Stroudsburg, PA, USA. Associa-
0.9 BLEU for GermanEnglish and 1.2 BLEU for tion for Computational Linguistics.
RomanianEnglish. In particular, we see large
improvements for longer sentences involving syn- Ondrej Bojar, Rajen Chatterjee, Christian Federmann,
tactic phenomena such as subordinate and coordi- Yvette Graham, Barry Haddow, Matthias Huck,
Antonio Jimeno Yepes, Philipp Koehn, Varvara
nate clauses and prepositional phrase attachment. Logacheva, Christof Monz, Matteo Negri, Aure-
In future work, we plan to evaluate the impact lie Neveol, Mariana Neves, Martin Popel, Matt
of target-syntax when translating into a morpho- Post, Raphael Rubino, Carolina Scarton, Lucia Spe-
logically rich language, for example by using the cia, Marco Turchi, Karin Verspoor, and Marcos
Zampieri. 2016. Findings of the 2016 conference
Hindi CCGBank (Ambati et al., 2016). on machine translation. In Proceedings of the First
Conference on Machine Translation, pages 131
Acknowledgements 198, Berlin, Germany. Association for Computa-
tional Linguistics.
We thank the anonymous reviewers for their com-
ments and suggestions. This project has received David Chiang. 2007. Hierarchical phrase-based trans-
funding from the European Unions Horizon 2020 lation. Comput. Linguist., 33(2):201228.
research and innovation programme under grant
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-
agreements 644402 (HimL), 644333 (SUMMA) danau, and Yoshua Bengio. 2014a. On the proper-
and 645452 (QT21). ties of neural machine translation: Encoderdecoder
approaches. In Proceedings of SSST-8, Eighth Work-
shop on Syntax, Semantics and Structure in Statisti-
References cal Translation, pages 103111, Doha, Qatar. Asso-
ciation for Computational Linguistics.
Roee Aharoni and Yoav Goldberg. 2017. Towards
string-to-tree neural machine translation. In Pro- Kyunghyun Cho, Bart van Merrienboer, Caglar
ceedings of the 55th Annual Meeting of the Associa- Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-
tion for Computational Linguistics (Volume 2: Short ger Schwenk, and Yoshua Bengio. 2014b. Learning
Papers), Vancouver, Canada. Association for Com- phrase representations using rnn encoderdecoder
putational Linguistics. for statistical machine translation. In Proceedings of
the 2014 Conference on Empirical Methods in Nat-
Bharat Ram Ambati, Tejaswini Deoskar, and Mark ural Language Processing (EMNLP), pages 1724
Steedman. 2016. Hindi CCGbank: CCG Treebank 1734, Doha, Qatar. Association for Computational
from the Hindi Dependency Treebank. In Language Linguistics.
Resources and Evaluation.
Daniel Andor, Chris Alberti, David Weiss, Aliaksei Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,
Severyn, Alessandro Presta, Kuzman Ganchev, Slav and Noah A. Smith. 2016. Recurrent neural network
Petrov, and Michael Collins. 2016. Globally nor- grammars. In Proceedings of the 2016 Conference
malized transition-based neural networks. In Pro- of the North American Chapter of the Association
ceedings of the 54th Annual Meeting of the Associa- for Computational Linguistics: Human Language
tion for Computational Linguistics (Volume 1: Long Technologies, pages 199209, San Diego, Califor-
Papers), pages 24422452, Berlin, Germany. Asso- nia. Association for Computational Linguistics.
ciation for Computational Linguistics.
Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Tsuruoka. 2016. Tree-to-sequence attentional neu-
gio. 2015. Neural machine translation by jointly ral machine translation. In Proceedings of the 54th
learning to align and translate. In Proceedings of Annual Meeting of the Association for Computa-
the International Conference on Learning Represen- tional Linguistics (Volume 1: Long Papers), pages
tations (ICLR). 823833, Berlin, Germany. Association for Compu-
tational Linguistics.
Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and
Marcello Federico. 2016. Neural versus phrase- Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun
based machine translation quality: a case study. In Cho. 2017. Learning to parse and translate improves
Proceedings of the 2016 Conference on Empirical neural machine translation. In Proceedings of the
55th Annual Meeting of the Association for Compu- Maria Nadejde, Philip Williams, and Philipp Koehn.
tational Linguistics (Volume 2: Short Papers), Van- 2013. Edinburghs Syntax-Based Machine Transla-
couver, Canada. Association for Computational Lin- tion Systems. In Proceedings of the Eighth Work-
guistics. shop on Statistical Machine Translation, pages 170
176, Sofia, Bulgaria.
Michel Galley, Mark Hopkins, Kevin Knight, and
Daniel Marcu. 2004. Whats in a translation Jan Niehues, Thanh-Le Ha, Eunah Cho, and Alex
rule? In Proceedings of Human Language Tech- Waibel. 2016. Using factored word representation
nologies: Conference of the North American Chap- in neural network language models. In Proceed-
ter of the Association of Computational Linguistics, ings of the First Conference on Machine Translation,
HLT-NAACL 04. Berlin, Germany.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-

Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Jing Zhu. 2002. Bleu: A method for automatic eval-
Hoang. 2016. Is Neural Machine Translation Ready uation of machine translation. In Proceedings of
for Deployment? A Case Study on 30 Translation the 40th Annual Meeting on Association for Com-
Directions. In Proceedings of the IWSLT 2016. putational Linguistics, ACL 02, pages 311318,
Stroudsburg, PA, USA. Association for Computa-
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent tional Linguistics.
continuous translation models. In Proceedings of
the 2013 Conference on Empirical Methods in Natu- Stefan Riezler and John T. Maxwell. 2005. On some
ral Language Processing, pages 17001709, Seattle, pitfalls in automatic evaluation and significance test-
Washington, USA. Association for Computational ing for mt. In Proceedings of the ACL Workshop
Linguistics. on Intrinsic and Extrinsic Evaluation Measures for
Machine Translation and/or Summarization, pages
Diederik Kingma and Jimmy Ba. 2014. Adam: A 5764, Ann Arbor, Michigan. Association for Com-
method for stochastic optimization. arXiv preprint putational Linguistics.
arXiv:1412.6980.
Rico Sennrich. 2015. Modelling and Optimizing on
Mike Lewis, Luheng He, and Luke Zettlemoyer. 2015. Syntactic N-Grams for Statistical Machine Transla-
Joint a* ccg parsing and semantic role labelling. In tion. Transactions of the Association for Computa-
Empirical Methods in Natural Language Process- tional Linguistics, 3:169182.
ing. Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-
dra Birch, Barry Haddow, Julian Hitschler, Marcin
Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Junczys-Dowmunt, Samuel Laubli, Antonio Valerio
Vinyals, and Lukasz Kaiser. 2016. Multi-task se- Miceli Barone, Jozef Mokry, and Maria Nadejde.
quence to sequence learning. In Proceedings of 2017. Nematus: a toolkit for neural machine trans-
International Conference on Learning Representa- lation. In Proceedings of the Software Demonstra-
tions (ICLR 2016). tions of the 15th Conference of the European Chap-
ter of the Association for Computational Linguistics,
Mercedes Garca Martnez, Loc Barrault, and Fethi pages 6568, Valencia, Spain. Association for Com-
Bougares. 2016. Factored Neural Machine Trans- putational Linguistics.
lation Architectures. In International Workshop on
Spoken Language Translation (IWSLT16). Rico Sennrich and Barry Haddow. 2016. Linguistic
input features improve neural machine translation.
Arul Menezes and Chris Quirk. 2007. Using depen- In Proceedings of the First Conference on Machine
dency order templates to improve generality in trans- Translation, pages 8391, Berlin, Germany.
lation. In Proceedings of the Second Workshop on
Statistical Machine Translation, pages 18. Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016a. Edinburgh neural machine translation sys-
tems for wmt 16. In Proceedings of the First
Maria Nadejde, Alexandra Birch, and Philipp Koehn. Conference on Machine Translation, pages 371
2016a. Modeling selectional preferences of verbs 376, Berlin, Germany. Association for Computa-
and nouns in string-to-tree machine translation. In tional Linguistics.
Proceedings of the First Conference on Machine
Translation, pages 3242, Berlin, Germany. Asso- Rico Sennrich, Barry Haddow, and Alexandra Birch.
ciation for Computational Linguistics. 2016b. Neural machine translation of rare words
with subword units. In Proceedings of the 54th An-
Maria Nadejde, Alexandra Birch, and Philipp Koehn. nual Meeting of the Association for Computational
2016b. A neural verb lexicon model with Linguistics (Volume 1: Long Papers), Berlin, Ger-
source-side syntactic context for string-to-tree ma- many. Association for Computational Linguistics.
chine translation. In Proceedings of the Interna-
tional Workshop on Spoken Language Translation Rico Sennrich, Martin Volk, and Gerold Schneider.
(IWSLT). 2013. Exploiting Synergies Between Open Re-
sources for German Dependency Parsing, POS-
tagging, and Morphological Analysis. In Proceed-
ings of the International Conference Recent Ad-
vances in Natural Language Processing 2013, pages
601609, Hissar, Bulgaria.
Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does
string-based neural mt learn source syntax? In Pro-
ceedings of the 2016 Conference on Empirical Meth-
ods in Natural Language Processing, pages 1526
1534, Austin, Texas. Association for Computational
Linguistics.
Mark Steedman. 2000. The syntactic process, vol-
ume 24. MIT Press.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural net-
works. In Proceedings of the 27th International
Conference on Neural Information Processing Sys-
tems, NIPS14, pages 31043112.
Philip Williams and Philipp Koehn. 2012. Ghkm rule
extraction and scope-3 parsing in moses. In Pro-
ceedings of the Seventh Workshop on Statistical Ma-
chine Translation, pages 388394.

Predicting Target Language CCG Supertags Improves Neural Machine Translation

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Predicting Target Language CCG Supertags Improves Neural Machine Translation

Caricato da

Copyright:

Formati disponibili

Predicting Target Language CCG Supertags Improves Neural Machine

Abstract tured by these models. In a detailed analysis,

Figure 1: Source and target representation of syntactic information in syntax-aware NMT.

st-3 st-2 st-1 st s't-1 s't st-1 st

sub. qu. pp contr. conj both language pairs.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-

Potrebbero piacerti anche