Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
people
district
Those
grown
many
years
have
lived
and
a particular phrase was mistranslated, the user
.
up
in
a
,
can examine the set of sentence pairs that a par- Ces
ticular translation was learned from, and make gens
ont
corrections.
grandi
These features mean that our system is capable of ,
vécu
improving with use and adapting to be more appro-
et
priate for a new domain. This has two main im- oeuvré
plications: our systems get better as our customers des
use them, and our systems have the potential to be dizaines
trained using example translations from one domain d'
(such as government documents, which have abun- années
dans
dant translations) and gradually adapted to a new do-
le
main. domaine
The remainder of the paper is as follows: Section agricole
2 describes how our phrase-based models of trans- .
lation are learned from archived translations, and
gives example output produced by a system trained Figure 1: A word-level alignment for a sentence pair
on data from the Canadian parliament. Section 3 that occurs in our training data
shows how our system dynamically integrates edited
output by extracting the translations of new phrases, 2.1 Word Alignments
and weighting the corrected translations more heav-
ily than existing translations in the model. Section Brown et al. (1993) define a series of translation
4 described the advanced editing technique that al- models, which are commonly referred to as IBM
lows a user to inspect the sentence pairs which a Models 1 to 5. The IBM Models formulate transla-
faulty translation was learned from, and correct the tion essentially as a word-level operation. The prob-
statistical models by explicitly showing the system ability that a foreign sentence is the translation of an
which phrases ought to be learned from those sen- English sentence is calculating by summing over the
tence pairs instead. probabilities of all possible word-level alignments,
a, between the sentences:
2 Phrase-based Statistical Translation X
p(f |e) = p(f , a|e)
The goal of statistical machine translation is to be a
able to choose that English sentence, e, that is the Thus they decompose the problem of determining
most probable translation of a given sentence, f , in whether a sentence is a good translation of another
a foreign language. Rather than choosing e∗ that di- into the problem of determining whether there is
rectly maximizes the conditional probability p(e|f ), a sensible mapping between the words in the sen-
Bayes’ rule is generally applied: tences. Figure 1 illustrates a probable word-level
e∗ = arg max p(e)p(f |e) alignment between a sentence pair in the Canadian
e Hansard bilingual corpus.
The effect of applying Bayes’ rule is to divide the Brown et al. (1993) formulate alignment proba-
task into estimating two probabilities: a language bility p(f , a|e) in terms of distortion, fertility, and
model probability p(e) which can be estimated us- spurious word probabilities in addition to word-for-
ing a monolingual corpus, and a translation model word translation probabilities. The act of translation
probability p(f |e) which is estimated using a bilin- in the Brown et al. (1993) approach is one of string
gual, sentence-aligned corpus. Here we examine rewriting. In string rewriting each word in a source
different ways of calculating the translation model sentence is replaced by zero or more words in the
probability. target language. Then, a number of “spurious” target
words might be inserted with no direct connection to phrases2 from word alignments described in
the original source words. Finally the words are then Och and Ney (2003). Counts are collected over
reordered in some fashion to form the translation. phrases extracted from word alignments of all sen-
Problems with the Brown et al. (1993) approach tence pairs in the training corpus. These counts are
to translation include: then used to calculate phrasal translation probabili-
ties.
• It doesn’t have a direct way of translating The act of translating with phrase-based transla-
phrases; instead fertility probabilities are used tion model involves breaking an input sentence into
to replicate words and translate them individu- all possible substrings, looking up all translations
ally. that were aligned with each substring in the train-
• Using small units such as words means that a ing corpus, and then searching through all possible
lot of word reordering has to happen. But the translations to find the best translation of source sen-
distortion probability is a poor explanation of tence.
word order.
2.3 Example translation
2.2 Phrase Alignments This section gives an example translation which was
Phrase-based translation, by contrast, uses larger produced by our system when it was trained on a col-
segments of human translated text. Phrase-based lection of example translations from the Canadian
translation does away with fertility and spurious Parliament. For the source passage
word probabilities. While it does have some no-
tion of distortion, this is less pertinent since local re- L’honorable Leonard J. Gustafson: Hon-
orderings such as adjective-noun alternation can be orables sénateurs, tandis que la guerre
easily captured in phrases. The main part of phrase- en Irak entre dans sa troisième semaine,
based translation models is the estimation of phrasal nous ne devons pas oublier qu’il faut pren-
translation probabilities. dre des mesures pour éviter une crise hu-
In general, the probability of an English phrase ē manitaire dans la population civile. À
translating as a French phrase f¯ is calculated as the cet égard, il y a de nombreux domaines
number of times that the English phrase was aligned dans lesquels les Canadiens doivent faire
with the French phrase in the training corpus, di- preuve de leadership.
vided by the total number of times that the French Maintenant que la guerre fait rage en Irak
phrase occurred: et que des pénuries de produits alimen-
taires et de fournitures médicales com-
count(f¯, ē)
p(f¯|ē) = P ¯
mencent à se produire, les pays qui ont
f¯ count(f , ē) accès à des ressources ont la respons-
abilité d’essayer de minimiser les effets
The trick is how to go about extracting the counts
négatifs du conflit sur la population iraki-
for phrase alignments from a training corpus.
enne. Je crois personnellement que le
Many methods for calculating phrase alignments
Canada devrait jouer un plus grand rôle à
use word-level alignments as a starting point.1
cet égard.
There are various heuristics for extracting phrase
alignments from word alignments, some are de- Au Canada, nous disposons en abondance
scribed in Koehn et al. (2003), Tillmann (2003), and de blé et d’autres produits alimentaires.
Vogel et al. (2003). Nous pouvons également fournir des arti-
Figure 2 gives a graphical illustration of cles médicaux aux citoyens de l’Irak. Le
the method of extracting incrementally larger Canada a une fière réputation pour ce qui
1 2
There are other ways of calculating phrasal translation Note that the ‘phrases’ in phrase-based translation are not
probabilities. For instance, Marcu and Wong (2002) estimate the traditional notion of syntactic constituents; instead they
them directly rather than starting from word-level alignments. might be more aptly described as ‘substrings’ or ‘blocks’.
farming
worked
people
district
Those
grown
many
years
have
lived
and
.
up
in
a
,
Ces
gens
Phrases extracted on iteration 1: (Ces : Those), (gens : people),
ont
grandi
(ont : have), (grandi : grown up), (, : ,), (vécu : lived), (et : and),
, (oeuvré : worked), (des dizaines d’ : many), (années : years),
vécu
et (dans : in), (le : a), (domaine : district), (agricole : farming), (. : .)
oeuvré
des
dizaines
d'
Notice that one word can translate as a phrase, such as ‘grandi’
années
dans
→ ‘grown up’. Incrementally larger phrases are created by incor-
le porating adjacent phrases words and phrases.
domaine
agricole
.
farming
worked
people
district
Those
grown
many
years
have
lived
and
.
up
in
a
Ces
gens
ont
(ont grandi : have grown up), (grandi , : grown up ,), (, vécu : ,
grandi lived), (vécu et : lived and), (et oeuvré : and worked), (oeuvré
,
vécu des dizaines d’ : worked many), (des dizaines d’ années, many
et
oeuvré years), (années dans : years in), (dans le : in a), (domaine agricole
des
dizaines
: farming district)
d'
années
dans Notice that ‘a farming’ does not have a translation since the
le
domaine phrase that it spans are not contiguous.
agricole
.
farming
worked
people
district
Those
grown
many
years
have
lived
and
in
a
,
Ces
gens grandi : people have grown up), (ont grandi , : have grown up ,),
ont
grandi (grandi , vécu : grown up , lived), (, vécu et : , lived and), (vécu
,
vécu
et oeuvré : lived and worked), (et oeuvré des dizaines d’ : and
et
oeuvré
worked many), (oeuvré des dizaines d’ années : worked many
des years), (des dizaines d’ années dans : many years in), (années
dizaines
d' dans le : years in a), (le domaine agricole : a farming district),
années
dans (domaine agricole . : farming district .)
le
domaine
agricole
.
Notice that ‘a farming district’ now has a translation since the
phrase ‘farming district’ is adjacent to the phrase ‘a’.
farming
worked
people
district
Those
grown
many
years
have
lived
and
.
up
in
a
Iteration 4: (Ces gens ont grandi : Those people have grown up),
,
Ces
gens
ont
(gens ont grandi , : people have grown up ,), (ont grandi , vécu :
grandi have grown up , lived), (grandi , vécu et : grown up , lived and), (,
,
vécu vécu et oeuvré : , lived and worked), (vécu et oeuvré des dizaines
et
oeuvré d’ : lived and worked many), (et oeuvré des dizaines d’ années
des
dizaines
: and worked many years), (oeuvré des dizaines d’ années dans
d'
années
: worked many years in), (des dizaines d’ années dans le : many
dans years in a), (dans le domaine agricole : in a farming district), (le
le
domaine domaine agricole . : a farming district .)
agricole
.
I personally believe that Canada should play a more significant role in this regard.
À cet égard , il y a de nombreux domaines dans lesquels les Canadiens doivent faire preuve de leadership.
In this regard , there are a number of areas in which the Canadians must concentrate to show leadership.
Figure 3: An example of the phrases that were used to translate two sentences
est d’offrir une aide humanitaire aux gens Hon. Leonard J. Gustafson: Honourable
quand ils en ont besoin. senators, as the war in Iraq enters its third
week, the need to take measures to avoid
Our system produces the translation
a humanitarian crisis among Iraq’s civil-
The Honourable Leonard J. Gustafson: ian population should not be forgotten. In
Honourable senators, while the war in Iraq this regard, there may be avenues where
extending into a third week, the minister Canadians must provide leadership.
should not forget the one we must take As the war wages in Iraq and shortages
some steps to prevent a crisis humanitar- of foodstuffs and medical supplies start to
ian face in the civilian populations. In occur, countries with access to resources
this regard, there are a number of areas in have a responsibility to help minimize
which the Canadians must concentrate to the negative impact of the conflict on the
show leadership. country’s population. In this regard, it
That the war is raging in Iraq and that is my belief that Canada should play a
a shortage of food and medical supplies greater role.
are beginning to take place, those coun- In Canada, we have access to plentiful
tries that have access to the resources are supplies of wheat and other foodstuffs.
the responsibility for doing try to mini- Canada can also provide medical supplies
mize the negative effects on workers on to the citizens of Iraq. Canada has a proud
the people Irakienne. I personally believe history in providing humanitarian assis-
that Canada should play a more significant tance to people in times of need.
role in this regard.
Figure 3 shows which phrases were selected by
In Canada, we have in abundance of our system when deciding how to translate two of
wheat, as have other food products. We the sentences from the passage.
can also provision of medical articles to
the people on the other Iraq. Canada has a 3 Simple Editing
proud record in so far, as would be to give
Because our translations are learned from example
humanitarian assistance to people when
translations we can exploit edited system output to
they need it.
improve the quality of subsequent translations. The
As a reference, here is how a human translator ren- most straightforward way of improving the transla-
dered the passage tion quality would be to add the edited translations
(along with the source sentences that they were pro-
duced from) to the training corpus, and re-train the
accessible
system on the augmented set of data. However,
kingdom
changes
towards
climatic
migrate
regions
caused
animal
During
training the system can take a long time – it took
more
age
the
the
ice
to
.
,
longer than a week to train the system using the par- Durant
allel corpus containing roughly 25 million words of l'
ère
Canadian Parliament text. Since it takes so long to glacière
re-train a system, this method could only be done pe- ,
les
riodically, and translation quality would not imme- changements
diately benefit from a user correcting the system’s climatiques
obligèrent
translations.
le
Instead we have developed what we term an règne
“alignment server”. The alignment server is a vari- animale
à
ant of the software that trains the translation mod- migrer
els. It keeps the parameters of a translation model vers
des
in memory, and is able to create a word alignment contrées
on-the-fly from a new sentence pair. plus
accueillantes
For example, if our system had translated the
.
French sentence
Durant l’ère glacière, les changements Figure 4: Word alignment produced by the align-
climatiques obligèrent le règne animale ment server for an edited translation
à migrer vers des contrées plus accueil-
lantes.
into English as
During the era refrigerator, the climatic
accessible
accessible
kingdom
kingdom
changes
changes
towards
towards
climatic
migrate
regions
climatic
migrate
regions
caused
caused
animal
animal
During
During
more
more
changes oblige the the reign animal to mi-
age
age
the
the
the
the
ice
ice
to
to
.
.
,
,
Durant Durant
accessible
kingdom
kingdom
changes
changes
towards
towards
climatic
migrate
regions
climatic
migrate
regions
caused
caused
animal
animal
During
During
more
more
age
the
the
the
the
ice
ice
to
to
.
.
,
Durant Durant
l' l'
alignment such as the one depicted in Figure 4. ère
glacière
ère
glacière
, ,
From this we extract a set of new phrases and their changements
les
changements
les
Figure 7: Inspecting the phrases that were used to generate a translation to discover misaligned phrases.
5 Discussion
There are a number of limitations that make non-
statistical machine translation unsuitable for the hu-
man translation process. Specifically:
References
Peter Brown, John Cocke, Stephen Della Pietra, Vincent
Della Pietra, Frederick Jelinek, Robert Mercer, and
Paul Poossin. 1988. A statistical approach to language
translation. In 12th International Conference on Com-
putational Linguistics.