Sei sulla pagina 1di 23

An Introduction to Machine Translation

Andy Way, DCU


The Rise & Fall of Different MT Paradigms
Three main approaches to RBMT
language-neutral interlingua

TRANSFER
GENERATION
ANALYSIS

direct translation
source text target text

The Vauquois Pyramid


System Design: Concerns
Multilingual vs. Bilingual
Multilingual:
Extreme: Eurotra, i.e. 72 language pairs Modest: EN DE,FR,ES,
i.e. 3 language pairs
Intermediate: EN,FR,DE,ES,JP, but not all combinations
Bilingual:
Unidirectional vs. Bidirectional
ENFR or FREN
Reversible vs. Non-reversible
ENFR, same EN,FR components for Analysis & Generation, and
reversible transfer module
ENFR & FREN, but different EN, FR components for Analysis &
Generation, and different transfer modules, NB, lack of modularity …
Direct vs. Transfer vs. Interlingua
Batch vs. Interactive
Advantages/Disadvantages of Direct Systems
Advantages
Engine's competence lies in its comparative grammar.
Highly robust. Does not break down or stop when
it encounters unknown words, unknown grammatical
constructs, or ill-formed Input
Designed for unidirectional translation between one pair of
langs. Not conducive to genuine multilingual MT design.
Disadvantages
‘word-for-word' translation + local reordering = poor
translation, using cheap bilingual dictionary & rudimentary
knowledge of target language.
Linguistically, computationally naive. No analysis of internal
structure of Input, especially w.r.t. the grammatical relationships
between the main parts of sentences.
Advantages/Disadvantages of Interlingual Systems

Advantages
Intermediate representation (IR) fully specified, i.e. no need to
‘look back' at Source in order to generate Target.
Easy to extend to other langs.
Built-in back translation: useful for testing.
Disadvantages
How to define an Interlingua for closely related languages?
Truly universal Interlingua possible?
Advantages/Disadvantages of Transfer Systems
Advantages
No language-independent representations: source IR specific to
a particular lang., as is the target lang. IR.
So Complexity of Analysis & Generation components much
reduced …
Also, no necessary equivalence between source and target
IRs for the same language!
Disadvantages
Not so easy to extend to other languages: n analysis modules, n
generation modules, n x n-1 transfer modules, i.e. not much less
than n² …
No guaranteed built-in back translation.
Direct, or Indirect?
Direct:
From manufacturer's viewpoint, better, as it's more robust …
Indirect:
Falls over more easily.
Development phase can be trying.
Commercially, must be supplemented with techniques for dealing with
unseen Input.
What about Translation Quality?
Indirect systems clearly better in principle.
However, constructing MT engine requires considerable effort.
Direct Systems can achieve good performance.
Summary
Research: mostly Transfer-based, with rules automatically acquired
from data
Industrially: we can expect highly-developed Direct Systems to survive
for some years to come …
Other Material
Arnold, D. et al. (1994): Machine Translation - An Introductory
Guide; NCC Blackwell, Oxford
Hutchins, J. & H. Somers (1992): An Introduction to MT; Academic
Press, London
Trujillo, A. (1999): Translation Engines; Springer, London

Newer books include:


Bowker, L. (2002): Computer-Aided Translation Technology, U. of
Ottawa Press.
Somers, H. (2003): Computers and Translation: A translator's guide,
John Benjamins.
Bond, F. (2005): Translating the Untranslatable, CSLI.
Quah, C. (2006): Translation and Technology, Palgrave MacMillan.
Why Corpus-Based MT?
the (relative) failure of rule-based approaches
the increasing availability of machine-readable
text
the increase in capability of hardware (CPU,
memory, disk space) with associated decrease
in cost
Corpus-Based MT is here to stay
These approaches are now mainstream:

Most researchers are developing corpus-based systems;


First company to use SMT now exists:
http://www.languageweaver.com;
CNGL partner Traslán uses EBMT/SMT hybrid;
In recent large-scale evaluations, corpus-based MT systems
come first.

Two caveats:
Most industrial systems are still rule-based (but cf. Google’s
systems now all SMT);
Current mainstream evaluation metrics favour n-gram-based
systems (i.e. bias towards SMT).
Thanks to Kevin Knight …
Centauri/Arcturan Exercise

Slides already on CA446 webpage …


Centauri/Arcturan [Knight, 1997]

Your assignment, put these words in order:

{ jjat, arrat, mat, bat, oloat, at-yurp }

There are 6! different orders possible, so 720 different


translations.

Best order (according to placement in TL side of the


corpus is as given above):
Not just unigrams, but n-grams also …
It’s Really Spanish—English!
Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa

1a. Garcia and associates . 7a. the clients and the associates are enemies .
1b. Garcia y asociados . 7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates . 8a. the company has three groups .
2b. Carlos Garcia tiene tres asociados . 8b. la empresa tiene tres grupos .

3a. his associates are not strong . 9a. its groups are in Europe .
3b. sus asociados no son fuertes . 9b. sus grupos estan en Europa .
4a. Garcia has a company also . 10a. the modern groups sell strong pharmaceuticals .
4b. Garcia tambien tiene una empresa . 10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry . 11a. the groups do not sell zenzanine .
5b. sus clientes estan enfadados . 11b. los grupos no venden zanzanina .
6a. the associates are also angry . 12a. the small groups are not modern .
6b. los asociados tambien estan enfadados . 12b. los grupos pequenos no son modernos .
Some more to try …
iat lat pippat eneat hilat oloat at-yurp.
totat nnat forat arrat mat bat.
wat dat quat cat uskrat at-drubel.
Some more to try …
iat lat pippat eneat hilat oloat at-yurp.
totat nnat forat arrat mat bat.
wat dat quat cat uskrat at-drubel.

… if you have trouble sleeping at nights!


What have we just seen?
what parallel corpora look like;
how relevant parallel corpora are for MT;
how to build bilingual dictionaries from parallel corpora;
how cognate information may be useful in MT;
how to do word alignment.
What else do we need to know?
about word alignment on a larger scale;
about phrasal alignment, the norm in real translation
data;
about unknown words;
the importance of knowing the target language (vs.
source) in making fluent translations;
about locality in word order shifts;
how to guess the meanings/translations of unknown
words;
about how much uncertainty the machine faces in
working with limited data;
about working on different domains;

Do such methods scale to ‘real’ MT?
Availability of monolingual and bilingual corpora?
Possibility of sentence-aligning bilingual corpora?
Can we write an algorithm to extract the translation
dictionary?
Can we write an algorithm to extract the monolingual
word pair counts?
Can we write an algorithm to generate translations using
our translation dictionary and word pair counts?
Do such methods scale to ‘real’ MT?
Availability of monolingual and bilingual corpora?
Possibility of sentence-aligning bilingual corpora?
Can we write an algorithm to extract the translation
dictionary?
Can we write an algorithm to extract the monolingual
word pair counts?
Can we write an algorithm to generate translations using
our translation dictionary and word pair counts?

WILL THE TRANSLATIONS PRODUCED BE ANY


GOOD?
Parallel Corpora
Hugely important … but not available in a wide
range of language pairs:
Chinese—English: Hong Kong data
French—English: Canadian Hansards
Older EU pairs: Europarl [Koehn 04]
Newer EU pairs: JRC-Acquis Communautaire, very
recently distributed updated Europarl
Arabic—English: LDC Data
NIST, IWSLT, TC-STAR Evaluations

Caveat interpres!
Beware of sparse data!
Beware of unrepresentative corpora!
Beware of poor quality language!

If the corpora are small, or of poor quality, or are


unrepresentative, then our statistical language
models will be poor, so any results we achieve
will be poor.

Potrebbero piacerti anche