Rule-Based Machine Translation

Rule-based MT MT evaluation Rule-based MT MT evaluation
Rule-based Machine Translation
Machine Translation
Rule-based MT & MT evaluation
Jörg Tiedemann
jorg.tiedemann@lingfil.uu.se
Department of Linguistics and Philology
Uppsala University
September 2009
Jörg Tiedemann 1/36 Jörg Tiedemann 2/36
What do we need? What are the problems?
direct translation: a huge dictionary Direct translation:

transfer-based translation: grammars & rules I dictionary has to cover all cross-lingual phenomena
I rules for source language analysis I need to include contextual information in dictionary (long
(syntactic/semantic) phrases)
I rules for source-to-target transfer I problems with non-compositionality and ambiguity
I rules for target language generation
I inflectional agreement, shifts in word order & structure
interlingua-based translation: the same but no transfer
→ direct translation systems include simplistic rules

Direct Translation (Advanced) Direct Translation
Is it feasible?
simplistic approach: only low-level pre/post-processing I a lot of compositionality in natural language
(tokenization, etc ...) I many similarities between languages
advanced approach: handle some specific phenomena (especially between related languages)
I identification & handling of syntactic ambiguity I example: Systran (in daily use by the European
I morphological processing/synthesis Commission)
I word re-ordering rules I > 1.6 million dictionary units
I rules for prepositions I dictionaries for different domains
I handling of compounds and idioms, ... I more-and-more transfer based
→ many data-driven MT systems ∼ direct translation systems
Transfer-based Translation Transfer-based Translation
What kind of information/tools do we need?
Motivation:
I complete analysis of source language sentences
I transfer step covers divergences between languages
I handle lexical & structural ambiguity in one formalism I source language parser (morpho-syntactic analysis)
→ What kind of information/tools do we need? I transfer engine (e.g. unification based grammar)
I target language generator
→ modular design

Transfer-based MT Transfer-based MT
Syntactic Transfer rules (systematic structural differences)

Need preference mechanism for rule selection!
I English to Spanish:
on → på
I NP → Adjective1 Noun2 ⇒ NP → Noun2 come.vb → kom.vb
Adjective1 come on → kom igen
I Chinese to English:
sit.vb on NP → sitta.vb på NP
I VP → PP[+Goal] V ⇒ VP → V PP[+Goal]
sit.vb on the couch → sitta.vb i soffan
I English to Japanese:
I VP → V NP ⇒ VP → NP V
→ Common: preference for more specific rules
I NP → NP1 RelClause2 ⇒ NP → RelClause2 NP1
Transfer-based MT Transfer-based Translation
What are the problems?
I many lexical transfer rules I lots of grammar engineering (writing rules ...)
I often feature-based representations I language-pair specific rules
I rules can copy, delete, transfer, assign features I exponential ambiguity
I fixed rule preference (e.g. specific first) I variation & preference
I morphological generation I coverage & robustness
→ Good quality can be achieved but low coverage!

Interlingua-based Translation Classical Rule-based Translation
Advantages:
I no language-pair specific transfer
→ Too much manual work involved!
I simple (?) to add new languages
(add new analysis/generation component) Is there no hope for rule-based systems?
I Domain-specific tasks
Disadvantages: I Rule-induction
I need to design interlingua that covers all language I Hybrid systems
phenomena
I need semantic representation (and that’s hard!)
I may even fail for simple (direct) examples
Rule-based Translation Second Part: MT evaluation
Domain-specific MT
I high quality translation for specific domains

I controlled languages: I How can we measure MT quality?
I complete coverage of source language (lexicon, grammar) I How can we compare MT engines?
& terminology I How can we measure progress in MT development?
I reduce ambiguity
I requires language checker tools
(for source language documents)
→ high quality & high consistency

What do we expect from MT? MT evaluation

Manual evaluation
I adequacy & informativeness (preserve meaning) I ask actual users to rate translations
I fluency & grammaticality (translation needs to be natural) I statistics over user responses
I acceptance (for its task) I separate evaluations of adequacy & fluency
I requires guidelines
Evaluation is difficult! I task-specific evaluation
I What is the best translation? (language variation!)
I Subjective aspects (What is “fluent”? Clarity? Style?) Automatic evaluation
I What is “grammatical”? I compare to reference translations
I What is “adequate”? (Is it possible to be adequate?) I approximations by measuring overlaps
I strong bias but useful for rapid development
Manual MT evaluation Manual MT evaluation
Typical setup:
Compare MT engines:
Adequacy Fluency
5 = All 5 = Flawless English I rank proposed translations
4 = Most 4 = Good English
3 = Much 3 = Non-native English
I measure relative quality
2 = Little 2 = Disfluent English I could include manual translation
1 = None 1 = Incomprehensible I could rank selected segments only
→ simpler task, better agreement, less guidelines

Strong correlations if evaluated together
→ Separate evaluation on different examples?

Task-specific evaluation Manual MT evaluation
Different tasks require different types of quality!

browsing quality: Is the translation understandable in its What are the problems?
context? (main contents is clear)
post-editing quality: How many edit operations are required to I need volunteers (every time we want to evaluate)
turn it into a good translation? → expensive evaluation!
publishing quality: How many human interventions are I could be hard to setup
necessary to make the entire document ready for I subjective measures & disagreement between annotators
printing?
→ Difficult to find a better solution ....

→ Difficult to have a general framework!
Possibly: Decide quality level depending on evaluation results
Automatic Evaluation Automatic Evaluation
I constant evaluation is necessary for system development Why are there so many automatic evaluation measures?
I ... but manual evaluation is too expensive!
I only approximations of adequacy & fluency
→ Automatic evaluation is required! I different types of correlations with human evaluation
I possible bias towards certain approaches
Comparison of MT output with reference translations: I tuning on automatic measures makes them inappropriate
BLEU, NIST, METEOR, WER, PER, TER, ROUGE ...

The “BLEU-score Revolution” The “BLEU-score Revolution”
Basic idea:
I introduced in 2002 by Papineni et al
I translation is better if it is closer to given (correct)
I desperately needed by rapid MT development
reference translations
I quickly adapted by statistical MT community
I “closeness” can be measured in terms of N-gram overlaps
I created a boom in MT research/experiments → modified form of precision
I add “brevity penalty” to account for sentence length
→ Many MT papers report only BLEU scores and don’t even
look at the translations ...
→ High correlation with human judgments
(0.99 & 0.96 in original paper)!

candidate translations reference translations Modified N-gram precision (for each N-gram):
(1) It is a guide to action which 1) It is a guide to action that en-
ensures that the military always sures that the military will forever countclip = min (countcandidate , maxcountreference )
obeys the commands of the party. heed Party commands.
(2) It is to insure the troops forever 2) It is the guiding principle which → Avoid to count correct N-grams more often than they appear
hearing the activity guidebook that guarantees the military forces al- in any reference translation!
party direct. ways being under the command of
the Party.
3) It is the practical guide for the Example
army always to heed the directions
of the party. Candidate: the the the the the the the.
Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
Candidate (1) shares more words and word-N-grams with the
reference translations than candidate (2)!
→ Compute precision scores (proportion of correct N-grams) countclip (the) = 2
punigram = 2/7 (unigram precision)


Brevity penalty (BP) for short candidates (c):
BLEU scores for 110 statistical machine translation systems

1 if candidate c > reference r (Koehn 2005)
BP
exp(1 − r /c) if candidate c ≤ reference r
% da de el en es fr fi it nl pt sv
da - 18.4 21.1 28.5 26.4 28.7 14.2 22.2 21.4 24.3 28.3
de 22.3 - 20.7 25.3 25.4 27.7 11.8 21.3 23.4 23.2 20.5
Putting it all together: el 22.7 17.4 - 27.2 31.2 32.1 11.4 26.8 20.0 27.6 21.2
en 25.2 17.6 23.2 - 30.1 31.1 13.0 25.3 21.0 27.1 24.8
es 24.1 18.2 28.3 30.5 - 40.2 12.5 32.3 21.4 35.9 23.9
N
!
X fr 23.7 18.5 26.1 30.0 38.4 - 12.6 32.4 21.1 35.3 22.6
BLEU = BP ∗ exp wn logpn fi 20.0 14.5 18.2 21.8 21.1 22.4 - 18.3 17.0 19.1 18.8
it 21.4 16.9 24.8 27.8 34.0 36.0 11.0 - 20.0 31.2 20.2
n=1
nl 20.5 18.3 17.4 23.0 22.9 24.6 10.3 20.0 - 20.7 19.0
pt 23.2 18.2 26.4 30.1 37.9 39.0 11.9 32.0 20.2 - 21.9
sv 30.3 18.9 22.8 30.2 28.6 29.7 15.3 23.9 21.9 25.9 -
Usually wn = 1/N and N = 4
What’s risky with BLEU:

What’s good about BLEU: I systems are tuned for optimizing BLEU scores
→ strong bias, less correlation with human judgments
I easy to compute
I often only one reference translation
I gives scores from 0 to 100%
I can be used to measure system development I difficult to compare systems with generally different approaches
I can quickly test different system parameters I difficult to compare performance on different language pairs
I even more difficult to compare results on different domains &
text types

Alternative Measures Summary on MT Evaluation

After BLEU many evaluation measures have been proposed:
Other evaluation metrics

NIST: BLEU + n-gram weights according to informativeness (rare → I automatic evaluation is (very) popular but risky
more informative) I human evaluation is safe but expensive
METEOR: harmonic mean of unigram precision and recall + synonym
expansion & stemming
I automatic measures are great for system development
WER, PER, TER: based on edit distance (insertion, deletion, substitution, I lots of discussion about MT evaluation
moving)
Dependency overlap: overlap in grammatical relations → Don’t forget to look at actual MT output!
Semantic role overlap: lexical overlap between semantic roles
Metrics can be combined to better correlate with human judgments!

→ Automatically train combination weights!
Rule-based MT MT evaluation
Next
Lab:
1. try to manually evaluate on-line translation services
2. evaluation experiment: play a little game
I try to guess the type of translations (automatic or manual)
I test if automatic translations are understandable or not
I challenge the system and find out MT weaknesses
Next lecture:
I The amazing utility of parallel corpora (part I)
Jörg Tiedemann 36/36

Rule-Based Machine Translation

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Rule-Based Machine Translation

Caricato da

Copyright:

Formati disponibili

Rule-based MT MT evaluation Rule-based MT MT evaluation