Sei sulla pagina 1di 9

Rule-based MT MT evaluation Rule-based MT MT evaluation

Rule-based Machine Translation

Machine Translation
Rule-based MT & MT evaluation

Jörg Tiedemann

jorg.tiedemann@lingfil.uu.se
Department of Linguistics and Philology
Uppsala University

September 2009

Jörg Tiedemann 1/36 Jörg Tiedemann 2/36

Rule-based MT MT evaluation Rule-based MT MT evaluation

What do we need? What are the problems?

direct translation: a huge dictionary Direct translation:


transfer-based translation: grammars & rules I dictionary has to cover all cross-lingual phenomena
I rules for source language analysis I need to include contextual information in dictionary (long
(syntactic/semantic) phrases)
I rules for source-to-target transfer I problems with non-compositionality and ambiguity
I rules for target language generation
I inflectional agreement, shifts in word order & structure
interlingua-based translation: the same but no transfer
→ direct translation systems include simplistic rules

Jörg Tiedemann 3/36 Jörg Tiedemann 4/36


Rule-based MT MT evaluation Rule-based MT MT evaluation

Direct Translation (Advanced) Direct Translation

Is it feasible?
simplistic approach: only low-level pre/post-processing I a lot of compositionality in natural language
(tokenization, etc ...) I many similarities between languages
advanced approach: handle some specific phenomena (especially between related languages)
I identification & handling of syntactic ambiguity I example: Systran (in daily use by the European
I morphological processing/synthesis Commission)
I word re-ordering rules I > 1.6 million dictionary units
I rules for prepositions I dictionaries for different domains
I handling of compounds and idioms, ... I more-and-more transfer based

→ many data-driven MT systems ∼ direct translation systems

Jörg Tiedemann 5/36 Jörg Tiedemann 6/36

Rule-based MT MT evaluation Rule-based MT MT evaluation

Transfer-based Translation Transfer-based Translation

What kind of information/tools do we need?

Motivation:
I complete analysis of source language sentences
I transfer step covers divergences between languages
I handle lexical & structural ambiguity in one formalism I source language parser (morpho-syntactic analysis)
→ What kind of information/tools do we need? I transfer engine (e.g. unification based grammar)
I target language generator

→ modular design

Jörg Tiedemann 7/36 Jörg Tiedemann 8/36


Rule-based MT MT evaluation Rule-based MT MT evaluation

Transfer-based MT Transfer-based MT

Syntactic Transfer rules (systematic structural differences)


Need preference mechanism for rule selection!

I English to Spanish:
on → på
I NP → Adjective1 Noun2 ⇒ NP → Noun2 come.vb → kom.vb
Adjective1 come on → kom igen
I Chinese to English:
sit.vb on NP → sitta.vb på NP
I VP → PP[+Goal] V ⇒ VP → V PP[+Goal]
sit.vb on the couch → sitta.vb i soffan
I English to Japanese:
I VP → V NP ⇒ VP → NP V
→ Common: preference for more specific rules
I NP → NP1 RelClause2 ⇒ NP → RelClause2 NP1

Jörg Tiedemann 9/36 Jörg Tiedemann 10/36

Rule-based MT MT evaluation Rule-based MT MT evaluation

Transfer-based MT Transfer-based Translation

What are the problems?

I many lexical transfer rules I lots of grammar engineering (writing rules ...)
I often feature-based representations I language-pair specific rules
I rules can copy, delete, transfer, assign features I exponential ambiguity
I fixed rule preference (e.g. specific first) I variation & preference
I morphological generation I coverage & robustness

→ Good quality can be achieved but low coverage!

Jörg Tiedemann 11/36 Jörg Tiedemann 12/36


Rule-based MT MT evaluation Rule-based MT MT evaluation

Interlingua-based Translation Classical Rule-based Translation

Advantages:
I no language-pair specific transfer
→ Too much manual work involved!
I simple (?) to add new languages
(add new analysis/generation component) Is there no hope for rule-based systems?
I Domain-specific tasks
Disadvantages: I Rule-induction
I need to design interlingua that covers all language I Hybrid systems
phenomena
I need semantic representation (and that’s hard!)
I may even fail for simple (direct) examples

Jörg Tiedemann 13/36 Jörg Tiedemann 14/36

Rule-based MT MT evaluation Rule-based MT MT evaluation

Rule-based Translation Second Part: MT evaluation

Domain-specific MT

I high quality translation for specific domains


I controlled languages: I How can we measure MT quality?
I complete coverage of source language (lexicon, grammar) I How can we compare MT engines?
& terminology I How can we measure progress in MT development?
I reduce ambiguity
I requires language checker tools
(for source language documents)

→ high quality & high consistency

Jörg Tiedemann 15/36 Jörg Tiedemann 16/36


Rule-based MT MT evaluation Rule-based MT MT evaluation

What do we expect from MT? MT evaluation


Manual evaluation
I adequacy & informativeness (preserve meaning) I ask actual users to rate translations
I fluency & grammaticality (translation needs to be natural) I statistics over user responses
I acceptance (for its task) I separate evaluations of adequacy & fluency
I requires guidelines
Evaluation is difficult! I task-specific evaluation
I What is the best translation? (language variation!)
I Subjective aspects (What is “fluent”? Clarity? Style?) Automatic evaluation
I What is “grammatical”? I compare to reference translations
I What is “adequate”? (Is it possible to be adequate?) I approximations by measuring overlaps
I strong bias but useful for rapid development

Jörg Tiedemann 17/36 Jörg Tiedemann 18/36

Rule-based MT MT evaluation Rule-based MT MT evaluation

Manual MT evaluation Manual MT evaluation

Typical setup:
Compare MT engines:
Adequacy Fluency
5 = All 5 = Flawless English I rank proposed translations
4 = Most 4 = Good English
3 = Much 3 = Non-native English
I measure relative quality
2 = Little 2 = Disfluent English I could include manual translation
1 = None 1 = Incomprehensible I could rank selected segments only

→ simpler task, better agreement, less guidelines


Strong correlations if evaluated together
→ Separate evaluation on different examples?

Jörg Tiedemann 19/36 Jörg Tiedemann 20/36


Rule-based MT MT evaluation Rule-based MT MT evaluation

Task-specific evaluation Manual MT evaluation

Different tasks require different types of quality!


browsing quality: Is the translation understandable in its What are the problems?
context? (main contents is clear)
post-editing quality: How many edit operations are required to I need volunteers (every time we want to evaluate)
turn it into a good translation? → expensive evaluation!
publishing quality: How many human interventions are I could be hard to setup
necessary to make the entire document ready for I subjective measures & disagreement between annotators
printing?

→ Difficult to find a better solution ....


→ Difficult to have a general framework!

Possibly: Decide quality level depending on evaluation results

Jörg Tiedemann 21/36 Jörg Tiedemann 22/36

Rule-based MT MT evaluation Rule-based MT MT evaluation

Automatic Evaluation Automatic Evaluation

I constant evaluation is necessary for system development Why are there so many automatic evaluation measures?
I ... but manual evaluation is too expensive!
I only approximations of adequacy & fluency
→ Automatic evaluation is required! I different types of correlations with human evaluation
I possible bias towards certain approaches
Comparison of MT output with reference translations: I tuning on automatic measures makes them inappropriate
BLEU, NIST, METEOR, WER, PER, TER, ROUGE ...

Jörg Tiedemann 23/36 Jörg Tiedemann 24/36


Rule-based MT MT evaluation Rule-based MT MT evaluation

The “BLEU-score Revolution” The “BLEU-score Revolution”

Basic idea:
I introduced in 2002 by Papineni et al
I translation is better if it is closer to given (correct)
I desperately needed by rapid MT development
reference translations
I quickly adapted by statistical MT community
I “closeness” can be measured in terms of N-gram overlaps
I created a boom in MT research/experiments → modified form of precision
I add “brevity penalty” to account for sentence length
→ Many MT papers report only BLEU scores and don’t even
look at the translations ...
→ High correlation with human judgments
(0.99 & 0.96 in original paper)!

Jörg Tiedemann 25/36 Jörg Tiedemann 26/36

Rule-based MT MT evaluation Rule-based MT MT evaluation

The “BLEU-score Revolution” The “BLEU-score Revolution”


candidate translations reference translations Modified N-gram precision (for each N-gram):
(1) It is a guide to action which 1) It is a guide to action that en-
ensures that the military always sures that the military will forever countclip = min (countcandidate , maxcountreference )
obeys the commands of the party. heed Party commands.
(2) It is to insure the troops forever 2) It is the guiding principle which → Avoid to count correct N-grams more often than they appear
hearing the activity guidebook that guarantees the military forces al- in any reference translation!
party direct. ways being under the command of
the Party.
3) It is the practical guide for the Example
army always to heed the directions
of the party. Candidate: the the the the the the the.
Reference 1: The cat is on the mat.
Reference 2: There is a cat on the mat.
Candidate (1) shares more words and word-N-grams with the
reference translations than candidate (2)!
→ Compute precision scores (proportion of correct N-grams) countclip (the) = 2
punigram = 2/7 (unigram precision)

Jörg Tiedemann 27/36 Jörg Tiedemann 28/36


Rule-based MT MT evaluation Rule-based MT MT evaluation

The “BLEU-score Revolution” The “BLEU-score Revolution”


Brevity penalty (BP) for short candidates (c):
BLEU scores for 110 statistical machine translation systems

1 if candidate c > reference r (Koehn 2005)
BP
exp(1 − r /c) if candidate c ≤ reference r
% da de el en es fr fi it nl pt sv
da - 18.4 21.1 28.5 26.4 28.7 14.2 22.2 21.4 24.3 28.3
de 22.3 - 20.7 25.3 25.4 27.7 11.8 21.3 23.4 23.2 20.5
Putting it all together: el 22.7 17.4 - 27.2 31.2 32.1 11.4 26.8 20.0 27.6 21.2
en 25.2 17.6 23.2 - 30.1 31.1 13.0 25.3 21.0 27.1 24.8
es 24.1 18.2 28.3 30.5 - 40.2 12.5 32.3 21.4 35.9 23.9
N
!
X fr 23.7 18.5 26.1 30.0 38.4 - 12.6 32.4 21.1 35.3 22.6
BLEU = BP ∗ exp wn logpn fi 20.0 14.5 18.2 21.8 21.1 22.4 - 18.3 17.0 19.1 18.8
it 21.4 16.9 24.8 27.8 34.0 36.0 11.0 - 20.0 31.2 20.2
n=1
nl 20.5 18.3 17.4 23.0 22.9 24.6 10.3 20.0 - 20.7 19.0
pt 23.2 18.2 26.4 30.1 37.9 39.0 11.9 32.0 20.2 - 21.9
sv 30.3 18.9 22.8 30.2 28.6 29.7 15.3 23.9 21.9 25.9 -
Usually wn = 1/N and N = 4

Jörg Tiedemann 29/36 Jörg Tiedemann 31/36

Rule-based MT MT evaluation Rule-based MT MT evaluation

The “BLEU-score Revolution” The “BLEU-score Revolution”

What’s risky with BLEU:


What’s good about BLEU: I systems are tuned for optimizing BLEU scores
→ strong bias, less correlation with human judgments
I easy to compute
I often only one reference translation
I gives scores from 0 to 100%
I can be used to measure system development I difficult to compare systems with generally different approaches
I can quickly test different system parameters I difficult to compare performance on different language pairs
I even more difficult to compare results on different domains &
text types

Jörg Tiedemann 32/36 Jörg Tiedemann 33/36


Rule-based MT MT evaluation Rule-based MT MT evaluation

Alternative Measures Summary on MT Evaluation


After BLEU many evaluation measures have been proposed:

Other evaluation metrics


NIST: BLEU + n-gram weights according to informativeness (rare → I automatic evaluation is (very) popular but risky
more informative) I human evaluation is safe but expensive
METEOR: harmonic mean of unigram precision and recall + synonym
expansion & stemming
I automatic measures are great for system development
WER, PER, TER: based on edit distance (insertion, deletion, substitution, I lots of discussion about MT evaluation
moving)
Dependency overlap: overlap in grammatical relations → Don’t forget to look at actual MT output!
Semantic role overlap: lexical overlap between semantic roles

Metrics can be combined to better correlate with human judgments!


→ Automatically train combination weights!

Jörg Tiedemann 34/36 Jörg Tiedemann 35/36

Rule-based MT MT evaluation

Next

Lab:
1. try to manually evaluate on-line translation services
2. evaluation experiment: play a little game
I try to guess the type of translations (automatic or manual)
I test if automatic translations are understandable or not
I challenge the system and find out MT weaknesses

Next lecture:
I The amazing utility of parallel corpora (part I)

Jörg Tiedemann 36/36

Potrebbero piacerti anche