Arabic Diacritization Using Weighted Finite-State Transducers

Arabic Diacritization Using Weighted Finite-State Transducers
Rani Nelken and Stuart M. Shieber

Division of Engineering and Applied Sciences
Harvard University
33 Oxford St.
Cambridge, MA 02138
{nelken,shieber}@deas.harvard.edu
Abstract We present a system for Arabic diacritiza-

tion1 in Modern Standard Arabic (MSA) using
Arabic is usually written without short
weighted finite-state transducers. The system
vowels and additional diacritics, which
is constructed using standardly available finite-
are nevertheless important for several
state tools, and encodes only minimal morpho-
applications. We present a novel al-
logical knowledge, yet achieves very high levels
gorithm for restoring these symbols,
of accuracy. While the methods described in this
using a cascade of probabilistic finite-
paper are applicable to additional semitic lan-
state transducers trained on the Ara-
guages, the choice of MSA was motivated by the
bic treebank, integrating a word-based
availability of the Arabic Treebank, distributed
language model, a letter-based lan-
by the Linguistic Data Consortium (LDC), a siz-
guage model, and an extremely simple
able electronic corpus of diacritized text, which
morphological model. This combina-
we could use for training and testing. Such re-
tion of probabilistic methods and sim-
sources are rare for other semitic languages.
ple linguistic information yields high
This paper is structured as follows. In Sec-
levels of accuracy.
tion 1, we describe the task, including a brief in-
troduction to Arabic diacritization, and the cor-
Introduction
pus we used. In Section 2, we describe the design
Most semitic languages in both ancient and con- of our system, which consists of a trigram word-
temporary times are usually written without based language model, augmented by two exten-
short vowels and other diacritic marks, often sions: an extremely simple morphological ana-
leading to potential ambiguity. While such am- lyzer and a letter-based language model, which
biguity only rarely impedes proficient speakers, are used to address the data sparsity problem.
it can certainly be a source of confusion for be- In Section 3, we report the systems experimen-
ginning readers and people with learning disabil- tal evaluation. We review related work in Sec-
ities (Abu-Rabia, 1999). The problem becomes tion 4, and close with conclusions and directions
even more acute when people are required to for future research in Section 5.
actively generate diacritized script, as used for
example in poetry or childrens books. Diacriti- 1 The task
zation is even more problematic for computa-
tional systems, adding another level of ambigu- The Arabic vowel system consists of 3 short vow-
ity to both analysis and generation of text. For els and 3 long vowels, as summarized in Ta-
example, full vocalization is required for text-
1
to-speech applications, and has been shown to We distinguish between vocalizationthe restora-
tion of vowel symbolsand diacritization, which includes
improve speech-recognition perplexity and error additionally the restoration of a richer system of diacritic
rate (Kirchoff et al., 2002). marks, as explained below.
Short vowels Arab. Tran. Arab. Tran.

Vocalized Unvoc. Pronounc. @ > @ |
Arab. Tran. Arab. Tran.

@

<
u - - /u/

' & @ {
a - - /a/

' }

i - - /i/
Long vowels Table 2: Arabic glottal stop

uw w /u:/

A aA(1),(2) @ A /a:/ Table 2. In undiacritized text, it appears either

aY Y(3) /a:/ as A or as one of the forms in the table.
iy y /i:/ We used the LDCs Arabic Treebank of dia-

critized news stories (Part 2). The corpus con-
Doubled case endings sists of 501 news stories collected from Al-Hayat

N - - /un/ for a total of 144,199 words. In addition to dia-

F - - /an/ critization, the corpus contains several types of

A AF @ A /an/ annotations. After being transliterated, it was
morphologically analyzed using the Buckwalter
K - - /in/
morphological analyzer (2002b). Buckwalters
(1)
aA may also appear as A even in vocalized text. system provides a list of all possible analyses of
(2)
In some lexical items, aA is written as ; in which case
it is dropped in undiacritized text.
each word, including a full diacritization, and
(3)
Y and y can appear interchangeably in the cor- a part-of-speech (POS) tag. From this candi-
pus (Buckwalter, 2004). date list, annotators chose a single analysis. Af-
terwards, clitics, which are prevalent in Arabic
Table 1: Arabic vowels were separated and marked, and a full parse-
tree was manually generated. For our purposes,
ble 1.2 Short vowels are written as symbols ei- we have stripped all the POS tagging, to retain
ther above or below the letter in diacritized text, two versions of each filediacritized and undi-
and dropped altogether in undiacritized text. acritized, which we use for both training and
Long vowels are written as a combination of a evaluation as will be explained below.
short vowel symbol, followed by a vowel letter; An example sentence fragment in Arabic is
in undiacritized text, the short vowel symbol is given in Figure 1 in three forms: undiacritized,
dropped. Arabic also uses vowels at the end of diacritized without case endings, and with case
words to mark case distinctions, which include endings.
both the short vowels and an additional doubled

. . . Q Am. A @
B @
form (tanween). J
K. Q @ Y @ XP

Diacritized text also contains two syllabifica-
. . . Q J
K. Q @ Y @ Am. A @
B @ XP

tion marks: H .
(trans.: ) denoting doubling of

. . . Q J
K. Q @ Y @ Am. A @
B @ XP
the preceding consonant (in this case, H
.
), and

H
.
(trans.: o), denoting the lack of a vowel. Figure 1: Example sentence fragment
The Arabic glottal stop (hamza) deserves
special mention, as it can appear in several dif- The transliteration and translation are given
ferent forms in diacritized text, enumerated in in Figure 2. We follow precisely the form that
appears in the Arabic treebank.3
2
Throughout the paper we follow Buckwalters
3
transliteration of Arabic into 7-bit ASCII charac- The diacritization of this example is (strictly speak-
ters (2002a). ing) incomplete with respect to the diacritization of the
rd Al>myn AlEAm ljAmEp Aldwl AlErbyp Emrw mwsY
rada Al>amiyn AlEAm lijAmiEap Alduwal AlEarabiyap Eamorw muwsaY
rada Al>amiynu AlEAmu lijAmiEapi Alduwali AlEarabiyapi Eamorw muwsaY
Arab League Secretary General Amr Mussa replied. . .
Figure 2: Transliteration and translation of the sentence fragment
2 System design
LM SP DD UNK
To restore diacritization, we have created a gen-

erative probabilistic model of the process of los-
ing diacritics, expressed as a finite-state trans- Figure 3: Basic model
ducer. The model transduces fully diacritized
Arabic text, weighted according to a language only component of the basic model for which
model, into undiacritized text. To restore dia- we learn such weights. During decoding, these
critics, we use Viterbi decoding, a standard al- weights are utilized to choose the most prob-
gorithm for efficiently computing the best path able word sequence that could have generated
through an automaton, to reconstruct the max- the undiacritized text. The model also includes
imum likelihood diacritized word sequence that special symbols for unknown words, hunki, and
would generate a given undiacritized word se- for numbers, hnumi, as explained below.
quence.
The model is constructed as a composi- Spelling (SP) A spelling transducer that
tion of several weighted finite-state transduc- transduces a word into its component letters.
ers (Pereira and Riley, 1997). Transducers are This is a technical necessity since the language
extremely well-suited for this task as their clo- model operates on word tokens and the follow-
sure under composition allows complex models ing components operate on letter tokens. For in-
to be efficiently and elegantly constructed from stance the single token Al>amiyn is transduced
modular implementations of simpler models. to the sequence of tokens A,l,>,a,m,i,y,n.
The system is implemented using the AT&T Diacritic drop (DD) A transducer for drop-
FSM and GRM libraries (Mohri et al., 2000; Al- ping vowels and other diacritics. The transducer
lauzen et al., 2003), which provide a collection simply replaces all short vowel symbols and syl-
of useful tools for constructing weighted finite- labification marks with the empty string, . In
state transducers implementing language mod- addition, this transducer also handles the mul-
els. tiple forms of the glottal stop (see Section 1).
2.1 Basic model Rather than encoding any morphological rules
on when the glottal stop receives each form, we
Our cascade consists of the following transduc- merely encode the generic availability of these
ers, illustrated in Figure 3. various alternatives, as transductions. For in-
Language model (LM) A standard trigram stance, DD includes the option of transducing
language model of Arabic diacritized words. For { to A, without any information on when such
smoothing purposes, we use Katz backoff (Katz, transduction should take place.
1987). We learn weights for the model from a Unknowns (UNK) Due to data sparsity, a
training set of diacritized words. This is the test input may include words that did not ap-
determiner Al and the letter immediately following it. pear in the training data, and will thus be un-
For instance, in Alduwal, the d should actually have been known to the language model. To handle such
doubled, yielding Alduwal. The treebank consistently
does not diacritize Al, and we adhere to its conventions words, we add a transducer, UNK, that trans-
in both training and testing. duces hunki, into a stochastic sequence of ar-

bitrary letters. During decoding, the letter se-
quence is fixed, and since it has no possible dia-

PRE MID SUF
critization in the model, Viterbi decoding would
choose hunki as its most likely generator.
UNK plays a similar purpose in handling Figure 4: Clitic concatenation
numbers. UNK transduces hnumi to a stochasti-
cally generated sequence of digits. In the train-
ing data, we replace all numbers with hnumi. On non-deterministically append it to the following
encountering a number in a test input, the de- word, merely by transducing the space follow-
coding algorithm would replace the number with ing it to . This is done iteratively for each pre-
hnumi. As a post-processing step, we replace all fix. After concatenating prefixes, CC can non-
occurrences of hunki and hnumi with the original deterministically decide that it has reached the
input word/number. main word, which it copies. Finally, it concate-
nates suffixes symmetrically to prefixes. For in-
2.2 Handling clitics stance, on encountering the letter string w Al
Arabic contains numerous clitics, which are ap- >myn, CC might drop the spaces to generate
pended to words, either as prefixes or as suf- wAl>myn (and the secretary).
fixes, including the determiner, conjunctions, The transducer implementation of CC con-
some prepositions, and pronouns. Clitics pose sists of three components, depicted schemati-
an important challenge for an n-gram model, cally in Figure 4. The first component itera-
since the same word with a different clitic combi- tively appends prefixes. For each of a fixed set
nation would appear to the model as a separate of prefixes, it has a set of states for identify-
token. We thus augment our basic model with ing the prefix and dropping the trailing space.
a transducer for handling clitics. A non-deterministic jump moves the transducer
Handling clitics using a rule-based approach is to the middle component, which implements the
a non-trivial undertaking (Buckwalter, 2002b). identity function on letters, copying the putative
In addition to cases of potential ambiguity be- main word to the output. Finally, CC can non-
tween letters belonging to a clitic and letters be- deterministically jump to the final component,
longing to a word, clitics might be iteratively which appends suffixes by dropping the preced-
appended, but only in some combinations and ing space.
some orderings. Buckwalter maintains a dictio- By design, CC provides a very simple model
nary not only of all prefixes, stems, and suffixes, of Arabic clitics. It maintains just a list of pos-
but also keeps a separate dictionary entry for sible prefixes and suffixes, but encodes no in-
each allowed combination of diacritized clitics. formation about stems or possible clitic order-
Since, unlike Buckwalters system, we are inter- ings, potentially allowing many ungrammatical
ested just in the most probable clitic separation combinations. We rely on the probabilistic lan-
rather than the full set of analyses, we imple- guage model to assign such combinations very
ment only a very simple transducer, and rely on low probabilities.4
the probabilistic model to handle such ambigu- We use the following list of (undiacritized) cl-
ities and complexities. itics: (cf. Diab et al. (2004) who use the same
From a generative perspective, we assume set with the omission of s, and ny):
that the hypothetical original text from which prefixes: b (by/with), l (to), k (as), w (and),
the model starts is not only diacritized, but f (and), Al (the), s (future);
also has clitics separated. We augment the
model with a transducer, Clitic Concatenation suffixes: y (my/mine), ny (me), nA
(CC), which non-deterministically concatenates 4
The only special case of multiple prefix combinations
clitics to words. CC scans the letter stream; that we explicitly encode is the combination of l+Al (to
on encountering a potential prefix, CC can + the) which becomes ll, by dropping the A.
(our/ours), k (your/yours), kmA (your/yours 2.3 Letter model for unknown words
masc. dual), km (your/yours masc. pl.),
To diacritize unknown words, we trained a
knA (your/yours fem. dual), kn (your/yours
letter-based 4-gram language model of Arabic
fem. pl.), h (him/his), hA (her/hers), hmA
words, LLM, on the letter sequences of words
(their/theirs masc. dual), hnA (their/theirs
in the training set. Composing LLM with the
fem. dual), hm (their/theirs masc. pl.), hn
vowel-drop transducer, DD, yields a probabilis-
(their/theirs fem. pl).
tic generative model of Arabic letter and di-
We integrate CC into the cascade by compos- acritization patterns, including for words that
ing it after DD, and before UNK. Thus, clitics were never encountered in training.
appear in their undiacritized form. Our model In principle, we could use the letter model
now assumes that the diacritized input text has as an alternative model of the full text, but we
clitics separated. This requires two changes to found it more effective to use it selectively, only
our method. First, training must now be per- on unknown words. Thus, after running the
formed on text in which clitics are separated. word-based language model, we extract all the
This is straightforward since clitics are tagged words tagged as hunki and run the letter-based
in the corpus. Second, in the undiacritized test model on them. Here is an example transduc-
data, we keep clitics intact. Running Viterbi de- tion:
coding on the augmented model would not only Diacritized Alduwal
diacritize it, but also separate clitics. To gener- LLM (Weighted)
ate grammatical Arabic, we reconnect the clitics DD (Diacritics dropped) Aldwl
as a post-processing step. We use a greedy strat- We chose not to apply any special clitic han-
egy of connecting each prefix to the following dling for the letter-based model. To see why,
word, and each suffix to the preceding word.5 consider the alternative model that would in-
clude CC. Since LLM is unaware of word tokens,
While our handling of clitics helps overcome there is no pressure on the decoding algorithm
data sparsity, there is also a potential cost for to split the clitics from the word, and clitics may
decoding. Clitics, which are, intuitively speak- therefore be incorrectly vocalized.
ing, less informative than regular words, are now
treated as lexical items of equal stature. For 3 Experiments
instance, a bigram model may include the col-
location Al>amiyn AlEAm (the secretary gen- We randomly split the set of news articles in
eral). Once clitics are separated this becomes each of the two parts of the Arabic treebank
Al >amiyn Al EAm. A bigram model would into a training and held-out testing set of sizes
no longer retain the connection between each of 90% and 10% respectively. We trained both
the main words, >amiyn and EAm, but only the word-based and the letter-based language
between them and the determiner Al, which is models on the diacritized version of the train-
potentially less informative. ing set. We then ran Viterbi decoding on the
undiacritized version of the testing set, which
Figure 5 shows an example transduction consists of a total of over 14,000 words. As a
through the word-based model, where for illus- baseline, we used a unigram word model with-
tration purposes, we assume that Aldwl is an out clitic handling, constructed using the same
unknown word. transducer technology. We ran two batches of
experimentsone in which case endings were
stripped throughout the training and testing
5
The only case that requires special attention is ka data, and we did not attempt to restore them,
which can be either a prefix (meaning as) or a suf- and one in which case markings were included.
fix (meaning your/yours masc.). The greedy strategy
always chooses the suffix meaning. We correct it by com- Results are reported in Table 3. For each
parison with the input text. model, we report two measures: the word er-
Diacritized rada Al >amiyn Al EAm li jAmiEap hunki
LM (Weighted, but otherwise unchanged)
SP (Change in token resolution from words to letters)
DD (Diacritics dropped) rd Al >myn Al EAm l jAmEp
CC (Clitics concatenated) rd Al>myn AlEAm ljAmEp
UNK (hunki becomes Aldwl) rd Al>myn AlEAm ljAmEp Aldwl
Figure 5: Example transduction
ror rate (WER), and the diacritization error wstkwn was correctly decoded to wa sa
rate (DER), i.e., the proportion of incorrectly takuwnu, which after post-processing be-
restored diacritics. comes wasatakuwnu (and [it] shall be).
Surprisingly, a trigram word-based language
Letter model
model yields only a modest improvement over
the baseline unigram model. The addition of a AstfzAz correctly decoded to {isotifozAz
clitic connection model and a letter-based lan- (instigation). This example is interest-
guage model leads to a marked improvement in ing, since a morphological analysis would
both WER and DER. This trend is repeated for deterministically predict this diacritization.
both variants of the taskeither with or without The probabilistic letter model was able to
case endings. Including case information natu- correctly decode it even though it has no
rally yields proportionally worse accuracy. Since explicit encoding of such knowledge.
case markings encode higher-order grammatical
information, they would require a more power- Non-Arabic names are obviously problem-
ful grammatical model than offered by finite- atic for the model. For instance bwrtlAnd
state methods. To illustrate the systems per- was incorrectly decoded to buwrotilAnoda
formance, here are some decodings made by the rather than buwrotlAnod (Portland), but
different versions of the model. note that some of the diacritics were cor-
rectly restored. Al-Onaizan and Knight
Basic model (2002) proposed a transducer for modeling
the Arabic spelling of such names for the
An, <ina, and >ana, three versions of purpose of translating from Arabic. Such a
the word that, which may all appear as model could be seamlessly integrated into
An in undiacritized text, are often confused. our architecture, for improved accuracy.
As Buckwalter (2004) notes, the corpus it-
self is sometimes inconsistent about the use 4 Related work
of <ina and >ana.
Gal (2002) constructed an HMM-based bigram
Several of the third-person possessive pro- model for restoring vowels (but not additional
noun clitics can appear either with a u or diacritics) in Arabic and Hebrew. For Arabic,
an i, for instance, the third person singu- the model was applied to the Quran, a corpus of
lar masculine possessive can appear as ei- about 90,000 words, achieving 14% WER. The
ther hu or hi. The correct form depends word-based language model component of our
on the preceding letter and vowel (includ- system is very similar to Gals HMM. The very
ing the case vowels). Part of the tradeoff of flexible framework of transducers allows us to
treating clitics as independent lexical items easily enhance the model with our simple but ef-
is that the word-based model is ignorant of fective morphology handler and letter-based lan-
the letter preceding a suffix clitic. guage model.
Several commercial tools are available for
Clitic model Arabic diacritization, which unfortunately we
Model without case with case
WER DER WER DER
Baseline 15.48% 17.33% 30.39% 24.03%
3-gram word 14.64% 16.9% 28.42% 23.34%
3-gram word + CC 8.49% 9.32% 24.22% 15.36%
3-gram word + CC + 4-gram letter 7.33% 6.35% 23.61% 12.79%
Table 3: Results on the Al-Hayat corpus
did not have access to. Vergyri and Kirchhoff tion. The modular design of the system, based
(2004) evaluated one (unspecified) system on on a composition of simple and compact trans-
two MSA texts, reporting a 9% DER without ducers allows us to achieve high levels of accu-
case information, and 28% DER with case end- racy while encoding extremely limited morpho-
ings. logical knowledge. In particular, while our sys-
Kirchoff et al. (2002) focuses on vocalizing tem is aware of the existence of Arabic clitics,
transcripts of oral conversational Arabic. Since it has no explicit knowledge of how they can
conversational Arabic is much more free-flowing, be combined. Such patterns are automatically
and prone to dialect and speaker differences, learned from the training data. Likewise, while
diacritization of such transcripts proves much the system is aware of different orthographic
more difficult. Kirchoff et al. started from a variants of the glottal stop, it encodes no ex-
unigram model, and augmented it with the fol- plicit rules to predict their distribution.
lowing heuristic. For each unknown word, they The main resource that our method relies on
search for the closest known unvocalized word to is the existence of sufficient quantities of dia-
it according to Levenshtein distance, and apply critized text. Since semitic languages are typ-
whatever transformation that word undergoes, ically written without vowels, it is rare to find
yielding 16.5% WER. Our letter-based model sizable collections of diacritized text in digital
provides an alternative method of generalizing form. The alternative is to diacritize text using
the diacritization from known words to unknown a combination of manual annotation and compu-
ones. tational tools. This is precisely the process that
Vergyri and Kirchhoff (2004) also handled was followed in the compilation of the Arabic
conversational Arabic, and showed that some of treebank, and similar efforts are now underway
the complexity inherent in vocalizing such text for Hebrew (Wintner and Yona, 2003).
can be offset by combining information on the In contrast to morphological analyzers, which
acoustic signal with morphological and contex- usually provide only an unranked list of all pos-
tual information. They treat the latter problem sible analyses, our method provides the most
as an unsupervised tagging problem, where each probable analysis, and with a trivial extension,
word is assigned a tag representing one of its could provide a ranked n-best list. Reducing and
possible diacritizations according to Buckwal- ranking the possible analyses may help simplify
ters morphological analyzer (2002b). They use the annotators job. The burden of requiring
Expectation Maximization (EM) to train a tri- large quantities of diacritized text could be as-
gram model of tag sequences. The evaluation suaged by iterative bootstrappingtraining the
shows that the combined model yields a signifi- system and manually correcting it on corpora of
cant improvement over just the acoustic model. increasing size.
5 Conclusions and future directions As another future direction, we note that oc-
casionally one may find a vowel or two, even
We have presented an effective probabilistic in otherwise undiacritized text fragments. This
finite-state architecture for Arabic diacritiza- is especially true for extremely short text frag-
ments, where ambiguity is undesirable, as in Tim Buckwalter. 2004. Issues in Arabic orthogra-
banners or advertisements. This raises an inter- phy and morphology analysis. In Ali Farghaly and
Karine Megerdoomian, editors, COLING 2004
esting optimization problemwhat is the least
Computational Approaches to Arabic Script-based
number of vowel symbols that are required in Languages, pages 3134, Geneva, Switzerland,
order to ensure an unambiguous reading, and August 28th. COLING.
where should they be placed? Assuming that
Mona Diab, Kadri Hacioglu, and Daniel Jurafsky.
the errors of the probabilistic model are indica- 2004. Automatic tagging of Arabic text: From
tive of the types of errors that a human might raw text to base phrase chunks. In Susan Du-
make, we can use this model to predict where mais, Daniel Marcu, and Salim Roukos, editors,
disambiguating vowels would be most informa- HLT-NAACL 2004: Short Papers, pages 149152,
Boston, Massachusetts, USA, May 2 - May 7. As-
tive. A simple change to the model described sociation for Computational Linguistics.
in this paper would make vowel drop optional
rather than obligatory. Such a model would Yaakov Gal. 2002. An HMM approach to vowel
restoration in Arabic and Hebrew. In Proceedings
then be able to generate not only fully unvocal- of the Workshop on Computational Approaches
ized text, but also partially vocalized variants to Semitic Languages, pages 2733, Philadelphia,
of it. The optimization problem would then be- July. Association for Computational Linguistics.
come one of finding the partially diacritized text
Slava M. Katz. 1987. Estimation of probabilities
with the minimal number of vowels that would from sparse data for the language model com-
be least ambiguous. ponent of a speech recognizer. IEEE Transac-
tions on Acoustics, Speech and Signal Processing,
Acknowledgments 35(3):4004001, March.
We thank Yaakov Gal for his comments on a Katrin Kirchoff, Jeff Bilmes, Sourin Das, Nicolae
previous version of this paper. This work was Duta, Melissa Egan, Gang Ji, Feng He, John
Henderson, Daben Liu, Mohamed Noamany, Pat
supported in part by grant IIS-0329089 from the Schone, Richard Schwarta, and Dimitra Vergyri.
National Science Foundation. 2002. Novel approaches to Arabic speech recogni-
tion: report from the 2002 Johns-Hopkins summer
workshop. Technical report, Johns Hopkins Uni-
References versity.
Salim Abu-Rabia. 1999. The effect of Arabic vow- Mehryar Mohri, Fernando Pereira, and Michael Ri-
els on the reading comprehension of second- and ley. 2000. The design principles of a weighted
sixth-grade native Arab children. Journal of Psy- finite-state transducer library. Theoretical Com-
cholinguist Research, 28(1):93101, January. puter Science, 231(1):1732.
Yaser Al-Onaizan and Kevin Knight. 2002. Ma- Fernando C. N. Pereira and Michael Riley. 1997.
chine transliteration of names in Arabic texts. In Speech recognition by composition of weighted fi-
Proceedings of the Workshop on Computational nite automata. In Emmanuel Roche and Yves
Approaches to Semitic Languages, pages 3446, Schabes, editors, Finite-State Devices for Natu-
Philadelphia, July. Association for Computational ral Language Processing. MIT Press, Cambridge,
Linguistics. MA.
Cyril Allauzen, Mehryar Mohri, and Brian Roark. Dimitra Vergyri and Katrin Kirchhoff. 2004. Auto-
2003. Generalized algorithms for constructing sta- matic diacritization of Arabic for acoustic model-
tistical language models. In Proceedings of the ing in speech recognition. In Ali Farghaly and
41st Annual Meeting of the Association for Com- Karine Megerdoomian, editors, COLING 2004
putational Linguistics (ACL2003), pages 4047. Computational Approaches to Arabic Script-based
Languages, pages 6673, Geneva, Switzerland,
Tim Buckwalter. 2002a. Arabic transliteration ta- August 28th. COLING.
ble. http://www.qamus.org/transliteration.htm.
Shuly Wintner and Shlomo Yona. 2003. Resources
Tim Buckwalter. 2002b. Buckwalter Arabic mor- for processing Hebrew. In Proceedings of the MT
phological analyzer version 1.0. Linguistic Data Summit IX Workshop on Machine Translation for
Consortium, catalog number LDC2002L49 and Semitic Languages, New Orleans, September.
ISBN 1-58563-257-0.

Arabic Diacritization Using Weighted Finite-State Transducers

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Arabic Diacritization Using Weighted Finite-State Transducers

Caricato da

Copyright:

Formati disponibili

Arabic Diacritization Using Weighted Finite-State Transducers

Rani Nelken and Stuart M. Shieber

Abstract We present a system for Arabic diacritiza-

Figure 2: Transliteration and translation of the sentence fragment

To restore diacritization, we have created a gen-

Figure 5: Example transduction

Table 3: Results on the Al-Hayat corpus

Potrebbero piacerti anche