Sei sulla pagina 1di 8

The Challenges and Pitfalls of Arabic Romanization and Arabization

Jack Halpern (春遍雀來)


The CJK Dictionary Institute, Inc. (日中韓辭典研究所)
34-14, 2-chome, Tohoku, Niiza-shi, Saitama 352-0001, Japan
jack@cjk.org

their native script directly into Arabic, something


Abstract probably never attempted. These systems are part
of our ongoing efforts to develop Arabic re-
The high level of ambiguity of the Ara- sources for automatic transcription, machine
bic script poses special challenges to translation and named entity extraction.
developers of NLP tools in areas such as
morphological analysis, named entity The following typographic conventions are used
extraction and machine translation. in this paper:
These difficulties are exacerbated by the
lack of comprehensive lexical resources, 1. Phonemic transcriptions are indicated by
such as proper noun databases, and the slashes (‫ > ﻗــــﺎﺑﻮس‬/qaabuus/) .
multiplicity of ambiguous transcription 2. Phonetic transcriptions are indicated by
schemes. This paper focuses on some of square brackets ( ‫[ > ﻗــــﺎﺑﻮس‬qɑːbuːs]) .
the linguistic issues encountered in two
subdisciplines that play an increasingly 3. Graphemic transliterations are indicated by
important role in Arabic information back slashes ( ‫\ > ﻗــــﺎﺑﻮس‬qAbws\) .
processing: the romanization of Arabic 4. Popular transcriptions are indicated by italics
names and the arabization of non- ( ‫ > ﻗــــﺎﺑﻮس‬Qaboos).
Arabic names. The basic premise is that
linguistic knowledge in the form of lin- 2 Motivation and Previous Work
guistic rules is essential for achieving Arabic transcription technology is playing an
high accuracy. increasingly important role in a variety of
practical applications such as named entity
1 Introduction recognition, machine translation, cross-language
information retrieval and various security
The process of automatically transcribing Arabic
applications such as anti-money laundering and
to a Roman script representation, called romani-
terrorist watch lists. Despite the importance of
zation, is a tough computational task to which
these applications, Arabic transcription has not
there is no definitive solution. The opposite op-
been the subject of sufficient studies that
eration of transcribing a non-Arabic script into
examine the linguistic issues. This paper
Arabic, called arabization, is also difficult but
attempts to fill that gap.
for different reasons.
Several companies and researchers have
This paper briefly describes the algorithms and
developed automatic diacriticization software.
major linguistic issues encountered in the course
Vergyri and Kirchhoff (2004) report the high
of developing two automatic transcription sys-
error rate of these products. Gal used a HMM
tems: (1) Automatic Romanizer of Arabic Names
bigram model and achieved a 14% error rate,
(ARAN), which romanizes unvocalized Arabic
while AbdulJaleel and Larkey (2003) developed
names into various romanizations systems, and
an n-gram based statistical system for arabizing
(2) Non-Arabic Name Arabizer (NANA), which
English, with an error rate of 10%-20%. Elshafei
arabizes non-Arabic names written in the Roman
et al. (2006) report a 5.5% error rate using an
and CJK scripts.
HMM approach, while Arbabi et al (1994).
developed a diacriticizer that combines a
A novel feature of these systems is that they are
knowledge base with neural networks to achieve
fine tuned to transcribing personal names and
a low error rate of 3.1% but which rejects 55% of
placenames to and from Arabic, with special fo-
the names as unprocessable.
cus on the linguistic knowledge and rules re-
quired for transcribing CJK names written in

1
ner that reflects the pronunciation of the original,
We have not used sophisticated statistical often ignoring graphemic correspondence. This
approaches. Our basic strategy has been to use includes the following subcategories:
conventional linguistic knowledge because we
believe that ultimately statistical methods by 1. A phonetic transcription represents the ac-
themselves are inadequate. Kay (2004) argues tual speech sounds, including allophones.
that "statistics are a surrogate for knowledge of The best known of these is IPA. For example,
the world" and that "this is an alarming trend that ‫ ﻣﺤﻤﺪ‬is transcribed as [muħɛ̈mmɛ̈d].
computational linguists ... should resist with
2. A phonemic transcription represents the
great determination." This was reinforced by
phonemes of the source language (ignoring
Farghaly (2004) when he wrote "It is becoming
allophones), ideally on a one-to-one basis.
increasingly evident that statistical and corpus-
based approaches...are not sufficient..." For example, ‫ ﻣﺤﻤﺪ‬is transcribed as
/muHammad/, in which a represents the pho-
Our policy is that linguistic rules, based on deep neme /a/, rather than the phone [ɛ̈].
analysis of the source and target scripts, are 3. A popular transcription is a conventional-
indispensable. To rephrase, many contemporary ized orthography that roughly represents
statistical methods involve brute-force pronunciation. For example, ‫ ﻣﺤﻤﺪ‬is tran-
mathematical techniques that exploit vast scribed in some 200 different ways, such as
amounts of data, whereas a rule-based approach Mohammed, Muhammad, Moohammad,
captures aspects of human intelligence because it Moohamad, Mohammad, Mohamad, etc.
is based on linguistic knowledge. We have
combined linguistic rules with statistically Diacriticization is the process of adding vowel
derived mapping tables to build a flexible system signs (called vocalization) and other diacritics.
that can be extended to other Arabic script based For example, ‫\ ﻣﺤﻤﺪ‬mHmd\ is converted to the
languages. vocalized ‫ﻣﺤَﻤﱠﺪ‬ُ \muHam~ad\. Note the four dia-
critics that were added.
3 Basic Concepts
Much confusion surrounds the terms translitera- Arabization is the reverse of romanization; that
tion and transcription, with the former often mis- is, the representation of a non-Arabic script, such
leadingly used in the sense of the latter even in as the Roman and CJK scripts, using the Arabic
academic papers (AbdulJaleel and Larkey, 2003). alphabet, e.g., Muhammad → ‫ﻣﺤﻤﺪ‬, Clinton →
To discuss these concepts in an unambiguous ‫ آﻠﻴﻨﺘــــــــﻮن‬埼玉, Saitama → ‫ﺳـــــﺎﻳﺘﺎﻣﺎ‬.
manner it is necessary to understand these and
related terms correctly. 4 Why is Arabic ambiguous?
A distinguishing feature of abjads in general, and
Romanization is the representation of a language
of Arabic in particular, is that words are written
written in a non-Roman script using the Roman
as a string of consonants with little or no indica-
alphabet. This includes both transliteration and
tion of vowels, referred to as unvocalized Arabic.
transcription, e.g. ‫ ﻣﺤﻤﺪ‬is transliterated as Though diacritics can be used to indicate short
\mHmd\ and transcribed as Mohammed, Mu- vowels, they are used sparingly, while the use of
hammad, or Mohamad, among many others. consonants to indicate long vowels is ambiguous.
On the whole, unvocalized Arabic is highly am-
Transliteration is a representation of the script biguous and poses major challenges to Arabic
of a source language by using the characters of information processing applications.
another script. Ideally, it unambiguously repre-
sents the graphemes, rather than the phonemes, 4.1 Morphological Ambiguity
of the source language. For example, ‫ ﻣﺤﻤﺪ‬is Arabic is a highly inflected language. Inflection
transliterated as \mHmd\, in which each Arabic
is indicated by changing the vowel patterns as
letter is unambiguously represented by one Ro-
well as by adding various suffixes, prefixes, and
man letter, enabling round-trip conversion.
clitics. A full paradigm for ‫ آَﺎﺗِﺐ‬/kaatib/ 'writer'
that we created (for a comprehensive Arabic-
Transcription is a representation of the source
English dictionary project) reaches a staggering
script of a language in the target script in a man-
total of 3487 valid forms, including affixes and

2
clitics as well as inflectional syncretisms. For literated, it must not be transcribed, e.g.,
example, ‫ آﺎﺗـــﺐ‬can represent any of the follow- ‫ آﺘﺒـــــﻮا‬is transliterated as \ktbwA\, with ‘alif
ing seven wordforms: ‫ آَﺎﺗِﺐ‬/kaatib/, َ‫آَﺎﺗَﺐ‬ at the end, but transcribed as /katabuu/, omit-
/kaataba/, ٍ‫ آَﺎﺗِﺐ‬/kaatibin/, ٌ‫ آَﺎﺗِﺐ‬/kaatibun/,َ‫آَﺎﺗِﺐ‬ ting the 'alif.
/kaatiba/, ِ‫ آَﺎﺗِﺐ‬/kaatibi/, ‫ﺐ‬
ُ ِ‫ آَﺎﺗ‬/kaatibu/. 7. The diacritic shadda indicating consonant
gemination is normally omitted, e.g., the un-
vocalized ‫ ﻣﺤﻤﺪ‬Muhammad (vocalized
4.2 Orthographical Ambiguity ‫ﻣﺤَﻤﱠﺪ‬ ُ ) provides no clues that the [m] should
On the orthographic level, Arabic is also highly be doubled.
ambiguous. For example, the string ‫ ﻣﻮ‬can theo- 8. Another source of ambiguity is the omission
retically represent 40 consonant-vowel permuta- of tanwiin diacritics for case endings, e.g., in
tions, such as mawa, mawwa, mawi, mawwi, ‫\ ﺷــﻜﺮا‬$ukrAF\ (vocalized ً‫ﻜﺮَا‬ ْ‫ﺷ‬ُ ), the
mawu, mawwu, maw, maww, miwa, miwwa.... fatHatayn is not written.
etc., though in practice some may never be used. 9. The rules for determining the hamza seat are
Humans can normally disambiguate this by con- of notorious complexity. In transcribing to
text, but for a program the task is formidable. Arabic, it is difficult to determine the hamza
seat as well as the short vowel that follows;
Conventional wisdom has it that the Arabic e.g., hamzated waaw (‫ )ؤ‬could represent /'a/,
script is ambiguous "due to non-representation of /'u/ or even /'/ (no vowel).
short vowels," while other features are often 10. In arabization, determining the hamza seat
lightly passed over. In fact, a whole gamut of requires the application of complex rules
factors contribute to orthographical ambiguity. based on the phonological environment,
which is further complicated by the frequent
The list of factors below is not intended to serve omission and inconsistent use of hamza in
as a detailed treatment of Arabic orthographic foreign names (see Section 7).
ambiguity, but to demonstrate the principal lin- 11. Phonological alternation processes such as
guistic issues that need to be addressed to assimilation that modify the phonetic realiza-
achieve accurate transcription. tion. For example, the unvocalized ‫اﻟﺮﺟﻞ‬
‫' اﻟﻄﻮﻳــــﻞ‬the tall man' is realized as
1. The greatest challenge is the omission of /'arrajulu-TTawiilu/ (ُ‫ﻞ ٱﻟﻄﱠﻮِﻳﻞ‬ ُ ‫ﺟ‬
ُ ‫)اَﻟﺮﱠ‬, in which
short vowels; e.g., the unvocalized ‫آﺎﺗـــﺐ‬ the ‫ ال‬is assimilated into ‫ طﱠ‬/TTa/, not as
\kAtb\ can represent seven wordforms such /'alrajulu alTawiilu/.
as ‫ آَﺎﺗِﺐ‬/kaatib/ and َ‫ آَﺎِﺗﺐ‬/kaatiba/. 12. Vowel shortening is sometimes lexically de-
2. In contrast, some short vowels actually are termined and thus cannot be predicted from
represented. For example, taa' marbuuTa of- the orthography; e.g., ‫' ﻓــﻲ اﻟﻘﺎهﺮة‬in Cairo'
ten indicates a short /a/, as in ‫ ﺟﺎﻣﻌﺔ‬/jaami`a/, is pronounced /fi-lqaahira/, not /fii-lqaahira.
while in foreign names short and long vow- That is, /fii/ is shortened to /fi/.
els are normally written identically by add-
ing ‫ ا‬,‫ ي‬or ‫و‬, as in ‫\ روﺳـــﻴﺎ‬rwsyA\ 'Russia'. 5 Automatic Romanizer of Arabic Names
3. Long /aa/ can be expressed in multiple ways,
5.1 Overview
e.g., by 'alif Tawiila (‫ )ا‬as in ‫ﺳـــﻮرﻳﺎ‬, by (2)
'alif maduuda (‫ )ﺁ‬as in ‫ﺁﺳـــﻴﺎ‬, and by (3) 'alif The Automatic Romanizer of Arabic Names
maqSuura (‫ )ى‬as in ‫ﺁﺳـــﻴﺎ اﻟﻮﺳــﻄﻰ‬. (ARAN) consists of multiple modules for the
4. Long vowels are sometimes omitted too, as transcription and transliteration of Arabic and
in ‫ هﺪا‬/haadha/. In this case, the 'alif qaSiira related tasks such as variant generation and vo-
("dagger alif") is omitted. calization. The core problem that ARAN ad-
5. Not all bare alifs represent long /a/. Some are dresses is making an intelligent guess at deter-
silent (next item), while some are nunated; mining the vowels of unvocalized Arabic names
e.g., ‫ را‬in ‫ ﺷــﻜﺮا‬represents /ran/, ً‫را‬, not ‫رَا‬ and generating romanized candidates based on
/raa/. statistically motivated linguistic rules derived
6. 'alif alfaaSila (otiose alif), added to the third from an in-depth analysis of Arabic orthography.
person masculine plural forms of the past The principal components of ARAN are:
tense, is a mere orthographic convention and
1. ATAN: Automatic Transcriber of Arabic Names
is not pronounced. Though it must be trans-
2. AXAN: Automatic Transliterator of Arabic Names

3
3. APAN: Automatic Phoneticizer of Arabic Names but might improve the match rate because fuzzily
4. ADAN: Automatic Diacriticizer of Arabic Names matched names could often be correct, whereas
5. AVAN: Automatic Variant Generator for Arabic generated names could have incorrect short vow-
Names els. The user can set parameters to output any
desired combination of three modes: exact match,
Table 1 shows examples of how each module fuzzy match or algorithmic generation.
processes a string of unvocalized Arabic:
Table 1. Output from Various ARAN modules
Unvocalized Vocalized Phonemic Graphemic Phonetic Popular
(input) (ADAN) (ATAN) (AXAN) (APAN) (AVAN)*
‫ﻣﺤﻤﺪ‬ ‫ﻣﺤَﻤﱠﺪ‬
ُ muHammad mHmd muħɛ̈mmɛ̈d Muhammad
‫ﻗــــﺎﺑﻮس‬ ‫ﻗَﺎﺑُﻮس‬ qaabuus qAbws qɑːbuːs Qaboos
‫ﺟﻤﺎل‬ ‫ﺟَﻤَﺎل‬ jamaal jmAl dʒɛ̈mɛ̈ːl Jamal
‫ﻣﻜﺔ‬ ‫ﻣَـﻜـﱠﺔ‬ makka mkp mɛ̈kkɛ Mecca
*Only one popular variant is shown, but in reality there could be dozens. For example,
for ‫ ﻗ ﺎﺑﻮس‬AVAN generates Qabuus, Qabus, Qabous, Qabooss, … and many more.

5.2 Romanization Algorithm 5.3 Rules Knowledge Base


The romanization algorithm accepts an Arabic ARAN uses a knowledge base module for gener-
string as input and generates a list of romanized ating romanized strings from the Arabic input
candidates by combining lookup in the Database string. This is the central component of the algo-
of Arabic Names (DAN), a database of about rithm but is independent of it for maximum
180,000 romanized Arabic name variants and flexibility. The rules can be modified by the user
their variants, with a knowledge base of rules. to further refine the accuracy or to adjust them to
ARAN can generate candidates in pure algo- other Arabic-script based languages.
rithmic mode, or it can access DAN to find ex-
plicit entries before resorting to algorithmic gen- The knowledge base was created by in-depth
eration. Roughly, the algorithm works as fol- analysis of the Arabic orthography using the re-
lows: sults of statistical analysis of a large name corpus
based on a bilingually aligned phone directory. A
1. Get an Arabic string from the input file. regular-expression-like mini-language for writ-
2. Transliterate to Buckwalter for internal proc- ing vocalization and romanization rules was de-
essing using the AXAN module. veloped in which LHS (left-hand side) and RHS
3. Attempt to find an exact match in DAN. (right-hand side) style rules are defined as de-
4. If that fails, perform a fuzzy match to re- clarative statements on a high level of abstraction
trieve from DAN. independent of specific computer languages.
5. If that fails, generate romanization candi- These are then implemented by the appropriate
dates algorithmically. functions in the romanization algorithm module.
6. Output a list of romanized candidates. For example, the rule "I:C1(?=[^Awyp]):&c[aiu]"
(colons are field separators) means as follows:
For example, ‫ إﺑـــﺮاهﻴﻢ‬is first transliterated to
\<brAhym\ and looked up in DAN. If the pa- an initial consonant (indicated by "I"1 in
rameters are set to return popular readings and the first field) in the C1 consonant subset
their variants, the output will be Ibrahiim, Ibra- not followed by a long vowel 'alif, waaw,
him, Ebraheem, Ebrahiim.... If the parameters yaa' or taa’ marbuuTa (regex back refer-
are set to return purely generated candidates the ence), is converted to the corresponding
result will be ibraahiim, ibaraahiim, ibiraahiim, consonant in question (defined in a mapping
iburaahiim, one of which is correct. These can- table) followed by one of the romanized
didates can be further expanded by AVAN to short vowels ‘a’, ‘i’ or ‘u’.
generate variants such as Ibrahim and Ibraahim.

Fuzzy matching, such as ignoring hamza and


collapsing 'alif with 'alif maqSuura, is a bit risky

4
6 Non-Arabic Name Arabizer ‫( ﺳـــــﺎﻳﺘﺎﻣﺎ‬埼玉 /saitama/), and other kinds
of variants, such as ‫ آﺎﻧﺎﺟــﺎوا‬for the more
6.1 Overview
common ‫( آﺎﻧﺎﻏــﺎوا‬神奈川 /kanagawa/).
The Non-Arabic Name Arabizer (NANA) is
designed to arabize non-Arabic names. This in- 6.3 Vowel Sequence Ambiguity
cludes Roman-script names such as Bill Clinton Vowels sequences are difficult to transcribe be-
to ‫ﺑﻴــــﻞ آﻠﻴﻨﺘــــــــﻮن‬, as well as a technology cause they could represent diphthongs, mo-
probably never attempted before: transcribing nophthongs (distinct vowels), or long vowels.
CJK names directly into Arabic. We have devel- Representing Japanese vowels accurately in Ara-
oped language-dependent rules, mapping tables bic is not possible. In cases where vowel se-
and algorithms for transcribing CJK names writ- quences represent monophthongs, hamza is
ten in their native scripts. For example, the Japa- sometimes used and sometimes omitted.
nese placename 埼 玉 /saitama/ is arabized as
‫ﺳـــــﺎﻳﺘﺎﻣﺎ‬, the Chinese name 杨海洋 /yang hai- Table 2. Diphthong Ambiguity for 福井 /fu-ku-i/
yang/ as ‫ﻳــــﺎﻧﻎ هﺎﻳﻴـــــﺎﻧﻎ‬, and the Korean city No. Arabic Google hits Buckwalter
1 ‫ ﻓﻮآﻮﺋـــــﻲ‬468 fwkw}y
부산 /busan/ as ‫ﺑﻮﺳـــﺎن‬. 2 ‫ﻓﻮآـــﻮئ‬ 9 fwkw}
3 ‫ﻓﻮآـــﻮي‬ 1950 Fwkwy
Various papers, such as AbdulJaleel and Larkey 4 ‫ﻓﻮآﻮﻳـــــﻲ‬ 335 Fwkwyy
(2003), describe systems for transcribing Roman-
script names into Arabic. Although NANA also Table 2 shows some of the variation to expect in
has this capability, it is beyond the scope of this Japanese name Arabization. Though phonologi-
paper. The issues for Chinese and Korean, the cally (2) is the most accurate, it is the least used.
subject of a future paper, are similar in nature but As expected, the diphthongized (3) is the most
require a different set of language-specific rules. common form because of the tendency to avoid
hamza in foreign names. Some important vowel
6.2 Arabization Policy
sequence issues are:
A fundamental problem in arabizing CJK names
is that there are significant differences between 1. There is a strong tendency not to use non-
the Arabic and CJK phonological systems and initial hamza, as in (1) and (2) above, in for-
the lack of detalied transcription standards. Since eign names. One reason for this is insuffi-
these languages are not well known in the Arab- cient knowledge of the phonology of the
speaking world, CJK names are often arabized source language, especially of such "exotic"
on the basis of their romanized transcriptions, languages as Japanese.
rather than the native script, and it is sometimes 2. Japanese is especially problematic because it
erroneously assumed that the Roman letters are is moraic. Some Japanese mora sequences,
pronounced as in English. This is further compli- such as あい /ai/ or うい /ui/, are often diph-
cated by the plethora of CJK romanization stan- thongized in Arabic, though ideally the sec-
dards. We have established an arabization policy ond vowel should be treated as a mo-
for Japanese based on a number of sometimes nophthong represented by hamza. That is, 福
conflicting criteria: 井 /fu-ku-i/ should be written as (1)
‫ ﻓﻮآﻮﺋـــــﻲ‬or (2) ‫ﻓﻮآـــﻮئ‬, rather than the
1. How names are actually spelled on the Ara-
bic web, atlases, maps and books. more common (3) ‫ﻓﻮآـــﻮي‬.
2. Ensuring that same source syllables are 3. In theory, a vowel sequence like /ai/ as in さ
spelled consistently taking into account pho- い /sa-i/ can be written in five ways: ‫ﺳﺎي‬
nological changes. ‫ﺳـــﺎﺋﻲ ﺳـــﺎﻳﻲ ﺳﺎئ ﺳﻲ‬. To accu-
3. Treating Japanese names as a sequence of rately transcribe a name like Saitama (埼玉)
syllables, rather than of morae, since that is it is necessary to know that it consists of four
how they are commonly transcribed. morae (/sa-i-ta-ma/ さいたま), rather than
4. Using hamza to represent vowel sequences three syllables (/sai-ta-ma/). Ideally it should
only in those cases where dipthongization is be transcribed as ‫ﺳـــــــﺎﺋﻴﺘﺎﻣﺎ‬, rather than
not possible or awkward (see Section 6.3). the much more common ‫ﺳـــــﺎﻳﺘﺎﻣﺎ‬. That is,
5. Generating hamzated variants, such as since /sa-i/ is a bimoraic syllable, the hamza
‫ ﺳـــــــﺎﺋﻴﺘﺎﻣﺎ‬for the more common

5
over yaa' should be used to represent /i/ as a names using a knowledge base of rules and map-
distinct monophthong, as in ‫ﺳﺎئ‬. In reality, ping tables fine tuned to the Japanese and Arabic
Saitama is normally spelled ‫ﺳـــــﺎﻳﺘﺎﻣﺎ‬, so phonological systems. Roughly, the algorithm
that /sa-i/ is diphthongized as ‫ ﺳﺎي‬/say/. works as follows:
4. In names like 福岡 /fu-ku-o-ka/ the sequence
/ku-o/ represents distinct sounds that cannot 1. Get a string from the input file.
be diphthongized. Following hamza rules, 2. Determine if the string is Japanese.
this should be written ‫ﻓﻮآﻮؤوآــــﺎ‬, but in fact 3. Convert kanji to hiragana reading by looking
it is commonly spelled ‫ﻓﻮآﻮأوآــــﺎ‬, in which up in JEP.
4. Convert hiragana to romanized Japanese by
‫أو‬, rather than ‫ؤو‬, represents /u/. Omitting the
looking up in JEP.
hamza here would make little sense.
5. If (3) fails, convert to hiragana algorithmi-
6.4 Long and Short Vowels cally (difficult due to extreme ambiguity).
6. If (3) returns multiple strings, use criteria
The treatment of Japanese vowels is complex like frequency and semantic codes to elimi-
and may have hamzated variants. nate unlikely candidates.
Table 3. Long and Short Vowels
No. Kanji Kana Phonemic Arab1 Arab2 Arab3
1 太田 おおた oota ‫أوﺗــﺎ‬
2 風馬 ふうま fuuma ‫ﻓﻮﻣــﺎ‬
3 敬子 けいこ keiko ‫آﻴﻴﻜــــــﻮ‬ ‫آﻴﻜــــﻮ‬
4 空野 くうの kuuno ‫آﻮﻧـــﻮ‬
5 久野 くの kuno ‫آﻮﻧـــﻮ‬
6 日枝 ひえだ hieda ‫هﻴﻴـــﺪا‬ ‫هﻴﺌـــﺪا‬ ‫هﻴﺌﻴـــــﺪا‬
7 芳江 よしえ yoshie ‫ﻳﻮﺷـــــــﻴﺌﻪ ﻳﻮﺷـــــﻴﻲ‬ ‫ﻳﻮﺷـــــــﻴﺌﻲ‬

1. Japanese long vowels are expressed in vari- 7. Determine whether to diphthongize or to use
ous ways, such as by repeating the vowel as hamza by considering both the hiragana and
in (2) ふう /fuu/, or by adding う /u/ after /o/ the romanized Japanese.
as in (1). えい /ei/ is special because the ‫ي‬ 8. Use the rules knowledge base, which is em-
may be repeated, as in (3). bedded in a multi-option comprehensive hi-
2. Since short vowels are omitted in Arabic, ragana-to-Arabic mapping tables to convert
short vowels in foreign names are normally to Arabic script.
transcribed as if they were long; that is, by 9. The AVAN module generates variants if re-
adding ‫ ا‬for /a/, ‫ ي‬for /i/ and ‫ و‬for /u/. Thus quested by user parameters.
both (4) and (5) are written identically as 10. Output arabized name (with or without vari-
‫ آﻮﻧـــﻮ‬and there is no way to distinguish ants as necessary).
vowel length.
3. Normally the vowel /e/ is not distinguished We have not yet performed formal error rate test-
ing, but our preliminary experiments indicate
from /i/ and both are represented by ‫ي‬. An
extra complication is that at word end /e/ is that the above algorithm can arabize a CJK name
to its correct or legitimate variant form with a
sometimes expressed by ‫ﻩ‬, so that in tran-
success rate of nearly 100%. This is because the
scribing such names as (5) and (6) it is nec-
algorithm is based on a thorough understanding
essary to consider hamza rules, whether to
of the Arabic and Japanese (as well as Chinese
diphthongize, the position of the syllable in
and Korean, though not discussed here) phono-
the word, and how these interact.
logical systems, and a comprehensive mapping
6.5 Arabization Algorithm table designed to cover almost all possible Japa-
nese-to-Arabic mappings, including positional
The arabization algorithm accepts a CJK string variants and phonological changes resulting from
as input and generates a list of romanized candi- liaison.
dates by combining lookup in the Japanese-
English Proper Noun Database (JEP), a database
of about 600,000 Japanese personal and place

6
7 Arabic Orthographic Variants these cannot be rigorously defined, they are both
of frequent occurrence based on statistical and
The number of personal names and their variants linguistic analysis of MSA orthography. It
in the world is in the billions. Identifying names should also be noted that "standard form,"
and their variants (named entity recognition) is a though linguistically correct, is not necessarily
hot topic in computational linguistics. To en- the most common form (we are gathering statis-
hance this technology, we added a variant gen- tics for the occurrence of each form).
eration module (AVAN) to both the ARAN and
NANA systems, which is supported by compre- There are often many more variants than those
hensive databases of CJK proper nouns. shown above. For example, Alexandria can be
7.1 Romanization Variants written in about a dozen ways, the most frequent
ones according to Google being ‫اﻻﺳــــــﻜﻨﺪرﻳﺔ‬
The many popular transcriptions of Arabic with 2,930,000, ‫ اﻹﺳــــــﻜﻨﺪرﻳﺔ‬with 690,000,
names result in a large number of variants. One and ‫ اﻻﺳــــــﻜﻨﺪرﻳﻪ‬with 89,200 occurrences re-
reason for this is that several Arabic consonants, spectively.
such as ‫[ ع‬ʔˁ], ‫[ ض‬dˁ], ‫[ ط‬tˁ] and ‫[ ظ‬ðˁ], do not
exist in European languages. These sounds are 8 System Modules and Future Work
difficult to pronounce and are rendered in differ- The principal components of ARAN (some of
ent ways when romanized. Another factor is the which are in progress) are briefly described be-
bewildering variety of ways in which Arabic low,
vowels are transcribed, partially due to dialecti-
cal influences. For example, the ‫ أ‬/'u/ in ‫ أﺳﺎﻣﺔ‬is 1. The Automatic Transcriber of Arabic
transcribed in various ways as seen in Usama, Names (ATAN) is ARAN's core module for
Ousama, Osama and Oosama, while ‫ﻣﻌـﻤﺮ‬ generating phonemic and popular transcrip-
\mEmr\ is spelled as Moammar, Muammar, tions of Arabic personal names. Because of
Mu'ammar, Mo'ammar, Moammar, Moamer, the inconsistent nature of the various popular
Moamar, and others. Arabic romanization systems, there are often
7.2 Arabic Variants many, sometimes dozens or even hundreds,
of romanizations for the same name.
Both Arab and foreign names have orthographic ATAN supports most of the commonly used
variants in Arabic. These are of two kinds: systems, and has a flexible architecture that
enables the user to configure the system to
1. Orthographic variants are non-standard ways support user-defined systems. For example,
to spell a specific variant of a name, like ‫اﺑــﻮ‬ ‫ﺷـــﻮﻟﻮخ‬, which is first transliterated to
‫ ﻇــﺒﻲ‬instead of ‫ أﺑــﻮ ﻇــﺒﻲ‬for Abu Dhabi, in \$wlwx\ by the AXAN module, can then be
which the hamza is omitted. transcribed as /shwlwkh/ in the ALC-LC sys-
2. Orthographic errors are frequently occurring, tem, as /šūlūḫ/ in the DIN system, as Shou-
systematic spelling mistakes, like yaa' in ‫اﺑــﻮ‬
lokh as a possible English spelling, etc. The
‫( ﻇــﺒﻲ‬Abu Dhabi) being replaced by 'alif
AVAN module can then be used to return
maqSuura in ‫اﺑــﻮ ﻇــﺒﻰ‬. many popular variants.

Table 4. Orthographic Variation in Arabic Names


Standard Buckwalter English Variant Error Remarks
‫أﺑــﻮ ﻇــﺒﻰ‬ V: omit hamza
‫أﺑــﻮ ﻇــﺒﻲ‬ >bw Zby Abu Dhabi ‫اﺑــﻮ ﻇــﺒﻲ‬
‫اﺑــﻮ ﻇــﺒﻰ‬ E: ‘alif maqsura replaces yaa'
V: omit hamza
‫ اﻹﺳــــــﻜﻨﺪرﻳﺔ‬Al<skndryp Alexandria ‫إﺳـــــﻜﻨﺪرﻳﻪال اﻻﺳــــــﻜﻨﺪرﻳﺔ‬
E: haa' replaces taa' marbuuTa
‫ﺑــــﺎﻟﻮ اﻟﺘــــﻮ‬ V1: omit hamza
‫ﺑــــﺎﻟﻮ أﻟﺘــــﻮ‬ bAlw >ltw Palo Alto
‫ﺑــــﺎﻟﻮ ﺁﻟﺘــــﻮ‬ V2: madda replaces hamza
‫ﻃﻮآﻴـــﻮ‬ Twkyw Tokyo ‫ﺗﻮآﻴـــــﻮ‬ E: taa' replaces Taa'

Table 4 shows examples of variants ("V") and 2. The Automatic Transliterator of Arabic
errors ("E"). Though the difference between Names (AXAN) generates transliterations of

7
Arabic names or any other Arabic text. There these are Farsi (official language of Iran),
are few strict transliteration systems that use Pashto (western Pakistan and official lan-
unique symbols for each letter and allow for guage of Afghanistan), Dari (Afghan dialect
round-trip conversion. The excellent and of Farsi, official language of Afghanistan),
widely used Buckwalter transliteration sys- Urdu (official language of Pakistan) and
tem is not only supported by AXAN, but is Kurdish (Turkey, Iraq, Iran, Syria, Armenia,
also used for internal processing in all Lebanon). Others include Shamukhi (Paki-
ARAN databases and algorithms. AXAN can stani version of Punjabi), Kashmiri (India
be configured to support other transliteration and Pakistan), and Uyghur (northwest China).
systems, including Cyrillization, by adding ARAN will eventually be expanded to (1)
custom mapping tables . romanize to/from the major ASBL languages,
3. The Automatic Phoneticizer of Arabic (2) automatically identify the language, (3)
Names (APAN) generates phonetic tran- automatically detect legacy encodings and
scriptions of Arabic names in IPA. This convert to Unicode.
represents the actual pronunciation in MSA,
including distinctions between the major al- 9 Conclusion
lophones. For example, the name ‫ﻗــــﺎﺑﻮس‬
As we have seen, the high level of ambiguity in
Qaboos is transcribed as [qɑːbuːs]. Note the Arabic script makes it challenging to build
that the phonemic transcription /qaabuus/ automatic transcription systems that produce re-
only indicates the vowel length (/aa/), liable results. In particular, we have seen the dif-
whereas the phonetic transcription also indi- ficulties in arabizing CJK names due to the lack
cates the quality of the vowel (ɑː), distin- of standards and to the major phonological dif-
ferences between the languages. We have also
guishing it from its more common realization seen how important linguistic knowledge is in
[ɛ̈ː]. APAN can be configured to transcribe such areas as Japanese-to-Arabic transcription,
in various MSA flavors. This refers to re- resulting in a very high accuracy rate. Since Ara-
gional variations in MSA pronunciation, not bic transcription is playing an increasingly im-
to Arabic dialects per se. For example, for portant role in a variety of practical applications,
‫ ﺟﻤﺎل‬/jammal/ APAN generates [dʒɛ̈mɛ̈ːl] it is necessary to pursue efforts to develop more
language-specific transcription systems based on
for Gulf MSA, [gɛ̈mɛ̈ːl] for Egyptian MSA , linguistic knowledge.
and [ʒɛmɛ̈ːl] for Levantine MSA.
References
4. The Automatic Generator of Variants for
Arabic Names (AVAN) supports the ARAN Nasreen Abduljaleel and Leah S. Larkey. 2003. Sta-
and NANA system by generating a large tistical transliteration for English-Arabic Cross
number of variants and variant candidates Language Information Retrieval. CIKM 2003: 139-
146
both algorithmically and by retrieving from
hardcoded databases, whose occurrences are M. Arbabi, S.M. Fischthal, V.C. Cheng and E. Bart.
then validated in Arabic corpora and the web. 1994. Algorithms for Arabic name transliteration.
See Section 7 for details. IBM J. Res. Develop., 38(2)
5. The Automatic Diacriticizer of Arabic Moustafa Elshafei, Husni Al-Muhtaseb, Mansour Al-
Names (ADAN) automatically diacriticizes, Ghamdi. 2006. Machine Generation of Arabic Dia-
or adds vowels and other diacritics (like critical Marks. MLMTA 2006: 128-133
fatha and shadda) to unvocalized or semi- Ali Farghaly. 2004 Computer Processing of Arabic
vocalized Arabic. For example, ‫ﻣﺤﻤﺪ‬ Script-based Languages: Current State and Future
\mHmd\ and ‫\ اﻟﺮﻳـﺎض‬AlryAD\ are converted Directions. COLING 2004
to the vocalized ‫ﻣﺤَﻤﱠﺪ‬ُ and ‫ اﻟﺮﱢﻳـَﺎض‬respec- Martin Kay Stanford University. 2004. Arabic Script-
tively. This is related to, but distinct from, based Languages deserve to be studied linguisti-
the equally difficult task of phonemic tran- cally. COLING 2004.
scription. See Table 1 for examples.
D. Vergyri and K. Kirchhoff. 2004. Automatic Diacri-
6. There are dozens of non-Arabic languages
tization of Arabic for Acoustic Modeling in Speech
that are or have been written in the Arabic Recognition. COLING Workshop on Arabic-script
script, referred to as Arabic Script Based Based Languages, Geneva, Switzerland, 2004.
Languages (ASBL). The most important of

Potrebbero piacerti anche