Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1
ner that reflects the pronunciation of the original,
We have not used sophisticated statistical often ignoring graphemic correspondence. This
approaches. Our basic strategy has been to use includes the following subcategories:
conventional linguistic knowledge because we
believe that ultimately statistical methods by 1. A phonetic transcription represents the ac-
themselves are inadequate. Kay (2004) argues tual speech sounds, including allophones.
that "statistics are a surrogate for knowledge of The best known of these is IPA. For example,
the world" and that "this is an alarming trend that ﻣﺤﻤﺪis transcribed as [muħɛ̈mmɛ̈d].
computational linguists ... should resist with
2. A phonemic transcription represents the
great determination." This was reinforced by
phonemes of the source language (ignoring
Farghaly (2004) when he wrote "It is becoming
allophones), ideally on a one-to-one basis.
increasingly evident that statistical and corpus-
based approaches...are not sufficient..." For example, ﻣﺤﻤﺪis transcribed as
/muHammad/, in which a represents the pho-
Our policy is that linguistic rules, based on deep neme /a/, rather than the phone [ɛ̈].
analysis of the source and target scripts, are 3. A popular transcription is a conventional-
indispensable. To rephrase, many contemporary ized orthography that roughly represents
statistical methods involve brute-force pronunciation. For example, ﻣﺤﻤﺪis tran-
mathematical techniques that exploit vast scribed in some 200 different ways, such as
amounts of data, whereas a rule-based approach Mohammed, Muhammad, Moohammad,
captures aspects of human intelligence because it Moohamad, Mohammad, Mohamad, etc.
is based on linguistic knowledge. We have
combined linguistic rules with statistically Diacriticization is the process of adding vowel
derived mapping tables to build a flexible system signs (called vocalization) and other diacritics.
that can be extended to other Arabic script based For example, \ ﻣﺤﻤﺪmHmd\ is converted to the
languages. vocalized ﻣﺤَﻤﱠﺪُ \muHam~ad\. Note the four dia-
critics that were added.
3 Basic Concepts
Much confusion surrounds the terms translitera- Arabization is the reverse of romanization; that
tion and transcription, with the former often mis- is, the representation of a non-Arabic script, such
leadingly used in the sense of the latter even in as the Roman and CJK scripts, using the Arabic
academic papers (AbdulJaleel and Larkey, 2003). alphabet, e.g., Muhammad → ﻣﺤﻤﺪ, Clinton →
To discuss these concepts in an unambiguous آﻠﻴﻨﺘــــــــﻮن埼玉, Saitama → ﺳـــــﺎﻳﺘﺎﻣﺎ.
manner it is necessary to understand these and
related terms correctly. 4 Why is Arabic ambiguous?
A distinguishing feature of abjads in general, and
Romanization is the representation of a language
of Arabic in particular, is that words are written
written in a non-Roman script using the Roman
as a string of consonants with little or no indica-
alphabet. This includes both transliteration and
tion of vowels, referred to as unvocalized Arabic.
transcription, e.g. ﻣﺤﻤﺪis transliterated as Though diacritics can be used to indicate short
\mHmd\ and transcribed as Mohammed, Mu- vowels, they are used sparingly, while the use of
hammad, or Mohamad, among many others. consonants to indicate long vowels is ambiguous.
On the whole, unvocalized Arabic is highly am-
Transliteration is a representation of the script biguous and poses major challenges to Arabic
of a source language by using the characters of information processing applications.
another script. Ideally, it unambiguously repre-
sents the graphemes, rather than the phonemes, 4.1 Morphological Ambiguity
of the source language. For example, ﻣﺤﻤﺪis Arabic is a highly inflected language. Inflection
transliterated as \mHmd\, in which each Arabic
is indicated by changing the vowel patterns as
letter is unambiguously represented by one Ro-
well as by adding various suffixes, prefixes, and
man letter, enabling round-trip conversion.
clitics. A full paradigm for آَﺎﺗِﺐ/kaatib/ 'writer'
that we created (for a comprehensive Arabic-
Transcription is a representation of the source
English dictionary project) reaches a staggering
script of a language in the target script in a man-
total of 3487 valid forms, including affixes and
2
clitics as well as inflectional syncretisms. For literated, it must not be transcribed, e.g.,
example, آﺎﺗـــﺐcan represent any of the follow- آﺘﺒـــــﻮاis transliterated as \ktbwA\, with ‘alif
ing seven wordforms: آَﺎﺗِﺐ/kaatib/, َآَﺎﺗَﺐ at the end, but transcribed as /katabuu/, omit-
/kaataba/, ٍ آَﺎﺗِﺐ/kaatibin/, ٌ آَﺎﺗِﺐ/kaatibun/,َآَﺎﺗِﺐ ting the 'alif.
/kaatiba/, ِ آَﺎﺗِﺐ/kaatibi/, ﺐ
ُ ِ آَﺎﺗ/kaatibu/. 7. The diacritic shadda indicating consonant
gemination is normally omitted, e.g., the un-
vocalized ﻣﺤﻤﺪMuhammad (vocalized
4.2 Orthographical Ambiguity ﻣﺤَﻤﱠﺪ ُ ) provides no clues that the [m] should
On the orthographic level, Arabic is also highly be doubled.
ambiguous. For example, the string ﻣﻮcan theo- 8. Another source of ambiguity is the omission
retically represent 40 consonant-vowel permuta- of tanwiin diacritics for case endings, e.g., in
tions, such as mawa, mawwa, mawi, mawwi, \ ﺷــﻜﺮا$ukrAF\ (vocalized ًﻜﺮَا ْﺷُ ), the
mawu, mawwu, maw, maww, miwa, miwwa.... fatHatayn is not written.
etc., though in practice some may never be used. 9. The rules for determining the hamza seat are
Humans can normally disambiguate this by con- of notorious complexity. In transcribing to
text, but for a program the task is formidable. Arabic, it is difficult to determine the hamza
seat as well as the short vowel that follows;
Conventional wisdom has it that the Arabic e.g., hamzated waaw ( )ؤcould represent /'a/,
script is ambiguous "due to non-representation of /'u/ or even /'/ (no vowel).
short vowels," while other features are often 10. In arabization, determining the hamza seat
lightly passed over. In fact, a whole gamut of requires the application of complex rules
factors contribute to orthographical ambiguity. based on the phonological environment,
which is further complicated by the frequent
The list of factors below is not intended to serve omission and inconsistent use of hamza in
as a detailed treatment of Arabic orthographic foreign names (see Section 7).
ambiguity, but to demonstrate the principal lin- 11. Phonological alternation processes such as
guistic issues that need to be addressed to assimilation that modify the phonetic realiza-
achieve accurate transcription. tion. For example, the unvocalized اﻟﺮﺟﻞ
' اﻟﻄﻮﻳــــﻞthe tall man' is realized as
1. The greatest challenge is the omission of /'arrajulu-TTawiilu/ (ُﻞ ٱﻟﻄﱠﻮِﻳﻞ ُ ﺟ
ُ )اَﻟﺮﱠ, in which
short vowels; e.g., the unvocalized آﺎﺗـــﺐ the الis assimilated into طﱠ/TTa/, not as
\kAtb\ can represent seven wordforms such /'alrajulu alTawiilu/.
as آَﺎﺗِﺐ/kaatib/ and َ آَﺎِﺗﺐ/kaatiba/. 12. Vowel shortening is sometimes lexically de-
2. In contrast, some short vowels actually are termined and thus cannot be predicted from
represented. For example, taa' marbuuTa of- the orthography; e.g., ' ﻓــﻲ اﻟﻘﺎهﺮةin Cairo'
ten indicates a short /a/, as in ﺟﺎﻣﻌﺔ/jaami`a/, is pronounced /fi-lqaahira/, not /fii-lqaahira.
while in foreign names short and long vow- That is, /fii/ is shortened to /fi/.
els are normally written identically by add-
ing ا, يor و, as in \ روﺳـــﻴﺎrwsyA\ 'Russia'. 5 Automatic Romanizer of Arabic Names
3. Long /aa/ can be expressed in multiple ways,
5.1 Overview
e.g., by 'alif Tawiila ( )اas in ﺳـــﻮرﻳﺎ, by (2)
'alif maduuda ( )ﺁas in ﺁﺳـــﻴﺎ, and by (3) 'alif The Automatic Romanizer of Arabic Names
maqSuura ( )ىas in ﺁﺳـــﻴﺎ اﻟﻮﺳــﻄﻰ. (ARAN) consists of multiple modules for the
4. Long vowels are sometimes omitted too, as transcription and transliteration of Arabic and
in هﺪا/haadha/. In this case, the 'alif qaSiira related tasks such as variant generation and vo-
("dagger alif") is omitted. calization. The core problem that ARAN ad-
5. Not all bare alifs represent long /a/. Some are dresses is making an intelligent guess at deter-
silent (next item), while some are nunated; mining the vowels of unvocalized Arabic names
e.g., راin ﺷــﻜﺮاrepresents /ran/, ًرا, not رَا and generating romanized candidates based on
/raa/. statistically motivated linguistic rules derived
6. 'alif alfaaSila (otiose alif), added to the third from an in-depth analysis of Arabic orthography.
person masculine plural forms of the past The principal components of ARAN are:
tense, is a mere orthographic convention and
1. ATAN: Automatic Transcriber of Arabic Names
is not pronounced. Though it must be trans-
2. AXAN: Automatic Transliterator of Arabic Names
3
3. APAN: Automatic Phoneticizer of Arabic Names but might improve the match rate because fuzzily
4. ADAN: Automatic Diacriticizer of Arabic Names matched names could often be correct, whereas
5. AVAN: Automatic Variant Generator for Arabic generated names could have incorrect short vow-
Names els. The user can set parameters to output any
desired combination of three modes: exact match,
Table 1 shows examples of how each module fuzzy match or algorithmic generation.
processes a string of unvocalized Arabic:
Table 1. Output from Various ARAN modules
Unvocalized Vocalized Phonemic Graphemic Phonetic Popular
(input) (ADAN) (ATAN) (AXAN) (APAN) (AVAN)*
ﻣﺤﻤﺪ ﻣﺤَﻤﱠﺪ
ُ muHammad mHmd muħɛ̈mmɛ̈d Muhammad
ﻗــــﺎﺑﻮس ﻗَﺎﺑُﻮس qaabuus qAbws qɑːbuːs Qaboos
ﺟﻤﺎل ﺟَﻤَﺎل jamaal jmAl dʒɛ̈mɛ̈ːl Jamal
ﻣﻜﺔ ﻣَـﻜـﱠﺔ makka mkp mɛ̈kkɛ Mecca
*Only one popular variant is shown, but in reality there could be dozens. For example,
for ﻗ ﺎﺑﻮسAVAN generates Qabuus, Qabus, Qabous, Qabooss, … and many more.
4
6 Non-Arabic Name Arabizer ( ﺳـــــﺎﻳﺘﺎﻣﺎ埼玉 /saitama/), and other kinds
of variants, such as آﺎﻧﺎﺟــﺎواfor the more
6.1 Overview
common ( آﺎﻧﺎﻏــﺎوا神奈川 /kanagawa/).
The Non-Arabic Name Arabizer (NANA) is
designed to arabize non-Arabic names. This in- 6.3 Vowel Sequence Ambiguity
cludes Roman-script names such as Bill Clinton Vowels sequences are difficult to transcribe be-
to ﺑﻴــــﻞ آﻠﻴﻨﺘــــــــﻮن, as well as a technology cause they could represent diphthongs, mo-
probably never attempted before: transcribing nophthongs (distinct vowels), or long vowels.
CJK names directly into Arabic. We have devel- Representing Japanese vowels accurately in Ara-
oped language-dependent rules, mapping tables bic is not possible. In cases where vowel se-
and algorithms for transcribing CJK names writ- quences represent monophthongs, hamza is
ten in their native scripts. For example, the Japa- sometimes used and sometimes omitted.
nese placename 埼 玉 /saitama/ is arabized as
ﺳـــــﺎﻳﺘﺎﻣﺎ, the Chinese name 杨海洋 /yang hai- Table 2. Diphthong Ambiguity for 福井 /fu-ku-i/
yang/ as ﻳــــﺎﻧﻎ هﺎﻳﻴـــــﺎﻧﻎ, and the Korean city No. Arabic Google hits Buckwalter
1 ﻓﻮآﻮﺋـــــﻲ468 fwkw}y
부산 /busan/ as ﺑﻮﺳـــﺎن. 2 ﻓﻮآـــﻮئ 9 fwkw}
3 ﻓﻮآـــﻮي 1950 Fwkwy
Various papers, such as AbdulJaleel and Larkey 4 ﻓﻮآﻮﻳـــــﻲ 335 Fwkwyy
(2003), describe systems for transcribing Roman-
script names into Arabic. Although NANA also Table 2 shows some of the variation to expect in
has this capability, it is beyond the scope of this Japanese name Arabization. Though phonologi-
paper. The issues for Chinese and Korean, the cally (2) is the most accurate, it is the least used.
subject of a future paper, are similar in nature but As expected, the diphthongized (3) is the most
require a different set of language-specific rules. common form because of the tendency to avoid
hamza in foreign names. Some important vowel
6.2 Arabization Policy
sequence issues are:
A fundamental problem in arabizing CJK names
is that there are significant differences between 1. There is a strong tendency not to use non-
the Arabic and CJK phonological systems and initial hamza, as in (1) and (2) above, in for-
the lack of detalied transcription standards. Since eign names. One reason for this is insuffi-
these languages are not well known in the Arab- cient knowledge of the phonology of the
speaking world, CJK names are often arabized source language, especially of such "exotic"
on the basis of their romanized transcriptions, languages as Japanese.
rather than the native script, and it is sometimes 2. Japanese is especially problematic because it
erroneously assumed that the Roman letters are is moraic. Some Japanese mora sequences,
pronounced as in English. This is further compli- such as あい /ai/ or うい /ui/, are often diph-
cated by the plethora of CJK romanization stan- thongized in Arabic, though ideally the sec-
dards. We have established an arabization policy ond vowel should be treated as a mo-
for Japanese based on a number of sometimes nophthong represented by hamza. That is, 福
conflicting criteria: 井 /fu-ku-i/ should be written as (1)
ﻓﻮآﻮﺋـــــﻲor (2) ﻓﻮآـــﻮئ, rather than the
1. How names are actually spelled on the Ara-
bic web, atlases, maps and books. more common (3) ﻓﻮآـــﻮي.
2. Ensuring that same source syllables are 3. In theory, a vowel sequence like /ai/ as in さ
spelled consistently taking into account pho- い /sa-i/ can be written in five ways: ﺳﺎي
nological changes. ﺳـــﺎﺋﻲ ﺳـــﺎﻳﻲ ﺳﺎئ ﺳﻲ. To accu-
3. Treating Japanese names as a sequence of rately transcribe a name like Saitama (埼玉)
syllables, rather than of morae, since that is it is necessary to know that it consists of four
how they are commonly transcribed. morae (/sa-i-ta-ma/ さいたま), rather than
4. Using hamza to represent vowel sequences three syllables (/sai-ta-ma/). Ideally it should
only in those cases where dipthongization is be transcribed as ﺳـــــــﺎﺋﻴﺘﺎﻣﺎ, rather than
not possible or awkward (see Section 6.3). the much more common ﺳـــــﺎﻳﺘﺎﻣﺎ. That is,
5. Generating hamzated variants, such as since /sa-i/ is a bimoraic syllable, the hamza
ﺳـــــــﺎﺋﻴﺘﺎﻣﺎfor the more common
5
over yaa' should be used to represent /i/ as a names using a knowledge base of rules and map-
distinct monophthong, as in ﺳﺎئ. In reality, ping tables fine tuned to the Japanese and Arabic
Saitama is normally spelled ﺳـــــﺎﻳﺘﺎﻣﺎ, so phonological systems. Roughly, the algorithm
that /sa-i/ is diphthongized as ﺳﺎي/say/. works as follows:
4. In names like 福岡 /fu-ku-o-ka/ the sequence
/ku-o/ represents distinct sounds that cannot 1. Get a string from the input file.
be diphthongized. Following hamza rules, 2. Determine if the string is Japanese.
this should be written ﻓﻮآﻮؤوآــــﺎ, but in fact 3. Convert kanji to hiragana reading by looking
it is commonly spelled ﻓﻮآﻮأوآــــﺎ, in which up in JEP.
4. Convert hiragana to romanized Japanese by
أو, rather than ؤو, represents /u/. Omitting the
looking up in JEP.
hamza here would make little sense.
5. If (3) fails, convert to hiragana algorithmi-
6.4 Long and Short Vowels cally (difficult due to extreme ambiguity).
6. If (3) returns multiple strings, use criteria
The treatment of Japanese vowels is complex like frequency and semantic codes to elimi-
and may have hamzated variants. nate unlikely candidates.
Table 3. Long and Short Vowels
No. Kanji Kana Phonemic Arab1 Arab2 Arab3
1 太田 おおた oota أوﺗــﺎ
2 風馬 ふうま fuuma ﻓﻮﻣــﺎ
3 敬子 けいこ keiko آﻴﻴﻜــــــﻮ آﻴﻜــــﻮ
4 空野 くうの kuuno آﻮﻧـــﻮ
5 久野 くの kuno آﻮﻧـــﻮ
6 日枝 ひえだ hieda هﻴﻴـــﺪا هﻴﺌـــﺪا هﻴﺌﻴـــــﺪا
7 芳江 よしえ yoshie ﻳﻮﺷـــــــﻴﺌﻪ ﻳﻮﺷـــــﻴﻲ ﻳﻮﺷـــــــﻴﺌﻲ
1. Japanese long vowels are expressed in vari- 7. Determine whether to diphthongize or to use
ous ways, such as by repeating the vowel as hamza by considering both the hiragana and
in (2) ふう /fuu/, or by adding う /u/ after /o/ the romanized Japanese.
as in (1). えい /ei/ is special because the ي 8. Use the rules knowledge base, which is em-
may be repeated, as in (3). bedded in a multi-option comprehensive hi-
2. Since short vowels are omitted in Arabic, ragana-to-Arabic mapping tables to convert
short vowels in foreign names are normally to Arabic script.
transcribed as if they were long; that is, by 9. The AVAN module generates variants if re-
adding اfor /a/, يfor /i/ and وfor /u/. Thus quested by user parameters.
both (4) and (5) are written identically as 10. Output arabized name (with or without vari-
آﻮﻧـــﻮand there is no way to distinguish ants as necessary).
vowel length.
3. Normally the vowel /e/ is not distinguished We have not yet performed formal error rate test-
ing, but our preliminary experiments indicate
from /i/ and both are represented by ي. An
extra complication is that at word end /e/ is that the above algorithm can arabize a CJK name
to its correct or legitimate variant form with a
sometimes expressed by ﻩ, so that in tran-
success rate of nearly 100%. This is because the
scribing such names as (5) and (6) it is nec-
algorithm is based on a thorough understanding
essary to consider hamza rules, whether to
of the Arabic and Japanese (as well as Chinese
diphthongize, the position of the syllable in
and Korean, though not discussed here) phono-
the word, and how these interact.
logical systems, and a comprehensive mapping
6.5 Arabization Algorithm table designed to cover almost all possible Japa-
nese-to-Arabic mappings, including positional
The arabization algorithm accepts a CJK string variants and phonological changes resulting from
as input and generates a list of romanized candi- liaison.
dates by combining lookup in the Japanese-
English Proper Noun Database (JEP), a database
of about 600,000 Japanese personal and place
6
7 Arabic Orthographic Variants these cannot be rigorously defined, they are both
of frequent occurrence based on statistical and
The number of personal names and their variants linguistic analysis of MSA orthography. It
in the world is in the billions. Identifying names should also be noted that "standard form,"
and their variants (named entity recognition) is a though linguistically correct, is not necessarily
hot topic in computational linguistics. To en- the most common form (we are gathering statis-
hance this technology, we added a variant gen- tics for the occurrence of each form).
eration module (AVAN) to both the ARAN and
NANA systems, which is supported by compre- There are often many more variants than those
hensive databases of CJK proper nouns. shown above. For example, Alexandria can be
7.1 Romanization Variants written in about a dozen ways, the most frequent
ones according to Google being اﻻﺳــــــﻜﻨﺪرﻳﺔ
The many popular transcriptions of Arabic with 2,930,000, اﻹﺳــــــﻜﻨﺪرﻳﺔwith 690,000,
names result in a large number of variants. One and اﻻﺳــــــﻜﻨﺪرﻳﻪwith 89,200 occurrences re-
reason for this is that several Arabic consonants, spectively.
such as [ عʔˁ], [ ضdˁ], [ طtˁ] and [ ظðˁ], do not
exist in European languages. These sounds are 8 System Modules and Future Work
difficult to pronounce and are rendered in differ- The principal components of ARAN (some of
ent ways when romanized. Another factor is the which are in progress) are briefly described be-
bewildering variety of ways in which Arabic low,
vowels are transcribed, partially due to dialecti-
cal influences. For example, the أ/'u/ in أﺳﺎﻣﺔis 1. The Automatic Transcriber of Arabic
transcribed in various ways as seen in Usama, Names (ATAN) is ARAN's core module for
Ousama, Osama and Oosama, while ﻣﻌـﻤﺮ generating phonemic and popular transcrip-
\mEmr\ is spelled as Moammar, Muammar, tions of Arabic personal names. Because of
Mu'ammar, Mo'ammar, Moammar, Moamer, the inconsistent nature of the various popular
Moamar, and others. Arabic romanization systems, there are often
7.2 Arabic Variants many, sometimes dozens or even hundreds,
of romanizations for the same name.
Both Arab and foreign names have orthographic ATAN supports most of the commonly used
variants in Arabic. These are of two kinds: systems, and has a flexible architecture that
enables the user to configure the system to
1. Orthographic variants are non-standard ways support user-defined systems. For example,
to spell a specific variant of a name, like اﺑــﻮ ﺷـــﻮﻟﻮخ, which is first transliterated to
ﻇــﺒﻲinstead of أﺑــﻮ ﻇــﺒﻲfor Abu Dhabi, in \$wlwx\ by the AXAN module, can then be
which the hamza is omitted. transcribed as /shwlwkh/ in the ALC-LC sys-
2. Orthographic errors are frequently occurring, tem, as /šūlūḫ/ in the DIN system, as Shou-
systematic spelling mistakes, like yaa' in اﺑــﻮ
lokh as a possible English spelling, etc. The
( ﻇــﺒﻲAbu Dhabi) being replaced by 'alif
AVAN module can then be used to return
maqSuura in اﺑــﻮ ﻇــﺒﻰ. many popular variants.
Table 4 shows examples of variants ("V") and 2. The Automatic Transliterator of Arabic
errors ("E"). Though the difference between Names (AXAN) generates transliterations of
7
Arabic names or any other Arabic text. There these are Farsi (official language of Iran),
are few strict transliteration systems that use Pashto (western Pakistan and official lan-
unique symbols for each letter and allow for guage of Afghanistan), Dari (Afghan dialect
round-trip conversion. The excellent and of Farsi, official language of Afghanistan),
widely used Buckwalter transliteration sys- Urdu (official language of Pakistan) and
tem is not only supported by AXAN, but is Kurdish (Turkey, Iraq, Iran, Syria, Armenia,
also used for internal processing in all Lebanon). Others include Shamukhi (Paki-
ARAN databases and algorithms. AXAN can stani version of Punjabi), Kashmiri (India
be configured to support other transliteration and Pakistan), and Uyghur (northwest China).
systems, including Cyrillization, by adding ARAN will eventually be expanded to (1)
custom mapping tables . romanize to/from the major ASBL languages,
3. The Automatic Phoneticizer of Arabic (2) automatically identify the language, (3)
Names (APAN) generates phonetic tran- automatically detect legacy encodings and
scriptions of Arabic names in IPA. This convert to Unicode.
represents the actual pronunciation in MSA,
including distinctions between the major al- 9 Conclusion
lophones. For example, the name ﻗــــﺎﺑﻮس
As we have seen, the high level of ambiguity in
Qaboos is transcribed as [qɑːbuːs]. Note the Arabic script makes it challenging to build
that the phonemic transcription /qaabuus/ automatic transcription systems that produce re-
only indicates the vowel length (/aa/), liable results. In particular, we have seen the dif-
whereas the phonetic transcription also indi- ficulties in arabizing CJK names due to the lack
cates the quality of the vowel (ɑː), distin- of standards and to the major phonological dif-
ferences between the languages. We have also
guishing it from its more common realization seen how important linguistic knowledge is in
[ɛ̈ː]. APAN can be configured to transcribe such areas as Japanese-to-Arabic transcription,
in various MSA flavors. This refers to re- resulting in a very high accuracy rate. Since Ara-
gional variations in MSA pronunciation, not bic transcription is playing an increasingly im-
to Arabic dialects per se. For example, for portant role in a variety of practical applications,
ﺟﻤﺎل/jammal/ APAN generates [dʒɛ̈mɛ̈ːl] it is necessary to pursue efforts to develop more
language-specific transcription systems based on
for Gulf MSA, [gɛ̈mɛ̈ːl] for Egyptian MSA , linguistic knowledge.
and [ʒɛmɛ̈ːl] for Levantine MSA.
References
4. The Automatic Generator of Variants for
Arabic Names (AVAN) supports the ARAN Nasreen Abduljaleel and Leah S. Larkey. 2003. Sta-
and NANA system by generating a large tistical transliteration for English-Arabic Cross
number of variants and variant candidates Language Information Retrieval. CIKM 2003: 139-
146
both algorithmically and by retrieving from
hardcoded databases, whose occurrences are M. Arbabi, S.M. Fischthal, V.C. Cheng and E. Bart.
then validated in Arabic corpora and the web. 1994. Algorithms for Arabic name transliteration.
See Section 7 for details. IBM J. Res. Develop., 38(2)
5. The Automatic Diacriticizer of Arabic Moustafa Elshafei, Husni Al-Muhtaseb, Mansour Al-
Names (ADAN) automatically diacriticizes, Ghamdi. 2006. Machine Generation of Arabic Dia-
or adds vowels and other diacritics (like critical Marks. MLMTA 2006: 128-133
fatha and shadda) to unvocalized or semi- Ali Farghaly. 2004 Computer Processing of Arabic
vocalized Arabic. For example, ﻣﺤﻤﺪ Script-based Languages: Current State and Future
\mHmd\ and \ اﻟﺮﻳـﺎضAlryAD\ are converted Directions. COLING 2004
to the vocalized ﻣﺤَﻤﱠﺪُ and اﻟﺮﱢﻳـَﺎضrespec- Martin Kay Stanford University. 2004. Arabic Script-
tively. This is related to, but distinct from, based Languages deserve to be studied linguisti-
the equally difficult task of phonemic tran- cally. COLING 2004.
scription. See Table 1 for examples.
D. Vergyri and K. Kirchhoff. 2004. Automatic Diacri-
6. There are dozens of non-Arabic languages
tization of Arabic for Acoustic Modeling in Speech
that are or have been written in the Arabic Recognition. COLING Workshop on Arabic-script
script, referred to as Arabic Script Based Based Languages, Geneva, Switzerland, 2004.
Languages (ASBL). The most important of