Sei sulla pagina 1di 18

Int J Speech Technol (2006) 9: 133150

DOI 10.1007/s10772-008-9009-1

Arabic speech recognition using SPHINX engine


Hussein Hyassat Raed Abu Zitar

Received: 1 October 2008 / Accepted: 9 October 2008 / Published online: 28 October 2008
Springer Science+Business Media, LLC 2008

Abstract Although the Arab world has an estimated completely described. Fully diacritic Arabic transcrip-
number of 250 million Arabic speakers, there has tions, for all the three corpuses were developed too.
been little research on Arabic speech recognition when SPHINX-IV engine was customized and trained,
compared to other languages of similar importance for both the language model and the lexicon modules
(e.g. Mandarin). Due to the lack of diacritic Ara- shown in the frame work architecture block diagram
bic text and the lack of Pronunciation Dictionary on next page.
(PD), most of previous work on Arabic Automatic Using the three mentioned corpuses; the (PD) de-
Speech Recognition has been concentrated on devel- veloped by our automatic tool with the transcripts,
oping recognizers using Romanized characters i.e. let SPHINX-IV engine is trained and tuned in order to
the system recognizes the Arabic word as an English develop three acoustic models, one for each corpus.
one, then map it to Arabic word from lookup table that Training is based on an HMM model that is built on
maps the Arabic word to its Romanized pronunciation. statistical information and random variables distribu-
In this work, we introduce the first SPHINX- tions extracted from the training data itself. New algo-
IV-based Arabic recognizer and propose an auto- rithm is proposed to add unlabeled data to the training
matic toolkit, which is capable of producing (PD) corpus in order to increase the corpus size. This algo-
for both Holly Quraan and standard Arabic lan- rithm is based on Neural Network confidence scorer
guage. Three corpuses are completely developed in and then is used to annotate the decoded speech in or-
this work, namely the Holly Quraan Corpus HQC-1 der to decide whether the proposed transcript is ac-
about 18.5 hours, the command and control corpus cepted and can be added to the seed corpus or not.
CAC-1 about 1.5 hours and Arabic digits corpus ADC The model parameters were fine-tuned using simu-
less than one hour of speech. The building process is lated annealing algorithm; optimum values were tested
and reported. Our major contribution is mainly using
the open source SPHINX-IV model in Arabic speech
recognition by building our own language and acoustic
H. Hyassat
Arab Academy of Business and Financial Sciences, models without Romanization for the Arabic speech.
Amman, Jordan The system is fine-tuned and data are refined for train-
ing and validation. Optimum values for number of
R. Abu Zitar ()
Gaussian mixtures distributions and number of states
School of Computing and Engineering, New York Institute
of Technology, Amman, Jordan in HMMs have been found according to specified per-
e-mail: rzitar@nyit.edu formance measures. Optimum values for confidence
134 Int J Speech Technol (2006) 9: 133150

scores were found for the training data. Although era, resulting in further stratification of society and
much more work need to be done to complete the tragic loss in human potential. Automatic Speech
work with this size, we consider the corpus used in our Recognition (ASR) field is one of these interfaces,
system is enough to validate our approach. SPHINX which witnesses over the last decade enormous
has never been used before in this manner for Arabic progress, and can be reliably done on large vocab-
speech recognition. The work is an invitation for all ularies, on continuous speech and speaker indepen-
open source speech recognition developers and groups dently. The word error rate of these recognizers under
to take over and capitalize on what we have started. special conditions often is below 10 percent (Pallet
et al. 1999) and for general purposes Large Vocabulary
Keywords SPHINX engine Pronunciation Continuous Speech Recognizers (LVCSR) the best
Dictionary Diacritic Arabic word error rates were as high as 23.9% (Rosti 2004;
Hain et al. 2003) for English language.
With an estimated number of 250 million native
speakers, Arabic language is the sixth most widely
1 Introduction
spoken language in the world. But research on ASR
for Arabic is too limited compared to other languages
Large Vocabulary Continuous Speech Recognizers with similar importance like mandarin (Kirchhoff et al.
(LVCSR) are commercially available from different 2002).
vendors. Along with this increased availability comes Most of previous work on Arabic ASR aims at
the demand for recognizers in many different lan- developing recognizer, for either Modern Standard
guages that often were not focused on the speech Arabic (MSA) or Egyptian Colloquial Arabic (ECA).
recognition research. So far, Arabic language is one Some results of Word Error Rates (WER) obtained
of these languages. With the increasing role of com- from both MSA and ECA are shown in Table 1.
puters in our life, there is a desire to communicate From Table 1 we see that the performance of the
with them naturally. Speech processing by computer Arabic ASR for ECA are very poor compared to other
provides one vehicle for natural communication be- ASR for other languages like English, this result is an-
tween man and machine. Interactive networks provide other motivation for this research.
easy access to a wealth of information and services Most previous works on Arabic ASR was done
that will fundamentally affect how people work, play by training the system using two formats, either the
and conduct their daily affairs. Romanized format or standard Arabic script with-
The average citizen needs to communicate with out Romanized transcript. Arabic ASR are concen-
these networks using natural communication skills us- trated on developing recognizers for Modern Standard
ing everyday devices, such as telephones either mo- Arabic (MSA), which is a formal linguistic standard
bile or fixed and televisions. Without fundamental ad- used throughout the Arabic-speaking world and is em-
vances in user-centered interfaces, a large portion of ployed in the media (e.g. broadcast news), lectures,
society will be prevented from participating in the info courtrooms, etc. or colloquial Arabic Language.
Int J Speech Technol (2006) 9: 133150 135
Table 1 WER (%) Arabic language type Year Word error rate (WER) Reference
obtained for both MSA and
ECA
MSA 1997 1520% Billa et al. (2002a, 2002b)
ECA 96/97 6156% Kirchhoff et al. (2002), Zavagliakos
et al. (1998)
ECA 2002 55.154.9% Kirchhoff et al. (2002)

SPHINX-IV engine will be customized in this re- Institute of Technology (MIT). The current working
search. SPHINX-IV is an open source speech recog- engines of SPHINX are SPHINX I, II, III, IV and
nition engine built for research purposes by speech re- pocket SPHINX, in addition to these engines SPHINX
search group at Carnegie Mellon University (CMU) has one trainer, this trainer is capable to produce
(CMU SPHINX Open Source Speech Recognition En- acoustic model, this model can be used in all SPHINX
gines 2007; Huang et al. 2003). Many theses around version except SPHINX-I, every SPHINX engine has
the world tackled SPHINX for Speech Recognition its unique characteristics and usage.
(Rosti 2004; Nedel 2004; Doh 2000; Ohshima 1993;
Raj 2000; Huerta 2000; Rozzi 1991; Liu 1994; Gou- 2.2 Hidden Markov Model Toolkit (HTK)
va 1996; Seltzer 2000; Siegler 1999), but not for
Arabic. Reasons for selecting this engine will be pre- S. Young, presented a frame work for the HTK TOOL
sented latter. SPHINX-IV architecture consists of se- KIT (Hermansky 1990) he stated that The Hidden
ries of processes independent of each other as it will be Markov Model Toolkit (HTK) is a portable toolkit
shown in next sections. Each block in this figure rep- for building and manipulating hidden Markov mod-
resents a series of process, independent of each other. els. HTK is primarily used for speech recognition re-
search although it has been used for numerous other
applications. It was originally developed at the Ma-
2 Review of speech recognition engines chine Intelligence Laboratory (formerly known as the
Speech Vision and Robotics Group) of the Cambridge
In this section some of the famous speech recogni- University Engineering Department (CUED) where it
tion engines are reviewed, SPHINX, Hidden Markov has been used to build large vocabulary speech recog-
Toolkit (HTk) and Center for Spoken Language Un- nition systems. It consists of a set of library modules
derstanding Toolkit (CSLU). and a set of more than 20 tools. A HTK-based recog-
nizer was included in both the ARPA September 1992
2.1 SPHINX engine Resource Management Evaluation and the November
1993 Wall Street Journal CSR Evaluation, where in
SPHINX is a large-vocabulary, speaker-independent, both cases performance was comparable with the sys-
Hidden Markov Model (HMM)-based continuous tems developed by the main ARPA contractors.
speech recognition system. SPHINX was developed In the year 1999, the current version of HTK was
at CMU in 1988 (Russell et al. 1995; Christensen V2.2 and all rights to HTK rested with Entropic. At
1996; Rabiner and Juang 1993) and was one of the this time Entropics major business focus was voice-
first systems to demonstrate the feasibility of accu- enabling the Web and Microsoft purchased Entropic in
rate, speaker-independent, large-vocabulary continu- November 1999. Recently Microsoft decided to make
ous speech recognition. SPHINX-II (Russell et al. the core HTK toolkit available again and licensed the
1995) was one of the first systems to employ semi- software back to researchers and academic usage, so
continuous HMMs. that it could distribute and develop the software for
SPHINX is a collection of several ASR; it was these purposes.
created in collaboration between the SPHINX group
at CMU, Sun Microsystems Laboratories, Mitsubishi 2.3 Hybrid systems
Electric Research Labs (MERL), and Hewlett Packard
(HP), with contributions from the University of Cal- A. Ganapathiraju et al. described the use of a powerful
ifornia at Santa Cruz (UCSC) and the Massachusetts machine learning scheme, Support Vector Machines
136 Int J Speech Technol (2006) 9: 133150

(SVM) (Lee et al. 1990), within the framework of Hid- against the script and Romanized transcriptions, re-
den Markov Model (HMM) based speech recognition. spectively), they concluded that it would be advan-
They developed the hybrid SVM/HMM system based tageous to have large amount of Romanized training
on their public domain toolkit. The hybrid system has data for the development of future Arabic ASR sys-
been evaluated on the OGI Alpha-digits corpus and tems, and focused on building an (ART), rather than
performs at 11.6% WER, as compared to 12.7% with try to explore the reasons behind these results.
a triphone mixture-Gaussian HMM system, while us- In my opinion the real reasons for this result is
ing only a fifth of the training data used by triphone that, it is unfair to compare these two systems, as
system. Several important issues that arise out of the they are totally two different systems. The first one
nature of SVM classifiers have been addressed. was trained on standard Arabic script without diacrit-
ics, while the other was trained using Romanized tran-
scription which includes vowels information, doing so
3 Arabic language speech recognition research you fool the system by hiding important information
like the short vowels in the former, while this infor-
Katrin Kirchhoff et al. on their project at the 2002 mation is presented in the later. Due this reason, the
Johns Hopkins Summer Workshop (Kirchhoff et al. fact that Romanized Arabic is unnatural and difficult
2002), which focused on the recognition of dialectal to read for native speakers and the failure of using
Arabic. Three problems were addressed: out-of-corpus data that have proved successful in other
languagesaccording to Katrin Kirchhoff et al., we
1. The lack of short vowels and other pronunciation think that, research for Arabic ASR should be done
information in Arabic texts. on original fully or partially diacritized Arabic cor-
2. The morphological complexity of Arabic. pus not Romanized and to start developing (APDT),
3. The discrepancies between dialectal and formal rather than developing ART as stated by Sir Thomas
Arabic. Elliot where he stated that If physicians be angry, that
They used the only standardized corpus of dialectal I have written physics in English, let them remember
Arabic currently available (2002), the LDC Call Home that the Greeks wrote in Greek, the Romans in Latin,
(CH) corpus of ECA. The corpus is accompanied by Avicenna and the other in Arabic, which were their
transcriptions in two formats: standard Arabic script own and proper maternal tongues (CMU SPHINX
without diacritics and a Romanized version, which trainer 2008).
is close to a phonemic transcription. Example of the Modular recurrent Elman neural networks
Romanized form used in their experiments is shown (MRENN) for Arabic isolated speech recognition is
in Table 2, they stated that Romanized Arabic is un- implemented (Young 1994). This is a special kind of a
natural and difficult to read for native speakers; more- recurrent network. The Elman network, originally de-
over, script-based recognizers (where acoustic models veloped for speech recognition, is a two-layer network
are trained on graphemes rather than phonemes) have in which the hidden layer is recurrent. The inputs to
performed well on Arabic ASR tasks in the past. the hidden layer are the present inputs and the outputs
of the hidden layer are saved from the previous time-
3.1 Automatic Romanizing Tool (ART) step in buffers called context units. Their work is a
duplicate of a previous work done by (Ganapathiraju
et al. 2000) but for English language. They described
Once Katrin Kirchhoff et al. evaluate their system,
a novel method of using recurrent neural networks
a WER of 59.9% and 55.8% is obtained (evaluated
(RNN) for isolated word recognition. Each word in the
target vocabulary is modeled by a fully connected re-
Table 2 ECA transliterated and Romanized sentence represen-
tations (Kirchhoff et al. 2002)
current network. To recognize an input utterance, the
best matching word is determined based on its tem-
ECA poral output response. The system is trained in two
Transliterated script AlHmd llh kwlsB w Antl Azlk stages. First, the RNN speech models are trained inde-
Romanized word forms llHamdulillA kuwayyisaB wi inti pendently to capture the essential static and temporal
izzayik characteristics of individual words. This is performed
Int J Speech Technol (2006) 9: 133150 137

by using an iterative re-segmentation training algo- 4.1 Automatic generation of a Pronunciation


rithm which gives the optimal phonetic segmentation Dictionary
automatically for each training utterance. The second-
stage involves mutually discriminative training among In Essa (1998), it is stated that it has been con-
the RNN speech models aiming at minimizing the firmed that an appropriate (PD) constructed by hand
probability of misclassification. M. M. El Choubassi or by a rule-based system improves recognition per-
et al. used a separate Elman network for each word formance. But such dictionaries require time and ex-
in the vocabulary set, although they obtained promis- pertise to construct. Since relation between grapheme
ing results (accuracy 87%) for their very small iso- to phoneme is not direct, the researchers proposed
lated system, but this approach may be suitable for a method for automatically generating of a (PD) for
isolated very small vocabulary size and not suitable Japanese language, based on a pronunciation neural
for (LVCSR), a memory and performance problems network that is able to predict plausible pronunciations
will be faced. In Baugh and Cable (1978), a method from the canonical pronunciation. They use a multi-
for automatically segmenting Holly Quraanic Arabic layer perceptron to predict alternative pronunciation
is presented; a linguistic segmentation estimate was from the canonical pronunciations based on maximum
output of the network.
used when the recognition failed to provide a strong
In Schultz (2002), authors use a statistical proce-
classification. In El Choubassi et al. (2003), authors
dure to determine the phonetic realization of phonemes
presented work for Data Recording, Transcription,
base forms -expected pronunciation-. Taking account
and Speech Recognition for Egyptian colloquial, their
of lexical stress and word boundary information, they
work is based on Romanized transcription of Arabic,
generated statistics for phonemes in word base form
table below represent below represent the Romanized
from a phonetically labeled speech corpus. The esti-
symbols used in their work. They stated that Roman-
mates derived from this corpus were then used to gen-
ized Grapheme-to-Phoneme tool for Standard Arabic erate pronunciation networks from base forms in the
(collected in Tunisia and Palestine) already developed (DARPA) Resource Management (RM) task. A sig-
at CMU. They use Egypt Call Home data and pronun- nificant improvement in recognition accuracy was ob-
ciation dictionaries of (LDC). tained on (RM) task using the pronunciation networks
thus derived, relative to the base form pronunciations.
In Fukada et al. (1999), authors described the proce-
4 Pronunciation Dictionaries dure for generating the phoneme sequence automati-
cally using their general purpose phonetic front end. In
order to generate a pronunciation string for each word,
(PD) is a human-generated or machine table of pos- a neural network first assigned a score fro each of 39
sible words and their permissible phonetic sequences phonemes to each 6 msec frame in the word. Then a
-their pronunciations-. Since there are many possi- Viterbi search finds the best scoring sequence. They
ble sequences of phonetic units that do not comprise found that the difference between systems using net-
actual words the lexical model prevents many pho- works derived from hand labels and those using ma-
netic sequences from being explored during recogni- chine labels is not significant, which means that they
tion process (Lee et al. 1998). recommend the use of automatic generated (PD).
Often the creation of a (PD) is not a trivial task. In Fukada et al. (1999), authors stated that sev-
It can be created manually by a human expert in the eral approaches have been adopted over the years
modeled language. But especially with large vocabu- for grapheme-to-phone conversion for European Por-
lary recognizers in which we deal with ten of thou- tuguese: hand derived rules, neural networks, classifi-
sands of words. This approach can be very expensive cation and regression trees, etc. their first approach to
and time consuming and therefore it is often not a fea- grapheme-to-phone conversion was a rule-based sys-
sible option. So the process has to be at least in part be tem with about 200 rules. Later, this rule-based ap-
automatized. There are many researches on develop- proach was compared with a neural net approach. Fi-
ing (PD) for ASR, in this section we will review some nally they described the development of a grapheme-
of these researches. to-phone conversion module based on Weighted Finite
138 Int J Speech Technol (2006) 9: 133150

State Transducers. They investigated both the use of Based (PD) will be presented, first the importance of
knowledge-based and data-driven approaches. PD for ASR will be described, then rules of phonolog-
ical Arabic system will be presented followed by Or-
4.2 Arabic speech sounds and properties thographic to Phonetic Transcription description. The
section will be concluded by description of the gener-
Arabic is a Semitic language, and it is one of the old- ation of (PD) for both MSA and Holly Quraan Large
est languages in the world today. It is the fifth widely Vocabulary Continuous Speech Recognizers (LVCSR)
used language nowadays. Arabic alphabets are used in are commercially available from different vendors.
several languages, such as Persian and Urdu (Hiyassat Along with this increased availability comes the de-
et al. 2005) Arabic linguistics came into being in the mand for recognizers in many different languages that
eighth century with the beginning of the expansion of often were not focused enough for the speech recog-
Islam. This early start can be explained in terms of the nition research so far. It is estimated that today as
tremendous need felt by the members of the new com- much as four to six thousand different languages exist
munity to know the language of the Holly Quraan, (Alghamdi 2001). Therefore, over the last period in-
which had become the official language of the young creased thought has been given to creating methods for
Islamic state (Al-Zabibi 1990). automating the design of speech recognition systems
Arabic linguistics exerted huge effort, explaining for new languages while making use of the knowledge
linguistic rules and Arabic grammar; however, this lin-
that has been gathered from already studied languages.
guist did not last long especially in the information era
One of the core components of a speech recognition
(Alghamdi et al. 2004).
system is the PD. The main purpose of it is to map the
The relative regularity of the syntax presents some
orthographic representation of a word to its pronunci-
advantages for its formalization. In addition, the Ara-
ation. The search space of the recognizer is the (PD)
bic language has the following characteristic: from
(Andersen et al. 1996). The performance of a recogni-
one root the derivational and inflectional systems are
tion system depends on the choice of subunits and the
able to produce a large number of words, or lexical
accuracy of the PD. An accurate mapping of the or-
forms, each of which has specific patterns and seman-
tics. In a certain sense, the Arabic language seems bet- thographic representation of a word onto a subunit se-
ter suited for computers than English or French (Hadj- quence is important to ensure recognition quality, oth-
Salah 1983). erwise the acoustic models trained with the wrong data
Contemporary Standard Arabic, a modernized ver- or during decoding the calculation of the scores for a
sion of classical Arabic, is the language commonly hypothesis is falsified by applying the wrong models
in use in all Arab speaking lands today. It is the lan- (Schultz 2002; Schultz et al. 2004).
guage of science and learning, of literature and the The PD lists the most likely pronunciation or cita-
theater, and of the press, radio and television. Notwith- tion form of all words that are contained in the speech
standing the unanimous acceptability of Contempo- corpus. The pronunciation of the corpus can range
rary standard Arabic and its general adoption as the from very simple and achievable with automatic pro-
common medium of communication throughout the cedures to very complex and time-consuming (Fukada
Arab world, it is not every day speech of the people et al. 1999).
(Alghamdi et al. 2004). The creation of a PD is not a trivial task as men-
tioned earlier and the process has to be at least in part
4.3 Grapheme based Pronunciation Dictionary for be automated. With sufficient knowledge of the target
Arabic language, one can try to build a set of rules that map
the orthography of a word to its pronunciation. For
Grapheme-to-phoneme conversion is an important some languages this might work very well, for others
prerequisite for many applications involving speech this might be almost impossible. Arabic language is
synthesis and recognition (Lee et al. 1998). For ASR an example of languages with a very close grapheme
this process is important in developing the (PD), nor- to phoneme relation (Hadj-Salah 1983). Thus compar-
mally as mentioned earlier this (PD) is hand crafted. atively few rules suffice to build a PD containing the
In this section, a thorough description of grapheme canonical information.
Int J Speech Technol (2006) 9: 133150 139

4.4 Automatic versus Hand Crafted Pronunciation When properly applied, these rules enable one to seg-
Dictionaries ment almost any utterance in Arabic correctly and eas-
ily, for they make the division between the coda and
The recognition quality is maintained by maintaining the onset of nearly all contiguous syllables clear-cut.
the quality of the PD with which it maps the orthog-
raphy of a word to the way it is pronounced by the 4.6 Orthographic to phonetic transcription
speakers. Best dictionaries such as CMU-dict (pronun-
ciation dictionary created by Carnage Mellon Univer- Conversion of Arabic phonetic script into rules is one
sity) are usually achieved with hand-crafted dictionar- of the major obstacles facing the researchers on Ara-
ies (Fukada et al. 1999; Killer et al. 2003). However, bic text to speech systems and speech recognition. Al-
manually created dictionaries require an expert in the though Arabic is one of the oldest languages that its
target language (Killer et al. 2004). However; this is sounds and phonological rules were extensively stud-
a time consuming and costly approach, especially for ied and documented (more than 12 centuries ago) (Al-
large vocabulary speech recognition. If no language ghamdi et al. 2004), these valuable studies need to be
expert knowledge is available or affordable, methods compiled from scattered literatures and formulated in
are needed to automate the PD creation process. Sev- a modern mathematical frame work. The aim of this
eral different methods have been introduced over time. section is to formulate the relation between grapheme
Most of them are based on the conversion of the ortho- to phoneme relationship for Arabic.
graphic transcription to a phonetic one, using either Arabic language is an algorithmic language, at least
rule based (Killer et al. 2003) or statistical approaches from the phonology, writing and derivatives point of
(Killer et al. 2004). view, for example no law can explain the pronuncia-
In order to reduce, both the cost and time required tion of g in English in the following words laugh,
to develop LVCSR systems, the problem of creating
through, good and geography, while Arabic language
(PD) must be solved. In the following sections devel-
has direct grapheme to phoneme mapping for most
oping automatic tool for (PD) for Standard Arabic lan-
grapheme. In general, Arabic text with diacritics is
guage will be described.
pronounced as it is written using certain rules. Con-
trary to English, Arabic does not have words with dif-
4.5 Segmenting Arabic utterance
ferent orthographic forms and the same pronunciation.
The first basic rule that operates in the phonological There are sixteen essential rules in orthographic
system of Arabic without exception is that the number to phonetic transcription (Al-Zabibi 1990; Hadj-Salah
of syllables in an utterance is equal to the number of 1983).1 These rules are:
vowels. The issue, then, is not the number of syllables 1. The sokon sign ( ), is not symbol of any phoneme,
in an utterance, since this is automatic, but rather the but it is meaning is this consonant is followed
boundaries that are signaled either by zero, one or two by another consonant, without intermediate vowel,
consonants (Alghamdi et al. 2004). (i.e. if it exist or not it will not affect the pronun-
The second basic rule of Arabic phonology is that
ciation of the consonant itself). Example this
the onset of the syllable equals the beginning of an ut-
means that will be pronounced as is without
terance. Thus, both can begin with a single consonant
introducing any vowles.
example the first phoneme is consonant n
2. The after group waw as in is not
and the second is the short vowel ] ], (Alghamdi
pronounced.
et al. 2004).
3. Pharyngealization (emphasis): There are Pharyn-
The third rule is that the coda of the syllable is iden-
gealized consonants in standard Arabic were you
tical with the end of an utterance, coinciding infinitely
stress the consonant when pronouncing it.
with the codas of the six syllable types previously dis-
Example is the word ( ) count the sign here,
cussed. Accordingly, syllables in Arabic can be either
open or closed, i.e., they can end in one or two conso- used as stress when the is pronounced .
nants, respectively.
Clearly, then, one should use the three rules just 1 URL:http://www.phonetik.unimuenchen.de/Forschung/BITS/

stated to begin the process of segmentation in Arabic. TP1/Cookbook/node145.html (2006).


140 Int J Speech Technol (2006) 9: 133150
Table 3 Pronunciation Pronunciation rule Successor letter
rules for laam ( )
Moonlam ( ) pronounced if
it is followed by these letters
Sunlam ( ) assimilated if it is
followed by these letters

4. The pronunciation of alef is totally dependent on Table 4 Pronunciation


examples
the successors characters as follows:
a. Not pronounced if followed by two consonant
as in (in the school), this pronounced
as . 10. The pause: an utterance in Arabic is never termi-
b. Pronounced if it is part of the laam of definite nated by a short vowel this means that the short
article as in the will be pronounced vowel of the last word of the sentence is not pro-
as follows . nounced.
c. Pronounced as the short vowel , if it is the 11. The rule for pronouncing the three Tanween
first of a verb, with the third character of it has double diacritic signs ( ) namely ( , , ) this di-
the short vowel as in , in this case it acritics are pronounced as N ( ) as in it
is pronounced as .
is pronounced as if it is not finally or
d. If the above rules did not apply, then the alef
pronounced as otherwise.
pronounced as the short vowel , as in is
12. The lengthening alef , as in , is pronounced
pronounced as .
5. The alef almaqsorah , its predecessor is always as .
the short vowel as in , it is pronounced as 13. If the predecessor of the vowel waw , is the
short vowel as in , then it is pronounced
.
as .
6. Feminine Taa ( ) as in which is
used in Arabic at the end of a noun to modify its 14. If the predecessor of vowel Yaa is the short
gender from masculine to feminine if the word con- vowel as in , then the Yaa is pronounced
taining feminine Taa found as the last word in sen- as .
tence then the Taa pronounced as Haa other- 15. If there is three successive consonant, as in
wise it is pronounced as . , then short vowel is introduced as
7. The letter laam ( ) in ( ) is the Laam of the def- .
inite article, prefixed to nouns; they are added to 16. The laam ( ) always in traqeeq except in the pro-
the structure of the word. There are two types nunciation of the name of Allah ( almighty )
of Laam The moonlam Alqmar ( ) pronounced or it is tafkheem if comes in at the begin-
a l q a m r and the sunlam Alshams ning of the utterance or its predecessor either one
( ) pronounced Ashmes the Laam of the two short vowels , . Example it is pro-
is not pronounced her. The letter Laam in ( ) ei- nounced respectively.
ther pronounced or assimilated depending on the
successor character as shown in Table 3.
4.7 Generation of Pronunciation Dictionaries
8. The HamzGlottal-(( ) ):
The Hamza is pronounced when it comes after
Pronunciation dictionaries are built by splitting a word
a pause or at the beginning of an utterance. But it
into its graphemes which are used as subunits. For ex-
is not pronounced in all other cases as shown in
ample the word would be simply followed by
Table 4.
its fragmentation into subunits (graphemes) as in the
9. The rules of pronunciation two successive words
when the last character of the first and the first char- following example . This is a very
acter of the second word are not vowelized. The easy and fast approach to produce pronunciation dic-
general rule is that the short vowel \i\ should be in- tionaries. The questions arise how well graphemes are
troduced after the last character of the first word. suited as subunits, to what extent are they inferior to
Int J Speech Technol (2006) 9: 133150 141

phonemes or do they perform comparably well, how Qaaf . Although some researchers have already
do we cluster graphemes into poly graphemes, how do used English corpus for Arabic Speech Recognition
we generate the questions to built up the decision tree? purposes but most of these approaches did not offer
Apart from dialectic Arabic, there are two kinds of good performance results (according to Katrin Kirch-
pronunciations for Arabic language, the MSA pronun- hoff et al. (2002) WER obtained is 59.9% for Roman-
ciation and The Holy Quraan pronunciation (Baugh ized Arabic which is not comparable to English ASR
and Cable 1978). The standard Arabic pronunciation WER) (Kirchhoff et al. 2002). To this effect, we de-
is governed by the above rules mentioned earlier in cided to build pure formal Arabic corpus that will be
Sect. 4.5, while The Holly Quraan pronunciation is used in testing our algorithm. This corpus will may be
governed by what is called Tajweed rules which will used as benchmark for future researches.
be described in Sect. 5.3. The proposed (PD) deals In building a corpus for any language certain do-
with both of these pronunciations. main should be selected, a domain dependent tran-
scription should be obtained. Recording this transcrip-
tion is done using different speakers in a sound iso-
5 Experimental environment lated booth and sampled using deferent sampling rates
(Rosti 2004; Huang et al. 2003; Raj 2000; Alghamdi
In this section the development of Arabic Corpuses et al. 2004; Killer et al. 2003).
and Baseline System used in experimenting the (PD) Of course, such tasks are exhaustive in both time
developed will be presented. Namely the Holly and cost and beyond individuals capabilities. Usually
Quraan Corpus (HQC-1), Command And Control such tasks are done through bodies such as Defense
Corpus (CAC-1) and Arabic Digits Corpus (ADC). Advanced Research Projects Agency (DARPA), Jon
The focus of our research is on developing these cor- Hopkins University (JHU), Carnegie Mellon Univer-
puses in order to facilitate testing the (PD) already sity (CMU) or Harvard Tool Kit (HTK) and Network
developed. Selecting the SPHINX-IV engine, which for Euro-Mediterranean Language Resources (NEM-
is built on open architecture, will make the results we LAR) (CMU SPHINX Open Source Speech Recogni-
present independent of the specific recognition engine tion Engines 2007; Huang et al. 2003; Fukada et al.
used. The particular aspects of the speech databases 1999; Mimer et al. 2004; Black et al. 1998).
are presented to provide the reader with useful context As mentioned earlier The Arabic alphabet only
information for interpreting our results and to provide contains letters for long vowels and consonants. Short
other researchers with enough information to repeat vowels and other pronunciation phenomena, such as
and validate our experiments. consonant doubling, can be indicated by diacritics
(short strokes placed above or below the preceding
5.1 Arabic Corpus and Baseline System consonant). However, Arabic texts are almost never
fully diacritized and are thus potentially unsuitable for
Most of the research done on SPHINX-IV used ei- recognizer training except The Holly Quraan and few
ther Wall Street Journal (s3-94, s0-94), and/or Re- other school text books. The Holly Quraan is consid-
source Management Corpus (RM) (Rosti 2004; CMU ered as the most important reference for Arabic lan-
SPHINX Open Source Speech Recognition Engines guage.
2007; Huang et al. 2003; Nedel 2004; Ohshima 1993;
CMU SPHINX trainer 2008; Young 1994; Hiyassat 5.2 Corpuses design criteria
et al. 2005; Al-Zabibi 1990). These corpuses are used
for other Latin languages such as French or Italian due Developing speech corpus is not a trivial task and need
to similarities between these languages and the Eng- resources to be allocated, unfortunately such resources
lish language from phoneme point of view. does not exists for the purpose of this research, in some
Unfortunately, there are great differences between researches a figure of hundreds of thousand of dollars
English language and Arabic language from phoneme is considered limited budget. Some researchers con-
point of view due to the existence of some special sider that the size of 40 hours of broadcast News is
phoneme such as Dhad , Dh , Tah, , aeen enough, due to the high cost of the resources (Mimer
, ghaeen , haa , ssad , KHaa and et al. 2004).
142 Int J Speech Technol (2006) 9: 133150

5.3 The Holly Quraan Corpus HQC-1

The development process of the HQC-1 is started by


collecting the recording for different reciters. The au-
dio file collected, are of different format, some of it
are MP3 format while the others are of wav format.
All of the collected audio files were converted using
this process using Sound Forge software which is open
source sound processing software (Black et al. 1998).
Fig. 1 Example of control file content and format
Audio files have the following characteristics:

5.4.1 Tools/software used for marking silences


Format mono .wav files
Sampling frequency 16000 Hz; 16 bit
After choosing good quality recording, it is splitted
into sub recordings of about 10 to 30 seconds length
each. Sound forge software is used to automatically
detect the silence in the original recording by using
5.3.1 Filenames conventions the Auto Cue feature. A single cue is added in the cen-
ter of each detected silent region. The Threshold value
Each file have been give unique file name, filenames of sets the volume level for the silence. In most cases,
the resources bear three ordered types of information: the value should be 40 dB or higher otherwise any
background hiss, pops, or clicks will be treated as non-
a. Reciter name.
silence and no silence would be marked at all. If no cue
b. Sora (chapter) name.
points appear, we try increasing this value to 30 dB
c. Serial number identifying audio file. or higher.
Example of the file name is: Huthaifi-Isra-001.wav,
the first part of the file name, is stand for the name of 5.4.2 Silence length
reciter. In this case al-Huthiafi (one of the famous re-
citers for Holly Quraan), the second part is sora name, The Silence length value determines how much silence
in this example the Isra sora (the sora number seven- is required before it is marked. Some recordings con-
teen of the Holly Quraan) and the last part is the serial tain brief silences that you usually do not want to be
number of the audio file with respect to both the reciter marked. This value helps to avoid marking any brief
and the sora number. The file extension (.wav) stands pauses within a wave file. Values between 1.0 to 1.5
for file format. seconds are used to ignore these brief silences and only
mark longer silences between recordings.
5.3.2 Directory structure
5.4.3 Splitting recordings
Directory of the audio file was divided into subdirecto-
ries, each reciter in separate directory, the second level Recordings were split into small recordings of 10 to 30
of sub directories are sora of the Holly Quraan. Ex- seconds length each, according to cue points location.
ample: HQC/Huthaifi/Isra/filename.wav. Each of these files is given unique file identification
name. Control file containing list of file name and path
for all recordings are created, the file extension name
5.4 Languages and character sets encodings
for this file is .fileids example an4_train.fileids as
shown in Fig. 1. File extension here is not provided,
Language used in the transcription files is the dia- its defined in the configuration file, since SPHINX-IV
critized Arabic, the encoding used is Unicode (UTF-8) accept different file format as wave, raw or NIST for-
and end line with line feed only format. mat.
Int J Speech Technol (2006) 9: 133150 143

5.4.4 Feature extraction to 16 kHz and divided into small utterancesand then
transcribed, resulting in a total of 59428 words and
For every recording in the training corpus, a set of fea- 25740 unique words with about 18.35 of recording
tures files are computed from the audio training data. hours. It took a total of about 732 working hours to
Each recording can be transformed into a sequence build this corpus.
of feature vectors using the front-end executable pro-
vided with the SPHINX-IV training package, as ex-
plained earlier. 6 Pronunciation Dictionary creation
The process start with Pre-emphasis, this is ap-
plied first, in order to remove noise by applying hi- In order to create the PD, the APDT described earlier
pass filter. Then hamming window is applied, in or- should be invoked; APDT needs a transcription file to
der to slices up Data objects into a number of overlap- produce PD based on this transcription file. Once the
ping windows, (usually referred to as frames in the APDT is invoked, two files are created; one is the (PD)
speech world). After applying the hamming window and the other is a file containing the transcription with
the (FFT) is applied to compute the Discrete (FT) of pronunciation alignment, so that each word in the tran-
the input sequence. This is done by analyzing a signal scription is mapped to its pronunciation in the PD file.
into its frequency components. Filters bank (MFFB) is The PD file has all acoustic events and words in the
applied to the output of the FFT, the output is an array transcripts mapped onto the acoustic units we want to
of filtered values, typically called Mel-spectrum, each
train. Redundancy in the form of extra words is permit-
corresponding to the result of filtering the input spec-
ted. The dictionary must have all alternate pronuncia-
trum through an individual filter. Therefore, the length
tions marked with parenthesized serial numbers start-
of the output array is equal to the number of filters cre-
ing from (2) for the second pronunciation. The marker
ated. In order to obtain the MFCC the (DCT) is applied
(1) is omitted. Each word in the dictionary is followed
and its mean is computed for the MFCC to obtain the
by its pronunciation as shown in Fig. 3.
(CMN)), this is done in order to reduce the distortion
caused by the transmission channel. After computing
6.1 Filler dictionary
the (CMN), the first and the second derivative of it are
computed in order to model the speech signal dynam-
Filler dictionary usually lists the non-speech events as
ics (all as explained earlier).
words and maps them to user-defined phones. This
5.4.5 Transcription file dictionary must at least have the entries (Fig. 4).

Each recording is accurately transcribed; any error in


the transcription will mislead the training process later.
The transcription process done manually, that is, we
listen to the recording then we match exactly what we
hear into text even the silence or the noise should be
represented in the transcription. The transcription then
is followed by the file name without the path as shown
in Fig. 2 this is to map the transcription to its corre-
sponding recording.
The HQC-1 corpus consists of about 7742 record-
ings, these recordings were processeddown sampled

Fig. 2 Transcription file contents Fig. 3 Sample of pronunciation dictionary


144 Int J Speech Technol (2006) 9: 133150
Fig. 4 Filler dictionary
minimum content

Format Mono .wav files


Sampling frequency 16000 Hz; 16 bit

The entries stand for:


<s> : beginning-utterance silence
<sil> : within-utterance silence
</s> : end-utterance silence
Note that the words <s>, </s> and <sil> are treated Fig. 5 Phones list used in training
as special words and are required to be present in the
filler dictionary. At least one of these must be mapped a. Speaker name.
on to a phone called SIL. The phone SIL is treated b. Serial number identifying audio file.
in a special manner and is required to be present. The
SPHINX-IV expects us to name the acoustic events Example of the file name: Mohamad-001.wav, the
corresponding to our general background condition as first part stand for the speaker name the second part is
SIL. For clean speech these events may actually be the serial number of the audio file. The file extension
silences, but for noisy speech these may be the most (.wav) stand for file format.
general kind of background noise that prevails in the
database. Other noises can then be modeled by phones 6.3.2 Directory structure
defined by us.
Directory of the audio files is divided into sub di-
6.2 Phone list rectories for each speaker. Example: CAC/Mohamad/
filename.wav.
Phone list is a list of all acoustic units that we want
to train models for. The SPHINX-IV does not permit 6.3.3 Languages and character sets encodings
us to have units other than those in our dictionaries.
All units in the two dictionaries must be listed here. Language used in the transcription files is the dia-
In other words, phone list must have exactly the same critized Arabic, the encoding used is Unicode (UTF-8)
units used in your dictionaries, no more and no less. and end line with line feed only format.
Each phone must be listed on a separate line in the
file, beginning from the left, with no extra spaces after 6.3.4 Recording and processing
the phone. an example is shown in Fig. 5.
By creating the transcription file, the PD, the filler CAC-1 corpus is totally recorded for this research.
dictionary, control file and the features of the audio CAC-1 consists of two disjoint sets of utterances: 5628
files, the system now can be trained in order to test the training utterances collected from 103 male and 74
PD accuracy. female speakers, and 372 testing utterances from 15
male and 8 female speakers details are shown in Ta-
6.3 Command and Control Corpus (CAC-1)
ble 5. The total length of the training utterances is
about 4248 seconds. Different speakers are used for
The development process of the CAC-1 is started by
the training and testing data. Both the training and
collecting the recording for different speakers. The au-
the testing utterances in the CAC-1 database were
dio file recorded format is
recorded simultaneously using mono single unidirec-
6.3.1 Filenames conventions tional microphone placed on the desktop nearby. It
is originally recorded at 16000 kHz with manually
Filenames of the resources contain the following in- transcribed and annotated acoustic environment for
formation: speakers dialect and other conditions.
Int J Speech Technol (2006) 9: 133150 145

The CAC-1 corpus is considered as a small vocabu- Dictionariespronunciation and fillerare devel-
lary set (approximately 30 words in lexicon), the utter- oped in a similar way as for HQC-1. Since the CAC-
ances consist of command and control word as shown 1 corpus is a planned one, transcription were done
in Table 6. easilyeach speaker is asked to say exactly the same
Baseline system is trained using about 2 hours of word in the same orderonly mapping of the record-
speech in CAC-1, including all conditions together. ing to the control file is needed. Of course making sure
About 10 minutes of evaluation data are used for the that the recording reflects exactly the same transcrip-
test in this research. tion is essential; otherwise we fool the system in the
training phase.
7 Arabic Digits Corpus (ADC)
7.1 Training and evaluation
The third corpus is the ADC corpus is totally devel-
oped in this research too. This corpus is built for devel- Once the model definition file is ready, training is
oping Arabic digit model recognition for digits zero, started by initializing the model parameter and run-
one . . . nine ... , . ning the Baum-Welch algorithm described in next sec-
The ADC corpus is developed using the recording tions.
of 142 speakers, Table 7 shows the details of those
speakers. This corpus was developed exactly in the 7.2 Effect of number of Gaussians on the ADC
same manner and the same environment of the CAC-1
corpus. The ADC Consists of two disjoint sets of ut- Table 8 through Table 14 show the different perfor-
terances: 1213 training utterances collected from 73 mance measure of the ADC, from these tables it is
male and 49 female speakers, and 143 testing utter-
ances from 12 male and 8 female speakers details are Table 8 Number of Number Accuracy
shown in Table 7. The total length of the training ut- Gaussians versus word of Gaus-
accuracy versus
terances is about 0.67 hr. sians
Baseline system is trained using about 35 minutes
of speech, including all conditions together. About 7 1 80.159
minutes of evaluation data are used for the test in this 2 84.127
research. 4 88.889
8 89.683
Table 5 Details of Gender Total 16 88.889
speakers participated in
32 77.778
CAC-1 corpus
Male 118 64 69.048
Female 82 128 65.079
256 65.079
Table 6 Words used in
CAC-1 corpus
Table 9 Number of Gaussians versus number of errors
Number of Gaussians Errors
Sub Ins Del

1 24 0 1
2 18 0 2
4 12 0 2
8 9 0 4
16 9 0 5
Table 7 Details of Gender Total 32 9 0 19
speakers participated in 64 9 0 30
ADC corpus
Male 85 128 9 0 35
Female 57 256 9 0 35
146 Int J Speech Technol (2006) 9: 133150
Table 10 Number of Number of WER Table 13 Number of Number of Speed X
Gaussians versus WER Gaussians Versus speed as
Gaussians ratio of real time audio Gaussians real time

1 19.841 1 0.06
2 15.873 2 0.06
4 11.111 4 0.05
8 10.317 8 0.07
16 11.111 16 0.08
32 22.222 32 0.09
64 30.952 64 0.11
128 34.921 128 0.13
256 34.921 256 0.15

Table 11 Number of Number of Words Table 14 Number of


Gaussians versus words Number of Average
Gaussians matches Gaussians versus memory
matches usage Gaussians memory used

1 101 1 7.4
2 106 2 7.53
4 112 4 8.44
8 113 8 9.4
16 112 16 11.39
32 98 32 16.08
64 87 64 25.17
128 82 128 43.13
256 82 256 79.02

Table 12 Number of Gaussians versus sentences matches


Number of Gaussians Sentences matches Sentence accuracy

1 101 80.159
2 106 84.127
4 112 88.889
8 113 89.683
16 112 88.889
32 98 77.778
64 87 69.048
128 82 65.079
256 82 65.079

easily noticeable that best performance obtained when


Fig. 6 Overall Likelihood versus number of Gaussians Densi-
the distributions splitted into 8 Gaussians. ties with five states per HMM

7.3 HQC-1 overall likelihood of training


frame likelihood and dividing it by number of frames.
SPHINX-IV provides a tool to calculate per frame It is found that as the number of Gaussians densi-
training likelihood and overall training likelihood, the ties increases the Overall Likelihood increases too as
over all is obtained simply by summing up the per shown on Fig. 6.
Int J Speech Technol (2006) 9: 133150 147

8 Summary and conclusions both formal and Holly Quran. HUSDICT60 contain
59424 words, this dictionary will be freely available
In this research we have used SPHINX-IV for Arabic on Arabic. Three corpuses are entirely developed by
speech recognition, build speech recognition resources the author of this work. In the Holly Quraan Corpus
for Arabic, build new tools that do not originally ex- (HQC-1) about 7,742 recordings were processed and
ist in SPHINX-IV that are suitable for Arabic Recog- then transcribed, which results in total of 59,428 words
nition such as APDT and linguistic questions and we and 25,740 unique words, and about 18.35 of record-
also investigated fine-tuning SPHINX-IV parameters ing hours, this process consumes about 432 working
for this purpose too. In this section, we present a sum- hours. Note that one research at Carnegie Mellon Uni-
mary of this research and the relevant observations
versity (CMU) is done on about 1,400 hours of speech
that we have drawn from our investigation in train-
for training one system (CMU SPHINX Open Source
ing, fine-tuning and testing using the three acoustic
Speech Recognition Engines 2007). Results are shown
models (HQC-1, CAC-1 and ADC) based on our pro-
in Fig. 7 and Fig. 8
posed dictionary. Some comments on future research
The CAC-1 corpus consists of two disjoint sets
directions and unresolved questions are presented too.
of utterances: 5628 training utterances collected from
The section is closed with final summary and conclu-
103 male and 74 female Arabic native speakers, and
sions. What is mostly unique about this research is
372 testing utterances from 15 male and 8 female
the (APDT) algorithm we developed and tested. Three
Arabic corpuses, namely HQC-1, CAC-1 and ADC speakers. The CAC-1 corpus is considered as a small
were created for providing acceptable level of train- vocabulary set (approximately 30 words in lexicon),
ing and testing of our system. The recognition per- final results for this corpus is shown in Fig. 9.
formance obtained using these corpuses and the dic- ADC corpus is developed using the recording of
tionary obtained using our (APDT) for Arabic is very 142 Arabic native speakers. This corpus is concerned
successful. To the best of our knowledge, neither this in developing Arabic digit model recognition for dig-
tool nor the HQC-1 corpus were exist prior to this re- its zero, one . . . nine ... , . The ADC
search. SPHINX-IV parameters were tuned using a Consists of two disjoint sets of utterances: 1213 train-
global search algorithm, training data was extended ing utterances collected from 73 male and 49 female
with help of neural network, and a features-based sys- speakers, and 143 testing utterances from 12 male and
tem (not Romanized) for Arabic speech recognition 8 female speakers. The total length of the training ut-
based on SPHINX-IV technology is finally found. Our terances is about 2431 seconds (Fig. 10).
system could be the basis for any future open source From the results of research throughout this work,
research on Arabic Speech Recognition and we intend many suggestions for future work are recommended
to keep it open for the research community. Automatic as shown in coming subsections. One major weakness
tool kit for generating (PD) is fully developed and of conventional HMMs is that they do not provide an
tested in this work. This tool kit is a Rule based pro- adequate representation of the temporal structure of
nunciation tool. (PD) (HUSDICT60) is produced for speech. This is because the probability of state occu-

Fig. 7 Test summaries for Holly Quraan corpus


the HQC-1

[java] Accuracy: 70.813% Errors: 750(Sub: 467 Ins: 276


Del: 7)
[java] Words: 1624 Matches: 1150 WER: 46.182%
[java] Sentences: 273 Matches: 57 SentenceAcc: 20.879%
[java] This Time Audio: 7.62s Proc: 5.68s Speed: 0.75 X
real time
[java] Total Time Audio: 2205.66s Proc: 1638.93s Speed:
0.74 X real time
148 Int J Speech Technol (2006) 9: 133150

Fig. 8 Output of the performance test for HQC-1

Command and Control Corpus

Accuracy: 98.182% Errors: 1 (Sub: 1 Ins: 0 Del: 0)


Words: 55 Matches: 54 WER: 1.818%
Sentences: 55 Matches: 54 Sentence Acc: 98.182%
Total Time Audio: 53.29s Proc: 13.24s Speed: 0.25 X
real time
Mem Total: 126.62 Mb Free: 101.30 Mb
Used: This: 25.33 Mb Avg: 20.14 Mb Max: 25.49 Mb

Fig. 9 Test summaries for the CAC-1


Int J Speech Technol (2006) 9: 133150 149

Performance test of the digits corpus

Accuracy: 99.213% Errors: 1 (Sub: 0 Ins: 0 Del: 1)


Words: 127 Matches: 126 WER: 0.787%
Sentences: 127 Matches: 126 SentenceAcc: 99.213%
This Time Audio: 1.39s Proc: 0.09s Speed: 0.07 X
real time
Total Time Audio: 143.24s Proc: 9.91s Speed: 0.07 X
real time
Mem Total: 126.62 Mb Free: 114.17 Mb
Used: This: 12.46 Mb Avg: 12.76 Mb Max: 18.44 Mb

Fig. 10 Test summaries for the ADC

pancy decreases exponentially with time. This issue Billa, J., et al. (2002a). Arabic speech and text in Tides On Tap.
is a promising area to investigate and many issues on In Proceedings of HLT, 2002.
Billa, J., et al. (2002b). Audio indexing of broadcast news. In
the HMM modeling and temporal structuring can be Proceedings of ICASSP, 2002.
studied. Another issue is the training of the HMM, al- Black, A., Lenzo, K., & Pagel, V. (1998). Issues in building gen-
though Ant Colony Optimization is a stochastic and eral letter to sound rules. In Proceedings of the ESCA work-
discrete optimization algorithm, we believe that this shop on speech synthesis, Australia (p. 7780) 1998.
Christensen, H. (1996). Speaker adaptation of hidden Markov
could be a promising algorithm if adapted for training models using maximum likelihood linear regression. Ph.D.
speech recognition models or at least can be used in Thesis, Institute of Electronic Systems Department of
optimizing the training process. Final words, it should Communication Technology, Aalborg University.
be noticed that most Arabic texts are almost never CMU SPHINX Open Source Speech Recognition Engines.
URL:http://www.speech.cs.cmu.edu/ (2007).
fully diacritical, and are thus potentially unsuitable for CMU SPHINX trainer Open Source Speech Recognition En-
recognizer training except the Holly Quraan and few gines, URL: http//:www.cmusphinx.org/trainer (2008).
other Text Books and some religion old books. In addi- Doh, S.-J. (2000). Enhancements to transformation-based
tion to that, the existence of electronic versions of such speaker adaptation: principal component and inter-class
maximum likelihood linear regression. Ph.D. Thesis,
text is not always available. There should be an Ara- Department of Electrical and Computer Engineering,
bic effort to create diacritical corpus for both speech Carnegie Mellon University.
recognition and text to speech research. During this re- El Choubassi, M. M., El Khoury, H. E., Jabra Alagha, C. E.,
search, about 200,000 unique diacritic words are col- Skaf, J. A., & Al-Alaoui, M. A. (2003). Arabic speech
recognition using recurrent neural networks. Electrical and
lected and are now available on our free website cor- Computer Engineering Department, Faculty of Engineer-
pus as mentioned earlier. ing and ArchitectureAmerican University of Beirut.
Essa, O. (1998). Using prosody in automatic segmentation of
speech. In Proceedings of the ACM 36th annual south east
conference (pp. 4449). Apr. 1998.
References Fukada, T., Yoshimura, T., & Sagisaka, Y. (1999). Automatic
generation of multiple pronunciations based on neural net-
Al-Zabibi, M. (1990). An acoustic-phonetic approach in auto- works. Speech Communication, 27, 6373.
matic Arabic speech recognition. The British Library in Ganapathiraju, A., Hamaker, J., & Picone, J. (2000). Hybrid
Association with UMI. SVM/HMM architectures for speech recognition. In Pro-
Alghamdi, M. (2001). Arabic phonetics. Riyadh: Altawbah ceedings of the international conference on spoken lan-
Printing. guage processing (Vol. 4, pp. 504507). November 2000.
Alghamdi, M., Al-Muhtaseb, H., & Elshafei, M. (2004). Ara- Gouva, E. B. (1996). Acoustic-feature-based frequency warp-
bic phonological rules. Journal of King Saud University: ing for speaker normalization. Ph.D. Thesis, Department
Computer Sciences and Information, 16, 125 (in Arabic). of Electrical and Computer Engineering, Carnegie Mellon
Andersen, O., & Kuhn, R., et al. (1996). Comparison of two University.
tree-structured approaches for grapheme-to-phoneme con- Hadj-Salah, A. (1983). A description of the characteristics of
version. In ICSLP 96 (Vol. 3, pp. 17001703) Oct. 1996. the Arabic language. In Applied Arabic linguistics, signal
Baugh, A. C., & Cable, T. (1978). A history of the English lan- & information processing, Rabat, Morocco, 26 September
guage. Oxon: Redwood Burn Ltd. 5 October 1983.
150 Int J Speech Technol (2006) 9: 133150

Hain, T., et al. (2003). Automatic transcription of conversational Ph.D. Thesis, Department of Electrical and Computer En-
telephone speechdevelopment of the CU-HTK 2002 sys- gineering, Carnegie Mellon University.
tem. (Technical Report CUED/F-INFENG/TR. 465). Cam- Ohshima, Y. (1993). Environmental robustness in speech recog-
bridge University Engineering Department. Available at nition using physiologically-motivated signal processing.
http://mi.eng.cam.ac.uk/reports/. Ph.D. Thesis, Department of Electrical and Computer En-
Hermansky, H. (1990). Perceptual linear predictive (PLP) analy- gineering, Carnegie Mellon University.
sis of speech. Journal of the Acoustic Society of America, Pallet, D. S., et al. (1999). 1998 Broadcast news bench-
87, 17381752. mark test results. In Proceedings of the DARPA broadcast
Hiyassat, H., Nedhal, Y., & Asem, E. Automatic speech recog- news workshop, Herndon, Virginia, February 28March 3,
nition system requirement using Z notation. In Proceedings 1999.
of Amse 05, Roan, France, 2005. Rabiner, L. R., & Juang, B.-H. (1993). Fundamentals of speech
Huang, X., Alleva, F., Wuen, H., Hwang, M.-Y., & Rosen- recognition. Englewood Cliffs: Prentice-Hall.
feld, R. (2003). The SPHINX-II speech recognition sys- Raj, B. (2000). Reconstruction of incomplete spectrograms for
tem: an overview . In School of Computer Science Carnegie robust speech recognition. Ph.D. Thesis, Department of
Mellon University, Pittsburgh, 15213, 2003. Electrical and Computer Engineering, Carnegie Mellon
Huerta, J. M. (2000). Robust speech recognition in GSM codec University.
environments. Ph.D. Thesis, Department of Electrical and Rosti, A.-V.I. (2004). Linear Gaussian models for speech recog-
Computer Engineering, Carnegie Mellon University. nition. Ph.D. Thesis, Wolfson College, University of Cam-
Killer, M., Stker, S., & Schultz, T. (2003). Grapheme based bridge.
speech recognition. Eurospeech, Geneva, Switzerland, Rozzi, W. A. (1991). Speaker adaptation in continuous speech
September 2003. recognition via estilsiation of correlated mean vectors.
Killer, M., Stker, S., & Schultz, T. (2004). A grapheme based Ph.D. Thesis, Department of Electrical and Computer En-
speech recognition system for Russian. In SPECOM2004: gineering, Carnegie Mellon University.
9th conference, speech and computer, St. Petersburg, Rus- Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995). Local
sia, September 2022. learning in probabilistic networks with hidden variables.
Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., Computer Science, IJCAI.
He, F., Henderson, J., Liu, D., Noamany, M., Schone, P., Schultz, T. (2002). Globalphone: a multilingual speech and text
Schwartz, R., & Vergyri, D. (2007). Novel approaches to database developed at Karlsruhe University. In Proceed-
Arabic speech recognition. The 2002 Johns-Hopkins sum- ings of the ICSLP, Denver, CO, 2002.
mer workshop, 2002. Schultz, T., Alexander, D., Black, A., Peterson, K., Suebvisai,
Lee, K., Hon, H., & Reddy, R. (1990). An overview of S., & Waibel, A. (2004). A Thai speech translation system
the SPHINX speech recognition. IEEE Transactions on for medical dialogs. In Proceedings of the human language
Acoustics, Speech, and Signal Processing, ASSP-28(1), technologies (HLT), Boston, MA, May 2004.
3545. Seltzer, M. L. (2000). Automatic detection of corrupt spectro-
Lee, T., Ching, P. C., & Chan, L. W. (1998). Isolated word graphic features for robust speech recognition. Master de-
recognition using modular recurrent neural networks. Pat- gree Thesis, Department of Electrical and Computer Engi-
tern Recognition, 31(6), 751760. neering, Carnegie Mellon University.
Liu, F.-H. (1994). Environmental adaptation for robust speech Siegler, M. A. (1999). Integration of continuous speech recogni-
recognition. Ph.D. Thesis, Department of Electrical and tion and information retrieval for mutually optimal perfor-
Computer Engineering, Carnegie Mellon University, Pitts- mance. Ph.D. Thesis, Department of Electrical and Com-
burgh, PA. puter Engineering, Carnegie Mellon University.
Mimer, B., Stuker, S., & Schultz, T. (2004). Flexible decision Young, S. J. (1994). The HTK hidden Markov model toolkit:
trees for grapheme based speech recognition. In Proceed- design and philosophy (CUED/F-INFENG/TR.152). Engi-
ings of the 15th conference elektronische sprach signal ve- neering Department, University of Cambridge.
rarbeitung (ESSV), Cottbus, Germany, 2004. Zavagliakos, G., et al. (1998). The BNN Byblos 1997 large vo-
Nedel, J. P. (2004). Duration normalization for robust recog- cabulary conversational speech recognition system. In Pro-
nition of spontaneous speech via missing feature methods. ceedings of ICASSP, 1998.

Potrebbero piacerti anche