Sei sulla pagina 1di 14

Available online at www.sciencedirect.

com

Speech Communication 56 (2014) 181194


www.elsevier.com/locate/specom

Using dierent acoustic, lexical and language modeling units for ASR
of an under-resourced language Amharic
Martha Yiru Tachbelie a,, Solomon Teferra Abate a, Laurent Besacier b
b

a
School of Information Sciences, Addis Ababa University, Addis Ababa, Ethiopia
Laboratoire dinformatique de Grenoble (LIG), Universite Joseph Fourier, Grenoble 1, France

Available online 14 February 2013

Abstract
State-of-the-art large vocabulary continuous speech recognition systems use mostly phone based acoustic models (AMs) and word based
lexical and language models. However, phone based AMs are not ecient in modeling long-term temporal dependencies and the use of
words in lexical and language models leads to out-of-vocabulary (OOV) problem, which is a serious issue for morphologically rich languages. This paper presents the results of our contributions on the use of dierent units for acoustic, lexical and language modeling for
an under-resourced language (Amharic spoken in Ethiopia). Triphone, Syllable and hybrid (syllable-phone) units have been investigated
for acoustic modeling. Word and morphemes have been investigated for lexical and language modeling. We have also investigated the
use of longer (syllable) acoustic units and shorter (morpheme) lexical as well as language modeling units in a speech recognition system.
Although hybrid AMs did not bring much improvement over context dependent syllable based recognizers in speech recognition performance with word based lexical and language model (i.e. word based speech recognition), we observed a signicant word error rate
(WER) reduction compared to triphone-based systems in morpheme-based speech recognition. Syllable AMs also led to a WER reduction over the triphone-based systems both in word based and morpheme based speech recognition. It was possible to obtain a 3% absolute WER reduction as a result of using syllable acoustic units in morpheme-based speech recognition. Overall, our result shows that
syllable and hybrid AMs are best tted in morpheme-based speech recognition.
2013 Elsevier B.V. All rights reserved.
Keywords: Syllable-based acoustic modeling; Hybrid (phonesyllable) acoustic modeling; Morphemebased; Speech recognition; Under-resourced
languages; Amharic

1. Introduction
Many languages, especially languages of developing
countries, lack sucient resources and tools required for
the implementation of human language technologies. These
languages are commonly referred to as under-resourced or
low density languages (Besacier et al., 2006). The term
under-resourced languages introduced by (Berment, 2004)
refers to a language with some of the following aspects:
lack of a unique writing system or stable orthography,
Corresponding author. Tel.: +251 923518241.

E-mail addresses: marthayiru@gmail.com, marthayiru@yahoo.com


(M.Y. Tachbelie), solomon_teferra_7@yahoo.com (S.T. Abate), laurent.
besacier@imag.fr (L. Besacier).
0167-6393/$ - see front matter 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.specom.2013.01.008

limited presence on the web, lack of linguistic expertise,


lack of electronic resources for natural language processing
(NLP) such as monolingual corpora, bilingual electronic
dictionaries, and transcribed speech data.
Natural language technologies for under-resourced languages are developed using small set of data collected by
researchers and, therefore, the performance of such systems
are often inferior compared to systems of technologically
favored languages. The problem is further aggravated if
the language under study is also morphologically rich as
the number of out-of-vocabulary (OOV) words is usually
big. Porting a NLP system (e.g. a speech recognition system) to such languages, therefore, require techniques that
go far beyond the basic re-training of the models. That
means methods that work best with the available resource

182

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

have to be identied. This paper presents an attempt made


in this line. Since we question the usual units for ASR
(acoustic modeling using phones and language modeling
using words), we have tried to propose more ecient modeling strategies for our target language which is Amharic.
Moreover, in this paper, we also report our experience on
the use of crowdsourcing technology to transcribe speech
data.
1.1. Modeling units
1.1.1. Acoustic modeling units
Speech recognition requires segmentation of speech
data into fundamental acoustic units. The ideal and natural acoustic unit is the word. However, the use of words
as acoustic units in large vocabulary continuous speech
recognition (LVCSR) systems is impractical because of
the need for a very large data to train models adequately.
Thus, sub-word acoustic units such as phones or context
dependent phones are generally used in LVCSR. Nowadays, the use of context dependent phones in acoustic
modeling is a common practice. Nevertheless, the
phone-based units are not ecient in modeling long-term
temporal dependencies. As a result, many researchers
(Ganapathiraju et al., 2001; Sethy, 2002; Hamalainen,
2005; Abate and Menzel, 2007a; Thangarajan and Natarajan, 2008; Azim et al., 2008) investigated the use of syllables in acoustic modeling. Others such as (Liu et al.,
2011) investigated the use of hybrid units (phone and syllable) in acoustic modeling.
1.1.2. Lexical and Language modeling units
Most large vocabulary speech recognition systems operate with a nite vocabulary. All the words which are not in
the systems vocabulary are considered as OOV words.
These words are one of the major sources of error in automatic speech recognition (ASR). When a speech recognition system is confronted with a word which is not in its
vocabulary, it may recognize it as a phonetically similar
in-vocabulary unit/item. That means the OOV word is misrecognized. This in turn might cause its neighboring words
to be misrecognized. (Woodland et al., 1995) indicated the
fact that each OOV word in the test data contributes to 1.6
errors on the average. Therefore, dierent approaches have
been investigated to cope with the OOV problem and consequently to reduce the error rate of automatic speech recognition systems. One of these approaches is vocabulary
optimization (Bazzi, 2002), where the vocabulary is
selected in a way that it reduces the OOV rate. This
involves either increasing the vocabulary size or including
frequent words in a vocabulary. This approach may work
for morphologically simple languages like English where
a 20k vocabulary has 2% OOV rate and a 65k one has only
0.6% OOV rate (Gales and Woodland, 2006).
However, for morphologically rich languages, for which
OOV is a severe problem, a much larger vocabulary is
required to reach 1% OOV rate. (Gale and Woodland,

2006) indicated the fact that for Russian and Arabic


800 k and 400 k vocabularies are required, respectively
for a 1% OOV rate. Increasing the vocabulary to alleviate
the OOV problem is not the best solution especially for
morphologically rich languages as the system complexity
increases with the size of the vocabulary. Moreover, the
approach is not appropriate for under-resourced languages
that do not have the necessary amount of data. Therefore,
modeling sub-word units, particularly morphemes, has
been used for morphologically rich languages. Many
researchers (Geutner, 1995; Carki et al., 2000; Ircing
et al., 2001; Whittaker and Woodland, 2000; Whittaker
et al., 2001; Siivola et al., 2003; Hirsimaki et al., 2005; Kirchho et al., 2002; Pellegrini and Lamel, 2009; Tachbelie
et al., 2010) did morpheme-based or sub-word based
speech recognition experiments.
1.2. Outline of this paper
In this paper, we investigate speech recognition for
Amharic (which is an under-resourced as well as morphologically rich language) using dierent units for acoustic,
lexical and language modeling with the aim of nding the
best combination that lead to best performance with the
available resources. Previously, the use of phone, consonant vowel (CV) syllable, and hybrid (phonesyllable) units
for acoustic modeling and morphemes for lexical and language modeling have been investigated separately. To our
knowledge, no attempt has been made to investigate the
use of longer acoustic unit and shorter lexical units
together in a speech recognition system. Thus, this work
deeply investigates the use of syllables and morphemes in
a speech recognition system. Our work is also the rst in
investigating the use of context dependent CV syllables
for Amharic.
The next sub-section describes the nature of the Amharic language. We also present a review of pertinent previous
works on Amharic speech recognition in this section. Section 2 presents the speech and text data used for our experiment as well as some attempts of transcribing speech data
using crowdsourcing. Section 3 reveals the best unit for
Amharic acoustic modeling while Section 4 is dedicated
to morpheme-based lexical and language modeling.
Finally, Section 5 concludes this work and gives some
perspectives.
1.3. The Amharic language
Amharic is a member of the Ethio-Semitic languages,
which belong to the Semitic branch of the Afroasiatic super
family (Voigt, 1987). It is related to Hebrew, Arabic, and
Syrian. Amharic, which is spoken mainly in Ethiopia, is
the second most spoken Semitic language, after Arabic.
According to the 1998 census, it is spoken by over 17 million people as a rst language and by over 5 million as a
second language throughout dierent regions of Ethiopia.
Currently, Amharic is the ocial working language of

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

183

the federal democratic republic of Ethiopia and several of


the states within the federal system. The language is also
spoken in other countries such as Egypt and Israel (Ethnologue, 2004).

1.3.1. Amharic phonetics


Amharic has a total of 38 phonemes (31 consonants and
7 vowels) (Leslau, 2000). On the basis of their manner of
articulation, the consonants are classied into stops, fricatives, nasals, liquids and semivowels. Table 1 gives the
inventory of Amharic consonants.
The phonetic transcription of the consonants b, d, f, g,
h, k, l, m, n, p, r, s, t, w, y and z corresponds to that of
the English consonants. Moreover, there are sounds which
are the same or nearly the same as the English sounds, but
are represented by special phonetic symbols. These are c as
ch in church, s as sh in shoe, as j in joke, z as s in
pleasure and n as ni in onion. The pronunciation of the
glotal stop ? corresponds to the English uh-uh used as a
negation (Leslau, 2000). This consonant may or may not
be pronounced. It may be pronounced when it occurs
between two vowels (Appleyard, 1995).
The sounds called glottalized or ejectives (t, q, p, s, c)
are peculiar to Amharic and they do not have correspondents in English. Each of these sounds has a non-glotalized
counterpart from amongst the consonants (t, k, p, s, c)
(Appleyard, 1995; Leslau, 2000).
Long consonants, also called geminated consonants, are
clearly pronounced. The length of a consonant may aect
the meaning of a word as in wana swimming and
wanna main or principal. (Yimam, 2007) said that all
Amharic consonants can be geminated except ? and h.
Amharic has seven vowels: (E, u, i, a, e, , o). Fig. 1 illustrates the position of these vowels in articulation.
(Leslau, 2000) pointed out that there is no precise correspondence between Amharic and English vowels. The
vowel E is pronounced like er in bigger and the vowel
u is pronounced like o in who. The vowels i and a are
pronounced like ee in feet and a in father, respectively.

Fig. 1. Amharic vowels (adapted from Yimam (2007)).

The other vowels e, and o are pronounced as a in state,


as e in roses and as o in shore.
1.3.2. Amharic morphology
Like other Semitic languages such as Arabic, Amharic
exhibits a root-pattern morphological phenomenon. A root
is a set of consonants (called radicals) which has a basic
lexical meaning. A pattern consists of a set of vowels
which are inserted (intercalated) among the consonants
of a root to form a stem. The pattern is combined with a
particular prex or sux to create a single grammatical
form (Bender et al., 1976) or another stem (Yimam,
2007). For example, the Amharic root sbr means break.
By intercalating the pattern E_E and attaching the sux E we get sEbbErE he broke which is the rst form of a verb
(3rd person masculine singular in past tense, as in other
semitic languages) (Bender et al., 1976). In addition to this
non-concatenative morphological feature, Amharic uses
dierent axes to create inectional and derivational word
forms.
Some adverbs can be derived from adjectives. Nouns are
derived from other basic nouns, adjectives, stems, roots,
and the innitive form of a verb by axation and intercalation. For example, from the noun lgg child another
noun lgnEt childhood; from the adjective dEg generous
the noun dEgnEt generosity; from the stem snf, the noun
snfna laziness; from root qld, the noun qEld joke; from
innitive verb mEsbEr to break the noun mEsbEriya an

Table 1
Amharic consonants (adapted from Leslau, 2000).
Manner of Articulation

Labial

Dental

Palatal

Velar

Glottal

Voiceless
Voiced
Glottalized
Rounded

p
b
p

t
d
t

c
g
c

k
g
q

Voiceless
Voiced
Glottalized
Rounded

Nasals

Voiced

Liquids

Voiceless
Voiced

Stops

Fricatives

Semivowels

kw, gw, qw
s
z
s

s
z

hw
n

l,r
w

184

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

instrument used for breaking can be derived. Case, number, deniteness, and gender marker axes inect nouns.
Adjectives are derived from nouns, stems or verbal roots
by adding a prex or a sux. For example, it is possible to
derive dIngayama stony from the noun dngay stone;
zngu forgetful from the stem zng; sEnEf lazy from the
root snf by suxation and intercalation. Adjectives can
also be formed through compounding. For instance,
hodEsE tolerant, patient, is derived by compounding
the noun hod stomach and the adjective sE wide. Like
nouns, adjectives are inected for gender, number, and case
(Yimam, 2007).
Unlike the other word categories such as noun and
adjectives, the derivation of verbs from other parts of
speech is not common. The conversion of a root to a basic
verb stem requires both intercalation and axation. For
instance, from the root gdl kill we obtain the perfective
verb stem gEddEl by intercalating the pattern E_E. From
this perfective stem, it is possible to derive a passive
(tEgEddEl-) and a causative stem (asgEddEl-) using the prexes tE- and as-, respectively. Other verb forms are also
derived from roots in a similar fashion. Verbs are inected
for person, gender, number, aspect, tense and mood
(Yimam, 2007). Other elements like negative markers also
inect verbs in Amharic. In this work, only the concatenative morphemes are considered.
1.3.3. Amharic writing system
Amharic is written in its own script which is known as
dEl. The Amharic script is syllabary since each symbol
represents a consonant combined with a vowel and the
vowel has no independent existence (Leslau, 2000). In other
words, each symbol in Amharic orthography represents a
CV syllable.1 The writing system consists of 276 distinct
symbols, 20 numerals and eight punctuations. There are
33 core consonants each of which have seven shapes or
orders according to the vowels combined to them as shown
in Table 2. This makes 231 (33  7) distinct symbols (CV
syllables) out of the 276. The remaining symbols include
labiovelars (20), labialized consonants (18) and the consonant v (which appears only in modern loan words like
viza meaning visa) in its seven orders.
1.3.4. Amharic syllable structure
Most Amharic linguists (Yimam, 2007; Haile, 1995)
agree that the syllable structure of Amharic is (C)V(C)(C)
where C represents a consonant and V a vowel. That means
the syllable types of Amharic are V, CV, CVC, VC, CVCC,
VCC. Some others (Seyoum, 2001) claim that the only
Amharic syllable types are CV and CVC. However, CV syllables cover the large majority of syllable distribution in
Amharic (H/Mariam et al., 2004). Since the Amharic
writing system is syllabary (representing approximately
1
Except the 6th order character that represents a consonant with or
without the vowel .

Table 2
Sample Amharic dEl.

CV syllables) and since CV syllables cover the large majority of syllable distribution in Amharic, only CV syllables
have been considered in the current investigation.
1.4. Previous works on Amharic speech recognition
Research in automatic speech recognition for Amharic
started in 2001 when (Berhanu, 2001) developed isolated
Consonant-Vowel syllable recognition system. Since then,
several attempts have been made in the academic research.
At the begining, the researches were conducted using small
data sets developed by the researchers themeselves for their
research purpose. The development of a medium size read
speech corpus (Abate et al., 2005) facilitated the research in
the area. Although there are several attempts (Tadesse,
2002; Seifu, 2003; Tachbelie, 2003; Girmaw, 2004; Seid
and Gamback, 2005; Abate, 2006; Abate and Menzel,
2007a; Abate and Menzel, 2007b; Pellegrini and Lamel,
2006a, 2009; Tachbelie et al., 2009, 2010, 2011), in this section we give only a review of pertinent works that investigated the use of units dierent from phone for acoustic
modeling and dierent from words for lexical and language
modeling.
For Amharic, the rst experiment on the use of syllable
in acoustic modeling is due to (Abate and Menzel, 2007a).
As the Amharic orthography has more or less a one to one
correspondence with consonant vowel (CV) syllabic
sounds, they experimented on the use of CV syllables in
acoustic modeling. Models with dierent HMM topologies
have been developed. A model with ve states per HMM
and with no skips found to be the best one in terms of accuracy. Compared to a triphone-based model, the context
independent syllable based model performed slightly worse
in terms of accuracy. However, the syllable based recognizers were found to be better in terms of recognition speed
and storage requirement. Thus, they concluded that the
use of CV syllables is a promising alternative in the development of automatic speech recognition for Amharic.
The application of automatic word decomposition
(using Harris algorithm) for Amharic speech recognition
has been investigated by (Pellegrini and Lamel, 2006a). In
their study, the units obtained through decomposition have
been used in both lexical and language models. They

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

reported recognition results for four dierent congurations: full word and three decomposed forms (detaching
both prex and sux, prex only and sux only). A word
error rate (WER) reduction over the base-line word-based
system has been reported using two hours of training data
in speech recognition in all decomposed forms although the
level of improvement varies. The highest improvement
(5.2% absolute WER reduction) has been obtained with
the system in which only the prexes have been detached.
When both prexes and suxes are considered, the
improvement in performance is small, namely 2.2%. As
the authors said, this might be due to the limited span of
the n-gram language models.
Decomposing lexical units with the same algorithm led
to worse performance when more training data (35 h)
was used (Pellegrini and Lamel, 2007). This can be
explained by a higher acoustic confusability. (Pellegrini
and Lamel, 2007, 2009) tried to solve this problem by using
other modied decomposition algorithms. Their starting
algorithm was Morfessor (Creutz and Lagus, 2005) which,
however, has been modied by adding dierent information. In Morfessor Baseline, the prior probability of getting
N distinct morphs (p(Lexicon) is estimated on the basis of
the frequency and length (character sequence probability)
of morphs. The rst modication made by (Pellegrini and
Lamel, 2007, 2009) aects the calculation of morph length.
In Morfessor Baseline character probabilities are static and
calculated as a simple ratio of the number of occurrences of
the character (irrespective of its place in words) divided by
the total number of characters in the corpus. Inspired by
Harris algorithm (Pellegrini and Lamel, 2007, 2009) made
the calculation context sensitive. The probability that a
word beginning (WB5) is a morpheme, is dened as the
ratio of the number of distinct letters L(WB) which can follow WB over the total number of distinct letters L. The
other modication is adding a phone-based feature in the
calculation of p(Lexicon). The third modication is to
avoid segmentation if it results in phonetically confusable
morphemes. During the decomposition process, morphemes that dier from each other by only one syllable
are compared. If the pair of syllables is among the most frequently confused pairs (found in their previous study (Pellegrini and Lamel, 2006b)), the segmentation is forbidden.
They were only able to achieve a word error rate reduction
if the phonetic confusion constraint was used to block the
decomposition of words which would result in acoustically
confusable units.
(Tachbelie et al., 2010) showed the eect of OOV words
on the performance of Amharic speech recognition system
and investigated the use of sub-word units in lexical and
language modeling with the aim of reducing the OOV rate
and thereby the performance degradation it causes. Morfessor which is a freely available, language independent,
unsupervised morphology learning tool that tries to identify all the morphemes found in a given word has been used
for morphological segmentation. The acoustic model used
in the study was a collection of cross-word tri-phone

185

HMMs. Improvement in word recognition accuracy has


been observed when relatively small vocabulary (5k) is
used. However, as the vocabulary increases the performance of morpheme-based system was not better than
the word-based system although the morpheme-based
one has a very low OOV rate. As it has been indicated
by the authors, the main reason for the poor performance
of morpheme-based system is acoustic confusability.
In contrast to (Pellegrini and Lamel, 2006a, 2007, 2009)
and (Tachbelie et al., 2010; Tachbelie et al., 2009, 2011)
used morphemes only for the language modeling component. They applied a lattice rescoring framework to avoid
the inuence of acoustic confusability on the performance
of the speech recognizer. Lattices have been generated in
a single pass recognition using a bigram word-based language model and rescored using sub-word language models. Improvement in the performance of the speech
recognition has been obtained. However, this method does
not solve the OOV problem since a word-based pronunciation dictionary is used.
The work presented in this paper is dierent from the
previous work (Abate and Menzel, 2007a) in two ways:
rst, besides context independent syllable acoustic models,
we experimented with tied state syllable modeling which
was recommended by (Abate and Menzel, 2007a) but not
addressed so far for Amharic speech recognition. Second,
a large (65k) dictionary has been used as a system vocabulary and our test set includes 9.35% OOV words i.e. words
that are not found in the active vocabulary. (Abate and
Menzel, 2007a) experimented with 5k and 20k vocabularies
and their test set did not include OOV words. Moreover,
the use of hybrid units for Amharic acoustic modeling
has not been addressed so far.
Unlike previous works that utilized morpheme in lexical
and language modeling (Pellegrini and Lamel, 2006a, 2007,
2009) and (Tachbelie et al., 2009, 2010, 2011), the purpose
of this work is to see the eect of using both longer acoustic
units (syllables) and shorter lexical and language modeling
units (morphemes) on the performance of a speech recognition system. Besides the unsupervised morphological segmentation method used in the previous works as well, we
have used nite state based supervised segmentation
method for the purpose of this study.
2. Data collection corpora
This section gives brief description of the speech and
text corpus used for our investigation. The section also
describes the pronunciation dictionaries used in our study
and how we have prepared them. We also mention our
speech transcription experience of speech transcription
using crowdsourcing, namely Amazons Mechanical Turk
(MTurk). Although we did not use the resulting crowdsourced transcription in the experiments reported in this
paper, we report here our main results (published in (Gelas
et al., 2011)) on the use of crowdsourcing for under-resourced languages.

186

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

2.1. Speech and text corpus

2.3. Crowdsourcing for transcription

The speech corpus used for speech recognition experiments is a read speech corpus (Abate et al., 2005) developed at the university of Hamburg. The audio corpus
was collected in the following manner. Texts were rst
extracted from news websites and then segmented by sentence. Recordings were made by native speakers reading
sentence by sentence with the possibility to rerecord anytime they considered having mispronounced. The corpus
contains 20 hours of training speech collected from 100
speakers who read a total of 10,850 sentences (28,666
tokens). Compared to other speech corpora that contain
hundreds of hours of speech data for training, this corpus is obviously small in size and accordingly the models
will suer from a lack of training data.
The corpus also includes four dierent test sets (5k
and 20k both for development and evaluation). However, for the purpose of the current investigation we
have used the 5k development test set, which includes
360 sentences (4106 tokens or 2836 distinct words)
read by 20 speakers.
The text corpus used for this study is the ATC_120k
(Tachbelie, 2010). It consists of 120,262 sentences
(2,348,150 tokens or 211,120 types). The ATC_120k corpus
has been used to derive the vocabulary for the pronunciation dictionaries and to train language models.

Speech transcriptions are required for any research


in speech recognition. However, the time and cost of
manual speech transcription make the collection of transcribed speech data dicult in all languages of the world.
Therefore, crowdsourcing is currently investigated by different researchers. Amazons Mechanical Turk (MTurk),
which is an online market place for work, aims at outsourcing dicult or impossible tasks for computers called
Human Intelligence Tasks (HITs) to willing human
workers (turkers) around the Web. Making use of this
crowd is hoped to bring an important benet against
traditional solutions (employees or contractors).
Recently MTurk has been investigated as a great potential to reduce the cost of manual speech transcription. For
example, Gruenstein et al. (2009) and McGraw et al. (2009)
reported near-expert accuracy by using MTurk to correct
the output of an automatic speech recognizer. Marge
et al. (2010) combined multiple MTurk transcriptions to
produce merged transcriptions that approached the accuracy of expert transcribers. The studies on English, including (Snow et al., 2008; McGraw et al., 2009), showed that
MTurk can be used to cheaply create data for natural language processing applications. Most of the studies conducted on the use of MTurk for speech transcription take
English as their subject of study which is one of the wellresourced languages. However, MTurk is not yet widely
studied as a means to acquire useful data for under-resourced languages except a research conducted recently
(Novotney and Callison-Burch, 2010) on Korean, Hindi
and Tamil.
Although we have transcribed speech data described in
Section 2.1, we investigated the usability of MTurk for
speech transcription in Amharic. The goal of our attempt
was to evaluate the quality of the data produced via crowdsourcing. Thus, in our attempt of using MTurk, we have
actually re-transcribed speech data for which we already
had transcriptions.
For our transcription task, we selected 1183 audio les
with average length of 5.9 second. These les were published (a HIT for a le) on MTurk with a payment rate
of USD 0.05 per HIT. To avoid inept Turkers, HIT
descriptions and instructions were given in Amharic. For
the transcription to be in Unicode encoding, we have given
the address of an online Unicode based Amharic virtual
keyboard.
With regard to completion rate, among the 1183 sentences, only 54% of approved HITs have been obtained
in 73 days. This means that very few data (only a few hundred sentences) was transcribed during a rather long period
(73 days)! We have made the approval process rst manually via the MTurk web interface and then conducted
experiment on dierent methods of automatic approval.
Table 3 shows proportion of approved and rejected HITs
in both approval methods (manual and automatic).

2.2. Pronunciation dictionaries


The pronunciation dictionaries, used in our study, consists of 65k most frequent words taken from the
ATC_120k. The pronunciation dictionaries have not been
developed by linguistic experts. Rather, they have been
encoded by means of a simple procedure that takes advantage of the Amharic writing system which is fairly close to
the pronunciation in many cases. There are, however, notable dierences especially in the area of gemination, insertion of the epenthetic vowel and realization of the glottal
stop. In the current investigation, gemination has not been
dealt with. The epenthetic vowel has been added to all sixth
order characters although this is not always true in reality.
For example, the word
which means king has been
transcribed as ngus (the rst and the last consonants
being in sixth order) while its correct transcription should
be ngus.
With regard to the glottal stop, we have prepared and
used two versions of pronunciation dictionaries for triphone-based systems: one with the glottal stop and the
other without it in the pronunciation. For the syllable
based systems, the pronunciation of each word has been
segmented into consonant vowel (CV) syllables. Thus, in
order to have CV syllables consistently, the glottal stop
consonant is realized in all cases. A modication of the syllable-based pronunciation dictionary has been used for the
hybrid phone-syllable based recognizers.

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194


Table 3
Submitted HITs approval.

Approved
Rejected
Total

No. of workers (manual and


automatic)

No. of hits

12
171
177

589
492

Manual

Automatic

584
497
1081

Table 4
Content of Rejected HITs.
Nature of rejected HITs

% of Rejected HITs

Empty
Non-sense
Copy from Instruction
Trying without knowledge

60.57
20.33
5.70
13.40

By the manual process, we rejected HITs containing


empty transcriptions, copy of instructions and descriptions
from our HITs, non-sense text and HITs which were made
by people who were trying to transcribe without any
knowledge of the language. Table 4 shows details of the
HITs that were rejected manually. Doing this approval
process this way can be considered as time consuming on
a large amount of data. Therefore, we have conducted an
experiment on automatic approval methods using the total
submitted HITs. As can be seen from Table 3, we have
obtained equivalent results to that of the manual approval
by the following steps of rejecting HITs with: (1) empty
and short (shorter than 4 words) transcriptions, (2) transcriptions using non-Amharic writing system, including
copy of urls, (3) transcriptions that contain bigrams of
instructions and descriptions from our HITs, (4) transcriptions that are out of the distribution space set by Avg +
3*Stdv(log2(ppl1)) where ppl1 is a perplexity of each sentence (transcription of an utterance) assigned by a language
model developed on a dierent text.
To evaluate Turkers transcription (TRK) quality, we
computed accuracy against our reference transcription
(REF). As Amharic is morphologically rich language, we
found relevant to calculate error rate at word-level
(WER), syllable-level (SER) and character-level (CER).
Indeed, some misspellings, dierences of segmentation
(which can be really frequent in morphologically rich languages) will not necessarily impact system performance
but will still inate WER (Novotney and Callison-Burch,
2010). The CER is less aected and, therefore, it reects
the transcription quality more than the WER. Our reference transcription consists of the sentences read during corpus recording and it may also have some disagreements
with the audio les due to reading errors and is imperfect.
As expected, WER is pretty high (16.0%) while CER is low
enough (3.3%) to approach disagreement among expert
transcribers. The word level disagreement for a morphologically poor language ranges 2-4% WER.2 The gap
2

http://www.itl.nist.gov/iad/mig/tests/rt.

187

between WER and SER (4.8%) can be a good indication


of the weight of dierent segmentation errors due to the
rich morphology.
Besides, real usefulness of such transcriptions has been
evaluated in an ASR system. We have used the 65k
vocabulary and the 3-gram language model that are developed by (Tachbelie et al., 2010). We used SphinxTrain
toolkit from Sphinx project for building Hidden Markov
Models based acoustic models (AMs). We trained context
independent acoustic models of 40 phones. We computed
WER using test sets which contain 360 utterances. Results
indicate nearly similar performances with a slightly lower
WER for the one based on TRK transcriptions ( 0.5%).
This suggests, therefore, that non-expert transcriptions
using crowdsourcing can be accurate enough for ASR.
Anyway, it is important to insist on the fact that Amharic
transcription task was incomplete after a period of
73 days; this may be due to a higher task diculty (use
of a virtual keyboard to handle Amharic scripts). Such
a result questions the use of Amazons Mechanical Turk
for less elementary tasks that require more of a workers
time or expertise.
3. Acoustic modeling: which unit?
In this section we give details of our experiment on the
use of dierent acoustic units for Amharic speech recognition. We have developed a cross-word triphone based recognizer as a baseline. Moreover, acoustic models with
other units have been considered. We rst give the description of the experimental setups that are common to all
speech recognition systems. Then, we give details of speech
recognition systems based on three types of acoustic units
(Triphone, CV syllable and Hybrid).
3.1. Experimental setup
The acoustic features used (in all the acoustic models in
speech recognition systems described in Sections 3.2, 3.3
and 3.4) consist of 13 dimensional Mel Frequency Cepstral
Coecient (MFCC), with their rst- and second-order
derivatives. A window size of 25 ms with an overlap of
10 ms has been used in the estimation of the MFCCs.
The acoustic models have been trained using Sphinx,3
one of the most widely used open source speech recognition
toolkits.
A word trigram language model has been developed
using the ATC_120k corpus and the SRILM toolkit (Stolcke, 2002). The language model is smoothed with modied
KneserNey smoothing technique and made open by
including a special unknown word token. Moreover, since
the amount of the training text is small, all trigrams
(regardless of their number of occurrence) have been
included in the model. The language model has a perplexity
3

http://cmusphinx.sourceforge.net.

188

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

of 38.52 on the 5k development test set sentences. This


language model has been used in triphone, context-independent CV syllable, context-dependent CV syllable and
hybrid recognition systems described in Sections 3.2, 3.3
and 3.4, respectively.

Table 6
Performance of CI syllable-based recognizers.
Syllable CI Models

WER in %

Syll_CI_12gau
Syll_CI_16gau
Syll_CI_24gau
Syll_CI_32gau

18.9
18.3
18
18.3

3.2. Triphone-based system


The pronunciation dictionary consists of 65k most frequent words taken from the ATC_120K text corpus
described in Section 2.1. Moreover, the pronunciation dictionary has been prepared as detailed in Section 2.2.
The HMM (Hidden Markov Model) topology used is 3state Bakis topology with an additional non-emitting last
state. We have trained several models with varying number
of tied state triphones (also called senones) and Gaussian
mixture. As Table 5 shows, the best performing model,
which has a small word error rate (WER), consists of
2500 tied states cross-word triphone HMMs with 16
Gaussian mixtures. The model developed without the realization of the glottal stop consonant in the pronunciation
(Triphone_2500sen_16gau) performs slightly better than
the one in which the pronunciation of the glottal stop is
marked in the lexical model (Triphone_2500sen_
16gau_w?). However, further parameter tuning and the
use of user dened questions for decision tree clustering,
reduced WERs of Triphone_2500sen_16gau and Triphone_2500sen_16gau_w? to 17.8 and 17.9, respectively.
The dierence of these systems (in performance) is statistically insignicant. This shows that considering the glottal
stop as it is always pronounced in speech does not have
much eect on the performance of a speech recognition
system.
In order to make a fair comparison with context independent syllable models (which will be discussed in Section 3.3 and have only 233 acoustic units), we have
developed acoustic models with smaller number of senones,
namely 250, while all the other topologies being the same.
Unsurprisingly, these models have worse performance than
the models with 2500 senones.
3.3. Syllable-based recognizer
As it has been indicated in Section 1.2.4, Amharic has
several syllable structures. However, since the Amharic
writing system is syllabary (representing approximately
CV syllables) and since CV syllables cover the large major-

Table 5
Performance of triphone-based systems.
Triphone Models

WER in %

Triphone_2500sen_16gau_w?
Triphone_2500sen_16gau
Triphone_250sen_16gau_w?
Triphone_250sen_16gau

18.8
18.2
22.5
21.2

Table 7
Performance of CD syllable-based recognizers.
Syllable CD models

WER in %

Syll_CD_1000sen_16gau
Syll_CD_1500sen_24gau
Syll_CD_2000sen_24gau
Syll_CD_2500sen_24gau
Syll_CD_3000sen_24gau
Syll_CD_3500sen_24gau

18
17.3
17.6
17.6
17.9
17.9

ity of syllable distribution in Amharic, only CV syllables


have been considered in the current investigation.
As indicated in Section 2.2, Amharic has 276 distinct CV
syllabic symbols. However, some of the symbols are duplicate in a sense that they represent the same syllabic sounds.
For example the Amharic characters and represent the
same sound sE, and pronounced as sE, etc. On the other
hand, research in speech recognition should only consider
distinct sounds instead of the distinct orthographic symbols. Therefore, like (Abate and Menzel, 2007a), redundant
symbols that represent the same syllabic sounds have been
eliminated and a total of 233 CV syllables have been considered as acoustic units.
HMM based acoustic models have been developed as
detailed in Section 3.1. Since the result in the previous
study (Abate and Menzel, 2007a) showed better result with
ve states HMM, 5-state Bakis topology with no skip and
with an additional non-emitting last state has been used in
our experiment. We have trained 233 context independent
(CI) syllable based acoustic models with dierent Gaussian
mixtures. As can be seen from Table 6, the best performing
CI CV syllable based system found to be the one with 24
Gaussian mixtures.
We also experimented with context dependent (CD) CV
syllable acoustic models. As we did with the triphone based
systems, dierent models have been developed using dierent number of senones and Gaussian mixtures. We started
with 2500 senones as the best triphone-based model has
2500 senones and experimented with dierent number of
senones by increasing and decreasing the number of senones by 500 each time. Table 7 presents the best models for
each number of senones. As can be seen from the table,
the model with 1500 senones and 24 Gaussian mixtures is
the best one (has the lowest WER, namely 17.3). However,
as the number of senones increases above 1500, performance started to degrade approaching the WER reported
for the context independent syllable models. Usually a
much better reduction in error rate is obtained by making

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

189

Table 8
WER of hybrid acoustic models using word-based LM.

Fig. 2. Syllable frequency distribution.

acoustic units context dependent. However, in our experiment only a slight improvement (WER of 18 Vs 17.3) has
been obtained as a result of context modeling. This can
be explained with the scarcity of training speech data. A
very large training data should be used to implement a context dependent syllable acoustic unit as the number of
parameters to be estimated is very large. As we have used
only 20 hours of training speech, it is obvious that our
models suer from insucient training data. The other reason is due to the nature of the acoustic unit itself, i.e. syllables are less context sensitive than phones.
Further parameter tuning of the best CD CV syllable
model (Syll_CD_1500sen_24gau) brought small improvement in performance (WER of 17.1%). As we did for the
best triphone-based system, we have used user dened
questions for decision tree clustering. However, no WER
reduction has been obtained. Rather, a slight increase in
WER (from 17.1% to 17.3%) has been observed. This can
be explained with the simplicity4 of the clusters that we
have dened for the syllable-based recognizers.
3.4. Hybrid recognizers
From our experiments on syllable-based acoustic modeling, we noticed that some of the CV syllables are relatively
rare in the training data and, therefore, not trained very
well. Fig. 2 shows the distribution of syllables in our training data. Since it is not feasible to record audio data in
order to enrich the rare syllables with the time and resource
available for our experiment and since our aim is to nd the
best way of developing a speech recognition system for
Amharic (an under-resourced language) using the available
data, we decided to decompose the rare CV syllables (based
on their frequency in the training transcription) into constituent phones and train hybrid (CV syllable and phones)
acoustic models.

The clusters (Nasals, fricatives, etc.) are dened based only on the
phonetic category of the consonant irrespective of the vowel of the
syllable.

Hybrid Models

WER in %

Hybrid_204Units_5statesWS
Hybrid_204Units_5statesWOS
Hybrid_204Units_4statesWOS
Hybrid_170Units_5statesWS
Hybrid_170Units_5statesWOS
Hybrid_170Units_4statesWOS

17.8
16.9
17.1
18.9
17.0
17.5

Using the pronunciation dictionary used in the CV syllable-based recognizers, we have prepared two versions of pronunciation
dictionaries:
HyridDict_FL100
and
HyridDict_FL500. In HyridDict_FL100, all the CV syllables
with a frequency of less than 100 have been decomposed into
their constituent phones. The number of distinct pronunciation units considered in this dictionary is 204 (31 phones,
172 CV syllables and a silence). CV syllables that appeared
less than 500 times in the training transcription have also been
decomposed (into phones) to form HyridDict_FL500 dictionary. The total number of distinct pronunciation units in this
dictionary is 170 (41 phones, 128 CV syllables and a silence).
Hybrid acoustic models have, then, been developed using
these dictionaries (HyridDict_FL100 and HyridDict_FL500)
with dierent HMM topologies.
While it might be possible to use dierent number of states
for various units (phones and syllables) in one system, we
decided to use a common HMM topology of ve states with
skips. We assume that this HMM topology handles the irregularities (in length) of the hybrid acoustic units. However,
for comparison purpose, we have also developed hybrid
acoustic models with four and ve states without skips.
The hybrid acoustic models have been evaluated using
the 65k pronunciation dictionary used in the syllable-based
recognizers. However, the CV syllables that are decomposed into phones in the training dictionaries have also
been decomposed in the 65k dictionary. Table 8 presents
the performance of the hybrid system evaluated on the
5k development test set.
As it can be seen from the table, decomposing rare syllables
into phones did not bring signicant performance improvement over the syllable based systems. Rather, in some cases,
the result is even worse compared to the pure context dependent CV syllable-based speech recognition systems. The use of
ve states with skip model topology led to the worst performance (17.8% and 18.9% for the acoustic models developed
with HyridDict_FL100 and HyridDict_FL500 dictionaries,
respectively). Although this topology enables us to capture
irregularities in acoustic units length, it requires much training data as the number of parameters (transition matrices) to
be estimated is larger than the number of parameters needed
in acoustic models without skips. This is why hybrid models
with skips did not perform well compared to that of ve states
without skips (see Hybrid_204Units_5statesWOS and
Hybrid_170Units_5statesWOS in the table).

190

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

3.5. Discussion: comparison of triphone, CV syllable, hybridbased recognizers


We have compared the performance of the best CV syllable, triphone and hybrid acoustic unit based recognizers
in terms of accuracy. To have a fair comparison we consider the triphone models developed with a pronunciation
dictionary in which the glottal stop has been realized (i.e.
Triphone_2500sen_16gau_w? and Triphone_250_16_w?).
Unlike the results reported by (Abate and Menzel,
2007a), in our experiment, the use of syllables as an acoustic unit led to improved recognition accuracy. Even the
context independent syllable model is better (in recognition
accuracy) than the best triphone model with 2500 senones
(cf. Tables 3 and 4). Although direct comparison of our
results with the ones reported for other language (such as
(Ganapathiraju et al., 2001)) is not possible as the task,
the amount of training data and the vocabulary used are
dierent, the recognition improvement obtained as a result
of using syllable units in our experiment is not big. This can
also be attributed to the insucient amount of training
data we have used for our experiment. Furthermore, the
big performance improvement in previous works for other
languages may be due to the availability of long syllables
(consisting of more than two phones). Moreover, the number of mono-syllabic words also contributes for a good performance. In our case, we have used only CV syllables and
the number of mono-syllabic words consisting of only CV
is not very large (only 0.13%). In addition, currently we
have trained syllable acoustic models with ve states per
HMM. Using a number greater than 5 (Azim et al.,
2008) or determining the number of states for each model
on the basis of the duration of the syllables (Ganapathiraju
et al., 2001) might improve the performance of syllablebased recognizers.
A much higher reduction in recognition error rate (from
22.5 to 18) has been observed with CI syllable based models compared to the triphone model developed with a small
number of senones that nearly corresponds to the number
of syllable units.
The best performing context dependent CV syllable
based recognizer brought a 1.5% absolute WER reduction
over the best performing tied state triphone based recognizer. The WER reduction is statistically signicant with
p value of 0.001. This shows that using syllables as a recognition unit is a promising direction provided that sucient
training speech is available for training tied state context
dependent syllable acoustic models.
Although hybrid (phonesyllables) based recognizers
did not bring signicant performance improvement over
the CD syllable based recognizers, they outperformed the
triphone based systems in most of the cases. The best performing hybrid recognizer (Hybrid_204Units_5statesWOS)
brought a statistically signicant (with p value of 0.001)
WER reduction over the triphone based system.
Generally, the hybrid recognizers with HMM topology
of ve states without skip (Hybrid_204Units_5statesWOS

and Hybrid_170Units_5statesWOS) and four states without skip (Hybrid_204Units_4statesWOS) outperformed


all the other systems. However, the dierence (in performance) from the CD syllable based recognizers is not signicant. The CI syllable based models are better than the
tied state triphone based models in terms of accuracy.
4. Morpheme-based systems
Manual analysis of recognition results of the CV syllable-based and triphone-based systems showed that most
short (consisting of only two CVs) words are correctly recognized by the syllable-based recognizers while deleted by
the triphone-based ones. This inspired us to investigate
the use of CV syllable and hybrid acoustic models in morpheme-based speech recognition where morphemes are
used as entries in the pronunciation dictionary as well as
units in the language model. Since morphemes are usually
shorter than words, we expected that CV syllable and
hybrid acoustic models lead to a better recognition performance in morpheme-based speech recognition.
This section presents the results of morpheme-based recognition experiments following a brief description of the
morphological segmentation methods used in our experiments. The results reported are word error rates (WER)
computed after reconstructing words from the recognized
morpheme sequences. They are, therefore, directly comparable with the results reported in Section 3.
4.1. Morphological segmentation
Two segmentation methods have been applied to obtain
morphologically segmented text: unsupervised and nite
state-based supervised segmentation. The following is a
description of the morphological segmentation methods
and morpheme-based corpora obtained by applying the
morphological segmentation methods.
4.1.1. Unsupervised segmentation
For unsupervised word segmentation Morfessor (Creutz
and Lagus, 2005), which is a freely available language independent unsupervised morphology learning tool that tries
to identify all the morphemes found in a given word, has
been used. It is a data-driven approach that learns a subword lexicon from a training corpus of words by using a
Minimum Description Length (MDL) algorithm (Creutz
and Lagus, 2005). It has been used with default options
and without any adaptation. The unit obtained with Morfessor is referred here as morpheme even if it does not automatically correspond to the linguistic denition of a
morpheme (the smallest semantically meaningful unit).
4.1.2. Supervised segmentation
In order to obtain a linguistic morpheme segmentation,
we used a segmented text described in (Tachbelie et al.,
2011a) to train a nite state machine (FSM) based morphological segmenter (a composition of morph transducer and

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

12-gram CV syllable-based LM) using the AT&T FSM


Library and GRM Library (Grammar Library) (Mohri
et al., 1998). This segmenter segments only prepositions
(mostly prex) and conjunctions (mostly sux) from
words.
4.1.3. Morpheme-based corpora
Morfessor and FSM-based segmenter have been used to
segment the ATC_120k text corpus described in Section 2.1.
That means we have two versions of morphologically segmented corpora: segmented by Morfessor and FSM. The
Morfessor segmented corpus consists of 4,035,592 morpheme tokens or 15,933 types. The FSM segmentation,
on the other hand, resulted in a morpheme-based text corpus consisting of 3,104,474 and 142,855 morpheme tokens
and types, respectively. The number of types in FSM segmented corpus is much bigger than the one for Morfessor
segmented. This is because in FSM segmentation only
prepositions and conjunctions are detached from the word
while Morfessor gives a complete morpheme segmentation
of a word. These corpora have been used to prepare pronunciation dictionaries and morpheme-based language
models for morpheme-based speech recognition
experiment.
In order to facilitate the conversion of morpheme
sequences to words after recognition, a special morpheme
boundary marker (MB) has been attached to the left and
right of morphemes obtained in unsupervised manner (case
of Morfessor). For the morphemes obtained using the
FSM-based segmentation, two boundary markers #
and + have been attached to prexes and suxes, respectively. This made the morphemes (in Morfessor and FSM
segmented texts) context sensitive and consequently
increased the number of distinct morphemes to 45,338 for
Morfessor and 144,024 for FSM segmented text. The
markers are dierent for FSM and Morfessor approaches
because FSM based segmenter segments only prepositions
and conjunctions while Morfessor gives complete morpheme segmentation without any context information.
Before scoring an ASR output (made up of morphemes), we rebuild words by reconnecting every units
containing MB (or #/+) to the one next to or before it.
For example, in the morpheme sequence bE# zih +m,
bE# is a prex and is connected to the morpheme next
to it (zih) resulting in a new morpheme/word bEzih.
Then the sux m is attached to the new morpheme giving
the word bEzihm. For Morfessor segmented corpus
MB has been used as a morpheme boundary marker.
To reconstruct a word, in this case, we reconnect morphemes every time two MBs appear consecutively. For
instance, the morpheme sequence bEMB MBzihMB
MBm is reconnected to give the word bEzihm.

191

Table 9
WER of several AMs with FSM segmented morpheme-based LMs.
Units

Models

WER in %

Phone

Triphone_3states
Triphone_3states_UDQ

16.7
15.9

CV Syllable

CD_Syllable
CD_Syllable_UDQ

14.3
14.6

Phone + CV Syllables

Hybrid_170Units_5statesWS
Hybrid_170Units_5statesWOS
Hybrid_170Units_4statesWOS
Hybrid_204Units_5statesWS
Hybrid_204Units_5statesWOS
Hybrid_204Units_4statesWOS

16
13.9
14.3
14.6
14.2
14.3

hybrid models described in Section 3. Morpheme-based trigram language models have been developed (in similar
fashion as the word trigram language model described in
Section 3.1) with the unsupervised and supervised morphologically segmented corpora. For the latter corpus, a 65k
morpheme vocabulary has been prepared by taking the
most frequent morphemes. This vocabulary has been used
to prepare three types of pronunciation dictionaries
according to the units (phone, CV syllable, and hybrid)
used in the acoustic models. Recognition experiment has,
then, been performed using the 5k development test set.
Table 9 gives the WER of dierent acoustic models in
FSM segmented morpheme-based recognition.
As the table shows, the CV syllable-based acoustic models outperformed the triphone-based ones. A 2.4% absolute
WER reduction (cf. Triphone_3states and CD_Syllable in
the table) has been obtained as a result of using syllable
acoustic units in morpheme-based recognition. This
improvement is statistically signicant with p-value of less
than 0.001. The use of user dened question for decision
tree-based clustering has positive inuence (resulted in a
0.8% absolute WER reduction) on the triphone-based
acoustic models. However, using user dened question
did not bring WER reduction for the syllable-based acoustic models as it is also true in word-based recognition
experiments described in Section 3.3. Nevertheless, the syllable-based acoustic model with user dened questions
(CD_Syllable_UDQ) resulted in a signicant (at p value
of 0.005) WER reduction compared to the equivalent triphone-based system (Triphone_3states_UDQ). Generally,
all hybrid acoustic models performed signicantly (at pvalue of 0.001) better than the triphone-based systems,
the best performing (with a WER of 13.9%) being the
Hybrid_170units_5states_WOS.5 This system has also a
slightly lower WER compared to the pure CV syllablebased systems. Although the model topology is crude in
representing the acoustic units (not state of the art for
phones), this model achieved the lowest WER among all

4.2. Experimental results of morpheme-based systems


5

The acoustic models used in morpheme-based recognition are the triphone, context dependent CV syllables and

The model with HMM topology of ve states without skips and in


which CV syllables with a frequency less than 500 have been decomposed
into phones.

192

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

Table 10
WER of several AMs with morfessor seg. morpheme-based LMs.
Units

Models

WER in %

Phone

Triphone_3states
Triphone_3states_UDQ

17.8
15.9

CV Syllable

CD_Syllable
CD_Syllable_UDQ

14.8
13.9

Phone + CV Syllables

Hybrid_170Units_5statesWS
Hybrid_170Units_5statesWOS
Hybrid_170Units_4statesWOS
Hybrid_204Units_5statesWS
Hybrid_204Units_5statesWOS
Hybrid_204Units_4statesWOS

15.5
13.7
13.5
13.7
13.7
13.3

the others. This indicates the potential of the hybrid acoustic model for even higher performance provided that
proper topologies are used for each of the units.
For the unsupervised morpheme-based recognition, all
the distinct morphemes in the segmented text (45k) have
been considered as entries for pronunciation dictionaries.
As we did for the FSM segmented text, we prepared three
versions of pronunciation dictionaries according to the
type of units used in the acoustic models. Performance of
the Morfessor morpheme-based recognition using the different acoustic models on the 5k development test set is
presented in Table 10. As it is true in the FSM segmented
morpheme-based recognition experiment, the use of syllable acoustic models led to greater WER reductions in morpheme-based recognition. 3% and 2% absolute WER
reductions have been obtained compared to triphone-based
ones that use automatically and user dened tree questions,
respectively. These error rate reductions are statistically
signicant with the p-value of less than 0.001. The result
clearly shows that syllable-based acoustic models are best
tted for morpheme-based recognition. The hybrid acoustic models (except Hybrid_170units_5statesWS) outperformed the triphone and CV syllable-based models
although the improvement over the syllable-based model
with user dened question (CD_Syllable_UDQ) is not statistically signicant. In all the other cases the WER reduction is statistically signicant at p-value of less than 0.001
compared to the triphone-based systems and with minimum p-value of 0.002 compared to the syllable-based
one. This shows that the hybrid models are the best for
Amharic morpheme-based recognition systems.
Although the dierence is not big, Morfessor-based segmentation led to a lower WER than FSM-based segmentation. This can be explained by a relatively high OOV rate
(3.58%) of FSM based system compared to that of Morfessors which is almost zero (0.10%). The results presented in
this section also showed that using morphemes (instead of
words) as entries in pronunciation dictionary and units in
language model brings improvement in Amharic speech
recognition system. Our results show that the use of long
acoustic units (syllables) and short lexical units
(morphemes) is the best for Amharic speech recognition.

Moreover, the results show that the use of hybrid units in


acoustic modeling and morphemes in lexical and language
modeling lead to best performance.
4.3. Vocabulary gain
One of the benets of using morpheme-based recognition system is the recognition of out-of-vocabulary words.
We have analyzed the vocabulary gain considering the recognition output of the best performing morpheme-based
systems from Tables 9 and 10 against our 65k word vocabulary. The outputs of Hybrid_170Units_5statesWOS and
Hybrid_204Units_4statesWOS have been considered for
FSM and Morfessor segmented systems, respectively. The
number of newly reconstructed words that are recognized
by the FSM system but not available in the vocabulary is
246. Out of them, 156 (63.41%) are OOV words, which
are in the test set but not in the 65k vocabulary. In the output of Morfessor system, there are 333 words which are not
found in the vocabulary, 248 (74.47%) being OOV words.
On the other hand both sub-word based recognizers reconstructed illegal words (meaningless, fragments and concatenation of two words) in the language. Out of the
reconstructed words, 10.56% and 12.01% are illegal words
generated by FSM and Morfessor segmented systems,
respectively.
5. Conclusion and future directions
We have investigated the use of dierent acoustic units
for Amharic speech recognition. As the use of triphone
units is a state of the art, we have developed triphone-based
system as a baseline. Then, we have investigated the use of
syllables as acoustic units for Amharic speech recognition.
As the Amharic writing system is syllabary, each character
representing a consonant and a vowel, we have used CV
syllables in acoustic modeling. Since there were rare syllables in our training speech, we decomposed the rare syllables into their constituent phones while keeping the
syllables that have enough representation in the training
and trained hybrid phone-syllable acoustic models. Moreover, the triphone, CV syllable and hybrid acoustic models
have been investigated in morpheme-based recognition system where morphemes are used as lexical and language
modeling units.
The CV syllable based recognizers have been compared
with tied state triphone based recognizers. Our results show
that the syllable based recognizers outperformed the best
tied state triphone based recognizer in accuracy. Even the
CI syllable based model performed better than the best triphone based system. Generally, the results enable us to
conclude that the use of syllables is a promising direction
for Amharic speech recognition. When resources (particularly training data) are limited, the CI CV syllable units
are the best choice for Amharic speech recognition. However, if our concern is on the accuracy of a system, the

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

CD syllables are the best alternative acoustic units provided that enough training data is available.
Hybrid (phone-syllables) based recognizers did not
bring signicant performance improvement over the best
CD syllable based recognizers when word units are used
in the pronunciation dictionary and language model.
However, they brought better WER reductions in morpheme-based speech recognition. The syllable-based
acoustic models also outperformed the triphone-based
models in morpheme-based speech recognition. This
enables us to conclude that the use of syllables and hybrid
units in acoustic modeling and morphemes in lexical and
language modeling is the best for Amharic speech recognition. We recommend investigation of the use of syllable
acoustic models in morpheme-based speech recognition
for other morphologically rich languages.
In the current study, only consonant vowel (CV) syllables are considered. As Amharic has other syllable structures (V, VC, CVC, CVCC) as well, we will investigate
the use of all types of syllables in acoustic modeling. We
think that a better performance can be obtained by considering all syllable structures in acoustic modeling because
the use of long (in time) units will improve recognition.
However, considering all syllable structures has its own
challenge. Since the number of syllables will be much larger
than 233 (considered in this study), large training speech
data is required to train all the syllables adequately. Thus,
a way to model all Amharic syllable types while using the
available training data has to be found. Experiments with
dierent (from 5) number of states per HMM will also be
conducted for the syllable based acoustic modeling.
In hybrid acoustic modeling, we have used an HMM
topology of 5 states with skips assuming that this topology
can handle the irregularities (in length) of the acoustic
units. However, since the number of parameters estimated
in this model topology is very big, large training data is
required to adequately train such models. As we have used
only 20 hours of training speech data, we could not see the
benet of using such a topology in hybrid acoustic models.
Instead a high WER has been observed. Thus, in order to
get the real benet of using models with skips, large training data has to be used. Since acquiring data is not easily
achievable (especially for under-resourced languages),
using dierent model topologies for syllable and phone
units in hybrid acoustic modeling will be investigated.
In our morpheme-based systems we decomposed all the
words in our corpus irrespective of their frequency in our
data. Decomposing rare words while keeping frequent
words as they are is an interesting direction in dealing with
morphologically rich languages. An approach followed by
(EI-Desoky et al., 2009), where no segmentation is done for
the top N highly ranked decomposable words is an interesting future endeavor.
Last but not least, specic language issues like gemination, epenthetic vowel insertion and the glottal stop realization will also be handled. We will model geminated and
non-geminated consonants separately. In addition, the

193

epenthetic vowel and the glottal stop will be realized in


their proper places in the pronunciation.
References
Abate, Solomon Teferra, 2006. Automatic Speech Recognition /for
Amharic, Ph.D. thesis, University of Hamburg, Germany.
Abate, Solomon Teferra, Menzel, Wolfgang, 2007a. Syllable-based speech
recognition for Amharic. In: Proceedings of the 2007 Workshop on
Computational Approaches to Semitic Languages: Common Issues
and Resources, Prague, Chech Republic, pp. 3340.
Abate, Solomon Teferra, Menzel, Wolfgang, 2007. Automatic Speech
Recognition for an Under-Resourced Language Amharic. In:
Proceedings of INTERSPEECH 2007, pp. 15411544.
Abate, Solomon Teferra, Menzel, Wolfgang, Tala, Bairu, 2005. An
Amharic Speech Corpus for Large Vocabulary Continuous Speech
Recognition. In: Proceedings of INTERSPEECH-2005, Lisbon, Portugal, pp. 16011604.
Appleyard, David, 1995. Colloquial Amharic: A Complete Course for
Beginners. Routledge, London, NY.
Azim, Mohamed Mostafa, Tolba, Hesham, Mahdy, Sherif, Fahsal,
Mervat, 2008. Syllable-based automatic Arabic speech recognition in
noisy-telephone Channel. WSEAS Transactions on Signal Processing 4
(4), 211220.
Bazzi, Issam, 2002. Modelling Out-of-VocabularyWords for Robust
Speech Recognition. Ph.D. Thesis, Massachsetts Institute of Technology, 2002.
Bender, M.L., Bowen, J.D., Cooper, R.L., Ferguson, C.A., 1976.
Languages in Ethiopia. Oxford University Press, London.
Berhanu, Solomon, 2001. Isolated Amharic Consonant-Vowel Syllable
Recognition: An Experiment Using the Hidden Markov Model. M.Sc.
Thesis, School of Information Studies for Africa, Addis Ababa
University, Ethiopia.
Berment, V., 2004. Methodes pour informatiser les langues et les groupes
de langues peu dotees. Ph.D. thesis, Universite Joseph Fourier,
Grenoble, France.
Besacier, L., Le, V.-B., Boitet, C., Berment, V., 2006. ASR and translation
for under-resourced languages. In: Proceedings of IEEE International
Conference on Acoustics, Speech and Signal Processing, ICASSP 2006,
vol. 5, 2006, pp. 12211224.
Carki, Kenan, Geutner, Petra, Schultz, Tanja, 2000. Turkish LVCSR:
towards better speech recognition for agglutinative languages. In:
IEEE International Conference on Acoustics, Speech, and Signal
Processing, vol. 3, pp. 15631566.
Creutz, Mathias, Lagus, Krista, 2005. Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.1. Tech. Rep. A81, Neural Networks Research Center, Helsinki
University of Technology.
El-Desoky, Amr, Gollan, Christian, Rybach, David, Schluter, Ralf, Ney,
v, 2009. Investigating the use of morphological decomposition and
diacritization for improving Arabic LVCSR. In: Proceedings of
Interspeech-2009, pp. 26792682.
Ethnologue, 2004. <http://www.ethnologue.com/show_language.asp?
code=AMH>.
Gales, Mark, Woodland, Phil, 2006. Recent Progress in Large Vocabulary
Continuous Speech Recognition: An HTK Perspective.
Ganapathiraju, Aravind, Hamaker, Jonathan, Picone, Joseph, Ordowski,
Mark, Doddington, George R., 2001. Syllable-based large vocabulary
continuous speech recognition. IEEE Transactions on Speech and
Audio Processing 9 (4), 358366.
Gelas, Hadrien, Abate, Solomon Teferra, Besacier, Laurent, Pellegrino,
F., 2011. Quality assessment of crowdsourcing transcriptions for
African languages. In: Proceedings of INTERSPEECH, Florence,
Italy.
Geutner, Petra, 1995. Using morphology towards better large vocabulary
speech recognition systems. In: Proceedings of IEEE International on
Acoustics, Speech and Signal Processing, vol. I, pp. 445448.

194

M.Y. Tachbelie et al. / Speech Communication 56 (2014) 181194

Girmaw, Molalgne, 2004. An Automatic Speech Recognition System for


Amharic, M.Sc. Thesis, Dept. of Signals, Sensors and Systems, Royal
Institute of Technology, Stockholm, Sweden.
Gruenstein, Alexander, McGraw, Ian, Sutherland, Andrew, 2009. A selftranscribing speech corpus: collecting continuous speech with an
online educational game. In: Proceeding of SLaTE, Brighton, UK.
Mariam, Sebsibe H., Kishore, S.P., Black, Alan W., Kumar, Rohit,
Sangal, Rajeev, 2004. Unit selection voice for amharic using festvox.
In: Proceeding of the 5th ISCA Speech Synthesis Workshop,
Pittsburgh, PA, pp. 103107.
Haile, Alemayehu, 1995. Is syllable weight distinction relevant for
Amharic stress assignment? Journal of Ethiopian Studies 28 (2), 1525.
Hamalainen, Annika, Boves, Lou, de Veth, Johan, 2005. Syllable length
acoustic units in large-vocabulary continuous speech recognition. In:
Proceedings of SPECOM 2005, pp. 499502.
Hirsimaki, Teemu, Creutz, Mathias, Siivola, Vesa, Kurimo, Mikko, 2005.
Morphologically motivated language models in speech recognition. In:
Proceedings of the International and Interdisciplinary Conference on
Adaptive Knowledge Representation and Reasoning, pp. 121126.
Ircing, Pavel, Krbec, Pavel, Hajic, Jan, Psutka, Josef, Khudanpur,
Sanjeev, Jelinek, Frederick, Byrne, William, 2001. On large vocabulary
continuous speech recognition of highly inectional language Czech.
In: Proceeding of INTERSPEECH01, pp. 487489.
Kirchho, Katrin, Bilmes, Je, Henderson, John, Schwartz, Richard,
Noamany, Mohamed, Schone, Pat, Ji, Gang, Das, Sourin, Egan,
Melissa, He, Feng, Vergyri, Dimitra, Liu, Daben, Duta, Nicolae, 2002.
Novel Speech Recognition Models for Arabic. Tech. Rep., JohnsHopkins University Summer Research Workshop.
Leslau, Wolf, 2000. Introductory Grammar of Amharic. Harrassowits
Verlag, Wiesbaden.
Liu, Xunying, Gales, Mark John Francis, Hieronymus, Jim L., Woodland, Philip C., 2011. Investigation of acoustic units for LVCSR
systems. In: ICASSP11, pp. 48724875.
Marge, Matthew, Banerjee, Satanjeev, Rudnicky, Alexander I., 2010.
Using the Amazon mechanical Turk to transcribe and annotate
meeting speech for extractive summarization. In: Proceedings of
NAACL HLT.
McGraw, Ian, Gruenstein, Alexander, Sutherland, Andrew, 2009. A selflabeling speech corpus: collecting spoken words with an online
educational game. In: Proceedings of INTERSPEECH.
Mohri, Mehryar, Pereira, Fernando, Riley, Michael, 1998. A rational
design for a weighted nite-state transducer library. In: Derick
Woodand Sheng Yu (Ed.), Automata Implementation, vol. 1436,
Lecture Notes in Computer Science, Springer Berlin/Heidelberg, pp.
144158.
Scott, Novotney, Callison-Burch, Chris, 2010. Cheap, fast and good
enough: automatic speech recognition with non-expert transcription.
In: Proceedings of NAACL HLT, pp. 207215.
Pellegrini, Thomas, Lamel, Lori, 2006. Investigating automatic decomposition for ASR in less represented languages. In: Proceedings of
INTERSPEECH 2006.
Pellegrini, Thomas, Lamel, Lori, 2006. Experimental detection of vowel
pronunciation variants in Amharic. In: Proceedings of LREC.
Pellegrini, Thomas, Lamel, Lori, 2007. Using phonetic features in
unsupervised word decompounding for ASR with application to a
less-represented language. In: Proceedings of INTERSPEECH 2007,
pp. 17971800.
Pellegrini, Thomas, Lamel, Lori, 2009. Automatic word decompounding
for ASR in a morphologically rich language: application to Amharic.
IEEE Transactions on Audio, Speech, and Language Processing 17 (5),
863873.
Seid, Hussien, Gamback, Bjorn,2005. A speaker independent continuous
speech recognizer for Amharic. In: Proceedings of INTERSPEECH
2005, 9th European Conference on Speech Communication and
Technology, Lisbon, Portugal, pp. 33493352.

Seifu, Zegaye, 2003. HMM Based Large Vocabulary, Speaker Independent, Continuous Amharic Speech Recognizer. M.Sc. Thesis, School
of Information Studies for Africa, Addis Ababa University, Ethiopia.
Sethy, Abhinav, Narayanan, Shrikanth, Parthasarthy, S., 2002. A syllable
based approach for improved recognition of spoken names. In:
Proceeding of the 5th ISCA Pronunciation Modeling Workshop, pp.
3035.
Seyoum, Mulugeta, 2001. The Syllable Structure and Syllablication in
Amharic. Masters thesis, Department of Linguistics, Trondheim,
Norway.
Siivola, Vesa, Hirsimaki, Teemu, Creutz, Mathias, Kurimo, Mikko.
Unlimited vocabulary speech recognition based on morphs discovered
in an unsupervised manner. In: Proceedings of Eurospeech, pp. 2293
2296.
Snow, Rion, OConnor, Brendan, Jurafsky, Daniel, Ng, Andrew Y.,
2008. Cheap and fast but is it good? Evaluating non-expert
annotations for natural language tasks. In: Proceedings of EMNLP08, pp. 254263.
Stolcke, Andreas, 2002. SRILM an extensible language modeling
toolkit. In: Proceedings of ICSLP-2002. Denber, Colorado, USA, pp.
901904.
Tachbelie, Martha Yiru, 2003. Automatic Amharic Speech Recognition
System to Command and Control Computers, M.Sc. Thesis, School of
Information Studies for Africa, Addis Ababa University, Ethiopia.
Tachbelie, Martha Yiru, 2010. Morphology-Based Language Modeling
for Amharic. Ph.D. thesis, University of Hamburg, Germany.
Tachbelie, Martha Yiru, Abate, Solomon Teferra, Menzel, Wolfgang,
2009. Morpheme-based language modeling for Amharic speech
recognition. In: Proceedings of the 4th Language and Technology
Conference LTC-09, pp. 114118.
Tachbelie, Martha Yiru, Abate, Solomon Teferra, Menzel, Wolfgang.
Morpheme-based automatic speech recognition for a morphologically
rich language Amharic. In: Proceeding of SLTU10, Penang,
Malaysia, pp. 6873.
Tachbelie, Martha Yiru, Abate, Solomon Teferra, Menzel, Wolfgang,
2011. Morpheme-based and factored language modeling for Amharic
speech recognition. Lecture Notes in Computer Science, Human
Language Technology: Challenges for Computer Science and Linguists, vol. 6562, pp. 8293.
Tachbelie, Martha Yiru, Abate, Solomon Teferra, Besacier, Laurent,
2011. Part-of-speech tagging for under-resourced and morphologically
rich languages the case of Amharic. In: Proceedings of the HLTD
2011, pp. 5055.
Tadesse, Kinfe, 2002. Sub-Word Based Amharic Speech Recognizer: An
Experiment Using Hidden Markov Model (HMM). M.Sc. Thesis,
School of Information Studies for Africa, Addis Ababa University,
Ethiopia.
Thangarajan, R., Natarajan, A.M., 2008. Syllable based continuous
speech recognition for Tamil. South Asian Language Review 17 (1),
7185.
Voigt, Reiner M., 1987. The classication of central semitic. Journal of
Semitic Studies 32 (1), 121.
Whittaker, E.W.D., Woodland, P.C., 2000. Particle-based language
modeling. In: Proceeding of International Conference on Spoken
Language Processing, pp. 170173.
Whittaker, E.W.D., Van Thong, J.M., Moreno, P.J., 2001. Vocabulary
independent speech recognition using particles. In: IEEE Workshop on
Automatic Speech Recognition and Understanding, pp. 315318.
Woodland, P.C., Leggetter, C.J., Odell, J.J., Valtchev, V., Young, S.J.,
1995. The 1994 HTK large vocabulary speech recognition system. In:
Proceedings of the 1995 International Conference on Acoustics,
Speech and Signal Processing, vol. 1, pp. 7376.
Yimam, Baye, 2007. yEamarNa sEwasEw, second ed., EMPDE, Addis
Ababa.

Potrebbero piacerti anche