Research Article

Text to Speech System for Punjabi

Using Festival Framework
Sukhpreet Kaur Gill1

Text to speech Generation (TTS) is the process of converting the raw text into the speech output.
This article represents the development of Punjabi voice using Festvox tool and the Festival is used
as engine to run that developed voice. The statistical parametric method of waveform generation is
used. The phoneme is chosen as basic unit for speech synthesis. The corpus for Punjabi language is
collected from newspaper and news channels. The recording is done in noise free environment.

Keywords: Festvox, Festival, Statistical parametric method

The Text to speech system (TTS) is system which takes input in the form of raw text data and converts it into the
synthesized voiced output. The TTS is part of speech synthesis process. There are different types of speech synthesis
systems which are used to synthesize the voice like MBROLA, FESTIVAL and FLITE.4 But, Festival is one of best text to
speech synthesis Framework.1 The Festival is act as engine to run the voice. The festvox act as backend part through
which the actual development also makes use of voice is done as shown in Fig. 1. Festvox as backend part to develop
the new voice. Festival Framework has multilingual support. Festival is implemented in two languages C++ and SIOD.
In Festvox currently seven Indic languages was developed.1

Figure 1.Role of Festvox in Voice Development

The basic methodology is used by all frameworks for development of synthesized speech outputs. The steps involve in
the speech generation are: Text processing, Linguistic processing and waveform generation as outlined in Fig. 2.
Figure 2.Text to Speech System

This article focuses on the building of Punjabi voice Literature Survey

using Festvox. The Indian Language script was
developed from Brahmis script.7 Punjabi Language is Indic languages are supported by the Festival speech
spoken by the 104 million people in the India. In this synthesis system which is an open source. It makes use
article the Statistical Parametric Synthesis method of of the three vital modules i.e. text processing, prosodic
Festvox is used to develop the voice. The phoneme is processing and the synthesis of waveform[1]In past
chosen as basic unit for the generation of synthesized decades, many researchers has compared the quality of
voice. The grapheme to phoneme synthesizer is used speech produced by festival using different units of
because of lack of resources for Indian languages. This speech (i.e. syllable, phoneme etc.) and techniques for
technique provides support for phonetics, but producing the waveform. For producing waveform
construction of explicit lexicon is not required.1 mainly, the Unit selection and diphones are used by
researchers. The speech synthesized using Open-
Synthesis methods offers by Festvox are: unit selection, Domain Unit Selection makes use of minimal user
limited Domain and diphone method. In the unit experience and knowledge for speech synthesis [10].
selection method synthesized speech quality depend Various fundamental units like syllable or phonemes are
upon the quality of collected corpus. The deep tested in Unit Selection waveform synthesis. Using
knowledge is required for complete processing. In syllable as the fundamental unit, a TTS system is
Statistical parametric synthesized technique less corpus generated for the Bengali language.8 But, phoneme is
is needed and little knowledge is required to develop used for building a Malayalam TTS system [5]. For
the TTS system. The basic Methodology of Statistical reconstructing wave signal from the extracted
Parametric method includes the extraction of feature parameters, Statistical Parametric Synthesis is one of
parameters of frequency contour and Mel Cepstral the easiest methods.2 The size of the database decides
coefficients. These parameters are extracted from the quality of speech produced in case of Unit Selection
labelled data. In this method the maximum likelihood synthesis. Moreover, for generating speech using this
criteria for decision making is used.2 This is applied at technique specialization is required and it’s also
the time when the decision is taken to select the expensive. The Festival system and Festvox new
accurate parametric features to generate the cluster versions are updated to support CLUSTERGEN voices.
tree and duration model. The evaluation of duration These are released in new version beta 1.96 of Festival.
Models is done using root mean square error and co- Unit selection technique is applied on the parameters
relates coefficient. extracted using the CLUSTERGEN method, in hybrid

Gill SK et al. J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2)

technique. Results verified that this hybrid technique is Experimental Work

better than the statistical parametric method and Cl
units. The corpus for the Punjabi language is collected from
different Punjabi newspapers and new channels. The
In this article, CLUSTERGEN technique along with festival corpus selection is done in such way that it should cover
framework is used to produce Punjabi language speech. the possible Punjabi ordinals. Due to lack of resources
Analysis of phonemes of Punjabi shows that Punjabi Grapheme based synthesizer is used. The basic modules
phonemes can be categorized in two parts i.e. non- in festvox include the text normalization,
nasalized valid phonemes (total and nasalized valid prosody/linguistic analysis and waveform re-synthesis as
phonemes (total 324).6,7 shown in Fig. 3.

Figure 3.Methodology for Speech Generation

Collection and Recording of Speech phoneme as basic unit for synthesis. The quality of this
speech output is although not accurate and moreover
For the development of Punjabi language the speech leads to discontinuities at joining point. But it is the
corpus of about 80 minute was recorded. The recorded smallest unit and therefore requires the less amount of
data is used to develop the Punjabi synthesized voice. speech corpus. The storage space needed for speech
The corpus was collected in the form of sentences. The corpus is large in case of syllable basic unit. So,
corresponding voice was recorded in the noiseless phoneme is chosen as basic unit. Another reason is that,
studio. For the acceptance in Festvox, the voice is in this article grapheme based synthesizer is used.
normalized to convert it into 16 kHz format. The data is
divided randomly into training data (90% of total Labelling and Model Development
segments) and test data (10% of total segments).5 The
recorded corpus is recorded by seven different female The labelling is the important part in every speech
recorders and one best female recorder voice was synthesis Framework. The Festival supports the
selected. For the accurate automatic labelling the automatic labelling using EHMM labeller. But,
starting and end silences was pruned. The middle sometimes it leads to wrong results. So, the manual
silence was also pruned in the utterances. labelling is also performed. The labelling is done to make
the each phone unit in the corpus and assigning the
Selection of basic Unit for Synthesis sound to each phoneme unit. The manual labelling is
hard task but leads to accurate results. The Wave surfer
The beginning of development of voice is done by tool is used for the manual labelling of the speech
selecting basic unit for synthesis. There are various units corpus. The Fig. 4 shows the manual labelling of speech
that have been used in the researches like phones, corpus using the Wave surfer tool.
diphone and syllables.2 But, in this article we choose the

J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2) Gill SK

Figure 4.The Marking for Phonemes in the Word “ਬ + +ਗ਼”

The next step is the extraction of features from the voice is selected by Festival to run the voice according to
labelled data. These extracted features are then the inputted text. In this sounds are derived by the
selected on the basis of decision theory of maximum parser and the features are extracted by linguistic part.
likelihood criteria to develop the cluster trees and In the Statistical Parametric technique, the developed
duration model. The CART is used to develop the cluster model features are mapped with inputted text features
trees. In statistical parametric synthesis method only and at end all these are joined to develop the waveform
single decision tree is developed. The decision tree has as shown in Fig. 5. The basis constraints behind the
binary query for any feature at each node in the selection of speech units are:
developed tree. Based on training data, the leaves of the
tree contain the best predicted value.  The best match unit for the target unit is selected.
 The selected should be such that they must join
Waveform Synthesis smoothly when concatenation is done to
reconstruct the waveform.
When user inputted text then re-construction of
waveform is performed. In which Festvox developed

Figure 5.Waveform Re-synthesis

Gill SK et al. J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2)

