Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract
Text to speech Generation (TTS) is the process of converting the raw text into the speech output.
This article represents the development of Punjabi voice using Festvox tool and the Festival is used
as engine to run that developed voice. The statistical parametric method of waveform generation is
used. The phoneme is chosen as basic unit for speech synthesis. The corpus for Punjabi language is
collected from newspaper and news channels. The recording is done in noise free environment.
The basic methodology is used by all frameworks for development of synthesized speech outputs. The steps involve in
the speech generation are: Text processing, Linguistic processing and waveform generation as outlined in Fig. 2.
1
Assistant Professor, GNA University, Phagwara, Punjab, India.
E-mail Id: gillsukhi92@mail.com
Orcid Id: http://orcid.org/0000-0002-6247-7571
How to cite this article: Gill SK. Text to Speech System for Punjabi Using Festival Framework. J Adv Res Appl Arti Intel Neural
Netw 2017; 4(1&2): 36-40.
37
Gill SK et al. J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2)
Collection and Recording of Speech phoneme as basic unit for synthesis. The quality of this
speech output is although not accurate and moreover
For the development of Punjabi language the speech leads to discontinuities at joining point. But it is the
corpus of about 80 minute was recorded. The recorded smallest unit and therefore requires the less amount of
data is used to develop the Punjabi synthesized voice. speech corpus. The storage space needed for speech
The corpus was collected in the form of sentences. The corpus is large in case of syllable basic unit. So,
corresponding voice was recorded in the noiseless phoneme is chosen as basic unit. Another reason is that,
studio. For the acceptance in Festvox, the voice is in this article grapheme based synthesizer is used.
normalized to convert it into 16 kHz format. The data is
divided randomly into training data (90% of total Labelling and Model Development
segments) and test data (10% of total segments).5 The
recorded corpus is recorded by seven different female The labelling is the important part in every speech
recorders and one best female recorder voice was synthesis Framework. The Festival supports the
selected. For the accurate automatic labelling the automatic labelling using EHMM labeller. But,
starting and end silences was pruned. The middle sometimes it leads to wrong results. So, the manual
silence was also pruned in the utterances. labelling is also performed. The labelling is done to make
the each phone unit in the corpus and assigning the
Selection of basic Unit for Synthesis sound to each phoneme unit. The manual labelling is
hard task but leads to accurate results. The Wave surfer
The beginning of development of voice is done by tool is used for the manual labelling of the speech
selecting basic unit for synthesis. There are various units corpus. The Fig. 4 shows the manual labelling of speech
that have been used in the researches like phones, corpus using the Wave surfer tool.
diphone and syllables.2 But, in this article we choose the
38
J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2) Gill SK
The next step is the extraction of features from the voice is selected by Festival to run the voice according to
labelled data. These extracted features are then the inputted text. In this sounds are derived by the
selected on the basis of decision theory of maximum parser and the features are extracted by linguistic part.
likelihood criteria to develop the cluster trees and In the Statistical Parametric technique, the developed
duration model. The CART is used to develop the cluster model features are mapped with inputted text features
trees. In statistical parametric synthesis method only and at end all these are joined to develop the waveform
single decision tree is developed. The decision tree has as shown in Fig. 5. The basis constraints behind the
binary query for any feature at each node in the selection of speech units are:
developed tree. Based on training data, the leaves of the
tree contain the best predicted value. The best match unit for the target unit is selected.
The selected should be such that they must join
Waveform Synthesis smoothly when concatenation is done to
reconstruct the waveform.
When user inputted text then re-construction of
waveform is performed. In which Festvox developed
39
Gill SK et al. J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2)
40