Festival Training

Festival TTS Training Material
TTS Group
Indian Institute of Technology Madras
Chennai - 600036
India
June 5, 2012
Contents
1 Introduction
1.1 Nature of scripts of Indian languages . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Convergence and divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
4
2 What is Text to Speech Synthesis?

2.1 Components of a text-to-speech system . .
2.2 Normalization of non-standard words . . . .
2.3 Grapheme-to-phoneme conversion . . . . . .
2.4 Prosodic analysis . . . . . . . . . . . . . . .
2.5 Methods of speech generation . . . . . . . .
2.5.1 Parametric synthesis . . . . . . . . .
2.5.2 Concatenative synthesis . . . . . . .
2.6 Primary components of the TTS framework
2.7 Screen readers for the visually challenged .
5
5
5
5
6
6
7
7
8
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Overall Picture
10
4 Labeling Tool
12
4.1 How to Install LabelingTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Troubleshooting of LabelingTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Labeling Tool User Manual
5.1 How To Use Labeling Tool . . . . . . . . . . . .
5.2 How to do label correction using Labeling tool
5.3 Viewing the labelled file . . . . . . . . . . . . .
5.4 Control file . . . . . . . . . . . . . . . . . . . .
5.5 Performance results for 6 Indian Languages . .
5.6 Limitations of the tool . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
18
25
29
29
30
30
6 Unit Selection Synthesis Using Festival

6.1 Cluster unit selection . . . . . . . . . . . . . . .
6.2 Choosing the right unit type . . . . . . . . . .
6.3 Collecting databases for unit selection . . . . .
6.4 Preliminaries . . . . . . . . . . . . . . . . . . .
6.5 Building utterance structures for unit selection
6.6 Making cepstrum parameter files . . . . . . . .
6.7 Building the clusters . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
31
32
33
33
34
35
7 Building Festival Voice
42
8 Customizing festival for Indian Languages

44
8.1 Some of the parameters that were customized to deal with Indian languages in festival
framework are : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 Modifications in source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9 Trouble Shooting in festival
50
9.1 Troubleshooting (Issues related with festival) . . . . . . . . . . . . . . . . . . . . . . 50
9.2 Troubleshooting(Issues might occur while synthesizing) . . . . . . . . . . . . . . . . . 50
10 ORCA Screen Reader
51
11 NVDA Windows Screen Reader

53
11.1 Compiling Festival in Windows : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
12 SAPI compatibility for festival voice
60
13 Sphere Converter Tool

13.1 Extraction of details from header of the input file
13.1.1 Calculate sample minimum and maximum
13.1.2 RAW Files . . . . . . . . . . . . . . . . .
13.1.3 MULAW Files . . . . . . . . . . . . . . .
13.1.4 Output in encoded format . . . . . . . . .
13.2 Configfile . . . . . . . . . . . . . . . . . . . . . .
14 Sphere Converter User Manual
14.1 How to Install the Sphere converter tool .
14.2 How to use the tool . . . . . . . . . . . .
14.3 Fields in Properties . . . . . . . . . . . . .
14.4 Screenshot . . . . . . . . . . . . . . . . . .
14.5 Example of data in the Config file (default
14.6 Limitations to the tool . . . . . . . . . . .
. . . .
values
. . . .
. . . .
. . . .
. . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
properties)
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62
62
63
63
63
63
63
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
64
64
65
66
67
68
69
Introduction
This training is conducted for new members who joined the TTS consortium. The main aim of the
TTS consortium is to develop text-to-speech (TTS) systems in all of these 22 official languages in
order to build screen readers which are spoken interfaces for information access which will aid visually challenged people use a computer with ease and to make computing ubiquitous and inclusive.
1.1
Nature of scripts of Indian languages
The scripts in Indian languages have originated from the ancient Brahmi script. The basic units of
the writing system are referred to as Aksharas. The properties of Aksharas are as follows:
1. An Akshara is an orthographic representation of a speech sound in an Indian language
2. Aksharas are syllabic in nature
3. The typical forms of Akshara are V, CV, CCV and CCCV, thus having a generalized form of
C*V where C denotes consonant and V denotes vowel
As Indian languages are Akshara based, akshara being a subset of a syllable, a syllable based unit
selection synthesis system has been built for Indian languages. Further, a syllable corresponds to
a basic unit of production as opposed to that of the diphone or the phone. Earlier efforts were
made by the consortium members, in particular, IIIT Hyderabad and IIT Madras do indicate that
natural sounding synthesisers for Indian languages can be built using the syllable as a basic unit.
1.2
Convergence and divergence
The official languages of India, except (English and Urdu) share a common phonetic base, i.e., they
share a common set of speech sounds. This common phonetic base consists of around 50 phones,
including 15 vowels and 35 consonants. While all of these languages share a common phonetic base,
some of the languages such as Hindi, Marathi and Nepali also share a common script known as
Devanagari. But languages such as Telugu, Kannada and Tamil have their own scripts.
The property that makes these languages unique can be attributed to the phonotactics in each
of these languages rather than the scripts and speech sounds. Phonotactics is the permissible
combinations of phones that can co-occur in a language. This implies that the distribution of
syllables encountered in each language is different. Another dimension in which the Indian languages
significantly differ is prosody which includes duration, intonation and prominence associated with
each syllable in a word or a sentence.
What is Text to Speech Synthesis?
Text to Speech Synthesis System converts text input to speech output. The conversion of text into
spoken form is deceptively nontrivial. A nave approach is to consider storing and concatenation of
basic sounds (also referred to as phones) of a language to produce a speech waveform. But, natural
speech consists of co-articulation i.e., effect of coupling two sound together, and prosody at syllable,
word, sentence and discourse level, which cannot be synthesised by simple concatenation of phones.
Another method often employed is to store a huge dictionary of the most common words. However,
such a method may not synthesise millions of names and acronyms which are not in the dictionary.
It also cannot deal with generating appropriate intonation and duration for words in different
context. Thus a text-to-speech approach using phones provides flexibility but cannot produce
intelligible and natural speech, while a word level concatenation produces intelligible and natural
speech but is not flexible. In order to balance between flexibility and intelligibility/naturalness,
sub-word units such as diphones which capture essential coarticulation between adjacent phones
are used as suitable units in a text-to-speech system.
2.1
Components of a text-to-speech system
A typical architecture of a Text-to-Speech (TTS) system is as shown in Figure below The components of a text-to-speech system could be broadly categorized as text processing and methods of
speech generation.
Text processing in the real world, the typical input to a text-to-speech system is text as available
in electronic documents, news papers, blogs, emails etc. The text available in real world is anything
but a sequence of words available in standard dictionary. The text contains several non-standard
words such as numbers, abbreviations, homographs and symbols built using punctuation characters
such as exclamation !, smileys :-) etc. The goal of text processing module is to process the input
text, normalize the non-standard words, predict the prosodic pauses and generate the appropriate
phone sequences for each of the words.
2.2
Normalization of non-standard words
The text in real world consists of words whose pronunciation is typically not found in dictionaries
or lexicons such as IBM, CMU, and MSN etc. Such words are referred to as non-standard words
(NSW). The various categories of NSW are:
1. Numbers whose pronunciation changes depending on whether they refer to currency, time,
telephone numbers, zip code etc.
2. Abbreviations, contractions, acronyms such as ABC, US, approx., Ctrl-C, lb.,
3. Punctuations 3-4, +/-, and/or,
4. Dates, time, units and URLs.
2.3
Grapheme-to-phoneme conversion
Given the sequence of words, the next step is to generate a sequence of phones. For languages
such as Spanish, Telugu, Kannada, where there is a good correspondence between what is written
and what is spoken, a set of simple rules may often suffice. For languages such as English where
the relationship between the orthography and pronunciation is complex, a standard pronunciation
dictionary such as CMU-DICT is used. To handle unseen words, a grapheme-to-phoneme generator
is built using machine learning techniques.
2.4
Prosodic analysis
Prosodic analysis deals with modeling and generation of appropriate duration and intonation contours for the given text. This is inherently difficult since prosody is absent in text. For example,
the sentences where are you going?; where are you GOING? and where are YOU going?, have
same text-content but can be uttered with different intonation and duration to convey different
meanings. To predict appropriate duration and intonation, the input text needs to be analyzed.
This can be performed by a variety of algorithms including simple rules, example-based techniques
and machine learning algorithms. The generated duration and intonation contour can be used to
manipulate the context-insensitive diphones in diphone based synthesis or to select an appropriate
unit in unit selection voices.
2.5
Methods of speech generation
The methods of conversion of phone sequence to speech waveform could be categorized into parametric, concatenative and statistical parametric synthesis.
2.5.1
Parametric synthesis
Parameters such as formants, linear prediction coefficients are extracted from the speech signal of
each phone unit. These parameters are modified during synthesis time to incorporate co-articulation
and prosody of a natural speech signal. The required modifications are specified in terms of rules
which are derived manually from the observations of speech data. These rules include duration,
intonation, co-articulation and excitation function. Examples of the early parametric synthesis
systems are Klatts formant synthesis and MITTALK.
2.5.2
Concatenative synthesis
Derivation of rules in parametric synthesis is a laborious task. Also, the quality of synthesized
speech using traditional parametric synthesis is found to be robotic. This has led to development
of concatenative synthesis where the examples of speech units are stored and used during synthesis.
Concatenative synthesis is based on the concatenation (or stringing together) of segments of
recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the
automated techniques for segmenting the waveforms sometimes result in audible glitches in the
output. There are three main sub-types of concatenative synthesis.
1. Unit selection synthesis - Unit selection synthesis uses large databases of recorded speech.
During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and
sentences. Typically, the division into segments is done using a specially modified speech
recognizer set to a forced alignment mode with some manual correction afterward, using
visual representations such as the waveform and spectrogram. An index of the units in the
speech database is then created based on the segmentation and acoustic parameters like the
fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. At
run time, the desired target utterance is created by determining the best chain of candidate
units from the database (unit selection). This process is typically achieved using a specially
weighted decision tree.
Unit selection provides the greatest naturalness, because it applies only a small amount of
digital signal processing (DSP) to the recorded speech. DSP often makes recorded speech
sound less natural, although some systems use a small amount of signal processing at the
point of concatenation to smooth the waveform. The output from the best unit-selection
systems is often indistinguishable from real human voices, especially in contexts for which the
TTS system has been tuned. However, maximum naturalness typically require unit-selection
speech databases to be very large, in some systems ranging into the gigabytes of recorded
data, representing dozens of hours of speech. Also, unit selection algorithms have been known
to select segments from a place that results in less than ideal synthesis (e.g. minor words
become unclear) even when a better choice exists in the database. Recently, researchers have
proposed various automated methods to detect unnatural segments in unit-selection speech
synthesis systems.
2. Diphone synthesis - Diphone synthesis uses a minimal speech database containing all the
diphones (sound-to-sound transitions) occurring in a language. The number of diphones
depends on the phonotactics of the language: for example, Spanish has about 800 diphones,
and German about 2500. In diphone synthesis, only one example of each diphone is contained
in the speech database. At runtime, the target prosody of a sentence is superimposed on these
minimal units by means of digital signal processing techniques such as linear predictive coding,
PSOLA or MBROLA. Diphone synthesis suffers from the sonic glitches of concatenative
synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages
of either approach other than small size. As such, its use in commercial applications is
declining,[citation needed] although it continues to be used in research because there are a
number of freely available software implementations.
3. Domain-specific synthesis - Domain-specific synthesis concatenates prerecorded words and
phrases to create complete utterances. It is used in applications where the variety of texts
the system will output is limited to a particular domain, like transit schedule announcements
or weather reports.The technology is very simple to implement, and has been in commercial
use for a long time, in devices like talking clocks and calculators. The level of naturalness
of these systems can be very high because the variety of sentence types is limited, and they
closely match the prosody and intonation of the original recordings. Because these systems are
limited by the words and phrases in their databases, they are not general-purpose and can only
synthesize the combinations of words and phrases with which they have been preprogrammed.
The blending of words within naturally spoken language however can still cause problems
unless the many variations are taken into account.
For example, in non-rhotic dialects of English the r in words like clear /kl/ is usually only
pronounced when the following word has a vowel as its first letter (e.g. clear out is realized
as /klt/). Likewise in French, many final consonants become no longer silent if followed by a
word that begins with a vowel, an effect called liaison. This alternation cannot be reproduced
by a simple word-concatenation system, which would require additional complexity to be
context-sensitive.
The speech units used in concatenative synthesis are typically at diphone level so that the
natural co-articulation is retained. Duration and intonation are derived either manually or
automatically from the data and are incorporated during synthesis time. Examples of diphone
synthesizers are Festival diphone synthesis and MBROLA. The possibility of storing more than
one example of a diphone unit, due to increase in storage and computation capabilities, has led
to development of unit selection synthesis. Multiple examples of a unit along with the relevant
linguistic and phonetic context are stored and used in the unit selection synthesis. The quality
of unit selection synthesis is found to be more natural than diphone and parametric synthesis.
However, unit selection synthesis lacks the consistency i.e., in terms of variations of the quality.
2.6
Primary components of the TTS framework
1. Speech Engine - One of the most widely used speech engine is eSpeak. eSpeak uses formant
synthesis method, which allows many languages to be provided with a small footprint. The
speech synthesized is intelligible, and provides quick responses, but lacks naturalness. The
demand is for a high quality natural sounding TTS system. We have used festival speech
synthesis system developed at The Centre for Speech Technology Research, University of
Edinburgh, which provides a framework for building speech synthesis systems and offers
full text to speech support through a number of APIs . A large corpus based unit selection
paradigm has been employed. This paradigm is known to produce intelligible natural sounding
speech output, but has a larger foot print.
2. Screen Readers - The role of a screen reader is to identify and interpret what is being displayed on the screen and transfer it to the speech engine for synthesis. JAWS is the most
popular screen reader used worldwide for Microsoft Windows based systems. But the main 30
drawback of this software is its high cost, approximately 1300 USD, whereas the average per
capita income in India is 1045 USD. Different open source screen readers are freely available.
We chose ORCA for Linux based systems and NVDA for Windows based systems. ORCA is
a flexible screen reader that provides access to the graphical desktop via user-customizable
8
combinations of speech, braille and magnification. ORCA supports the Festival GNOME
speech synthesizer and comes bundled with popular Linux distributions like Ubuntu and Fedora. NVDA is a free screen reader which enables vision impaired people to access computers
running Windows. NVDA is popular among the members of the AccessIndia community.
AccessIndia is a mailing list which provides an opportunity for visually impaired computer
users in India to exchange information as well as conduct discussions related to assistive
technology and other accessibility issues . NVDA has already been integrated with Festival
speech Engine by Olga Yakovleva.
3. Typing tool for Indian Languages - The typing tools map the qwerty keyboard to Indian
language characters. Widely used tools to input data in Indian languages are Smart Common Input Method (SCIM)and inbuilt InScript keyboard, for Linux and Windows systems
respectively. Same has been used for our TTS systems, as well.
2.7
Screen readers for the visually challenged
India is home to the worlds largest visually challenged (VC) population. In todays digital world,
disability is equated to inability. Low attention is paid to people with disabilities and social inclusion
and acceptance is always a threat/challenge. The perceived inability of people with disability, the
perceived cost of special education and attitudes towards inclusive education are major constraints
for effective delivery of education. Education is THE means of developing the capabilities of people
with disability, to enable them to develop their potential, become self sufficient, escape poverty and
provide a means of entry to fields previously denied to them. The aim of this project is to make
a difference in the lives of VC persons. VC persons need to depend on others to access common
information that others take for granted, such as newspapers, bank statements, and scholastic
transcripts. Assistive technologies (AT) are necessary to enable physically challenged persons to
become part of the mainstream of society. A screen reader is an assistive technology potentially
useful to people who are visually challenged, visually impaired, illiterate or learning disabled, to
use/access standard computer software, such as Word Processors, Spreadsheets, Email and the
Internet.
Before the start of this project, Indian Institute of Technology, Madras (IIT Madras) had
been conducting a training programme for visually challenged people, to enable them to use the
computer using the screen reader JAWS with English as the language. Although, the VC persons
have benefited from this programme, most of them felt that:
The English accent was difficult to understand.
Most students would have preferred a reader in their native language.
They would prefer English spoken in Indian accent.
The price for the individual purchase of JAWS was very high.
Against this backdrop, it was felt imperative to build assistive technologies in the vernacular.
An initiative was taken by DIT, Ministry of Information Technology to sponsor the development
of
1. Natural sounding Text-to-speech synthesis systems in different Indian languages
2. To ensure that the TTSes are also integrated with open source screen readers.
Overall Picture
1. Data Collection - Text crawled from a news site and a site for stories for children.
2. Cleaning up of Data - From the crawled data sentences were picked to maximize syllable
coverage.
3. Recording - The sentences that were picked were then recorded in a studio which was a
completely noise-free environment.
4. Labeling - The wavefiles were then manually labeled using the semi-automatic labeling tool
to get accurate syllable boundaries.
5. Training - Using the wavefiles and their transcriptions the indian language unit selection
voice was built
10
6. Testing - Using the voice built, a MOS test was conducted with visually challenged end users
as the evaluators.
11
Labeling Tool
It is a widely accepted fact that the accuracy of labeling of speech files would have a great bearing
on the quality of unit selection synthesis. Process of manual labeling is time consuming and
daunting task. It is also not trivial to label waveforms manually, at the syllable level. DONLabel
Labeling tool provides an automatic way of performing labeling given an input waveform and the
corresponding text in utf8 format. The tool makes use of group delay based segmentation to provide
the segment boundaries. The size of the segment labels generated can vary from monosyllables to
polysyllables, as the Window Scale Factor (WSF) parameter is varied from small to large values.
Our labeling process make use of:
Ergodic HMM (EHMM) labeling procedure provided by Festival,
The group delay based algorithm (GD)
The Vowel Onset Point (VOP) detection algorithm.
The Labeling tool displays a panel, which shows the segment boundaries estimated by Group Delay
algorithm, another panel which would show the segment boundaries as estimated by the EHMM
process and a panel for VOP, which shows how many vowel onset points are present between
each segments provided by group delay algorithm. This would help greatly in adjusting the labels
provided by the group delay algorithm, if necessary, by comparing the labeling outputs of both
EHMM process and VOP algorithm. By using VOP as an additional cue, manual intervention
during the labeling process can be eliminated. It would also improve the accuracy of the labels
generated by the labeling tool.
The tool works for 6 different Indian languages namely
Hindi
Tamil
Malayalam
Marathi
Telugu
Bengali
The tool also displays the text (utf8) in segmented format along with the speech file.
4.1
How to Install LabelingTool
1. Copy the html folder to /var/www folder. If www folder is not there in /var, create a
folder named www and extract the html folder into it. So we have the labelingTool code in
/var/www/html/labelingTool/
2. Install java compiler using the following command
sudo aptget install sunjava6jdk
The following error may come ==> Reading package lists... Done
Building dependency tree
Reading state information... Done
Package sunjava6jdk is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or is only available from
another source
12
E: Package sun-java6-jdk has no installation candidate

sudo aptget install sunjava6jre
The following error may come ==>
Reading package lists... Done
Building dependency tree
Reading state information... Done
Package sunjava6jre is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or is only available from
another source
E: Package sun-java6-jre has no installation candidate
One solution is :
sudo addaptrepository deb http://archive.canonical.com/ lucid partner
sudo addaptrepository deb http://ftp.debian.org/debian squeeze main contrib nonfree
sudo addaptrepository deb http://ppa.launchpad.net/chromiumdaily/ppa/ubuntu/ lucid main
sudo addaptrepository deb http://ppa.launchpad.net/flexiondotorg/java/ubuntu/ lucid
main
sudo aptget update
The other solution is :
For Ubuntu 10.04 LTS, the sun-java6 packages have been dropped from the Multiverse section
of the Ubuntu archive. It is recommended that you use openjdk-6 instead.
If you can not switch from the proprietary Sun JDK/JRE to OpenJDK, you can install sunjava6 packages from the Canonical Partner Repository. You can configure your system to use
this repository via command-line:
sudo
sudo
sudo
sudo
sudo
add-apt-repository deb http://archive.canonical.com/ lucid partner

apt-get update
apt-get install sun-java6-jre sun-java6-plugin
apt-get install sun-java6-jdk
update-alternatives config java
For Ubuntu 10.10, the sun-java6 packages have been dropped from the Multiverse section of
the Ubuntu archive. It is recommended that you use openjdk-6 instead.
If you can not switch from the proprietary Sun JDK/JRE to OpenJDK, you can install sunjava6 packages from the Canonical Partner Repository. You can configure your system to use
this repository via command-line:
sudo add-apt-repository deb http://archive.canonical.com/ maverick partner
sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-plugin
sudo apt-get install sun-java6-jdk
sudo update-alternatives config java
13
If above does not work (for other version of ubuntu) then you can create local repository as
follows:
cd /
wget https://github.com/flexiondotorg/oab-java6/raw/0.2.1/oab-java6.sh -O oab-java6.sh
chmod +x oab-java6.sh
sudo ./oab-java6.sh
and then run:
sudo apt-get install sun-java6-jdk
sudo apt-get install sun-java6-jre
Source :
https://github.com/flexiondotorg/oab-java6/blob/a04949f242777eb040150e53f4dbcd4a3ccb7568/
README.rst
3. Install php using the following command
sudo aptget install php5
4. Install apache2 using the following command
sudo aptget install apache2
update the paths in the following file
/etc/apache2/sitesavailable/default
Set all path of cgibin to /var/www/html/cgi-bin.
Sample default file is attached
5. Install apache2 using the following command
sudo aptget install speech-tools
6. Install tcsh using the following command
sudo aptget install tcsh
7. Enable java script in the properties of the browser used
Use Google chrome or Mozilla firefox
8. Install java plugin for browser
sudo aptget install sunjava6plugin
Create a symbolic link to the Java Plugin libnpjp2.so using the following commands
sudo ln s /usr/lib/jvm/java-6-sun/jre/plugin/i386/libnpjp2.so /etc/alternatives/mozillajavaplugin.so
sudo ln s /etc/alternatives/mozillajavaplugin.so /usr/lib/mozilla/plugins/libnpjp2.so
9. give full permissions to html folder
sudo chmod R 777 html/
10. Add the following code to /etc/java6sun/security/java.policy
14
grant {
permission java.security.AllPermission;
};
11. In /var/www/html/labelingTool/jsrc/install file, make sure that correct path of javac is provided
as per your installation. For example : /usr/lib/jvm/java6sun-1.6.0.26/bin/javac
Version of java is 1.6.0.26 here, it might be different in your installation. Check the path and
give correct values.
12. Install the tool using the following command
Go to /var/www/html/labelingTool/jsrc and run the below command
sudo ./install
It might give the following output which is not an error.
Note: LabelingTool.java uses or overrides a deprecated API.
Note: Recompile with Xlint:deprecation for details.
13. Restart apache using the following command
sudo /etc/init.d/apache2 restart
14. check if java applet is enabled in the browser by using the following link
http://javatester.org/enabled.html
In that webpage, in the LIVE box, it should display
This web browser can indeed run Java applets
wait for some time for the display to come.
In case it had displayed This web browser can NOT run Java applets, there is some issue
with the java applets. Please browse for how to enable java in your version of browser and
fix the issue.
15. Replace the Pronunciation Rules.pl in the /var/www/html/labelingTool folder with your language specific code (The name should be same Pronunciation Rules.pl )
16. Open the browser and go to the following link
http://localhost/main.php
NOTE : VOP algoirthm is not used in the current version of the labelingTool. So
anything related to vop, please ignore on the below sections
4.2
Troubleshooting of LabelingTool
1. When Labelingtool is working fine the following files will be generated in labelingTool/results
folder
boundary
segments
spec low
vop
wav sig
gd spectrum low
segments indicator
tmp.seg
vopsegments
15
2. when the boundaries are manually updated, (deleted, added or moved) and saved 2 more files
gets created in the results folder.
ind upd
segments updated
3. When after manually updating and saving, if the vopUpdate button is clicked, another new
file gets created in the results folder
vop updated
4. If a file named vop is not getting generated in labelingTool/results folder and the labelit.php
page is getting stuck, you need to compile the vop module.
Follow the below steps.
(a) cd /var/www/html/labelingTool/VopLab
(b) make f MakeEse clean
(c) make f MakeEse
(d) cd bin
(e) cp Multiple Vopd ../../bin/
5. If the above files are not getting created, we can try running through command line as follows
Execute them from/var/www/html/labelingTool/bin folder.
The command line usage of the WordsWithSilenceRemoval program is as follows
WordsWithSilenceRemoval ctrlFile waveFile sigFile spectraFile boundaryFile thresSilence(ms) thresVoiced(ms)
example :
./WordsWithSilenceRemoval fewords2.base /home/text 0001.wav ..results/spec ..results/boun 100 100
Two files named spec and boun has to be generated in the results folder.
if not created. try recompiling.
cd /var/www/html/labelingTool/Segmentation
make f MakeWordsWithSilenceRemoval clean
make f MakeWordsWithSilenceRemoval
cp bin/WordsWithSilenceRemoval /var/www/html/labelingTool/bin/
The command line usage of the Multiple Vopd program is as follows
Multiple Vopd ctrlFile waveFile(in sphere format) segments file vopfile
example :
./Multiple Vopd fe-ctrl.ese ../results/wav ..results/segments ../results/vop
The file wav in results folder is already the sphere format of your input wavefile.
16
On running Multiple Vopd binary, a file vop has to be generated in the results folder.
6. If the file wav is not produced in results folder, speech tools are not installed
How to check if speech tools are installed :
Once installing speech tools check if the following
ch wave info <wave file name>
This command should give the information about that wave file.
If speech tools was installed along with festival and there is no link to it in
/usr/bin, please make a link to point to ch wave binary file in /usr/bin folder.
7. How to check if tcsh is installed..
type command tcsh and a new prompt will come.
8. Provide full permissions to the labelingTool folder and its sub folder so that the new files can
be created and updated without any permission issues.
(if required, following command can be used in the labelingTool folder
chmod R 777 *
chown R root:root * )
9. The java.policy file should be updated as specified in the installation steps, otherwise it may
result in error Error writing Lab File
10. When the lab file is viewed in the browser, if utf8 is not displaying, enable characterencoding
to utf8 for the browser
Tools->options->content->fonts and colors(Advanced menu)->default character encoding(utf8)
Restart browser.
17
Labeling Tool User Manual
5.1
How To Use Labeling Tool
The front page of the tool can be taken using the URL http://localhost/main.php A screen shot
of the front page is as shown below.
The front page has the following fields

The speech file in wav format should be provided. It can be browsed using the browse button
The corresponding utf8 text has to be provided in the text file. It can be browsed using the
browse button the text file that is uploaded should not have any special characters.
The ehmm lab file generated by festival while building voice can be provided as input. This
is an optional field.
The gd lab file generated by labelingtool in a previous attempt to label the same file. This
is an optional field. If the user had once labelled a file half way and saved the lab file, it can
be provided as input here so as to label the rest of it or to correct the labels.
The threshold for voiced segment has to be provided in the text box. It varies for each wav
file. The value is in milli seconds. (e.g. 100, 200, 50..)
The threshold for unvoiced segment has to be provided in the text box. It varies for each wav
file. The value is in milli seconds. (e.g. 100, 200, 50..) If the speech file has very long silences
a high value can be provided as threshold value.
WSF (window scale factor) can be selected from the drop down list. The default value given
is 5. Depending on the output user will be required to change WSF values and find the most
appropriate value that provides the best segmentation possible for the speech file.
The corresponding language can be selected using the radio button
18
Submit the details to the tool using submit button.

A screen shot of the filled up front page is given below.
Loading Page
On clicking submit button on the front page the following page will be displayed.
Validation for data entered

If the loading of all files were successful and proper values were given for the thresholds in
the front page the message Click View to see the results... will be displayed as shown above.
If the wave file was not provided in the front page the following error will come in the loading
page Error uploading wav file. Wav file must be entered
If the text file was not provided in the front page the following error will come in the loading
page Error uploading text file. Text file must be entered
If the threshold for voiced segments was not provided in the front page the following error
will come in the loading page Threshold for voiced segments must be entered
If the threshold for unvoiced segments was not provided in the front page the following error
will come in the loading page Threshold for unvoiced segments must be entered
19
If numeric value is not entered for thresholds of unvoiced or voiced segments, in the front
page the following error will come in the loading page Numeric value must be entered for
thresholds
The wav file loaded will be copied to the /var/www/html/UploadDir folder as text.wav
The lab file (ehmm label file) will be copied to /var/www/html/labelingTool/lab folder as
temp.lab If error occurred while moving to the lab folder the following error will be displayed
Error moving lab file.
The gd lab file (group delay label file) will be copied to /var/www/html/labelingTool/results
folder with the name gd lab. If error occurred while moving to the lab folder the following
error will be displayed Error moving gdlab file.
The Labelit Page
On clicking view button on the loading page the labelit page will be loaded. A screenshot of
this page along with the markings for each panel is given below.
Note: If error message Error reading file http://localhost/labelingTool/tmp/temp.wav appears, it means in place of wav file some other file(eg text file) was uploaded.
Panels on the Labelit Page
It has 6 main Panels
EHMM Panel displays the lab files generated by festival using EHMM algorithm while
building voices
Slider Panel using this panel we can slide, delete or add segments/labels
20
Wave Panel displays the speech waveform in segmented format (Note: The speech wave
file is not appearing, as seen in wavesurfur. This is because of the limitations in java)
Text Panel displays the segmented text (in utf8 format) with syllable as the basic units.
GD Panel draws the group delay curve. This is the result of group delay algorithm.
Wherever the peaks appear, is considered to be a segment boundary.
VOP Panel shows the number of vowel onset points found between each segments provided by Group delay. Here green colour corresponds to one vowel onset point. That
means the segment boundary found by group delay algorithm is correct. Red colour corresponds to zero vowel onset point. That means the segment boundary found by group
delay algorithm is wrong and that boundary needs to be deleted. Yellow colour corresponds to more than one vowel onset points. This means that, between 2 boundaries
found by group delay algorithm there will be one or more boundaries.
Resegment
The WSF selected for this example is 5. A different wsf will provide a different set of
boundaries. Lesser the wsf, greater the number of boundaries and vice versa. To experiment
with different wsf values, select the WSF from the drop down list and click RESEGMENT.
A screen shot for the same text (as in the above figure) with a greater wsf selected is shown
below
The above figure shows the segmentation using wsf = 12. It gives less number of boundaries.
Below figure shows the same waveform with a lesser wsf (wsf =3). It gives more number of
boundaries.
21
So the ideal wsf for the waveform has to be found out. Easier way is to check the text
segments are reaching approximately near the end of the waveform. (Not missing any text
segments nor having many segments without texts).
Menu Bar
The menu Bar is just above the EHMM Panel, with a heading Waveform The Menu Bar
contains following buttons in that order from left to right
Save button The lab file can be saved using the save button. After making any changes to
the segments (deletion, addition or dragging), if required save button has to be clicked.
Play the waveform The entire wave file will be played on pressing this button
Play the selection Select some portion of the waveform (say a segment) and play just
that part using this button. This button can be used to verify each segment.
Play from selection Play the waveform starting from the current selection to the end.
Click the mouse on the waveform and a yellow line will appear to show the selection.
On clicking this button, from that selected point to end of the file will be played
Play to selection Plays the waveform from the beginning to the end of the current
selection
Stop the playbackStops the playing of wave file
Zoom to fit Display the selected portion of the wave zoomed in
Zoom 1 Display the entire wave
Zoom in Zoom in on the wave
Zoom out Zoom out on the wave
Update VOP Panel After changing the segments (dragging, adding or deleting) , the
VOP algorithm is recalculated on the new set of segments on clicking this button. After
making the changes, the save button must be pressed before updating the VOP
panel.
22
Some screen shots are given below to demonstrate the use of menu bar.
Below figure shows how to select a portion (drag using mouse on wavepanel) of the waveform
and play that part. The selected portion appears shaded in yellow as shown
The below figure shows how to select a point (click using mouse on wavepanel) and play from
selection to end of file. The selected point appears as a yellow line
3cm
23
Next figure shows how to select a portion of the wave and zoom to fit.
The Next figure shows how the portion of the wave file selected in the above figure is zoomed.
2cm
24
5.2
How to do label correction using Labeling tool
Each segment given by the group delay can be listened to and decided whether the segment is correct or not, whether it is matching the text segmentation, with the help of VOP and EHMM Panels.
Deletion of a Segment
All the segments appear as red lines in the labeling tool output. A segment can be deleted
by right clicking on that particular segment on the slider panel. The figure below shows the
original output of labelling tool for the Hindi wave file
The third and fourth segments are very close to each other and one has to be deleted. Ideally
we delete the fourth one. The VOP has given a red colour (indication to delete one) for that
segment. User can decide whether to delete right or left of red segment after listening.
On deletion (right click on slider panel on that segment head) of the fourth segment, the text
segments get shifted and fits after silence segment as shown in the below figure.
25
On listening each segments it is seen that the segment between and is wrong. It has to
be deleted. The VOP gives red colour for the segment and the corresponding peak in the
group delay panel is below the threshold. Peaks below the threshold in group delay curve
usually wont be a segment boundary. But sometimes the algorithm computes it as a boundary. Threshold value in GD panel is the middle line in magenta colour.
There are 2 more red columns in VOP. The last one is correct and we have to delete a segment.
The second last red column in VOP is incorrect and GD gives the correct segment. Hence it
need not be deleted. is always used as a reference for GD algorithm. It can be wrong in some
cases. The yellow colour on VOP usually says to add a new segment, but here the yellow
colour is appearing in the Silence region and we ignore it.
The figure below shows the corrected segments (after deletion)
26
On completion of correcting the labels, the save button have to be pressed. On clicking Save
button a dialog box appears with the message Lab File Saved Click Next to Continue
A silence segment gets deleted on clicking the right boundary of the silence segment.
Update VOP Panel
After saving the changes made to the labels the VOP update button has to be clicked to
recalculate the VOP algorithm on the new segments. The updated output is shown in below
figure.
27
Adding A Segment
A segment can be added by right clicking with mouse on the slider panel at the point where
a segment needs to be added. The below figure shows a case in which a segment needs to be
added.
The VOP shows three yellow columns here of which the second yellow column is true. The
GD plot shows a small peak in that segment and we can be sure that the segment has to be
added at the peak only. In the above figure it can be seen that the mouse is placed on the
slider panel at the location to add the new segment. The figure below shows the corresponding
corrected wave file and after VOP updation done.
Sliding a Segment
A segment can be moved to left or right by clicking on the head of the segment boundary on
the slider panel and dragging left or right. Sliding can be used if required while correcting
the labels.
Modification of labfile If a half corrected lab file is already present (gd lab file present),
upload it from ./labelingTool/labfiles directory in the gd lab file option in the main page.
Irrespective of the wsf value, the earlier lab file will be loaded. But if we use resegmentation
the already present labels will be gone and it will be regenerated based on the new wsf value
present. After modification, when Save button is pressed same labfile is updated but before
updating backup copy of lab file is created.
Note: If system creates a lab file with same name that already exists in labfiles directory,
system creates the backup copy of that file. But backup copy is by default hidden, to view it
just press CTRL + h.
28
Logfiles Tool generates a seprate log file for each lab file(eg. text0001.log) in ./labelingTool/logfiles directory. Please keep cleaning this directory after certain interval.
5.3
Viewing the labelled file
Once the corrections are made and the save button is clicked the lab file is generated in/var/www/html/labelingT
directory and it can be viewed by clicking on the next link. The following message comes on clicking the next, Download the labfile: labfile Click on the link labfile. The lab file will appear on
the browser window as below
5.4
Control file
A control file is placed at the location /var/www/html/labelingTool/bin/fewords.base The parameters in the control file are given below. These parameters can be adjusted by the user to get better
segmentation results.
windowSize size of frame for energy computation
waveType type of the waveform: 0 Sphere PCM 1 Sphere Ulaw 2 plain sample one short
integer per line 3 RAW sequence of bytes each sample 8bit 4 RAW16 two bytes/sample Big
Endian 5 RAW16 two bytes/sample Little Endian 6 Microsoft RIFF standard wav format
winScaleFactor should be chosen based on syllable rate choose by trial and error
gamma reduces the dynamic range of energy
fftOrder and fftSize MUST be set to ZERO!!
frameAdvanceSamples frameshift for energy computation
medianOrder order of median smoothing for group delay function 1==> no smoothing
29
ThresEnergy, thresZero, thresSpectralFlatness are thresholds used for voiced unvoiced detection
When a parameter is set to zero, it is NOT used . Examples tested with ENERGY only
Sampling rate of the signal required for giving boundary information in seconds.
5.5
Performance results for 6 Indian Languages
Testing was conducted for a set of test sentences for all 6 Indianlanguages and the percentage of
correctness was calculated based on the following formulae. The calculations were done after the
segmentation was done using the tool with the best wsf and threshold values.
[1
Language
Hindi
Malayalam
Telugu
Marathi
Bengali
Tamil
5.6
(N oof insertions+noof deletions)

T otalnoof segments
Percentage of Correctness
86.83%
78.68%
85.40%
80.24%
77.84%
77.38%
Limitations of the tool
Zooming is not enabled for VOP and EHMM panels

Wave form is not displayed properly as in wavesurfur
30
] 100
Unit Selection Synthesis Using Festival
This chapter discusses some of the options for building waveform synthesizers using unit selection
techniques in Festival.
By unit selection we actually mean the selection of some unit of speech which may be anything
from whole phrase down to diphone (or even smaller). Technically diphone selection is a simple
case of this. However typically what we mean is unlike diphone selection, in unit selection there is
more than one example of the unit and some mechanism is used to select between them at run-time.
The theory is obvious but the design of such systems and finding the appropriate selection criteria, weighting the costs of relative candidates is a non-trivial problem. However techniques like this
often produce very high quality, very natural sounding synthesis. However they also can produce
some very bad synthesis too, when the database has unexpected holes and/or the selection costs fail.
6.1
Cluster unit selection
The idea is to take a database of general speech and try to cluster each phone type into groups
of acoustically similar units based on the (non-acoustic) information available at synthesis time,
such as phonetic context, prosodic features (F0 and duration) and higher level features such as
stressing, word position, and accents. The actual features used may easily be changed and experimented with as can the definition of the definition of acoustic distance between the units in a cluster.
The basic processes involved in building a waveform synthesizer for the clustering algorithm are
as follows. A high level walkthrough of the scripts to run is given after these lower level details.
1. Collect the database of general speech.
2. Building the utterance Structure
3. Building coefficients for acoustic distances, typically some form of cepstrum plus F0, or some
pitch synchronous analysis (e.g. LPC).
4. Build distances tables, precalculating the acoustic distance between each unit of the same
phone type.
5. Dump selection features (phone context, prosodic, positional and whatever) for each unit
type.
6. Build cluster trees using wagon with the features and acoustic distances dumped by the
previous two stages
7. Building the voice description itself
6.2
Choosing the right unit type
Before you start you must make a decision about what unit type you are going to use. Note there
are two dimensions here. First is size, such as phone, diphone, demi-syllable. The second type itself
which may be simple phone, phone plus stress, phone plus word etc. The code here and the related
files basically assume unit size is phone. However because you may also include a percentage of the
previous unit in the acoustic distance measure this unit size is more effectively phone plus previous
phone, thus it is somewhat diphone like. The cluster method has actual restrictions on the unit
size, it simply clusters the given acoustic units with the given feature, but the basic synthesis code
31
is currently assuming phone sized units.

The second dimension, type, is very open and we expect that controlling this will be a good
method to attain high quality general unit selection synthesis. The parameter clunit name feat
may be used define the unit type. The simplest conceptual example is the one used in the limited
domain synthesis. There we distinguish each phone with the word it comes from, thus a d from
the word limited is distinct from the d in the word domain. Such distinctions can hard partition
up the space of phones into types that can be more manageable.
The decision of how to carve up that space depends largely on the intended use of the database.
The more distinctions you make less you depend on the clustering acoustic distance, but the more
you depend on your labels (and the speech) being (absolutely) correct. The mechanism to define
the unit type is through a (typically) user defined feature function. In the given setup scripts this
feature function will be called lisp INST LANG NAME::clunit name. Thus the voice simply defines the function INST LANG NAME::clunit name to return the unit type for the given segment.
If you wanted to make a diphone unit selection voice this function could simply be
(define (INST LANG NAME::clunit name i)
(string append
(item.name i)
(item.feat i p.name)))
Thus the unittype would be the phone plus its previous phone. Note that the first part of a
unit name is assumed to be the phone name in various parts of the code thus although you make
think it would be neater to return previousphone phone that would mess up some other parts of
the code.
In the limited domain case the word is attached to the phone. You can also consider some
demisyllable information or more to differentiate between different instances of the same phone.
The important thing to remember is that at synthesis time the same function is called to identify the unittype which is used to select the appropriate cluster tree to select from. Thus you need
to ensure that if you use say diphones that the your database really does not have all diphones in it.
6.3
Collecting databases for unit selection
Unlike diphone database which are carefully constructed to ensure specific coverage, one of the
advantages of unit selection is that a much more general database is desired. However, although
voices may be built from existing data not specifically gathered for synthesis there are still factors
about the data that will help make better synthesis.
Like diphone databases the more cleanly and carefully the speech is recorded the better the
synthesized voice will be. As we are going to be selecting units from different parts of the database
the more similar the recordings are, the less likely bad joins will occur. However unlike diphones
database, prosodic variation is probably a good thing, as it is those variations that can make synthesis from unit selection sound more natural. Good phonetic coverage is also useful, at least phone
coverage if not complete diphone coverage. Also synthesis using these techniques seems to retain
aspects of the original database. If the database is broadcast news stories, the synthesis from it
will typically sound like read news stories (or more importantly will sound best when it is reading
32
news stories).
Again the notes about recording the database apply, though it will sometimes be the case that
the database is already recorded and beyond your control, in that case you will always have something legitimate to blame for poor quality synthesis.
6.4
Preliminaries
Throughout our discussion we will assume the following database layout. It is highly recommended
that you follow this format otherwise scripts, and examples will fail. There are many ways to
organize databases and many of such choices are arbitrary, here is our arbitrary layout.
The basic database directory should contain the following directories
bin/ Any database specific scripts for processing. Typically this first contains a copy of
standard scripts that are then customized when necessary to the particular database
wav/ The waveform files. These should be headered, one utterances per file with a standard
name convention. They should have the extension .wav and the fileid consistent with all other files
through the database (labels, utterances, pitch marks etc).
lab/ The segmental labels. This is usually the master label files, these may contain more
information that the labels used by festival which will be in festival/relations/Segment/.
lar/ The EGG files (larynograph files) if collected.
pm/ Pitchmark files as generated from the lar files or from the signal directly.
festival/ Festival specific label files.
festival/relations/ The processed labeled files for building Festival utterances, held in directories whose name reflects the relation they represent: Segment/, Word/, Syllable/ etc.
festival/utts/ The utterances files as generated from the festival/relations/ label files.
Other directories will be created for various processing reasons.
6.5
Building utterance structures for unit selection
In order to make access well defined you need to construct Festival utterance structures for each
of the utterances in your database. This (in its basic form) requires labels for segments, syllables,
words, phrases, F0 Targets, and intonation events. Ideally these should all be carefully hand labeled
but in most cases thats impractical. There are ways to automatically obtain most of these labels
but you should be aware of the inherit errors in the labeling system you use (including labeling
systems that involve human labelers). Note that when a unit selection method is to be used that
fundamentally uses segment boundaries its quality is going to be ultimately determined by the
quality of the segmental labels in the databases.
For the unit selection algorithm described below the segmental labels should be using the same
phoneset as used in the actual synthesis voice. However a more detailed phonetic labeling may be
more useful (e.g. marking closures in stops) mapping that information back to the phone labels
33
before actual use. Autoaligned databases typically arent accurate enough for use in unit selection.
Most autoaligners are built using speech recognition technology where actual phone boundaries are
not the primary measure of success. General speech recognition systems primarily measure words
correct (or more usefully semantically correct) and do not require phone boundaries to be accurate.
If the database is to be used for unit selection it is very important that the phone boundaries
are accurate. Having said this though, we have successfully used the aligner described in the diphone chapter above to label general utterance where we knew which phone string we were looking
for, using such an aligner may be a useful first pass, but the result should always be checked by hand.
It has been suggested that aligning techniques and unit selection training techniques can be used
to judge the accuracy of the labels and basically exclude any segments that appear to fall outside
the typical range for the segment type. Thus it, is believed that unit selection algorithms should
be able to deal with a certain amount of noise in the labeling. This is the desire for researchers in
the field, but we are some way from that and the easiest way at present to improve the quality of
unit selection algorithms at present is to ensure that segmental labeling is as accurate as possible.
Once we have a better handle on selection techniques themselves it will then be possible to start
experimenting with noisy labeling.
However it should be added that this unit selection technique (and many others) support what
is termed optimal coupling where the acoustically most appropriate join point is found automatically at run time when two units are selected for concatenation. This technique is inherently
robust to at least a few tens of millisecond boundary labeling errors.
For the cluster method defined here it is best to construct more than simply segments, durations
and an F0 target. A whole syllabic structure plus word boundaries, intonation events and phrasing
allow a much richer set of features to be used for clusters. See the Section called Utterance building
in the Chapter called A Practical Speech Synthesis System for a more general discussion of how to
build utterance structures for a database.
6.6
Making cepstrum parameter files
In order to cluster similar units in a database we build an acoustic representation of them. This
is is also still a research issue but in the example here we will use Mel cepstrum. Interestingly
we do not generate these at fixed intervals, but at pitch marks. Thus have a parametric spectral
representation of each pitch period. We have found this a better method, though it does require
that pitchmarks are reasonably identified.
Here is an example script which will generate these parameters for a database, it is included in
festvox/src/unitsel/make mcep.
for i in $*
do
fname=basename $i .wav
echo $fname MCEP
$SIG2FV $SIG2FVPARAMS otype est binary $i o mcep/$fname.mcep pm pm/$fname.pm
window type hamming
done
The above builds coefficients at fixed frames. We have also experimented with building parameters pitch synchronously and have found a slight improvement in the usefulness of the measure
34
based on this. We do not pretend that this part is particularly neat in the system but it does work.
When pitch synchronous parameters are build the clunits module will automatically put the local
F0 value in coefficient 0 at load time. This happens to be appropriate from LPC coefficients. The
script in festvox/src/general/make lpc can be used to generate the parameters, assuming you have
already generated pitch marks.
Note the secondary advantage of using LPC coefficients is that they are required any way for
LPC resynthesis thus this allows less information about the database to be required at run time.
We have not yet tried pitch synchronous MEL frequency cepstrum coefficients but that should be
tried. Also a more general duration/number of pitch periods match algorithm is worth defining.
6.7
Building the clusters
Cluster building is mostly automatic. Of course you need the clunits modules compiled into your
version of Festival. Version 1.3.1 or later is required, the version of clunits in 1.3.0 is buggy and
incomplete and will not work. To compile in clunits, add
ALSO INCLUDE += clunits
to the end of your festival/config/config file, and recompile. To check if an installation already
has support for clunits check the value of the variable *modules*.
The file festvox/src/unitsel/build clunits.scm contains the basic parameters to build a cluster model for a databases that has utterance structures and acoustic parameters. The function
build clunits will build the distance tables, dump the features and build the cluster trees. There
are many parameters are set for the particular database (and instance of cluster building) through
the Lisp variable clunits params. An reasonable set of defaults is given in that file, and reasonable
runtime parameters will be copied into festvox/INST LANG VOX clunits.scm when a new voice
is setup.
The function build clunits runs through all the steps but in order to better explain what is going
on, we will go through each step and at that time explain which parameters affect the substep.
The first stage is to load in all the utterances in the database, sort them into segment type and
name them with individual names (as TYPE NUM. This first stage is required for all other stages
so that if you are not running build clunits you still need to run this stage first. This is done by
the calls
(format t Loading utterances and sorting types\n)
(set! utterances (acost:db utts load dt params))
(set! unittypes (acost:find same types utterances))
(acost:name units unittypes)
Though the function build clunits init will do the same thing.
This uses the following parameters
name STRING
A name for this database.
35
db dir FILENAME
This pathname of the database, typically . as in the current directory.
utts dir FILENAME
The directory contain the utterances.
utts ext FILENAME
The file extention for the utterance files
files
The list of file ids in the database.
For example for the KED example these parameters are
(name ked timit)
(db dir /usr/awb/data/timit/ked/)
(utts dir festival/utts/)
(utts ext .utt)
(files (kdt 001 kdt 002 kdt 003 ... ))
In the examples below the list of fileids is extracted from the given prompt file at call time. The
next stage is to load the acoustic parameters and build the distance tables. The acoustic distance
between each segment of the same type is calculated and saved in the distance table. Precalculating
this saves a lot of time as the cluster will require this number many times.
This is done by the following two function calls
(format t Loading coefficients\n)
(acost:utts load coeffs utterances)
(format t Building distance tables\n)
(acost:build disttabs unittypes clunits params)
The following parameters influence the behaviour.
coeffs dir FILENAME
The directory (from db dir) that contains the acoustic coefficients as generated by the script
make mcep.
coeffs ext FILENAME
The file extention for the coefficient files
get std per unit
Takes the value t or nil. If t the parameters for the type of segment are normalized by finding
the means and standard deviations for the class are used. Thus a mean mahalanobis euclidean
distance is found between units rather than simply a euclidean distance. The recommended value
is t.
ac left context FLOAT
The amount of the previous unit to be included in the the distance. 1.0 means all, 0.0 means
none. This parameter may be used to make the acoustic distance sensitive to the previous acoustic
36
context. The recommended value is 0.8.

dur pen weight FLOAT
The penalty factor for duration mismatch between units.
F0 pen weight FLOAT
The penalty factor for F0 mismatch between units.
ac weights (FLOAT FLOAT ...) The weights for each parameter in the coefficeint files used
while finding the acoustic distance between segments. There must be the same number of weights
as there are parameters in the coefficient files. The first parameter is (in normal operations) F0 .
Its is common to give proportionally more weight to F0 that to each individual other parameter.
The remaining parameters are typically MFCCs (and possibly delta MFCCs). Finding the right
parameters and weightings is one the key goals in unit selection synthesis so its not easy to give
concrete recommendations. The following arent bad, but there may be better ones too though we
suspect that real human listening tests are probably the best way to find better values.
An example is
(coeffs dir mcep/)
(coeffs ext .mcep)
(dur pen weight 0.1)
(get stds per unit t)
(ac left context 0.8)
(ac weights (0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5))
The next stage is to dump the features that will be used to index the clusters. Remember the
clusters are defined with respect to the acoustic distance between each unit in the cluster, but they
are indexed by these features. These features are those which will be available at text-to-speech
time when no acoustic information is available. Thus they include things like phonetic and prosodic
context rather than spectral information. The name features may (and probably should) be over
general allowing the decision tree building program wagon to decide which of theses feature actual
does have an acoustic distinction in the units.
The function to dump the features is
(format t Dumping features for clustering\n)
(acost:dump features unittypes utterances clunits params)
The parameters which affect this function are
fests dir FILENAME
The directory when the features will be saved (by segment type).
feats LIST
The list of features to be dumped. These are standard festival feature names with respect to the
Segment relation.
For our KED example these values are
(feats dir festival/feats/)
(feats
37
(occurid
p.name p.ph vc p.ph ctype
p.ph vheight p.ph vlng
p.ph vfront p.ph vrnd
p.ph cplace p.ph cvox
n.name n.ph vc n.ph ctype
n.ph vheight n.ph vlng
n.ph vfront n.ph vrnd
n.ph cplace n.ph cvox
segment duration seg pitch p.seg pitch n.seg pitch
R:SylStructure.parent.stress
seg onsetcoda n.seg onsetcoda p.seg onsetcoda
R:SylStructure.parent.accented
pos in syl
syl initial
syl final
R:SylStructure.parent.syl break
R:SylStructure.parent.R:Syllable.p.syl break
pp.name pp.ph vc pp.ph ctype
pp.ph vheight pp.ph vlng
pp.ph vfront pp.ph vrnd
pp.ph cplace pp.ph cvox))
Now that we have the acoustic distances and the feature descriptions of each unit the next
stage is to find a relationship between those features and the acoustic distances. This we do using
the CART tree builder wagon. It will find out questions about which features best minimize the
acoustic distance between the units in that class. wagon has many options many of which are
apposite to this task though it is interesting that this learning task is interestingly closed. That
is we are trying to classify all the units in the database, there is no test set as such. However in
synthesis there will be desired units whose feature vector didnt exist in the training set.
The clusters are built by the following function
(format t Building cluster trees\n)
(acost:find clusters (mapcar car unittypes) clunits params)
The parameters that affect the tree building process are
tree dir FILENAME
the directory where the decision tree for each segment type will be saved
wagon field desc LIST
A filename of a wagon field descriptor file. This is a standard field description (field name plus
field type) that is require for wagon. An example is given in festival/clunits/all.desc which should
be sufficient for the default feature list, though if you change the feature list (or the values those
features can take you may need to change this file.
wagon progname FILENAME
The pathname for the wagon CART building program. This is a string and may also include any
extra parameters you wish to give to wagon.
38
wagon cluster size INT

The minimum cluster size (the wagon stop value).
prune reduce INT
This number of elements in each cluster to remove in pruning. This removes the units in the cluster
that are furthest from the center. This is down within the wagon training.
cluster prune limit INT
This is a post wagon build operation on the generated trees (and perhaps a more reliably method
of pruning). This defines the maximum number of units that will be in a cluster at a tree leaf.
The wagon cluster size the minimum size. This is usefully when there are some large numbers of
some particular unit type which cannot be differentiated. Format example silence segments without
context of nothing other silence. Another usage of this is to cause only the center example units to
be used. We have used this in building diphones databases from general databases but making the
selection features only include phonetic context features and then restrict the number of diphones
we take by making this number 5 or so.
unittype prune threshold INT
When making complex unit types this defines the minimal number of units of that type required before building a tree. When doing cascaded unit selection synthesizers its often not worth
excluding large stages if there is say only one example of a particular demisyllable.
Note that as the distance tables can be large there is an alternative function that does both the
distance table and clustering in one, deleting the distance table immediately after use, thus you
only need enough disk space for the largest number of phones in any type. To do this
(acost:disttabs and clusters unittypes clunits params)
Removing the calls to acost:build disttabs and acost:find clusters.
In our KED example these have the values
(trees dir festival/trees/)
(wagon field desc festival/clunits/all.desc)
(wagon progname /usr/awb/projects/speech tools/bin/wagon)
(wagon cluster size 10)
(prune reduce 0)
The final stage in building a cluster model is collect the generated trees into a single file and
dumping the unit catalogue, i.e. the list of unit names and their files and position in them. This is
done by the lisp function
(acost:collect trees (mapcar car unittypes) clunits params)
(format t Saving unit catalogue\n)
(acost:save catalogue utterances clunits params)
The only parameter that affect this is
catalogue dir FILENAME
39
the directory where the catalogue will be save (the name parameter is used to name the file).
Be default this is
(catalogue dir festival/clunits/)
There are a number of parameters that are specified with a cluster voice. These are related to
the run time aspects of the cluster model. These are
join weights FLOATLIST
This are a set of weights, in the same format as ac weights that are used in optimal coupling to
find the best join point between two candidate units. This is different from ac weights as it is likely
different values are desired, particularly increasing the F0 value (column 0).
continuity weight FLOAT
The factor to multiply the join cost over the target cost. This is probably not very relevant given
the the target cost is merely the position from the cluster center.
log scores 1
If specified the joins scores are converted to logs. For databases that have a tendency to contain
nonoptimal joins (probably any nonlimited domain databases), this may be useful to stop failed
synthesis of longer sentences. The problem is that the sum of very large number can lead to overflow. This helps reduce this. You could alternatively change the continuity weight to a number less
that 1 which would also partially help. However such overflows are often a pointer to some other
problem (poor distribution of phones in the db), so this is probably just a hack.
optimal coupling INT
If 1 this uses optimal coupling and searches the cepstrum vectors at each join point to find the
best possible join point. This is computationally expensive (as well as having to load in lots of
cepstrum files), but does give better results. If the value is 2 this only checks the coupling distance
at the given boundary (and doesnt move it), this is often adequate in good databases (e.g. limited
domain), and is certainly faster.
extend selections INT
If 1 then the selected cluster will be extended to include any unit from the cluster of the previous
segments candidate units that has correct phone type (and isnt already included in the current
cluster). This is experimental but has shown its worth and hence is recommended. This means that
instead of selecting just units selection is effectively selecting the beginnings of multiple segment
units. This option encourages far longer units.
pm coeffs dir FILENAME
The directory (from db dir where the pitchmarks are
pm coeffs ext FILENAME
The file extension for the pitchmark files.
sig dir FILENAME
Directory containing waveforms of the units (or residuals if Residual LPC is being used, PCM
waveforms is PSOLA is being used)
sig ext FILENAME
40
File extension for waveforms/residuals

join method METHOD
Specify the method used for joining the selected units. Currently it supports simple, a very
naive joining mechanism, and windowed, where the ends of the units are windowed using a hamming
window then overlapped (no prosodic modification takes place though). The other two possible
values for this feature are none which does nothing, and modified lpc which uses the standard
UniSyn module to modify the selected units to match the targets.
clunits debug 1/2
With a value of 1 some debugging information is printed during synthesis, particularly how many
candidate phones are available at each stage (and any extended ones). Also where each phone is
coming from is printed.
With a value of 2 more debugging information is given include the above plus joining costs
(which are very readable by humans).
41
Building Festival Voice
In the context of Indian languages, syllable units are found to be a much better choice than units
like phone, diphone, and half-phone. Unlike most other foreign languages in which the basic unit
of writing system is an alphabet, Indian language scripts use syllable as the basic linguistic unit.
The syllabic writing in Indic scripts is based on the phonetics of linguistic sounds and the syllabic
model is generic to all Indian languages A syllable is typically of the following form: V, CV, VC,
CCV, CCCV, and CCVC, where C is consonant and V is Vowel. A syllable could be represented as
C*VC*, containing at least one vowel and zero, one or more consonants. Following steps explains
how to build a syllable based synthesis using FestVox.
1. Create a directory and enter into the directory. $mkdir iiit tel syllable $cd iiit tel syllable
2. Creat voice setup $FESTVOXDIR/src/unitsel/setup clunits iiit tel syllable $FESTVOXDIR/src/prosody/

Before going to run build prompts do following steps.
(a) modify your phoneset according to syllables as phonemes in the phoneset file
(b) modify phoneme lable files as syllable lables.
(c) Remove special symbols from tokenizer.
(d) Call your pronunciation directory module from festvox/iiit tel syllable lexicon.scm
(e) The last modification is change the default phonemeset to your language unique syllables
in festival/clunits/all.desc file under p.name field.
3. Generate Prompts:
festival b festvox/build clunits.scm (build prompts etc/txt.done.data)
4. Record prompts
./bin/prompt them etc/time.data
5. Label Automatically
$FESTVOXDIR/src/ehmm/bin/do ehmm help run following steps individually: setup, phseq, feats, bw, align
6. Generate Pitch markers
bin/make pm wave wav/*.wav
7. Correct the pitch markers
bin/make pm fix pm/*.pm
Tuning pitch markars
(a) Convert pitch marks into label format
./bin/make pmlab pm pm/*.pm
(b) After modigying pitch markers convert lable format to pitchmarkers
./bin/make pm pmlab pm lab/*.lab
8. Generate Mel Cepstral coefficients
bin/make mcep wav/*.wav
9. Generate Utterance Structure
festival b festvox/build clunits.scm (build utts etc/txt.done.data)
42
10. Open festival/clunits/all.desc and add all the syllables in p.name field.
Cluster the units
festival b festvox/build clunits.scm (build clunits etc/txt.done.data)
11. open bin/make dur model and remove stepwise
./bin/do build do dur
12. Test the voice.
festival festvox/iiit tel syllable clunits.scm (voice iiit tel syllable clunits)
To synthesize sentence:
If you are building voice on local machine:
(SayText your text)
If you are running voice on remote machine:
(utt.save.wave (utt.synth (Utterance Text your text)) test.wav)
If you want to see selected units, rum following command
(set! utt (SayText your text))
(clunits::units selected utt filename)
(utt.save.wave utt filename wav)
43
Customizing festival for Indian Languages
8.1
Some of the parameters that were customized to deal with Indian languages
in festival framework are :
Cluster Size It is one of the parameters to be adjusted while building a tree. If the number
of nodes for each branch of a tree is very large, it takes more time to synthesize speech as the
time required to search for the appropriate unit is more. We therefore limit the size of the
branch of the tree by specifying the maximum number of nodes, which is denoted by cluster
size. When the tree is built, the cluster size is limited by putting the clustered set of units
through a larger set of questions to limit the number of units being clustered as one type.
Duration Penalty Weight (duration pen weight) While synthesizing speech, the duration
of each unit being picked is also of importance as units of different durations being clustered
together would make very unpleasant listening. The duration pen weight parameter specifies
how much importance should be given to the duration of the unit when the synthesizer is
trying to pick units for synthesis. A high value of duration pen weight means a unit very
similar in duration to the required unit is picked. Else, not much importance is given to
duration and importance is given to other features of the unit.
Fundamental pitch penalty weight (F0 pen weight) While listening to synthesized speech
an abrupt change in pitch between units is not very pleasing to the ear. The F0 pen weight
parameter specifies how much importance is given to F0 while selecting a unit for synthesis.
The F0 is calculated by calculating the F0 at the center of the unit which would be approximately where the vowel lies, which plays a major role in the F0 contour of the unit. We
therefore try to select units which have similar values of F0 to avoid fluctuations in the F0
contour of the synthesized speech.
ac left context In speech, the way a particular unit is spoken depends a lot on the preceeding
and succeeding unit i.e. the context in which a particular unit is spoken. Usually a unit is
picked based on what the succeeding unit is. This ac left context specifies the importance
given to picking a unit based on what the preceeding unit was.
Phrase Markers It is very hard to make sense out of something that is said without a
pause. It is therefore important to have pauses at the end of phrases to make what is spoken,
intelligible. Hindi has certain units called phrase markers which usually mark the end of a
phrase. For the purpose of inserting silences at the end of phrases, these phrase markers were
identified and a silence was inserted each time one of these was encountered.
Morpheme tags There are no phrase markers in tamil, but there are units called morpheme
tags which are found at the end of words which can be used to predict silences. The voice
was built using these tags to predict phrase end silences while synthesizing speech.
Handling silences Since there are a large number of silences in the database, the chances of
a silence of a wrong duration in the wrong place is a common problem that is faced. There
is a chance that a long silence is inserted at the end of a phrase or an extremely short silence
is inserted at the end of a phrase which sounds very inappropriate. The silence units were
therefore quantified into 2 types, i.e. SSIL, the silence at the end of a phrase and LSIL, the
silence at the end of a sentence. The silence at the end of a phrase will be of a short duration
while the silence at the end of a sentence will be of a long duration.
Inserting commas Just picking phrase markers was not sufficient to make the speech prosodically rich. Commas were inserted in the text wherever a pause might have been there and the
44
tree was built using these commas so that the location of these commas could be predicted
as pauses while synthesizing speech.
Duration Modeling Was done so as to include the duration of the unit to be used as a feature
while building the tree and also as a feature to narrow down the the size of the number of
units selected while picking units for synthesis.
Prosody Modeling This was achieved by phrase markers and by inserting commas in the
text. Prosody modeling was done to make the synthesized speech more expressive so that it
will more usable for the visually challenged persons.
Geminates In Indian languages it is very important to preserve the intra-word pause while
speaking, as the word spoken without the intra-word pause would have a completely different
meaning. These intraword pauses are called geminates, and care has been taken to preserve
these intraword pauses during synthesis.
8.2
Modifications in source code
1. Add the below 3 lines in the txt.done.data and also add the corresponding wav and lab files
in the respective folder
( text 0998 LSIL )
( text 0999 SSIL )
(text 0000 mono)
2. Inside the bin folder, Do the following Modification in make pm wave file
PM ARGS=min 0.0057 max 0.012 def 0.01 wave end lx lf 140 lx lo 111 lx hf 80
lx ho 51 med o 0
COMMENT THE ABOVE LINE AND ADD THE FOLLOWING LINE IN THE FILE
PM ARGS=min 0.003 max 0.7 def 0.01 wave end lx lf 340 lx lo 91 lx hf 140
lx ho 51 med o 0
3. Open /festvox/build clunits.scm file
=>GoTo Line No:69 (i.e) (ac left context 0.8) change the value 0.8 to 0.1
=>GoTo Line No:87 (i.e) (wagon cluster size 20) change the value 20 to 7
=>GoTo Line No:89 (i.e) (cluster prune limit 40) change the value 40 to 10
4. Open /festvox/voicefoldername clunits.scm file =>GoTo Line No:136 (optimal coupling 1)
change the value 1 to 2
5. Handling SIL For small system this issue is not need to be handled but system with large
database multiple occurrence of SIL creates problem. To Solve the issue do the following step
=>GoTo line No:161 the line starts with (define (VOICE FOLDER NAME::clunit name i)
Replace the entire function with the following code
45
(define (VOICE FOLDER NAME::clunit name i)

(VOICE FOLDER NAME::clunit name i)
Defines the unit name for unit selection for tam. The can be modified. It changes the basic
classification of unit for clustering. By default we just use the phone name, but we may want
to make this present phone plus previous phone (or something else).
(let ((name (item.name i)))
(cond
((and (not iitm tam aarthi::clunits loaded)
(or (stringequal h# name)
(stringequal 1 (item.feat i ignore))
(and (stringequal pau name)
(or (stringequal pau (item.feat i p.name))
(stringequal h# (item.feat i p.name)))
(stringequal pau (item.feat i n.name)))))
ignore)
((stringequal name SIL)
; (set! pau count (+ pau count 1))
(stringappend
name
(item.feat i p.name) (item.feat i p.p.name)

))
;; Comment out this if you want a more interesting unit name
((null nil)
name)
;; Comment out this if you want a more interesting unit name

;((null nil)
; name)
;; Comment out the above if you want to use these rules
;((stringequal + (item.feat i ph vc))
; (string-append
; name
;
; (item.feat i R:SylStructure.parent.stress)
;
; (iiit tel lenina::nextvoicing i)))
; ((string-equal name SIL)
; (string-append
;name
;
; (VOICE FOLDER NAME::nextvoicing i)))
;(t
; (string-append
; name
;
46
; ; (item.feat i seg onsetcoda)

;;
; (iiit tel lenina::nextvoicing i)))
)))
6. then go to line number 309 and add the following code
(define (phrase number word)
(phrase number word)
phrase number of a given word in a sentence.
(cond
((null word) 0) ;; beginning or utterance
((stringequal ; (item.feat word p.R:Token.parent.punc)) 0) ; end of a sentence
((stringequal , (item.feat word p.R:Token.parent.punc)) (+ 1 (phrase snumber (item.prev
word)))) ;end of a phrase
(t
(+ 0 (phrase number (item.prev word))))))
7. GoTo festival/clunits/ folder ===>Replace the all.desc file and copy the syllables and phones
to both p.name and n.name field
8. Generate phoneset units along with features to include in phoneset.scm file by running create phoneset languageName.pl The Phoneset.scm file contains a list of all units along with
their phonetic features. The create phoneset.pl script first takes every syllable and breaks it
down into smaller units and dumps their phonetic features into the Phoneset.scm file. For
every syllable the create phoneset.pl script checks first if the vowel present in the syllable is
a short vowel or a long vowel. Depending on whether it is a short vowel or a long vowel a
particular value is assigned to that field. After that the starting and ending consonants of
the syllable are checked and and depending on the place of articulation of the consonants a
particular value is assigned to that field. Depending on the type of vowel and the type of
beginning and end consonants we can now assign a value to the type of vowel field as well.
The fields for manner of articulation are kept as zero.
9. In VoiceFolderName Phoneset.scm file
Uncomment the following line during TRAINING
(PhoneSet.silences (SIL))
Uncomment the following line during TESTING
(PhoneSet.silences (SSIL))
10. In the VoiceFolderName phoneset.scm file, we have to change the phoneset definitions, Replace the defPhoneSet function with the following code.
(;; vowel or consonant
(vlng 1 0)
;; full vowel
(fv 1 0)
;; syllable type v vc/vcc cv/ccv cvc/cvcc
(syll type 1 2 3 4 0)
;; place of articulation of c1
(poa c1 1 2 3 4 5 6 7 0)
;; manner of articulation of c1
(moa c1 + 0)
;; place of articulation of c2 labial alveolar palatal labiodental dental velar
47
(poa c2 1 2 3 4 5 6 7 0)
;; manner of articulation of c2
(moa c2 + 0)
)
11. When running clunits i.e., the final step, remove (text 0000 mono) & (text 0000-2 phone)
from txt.done.data (if exists)
12. Go to VoiceFolderName lexicon.scm file(Calling parser in lexicon file) Goto line number 137
and add the following code in Hand written letter to sound rules section
(define (iitm tam lts function word features)
(iitm hin lts function WORD FEATURES)
Return pronunciation of word not in lexicon.
(cond
((stringequal LSIL word )(set! wordstruct ( ((LSIL) 0) ))(list word nil wordstruct))
((stringequal SSIL word )(set! wordstruct ( ((SSIL) 0) ))(list word nil wordstruct))
((stringequal mono word )(set! myfilepointer (fopen unit size.sh w))(format myfilepointer %s mono)(fclose myfilepointer))
((string-equal word phone)(set! myfilepointer (fopen unit size.sh w))(format myfilepointer %s phone)(fclose myfilepointer))
(t
(set! myfilepointer (fopen (pathappend VoiceFolderName::dir parser.sh) w))
;; (format myfilepointer perl %s %s %s (pathappend VoiceFolderName::dir bin/il parsertrain.pl) word VoiceFolderName::dir)
(format myfilepointer perl %s %s %s (path-append VoiceFolderName::dir
bin/il parser-test.pl) word VoiceFolderName::dir)
(fclose myfilepointer)
;; (print called)
(system chmod +x parser.sh)
(system ./parser.sh)
;(format t %l
n word) (load (path-append VoiceFolderName::dir wordpronunciation))
(list word a wordstruct)))
)
During Training process uncomment il parsertrain.pl, during testing uncomment il parsertest.pl
13. Creating pronunciation dictionary perl test.pl <inputfile in utf8 format>
Files name to be edited in il parser pronun dict.pl
(a) file containing unique clusters
eg my $file=./unique clusters artistName;
(b) Create pronunciation dictionary:
my $oF = pronunciationdict artistName;
48
(c) rename the created pronunciation dictionary to instituteName language lex.out

Add MNCL (without quote) at the first line of instituteName language lex.out and put it
into festvox directory
14. To handle English words, we use the preprocessing3.pl perl script. When an english word
is encountered, if it is not in the Pronunciation dictionary, it will be sent to the parser and
the parser will send the word to preprocessing3.pl which will generate wordpronunciation by
splitting the word into individual alphabets.
Eg : for i/p word Ram ===> R A M will be the output in wordpronunciation
15. Handling Numbers, Abbreviations, Date and Time, we have include seperate scm file in the
festvox folder called tokentowords.scm
49
Trouble Shooting in festival
9.1
Troubleshooting (Issues related with festival)
Some errors with solutions during the installation/building process: Error: /usr/bin/ld: cannot find lcurses
Solution: sudo ln s /lib/libncurses.so.5 /lib/libcurses.so
Error: /usr/bin/ld: cannot find lncurses
Solution:aptget install libncurses5dev
Error: /usr/bin/ld: cannot find lstdc++
Solution: sudo ln s /usr/lib/libstdc++.so.6 /lib/libstdc++.so
Error: gcc: error trying to exec cc1plus: execvp: No such file or directory
Solution: sudo aptget install g++
Error: ln s festival/bin/festival /usr/bin/festival
ln: accessing /usr/bin/festival: Too many levels of symbolic links
Solution:sudo mv /usr/bin/festival /usr/bin/festival.orig
ln s /home/boss/festival/festival/src/main/festival /usr/bin/festival
ln: creating symbolic link /usr/bin/festival to /home/boss/festival/festival/
9.2
Troubleshooting(Issues might occur while synthesizing)
Error:: Linux: cant open /dev/dsp

Solution:-Go to your home directory and open the .festivalrc (if it is not there, just create it)
$cd
$sudo gedit .festivalrc
add the following line in this file and save:
(Parameter.set Audio Command aplay q c 1 t raw f s16 r $SR $FILE)
(Parameter.set Audio Method Audio Command)
50
10
ORCA Screen Reader
Integrating festival voices with Orca

1. Place the voices in festival/lib/voices/ folder as Orca will be loading all the voices in this
folder.
2. Edit the clunits.scm file for the voice in festvox folder of the voice. (e.g. For hindi, take the
iitm hin anjana clunits.scm file in festival/lib/voices/hindi/iitm hin anjana clunits/festvox/
folder and insert the following code just above the last line in the file last line in the file would
be
(provide voice name)
(e.g. For hindi the last line in clunits.scm file will be (provide iitm hin anjana clunits) )
(proclaim voice
iitm hin anjana clunits
((language hindi)
(gender female)
(dialect hindi)
(description
This voice provides an Indian hindi language)
(builtwith festvox2.1)
(coding UTF8)
))
Give the correct voice name and language in the above code.
3. Start festival in server mode with the following command
festival i heap 2100000 server
4. Start orca with the following command on another command prompt orca n
5. Click on ORCA preferences button
6. Click on the speech tab
7. The following fields should have the entries given here
Speech system should be GNOME Speech Services
Speech synthesiser should be Festival GNOME Speech Driver
Voice settings should be Default
Select the festival voice for your language from the drop down list for the Person entry
8. Click on Apply button and then OK button.
9. Now ORCA should be able to read the language data.
10. If festival synthesizer does not load, install gnome-speech-swift (if ubuntu,
sudo apt-get install gnome-speech-swift
and start festival and orca again. If ubuntu version is greater than 10.04 , festival speech
dispatcher has to be installed . Gnome speech driver is not supported in later versions of
ubuntu . Use the following command to install the speech dispatcher,
sudo apt-get install speech-dispatcher-festival
51
11. If a timeout occurs for orca, type locate settings.py in command prompt, and open the files
named settings.py in any orca related folders. Usually there are more than one. Search for the
phrase timeoutTime and change its value to 30. Do the same for all files named settings.py.
Start festival and orca again
12. If an english word is not there in the database, it spells it out.
13. It can be tested using a gedit file containing your language data.
14. Cursor should be placed in front of the sentence to be read using keyboard arrow keys. Move
the cursor to different lines in the file for it to read line by line.
52
11
11.1
NVDA Windows Screen Reader

Compiling Festival in Windows :
1. Visual Studio 2008 (vc 9.0) standard edition must be successfully installed
2. The service pack for the visual studio 2008 must be installed
3. Install cygwin
4. Rename \speech tools\config\systems\ix86 CYGWIN1.5.mak to
\speech tools\config\systems\ix86 unknown.mak
(if error comes file not found)
Note: copy the new module (il parser) to your festival/src/modules/ folder before compiling
speech tools and festival. Copy the Makefile provided to festival/src/modules/ folder
Follow the steps mentioned in http://www.eguidedog.net/doc build win festival.php There
are more changes we made apart from that mentioned in the web page, which are mentioned
below. IMPORTANT: The following changes needs to be made only if errors are thrown
for these files while compiling festival following the steps in the link given above.
5. speech tools/include/EST.h must have the following changes #include <iostream> should
be added at line 45 before using namespace std;
6. speech tools/include/EST math.h must have the following changes #include iostream should
be added at line 54 after #include <cfloat>
7. speech tools/include/EST TKVL.h must have the following changes #include <iostream>
should be added at line 43 before using namespace std;
8. speech tools/include/EST Token.h must have the following changes #include <iostream>
9. speech tools/include/EST TrackMap.h must have the following changes #include <iostream>
10. speech tools/stats/wagon/wagon aux.cc must have the following changes #include EST Math.hshould
be added at line 47 after #include EST Wagon.h
11. speech tools/stats/EST DProbDist.cc must have the following changes long long l; on line 62
on line must be changed to long l; and l = (long long)c; on line 66 must be changed to l =
(long )c;
12. speech tools/utils/EST cutils.c must have the following changes
if (((tdir=getenv(TMPDIR)) == NULL) k
((tdir=getenv(TEMP)) == NULL) k
((tdir=getenv(TMP)) == NULL))
tdir = /tmp;
must be replaced by
if (((tdir=getenv(TMPDIR)) == NULL) &&
((tdir=getenv(TEMP)) == NULL) &&
((tdir=getenv(TMP)) == NULL))
tdir = /tmp;
53
13. speech tools/utils/EST ServiceTable.cc must have the following changes

The following code must be moved from the end of the file to line 52 after
#include EST ServiceTable.h
Declare KVL T(EST String, EST ServiceTable::Entry, EST String ST entry)
#if defined(INSTANTIATE TEMPLATES)
#include ../base class/EST TList.cc
#include ../base class/EST TSortable.cc
#include ../base class/EST TKVL.cc
Instantiate KVL T(EST String, EST ServiceTable::Entry, EST String ST entry)
#endif
14. festival/src/main/festival client.cc must have the following changes
#include <iostream> should be added at line 42 before using namespace std;
15. festival/src/main/festival main.cc must have the following changes #include <iostream>
16. festival/src/modules/MultiSyn/EST FlatTargetCost.cc and in festival/src/modules/MultiSyn/EST FlatT

all references to the variable WORD and PWORD must be replaced by WORD1 and PWORD1.
WORD and PWORD are keywords in VC++. So the variable names have to be changed.
Change in EST FlatTargetCost.h
enum tcdata t
{
VOWEL, SIL, BAD DUR, NBAD DUR, BAD OOL, NBAD OOL, BAD F0 , SYL, SYL STRESS,
N SIL,
N VOWEL, NSYL, SYL STRESS, RC, NNBAD DUR, NNSYL, LC, PBAD DUR,
PSYL, WORD, NWORD, NNWORD, PWORD, SYLPOS, WORDPOS, PBREAK, POS,
PUNC, NPOS,
NPUNC, TCHI LAST
};
Must be changed to enum tcdata t
{
VOWEL, SIL, BAD DUR, NBAD DUR, BAD OOL, NBAD OOL, BAD F0 , SYL, SYL STRESS,
N SIL,
N VOWEL, NSYL, NSYL STRESS, RC, NNBAD DUR, NNSYL, LC, PBAD DUR,
PSYL, WORD1, NWORD, NNWORD, PWORD1, SYLPOS, WORDPOS, PBREAK, POS,
PUNC, NPOS,
NPUNC, TCHI LAST
};
Change in EST FlatTargetCost.cc
In function TCData *EST FlatTargetCost::flatpack(EST Item *seg) const
54
// seg word feature

if(word=tc get word(seg))
(*f)[WORD]=simple id(word->S(id));
else
(*f)[WORD]=0;
must be replaced by
// seg word feature
if(word=tc get word(seg))
(*f)[WORD1]=simple id(word->S(id));
else
(*f)[WORD1]=0;
In function
TCData *EST FlatTargetCost::flatpack(EST Item *seg) const
// Prev seg word feature
if(seg->prev() && (word=tc get word(seg->prev())))
(*f)[PWORD]=simple id(word->S(id));
else
(*f)[PWORD]=0;
Must be replaced by
// Prev seg word feature
if(seg->prev() && (word=tc get word(seg->prev())))
(*f)[PWORD1]=simple id(word->S(id));
else
(*f)[PWORD1]=0;
In function
TCData *EST FlatTargetCost::flatpack(EST Item *seg) const
// segs wordpos
(*f)[WORDPOS]=0; // medial
if( f->a no check(WORD)!= f->a no check(NWORD) )
(*f)[WORDPOS]=1; // inter
else if( f->a no check(WORD)!= f->a no check(PWORD) )
(*f)[WORDPOS]=2; // initial
else if( f->a no check(NWORD) != f->a no check(NNWORD) )
(*f)[WORDPOS]=3; // final
Must be replaced by
// segs wordpos
(*f)[WORDPOS]=0; // medial
if( f->a no check(WORD1)!= f->a no check(NWORD) )
(*f)[WORDPOS]=1; // inter
else if( f->a no check(WORD1)!= f->a no check(PWORD1) )
55
(*f)[WORDPOS]=2; // initial
else if( f->a no check(NWORD) != f->a no check(NNWORD) )
(*f)[WORDPOS]=3; // final
In function
float EST FlatTargetCost::position in phrase cost() const
if ( !t->a no check(WORD) && !c->a no check(WORD) )
return 0;
if ( !t->a no check(WORD) k !c->a no check(WORD) )
return 1;
must be replaced by
if ( !t->a no check(WORD1) && !c->a no check(WORD1) )
return 0;
if ( !t->a no check(WORD1) k !c->a no check(WORD1) )
return 1;
In function
float EST FlatTargetCost::punctuation cost() const
if ( (t->a no check(WORD) && !c->a no check(WORD))
k (!t->a no check(WORD) && c->a no check(WORD)) )
score += 0.5;
else if (t->a no check(WORD) && c->a no check(WORD))
must be replaced by
if ( (t->a no check(WORD1) && !c->a no check(WORD1))
k (!t->a no check(WORD1) && c->a no check(WORD1)) )
score += 0.5;
else
if (t->a no check(WORD1) && c->a no check(WORD1))
In function
float EST FlatTargetCost::partofspeech cost() const
// Compare left phone half of diphone
if(!t->a no check(WORD) && !c->a no check(WORD))
return 0;
if(!t->a no check(WORD) k !c->a no check(WORD))
return 1;
must be replaced by
56
// Compare left phone half of diphone

if(!t->a no check(WORD1) && !c->a no check(WORD1))
return 0;
if(!t->a no check(WORD1) k !c->a no check(WORD1))
return 1;
17. festival/src/modules/MultiSyn/EST JoinCostCache.h must have the following changes .
comment out the following portion of the code
static const unsigned char minVal = 0x0;
static const unsigned char maxVal = 0xff;
static const unsigned char defVal = 0xff;
18. festival/src/modules/MultiSyn/EST JoinCostCache.cc, replace all minVal by 0x0, all maxVal by 0xff and all defVal by 0xff
Following are the changes
comment out // #include <iostream>
if( a == b )
return minVal;
else if( b > a )
return cache[(b*(b-1)>>1)+a];
else
return cache[(a*(a-1)>>1)+b];
return defVal;
Must be replaced by
if( a == b )
return 0x0;
else if( b > a )
return cache[(b*(b-1)>>1)+a];
else
return cache[(a*(a-1)>>1)+b];
return 0xff;
unsigned int qleveln = maxVal-minVal; must be replaced by unsigned int qleveln = 0xff-0x0;
if( cost >= ulimit )
qcost = maxVal;
else if( cost <= llimit )
qcost = minVal;
57
replace by
if( cost >= ulimit )
qcost = 0xff;
else if( cost <= llimit )
qcost = 0x0;
19. festival/src/modules/MultiSyn/EST TargetCost.cc must have the following changes .
comment out the following code
const EST String &left phone( cand left->features().val(name).String() );
const EST String &right phone( cand right->features().val(name).String() );
if( ph is vowel( left phone )
k ph is approximant( left phone )
k ph is liquid( left phone )
k ph is nasal( left phone ) )
Replace by
if( ph is vowel( cand left->features().val(name).String() )
k ph is approximant( cand left->features().val(name).String() )
k ph is liquid( cand left->features().val(name).String() )
k ph is nasal( cand left->features().val(name).String() ) )
if( ph is vowel( right phone )
k ph
k ph
k ph
fv =
is approximant( right phone )

is liquid( right phone )
is nasal( right phone ) )
fvector( cand right->f(midcoef) );
replace by
if( ph is vowel( cand->next()->features().val(name).String() )
k ph is approximant( cand->next()->features().val(name).String() )
k ph is liquid( cand->next()->features().val(name).String() )
k ph is nasal( cand->next()->features().val(name).String() ) ) fv = fvector( cand->next()>f(midcoef) );
20. festival/src/modules/Text/token.cc must have the following changes . #include <iostream>
21. festival/src/modules/UniSyn/us mapping.cc must have the following changes . declare int i;
separately in the following functions and remove the declarations from for loops
58
(a) void make linear mapping(EST Track &pm, EST IVector &map)
(b) static void pitchmarksToSpaces( const EST Track &pm, EST IVector *spaces, int start pm,
int end pm, int wav srate )
(c) void make join interpolate mapping( const EST Track &source pm,EST Track &target pm, const EST Relation &units, EST IVector &map )
void make join interpolate mapping2( const EST Track &source pm, EST Track &target pm, const EST Relation &units, EST IVector &map )
22. festival/src/modules/UniSyn/us prosody.cc must have the following changes . In function
void F0 to pitchmarks(EST Track &fz, EST Track &pm, int num channels, float default F0 ,
float target end)
remove the declaration of i in the for loop
for( int i=0; i<fz len; i++ )
In function void stretch F0 time(EST Track &F0 , float stretch, float s last time, float t last time)
declare int i; separately and remove the declarations from for loops
23. festival/src/modules/UniSyn/us unit.cc must have the following changes .
declare int i; separately in the following functions and remove the declarations from for loops
(a) static EST Track* us pitch period energy contour( const EST WaveVector &pp,const
EST Track &pm )
(b) void us linear smooth amplitude( EST Utterance *utt )
59
12
SAPI compatibility for festival voice
1. Install festival and speech tools in windows

Please check the chapter for compiling festival in windows.
The parser module has to be written in C .
The new module (il parser) has to be plugged into festival.
Before compiling festival, this module has to be kept in the src/modules folder.
Suppose we are installing festival in say D:\fest install\festival and
speech tools in D:\fest install\speech tools
This new module has to be kept in D
:\fest install\festival\src\ modules\il parser
2. Install the voice . Say suppose the voice is kept in the following folder. D:\fest install\festival\lib\voices\hi
3. Some festival files has been changed . These files are to be replaced at the place where we have
festival and speech tools installed. festival.cc -> D:\fest install\festival\src\arch\festival
festival main.cc -> D:\fest install\festival\src\main
clunits.cc -> D:\fest install\festival\src\modules\clunits
EST wave utils.cc -> D:\fest install\speech tools\speech class
config.ini copy it to voice folder (hindi\iitm hin anjana clunits)
config.ini file will be accessed by sapi code.
(voice iitm hin anjana clunits)
This file has the command to set a voice
Now you need to compile festival as per the steps given in chapter Compile festival in windows.
4. Install Microsoft SDK from the link http://www.microsoft.com/download/en/details.aspx?id=11310
5. Replace the SampleTTSEngine folder { C:\ProgramFiles\MicrosoftSDKs\Windows\v6.1\Samples\winui\

} with the code provided by IITM
6. In this SampleTTSEngine solution a file called registervox.cpp has the details of our voice.
The name and the language code should be changed for respective languages
7. Two environmental variables have to be created
FESTLIBDIR D:\festival\festival\lib
This should point to where your voice is kept. lib folder should be there with all the scm files.
and the voice should be kept in this lib folder under voices\hindi folder
voice path D:\festival\festival\lib\voices\hindi\iitm hin anjana clunits\
This will point to the voice folder
8. Check the properties of SampleTTSEngine solution. The libraries, include path of festival
and speech tools must point to the correct path. These libraries will be build when we compile
festival and speech tools (point 1)
9. compile the SampleTTSEngine solution in release mode. It will generate SampleTtsEngine.dll.
60
10. Check if an entry is there in registry (HKEY LOCAL MACHINE -> software -> Microsoft
-> speech -> voices -> Tokens -> ) An entry for our voice will be there.
11. Test with sample TTS application. (Control Panel -> Speech -> Text to speech ) or with
TTSAPP.exe that comes with the SDK.
12. If it works in these applications now try in NVDA or JAWS.
61
13
Sphere Converter Tool
The tool was developed to convert all the speech files in different format to a standard sphere
format. In the sphere format, there will be a header which will have all the details of the speech
file. The speech files can be of wav , raw or mulaw format. The sphere files can either be encoded
in wavpack or shorten encoding or kept in the same format as of the input speech file.
The input file(either mu law, wave or raw)is to be converted to a sphere file (either encoded in
wavpack, shorten or no encoding) with a sphere header. SPHERE files contain a strictly defined
header portion followed by the file body (waveform). The header is an object oriented, 1024byte
blocked, ASCII structure which is prepended to the waveform data. The header is composed of a
fixed format portion followed by an object oriented variable portion. The fixed portion is as follows:
NIST 1A<newline> 1024<newline>
The remaining object oriented variable portion is composed of <object> <type> <value>
Below is a sample sphere header that this module is generating. First 4 fields are user defined
fields taken from config file.
NIST 1A
1024
location id s13
TTS IITMadras
database id s22
Sujatha 20 RadioJockey
utterance id s9 Suj trial
sample sig bits i 16
channel count i 1
sample n bytes i 2
sample rate i 16000
sample count i 46563
sample coding s3 pcm
sample byte format s2 01
sample min i 16387 sample max i 23904
end head
13.1
Extraction of details from header of the input file
The input file can be wav file, either pcm or mu law encoded. The header file of a wav file is shown
in a table at the end of the document
62
The necessary information from the header of the input file is extracted. If the fact chunk is
present in the header the sample count is obtained from the header otherwise it is calculated as
follows. The total number of data bytes is obtained from the cksize (second field) in data chunk.
The bits per sample is obtained from the field in format chunk. Bytes per sample =( bits per
sample ) /8
The sample count = (No. of data bytes) / (bytes per sample)
In the sphere package the byte format of data is stored in field SAMPLE BYTE FORMAT.
If the sample data is in little endian format, this field is given the value = 01 , if the data is in
bigendian the value is 10 and if the samples are single byte the value is 1.
13.1.1
Calculate sample minimum and maximum values
- The objective of this module is to find the maximum sample value and minimum sample value
among the sample data present in the input file. Each sample is read from the data part of the file
and calculated which is the maximum value and minimum value.
13.1.2
RAW Files
- RAW files are headerless audio files. The sample rate, sample size, channel count and data encoding must be given by the user in the config file, for the program to read the file successfully. The
sample count is calculated by counting the number of samples read while calculating the sample
minimum and maximum values.
13.1.3
MULAW Files
- If the input file is a mulaw encoded file the AudioFormat field in the format chunk of the header
will have value = 7and the FACT chunk will be present in the header.
13.1.4
Output in encoded format
- The data in the output sphere file can be Shorten compressed byte stream or Wavpack compressed
byte stream or the data as is present in the input file.
13.2
Configfile
- The user defined fields to be added to header can be kept in this file and it is to be placed at the
location were the executables are placed.
The output sphere files can be played in the utility wavesurfur. The sphere files have .sph
extension and the sphere header can be verified by opening the file. The file can be opened in a
hex editor (e.g ghex2) to verify the header fields and size of file.
63
14
14.1
Sphere Converter User Manual

How to Install the Sphere converter tool
1. untar sphere 2.6a.tar.Z (use tar xvzf or zcat sphere 2.6.tar.Z tar xvf ) tar xvzf
sphere 2.6a.tar.Z
2. A folder by name nist will be created.
3. change the file exit.c ( nist/src/lib/sp)
replace extern char *sys errlist[]; by following
#ifdef NARCH linux
#include <errno.h>
#else
extern char *sys errlist[];
#endif
4. go to folder nist ( cd nist ) and install nist as follows sh src/scripts/install.sh
(a) : Sun OS4.1.[12]
(b) : Sun Solaris
(c) : Next OS
(d) : Dec OSF/1 (with gcc)
(e) : Dec OSF/1 (with cc)
(f) : SGI IRIX
(g) : HP Unix (with gcc)
(h) : HP Unix (with cc)
(i) : IBM AIX
(j) : Custom
Please Choose one:
10
What is/are the Compiler Command ? [cc]
cc
OK, The Compiler Command command is cc. Is this OK? [yes]
yes
What is/are the Compiler Flags ? [g]
g
c
OK, The Compiler Flags command is g c.
Is this OK? [yes]
yes
What is/are the Install Command ? [install s m 755]
install s m 755
What is/are the Archive Sorting Command ? [ranlib]
What is/are the Archive Update Command ? [ar ru]
What is/are the Architecture ? [SUN]
linux
OK, The Architecture command is linux. Is this OK? [yes]
yes
64
5. copy the following files from c files folder to nist/bin or to any user defined folder
decode sphere.c
encode sphere.c
configfile
compare.c
convert to sphere.c
wavtosphere.sh
6. compile using (in bin folder)
cc convert to sphere.c combine decode sphere.c encode sphere.c I../include/ lsp lutil
lm L../lib/
if the c files are not in nist/bin but in an ser defined folder, give appropriate path for the
nist/ lib and nist/include in the above command
This will create a file a.out which will be used by the front end of the tool
7. Install QT
compile the Qt bin file qtsdklinuxx86opensource2010.02.bin:
./qtsdklinuxx86opensource2010.02.bin
A folder by name qtsdk2010.02 will be created. Thus, installing the Qt software is done.
8. Now, to compile the sphere converter codes using Qt: Extract sphereconverter.tar
cd sphereconverter
Execute the following commands:
/home/......./qtsdk2010.02/qt/bin/qmake project (This creates the .pro file.)
/home/......./qtsdk2010.02/qt/bin/qmake (This created the Makefile.)
make (This makes the folder and creates sphere converter executable.)
9. Copy the sphereconverter executable to the bin folder of nist (cd nist/bin) or to the user
defined folder containing the c files.
10. Copy this user manual to both sphereconverter andnist/bin folder (or to the user defined
folder containing the c files) to access it at all times from the help button in the tool.
11. Now, clicking on this executable in the /home/.../nist/bin (or to the user defined folder
containing the c files) will run the tool.
14.2
How to use the tool
Select the type of file to be converted in the radio button wav (pcm or mulaw) or raw.
Select whether a single file or a bulk of files have to be converted.
If Single file conversion is selected:
1. Load the input file with extension either wav or raw. It can be browsed using the
Open button. The files with the the extension selected will be listed when browsing.
2. Specify the name of the output sphere file and where the output file has to be saved
using the save as button.
If Bulk file conversion is selected:
65
1. Load the input folder containing wav files or raw files depending on the type of file
selected. It can be browsed using the Open Button
2. If the input folder contains files(wav or raw) other than the type of file selected, the
other type files will be converted to corresponding sphere files with default properties.
3. Specify the name of the output folder name where the sphere files have to be located.
Click on Edit properties to enter the details that would be stored in the sphere header.
If properties are not edited the default properties stored in the configfile(present in nist/bin
folder) will be used.
Select the type of encoding for the output sphere file. It can be wavpack encoded or shorten
encoding or without any encoding.
Click Convert button to convert the a single file or Bulk Convert button to convert a set
of files.
If the file is successfully converted the message File was succesfully converted will be displayed.
If any field entered in the properties were wrong or if the file was not successfully converted
the appropriate message will be displayed.
After successful convertion the sphere header created by the tool will be displayed. If the user
is satisfield with the header Ok button can be clicked, else Cancel button can be clicked
and user can go back and make changes in the properties.
On clicking the help button on the right top corner the user manual will be opened which
can be referred for any issues while using the tool.
14.3
Fields in Properties
location id : Its a Mandatory field. User can enter any value of string format. Maximum
allowed length of the entered text is 100 characters. Preferably this field can hold the name
of the location/institution at which the conversion is taking place.
database id : Its a Mandatory field. User can enter any value of string format. Maximum
allowed length of the entered text is 100 characters. Preferably this field can hold the details
of project/database/speaker.
utterence id : Its a Mandatory field. User can enter any value of string format. Maximum
allowed length of the entered text is 100 characters. Preferably this field can hold the name
of the speaker. In the sphere header the value of this field will be appended with the name
of the file, seperated by an underscore.
language : Its a Mandatory field. User can enter any value of string format. Maximum
allowed length of the entered text is 50 characters. Preferably this field can hold the language
used for the input file.
sample n bytes : Its a Mandatory field for raw files only. The user can enter the number of
bytes in a sample in the file. For wav files this value will be retrieved from the wav header. If
a non integer value is entered the tool will point out to the user to enter integer value. This
tool deals with only one byte per sample and 2 bytes per sample
66
sample sig bits : Its a Mandatory field. User can enter the number of significant bits in a
sample. If a non integer value is entered the tool will point out to the user to enter integer
value. If the value of sample sig bit > (sample n bytes * 8) an error will be thrown informing
the user that the value entered for this field is wrong.
sample rate : Its a Mandatory field for raw files only. For wav files this value will be retrieved
from the wav header. The user can enter the sampling rate(blocks per second) used in the
file. If a non integer value is entered the tool will point out to the user to enter integer value.
If (sample count / sample rate ) < = 0, (meaning duration of file is less than or equal to
zero) an error will be thrown informing the user that the value entered for this field is wrong
sample byte format : Its a Mandatory field for raw files only. For wav files this value will be
retrieved from the wav header. The user can enter the byte ordering used inthe file. It can
be either 01 for little endian, 10 for big endian and 1 for single byte. If a value other than
this is entered, the tool will point out to the user to enter one of the three values. For raw
files, if sample n bytes entered by the user is 1, this field can take only the value 1. Else the
user will be informed that the value entered for this field is wrong.
channel count : This tool deals with only channel count = 1. It is the number of interleaved
channels. Mono = 1, Stereo = 2
sample count : The value for this field is calculated by the tool. It is the total number of
samples in the file.
sample coding : The value for this field is calculated by the tool. It is the encoding used
in the input and output file, seperated by a comma. The input file encodings can be pcm,
mulaw or raw. Output file encodings are shorten or wavepack. If no encoding is selected for
the output file, the value of the field will contain only the encoding of the input file
sample max : The value for this field is calculated by the tool. It is maximum sample value
( amplitude of the sample with maximum value) present in the file
sample min : The value for this field is calculated by the tool. It is minimum sample value (
amplitude of the sample with minimum value) present in the file
User can add more fields using the Add button. The field name, data type and value has to
be entered. The data type can be string, integer or real, which can be selected from the drop
down list. Click Ok button after entering details of the new field. Or cancel button can be
clicked. The field names should not have spaces in between.
Maximum size for the sphere header is 1024 bytes. If the user enters more data, tool will
inform the user that the header has exceeded 1024 bytes and tells the user to edit/delete few
properties.
User entered fields can be deleted. User can check off the check box on the right of each fields
entered by the user and delete button can be pressed.
Once the properties are edited click Ok button.
If you click Cancel button, tool will inform the user the Cancelling will remove the editted
properties and use the default wav properties. Do you want to continue?. User can click
yes or no.
14.4
Screenshot
Here is a screen shot of the sphere converter tool

67
14.5
Example of data in the Config file (default properties)
WAV:
location id STRING IIT Madras, chennai
database id STRING Sujatha 20 RadioJockey
utterance id STRING Suj
language STRING tamil
sample sig bits INTEGER 16
$$
RAW:
location id STRING IIT Madras, chennai
database id STRING Sujatha 20 RadioJockey
utterance id STRING Suj
language STRING hindi
sample n bytes INTEGER 2
sample sig bits INTEGER 16
channel count INTEGER 1
68
sample rate INTEGER 16000

sample byte format STRING 01
$$
14.6
Limitations to the tool
The tool allows only files of single channel.

sample n bytes can take only the values 1 and 2.
For raw files the correct value for sample n bytes,sample rate and sample byte format should
be given to get the correct value for sample max and sample min.
Maximum size for the sphere header is 1024 bytes.
For bulk conversion, if both the pcm and mulaw files are there in the input folder, they will
use the same properties.
For Bulk conversion if the input folder has sub folders or different types(pcm,mulaw or raw)
of input files, the tool will inform that All files were not sucesfully converted. All files in the
folder may not be of the same format. Although conversion will successfully happen for the
type of file you have selected in the gui, this error is thrown. It is advised to have only files
of one type in the input folder and without any subfolders.
69
Field
Length
RIFF/RIFX chunk
ChunkID
4
ChunkSize
Wave ID
FORMAT Chunk
Subchunk1ID
Subchunk1Size
Content
Contains the letters RIFF or RIFX in ASCII
form. For little endian files, it is RIFF and big
endian is RIFX
This is the size of the rest of the chunk following
this number. This is the size of the entire file in
bytes minus 8 bytes for the two fields not
included in this count:
ChunkID and ChunkSize.
Contains the letters WAVE
4
4
Contains the letters fmt

This is the size of the rest of the Subchunk
which follows this number. Value can be 16 or 18
or 40.
16 for PCM.
AudioFormat
2
PCM = 1 (i.e. Linear quantization)
Values other than 1 indicate some
form of compression.
Mulaw file has value 7
NumChannels
2
Number of interleaved channels
Mono = 1, Stereo = 2
SampleRate
4
Sampling rate (blocks per second)
8000, 16000, 44100, etc.
ByteRate
4
Data rate. This is AvgBytesPerSec = =
SampleRate * NumChannels * BitsPerSample/8
BlockAlign
2
Data block size (bytes)
= = NumChannels * BitsPerSample/8
The number of bytes for one sample including
all channels.
BitsPerSample
2
8 bits = 8, 16 bits = 16, etc.
Optional portion in FORMAT chunk
cbSize
2
Size of the extension (0 or 22)
This field is present only if the Subchunk1Size
is 18 or 40
ValidBitsPerSample
2
Number of valid bits
ChannelMask
4
Speaker position mask
SubFormat
16
GUID, including the data format code
FACT Chunk (All (compressed) nonPCM formats must have a Fact chunk)
ckID
4
Chunk ID: fact
cksize
4
Chunk size: minimum 4
SampleLength
4
Number of samples (per channel)
Data chunk
ckID
4
Contains the letters data
cksize
4
This is the number of bytes in the data.
sampled data
n
The actual sound data.
pad byte
0 or 1
Padding byte if n is odd
70

Festival Training

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Festival Training

Caricato da

Copyright:

Formati disponibili

Festival TTS Training Material

2 What is Text to Speech Synthesis?

6 Unit Selection Synthesis Using Festival

7 Building Festival Voice

8 Customizing festival for Indian Languages

11 NVDA Windows Screen Reader

13 Sphere Converter Tool

Nature of scripts of Indian languages

Convergence and divergence

What is Text to Speech Synthesis?

Components of a text-to-speech system

Normalization of non-standard words

Methods of speech generation

Primary components of the TTS framework

Screen readers for the visually challenged

How to Install LabelingTool

E: Package sun-java6-jdk has no installation candidate

add-apt-repository deb http://archive.canonical.com/ lucid partner

Labeling Tool User Manual

How To Use Labeling Tool

The front page has the following fields

Submit the details to the tool using submit button.

Validation for data entered

How to do label correction using Labeling tool

Viewing the labelled file

Performance results for 6 Indian Languages

(N oof insertions+noof deletions)

Limitations of the tool

Zooming is not enabled for VOP and EHMM panels

Unit Selection Synthesis Using Festival

Cluster unit selection

Choosing the right unit type

is currently assuming phone sized units.

Collecting databases for unit selection

Building utterance structures for unit selection

Making cepstrum parameter files

Building the clusters

context. The recommended value is 0.8.

wagon cluster size INT

File extension for waveforms/residuals

Building Festival Voice

2. Creat voice setup $FESTVOXDIR/src/unitsel/setup clunits iiit tel syllable $FESTVOXDIR/src/prosody/

Customizing festival for Indian Languages

Modifications in source code

(define (VOICE FOLDER NAME::clunit name i)

(item.feat i p.name) (item.feat i p.p.name)

;; Comment out this if you want a more interesting unit name

; ; (item.feat i seg onsetcoda)

(c) rename the created pronunciation dictionary to instituteName language lex.out

Trouble Shooting in festival

Troubleshooting (Issues related with festival)

Troubleshooting(Issues might occur while synthesizing)

Error:: Linux: cant open /dev/dsp

ORCA Screen Reader

Integrating festival voices with Orca

NVDA Windows Screen Reader

13. speech tools/utils/EST ServiceTable.cc must have the following changes

16. festival/src/modules/MultiSyn/EST FlatTargetCost.cc and in festival/src/modules/MultiSyn/EST FlatT

// seg word feature

// Compare left phone half of diphone

is approximant( right phone )

SAPI compatibility for festival voice

1. Install festival and speech tools in windows