Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
TTS Group
Indian Institute of Technology Madras
Chennai - 600036
India
June 5, 2012
Contents
1 Introduction
1.1 Nature of scripts of Indian languages . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Convergence and divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
4
5
5
5
5
6
6
7
7
8
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Overall Picture
10
4 Labeling Tool
12
4.1 How to Install LabelingTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Troubleshooting of LabelingTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Labeling Tool User Manual
5.1 How To Use Labeling Tool . . . . . . . . . . . .
5.2 How to do label correction using Labeling tool
5.3 Viewing the labelled file . . . . . . . . . . . . .
5.4 Control file . . . . . . . . . . . . . . . . . . . .
5.5 Performance results for 6 Indian Languages . .
5.6 Limitations of the tool . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
18
25
29
29
30
30
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
31
32
33
33
34
35
42
51
60
. . . .
values
. . . .
. . . .
. . . .
. . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
properties)
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62
62
63
63
63
63
63
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
64
64
65
66
67
68
69
Introduction
This training is conducted for new members who joined the TTS consortium. The main aim of the
TTS consortium is to develop text-to-speech (TTS) systems in all of these 22 official languages in
order to build screen readers which are spoken interfaces for information access which will aid visually challenged people use a computer with ease and to make computing ubiquitous and inclusive.
1.1
The scripts in Indian languages have originated from the ancient Brahmi script. The basic units of
the writing system are referred to as Aksharas. The properties of Aksharas are as follows:
1. An Akshara is an orthographic representation of a speech sound in an Indian language
2. Aksharas are syllabic in nature
3. The typical forms of Akshara are V, CV, CCV and CCCV, thus having a generalized form of
C*V where C denotes consonant and V denotes vowel
As Indian languages are Akshara based, akshara being a subset of a syllable, a syllable based unit
selection synthesis system has been built for Indian languages. Further, a syllable corresponds to
a basic unit of production as opposed to that of the diphone or the phone. Earlier efforts were
made by the consortium members, in particular, IIIT Hyderabad and IIT Madras do indicate that
natural sounding synthesisers for Indian languages can be built using the syllable as a basic unit.
1.2
The official languages of India, except (English and Urdu) share a common phonetic base, i.e., they
share a common set of speech sounds. This common phonetic base consists of around 50 phones,
including 15 vowels and 35 consonants. While all of these languages share a common phonetic base,
some of the languages such as Hindi, Marathi and Nepali also share a common script known as
Devanagari. But languages such as Telugu, Kannada and Tamil have their own scripts.
The property that makes these languages unique can be attributed to the phonotactics in each
of these languages rather than the scripts and speech sounds. Phonotactics is the permissible
combinations of phones that can co-occur in a language. This implies that the distribution of
syllables encountered in each language is different. Another dimension in which the Indian languages
significantly differ is prosody which includes duration, intonation and prominence associated with
each syllable in a word or a sentence.
Text to Speech Synthesis System converts text input to speech output. The conversion of text into
spoken form is deceptively nontrivial. A nave approach is to consider storing and concatenation of
basic sounds (also referred to as phones) of a language to produce a speech waveform. But, natural
speech consists of co-articulation i.e., effect of coupling two sound together, and prosody at syllable,
word, sentence and discourse level, which cannot be synthesised by simple concatenation of phones.
Another method often employed is to store a huge dictionary of the most common words. However,
such a method may not synthesise millions of names and acronyms which are not in the dictionary.
It also cannot deal with generating appropriate intonation and duration for words in different
context. Thus a text-to-speech approach using phones provides flexibility but cannot produce
intelligible and natural speech, while a word level concatenation produces intelligible and natural
speech but is not flexible. In order to balance between flexibility and intelligibility/naturalness,
sub-word units such as diphones which capture essential coarticulation between adjacent phones
are used as suitable units in a text-to-speech system.
2.1
A typical architecture of a Text-to-Speech (TTS) system is as shown in Figure below The components of a text-to-speech system could be broadly categorized as text processing and methods of
speech generation.
Text processing in the real world, the typical input to a text-to-speech system is text as available
in electronic documents, news papers, blogs, emails etc. The text available in real world is anything
but a sequence of words available in standard dictionary. The text contains several non-standard
words such as numbers, abbreviations, homographs and symbols built using punctuation characters
such as exclamation !, smileys :-) etc. The goal of text processing module is to process the input
text, normalize the non-standard words, predict the prosodic pauses and generate the appropriate
phone sequences for each of the words.
2.2
The text in real world consists of words whose pronunciation is typically not found in dictionaries
or lexicons such as IBM, CMU, and MSN etc. Such words are referred to as non-standard words
(NSW). The various categories of NSW are:
1. Numbers whose pronunciation changes depending on whether they refer to currency, time,
telephone numbers, zip code etc.
2. Abbreviations, contractions, acronyms such as ABC, US, approx., Ctrl-C, lb.,
3. Punctuations 3-4, +/-, and/or,
4. Dates, time, units and URLs.
2.3
Grapheme-to-phoneme conversion
Given the sequence of words, the next step is to generate a sequence of phones. For languages
such as Spanish, Telugu, Kannada, where there is a good correspondence between what is written
and what is spoken, a set of simple rules may often suffice. For languages such as English where
the relationship between the orthography and pronunciation is complex, a standard pronunciation
dictionary such as CMU-DICT is used. To handle unseen words, a grapheme-to-phoneme generator
is built using machine learning techniques.
2.4
Prosodic analysis
Prosodic analysis deals with modeling and generation of appropriate duration and intonation contours for the given text. This is inherently difficult since prosody is absent in text. For example,
the sentences where are you going?; where are you GOING? and where are YOU going?, have
same text-content but can be uttered with different intonation and duration to convey different
meanings. To predict appropriate duration and intonation, the input text needs to be analyzed.
This can be performed by a variety of algorithms including simple rules, example-based techniques
and machine learning algorithms. The generated duration and intonation contour can be used to
manipulate the context-insensitive diphones in diphone based synthesis or to select an appropriate
unit in unit selection voices.
2.5
The methods of conversion of phone sequence to speech waveform could be categorized into parametric, concatenative and statistical parametric synthesis.
2.5.1
Parametric synthesis
Parameters such as formants, linear prediction coefficients are extracted from the speech signal of
each phone unit. These parameters are modified during synthesis time to incorporate co-articulation
and prosody of a natural speech signal. The required modifications are specified in terms of rules
which are derived manually from the observations of speech data. These rules include duration,
intonation, co-articulation and excitation function. Examples of the early parametric synthesis
systems are Klatts formant synthesis and MITTALK.
2.5.2
Concatenative synthesis
Derivation of rules in parametric synthesis is a laborious task. Also, the quality of synthesized
speech using traditional parametric synthesis is found to be robotic. This has led to development
of concatenative synthesis where the examples of speech units are stored and used during synthesis.
Concatenative synthesis is based on the concatenation (or stringing together) of segments of
recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the
automated techniques for segmenting the waveforms sometimes result in audible glitches in the
output. There are three main sub-types of concatenative synthesis.
1. Unit selection synthesis - Unit selection synthesis uses large databases of recorded speech.
During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and
sentences. Typically, the division into segments is done using a specially modified speech
recognizer set to a forced alignment mode with some manual correction afterward, using
visual representations such as the waveform and spectrogram. An index of the units in the
speech database is then created based on the segmentation and acoustic parameters like the
fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. At
run time, the desired target utterance is created by determining the best chain of candidate
units from the database (unit selection). This process is typically achieved using a specially
weighted decision tree.
Unit selection provides the greatest naturalness, because it applies only a small amount of
digital signal processing (DSP) to the recorded speech. DSP often makes recorded speech
sound less natural, although some systems use a small amount of signal processing at the
point of concatenation to smooth the waveform. The output from the best unit-selection
systems is often indistinguishable from real human voices, especially in contexts for which the
TTS system has been tuned. However, maximum naturalness typically require unit-selection
speech databases to be very large, in some systems ranging into the gigabytes of recorded
data, representing dozens of hours of speech. Also, unit selection algorithms have been known
to select segments from a place that results in less than ideal synthesis (e.g. minor words
become unclear) even when a better choice exists in the database. Recently, researchers have
proposed various automated methods to detect unnatural segments in unit-selection speech
synthesis systems.
2. Diphone synthesis - Diphone synthesis uses a minimal speech database containing all the
diphones (sound-to-sound transitions) occurring in a language. The number of diphones
depends on the phonotactics of the language: for example, Spanish has about 800 diphones,
and German about 2500. In diphone synthesis, only one example of each diphone is contained
in the speech database. At runtime, the target prosody of a sentence is superimposed on these
minimal units by means of digital signal processing techniques such as linear predictive coding,
PSOLA or MBROLA. Diphone synthesis suffers from the sonic glitches of concatenative
synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages
of either approach other than small size. As such, its use in commercial applications is
declining,[citation needed] although it continues to be used in research because there are a
number of freely available software implementations.
3. Domain-specific synthesis - Domain-specific synthesis concatenates prerecorded words and
phrases to create complete utterances. It is used in applications where the variety of texts
the system will output is limited to a particular domain, like transit schedule announcements
or weather reports.The technology is very simple to implement, and has been in commercial
use for a long time, in devices like talking clocks and calculators. The level of naturalness
of these systems can be very high because the variety of sentence types is limited, and they
closely match the prosody and intonation of the original recordings. Because these systems are
limited by the words and phrases in their databases, they are not general-purpose and can only
synthesize the combinations of words and phrases with which they have been preprogrammed.
The blending of words within naturally spoken language however can still cause problems
unless the many variations are taken into account.
For example, in non-rhotic dialects of English the r in words like clear /kl/ is usually only
pronounced when the following word has a vowel as its first letter (e.g. clear out is realized
as /klt/). Likewise in French, many final consonants become no longer silent if followed by a
word that begins with a vowel, an effect called liaison. This alternation cannot be reproduced
by a simple word-concatenation system, which would require additional complexity to be
context-sensitive.
The speech units used in concatenative synthesis are typically at diphone level so that the
natural co-articulation is retained. Duration and intonation are derived either manually or
automatically from the data and are incorporated during synthesis time. Examples of diphone
synthesizers are Festival diphone synthesis and MBROLA. The possibility of storing more than
one example of a diphone unit, due to increase in storage and computation capabilities, has led
to development of unit selection synthesis. Multiple examples of a unit along with the relevant
linguistic and phonetic context are stored and used in the unit selection synthesis. The quality
of unit selection synthesis is found to be more natural than diphone and parametric synthesis.
However, unit selection synthesis lacks the consistency i.e., in terms of variations of the quality.
2.6
1. Speech Engine - One of the most widely used speech engine is eSpeak. eSpeak uses formant
synthesis method, which allows many languages to be provided with a small footprint. The
speech synthesized is intelligible, and provides quick responses, but lacks naturalness. The
demand is for a high quality natural sounding TTS system. We have used festival speech
synthesis system developed at The Centre for Speech Technology Research, University of
Edinburgh, which provides a framework for building speech synthesis systems and offers
full text to speech support through a number of APIs . A large corpus based unit selection
paradigm has been employed. This paradigm is known to produce intelligible natural sounding
speech output, but has a larger foot print.
2. Screen Readers - The role of a screen reader is to identify and interpret what is being displayed on the screen and transfer it to the speech engine for synthesis. JAWS is the most
popular screen reader used worldwide for Microsoft Windows based systems. But the main 30
drawback of this software is its high cost, approximately 1300 USD, whereas the average per
capita income in India is 1045 USD. Different open source screen readers are freely available.
We chose ORCA for Linux based systems and NVDA for Windows based systems. ORCA is
a flexible screen reader that provides access to the graphical desktop via user-customizable
8
combinations of speech, braille and magnification. ORCA supports the Festival GNOME
speech synthesizer and comes bundled with popular Linux distributions like Ubuntu and Fedora. NVDA is a free screen reader which enables vision impaired people to access computers
running Windows. NVDA is popular among the members of the AccessIndia community.
AccessIndia is a mailing list which provides an opportunity for visually impaired computer
users in India to exchange information as well as conduct discussions related to assistive
technology and other accessibility issues . NVDA has already been integrated with Festival
speech Engine by Olga Yakovleva.
3. Typing tool for Indian Languages - The typing tools map the qwerty keyboard to Indian
language characters. Widely used tools to input data in Indian languages are Smart Common Input Method (SCIM)and inbuilt InScript keyboard, for Linux and Windows systems
respectively. Same has been used for our TTS systems, as well.
2.7
India is home to the worlds largest visually challenged (VC) population. In todays digital world,
disability is equated to inability. Low attention is paid to people with disabilities and social inclusion
and acceptance is always a threat/challenge. The perceived inability of people with disability, the
perceived cost of special education and attitudes towards inclusive education are major constraints
for effective delivery of education. Education is THE means of developing the capabilities of people
with disability, to enable them to develop their potential, become self sufficient, escape poverty and
provide a means of entry to fields previously denied to them. The aim of this project is to make
a difference in the lives of VC persons. VC persons need to depend on others to access common
information that others take for granted, such as newspapers, bank statements, and scholastic
transcripts. Assistive technologies (AT) are necessary to enable physically challenged persons to
become part of the mainstream of society. A screen reader is an assistive technology potentially
useful to people who are visually challenged, visually impaired, illiterate or learning disabled, to
use/access standard computer software, such as Word Processors, Spreadsheets, Email and the
Internet.
Before the start of this project, Indian Institute of Technology, Madras (IIT Madras) had
been conducting a training programme for visually challenged people, to enable them to use the
computer using the screen reader JAWS with English as the language. Although, the VC persons
have benefited from this programme, most of them felt that:
The English accent was difficult to understand.
Most students would have preferred a reader in their native language.
They would prefer English spoken in Indian accent.
The price for the individual purchase of JAWS was very high.
Against this backdrop, it was felt imperative to build assistive technologies in the vernacular.
An initiative was taken by DIT, Ministry of Information Technology to sponsor the development
of
1. Natural sounding Text-to-speech synthesis systems in different Indian languages
2. To ensure that the TTSes are also integrated with open source screen readers.
Overall Picture
1. Data Collection - Text crawled from a news site and a site for stories for children.
2. Cleaning up of Data - From the crawled data sentences were picked to maximize syllable
coverage.
3. Recording - The sentences that were picked were then recorded in a studio which was a
completely noise-free environment.
4. Labeling - The wavefiles were then manually labeled using the semi-automatic labeling tool
to get accurate syllable boundaries.
5. Training - Using the wavefiles and their transcriptions the indian language unit selection
voice was built
10
6. Testing - Using the voice built, a MOS test was conducted with visually challenged end users
as the evaluators.
11
Labeling Tool
It is a widely accepted fact that the accuracy of labeling of speech files would have a great bearing
on the quality of unit selection synthesis. Process of manual labeling is time consuming and
daunting task. It is also not trivial to label waveforms manually, at the syllable level. DONLabel
Labeling tool provides an automatic way of performing labeling given an input waveform and the
corresponding text in utf8 format. The tool makes use of group delay based segmentation to provide
the segment boundaries. The size of the segment labels generated can vary from monosyllables to
polysyllables, as the Window Scale Factor (WSF) parameter is varied from small to large values.
Our labeling process make use of:
Ergodic HMM (EHMM) labeling procedure provided by Festival,
The group delay based algorithm (GD)
The Vowel Onset Point (VOP) detection algorithm.
The Labeling tool displays a panel, which shows the segment boundaries estimated by Group Delay
algorithm, another panel which would show the segment boundaries as estimated by the EHMM
process and a panel for VOP, which shows how many vowel onset points are present between
each segments provided by group delay algorithm. This would help greatly in adjusting the labels
provided by the group delay algorithm, if necessary, by comparing the labeling outputs of both
EHMM process and VOP algorithm. By using VOP as an additional cue, manual intervention
during the labeling process can be eliminated. It would also improve the accuracy of the labels
generated by the labeling tool.
The tool works for 6 different Indian languages namely
Hindi
Tamil
Malayalam
Marathi
Telugu
Bengali
The tool also displays the text (utf8) in segmented format along with the speech file.
4.1
1. Copy the html folder to /var/www folder. If www folder is not there in /var, create a
folder named www and extract the html folder into it. So we have the labelingTool code in
/var/www/html/labelingTool/
2. Install java compiler using the following command
sudo aptget install sunjava6jdk
The following error may come ==> Reading package lists... Done
Building dependency tree
Reading state information... Done
Package sunjava6jdk is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or is only available from
another source
12
For Ubuntu 10.10, the sun-java6 packages have been dropped from the Multiverse section of
the Ubuntu archive. It is recommended that you use openjdk-6 instead.
If you can not switch from the proprietary Sun JDK/JRE to OpenJDK, you can install sunjava6 packages from the Canonical Partner Repository. You can configure your system to use
this repository via command-line:
sudo add-apt-repository deb http://archive.canonical.com/ maverick partner
sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-plugin
sudo apt-get install sun-java6-jdk
sudo update-alternatives config java
13
If above does not work (for other version of ubuntu) then you can create local repository as
follows:
cd /
wget https://github.com/flexiondotorg/oab-java6/raw/0.2.1/oab-java6.sh -O oab-java6.sh
chmod +x oab-java6.sh
sudo ./oab-java6.sh
and then run:
sudo apt-get install sun-java6-jdk
sudo apt-get install sun-java6-jre
Source :
https://github.com/flexiondotorg/oab-java6/blob/a04949f242777eb040150e53f4dbcd4a3ccb7568/
README.rst
3. Install php using the following command
sudo aptget install php5
4. Install apache2 using the following command
sudo aptget install apache2
update the paths in the following file
/etc/apache2/sitesavailable/default
Set all path of cgibin to /var/www/html/cgi-bin.
Sample default file is attached
5. Install apache2 using the following command
sudo aptget install speech-tools
6. Install tcsh using the following command
sudo aptget install tcsh
7. Enable java script in the properties of the browser used
Use Google chrome or Mozilla firefox
8. Install java plugin for browser
sudo aptget install sunjava6plugin
Create a symbolic link to the Java Plugin libnpjp2.so using the following commands
sudo ln s /usr/lib/jvm/java-6-sun/jre/plugin/i386/libnpjp2.so /etc/alternatives/mozillajavaplugin.so
sudo ln s /etc/alternatives/mozillajavaplugin.so /usr/lib/mozilla/plugins/libnpjp2.so
9. give full permissions to html folder
sudo chmod R 777 html/
10. Add the following code to /etc/java6sun/security/java.policy
14
grant {
permission java.security.AllPermission;
};
11. In /var/www/html/labelingTool/jsrc/install file, make sure that correct path of javac is provided
as per your installation. For example : /usr/lib/jvm/java6sun-1.6.0.26/bin/javac
Version of java is 1.6.0.26 here, it might be different in your installation. Check the path and
give correct values.
12. Install the tool using the following command
Go to /var/www/html/labelingTool/jsrc and run the below command
sudo ./install
It might give the following output which is not an error.
Note: LabelingTool.java uses or overrides a deprecated API.
Note: Recompile with Xlint:deprecation for details.
13. Restart apache using the following command
sudo /etc/init.d/apache2 restart
14. check if java applet is enabled in the browser by using the following link
http://javatester.org/enabled.html
In that webpage, in the LIVE box, it should display
This web browser can indeed run Java applets
wait for some time for the display to come.
In case it had displayed This web browser can NOT run Java applets, there is some issue
with the java applets. Please browse for how to enable java in your version of browser and
fix the issue.
15. Replace the Pronunciation Rules.pl in the /var/www/html/labelingTool folder with your language specific code (The name should be same Pronunciation Rules.pl )
16. Open the browser and go to the following link
http://localhost/main.php
NOTE : VOP algoirthm is not used in the current version of the labelingTool. So
anything related to vop, please ignore on the below sections
4.2
Troubleshooting of LabelingTool
1. When Labelingtool is working fine the following files will be generated in labelingTool/results
folder
boundary
segments
spec low
vop
wav sig
gd spectrum low
segments indicator
tmp.seg
vopsegments
15
2. when the boundaries are manually updated, (deleted, added or moved) and saved 2 more files
gets created in the results folder.
ind upd
segments updated
3. When after manually updating and saving, if the vopUpdate button is clicked, another new
file gets created in the results folder
vop updated
4. If a file named vop is not getting generated in labelingTool/results folder and the labelit.php
page is getting stuck, you need to compile the vop module.
Follow the below steps.
(a) cd /var/www/html/labelingTool/VopLab
(b) make f MakeEse clean
(c) make f MakeEse
(d) cd bin
(e) cp Multiple Vopd ../../bin/
5. If the above files are not getting created, we can try running through command line as follows
Execute them from/var/www/html/labelingTool/bin folder.
The command line usage of the WordsWithSilenceRemoval program is as follows
WordsWithSilenceRemoval ctrlFile waveFile sigFile spectraFile boundaryFile thresSilence(ms) thresVoiced(ms)
example :
./WordsWithSilenceRemoval fewords2.base /home/text 0001.wav ..results/spec ..results/boun 100 100
Two files named spec and boun has to be generated in the results folder.
if not created. try recompiling.
cd /var/www/html/labelingTool/Segmentation
make f MakeWordsWithSilenceRemoval clean
make f MakeWordsWithSilenceRemoval
cp bin/WordsWithSilenceRemoval /var/www/html/labelingTool/bin/
The command line usage of the Multiple Vopd program is as follows
Multiple Vopd ctrlFile waveFile(in sphere format) segments file vopfile
example :
./Multiple Vopd fe-ctrl.ese ../results/wav ..results/segments ../results/vop
The file wav in results folder is already the sphere format of your input wavefile.
16
On running Multiple Vopd binary, a file vop has to be generated in the results folder.
6. If the file wav is not produced in results folder, speech tools are not installed
How to check if speech tools are installed :
Once installing speech tools check if the following
ch wave info <wave file name>
This command should give the information about that wave file.
If speech tools was installed along with festival and there is no link to it in
/usr/bin, please make a link to point to ch wave binary file in /usr/bin folder.
7. How to check if tcsh is installed..
type command tcsh and a new prompt will come.
8. Provide full permissions to the labelingTool folder and its sub folder so that the new files can
be created and updated without any permission issues.
(if required, following command can be used in the labelingTool folder
chmod R 777 *
chown R root:root * )
9. The java.policy file should be updated as specified in the installation steps, otherwise it may
result in error Error writing Lab File
10. When the lab file is viewed in the browser, if utf8 is not displaying, enable characterencoding
to utf8 for the browser
Tools->options->content->fonts and colors(Advanced menu)->default character encoding(utf8)
Restart browser.
17
5.1
The front page of the tool can be taken using the URL http://localhost/main.php A screen shot
of the front page is as shown below.
Loading Page
On clicking submit button on the front page the following page will be displayed.
19
If numeric value is not entered for thresholds of unvoiced or voiced segments, in the front
page the following error will come in the loading page Numeric value must be entered for
thresholds
The wav file loaded will be copied to the /var/www/html/UploadDir folder as text.wav
The lab file (ehmm label file) will be copied to /var/www/html/labelingTool/lab folder as
temp.lab If error occurred while moving to the lab folder the following error will be displayed
Error moving lab file.
The gd lab file (group delay label file) will be copied to /var/www/html/labelingTool/results
folder with the name gd lab. If error occurred while moving to the lab folder the following
error will be displayed Error moving gdlab file.
The Labelit Page
On clicking view button on the loading page the labelit page will be loaded. A screenshot of
this page along with the markings for each panel is given below.
Note: If error message Error reading file http://localhost/labelingTool/tmp/temp.wav appears, it means in place of wav file some other file(eg text file) was uploaded.
Panels on the Labelit Page
It has 6 main Panels
EHMM Panel displays the lab files generated by festival using EHMM algorithm while
building voices
Slider Panel using this panel we can slide, delete or add segments/labels
20
Wave Panel displays the speech waveform in segmented format (Note: The speech wave
file is not appearing, as seen in wavesurfur. This is because of the limitations in java)
Text Panel displays the segmented text (in utf8 format) with syllable as the basic units.
GD Panel draws the group delay curve. This is the result of group delay algorithm.
Wherever the peaks appear, is considered to be a segment boundary.
VOP Panel shows the number of vowel onset points found between each segments provided by Group delay. Here green colour corresponds to one vowel onset point. That
means the segment boundary found by group delay algorithm is correct. Red colour corresponds to zero vowel onset point. That means the segment boundary found by group
delay algorithm is wrong and that boundary needs to be deleted. Yellow colour corresponds to more than one vowel onset points. This means that, between 2 boundaries
found by group delay algorithm there will be one or more boundaries.
Resegment
The WSF selected for this example is 5. A different wsf will provide a different set of
boundaries. Lesser the wsf, greater the number of boundaries and vice versa. To experiment
with different wsf values, select the WSF from the drop down list and click RESEGMENT.
A screen shot for the same text (as in the above figure) with a greater wsf selected is shown
below
The above figure shows the segmentation using wsf = 12. It gives less number of boundaries.
Below figure shows the same waveform with a lesser wsf (wsf =3). It gives more number of
boundaries.
21
So the ideal wsf for the waveform has to be found out. Easier way is to check the text
segments are reaching approximately near the end of the waveform. (Not missing any text
segments nor having many segments without texts).
Menu Bar
The menu Bar is just above the EHMM Panel, with a heading Waveform The Menu Bar
contains following buttons in that order from left to right
Save button The lab file can be saved using the save button. After making any changes to
the segments (deletion, addition or dragging), if required save button has to be clicked.
Play the waveform The entire wave file will be played on pressing this button
Play the selection Select some portion of the waveform (say a segment) and play just
that part using this button. This button can be used to verify each segment.
Play from selection Play the waveform starting from the current selection to the end.
Click the mouse on the waveform and a yellow line will appear to show the selection.
On clicking this button, from that selected point to end of the file will be played
Play to selection Plays the waveform from the beginning to the end of the current
selection
Stop the playbackStops the playing of wave file
Zoom to fit Display the selected portion of the wave zoomed in
Zoom 1 Display the entire wave
Zoom in Zoom in on the wave
Zoom out Zoom out on the wave
Update VOP Panel After changing the segments (dragging, adding or deleting) , the
VOP algorithm is recalculated on the new set of segments on clicking this button. After
making the changes, the save button must be pressed before updating the VOP
panel.
22
Some screen shots are given below to demonstrate the use of menu bar.
Below figure shows how to select a portion (drag using mouse on wavepanel) of the waveform
and play that part. The selected portion appears shaded in yellow as shown
The below figure shows how to select a point (click using mouse on wavepanel) and play from
selection to end of file. The selected point appears as a yellow line
3cm
23
Next figure shows how to select a portion of the wave and zoom to fit.
The Next figure shows how the portion of the wave file selected in the above figure is zoomed.
2cm
24
5.2
Each segment given by the group delay can be listened to and decided whether the segment is correct or not, whether it is matching the text segmentation, with the help of VOP and EHMM Panels.
Deletion of a Segment
All the segments appear as red lines in the labeling tool output. A segment can be deleted
by right clicking on that particular segment on the slider panel. The figure below shows the
original output of labelling tool for the Hindi wave file
The third and fourth segments are very close to each other and one has to be deleted. Ideally
we delete the fourth one. The VOP has given a red colour (indication to delete one) for that
segment. User can decide whether to delete right or left of red segment after listening.
On deletion (right click on slider panel on that segment head) of the fourth segment, the text
segments get shifted and fits after silence segment as shown in the below figure.
25
On listening each segments it is seen that the segment between and is wrong. It has to
be deleted. The VOP gives red colour for the segment and the corresponding peak in the
group delay panel is below the threshold. Peaks below the threshold in group delay curve
usually wont be a segment boundary. But sometimes the algorithm computes it as a boundary. Threshold value in GD panel is the middle line in magenta colour.
There are 2 more red columns in VOP. The last one is correct and we have to delete a segment.
The second last red column in VOP is incorrect and GD gives the correct segment. Hence it
need not be deleted. is always used as a reference for GD algorithm. It can be wrong in some
cases. The yellow colour on VOP usually says to add a new segment, but here the yellow
colour is appearing in the Silence region and we ignore it.
The figure below shows the corrected segments (after deletion)
26
On completion of correcting the labels, the save button have to be pressed. On clicking Save
button a dialog box appears with the message Lab File Saved Click Next to Continue
A silence segment gets deleted on clicking the right boundary of the silence segment.
Update VOP Panel
After saving the changes made to the labels the VOP update button has to be clicked to
recalculate the VOP algorithm on the new segments. The updated output is shown in below
figure.
27
Adding A Segment
A segment can be added by right clicking with mouse on the slider panel at the point where
a segment needs to be added. The below figure shows a case in which a segment needs to be
added.
The VOP shows three yellow columns here of which the second yellow column is true. The
GD plot shows a small peak in that segment and we can be sure that the segment has to be
added at the peak only. In the above figure it can be seen that the mouse is placed on the
slider panel at the location to add the new segment. The figure below shows the corresponding
corrected wave file and after VOP updation done.
Sliding a Segment
A segment can be moved to left or right by clicking on the head of the segment boundary on
the slider panel and dragging left or right. Sliding can be used if required while correcting
the labels.
Modification of labfile If a half corrected lab file is already present (gd lab file present),
upload it from ./labelingTool/labfiles directory in the gd lab file option in the main page.
Irrespective of the wsf value, the earlier lab file will be loaded. But if we use resegmentation
the already present labels will be gone and it will be regenerated based on the new wsf value
present. After modification, when Save button is pressed same labfile is updated but before
updating backup copy of lab file is created.
Note: If system creates a lab file with same name that already exists in labfiles directory,
system creates the backup copy of that file. But backup copy is by default hidden, to view it
just press CTRL + h.
28
Logfiles Tool generates a seprate log file for each lab file(eg. text0001.log) in ./labelingTool/logfiles directory. Please keep cleaning this directory after certain interval.
5.3
Once the corrections are made and the save button is clicked the lab file is generated in/var/www/html/labelingT
directory and it can be viewed by clicking on the next link. The following message comes on clicking the next, Download the labfile: labfile Click on the link labfile. The lab file will appear on
the browser window as below
5.4
Control file
A control file is placed at the location /var/www/html/labelingTool/bin/fewords.base The parameters in the control file are given below. These parameters can be adjusted by the user to get better
segmentation results.
windowSize size of frame for energy computation
waveType type of the waveform: 0 Sphere PCM 1 Sphere Ulaw 2 plain sample one short
integer per line 3 RAW sequence of bytes each sample 8bit 4 RAW16 two bytes/sample Big
Endian 5 RAW16 two bytes/sample Little Endian 6 Microsoft RIFF standard wav format
winScaleFactor should be chosen based on syllable rate choose by trial and error
gamma reduces the dynamic range of energy
fftOrder and fftSize MUST be set to ZERO!!
frameAdvanceSamples frameshift for energy computation
medianOrder order of median smoothing for group delay function 1==> no smoothing
29
ThresEnergy, thresZero, thresSpectralFlatness are thresholds used for voiced unvoiced detection
When a parameter is set to zero, it is NOT used . Examples tested with ENERGY only
Sampling rate of the signal required for giving boundary information in seconds.
5.5
Testing was conducted for a set of test sentences for all 6 Indianlanguages and the percentage of
correctness was calculated based on the following formulae. The calculations were done after the
segmentation was done using the tool with the best wsf and threshold values.
[1
Language
Hindi
Malayalam
Telugu
Marathi
Bengali
Tamil
5.6
Percentage of Correctness
86.83%
78.68%
85.40%
80.24%
77.84%
77.38%
30
] 100
This chapter discusses some of the options for building waveform synthesizers using unit selection
techniques in Festival.
By unit selection we actually mean the selection of some unit of speech which may be anything
from whole phrase down to diphone (or even smaller). Technically diphone selection is a simple
case of this. However typically what we mean is unlike diphone selection, in unit selection there is
more than one example of the unit and some mechanism is used to select between them at run-time.
The theory is obvious but the design of such systems and finding the appropriate selection criteria, weighting the costs of relative candidates is a non-trivial problem. However techniques like this
often produce very high quality, very natural sounding synthesis. However they also can produce
some very bad synthesis too, when the database has unexpected holes and/or the selection costs fail.
6.1
The idea is to take a database of general speech and try to cluster each phone type into groups
of acoustically similar units based on the (non-acoustic) information available at synthesis time,
such as phonetic context, prosodic features (F0 and duration) and higher level features such as
stressing, word position, and accents. The actual features used may easily be changed and experimented with as can the definition of the definition of acoustic distance between the units in a cluster.
The basic processes involved in building a waveform synthesizer for the clustering algorithm are
as follows. A high level walkthrough of the scripts to run is given after these lower level details.
1. Collect the database of general speech.
2. Building the utterance Structure
3. Building coefficients for acoustic distances, typically some form of cepstrum plus F0, or some
pitch synchronous analysis (e.g. LPC).
4. Build distances tables, precalculating the acoustic distance between each unit of the same
phone type.
5. Dump selection features (phone context, prosodic, positional and whatever) for each unit
type.
6. Build cluster trees using wagon with the features and acoustic distances dumped by the
previous two stages
7. Building the voice description itself
6.2
Before you start you must make a decision about what unit type you are going to use. Note there
are two dimensions here. First is size, such as phone, diphone, demi-syllable. The second type itself
which may be simple phone, phone plus stress, phone plus word etc. The code here and the related
files basically assume unit size is phone. However because you may also include a percentage of the
previous unit in the acoustic distance measure this unit size is more effectively phone plus previous
phone, thus it is somewhat diphone like. The cluster method has actual restrictions on the unit
size, it simply clusters the given acoustic units with the given feature, but the basic synthesis code
31
(item.feat i p.name)))
Thus the unittype would be the phone plus its previous phone. Note that the first part of a
unit name is assumed to be the phone name in various parts of the code thus although you make
think it would be neater to return previousphone phone that would mess up some other parts of
the code.
In the limited domain case the word is attached to the phone. You can also consider some
demisyllable information or more to differentiate between different instances of the same phone.
The important thing to remember is that at synthesis time the same function is called to identify the unittype which is used to select the appropriate cluster tree to select from. Thus you need
to ensure that if you use say diphones that the your database really does not have all diphones in it.
6.3
Unlike diphone database which are carefully constructed to ensure specific coverage, one of the
advantages of unit selection is that a much more general database is desired. However, although
voices may be built from existing data not specifically gathered for synthesis there are still factors
about the data that will help make better synthesis.
Like diphone databases the more cleanly and carefully the speech is recorded the better the
synthesized voice will be. As we are going to be selecting units from different parts of the database
the more similar the recordings are, the less likely bad joins will occur. However unlike diphones
database, prosodic variation is probably a good thing, as it is those variations that can make synthesis from unit selection sound more natural. Good phonetic coverage is also useful, at least phone
coverage if not complete diphone coverage. Also synthesis using these techniques seems to retain
aspects of the original database. If the database is broadcast news stories, the synthesis from it
will typically sound like read news stories (or more importantly will sound best when it is reading
32
news stories).
Again the notes about recording the database apply, though it will sometimes be the case that
the database is already recorded and beyond your control, in that case you will always have something legitimate to blame for poor quality synthesis.
6.4
Preliminaries
Throughout our discussion we will assume the following database layout. It is highly recommended
that you follow this format otherwise scripts, and examples will fail. There are many ways to
organize databases and many of such choices are arbitrary, here is our arbitrary layout.
The basic database directory should contain the following directories
bin/ Any database specific scripts for processing. Typically this first contains a copy of
standard scripts that are then customized when necessary to the particular database
wav/ The waveform files. These should be headered, one utterances per file with a standard
name convention. They should have the extension .wav and the fileid consistent with all other files
through the database (labels, utterances, pitch marks etc).
lab/ The segmental labels. This is usually the master label files, these may contain more
information that the labels used by festival which will be in festival/relations/Segment/.
lar/ The EGG files (larynograph files) if collected.
pm/ Pitchmark files as generated from the lar files or from the signal directly.
festival/ Festival specific label files.
festival/relations/ The processed labeled files for building Festival utterances, held in directories whose name reflects the relation they represent: Segment/, Word/, Syllable/ etc.
festival/utts/ The utterances files as generated from the festival/relations/ label files.
Other directories will be created for various processing reasons.
6.5
In order to make access well defined you need to construct Festival utterance structures for each
of the utterances in your database. This (in its basic form) requires labels for segments, syllables,
words, phrases, F0 Targets, and intonation events. Ideally these should all be carefully hand labeled
but in most cases thats impractical. There are ways to automatically obtain most of these labels
but you should be aware of the inherit errors in the labeling system you use (including labeling
systems that involve human labelers). Note that when a unit selection method is to be used that
fundamentally uses segment boundaries its quality is going to be ultimately determined by the
quality of the segmental labels in the databases.
For the unit selection algorithm described below the segmental labels should be using the same
phoneset as used in the actual synthesis voice. However a more detailed phonetic labeling may be
more useful (e.g. marking closures in stops) mapping that information back to the phone labels
33
before actual use. Autoaligned databases typically arent accurate enough for use in unit selection.
Most autoaligners are built using speech recognition technology where actual phone boundaries are
not the primary measure of success. General speech recognition systems primarily measure words
correct (or more usefully semantically correct) and do not require phone boundaries to be accurate.
If the database is to be used for unit selection it is very important that the phone boundaries
are accurate. Having said this though, we have successfully used the aligner described in the diphone chapter above to label general utterance where we knew which phone string we were looking
for, using such an aligner may be a useful first pass, but the result should always be checked by hand.
It has been suggested that aligning techniques and unit selection training techniques can be used
to judge the accuracy of the labels and basically exclude any segments that appear to fall outside
the typical range for the segment type. Thus it, is believed that unit selection algorithms should
be able to deal with a certain amount of noise in the labeling. This is the desire for researchers in
the field, but we are some way from that and the easiest way at present to improve the quality of
unit selection algorithms at present is to ensure that segmental labeling is as accurate as possible.
Once we have a better handle on selection techniques themselves it will then be possible to start
experimenting with noisy labeling.
However it should be added that this unit selection technique (and many others) support what
is termed optimal coupling where the acoustically most appropriate join point is found automatically at run time when two units are selected for concatenation. This technique is inherently
robust to at least a few tens of millisecond boundary labeling errors.
For the cluster method defined here it is best to construct more than simply segments, durations
and an F0 target. A whole syllabic structure plus word boundaries, intonation events and phrasing
allow a much richer set of features to be used for clusters. See the Section called Utterance building
in the Chapter called A Practical Speech Synthesis System for a more general discussion of how to
build utterance structures for a database.
6.6
In order to cluster similar units in a database we build an acoustic representation of them. This
is is also still a research issue but in the example here we will use Mel cepstrum. Interestingly
we do not generate these at fixed intervals, but at pitch marks. Thus have a parametric spectral
representation of each pitch period. We have found this a better method, though it does require
that pitchmarks are reasonably identified.
Here is an example script which will generate these parameters for a database, it is included in
festvox/src/unitsel/make mcep.
for i in $*
do
fname=basename $i .wav
echo $fname MCEP
$SIG2FV $SIG2FVPARAMS otype est binary $i o mcep/$fname.mcep pm pm/$fname.pm
window type hamming
done
The above builds coefficients at fixed frames. We have also experimented with building parameters pitch synchronously and have found a slight improvement in the usefulness of the measure
34
based on this. We do not pretend that this part is particularly neat in the system but it does work.
When pitch synchronous parameters are build the clunits module will automatically put the local
F0 value in coefficient 0 at load time. This happens to be appropriate from LPC coefficients. The
script in festvox/src/general/make lpc can be used to generate the parameters, assuming you have
already generated pitch marks.
Note the secondary advantage of using LPC coefficients is that they are required any way for
LPC resynthesis thus this allows less information about the database to be required at run time.
We have not yet tried pitch synchronous MEL frequency cepstrum coefficients but that should be
tried. Also a more general duration/number of pitch periods match algorithm is worth defining.
6.7
Cluster building is mostly automatic. Of course you need the clunits modules compiled into your
version of Festival. Version 1.3.1 or later is required, the version of clunits in 1.3.0 is buggy and
incomplete and will not work. To compile in clunits, add
ALSO INCLUDE += clunits
to the end of your festival/config/config file, and recompile. To check if an installation already
has support for clunits check the value of the variable *modules*.
The file festvox/src/unitsel/build clunits.scm contains the basic parameters to build a cluster model for a databases that has utterance structures and acoustic parameters. The function
build clunits will build the distance tables, dump the features and build the cluster trees. There
are many parameters are set for the particular database (and instance of cluster building) through
the Lisp variable clunits params. An reasonable set of defaults is given in that file, and reasonable
runtime parameters will be copied into festvox/INST LANG VOX clunits.scm when a new voice
is setup.
The function build clunits runs through all the steps but in order to better explain what is going
on, we will go through each step and at that time explain which parameters affect the substep.
The first stage is to load in all the utterances in the database, sort them into segment type and
name them with individual names (as TYPE NUM. This first stage is required for all other stages
so that if you are not running build clunits you still need to run this stage first. This is done by
the calls
(format t Loading utterances and sorting types\n)
(set! utterances (acost:db utts load dt params))
(set! unittypes (acost:find same types utterances))
(acost:name units unittypes)
Though the function build clunits init will do the same thing.
This uses the following parameters
name STRING
A name for this database.
35
db dir FILENAME
This pathname of the database, typically . as in the current directory.
utts dir FILENAME
The directory contain the utterances.
utts ext FILENAME
The file extention for the utterance files
files
The list of file ids in the database.
For example for the KED example these parameters are
(name ked timit)
(db dir /usr/awb/data/timit/ked/)
(utts dir festival/utts/)
(utts ext .utt)
(files (kdt 001 kdt 002 kdt 003 ... ))
In the examples below the list of fileids is extracted from the given prompt file at call time. The
next stage is to load the acoustic parameters and build the distance tables. The acoustic distance
between each segment of the same type is calculated and saved in the distance table. Precalculating
this saves a lot of time as the cluster will require this number many times.
This is done by the following two function calls
(format t Loading coefficients\n)
(acost:utts load coeffs utterances)
(format t Building distance tables\n)
(acost:build disttabs unittypes clunits params)
The following parameters influence the behaviour.
coeffs dir FILENAME
The directory (from db dir) that contains the acoustic coefficients as generated by the script
make mcep.
coeffs ext FILENAME
The file extention for the coefficient files
get std per unit
Takes the value t or nil. If t the parameters for the type of segment are normalized by finding
the means and standard deviations for the class are used. Thus a mean mahalanobis euclidean
distance is found between units rather than simply a euclidean distance. The recommended value
is t.
ac left context FLOAT
The amount of the previous unit to be included in the the distance. 1.0 means all, 0.0 means
none. This parameter may be used to make the acoustic distance sensitive to the previous acoustic
36
37
(occurid
p.name p.ph vc p.ph ctype
p.ph vheight p.ph vlng
p.ph vfront p.ph vrnd
p.ph cplace p.ph cvox
n.name n.ph vc n.ph ctype
n.ph vheight n.ph vlng
n.ph vfront n.ph vrnd
n.ph cplace n.ph cvox
segment duration seg pitch p.seg pitch n.seg pitch
R:SylStructure.parent.stress
seg onsetcoda n.seg onsetcoda p.seg onsetcoda
R:SylStructure.parent.accented
pos in syl
syl initial
syl final
R:SylStructure.parent.syl break
R:SylStructure.parent.R:Syllable.p.syl break
pp.name pp.ph vc pp.ph ctype
pp.ph vheight pp.ph vlng
pp.ph vfront pp.ph vrnd
pp.ph cplace pp.ph cvox))
Now that we have the acoustic distances and the feature descriptions of each unit the next
stage is to find a relationship between those features and the acoustic distances. This we do using
the CART tree builder wagon. It will find out questions about which features best minimize the
acoustic distance between the units in that class. wagon has many options many of which are
apposite to this task though it is interesting that this learning task is interestingly closed. That
is we are trying to classify all the units in the database, there is no test set as such. However in
synthesis there will be desired units whose feature vector didnt exist in the training set.
The clusters are built by the following function
(format t Building cluster trees\n)
(acost:find clusters (mapcar car unittypes) clunits params)
The parameters that affect the tree building process are
tree dir FILENAME
the directory where the decision tree for each segment type will be saved
wagon field desc LIST
A filename of a wagon field descriptor file. This is a standard field description (field name plus
field type) that is require for wagon. An example is given in festival/clunits/all.desc which should
be sufficient for the default feature list, though if you change the feature list (or the values those
features can take you may need to change this file.
wagon progname FILENAME
The pathname for the wagon CART building program. This is a string and may also include any
extra parameters you wish to give to wagon.
38
39
the directory where the catalogue will be save (the name parameter is used to name the file).
Be default this is
(catalogue dir festival/clunits/)
There are a number of parameters that are specified with a cluster voice. These are related to
the run time aspects of the cluster model. These are
join weights FLOATLIST
This are a set of weights, in the same format as ac weights that are used in optimal coupling to
find the best join point between two candidate units. This is different from ac weights as it is likely
different values are desired, particularly increasing the F0 value (column 0).
continuity weight FLOAT
The factor to multiply the join cost over the target cost. This is probably not very relevant given
the the target cost is merely the position from the cluster center.
log scores 1
If specified the joins scores are converted to logs. For databases that have a tendency to contain
nonoptimal joins (probably any nonlimited domain databases), this may be useful to stop failed
synthesis of longer sentences. The problem is that the sum of very large number can lead to overflow. This helps reduce this. You could alternatively change the continuity weight to a number less
that 1 which would also partially help. However such overflows are often a pointer to some other
problem (poor distribution of phones in the db), so this is probably just a hack.
optimal coupling INT
If 1 this uses optimal coupling and searches the cepstrum vectors at each join point to find the
best possible join point. This is computationally expensive (as well as having to load in lots of
cepstrum files), but does give better results. If the value is 2 this only checks the coupling distance
at the given boundary (and doesnt move it), this is often adequate in good databases (e.g. limited
domain), and is certainly faster.
extend selections INT
If 1 then the selected cluster will be extended to include any unit from the cluster of the previous
segments candidate units that has correct phone type (and isnt already included in the current
cluster). This is experimental but has shown its worth and hence is recommended. This means that
instead of selecting just units selection is effectively selecting the beginnings of multiple segment
units. This option encourages far longer units.
pm coeffs dir FILENAME
The directory (from db dir where the pitchmarks are
pm coeffs ext FILENAME
The file extension for the pitchmark files.
sig dir FILENAME
Directory containing waveforms of the units (or residuals if Residual LPC is being used, PCM
waveforms is PSOLA is being used)
sig ext FILENAME
40
41
In the context of Indian languages, syllable units are found to be a much better choice than units
like phone, diphone, and half-phone. Unlike most other foreign languages in which the basic unit
of writing system is an alphabet, Indian language scripts use syllable as the basic linguistic unit.
The syllabic writing in Indic scripts is based on the phonetics of linguistic sounds and the syllabic
model is generic to all Indian languages A syllable is typically of the following form: V, CV, VC,
CCV, CCCV, and CCVC, where C is consonant and V is Vowel. A syllable could be represented as
C*VC*, containing at least one vowel and zero, one or more consonants. Following steps explains
how to build a syllable based synthesis using FestVox.
1. Create a directory and enter into the directory. $mkdir iiit tel syllable $cd iiit tel syllable
42
10. Open festival/clunits/all.desc and add all the syllables in p.name field.
Cluster the units
festival b festvox/build clunits.scm (build clunits etc/txt.done.data)
11. open bin/make dur model and remove stepwise
./bin/do build do dur
12. Test the voice.
festival festvox/iiit tel syllable clunits.scm (voice iiit tel syllable clunits)
To synthesize sentence:
If you are building voice on local machine:
(SayText your text)
If you are running voice on remote machine:
(utt.save.wave (utt.synth (Utterance Text your text)) test.wav)
If you want to see selected units, rum following command
(set! utt (SayText your text))
(clunits::units selected utt filename)
(utt.save.wave utt filename wav)
43
8.1
Some of the parameters that were customized to deal with Indian languages
in festival framework are :
Cluster Size It is one of the parameters to be adjusted while building a tree. If the number
of nodes for each branch of a tree is very large, it takes more time to synthesize speech as the
time required to search for the appropriate unit is more. We therefore limit the size of the
branch of the tree by specifying the maximum number of nodes, which is denoted by cluster
size. When the tree is built, the cluster size is limited by putting the clustered set of units
through a larger set of questions to limit the number of units being clustered as one type.
Duration Penalty Weight (duration pen weight) While synthesizing speech, the duration
of each unit being picked is also of importance as units of different durations being clustered
together would make very unpleasant listening. The duration pen weight parameter specifies
how much importance should be given to the duration of the unit when the synthesizer is
trying to pick units for synthesis. A high value of duration pen weight means a unit very
similar in duration to the required unit is picked. Else, not much importance is given to
duration and importance is given to other features of the unit.
Fundamental pitch penalty weight (F0 pen weight) While listening to synthesized speech
an abrupt change in pitch between units is not very pleasing to the ear. The F0 pen weight
parameter specifies how much importance is given to F0 while selecting a unit for synthesis.
The F0 is calculated by calculating the F0 at the center of the unit which would be approximately where the vowel lies, which plays a major role in the F0 contour of the unit. We
therefore try to select units which have similar values of F0 to avoid fluctuations in the F0
contour of the synthesized speech.
ac left context In speech, the way a particular unit is spoken depends a lot on the preceeding
and succeeding unit i.e. the context in which a particular unit is spoken. Usually a unit is
picked based on what the succeeding unit is. This ac left context specifies the importance
given to picking a unit based on what the preceeding unit was.
Phrase Markers It is very hard to make sense out of something that is said without a
pause. It is therefore important to have pauses at the end of phrases to make what is spoken,
intelligible. Hindi has certain units called phrase markers which usually mark the end of a
phrase. For the purpose of inserting silences at the end of phrases, these phrase markers were
identified and a silence was inserted each time one of these was encountered.
Morpheme tags There are no phrase markers in tamil, but there are units called morpheme
tags which are found at the end of words which can be used to predict silences. The voice
was built using these tags to predict phrase end silences while synthesizing speech.
Handling silences Since there are a large number of silences in the database, the chances of
a silence of a wrong duration in the wrong place is a common problem that is faced. There
is a chance that a long silence is inserted at the end of a phrase or an extremely short silence
is inserted at the end of a phrase which sounds very inappropriate. The silence units were
therefore quantified into 2 types, i.e. SSIL, the silence at the end of a phrase and LSIL, the
silence at the end of a sentence. The silence at the end of a phrase will be of a short duration
while the silence at the end of a sentence will be of a long duration.
Inserting commas Just picking phrase markers was not sufficient to make the speech prosodically rich. Commas were inserted in the text wherever a pause might have been there and the
44
tree was built using these commas so that the location of these commas could be predicted
as pauses while synthesizing speech.
Duration Modeling Was done so as to include the duration of the unit to be used as a feature
while building the tree and also as a feature to narrow down the the size of the number of
units selected while picking units for synthesis.
Prosody Modeling This was achieved by phrase markers and by inserting commas in the
text. Prosody modeling was done to make the synthesized speech more expressive so that it
will more usable for the visually challenged persons.
Geminates In Indian languages it is very important to preserve the intra-word pause while
speaking, as the word spoken without the intra-word pause would have a completely different
meaning. These intraword pauses are called geminates, and care has been taken to preserve
these intraword pauses during synthesis.
8.2
1. Add the below 3 lines in the txt.done.data and also add the corresponding wav and lab files
in the respective folder
( text 0998 LSIL )
( text 0999 SSIL )
(text 0000 mono)
2. Inside the bin folder, Do the following Modification in make pm wave file
PM ARGS=min 0.0057 max 0.012 def 0.01 wave end lx lf 140 lx lo 111 lx hf 80
lx ho 51 med o 0
COMMENT THE ABOVE LINE AND ADD THE FOLLOWING LINE IN THE FILE
PM ARGS=min 0.003 max 0.7 def 0.01 wave end lx lf 340 lx lo 91 lx hf 140
lx ho 51 med o 0
3. Open /festvox/build clunits.scm file
=>GoTo Line No:69 (i.e) (ac left context 0.8) change the value 0.8 to 0.1
=>GoTo Line No:87 (i.e) (wagon cluster size 20) change the value 20 to 7
=>GoTo Line No:89 (i.e) (cluster prune limit 40) change the value 40 to 10
4. Open /festvox/voicefoldername clunits.scm file =>GoTo Line No:136 (optimal coupling 1)
change the value 1 to 2
5. Handling SIL For small system this issue is not need to be handled but system with large
database multiple occurrence of SIL creates problem. To Solve the issue do the following step
=>GoTo line No:161 the line starts with (define (VOICE FOLDER NAME::clunit name i)
Replace the entire function with the following code
45
46
(poa c2 1 2 3 4 5 6 7 0)
;; manner of articulation of c2
(moa c2 + 0)
)
11. When running clunits i.e., the final step, remove (text 0000 mono) & (text 0000-2 phone)
from txt.done.data (if exists)
12. Go to VoiceFolderName lexicon.scm file(Calling parser in lexicon file) Goto line number 137
and add the following code in Hand written letter to sound rules section
(define (iitm tam lts function word features)
(iitm hin lts function WORD FEATURES)
Return pronunciation of word not in lexicon.
(cond
((stringequal LSIL word )(set! wordstruct ( ((LSIL) 0) ))(list word nil wordstruct))
((stringequal SSIL word )(set! wordstruct ( ((SSIL) 0) ))(list word nil wordstruct))
((stringequal mono word )(set! myfilepointer (fopen unit size.sh w))(format myfilepointer %s mono)(fclose myfilepointer))
((string-equal word phone)(set! myfilepointer (fopen unit size.sh w))(format myfilepointer %s phone)(fclose myfilepointer))
(t
(set! myfilepointer (fopen (pathappend VoiceFolderName::dir parser.sh) w))
;; (format myfilepointer perl %s %s %s (pathappend VoiceFolderName::dir bin/il parsertrain.pl) word VoiceFolderName::dir)
(format myfilepointer perl %s %s %s (path-append VoiceFolderName::dir
bin/il parser-test.pl) word VoiceFolderName::dir)
(fclose myfilepointer)
;; (print called)
(system chmod +x parser.sh)
(system ./parser.sh)
;(format t %l
n word) (load (path-append VoiceFolderName::dir wordpronunciation))
(list word a wordstruct)))
)
During Training process uncomment il parsertrain.pl, during testing uncomment il parsertest.pl
13. Creating pronunciation dictionary perl test.pl <inputfile in utf8 format>
Files name to be edited in il parser pronun dict.pl
(a) file containing unique clusters
eg my $file=./unique clusters artistName;
(b) Create pronunciation dictionary:
my $oF = pronunciationdict artistName;
48
49
9.1
Some errors with solutions during the installation/building process: Error: /usr/bin/ld: cannot find lcurses
Solution: sudo ln s /lib/libncurses.so.5 /lib/libcurses.so
Error: /usr/bin/ld: cannot find lncurses
Solution:aptget install libncurses5dev
Error: /usr/bin/ld: cannot find lstdc++
Solution: sudo ln s /usr/lib/libstdc++.so.6 /lib/libstdc++.so
Error: gcc: error trying to exec cc1plus: execvp: No such file or directory
Solution: sudo aptget install g++
Error: ln s festival/bin/festival /usr/bin/festival
ln: accessing /usr/bin/festival: Too many levels of symbolic links
Solution:sudo mv /usr/bin/festival /usr/bin/festival.orig
ln s /home/boss/festival/festival/src/main/festival /usr/bin/festival
ln: creating symbolic link /usr/bin/festival to /home/boss/festival/festival/
9.2
50
10
51
11. If a timeout occurs for orca, type locate settings.py in command prompt, and open the files
named settings.py in any orca related folders. Usually there are more than one. Search for the
phrase timeoutTime and change its value to 30. Do the same for all files named settings.py.
Start festival and orca again
12. If an english word is not there in the database, it spells it out.
13. It can be tested using a gedit file containing your language data.
14. Cursor should be placed in front of the sentence to be read using keyboard arrow keys. Move
the cursor to different lines in the file for it to read line by line.
52
11
11.1
1. Visual Studio 2008 (vc 9.0) standard edition must be successfully installed
2. The service pack for the visual studio 2008 must be installed
3. Install cygwin
4. Rename \speech tools\config\systems\ix86 CYGWIN1.5.mak to
\speech tools\config\systems\ix86 unknown.mak
(if error comes file not found)
Note: copy the new module (il parser) to your festival/src/modules/ folder before compiling
speech tools and festival. Copy the Makefile provided to festival/src/modules/ folder
Follow the steps mentioned in http://www.eguidedog.net/doc build win festival.php There
are more changes we made apart from that mentioned in the web page, which are mentioned
below. IMPORTANT: The following changes needs to be made only if errors are thrown
for these files while compiling festival following the steps in the link given above.
5. speech tools/include/EST.h must have the following changes #include <iostream> should
be added at line 45 before using namespace std;
6. speech tools/include/EST math.h must have the following changes #include iostream should
be added at line 54 after #include <cfloat>
7. speech tools/include/EST TKVL.h must have the following changes #include <iostream>
should be added at line 43 before using namespace std;
8. speech tools/include/EST Token.h must have the following changes #include <iostream>
should be added at line 44 before using namespace std;
9. speech tools/include/EST TrackMap.h must have the following changes #include <iostream>
should be added at line 38 before using namespace std;
10. speech tools/stats/wagon/wagon aux.cc must have the following changes #include EST Math.hshould
be added at line 47 after #include EST Wagon.h
11. speech tools/stats/EST DProbDist.cc must have the following changes long long l; on line 62
on line must be changed to long l; and l = (long long)c; on line 66 must be changed to l =
(long )c;
12. speech tools/utils/EST cutils.c must have the following changes
if (((tdir=getenv(TMPDIR)) == NULL) k
((tdir=getenv(TEMP)) == NULL) k
((tdir=getenv(TMP)) == NULL))
tdir = /tmp;
must be replaced by
if (((tdir=getenv(TMPDIR)) == NULL) &&
((tdir=getenv(TEMP)) == NULL) &&
((tdir=getenv(TMP)) == NULL))
tdir = /tmp;
53
54
(*f)[WORDPOS]=2; // initial
else if( f->a no check(NWORD) != f->a no check(NNWORD) )
(*f)[WORDPOS]=3; // final
In function
float EST FlatTargetCost::position in phrase cost() const
if ( !t->a no check(WORD) && !c->a no check(WORD) )
return 0;
if ( !t->a no check(WORD) k !c->a no check(WORD) )
return 1;
must be replaced by
if ( !t->a no check(WORD1) && !c->a no check(WORD1) )
return 0;
if ( !t->a no check(WORD1) k !c->a no check(WORD1) )
return 1;
In function
float EST FlatTargetCost::punctuation cost() const
if ( (t->a no check(WORD) && !c->a no check(WORD))
k (!t->a no check(WORD) && c->a no check(WORD)) )
score += 0.5;
else if (t->a no check(WORD) && c->a no check(WORD))
must be replaced by
if ( (t->a no check(WORD1) && !c->a no check(WORD1))
k (!t->a no check(WORD1) && c->a no check(WORD1)) )
score += 0.5;
else
if (t->a no check(WORD1) && c->a no check(WORD1))
In function
float EST FlatTargetCost::partofspeech cost() const
// Compare left phone half of diphone
if(!t->a no check(WORD) && !c->a no check(WORD))
return 0;
if(!t->a no check(WORD) k !c->a no check(WORD))
return 1;
must be replaced by
56
57
replace by
if( cost >= ulimit )
qcost = 0xff;
else if( cost <= llimit )
qcost = 0x0;
19. festival/src/modules/MultiSyn/EST TargetCost.cc must have the following changes .
comment out the following code
const EST String &left phone( cand left->features().val(name).String() );
const EST String &right phone( cand right->features().val(name).String() );
if( ph is vowel( left phone )
k ph is approximant( left phone )
k ph is liquid( left phone )
k ph is nasal( left phone ) )
Replace by
if( ph is vowel( cand left->features().val(name).String() )
k ph is approximant( cand left->features().val(name).String() )
k ph is liquid( cand left->features().val(name).String() )
k ph is nasal( cand left->features().val(name).String() ) )
if( ph is vowel( right phone )
k ph
k ph
k ph
fv =
replace by
if( ph is vowel( cand->next()->features().val(name).String() )
k ph is approximant( cand->next()->features().val(name).String() )
k ph is liquid( cand->next()->features().val(name).String() )
k ph is nasal( cand->next()->features().val(name).String() ) ) fv = fvector( cand->next()>f(midcoef) );
20. festival/src/modules/Text/token.cc must have the following changes . #include <iostream>
should be added at line 48 before using namespace std;
21. festival/src/modules/UniSyn/us mapping.cc must have the following changes . declare int i;
separately in the following functions and remove the declarations from for loops
58
(a) void make linear mapping(EST Track &pm, EST IVector &map)
(b) static void pitchmarksToSpaces( const EST Track &pm, EST IVector *spaces, int start pm,
int end pm, int wav srate )
(c) void make join interpolate mapping( const EST Track &source pm,EST Track &target pm, const EST Relation &units, EST IVector &map )
void make join interpolate mapping2( const EST Track &source pm, EST Track &target pm, const EST Relation &units, EST IVector &map )
22. festival/src/modules/UniSyn/us prosody.cc must have the following changes . In function
void F0 to pitchmarks(EST Track &fz, EST Track &pm, int num channels, float default F0 ,
float target end)
remove the declaration of i in the for loop
for( int i=0; i<fz len; i++ )
In function void stretch F0 time(EST Track &F0 , float stretch, float s last time, float t last time)
declare int i; separately and remove the declarations from for loops
23. festival/src/modules/UniSyn/us unit.cc must have the following changes .
declare int i; separately in the following functions and remove the declarations from for loops
(a) static EST Track* us pitch period energy contour( const EST WaveVector &pp,const
EST Track &pm )
(b) void us linear smooth amplitude( EST Utterance *utt )
59
12
2. Install the voice . Say suppose the voice is kept in the following folder. D:\fest install\festival\lib\voices\hi
3. Some festival files has been changed . These files are to be replaced at the place where we have
festival and speech tools installed. festival.cc -> D:\fest install\festival\src\arch\festival
festival main.cc -> D:\fest install\festival\src\main
clunits.cc -> D:\fest install\festival\src\modules\clunits
EST wave utils.cc -> D:\fest install\speech tools\speech class
config.ini copy it to voice folder (hindi\iitm hin anjana clunits)
config.ini file will be accessed by sapi code.
(voice iitm hin anjana clunits)
Now you need to compile festival as per the steps given in chapter Compile festival in windows.
4. Install Microsoft SDK from the link http://www.microsoft.com/download/en/details.aspx?id=11310
60
10. Check if an entry is there in registry (HKEY LOCAL MACHINE -> software -> Microsoft
-> speech -> voices -> Tokens -> ) An entry for our voice will be there.
11. Test with sample TTS application. (Control Panel -> Speech -> Text to speech ) or with
TTSAPP.exe that comes with the SDK.
12. If it works in these applications now try in NVDA or JAWS.
61
13
The tool was developed to convert all the speech files in different format to a standard sphere
format. In the sphere format, there will be a header which will have all the details of the speech
file. The speech files can be of wav , raw or mulaw format. The sphere files can either be encoded
in wavpack or shorten encoding or kept in the same format as of the input speech file.
The input file(either mu law, wave or raw)is to be converted to a sphere file (either encoded in
wavpack, shorten or no encoding) with a sphere header. SPHERE files contain a strictly defined
header portion followed by the file body (waveform). The header is an object oriented, 1024byte
blocked, ASCII structure which is prepended to the waveform data. The header is composed of a
fixed format portion followed by an object oriented variable portion. The fixed portion is as follows:
NIST 1A<newline> 1024<newline>
The remaining object oriented variable portion is composed of <object> <type> <value>
Below is a sample sphere header that this module is generating. First 4 fields are user defined
fields taken from config file.
NIST 1A
1024
location id s13
TTS IITMadras
database id s22
Sujatha 20 RadioJockey
utterance id s9 Suj trial
sample sig bits i 16
channel count i 1
sample n bytes i 2
sample rate i 16000
sample count i 46563
sample coding s3 pcm
sample byte format s2 01
sample min i 16387 sample max i 23904
end head
13.1
The input file can be wav file, either pcm or mu law encoded. The header file of a wav file is shown
in a table at the end of the document
62
The necessary information from the header of the input file is extracted. If the fact chunk is
present in the header the sample count is obtained from the header otherwise it is calculated as
follows. The total number of data bytes is obtained from the cksize (second field) in data chunk.
The bits per sample is obtained from the field in format chunk. Bytes per sample =( bits per
sample ) /8
The sample count = (No. of data bytes) / (bytes per sample)
In the sphere package the byte format of data is stored in field SAMPLE BYTE FORMAT.
If the sample data is in little endian format, this field is given the value = 01 , if the data is in
bigendian the value is 10 and if the samples are single byte the value is 1.
13.1.1
- The objective of this module is to find the maximum sample value and minimum sample value
among the sample data present in the input file. Each sample is read from the data part of the file
and calculated which is the maximum value and minimum value.
13.1.2
RAW Files
- RAW files are headerless audio files. The sample rate, sample size, channel count and data encoding must be given by the user in the config file, for the program to read the file successfully. The
sample count is calculated by counting the number of samples read while calculating the sample
minimum and maximum values.
13.1.3
MULAW Files
- If the input file is a mulaw encoded file the AudioFormat field in the format chunk of the header
will have value = 7and the FACT chunk will be present in the header.
13.1.4
- The data in the output sphere file can be Shorten compressed byte stream or Wavpack compressed
byte stream or the data as is present in the input file.
13.2
Configfile
- The user defined fields to be added to header can be kept in this file and it is to be placed at the
location were the executables are placed.
The output sphere files can be played in the utility wavesurfur. The sphere files have .sph
extension and the sphere header can be verified by opening the file. The file can be opened in a
hex editor (e.g ghex2) to verify the header fields and size of file.
63
14
14.1
1. untar sphere 2.6a.tar.Z (use tar xvzf or zcat sphere 2.6.tar.Z tar xvf ) tar xvzf
sphere 2.6a.tar.Z
2. A folder by name nist will be created.
3. change the file exit.c ( nist/src/lib/sp)
replace extern char *sys errlist[]; by following
#ifdef NARCH linux
#include <errno.h>
#else
extern char *sys errlist[];
#endif
4. go to folder nist ( cd nist ) and install nist as follows sh src/scripts/install.sh
(a) : Sun OS4.1.[12]
(b) : Sun Solaris
(c) : Next OS
(d) : Dec OSF/1 (with gcc)
(e) : Dec OSF/1 (with cc)
(f) : SGI IRIX
(g) : HP Unix (with gcc)
(h) : HP Unix (with cc)
(i) : IBM AIX
(j) : Custom
Please Choose one:
10
What is/are the Compiler Command ? [cc]
cc
OK, The Compiler Command command is cc. Is this OK? [yes]
yes
What is/are the Compiler Flags ? [g]
g
c
OK, The Compiler Flags command is g c.
Is this OK? [yes]
yes
What is/are the Install Command ? [install s m 755]
install s m 755
What is/are the Archive Sorting Command ? [ranlib]
What is/are the Archive Update Command ? [ar ru]
What is/are the Architecture ? [SUN]
linux
OK, The Architecture command is linux. Is this OK? [yes]
yes
64
5. copy the following files from c files folder to nist/bin or to any user defined folder
decode sphere.c
encode sphere.c
configfile
compare.c
convert to sphere.c
wavtosphere.sh
6. compile using (in bin folder)
cc convert to sphere.c combine decode sphere.c encode sphere.c I../include/ lsp lutil
lm L../lib/
if the c files are not in nist/bin but in an ser defined folder, give appropriate path for the
nist/ lib and nist/include in the above command
This will create a file a.out which will be used by the front end of the tool
7. Install QT
compile the Qt bin file qtsdklinuxx86opensource2010.02.bin:
./qtsdklinuxx86opensource2010.02.bin
A folder by name qtsdk2010.02 will be created. Thus, installing the Qt software is done.
8. Now, to compile the sphere converter codes using Qt: Extract sphereconverter.tar
cd sphereconverter
Execute the following commands:
/home/......./qtsdk2010.02/qt/bin/qmake project (This creates the .pro file.)
/home/......./qtsdk2010.02/qt/bin/qmake (This created the Makefile.)
make (This makes the folder and creates sphere converter executable.)
9. Copy the sphereconverter executable to the bin folder of nist (cd nist/bin) or to the user
defined folder containing the c files.
10. Copy this user manual to both sphereconverter andnist/bin folder (or to the user defined
folder containing the c files) to access it at all times from the help button in the tool.
11. Now, clicking on this executable in the /home/.../nist/bin (or to the user defined folder
containing the c files) will run the tool.
14.2
Select the type of file to be converted in the radio button wav (pcm or mulaw) or raw.
Select whether a single file or a bulk of files have to be converted.
If Single file conversion is selected:
1. Load the input file with extension either wav or raw. It can be browsed using the
Open button. The files with the the extension selected will be listed when browsing.
2. Specify the name of the output sphere file and where the output file has to be saved
using the save as button.
If Bulk file conversion is selected:
65
1. Load the input folder containing wav files or raw files depending on the type of file
selected. It can be browsed using the Open Button
2. If the input folder contains files(wav or raw) other than the type of file selected, the
other type files will be converted to corresponding sphere files with default properties.
3. Specify the name of the output folder name where the sphere files have to be located.
Click on Edit properties to enter the details that would be stored in the sphere header.
If properties are not edited the default properties stored in the configfile(present in nist/bin
folder) will be used.
Select the type of encoding for the output sphere file. It can be wavpack encoded or shorten
encoding or without any encoding.
Click Convert button to convert the a single file or Bulk Convert button to convert a set
of files.
If the file is successfully converted the message File was succesfully converted will be displayed.
If any field entered in the properties were wrong or if the file was not successfully converted
the appropriate message will be displayed.
After successful convertion the sphere header created by the tool will be displayed. If the user
is satisfield with the header Ok button can be clicked, else Cancel button can be clicked
and user can go back and make changes in the properties.
On clicking the help button on the right top corner the user manual will be opened which
can be referred for any issues while using the tool.
14.3
Fields in Properties
location id : Its a Mandatory field. User can enter any value of string format. Maximum
allowed length of the entered text is 100 characters. Preferably this field can hold the name
of the location/institution at which the conversion is taking place.
database id : Its a Mandatory field. User can enter any value of string format. Maximum
allowed length of the entered text is 100 characters. Preferably this field can hold the details
of project/database/speaker.
utterence id : Its a Mandatory field. User can enter any value of string format. Maximum
allowed length of the entered text is 100 characters. Preferably this field can hold the name
of the speaker. In the sphere header the value of this field will be appended with the name
of the file, seperated by an underscore.
language : Its a Mandatory field. User can enter any value of string format. Maximum
allowed length of the entered text is 50 characters. Preferably this field can hold the language
used for the input file.
sample n bytes : Its a Mandatory field for raw files only. The user can enter the number of
bytes in a sample in the file. For wav files this value will be retrieved from the wav header. If
a non integer value is entered the tool will point out to the user to enter integer value. This
tool deals with only one byte per sample and 2 bytes per sample
66
sample sig bits : Its a Mandatory field. User can enter the number of significant bits in a
sample. If a non integer value is entered the tool will point out to the user to enter integer
value. If the value of sample sig bit > (sample n bytes * 8) an error will be thrown informing
the user that the value entered for this field is wrong.
sample rate : Its a Mandatory field for raw files only. For wav files this value will be retrieved
from the wav header. The user can enter the sampling rate(blocks per second) used in the
file. If a non integer value is entered the tool will point out to the user to enter integer value.
If (sample count / sample rate ) < = 0, (meaning duration of file is less than or equal to
zero) an error will be thrown informing the user that the value entered for this field is wrong
sample byte format : Its a Mandatory field for raw files only. For wav files this value will be
retrieved from the wav header. The user can enter the byte ordering used inthe file. It can
be either 01 for little endian, 10 for big endian and 1 for single byte. If a value other than
this is entered, the tool will point out to the user to enter one of the three values. For raw
files, if sample n bytes entered by the user is 1, this field can take only the value 1. Else the
user will be informed that the value entered for this field is wrong.
channel count : This tool deals with only channel count = 1. It is the number of interleaved
channels. Mono = 1, Stereo = 2
sample count : The value for this field is calculated by the tool. It is the total number of
samples in the file.
sample coding : The value for this field is calculated by the tool. It is the encoding used
in the input and output file, seperated by a comma. The input file encodings can be pcm,
mulaw or raw. Output file encodings are shorten or wavepack. If no encoding is selected for
the output file, the value of the field will contain only the encoding of the input file
sample max : The value for this field is calculated by the tool. It is maximum sample value
( amplitude of the sample with maximum value) present in the file
sample min : The value for this field is calculated by the tool. It is minimum sample value (
amplitude of the sample with minimum value) present in the file
User can add more fields using the Add button. The field name, data type and value has to
be entered. The data type can be string, integer or real, which can be selected from the drop
down list. Click Ok button after entering details of the new field. Or cancel button can be
clicked. The field names should not have spaces in between.
Maximum size for the sphere header is 1024 bytes. If the user enters more data, tool will
inform the user that the header has exceeded 1024 bytes and tells the user to edit/delete few
properties.
User entered fields can be deleted. User can check off the check box on the right of each fields
entered by the user and delete button can be pressed.
Once the properties are edited click Ok button.
If you click Cancel button, tool will inform the user the Cancelling will remove the editted
properties and use the default wav properties. Do you want to continue?. User can click
yes or no.
14.4
Screenshot
14.5
WAV:
location id STRING IIT Madras, chennai
database id STRING Sujatha 20 RadioJockey
utterance id STRING Suj
language STRING tamil
sample sig bits INTEGER 16
$$
RAW:
location id STRING IIT Madras, chennai
database id STRING Sujatha 20 RadioJockey
utterance id STRING Suj
language STRING hindi
sample n bytes INTEGER 2
sample sig bits INTEGER 16
channel count INTEGER 1
68
14.6
69
Field
Length
RIFF/RIFX chunk
ChunkID
4
ChunkSize
Wave ID
FORMAT Chunk
Subchunk1ID
Subchunk1Size
Content
Contains the letters RIFF or RIFX in ASCII
form. For little endian files, it is RIFF and big
endian is RIFX
This is the size of the rest of the chunk following
this number. This is the size of the entire file in
bytes minus 8 bytes for the two fields not
included in this count:
ChunkID and ChunkSize.
Contains the letters WAVE
4
4