Prof. AG. Ramakrishnan

Text-to-Speech Text to Speech Synthesis Research @ MILE
AGRamakrishnan Professor MedicalIntelligence&LanguageEngineering(MILE)Lab DepartmentofElectricalEngineering,IISc,Bangalore.
Celebrating the Centenary of the Department !

MILELab,IISc,Bangalore 1/29/2012 1
PhilosophyofMILE
Researchrelevanttopeopleandlifearoundus. Do not download research topics data or code ! Donotdownloadresearchtopics,dataorcode! Commitmenttodeliversomethingusefulbyitselfthrowsup meaningfulresearchissues. Havingchosentoworkonanappliedarea,wedealwith everythingthatisrequiredtoreachthegoal. Allthedataweusehavebeencollectedbyus:Indiahasahuge populationandso,thereisnodearthforcreationofstandard databases. Researchislearning;andLearningisfun&itsownreward!
MILELab,IISc,Bangalore
1/29/2012
Whatwecreated:Vision2010
IndicLanguageReadingMachinesforPeoplewith VisualDisability(PWVD). AutomatedBookReader(ABR)forIndicscripts. d k d ( )f d AnyprintedmaterialinIndianlanguagesbecomes accessible document analysis & recognition accessible documentanalysis&recognition. Texttospeech(TTS)conversion. N d t d l ith bili Needtodealwithbilingual&trilingualtext l & t ili lt t scriptrecognitionatthewordlevel. Posters road signs menu card notice boards Posters,roadsigns,menucard,noticeboards Camerabaseddocumentanalysis&recognition. Coloured text printed on complex background. Colouredtextprintedoncomplexbackground. OnlineHandwritingRecognition(OHWR)
Wherearewe,today?
UsingourTamilOCR,WorthTrust,Chennaihasalready digitized200Tamilbooks(>30,000pages)andtheBraille booksarealreadybeingusedbyaround100PWVD. b k l d b i db d Inventionlabs,ChennaiwillbringoutTamilandKannada versionsofAvaz usingourTTSbyJuly2012.(Usedby versions of Avaz using our TTS by July 2012 (Used by childrenwithcerebralpalsy;awardfromPresident). TTS with SAPI to be used by National Association for Blind TTSwithSAPItobeusedbyNationalAssociationforBlind. Clinicaltrials p a ed plannedwithSt S JohnsHospital, Bangalore:
OHWR Speech
1/29/2012 4 MILELab,IISc,Bangalore
AcknowledgmentforourTTS
JVRama Partha RMuralishankar Ranjani HGShivaKumarHR
Lakshmish KPrathibha L k h i hK P hibh
Abhinava S Abhi SArun S i A Sriraman
Vikram LR Vik
Ajit Narayanan,CEOofInventionLabs,Chennai: YourTTSisthebestIndianlanguageTTSIhaveseensofar.
WebDemoofTirukkural&Vak(MILETTS): Web Demo of Tirukkural & Vak (MILE TTS): http:\\mile.ee.iisc.ernet.in\tts

WritingtoSpeechDevice
Laryngectomy
WorkingwithSt.JohnsHospital&Medical CollegetotestonPersonswithVocalDisability
PatentPending
OtherResearchatMILE
BrainMapsatMedicaidSystems HeartRateVariability(BiologicalCybernetics) Fetallungmaturityfromultrasound&3DMRIComp. FieldExtractionfromDocumentImages VitiligoQuantification CURRENT MachineListening MultilingualRecognition. AssessmentofDiabeticRetinopathy FUTURE Earlydetectionofretinaldiseases Tamil KannadaMachineTranslation
ApplicationsofTTS
Naturallanguageinterfaceforcomputers Digitalpersonalassistantwithtranslation Digital personal assistant with translation EMailreaderinlocallanguage Interactiontoolforphysicians Automatictelephonebasedenquirysystem Virtualteacher Automaticdocumentreadingmachines Automatic document reading machines InternetNewsChannels AccessibilityandreadingAidsfortheblind y g Communicationaidforcerebralpalsychildren Aidforpersonswithlaryngectomy
MethodEmployed
WaveformConcatenationBased W f C t ti B d Gooddatabaseofover1100separatelyspoken, phoneticallyrichsentences,segmentedasphones. phonetically rich sentences segmented as phones Phoneticequivalentoftext(wayitispronounced) l t d t f d d h C f ll Carefullyselectedsegmentsfromrecordedspeech Signalprocessingforsmoothconcatenation. Si l Signalprocessingfornaturalness. i f l Specialprovisions.
Issuestobeaddressed
PhoneticallyRichTextSelection Recordingfromagoodspeaker Recording from a good speaker Segmentation&Annotation TextNormalization G2PConversionandexceptions ProsodyPrediction UnitSelection Unit Selection PitchModification DurationModification DeterminingthePointofConcatenation SpokenLanguagevs.WrittenLanguage Indianlanguage,alongwithEnglishwords Indian language along with English words
1/29/2012
11
RichTextSelection
OptimalCoverage allspeechunitsofthelanguage shouldbecoveredinthedatabase HugeTextCorpus PhoneticCorpus Generatethesupersetofallphones(phonemes),in p p (p ), allthecontexts. Searchthecorpusformin.#ofsentences=>Greedy algorithm. Addwordsorsentencestocomplete. Bilingual,ifnecessary.
TextSelectionAlgorithm
Greedyalgorithmisused. Sentencecoveringthehighestnumberofunitsis S t i th hi h t b f it i firstselectedandtakenoutofcorpus. Count(requiredminimumnumber)ofcovered C ( i d i i b ) f d unitsreducedaccordingly. Nextbestcoveringsentenceisselectedand removedfromthecorpus. Wordbeginning/ending&sentencebeginning/ endingcontextsareused. Alinguistcreatedwordstocovertherest.
RecordingaGoodVoice
Selectionofspeakerishalfthejob! p j Normalspeed,declarativestyle. Familiaritywiththetext. y Rerecordingwhennecessary. Fatigueofthespeakertobehandled. Availabilityofspeaker,forfutureadditions(whatwe added isolatedcharacters,etc. Mispronunciations textcorrection. MissingUnits SpokenLanguage|Issues.
SpeechDatabaseRecording
JayamKondan,ex AIRNewsReader Tamil;also,a Jayam Kondan, exAIR News Reader Tamil; also, a completeTamildrama,withandwithoutemotions. LaterEnglishandisolatedalphabetsrecorded. g p ProfessionalStudioRecording Manuallisteningandrerecording Manual listening and rerecording
Segmentation&Annotation
Labourintensive,needstrainedpeople. Labour intensive, needs trained people. Nocompletelyautomatedtechniqueyet. Levelsofsegmentation phone,diphone, Levels of segmentation phone diphone polyphone,demisyllable,syllable. Annotation silence,pause,phrasebreak,matching Annotation silence, pause, phrase break, matching texttospokenphones. Automatingit stillaresearchissue. g Databaseorganisationforunitselection.
Motivationforsubspacebased segmentation
Segmentation using Energy based method 0.6 /a/ 0.4 0.2 0 0.2 0.4 Speech signal /aka/ 0.6 0.8 (a) 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 0.45 /k/ /a/
0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 (b) 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 Speech signal /eyo/ /y/ Actual consonant (/y/) position 0.35 0.4 0.45 0.5 /e/ /y/ /o/
TextProcessing
Tokenization groupingtextintowords,sentences, Tokenization grouping text into words, sentences, utterances handleabbreviations,initials,etc. TextNormalization convertingnonstandardwords (numerals,abbreviations,acronyms,punctuation marks,etc.)intostandardwords. Identifypropernamesandforeignwords. Tagsforuppercaseletters,etc.
TextNormalization
The"textnormalization"componentconvertsanynon textinputintoaseriesofappropriatespokenwords. i i i f i k d PriceoflandisRs.2,66,60,000.Formoreinformation, call26660000. Firstoneis/irandu kodiye arubattu aaru latchattu arubadaayiram/ Secondoneis/irandu aaru aaru aaru suzhi suzhi suzhi / suzhi/
1/29/2012
19
TextNormalization contd..
1)Isolateswordsinthetext 2)Integers,floatingpointnumbers,range,ratio, 2) Integers, floating point numbers, range, ratio, alphanumericstrings,times,dates,andothersymbolic representationsareconvertedintowords.Weneedto codetherulesfortheconversionofthesesymbolsinto words,sincetheydifferdependinguponthelanguage andcontext. and context 3)Abbreviationsareconvertedintowords.The normalizer usesadatabaseofabbreviationsandwhat li d t b f bb i ti d h t theyareexpandedto. 4)Acronymshavesufficientvowelstobepronounced 4) A h ffi i l b d Eg./manumozhi/
TN(contd..)
5) The normalizer will have rules dictating if the ) g punctuation causes a word to be spoken or if it is silent. e.g.: Periods at the end of sentences are not normally spoken, but a period in an Internet address is spoken as "dot. In S/W for visually challenged, periods are also voiced. voiced Once the text has been normalized and simplified into a series of words, it is passed onto the next module, i f d i i d h d l namely grapheme to phoneme converter.
1/29/2012
21
Grapheme Phoneme Conversion

Lettertosoundandcontextualrules Useslookuptable(lexicon)forforeignwords, Uses look up table (lexicon) for foreign words foreignnames,etc. Intervocalick,T,t,pbecomeg,D,dandb.K,ch,T,t, Intervocalic k, T, t, p become g, D, d and b. K, ch, T, t, pafterhomorganicnasalsbecomeg,j,D,d,b. Eg.:pattam=pattam,patam=padam;pantam= pandam;manchaL=manjaL.=>rulebasedG2P. English:eg.put,but,use,utter=>Lexiconforwhole vocabulary. vocabulary
ProsodyPrediction
Thetermprosodyreferstocertainpropertiesofthe speechsignal,whicharerelatedtoaudiblechanges h i l hi h l d dibl h inpitch,loudness,syllablelength,intonation. E Diff Eg.Different/m/shaveverydifferentmeanings. t / / h diff t i Prosodicfeaturescreateasegmentationofthe speechchainintogroupsofsyllables(Isyllableis speech chain into groups of syllables (I syllable is stressed,etc.) They give rise to the grouping of words into larger Theygiverisetothegroupingofwordsintolarger chunks syntacticandphonologicalphrases.
PitchcontourofaYesNo question
/Vidiyaimadiyaalvellamudiyumaa/?
Pitchcontourofan Affirmativesentence
/ /Akash nalla paiyan./ p y /
Pitchcontourofan Exclamatorysentence
/avaninguvandaanaa/?
/avan nejammaa varaan/
ProsodyModeling
Precisedurationofeachphoneme/syllableandof Precise duration of each phoneme/syllable and of silences,aswellastheintonationtoapplyonthem needtobeobtainedfromthemodel. Theabovesteprequiresformalizingalotofphonetic orphonologicalknowledge,automaticallyacquired fromdatawithstatistical(machinelearning) methods.
MeasurableProsody parameters
Stress(syllablemeasure) (y ) F0peak PositiveandnegativeF0slopevalues g p PositiveandnegativeRMSEnergypeak Duration Segmentdurations Tone SentenceF0contour(individualsegments)
ProsodyModelsforIndian languages
Presentscenario Nocomputationallinguisticknowledgeormodelis No computational linguistic knowledge or model is availablefordevelopinganyIndianlanguageTTS. GoodqualityTTSarenotavailableforSouthIndian Good quality TTS are not available for South Indian languages. CommercialHindiTTSisalsoofpoorquality. p q y Littleornoresearchhasbeenconductedonprosody modelingforanyIndianlanguage. g y g g
Prosody interrogation
plain vs.intonated interrogative sentences 320 300 280 260 p itch in H z 240 220 200 180 160 140 0
0.2
0.4
0.6 0.8 time in sec
1.2
1.4
Pitchcontourofplainutteredandintonated interrogativesentence(noteyaxisstartvalue)
Prosody exclamation
plain vs.intonated interrogative sentences 500
450
400
p hinH itc z
350
300
250
200
150
0.2
0.4
0.6 0.8 time in sec
1.2
1.4
Pitch contours of a plainly uttered and intonated exclamatory sentence
Increaseinenergy&durationof intonatedoverplainutteredsentences
Typeof sentences Rangeof Increasein meanenergy mean energy (%) 10 10 28 5 35 Rangeof Average Increasein Increasein basalenergy basal energy mean mean (%) energy(%) 5 5 13 5 35 17.5 17 5 16.3 Average Increasein basal basal energy(%) 13.6 13 6 14.6
Interrogative I t ti Exclamatory
Typeofsentence Interrogative Exclamatory
Averageincreasein totalduration(%) 19.7 19 7 16.4
High speaking rate Highspeakingrate
reducedvowels. reduced vowels
UnitSelection
TargetUnit:Aunit,withdesired(predicted)acoustic, spectralandcontextualfeatures.Selectingthebest p g unitoftherequiredtypefromthedatabase. Targetcost:betweenthetargetunitanddatabase units. Join/concatenationcost:betweenthecandidateunit anditspredecessorandsuccessorunitsin and its predecessor and successor units in concatenation. Requiresefficientorganisationofthedatabaseand q g precomputingoffeatures.
UnitSelectionalgorithm Totalcost
Totalcostiscalculatedforeachsegmentunitineach Total cost is calculated for each segment unit in each candidateclusterforthespeechtobesynthesized.
UnitSelectionTechnique
Viterbi search path through the best sequence of candidate searchpaththroughthebestsequenceofcandidate units(thickline)
DCTbasedpitchsynchronouspitch modificationinthesourcedomain
Pitchismodifiedinthesourcedomain Linearpredictionanalysisofthepitchframe Inversefiltertogettheresidual(vocalcordsignal, containingthepitchinformation) ) ObtainDCToftheLPresidualvector. Padzeros/truncatetomodifythelengthofthe P d /t t t dif th l th f th vector(therefore,thepitchperiod) Energynormalization. Energy normalization IDCTgivespitchmodifiedresidual. Forwardfilter,usingthesameLPcoefficients. , g
PitchModificationinthe SourceDomainusingDCT
Pitch Sync. S Speech frame LP Residue
A(z)
N1 PointDCT
Pad zeros / Truncate
Normalization N1/N2
Pitch Modified frame
G/A(z)
Modified Residue
N2Point IDCT
BlockSchematicofPitchSynchronousPitchModificationSystem
PitchModificationResults
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 (f) 0 50 100 150 200 250 300 350 400 450 500 (e) 50 100 150 200 250 300 350 400 450 500 (d) 50 100 150 200 250 300 350 400 450 500 (c) 50 100 150 200 250 300 350 400 450 500 (b) 50 100 150 200 250 300 350 400 450 500 (a) 50 100 150 200 250 300 350 400 450 500
Pitchmodification
Original Speech
0.6 0.8 1.1 1.3 1.5 Pitch modified versions of the original speech with different factors
PitchModificationResults (samplesounds)
Pitch Modification Direct LPC Factor Modified LPC Modified WLPC
0.5 0.7 07 0.8 1.2 12 1.5 1.8 18 2.0

Original
Simpleemotionsynthesis
Emotional Speech Synthesis Emotional Signal Pitch Marking Max Pitch Period N1 Reference Signal
DCT
LP Analysis
Instantaneous Pitch
N1 ___ N2
Correction Factor
Emotional Speech
Pitch S Pit h Synchronous LP h Coefficient Forward Filtering G / A(z) IDCT
Block Schematic of Emotional Speech Synthesis

DemoofEmotionalsynthesisand Pitchmodification
original
synthesized
Timevaryingpitchmodification
300
0 0 Time (s) 0.995188 0.98375 1.01038 0.98025 1.00081
Synthesizedinterrogativesentenceusing timevaryingpitchmodification
TD PSOLA
Analysis:waveformisdecomposedintoasequenceofoverlapping fragmentsofspeech.Synthesis:Fragmentsofspeechrecombinedas desired,withpitchandtimescalemodification.Necessitates preliminaryPitchmarking
Pitchmodificationfactorsrequiredtoconvert maletofemale(1.4 maletofemale(1.4 1.6) maletochild(1.7 2.0) maletooldman svoice (0.5 0.7) (0 5 0 7) male to old mans voice Makingthefundamentalfrequencyhighby30% andshiftingtheformantsupby25%converts and shifting the formants up by 25% converts malevoicetofemalevoice
DurationModification
Changeofpitchimplieschangeofduration Change of pitch implies change of duration Changeofspeakingratenecessitatesittoo. Stressedsyllablehasincreasedduration. Stressed syllable has increased duration Syllableembeddedinalongwordhaslessduration. Addingorremovingwholepitchperiods Adding or removing whole pitch periods
Importanceofthe DurationalInformation
Variation in the duration of a word /SENDRAAN/ in different circumstances
OptimalPointof Concatenation
Concatenateatpointsofminimalenergy Concatenate at points of minimal energy Matchingthespectraloracousticcharacteristics InterpolationofLinearPredictioncoefficients Interpolation of Linear Prediction coefficients
Lexicons
ProperNames places,people,roads,etc. Proper Names places, people, roads, etc. CompoundWords Functionwords Function words Verbrootsandcommonnouns.
SpokenLanguageIssues
MultipleLanguagesmixed Multiple Languages mixed Distortionsinspokenwordsnecessitatesunits beyondthephonotacticconstraints y p Foreignwordsrequirenewphones.
ResearchinComputational Linguistics
SystematicallystudyingspokenTamiland documentingalltheadditionalphonesused. Identifyingandparsinglexicalandphonological phrases forinsertingpauses place&duration. Predictingtheemotionalcontentfromanalyzingthe text foremotionalspeechsynthesis. Durationmodificationofdifferentclassesofphones Duration modification of different classes of phones withchangesinspeakingrate forhearingimpaired. Translationoftechnicaltermsindifferentfieldsto a s at o o tec ca te s d e e t e ds to Tamil,incollaborationwithfieldexperts. StudyingprosodyinTamil pitch,durationand amplitudecontoursindifferenttypesofsentences.
ResearchinTechnology
Weneedprecisionautomatedsegmentation. MachineTranslationusingmachinelearning g g methods. Analyzingphonotactic exceptionsfromreallife spokenTamildatabase.
1/29/2012
54
CorpusBuilding
TextNormalization foreignwords,spokenTamil, Studyingandmodelingnaturalvariationsinduration y g g ofsyllables,pauses,intonation,energycontour,etc. G2PforspokenTamizh differentdialects. Parallelcorpusformachinetranslation. Errorfreehugetextcorpuscoveringallfields. Segmentedandannotatedspeechcorpusfor automatedspeechrecognitionandlimiteddomain applications. applications
1/29/2012
55
NeedforSegmentationin Concatenative SpeechSynthesis
ItisKnowledgedrivenspeechsynthesissystem Basicunits,whenconcatenated,needtomatchthe Basic units, when concatenated, need to match the predicteddurationoftheword. Basicunits:V,VC,CV,VCV,VCCVandVCCCV DurationModification:canonlybeperformedonthe vowelpartsofthebasicunits WeneedAutomatedsegmentation W d A d i Manualsegmentation:inconsistent&tedious.
Synthesized word /kamala/ 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 0
/kamala/
0.2 0.4 Time (Sec) Basicunit /ama/ Basicunit /ala/ 0.6 0.4 0.2 0 0.2 0.4
BasicUnit /ka/ 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 0 0.3 0.2 0.1 0 0.1 0.2
/ka/
0.1 0.2 Time (Sec)
0.3 0.4 0
/ama/
0.2 Time (Sec) 0.4
0.6 0.8 0
/ala/
0.2 Time (Sec) 0.4
Subspace based segmentation of consonants & vowels
Plosives,nasals,affricatesandfricativeshavea os es, asa s, a cates a d cat es a e a commonpropertyoflowenergycomparedtovowels, whereasglideshavecomparableenergy. Hence,energybasedsegmentationisineffective. TestfeaturevectorsprojectedontheVoweland Consonantsubspaces C t b V&CsubspacesarerepresentedbyGeneralized eigenvectorsobtainedfromthefeaturevectorsfrom eigenvectors obtained from the feature vectors from thetrainingset.
Fisher'sDiscriminant
ProjectionPlane1: P j ti Pl 1
Properprojection Leads to perfect Leadstoperfect classification
ProjectionPlane2:
Projectioninvolves overlap overlap Leadstoimproper Classification
Motivationforsubspacebased segmentation
Segmentation using Energy based method 0.6 /a/ 0.4 0.2 0 0.2 0.4 Speech signal /aka/ 0.6 0.8 (a) 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 0.45 /k/ /a/
0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 (b) 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 Speech signal /eyo/ /y/ Actual consonant (/y/) position 0.35 0.4 0.45 0.5 /e/ /y/ /o/
AccurateCVsegmentationisobtainedusing energybasedmethodfornoncoarticulatedbasic energy based method for non co articulated basic units Butnotforcoarticulatedphones. p Inourapproach: Wecollectensembleoffeature vectorsoflengthNcorrespondingtodifferent vowelsandobtainthevowelcovariancematrixCv l d bt i th l i ti C SimilarlyforconsonantsCc GeneralizedeigenvectorscorrespondingtoC &C Generalized eigenvectors corresponding to Cv& Cc areused EffectiveforCoarticulatedbasicunits
FeatureTransformation
FeatureslikeLPC,LPCCandMFCCmodel statisticalpropertiesofvowelsandconsonants Fromthepointofviewofbasicunitsegmentation VI:Vowelinformationinfeaturevectors(signal) CI:Consonantinformation(noise) Linearfeaturetransformation:aimsatfindinga subspace,ofthefeaturespacewithmaximumSNR b f th f t ith i SNR
FeatureTransformation(contd..) RepresentingVIandCIbytrainingvectorsobtained usingManualsegmentation using Manual segmentation DirectioninthefeaturespacewithmaxSNRobtained usingGEVdecompositionofCv andCc Covariance matricesoffeaturevectorsofV&C i ff f & LineartransformationmatrixW
) x =WTx dim ( x ) = n ) dim ( x ) = m m<n
FeatureTransformation(contd..)
Letd betrainingvectorscontainingVI Let dv be training vectors containing VI anddc betrainingvectorscontainingCI
Cv = E{(dv dv )(dv dv )T } Cc = E{(dc dc )(dc dc ) }

T
FindWsuchthatthevariancecausedbyVItoCIis Find W such that the variance caused by VI to CI is maximizedaftertransformation
Densityfunctionsofdvanddc areassumedtobe normallydistributed normally distributed Covariancematricesaftertransformation ) Cv = W T CvW ) T Cc = W CcW Measureofthevarianceorthescatteris determinantofthecovariancematrix Determinantisequaltotheproductofthe eigenvalues&hencetheproductofthevariancesin theprincipaldirections the principal directions
Criterionfunctiontobemaximized Criterion function to be maximized
) T Cv W C vW J (W ) = ) = T W C cW Cc
ColumnsofoptimalWareobtainedasGEVV Columns of optimal W are obtained as GEVV (generalizedeigenvectorsforvowels)tothe valueordered(largesttosmallest)eigenvaluesin
Cvi ( v ) = i Cci ( v )
C c i ( c ) = i C v i ( c ) Similarlyforconsonants,GEVC Similarly for consonants, GEVC Thus,thetransformationWdiagonalizesbothCv &Cc ThevarianceofVIalong is whileCIhasunit The variance of VI along i(v) is i while CI has unit varianceinalldirections UsingSNRmeasureintroducedinMalayathetal. for g y GEVV
trace(W CvW ) i =1 = = T trace(W CcW ) m
VowelConsonantSegmentation
GEVVsandGEVCsareobtained GEVVs and GEVCs are obtained Evaluatingnormcontours
NCv (k ) = NCc (k ) =
(i ( v ) )T xk
i =1 L
(i ( c ) )T xk
i =1
NCv and NCc are norm contours from V and C andNC arenormcontoursfromVandC subspaces Thesenorm contoursrepresentVIandCI These normcontours represent VI and CI
VowelConsonantSegmentation
Normcontourscrosseachother NCv (k ) = NCc (k ) Segmentationpoints L= Wefoundoptimumresultsfor3 Relativeimportanceofthedifferentfrequencybands forvowelsandconsonantsisconveyedbyfirstthree principalfilters VI:Midfrequencyregionofthespeechspectrum VI Mid f i f th h t CI:Low&Highfrequencyregions Speech&Speakerinformation[Vijayakrishna] S h&S k i f ti [Vij kih ]
PerformanceofVowelConsonant Segmentation
GEVVs&GEVCsareobtainedfromTamilspeech databasespokenbyamalevolunteer ForobtainingMFCC,theMelscalewassimulated usingasetof24triangularfilters ForLPCC,a12 F LPCC 12th orderLPCanalysiswasperformed d LPC l i f d afterpreemphasiswith =0.95 Segmentationtestswerecarriedoutonthebasic Segmentation tests were carried out on the basic unitsofKannadaspeechdatabasespokenbya femalevolunteer
0.5 0 0.5 Speech Signal /eyo/ (a) 1 0 0.05 0.1 6 4 2 0 4000 (b) 0
0.15
0.2 0.25 Time(sec)
0.3
0.35
0.4
0.45
Norm
0.05
0.1
0.15
0.2 0.25 Time(sec)
0.3
0.35
0.4
0.45
Frequency (Hz) z)
3000 2000 1000 0 0 0.05 0.1 0.15 0.2 0.25 Time(sec) 0.3 0.35 0.4 0.45
Performanceof VowelConsonantSegmentation
0.4 0.2 0 0.2 0.4 6 /i/ 4 2 0 4000 (b) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 (a) 0 0.05 0.1 0.15 /y/ 0.2 0.25 /o/ 0.3 0.35
Speech signal /iyo/
Frequency
3000 2000 1000 0 0 0.05 0.1 0.15 Time 0.2 0.25 0.3 0.35
0.4
Speech signal /aul1o/

0.2 0 0.2 0.4 8 /au/ 6 /l1/ /o/ (a) 0 0.1 0.2 0.3 0.4 Time (sec) 0.5 0.6 0.7
Norm
4 2 0 (b) 0 0.1 0.2 0.3 0.4 Time (sec) 0.5 0.6 0.7
4000
Frequency (Hz) z)
3000 2000 1000 0 0 0.1 0.2 0.3 0.4 Time (sec) 0.5 0.6 0.7
0.4 Speech Signal /auyi/ 0.2 0 0.2 0.4 8 6 /au/ /y/ /i/ (a) 0 0.1 0.2 0.3 Time(sec) 0.4 0.5 0.6 0.7
Norm
4 2 0 (b) 0 0.1 0.2 0.3 Time(sec) 0.4 0.5 0.6 0.7
8000
Frequency
6000 4000 2000 0 0 0.1 0.2 0.3 Time 0.4 0.5 0.6 0.7
Publications
R. Muralishankar, A. G. Ramakrishnan, Modification of Pitch using DCT in the Source Domain, Speech Communication, 2005. R. R Muralishankar and A G Ramakrishnan Discrete Cosine Transformed A. G. Ramakrishnan, Discrete Cepstrum, International Journal of Speech Technology, 2002. R.MuralishankarandA.G.Ramakrishnan,DCTbasedpseudocomplex cepstrum Proc ICASSP 2002 Orlando Florida May 13 17,2002. cepstrum,Proc.ICASSP2002,Orlando,Florida,May13 17 2002 R.MuraliShankarandAGR, DCTbasedPitchModification,Proc. SPCOM01,IISc,Bangalore,July1518,2001. JayavardhanaRamaandAGR,Thirukkural:atexttospeechsynthesis system,TamilInternet2001,KualaLumpur,Aug2628,2001. R.MuralishankarandAGR,NaturalisingtheTamilsynthesizer,Tamil Internet2001,KualaLumpur,August2628,2001.
Publications contd.
K G Aparna, G L Jayavardhana Rama and A. G. Ramakrishnan, Machine reading of Tamil books an aid for the blind, Proc. International Conf. on Biomedical Engg., Bangalore, Dec. 2124, 2001. K.SureshandAGR,"ADCTbasedEstimationofPitch",Proc.Intern. , , Conf.MultimediaProc.Systems,Chennai,Aug.1315,2000. R.Murali ShankarandAGR,"RobustPitchdetectionusingDCTbased SpectralAutocorrelation ,Proc.Intern.Conf.MultimediaProcessing Spectral Autocorrelation", Proc. Intern. Conf. Multimedia Processing andSystems,Chennai,Aug.1315,2000. RMuraliShankar andAGR,"SynthesisofSpeechwithEmotions,Proc. Intern.Conf.Commn.Comp.Devices,KGP,Dec.1416,2000. Intern Conf Commn Comp Devices KGP Dec 1416 2000
Acknowledgement
MinistryofSocialJustice&Empowerment DepartmentofInformationTechnology, GovernmentofIndia. KarnatakaStateCouncilforScienceand Technology. Technology TamilSoftwareDevelopmentFund LDCILandCIIL THEDISTINGUISHEDAUDIENCE THE DISTINGUISHED AUDIENCE
BlockdiagramofTTS
Input/Output of Thirukkural/Vaachaka
Input Input
Textinputthroughmultiplekeyboards Printedtextthathasundergoneoptical Printed text that has undergone optical characterrecognition ExistingUnicodefilesorfromwebsites Output Intelligible,naturalTamil/Kannadaspeech
Textanalysis
Offline Recordingbasicunits g Observationofdurationinnaturalspeech CIILbookonphoneduration Online Parsing Graphemetophonemeconversion Applyingdurationrules
SpeechSignalProcessing
Consonantvowelsegmentation Pitch detection and marking Pitchdetectionandmarking Concatenation(Pitch,amplitudeand ) durationmodification)
ConsonantVowel Segmentation
Energy based segmentation Energybasedsegmentation

FailsforcoarticulatedCVsuchas/yi/
LPCepstrumbasedsegmentation
PitchDetectionandMarking
PitchDetection
DCTbasedSpectralautocorrelation DCT b d S t l t l ti
PitchMarking
Markedatzerocrossings Marked at zero crossings
Featuresofthesoftware
AcceptableQualitymalevoice p y Textinputusinganykeyboardinterfaceor existingUnicodefile existing Unicode file DisplaystextinTamil/Kannada A Acceptablynaturalandintelligible. t bl t l d i t lli ibl
Scopeforfurtherwork
Comprehensive notmissingconsonantclusters Comprehensive not missing consonant clusters Naturalprosody Simulatingdifferentcharacteristicsofthespeaker Simulating different characteristics of the speaker Emotionscouldbeadded Provisionforalienwords,English Provision for alien words English

Prof. AG. Ramakrishnan

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Prof. AG. Ramakrishnan

Caricato da

Copyright:

Formati disponibili

Text-to-Speech Text to Speech Synthesis Research @ MILE

AGRamakrishnan Professor MedicalIntelligence&LanguageEngineering(MILE)Lab DepartmentofElectricalEngineering,IISc,Bangalore.

Celebrating the Centenary of the Department !

Lakshmish KPrathibha L k h i hK P hibh

Abhinava S Abhi SArun S i A Sriraman

Ajit Narayanan,CEOofInventionLabs,Chennai: YourTTSisthebestIndianlanguageTTSIhaveseensofar.

WebDemoofTirukkural&Vak(MILETTS): Web Demo of Tirukkural & Vak (MILE TTS): http:\\mile.ee.iisc.ernet.in\tts

Grapheme Phoneme Conversion

/ /Akash nalla paiyan./ p y /

/avan nejammaa varaan/

0.6 0.8 time in sec

0.6 0.8 time in sec

Pitch contours of a plainly uttered and intonated exclamatory sentence

Typeofsentence Interrogative Exclamatory

Averageincreasein totalduration(%) 19.7 19 7 16.4

High speaking rate Highspeakingrate

reducedvowels. reduced vowels

Pad zeros / Truncate

Pitch Modified frame

0.5 0.7 07 0.8 1.2 12 1.5 1.8 18 2.0

Pitch S Pit h Synchronous LP h Coefficient Forward Filtering G / A(z) IDCT

Block Schematic of Emotional Speech Synthesis

0 0 Time (s) 0.995188 0.98375 1.01038 0.98025 1.00081

Analysis:waveformisdecomposedintoasequenceofoverlapping fragmentsofspeech.Synthesis:Fragmentsofspeechrecombinedas desired,withpitchandtimescalemodification.Necessitates preliminaryPitchmarking

Variation in the duration of a word /SENDRAAN/ in different circumstances

MultipleLanguagesmixed Multiple Languages mixed Distortionsinspokenwordsnecessitatesunits beyondthephonotacticconstraints y p Foreignwordsrequirenewphones.

NeedforSegmentationin Concatenative SpeechSynthesis

Subspace based segmentation of consonants & vowels

FeatureslikeLPC,LPCCandMFCCmodel statisticalpropertiesofvowelsandconsonants Fromthepointofviewofbasicunitsegmentation VI:Vowelinformationinfeaturevectors(signal) CI:Consonantinformation(noise) Linearfeaturetransformation:aimsatfindinga subspace,ofthefeaturespacewithmaximumSNR b f th f t ith i SNR

) x =WTx dim ( x ) = n ) dim ( x ) = m m<n

Letd betrainingvectorscontainingVI Let dv be training vectors containing VI anddc betrainingvectorscontainingCI

Cv = E{(dv dv )(dv dv )T } Cc = E{(dc dc )(dc dc ) }

FindWsuchthatthevariancecausedbyVItoCIis Find W such that the variance caused by VI to CI is maximizedaftertransformation

Criterionfunctiontobemaximized Criterion function to be maximized

ColumnsofoptimalWareobtainedasGEVV Columns of optimal W are obtained as GEVV (generalizedeigenvectorsforvowels)tothe valueordered(largesttosmallest)eigenvaluesin

trace(W CvW ) i =1 = = T trace(W CcW ) m

GEVVsandGEVCsareobtained GEVVs and GEVCs are obtained Evaluatingnormcontours

0.2 0.25 Time(sec)

0.2 0.25 Time(sec)

Speech signal /iyo/

Speech signal /aul1o/

4 2 0 (b) 0 0.1 0.2 0.3 Time(sec) 0.4 0.5 0.6 0.7

Consonantvowelsegmentation Pitch detection and marking Pitchdetectionandmarking Concatenation(Pitch,amplitudeand ) durationmodification)

Energy based segmentation Energybasedsegmentation

Potrebbero piacerti anche