Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
INTRODUCTION
A user gives a predefined voice instruction to the system through microphone, the system understand this command and execute the required function. It facilitates the user to run windows through your voice without use of keyboard or mouse.
KEY TERMS
Speaking Modes
o o Isolated Words Continuous Speech
SPHINX
Sphinx is a set of Java classes used in background to recognize the voice. It is open source and is provided by Java,
OVERALL PROCESSING
FEATURE EXTRACTION
Speech Data
Feature Extraction
Text Data
Language Model
Recognition Engine
output
It generates a set of 51 dimension feature vectors which represent important characteristics of speech signals. It is used to convert the speech waveform to some type of parametric representation. A wide range of possibilities exist for parametrically representing the speech signal. Such as LPC (Linear Prediction Coding) and MFCC (Mel Frequency Cepstral Coefficients).
Speech Data
Feature Extraction
Text Data
Language Model
Recognition Engine
output
text2wfreq
text2idngram
Id-N-gram
Wfreq2vocab
idngram2lm
vocab
binlm2arpa
arpa.dmp
lm3g2dmp
arpa
POCKET SPHINX
Decoding Engine It is used as a set of libraries that include core speech recognition functions.
Input is audio file in wave format and the final output of recognition is displayed as text.
Speech Data Feature Extraction Text Data
Language Model
Recognition Engine
output
HMM Background
Basic theory developed and published in 1960s and 70s
HMM Overview
Machine learning method Makes use of state machines Based on probabilistic model Can only observe output from states, not the states themselves
Example: speech recognition
Observe: acoustic signals Hidden States: phonemes
(distinctive sounds of a language)
HMM Components
A set of states (xs) A set of possible output symbols (ys) A state transition matrix (as):
probability of making transition from one state to the next
o What is the probability the weather for the next 7 days will be:
sun, sun, rain, rain, sun, cloudy, sun
Bakis (left-right):
o As time increases, states proceed from left to right
HMM Advantages
Advantages:
o Effective o Can handle variations in record structure
Optional fields Varying field ordering
HMM Uses
Speech recognition: Recognizing spoken words and phrases Text processing: Parsing raw records into structured records Bioinformatics: Protein sequence prediction
Financial: o Stock market forecasts (price pattern prediction) o Comparison shopping services
THE LEXICAL ACCESS COMPONENT OF THE CMU CONTINUOUS SPEECH RECOGNITION SYSTEM
The CMU Lexical Access System hypothesizes words from a phonetic dictionary.
Word hypothesis are anchored on syllabic nuclei and are generated independently for different parts of the utterance.
EXAMPLES WORD Cat [kt] SYLLABIC NUCLEI []
Anchor Generator
Matcher
MATCHING ENGINE
Coarse labeler Anchor Generat or Front End Lattice Integrator
Matcher
Words are hypothesized by matching an input sequence of labels against the stored representation of the possible pronunciation. It uses the Beam search algorithm which is a modified best first search strategy. The beam search algorithm can simultaneously search paths with different lengths.
Parser Verifier
THE LEXICON
Front End
Lattice Integrator
Coarse labeler
Lexicon
Anchor Generat or
Matcher
The lexicon (dictionary) is stored in the form of a phonetic network. The sources of pronunciations that have been used:
o On-line phonetic dictionary, such as the Shop Dictionary. o Letter-to-sound compiler (The Talk System).
The current CMU lexicon is constructed using a base over 150 rules covering several types of phenomena:
o Including co-articulator phenomena. o Front-end characteristics.
Parser
ANCHOR GENERATION
Front End
Lattice Integrator
Verifier
Coarse labeler
Lexicon
Anchor Generat or
Matcher
To eliminate unnecessary matches, the voice recognition system uses syllable anchors to select locations in an utterance where words are to be hypothesized.
The anchor generation algorithm is straight forward and is based on the following reasoning:
o Words are composed of syllables, and all the syllable contain a vocalic center. o Word divisions cannot occur inside vocalic center. o The coarse labeler provides information about vocalic, non-vocalic and silent regions.
The algorithm is implemented in such a way that the best hypothesis will be generated.
Single Anchor:
o In the single anchor mode, anchors of different lengths are generated and the matcher is invoked separately for each one. Although this procedure is simple, its inefficient too.
Multiple Anchor:
o The multiple anchor mode, reduces the computations, and also reduces the number of hypothesis generated.
COARSE LABELER
Coarse labeler Anchor Generat or Front End Lattice Integrator
Parser
Verifier Lexicon
Matcher
The coarse labeling algorithm is based on the ZAPDASH (Zerocrossing And Peak to peak amplitude of Differenced And Smoothed data) algorithm. The algorithm is robust and speaker independent, and operates reliably over a large dynamic range.
Coarse labeler
Lexicon
Matcher
Lattice Integrator
JUNCTION VERIFIER
The verifier basically examines junctures between words and determines whether these words can be connected together in sequence.
Matcher
CONCLUSION
Its not nearly enough detailed to actually write a speech recognizer, but it exposes the basic concepts. The basic concepts we learnt today to implement speech recognition are:
1. Sphinx 2. Lexical Access System 3. HMM Model
The real life implementations of these techniques are still in the development phase while some are successfully launched. Example: Winvoice using Sphinx.
REFERENCES
Alexander I. Rudnicky, Lynn k. Baumeister, Kevin H. DeGraaf, The Lexical Access Component of The CMU Continue Speech Recognition, pp. 376-379, 1987, IEEE. Yun Wang Xueying Zhang, Realization of Mandarin Continuous Digits Speech Recognition, pp. 378-380, 2010, IEEE. Todd A. Stephenson, Speech Recognition with Auxiliary Information, pp. 189-203, 2004, IEEE.