Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
SESSION (2014-15)
GUIDED BY :
JITENDRA KASERA
SUBMITTED BY
Himanshu Choubisa
( EEE & VIIIth sem)
TABLE OF CONTENTS
CONTENTS
PAGE NO
CERTIFICATE
ACKNOWLEDGMENT
ABSTRACT
1. Introduction
1.1
1.2
1.3
Problems
Tools
Applications
2. The Technology
3. Speech recognition
3.1
3.2
4. Speaker independency
1
2
7
11
12
15
16
17
19
19
21
25
26
28
28
30
31
7. Application
34
7.1 Health care
7.2
34
Military
35
7.3 Training air traffic controllers
38
39
39
39
8. Conclusion
41
9. Bibliography
42
CERTIFICATE
I hereby certify that the work which is being presented in the B. Tech Project
Report entitled Topic in partial fulfillment of the requirements for the award of the
Bachelor of Technology in Electrical and Electronics Engineering and submitted to
the Department of Electrical and Electronics Engineering, Pacific College of
Engineering, Udaipur is an authentic work carried out during a period from Jan.
2015 to May 2014 under the supervision of Mr.Jitendra Kasera.
The matter presented in this report has not been submitted by me for the award of
any other degree elsewhere.
Signature of candidate
________________
_________________
This is to certify that the above statement made by the candidate is correct to the
best of my knowledge.
________________
_________________
(Head of dept)
(Guide Name)
(EEE)
Pacific College of Engineering, Udaipur
ACKNOWLEDGEMENT
This is opportunity to express my heartfelt words for the people who were part
of this Project in numerous ways, people who gave me unending support right from
beginning of the Project.
We express our earnest gratitude to our internal guide Mr. JITENDRA KASERA,
Department of EEE, our project guide, for his constant support, encouragement and
guidance. We are grateful for his cooperation and his valuable suggestions.
Finally, we express our gratitude to all other members who are involved either directly or
indirectly for the completion of this project.
ABSTRACT
Artificial intelligence (AI) for speech recognition involves two basic ideas. First, it
involves studying the thought processes of human beings. Second, it deals with representing those
processes via machines (like computers, robots, etc).AI is behavior of a machine, which, if
performed by a human being, would be called intelligence. It makes machines smarter and more
useful, and is less expensive than natural intelligence. Natural language processing (NLP) refers
to artificial intelligence methods of communicating with a computer in a natural language like
English. The main objective of a NLP program is to understand input and initiate action. The
input words are scanned and matched against internally stored known words. Identification of a
keyword causes some action to be taken. In this way, one can communicate with the computer in
one's language
Chapter 1
INTRODUCTION
The speech recognition process is performed by a software component known as the
speech recognition engine. The primary function of the speech recognition engine is to process
spoken input and translate it into text that an application understands. The application can then
do one of two things: The application can interpret the result of the recognition as a command. In
this case , the application is a command and control application. If an application handles the
recognized text simply as text, then it is considered a dictation application. The user speaks to the
computer through a microphone, which in turn, identifies the meaning of the words and sends it
to NLP device for further processing. Once recognized, the words can be used in a variety of
applications like display, robotics, commands to computers, and dictation. No special commands
or computer language are required. There is no need to enter programs in a special language for
creating software. Voice XML takes speech recognition even further. Instead of talking to your
computer, you're essentially talking to a web site, and you're doing this over the phone. OK , you
say, well, what exactly is speech recognition? Simply put, it is the process of converting spoken
input to text. Speech recognition is thus sometimes referred to as speech-to-text .Speech
recognition allows you to provide input to an application with your voice. Just like clicking with
your mouse, typing on your keyboard, or pressing a key on the phone keypad provides input to
an application; speech recognition allows you to provide input by talking. In the desktop world,
you need a microphone to be able to do this. In the Voice XML world, all you need is a
telephone.
When you dial the telephone number of a big company, you are likely to hear the
sonorous voice of a cultured lady who responds to your call with great courtesy saying welcome
to company X. Please give me the extension number you want .You pronounce the extension
number, your name, and the name of the person you want to contact. If the called person accepts
the call, the connection is given quickly. This is artificial intelligence where an automatic callhandling system is used without employing any telephone operator.
AI is the study of the abilities for computers to perform tasks, which currently are better
done by humans. AI has an interdisciplinary field where computer science intersects with
philosophy, psychology, engineering and other fields. Humans make decisions based upon
1
Dept. of EEE
Himanshu Choubisa
experience and intention. The essence of AI in the integration of computer to mimic this learning
process is known as Artificial Intelligence Integration.
1.1 Problems
The general problem of simulating (or creating) intelligence has been broken down into a
number of specific sub-problems. These consist of particular traits or capabilities that researchers
would like an intelligent system to display. The traits described below have received the most
attention.
Himanshu Choubisa
relations between objects situations, events, states and time causes and effects knowledge about
knowledge (what we know about what other people know) and many other, less well researched
domains. A complete representation of "what exists" is an ontology (borrowing a word from
traditional philosophy), of which the most general are called upper ontologys.
3
Dept. of EEE
Himanshu Choubisa
1.1.3 Planning
Intelligent agents must be able to set goals and achieve them. They need a way to
visualize the future (they must have a representation of the state of the world and be able to make
predictions about how their actions will change it) and be able to make choices that maximize the
utility (or "value") of the available choices.
In classical planning problems, the agent can assume that it is the only thing acting on the
world and it can be certain what the consequences of its actions may be. However, if this is not
true, it must periodically check if the world matches its predictions and it must change its plan as
this becomes necessary, requiring the agent to reason under uncertainty. Multi-agent planning
uses the cooperation and competition of many agents to achieve a given goal. Emergent behavior
such as this is used by evolutionary algorithms and swarm intelligence.
1.1.4 Learning
Machine learning has been central to AI research from the beginning. Unsupervised
learning is the ability to find patterns in a stream of input. Supervised learning includes both
classification and numerical regression. Classification is used to determine what category
something belongs in, after seeing a number of examples of things from several categories.
Regression takes a set of numerical input/output examples and attempts to discover a continuous
function that would generate the outputs from the inputs. In reinforcement learning the agent is
rewarded for good responses and punished for bad ones. These can be analyzed in terms of
decision theory, using concepts like utility. The mathematical analysis of machine learning
algorithms and their performance is a branch of theoretical computer science known as
computational learning theory
Figure : 1.1 ASIMO uses sensors and intelligent algorithms to avoid obstacles and navigate
stairs.
4
Dept. of EEE
Himanshu Choubisa
Natural language processing gives machines the ability to read and understand the
languages that humans speak. Many researchers hope that a sufficiently powerful natural
language processing system would be able to acquire knowledge on its own, by reading the
existing text available over the internet. Some straightforward applications of natural language
processing include information retrieval (or text mining) and machine translation.
Figure : 1.2
The Care-Providing robot FRIEND uses sensors like cameras and intelligent algorithms
to control the manipulator in order to support disabled and elderly people in their daily life
activities. The field of robotics is closely related to AI. Intelligence is required for robots to be
able to handle such tasks as object manipulation and navigation, with sub-problems of
localization (knowing where you are), mapping (learning what is around you) and motion
planning (figuring out how to get there)
1.1.7 Perception
Machine perception is the ability to use input from sensors (such as cameras,
microphones, sonar and others more exotic) to deduce aspects of the world. Computer vision is
the ability to analyze visual input. A few selected sub problems are speech recognition, facial
recognition and object recognition.
5
Dept. of EEE
Himanshu Choubisa
Emotion and social skills play two roles for an intelligent agent. First, it must be able to
predict the actions of others, by understanding their motives and emotional
states. (This
involves elements of game theory, decision theory, as well as the ability to model human
emotions and the perceptual skills to detect emotions.) Also, for good human-computer
interaction, an intelligent machine also needs to display emotions. At the very least it must
appear polite and sensitive to the humans it interacts with. At best, it should have normal
emotions itself.
1.1.9 Creativity
Figure : 1.4 TOPIO, a robot that can play table tennis, developed by TOSY.
6
Dept. of EEE
Himanshu Choubisa
1.2 Tools
In the course of 50 years of research, AI has developed a large number of tools to solve
the most difficult problems in computer science. A few of the most general of these methods are
discussed below.
Himanshu Choubisa
swarm optimization) and evolutionary algorithms (such as genetic algorithms and genetic
programming).
1.2.2 Logic
Logic was introduced into AI research by John McCarthy in his 1958 Advice Taker
proposal. Logic is used for knowledge representation and problem solving, but it can be applied
to other problems as well. For example, the sat plan algorithm uses logic for planning and
inductive logic programming is a method for learning.
Several different forms of logic are used in AI research. Propositional or sentential logic
is the logic of statements which can be true or false. First-order logic also allows the use of
quantifiers and predicates, and can express facts about objects, their properties, and their
relations with each other. Fuzzy logic, is a version of first-order logic which allows the truth of a
statement to be represented as a value between 0 and 1, rather than simply True (1) or False (0).
Fuzzy systems can be used for uncertain reasoning and have been widely used in modern
industrial and consumer product control systems. Subjective logic models uncertainty in a
different and more explicit manner than fuzzy-logic: a given binomial opinion satisfies belief +
disbelief + uncertainty = 1 within a Beta distribution. By this method, ignorance can be
distinguished from probabilistic statements that an agent makes with high confidence. Default
logics, non-monotonic logics and circumscription are forms of logic designed to help with
default reasoning and the qualification problem. Several extensions of logic have been designed
to handle specific domains of knowledge, such as: description logics situation calculus, event
calculus and fluent calculus (for representing events and time) causal calculus belief calculus;
and modal logics.
Himanshu Choubisa
maximization algorithm), planning (using decision networks) and perception (using dynamic
Bayesian networks). Probabilistic algorithms can also be used for filtering, prediction, smoothing
and finding explanations for streams of data, helping perception systems to analyze processes
that occur over time (e.g., hidden Markov models or Kalman filters).
A key concept from the science of economics is "utility": a measure of how valuable
something is to an intelligent agent. Precise mathematical tools have been developed that analyze
how an agent can make choices and plan, using decision theory, decision analysis, information
value theory. These tools include models such as Markov decision processes, dynamic
9
Dept. of EEE
Himanshu Choubisa
Figure : 1.5
1.2.7 Languages
AI researchers have developed several specialized languages for AI research, including
Lisp and Prolog.
10
Dept. of EEE
Himanshu Choubisa
1.3 Applications
Artificial intelligence has successfully been used in a wide range of fields including
medical diagnosis, stock trading, robot control, law, scientific discovery, video games, toys, and
Web search engines. Frequently, when a technique reaches mainstream use, it is no longer
considered artificial intelligence, sometimes described as the AI effect .It may also become
integrated into artificial life.
1.3.2 Platforms
A platform (or "computing platform")is defined by Wikipedia as "some sort of hardware
architecture or software framework (including application frameworks), that allows software to
run." As Rodney Brooks pointed out many years ago, it is not just the artificial intelligence
software that defines the AI features of the platform, but rather the actual platform itself that
affects the AI that results, i.e, we need to be working out AI problems on real world platforms
rather than in isolation.
A wide variety of platforms has allowed different aspects of AI to develop, ranging from
expert systems, albeit PC-based but still an entire real-world system to various robot platforms
such as the widely available Roomba with open interface.
11
Dept. of EEE
Himanshu Choubisa
Chapter 2
THE TECHNOLOGY
A human identity recognition system based on voice analysis could have seamless
applications. The ASR (Automatic Speaker Recognition) is one such system. Automatic Speaker
Recognition is a system that can recognize a person based on his/her voice. This is achieved by
implementing complex signal processing algorithms that run on a digital computer or a processor
.This Application is analogous to the fingerprint recognition system or other biometrics
recognition systems that are based on certain characteristics of a person.
There are several occasions when we want to identify a person from a given group of
people even when the person is not present for physical examination. For example, when a
person converses on a telephone, all we have is the persons voice for analysis. It then makes
sense to develop a recognition system based on voice.
Speaker recognition has typically been classified as either a verification or identification
task. Speaker verification is usually the simpler of the two since it involves the comparison of
the input signal with a single given stored reference pattern. Therefore, the verification task only
requires a system to verify, if the speaker is the same as the person he/she identifies
himself/herself. Speaker identification is more complex because the test speaker must be
compared against a number of reference speakers to determine if a match can be made. Not only
the input signal is to be examined to see if it came from a speaker, but the identification of the
individual speaker is also necessary.
The identification of speakers remains a difficult task for a number of reasons. First, the
acquiring of a unique speech signal can suffer as a result of the variation of the voice inputs from
a speaker and environmental factors. Both the volume and pace of speech can vary from one test
to another. Also, unless initially constrained, an extensive vocabulary or unstructured grammar
can affect results. Background noise must also be kept to the minimum so that a changing
environment will not divert the speakers attention or the final voicing of a word or sentence. As
a result, many restrictions and clarifications have been placed on speaker and speech recognition
systems.
12
Dept. of EEE
Himanshu Choubisa
One such restriction involves using a closed set for speaker recognition. A closed set
implies that only speakers within the original stored set will be asked to be identified. An open
set would allow the extra possibility of a test speaker not coming from the initially trained set of
speakers, thereby requiring the system to recognize the speaker as not belonging to the original
set. An open set system may also have the task to learning a new speaker and placing him or her
within the original set for future reference.
Another common restriction involves using a test dependent speaker recognition system.
This type of system would require the speaker to utter a unique word or phrase to be compared
against the original set of like phrases. Text-independent recognition, which for most cases is
more complex and difficult to perform, identifies the speaker regardless of the text or phrase
spoken.
Once an utterance, or signal, has been recorded, it is usually necessary to process it to get
the voiced signal in a form that makes classification and recognition possible. Various methods
have included the use of power spectrum values, spectrum coefficients, linear predictive coding,
and a nearest neighbor distance algorithm. Tests have also shown that although spectrum
coefficients and linear predictive coding have given better results for conventional template and
statistical classification methods, power spectrum values have performed better when using
neural networks during the final recognition stages.
Various methods have also been used to perform the classification and recognition of the
processed speech signal. Statistical methods utilizing Hidden Markov Models, linear vector
quantifiers, or classical techniques such as template matching have produced encouraging, yet
limited success. Recent deployments using neural networks, while producing varied success
rates, have offered more options regarding the types of inputs sent to the networks, as well as
provided the ability to learn speakers in both an off and online manner. Although backpropagation networks have traditionally been used, the implementation of more sophisticated
networks, such as an ART 2 network, has been made.
ASR can be broadly classified into four types:
1.
Text-independent identification
2.
Text-independent verification
13
Dept. of EEE
Himanshu Choubisa
3.
Text-dependent identification
4.
Text-dependent verification
14
Dept. of EEE
Himanshu Choubisa
Chapter 3
SPEECH RECOGNITION
The user speaks to the computer through a microphone, which in turn, identifies the
meaning of the words and sends it to NLP device for further processing. Once recognized, the
words can be used in a variety of applications like display, robotics,
commands to computers, and dictation.
The word recognizer is a speech recognition system that identifies individual words.
Early pioneering systems could recognize only individual alphabets and numbers. Today,
majority of word recognition systems are word recognizers and have more than 95% recognition
accuracy. Such systems are capable of recognizing a small vocabulary of single words or simple
phrases. One must speak the input information in clearly definable single words, with a pause
between words, in order to enter data in a computer. Continuous speech recognizers are far more
difficult to build than word recognizers. You speak complete sentences to the computer. The
input will be recognized and, then processed by NLP. Such recognizers employ sophisticated,
complex techniques to deal with continuous speech, because when one speaks continuously,
most of the words slur together and it is difficult for the system to know where one word ends
and the other
begins. Unlike word recognizers, the information spoken is not recognized instantly by this
system.
Himanshu Choubisa
USER
SPEECH
RECOGNITIO
N DEVICE
DICTATING
COMMANDS
TO COMPUTER
INPUT TO
OTHER
ROBOTS,
EXPERTS
VIOCE SOUND
NLP
UNDERSTANDI
NG
Figure:3.1
After the training process, the users spoken words will produce text; the accuracy of this
will improve with further dictation and conscientious use of the correction procedure. With a
well-trained system, around 95% of the words spoken could be correctly interpreted. The system
can be trained to identify certain words and phrases and examine the users standard documents
in order to develop an accurate voice file for the individual.
However, there are many other factors that need to be considered in order to achieve a
high recognition rate. There is no doubt that the software works and can liberate many learners,
but the process can be far more time consuming than first time users may appreciate and the
results can often be poor. This can be very demotivating, and many users give up at this stage.
Quality support from someone who is able to show the user the most effective ways of using the
software is essential.
16
Dept. of EEE
Himanshu Choubisa
When using speech recognition software, the users expectations and the advertising on
the box may well be far higher than what will realistically be achieved. You talk and it types
can be achieved by some people only after a great deal of perseverance and hard work.
3.2.1 Utterances
When the user says something, this is known as an utterance. An utterance is any stream
of speech between two periods of silence. Utterances are sent to the speech engine to be
processed. Silence, in speech recognition, is almost as important as what is spoken, because
silence delineates the start and end of an utterance. Here's how it works. The speech recognition
engine is "listening" for speech input. When the engine detects audio input - in other words, a
lack of silence -- the beginning of an utterance is signaled.
Similarly, when the engine detects a certain amount of silence following the audio, the
end of the utterance occurs. Utterances are sent to the speech engine to be processed. If the user
doesnt say anything, the engine returns what is known as a silence timeout - an indication that
there was no speech detected within the expected timeframe, and the application takes an
appropriate action, such as reprompting the user for input. An utterance can be a single word, or
it can contain multiple words (a phrase or a sentence).
3.2.2 Pronunciations
The speech recognition engine uses all sorts of data, statistical models, and algorithms to
convert spoken input into text. One piece of information that the speech recognition engine uses
to process a word is its pronunciation, which represents what the speech engine thinks a word
should sound like. Words can have multiple pronunciations associated with them. For example,
the word the has at least two pronunciations in the U.S. English language: thee and thuh.
17
Dept. of EEE
Himanshu Choubisa
As a Voice XML application developer, you may want to provide multiple pronunciations for
certain words and phrases to allow for variations in the ways your callers may speak them.
3.2.3 Grammars
As a Voice XML application developer, you must specify the words and phrases that
users can say to your application. These words and phrases are defined to the speech recognition
engine and are used in the recognition process. You can specify the valid words and phrases in a
number of different ways, but in Voice XML, you do this by specifying a grammar. A grammar
uses a particular syntax, or set of rules, to define the words and phrases that can be recognized by
the engine. A grammar can be as simple as a list of words, or it can be flexible enough to allow
such variability in what can be said that it approaches natural language capability.
3.2.4 Accuracy
The performance of a speech recognition system is measurable. Perhaps the most widely
used measurement is accuracy. It is typically a quantitative measurement and can be calculated in
several ways. Arguably the most important measurement of accuracy is whether the desired end
result occurred. This measurement is useful in validating application design Another
measurement of recognition accuracy is whether the engine
recognized the utterance exactly as spoken.
Another measurement of recognition accuracy is whether the engine recognized the
utterance exactly as spoken. This measure of recognition accuracy is expressed as a percentage
and represents the number of utterances recognized correctly out of the total number of
utterances spoken. It is a useful measurement when validating grammar design.
Recognition accuracy is an important measure for all speech recognition applications. It
is tied to grammar design and to the acoustic environment of the user. You need to measure the
recognition accuracy for your application, and may want to adjust your application and its
grammars based on the results obtained when you test your application with typical users.
18
Dept. of EEE
Himanshu Choubisa
Chapter 4
SPEAKER INDEPENDENCY
The speech quality varies from person to person. It is therefore difficult to build an
electronic system that recognizes everyones voice. By limiting the system to the voice of a
single person, the system becomes not only simpler but also more reliable. The computer must be
trained to the voice of that particular individual. Such a system is called Speaker-dependent
system.
Speaker-independent system can be used by anybody, and can recognize any voice, even
though the characteristics vary widely from one speaker to another. Most of these systems are
costly and complex. Also, these have very limited vocabularies. It is important to consider the
environment in which the speech recognition system has to work. The grammar used by the
speaker and accepted by the system, noise level, noise type, position of the microphone, and
speed and manner of the users speech are some factors that may affect the quality of the speech
recognition.
Himanshu Choubisa
many different callers without having to understand the individual voice characteristics of each
caller.
20
Dept. of EEE
Himanshu Choubisa
VOICE
SAMPL
E
ADC WITH
SIGNAL
CONDITIONI
NG
ALGORIT
HM TO
SELECT
FEATURE
MEASUREME
NT OF
DISTANCE
DECISIO
N
REFERENC
E
VECTORS
IDENTITY
OF
21
Dept. of EEE
Himanshu Choubisa
VOICE
SAMPLE
(PERSON
CLAIMS
HIS/HER
IDENTITY
ADC WITH
SIGNAL
CONDITIONIN
G
ALGORITH
M TO
SELECT
FEATURES
MEASUREM
ENT OF
DISTANCE
THRESHOLD
COMPARISON
REFERE
NCE
VECTOR
OF THE
SPEAKE
PERSON
VERIFIED
OR NOT
22
Dept. of EEE
Himanshu Choubisa
The analogue signal, representing a spoken word, contains many individual frequencies
of various amplitudes and different phases, which when blended together take the shape of a
complex waveform . A set of filters is used to break this complex input signal into its component
parts. Band pass filters (BEP) pass on frequencies only in certain frequency range, rejecting all
other frequencies. Generally, about sixteen filters are used; a simple system may contain a
minimum of three filters. The more the number of filters user, the higher the probability of
accurate recognition.
Presently, switched capacitor digital filters are used because these can be custom-built in
integrated circuit form. These are smaller and cheaper than active filters using operational
amplifiers. The filter output is then fed to the ADC to translate the analogue signal into digital
word. The ADC samples the filter outputs many times a second. Each sample represents a
different amplitude of the signal .
Evenly spaced vertical lines represent the amplitude of the audio filter output at the
instant of sampling. Each value is then converted to a binary number proportional to the
amplitude of the sample. A central processor unit controls the input circuits that are fed by the
ADCs. A large RAM stores all the digital values in a buffer area. This digital information,
representing the spoken word, is now accessed by the CPU to process it further. The normal
speech has a frequency range of 200 Hz to 7 kHz. Recognizing a telephone call is more difficult
as it has bandwidth limitation of 300Hz to 3.3 Hz.
As explained earlier, the spoken words are processed by the filters and ADCs. The binary
representation of each of these words becomes a template or standard, against which the future
words are compared. These templates are stored in the memory. Once the storing process is
completed, the system can go into its active mode and is capable of identifying spoken words. As
each word is spoken, it is converted into binary equivalent and stored in RAM. The computer
then starts searching and compares the binary input pattern with the templates.
It is to be noted that even if the same speaker talks the same text, there are always slight
variations in amplitude or loudness of the signal, pitch, frequency difference, time gap, etc. Due
to this reason, there is never a perfect match between the template and binary input word. The
23
Dept. of EEE
Himanshu Choubisa
pattern matching process therefore uses statistical techniques and is designed to look for the best
fit.
The values of binary input words are subtracted from the corresponding values, in the
templates. If both the values are same, the difference is zero and there is perfect match. If not,
the subtraction produces some difference or error. The smaller the error, the better the match.
When the best match occurs the word is identified and displayed on the screen or used in some
other manner.
The search process takes a considerable amount of time as the CPU has to make many
comparisons before recognition occurs. This necessitates use of very high-speed processors. A
large RAM is also required as even though a spoken word may last only a few hundred
milliseconds, but the same is translated into many thousands of digital words. It is important to
not e that alignment of words and templates are to be matched correctly in time, before
computing the similarity score. This process, termed as dynamic time warping, recognizes that
different speaker pronounce the same words at different speeds as well as elongate different parts
of the same word. This is important for the speaker-independent recognizers.
Continuous speech recognizers are far more difficult to build than word recognizers. You can
speak complete sentences to the computer. The input will be recognized and, when processed by
NLP, understood. Such recognizers employ sophisticated, complex techniques to deal with
continuous speech, because when one speaks continuously, most of the words slur together and it
is difficult for the system to know where one word ends and the other begins. Unlike word
recognizers, the information spoken is not recognized instantly by this system.
24
Dept. of EEE
Himanshu Choubisa
Chapter 5
Himanshu Choubisa
different amplitude of the signal .A CPU controls the input circuits that are fed by the ADCs. A
large RAM stores all the digital values in a buffer area. This digital information, representing the
spoken word, is now accessed by the CPU to process it further.
ADC
BPF
I
N
P
U
T
RAM
DIGITISED
SPEECH
C
I
R
C
U
I
T
S
Figure 5.1
The normal speech has a frequency range of 200 Hz to 7KHz. Recognizing a telephone
call is more difficult as it has bandwidth limitations of 300Hz to 3.3KHz.As explained earlier the
spoken words are processed by the filters and ADCs. The binary representation of each of these
word becomes a template or standard against which the future words are compared. These
templates are stored in the memory. Once the storing process is completed, the system can go
into its active mode and is capable of identifying the spoken words. As each word is spoken, it is
converted into binary equivalent and stored in RAM. The computer then starts searching and
compares the binary input pattern with the templates. It is to be noted that even if the same
speaker talks the same text, there are always slight variations in amplitude or loudness of the
signal, pitch, frequency difference, time gap etc. Due to this reason there is never a perfect match
between the template and the binary input word. The pattern matching process therefore uses
statistical techniques and is designed to look for the best fit.
26
Dept. of EEE
Himanshu Choubisa
The values of binary input words are subtracted from the corresponding values in the
templates. If both the values are same, the difference is zero and there is perfect match. If not, the
subtraction produces some difference or error. The smaller the error, the better the match. When
the best match occurs, the word templates are to be matched correctly in time, before computing
the similarity score. This process, termed as dynamic time warping recognizes that different
speakers pronounce the same word at different is identified and displayed on the screen or used
in some other manner.
The search process takes a considerable amount of time, as the CPU has to make many
comparisons before recognition occurs. This necessitates use of very high-speed processors. A
Large RAM is also required as even though a spoken word may last only a few hundred
milliseconds, but the same is translated into many thousands of digital words. It is important to
note that alignment of words and speeds as well as elongate different parts of the same word.
This is important for the speaker- independent recognizers.
Now that we've discussed some of the basic terms and concepts involved in speech
recognition, let's put them together and take a look at how the speech recognition process works.
As you can probably imagine, the speech recognition engine has a rather complex task to handle,
that of taking raw audio input and translating it to recognized text that an application
understands. The major components discussed are:
Audio input
Grammar(s)
Acoustic Model
Recognized text
The first thing we want to take a look at is the audio input coming into the recognition
engine. It is important to understand that this audio stream is rarely pristine. It contains not only
the speech data (what was said) but also background noise. This noise can interfere with the
recognition process, and the speech engine must handle (and possibly even adapt to) the
environment within which the audio is spoken. As we've discussed, it is the job of the speech
recognition engine to convert spoken input into text. To do this, it employs all sorts of data,
statistics, and software algorithms. Its first job is to process the incoming audio signal and
convert it into a format best suited for further analysis.
27
Dept. of EEE
Himanshu Choubisa
Once the speech data is in the proper format, the engine searches for the best match. It
does this by taking into consideration the words and phrases it knows about (the active
grammars), along with its knowledge of the environment in which it is operating for Voice XML,
this is the telephony environment). The knowledge of the environment is provided in the form of
an acoustic model. Once it identifies the the most likely match or what was said, it returns what it
recognized as a text string. Most speech engines try very hard to find a match, and are usually
very "forgiving." But it is important to note that the engine is always returning it's best guess
for what was said.
Himanshu Choubisa
interprets with more accuracy when it recognizes the context. The delivery can be varied by
using short phrases and single words, following the natural pattern of speech.
29
Dept. of EEE
Himanshu Choubisa
30
Dept. of EEE
Himanshu Choubisa
Chapter 6
31
Dept. of EEE
Himanshu Choubisa
Product evaluators of an IBM dictation software the human brain that transiently holds
chunks of information and solves problems also supports speaking and listening. Therefore,
working on tough problems is best done in quiet environmentswithout speaking or listening to
someone. However, because physical activity is handled in another part of the brain, problem
solving is compatible with routine physical activities like walking and driving. In short, humans
speak and walk easily but find it more difficult to speak and think at the same time .
Similarly when operating a computer, most humans type (or move a mouse) and think but
find it more difficult to speak and think at the same time. Hand-eye coordination is accomplished
in different brain structures, so typing or mouse movement can be performed in parallel with
problem solving. Product evaluators of an IBM dictation software package also noticed this
phenomenon. They wrote that thought for many people is very closely linked to language. In
keyboarding, users can continue to hone their words while their fingers output an earlier version.
In dictation, users may experience more interference between outputting their initial thought and
elaborating on it. Developers of commercial speech recognition software packages recognize
this problem and often advise dictation of full paragraphs or documents, followed by a review or
proofreading phase to correct errors.
Since speaking consumes precious cognitive resources, it is difficult to solve problems at
the same time. Proficient keyboard users can have higher levels of parallelism in problem
solving while performing data entry. This may explain why after 30 years of ambitious attempts
to provide military pilots with speech recognition in cockpits, aircraft designers persist in using
hand-input devices and visual displays. Complex functionality is built
In to the pilots joy-stick, which has up to 17 functions, including pitch-roll- yaw controls, plus a
rich set of buttons and triggers. Similarly automobile controls may have turn signals, wiper
settings, and washer buttons all built onto a single stick, and typical video camera controls may
have dozens of settings that are adjustable through knobs and switches. Rich designs for hand
input can inform users and free their minds for status monitoring and problem solving.
The interfering effects of acoustic processing are a limiting factor for designers of speech
recognition, but the the role of emotive prosody raises further con-cerns. The human voice has
evolved remarkably well to support human-human interaction. We admire and are inspired by
passionate speeches. We are moved by grief-choked eulogies and touched by a childs calls as we
32
Dept. of EEE
Himanshu Choubisa
leave for work. A military commander may bark commands at troops, but there is as much
motivational force in the tone as there is information in the words. Loudly barking commands at
a computer is not likely to force it to shorten its response time or retract a dialogue box.
Promoters of affective computing, or reorganizing, responding to, and making emotional
displays, may recommend such strategies, though this approach seems misguided.
Many users might want shorter response times without having to work them-selves into a
mood of impatience. Secondly, the logic of computing requires a user response to a dialogue box
independent of the users mood. And thirdly, the uncertainty of machine recognition
could undermine the positive effects of user control and interface predictabilit
Chapter 7
APPLICATION
One of the main benefits of speech recognition system is that it lets user do other works
simultaneously. The user can concentrate on observation and manual operations, and still control
the machinery by voice input commands. Consider a material-handling plant where a number of
conveyors are employed to transport various grades of materials to different destinations.
Nowadays, only one operator is employed to run the plant. He has to keep a watch on various
meters, gauges, indication lights, analyzers, overload devices, etc from the central control panel.
If something wrong happens, he has to run to physically push the stop button. How convenient
it would be if a conveyor or a number of conveyors are stopped automatically
by simply saying stop.
Another major application of speech processing is in military operations. Voice control of
weapons is an example. With reliable speech recognition equipment, pilots can give commands
and information to the computers by simply speaking in to their microphones-they dont have to
use their hands for this purpose .Another good example is a radiologist scanning hundreds of
Xrays, ultra sonograms, CT scans and simultaneously dictating conclusion to a speech
recognition system connected to word processors. The radiologist can focus his attention on the
images rather than writing the text. Voice recognition could also be used on computers for
33
Dept. of EEE
Himanshu Choubisa
making airline and hotel reservations. A user requires simply to state his needs, to make
reservation, cancel a reservation, or make enquiries about schedule. sitive effects of user control
and interface predictability.
Many Electronic Medical Records (EMR) applications can be more effective and may be
performed more easily when deployed in conjunction with a speech-recognition engine.
Searches, queries, and form filling may all be faster to perform by voice than by using a
keyboard.
7.2 Military
34
Dept. of EEE
Himanshu Choubisa
Himanshu Choubisa
The Eurofighter Typhoon currently in service with the UK RAF employs a speakerdependent system, i.e. it requires each pilot to create a template. The system is not used for any
safety critical or weapon critical tasks, such as weapon release or lowering of the undercarriage,
but is used for a wide range of other cockpit functions. Voice commands are confirmed by visual
and/or aural feedback. The system is seen as a major design feature in the reduction of pilot
workload, and even allows the pilot to assign targets to himself with two simple voice commands
or to any of his wingmen with only five commands.
7.2.2 Helicopters
The problems of achieving high recognition accuracy under stress and noise pertain
strongly to the helicopter environment as well as to the fighter environment. The acoustic noise
problem is actually more severe in the helicopter environment, not only because of the high noise
levels but also because the helicopter pilot generally does not wear a facemask, which would
reduce acoustic noise in the microphone. Substantial test and evaluation programs have been
carried out in the past decade in speech recognition systems applications in helicopters, notably
by the U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal
Aerospace Establishment (RAE) in the UK. Work in France has included speech recognition in
the Puma helicopter. There has also been much useful work in Canada. Results have been
encouraging, and voice applications have included: control of communication radios; setting of
navigation systems; and control of an automated target handover system.
As in fighter applications, the overriding issue for voice in helicopters is the impact on
pilot effectiveness. Encouraging results are reported for the AVRADA tests, although these
represent only a feasibility demonstration in a test environment. Much remains to be done both in
speech recognition and in overall speech recognition technology, in order to consistently achieve
performance improvements in operational settings.
Battle Management command centres generally require rapid access to and control of
large, rapidly changing information databases. Commanders and system operators need to query
these databases as conveniently as possible, in an eyes-busy environment where much of the
information is presented in a display format. Human-machine interaction by voice has the
36
Dept. of EEE
Himanshu Choubisa
potential to be very useful in these environments. A number of efforts have been undertaken to
interface
commercially
available
isolated-word
recognizers
into
battle
management
environments. In one feasibility study speech recognition equipment was tested in conjunction
with an integrated information display for naval battle management applications. Users were
very optimistic about the potential of the system, although capabilities were limited.
Speech understanding programs sponsored by the Defense Advanced Research Projects
Agency (DARPA) in the U.S. has focused on this problem of natural speech interface. Speech
recognition efforts have focused on a database of continuous speech recognition (CSR), largevocabulary speech which is designed to be representative of the naval resource management task.
Significant advances in the state-of-the-art in CSR have been achieved, and current efforts are
focused on integrating speech recognition and natural language processing to allow spoken
language interaction with a naval resource management system.
Himanshu Choubisa
systems, directed at issues both in speech recognition and in application of task-domain grammar
constraints.[4]
The USAF, USMC, US Army, and FAA are currently using ATC simulators with speech
recognition from a number of different vendors, including UFA, Inc, and Adacel Systems Inc
(ASI). This software uses speech recognition and synthetic speech to enable the trainee to control
aircraft and ground vehicles in the simulation without the need for pseudo pilots.
Another approach to ATC simulation with speech recognition has been created by
Supremis. The Supremis system is not constrained by rigid grammars imposed by the underlying
limitations of other recognition strategies.
Himanshu Choubisa
39
Dept. of EEE
Himanshu Choubisa
8 . CONCLUSION
Speech recognition will revolutionize the way people conduct business over the Web and
will, ultimately, differentiate world-class e-businesses. Voice XML ties speech recognition and
telephony together and provides the technology with which businesses can develop and deploy
voice-enabled Web solutions TODAY! These solutions can greatly expand the accessibility of
Web-based self-service transactions to customers who would otherwise not have access, and, at
the same time, leverage a business existing Web investments. Speech recognition and Voice
XML clearly represent the next wave of the Web. It is important to consider the environment in
which the speech system has to work. The grammar used by the speaker and accepted by the
system, noise level, noise type, position of the microphone, and speed and manner of the users
speech are some factors that may affect the quality of speech recognition. Since, most
recognition systems are speaker independent, it is necessary to train a system to recognize the
dialect of each user. During training, the computers display a word and the user reads it aloud.
9. BIBLOGRAPHY
1. Luger, George; Stubblefield, William (2004). Artificial Intelligence: Structures
and Strategies for
40
Dept. of EEE
Himanshu Choubisa
41
Dept. of EEE
Himanshu Choubisa