Sei sulla pagina 1di 19

1

1. INTRODUCTION

One of the most important inventions of the nineteenth century was the telephone. Then at the
midpoint of twentieth century, the invention of the digital computer amplified the power of our minds,
enabled us to think and work more efficiently and made us more imaginative then we could ever have
imagined .Now several new technologies have empowered us to teach computers to talk to us in our native
languages and to listen to us when we speak (recognition); haltingly computers have begun to understand
what we say. Having given our computers both oral and aural abilities, we have been able to produce
innumerable computer applications that further enhance our productivity. Such capabilities enable us to
route phone calls automatically and to obtain and update computer based information by telephone, using a
group of activities collectively referred to as Voice Processing.

SPEECH TECHNOLOGY

Three primary speech technologies are used in voice processing applications: stored speech, text-to-
speech and speech recognition. Stored speech involves the production of computer speech from an actual
human voice that is stored in a computer’s memory and used in any of several ways.

Speech can also be synthesized from plain text in a process known as text-to-speech which also
enables voice processing applications to read from textual database.

Speech recognition is the process of deriving either a textual transcription or some form of meaning
from a spoken input.

Speech analysis can be thought of as that part of voice processing that converts human speech to
digital forms suitable for transmission or storage by computers.

Speech synthesis functions are essentially the inverse of speech analysis – they reconvert speech
data from a digital form to one that’s similar to the original recording and suitable for playback.

Speech analysis processes can also be referred to as a digital speech encoding ( or simply coding)
and speech synthesis can be referred to as Speech decoding.

Dept. Of ECE PESIT


2
2. Evolution of ASR Methodologies

Speech recognition research has been on-going for more than 80 years. Over that period there have
been at least 4 generations of approaches, and we forecast a 5th generation that is being formulated based
on current research themes. The 5 generations, and the technology themes associated with each of them,
are as follows[5].

• Generation 1 (1930s to 1950s):


Use of ad hoc methods to recognize sounds, or small vocabularies of isolated words.
• Generation 2 (1950s to 1960s):
Use of acoustic phonetic approaches to recognize phonemes, phones, or digit vocabularies.
• Generation 3 (1960s to 1980s):
Use of pattern recognition approaches to speech recognition of small to medium-sized vocabularies
of isolated and connected word sequences, including use of linear predictive coding (LPC) as the
basic method of spectral analysis; use of LPC distance measures for pattern similarity scores; use of
dynamic programming methods for time aligning patterns; use of pattern recognition methods for
clustering multiple patterns into consistent reference patterns; use of vector quantization (VQ)
codebook methods for data reduction and reduced computation.
• Generation 4 (1980s to 2010s):
Use of Hidden Markov model (HMM) statistical methods for modelling speech dynamics and
statistics in a continuous speech recognition system; use of forward-backward and segmental K
-means training methods; use of Viterbi alignment methods; use of maximum likelihood (ML) and
various other performance criteria and methods for optimizing statistical models; introduction of
neural network (NN) methods for estimating conditional probability densities; use of adaptation
methods that modify the parameters associated with either the speech signal or the statistical model
so as to enhance the compatibility between model and data for increased recognition accuracy.
• Generation 5 (2000s to 2010s):
Use of parallel processing methods to increase recognition decision reliability; combinations of
HMMs and acoustic-phonetic approaches to detect and correct linguistic irregularities; increased
robustness for recognition of speech in noise; machine learning of optimal combinations of models.

3. ISSUES IN SPEECH RECOGNITION

Dept. Of ECE PESIT


3
As we examine the progress made in implementing speech recognition and natural language
understanding systems over the years, we will see that there are a number of issues that need to be
addressed in order to define the operating range of each speech recognition system that is built. These
issues include the following [5]:

• Speech unit for recognition: ranging from words down to syllables and finally to phonemes or even
phones. Early systems investigated all these types of units with the goal of understanding their
robustness to context, speakers and speaking environments
• Vocabulary size: ranging from small (order of 2–100 words), medium (order of 100–1000) words,
and large (anything above 1000 words up to unlimited vocabularies). Early systems tackled
primarily small-vocabulary recognition problems; modern speech recognizers are all large-
vocabulary systems
• Task syntax: ranging from simple tasks with almost no syntax (every word in the vocabulary can
follow every other word) to highly complex tasks where the words follow a statistical n-gram
language model
• Task perplexity (the average word branching factor): ranging from low values (for simple tasks) to
values on the order of 100 for complex tasks whose perplexity approaches that of natural language
task
• Speaking mode: ranging from isolated words (or short phrases), to connected word systems (e.g.,
sequences of digits that form identification codes or telephone numbers), to continuous speech
(including both read passages and spontaneous conversational speech)
• Speaker mode: ranging from speaker-trained systems to speaker-adaptive systems to speaker
independent systems, which can be used by anyone without any additional training. Most modern
ASR systems are speaker independent and are utilized in a range of telecommunication
applications. However, for dictation purposes, most systems are still largely speaker dependent and
adapt over time to each individual speaker.
• Speaking situation: ranging from human-to-machine dialogues to human-to-human dialogues (e.g.,
as might be needed for language translation systems)
• Speaking environment: ranging from a quiet room, to noisy places (e.g., offices, airline terminals),
and even outdoors (e.g., via the use of cellphones)
• Transducer: ranging from high-quality microphones to telephones (wire line) to cellphones (mobile)
to array microphones (which track the speaker location electronically)

Dept. Of ECE PESIT


4

4. SPEECH RECOGNITION BASICS

The following definitions are the basics needed for understanding speech recognition technology.

• Utterance
An utterance is the vocalization (speaking) of a word or words that represent a single meaning to
the computer. Utterances can be a single word, a few words, a sentence, or even multiple sentences.
• Speaker Dependency
Speaker dependent systems are designed around a specific speaker. They generally are more
accurate for the correct speaker, but much less accurate for other speakers. They assume the speaker
will speak in a consistent voice and tempo. Speaker independent systems are designed for a variety
of speakers. Adaptive systems usually start as speaker independent systems and utilize training
techniques to adapt to the speaker to increase their recognition accuracy.
• Vocabularies
Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by the SR
system. Generally, smaller vocabularies are easier for a computer to recognize, while larger
vocabularies are more difficult. Unlike normal dictionaries, each entry doesn't have to be a single
word. They can be as long as a sentence or two. Smaller vocabularies can have as few as 1 or 2
recognized utterances (e.g."Wake Up"), while very large vocabularies can have a hundred thousand
or more!
• Accurate
The ability of a recognizer can be examined by measuring its accuracy − or how well it recognizes
utterances. This includes not only correctly identifying an utterance but also identifying if the
spoken utterance is not in its vocabulary. Good ASR systems have an accuracy of 98% or more!
The acceptable accuracy of a system really depends on the application.
• Training
Some speech recognizers have the ability to adapt to a speaker. When the system has this ability, it
may allow training to take place. An ASR system is trained by having the speaker repeat standard
or common phrases and adjusting its comparison algorithms to match that particular speaker.
Training a recognizer usually improves its accuracy. Training can also be used by speakers that
have difficulty speaking, or pronouncing certain words. As long as the speaker can consistently
repeat an utterance, ASR systems with training should be able to adapt.

Dept. Of ECE PESIT


5
5. TYPES OF SPEECH RECOGNITION

Speech recognition systems can be separated in several different classes by describing what types of
utterances they have the ability to recognize. These classes are based on the fact that one of the difficulties
of ASR is the ability to determine when a speaker starts and finishes an utterance. Most packages can fit
into more than one class, depending on which mode they are using.

• Isolated Words:
Isolated word recognizers usually require each utterance to have quiet (lack of an audio signal) on
BOTH sides of the sample window. It doesn't mean that it accepts single words, but does require a
single utterance at a time. Often, these systems have "Listen/Not−Listen" states, where they require
the speaker to wait between utterances (usually doing processing during the pauses). Isolated
Utterance might be a better name for this class.
• Connected Words:
Connect word systems (or more correctly 'connected utterances') are similar to Isolated words, but
allow separate utterances to be 'run−together' with a minimal pause between them.
• Continuous Speech
Continuous recognition is the next step. Recognizers with continuous speech capabilities are some
of the most difficult to create because they must utilize special methods to determine utterance
boundaries. Continuous speech recognizers allow users to speak almost naturally, while the
computer determines the content. Basically, it's computer dictation.
• Spontaneous Speech
There appears to be a variety of definitions for what spontaneous speech actually is. At a basic
level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system
with spontaneous speech ability should be able to handle a variety of natural speech features such as
words being run together, "ums" and "ahs", and even slight stutters.
• Voice Verification/Identification
Some ASR systems have the ability to identify specific users. This document doesn't cover
verification or security systems

Dept. Of ECE PESIT


6
6. SPEECH RECOGNITION

The days when you had to keep staring at the computer screen and frantically hit the key or click
the mouse for the computer to respond to your commands may soon be a things of past. Today we can
stretch out and relax and tell your computer to do your bidding. Speech recognition is the process of
deriving either a textual transcription or some form of meaning from a spoken input.

Speech recognition is the inverse process of synthesis, conversion of speech to text. The Speech
recognition task is complex. This involves the computer taking the user's speech and interpreting what has
been said. This allows the user to control the computer (or certain aspects of it) by voice, rather than having
to use the mouse and keyboard, or alternatively just dictating the contents of a document. This has been
made possible by the ASR (Automatic Speech Recognition) technology.

The ASR technology would be particularly welcome by automated telephone exchange operators,
doctors, besides others whose seek freedom from tiresome conventional computer operations using
keyboard and the mouse. It is suitable for applications in which computers are used to provide routine
information and services. The ASR’s direct speech to text dictation offers a significant advantage over
traditional transcriptions. With further refinement of the technology in text will become a thing of past.
ASR offers a solution to this fatigue-causing procedure by converting speech in to text.

The ASR technology is presently capable achieving recognition accuracies of 95% - 98 % but only
under ideal conditions. The technology is still far from perfect in the uncontrolled real world. The routes of
this technology can be traced to 1968 when the term Information Technology hadn’t even been coined.
American’s had only begun to realize the vast potential of computers. In the Hollywood blockbuster 2001:
a space odyssey. A talking listening computer HAL-9000, had been featured which to date is a called
figure in both science fiction and in the world of computing. Even today almost every speech recognition
technologist dreams of designing an HAL-like computer with a clear voice and the ability to understand
normal speech. Though the ASR technology is still not as versatile as the imaginer HAL, it can
nevertheless be used to make life easier. New application specific standard products, interactive error-
recovery techniques, and better voice activated user interfaces allow the handicapped, computer-illiterate,
and rotary dial phone owners to talk to the computers. ASR by offering a natural human interface to
computers, finds applications in telephone-call centres, such as for airline flight information system,
learning devices, toys, etc.

Dept. Of ECE PESIT


7
6.1. HOW DOES THE ASR TECHNOLOGY WORK?
When a person speaks, compressed air from the lungs is forced through the vocal tract as a sound
wave that varies as per the variations in the lung pressure and the vocal tract. This acoustic wave is
interpreted as speech when it falls upon a person’s ear. In any machine that records or transmits human
voice, the sound wave is converted into an electrical analog signal using a microphone.

THE
ROAD TO
HAL

Person speaks
“THE ROAD TO THE ROAD TO HAL
HAL”

ELECTRONICAL TWO CALL


SIGNAL INTO LANGUAGE
THE COMPUTER ANALYSIS

BACKGR
REMOVAL OF MATCHING AND CHOSING THE
NOISE AND
RIGHT CHARACTER COMBINATION
SOUND
AMPLIFIC ATION
BREAK UP
WORDS INTO
PHONEMES

PER Roa Moo Gall


d
THE To Mall
Loa
Fig.6.1 Flow of speech recognition

When we speak into a telephone receiver, for instance, its microphone converts the acoustic wave
into an electrical analog signal that is transmitted through the telephone network. The electrical signals
strength from the microphone varies in amplitude over time and is referred to as an analog signal or an
analog waveform. If the signal results from speech, it is known as a speech waveform. Speech waveforms
have the characteristic of being continuous in both time and amplitude.

Dept. Of ECE PESIT


8
A listener’s ears and brain receive and process the analog speech waveforms to figure out the
speech. ASR enabled computers, too, work under the same principle by picking up acoustic cues for speech
analysis and synthesis. Because it helps to understand the ASR technology better, let us dwell a little more
on the acoustic process of the human articulator system. In the vocal tract the process begins from the
lungs. The variations in air pressure cause vibrations in the folds of skin that constitute the vocal chords.
The elongated orifice between the vocal chords is called the glottis. As a result of the vibrations, repeated
bursts of compressed air are released in to the air as sound waves.

Articulators in the vocal tract are manipulated by the speaker to produce various effects. The vocal
chords can be stiffened or relaxed to modify the rate of vibration, or they can be turned off and the
vibration eliminated while still allowing air to pass. The velum acts as a gate between the oral and the nasal
cavities. It can be closed to isolate or opened to couple the two cavities. The tongue, jaw, teeth, and lips can
be moved to change the shape of the oral cavity.

The nature of sound preserve wave radiating out world from the lips depends upon this time varying
articulations and upon the absorptive qualities of the vocal tracts materials. The sound pressure wave exists
as a continually moving disturbance of air. Particles come move closer together as the pressure increases or
move further apart as it decreases, each influencing its neighbor in turn as the wave propagates at the speed
of sound. The amplitude to the wave at any position, distant from the speaker, is measured by the density of
air molecules and grows weaker as the distance increases. When this wave falls upon the ear it is
interpreted as sound with discernible timbre, pitch, and loudness.

Air under pressure from the lung moves through the vocal tract and comes into contact with various
obstructions including palate, tongue, teeth, lips and timings. Some of its energy is absorbed by these
obstructions; most is reflected. Reflections occur in all directions so that parts of waves bounce around
inside the cavities for some time, blending with other waves, dissipating energy and finally finding the way
out through the nostrils or past the lips.

Some waves resonate inside the tract according to their frequency and the cavity’s shape at that
moment, combining with other reflections, reinforcing the wave energy before exiting. Energy in waves of
other, non-resonant frequencies is attenuated rather than amplified in its passage through the tract.

6.2. THE SPEECH RECOGNITION PROCESS

When a person speaks, compressed air from the lungs is forced through the vocal tract as a sound
wave that varies as per the variations in the lung pressure and the vocal tract. This acoustic wave is
interpreted as speech when it falls up on a person’s ear. Speech waveforms have the characteristic of being
continuous in both time and amplitude.

Dept. Of ECE PESIT


9

Fig.6.2.Steps in speech recognition

Fig.6.3.Block diagram of steps in speech recognition

Any speech recognition system involves five major steps:

• Converting sounds into electrical signals: when we speak into microphone it converts sound waves
into electrical signals. In any machine that records or transmits human voice, the sound wave is
converted into an electrical signal using a microphone. When we speak into telephone receiver, for
instance, its microphone converts the acoustic wave into an electrical analog signal that is
transmitted through the telephone network. The electrical signal’s strength from the microphone
varies in amplitude overtime and is referred to as an analog signal or an analog waveform. Analog
signal is converted into digital signal using sound card.
• Background noise removal: the ASR programs removes all noise and retains the words that you
have spoken.
• Breaking up words into phonemes: The words are broken down into individual sounds, known as
phonemes, which are the smallest sound units discernible. For each small amount of time, some
feature, value is found out in the wave. Likewise, the wave is divided into small parts, called
Phonemes.
• Matching and choosing character combination: this is the most complex phase. The program has
big dictionary of popular words that exist in the language. Each Phoneme is matched against the
sounds and converted into appropriate character group. This is where problem begins. It checks and
compares words that are similar in sound with what they have heard. All these similar words are
collected.
• Language analysis: here it checks if the language allows a particular syllable to appear after
another.

After that, there will be grammar check. It tries to find out whether or not the combination of
words any sense. That is there will be a grammar check package.
Dept. Of ECE PESIT
10
Finally the numerous words constitution the speech recognition programs come with their own
word processor, some can work with other word processing package like MS word and word perfect.

6.3. VARIATIONS IN SPEECH

The speech-recognition process is complicated because the production of phonemes and the
transition between them varies from person to person and even the same person. Different people speak
differently. Accents, regional dialects, sex, age, speech impediments, emotional state, and other factors
cause people to pronounce the same word in different ways. Phonemes are added, omitted, or substituted.
For example, the word, America, is pronounced in parts of New England as Amrica. The rate of speech
also varies from person the person depending upon a person’s habit and his regional background.

A word or a phrase spoken by the same individual differs from moment to moment illness;
tiredness, stress or other conditions cause subtle variations in the way a word is spoken at different times.
Also, the voice quality varies in accordance with the position of the person relative to the microphone, the
acoustic nature of the surroundings, or the quality of the recording devices. The resulting changes in the
waveform can drastically affect the performance of the recognizer.

6.4. VOCABULARIES FOR COMPUTERS

Each ASR system has LAN active vocabulary- a set of words from which the recognition engine
tries to make senses of utterance- and a total vocabulary size-the total number of words in all possible sets
that can be culled from the memory.

The vocabulary size and system recognition latency- the allowable time to accurately recognize an
utterance determining the process horsepower of the recognition engine.

An active vocabulary set comprises approximately fourteen words plus none of the above, who the
recognizer chooses when none of the fourteen words is good mach .The recognition latency when using a
4-MIPS processor, is about.5 seconds for a second for independent set. Processing power requirements
increased dramatically for LVR sets with thousands of words. Real time latencies with a vocabulary base
of few thousands are possible only through the use of Pentium class processors. A small active vocabulary
limits a system search range providing advantages in latency and search time. A large total vocabulary
enables more versatile human interface but affects system memory requirements. A system with a small
active vocabulary with each prompt usually provides faster more accurate results, similar sounding words
in vocabulary set cause recognition errors. But a unique sound for each word enhances recognition engines
accuracy.

Dept. Of ECE PESIT


11
6.5. WHICH SYSTEM TO CHOOSE

In choosing a speech recognition system you should consider the degree of speaker independence it
offers. Speaker independent systems can provide high recognition accuracies for a wide range of users
without needing to adapt to each user’s voice. Speaker dependent systems require that to train the system to
your voice to attain high accuracy. Speaker adaptive systems an intermediate category are essentially
speaker-independent but can adapt their templates for each user to improve accuracy.

ADVANTAGES OF SPEAKER INDEPENDENT SYSTEM

The advantage of a speaker independent system is obvious-anyone can use the system without first
training it. However, its drawbacks are not so obvious. One limitation is the work that goes into creating
the vocabulary templates. To create reliable speaker-independents templates, someone must collect and
process numerous speech samples. This is a time-consuming task; creating these templates is not a one-
time effort. Speaker-independent templates are language-dependant, and the templates are sensitive not
only to two dissimilar languages but also to the differences between British and American English.
Therefore, as part of your design activity, you would need to create a set of templates for each language or
a major dialect that your customers use. Speaker independent systems also have a relatively fixed
vocabulary because of the difficulty in creating a new template in the field at the user’s site.

ADVANTAGE OF A SPEAKER-DEPENDENT SYSTEM:

A speaker dependent system requires the user to train the ASR system by providing examples of his
own speech. Training can be tedious process, but the system has the advantage of using templates that refer
only to the specific user and not some vague average voice. The result is language independence. You can
say ja, si, or ya during training, as long as you are consistent. The drawback is that the speaker-dependent
system must do more than simply match incoming speech to the templates. It must also include resources
to create those templates.

WHICH IS BETTER:

For a given amount of processing power, a speaker dependent system tends to provide more
accurate recognition than a speaker-independent system. A speaker independent system is not necessarily
better, the difference in performance stems from the speaker independent template encompassing wide
speech variations.
Dept. Of ECE PESIT
12
6.6. TECHNIQUES IN VOGUE:

The most frequently used speech recognition technique involves template matching, in which
vocabulary words are characterized in memory a template time based sequences of spectral information
taken from waveforms obtained during training.

As an alternative to template matching, feature based designs have been used in which a time
sequence of the pertinent phonetic features is extracted from a speech waveform. Different modelling
approaches are used, but models involving state diagrams have been found to give encouraging
performance. In particular, HMM (Hidden Markov Models) are frequently applied. With HMMs any
speech unit can be modelled, and all knowledge sources can be modelled, and all knowledge sources and
be included in a single, integrated model. Various types of HMMs have been implemented with differing
results. Some model each word in the vocabulary, while others model sub word speech units.

7. HIDDEN MARKOV MODEL

A hidden Markov model can be used to model an unknown process that produces a sequence of
observable outputs at discrete intervals where the outputs are members of some finite alphabet. It might be
helpful to think of the unknown process as a black box about whose workings nothing is known except
that, at each interval, it issues one member chosen from the alphabet. These models are called "hidden"
Markov models precisely because the state sequence that produced the observable output is not known-it's
"hidden." HMMs have been found to be especially apt for modelling speech processes.

CHOICE OF SPEECH UNITS

The amount of storage required and the amount of processing time for recognition are functions of
the number of units in the inventory, so selection of the unit will have a significant impact. Another
important consideration in selecting a speech unit concerns the ability to model contextual differences.
Another consideration concerns the ease with which adequate training can be provided

MODELING SPEECH UNITS WITH HIDDEN MARKOV MODELS

Dept. Of ECE PESIT


13
Suppose we want to design a word-based, isolated word recognizer using discrete hidden Markov
models. Each word in the vocabulary is represented by an individual HMM, each with the same number of
states. A Word can be modelled as a sequence of syllables, phonemes, or other speech sounds that have a
temporal interpretation and can best be modelled with a left-to-right HMM whose states represent the
speech sounds. Assume the longest word in the vocabulary can be represented by a 10-state HMM. So,
using a 10-state HMM like that of Figure below for each word, let's assume states in the HMM represent
phonemes. The dotted lines in the figure are null transitions, so any state can be omitted and some words
modelled with fewer states. The duration of a phoneme is accommodated by having a state transition
returning to the same state. Thus, at a clock time, a state may return to itself and may do so at as many
clock times as required to correctly model the duration of that phoneme in the word, Except for beginning
and end states, which represent transitions into and out of the word, each state in the word model has a self-
transition. Assume, in our example, that the input speech waveform is coded into a string of spectral
vectors, one occurring every 10 milliseconds, and that vector quantization further transforms each spectral
vector to a single value that indexes a representative vector in the codebook. Each word in the vocabulary
will be trained through a number of repetitions by one or more talkers. As each word is trained, the
transitional and output probabilities of its HMM are adjusted to merge the latest word repetition into the
model. During training, the codebook is iterated with the objective of deriving one that’s optimum for the
defined vocabulary. When an unknown spoken word is to be recognized, it's transformed to a string of code
book indices. That string is then considered an HMM observation sequence by the recognizer that
calculates, for each word model in the vocabulary, the probability of that HMM having generated the
observations. The word corresponding to the word model with the highest probability is selected as the one
recognized.

ACOUSTIC/PHONETIC EXAMPLE USING HIDDEN MARKOV MODEL

Every speech recognition system has its own architecture. Even those that are based on HMMs have
their individual designs, but all share some basic concepts and features, many of which are recognizable
even though the names are often different. A representative block diagram is given below. The input to a
recognizer represented by Figure arrives from the left in the form of a speech waveform, and an output
word or sequence of words emanates from the recognizer to the right.

It incorporates:

(A) SPECTRAL CODING: The purpose of spectral coding is to transform the signal into digital
form embodying speech features that facilitate subsequent recognition tasks. In addition to spectral
coding, this function is sometimes called spectrum analysis, acoustic parameterization, etc. Recognizers
can work with time-domain coding, but spectrally coded parameters in the frequency domain have
advantages and are widely used-hence the title "spectral coding."

Dept. Of ECE PESIT


14
:

Fig.7.1.A hidden Markov model recognizer

(B) UNIT MATCHING: The objective of unit matching is to transcribe the output data stream from
the spectral coding module into a sequence of speech units. The function of this module is also referred to
as feature analysis, phonetic decoding, phonetic segmentation, phonetic processing, feature extraction, etc.

(C) LEXICAL DECODING: The function of this module is to match strings of speech units in the
unit matching module's output stream with words from the recognizer's lexicon. It outputs candidate words-
usually in the form of a word lattice containing sets of alternative word choices.

(D) SYNTACTIC, SEMANTIC, AND OTHER ANALYSES Analyses that follow lexical
decoding all have the purpose of pruning worst candidates passed along from the lexical decoding module
until optimal word selections can be made. Various means and various sources of intelligence- can be
applied to this end. Acoustic information (stress, intonation, change of amplitude or pitch, relative location
of formants, etc.) obtained from the waveform can be employed, but sources of intelligence from outside
the waveform are also available. These include syntactic, semantic, and pragmatic information.

Dept. Of ECE PESIT


15

8. APPLICATIONS

The specific use of speech recognition technology will depend on the application. Some target
applications that are good candidates for integrating speech recognition include:

• Games and Edutainment


Speech recognition offers game and edutainment developers the potential to bring their applications
to a new level of play. With games, for example, traditional computer-based characters could
evolve into characters that the user can actually talk to. While speech recognition enhances the
realism and fun in many computer games, it also provides a useful alternative to keyboard-based
control, and voice commands provide new freedom for the user in any sort of application, from
entertainment to office productivity.
• Data Entry
Applications that require users to keyboard paper-based data into the computer (such as database
front-ends and spreadsheets) are good candidates for a speech recognition application. Reading data
directly to the computer is much easier for most users and can significantly speed up data entry.
While speech recognition technology cannot effectively be used to enter names, it can enter
numbers or items selected from a small (less than 100 items) list. Some recognizers can even handle
spelling fairly well. If an application has fields with mutually exclusive data types (for example,
one field allows "male" or "female", another is for age, and a third is for city), the speech
recognition engine can process the command and automatically determine which field to fill in.
• Document Editing
This is a scenario in which one or both modes of speech recognition could be used to dramatically
improve productivity. Dictation would allow users to dictate entire documents without typing.
Dept. Of ECE PESIT
16
Command and control would allow users to modify formatting or change views without using the
mouse or keyboard. For example, a word processor might provide commands like "bold", "italic",
"change to Times New Roman font", "use bullet list text style," and "use 18 point type." A paint
package might have "select eraser" or "choose a wider brush."
• Command and Control
ASR systems that are designed to perform functions and actions on the system are defined as
Command and Control systems. Utterances like "Open Netscape" and "Start a new xterm" will do
just that.
• Telephony
Some PBX/Voice Mail systems allow callers to speak commands instead of pressing buttons to
send specific tones.

• Wearable devices
Because inputs are limited for wearable devices, speaking is a natural possibility.
• Medical/Disabilities
Many people have difficulty typing due to physical limitations such as repetitive strain injuries
(RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use
a system connected to their telephone to convert the caller's speech to text.
• Embedded Applications
Some newer cellular phones include C&C speech recognition that allow utterances such as "Call
Home". This could be a major factor in the future of ASR and Linux. Why can't I talk to my
television yet?

9. LIMITATIONS OF SPEECH
RECOGNITION
Each of the speech technologies of recognition and synthesis has its limitations. These limitations
or constraints on speech recognition systems focus on the idea of variability. Overcoming the tendency for
ASR systems to assign completely different labels to speech signals which a human being would judge to
be variants of the same signal has been a major stumbling block in developing the technology. The task has
been viewed as one of de-sensitising recognisers to variability. It is not entirely clear that this idea models
adequately the parallel process in human speech perception.

Human being are extremely good at spotting similarities between input signals - whether they are speech
signals or some other kind of sensory input, like visual signals. The human being is essentially a pattern
seeking device, attempting all the while to spot identity rather than difference.

Dept. Of ECE PESIT


17
By contrast traditional computer programming techniques make it relatively easy to spot differences, but
surprisingly difficult to spot similarity even when the variability is only slight. Much effort is being
devoted at the moment to developing techniques which can re-orientate this situation and turn the computer
into an efficient pattern spotting device.

10. MERITS

The uses of speech technology are wide ranging. Most effort at the moment centers around trying to
provide voice input and output for information systems - say, over the telephone network.

A relatively new refinement here is the provision of speech systems for accessing distributed information
of the kind presented on the Internet. The idea is to make this information available to people who do not
have, or do not want to have, access to screens and keyboards. Essentially researchers are trying to harness
the more natural use of speech as a means of direct access to systems which which more normally
associated with the technological paraphernalia of computers.

Clearly a major use of the technology is to assist people who are disadvantaged in one way or another with
respect to producing or perceiving normal speech.

The eavesdropping potential referred to in the slide is not sinister. It simply means the provision of, say, a
speech recognition system for providing an input to a computer when the speaker has their hands engaged
on some other task and cannot manipulate a keyboard - for example, a surgeon giving a running
commentary on what he or she is doing. Another example might be a car mechanic on his or her back
underneath a vehicle interrogating a stores computer as to the availability of a particular spare part.

CONCLUSION

Speech recognition is a truly amazing human capacity, especially when you consider that normal
conversation requires the recognition of 10 to 15 phonemes per second. It should be of little surprise then
Dept. Of ECE PESIT
18
that attempts to make machine (computer) recognition systems have proven difficult. Despite these
problems, a variety of systems are becoming available that achieve some success, usually by addressing
one or two particular aspects of speech recognition. A variety of speech synthesis systems, on the other
hand, have been available for some time now. Though limited in capabilities and generally lacking the
“natural” quality of human speech, these systems are now a common component in our lives.

BIBLIOGRAHY

[1] L. R. Rabiner and B. Juang, “Fundamentals of Speech Recognition”, Pearson


Education (Asia) Pte. Ltd., 2004

[2] L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signals”,


Pearson Education (Asia) Pte. Ltd., 2004.

[4]S.Young, HMMs and Related Speech Recognition Technologies, Part E 27

[5]Historical Perspective of the Field of ASR/NLU,L. Rabiner, B.-H. Juang,Part E 26

[6] http://en.wikipedia.org/wiki/Speech_recognition

Dept. Of ECE PESIT


19

Dept. Of ECE PESIT

Potrebbero piacerti anche