Sei sulla pagina 1di 8

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 10 No: 01

69

DESIGN AND SOFTWARE IMPLEMENTATION OF EFFICIENT SPEECH RECOGNIZER


Asim Shahzad, Romana Shahzadi, Farhan Aadil, Shahazada Khayyam Nisar, Prof Dr Zafrullah

University of Engineering and Technology Taxila Pakistan 47040


[asim.shahzad | romana.shahzadi | farhan.aadil | shahzada.khayyam | dr.zafrullah]@uettaxila.edu.pk

Abstract
As speech processing is one of the emerging fields in signal processing, especially because of the great revolution in cellular technologies. It plays great role in the design of interactive systems like speech-to-text converters. So, speech recognizer is a building block in number of systems like speech-to-text converters design, text-tospeech converters, and person recognition. We have designed and simulated efficient speech recognizer using linear predictive coding (LPC) and implemented various steps like (training phase, efficient speech models and recognition phase) in Matlab and got all speech recognition coefficients/parameters. Keywords Speech Recognizer, linear predictive coding (LPC), Statistical pattern recognition, Speech Models, Digital Filters. 1. Introduction to Speech Recognition Speech recognition is the process by which a computer (or other type of machine) identifies spoken words. Basically, it means talking to your computer, AND having it correctly recognizes what you are saying. Speech recognition systems can be separated in several different classes by describing what types of utterances they have the ability to recognize[1]. These classes are based on the fact that one of the difficulties of ASR is the ability to determine when a speaker starts and finishes an utterance. Most packages can fit into more than one class, depending on which mode they're using [1]. 1.2.1 Isolated Words Isolated word recognizers usually require each utterance to have quiet (lack of an audio signal) on BOTH sides of the sample window. It doesn't mean that it accepts single words, but does require a

single utterance at a time. Often, these systems have "Listen/Not-Listen" states, where they require the speaker to wait between utterances (usually doing processing during the pauses). Isolated Utterance might be a better name for this class. 1.2.2 Connected Words Connect word systems (or more correctly 'connected utterances') are similar to Isolated words, but allow separate utterances to be 'run-together' with a minimal pause between them. 1.2.3 Continuous Speech Continuous recognition is the next step. Recognizers with continuous speech capabilities are some of the most difficult to create because they must utilize special methods to determine utterance boundaries. Continuous speech recognizers allow users to speak almost naturally, while the computer determines the content. Basically, it's computer dictation [1]. 1.2.4 Spontaneous Speech There appears to be a variety of definitions for what spontaneous speech actually is. At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters [1]. 1.2.5 Voice Verification/Identification Some ASR systems have the ability to identify specific users. 2. Approaches used to Speech Recognition There are three approaches to speech recognition, namely: 1. The acoustic-phonetic approach

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 10 No: 01

70

2. The pattern recognition approach 3. The artificial intelligence approach 2.1 Acoustic-Phonetic Approaches to A Speech Recognition Figure 1 shows a block diagram of the acoustic-phonetic approach to speech recognition. The first step in the processing (a step common to all approaches to speech recognition) is the speech analysis system (the so-called feature measurement method), which provides an appropriate (spectral) representation of the characteristics of the time-varying speech signal[2][4]. The most common techniques of spectral analysis are the class of filter bank methods and the class of linear predictive coding (LPC) methods. The next step in the processing is the featuredetection stage. The idea here is to convert the spectral measurements to a set of features that describe the broad acoustic properties of the different phonetic units. The third step in the procedure is the segmentation and labeling phase whereby the system tries to find stable regions (where the features change very little over the region) and then to label the segmented region according to how well the features within that region match those of individual phonetic units[2].

2.2 The Statistical pattern recognition approach Block diagram of a canonic pattern recognition approach to speech recognition is shown in Figure 2. The pattern-recognition paradigm has four steps.

Figure 2: Block diagram of pattern-recognition speech recognizer [3]

Figure 1: Block diagram of Acoustic-Phonetic Approach to Speech Recognition [2][3]

The result of the segmentation and labeling step is usually a phoneme lattice from which a lexical access procedure determines the best matching word or sequence of words. Other types of lattices (e.g., syllable, word) can also be derived by integrating vocabulary and syntax constraints into the control strategy.

1. Feature measurement, in which a sequence of measurements is made on the input signal to define the "test pattern". For speech signals the feature measurements are usually the output of some type of spectral analysis technique, signal as a filter bank analyzer, a linear predictive coding analysis, or a discrete Fourier transform (DFT) analysis[3]. 2. Pattern training, in which one or more test patterns corresponding to speech sounds of the same class are used to create a pattern representative of the features of that class. The resulting pattern, generally called a reference pattern, can be an exemplar or template, derived from some type of averaging technique, or it can be a model that characterizes the statistics of the features of the reference pattern[3]. 3. Pattern classification, in which the unknown test pattern is compared with each (sound) class reference pattern and a measure of similarity (distance) between the test pattern and each reference pattern is computed. -To "compare speed:;- patterns (which consist of a sequence of spectral vectors),we require both a local measure, in which local distance is defined as the spectral "distance" between two well-defined spectral vectors, and a global time alignment procedure (often called a dynamic time warping algorithm), which compensates for different rates of speaking (time scales) of the two patterns[3][4]. 4. Decision logic in which the reference

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 10 No: 01

71

pattern similarity scores are used to decide which reference pattern (or possibly which sequence of reference patterns) best matches the unknown test pattern [3]. 2.3 Artificial Intelligence (AI) Approaches to Speech Recognition The basic idea of AI approach to speech recognition is to compile and incorporate knowledge from a variety of knowledge sources and to bring it to bear on the problem at hand. Thus, for example, the AI approach to segmentation and labeling would be to augment the generally used acoustic knowledge with phonemic knowledge, lexical knowledge, syntactic knowledge, semantic knowledge, and even pragmatic knowledge [3]. To be more specific, we first define these knowledge sources: 1) Acoustic knowledge evidence of which sounds (predefined phonetic units) are spoken on the basis of spectral measurements and presence or absence of features. 2) Lexical knowledge the combination of acoustic evidence so as to postulate words as specified by a lexicon that maps sounds into words (or equivalently decomposes words into sounds). 3) Syntactic knowledge the combination of words to form grammatically correct string (according to language model) such as sentences or phrases . 4) Semantic knowledge-understanding of the task domain so as to be able to validate sentences (or phrases) that are consistent with the task being performed, or which are consistent with previously decoded sentences. 5)Pragmatic knowledge-inference ability necessary in resolving ambiguity of meaning based on ways in which words are generally used. To illustrate the correcting and constraining power of these knowledge sources, consider the following sentences [3] 1. Go to the refrigerator and get me a book. 2. The bears killed the rams. 3. Power plants colorless happily old. 4. Good ideas often run when least expected.

The first sentence is syntactically meaningful but semantically inconsistent. The second sentence can be interpreted in at least two pragmatically different ways, depending on whether the context is an event in a jungle or the description of a football game between two teams called the "bears" and the "rams." The third sentence is syntactically unacceptable and semantically meaningless. The fourth sentence is semantically inconsistent and can trivially be corrected by changing the word run to come, a slight phonetic difference [3]. The word-correcting capability of higher-level knowledge sources is illustrated in Figure 3, which shows the word error probability of a recognizer both with and without syntactic constraints, as a function of a "deviation" parameter sigma. As the deviation parameter gets larger, the word error probability increases for both cases; however, without syntax the word error probability rapidly leads to 1.0, but with syntax it increases gradually with increases in the noise parameter. There are several ways to integrate knowledge sources within a speech recognizer. Perhaps the most standard approach is the "bottom-up" processor (Figure 4), in which the lowest-level processes (e.g., feature detection, phonetic decoding) precede higher-level processes (lexical decoding, language model) in a sequential manner so as to constrain each stage of the processing as little as possible. An alternative is the socalled "top-down" processor, in which the language model generates word hypotheses that are matched against the speech signal, and syntactically and semantically meaningful sentences are built up on the basis of the word match scores. Figure 5 shows a system that is often implemented in the top-down mode by integrating the unit matching, lexical decoding, and syntactic analyses modules into a consistent framework. (This system will be discussed extensively in the chapter on largevocabulary continuous-speech recognition.) A third alternative is the socalled blackboard approach, as illustrated in figure. In this approach, all knowledge sources (KS) are considered independent; a hypothesis and-test paradigm serves as the basic medium of communication among KSs; each KS is

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 10 No: 01

72

data driven, based on the occurrence of patterns on the blackboard that match the templates specified by the KS; the system activity operates asynchronously; assigned cost and utility considerations and an overall ratings policy to combine and propagate ratings across all levels.

3. Linear Predictive Coding Model for Speech Recognition


LPC provides a good model of the speech signal. This is especially true for quasi steady state voiced regions of speech in which the all-pole model of LPC provides flit, a good approximation to the vocal tract spectral envelope. During unvoiced and transient regions of speech, the LPC model is less effective than for voiced regions, but it still provides an acceptably useful model for speechrecognition purposes. The way in which LPC is applied to the analysis of speech signals leads to a reasonable source-vocal tract separation. As a result, a parsimonious representation of the vocal tract characteristics becomes possible. 3.1The LPC Model The basic idea behind the LPC model is that a given speech sample at time n, s(n), can be approximated as a linear combination of the past p speech samples, such that s(n)= a1 s(n - 1) + a 2 s(n - 2) +. . . + a p s(n - p) (1) where the coefficients a1 , a 2 , . .., a p are

Figure 3: Illustration of word correcting capability of syntax in speech recognition [3].

Figure.4: A bottom up approach to knowledge integration for speech recognition [3].

assumed constant over the speech analysis frame, We convert above Eq. to an equality by including an excitation term, G u(n), giving:

s(n) =

i =1

a i s(n - i) + G u(n),

(2)

where u(n) is a normalized excitation and G is the gain of the excitation. By expressing above Eq. in the z-domain we get the relation S(z)=
Figure 5: A top down approach to knowledge integration for speech recognition[3]

i =1

a i z i S(z)+ G u(n),

(3)

leading to the transfer function H(z) =

S ( z) = GU ( z )

1 1 ai z i
i =1 p

1 A( z )

Figure 6: A blackboard approach to knowledge integration for speech recognition [3][4]

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 10 No: 01

73

Figure 7: Linear predictive model of speech[2][3]

The interpretation of above Eq. is given in Figure, which shows the non linearized excitation source, u(n),being scaled by the gain, G, and acting as input 1 to the all-pole system, H(z) = to A( z ) produce the speech signal, s(n). Based on our knowledge that the actual excitation function for speech is essentially either a quasi periodic pulse train (for voiced speech sounds) or a random noise source (for unvoiced sounds), the appropriate synthesis model for speech, corresponding to the LPC analysis. Here the normalized excitation source is chosen by a switch whose position is controlled by the voiced/unvoiced character of the speech, which chooses either a quasi periodic train of pulses as the excitation for voiced sounds, or a random noise sequence for unvoiced sounds. The appropriate gain, G, of the source is estimated from the speech signal, and the scaled source is used as input to a digital filter (H(z)), which is controlled by the vocal tract parameters characteristic of the speech being produced. Thus the parameters of this model are voiced/unvoiced classification, pitch period for voiced sounds, the gain parameter, and the coefficients of digital filters{ a k }. These parameters are all vary slowly with time.

4. Simulation and results 4.1 Basic Principles of ASR All ASR systems operate in two phases. First, a training phase, during which the system learns the reference patterns representing the different speech sounds (e.g. phrases, words, phones) that constitute the vocabulary of the application. Each reference is learned from spoken examples and stored either in the form of templates obtained by some averaging method or models that characterize the statistical properties of pattern. Second, a recognizing phase, during which an unknown input pattern, is identified by considering the set of references. .

Figure 9: Basic principle of speeck recognizer.

Figure 8: Speech synthesis model based on LPC model[4]

Most ASR systems consist of three major modules i.e. signal processing front-end, acoustic modeling and language modeling. The signal processing front-end transforms the speech signal into a sequence of feature vectors to be used for classification. Generally, this representation has a considerably lower information rate than the original speech waveform[3]. The recognizer matches the sequence of observations with subword models or with word models which are built from subword models with predefined word lexicon. The recognized word or subword units are used to construct the whole sentence with the help of language models. It is common for a continuous speech recognition system to recognize the whole sentence by first constructing a finite-state network, with

or without a grammar based language model, to represent the possible sentence model. Following diagram shows our input signal.

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 10 No: 01 4.1.1.2.b Using AZCR Of speech signal

74

Average Zero-Crossing Rate is defined as

(5) w(n) = 1/2N, 0 <= n <= N-1 This measure could allow the discrimination between voiced and unvoiced regions of speech, or between speech and silence. Unvoiced speech has in general, higher zero-crossing rate. The signals in the graphs are normalized.
Figure 10: Input Speech Signal.

4.1 Edge Detection


4.1.1.1a Using Energy Of speech signal

Short-Time energy is a simple short-time speech measurement. It is defined as:

(4) This measurement can in a way distinguish between voiced and unvoiced speech segments, since unvoiced speech has significantly smaller short- time energy. For the length of the window a practical choice is 10-20 msec that is 160-320 samples for sampling frequency 16kHz..This way the window will include a suitable number of pitch periods so that the result will be neither too smooth, nor too detailed.
Figure 12: Average zero crossing rates of speech signal.

Figure 13: signal after removing noise.

4.1.1.3 Signal Processing Front-End and Parameterization There are a lot of variations in the front-end processing for speech recognition. However, we can summarize them in a block diagram as shown below.

Figure 11: Energy of speech signal.

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 10 No: 01

75

4.1.1.3.a Pre-emphasis

The speech is first pre-emphasized with a pre-emphasis filter which is initially motivated by speech production model. From the speech production model of voiced speech, there is an overall of -6 dB/octave due decay (-12 dB/octave due to excitation source and +6 dB/octave due to the radiation compensation) in speech radiated from lips, as frequency increases. The spectrum of the speech is flattening by a pre-emphasis filter of the form 1-az-1 as shown below. While this filter does its job for voiced speech, it causes a +12 dB/octave rise in unvoiced speech[2].

successive frame overlapping each other so that we can generate a smoother feature set over time.
4.1.1.3.c Temporal Feature Extraction

Extracting features of speech from each frame in the time domain has the advantage of simplicity, quick calculation, and easy physical interpretation. This temporal feature includes short-time average energy and amplitude, short-time zero-crossing rate, short-time autocorrelation, pitch periods, root mean square (rms), maximum of amplitude, voicing quality, different between maximum and minimum values in the positive and negative halves of the signal, and autocorrelation peaks.
4.1.1.3.d Spectral Analysis

Spectral analysis is the most important module in front-end processing. Most useful parameters in speech processing are found in the frequency domain. There are several ways to extract spectral information of speech; these include FIR filter-bank and linear predictive coding (LPC). Many variations of these techniques have been developed, including mel-scale FFT filter-bank and perceptual linear predictive (PLP) front-end. Recently, auditory models have received the attention of many researchers and have proved their robustness in adverse environments [4].
Figure 14:(a,b) Signal after pre-emphasis,Pre-emphasis filter curve.

4.1.1.3.b Blocking and Windowing

4.1.1.3.e Features in Frequency Domain and Feature Enhancement

To extract the short-time features of a speech signal, we need to block the speech into short segments called frame. The duration of each frame varies from 10 to 30ms[4].The speech belonging to each frame is assumed to be stationary. To reduce the edge effect of each segment, we need to apply a smoothing window (e.g. Hamming window) to each frame. We allow each

After the spectral information is obtained, we can process this information by extracting some useful feature. The features in the frequency domain may include the difference between peak and valley, the energy in a particular frequency region, the spectral gradient and spectral variation contour. They are used more frequently in a knowledge based ASR.In most of the

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol: 10 No: 01

76

auditory models, the model of neural transduction serves to enhance speech [4].
4.1.1.3.f Reduction of Information Rate and Feature Dimension

for speech recognition and model and simulate best speech recognition approach and completely got the speech coefficients/parameters. In the next phase we are trying to perform patterns recognition based on ideal speech database.

Information rate is how frequent we generated a feature vector. The reduction of information rate is either done during the blocking and windowing modules which represent short-time analysis or after the necessary features are extracted. In a filter-bank design, the information rate is reduced at the output of the filter-bank or the output of an auditory model. In general, the output is half-wave rectified, lowpassed, and decimated. The information rate of 100 Hz (i.e. 100 feature vectors per second) is generally adequate for speech[4]. The output of most spectral feature extract techniques has the dimension of 30 to 40. The dimension of these vectors can be further reduced by cepstral analysis or principle component analysis. For LPC parameterization, the number of LPC coefficients generated from each frame varies from 8 to 14. Reduction of feature dimension is not necessary here. However, most system will convert LPC coefficients to cepstral coefficients because cepstral coefficients are more stable. 4.1.1.4Speech Recognition Techniques The pattern recognition approach requires no explicit knowledge of speech. This approach has two steps namely, training of speech patterns based on some generic spectral parameter set and recognition of patterns via pattern comparison. The popular pattern recognition techniques include template matching and artificial neural network (ANN).

References:
[1] Fundamentals of speech Recognitionby Lawrence Rabiner & Bing-Hwang Juang. [2] Digital processing of speech signals by Lawrence Rabiner & R.W.Schafer. [3] Statistical approaches in speech recognition by Frederick Jelinek. [4] Speech recognition tutorials by Lawrence Rabiner.
Mr. Asim Shahzad is pursuing his PhD from University of Engineering & Technology Taxila, Pakistan. He has completed his M.S in Computer Engineering from UET Taxila and also he has earned an MS Degree in Telecommunication Engineering from Institute of Communication Technologies, Islamabad Pakistan. He is also working as Assistant Professor in the department of Telecommunication Engineering, UET Taxila. His areas of interest are Optical Communication and Data & network security. Mr. Farhan Aadil Malik is student of MS Software Engineering at University of Engineering & Technology Taxila. He earned his Bachelors degree in Computer Sciences from Allama Iqbal Open University, Islamabad, Pakistan in 2005 with distinction and currently working as Programmer at UET Taxila. His current areas of interest are ad hoc Networks and Data Communication Systems & Security. Mr. Shahzada Khayyam Nisar completed his graduation in Computer Sciences from Allama Iqbal Open University, Islamabad Pakistan in 2004 with honors. He was awarded Allama Iqbal Award by the University. He is also HEC Scholarship holder. Currently he is student of MS Software Engineering at UET, Taxila. He has more than 5 years of teaching experience, presently working as Programmer at UET, Taxila. His areas of interest are Vehicular ad hoc Networks and Data Communication Systems & Security.

Conclusion and Future Work


In this research paper we have discussed the basis phenomena behind speech recognition ,discuss various approaches

Potrebbero piacerti anche