Sei sulla pagina 1di 10

IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO.

2, MAY 1996

143

A Software Tool for Introducing Speech Coding Fundamentals in a DSP Course


Andreas Spanias, Senior Member, IEEE, and Edward M. Painter, Student Member, IEEE
Abstract- An educational software tool on speech coding is presentedl. Portions of this program are used in our senior-level DSP (digital signal processing) class at Arizona State University to expose undergraduate students to speech coding and present speech analysis/synthesis as an application paradigm for many DSP funldamental concepts. The simulation software provides an interactive environment that allows users to investigate and understaiid speech cuding algorithms for a variety of input speech records. Time- and frequency-domain representations of input and reconstructed speech can be graphically displayed and played back on a PC equipped with a standard 16-bit sound card. Tha: program has been developed for use in the MATLAB environmlent and includes implementations of the FS-1015 LPC10e, the FS-1016 CELP, the ETSI GSM, the IS-54 VSELP, the G.721 ADPCM, and the G.728 LD-CELP speech coding algorithms, integrated under a common graphical interface.

I. INTRODUCTION
A. Speech Coding in the DSP Course: Motivation and Rationale PEECH coding is an application area in signal processing concerned with obtaining compact representations of speech signals for efficient transmission or storage. This requires analysis and modeling of digital speech signals which are usually represented by a compact set of quantized filter, excitation, and spectrum parameters. As such speech coding uses many fundamental signal processing tools and concept4, which are taught in undergraduate DSP (digital signal processing) classes, and hence it can be used as an application paradigm for demonstrating the utility of such tools. An exposition to speech coding as an application in an undergraduate DSP course is also motivated by the emergence of new computer and mobile communication applications that require young electrical engineers to have some fundamental speech processing knowledge in the context of DSP. This is because many of the graduating engineers seeking employment in the communication and computer industry are likely to face difficult implementation tasks involving speech processing algorithms. In most electrical engineering programs the fundamentals of speech processing are not covered at the undergraduate level. At Arizona State University, we recently started to introduce speech coding towards the end of the stmior-level four-credit DSP course. In particular, we
I MATLAB is a trademark of The Mathworks, Inc. Manuscript received June 14, 1995; revised March 6, 1996. The authors are with the Department of Electrical Engineering, Arizona State Unilersity, Tempe, A 2 85287-7206 USA. Publisher Item Identifier S 0018-9359(96)04258-6.

devoted three lectures, two homework assignments, and one computer project to address this important application area. As part of this effort, we developed an educational simulation program in MATLAB that can be used to provide knowledge on speech coding algorithms and demonstrate the utility of several important DSP concepts. In this paper, we describe this educational software Loo1 and we give sample simulations that can be used to assist undergraduate students in understanding speech coding algorithms. Before we present the software tool, we give a brief description of our DSP course and then describe how the speech coding material and computer simulation program fit in with the rest of the syllabi. Our DSP course starts with a review of continuous-time linear systems and transforms. We then bridge into discrete time by covering the sampling theorem and we proceed with lectures on discrete-time system fundamentals (linearity, causality, convolution, difference equations, etc.) followed by the z transform, and FIR and IIR filter design. We then introduce the discrete-time Fourier series and transform followed by the FFT and its applications. Finally, we cover some aspects of random signal processing (mean, autocorrelation, PSD) and conclude with descriptions of DSP applications as well as exposition to advanced topics in signal processing. It is in the latter part of the course that we introduce speech coding as an application that brings together: digital filtering concepts, random signal processing, autocorrelation and PSD estimation, handling of nonstationarities, windowing, quantization of filter coefficients, estimation of periodicity, and exposition to time-varying signal modeling. During this part we also provide a brief tutorial on the various speech coding standards and we explain the role of the signal processing algorithms in cellular communications and in multimedia applications.
B. The Utility o f the Speech Coding Simulation Program

The simulation programs are written for speech coding algorithms that are based on a source-system model, Fig. 1. This model is consistent with the human speech reproduction system in the sense that typical voiced speech (e.g., a steady vowel) can be attributed to periodic glottal pulses (generated by vibrating vocal chords) exciting the vocal tract, and typical unvoiced sounds (e.g., /s/) are due to random-like turbulence let out through a narrow opening in the tract. The engineering source-system model for speech synthesis produces voiced speech using a periodic signal exciting a properly chosen digital filter and unvoiced speech by exciting a properly estimated digital filter with random excitation. Several signal

0018-9359/96$05.00 0 1996 IEEE

144

IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 2, MAY 1996

VOCAL TRACT FILTER

SYNTHETIC
b

SPEECH

Fig. 1. The source-system voice synthesis model used in speech coding algorithms.

processing concepts can be reinforced with the voice synthesis process used in speech coding particularly when presented in an invxactive software environment. The software simulation enables students to experiment with speech algorithms using a variety of input signals, examine graphical representations of critical analysis/synthesis parameters on a frame-by-frame basis, playback reconstructed output speech, and compare the quality of output speech associated with the different coding algorithms. The graphical user interface allows time- and frequency-domain examination of input speech, output speech, and model parameters for each frame, including system poles, zeros, and excitation sequences. Graphical outputs may provide information to students about underlying algorithm mechanisms and signal processing functions. Simulations have been coded in an expository style for easy understanding and to provide a template for implementation on other systems. Our program provides working examples and important details often omitted in the general literature. We have chosen the MATLAB programming environment because it offers several advantages over traditional programming languages. First, it enables students to generate a variety of signal and parameter plots, experiment with the effects of channel noise and network tandeming, and modify algorithm parameters with minimal effort in an environment where files are easily accessible. Second, MATLAB code is quite compact thereby simplifying algorithm understanding by providing a powerful set of functions supporting filtering, as well as scalar, vector, and matrix operations. Third, MATLAB is being used in many academic institutions to support linear systems and DSP courses and students are able to experiment with the algorithms and quickly assess perceptual quality since MATLAB supports several sound boards. Finally, the MATLAB speech coders will run on a variety of host computers, i.e., PC, Macintosh, UNIX workstations, and mainframes. The simulation program is not only useful to undergraduate and graduate students in academic environments but also to engineers in the industry concerned with the understanding and implementation of speech coding algorithms. Implementation engineers often rely on brief descriptions of the algorithms as captured in the standards. This presents a problem, since novice readers usually find a certain degree of obscurity in the documents describing the standards. Although surveys of state-of-the-art speech coders and standards appeared in the speech coding literature [ 11, detailed algorithmic descriptions, and implementation details are often proprietary. Even public

domain software for standardized algorithms is usually optimized for efficiency or fixed-point realization on specific architectures and therefore is not suitable for educational purposes. The software described in the next few sections of this paper can be used for evaluating, simulating, and understanding standardized speech coding algorithms. Several coders based on linear prediction have been implemented. These include the 2.4 kbit/s FS-1015 LPC-10e [2], the 4.8 kbit/s FS-1016 CELP [ 3 ] , the 8 kbit/s IS-54 VSELP [4], the 13 kbit/s ETSI GSM [ 5 ] ,the 16 kbit/s G.728 LD-CELP [6], and the 32 kbit/s G.721 ADPCM [7]. These programs provide a unified exposition to the algorithms by bringing them together into a common simulation framework under the MATLAB computing environment. In addition, a unified user-friendly interface is developed for all the algorithms to facilitate evaluation, record/playback, experimentation, graphical analysis, and understanding. Moreover, they offer a standard by which to calibrate new implementations. Finally, with their consistent software architectures and user-interfaces, these programs provide a unified framework for comparisons.

C. Paper Organization
In the next section of the paper we provide a brief survey of some of the important speech coding algorithms and standards with major emphasis on those that have been implemented in the simulation tool. The section focuses on linear predictive coding (LPC) starting with ADPCM and continuing with openloop and closed-loop (analysis-by-synthesis) LPC. Section I11 describes the MATLAB speech coding simulation software, and Section IV gives sample exercises using the simulation programs. Finally, Section V presents concluding remarks.

11. SPEECH CODING ALGORITHMS


In this section, we describe briefly the underlying speech coding methods associated with the algorithms implemented in the MATLAB simulation software. In our presentation, we avoided detailed mathematical descriptions and we chose to provide simple engineering explanations of the parametric analysis-synthesis models. For a more detailed survey of the algorithms we refer the reader to [ 11. Most of the well-known speech coding standards are based on linear prediction (LP)-a process in which the present sample of speech is represented by a linear combination of past samples. The linear predictor can be viewed as an inverse filter that generates parameters

SPANIAS AND PAINTER: SOFTWARE TOOL FOR INTRODUCING SPEECH CODING FUNDAMENTALS

145

Fig. 2.

Linear prediction analysis filter.

for an all-pole filter that models the vocal tract, Fig. 2. The LP analysis filter has a transfer function

P(z)

1- A ( z )

(1)

and the LP synthesis filter is given by

H(%) =
where

1- A ( z )
P

2=1

and p is predictor order, z is the complex variable of the ztransform, and a, are predictor coefficients. This is realized using an FIR filter driven by input speech samples. The prediction coefficients are obtained by minimizing the mean square of its output, i.e., minimizing the prediction residual. The prl-diction residual can be expressed in the time domain as
e ( n ) = s(n) - a l s ( n - 1) - aZs(n - 2) . . . a,s(n
-

+Y
(b) Fig. 3. ADPCM (a) transmitter and (b) receiver

p ) (4)

where s ( n ) is input speech and a, are predictor coefficients. Thus, 1 he present speech sample is predicted by a linear combination of previous speech samples. The form of the linear predicror shown in Fig. 2 is called in the speech processing literature short-term predictor because it is associated with short delays. In speech coding applications, with very few exceptions, the short-term predictor order is typically 10-15. Long-lerm predictors are typically used to predict periodicities in sperxh and usually involve long-term memory of the order of 20-147 samples. The long-term predictor is of the form

where

a=-j

and r is the delay usually associated with pitch period. LP is employed in a variety of speech coding algorithms ranging from ,4DPCM to two-state excited LPC and the more recent and successful analysis-by-synthesis LPC methods. Linear prediction is used in ADPCM to predict the current input sample and produce a prediction error of low variance. Since the prediction residual is quantized, a reduced variance quantizer

input results in reduced quantization noise. For speech sampled at 8 kHz the linear predictor is typically updated every 20-30 ms. The ADPCM coder (Fig. 3) does not rely implicitly or explicitly on a speech reproduction model and is essentially a waveform coder, i.e., it can also handle nonspeech signals. ADPCM is used in the 32 kbit/s ITU 726 standard (formerly CCITT 721). The latter algorithm was implemented in the simulation software. From the DSP educational point of view, upon examining the ADPCM algorithm, students can see how time-varying filtering can be used to track and represent compactly the short-term statistics of speech and hence exploit the redundancy in the speech signal for coding purposes. On the other hand issues associated with quantization noise both on the residual signal and the filter parameters can be highlighted. The classical two-state excited LPC (Fig. 4) is distinctly different than ADPCM in that it uses an analysis-synthesis scheme that relies explicitly on the engineering speech reproduction model shown in Fig. 1. Speech in the classical LPC is typically processed frame-by-frame and for each frame

I46

IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 2, MAY 1996

Excitation parameters

Fig. 4. Classical LPC (a) analysis and (b) synthesis.

a 10th order LP polynomial is estimated by solving a set of linear equations involving autocorrelation estimates. In each frame, a voicing decision (voiced/unvoiced) is made to determine whether the excitation will be periodic or random A pitch estimator is used to determine periodicity in voiced speech. Classical LPC uses open-loop analysis to determine the excitation parameters (voicing, pitch), i.e., the excitation parameters are determined without accounting explicitly their effect on the quality of synthesis. This class of algorithms is very efficient in terms of bit rates, however, reconstructed speech from coded LP parameters is often perceived as synthetic (mechanical). The 2.4 kbit/s LPC-IO algorithm of the Department of Defense became a federal standard in the early eighties and a MATLAB implementation of this algorithm is part of the software preuented. From the DSP educational point of view, an exposition to this algorithm highlights several practical issues, such as LP coefficient and periodicity (pitch) estimation from short data records, stability issues with the LPC synthesis filter, reflection coefficients, lattice structures, etc. A class of LPC algorithms that uses closed-loop analysis emerged in the early eighties. These coders, Fig. 5 , are called closed-loop or analysis-by-synthesis because they determine or form an excitation sequence for LPC synthesis by minimizing explicitly the reconstruction error, i.e., the difference between the original and reconstructed speech. These algorithms include a long-term (pitch) predictor in addition to the shortterm linear predictor. In addition, these coders incorporate mechanisms for optimizing the performance o f the coder such that the masking properties of the human ear are exploited. This perceptual enhancement is achieved by implicitly filtering the quantization noise in such a way so that it is masked by the high energy formants. There are three forms of excitation in analysis-by-synthesis LPC, i.e., multipulse excitation, regular-pulse excitation, and vector or code excitation. The associated algorithms are called multipulse LPC (MLPC), regular pulse LPC (RPE-LPC), and code excited linear prediction (CELP), respectively. The

MLPC uses an excitation that consists of multiple irregular pulses that are chosen one-by-one such that the reconstruction error is minimized. RPE-LPC is a special case of MLPC in which the pulses are chosen to be equidistant (appearing on regular intervals). CELP is the most recent of these techniques and uses vector quantization to encode the excitation. The idea in CELP is to select and transmit the excitation vector from a bank of representative excitation sequences (codebook) such that the filtered reconstruction error is minimized in the MSE sense. Since these waveforms (vectors) are predetermined they can be stored both at the transmitter and the receiver and hence only their address needs to be transmitted. As far as standards are concerned, several analysis-bysynthesis LPC algorithms became, or are in the process of becoming, part of national and international communications standards. Some of the well-known standards are listed below. 1) A 13 kbit/s coding scheme that uses RPE with longterm prediction (LTP) was adopted for the full-rate GSM Pan-European digital mobile standard. 2) A 4.8 kbit/s CELP algorithm has been adopted by the Department of Defense for possible use in the third-generation secure telephone unit (STU-111). This algorithm is described in the Federal Standard 1016 (FS1016) and was jointly developed by the DOD and AT&T Bell Labs. 3) An 8 kbit/s vector sum excited linear prediction (VSELP) algorithm was proposed by Gerson and Jasiuk and is part of the North American Digital Cellular System. VSELP algorithms operating at lower rates have also been proposed for the new half-rate standards and a 6.7 kbit/s VSELP algorithm was adopted for the Japanese digital cellular standard. 4) A low-delay CELP (LD-CELP) coder is part of the CCITT G.728 standard. The low-delay CELP coder achieves low one-way delay by using a backwardadaptive predictor and short excitation vectors ( 5 samples). The LD-CELP algorithm does not utilize LTP.

SPANIAS AND PAINTER SOFTWARE TOOL FOR INTRODUCING SPEECH CODING FUNDAMENTALS

147

Fig. 5. Analysis-by-synthesis LPC.

Instead, the order of the short-term predictor is increased to 50. All of these algorithms are part of our MATLAB simulation. There are a plethora of educational experiences to be gained by performing simulations with these algorithms. Students not only arc exposed to new cellular and multimedia technologies but also are exposed to issues like long-term prediction, vector quantization, perceptual filtering, etc. 111. MATLAB CODEC SIMULATIONS A. Overview The !,imulation programs will execute on a PC 386 compatible with a math co-processor running MATLAB for Windows with 4 MB of RAM and sufficient hard-drive (H/D) swap and data space. For recording and listening a 16-bit sound card, a microphone, and amplified speakers or headset are needed. We have found that 486DX-66 machines (or better) with at least 8 MB of RAM will execute reasonably fast much of the sofi ware. Simulations accept original input samples from .WAV input files, run analysis at the transmitter, transmit parameters through a simulated channel, run synthesis at the receiver, and then generate .WAV output files. Speech files contain 16-bit linear PCM data, sampled at 8 kHz. One could use thia interactive simulation to analyze signals on a frameby-frame basis during speech analysis and synthesis.
B. Tim(.- and Frequency-Domain Viewing Windows

A time-domain viewing window allows visual comparison of the I T M input speech waveform against the reconstructed output speech waveform [Fig. 6(a)]. Dashed vertical lines indicate subframe boundaries. Comparing these plots, the user can appreciate the differences in waveform matching behavior between a hybrid algorithm such as the CELP algorithm and a vocoder such as the LPC-10e. Comparisons are enhanced by a facility which allows visual examination of the error between input and reconstructed speech. For all algorithms, timedomairi windows indicate the current frame number and total frames to be processed. A frequency-domain viewing window

is also available [Fig. 6(b)], allowing comparison of magnitude spectra between input and reconstructed output speech. Magnitude spectral estimates are generated using a 1024point FFT. The LPC envelope, corresponding to quantized predictor coefficients received by the decoder, is superimposed on both plots. This window illustrates spectral matching properties. For example, users may observe that a vocoder method like LPC- 10e exhibits reasonable spectral matching, despite its poor temporal waveform matching. The frequencydomain window also allows display of spectral error. In all LPC coding methods, short-term spectral characteristics are captured in an all-pole synthesis filter. The differences in the excitation models for the different algorithms can be observed using a viewing window that allows observation of excitation sequences in time and frequency [Fig. 6(c)]. After observing several voiced LPC- 1O e excitations, for example, it becomes evident that the basic glottal pulse shape does not change. Observing GSM excitations clarifies the concept of RPE, in which each frame of regularly spaced pulses has distinctly different amplitude patterns than its predecessor. One can observe that RPE excitations achieve performance gains relative to the simplistic two-state model used in LPC-10e. In CELP [Fig. 6(c)], random vectors have been combined with lag search vectors to obtain an optimal excitation. The all- pole LPC synthesis filter can be represented in several forms. Since potential users and particularly students with interest in DSP are essentially familiar with filter transfer functions, we have elected to present pole locations of the decoder's LPC synthesis filter through a z-plane view [Fig. 6(d)]. This window also allows pole trajectory tracking and animated playback and provides information about formant locations. When tracking is enabled, colored line segments trace the movements of selected complex conjugate pole pairs.
C. Objective Quality Measures and Speech File Viewing Utility

Many objective quality measures have been proposed to quantify coding distortion [8]. Our simulations incorporate both spectral and temporal distortion measures in a quality display window which can be accessed from any of the other

I48

IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 2, MAY 1996

i-

ll /I

I
50

I
100

I
150

141 ii

050-

-05 -

1-

L
(c) (d) Fig. 6. Viewing windows (CELP): (a) Time-domain speech, (b) frequency domain speech, (c) excitation, and (d) z-plane.

viewing windows. Our tools also include a frame-by-frame speech file viewing utility. This program generates several waveform displays, including time-domain, LPC envelope, and spectral (FFT-based) magnitude and phase. The viewing utility allows random access to any frame in a speech file and accommodates variable frame sizes. This utility also generates three-dimensional spectrograms for a speech file using either FFT magnitude or LP-based spectral analysis methods.
IV. MATLAB SIMULATION EXERCISES

A . CELP Closed-Loop Excitation Optimization (Codebook Search)

Excitation optimization in CELP involves (in most cases) exhaustively searching two vector codebooks. Codebooks are searched sequentially, adaptive first and then stochastic. Excitation vectors (gain-shape VQ) are chosen to minimize a perceptually weighted error measure [SI. By operating in MATLAB, we enable users to modify CELP source code

with relative ease and produce illustrations of codebook search procedures. In Fig. 7(a)-(c), we show three candidate adaptive codebook vectors, corresponding to the minimum, median, and maximum match scores, obtained from a 256-vector search space. Using the excitation sequences shown in Fig. 7(a)-(c), we can synthesize and evaluate speech waveforms as shown in Fig. 7(d)-(f), respectively. Each output record has been plotted with the original input speech to allow comparisons. Signalto-noise ratios (SNR's) are also provided to give an objective measure of performance. From Fig. 7, we observe that higher match scores correspond to higher quality excitations and higher SNR. Fig. 7(a)-(c), however, represents only adaptive excitation components. A stochastic vector is also used during CELP synthesis. Stochastic codebook search takes place after the adaptive search, therefore the optimal choice of a stochastic vector is affected by the adaptive vector selection. Fig. 8(a)-(c) shows optimal stochastic vectors (maximum match score) selected for combination with minimum, median, and maximum match score adaptive codebook vectors, respectively. Fig. S(d)-(f) shows output speech produced when these are combined with their adaptive counterparts of Fig. 7(a)-(c). SNR's shown in Fig. 7(d)-(f) and Fig. 8(d)-(f) illustrate

SPANIAS AND PAINTER: SOFTWARE TOOL FOR INTRODUCING SPEECH CODING FUNDAMENTALS

I49

Suboplimal Adaptive CB Excitation Vector [Minimum Match Score]

Suboptimal Adaptive CB ExcitationVector [Median Match Score]


8
O0

0 08
: r

0 Oo6! 04

0 Oo6! 04

002U 3

0-\
-0 02 -

-0 04}

-0

4
0 1 2 3 4 5 6 7

-I
-0 - 06 0 1 2 0 3 4 4 5 1 6 7 1

-0 081

-0 08 0

Time (mS)

Time (mS)

Optimal Adaptive CB Excitation Vector [Maximum Match Score]


0 08
0 06

Input vs Output, Minimum Match Adaptive Excitalion

0 04
0 02
U 3

-=c

-0 02

-0 04
-0 06 -0 08 0
-0 6 1 -i

J
i

Time (mS)

Time (mS)

Input vs Output, Median Match Adaptive Excitation


SNR = -0.0331 dB

Input us Output Maximum Match Adaptive Excitation

041

-0 4

1
0 1 2 3 4 5 6 7

Time (mS)

Rg 7

Adaptive excltatlon vectors associated wlth (a) minimum, (b) median, (c) maximum match score, and (d)-(0 correspondlng output speech

the improvement in speech quality attained by adding a stochastic vector. We are also able to observe the sparse (77% zeros), ternary valued (- l,O,l)nature of stochastic vectors. By developing plots like those presented in Figs. 7 and 8, users are able to observe the correspondence between match scores and excitation quality. Furthermore, they gain knowledge on

the nature of excitation signals derived from the different codebooks.


B. CELP Perceptual Weighting Filter

CELP codebook search procedures minimize a perceptually weighted error signal. Weighting is achieved through an IIR

150

IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 2, MAY 1996

Optimal Stochastic CB Excitation Vector [Maximum Match Score]


0 Oai

Optimal Stochastic CB Excitation Vector [Maximum Match Score]

0 04 0 02 7J 3

-~ = 0
<

E"

-0 02 -

-0 04 -

-0 08' 0

-O -0 00 O 0
1 2 3 4 Time (mS) 5 6 7

7J

Time (mS)

ODtimal Stochastic CB Excitation Vector IMaximum Match Score1

Input vs Output For Combined Minimum Adaptive, Optimal Stochastic Excitati,


SNR = 1.6404 dB

002D 3

<

O .

A A

.q
-0 02 -

,
1

,
5

,
6

,
7

-0oa

-O4I
-0 6

4 Time (mS)

4 Time (mS)

1i

Input vs Output For Combined Median Adaptive Optimal Stochastic Excitatioi


0604SNR= 0 9554 dB

Input vs Output For Combined Optimal Adaptive, Optimal Stochastic Excitatic 06


SNR = 7.1850 dB

-n e

J
1 2 3 4 Time (mS) 5 6 7

-061

, 1

, 2

, 4 Time (mS)

, 6

, 7

Fig. 8. Stochastic excitation vectors selected for combination with (a) minimum, (b) median, (c) maximum match score adaptive excitation sequences, and (d)-(0 corresponding output speech.

filter, which modifies the error spectrum to exploit the masking properties of the ear. In particular, CELP algorithms exploit the fact that humans have a limited ability to detect small errors in frequency bands where the speech signal has high

energy, such as the formant regions [lo]. Therefore the CELP weighting filter de-emphasizes formant regions in the error spectrum. The transfer function of the weighting filter is of the form

Perceptual Weighting Filter Poles and Zeros 1 -

Perceptual Weighting Filter Frequency Response and L P Synthesis Envelop

..

rm

05-

0.

-05 .

1
-1

-05

05

1000

2000

3000

4000

Re

Frequency (Hz)

Fig. 9. CELP perceptual weighting filter: (a) poledzeros and (b) magnitude frequency response with LP synthesis envelope.
P

1- x a z z - z

W ( 2 )= 1 1- A b / ? )

2=1

(7)

1- C a , y ' z - ~
7=1

The parameter y expands formant bandwidths by moving poles radially inward toward the center of the unit circle. This coefficient usually takes on values between 0.8 and 0.9. Our software enables users to choose a frame of speech and then develop poleizero and frequency response plots for the PWF (Fig. 9). Users may also process speech records with and without ihe weighting filter. Comparing output records provides insight on the net effects of the weighting filter. One can observe that subjective speech quality (informal listening) improves with the filter, despite the drop in SNR. Thus, in addition to weighting filter behavior, this exercise also demonstrates the poor subjective quality measurement inherent in SNR.
C. CELP Robustness in the Presence of Channel Error>; and Tandem Encodings

-101

-0- :

Adaptive CB Indices

_ . : . Adaptive CB

Gains

-+-

Stochastic CB Indices

-15

: stochastic CB Gains

-2d

'

'

'

" " "

'

'

" " "

'

'

" " "

lo4

Bit Error Rate (BER)

lo-%

t
10-

Codec bit streams in wireless applications are subjected to channlel errors which are characterized in terms of bit error rate (IBER). Coding algorithms should tolerate bit errors with minimal perceptual degradation. Our simulations are equipped with I3ER controls that enable users to compare and contrast error tolerances between the different algorithms. The family of curves in Fig. 10(a) illustrates spectral and excitation parametric error sensitivities measured at EiER's of 0.1%, O S % , 1%, 5%, and 10%.For each curve, bits for the specified paramleter are randomly corrupted, while the remaining parameters are left undisturbed. Results of this test suggest that CELP SSNR is most sensitive to perturbations in stochastic CB gain, while it is least sensitive to LSP errors. Poiential users can also employ our tools to perform subjective evaluations. They will find that LSP errors introduce whistling and squeaking effects, pitch lag errors create noticeable roughness, and gain errors lead to irritating clicks and pops.

11

\
?
1

\
i x SNR

-2

4 L

,
2 3 Number of Tandem Stages

1o:SSNR

(b)

Fig. 10. (a) SSNR penalties associated with CELP channel errors and (b) SSNR penalties associated with CELP tandem encodings.

In addition to channel errors, a robust coding algorithm must also tolerate tandem encodings without excessive comproI

152

IEEE TRANSACTIONS ON EDUCATION, VOL. 39, NO. 2, MAY 1996

mises in output quality. Our coding simulations enable users to examine algorithmic responses to multiple synchronous tandems. The CELP tandeming results given in Fig. lO(b) were obtained using five-stage synchronous tandem configurations (To - 75) to process phonetically balanced TIMIT speech records, spoken by 15 males and 15 females. Objective figures of merit are reported in terms of SNR, SSNR, and C D . SSNR is measured after segmenting speech records into 20 millisecond frames (160 samples) and averaging the individual SNRs. Cepstral distance was obtained by averaging segmental CDs over the same set of 20 millisecond frames. Tandeming scores shown in Fig. 10(b) were obtained after processing 5700 frames at each of six tandem nodes (To - T j ) .Dynamic range for CELP SNRs is 8 dB, with worst-case performance at Tj reaching a -0.3 dB minimum. The preceding exercises represent the testing capabilities of our software. Other beneficial topics for experiments include comparisons of performance with different input signaldspeakers, examination of parametric variations and performance tradeoffs, and evaluations of algorithmic robustness to acoustic background noise.

[6] J. Chen et al., A low-delay CELP coder for the CCITT 16 kb/s speech coding standard, IEEE Trans. Select, Areas Commun. (Special Issue on Speech and Image Coding), N.Hubing, Ed., vol. 10, pp. 830-849, Jun. 1992. [7] N. Benvenuto, G. Bertocci, and W. R. Daumer, The 32-kb/s ADPCM coding standard. ATKiT Tech. . I . vol. , 65, pp. 12-22, Sept.-Oct. 1986. 181 A. H. Gray and J. D. Markel, Distance measures for speech ~. processing, IEEE Trans. ASSP-24, no. 5 , Oct. 1976. 191 .Details to assist in implementation of federal standard 1016 CELP, National Communications System, Tech. Inform. Bull. 92-1, Jan. 1992. [lo] P. Kroon and E. F. Deprettere, A class of analysis-by~synthesis predictive coders for high quality speech coding at rates between 4.8 and I . Select. Areas Commun., vol. 6, no. 2, pp. 353-363, 16 kbits/s, IEEE . Feb. 1988. [ 1 11 CCITT Recommendation G.722, 7 kHz audio coding within 64 kbitsk, in Blue Book, vol. 111, Fascicle 111, Oct. 1988.

V. CONCLUSION
We have presented new educational speech coding simdation software developed at Arizona State University to supplement our DSP and speech coding courses with handson experiments. We have also described a laboratory hardware/software environment and outlined simulation exercises. We are continuing to enhance this software. Future releases will incorporate additional coding standards such as the G.722 subband coder [11]. Also in the future, we hope to port the software to real-time hardware.

Andreas Spanias (S85-M88-SM95) received the Ph.D. degree in electrical engineering from West Virginia University in 1988. He joined Arizona State University (ASU) in 1988 as an Assistant Professor in the Department of Electrical Engineering. He is now Associate Professor in the same department. His research interests are in the areas of adaptive signal processing, speech processing, spectral analysis, and mobile communications. While at ASU, he has developed and taught courses in digital and adaptive signal processing. He has also developed and taught short courses in digital signal processing and speech processing to support continuing education. He has been the Principal Investigator on research contracts with Intel Corporation. Motorola Inc., Sandia National Labs, and Active Noise and Vibration Technologies. He has also consulted with Inter-Tel Communications and the Cyprus Institute of Neurology and Genetics. Dr. Spanias is an Associate Editor of the IEEE TRANSACTIONS ON SIGNAL PROCESSING, member of the Statistical Signal and Array Processing Technical Committee of the IEEE Signal Processing Society, and also member of the Digital Signal Processing Technical Committee of the IEEE Circuits and Systems Society. He is also co-chair of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-99), president of the combined IEEE Communications and Signal Processing Chapter at Phoenix, and member of Eta Kappa Nu and Sigma Xi.

REFERENCES
[11 A. S. Spanias, Speech coding: A tutorial review, Proc. IEEE, vol. 82, pp. 1541-1582, Oct. 1994. PI T. E. Tremain, The government standard linear predictive coding algorithm: LPC-IO, Speech Technology, pp. 40-49, Apr. 1982. I31 J. P. Campbell et al.. The proposed federal standard 1016 4800 bps voice coder: CELP, Speech Technology, vol. 5 , pp. 58-64, Apr./May 1990. I41 1. Gerson and M. Jasiuk, Vector sum excited linear prediction (VSELP) speech coding at 8 khitds, in Proc. ZCASSP-YO, New Mexico. Apr. 1990, pp. 4 6 1 4 6 4 . 151 P. Vary et al., Speech codec for the European mobile radio system, in Proc. ICASSP-88, Apr. 1988, p. 227.

Edward M. Painter (S95) was born in Boston, MA, in 1967. He received the A.B. degree in engineering sciences and computer science from Dartmouth College in 1989 and the M.S. degree in electrical engineering from Arizona State University (ASU) in 1995. He is working toward the Ph.D. degree in electrical engineering at the Telecommunications Research Center at ASU, sponsored by Intel NDTC. He worked as an Embedded Systems Development Engineer from I989 to 1992 with Applied Systems and then as an Industrial Fellow with the Flight Controls Group at Honeywell Commercial Flight Systems from 1992 to 1994. He is currently a Research Assistant with the Telecommunications Research Center at ASU. His research interests are in the areas of speech and audio processing, perceptual coding, voice conversion, and image processing. He is a student member of the Audio Engineering Society.

Potrebbero piacerti anche