Sei sulla pagina 1di 72

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes Monday 6/19


Fundamentals of Digital Signal Processing
The main components of digital signal processing are signals, sequences and systems. What is a signal? We define a signal as a time-varying quantity that conveys information about its source. Examples of signals are music, speech, daily temperature measurements, etc. What is a sequence? Simply put, a sequence is a set of numbers. It can also mean: a representation of a signal on a computer. In digital signal processing, sequences are often obtained by sampling a real-world signal at a certain rate. A simple example of a sequence is the set of numbers {0,1,2,0,1,2}. What is a system? In essence, a system is a processing element that turns an input signal into an output signal. For example, a microphone is a system: it converts acoustic waves into electrical pulses. A speaker also is a system: it converts electrical pulses into acoustic waves. You can also think of the volume control on an iPod as a system: it convert music into LOUD MUSIC. Or a more liberal example: a pen is a system, because it converts motion into text on paper.

Sequences
As mentioned, a sequence is a set of numbers, such as x={0,1,2,0,1,2}. But sequences are usually indexed, which means that we associate it with another set of equal length that contains sequential numbers, for example n={-1,0,1,2,3,4}. Sequence notation We use the following notation for a sequence: x[n]={0,1,2,0,1,2} for n=-1,,4. Displaying a sequence We can graphically display this sequence as follows:
x [n]
1

-1

If we had defined the sequence on another index set, for example x={0,1,2,0,1,2} for n=-3,,2

SPHSC 503 Speech Signal Processing then the sequence would be displayed as
x [n] 2
2

UW Summer 2006

-3

-2

-1

Note that a sequence with the same values but with different indexes is a different sequence. Sequence conventions 1. Specifying a sequence without specifying an index usually means the sequence starts at n=0. For example, x[n]={0,1,2,0,1,2} means x[n]={0,1,2,0,1,2} for n=0,,5. 2. A sequence is assumed to be 0 outside the specified indices. Some important sequences Unit sample sequence, also known as the impulse sequence Notation:

[ n] =
Graphical:

1 n = 0 0 n 0
1

-2 -1 0 1 2

Unit step sequence Notation:

1 n 0 u[n] = 0 n < 0
Graphical:
u[n] 1 1 1 1

-2 0 1 2 3

SPHSC 503 Speech Signal Processing Sinusoids For example x[n] = sin(2 n / 8)
1 x[n] 1

UW Summer 2006

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

-1

-1

Periodic sequences Defined as x[n] = x[n + N ] for all n and some integer N. For example x[n] = sin(2 n / 8) , see figure above.

Systems
We can represent a system in the following diagram:
x[n] input T system y[n] output

and the following notation: y[n] = T {x[n]} . In this representation, T can be anything. But if we dont put any further restrictions on T, then theres not much more we can say about the system. Therefore, we usually look at special kinds of systems in digital signal processing.
Special system: a linear system Given a system T, y[n] = T {x[n]}

x[n]

y[n]

If the input x[n] is multiplied (scaled) by a constant, a, then the output of a linear system will be scaled by the same amount, or: T {a x[n]} = a T {x[n]} = a y[n] .
ax[n] T ay[n]

Furthermore, if it is known about the system T that y1[n] = T {x1[n]} and y2 [n] = T {x2 [n]}

x1[n]

y1[n]

x2[n]

y2[n]

SPHSC 503 Speech Signal Processing

UW Summer 2006

If two inputs are added, the output of a linear system will be the two outputs added, or: T {x1[n] + x2 [n]} = T {x1[n]} + T {x2 [ n]} = y1[ n] + y2 [n]

x1[n]+x2[n]

y1[n]+y2[n]

These two properties of multiplication and addition of a linear system can be combined into the property of superposition:
T {a x1[n] + b x2 [n]} = T {a x1[ n]} + T {b x2 [n]} = a T {x1[ n]} + b T {x2 [n]} = a y1[n] + b y2 [n]

ax1[n]+bx2[n]

ay1[n]+by2[n]

Special system: a time-invariant (a.k.a. shift-invariant) system Given a system T, y[n] = T {x[n]}

x[n]

y[n]

If the input x[n] is delayed (shifted) by N samples, then the output of a time-invariant system will be shifted by the same amount, or: T {x[n N ]} = y[n N ] .
x[n-N] T y[n-N]

Special system: a linear, time-invariant (LTI, or LSI) system A linear, time-invariant system is a system that combines the properties of linearity and timeinvariance. Consider a LTI system T. Suppose we know about T that:

when the input x[n] is the impulse sequence, i.e., x[n] = [n] then the output y [n] = {1,1} for n = 0,1

With this information, we are able to determine the output of this LTI system for all other possible input sequences. Lets see how Given some input sequence x[n], for example x[ n] = {0,1, 2,0,1, 2} for n = 0,...,5 . We define the following set of sequences (see the diagram below)

x[n], if n = i xi [n] = otherwise 0,

SPHSC 503 Speech Signal Processing

UW Summer 2006

Using these sequences, we can reconstruct the input sequence x[n] by summing all of these sequences, x[n] = xi [n] , as illustrated in the diagram below.
i

2 1 x[n] 0 1 2 3 1 4

x [n] 0 0 x [n] 1 0

1 1 1

Note that the xi [n] sequences are all scaled and shifted impulse sequences:

2 2

x [n] 2 0
x [n] 3 0 x [n] 4 0

x[n], if n = i xi [n] = otherwise 0, i] x[i ] = [ n


shift scaling

1
1

2
2

3
3

4
4 1 4

5
5

5 2

Therefore, we can reconstruct the input sequence by summing these scaled and shifted impulse sequences:

x [n] 5 0

x[n] = xi [n] = [n i ]x[i ]


i i

2 1 0 1 2 3 1 4

For each of the xi [n] sequences, we can find the output of the LTI system, as follows: It was given about the system that when the input is the impulse response, then the output of the system is T {[n]} = y [n] By using the time-invariance property, we find that the output for a shifted impulse is: T {[n i ]} = y [n i ] By using the scaling property of linearity, we find that the output for the scaled (and shifted) impulse is: T {[n i ]x[i ]} = y [n i ]x[i ] So for the sum of the xi [n] sequences we can find the output of the LTI system by using the additive property of linearity: T {x[n]} = T [n i ]x[i ] i
= T {[n i ]x[i ]}
i

= T {[n i ]}x[i ]
i

= y [ n i ]x[i ]
i

SPHSC 503 Speech Signal Processing The whole process is illustrated by the diagram below:
1 1 1 1 2 3

UW Summer 2006

Given:

[n] 0

y [n] 0

Scale and shift:

x [n] 0 0 x [n] 1 0 x [n] 2 0 x [n] 3 0 x [n] 4 0

1 1 1

y [n] 0 0 y [n] 1 0

1 1 1

2 1 2 2

2 2

3 2

1 1

2 2

3 3

4 4 1 4

5 5

6 6

y [n] 2 0 y [n] 3 0 y [n] 4 0

1 1

2 2

3 3

4 4 1 4

5 5 1 5 2

6 6

5 2

6 2

x [n] 5 0

2 2

5 2

y [n] 5 0

2 3

3 2 3

5 3

6 2 6

1 2 3 4 5 6 y[n] 0

Sum:

1 1 2

1 4 5

x[n] 0

Conclusion about LTI systems: impulse response Given the impulse response of an LTI system:
1 [n] 0 1 2 3 4 5

1 y [n] 0

1 1 2 3 4 5 6

Then the output of the system for any input can be found using the LTI properties of scaling, shifting and summing. A linear, time-invariant system is therefore completely described by its impulse response:

x[n] input

y[n]

y[n]

LTI system, output impulse response

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes 2 Wednesday 6/21


Summary of last lecture DSP is all about signals, sequences and systems Sequences: o indexed set of numbers, e.g., x[n]={0,1,2,0,1,2} for n=-1,,4 o important sequences: the impulse sequence [n] the unit step u[n] sinusoids periodic sequences Systems: o produces an output sequence given an input sequence o without restrictions, a system can be anything, and impossible to work with o linear system: scaled input causes scaled output added inputs causes added outputs combined: superposition o time-invariant system: an input delay causes the same output delay Special system: a linear, time-invariant (LTI) system A linear, time-invariant system is a system that combines the properties of linearity and timeinvariance. Consider a LTI system T. Suppose we know about T that: when the input x[n] is the impulse sequence, i.e., x[n] = [n] then the output y [ n] = {1,1} for n = 0,1

With this information, we are able to determine the output of this LTI system for all other possible input sequences. Lets see how Given some input sequence x[n], for example x[n] = {0,1, 2,0,1, 2} for n = 0,...,5 . We define the following set of sequences (see also the figure below)
x[n], if n = i xi [ n] = otherwise 0,

Using these sequences, we can reconstruct the input sequence x[n] by summing all of these sequences, x[n] = xi [n] , as illustrated in the figure below.
i

SPHSC 503 Speech Signal Processing


2 1 x[n] 0 1 2 3 1 4 5 2

UW Summer 2006

x [n] 0 0 x [n] 1 0

1 1 1

Note that the xi [n] sequences are all scaled and shifted impulse sequences:
x[ n], if n = i xi [ n] = otherwise 0, = [ n i] x[i ]
shift scaling

2 2

x [n] 2 0
x [n] 3 0 x [n] 4 0

1
1

2
2

3
3

4
4 1 4

5
5

5 2

Therefore, we can reconstruct the input sequence by summing these scaled and shifted impulse sequences: x[n] = xi [n] = [n i ]x[i ]
i i

x [n] 5 0

2 1 0 1 2 3 1 4

For each of the xi [n] sequences, we can find the output of the LTI system, as follows: It was given about the system that when the input is the impulse response, then the output of the system is T {[ n]} = y [ n] By applying the time-invariance property to this expression, we find that the output for a shifted impulse is: T {[n i ]} = y [n i ] By applying the scaling property of linearity, we find that the output for the scaled (and shifted) impulse is: T {[ n i ]x[i ]} = y [n i ] x[i ] So for the sum of the xi [n] sequences we can find the output of the LTI system by using the additive property of linearity: T {x[n]} = T [n i ]x[i ] i
= T {[n i ]x[i ]}
i

= T {[n i ]}x[i ]
i

= y [n i ]x[i ]
i

The whole process is illustrated by the diagram below:

SPHSC 503 Speech Signal Processing


1 1 1 1 2 3

UW Summer 2006

Given:

[n] 0

y [n] 0

Scale and shift:

x [n] 0 0 x [n] 1 0 x [n] 2 0 x [n] 3 0 x [n] 4 0

1 1 1

y [n] 0 0 y [n] 1 0

1 1 1

2 1 2 2

2 2

3 2

1 1

2 2

3 3

4 4 1 4

5 5

6 6

y [n] 2 0 y [n] 3 0 y [n] 4 0

1 1

2 2

3 3

4 4 1 4

5 5 1 5 2

6 6

5 2

6 2

x [n] 5 0

2 2

5 2

y [n] 5 0

2 3

3 2 3

5 3

6 2 6

1 2 3 4 5 6 y[n] 0

Sum:

1 1 2

1 4 5

x[n] 0

Conclusion about LTI systems: impulse response Given the impulse response of an LTI system:
1 [n] 0 1 2 3 4 5

1 y [n] 0

1 1 2 3 4 5 6

Then, as weve seen in the example above, the output of the system for any input can be found using the LTI properties of scaling, shifting and summing. A linear, time-invariant system is therefore completely described by its impulse response:

x[n] input

y[n]

y[n]

LTI system, output impulse response

If we have two identical systems, then their impulse responses must be the same. And vice versa, if we have two systems with identical impulse responses, then they are the same system.

SPHSC 503 Speech Signal Processing

UW Summer 2006

The convolution sum As illustrated by the example above, any sequence can be viewed as a sum of scaled, shifted impulse sequence x[n] = [ n i ] x[i ]
i

As a result, an LTI systems output can be viewed as a sum of scaled, shifted impulse responses y[n] = y [n i ]x[i ]
i

This special summation is called the convolution sum. It is usually written using the following conventions: Notation for the impulse response: h[n] Variable of summation: k Range of summation: negative infinity to infinity Short-hand notation:
y[n] =
k =

h[n k ]x[k ]

= h[n] x[n]

The convolution sum is one way to implement a system in Matlab. To compute the output of a system for a certain input sequence, you can evaluate the convolution sum for all values of n, using the systems impulse response and the input sequence.

The frequency response


An LTI system is characterized by its impulse response, but thats not always an intuitive way to describe a system. For example, suppose we have a system with the impulse response
h[n] 2

-1 0 -1

1 -1

h[n] = {1, 2, 1} for n=-1,0,1.


What kind of system is that? We often want to know how a system affects certain frequencies. Can you tell that from the above impulse response? This question is easier to answer by looking at an LTI system in another way, and that is through its frequency response. The frequency response is much like the impulse response, except that now were not interested in the response of the system to the impulse sequence, but in the response of the system to a (fixed) frequency. Given an LTI system with some impulse response h[n]. We can find the output of the system to a certain input x[n] using the convolution sum

SPHSC 503 Speech Signal Processing

UW Summer 2006

y[ n] =

k =

h[n k ]x[k ]

Let the input to this system be the complex frequency sequence x[n] = e jn . In this equation, j1 stands for the imaginary unit, which you may remember is defined as j = 1 , and variable is a number between 0 and that indicates frequency (more on this later). Then, the output of the system will be
y[n] = =
k =

h[n k ]x[k ] = h[n k ]e


k = j ( n r )

j k

(let r = n k )

r =

h[r ]e

r =

h[r ]e

j n j r

j r = e j n h[r ]e jr = x [ n ] h [ r ] e r = r =

We see that the output of an LTI system to a complex frequency sequence is the same complex frequency sequence multiplied by the complex constant

r =

h[r ]e

jr

We call this a constant because it does not depend on n. But it does depend on the frequency variable . We can therefore write it as a function of :
H ( ) =

r =

h[r ]e

j r

We can evaluate this function for many values of to determine how the system with impulse response h[n] responds to all kinds of frequencies.

Frequency response of system h[n] 4

h[n] 2
H()

3 2 1 0

-1 0 -1

1 -1

Frequency variable

Mathematicians use the symbol i for the imaginary unit, but electrical engineers prefer j.

SPHSC 503 Speech Signal Processing From the frequency response we can see that this system: suppresses low frequencies H ( ) is small when is close to zero passes high frequencies H ( ) is large when is close to Thus, this system is a high-pass filter. A few notes regarding the frequency response:

UW Summer 2006

The expression for the frequency response, H ( ) =

r =

h[r ]e

j r

, is also called the

discrete-time Fourier transform (DTFT) of the impulse response h[n]. Matlab has a builtin function to quickly evaluate the discrete-time Fourier transform of a sequence. This function is called fft, which stands for Fast Fourier Transform.

The function H ( ) is complex valued, which makes it a little complicated to plot the function. Most frequently, people plot the magnitude of H ( ) . That is what is done in the frequency response plot above. Sometimes, the phase of H ( ) is also of interest, in which case it is plotted in a separate plot. Another way to plot H ( ) is to separately plot the real and imaginary parts, but this is less common.

Frequency response H() 4 Magnitude 3 2 1 0 0 Real part 2 0 -2 -4

Frequency response H()

Frequency variable

Frequency variable

4 Phase (radians) 2 0 -2 -4 0 Imaginary part

0 -1 -2 -3

Frequency variable

Frequency variable

Magnitude and phase of the frequency response of h[n]

Real and imaginary parts of the frequency response of h[n]

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes 3 Friday 6/23


Summary of last lecture Linear, time-invariant system: Invariant to scaling, adding of inputs, and shifting Completely characterized by its impulse response, i.e., the output of the system when the input is the impulse sequence To find the output of an LTI system for an arbitrary input, we must convolve the input sequence with the systems impulse response. This can be done by computing the convolution sum:

y[n] = h[n] x[n] =


k =

h[n k ]x[k ]

LTI systems are also completely described by their frequency response, i.e. the output of the system when the input is a complex frequency sequence, for all possible complex frequency sequences To find the frequency response of a system, we must compute the discrete-time Fourier transform of its impulse response:

H ( ) =

n=

h[n]e

j n

Frequency analysis
In this lecture, we will study frequency analysis of systems, sequences and signals. We will look at some of the frequency analysis functions available in Matlab, and discuss the details of those functions.

Frequency analysis of systems: frequency response


Frequency analysis of systems in real life works as follows. Suppose that we want to know the frequency response of a certain loudspeaker. We could play a reference tone of a certain frequency through the loudspeaker, and measure the loudness of the tone as it comes out of the loudspeaker. We could repeat this experiment for all tones that were interested in, for example over the audible frequency range (roughly 20 Hz 20 kHz), and record all our loudness measurements in a graph. That would then be the frequency response of the loudspeaker. When measuring the frequency response of the loudspeaker in this way, were assuming that the loudspeaker is a linear, time-invariant system. Were assuming that if we play two different tones through the loudspeaker simultaneously, the output of the loudspeaker is those two tones, each one amplified or attenuated according to the measured frequency response. In real life, however, most systems are non-linear. Non-linear just means that some of the assumptions of linearity are not satisfied. In the loudspeaker example above, the two tones will most likely interact with each other in the loudspeaker in some complex way, and the output of the loudspeaker may contain additional tones not predicted by the measured frequency response, and each of the tones will not be amplified or attenuated exactly according to the frequency response. If we have a system in Matlab, we could follow the same strategy as with the loudspeaker. We could process single reference tones with the system, and record the output. But when the system is an LTI system defined by its impulse response, we can simply take the discrete-time Fourier

SPHSC 503 Speech Signal Processing

UW Summer 2006

transform of the impulse response to find the frequency response exactly. And since the system is LTI, it will behave exactly according to the frequency response, even for sums of tones or other more complicated inputs. The freqz function To compute the frequency response of a system, Matlab has the function freqz. Well discuss this function by looking at an example.
>> >> >> >> >> h = [-1 2 -1]; % define the impulse response nh = [-1 0 1]; % ... and its index vector N = 1024; % an additional parameter for freqz1 figure, stem(nh,h), title(h[n]) figure, freqz(h,1,N), title(H(w))
H(w) 50 Magnitude (dB)

h[n] 2

-50

-100

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8 Normalized Frequency ( rad/sample)

0.9

0 Phase (degrees)

-1 0 -1

1 -1

-50 -100 -150 -200

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8 Normalized Frequency ( rad/sample)

0.9

Here we see the frequency response of the high-pass filter we studied before, as generated by freqz. We will now address some of the features of this frequency response. Magnitude response and phase response As you can see, the frequency response consists of two plots: a magnitude plot (top) and a phase plot (bottom). As we saw in the previous lecture, the frequency response of a system is a complex function of frequency. That means that for each frequency, the frequency response is a complex number. It is kind of hard to plot complex numbers directly in Matlab, and therefore we separate the complex numbers into their magnitude and phase, and plot those separately. We refer to those plots as the magnitude response and phase response. Occasionally, we separate complex numbers into their real and imaginary parts, and plot those separately, but the magnitude/phase representation is usually easier to interpret. Magnitude in decibels (dB) You may notice that the magnitude is plotted on a decibel (dB) scale. The decibel scale is a relative logarithmic scale. For example, -20 on the decibel scale indicates a quantity that is 10 times smaller than the reference, -40 corresponds to 100 times smaller, -60 corresponds to 1000 times smaller, etc. Zero on the decibel scale means that the quantity is the same as the reference
1

The third parameter of freqz, N, determines how many points of the frequency response H ( ) are evaluated for the plot. A value of N=1024 gives a nice smooth plot of H ( ) .

SPHSC 503 Speech Signal Processing

UW Summer 2006

value, and positive values means that the quantity is bigger than the reference value (20 dB = 100 times bigger, 40 dB = 1000 times bigger, etc). When there is no explicit reference value, the reference value is usually taken to be 1. Normalized frequency When we use freqz to plot the frequency response of an LTI system, it uses a normalized frequency axis. Normalized frequencies are frequencies from 0 to . In the plot above, the x-axis runs from 0 to 1, but the x-axis label reads ( rad/sample). This means that when we read an x-value of the graph, we should multiply it by to get the true x-value. For example, the frequency response in the plot crosses 0 dB around 0.33 radians/sample. But why exactly do we use normalized frequency? Well, when we analyze the frequency response of a system, we dont know anything about a sampling frequency. In the example above, we only know that the impulse response h[ n] = {1, 2, 1} for n = 1,0,1 . And as weve seen in homework 1, we need a sampling frequency to convert a sequence index to a point in time. Without a sampling frequency, the best we can do is to use a normalized frequency representation with frequencies between 0 and .

Suppose we want to apply this system to a speech signal that is sampled at 10 kHz. In that case, we can supply the sampling frequency of the signal to freqz. freqz can then use this sampling frequency to convert the normalized frequencies to real frequencies, and we can get an idea of how this system will effect the real frequencies of the speech signal.
>> figure, freqz(h,1,N,10000), title(H(f), fs=10000)

Or, for a signal sampled at 16 kHz, we would use


>> figure, freqz(h,1,N,16000), title(H(f), fs=16000)
H(f), fs=10000 50 Magnitude (dB) 0 -50 -100 -150 Magnitude (dB) 50 0 -50 -100 -150 H(f), fs=16000

500

1000

1500

2000 2500 3000 Frequency (Hz)

3500

4000

4500

5000

1000

2000

3000 4000 5000 Frequency (Hz)

6000

7000

8000

0 Phase (degrees) -50 -100 -150 -200 Phase (degrees) 0 500 1000 1500 2000 2500 3000 Frequency (Hz) 3500 4000 4500 5000

0 -50 -100 -150 -200

1000

2000

3000 4000 5000 Frequency (Hz)

6000

7000

8000

In both cases, the normalized frequency is converted to real frequencies. A normalized frequency of 0 radians/sample always corresponds to 0 Hz, and a normalized frequency of radians/sample corresponds to Fs/2 Hz (either 5000 or 8000 Hz in the examples above). The formula for converting a normalized frequency to a real frequency F is Fs F= 2

SPHSC 503 Speech Signal Processing

UW Summer 2006

So when this system is applied to signals sampled at 10 kHz, it attenuates frequencies below 0.33 radians/sample = 1650 Hz and it amplifies frequencies above that frequency. At 16 kHz, that threshold is at 2640 Hz
The phase of this system is a line An important characteristic of the phase response of this system is that it is a line. A linear phase response like this means that the system is a particularly nice system that has no phase distortion. We will learn more about this when we discuss infinite impulse response systems. The second parameter of freqz is 1 You may have noticed that, so far, weve set the second parameter to 1 when we used freqz. You may have wondered why we do that. The reason for that is the subject of the next section. Finite impulse response (FIR) systems and infinite impulse response (IIR) systems Before going into a discussion about finite impulse response (FIR) systems and infinite impulse response (IIR) systems, it may be helpful to take a step back and give an overview of the systems we have discussed. All systems Linear systems Time-invariant systems

LTI systems

We have seen that an LTI system is completely characterized by its impulse response. We can therefore distinguish different types of LTI systems based on properties of their impulse response. The most prominent feature of the impulse response is its length: it can either be infinite or finite. An example of an infinite impulse response is h[n] = ( 1 u[n] 2)
n

which graphically looks like


h[n]=(0.9)n u[n]

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

It is constructed in the expression above by multiplying the unit step sequence

SPHSC 503 Speech Signal Processing

UW Summer 2006

u[n]

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

with the sequence (0.9)n


(0.9)n

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A problem with systems that have an infinite impulse response is that we cant work with them on computers, because computers have a finite amount of memory. As a consequence, we can never store an infinite impulse response on a computer. We therefore restrict ourselves to a special kind of IIR systems that can be implemented using two finite sequences, according to the following diagram: x[n]

hb[n]

+
ha[n]

y[n]

We call this kind of system a rational system. The elements in the dashed box are called a feedback loop or simply feedback. Without feedback, a rational system reduces to a regular, finite impulse response (FIR) system, with impulse response hb[n]. An example of a rational system is the following:
>> >> >> >> >> >> >> b = [1]; a = [1 0 0.81]; imp = [1 zeros(1,20)]; h = filter(b,a,imp); nh = 0:20; figure, stem(nh,h) figure, freqz(b,a) % % % % % % % define hb[n] define ha[n], feedback create 21-point impulse sequence determine impulse response index vector for impulse response plot impulse response plot frequency response
15 Magnitude (dB) 10 5 0 -5 -10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Normalized Frequency ( rad/sample) 0.9 1

0.5

0
100

-0.5

Phase (degrees)

50 0 -50 -100

-1

10

15

20

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8 Normalized Frequency ( rad/sample)

0.9

SPHSC 503 Speech Signal Processing

UW Summer 2006

The frequency response of this rational system has the same features as the frequency response of the high-pass filter shown earlier. The only difference is that the phase response of this system is a curve instead of a line. This means that this system has a non-linear phase response, which is a characteristic of all rational systems. As a result, this system is not so nice and has some phase distortion. Despite their phase distortion, rational systems are still very interesting because they are often much more efficient and have much less delay than equivalent FIR systems. The reason that we used 1 for the second parameter in the example above now becomes clear. It basically makes ha[n] equal to the impulse response sequence. And as we have seen in Exercise 2.1a, convolution with the impulse sequence produces an output sequence that is identical to the input sequence. It is in fact the identity system, and using the identity system in the feedback loop ensures us that it has no effect on the output. Given this classification of LTI systems, we can complete the systems overview diagram:
LTI systems IIR rational FIR

Frequency analysis of sequences and signals: frequency spectrum


Frequency analysis is not only useful to identify and characterize systems, it can also be used to examine the frequency content or frequency spectrum of sequences and signal. To determine the spectrum of a sequence, we could use Matlabs freqz function again, using the sequence as the first parameter and 1 the identify system as the feedback parameter. However, we are usually only interested in the magnitude of a sequences spectrum and not its phase, so the freqz function is a little overkill. Instead, we can find the spectrum of a sequence by computing its discrete-time Fourier transform. In Matlab, the discrete-time Fourier transform can be found using fft. For example, for a given sequence x (in these examples, x is the speech signal from Exercise 1.3),
>> X = fft(x); >> plot(X); % find frequency spectrum of x[n] % plot the frequency spectrum

Unfortunately, the result of these commands is quite horrific

SPHSC 503 Speech Signal Processing

UW Summer 2006

25 20 15 10 5 0 -5 -10 -15 -20 -25 -25

-20

-15

-10

-5

10

15

20

25

The problem here is that X is a complex vector. The figure above shows how Matlab plots complex numbers. We immediately get better results when we take the magnitude of X using the abs function before plotting it
>> X = fft(x); >> plot(abs(X));
30

% find frequency spectrum of x[n] % plot magnitude of frequency spectrum

25

20

15

10

1000

2000

3000

4000

5000

6000

7000

Still, the x-axis is not in units of Hz, nor in units of radians/sample for normalized frequency. And the spectrum appears to be mirrored halfway through. It turns out that a little more work is necessary to create a nice plot of the frequency spectrum of a sequence. In this course, we will use a function called spec to take care of that for us. To plot a frequency spectrum, we can simply type
>> spec(x) % plot frequency spectrum of x[n]

SPHSC 503 Speech Signal Processing

UW Summer 2006

30

25

20 Magnitude

15

10

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8 Normalized Frequency ( rad/sample)

0.9

This gets rid of the mirrored copy in the spectrum, and puts appropriate labels on the axes. By default, spec plots the magnitude of the spectrum on a linear scale, but it can be instructed to plot it on a dB scale by adding an optional flag:
>> spec(x,db)
30 20 10 0 Magnitude (dB) -10 -20 -30 -40 -50 -60

% plot frequency spectrum of x[n] in dB

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8 Normalized Frequency ( rad/sample)

0.9

Since we havent specified a sampling frequency, spec treats the speech signal as a sequence with unknown sampling frequency and plots it on a normalized frequency axis, like freqz. We can specify the sampling frequency as the second parameter:
>> spec(x,fs,db)
30 20 10 0 Magnitude (dB) -10 -20 -30 -40 -50 -60

% plot frequency spectrum of the signal x in dB

500

1000

1500

2000 2500 3000 Frequency (Hz)

3500

4000

4500

5000

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes 4 Monday 6/26


Summary of last lecture Frequency analysis of systems: frequency response Matlab: freqz(h,1) or freqz(b,a) Magnitude response, phase response Normalized frequency, from 0 to : no sampling frequency associated with system, can therefore not be put on a Hz frequency scale. A system is only a prototype until we know what the sampling frequency is of the signals that it will be applied to. Decibel scale Type of impulse response o Infinite (IIR) Not realizable in Matlab (finite memory) Therefore looking at subclass of IIR systems: rational systems Rational systems have infinite h[n], but are realizable with two finite sequences, b[n] and a[n], using a[n] in a feedback loop o Finite (FIR) Strictly speaking a subclass of rational systems, with a[n] = [n] such that the feedback loop is ineffective Frequency analysis of sequences: spectrum Phase response of less interest, freqz could be used but overkill Plot spectrum using spec(x) (defined in spec.m, an M-file written by the instructor and not part of regular Matlab) Again normalized frequency axis, no sampling frequency associated with a sequence. Optional 'db' flag for decibel magnitude scale Frequency analysis of signals: spectrum Same as sequences, but with an associated sampling frequency, spec(x,fs) Frequency in Hz, instead of normalized frequency.

Short-term frequency analysis


In the previous lecture, we studied frequency analysis of signals. Exercise 3.2 of the lab showed that long-term frequency analysis of speech signals yields good information about the overall frequency spectrum of the signal, but no information about the temporal location of those frequencies. And since speech is a very dynamic signal with a time-varying spectrum, it is often very insightful to look at frequency spectra of short sections of the speech signal. In this lecture, we will study the details of short-term frequency analysis. In the lab, we discuss the Matlab functions that perform this kind of analysis. Long-term frequency analysis revisited We defined the frequency response of a system as the discrete-time Fourier transform (DTFT) of the system's impulse response h[n]:
H ( ) =

n =

h[n]e

j n

Similarly, for a sequence x[n], its long-term frequency spectrum is defined as the DTFT of the sequence:

SPHSC 503 Speech Signal Processing

UW Summer 2006

X ( ) =

n=

x[ n]e j n .

Theoretically, we must know the sequence x[n] for all values of n (from n = until n = ) in order to compute its frequency spectrum. Fortunately, all terms where x[n] = 0 do not matter in the sum, and therefore an equivalent expression for the sequence's spectrum is
X ( ) = x[ n]e j n
n=0 N 1

Here we've assumed that the sequence starts at 0 and is N samples long. This tells us that we can apply the DTFT only to all of the non-zero samples of x[n], and still obtain the sequence's true spectrum X ( ) . But what is the correct mathematical expression to compute the spectrum over a short section of the sequence, that is, over only part of the non-zero samples of the sequence? Window sequence It turns out that the mathematically correct way to do that is to multiply the sequence x[n] by a window sequence w[n] that is non-zero only for n = 0,, L 1 , where L, the length of the window, is smaller than the length N of the sequence x[n]:
xw [ n] = x[ n] w[ n]

Then we compute the spectrum of the windowed sequence xw [ n] as usual


X w ( ) = xw [ n]e j n = ( x[ n]w[ n]) e j n
n=0 n= 0 N 1 N 1

The following figure illustrates how a window sequence w[n] is applied to the sequence x[n]:
x[n] 1 Amplitude 0 -1 0 10 20 30 40 50 w[n] 1 Amplitude 0 -1 0 1 Amplitude 10 20 30 40 50 x w[n] = x[n] w[n] 60 70 80 90 100 60 70 80 90 100

-1

10

20

30

40

50 Index (n)

60

70

80

90

100

As the figure shows, the windowed sequence is shorter in length than the original sequence. So we can further truncate the DTFT of the windowed sequence:

SPHSC 503 Speech Signal Processing


L1

UW Summer 2006

X w ( ) = ( x[ n]w[ n]) e j n .
n=0

Using this windowing technique, we can select any section of arbitrary length of the input sequence x[n] by choosing the length and location of the window accordingly. The only question that remains is: how does the window sequence w[n] affect the short-term frequency spectrum? Effect of the window To answer that question, we need to introduce an important property of the Fourier transform. The diagram below illustrates the property graphically: h[n]

LTI system I. Implementation of an LTI system in the time domain. y[n]

x[n]

convolution

y[n] = x[n] h[n]

h[n] II. Equivalent implementation of an LTI system in the frequency domain. x[n]
DTFT

LTI system
DTFT

y[n]
multiplication IDTFT

Y ( ) = X ( ) H ( )

The two implementations of an LTI system are equivalent: they will give the same output for the same input. Hence, convolution in the time domain = multiplication in the frequency domain:
F y[n] = x[n] h[n] Y ( ) = X ( ) H ( )

And since the time domain and the frequency domain are each others dual in the Fourier transform, it is also true that multiplication in the time domain = convolution in the frequency domain: F xw [n] = x[n] w[n] X w ( ) = X ( ) W ( ) .

This shows that multiplying the sequence x[n] with the window sequence w[n] in the time domain is equivalent to convolving the spectrum of the sequence, X ( ) , with the spectrum of the window, W ( ) . The result of the convolution of the spectra in the frequency domain is that the spectrum of the sequence is smeared by the spectrum of the window. This is best illustrated by the example in the figure below:

SPHSC 503 Speech Signal Processing


X()

UW Summer 2006

x[n] = cos(2 6 n / 90) 1 Amplitude 0.5 0 -0.5 -1 0 20 40 60 Index (n) 80 100


Magnitude (dB)

50 40 30 20 10 0 -10 -20 -30 0 5 10 15 20 25 30 Frequency (Hz) 35 40 45

w[n] 1 Amplitude
Magnitude (dB) 0 -10 -20 -30 -40 -50 -60 -70 -20 -15 -10

W()

0.5 0 -0.5 -1 0 20 40 60 Index (n) 80 100

-5 0 5 Frequency (Hz)

10

15

20

x w[n] = x[n] w[n] 1 Amplitude


Magnitude (dB)

X w() = X() W() 50 40 30 20 10 0 -10 -20 -30 0 5 10 15 20 25 30 Frequency (Hz) 35 40 45

0.5 0 -0.5 -1 0 20 40 60 Index (n) 80 100

Choice of window Because the window determines the spectrum of the windowed sequence to a great extent, the choice of the window is important. Matlab supports a number of common windows, each with their own strengths and weaknesses. Some common choices of windows are shown below.
Rectangular window: Sequence 1.2 1 0.8 Amplitude 0.6 0.4 0.2 0 -0.2 -0.4 0 10 20 30 40 Index (n) 50 60 -70 -1 -0.5 0 0.5 Normalized Frequency ( rad/sample) 1 Magnitude (dB) 10 0 -10 -20 -30 -40 -50 -60 Rectangular window: Spectrum

SPHSC 503 Speech Signal Processing


Triangular window: Sequence 1.2 1 0.8 Amplitude 0.6 0.4 0.2 0 -0.2 -0.4 0 10 20 30 40 Index (n) 50 60 -70 -1 Magnitude (dB) 10 0 -10 -20 -30 -40 -50 -60

UW Summer 2006
Triangular window: Spectrum

-0.5 0 0.5 Normalized Frequency ( rad/sample)

Hamming window: Sequence 1.2 1 0.8 Amplitude 0.6 0.4 0.2 0 -0.2 -0.4 0 10 20 30 40 Index (n) 50 60 -70 -1 Magnitude (dB) 10 0 -10 -20 -30 -40 -50 -60

Hamming window: Spectrum

-0.5 0 0.5 Normalized Frequency ( rad/sample)

Hann window: Sequence 1.2 1 0.8 Amplitude 0.6 0.4 0.2 0 -0.2 -0.4 0 10 20 30 40 Index (n) 50 60 -70 -1 Magnitude (dB) 10 0 -10 -20 -30 -40 -50 -60

Hann window: Spectrum

-0.5 0 0.5 Normalized Frequency ( rad/sample)

Kaiser window: Sequence 1.2 1 0.8 Amplitude 0.6 0.4 0.2 0 -0.2 -0.4 0 10 20 30 40 Index (n) 50 60 -70 -1 Magnitude (dB) 10 0 -10 -20 -30 -40 -50 -60

Kaiser window: Spectrum

-0.5 0 0.5 Normalized Frequency ( rad/sample)

All windows share the same characteristics. Their spectrum has a peak, called the main lobe, and ripples to the left and right of the main lobe called the side lobes. The width of the main lobe and the relative height of the side lobes are different for each window. The main lobe width determines how accurate a window is able to resolve different frequencies: wider is less accurate. The side lobe height determines how much spectral leakage the window has. We'll learn more about these terms in the next lecture. An important thing to realize is that we can't have short-term frequency analysis without a window. Even if we don't explicitly use a window, we are implicitly using a rectangular window.

SPHSC 503 Speech Signal Processing

UW Summer 2006

Parameters of the short-term frequency spectrum Besides the type of window rectangular, Hamming, etc. there are two other factors in Matlab that control the short-term frequency spectrum: window length and the number of frequency sample points.

The window length controls the fundamental trade-off between time resolution and frequency resolution of the short-term spectrum, irrespective of the window's shape. A long window gives poor time resolution, but good frequency resolution. Conversely, a short window gives good time resolution, but poor frequency resolution. For example, a 250 ms long window can, roughly speaking, resolve frequency components when they are 4 Hz or more apart (1/0.250 = 4), but it can't tell where in those 250 ms those frequency components occurred. On the other hand, a 10 ms window can only resolve frequency components when they are 100 Hz or more apart (1/0.010 = 100), but the uncertainty in time about the location of those frequencies is only 10 ms. The result of short-term spectral analysis using a long window is referred to as a narrowband spectrum (because a long window has a narrow main lobe), and the result of short-term spectral analysis using a short window is called a wideband spectrum. In short-term spectral analysis of speech, the window length is often chosen with respect to the fundamental period of the speech signal, i.e., the duration of one period of the fundamental frequency. A common choice for the window length is either less than 1 times the fundamental period, or greater than 2-3 times the fundamental period. Examples of narrowband and wideband short-term spectral analysis of speech are given in the figures below.
Wideband analysis of speech
40 Magnitude (dB) Magnitude (dB) 20 0 -20 -40 -60 0 1000 2000 3000 Frequency (Hz) 4000 5000 40 20 0 -20 -40 -60 0 1000 2000 3000 Frequency (Hz) 4000 5000

Narrowband analysis of speech

The other factor controlling the short-term spectrum in Matlab is the number of points at which the frequency spectrum H ( ) is evaluated. The number of points is usually equal to the length of the window. Sometimes a greater number of points is chosen to obtain a smoother looking spectrum. Evaluating H ( ) at fewer points than the window length is possible, but very rare.
Time-frequency domain: Spectrogram An important use of short-term spectral analysis is the short-time Fourier transform or spectrogram of a signal. The spectrogram of a sequence is constructed by computing the shortterm spectrum of a windowed version of the sequence, then shifting the window over to a new location and repeating this process until the entire sequence has been analyzed. The whole process is illustrated in the figure below:

SPHSC 503 Speech Signal Processing

UW Summer 2006

Step 1 Speech signal

Step 2

Step 3

0.2 0 -0.2 1000 2000 3000 4000 5000 6000 1

0.2 0 -0.2 1000 2000 3000 4000 5000 6000 1 0.5 0 -0.5 -1 1000 2000 3000 4000 5000 6000 1000 2000 3000 4000 5000 6000

0.2 0 -0.2 1000 2000 3000 4000 5000 6000 1 0.5 0 -0.5 -1 1000 2000 3000 4000 5000 6000

Shifting window

0.5 0 -0.5 -1

0.1 Windowed sequence 0.05 0 -0.05 -0.1 1000 2000 3000 4000 5000 6000 20 Short-term Spectrum 0 -20 -40 -60 0 1000 2000 3000 4000 5000

0.2 0.1 0 -0.1 -0.2 1000 2000 3000 4000 5000 6000 20 0 -20 -40 -60 0 1000 2000 3000 4000 5000

0.1 0 -0.1 1000 2000 3000 4000 5000 6000 20 0 -20 -40 -60 0 1000 2000 3000 4000 5000

Together, these short-term spectra (bottom row) make up the spectrogram, and are typically shown in a two-dimensional plot, where the horizontal axis is time, the vertical axis is frequency, and magnitude is the color or intensity of the plot. For example:
5000 4000 Frequency 3000 2000 1000 0

0.1

0.2

0.3 Time

0.4

0.5

0.6

The appearance of the spectrogram is controlled by a third parameter: window overlap. Window overlap determines how much the window is shifted between repeated computations of the shortterm spectrum. Common choices for window overlap are 50% or 75% of the window length. For example, if the window length is 200 samples and window overlap is 50% , the window would be shifted over 100 samples between each short-term spectrum. In the case that the overlap was 75%, the window would be shifted over 50 samples.

SPHSC 503 Speech Signal Processing

UW Summer 2006

The choice of window overlap depends on the application. When a temporally smooth spectrogram is desirable, window overlap should be 75% or more. When computation should be at a minimum, no overlap or 50% overlap are good choices. If computation is not an issue, you could even compute a new short-term spectrum for every sample of the sequence. In that case, window overlap = window length 1, and the window would only shift 1 sample between the spectra. But doing so is wasteful when analyzing speech signals, because the spectrum of speech does not change at such a high rate. It is more practical to compute a new spectrum every 20-50 ms, since that is the rate at which the speech spectrum changes.
Length of the window and fundamental frequency In a wideband spectrogram (i.e., using a window shorter than the fundamental period), the fundamental frequency of the speech signal resolves in time. That means that you can't really tell what the fundamental frequency is by looking at the frequency axis, but you can see energy fluctuations at the rate of the fundamental frequency along the time axis. In a narrowband spectrogram (i.e., using a window 2-3 times the fundamental period), the fundamental frequency resolves in frequency, i.e., you can see it as an energy peak along the frequency axis. See for example the figures below:

Wideband speech spectrogram


5000

4000

Frequency

3000

2000

1000

0.1

0.2

0.3 Time

0.4

0.5

0.6

Narrowband speech spectrogram


5000

4000

Frequency

3000

2000

1000

0.1

0.2

0.3 Time

0.4

0.5

0.6

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes 5 Wednesday 6/28


Summary of last lecture
Short-term frequency analysis and the spectrogram short-term frequency analysis only possible by applying a window sequence even when only selecting a section of a signal (truncating), equivalent to applying a rectangular window window sequence has smearing effect on the spectrum windows have two important spectral properties: main lobe width and side lobe height long window sequence narrow main lobe narrowband spectral analysis short window sequence wide main lobe wideband spectral analysis length and shape of a window determine its time and frequency resolution Spectrogram repeated short-term spectral analysis, column-wise graphical display window overlap = window length - shift between successive windows

Spectrogram modification
The spectrogram is not only a great tool to analyze (speech) signals, it is also often used to modify signals in various ways. The basic idea is to multiply each column in the spectrogram with a weighting vector. Each column in the spectrogram is a short-term spectrum of the signal, and by multiplying a column of the spectrogram with a weighting vector we modify the shortterm spectrum of the signal. The weighting vector can be the same for each column in the spectrogram, which essentially is the same as applying an LTI system with a frequency response given by the weighting vector. For example, if we want to apply a band-pass filter to the signal, we can multiply each column of the spectrogram with the weighting vector:
1 Weight

0.5

0 0 500 1000 1500 2000 2500 3000 Frequency (Hz) 3500 4000 4500 5000

which would modify the following speech spectrogram as shown below:

SPHSC 503 Speech Signal Processing

UW Summer 2006

Spectrogram before
5000 4500 4000 3500 Frequency (Hz) 3000 2500 2000 1500 1000 500 0 0.1 0.2 0.3 0.4 Time (s) 0.5 0.6

Spectrogram after (blue = zero)


5000 4500 4000 3500 Frequency (Hz) 3000 2500 2000 1500 1000 500 0 0.1 0.2 0.3 0.4 Time (s) 0.5 0.6

A very powerful way to modify the spectrogram, however, is by using a different weighting vector for each column of the spectrogram. This allows us to apply a time-varying filter to the signal, which can not be done with an LTI system. For example, we can apply different filters to different parts of the signal:
Spectrogram before
5000 4500
Low-pass filter 1 Weight

4000 3500

0.5

Frequency (Hz)

3000 2500 2000 1500 1000

0 0 500 1000 1500 2000 2500 3000 Frequency (Hz) 3500 4000 4500 5000

Band-pass filter 1 Weight

500 0 0.1 0.2 0.3 0.4 Time (s) 0.5 0.6

0.5

0 0 500 1000 1500 2000 2500 3000 Frequency (Hz) 3500 4000 4500 5000

Spectrogram after (blue = zero)


5000 4500 4000 3500
High-pass filter

Frequency (Hz)
3500 4000 4500 5000

3000 2500 2000 1500

1 Weight

0.5

0 0 500 1000 1500 2000 2500 3000 Frequency (Hz)

1000 500 0 0.1 0.2 0.3 0.4 Time (s) 0.5 0.6 0.7

SPHSC 503 Speech Signal Processing

UW Summer 2006

This technique has many uses: LTI filtering and time-varying filtering, signal separation, noise suppression, etc. It basically gives us the ability to carve out any part of a given spectrogram by attenuating undesired parts of the spectrogram or setting them to zero. But once we have modified the spectrogram, how do we convert it back to a waveform? How can we reconstruct a signal from the modified spectrogram?

Signal reconstruction from modified spectrogram


In order to answer that question, we have to go back to how the spectrogram was computed: each column of the spectrogram is the Fourier transform of a short windowed section of the signal. To recover a signal from a (modified) spectrogram, we need to reverse those steps: Take the inverse Fourier transform of each column of the spectrogram. Now each column is a short section of the signal Undo the window that was applied to each short section of the signal Assemble the short sections into a signal

This process seems easy enough, but there are some catches. Undoing the window is not as easy as it sounds, unless the window was the rectangular window. When we compute a spectrogram, we multiply each short section of the signal with the window. Therefore, to undo the window, we need to divide each short section by the window. But most windows taper to zero at their endpoints, and dividing samples of the signal by values close to zero is an operation that is very sensitive to noise. To avoid this problem, the spectrogram of a signal is computed and later reconstructed using windows that sum to one, like the sine window in this example:
Overlapping windows 1 0.5 0

200

400

600 Sum of windows

800

1000

1200

1 0.5 0

200

400

600

800

1000

1200

But the most important catch in reconstructing a signal from a modified spectrogram is the following. The columns in the spectrogram are dependent, because they come from overlapping short sections of the signal. For example, when we use 50% overlap between windows, each sample of the signal is represented in two columns of the spectrogram:

SPHSC 503 Speech Signal Processing

UW Summer 2006

Frequency ( rad/sample)

1 0.8 0.6 0.4 0.2 0 20


0.4

40

60

80

100 Time (n)

120

140

160

180

200

Amplitude

0.2 0 -0.2 -0.4 0 20 40 60 80 100 Time (n) 120 140 160 180 200

When we reconstruct the signal from the spectrogram, those two columns must agree on the value of that sample. But arbitrary modifications of the spectrogram may break those agreements or dependencies between the columns of the spectrogram. The only exception that doesnt break the dependencies is when we apply the same weighting vector to all columns. In the case of arbitrary modifications, we can still use the reconstruction procedure described above to reconstruct a time-domain signal. But what will happen then is that the two columns will not agree on the value of the sample, and the samples value will be a weighted average of the values that each column assigns to it. As a result, the spectrogram of the reconstructed signal is not exactly the modified spectrogram that we created. It is only a close approximation of it. This inaccuracy in reconstruction is usually taken for granted, because the approximation is often quite good, and this technique is so valuable. In extreme cases such as highly random modifications, however, the approximation may not be as good and could cause audible artifacts. Of course this problem in reconstruction could be avoided altogether by using non-overlapping windows to compute and reconstruct the spectrogram. But in that case, other problems arise, such as discontinuities between boundaries of adjacent windows, which are easily audible. Although overlapping windows are not the perfect solution, they guarantee a certain smoothness of the reconstructed signal and are therefore preferable over non-overlapping windows.

Application: noise removal using spectral subtraction


An interesting speech enhancement application based on modification of a signals spectrogram is noise removal using spectral subtraction. The idea behind this technique is as follows. Suppose we have a signal of interest that is corrupted by noise, and suppose furthermore that we have an estimate of the spectrum of that noise. (For example, we have recorded a speech signal that is corrupted by some fan noise, and we know that the fan noise has most of its energy in a small frequency band between 3400 and 3600 Hz.) We can then compute the spectrogram of the noisy signal, subtract the short-term noise spectrum from each column of the spectrogram, and reconstruct a filtered speech signal from the modified spectrogram. This is a very easy way to remove noise from a speech signal.

SPHSC 503 Speech Signal Processing

UW Summer 2006

In todays lab we will see an even more powerful version of this technique, where we dont need to have an estimate of the noise spectrum in advance. It is able to estimate the noise spectrum from the signal itself, and it can even update its estimate over time to track noise with a timevarying spectrum. A very powerful technique indeed!

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes 6 Friday 6/30


Summary of last lecture
Spectrogram modification Applying weighting vector to columns of the spectrogram Simple way to implement LTI filters More freedom: applying different weighting vector to each column of the spectrogram Time-varying filter, something that cant be done by LTI system Reconstruction from spectrogram Inverse Fourier transform, undoing the window, assembling short sections into signal Redundancy in spectrogram: each sample of the input is present in two (at 50% window overlap) or four (at 75% window overlap) columns of the spectrogram Columns of the spectrogram agree on the value of that sample Arbitrary modifications break agreement between columns Still possible to reconstruct a signal, but reconstruction is not perfect, but acceptable Despite its imperfect reconstruction, spectrogram modification widely used for signal processing Application of spectrogram modification: noise suppression using spectral subtraction Given an estimate of the noise spectrum, subtract noise spectrum from a noisy signals spectrogram to reduce the noise. Advanced technique: automatically estimate noise spectrum from noisy signal

Sampling, restoration, aliasing and quantization


So far in this course weve focused on digital signals and systems. But to apply digital systems to real world signals, we need to add the appropriate conversion steps before and after our digital processing. The following diagram gives an overview: x(t) A/D x[n] LTI system y[n] D/A y(t)

The analog signal x(t) is converted by an analog-to-digital (A/D) converter to the sequence x[n], which is processed by an LTI system with output sequence y[n]. The output sequence y[n] is then converted by an digital-to-analog (D/A) converter to an analog signal y(t) The task of the A/D converter is two-fold. First, it must sample the analog signal at a regular rate. This step if often treated as an independent conversion step called continuous-time to discretetime conversion or C/D conversion. Second, the A/D converter must quantize the samples to a finite number of signal levels that can be represented on the digital platform on which the LTI system is implemented. The task of the D/A converter is to convert the samples back to an analog, continuous-time signal. The A/D converter (including its C/D converter) and the D/A converter must perform their tasks in such a way that they are transparent to the LTI system. That means the entire system, indicated by the dashed box in the diagram above, should have a frequency response that is identical to the frequency response of the LTI system and not be affected by the A/D and D/A conversion.

SPHSC 503 Speech Signal Processing

UW Summer 2006

Sampling
The first conversion performed by an A/D converter is sampling. In this conversion step, a continuous-time signal is converted to a discrete-time signal. A discrete-time signal is a signal that is sampled, but that still can take on any signal value at its sample points. In contrast, a digital signal is a sampled signal that is quantized and can only take on a finite set of discrete signal values at its sample points. The sampling conversion step is also called C/D conversion.
Continuous-time signal 1 0.5 Amplitude 0 -0.5 -1 Amplitude 1 0.5 0 -0.5 -1 Discrete-time signal

0.1

0.2

0.3

0.4

0.5 0.6 Time (s)

0.7

0.8

0.9

10

20

30

40

50 60 Index (n)

70

80

90

100

Theoretically, the sampling or C/D conversion consists of two steps: Keep the signal values at the sampling times, and set the signal to zero everywhere else Normalize the time axis from seconds to sample index The first step can be viewed as multiplying an analog signal x(t) by an analog impulse train s(t), resulting in a sampled analog signal xs (t ) , as illustrated by the figure below:
Analog signal, x(t) 1 1 0.5 Amplitude 0 -0.5 -1 Amplitude 0.5 0 -0.5 -1 0 0.1 0.2 0.3 0.4 0.5 0.6 Time (s) 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 Time (s) 0.7 0.8 0.9 1 Analog impulse train, s(t)

x s (t) = x(t) s(t) 1 Amplitude 0.5 0 -0.5 -1

0.1

0.2

0.3

0.4

0.5 0.6 Time (s)

0.7

0.8

0.9

The spacing of the impulses of the impulse train is called the sampling period, specified by the symbol T, and is the reciprocal of the sampling frequency, Fs, i.e., T = 1 / Fs. Multiplying the signal with an impulse train in the time-domain is the same as convolving the spectrum of the signal with the spectrum of the impulse train. We saw this multiplication in time is convolution in frequency property of the Fourier transform before when we discussed multiplying a sequence with a window to obtain its short-term spectrum. To understand the effect of multiplication here, we need to know the Fourier transform of the impulse train. Without proof,

SPHSC 503 Speech Signal Processing

UW Summer 2006

we will state that the Fourier transform of an impulse train in time is an impulse train in frequency. The impulse trains in time and frequency are related according to the figure below.
s(t) 1 0 T 2T 3T 4T 5T Fs t -2Fs -Fs 0 Fs 2Fs f S(f)

A spacing of T = 1 / Fs seconds of impulses with an amplitude of 1 in the time domain results in a spacing of Fs Hz of impulses of amplitude Fs in the frequency domain. Therefore, given the spectrum of an analog signal, we can find the spectrum of the sampled analog signal by convolving it with the spectrum of the impulse train:
X(f) 1 0 f Xs(f) Fs -2Fs -Fs 0 Fs 2Fs f -2Fs -Fs 0 Fs 2Fs S(f) Fs f

The signals spectrum is shifted and weighted by the impulses in the spectrum of the impulse train signal. As a result, the sampled analog signals spectrum is periodic, and its period is equal to the sampling frequency. The next step in the C/D converter is normalization of the time-axis. Before normalization, the sampled analog signal xs(t) has non-zero values at t = ,-2T,-T,0,T,2T,. To normalize these sampling times, we divide t by the sampling period T, such that the signal now has non-zero values at t = -2,-1,0,1,2, . The signal values at these time instances are then treated at the values of the sequence x[n] at the corresponding index. As a result, the relationship between the analog signal x(t) and the sequence x[n] is x[ n] = x(nT ) . Normalization of the time axis by T in the time domain causes a normalization of the frequency axis by Fs in the frequency domain. That would leave the frequency axis in units of cycles per sample. This can be converted to radians per sample by realizing that 1 cycle = 2 radians. Therefore, the spectrum of the signal before and after sampling is:

SPHSC 503 Speech Signal Processing X(f) 1 0 f -4 -2 0

UW Summer 2006 Xs() Fs 2 4

This explains why the spectrum of any sequence is periodic with a period of 2.

Restoration
To obtain a continuous-time signal from a discrete-time signal, the D/A converter needs to undo the effect of sampling by reversing all the steps done in sampling. This is illustrated graphically in the following figure: x[n]
1 0.5 Amplitude 0 -0.5 -1

Xs()
g

Fs
0 10 20 30 40 50 60 Index (n) 70 80 90 100

-4 -2 0 Xs(f) Fs 2 4

xs(t)
s

1 Amplitude 0.5 0 -0.5 -1

0 0.1 0.2 0.3 0.4 0.5 0.6 Time (s) 0.7 0.8 0.9 1

-2Fs -Fs 0 Hlp(f) T Fs 2Fs f

0 x(t)
1 0.5 Amplitude 0 -0.5 -1 g

X(f) 1 0 f

0.1

0.2

0.3

0.4

0.5 0.6 Time (s)

0.7

0.8

0.9

First, the sample values of the sequence are placed on a time axis to obtain the sampled analog signal xs(t) with spectrum Xs(f). This scales the frequency back from units of cycles per sample to

SPHSC 503 Speech Signal Processing

UW Summer 2006

Hz. Then a low-pass filter, Hlp(f), is applied to the sampled analog signal to remove all periodic copies of the spectrum except for the copy centered at 0 Hz. This low-pass filter must have a gain of T to correct the change in gain incurred during sampling. The low-pass filter is called an interpolating filter or anti-imaging filter, because it interpolates between the samples in the time domain, and removes the periodic images of the spectrum in the frequency domain.

Aliasing
Because sampling introduces periodicity in the spectrum, we run the risk that the periodic copies of the spectrum will overlap with one and other. This happens when the sampling frequency is too small, as illustrated in the figure below. X(f) 1 0 S(f) Fs -2Fs -Fs 0 Xs(f) Fs -2Fs -Fs 0 Fs 2Fs f -2Fs -Fs 0 Fs 2Fs Fs 2Fs f -2Fs -Fs 0 Xs(f) Fs f Fs 2Fs f 1 0 S(f) Fs f f X(f)

On the left, the sampling frequency is chosen large enough for the signal, and no overlap occurs between the periodic copies of the spectrum. On the right, however, the sampling frequency is chosen too small for the signal, and overlap occurs between the copies of the spectrum. This artifact is known as aliasing distortion or just aliasing. To give a simple numerical example: consider a 9 Hz sinusoid sampled at 10 Hz. The 9 Hz sinusoid has two impulses in the spectrum, at +9 Hz and -9 Hz. When sampling at 10 Hz, the spectrum becomes periodic with a period of 10 Hz. That means that copies of the sinusoids impulses appear at f = ,-21,-11,-1,9,19,29, and at f = ,-29,-19,-9,1,11,21, and aliasing has occurred. Because, even if we dont perform any processing on the sampled sequence, we will apply a low-pass filter to the signal during reconstruction, which will only keep the impulses at f = -1 and f = 1. This shows that the 9 Hz sinusoid becomes a 1 Hz sinusoid when sampled at 10 Hz.

SPHSC 503 Speech Signal Processing


9 Hz tone, sampled 10 times per second 1 0.5 Amplitude 0 -0.5 -1 Amplitude 1 0.5 0 -0.5 -1

UW Summer 2006
Those samples represent a 1 Hz tone

0.1

0.2

0.3

0.4

0.5 0.6 Time (s)

0.7

0.8

0.9

0.1

0.2

0.3

0.4

0.5 0.6 Time (s)

0.7

0.8

0.9

When we choose our sampling frequency too small, aliasing will occur, and the C/D and D/C conversion will no longer be transparent. In fact, when aliasing occurs the overall system becomes non-linear, because aliasing creates frequencies in the output that were not present in the input, which linear systems cant do. Sometimes it is not desirable to choose a high enough sampling frequency, for example because it makes subsequent processing of the digital signal slower (there are simply more samples to process). In order to prevent aliasing when the sampling frequency is too low, we must use an anti-aliasing filter. This is a low-pass filter that removes frequencies from the input signal that would otherwise cause aliasing distortion. When we use an anti-aliasing filter, the overall system remains a linear system, and the overall frequency response is the combined response of the antialiasing filter and the digital LTI system.

Quantization
The second step in the analog-to-digital (A/D) converter is quantization. During the process of quantization, the infinite precision in signal level of an analog signal is converted to a finite precision in signal level for a digital signal. This idea is illustrated in the following figure:
Discrete-time signal, x[n] 1 Amplitude

-1

10

20

30

40

50 Index (n) Digital signal, x q[n]

60

70

80

90

100

1 0.75 0.5 0.25 0 -0.25 -0.5 -0.75 -1

Amplitude

10

20

30

40

50 Index (n)

60

70

80

90

100

Quantization error, e[n] = x q[n] - x[n] 0.2 Error 0 -0.2

10

20

30

40

50 Index (n)

60

70

80

90

100

SPHSC 503 Speech Signal Processing

UW Summer 2006

The difference between the analog signal and the quantized signal is called quantization error. On todays computers, as well as in Matlab, signal quantization plays a minimal role, because computers can almost always represent signals using 4 billion different signal levels, and often even orders of magnitude more than that. When we can use that many signal levels, the quantization error becomes really small and imperceptible. It may not always be possible to represent signals using that many levels in hearing aids and other small digital devices with limited memory and/or processing power. In that case, quantization introduces quantization noise into the system. Like all types of noise, quantization noise is characterized by is frequency spectrum. The shape of the quantization noise spectrum depends on the type of A/D converter used. We will briefly discuss two A/D converters and their quantization noise spectrum. 1. The sample and hold A/D converter. This converter samples analog signals close to the lowest sampling rate that avoid aliasing. At each sampling time, it measures the signal level of the analog signal, and holds it constant for the remainder of the sampling period. This allows the quantizer to match the signal level to an internal table of possible signal levels, and to output the quantized signal level. This type of A/D converter produces quantization noise with a flat spectrum, which means that the quantization noise is equally present in all frequencies. 2. The Sigma-Delta A/D converter. This converter samples analog signals at a very high sampling rate, often 64 times or more than the sampling rate used by the sample and hold A/D converter. At each sampling time, it compares the difference between the current signal level and the signal level at the previous sampling time in a somewhat complicated way. If that comparison is positive, the Sigma-Delta converter outputs +1, and if it turns out negative it outputs -1. The reason for this strange design is that the quantization noise of this converter no longer has a flat spectrum. Instead, quantization noise is low for low frequencies and increases gradually with higher frequencies. Combined with the 64 times over-sampling, this noise-shaping is a very nice property. It allows subsequent processing with a low-pass filter and downsampling by a factor of 64. The result is a very accurate representation of the input analog signal with very little quantization noise.

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes 7 Monday 7/3


Summary of last lecture
Sampling, restoration, aliasing To apply digital LTI systems to real-world signals, need conversions, from analog to digital and from digital to analog A/D conversion o Sampling Converting continuous-time signal to discrete-time sequence, or C/D conversion Equivalent to multiplying input signal with analog impulse train, to select signal at sampling times and set it to zero everywhere else, followed by a normalization of the time axis, dividing time by sampling period T Effect on the frequency spectrum: convolving input signals spectrum with spectrum of analog impulse train, which happens to be an impulse train in frequency. result: spectrum becomes periodic, period = Fs Normalization in time by T causes normalization in frequency by Fs, plus a switch in units from cycles/sample to radians/sample o Quantization: limit signal levels to a finite set of numbers that a computer can represent D/A conversion, also called D/C or discrete-time to continuous-time conversion o Undoing the normalization, converts index to time in seconds, and normalized frequency back into frequency in Hz o Interpolating between the samples, which is done by low-pass filtering the signal to keep only 1 copy of the periodic spectrum Aliasing o If sampling frequency chosen too low, periodic copies of the spectrum will overlap, causing frequencies above Fs/2 to appear below Fs/2 o Undesired, non-linear distortion of the signal o Easily prevented by choosing a high-enough sampling rate, of by applying an antialias filter o Anti-alias filter is a low-pass filter that truncates the input signals spectrum to the frequencies below Fs/2, so that no frequencies will alias

Quantization
The second step in the analog-to-digital (A/D) converter is quantization. During the process of quantization, the infinite precision in signal level of an analog signal is converted to a finite precision in signal level for a digital signal. This idea is illustrated in the following figure:

SPHSC 503 Speech Signal Processing


Discrete-time signal, x[n] 1 Amplitude

UW Summer 2006

-1

10

20

30

40

50 Index (n) Digital signal, x q[n]

60

70

80

90

100

1 0.75 0.5 0.25 0 -0.25 -0.5 -0.75 -1

Amplitude

10

20

30

40

50 Index (n)

60

70

80

90

100

Quantization error, e[n] = x q[n] - x[n] 0.2 Error 0 -0.2

10

20

30

40

50 Index (n)

60

70

80

90

100

The difference between the analog signal and the quantized signal is called quantization error. On todays computers, as well as in Matlab, signal quantization plays a minimal role, because computers can almost always represent signals using 4 billion different signal levels, and often even orders of magnitude more than that. When we can use that many signal levels, the quantization error becomes really small and imperceptible. It may not always be possible to represent signals using that many levels in hearing aids and other small digital devices with limited memory and/or processing power. In that case, quantization introduces quantization noise into the system. Like all types of noise, quantization noise is characterized by is frequency spectrum. The shape of the quantization noise spectrum depends on the type of A/D converter used. We will briefly discuss two A/D converters and their quantization noise spectrum. 1. The sample and hold A/D converter. This converter samples analog signals close to the lowest sampling rate that avoid aliasing. At each sampling time, it measures the signal level of the analog signal, and holds it constant for the remainder of the sampling period. This allows the quantizer to match the signal level to an internal table of possible signal levels, and to output the quantized signal level. This type of A/D converter produces quantization noise with a flat spectrum, which means that the quantization noise is equally present in all frequencies. 2. The Sigma-Delta A/D converter. This converter samples analog signals at a very high sampling rate, often 64 times or more than the sampling rate used by the sample and hold A/D converter. At each sampling time, it compares the difference between the current signal level and the signal level at the previous sampling time in a somewhat complicated way. If that comparison is positive, the Sigma-Delta converter outputs +1, and if it turns out negative it outputs -1. The reason for this strange design is that the quantization noise of this converter no longer has a flat spectrum. Instead, quantization noise is low for low

SPHSC 503 Speech Signal Processing

UW Summer 2006

frequencies and increases gradually with higher frequencies. Combined with the 64 times over-sampling, this noise-shaping is a very nice property. It allows subsequent processing with a low-pass filter and downsampling by a factor of 64. The result is a very accurate representation of the input analog signal with very little quantization noise.

FIR filter design


Today, well discuss finite impulse response (FIR) filter design. FIR filters have a few very nice properties: They are always stable, which means that they will never blow-up regardless of the input signal They have almost always linear phase, which gives them constant delay Easy to design, great for filters with piecewise constant frequency responses But the downside of FIR filters is that they require a long impulse response for certain designs, which makes them computationally intensive and causes a long delay in the system. Some general notes on FIR filter design The goal of filter design is to meet the requirements of a desired filter design with a filter that has the lowest possible order. The order of a filter is its length, i.e., its number of samples, minus one. Compare to first order polynomial (a line), which is defined by 2 points, a second order polynomial (a parabola), which is defined by 3 points, etc. The design requirements usually come in the form of an ideal, or brick wall, filter: a filter whose magnitude response is either 0 or 1 for all frequencies, with instantaneous jumps between the two magnitude levels. See for example the figure below.
Ideal low-pass filter 1 Magnitude Magnitude 1 Magnitude Ideal band-pass filter 1 Ideal high-pass filter

0 0 1 2 3 Frequency (kHz) 4 5

0 0 1 2 3 Frequency (kHz) 4 5

0 0 1 2 3 Frequency (kHz) 4 5

Such a filter cant be obtained in practice, because it requires an infinitely long impulse response (and one that cant be modeled by a rational system). Therefore, a practical filter will only approximate the ideal filter and have non-ideal properties such as ripple in the pass-band and stop-band, a non-zero stop-band, and the inclusion of a transition band. To obtain a minimum order, or minimum length, FIR filter, we must be aware of certain design requirement that will increase the length of the filter: designing an extreme low-pass, high-pass, or narrow band-pass filter steep and narrow transition bands high stop-band suppression small pass-band ripple These factors can be summarized into a single line: the more a FIR filter must behave like an ideal brick-wall filter, the longer it needs to be.

SPHSC 503 Speech Signal Processing

UW Summer 2006

FIR filter design methods


There are a number of methods for FIR filter design, which we will cover in more detail here. We will discuss the methods by using low-pass filter design as an example. Once we understand that type of design, we can use the same methods to design band-pass and high-pass filters, and use the low-pass filters as building blocks in more complicated designs. Windowing (Matlab: fir1) The windowing technique of FIR filter design works as follows. Suppose we have an expression for the ideal desired frequency response of our filter, and that we can find an analytic expression for the ideal desired impulse response. For example, for an ideal low-pass filter it is known that the ideal impulse response is the sinc function:
Frequency response for ideal low-pass filter (cut-off frequency 0.1) 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Normalized Frequency ( rad/sample) 0.8 0.9 1

< 0.1 1 H d ( ) = 0 0.1 < <

Magnitude

Impulse response for ideal low-pass filter (cut-off frequency 0.1) 0.15

0.1 Amplitude

0.05

hd [n] =

sin(0.1 n) n

-0.05 -100

-80

-60

-40

-20

0 Index (n)

20

40

60

80

100

Once we have the analytic expression for the ideal desired impulse response, we can truncate it to the filter length that we desire. Of course, the longer we make the filter, the better we will approximate the ideal frequency response. Truncation of the ideal impulse response has one drawback, which is illustrated in the following figures:
Low-pass filter, length = 7 1 Magnitude Magnitude 1 Low-pass filter, length = 21

0 0 0.2 0.4 0.6 0.8 Normalized Frequency ( rad/sample) 1

0 0 0.2 0.4 0.6 0.8 Normalized Frequency ( rad/sample) 1

SPHSC 503 Speech Signal Processing

UW Summer 2006

Low-pass filter, length = 51 1 Magnitude Magnitude 1

Low-pass filter, length = 101

0 0 0.2 0.4 0.6 0.8 Normalized Frequency ( rad/sample) 1

0 0 0.2 0.4 0.6 0.8 Normalized Frequency ( rad/sample) 1

These figures show the frequency response of a truncated impulse response of increasing lengths. As the length of the frequency response increase, the approximation of the ideal response becomes better. But the height of the ripples around the band edge does not decrease. This effect is known as the Gibbs phenomenon, and it does not disappear even as the filter length is further increased. To understand this problem and find a solution for it, we have to use the old idea that truncation of a sequence is the same as multiplying the sequence with a rectangular window. And multiplication of two sequences in time is convolution of their spectra in frequency. The figure below visualizes the convolution of the spectrum of a rectangular window with the ideal desired frequency response of our filter.
Frequency response of an ideal low-pass filter 1.2 1 0.8 0.6 Magnitude 0.4

Convolution of spectra
0.2

1
0 -0.2 -0.4 -1

0.8

Amplitude spectrum of a rectangular window 1.2 1 0.8

Magnitude

-0.8

-0.6

-0.4 -0.2 0 0.2 0.4 0.6 Normalized Frequency ( rad/sample)

0.8

0.6

0.4

0.2

0 0.6 Amplitude -1 0.4 0.2 0 -0.2 -0.4 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Normalized Frequency ( rad/sample) 0.8 1

-0.8

-0.6

-0.4 -0.2 0 0.2 0.4 0.6 Normalized Frequency ( rad/sample)

0.8

SPHSC 503 Speech Signal Processing

UW Summer 2006

As we can see, the side-lobes of the rectangular window are causing the ripples around the bandedge. And as we saw earlier when discussing windows: when we increase the length of the rectangular window, we decrease the width of its main-lobe, but we do not lower its side-lobes. To get lower side-lobes, we need to taper the ideal impulse response with a window instead of truncating it with a rectangular window. As the figure below shows, this significantly reduces the height of the band-edge ripples, at the expense of a wider transition.
Frequency response of truncated ideal impulse response 1.2 1 Magnitude Magnitude 0.8 0.6 0.4 0.2 0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Normalized Frequency ( rad/sample) 0.8 1 1.2 1 0.8 0.6 0.4 0.2 0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Normalized Frequency ( rad/sample) 0.8 1 Frequency response of tapered ideal impulse response

Frequency sampling (Matlab: fir2) It may not always be easy (or possible) to find an analytic expression for the desired frequency response of our filter. In that case, we can use the frequency sampling technique to design an FIR filter. This technique works as follows. Suppose we have the desired frequency response of our filter. Instead of deriving an analytic expression for the impulse response of that filter, we sample the desired frequency response at N equispaced frequencies. We then take the inverse discretetime Fourier transform of those samples to determine the length N FIR filter.

This technique yields an FIR filter that exactly meets our design requirements at the N frequency sample points. But outside of those points, the frequency response has the same ripple problem as we observed with the windowing technique.
Ideal frequency response, and samples 0.3 1 0.2 Magnitude Amplitude Impulse response

0.1

0 0 0 0.2 0.3 Normalized Frequency ( rad/sample) Effective amplitude response 1 Magnitude (dB) Amplitude 0 16 1 -0.1 0 5 10 n 15 20

Effective magnitude response

0 0 0.2 0.3 Normalized Frequency ( rad/sample) 1 0 0.2 0.3 Normalized Frequency ( rad/sample) 1

SPHSC 503 Speech Signal Processing

UW Summer 2006

We can reduce the height of the ripples by multiplying the impulse response by a window, but then were no longer guaranteed that the FIR filter meets our design requirements exactly at the N frequency sample points. Another way to get more control over the height of the ripples is by incorporating transition bands in our design, and allowing the frequency samples in the transition bands to take on any value.
Ideal frequency response, and samples with transition band 0.3 1 0.25 0.2 Magnitude Amplitude 0.15 0.1 0.05 0 0 0 0.2 0.3 Normalized Frequency ( rad/sample) 1 -0.05 -0.1 0 5 10 15 20 n 25 30 35 40 Impulse response

0.39

Effective amplitude response 1 0

Effective magnitude response

Magnitude (dB) 0 0.2 0.3 Normalized Frequency ( rad/sample) 1

Amplitude

43

0.39

0 0 0.2 0.3 Normalized Frequency ( rad/sample) 1

Then, we can freely choose the values of the frequency samples in the transition band to find an impulse response that minimizes the height of the ripple. Solutions to this optimization problem are available in table form for a selected number of designs. Matlabs Signal Processing Toolbox has the function firls which can solve the optimization problem for any design:

Ideal frequency response, with transition band 0.3 1 0.25 0.2 Magnitude Amplitude 0.15 0.1 0.05 0 0 0 0.2 0.3 Normalized Frequency ( rad/sample) 1 -0.05 -0.1 0 5 10

Impulse response

15

20 n

25

30

35

40

SPHSC 503 Speech Signal Processing

UW Summer 2006

Effective amplitude response 1 0

Effective magnitude response

Magnitude (dB) 0 0.2 0.3 Normalized Frequency ( rad/sample) 1

Amplitude

43

0.39

0 0 0.2 0.3 Normalized Frequency ( rad/sample) 1

Equiripple design (Matlab: firpm) Although the windowing and frequency sampling design methods are easy to understand and implement, they have serious drawbacks: The band edges of a design can not be specified precisely, instead we have to accept whatever band edge locations we obtain after the design The amount of pass-band ripple and stop-band ripple cant be controlled independently at the same time The ripples are not uniformly distributed over the band-intervals, as they are higher near band edges and smaller away from band edges

To solve these problems, Parks and McClellan developed an algorithm called the Parks and McClellan algorithm. Their algorithm allows us to specify the actual values for our pass-band and stop-band, which it meets, so there are no surprises there! It also allows independent control over the pass-band and stop-band ripple. And it distributes the ripples uniformly over the bandintervals, which reduces the filter order that is required to satisfy the same specifications. We algorithm is too complicated to go into detail in this course, but we will see how to use it in todays lab.

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes 8 Wednesday 7/5


Summary of last lecture
Quantization Second step of A/D, representing continuous signal level on a finite number of levels Introduces quantization error or quantization noise Spectrum of quantization noise is important: o sample and hold A/D converter has flat quantization noise spectrum o sigma-delta A/D converter has shaped quantization noise spectrum, low for low frequencies, high for high frequencies FIR filter design Windowing method (fir1): o have analytic expression for ideal frequency response o find analytic expression for ideal impulse response o truncate or window ideal impulse response to desired length Frequency sampling (fir2 and firls) o have analytic expression for ideal frequency response o sample ideal frequency response at N sample points o take inverse DTFT to find N-point impulse response o optionally, designate frequency samples in transition band as free samples, optimize impulse response that minimizes the difference between ideal and effective frequency response Equiripple design with the Parks-McClellan algorithm (firpm) o Put ripples uniformly over pass-band and stop-band for lower order o Parks-McClellan algorithm computes optimal impulse response given frequency response

Practical quantization
Some more research into quantization uncovered some practical information about it: - CDs use 16-bit quantization (65536 signal levels, good) - Digital telephones use 8-bit quantization (256 signal levels, ok/poor for speech) - Signals generated in Matlab use 64 bits (>18 billion billion signal levels) - Wav-files use 16-bit quantization by default, capable of 32 bits (> 4 billion signal levels) - If a signal uses the maximum range in signal levels (-1 to +1), then its average power level is roughly 0.7, and 16 bit quantization noise is of the order (2 / 216 ) = 87dB 20log10 0.7 For some other numbers of bits used in quantization, quantization noise is of the order: 1-bit = 3 dB, 2-bit = -3 dB, 4-bit = -15 dB 8-bit = -39 dB, 16-bit = -87 dB, 32-bit = -184 dB, 64-bit = -376 dB If a signal doesnt use the maximum signal level range, then quantization noise becomes more audible

SPHSC 503 Speech Signal Processing

UW Summer 2006

IIR filter design


Last lecture we saw the design methods for FIR filters. Design of IIR filters works different, and in general takes the following steps: Given a desired magnitude response Select an analog prototype low-pass filter for a Butterworth, Chebyshev, or elliptic filter. Definitions of such prototypes are available in the literature. They are usually defined by magnitude-squared response instead of their magnitude response. Prototypes are only available for low-pass filters with a fixed cut-off frequency. The prototype must go through a frequency band transformation to adjust the cut-off frequency, and to change the type of filter from low-pass to band-pass, high-pass or bandstop depending on the desired filter The filter is then transformed from the analog domain to the digital domain

We will ignore the precise details of these steps, and focus on the characteristics of the various types of IIR filters and the Matlab functions that generate them (according to the steps above). Butterworth (Matlab: buttord and butter) A Butterworth filter is characterized by the property that its magnitude response is flat in both the pass-band and the stop-band, and is monotonically decreasing, i.e., no ripples. As we can see in the magnitude-squared plot of the Butterworth filter, the magnitude response for F = 0 is always 1 for all N. The magnitude response at the cut-off frequency is always for all N. As the order of the filter increases, the Butterworth filter approaches an ideal low-pass filter. Chebyshev Type I (Matlab: cheb1ord and cheby1) and Chebyshev Type II (Matlab: cheb2ord and cheby2) There are two types of Chebyshev filters. The Chebyshev Type I filters have equiripple response in the pass-band, while the Chebyshev Type II filters have equiripple response in the stop-band. Recall our discussion of equiripple FIR filters, where we saw that we can obtain lower order filters that meet our design requirements when we choose a filter that has an equiripple rather than a monotonic behavior. Likewise, Chebyshev filters provide lower order than Butterworth filters for the same specifications. Elliptic (Matlab: ellipord and ellip) Elliptic filters exhibit equiripple behavior in the pass-band as well as in the stop-band. They are similar in magnitude response characteristics to the FIR equiripple filters. Therefore, elliptic filters are optimum filters in that they achieve the minimum order N for the given specifications, or alternatively, achieve the sharpest transition band for a given order N.

Phase response and group delay


Elliptic filters provide optimal performance in the magnitude-squared response, but have highly non-linear phase response, which is undesirable in many applications. Although were mostly concerned with the magnitude response of the filters that we design, phase is still an important issue in the overall system. At the other end of the performance scale are the Butterworth filters, which have maximally flat magnitude response and require a higher order N to achieve the same stop-band specification. However, they exhibit a fairly linear phase response in their pass-band. The Chebyshev filters have phase characteristics that lie somewhere in between. Therefore in practical application you can consider Butterworth as well as Chebyshev filters, in addition to

SPHSC 503 Speech Signal Processing

UW Summer 2006

elliptic filters. The choice depends on both the filter order, which influences processing speed and implementation complexity, and the phase characteristics, which control the distortion. When using an IIR filter with a non-linear phase response, it is usually not easy to tell from the phase response how the filter will distort a signal. Another form of filter analysis called group delay gives more insight into the distortion of the filter. The fourth plot of the IIR filter plots shows the group delay of each filter. In a group delay plot, the horizontal axis is frequency in Hz or normalized frequency in radians/sample, and the vertical axis is group delay in samples. Group delay is computed as the negative derivative with respect to frequency of the phase response. Group delay indicates how much nearby frequencies in a frequency region will be delayed by the filter. Nearby frequencies in a signal are not perceived individually, but as an average tone with an amplitude envelope. Group delay determines how much the filter will delay the envelope in frequency regions. This is illustrated by the example below:

Component 1 x 1[n] = cos(237t + 0)/4 + cos(240t + 0) + cos(243t + 0)/4 2 1 0 -1 -2 0 0.2 0.4 0.6 0.8 1

Component 2 x 2[n] = cos(277t + 0)/4 + cos(280t + 0) + cos(283t + 0)/4 2 1 0 -1 -2 0 0.2 0.4 0.6 0.8 1

Linear phase y 1[n] = cos(237t - 37)/4 + cos(240t - 40) + cos(243t - 43)/4 2 1 0 -1 -2 0 0.2 0.4 0.6 0.8 1

Linear phase y 2[n] = cos(277t - 77)/4 + cos(280t - 80) + cos(283t - 83)/4 2 1 0 -1 -2 0 0.2 0.4 0.6 0.8 1

Non-linear phase y 1[n] = cos(237t - 48)/4 + cos(240t - 51)/4 + cos(243t - 59)/4 2 1 0 -1 -2 0 0.2 0.4 0.6 0.8 1

Non-linear phase y 2[n] = cos(277t - 81)/4 + cos(280t - 88) + cos(283t - 95)/4 2 1 0 -1 -2 0 0.2 0.4 0.6 0.8 1

Imagine a signal with two components: a 40 Hz sinusoid with a 3 Hz envelope (top left) and an 80 Hz sinusoid with a 3 Hz envelope (top right). When this signal is filtered with an LTI system with a linear phase response (= constant group delay), then the envelope of both components of the signal is delayed by the same amount (second row, left and right), and the signal is not distorted. When the signal is filtered with an LTI system with a non-linear phase response, then the envelope of both components of the signal will be delayed by a different amount, causing distortion in the signal. If a linear phase response is desired, but an IIR filter must be used, there are two ways to improve the phase response of the IIR filter.

SPHSC 503 Speech Signal Processing

UW Summer 2006

Forward-backward filtering It is possible to remove all of the effects of an IIR filter with non-linear phase response from a signal by reversing the filtered signal in time, running it through the filter again, and reversing the twice filtered signal in time again. This technique of forward-backward filtering of a signal is based on a property of the Fourier transform. According to this property, the phase-response of the backward filtering operation will precisely cancel the phase-response of the forward filtering operation. The magnitude response of both filters is the same, so the twice filtered signals magnitude spectrum will be multiplied by the filters spectrum twice. The filtfilt function in Matlabs Signal Processing Toolbox implements forward-backward filtering. There are two catches: The IIR filter used in forward-backward filtering must be designed with very little passband ripple, but may have half the stop-band suppression. Because it will be applied twice, pass-band ripple will double, as will stop-band suppression. It can not be applied in real-time, because the entire signal must be known in advance to be able to reverse it in time and run it backward through the filter. All-pass filters Another way to improve the phase response of an IIR filter is to cascade it, i.e., follow it, with an all-pass filter. An all-pass filter has a magnitude response that is always 1 for all frequencies. In other words, is does not change the magnitude of any frequencies present in the input. All-pass systems are designed specifically for their phase response. It is usually possible to design an allpass filter that will compensate the non-linear phase response of an IIR filter in its pass-band. The overall system may not be exactly linear phase, but it will be closer than without the all-pass filter. This technique can be used in real-time applications, because no signal reversal is required. Cascading filters will, however, increase the total delay in the overall system.

SPHSC 503 Speech Signal Processing

UW Summer 2006

Butterworth filter plots


Magnitude-squared response, phase response, magnitude (dB) response and group delay of a Butterworth filter of varying order N, with cut-off frequency 1000 Hz at 10000 Hz sampling rate.
Magnitude Response (squared) 1 Butterworth, N = 1 0.9 0.8 0.7 -5 Magnitude squared Phase (radians) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Frequency (kHz) 3.5 4 4.5 -15 0 0.5 1 1.5 2 2.5 3 Frequency (kHz) 3.5 4 4.5 Butterworth, N = 2 Butterworth, N = 10 0 Phase Response

Butterworth, N = 1 Butterworth, N = 2 Butterworth, N = 10

-10

Magnitude Response (dB) 5 Butterworth, N = 1 Butterworth, N = 2 Butterworth, N = 10 20 18 16 14 Group delay (in samples) -10 Magnitude (dB) 12 10 8 6 -30 4 2 0

Group Delay

Butterworth, N = 1 Butterworth, N = 2 Butterworth, N = 10

-5

-15

-20

-25

-35

-40

0.5

1.5

2.5 3 Frequency (kHz)

3.5

4.5

0.5

1.5

2.5 3 Frequency (kHz)

3.5

4.5

SPHSC 503 Speech Signal Processing

UW Summer 2006

Chebyshev type I filter plots


Magnitude-squared response, phase response, magnitude (dB) response and group delay of a Chebyshev type I filter of varying order N, with cut-off frequency 1000 Hz, 0.5 dB pass-band ripple, at 10000 Hz sampling rate.
Magnitude Response (squared) 1 Chebyshev Type I, N = 1 0.9 0.8 -4 0.7 Magnitude squared Phase (radians) 0.6 0.5 0.4 0.3 -12 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Frequency (kHz) 3.5 4 4.5 -14 -6 Chebyshev Type I, N = 1 Chebyshev Type I, N = 2 Chebyshev Type I, N = 10 Chebyshev Type I, N = 2 Chebyshev Type I, N = 10 -2 0 Phase Response

-8

-10

-16 0 0.5 1 1.5 2 2.5 3 Frequency (kHz) 3.5 4 4.5

Magnitude Response (dB) 5 Chebyshev Type I, N = 1 0 Chebyshev Type I, N = 2 Chebyshev Type I, N = 10 18 16 14 Group delay (in samples) -10 Magnitude (dB) 12 10 8 6 -30 4 2 0 20

Group Delay

Chebyshev Type I, N = 1 Chebyshev Type I, N = 2 Chebyshev Type I, N = 10

-5

-15

-20

-25

-35

-40

0.5

1.5

2.5 3 Frequency (kHz)

3.5

4.5

0.5

1.5

2.5 3 Frequency (kHz)

3.5

4.5

SPHSC 503 Speech Signal Processing

UW Summer 2006

Chebyshev type II filter plots


Magnitude-squared response, phase response, magnitude (dB) response and group delay of a Chebyshev type II filter of varying order N, with cut-off frequency 1000 Hz, 10 dB stop-band suppression, at 10000 Hz sampling rate.
Magnitude Response (squared) 1 Chebyshev Type II, N = 1 0.9 0.8 0 0.7 Magnitude squared Phase (radians) 0.6 0.5 0.4 0.3 0.2 0.1 -4 0 0 0.5 1 1.5 2 2.5 3 Frequency (kHz) 3.5 4 4.5 0 0.5 1 1.5 2 2.5 3 Frequency (kHz) 3.5 4 4.5 -3 Chebyshev Type II, N = 1 Chebyshev Type II, N = 2 Chebyshev Type II, N = 10 Chebyshev Type II, N = 2 Chebyshev Type II, N = 10 1 Phase Response

-1

-2

Magnitude Response (dB) 5 Chebyshev Type II, N = 1 Chebyshev Type II, N = 2 Chebyshev Type II, N = 10 20 18 16 14 Group delay (in samples) -10 Magnitude (dB) 12 10 8 6 -30 4 2 0

Group Delay

Chebyshev Type II, N = 1 Chebyshev Type II, N = 2 Chebyshev Type II, N = 10

-5

-15

-20

-25

-35

-40

0.5

1.5

2.5 3 Frequency (kHz)

3.5

4.5

0.5

1.5

2.5 3 Frequency (kHz)

3.5

4.5

SPHSC 503 Speech Signal Processing

UW Summer 2006

Elliptic filter plots


Magnitude-squared response, phase response, magnitude (dB) response and group delay of an elliptic filter of varying order N, with cut-off frequency 1000 Hz, 0.5 dB pass-band ripple, 10 dB stop-band suppression, at 10000 Hz sampling rate.
Magnitude Response (squared) 1 Elliptic, N = 1 0.9 0.8 0.7 Magnitude squared Phase (radians) 0.6 0.5 0.4 0.3 -4 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Frequency (kHz) 3.5 4 4.5 0 0.5 1 1.5 2 2.5 3 Frequency (kHz) 3.5 4 4.5 -5 -1 Elliptic, N = 2 Elliptic, N = 10 0 1 Phase Response

-2

-3

Elliptic, N = 1 Elliptic, N = 2 Elliptic, N = 10

Magnitude Response (dB) 5 Elliptic, N = 1 Elliptic, N = 2 Elliptic, N = 10 20 18 16 14 Group delay (in samples) -10 Magnitude (dB) 12 10 8 6 -30 4 2 0

Group Delay

Elliptic, N = 1 Elliptic, N = 2 Elliptic, N = 10

-5

-15

-20

-25

-35

-40

0.5

1.5

2.5 3 Frequency (kHz)

3.5

4.5

0.5

1.5

2.5 3 Frequency (kHz)

3.5

4.5

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes 9 Friday 7/7


Summary of last lecture
IIR filter design Design process o analog low-pass filter prototypes o frequency band transformation o analog to digital conversion Types of IIR filters: Butterworth, Chebyshev I and II, elliptic Magnitude response, phase response, group delay Non-linear phase response Possible source of distortion Forward-backward filtering removes all phase distortion, but cant be applied real-time Could use an all-pass filter to compensate for phase distortion, but not easy to design

Introduction
In the second week of this course we discussed short-term frequency analysis of speech signals in the form of the short-time Fourier transform or the spectrogram. Those analysis tools are generic in the sense that they not require any prior knowledge about a signal to be successfully applied to that signal. This is a great strength of these techniques and one of the main reasons why they have been so widely applied to many signals. On the other hand, these techniques do not give very specific information about speech signals, because they are so general. If were only interested in frequency analysis of speech signals, we may want to use different techniques that yield specific information about the speech signal, such as fundamental frequency and formant frequencies. In this lecture well discuss a common model of speech signals that is used to more specific frequency analysis of speech signals, and use that model to perform linear predication of speech signals.

Acoustical model of speech production1


In engineering, speech signals are typically considered to consist of two components: glottal excitation and vocal tract resonances. Separating speech signals into these components allows engineers to define very simple models for both processes. Each of these models can be quantified by a small number of parameters, which compactly represent the salient features of the speech signals such as fundamental and formant frequencies. Glottal Excitation Glottal excitation is the process that describes the events in human speech production that take place between lungs and the vocal tract. As probably known, the vocal chords constrict the path from the lungs to the vocal tract. As lung pressure is increased, air flows out of the lungs and through the opening between the vocal chords (glottis). At one point, the vocal cords are together, thereby blocking the airflow, which builds up pressure behind them. Eventually the pressure reaches a level sufficient to force the vocal cords to open and this allow air to flow through the glottis. Then, the pressure in the glottis falls and, if the tension in the vocal chords is
1

Text and figures copied/adapted from Spoken language processing, X. Huang et al.

SPHSC 503 Speech Signal Processing

UW Summer 2006

properly adjusted, the reduced pressure allows the cords to come together, and the cycle is repeated. This condition of sustained oscillation occurs for voiced sounds, and is illustrated in the figure below.

This glottal excitation can be further separated into an impulse train that drives a glottal pulse FIR filter g[n]:

For unvoiced sounds, the airflow between lungs and vocal tract is not or very little obstructed by the vocal chords. In that case, the glottal excitation consists mostly of turbulence which is modeled as random noise:

This model of the glottal excitation is a decent approximation, but fails on voiced fricatives, since those sounds contain both a periodic component and an aspirated component. In this case, a mixed excitation model can be applied, using a sum of both an impulse train and random noise. Lossless tube concatenation A widely used model for speech production is based on the assumption that the vocal tract can be represented as a concatenation of lossless tubes, as shown in the figure below:

The constant cross-sectional areas A1, A2,, A5 of the tubes approximate the continuous area function A(x) of the vocal tract. In this model of the vocal tract a number of things are ignored, such as the vocal tracts three-dimensional bend, its elasticity, viscosity and thermal condition. By leaving those aspects out of the model, the sound waves in the tubes satisfy a pair of differential equations, which can be solved to find the system that models the vocal tract frequency response. In general, the concatenation of N lossless tubes results in an IIR system with an N-th order feedback sequence and a feed-forward sequence that is only a gain. The N-th order

SPHSC 503 Speech Signal Processing

UW Summer 2006

feedback sequences causes at most N/2 resonances or formants in the vocal tract. These resonances occur when a given frequency gets trapped in the vocal tract because it is reflected at the lips and then again at the glottis. The number of tubes that is required to accurately model the formants in a speech signal generated by a given vocal tract depends on the physical length of the vocal tract, the sampling frequency of the speech signal, and the speed of sound, as follows: 2 LFs N= c For example, for Fs = 8000 Hz, c = 34000 cm/s, and L = 17 cm, the average length of a male adult vocal tract, we obtain N = 8, or alternatively 4 formants. Experimentally, the vocal tract system has been observed to have approximately 1 formant per kHz. Shorter vocal tract lengths (females or children) have fewer resonances per kHz and vice versa. Source-filter models of speech production For a total model of human speech production, we combine the glottal excitation with the lossless tube concatenation model into a mixed excitation model, as shown below:

H()

Mathematically, this mixed excitation model can be expressed as

s[n] = {v [n] ( sf0 [n] g[n]) + u [n] w[n]} h[n]


where v [ n ] and u [ n ] control the amount of mixing between voiced and unvoiced excitation, sf0 [ n ] is the impulse train at the fundamental frequency that drives the glottal pulse filter g[n], w[n] is the random noise modeling the unvoiced excitation, and h[n] is the vocal tract filter. Despite all the approximations made, the combination of glottal excitation and the lossless tubes concatenation model represents reasonably well the human speech production processes. Inspired by these results, we will discuss linear predictive coding (LPC) and cepstral analysis of speech, which are both based on this model.

Linear predictive coding (LPC)


A very powerful method for speech analysis is based on linear predictive coding (LPC), also known as LPC analysis or auto-regressive (AR) modeling. This method is widely used because it is fast and simple, yet an effective way of estimating the main parameters of speech signals. Linear predictive coding is based on the idea that there is significant redundancy in a speech signal, such that the current sample can be predicted quite accurately from a linear combination of past samples. For example, for linear prediction from the past three samples, we would get
[ n] = a1 x[ n 1] + a2 x[ n 2] + a3 x[ n 3] x

SPHSC 503 Speech Signal Processing

UW Summer 2006

The prediction error e[n] is defined as the difference between the speech signal and the predicted speech signal e[ n] = x[ n] x[ n] = x[ n] a1 x[ n 1] + a2 x[ n 2] + a3 x[ n 3] Generally, the number of past samples p that are used to predict the current sample can be chosen freely. The goal of linear predictive coding is to find values for the prediction coefficients a1,,ap that minimize the sum of squared prediction error (a common engineering error measure):
p 2 [ n]) = E = e 2 [n] = ( x[n] x x[ n] ai x[ n i ] n n n i =1 2

As with spectral analysis of speech, linear predictive coding of speech is most useful when applied to short-term sections of speech with relatively constant spectrum. For that reason, speech signals are usually broken up into short sections and multiplied by a (Hamming) window, and a separate LPC analysis is performed on each windowed section of the signal. When LPC analysis is performed on N samples of a speech signal, the task of minimizing the prediction error can be expressed in a system of N + p linear equations in p unknowns, for example in the case of p = 3:

[0] = 0 x [1] = a1 x[0] x [2] = a1 x[1] + a2 x[0] x [3] = a1 x[2] + a2 x[1] + a3 x[0] x [4] = a1 x[3] + a2 x[2] + a3 x[1] x [ N ] = a1 x[ N 1] + a2 x[ N 2] + a3 x[ N 3] x [ N + 1] = a1 x[ N ] + a2 x[ N 1] + a3 x[ N 2] x [ N + 2] = a2 x[ N ] + a3 x[ N 1] x [ N + 3] = a3 x[ N ] x [ N + 4] = 0 x There are techniques from linear algebra available to find values for a1,,ap that minimize the prediction error. Those techniques form the basis to two common methods for linear prediction called the autocorrelation method and the covariance method. After LPC analysis of a speech signal x[n], we obtain values for the LPC coefficients a1,,ap and the prediction error e[n]. To reconstruct the original speech signal from the LPC coefficients and the prediction error, we must first form an IIR vocal tract filter by using the LPC coefficients as the feedback filter coefficients of the IIR filter. We then filter the prediction error e[n] with that IIR filter to obtain the original speech signal x[n].

SPHSC 503 Speech Signal Processing

UW Summer 2006

LPC analysis of speech separates the speech signal into a vocal tract filter and an excitation source, which is illustrated by the following figure.

original speech, x[n] 0.2 Magnitude (dB) 0.1 0 -0.1 -0.2 20 0 -20 -40 -60

Spectrum of original speech, X()

50

100 excitation signal, e[n]

150

200

1000

2000 3000 Frequency (Hz)

4000

5000

Spectrum of excitation signal, E() 20 Magnitude (dB) 0 -20 -40 -60

0.15 0.1 0.05 0 -0.05 -0.1 0 50 100 150 200

1000

2000 3000 Frequency (Hz)

4000

5000

Impulse response of vocal tract, h[n] 1.5 1 0.5 0 -0.5 Magnitude (dB) 30 20 10 0 -10 -20 0

Frequency response of vocal tract, H()

50

100

150

200

1000

2000 3000 Frequency (Hz)

4000

5000

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes 10 Monday 7/10


Summary of last lecture
Acoustical model of speech production - glottal excitation: random noise (voiceless) or impulse train & glottal pulse shape (voiced) - vocal tract filter, concatenation of lossless tubes, modeled as an all-pole IIR filter Linear predictive coding (LPC) - predict current sample from past samples - order of LPC, depends on signal, usually 1/kHz + 24 to model glottal effects and such.

The cepstrum
The word cepstrum is a play on the word spectrum, created by reversing the first syllable of the word. The term was coined by the inventors Bogert, Healy and Tukey in their paper The Quefrency alanysis of time series for echoes. Originally designed to analyze echoes, today the cepstrum has taken a dominant place in speech recognition. In todays lecture, we will study the cepstrum and its properties. Homomorphic processing The cepstrum is one type of homomorphic transformation. A homomorphic transformation
[ n] = D{x[ n]} x

is a transformation that converts a convolution,

x[n] = e[n] h[n] ,


into a sum

[n] = D{x[ n]} x = D{e[ n] h[ n]} = D{e[ n]} + D{h[n]} [ n] [ n] + h =e We will see that the cepstrum, as a homomorphic transformation, allows us to separate the glottal excitation e[n] from the vocal tract filter h[n]. This separation is possible because we can find a [n] 0 for n N and the cepstrum of the excitation value N such that the cepstrum of the filter h [ n] 0 for n < N . With this assumption, we can approximately recover both sequences e[n] and e [ n ] by homomorphic filtering. This is illustrated by the figures below. h[n] from the cepstrum x .

SPHSC 503 Speech Signal Processing

UW Summer 2006

Speech signal, x[n] 1 Magnitude (dB) 0.5 Amplitude 0 -0.5 -1 20 0 -20 -40 0 500 1000

Spectrum of speech signal, X()

50

100 Index (n)

150

200

250

1500

2000 2500 3000 Frequency (Hz)

3500

4000

4500

5000

cepstrum of x[n] 2 1 Amplitude 0 -1 -2

Top row: a speech signal and its spectrum Left: cepstrum of the speech signal

-100

-50

0 Quefrency

50

100

cepstrum of e[n] 2 1 Amplitude 0 -1 -2

Left: selecting the glottal excitation cepstral coefficients

-100

-50

0 Quefrency

50

100

cepstrum of h[n] 2 1 Amplitude 0 -1 -2

Left: selecting the vocal tract cepstal coefficients Bottom rows: excitation signal and vocal tract impulse response and their spectra.

-100

-50

0 Quefrency

50

100

Separated excitation signal, e[n] 1 Magnitude (dB) 0.5 Amplitude 0 -0.5 -1 20 0 -20 -40 0 500 1000

Spectrum of excitation, E()

50

100 Index (n)

150

200

250

1500

2000 2500 3000 Frequency (Hz)

3500

4000

4500

5000

Separated vocal tract, h[n] 1 Magnitude (dB) 0.5 Amplitude 0 -0.5 -1 20 0 -20 -40 0 500 1000

Spectrum of vocal tract, H()

50

100 Index (n)

150

200

250

1500

2000 2500 3000 Frequency (Hz)

3500

4000

4500

5000

SPHSC 503 Speech Signal Processing

UW Summer 2006

Definition of the cepstrum The cepstrum transforms a convolution to an addition in three steps. Suppose x[n] = e[n] h[n] : 1. Take the DTFT of (a windowed version of) x[n]. This results in the spectrum X ( ) of the signal x[n], and transforms the convolution (in time) into a multiplication (in frequency):

X ( ) = E ( ) H ( )
2. Take the log of X ( ) (usually natural log, base e); this results in the log-spectrum of x[n], and transforms the multiplication into an addition:

log X ( ) = log E ( ) H ( ) = log E ( ) + log H ( )


3. Take the inverse DTFT of the result. Recall that the spectrum X ( ) is complex. There are two version of the cepstrum defined: the real cepstrum and the complex cepstrum. In the real cepstrum, the real logarithm is taken of the magnitude of the spectrum, X ( ) , in step 2. In the complex cepstrum, the complex logarithm is taken of the spectrum in step 2. The details of the complex logarithm are not important to us. But what is important is that if we want to reconstruct a signal from a (modified) cepstrum, we must use the complex cepstrum. In the real cepstrum, the phase information of X ( ) is lost, and reconstruction is only possible under a minimal phase assumption that usually does not give the desired reconstruction result. In speech recognition, reconstruction is not required and therefore the simpler real cepstrum is used.
Cepstral features of voiced speech For the example voiced segment of speech shown in the figure below,
Speech signal, x[n] 1 0.5 Amplitude 0 -0.5 -1

50

100 Index (n)

150

200

250

the cepstrum is shown in the following figure:

SPHSC 503 Speech Signal Processing


cepstrum of x[n] 2 1 Amplitude 0 -1 -2

UW Summer 2006

-100

-50

0 Quefrency

50

100

In the cepstrum of this voiced segment of speech, we can identify the following features: 1. High energy content around low quefrencies 2. Decaying towards higher quefrencies 3. Distinct peak around fundamental quefrency The first feature is caused by the resonances in the vocal tract. The second feature can be attributed to a property of the cepstrum of an impulse train. It can be shown that the cepstrum of an impulse train is a decaying function. The third feature is caused by the echo detection capabilities of the cepstrum. The repeated glottal periods are detected as echoes.
Separating the glottal excitation from the vocal tract To separate the glottal excitation from the vocal tract response, we must choose a value N such that the vocal tract response has most of its energy in the quefrencies less than N, and the glottal excitation has most of energy in the quefrencies above N. Speech recognition researchers have shown empirically that a value of N between 12 and 20 is a good choice to separate glottal excitation from the vocal tract response, depending on the sampling rate and whether frequency warping is done. An example of this separation is shown in the figures on page 2, where N = 15. Frequency warping: the Mel-frequency cepstrum The Mel-frequency cepstrum differs from the real cepstrum in that it uses a non-linear frequency scale, which approximates the behavior of the auditory system.

The Mel-frequency cepstrum works as follows: 1. Given a (windowed) input signal x[n], compute its N-point DTFT X[k], for k = 0,,N-1 2. Construct M filterbank outputs from the N-point DTFT by multiplying the N-point DTFT with triangular filters, as shown below

3. Compute the log of the energy at the output of each filter of the filterbank, as N 1 S [m] = log | X [k ] |2 H m [k ] k =0

SPHSC 503 Speech Signal Processing 4. Take a modified inverse DTFT of the result
[ n] = S [ m]cos( n( m 1 x )/M) 2
m=0 M 1

UW Summer 2006

The Mel-frequency cepstrum is used extensively as a feature vector for speech recognition systems. For speech recognition, only the first 13 cepstrum coefficients are used, and are referred to as the Mel-frequency cepstral coefficients (MFCC). Note that the Mel-frequency cepstrum is no longer a homomorphic transformation, due to the computation of the weighted log-energy in step 3. In practice, however, the MFCC representation is approximately homomorphic for filters that have a smooth frequency response. The advantage of the MFCC representation is that the filter energies are more robust to noise and spectral estimation errors.

SPHSC 503 Speech Signal Processing

UW Summer 2006

Lecture notes 11 Wednesday 7/12


Summary of last lecture
Cepstrum Homomorphic transform, converts convolution of two sequences into summation of their cepstra Computed by taking the DTFT of a sequence, taking its log (or the log of the magnitude), and then taking its inverse DTFT Glottal excitation and vocal tract response separable in cepstrum of speech signal; glottal excitation is high-time part of cepstrum, vocal tract response is low-time part of cepstrum Frequency warping: Mel-frequency cepstrum Models the frequency selectivity of the human ear Triangular weighting imposed on samples of the DTFT Narrow at low frequencies, wider at high frequencies

Guest lecture: Modulation analysis of speech


What is modulation analysis? Consider a single 600 Hz sinusoid, sampled at 10 kHz:
1 0.5 Amplitude 0 -0.5 -1

0.01

0.02

0.03

0.04

0.05 0.06 Time (s)

0.07

0.08

0.09

0.1

with spectrogram
5000 4000 Frequency 3000 2000 1000 0

0.1

0.2

0.3

0.4

0.5 Time

0.6

0.7

0.8

0.9

SPHSC 503 Speech Signal Processing

UW Summer 2006

The simplest form of modulation analysis is to take a second Fourier transform along time of the spectrogram (magnitude). In the case of the signal above, the modulation spectrogram is
5000

4000 Frequency (Hz)

3000

2000

1000

10

15 20 25 Modulation frequency (Hz)

30

35

This figure shows modulation frequency on the horizontal axis, and regular acoustic frequency on the vertical axis. It tells us that the signal contains a sinusoid at 600 Hz, and that the envelope of that sinusoid is constant (DC only). Consider instead a 600 Hz sinusoid with an envelope that oscillates between 0.05 and 0.95 at 20 Hz:
1 0.5 Amplitude 0 -0.5 -1

0.01

0.02

0.03

0.04

0.05 0.06 Time (s)

0.07

0.08

0.09

0.1

with spectrogram
5000 4000 Frequency 3000 2000 1000 0

0.1

0.2

0.3

0.4

0.5 Time

0.6

0.7

0.8

0.9

SPHSC 503 Speech Signal Processing This signal has the following modulation spectrogram:
5000 4000 Frequency (Hz) 3000 2000 1000 0

UW Summer 2006

10

15 20 25 Modulation frequency (Hz)

30

35

As you can see, the spectrogram shows that the signal is a modulated sinusoid, but the modulation spectrogram quantifies the modulation as a 20 Hz modulator. General framework of modulation analysis and filtering The following figure shows the general framework for modulation analysis and filtering.

filter bank

LTI

x(t)

x(t)

^ x (t)

reconstruction

envelope a(t) detection


c(t)

^ a (t)

^ x (t)

carrier detection

First, a signal x(t) is passed through a filterbank, characterized by a set of band-pass filters, to obtain the subband signals x (t ) . In each of the subbands, an envelope and carrier detection operation takes place. It separates the subband signal into a slowly changing envelope and the subband fine-structure. When multiplied, the envelope and the carrier reconstruct the subband signal. For modulation analysis, a modulation spectrogram is constructed by computing the spectrum of the envelope signals, and stacking all envelope spectra vertically in a single convenient plot. For modulation filtering, the envelopes are filtered by an LTI filter, and the modified envelopes are recombined with the (unmodified, but delayed) carriers to obtain the modified subband (t ) is reconstructed from the modified subband signals. Then the modulation filtered signal x signals. The filterbank in the general framework is often implemented efficiently by using the discretetime Fourier transform. This restricts the subbands to be equal bandwidth and uniformly spaced in frequency.

SPHSC 503 Speech Signal Processing

UW Summer 2006

In our work, we distinguish two types of envelope detection, which we refer to as incoherent envelope detection and coherent envelope detection. With incoherent envelope detection, the envelope and carrier signals are detected independently, for example as the Fourier magnitude and phase or the Hilbert envelope and carrier. With coherent envelope detection, a carrier signal is estimated from the subband signal, which is then used to define the envelope signal. In the signals above, the modulation spectrogram was computed using the DTFT as a filterbank, and Fourier magnitude and phase was used for the envelope and carrier detection. Modulation spectrogram of speech Instead of the artificial signal above, lets examine a similar modulation spectrogram of the zero speech signal:
0.6 0.4 Amplitude

0.2 0

-0.2

-0.4

0.1

0.2

0.3

0.4 Time (s)

0.5

0.6

0.7

0.8

with (narrow-band) spectrogram


5000 4000 Frequency 3000 2000 1000 0 0.1 0.2 0.3 0.4 Time 0.5 0.6 0.7

Its modulation spectrogram is

SPHSC 503 Speech Signal Processing

UW Summer 2006

5000 4000 Frequency (Hz) 3000 2000 1000 0 0 5 10 15 20 25 Modulation frequency (Hz) 30 35

In the case of a narrow-band spectrogram, the maximum modulation frequency we can observe is lower than the fundamental frequency of the speech signal. Hence, the fundamental frequency resolves in acoustic frequency, visible in the modulation spectrogram above as the region of high energy (red) in the bottom left corner. When we compute a wide-band spectrogram of the speech signal, we obtain the following spectrogram and modulation spectrogram
5000 4000 Frequency 3000 2000 1000 0 0.1 5000 4000 Frequency (Hz) 3000 2000 1000 0 0 50 100 Modulation frequency (Hz) 150 0.2 0.3 0.4 Time 0.5 0.6 0.7

SPHSC 503 Speech Signal Processing

UW Summer 2006

In this modulation spectrogram, we can clearly see that the fundamental frequency resolves in modulation frequency (the vertical area of energy around 110 Hz in modulation frequency), and how formants resolve in acoustic frequency (the peaks of very high energy around 600 Hz and 1250 Hz in acoustic frequency and 0 Hz in modulation frequency). Real envelope vs. complex envelope In the examples above, we have seen the incoherent modulation spectrogram that uses the Fourier magnitude and phase as the envelope and carrier detection operations. There is a debate going on among researchers in modulation transforms whether real-world signals justify the use of the magnitude (or Hilbert) envelope. The general trend is that purely real envelopes such as the magnitude or Hilbert envelope are a poor model for the envelopes found in subbands of realworld signals such as speech and music. We have been arguing for the use complex-valued envelopes. In order to obtain a complex-valued subband envelope, we estimate a carrier signal from a given subband signal, and then define the envelope to be the complementary part of the subband. To estimate the carrier, we use a technique called instantaneous frequency (IF) estimation. The method of IF estimation that we use right now is borrowed from another technique called frequency reassignment.

Potrebbero piacerti anche