Sei sulla pagina 1di 44

SPEECH ENHANCEMENT

Chunjian Li
Aalborg University, Denmark

3/22/2006 Lecture notes for Speech Communications


Introduction
Applications:
- Improving quality and inteligibility (hearing
aid, cockpit comm., video conferencing ...)
- Source coding (mobile phone, video
conferencing, IP phone ...)
- Pre-processor for other speech processing
applications (speech recognition, speaker
varification ...)

3/22/2006 Lecture notes for Speech


Communications
Introduction
Classification 1
- Single channel
- Multi-channel
* with accoustic barrier (Adaptive Noise Cancelling)
* without accoustic barrier (Beam forming, ICA)
Classification 2
- Spectral domain methods (Power Spectral Subtraction,
Amplitude Spectral Subtraction, Autocorrelation Subtraction, Non-
causal IIR Wiener Filtering)
- Time domain methods (Kalman Filter)
- Adaptive noise cancelling
- Adaptive comb filtering

3/22/2006 Lecture notes for Speech


Communications
Single Channel Speech
Enhancement
Stochastic Model
- Noise process: broadband, stationary (or
short-time stationary), uncorrelated to
speech, additive.
- Speech process: short-time stationary.
- Need short-time processing

y (n; m) s (n; m) d (n; m)

3/22/2006 Lecture notes for Speech


Communications
Single Channel Speech
Enhancement
Important relation in the Power Spectrum
domain:

y ( ; m) s ( ; m) d ( ; m)

This is true only when the noise is


uncorrelated with the speech signal.

* To be concise, the index m is droped in the following


discussion

3/22/2006 Lecture notes for Speech


Communications
Power Spectral Subtraction

s ( ) y ( ) d ( )

| Ss ( ) | N s ( )


S s ( ) | S s ( ) | e
j y ( )
(1)

* Power Spectral Subtraction methods use the noisy phase spectrum to


synthesis the enhanced signal
3/22/2006 Lecture notes for Speech
Communications
Generalized Spectral
Subtraction and its variants
Generalization
Eq(1) can be written as:


1 (2)
Ss ( ) | S y ( ) | | Sd ( ) |
j y ( )
e

When =1 , eq(2) is called Amplitude Spectral Subtraction


(Boll,1979).

Variant Correlation subtraction


rs ( ; m) ry ( ; m) rd ( ; m)

3/22/2006 Lecture notes for Speech


Communications
Comments on Spectral
Subtraction methods
Low complexity
Severe musical noise
Usually need further enhancement
- Smoothing in time and frequency; Rectification;

Amplitude Spectral Subtraction:

Power Spectral Subtraction:

Noisy speech sample:

3/22/2006 Lecture notes for Speech


Communications
Comments on Spectral
Subtraction methods
Oversuppressing and smoothing can
reduce residual noise but result in
distortion to the speech spectrum.

Oversuppressing ASS:

Oversuppressing PSS:

Smoothing in time:

3/22/2006 Lecture notes for Speech


Communications
Wiener Filtering
The non-causal infinite impulse response
Wiener filter (hereafter as Non-causal IIR
Wiener Filter) is recognized as a spectral
domain method although the filtering
problem started in time domain.
Non-causal IIR Wiener filter with AR
modeling of speech can be employed in
iteratative manner, such that signal
estimation and parameter estimation are
done based on each other.
3/22/2006 Lecture notes for Speech
Communications
Noncausal Wiener Filter
A linear Minimum Mean Squared Error Filter:

s(n) h ( q ) y ( n q ) h( n ) * y ( n)
m

Orthogonality principle:

E s (n) h(q) y (n q) y (n k ) 0
q

Rys (k ) h( q ) R
q
yy (k q) Wiener-Hopf equation

3/22/2006 Lecture notes for Speech


Communications
Noncausal Wiener Filter
Orthogonality principle (frequency domain):
ys ( ) H ( )yy ( )
Transfer function:
ys ( )
H ( )
yy ( )
MSE of the Wiener filter:

h( q ) R
2
E[(s (n) s(n)) ] s
2
ys (q)
q

3/22/2006 Lecture notes for Speech


Communications
Comments on Noncausal WF
Requires estimates of the power
spectra of speech and noise.
Performance depends very much on
the estimates of the speech and noise
spectra.
WF oversuppress the speech
spectrum, results in muffling effect.
WF does not process phase spectrum.
3/22/2006 Lecture notes for Speech
Communications
Comments on Noncausal WF
Muffling effect caused by over-suppression

Blue: Original
Black: Wiener filter
Green: Square-root Wiener filter

3/22/2006 Lecture notes for Speech


Communications
Comments on Noncausal WF
Roughness caused by phase noise
The phase spectrum is not processed, results in
losing phase coherence in the voiced speech. The
effect is called roughness or reverberance.
Samples of muffling and roughness:
Clean samples:
Muffling:
Roughness:
Muffling & roughness:

3/22/2006 Lecture notes for Speech


Communications
Iterative Wiener Filtering
A parametric method using an all-pole
model
A sequential MAP estimator of both
speech waveform and LP coefficients.
[Lim, Oppenheim 1978]

3/22/2006 Lecture notes for Speech


Communications
Iterative Wiener Filtering
All-pole modeling of speech
- Speech amplitude spectrum can be well modeled
by an all-pole transfer function (the vocal tract)
excited by white noise or pulse train (the glottal
pulses). The coefficients of the all-pole model can
be found by the Linear Prediction analysis, thus is
called LP coef., and the excitation is called the
residual.
- The LP model is of minumum phase, which is
generally not the true phase of the vocal tract.
3/22/2006 Lecture notes for Speech
Communications
Iterative Wiener Filtering
The algorithm:
1. Estimate the LP coef. from the noisy
oberservation samples. Estimate the noise
spectrum during nonspeech activity.
2. Estimate the signal using the noncasual IIR WF
given the current estimate of LP coef. and current
estimate of the noise spectrum.
3. Estimate the LP coef. again given the current
estimate of the waveform.
4. Iterate until the convergence criterion is satisfied.

3/22/2006 Lecture notes for Speech


Communications
Iterative Wiener Filtering
Comments:
- Convergence is not garanteed, a heuristic stop
criterion is needed
- Result in unrealisticly sharp formants and pole
jittering
- Suffer from musical noise
- Need some kind of smoothing

10 dB noisy sample:
Iterative WF:
Iterative WF with smoothing:
3/22/2006 Lecture notes for Speech
Communications
Further enhancement to IWF
Constrained IWF [Hansen,Clements 1987]
Apply spectral constraint inter-frame and intra-frame
using LSP transformation.
Pole-zero modeling [Flanagan 1972]
Replace WF with Kalman filtering [Gibson
1991]
Vector quantization method [Gibson 1988]
Use HMM [Ephraim 1988]

3/22/2006 Lecture notes for Speech


Communications
Phase issues
The majority of the noise reduction mthods
only process amplitude spectrum, while the
noisy phase spectrum is left unprocessed.
The reasons are:
- Human ears are less sensitive to phase
than to the amplitude spectrum.
- Masking of amplitude to phase (6dB/0.6rad
threshold).
For low SNR (<6dB) source, the noisy
phase causes roughness/reverberance.
3/22/2006 Lecture notes for Speech
Communications
MMSE approaches to speech
enhancement
Wiener filtering; MMSE amplitude spectrum
estimator; MMSE log-amplitude spectrum
estimator; Non-Gaussian prior MMSE
approaches.
Being the dominant technique because of
better performance than the Spectral
Subtraction methods.
Need a priori info. of the speech and noise
spectrum.
3/22/2006 Lecture notes for Speech
Communications
MMSE amplitude spectrum
estimator (Ephraim-Malah
filter)
Ephraim-Malah, 1984
The basis of the noise reduction
function of MELPe coding standard
Consists of two parts: Decision-
Directed method estimating the a priori
speech spectrum, and the MMSE
Short-Time Spectral Amplitude (STSA)
estimator
3/22/2006 Lecture notes for Speech
Communications
MMSE STSA estimator

Assumptions:
- Stationary additive Gaussian noise with known PSD.
- An estimate of the speech spectrum is available.
- Spectral components (DFT coefficients) are statistically independent
and each follows Gaussian distribution (the DFT amplitude follows
Rayleigh distribution).
- The DFT phase follows uniform distribution and is independent of the
amplitude.

The signal model: y (t ) x(t ) d (t )

Let Yk Rk exp( j k ) , X k Ak exp( j k ) , Dk denote the kth spectral


component of the noisy observation y(t), the signal x(t), and the
noise d(t).
3/22/2006 Lecture notes for Speech
Communications
MMSE STSA estimator
With the following PDFs:

1 1 Ak Ak 2
p(Yk | Ak , k ) exp | Yk Ak e j k | ,
2
p ( Ak , k ) exp
d (k ) d (k ) x (k ) x ( k )

and Bayes rule, the estimator Ak can be shown to be:


A k E[ Ak | Yk ]
vk v v v
exp( k )[(1 vk ) I 0 ( k ) vk I1 ( k )]Rk
2 k 2 2 2

Where I 0 () and I1 () denote the modified Bessel functions of zero


and first order, and vk is defined by:
k
vk k
1 k

3/22/2006 Lecture notes for Speech


Communications
MMSE STSA estimator
Where k and k are defined by:

(k ) Rk2
k x k
d (k ) d ( k )

Where x (k ) E[| X k | ] and d (k ) E[| Dk | ]


2 2

k and k are interpreted as the a priori and a posteriori signal-to-noise


ratio respectively.

k is estimated by the Decision-Directed method.

3/22/2006 Lecture notes for Speech


Communications
Decision-Directed method
An estimate of the a priori SNR.
A combination of Power Spectrum
Subtraction, halfwave rectification and
inter-frame smoothing.
2 (n 1)
A
k (n) k (1 ) max[ k (n) 1,0], 0 1
d (k , n 1)
is usually chosen to be 0.98 in order to
get the best smoothing performance. The
higher the is, the less musical noise, but
the more distortion to the speech.
3/22/2006 Lecture notes for Speech
Communications
Comments on the MMSE
STSA estimator
Comparison of the suppression gains of Wiener filter and MMSE STSA

-The instantaneous SNR can be


interpreted as the a priori SNR
estimated without smoothing.
-WF gains do not vary with the
instantaneous SNR, only vary with
the a priori SNR. Whereas the
MMSE STSA gains vary with both
instataneous SNR and a priori SNR.
-When the a priori SNR is high, the
MMSE STSA estimator has gain
curves very close to the WF. When
the a priori SNR is low, the MMSE
STSA shows higher gain which is
very much affected by the
instataneous SNR.
3/22/2006 Lecture notes for Speech
Communications
Comments on the MMSE
STSA estimator

A comparison of the suppression gains of PSS, WF and MMSE STSA estimator

Estimated A priori SNR Estimated A priori SNR

Solid line: power subtraction; dashed line: The MMSE STSA. Rpost denotes the A priori
Wiener filter. SNR estimated without smoothing (the
instantaneous SNR).
3/22/2006 Lecture notes for Speech
Communications
Comments on the MMSE
STSA estimator
The gain curve transit smoothly between the power
subtraction curve and the Wiener curve. This transit is
controled by the un-smoothed estimate of a priori SNR (Rpost).
The larger Rpost, the stronger the anttenuation.
This counter-intuitive behavior manages to flatten the spurious
spectral peaks caused by the noise at the low SNR part of the
spectrum. While WF tends to sharpen the spurious peaks at
the low SNR part of the specatrum.
The phase of the noisy speech is used as the phase of the
enhanced speech, because of the assumption of uniform
distributed phase. An independent MMSE estimate of the
phasor has nonunity modulus, thus can not be combined with
the MMSE STSA.
Suffer less musical noise than the WF.

3/22/2006 Lecture notes for Speech


Communications
MMSE Log-Spectral Amplitude
Estimator
A modification to the MMSE STSA based on the fact that a distortion
measure based on the mean-square error of the log-spectra is more
suitable for speech processing.
Minimize the distortion measure E[(log Ak log A k ) ]
2

The MMSE LSA estimator can be shown to be:


A k exp( E[ln Ak | Yk ])
k 1 e t
exp( dt ) Rk
1 k 2 k t
v

k
where k 1 k , k and k are a priori SNR and a
v
k

posteriori SNR as defined before.

3/22/2006 Lecture notes for Speech


Communications
MMSE Log-Spectral Amplitude
Estimator
Comparison of the suppression gains of MMSE STSA and MMSE LSA

- The gain curves of MMSE LSA are


always lower than that of MMSE
STSA, resulting in lower residual
noise.
- When the a priori SNR is high, the
gain curve of MMSE LSA is very flat
which is similar to Wiener filter.
When the a priori SNR is low, the
gain curve of the MMSE LSA varies
w.r.t. the instantaneous SNR as the
MMSE STSA does.

Decision-Directed
Wiener Filter: MMSE LSA:

Noisy sample
3/22/2006 Lecture notes
for Speech
(0 dB):
Communications
MMSE estimator with non-
Gaussian prior
How well does Gaussian model fit the real probability distribution of DFT
coefficients?

Histogram of speech DFT amplitude. Histogram of noise (recorded from


market place) DFT amplitude.

*The histograms are taken from one hour of speech


3/22/2006 Lecture notes for Speech
Communications
MMSE estimator with non-
Gaussian prior
The probability density function of the DFT coefficients
of speech can be better modeled by Supper-Gaussian
functions (e.g. Gamma or Laplace) than the
Guanssian function [Rainer Martin 2002, 2003].
An even more exact probability density function is the
one talored to fit the shape of the histogram of the DFT
coefficients [Lotter, Vary 2003].
Using these density function in place of the Gaussian
density function (for speech or noise processes) in the
MMSE estimator can result in better noise reduction.
Non-Gaussian prior MMSE estimator is nonlinear, non-
zero-phased.

3/22/2006 Lecture notes for Speech


Communications
MMSE estimator with non-
Gaussian prior
Compared with WF:
- Better output SNR (Gaussian/Gamma)
- Less musical noise (Laplace/Gamma)
- Less distortion to the speech

3/22/2006 Lecture notes for Speech


Communications
MMSE joint estimator for
amplitude and phase spectra

[C.-J. Li, S. V. Andersen, Inter-Frequency Dependency in


MMSE Speech Enhancement, NORSIG04]

3/22/2006 Lecture notes for Speech Communications


Why MMSE joint estimator?
Phase is found to be of importance for noise
reduction of low SNR sources. Whereas
Independent optimum amplitude estimator and
optimum phase estimator do not coexist.
Finite frame length and temporal power localization
introduce correlation between spectral components.
This correlation can be exploited to improve the
estimate of low SNR frequency bin.
Time localization can be modeled with the joint
MMSE estimator, but can not be modeled by the
frequency domain Wiener filter. Time localization
indicates how much the phase is linearly related.

3/22/2006 Lecture notes for Speech


Communications
Formulation
Signal model: y FS v
Where F is the inverse Fourier matrix, S is the Fourier coefficients vector,
and v is white Gaussian noise vector.

The MMSE estimator of S can be shown to be:

S E (S | y)
Cs F H (FCs F H C v ) 1 y

Cs and C v being the spectral covariance matrix of the signal


and the noise, respectively (need to be estimated).

3/22/2006 Lecture notes for Speech


Communications
Estimating covariance matrix
Let 1/A(Z) denote the transfer function of the all pole model of speech,
r denote the LPC residual, and H denote the Toeplitz analysis matrix
consisting the coef. of A(Z), such as:

r Hs
The covariance matrix of r can be written as a diagonal matrix with
the square of r as its diagonal elements. Then the covariance matrix
of s and S can be written respectively as:

Cs H 1Cr H H
CS FC s F H

3/22/2006 Lecture notes for Speech


Communications
Relation between joint estimator
and other MMSE estimators

In the joint estimator, the spectral


covariance matrix CS is assumed to be a
full matrix, while the Wiener filter and
MMSE LSA estimator assume it is a
diagonal matrix.
This allows the joint estimator exploits the
correlation between frequency
components, which is ignored by the
frequency domain MMSE estimators.
3/22/2006 Lecture notes for Speech
Communications
Correlation of frequency
components

The covariance matrix of the frequency components

3/22/2006 Lecture notes for Speech


Communications
Preliminary results

The TFE-MMSE estimator preserves the signal spectrum better than the Wiener filter.
3/22/2006 Lecture notes for Speech
Communications
Results
TFE-MMSE stimator
TFE-Kalman filtering

Compared to
WF
Noisy (10dB)

3/22/2006 Lecture notes for Speech


Communications
Problems
Show that the non-causal IIR Wiener filter
gives an estimate of the signal power
spectrum that is under-biased, while the
Square-root Wiener filter gives an unbiased
estimate of the spectrum.
Discussion: Is estimating phase important in
speech enhancement? Can phase affects
magnitude?

3/22/2006 Lecture notes for Speech


Communications

Potrebbero piacerti anche