Sei sulla pagina 1di 4

SEGMENTATION AND ITS REAL-WORLD APPLICATIONS IN SPEECH PROCESSING

Farook Sattar1 , Mikael Nilsson2 , and Ingvar Claesson2


1

School of Electrical and Electronic Engineering,


Nanyang Technological University, Nanyang Avenue, Singapore 639798.
2
Blekinge Institute of Technology, School of Engineering,
P.O. Box 520, SE-372 25 Ronneby, Sweden.
ABSTRACT
The speech segmentation problem can be formulated as estimating
the locations and durations of speech and non-speech components
of the measured speech data. In this paper, a new time-scale transform based segmentation method and one of its important application in speech processing, are presented. The proposed scheme
is tested on a number of recorded speech data. The preliminary
results are shown quite promising. It is found that the method is
able to extract the speech components(i.e. active intervals) from
non-speech components (i.e. inactive intervals) effectively. The
method is, therefore, successfully applied to insert selective pauses
in the speech before delivering in the reverberant environment and
improve the quality/intelligibility of the delivered speech.
1. INTRODUCTION
Segmentation of speech/non-speech components and one of its
useful applications in signal processing are considered in this paper. The segmentation problem is formulated as estimating the locations and durations of speech and non-speech segments for the
input speech signal. All the speech components are clubbed together to form a set containing signal plus noise-like components,
while the non-speech components residing in the inactive time duration between the speech components form another set containing
only the noise-like components.
A number of segmentation methods have been proposed in the
literature and applications can be found in areas of voice activity detection (VAD) [1, 2, 3], word bondary detection, speech enhancement, speech recognition, indexing, content analysis, etc.
Segmentation of signal components involves exploiting the
time-, spectral-, and spatial-domain parameters. The key time,
scale/frequency and spatial domain parameters are the time location and duration, center frequency and bandwidth, and spatial
location of the waveform. In this paper, we restrict to time and
scale parameters. Adjacent signal components are distinguished
from another signal component in terms of their time and scale domain parameters. Using wavelet transform the time-domain signal
is transformed into time-scale domain. The segmentation is performed using time-scale transform of the input signal followed by
a decision strategy based on PCA (Principal Component Analysis)
and GLRT (Generalized Loglikelihood Ratio Test).
The paper is organized as follows. In Section 2, background
of the Morlet wavelet transform is given. In Section 3, segmentation method is presented. In section 4, the performance evaluation of the proposed segmentation method is shown. Section 5
shows the segmentation results with an application to playback the

1-4244-0779-6/07/$20.00 2007 IEEE

recorded speech in reverberant environment. Section 6 presents


the objective quality measures of the speech signals related to our
real-world applications. Section 7 gives the conclusion.
2. BACKGROUND
2.1. Morlet Wavelet Transform
The choice of the wavelet transform is problem dependent, that
is, it depends upon the requirement of the time-frequency/scale
resolutions. If the signal components are closely spaced in time
but overlap in frequency, then time resolution is important. If the
signal components are closely spaced in frequency but overlap in
time, then frequency resolution is important. The wavelet transform can handle a wide class of problems including those requiring high time resolution, and those requiring high frequency/scale
resolution by choosing appropriate wavelet parameters. There are
many choices within the wavelet family. In our case, Morlet wavelet
transform, [4, 5], is employed because of its ability to provide different window lengths for signals composed of different frequencies/scales, and the time-bandwidth product is minimal providing thereby a good time and frequency resolution. The relative
bandwidth, which is the ratio of the bandwidth, f , to the centerfrequency, fo , given by, f /fo , is constant for all frequency. The
time resolution becomes very good at high frequencies, while the
frequency resolution becomes very good at low frequencies. Thus
Morlet wavelet is ideal for commonly occurring signals, which
contain components of low frequencies with longer duration, and
components of high frequencies components with short duration.
The Morlet wavelets form a non-orthogonal basis function with
considerable spectral overlap among the basis functions. The Morlet wavelet transform [4, 5] is given by
(k, m) = 1/4 ej2fo k/m ek

/2m2

(1)

where k is the time index and m is the scale parameter. The Fourier
transform of the Morlet wavelet is given by
(, m) = 1/4 e(o /m)

/2

, 0

(2)

where is the frequency.


3. PROPOSED SEGMENTATION METHOD
The proposed segmentation method is based on following two steps:
time-scale transform followed by a decision rule based on PCA
(Principle Component Analysis) and GLRT (Generalized Likelihood Ratio Test).

3.1. Time-scale Transform


The segmentation method is based on time-scale characteristic of
a linear transform, which may be expressed as a linear combination of some basis functions, {(k, m)}, k = 1, , N 1;
m = 1, , M 1, where k is the time index and m is the
scale/frequency index. Then using linear affine transform, the timescale transformation of the input signal y will take the following
form:
yk = WkH y
(3)
where superscript H indicates conjugate transpose, yk is (M 1),
y is (N 1) and Wk is (N M ) matrices given by

yk = [y(k, 0) y(k, 1) y(k, M 1)]T


y = [y(k) y(k 1) y(k N + 1)]T

(k, 0)
(k 1, 0)
Wk =
...
(k N + 1, 0)

(4)

where () is transpose operator.


Now considering the measured data, y, as signal, x, plus noise, v,
and after applying the above transformation yields

(10)

where E[] is the expectation operator.


Using Singular-Value-Decomposition (SVD) of the covarience matrix Ryy , we get
2
Ryy = UUT , = diag[12 M
]

(11)

where U is the eigenvector matrix of Ryy and 1 2


M are the corresponding singular values.
As in [7] the log-likelihood function becomes
l(yk ) =

M
M
1
M
1 
lni2 + lnv2 + 2
2
2
2v
i=1

(k, M 1)
(k 1, M 1)

..

.
(k N + 1, M 1)
(5)

yk = xk + vk

Ryy = E[yk ykH ]

i=1

i2 v2
i2

|Ui yk |2
(12)

Combining the data independent terms into the threshold and scaling, the test statistics T (yk ) for the GLRT becomes
T (yk ) =

M


Gi |Ui yk |2

(13)

i=1

where

(6)
Gi =

where

2
xi
2
, xi
= i2 v2
i2

(14)

In Eq. (14), the singular value, xi , characterizes the signal component. Then the decision rule becomes

xk = WkH x,vk = WkH v


x = [x(k) x(k 1) x(k N + 1)]T
v = [v(k) v(k 1) v(k N + 1)]T

(7)
T (yk ) =
where

3.2. Decision Rule Based on PCA and GLRT


The segmentation of the signal components can be formulated as a
binary hypothesis testing between two hypothesis, Ho and H1 as
Ho : yk = vk
H1 : yk = xk + vk

(8)

where xk , vk and yk are the transform coefficients (i.e. wavelet


transform in our case) of the original signals, x(k), v(k) and y(k),
respectively, and vk is zero-mean Gaussian noise. At a given time
instant k, we have to decide if yk contains a signal component xk ,
or yk is only a noise vk .
We employ the maximum likelihood approach, which provides Generalized Likelihood Ratio Test (GLRT) for the signal yk
and noise vk , in our case, with unknown covariances given by
[6, 7]

M


(15)


lni2

M lnv2

(16)

i=1

Using the subspace concept [8] and SVD of the correlation matrix
Ryy , the test statistics can approximate as
T(yk ) =

r


i yk |2
|U

(17)

i=1

U
r ] contains r orthogonal eigenvectors having
1 U
where U=[
the r largest eigenvalues of Ryy , whereas the remaining (M r)
eigenvectors correspond to the eigenvalues (power-spectral density) of the noise.
Then the decision rule can be implemented as
T(yk ) =

> for H1
yy ; H1 )ln p(yk , R
vv ; Ho )
l(yk ) = ln p(yk , R
for Ho
(9)
yy and R
vv are the maximum likelihood estimates of
where R
the covariance under hypothesis H1 and Ho . This GLRT test is
implemented by a simple and efficient scheme based on principle
component analysis (PCA) or subspace technique. For this the
covariance matrix of yk as Ryy under the hypothesis H1 is given
by

v2

> for H1
for Ho

where
=

v2

r


> for H1
for Ho

(18)


ln
i2

(M

r)ln
v2

(19)

i=1

and

i2

xi

v2

i = 1, , r
i = r + 1, , M

(20)

4. PERFORMANCE EVALUATION

Table 1. Relative Error at Different SNRs.


Noise Types
SNR
White
Car
Babble
Street
Clean 0.0746 0.0746 0.0746 0.0746
20 dB 0.0868 0.0973 0.1047 0.0953
15 dB 0.1058 0.1134 0.1304 0.1218
10 dB 0.1378 0.1512 0.1637 0.1523
5 dB
0.1613 0.1833 0.1995 0.1926
0 dB
0.2082 0.2322 0.2433 0.2306

To evaluate the performance of the proposed segmentation method,


the experiments are carried out on speech signals of 20 speakers
(10 male and 10 female) from TIDIGITS database. 10 digit strings
of lengths varying from 3 to 7 digits have been considered for each
speaker. For reference decisions, the boundaries of each segment
are manually determined. White noise, babble noise, car noise,
street noise are then added to obtain the corrupted signals at different SNRs. Five levels of SNRs are considered (20 dB, 15 dB, 10
dB, 5 dB and 0 dB.) Therefore, a total of 4000 signals sampled at
8 kHz are applied to show the performance evaluation.
As a performance measure, relative error of a segment RE is
calculated as follows:

1.5

Amplitude

0.5

0.5

(21)

where AD is the duration of the segment which is determined


manually and ED is duration of the segment estimated by the
proposed method. In Table I, the performance measures for the
proposed method are given for different background noises and
various SNRs. It can be noted that a good performance can be
obtained for segmentation of speech signals at various noisy situations.
When the SNR is 20 dB, the average segmentation error is almost less than 10% of the duration of the segments within all noise
types. The segment boundaries determined by the proposed algorithm deviate from the result of manual segmentations by 10%13% when the SNR is 15 dB. For the signals at 10 dB and 5 dB, the
relative segmentation errors lie between (13%16%) and (16%
19%), respectively. The highest segmentation errors occur at 0 dB
SNR with the values between 20% and 24%. Among various noisy
environments, it is found that white noise gives the lowest segmentation errors for all SNRs, while the babble noise has the highest
relative error for all instances. On average, the relative errors for
all SNRs ranging from 20 dB down to 0 dB are found the lowest
for the white noise followed by car noise, street noise and babble
noise.
5. RESULTS FOR REAL-WORLD APPLICATIONS
A number of speech signals spoken by different male and female
speakers and sampled at the rate of 8 kHz, are tested. These signals
are recorded at the control room of a railway station in Malmo,
Sweden. These data are the railway announcements in Swedish
delivered from the control room to the loudspeakers. Usually, the
persons involved in announcing various relevant information to the
passengers are not well trained and often announce without sufficient pause between the two active speech components. As a consequence, the announcements delivered from the mounted loudspeakers to the passengers are become quite unclear and confusing. This is caused by the overlapping of speech segments due to
reverberation within the station. This problem can be mitigated by
inserting sufficient pauses in the selective locations of the speech
signals. Fig. 1 illustrates segmentation results for two recorded
signals. Fig. 1 also demonstrates how effectively segmentation
can be done by exploiting both time- and scale- domain properties
using Morlet wavelet.
Fig. 2(a) and 3(a) show the original speech signal and the
extended speech signal based on the segmentation results as illustrated in Fig. 1(a) and (b), respectively. After segmenting

1.5

10

12

Sample

10

(a)
1.5

0.5

Amplitude

|AD ED|
RE =
AD

0.5

1.5

10

15

Sample

10

(b)
Fig. 1. The results of segmentation for a recorded speech data
(The segmetation functions are plotted together with the real input
signal) (a) Example 1; (b) Example 2.

the signal, pauses with durations more than 1/2 second are selected automatically. Then, duration of each selected pause extends to 2 seconds so the problem of overlapping between the
two successive active speech segments occured due to reverberation, can be overcomed. For our particular data base, the durational information of 1/2 second(i.e. 2000 samples) and 2 seconds(16,000 samples) are chosen experimentally. In future, we
would like to vary these fixed values based on reverberation time
and tail information of the speech signal to improve the performance. The original speech signals and the results of extended
output signals for various male and female speakers are found at
http://www.ntu.edu.sg/home/efsattar/web/audio files.htm).
6. OBJECTIVE QUALITY MEASURE
We measured the spectral distances between the reference and distorted speech utterances. Our objective measure is based on parameterizations of a linear predictive vocal tract model of the reference and the distorted speech (due to reverberation in our case).
The parameters used in such measures can be the linear predictor
coefficients (LPC) [9]. In order to measure this, the reference and
distorted signals are divided into analysis frames of 20 ms duration. A linear predictive analysis is done for each frame of speech,
and the distance measure is calculated from the results of analysis
in the following way [10, 11, 12]:


d(Q, p, m) =

N
1 
|Q(i, m, ) Q(i, m, d)|p
N

1/p
(22)

i=1

where d(Q, p, m) is the distance for the mth analysis frame, p


is the power in the norm, and N is the order of the LPC analysis. Q(i, m, ) and Q(i, m, d) are the ith parameters of the corre-

sponding frames of the original speech and distorted speech, d is


the distortion index, and indicates no distortion.
To generate the reverberated sounds for testing purpose, we
used Schroeder filter [13], which seems to be an effective technique to simulate the reverbations. Fig. 2(a) and (b) illustrate a
recorded signal and the corresponding reverberated signal. Similarly, Fig. 3(a) and (b) show the extended signal (based on the
segmentation results) and the corresponding reverberated signal.
In order to measure distances between the reference speech
and the reverberated speech, the parameters used are: N (order of
the filters)=20, M (frame size)=160 samples (i.e. 20 ms duration),
p(power in the norm)=2.
The frame distances, d,(Eq. (22)) calculated for the signals
in Fig. 2, are presented in Fig. 4(a). Similarly, Fig. 4(b) shows
the frame distances for the signals shown in Fig. 3. As we see,
the frame distances are less in Fig. 4(b) compared to Fig. 4(a)
indicating the improvement of speech quality/ intelligibility in the
reverberant conditions as a result of efficient segmentation and selective pause insertion.

[7] L. L. Scharf, Statistical Signal Processing: Detection, Estimation and Time Series Analysis, Addison-Wesley, 1991.
[8] Alle-Jan V. der Veen, ED F. Deprettere, and A. Lee Swindlehurst, Sub-space based signal analysis using singular value
decomposition, Proc. of the IEEE, vol. 81, pp. 12771307,
1993.
[9] H. H. Monson, Statistical Digital Signal Processing and
Modeling, John Wiley & Sons Inc, 1996.
[10] M. A. Clements S. R. Quackenbush, T. P. Barnwell III ., Objective Measures of Speech Quality, Prentice Hall, 1988.
[11] J. G. Proakis J. R. Deller, J. J. L. Hansen, Discrete-Time
Processing of Speech Signals, IEEE Press, 1996.
[12] D. G. Manolakis J. G. Proakis, Digital Signal Processing,
Principles, Algorithms, and Applications, Prentice Hall,
1996.
[13] U. Zoler D. Arfib, F. Keiler, DAFX - Digital Audio Effects,
John Wiley, 2002.
(a)

7. CONCLUSION

0.5

Segmentation of recorded speech signal based on time and scale


information is presented. Using this time-scale based segmentation method, speech components (i.e. active intervals) can be extracted effectively for the recorded speech having varying pauses
(i.e. quiet intervals). The proposed scheme is evaluated on a number of measured signals. The preliminary results are shown quite
promising and thereby, the method is found quite useful to insert
desired speech pauses between the active segments. The insertion of selective pauses can overcome the overlapping problem
between the speech segments during the delivery of the speech
in the highly reverberant environment (such as, a railway station)
through loudspeakers. The objective measures for the quality/intelligibility of the delivered speech and performance evaluation
of the proposed method are also presented. In the full version
of the paper, comparison results with other methods will be presented.

0.5

0.5

1.5

2.5
5

x 10
2
1.5
1
0.5
0
0.5
1
1.5

0.5

1.5

2.5
5

x 10

(b)
Fig. 2. The application of segmentation; (a) Original speech signal, (b) corresponding reverberated speech signal (x-axis: samples,
y-axis: amplitude).
(a)
1

0.5

0.5

0.5

1.5

2.5

3
5

x 10
2
1.5
1
0.5
0
0.5
1

8. REFERENCES

1.5

0.5

1.5

2.5

3
5

x 10

(b)

[3] GSM 06.94 ETSI (digital cellular telecommunications system), Voice activity detector (vad) for adaptive multi-rate
(amr) speech traffic channels, European Telecommunication Standards Institute, 1999.
[4] O. Rioul and M. Vetterli, Wavelet and signal processing,
IEEE Signal Processing Magazine, pp. 1438, 1991.
[5] C. Torrence and G. P. Compo, A practical guide to wavelet
analysis, Bulletin of the American Meterological Society,
vol. 79, pp. 6176, Jan. 1998.
[6] S. M. Kay, Fundamentals of Statistical Signal Processing:
Detection Theory, Prentice-Hall, 1998.

(a)
2

1.5

Frame distance

[2] M. Marzinzik and B. Kollmeier, Speech pause detection


for noise spectrum estimation by power envelope dynamics,
IEEE Trans. Speech and Audio Processing, vol. 2, pp. 109
118, 2002.

Fig. 3. The application of segmentation; (a) Extended original


speech signal, (b) corresponding reverberated speech signal (xaxis: samples, y-axis: amplitude).

0.5

200

400

600

800
frame index

1000

1200

1400

200

400

600

800
frame index

1000

1200

1400

1.5

Frame distance

[1] S. Gazor and W. Zhang, A soft voice activity detector based


on a laplacian-gaussian model, IEEE Trans. Speech and
Audio Processing, vol. 11, pp. 498505, 2003.

0.5

(b)
Fig. 4. Illustration for speech quality/intelligibility measure; (a)
frame distances measured for Fig. 2, (b) frame distances measured
for Fig. 3.

Potrebbero piacerti anche