Sei sulla pagina 1di 8

International Journal of Computer Science Engineering

and Information Technology Research (IJCSEITR)


ISSN(P): 2249-6831; ISSN(E): 2249-7943
Vol. 6, Issue 5, Oct 2016, 59-66
TJPRC Pvt. Ltd.

A STATISTICAL ANALYSIS ON THE IMPACT OF SPEECH ENHANCEMENT


TECHNIQUES ON THE FEATURE VECTORS OF NOISY SPEECH
SIGNALS FOR SPEECH RECOGNITION
SWAPNANIL GOGOI & UTPAL BHATTACHARJEE
Computer Science and Engineering, Rajiv Gandhi University, Arunachal Pradesh, India
ABSTRACT
Noise is one of the major challenges in the development of robust automatic speech recognition (ASR) System.
There are several speech enhancement techniques available to reduce the effect of noise from speech signals. In this
paper, a statistical analysis is presented on the impact of speech enhancement techniques on the feature vectors of noisy
speech signals by estimating Bhattacharya distances (BD) from the feature vectors of approximately noise free training
speech signals to the feature vectors of noisy testing speech signals. Here Sub-band Spectral Subtraction (SSS) and
Frame Selection (FS) have been used as speech enhancement techniques at signal level and Cepstral Mean
Normalization (CMN) has been used as feature normalization technique at feature level. In this research work,
derivatives of MFCCs and Log energies has been used as speech feature vectors. Speech recognition experiments have
been also performed to recognize English vowel phonemes in this research work where the recognizer has been
developed using pattern recognition approach applying Hidden Markov Model (HMM).
KEYWORDS: Bhattacharya Distance, Sub-Band Spectral Subtraction, Frame Selection, Cepstral Mean Normalization,
Mel-Frequency Cepstral Coefficient.

Original Article

combination of Mel-Frequency Cepstral Coefficients (MFCC), Log energies, first time derivatives and second time

Received: Sep 04, 2016; Accepted: Sep 29, 2016; Published: Oct 04, 2016; Paper Id.: IJCSEITROCT20167

INTRODUCTION
The performance of ASR systems always degraded due to the presence of noise in the speech signals.
Now there are several speech enhancement techniques are available that can be used to reduce the effect of noise
from noisy speech signals so that the robustness of ASR system can be improved. It has been experiential that at
signal level, spectral subtraction (SS) approach is a popular and effective method to reduce the noise from a noisy
spectrum. Different approaches are proposed in past research works to implement spectral subtraction technique
[2, 6, 7, 8, 17, 18, 19]. At feature level, different feature selection [9], feature recombination [10], feature
compensation [11] and feature normalization [16, 20] approach is used to improve the noise robustness. CMN is
one of the effective feature normalization methods to reduce the effect of noise and distortion form speech signals
[20]. Different types of noise adaptive training strategies and model adaptation approaches [12-15] are used at the
model level for the improvement of noise robustness. Now it has been observed that speech recognition
performance of ASR systems in noisy conditions are always dependent upon the type of noise and applied speech
enhancement techniques. In this research work, SSS and FS are used as speech enhancement techniques at signal
level and at feature level CMN is used to normalize the speech feature vectors.

www.tjprc.org

editor@tjprc.org

60

Swapnanil Gogoi & Utpal Bhattacharjee

Main objective of this paper is to present a statistical analysis on the impact of the speech enhancement techniques
(SSS, FS and CMN) on the estimated speech feature vectors in noisy conditions where noisy speech signals are
contaminated with different types of noise.
In the final part of this work, ASR experiments have been also performed to find out the improvement of
recognition performance after application of speech enhancement techniques as mentioned above.

SPEECH DATABASE PREPARATION


In this paper, two speech databases from [4] have been used where one speech database is consist of male speech
signals recorded by 45 men speaker and other one is consist of female speech signals recorded by 48 women speaker. Here
each speaker recorded speech signals for 12 English vowel sounds/i, , e, , , a, , o, U, u, , /in h-V-d syllables
("heed", "hid", "hayed", "head", "had", "hod", "hawed", "hoed", "hood", "who'd", "hud", "heard", "hoyed", "hide", "hewed"
and "how'd") in approximately noise free environment[4]. Now in our experiment, initially two sets of speech signals have
been prepared from these two speech databases where first is used for training process and the second set is used for testing
or recognition process of the ASR system. The first set consist of first 276 speech signals from the male speech database
and first 288 speech signals from female speech database. Now the second set has been constructed by last 264 speech
signals from the male speech database and last 288 speech signals from the female speech database. Now from the second
set, 7 more sets of speech signals have been prepared by artificially adding 7 different noises (Babble noise, Pink noise,
White noise, Volvo noise, Factory noise, Destroyer noise from engine room (Destroyer engine) and destroyer noise from
operations room (Destroyer ops)) from NOISEX-92 [5] database to the speech signals of the second set. So here as testing
speech signals, 8 set of speech signals have been prepared where one contains noise free speech signals and other 7 contain
noisy speech signals with 7 different noises as mentioned above.

METHODOLOGY
In this research work, noisy speech signals have been enhanced by applying three speech enhancement techniques
where at signal level, SSS has been applied on the noisy speech signals and then resultant noise reduced speech data are
again enhanced by applying FS approach. Finally after feature extraction approach, CMN has been applied to normalize
the estimated speech feature vectors so that the effect of stationary noise and distortion from the estimated speech feature
vectors can be reduced.
Then BD have been estimated between each pair of noise free training speech signal and noise free and noisy
testing speech signal where both training and testing speech signals are correspond to the same English vowel phoneme.
Now from the estimated BD values mean and standard deviations correspond to noise free and each of the noisy versions
of speech signals has been estimated. A statistical analysis has been performed with these estimated statistical data to show
the impact of the speech enhancement techniques on the testing speech feature vectors. Finally ASR experiments have
been also performed on the speech signals before and after application of speech enhancement techniques.
SSS APPROACH FOR NOISE REDUCTION
SS is a speech enhancement technique where the noise spectrum is estimated form the noisy speech signal and
then it is subtracted from the spectrum of the noisy speech [1]. In this paper, SS algorithm based on minimum statistics [2]
has been used for noise reduction. The advantage of this approach is that it does not require any speech activity detector
and without using speech activity detector, the level of computing complexity is not increased [2]. In this work, SSS
Impact Factor (JCC): 7.1293

NAAS Rating: 3.63

A Statistical Analysis on the Impact of Speech Enhancement Techniques on the


Feature Vectors of Noisy Speech Signals for Speech Recognition

61

approach has been applied where at first the noisy speech signal is decomposed into two sub band signals that are low band
signal and high band signal and then SS is applied to the each of the two sub band signals for noise reduction. After SS, the
sub band signals have been combined to form the original noise reduced speech signal. In this case a window based
low-pass FIR filter has been designed with filter order 24 and frequency constrained 0.5 to extract the low band signal from
a noisy speech signal and again a window based high-pass FIR filter has been designed with filter order 24 and frequency
constrained 0.5 to extract the high band signal from a noisy speech signal.
FS APPROACH FOR REMOVAL OF NOISY FRAMES
In this research work, FS approach has been used for the removal of noisy frames from the noisy speech signals
where it has been tried not to remove any frame that contain any speech part. In this approach the frame selection criterion
and selection of proper frame length are the most important parts for the successful implementation in case of speech
enhancement. The following steps have been performed to implement FS in this work.
Step 1: In the first step, speech signal has been divided into M multiple frames where frame length of each frame
is considered as 10 ms. In this research work, selection of this frame length has been performed based on experimental
results where testing speech signals are contaminated with Babble noise.
Step 2: In the second step, at first standard deviations observed in each frame have been estimated. Then mean of
these standard deviations have been estimated as shown in equation 1 where M is the total number of frames and

is

the standard deviation observed at frame of the speech signal.


(1)

Step 3: In the third step, a frame has been selected as a standard frame ( ) which is the first frame whose
standard deviation is greater than

Step 4: Now in this fourth step, the Euclidean distances from


equation 2 where

is the value of frame,

and

to the other frames have been estimated by using

is the value of the standard frame,

that is

selected in step 3.

(2)

!=

Step 5: In the fifth step, a value D has been estimated by using equation 3 where

is the mean and

is

minimum value of the Euclidean distances that are estimated in step 4.


=

(3)

Step 6: In this step, a frame selection value,

has been estimated by using equation 4 where L is an integer

value that is selected based on experimental results and in this research work best results have been achieved at L = 50.
=

(4)

Step 7: In the final step, Euclidean distance from


value is greater than

to each other frames have been estimated and if this estimated

then the corresponding frame is selected otherwise it is discarded and

is also selected as it is

considered as an standard frame in this approach.


www.tjprc.org

editor@tjprc.org

62

Swapnanil Gogoi & Utpal Bhattacharjee

CMN FOR NORMALIZATION OF SPEECH FEATURE VECTORS


Presence of stationary noise and distortion may change the original values of the cepstral feature vectors of speech
signals [20]. CMN is a simple but effective approach that can be used to reduce the effect of these changes so that
robustness can be improved in case of ASR systems. In this paper, CMN has been applied on the estimated speech feature
vectors.
Now for estimation of speech vectors at first, log energy of each speech frame of speech signal has been estimated
by equation (5) as it can be useful for speech recognition.

(5)

( )

Now 12-dimensional MFCCs have been estimated from speech signal considering the following parameter values.

Pre-emphasis factor = -1.0

Frame size = 25ms

Frame shifting size = 10ms

Number of triangular band-pass Mel-filters = 20

Lifter weighting factor = 22


In this MFCC estimation process, equation (6) [3] has been used to represent the speech signal in Mel-scale where

is the Mel frequency and F is the linear frequency.


= 2595

(6)

1+

Then 13-dimensional first time derivatives of MFCCs termed as Delta MFCC (DMFCC) have been estimated
from the estimated MFCCs and log energies of speech signal using equation (7) [21, 22].
c[t] =

( [

])

(7)

Where c[t] is a DMFCC of tth frame with value for k is 2.


Now, 13-dimensional second time derivatives of MFCCs termed as Delta-delta MFCC (DDMFCC) have been
estimated from the estimated DMFCCs using equation (7).
Finally 39-dimensional speech vectors of an input speech signal have been estimated by combining
12-dimensional MFCCs, 1-dimensional log energy, 13-dimensional DMFCCs and 13-dimensional DDMFCCs.
Now, CMN has been applied on the 39-dimensional feature vectors of an input speech signal. In this approach, at
first mean

( ) of each 39 feature vector components has been estimated as shown in equation (8) where N is total number

of frames of the speech signal and


CMN

( ) is the -th original feature vector correspond to t-th component. Now using

( ) has been linearly transformed into normalized feature vector,

( ) as shown in equation (9).

()=

()

(8)

()=

()

(9)

( )

Impact Factor (JCC): 7.1293

NAAS Rating: 3.63

A Statistical Analysis on the Impact of Speech Enhancement Techniques on the


Feature Vectors of Noisy Speech Signals for Speech Recognition

63

ESTIMATION OF BD VALUES
The BD between ( ,
( , ) = ln

,,
+

) and ( ,

+2

,,

)can be estimated as shown in equation (10).

(10)

Where
,

are the variances of the

are the mean of the

and

and

distribution,

distribution

In this work, BD values have been estimated using equation (10) between each pair of training speech feature
vectors and testing speech feature vectors where both training and testing feature vectors correspond to the same English
vowel phoneme. So eight sets of BD values have been estimated correspond to the eight sets of the testing speech signals
where each set contains 47 46 12 = 25944 number of BD values because there are 47 number of training speech
signals and 46 number of testing speech signals in each testing speech database are available correspond to each of the 12
English vowel phonemes.
STATISTICAL ANALYSIS BASED ON ESTIMATED BD VALUES
For the statistical analysis, mean and standard deviations of BD values correspond to each of the testing speech
database have been estimated and these statistical data are shown in table 1. Form table 1, it has been observed that in all
cases the mean and standard deviation of BD values are decreased after applying speech enhancement techniques.
Then correlation of testing noise free signals with each of the testing noisy version of speech signals have been
estimated based on the estimated mean of BD values correspond to each of the testing speech databases and estimated
values are presented in figure 1. From these values it has been observed that after speech enhancement correlations
correspond to each of the noisy versions of testing speech signals are increased. It has been also observed that the standard
deviation of the estimated correlations is decreased after speech enhancement. It has been found that standard deviation of
correlations before speech enhancement is 0.217 and after speech enhancement it becomes 0.003.

Figure 1: Curves for Correlations of Mean of BD


Values before and after Speech Enhancement
www.tjprc.org

editor@tjprc.org

64

Swapnanil Gogoi & Utpal Bhattacharjee

Table 1: Mean BD and Standard Deviation of BD Values Correspond to


Each Noise Types before and after Speech Enhancement
Mean BD values
Noise type
Noise free
Babble
Pink
White
Volvo
Factory
Destroyerengine
Destroyerops

Before Speech
Enhancement
39.07
57.23
67.10
83.37
63.46
56.93
80.26
61.78

After Speech
Enhancement
23.60
23.68
23.85
23.83
23.98
22.96
23.37
23.57

Average Standard Deviations of


BD values
Before Speech
After Speech
Enhancement
Enhancement
14.97
8.42
19.42
8.82
22.37
8.55
31.11
8.67
22.73
8.56
18.30
8.58
29.49
8.67
20.62
8.67

Finally t-test and f-test have also been performed in this work on mean of BD values correspond to noise free and
each noisy versions of testing speech signals. In table 2, the results of t-test and f-test related to noise free and each noisy
version of speech signals before and after speech enhancement are presented.
Here the null hypothesis for t-test is that mean of BD values related to noise free and noisy version of speech
signals are independent random samples from normal distributions with equal means and equal but unknown variances. On
the other hand the alternative hypothesis is that mean of BD values related to noise free and noisy versions of speech
signals come from populations with unequal means. Now if the result of t-test has been observed as 0 then it means that the
test accepts the null hypothesis and if it is 1 then the test rejects the null hypothesis at the 5% significance level.
The null hypothesis for f-test is that mean of BD values related to noise free and noisy version of speech signals
are considered from normal distributions with the same variance. On the other hand the alternative hypothesis is that mean
of BD values related to noise free and noisy version of speech signals come from normal distributions with different
variances. Now if the result of f-test has been observed as 0 then it means that the test accepts the null hypothesis otherwise
if the result is 1 if then it means that the test rejects the null hypothesis at the 5% significance level.
Table 2: Results of T-Test and F-Test with Each Pair of Noise Free and Noisy Versions of
Speech Signals before and after Speech Enhancement using Mean of BD Values
Noise Type
Babble
Pink
White
Volvo
Factory
Destroyerengine
Destroyerops

Before Speech
Enhancement
T-Test
F-Test
1
1
1
0
1
1
1
0
1
0
1
1
1
0

After Speech
Enhancement
T-Test F-Test
1
0
1
0
1
0
1
0
1
0
1
0
1
0

ASR EXPERIMENTS
For ASR experiment, a left to right HMM with 8 states and 12 Gaussian models per state has been implemented
for training and testing process of the ASR system to recognize 12 English vowel phonemes.
The results of ASR experiments are shown in table 3. From these results it has been observed that except noisy
speech signals contaminated with Destroyerops noise, the recognition performances are improved in case of all other types
Impact Factor (JCC): 7.1293

NAAS Rating: 3.63

A Statistical Analysis on the Impact of Speech Enhancement Techniques on the


Feature Vectors of Noisy Speech Signals for Speech Recognition

65

of noisy speech signals after speech enhancement. Highest improvement of robustness of the ASR system has been
achieved in case of speech signals contaminated with Babble where 15.76% increase in recognition rate is observed. On
the other hand in case of noise free speech signals and noisy speech signals contaminated with Destroyerops, the
recognition performances are degraded after speech enhancement.
Table 3: Speech Recognition Rate (IN %) Correspond to the Noise Free and 7 Types of
Noisy Speech Signals with SSS, FS and CMN as Speech Enhancement Techniques
Noise Type
Noise free
Babble
Pink
White
Volvo
Factory
Destroyerengine
Destroyerops

Before Speech
Enhancement
Techniques
91.49
56.88
85.51
80.98
87.86
81.34
53.44
76.63

After Speech
Enhancement
90.40
72.64
88.22
88.41
88.95
89.13
68.84
75.36

CONCLUSIONS
Improvement of robustness of ASR systems in noisy conditions is one of the challenging issues in the research
field of speech recognition. In this research work SSS, FS and CMN are used as speech enhancement techniques to reduce
the effect of noise from noisy speech signals. Application of speech enhancement techniques on speech signals always
modifies its parametric representations. In this research work, a statistical analysis is presented on the impact of speech
enhancement techniques on feature vectors of noise free and noisy speech signals for speech recognition purpose by using
BD distances from feature vectors of noise free training speech signals to the feature vectors of noise free and noisy testing
speech signals. This analysis shows that in all cases, BD distances are decreased in case of feature vectors of enhanced
speech signals and correlations after speech enhancement are increased in case all types of noisy speech signals. In the
final part of this research work, ASR experiments are also performed with enhanced speech signals and speech signals
without any enhancement. The result shows improvement of recognition rate using SSS, FS and CMN in all cases except
noise free speech signals and noisy speech signals with noise type Destroyerops. From the statistical analysis and
recognition performances it has not been observed any specific pattern to relate the impact of speech enhancement
techniques on the feature vectors and the recognition performance of the ASR system.
REFERENCES
1.

Apte, D. S. D. (2012). Speech and audio processing. WileyIndia,.

2.

Martin, R. (1994). Spectral subtraction based on minimum statistics. power,6, 8.

3.

Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different implementations of MFCC. Journal of Computer Science
and Technology,16(6), 582-589.

4.

Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels.
The

Journal

of

the

Acoustical

society

of

America, 97(5),

3099-3111.

Retrieved

from:

http://homepages.wmich.edu/~hillenbr/voweldata.html.
5.

NOISEX92 noise database. Retrieved from: http://spib.rice.edu/spib/select noise.html.

www.tjprc.org

editor@tjprc.org

66

Swapnanil Gogoi & Utpal Bhattacharjee


6.

Junhui, Z., Xiang, X., & Jingming, K. (2002, August). Noise suppression based on auditory-like filters for robust speech
recognition. In Signal Processing, 2002 6th International Conference on (Vol. 1, pp. 560-563). IEEE.

7.

Karam, M., Khazaal, H. F., Aglan, H., & Cole, C. (2014). Noise removal in speech processing using spectral
subtraction. Journal of Signal and Information Processing, 2014.

8.

Matsumoto, H., & Naitoh, N. (1996, October). Smoothed spectral subtraction for a frequency-weighted HMM in noisy speech
recognition. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on(Vol. 2, pp. 905-908).
IEEE.

9.

Koniaris, C., Kuropatwinski, M., & Kleijn, W. B. (2010). Auditory-model based robust feature selection for speech
recognition. The Journal of the Acoustical Society of America, 127(2), EL73-EL79.

10. Okawa, S., Bocchieri, E., & Potamianos, A. (1998, May). Multi-band speech recognition in noisy environments. In Acoustics,
Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on(Vol. 2, pp. 641-644). IEEE.
11. Cui, X., & Alwan, A. (2005). Noise robust speech recognition using feature compensation based on polynomial regression of
utterance SNR. IEEE Transactions on Speech and Audio Processing, 13(6), 1161-1172.
12. Jiang, H., Hirose, K., & Hue, Q. (2000). A minimax search algorithm for robust continuous speech recognition. IEEE
transactions on speech and audio processing, 8(6), 688-694.
13. Xu, H., Dalsgaard, P., Tan, Z. H., & Lindberg, B. (2007). Noise condition-dependent training based on noise classification
and SNR estimation. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2431-2443.
14. Kalinli, O., Seltzer, M. L., Droppo, J., & Acero, A. (2010). Noise adaptive training for robust automatic speech
recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(8), 1889-1901.
15. Nasersharif, B., & Akbari, A. (2005, September). Improved HMM entropy for robust sub-band speech recognition. In Signal
Processing Conference, 2005 13th European (pp. 1-4). IEEE.
16. Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech
recognition. Speech Communication,25(1), 133-147.
17. Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech,
and signal processing,27(2), 113-120.
18. Berouti, M., Schwartz, R., & Makhoul, J. (1979, April). Enhancement of speech corrupted by acoustic noise. In Acoustics,
Speech, and Signal Processing, IEEE International Conference on ICASSP'79. (Vol. 4, pp. 208-211). IEEE.
19. Sim, B. L., Tong, Y. C., Chang, J. S., & Tan, C. T. (1998). A parametric formulation of the generalized spectral subtraction
method. IEEE Transactions on Speech and Audio Processing, 6(4), 328-337.
20. Strand, O. M., & Egeberg, A. (2004). Cepstral mean and variance normalization in the model domain. In COST278 and ISCA
Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction.
21. Arora, S. V. (2013). Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition System. International
Journal on Signal and Image Processing, 4(3), 50.
22. Sharma,

S.,

Shukla,

A.,

&

Mishra,

P.

(2014).

Speech

and

Language

Recognition

using

MFCC

DELTA-MFCC. International Journal of Engineering Trends and Technology (IJETT), 12(9), 449-452.

Impact Factor (JCC): 7.1293

NAAS Rating: 3.63

and

Potrebbero piacerti anche