Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
5, JULY 2012
1573
VTLN performs speaker normalization by reducing the variabilities in the spectra of speech signals that arise due to differences
in the vocal tract lengths (VTL) of speakers uttering the same
sound [7]. The normalization is achieved by either compressing
or expanding the speech spectrum and is usually referred to as
scaling. This scaling is usually specified through a mathematical
, where
is the warped-frerelation of the type
is the frequency-warping function. It is comquency and
monly assumed that the spectra of different speakers uttering the
same sound are linearly scaled versions of one another [7], [8],
. We would like to make it clear to the reader
i.e.,
that, though the discussion in this paper assumes linear-scaling
of the spectra, the methods developed in this paper can be applied to any arbitrary warping function. VTLN requires the estimation of only a single parameter called the warp-factor
for
normalization and hence requires very little acoustic data unlike
adaptation based methods (e.g., MLLR and CMLLR). However,
the practical implementation of conventional VTLN follows a
maximum likelihood (ML) based grid search over a pre-defined
range of warping-factors. This requires the features to be generated for all the warp-factors after appropriate modification of
the spectra. The ML estimate of the warp-factor is then found
by evaluating the likelihood of the warped features with respect
and is given
to the acoustic model, , and the transcription
by
Index TermsAutomatic speech recognition (ASR), lineartransformation, Mel frequency cepstral coefficient (MFCC),
speaker normalization, vocal tract length normalization (VTLN).
(1)
I. INTRODUCTION
NTER-SPEAKER variability is a major source of performance degradation in speaker-independent (SI) automatic
speech recognition (ASR) systems. Most state-of-the-art systems now incorporate vocal-tract length normalization (VTLN)
as an integral part of the system to reduce inter-speaker variability and hence improve the recognition performance [1][6].
Manuscript received December 27, 2010; revised July 08, 2011 and January
15, 2012; accepted January 23, 2012. Date of publication January 31, 2012; date
of current version March 21, 2012. This work was done while the authors were
at the Department of Electrical Engineering, Indian Institute of Technology,
Kanpur. This work was supported in part by the Department of Science and
Technology, Ministry of Science and Technology, India, under SERC project
SR/S3/EECE/058/2008. The associate editor coordinating the review of this
manuscript and approving it for publication was Prof. Steve Renals.
D. R. Sanand is with the Norwegian University of Science and Technology,
NO-7491 Trondheim, Norway (e-mail: drsanand@gmail.com).
S. Umesh is with the Department of Electrical Engineering, Indian Institute
of Technology Madras, Chennai-600036, India (e-mail: umeshs@ee.iitm.ac.in).
Digital Object Identifier 10.1109/TASL.2012.2186289
where
consist of static features obtained after frequencyand appended with difwarping the spectra by warp-factor
ferential and acceleration coefficients. In some systems, linear
discriminant analysis (LDA) is applied over a window of such
warped consecutive frames to account for dynamic variations
before obtaining the final feature-vector.
Recently there has been lot of interest in obtaining a direct
linear-transformation between static conventional Mel frequency cepstral coefficients (MFCC) features and the static
, i.e.,
VTLN-warped MFCC
(2)
where,
represents a matrix transformation.
One of the early attempts to obtain a linear-transformation
(LT) on the cepstra for speaker-normalization was by Acero
et al. [9], [10]. They showed that the warped cepstral coefficients can be obtained at the outputs of the filters at time
, by formulating the bilinear transform as a linear filtering
operation and having the time reversed cepstrum sequence as
the input. McDonough et al. [11], proposed a linear transformation using generalizations of bilinear transform known as
1574
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC
1575
Fig. 2. Conventional frame work for generating warped features in VTLN. The
filter-bank is inversely scaled instead of re-sampling the speech signal for each
warp-factor for efficient implementation.
(3)
The DCT matrix is given by
(4)
and the scaling factor
is defined as
.
Here, is the number of filters used in the Mel filter-bank and
is the number of cepstral coefficients.
As an illustration, let be a speech frame consisting of 320
samples. A 512-point DFT is applied to obtain the 256-dimensional vector whose elements are the magnitude of the DFT
coefficients for one-half of the spectrum. This is because the
magnitude spectrum has even-symmetry. If a 20-filter Mel filterbank smoothing is applied, then is a 20 256 matrix that operates on to obtain the Mel-warped smoothed spectra. is the
20 20 DCT-matrix applied on log-compressed Mel-warped
smoothed spectra to obtain the MFCC feature vector . In practice, only the first 16 cepstral coefficients are used and one may
use a 16 20 DCT transformation.
VTLN features are obtained in the original method of Andreou et al. [7], by frequency-warping the magnitude spectra
to get
before applying the unwarped Mel filter-bank. This
is done by re-sampling the signal. Therefore, in this case the
signal is warped for each VTLN warp-factor, while the Mel
filter-bank is left unchanged. Lee and Rose [8] proposed an efficient alternate implementation, where the Mel filter-bank is
inverse-scaled for each , while the signal spectra is left unchanged as shown in Fig. 2. This is the most popular method
of VTLN-warping. Therefore, in the Lee-Rose method, VTLNdenotes
warping is integrated into the Mel filter-bank and
the (inverse) VTLN-warped Mel filter-bank. Conventionally the
warp-factor, , used for warping the spectra is in the range of
Fig. 4. Piece-wise linear warping function used in conventional VTLN motivated by physiological arguments is shown. The slope of the warping function
is changed at F to avoid bandwidth mismatch after frequency scaling.
1576
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
frequency-warping and is motivated from physiological arguments that changes in VTL manifest as spectral-scaling, the
methods developed in this paper can be applied to any arbitrary
warping function.
are given by
The warped cepstral features
(7)
These are obtained by first warping and smoothing the power
spectrum, followed by log and the DCT operations. The filterbank is integrated with both Mel- and VTLN-warping, to perform smoothing as well as scaling of the spectrum. Observing
(3) and (7), it is clear that the only difference between the conventional and VTLN-warped MFCC features is the change in
the filter-bank structure, while the rest of the operations are the
same.
,
exactly corresponds to the case
For the case of
of conventional MFCC without VTLN-warping. From (3) and
and is given as
(7), the relation between
Fig. 5. Modification in the signal processing steps (separating the Mel- and
VTLN-warping) for realizing a linear-transformation. The filter-bank performs
only Mel-warping of the spectra and the proposed band-limited interpolation
matrix performs the VTLN warping.
where
is the transformation that is applied on
to ob. The above equation states that the filter-bank, , pertain
performs
forms only Mel warping and the transformation
VTLN-warping. This means that the VTLN-warping integrated
in the filter-bank for efficient implementation in the conventional approach [8] is now performed separately and is not a
part of the filter-bank construction. This is illustrated in Fig. 5.
If such a relation can be obtained, then from (3) and (7), the
is given by
relation between and
(8)
(11)
A linear-transformation between
and
(or ) can be
derived if all the intermediate operations can be represented as
linear operations, but from (8), it is evident that log is a nonlinear
does not exist. This is because,
operation and in practice
the power-spectrum cannot be completely reconstructed from
the filter-bank outputs because of the smoothing operation [16].
We need to obtain , since conventional VTLN warping relations are always specified in the linear-frequency (Hz) domain, usually through a mathematical relation of the type
, where
is the warped-frequency and
is the frequency-warping function. Therefore, in this case, it is not possible to completely recover from the filter-bank output and
hence a linear-transformation is not possible.
In the next section, we show that separating the frequencywarping operation from the filter-bank avoids the need to invert
the filter-bank operation or the logarithm and allows us to derive
a linear transformation on conventional MFCC.
III. REALIZING A LINEAR-TRANSFORMATION
In this section, we show that separating the VTLN-warping
(speaker scaling) from the Mel filter-bank helps us to derive a
linear-transformation (LT) between warped and unwarped cepstral features within the conventional MFCC framework. Let
, be the log-compressed Mel warped filter-bank
output. From (3), we see that the knowledge of implies the
knowledge of as they form a DCT pair, i.e.,
(9)
However, we cannot completely recover
from
because
of the filter-bank smoothing operation. Since
cannot be
completely recovered, we re-frame the problem as follows: can
be obtained by applying a linear-transformation on without recovering , i.e.,
(10)
and
, we are completely
By defining a LT between
avoiding the inversion of filter-bank for obtaining the raw magand also bypassing the log operation. We
nitude spectrum
would like to remind the reader that VTLN-warping relation
is usually specified in the linear-frequency (Hz) domain, and
therefore, at this point it is not clear what the relation between
and
should be. In the next subsection, we describe a method
to obtain a LT using the idea of band-limited interpolation.
A. Band-Limited (Sinc-) Interpolation
For a band-limited continuous-time signal,
, given uniformly spaced samples of the signal that are appropriately sampled, i.e.,
, we can exactly reconstruct the original continuous-time signal. This implies that we can recover the values
of the time signal at time-instants other than those at the uniformly spaced samples. We use this idea to obtain the LT for
VTLN-warping, except now we consider que-frency limited signals, instead of frequency-limited signals.
can be obtained either by applying a nonuniform filterbank (shown in Fig. 3) on the linear-frequency (Hz) magnitude
spectrum or by applying a uniformly spaced filter-bank (shown
in Fig. 6) on the Mel-warped magnitude spectrum. Therefore,
in the Mel-frequency domain, the continuous Mel-warped logcompressed spectrum,
, can be interpreted as the convolved
output of a triangle function on the Mel-warped magnitude spectrum and followed by a log operation on the amplitudes. We
can think of vector as being obtained by uniformly sampling
at
where
and the positions of these samples exactly correspond to the center frequencies of the filter-bank. Because of the triangle smoothing
and subsequent log -operation on the output (which reduces dynamic range), the que-frency content of this log -compressed
smoothed spectrum is only in the low que-frency region. Fig. 7
compares the cepstral coefficients obtained with and without
filter-bank smoothing. We see that the cepstral coefficients die
SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC
Fig. 6. The change in the filter-bank structure with VTLN-warping in Mel-frequency domain is illustrated. The filters have uniformly spaced center frequen: . However, they are nonuniformly
cies with uniform bandwidth for
spaced for different from unity.
= 1 00
Fig. 7. The effect of filter-bank smoothing on the cepstral coefficients is illustrated. Filter-bank smoothing helps limit the que-frency content to the lower
region ensuring que-frency limitedness.
down faster with filter-bank smoothing indicating that the quefrency content is limited to the low que-frency region. During
VTLN-warping, the filter center frequencies are appropriately
scaled in the linear-frequency (Hz) domain by inverse- as described in LeeRose [8]. This corresponds to the center frequencies of the filter-bank to be non-uniformly spaced in the Mel-frequency domain as shown in Fig. 6. As we represent the log-compressed Mel-warped smoothed magnitude spectrum by the continuous function
, the output of the VTLN-warped filterbank corresponds to sampling
nonuniformly, i.e.,
.
These nonuniformly spaced samples exactly correspond to the
elements of the vector
.
From the above discussion, we point out that elements of
vector (i.e.,
) can be interpreted as uniformly spaced samples and elements of
(i.e.,
) as nonuniformly spaced
samples of the same continuous function
. The main idea
is that, given the samples in , the samples (or elements) in
can be reconstructed using band-limited interpolation provided
that the cepstrum is que-frency limited.
Let
and
form a discrete-time Fourier transform
(DTFT) pair. Then sampling
would result in periodic
repetition of . As long as is strictly que-frency limited and
1577
1578
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
( )
where
Fig. 8. Tthe band-limited interpolation for linear-scaling relation is illustrated. Warping is defined in the linear-frequency (Hz) domain or x-axis, i.e.,
f
f . Along the y -axis, ; ; ;
are the center-frequencies of
uniformly spaced filter-bank corresponding to
: in the Mel-domain.
Similarly, ; ;
;
are the center-frequencies of warped filter-bank
and are nonuniformly spaced in the Mel-domain. The band-limited interpolation
matrix is defined to obtain samples at given samples at . In the figure,
represents unwarped frequencies both in linear-frequency (Hz) and Mel
frequency domains, and represents warped frequencies in both the domains.
...
...
= 1 00
and
and
is obtained from (12) and
from (13) and the
VTLN-warping relation,
is specified in domain in
(13). Using the even-symmetry property, we obtain NxN
interpolation matrix
, i.e.,
. The matrix is
given by
is given by
where
(14)
where
is the Nyquist frequency in the Mel frequency
domain. Here, we assume that the signal is periodic with
a period of
and symmetric around
. Therefore, theoretically half-filters are present at indices 0 and
1. The values at these indices are required for performing band-limited interpolation. If, we assume that is
que-frency limited, the elements of
can be determined
as [we use variable since is already used in (14)]
(17)
where
is the number of filters. The matrices
given by
and
are
(15)
Substituting
and
(16)
SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC
1579
B. Cosine-Interpolation
Motivated from the work of Umesh et al. [21], Panchapagesan
and Alwan [22], [23] proposed a linear-transformation approach
that incorporates the interpolation and warping in the inverse
discrete cosine transform (IDCT) matrix and we refer to this
approach as Cosine-interpolation. Considering to be the continuous Mel-frequency variable, the signal is assumed to be peand symmetric about the points
riodic with a period of
and
. A normalization variable is defined as follows:
where
and has the range
as
(21)
where
are the normalized half-sampled
being the freshifted positions of the Mel filter-bank and
quency warping function. The relation between the warped and
unwarped cepstral features is given by
(23)
Fig. 10. Comparing the VTLN-warped cepstra obtained using the conventional
and the proposed Sinc-interpolation approach for piece-wise linear and bilinear
warping functions. (a) Piece-wise linear warping. (b) Bilinear warping.
1580
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
trix can be used to obtain the VTLN-warped delta and acceleration coefficients. Therefore, the relation between the unwarped
and VTLN-warped features are given by
TABLE I
DESCRIPTION OF THE CORPUS USED FOR EXPERIMENTS
(25)
where represents the window length. represents the supervector formed by concatenating all the static MFCC cepstra
from adjacent frames.
Using the estimated warped features
a new VTLN model
is obtained. For performing the warp-factor estimation during
testing, we use a Gaussian mixture model (GMM) classifier
[38]. Unwarped features corresponding to each warping-factor
obtained in training are used to train the GMM with 256 mixtures. The optimal warping-factor in recognition is obtained by
calculating the likelihood with respect to each of the warpingfactor GMM and choosing the one that gives the best likelihood.
The warping-factors are estimated at speaker level in training
and utterance level during recognition. During warped feature
extraction, we map frequency points zero and pi onto themselves
using piece-wise linear-warping. We do not account for Jacobian in VTLN for the experiments presented in this paper.
V. IMPLEMENTATION DETAILS
In this section, we present the implementation details
for Sinc- and Cosine-interpolation approaches. Later, we
present the recognition results comparing the performance of
linear-transformation approaches with conventional VTLN.
A. Cosine-Interpolation
In this section, we will discuss the implementation details for
Cosine-interpolation and argue that the range of warping-factors have to be properly mapped either in the Mel-domain or
in the linear-frequency (Hz) domain for comparing the recognition performance with conventional and Sinc-interpolation
approaches. Before proceeding further, we present recognition
results for the TIDIGITS task in Table II. The models were
trained using male speakers and are used for recognizing
children speakers. For this task, Panchapagesan and Alwan
observed that Cosine-interpolation approach performed better
than conventional VTLN (see [22] and [23, Sec. (6.1)]). As
we will show next, the warp-factors for Cosine-interpolation
and conventional VTLN need to be mapped before they can be
compared. This is due to the difference in the domains where
frequency warping is applied. If proper mapping is chosen, the
difference in the performance observed in [22] and [23] no
longer exists.
SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC
1581
TABLE II
RECOGNITION RESULTS (%WER) COMPARING THE PERFORMANCE OF DIFFERENT APPROACHES TO VTLN FOR MALE-TRAIN AND CHILD-TEST
CASE OF TIDIGITS. DIFFERENT RANGE OF WARPING-FACTORS HAVE TO BE USED TO GET COMPARABLE PERFORMANCE FOR CONVENTIONAL
AND COSINE-INTERPOLATION APPROACHES DUE TO DIFFERENCE IN DOMAIN WHERE FREQUENCY WARPING IS APPLIED
Baseline - No VTLN; Conv. - Conventional; LT - Linear Transformation; M-C - Male Train - Child Test
= 0 80
Mel
Mel Mel
(27)
(28)
Mel
and the equivalent
Mel
The
Mel when
Mel
(29)
is fixed is given by
(30)
1582
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
TABLE III
RECOGNITION RESULTS (%WER) OF CONVENTIONAL AND LINEAR
TRANSFORM APPROACHES TO VTLN USING CONVENTIONAL MFCC.
BOTH SINC- AND COSINE-INTERPOLATION APPROACHES PERFORM
COMPARABLY TO CONVENTIONAL VTLN
Fig. 12. Histogram and contour plot comparing the alpha estimates for the conventional and Sinc-interpolation approaches for EPPS train data. In the lineartransformation approach, since we use conventional Mel-filters (without half-filters) the corresponding warp-factors would be approximately 0.02 less than conventional-VTLN which is reflected in the histogram. (a) Histogram plot. (b)
Contour plot.
We observe that the linear-transformation based approaches perform comparably with the conventional
approach irrespective of noisy or clean speech.
More importantly, we use the conventional Mel filter-bank
with out any modification and still perform VTLN-warping
using a linear-transformation.
Fig. 12 shows the histogram and contour plots for the warpfactors obtained using the conventional and the proposed Sincinterpolation approaches for the train data corresponding to the
EPPS task. The histogram plot shows the distribution of warp-
SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC
factors and the contour plot gives an idea on how the warpfactor estimates differ between Sinc- and conventional VTLN
approaches. From the histogram, we observe that majority of
the warp-factors got shifted by a single warp-factor step, like
1.02, 0.94 and 0.92 in the conventional apthe peaks at
1.00, 0.92 and 0.90, respectively. A simproach appear at
ilar behavior can also be observed from the contour plot. This
is because we are using conventional Mel-filters (i.e., no additional half-filters) with the center frequency of the first Melfilter (at 89.2 Hz) being mapped to zero-frequency to enable the
linear-transformation approach without any change in the signal
processing. This leads to a small consistent difference between
warp-factors obtained in the linear-transformation and the conventional approach. Our analysis of the warp-factors obtained
on the EPPS train data indicates that for any warp-factor in conventional-VTLN, the corresponding warp-factors are same or
0.02 smaller in 90% of the utterances. The correlation coefficient between the alpha estimates is 0.93, which also indicates
that the deviations are only marginal.
Another source of approximation is the use of truncated unwarped MFCC cepstra in (18) to obtain the VTLN-warped cepstra, which will also result in some loss of information. Though
there are small differences in the warp-factor distribution, the
recognition performance of LT-Sinc is comparable to conventional VTLN on a variety of tasks that we have presented in this
paper.
VI. CONCLUSION
In this paper, we have presented an approach to perform
VTLN using a linear transformation on conventional MFCC
without any modification in the feature extraction steps. The
linear-transformation is given by (18) with the interpolation
given by (17). Therefore, the linear-transformation
matrix
can be analytically calculated using the above equations for
any as well as for any arbitrary warping function by putting
in (13). This is an important difference when
appropriate
compared to the Cosine-interpolation, where
is used in
the Mel-domain and the warping is different from conventional
VTLN. Further, the corresponding warp-factors between cosine-interpolation approach and conventional-VTLN cannot be
easily compared. The key idea to our approach is to separate
the speaker scaling operation from the filter-bank which helps
us derive a linear transformation for VTLN using the idea of
band-limited interpolation. The use of such transformations
would enable the warp-factors to be efficiently estimated by
accumulating the sufficient statistics, use regression tree framework to perform VTLN at acoustic class level or use VTLN
matrices as base matrices for adaptation until sufficient data is
available. Such approaches cannot be easily implemented in
conventional-VTLN framework. Using four different tasks, to
illustrate the efficacy of our proposed approach, we have shown
that the recognition performance of our proposed approach of
linear transformation is always comparable to the conventional
VTLN on both clean and noisy speech data.
ACKNOWLEDGMENT
D. R. Sanand would like to thank Prof. H. Ney for giving him
an opportunity to work as a research assistant in the Human Lan-
1583
1584
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012