Sei sulla pagina 1di 12

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO.

5, JULY 2012

1573

VTLN Using Analytically Determined


Linear-Transformation on Conventional MFCC
D. R. Sanand and S. Umesh

AbstractIn this paper, we propose a method to analytically


obtain a linear-transformation on the conventional Mel frequency
cepstral coefficients (MFCC) features that corresponds to conventional vocal tract length normalization (VTLN)-warped MFCC
features, thereby simplifying the VTLN processing. There have
been many attempts to obtain such a linear-transformation,
but all the previously proposed approaches either modify the
signal processing (and therefore not conventional MFCC), or
the linear-transformation does not correspond to conventional
VTLN-warping, or the matrices being estimated and are data
dependent. In short, the conventional VTLN part of an automatic
speech recognition (ASR) system cannot be simply replaced
with any of the previously proposed methods. Umesh et al. proposed the idea to use band-limited interpolation for performing
VTLN-warping on MFCC using plain cepstra. Motivated from this
work, Panchapagesan and Alwan proposed a linear-transformation to perform VTLN-warping on conventional MFCC. However,
in their approach, VTLN warping is specified in the Mel-frequency domain and is not equivalent to conventional VTLN. In
this paper, we present an approach which also draws inspiration
from the work of Umesh et al., and which we believe for the first
time performs conventional VTLN as a linear-transformation on
conventional MFCC using the ideas of band-limited interpolation.
Deriving such a linear-transformation to perform VTLN, would
allow us to use the VTLN-matrices in transform-based adaptation
framework with its associated advantages and yet would require
the estimation of a single parameter. Using four different tasks, we
show that our proposed approach has almost identical recognition
performance to conventional VTLN on both clean and noisy
speech data.

VTLN performs speaker normalization by reducing the variabilities in the spectra of speech signals that arise due to differences
in the vocal tract lengths (VTL) of speakers uttering the same
sound [7]. The normalization is achieved by either compressing
or expanding the speech spectrum and is usually referred to as
scaling. This scaling is usually specified through a mathematical
, where
is the warped-frerelation of the type
is the frequency-warping function. It is comquency and
monly assumed that the spectra of different speakers uttering the
same sound are linearly scaled versions of one another [7], [8],
. We would like to make it clear to the reader
i.e.,
that, though the discussion in this paper assumes linear-scaling
of the spectra, the methods developed in this paper can be applied to any arbitrary warping function. VTLN requires the estimation of only a single parameter called the warp-factor
for
normalization and hence requires very little acoustic data unlike
adaptation based methods (e.g., MLLR and CMLLR). However,
the practical implementation of conventional VTLN follows a
maximum likelihood (ML) based grid search over a pre-defined
range of warping-factors. This requires the features to be generated for all the warp-factors after appropriate modification of
the spectra. The ML estimate of the warp-factor is then found
by evaluating the likelihood of the warped features with respect
and is given
to the acoustic model, , and the transcription
by

Index TermsAutomatic speech recognition (ASR), lineartransformation, Mel frequency cepstral coefficient (MFCC),
speaker normalization, vocal tract length normalization (VTLN).

(1)

I. INTRODUCTION
NTER-SPEAKER variability is a major source of performance degradation in speaker-independent (SI) automatic
speech recognition (ASR) systems. Most state-of-the-art systems now incorporate vocal-tract length normalization (VTLN)
as an integral part of the system to reduce inter-speaker variability and hence improve the recognition performance [1][6].

Manuscript received December 27, 2010; revised July 08, 2011 and January
15, 2012; accepted January 23, 2012. Date of publication January 31, 2012; date
of current version March 21, 2012. This work was done while the authors were
at the Department of Electrical Engineering, Indian Institute of Technology,
Kanpur. This work was supported in part by the Department of Science and
Technology, Ministry of Science and Technology, India, under SERC project
SR/S3/EECE/058/2008. The associate editor coordinating the review of this
manuscript and approving it for publication was Prof. Steve Renals.
D. R. Sanand is with the Norwegian University of Science and Technology,
NO-7491 Trondheim, Norway (e-mail: drsanand@gmail.com).
S. Umesh is with the Department of Electrical Engineering, Indian Institute
of Technology Madras, Chennai-600036, India (e-mail: umeshs@ee.iitm.ac.in).
Digital Object Identifier 10.1109/TASL.2012.2186289

where
consist of static features obtained after frequencyand appended with difwarping the spectra by warp-factor
ferential and acceleration coefficients. In some systems, linear
discriminant analysis (LDA) is applied over a window of such
warped consecutive frames to account for dynamic variations
before obtaining the final feature-vector.
Recently there has been lot of interest in obtaining a direct
linear-transformation between static conventional Mel frequency cepstral coefficients (MFCC) features and the static
, i.e.,
VTLN-warped MFCC
(2)
where,
represents a matrix transformation.
One of the early attempts to obtain a linear-transformation
(LT) on the cepstra for speaker-normalization was by Acero
et al. [9], [10]. They showed that the warped cepstral coefficients can be obtained at the outputs of the filters at time
, by formulating the bilinear transform as a linear filtering
operation and having the time reversed cepstrum sequence as
the input. McDonough et al. [11], proposed a linear transformation using generalizations of bilinear transform known as

1558-7916/$31.00 2012 IEEE

1574

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

all-pass transforms. The derivations were based on the argument


, used in most VTLN
that frequency warping functions,
methods can be approximated to a reasonable degree by the bilinear transform. Pitz et al. [12], [13] argued that a linear-transformation of cepstra can be obtained for any arbitrary invertible warping function. However, their derivations were made
using the modified signal processing approach discussed in [14],
which does not include filter-bank smoothing during the feature
extraction. The cepstra are assumed to be inverse discrete-time
Fourier transform (IDTFT) coefficients of the log power spectrum (without Mel warping) to derive the cepstral linear-transformation. Pitz states in his thesis [15] that inclusion of Melwarping makes the transformation highly non-linear and could
not be solved analytically. There have been other attempts to
obtain an approximation to the linear-transformation including
the work of Claes et al. [16], where the linear-transformation
was derived using the average third formant information. Cox
[17] presented a model based approach for VTLN that performs
transformation on MFCC features. Kim et al. [18] estimated
the linear-transformation using the ideas of constrained maximum likelihood linear-transformation (CMLLR) from training
data. Cui and Alwan [19] derived a mapping matrix using formant-like peaks and can be seen as a special case of [16]. Sanand
et al. [20] derived a linear-transformation using the idea of dynamic frequency warping, where the mapping is learnt from
the data. It is important to note that in all these methods, either the signal processing is changed (and therefore not conventional MFCC), or the linear-transformation does not correspond
to conventional VTLN-warping, or the matrices are estimated
and hence are dependent on the database. Therefore, the conventional VTLN part of an ASR system cannot be simply replaced with any of the methods described above.
Umesh et al. [21] proposed the idea of using band-limited interpolation to derive a linear-transformation for obtaining VTLN-warped MFCC, that performs both Mel- and
VTLN-warping on plain cepstra. Motivated from this work,
Panchapagesan and Alwan [22], [23] proposed an approach
to incorporate VTLN-warping into the inverse discrete cosine
transform (DCT) transformation to obtain a linear-transformation of the type shown in (2). We refer to this approach
as Cosine-interpolation in this paper. It is important to note
, in [22], [23] is performed in the
that, VTLN-warping,
Mel-frequency domain and is not exactly equivalent to conventional VTLN frequency-warping. This may be important in
cases where the warping function is specified in the frequency
(Hz) domain based on physiological arguments.
In this paper, we present an approach, which we believe for
,
the first time performs conventional frequency-warping,
as a linear-transformation on conventional MFCC using the
ideas of band-limited interpolation. We refer to this approach
as Sinc-interpolation in this paper. The goal is to analytically
of (2) given
. The
obtain the linear-transformation
proposed method does not modify any aspect of the conventional MFCC computation including the use of Mel filter-bank
smoothing as well as discrete cosine transform (DCT)-II. A part
of this work has already been presented in [20], [24], and [25].
A major advantage in obtaining a linear-transformation in
the framework of (2) is that the VTLN-warped cepstral features

Fig. 1. Steps involved in generating conventional MFCC features.

need not be computed for each , by first frequency warping the


spectra and then computing the corresponding VTLN-warped
can be directly
cepstra. Instead, the VTLN-warped cepstra
through
obtained from static conventional MFCC features
a matrix transformation. It can be easily shown that the dynamic coefficients of the warped features would also be related
through the same transformation in this case. Another advantage of such an approach is that, these matrices can be viewed as
feature transformation matrices similar to CMLLR, but are precomputed rather than estimated from data, requiring very little
. The use of such
adaptation data for optimal selection of
matrices also enables the warp-factors to be estimated by accumulating the sufficient statistics, there by simplifying the procedure for optimal warp-factor estimation [26], [27] and reducing
the computational complexity by 75%. Further, VTLN matrices
can be used in regression tree framework to perform VTLN at
acoustic class level, allowing estimation of multiple warp-factors for a single utterance [28] which is very difficult to implement in conventional VTLN framework. Finally, there is a possibility of using these VTLN matrices as base matrices for adaptation until sufficient data is available to obtain a robust estimate
of the adaptation (MLLR/CMLLR) matrix [29]. Recently, there
is also interest in using VTLN in the transform-based approach
for statistical speech synthesis [30].
The paper is organized as follows. In Section II, we present
how VTLN is performed in practice and discuss the limitations in formulating the problem as a linear-transformation. In
Section III, we present our idea of performing VTLN and show
that a matrix transformation can be formulated on conventional
MFCC to obtain VTLN-warped MFCC. Section IV presents
our setup for performing the speech recognition experiments
along with description of the databases used in our experiments.
In Section V, we discuss the differences between the proposed
and the Cosine-interpolation approaches for VTLN. Finally,
we present the recognition results to show that the proposed
approach has performance comparable to the conventional
VTLN.
II. IMPLEMENTATION OF CONVENTIONAL VTLN
Conventional MFCC feature extraction, which does not
include VTLN-warping, is usually implemented as shown in
Fig. 1. Let
represent the power or magnitude spectrum of
a frame of speech. Let represent the filter-bank smoothing
operation along with Mel-warping, which can be represented
reprethrough a linear-transformation matrix. Further, let
sent the DCT transformation which is also linear. The static
MFCC features, , are obtained by applying the Mel-warped
filter-bank on the power spectrum of the speech signal, followed
by applying logarithm on the amplitudes of the output of the

SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC

1575

Fig. 2. Conventional frame work for generating warped features in VTLN. The
filter-bank is inversely scaled instead of re-sampling the speech signal for each
warp-factor for efficient implementation.

filter-bank and finally a DCT transformation. All the operations


can be written mathematically as

Fig. 3. Illustrating the change in the filter-bank structure with VTLN-warping


in linear-frequency (Hz) domain. The filters have nonuniform center frequencies
with nonuniform bandwidths.

(3)
The DCT matrix is given by
(4)
and the scaling factor

is defined as

.
Here, is the number of filters used in the Mel filter-bank and
is the number of cepstral coefficients.
As an illustration, let be a speech frame consisting of 320
samples. A 512-point DFT is applied to obtain the 256-dimensional vector whose elements are the magnitude of the DFT
coefficients for one-half of the spectrum. This is because the
magnitude spectrum has even-symmetry. If a 20-filter Mel filterbank smoothing is applied, then is a 20 256 matrix that operates on to obtain the Mel-warped smoothed spectra. is the
20 20 DCT-matrix applied on log-compressed Mel-warped
smoothed spectra to obtain the MFCC feature vector . In practice, only the first 16 cepstral coefficients are used and one may
use a 16 20 DCT transformation.
VTLN features are obtained in the original method of Andreou et al. [7], by frequency-warping the magnitude spectra
to get
before applying the unwarped Mel filter-bank. This
is done by re-sampling the signal. Therefore, in this case the
signal is warped for each VTLN warp-factor, while the Mel
filter-bank is left unchanged. Lee and Rose [8] proposed an efficient alternate implementation, where the Mel filter-bank is
inverse-scaled for each , while the signal spectra is left unchanged as shown in Fig. 2. This is the most popular method
of VTLN-warping. Therefore, in the Lee-Rose method, VTLNdenotes
warping is integrated into the Mel filter-bank and
the (inverse) VTLN-warped Mel filter-bank. Conventionally the
warp-factor, , used for warping the spectra is in the range of

Fig. 4. Piece-wise linear warping function used in conventional VTLN motivated by physiological arguments is shown. The slope of the warping function
is changed at F to avoid bandwidth mismatch after frequency scaling.

0.80 to 1.20 based on physiological arguments. For each , the


center frequencies and bandwidths of the Mel filter-bank are appropriately scaled to obtain Mel- and VTLN-warped smoothed
spectra [8]. The change in the filter-bank structure for different
warp-factors is illustrated in Fig. 3. The slope in the last filter has
been modified appropriately using piece-wise linear warping
[31], so that the Nyquist frequency maps onto itself after frequency scaling. This avoids the bandwidth mismatch that arises
due to frequency warping. The piece-wise linear warping function used in our experiments is given by
(5)
(6)
and is shown in Fig. 4.
represents the cutoff frequency
is the Nyquist frequency.
where the slope is changed and
Although piece-wise linear warping is the most commonly used

1576

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

frequency-warping and is motivated from physiological arguments that changes in VTL manifest as spectral-scaling, the
methods developed in this paper can be applied to any arbitrary
warping function.
are given by
The warped cepstral features
(7)
These are obtained by first warping and smoothing the power
spectrum, followed by log and the DCT operations. The filterbank is integrated with both Mel- and VTLN-warping, to perform smoothing as well as scaling of the spectrum. Observing
(3) and (7), it is clear that the only difference between the conventional and VTLN-warped MFCC features is the change in
the filter-bank structure, while the rest of the operations are the
same.
,
exactly corresponds to the case
For the case of
of conventional MFCC without VTLN-warping. From (3) and
and is given as
(7), the relation between

Fig. 5. Modification in the signal processing steps (separating the Mel- and
VTLN-warping) for realizing a linear-transformation. The filter-bank performs
only Mel-warping of the spectra and the proposed band-limited interpolation
matrix performs the VTLN warping.

where
is the transformation that is applied on
to ob. The above equation states that the filter-bank, , pertain
performs
forms only Mel warping and the transformation
VTLN-warping. This means that the VTLN-warping integrated
in the filter-bank for efficient implementation in the conventional approach [8] is now performed separately and is not a
part of the filter-bank construction. This is illustrated in Fig. 5.
If such a relation can be obtained, then from (3) and (7), the
is given by
relation between and

(8)
(11)
A linear-transformation between
and
(or ) can be
derived if all the intermediate operations can be represented as
linear operations, but from (8), it is evident that log is a nonlinear
does not exist. This is because,
operation and in practice
the power-spectrum cannot be completely reconstructed from
the filter-bank outputs because of the smoothing operation [16].
We need to obtain , since conventional VTLN warping relations are always specified in the linear-frequency (Hz) domain, usually through a mathematical relation of the type
, where
is the warped-frequency and
is the frequency-warping function. Therefore, in this case, it is not possible to completely recover from the filter-bank output and
hence a linear-transformation is not possible.
In the next section, we show that separating the frequencywarping operation from the filter-bank avoids the need to invert
the filter-bank operation or the logarithm and allows us to derive
a linear transformation on conventional MFCC.
III. REALIZING A LINEAR-TRANSFORMATION
In this section, we show that separating the VTLN-warping
(speaker scaling) from the Mel filter-bank helps us to derive a
linear-transformation (LT) between warped and unwarped cepstral features within the conventional MFCC framework. Let
, be the log-compressed Mel warped filter-bank
output. From (3), we see that the knowledge of implies the
knowledge of as they form a DCT pair, i.e.,
(9)
However, we cannot completely recover
from
because
of the filter-bank smoothing operation. Since
cannot be
completely recovered, we re-frame the problem as follows: can
be obtained by applying a linear-transformation on without recovering , i.e.,
(10)

and
, we are completely
By defining a LT between
avoiding the inversion of filter-bank for obtaining the raw magand also bypassing the log operation. We
nitude spectrum
would like to remind the reader that VTLN-warping relation
is usually specified in the linear-frequency (Hz) domain, and
therefore, at this point it is not clear what the relation between
and
should be. In the next subsection, we describe a method
to obtain a LT using the idea of band-limited interpolation.
A. Band-Limited (Sinc-) Interpolation
For a band-limited continuous-time signal,
, given uniformly spaced samples of the signal that are appropriately sampled, i.e.,
, we can exactly reconstruct the original continuous-time signal. This implies that we can recover the values
of the time signal at time-instants other than those at the uniformly spaced samples. We use this idea to obtain the LT for
VTLN-warping, except now we consider que-frency limited signals, instead of frequency-limited signals.
can be obtained either by applying a nonuniform filterbank (shown in Fig. 3) on the linear-frequency (Hz) magnitude
spectrum or by applying a uniformly spaced filter-bank (shown
in Fig. 6) on the Mel-warped magnitude spectrum. Therefore,
in the Mel-frequency domain, the continuous Mel-warped logcompressed spectrum,
, can be interpreted as the convolved
output of a triangle function on the Mel-warped magnitude spectrum and followed by a log operation on the amplitudes. We
can think of vector as being obtained by uniformly sampling
at
where
and the positions of these samples exactly correspond to the center frequencies of the filter-bank. Because of the triangle smoothing
and subsequent log -operation on the output (which reduces dynamic range), the que-frency content of this log -compressed
smoothed spectrum is only in the low que-frency region. Fig. 7
compares the cepstral coefficients obtained with and without
filter-bank smoothing. We see that the cepstral coefficients die

SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC

Fig. 6. The change in the filter-bank structure with VTLN-warping in Mel-frequency domain is illustrated. The filters have uniformly spaced center frequen: . However, they are nonuniformly
cies with uniform bandwidth for
spaced for different from unity.

= 1 00

Fig. 7. The effect of filter-bank smoothing on the cepstral coefficients is illustrated. Filter-bank smoothing helps limit the que-frency content to the lower
region ensuring que-frency limitedness.

down faster with filter-bank smoothing indicating that the quefrency content is limited to the low que-frency region. During
VTLN-warping, the filter center frequencies are appropriately
scaled in the linear-frequency (Hz) domain by inverse- as described in LeeRose [8]. This corresponds to the center frequencies of the filter-bank to be non-uniformly spaced in the Mel-frequency domain as shown in Fig. 6. As we represent the log-compressed Mel-warped smoothed magnitude spectrum by the continuous function
, the output of the VTLN-warped filterbank corresponds to sampling
nonuniformly, i.e.,
.
These nonuniformly spaced samples exactly correspond to the
elements of the vector
.
From the above discussion, we point out that elements of
vector (i.e.,
) can be interpreted as uniformly spaced samples and elements of
(i.e.,
) as nonuniformly spaced
samples of the same continuous function
. The main idea
is that, given the samples in , the samples (or elements) in
can be reconstructed using band-limited interpolation provided
that the cepstrum is que-frency limited.
Let
and
form a discrete-time Fourier transform
(DTFT) pair. Then sampling
would result in periodic
repetition of . As long as is strictly que-frency limited and

1577

the sampling rate is sufficiently high, then there is no aliasing


at any
in the cepstral domain. In such a case, the value of
Mel-frequency
can be found from its uniformly-spaced samples at through band-limited interpolation. This is basically
exploiting the sampling theorem, where a signal (in this case a
frequency domain signal) can be reconstructed from its samples
using Sinc-interpolation. is nowhere used for any calculation
purposes and is presented here only for better understanding in
the derivation of the band-limited interpolation matrix.
Note that que-frency limitedness ensures that there is no
overlap in the periodic repetition of
(i.e., no aliasing), and
hence
can be exactly recovered. The que-frency limitedness
property depends both on the amount of smoothing done by the
Mel-filters (which controls the number of significant cepstral
coefficients) and on the number of Mel-filters which determines
the periodicity. If there is aliasing, there will be differences
and the actual values. Since our
between Sinc-interpolated
effort in this paper, is to use conventional MFCC processing,
both of these parameters are already fixed by the feature
extraction stage. However, as we will show later, even using
conventional MFCC processing there is very little difference
between interpolated and true values.
The steps to obtain the transformation matrix are as follows.
, represent the uniformly-spaced Mel1) Let
frequencies with samples of
at these points being elements of vector . Their corresponding linear-frequencies (Hz) are nonuniformly spaced and are represented by
. These are the center frequencies of the
Mel-filters in the linear-frequency (Hz) domain and are related through the standard Mel-relation, i.e.,
(12)
2) During VTLN-warping, the warping function
is applied to obtain the warped frequencies. Let,
represent the warped frequencies in the linear-frequency
(Hz) domain. Although our proposed method will work
for any warping function
, for illustration purpose we
use the piece-wise linear warping function as defined in (5)
and (6). The corresponding VTLN-warped center frequencies of the filters in the Mel-frequency domain
will
not be related through a linear scaling relation, since
(13)
for the linearTherefore, while
scaling relation (i.e., along axis), along -axis
as seen from (12) and (13) and graphically shown in
Fig. 8. The Cosine interpolation approach proposed in [22],
[23] assumes
, i.e., warping in the Mel domain (i.e., domain), and therefore, does not correspond
to conventional VTLN-warping which is specified in frequency domain (i.e., domain). While we refer to only
piece-wise linear warping for illustration purpose, any frequency-warping function can be used in our proposed approach by specifying the appropriate
in (13).

1578

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

Fig. 9. Framework of the proposed linear-transformation approach. Note that


only the conventional MFCC features are generated and warped features are
obtained using the LT matrices A .

( )

where
Fig. 8. Tthe band-limited interpolation for linear-scaling relation is illustrated. Warping is defined in the linear-frequency (Hz) domain or x-axis, i.e.,
f
f . Along the y -axis,  ;  ; ; 
are the center-frequencies of
uniformly spaced filter-bank corresponding to
: in the Mel-domain.
Similarly,  ;  ;
;
are the center-frequencies of warped filter-bank
and are nonuniformly spaced in the Mel-domain. The band-limited interpolation
matrix is defined to obtain samples at  given samples at  . In the figure,
represents unwarped frequencies both in linear-frequency (Hz) and Mel
frequency domains, and  represents warped frequencies in both the domains.

...

...

3) The Fourier relation between

= 1 00

and

and
is obtained from (12) and
from (13) and the
VTLN-warping relation,
is specified in domain in
(13). Using the even-symmetry property, we obtain NxN
interpolation matrix
, i.e.,
. The matrix is
given by

is given by

where
(14)
where
is the Nyquist frequency in the Mel frequency
domain. Here, we assume that the signal is periodic with
a period of
and symmetric around
. Therefore, theoretically half-filters are present at indices 0 and
1. The values at these indices are required for performing band-limited interpolation. If, we assume that is
que-frency limited, the elements of
can be determined
as [we use variable since is already used in (14)]

Alternatively, the above matrix


product of matrices and is given by

can also be written as a

(17)
where
is the number of filters. The matrices
given by

and

are

(15)
Substituting

of (14) in (15), we get


where

are normalized frequencies with the range


.
The linear transformation matrix to obtain the VTLN-warped
MFCC given the conventional MFCC is given by
(18)

The band-limited interpolation matrix between


is given by

and

(16)

represents the number of static cepstral coefficients in


Here,
the feature-vector and
is the number of Mel-filters used in
the feature extraction. The feature generation process using the
proposed linear-transformation (LT) approach is illustrated in
Fig. 9. Although we have shown here for the case of piece-wise
linear warping, the same procedure can be used for any arbitrary

SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC

1579

B. Cosine-Interpolation
Motivated from the work of Umesh et al. [21], Panchapagesan
and Alwan [22], [23] proposed a linear-transformation approach
that incorporates the interpolation and warping in the inverse
discrete cosine transform (IDCT) matrix and we refer to this
approach as Cosine-interpolation. Considering to be the continuous Mel-frequency variable, the signal is assumed to be peand symmetric about the points
riodic with a period of
and
. A normalization variable is defined as follows:

where
and has the range
as

(21)

. The warped IDCT matrix is given


(22)

where
are the normalized half-sampled
being the freshifted positions of the Mel filter-bank and
quency warping function. The relation between the warped and
unwarped cepstral features is given by
(23)
Fig. 10. Comparing the VTLN-warped cepstra obtained using the conventional
and the proposed Sinc-interpolation approach for piece-wise linear and bilinear
warping functions. (a) Piece-wise linear warping. (b) Bilinear warping.

warping function by choosing the appropriate


in (13).
Fig. 10 compares the VTLN-warped cepstra obtained using the
conventional and the proposed LT approach on piece-wise linear
and bilinear warping functions.
The idea of linear-transformation presented here will be a
special case of the method proposed by Umesh et al. in [21],
where a linear-transformation is proposed by separating both
Mel- and VTLN-warping from the filter-bank. The main differences between these approaches are as follows.
The filters are uniformly spaced in the Mel frequency domain for the approach proposed in this paper, i.e., are
uniformly spaced. In the work of Umesh et al. in [21], the
filters are uniformly spaced in the linear-frequency (Hz)
domain, i.e., are uniformly spaced. Therefore, conventional Mel filter-bank is not used in [21].
The interpolation matrix proposed in this paper is defined
as
(19)
i.e., it performs only VTLN-warping on the Mel warped
spectra. In [21], the interpolation matrix is defined as
(20)
where is the smoothed spectrum without Mel warping
and the transformation matrix performs both Mel- and
VTLN-warping to obtain VTLN-warped MFCC features.

From the above equations, we see that the VTLN-warping,


, is performed on the half-sample shifted positions
of the filter-bank center-frequencies
that are already
Mel-warped. In conventional VTLN, frequency warping is
.
performed in the linear-frequency (Hz) domain through
From the above discussion, it is clear that the Cosine-interpolation approach performs VTLN-warping on the Mel-warped frequencies and is not equivalent to conventional VTLN-warping.
Panchapagesan and Alwan themselves point to these differences
[23, below Eq. 27]. As seen from the warp-factor histograms
in [23], the conventional VTLN-warping-factors lie between
(0.88, 1.24) and the Cosine-interpolation based warping-factors
in the range (0.91, 1.11) for the same piece-wise linear warping.
This indicates that conventional frequency-warping and the
warping used in Mel-domain by Cosine-interpolation cannot
be directly compared since the domains in which the warping
is applied are different. In practice, most frequency-warping
functions are specified in linear-frequency (Hz) domain (and
not Mel-domain) often motivated by physiological arguments.
To summarize, the main differences between Cosine-interpolation and the linear-transformation derived in this paper are as
follows.
is applied in
In Cosine-interpolation, VTLN-warping
the Mel-domain and hence the corresponding warp-factors
are very different when compared with those from conventional VTLN.
The interpolation is performed using the inverse-DCT
matrix in Cosine-interpolation [see (22)], whereas the
approach presented in this paper uses band-limited interpolation as shown in (17).
Before proceeding further, we present the recognition setup
along with the details of the databases used in our experiments
in the next section.

1580

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

trix can be used to obtain the VTLN-warped delta and acceleration coefficients. Therefore, the relation between the unwarped
and VTLN-warped features are given by

TABLE I
DESCRIPTION OF THE CORPUS USED FOR EXPERIMENTS

(25)

If the features are obtained using an LDA transformation matrix,


then the relation between VTLN-warped and static unwarped
features are given by [26]
(26)
IV. EXPERIMENTAL SETUP
The recognition experiments include four different sets
of speech data: Wall Street Journal (WSJ0) [32], European
Parliamentary Plenary Sessions (EPPS) English [33], Texas
Instruments connected digits (TIDIGITS) [34] and Aurora 4.0
[35]. WSJ0, TIDIGITS and EPPS-English are clean speech
data, where as Aurora 4.0 is a noisy speech data. The details of
the database are presented in Table I. Aurora 4.0 consists of 14
different types of test sets, where seven of the them are recorded
with a similar microphone used for recording the training data
while the other seven with a different microphone.
All the experiments were done using the RWTH Aachen
Speech Recognition System [36] except for the TIDIGITS
task. While performing feature extraction, we use 20 filters
. The features are
and obtain 16 cepstral coefficients
mean and variance normalized at the segment level and LDA is
applied over a window of nine consecutive frames to derive a
45-dimensional feature vector. The system used a classification
and regression tree (CART) based state tying. We have 1501
generalized triphones for both WSJ0 and Aurora 4.0. and 4501
generalized triphones for EPPS task, respectively. The HMM
model consists of three emitting states with 256 mixtures per
state and uses a pooled covariance matrix.
The TIDIGITS speech recognition task is done using HTK
[37] and uses word models. It had 11 word models, which include zero to nine and oh. The features are of 39 dimensions
comprising normalized log-energy,
(excluding )
and their first- and second-order derivatives. Cepstral mean subtraction is applied at the segment level. The digits were modeled with simple left-to-right HMMs without skips and had 16
emitting states with five diagonal covariance Gaussian mixtures
per state. Silence is modeled using a three-state HMM having
six-mixture Gaussian models per state.
While performing VTLN in training, we follow a maximum-likelihood (ML)-based approach for estimating the
optimal warping-factor, i.e.,
(24)
is the known transcription
where is the SI model and
during training.
is the VTLN-warped feature vector of
utterance with the static features appended with the delta and acceleration coefficients or is obtained after transformation using
LDA. Since the delta and acceleration coefficients are obtained
from the static coefficients, the same VTLN transformation ma-

where represents the window length. represents the supervector formed by concatenating all the static MFCC cepstra
from adjacent frames.
Using the estimated warped features
a new VTLN model
is obtained. For performing the warp-factor estimation during
testing, we use a Gaussian mixture model (GMM) classifier
[38]. Unwarped features corresponding to each warping-factor
obtained in training are used to train the GMM with 256 mixtures. The optimal warping-factor in recognition is obtained by
calculating the likelihood with respect to each of the warpingfactor GMM and choosing the one that gives the best likelihood.
The warping-factors are estimated at speaker level in training
and utterance level during recognition. During warped feature
extraction, we map frequency points zero and pi onto themselves
using piece-wise linear-warping. We do not account for Jacobian in VTLN for the experiments presented in this paper.
V. IMPLEMENTATION DETAILS
In this section, we present the implementation details
for Sinc- and Cosine-interpolation approaches. Later, we
present the recognition results comparing the performance of
linear-transformation approaches with conventional VTLN.
A. Cosine-Interpolation
In this section, we will discuss the implementation details for
Cosine-interpolation and argue that the range of warping-factors have to be properly mapped either in the Mel-domain or
in the linear-frequency (Hz) domain for comparing the recognition performance with conventional and Sinc-interpolation
approaches. Before proceeding further, we present recognition
results for the TIDIGITS task in Table II. The models were
trained using male speakers and are used for recognizing
children speakers. For this task, Panchapagesan and Alwan
observed that Cosine-interpolation approach performed better
than conventional VTLN (see [22] and [23, Sec. (6.1)]). As
we will show next, the warp-factors for Cosine-interpolation
and conventional VTLN need to be mapped before they can be
compared. This is due to the difference in the domains where
frequency warping is applied. If proper mapping is chosen, the
difference in the performance observed in [22] and [23] no
longer exists.

SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC

1581

TABLE II
RECOGNITION RESULTS (%WER) COMPARING THE PERFORMANCE OF DIFFERENT APPROACHES TO VTLN FOR MALE-TRAIN AND CHILD-TEST
CASE OF TIDIGITS. DIFFERENT RANGE OF WARPING-FACTORS HAVE TO BE USED TO GET COMPARABLE PERFORMANCE FOR CONVENTIONAL
AND COSINE-INTERPOLATION APPROACHES DUE TO DIFFERENCE IN DOMAIN WHERE FREQUENCY WARPING IS APPLIED

Baseline - No VTLN; Conv. - Conventional; LT - Linear Transformation; M-C - Male Train - Child Test

Fig. 11. The different frequency-warping functions in the linear-frequency


(Hz) domain is shown. Using a warp-factor of 0.80 in Mel-domain will result
in a warping function in linear-frequency domain (dotted line) that is quite
different from using the same value of the warp-factor (i.e., 0.80) directly in
linear-frequency (Hz) domain (solid line). The figure also shows that using
0.9194 as the warp-factor in Mel-domain will generate frequency-warping
: in linear-frequency (Hz) domain.
function very similar to using

= 0 80

We briefly discuss the physiological motivation in choosing


the range of warp-factors used in conventional VTLN for piecewise linear warping. The average vocal-tract length for males
is about 17 cm, that for females is about 14.5 cm and for children is about 12 cm. Males can have vocal-tract lengths that
are 19 cm or longer. Since the differences in vocal-tract lengths
crudely manifest as scaling of the spectra for the same sound,
the range of scaling (or warp-) factors are determined by the
ratio of vocal-tract lengths. For adult speakers (i.e., only male
and female speakers), this ratio varies from
to about
. Usually, the range of warp-factors for
adult data are in the range of 0.80 to 1.20. However, if we train
models using male speakers and use children speakers for test,
then the range of warp-factors have to be different. In this case,
the lower-end in the range of warp-factors can be approximately
.
As pointed out in Section III-B, Cosine-interpolation performs VTLN-warping on the Mel-warped frequencies (i.e.,
) as opposed to performing frequency-warping in the
linear-frequency (Hz) domain (i.e.,
). Using the same
numerical value of the warping-factor both in Mel- and
linear-frequency (Hz) domain will result in different warpings
in the frequency (Hz) domain. The differences are illustrated
in Fig. 11. Using a warp-factor of 0.80 in the Mel-frequency
domain and mapping the warping back to linear-frequency
domain (shown in the figure with a dotted line) produces a very

different warping function in linear-frequency domain when


compared to directly using 0.80 warp-factor in linear-frequency
domain (shown with a solid line in the figure). On the other
hand, using a warp-factor of 0.9194 in the Mel-domain and
mapping the function back to linear-frequency domain results
in frequency-warping very similar to 0.80. We suspect that for
TIDIGITS task in [22] and [23], the same range of warp-factors
from 0.80 to 1.25 were used in both conventional VTLN and
in Cosine-interpolation. Since the test data is from children
speakers, the lower limit of 0.80 for conventional-VTLN did
not provide sufficient search space and resulted in degraded
performance. On the other hand, scaling the Mel-domain with
0.80 is approximately equivalent to scaling the linear-frequency domain by a factor of 0.5695. This provided a larger
search-space for Cosine-interpolation, probably helping it to
get better performance for children speech when compared to
conventional-VTLN in [22], [23].
In order to have a fair comparison, the warping-factors in
the linear-frequency (Hz) domain (or Mel-domain) have to be
in the same range for all the approaches. This can be done by
calculating the equivalent warping-factor in one domain (say
linear-frequency (Hz) domain) by fixing the warping-factor in
the other domain (say Mel frequency domain). The mapping of
warping-factors is done as follows.
Let us say that the cutoff where the slope of the warping
function changes is
in Hz and similarly where the
slope changes in the Mel domain as Mel .
The corresponding warped frequencies are given as

Mel

Mel Mel

(27)
(28)

and Mel are different.


where
The idea is that the inverse Mel-warped Mel should match
or Mel-warped
should match Mel . So this
involves finding the value of
(or Mel ) that matches
) and this can be found by equating (27) and
Mel (or
(28). Therefore, the equivalent
when the Mel is fixed
is given by

Mel
and the equivalent

Mel
The

Mel when
Mel

(29)

is fixed is given by
(30)

converts Mel frequencies to Hz and similarly


converts Hz frequencies to Mel frequencies.

1582

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

We map the warping-factors as discussed above by fixing


and calculating the corresponding Mel . The new range
of warping-factors for Mel will be in the range from (0.91,
in the range of (0.80, 1.25)
1.08) for the corresponding
for adult speech. These ranges seem to be consistent with the
observations made by Panchapagesan-Alwan (see [23, Fig. 4
]). The recognition performance on the TIDIGITS task using
proper range of warping-factors is shown in Table II, i.e., we
extend the lower range to account for children speakers. Unlike
[22] and [23], we now observe that all the approaches to VTLN
have similar performance. When the warping-factors are in the
improper range of (0.80, 1.20) for child speakers, the conventional VTLN performance is inferior to Cosine-interpolation as
observed in [22].
For all the subsequent experiments in this paper, we will use
mapping of warping-factors for Cosine-interpolation by fixing
the warp-factors in the linear-frequency (Hz) domain. In the next
section, we will present the implementation details of Sinc-interpolation when performing linear-transformation of conventional MFCC.

TABLE III
RECOGNITION RESULTS (%WER) OF CONVENTIONAL AND LINEAR
TRANSFORM APPROACHES TO VTLN USING CONVENTIONAL MFCC.
BOTH SINC- AND COSINE-INTERPOLATION APPROACHES PERFORM
COMPARABLY TO CONVENTIONAL VTLN

B. Linear-Transformation of Conventional MFCC for VTLN


Using Sinc-Interpolation
In order to perform band-limited interpolation, full-spectral
information in the frequency band is necessary. Since we use
conventional MFCC, the available spectral information is between the first and last filter of the filter-bank. Using a conventional filter-bank with 20 filters, the first filter has a center frequency around 135.2 Mels (or 89.92 Hz) and the last filter has
a center frequency around 2704.8 Mels (or 7016.2 Hz). It is
quite unlikely that a formant for any specific sound exists below
or above the specified frequencies for any particular speaker.
We can safely assume that the first and last filter-bank center
frequencies act as zero and Nyquist frequencies which should
in no way effect the VTLN performance. This assumption inherently means that the center frequencies of the first and last
filters will map onto themselves after frequency warping. Note
that by stating the center frequencies of the first and last filters
are not changing with frequency-warping, we do not mean that
we are ignoring the speech spectrum below 89.92 Hz. The information is still present in the first-filter since the lower-end of
the filter starts at zero frequency. The point we want to make
is that the center frequencies of the first and last filters do not
change after frequency warping. The only consequence of the
center frequency of the first filter being used as zero frequency
in linear-transformation case is that there will be a small consistent difference in numerical value between the warp-factor
estimate obtained by this method and the conventional method.
This is because there will be a small difference in the slope used
in (5) and (6). The linear transformation relation is given by
(31)
where
indicates that conventional Mel filter-bank is used.
The linear transformation is derived as shown in (17).
The results comparing the recognition performance using the
conventional Mel filter-bank are shown in Table III. We make
the following observations.

Fig. 12. Histogram and contour plot comparing the alpha estimates for the conventional and Sinc-interpolation approaches for EPPS train data. In the lineartransformation approach, since we use conventional Mel-filters (without half-filters) the corresponding warp-factors would be approximately 0.02 less than conventional-VTLN which is reflected in the histogram. (a) Histogram plot. (b)
Contour plot.

We observe that the linear-transformation based approaches perform comparably with the conventional
approach irrespective of noisy or clean speech.
More importantly, we use the conventional Mel filter-bank
with out any modification and still perform VTLN-warping
using a linear-transformation.
Fig. 12 shows the histogram and contour plots for the warpfactors obtained using the conventional and the proposed Sincinterpolation approaches for the train data corresponding to the
EPPS task. The histogram plot shows the distribution of warp-

SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC

factors and the contour plot gives an idea on how the warpfactor estimates differ between Sinc- and conventional VTLN
approaches. From the histogram, we observe that majority of
the warp-factors got shifted by a single warp-factor step, like
1.02, 0.94 and 0.92 in the conventional apthe peaks at
1.00, 0.92 and 0.90, respectively. A simproach appear at
ilar behavior can also be observed from the contour plot. This
is because we are using conventional Mel-filters (i.e., no additional half-filters) with the center frequency of the first Melfilter (at 89.2 Hz) being mapped to zero-frequency to enable the
linear-transformation approach without any change in the signal
processing. This leads to a small consistent difference between
warp-factors obtained in the linear-transformation and the conventional approach. Our analysis of the warp-factors obtained
on the EPPS train data indicates that for any warp-factor in conventional-VTLN, the corresponding warp-factors are same or
0.02 smaller in 90% of the utterances. The correlation coefficient between the alpha estimates is 0.93, which also indicates
that the deviations are only marginal.
Another source of approximation is the use of truncated unwarped MFCC cepstra in (18) to obtain the VTLN-warped cepstra, which will also result in some loss of information. Though
there are small differences in the warp-factor distribution, the
recognition performance of LT-Sinc is comparable to conventional VTLN on a variety of tasks that we have presented in this
paper.
VI. CONCLUSION
In this paper, we have presented an approach to perform
VTLN using a linear transformation on conventional MFCC
without any modification in the feature extraction steps. The
linear-transformation is given by (18) with the interpolation
given by (17). Therefore, the linear-transformation
matrix
can be analytically calculated using the above equations for
any as well as for any arbitrary warping function by putting
in (13). This is an important difference when
appropriate
compared to the Cosine-interpolation, where
is used in
the Mel-domain and the warping is different from conventional
VTLN. Further, the corresponding warp-factors between cosine-interpolation approach and conventional-VTLN cannot be
easily compared. The key idea to our approach is to separate
the speaker scaling operation from the filter-bank which helps
us derive a linear transformation for VTLN using the idea of
band-limited interpolation. The use of such transformations
would enable the warp-factors to be efficiently estimated by
accumulating the sufficient statistics, use regression tree framework to perform VTLN at acoustic class level or use VTLN
matrices as base matrices for adaptation until sufficient data is
available. Such approaches cannot be easily implemented in
conventional-VTLN framework. Using four different tasks, to
illustrate the efficacy of our proposed approach, we have shown
that the recognition performance of our proposed approach of
linear transformation is always comparable to the conventional
VTLN on both clean and noisy speech data.
ACKNOWLEDGMENT
D. R. Sanand would like to thank Prof. H. Ney for giving him
an opportunity to work as a research assistant in the Human Lan-

1583

guage Technology Group at RWTH Aachen University, Aachen,


Germany. The authors would like to thank Prof. H. Ney for providing resources to run the recognition experiments reported in
this paper. The authors would also like to thank the anonymous
reviewers for their valuable comments and suggestions.
REFERENCES
[1] G. Evermann, H. Y. Chan, M. J. F. Gales, T. Hain, X. Liu, D. Mrva, L.
Wang, and P. C. Woodland, Development of the 2003 CU-HTK conversational telephone speech transcription system, in Proc. ICASSP
04, Montreal, QC, Canada, May 2004, pp. 249252.
[2] A. Sixtus, S. Molau, S. Kanthak, R. Schlter, and H. Ney, Recent improvements of the RWTH large vocabulary speech recognition system
on spontaneous speech, in Proc. ICASSP 00, Istanbul, Turkey, Jun.
2000, pp. 16711674.
[3] G. Zavaliagkos, J. McDonough, D. Miller, A. El-Jaroudi, J. Billa, F.
Richardson, K. Ma, M. Siu, and H. Gish, The BBN Byblos 1997
large vocabulary conversational speech recognition system, in Proc.
ICASSP 98, Seattle, WA, May 1998, pp. 905908.
[4] J. L. Gauvain, L. Lamel, H. Schwenk, G. Adda, L. Chen, and F.
Lefevre, Conversational telephone speech recognition, in Proc.
ICASSP 03, Hong Kong, Apr. 2003, pp. 212215.
[5] A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. R. Gadde, M.
Plauch, C. Richey, E. Shriberg, K. Snmez, F. Weng, and J. Zheng,
The SRI March 2000 Hub-5 conversational speech transcription
system, in Proc. NIST Speech Transcript. Workshop, 2000.
[6] Y. Gao, Y. Li, V. Goel, and M. Picheny, Recent advances in speech
recognition system for IBM DARPA communicator, in Proc. Eurospeech 01, Aalborg, Denmark, Sep. 2001.
[7] A. Andreou, T. Kamm, and J. Cohen, Experiments in vocal tract normalization, in Proc. CAIP Workshop: Frontiers in Speech Recognition
II, 1994.
[8] L. Lee and R. Rose, Frequency warping approach to speaker normalization, IEEE Trans. Speech Audio Process., vol. 6, no. 1, pp. 4959,
Jan. 1998.
[9] A. Acero and R. M. Stern, Robust speech recognition by normalization of the acoustic space, in Proc. ICASSP 91, Toronto, ON, Canada,
May 1991, pp. 893896.
[10] A. Acero, Acoustical and environmental robustness in automatic
speech recognition, Ph.D. dissertation, Carnegie Mellon Univ.,
Pittsburgh, PA, 1990.
[11] J. McDonough, W. Bryne, and X. Luo, Speaker normalization with
all-pass transforms, in ICSLP 98, Sydney, Australia, Nov. 1998.
[12] M. Pitz and H. Ney, Vocal tract normalization equals linear transformation in cepstral space, IEEE Trans. Speech Audio Process., vol. 13,
no. 5, pp. 930944, Sep. 2005.
[13] M. Pitz, S. Molau, R. Schlter, and H. Ney, Vocal tract normalization
equals linear transformation in cepstral space, in Eurospeech 01, Aalborg, Denmark, Sep. 2001.
[14] S. Molau, M. Pitz, R. Schlter, and H. Ney, Computing Mel-frequency
cepstral coefficients on the power spectrum, in Proc. ICASSP 01, Salt
Lake City, UT, May 2001, pp. 7376.
[15] M. Pitz, Investigations on linear transformations for speaker adaptation and normalization, Ph.D. dissertation, RWTH Aachen, Aachen,
Germany, March 2005.
[16] T. Claes, I. Dologlou, L. Bosch, and D. van Compernolle, A novel
feature transformation for vocal tract length normalisation in automatic
speech recognition, IEEE Trans. Speech Audio Process., vol. 6, no. 6,
pp. 549557, Nov. 1998.
[17] S. Cox, Speaker normalization in the MFCC domain, in Proc. ICSLP
00, Beijing, China, Oct. 2000.
[18] D. Y. Kim, S. Umesh, M. J. F. Gales, T. Hain, and P. C. Woodland,
Using VTLN for broadcast news transcription, in Proc. Interspeech
04, Jeju Island, Korea, Sep. 2004.
[19] X. Cui and A. Alwan, Adaptation of children speech with limited data
based on formant-like peak alignment, Comput. Speech Lang., vol. 20,
no. 4, pp. 400419, Oct. 2006.
[20] D. R. Sanand, D. D. Kumar, and S. Umesh, Linear transformation
approach to VTLN using dynamic frequency warping, in Proc. Interspeech 07, Antwerp, Belgium, Aug. 2007.
[21] S. Umesh, A. Zolnay, and H. Ney, Implementing frequency warping
and VTLN through linear transformation of conventional MFCC, in
Interspeech 05, Lisbon, Portugal, Sept. 2005.

1584

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

[22] S. Panchapagesan, Frequency warping by linear transformation of


standard MFCC, in Proc. Interspeech 06, Pittsburgh, PA, Sep. 2006.
[23] S. Panchapagesan and A. Alwan, Frequency warping for VTLN
and speaker adaptation by linear transformation of standard MFCC,
Comput. Speech Lang., vol. 23, no. 1, pp. 4264, Jan. 2009.
[24] D. R. Sanand and S. Umesh, Study of Jacobian compensation using
linear transformation of conventional MFCC for VTLN, in Proc. Interspeech 08, Brisbane, Australia, Sep. 2008.
[25] D. R. Sanand, R. Schlter, and H. Ney, Revisiting VTLN using linear
transformation on conventional MFCC, in Proc. Interspeech 10,
Makuhari, Japan, Sep. 2010.
[26] J. Lf, H. Ney, and S. Umesh, VTLN warping factor estimation using
accumulation of sufficient statistics, in Proc. ICASSP 06, Toulouse,
France, May 2006, pp. 12011204.
[27] P. T. Akhil, S. P. Rath, S. Umesh, and D. R. Sanand, A computationally efficient approach to warp factor estimation in VTLN using EM
algorithm and sufficient statistics, in Proc. Interspeech 08, Brisbane,
Australia, Sep. 2008.
[28] S. P. Rath and S. Umesh, Acoustic class specific VTLN-warping using
regression class trees, in Proc. Interspeech 09, Brighton, U.K., Sep.
2009.
[29] C. Breslin, K. Chin, M. Gales, K. Knill, and H. Xu, Prior information for rapid speaker adaptation, in Proc. Interspeech 10, Makuhari,
Japan, Sep. 2010.
[30] L. Saheer, J. Dines, P. N. Garner, and H. Liang, Implementation of
VTLN for statistical speech synthesis, in Proc. ISCA Speech Synth.
Workshop, Sep. 2010.
[31] S. Wegmann, D. McAllaster, J. Orloff, and B. Peskin, Speaker normalization on conversational telephone speech, in Proc. ICASSP 96,
Atlanta, GA, May 1996, pp. 339341.
[32] D. B. Paul and J. M. Baker, The design for the wall street journal-based
CSR corpus, in Proc. ICSLP 92, Banff, AB, Canada, Oct. 1992.
[33] J. Lf, M. Bisani, C. Gollan, G. Heigold, B. Hoffmeister, C. Plahl,
R. Schlter, and H. Ney, The 2006 RWTH parliamentary speeches
transcription system, in Proc. Interspeech 06, Barcelona, Spain, Jun.
2006.
[34] R. Leonard, A database for speaker-independent digit recognition, in
ICASSP 84, San Diego, CA, Mar. 1984, pp. 328331.
[35] N. Parihar and J. Picone, DSR Front End LVCSR Evaluation, AU/384/
02, Mississippi State University, Mississippi State, MS, Dec. 2002,
Tech. Rep..
[36] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. Lf, R. Schlter,
and H. Ney, The RWTH Aachen university open source speech
recognition system, in Proc. Interspeech 09, Brighton, U.K., Sep.
2009.

[37] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G.


Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,
The HTK Book (for HTK Version 3.4). Cambridge, U.K.: Cambridge
Univ. Eng. Dept., 2006.
[38] L. Welling, S. Kanthak, and H. Ney, Improved methods for vocal
tract normalization, in Proc. ICASSP 99, Phoenix, AZ, Mar. 1999,
pp. 761764.

D. R. Sanand received the Ph.D. degree in electrical


engineering from the Indian Institute of Technology,
Kanpur, in 2010.
From 2009 to 2010, he was a Postdoctoral Researcher in the Department of Computer Science,
RWTH Aachen University, Aachen, Germany,
and later in the Department of Information and
Computer Science, Aalto University, Espoo, Finland, until 2011. Currently, he is a Postdoctoral
Researcher in the Department of Electronics and
Telecommunications, Norwegian University of
Science and Technology, Trondheim, Norway. His research interests include
speech recognition, synthesis, and biomedical signal processing.

S. Umesh received the Ph.D. degree in electrical


engineering from the University of Rhode Island,
Kingston, in 1993.
From 1993 to 1996, he was a Postdoctoral Fellow
at the City University of New York. From 1996 to
2009, he was with the Indian Institute of Technology,
Kanpur, first as an Assistant Professor and then as
a Professor of electrical engineering. Since 2009,
he has been with the Indian Institute of Technology,
Madras, where he is a Professor of electrical engineering. He has also been a Visiting Researcher at
AT&T Research Laboratories, Machine Intelligence Laboratory, Cambridge
University, U.K., and the Department of Computer Science (Lehrstuhl fr
Informatik VI), RWTH-Aachen, Germany. His recent research interests have
been mainly in the area of speaker-normalization and noise-robustness and
their application in large-vocabulary continuous speech recognition systems.
He has also worked in the areas of statistical signal processing and time-varying
spectral analysis.
Dr. Umesh was a recipient of the Indian AICTE Career Award for Young
Teachers and the Alexander von Humboldt Research Fellowship.

Potrebbero piacerti anche