Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract In this paper, we propose an approach to segmen- consonant regions, transition regions and vowel regions in
tation of continuous speech into syllable-like units where each the speech signal. We propose a clustering based approach
unit has one or more consonants followed by a vowel. The to identify the different types of regions in the speech
proposed approach uses the clustering and vector quantization
methods to identify the consonant, transition and vowel regions signal. The clustering approach involves forming clusters in
in continuous speech. We consider methods based on clustering the space of speech parametric vectors that represent the
and vector quantization in the Mercer kernel feature space for spectral characteristics of the speech signal. The clusters are
separation of nonlinearly separable clusters of data belonging nonlinearly separable for the data of consonants and vowels
to the different regions. Results of experimental studies demon- that have similar spectral characteristics. The commonly used
strate the effectiveness of the kernel based methods in improving
the performance of the speech segmentation system. K-means clustering method gives a linear separation of data
and is not suitable for separation of nonlinearly separable
I. I NTRODUCTION data. In this paper, we explore the Mercer kernel based
Continuous speech recognition involves speech signal-to- methods for clustering of nonlinearly separable data. These
symbol transformation and symbol-to-text conversion, where methods involve a nonlinear transformation from the input
symbols correspond to subword units of speech. Syllable- space to a higher dimensional feature space induced by the
like units are commonly used as symbols [1]. One approach innerproduct kernels or Mercer kernels. Nonlinear mapping
to developing the speech signal-to-symbol transformation by the kernels is expected to lead to linear separability of data
module is based on segmentation and labeling. This approach in the kernel feature space. Then a linear separation formed
involves detection of boundaries of subword unit segments by clustering in the kernel feature space would correspond
in the given continuous speech signal and then using a to a nonlinear separation in the input space. We also address
subword unit classifier to assign a label for each segment. issues in performing vector quantization in the kernel feature
Among different forms of syllable-like units, the C + V units space. The methods for kernel based clustering and vector
have a high frequency of occurrence. A C + V unit consists quantization are used for segmentation of continuous speech
of one or more consecutive consonants (C) followed by a into C + V units.
vowel (V ). The most common forms of C + V units are CV , The paper is organized as follows: The method for clus-
CCV , and CCCV in the decreasing order of frequency of tering in Mercer kernel feature space is explained in Section
occurrence. Approaches based on signal processing methods II. The method for vector quantization in the kernel feature
have been developed for detection of syllable boundaries in space is described in Section III. The proposed approach for
continuous speech [2]. In this paper, we propose a clustering detection of boundaries of C + V units in continuous speech
based approach for detection of boundaries of C + V units in is presented in Section IV. Experimental studies on detection
continuous speech. of boundaries of C + V units are presented in Section V.
The clustering based approach to detection of boundaries II. C LUSTERING IN K ERNEL F EATURE S PACE
of C + V units involves identification of the consonant regions
and vowel regions in the continuous speech signal. The con- Consider a set of N data points in the input space, xn ,
sonant and vowel regions have different gross characteristics n = 1, 2, ...., N . Let the number of clusters to be formed is
of speech signal. Different categories of consonants such as K. The criterion used by the K-means clustering method in
stop consonants, nasals, semivowels and fricatives also have the input space for grouping the data into K clusters is to
differences in their gross characteristics of speech signal. minimize the trace of the within-cluster scatter matrix, Sw ,
Production of speech for a CV unit involves transition from a defined as follows [3]:
K N
consonant to a vowel. The characteristics of the signal for the 1
Sw = zkn (xn mk )(xn mk )T (1)
transition region are different from that of the consonant and N n=1
k=1
vowel regions. The characteristics of the transition region are
also different for different CV units. Therefore segmentation where mk is the center of the k th cluster, Ck , and zkn is
of continuous speech utterance involves identification of the membership of data point xn to the cluster Ck . The
membership value zkn = 1, if xn Ck and 0 otherwise. The
D. Srikrishna Satish is with the Department of Computer Science and number of points in the k th cluster is given as Nk defined
Engineering, Indian Institute of Technology Madras, Chennai-36, India
C. Chandra Sekhar is with the Department of Computer Science and
by: N
Engineering, Indian Institute of Technology Madras, Chennai-36, India Nk = zkn (2)
(email: chandra@cs.iitm.ernet.in). n=1
1637
in the input space. The linear separation of data formed by is no explicit representation of the mean vector of a cluster.
the K-means clustering method in the input space is shown For performing vector quantization in kernel feature space,
in Figure 1(b). The separations obtained using the kernel a vector x is assigned the index of the cluster whose center
based clustering method in the feature space of Gaussian has the highest similarity to (x). The similarity measure
kernel for different values of 2 are shown in Figures 1(c) between the vector x and the center of a cluster, Ck , in the
and 1(d). It can be seen from Figure 1(d) that the points of feature space can be computed as follows:
1
the two nonlinearly separable classes are perfectly separated sk (x) = T (x)m k = K(x, xi ) (18)
for 2 =0.18. This example demonstrates the potential of the Nk
(xi )Ck
kernel based clustering method to separate the nonlinearly
separable data in the input space. Computation of sk (x), k = 1, 2, ..., K, involves performing
the kernel operation between x and each of the N data
vectors used in formation of the clusters. Therefore vector
1.5 1.5
quantization in the kernel feature space is computationally
intensive. In order to reduce the number of kernel operations
1 1
1638
Continuous speech signal for a sentence
classes with stop consonants, affricates or fricatives as conso-
nants, the characteristics of the signal in different regions of
the CV unit are significantly different. However, for the CV
classes with nasals or semivowels as consonants, the different
regions in the CV unit have similar characteristics leading to Short time spectrum analysis
1639
TABLE I TABLE II
P ERFORMANCE OF THE SYSTEM FOR DETECTION OF BOUNDARIES OF N UMBER OF MATCHING HYPOTHESES IN DETECTION OF BOUNDARIES
C + V UNITS USING DIFFERENT METHODS OF CLUSTERING AND VECTOR OF C + V UNITS OF DIFFERENT CATEGORIES OF CONSONANTS .
QUANTIZATION (VQ) AND FOR DIFFERENT VALUES OF GIVEN IN Category No. of actual No. of matching hypotheses
MILLISECONDS . boundaries Input space Kernel based
Matching Spurious clustering clustering
Method hypotheses (in %) hypotheses (in %) Stop consonants 1110 907 901
=10 =20 =30 =10 =20 =30 Fricatives 316 293 262
Clustering Nasals 336 82 245
and VQ Semivowels 654 170 352
in the 47.62 58.41 62.24 40.99 27.68 22.87
input
space
Clustering
the system using the second level hypothesization is given
and VQ
in the 55.51 67.08 73.34 44.9 33.21 27.02
in Table III for the kernel based clustering method. For
kernel
comparison, the performance of the system without using
feature
the second level hypothesization is also given in Table III. It
space is seen that the second level hypothesization is effective in
increasing the percentage of matching hypotheses by about
10.5%.
1640
VI. C ONCLUSIONS
In this paper, we have proposed a method for detection
/pra/ /dhA/ /ni/ /jA/ /ti/ /ki/ /a~N/ /ki/ / tam/ /chE/ /sA/ /ru/ of boundaries of C + V segments in the given continuous
speech using a clustering and vector quantization based
(a) method. It has been demonstrated that the method based on
the kernel based clustering and vector quantization gives a
better performance than the method based on the input space
(b) clustering and vector quantization. It has been shown that the
use of the kernel based clustering method for separation of
the data of different regions in the CV classes has improved
(c) the performance of the method for speech segmentation.
R EFERENCES
[1] Suryakanth V. Gangashetty, Neural Network Models for Recognition of
(d) Consonant-Vowel Units of Speech in Multiple Languages, Ph.D. thesis,
Indian Institute of Technology Madras, February 2005.
[2] T. Nagarajan, Implicit Systems for Spoken Language Identification,
Ph.D. thesis, Indian Institute of Technology Madras, February 2004.
(e) [3] M.Girolami, Mercer kernel-based clustering in feature space, IEEE
Trans. Neural Networks, vol. 13, no. 3, pp. 780784, May 2002.
[4] D.Srikrishna Satish, Kernel Based Clustering and Vector Quantization
Time for Pattern Classification, M.S. thesis, Indian Institute of Technology
Madras, March 2005.
[5] J. M. Buhmann, Data Clustering and Data Visualisation, Kluwer
Fig. 3. (a) Speech signal waveform for the utterance of the sentence /pra Academic, 1998.
dhA ni jA ti ki a N ki tam chE sA ru/, (b) Manually marked boundaries [6] D. S. Satish and C. C. Sekhar, Kernel based clustering and vector
of C + V segments, (c) Boundaries hypothesized by clustering and vector quantization for speech recognition, in proceedings of IEEE Interna-
quantization in the input space, (d) Boundaries hypothesized by clustering tional Workshop on Machine Learning for Signal Processing, Brazil,
and vector quantization in the kernel feature space, and (e) Boundaries pp. 315324, September 2004.
hypothesized using second level hypothesization. [7] A. Nayeemulla Khan, S. V. Gangashetty, and S. Rajendran, Speech
database for Indian languages - A preliminary study, in Proc. Int.
Conf. Natural Language Processing, December 2002, pp. 295301.
(a)
(b)
(c)
(d)
(e)
Time
Fig. 4. (a) Speech signal waveform for the utterance of the sentence
/a me ri kA adh yak Shu Du/, (b) Manually marked boundaries of C + V
segments, (c) Boundaries hypothesized by clustering and vector quantization
in the input space, (d) Boundaries hypothesized by clustering and vector
quantization in the kernel feature space, and (e) Boundaries hypothesized
using the second level hypothesization.
1641