Sei sulla pagina 1di 2

HMM topology for boundary renement in automatic speech segmentation

E. Akdemir and T. Ciloglu


A boundary renement method using a new hidden Markov model (HMM) topology is proposed for automatic phonetic speech segmentation. The proposed method has the ability to work at high frame rates and the training and boundary renement stages are easy and fast. The method is data driven and can be adapted to any speech segmentation problem provided that a training set is available. Given an initial segmentation obtained by forced alignment using an HMM based phone recogniser, 20% decrease in boundary errors is achieved.

121 boundary models (breath and silence forms the eleventh class for both cases).
a11 = 1-(1/N1) a33 = 1

a12 = 1/N1

a23 = 1

S1

S2

S3

Fig. 1 Proposed three state HMM topology

Introduction: Boundary renement aims to improve precision in phonetic boundary locations of a speech waveform by using the boundary locations estimated by an automatic speech segmentation (AS) system, and acoustical and statistical knowledge about speech. Hidden Markov model (HMM) based speech recognisers are used for AS. They work at frame rates of 100 frame/s, which is a relatively lower value for the required segmentation accuracy (200 to 1000 frame/s). This is also the case with AS systems other than HMM based AS systems. Therefore, two-stage approaches are widely used in the literature. The boundaries obtained after the rst stage have very few gross errors and many ne errors owing to poor time resolution. The renement process has to decrease the magnitudes of the small errors without giving rise to additional large errors. Several approaches to boundary renement exist in the literature; in [1], average deviations from the hand labelled boundaries are calculated for different boundary classes and the boundaries from the rst stage are shifted by boundary specic average deviation. A context dependent approach [2] uses boundary models composed of a xed length sequence of Gaussian mixture models (GMMs) for every phoneme pair. Ultimately, the boundary is found around the boundary point estimated in the rst stage so as to maximise its likelihood given the model. Another method aims to minimise audible signal discontinuities caused by spectral mismatches when concatenating these units [3]. The weighted spectral slope metric, [4], is adapted to nd the boundary as the point at which the spectral discontinuity is maximum. The search interval for the maximisation is determined according to the boundary class. A more comprehensive work [5] involves building an articial neural network (ANN) boundary model for the second stage, which uses statistical information such as average durations of the phones in the database and the probability distribution function of the boundary around the boundary found at the rst stage and also acoustic features such as energy, correlation and the log-energy spectrum of the signal. In this Letter, a boundary renement method based on a new HMM topology is presented. Preliminary work tested on only two phoneme-tophoneme boundaries was presented (in Turkish) at a local conference [6]. The work described here involves improved training and test stages, a new boundary-phoneme-class based approach to apply the boundary renement to all phoneme couples, and the results obtained on the MOCHA-TIMIT database [7], with and without inclusion of lip motion information. HMM based boundary renement: The goal of speech segmentation is to determine a point (or a frame) on the speech waveform, namely the boundary point. In contrast, conventional HMM based AS systems, as they nd the most likely state sequence for the observation with respect to the given models, do not necessarily serve to maintain accuracy in locating boundaries. The HMM topology proposed in this Letter overcomes this problem by assigning a special state that endures a single frame at the boundary (Fig. 1). The topology is left to right and has three states. The rst state of the HMM is associated with the left phoneme class (PhCs) at the boundary. The second state is associated with a single boundary frame. The boundary state is active for only one frame and the transition to the third state is immediate. This is achieved by setting the transition probability from the second state to the third state (a23) to 1. The third state is associated with the right PhCs. Boundary models for each phoneme-to-phoneme boundary can be developed if the size of the training and the test data are adequate. Instead, phonemes are divided into 10 classes and the boundary models are developed for each PhCs-to-PhCs boundary, resulting in

Training HMM boundary models: A model for each PhCs-to-PhCs boundary is trained. Handling each boundary separately, and the ability of assigning a single state to the boundary point allows the determination of all frames of the states of the HMM models directly, i.e. given a dataset with hand-labelled boundaries, the frames before the boundary point are assigned to the rst state of the HMM, the frames after the boundary are assigned to the third state, and the frame corresponding to the boundary point is assigned to the second state. 121 boundary models corresponding to each PhCs couple are trained. The speech segments corresponding to the PhCs couples are extracted using the manually segmented boundary locations. These segments are grouped into 121 clusters. Each training set belonging to a PhCsto-PhCs boundary contains the corresponding speech segments from the database and the boundary location information of these speech segments. The proposed HMM topology has the advantage of avoiding complex, iterative and time-consuming HMM training processes such as the Baum-Welch algorithm [8], or some gradient based techniques [9]. The training is very fast and practical as the feature vectors belonging to the states of the HMMs are explicitly determined by the denition of the topology. For each training set, the initial 30% of the acoustic feature vectors of the left PhCs are omitted in order to get rid of some portion of the rst phoneme that also carries the properties of its predecessor. The remaining feature vectors are used to estimate the probability density function of the rst state of the boundary model. The feature vectors corresponding to the boundary location are used to estimate the probability distribution belonging to the second state. The probability distribution of the third state is found using the feature vectors of the right PhCs after omitting the last 30%. The state probability distributions are modelled with GMMs. In our case, single mixture distributions are used; the parameters are calculated by nding the means and the variances of the feature vectors belonging to the corresponding state. The state transition probability from the rst state to the second state, a12 , is the inverse of the number of average feature vectors observed before the boundary point and a11 1 2 a12. The transition probabilities from the second state to the third state, a23 , and from the third state to itself, a33 , are 1. In this work, the HMM based AS system proposed in a previous study [10] was used as the rst stage of segmentation. The rst stage system uses the publicly available MOCHA-TIMIT database [7]. The average absolute boundary error (AABE) between the manually segmented boundaries and the boundaries estimated by the AS system was reduced to 8.12 from 9.90 ms in that work. The acoustic feature vector in the rst stage contains 13 MFCCs (including energy coefcient), and their rst- and second-order derivatives; the same feature vector composition is used in the proposed renement system; feature vectors are extracted from 10 ms frames. The frame shift is set to be 5 ms for voiced-voiced phoneme couples, and 1 ms for the phoneme couples including an unvoiced phoneme group (voicedunvoiced, unvoiced-voiced and unvoiced-unvoiced), as the boundary before/after an unvoiced phoneme can be located more precisely. The precision for the voiced-voiced boundaries is lower because the change in the speech waveform is examined in a pitch synchronous manner during manual segmentation and automatic segmentation most of the time. Boundary renement: The speech waveform is divided into phoneme couples as in the training stage; this time boundary locations estimated in the rst stage are used instead of the manually found ones.

ELECTRONICS LETTERS 22nd July 2010 Vol. 46 No. 15

The starting 30% of the left phoneme and the last 30% of the right phoneme are omitted for each couple. The Viterbi algorithm is used to determine the rened boundaries. For each speech segment including a boundary, the corresponding PhCs-to-PhCs boundary model is identied and the algorithm determines the most likely sequence of hidden states given the observed frames and HMM parameters. The location of the frame that is associated to the second state of the HMM boundary model is declared as the rened boundary. Results: The proposed HMM boundary renement method was tested on the MOCHA-TIMIT database using a HMM based AS system as the rst stage, [10]. At rst, the method was tested on only two phoneme-to-phoneme boundaries; /y/-/uu/ boundary and /t/-(/uu/ or /o/) boundary; these boundaries have 125 and 50 occurrences in the database, respectively. For both, the experiments were carried out using a ve fold validation set. Table 1 displays the AABE values after the rst stage and after boundary renement. HMM boundary renement decreased AABE about 44% for the /y/-/uu/ boundary and about 63% for the /t/-(/uu/ or /o/) boundary. The proposed method was then tested on the whole database. 420 of the utterances were used as the training set and 40 of them were used as the test set as it is in the rst stage system. The AS results of the rst stage were found for three types of feature vectors and all of these results were used to test the proposed method. The results are presented in Table 2. The three types of results of the rst stage are found by [10], (i) using only the audio data, (ii) using the positions of the upper lip and the lower lip with the audio data, (iii) using different audiovisual feature vectors selectively according to the boundary type. The proposed method decreased the AABE by 18 20% for different feature vector compositions of the rst stage system.

Conclusion: A new HMM topology for boundary renement is proposed. The training and test processes of the proposed boundary models are fast and practical. The method is data driven, no language or user specic adaptations are necessary, and it requires only the acoustic data and a reference set of boundary locations; hence it has the potential to be used in all types of phonetic speech segmentation problems. The reduction in AABE for different rst stage systems is about 18 20% and the proposed method may lead to more than 50% reductions in AABE for certain phoneme types. # The Institution of Engineering and Technology 2010 26 May 2010 doi: 10.1049/el.2010.1390 E. Akdemir and T. Ciloglu (Department of Electrical and Electronics Engineering, Middle East Technical University, Cankaya, Ankara 06531, Turkey) E-mail: eakdemir@metu.edu.tr References
1 Matousek, J., Tihelka, D., and Psutka, J.: Automatic segmentation for Czech concatenative speech synthesis with boundary-specic correction. Proc. Eurospeech, Geneva, Switzerland, September 2003, pp. 301304 2 Sethy, A., and Narayanan, S.: Rened speech segmentation for concatenative speech synthesis. Proc. Int. Conf. Spoken Language Processing, Denver, CO, USA, September 2002, pp. 149 152 3 Kim, Y.-J., and Conkie, A.: Automatic segmentation combining an HMM-based approach and spectral boundary correction. Proc. Int. Conf. Spoken Language Processing, Denver, CO, USA, September 2002, pp. 145148 4 Rabiner, L., and Juang, B.: Fundamentals of speech recognition (Prentice-Hall, Inc., 1993) 5 Toledano, D.T., Gmez, L.A.H., and Grande, L.V.: Automatic phonetic segmentation, IEEE Trans. Audio Speech Lang. Process., 2003, 11, (6), pp. 617625 6 Akdemir, E.,. and Ciloglu, T.: Otomatik Konusma Bolutleme . Sonuclarnn Iyilestirilmesi Icin Hmm Tabanl Bolutleme Sistemi. . . IEEE 17. Sinyal Isleme ve Iletisim Uygulamalar Kurultay, (SIU 2009), Antalya, April 2009, pp. 381384 7 Wrench, A. 1999, http://www.cstr.ed.ac.uk/research/projects/artic/mocha. html 8 Dempster, A.P., Laird, N.M., and Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Stat. Soc. Series B, (Methodological), 1977, 39, (1), pp. 138 9 Levinson, S.E., Rabiner, L.R., and Sondhi, M.M.: An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition, Bell Syst. Tech. J., 1983, 62, (4), pp. 10351074 10 Akdemir, E., and Ciloglu, T.: The use of articulator motion information in automatic speech segmentation, Speech Commun., 2008, 50, (7), pp. 594604

Table 1: AABE for two boundary types before and after boundary renement using proposed method
AABE after HMM AABE at rst stage (ms) boundary renement (ms) /y/ - /uu/ 15.8 8.9 (43.6%) /t/ - (/uu/ or /o/) 8.3 3.1 (62.6%) Boundary

Table 2: AABE for three AS systems before and after boundary renement using proposed method
Boundary Acoustic feature vector Audiovisual feature vector including lip positions Selectively using audiovisual feature vectors AABE at rst AABE after HMM boundary stage (ms) renement (ms) 9.94 8.91 8.26 7.96 (19.9%) 7.32 (17.75%) 6.77 (18.4%)

ELECTRONICS LETTERS 22nd July 2010 Vol. 46 No. 15

Potrebbero piacerti anche