Sei sulla pagina 1di 11

CS5984 Pattern Recognition and Clustering

Speaker Identification System Using HMM and Mel Frequency Cepstral Coefficient

Project Client
Dr. Yingen Xiong

Project Team
Seonho Kim (shk@vt.edu) Seungwon Yang (seungwon@vt.edu)

May 10, 2006

TABLE OF CONTENTS Problem Definition Memex Project Data Set Our Approaches Step 1: Data Collection Step 2: Remove Silence Step 3: Emphasize Signal Step 4: Apply MFCC Step 5: Shift Signal Step 6: Train HMM four different versions Step 7:Evaluation 5. Four Different Training Methods a. Training 12 Values of MFCC b. Training MFCC Mean Values c. Training MFCC Selected One Feature d. Training 4 utterances of a word 6. Evaluations 7. HMM Parameters 8. Future Work and Conclusions References 1. 2. 3. 4. 3 3 3 4 4 4 5 5 6 6 7 7 7 8 8 9 9 9 10 11

TABLE OF FIGURES Figure Figure Figure Figure Figure Figure Figure Figure Figure 1: Five utterances of Virginia Tech 2: Silence Removal 3: Emphasize Signal 4: MFCC Transformation 5: Shift Signal 6 : MFCC Result Training Data 7: MFCC Mean Training Data 8: Selected Feature Training Data 9: Four utterances Training Data 4 5 5 6 6 7 8 9 9

1.

Problem Definition

Lets say that we have years of audio data recorded everyday using a portable recording device. From this huge amount of data, I want to find all the audio clips of discussions with a specific person. How can I find them? Another example is that a group of people are having a discussion in a video conferencing room. Can I make the camera automatically focus on a specific person (for example, a group leader) whenever he or she speaks even if the other people are also talking? Speaker identification recognition system, which allows us to find a person based on his or her voice, can give us solutions for these questions.

2.

Memex Project

This project, speaker ID recognition system development, is in relation with Memex project (http://www.memex.cs.vt.edu/index.php?option= com_content&task=view&id=54&Itemid=71). In MEMEX project, the researchers will collect huge amount of audio, image, and GPS data using a device called SenseCam as well as the software in everyday life as digital memories. Once the data is collected, it becomes important to retrieve what we want to find. We would like to apply the speaker ID recognition system to this purpose so that we could retrieve conversations with a specific person from the huge amount of data.

3.

Data Set

We recruited five people, two male and three female, and recorded their voice in a quiet place using an audio recording/editing software, Cool Edit. We let them say Virginia Tech five times. Then using the software, we cut each utterance of Virginia Tech and made each an audio file. Four of them were used for training purposes and one used for evaluation.

4.

Our Approaches Step 1: Data Collection

We collected audio data from five people, two male and three female. We let them say total five utterances of Virginia Tech. Four utterance of them were used for HMM training, and one of them was used for evaluation. Figure 1 shows audio signals of the recorded voice.

Figure 1: Five utterances of Virginia Tech

Step 2: Remove Silence


Once the audio data was collected, silence in the signal was removed using a software tool. Figure 2 shows signals before and after the silence removal. It is observable that the overall signal length was reduced in half (see x axis value).

Figure 2: Silence removal

Step 3: Emphasize Signal


Signal was emphasized using filter Matlab function.

Figure 3: Emphasize Signal

Step 4: Apply MFCC


Once the signal was emphasized, 12 features were extracted using Mel Frequency Cepstrum Coefficients (MFCC) transformation. The second graph below shows 12 different coefficients in different colors, which were extracted from the emphasized signal.

Figure 4: MFCC transformation

Step 5: Shift Signal


Since the HMM module we downloaded from Matlab Toolbox <http://www.cs.ubc.ca/~murphyk/Software/HMM/ hmm.html> accepts values from 0 to 100, we shifted the extracted 12 feature values accordingly using a simple formula.

Figure 5: Shift signal

Step 6: Train HMM four different versions


Different versions were listed below. 1. Iterative speaking: Virginia Tech X 4 2. 12 features: Using MFCC 3. 1 selected feature: One selected coefficient

4. MFCC Mean: Average of 12 coefficient We implemented four different versions to see which one was better performing. The first version, Iterative speaking, accepts the mean value of 12 MFCC features from four utterances of Virginia Tech as training data while the other three versions use only one utterance for training. The second version, 12 features: Using MFCC, accepts a vector of 12 feature values as its input. We selected one of 12 MFCC features as input for the third version. Finally, MFCC Mean version accepts the mean value of 12 MFCC features.

Step 7:Evaluation
Using the testing data, one utterance of Virginia Tech which we prepared in data collection step, we computed the log-likelihood. The HMM with the largest log-likelihood is the identified speaker.

5.

Four Different Training Methods

We implemented four different versions of training methods. This is because the HMM we used accept only one dimensional data as input for training while our data is 12 dimensions.

a. Training 12 Values of MFCC


First training model trains whole length of an utterance Virginia tech of a person. Each training input string consists of 12 integer values of given time t. That is the size of training data is 12 x Length of the utterance. Unlike to other training methods, this method assumes the transition occurred between 12 features not between time t and t+1.

Figure 6: MFCC Results Training Data

Figure 6 show the training data of this method. Each point consists of 12 values with different colors. HMM trains until it reaches the end of data.

b. Training MFCC Mean Values


Second training model also trains whole length of an utterance Virginia tech of a person. In This method we used the mean value of MFCC features instead of using all 12 values. We converted this one dimensional data to n x (Length of utterance/n) size HMM input data, where n is an arbitrary number, so that each training input string consists of n integer values observed from time t to t+n.

Figure 7: MFCC Mean Training Data

Figure 7 show the one dimensional MFCC mean training data and converted n dimensional actual input data to HMM.

c. Training MFCC Selected One Feature


Third training model also trains whole length of an utterance Virginia tech of a person. In This method we used the selected one feature value instead of using all 12 MFCC values. We converted this one dimensional data to n x (Length of utterance/n) size HMM input data, where n is an arbitrary number, so that each training input string consists of n integer values observed from time t to t+n.

Figure 8: Selected Feature Training Data

Figure 8 show the one dimensional MFCC mean training data and converted n dimensional actual input data to HMM.

d. Training 4 utterances of a word


Fourth training model trains four utterances Virginia tech of a person. In this case the HMM just trains the beginning 30 sample points of each utterance. Each utterance transformed into 12 features using MFCC filter and obtained the MFCC mean. Thus, the size of training data is 4 x 30. Each training input string consists of 30 integer values observed from time t to t+30.

Figure 9: Four Utterances Training Data

Figure 9 show the four one dimensional MFCC mean training data. Each line in different color represents different utterance.

6.

Evaluations

For the evaluations, we used log-likelihood estimation method to compute to probability that testing voice data is generated by the HMMs of each person. We selected a HMM with the greatest loglikelihood and identify the speaker to which the HMM belonged. Evaluation tasks were picking one person from five so the precision rate of random selection method is 20%. Our four training methods showed about 60% to 70% of accuracy rates. However, because of limits of the time, data, and resources, we couldnt conduct significant system evaluation. We left evaluation tasks for further study.

7.

HMM Parameters

The number of visible states, we used for this project, was determined based on the training data, the utterance of each person. The utterance data, wave data, fluctuated between -1 and 1. Because the HMM we used, accepts only positive integers, we discretized the values

shifted and rescaled to integers between 1 and 100. So, number 100 is used as the number of visible states. As we tested, the number of hidden states didnt matter as it is not so small or so large. However, we found it effects on the speed of training. So, we fixed the number of hidden states to 15.

8.

Future Work and Conclusions

In this project we implemented a speaker identification system based on HMM and Mel Frequent Cepstral Coefficient (MFCC). In training modules, four different versions were built to try different methods. The four different training methods showed similar performance in accuracy. For further study, we will compare and evaluate the four different training methods more precisely with larger data and more participants. Also, we plan to improve this system to be a text independent speaker identification system in order to be used in multimedia data retrieval in Memex project. We may need to use more text independent features to accomplish this purpose.

10

References
1. L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, pp.257-286, 1989. 2. Richard O. Duda, Peter E. Hart, David G. Stock, Pattern Classification, A Wiley-Interscience Publication, 2001, WileyInterscience, New York, N.Y 3. HMM Toolbox, available at http://www.cs.ubc.ca/~murphyk/ Software/HMM/hmm.html, 2006 4. Mel Frequency Cepstral Coefficient module at VOICEBOX, available at http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/ voicebox.html, 2006 5. Signal Processing Tools, available at http://web.mit.edu/ ~sharat/www/resources.html, 2006

11

Potrebbero piacerti anche