Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Speaker Identification System Using HMM and Mel Frequency Cepstral Coefficient
Project Client
Dr. Yingen Xiong
Project Team
Seonho Kim (shk@vt.edu) Seungwon Yang (seungwon@vt.edu)
TABLE OF CONTENTS Problem Definition Memex Project Data Set Our Approaches Step 1: Data Collection Step 2: Remove Silence Step 3: Emphasize Signal Step 4: Apply MFCC Step 5: Shift Signal Step 6: Train HMM four different versions Step 7:Evaluation 5. Four Different Training Methods a. Training 12 Values of MFCC b. Training MFCC Mean Values c. Training MFCC Selected One Feature d. Training 4 utterances of a word 6. Evaluations 7. HMM Parameters 8. Future Work and Conclusions References 1. 2. 3. 4. 3 3 3 4 4 4 5 5 6 6 7 7 7 8 8 9 9 9 10 11
TABLE OF FIGURES Figure Figure Figure Figure Figure Figure Figure Figure Figure 1: Five utterances of Virginia Tech 2: Silence Removal 3: Emphasize Signal 4: MFCC Transformation 5: Shift Signal 6 : MFCC Result Training Data 7: MFCC Mean Training Data 8: Selected Feature Training Data 9: Four utterances Training Data 4 5 5 6 6 7 8 9 9
1.
Problem Definition
Lets say that we have years of audio data recorded everyday using a portable recording device. From this huge amount of data, I want to find all the audio clips of discussions with a specific person. How can I find them? Another example is that a group of people are having a discussion in a video conferencing room. Can I make the camera automatically focus on a specific person (for example, a group leader) whenever he or she speaks even if the other people are also talking? Speaker identification recognition system, which allows us to find a person based on his or her voice, can give us solutions for these questions.
2.
Memex Project
This project, speaker ID recognition system development, is in relation with Memex project (http://www.memex.cs.vt.edu/index.php?option= com_content&task=view&id=54&Itemid=71). In MEMEX project, the researchers will collect huge amount of audio, image, and GPS data using a device called SenseCam as well as the software in everyday life as digital memories. Once the data is collected, it becomes important to retrieve what we want to find. We would like to apply the speaker ID recognition system to this purpose so that we could retrieve conversations with a specific person from the huge amount of data.
3.
Data Set
We recruited five people, two male and three female, and recorded their voice in a quiet place using an audio recording/editing software, Cool Edit. We let them say Virginia Tech five times. Then using the software, we cut each utterance of Virginia Tech and made each an audio file. Four of them were used for training purposes and one used for evaluation.
4.
We collected audio data from five people, two male and three female. We let them say total five utterances of Virginia Tech. Four utterance of them were used for HMM training, and one of them was used for evaluation. Figure 1 shows audio signals of the recorded voice.
4. MFCC Mean: Average of 12 coefficient We implemented four different versions to see which one was better performing. The first version, Iterative speaking, accepts the mean value of 12 MFCC features from four utterances of Virginia Tech as training data while the other three versions use only one utterance for training. The second version, 12 features: Using MFCC, accepts a vector of 12 feature values as its input. We selected one of 12 MFCC features as input for the third version. Finally, MFCC Mean version accepts the mean value of 12 MFCC features.
Step 7:Evaluation
Using the testing data, one utterance of Virginia Tech which we prepared in data collection step, we computed the log-likelihood. The HMM with the largest log-likelihood is the identified speaker.
5.
We implemented four different versions of training methods. This is because the HMM we used accept only one dimensional data as input for training while our data is 12 dimensions.
Figure 6 show the training data of this method. Each point consists of 12 values with different colors. HMM trains until it reaches the end of data.
Figure 7 show the one dimensional MFCC mean training data and converted n dimensional actual input data to HMM.
Figure 8 show the one dimensional MFCC mean training data and converted n dimensional actual input data to HMM.
Figure 9 show the four one dimensional MFCC mean training data. Each line in different color represents different utterance.
6.
Evaluations
For the evaluations, we used log-likelihood estimation method to compute to probability that testing voice data is generated by the HMMs of each person. We selected a HMM with the greatest loglikelihood and identify the speaker to which the HMM belonged. Evaluation tasks were picking one person from five so the precision rate of random selection method is 20%. Our four training methods showed about 60% to 70% of accuracy rates. However, because of limits of the time, data, and resources, we couldnt conduct significant system evaluation. We left evaluation tasks for further study.
7.
HMM Parameters
The number of visible states, we used for this project, was determined based on the training data, the utterance of each person. The utterance data, wave data, fluctuated between -1 and 1. Because the HMM we used, accepts only positive integers, we discretized the values
shifted and rescaled to integers between 1 and 100. So, number 100 is used as the number of visible states. As we tested, the number of hidden states didnt matter as it is not so small or so large. However, we found it effects on the speed of training. So, we fixed the number of hidden states to 15.
8.
In this project we implemented a speaker identification system based on HMM and Mel Frequent Cepstral Coefficient (MFCC). In training modules, four different versions were built to try different methods. The four different training methods showed similar performance in accuracy. For further study, we will compare and evaluate the four different training methods more precisely with larger data and more participants. Also, we plan to improve this system to be a text independent speaker identification system in order to be used in multimedia data retrieval in Memex project. We may need to use more text independent features to accomplish this purpose.
10
References
1. L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, pp.257-286, 1989. 2. Richard O. Duda, Peter E. Hart, David G. Stock, Pattern Classification, A Wiley-Interscience Publication, 2001, WileyInterscience, New York, N.Y 3. HMM Toolbox, available at http://www.cs.ubc.ca/~murphyk/ Software/HMM/hmm.html, 2006 4. Mel Frequency Cepstral Coefficient module at VOICEBOX, available at http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/ voicebox.html, 2006 5. Signal Processing Tools, available at http://web.mit.edu/ ~sharat/www/resources.html, 2006
11