Sei sulla pagina 1di 6

Information Explication from Computer-Transcribed Conversations: The Weighted Markov Classifier

Nathan Eagle
MIT Media Lab, 20 Ames St., Cambridge, MA 02139 nathan@media.mit.edu Abstract In this research, I attempt to extract information from a computer-transcribed conversation. Leveraging the SDK of a sophisticated, commercial speech recognition system, I have created a classifier that weights words based on the output confidence factor from the recognition engine, and maximizes the probability of the word frequency using both 0th and first order Markov models. I demonstrate the success of this algorithm when applied to location classification, even when used on different speakers. While location certainty cannot be achieved from this data alone, given enough labeled training sets, a type of 'situational awareness' can be developed. Applications for such awareness could potentially include topic spotting, conversation mediation, and virtual memory augmentation. Finally I will discuss other potential impacts of this technology within the realm of ubiquitous computing. Background The accuracy of commercial voice recognition systems has reached a point where almost half of all words spoken by a single individual can be correctly recognized during most conversations. Extracting information from gigabytes of transcribed conversation data is a job for an appropriate machine learning algorithm. However, data analysis on a sequence appears to be more difficult when the sequence can be drawn from an unbounded set. In Problem Set 4 we looked at character frequency and took data from 27 possible values. Words, however, are in principle unbounded because there is always a non-zero probability of finding a word never previously encountered.1 Despite this fact, this project is able to take advantage of limitations in the vocabulary both of the speaker and of the speech recognition engine to condense a working vocabulary into a set of approximately 5000 words. Data Gathering I collected data for one month, between November and December 2001. Both at home and at lab, I installed a wireless network to support streaming audio directly to my computer. I wore a wireless microphone during most of the day and, after informing people of the purpose of the apparatus, began to record many of my daily conversations. I captured data in two ways. For the first fifteen days, the audio data was sent directly into a recognition engine and
1

was transcribed in real-time. During the latter portion of the data gathering process, I captured my conversations as audio files for later processing. Saving the raw audio data enabled me to gather more information about the data such as word confidence factor and time stamp. This richer set of data was later used as my test set. After each conversation I saved the audio file with a "label name" including place, people involved, and subject matter (ie: lab.vish.vikram.mla.wav or home.liz.dinner.wav). During that month I recorded over 30 hours of my daily conversations and transcribed approximately 30,000 words. Enabling Software and Hardware and Additional Data TCL interface to the ViaVoice Recognition Engine Leveraging the sophisticated algorithms behind a commercial speech recognition system, a TCL script was written2 to interface with the ViaVoice SDK. Recorded conversations in a 22kHz, 16 bit .wav file format were placed in a select directory where they were input into the script. The script's output corresponds to the transcribed words from the ViaVoice recognition engine. Along with each word is a time stamp marking the beginning of the utterance and a recognition confidence factor ranging from 100 to 100. The three data types in each output data file (transcribed word, timestamp, and confidence factor) are then input into a Matlab program that parses the data into a 3xn cell array. Wireless Microphone Transmitters and Receivers AudioTechnica wireless microphones and receivers were used for all of the data collection. The settings on each microphone were adjusted manually and the AF gain was tuned for optimal sound quality. The two receivers were set up in my office and at home. They input raw sound into either the recognition engine or a sound editor that converted the file to the correct data type for later processing. (22kHz, 16 bit, mono) Third Person Audio Data For additional training data, I used 10 hours of labeled audio data gathered in the similar environments with the help of Brian Clarkson.

Pereira,F.; Singer, Y.; Tishby, N. Beyond Word N-Grams

enabled by Geva Patz

Discussion of Possible Classification Models: Prediction Suffix Trees (PSTs) Modeling longer-range regularities is quite computationally intensive due to a size explosion caused by model order. Indeed, each bi-gram used in this paper from a simple bounded 5000-word vocabulary takes well over 150 MB of memory. An uncompressed tri-gram would need almost 2 GB of RAM. These current computation constraints prohibit traditional methods of classification to capture even relatively local dependencies that exceed model order. However, a prediction suffix tree (PST) has nodes that store suffixes of previous inputs and can provide a probability distribution for possible subsequent words in a computationally efficient manner. This ability makes PSTs a good model choice for many text classification problems. Unfortunately, even with advanced features such as 'wildcards' that allow a particular word position to be ignored in a prediction, PSTs are not the best candidate to model data output from today's voice recognition engines. The output word from the engine has an accuracy of approximately 50%, and when the engine is incorrect, it often models one word as several, or vice versa. Longer streams of data are not likely to be repeatable by the audio engine. In addition, the grammar model discovered by the PST will likely be more highly correlated with that of the speech recognition engine rather than the actual grammar spoken. Hence the utility of the PST model is no longer appropriate to this application. Linear Regression Linear regression could be used on this type of data; using each word in the vocabulary as a "feature" and weighting each feature according to how instrumental it is in the classification would probably provide a satisfactory answer given an unlimited number of training sets. However, 5000 discrete features would have to be trained, which signifies that an enormous amount of training sets would be necessary to use this method for text classification. 0th and First Order Markov Models Using word frequency to explicate location is an obvious choice for this classification problem. Intuitively, it is clear there should be many words in our vocabularies that are extremely 'location-dependent'. These words are expected to have a frequency that correlates to social setting, and indirectly to location. The vocabulary that people use in different locations can be clustered into sets that, although they may have significant overlap, have features that will remain relatively distinct. The zeroth order Markov model simply creates a normalized 5000-by-1 histogram of word frequency for each of the training data (home vs. lab), and finds the log-likelihoods of a test stream of data. The classification thus disregards all contextual information. The first order Markov model in contrast, considers pairs of

neighboring words and places them into a similar 5000-by5000 histogram. The Classifiers Four classifiers were chosen for this data: a 0th and 1st order Markov model, and a modified version of the two models using word confidence factor outputs. Both of the unmodified the Markov models are straightforward classifiers whose outputs depend on a log probability to determine the appropriate classification. Both methods also add pseudovalues to the count histograms in order to prevent log(0) problems. Finally, the methods are normalized such that each value corresponds to an actual probability of the word or sequence. Each word in a test set corresponds to one of the 5000 words in the model's working vocabulary. The test set of n words is then turned into a stream of n numbers using a 'key' which correlates every individual word with a number from 1-5000. Eq 1.) 0th Order Markov:
n

log(count1( stream ( j )))


j =1

Eq 2.) 1st Order Markov:


n 1

log(count 2( stream( j ), stream ( j + 1)))

j =1

The other two classifiers incorporate additional information from the TCL script related to the confidence factor of each word generated from the speech recognition engine. The confidence factors are distributed in a Gaussian manner with values ranging from zero to one. The pseudovalues are again added to the counts and then the total is normalized. However, now the confidence factors are placed within the probability summation. By multiplying the confidence factor with the initial count probabilities, the classifier becomes biased towards the words that have a greater probability of being correct. Thus, the classifier attempts to overcome the fact that every other word in the data set is incorrect by shifting the weights of the features. Changing the exponential q on the confidence factor (conf(I)) determines how heavily to weight the confidence data. Eq 3.) 0th Order Weighted Markov:
n

log( count1( stream ( j )) * conf ( j ) )


j =1

Eq 4.) 1st Order Weighted Markov:


n 1

log( count 2( stream ( j ), stream ( j + 1)) * conf ( j ) * conf ( j + 1)


q j =1

Evaluation of Results Metrics for Comparison After training on approximately 20 hours of data, the classifiers were given 10 hours of test data. Two methods were used to assess the success of each classifier: certainty and speed. Certainty was defined by the two log-likelihoods

associated with the 'home' and 'lab' classes, where LL1 > LL 2 : LL1 Eq 5.) cert = 1 LL 2 Eq 6.) cert 2 = LL1 LL 2 , Speed was defined as the number of words required before the classifier converges on what will be a final answer.
Initial Over-fitting of the Confidence Factor Weights The weight of the confidence factor (signified by q in equation 3 and 4) was initially high due to prior thinking that the poor word accuracy from the recognition engine would dramatically degenerate the underlying classification performance. However, it was soon learned that setting the value of q above one would begin to have the opposite effect on classification certainty. Although the optimal number varied between test sets, q was finally set to one for the following tests. Certainty of the four Classifiers When the classifiers are applied to the transcribed conversations, it was interesting to note how insignificant the confidence weights were in the overall log probability. It can be seen from the example in Figure 3 and 4 that there is almost no difference in final log probabilities. When the 0th order approach was compared with the 1st order Markov model it was surprising to see that the 0th order model output a higher certainty than the Markov model in most of the data sets. Convergence Comparison Due to the lack of influence the confidence factors had in determining the overall log-probability, it was expected that they did not influence the speed of the convergence. However, it is important to note that again, the 0th order model converges faster than the first order Markov model in the majority of the test sets (see Figure 1 and 2). Speaker Independence Initial data supports the theory that similar location elicits the common vocabulary from multiple people. Over 90% of the data from an individual uninvolved in generating the training data was classified correctly.
Conclusions and Further Work It has been demonstrated that higher order Markov models generally tend to perform worse than 0th order Markov models when trained with a limited set of computertranscribed data. There are two possible explanations for this result related to either training data size or speech engine characteristics. For a first order Markov model, even a vocabulary limited to 500 of the most common words would still require over 25,000 training words before any

'learning' could begin.3 The current training data incorporated a total of 30,000 words, which may not have been enough to model the 5000-by-5000 matrix adequately. The other explanation relates to subtleties within the speech recognition output. When ViaVoice misses a word two times, it seldom substitutes the same 'wrong' word both times. Since there is so much discrepancy (half of all words are incorrect), any type of contextual assumptions related to one word following the next could be invalid. Although much more data collecting needs to be done, this initial analysis supports the theory that vocabulary is indeed 'location-centric' and that some locations elicit similar vocabulary across multiple individuals. However, to make this theory more substantiated it will be necessary to increase the class size (more locations) and gather conversation data from many individuals. While meaning cannot be generated from this data alone, given enough labeled training sets, a type of 'situational awareness' can be developed.4 Applications for such awareness could potentially include topic spotting, conversation mediation, and virtual memory augmentation. With our future soon to be inundated with similar data sets, applications that employ the classification techniques discussed in this paper shall play a large role in the lives of many.

Discussion with Tommi Jaakkola (12/5/01) Jebara,T; Ivanov, Y; Rahimi, A; Pentland, A. 1999. Tracking Conversational Context.
4

Appendix I. - Data Gathered

Training Data: Home liz.dinner.8pm liz.afterdinner.930pm adlar.bora.dinner liz.gamenight family.phone home.alldaysat liz.dinner.5pm Total Training Home Word Count: 13,000 words Lab geva.school marco.computers mla.meeting pentlandians rao.vik.vish.mla vikram.phc steve.markov Total Training Lab Word Count: 22,000 words

Test Data: Home liz.dinner - CORRECT liz.cleaning - CORRECT liz.phonecalls - INCORRECT (Both 0th and 1st Order) homeallday - CORRECT Lab dipesh.phone - CORRECT dipesh2.phone - CORRECT dipesh.marco.india - CORRECT niloy.rec7 - CORRECT niloy2.rec7 - CORRECT tanzeem.grad - CORRECT vik.vish.mla - CORRECT tar.steve.modems - CORRECT 0th Order, INCORRECT 1st Order

Appendix II. - Figures

Fig. 1. - Typical data set (niloy.rec7.dat) classified with a 0th order Markov

Fig. 2. - Typical data set (niloy.rec7.dat) classified with a 1st order Markov model

Fig. 3. - Certainty of the 0th and 1st order Markov Models

Fig. 4. - Certainty of the two Classifiers modified with Confidence Values

Potrebbero piacerti anche