Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract
This paper reports recent work in Automatic Speech Recognition - ASR systems in the fields of information retrieval and audio
segmentation in multimedia files. Two approaches are presented. One of them relies on keywords selection and topic detection
and the other one on audio segmentation.
A natural technique for determining long term characteristics in speech, such as language, topic or dialect is to first convert the
input speech into a sequence of tokens (e.g. words, phones, etc.). From these tokens, we can then look for distinctive phrases
or keywords, that characterize the speech. In many applications, such as supervised topic detection, a set of distinctive keywords
may not be known a priori. In this case, a method of keyword selection is desirable. In this paper, TF-IDF is used to determine
which words in a corpus of documents might be more favorable to discrimine text among different topics.
Experiments have been done using MAVIR database. Regarding audio segmentation, although the proposed method can easily
be adapted to other speech/non-speech doscrimination applications, the present work only focuses on speech/music+speech
segmentation. Different experiments, including different features extraction method and different HMM topologies illustrate the
robustness of the approach, always resulting in a correct segmentation performance higher than 90%.
Index Terms
Speech segmentation, keyword selection, topic detection.
I. I NTRODUCTION
He work presented in this paper has been carried out in the context of the author’s thesis development. As part of the
T work related to speech recognition, there are some paralel tasks that may improve the word accuracy recognition goal
and also provide contextual information about the topic or the context in wich the speech is involved.
The rapid growth on multimedial contents on web and digital resources has brought up interest for developing techniques to
automatically classify these contents. In particular, audio files and audio tracks in videos contain relevant information about
the nature and content of the multimedia file. In this context, audio classification plays an important role as preprocessing step
in a variety of more complex systems related to information retrieval in music or multimedia content extraction, annotation
and indexing. Standard speech recognizers attempting to perform recognition on all input frames will naturally produce high
error rates with such a mixed input signal. Therefore, one major challenge in Automatic Speech Recognition - ASR Systems,
dealing with multimedia content, lies in how to separate these speech signals from other audio signals (e.g., background noise
or music) in order to avoid recognizing mixed input signals.
More generally, audio segmentation could allow the use of ASR acoustic models trained on particular acoustic conditions,
such as wide bandwidth (high quality microphone input) versus telephone narrow bandwidth, male speaker versus female
speaker, etc., thus improving overall performance of the resulting system. Finally, this segmentation could also be designed to
provide additional interesting information, such as the division into speaker turns and the speaker identities (allowing, e.g., for
an automatic indexing and retrieval of all occurrences of a same speaker), as well as syntactical information (such as end of
sentences, punctuation marks, etc.).
The other task we explore in this paper is the keyword selection for supervised topic detection in automatic transcriptions given
by an ASR. Having representative keywords for each labeled topic can be useful in more accurately clustering recognized
speech. Actually, topic detection task can be used for indexing audio content from broadcast news, conferences, lectures, etc.
This paper is arranged as follows: in section III we introduce the keyword selection and topic detection task and its computation
for the present setup. In section IV we present the audio segmentation task. Finally, in section VIII some conclusions are
dicussed.
Keywords selection
Automatic
Segmented Topic
Speech
audio Recognition Detection
Full Audio
audio Segmentation
Current architecture
Automatic Topic
Full Audio
Speech
audio Segmentation Detection
Recognition
A. Keyword selection
We face the problem of keyword selection as a feature selection problem. In this work the TF-IDF weight is used for
selecting words as keywords from the training database. Though TF-IDF is a relatively old weighing scheme, it is simple and
effective and it can be a starting point for future algorithms [3], [4]. Figure 2 shows the scheme used for keyword selection.
First, all the documents labeled as topic n are gathered together. Then TF-IDF is applied to each of the previous sets. Different
selection criteria were used to select the final set of keywords.
Criteria
Text / topic n
Selected
Keyword selection
Keywords
C. Normalization
Once the keywords have been selected, the next step is to calculate the histogram of keywords per topic. This is done by
calculating the relative frequency of each keyword over the whole text labeled in each topic.
Since each topic is represented by its own unique vector it cannot be expected that the same range values will be optimal
across all topics unless these values are normalized.
D. Topic Detection
As shown in figure 3 the topic detection problem can be solved as a classification problem [5].
Selected
Keywords
i keywords
Automatic Histogram
Keyword vector
ki
Speech
Extraction
Recognition Topic
Classifier
Keyword Detected
Topic Histogram
Matrix
Histogram Kij
jtopics
After the speech is recognized the keyword extraction module calculate the times each keyword appears in the recognized
text resulting in the vector k = [k1 , k2 , ..., kn ] where n is the number of keywords. Then, the classifier calculates the distance
between the histogram vector k and every column vector in histogram matrix K, as follows:
X (Kij − ki )2
dj = (4)
i=n
σi2
where dj is the distance from the histogram vector k to the j − th column vector of matrix K and j is the number of topics.
The detected topic is the one with the minimun distance value.
4
Music
Others
Speech
Audio Speech
Noise ASR
segmentation
Music
One of the issues in the design of a signal classifier is the selection of an appropriate feature set that captures the temporal
and spectral structure of the signals. Many such features for speech/music discrimination have been suggested in the literature.
Previous works on audio segmentation have been focused on feature extraction analysis or system architecture. Normally,
authors have combined MFCCs with spectral features such as modulation energy and percentage of low energy frames as
in [6], [11], [12] or have used histogram equalization-based features as in [7].
A. Baseline
Previous works in UPM-GTH 1 have been focused in segmenting audio documents into few acoustic classes (ACs), such
as Clean Speech, Music, Speech with noise in background and Speech with music in background [8]. In this paper, only two
classes have been considered due to the audio content included in MAVIR database; there are only two classes of audio events:
• Music [mu]. Music is understood in a general sense.
• Speech with background music [sm]. Overlapping of speech and music classes
For feature extraction, we have considered long term statistics of MFCC (Mel Frequency Cepstral Coefficients), sprectral
entropy [9] and CHROMA coefficients. These CHROMA features are a powerful representation for music audio in wich the
entire spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma) of the musical octave [10].
The baseline is a one-step system based on HMM as the one presented in figure 5. In particular, 3-state HMM have been
considered for each acoustic class, considering initially 16 Gaussians per state [14]. The number of states have been adjusted
following the topology proposed in [8].
Music
Speech
MFCC Chroma
Audio segmentation
B. Evaluation metrics
The evaluation metric used for audio segmentation is the same as the one used for speaker diarization experiments and
described and used by NIST in the RT Evaluations. This Diarization Error Rate - DER is defined as the sum of the per-
speaker false alarm (falsely identifying speech), missed speech (failing to identify speech), and speaker error (incorrectly
1 Grupo de Tecnología del Habla - Universidad Politécnica de Madrid
5
identifying the speech over the music) times, divided by the total amount of speech time in a test audio file. That is,
TF A + TM ISS + TSP KR
DER = (5)
TSP EECH
In our case, the total amount of speech time is the same as total amount of scored time of the audio file. To measure it, the
script MD-eval-v12.pl developed by NIST was used 2 .
C. Feature analysis
During experiments we have evaluated an important amount of features used in speech and speaker recognition. The best
features for this task have been:
• MFCC15_E_D_A (mean+var). 15-MFCCs and local energy computed in 25ms windows (with an overlapping of 15ms),
their delta and double delta. The statistics are mean and variance computed along a 1 second with 0.5s overlapping.
• MFCC15_E_D_A (mean+std). Similar to the previous one but considering as statistics: mean and standard deviation.
• MFCC15_E_D_A (mean+std+skew): Similar to the previous one but considering as statistics: mean, standard deviation
and skewness.
• MFCC15_E_D_A (mean+std+skew+kurt): adding kurtosis as a new statistic.
• MFCC15_E_D_A (mean+std+kurt): same that previous one removing skewness.
• MFCC15CHR_E_D_A (mean+std): 15-MFCCs, local energy computed in 25ms windows (with an overlapping of 15ms),
their delta and double delta and 12 CHROMA coefficients computed every 50 ms. Statistics are mean and standard
deviation along a 1 second with 0.5s overlapping.
• MFCC15CHR+SpectralFeatures_E_D_A (mean+std): same to the previous one adding the statistics (mean and standard
deviation) of several spectral features computed at 50ms frames (flux, centroid, entropy and band energies).
• MFCC15CHR+Entropy_E_D_A (mean+std): same to the previous one adding only the mean and standard deviation of
the spectral entropy.
This initial analysis was performed considering 16 Gaussians per state. After this analysis we decided to increase the number
of gaussians per state. For this reason, the experiment was repeated for 32, 64 and 128 Gaussians.
V. DATABASE DESCRIPTION
Experiments have been done using MAVIR database.
Tests on audio segmentation have been done on tourism video corpus provided by MAVIR project 3 . This corpus consists of a
collection of tourism videoclips from the TURESPAÑA Corporate Website (Spanish Tourism Institute). It includes 39 videos
in spanish and 23 videos in english. It was annotated by the Linguistic Department of the Universidad Autónoma de Madrid
under MAVIR project. The spanish partition of the database includes around 2 hours of video. The audio signals are provided
in pcm format, mono, 16 bits and sampling frequency 8kHz.
TABLE II
R ESULTS FOR DIFFERENT FEATURES USING 64 G AUSSIANS PER STATE
2 Available at http://www.itl.nist.gov/
3 MAVIR is a research network co-funded by the Regional Government of Madrid and the European Social Funder under MA2VICMR (2010-2013). More
info http://www.mavir.net
6
TABLE III
R ESULTS FOR DIFFERENT FEATURES USING 128 G AUSSIANS PER STATE
TABLE IV
R ESULTS FOR DIFFERENT EXPERIMENTS IN MAVIR DATABASE
VIII. C ONCLUSIONS
For the topic detection task, we can conclude that:
• Normalization over each topic is needed due to the different range values the TF-IDF can present for the same keyword
in different topics.
• TF-IDF is a simple and efficient algorithm for selecting representative words in order to identificate and detect diferent
topics among a whole collection of documents.
• But, despite its strength, TF-IDF has some limitations. This algorithm does not take into account the relationship between
words, (e.g. in synonyms or in plural words). In this experiment TF-IDF could not recognize words as “parque” and
“parques” as the same word, categorizing each of them instead of evaluating as the same word. This can decrease the
weight given to that word in the keyword set. For large document collections, this could present an escalating problem.
And for the audio segmentation task we also conclude that:
• Including CHROMA coefficients allows reducing significantly the error for all ACs from 12.51% to 7.96%.
• In all cases, when increasing the number of Gaussians the results are better. Best results are obtained when using
MFCC+CHROMA+Entropy features.
• In summary, for the best configuration of the one-step system, we have obtained a 7.96% error (using the NIST tool).
R EFERENCES
[1] Fiscus, J. and Doddington, G. “Topic Detection and Tracking Evaluation Overview”. Chapter in Topic Detection and Tracking Event-based Information
Organization. National Institute of Standars and Technology, USA, 2002.
[2] Schultz, J.M. and Liberman, M. “Topic Detection and Tracking using idf-Weighted Cosine Coefficient”. Proc. of The DARPA Broadcast News Workshop
1999. pp. 189-192. USA, 1999.
[3] Wintrode, J. and Kulp, S. “Confidence-based techniques for rapid and robust topic identification of conversational telephone speech”. Proc. of Interspeech
2009. England, 2009.
[4] Ramos, J. “Using TF-IDF to determine word relevance in document queries”. Department of Computer Science. Rutgers University, USA, 2003.
[5] Mahajan, M. and Beeferman, D. and Huang, X.D. “Improved topic-dependent language modeling using information retrieval techniques”. Proc. IEEE
ICASSP 1999, vol. 1, pp. 541-544, 1999.
[6] Izumitani, T. and Mukai, R. and Kashino, K. “A background music detection method based on robust feature extraction”. Proc. IEEE ICASSP 2008, pp.
13-16, 2008.
[7] Gallardo, A. and Montero, J.M. “Histogram Equalization-Based Features for Speech, Music, and Song Discrimination”. IEEE Signal Process. Lett., vol.
17, no. 7, July, 2010.
[8] Gallardo, A. and San-Segundo, R. “UPM-UC3M system for music and speech segmentation”. Proc. of the Jornadas de Tecnología del Habla FALA
2010. Spain, Novemeber, 2010.
7
[9] Misra, H. and Ikbal, S. and Bourlard, H. and Hermansky, H. “Spectral entropy based feature for robust ASR”. Proc. IEEE ICASSP 2004, pp. 193-198,
2004.
[10] Eyben, F. and Wöllmer, M. and Schuller, B. “openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor”. Proc. ACM Multimedia
(MM), ACM, Firenze, Italy, 2010.
[11] Huijbregts, M. and de-Jong, F. “Robust Speech/Non-Speech Classification in Heterogeneous Multimedia Content”. Speech Communication, vol. 53, no.
2, pp. 143-153, 2011.
[12] Tsai, W.H. and Lin, H.P. “Background Music Removal Based on Cepstrum Transformation for Popular Singer Identification”. IEEE Trans. Audio, Speech
and Language Processing, vol. 19, no. 5, pp. 1196-1205, 2011.
[13] Moreno, J. et al. “Some experiments in evaluating ASR systems applied to multimedia retrieval”. Proc. of 7th Intl. Conf. on Adaptive Multimedia
Retrieval, pp. 12-23, Spain, 2009.
[14] Ajmera, J. and McCowan, I. and Bourlard, H. “Speech/music segmentation using entropy and dynamism features in a HMM classification framework”.
Speech Communication, vol. 40, pp. 351-363, 2003.
[15] Lane, I. and Kawahara, T. andMatsui, T. “Language model switching based on topic detection for dialog speech recognition”. Proc. IEEE ICASSP 2003,
pp. 616-619, 2003.