Sei sulla pagina 1di 7

1

Audio-speech segmentation and topic detection for


a speech-based information retrieval system
Julian David Echeverry Correa
Speech Technology Group
Universidad Politécnica de Madrid
jdec@die.upm.es

Abstract
This paper reports recent work in Automatic Speech Recognition - ASR systems in the fields of information retrieval and audio
segmentation in multimedia files. Two approaches are presented. One of them relies on keywords selection and topic detection
and the other one on audio segmentation.
A natural technique for determining long term characteristics in speech, such as language, topic or dialect is to first convert the
input speech into a sequence of tokens (e.g. words, phones, etc.). From these tokens, we can then look for distinctive phrases
or keywords, that characterize the speech. In many applications, such as supervised topic detection, a set of distinctive keywords
may not be known a priori. In this case, a method of keyword selection is desirable. In this paper, TF-IDF is used to determine
which words in a corpus of documents might be more favorable to discrimine text among different topics.
Experiments have been done using MAVIR database. Regarding audio segmentation, although the proposed method can easily
be adapted to other speech/non-speech doscrimination applications, the present work only focuses on speech/music+speech
segmentation. Different experiments, including different features extraction method and different HMM topologies illustrate the
robustness of the approach, always resulting in a correct segmentation performance higher than 90%.

Index Terms
Speech segmentation, keyword selection, topic detection.

I. I NTRODUCTION
He work presented in this paper has been carried out in the context of the author’s thesis development. As part of the
T work related to speech recognition, there are some paralel tasks that may improve the word accuracy recognition goal
and also provide contextual information about the topic or the context in wich the speech is involved.
The rapid growth on multimedial contents on web and digital resources has brought up interest for developing techniques to
automatically classify these contents. In particular, audio files and audio tracks in videos contain relevant information about
the nature and content of the multimedia file. In this context, audio classification plays an important role as preprocessing step
in a variety of more complex systems related to information retrieval in music or multimedia content extraction, annotation
and indexing. Standard speech recognizers attempting to perform recognition on all input frames will naturally produce high
error rates with such a mixed input signal. Therefore, one major challenge in Automatic Speech Recognition - ASR Systems,
dealing with multimedia content, lies in how to separate these speech signals from other audio signals (e.g., background noise
or music) in order to avoid recognizing mixed input signals.
More generally, audio segmentation could allow the use of ASR acoustic models trained on particular acoustic conditions,
such as wide bandwidth (high quality microphone input) versus telephone narrow bandwidth, male speaker versus female
speaker, etc., thus improving overall performance of the resulting system. Finally, this segmentation could also be designed to
provide additional interesting information, such as the division into speaker turns and the speaker identities (allowing, e.g., for
an automatic indexing and retrieval of all occurrences of a same speaker), as well as syntactical information (such as end of
sentences, punctuation marks, etc.).
The other task we explore in this paper is the keyword selection for supervised topic detection in automatic transcriptions given
by an ASR. Having representative keywords for each labeled topic can be useful in more accurately clustering recognized
speech. Actually, topic detection task can be used for indexing audio content from broadcast news, conferences, lectures, etc.
This paper is arranged as follows: in section III we introduce the keyword selection and topic detection task and its computation
for the present setup. In section IV we present the audio segmentation task. Finally, in section VIII some conclusions are
dicussed.

II. G ENERAL OVERVIEW


An overview of the proposed system architecture is shown in figure 1. In the current system, speech is automatically off-line
segmented using the time marks provided in database annotations. In a parallel task, the system receives the full audio content
Thanks to everyone in the GTH, (http://lorien.die.upm.es/).
2

and segments speech frames from music, music+speech or noise in background.


In the system under development the audio is expected to be segmented as a previous step of the speech recognition process.
Then, only speech frames will be recognized. This stage is very important for the rest of the process since we are not interested
in wasting time trying to recognise audio segments that do not contain “useful” speech.
Off-line and previous processing is required for selecting the keywords needed in topic detection task.

Keywords selection

Automatic
Segmented Topic
Speech
audio Recognition Detection

Full Audio
audio Segmentation
Current architecture

System under development


Keywords selection

Automatic Topic
Full Audio
Speech
audio Segmentation Detection
Recognition

Fig. 1. General overview of the proposed system architecture.

III. T OPIC DETECTION


One of the goals in Topic Detection is to develop automatic methods of identifying topically related stories within a stream of
audio (e.g. news media, conferences, lectures, etc.) or recognized text. According to [1] a topic is a “seminal event or activity,
along with directly related events and activities”. A story is considered “on topic” when it discusses events and activities that
are directly connected to that topic’s seminal event. For topic identification, keywords are utilized. This is done not only based
on the words’ relevance in the current document, but also expanded to the rest of the documents in the corpus. To measure
words’s relevance Term Frequency - Inverse Document Frequency TF-IDF is used [2].

A. Keyword selection
We face the problem of keyword selection as a feature selection problem. In this work the TF-IDF weight is used for
selecting words as keywords from the training database. Though TF-IDF is a relatively old weighing scheme, it is simple and
effective and it can be a starting point for future algorithms [3], [4]. Figure 2 shows the scheme used for keyword selection.
First, all the documents labeled as topic n are gathered together. Then TF-IDF is applied to each of the previous sets. Different
selection criteria were used to select the final set of keywords.

Training Corpus TF -IDF


(Term Frequency -
Text / topic 1 Inverse Document Frequency)
Text / topic 2
Text / topic 3 Selection
...

Criteria
Text / topic n
Selected
Keyword selection
Keywords

Fig. 2. Keywords selection procedure.


3

B. Mathematical framework for TF-IDF


Essentially, TF-IDF works by determining the relative frequency of words in a specific document compared to the inverse
proportion of that word over the entire document corpus. This calculation determines how relevant a given word is in a
particular document. Words that are common in a single or a small group of documents tend to have higher TF-IDF numbers
than common words such as articles and prepositions.
The implementation of the TF-IDF works as follows. Given a corpus or a document collection D, a word wi and a individual
document dj ∈ D, we calculate term frequency - tfi,j as
ni,j
tfi,j = P (1)
k nk,j
where ni,j is the number of times the word wi appears in dj and the denominator is the sum of the ocurrences of all terms
in document dj , that is, the size of the document |dj |.
The inverse document frequency is a measure of the general importance of the term. It is calculated as follows
|D|
idfi = log (2)
1 + |{j : wi ∈ dj }|
where |D| is the total number of documents in the corpus and |{j : wi ∈ dj }| is the number of documents where the term
wi appears.
Finally, the TF-IDF can be obtained as

(tf − idf ) = tfi,j × idfi (3)


A high value in tf −idf means a high term frequency and a low document frequency of the term in the whole corpus. Words
with highest values are the most relevant for a given document. In the case for extremely common words, such as articles
or prepositions, they hold no relevant meaning in a topic. Therefore this discriminatory power can be used for selecting the
keywords that represents each topic.

C. Normalization
Once the keywords have been selected, the next step is to calculate the histogram of keywords per topic. This is done by
calculating the relative frequency of each keyword over the whole text labeled in each topic.
Since each topic is represented by its own unique vector it cannot be expected that the same range values will be optimal
across all topics unless these values are normalized.

D. Topic Detection
As shown in figure 3 the topic detection problem can be solved as a classification problem [5].

Selected
Keywords
i keywords

Automatic Histogram

Keyword vector
ki
Speech
Extraction
Recognition Topic
Classifier
Keyword Detected
Topic Histogram
Matrix
Histogram Kij

jtopics

Fig. 3. Topic Detection Scheme.

After the speech is recognized the keyword extraction module calculate the times each keyword appears in the recognized
text resulting in the vector k = [k1 , k2 , ..., kn ] where n is the number of keywords. Then, the classifier calculates the distance
between the histogram vector k and every column vector in histogram matrix K, as follows:
X (Kij − ki )2
dj = (4)
i=n
σi2
where dj is the distance from the histogram vector k to the j − th column vector of matrix K and j is the number of topics.
The detected topic is the one with the minimun distance value.
4

IV. AUDIO SEGMENTATION


The problem of distinguishing speech signals form other audio signals (e.g. music) has become increasingly important as
ASR systems are applied to more realworld multimedia domains. Therefore, a pre-processing stage that segments the signal
into periods of speech and non-speech is invaluable in improving recognition accuracy.
A general overview of the system is depicted in figure 4.

Music
Others

Speech
Audio Speech
Noise ASR
segmentation

Music

Fig. 4. Audio-segmentation global scheme

One of the issues in the design of a signal classifier is the selection of an appropriate feature set that captures the temporal
and spectral structure of the signals. Many such features for speech/music discrimination have been suggested in the literature.
Previous works on audio segmentation have been focused on feature extraction analysis or system architecture. Normally,
authors have combined MFCCs with spectral features such as modulation energy and percentage of low energy frames as
in [6], [11], [12] or have used histogram equalization-based features as in [7].

A. Baseline
Previous works in UPM-GTH 1 have been focused in segmenting audio documents into few acoustic classes (ACs), such
as Clean Speech, Music, Speech with noise in background and Speech with music in background [8]. In this paper, only two
classes have been considered due to the audio content included in MAVIR database; there are only two classes of audio events:
• Music [mu]. Music is understood in a general sense.
• Speech with background music [sm]. Overlapping of speech and music classes
For feature extraction, we have considered long term statistics of MFCC (Mel Frequency Cepstral Coefficients), sprectral
entropy [9] and CHROMA coefficients. These CHROMA features are a powerful representation for music audio in wich the
entire spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma) of the musical octave [10].
The baseline is a one-step system based on HMM as the one presented in figure 5. In particular, 3-state HMM have been
considered for each acoustic class, considering initially 16 Gaussians per state [14]. The number of states have been adjusted
following the topology proposed in [8].

Feature HMM based


extraction frame classification

Music

Speech
MFCC Chroma
Audio segmentation

Fig. 5. Baseline scheme

B. Evaluation metrics
The evaluation metric used for audio segmentation is the same as the one used for speaker diarization experiments and
described and used by NIST in the RT Evaluations. This Diarization Error Rate - DER is defined as the sum of the per-
speaker false alarm (falsely identifying speech), missed speech (failing to identify speech), and speaker error (incorrectly
1 Grupo de Tecnología del Habla - Universidad Politécnica de Madrid
5

identifying the speech over the music) times, divided by the total amount of speech time in a test audio file. That is,
TF A + TM ISS + TSP KR
DER = (5)
TSP EECH
In our case, the total amount of speech time is the same as total amount of scored time of the audio file. To measure it, the
script MD-eval-v12.pl developed by NIST was used 2 .

C. Feature analysis
During experiments we have evaluated an important amount of features used in speech and speaker recognition. The best
features for this task have been:
• MFCC15_E_D_A (mean+var). 15-MFCCs and local energy computed in 25ms windows (with an overlapping of 15ms),
their delta and double delta. The statistics are mean and variance computed along a 1 second with 0.5s overlapping.
• MFCC15_E_D_A (mean+std). Similar to the previous one but considering as statistics: mean and standard deviation.
• MFCC15_E_D_A (mean+std+skew): Similar to the previous one but considering as statistics: mean, standard deviation
and skewness.
• MFCC15_E_D_A (mean+std+skew+kurt): adding kurtosis as a new statistic.
• MFCC15_E_D_A (mean+std+kurt): same that previous one removing skewness.
• MFCC15CHR_E_D_A (mean+std): 15-MFCCs, local energy computed in 25ms windows (with an overlapping of 15ms),
their delta and double delta and 12 CHROMA coefficients computed every 50 ms. Statistics are mean and standard
deviation along a 1 second with 0.5s overlapping.
• MFCC15CHR+SpectralFeatures_E_D_A (mean+std): same to the previous one adding the statistics (mean and standard
deviation) of several spectral features computed at 50ms frames (flux, centroid, entropy and band energies).
• MFCC15CHR+Entropy_E_D_A (mean+std): same to the previous one adding only the mean and standard deviation of
the spectral entropy.
This initial analysis was performed considering 16 Gaussians per state. After this analysis we decided to increase the number
of gaussians per state. For this reason, the experiment was repeated for 32, 64 and 128 Gaussians.

V. DATABASE DESCRIPTION
Experiments have been done using MAVIR database.
Tests on audio segmentation have been done on tourism video corpus provided by MAVIR project 3 . This corpus consists of a
collection of tourism videoclips from the TURESPAÑA Corporate Website (Spanish Tourism Institute). It includes 39 videos
in spanish and 23 videos in english. It was annotated by the Linguistic Department of the Universidad Autónoma de Madrid
under MAVIR project. The spanish partition of the database includes around 2 hours of video. The audio signals are provided
in pcm format, mono, 16 bits and sampling frequency 8kHz.

VI. E XPERIMENTAL RESULTS FOR AUDIO SEGMENTATION


For the features described in section IV-C, table I, table II and table III presents the results for different features.
TABLE I
R ESULTS FOR DIFFERENT FEATURES USING 32 G AUSSIANS PER STATE

Feature DER(%) Features DER(%)


MFCC15_E_D_A (mean+var) 20.09 MFCC15_E_D_A (mean+std+kurt) 18.11
MFCC15_E_D_A (mean+std) 19.78 MFCC15CHR_E_D_A (mean+std) 15.21
MFCC15_E_D_A (mean+std+skew) 19.72 MFCC15CHR+SpectralFeatures_E_D_A (mean+std) 14.95
MFCC15_E_D_A (mean+std+skew+kurt) 18.01 MFCC15CHR+Entropy_E_D_A (mean+std) 14.12

TABLE II
R ESULTS FOR DIFFERENT FEATURES USING 64 G AUSSIANS PER STATE

Feature DER(%) Features DER(%)


MFCC15_E_D_A (mean+var) 18.34 MFCC15_E_D_A (mean+std+kurt) 15.37
MFCC15_E_D_A (mean+std) 17.09 MFCC15CHR_E_D_A (mean+std) 13.98
MFCC15_E_D_A (mean+std+skew) 17.22 MFCC15CHR+SpectralFeatures_E_D_A (mean+std) 11.54
MFCC15_E_D_A (mean+std+skew+kurt) 18.01 MFCC15CHR+Entropy_E_D_A (mean+std) 9.56

2 Available at http://www.itl.nist.gov/
3 MAVIR is a research network co-funded by the Regional Government of Madrid and the European Social Funder under MA2VICMR (2010-2013). More
info http://www.mavir.net
6

TABLE III
R ESULTS FOR DIFFERENT FEATURES USING 128 G AUSSIANS PER STATE

Feature DER(%) Features DER(%)


MFCC15_E_D_A (mean+var) 16.32 MFCC15_E_D_A (mean+std+kurt) 12.51
MFCC15_E_D_A (mean+std) 15.23 MFCC15CHR_E_D_A (mean+std) 10.15
MFCC15_E_D_A (mean+std+skew) 14.34 MFCC15CHR+SpectralFeatures_E_D_A (mean+std) 9.63
MFCC15_E_D_A (mean+std+skew+kurt) 15.32 MFCC15CHR+Entropy_E_D_A (mean+std) 7.96

VII. E XPERIMENTAL RESULTS FOR TOPIC DETECTION


Different experiments have been done using MAVIR database. For experiment 1 all annotations texts have been used for
keywords selection. For experiment 2, annotation text have been divided into train and test (90-10%). The word recognition
error of the ASR is about 45%. Different number of keywords per topic have been evaluated. Table IV presents the results for
the topic detection accuracy in both experiments.

TABLE IV
R ESULTS FOR DIFFERENT EXPERIMENTS IN MAVIR DATABASE

No. kws / topic Experiment 1 (%) Experiment 2 (%)


10 76.92 57.69
11 80.77 57.69
12 84.62 61.54
13 88.46 61.54
14 88.46 65.38
15 96.15 61.54
16 96.15 65.38
17 96.15 69.23
18 92.31 76.92
19 100 80.77
20 100 82.46
21 100 80.77
22 100 80.77

VIII. C ONCLUSIONS
For the topic detection task, we can conclude that:
• Normalization over each topic is needed due to the different range values the TF-IDF can present for the same keyword
in different topics.
• TF-IDF is a simple and efficient algorithm for selecting representative words in order to identificate and detect diferent
topics among a whole collection of documents.
• But, despite its strength, TF-IDF has some limitations. This algorithm does not take into account the relationship between
words, (e.g. in synonyms or in plural words). In this experiment TF-IDF could not recognize words as “parque” and
“parques” as the same word, categorizing each of them instead of evaluating as the same word. This can decrease the
weight given to that word in the keyword set. For large document collections, this could present an escalating problem.
And for the audio segmentation task we also conclude that:
• Including CHROMA coefficients allows reducing significantly the error for all ACs from 12.51% to 7.96%.
• In all cases, when increasing the number of Gaussians the results are better. Best results are obtained when using
MFCC+CHROMA+Entropy features.
• In summary, for the best configuration of the one-step system, we have obtained a 7.96% error (using the NIST tool).

R EFERENCES
[1] Fiscus, J. and Doddington, G. “Topic Detection and Tracking Evaluation Overview”. Chapter in Topic Detection and Tracking Event-based Information
Organization. National Institute of Standars and Technology, USA, 2002.
[2] Schultz, J.M. and Liberman, M. “Topic Detection and Tracking using idf-Weighted Cosine Coefficient”. Proc. of The DARPA Broadcast News Workshop
1999. pp. 189-192. USA, 1999.
[3] Wintrode, J. and Kulp, S. “Confidence-based techniques for rapid and robust topic identification of conversational telephone speech”. Proc. of Interspeech
2009. England, 2009.
[4] Ramos, J. “Using TF-IDF to determine word relevance in document queries”. Department of Computer Science. Rutgers University, USA, 2003.
[5] Mahajan, M. and Beeferman, D. and Huang, X.D. “Improved topic-dependent language modeling using information retrieval techniques”. Proc. IEEE
ICASSP 1999, vol. 1, pp. 541-544, 1999.
[6] Izumitani, T. and Mukai, R. and Kashino, K. “A background music detection method based on robust feature extraction”. Proc. IEEE ICASSP 2008, pp.
13-16, 2008.
[7] Gallardo, A. and Montero, J.M. “Histogram Equalization-Based Features for Speech, Music, and Song Discrimination”. IEEE Signal Process. Lett., vol.
17, no. 7, July, 2010.
[8] Gallardo, A. and San-Segundo, R. “UPM-UC3M system for music and speech segmentation”. Proc. of the Jornadas de Tecnología del Habla FALA
2010. Spain, Novemeber, 2010.
7

[9] Misra, H. and Ikbal, S. and Bourlard, H. and Hermansky, H. “Spectral entropy based feature for robust ASR”. Proc. IEEE ICASSP 2004, pp. 193-198,
2004.
[10] Eyben, F. and Wöllmer, M. and Schuller, B. “openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor”. Proc. ACM Multimedia
(MM), ACM, Firenze, Italy, 2010.
[11] Huijbregts, M. and de-Jong, F. “Robust Speech/Non-Speech Classification in Heterogeneous Multimedia Content”. Speech Communication, vol. 53, no.
2, pp. 143-153, 2011.
[12] Tsai, W.H. and Lin, H.P. “Background Music Removal Based on Cepstrum Transformation for Popular Singer Identification”. IEEE Trans. Audio, Speech
and Language Processing, vol. 19, no. 5, pp. 1196-1205, 2011.
[13] Moreno, J. et al. “Some experiments in evaluating ASR systems applied to multimedia retrieval”. Proc. of 7th Intl. Conf. on Adaptive Multimedia
Retrieval, pp. 12-23, Spain, 2009.
[14] Ajmera, J. and McCowan, I. and Bourlard, H. “Speech/music segmentation using entropy and dynamism features in a HMM classification framework”.
Speech Communication, vol. 40, pp. 351-363, 2003.
[15] Lane, I. and Kawahara, T. andMatsui, T. “Language model switching based on topic detection for dialog speech recognition”. Proc. IEEE ICASSP 2003,
pp. 616-619, 2003.

Potrebbero piacerti anche