Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Matlab
Abstract—This report presents a simple method for silence of 50 mseconds length. Then for each frame, the two
removal and segmentation of speech signals, implemented in features, described below, are calculated, leading to two
Matlab. The method is based on two simple audio features, feature sequences for the whole audio signal. The adopted
namely the signal energy and the spectral centroid. As long
as the feature sequences are extracted, a simple thresholding audio features are:
criterion is applied in order to remove the silence areas in the 1) Signal Energy: Let xi (n), n = 1, . . . , N the audio
audio signal. samples of the i−th frame, of length N . Then, for
each frame i the energyPis calculated according to
N
I. I NTRODUCTION the equation: E(i) = N1 n=1 |xi (n)|2 . This simple
feature can be used for detecting silent periods in audio
Speech signals usually contain many areas of silence or
signals, but also for discriminating between audio
noise. Therefore, in speech analysis it is needed to first apply
classes.
a silence removal method, in order to detect ”clean” speech
2) Spectral centroid: The spectral centroid, Ci , of the
segments. Some examples where such application may be
i-th frame is defined as thePcenter of “gravity” of
useful are: speech-based human-computer interaction, audio- N
(k+1)Xi (k)
based surveillance systems and in general all automatic its spectrum, i.e., Ci = P
k=1
N . Xi (k),
Xi (k)
speech recognition systems. The method implemented here k=1
k = 1 . . . , N , is the Discrete Fourier Transform (DFT)
is a very simple example of how the detection of speech coefficients of the i-th short-term frame, where N is
segments can be achieved. the frame length. This feature is a measure of the
II. A LGORITHM DESCRIPTION spectral position, with high values corresponding to
“brighter” sounds. Experiments have indicated that the
A. General sequence of spectral centroid is highly variated for
In general, the following steps are executed: speech segments ([1]).
1) Two feature sequences are extracted from the whole The reasons that these particular features were selected
audio signal. (apart from their simplicity in implementation) are:
2) For each sequence two thresholds are dynamically 1) For simple cases, (where the level of background noise
estimated. is not very high) the energy of the voiced segments is
3) A simple thresholding criterion is applied on the larger than the energy of the silent segments.
sequences. 2) If unvoiced segments simply contain environmental
4) Speech segments are detected based on the above sounds, then the spectral centroid for the voiced seg-
criterion and finally a simple post-processing stage is ments is again larger, since these noisy sounds tend
applied. to have lower frequencies and therefore the spectral
centroid values are lower.
B. Feature Extraction For more details on those audio features and their appli-
In order to extract the feature sequences, the signal is first cation on audio analysis one can refer to [2], [1], [3] and
broken into non-overlapping short-term-windows (frames) [4].
C. Speech Segments Detection
Short time energy (original)
As long as the two feature sequences are computed, as 0.08
0.06
Short time energy (filtered)
(one for each sequence) are computed. Towards this end, the 0 50 100 150 200 250 300
following process is carried out, for each feature sequence: Spectral Centroid (original)
0.4
ues. 0.2
W +1 . −0.5
W is a user-defined parameter. Large values of W 0 2 4 6 8 10 12 14 16
obviously lead to threshold values closer to M1 .
The above process is executed for both feature sequences,
leading to two thresholds: T1 and T2 , based on the energy Figure 1. The presented results for an audio example. The first sub-
sequence and the spectral centroid sequence respectively. figure shows the sequence of the signal’s energy. In the second subfigure
the spectral centroid sequence is presented. In both cases, the respective
As long as the two thresholds have been estimated, the thresholds are also shown. The third figure presents the whole audio signal.
two feature sequences are thresholded, and the segments Red color represents the detected voiced segments.
are formed by successive frames for which the respective
feature values (for both feature sequences) are larger than
the computed thresholds. For example, in order to listen to the first of the detected
audio segments one has to type the command:
D. Post processing
As a post-processing step, the detected speech segments
sound(segments{1}, fs);
are lengthened by 5 short term windows (i.e., 250 msec-
onds), on both sides. Finally, successive segments are For more details and possible questions, please refer
merged. to the Mathworks File Exchange web site.
III. E XECUTION EXAMPLES
The main implemented Matlab function is called de-
tectVoiced(), and it can be called as follow: