Efeffefe

Audio Signal Recognition
for Speech, Music,

and Environmental Sounds
1 Pattern Recognition for Sounds

2 Speech Recognition
3 Other Audio Applications
4 Observations and Conclusions
Dan Ellis <dpwe@ee.columbia.edu>

Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Columbia University, New York
http://labrosa.ee.columbia.edu/
Dan Ellis
Audio Signal Reecognition
2003-11-13 - 1 / 25
Pattern Recognition for Sounds
Pattern recognition is abstraction
- continuous signal discrete labels

- an essential part of understanding?
information extraction
Dan Ellis
Sound is a challenging domain

- sounds can be highly variable
- human listeners are extremely adept
2003-11-13 - 2 / 25
Pattern classification
Classes are defined as distinct region

in some feature space
- e.g. formant frequencies to define vowels
F2/Hz
4000
2000
3000
ay
ao
2000
1000
1000
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
time/s
0
0
Issues
- finding segments
to classify
- transforming to
an appropriate
feature space
- defining the
class boundaries
2000
F1/Hz
Pols vowel formants: "u" (x), "o" (o), "a" (+)

1600
1400
+
1200
1000
o
x
800
600
0
Dan Ellis
1000
1800
F2 / Hz
f/Hz
200
400
new observation x
600
800
1000
F1 / Hz
1200
2003-11-13 - 3 / 25
1400
1600
Classification system parts
Sensor
signal
Pre-processing/
segmentation
STFT
Locate vowels
segment
Feature extraction
Formant extraction
feature vector
Classification
class
Post-processing
Dan Ellis
Context constraints
Costs/risk
2003-11-13 - 4 / 25
Feature extraction
Feature choice is critical to performance

- make important aspects explicit,
remove irrelevant details
- equivalent representations
can perform very differently in practice
- major opening for domain knowledge
(cleverness)
Mel-Frequency Cepstral Coefficients (MFCCs):

Ubiquitous speech features
- DCT of log spectrum on auditory scale
- approximately decorrelated ...
MFCCs
cep. coef.
freq. channel
Mel Spectrogram
30
20
10
10
0
0
1
Dan Ellis
time / sec
2003-11-13 - 5 / 25
Statistical Interpretation
Observations are random variables

whose distribution depends on the class:
Class
Observation
(hidden)
discrete
p(x|i)
Pr(i|x)
continuous
Source distributions p(x|i)

- reflect variability in feature
- reflect noise in observation
- generally have to be estimated from data
(rather than known in advance)
p(x|i)
1 2 3 4
x
Dan Ellis
2003-11-13 - 6 / 25
Priors and posteriors
Bayesian inference can be interpreted as

updating prior beliefs with new information, x:
Likelihood
p ( x i )
Pr ( i ) --------------------------------------------------- = Pr ( i x )
p ( x j ) Pr ( j )
Prior
probability
Evidence = p(x)
Posterior
probability
Posterior is prior scaled by likelihood

& normalized by evidence (so (posteriors) = 1)
Minimize the probability of error by

choosing maximum a posteriori (MAP) class:
= argmax Pr ( i x )
Dan Ellis
2003-11-13 - 7 / 25
Practical implementation
= argmax Pr ( i x )
Optimal classifier is
i
but we dont know Pr ( i x )
So, model conditional distributions

p ( x i ) then use Bayes rule to find MAP class
p(x|1)
Labeled
training
examples
{xn,xn}
Sort
according
to class
Dan Ellis
Estimate
conditional pdf
for class 1
2003-11-13 - 8 / 25
Gaussian models
Model data distributions via parametric model

- assume known form, estimate a few parameters
E.g. Gaussian in 1 dimension:
1
1 x i
p ( x i ) = ---------------- exp --- -------------
2 i
2 i
normalization
For higher dimensions, need mean vector i
and d x d covariance matrix i

5
4
1
0.5
4
2
0 0
Dan Ellis
Fit more complex distributions with mixtures...

2003-11-13 - 9 / 25
Gaussian models for formant data
Single Gaussians a reasonable fit for this data
Extrapolation of decision boundaries can be

surprising
Dan Ellis
2003-11-13 - 10 / 25
Outline
- How its done
- What works, and what doesnt
Dan Ellis
2003-11-13 - 11 / 25
How to recognize speech?
freq / Hz
Cross correlate templates?

- waveform?
- spectrogram?
- time-warp problems
Classify short segments as phones (or ...),

handle time-warp later
- model with slices of ~ 10 ms
- pseudo-piecewise-stationary model of words:
sil
g
w
eh
n
sil
4000
3000
2000
1000
0
0.05
Dan Ellis
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45 time / s
2003-11-13 - 12 / 25
Speech Recognizer Architecture
Almost all current systems are the same:
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Word models
s
ah
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
Dan Ellis
Acoustic
classifier
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
Biggest source of improvement is increase in

training data
- .. along with algorithms to take advantage
2003-11-13 - 13 / 25
Speech: Progress
Annual NIST evaluations
30%
3%
1990
1995
2000
2005
- steady progress (?), but still order(s) of

magnitude worse than human listeners
Dan Ellis
2003-11-13 - 14 / 25
Speech: Problems
Natural, spontaneous speech is weird!
Dan Ellis
coarticulation
deletions
disfluencies
is word transcription even a sensible approach?
Other major problems

- speaking style, rate, accent
- environment / background...
2003-11-13 - 15 / 25
Speech: What works, what doesnt
What works: Techniques:

- MFCC features + GMM/HMM systems
trained with Baum-Welch (EM)
- Using lots of training data
Domains:
- Controlled, low noise environments
- Constrained, predictable contexts
- Motivated, co-operative users
What doesnt work: Techniques:

- rules based on insight
- perceptual representations
(except when they do...)
Domains:
- spontaneous, informal speech
- unusual accents, voice quality, speaking style
- variable, high-noise background / environment
Dan Ellis
2003-11-13 - 16 / 25
Outline
- Meeting recordings
- Alarm sounds
- Music signal processing
Dan Ellis
2003-11-13 - 17 / 25
Other Audio Applications:

ICSI Meeting Recordings corpus
Real meetings, 16 channel recordings, 80 hrs
- released through NIST/LDC
Classification e.g.: Detecting emphasized utterances

based on f0 contour (Kennedy & Ellis 03)
-
per-speaker normalized
f0 as unidimensional
feature simple
threshold classification
Dan Ellis
55
110
220
Speaker 1
440
f0/Hz
110
440
1760
Speaker 2
2003-11-13 - 18 / 25
f0/Hz
freq / Hz
Personal Audio
LifeLog / MyLifeBits /
Remembrance Agent:
- easy to record everything you
hear
Then what?
- prohibitive to review
- applications if access easier?
Automatic content analysis / indexing...
4
2
freq / Bark
50
100
150
200
250 time / min
15
10
5
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
clock time
- find features to classify into e.g. locations

Dan Ellis
2003-11-13 - 19 / 25
Alarm sound detection

Alarm sounds have particular structure
- clear even at low SNRs
- potential applications...
freq / kHz
Restaurant+ alarms (snr 0 ns 6 al 8)

4
3
2
1
0
Contrast two systems: (Ellis 01)

- standard, global features, P(X|M)
- sinusoidal model, fragments, P(M,S|Y)
0
MLP classifier output
freq / kHz
Sound object classifier output

4
3
2
1
0
20
25
30
35
40
45
time/sec 50
- error rates high, but interesting comparisons...

Dan Ellis
2003-11-13 - 20 / 25
Music signal modeling
Use machine listener to navigate large music

collections
- e.g. unsigned bands on MP3.com
Classification to label:
- notes, chords, singing, instruments
- .. information to help cluster music
Artist
models based on feature distributions
p
0
0.6
0.2
Electronica
fifth cepstral coef
0.4
0
0.2
0.4
0.6
madonna
bowie
0.8
1
Dan Ellis
10
15
0.5
0
third cepstral coef
0.5
madonna
bowie
15
10
Country
measure similarity between users collections

and new music? (Berenzweig & Ellis 03)
2003-11-13 - 21 / 25
Outline
- Model complexity
- Sound mixtures
Dan Ellis
2003-11-13 - 22 / 25
Observations and Conclusions:

Training and test data
Balance model/data size to avoid overfitting:

Test
data
error
rate
Overfitting
Training
data
training or parameters
Diminishing returns from more data:

Word Error Rate
Mo
re t
ters
rai
ame
e pa
Mor
nin
gd
ata
44
42
40
WER%
Optimal
parameter/data
ratio
38
36
34
Constant
training
time
32
9.25
500
18.5
Training set / hours
1000
Hidden layer / units
37
2000
74 4000
Dan Ellis
WER for PLP12N-8k nets vs. net size & training data
2003-11-13 - 23 / 25
Beyond classification
No free lunch:
Classifier can only do so much
- always need to consider other parts of system
Features
- impose ceiling on system performance
- improved features allow simpler classifiers
Segmentation / mixtures
- e.g. speech-in-noise:
only subset of feature dimensions available
missing-data approaches...
S2(t)
Y(t)
S1(t)
Dan Ellis
2003-11-13 - 24 / 25
Summary
Statistical Pattern Recognition

- exploit training data for probabilistically-correct
classifications
Speech recognition
- successful application of statistical PR
- .. but many remaining frontiers
Other audio applications

- meetings, alarms, music
- classification is information extraction
Current challenges
- variability in speech
- acoustic mixtures
Dan Ellis
2003-11-13 - 25 / 25
Extra slides
Dan Ellis
2003-11-13 - 26 / 25
Neural network classifiers

Instead of estimating p ( x i ) and using Bayes,
can also try to estimate posteriors Pr ( i x )
directly (the decision boundaries)
Sums over nonlinear functions of sums

give a large range of decision surfaces...
e.g. Multi-layer perceptron (MLP):
y k = F [ w jk F [ w ij x i ] ]
j
j
h1
x1
+
wjk +
x2
h
wij + F[] 2
x3
+
y1
y2
Input
layer
Dan Ellis
Output
Hidden
layer
layer
Problem is finding the weights wij ... (training)
2003-11-13 - 27 / 25
Neural net classifier
Models boundaries, not density p ( x i )
Discriminant training
- concentrate on boundary regions
- needs to see all classes at once
Dan Ellis
2003-11-13 - 28 / 25
Why is Speech Recognition hard?
Why not match against a set of waveforms?

- waveforms are never (nearly!) the same twice
- speakers minimize information/effort in speech
Speech variability comes from many sources:

- speaker-dependent (SD) recognizers must
handle within-speaker variability
- speaker-independent (SI) recognizers must also
deal with variation between speakers
- all recognizers are afflicted by background noise,
variable channels
Need recognition models that:

- generalize i.e. accept variations in a range, and
- adapt i.e. tune in to a particular variant
Dan Ellis
2003-11-13 - 29 / 25
Within-speaker variability
Timing variation:
- word duration varies enormously
Frequency
4000
2000
0
0
1.0
0.5
s ow
1.5
2.0
ay
aa ax b aw ax ay ih k t s t ih
l
dx
th n th n ih
I
ABOUT
I
IT'S STILL
THOUGHT
THAT THINK
AND
2.5
3.0
p aa s b ax l
th
SO
POSSIBLE
- fast speech reduces vowels
Speaking style variation:

- careful/casual articulation
- soft/loud speech
Contextual effects:
- speech sounds vary with context, role:
How do you do?
Dan Ellis
2003-11-13 - 30 / 25
mbma0
freq / Hz
Between-speaker variability
Accent variation
- regional / mother tongue
Voice quality variation

- gender, age, huskiness, nasality
Individual characteristics
- mannerisms, speed, prosody
8000
6000
4000
2000
0
8000
6000
fjdm2
4000
2000
0
0
Dan Ellis
0.5
1.5
2.5
2003-11-13 - 31 / 25
time / s
Environment variability
Background noise
- fans, cars, doors, papers
Reverberation
- boxiness in recordings
Microphone channel
- huge effect on relative spectral gain
Close
mic
freq / Hz
4000
2000
0
4000
Tabletop
mic
2000
0
Dan Ellis
0.2
0.4
0.6
0.8
1.2
1.4
2003-11-13 - 32 / 25
time / s

Efeffefe

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Efeffefe

Caricato da

Copyright:

Formati disponibili

Audio Signal Recognition

for Speech, Music,

1 Pattern Recognition for Sounds

Dan Ellis <dpwe@ee.columbia.edu>

Audio Signal Reecognition

Pattern Recognition for Sounds

Pattern recognition is abstraction

- continuous signal discrete labels

Sound is a challenging domain

Audio Signal Reecognition

Classes are defined as distinct region

Pols vowel formants: "u" (x), "o" (o), "a" (+)

Audio Signal Reecognition

Classification system parts

Audio Signal Reecognition

Feature choice is critical to performance

Mel-Frequency Cepstral Coefficients (MFCCs):

Audio Signal Reecognition

Observations are random variables

Source distributions p(x|i)

Audio Signal Reecognition

Priors and posteriors

Bayesian inference can be interpreted as

Posterior is prior scaled by likelihood

Minimize the probability of error by

Audio Signal Reecognition

So, model conditional distributions

Audio Signal Reecognition

Model data distributions via parametric model

E.g. Gaussian in 1 dimension:

For higher dimensions, need mean vector i

and d x d covariance matrix i

Fit more complex distributions with mixtures...

Gaussian models for formant data

Single Gaussians a reasonable fit for this data

Extrapolation of decision boundaries can be

Audio Signal Reecognition

Audio Signal Reecognition

How to recognize speech?

Cross correlate templates?

Classify short segments as phones (or ...),

Audio Signal Reecognition

Speech Recognizer Architecture

Almost all current systems are the same:

Biggest source of improvement is increase in

Annual NIST evaluations

- steady progress (?), but still order(s) of

Audio Signal Reecognition

Natural, spontaneous speech is weird!

Other major problems

Audio Signal Reecognition

Speech: What works, what doesnt

What works: Techniques:

What doesnt work: Techniques:

Audio Signal Reecognition

Audio Signal Reecognition

Other Audio Applications:

Real meetings, 16 channel recordings, 80 hrs

- released through NIST/LDC

Classification e.g.: Detecting emphasized utterances

Audio Signal Reecognition

Automatic content analysis / indexing...

250 time / min