Sei sulla pagina 1di 32

Audio Signal Recognition

for Speech, Music,


and Environmental Sounds

1 Pattern Recognition for Sounds


2 Speech Recognition
3 Other Audio Applications
4 Observations and Conclusions

Dan Ellis <dpwe@ee.columbia.edu>


Laboratory for Recognition and Organization of Speech and Audio
(LabROSA)
Columbia University, New York
http://labrosa.ee.columbia.edu/
Dan Ellis

Audio Signal Reecognition

2003-11-13 - 1 / 25

Pattern Recognition for Sounds

Pattern recognition is abstraction

- continuous signal discrete labels


- an essential part of understanding?
information extraction

Dan Ellis

Sound is a challenging domain


- sounds can be highly variable
- human listeners are extremely adept

Audio Signal Reecognition

2003-11-13 - 2 / 25

Pattern classification

Classes are defined as distinct region


in some feature space
- e.g. formant frequencies to define vowels
F2/Hz

4000

2000

3000

ay
ao

2000
1000

1000

0
0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0
time/s
0
0

Issues
- finding segments
to classify
- transforming to
an appropriate
feature space
- defining the
class boundaries

2000
F1/Hz

Pols vowel formants: "u" (x), "o" (o), "a" (+)


1600
1400

+
1200
1000

o
x

800

600
0

Dan Ellis

1000

1800

F2 / Hz

f/Hz

200

Audio Signal Reecognition

400

new observation x
600

800
1000
F1 / Hz

1200

2003-11-13 - 3 / 25

1400

1600

Classification system parts

Sensor
signal
Pre-processing/
segmentation

STFT
Locate vowels

segment
Feature extraction

Formant extraction

feature vector
Classification
class
Post-processing

Dan Ellis

Context constraints
Costs/risk

Audio Signal Reecognition

2003-11-13 - 4 / 25

Feature extraction

Feature choice is critical to performance


- make important aspects explicit,
remove irrelevant details
- equivalent representations
can perform very differently in practice
- major opening for domain knowledge
(cleverness)

Mel-Frequency Cepstral Coefficients (MFCCs):


Ubiquitous speech features
- DCT of log spectrum on auditory scale
- approximately decorrelated ...
MFCCs
cep. coef.

freq. channel

Mel Spectrogram
30
20
10

10

0
0

1
Dan Ellis

Audio Signal Reecognition

time / sec

2003-11-13 - 5 / 25

Statistical Interpretation

Observations are random variables


whose distribution depends on the class:
Class
Observation

(hidden)
discrete

p(x|i)

Pr(i|x)

continuous

Source distributions p(x|i)


- reflect variability in feature
- reflect noise in observation
- generally have to be estimated from data
(rather than known in advance)

p(x|i)
1 2 3 4
x
Dan Ellis

Audio Signal Reecognition

2003-11-13 - 6 / 25

Priors and posteriors

Bayesian inference can be interpreted as


updating prior beliefs with new information, x:
Likelihood

p ( x i )
Pr ( i ) --------------------------------------------------- = Pr ( i x )
p ( x j ) Pr ( j )

Prior
probability

Evidence = p(x)

Posterior
probability

Posterior is prior scaled by likelihood


& normalized by evidence (so (posteriors) = 1)

Minimize the probability of error by


choosing maximum a posteriori (MAP) class:

= argmax Pr ( i x )

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 7 / 25

Practical implementation

= argmax Pr ( i x )
Optimal classifier is
i
but we dont know Pr ( i x )

So, model conditional distributions


p ( x i ) then use Bayes rule to find MAP class
p(x|1)

Labeled
training
examples
{xn,xn}

Sort
according
to class

Dan Ellis

Audio Signal Reecognition

Estimate
conditional pdf
for class 1

2003-11-13 - 8 / 25

Gaussian models

Model data distributions via parametric model


- assume known form, estimate a few parameters

E.g. Gaussian in 1 dimension:

1
1 x i
p ( x i ) = ---------------- exp --- -------------
2 i
2 i
normalization

For higher dimensions, need mean vector i

and d x d covariance matrix i


5
4

1
0.5

4
2
0 0

Dan Ellis

Fit more complex distributions with mixtures...


Audio Signal Reecognition

2003-11-13 - 9 / 25

Gaussian models for formant data

Single Gaussians a reasonable fit for this data

Extrapolation of decision boundaries can be


surprising

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 10 / 25

Outline
1 Pattern Recognition for Sounds
2 Speech Recognition
- How its done
- What works, and what doesnt
3 Other Audio Applications
4 Observations and Conclusions

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 11 / 25

How to recognize speech?

freq / Hz

Cross correlate templates?


- waveform?
- spectrogram?
- time-warp problems

Classify short segments as phones (or ...),


handle time-warp later
- model with slices of ~ 10 ms
- pseudo-piecewise-stationary model of words:
sil
g
w
eh
n
sil

4000
3000
2000
1000
0

0.05
Dan Ellis

0.1

0.15

0.2

0.25

0.3

Audio Signal Reecognition

0.35

0.4

0.45 time / s

2003-11-13 - 12 / 25

Speech Recognizer Architecture

Almost all current systems are the same:

sound
Feature
calculation

D AT A

feature vectors
Acoustic model
parameters
Word models
s

ah

Language model
p("sat"|"the","cat")
p("saw"|"the","cat")

Dan Ellis

Acoustic
classifier

phone probabilities
HMM
decoder

phone / word
sequence
Understanding/
application...

Biggest source of improvement is increase in


training data
- .. along with algorithms to take advantage
Audio Signal Reecognition

2003-11-13 - 13 / 25

Speech: Progress

Annual NIST evaluations

30%

3%

1990

1995

2000

2005

- steady progress (?), but still order(s) of


magnitude worse than human listeners
Dan Ellis

Audio Signal Reecognition

2003-11-13 - 14 / 25

Speech: Problems

Natural, spontaneous speech is weird!

Dan Ellis

coarticulation
deletions
disfluencies
is word transcription even a sensible approach?

Other major problems


- speaking style, rate, accent
- environment / background...

Audio Signal Reecognition

2003-11-13 - 15 / 25

Speech: What works, what doesnt

What works: Techniques:


- MFCC features + GMM/HMM systems
trained with Baum-Welch (EM)
- Using lots of training data
Domains:
- Controlled, low noise environments
- Constrained, predictable contexts
- Motivated, co-operative users

What doesnt work: Techniques:


- rules based on insight
- perceptual representations
(except when they do...)
Domains:
- spontaneous, informal speech
- unusual accents, voice quality, speaking style
- variable, high-noise background / environment

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 16 / 25

Outline
1 Pattern Recognition for Sounds
2 Speech Recognition
3 Other Audio Applications
- Meeting recordings
- Alarm sounds
- Music signal processing
4 Observations and Conclusions

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 17 / 25

Other Audio Applications:


ICSI Meeting Recordings corpus

Real meetings, 16 channel recordings, 80 hrs

- released through NIST/LDC

Classification e.g.: Detecting emphasized utterances


based on f0 contour (Kennedy & Ellis 03)
-

per-speaker normalized
f0 as unidimensional
feature simple
threshold classification
Dan Ellis

55

110

220

Speaker 1

Audio Signal Reecognition

440

f0/Hz

110

440

1760

Speaker 2

2003-11-13 - 18 / 25

f0/Hz

freq / Hz

Personal Audio

LifeLog / MyLifeBits /
Remembrance Agent:
- easy to record everything you
hear

Then what?
- prohibitive to review
- applications if access easier?

Automatic content analysis / indexing...

4
2

freq / Bark

50

100

150

200

250 time / min

15
10
5
14:30

15:00

15:30

16:00

16:30

17:00

17:30

18:00

18:30
clock time

- find features to classify into e.g. locations


Dan Ellis

Audio Signal Reecognition

2003-11-13 - 19 / 25

Alarm sound detection


Alarm sounds have particular structure
- clear even at low SNRs
- potential applications...
freq / kHz

Restaurant+ alarms (snr 0 ns 6 al 8)


4
3
2
1
0

Contrast two systems: (Ellis 01)


- standard, global features, P(X|M)
- sinusoidal model, fragments, P(M,S|Y)
0

MLP classifier output

freq / kHz

Sound object classifier output


4

3
2
1
0
20

25

30

35

40

45

time/sec 50

- error rates high, but interesting comparisons...


Dan Ellis

Audio Signal Reecognition

2003-11-13 - 20 / 25

Music signal modeling

Use machine listener to navigate large music


collections
- e.g. unsigned bands on MP3.com

Classification to label:
- notes, chords, singing, instruments
- .. information to help cluster music

Artist
models based on feature distributions
p
0

0.6

0.2

Electronica

fifth cepstral coef

0.4

0
0.2
0.4
0.6

madonna
bowie

0.8
1

Dan Ellis

10

15

0.5
0
third cepstral coef

0.5

madonna
bowie
15

10
Country

measure similarity between users collections


and new music? (Berenzweig & Ellis 03)
Audio Signal Reecognition

2003-11-13 - 21 / 25

Outline
1 Pattern Recognition for Sounds
2 Speech Recognition
3 Other Audio Applications
4 Observations and Conclusions
- Model complexity
- Sound mixtures

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 22 / 25

Observations and Conclusions:


Training and test data

Balance model/data size to avoid overfitting:


Test
data

error
rate

Overfitting

Training
data

training or parameters

Diminishing returns from more data:


Word Error Rate
Mo
re t
ters
rai
ame

e pa

Mor

nin

gd

ata

44
42
40
WER%

Optimal
parameter/data
ratio

38
36
34

Constant
training
time

32
9.25

500
18.5
Training set / hours

1000
Hidden layer / units
37

2000
74 4000

Dan Ellis

WER for PLP12N-8k nets vs. net size & training data

Audio Signal Reecognition

2003-11-13 - 23 / 25

Beyond classification

No free lunch:
Classifier can only do so much
- always need to consider other parts of system

Features
- impose ceiling on system performance
- improved features allow simpler classifiers

Segmentation / mixtures
- e.g. speech-in-noise:
only subset of feature dimensions available
missing-data approaches...

S2(t)

Y(t)

S1(t)
Dan Ellis

Audio Signal Reecognition

2003-11-13 - 24 / 25

Summary

Statistical Pattern Recognition


- exploit training data for probabilistically-correct
classifications

Speech recognition
- successful application of statistical PR
- .. but many remaining frontiers

Other audio applications


- meetings, alarms, music
- classification is information extraction

Current challenges
- variability in speech
- acoustic mixtures

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 25 / 25

Extra slides

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 26 / 25

Neural network classifiers


Instead of estimating p ( x i ) and using Bayes,
can also try to estimate posteriors Pr ( i x )

directly (the decision boundaries)

Sums over nonlinear functions of sums


give a large range of decision surfaces...

e.g. Multi-layer perceptron (MLP):

y k = F [ w jk F [ w ij x i ] ]
j
j
h1
x1
+
wjk +
x2
h
wij + F[] 2
x3
+

y1
y2

Input
layer

Dan Ellis

Output
Hidden
layer
layer
Problem is finding the weights wij ... (training)
Audio Signal Reecognition

2003-11-13 - 27 / 25

Neural net classifier

Models boundaries, not density p ( x i )

Discriminant training
- concentrate on boundary regions
- needs to see all classes at once

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 28 / 25

Why is Speech Recognition hard?

Why not match against a set of waveforms?


- waveforms are never (nearly!) the same twice
- speakers minimize information/effort in speech

Speech variability comes from many sources:


- speaker-dependent (SD) recognizers must
handle within-speaker variability
- speaker-independent (SI) recognizers must also
deal with variation between speakers
- all recognizers are afflicted by background noise,
variable channels

Need recognition models that:


- generalize i.e. accept variations in a range, and
- adapt i.e. tune in to a particular variant

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 29 / 25

Within-speaker variability

Timing variation:
- word duration varies enormously

Frequency

4000
2000

0
0

1.0

0.5

s ow

1.5

2.0

ay

aa ax b aw ax ay ih k t s t ih
l
dx
th n th n ih
I
ABOUT
I
IT'S STILL
THOUGHT
THAT THINK
AND

2.5

3.0

p aa s b ax l

th

SO

POSSIBLE

- fast speech reduces vowels

Speaking style variation:


- careful/casual articulation
- soft/loud speech

Contextual effects:
- speech sounds vary with context, role:
How do you do?

Dan Ellis

Audio Signal Reecognition

2003-11-13 - 30 / 25

mbma0

freq / Hz

Between-speaker variability

Accent variation
- regional / mother tongue

Voice quality variation


- gender, age, huskiness, nasality

Individual characteristics
- mannerisms, speed, prosody

8000
6000
4000
2000
0
8000
6000

fjdm2

4000
2000
0
0

Dan Ellis

0.5

1.5

Audio Signal Reecognition

2.5

2003-11-13 - 31 / 25

time / s

Environment variability

Background noise
- fans, cars, doors, papers

Reverberation
- boxiness in recordings

Microphone channel
- huge effect on relative spectral gain

Close
mic

freq / Hz

4000
2000
0
4000

Tabletop
mic

2000
0

Dan Ellis

0.2

0.4

0.6

0.8

Audio Signal Reecognition

1.2

1.4

2003-11-13 - 32 / 25

time / s

Potrebbero piacerti anche