Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
2003-11-13 - 1 / 25
Dan Ellis
2003-11-13 - 2 / 25
Pattern classification
4000
2000
3000
ay
ao
2000
1000
1000
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
time/s
0
0
Issues
- finding segments
to classify
- transforming to
an appropriate
feature space
- defining the
class boundaries
2000
F1/Hz
+
1200
1000
o
x
800
600
0
Dan Ellis
1000
1800
F2 / Hz
f/Hz
200
400
new observation x
600
800
1000
F1 / Hz
1200
2003-11-13 - 3 / 25
1400
1600
Sensor
signal
Pre-processing/
segmentation
STFT
Locate vowels
segment
Feature extraction
Formant extraction
feature vector
Classification
class
Post-processing
Dan Ellis
Context constraints
Costs/risk
2003-11-13 - 4 / 25
Feature extraction
freq. channel
Mel Spectrogram
30
20
10
10
0
0
1
Dan Ellis
time / sec
2003-11-13 - 5 / 25
Statistical Interpretation
(hidden)
discrete
p(x|i)
Pr(i|x)
continuous
p(x|i)
1 2 3 4
x
Dan Ellis
2003-11-13 - 6 / 25
p ( x i )
Pr ( i ) --------------------------------------------------- = Pr ( i x )
p ( x j ) Pr ( j )
Prior
probability
Evidence = p(x)
Posterior
probability
= argmax Pr ( i x )
Dan Ellis
2003-11-13 - 7 / 25
Practical implementation
= argmax Pr ( i x )
Optimal classifier is
i
but we dont know Pr ( i x )
Labeled
training
examples
{xn,xn}
Sort
according
to class
Dan Ellis
Estimate
conditional pdf
for class 1
2003-11-13 - 8 / 25
Gaussian models
1
1 x i
p ( x i ) = ---------------- exp --- -------------
2 i
2 i
normalization
1
0.5
4
2
0 0
Dan Ellis
2003-11-13 - 9 / 25
Dan Ellis
2003-11-13 - 10 / 25
Outline
1 Pattern Recognition for Sounds
2 Speech Recognition
- How its done
- What works, and what doesnt
3 Other Audio Applications
4 Observations and Conclusions
Dan Ellis
2003-11-13 - 11 / 25
freq / Hz
4000
3000
2000
1000
0
0.05
Dan Ellis
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45 time / s
2003-11-13 - 12 / 25
sound
Feature
calculation
D AT A
feature vectors
Acoustic model
parameters
Word models
s
ah
Language model
p("sat"|"the","cat")
p("saw"|"the","cat")
Dan Ellis
Acoustic
classifier
phone probabilities
HMM
decoder
phone / word
sequence
Understanding/
application...
2003-11-13 - 13 / 25
Speech: Progress
30%
3%
1990
1995
2000
2005
2003-11-13 - 14 / 25
Speech: Problems
Dan Ellis
coarticulation
deletions
disfluencies
is word transcription even a sensible approach?
2003-11-13 - 15 / 25
Dan Ellis
2003-11-13 - 16 / 25
Outline
1 Pattern Recognition for Sounds
2 Speech Recognition
3 Other Audio Applications
- Meeting recordings
- Alarm sounds
- Music signal processing
4 Observations and Conclusions
Dan Ellis
2003-11-13 - 17 / 25
per-speaker normalized
f0 as unidimensional
feature simple
threshold classification
Dan Ellis
55
110
220
Speaker 1
440
f0/Hz
110
440
1760
Speaker 2
2003-11-13 - 18 / 25
f0/Hz
freq / Hz
Personal Audio
LifeLog / MyLifeBits /
Remembrance Agent:
- easy to record everything you
hear
Then what?
- prohibitive to review
- applications if access easier?
4
2
freq / Bark
50
100
150
200
15
10
5
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
clock time
2003-11-13 - 19 / 25
freq / kHz
3
2
1
0
20
25
30
35
40
45
time/sec 50
2003-11-13 - 20 / 25
Classification to label:
- notes, chords, singing, instruments
- .. information to help cluster music
Artist
models based on feature distributions
p
0
0.6
0.2
Electronica
0.4
0
0.2
0.4
0.6
madonna
bowie
0.8
1
Dan Ellis
10
15
0.5
0
third cepstral coef
0.5
madonna
bowie
15
10
Country
2003-11-13 - 21 / 25
Outline
1 Pattern Recognition for Sounds
2 Speech Recognition
3 Other Audio Applications
4 Observations and Conclusions
- Model complexity
- Sound mixtures
Dan Ellis
2003-11-13 - 22 / 25
error
rate
Overfitting
Training
data
training or parameters
e pa
Mor
nin
gd
ata
44
42
40
WER%
Optimal
parameter/data
ratio
38
36
34
Constant
training
time
32
9.25
500
18.5
Training set / hours
1000
Hidden layer / units
37
2000
74 4000
Dan Ellis
WER for PLP12N-8k nets vs. net size & training data
2003-11-13 - 23 / 25
Beyond classification
No free lunch:
Classifier can only do so much
- always need to consider other parts of system
Features
- impose ceiling on system performance
- improved features allow simpler classifiers
Segmentation / mixtures
- e.g. speech-in-noise:
only subset of feature dimensions available
missing-data approaches...
S2(t)
Y(t)
S1(t)
Dan Ellis
2003-11-13 - 24 / 25
Summary
Speech recognition
- successful application of statistical PR
- .. but many remaining frontiers
Current challenges
- variability in speech
- acoustic mixtures
Dan Ellis
2003-11-13 - 25 / 25
Extra slides
Dan Ellis
2003-11-13 - 26 / 25
y k = F [ w jk F [ w ij x i ] ]
j
j
h1
x1
+
wjk +
x2
h
wij + F[] 2
x3
+
y1
y2
Input
layer
Dan Ellis
Output
Hidden
layer
layer
Problem is finding the weights wij ... (training)
Audio Signal Reecognition
2003-11-13 - 27 / 25
Discriminant training
- concentrate on boundary regions
- needs to see all classes at once
Dan Ellis
2003-11-13 - 28 / 25
Dan Ellis
2003-11-13 - 29 / 25
Within-speaker variability
Timing variation:
- word duration varies enormously
Frequency
4000
2000
0
0
1.0
0.5
s ow
1.5
2.0
ay
aa ax b aw ax ay ih k t s t ih
l
dx
th n th n ih
I
ABOUT
I
IT'S STILL
THOUGHT
THAT THINK
AND
2.5
3.0
p aa s b ax l
th
SO
POSSIBLE
Contextual effects:
- speech sounds vary with context, role:
How do you do?
Dan Ellis
2003-11-13 - 30 / 25
mbma0
freq / Hz
Between-speaker variability
Accent variation
- regional / mother tongue
Individual characteristics
- mannerisms, speed, prosody
8000
6000
4000
2000
0
8000
6000
fjdm2
4000
2000
0
0
Dan Ellis
0.5
1.5
2.5
2003-11-13 - 31 / 25
time / s
Environment variability
Background noise
- fans, cars, doors, papers
Reverberation
- boxiness in recordings
Microphone channel
- huge effect on relative spectral gain
Close
mic
freq / Hz
4000
2000
0
4000
Tabletop
mic
2000
0
Dan Ellis
0.2
0.4
0.6
0.8
1.2
1.4
2003-11-13 - 32 / 25
time / s