Sei sulla pagina 1di 50

Speech Technology: A Practical

Introduction
Topic: Spectrogram, Cepstrum
and Mel-Frequency Analysis
Kishore Prahallad
Email: skishore@cs.cmu.edu

Carnegie Mellon University


&
International Institute of Information Technology Hyderabad
1
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Topics
• Spectrogram
• Cepstrum
• Mel-Frequency Analysis
• Mel-Frequency Cepstral Coefficients

2
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Spectrogram

3
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Speech signal represented as a sequence of spectral vectors

FFT FFT FFT

Spectrum
4
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

5
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Amp.

Hz 6
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Rotate it by 90 degrees

Hz

Amplitude 7
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Hz • MAP spectral amplitude to a grey level (0-


255) value. 0 represents black and 255
represents white.
• Higher the amplitude, darker the
corresponding region.
Amplitude 8
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Hz

Time 9
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Speech signal represented as a sequence of spectral vectors

Time Vs Frequency
FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT
representation of a speech
signal is referred to as
spectrogram
Spectrum

Hz

Time 10
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Some Real Spectrograms

Dark regions indicate


peaks (formants)
in the spectrum

11
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Why we are bothered about
spectrograms
Phones and their
properties are
better observed
in spectrogram

12
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Why we are bothered about
spectrograms
Sounds can be
identified much
better by the
Formants and by
their transitions

13
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Why we are bothered about
spectrograms
Sounds can be
identified much
better by the
Formants and by
their transitions

Hidden Markov
Models implicitly
model these
spectrograms to
perform speech
recognition

14
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Usefulness of Spectrogram
• Time-Frequency representation of the speech signal

• Spectrogram is a tool to study speech sounds (phones)

• Phones and their properties are visually studied by phoneticians

• Hidden Markov Models implicitly model spectrograms for speech to


text systems

• Useful for evaluation of text to speech systems


– A high quality text to speech system should produce synthesized
speech whose spectrograms should nearly match with the natural
sentences.

15
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Cepstral Analysis

16
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
A Sample Speech Spectrum

dB

Frequency (Hz)

• Peaks denote dominant frequency


components in the speech signal
• Peaks are referred to as formants
• Formants carry the identity of the sound
17
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
What we want to Extract? –
Spectral Envelope
• Formants and a smooth curve connecting them
• This Smooth curve is referred to as spectral envelope

dB

Frequency (Hz)

18
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Spectral Envelope
Spectrum

Spectral
Envelope

Spectral
details

19
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Spectral Envelope
Spectrum
log X[k]

Spectral
log H[k]
Envelope

Spectral log E[k]


details

20
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Spectral Envelope
Spectrum
log X[k]
log X[k] = log H[k] + log E[k]

1. Our goal: We want to


Spectral separate spectral
log H[k]
Envelope envelope and spectral
details from the
spectrum.
Spectral log E[k]
details 2. i.e Given log X[k],
obtain log H[k] and log
E[k], such that
log X[k] = log H[k] + log
E[k]
21
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
How to achieve this
separation ?

22
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
Spectrum
• Trick: Take FFT of
the spectrum!!
• An FFT on spectrum
referred to as Inverse
FFT (IFFT). Spectral
Envelope
• Note: We are dealing
with spectrum in log
domain (part of the
trick)
• IFFT of log spectrum
would represent the
signal in pseudo-
Spectral
frequency axis details
23
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
Spectrum

Spectral
Envelope

Spectral
A pseudo-frequency
details
axis 24
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
Spectrum

Low Freq. High Freq.


region region

Spectral
Envelope

Spectral
A pseudo-frequency
details
axis 25
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
Spectrum

Low Freq. High Freq.


region region

Spectral
Envelope

IFFT

Spectral
A pseudo-frequency
details
axis 26
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
Spectrum

Low Freq. HighTreat


Freq.this as a
region regionsine wave
with 4 cycles
per sec.
Spectral
Envelope

IFFT

Spectral
A pseudo-frequency
details
axis 27
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Gives a peak
Play a Mathematical Trick
at 4 Hz in
frequency
axis Spectrum

Low Freq. HighTreat


Freq.this as a
region regionsine wave
with 4 cycles
per sec.
Spectral
Envelope

IFFT

Spectral
A pseudo-frequency
details
axis 28
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Gives a peak
Play a Mathematical Trick
at 4 Hz in
frequency
axis Spectrum

Low Freq. HighTreat


Freq.this as a
region regionsine wave
with 4 cycles
per sec.
Spectral
Envelope

IFFT

Spectral
A pseudo-frequency
details
axis 29
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
Spectrum

Low Freq. High Freq.


region region

Spectral
Envelope

IFFT

Spectral
A pseudo-frequency
details
axis 30
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
Gives a peak
at 100 Hz in Spectrum
frequency
Low Freq.axis High Freq.
region region

Treat this as a
sine wave with Spectral
100 cycles per Envelope
sec.

IFFT

Spectral
A pseudo-frequency
details
axis 31
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
Spectrum

Low Freq. High Freq.


region region

Spectral
Envelope
IFFT

IFFT

Spectral
A pseudo-frequency
details
axis 32
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
Spectrum

Spectral
Envelope

Spectral
A pseudo-frequency
details
axis 33
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
log X[k] = log H[k] + log E[k] Spectrum

IFFT

log H[k] Spectral


Envelope

log E[k]

Spectral
A pseudo-frequency
details
axis 34
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT

log H[k] Spectral


Envelope

log E[k]

Spectral
A pseudo-frequency
details
axis 35
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT

log H[k] Spectral


Envelope

In practice all you


have access to only
log X[k] and hence log E[k]
you can obtain x[k]

Spectral
A pseudo-frequency
details
axis 36
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT

log H[k] Spectral


Envelope

If you know x[k]


Filter the low
frequency region to get log E[k]
h[k]

Spectral
A pseudo-frequency
details
axis 37
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Play a Mathematical Trick
x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT

A pseudo-frequency log H[k] Spectral


axis Envelope

• x[k] is referred to as Cepstrum


• h[k] is obtained by considering log E[k]
the low frequency region of x[k].
• h[k] represents the spectral
envelope and is widely used as
feature for speech recognition
Spectral
details
38
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Cepstral Analysis
X [ k ] = H [ k ] E[ k ]
|| X [k ] || = || H [k ] || || E[k ] ||
|| . || − denotes magnitude
Take Log on both sides
log || X [k ] || = log ||H [k ] || + log || E[k ] ||
Taking inverse FFT on both sides
x[k ] = h[k ] + e[k ]

39
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Mel-Frequency Analysis

40
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Review: What we did
• We captured spectral envelope (curve
connecting all formants)
• BUT: Perceptual experiments say human ear
concentrates on certain regions rather than
using whole of the spectral envelope….

dB

Frequency (Hz)
41
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Mel-Frequency Analysis
• Mel-Frequency analysis of speech is
based on human perception experiments
• It is observed that human ear acts as filter
– It concentrates on only certain frequency
components
• These filters are non-uniformly spaced on
the frequency axis
– More filters in the low frequency regions
– Less no. of filters in high frequency regions

42
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Mel-Frequency Filters

43
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Mel-Frequency Filters
More no. of filters in low Lesser no. of filters in
freq. region high freq. region

44
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Mel-Frequency Cepstral
Coefficients (MFCC)
• Spectrum  Mel-Filters  Mel-Spectrum
• Say log X[k] = log (Mel-Spectrum)
• NOW perform Cepstral analysis on log X[k]
– log X[k] = log H[k] + log E[k]
– Taking IFFT
– x[k] = h[k] + e[k]
• Cepstral coefficients h[k] obtained for Mel-
spectrum are referred to as Mel-Frequency
Cepstral Coefficients often denoted by *MFCC*

45
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Speech signal represented as a sequence of spectral vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Mel-Filters

Cepstral Analy.

46
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Speech signal represented as a sequence of CEPSTRAL vectors

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Spectrum

Cepstral
Vectors

47
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Why we are going to use MFCC
• Speech synthesis
– Used for joining two speech segments S1 and S2
– Represent S1 as a sequence of MFCC
– Represent S2 as a sequence of MFCC
– Join at the point where MFCCs of S1 and S2 have
minimal Euclidean distance
• Used in speech recognition
– MFCC are mostly used features in state-of-art speech
recognition system

48
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Summary: Process of Feature
Extraction
• Speech is analyzed over short analysis window
• For each short analysis window a spectrum is obtained
using FFT
• Spectrum is passed through Mel-Filters to obtain Mel-
Spectrum
• Cepstral analysis is performed on Mel-Spectrum to
obtain Mel-Frequency Cepstral Coefficients
• Thus speech is represented as a sequence of Cepstral
vectors
• It is these Cepstral vectors which are given to pattern
classifiers for speech recognition purpose

49
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Additional Reading
• Chapter 6
– Pg: 273 – 281
– Pg: 304 – 311
– Pg: 314 - 316

50
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Potrebbero piacerti anche