Sei sulla pagina 1di 49

Production and Classification of Speech Signals

Production and Classification of Speech Signals


CONTENTS
Speech Production Process
Anatomy and Physiology of Speech production
Spectrographic Analysis of Speech
Categorization of Speech Sounds
Discrete-time Model of Speech Signals

Textbook: T. F. Quatieri. Discrete-Time Speech Signal Processing: Principles


and practice. PHI, 2002. (Chapter 3)
1

Production and Classification of Speech Signals

1. Speech production process

The sound sources are idealized as periodic, impulsive, or white noise and
can occur in the larynx or vocal tract.

Fig. 3.1 Simplified speech production model.


2

Production and Classification of Speech Signals

Speech Organs

The lung acts as a power supply and provide airflow to the larynx.
The larynx modulates the airflow and provides either a periodic puff-like or
a noisy airflow source to the vocal tract.

The vocal tract consists of oral, nasal, and pharynx cavities, giving the
modulated airflow its color by spectrally shaping the source.

The variation of air pressure at the lips results in traveling sound wave
that the listener perceives as speech.

In general there are three general categories of source of speech sounds:


periodic, noisy, and impulsive, though combinations are often present.

Production and Classification of Speech Signals

Combining these sources with different vocal tract configurations gives


more refined speech sound classes, referred to as phonemes. The study
of which is called phonemics.

The study of sound variations of phonemes that lead to the same meaning
is called phonetics.

Phonemes are basic building blocks of a language and they are


concatenated according to certain phonemic and grammatical rules (called
linguistics) .

Production and Classification of Speech Signals

2. Anatomy and Physiology of Speech Production

Fig. 3.2 Cross-sectional view of the anatomy of speech production.

During speaking, we take in short spurts of air and release them steadily.
We override our rhythmic breathing by making the duration of exhaling
roughly equal to the length of a sentence or phase, where the lung air
pressure is maintained at approximately a constant level.
5

Production and Classification of Speech Signals

2.1 The Larynx

The larynx is a complicated system of cartilages, muscles, and ligaments,


whose primary purpose in speech production is to control the vocal cords
or vocal folds.

The vocal folds are two masses of flesh, ligament, and muscle, which
stretch between the front and back of the larynx.

Fig. 3.3 Sketches of downward-looking view of human larynx (a) voicing; (b)
breathing.

Production and Classification of Speech Signals

The Glottis

The glottis is the slit-like orifice between two folds.


The folds are fixed at the front of the larynx. They are free to move at the
back and sides of the larynx.

The size of the glottis controlled in part by the arytenoids cartilages and in
part by muscles within the folds.

The tension of the folds is controlled by muscle within the folds, as well as
the cartilage around the folds.

The vocal folds, as well as the epiglottis, close during eating, and open
during breathing.

Production and Classification of Speech Signals

2.1.1 Voicing - How a vowel is produced?

During a vowel (e.g. a, e, I, ..), the arytenoids cartilages move toward


each other (Fig. 3.3a). The vocal folds tense up and are brought close
together. This partial closing of the glottis and increased fold tension
cause self-sustained oscillations of the folds.

Fig. 3.4 Bernoullis principle in the glottis.


8

Production and Classification of Speech Signals

Suppose the vocal folds begin in a loose and open state.


The contraction of the lungs results in air flowing through the glottis.
The increase in tension of the folds, together with the decrease in
pressure at the glottis due to Bernoullis principle, causes the vocal folds
to close shut abruptly.
Air pressure then builds behind the vocal folds as the lungs continue to
contract, forcing the folds to open.
The entire process then repeats and the result is periodic puffs of air
that enter the vocal tract.
Both horizontal and vertical movement of the folds may occur
simultaneously.

Production and Classification of Speech Signals

The Glottal Airflow (Glottal flow)

If we measure the airflow velocity at the glottis as a function of time, we


would obtain the following waveform.

Fig. 3.6 Illustration of periodic glottal airflow velocity.

The time interval during which the vocal folds are closed, and no flow
occurs, is referred to as the glottal closed phase.
10

Production and Classification of Speech Signals

The time interval over which there is nonzero flow and up to the maximum
of the airflow velocity is referred to as the glottal open phase, and the time
interval from the airflow maximum to the time of glottal closure is referred
to as the return phase.

The time duration of one glottal cycle is referred to as the pitch period and
the reciprocal of the pitch period is the corresponding pitch, also referred
to as the fundamental frequency.

In conversational speech, during vowel sounds, we might see typically one


to four pitch periods over the duration of the sound, although the number
of pitch periods changes with numerous factors such as stress and
speaking rate.

The pitch range is about 60 Hz to 400 Hz. Typically, males have lower pitch
than females because their vocal folds are longer and more massive.

11

Production and Classification of Speech Signals

Simple mathematical model of the glottal flow


A simple model of the glottal flow above is given by the convolution of a
periodic impulse train with the glottal flow over one cycle:
(2-1)
u[n] = g[n] * p[n] ,
where g[n] is the glottal flow waveform over a single cycle and

p[n] = k = [n kP] is an impulse train with spacing P.

Fig. 3.7 Illustration of periodic glottal flow: (a) typical glottal flow; (b) same a
(a) with lower pitch; (c) same as (a) with softer glottal flow.
12

Production and Classification of Speech Signals

To model the finite-length speech waveform, we multiply an analysis window

w[ n, ] , centered at time to give


u[n, ] = w[ n, ] ( g[n] * p[n]) .

(2-2)

Using the multiplication and convolution theorems, the discrete-time Fourier


transform (or frequency domain description of (2-2)) is

1
U [ , ] = W ( , ) [ G ( ) ( k )].
P
k =

(2-3)

1
= [ G ( k )W ( k , )] ,
P k =

where W ( , ) is the Fourier transform of w[ n, ] , G ( ) is the Fourier


transform of g[n], k =

2
P

k,

2
P

is the fundamental frequency or pitch.

As the pitch period decreases, the spacing between the frequencies

k = 2P k increases (c.f. Fig. 3.7b).


13

Production and Classification of Speech Signals

The Fourier transform of the periodic glottal waveform is characterized by


harmonics.

Typically, the spectral envelope of the harmonics, governed by G ( ) , has


on the average a -12dB/octave rolloff.

In more relaxed voicing (Fig. 3.7), the vocal folds do not close as abruptly,
and the glottal waveform has more rounded corners with an average 15dB/octave rolloff.

Pitch Jitter (variation of pitch period) and amplitude shimmer (variation of


glottal flow between cycles) can occur. These phenomenon help give the
vowel its naturalness and , in contrast to machine like sound.

The extend and form of jitter and shimmer can contribute to voice
character.

A high degree of jitter results in a voice with hoarse quality, which can be
characteristics of a particular speaker or can be created under specific
speaking conditions such as with stress and fear.
14

Production and Classification of Speech Signals

2.1.2 Unvoicing?

In the unvoiced state, the folds are closer together and more tense than in
the breathing state, thus allowing for turbulence to be generated at the
folds.

Turbulence at the vocal folds is called aspiration. (e.g. h, he) These


sounds are sometimes called whispered sound.

In certain voice types, aspiration occurs normally simultaneously with


voicing, resulting in breathing voice.

Fig. 3.8 Sketches of various vocal fold configurations.


15

Production and Classification of Speech Signals

There are other forms of vocal fold movement that do not fall clearly into
any of the three states of breathing, voicing and unvoicing.
In vocal fry (Fig. 3.9.a), the folds are massy and relaxed with an abnormally
low and irregular pitch, which is characterized by secondary glottal pulses.
In diplophonia (Fig. 3.9.b), secondary glottal pulses occur between primary
pulses but within the closed phase.

Fig. 3.9 Illustration of secondary-pulse glottal flow.


16

Production and Classification of Speech Signals

2.2 The Vocal Tract

The vocal tract is comprised of the oral cavity from the larynx to the lips
and the nasal passage that is coupled to the oral tract by way of the
velum.

The vocal tract spectrally colors the source, which is important for
making perceptually distinct speech sounds.

Fig. 3.10. Illustration of changing vocal tract shapes.


17

Production and Classification of Speech Signals

2.2.1 Spectral shaping

Under certain conditions, the relationship between a glottal airflow


velocity input and vocal tract airflow velocity output can be approximated
by a linear filter with resonances, much like resonances of organ pipes
and wind instruments.

The resonance frequencies of the vocal tract are called formant


frequencies or simply formants.

Formants change with different vocal tract configurations.


When the vocal tract is modeled as a time-invariant all-pole linear system,
then a pole at z 0 = r0 e

j 0

corresponds approximately to a vocal tract

formant at frequency = 0 . Because the vocal tract is assumed stable,


the vocal tract transfer function an be written as:

H ( z) =

A
kN=i 1 (1 c k z 1 )(1 c k* z 1 )

Ni

~
A

k =1

(1 c k z 1 )(1 c k* z 1 )

18

(2-4)

Production and Classification of Speech Signals

2.2.2 Fourier Transform of speech signals after going through the vocal tract.
Assuming a periodic glottal flow source of the form:

u[n] = g[n] * p[n] .

(2-2)

The vocal tract output after passing u[n] through a LTI vocal tract with
impulse response h[n] is

x[n] = h[n] * ( g[ n] * p[n]) .

(2-3)

The windowed version of x[n] is

x[n, ] = w[n, ] {h[n] * ( g[n] * p[ n])}.

(2-4)

Using the convolution and multiplication theorem, one gets

x[n, ] X ( , ) = W ( , ) H ( )G ( ) ( k ) .

P
k =
DTFT

19

(2-5)

Production and Classification of Speech Signals

Fig. 3.11 Illustration of relation of glottal source harmonics 1 , 2 , L , N , vocal


tract formants F1 , F2 , L , FM , and the spectral envelope | H ( )G ( ) | .

Generally the frequencies of the formants decrease as the vocal tract


length increases.

The spectral envelope is given by | H ( )G ( ) | consisting of a glottal and


vocal tract contribution.

20

Production and Classification of Speech Signals

The peaks in the spectral envelope correspond to vocal-tract formant


frequencies, F1, F2, FM.

A formant corresponds to the vocal tract poles, while the harmonics arise
from the periodicity of the glottal source.

(a) first harmonic


higher than first
formant frequency

(b) first harmonic


matched to first
formant frequency
Fig. 3.12 Illustration of formant movement to enhance the singing voice of a
soprano.
21

Production and Classification of Speech Signals

2.2.3 Categorization of sound by source

Different vocal tract configurations/shapes such as those in Fig. 3.10 can


also generate different sound sources. E.g. a complete closure of the
tract by pressing the tongue against the palate generates impulsive
sound.

Speech sounds generated with a periodic glottal source are termed


voiced. Sounds not so generated are called unvoiced.

There are a variety of unvoiced sounds, including those created with a


noise source at an oral tract constriction. Because they come from the
friction of the moving air against the constriction, these sounds are called
fricatives, e.g. th in the word thin.

A second unvoiced sound class is plosives created with an impulsive


source with the oral tract (Fig. 3.10b). e.g. t in top.

22

Production and Classification of Speech Signals

Vocal fold vibration can occur simultaneously with impulsive or noisy


sources. E.g. Z in Zebra. Sound sounds are called voiced fricatives.

There also exist voiced plosives. E.g. b in the word boat.

Fig. 3.13 Examples of voiced, fricative, and plosive sounds in the sentence,
Which tea party did Baker go to?: (a) speech waveform; (b)-(d) magnified
voiced, fricative, and plosive sounds from (a).

23

Production and Classification of Speech Signals

3 Spectrographic analysis of speech

The Fourier transform of the windowed speech waveform, i.e. the shorttime Fourier transform (STFT), is given by

X ( , ) =

x[n, ] exp( jn ) .

(3-1)

n =

where x[ n, ] = w[ n, ] x[ n] is the windowed speech segments as a


function of the window center at time .

The spectrogram is a graphical display of the magnitude of the timevarying spectral characteristis and is given by

S ( , ) =| X ( , ) | 2 ,

(3-2)

which is a measure of the energy of the frequency component at


frequency in the neighborhood of .
24

Production and Classification of Speech Signals

tapper

variations
Fig. 3.14 Formation of (a) the narrowband and (b) the wideband spectrograms.

The figure shows two types of spectrograms: narrowband (good spectral


resolution with large window length, e.g. 20ms) and wideband (good time
resolution with short window length e.g. 4ms Hamming window).
25

Production and Classification of Speech Signals

With voiced sources, the narrowband spectrogram shows horizontal


striations of the harmonics components, wideband spectrogram reveals
the variations of the pitch, but generates vertical striations due to poor
frequency resolution.

With regard to fricatives, the STFT magnitude of noise sounds is often


called the periodogram, because of the random wiggles of the spectral
envelope.

With plosive sounds, the wideband spectrogram is preferred because it


gives better temporal resolution of the sounds components.

26

Production and Classification of Speech Signals

Insufficient frequency resolution

Better time resolution

Fundamental and harmonics

Fig. 3.15 Comparison of measured spectrograms for the utterance, which tea
party did Baker go to?: (a) speech waveform; (b) wideband spectrogram; (c)
narrowband spectrogram.
27

Production and Classification of Speech Signals

4 Categorization of Speech Sounds

Speech sounds can be categorized by the sound source, which can be


created with either the vocal folds or with a constriction in the vocal tract.

On the other hand, the time-varying spectral characteristics of speech can


be studied using spectrogram.

Here we give more example and classification of sounds according to the


following perspectives:
1. The nature of source: periodic, noisy, or impulsive and combinations of the
three.
2. The shape of the vocal tract, e.g. place of the tongue hump and the degree of
the constriction of the hump place and manner of articulation, respectively.
3. The time domain waveform which gives the pressure change with time at the
lips output.
4. The time-varying spectral characteristics revealed through the spectrogram.
28

Production and Classification of Speech Signals

4.1 Element of a Language (terminology)


Phoneme is a fundamental distinctive unit of a language in that it is a
speech sound class that differentiates words of a language.

E.g. the

phoneme c b and h give distinctive sound and hence meaning to the


words cat, bat, and hat.
A particular instantiation of a phoneme is called phone and the study of
these sound variations is called phonetics.
Different languages contain different phoneme sets.
Syllables contain one or more phonemes, while words are formed with one
or more syllables, concatenated to form phrases and sentences.
Linguistics is the study of the arrangement of speech sounds, i.e.
phonemes, according to the rules of a language.

29

Production and Classification of Speech Signals

The use of features 1 and 2 in the study of phonemes is called articulatory


phonetics because phonemes arise from a combination of vocal fold and
vocal tract articulatory features.
While using the last two is called acoustic phonetics because it concerns
with the generation of the speech sounds from an acoustic point of view.
The variants of sounds, or phones, that convey the same phoneme are
called the allophones of the phoneme. E.g. the t in butter, but, and
to.
The articulatory properties are influenced by adjacent phonemes and other
factors.

30

Production and Classification of Speech Signals

In English, the combinations of features give 40 phonemes as follows:

Fig. 3.17 Phonemes in American English.


31

Production and Classification of Speech Signals

Vowels

Fig. 3.18 Vocal tract profiles for vowels in American English.


32

Production and Classification of Speech Signals

Fig. 3.19 Waveform, wideband spectrogram, and spectral slice of narrowband


spectrogram for two vowels: (a) /i/ as in eve; (b) /a/ as in father.
The second large phoneme grouping is that of consonants. The consonants
contain a number of subgroups: nasals, fricatives, plosives, whispers and
affricates.
33

Production and Classification of Speech Signals

Nasals

Fig. 3.20. Vocal tract configurations for nasal consonants.

Broadband width
Low frequency

Fig. 3. 21 Wideband spectrograms of nasal consonants (a) /n/ in no and (b)


/m/ in mo.
34

Production and Classification of Speech Signals

The source is quasi-periodic airflow puffs from the vibrating vocal folds.
The velum is lowered and the air flows mainly through the nasal cavity, the
oral tract being constricted; thus sound is radiated at the nostrils. E.g. /m/
in mo (oral tract constriction) and /n/ in no (constriction is with the
tongue to the gum ridge).
The spectrum of a nasal is dominated by the low resonance of the large
volume of the nasal cavity, which also have a large bandwidth because of
the viscous losses of airflow over the complexly configured surface.
The closed oral cavity has its own resonances and absorbs acoustic
energy. These anti-resonances can be modeled as zero of the vocal tract
transfer function.
In nasalization of vowels, the velum is partially open. The speech sound is
primarily due to the sound at the lips and not the sound at the nose output.
Vowels adjacent to nasal consonants tend to be nasalized.
35

Production and Classification of Speech Signals

Fricatives

Fig. 3. 22 Vocal tract configurations for pairs of voiced and unvoiced


fricatives.
36

Production and Classification of Speech Signals

There are two classes of fricative consonants voiced and unvoiced


fricatives.
The location of the constriction by the tongue at the back, center, or front
of the oral tract, as well as at the teeth or lips, influences which fricative
sound is produced.

The transfer function consists primarily of high-

frequency resonances which changes with the location of the construction.

Unvoiced fricatives are characterized by a noisy spectrum while voiced


fricatives often show both noise and harmonics. E.g. with /S/, the frication
occurs at the palate, and with /f/ at the lips.

37

Production and Classification of Speech Signals

Fig. 3. 23 Waveform, wideband spectrogram, and narrowband spectral slice of


voiced and unvoiced fricative pair: (a) /v/ as in vote; (b) /f/ as in for.

38

Production and Classification of Speech Signals

A simple model for voiced fricative is

x[n] = x g [n] + x q [n].

(4-1)

where

x g [n] = h[n] * ( g[n] * p[n]) is the voiced component

x q [n] = h f [n] * (q[n]u[n]) is the unvoiced component


q[n] is a (white) noise component and hf[n] is the impulse response of the
front oral cavity.
airflow velocity.

Since fricative is approximately synchronized with


The noise source is assumed to be modulated by the

glottal waveform u[n].

39

Production and Classification of Speech Signals

Plosives

Fig. 3. 24 Vocal tract configurations for unvoiced and voiced plosive pairs.
Plosives can be both voiced and unvoiced.
The voiced onset time is the difference between the time of the burst and
the onset of voicing in the following vowel. The length of the voice onset
time and the place of constriction vary with the plosive consonant.
40

Production and Classification of Speech Signals

Fig. 3. 25 A schematic representation of (a) unvoiced and (b) voiced plosives.


The voiced onset time is denoted by VOT.

41

Production and Classification of Speech Signals

Fig. 3. 26 Waveform, wideband spectrogram, and narrowband spectral slice of


voiced and unvoiced plosive pair: (a) /g/ as in go; (b) /k/ as in key.

42

Production and Classification of Speech Signals

In voiced plosives, although the oral tract is closed, we hear a lowfrequency vibration, called the voice bar, due to the propagation of the
vibration at the vocal folds through the walls of the throat. Unlike unvoiced
plosives, there is little aspiration.
A simple for voiced plosive is

x[n] = h[n, m] * u[n] + h f [n, m] * [n]


=

(4-1)

h[n, m]u[n m] + h f [n, m] [n m]

m =

m =

Due to the changing vocal tract shape during the transition from the burst
to a following steady vowel, h and hf are assumed to be linear, but timevarying.
The burst is modeled as an impulse which is assumed to occur at time n=0.

43

Production and Classification of Speech Signals

5 Discrete-time Model of Speech Signals

It is possible to study the airflow velocity, pressure, etc by solving a set of


partial differential equations of the underlying acoustics phenomenon. A
commonly used model is the concatenated tube model:

Fig. 4.14 Concatenated tube model. The k-th tube has cross-sectional area Ak
and length lk.
Due to time limitation, we shall not go into these acoustic models.

44

Production and Classification of Speech Signals

If the airflows are models as signal flows, then the above tube model can
be approximated by the following signal flow graph:

Fig. 4.16 Signal flow graphs of (a) two concatenated tubes; (b) lip boundary
condition; (c) glottal boundary condition.
45

Production and Classification of Speech Signals

Fig. 4.18 Signal flow graph conversion to discrete time of (a) lossless two
tube model; (b) discrete-time version of (a); (c) conversion of (b) with singlesample delays.
46

Production and Classification of Speech Signals

The discrete-time speech production model for periodic, noise, and


impulsive sound sources is shown as follows:

Fig. 4.20 Overview of the complete discrete-time speech production model.


The voiced (periodic) speech is modeled by sending a periodic impulse train
having a period equal to the pitch period through G(z), V(z) and R(z).

47

Production and Classification of Speech Signals

G(z) is the z-transform of the glottal pulse. V(z) is the z-transform of the
vocal tract transfer function, and R(z) is the radiation loss at the lips (R(z) in
dotted line models the radiation loss at the glottis). An approximation of a
typical glottal flow waveform over one cycle is of the form

g[n] = ( n u[ n]) * ( n u[ n]) .

(4-1)

The z-transform is

G( z) =

1
(1 z ) 2

(4-1)

, < 1.

Note, the poles are outside the unit circle.


R(z) is usually model as a zero R ( z ) = 1 z 1, with < 1.
V(z) is a stable all-pole filter.

48

Production and Classification of Speech Signals

To model fricative consonants, a white noise instead of the periodic glottal


pulses is employed. It is still colored by the vocal tract and there is also
similar radiation loss at the lips.
Similarly, to model plosives, an impulse at appropriate onset time is used
to excite the vocal tract.
To model voiced plosives for example, these three sources can be
combined either linearly or nonlinearly as the excitation to the vocal tract.

49

Potrebbero piacerti anche