Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Supported by EPSRC
Content
1- Introduction to Phonetics and Acoustics of Accents
5- Accent Morphing
6- Audio Demo
1. Introduction to Phonetics and Acoustics of Accents
Multimedia Communication Signal Processing Group
1.1 Background
• Accents are dynamic processes in that they evolve over time influenced by large-
scale immigration, socio-economic changes and cultural trends.
• Generally the structural differences between accents can be divided into two
broad parts:
• The importance of an accent feature depends on its distance from that of the
‘standard’ or ‘received’ pronunciation and the frequency with which that
feature occurs in the acoustics of speech.
1.3 Phonetics of Accents
• Perceived acoustics differences of accents are due to the differences, during the
production of sound, in the configurations, positioning, tension and movement
of laryngeal and supra-laryngeal articulatory parameters, namely vocal folds,
vocal tract, tongue and lips
(i) Formant trajectories Fkj(t), k is the formant index and j is phoneme index.
(ii) Timing and magnitude of the formant target point(s) in formant space for
each phonetic unit.
(b) Pitch prosody correlates of accents, include:
Multimedia Communication Signal Processing Group
(i) Pitch trajectory at various linguistic contexts and positions. e.g. pitch rise, at
the beginning of a voiced group or phrase, pitch fall at the end of a phrase.
(ii) Pitch nucleus i.e. the timing and magnitude of the prominent pitch event in
a voiced group.
(ii) Relative duration and timings of the two constituent vowels of diphthongs.
(d) Laryngeal (glottal) correlates of accents, i.e the voice quality of speech
segments in certain contexts as a function of accent.
2. Research Issues in Modelling Acoustics of Accents of English
Multimedia Communication Signal Processing Group
• Modelling the duration of vowels and diphthongs and the relative duration of the
two halves of diphthongs.
Phonetic Parameters
Multimedia Communication Signal Processing Group
Glottal pulse (Voice Quality) Durations and shapes of opening and closing of glottal folds **
Prosody Correlates
F0 mean Average of pitch *
F0 range Range of pitch *
Pitch Nucleus Prominent point (stressed) within an intonation group (Tone ***
Unit)
Initial Pitch Rise First pitch slope of a narrative utterance ***
Final Pitch Lowering Final fall pitch slope of a narrative utterance ***
Final Pitch Rise Final rise pitch slope of a narrative utterance ***
Phoneme Duration Vowel duration elongation and complete pronunciation all ***
affect
Excessive Co-articulation Clipped or short duration sounds ****
Speech Accent Feature Analysis Method
Multimedia Communication Signal Processing Group
Speaking Rate
HMM Labeling & & Durations
Training Segmentation
Formants
& Trajectories Accent
Input
Speech Profile
F0 Range/Mean
Pitch Accents
Pitch Pitch Contour
Marker Tracker Tone Nucleus
Features
0.2
0.18
0.16
Duration (sec)
0.14
0.12
0.1
0.08
0.06
0.04
Australian British American
0.02
aa ae ah ao aw ay eh er ey ih iy ow oy uh uw
Figure: Comparison of phoneme durations of British, Australian and American.
Speaking Rate
(number/sec) Phone Word
British 12.1 3.64
Multimedia Communication Signal Processing Group
Table : (%) word error of speech recognition across British, American and Australian accents.
• There is an apparent correlation between automatic speech recognition and speaking rate.
•Australian with the slowest speaking rate obtains the best recognition results followed by
American and British.
Formant Estimation with 2D-HMM
Formant feature extraction, illustrated consists of three main functions,
Multimedia Communication Signal Processing Group
(1) an LP model,
(2) a polynomial root finder, and
(3) a contour trend estimator.
Consider the z-transfer function of an LP model with K real poles and I complex
pole pairs and a gain factor G as
K I
1 1
H ( z) = G ∏ −1 ∏
k =1 (1 − Ak z ) i =1 (1 − Ai e z )(1 − Ai e − j 2π ( Fi / Fs ) z −1 )
j 2 π ( Fi / Fs ) −1
where Ak is the pole radius, Fi the pole frequency and Fs sampling frequency.
∆ −estimator
Formant candidate
Feature vector
Segmentation LPC Polynomial Frequency,Bandwidth
Speech
& window Model roots Intensity Calculation
LPC
P1 P2 P3 P4 P5 P6
Time(s)
Frequency(Hz)
Illustration of of LP spectrum and the modelling of 6 complex pole pairs of a speech segment with an HMM
composed of 4 formant-states.
Comparison of histograms (thin solid line) and Gaussian HMMs of formants of Australian English (bold dashed line). X axis:
frequency (Hz); Y axis: probability.
The figures show that HMMS are excellent models of the distribution of the formants.
Comparison of Formants Spaces of American, Australian and British Accents
Multimedia Communication Signal Processing Group
Figure : Comparison of trajectories and target time of formant of British, Australian and American accents
1000 2800
900 2600
800 2400
2200
700 2000
600 1800
1600
Multimedia Communication Signal Processing Group
500 1400
400 1200
1000
300
Australian British American 800 Australian British American
200 600
AA AE AH AO EH ER IH IY OH UW UH AA AE AH AO EH ER IH IY OH UW UH
3225 4600
31
50 4500
3075
4400
4300
3000 4200
2925 4100
2850 4000
2775 3900
3800
2700 3700
2625 3600
Australian British American Australian British American
2550 3500
AA AE AH AO EH ER IH IY OH UW UH AA AE AH AO EH ER IH IY OH UW UH
Accent Model
HMM Training/
Multimedia Communication Signal Processing Group
Adaptation
Formant
Pitch Tracker Accent Synthesised
Estimation
Speech
• Prosody Modification : based on time domain pitch synchronous overlap and add
(TD-PSOLA) method.
-35
-40
Multimedia Communication Signal Processing Group
-45
Magnitude (dB)
-50
-55
-60
-65
I 12
-70 I 23 I34
-75 0 1
0.1
BW1 0.2 0.3
BW2 0.4 0.5
BW0.6
3
0.7 0.8
BW 40.9
F 01 F 12 F 23 F 34 F 45
Frequency (Hz)
Figure Illustration of a non-uniform frequency warping using LP model frequency response. The spectrum is divided into a
number of bands centered on the formants and a different set of warping parameters is applied to each band.
Formant Transformation
Ratios
f i T+ 1 − f i T
α i(i+1) =
f i +S 1 − f i S
Where fiT and fiS are the ith formants of the source and target accents
The frequency mapping can be expressed as
Figure : Illustration of warped(solid line) and original(dash dot line) formant trajectories of
/aa/ in accent conversion from Australian to British.
Pitch Modification Using Time Domain PSOLA (TD-PSOLA)
Multimedia Communication Signal Processing Group
Source Speech
Pitch Marks
Target Speech
Pitch Marks
(a) (b)
(c) (d)
(c) ‘asked’ in Australian, (d) Australian-accent ‘article’ transformed to British accent
Source
Speech
ion ion
LP
LP
Rotat
Model
Rotat
Multimedia Communication Signal Processing Group
Source
/ Pole
Speaker Formant
/ Pole
HMM Trajectory
Mapped
ing ing
Model Speech Speech
Warping Speech
Warp
Recon
Recon-
Factors
struction
Target
Warp
Speaker Formant
rumrum
HMM Trajectory
Spect
Model
Spect
LPC-
LP
Model
Target
Speech
Model Formant Formant Mapping Speech
Estimation Tracking Reconstruction
An example of voice
conversion
American male American female Transformed(AM m->f)
Accent Conversion Demonstration
Spoken
word
Australian British Transformed
‘Article’
‘Claim’
‘Beige’
‘Boston’
‘Opposition’
‘The occupied’
Multimedia Communication Signal Processing Group
The End