Accent Seminar

Multimedia Communication Signal Processing Group
Analysis, Modelling and Synthesis of

British, Australian and American Accents
Supported by EPSRC
Qin Yan Saeed Vaseghi

Multimedia Communication Signal processing Lab
Department of Electronic and Computer Engineering
Brunel University
Content
1- Introduction to Phonetics and Acoustics of Accents
2- Research Issues in Modelling Acoustics of Accents of English
3- Current Research Problems
4- Accent Analysis and Models
5- Accent Morphing
6- Audio Demo
1. Introduction to Phonetics and Acoustics of Accents
1.1 Background
• Accents are acoustic manifestations of differences in pronunciation and

intonations by a community of people from a national, regional or a socio-economic
grouping.
• Accents are dynamic processes in that they evolve over time influenced by large-
scale immigration, socio-economic changes and cultural trends.
• Applications of accent models include:

- speech recognition,
- text to speech synthesis,
- voice editing,
- accent morphing in broadcasting and films,
- toys and computer games,
- accent coaching, education.
1.2 Basic Structure of Accents
• Generally the structural differences between accents can be divided into two
broad parts:
(a) Differences in phonetic transcriptions.
(b) Differences in acoustics correlates and intonations of accents.
• The importance of an accent feature depends on its distance from that of the
‘standard’ or ‘received’ pronunciation and the frequency with which that
feature occurs in the acoustics of speech.
1.3 Phonetics of Accents
• A dominant aspect of accents is in the differences in pronunciation as

transcribed by a phonetic dictionary.
• The differences in phonetic transcription can be categorized into two classes:
a) Differences in the number and identity of the phonemes.

For example, British English as transcribed by Cambridge University’s BEEP
dictionary2 has five extra vowels: /ax(ə) ea(ɛə) ia(iə) ua (uə) ah (ɒ) / compared to
American as transcribed by Carnegie Melon University CMU dictionary. /iə ɛə
uə/,are allophones of /i ɛ u/. American /ɒ/ is merged with /a/
compared with British accent.
American transcription has three different levels of stress for
vowels and diphthongs. Also Australian English has distinctive
vowels such as /æi/ instead of /ei/ and /æƆ/for /au/.
b) Differences in phonetic realizations: phoneme substitution, deletion, insertion.
For example, ‘JOHN’ is pronounced as /ʤΛn/ in American but as /ʤƆn/ in

British and Australian English. The word ‘SAY’ is pronounced as /sei/ in British
and American but it is pronounced as /sæi / in Australian.
1.4 Acoustics of Accents
• Perceived acoustics differences of accents are due to the differences, during the
production of sound, in the configurations, positioning, tension and movement
of laryngeal and supra-laryngeal articulatory parameters, namely vocal folds,
vocal tract, tongue and lips
• Four aspects of acoustic correlates of accents are considered essential for

accent models and accent synthesis. These are:
(a) Formants (i.e. frequency of vocal tract resonance) correlates of accents,

including:
(i) Formant trajectories Fkj(t), k is the formant index and j is phoneme index.
(ii) Timing and magnitude of the formant target point(s) in formant space for
each phonetic unit.
(b) Pitch prosody correlates of accents, include:
(i) Pitch trajectory at various linguistic contexts and positions. e.g. pitch rise, at
the beginning of a voiced group or phrase, pitch fall at the end of a phrase.
(ii) Pitch nucleus i.e. the timing and magnitude of the prominent pitch event in
a voiced group.
(c) Duration and Timing correlates of accents,
(i) Duration of vowels and diphthongs.
(ii) Relative duration and timings of the two constituent vowels of diphthongs.
(d) Laryngeal (glottal) correlates of accents, i.e the voice quality of speech
segments in certain contexts as a function of accent.
2. Research Issues in Modelling Acoustics of Accents of English
• Definition of an accent ‘feature set’ composed of formants’ trajectories,

formants’ target points, pitch trajectory, power trajectory, duration.
• Separation, normalisation, or averaging out of speakers’ characteristics from

accent characteristics, this is required for modelling parameters of accent.
• Modelling formants of vowels and diphthongs, the latter is composed of two

connected elementary sounds.
• Modelling the duration of vowels and diphthongs and the relative duration of the
two halves of diphthongs.
• Modelling pitch trajectory in different phonetic/linguistic positions and contexts.
• Modelling voice quality correlates of an accents in different phonetic/linguistic

positions and contexts.
• Integration of all accent features within a coherent generative model.

Accent Profile (AP)
Parameters Comments Rank
Phonetic Parameters
Substitution, insertion, Pronunciation differences obtained from phonetic *****

deletion transcription dictionaries
Supra-laryngeal and Laryngeal Correlates

Formants & their trajectories 2nd formant with largest variance is most sensitive to accent ****
Glottal pulse (Voice Quality) Durations and shapes of opening and closing of glottal folds **
Prosody Correlates
F0 mean Average of pitch *
F0 range Range of pitch *
Pitch Nucleus Prominent point (stressed) within an intonation group (Tone ***
Unit)
Initial Pitch Rise First pitch slope of a narrative utterance ***
Final Pitch Lowering Final fall pitch slope of a narrative utterance ***
Final Pitch Rise Final rise pitch slope of a narrative utterance ***
Timing and Delivery Correlates

Speaking Rate Phonemes or words per second *
Phoneme Duration Vowel duration elongation and complete pronunciation all ***
affect
Excessive Co-articulation Clipped or short duration sounds ****
Speech Accent Feature Analysis Method
Speaking Rate
HMM Labeling & & Durations
Training Segmentation
Formants
& Trajectories Accent
Input
Speech Profile
F0 Range/Mean
Pitch Accents
Pitch Pitch Contour
Marker Tracker Tone Nucleus
Features
Block diagram illustration of the processes involved in accent analysis
The basic processes involved in accent analysis includes

• Speech phonetic labelling and boundary segmentation using HMMs
• Pitch trajectory and pitch nucleus estimation
• Formant models and formant track estimation
• Duration and power trajectory analysis
Analysis of Duration Correlate of AU, US and UK Accent Speech
Figure: Comparison of speaking rates of British, Australian and American.
0.2
0.18
0.16
Duration (sec)
0.14
0.12
0.1
0.08
0.06
0.04
Australian British American
0.02
aa ae ah ao aw ay eh er ey ih iy ow oy uh uw
Figure: Comparison of phoneme durations of British, Australian and American.
Speaking Rate
(number/sec) Phone Word
British 12.1 3.64
American 11.6 3.1
Australian 10.8 2.8
Comparison of speaking rates of British, American and Australian Accents.
• Australian speaking (word) rate is 23% slower than British

• American speaking (word) rate is 15% slower than British
Model British American Australian

Input Model Model Model
British 12.8 29.3 34.9
American 30.6 8.8 29.94
Australian 33.1 27.3 7.28
Table : (%) word error of speech recognition across British, American and Australian accents.
• There is an apparent correlation between automatic speech recognition and speaking rate.
•Australian with the slowest speaking rate obtains the best recognition results followed by
American and British.
Formant Estimation with 2D-HMM
Formant feature extraction, illustrated consists of three main functions,
(1) an LP model,
(2) a polynomial root finder, and
(3) a contour trend estimator.
Consider the z-transfer function of an LP model with K real poles and I complex
pole pairs and a gain factor G as
K I
1 1
H ( z) = G ∏ −1 ∏
k =1 (1 − Ak z ) i =1 (1 − Ai e z )(1 − Ai e − j 2π ( Fi / Fs ) z −1 )
j 2 π ( Fi / Fs ) −1
where Ak is the pole radius, Fi the pole frequency and Fs sampling frequency.
∆ −estimator
Formant candidate
Feature vector
Segmentation LPC Polynomial Frequency,Bandwidth
Speech
& window Model roots Intensity Calculation
LP-based Formant-candidate feature extraction method

LPC
P1 P2 P3 P4 P5 P6
Time(s)
Frequency(Hz)
Illustration of of LP spectrum and the modelling of 6 complex pole pairs of a speech segment with an HMM
composed of 4 formant-states.
• 2D HMMs span time and frequency dimensions

• Left-right HMM states across frequency model formants such that the first state
models the first formant, the second state the second formant and so on
• The distribution of formants in each state is modelled by a mixture Gaussian
density.
Three spectrogram examples of formant tracks superimposed on LPC spectrum of speech

Comparison of histograms (thin solid line) and Gaussian HMMs of formants of Australian English (bold dashed line). X axis:
frequency (Hz); Y axis: probability.
The figures show that HMMS are excellent models of the distribution of the formants.
Comparison of Formants Spaces of American, Australian and British Accents
F1 vs F2 space of British, Australian and American English. Click phoneme to listen.
Note the following features:

• Rising of vowels /ae/ and /eh/ in Australian.
• Fronting of the open vowel /aa/ and high vowel /uw/ in Australian.
• Fronting and rising of the vowel /er/ in Australian.
• The vowels /iy/, /eh/ and /ae/ in Australian are closer.
Figure : Comparison of trajectories and target time of formant of British, Australian and American accents
1000 2800
900 2600
800 2400
2200
700 2000
600 1800
1600
500 1400
400 1200
1000
300
Australian British American 800 Australian British American
200 600
AA AE AH AO EH ER IH IY OH UW UH AA AE AH AO EH ER IH IY OH UW UH
3225 4600
31
50 4500
3075
4400
4300
3000 4200
2925 4100
2850 4000
2775 3900
3800
2700 3700
2625 3600
Australian British American Australian British American
2550 3500
AA AE AH AO EH ER IH IY OH UW UH AA AE AH AO EH ER IH IY OH UW UH
Figure : Comparison of formants of Australian, British and American (female)

V  A− B 
2
Rank  
F vi F vi
Formant Ranking using a normalised distance   A + B)
0.5 ( F vi F
  ∑
i
 v =1  vi  
Formant Ranking Order
Accent Pairs 1 2 3 4
British & Australian 1st 2nd 4th 3rd

British & American 2nd 1st 3rd 4th
Australian & American 2 1 3 4 nd st rd th
• 2nd Formant has widest frequency range and is most sensitive to Accent
Accent Morphing Method
Accent Model
HMM Training/
Adaptation
Source Speech Speech Labeling Prosody

Formant Mapping
& Segmentation Modification
Formant
Pitch Tracker Accent Synthesised
Estimation
Speech
Figure : Diagram of a voice morphing system used for accent conversion
• Formant Mapping : Transformation of formants of the source towards those of the

target accent is based on non-uniform linear prediction model frequency warping.
• Prosody Modification : based on time domain pitch synchronous overlap and add
(TD-PSOLA) method.
• Prosody Modification includes pitch slope, duration and power trajectory.
• Application : Text to speech synthesis, Broadcasting System e.g. Accent

modification in films, Education software such language teaching, Speech interface
in mobile, Call centre and other electronic products
Formant Transformation via Non-Uniform LP Frequency Warping
-35
-40
-45
Magnitude (dB)
-50
-55
-60
-65
I 12
-70 I 23 I34
-75 0 1
0.1
BW1 0.2 0.3
BW2 0.4 0.5
BW0.6
3
0.7 0.8
BW 40.9
F 01 F 12 F 23 F 34 F 45
Frequency (Hz)
Figure Illustration of a non-uniform frequency warping using LP model frequency response. The spectrum is divided into a
number of bands centered on the formants and a different set of warping parameters is applied to each band.
Formant Transformation
Ratios
Linear Prediction Polynomial roots LP Spectrum Accent modified

Speech
Model Pole estimation Mapping spectrum
Formant HMMs Formant

Estimation
Figure : Illustration modification of spectrum towards formants of target accent

The frequency bands of the source speaker [F01 F12 F23 F34 F45 ] are mapped to the target
accent using a set of warping ratios derived from differences in the formants of
phonetic segments of speech across accents as
f i T+ 1 − f i T
α i(i+1) =
f i +S 1 − f i S
Where fiT and fiS are the ith formants of the source and target accents
The frequency mapping can be expressed as
f i ( i+1) =α i ( i+1 ) f i ( i+1)
Figure : Illustration of warped(solid line) and original(dash dot line) formant trajectories of
/aa/ in accent conversion from Australian to British.
Pitch Modification Using Time Domain PSOLA (TD-PSOLA)
Source pitch marks
Target pitch marks
Illustration of mapping of pitch periods of a source speech to a target
Source Speech
Pitch Marks
Target Speech
Pitch Marks
• TD-PSOLA is applied into each corresponding voiced speech segment to

modify the pitch slope and duration of the segments
Examples of changes in accent/duration modulation of pitch
(a) (b)
(a) ‘article’ in Australian, (b) Australian-accent ‘article’ transformed to British accent
(c) (d)
(c) ‘asked’ in Australian, (d) Australian-accent ‘article’ transformed to British accent
Source
Speech
ion ion
LP
LP
Rotat
Model
Rotat
Source
/ Pole
Speaker Formant
/ Pole
HMM Trajectory
Mapped
ing ing
Model Speech Speech
Warping Speech
Warp
Recon
Recon-
Factors
struction
Target
Warp
Speaker Formant
rumrum
HMM Trajectory
Spect
Model
Spect
LPC-
LP
Model
Target
Speech
Model Formant Formant Mapping Speech
Estimation Tracking Reconstruction
An Outline of Voice-Morph: A system for Voice and Accent Conversion
An example of voice
conversion
American male American female Transformed(AM m->f)
Accent Conversion Demonstration
Source Accent Target Accent

Spoken
word
Australian British Transformed
‘Article’
‘Claim’
‘Beige’
British American Transformed

‘Cooperation’
‘Boston’
‘Opposition’
‘The occupied’
The End

Accent Seminar

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Accent Seminar

Caricato da

Copyright:

Formati disponibili

Multimedia Communication Signal Processing Group

Analysis, Modelling and Synthesis of

Qin Yan Saeed Vaseghi

2- Research Issues in Modelling Acoustics of Accents of English

3- Current Research Problems

4- Accent Analysis and Models

• Accents are acoustic manifestations of differences in pronunciation and

• Applications of accent models include:

1.2 Basic Structure of Accents

(a) Differences in phonetic transcriptions.

(b) Differences in acoustics correlates and intonations of accents.

• A dominant aspect of accents is in the differences in pronunciation as

transcribed by a phonetic dictionary.

• The differences in phonetic transcription can be categorized into two classes:

a) Differences in the number and identity of the phonemes.

b) Differences in phonetic realizations: phoneme substitution, deletion, insertion.

For example, ‘JOHN’ is pronounced as /ʤΛn/ in American but as /ʤƆn/ in

• Four aspects of acoustic correlates of accents are considered essential for

(a) Formants (i.e. frequency of vocal tract resonance) correlates of accents,

(c) Duration and Timing correlates of accents,

(i) Duration of vowels and diphthongs.

• Definition of an accent ‘feature set’ composed of formants’ trajectories,

• Separation, normalisation, or averaging out of speakers’ characteristics from

• Modelling formants of vowels and diphthongs, the latter is composed of two

• Modelling pitch trajectory in different phonetic/linguistic positions and contexts.

• Modelling voice quality correlates of an accents in different phonetic/linguistic

• Integration of all accent features within a coherent generative model.

Parameters Comments Rank

Substitution, insertion, Pronunciation differences obtained from phonetic *****

Supra-laryngeal and Laryngeal Correlates

Timing and Delivery Correlates

Block diagram illustration of the processes involved in accent analysis

The basic processes involved in accent analysis includes

Figure: Comparison of speaking rates of British, Australian and American.

American 11.6 3.1

Australian 10.8 2.8

Comparison of speaking rates of British, American and Australian Accents.

• Australian speaking (word) rate is 23% slower than British

Model British American Australian

LP-based Formant-candidate feature extraction method

• 2D HMMs span time and frequency dimensions

Three spectrogram examples of formant tracks superimposed on LPC spectrum of speech

F1 vs F2 space of British, Australian and American English. Click phoneme to listen.

Note the following features:

Figure : Comparison of formants of Australian, British and American (female)

British & Australian 1st 2nd 4th 3rd

Source Speech Speech Labeling Prosody

Figure : Diagram of a voice morphing system used for accent conversion

• Formant Mapping : Transformation of formants of the source towards those of the

• Prosody Modification includes pitch slope, duration and power trajectory.

• Application : Text to speech synthesis, Broadcasting System e.g. Accent

Linear Prediction Polynomial roots LP Spectrum Accent modified

Formant HMMs Formant

Figure : Illustration modification of spectrum towards formants of target accent

f i ( i+1) =α i ( i+1 ) f i ( i+1)

Source pitch marks

Target pitch marks

Illustration of mapping of pitch periods of a source speech to a target

• TD-PSOLA is applied into each corresponding voiced speech segment to

(a) ‘article’ in Australian, (b) Australian-accent ‘article’ transformed to British accent

An Outline of Voice-Morph: A system for Voice and Accent Conversion

Source Accent Target Accent

British American Transformed