Sei sulla pagina 1di 38

Speech Coding Techniques

潘奕誠
4/7/2003
Introduction
 Efficient speech-coding techniques
 Advantages for VoIP
 Digital streams of ones and zeros
 The lower the bandwidth, the lower the
quality
 RTP payload types
 Processing power
 The better quality (for a given bandwidth)
uses a more complex algorithm
 A balance between quality and cost
Voice Quality
 Bandwidth is easily quantified
 Voice quality is subjective
 MOS, Mean Opinion Score
 ITU-T Recommendation P.800
 Excellent – 5
 Good – 4
 Fair – 3
 Poor – 2
 Bad – 1
 A minimum of 30 people
 Listen to voice samples or in conversations
 P.800 recommendations
 The selection of participants
 The test environment
 Explanations to listeners
 Analysis of results
 Toll quality
 A MOS of 4.0 or higher
About Speech
 Speech
 Air pushed from the lungs past the vocal
cords and along the vocal tract
 The basic vibrations – vocal cords
 The sound is altered by the disposition of
the vocal tract ( tongue and mouth)
 Model the vocal tract as a filter
 The shape changes relatively slowly
 The vibrations at the vocal cords
 The excitation signal
Speech sounds
 Voiced sound
 The vocal cords vibrate open and close
 Quasi-periodic pulses of air
 The rate of the opening and closing – the pitch
 Unvoiced sounds
 Forcing air at high velocities through a constriction
 Noise-like turbulence
 Show little long-term periodicity
 Short-term correlations still present
 Plosive sounds
 A complete closure in the vocal tract
 Air pressure is built up and released suddenly
Voice Sampling
 Discrete Time LTI Systems: The
Convolution Sum
 
x[n]   x[k ] [n  k ]
k  
y[n]   x[k ]h[n  k ]
k  

1
h[n]

0 1 2 n
2.5
2 2
x[n] y[n]
0.5 0.5

0 1 n 0 1 2 3 n
 Nyquist sampling theorem
X c ( j )


s (t )    (t  nT )
n  
 N N 
xs (t )  xc (t ) s (t )

 xc (t )   (t  nT )
 S 0 X c ( j ) S  n  

2 
S ( j) 
T
  (  k )
k  
s

S  N N S 

( S   N )
Quantization (Scalar
Quantization)
v1 v2 vk+1 vL

m0= -A m1 m2 …… mk mk+1 mL1 mL=A


J
 Assume | x[n] |  A
     
k+1

divide the range [ A , A ] into L quantization levels


{ J1 , J2 , …… Jk ,….. JL }
Jk : [mk-1,mk ]
R
L=2

each quantization level Jk is represented by a value vk


S = U Jk , V = { v1 , v2 , …… vk ,….. vL }
Non-Uniform Quantization
m0 = -A m1 m2 …… 0 mL=A

Concept : small quantization levels for small x


large quantization levels for large x

Goal: constant SNRQ for all x


Companding

x[n] ^
x[n]
F(x) Uniform Uniform F1(x)
Quantization Decoder

Compressor …1101…1101… Expandor

Compressor + Expandor  Compandor


F(x) is to specify the non-uniform quantization
characteristics
Non-Uniform Quantization
 -law
log 1  μ x 
F ( x)  ,0  x  1
log( 1  μ)
 A-law
 Ax 1
 ,0  x 

F ( x )   1  lnA A
1  ln[ A x ] , 1  x  1

 1  lnA A

 Typical values in practice


 = 255 , A = 87.6
Types of Speech Codecs
 Waveform codecs,source codecs (also
known as vocoders),and hybrid codecs.
Speech Source Model and
Source Coding
unvoiced G(z), G(), g[n]
random Excitation parameters
sequence u[n] 1 x[n]v/u : voiced/ unvoiced
G(z) =
generator  P N : pitch for voiced
periodic 1  akz-k
pulse
G G : signal gain
k=1
train v/u
generator voiced Vocal Tract  excitation signal u[n]
N Model
Vocal Tract parameters
Excitation {ak} : LPC coefficients

formant structure of
speech signals
A good approximation,
though not precise enough
LPC Vocoder(Voice Coder)
x[n] { ak }
LPC Encoder
Analysis N,G
…11011
v/u

N by pitch detection
v/u by voicing detection
receiver

{ ak } x[n]
Decoder Ex g[n]
N,G G(z)
…11011
v/u

{ak} can be non-uniform or vector


quantized to reduce bit rate further
G.711
 The most commonplace codec
 Used in circuit-switched telephone network

 PCM, Pulse-Code Modulation

 If uniform quantization
 12 bits * 8 k/sec = 96 kbps

 Non-uniform quantization
 65 kbps DS0 rate


  law
 North America
 A-law
 Other countries, a little friendlier to

lower signal levels


 An MOS of about 4.3
ADPCM(adaptive differential
PCM)
 DPCM and ADPCM.
 ADPCM : Adaptive Prediction in DPCM
Adaptive Quantization
Adaptive Quantization
 Quantization level  varies with local signal level
 [n] = ax[n]
 x[n] : locally estimated standard deviation of x[n]
 G.721:ADPCM-coded speech at 32Kbps.
 G.726(A-law or   law)
 16,24,32,40Kbps
 MOS 4.0 , at 32Kbps
Analysis-by-Synthesis (AbS)
Codecs
 Hybrid codec
 Fill the gap between waveform and source codecs

 The most successful and commonly used

 Time-domain AbS codecs

 Not a simple two-state, voiced/unvoiced

 Different excitation signals are attempted

 Closest to the original waveform is selected

 MPE, Multi-Pulse Excited

 RPE, Regular-Pulse Excited

 CELP, Code-Excited Linear Predictive


G.728 LD-CELP
 CELP codecs
 A filter; its characteristics change over time
 A codebook of acoustic vectors
 A vector = a set of elements representing various char.

of the excitation
 Transmit
 Filter coefficients, gain, a pointer to the vector chosen

 Low Delay CELP


 Backward-adaptive coder
 Use previous samples to determine filter coefficients

 Operates on five samples at a time

 Delay < 1 ms

 Only the pointer is transmitted


 1024 vectors in the code book
 10-bit pointer (index)
 16 kbps
 LD-CELP encoder
 Minimize a frequency-weighted mean-square error
 LD-CELP decoder

 An MOS score of about 3.9


 One-quarter of G.711 bandwidth
G.723.1 ACELP
 6.3 or 5.3 kbps
 Both mandatory
 Can change from one to another during a conversation
 The coder
 A band-limited input speech signal
 Sampled at 8 KHz, 16-bit uniform PCM quantization
 Operate on blocks of 240 samples at a time
 A look-ahead of 7.5 ms
 A total algorithmic delay of 37.5 ms + other delays
 A high-pass filter to remove any DC component
 G.723.1 Annex A
 Silence Insertion Description (SID) frames
of size four octets
 The two lsbs of the first octet
 00 6.3kbps 24 octets/frame
 01 5.3kbps 20
 10 SID frame 4
 An MOS of about 3.8
 At least 37.5 ms delay
G.729
 8 kbps
 Input frames of 10 ms, 80 samples for 8 KHz
sampling rate
 5 ms look-ahead
 Algorithmic delay of 15 ms
 An 80-bit frame for 10 ms of speech
 A complex codec
 G.729.A (Annex A), a number of simplifications
 Same frame structure
 Encoder/decoder, G.729/G.729.A
 Slightly lower quality
 G.729.B
 VAD, Voice Activity Detection
 Based on analysis of several parameters of the input
 The current frames plus two preceding frames
 DTX, Discontinuous Transmission
 Send nothing or send an SID frame
 SID frame contains information to generate comfort
noise
 CNG, Comfort Noise Generation
 G.729, an MOS of about 4.0
 G.729A an MOS of about 3.7
Other Codecs
 CDMA QCELP defined in IS-733
 Variable-rate coder
 Two most common rates
 The high rate, 13.3 kbps
 A lower rate, 6.2 kbps
 Silence suppression
 For use with RTP, RFC 2658
 GSM Enhanced Full-Rate (EFR)
 GSM 06.60
 An enhanced version of GSM Full-Rate
 ACELP-based codec
 The same bit rate and the same overall
packing structure
 12.2 kbps
 Support discontinuous transmission
 For use with RTP, RFC 1890
 GSM Adaptive Multi-Rate (AMR) codec
 GSM 06.90
 Eight different modes
 4.75 kbps to 12.2 kbps
 12.2 kbps, GSM EFR
 7.4 kbps, IS-641 (TDMA cellular systems)
 Change the mode at any time
 Offer discontinuous transmission
 The coding choice of many 3G wireless
networks
 The MOS values are for laboratory
conditions
 G.711 does not deal with lost packets
 G.729 can accommodate a lost frame by
interpolating from previous frames
 But cause errors in subsequent speech frames
 Processing Power
 G.728 or G.729, 40 MIPS
 G.726 10 MIPS
 Cascaded Codecs
 E.g., G.711 stream -> G.729
encoder/decoder
 Might not even come close to G.729
 Each coder only generate an
approximate of the incoming signal
Tones, Signal, and DTMF
Digits
 The hybrid codecs are optimized for human
speech
 Other data may need to be transmitted
 Tones: fax tones, dialing tone, busy tone
 DTMF digits for two-stage dialing or voice-mail
 G.711 is OK
 G.723.1 and G.729 can be unintelligible
 The ingress gateway needs to intercept
 The tones and DTMT digits
 Use an external signaling system
 Easy at the start of a call
 Difficult in the middle of a call
 Encode the tones differently form the speech
 Send them along the same media path
 An RTP packet provides the name of the tone and the
duration
 Or, a dynamic RTP profile; an RTP packet containing the
frequency, volume and the duration
 RFC 2198
 An RTP payload format for redundant audio data

 Sending both types of RTP payload


 RTP Payload Format for DTMF Digits
 An Internet Draft
 Both methods described before
 A large number of tones and events
 DTMF digits, a busy tone, a congestion tone, a
ringing tone, etc.
 The named events
 E: the end of the tone, R: reserved
 Payload format
Finis
Discrete Time LTI Systems:
The Convolution Sum

x[n]   x[k ] [n  k ]
k  

y[n]   x[k ]h[n  k ]
k  

1
h[n]

0 1 2 n
2 2.5 2
x[n] y[n]
0.5 0.5

0 1 n 0 1 2 3 n
Frequency-Domain
Representation of Sampling
X c ( j)


s (t )    (t  nT )
n  
 N N 
xs (t )  xc (t ) s (t )

 xc (t )   (t  nT )
 S 0 X c ( j) S  n  

2 
S ( j ) 
T
  (  k )
k  
s

S  N N S 

( S   N )
Speech Source Model and
Source Coding
 Vocal Tract Model
p
u (n)   ak x[n  k ]  x[n]
k 1
1 X ( z)
G( z)  p

U ( z)
1   ak z  k
k 1

Potrebbero piacerti anche