Sei sulla pagina 1di 36

Table of Contents

Table of Contents ………………………………………………………...………….……….. i


List of Figures …………………………………………………………………..…………… iii
List of Tables……………………………………………………………………..…………… iv
List of Abbreviations…………………………………………………………….…………… v
Acknowledgement……………………………………………………………………………..vi
i
Abstract……………….………………………………………………………….…………...viii

1 INTRODUCTION ……………………………………………………………..…. 1
1.1 Motivation ……………………………………………………………….. 1
1.2 Overview of Speech Coding ……………………………………………... 2
1.3 Applications of Speech Coders ………………………..………………… 2
1.4 Objective of Present Work ………………………………………………. 3
1.5 Report Organization ………………………………………………….….. 3

2 LITERATURE REVIEW ……………………………..………………………….. 4


2.1 Introduction ……………………………………..……………………….. 4
2.2 Basic Issues in Speech Coding …………………...……………………… 5
2.3 Speech Coding Techniques and Functionalities ………………..……….. 6
2.4 Speech Coding Standards ……………………………………..………… 6

3 PRESENT WORK.……………………………………………………………….. 9
3.1 Structure of Speech Coders ……………………………………………… 9
3.2 Classification of Speech Coders …………………………………...…… 14
3.2.1 Classification by Bit-Rate ……………………………….…… 14
3.2.2 Classification by Coding Techniques ………………….…….. 15
3.3 About Algorithms …………………………………….………………… 16
3.4 Pulse Code Modulation ………………………………………………… 17
3.4.1 Modulation …………………………………………………… 18
3.4.2 Demodulation ………………………………………………… 19
3.4.3 Digitization …………………………………………………… 19
3.5 Differential Pulse Code Modulation …………………………………… 20
3.6 Other Popular Algorithms ……………………………………………… 21
4 RESULTS AND DISCUSSIONS ………………………………………………. 23
4.1 Implementation Details ………………………………………………… 23
4.2 Results ………………………………………………………………….. 23

5 CONCLUSION …………………………………………………………………. 26

6 FUTURE SCOPE ……………………………………………………………….. 27

REFERENCES
List of Figures

Figure 3.1 Block diagram of a speech coding system


Figure 3.2 Block diagram of a speech coder.
Figure 3.3 System for delay measurement.
Figure 3.4 Illustration of the components of coding delay.
Figure 3.5 Sampling and quantization of a signal (red) for 4-bit PCM
Figure 4.1 MATLAB Simulation of PCM.
Figure 4.2 MATLAB Simulation of DPCM.
List of Tables

Table 2.1 Summary of Major Speech Coding Standards


Table 3.1 Classification of speech coders according to bit-rate
Table 4.1 Results for Quantization Bits = 16 and Sampling Frequency = 8 kHz
Table 4.2 Results for Quantization Bits = 8 and Sampling Frequency = 8 kHz
Table 4.3 Results for Quantization Bits = 16 and Sampling Frequency = 16 kHz
List of Abbreviations

3G Third Generation
AbS Analysis-by-Synthesis
ACELP Algebraic Code-Excited Linear Prediction
ACR Absolute Category Rating
ADPCM Adaptive Differential Pulse Code Modulation
CDMA Code Division Multiple Access
CELP Code-Excited Linear Prediction
DMOS Degradation Mean Opinion Score
DoD U.S. Department of Defense
DPCM Differential Pulse Code Modulation
DSVD Digital Simultaneous Voice and Data
DTAD Digital Telephone Answering Device
GSM Groupe Speciale Mobile
ICASSP International Conference on Acoustics, Speech, and Signal Processing
IDFT Inverse discrete Fourier transform
IEC International Electrotechnical Commission
IEEE Institute of Electrical and Electronics Engineers
IP Internet Protocol
ITU International Telecommunications Union
ITU–R ITU–Radiocommunication Sector
ITU–T ITU–Telecommunications Standardization Sector
MOS Mean Opinion Score
NCS National Communications System
PC Personal Computer
PCM Pulse Code Modulation
POTS Plain Old Telephone Service
PSTN Public Switched Telephone Network
RAM Random Access Memory
RC Reflection Coefficient
RCR Research and Development Center for Radio Systems of Japan
RMS Root Mean Square
ROM Read Only Memory
SNR Signal to Noise Ratio
TDMA Time Division Multiple Access
TI Texas Instruments
TIA Telecommunications Industry Association
TTS Text to Speech
UMTS Universal Mobile Telecommunications System
VBR Variable Bit-rate
VoIP Voice over Internet Protocol
VSELP Vector Sum Excited Linear Prediction
ACKNOWLEDGEMENT

We take the opportunity to remember and acknowledge the cooperation, good will
and support, both moral and technical, extended by several noble individuals of which
the report is involved. We shall always cherish out associations with them.

We wish to accord our sincere thanks to HOD of Electrical Engineering Department,


Prof. Ram Naresh Sharma for providing this opportunity to execute the project. We
extend our kind thanks to our guide Mr. Himesh Handa for cooperating with us till the
completion of this project. We would like to acknowledge the support and
infrastructure provided by the college to complete this work.
Abstract
With the advancement in technology, the application of low bit-rate speech coders to
civilian and military communications as well as computer-related voice applications
is substantially progressing. Today, speech coders have become essential components
in telecommunications and in the multimedia infrastructure. Central to this progress
has been the development of new speech coders capable of producing high-quality
speech at low data rates. Most of these coders incorporate mechanisms to: represent
the spectral properties of speech, provide for speech waveform matching, and
“optimize” the coder’s performance for the human ear. A number of these coders
have already been adopted in national and international cellular telephony
standards.
Commercial systems that rely on efficient speech coding include cellular
communication, voice over internet protocol (VOIP), videoconferencing, electronic
toys, archiving, and digital simultaneous voice and data (DSVD), as well as
numerous PC-based games and multimedia applications.
In mobile communication systems, service providers are continuously met
with the challenge of accommodating more users within a limited allocated
bandwidth, speech coders that provide toll quality speech at low bit rates are needed.
For this reason, manufactures and service providers are continuously in search of
low bit-rate speech coders that deliver toll-quality speech.
The objective of this project is to study commonly used speech coding
algorithms. The project report starts with the description of these speech coders.
Then we present our implementation results and finally give concluding remarks
followed by comments on future research in this area.
CHAPTER 1

INTRODUCTION

Speech coding is the process of obtaining a compact representation of voice signals


for efficient transmission over band-limited wired and wireless channels and/or
storage. In general, speech coding is a procedure to represent a digitized speech signal
using as few bits as possible, maintaining at the same time a reasonable level of
speech quality [1]. A not so popular name having the same meaning is speech
compression. Speech coding has matured to the point where it now constitutes an
important application area of signal processing.

1.1 MOTIVATION
In the era of third-generation (3G) wireless personal communications standards,
despite the emergence of broad-band access network standard proposals, the most
important mobile radio services are still based on voice communications. Even when
the predicted surge of wireless data and Internet services becomes a reality, voice will
remain the most natural means of human communication, although it may be
delivered via the Internet, predominantly after compression.
Due to the increasing demand for speech communication, speech coding
technology has received augmenting levels of interest from the research,
standardization, and business communities. Advances in microelectronics and the vast
availability of low-cost programmable processors and dedicated chips have enabled
rapid technology transfer from research to product development; this encourages the
research community to investigate alternative schemes for speech coding, with the
objectives of overcoming deficiencies and limitations [2]. The standardization
community pursues the establishment of standard speech coding methods for various
applications that will be widely accepted and implemented by the industry. The
business communities capitalize on the ever-increasing demand and opportunities in
the consumer, corporate, and network environments for speech processing products.
1.2 OVERVIEW OF SPEECH CODING
This section describes the structure, properties, and applications of speech coding
technology.
Speech coding is the art of creating a minimally redundant representation of the
speech signal that can be efficiently transmitted or stored in digital media, and
decoding the signal with the best possible perceptual quality. Like any other
continuous-time signal, speech may be represented digitally through the processes of
sampling and quantization; speech is typically quantized using either 16-bit uniform
or 8-bit companded quantization [2]. Like many other signals, however, a sampled
speech signal contains a great deal of information that is either redundant (nonzero
mutual information between successive samples in the signal) or perceptually
irrelevant (information that is not perceived by human listeners). Most
telecommunications coders are lossy, meaning that the synthesized speech is
perceptually similar to the original but may be physically dissimilar.
Speech coding is performed using numerous steps or operations specified as
an algorithm. An algorithm is any well-defined computational procedure that takes
some value, or set of values, as input and produces some value, or set of values, as
output. An algorithm is thus a sequence of computational steps that transform the
input into the output. Many signal processing problems—including speech coding—
can be formulated as a well-specified computational problem; hence, a particular
coding scheme can be defined as an algorithm. In general, an algorithm is specified
with a set of instructions, providing the computational steps needed to perform a task.
With these instructions, a computer or processor can execute them so as to complete
the coding task. The instructions can also be translated to the structure of a digital
circuit, carrying out the computation directly at the hardware level [2].

1.3 APPLICATIONS OF SPEECH CODERS


Speech coding has an important role in modern voice-enabled technology, particularly
for digital speech communication, where quality and complexity have a direct impact
on the marketability and cost of the underlying products or services [3]. There are
many speech coding standards designed to suit the need of a given application.
More recently, with the explosive growth of the internet, the potential market
of voice over internet protocol (voice over IP, or VoIP) has lured many companies to
develop products and services around the concept. Speech coding will play a central
role in this revolution.
Another smaller-scale area of application includes voice storage or digital
recording, with some outstanding representatives being the digital telephone
answering device (DTAD) and solid-state recorders. For these products to be
competitive in the marketplace, their costs must be driven to a minimum. By
compressing the digital speech signal before storage, longer-duration voice messages
can be recorded for a given amount of memory chips, leading to improved cost
effectiveness [3].
Techniques developed for speech coding have also been applied to other
application areas such as speech synthesis, audio coding, speech recognition, and
speaker recognition. Due to the weighty position that speech coding occupies in
modern technology, it will remain in the center of attention for years to come.

1.4 OBJECTIVE OF PRESENT WORK


The main objectives of the project can be divided into three goals;
 To study the basics of speech coding.
 To design PCM and DPCM speech coders using MATLAB program that are
capable of coding and decoding the input speech signal.
 To compare the PCM and DPCM coders by the speech quality, coding delay,
error etc.

1.5 REPORT ORGANIZATION


The report is divided into 6 chapters. Chapter 1 provides the introduction and an
overview of the subjects covered, with references to various aspects of speech coding,
overview and applications. Chapter 2 is a review of speech coding issues, coding
techniques, standards etc. are discussed. Chapter 3 deals with the present work.
Speech coding is explained in detail and PCM and DPCM techniques are described.
Chapter 4 contains the implementation results in which PCM and DPCM coders are
compared. Chapter 5 is the conclusion of our project work. Finally, Chapter 6 is
concerned with the future scope of speech coding.
CHAPTER 2

LITERATURE REVIEW

The history of audio and music compression was beginning in the 1930s with research
into pulse-code modulation (PCM) and PCM coding. Compression of digital audio
was started in the 1960s by telephone companies who were concerned with the cost of
transmission bandwidth. The 1990s have seen improvements in these earlier
algorithms and an increase in compression ratios at given audio quality levels. Speech
compression is often referred to as speech coding which is defined as a method for
reducing the amount of information needed to represent a speech signal. Most forms
of speech coding are usually based on a lossy algorithm. Lossy algorithms are
considered acceptable when encoding speech because the loss of quality is often
undetectable to the human ear.

2.1 INTRODUCTION
Speech coding is fundamental to the operation of the public switched telephone
network (PSTN), videoconferencing systems, digital cellular communications, and
emerging voice over Internet protocol (VoIP) applications. The goal of speech coding
is to represent speech in digital form with as few bits as possible while maintaining
the intelligibility and quality required for the particular application [4]. Interest in
speech coding is motivated by the evolution to digital communications and the
requirement to minimize bit rate, and hence, conserve bandwidth. There is always a
tradeoff between lowering the bit rate and maintaining the delivered voice quality and
intelligibility; however, depending on the application, many other constraints also
must be considered, such as complexity, delay, and performance with bit errors or
packet losses.
Based on these developments, it is possible today, and it is likely in the near
future, that our day-to-day voice communications will involve multiple hops
including heterogeneous networks. This is a considerable departure from the plain old
telephone service (POTS) on the PSTN, and indeed, these future voice connections
will differ greatly even from the digital cellular calls connected through the PSTN
today. As the networks supporting our voice calls become less homogeneous and
include more wireless links, many new challenges and opportunities emerge. There
was almost an exponential growth of speech coding standards in the 1990's for a wide
range of networks and applications, including the PSTN, digital cellular, and
multimedia streaming.
In order to compare the various speech coding methods and standards, it is
necessary to have methods for establishing the quality and intelligibility produced by
a speech coder. It is a difficult task to find objective measures of speech quality, and
often, the only acceptable approach is to perform subjective listening tests [5].
However, there have been some recent successes in developing objective quantities,
experimental procedures, and mathematical expressions that have a good correlation
with speech quality and intelligibility.

2.2 BASIC ISSUES IN SPEECH CODING


Speech and audio coding can be classified according to the bandwidth occupied by
the input and the reproduced source. Narrowband or telephone bandwidth speech
occupies the band from 200 to 3400 Hz, while wideband speech is contained in the
range of 50 Hz to 7 kHz. High quality audio is generally taken to cover the range of
20 Hz to 20 kHz.
Given a particular source, the classic tradeoff in lossy source compression is
rate versus distortion--the higher the rate, the smaller the average distortion in the
reproduced signal. Of course, since a higher bit rate implies a greater bandwidth
requirement, the goal is always to minimize the rate required to satisfy the distortion
constraint. For speech coding, we are interested in achieving a quality as close to the
original speech as possible. Encompassed in the term quality are intelligibility,
speaker identification, and naturalness. Absolute category rating tests are subjective
tests of speech quality and involve listeners assigning a category and rating for each
speech utterance according to the classifications, such as, Excellent (5), Good (4), Fair
(3), Poor (2), and Bad (1). The average for each utterance over all listeners is the
Mean Opinion Score (MOS) [5].
Although important, the MOS values obtained by listening to isolated
utterances do not capture the dynamics of conversational voice communications in the
various network environments. It is intuitive that speech codecs should be tested,
within the environment and while executing the tasks, for which they are designed.
Thus, since we are interested in conversational (two-way) voice communications, a
more realistic test would be conducted in this scenario. Recently, the perceptual
evaluation of speech quality (PESQ) method was developed to provide an assessment
of speech codec performance in conversational voice communications. The PESQ has
been standardized by the ITU-T as P.862 and can be used to generate MOS values for
both narrowband and wideband speech.

2.3 SPEECH CODING TECHNIQUES AND FUNCTIONALITIES


The most common approaches to narrowband speech coding today center around two
paradigms, namely, waveform-following coders and analysis-by-synthesis methods.
Waveform-following coders attempt to reproduce the time domain speech waveform
as accurately as possible, while analysis-by-synthesis methods utilize the linear
prediction model and a perceptual distortion measure to reproduce only those
characteristics of the input speech that are determined to be most important [5].
Another approach to speech coding breaks the speech into separate frequency bands,
called subbands, and then codes these subbands separately, perhaps using a waveform
coder or analysis-by-synthesis coding, for reconstruction and recombination at the
receiver. Extending the resolution of the frequency domain decomposition leads to
transform coding, wherein a transform is performed on a frame of input speech and
the resulting transform coefficients are quantized and transmitted to reconstruct the
speech from the inverse transform. A discussion of the class of speech coders called
vocoders or purely parametric coders is not included as they are out of the scope of
the project and their more limited range of applications today.

2.4 SPEECH CODING STANDARDS


Standards exist because there are strong needs to have common means for
communication: it is to everyone’s best interest to be able to develop and utilize
products and services based on the same reference [2]. The standard bodies are
organizations responsible for overseeing the development of standards for a particular
application. Brief descriptions of some well-known standard bodies are given here.
 International Telecommunications Union (ITU): The Telecommunications
Standardization Sector of the ITU (ITU-T) is responsible for creating speech
coding standards for network telephony. This includes both wired and wireless
networks.
 Telecommunications Industry Association (TIA): The TIA is in charge of
promulgating speech coding standards for specific applications. It is part of the
American National Standards Institute (ANSI). The TIA has successfully
developed standards for North American digital cellular telephony, including
time division multiple access (TDMA) and code division multiple access
(CDMA) systems.
 European Telecommunications Standards Institute (ETSI): The ETSI has
memberships from European countries and companies and is mainly an
organization of equipment manufacturers. ETSI is organized by application;
the most influential group in speech coding is the Groupe Speciale Mobile
(GSM), which has several prominent standards under its belt.
 United States Department of Defense (DoD): The DoD is involved with the
creation of speech coding standards, known as U.S. Federal standards, mainly
for military applications.
 Research and Development Center for Radio Systems of Japan (RCR): Japan’s
digital cellular standards are created by the RCR.
Table 2.1 Summary of Major Speech Coding Standards
CHAPTER 3

PRESENT WORK

The goal of all speech coding systems is to transmit speech with the highest possible
quality using the least possible channel capacity. In general, there is a positive
correlation between coder bit-rate efficiency and the algorithmic complexity required
to achieve it. The more complex an algorithm is, the more is processing delay and
cost of implementation. A speech coder converts a digitized speech signal into a
coded representation, which is usually transmitted in frames. A speech decoder
receives coded frames and synthesizes reconstructed speech. Standards typically
dictate the input–output relationships of both coder and decoder. The input–output
relationship is specified using a reference implementation, but novel implementations
are allowed, provided that input–output equivalence is maintained. Speech coders
differ primarily in bit rate (measured in bits per sample or bits per second),
complexity (measured in operations per second), delay (measured in milliseconds
between recording and playback), and perceptual quality of the synthesized speech.

3.1 STRUCTURE OF SPEECH CODERS


Figure 3.1 shows the block diagram of a speech coding system. The continuous time
analog speech signal from a given source is digitized by a standard connection of

Figure 3.1 Block diagram of a speech coding system


filter (eliminates aliasing), sampler (discrete-time conversion), and analog to digital
converter (uniform quantization is assumed). The output is a discrete-time speech
signal whose sample values are also discretized. This signal is referred to as the
digital speech [2].
Traditionally, most speech coding systems were designed to support
telecommunication applications, with the frequency contents limited between 300 and
3400 Hz. According to the Nyquist theorem, the sampling frequency must be at least
twice the bandwidth of the continuous-time signal in order to avoid aliasing. A value
of 8 kHz is commonly selected as the standard sampling frequency for speech signals.
To convert the analog samples to a digital format using uniform quantization and
maintaining toll quality [Jayant and Noll, 1984]—the digital speech will be roughly
indistinguishable from the bandlimited input—more than 8 bits/sample is necessary.
The use of 16 bits/sample provides a quality that is considered high. Throughout this
book, the following parameters are assumed for the digital speech signal:
Sampling frequency = 8 kHz
Number of bits per sample = 16
This gives rise to
Bit-rate = 8 kHz * 16 bits = 128 kbps

The above bit-rate, also known as input bit-rate, is what the source encoder
attempts to reduce (Figure 3.1). The output of the source encoder represents the
encoded digital speech and in general has substantially lower bit-rate than the input.
The linear prediction coding algorithm, for instance, has an output rate of 2.4 kbps, a
reduction of more than 53 times with respect to the input. The encoded digital speech
data is further processed by the channel encoder, providing error protection to the bit-
stream before transmission to the communication channel, where various noise and
interference can sabotage the reliability of the transmitted data. Even though in Figure
1.1 the source encoder and channel encoder are separated, it is also possible to jointly
implement them so that source and channel encoding are done in a single step. The
channel decoder processes the error-protected data to recover the encoded data, which
is then passed to the source decoder to generate the output digital speech signal,
having the original rate. This output digital speech signal is converted to continuous-
time analog form through standard procedures: digital to analog conversion followed
by ant-aliasing filtering [2].
The input speech (a discrete-time signal having a bit-rate of 128 kbps) enters
the encoder to produce the encoded bit-stream, or compressed speech data. Bit-rate of
the bit-stream is normally much lower than that of the input speech.

Figure 3.2 Block diagram of a speech coder.

The decoder takes the encoded bit-stream as its input to produce the output
speech signal, which is a discrete-time signal having the same rate as the input speech.
As we will see later in this book, many diverse approaches can be used to design the
encoder/decoder pair. Different methods provide differing speech quality and bit-rate,
as well as implementation complexity. The encoder/decoder structure represented in
Figure 3.2 is known as a speech coder, where the input speech is encoded to produce a
low-rate bit-stream. This bit-stream is input to the decoder, which constructs an
approximation of the original signal.

Desirable Properties of a Speech Coder


The main goal of speech coding is either to maximize the perceived quality at a
particular bit-rate, or to minimize the bit-rate for a particular perceptual quality. The
appropriate bit-rate at which speech should be transmitted or stored depends on the
cost of transmission or storage, the cost of coding (compressing) the digital speech
signal, and the speech quality requirements [2]. In almost all speech coders, the
reconstructed signal differs from the original one. The bit-rate is reduced by
representing the speech signal (or parameters of a speech production model) with
reduced precision and by removing inherent redundancy from the signal, resulting
therefore in a lossy coding scheme. Desirable properties of a speech coder include:
 Low Bit-Rate: The lower the bit-rate of the encoded bit-stream, the less
bandwidth is required for transmission, leading to a more efficient system.
This requirement is in constant conflict with other good properties of the
system, such as speech quality. In practice, a trade-off is found to satisfy the
necessity of a given application.
 High Speech Quality: The decoded speech should have a quality acceptable
for the target application. There are many dimensions in quality perception,
including intelligibility, naturalness, pleasantness, and speaker recognizability.
 Robustness across Different Speakers / Languages: The underlying technique
of the speech coder should be general enough to model different speakers
(adult male, adult female, and children) and different languages adequately.
Note that this is not a trivial task, since each voice signal has its unique
characteristics.
 Robustness in the Presence of Channel Errors: This is crucial for digital
communication systems where channel errors will have a negative impact on
speech quality.
 Good Performance on Non-speech Signals (i.e., telephone signaling). In a
typical telecommunication system, other signals might be present besides
speech. Signaling tones such as dual-tone multi-frequency (DTMF) in keypad
dialing and music are often encountered. Even though low bit-rate speech
coders might not be able to reproduce all signals faithfully, it should not
generate annoying artifacts when facing these alternate signals.
 Low Memory Size and Low Computational Complexity: In order for the speech
coder to be practicable, costs associated with its implementation must be low;
these include the amount of memory needed to support its operation, as well as
computational demand. Speech coding researchers spend a great deal of effort
to find out the most efficient realizations.
 Low Coding Delay: In the process of speech encoding and decoding, delay is
inevitably introduced, which is the time shift between the input speeches of
the encoder with respect to the output speech of the decoder. An excessive
delay creates problems with real-time two-way conversations, where the
parties tend to ‘‘talk over’’ each other.

Coding Delay
Consider the delay measured using the topology shown in Figure 3.3. The delay
obtained in this way is known as coding delay, or one-way coding delay [Chen,
1995], which is given by the elapsed time from the instant a speech sample arrives at
the encoder input to the instant when the same speech sample appears at the decoder
output [2]. The definition does not consider exterior factors, such as communication
distance or equipment, which are not controllable by the algorithm designer.

Figure 3.3 System for delay measurement.


Figure 3.4 Illustration of the components of coding delay.

Based on the definition, the coding delay can be decomposed into four major
components (see Figure 3.4):
1. Encoder Buffering Delay: Many speech encoders require the collection of a
certain number of samples before processing. For instance, typical linear
prediction (LP)-based coders need to gather one frame of samples ranging
from 160 to 240 samples, or 20 to 30 ms, before proceeding with the actual
encoding process.
2. Encoder Processing Delay: The encoder consumes a certain amount of time to
process the buffered data and construct the bit-stream. This delay can be
shortened by increasing the computational power of the underlying platform
and by utilizing efficient algorithms. The processing delay must be shorter
than the buffering delay; otherwise the encoder will not be able to handle data
from the next frame.
3. Transmission Delay: Once the encoder finishes processing one frame of input
samples, the resultant bits representing the compressed bit-stream are
transmitted to the decoder. Many transmission modes are possible and the
choice depends on the particular system requirements.
4. Decoder Processing Delay: This is the time required to decode in order to
produce one frame of synthetic speech. As for the case of the encoder
processing delay, its upper limit is given by the encoder buffering delay, since
a whole frame of synthetic speech data must be completed within this time
frame in order to be ready for the next frame.

3.2 CLASSIFICATION OF SPEECH CODERS


There is no clear separation between various approaches of classification of speech
coders. This section presents some existent classification criteria. The speech coders
can be classified according to their bit-rate or by their coding technique. There are
other approaches also like single mode or multimode coders. The most popular and
widely used approaches of classification of speech coders are:
 Bit-rate
 Coding technique

3.2.1 Classification by Bit-Rate


All speech coders are designed to reduce the reference bit-rate of 128 kbps toward
lower values. Depending on the bit-rate of the encoded bit-stream, it is common to
classify the speech coders according to Table 3.1.

Table 3.1 Classification of speech coders according to bit-rate

A given method works fine at a certain bit-rate range, but the quality of the
decoded speech will drop drastically if it is decreased below a certain threshold. The
minimum bit-rate that speech coders will achieve is limited by the information content
of the speech signal. Judging from the recoverable message rate from a linguistic
perspective for typical speech signals, it is reasonable to say that the minimum lies
somewhere around 100 bps. Current coders can produce good quality at 2 kbps and
above, suggesting that there is plenty of room for future improvement.

3.2.2 Classification by Coding Techniques


Waveform Coders: An attempt is made to preserve the original shape of the signal
waveform, and hence the resultant coders can generally be applied to any signal
source. These coders are better suited for high bit-rate coding, since performance
drops sharply with decreasing bit-rate. In practice, these coders work best at a bit-rate
of 32 kbps and higher. Signal-to-noise ratio (SNR) can be utilized to measure the
quality of waveform coders. Some examples of this class include various kinds of
pulse code modulation (PCM) and adaptive differential PCM (ADPCM).

Parametric Coders: Within the framework of parametric coders, the speech signal is
assumed to be generated from a model, which is controlled by some parameters.
During encoding, parameters of the model are estimated from the input speech signal,
with the parameters transmitted as the encoded bit-stream. This type of coder makes
no attempt to preserve the original shape of the waveform, and hence SNR is a useless
quality measure. Perceptual quality of the decoded speech is directly related to the
accuracy and sophistication of the underlying model. Due to this limitation, the coder
is signal specific, having poor performance for non-speech signals.
There are several proposed models in the literature. The most successful,
however, is based on linear prediction. In this approach, the human speech production
mechanism is summarized using a time-varying filter, with the coefficients of the
filter found using the linear prediction analysis procedure.

Hybrid Coders: As its name implies, a hybrid coder combines the strength of a
waveform coder with that of a parametric coder. Like a parametric coder, it relies on a
speech production model; during encoding, parameters of the model are located.
Additional parameters of the model are optimized in such a way that the decoded
speech is as close as possible to the original waveform, with the closeness often
measured by a perceptually weighted error signal. As in waveform coders, an attempt
is made to match the original signal with the decoded signal in the time domain.
This class dominates the medium bit-rate coders, with the code-excited linear
prediction algorithm and its variants the most outstanding representatives. From a
technical perspective, the difference between a hybrid coder and a parametric coder is
that the former attempts to quantize or represent the excitation signal to the speech
production model, which is transmitted as part of the encoded bit-stream. The latter,
however, achieves low bit-rate by discarding all detail information of the excitation
signal; only coarse parameters are extracted. A hybrid coder tends to behave like a
waveform coder for high bit-rate, and like a parametric coder at low bit-rate, with fair
to good quality for medium bit-rate.
3.3 ABOUT ALGORITHMS
A speech coder is generally specified as an algorithm, which is defined as a
computational procedure that takes some input values to produce some output values.
An algorithm can be implemented as software (i.e., a program to command a
processor) or as hardware (direct execution through digital circuitry) [6]. With the
widespread availability of low-cost high-performance digital signal processors (DSPs)
and general-purpose microprocessors, many signal processing tasks—done in the old
days using analog circuitry—are predominantly executed in the digital domain.
Advantages of going digital are many: programmability, reliability, and the ability to
handle very complex procedures, such as the operations involved in a speech coder,
so complex that the analog world would have never dreamed of it. In this section the
various aspects of algorithmic implementation are explained.

The Reference Code


It is the trend for most standard bodies to come up with a reference source code for
their standards, where code refers to the algorithm or program written in text form.
The source code is elaborated with some high-level programming language, with the
C language being the most commonly used [Harbison and Steele, 1995]. In this
reference code, the different components of the speech coding algorithm are
implemented. Normally, there are two main functions: encode and decode taking care
of the operations of the encoder and decoder, respectively. The reference source code
is very general and might not be optimized for speed or storage; therefore, it is an
engineering task to adjust the code so as to suit a given platform. Since different
processors have different strengths and weaknesses, the adjustment must be custom
made; in many instances, this translates into assembly language programming. The
task normally consists of changing certain parts of the algorithm so as to speed up the
computational process or to reduce memory requirements. Depending on the platform,
the adjustment of the source code can be relatively easy or extremely hard; or it may
even be unrealizable, if the available resources are not enough to cover the demand of
the algorithm. A supercomputer, for instance, is a platform where there are abundant
memory and computational power; minimum change is required to make an algorithm
run under this environment [7]. The personal computer (PC), on the other hand, has a
moderate amount of memory and computational power; so adjustment is desirable to
speed up the algorithm, but memory might not be such a big concern. A cellular
handset is an example where memory and computational power are limited; the code
must be adjusted carefully so that the algorithm runs within the restricted
confinements. To verify that a given implementation is accurate, standard bodies
often provide a set of test vectors. That is, a given input test vector must produce a
corresponding output vector. Any deviation will be considered a failure to conform to
the specification.

3.4 PULSE CODE MODULATION

Pulse-code modulation (PCM) is a digital representation of an analog signal where


the magnitude of the signal is sampled regularly at uniform intervals, then quantized
to a series of symbols in a numeric (usually binary) code. PCM has been used in
digital telephone systems and 1980s-era electronic musical keyboards. Uncompressed
PCM is not typically used for video in standard definition consumer applications such
as DVD or DVR because the bit rate required is far too high. However, the next-
generation blu-ray format, which has a capacity far superior to previous medium,
sometimes allows producers to include the full PCM soundtrack. The word pulse in
the term Pulse-Code Modulation refers to the "pulses" to be found in the transmission
line. This perhaps is a natural consequence of this technique having evolved alongside
two analog methods, pulse width modulation and pulse position modulation, in which
the information to be encoded is in fact represented by discrete signal pulses of
varying width or position, respectively [8]. In this respect, PCM bears little
resemblance to these other forms of signal encoding, except that all can be used in
time division multiplexing, and the binary numbers of the PCM codes are represented
as electrical pulses. The device that performs the coding and decoding function in a
telephone circuit is called a codec.

3.4.1 Modulation
A sine wave (red curve) is sampled and quantized for PCM. The sine wave is sampled
at regular intervals, shown as ticks on the x-axis. For each sample, one of the
available values is chosen by some algorithm (in this case, the floor function is used).
This produces a fully discrete representation of the input signal (shaded area) that can
be easily encoded as digital data for storage or manipulation [8]. For the sine wave
example at right, we can verify that the quantized values at the sampling moments are
7, 9, 11, 12, 13, 14, 14, 15, 15, 15, 14, etc. Encoding these values as binary numbers
would result in the following set of nibbles: 0111, 1001, 1011, 1100, 1101, 1110,
1110, 1111, 1111, 1111, 1110, etc. These digital values could then be further
processed or analyzed by a purpose-specific digital signal processor or general
purpose CPU. Several Pulse Code Modulation streams could also be multiplexed into
a larger aggregate data stream, generally for transmission of multiple streams over a
single physical link. This technique is called time-division multiplexing, or TDM, and
is widely used, notably in the modern public telephone system. The diagram of
sampling and quantization of sine wave is:

Fig. 3.5 Sampling and quantization of a signal (red) for 4-bit PCM

There are many ways to implement a real device that performs this task. In real
systems, such a device is commonly implemented on a single integrated circuit that
lacks only the clock necessary for sampling, and is generally referred to as an ADC
(Analog-to-Digital converter). These devices will produce on their output a binary
representation of the input whenever they are triggered by a clock signal, which
would then be read by a processor of some sort.

3.4.2 Demodulation
To produce output from the sampled data, the procedure of modulation is applied in
reverse [8]. After each sampling period has passed, the next value is read and the
output of the system is shifted instantaneously (in an idealized system) to the new
value. As a result of these instantaneous transitions, the discrete signal will have a
significant amount of inherent high frequency energy, mostly harmonics of the
sampling frequency. To smooth out the signal and remove these undesirable
harmonics, the signal would be passed through analog filters that suppress artifacts
outside the expected frequency range (i.e. greater than ½ fs, the maximum resolvable
frequency). Some systems use digital filtering to remove the lowest and largest
harmonics. In some systems, no explicit filtering is done at all; as it's impossible for
any system to reproduce a signal with infinite bandwidth, inherent losses in the
system compensate for the artifacts — or the system simply does not require much
precision. The sampling theorem suggests that practical PCM devices, provided a
sampling frequency that is sufficiently greater than that of the input signal, can
operate without introducing significant distortions within their designed frequency
bands. The electronics involved in producing an accurate analog signal from the
discrete data are similar to those used for generating the digital signal. These devices
are DACs (digital-to-analog converters), and operate similarly to ADCs. They
produce on their output a voltage or current (depending on type) that represents the
value presented on their inputs. This output would then generally be filtered and
amplified for use.

3.4.3 Digitization

In conventional PCM, the analog signal may be processed (e.g. by amplitude


compression) before being digitized. Once the signal is digitized, the PCM signal is
usually subjected to further processing (e.g. digital data compression) [8].Some forms
of PCM combine signal processing with coding. Older versions of these systems
applied the processing in the analog domain as part of the A/D process; newer
implementations do so in the digital domain. These simple techniques have been
largely rendered obsolete by modern transform-based audio compression techniques.

 DPCM encodes the PCM values as differences between the current and the
predicted value. An algorithm predicts the next sample based on the previous
samples, and the encoder stores only the difference between this prediction
and the actual value. If the prediction is reasonable, fewer bits can be used to
represent the same information. For audio, this type of encoding reduces the
number of bits required per sample by about 25% compared to PCM.
 Adaptive DPCM (ADPCM) is a variant of DPCM that varies the size of the
quantization step, to allow further reduction of the required bandwidth for a
given signal-to-noise ratio.
 Delta modulation, another variant, uses one bit per sample.

3.5 DIFFERENTIAL PULSE CODE MODULATION

Differential pulse code modulation (DPCM) is a procedure of converting an analog


into a digital signal in which an analog signal is sampled and then the difference
between the actual sample value and its predicted value (predicted value is based on
previous sample or samples) is quantized and then encoded forming a digital value
[9]. DPCM code words represent differences between samples unlike PCM where
code words represented a sample value. Basic concept of DPCM - coding a
difference, is based on the fact that most source signals show significant correlation
between successive samples so encoding uses redundancy in sample values which
implies lower bit rate. Realization of basic concept is based on a technique in which
we have to predict current sample value based upon previous samples (or sample) and
we have to encode the difference between actual value of sample and predicted value
(the difference between samples can be interpreted as prediction error). Because it's
necessary to predict sample value DPCM is form of predictive coding. DPCM
compression depends on the prediction technique, well-conducted prediction
techniques lead to good compression rates, in other cases DPCM could mean
expansion comparing to regular PCM encoding. The various steps like sampling,
quantization, digitization etc. are similar to those in case of PCM technique.

 For DPCM following options are available for selection criteria:

 Quality - This is related to the pitch setting in the DPCM instrument editor,
use the same to play the sample at original pitch. Quality of 15 will give best
result, but samples can only be a little less than one second.
 Volume - Sets the conversion volume level, higher levels helps removing
noise.
 Click-elimination - The volume of triangle and noise is decreased when a
DPCM sample is playing and can be restored with a note-off in the DPCM
channel, but this will normally result in an audible click-sound. These two
options will help restoring the volume after the sample is finished without
causing a click:
 Restore delta counter - Restores the channels delta counter by adding zeroes
after the sample. Could cause a small echo-like sound.
 Clip sample - Cuts the available volume-range for the sample and leaves the
delta counter near zero after end. Decreases volume and heavily distorts the
sample. What's best depends on the sample, most likely is that you won't need
any option. Max size of DPCM samples are 3.9 kb, at quality 15 (33 kHz) it's
a little less than one second and lowest quality (4 kHz) about eight seconds.

3.6 OTHER POPULAR ALGORITHMS

There are other types of waveform algorithms which are used in speech coding like
A-law PCM, µ-law PCM, ADPCM etc. An ADPCM algorithm is used to map a series
of 8-bit µ-law or A-law PCM samples into a series of 4-bit ADPCM samples. In this
way, the capacity of the line is doubled. The technique is detailed in the G.726
standard. Some ADPCM techniques are used in Voice over IP communications.
Similarly other types of coders are also available like CELP, VCELP etc. Code
excited linear prediction (CELP) is a speech coding algorithm originally proposed by
M.R. Schroeder and B.S. Atal in 1985. At the time, it provided significantly better
quality than existing low bit-rate algorithms, such as RELP and LPC vocoders (e.g.
FS-1015). Along with its variants, such as ACELP, RCELP, LD-CELP and VSELP, it
is currently the most widely used speech coding algorithm. CELP is now used as a
generic term for a class of algorithms and not for a particular codec. Vector sum
excited linear prediction (VSELP) is a speech coding method used in several cellular
standards. Variations of this codec have been used in several 2G cellular telephony
standards, including IS-54, IS-136 (D-AMPS) and GSM (Half Rate speech). It was
also used in the first version of RealAudio for audio over the Internet. The IS-54
VSELP standard was published by the Telecommunications Industry Association in
1989.
CHAPTER 4

RESULTS AND DISCUSSIONS

We studied the basics of speech coding system and various coding techniques with
their design procedures and application scopes. Then we implemented PCM and
DPCM coders in MATLAB 7. Further we compared the PCM and DPCM coders on
criteria like speech quality, error, execution time etc. by varying bit-rate and sampling
frequency.

4.1 IMPLEMENTATION DETAILS

For MATLAB implementations:

 Input speech has been sampled at 8 kHz (for comparison with standard
coders).
 For PCM, we have used a uniform quantizer with 2^16 = 65536 levels. The
bit-rate is 8k*16 = 128 kbps.
 For DPCM, we have used an adaptive first-order linear predictor with
coefficient α = 0.45. Again, we have a uniform quantizer with 2^16 = 65536
levels. Here also, the bit-rate is 128 kbps.

4.2 RESULTS

Table 4.1 Results for Quantization Bits = 16 and Sampling Frequency = 8 kHz

Criteria PCM DPCM


Quantization Noise 0.0833 0.0978
Max. Value of Error 6.2498e-5 6.2498e-5
SNR (decibels) 98.0905 98.0905
Execution Time (sec.) 1.288820 1.322032
Fig. 4.1 MATLAB Simulation of PCM.

Fig. 4.2 MATLAB Simulation of DPCM.


Table 4.2 Results for Quantization Bits = 8 and Sampling Frequency = 8 kHz

Criteria PCM DPCM


Quantization Noise 5.4610e+3 6.4123e+3
Max. Value of Error 2.0398 2.0398
SNR (decibels) 49.9257 49.9257
Execution Time (sec.) 1.285243 1.341308

Table 4.3 Results for Quantization Bits = 16 and Sampling Frequency = 16 kHz

Criteria PCM DPCM


Quantization Noise 0.0833 0.0978
Max. Value of Error 3.1250e-5 3.1250e-5
SNR (decibels) 98.0905 98.0905
Execution Time (sec.) 3.786807 4.073008

Note: The execution time also depends upon the computer system in which the codes
are tested. All these simulations are done in Windows Vista based system with
processor speed 2.8 GHz and 2GB RAM.
CHAPTER 5

CONCLUSIONS

Speech coding or speech compression is the application of data compression of digital


audio signals containing speech. Speech coding uses speech-specific parameter
estimation using audio signal processing techniques to model the speech signal,
combined with generic data compression algorithms to represent the resulting
modeled parameters in a compact bit stream. The objective of the speech coding is to
represent speech signal with minimum number of bits yet maintain the perceptual
quality.

The advantages with coded speech signals are lower sensitivity to channel
noise, easier to error-protect, encrypt, multiplex and packetize and lastly it is efficient
transmission over bandwidth constrained channels due to lower bit rate. PCM coders
produce better quality speech than most of the parametric coders but relatively higher
compression ratio is achieved in parametric coders. Further error introduced in DPCM
algorithm is greater than that introduced in PCM algorithm but simultaneously
compression ratio reduces in latter. Thus the choice of the coder is based upon the
application requirements. Further hybrid coders having the characteristics of both
waveform coders and parametric coders are being designed to provide better
compression ratio maintaining reasonable speech quality.

For this project, the implementation of the speech coding based on PCM and
DPCM is successfully done in MATLAB. The output results from the input speech
are very good and acceptable. The targeted objective has been achieved.
CHAPTER 6

FUTURE SCOPE

In recent years, there has been significant progress in the fundamental building blocks
of source coding: flexible methods of time-frequency analysis, adaptive vector
quantization, and noiseless coding. Compelling applications of these techniques to
speech coding are relatively less mature. The present research is focused on meeting
the critical need for high quality speech transmission over digital cellular channels at
4 kbps. Research on properly coordinated source and channel coding is needed to
realize a good solution to this problem. Although, high-quality low-delay coding at 16
kbps has been achieved, low-delay coding at lower rates is still a challenging
problem. Improving the performance of low-rate coders operating in noisy channels is
also an open problem. Additionally there is a demand for robust low-rate coders that
will accommodate signals other than speech such as music. Further, current research
is also focused in the area of VoIP.
REFERENCES
[1] T. P. Barnwell III, K. Nayebi and C. H. Richardson, “SPEECH CODING, A
computer Laboratory Textbook”, John Wiley & Sons, Inc. 1996.
[2] Wai C. Chu, “Speech Coding Algorithms: Foundation and Evolution of
Standardized Coders”, Wiley Inter-science.
[3] P. C. Loizou, Speech Enhancement, Theory and Practice. CRC Press, 2007.
[4] Mark Hasegawa Johnson and Abeer Alwan, “Speech Coding: Fundamentals and
Applications”.
[5] Lawrence R. Rabiner and Ronald W. Schafer, “Introduction to Digital Speech
Processing”, Vol. 1, Nos. 1–2 (2007) 1–194.
[6] DDVPC, CELP Speech Coding Standard, Technical Report FS-1016, U.S. Dept.
of Defense Voice Processing Consortium, 1989.
[7] A. Das and A. Gersho, Low-rate multimode multiband spectral coding of speech,
Int. J. Speech Tech. 2(4): 317–327 (1999).
[8] J. H. Chung and R. W. Schafer, “Performance evaluation of analysis-bysynthesis
homomorphic vocoders,” Proceedings of IEEE ICASSP, vol. 2, pp. 117–120,
March 1992.
[9] R. Goldberg and L. Riek, A Practical Handbook of Speech Coders, CRC Press,
Boca Raton, FL, 2000.
[10] http://www.mathworks.com

Potrebbero piacerti anche