Sei sulla pagina 1di 6

Achieving the Highest Voice Quality for VoIP Solutions

Jan Linden
Global IP Sound, Inc.
900 Kearny Street, Ste 500, San Francisco, CA 94133, USA
+1 (415) 397-2585
jan.linden@globalipsound.com
ABSTRACT
As Voice over IP (VoIP) becomes more pervasive, ensuring
that services and equipment deliver the best possible voice
quality becomes an even more critical component of the
solution. Business and consumer users expect voice quality
that performs, at minimum, just like their existing mobile or
PSTN phones, and that means that developers face
significant design challenges. Equipment manufacturers and
applications developers must consider the entire audio
experience of the end-user. This paper will discuss issues
developers must address to ensure the highest possible voice
quality in VoIP-enabled devices.

INTRODUCTION

Speech processing for telecommunications is an area that


has attracted much interest for several decades and very high
quality solutions have been developed over the years. As a
result the end-users expect a certain level of quality. Partly
because of quality issues the introduction of VoIP has not
been entirely smooth regardless of all the benefits in terms
of cost savings and improved services that are offered. There
are several fundamental differences between the traditional
telephony systems and the emerging VoIP systems that can
severely impact the voice quality if not handled properly.
This article will discuss the major challenges specific to
VoIP and show that with proper design the quality of a VoIP
solution can in fact widely exceed that of PSTN. The focus
is on giving a broad overview of all the issues that affect
end-to-end quality and cover the main points of each rather
than treating a specific topic in depth.

SPEECH CODEC

The basic algorithmic building block in a VoIP system is the


speech codec. A speech codec has several important
features, including speech quality, bit-rate, delay, sampling
rate, packet loss robustness, complexity, and sensitivity to
type of input signal. The quality of speech produced by the
speech codec defines the upper limit for achievable end-toend quality. This determines the sound quality for perfect
network conditions, in which there would be no packet loss,
delay, jitter, echo, or other quality-degrading factors.

Other speech codec related factors affecting the overall


sound quality include handling of different voices as well as
the quality of non-speech signals such as background noise,
tones, and music.

2.1

Choice of codec

The bit-rate delivered by the speech encoder determines the


bandwidth load on the network. The packet headers (IP,
UDP, RTP) also add a significant portion to the bandwidth.
In fact, often the overhead due to the transport protocols
exceeds the bit-rate of the actual payload. As a result, a very
low bit-rate codec might not offer much lower bandwidth
utilization than a medium bit-rate codec and consequently
the trade-off between bit-rate and quality cannot be based on
the speech codec bit-rate alone.
A speech codec used in a VoIP environment must be able to
deal with lost packets. This robustness determines the sound
quality in a loaded network and also in congested situations
where packet loss is likely. Packet loss issues are discussed
in Section 3.2.
The delay introduced by the speech coder can be divided
into algorithmic and processing delay. The algorithmic delay
occurs because of framing for block processing, since the
encoder produces a set of bits representing a block of speech
samples. Furthermore, many coders using block processing
also have a look-ahead function that requires buffering of
future speech samples before a block is encoded which adds
to the algorithmic delay. Processing delay is the time it takes
to encode and decode a block of speech samples.
The complexity of a speech-coding algorithm dictates the
computational effort and the memory requirements.
Complexity is an important cost factor for implementing a
codec and generally increases with decreasing bit rate.
Similarly, memory requirements also affect the cost of
implementation.
Increasing the sampling frequency from the 8 kHz used for
narrowband products to the 16 kHz used for wideband
speech coding produces much more natural, comfortable,
and intelligible speech. This far, wideband speech coding
has found very limited use only in applications such as
videoconferencing because speech coders mostly interact
with the public switched telephone network (PSTN), which
is inherently narrowband. There is no such limitation in

Basically all desired parameters such as low bit-rate, low


delay, low complexity, and low memory usage are
conflicting with the overall goal of achieving high basic
quality and high packet loss robustness. Therefore, there is a
need to support a multitude of codecs that are suitable for
different scenarios.

important factors distinguishing speech processing for VoIP


from traditional solutions. If the VoIP device cannot cope
with network degradation in a satisfactory manner the
quality can never be acceptable. Therefore, it is of utmost
importance that the characteristics of the IP network is taken
into account in the design and implementation of VoIP
products as well as in the choice of components such as the
speech codec. In the following sub-sections delay, jitter, and
packet loss will be discussed and methods to cope with these
challenges will be covered.

2.2

3.1

VoIP when the call is initiated and terminated within the IP


network. Therefore, because of the dramatic quality
improvement attainable, the next generation of speech
codecs for VoIP will be wideband.

Implementation Issues

Because speech coding standards are defined through bitexact standards specifications it is easy to believe that all
implementations are identical. This is however not the case.
In the interest of saving complexity and memory utilization
it is very common that trade-offs are made that will affect
the bit-exactness and quality. If there is a need to depart
from the standard, extreme care must be taken and the
resulting code has to be tested thoroughly, including special
cases such as high or low input levels and tones, in order to
verify that there is no loss of quality. Another reason for
deviating from the specifications is that some standards have
fairly well known bugs that havent been corrected in the
standards specification. In this case an implementation that
is not bit-exact with the standard may very well provide
better quality than one that is.

Delay

Many factors affect the perceived quality in two-way


communication. An extremely important parameter is the
transmission delay between the two end-points. If the
latency is high, it can severely impact the quality and ease of
conversation. The two main effects caused by high latency
are annoying echo and talker overlap, which both can cause
significant reduction of the perceived conversation quality.
In traditional telephony, long delays are basically only
experienced for long-distance calls and calls to mobile
phones. This is not necessarily true for VoIP. The effects of
excessive delay have often been overlooked in VoIP design,
resulting in significant quality degradation even in shortdistance calls. Wireless VoIP, typically over a Wireless
LAN (WLAN), is becoming increasingly popular, but even
further elevates the challenges of delay management.

What seemingly looks like codec implementation issues


could sometimes be related to signal processing such as
filtering and issues related to scheduling and buffering.

COPING WITH NETWORK


DEGRADATION

Three major factors associated with packet networks have a


significant impact on perceived speech quality: delay, jitter,
and packet loss. All three factors stem from the nature of a
packet network, in which there is no guarantee that a packet
of speech data will arrive at the receiving end in time, or
even that it will arrive at all. This is in contrast to traditional
telephony networks where rarely, or never, are packets lost
and often the transmission delay is a fixed parameter that
does not vary over time. These network effects are the most

Mean Opinion Score

A very important issue when implementing any signal


processing algorithm is how to best utilize the available
memory. Very often the number of channels that can be
supported is limited by memory constraints rather than
complexity. There are no general guidelines that can be
applied since each situation differs from the next one. For
example, how much of the dynamic memory allocation
should be in some scratch area and how much should be on
the stack depends heavily on the configuration. For best
memory utilization it is also important that each component
is implemented in a similar fashion such that common
memory areas can be utilized efficiently.

1
0

250

500

750

One-way transmission time [ms]

Figure 1: Effect of delay on conversational quality from ITU-T


G.114.

The impact of latency on communication quality is not


easily measured and varies significantly with the usage
scenario. For example, long delays are not perceived as
annoying in a cell phone environment as for a regular wired
phone because of the added value of mobility. The presence
of echo also has a significant impact on our sensitivity to
delay: the higher the latency, the lower the perceived quality.
Hence, it is not possible to come up with one single number
for how high latency is acceptable, but only some guidelines.
If the overall delay is more than about 40 ms, an echo is
audible. For lower delays, the echo will only be perceived as
an expected side-tone. As long as the latency is not too high,

The ITU-T (International Telecommunication Union


Standardization Sector) recommends in standard G.114 that
the one-way delay should be kept lower than 150 ms for
acceptable conversation quality (Figure 1 is from G.114 and
shows the perceived effect on quality as a function of delay).
Delays from 150 to 400 ms are acceptable provided that
administrators are aware of the impact on the quality of user
applications and larger latency than 400 ms is unacceptable.

3.2

Packet Loss

Most packet losses occur in the routers, either due to high


router load or to high link load. In both cases, packets in the
queues might be dropped. Packet loss also occurs when there
is a breakdown in a transmission link. The result is data link
layer error and the incomplete packet is dropped.
Configuration errors and collisions may also result in packet
loss. In non-real-time applications, packet loss is solved at
the transfer control protocol (TCP) layer by retransmission.
For telephony, this is not a viable solution since
retransmitted packets would arrive too late and be of no use.
When a packet loss occurs some mechanism for filling in the
missing speech must be incorporated. Such solutions are
usually referred to as packet loss concealment (PLC)
algorithms. For best performance, these algorithms have to
accurately predict the speech signal and make a smooth
transition between the previous decoded speech and inserted
segment.
Since packet losses occur mainly when the network is
heavily loaded, it is not uncommon for packet losses to
appear in bursts. A burst may consist of a series of
consecutive lost packets or a period of high packet loss rate.
Obviously, when several consecutive packets are lost, even
the best PLC algorithm will have problems producing
acceptable speech quality.
In order to save bandwidth, multiple speech frames are
sometimes carried in a single packet, so a single lost packet
may result in multiple lost frames. Even if the packet losses
occur randomly, the listening experience is then similar to
that of having the packet losses occur in bursts.

3.2.1

Packet loss concealment

Until recently, two simple approaches to dealing with lost


packets have prevailed. The first method, referred to as zero
stuffing (ZS), involves simply replacing a lost packet with a
period of silence of the same duration as the lost packet.

Naturally, this method is not providing a high quality output


and already for as low packet loss rate as 1 % very annoying
artifacts will be apparent.
The second method, referred to as packet repetition (PR),
assumes that the difference between two consecutive speech
frames is quite small and replaces the lost packet by simply
repeating the previous packet. In practice, however, even a
minor change in pitch frequency, for example, is easily
detected by the human ear. In addition, it is virtually
impossible to achieve smooth transitions between the
packets with this approach. However, this approach
performs fairly well for very small probabilities (less than
3 %) of packet loss.
Simple methods, like repeating the previous packet, do not
provide sufficient quality for wireless applications. A
sophisticated algorithm, on the other hand, can handle 10 %
of packet loss without noticeable degradation.
Another approach to handle packet loss is to deploy a speech
coding technique that has been specifically designed to
handle packet loss. None of the current speech coding
standards (e.g. ITU codecs) have been designed in such a
way and hence are all sensitive to packet loss. However, new
robust codecs are being adopted outside of the traditional
standards bodies for speech coding. For example, the
Internet Engineering Task Force (IETF) is currently
standardizing the iLBC speech codec [1].
4.5
4
3.5
MOS

echo cancellation algorithms remove most of the effects. For


very long delays (greater than 200 ms), even if echo
cancellation is used, it is hard to maintain a two-way
conversation without talker overlap. This effect is often
accentuated by shortcomings of the echo canceller design.
However, in a pure VoIP call (none of the end-points is
connected to the PSTN), it is also possible to have a
situation where no echo is generated, which allows for
slightly higher acceptable delay.

3
NetEQ

2.5

ITU PLC
PR

ZS
EG.711

1.5
0

10
% Packet loss

15

20

Figure 2: Subjective test results for different approaches to


handling packet loss concealment. Source: Lockheed Martin
Global Telecommunication (COMSAT).

In Figure 2 subjective listening test results for a number of


approaches to handle packet loss for the G.711 codec are
depicted. Clearly, ZS and PR are not providing quality that
can be classified as acceptable. The packet loss concealment
method described in G.711 Appendix I (marked ITU PLC in
the figure) offers reasonable quality but is clearly
outperformed by the method described in [2] (NetEQ) and
the codec specifically designed for packet networks called
Enhanced G.711.

3.3

Network Jitter

In contrast to the constant algorithmic and processing delay,


transmission delay varies over time. The reason is that the
transit time of a packet through an IP network will vary due
to queuing effects. The transmission delay is split into two
parts, one being the constant or slowly varying network
delay and the other being the rapid variations on top of the
basic network delay, usually referred to as jitter. The jitter is
defined as a smoothed function of the delay differences
between consecutive packets over time.
The jitter present in packet networks complicates the
decoding process in the receiver device because the decoder
needs to have packets of data readily available at the right
time instants. If the data is not available, the decoder will not
be able to produce smooth, continuous speech. A jitter buffer
is normally used to make sure that packets are available
when needed.

3.3.1

Jitter buffer design

A jitter buffer is required to make sure that packets are


available when needed for play-out. It removes the jitter in
the arrival times of the packets at the cost of an increase in
the overall delay. The objective of a jitter buffer algorithm is
to keep the buffering delay as short as possible while
minimizing the number of packets that arrive too late to be
used. A large jitter buffer causes an increase in the delay and
decreases the packet loss. A small jitter buffer decreases the
delay but increases the resulting packet loss.
The traditional approach is to store the incoming packets in a
buffer (packet buffer) before sending them to the decoder.
Because packets can arrive out of order, the jitter buffer is
not a strict first-in-first-out (FIFO) buffer, but also reorders
packets if necessary. The most straightforward approach is
to have a buffer with a fixed number of packets. This results
in a constant system delay and requires no computations and
provides minimum complexity. The drawback with this
approach is that the length of the buffer has to be made
sufficiently large that even the worst case can be
accommodated.
In order to keep the delay as short as possible, it is important
that the jitter buffer algorithm adapt rapidly to changing
network conditions. Therefore, jitter buffers with dynamic
size allocation, so-called adaptive jitter buffers, are now
most common. The adaptation is achieved by inserting
packets in the buffer when the delay needs to be increased,
and removing packets when the delay can be decreased.
Packet insertion is usually done by repeating the previous
packet. Unfortunately, this will almost always result in
audible distortion, so most adaptive jitter buffer algorithms
are very conservative when it comes to reducing the delay to
avoid quality degradation..
This traditional packet buffer approach is limited in its
adaptation granularity by the packet size, since it can only

change the buffer length by adding or discarding one or


several packets.
Another major limitation of traditional jitter buffers is that,
in order to limit the audible distortion of removing packets,
they typically only function during periods of silence.
Hence, delay builds up during a talk spurt, and it can take
several seconds before a reduction in the delay can be
achieved.
A new invention which combines an advanced adaptive
jitter-buffer control with error concealment has recently been
presented [2]. Combining adaptive jitter control and packet
loss concealment into one unit makes this unique algorithm
capable of adapting the buffer size on a millisecond basis.
The approach allows it to quickly adapt to changing network
conditions, and to ensure high speech quality with minimal
buffer latency. This can be achieved because the algorithm is
working together with the decoder and not in the packet
buffer. In addition to minimizing jitter buffer delay, the
packet loss concealment part of the algorithm is based on a
novel approach, and is capable of producing higher quality
than any of the standard PLC methods. Experiments show
that with this type of approach one-way delay savings of
30 80 ms are achievable in a typical VoIP environment [2].

ECHO CANCELLATION

One of the most important aspects in terms of effect on the


end-to-end quality is the amount of echo present during a
conversation. This is an effect that does not show up until a
call is established between two endpoints. To avoid
disturbing echo often an echo cancellation algorithm has to
be inserted at an appropriate point in the signal path. The
requirements on an echo canceller to achieve good voice
quality are very tough and in fact all present algorithms are
imperfect in some sense. The result of a poor design can
show up in several ways, the most common being: (i)
audible echo, (ii) clipping of the speech, and (iii) poor
doubletalk performance.
Echo is a severe distraction, if the round trip delay is longer
than 40 ms. Since the delays in IP telephony systems are
significantly higher, the echo is clearly audible to the
speaker. Canceling echo is, therefore, essential to
maintaining high quality. Two types of echo can deteriorate
speech quality: network echo and acoustic echo.

4.1

Network Echo Cancellation

Network echo is the dominant source of echo in telephone


networks. It results from the impedance mismatch at the
hybrid circuits of a PSTN exchange, at which the two-wire
subscriber loop lines are connected to the four-wire lines of
the long-haul trunk (Figure 3).The echo path is stationary,
except when the call is transferred to another handset or
when a third party is connected to the phone call. This
results in an abrupt change in the echo path.

Figure 3: Diagram of network hybrid causing echo in PSTN.

As previously noted, the common solutions to echo


cancellation and other impairments in packet-switched
networks are basically adaptations of techniques used for the
circuit-switched network. To achieve the best possible
quality a systematic approach is necessary to address the
quality-of-sound issues that are specific to packet networks.
There are significant differences between the repackaged
circuit-switched cancellers and packet network-optimized
cancellers. For example, the PSTN doesnt inherently have
the ability to process packets, so energy calculations that
determine the updating of the adaptive filters only look at
individual voice samples. In contrast, a canceller that is
designed to handle packet networks can perform look-ahead
packet processing, looking not only at the current packet but,
for example, 80 or 160 additional samples. Greater precision
in both filter update and double talk detection can be
achieved by basing calculations on entire packets.

4.2

Acoustic Echo Cancellation

Acoustic echo occurs when there is a feedback path between


a telephones microphone and speaker (a problem primarily
associated with wireless and hands-free devices), or between
the microphone and speakers of a PC-based system. In
addition to the direct coupling path from microphone to
speaker, acoustic echo can be caused by multiple reflections
of the speakers sound waves back to the microphone from
walls, floor, ceiling, windows, furniture, a cars dashboard
and other objects. Hence, the acoustic echo path is nonstationary.
There are very few differences between designing AEC for
VoIP and for traditional telephony applications. However,
due to the higher delays typically experienced in VoIP the
requirements on the AEC is often even more demanding.
Also, wideband speech adds some new challenges in terms
of quality and complexity.

AUXILIARY VOICE PROCESSING

In addition to the most visible and well know voice


processing components in a VoIP device there are many
other components that are included either to enhance the
user experience or for other reasons such as reducing
bandwidth requirements. Since no chain is stronger than its
weakest link it is imperative that even these seemingly less
important components are designed in an appropriate
fashion.

Examples of such components include automatic gain


control (AGC), voice activity detection (VAD), comfort
noise generation (CNG), noise suppression, and signal
mixing for multiparty calling features. Typically, there is not
much difference in the design or requirements between
traditional telecommunications solutions and VoIP solutions
for this type of components. However, for example VAD
and CNG are typically deployed more frequently in VoIP
systems. The main reasons for this are that due to the
protocol overhead in VoIP the net saving in bandwidth is
very significant and that an IP network is well suited to
utilize the resulting variable bit rate to transport other data
while no voice packets are sent. Due to misclassifications in
the VAD algorithm sometimes clipping of the speech signal
and noise bursts can occur. Also, since only comfort noise is
played out during silence periods the background signal may
sound artificial. The most common problem with CNG,
though, is that the level is too low which results in the
feeling that the other person has dropped out of the
conversation. These performance issues mandate that VAD
should be used with caution to avoid unnecessary quality
degradation.
Implementing multiparty calling also faces some challenges
due to the characteristics of the IP networks. For example,
the requirements on delay, clock drift, and echo cancellation
performance are significantly tougher due to the fact that
several signals are mixed together and that there are several
listeners present. A jitter buffer with low delay and the
capability of efficiently handling clock drift offers a very
significant improvement in such a scenario. Serious
complexity issues arise since different codecs can be used
for each of the parties in a call. Those codecs might even use
different sampling frequencies. Intelligent schemes to
manage complexity are thus important. One way to reduce
the complexity is by using VAD to determine what
participants are active at each point in time and only mix
those into the output signals.

HARDWARE ISSUES

Another major challenge is to achieve proper handling of the


audio in the hardware interfaces of a VoIP device. These
devices are often small, have to be inexpensive, and some
are originally designed for other applications, e.g., PCs and
PDAs, which imposes challenges in achieving good voice
quality and low delay.
Many factors in the design of VoIP devices affect speech
quality. Obvious examples are microphones, speakers, and
analog-to-digital converters. These issues are all very similar
to challenges well known from designing devices for regular
telephony and as such are well understood. However, that
does not mean that designing a VoIP device is an easy task
and it is important to utilize the expertise available in this
area to achieve best possible quality.
One device-specific issue that can potentially have a serious
effect on delay and voice quality is the handling of clock

drift. The traditional approach is to deploy a clock


synchronization mechanism at the receiver to correct for
clock drift by comparing the time stamps of the received
RTP packets with the local clock. Such solutions suffer from
problems that are similar to the traditional packet buffer:
they are based on making large adjustments infrequently,
which makes each such adjustment very noticeable.
Furthermore, it is hard to perform reliable clock drift
estimation in VoIP because of the low packet rate and the
presence of jitter. The technique that combines jitter buffer
control and error concealment offers a solution since it
automatically takes care of clock drift without introducing
any audible artifacts or extra delay.

6.1

VoIP on PCs and PDAs

Voice enabling a PC, a PDA, or another device not initially


designed for carrying real-time voice imposes a new set of
challenges, including sharing resources with other
applications in a non-real-time operating system. The
unpredictable response times and thread handling in these
operating systems can, if not handled properly, increase the
latency significantly and cause gaps in the audio output.
Some operating systems, especially for PDA type of devices,
do not even have full support of multithreading, making a
low delay implementation very challenging in an
environment where it is desirable to run several applications
simultaneously.
These devices are often equipped with low quality sound
cards in the interest of saving cost. The consequence can, for
example, be that the quality of the digitized signal is poor
and that digital resampling filters have to be implemented in
order to avoid sampling distortion. In fact, it is always a
good idea to utilize the native sampling frequency of the
sound card (typically 48 kHz) for recording and playback
and deploy well designed digital filters for up- and downsampling to avoid aliasing problems. Another challenge with
the typical sound card and the associated drivers is that the
buffering done results in significant delay and jitter.
Realizing that this is an effect very similar to the network
jitter it is clear that by integrating an efficient jitter buffer
with the handling of the soundcard significant delay savings
are possible.
The effect of clock drift, as discussed previously, can be
very severe due to the low quality sound cards often
deployed. It is not uncommon to see delays of up to a second
due to clock drift if no action is taken to mitigate the effect.
Again, the approach of combining jitter buffer and packet
loss concealment into a very rapidly adapting jitter buffer
solution will automatically take care of this issue without the
need for actually estimating the actual clock drift.
Very high noise levels are very common, especially if a
built-in microphone that picks up the noise from the
computers fan is used. It is therefore a good idea to include
a noise suppression algorithm.

If the PC is to be used in a speakerphone type of scenario


acoustic echo cancellation is necessary to avoid disturbing
echo on the other side of the conversation. This setup poses
a whole new set of challenges on AEC design due to the
imperfections of soundcards and the fact that the acoustic
environment is completely unknown at the time of the
design. A common setup is a laptop with built-in
microphone and speakers where both often are of poor
quality, introducing noise and nonlinearities in the echo
path. Often there is also significant audio leakage through
the laptop chassis and through electric coupling. The other
typical setup is when external microphone and speakers are
used. In this case there is typically no leakage, except for
what could occur in the soundcard, to compensate for.
However, there are other challenges, for example the relative
placement of the microphone and the speaker will affect the
AEC significantly and the two can be connected through
different audio interfaces (microphones are often connected
using the USB interface) potentially resulting in clock drift.
An AEC design able of coping with such varying conditions
has to be very flexible and capable of adapting to different
audio environments.

CONCLUSIONS

A designer of VoIP equipment will face many challenges,


some similar to what has been experienced in traditional
telecommunications design, and some very specific to VoIP.
The most prominent of these challenges have been discussed
in this paper and solutions to overcome them were
presented. Some of the most important aspects specific to
VoIP are related to the characteristics of the transport media
IP networks. The design has to be able to cope with packet
loss and transmission time jitter with a minimum of latency
and high voice quality. This paper has shown that by proper
design it is possible to achieve equal to, or better than, PSTN
quality in VoIP. Of particular importance are speech codecs
that are robust against packet loss and very rapidly adapting
jitter buffer solutions delivering high quality at a minimal
delay, which can be achieved by a novel technique based on
combining jitter buffer adaptation and packet loss
concealment in the same algorithm.

REFERENCES

[1] IETF draft, Internet Low Bit Rate Codec (iLBC),


draft-ietf-avt-ilbc-codec-04.txt, November 2003,
Andersen, et al.
[2] Whitepaper, GIPS NetEQ - A Combined Jitter
Buffer Control/Error Concealment Algorithm for
VoP Gateways, available from Global IP Sound.

Potrebbero piacerti anche