Sei sulla pagina 1di 15

Voice Encoding Speech codecs, also referred to as voice coders or vocoders, if source codecs are used, can be divided

into three basic classes, which are:



Waveform Source Hybrid

Waveform codecs are older codecs that use high bit rates and provide very good quality speech reproduction. Source codecs operate at very low bit rates and tend to produce speech that sounds artificial. Hybrid codecs use techniques from both source and waveform coding, operate at intermediate bit rates and provide good quality speech. Codecs operate by modeling a segment of the speech waveform on the order of 20 ms. Speech model parameters are then estimated, quantized, coded, and transmitted over a communications channel. The receiver then decodes the transmitted values and reconstructs synthesized speech. Waveform Encoding Waveform algorithms produce high quality speech at high bit rates. Waveform coders sample analog signals at 8000 times per second and each speech sample is quantized using either linear or non-linear quantization. If linear quantization is used approximately 12 bits per sample are required, resulting is a bit rate of around 96 Kbps. However, if non-linear quantization is used, 8 bits per sample are required, resulting in a bit rate of 64 Kbps. Quantization is the process of converting each analog sample value into a discrete value that can be assigned a unique digital code word. Since most voice signals generated are of the low kind, having better voice quality at higher signal levels is a very inefficient way of digitizing voice signals. To improve voice quality at lower signal levels, uniform quantization is replaced by a non-uniform quantization process called companding. Companding refers to the process of first compressing an analog signal at the source, and then expanding this signal back to its original size when it reaches its destination. The term companding is created by combining the two terms, compressing and expanding, into one word

PCM, by the most common waveform encoder, uses non-linear quantization. PCM is defined in the ITU-T G.711 specification and has been standardized as G.711 u-law in the United States and G.711 a-law in Europe and other parts of the world. PCM codecs are simple, introduce very little delay, and produce very high quality speech. When using PCM, the first step to convert the signal from analog to digital is to filter out the higher frequency component of the signal. Next, to convert the analog voice signal to a digital voice signal, the filtered signal is sampled is at a constant sampling frequency using a process called pulse amplitude modulation (PAM). This step uses the original analog signal to modulate the amplitude of a pulse train that has a constant amplitude and frequency. 8 bits per sample are used, and sampling occurs at 8000 samples per second (8 KHz). Therefore, 8 bits * 8000 samples per second results in 64,000 bits per second or 64 Kbps. This concept is illustrated in the following diagram:

The third step is to digitize these samples in preparation for transmission over a Telephony network. The process of digitizing analog voice signals is called PCM. The only difference between PAM and PCM is that PCM takes the process one step further. PCM decodes each analog sample using binary code words. PCM has an analog-to-digital converter on the source side and a digital-toanalog converter on the destination side. PCM uses a technique called quantization to encode these samples. These steps are illustrated in the following diagram:

Other waveform approaches to speech encoding attempt to predict the value of the next sample by looking at the immediate past, i.e. previous samples. This approach works because of the repeated vocal patterns present in each speech sample. If predications are correct, the differences between input sample signals are minimal. Differential PCM (DPCM) is designed to calculate this difference and then transmit this small difference signal instead of the entire input sample signal. Since the difference between input samples is less than an entire input sample, the number of bits required for transmission is reduced. This allows for a reduction in the throughput required to transmit voice signals. Using DPCM can reduce the bit rate of voice transmission down to 48 kbps. DPCM is a good way to reduce the bit rate for voice transmission. However, it causes some other problems that deal with voice quality. DPCM quantizes and encodes the difference between a previous sample input signal and a current sample input signal. However, DPCM can be considered inefficient because most of the signals generated by the human voice are small. Voice quality needs to focus on small signals. To solve this problem, Adaptive DPCM (ADPCM) was developed. Adaptive DPCM (ADPCM) is a waveform coding method defined in the ITU-T G.726 specification. ADPCM adapts the quantization levels of the difference signal generated at the time of the DPCM process. If the difference signal is low, ADPCM increases the size of the quantization levels; inversely, if the difference signal is high, it decreases the size of the quantization levels. Using ADPCM

reduces the bit rate of voice transmission to 32 kbps, which is half the bit rate of A-law or u-law PCM, while still producing speech quality very near the same 64 Kbps PCM codecs. ADPCM also offers bit rates of 16 Kbps and 24 Kbps. The following table lists the most common waveform coders and their characteristics:
Codec G.711 G.726 G.726 G.726 Compression Technique Pulse Code Modulation Adaptive Pulse Code Modulation Adaptive Pulse Code Modulation Adaptive Pulse Code Modulation Bit Rate Calculation 8000 KHz * 8 bits/sample 8000 KHz * 4 bits/sample 8000 KHz * 3 bits/sample 8000 KHz * 2 bits/sample Bit Rate 64 Kbps 32 Kbps 24 Kbps 16 Kbps

Source Encoding Source codecs, sometimes referred to as vocoders, operate using a model of source signal generation that extracts the model parameters from the signal being coded. These model parameters are transmitted to the decoder on the receiving end. Vocoders use specific compression techniques that are optimized for coding human speech and represent the vocal tract as a timevarying filter excited with a white noise source for unvoiced speech segments, or a series of pulses separated by the pitch period for vocalized speech. The information sent to the decoder includes filter specification, the necessary variance of the excitation signal, and the pitch period for voice speech. This information is continually updated every 10 ms to 20 ms, which follows the changing nature of speech. Vocoders can either be analog vocoders or linear prediction-based vocoders. Analog vocoder systems use a number of frequency channels, all tuned to different frequencies using band-pass filters. Analog vocoders are beyond the scope of the CVOICE. Linear vocoders use the source filter model and employ the use of codebooks. Codebooks store specific predictive waveshapes of human speech and match the speech, encode the phrases, and decode the waveshapes at the receiving end by looking up the coded phrase and matching it to the stored waveshape in the codebook of the receiver. By using this method, data rates below 13 Kbps can be achieved. Code Excited Linear Prediction (CELP), which is one of the most widely used speech coding algorithms, is a linear prediction-based vocoder. Code Excited Linear Prediction coverts received input from an 8-bit to a 16-bit linear PCM sample. It uses a codebook to learn and predict the voice waveform and the coder is excited by a white noise generate. The term 'excited' simply refers to the point at

which the coder begins the lookup process. The mathematical result is then sent to the receiving decoder and the voice waveform is generated. Although many variants of CELP exist, the only two variants that are relevant to the CVOICE certification requirements are LowDelay CELP (LDCELP) and Conjugate Structure Algebraic CELP (CS-ACELP). LD-CELP has been standardized by the ITU as G.728. This coding mechanism codes speech at 16 Kbps and has a delay of between 2 ms and 5 ms. CS-ACELP minimizes bandwidth (8 Kbps) at the expense of increasing delay, which is 10 ms. G.729, G.729 Annex A (G.729A), G.729 Annex B (G.729B), and G.729A Annex B (G.729AB) are all variations of CS-ACELP. G.729 Annex-B is a high complexity algorithm, and G.729A Annex-B is a medium complexity variant of G.729 Annex-B with slightly lower voice quality. The difference between the G.729 and G.729 Annex-B codec is that the G.729B codec provides built-in IETF Voice Activity Detection (VAD) and Comfort Noise Generation (CNG). Codec complexity will be described later in this section. VAD is a voice encoding algorithm that takes note of silence during voice conversations and suppresses the transmission of voice packets that contain no actual data within them. Comfort noise is artificial background noise that is used to fill the silence in a transmission resulting from VAD. Another source encoding technique that must be taken into consideration is the G.723.1 standard. G.723.1 is a result of a competition that ITU announced with the aim to design a codec that would allow calls over 28.8 and 33 Kbps modem links. There were two very good solutions and ITU decided to use them both. Because of that, we have two variants of G.723.1, which are ACELP and Multipulse LPC with Maximum Likelihood Quantization (MP-MLQ). Both solutions operate on audio frames of 30 milliseconds, but the algorithms differ. G.723.1 ACELP has a bit rate of 5.3 Kbps, while MP-MLQ has a bit rate of 6.3 Kbps. The encoded frames for the two variants are 20 bytes and 24 bytes long, respectively. G.723.1 is a licensed codec and the last patent that covers it is expected to expire in 2014. This codec is not commonly implemented as a result of this. The following table lists common source coders and their characteristics:
Codec G.723.1 G.723.1 G.729a Compression Technique Algebraic CELP Multipulse LPC with Maximum Likelihood Quantization Conjugate Structure Algebraic CELP Coding Delay 30 ms 30 ms 10 ms Bit Rate 5.3 Kbps 6.3 Kbps 8 Kbps

G.729 G.728

Conjugate Structure Algebraic CELP Low-Delay CELP

10 3 ms -- 5 ms

8 Kbps 16 Kbps

Cisco voice gateways support numerous codecs, which can be configured via the codec voice-network dial peer configuration command as illustrated in the following output: R1(config)#dial-peer voice 1 voip R1config-dial-peer)#codec ? clear-channel Clear Channel 64000 bps (No voice capabilities: data transport only) g711alaw G.711 A Law 64000 bps g711ulaw G.711 u Law 64000 bps g723ar53 G.723.1 ANNEX-A 5300 bps (contains built-in vad that cannot be disabled) g723ar63 G.723.1 ANNEX-A 6300 bps (contains built-in vad that cannot be disabled) g723r53 G.723.1 5300 bps g723r63 G.723.1 6300 bps g726r16 G.726 16000 bps g726r24 G.726 24000 bps g726r32 G.726 32000 bps g728 G.728 16000 bps g729br8 G.729 ANNEX-B 8000 bps (contains built-in vad that cannot be disabled) g729r8 G.729 8000 bps transparent transparent; uses the endpoint codec By default, Cisco voice gateways use g729r8, 30-byte payload for VoFR and VoATM and g729r8, 20-byte payload for VoIP. POTS dial peers use a default codec of G.711. The default codec values for dial peers can be viewed by issuing the show dial-peer voice [tag] command as illustrated in the following output:
r1#show dial-peer voice 2010 VoiceOverIpPeer2010 peer type = voice, information type = voice, description = `Dial Peer to Cisco Unified Communications Manager', tag = 2010, destination-pattern = `555....', [Truncated Output] codec = g729r8, payload size = 20 bytes,

Media Setting = flow-through (global) Expect factor = 10, Icpif = 20, [Truncated Output' -

Mean Opinion Score Although numerous codecs are supported and can be used, it is important to understand that while they may offer certain advantages (such as bandwidth savings), they all have different a different Mean Opinion Score (MOS) which must be taken into consideration when they are implemented in VoIP solutions. MOS was developed to help quantify the quality of a given coding technique. Listeners listen to various speech patterns sent through each compression technique in order to compile MOS numbers,. The test listeners then rate the quality of the sound based on a scale of 1 to 5, with 5 being the highest (best) and 1 being the lowest (worst) quality. The results are then averaged to produce a MOS. For example, PCM uses the most bandwidth and yet receives the best MOS rating. CS-ACELP, on the other hand, provides significant bandwidth savings at the expense of a lower MOS rating. The following table shows the bit rates of the codecs we have learned about in this chapter, as well as their MOS:
Codec G.711 G.726 G.723.1 G.729a G.729 G.728 Compression Technique Pulse Code Modulation Adaptive Pulse Code Modulation Algebraic CELP Conjugate Structure Algebraic CELP Conjugate Structure Algebraic CELP Low-Delay CELP Bit rate 64 Kbps 32 Kbps 5.3 Kbps 8 Kbps 8 Kbps 16 Kbps MOS 4.1 3.85 3.65 3.7 3.92 3.61

Codec Complexity The codecs that we have learned about in this chapter can be classified as low, medium, or high complexity codecs. Low complexity codecs, such as PCM, use the least amount of processing power, while high complexity codecs, such as CS-ACELP, have the highest processing requirements. Medium complexity codecs fall in between high and low complexity codecs. However, it is not uncommon for low complexity codecs to be classified as medium complexity codecs, resulting in only two complexity classes: medium and high.

The difference between medium and high complexity codecs is the amount of CPU utilization necessary to process the codec algorithm, and therefore, the number of voice channels that can be supported by a single DSP. For this reason, all the medium complexity codecs can also be run in high complexity mode, but fewer (usually half) of the channels are available per DSP. In Cisco voice gateways medium complexity allows the C549 DSPs to process up to four voice or fax-relay calls per DSP and the C5510 DSPs to process up to eight voice or fax-relay calls per DSP. High complexity allows the C549 DSPs to process up to two voice or fax-relay calls per DSP and the C5510 DSPs to process up to six voice or fax-relay calls per DSP. Fax-relay, which will be described in detail later in this guide, supports bit rates of 2.4 Kbps, 4.8 Kbps, 7.2 Kbps, 9.6 Kbps, 12 Kbps and 14.4 Kbps. In addition to this, fax relay can use medium or high complexity codecs. The following table shows the codec complexity level and the applicable codecs that are a part of that level:
Codec Complexity Codecs


Low

G.711 Fax-relay / Fax-passthrough Modem-relay / Modem-passthrough Clear channel

G.729A G.729AB Fax-relay / Fax-passthrough G.726

Medium

G.729 G.729B Fax-relay / Fax-passthrough

High

G.7.28 G.723.1

Cisco IOS software allows administrators to select the desired codec complexity via the codec complexity [high|medium|flex] voice card configuration command. Platforms that support the C549 DSP technology, such as Cisco 2600 and 3600 series routers, support only the[high] and [medium] keywords. Platforms that support C5510 DSP technology, such as the Cisco 3800 ISR, support the [flex] option. When flex complexity is configured, up to sixteen calls can be completed per DSP. The number of supported calls varies from six to sixteen and is based on the codec used for a call. These available options are illustrated in the following output for a typical Cisco 3800 ISR router:
R1(config)#voice-card 0 R1(config-voicecard)#codec complexity ? flex Set codec complexity Flex. Flex complexity, higher call density. high Set codec complexity high. High complexity, lower call density. medium Set codec complexity medium. Mid range complexity and call density. <cr>

The selected codec complexity can be viewed in the output of the show voice dsp command: R1#show voice dsp ----------------------------FLEX VOICE CARD 0 -----------------------------*DSP SIGNALING CHANNELS* DSP DSP DSPWARE CURR BOOT PAK TX/RX CH CODEC RST AI VOICEPORT TS PACK Type NUM VERSION STATE STATE ABRT Count ===== === == ======== ========== ===== ======= === == ========= == ==== ======= *DSP SIGNALING CHANNELS* DSP DSP DSPWARE CURR BOOT PAK TX/RX CH CODEC RST AI VOICEPORT TS PACK Type NUM VERSION STATE STATE ABRT

Count ===== === == ======== ========== ===== ======= === == C5510 001 01 {flex} 4.4.29 alloc idle 0 0 C5510 001 02 {flex} 4.4.29 alloc idle 0 0 C5510 001 03 {flex} 4.4.29 alloc idle 0 0 C5510 001 04 {flex} 4.4.29 alloc idle 0 0 C5510 001 05 {flex} 4.4.29 alloc idle 0 0 C5510 001 06 {flex} 4.4.29 alloc idle 0 0 C5510 001 07 {flex} 4.4.29 alloc idle 0 0 C5510 001 08 {flex} 4.4.29 alloc idle 0 0 C5510 001 09 {flex} 4.4.29 alloc idle 0 0 C5510 001 10 {flex} 4.4.29 alloc idle 0 0 C5510 001 11 {flex} 4.4.29 alloc idle 0 0 - [Truncated Output] --------------------------END OF FLEX VOICE CARD 0 ---------------------------Calculating Codec Bandwidth

========= 0/0/0 0/0/1 0/0/2 0/0/3 0/1/0 0/1/1 0/1/2 0/1/3 1/0:1 1/0:1 1/0:1

== 02 06 10 14 02 06 10 14 01 02 03

==== 0 0 0 0 0 0 0 0 0 0 0

======= 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0

One of the most important factors to consider when building packet voice networks is proper capacity planning. Within capacity planning, bandwidth calculation is an important factor to consider when designing and troubleshooting packet voice networks for good voice quality. In this last section, we are going to learn about calculating codec bandwidth and features to modify or conserve bandwidth when Voice over IP (VoIP) is used. Before a VoIP call can be heard at the other end, one of the users must pick up the telephone and dial the digits of the party they are trying to reach. The voice gateway connected to the telephone interprets the dialed digits and uses signaling to set up the VoIP call. The called party hears the ringing, the caller hears the ring back tone, and when the called party picks up, the VoIP call is considered connected. The actual voice call itself, i.e. after the called party has picked up and the two users are connected, uses Real-Time Transport Protocol (RTP). RTP defines a standardized packet format for delivering audio and video over IP. The following diagram illustrates the format of an IP packet using Real-Time Transport Protocol:

In addition to RTP, the Layer 2 encapsulation used for VoIP transport must also be taken into consideration when calculating the overall bandwidth consumed by a voice call. Layer 2 technologies used for VoIP include Multilink Point-to-Point Protocol (MP) or Frame Relay Forum 12 (FRF.12) and Ethernet. MP and FRF.12 Layer 2 headers both add an additional 6 bytes of overhead, which includes an additional 1 byte of overhead for the end-of-frame flag on MP and Frame Relay frames. Ethernet adds an additional 18 bytes of overhead including 4 bytes of Frame Check Sequence (FCS) or Cyclic Redundancy Check (CRC). Additionally, it is important to know that if RTP is compressed, referred to as compressed RTP, or cRTP, this further reduces the IP/UDP/RTP headers to 2 or 4 bytes. Keep in mind; however, that cRTP is not available over Ethernet as it is a WAN-only technology. The following table contains calculations for the default voice payload sizes in Cisco Unified Communications Manager or Cisco voice gateways:
Codec Information Codec & Bit Rate (Kbps) G.711 (64K) G.723.1 (6.3K) Codec Sample Size (Bytes) 80 Bytes Codec Sample Interval (ms) 10 ms 10 ms 30 ms Mean Opinion Score (MOS) 4.1 3.92 3.9 Voice Payload Size (Bytes) 160 Bytes 20 Bytes 24 Bytes Bandwidth Calculation Voice Payload Size (ms) 20 ms 20 ms 30 ms Packets Per Second (PPS) 50 50 34 Bandwidth MP Bandwidth with Bandwidth or FRF.12 cRTP MP or Ethernet (Kbps) FRF.12 (Kbps) (Kbps) 82.8 Kbps 26.8 Kbps 18.9 Kbps 67.6 Kbps 11.6 Kbps 8.8 Kbps 87.2 Kbps 31.2 Kbps 21.9 Kbps

G.729 (8K) 10 Bytes 24 Bytes

G.723.1 (5.3 K) G.726 (32 K) G.726 (24 K) G.728 (16 K)

20 Bytes 20 Bytes 15 Bytes 10 Bytes

30 ms 5 ms 5 ms 5 ms

3.8 3.85 3.7 3.61

20 Bytes 80 Bytes 60 Bytes 60 Bytes

30 ms 20 ms 20 ms 30 ms

34 50 50 34

17.9 Kbps 50.8 Kbps 42.8 Kbps 28.5 Kbps

7.7 Kbps 35.6 Kbps 27.6 Kbps 18.4 Kbps

20.8 Kbps 55.2 Kbps 47.2 Kbps 31.5 Kbps

The terms used in the table above are listed and described in the following table:
Term Codec Bit Rate (Kbps) Description Based on the codec, this is the number of bits per second that need to be transmitted to deliver a voice call. (codec bit rate = codec sample size / codec sample interval). Based on the codec, this is the number of bytes captured by the Digital Signal Processor (DSP) at each codec sample interval. For example, the G.729 coder operates on sample intervals of 10 ms, corresponding to 10 bytes (80 bits) per sample at a bit rate of 8 Kbps. (codec bit rate = codec sample size / codec sample interval). This is the sample interval at which the codec operates. For example, the G.729 coder operates on sample intervals of 10 ms, corresponding to 10 bytes (80 bits) per sample at a bit rate of 8 Kbps. (codec bit rate = codec sample size / codec sample interval). The voice payload size represents the number of bytes (or bits) that are filled into a packet. The voice payload size must be a multiple of the codec sample size. For example, G.729 packets can use 10, 20, 30, 40, 50, or 60 bytes of voice payload size. The voice payload size can also be represented in terms of the codec samples. For example, a G.729 voice payload size of 20 ms (two 10 ms codec samples) represents a voice payload of 20 bytes [ (20 bytes * 8) / (20 ms) = 8 Kbps ] PPS represents the number of packets that need to be transmitted every second in order to deliver the codec bit rate. For example, for a G.729 call with voice payload size per packet of 20 bytes (160 bits), 50 packets need to be transmitted every second [50 pps = (8 Kbps) / (160 bits per packet) ]

Codec Sample Size (Bytes)

Codec Sample Interval (ms)

Voice Payload Size (Bytes)

Voice Payload Size (ms)

PPS

The following formula is used for bandwidth calculations: 1. Total packet size = (L2 header) + (IP/UDP/RTP header) + (voice payload size in bytes) 2. PPS = (codec bit rate in bits) / (voice payload size in bits) 3. Bandwidth = total packet size (in bits) * PPS

For example, the required bandwidth for a G.729 call, using an 8 Kbps codec bit rate with cRTP, MP and the default 20 bytes of voice payload is calculated as follows:

Total packet size = (L2 header) + (IP/UDP/RTP header) + (voice payload size in bytes) Total packet size = (6) + (2) + (20) = 28 bytes PPS = (codec bit rate) / (voice payload size in bits) PPS = (8000) / (160) = 50 pps Bandwidth = total packet size (in bits) * PPS Bandwidth = 224 * 50 = 11200 bps, or 11.2 Kbps

Therefore, to support 2 such calls, 22.4 Kbps of bandwidth would be required, and to support 10 such calls, 112 Kbps of bandwidth would be required. As a second example, the required bandwidth for a G.729 call, using an 8 Kbps codec bit rate, Frame Relay and the default 20 bytes of voice payload is calculated as follows:

Total packet size = (L2 header) + (IP/UDP/RTP header) + (voice payload size in bytes) Total packet size = (6) + (40) + (20) = 66 bytes PPS = (codec bit rate) / (voice payload size in bits) PPS = (8000) / (160) = 50 pps Bandwidth = total packet size (in bits) * PPS Bandwidth = 528 * 50 = 26400 bps, or 26.4 Kbps

To support 3 such calls, 79.2 Kbps of bandwidth would be required, and to support 10 such calls, 264 Kbps of bandwidth would be required.

As yet another example, the required bandwidth for a G.711 call, using a 64 Kbps codec bit rate, Ethernet and the default 160 bytes of voice payload is calculated as follows:

Total packet size = (L2 header) + (IP/UDP/RTP header) + (voice payload size in bytes) Total packet size = (18) + (40) + (160) = 218 bytes PPS = (codec bit rate) / (voice payload size in bits) PPS = (64000) / (1280) = 50 pps Bandwidth = total packet size (in bits) * PPS Bandwidth = 1744 * 50 = 87200 bps, or 87.2 Kbps

To support 4 such calls, 348.8 Kbps of bandwidth would be required, and to support 10 such calls, 872 Kbps of bandwidth would be required. As a final example, the required bandwidth for a G.729 call, using an 8 Kbps codec bit rate, Ethernet and the default 20 bytes of voice payload is calculated as follows:

Total packet size = (L2 header) + (IP/UDP/RTP header) + (voice payload size in bytes) Total packet size = (18) + (40) + (20) = 78 bytes PPS = (codec bit rate) / (voice payload size in bits) PPS = (8000) / (160) = 50 pps Bandwidth = total packet size (in bits) * PPS Bandwidth = 634 * 50 = 31200 bps, or 31.2 Kbps

To support 5 such calls 156 Kbps of bandwidth would be required and to support 10 such calls 312 Kbps of bandwidth would be required.

In conclusion, while it is important to understand the advantages of using one codec over another, it is of even more important to understand the bandwidth requirements of the different codecs in VoIP implementations, especially in real-world networks. One of the most common, and discrediting, mistakes that you can make as a voice engineer is provisioning inadequate bandwidth for the required number of voice calls. Make sure that you are completely comfortable with calculating the bandwidth requirements for different codecs before moving onto the next chapter.

Potrebbero piacerti anche