Sei sulla pagina 1di 10

Stereo Coding for Audio Compression

Rui Wang, Harold Nyikal, James Yu

March 7, 2005

Abstract

A perceptual audio coder with stereo coding is implemented and reviewed. The
Mid/Side (M/S) stereo coding is incorporated into the baseline perceptual coder. The
coder was tested with various audio files ranging from music to speech at 128kbps.
Stereo coding is shown to reduce much of the redundancy in stereo signals.

1 Introduction

Audio compression has been an increasingly important technique in audio tranmission and
storage. Perceptually lossless compression is achieved by exploiting psychoacoustic models
that discard information the human auditory system cannot perceive. This induces much
higher compression ratios than the usual entropy coding. Usually, this is performed in
a transform domain, where information is discarded by quantizing transform bins more
coarsely.

Joint stereo coding is an extension to the nominal block floating point quantization scheme.
The main assumption in stereo coding is that the left and right channels of audio are highly
correlated. This is usually the case for speech and music, since the two microphones are
spatially close and used at the same time to record the same sounds. This strong correlation
suggests that there is redundancy in the stereo signal.

First, we review the perceptual model that is at the heart of the compression scheme. Then,
the stereo coder is detailed and analyzed. Finally, compression quality results are provided.

2 Overview of Perceptual Model

The perceptual model is based on the characteristics of the human auditory system. The
model eventually dictates the number of bits that will be assigned to be used for each

1
line in the frequency domain. By assigning different number of bits, the coder is essentially
quantizing each frequency line with different levels of coarseness based on relative importance.

The steps of the perceptual analysis are as follows:

1. The signal is divided into blocks of size N.

2. The FFT is applied to each block.

3. The masking model is used to determine the SMR for each bark subband.

4. A waterfilling algorithm assigns the number of bits to be consumed for each subband.

5. The block is quantized using block floating point quantization.

The quantization actually occurs in the MDCT domain. Note that the signal is analyzed in
the FFT domain, and processed in the MDCT domain. The MDCT is usually the preferred
transform since it has some nice characteristics in terms of implementation, windowing, and
audio quality.

2.1 Masking

The main mechanism behind the audio perceptual coding model is masking. Masking is the
act of one particular signal frequency component inhibiting the perceived strength of another
frequency component.

For each frequency line in the signal, a masking curve is derived in the bark space using

F (dz, LM ) = (−27 + 0.37 max(LM − 40, 0)u(dz))|dz| (1)

where LM is the masker’s sound pressure level (SPL) in dB, u(f ) is the unit step function,
and dz is the distance to the masker. This shape looks like a triangle that has constant
slope on the left, and shallowing slope on the right with respect to the masker’s SPL. This
function is calculated for every frequency line, and the final masking curve is the point-by-
point maximum of all the curves and the threshold for hearing curve.

The relative importance of each frequency line is determined by the signal to mask ratio
(SMR), measured in dB. The actual bit allocation is performed on each subband, where a
subband is determined by the standard bark scale. Therefore, a subband with a higher max
SMR are allocated more bits than lines with a lower max SMR.

Basically, signals that are below the masking threshold will be masked (or hidden) by another
(usually stronger) signal component. Thus, we consider these signals to be less important in
reconstructing the signal.

2
2.2 Bit Allocation

The bit allocation scheme used is based on block floating point. Each subband is assigned
a scale factor which is applied to all the lines in the block. Each line in the block is then
associated with a mantissa factor. The perceptual coder determines how many mantissa bits,
Rb , a subband will receive. A bit pool P is determined for the block based on the desired
bitrate and the sampling frequency.

A waterfilling algorithm is used to assign bits to each subband based on the SMR. The
algorithm is as follows

1. Determine the number of bits in the bit pool P , based on the desired bit rate, block
size, and sampling frequency.

2. Sort the subbands by SMR

3. Add one bit to the band with highest SMR, or two bits if it is the first time is being
allocated.

4. Decrement the SMR by 6 dB × bits allocated.

5. Decrement the bit pool.

6. Go to step 2, and repeat until bit pool is emptied.

Once all the bits have been used, the block is quantized using the block floating point
quantization scheme and efficiently packed into the compressed file.

2.3 Decoding

The decoding algorithm is much less complex than the encoding scheme. This is desirably
since in many cases the decoder has less processing power than the encoder.

No psychoacoustic analysis needs to be performed for the decoder. Only the Rb values are
needed for the decoder to correctly dequantize the frequency lines. After that, the inverse
MDCT is applied block by block and are overlap added to produce the reconstructed signal.

3 Stereo Coding

Joint stereo coding is an extension of the psychoacoustic model that takes advantage of the
typical high correlation that exists between the signal power spectra of the left and right

3
channels in stereo audio to improve coding gains. There are various ways of achieving this
in practice.

We chose to implement mid/side (M/S) stereo coding as outlined in [1]. In this method,
instead of transmitting the left and right channels, the normalized sum (mid) and difference
(side) signals are transmitted. Also, the left and right channel share a common bit pool.
Depending on the signal, this can reduce the data rate of the signal by up to 50%.

For example, consider a stereo signal with identical left and right channels. The difference
of the two channels will be zero and thus the side information (all zeros) can be transmitted
with a single bit to say that it is all zeros. This frees up bits, allowing the mid information
to be transmitted with twice as many bits. In any case, the left and right signal can be
completely reconstructed at the decoder. Hence, the coding gain is roughly 50%.

Though stereo signals are seldom like the example above, the side information is usually
smaller in value than either the left or right channels, suggesting a reduction in the number
of bits. However, cross-channel psychoacoustics play a big role in perception of stereo sound,
and because the mid and the side both contain information on the left and right channels,
we must perform bit-allocation for the mid and side information based on a cross-channel
psychoacoustic model.

Here, we detail the different components in stereo coding. The flow chart of the encoding
algorithm can be seen in Figure 1. Also, the decoding algorithm is shown in Figure 2.

3.1 M/S Decision

The first step in stereo coding is to decide whether to transmit data as left/right or mid/side.
There are cases where there are no significant gains in transmitting mid/side information over
left/right in certain subbands. In cases like these, the left/right information is transmitted.
Our decision for M/S is applied for each subband of the signal. The decision thresholds are
fhigher fhigher
X X
(lk2 − rk2 ) < 0.8 (lk2 + rk2 ) (2)
k=flower k=flower
fhigher fhigher
X X
(lk2 − rk2 ) > 0.8 (lk2 + rk2 ) (3)
k=flower k=flower

where lk and rk correspond to the FFT spectral line amplitudes computed in the psychoa-
coustic model, and flower and fupper correspond to the lower and upper lines within a subband.
If either of these conditions are met then M/S is transmitted, if not, then L/R is transmitted.
This condition allows M/S transmission in cases where the mid and the side differ in energy
by a certain threshold (in this case, 80%).

4
Figure 1: Flow chart of the stereo encoding algorithm.

5
Figure 2: Flow chart of the stereo decoding algorithm.

The values of M/S are calculated as follows


L+R
M= (4)
2
L−R
S= (5)
2
where L and R are the filter bank spectral line amplitudes. We can see that no actual
information is lost in the transformation to M/S. Both the L and R channels can be easily
recovered from the M and S channels.

3.2 Masking in Stereo

Next, the masking thresholds for M and S need to be calculated. This is a step-wise process.
First the equation (1) is applied to each M and S frequency line in the exact manner as in
the aforementioned section to calculate the basic masking thresholds, denoted BT HRm and
BT HRs [1].

To calculate the stereo masking contributions of the M and S channels, an additional factor,
the masking level difference factor (MLD), is calculated at each frequency line and multiplied
by each of the M and S masking level thresholds to obtain the masking level difference,

6
MLD factor
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25
frequency (barks)

Figure 3: The MLD factor that is applied to the masking model.

denoted M LDm and M LDs . The MLD provides a second level of detectability of noise in
the M and S channels based on the masking level differences between the channels [1].

Essentially, the MLD is a measure of how detectable a masked signal in the M channel is in
the S channel and vice versa. The equation used to calculate the MLD factor is as follows
[1]:
min(z,15.5)
M LD = 101.25(1−cos(π 15.5
))−2.5
(6)
where z is the frequency in barks. Figure 3 shows what the MLD curve looks like. Now, the
MLD factors can be calculated as:
M LDm = M LD × BT HRm (7)
M LDs = M LD × BT HRm (8)
The actual thresholds for M and S are calculated as follows:
T HRm = max(BT HRm , min(BT HRs , M LDs )) (9)
T HRs = max(BT HRs , min(BT HRm , M LDm )) (10)
The MLD signal essentially substitutes for the BTHR signal in cases where there is a chance
of stereo unmasking [1].

7
3.3 Bit Allocation

The bit allocation scheme used is the exact same as in the baseline coder. The main difference
is that both channels now share a common bit pool P , and use the SMRs obtained from the
masking curves calculated using the addition MLD factor.

The waterfilling algorithm is now applied to all the frequency lines of both channels. This
allows the algorithm to assign bits to lines regardless of which channel they are in. Essentially,
this is the where the coding gains are achieved. If one channel has a much higher SMR than
the other (in the case of the M/S representation), then more bits will be applied to that
channel than the other.

3.4 Packing and Decoding

The only extra information needed for the joint stereo decoder is an additional bit that
tells whether each subband is in an L/R or M/S representation. This information is passed
alongside the usual bit allocation bits. The decoder parses this information and will convert
any M/S representation into L/R for audio playback.

4 Results

We applied the joint stereo coding algorithm to various audio signals. One of the more
important tests was to make sure that the stereo image quality is not affected by coding
the signal as M/S. This was rigorously tested by modulating a signal’s channels by sinusoids
with different phase offsets.

Specifically, we modulated the left channel using a slowly varying cosine and the right channel
with a slowly varying sine. This results in the stereo image weaving from left to right. During
these tests we did not notice any degradation in the stereo and audio quality.

For the rest of the quality tests, we used a constant bitrate of 128kbps on various audio
signals. The results were on par with the baseline coder (ie. the quality was very good).
Table 1 shows the listening test results in SDG.

5 Conclusion

Stereo coding has been shown to be very useful in reducing the redundancy in stereo audio
signals. One can achieve significant gains in stereo coding, which can be utilized to either

8
Audio Name SDG
Castanets -0.5
Rock Music 0
Pop Music 0
Harpsichord -0.25
Glockenspiel -0.1
Bass Singer 0

Table 1: Listening test results for the audio signals compressed with the stereo coder. All
results use the SDG scale.

boost the quality of the reconstructed signal or to lower the bitrate while keeping the signal
quality constant with respect to the original coder. This is due to the fact that the M/S
representation of the signal is essentially lossless.

Moreover, stereo coding does not hurt the stereo image when correctly utilizing the stereo
masking model and shared bit pool methods. The tests show that the overall quality re-
mains the same or is better. Stereo coding is also popular with standard audio compression
techniques, including MP3.

6 Future Work

There are many possible extensions to stereo coding. One of these is intensity coding. The
main concept is that since much of the signal can be redundant in both channels, we can
code some parts of it as a mono signal, and multiplex it to a stereo signal. This is usually
done for higher frequencies, where the ear is less sensitive to the stereo image. Using this
idea will guarantee a gain of 50% within those particular frequency bands.

Another possible extension is to use a variable bit pool that will save bits for later use. This
will be most dramatic in cases where the M/S representation is severely skewed to one side.
Theoretically, we would only need half the number of bits to represent such a signal while
keeping the quality constant. These extra bits may be used for more bit starved blocks in
the future. For example, it may alleviate pre-echo effects due to powerful transients.

References
[1] Johnston and Ferreira, Sum-Difference Stereo Transform Coding, Proc. ICASSP, pp.569-
571, May 1992.

9
[2] Bosi and Goldberg, Introduction to Digital Audio Coding and Standards. Kluwer Aca-
demic Publishers, 2003.

10

Potrebbero piacerti anche