Anite Network Testing - Testing of Voice Quality Using PESQ and POLQA Algorithms - Whitepaper

WWW.ANITE.
COM
Testing of voice quality using PESQ and POLQA algorithms

JARI KUOKKANEN, PROJECT MANAGER, ANITE NETWORK TESTING
© Anite 2014
WHITEPAPER
CONTENTS
1 THEORETICAL BACKGROUND ............................................................................. 3

1.1 Overview ................................................................................................ 3
1.2 ITU-T recommendations ............................................................................ 3
1.2.1 ITU-T P.800 ..................................................................................... 3
1.2.2 Perceptual modeling ......................................................................... 4
1.2.3 ITU-T P.862.1 PESQ .......................................................................... 4
1.2.4 ITU-T P.862.2 PESQ with wideband extensions ..................................... 6
1.2.5 ITU-T P.863 POLQA .......................................................................... 7
1.3 Performance evaluation ............................................................................. 8
1.3.1 Moving from PESQ to POLQA [8] ......................................................... 8
1.3.2 PESQ and POLQA reliability ................................................................ 9
1.3.3 PESQ and EVRC................................................................................ 9
1.3.4 Test sample comparison .................................................................... 9
1.3.5 Some guidelines for evaluating the MOS ............................................. 10
1.3.6 PESQ and wideband AMR codecs ........................................................ 11
2 MEETING THE REQUIREMENTS WITH NEMO VOICE QUALITY ................................. 12

2.1 Overview ............................................................................................... 12
2.2 Standard compliance ............................................................................... 12
2.3 Signal processing .................................................................................... 13
2.4 Nemo Media Router (NMR) ....................................................................... 13
2.5 Voice quality testing procedure ................................................................. 14
3 SUMMARY .................................................................................................... 16
4 REFERENCES .................................................................................................. 17
Page 2
WHITEPAPER
1 THEORETICAL BACKGROUND
1.1 Overview
Speech transmission path amongst mobile and fixed networks consists of many different elements.
Along the path there can be multiple speech codecs, analog-to-digital and digital-to-analog
conversions, echo cancellers, noise suppressors, adaptive level controllers, voice activity detectors,
comfort noise generators, signal enhancers, and so on. In modern packet-switched networks
variable delays and packet losses inflict other types of problems. Moreover, especially in mobile
networks additional quality degradation may, and usually will, happen due to bit errors on the air
interface layer and also by silent gaps caused by, for example, handovers.
These kinds of complicated systems can inflict a large variety of degradations to speech signals.
These degradations include loudness loss, talker and listener echo, temporal gaps on speech signal,
filtering, amplitude clipping, variable delays, distortions, channel errors, and effects/artifacts from
noise reduction algorithms and from operation of echo cancellers.
1.2 ITU-T recommendations

These chapters are based on references [1] and [12] under permission from Opticom GmbH.
1.2.1 ITU-T P.800

Historically being related to the assessment of telephone connections, useful methods for testing
telephone band speech signals were first standardized within ITU-T. Recommendation P.800 [2]
defines the absolute category rating test method which has been used for the assessment of speech
codecs since 1993. Within the ACR test method, the ITU five-grade impairment scale is applied (see
Table 1).
It should be noted that because of the telecommunication environment, testing is done without a
comparison to an undistorted reference. This compares to a typical situation of a phone call, where
the listener has no access to a comparison with a reference, for example the original voice of the
other party. However, it should be noted that the listening test according to P.800 could be
regarded as a comparison between a test signal and a reference in the mind of the listener. This is
because of the fact that the listener is very familiar with the natural sound of a human voice.
For comparison reasons, and in order to be able to merge the results of different individuals, it is
necessary to adjust the listener’s opinions to an absolute scale. For this purpose, predefined
examples with well-defined noise insertions of fixed modulated noise reference units (MNRU) [3]
are presented at the beginning of a test. Each sample represents an example distortion
corresponding to the ITU-T version of the five grade impairment scale.
Impairment Grade
Imperceptible 5
Perceptible, but not annoying 4
Slightly annoying 3
Annoying 2
Very annoying 1
Table 1. The ITU-T five-grade impairment scale.
Page 3
WHITEPAPER
Based on these test conditions a population of typically 20 to 50 test subjects will be presented with
identical series of speech fragments. Every test subject will be asked to score each sample on the
impairment scale. After statistical processing of the individual results, a mean opinion score (MOS)
can be calculated. With thorough setups, such test results can be reproduced quite well, even at
different locations. Of course, the effort needed in terms of subjects and time is tremendous.
Furthermore, such test methods cannot be applied within a practical or field environment in the
daily life.
1.2.2 Perceptual modeling

During all the years in the development of compression schemes assessing the quality was a
pending issue. Consequently, the idea of substituting the subjective tests by objective, computer-
based methods has been an ongoing focus of research and development. Early work motivated from
the development in speech coding was reported in [7]. Since then several methods were
introduced.
International standardization of perceptual audio measurement techniques was mainly driven by
two expert groups within the International Telecommunications Union (ITU).
The underlying concepts of the proposed algorithms for perceptual techniques are all quite similar.
The common structure of these algorithms is depicted in Figure 1.
Figure 1. The underlying concept for perceptual measurement.

The process of human perception is modelled by employing a difference measurement technique
which compares both, a reference signal (i.e. the input signal to a codec) and a test signal (i.e. the
output signal of the codec).
First, the algorithms process an ear model for the reference and the test signal, in order to calculate
an estimate for the audible signal components. The result can be imagined as the internal
representation inside the human auditory system. The comparison of the internal representations of
the reference, and the test signal leads to an estimate of the audible difference.
1.2.3 ITU-T P.862.1 PESQ

Driven by the demand for a verified test procedure for VoIP, an expert group within ITU-T SG12 has
been working on an improved speech quality model. After a competitive phase, the new model
PESQ has been devised. PESQ stands for Perceptual Evaluation of Speech Quality. PESQ
combines a further refinement of PSQM and PAMS. Extensive tests showed PESQ's superior
performance especially for VoIP applications. In February 2001 PESQ was accepted as the ITU-T
Rec. P.862 [6].
When PSQM was standardized as P.861 [5], the scope of the standard were at that time state of the
art codecs as they were mainly used for mobile transmission, like GSM. VoIP was not yet a topic at
this time. The requirements for measurement equipment have changed dramatically since then. As
a consequence, the ITU set up a working group to revise the P.861 standard in order to cope with
the new demands arising from modern networks like VoIP. With these networks the measurement
algorithm has to deal with much higher distortions as with GSM codecs, but maybe the most
eminent factor is that the delay between the reference and the test signal is not constant anymore.
Page 4
WHITEPAPER
A first approach to overcome these problems was the development of PSQM+ (however it is not
included in the standard). It could well handle the larger distortions as they are caused by e.g.
burst errors, but still had significant problems with the compensation of the varying delay.
With the new ITU standard P.862 (PESQ) [6] this problem is now finally eliminated. PESQ combines
the excellent psychoacoustic and cognitive model of PSQM+ with a time alignment algorithm that
perfectly handles varying delays. The only drawback of PESQ is that it is absolutely not designed for
streaming applications. This is in turn why it cannot fully replace PSQM+. With PSQM and PESQ
there are now two standards that cover the entire problem of measuring speech quality. Figure 2
gives an overview of the structure of the PESQ algorithm and shows the new blocks which have
been added to the PSQM algorithm.
Figure 2. The structure of the PESQ algorithm.

One of the major advantages of PESQ compared to PSQM (+) is that it contains a real good time
alignment algorithm, which is capable of handling varying delays. With PSQM, such time alignment
was missing in the standard, and it was up to the implementers to take care of this issue. As
experience showed, only very few PSQM implementations came with a time alignment algorithm
that was well suited for static delays on real networks, and even less measurement systems were
capable of handling varying delays, as they appear on e.g. packet based networks.
As a consequence of the wrong time alignment, two parts of the reference and the test signal were
compared that did not match and therefore did sound different. This sonic difference of course led
to a PSQM score which was too pessimistic and simply wrong. With PESQ, this shortcoming is now
finally eliminated and the user will get realistic results for his device under test. There is no more
danger, that the tested system is downgraded, just because of a deficiency of the measurement
algorithm.
The PESQ Score expresses the voice quality on a MOS like scale. The PESQ Score as defined by the
ITU recommendation P.862 ranges from –0.5 (worst) up to 4.5 (best). This may surprise at first
glance since the ITU scale ranges up to 5.0, but the explanation is simple: PESQ simulates a
listening test and is optimized to reproduce the average result of all listeners (remember, MOS
stands for Mean Opinion Score). Statistics however prove that the best average result one can
generally expect from a listening test is not 5.0, instead it is ca. 4.5. It appears the subjects are
always cautious to score a 5, meaning excellent, even if there is no degradation at all.
The PESQ score is frequently also referred to as the PESQ MOS, which indicates the high correlation
of the two values PESQ Score and MOS. This is however scientifically not correct. The PESQ Score
can be mapped to the ITU P.800 scale by applying a simple mapping function. One such function
has been standardized by the ITU in P.862.1 [9] (MOS-LQO). Another mapping function is the so
called PESQ-LQ. Due to the new recommendation P.862.1, PESQ-LQ is obsolete now.
Page 5
WHITEPAPER
The ITU has standardized a universal PESQ to MOS mapping. This was created from a shared pool
of subjective test results covering wireless, VoIP, fixed and codec-only conditions, including
Japanese, British English, American English, French, German, Italian, Swedish, Dutch and Finnish.
This mapping is continuous from PESQ –0.5 to 4.5 and MOS 1 to 4.55. It takes the form of a logistic
with four parameters, and is shown below:
The mapping from PESQ score to P.862.1-MOS can be computed as follows:
For more information on this mapping, please see ITU-T recommendation P.862.1.
1.2.4 ITU-T P.862.2 PESQ with wideband extensions

Adaptive Multi-Rate Wideband (AMR-WB) is a patented speech coding standard developed based on
Adaptive Multi-Rate encoding, using similar methodology as Algebraic Code Excited Linear
Prediction (ACELP). AMR-WB provides an excellent speech quality due to a wider speech bandwidth
of 50–7000 Hz. Narrowband speech coders in general are optimized for POTS wire line quality of
300–3400 Hz.
AMR-WB is codified as G.722.2, an ITU-T standard speech codec, formally known as Wideband
coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB). G.722.2 AMR-
WB is the same codec as the 3GPP AMR-WB. The corresponding 3GPP specifications are TS 26.190
for the speech codec and TS 26.194 for the Voice Activity Detector.
ITU-T Recommendation P.862.2 specifies a wideband extension to P.862, and includes an integral
mapping of the raw PESQ score onto a MOS scale. The mapping from raw wideband PESQ score to
P.862.2-MOS can be computed as follows:
For more information on this mapping, please see ITU-T recommendation P.862.2 [11].
Page 6
WHITEPAPE
ER
1.2.5
1 ITU
U-T P.863
3 POLQA
POLLQA is the neext-generatio
on voice qua ality testing technology for
f fixed, mo obile and IP based
netwworks. POLQQA has been selected to form the new ITU-T voic ce quality te
esting standaard, P.863,
and will be usedd with HD Vo
oice, 3G and d 4G/LTE. PO OLQA – whichh stands for Perceptual Objective
Listtening Quallity Analysiis – will offe r a new leve marking capa bility to dete
el of benchm ermine the
voic
ce quality of mobile netw
work servicess.
POL LQA's radically revised ps

sycho-acousstic and cognnitive model allows – forr the first tim
me in the
history of the IT
TU-T P.86X series
s – a tru
ue quality prrediction for: EVRC type codecs, Noiise Reduction
and Voice Quality Enhancem ment, Time-w warping, UCC and VoIP, Non-optima al presentation levels,
Filte
ering and speectral shapin
ng and Recoordings made e at an ear simulator.
s
POLLQA fits well with new tra
ansmission ttechnologies s in service now
n or to be launched in
n the near
futu
ure, and provvides stable and accuratte results alo
ong with an improvemen nt in perform
mance for
exis
sting technologies. POLQQA is designeed to measure disruptive e effects cau
used by multti-component
channnels. POLQA will handlee new voice services likee stretching and compre ession of speech signals in
time
e domain. PO OLQA improv prediction for new codec
ves quality p c schemes likke AMR and EVRC.
POLLQA combine es an excelle
ent psychoaccoustic and cognitive
c mo
odel with a n
new time alig
gnment
algo
orithm that perfectly
p han
ndles varying
g delays.
For most applica
ations the chhange from PESQ to POL LQA is option
nal, though it will lead to
o improved
uracy in mos
accu st cases. Forr the followin
ng applicatio
ons however, POLQA is cclearly preferrable:
 Acoustic
c interfaces on the receiv
ving side
 Direct comparison
c of
o EVRC with
h AMR type codecs
c
 Wideban
nd and supe
er-wideband measureme
ents
 Very hig
gh background noise lev
vels
 Severe linear filterin
ng
 Termina
al assessmen
nt
For more inform ease see ITU-T recomme
mation aboutt POLQA, ple endation P.86
63.
Pa
age 7
WHITEPAPER
1.3 Performance evaluation
1.3.1 Moving from PESQ to POLQA [8]

P.863 narrowband mode was developed to provide P.862 backwards compatibility. This backwards
compatibility means that P.863 can be used wherever you are currently using P.862. It also means
that the predictions of P.863 can be directly compared to both new and old narrowband subjective
tests.
Although built to be backwards compatible, one will see different scores when testing a connection
with P.863 instead of P.862 – different scores because there are fundamental differences between
how P.862 and P.863 process the signals. Also, a significantly larger database of subjective test
results have been used in the development and calibration of P.863 compared to P.862. This
increase in data means that we have a more complete view of how people judge the full range of
speech quality delivered by modern telephone networks. For this reason and because P.863
achieves better performance across all databases compared to P.862, predictions from P.863 should
be viewed as more accurate than P.862.
It is recommended that P.862 and P.863 are run in parallel until one gets a feeling for P.863 scores.
Degradation P.863 POLQA P.862/P.862.1 PESQ
Bandwidth P.863 in super-wideband mode takes into PESQ applies a linear frequency equalization
limitations account bandwidth limitations by detecting stage before presenting the signals to the
the absence of any speech energy above 3.8 psycho-acoustic model. This effectively
kHz to indicate a narrowband degraded file removes frequency response influences from
and the absence of any speech energy above being detected in the model.
7 kHz to indicate a wideband degraded file. This is useful for small degrees of frequency
With a narrowband signal the maximum shaping but PESQ underestimates severe linear
achievable score is 3.8. frequency response distortions.
Change in bandwidth between speakers
(different gender or talker) in a single test
may lead to a lower than expected score.
Short interrupts The predicted quality score tends to a MOS PESQ behaves in a similar way to P.863 with
(e.g. packet loss) value of 1.0 as frame loss increases up to scores tending to a MOS of 1.0 as loss rate
30%. Even small loss rates cause a drop in increases to 30%.
measured speech quality. Results show that
the super-wideband mode is slightly more
sensitive to short interrupts than the
narrowband mode.
Long interrupts Long interrupts describe muting of speech for It has been claimed that PESQ reacts
(e.g. VAD clipping, 200ms or more at the front, in the middle or unexpectedly to lost speech. For small
inter-system the end of a speech sentence. Loss in the interrupts P.863 and PESQ produce consistent
handovers) middle of speech leads to the largest drop in scores, but with longer interrupts PESQ
quality, followed by front-end loss with losses predictions are significantly more optimistic
at the end of a sentence having least impact than expected.
on score.
Table 2. Short summary of what can be expected from P.863 for certain types of
degradation compared to the PESQ processing model.
P.863 will return scores very similar to PESQ P.862.1 in the narrowband mode with simple codecs
such as G.711. Tests with more sophisticated codecs and transmission techniques may yield
different scores as P.863 addresses the objective assessment limitations of PESQ.
It is more difficult to compare P.863 super-wideband mode with PESQ P.862.2 because most
wideband experiments were performed with 16k sample rate material. P.863 super-wideband mode
requires a 48k sample rate reference file. The P.863 results with a 16k sample rate reference file or
an up-sampled 16k reference file will be wrong as there will not be any speech energy above 8 kHz.
Page 8
WHITEPAPER
1.3.2 PESQ and POLQA reliability

Generally the voice quality measure produced by Nemo products – that is PESQ or POLQA score
mapped onto P.800 scale – is an exact and reliable indicator of network’s speech quality. However,
one very common mistake when interpreting PESQ results is to overestimate the accuracy of the
results. When presenting MOS-LQO values with a resolution of three decimals, one should always
be aware, that the subjects in a listening test have zero decimals available for their votes. Without
any further uncertainties and other subjective factors this would result in a theoretical maximum
resolution of 0.03 points on the MOS scale if 30 subjects were participating. In reality, the accuracy
of a subjective test will be much worse. POLQA, however, is trained on such data and the resulting
errors will propagate.
Due to the nature of GSM speech codecs (HR, FR, EFR) that are based on vocoding, the theoretical
maximum quality is 3.85 MOS on EFR channel. This result is based on a simple measurement: all
one needs is a software codec (like GSM EFR), that encodes a PCM speech file and writes it to a file
which then contains the GSM- encoded bit-stream. After that one can decode it back to PCM. Then
give original and encoded-decoded files to PESQ library and the PESQ score should be near above
mentioned value. In real life, the sample also goes through DA and AD conversions that affect the
score minimally.
1.3.3 PESQ and EVRC

Field tests and a study [10] made by Qualcomm indicate that PESQ is slightly biased against EVRC
based codecs. Reason is believed to be the limitations on PESQ’s time alignment and
psychoacoustic model that cannot treat minimal changes caused by EVRC signal processing
technologies correctly. Therefore it is recommended to add offset of 0.318MOS to average score in
both directions whenever one or both measurement ends use EVRC based codec.
EVRC family codecs are used widely in CDMA networks (not in WCDMA). When a CDMA terminal is
used for voice quality measurements, EVRC MOS compensation is used by default with Nemo
Outdoor and Nemo Invex if the used test algorithm is PESQ. It can be disabled from the advanced
properties of the test device.
1.3.4 Test sample comparison

Table 3 presents PESQ results from sample comparison measurement made in near static
conditions in a live 3G network. Results indicate that the average scores of different Nemo Outdoor
test samples vary within 0.26MOS range.
An offset is calculated between each sample and sample 4s_m.wav that gave the best average
score. There is also an offset for measurements made in network that uses EVRC codec. These
offsets are presented here only as examples and the same kind of measurement has to be repeated
in each measured network to obtain correct results.
Note that EVRC offset should not be used with POLQA and therefore it is not available in Nemo
Outdoor when the selected measurement algorithm is POLQA.
Test sample Count Average MOS Std Deviation Offset Offset (EVRC)
4s_m.wav 5501 4,142 0,080 0,000 0,318

8s.wav 2855 4,087 0,079 0,055 0,373
10s.wav 738 4,008 0,086 0,134 0,452
VB5.wav 4466 4,136 0,096 0,007 0,325
Table 3.Test sample average scores with standard deviations as measured in near static
conditions. Note that calculated offsets are only informative and are not statistically
strong enough for score calibration in all networks.
Page 9
WHITEPAPER
Table 4 presents PESQ and POLQA maximums for AMR-12.2 codec along with sampling rate, length,
SNR and speech activities.
Sample filename Samp. rate Length SNR Speech.act Pesq‐NB Polqa‐NB Polqa‐SWB
4s_m.wav 8000 4.000 36.027 76.7 % 4.30 4.50 ‐

8s.wav 8000 8.000 35.650 75.1 % 4.30 4.36 ‐
10s.wav 8000 9.990 39.219 64.0 % 4.34 4.42 ‐
AmEnglish_NB_m1s1_f2s2_6s.wav 8000 6.000 40.333 72.1 % 4.24 4.29 3.40
AmEnglish_SWB_m1s1_f2s2_6s.wav 48000 6.000 40.317 72.6 % ‐ ‐ 3.42
BrEnglish_NB_f1s4_m1s3_6s.wav 8000 6.000 37.945 67.5 % 4.15 4.32 3.41
BrEnglish_SWB_f1s4_m1s3_6s.wav 48000 6.000 37.948 67.6 % ‐ ‐ 3.39
German_NB_m2s1_f1s1_6s.wav 8000 6.000 41.057 63.6 % 4.03 4.33 3.21
German_SWB_m2s1_f1s1_6s.wav 48000 6.000 41.085 64.1 % ‐ ‐ 3.23
Italian_NB_f1s2_m1s2_6s.wav 8000 6.000 41.766 65.4 % 4.08 4.40 3.28
Italian_SWB_f1s2_m1s2_6s.wav 48000 6.000 41.764 65.8 % ‐ ‐ 3.34
Japanese_NB_f1s1_m1s1_6s.wav 8000 6.000 40.405 67.5 % 4.09 4.30 3.05
Japanese_SWB_f1s1_m1s1_6s.wav 48000 6.000 40.397 68.0 % ‐ ‐ 3.03
Russian_NB_f2s6_m1s3_6s.wav 8000 6.000 39.678 66.8 % 4.05 4.32 3.27
Russian_SWB_f2s6_m1s3_6s.wav 48000 6.000 39.695 67.3 % ‐ ‐ 3.24
Table 4.Test sample maximum scores for AMR-12.2 codec.
1.3.5 Some guidelines for evaluating the MOS

It is not essential whether some other tools claim to produce more precise or higher MOS scores
than the PESQ/POLQA implementation. Other tools can use different MOS scale and not the official
P.800 scale, and naturally their results are not comparable with P.800 MOS-LQO mapped results,
and therefore they can be compared only with other results produced with that tool. At the end,
MOS simply means Mean Opinion Score and can freely be used with whatever test material so
different vendors can use it in different ways. But when it comes to precision, compared with actual
tests made with human test subjects, there is no official algorithm that outperforms POLQA.
When interpreting PESQ/POLQA results it is important to know which version of the algorithm was
used. Although the narrowband and the wideband version of PESQ/POLQA will generate MOS scores
on the same five point scale, it is strictly forbidden to mix wideband and narrowband results. If
wideband networks have to be compared to narrowband networks, then the wideband version of
PESQ/POLQA has to be used in both cases. Note also that there is no sense in comparing PESQ
results to POLQA as the used mappings are different.
Generally, one can achieve 3.8-4.2MOS PESQ scores only in good laboratory conditions in a
narrowband network. In casual test drives, a good score would be between 2.5 and 4.0, and
normally listeners cannot tell the difference between 3.0 and 4.0MOS. When the quality drops below
2.0, degradations become easily noticeable. The AMR codec may give up to 0.5MOS lower scores
than GSM-EFR because it lowers the bitrate in good conditions – hence the codec’s name: Adaptive
Multi-Rate. However, if AMR has a fixed high bitrate, it can give scores greater than 4.0MOS, even
4.3MOS occasionally.
With POLQA the scores achieved in good conditions are generally higher than PESQ scores for the
same samples. Additionally the POLQA-WB maximum score has been increased to 4.75MOS (it is
still 4.5MOS for NB). As a result POLQA can score occasionally 4.5MOS in good conditions that
never happened with PESQ.
As POLQA is the latest objective algorithm and also the latest ITU-T recommended method, it is the
best possible method to be used in objective end-to-end measurements. Being seamlessly
integrated into Nemo products like Nemo Outdoor, Nemo Invex and Nemo Server, POLQA-based
voice quality makes the voice quality testing easy and enables network-wide quality performance
evaluation as seen by the end-user.
Page 10
WHITEPAPER
1.3.6 PESQ and wideband AMR codecs

AMR-WB operates like AMR with nine different bit rates. The lowest bit rate providing excellent
speech quality in a clean environment is 12.65 kbit/s. Higher bit rates are useful in background
noise conditions and for music. Also lower bit rates of 6.60 and 8.85 kbit/s provide reasonable
quality especially when compared to narrow-band codecs. All modes are sampled at 16 kHz (using
14 bit resolution) and processed at 12.8 kHz.
When used in mobile phone networks, there are three different configurations (combinations of
bitrates) that may be used for voice channels:
 Configuration A: 6.6, 8.85, and 12.65 kbit/s (mandatory multi-rate configuration)
 Configuration B: 6.6, 8.85, 12.65, and 15.85 kbit/s
 Configuration C: 6.6, 8.85, 12.65, and 23.85 kbit/s
This limitation was designed to simplify the bit rate negotiation between a handset and a base
station, thus vastly simplifying the implementation and testing. All other bitrates can still be used
for other purposes in mobile networks, including multimedia messaging, streaming audio, etc.
The basic P.862 model provides raw scores in the range of –0.5 to 4.5. The wideband extension to
P.862 includes a mapping function that allows linear comparisons with MOS values produced from
subjective experiments that include wideband speech conditions with an audio bandwidth of 50-
7000 Hz. This means that direct comparisons between scores produced by the wideband extension
ITU-T P.862.2 and scores produced by baseline P.862 or P.862.1 are not possible, due to the
different experimental context.
In any given experiment, the way that subjects interpret the 1 – 5 MOS voting scale will be
determined by the overall content of the experiment. For example, in a narrowband experiment a
high-quality speech signal with a bandwidth of 300 – 3400Hz is likely to produce a MOS value in the
region of 4.5, but in an experiment that also includes high-quality speech material with a bandwidth
of 50 – 7000Hz, the same file may produce a MOS value as low as 3. This effect is called
experimental context.
Page 11
WHITEPAPER
2 MEETING THE REQUIREMENTS WITH

NEMO VOICE QUALITY
2.1 Overview
Nemo Voice Quality (VQ) is an option for several Nemo products, namely Nemo Outdoor, Nemo
Invex, Nemo Handy, Nemo Walker Air, Nemo Autonomous, and Nemo Server. Nemo VQ uses PESQ
and POLQA OEM libraries that are licensed from Opticom GmbH. The PESQ/POLQA scores are
calculated by comparing the original and the degraded sample files – this is the so called intrusive
testing method.
In mobile-to-mobile testing each score is a combination of both test terminal’s uplink and downlink
quality. In contrast to this, when testing against a fixed end, such as Nemo Server with PSTN or
ISDN, the server-side results present purely mobile uplink quality and mobile-side results present
mobile downlink quality. Because PSTN and ISDN can be considered static, the quality by direction
can be isolated only in mobile-to-fixed testing.
2.2 Standard compliance

Nemo Voice Quality uses the algorithms specified in ITU-T recommendations P.862 [6] and P.863
[12] that are also known as PESQ and POLQA respectively. The PESQ/POLQA score is mapped onto
a MOS scale as specified in the ITU-T recommendation P.800 [2], by using mapping function as
described in ITU-T recommendation P.862.1 [9]. PESQ replaces older P.861 PSQM [5] algorithm
that is now obsolete and in turn POLQA replaces the PESQ.
PESQ and POLQA are compatible with all known mobile technologies and codecs such as HR, FR,
EFR, and AMR (PESQ – see EVRC note in paragraph 1.3.3).
Nemo Outdoor and Nemo Handy PESQ measurement systems have been measured against Malden
Electronics’ DSLA (Digital Speech Level Analyzer) in a static network and average scores and
standard deviations were the same with greater than 0.1MOS precision.
Nemo Outdoor and Nemo Invex comply with the ITU-T Wideband extension Recommendation
P.862.2 for the assessment of wideband telephone networks and speech codecs. Additionally they
comply with P.863 in narrowband and wideband modes.
Measurement data includes information about the audio quality type within AQUL/AQDL events:
 2 = PESQ NB (same as P.862.1 standard)
 6 = PESQ WB (same as P.862.2 standard)
 7 = POLQA NB (same as ITU-T P.863 standard)
 8 = POLQA SWB (same as ITU-T P.863 standard)
Nemo products use WB assessment mode whenever a wideband audio sample file is used.
Detection is based on audio sample sampling rate: when sampling rate is >12kHz, WB assessment
is used.
There are a couple of exceptions: Nemo Server supports only narrowband PESQ and POLQA as
there is no such thing as wideband PSTN/ISDN lines. Nemo Handy-S and Nemo Autonomous (S60
versions) support only PESQ in narrowband mode.
However, Nemo Server can transfer VoIP calls that are made in POLQA wideband mode but it does
not measure these calls by itself but instead just transfers packets back and forth between two VoIP
callers.
Please note also that it is the user's responsibility to check that the system being tested is really a
WB system, for example, by observing the used uplink and downlink codecs.
Page 12
WHITEPAPER
2.3 Signal processing

Nemo Outdoor Voice Quality is based on realtime handling of audio signals on a test laptop by using
an ASIO-compatible advanced audio card (e.g., Terratec DMX 6Fire USB) as an audio I/O interface
or digitally via Nemo Media Router, when available.
Nemo Invex handles the audio traffic within its dedicated handset isolation module that is
specifically designed for the task and calibrated for each mobile type to give out maximum audio
performance.
Nemo products that utilize the Nemo Media Router (e.g. Nemo Handy-A and Nemo Outdoor),
handle the audio signals digitally, which eliminates the impairments caused by AD-DA conversions.
Nemo Outdoor handles the resampling and rescaling and controls the transmission of test samples
and implements mutual synchronization between measurement ends.
The soundcard performs rough line level adjustments and AD-DA conversions. The soundcard uses
a 48 kHz sampling rate and 16bit precision at minimum. Some soundcards have no less than 24bit
internal precision. Test samples provided with Nemo Outdoor have 8 kHz sampling rate and 16bit
precision for narrow band. For wideband a 16 kHz sampling rate and 16bit precision is required.
Nemo products can store received samples to be played later during measurement file playback or
for further analysis by an expert listener. Each sample file is auto-named so that there is date,
time, and MOS included in the name so it is easy to pick up later the most interesting samples from
all recorded material. There is also a threshold setting so that it is possible to save only samples
below certain MOS.
2.4 Nemo Media Router (NMR)

Nemo Media Router is Anite Finland Ltd proprietary communication interface and application
developed for Android-based smartphones. With NMR interface Android-based smartphones are
communicating with PC-based applications, such as Nemo Outdoor.
With NMR, PESQ/POLQA voice quality measurements can be done with Android-based smartphones
without any additional hardware, such as sound cards or Nemo Invex handset isolation modules.
This is highly beneficial as most new smartphones because interfering extra signals to the audio
Page 13
WHITEPAPER
path when phone is, for example, charged or powered via a USB data cable. With NMR, interfering
noise issues can be completely avoided in voice quality measurements.
NMR interface was initially designed for voice quality measurements but the same interface can be
later used also for other purposes, such as data testing.
In voice quality measurements, a smartphone records the received audio sample files and transfers
them via the NMR interface back to the host computer in real-time. The host computer then
calculates PESQ/POLQA results, similarly to the sound card interface, and values are written to
Nemo Outdoor log files. This requires no changes in PESQ and POLQA licensing, unlike the sound
card based solution.
NMR supports multiple phones connected simultaneously to one host. The target is to be able to
support six or more phones connected to one host simultaneously.
The main advantages from NMR are:
 eliminates extra sound card HW and audio cabling
 decreases the complexity of the measurement system
 reduces weight and power consumption
 improves user experience and audio quality by eliminating the phone-generated noise
2.5 Voice quality testing procedure
In mobile-to-mobile testing the test terminals used in voice quality measurements can be connected
to separate laptops, attached to Nemo Multi or Nemo Invex, or they can be Handy-A units
measuring independently. One or both ends can also be Nemo Handy-A units or Nemo Server with
PSTN or ISDN test lines. All combinations work together.
The audio can be transferred via a sound card as pictured above, via handset isolation modules, or
via Nemo Media Router installed on the connected smartphone.
Page 14
WHITEPAPER
Test Procedure
1. Both measurement ends are configured to use the same reference sample.
2. At first the reference sample is resampled and rescaled to the nearest format supported by
the audio interface used.
3. Nemo Outdoor/Invex/Handy initiates the test mobile to make a test call to the other mobile.
4. After a voice connection is established, the mobile initiating the call (MO from now on)
starts sending the synchronization signal.
5. Call receiver (MT from now on) has already accepted the incoming call and starts receiving
audio as soon as voice is connected.
6. MO starts sending the reference sample after the synchronization signal has been sent
(synchronization total length is 480ms).
7. MT detects the synchronization signal and adjusts its timing so that it can receive the test
sample from the correct point of time.
8. MT receives the first test sample and calculates the PESQ/POLQA score.
9. MT starts sending the synchronization signal over to MO.
10. MO receives synchronization, adjusts its timing and level, and then receives its first test
sample.
11. After that the test cycle continues like at the beginning, alternating the sending and
receiving until call is finished.
Page 15
WHITEPAPER
3 SUMMARY
In short, Nemo Voice Quality provides certified measurement results and consistent experience for
wide variety of test devices – within and between all Nemo product platforms. Additionally with
advances in the field of intrusive speech quality measurement brought in by POLQA algorithm, a
system and device -neutral assessment of wide variety of voice call applications – including not only
cellular technologies but also data and stream -based technologies as well – is now possible. Finally,
all these combined with the superior NMR technology that eliminates the analogue path altogether
makes Nemo Voice Quality the industry-leading speech quality measurement system.
Page 16
WHITEPAPER
4 REFERENCES
[1] Opticom GmbH, PesqOemLibManualV1.7.pdf
[2] ITU-T Recommendation P.800, Methods for subjective determination of transmission quality,
1996
[3] ITU-T Recommendation P.810, Modulated Noise Reference Unit (MNRU), 1996
[4] ITU-T Recommendation P.830, Subjective Performance Assessment of Telephone-Band and

Wideband Digital Codecs, 1996
[5] ITU-T Recommendation P.861, Objective Quality measurement of telephone-band (300 -3400
Hz) speech codecs, 1996
[6] ITU-T Recommendation P.862, PESQ an objective method for end-to-end speech quality
assessment of narrowband telephone networks and speech codecs, February 2001
[7] KARJALAINEN M., A New Auditory Model for the Evaluation of Sound Quality of Audio Systems,
Proc. of the ICASSP 1985, pp. 608-611
[8] Revised draft of Application Guide for P.863, ITU-T SG12, TD 851rev1 (GEN/12), June 2012
[9] ITU-T Recommendation P.862.1, Mapping function for transforming P.862 raw result scores to
MOS-LQO
[10] PESQ limitations for EVRC-based speech codec, Qualcomm, August 2007
[11] ITU-T Recommendation P.862.2
[12] Opticom GmbH, PolqaOemLibManualV1.4.pdf
Page 17

Anite Network Testing - Testing of Voice Quality Using PESQ and POLQA Algorithms - Whitepaper

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Anite Network Testing - Testing of Voice Quality Using PESQ and POLQA Algorithms - Whitepaper

Caricato da

Copyright:

Formati disponibili

WWW.ANITE.

Testing of voice quality using PESQ and POLQA algorithms

1 THEORETICAL BACKGROUND ............................................................................. 3

2 MEETING THE REQUIREMENTS WITH NEMO VOICE QUALITY ................................. 12

3 SUMMARY .................................................................................................... 16

4 REFERENCES .................................................................................................. 17

1.2 ITU-T recommendations

1.2.1 ITU-T P.800

Perceptible, but not annoying 4

Table 1. The ITU-T five-grade impairment scale.

1.2.2 Perceptual modeling

Figure 1. The underlying concept for perceptual measurement.

1.2.3 ITU-T P.862.1 PESQ

Figure 2. The structure of the PESQ algorithm.

The mapping from PESQ score to P.862.1-MOS can be computed as follows:

1.2.4 ITU-T P.862.2 PESQ with wideband extensions

POL LQA's radically revised ps

1.3 Performance evaluation

1.3.1 Moving from PESQ to POLQA [8]

Degradation P.863 POLQA P.862/P.862.1 PESQ

1.3.2 PESQ and POLQA reliability

1.3.3 PESQ and EVRC

1.3.4 Test sample comparison

4s_m.wav 5501 4,142 0,080 0,000 0,318

VB5.wav 4466 4,136 0,096 0,007 0,325

4s_m.wav 8000 4.000 36.027 76.7 % 4.30 4.50 ‐

Table 4.Test sample maximum scores for AMR-12.2 codec.

1.3.5 Some guidelines for evaluating the MOS

1.3.6 PESQ and wideband AMR codecs

2 MEETING THE REQUIREMENTS WITH

2.2 Standard compliance

2.3 Signal processing

2.4 Nemo Media Router (NMR)

2.5 Voice quality testing procedure

[1] Opticom GmbH, PesqOemLibManualV1.7.pdf

[4] ITU-T Recommendation P.830, Subjective Performance Assessment of Telephone-Band and

[11] ITU-T Recommendation P.862.2

[12] Opticom GmbH, PolqaOemLibManualV1.4.pdf

Potrebbero piacerti anche