Sei sulla pagina 1di 4

Voice over IP (VoIP) Speech Quality Measurement with Open-Source Software Components

Sebastian Schumann 1, Juraj Londk 1, Bastian Huntgeburth 2


2

Department of Telecommunications, FEI STU in Bratislava, Ilkoviova 3, Bratislava, Slovakia Deutsche Telekom AG, Hochschule fr Telekommunikation Leipzig, Gustav-Freytag-Strae 45, Leipzig, Germany schumann@ktl.elf.stuba.sk II. THEORY

AbstractThis paper proposes an alternative to expensive means for VoIP speech quality measurement. While current applications and measurement devices on the market are very expensive, the authors propose a solution based on open-source components that allows the determination of the Mean Opinion Score (MOS) value according the Perceptual Evaluation of Speech Quality (PESQ) test methodology. Keywords Voice over IP (VoIP), Quality of Service (QoS), Quality of Experience (QoE), Mean Opinion Score (MOS), Perceptual Evaluation of Speech Quality (PESQ), open-source, NGNlab, ELMAR, Croatia

The QoS measurements referred to within this paper are based on the PESQ standard method from [5]. The recommendation describes an objective method for predicting the subjective quality of 3.1 kHz (narrow-band) handset telephony and narrow-band speech codecs. PESQ compares an input signal with a degraded signal. Passing through a communications system influences the difference of these signals. The following parameters can reduce the quality of the degraded signal (examples): Network architecture System architecture Transcoding

I.

INTRODUCTION

The NGNlab [1] network infrastructure of the Slovak University of Technology (STU) hosts several Voice over IP (VoIP) applications in virtual and stand-alone environments. The tasks of the laboratory infrastructure are to simulate a carrier-grade telecommunication environment (VoIP infrastructure, IP Multimedia Subsystem (IMS) infrastructure) and to allow tests and actual communication to be performed on this platform. To handle IP based communication traffic, the infrastructure has to process low resource-intense signaling traffic (Session Initiation Protocol (SIP) based), but in many cases 1 also high resource-intense media traffic. While even thousands of call establishments per second can be managed with a scalable SIP proxy (e.g., OpenSIPS [2]), Real-time Transport Protocol (RTP) processing applications (e.g., Asterisk [3]) are very resource limited regarding the number of media streams and other tasks they have to perform (e.g., conferencing, transcoding). To find the limits and a potential influence of network and application architecture, means have to be taken to determine impacts on the Quality of Service (QoS) the infrastructure can provide. This paper discusses how to measure the actual speech quality of those applications in an effective way. Due to budgetary limitations, commercial solutions such as [4] are not taken into account. Although the authors had the chance to test the mentioned equipment and retrieved satisfactory results, the motivation remained to determine another way for a flexible method that can be used in education - without the need to buy or rent expensive equipment and still obtain satisfactory results.
1

The output of the PESQ measuring is a prediction of the perceived quality that would be given by persons in a subjective listening test. This perceived quality is ranked on the MOS scale, ranging from 1 (bad output) to 5 (excellent output). [6] describes details of the numeric representation of the perceived output quality and general methods for the subjective determination of the transmitted voice quality. As the subjective determination environment is not a good option for quickly testing communication environments, the PESQ method and its evaluation and determination of the objective MOS value is a good alternative. In the following, it is not described how the MOS value is calculated in detail, but how to create an environment that can measure the speech quality achieved by any VoIP infrastructure. The goal is to have a fully automated environment that can be easily set-up by anybody that wants to test a VoIP infrastructure used for long-term testing, creating statistics and used to determine potential weaknesses in the system set-up. III. PRACTICAL MEASUREMENT

This section discusses the components, which are required for measuring the speech quality. Moreover, the tools and steps for retrieving the measurement results are explained.

If no direct connection between the participants is possible (e.g., due to Network Address Translation (NAT)) or for media processing applications

A. Preparation For setting up the environment, some input and reference files are required. In addition, also the knowledge of the signaling process during the further measurement needs to be assured. First of all, the reference speech signal has to be selected and converted into a proper format to be played with exactly the same quality during all further measurements. The input files are reference speech examples from [7]. They follow the recommendations to measure speech quality (e.g., short words and speaking pauses). To provide a repeatable play on same quality, the Waveform Audio File Format (WAV) files from the recommendation have to be converted into packet capture (PCAP) format, which is the input format for the proposed call generator. For converting the files, each WAV file was played by an Asterisk extension that answers a phone call with the Playback() method. Meanwhile, the RTP stream has been captured with Wireshark and saved in the PCAP format. This has to be done for each speech codec that should be played or transcoded during the measurement. The PCAP file is then played with SIPp (see [8]), the wellknown open source call generator. The next step is to determine, which logical end devices are involved to create a proper flow of signaling messages. This mainly depends on the entire environment and on the test object respectively. It has to be clear which SIP requests and responses have to be sent. This includes the registration and/or authentication with every SIP INVITE message to establish a call successfully. This information is used to create a proper Extensible Markup Language (XML) file for the SIPp user agent (UA). With both the input files and the SIP UA, the sender side is created. To evaluate the measured speech quality that passes the VoIP infrastructure, the PESQ implementation (distributed with [5]) is used. For the educational test environment, the source code has been compiled to an executable on the target system. The program compares the reference WAV file and measured WAV file. All components are running on Debian Lenny. B. Set up The tools and applications for the realization are all opensource and available in the educational scope. This section defines the system and the tools for building the endpoints of measuring. The implementation of the call establishment is shown in Figure 1. The whole environment consists of one reference point that generates calls (sender), one end point that meters the delivered speech (receiver) and evaluates it with PESQ and of course the system under test (SUT). The test can be applied to any VoIP infrastructure, e.g., simple Public Branch Exchange (PBX), media gateway or a complete infrastructure with Session Border Controller (SBC), proxies and SIP application server.

Figure 1. Test setup

The generator part is fulfilled by an instance of SIPp. It establishes calls according the defined logic in the XML file and plays the prepared PCAP files. The call is sent through the SUT towards the PESQ evaluator. For simple measurements, SIPp is not used for penetration testing, but only to place one quality call at time. An Asterisk running on Debian Lenny fulfills the part of the evaluator. It provides the needed functionality by default and has only to connect to the SUT acting as SIP UA that will be called by the generator. The Asterisk is only used for the measurement task; it has no influence on the results. C. Evaluation For the evaluation of the speech quality, the PESQ algorithm compares the local reference WAV file with the recorded WAV from the Asterisk. The difference between those files (i.e., the loss caused by transmission, transcoding etc.) will be calculated. The evaluation is triggered by a call from the generators SIP user extension to a previously defined other endpoint. In the given example, an Asterisk installation in the NGNlab infrastructure is the SUT. The Asterisk on the receiving side does the following to evaluate the call quality: Answer the call and execute the recording function. Play a beep, which has to be taken into account on the generators side as delay before playing the reference speech file. Stop recording after a max. duration. Execute a shell script, which trims the recorded WAV file to a comparable length. SoX is used for this purpose. Run the PESQ algorithm and compare the local reference with the measured WAV file. Written all results into a text file for later automated processing.

The results from the PESQ evaluator include the PESQ MOS and MOS Listening Quality Objective (MOS-LQO) values.2 Figure 2 shows the mapping function described in [10]. It is required to rescale the PESQMOS to a range from 1.02 to 4.56 to perform a better correlation to MOS. The mentioned mapping function is implemented in the current C source code, which is provided with [5] and also used as a base for the calculations within this paper.

TABLE I. Scenario 1st 1st 1st 1 2 2 2 2


st

MEASURMENT RESULTS Results

Reference file

PESQMOS

MOSLQO

A_eng_f1 A_eng_f2 A_eng_m1 A_eng_m2 A_eng_f1 A_eng_f2 A_eng_m1 A_eng_m2

4.345 4.365 4.373 4.273 3.495 3.700 3.714 3.491

4.447 4.461 4.467 4.394 3.547 3.817 3.835 3.541

nd nd nd nd

The results of both measurements are repeatable and match the prospects. A second series of measurements should investigate the degradation of the voice quality in connection with the workload of the Asterisk PBX. While the first series has been performed with no significant utilization on the SUT, the second series requires a almost completely utilized Asterisk. To set the SUT under high load, two additional SIPp UAs were added. They created continuous media stress on the Asterisk: SIPp played a PCAP RTP stream, which was transcoded to G.729 by the Asterisk, to easily create a high CPU load. The measurement began with an idle system. Then, the simultaneous calls were raised in steps of 10 calls up to 100 simultaneous calls. On each stage of workload, the quality of the delivered voice was tested with two female and two male speech-files mentioned before. Each file was played ten times and an average of the 40 MOS-LQO value measurements has been made. The whole procedure was done once with no transcoding and once with transcoding (from G.711a to G.729). The transcoding was done by the SUT. Figure 3 shows the developing MOSLQO values in connection with the simultaneous calls.

Figure 2. Mapping function for PESQMOS to MOS (acc. [10])

D. Further Practice With the ability to measure varying delays and its psychoacoustic and cognitive model, the measurement method based on the PESQ algorithm is capable to perform carrier-grade measurements in next generation networks. With the described open-source implementations and their set-up acc. III-B, several interesting test-scenarios are possible. The call generator can be placed in customers network or behind the access network of other carriers. It could perform speech quality measurements together with others over a longer period to identify troubles with the provisioning of the service. IV. EXAMPLE MEASUREMENT

The following example measurement has been performed within the NGNlab infrastructure: An environment has been set up as shown in Figure 1. The SUT was an Asterisk installation. The generator (SIPp) and the evaluator (Asterisk) both register as UA on the Asterisk. In a first measurement, a normal call with the G.711a speech codec has been performed. In a second measurement, the Asterisk was forced to perform transcoding from G.711a to G.729 codec while doing a quality measurement. For the first measurement, the expected MOS-LQO value for G.711a was 4.4; for the second measurement, a MOS-LQO value for G.729 of 3.8 has been expected. Both scenarios have been carried out with two male and two female reference speech examples (American English from [7]) as recommended in [5]. The results are shown in table I.

Figure 3. MOSLQO under varying load

The PESQMOS value is the raw score in a range from -0.5 to 4.5. It is desired, however, to have a MOS-LQO score (acc. [9]) to allow a linear comparison with MOS.

The course of graph shows that the voice quality was constant to an amount of approximately 50 concurrent calls.

Afterwards, the quality of the raw G.711a call drops about 0.7 points and then stays nearly constant at 3.6 MOSLQO. The graph of the transcoded call from G.711a to G.729 looks similar. It drops at approximately 50 calls as well about approximately 0.6 MOS-LQO points. In the end, the Asterisk was so busy that no more calls could be established. The measurement shows, that a good speech quality could warrant only up to 50 calls at the same time with this SUT setup. V. CONCLUSION

ACKNOWLEDGMENT The second author is also affiliated member of the Institut fr Telekommunikationsinformatik at the Hochschule fr Telekommunikation (HfT) in Leipzig, Germany. STU and HfT are working together on common topics in the area of Next Generation Networks (NGN) since 2006. This paper also presents some of the results and acquired experience from various research projects such as NGNlab project [1], European Celtic-EURECA project Netlab [11], AV project: Converged technologies for next generation networks (NGN) No. AV/4/0019/07, Slovak National basic research projects VEGA No. 1/0720/09 and VEGA 1/4084/07. REFERENCES
[1] NGNlab - NGN laboratory at Slovak Technical University in Bratislava, http://www.ngnlab.eu [2] OpenSIPS (Open SIP Server), a mature Open Source implementation of a SIP server, http://www.opensips.org [3] Asterisk, Open Source telephony switching and private branch exchange (PBX) daemon, http://www.asterisk.org [4] Radcom Network Testing Solutions, http://www.radcom.com [5] ITU-T P.862, Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, 2001 [6] ITU-T P.800, Methods for subjective determination of transmission quality, 1996 [7] ITU-T P.50, Artificial voices, 1999 [8] SIPp, a free Open Source test tool / traffic generator for the SIP protocol, http://sipp.sourceforge.net [9] ITU-T P.800.1, Mean Opinion Score (MOS) terminology, 2006 [10] ITU-T P.862.1, Mapping function for transforming P.862 raw result scores to MOS-LQO, 2003 [11] NetLab - Use Cases for Interconnected Testbeds and Living Labs, http://www.celticinitiative.org/Projects/NETLAB/

Some example measurements have been made and well documented. The results were conforming to the expected values. An environment was tested under different working load. The proposed measurement utilities as well as the complete system are very flexible and can easily be adjusted to various needs. The proposed method works with any audio codec, which is supported by the Asterisk recording function on the receiver side. With SIPp, it is not only possible to generate calls for the QoS testing, but also to set any SUT under higher load. For a further improvement of the measurement system, a future extension towards bidirectional measurements is possible. Moreover, a graphical user interface (GUI) for a userfriendly handling can be added. It also could be useful to connect one end to a Public Switched Telephone Network (PSTN) to enable a measurement of media gateways. Using scripts, the proposed step-by-step measurements can be running automatically and create valuable outputs over a longer period of time.

Potrebbero piacerti anche