Sei sulla pagina 1di 4

Next-Generation Audio Networking Engineering

for Professional Applications


Abderrahmane SMIMITE 1 2 Ken CHEN 2 Azeddine BEGHDADI 2

1: Digital Media Solutions, 45, Grande Allée du 12 Février 1934, 77186, Noisiel, France
2: Université Paris 13, Sorbonne Paris Cité, Laboratoire de Traitement et Transport de l'Information (L2TI, EA 3043), F-93430, Villetaneuse, France

Keywords:
Audio Transport, Audio networking architecture, Interoperability, Multichannel audio, Spatial sound, Synchronisation

Abstract:
This paper presents an overview of the present and the future present potential and future developments in audio
of audio networking and its needs, and focus on a light- networking.
weighted generic architecture that we had experienced on II. Requirements
multichannel spatial sound streamed to multiple recipients, in The media networks requirements are generally
the context of 3D multimedia environment, and which should application specific. This section will not describe
fit all the current and future requirements of professional requirements down to the last detail but gives more of a
audio applications. general scope with an attention to the common basic
I. Introduction components, while focusing on the new requirements for
Audio transport is one of the most fundamental future applications.
functional components when   we’re   dealing   with   A professional audio network should basically support
multimedia contents. As a matter of fact, since the first synchronous transport of a high number of channels with
technologies handling Audio and Video, their transport has all the timing constraints involved and the remote control
been a major and sensitive concern and it is still rising and monitoring of the devices handling those streams. We
because of the recent upcoming applications with newer list here the main features that, in our opinion, a network
and higher requirements such as 3D immersive multimedia engineer should take into consideration when developing
environment. So considering that almost all of audio an audio network:
content are migrating to digital today, with a particular A. Channels number
increasing need of transmission over long distances, sound The   most   obvious   example   when   we’re   talking   about  
transport had leaned toward networked digital solutions. application requiring high number of audio channels is
The major advantage of digital audio networks, in public address. As a matter of fact, examples of up to
contrast to legacy analog audio point-to-point connections, 10,000 nodes are found in this area and same thing goes
is flexibility, alongside the fact that analog signal for mixing console & conferencing markets.
distribution requires separate physical channel for every Nowadays, new technologies are offering a new audio
signal. In addition to all the advantages of the experience where listeners can enjoy a fully immersive 3D
digitalisation, an audio signal transported over a network audio scene. The most promising techniques are High-
is available almost everywhere in the network. No re- Order Ambisonics (HOA) and Wave Field Synthesis
routing or re-plugging of cables is needed. A further big (WFS) which are the most advanced audio spatialization
advantage of networks is their scalability. If there is a need techniques but require a high number of audio channels
for more bandwidth, the network can easily be extended that sometimes goes up to few hundreds (TU Berlin is
by adding more switches and connections. equipped with a 832 channels WFS system) [1][4].
This report brings to light the current state and the B. Timing & Synchronisation
future of audio networking by presenting the evolution of Audio streams, as any other digital signals, are by
its needs, with a focus on the engineering aspects that a nature time dependant, as sampling is the first step of
transport technology has to provide in order to match all digitisation. Timing aspect is even more critical when
the requirements. For that matter, a generic architecture is we’re   dealing   with   multichannel   audio   streams   where  
suggested and which should guarantee a high level of multiple channels are time-correlated in order to form
interoperability. Auditory images (3D audio landscapes).
The layout of this paper is as follow: section 2 gives In fact, timing constraints are less obvious but more
background information of the current and upcoming fundamental   when   we’re   dealing   with   spatial   sound  
requirements for sound transmission, particularly reproduction. In order to preserve the auditory image
multichannel audio streams. An overview of the existing quality (i.e. how good the impression of spatialization will
technologies is given in section 3 through a comparison of be perceived), and particularly when channels are
a Layer 2 and Layer 3 solutions. In section 4, a generic separated and streamed to different and spatially distant
architecture for multichannel audio transport is suggested clients like loudspeakers, the transport mechanism should
through an application on 3D multimedia environment, maintain the phases between the channels. The
and which should suit most of current and future synchronisation   drift   can’t   exceed   a few microseconds in
foreseeable audio applications. At last, section 5 will
order to not be perceived [6]. More precisely, AES11 Supporting various proprietary technologies is and will
states that the outputs from all equipment in an be a substantial issue. As a matter of fact, a comprehensive
isochronous system must lie within +/- 5% of the reference media network should be able to interconnect and support
phase. For 48kHz this represents about +/-1 µs. devices with vendor-specific features but it won’t   always  
Latency is also a well-recognised issue when dealing be a feasible task for the simple reason that manufacturer
with a network. Optimally, delay must be as low as will naturally always try to keep their secrets from being
possible to ensure network transparency and preferably revealed.
deterministic. For Broadcast and live performance, the
III. Existing Technologies
maximum tolerable latency is around 2 ms while in
general consumer we can accept up to 50 ms. When we state a layer-specific technology, we refer to
the seven layers OSI model.
C. Audio Quality & bandwidth Besides Layer 1 (L1) proprietary audio transport
An audio network interface should be able to handle all solutions, many Layer 2 (L2) and layer 3 (L3) solutions
the sampling rates used on the market: 44.1 and 48 KHz have been introduced to the market. We’ll  focus  basically  
for general purpose, 88.2 and 96 KHz for studios and even on the second ones, with a particular interest to AVB (L2)
192 KHz for probable future use. and RAVENNA (L3) as they represent the most advanced
This same thing goes for bit depth: we should be able and open technologies of this class.
to manage audio signals coded on 16 bits as well as 24 or a. AVB
even 32 bits. AVB stands for Audio Video Bridging for real-time
Consequently, the bandwidth required for a single sensitive multimedia content. It’s a set of approved IEEE
channel of non-compressed audio data will vary between developed by 802.1 standards committee. The protocol
705,6 Kbps and 6,144 Mbps. So it’s  only  wise  to  consider   stack is as follow (figure):
the   upper   limit   when   we’re   building   an   audio   transport  
technology to match all needs.
D. Reliability
Occasional loss or corruption of media packets if not
significant can be tolerated in general consumer market
whereas media networks involved in professional
applications have high reliability requirements.
Applications like conferencing and communication
systems are depending on the correct functioning of the
networks carrying the audio-visual data. From a listener
point of view, subjective tests have revealed that audio
Figure 1: Protocol Stack of an AVB endpoint
packet loss begins to be bothersome if it exceeds 5% [7].
As mentioned in [1], the use of Professional media AVB standard consists of the following components:
networks might be extended to life-safety application from - IEEE 1722, a transport protocol that enable
time to time (distress signal for instance). Therefore higher interoperable streaming by defining: (1) a media
constraints are imposed and might even involve an formats and encapsulations, (2) a Media
additional security layer. The same thing goes for business synchronization mechanisms (3) a Multicast
conferencing to ensure that such critical applications are address assignment [8].
safe from tampering. - IEEE 802.1AS that provides a tool to ensure
Fault-tolerance is another aspect that a media network Timing and Synchronization through a profile of
has to guarantee. The basic solution is usually redundancy IEEE 1588 (gPTP). A glance of PTP is given
even  if  it’s  not  the   most  cost-effective one (it doubles the later in this paper.
physical resources and requires an additional mechanism - IEEE 802.1Qat that defines a Stream Reservation
to handle the both streams and which may introduce an Protocol (SRP), which is an enhancement to
additional delay). Depending on the application, Ethernet via the implementation of an admission
redundancy can be avoided using an audio-tailored protocol and a mechanism for end-to-end
Forward Error Correction (FEC) as suggested in [5] and management of the streams to guarantee Quality
so, maximize network performances. of Service.
- IEEE 802.1Qav for traffic shaping.
E. Scalability & Manageability - IEEE 802.1BA for system specification.
Professional media networks evolve continuously and b. RAVENNA
involve generally devices from different manufacturers. It RAVENNA is an open solution based on Internet
should be possible to add new devices without Protocol standard and thus can run on most existing
compatibility matter, and the devices must at least be able managed network. As an IP-based solution, it is based on
to handle audio streams and exchange basic information. protocol levels on or above layer 3. RTP (Real-time
The main interest, as stated above, of a networked Transport Protocol), widely used in numerous time-
solution, is to be able to route any signal from any device sensitive applications, is the protocol used for streaming
to another. In addition to that, the network should allow the media content.  It’s  used  jointly  with  RTCP  (Real-time
devices monitoring, advanced control capabilities and Transport Control Protocol) that provides statistics and
Quality of Service management. control information for RTP flows. RAVENNA includes
also RTSP/SDP Protocol for communication control and
session management, and supports both DNS-SD and the that timing constraints are easier to respect and
ZeroConf mechanism for device configuration. It relies on interoperability with other systems is more feasible.
PTPv2 (IEEE 1588-2008) to achieve nodes Our application consists of one multichannel audio
synchronization, and use DiffServ as a QoS mechanism. source and multiple sinks as shown below:
c. Comparison
The main difference between RAVENNA and AVB is
that   they’re   respectively   Layer   3   and   Layer   2   solutions.  
RAVENNA describes an Audio-Over-IP (AoIP)
technology while AVB defines an Audio-Over-Ethernet
(AoE) standard (and IP independent). The following table
states the main differences and more details can be found
on [3].
AVB RAVENNA Figure 2: Concept of our Application
IEEE 802.1AS gPTP
Clock IEEE 1588-2008 Only the main aspects of the developed solution are
(subset of IEEE
Synchronisation PTP V2
1588-2008) presented in this section and more details will be available
Guaranteed: in an additional paper.
Configurable and
* Class A: 2 ms
Latency * Class B: 50 ms
network dependent a. Using Ethernet
(min possible ~1ms)
7 hops max allowed We can simplify the task of building an audio network
by designing it around one of the many existing
Media
Streaming
IEEE 1722 RTP communication standards used by the IT networks.
Ethernet pops up as a natural choice since it provides the
Fully redundant on 2 best balance between the high bandwidth (Gigabit version)
network interfaces
Fault Tolerance Not covered
supported (not and cost-effectiveness, compared to other technologies
mandatory) such as MADI.
Configuration IEEE 1722.1 (Plug & Variable (static, To use Ethernet for transporting real-time audio
and control Play Support) ZeroConf, RTSP, ...) information,  it’s  required  to  either  eliminate  the  causes  of  
unpredictable behaviour or mitigate them with buffering
The main issue with AVB is that it requires particular and retransmission strategies on a very known and
switches that are AVB-compliant, which make it unusable mastered time bases.
on existing networks. It’s also important to state that AVB One major characteristic to deal with when we
isn’t   yet   a   ready-to-use solution but more of a set of transport audio over Ethernet is packetization. As a matter
standards (a first and promising implementation of AVB of fact, any scheme for moving digital audio over a
has been introduced by XMOS recently but still on a beta packet-based network must pack audio data into a frame,
phase). transmit it and then unpack it into its original form.
RAVENNA on the other hand, relies for QoS on However, a packetization strategy involves a number of
DiffServ, which is not a bandwidth allocation scheme, so trade-offs: to optimize bandwidth use, we need to
no guarantee can be given that streams will always have maximize the ratio of payload data to header data by using
the bandwidth they need for uninterrupted streaming. the largest possible payload of 1500 bytes. But a single
Therefore an engineered network, which is thoroughly audio channel coded on 32 bits packed into such a frame
designed and maintained, is vital for a RAVENNA system. would contain 8 milliseconds of material. Given the
This is not the case for unmanaged consumer plug-and- inevitability of buffering, this introduces a granularity that,
play networks and might be bothersome for some possibly at multiple points in the transmission chain,
professional applications (A capability managed by AVB) would impose an important delay on the audio path (many
[3]. tens   of   milliseconds),   which   won’t be acceptable for live
broadcast for example [2]. That’s   why   a   specific  
IV. Application to 3D Immersive Environment packetization scheme has been proposed.
According to their specific application, user may b. Streaming
choose between wired and wireless solutions. Up today, Since our system is built on UDP/IP for compatibility
wireless   solutions   doesn’t   match   all   the   technical   purpose, we choose to handle streaming to work with an
specifications for professional purposes, particularly RTP-like header and keep mainly the critical elements of
bandwidth and synchronisation. A first ample wireless it: the timestamp to guarantee time aligning at the receiver
solution based on IEEE 802.11 is presented in [5], but the and the sequence ID to play packets in the correct order
bandwidth required for transmitting a high number of high and detect the missing ones. RTCP is used on a secondary
quality audio channels is still an open issue.   That’s   why   port to gather network statistics and manage control
we choose to work on a wired solution for our application, information.
in addition to some electro-magnetic issues that we may c. Clocks synchronisation
encounter in some environments. The major   issue   when   we’re   using   standard Ethernet
Another aspect that we kept in mind is respecting a networks for distribution of digital audio signals is the
minimal changes philosophy by using minimalistic light distribution of the corresponding media clock across the
protocols that are compliant with the existing standards, so network, considering their asynchronous aspect.
With the Precision - Application layer that comprises:
Time Protocol (PTP) o A multichannel audio core handler,
described in IEEE 1588, o An audio specific FEC mechanism,
it is possible to o Monitoring, configuration and control
synchronise distributed messages are exchanged via RTCP
clocks with an accuracy Tests have been conducted using several configurations
of less than 1 µs, which of multichannel streams and the results have been validated
suits enough our using a spatial extension of the PEAQ measurement
application. method.
The synchronisation V. Conclusion & Perspectives
process, as illustrated in We’re   already   witnessing   a   preview   of   the   next-
the figure above, is generation audio applications through the emerging 3D
achieved through the multimedia technologies with newer and higher
exchange of specific Figure 3: Principle of PTP requirements: more channels with higher audio quality to
messages based on a provide a better and newer multimedia experience. On the
Master-Slave principle: it includes Master sync message, networking side, it does inevitably involve the usage of
Master delay response message, and the Slave clock delay more bandwidth and new transmission protocols.
request messages. In addition, the Best Master Clock The main downside of the existing technologies is the
(BMC) algorithm is used to allow multiple Masters to lack of incompatibility between each other: different
negotiate the best clock for the network. Depending on the architecture (L1, L2 or L3) with more or less proprietary
implementation, synchronisation accuracy can goes from protocols. Some solutions even require specific switches to
10 to 100 us to 10 to 100 ns. During our tests, a software function properly.
implementation (PTPd, an open source daemon) has been The simplest way to ensure interoperability is that all
used and which gave good results. More details can be manufacturers agree on a common basis while keeping the
found in the IEEE 1588-2008 standard. customized additional information as optional extensions,
d. Network monitoring and management that can be potentially ignored or override in order to
Devices management has been done using a message guarantee a minimal level compatibility for a safe-mode
exchange technique through the RTCP channel. We’re   functioning. For instance, a first step is to perform network
leaning toward an XFN-based device description to ensure management and device control using XFN.
more interoperability (XFN is an IP-based peer-to-peer Networks performances will continue to increase and
protocol for control, configuration, monitoring and soon even a highly reliable wireless multichannel audio
connection management of networked devices) [9]. transport will be possible (using 802.11ac or SuperWifi).
e. Ensuring Reliability Coexistence with IT traffic, Plug & Play capability on
Today’s   networks   are   highly   reliable   but   can’t   a higher level and Internet Bridging for Transmedia
guarantee yet a total fail-free   performance.   We’ve   been   applications are potential enhancement tracks that have yet
working on an audio-specific FEC mechanism to increase to be investigated thoroughly.
the network performance. In case of this system is used
latency-sensitive application; we suggest the use of VI. References
redundant streams with the same time stamping, so the [1] Jeff Berryman, “Technical   Criteria   For   Professional   Media  
receiver can handle the samples properly. Networks”, 44th AES Conference, San Diego 2011.
f. Deployment [2] Patrick Warrington, “Digital   Audio   Networking”, Broadcast
Network deployment in a 3D multimedia environment, Engineering Magazine, 2003.
as many other professional applications, can be a tricky [3] Axel Holzinger and Andreas Hildebrand “Real-time Linear Audio
business. The ring or daisy chain topologies, even if Distribution Over Networks: A Comparison Of Layer 2 And 3 Solutions
they’re   the   ones   using   the   less   cable, aren’t   always   the   Using   The   Example   Of   Ethernet   AVB   And   Ravenna”, 44th AES
Conference, San Diego 2011.
safest choice. The tree topology comes to us then as a
natural choice for obvious reasons. So, we worked on a [4] Frank Melchior and Sascha Spors “Spatial   Audio   Reproduction:  
specific algorithm for network component placement to From Theory to Production”, 128th Convention of the AES, London
2010.
minimize cable length and simplify the installation. Using
an L3 solution allows also the use of existing networks and [5] Seppo Nikkilä, “Introducing   Wireless   Organic   Digital   Audio:   A  
Multichannel Streaming Audio Network Based On IEEE 802.11
coexistence with other IT traffic. Standards”, 44th AES Conference, San Diego 2011.
g. Emulation
[6] M. Rautiainen, H. Aska, T. Ojala, M. Hosio, A. Mäkivirta and N.
We had developed a full software emulator, using Haatainen,   “Swarm Synchronization For Multi-Recipient Multimedia
exclusively standardized protocol for interoperability Streaming”,  IEEE ICME 2009.
matter, using the following architecture: [7] Gillian M. Wilson and M. Angela Sasse, “Investigating  the  Impact  of  
- A layer 3 solution based on UDP/IP, Audio Degradations on Users: Subjective vs. Objective Assessment
- An additional header is added to the audio packets Methods”, Proc. OZCHI 2000, Sydney
(inspired by RTP, compatible but lighter) to [8] Robert Boatright, “Understanding   IEEE   1722:   AVB   Transport  
ensure the respect of packets order and detect Protocol – AVBTP”, IEEE 802.1 Plenary, March 2009
packet loss, [9] Universal   Media   Access   Networks   GmbH,   (UMAN),   “XFN
- A PTP daemon to handle streamer/receivers Specification  Version  1.0”.  August 2009.
synchronization,

Potrebbero piacerti anche