Sei sulla pagina 1di 88

Georg-August-Universitt

Gttingen
Zentrum fr Informatik

ISSN
1612-6793
Nummer ZFI-BM-2007-41

Bachelorarbeit
im Studiengang "Angewandte Informatik"

Performance Evaluation of a
Novel Overlay Multicast Protocol

David Weiss

Forschungsgruppe fr
Computernetzwerke

Bachelor- und Masterarbeiten


des Zentrums fr Informatik
an der Georg-August-Universitt Gttingen
15. November 2007

Georg-August-Universitt Gttingen
Zentrum fr Informatik
Lotzestrae 16-18
37083 Gttingen
Germany
Tel.

+49 (5 51) 39-1 44 14

Fax

+49 (5 51) 39-1 46 39

Email

office@informatik.uni-goettingen.de

WWW www.informatik.uni-goettingen.de

Ich erklre hiermit, dass ich die vorliegende Arbeit selbststndig verfasst und
keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.
Gttingen, den 15. November 2007

Bachelor thesis

Performance Evaluation of a
Novel Overlay Multicast Protocol
David Weiss
2007/11/15

Supervised by Prof. Dr. Xiaoming Fu


Computer Networks Group
Georg-August-Universitt Gttingen

Abstract

The demand for high-bandwidth media streaming over the Internet is growing. For large
groups of receivers, media streaming places a heavy burden on the network. IP Multicast
can alleviate this problem, but it is not widely deployed. In recent years, application layer
multicast and overlay multicast have been proposed as alternatives. However, there are still
concerns about the efficiency, scalability and deployment of these architectures.
In this thesis, a novel application layer multicast approach, called the Dynamic Meshbased Overlay Multicast Protocol (DMMP), is evaluated. DMMP establishes an overlay network core consisting of super nodes, which are end-hosts with particularly high capacities.
Each super node manages a cluster of non-super nodes. We use network simulations to analyze the performance of DMMP. For that purpose, we have implemented a DMMP module
in OverSim. OverSim is an overlay network simulation framework based on OMNeT++.
We compare DMMP with NICE, a well-known application layer multicast protocol, that is
claimed to achieve low link stress and low control overhead. We experiment with groups of
up to 2048 members.
Our results indicate that DMMP can achieve comparable service quality with less control
overhead, and that DMMP has the potential to scale to a high number of receivers.
Keywords: multicast, application layer multicast, overlay, media streaming, network simulation

Contents
1

Introduction

1.1

Requirements for multicast architectures . . . . . . . . . . . . . . . . . . . . . .

1.2

Thesis contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Organization of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Related work

2.1

Network layer multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Application layer multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2.1

Advantages and concerns . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2.2

Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.2.3

Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.2.4

Host Multicast Tree Protocol (HMTP) . . . . . . . . . . . . . . . . . . .

15

2.2.5

Narada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.2.6

NICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.2.7

Two-tier Overlay Multicast Architecture (TOMA) . . . . . . . . . . . .

21

2.2.8

Similar protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

Dynamic Mesh-based Overlay Multicast Protocol (DMMP) . . . . . . . . . . .

24

2.3.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.3.2

Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.3

Implementation

31

3.1

Network simulation basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.1.1

Common network simulators . . . . . . . . . . . . . . . . . . . . . . . .

32

3.1.2

nsnam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3.1.3

OMNeT++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.1.4

Overlay network simulators . . . . . . . . . . . . . . . . . . . . . . . . .

37

DMMP simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.2.1

Extending OverSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.2.2

Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.2.3

Overlay construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

3.2.4

Join algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

3.2.5

Data delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

3.2.6

Handling inactive non-super nodes . . . . . . . . . . . . . . . . . . . .

49

3.2.7

Handling inactive super nodes . . . . . . . . . . . . . . . . . . . . . . .

52

3.2.8

Self-improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

3.2

Contents

3.2.9
4

59

Performance evaluation

61

4.1

Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.1.1

Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.1.2

Simulation scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

4.1.3

Expected results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.2.1

Typical scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.2.2

Comparison with NICE . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

4.2

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion

79

5.1

Lessons learnt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

5.2

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

5.3

Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

Bibliography

82

1 Introduction
Multicast is the delivery of a message to several destinations. On the Internet, multicast
can be achieved by sending a distinct message to each individual destination. However, if
there are many destinations, then this approach imposes a high load on the network and the
source. Due to a growing number of applications for multicast, network layer multicast has
been introduced in the late 1980s [3]. In this architecture, the source sends a single packet
to a multicast address. The routers replicate that packet and deliver it to each destination.
That way, redundant transmissions can be avoided.
Deployment of network layer multicast requires changes to every router. In addition,
there are concerns about the scalability of network layer multicast, and a number of technical issues. Today, network layer multicast has still not been widely deployed. Nevertheless,
there is a growing demand for efficient multicast service. In particular, a scalable architecture for multimedia streaming applications like IP television is needed. Consequently,
application layer multicast has been proposed as an alternative to network layer multicast
in recent years. In this architecture, the participating end-hosts are organized into a (virtual)
overlay network. Typically, the topology of the overlay network is a mesh. Messages are
distributed via a subtree of the mesh. That means, some of the destinations forward the
received messages to other destinations. Application layer multicast is transparent to the
routers, that is, changes to the routers are not necessary. Therefore, application layer multicast can be deployed much easier than network layer multicast. However, it is questionable
if application layer multicast can achieve comparable efficiency, especially for large multicast groups. One important problem with application layer multicast is the maintenance of
the overlay network. As end-hosts can leave the multicast group at any time, overlay links
have to be added and dropped constantly. A variety of concepts has been proposed to tackle
this problem, and to improve the overall performance of application layer multicast. Some
approaches establish an overlay core consisting of statically placed network infrastructure.
This is often referred to as "overlay multicast". Simulation experiments in [22] indicate that
overlay multicast can achieve a performance that is comparable to network layer multicast.
On the other hand, it is costly to deploy the overlay core, and it also lacks flexibility.
This thesis studies a Dynamic Mesh-based Overlay Multicast Protocol (DMMP). DMMP
has been specified in an Internet draft [25], which is currently under revision. In DMMP,
the overlay core consists of end-hosts with especially high capacities, called super nodes.
Ideally, super nodes should have more bandwidth than other end-hosts, and they should
be more stable. While non super nodes are organized in clusters with a tree topology, the

1 Introduction

overlay core is a mesh. That way, DMMP creates a stable, efficient overlay core without
using static infrastructure. We compare DMMP with NICE [7], another application layer
multicast protocol. NICE organizes the end-hosts into a hierarchy. Simulation experiments
indicate that NICE scales to large groups and imposes a relatively low load on the network.
Initial mathematical analysis [24] suggests that DMMP can achieve slightly better performance than NICE. As DMMP and NICE are complex protocols, a theoretical analysis has to
be based on a number of assumptions which may or may not hold in practice. Then again,
it is also difficult to analyze the behavior of DMMP for thousands of network nodes using
testbeds.
We use network simulations to evaluate the performance of DMMP and compare it to
the performance of NICE. For that purpose, we incorporate a DMMP implementation into
the OverSim [9] network simulation framework, which is based on the OMNeT++ Discrete
Event Processing System (OMNeT++) [33]. As we do not have access to a network simulation of NICE, we have to rely on reproducing the simulation setup reported in [7].

1.1 Requirements for multicast architectures


For the evaluation of DMMP, suitable performance metrics have to be chosen. Therefore, it is
important to be clear about the requirements of the applications that DMMP targets. DMMP
is designed for large-scale, high-bandwidth multimedia streaming applications, such as IP
television. Below, we describe the properties of this type of applications, roughly following
[26]:
There is a single source that multicasts data to several destinations. The data is sent
simultaneously, that is, we consider "live" streaming rather than video on demand.
There could be thousands, tens of thousands, or even hundreds of thousands of destinations, and the destinations are distributed over the Internet. That means, the multicast group is large, but sparse.
The source sends data at a medium (e.g. 128 kilobits per second) or a high (at least one
megabit per second) constant bitrate.
There are hard real-time constraints. That means, a packet that arrives after its scheduled playback time is useless. Nevertheless, latency and jitter can be tolerated to a
degree because the data are buffered at the destinations.
Packet losses can be tolerated, but they affect the service quality. The term "service
quality" does not refer to resource reservations ("Quality of Service"), but to, for example, video quality.
We can deduce the following requirements for multicast architectures:

1 Introduction

It goes without saying that the multicast architecture needs to be deployable. Dedicated infrastructure or changes to routers can be problematic.
The resource consumption needs to be low. As the source sends at a high bitrate,
redundant transmissions should be avoided.
Few packets should be lost or delayed, so that a high service quality can be achieved.
Based on [6] and the above considerations, we briefly introduce the performance metrics for
DMMP:
The control overhead is the amount of traffic that is caused by establishing and maintaining the multicast tree.
The stress of a link is the number of identical data messages sent over that link. If the
stress of a link is greater than one, there are redundant transmissions.1
The stretch of a member refers to the length of the data path from the source to the
member, compared with the length of the direct unicast path.
The loss rate is the ratio of the number of lost packets to the number of packets that
should have been received.

1.2 Thesis contribution


Our contribution is as follows:
We introduce some of the previously proposed multicast architectures and point out
their shortcomings.
We implement a DMMP module for OverSim/ OMNeT++ that can be used for further
analysis of DMMP in the future. For lack of time, we do not implement all features of
DMMP.
We analyze the scalability of DMMP using network simulations. We measure the control overhead, stress, data path length and packet loss for groups of up to 2048 members.
Finally, we compare DMMP with NICE.

1.3 Organization of this thesis


In Chapter 2, related work is investigated. This should mostly explain where DMMP comes
from. Some of the presented concepts are also reused by DMMP. We discuss network layer

1 They

are redundant because network layer multicast could avoid them.

1 Introduction

multicast briefly, and concentrate on application layer multicast. Finally, DMMP is introduced. In chapter 3, it is pointed out what the benefits of network simulations are, in general, and specifically regarding the analysis of DMMP. After that, the implementation of the
DMMP module is documented. We focus on describing the exact algorithms, deviations
from [25], and application-specific parameters. OverSim and OMNeT++ are described as
well. The evaluation methodology of the simulation experiments is described in Chapter 4.
We present our results, describe their implications, and try to explain them. In Chapter 5,
we summarize our experiences with the network simulator, draw our conclusions, and list
possible future work.

2 Related work
This chapter gives an overview of previously proposed multicast architectures. We start
with network layer multicast (Section 2.1). Then DMMP and other application layer multicast approaches are described in Sections 2.2 and 2.3.

2.1 Network layer multicast


For lack of space, we do not describe the network layer multicast routing protocol. Instead,
we only discuss the benefits and shortcomings of this approach shortly.
The network layer can provide multicast service with the least possible number of data
packet transmissions. At the same time, minimal latency can be achieved. However, note
that DVMRP as well as other routing protocols based on the reverse path multicasting algorithm, generate some redundant transmissions. Furthermore, CBT uses a shared tree,
meaning that it does not achieve minimal latency. Nevertheless, the efficiency of network
layer multicast is alluring.
[3] argues that multicast is an old concept for Internet standards. Compared with, for instance, the World Wide Web, the deployment of network layer multicast has been very slow
so far. This indicates that the deployment is problematic in principle. There are a number
of possible technical reasons for this. It is difficult to implement higher layer functionality
such as congestion control, reliability, security or flow control on top of network layer multicast. Furthermore, network layer multicast requires routers to maintain state information
about other routers. This increases the memory consumption at each router and the per hop
computation time. In general, it is difficult to aggregate multicast addresses in the routing
tables, which leads to large routing tables. The usage of class D addresses introduces some
additional problems, such as address collisions and access control. As for the scalability of
network layer multicast: in DVMRP and MOSPF, each router stores state about ( R S)
other routers, where R is the number of routers with attached group members, and S the
number of routers with an attached multicast source. Due to this problem and other issues,
such as the dependence on specific unicast routing protocols, DVMRP and MOSPF are not
used for network layer multicast routing anymore. CBT reduces the state kept by routers by
using a shared multicast tree. However, routers still maintain per group state. In addition,
the traffic concentration around the core is problematic. Some of these arguments also apply to PIM, which is the prevalent network layer multicast routing protocol today. There are
also concerns about the scalability of inter-AS routing protocols [22].

2 Related work

2.2 Application layer multicast


To address the shortcomings of network layer multicast, application layer multicast (ALM)
has been introduced in recent years. In the following sections, ALM approaches will be further discussed. Publications on this subject usually distinguish overlay multicast from ALM
[22][24]. In both cases, the construction of an overlay network plays an important role, but
for overlay multicast there is an overlay network core consisting of statically placed infrastructure nodes, called proxies. However, participating end-hosts still use an ALM protocol.
Therefore, the term "ALM" is used in a wider sense, including proxy-based architectures, in
this thesis. We use the term peer-to-peer, when explicitly talking about ALM without such
proxies.
First, we will introduce some general arguments for and against ALM in comparison with
network layer multicast in Section 2.2.1. Then, important concepts that all or most ALM
architectures have in common will be discussed in Sections 2.2.2 and 2.2.3. Some specific
solutions (HMTP, Narada, NICE and TOMA) will be highlighted in Sections 2.2.4 to 2.2.7
to illustrate these concepts. Additional proposals will be briefly described in Section 2.2.8.
Finally DMMP will be described in Section 2.3.

2.2.1 Advantages and concerns


This section tries to explain why ALM is believed to be a promising alternative to network
layer multicast and what major concerns are.
In the layered network design a fundamental question is, "Which functions should each
layer provide?". For the Internet, where only the lower three layers are implemented in
routers, this corresponds to the question, "Which functions should be provided by each
router, and which ones only by end-hosts (end-to-end)?". [31] argues that if the lower layers
do not have enough information to provide some additional function f (e.g. multicast), f
will have to be implemented on the higher layers redundantly. Even in this case, implementing f on the lower layers may improve the performance. However, f should be implemented on a lower layer only if the gain in performance is significant. [31] also suggests to
consider higher layer implementations that do not require f . They may incur a performance
penalty because the lower layer offers unnecessary functionality.
Note that it is questionable if network layer multicast can provide all multicast functionality that applications desire at a reasonable cost. Consider a multimedia streaming application that lets the user pause and resume the playback. This is much easier to implement
on the application layer (by having each end-host cache received data for a while; OverCast
[18] allows this for example). In general, ALM can be flexibly customized to meet the requirements of a specific application[13]. For ALM, the performance compared with network
layer multicast is a major concern. High performance means efficient data delivery (few re-

10

2 Related work

A
D

E
G

Figure 2.1: Example of an overlay topology

dundant transmissions and low control overhead) and high service quality (low loss rate
and low latency). In ALM, some redundant transmissions are inevitable. Furthermore, endhosts have little knowledge about the underlying network topology. Without such a knowledge, data delivery paths may be longer compared with network layer multicast. Nevertheless, it is possible to obtain some information about the underlying topology. For example,
the end-to-end delay between two hosts can be determined by measuring packet round trip
times. However, such a method is not reliable and generates high control overhead as well.
Additionally, end-hosts are not as stable as routers. End-hosts leaving the multicast group
may partition the data delivery tree. Then other end-hosts cannot receive data until the tree
is repaired. Detecting and repairing these partitions increases the control overhead.
Apparently, the performance related aspects require careful consideration. For now, we
can state that, so far, one design principle of the Internet has been: implement relatively
little intelligence in the network so that most functions are provided only by end-hosts, for
example congestion control. The end-to-end principle has worked well in the past.

2.2.2 Basic concepts


ALM does not require changes to routers. Instead, multicast functionality is performed in
the application layer on top of any transport protocol, depending on the applications requirements. The group members are end-hosts, and act like routers; that is, they replicate
received data and forward it to other end-hosts via unicast [22]. They form a (virtual) overlay network as illustrated in Figure 2.1. The underlying network is composed of routers
(squares) and end-hosts (A, B, . . ., H), connected by physical links (full lines). The overlay
network consists of end-hosts and overlay links (dotted lines). Assume that all end-hosts
are members of the same (application layer) multicast group. Overlay links indicate which

11

2 Related work

members directly interact with each other, and each overlay link corresponds to a path in the
underlying topology. The construction of the overlay topology is an important task of ALM
protocols. [6] distinguishes the data topology from the control topology. The data topology
determines who receives data from whom. In Figure 2.1, the link from A to E means that E
receives data from A. The data topology is a tree.1 Additionally, members exchange control
messages regularly, mostly for maintaining the data topology. Different from routers, endhosts may leave the group at any time. An end-host that leaves after a short time is called a
transient node. Ideally, a leaving member should notify some remaining members about its
departure; we refer to this as graceful leaving. However, members may also leave without
notice, for example due to a software crash. We call it ungraceful leaving. In general, we refer to membership changes (join and leave) as churn. Either way, the data topology is likely
to become partitioned when members leave. Control messages are needed to establish and
improve the data topology as members join, and to detect and repair partitions. Links in the
control topology indicate exchange of control messages. Usually the control topology is a
mesh and the data topology is a subtree of the control topology.
Once end-hosts have joined the control topology, they can exchanged control messages via
the mesh. When end-hosts initially join, they have to find a suitable position in the control
topology, but they are not part of the control topology yet. Therefore ALM protocols need a
way to bootstrap. There is virtually always an RP that is assumed to be globally known, and
that can provide newly joining hosts with (for example) the root of the data delivery tree or
some random members. Note that the RP in CBT performs similar tasks, see [5]. In a way,
the RP introduces a single point of failure and a potential performance bottleneck. However, as in CBT, the RP could be replicated at the cost of additional complexity. Similarly,
several groups could share an RP. Furthermore, an RP failure does not disrupt the data dissemination (in contrast to CBT); ALM can tolerate a temporarily unavailable RP [37]. As the
bootstrapping procedure is similar in all ALM architectures, we do not pay much attention
to it.
A good overlay network should have the following properties in order to meet the requirements listed in Section 1.1:
1. The overlay should resemble the underlying topology. This will lead to low latency
and fewer redundant transmissions on physical links.
2. There should be little control traffic. It is important that the control traffic does not
limit the scalability.
3. The loss rate should be low; that is, members should miss as few data packets as
possible.
1 Each

member M may receive the same data from several other members, but we consider only one of those
members the parent of M. That is, we disregard the redundant transmissions. Consequently, the data topology is a tree. This is consistent with [14] and [32], who use the terms "shortest reverse-path tree" and "tree
built by reverse path forwarding" in the context of DVMRP.

12

2 Related work

4. The overlay should be resilient regarding churn, failures, and changes of the underlying topology.
5. The available resources, especially upstream bandwidth, may be insufficient. Endhosts have heterogeneous capacities; for example, a member with low upstream bandwidth can only support a small number of children in the data delivery tree. Such
degree constraints have to be considered. Moreover, the burden of data and control
traffic should be fairly shared among the members.

2.2.3 Categories
In this section, we categorize ALM approaches based on the following questions: (1) Are
there statically placed proxies? (2) Is the data topology derived from the control topology,
or is it the other way around? (3) Is the overlay constructed distributedly? (4) How many
members are allowed to send to the group?

1. Overlay multicast has been introduced briefly in Section 2.2. In this type of architecture, proxies, which can be routers or end-hosts, take over some group management
functions and form an overlay core. In Figure 2.1, nodes A, B and C are proxies and
constitute the overlay core. Concerning approaches without the proxies, the participating nodes are more like peers. Overlay multicast can construct a highly efficient
and stable overlay core [22]; proxies are placed statically with all information about
the underlying topology available, and can be assumed to be more stable than normal
end-hosts. They may also have more resources such as memory, computation power,
and most importantly upstream bandwidth. Therefore, they can be used to alleviate
the control traffic of less capable members. Unfortunately, the deployment of overlay
multicast is difficult: somebody, a multicast service provider, has to place and maintain
the proxies. Also the used bandwidth, which is expected to be high compared to a normal end-host, can be costly. Dimensioning the overlay core is problematic, too. This
includes choosing the number of proxies and their locations. Usually bandwidth has
to be purchased in advance. Those decisions require the service provider to estimate
the number of users, their geographical distribution and churn. Over-dimensioning
the core wastes money, but if capacities are insufficient during the runtime, users will
have to be turned down. In any case, it is difficult or impossible to adapt the overlay
core to dynamic changes; overlay multicast lacks flexibility. Finally, it may be difficult
to bill the users. An example for an overlay multicast architecture is TOMA, which
will be discussed in Section 2.2.7.
2. The tree-first approach: This approach is very intuitive: the participating nodes are
firstly organized into a shared tree for data delivery. The control topology is then
constructed by adding some links to the tree.

13

2 Related work

Figure 2.2: Relations in a tree

A side note about the terminology: consider node G in Figure 2.2. Its parent is
C and its children are L and M. We call Cs parent A the grandparent of G; G is a
grandchild of A. All nodes on the same tree level as G are Gs siblings. All nodes
on the same level as Ps parent, including C, are Gs parent level nodes. All nodes
on the same level as Gs children, including L and M, are Gs child level nodes.
One advantage of this approach is that it is usually sufficient for each member to
maintain states about some "relatives" in the tree. As long as the topology does
not change, the control overhead is rather low. However, the tree-first approach
lacks resilience. Tree partitions usually take long time to repair, and joining procedures tend to be costly. The tree-first approach is also vulnerable to loops in the
data topology. More importantly, it is hardly suitable for latency sensitive applications with many sources (for example games) because it builds a shared data
delivery tree. An example of such an approach is HMTP, which will be further
examined in Section 2.2.4.
The mesh-first approach: With this approach, the control topology a mesh is
constructed first. Then the data topology is determined by running a routing
algorithm on top of the mesh. Typically, the data topology is a source-specific
tree. For example Narada, which will be described in detail in Section 2.2.5, uses
DVMRP to (implicitly) construct the data delivery trees. The mesh-first approach
tends to offer higher resilience than the tree-first approach. While source-specific
trees should reduce latencies compared to the tree-first approach, they are not
optimal either. Their quality is limited by the richness of the control topology [6].
Depending on the routing algorithm, some redundant end-to-end transmissions
are possible. This makes the mesh-first approach less suitable for high bandwidth
applications.

14

2 Related work

3. ALM approaches can be further categorized by considering the way that the overlay is
built. We distinguish centralized from distributed overlay construction. This concept is
similar to the distinction of centralized and decentralized network layer routing. For
example, an end-host that joins the NICE hierarchy finds an appropriate position by
itself. The RP, as a central entity, only provides the address of the root. This is distributed overlay construction. In contrast, TOMA clusters are constructed centrally
by the proxies. Then the cluster nodes are informed of their neighbors and start exchanging control messages with them. The centralized approach imposes a high load
on the node that constructs the overlay; this central entity may become a performance
bottleneck. Then again, a centrally constructed topology may have better properties,
as more knowledge of the underlying topology is available.
4. A fourth classification criterion for ALM architectures is the number of sources. A
single-source approach allows only one group member to send to the group (one-tomany). It is easier to optimize the overlay for one source. This approach also simplifies
concepts and implementation. Many-to-many applications like chat or games require
every member to be a source though. However, in [18] it is pointed out that "many
applications which appear to need multi-source multicast, such as a distributed lecture
allowing questions from the class, do not". Additional senders, for example students
that ask questions, can send their data to the source; then the source broadcasts it to
the group.

2.2.4 Host Multicast Tree Protocol (HMTP)


HMTP is designed for any type of multicast application; in particular, each member can be a
source. However, simulation results have been reported only for medium-sized groups with
up to 500 members [37]. The protocol takes the tree-first approach for distributed overlay
construction. HMTP is not intended as a peer-to-peer protocol. Instead, it is assumed that
some of the participating nodes are routers (referred to as designated members). Each router
represents a network layer multicast enabled island and is responsible for data dissemination to end-hosts on its island. End-hosts that are not on such an island may join the HMTP
group as peers of the designated members. That means the overlay consists of end-hosts,
as well as routers. This allows HMTP to coexist with network layer multicast. In fact it is
designed as a temporary solution with reasonable efficiency and scalability to allow quick
deployment. Nevertheless, it would also be possible to use HMTP for mere end system multicast (as defined in [13]). It is discussed here mainly as a simple example of the tree-first
approach, so that details, such as the election of the designated members, will be omitted.
Data delivery: With the tree-first approach, this is fairly simple. When a member receives
data or wishes to send data to the group, it (1) sends the data to its parent and (2) to
all its children. It does not send the data to the member that it has received the data
from. That way, there are no redundant end-to-end transmissions.2
2 More

precisely, for a group of size n, there are n 1 end-to-end transmissions.

15

2 Related work

Join algorithm: A recursive algorithm is used. A node J that wishes to join the multicast
group first contacts the root R of the data delivery tree. R responds with a list of its
children. When J receives a list l of nodes from their common parent P, it determines
its end-to-end delay to all of them. A short delay to a node X means that X is probably
located nearby in the underlying topology and would be a suitable parent in the data
delivery tree. If the delay to all l-nodes is higher than the delay to P, then J sends a
join request to P, asking it to be added as a child. Otherwise, J queries the "nearest"
l-node for a list of its children. In a nutshell: joining nodes are relayed down the data
delivery tree. This algorithm assures that members with short end-to-end delay are
"clustered together" [37], so that the overlay resembles the underlying topology. Note
that join requests may be denied, for example because the desired parent P does not
have enough upstream bandwidth to support another child. In this case, the joining
member sends a join request to one of Ps children. If there are no children, the joining
node may be forced to "go up the tree" again to explore a different branch. Joining
a large group may take a long time. Therefore the authors of [37] suggest a "foster
child" mechanism: the joining node is attached to a temporary parent so that it can
start receiving data quickly.
Partition handling: A gracefully leaving node informs its parent and children. Each node
exchanges refresh messages with its parent and children periodically to detect ungraceful leaves. The control topology is very limited; if a member X detects that its parent P
is inactive, then X cannot notify its siblings or its grandparent because it has no knowledge about them. Any inactive non-leaf node partitions the data delivery tree: there is
one partitioned branch rooted at each child of the inactive node. Those branches stay
intact while their roots rejoin. Rejoining is done in "reverse order", and is usually not
as time-consuming as joining initially: each member keeps track of the path from itself
to the root (root path). A rejoining node first contacts its former grandparent and then
works its way up the tree. That way, nodes that are known to be near are contacted
first. The root paths are also used to improve the tree over time, and help to avoid and
detect loops. They are updated as follows: when a node P receives a refresh message
from one of its children C, then P responds with its own root path. C adds P to the
path in order to update its root path.
Self-improvement: As nodes leave the group, their parents can accept new children. For
the remaining nodes, this means more suitable positions in the tree may become available over time. Each node X looks for nearer parents periodically, starting with a
random node in the root path. The exact algorithm is similar to the join algorithm
described above. If a potential parent P with significantly shorter end-to-end delay
is found, X changes its parent. That is, the subtree rooted at X is attached to P. As
delay measurements are influenced by cross traffic, there is a threshold for changing
the parent. This avoids changing the parent back and forth ("oscillation").

16

2 Related work

On the one hand, HTMP is an intuitive approach that avoids redundant end-to-end transmissions. With a single source as the root, low latency can be achieved. On the other hand,
all disadvantages of the tree-first approach, as described in Section 2.2.3, apply. In particular, HTMP is vulnerable to membership changes as its control topology is sparse. Note that
HTMP assumes a part of the members being (stable) routers. With many sources, the shared
tree approach is suboptimal. The join algorithm is another issue as it induces high control
overhead.

2.2.5 Narada
The Narada protocol was proposed and evaluated in[13] to show that ALM can achieve
acceptable performance (compared with network layer multicast). As for the categories
that have been introduced in Section 2.2.3: Narada is an example of a (1) peer-to-peer, (2)
mesh-first approach that is (3) fully distributed and (4) allows any number of sources. It is
intended for rather small groups; reasonable performance has been shown by simulation
experiments with up to 128 members. Experiments in [7] indicate that the scalability of
Narada is limited mostly by the produced control overhead. Although the control topology
is not a complete graph, all members have knowledge about each other. Below, we will
describe Narada in greater detail because some aspects of DMMP are very similar.
Data delivery: The data topology is built by running a distance vector protocol similar to
DVMRP on top of the mesh, which produces source-specific spanning trees. The distance metric is application-specific. The count-to-infinity problem is solved by exchanging minimum distance paths among the members. Note that such a routing
protocol leads to some redundant transmissions: there is up to one transmission per
overlay link.3
Join algorithm: A joining end-host sends a join request to some random group members
and attaches itself to the first responding member. That member should be near in
the underlying topology, as its response arrived quickly. However, only a potentially
small number of randomly chosen members is contacted, and therefore the initial
mesh position of a joining end-host may be unfavorable. As for degree constraints:
a member that has not enough available capacity can choose not to respond to join
requests.
Partition handling: Members that wish to leave the group are supposed to advertise routing updates indicating a large distance to all destinations for some time. This allows
the remaining members to adapt their routes. That way, packet loss can be avoided.
Gracefully leaving members should also notify their neighbors. As in most other ALM
protocols, members exchange refresh messages periodically to detected ungraceful
leaves. Routing updates are combined with the refresh messages. If a member X
leaves ungracefully its neighbors do not receive any more refresh messages. Narada
3 Different

from HTMP, the control topology is not a tree.

17

2 Related work

A
C

Figure 2.3: Overlay mesh

does not assume reliable message transport; refresh message can be lost occasionally.
For that reason, each neighbor independently sends X a probe message to verify that
X is inactive. As X has left, it does not respond to the probe messages. Its neighbors
assume that X is inactive. Inactive end-hosts are removed from routing tables; that
means, the data delivery tree can adapt quickly. Leaving members can also partition
the mesh though.
It is relatively difficult to detect and repair mesh partitions. In Figure 2.3, consider that
member A leaves the group, which is noticed by its neighbors B and C. Members C,
D and E cannot receive data until the partition is detected and repaired. Partitions are
detected as follows. Each member stores in a table all other members IP addresses
and a timestamp indicating the last time it heard of them. Those tables are exchanged
among neighbors as part of the periodic refresh messages. (Additionally, a sequence
number is stored in the table. That way, timestamps do not have to be included in
refresh messages.) Upon receiving refresh messages, members update their tables. If
the table entry for a member M is not updated for some time, a partition is likely. The
partition can be repaired quickly by adding a link to M. However, there is a problem
with this approach. If there is in fact a partition, several members may detect it at the
same time and more links than necessary will be added. Narada uses a randomized
algorithm to decide if a link should be added. If there is no update for at least Tmin
seconds, a link may be added. A link is guaranteed to be added after time Tmax . The
algorithms for partition detection and repair are described in more detail in Section
3.2.7.
Self-improvement: The mesh repair algorithm and the join algorithm do not pick mesh
links carefully. This leads to long data delivery paths and high latency. Additionally,
unnecessary mesh links are maintained. The mesh can improve its quality over time
by distributedly adding and dropping links.
Adding links: Periodically each member X chooses another member Y at random and

18

2 Related work

requests its routing table. Based on that, X computes how a link from X to Y
would shorten the distance from X to all other members. The resulting number
is called the utility of that link. If the utility is greater than a threshold, the link is
added.
Dropping links: Each member periodically determines its least useful incident link.
This is done by computing the consensus cost for each link. This value depends
on the number of destinations for which the link is used as the outgoing interface. If the consensus cost of the least useful link is below a threshold, the link
is dropped. The exact algorithm is provided in[13]. It guarantees that the mesh
does not become partitioned, when a link is dropped.
For small groups Narada achieves low latency given some time to stabilize. Simulation
experiments in[13] indicate that latency is almost comparable to network layer multicast.
Narada accomplishes this for any number of sources and adapts well to dynamic changes.
It is not designed for large groups, though, and therefore it does not scale well: [7] states
that "Narada has O(n2 ) aggregate control overhead".

2.2.6 NICE
The NICE 4 protocol has been proposed in [7]. We study NICE because we see it as a main
contender of DMMP. NICE is motivated from the shortcomings of Narada. As mentioned
in Section 2.2.5, Narada does not support large groups. Its high control overhead is particularly unreasonable for low bandwidth applications. Control traffic may consume more
bandwidth than the application data then. NICE, in contrast, is designed with that kind of
applications in mind. Examples are Internet radio or stock market tickers. The application
characteristics can be summarized as: (1) large groups (thousands of members); (2) multiple
sources; (3) low bandwidth traffic; (4) only soft real-time requirements, timely delivery is
desirable but slightly delayed data is still usable; (5) loss tolerance.
NICE is a fully distributed peer-to-peer approach. The concept is to establish and maintain a hierarchical control topology. The data delivery algorithm results from this topology;
the data topology is implicitly defined by the properties of the control topology. Therefore
NICE is classified as an implicit approach (as opposed to the tree-first or mesh-first approach)
in [7]. The NICE hierarchy consists of several levels, and on each level there are several clusters. The group members are organized in such clusters. Each member belongs to one cluster
on the lowest level (level 0) and may belong to other clusters on higher levels. Each cluster
has a size between k and 3k 1, where k is typically small, for example three. This invariant is enforced by merging small clusters and splitting big ones. Each cluster has a leader,
which should ideally be the topological center of the cluster. Clusters on level n (n 1) are
comprised of the cluster leaders of level n 1. Consequently, there is only one cluster on
the highest level. The leader of this cluster (the root of the hierarchy) is either the RP or the
4 NICE

is a recursive acronym and stands for "NICE is the Internet Cooperative Environment".

19

2 Related work

source (for a single-source application). The root is member of (log(n)) clusters and has
the highest control traffic. The control topology inside the clusters is a complete graph.
Data delivery: The source provides all nodes in all clusters it belongs to with data. When a
member receives data, it does the same, except that is does not forward to the member
it received the data from. That means, for each cluster, there is a core based data
delivery tree and the maximum number of forwarded packets is in O(log(n)) for each
member.
Join algorithm: Nodes join by contacting the root, which responds with a member list of
the top level cluster. The joining node determines the distance to each of these members (e.g. by measuring the round trip time). Then the nearest member is asked for its
member list. That way, the joining node is redirected until it receives a member list of a
level 0 cluster, which it joins. That way, the nearest cluster is joined. Joining takes long
and requires O(k log(n)) queries, as every joining node is "passed" down the hierarchy
to level 0. In contrast, joining HMTP nodes can obtain a higher position, if they find
a suitable parent. Therefore [7] suggests to "peer" joining nodes temporarily to allow
them to receive data. This is identical to the foster child mechanism of HTMP.
Newly joined nodes belong to a level 0 cluster. They can become members of higher
level clusters later because cluster leaders change over time. When a cluster is joined
by a new member, the center of the cluster may change. In this case, the new center
becomes the cluster leader. The NICE protocol requires the center of a cluster to be the
cluster leader; this is an important invariant. Note that cluster leaders are not chosen
based on their available resources such as upstream bandwidth. In particular, degree
constraints are not considered. NICE is designed for low bandwidth applications, so
most members can be assumed to support a high number of children in the data delivery tree. However, for high bandwidth applications, this is problematic. As members
join, clusters grow beyond the maximum cluster size. In this case, the cluster leader
splits the cluster dividing the other members into two new clusters. It also chooses
leaders for both clusters.
Partition handling: NICE detects ungraceful leaves in the same manner as HMTP. NICEs
refresh messages are called "heartbeat" messages. When a cluster leader leaves, the
remaining members negotiate a new leader. If the cluster size falls below the minimum
cluster size, its leader contacts the leader of the nearest cluster on the same level of the
hierarchy, and the clusters are merged.
Self-improvement: Each member X regularly measures the distance to leaders of foreign
clusters. If the distance to the leader of a foreign cluster C is smaller than the distance
to the current cluster leader, X moves to the cluster C.
Simulation and testbed experiments in [7] with up to 2048 members suggest that NICE scales
well and thus supports groups with thousands of members; its average aggregated control

20

2 Related work

overhead is in O(1). Furthermore NICE builds on a consistent concept.5 However, NICE


may be less suitable for high bandwidth applications, as end-host heterogeneities are not
considered. There are also no performance measurements for very large groups (with more
than 2048 members) so far.

2.2.7 Two-tier Overlay Multicast Architecture (TOMA)


[22] and [21] advocate a (proxy-based) overlay multicast architecture. TOMA allows any
number of sources and supports all kinds of multicast applications. It is a two-tier approach
in the sense that the overlay network consists of statically placed proxies ("service nodes"),
that form the overlay core, and end-hosts. The overlay core is called the multicast service
overlay network (MSON). The MSON is shared by a potentially high number of multicast
groups. End-hosts are grouped in clusters, and each cluster is attached to exactly one proxy.
Usually each proxy will serve several clusters, each belonging to a different group. Inside
the clusters a tree-first protocol is used. The data topology is a core-based tree rooted at the
proxy. The proxy also constructs the tree centrally. TOMA is claimed to be highly efficient
and scalable, to generate low control overhead, and to achieve a better overall performance
than NICE. This is supported by simulation experiments in [22]: for a group of 1000 members, NICE generates three times more control messages in total. For TOMA, the observed
latency is comparable to network layer multicast.
OLAMP: The MSON data topology is built using the overlay aggregated multicast protocol
(OLAMP). The main idea is to let groups with a similar distribution of members
among the service nodes share a data delivery tree. That means, while not all groups
share the same MSON tree, there is not necessarily an individual tree for each group
either. Instead, similar groups are aggregated and share an aggregated tree, which helps
reducing the state information kept by the service nodes. For each group there is a host
proxy that handles tasks concerning the whole group. Most importantly, it assigns the
group to an aggregated tree. Joining end-hosts contact all service nodes to measure the
round trip time. The service nodes respond with information regarding their current
workload. This could be the available bandwidth or other capacities. Based on this
information, the joining end-host chooses a member proxy and sends it an "O-JOIN"
request. If the joining end-host is the first of his group G joining that proxy, a new
cluster is formed and the member proxy contacts the responsible host proxy with an
"O-GRAFT" message. The host proxy then adds the member proxy to the aggregated
tree that G is assigned to. Similarly, when the last member of a cluster leaves, the
member proxy sends an "O-LEAVE" message to the host proxy.
It depends on the amount of wasted bandwidth, whether groups share an aggregated
tree. In Figure 2.4, the ellipses represent clusters; P1, P2, P3 and P4 are proxies. The
5 Although

it is worth mentioning that the exact algorithms concerning cluster leadership changes were not
described in detail above, and are in fact rather complex.

21

2 Related work

P4
P1
P3

P2

Figure 2.4: TOMA and an aggregated tree

white clusters belong to group F. It shares an aggregated tree with group G, which
consists of the gray clusters. The tree is not optimized for G; data targeted at G is
sent to P2 although P2 has no members of G. This may waste bandwidth. OLAMP
defines rules for deciding if G should have an own tree instead. Such a tree would be
computed using a multicast routing algorithm.
Cluster management: Each cluster is attached to one proxy. The proxy stores information
about all cluster members and arranges them into a core-based tree rooted at the proxy.
The cluster members periodically measure the round trip time to their parent and
children, and send the results to the proxy. In turn, the proxy informs the members
of topology changes. Members also exchange probe messages periodically to detect
ungraceful leaves. When a tree-partition is discovered, the proxy is asked to repair it.
The data delivery works like HTMP or any other tree-first approach. The difference is,
that the proxy forwards received data to the rest of the MSON, along the aggregated
tree. Proxies that receive data from the MSON forward it to the local cluster.
Deployment: As mentioned in Section 2.2.3, the deployment of overlay multicast tends to
be difficult. [22] suggests the following model: The multicast service provider (which
may additionally offer other Internet services) sets up the MSON and estimates its
usage. Then it purchases the necessary bandwidth from a network provider. The
initiators of multicast groups pay for the multicast service and bill the participating
end-users.
Overlay core dimensioning: Given the geographical distribution of the end-users it is not
difficult to find approximately optimal locations for the proxies; this optimization
problem is known as the "warehouse location problem", and a solution is presented
for example in [20]. Estimating the user distribution is more problematic. The necessary bandwidth is even harder to predict. [21] suggests to over-dimension the MSON

22

2 Related work

slightly. When the purchased bandwidth turns out to be insufficient, it may be possible to lease additional bandwidth at runtime. However, this would probably be expensive.
In summary, TOMA achieves a stable overlay core that is efficient regarding redundant
transmissions and latency, even for large groups. TOMA relies on statically placed infrastructure though. Some disadvantages of this approach have been noted in Section 2.2.3. It
seems questionable if TOMA can be deployed quickly. This aspect is critical, because, after
all, this is mainly what is inhibiting network layer multicast.

2.2.8 Similar protocols


In the previous sections, some examples of ALM protocols have been given. A few more
ALM proposals are introduced in this section, while the rest of them are omitted for brevity.
OverCast [18] is an overlay multicast architecture. Different from TOMA, only a single
source is supported. OverCast is intended to be used by companies to deliver high
bandwidth content to employees. As it operates on top of TCP, "content types that
require bit-for-bit integrity" [18] are supported.
The "Application Level Multicast Infrastructure" (ALMI) [29] is a peer-to-peer, treefirst approach optimized for small (tens of members), short-lived groups with any
number of sources 6 . The ALMI overlay resembles a TOMA cluster: the data topology
is centrally constructed and maintained by the RP.
HostCast [27] is another peer-to-peer, tree-first approach. It works similar to HMTP,
but there is only one source, which is the root of the data topology. It also features a
richer control topology: each member has a mesh link to its grandparent, to all its child
level nodes and parent level nodes, and to all its grandchildren. As in HTMP, nodes
keep track of their root path (called the "primary root path"). In addition, HostCast
nodes maintain a "secondary root path" for every mesh neighbor, which is the shortest
overlay path to the root via that neighbor. The secondary root paths are used for selfimprovement, among other things. If a nodes secondary root path promises lower latency than the primary root path, the node will choose the corresponding mesh neighbor as its new parent. Simulation experiments in [27] indicate that HostCast is resilient
regarding membership changes. The control overhead has not been evaluated though.
OMNI [8] is an overlay multicast architecture aimed at media streaming applications.
In contrast to TOMA, any peer-to-peer ALM protocol can be used within the clusters.
The overlay core data topology is optimized for latency and prioritizes proxies with a
high number of attached end-hosts. There is only a single source, which is attached to
the root of the data topology.

6 Examples

of such applications would be video conferences and games.

23

2 Related work

So far we have referred to ALM without proxies as a peer-to-peer architecture. The authors
of [35] argue that having virtual connections to many other nodes is "server-like" behavior,
which does not correspond to the peer-to-peer paradigm. Therefore they do not consider
ALM a peer-to-peer approach. ALM is certainly not a typical peer-to-peer architecture like
a file sharing system. In both cases, an overlay topology is constructed among the participating end-hosts; for example Chord arranges end-hosts in a ring[32, pages 380384]. In a
typical peer-to-peer system, however, files (or whatever kind of records the peers exchange)
are completely transferred before they are consumed. That means, there are no real-time
constraints as in ALM. In-order delivery is not necessary either; instead reliable service is
usually important. Typical peer-to-peer systems also require some sort of lookup mechanism, for example a distributed hash table, as each peer offers a large number of records.
Because of these differences typical peer-to-peer systems will not be further discussed.

2.3 Dynamic Mesh-based Overlay Multicast Protocol (DMMP)


In the previous sections, various ALM approaches have been introduced, and some tradeoffs have been noted: overlay multicast can achieve a highly efficient overlay core, but static
proxy placement introduces a number of problems regarding capacity dimensioning and
deployment. With the tree-first approach group members keep state information only about
a small number of other end-hosts, but trees tend to lack resilience. Additionally, the join
procedure is costly, which also applies to the hierarchical approach of NICE. The mesh-first
approach achieves higher resilience, but the maintenance of mesh links may be costly. For
example with Narada, each member maintains state about each other member. Furthermore, none of these approaches exploit the heterogeneous properties of end-hosts. At best,
degree constraints are considered. With these shortcomings in mind, DMMP, which has
been described in [25] and in [24] takes a hybrid approach: there is an overlay core comprised of end-hosts with high stability and bandwidth. These end-hosts are called super
nodes. This concept is similar to overlay multicast, but the super nodes are not placed statically. The core is organized using a mesh-first protocol similar to Narada. The overlay core
is relatively small, so Naradas lack of scalability is not a problem. The non-super nodes
are placed in clusters; each cluster is led by one super node. Inside the clusters, a tree-first
protocol is used.

2.3.1 Overview
DMMP is intended for multimedia streaming applications as described in Section 1.1. That
is, there is only one source. In the future, many-to-many communication may be supported
as well. DMMP makes no assumptions about the transport layer, but [25] advocates UDP
because control message exchange follows the request-response-pattern and lost multimedia data can be tolerated.
The DMMP architecture has two tiers:

24

2 Related work

Figure 2.5: DMMP: Data delivery algorithm

1. The upper tier consists of super nodes, which form the overlay core. Super nodes are
selected from among members based on their capacity. The capacity of a member is
a function of the available bandwidth and the uptime. Uptime refers to the elapsed
time since the member joined the group. Members with high uptime are assumed
to be relatively stable. Other values can be taken into account to profile DMMP. As
DMMP is directed at large groups, the chosen super nodes are expected to be well
dispersed over the underlying topology. Super nodes are assumed to be ordinary endhosts, but in principal they can also be provided (statically or on-demand) by a group
coordinator. The super nodes and the source self-organize into a mesh. This works the
same way as in Narada.
2. The lower tier consists of all members that are not super nodes. Each super node is
responsible for one cluster of non-super nodes. The clusters are organized as trees.
Joining non-super nodes use distance measurements (for example round trip time) to
choose a cluster and find a suitable position in the tree. Therefore, the clusters resemble
core-based trees.
Figure 2.5 shows the data delivery algorithm. S is the source, the white circles are super
nodes, and the black circles are non-super-nodes. The ellipses are clusters; only one cluster
is shown in detail. The source sends its data directly to some of the super nodes. As Narada,
DMMP runs a distance vector protocol similar to DVMRP on top of the mesh to create a
source-specific spanning tree for data delivery. Consequently, there can be redundant transmissions. In Figure 2.5, the spanning tree is represented by arrows. The dotted arrows show
redundant transmissions; they are not part of the tree. Unlike Narada nodes, super nodes do
not only forward data to their mesh neighbors, but also to the children in their local cluster.

25

2 Related work

In turn, cluster nodes forward the data to their children. That way, data is simply delivered
from top to bottom inside the clusters.
Inactive super nodes are detected using refresh messages. Possibly, they can be replaced
by one of their children. However, this may not always be feasible. The missing super
nodes cluster could be empty, for example. Or all its children could have insufficient capacity. That means, the mesh can become partitioned. DMMP treats this situation using
the same algorithms as Narada. In the clusters, refresh messages are used, too. Each cluster node periodically exchanges refresh messages with its (1) parent level nodes, (2) child
level nodes, and (3) siblings. In other words, there are control topology links between those
nodes.7 When an inactive node is detected, it may be possible to replace it with one of its
former children. Otherwise, the data delivery tree may become partitioned. In that case,
the former children of the inactive node look for new parents starting with their remaining cluster neighbors. Because the join algorithm is based on distance measurements, those
nodes are assumed to be near. Therefore, they are suitable new parents. That said, DMMP
clusters are not solely optimized for distance. Ideally, the clusters should be sorted by capacity, meaning that no node has a higher capacity than its parent. This has a number of
advantages:
1. The probability that a node with a high position in the data delivery tree leaves is
reduced, and partitions are more likely to be local problems affecting only a small
number of nodes. Note that partitioned members are cut off from the data stream
until the partition is repaired.
2. Nodes with high bandwidth can accept more children. Optimizing the clusters for capacity makes them broader and shorter, which should eventually improve the average
latency as well. Shorter clusters also expedite join attempts.
3. It is likelier that inactive super nodes can be replaced; no cluster node has higher
capacity than the super node or its children.
Another consequence is that nodes close to the overlay core contribute more bandwidth.
However, they are also likely to experience higher service quality, with lower loss rate and
latency. That means, there is an incentive to contribute resources to the group. Two mechanisms are used to optimize the clusters for capacity:
1. When a cluster node receives several join requests within a short period of time and
cannot satisfy all of them due to a lack of bandwidth, the nodes with the highest capacity are chosen as children.
2. Nodes with high capacity can step up to higher tree levels over time. If a nodes
capacity is significantly higher than its parents capacity, the two nodes may switch
their positions. We refer to this as a promotion.
7 HostCast

builds a similar control topology.

26

2 Related work

In summary, clusters resemble heaps. HMTP and HostCast (see Sections 2.2.4 and 2.2.8)
have no such mechanism. Instead, the data delivery tree is optimized exclusively for latency. Note that the overlay core improves itself over time as well. Mesh links may be
added and dropped; the algorithm is the same one that Narada uses.
In the next section, more details about the construction of the initial control topology, the
join algorithm, partition handling in the clusters, and the promotion mechanism are given.
We do not further discuss the data delivery algorithm, as it is very similar to DVMRP. We
also omit the adding and dropping of mesh links. Details about this can be found in Section
3.2 and in [13].

2.3.2 Details
In this section, we will highlight some DMMP protocol details. Most of them will be revisited in Section 3.2.
Overlay construction: End-hosts that wish to join the group obtain the IP address of the
RP via the Domain Name System (DNS). Assume that there are some initial group
members which have subscribed to the RP. The source is part of this initial group. For
each initial member, the available upstream bandwidth is determined. It would also
be possible to let the users manually specify the available bandwidth. This approach is
taken in [12], for example. However, users may misstate such information. Instead, the
bandwidth is measured for each participating end-host. This can be done by sending a
series of test packets and measuring the inter-arrival times of the responses. That way,
the bandwidth of the bottleneck link, which is the link with the lowest bandwidth,
can be determined. More sophisticated techniques are required to handle competing
traffic [11]. These measurements may have to be repeated regularly, as the available
bandwidth can change over time. After obtaining the upstream bandwidth b(i ) for
each initial member i, the RP calculates the maximum outdegree d (i ) as

d (i ) =


b (i )
,
r

where r is the constant bitrate the source sends at. Note that members with a maximum outdegree of zero have to be leaf nodes in the data topology.
Next, an application-specific number of super nodes is chosen in order of maximum
outdegree. Other metrics can be considered, again depending on the application. For
example, it is desirable to assure that super nodes do not behave in a malicious way.8
The number of super nodes should in general be less than 100. This is because the super nodes basically use Narada, which is known to have significant control overhead
already for 128 end-hosts. Clusters may consist of hundreds of members, so DMMP
8 Security

is an open issue of DMMP.

27

2 Related work

groups can have tens of thousands of members. The super nodes and the source selforganize into a mesh as described in[13]. Then the clusters are formed using the join
algorithm, which is described below. In the overlay construction, end-to-end connectivity is assumed. That means, problems concerning NAT and firewalls have not been
considered yet. When all members have contacted the RP to confirm their positions,
the initial overlay is constructed.
Join algorithm: A newly joining end-host queries the RP for a list of super nodes. The
length of the list is application-specific. Then it measures its distance to each super
node on the list and chooses the nearest one. The joining host sends a join request
to the chosen super node. The request includes the available bandwidth and the current uptime. Whenever a member M receives a join request, it waits for concurrent
join requests. After some amount of time, M accepts as many joining end-hosts as its
maximum outdegree allows. The child selection is based on the capacity of the joining end-hosts. The capacity is a function of the available bandwidth and the uptime,
an example will be given below (see Equation 2.1). The accepted hosts receive an acknowledgment. M sends a list of its children to the rejected hosts. Based on that list,
the rejected hosts continue to look for suitable parent. This is described in more detail
in Section 3.2.4. All in all, this algorithm is similar to the HMTP join algorithm. The
main difference is that in HMTP, a node X only sends a join request to another node Y,
if the distance from X to Y is shorter than the distance from X to any of Ys children.
A DMMP host would try to obtain a position as a child of Y before considering the
children of Y. DMMP generally focuses less on optimizing latencies.
Partition handling in the clusters: A leaving end-hosts should at least notify its parent or
one of its children of its departure. Ungraceful leaves are detected using refresh messages. Each cluster member periodically exchanges refresh messages with its neighbors in the control topology. Members can also request refresh messages from their
neighbors. If a member does not respond to a refresh request, or does not send a refresh message on its own for some period of time, it is suspected to be inactive. It is
sent a probe message to confirm this. If a member M does not respond to a probe message, the sender of that message assumes that M is inactive. When a member learns
that one of its neighbors has become inactive, it notifies its remaining neighbors.
Consider Figure 2.2 again. Assume that it shows the data topology of a DMMP cluster,
and that D has left the group. That is, the tree is partitioned: I and J cannot receive
data. In DMMP, there are two ways to fix this:
Either I or J can replace D. Here is how it works: I and J both send a replacement
request to B. Note that in the DMMP control topology, nodes have no link to their
grandparent. Therefore I and J may have to query the responsible super node
A to obtain Bs address. After receiving the first replacement request, B waits
for additional replacement requests. After some time, it chooses a replacement

28

2 Related work

capacity

Y
b(Y)

X
b(X)

time

Figure 2.6: Rationale for Equation 2.1

based on capacity. Assume that I has a higher capacity than J. B will send I an
acknowledgment. J receives the address of I and tries to join as a child of I. This
approach has been proposed in [23].
[25] suggests that rejoining nodes use their remaining neighbors as a starting
point for the join algorithm that has been described above. The neighbors of a
node are its sibling, its child level nodes, and its parent level nodes. In Figure 2.2,
Js neighbors are E, F, G, H, I, K, L and M, for example. The nodes on the lowest
level of a cluster tend to have the lowest available bandwidth and the highest
number of neighbors. This is worrying because maintaining those links may induce high control overhead. However, it might be sufficient to exchange refresh
message less frequently between "remote" relatives.
Promotions: If the capacity of a member X exceeds the capacity of its parent Y by a threshold, X requests Y to switch positions with it. How does X know that it has higher
capacity than Y? All members announce their capacity via refresh messages: each
refresh message contains the senders uptime and bandwidth. [24] proposes the following capacity function:
c( j) = b( j) +

b( j)
t( j) 1 j n,
b (i )

in=1

(2.1)

where n is the group size, b( x ) is the bandwidth of member x, and t( x ) is that members uptime. In Figure 2.6, node Y has joined the group at time t = 0, whereas X has
joined at t > 0. Until one of the nodes leaves, Y has higher uptime than X. However,
as Y has more available bandwidth than X, the capacity of Y grows faster over time
and exceeds the capacity of X at some point. That means, the capacity function can
prevent transient nodes from obtaining high positions in the clusters, and will eventually help high-bandwidth nodes to climb up.

29

2 Related work

[25] does not contain all the details about the promotion mechanism; further details
may be added in the next revision.
Message types: [25] proposes seven control message types; for each type, there is a requestresponse message pair. A 24 byte DMMP header is appended to all DMMP messages.
A description of the header fields can be found in [25].
1. Subscription: Used to obtain the address of the RP via DNS.
2. Ping-RP: Newly joining end-hosts send a ping-RP request to the RP. The RP
replies with a list of randomly chosen super nodes.
3. Join: Newly joining and rejoining end-hosts send join requests to cluster members. Join responses indicate if the joining host has been accepted as a child or
not. When a joining host is rejected, the join response contains the addresses of
the senders children. This has been described above in more detail.
4. Refresh: Members that are adjacent in the control topology exchange refresh messages regularly. Refresh messages contain some information about the senders
capacities, and may also contain routing updates.
5. Probe: When a member receives no refresh message from one of its neighbors for
some period of time, it suspects the neighbor to be inactive, and sends a probe
request to it. If the neighbor is not inactive, it sends a probe response back immediately.
6. Inactive report: When a member learns about an inactive member, it can notify
other members using an inactive report. The report indicates which member has
become inactive.
7. Status report: This message type is used for miscellaneous tasks. For example,
promotions could be arranged using status reports.

30

3 Implementation
A network simulator is a program that simulates a computer network by calculating the interactions between the nodes. In order to evaluate the performance of DMMP we incorporate
DMMP into the OverSim simulation framework. First, we provide some background on
network simulation in Section 3.1. Then, the DMMP implementation is described in detail
(Section 3.2).

3.1 Network simulation basics


Mathematical methods alone are often not sufficient to analyze the behavior of complex
network protocols. For complex, heterogeneous environments like the Internet, they can
only provide a rough estimation. With testbeds, all the details of a real network can be captured [4], but testbeds are limited in scale. They also lack flexibility: changes to the network
topology or to the protocol software are costly. Network simulations, in contrast, allow an
inexpensive, flexible analysis of network protocols. They are inexpensive because typically
only a single computer is required. (Although some network simulators allow distributed
simulation; then several machines can be used to analyze a complex network.) Network simulations are flexible because the simulation scenario can be easily changed. Protocols can be
analyzed for various network topologies and traffic patterns that way. Network simulations
are also suitable for observing the large-scale behavior of a protocol. However, a scalable
network simulation has to abstract from some of the details. That means, not all real world
issues can be considered. For that reason, simulations cannot replace testbeds or actual deployment.
Network simulations can be used for several purposes:
Demonstration: Simulations assist protocol designers and students in getting a better understanding of a protocol [13].
Verification and validation: Simulations can help to assure that the protocol is correct and
meets the needs of its application.
Performance evaluation: Simulations are fully reproducible. Therefore, they are well suited
for evaluating and comparing the performance of different protocols.
In this thesis, we use simulation experiments to evaluate the performance of a rather complex network protocol. We are interested in the large-scale network behavior with at least
thousands of nodes, so testbeds are hardly feasible. The performance metrics are also hard

31

3 Implementation

to analyze mathematically.
Following [4], we distinguish between two types of network simulators:
1. Customized simulators are designed for analyzing one specific protocol in a certain scenario. They are optimized for this protocol and offer exactly the desired degree of
detail. Comparing network protocols is difficult with customized simulators. Ideally,
all relevant protocols should be implemented by the same people. For example in [7],
a customized simulator has been used to compare NICE with Narada. Both protocols
have been implemented, as well as all the underlying protocols. We cannot use their
implementations for comparing NICE and DMMP. In addition, our results are not fully
comparable with their results because we do not use the same network simulator.
2. Common simulators provide a generic simulation framework that is independent of the
simulated protocols. With such a modular framework, protocols need to be implemented only once. Then everybody can use them for their own simulations. Typically,
the people who have devised a protocol will provide an implementation; that means,
efficient implementations can be expected.
In Section 3.1.1, common simulators will be described in more detail; two typical examples
(nsnam, OMNeT++) will be introduced in Sections 3.1.2 and 3.1.3. With those simulators a
wide range of protocols can be simulated. We will address simulation frameworks that are
designed for overlay and peer-to-peer protocols in Section 3.1.4. In that context, OverSim,
which is used for our simulation experiments, will be introduced. OverSim is based on
OMNeT++.

3.1.1 Common network simulators


In this section, a closer look at common simulators is taken. Mostly following [4] and [10]
we describe the properties of common simulators:
Principle of the simulation engine: The prevalent principle is discrete event processing. The
network is modeled as a discrete event system. Such a system changes its state at discrete points in time. State changes are called "events". Nothing happens between two
events. Discrete event processing works as follows [34, pages 3738]: initially, some
events are scheduled. That means, each event receives a timestamp indicating when it
is supposed to take place. All scheduled events are kept in a global data structure (e.g.
the event queue), and processed one by one in the order of the timestamps. An event
may spawn new events when it is processed. There are other engines, e.g. simulation
based on Markov chains [19], which are not described here because they are hardly
used in network simulation.
Focus: Some common simulators target a specific type of protocols. For example, OverSim,
which will be described in the next section, focuses on overlay network simulation.

32

3 Implementation

nsnam is intended for simulating Internet protocols. In contrast, other simulators provide a generic "simulation language" [10] coupled with protocol libraries. These simulators are not limited to network simulation, but support a wide range of systems.
Degree of abstraction: Network protocols and their environment can be very complex. There
is clearly a trade-off between the degree of realism and the consumed resources. Without abstracting from some of the details, simulations of large networks may take long
to execute and demand a high amount of memory. It is desirable that the degree of
abstraction can be adjusted. That way, it is possible to analyze the protocol details, as
well as the large-scale behavior.
Efficiency: An efficient simulator consumes few resources, and can handle complex networks. Hence, the programming language that the simulator is implemented in plays
a role.
Extensibility: An important idea of common simulators is that different researchers can implement a large number of protocols on top of the simulation framework. Therefore,
common simulators need to be highly extensible. Clean interfaces and a comprehensive documentation can make extensions easier.
Availability of protocol implementations: Common simulators allow the combination of
existing protocol implementations in order to create more complex simulation models,
instead of building them from scratch. For this reason, simulators with a large library
of protocols are more attractive to researchers. For simulations of Internet protocols, it
is important that basic Internet protocols like IP or TCP are provided.
Visualization: Most common simulators can create an animation of the simulation model,
allowing the user to observe the protocol behavior. This is important for debugging.
Verification and debugging of protocol implementations is in general difficult, so additional (non-visual) debugging and tracing features can be useful. Visualization can
also help users to acquire an intuitive understanding of the simulated protocols [4].
Scenario generation: The term scenario refers to the entire simulated environment, including the network topology, dynamic topology changes and traffic patterns. Typically,
network protocols are mostly independent from the scenario. The protocol is defined
once, and is then analyzed in a number of different scenarios. A simulator should provide support for (1) implementing the protocols efficiently, (2) defining and changing
scenarios quickly, and (3) adapting protocol parameters to fit the scenario. Scenarios
often cannot be specified manually because they are too complex. Therefore network
simulators should also support randomized scenario generation (e.g. topology generators) and scenario libraries (including e.g. real world topologies).
Statistics support: In order to evaluate the performance of a protocol, statistical data needs
to be gathered. Some simulators explicitly support this by offering an easy interface

33

3 Implementation

for recording statistics. Additionally, tools for visualizing the collected data may be
provided.
Interaction with real networks: This concept has mainly two applications. (1) Confront a
real network with simulated traffic, and (2) confront simulated nodes with real traffic.
In the latter case, one or several real nodes are emulated.
We try to illustrate these properties in the next sections by introducing two popular, objectoriented common simulators, nsnam and OMNeT++. We focus on OMNeT++, which is used
for our simulation experiments. Both simulators are free/ open source software. As they are
both written in C++ and use a discrete event processing engine, their efficiency is roughly
comparable. OPNET Modeler is another popular common simulator with a comprehensive
protocol library. We leave OPNET Modeler out because it is not fundamentally different
from nsnam and OMNeT++.

3.1.2 nsnam
nsnam, which is also referred to as the VINT [17] simulator, consists of two tightly bonded
parts: ns-2 [4] provides the simulation engine; ns stands "network simulator", "2" is a version number. ns-3 is currently still work in progress. nam [15] (for "network animator")
provides visualization. ns-2 generates a trace file, which nam interprets. nsnam focuses on
Internet protocols, and it is highly popular in that domain. For example, it has been used a
lot in research on TCP. Consequently, many Internet protocols have been implemented for
nsnam. A "split-programming model" [10] is used. That means, simulations are specified
using two different programming languages. OTcl, an object-oriented scripting language,
handles the parts of the simulation that change frequently, that is, mostly the scenario generation. The protocol details are implemented in C++. When OTcl scripts are executed,
the instantiated OTcl objects are mirrored to C++ objects, so that they can interact with the
native C++ code. The split-programming model expedites scenario generation, and at the
same time allows performance-critical algorithms to be implemented efficiently. C++ simulation objects can be combined to create more complex "macro-objects" [10] using OTcl.
However, macro-objects cannot be further combined. This limits the extensibility of nsnam.
Another problem is that the simulation engine and the implementations of the basic Internet
protocols are not cleanly separated [33].
The degree of abstraction can be adjusted. Three network layer models are provided: in
the first model, hop-by-hop forwarding and dynamic routing updates are simulated. In the
second model, routing is static and centralized. In the third model, routers are not simulated
at all. nsnam uses traffic model and topology libraries, and supports topology generators
(e.g. the popular GT-ITM package [16]). A framework for systematic testing of protocol
implementations, called "STRESS", is provided as debugging and validation support, in addition to nam. Interaction with real networks is possible as well. A part of the nsnam code
has been transformed into the Telecommunications Description Language, which allows

34

3 Implementation

distributed simulation [30]; nsnam itself does not have this feature.

3.1.3 OMNeT++
OMNeT++ [33][34] is free for academic non-profit use. There is also a commercial version
called OMNEST [1]. OMNeT++ has a much broader focus than nsnam. Any system that
can be modeled as a discrete event system can be simulated. OMNeT++ is mostly used
for computer network simulation, but it could also be used for e.g. analysis of hardware
architectures. It consists of a simulation kernel, a simulation library, component libraries and
user interfaces [34, pages 211212].
The simulation kernel mainly handles the discrete event processing. It supports distributed simulation.
The simulation library offers support for common simulation tasks. It includes, for
example, random number generators and containers, as well as classes for gathering
statistics. We will elaborate a bit on the statistics support: Output vectors are collections
of (time, value) pairs, which are recorded over the course of a simulation run. For
example, assume that packet round trip times are measured regularly in a simulation.
Then all the individual measurements could be stored in an output vector. The output
vector writes the data to a file. The data can later be plotted using plove, a tool that
comes with OMNeT++. Figure 4.7 has been generated by plove. The file format is very
simple, which makes post-processing using external tools rather easy. An output scalar
stores a single scalar value and a description string. Scalars are typically recorded at
the end of a simulation run. Example: one could count the lost packets over the course
of a run, and record the total number as a scalar at the end. The tool scalars can be used
for post-processing.
There are two alternative user interfaces; the text-based, non-interactive Cmdenv for
batch execution, and the richer graphical user interface Tkenv1 . Tkenv does not only
provide animation, but also additional debugging and tracing support. Most notably,
it is possible to inspect all simulation objects, such as messages, modules, parameters
(see below) or output vectors, at the run time. Tkenv is shown in Figures 3.2 and 3.4.
The component libraries contain mostly the protocol implementations. OMNeT++ is
completely independent from these libraries; in fact, it does not come with any component libraries. For example, the INET framework provides the essential Internet protocols. In nsnam, the basic Internet protocols are an integral part of the simulator itself,
in contrast. Simulation objects are wrapped in modules. These modules can be arbitrarily combined to build more sophisticated modules. Component libraries consist of
a number of related modules.
OMNeT++ simulation models are implemented as follows: simple modules are implemented
as C++ classes, they are not composed of other modules. Compound modules contain other
1 Tkenv

is based on the graphical user interface toolkit tk [2], hence the name.

35

3 Implementation

system module
standard host
ppp

network layer
ip

eth

icmp

...

...

Figure 3.1: Nested modules in OMNeT++

modules, which may be simple or compound. They are described using the NED language,
which is a simple compiled programming language with a syntax similar to C. OMNeT++
provides a compiler that translates NED code to C++ code. That means, there is also C++
code for compound modules, but it is usually not written manually. All modules of a simulation are included in the system model. That means, there is a module hierarchy rooted at
the system model. For network simulation, the system model usually represents a network.
In Figure 3.1, a simple module hierarchy is shown. The example is taken from a modified
version of the INET framework, which will be further described in Section 3.1.4. The gray
boxes are simple modules, and the white boxes are compound modules. The system module is a network that consists only of a single end-host. The end-host is represented by a
compound module, and composed of various protocols and data structures. For example,
the Point-to-Point Protocol (PPP) is implemented as a simple module. The network layer
protocols are wrapped in another compound module. Remember that this kind of nested
model is not supported by nsnam. Normally, some modules are connected to each other.
This is not shown in Figure 3.1. Modules can be connected to modules on the same level
of the hierarchy, and to their parent and child modules. Connected modules communicate
by exchanging messages. For network simulation, messages are typically packets or timers.
Messages are very similar to events (see discrete event processing in Section 3.1.1): the arrival of a packet can be seen as an event, and, inversely, timers can be seen as a message
that a module sends to itself. Modules can also be parametrized. Parameters are set in
compound module definitions using NED or in a separate configuration file. That way, protocol implementations can be quickly adapted to the simulated scenario. For example, the
IP module in Figure 3.1 has a parameter "time to live" that determines the initial value of
the time to live field in the IP header. Clearly, this value has to be adjusted to the simulated
network topology.

36

3 Implementation

In summary, there is a clean separation of the simulation kernel, the simulation library,
user interfaces and component libraries, which makes it easy to extend OMNeT++. In particular, scenario generation and protocol implementation are separated. OMNeT++ uses two
different programming languages: protocols are implemented as simple modules in C++;
network topologies are described in NED. Via parameters any other simulation properties
that have to be changed frequently can be specified in NED as well. This is similar to the
split-programming model of nsnam. However, nsnam macro-objects do not have parameters. There are arguably more protocol implementations for nsnam and OPNET Modeler
than for OMNeT++ though.

3.1.4 Overlay network simulators


In the previous sections, common network simulators with a rather wide focus have been
described. In this section, we introduce common simulators that are designed for simulating
peer-to-peer and overlay networks, and thus have a smaller focus. These networks typically
consist of a very high number of nodes. Relatively small networks contain thousands of
nodes. Consequently, most overlay simulators provide abstract, but scalable network models. Some overlay simulators have additional feature that facilitate the implementation of
typical peer-to-peer protocols. For example, a generic lookup function could be provided,
as many peer-to-peer protocols require a lookup function. However, overlay simulators are
not intended for application layer multicast protocols, and thus many of the features are not
useful for our purposes.
We have selected OverSim for our simulation experiments. To explain this choice, we specify our requirements first. Our results have to be comparable with the results reported in
[7]. Therefore, we need to reproduce the simulation scenario used in [7]. This is described
in more detail in Section 4.1.2. For now, it is sufficient to consider the following aspects:
the underlying network is modeled in some detail. That is, routers are modeled. 10,000
routers are placed using GT-ITM, and up to 2048 end-hosts are simulated. An extraordinary
churn model is used. Stress, packet loss, control overhead, data path length and stretch are
measured. From this we can deduce our requirements:
There should be a rather detailed model of the underlying network that is comparable
with the model used in [7]. The model should also support end-hosts with heterogeneous bandwidths.
The underlying network model and the simulation engine need to be efficient. They
should scale to a high number of network nodes. While we do not intend to simulate
hundreds of thousands of nodes, it should be possible to simulate up to about 20,000
nodes.
Support for GT-ITM is desirable, otherwise it is difficult to reproduce the network
topology used in [7]. Traffic generators are less important; a simple constant bitrate
model will suffice.

37

3 Implementation

There should be support for implementing arbitrary churn models.


In order to evaluate the performance of DMMP, statistics need to be gathered. The
simulator should provide support for this.
The following properties are desirable for all simulation tasks: the simulator should
provide a clean, well-documented API, an easy to learn scripting language for scenario
specification, and some debugging helps [28].
According to [28], most overlay simulators do not meet these requirements at all. The lack
of scalability, documentation, extensibility, statistics support is criticized, as well as the inflexible models of the underlying network. OverSim is a very recent approach, and tries
to address these problems [9]. There are already a number of protocol implementations for
OverSim.
OverSim is based on OMNeT++. Hence, it can exploit all the advantages of OMNeT++
that we listed in Section 3.1.3, such as free usage, high extensibility, statistics support, visualization, and efficiency. OverSim provides three underlying network models:
In the simple model, the underlying network is mostly abstracted out. Messages exchanged between the overlay nodes are delayed by a constant period of time. Alternatively, the overlay nodes can be placed in a two-dimensional Euclidean space. In this
case, the delay is computed as a function of the Euclidean distance and the last-hop
bandwidth (see Section 3.2.3).
In the single host model, a single end-host is emulated. This model is intended for
interaction with real networks.
The INET model is based on a revised version of the INET framework. The INET framework is an OMNeT++ component library that provides some essential Internet protocols. It features simulation models of all OSI layers "from the MAC layer onwards" [9].
More precisely, the following protocols/ technologies are implemented:
Layer 2

PPP, Ethernet, 802.11

Layer 3

various routing protocols, IPv4, IPv6

Layer 4

TCP, UDP, RTP

In addition, there are several simple application models. In summary, the INET model
is very detailed. The developers of OverSim have profiled the INET modules to increase its efficiency and scalability. For example, a static routing protocol has been
added.
The underlying network model and the overlay protocols are cleanly separated. For example, it is possible to exchange the underlying network model without changing the overlay
protocols. OverSim also offers some support specifically for peer-to-peer protocols. This includes a generic lookup function, bootstrapping support, visualization of the overlay topology, and collection of some statistical data, such as "the number of sent received, forwarded

38

3 Implementation

and dropped packets per node" [9]. Most of these features are too specific to be used for the
DMMP simulations, but they can be turned off easily, and they seem to come at little cost.
The authors of [9] claim and demonstrate that runs with 100,000 end-hosts are feasible.
What OverSim offers to us, beyond the functionality of a common simulators, is mostly
the INET model, which is very suitable for our purposes. Many other overlay simulators
lack a detailed underlying network model. Unfortunately, OverSim has also some shortcomings. Firstly, the scalability is not quite satisfying. In particular, OverSim consumes
high amounts of memory. According to [9], each node requires about 70 kilobytes of memory. However, we have made the experience that routers require far more memory. Each
router creates one entry in its routing table for each other router. Therefore, the memory
consumption increases quadratically with the number of routers. Several gigabytes of memory are necessary to simulate 10,000 routers. Secondly, OverSim does not support topology
generators. In [9], this is listed as future work. OverSim does not provide a suitable traffic
model either, however, this can added easily. Thirdly, the documentation is rather sparse in
parts, for example, it does not describe how to add new overlay protocols.

3.2 DMMP simulation


We have implemented DMMP on top of OverSim version 2007-07-24. In Section 3.2.1, we
will describe the OverSim API, and how OverSim has been extended. The following sections
describe in detail how DMMP has been implemented. Our implementation does not fully
comply with [25] for two reasons:
1. Not all the details are specified in [25]; many parameters are left open on purpose
because they are application-specific. A few aspects have not been completely thought
out yet, for example the promotion algorithm.
2. For lack of time, some simplifications were necessary. We left out the adding and dropping of mesh links in the overlay core. That means, the initial mesh does not improve
over time. Therefore we use a centralized algorithm to construct a reasonable initial
mesh. This is also easier to implement than the distributed algorithm that[13] proposes. In DMMP, the super nodes regularly exchange routing updates. This is mainly
necessary in order to decide whether links should be added or removed from the mesh.
As this feature is not implemented, we also slightly simplify the routing algorithm. In
general, we abstract from the bootstrapping and initial overlay construction, and concentrate on analyzing the behavior of the protocol after it has stabilized.
The deviations from [25] are described in more detail in the following sections. In Section
3.2.9 we summarize our implementation design.

39

3 Implementation

3.2.1 Extending OverSim


OverSim has been introduced in Section 3.1.4. We concluded that OverSim is suitable for
our simulation experiments, but we also noted some shortcomings. Consequently, there are
some problematic aspects we have to consider for our implementation:
Efficiency: OverSim consumes a high amount of memory for simulating the routers. The
remaining memory has to be used economically. Execution speed is an issue, too. In
general, the execution speed of OMNeT++ grows linearly with the number of messages, and the number of messages is proportional to the number nodes. However,
things get worse when message handlers take a long time to execute, or when unnecessary messages are scheduled. Therefore, time complexity has to be considered.
Debugging: DMMP is a rather complex protocol, so debugging is a concern. In addition,
OMNeT++ is vulnerable to memory allocation problems. The programmer is responsible for dynamically allocating and deleting messages. The object inspector keep
tracks of created messages, which can help in discovering memory leaks. Nevertheless, [34, pages 175176] acknowledges that this may not be sufficient, and suggests to
use valgrind or a similar external debugger.
In OverSim, end-hosts are compound modules. We use the INET underlying network
model. Figure 3.2 shows the modules that an end-host ("OverlayTerminal") is composed
of in this model. The PPP module is connected to the OverlayTerminal, which is in turn
connected to an access router. On top of PPP, there is a protocol stack consisting of IP and
closely related network layer protocols, UDP, an overlay protocol, and some higher layer
applications. That means, the overlay module receives data from the "Tier 1" module, and
passes data to the UDP module. The overlay module can be any module that implements an
overlay network protocol. The only condition is that the overlay module has to be derived
from the class BaseOverlay. This is illustrated in Figure 3.3. BaseOverlay is in turn derived
from cModule, which is the OMNeT++ base class for simple modules. That means, the overlay module has to be a simple module. In DMMP, the source behaves differently from the
other end-hosts. Therefore, it makes sense to have two different modules, DMMPMember
and DMMPSource, and thus two different types of end-hosts. OverSim supports this. Common functionality of all DMMP-aware end-hosts is implemented in a DMMP base class. The
BaseOverlay member functions shown in Figure 3.3 should be implemented by every overlay module. There is a number of other important member functions, which have been left
out in Figure 3.3. Note that the interface of the overlay module strongly resembles the interface of a class that implements a network protocol in a real world application. The internal
structures are similar as well: they mostly consist of handler functions for various message
types and timeouts. However, there are also some differences from a programmers point
of view. For the simulation module, common real world issues, such as parsing incoming
packets, can be disregarded. Simulations also leave more room for simplifications. For example, in DMMP it is possible that a (possibly malfunctioning) member sends out inactive
reports about a member that is actually not inactive. Although this situation should occur

40

3 Implementation

Figure 3.2: End-host in the INET underlying network model (screenshot from Tkenv)

41

3 Implementation

cModule

BaseOverlay
#initializeOverlay(): void
#finishOverlay(): void
#handleUDPMessage(msg:Message): void
#handleAppMessage(msg:Message): void

DMMP

DMMPMember

DMMPSource

Figure 3.3: The OverSim API: simplified class diagram

rarely, a real world implementation has to be able to handle it. In our simulations, we can
disregard spurious inactive reports entirely. However, many simplifications are ineligible
because they would affect the performance measurements significantly.

3.2.2 Components
Figure 3.4 shows the initial components of our simulation model. The underlying network
consists of a number of interconnected routers (illustrated as towers), and does not change
over the course of a simulation run. The underlying network is further described in Section
4.1.2; most of the details are not important here because in OverSim, the overlay modules
are independent of the underlying network model. The notebook symbols represent endhosts. They are the initial members of a DMMP multicast group. One of the initial group
members is the source, the others do not send data to the group. Over time, more end-hosts
will be added to the model, and others will be removed (the source is not removed though).
All end-hosts are placed using dynamic module creation [34, pages 7678], meaning that they
are placed via manually written C++ code.2 Each end-host is attached to exactly one router.
Links have a bandwidth and a propagation delay [34, pages 4041]. Consider Figure 3.5. Node
A sends a message to node B over a network link. The first bit is sent at time t0 , the last bit
is sent at t1 . The difference t0 t1 is a function of the link bandwidth and the message size.
With a bandwidth of b bits per second and a message size of s bit it is t0 t1 = bs . The propagation delay is simply t0 t2 , where t2 is the arrival time of the first bit. As bandwidth and
2 Simulation objects can also be created via NED, which is called static creation because, that way, the objects are

placed immediately at the beginning of the simulation run, and destroyed at the end. This method is used
for placing the routers.

42

3 Implementation

Figure 3.4: The initial components of the DMMP simulation model (screenshot from Tkenv)

A
first bit sent

last bit sent

message

first bit arrives

last bit arrives


time

Figure 3.5: Message transmissions in OMNeT++

43

3 Implementation

propagation delay do not change over time, it is also t0 t1 = t2 t3 and t0 t2 = t1 t3 ,


where t3 is the arrival time of the last bit. Links are symmetric in the sense that the bandwidth and propagation delay is the same in both directions. Only one message can traverse
a link at a time. That means, upstream and downstream traffic are not differentiated. This is
a bit problematic because e.g. ADSL cannot be modeled.
The RP (illustrated as a lightning in Figure 3.4) is not modeled as a network node. Instead, it is accessed via direct method calls. Typically, OMNeT++ modules communicate by
exchanging messages, but for convenience, public member functions of a module can also
be called directly. This is not a matter of course. The simulation kernel requires each simulation object to be "owned" by a module. Direct method calls can break ownership mappings.
Therefore the simulation library provides functions that notify the kernel of direct method
calls. It also helps the programmer to obtain references to foreign modules. A more obvious problem with direct method calls is that transmission times are not modeled. The calls
happen immediately, which is not realistic in the case of the RP. However, the RP is hardly
involved in the communication once the initial control topology is established, so the transmission delays should not matter for our performance measurements. Having said that, the
RP is highly involved in the bootstrapping and initialization. According to [25] it has to
provide every newly joining host with a partial member list. For large groups, the RP may
in fact not be able to handle all the queries. That means, in a realistic model several RPs
can be necessary. We avoid this problems by modeling the RP as an entity outside of the
network, and focus our performance measurements on everything that happens after the
overlay initialization. Evaluation of the performance during the initialization is possible future work. The RP also aggregates some of the collected statistics. In that case, direct method
calls are clearly appropriate because the statistics collection should not affect the measurements. OverSim provides a similar module, the GlobalObserver (also shown in Figure 3.4).
It is intended for bootstrapping and statistics collection, but the offered functionality is not
compatible with DMMP. We use our own RP module instead.
The IPv4UnderlayConfigurator and the ChurnGenerator are mainly responsible for filling
the routing tables and placing the end-hosts. These aspects are discussed in Section 4.1.2.

3.2.3 Overlay construction


We try to simplify the initial overlay construction because it is not essential for our performance measurements. We divide this process into three steps: the bootstrapping, the
construction of the overlay core, and the construction of the clusters.
1. Bootstrapping: End-hosts are identified by their IP addresses, DNS is not modeled.
Newly joining end-hosts obtain the IP addresses of some existing group members by
contacting the RP. Each initial member contacts the RP and announces its last-hop bandwidth. This is the bandwidth of the link that connects the member to a router. That
link is also the last hop of the route from an arbitrary node to the member. In reality,

44

3 Implementation

the bandwidth has to be measured and may change over time. Effective bandwidth
measurements are rather difficult to implement, so we leave them out for simplicity.
Instead, we "cheat" by obtaining the bandwidth directly from the link. OMNeT++
implements links as cConnection objects, the bandwidth can be accessed easily. Unfortunately, this approach leads to slightly unrealistic results because real measurements
would be less accurate and would induce some control overhead.
2. Construction of the overlay core: The RP stores the IP address and bandwidth of each
initial member. It chooses a number of initial members as super nodes in order of
bandwidth. The number of super nodes, which also determines the average cluster
size, is an application-specific parameter. The effect of this parameter is investigated
in Section 4.2.1. DMMP adopts the overlay core maintenance from Narada. In Narada,
the initial mesh can have poor quality, but it improves itself over time. As we do not
implement the self-improvement mechanism, we use a relatively simple centralized
algorithm to construct a reasonable initial mesh. First, the maximum degree of each
super node i is determined as
degree(i ) =


b (i )
,
r + c (i )

where b(i ) is the last-hop bandwidth of i in bits per second, r is the constant bitrate that
the source sends at, and c is the expected control overhead in bit per second. The control overhead will be analyzed in Section 4.2.1. As the underlying network topology
is modeled as an undirected graph, we cannot differentiate between outdegree and
indegree. For each packet that the source sends to the group, there is up to one endto-end transmission per mesh link. This is because a reverse path routing algorithm is
used. Therefore, the maximum number of mesh neighbors is degree(i ) for each super
node i; each incident mesh link "consumes" one "degree unit". It seems reasonable
to let each super node "reserve" some degree for children in its local cluster, so super
nodes with high maximum degree should have more neighbors than super nodes with
relatively low maximum outdegree. A modified version of Prims algorithm is used
to construct suitable spanning trees. We do not go into more detail here because the
centralized mesh construction is only a temporary solution, and should be replaced by
an efficient, distributed algorithm. Using the algorithm described in [36], k spanning
trees are constructed. That means, the density of the mesh is adjusted via the parameter k. As the number of super nodes should be less than 100, and a very dense mesh
requires a high amount of reserved degree, a reasonable value is 2 k 5. Note
that for k = 1 the overlay core is a tree. We experiment with the value of k in Section
4.2.1. Finally, the source chooses as many neighbors as its maximum degree allows.
The chosen super nodes receive data directly from the source.
3. Cluster construction: The RP contacts the initial members that have not been chosen as
super nodes. They join a cluster using the join algorithm described in the next section.

45

3 Implementation

...

JoinRq
JoinRq

JoinRsp(REJ, D, B)

JoinRsp(ACK)

RttTest
data topology

RttTest

RttRsp
JoinRq

RttRsp
JoinRsp
time

message exchange
.

Figure 3.6: The join algorithm: message flow

After joining, they notify the RP. As soon as all initial members have confirmed their
positions, the RP tells the source to start the data delivery.

3.2.4 Join algorithm


The join algorithm is implemented as described in Section 2.3.2. It is illustrated in Figure
3.6. The left part shows the data topology of a DMMP cluster. We only consider the subtree
rooted at C. D is the only child of C. Assume that C has a maximum degree of three; that
means, it can accept one more child. A and B attempt to join the cluster. Both have queried
the RP for a list of s randomly chosen super nodes. Based on distance measurements and
previous join attempts, both have independently chosen C as their desired parent. Let A
have a higher capacity (see below) than B.
First, B sends a join request to C, including Bs capacity. C does not accept B as a child
right away, but waits for additional join requests. After all, it can only accept one more child.
C waits for t j seconds, then it chooses a child from among the joining end-hosts. In the example, C receives a join request from A within that time. C chooses A as a child because A has
a higher capacity than B. From Figure 3.6 it can be observed that the end-to-end delay from
A to C is actually higher than the delay from B to C; however, children are chosen solely
based on capacity. If C had a higher maximum degree, it would accept B as a child as well.
In this case, there would be no advantage in waiting for additional join requests. C adds an
overlay link to A, meaning that C adds A to its internal list of children. When C receives data,

46

3 Implementation

it forwards the data to all end-hosts on that list. C also starts sending refresh messages to
A. This is covered in Section 3.2.6. C sends a join response to both A and B. The response
to A indicates that A has been accepted as a child; A knows that it has successfully joined
the cluster, and starts sending refresh messages to C. B receives a join response indicating
that it has been rejected as a child. This rejection includes the IP addresses and capacities of
Cs children A, which has just been accepted as a child, and D. B uses an internal candidate
parent cache to store this information. C is removed from the cache because it is currently
saturated. B chooses a parent from among Cs children; other entries in the candidate parent cache are not considered. They are only considered if none of Cs children accepts B. B
determines its distance to A and D by measuring the round trip times of (empty) test messages. There are other possible distance metrics. For example, the length of the shortest path
in the underlying network could be measured. However, this method seems less reliable.
Measuring the round trip time is a simple, intuitive approach. B sends a test message to
A and D simultaneously. The test response of A arrives first. That way, B knows that A is
nearer than D, and sends a join request to A. In our example, A has not received the join
response from C, when it receives the test message; this is unproblematic. When A receives
the join request from B, it has already joined the cluster. For simplicity, assume that A has
a maximum degree of two. If it had a higher maximum degree, it might try to switch its
position with C, which would make things more complicated. Consequently, A can accept
one child. There are no concurrent join requests, so A accepts B as a child, and B has joined
the cluster.
Now assume that the maximum degree of A is only one, that is A cannot accept a child,
and thus is a leaf node. In this case, C does not include the address of A in the rejection
message because C knows that A does not have enough capacity. Let A and D both be leaf
nodes. In this scenario, B receives an empty rejection (that is, without any IP addresses).
Then B contacts one of the members in its candidate parent cache. If the cache is empty, B
knows that all clusters that it tried to join are saturated. As the ping-RP query provides a
list of randomly chosen super nodes, there may be other clusters, which are not saturated.
Therefore B queries the RP for a new list of super nodes.
Below some interesting parameters of the join algorithm are discussed:
Each member waits for t j seconds before choosing its children. A high value of t j
can avoid promotions because end-hosts with low capacity are placed at low cluster
positions from the beginning. With t j = 0 members do not wait for concurrent join
requests. That way, end-hosts can join and, more importantly, rejoin faster, but more
promotions are necessary. If the average degree of end-hosts is high, it is clearly pointless to wait for join requests, because most of the time all joining hosts can be accepted.
As DMMP targets high-bandwidth applications, the average degree will typically not
be that high though. We set t j > 0 during the initial cluster construction. In this phase,
a high number of join requests is sent at the same time. In addition, users may be

47

3 Implementation

more willing to tolerate delays before the playback has initially started (in the case of
multimedia streaming). In most cases, the media player buffers incoming data before
starting the plackback. That means, there is an initial delay no matter how fast the
join algorithm is. We believe that it is better to set t j = 0 after the initialization phase,
mostly in order to speed up rejoin attempts, and thus reduce packet loss.
There are some other timeouts involved in the join algorithm, that have not been mentioned for simplicity. Clearly, joining end-hosts cannot wait for a join response indefinitely. The desired parent may have left the group before sending a join response.
The same applies for responses to round trip time tests. In both cases, the joining host
is not interested in responses of distant members. Therefore the timeouts should be
short.
Recall Equation 2.1. In [25], the units of the bandwidth and uptime are not specified.
If "bit per second" is used as the unit of the bandwidth, and "seconds" as the unit of the
uptime, then the uptime is virtually irrelevant; the capacity c of an average member
would be

t
c = b + ,
n

where b is the average last-hop bandwidth, t is the average uptime, and n is the group
size. In a typical scenario, the maximum uptime tmax is a few thousand seconds (a few
hours), and the group has thousands of members. In our implementation, the units
stated above are used, but there is a weighting factor u:
c( j) := b( j) +

b( j)
t( j) u 1 j n.
b (i )

(3.1)

in=1

We give both summands roughly equal weight by choosing u =

B
tmax

where B is an estimation of the total bandwidth. Assuming that B =

1 j n,
b(i ), we get

in=1

b( j) c( j) 2 b( j) 1 j n.

(3.2)

The capacity of a member m1 with maximal uptime is 2 b(m1 ), and the capacity of a
member m2 that has just joined the group is b(m2 ). When a high number of membership changes is expected, the uptime can be given more weight, and vice versa.
Members obtain the total bandwidth as follows: for every member, the RP keeps track
of (1) the IP address, (2) the last-hop bandwidth, and (3) whether the member is a super node. The last field is required to answer ping-RP queries. The RP periodically
calculates the total bandwidth and delivers that information to the group. The total
bandwidth updates are attached to some of the periodic refresh messages. These refresh messages also need to carry a timestamp. That way, a member that receives a
refresh message can decide if the total bandwidth update is outdated.

48

3 Implementation

The RP answers ping-RP queries with a list of s super nodes. s needs to be limited
because each newly joining end-host measures its distance to all s super nodes. For
large s, the associated control overhead at the joining host is unreasonable.

3.2.5 Data delivery


The data dissemination in the clusters is easy to implement: each cluster member keeps
track of its children, and forwards incoming data to all children. Things are slightly more
complicated in the overlay core: [25] suggests to use DVMRP for data delivery, so does
[13]. More precisely, every super node is supposed to maintain a routing table, and to keep
track of the shortest path to each other super node. Adjacent super nodes should exchange
routing updates by attaching their routing tables to the periodic refresh messages. These
comprehensive routing tables are mainly used to decide if mesh links should be added or
dropped. For simplicity, we do not implement this behavior. We also use a simpler routing
algorithm.
The source assigns every data packet an ascending sequence number q. Super nodes keep
track of the sequence number q L of the last received packet. When a super node receives
a packet, it considers the packet as a new packet if and only if q > q L . New packets are
forwarded via all incident mesh links, except the one the packet arrived on. This flooding
algorithm has been described in [14].Up to two copies of each data packet are sent per mesh
link. This is very inefficient, and can be avoided. Each super node identifies its children in
the data delivery tree. As described in [14], the nodes can identify their children by keeping
track of each neighbors distance to S, for all sources S. This is referred to as Reverse Path
Broadcasting. In DMMP, there is only one source. It is sufficient to maintain a routing table
with two columns: IP address and distance to the source. We use overlay hops as the distance metric. Each super node adds its own best known distance to the source to the refresh
messages that it exchanges with its mesh neighbors.
Note that this routing algorithm is only a temporary solution. The routing updates in
Narada induce much higher control overhead. In particular, the size of a routing update
is in O(n), where n is the group size, whereas our routing updates have a size of O(1).
In order to make our results more realistic, we increase the size of the refresh messages
accordingly. This is not difficult in OMNeT++: the size of a message can be set arbitrarily; it
is not automatically calculated based on the content of the message.

3.2.6 Handling inactive non-super nodes


When non-super nodes leave the group, the local cluster may become partitioned. In this
section we describe how inactive non-super nodes are detected, and how the data topology
is repaired.

49

3 Implementation

P
JoinRsp(ACK)
ProbeTimer

Refresh

ProbeTimer

Refresh

RefreshTimer
ProbeTimer

ProbeTimer

RefreshTimer
Refresh

Refresh

...
Figure 3.7: Refresh message exchange between parent and child

Detecting inactive non-super nodes: In our implementation, cluster members exchange control messages only with their parents and children, not with any other relatives. Maintaining these relationships seems rather difficult to implement, so we leave this as
future work. Nevertheless, it would have been interesting to see, how these additional mesh links affect the performance, and how frequently remote relatives should
exchange refresh messages. For simplicity, we do not implement refresh requests and
inactive responses (see Section 2.3.2), either. We are also uncertain if a real world
implementation should use these message types. It is sufficient to exchange refresh
messages periodically, and lost inactive reports can be tolerated. Additionally, inactive reports are sent rarely, due to the trimmed control topology.
Refresh messages are sent when a parent-child relationship has been established. Subsequently, they are sent every tc seconds3 . In Figure 3.7, member P accepts C as a
child. Right after that, P sends a refresh message to C, and starts a probe timer. When
this timer expires, P sends a probe request to C. If P receives a refresh message from
C before that, it resets the probe timer. Refresh messages are sent periodically, for this
purpose P schedules a refresh timer (tc seconds). When the refresh timer expires, it is
reset, and a refresh message is sent.
The parameter tc is clearly one of the most important factors for the control overhead
and the packet loss. With a high value of tc , it takes members a long time td tc to
detect missing neighbors. Consequently, it takes a long time to repair partitions. Furthermore, capacity updates and total bandwidth updates are included in the periodic
refresh messages. If tc is high, capacities are updated less frequently. This is a minor factor though, because capacity updates are mainly needed for promotions, and
the promotion algorithm does not rely on the capacities being up-to-date. On the other
3c

stands for "cluster" here; refresh messages may be sent more frequently in the overlay core (mesh, hence tm ).

50

3 Implementation

P
Refresh

ProbeTimer

ProbeTimer
ProbeRq

InactiveTimer

Figure 3.8: Detection of inactive members: message flow

hand, a low tc leads to high control overhead. Probe timers expire after t p tc seconds.
As the transmission times of the refresh messages may vary, the probe timers should
be set conservatively. On a side note: if the first refresh message that P sends happens to reach C before the join response, then C ignores the refresh message. Chances
are that the next refresh message arrives before Cs probe timer expires. Otherwise, C
probes P, which is unproblematic as well.
Figure 3.8 shows how inactive members are detected: C has scheduled a probe timer
for its parent P. P has left ungracefully, and eventually the probe timer expires. C then
suspects P of being inactive and sends it a probe request. It uses an inactive timer to
wait for a probe response. The timer runs for ti seconds. After that, C assumes that P
has left ungracefully. In Figure 3.8, the refresh messages sent by C have been left out
for clarity.
Above, an example of a spurious probe request has been given. When an active member receives a probe request, it immediately sends a probe response back. In Figure
3.8, C would reset the probe timer in that case. A gracefully leaving member sends an
inactive report about itself. Sending the inactive reports takes a small amount of time.
We minimize this delay: leaving members send only one inactive report. This report
includes the IP addresses of the leaving members children, and it is sent to the leaving members parent. The parent sends an inactive report about the leaving member
to each of the now orphaned children. If the parent leaves before receiving the inactive
report, the orphans will not be notified. In this case, they have to detect by themselves
that their parent is missing.

51

3 Implementation

The total time to detect an inactive member is td t p + ti . Therefore ti has a high


impact on packet losses. Then again, the purpose of the probe requests is to make sure
that active members are not falsely assumed to be inactive. In the simulation model,
this can be easily assured because no messages are lost. There are only a few very special situations in which refresh messages are ignored or get delayed. Therefore a small
value for ti would be sufficient. However, in a real network, refresh messages can be
delayed or lost due to congestion. In this case, it is well possible that the congestion
also affects the probe messages. For that reason we set ti to a rather high value.
Treating inactive non-super nodes: When a cluster member detects that one of its children
is inactive, it removes the inactive end-host from its list of children; it stops forwarding
data to the inactive end-host. In addition, the cluster member notifies the RP. The RP
needs this information to calculate and update the total bandwidth.
When a cluster member detects that its parent is inactive, it rejoins the overlay by finding a new parent. In our implementation, rejoining is not much different from joining
initially. As mentioned above, cluster members have no relatives that could help them
to rejoin. The replacement mechanism is not used either. It is not as easy to implement as it may seem. Members may leave the group ungracefully right after sending
a replacement request. The inactive end-hosts parent, which receives the replacement
requests, may also leave at any time. An implementation has to be able to tolerate
these events. There is another problem: recall the example on page 28. I is chosen
as a replacement of D, and J tries to join as a child of I. However, the I may not be
able to accept another child. In this case, J has to find a different parent. However, the
replacement mechanism should be further analyzed in the future because it is a rather
unusual approach. For example, in HMTP and HostCast, which also take the tree-first
approach, inactive nodes cannot be replaced.
All in all, we believe that our implementation lacks support for rejoining end-hosts.
They have to rely on the candidate parent cache, which contains the IP addresses and
round trip times for members that have been contacted in previous join attempts. The
candidate parent cache can indeed be helpful in finding a suitable, new parent; but the
information in the cache can also be outdated. For example, all entries in the cache
may refer to members that have already left the group.
In summary, end-hosts rejoin the overlay using the candidate parent cache as a starting
point of the join algorithm that we described in Section 3.2.4.

3.2.7 Handling inactive super nodes


Leaving super nodes are more difficult to handle than leaving non-super nodes. This is
because all super nodes need to learn about the departure. In addition, the mesh can become

52

3 Implementation

partitioned.
Handling inactive super nodes: When a super node leaves ungracefully, its former children detect this as described in Section 3.2.6 and rejoin the overlay. Again, the missing
node cannot be replaced. Our implementation chooses all super nodes from among
the initial members. It is not possible to allocate additional super nodes later on.
Adjacent super nodes exchange refresh messages. We set the timer for the periodic
refresh messages to tm . These messages contain the same information as the refresh
messages in the clusters. Additionally, they carry the senders distance to the source
(routing information), and sequence numbers, which are used to detect mesh partitions (this is described later in this section). Because of this, refresh messages should
possibly be exchanged more frequently in the overlay core, that is, tm tc . Clearly, tm
has a big impact on how quickly mesh partitions can be detected. We experiment with
the value of tm in Section 4.2.2.
Each super node maintains a list of all other super nodes. This list is mostly needed to
detect mesh partitions. When a super node learns that one of its neighbors is inactive,
it removes the neighbor from its list of super nodes and from its list of neighbors. Then
it notifies all other super nodes using inactive reports. The inactive reports are flooded
via the mesh. When a super node receives an inactive report about a node that is not
on its list of super nodes, it does not further forward the inactive report. That way, the
flood stops at some point. When a super node discovers a missing neighbor by sending it a probe request, the super node does not only notify all other super nodes, but
also sends an inactive report to the RP. That means, the RP typically receives several
redundant inactive reports about a missing super node. This redundancy is desirable
in a real network because inactive reports can be lost. It is important to make sure that
the RP learns about inactive super nodes. Otherwise it would answer ping-RP queries
incorrectly.
Ideally, super nodes that wish to leave should remain in the group until the other super nodes have changed their routes. For simplicity, this has not been implemented.
Gracefully leaving super nodes notify their children in the local cluster and the adjacent super nodes.
Handling mesh partitions: We adopt the algorithms described in [13]. Each super node
stores a refresh table with one entry for every super node. Each entry consists of an
IP address, a timestamp and a sequence number. The table is initialized during the
mesh construction, as soon as the super node knows the IP addresses of all other super nodes. The initialization is done as follows:

53

3 Implementation

For all super nodes s do {


table.addEntry(s.ip, simTime(), 0)
}
That means, the timestamp is set to the current simulation time, and sequence numbers start at zero. Super nodes include their tables in refresh messages. The timestamp
column does not have to be included this is the purpose of the sequence numbers.
When a super node receives a refresh message m from a mesh neighbor, it updates its
refresh table:
For all entries r in m do {
Entry& e := table.find(r.ip)
If e.seq < r.seq then {
e.seq := r.seq
e.timestamp := simTime()
}
}
That is, if the existing entry has a lower sequence number than the entry in m, it is
updated by adopting the sequence number and updating the timestamp. Note that
the find operation in the pseudocode returns a reference to e.
With the help of the refresh tables, partitions can be detected. Super nodes check for
partitions periodically. As this check does not involve additional message exchange, it
should be performed frequently. The pseudocode below describes the exact algorithm:
For all entries e in the refresh table do {
float t := simTime() e.timestamp
If Tmin < t < Tmax then {
queue.push(e)
}
}
For all entries e in the queue do {
float t := simTime() e.timestamp
If t < Tmin then {

54

3 Implementation

queue.erase(e)
}
If t > Tmax then {
handlePartition(e)
}
}
With a probability of (queue.size / table.size) do {
handlePartition(queue.pop())
}
Tmin and Tmax are constants, and have already been introduced in Section 2.2.5. They
determine how aggressively links are added in order to repair potential partitions.
When an entry is not updated for at least Tmin seconds, it is copied to a queue. Entries
remain in the queue for up to Tmax Tmin seconds. After that time, a partition is assumed and handlePartition() is called. Entries can be removed from the queue earlier
when an update or inactive report about the corresponding super node is received.
Furthermore, every time a member checks for a partition, the super node that has
been on the queue for the longest time is assumed to be partitioned with a probability
depending on the size of the queue. The reason for this is as follows: assume that a
super node s checks for partitions, and that there are many entries on the queue. That
means, there is a high number of super nodes from which s has not received an update. This is a strong indicator for a partition. It is important to detect mesh partitions
quickly because any number of members can be affected; but if Tmin and Tmax are too
low, a high number of unnecessary links may be added. The values should be chosen
carefully as a function of tm and the diameter of the mesh.
The handlePartition() function is called when a super node x has detected a partition.
It takes a refresh table entry e as an argument and does the following: first, a probe
request is sent to the super node y with the IP address e.ip. If there is no response,
handlePartition() returns. In this case, s has not received an update about y because y
has left the group. Note that in the worst case, x learns about an inactive super node
tm d seconds after the departure, where d is the diameter of the mesh.4 Typically, it
should be Tmin < tm d, so the probe request is important. If there is a probe response,
a link that connects x and y is added to the mesh. [25] and [13] do not specify how
this should be done. Clearly, x needs to tell y that there is a new link, so that y can update its list of neighbors. In our implementation, x sends a status report that notifies
y about the new link. Remember that the additional mesh link consumes bandwidth;
both super nodes may be unable to support an additional mesh neighbor. We assume
4 It

can take longer if the super node leaves ungracefully.

55

3 Implementation

Figure 3.9: The data topology before and after a promotion

that both x and y have children in their local clusters. If necessary, resources are freed
by sending a breakup message to one child. A cluster member that receives a breakup
message rejoins the overlay as if its parent had left the group. There are other possible
solutions. For example, x and y could drop a mesh link to free resources. [7] recommends that the "mesh degree bound for hosts should not be strictly enforced to ensure
connectivity. Instead additional mechanisms that limit the degree of the data path on
the mesh should be used." However, suitable mechanisms are not described. Our approach makes sure that mesh partitions, once they are detected, are repaired quickly
and without causing new mesh partitions.

3.2.8 Self-improvement
As mentioned in Section 3.2, the mesh self-improvement mechanism is not implemented.
Instead, we try to construct a reasonable initial mesh. That means, the quality of the overlay
can decrease when super nodes leave because disadvantageous links may be added to repair
partitions, and because a lower number of super nodes affects the performance. However,
we do not believe that this has a big impact on our measurements. In our scenarios, the
ratio of super nodes to non-super nodes does not change drastically over time. We consider
the self-improvement of the clusters via promotions more important; this is described below.
A cluster member that has a higher capacity than its parent can be promoted, meaning that
it swaps places with its parent. The basic idea is that first child and parent swap their positions. This involves communication between the promoted node, its parent and its grandparent. Then, the promoted node adopts the children of its former parent. This is shown in
Figure 3.9. The left part shows the data topology of a small DMMP cluster. Assume that C
has a significantly higher capacity than its parent B. The right part shows the topology after the promotion of C. C has taken the position of B, and it has adopted its former sibling D.

56

3 Implementation

Implementing the promotions turned out to be a time consuming task. There are a number
of difficulties:
Several nodes are involved in a promotion.
Each node may leave at any time.
The promoted node may not be able to take over all children of its former parent. In
fact, it may not have enough bandwidth to accept its former parent as a child. This
becomes more complicated when end-hosts join during a promotion.
We implement the promotion mechanism because we consider it an essential feature of
DMMP that distinguishes it from other ALM protocols.
[25] does not describe promotions in detail. However, Jun Lei, one of the authors, suggested the following algorithm: consider Figure 3.9 again. First, C requests a promotion
by sending a status request to B. B acknowledges this by sending a status report back. This
status report contains the address of A. Moreover, B breaks its connection to A and C. When
C receives the status report, it contacts A, which adds C as a child. Then, B is notified by C
and rejoins as a child of C. Meanwhile, B stores A as its backup parent. If C turns out to be
inactive (it may have left the group after requesting the promotion), B can rejoin as a child
of A. That means, the original topology as shown in the left part of Figure 3.9 is restored.
When B has successfully joined as a child of C, B breaks its connection to D if C has enough
capacity to accept an additional child. Finally, D joins as a child of C. E is not involved in the
promotion. It is included in the Figure to emphasize that the promoted node may already
have children, and hence can be saturated. We have implemented this algorithm with a few
enhancements.
We use the example shown in Figure 3.10 to describe the promotion algorithm. In the first
part, C and B swap their positions. In the second part, D joins as a child of C.
First part: C knows Bs capacity either from a join response or from a refresh message. It
decides if it should request a promotion in the following way:
If this.capacity > parent.capacity + threshold then {
requestPromotion()
}
C also makes sure that B is not a super node. Swapping positions with a super node
is not implemented, and perhaps not desirable either. The threshold is there to avoid
oscillation. As the capacity is a function of the current uptime, it changes over time.
Therefore members have to check constantly if a promotion is appropriate.

57

3 Implementation

PromoRq
BreakUpRq
BreakUpRsp
PromoRsp(ACK)
JoinRq
JoinRsp
JoinRsp
PositionConfirm
BreakUpRsp
JoinRq
JoinRsp
PositionConfirm

Figure 3.10: The promotion algorithm: message flow

C decides to send a promotion request to B. The request includes Cs capacity and the
currently unused degree. Then, B checks if Cs capacity is indeed higher. (C may have
made its request based on slightly outdated information.) In addition, B must not be
involved in another promotion. If a promotion is not possible, B sends C a promotion
response indicating denial. Otherwise, B sends a breakup message to its parent A, but
keeps A as a backup parent. A also keeps B as a temporary child. That way, B does not
lose data during the promotion. Now A has enough available bandwidth to accept
C as a child. A reserves this bandwidth for C so that other joining end-hosts cannot
interfere. Next, C tells B that it has reserved bandwidth using a breakup response,
and in turn B sends a promotion response to C acknowledging the promotion. B adds
C as a temporary child. C sends a join request to A and breaks the connection to B.
It is necessary that B waits until A has reserved bandwidth. If it acknowledges the
promotion right away, Cs join request may reach A earlier than Bs breakup message.
In this case, A may not be able to accept C as a child.
After joining successfully, C notifies B using a join response. B removes C from its list
of temporary children, and confirms its own position to A, which in turn removes B
from its list of temporary children.
Second part: B knows Cs currently available bandwidth from the promotion request. Therefore, it can determine the number n of additional children that C can accept. B chooses
n of its own children arbitrarily (in the example it chooses only D), and sends breakup
messages to them. The breakup messages contain the address of C. The chosen chil-

58

3 Implementation

DMMP

1
1

std::map

1
1

MemberMap

DMMPMember

DMMPSource

RP

RefreshTable
1

std::list

std::list

std::string:
+tmpChildren
s
t
d
:
:
v
e
c
tor
1
+reservations
1
std::string:

RefreshEntry

MemberInfo

Figure 3.11: DMMP implementation design: class diagram

dren are added to Bs list of temporary children. They are removed when they confirm
their new positions. D tries to join as a child of C, and keeps B as a backup parent
meanwhile.
This algorithm can tolerate arbitrary membership changes. For example, B and D can rejoin
as children of their backup parents if C leaves. There is also little or no packet loss. Nevertheless, we believe that the algorithm can be further optimized and simplified. Note that
there are several timers involved that have not been mentioned above.

3.2.9 Summary
Figure 3.11 summarizes the design of the DMMP implementation. For clarity, the interactions with the rest of the simulation model are not shown. The message classes are left out,
too. DMMPSource implements the behavior of the source. It receives data from a higher
layer module, wraps that data in messages, attaches the DMMP header, and sends the messages to its mesh neighbors. The DMMPMember class implements all the algorithms described in the previous sections of this chapter. IP addresses are stored as std::strings; whenever more information about a group member needs to be stored, a MemberInfo object is
used. MemberInfo objects have data members for all relevant properties of group members,
for example the maximum degree or the best known distance to the source. MemberMap
objects aggregate several MemberInfo objects. An std::map<std::string,MemberInfo*> is used
as an underlying data structure. DMMPMembers use several MemberMaps to keep track of
(1) the children in the local cluster, (2) mesh neighbors, (3) all super nodes, (4) buffered join
requests, and (5) candidate parents. (2) and (3) are only used by super nodes, and (5) is only
used by non-super nodes. During promotions, DMMPMembers also keep track of bandwidth reservations and temporary children. In this case, only the IP addresses are stored.

59

3 Implementation

Each DMMPMember has a RefreshTable (see Section 3.2.7), which consists of a number of
RefreshEntries and a list of potentially unreachable super nodes.
In Section 3.2.1, we noted that efficiency and scalability of the implementation are critical.
As we do not implement all DMMP features, there is also much future work. For this reason,
the code also needs to be extensible, in particular readable. To a degree, these two goals are
conflicting. For example, members store their children in a MemberMap which uses an
std::map internally. This is convenient because lookup operations are easy to implement
and easy to understand. However, iteration over an std::map is rather inefficient. Iteration
is needed e.g. whenever a DMMPMember forwards data to its children. However, all of
the MemberMaps contain relatively few MemberInfo objects. All in all, we believe that our
implementation is reasonably efficient. Another concern that was pointed out in Section
3.2.1 is debugging. The OMNeT++ simulation library provides a macro that can make any
data member visible in the object inspector. In addition, OverSim provides functions to
visualize the overlay topology. This, coupled with the animation of the message exchange,
makes it fairly easy to debug network models with a small number of nodes in the Tkenv.
Many interesting situations occur only with a high number of nodes though. Example: in
Section 3.2.8 some possible complications during promotions have been mentioned. It is
highly unlikely that this would happen with a group of, say, ten members. With a high
number of nodes, the animated network is difficult to overview, and the execution speed is
too slow. Consequently, we have mainly used logs and assertions for debugging, which is a
time consuming method. We believe that all major bugs have been resolved, though.

60

4 Performance evaluation
Scalability, resilience, efficiency, and service quality are major concerns with application
layer multicast. These properties are difficult to analyze with mathematical models or testbeds.
In this chapter, we evaluate the performance of DMMP using network simulation experiments. We analyze the efficiency of the data delivery, and the service quality experienced by
the participating end-hosts in dynamic scenarios with a high number of end-host. The exact
setup of our experiments is described in Section 4.1. The results are presented in Section 4.2.

4.1 Evaluation methodology


First, we will describe the performance metrics and how they are implemented in Section
4.1.1. Then the simulation scenarios will be illustrated in Section 4.1.2. This includes the
churn distributions, the underlying network topologies, and the traffic created by the source.
In Section 4.1.3, we revisit the performance metrics and state which values we expect based
on the design of DMMP.

4.1.1 Performance metrics


We are interested in measuring the efficiency of the data delivery, and the service quality:
We consider only the bandwidth consumption. Other resources, such as memory
and computation time, are less relevant: DMMP does not change the behavior of the
routers, and the end-hosts can be assumed to have plenty of memory and fast processors. In order to improve the design of DMMP, it is also important to determine how
much bandwidth is consumed by data traffic, and how much by control traffic.
The service quality that the group members experience mostly depends on the number
of missed data messages, and the latency.
In the previous research, a number of performance metrics for application layer multicast
architectures have been proposed. Some of them are introduced below, roughly following
[6].
Stress: The stress metric is used for analyzing the efficiency of the data delivery. For each
data message that a source multicasts to the group, the number of generated copies is
counted. Most importantly, the stress of a link in the underlying network is the number
of copies sent over that link (in any direction). Similarly, the stress of a router equals
the number of forwarded copies. Network layer multicast achieves minimal stress; no

61

4 Performance evaluation

2
2
1

2
2
1
1

Figure 4.1: The stress metric: example

link and no router has a stress greater than one. This cannot be achieved with application layer multicast; nevertheless, the stress should be as low as possible. Figure
4.1 shows an example. The dotted arrows show the overlay data topology. The white
arrows represent data packet transmissions. For each link and router the number of
transmissions is shown. The overlay network is dynamic in nature. For that reason,
the link stress changes over time, and needs to be evaluated for each data packet that
the source produces. By numbering the packets, links and routers, we can define the
stress of a link or router j, 1 j M, as the average number of transmissions:
s( j) :=

1 N

n(i, j),
N i
=0

(4.1)

where n(i, j) is the number of copies of packet i (1 i N) that are transmitted via
j. Given a network with many links and routers, it is more helpful to determine the
average stress
s :=

M
1
s ( j ).
M z j =0

(4.2)

z is the number of links or routers with a stress of zero. [13] states that only the "links
active in data transmission" should be counted. Clearly, it does not make sense to
consider parts of the underlying network which are not involved in the multicast, for
example routers without attached group members. We can also measure the load of
end-hosts, by calculating the stress of the last-hop links. Note that the last-hop stress
is closely related to the degree, which can be easily determined mathematically (see
Section 4.1.3).
Control overhead: The main task of application layer multicast protocols is the delivery of
higher layer application data from a source to all other group members. The overhead
of controlling the data exchange between the participating nodes is referred to as control overhead. In the case of DMMP, control overhead is caused mainly by establishing
and maintaining the overlay network. There are several ways to measure this over-

62

4 Performance evaluation

head. For example in [22], the number of control messages is counted. It would also
be possible to measure the amount of bandwidth that is used for control traffic.
Loss rate Many application layer multicast protocols do not guarantee data delivery. When
a source sends a data message to the group, some members may not receive the message. This can happen due to an unreliable transport protocol. More importantly, the
data topology can be temporarily partitioned.
Message losses can be evaluated by calculating the loss rate l for each group member i:
l (i ) =

N (i ) r (i )
N (i )

(4.3)

, where N (i ) is the number of data messages sent to the group within the lifetime of i,
and r (i ) is the number of unique data messages that i has received. While multimedia
applications may be able to tolerate rather high loss rates, it is nevertheless desirable
to achieve a high probability of delivery. Moreover, knowing how long partitions last,
and how many members are affected can give insights about the resilience of an application layer multicast protocol. This is measured in [7], for example.
Latency, data path length, and stretch: All these metrics consider the distances between a
source and the other group members. Assume that a source sends a data message at
ts . It is forwarded by several routers and end-hosts, and at t a , it arrives at the group
member m. Then, the latency currently experienced by m is t a ts . For large-scale experiments, the latency should be averaged over all messages and over all members.
Low latency is especially important for interactive applications.
The data path length p of a group member i is the length of the path from the source to
i. More precisely, we define p(i ) as the number of physical links that packets traverse
until reaching i, averaged over all packets that are sent to the group. A long data path
does not always imply high latency, but it can be an indicator of high jitter and packet
loss; packets that traverse a high number of links are more likely to be delayed or lost.
The stretch of a group member i is the ratio

p (i )
,
c (i )

where c(i ) is the length of the shortest

unicast path from the source to i. When the source unicasts data to all destinations,
the stretch is one for all members by definition.
In our simulations, we use mostly the performance metrics that have been used in [7] in
order to produce comparable results. We measure the stress, control overhead, loss rate,
and data path length as follows:
Each group member keeps track of the number of sent and received data messages,
routers count the number of forwarded data messages. End-hosts calculate their lasthop stress based on Equation 4.1, and record it, when they leave the group. Similarly,

63

4 Performance evaluation

routers report their router stress at the end of the simulation. In addition, each router
remembers for each link if it has sent or received data via that link at all. This way, links
with a stress of zero can be identified. All statistics are aggregated by the RP, which
eventually stores them using output vectors and output scalars (see Section 3.1.3). The
average router stress and last-hop stress are determined according to Equation 4.2.
The average link stress is then computed as as function of the average router stress
and the average last-hop stress.
The authors of [7] have measured the control overhead in bit per second at routers
and end-hosts. For simplicity, we omit these measurements for the routers, and only
report the control overhead incurred by the end-hosts. It is important to consider the
size of the control messages, and not only their number. In Narada, the number of
control messages grows linearly with the group size, whereas the control traffic in bit
per second exhibits quadratic growth. We expect the number of super nodes in DMMP
to have a similar effect.
It is not entirely clear, if protocol headers are regarded as control overhead in [7]. It
is not stated how frequently data messages are sent by the source. If the headers of
data messages were counted, this would be important information. In our measurements we consider the headers of control messages, but disregard the headers of data
messages.
We measure the loss rate as described in Equation 4.3. The number of missed data
packets is determined by observing gaps in the sequence numbers. When a member
receives a data message, it calculates the difference of the messages sequence number
and the sequence number of the last previously received data message. If the difference is greater than one, packets have been missed. It is also possible that a member
leaves while rejoining. For that case, members also store the timestamp of the last
received data message. When a member leaves, it can tell by the timestamp if it has
missed any messages.
We do not measure latencies for three reasons:
Latencies have not been measured in [7].
DMMP is not intended for applications that are latency sensitive.
We have not carefully optimized timeouts.
Instead, we measure the data path length. This metric has been used in [7] as well.
We determine the data path lengths by adding a hop counter to each data message.
Every router or end-host that forwards a data message, increments the hop counter.
When the data messages arrives at a group member, the data path length of that group
member is equal to the hop counter. For simplicity, we omit the stretch metric.

64

4 Performance evaluation

4.1.2 Simulation scenarios


In this section, the scenarios used for our simulation experiments are described. Remember that in OverSim, protocol behavior and scenarios are separated. By "scenario" we mean
mostly the network topology, dynamic topology changes, the lower layer protocols, and the
traffic patterns of our simulation model. Two different types of scenarios are used. The first
one is used to compare DMMP and NICE, and is therefore very similar to the scenarios described in [7]. We refer to these scenarios as the NICE scenarios. In the NICE scenarios, there
is a long phase without any churn. Afterward, a high number of randomly chosen members
leave the group at once. As DMMP is designed with frequent membership changes in mind,
we use a second scenario type with a churn model that is more typical for DMMP. We call
them the typical scenarios. In these scenarios, there are frequent membership changes all the
time. Below, we describe the network topology, the lower layer protocols, and the traffic
pattern, which are the same in both scenario types. Then the two different churn models are
described in more detail.
Network topology: OverSim uses a very simple algorithm to generate the underlying network topology. The algorithm is implemented in NED. There are two types of routers,
backbone routers and access routers. Links between backbone routers are placed at random. The algorithm does not guarantee that the resulting backbone graph is connected; if it is not connected, OverSim exits. Each access router is attached to one
backbone router. End-hosts are placed dynamically, depending on the churn model.
Each end-host is connected to exactly one access router via PPP. The access router is
chosen at random when the host is created. As mentioned in Section 3.1.4, routers
consume a high amount of memory, which limits the number of routers that can be
simulated. The algorithm that generates the network topology is also designed for a
small number of routers: the probability p that the backbone is connected decreases
with the number of backbone routers.1
However, a more sophisticated algorithm requires dynamic router placement (that is,
manually written C++ code) because the semantics of the NED language are too limited. Another possibility is using scripting languages like perl or awk to generate a
NED file that explicitly lists the neighbors of each individual backbone router. Both
methods could be based on the output of an external topology generator [34, page 35].
For lack of time, we use the mechanism provided by OverSim. With one gigabyte main
memory we are able to simulate up to around 2,500 routers. For our experiments, we
use 1,500 backbone routers and 1,000 access routers. With p 0.5, the average router
degree d is d 5.8. The ratio of backbone routers to access routers is chosen because a
higher number of backbone routers leads to a lower p or a higher d. A higher number
of access routers is not reasonable either. As we attach end-hosts to randomly chosen
1 It is


 b 1  b 1
1 1 db
, where b is the number of backbone routers, and d is the average degree of the

backbone routers.

65

4 Performance evaluation

access routers, many access routers end up without any attached end-hosts, and thus
do not interact with the rest of the simulation model at all.
In [7] a topology with 10,000 routers and a smaller router degree has been used. This
makes it difficult to compare DMMP with NICE. For the typical scenarios, a larger
topology would be desirable as well. We will discuss how the size of the topology
affects the performance evaluation in Section 4.2.
The links between routers represent fiber optic cables with a propagation delay of five
milliseconds and a bandwidth of one gigabit per second. In [24] it has been pointed
out that a big part of the end-hosts on the Internet do not have enough upstream
bandwidth to support children in the overlay topology. We model this by setting the
maximum degree of these hosts to one. That means, the bandwidth is chosen based
on the bitrate of the data stream. We also need some end-hosts with a high maximum
degree that can become super nodes. In the INET underlying network model, all links
between an access router and its attached end-hosts have the same propagation delay
and bandwidth. Therefore, all end-hosts attached to the same access router have the
same maximum degree. Hence, we can say that each access network has a maximum
degree. When an access router is placed, the maximum degree of the access network
is set to one with a probability of 0.4. The maximum degree distribution is shown in
the table below. The degree of the source is determined in the same way, but it cannot
be greater than five. Assume a source with a degree of 14 that directly provides all
or most super nodes with data. In this case, the performance of the overlay core does
not affect the overall results. We avoid this kind of degenerate overlay topologies by
limiting the degree of the source.
maximum degree

probability

0.4

0.2

0.2

10

0.1

14

0.1

We set the last-hop propagation delays to zero because otherwise, considering the
rather small backbone network, propagation delays would dominate our round trip
time measurements. Remember that we use the length of the data delivery path as
a performance metric, and that we do not measure end-to-end latencies. If members
chose their parents mostly based on last-hop propagation delays, the data delivery
paths would clearly be long, and there would be no point in measuring their lengths.
Lower layer protocols: The simulation model does not consider the details of the physical
layer. Instead, when a frame is sent over a link, the transmission time is calculated as a

66

4 Performance evaluation

function of the links bandwidth and propagation delay, as described in Section 3.2.2.
A simplified version of PPP is used as the link layer protocol. It does not conceal bit
errors; in fact, the INET underlying network model does not model bit errors at all. The
network layer implementation is slightly more complex. Routing tables are globally
computed using Dijkstras algorithm. The underlying topology does not change over
time, so there are no dynamic routing updates. Messages that are bigger than the
maximum transmission unit (MTU) are fragmented. UDP constitutes the transport
layer. This protocol stack is illustrated in Figure 3.2. Routers use the same link layer
and network layer modules.
Traffic pattern: "Traffic" refers to the payloads of the higher layer application here, in contrast to the control traffic that DMMP generates by itself. OverSim encourages users to
implement traffic sources as higher layer modules. A stack of application layer protocols can be placed on top of the overlay protocol. In Figure 3.2, there are three higher
layer modules: "tier1", "tier2" and "tier3". We implemented a very simple module representing a multimedia application. Only the source uses this module. The module
is notified by the RP as soon as the initial control topology is constructed. Then it
produces data at a constant bitrate r and forwards it to the DMMP module, which
provides the multicast functionality. For all our experiments, we use a small bitrate of
64 kilobit. The multimedia module creates three data messages per second, each with
a size of

64
3

kilobit. This is done to avoid IP fragmentation. Frequent data messages

slow the simulation down because each data packet is routed individually. Therefore
we do not use higher bitrates. For our experiments, this is unproblematic, although
DMMP is designed for high-bandwidth applications. The maximum degrees of the
end-hosts are much more important. We adjust the bandwidth of the end-hosts based
on r, so that the maximum degrees do not depend on r.
Churn models: In OverSim, churn models are implemented as churn generators. Churn generators are simple modules that are derived from the ChurnGenerator base class. They
place and remove end-hosts over the course of a simulation run. We have implemented a churn generator for the NICE scenarios. For the typical scenarios we use the
built-in ParetoChurn generator.
The churn model described in [7] has three subsequent phases:
1. Join phase: Within the first 200 seconds, end-hosts join the multicast group uniformly at random.
2. Stabilize phase: The overlay is given 1,800 seconds to stabilize. Within that phase,
there are no membership changes.
3. Leave phase: Every 100 seconds, a high number of randomly chosen members
leave ungracefully within 10 seconds. This is done five times, so the total simulation time is 2,500 seconds.

67

4 Performance evaluation

This model allows an analysis of the multicast protocol and its convergence time under optimal conditions, without disruptions caused by membership changes. Then,
in the leave phase, it can be measured how quickly the protocol recovers from severe
changes.
However, constant membership changes are certainly more realistic, and as DMMP
is designed with dynamic scenarios in mind, we believe that a scenario with a more
typical churn model can provide important insights as well. In particular, DMMP
calculates the capacity of a member as a function of that members current uptime. It
is implicitly assumed that members with a higher uptime are more stable. In the NICE
scenarios, this assumption does not hold. Leaving members are chosen uniformly at
random in the leave phase. In addition, all members join more or less at the same
time, and thus have roughly the same uptime. Therefore, an important condition for
the churn model in the typical scenarios is
P(L > k + t | L > k) > P(L > t) k, t {r R : r > 0},

(4.4)

where the random variable L is the lifetime of an arbitrary member, that is, L = tl
t j , where the member joins the group at t j , and leaves the group at tl . In short: the
distribution of L must not be memoryless. The ParetoChurn generator provided by
OverSim uses a Pareto distribution, which is fine. First, there is a short join phase of
100 seconds. A high number of members join during the join phase, and there are no
departures. Then, members join and leave constantly, creating a stable equilibrium.
That means, the size of the group hardly changes over the course of the simulation.
The total simulation time is 500 seconds. We believe that this is sufficient because
DMMP can establish the initial overlay network quickly. We set the expected value of
L to 7,500 seconds. Measurements for groups of 1,000 members show that with these
settings, a member leaves the group about every five seconds. More frequent leaves
would lead to a rapidly decreasing number of super nodes because super nodes cannot
be replaced by newly joining end-hosts. Departures are not always ungraceful in this
model. Members leave gracefully with a probability of 0.5. Note that this is unlike the
NICE scenarios. The idea of the typical scenarios is to evaluate DMMP under ordinary
conditions, whereas the NICE scenarios perform stress tests.

4.1.3 Expected results


As mentioned in the previous sections, we cannot simulate as many routers as the authors
of [7] did. All results could be affected by this. The effect of the number of routers is difficult to predict. As our underlying topology has a smaller diameter, we expect shorter data
paths. The router and link stress should be higher because the traffic is distributed among a
rather small number of routers. As we do not model transmission errors, control overhead
and loss rate should not depend on the size of the underlying network.

68

4 Performance evaluation

Ignoring the problems concerning the number of routers, we expect the following results:
Stress First of all, the last-hop stress is easy to predict. The data delivery tree has n
1 edges, where n is the group size. There is one end-to-end transmission per edge.
Hence, the average last-hop stress sh is
sh =

2 ( n 1)
,
n

(4.5)

which approaches two for large groups. Considering that DMMP produces some redundant transmissions, sh should always be about two. We measure sh mostly to verify
the implementation.
As the average stress of the network depends on the network size, it is difficult to estimate. Based on the theoretical analysis in [24], we expect the average link stress sl
of DMMP to be slightly lower than the average link stress of NICE. However, these
considerations are "based on the assumption of a large member population uniformly
distributed in the network". sl is necessarily greater than one because only links i with
sl 1 are considered. Moreover, the average stress of the last-hop links is two, as we
have shown above.

Control overhead: We consider the control overhead a weak spot of NICE. Maintaining
the hierarchy with all its invariants is costly. Additionally, [7] have reported that the
control overhead at routers grows logarithmically with the number of end-hosts. We
expect DMMP to induce slightly less control overhead. In particular, the control overhead within the clusters should be low. The overlay core may produce higher control
overhead because the size of the refresh tables grows linearly with the number of super nodes. Hence, the control overhead of the overlay core is in (m2 ), where m is the
number of super nodes. However, m is bounded.
Loss rate In our simulations, data messages are lost only due to partitions. On the one
hand, it may take cluster nodes a relatively long time to rejoin the group. On the
other hand, departures in the overlay core are unproblematic as long as the mesh does
not get partitioned. DMMP also considers the uptime of the group members, but
in the NICE scenarios this does not make a difference, as we have explained in the
previous section. In summary, it is hard to predict if our DMMP implementation can
achieve a lower loss rate than NICE. In our simulation scenarios, members do not
leave extremely frequently. Considering that losses only occur when members leave,
the average loss rate should certainly be less than one percent.
Data path length: Our DMMP implementation does not optimize the overlay for distance.
Therefore we expect the average data path of DMMP to be longer than the average

69

4 Performance evaluation

data path length of NICE. As several factors determine the positions of the end-hosts
in the data delivery tree, the data path length is not easy to estimate. For most of the
performance metrics, we cannot provide good estimations. This is the motivation of
our simulation experiments.

4.2 Results
We have discussed the application-specific parameters of DMMP in Section 3.2. In section
4.2.1, we experiment with some of these parameters using typical scenarios. We also evaluate the general performance in the typical scenarios. Then, we present the results that we
produced using the NICE scenarios, and compare DMMP with NICE.
Some parameters are the same in all the experiments because we did not have the time to
adjust them carefully.
The threshold for promotions (see Section 3.2.8) is set to 100,000. Recall Equation 3.2.
As the source sends at a small bitrate of 64 kilobit per second, we also use rather low
values for the bandwidth of the end-hosts. More precisely, it is
128, 000 b(i ) c(i ) 2 b(i ) 2, 000, 000.
Hence, 100,000 seems to be a reasonable threshold.
The timeouts, and Tmin and Tmax (see Section 3.2.7) are chosen fairly arbitrarily based
on observations during the implementation and testing.
In our initial experiments, it took partitioned cluster members a long time to rejoin. In many
cases, rejoining end-hosts received five or more rejections before finding a suitable parent.
This problem has already been pointed out in Section 3.2.4. Implementing the control topology links between relatives would have been the best solution. For lack of time, we made
some slight changes to the join algorithm instead. In our implementation, joining end-hosts
send several join requests at the same time, which allows hosts to join, and, more importantly, rejoin faster. Nevertheless, this algorithm is only a temporary solution because it can
produce loops in the data topology.

4.2.1 Typical scenarios


Before analyzing the performance of DMMP, we adjust the number of super nodes and the
density of the mesh. We simulate a group of 1,000 end-hosts. OverSim supports far more
end-hosts. However, the execution speed decreases linearly with the number of end-hosts.
With 1,000 end-hosts, simulations run roughly in real-time on a three gigahertz processor.
That is, each run takes five to ten minutes. Every run is repeated three times with different
random number generator seeds. We report only the mean values. The results of a single

70

4 Performance evaluation

Figure 4.2: Impact of the overlay core size

run can vary greatly2 depending on, for example, the number of mesh partitions. In fact,
more repetitions would have been desirable. Then again, we attempt a rough initial analysis, so that some lack of precision is tolerable.

Number of super nodes The impact of the number of super nodes is shown in Figure 4.2.
We measured the average router stress, the average data path length and the average loss rate (left panel), and the average control overhead in kilobit per second (right
panel) for different numbers of super nodes. Regarding the absolute values, note that
we have set the refresh frequency to tc = tm = 1.5 seconds, which will be discussed in
the following section. The loss rate has been multiplied with 1,000, and the data path
length has been divided by ten in order to adjust the values to one scale.
A very low number of super nodes leads to a poor overall performance. Here are some
possible explanations: The data path length and stress are comparatively high because
few super nodes lead to bigger and thus deeper clusters. The control overhead is low
because most of the refresh messages are exchanged in the clusters. The refresh messages in the clusters have a relatively small, constant size. The relatively high loss rates
could be due to the large clusters, which are vulnerable to partitions. Interestingly, the
control overhead for five super nodes is higher than the control overhead for ten super
nodes. The additional control overhead could be caused by frequent cluster partitions.
For high numbers of super nodes, we measured high control overheads. In fact, the
control overhead seems to exhibit quadratic growth. The average control overhead is
almost 16 kilobits per second for 100 super nodes. This is not unexpected. It can be easily shown mathematically that the overhead induced by refresh messages inside the
2 This

is especially true for the average loss rate.

71

4 Performance evaluation

Figure 4.3: Impact of k

overlay core grows quadratically. It is more surprising that the overall performance
decreases with more than 30 super nodes. The redundant transmissions produced by
DVMRP could be responsible for the increased stress. The data paths could be long
because the mesh is not optimized for distance at all. A more complete DMMP implementation could perhaps achieve an overlay core with shorter data paths. The high
loss rate is difficult to explain. Possibly, a larger mesh is more vulnerable to mesh partitions. Having said that, we did increase the mesh density k from two to three in the
experiments with 40 or more super nodes. The impact of k is discussed below.
In summary, a reasonable number of super nodes should be between ten and 30. For
the remaining experiments with 1,000 end-hosts, we use 30 super nodes. According
to Figure 4.2, the stress and data path length are relatively low for 30 super nodes. At
the same time, the control overhead is still acceptable. Ultimately, our measurements
indicate that the exact number of super nodes is not very important. Any moderate
number seems to be fine. However, it is reassuring that extreme numbers of super
nodes entail poor performance. Without any super nodes, DMMP is basically a treefirst protocol similar to HMTP. When all members are super nodes, DMMP behaves
like Narada, a mesh-first protocol. Our results indicate that the two-tier architecture
of DMMP is in principle superior to those two approaches, at least for rather large
groups.
k: The impact of the mesh density is shown in Figure 4.3. k is the number of interleaved
spanning trees that the mesh is composed of, that is, the number of mesh links is about
n k, where n is the number of super nodes. Remember that the centralized mesh construction algorithm is only a temporary solution. However, any mesh construction
algorithm should have a parameter that controls the density of the mesh. Therefore,

72

4 Performance evaluation

Figure 4.4: Scalability of DMMP

the analysis of k is relevant.


Again, it is not surprising that the control overhead grows with the number mesh links.
A higher average super node degree implies more refresh messages. The high loss rate
at k = 1 is also easy to explain: for k = 1, the data topology of the overlay core is a tree.
Whenever a non-leaf super node becomes inactive, the overlay core is partitioned. k
could have two conflicting effects on the data path length: (1) a higher mesh density
leads to better reverse path trees, and (2) super nodes need to reserve more bandwidth
for reverse path forwarding, which leads to deeper clusters. For example, assume that
k = 5. As there are only 30 super nodes, all of them should have the highest possible
maximum degree of 14. That means, up to 420 cluster nodes can be direct children of
super nodes. For k = 5, there are only about 270 direct children. This is also a possible
explanation for the slightly increasing loss rate. The router stress may also be affected
by a growing number of redundant transmissions created by DVMRP.
All in all, a rather low density of k = 2 or k = 3 yields the best results. For the
remaining experiments, we set either k = 2 or k = 3, depending on the number of
super nodes.
Scalability DMMP has been designed for large-scale application. Hence, the scalability is
a very important aspect. We measure the performance for 500 to 2000 end-hosts. The
results are illustrated in Figure 4.4. The right panel shows the average stress. The
last-hop stress is always around two, just as Equation 4.5 implies. Unfortunately, the
router stress exhibits linear growth. In Section 4.1.3, we stated that the router stress
could be affected by the size of the underlying topology. We have repeated our measurements with 1,000 end-hosts and only 1,000 backbone routers (instead of 1,500) to

73

4 Performance evaluation

Figure 4.5: Distribution of the router stress

analyze how the size of the underlying network influences our measurements. We
observed a significantly higher router stress of about 3.9. Our other results for 1,000
backbone routers did not differ much from the results shown in Figure 4.4. In particular, the data path length is almost the same in both scenarios. This indicates that the
router stress measurements are warped by the low number of routers. Therefore, we
cannot draw conclusions about the large-scale router and link stress.
The results in the left panel indicate that the control overhead remains more or less
constant with increasing group size. Note that the number of super nodes has been
adapted to the group size, which may explain the variations of the control overhead
and data path length. With increasing group size, the depth of the data delivery tree
grows necessarily. Nevertheless, the data path length seems to grow very slowly. The
trends shown in the left panel indicate that DMMP has the potential to scale to larger
group sizes.
Stress distribution For completeness, we include our analysis of the individual router stress.
Consider a star topology as an example. Physical links near the center experience very
high stress. This kind of traffic concentration is clearly undesirable. Figure 4.5 indicates that the DMMP traffic is rather well-dispersed. Most of the packets are handled
by routers with stress five or less. There are few routers with stress greater than 15,
and the maximum router stress is 30. However, as mentioned above, the stress could
be very different for a bigger and thus more realistic network. Figure 4.5 also shows
that about 100 routers have not forwarded any data at all. Note that the router stress
has been rounded, meaning that some of these routers actually have a stress between

74

4 Performance evaluation

Figure 4.6: Comparison of DMMP and NICE

zero and 0.5.


The results produced using the typical scenarios suggest that DMMP has the potential to
scale to large groups. In the next section, we will analyze if DMMP can achieve better performance than other application layer multicast approaches, by comparing DMMP with
NICE.

4.2.2 Comparison with NICE


We compare the performance of DMMP to the "aggregate results" reported in [7]. Consider
Figure 4.6. The left panel shows the average data path lengths for a varying number of endhosts. The average data path length of DMMP seems to exhibit slow linear or logarithmic
growth, whereas the average data path of NICE is shorter for 2048 hosts than for 1024 hosts.
It would be interesting to see if these trends continue for larger group sizes. We can also state
that the data paths are slightly longer for DMMP in all experiments. As our implementation
of DMMP is not optimized for distance, we think that this is acceptable. The promotion
mechanism of DMMP can lead to shorter clusters; however, this does not necessarily mean
that it shortens the data paths. It should reduce the number of end-to-end transmissions
though. The last-hop links are often performance bottlenecks. That means, the latency metric would presumably be more favorable to DMMP. The right panel compares the average
router and link stress of the two protocols. Clearly, the stress of DMMP is much higher and
increases much faster than the stress of NICE. Again, we attribute this mainly to the size of
the underlying network.
We have also measured the control overhead of DMMP. Note that this is the control overhead observed by the group members. In [7], the overhead observed by the routers has
been reported. It is also stated that, "since the control traffic gets aggregated inside the net-

75

4 Performance evaluation

work, the overhead at routers is significantly higher than the overhead at an end-host". That
means, these results are not comparable. However, the control overhead at end-hosts is
briefly mentioned in [7] as well. It is 0.97 kilobit per second for a group of 128 members during the stabilize phase. The worst-case control overhead at end-hosts is proven to "increase
logarithmically with increase in group size". Initially, DMMP generated higher control overhead: the average control overheads shown in Figure 4.4 are all greater than 2 kilobits per
second. These results were achieve with tc = tm = 1.5 seconds though. (The parameters tc
and tm have been introduced in Section 3.2.6.) In the experiments with NICE, heartbeat messages were exchanged only every five seconds. For tc = tm = 5 seconds, we have measured
a significantly lower control overhead of about 0.5 kilobit per second in scenarios with 1,000
group members.
However, the heartbeat is not necessarily comparable to tc and tm . Remember that there
are five leave phases in the NICE scenarios. That way, the partition handling is evaluated
under extreme conditions. NICE recovers from the leave phases "within 30 seconds" [7]. It is
questionable if DMMP can repair mesh partitions within 30 seconds with tm = 5 seconds. In
figure 4.7, we compare the recovery times of DMMP for tc = tm = 1.5 seconds, tc = tm = 5
seconds, and for tm = 2.5 seconds and tc = 1.5 seconds. We recorded the current average
loss rate over all members (denoted on the vertical axis) every 10 seconds. Prior to the first
leave phase, the loss rate is almost zero. Then, 128 members leave within 10 seconds. Consequently, the loss rate rises to rather high values of around 0.1, meaning that every tenth
data message is lost. The individual peak values are not so significant since Figure 4.7 is the
result of only one simulation run for each scenario. As soon as the data topology is repaired,
the loss rate drops back to around zero. For tm = 5 seconds, this does not always happen
within 30 seconds. The loss rate is high from simulation time 700 to 740, and also from time
400 to 440. It seems that for tm = 2.5 seconds, the data topology is always repaired within
30 seconds. With tm = 1.5 the loss rates and recovery times are even lower. Since a constant
average control overhead of around 2 kilobits per seconds seems acceptable to us, we have
used this configuration for the typical scenarios.
The results shown in Figure 4.6 have been produced with tc = 3.5 and tm = 2.5. Note
that the control overhead is still comparable to NICE for small groups, and lower than the
control overhead generated by NICE for large groups. Moreover, the control overhead of
DMMP does not seem to grow with the number of end-hosts. However, it is important to
point out that our control overhead measurements disregard a number of aspects:
We did not implement the refresh message exchange between relatives. Then again,
this mechanism may also alleviate the overall control overhead by reducing the costs
of rejoin attempts.
The control overhead induced by the initial mesh construction has not been considered.

76

4 Performance evaluation

Figure 4.7: Effect of the refresh timeout on the loss rate

77

4 Performance evaluation

We have not measured the overhead of RP queries.


Bandwidth measurements have not been considered. Note that in order to support
high bandwidth applications, NICE may have to perform bandwidth measurements
as well.

78

5 Conclusion
5.1 Lessons learnt
One important aim of this thesis was the comparison of DMMP and NICE with regard to
the stress metric. Unfortunately, our results are not comparable to the results reported in [7].
We can learn several things from this:
Comparing the two protocols based on the reported results was not a very good idea
to begin with. Complex simulation scenarios are difficult to reproduce using a different network simulator. Even if we had been able to simulate enough routers, the
comparison would not have been completely fair. If somehow possible, network protocols should be compared using the same network simulator. That way, much cleaner
results can be produced.
If we had considered the simulation scenarios and the accompanying hardware requirements earlier, we could have chosen a network simulator that provides a more
suitable underlying network model. Alternatively, we could have come up with such
a model by ourselves.
At the beginning of the implementation, we did not focus enough on debugging and
verification. Later on, these tasks became very time consuming. While early testing
is always important when writing moderately complex software, it seem to be fundamental to complex network simulations.
Nevertheless, OMNeT++ turned out to be a good choice in principle, as it provides a consistent and convenient API.

5.2 Conclusions
In this thesis, we tried to evaluate if DMMP can meet the requirements of high-bandwidth,
large-scale multimedia streaming, or if the general concerns about the scalability of application layer multicast apply for DMMP. We also investigated whether DMMP can achieve
better performance than other application layer multicast approaches. We have conducted
network simulations to analyze the scalability and efficiency of DMMP, in particular the induced stress and control overhead, in comparison with NICE.
The stress measurements show that, given a small network topology, the stress of DMMP
grows linearly with the number of end-hosts. However, our results also suggest that the

79

5 Conclusion

average link stress depends greatly on the size of the underlying network. Therefore, we
must not conclude that DMMP causes high link stress. In fact, we cannot estimate the behavior for a large network. Based on our other measurements, we can state that the control
overhead is low and hardly dependent on the group size. Our only concern with this result
is that we used a simplified version of DMMP. Some of the simplifications might affect the
control overhead. Similarly, the average data path length seemed to increase only slowly
with the group size. We conclude that DMMP has the potential to scale to large groups of
receivers, and that the two-tier architecture of DMMP is a promising approach.
In comparison with NICE, DMMP produces only slightly longer data paths, although
DMMP does not focus on minimizing the data path lengths. In addition, DMMP generates
less control overhead. Overall, the performance of DMMP is comparable to the performance
of NICE. Considering that several important features of DMMP have not been implemented
yet, this is a promising result.

5.3 Future work


First,we describe the future work on the DMMP module. Most of these issues have already
been pointed out in the previous chapters. Then we talk about possible enhancements to
DMMP itself.
The following features of DMMP need to be implemented in the future:
The super nodes need to exchange proper routing updates. Based on that, they can
decide to add or drop mesh links.
The initial mesh should be constructed distributedly instead of centrally.
Each cluster member M should keep track of its relatives. If the parent of M leaves,
the relatives can help M to rejoin the cluster. Currently, the control topology of the
clusters is a tree.
Super nodes cannot be allocated dynamically yet. Inactive super nodes should be
replaced by one their former children. Inactive non-super nodes should be replaced in
the same way.
Bandwidth measurements should be added.
Gracefully leaving super nodes should remain in the group for some time so that the
other super nodes can change their routing tables.
The promotion algorithm should be revised because it is too complicated.
Using the revised DMMP module, the protocol behavior for larger numbers of end-hosts
and routers could be analyzed. So far, we do not have reliable results regarding the link

80

5 Conclusion

stress. It is also not evident how well the data path length scales. In Section 4.2.2, we
have conjectured that DMMP can achieve favorable latencies. This should be checked by
evaluating the latency of DMMP. DMMP could also be compared with other multicast architectures, for example with TOMA, which is claimed to be even more efficient than NICE.
In network simulations, certain member distributions and churn models are assumed.
Therefore, DMMP should be evaluated in a real network at some point.
There are numerous possible extensions to DMMP:
Security issues: So far, DMMP cannot handle malicious nodes. A malicious super node
could disrupt the data dissemination severely.
NAT and firewall support: Firewalls and NAT boxes are common in practice. For example,
[12] have implemented a multimedia streaming system. They report that initially 20
to 30 percent of the users had to be turned down because their system lacked firewall
and NAT support.
Overlay initialization details: The initial overlay construction may overload the RP. Several RPs could be used, but the details have not been thought out yet.
Shortening the data path length: Depending on the further analysis of the data path length,
it may be necessary to shorten the data paths, or to find other ways to improve the service quality experienced by the leaf nodes.
Peer-to-Peer concepts: Some successful peer-to-peer file sharing concepts, such as swarming or trackers, could be incorporated into DMMP. The other way around, DMMP
could be adapted to support additional applications, for example file sharing.

81

Bibliography
[1] OMNESTTM Simulation Environment. URL
http://www.omnest.com/.
[2] Tcl SourceForge Project. URL
http://tcl.sourceforge.net/.
[3] K. Almeroth. The Evolution of Multicast: From the Mbone to Inter-domain Multicast
to Internet2 Deployment. IEEE Networks, 2000.
[4] S. Bajaj, L. Breslau, D. Estrin, K. Fall, S. Floyd, P. Haldar, M. Handley, A. Helmy, J. Heidemann, P. Huang, S. Kumar, S. McCanne, R. Rejaie, P. Sharma, K. Varadhan, Y. Xu,
H. Yu, and D. Zappala. Improving Simulation for Network Research. Technical Report 99-702b, Information Sciences Institute, University of Southern California, 1999.
[5] A. Ballardie, P. Francis, and J. Crowcroft. Core BasedTrees(CBT). In Proceedings of the
ACM SIGCOMM, pages 8595, 1993.
[6] S. Banerjee and B. Bhattacharjee. A Comparative Study of Application Layer Multicast
Protocols. Work under submission, 2002.
[7] S. Banerjee, B. Bhattacharjee, and C. Kommareddy. Scalable Application Layer Multicast. In Proceedings of the ACM SIGCOMM, 2002.
[8] S. Banerjee, C. Kommareddy, K. Kar, B. Bhattacharjee, and S. Khuller. An Efficient
Overlay Multicast Infrastructure for Real-time Applications. Computer Networks, 50(6),
2006. Special Issue on Overlay Distribution Structures and their Applications.
[9] I. Baumgart, B. Heep, and S. Krause. OverSim: A Flexible Overlay Network Simulation Framework. In Proceedings of the 10th IEEE Global Internet Symposium, 2007.
[10] L. Breslau, D. Estrin, K. Fall, S. Floyd, J. Heidemann, A. Helmy, P. Huang, S. McCanne,
K. Varadhan, Y. Xu, and H. Yu. Advances in Network Simulation. Computer, 33(5):59
67, 2000.
[11] R. Carter and M. Crovella. Measuring Bottleneck Link Speed in Packet-Switched Networks. Technical Report BU-CS-96-006, Computer Science Department, Boston University, 1996.
[12] Y. Chu, A. Ganjam, T. Ng, S. Rao, K. Sripanidkulchai, J.Zhan, and H. Zhang. Early
Experience with an Internet Broadcast System Based on Overlay Multicast. Technical
Report CMUCS-03-214, Carnegie Mellon University, 2003.

82

Bibliography

[13] Y.-H. Chu, S. G. Rao, and H. Zhang. A Case for End System Multicast. In Proceedings
of the ACM SIGMETRICS, 2000.
[14] S. Deering. Multicast Routing in Internetworks and Extended LANs. In Proceedings of
the ACM SIGCOMM, 1988.
[15] D. Estrin, M. Handley, J. Heidemann, S. McCanne, Y. Xu, and H. Yu. Network Visualization with the VINT Network Animator Nam. Technical Report 99-703b, Computer
Science Department, University of Southern California, 1999.
[16] E. Zegura et al. Modeling Topology of Large Internetworks. URL
http://www.cc.gatech.edu/projects/gtitm/, 2000.
[17] A. Helmy and S. Kumar. VINT. Virtual InterNetwork Testbed. URL
http://www.isi.edu/nsnam/vint/, 1997.
[18] J. Jannotti, D. Gifford, K. Johnson, M. Kaashoek, and J. OToole. Reliable Multicasting
with an Overlay Network. In Proceedings of the 4th Symposium on Operating Systems
Design and Implementation, 2000.
[19] G. Kesidis and J. Walrand. Quick Simulation of ATM Buffers with On-off Multiclass
Markov Fluid Sources. ACM TOMACS, 3(3):269276, 1993.
[20] B. Khumawala. An Efficient Branch and Bound Algorithm for the Warehouse Location Problem. Management Science, 18(12):B718B731, 1972. Application Series.
[21] L. Lao, J.-H. Cui, and M. Gerla. A Scalable Overlay Multicast Architecture for LargeScale Applications. Technical Report UCLA CSD 040008, Computer Science Department, University of California, Los Angeles, 2004.
[22] L. Lao, J.-H. Cui, and M. Gerla. TOMA: A Viable Solution for Large-Scale Multicast
Service Support. In Proceedings of the IFIP Networking, 2005.
[23] J. Lei, X. Fu, and D. Hogrefe. DMMP: A New Dynamic Mesh-based Overlay Multicast
Protocol Framework. Work in progress, not published yet.
[24] J. Lei, X. Fu, and D. Hogrefe. DMMP: A New Dynamic Mesh-based Overlay Multicast
Protocol Framework. In Proceedings of the 2007 IEEE Consumer Communications and
Networking Conference - Workshop on Peer-to-Peer Multicasting (P2PM 2007), Las Vegas,
Nevada, USA, 2007.
[25] J. Lei, X. Fu, X. Yang, and D. Hogrefe. A Dynamic Mesh-based Overlay Multicast Protocol (DMMP). Internet Draft, draft-lei-samrg-dmmp-02.txt, 2007.
[26] J. Lei, I. Juchem, X. Fu, and D. Hogrefe. Architectural Thoughts and Requirements
Considerations on Video Streaming over the Internet. Technical Report ISSN 16111044 IFI-TB-2005-06, Institute for Informatics, Georg-August-Universitaet Goettingen,
2005.

83

Bibliography

[27] Z. Li and P. Mohapatra. HostCast: A New Overlay Multicasting Protocol. In Proceedings of the IEEE International Conference on Communications, 2003.
[28] S. Naicken, A. Basu, B. Livingston, and S. Rodhetbhai. A Survey of Peer-to-Peer Network Simulators. In Proceedings of the 7th Annual Postgraduate Symposium, 2006.
[29] D. Pendarakis, S. Shi, D. Verma, and M. Waldvogel. ALMI: An Application Level Multicast Infrastructure. In Proceedings of the 3rd USENIX Symposium on Internet Technologies & Systems, 2001.
[30] B. Premore and D. Nicol. Parallel Simulation of TCP/IP Using TeD. In Proceedings of
the Winter Simulation Conference, 1997.
[31] J. Saltzer, D. Reed, and D. Clark. End-to-End Arguments in System Design. ACM
Transactions on Computer Systems, 2(4):195206, 1984.
[32] Andrew S. Tanenbaum. Computer Networks. Prentice-Hall India, 4th edition, 2006.
[33] A. Varga. The OMNET++ Discrete Event Simulation System. In Proceedings of the 15th
European Simulation Multiconference, 2001.
[34] A. Varga. OMNeT++. Discrete Event Simulation System. User Manual. Version 3.2, 2005.
[35] D. Xu, M. Hefeeda, S. Hambrusch, and B. Bhargava. On Peer-to-Peer Media Streaming. Distributed Computing Systems, pages 363371, 2002. Proceedings. 22nd International Conference on Distributed Computing Systems.
[36] A. Young, J. Chen, Z. Ma, and A. Krishnamurthy. Overlay Mesh Construction Using
Interleaved Spanning Trees. In Proceedings of INFOCOM, 2004.
[37] B. Zhang, S. Jamin, and L. Zhang. Host Multicast: A Framework for Delivering Multicast to End Users. In Proceedings of the IEEE INFOCOM, 2002.

84

Potrebbero piacerti anche