Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Gttingen
Zentrum fr Informatik
ISSN
1612-6793
Nummer ZFI-BM-2007-41
Bachelorarbeit
im Studiengang "Angewandte Informatik"
Performance Evaluation of a
Novel Overlay Multicast Protocol
David Weiss
Forschungsgruppe fr
Computernetzwerke
Georg-August-Universitt Gttingen
Zentrum fr Informatik
Lotzestrae 16-18
37083 Gttingen
Germany
Tel.
Fax
office@informatik.uni-goettingen.de
WWW www.informatik.uni-goettingen.de
Ich erklre hiermit, dass ich die vorliegende Arbeit selbststndig verfasst und
keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.
Gttingen, den 15. November 2007
Bachelor thesis
Performance Evaluation of a
Novel Overlay Multicast Protocol
David Weiss
2007/11/15
Abstract
The demand for high-bandwidth media streaming over the Internet is growing. For large
groups of receivers, media streaming places a heavy burden on the network. IP Multicast
can alleviate this problem, but it is not widely deployed. In recent years, application layer
multicast and overlay multicast have been proposed as alternatives. However, there are still
concerns about the efficiency, scalability and deployment of these architectures.
In this thesis, a novel application layer multicast approach, called the Dynamic Meshbased Overlay Multicast Protocol (DMMP), is evaluated. DMMP establishes an overlay network core consisting of super nodes, which are end-hosts with particularly high capacities.
Each super node manages a cluster of non-super nodes. We use network simulations to analyze the performance of DMMP. For that purpose, we have implemented a DMMP module
in OverSim. OverSim is an overlay network simulation framework based on OMNeT++.
We compare DMMP with NICE, a well-known application layer multicast protocol, that is
claimed to achieve low link stress and low control overhead. We experiment with groups of
up to 2048 members.
Our results indicate that DMMP can achieve comparable service quality with less control
overhead, and that DMMP has the potential to scale to a high number of receivers.
Keywords: multicast, application layer multicast, overlay, media streaming, network simulation
Contents
1
Introduction
1.1
1.2
Thesis contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
Related work
2.1
2.2
10
2.2.1
10
2.2.2
Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.3
Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.4
15
2.2.5
Narada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2.6
NICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.7
21
2.2.8
Similar protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
24
2.3.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.3.2
Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.3
Implementation
31
3.1
31
3.1.1
32
3.1.2
nsnam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.1.3
OMNeT++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.1.4
37
DMMP simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.2.1
Extending OverSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.2.2
Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.2.3
Overlay construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.2.4
Join algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.2.5
Data delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.2.6
49
3.2.7
52
3.2.8
Self-improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.2
Contents
3.2.9
4
59
Performance evaluation
61
4.1
Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.1.1
Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.1.2
Simulation scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.1.3
Expected results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.2.1
Typical scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.2.2
75
4.2
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion
79
5.1
Lessons learnt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.3
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
Bibliography
82
1 Introduction
Multicast is the delivery of a message to several destinations. On the Internet, multicast
can be achieved by sending a distinct message to each individual destination. However, if
there are many destinations, then this approach imposes a high load on the network and the
source. Due to a growing number of applications for multicast, network layer multicast has
been introduced in the late 1980s [3]. In this architecture, the source sends a single packet
to a multicast address. The routers replicate that packet and deliver it to each destination.
That way, redundant transmissions can be avoided.
Deployment of network layer multicast requires changes to every router. In addition,
there are concerns about the scalability of network layer multicast, and a number of technical issues. Today, network layer multicast has still not been widely deployed. Nevertheless,
there is a growing demand for efficient multicast service. In particular, a scalable architecture for multimedia streaming applications like IP television is needed. Consequently,
application layer multicast has been proposed as an alternative to network layer multicast
in recent years. In this architecture, the participating end-hosts are organized into a (virtual)
overlay network. Typically, the topology of the overlay network is a mesh. Messages are
distributed via a subtree of the mesh. That means, some of the destinations forward the
received messages to other destinations. Application layer multicast is transparent to the
routers, that is, changes to the routers are not necessary. Therefore, application layer multicast can be deployed much easier than network layer multicast. However, it is questionable
if application layer multicast can achieve comparable efficiency, especially for large multicast groups. One important problem with application layer multicast is the maintenance of
the overlay network. As end-hosts can leave the multicast group at any time, overlay links
have to be added and dropped constantly. A variety of concepts has been proposed to tackle
this problem, and to improve the overall performance of application layer multicast. Some
approaches establish an overlay core consisting of statically placed network infrastructure.
This is often referred to as "overlay multicast". Simulation experiments in [22] indicate that
overlay multicast can achieve a performance that is comparable to network layer multicast.
On the other hand, it is costly to deploy the overlay core, and it also lacks flexibility.
This thesis studies a Dynamic Mesh-based Overlay Multicast Protocol (DMMP). DMMP
has been specified in an Internet draft [25], which is currently under revision. In DMMP,
the overlay core consists of end-hosts with especially high capacities, called super nodes.
Ideally, super nodes should have more bandwidth than other end-hosts, and they should
be more stable. While non super nodes are organized in clusters with a tree topology, the
1 Introduction
overlay core is a mesh. That way, DMMP creates a stable, efficient overlay core without
using static infrastructure. We compare DMMP with NICE [7], another application layer
multicast protocol. NICE organizes the end-hosts into a hierarchy. Simulation experiments
indicate that NICE scales to large groups and imposes a relatively low load on the network.
Initial mathematical analysis [24] suggests that DMMP can achieve slightly better performance than NICE. As DMMP and NICE are complex protocols, a theoretical analysis has to
be based on a number of assumptions which may or may not hold in practice. Then again,
it is also difficult to analyze the behavior of DMMP for thousands of network nodes using
testbeds.
We use network simulations to evaluate the performance of DMMP and compare it to
the performance of NICE. For that purpose, we incorporate a DMMP implementation into
the OverSim [9] network simulation framework, which is based on the OMNeT++ Discrete
Event Processing System (OMNeT++) [33]. As we do not have access to a network simulation of NICE, we have to rely on reproducing the simulation setup reported in [7].
1 Introduction
It goes without saying that the multicast architecture needs to be deployable. Dedicated infrastructure or changes to routers can be problematic.
The resource consumption needs to be low. As the source sends at a high bitrate,
redundant transmissions should be avoided.
Few packets should be lost or delayed, so that a high service quality can be achieved.
Based on [6] and the above considerations, we briefly introduce the performance metrics for
DMMP:
The control overhead is the amount of traffic that is caused by establishing and maintaining the multicast tree.
The stress of a link is the number of identical data messages sent over that link. If the
stress of a link is greater than one, there are redundant transmissions.1
The stretch of a member refers to the length of the data path from the source to the
member, compared with the length of the direct unicast path.
The loss rate is the ratio of the number of lost packets to the number of packets that
should have been received.
1 They
1 Introduction
multicast briefly, and concentrate on application layer multicast. Finally, DMMP is introduced. In chapter 3, it is pointed out what the benefits of network simulations are, in general, and specifically regarding the analysis of DMMP. After that, the implementation of the
DMMP module is documented. We focus on describing the exact algorithms, deviations
from [25], and application-specific parameters. OverSim and OMNeT++ are described as
well. The evaluation methodology of the simulation experiments is described in Chapter 4.
We present our results, describe their implications, and try to explain them. In Chapter 5,
we summarize our experiences with the network simulator, draw our conclusions, and list
possible future work.
2 Related work
This chapter gives an overview of previously proposed multicast architectures. We start
with network layer multicast (Section 2.1). Then DMMP and other application layer multicast approaches are described in Sections 2.2 and 2.3.
2 Related work
10
2 Related work
A
D
E
G
dundant transmissions and low control overhead) and high service quality (low loss rate
and low latency). In ALM, some redundant transmissions are inevitable. Furthermore, endhosts have little knowledge about the underlying network topology. Without such a knowledge, data delivery paths may be longer compared with network layer multicast. Nevertheless, it is possible to obtain some information about the underlying topology. For example,
the end-to-end delay between two hosts can be determined by measuring packet round trip
times. However, such a method is not reliable and generates high control overhead as well.
Additionally, end-hosts are not as stable as routers. End-hosts leaving the multicast group
may partition the data delivery tree. Then other end-hosts cannot receive data until the tree
is repaired. Detecting and repairing these partitions increases the control overhead.
Apparently, the performance related aspects require careful consideration. For now, we
can state that, so far, one design principle of the Internet has been: implement relatively
little intelligence in the network so that most functions are provided only by end-hosts, for
example congestion control. The end-to-end principle has worked well in the past.
11
2 Related work
members directly interact with each other, and each overlay link corresponds to a path in the
underlying topology. The construction of the overlay topology is an important task of ALM
protocols. [6] distinguishes the data topology from the control topology. The data topology
determines who receives data from whom. In Figure 2.1, the link from A to E means that E
receives data from A. The data topology is a tree.1 Additionally, members exchange control
messages regularly, mostly for maintaining the data topology. Different from routers, endhosts may leave the group at any time. An end-host that leaves after a short time is called a
transient node. Ideally, a leaving member should notify some remaining members about its
departure; we refer to this as graceful leaving. However, members may also leave without
notice, for example due to a software crash. We call it ungraceful leaving. In general, we refer to membership changes (join and leave) as churn. Either way, the data topology is likely
to become partitioned when members leave. Control messages are needed to establish and
improve the data topology as members join, and to detect and repair partitions. Links in the
control topology indicate exchange of control messages. Usually the control topology is a
mesh and the data topology is a subtree of the control topology.
Once end-hosts have joined the control topology, they can exchanged control messages via
the mesh. When end-hosts initially join, they have to find a suitable position in the control
topology, but they are not part of the control topology yet. Therefore ALM protocols need a
way to bootstrap. There is virtually always an RP that is assumed to be globally known, and
that can provide newly joining hosts with (for example) the root of the data delivery tree or
some random members. Note that the RP in CBT performs similar tasks, see [5]. In a way,
the RP introduces a single point of failure and a potential performance bottleneck. However, as in CBT, the RP could be replicated at the cost of additional complexity. Similarly,
several groups could share an RP. Furthermore, an RP failure does not disrupt the data dissemination (in contrast to CBT); ALM can tolerate a temporarily unavailable RP [37]. As the
bootstrapping procedure is similar in all ALM architectures, we do not pay much attention
to it.
A good overlay network should have the following properties in order to meet the requirements listed in Section 1.1:
1. The overlay should resemble the underlying topology. This will lead to low latency
and fewer redundant transmissions on physical links.
2. There should be little control traffic. It is important that the control traffic does not
limit the scalability.
3. The loss rate should be low; that is, members should miss as few data packets as
possible.
1 Each
member M may receive the same data from several other members, but we consider only one of those
members the parent of M. That is, we disregard the redundant transmissions. Consequently, the data topology is a tree. This is consistent with [14] and [32], who use the terms "shortest reverse-path tree" and "tree
built by reverse path forwarding" in the context of DVMRP.
12
2 Related work
4. The overlay should be resilient regarding churn, failures, and changes of the underlying topology.
5. The available resources, especially upstream bandwidth, may be insufficient. Endhosts have heterogeneous capacities; for example, a member with low upstream bandwidth can only support a small number of children in the data delivery tree. Such
degree constraints have to be considered. Moreover, the burden of data and control
traffic should be fairly shared among the members.
2.2.3 Categories
In this section, we categorize ALM approaches based on the following questions: (1) Are
there statically placed proxies? (2) Is the data topology derived from the control topology,
or is it the other way around? (3) Is the overlay constructed distributedly? (4) How many
members are allowed to send to the group?
1. Overlay multicast has been introduced briefly in Section 2.2. In this type of architecture, proxies, which can be routers or end-hosts, take over some group management
functions and form an overlay core. In Figure 2.1, nodes A, B and C are proxies and
constitute the overlay core. Concerning approaches without the proxies, the participating nodes are more like peers. Overlay multicast can construct a highly efficient
and stable overlay core [22]; proxies are placed statically with all information about
the underlying topology available, and can be assumed to be more stable than normal
end-hosts. They may also have more resources such as memory, computation power,
and most importantly upstream bandwidth. Therefore, they can be used to alleviate
the control traffic of less capable members. Unfortunately, the deployment of overlay
multicast is difficult: somebody, a multicast service provider, has to place and maintain
the proxies. Also the used bandwidth, which is expected to be high compared to a normal end-host, can be costly. Dimensioning the overlay core is problematic, too. This
includes choosing the number of proxies and their locations. Usually bandwidth has
to be purchased in advance. Those decisions require the service provider to estimate
the number of users, their geographical distribution and churn. Over-dimensioning
the core wastes money, but if capacities are insufficient during the runtime, users will
have to be turned down. In any case, it is difficult or impossible to adapt the overlay
core to dynamic changes; overlay multicast lacks flexibility. Finally, it may be difficult
to bill the users. An example for an overlay multicast architecture is TOMA, which
will be discussed in Section 2.2.7.
2. The tree-first approach: This approach is very intuitive: the participating nodes are
firstly organized into a shared tree for data delivery. The control topology is then
constructed by adding some links to the tree.
13
2 Related work
A side note about the terminology: consider node G in Figure 2.2. Its parent is
C and its children are L and M. We call Cs parent A the grandparent of G; G is a
grandchild of A. All nodes on the same tree level as G are Gs siblings. All nodes
on the same level as Ps parent, including C, are Gs parent level nodes. All nodes
on the same level as Gs children, including L and M, are Gs child level nodes.
One advantage of this approach is that it is usually sufficient for each member to
maintain states about some "relatives" in the tree. As long as the topology does
not change, the control overhead is rather low. However, the tree-first approach
lacks resilience. Tree partitions usually take long time to repair, and joining procedures tend to be costly. The tree-first approach is also vulnerable to loops in the
data topology. More importantly, it is hardly suitable for latency sensitive applications with many sources (for example games) because it builds a shared data
delivery tree. An example of such an approach is HMTP, which will be further
examined in Section 2.2.4.
The mesh-first approach: With this approach, the control topology a mesh is
constructed first. Then the data topology is determined by running a routing
algorithm on top of the mesh. Typically, the data topology is a source-specific
tree. For example Narada, which will be described in detail in Section 2.2.5, uses
DVMRP to (implicitly) construct the data delivery trees. The mesh-first approach
tends to offer higher resilience than the tree-first approach. While source-specific
trees should reduce latencies compared to the tree-first approach, they are not
optimal either. Their quality is limited by the richness of the control topology [6].
Depending on the routing algorithm, some redundant end-to-end transmissions
are possible. This makes the mesh-first approach less suitable for high bandwidth
applications.
14
2 Related work
3. ALM approaches can be further categorized by considering the way that the overlay is
built. We distinguish centralized from distributed overlay construction. This concept is
similar to the distinction of centralized and decentralized network layer routing. For
example, an end-host that joins the NICE hierarchy finds an appropriate position by
itself. The RP, as a central entity, only provides the address of the root. This is distributed overlay construction. In contrast, TOMA clusters are constructed centrally
by the proxies. Then the cluster nodes are informed of their neighbors and start exchanging control messages with them. The centralized approach imposes a high load
on the node that constructs the overlay; this central entity may become a performance
bottleneck. Then again, a centrally constructed topology may have better properties,
as more knowledge of the underlying topology is available.
4. A fourth classification criterion for ALM architectures is the number of sources. A
single-source approach allows only one group member to send to the group (one-tomany). It is easier to optimize the overlay for one source. This approach also simplifies
concepts and implementation. Many-to-many applications like chat or games require
every member to be a source though. However, in [18] it is pointed out that "many
applications which appear to need multi-source multicast, such as a distributed lecture
allowing questions from the class, do not". Additional senders, for example students
that ask questions, can send their data to the source; then the source broadcasts it to
the group.
15
2 Related work
Join algorithm: A recursive algorithm is used. A node J that wishes to join the multicast
group first contacts the root R of the data delivery tree. R responds with a list of its
children. When J receives a list l of nodes from their common parent P, it determines
its end-to-end delay to all of them. A short delay to a node X means that X is probably
located nearby in the underlying topology and would be a suitable parent in the data
delivery tree. If the delay to all l-nodes is higher than the delay to P, then J sends a
join request to P, asking it to be added as a child. Otherwise, J queries the "nearest"
l-node for a list of its children. In a nutshell: joining nodes are relayed down the data
delivery tree. This algorithm assures that members with short end-to-end delay are
"clustered together" [37], so that the overlay resembles the underlying topology. Note
that join requests may be denied, for example because the desired parent P does not
have enough upstream bandwidth to support another child. In this case, the joining
member sends a join request to one of Ps children. If there are no children, the joining
node may be forced to "go up the tree" again to explore a different branch. Joining
a large group may take a long time. Therefore the authors of [37] suggest a "foster
child" mechanism: the joining node is attached to a temporary parent so that it can
start receiving data quickly.
Partition handling: A gracefully leaving node informs its parent and children. Each node
exchanges refresh messages with its parent and children periodically to detect ungraceful leaves. The control topology is very limited; if a member X detects that its parent P
is inactive, then X cannot notify its siblings or its grandparent because it has no knowledge about them. Any inactive non-leaf node partitions the data delivery tree: there is
one partitioned branch rooted at each child of the inactive node. Those branches stay
intact while their roots rejoin. Rejoining is done in "reverse order", and is usually not
as time-consuming as joining initially: each member keeps track of the path from itself
to the root (root path). A rejoining node first contacts its former grandparent and then
works its way up the tree. That way, nodes that are known to be near are contacted
first. The root paths are also used to improve the tree over time, and help to avoid and
detect loops. They are updated as follows: when a node P receives a refresh message
from one of its children C, then P responds with its own root path. C adds P to the
path in order to update its root path.
Self-improvement: As nodes leave the group, their parents can accept new children. For
the remaining nodes, this means more suitable positions in the tree may become available over time. Each node X looks for nearer parents periodically, starting with a
random node in the root path. The exact algorithm is similar to the join algorithm
described above. If a potential parent P with significantly shorter end-to-end delay
is found, X changes its parent. That is, the subtree rooted at X is attached to P. As
delay measurements are influenced by cross traffic, there is a threshold for changing
the parent. This avoids changing the parent back and forth ("oscillation").
16
2 Related work
On the one hand, HTMP is an intuitive approach that avoids redundant end-to-end transmissions. With a single source as the root, low latency can be achieved. On the other hand,
all disadvantages of the tree-first approach, as described in Section 2.2.3, apply. In particular, HTMP is vulnerable to membership changes as its control topology is sparse. Note that
HTMP assumes a part of the members being (stable) routers. With many sources, the shared
tree approach is suboptimal. The join algorithm is another issue as it induces high control
overhead.
2.2.5 Narada
The Narada protocol was proposed and evaluated in[13] to show that ALM can achieve
acceptable performance (compared with network layer multicast). As for the categories
that have been introduced in Section 2.2.3: Narada is an example of a (1) peer-to-peer, (2)
mesh-first approach that is (3) fully distributed and (4) allows any number of sources. It is
intended for rather small groups; reasonable performance has been shown by simulation
experiments with up to 128 members. Experiments in [7] indicate that the scalability of
Narada is limited mostly by the produced control overhead. Although the control topology
is not a complete graph, all members have knowledge about each other. Below, we will
describe Narada in greater detail because some aspects of DMMP are very similar.
Data delivery: The data topology is built by running a distance vector protocol similar to
DVMRP on top of the mesh, which produces source-specific spanning trees. The distance metric is application-specific. The count-to-infinity problem is solved by exchanging minimum distance paths among the members. Note that such a routing
protocol leads to some redundant transmissions: there is up to one transmission per
overlay link.3
Join algorithm: A joining end-host sends a join request to some random group members
and attaches itself to the first responding member. That member should be near in
the underlying topology, as its response arrived quickly. However, only a potentially
small number of randomly chosen members is contacted, and therefore the initial
mesh position of a joining end-host may be unfavorable. As for degree constraints:
a member that has not enough available capacity can choose not to respond to join
requests.
Partition handling: Members that wish to leave the group are supposed to advertise routing updates indicating a large distance to all destinations for some time. This allows
the remaining members to adapt their routes. That way, packet loss can be avoided.
Gracefully leaving members should also notify their neighbors. As in most other ALM
protocols, members exchange refresh messages periodically to detected ungraceful
leaves. Routing updates are combined with the refresh messages. If a member X
leaves ungracefully its neighbors do not receive any more refresh messages. Narada
3 Different
17
2 Related work
A
C
does not assume reliable message transport; refresh message can be lost occasionally.
For that reason, each neighbor independently sends X a probe message to verify that
X is inactive. As X has left, it does not respond to the probe messages. Its neighbors
assume that X is inactive. Inactive end-hosts are removed from routing tables; that
means, the data delivery tree can adapt quickly. Leaving members can also partition
the mesh though.
It is relatively difficult to detect and repair mesh partitions. In Figure 2.3, consider that
member A leaves the group, which is noticed by its neighbors B and C. Members C,
D and E cannot receive data until the partition is detected and repaired. Partitions are
detected as follows. Each member stores in a table all other members IP addresses
and a timestamp indicating the last time it heard of them. Those tables are exchanged
among neighbors as part of the periodic refresh messages. (Additionally, a sequence
number is stored in the table. That way, timestamps do not have to be included in
refresh messages.) Upon receiving refresh messages, members update their tables. If
the table entry for a member M is not updated for some time, a partition is likely. The
partition can be repaired quickly by adding a link to M. However, there is a problem
with this approach. If there is in fact a partition, several members may detect it at the
same time and more links than necessary will be added. Narada uses a randomized
algorithm to decide if a link should be added. If there is no update for at least Tmin
seconds, a link may be added. A link is guaranteed to be added after time Tmax . The
algorithms for partition detection and repair are described in more detail in Section
3.2.7.
Self-improvement: The mesh repair algorithm and the join algorithm do not pick mesh
links carefully. This leads to long data delivery paths and high latency. Additionally,
unnecessary mesh links are maintained. The mesh can improve its quality over time
by distributedly adding and dropping links.
Adding links: Periodically each member X chooses another member Y at random and
18
2 Related work
requests its routing table. Based on that, X computes how a link from X to Y
would shorten the distance from X to all other members. The resulting number
is called the utility of that link. If the utility is greater than a threshold, the link is
added.
Dropping links: Each member periodically determines its least useful incident link.
This is done by computing the consensus cost for each link. This value depends
on the number of destinations for which the link is used as the outgoing interface. If the consensus cost of the least useful link is below a threshold, the link
is dropped. The exact algorithm is provided in[13]. It guarantees that the mesh
does not become partitioned, when a link is dropped.
For small groups Narada achieves low latency given some time to stabilize. Simulation
experiments in[13] indicate that latency is almost comparable to network layer multicast.
Narada accomplishes this for any number of sources and adapts well to dynamic changes.
It is not designed for large groups, though, and therefore it does not scale well: [7] states
that "Narada has O(n2 ) aggregate control overhead".
2.2.6 NICE
The NICE 4 protocol has been proposed in [7]. We study NICE because we see it as a main
contender of DMMP. NICE is motivated from the shortcomings of Narada. As mentioned
in Section 2.2.5, Narada does not support large groups. Its high control overhead is particularly unreasonable for low bandwidth applications. Control traffic may consume more
bandwidth than the application data then. NICE, in contrast, is designed with that kind of
applications in mind. Examples are Internet radio or stock market tickers. The application
characteristics can be summarized as: (1) large groups (thousands of members); (2) multiple
sources; (3) low bandwidth traffic; (4) only soft real-time requirements, timely delivery is
desirable but slightly delayed data is still usable; (5) loss tolerance.
NICE is a fully distributed peer-to-peer approach. The concept is to establish and maintain a hierarchical control topology. The data delivery algorithm results from this topology;
the data topology is implicitly defined by the properties of the control topology. Therefore
NICE is classified as an implicit approach (as opposed to the tree-first or mesh-first approach)
in [7]. The NICE hierarchy consists of several levels, and on each level there are several clusters. The group members are organized in such clusters. Each member belongs to one cluster
on the lowest level (level 0) and may belong to other clusters on higher levels. Each cluster
has a size between k and 3k 1, where k is typically small, for example three. This invariant is enforced by merging small clusters and splitting big ones. Each cluster has a leader,
which should ideally be the topological center of the cluster. Clusters on level n (n 1) are
comprised of the cluster leaders of level n 1. Consequently, there is only one cluster on
the highest level. The leader of this cluster (the root of the hierarchy) is either the RP or the
4 NICE
is a recursive acronym and stands for "NICE is the Internet Cooperative Environment".
19
2 Related work
source (for a single-source application). The root is member of (log(n)) clusters and has
the highest control traffic. The control topology inside the clusters is a complete graph.
Data delivery: The source provides all nodes in all clusters it belongs to with data. When a
member receives data, it does the same, except that is does not forward to the member
it received the data from. That means, for each cluster, there is a core based data
delivery tree and the maximum number of forwarded packets is in O(log(n)) for each
member.
Join algorithm: Nodes join by contacting the root, which responds with a member list of
the top level cluster. The joining node determines the distance to each of these members (e.g. by measuring the round trip time). Then the nearest member is asked for its
member list. That way, the joining node is redirected until it receives a member list of a
level 0 cluster, which it joins. That way, the nearest cluster is joined. Joining takes long
and requires O(k log(n)) queries, as every joining node is "passed" down the hierarchy
to level 0. In contrast, joining HMTP nodes can obtain a higher position, if they find
a suitable parent. Therefore [7] suggests to "peer" joining nodes temporarily to allow
them to receive data. This is identical to the foster child mechanism of HTMP.
Newly joined nodes belong to a level 0 cluster. They can become members of higher
level clusters later because cluster leaders change over time. When a cluster is joined
by a new member, the center of the cluster may change. In this case, the new center
becomes the cluster leader. The NICE protocol requires the center of a cluster to be the
cluster leader; this is an important invariant. Note that cluster leaders are not chosen
based on their available resources such as upstream bandwidth. In particular, degree
constraints are not considered. NICE is designed for low bandwidth applications, so
most members can be assumed to support a high number of children in the data delivery tree. However, for high bandwidth applications, this is problematic. As members
join, clusters grow beyond the maximum cluster size. In this case, the cluster leader
splits the cluster dividing the other members into two new clusters. It also chooses
leaders for both clusters.
Partition handling: NICE detects ungraceful leaves in the same manner as HMTP. NICEs
refresh messages are called "heartbeat" messages. When a cluster leader leaves, the
remaining members negotiate a new leader. If the cluster size falls below the minimum
cluster size, its leader contacts the leader of the nearest cluster on the same level of the
hierarchy, and the clusters are merged.
Self-improvement: Each member X regularly measures the distance to leaders of foreign
clusters. If the distance to the leader of a foreign cluster C is smaller than the distance
to the current cluster leader, X moves to the cluster C.
Simulation and testbed experiments in [7] with up to 2048 members suggest that NICE scales
well and thus supports groups with thousands of members; its average aggregated control
20
2 Related work
it is worth mentioning that the exact algorithms concerning cluster leadership changes were not
described in detail above, and are in fact rather complex.
21
2 Related work
P4
P1
P3
P2
white clusters belong to group F. It shares an aggregated tree with group G, which
consists of the gray clusters. The tree is not optimized for G; data targeted at G is
sent to P2 although P2 has no members of G. This may waste bandwidth. OLAMP
defines rules for deciding if G should have an own tree instead. Such a tree would be
computed using a multicast routing algorithm.
Cluster management: Each cluster is attached to one proxy. The proxy stores information
about all cluster members and arranges them into a core-based tree rooted at the proxy.
The cluster members periodically measure the round trip time to their parent and
children, and send the results to the proxy. In turn, the proxy informs the members
of topology changes. Members also exchange probe messages periodically to detect
ungraceful leaves. When a tree-partition is discovered, the proxy is asked to repair it.
The data delivery works like HTMP or any other tree-first approach. The difference is,
that the proxy forwards received data to the rest of the MSON, along the aggregated
tree. Proxies that receive data from the MSON forward it to the local cluster.
Deployment: As mentioned in Section 2.2.3, the deployment of overlay multicast tends to
be difficult. [22] suggests the following model: The multicast service provider (which
may additionally offer other Internet services) sets up the MSON and estimates its
usage. Then it purchases the necessary bandwidth from a network provider. The
initiators of multicast groups pay for the multicast service and bill the participating
end-users.
Overlay core dimensioning: Given the geographical distribution of the end-users it is not
difficult to find approximately optimal locations for the proxies; this optimization
problem is known as the "warehouse location problem", and a solution is presented
for example in [20]. Estimating the user distribution is more problematic. The necessary bandwidth is even harder to predict. [21] suggests to over-dimension the MSON
22
2 Related work
slightly. When the purchased bandwidth turns out to be insufficient, it may be possible to lease additional bandwidth at runtime. However, this would probably be expensive.
In summary, TOMA achieves a stable overlay core that is efficient regarding redundant
transmissions and latency, even for large groups. TOMA relies on statically placed infrastructure though. Some disadvantages of this approach have been noted in Section 2.2.3. It
seems questionable if TOMA can be deployed quickly. This aspect is critical, because, after
all, this is mainly what is inhibiting network layer multicast.
6 Examples
23
2 Related work
So far we have referred to ALM without proxies as a peer-to-peer architecture. The authors
of [35] argue that having virtual connections to many other nodes is "server-like" behavior,
which does not correspond to the peer-to-peer paradigm. Therefore they do not consider
ALM a peer-to-peer approach. ALM is certainly not a typical peer-to-peer architecture like
a file sharing system. In both cases, an overlay topology is constructed among the participating end-hosts; for example Chord arranges end-hosts in a ring[32, pages 380384]. In a
typical peer-to-peer system, however, files (or whatever kind of records the peers exchange)
are completely transferred before they are consumed. That means, there are no real-time
constraints as in ALM. In-order delivery is not necessary either; instead reliable service is
usually important. Typical peer-to-peer systems also require some sort of lookup mechanism, for example a distributed hash table, as each peer offers a large number of records.
Because of these differences typical peer-to-peer systems will not be further discussed.
2.3.1 Overview
DMMP is intended for multimedia streaming applications as described in Section 1.1. That
is, there is only one source. In the future, many-to-many communication may be supported
as well. DMMP makes no assumptions about the transport layer, but [25] advocates UDP
because control message exchange follows the request-response-pattern and lost multimedia data can be tolerated.
The DMMP architecture has two tiers:
24
2 Related work
1. The upper tier consists of super nodes, which form the overlay core. Super nodes are
selected from among members based on their capacity. The capacity of a member is
a function of the available bandwidth and the uptime. Uptime refers to the elapsed
time since the member joined the group. Members with high uptime are assumed
to be relatively stable. Other values can be taken into account to profile DMMP. As
DMMP is directed at large groups, the chosen super nodes are expected to be well
dispersed over the underlying topology. Super nodes are assumed to be ordinary endhosts, but in principal they can also be provided (statically or on-demand) by a group
coordinator. The super nodes and the source self-organize into a mesh. This works the
same way as in Narada.
2. The lower tier consists of all members that are not super nodes. Each super node is
responsible for one cluster of non-super nodes. The clusters are organized as trees.
Joining non-super nodes use distance measurements (for example round trip time) to
choose a cluster and find a suitable position in the tree. Therefore, the clusters resemble
core-based trees.
Figure 2.5 shows the data delivery algorithm. S is the source, the white circles are super
nodes, and the black circles are non-super-nodes. The ellipses are clusters; only one cluster
is shown in detail. The source sends its data directly to some of the super nodes. As Narada,
DMMP runs a distance vector protocol similar to DVMRP on top of the mesh to create a
source-specific spanning tree for data delivery. Consequently, there can be redundant transmissions. In Figure 2.5, the spanning tree is represented by arrows. The dotted arrows show
redundant transmissions; they are not part of the tree. Unlike Narada nodes, super nodes do
not only forward data to their mesh neighbors, but also to the children in their local cluster.
25
2 Related work
In turn, cluster nodes forward the data to their children. That way, data is simply delivered
from top to bottom inside the clusters.
Inactive super nodes are detected using refresh messages. Possibly, they can be replaced
by one of their children. However, this may not always be feasible. The missing super
nodes cluster could be empty, for example. Or all its children could have insufficient capacity. That means, the mesh can become partitioned. DMMP treats this situation using
the same algorithms as Narada. In the clusters, refresh messages are used, too. Each cluster node periodically exchanges refresh messages with its (1) parent level nodes, (2) child
level nodes, and (3) siblings. In other words, there are control topology links between those
nodes.7 When an inactive node is detected, it may be possible to replace it with one of its
former children. Otherwise, the data delivery tree may become partitioned. In that case,
the former children of the inactive node look for new parents starting with their remaining cluster neighbors. Because the join algorithm is based on distance measurements, those
nodes are assumed to be near. Therefore, they are suitable new parents. That said, DMMP
clusters are not solely optimized for distance. Ideally, the clusters should be sorted by capacity, meaning that no node has a higher capacity than its parent. This has a number of
advantages:
1. The probability that a node with a high position in the data delivery tree leaves is
reduced, and partitions are more likely to be local problems affecting only a small
number of nodes. Note that partitioned members are cut off from the data stream
until the partition is repaired.
2. Nodes with high bandwidth can accept more children. Optimizing the clusters for capacity makes them broader and shorter, which should eventually improve the average
latency as well. Shorter clusters also expedite join attempts.
3. It is likelier that inactive super nodes can be replaced; no cluster node has higher
capacity than the super node or its children.
Another consequence is that nodes close to the overlay core contribute more bandwidth.
However, they are also likely to experience higher service quality, with lower loss rate and
latency. That means, there is an incentive to contribute resources to the group. Two mechanisms are used to optimize the clusters for capacity:
1. When a cluster node receives several join requests within a short period of time and
cannot satisfy all of them due to a lack of bandwidth, the nodes with the highest capacity are chosen as children.
2. Nodes with high capacity can step up to higher tree levels over time. If a nodes
capacity is significantly higher than its parents capacity, the two nodes may switch
their positions. We refer to this as a promotion.
7 HostCast
26
2 Related work
In summary, clusters resemble heaps. HMTP and HostCast (see Sections 2.2.4 and 2.2.8)
have no such mechanism. Instead, the data delivery tree is optimized exclusively for latency. Note that the overlay core improves itself over time as well. Mesh links may be
added and dropped; the algorithm is the same one that Narada uses.
In the next section, more details about the construction of the initial control topology, the
join algorithm, partition handling in the clusters, and the promotion mechanism are given.
We do not further discuss the data delivery algorithm, as it is very similar to DVMRP. We
also omit the adding and dropping of mesh links. Details about this can be found in Section
3.2 and in [13].
2.3.2 Details
In this section, we will highlight some DMMP protocol details. Most of them will be revisited in Section 3.2.
Overlay construction: End-hosts that wish to join the group obtain the IP address of the
RP via the Domain Name System (DNS). Assume that there are some initial group
members which have subscribed to the RP. The source is part of this initial group. For
each initial member, the available upstream bandwidth is determined. It would also
be possible to let the users manually specify the available bandwidth. This approach is
taken in [12], for example. However, users may misstate such information. Instead, the
bandwidth is measured for each participating end-host. This can be done by sending a
series of test packets and measuring the inter-arrival times of the responses. That way,
the bandwidth of the bottleneck link, which is the link with the lowest bandwidth,
can be determined. More sophisticated techniques are required to handle competing
traffic [11]. These measurements may have to be repeated regularly, as the available
bandwidth can change over time. After obtaining the upstream bandwidth b(i ) for
each initial member i, the RP calculates the maximum outdegree d (i ) as
d (i ) =
b (i )
,
r
where r is the constant bitrate the source sends at. Note that members with a maximum outdegree of zero have to be leaf nodes in the data topology.
Next, an application-specific number of super nodes is chosen in order of maximum
outdegree. Other metrics can be considered, again depending on the application. For
example, it is desirable to assure that super nodes do not behave in a malicious way.8
The number of super nodes should in general be less than 100. This is because the super nodes basically use Narada, which is known to have significant control overhead
already for 128 end-hosts. Clusters may consist of hundreds of members, so DMMP
8 Security
27
2 Related work
groups can have tens of thousands of members. The super nodes and the source selforganize into a mesh as described in[13]. Then the clusters are formed using the join
algorithm, which is described below. In the overlay construction, end-to-end connectivity is assumed. That means, problems concerning NAT and firewalls have not been
considered yet. When all members have contacted the RP to confirm their positions,
the initial overlay is constructed.
Join algorithm: A newly joining end-host queries the RP for a list of super nodes. The
length of the list is application-specific. Then it measures its distance to each super
node on the list and chooses the nearest one. The joining host sends a join request
to the chosen super node. The request includes the available bandwidth and the current uptime. Whenever a member M receives a join request, it waits for concurrent
join requests. After some amount of time, M accepts as many joining end-hosts as its
maximum outdegree allows. The child selection is based on the capacity of the joining end-hosts. The capacity is a function of the available bandwidth and the uptime,
an example will be given below (see Equation 2.1). The accepted hosts receive an acknowledgment. M sends a list of its children to the rejected hosts. Based on that list,
the rejected hosts continue to look for suitable parent. This is described in more detail
in Section 3.2.4. All in all, this algorithm is similar to the HMTP join algorithm. The
main difference is that in HMTP, a node X only sends a join request to another node Y,
if the distance from X to Y is shorter than the distance from X to any of Ys children.
A DMMP host would try to obtain a position as a child of Y before considering the
children of Y. DMMP generally focuses less on optimizing latencies.
Partition handling in the clusters: A leaving end-hosts should at least notify its parent or
one of its children of its departure. Ungraceful leaves are detected using refresh messages. Each cluster member periodically exchanges refresh messages with its neighbors in the control topology. Members can also request refresh messages from their
neighbors. If a member does not respond to a refresh request, or does not send a refresh message on its own for some period of time, it is suspected to be inactive. It is
sent a probe message to confirm this. If a member M does not respond to a probe message, the sender of that message assumes that M is inactive. When a member learns
that one of its neighbors has become inactive, it notifies its remaining neighbors.
Consider Figure 2.2 again. Assume that it shows the data topology of a DMMP cluster,
and that D has left the group. That is, the tree is partitioned: I and J cannot receive
data. In DMMP, there are two ways to fix this:
Either I or J can replace D. Here is how it works: I and J both send a replacement
request to B. Note that in the DMMP control topology, nodes have no link to their
grandparent. Therefore I and J may have to query the responsible super node
A to obtain Bs address. After receiving the first replacement request, B waits
for additional replacement requests. After some time, it chooses a replacement
28
2 Related work
capacity
Y
b(Y)
X
b(X)
time
based on capacity. Assume that I has a higher capacity than J. B will send I an
acknowledgment. J receives the address of I and tries to join as a child of I. This
approach has been proposed in [23].
[25] suggests that rejoining nodes use their remaining neighbors as a starting
point for the join algorithm that has been described above. The neighbors of a
node are its sibling, its child level nodes, and its parent level nodes. In Figure 2.2,
Js neighbors are E, F, G, H, I, K, L and M, for example. The nodes on the lowest
level of a cluster tend to have the lowest available bandwidth and the highest
number of neighbors. This is worrying because maintaining those links may induce high control overhead. However, it might be sufficient to exchange refresh
message less frequently between "remote" relatives.
Promotions: If the capacity of a member X exceeds the capacity of its parent Y by a threshold, X requests Y to switch positions with it. How does X know that it has higher
capacity than Y? All members announce their capacity via refresh messages: each
refresh message contains the senders uptime and bandwidth. [24] proposes the following capacity function:
c( j) = b( j) +
b( j)
t( j) 1 j n,
b (i )
in=1
(2.1)
where n is the group size, b( x ) is the bandwidth of member x, and t( x ) is that members uptime. In Figure 2.6, node Y has joined the group at time t = 0, whereas X has
joined at t > 0. Until one of the nodes leaves, Y has higher uptime than X. However,
as Y has more available bandwidth than X, the capacity of Y grows faster over time
and exceeds the capacity of X at some point. That means, the capacity function can
prevent transient nodes from obtaining high positions in the clusters, and will eventually help high-bandwidth nodes to climb up.
29
2 Related work
[25] does not contain all the details about the promotion mechanism; further details
may be added in the next revision.
Message types: [25] proposes seven control message types; for each type, there is a requestresponse message pair. A 24 byte DMMP header is appended to all DMMP messages.
A description of the header fields can be found in [25].
1. Subscription: Used to obtain the address of the RP via DNS.
2. Ping-RP: Newly joining end-hosts send a ping-RP request to the RP. The RP
replies with a list of randomly chosen super nodes.
3. Join: Newly joining and rejoining end-hosts send join requests to cluster members. Join responses indicate if the joining host has been accepted as a child or
not. When a joining host is rejected, the join response contains the addresses of
the senders children. This has been described above in more detail.
4. Refresh: Members that are adjacent in the control topology exchange refresh messages regularly. Refresh messages contain some information about the senders
capacities, and may also contain routing updates.
5. Probe: When a member receives no refresh message from one of its neighbors for
some period of time, it suspects the neighbor to be inactive, and sends a probe
request to it. If the neighbor is not inactive, it sends a probe response back immediately.
6. Inactive report: When a member learns about an inactive member, it can notify
other members using an inactive report. The report indicates which member has
become inactive.
7. Status report: This message type is used for miscellaneous tasks. For example,
promotions could be arranged using status reports.
30
3 Implementation
A network simulator is a program that simulates a computer network by calculating the interactions between the nodes. In order to evaluate the performance of DMMP we incorporate
DMMP into the OverSim simulation framework. First, we provide some background on
network simulation in Section 3.1. Then, the DMMP implementation is described in detail
(Section 3.2).
31
3 Implementation
to analyze mathematically.
Following [4], we distinguish between two types of network simulators:
1. Customized simulators are designed for analyzing one specific protocol in a certain scenario. They are optimized for this protocol and offer exactly the desired degree of
detail. Comparing network protocols is difficult with customized simulators. Ideally,
all relevant protocols should be implemented by the same people. For example in [7],
a customized simulator has been used to compare NICE with Narada. Both protocols
have been implemented, as well as all the underlying protocols. We cannot use their
implementations for comparing NICE and DMMP. In addition, our results are not fully
comparable with their results because we do not use the same network simulator.
2. Common simulators provide a generic simulation framework that is independent of the
simulated protocols. With such a modular framework, protocols need to be implemented only once. Then everybody can use them for their own simulations. Typically,
the people who have devised a protocol will provide an implementation; that means,
efficient implementations can be expected.
In Section 3.1.1, common simulators will be described in more detail; two typical examples
(nsnam, OMNeT++) will be introduced in Sections 3.1.2 and 3.1.3. With those simulators a
wide range of protocols can be simulated. We will address simulation frameworks that are
designed for overlay and peer-to-peer protocols in Section 3.1.4. In that context, OverSim,
which is used for our simulation experiments, will be introduced. OverSim is based on
OMNeT++.
32
3 Implementation
nsnam is intended for simulating Internet protocols. In contrast, other simulators provide a generic "simulation language" [10] coupled with protocol libraries. These simulators are not limited to network simulation, but support a wide range of systems.
Degree of abstraction: Network protocols and their environment can be very complex. There
is clearly a trade-off between the degree of realism and the consumed resources. Without abstracting from some of the details, simulations of large networks may take long
to execute and demand a high amount of memory. It is desirable that the degree of
abstraction can be adjusted. That way, it is possible to analyze the protocol details, as
well as the large-scale behavior.
Efficiency: An efficient simulator consumes few resources, and can handle complex networks. Hence, the programming language that the simulator is implemented in plays
a role.
Extensibility: An important idea of common simulators is that different researchers can implement a large number of protocols on top of the simulation framework. Therefore,
common simulators need to be highly extensible. Clean interfaces and a comprehensive documentation can make extensions easier.
Availability of protocol implementations: Common simulators allow the combination of
existing protocol implementations in order to create more complex simulation models,
instead of building them from scratch. For this reason, simulators with a large library
of protocols are more attractive to researchers. For simulations of Internet protocols, it
is important that basic Internet protocols like IP or TCP are provided.
Visualization: Most common simulators can create an animation of the simulation model,
allowing the user to observe the protocol behavior. This is important for debugging.
Verification and debugging of protocol implementations is in general difficult, so additional (non-visual) debugging and tracing features can be useful. Visualization can
also help users to acquire an intuitive understanding of the simulated protocols [4].
Scenario generation: The term scenario refers to the entire simulated environment, including the network topology, dynamic topology changes and traffic patterns. Typically,
network protocols are mostly independent from the scenario. The protocol is defined
once, and is then analyzed in a number of different scenarios. A simulator should provide support for (1) implementing the protocols efficiently, (2) defining and changing
scenarios quickly, and (3) adapting protocol parameters to fit the scenario. Scenarios
often cannot be specified manually because they are too complex. Therefore network
simulators should also support randomized scenario generation (e.g. topology generators) and scenario libraries (including e.g. real world topologies).
Statistics support: In order to evaluate the performance of a protocol, statistical data needs
to be gathered. Some simulators explicitly support this by offering an easy interface
33
3 Implementation
for recording statistics. Additionally, tools for visualizing the collected data may be
provided.
Interaction with real networks: This concept has mainly two applications. (1) Confront a
real network with simulated traffic, and (2) confront simulated nodes with real traffic.
In the latter case, one or several real nodes are emulated.
We try to illustrate these properties in the next sections by introducing two popular, objectoriented common simulators, nsnam and OMNeT++. We focus on OMNeT++, which is used
for our simulation experiments. Both simulators are free/ open source software. As they are
both written in C++ and use a discrete event processing engine, their efficiency is roughly
comparable. OPNET Modeler is another popular common simulator with a comprehensive
protocol library. We leave OPNET Modeler out because it is not fundamentally different
from nsnam and OMNeT++.
3.1.2 nsnam
nsnam, which is also referred to as the VINT [17] simulator, consists of two tightly bonded
parts: ns-2 [4] provides the simulation engine; ns stands "network simulator", "2" is a version number. ns-3 is currently still work in progress. nam [15] (for "network animator")
provides visualization. ns-2 generates a trace file, which nam interprets. nsnam focuses on
Internet protocols, and it is highly popular in that domain. For example, it has been used a
lot in research on TCP. Consequently, many Internet protocols have been implemented for
nsnam. A "split-programming model" [10] is used. That means, simulations are specified
using two different programming languages. OTcl, an object-oriented scripting language,
handles the parts of the simulation that change frequently, that is, mostly the scenario generation. The protocol details are implemented in C++. When OTcl scripts are executed,
the instantiated OTcl objects are mirrored to C++ objects, so that they can interact with the
native C++ code. The split-programming model expedites scenario generation, and at the
same time allows performance-critical algorithms to be implemented efficiently. C++ simulation objects can be combined to create more complex "macro-objects" [10] using OTcl.
However, macro-objects cannot be further combined. This limits the extensibility of nsnam.
Another problem is that the simulation engine and the implementations of the basic Internet
protocols are not cleanly separated [33].
The degree of abstraction can be adjusted. Three network layer models are provided: in
the first model, hop-by-hop forwarding and dynamic routing updates are simulated. In the
second model, routing is static and centralized. In the third model, routers are not simulated
at all. nsnam uses traffic model and topology libraries, and supports topology generators
(e.g. the popular GT-ITM package [16]). A framework for systematic testing of protocol
implementations, called "STRESS", is provided as debugging and validation support, in addition to nam. Interaction with real networks is possible as well. A part of the nsnam code
has been transformed into the Telecommunications Description Language, which allows
34
3 Implementation
distributed simulation [30]; nsnam itself does not have this feature.
3.1.3 OMNeT++
OMNeT++ [33][34] is free for academic non-profit use. There is also a commercial version
called OMNEST [1]. OMNeT++ has a much broader focus than nsnam. Any system that
can be modeled as a discrete event system can be simulated. OMNeT++ is mostly used
for computer network simulation, but it could also be used for e.g. analysis of hardware
architectures. It consists of a simulation kernel, a simulation library, component libraries and
user interfaces [34, pages 211212].
The simulation kernel mainly handles the discrete event processing. It supports distributed simulation.
The simulation library offers support for common simulation tasks. It includes, for
example, random number generators and containers, as well as classes for gathering
statistics. We will elaborate a bit on the statistics support: Output vectors are collections
of (time, value) pairs, which are recorded over the course of a simulation run. For
example, assume that packet round trip times are measured regularly in a simulation.
Then all the individual measurements could be stored in an output vector. The output
vector writes the data to a file. The data can later be plotted using plove, a tool that
comes with OMNeT++. Figure 4.7 has been generated by plove. The file format is very
simple, which makes post-processing using external tools rather easy. An output scalar
stores a single scalar value and a description string. Scalars are typically recorded at
the end of a simulation run. Example: one could count the lost packets over the course
of a run, and record the total number as a scalar at the end. The tool scalars can be used
for post-processing.
There are two alternative user interfaces; the text-based, non-interactive Cmdenv for
batch execution, and the richer graphical user interface Tkenv1 . Tkenv does not only
provide animation, but also additional debugging and tracing support. Most notably,
it is possible to inspect all simulation objects, such as messages, modules, parameters
(see below) or output vectors, at the run time. Tkenv is shown in Figures 3.2 and 3.4.
The component libraries contain mostly the protocol implementations. OMNeT++ is
completely independent from these libraries; in fact, it does not come with any component libraries. For example, the INET framework provides the essential Internet protocols. In nsnam, the basic Internet protocols are an integral part of the simulator itself,
in contrast. Simulation objects are wrapped in modules. These modules can be arbitrarily combined to build more sophisticated modules. Component libraries consist of
a number of related modules.
OMNeT++ simulation models are implemented as follows: simple modules are implemented
as C++ classes, they are not composed of other modules. Compound modules contain other
1 Tkenv
is based on the graphical user interface toolkit tk [2], hence the name.
35
3 Implementation
system module
standard host
ppp
network layer
ip
eth
icmp
...
...
modules, which may be simple or compound. They are described using the NED language,
which is a simple compiled programming language with a syntax similar to C. OMNeT++
provides a compiler that translates NED code to C++ code. That means, there is also C++
code for compound modules, but it is usually not written manually. All modules of a simulation are included in the system model. That means, there is a module hierarchy rooted at
the system model. For network simulation, the system model usually represents a network.
In Figure 3.1, a simple module hierarchy is shown. The example is taken from a modified
version of the INET framework, which will be further described in Section 3.1.4. The gray
boxes are simple modules, and the white boxes are compound modules. The system module is a network that consists only of a single end-host. The end-host is represented by a
compound module, and composed of various protocols and data structures. For example,
the Point-to-Point Protocol (PPP) is implemented as a simple module. The network layer
protocols are wrapped in another compound module. Remember that this kind of nested
model is not supported by nsnam. Normally, some modules are connected to each other.
This is not shown in Figure 3.1. Modules can be connected to modules on the same level
of the hierarchy, and to their parent and child modules. Connected modules communicate
by exchanging messages. For network simulation, messages are typically packets or timers.
Messages are very similar to events (see discrete event processing in Section 3.1.1): the arrival of a packet can be seen as an event, and, inversely, timers can be seen as a message
that a module sends to itself. Modules can also be parametrized. Parameters are set in
compound module definitions using NED or in a separate configuration file. That way, protocol implementations can be quickly adapted to the simulated scenario. For example, the
IP module in Figure 3.1 has a parameter "time to live" that determines the initial value of
the time to live field in the IP header. Clearly, this value has to be adjusted to the simulated
network topology.
36
3 Implementation
In summary, there is a clean separation of the simulation kernel, the simulation library,
user interfaces and component libraries, which makes it easy to extend OMNeT++. In particular, scenario generation and protocol implementation are separated. OMNeT++ uses two
different programming languages: protocols are implemented as simple modules in C++;
network topologies are described in NED. Via parameters any other simulation properties
that have to be changed frequently can be specified in NED as well. This is similar to the
split-programming model of nsnam. However, nsnam macro-objects do not have parameters. There are arguably more protocol implementations for nsnam and OPNET Modeler
than for OMNeT++ though.
37
3 Implementation
Layer 3
Layer 4
In addition, there are several simple application models. In summary, the INET model
is very detailed. The developers of OverSim have profiled the INET modules to increase its efficiency and scalability. For example, a static routing protocol has been
added.
The underlying network model and the overlay protocols are cleanly separated. For example, it is possible to exchange the underlying network model without changing the overlay
protocols. OverSim also offers some support specifically for peer-to-peer protocols. This includes a generic lookup function, bootstrapping support, visualization of the overlay topology, and collection of some statistical data, such as "the number of sent received, forwarded
38
3 Implementation
and dropped packets per node" [9]. Most of these features are too specific to be used for the
DMMP simulations, but they can be turned off easily, and they seem to come at little cost.
The authors of [9] claim and demonstrate that runs with 100,000 end-hosts are feasible.
What OverSim offers to us, beyond the functionality of a common simulators, is mostly
the INET model, which is very suitable for our purposes. Many other overlay simulators
lack a detailed underlying network model. Unfortunately, OverSim has also some shortcomings. Firstly, the scalability is not quite satisfying. In particular, OverSim consumes
high amounts of memory. According to [9], each node requires about 70 kilobytes of memory. However, we have made the experience that routers require far more memory. Each
router creates one entry in its routing table for each other router. Therefore, the memory
consumption increases quadratically with the number of routers. Several gigabytes of memory are necessary to simulate 10,000 routers. Secondly, OverSim does not support topology
generators. In [9], this is listed as future work. OverSim does not provide a suitable traffic
model either, however, this can added easily. Thirdly, the documentation is rather sparse in
parts, for example, it does not describe how to add new overlay protocols.
39
3 Implementation
40
3 Implementation
Figure 3.2: End-host in the INET underlying network model (screenshot from Tkenv)
41
3 Implementation
cModule
BaseOverlay
#initializeOverlay(): void
#finishOverlay(): void
#handleUDPMessage(msg:Message): void
#handleAppMessage(msg:Message): void
DMMP
DMMPMember
DMMPSource
rarely, a real world implementation has to be able to handle it. In our simulations, we can
disregard spurious inactive reports entirely. However, many simplifications are ineligible
because they would affect the performance measurements significantly.
3.2.2 Components
Figure 3.4 shows the initial components of our simulation model. The underlying network
consists of a number of interconnected routers (illustrated as towers), and does not change
over the course of a simulation run. The underlying network is further described in Section
4.1.2; most of the details are not important here because in OverSim, the overlay modules
are independent of the underlying network model. The notebook symbols represent endhosts. They are the initial members of a DMMP multicast group. One of the initial group
members is the source, the others do not send data to the group. Over time, more end-hosts
will be added to the model, and others will be removed (the source is not removed though).
All end-hosts are placed using dynamic module creation [34, pages 7678], meaning that they
are placed via manually written C++ code.2 Each end-host is attached to exactly one router.
Links have a bandwidth and a propagation delay [34, pages 4041]. Consider Figure 3.5. Node
A sends a message to node B over a network link. The first bit is sent at time t0 , the last bit
is sent at t1 . The difference t0 t1 is a function of the link bandwidth and the message size.
With a bandwidth of b bits per second and a message size of s bit it is t0 t1 = bs . The propagation delay is simply t0 t2 , where t2 is the arrival time of the first bit. As bandwidth and
2 Simulation objects can also be created via NED, which is called static creation because, that way, the objects are
placed immediately at the beginning of the simulation run, and destroyed at the end. This method is used
for placing the routers.
42
3 Implementation
Figure 3.4: The initial components of the DMMP simulation model (screenshot from Tkenv)
A
first bit sent
message
43
3 Implementation
44
3 Implementation
the bandwidth has to be measured and may change over time. Effective bandwidth
measurements are rather difficult to implement, so we leave them out for simplicity.
Instead, we "cheat" by obtaining the bandwidth directly from the link. OMNeT++
implements links as cConnection objects, the bandwidth can be accessed easily. Unfortunately, this approach leads to slightly unrealistic results because real measurements
would be less accurate and would induce some control overhead.
2. Construction of the overlay core: The RP stores the IP address and bandwidth of each
initial member. It chooses a number of initial members as super nodes in order of
bandwidth. The number of super nodes, which also determines the average cluster
size, is an application-specific parameter. The effect of this parameter is investigated
in Section 4.2.1. DMMP adopts the overlay core maintenance from Narada. In Narada,
the initial mesh can have poor quality, but it improves itself over time. As we do not
implement the self-improvement mechanism, we use a relatively simple centralized
algorithm to construct a reasonable initial mesh. First, the maximum degree of each
super node i is determined as
degree(i ) =
b (i )
,
r + c (i )
where b(i ) is the last-hop bandwidth of i in bits per second, r is the constant bitrate that
the source sends at, and c is the expected control overhead in bit per second. The control overhead will be analyzed in Section 4.2.1. As the underlying network topology
is modeled as an undirected graph, we cannot differentiate between outdegree and
indegree. For each packet that the source sends to the group, there is up to one endto-end transmission per mesh link. This is because a reverse path routing algorithm is
used. Therefore, the maximum number of mesh neighbors is degree(i ) for each super
node i; each incident mesh link "consumes" one "degree unit". It seems reasonable
to let each super node "reserve" some degree for children in its local cluster, so super
nodes with high maximum degree should have more neighbors than super nodes with
relatively low maximum outdegree. A modified version of Prims algorithm is used
to construct suitable spanning trees. We do not go into more detail here because the
centralized mesh construction is only a temporary solution, and should be replaced by
an efficient, distributed algorithm. Using the algorithm described in [36], k spanning
trees are constructed. That means, the density of the mesh is adjusted via the parameter k. As the number of super nodes should be less than 100, and a very dense mesh
requires a high amount of reserved degree, a reasonable value is 2 k 5. Note
that for k = 1 the overlay core is a tree. We experiment with the value of k in Section
4.2.1. Finally, the source chooses as many neighbors as its maximum degree allows.
The chosen super nodes receive data directly from the source.
3. Cluster construction: The RP contacts the initial members that have not been chosen as
super nodes. They join a cluster using the join algorithm described in the next section.
45
3 Implementation
...
JoinRq
JoinRq
JoinRsp(REJ, D, B)
JoinRsp(ACK)
RttTest
data topology
RttTest
RttRsp
JoinRq
RttRsp
JoinRsp
time
message exchange
.
After joining, they notify the RP. As soon as all initial members have confirmed their
positions, the RP tells the source to start the data delivery.
46
3 Implementation
it forwards the data to all end-hosts on that list. C also starts sending refresh messages to
A. This is covered in Section 3.2.6. C sends a join response to both A and B. The response
to A indicates that A has been accepted as a child; A knows that it has successfully joined
the cluster, and starts sending refresh messages to C. B receives a join response indicating
that it has been rejected as a child. This rejection includes the IP addresses and capacities of
Cs children A, which has just been accepted as a child, and D. B uses an internal candidate
parent cache to store this information. C is removed from the cache because it is currently
saturated. B chooses a parent from among Cs children; other entries in the candidate parent cache are not considered. They are only considered if none of Cs children accepts B. B
determines its distance to A and D by measuring the round trip times of (empty) test messages. There are other possible distance metrics. For example, the length of the shortest path
in the underlying network could be measured. However, this method seems less reliable.
Measuring the round trip time is a simple, intuitive approach. B sends a test message to
A and D simultaneously. The test response of A arrives first. That way, B knows that A is
nearer than D, and sends a join request to A. In our example, A has not received the join
response from C, when it receives the test message; this is unproblematic. When A receives
the join request from B, it has already joined the cluster. For simplicity, assume that A has
a maximum degree of two. If it had a higher maximum degree, it might try to switch its
position with C, which would make things more complicated. Consequently, A can accept
one child. There are no concurrent join requests, so A accepts B as a child, and B has joined
the cluster.
Now assume that the maximum degree of A is only one, that is A cannot accept a child,
and thus is a leaf node. In this case, C does not include the address of A in the rejection
message because C knows that A does not have enough capacity. Let A and D both be leaf
nodes. In this scenario, B receives an empty rejection (that is, without any IP addresses).
Then B contacts one of the members in its candidate parent cache. If the cache is empty, B
knows that all clusters that it tried to join are saturated. As the ping-RP query provides a
list of randomly chosen super nodes, there may be other clusters, which are not saturated.
Therefore B queries the RP for a new list of super nodes.
Below some interesting parameters of the join algorithm are discussed:
Each member waits for t j seconds before choosing its children. A high value of t j
can avoid promotions because end-hosts with low capacity are placed at low cluster
positions from the beginning. With t j = 0 members do not wait for concurrent join
requests. That way, end-hosts can join and, more importantly, rejoin faster, but more
promotions are necessary. If the average degree of end-hosts is high, it is clearly pointless to wait for join requests, because most of the time all joining hosts can be accepted.
As DMMP targets high-bandwidth applications, the average degree will typically not
be that high though. We set t j > 0 during the initial cluster construction. In this phase,
a high number of join requests is sent at the same time. In addition, users may be
47
3 Implementation
more willing to tolerate delays before the playback has initially started (in the case of
multimedia streaming). In most cases, the media player buffers incoming data before
starting the plackback. That means, there is an initial delay no matter how fast the
join algorithm is. We believe that it is better to set t j = 0 after the initialization phase,
mostly in order to speed up rejoin attempts, and thus reduce packet loss.
There are some other timeouts involved in the join algorithm, that have not been mentioned for simplicity. Clearly, joining end-hosts cannot wait for a join response indefinitely. The desired parent may have left the group before sending a join response.
The same applies for responses to round trip time tests. In both cases, the joining host
is not interested in responses of distant members. Therefore the timeouts should be
short.
Recall Equation 2.1. In [25], the units of the bandwidth and uptime are not specified.
If "bit per second" is used as the unit of the bandwidth, and "seconds" as the unit of the
uptime, then the uptime is virtually irrelevant; the capacity c of an average member
would be
t
c = b + ,
n
where b is the average last-hop bandwidth, t is the average uptime, and n is the group
size. In a typical scenario, the maximum uptime tmax is a few thousand seconds (a few
hours), and the group has thousands of members. In our implementation, the units
stated above are used, but there is a weighting factor u:
c( j) := b( j) +
b( j)
t( j) u 1 j n.
b (i )
(3.1)
in=1
B
tmax
1 j n,
b(i ), we get
in=1
b( j) c( j) 2 b( j) 1 j n.
(3.2)
The capacity of a member m1 with maximal uptime is 2 b(m1 ), and the capacity of a
member m2 that has just joined the group is b(m2 ). When a high number of membership changes is expected, the uptime can be given more weight, and vice versa.
Members obtain the total bandwidth as follows: for every member, the RP keeps track
of (1) the IP address, (2) the last-hop bandwidth, and (3) whether the member is a super node. The last field is required to answer ping-RP queries. The RP periodically
calculates the total bandwidth and delivers that information to the group. The total
bandwidth updates are attached to some of the periodic refresh messages. These refresh messages also need to carry a timestamp. That way, a member that receives a
refresh message can decide if the total bandwidth update is outdated.
48
3 Implementation
The RP answers ping-RP queries with a list of s super nodes. s needs to be limited
because each newly joining end-host measures its distance to all s super nodes. For
large s, the associated control overhead at the joining host is unreasonable.
49
3 Implementation
P
JoinRsp(ACK)
ProbeTimer
Refresh
ProbeTimer
Refresh
RefreshTimer
ProbeTimer
ProbeTimer
RefreshTimer
Refresh
Refresh
...
Figure 3.7: Refresh message exchange between parent and child
Detecting inactive non-super nodes: In our implementation, cluster members exchange control messages only with their parents and children, not with any other relatives. Maintaining these relationships seems rather difficult to implement, so we leave this as
future work. Nevertheless, it would have been interesting to see, how these additional mesh links affect the performance, and how frequently remote relatives should
exchange refresh messages. For simplicity, we do not implement refresh requests and
inactive responses (see Section 2.3.2), either. We are also uncertain if a real world
implementation should use these message types. It is sufficient to exchange refresh
messages periodically, and lost inactive reports can be tolerated. Additionally, inactive reports are sent rarely, due to the trimmed control topology.
Refresh messages are sent when a parent-child relationship has been established. Subsequently, they are sent every tc seconds3 . In Figure 3.7, member P accepts C as a
child. Right after that, P sends a refresh message to C, and starts a probe timer. When
this timer expires, P sends a probe request to C. If P receives a refresh message from
C before that, it resets the probe timer. Refresh messages are sent periodically, for this
purpose P schedules a refresh timer (tc seconds). When the refresh timer expires, it is
reset, and a refresh message is sent.
The parameter tc is clearly one of the most important factors for the control overhead
and the packet loss. With a high value of tc , it takes members a long time td tc to
detect missing neighbors. Consequently, it takes a long time to repair partitions. Furthermore, capacity updates and total bandwidth updates are included in the periodic
refresh messages. If tc is high, capacities are updated less frequently. This is a minor factor though, because capacity updates are mainly needed for promotions, and
the promotion algorithm does not rely on the capacities being up-to-date. On the other
3c
stands for "cluster" here; refresh messages may be sent more frequently in the overlay core (mesh, hence tm ).
50
3 Implementation
P
Refresh
ProbeTimer
ProbeTimer
ProbeRq
InactiveTimer
hand, a low tc leads to high control overhead. Probe timers expire after t p tc seconds.
As the transmission times of the refresh messages may vary, the probe timers should
be set conservatively. On a side note: if the first refresh message that P sends happens to reach C before the join response, then C ignores the refresh message. Chances
are that the next refresh message arrives before Cs probe timer expires. Otherwise, C
probes P, which is unproblematic as well.
Figure 3.8 shows how inactive members are detected: C has scheduled a probe timer
for its parent P. P has left ungracefully, and eventually the probe timer expires. C then
suspects P of being inactive and sends it a probe request. It uses an inactive timer to
wait for a probe response. The timer runs for ti seconds. After that, C assumes that P
has left ungracefully. In Figure 3.8, the refresh messages sent by C have been left out
for clarity.
Above, an example of a spurious probe request has been given. When an active member receives a probe request, it immediately sends a probe response back. In Figure
3.8, C would reset the probe timer in that case. A gracefully leaving member sends an
inactive report about itself. Sending the inactive reports takes a small amount of time.
We minimize this delay: leaving members send only one inactive report. This report
includes the IP addresses of the leaving members children, and it is sent to the leaving members parent. The parent sends an inactive report about the leaving member
to each of the now orphaned children. If the parent leaves before receiving the inactive
report, the orphans will not be notified. In this case, they have to detect by themselves
that their parent is missing.
51
3 Implementation
52
3 Implementation
partitioned.
Handling inactive super nodes: When a super node leaves ungracefully, its former children detect this as described in Section 3.2.6 and rejoin the overlay. Again, the missing
node cannot be replaced. Our implementation chooses all super nodes from among
the initial members. It is not possible to allocate additional super nodes later on.
Adjacent super nodes exchange refresh messages. We set the timer for the periodic
refresh messages to tm . These messages contain the same information as the refresh
messages in the clusters. Additionally, they carry the senders distance to the source
(routing information), and sequence numbers, which are used to detect mesh partitions (this is described later in this section). Because of this, refresh messages should
possibly be exchanged more frequently in the overlay core, that is, tm tc . Clearly, tm
has a big impact on how quickly mesh partitions can be detected. We experiment with
the value of tm in Section 4.2.2.
Each super node maintains a list of all other super nodes. This list is mostly needed to
detect mesh partitions. When a super node learns that one of its neighbors is inactive,
it removes the neighbor from its list of super nodes and from its list of neighbors. Then
it notifies all other super nodes using inactive reports. The inactive reports are flooded
via the mesh. When a super node receives an inactive report about a node that is not
on its list of super nodes, it does not further forward the inactive report. That way, the
flood stops at some point. When a super node discovers a missing neighbor by sending it a probe request, the super node does not only notify all other super nodes, but
also sends an inactive report to the RP. That means, the RP typically receives several
redundant inactive reports about a missing super node. This redundancy is desirable
in a real network because inactive reports can be lost. It is important to make sure that
the RP learns about inactive super nodes. Otherwise it would answer ping-RP queries
incorrectly.
Ideally, super nodes that wish to leave should remain in the group until the other super nodes have changed their routes. For simplicity, this has not been implemented.
Gracefully leaving super nodes notify their children in the local cluster and the adjacent super nodes.
Handling mesh partitions: We adopt the algorithms described in [13]. Each super node
stores a refresh table with one entry for every super node. Each entry consists of an
IP address, a timestamp and a sequence number. The table is initialized during the
mesh construction, as soon as the super node knows the IP addresses of all other super nodes. The initialization is done as follows:
53
3 Implementation
54
3 Implementation
queue.erase(e)
}
If t > Tmax then {
handlePartition(e)
}
}
With a probability of (queue.size / table.size) do {
handlePartition(queue.pop())
}
Tmin and Tmax are constants, and have already been introduced in Section 2.2.5. They
determine how aggressively links are added in order to repair potential partitions.
When an entry is not updated for at least Tmin seconds, it is copied to a queue. Entries
remain in the queue for up to Tmax Tmin seconds. After that time, a partition is assumed and handlePartition() is called. Entries can be removed from the queue earlier
when an update or inactive report about the corresponding super node is received.
Furthermore, every time a member checks for a partition, the super node that has
been on the queue for the longest time is assumed to be partitioned with a probability
depending on the size of the queue. The reason for this is as follows: assume that a
super node s checks for partitions, and that there are many entries on the queue. That
means, there is a high number of super nodes from which s has not received an update. This is a strong indicator for a partition. It is important to detect mesh partitions
quickly because any number of members can be affected; but if Tmin and Tmax are too
low, a high number of unnecessary links may be added. The values should be chosen
carefully as a function of tm and the diameter of the mesh.
The handlePartition() function is called when a super node x has detected a partition.
It takes a refresh table entry e as an argument and does the following: first, a probe
request is sent to the super node y with the IP address e.ip. If there is no response,
handlePartition() returns. In this case, s has not received an update about y because y
has left the group. Note that in the worst case, x learns about an inactive super node
tm d seconds after the departure, where d is the diameter of the mesh.4 Typically, it
should be Tmin < tm d, so the probe request is important. If there is a probe response,
a link that connects x and y is added to the mesh. [25] and [13] do not specify how
this should be done. Clearly, x needs to tell y that there is a new link, so that y can update its list of neighbors. In our implementation, x sends a status report that notifies
y about the new link. Remember that the additional mesh link consumes bandwidth;
both super nodes may be unable to support an additional mesh neighbor. We assume
4 It
55
3 Implementation
that both x and y have children in their local clusters. If necessary, resources are freed
by sending a breakup message to one child. A cluster member that receives a breakup
message rejoins the overlay as if its parent had left the group. There are other possible
solutions. For example, x and y could drop a mesh link to free resources. [7] recommends that the "mesh degree bound for hosts should not be strictly enforced to ensure
connectivity. Instead additional mechanisms that limit the degree of the data path on
the mesh should be used." However, suitable mechanisms are not described. Our approach makes sure that mesh partitions, once they are detected, are repaired quickly
and without causing new mesh partitions.
3.2.8 Self-improvement
As mentioned in Section 3.2, the mesh self-improvement mechanism is not implemented.
Instead, we try to construct a reasonable initial mesh. That means, the quality of the overlay
can decrease when super nodes leave because disadvantageous links may be added to repair
partitions, and because a lower number of super nodes affects the performance. However,
we do not believe that this has a big impact on our measurements. In our scenarios, the
ratio of super nodes to non-super nodes does not change drastically over time. We consider
the self-improvement of the clusters via promotions more important; this is described below.
A cluster member that has a higher capacity than its parent can be promoted, meaning that
it swaps places with its parent. The basic idea is that first child and parent swap their positions. This involves communication between the promoted node, its parent and its grandparent. Then, the promoted node adopts the children of its former parent. This is shown in
Figure 3.9. The left part shows the data topology of a small DMMP cluster. Assume that C
has a significantly higher capacity than its parent B. The right part shows the topology after the promotion of C. C has taken the position of B, and it has adopted its former sibling D.
56
3 Implementation
Implementing the promotions turned out to be a time consuming task. There are a number
of difficulties:
Several nodes are involved in a promotion.
Each node may leave at any time.
The promoted node may not be able to take over all children of its former parent. In
fact, it may not have enough bandwidth to accept its former parent as a child. This
becomes more complicated when end-hosts join during a promotion.
We implement the promotion mechanism because we consider it an essential feature of
DMMP that distinguishes it from other ALM protocols.
[25] does not describe promotions in detail. However, Jun Lei, one of the authors, suggested the following algorithm: consider Figure 3.9 again. First, C requests a promotion
by sending a status request to B. B acknowledges this by sending a status report back. This
status report contains the address of A. Moreover, B breaks its connection to A and C. When
C receives the status report, it contacts A, which adds C as a child. Then, B is notified by C
and rejoins as a child of C. Meanwhile, B stores A as its backup parent. If C turns out to be
inactive (it may have left the group after requesting the promotion), B can rejoin as a child
of A. That means, the original topology as shown in the left part of Figure 3.9 is restored.
When B has successfully joined as a child of C, B breaks its connection to D if C has enough
capacity to accept an additional child. Finally, D joins as a child of C. E is not involved in the
promotion. It is included in the Figure to emphasize that the promoted node may already
have children, and hence can be saturated. We have implemented this algorithm with a few
enhancements.
We use the example shown in Figure 3.10 to describe the promotion algorithm. In the first
part, C and B swap their positions. In the second part, D joins as a child of C.
First part: C knows Bs capacity either from a join response or from a refresh message. It
decides if it should request a promotion in the following way:
If this.capacity > parent.capacity + threshold then {
requestPromotion()
}
C also makes sure that B is not a super node. Swapping positions with a super node
is not implemented, and perhaps not desirable either. The threshold is there to avoid
oscillation. As the capacity is a function of the current uptime, it changes over time.
Therefore members have to check constantly if a promotion is appropriate.
57
3 Implementation
PromoRq
BreakUpRq
BreakUpRsp
PromoRsp(ACK)
JoinRq
JoinRsp
JoinRsp
PositionConfirm
BreakUpRsp
JoinRq
JoinRsp
PositionConfirm
C decides to send a promotion request to B. The request includes Cs capacity and the
currently unused degree. Then, B checks if Cs capacity is indeed higher. (C may have
made its request based on slightly outdated information.) In addition, B must not be
involved in another promotion. If a promotion is not possible, B sends C a promotion
response indicating denial. Otherwise, B sends a breakup message to its parent A, but
keeps A as a backup parent. A also keeps B as a temporary child. That way, B does not
lose data during the promotion. Now A has enough available bandwidth to accept
C as a child. A reserves this bandwidth for C so that other joining end-hosts cannot
interfere. Next, C tells B that it has reserved bandwidth using a breakup response,
and in turn B sends a promotion response to C acknowledging the promotion. B adds
C as a temporary child. C sends a join request to A and breaks the connection to B.
It is necessary that B waits until A has reserved bandwidth. If it acknowledges the
promotion right away, Cs join request may reach A earlier than Bs breakup message.
In this case, A may not be able to accept C as a child.
After joining successfully, C notifies B using a join response. B removes C from its list
of temporary children, and confirms its own position to A, which in turn removes B
from its list of temporary children.
Second part: B knows Cs currently available bandwidth from the promotion request. Therefore, it can determine the number n of additional children that C can accept. B chooses
n of its own children arbitrarily (in the example it chooses only D), and sends breakup
messages to them. The breakup messages contain the address of C. The chosen chil-
58
3 Implementation
DMMP
1
1
std::map
1
1
MemberMap
DMMPMember
DMMPSource
RP
RefreshTable
1
std::list
std::list
std::string:
+tmpChildren
s
t
d
:
:
v
e
c
tor
1
+reservations
1
std::string:
RefreshEntry
MemberInfo
dren are added to Bs list of temporary children. They are removed when they confirm
their new positions. D tries to join as a child of C, and keeps B as a backup parent
meanwhile.
This algorithm can tolerate arbitrary membership changes. For example, B and D can rejoin
as children of their backup parents if C leaves. There is also little or no packet loss. Nevertheless, we believe that the algorithm can be further optimized and simplified. Note that
there are several timers involved that have not been mentioned above.
3.2.9 Summary
Figure 3.11 summarizes the design of the DMMP implementation. For clarity, the interactions with the rest of the simulation model are not shown. The message classes are left out,
too. DMMPSource implements the behavior of the source. It receives data from a higher
layer module, wraps that data in messages, attaches the DMMP header, and sends the messages to its mesh neighbors. The DMMPMember class implements all the algorithms described in the previous sections of this chapter. IP addresses are stored as std::strings; whenever more information about a group member needs to be stored, a MemberInfo object is
used. MemberInfo objects have data members for all relevant properties of group members,
for example the maximum degree or the best known distance to the source. MemberMap
objects aggregate several MemberInfo objects. An std::map<std::string,MemberInfo*> is used
as an underlying data structure. DMMPMembers use several MemberMaps to keep track of
(1) the children in the local cluster, (2) mesh neighbors, (3) all super nodes, (4) buffered join
requests, and (5) candidate parents. (2) and (3) are only used by super nodes, and (5) is only
used by non-super nodes. During promotions, DMMPMembers also keep track of bandwidth reservations and temporary children. In this case, only the IP addresses are stored.
59
3 Implementation
Each DMMPMember has a RefreshTable (see Section 3.2.7), which consists of a number of
RefreshEntries and a list of potentially unreachable super nodes.
In Section 3.2.1, we noted that efficiency and scalability of the implementation are critical.
As we do not implement all DMMP features, there is also much future work. For this reason,
the code also needs to be extensible, in particular readable. To a degree, these two goals are
conflicting. For example, members store their children in a MemberMap which uses an
std::map internally. This is convenient because lookup operations are easy to implement
and easy to understand. However, iteration over an std::map is rather inefficient. Iteration
is needed e.g. whenever a DMMPMember forwards data to its children. However, all of
the MemberMaps contain relatively few MemberInfo objects. All in all, we believe that our
implementation is reasonably efficient. Another concern that was pointed out in Section
3.2.1 is debugging. The OMNeT++ simulation library provides a macro that can make any
data member visible in the object inspector. In addition, OverSim provides functions to
visualize the overlay topology. This, coupled with the animation of the message exchange,
makes it fairly easy to debug network models with a small number of nodes in the Tkenv.
Many interesting situations occur only with a high number of nodes though. Example: in
Section 3.2.8 some possible complications during promotions have been mentioned. It is
highly unlikely that this would happen with a group of, say, ten members. With a high
number of nodes, the animated network is difficult to overview, and the execution speed is
too slow. Consequently, we have mainly used logs and assertions for debugging, which is a
time consuming method. We believe that all major bugs have been resolved, though.
60
4 Performance evaluation
Scalability, resilience, efficiency, and service quality are major concerns with application
layer multicast. These properties are difficult to analyze with mathematical models or testbeds.
In this chapter, we evaluate the performance of DMMP using network simulation experiments. We analyze the efficiency of the data delivery, and the service quality experienced by
the participating end-hosts in dynamic scenarios with a high number of end-host. The exact
setup of our experiments is described in Section 4.1. The results are presented in Section 4.2.
61
4 Performance evaluation
2
2
1
2
2
1
1
link and no router has a stress greater than one. This cannot be achieved with application layer multicast; nevertheless, the stress should be as low as possible. Figure
4.1 shows an example. The dotted arrows show the overlay data topology. The white
arrows represent data packet transmissions. For each link and router the number of
transmissions is shown. The overlay network is dynamic in nature. For that reason,
the link stress changes over time, and needs to be evaluated for each data packet that
the source produces. By numbering the packets, links and routers, we can define the
stress of a link or router j, 1 j M, as the average number of transmissions:
s( j) :=
1 N
n(i, j),
N i
=0
(4.1)
where n(i, j) is the number of copies of packet i (1 i N) that are transmitted via
j. Given a network with many links and routers, it is more helpful to determine the
average stress
s :=
M
1
s ( j ).
M z j =0
(4.2)
z is the number of links or routers with a stress of zero. [13] states that only the "links
active in data transmission" should be counted. Clearly, it does not make sense to
consider parts of the underlying network which are not involved in the multicast, for
example routers without attached group members. We can also measure the load of
end-hosts, by calculating the stress of the last-hop links. Note that the last-hop stress
is closely related to the degree, which can be easily determined mathematically (see
Section 4.1.3).
Control overhead: The main task of application layer multicast protocols is the delivery of
higher layer application data from a source to all other group members. The overhead
of controlling the data exchange between the participating nodes is referred to as control overhead. In the case of DMMP, control overhead is caused mainly by establishing
and maintaining the overlay network. There are several ways to measure this over-
62
4 Performance evaluation
head. For example in [22], the number of control messages is counted. It would also
be possible to measure the amount of bandwidth that is used for control traffic.
Loss rate Many application layer multicast protocols do not guarantee data delivery. When
a source sends a data message to the group, some members may not receive the message. This can happen due to an unreliable transport protocol. More importantly, the
data topology can be temporarily partitioned.
Message losses can be evaluated by calculating the loss rate l for each group member i:
l (i ) =
N (i ) r (i )
N (i )
(4.3)
, where N (i ) is the number of data messages sent to the group within the lifetime of i,
and r (i ) is the number of unique data messages that i has received. While multimedia
applications may be able to tolerate rather high loss rates, it is nevertheless desirable
to achieve a high probability of delivery. Moreover, knowing how long partitions last,
and how many members are affected can give insights about the resilience of an application layer multicast protocol. This is measured in [7], for example.
Latency, data path length, and stretch: All these metrics consider the distances between a
source and the other group members. Assume that a source sends a data message at
ts . It is forwarded by several routers and end-hosts, and at t a , it arrives at the group
member m. Then, the latency currently experienced by m is t a ts . For large-scale experiments, the latency should be averaged over all messages and over all members.
Low latency is especially important for interactive applications.
The data path length p of a group member i is the length of the path from the source to
i. More precisely, we define p(i ) as the number of physical links that packets traverse
until reaching i, averaged over all packets that are sent to the group. A long data path
does not always imply high latency, but it can be an indicator of high jitter and packet
loss; packets that traverse a high number of links are more likely to be delayed or lost.
The stretch of a group member i is the ratio
p (i )
,
c (i )
unicast path from the source to i. When the source unicasts data to all destinations,
the stretch is one for all members by definition.
In our simulations, we use mostly the performance metrics that have been used in [7] in
order to produce comparable results. We measure the stress, control overhead, loss rate,
and data path length as follows:
Each group member keeps track of the number of sent and received data messages,
routers count the number of forwarded data messages. End-hosts calculate their lasthop stress based on Equation 4.1, and record it, when they leave the group. Similarly,
63
4 Performance evaluation
routers report their router stress at the end of the simulation. In addition, each router
remembers for each link if it has sent or received data via that link at all. This way, links
with a stress of zero can be identified. All statistics are aggregated by the RP, which
eventually stores them using output vectors and output scalars (see Section 3.1.3). The
average router stress and last-hop stress are determined according to Equation 4.2.
The average link stress is then computed as as function of the average router stress
and the average last-hop stress.
The authors of [7] have measured the control overhead in bit per second at routers
and end-hosts. For simplicity, we omit these measurements for the routers, and only
report the control overhead incurred by the end-hosts. It is important to consider the
size of the control messages, and not only their number. In Narada, the number of
control messages grows linearly with the group size, whereas the control traffic in bit
per second exhibits quadratic growth. We expect the number of super nodes in DMMP
to have a similar effect.
It is not entirely clear, if protocol headers are regarded as control overhead in [7]. It
is not stated how frequently data messages are sent by the source. If the headers of
data messages were counted, this would be important information. In our measurements we consider the headers of control messages, but disregard the headers of data
messages.
We measure the loss rate as described in Equation 4.3. The number of missed data
packets is determined by observing gaps in the sequence numbers. When a member
receives a data message, it calculates the difference of the messages sequence number
and the sequence number of the last previously received data message. If the difference is greater than one, packets have been missed. It is also possible that a member
leaves while rejoining. For that case, members also store the timestamp of the last
received data message. When a member leaves, it can tell by the timestamp if it has
missed any messages.
We do not measure latencies for three reasons:
Latencies have not been measured in [7].
DMMP is not intended for applications that are latency sensitive.
We have not carefully optimized timeouts.
Instead, we measure the data path length. This metric has been used in [7] as well.
We determine the data path lengths by adding a hop counter to each data message.
Every router or end-host that forwards a data message, increments the hop counter.
When the data messages arrives at a group member, the data path length of that group
member is equal to the hop counter. For simplicity, we omit the stretch metric.
64
4 Performance evaluation
b 1 b 1
1 1 db
, where b is the number of backbone routers, and d is the average degree of the
backbone routers.
65
4 Performance evaluation
access routers, many access routers end up without any attached end-hosts, and thus
do not interact with the rest of the simulation model at all.
In [7] a topology with 10,000 routers and a smaller router degree has been used. This
makes it difficult to compare DMMP with NICE. For the typical scenarios, a larger
topology would be desirable as well. We will discuss how the size of the topology
affects the performance evaluation in Section 4.2.
The links between routers represent fiber optic cables with a propagation delay of five
milliseconds and a bandwidth of one gigabit per second. In [24] it has been pointed
out that a big part of the end-hosts on the Internet do not have enough upstream
bandwidth to support children in the overlay topology. We model this by setting the
maximum degree of these hosts to one. That means, the bandwidth is chosen based
on the bitrate of the data stream. We also need some end-hosts with a high maximum
degree that can become super nodes. In the INET underlying network model, all links
between an access router and its attached end-hosts have the same propagation delay
and bandwidth. Therefore, all end-hosts attached to the same access router have the
same maximum degree. Hence, we can say that each access network has a maximum
degree. When an access router is placed, the maximum degree of the access network
is set to one with a probability of 0.4. The maximum degree distribution is shown in
the table below. The degree of the source is determined in the same way, but it cannot
be greater than five. Assume a source with a degree of 14 that directly provides all
or most super nodes with data. In this case, the performance of the overlay core does
not affect the overall results. We avoid this kind of degenerate overlay topologies by
limiting the degree of the source.
maximum degree
probability
0.4
0.2
0.2
10
0.1
14
0.1
We set the last-hop propagation delays to zero because otherwise, considering the
rather small backbone network, propagation delays would dominate our round trip
time measurements. Remember that we use the length of the data delivery path as
a performance metric, and that we do not measure end-to-end latencies. If members
chose their parents mostly based on last-hop propagation delays, the data delivery
paths would clearly be long, and there would be no point in measuring their lengths.
Lower layer protocols: The simulation model does not consider the details of the physical
layer. Instead, when a frame is sent over a link, the transmission time is calculated as a
66
4 Performance evaluation
function of the links bandwidth and propagation delay, as described in Section 3.2.2.
A simplified version of PPP is used as the link layer protocol. It does not conceal bit
errors; in fact, the INET underlying network model does not model bit errors at all. The
network layer implementation is slightly more complex. Routing tables are globally
computed using Dijkstras algorithm. The underlying topology does not change over
time, so there are no dynamic routing updates. Messages that are bigger than the
maximum transmission unit (MTU) are fragmented. UDP constitutes the transport
layer. This protocol stack is illustrated in Figure 3.2. Routers use the same link layer
and network layer modules.
Traffic pattern: "Traffic" refers to the payloads of the higher layer application here, in contrast to the control traffic that DMMP generates by itself. OverSim encourages users to
implement traffic sources as higher layer modules. A stack of application layer protocols can be placed on top of the overlay protocol. In Figure 3.2, there are three higher
layer modules: "tier1", "tier2" and "tier3". We implemented a very simple module representing a multimedia application. Only the source uses this module. The module
is notified by the RP as soon as the initial control topology is constructed. Then it
produces data at a constant bitrate r and forwards it to the DMMP module, which
provides the multicast functionality. For all our experiments, we use a small bitrate of
64 kilobit. The multimedia module creates three data messages per second, each with
a size of
64
3
slow the simulation down because each data packet is routed individually. Therefore
we do not use higher bitrates. For our experiments, this is unproblematic, although
DMMP is designed for high-bandwidth applications. The maximum degrees of the
end-hosts are much more important. We adjust the bandwidth of the end-hosts based
on r, so that the maximum degrees do not depend on r.
Churn models: In OverSim, churn models are implemented as churn generators. Churn generators are simple modules that are derived from the ChurnGenerator base class. They
place and remove end-hosts over the course of a simulation run. We have implemented a churn generator for the NICE scenarios. For the typical scenarios we use the
built-in ParetoChurn generator.
The churn model described in [7] has three subsequent phases:
1. Join phase: Within the first 200 seconds, end-hosts join the multicast group uniformly at random.
2. Stabilize phase: The overlay is given 1,800 seconds to stabilize. Within that phase,
there are no membership changes.
3. Leave phase: Every 100 seconds, a high number of randomly chosen members
leave ungracefully within 10 seconds. This is done five times, so the total simulation time is 2,500 seconds.
67
4 Performance evaluation
This model allows an analysis of the multicast protocol and its convergence time under optimal conditions, without disruptions caused by membership changes. Then,
in the leave phase, it can be measured how quickly the protocol recovers from severe
changes.
However, constant membership changes are certainly more realistic, and as DMMP
is designed with dynamic scenarios in mind, we believe that a scenario with a more
typical churn model can provide important insights as well. In particular, DMMP
calculates the capacity of a member as a function of that members current uptime. It
is implicitly assumed that members with a higher uptime are more stable. In the NICE
scenarios, this assumption does not hold. Leaving members are chosen uniformly at
random in the leave phase. In addition, all members join more or less at the same
time, and thus have roughly the same uptime. Therefore, an important condition for
the churn model in the typical scenarios is
P(L > k + t | L > k) > P(L > t) k, t {r R : r > 0},
(4.4)
where the random variable L is the lifetime of an arbitrary member, that is, L = tl
t j , where the member joins the group at t j , and leaves the group at tl . In short: the
distribution of L must not be memoryless. The ParetoChurn generator provided by
OverSim uses a Pareto distribution, which is fine. First, there is a short join phase of
100 seconds. A high number of members join during the join phase, and there are no
departures. Then, members join and leave constantly, creating a stable equilibrium.
That means, the size of the group hardly changes over the course of the simulation.
The total simulation time is 500 seconds. We believe that this is sufficient because
DMMP can establish the initial overlay network quickly. We set the expected value of
L to 7,500 seconds. Measurements for groups of 1,000 members show that with these
settings, a member leaves the group about every five seconds. More frequent leaves
would lead to a rapidly decreasing number of super nodes because super nodes cannot
be replaced by newly joining end-hosts. Departures are not always ungraceful in this
model. Members leave gracefully with a probability of 0.5. Note that this is unlike the
NICE scenarios. The idea of the typical scenarios is to evaluate DMMP under ordinary
conditions, whereas the NICE scenarios perform stress tests.
68
4 Performance evaluation
Ignoring the problems concerning the number of routers, we expect the following results:
Stress First of all, the last-hop stress is easy to predict. The data delivery tree has n
1 edges, where n is the group size. There is one end-to-end transmission per edge.
Hence, the average last-hop stress sh is
sh =
2 ( n 1)
,
n
(4.5)
which approaches two for large groups. Considering that DMMP produces some redundant transmissions, sh should always be about two. We measure sh mostly to verify
the implementation.
As the average stress of the network depends on the network size, it is difficult to estimate. Based on the theoretical analysis in [24], we expect the average link stress sl
of DMMP to be slightly lower than the average link stress of NICE. However, these
considerations are "based on the assumption of a large member population uniformly
distributed in the network". sl is necessarily greater than one because only links i with
sl 1 are considered. Moreover, the average stress of the last-hop links is two, as we
have shown above.
Control overhead: We consider the control overhead a weak spot of NICE. Maintaining
the hierarchy with all its invariants is costly. Additionally, [7] have reported that the
control overhead at routers grows logarithmically with the number of end-hosts. We
expect DMMP to induce slightly less control overhead. In particular, the control overhead within the clusters should be low. The overlay core may produce higher control
overhead because the size of the refresh tables grows linearly with the number of super nodes. Hence, the control overhead of the overlay core is in (m2 ), where m is the
number of super nodes. However, m is bounded.
Loss rate In our simulations, data messages are lost only due to partitions. On the one
hand, it may take cluster nodes a relatively long time to rejoin the group. On the
other hand, departures in the overlay core are unproblematic as long as the mesh does
not get partitioned. DMMP also considers the uptime of the group members, but
in the NICE scenarios this does not make a difference, as we have explained in the
previous section. In summary, it is hard to predict if our DMMP implementation can
achieve a lower loss rate than NICE. In our simulation scenarios, members do not
leave extremely frequently. Considering that losses only occur when members leave,
the average loss rate should certainly be less than one percent.
Data path length: Our DMMP implementation does not optimize the overlay for distance.
Therefore we expect the average data path of DMMP to be longer than the average
69
4 Performance evaluation
data path length of NICE. As several factors determine the positions of the end-hosts
in the data delivery tree, the data path length is not easy to estimate. For most of the
performance metrics, we cannot provide good estimations. This is the motivation of
our simulation experiments.
4.2 Results
We have discussed the application-specific parameters of DMMP in Section 3.2. In section
4.2.1, we experiment with some of these parameters using typical scenarios. We also evaluate the general performance in the typical scenarios. Then, we present the results that we
produced using the NICE scenarios, and compare DMMP with NICE.
Some parameters are the same in all the experiments because we did not have the time to
adjust them carefully.
The threshold for promotions (see Section 3.2.8) is set to 100,000. Recall Equation 3.2.
As the source sends at a small bitrate of 64 kilobit per second, we also use rather low
values for the bandwidth of the end-hosts. More precisely, it is
128, 000 b(i ) c(i ) 2 b(i ) 2, 000, 000.
Hence, 100,000 seems to be a reasonable threshold.
The timeouts, and Tmin and Tmax (see Section 3.2.7) are chosen fairly arbitrarily based
on observations during the implementation and testing.
In our initial experiments, it took partitioned cluster members a long time to rejoin. In many
cases, rejoining end-hosts received five or more rejections before finding a suitable parent.
This problem has already been pointed out in Section 3.2.4. Implementing the control topology links between relatives would have been the best solution. For lack of time, we made
some slight changes to the join algorithm instead. In our implementation, joining end-hosts
send several join requests at the same time, which allows hosts to join, and, more importantly, rejoin faster. Nevertheless, this algorithm is only a temporary solution because it can
produce loops in the data topology.
70
4 Performance evaluation
run can vary greatly2 depending on, for example, the number of mesh partitions. In fact,
more repetitions would have been desirable. Then again, we attempt a rough initial analysis, so that some lack of precision is tolerable.
Number of super nodes The impact of the number of super nodes is shown in Figure 4.2.
We measured the average router stress, the average data path length and the average loss rate (left panel), and the average control overhead in kilobit per second (right
panel) for different numbers of super nodes. Regarding the absolute values, note that
we have set the refresh frequency to tc = tm = 1.5 seconds, which will be discussed in
the following section. The loss rate has been multiplied with 1,000, and the data path
length has been divided by ten in order to adjust the values to one scale.
A very low number of super nodes leads to a poor overall performance. Here are some
possible explanations: The data path length and stress are comparatively high because
few super nodes lead to bigger and thus deeper clusters. The control overhead is low
because most of the refresh messages are exchanged in the clusters. The refresh messages in the clusters have a relatively small, constant size. The relatively high loss rates
could be due to the large clusters, which are vulnerable to partitions. Interestingly, the
control overhead for five super nodes is higher than the control overhead for ten super
nodes. The additional control overhead could be caused by frequent cluster partitions.
For high numbers of super nodes, we measured high control overheads. In fact, the
control overhead seems to exhibit quadratic growth. The average control overhead is
almost 16 kilobits per second for 100 super nodes. This is not unexpected. It can be easily shown mathematically that the overhead induced by refresh messages inside the
2 This
71
4 Performance evaluation
overlay core grows quadratically. It is more surprising that the overall performance
decreases with more than 30 super nodes. The redundant transmissions produced by
DVMRP could be responsible for the increased stress. The data paths could be long
because the mesh is not optimized for distance at all. A more complete DMMP implementation could perhaps achieve an overlay core with shorter data paths. The high
loss rate is difficult to explain. Possibly, a larger mesh is more vulnerable to mesh partitions. Having said that, we did increase the mesh density k from two to three in the
experiments with 40 or more super nodes. The impact of k is discussed below.
In summary, a reasonable number of super nodes should be between ten and 30. For
the remaining experiments with 1,000 end-hosts, we use 30 super nodes. According
to Figure 4.2, the stress and data path length are relatively low for 30 super nodes. At
the same time, the control overhead is still acceptable. Ultimately, our measurements
indicate that the exact number of super nodes is not very important. Any moderate
number seems to be fine. However, it is reassuring that extreme numbers of super
nodes entail poor performance. Without any super nodes, DMMP is basically a treefirst protocol similar to HMTP. When all members are super nodes, DMMP behaves
like Narada, a mesh-first protocol. Our results indicate that the two-tier architecture
of DMMP is in principle superior to those two approaches, at least for rather large
groups.
k: The impact of the mesh density is shown in Figure 4.3. k is the number of interleaved
spanning trees that the mesh is composed of, that is, the number of mesh links is about
n k, where n is the number of super nodes. Remember that the centralized mesh construction algorithm is only a temporary solution. However, any mesh construction
algorithm should have a parameter that controls the density of the mesh. Therefore,
72
4 Performance evaluation
73
4 Performance evaluation
analyze how the size of the underlying network influences our measurements. We
observed a significantly higher router stress of about 3.9. Our other results for 1,000
backbone routers did not differ much from the results shown in Figure 4.4. In particular, the data path length is almost the same in both scenarios. This indicates that the
router stress measurements are warped by the low number of routers. Therefore, we
cannot draw conclusions about the large-scale router and link stress.
The results in the left panel indicate that the control overhead remains more or less
constant with increasing group size. Note that the number of super nodes has been
adapted to the group size, which may explain the variations of the control overhead
and data path length. With increasing group size, the depth of the data delivery tree
grows necessarily. Nevertheless, the data path length seems to grow very slowly. The
trends shown in the left panel indicate that DMMP has the potential to scale to larger
group sizes.
Stress distribution For completeness, we include our analysis of the individual router stress.
Consider a star topology as an example. Physical links near the center experience very
high stress. This kind of traffic concentration is clearly undesirable. Figure 4.5 indicates that the DMMP traffic is rather well-dispersed. Most of the packets are handled
by routers with stress five or less. There are few routers with stress greater than 15,
and the maximum router stress is 30. However, as mentioned above, the stress could
be very different for a bigger and thus more realistic network. Figure 4.5 also shows
that about 100 routers have not forwarded any data at all. Note that the router stress
has been rounded, meaning that some of these routers actually have a stress between
74
4 Performance evaluation
75
4 Performance evaluation
work, the overhead at routers is significantly higher than the overhead at an end-host". That
means, these results are not comparable. However, the control overhead at end-hosts is
briefly mentioned in [7] as well. It is 0.97 kilobit per second for a group of 128 members during the stabilize phase. The worst-case control overhead at end-hosts is proven to "increase
logarithmically with increase in group size". Initially, DMMP generated higher control overhead: the average control overheads shown in Figure 4.4 are all greater than 2 kilobits per
second. These results were achieve with tc = tm = 1.5 seconds though. (The parameters tc
and tm have been introduced in Section 3.2.6.) In the experiments with NICE, heartbeat messages were exchanged only every five seconds. For tc = tm = 5 seconds, we have measured
a significantly lower control overhead of about 0.5 kilobit per second in scenarios with 1,000
group members.
However, the heartbeat is not necessarily comparable to tc and tm . Remember that there
are five leave phases in the NICE scenarios. That way, the partition handling is evaluated
under extreme conditions. NICE recovers from the leave phases "within 30 seconds" [7]. It is
questionable if DMMP can repair mesh partitions within 30 seconds with tm = 5 seconds. In
figure 4.7, we compare the recovery times of DMMP for tc = tm = 1.5 seconds, tc = tm = 5
seconds, and for tm = 2.5 seconds and tc = 1.5 seconds. We recorded the current average
loss rate over all members (denoted on the vertical axis) every 10 seconds. Prior to the first
leave phase, the loss rate is almost zero. Then, 128 members leave within 10 seconds. Consequently, the loss rate rises to rather high values of around 0.1, meaning that every tenth
data message is lost. The individual peak values are not so significant since Figure 4.7 is the
result of only one simulation run for each scenario. As soon as the data topology is repaired,
the loss rate drops back to around zero. For tm = 5 seconds, this does not always happen
within 30 seconds. The loss rate is high from simulation time 700 to 740, and also from time
400 to 440. It seems that for tm = 2.5 seconds, the data topology is always repaired within
30 seconds. With tm = 1.5 the loss rates and recovery times are even lower. Since a constant
average control overhead of around 2 kilobits per seconds seems acceptable to us, we have
used this configuration for the typical scenarios.
The results shown in Figure 4.6 have been produced with tc = 3.5 and tm = 2.5. Note
that the control overhead is still comparable to NICE for small groups, and lower than the
control overhead generated by NICE for large groups. Moreover, the control overhead of
DMMP does not seem to grow with the number of end-hosts. However, it is important to
point out that our control overhead measurements disregard a number of aspects:
We did not implement the refresh message exchange between relatives. Then again,
this mechanism may also alleviate the overall control overhead by reducing the costs
of rejoin attempts.
The control overhead induced by the initial mesh construction has not been considered.
76
4 Performance evaluation
77
4 Performance evaluation
78
5 Conclusion
5.1 Lessons learnt
One important aim of this thesis was the comparison of DMMP and NICE with regard to
the stress metric. Unfortunately, our results are not comparable to the results reported in [7].
We can learn several things from this:
Comparing the two protocols based on the reported results was not a very good idea
to begin with. Complex simulation scenarios are difficult to reproduce using a different network simulator. Even if we had been able to simulate enough routers, the
comparison would not have been completely fair. If somehow possible, network protocols should be compared using the same network simulator. That way, much cleaner
results can be produced.
If we had considered the simulation scenarios and the accompanying hardware requirements earlier, we could have chosen a network simulator that provides a more
suitable underlying network model. Alternatively, we could have come up with such
a model by ourselves.
At the beginning of the implementation, we did not focus enough on debugging and
verification. Later on, these tasks became very time consuming. While early testing
is always important when writing moderately complex software, it seem to be fundamental to complex network simulations.
Nevertheless, OMNeT++ turned out to be a good choice in principle, as it provides a consistent and convenient API.
5.2 Conclusions
In this thesis, we tried to evaluate if DMMP can meet the requirements of high-bandwidth,
large-scale multimedia streaming, or if the general concerns about the scalability of application layer multicast apply for DMMP. We also investigated whether DMMP can achieve
better performance than other application layer multicast approaches. We have conducted
network simulations to analyze the scalability and efficiency of DMMP, in particular the induced stress and control overhead, in comparison with NICE.
The stress measurements show that, given a small network topology, the stress of DMMP
grows linearly with the number of end-hosts. However, our results also suggest that the
79
5 Conclusion
average link stress depends greatly on the size of the underlying network. Therefore, we
must not conclude that DMMP causes high link stress. In fact, we cannot estimate the behavior for a large network. Based on our other measurements, we can state that the control
overhead is low and hardly dependent on the group size. Our only concern with this result
is that we used a simplified version of DMMP. Some of the simplifications might affect the
control overhead. Similarly, the average data path length seemed to increase only slowly
with the group size. We conclude that DMMP has the potential to scale to large groups of
receivers, and that the two-tier architecture of DMMP is a promising approach.
In comparison with NICE, DMMP produces only slightly longer data paths, although
DMMP does not focus on minimizing the data path lengths. In addition, DMMP generates
less control overhead. Overall, the performance of DMMP is comparable to the performance
of NICE. Considering that several important features of DMMP have not been implemented
yet, this is a promising result.
80
5 Conclusion
stress. It is also not evident how well the data path length scales. In Section 4.2.2, we
have conjectured that DMMP can achieve favorable latencies. This should be checked by
evaluating the latency of DMMP. DMMP could also be compared with other multicast architectures, for example with TOMA, which is claimed to be even more efficient than NICE.
In network simulations, certain member distributions and churn models are assumed.
Therefore, DMMP should be evaluated in a real network at some point.
There are numerous possible extensions to DMMP:
Security issues: So far, DMMP cannot handle malicious nodes. A malicious super node
could disrupt the data dissemination severely.
NAT and firewall support: Firewalls and NAT boxes are common in practice. For example,
[12] have implemented a multimedia streaming system. They report that initially 20
to 30 percent of the users had to be turned down because their system lacked firewall
and NAT support.
Overlay initialization details: The initial overlay construction may overload the RP. Several RPs could be used, but the details have not been thought out yet.
Shortening the data path length: Depending on the further analysis of the data path length,
it may be necessary to shorten the data paths, or to find other ways to improve the service quality experienced by the leaf nodes.
Peer-to-Peer concepts: Some successful peer-to-peer file sharing concepts, such as swarming or trackers, could be incorporated into DMMP. The other way around, DMMP
could be adapted to support additional applications, for example file sharing.
81
Bibliography
[1] OMNESTTM Simulation Environment. URL
http://www.omnest.com/.
[2] Tcl SourceForge Project. URL
http://tcl.sourceforge.net/.
[3] K. Almeroth. The Evolution of Multicast: From the Mbone to Inter-domain Multicast
to Internet2 Deployment. IEEE Networks, 2000.
[4] S. Bajaj, L. Breslau, D. Estrin, K. Fall, S. Floyd, P. Haldar, M. Handley, A. Helmy, J. Heidemann, P. Huang, S. Kumar, S. McCanne, R. Rejaie, P. Sharma, K. Varadhan, Y. Xu,
H. Yu, and D. Zappala. Improving Simulation for Network Research. Technical Report 99-702b, Information Sciences Institute, University of Southern California, 1999.
[5] A. Ballardie, P. Francis, and J. Crowcroft. Core BasedTrees(CBT). In Proceedings of the
ACM SIGCOMM, pages 8595, 1993.
[6] S. Banerjee and B. Bhattacharjee. A Comparative Study of Application Layer Multicast
Protocols. Work under submission, 2002.
[7] S. Banerjee, B. Bhattacharjee, and C. Kommareddy. Scalable Application Layer Multicast. In Proceedings of the ACM SIGCOMM, 2002.
[8] S. Banerjee, C. Kommareddy, K. Kar, B. Bhattacharjee, and S. Khuller. An Efficient
Overlay Multicast Infrastructure for Real-time Applications. Computer Networks, 50(6),
2006. Special Issue on Overlay Distribution Structures and their Applications.
[9] I. Baumgart, B. Heep, and S. Krause. OverSim: A Flexible Overlay Network Simulation Framework. In Proceedings of the 10th IEEE Global Internet Symposium, 2007.
[10] L. Breslau, D. Estrin, K. Fall, S. Floyd, J. Heidemann, A. Helmy, P. Huang, S. McCanne,
K. Varadhan, Y. Xu, and H. Yu. Advances in Network Simulation. Computer, 33(5):59
67, 2000.
[11] R. Carter and M. Crovella. Measuring Bottleneck Link Speed in Packet-Switched Networks. Technical Report BU-CS-96-006, Computer Science Department, Boston University, 1996.
[12] Y. Chu, A. Ganjam, T. Ng, S. Rao, K. Sripanidkulchai, J.Zhan, and H. Zhang. Early
Experience with an Internet Broadcast System Based on Overlay Multicast. Technical
Report CMUCS-03-214, Carnegie Mellon University, 2003.
82
Bibliography
[13] Y.-H. Chu, S. G. Rao, and H. Zhang. A Case for End System Multicast. In Proceedings
of the ACM SIGMETRICS, 2000.
[14] S. Deering. Multicast Routing in Internetworks and Extended LANs. In Proceedings of
the ACM SIGCOMM, 1988.
[15] D. Estrin, M. Handley, J. Heidemann, S. McCanne, Y. Xu, and H. Yu. Network Visualization with the VINT Network Animator Nam. Technical Report 99-703b, Computer
Science Department, University of Southern California, 1999.
[16] E. Zegura et al. Modeling Topology of Large Internetworks. URL
http://www.cc.gatech.edu/projects/gtitm/, 2000.
[17] A. Helmy and S. Kumar. VINT. Virtual InterNetwork Testbed. URL
http://www.isi.edu/nsnam/vint/, 1997.
[18] J. Jannotti, D. Gifford, K. Johnson, M. Kaashoek, and J. OToole. Reliable Multicasting
with an Overlay Network. In Proceedings of the 4th Symposium on Operating Systems
Design and Implementation, 2000.
[19] G. Kesidis and J. Walrand. Quick Simulation of ATM Buffers with On-off Multiclass
Markov Fluid Sources. ACM TOMACS, 3(3):269276, 1993.
[20] B. Khumawala. An Efficient Branch and Bound Algorithm for the Warehouse Location Problem. Management Science, 18(12):B718B731, 1972. Application Series.
[21] L. Lao, J.-H. Cui, and M. Gerla. A Scalable Overlay Multicast Architecture for LargeScale Applications. Technical Report UCLA CSD 040008, Computer Science Department, University of California, Los Angeles, 2004.
[22] L. Lao, J.-H. Cui, and M. Gerla. TOMA: A Viable Solution for Large-Scale Multicast
Service Support. In Proceedings of the IFIP Networking, 2005.
[23] J. Lei, X. Fu, and D. Hogrefe. DMMP: A New Dynamic Mesh-based Overlay Multicast
Protocol Framework. Work in progress, not published yet.
[24] J. Lei, X. Fu, and D. Hogrefe. DMMP: A New Dynamic Mesh-based Overlay Multicast
Protocol Framework. In Proceedings of the 2007 IEEE Consumer Communications and
Networking Conference - Workshop on Peer-to-Peer Multicasting (P2PM 2007), Las Vegas,
Nevada, USA, 2007.
[25] J. Lei, X. Fu, X. Yang, and D. Hogrefe. A Dynamic Mesh-based Overlay Multicast Protocol (DMMP). Internet Draft, draft-lei-samrg-dmmp-02.txt, 2007.
[26] J. Lei, I. Juchem, X. Fu, and D. Hogrefe. Architectural Thoughts and Requirements
Considerations on Video Streaming over the Internet. Technical Report ISSN 16111044 IFI-TB-2005-06, Institute for Informatics, Georg-August-Universitaet Goettingen,
2005.
83
Bibliography
[27] Z. Li and P. Mohapatra. HostCast: A New Overlay Multicasting Protocol. In Proceedings of the IEEE International Conference on Communications, 2003.
[28] S. Naicken, A. Basu, B. Livingston, and S. Rodhetbhai. A Survey of Peer-to-Peer Network Simulators. In Proceedings of the 7th Annual Postgraduate Symposium, 2006.
[29] D. Pendarakis, S. Shi, D. Verma, and M. Waldvogel. ALMI: An Application Level Multicast Infrastructure. In Proceedings of the 3rd USENIX Symposium on Internet Technologies & Systems, 2001.
[30] B. Premore and D. Nicol. Parallel Simulation of TCP/IP Using TeD. In Proceedings of
the Winter Simulation Conference, 1997.
[31] J. Saltzer, D. Reed, and D. Clark. End-to-End Arguments in System Design. ACM
Transactions on Computer Systems, 2(4):195206, 1984.
[32] Andrew S. Tanenbaum. Computer Networks. Prentice-Hall India, 4th edition, 2006.
[33] A. Varga. The OMNET++ Discrete Event Simulation System. In Proceedings of the 15th
European Simulation Multiconference, 2001.
[34] A. Varga. OMNeT++. Discrete Event Simulation System. User Manual. Version 3.2, 2005.
[35] D. Xu, M. Hefeeda, S. Hambrusch, and B. Bhargava. On Peer-to-Peer Media Streaming. Distributed Computing Systems, pages 363371, 2002. Proceedings. 22nd International Conference on Distributed Computing Systems.
[36] A. Young, J. Chen, Z. Ma, and A. Krishnamurthy. Overlay Mesh Construction Using
Interleaved Spanning Trees. In Proceedings of INFOCOM, 2004.
[37] B. Zhang, S. Jamin, and L. Zhang. Host Multicast: A Framework for Delivering Multicast to End Users. In Proceedings of the IEEE INFOCOM, 2002.
84