Sei sulla pagina 1di 6

Networking Performance for Microkernels

Chris Maeda
Brian N. Bershad

School of Computer Science


Carnegie Mellon University
Pittsburgh, PA 15213
March 17, 1992

Abstract user space. Macrokernel systems put protocols


in the kernel for performance reasons; input pro-
Performance measurements of network proto- cessing can be done at interrupt level, which re-
cols in microkernel systems have been discour- duces latency, and packets need only be trans-
aging; typically 2 to 5 times slower than compa- ferred from the kernel t o the receiver’s address
rable macrokernel systems. This disparity has space, which reduces data copying. In contrast,
led many to conclude that the microkernel ap- microkernel systems are more likely to put the
proach, where protocols reside at user-level, is protocols in user space for flexibility; user-level
inherently flawed and that protocols must be code is easier to debug and modify. Moreover,
placed in the kernel to get reasonable perfor- microkernels generally relegate the protocols to
mance. We show that user-level network proto- a specific user address space and require pro-
cols have performed poorly because they rely tocol clients to indirect through that address
on code designed to run in a kernel environ- space.
ment. As a result, they make assumptions about
Critics of microkernel operating systems ar-
the costs of primitive protocol operations such
gue that important system services, such as net-
as scheduling, preemption, and data transfer
work communication, have higher latency or
which can require substantial overhead to sat-
lower throughput than in monolithic kernels.
isfy at user level. Good user-level protocol per-
For example, in UX, CMU’s single-task Unix
formance can be achieved by restructuring pro-
server for Mach 3.0 [Golub et al. 901, the round
tocol servers t o take advantage of microkernel
trip latency for UDP datagrams is about three
facilities, rather than ignore them.
times greater than for the Mach 2.5 system,
where all Unix functionality is in the kernel.
1 Introduction In this paper we show that the structure of
a protocol implementation and not the location
Microkernel operating systems, such as Mach is the primary determinant of performance. We
3.0 [Accetta et al. 861, provide support for support our position by measuring the perfor-
scheduling, virtual memory, and cross-address mance of a simple protocol, UDP, in several
space IPC. Higher level services, such as Unix different versions of the Mach operating sys-
emulation and network communication, are im- tem. Although UDP performance for Mach’s
plemented in user-level servers. In contrast, single server Unix system is poor, the perfor-
“macrokernel” systems, such as Sprite [Ouster- mance problems lie in the Unix server, not the
hout et al. 881 and Berkeley’s 4.3BSD [Leffler microkernel. The goal of the Unix server imple-
et al. 891, provide complete operating system mentation was to get the code working quickly
functionality within the kernel. and then, where necessary, to get quickly work-
Network protocols can run in kernel space or ing code. This approach optimizes programmer

IS4
0-8186-25554192$03.00 @ 1992 IBEB
time, since it is possible to build a Unix server
by taking an in-kernel version of Unix and ap-
Mach 2.5 kernel
plying “wrapper” code so it runs at user level. Unix server (IPC,VM) 19.5
To justify our argument, we show that a care-
fully implemented user-level protocol server can Table 1: UDP Round D i p Times for Various
perform as well as an in-kernel version. Our Systems. This is what you would observe if you
“proof” consists of a reimplementation of UDP ran the Mach 2.5 macrokernel (first line), and
which does not assume that it is running in the Mach 3.0 version MK65 with the Unix single
“Unix kernel” but is instead optimized for the server version UX29 (second line). The Unix
microkernel environment of Mach 3.0. server usea Mach IPC and VM operations to
communicate with an in-kernel device driver.

2 Network Performance
Measurements VM operations), and that there is additional
IPC latency between the protocol server and
Based on the performance of network protocols the device. The Unix server sends each out-
in Mach 3.0, it would be easy to conclude that bound packet to the ethernet device driver as
putting protocols at user-level is a bad idea. Ta- an IPC message with “out-of-band” data which
ble l shows the average round trip time for small allows the IPC system to map the data into
UDP packets (1 data byte plus headers) using the receiver’s (here, the kernel’s) address space
an in-kernel and out-of-kernel protocol imple- rather than copy it. In Mach, however, it can be
mentation. All times were collected by sending more efficient to copy, rather than remap, small
1000 packets back and forth between two user- amounts of data between address spaces.
level processes running on DECstation 2100’s
connected by a 10Mb/s ethernet. The data in 2.1 Some Easy Improvements
the tables represent the mean values of multi-
ple 1000 packet trials. The variance in all cases We can reduce the latency of a round trip mes-
was between 5 and 10 percent of the mean. The sage by eliminating the VM and IPC operations
DECstation 2100 uses a MIPS R2000 proces- on the outgoing path. VM overhead can be elim-
sor running at 12Mhz. Our machines each had inated by sending the data “in-band” in an IPC
16MB of RAM. message to the device driver. IPC overhead can
In Mach 2.5, UDP is in the kernel. In Mach be eliminated by using a system call to trap di-
3.0, UDP is in the Unix server but uses nearly rectly into the kernel device driver instead of
the same code as in Mach 2.5. The Unix server using Mach 1PC.l The first and second rows of
described in Table 1 uses Mach IPC to send Table 2 show the improvement in latency which
outgoing packets to an in-kernel ethernet device comes by first eliminating the VM operations,
driver. Incoming packets are read by the de- and then the VM operations in conjunction with
vice driver and then demultiplexed to user tasks the IPC.
(normally just the Unix server) by the packet A more aggressive approach is to map the de-
filter [Mogul et al. 871. Once they are demul- vice directly into the protocol server’s address
tiplexed, these incoming packets are forwarded space. This technique, described in [Forin et al.
via Mach IPC to a Unix server thread which 911, allows the protocol server to also act as the
does subsequent protocol processing. network device driver, controlling the network
The advantage of the IPC interface is that interface directly, and avoiding the copy and the
it allows network-transparent device access;
clients of a device may be located on nodes to ‘A user-to-kernel RPC for Mach 3.0 on the DEC-
which the device is not physically attached (as station 2100 takes between 100 and 300 psecs, depend-
might be the case on a multicomputer like the ing on cache effects [Bershad et al. 911,whereas a sys-
tem call takes only about 30 psecs. Little effort has
Touchstone). The disadvantage of the IPC in- been spent trying to reduce the latency of user-to-kernel
terface is that outgoing data must be copied RPC, whereas significant effort has gone towards reduc-
into a message buffer (or manipulated through ing user-teuser RPC latency [Draves et al. 911.

155
cesses (a read and write on each host). If we
UDP Implementation Time (ms) assume the worst caae where there is no overlap
Unix server (IPC, no VM) of reads and writes on the two hosts, having the
Unix server (syscd, no IPC, no VM) sockets in the Unix server should add no more
Unix server (mapped driver)
than 2 ms (4 * (490 - 60) psecs) to the total
Table 2: Second Cut UDP Round Trip Times round trip time.
for Various Systems. This is what you would It takes about 200 psecs to write a packet
observe if you ran Mach 3.0 with an experimen- to the ethernet device driver using a kernel
tal version of UX29 (first row), or if you extend syscall. Receiving a message from the packet
the kernel system call interface to include a di- filter with Mach IPC takes about 300 psecs.
rect device-write system call (second row), or (UX29 with the mapped ethernet driver does
if you enable mapped ethernet support in a stan- not currently support the packet filter.) Thus,
dard system (third row). crossing the address-space boundary between
the device driver in the kernel and the proto-
col stack in the Unix server should add about 1
ms (2 * 500 psecs) to the total round trip time.
IPC to the in-kernel device driver. The round Assuming the protocol stack runs as fast in
trip UDP latency for this approach is shown in the Unix server as it does in the kernel, a UDP
the third row of Table 2. The 6 ms improvement round trip in Mach 3.0 should then be about
is because a user-to-kernel IPC and several VM 3 ms slower than in Mach 2.5 (2 ms to talk to
operations have been avoided. the client and 1 ms to talk to the device driver).
The indirection through the Unix server should
2.2 What Do The Numbers thus have a relatively small effect on network
Mean? performance.

Although the times in Table 2 are better than


the Mach 3.0 times in Table 1, they are still 2.3 Where Does The Time Go?
worse than Mach 2.5. It is tempting t o con- Little attention has been paid to the implemen-
clude that the network performance in Mach 3.0 tation of the protocols themselves within the
is slower because the protocols are out of kernel. Unix server, so even small measures can still go
Since the actual protocol code which executes a long way. For example, we were able t o gain
in the two systems is the same, the slowdown an easy 30% performance improvement in net-
must be caused by the extra interprocess com- work round trip times simply by changing the
munication costs associated with the microker- way in which the lowest level of the Unix server
ne1 architecture. However, a few observations interacts with the device driver. In the rest of
about the cost of crossing address space bound- this section we describe two other problems with
aries between the Unix server and the kernel, the Unix server implementation.
and between the Unix server and the protocol
clients show that this conclusion can’t be right. The wrong kind of synchronization
On a DECstation 2100, it takes about 490
psecs to access the Unix server’s socket layer The Mach 2.5 and BSD kernels use processor
from a user program. This path includes one priorities (splx) to synchronize internally. The
pass through the system call emulation library spl machinery was designed for a uniprocessor
(about 280 psecs), which translates system calls architecture and for a kernel split into a non-
into IPC messages, and one IPC round trip be- preemptive top half and an interrupt-driven bot-
tween the user task and the Unix server (about tom half. However, the Unix server is multi-
180 psecs). The IPC time can be further broken threaded and relies on fine-grain locks for con-
down into 140 psecs for the actual RPC and 40 currency control. Instead of converting the ap-
psecs for the stub code. In contrast, it takes proximately 1000 spl calls in the server to use
about 60 psecs to access the in-kernel socket lock-based synchronization, we simulate the s p l
layer using a system call in Mach 2.5. calls at user-level with locks and software inter-
Each UDP round trip includes four socket ac- rupts. Each UDP round trip has about 50 spls

I56
on the critical path, with each requiring about
UDP Implementation Time (ms)
12 psecs on a DECstation 2100. We observed a
speedup of about 150 psecs per round trip when
New UDP Server (server-to-server)
we switched to a cheaper locking primitive [Ber- Mach 2.5 kernel
shad 911. Unix server (IPC,VM) 19.5

Too many levels of scheduling Table 3: UDP Round n i p Times for Vari-
ous Systems. The first line is for UDP access
The Unix server uses Mach’s C-Threads library through a special UDP server running on each
to multiplex several user-level threads on top of machine. The second line is for server-to-server
a few kernel threads [Cooper & Draves 881. Syn- communication. This eliminates the Mach IPC
chronization between user-level threads is gen- between the end users and the protocol server.
erally cheap, since it is little more than a corou- The third and fourth lines are repeated from
tine switch. However, the Unix server’s net- earlier tables.
work input threads are wired to dedicated kernel
threads. This means that scheduling operations
between the network interrupt thread and other destination task. This path only requires
threads in the Unix server (such as those wait- one primitive locking operation.
ing on incoming packets or locks) involves two
kernel threads. Mach kernel threads can only None of the server’s C-threads are wired
synchronize with Mach IPC, so interthread syn- to kernel threads so interthread context
chronization within the Unix server involves ad- switching is cheap.
ditional passes through the kernel.
Protocol clients call the server via Mach
IPC instead of Unix syscalls, thus avoiding
3 A New User-Level UDP the syscall-to-IPC translation step in the
system call emulator.
Server
The round trip times for two user-level ap-
To better show that the implementation, rather plications to communicate over the network via
than the address space, is responsible for the new UDP server are shown on the first lines
poor network performance, we built a stan- of Table 3. The table also repeats the times
dalone UDP protocol server which exports an for the unoptimized Unix server and for the in-
RPC interface analogous to Unix’s sendto and kernel protocol implementation for comparison.
recvf rom interface. UDP clients communicate Table 3 shows that it is possible to achieve round
with their local UDP server using Mach RPC trip latencies comparable to a macrokernel with
without memory remapping to move data be- a user-level protocol implementation. Further-
tween address spaces. more, since the top-level (user) and bottom-level
The UDP server’s lowest layer is the same (kernel) interfaces to the protocol stack are the
as the Unix server; incoming packets are re- same, it is clear that the speedup over the Unix
ceived from the packet filter and outgoing pack- server’s UDP is due to the difference in imple-
ets are written to the device driver with a mentation. The difference between the user-to-
d e v i c e s r i t e system ca!l. The server takes care user and server-to-server times reflects the cost
of checksumming, packet formatting, port dis- of the additional RPC’s through the Mach 3.0
patching, etc. microkernel.
The main differences between our UDP server
and the Unix server are:
4 Absolute Performance Is
0 The incoming packet path is optimized so What Counts
that packets are only copied once from a
buffer on the network input thread’s stack There is one compelling observation which
to an IPC message buffer that is sent to the weakens our claim that protocols can be ef-

157
fectively implemented at user level: our UDP Unixifying” it. We expect that other protocols
server is still slow. Even a 4 ms UDP round will be amenable to this general approach as
trip time, in-kernel or out-of-kernel, is slow. We well.
agree. Mach 3.0 has an in-kernel network RPC The UDP server described in this paper is
that is about twice as fast [Barrera 91].2 Re- not part of the standard Mach distribution from
searchers at DEC SRC [Schroeder & Burrows CMU. It was done within the context of an ex-
901 have shown that the way t o reduce latency perimental kernel and user environment 80 that
to the bare minimum is t o put pieces of the pro- we could demonstrate what was and what was
tocol in the kernel’s interrupt handler, thereby not responsible for microkernel networking per-
eliminating an additional dispatch and R P C formance.
There are three responses t o the “sorry, not
fast enough” argument. The first is that we
could make it faster if we tried - we just References
haven’t. Our goal was to show that a user-level [Accetta et al. 861 Accetta, M. J., Baron, R. V.,
implementation could match an in-kernel im- Bolosky, W., Golub, D. B., Rashid,
plementation, and we’ve done that. We could, R. F., Tevanian, Jr., A., and Young,
for example, improve the implementation of the M. W. Mach: A New Kernel Foundation
packet filter, or use it more aggressively by hav- for UNIX Development. In Proceedings
ing each receiving thread register its own fil- of the Summer 1986 USENIX Confer-
ter. We could also use shared memory to pass ence, pages 93-113, July 1986.
data between the UDP server and its clients, [Barrera 911 Barrera, J. S. A Fast Mach Network
as is done now in some cases by the Unix IPC Implementation. In Proceedings
server3, or between the kernel and the Unix of the Second Usenix Mach Workshop,
server [Reynolds & Heller 911. November 1991.
The second response is that, for low latency [Bershad 911 Bershad, B. N. Mutual Exclusion for
request-response protocols, the important parts Uniprocessors. Technical Report CMU-
of the protocol (that is, the common cases) can CS-91-116, School of Computer Science,
be implemented directly within the sending and Carnegie Mellon University, April 1991.
receiving address spaces. This enables the pro- [Bershad et al. 911 Bershad, B. N., Draves, R. P.,
tocol server t o be bypassed altogether. With the and Forin, A. Using microbenchmarks
exception of using a general packet filter, this is to evaluate system performance. In Sub-
the approach taken in DEC SRC’s RPC imple- mitted to WWOS 92, December 1991.
mentation for the Firefly. [Cooper & Draves 881 Cooper, E. C. and Draves,
The third response is that not all protocols are R. P. C-threads. Technical Re-
expected to have low latency. A protocol such port CMU-CS-88-54, School of Com-
as TCP, where throughput and not latency is puter Science, Carnegie Mellon Univer-
crucial, can run efficiently at user level [Forin sity, February 1988.
et al. 911. [Draves et al. 911 Draves, R. P., Bershad, B. N.,
Rashid, R. F., and Dean, R. W. Us-
ing Continuation to Implement Thread
5 Conclusions Management and Communication in
Operating Systems. In Proceedings of
We have shown that a t least one user-level pro- the 13th ACM Symposium on Operating
tocol is slow when running on top of a micro- Systems Principles, pages 122-136, Oc-
kernel and that it can be made fast by “de- tober 1991.
[Forin et al. 911 Forin, A., Golub, D. B., and Ber-
2Unlike UDP, Mach RPC has kernel-oriented seman- shad, B. N. An 1/0 System for Mach
tics, such as the ability to pass out-of-line data and port 3.0. In Proceedings of the Second Usenix
rights, which also motivate an in-kernel implementation. Mach Workshop, November 1991.

Work is underway at CMU to pass network messages [Golub et al. 901 Golub, D., Dean, R., Forin, A.,
between the Unix server and protocol clients via shared and Rashid, R. Unix as an Application
memory buffers. Program. In Proceedings of the Summer
1990 USENIX Conference, pages 87-95,
June 1990.
[Leffler et al. 891 Leffler, S., McKusick, M., Karels,
M., and Quarterman, J. The De-
sign and Implementation of the 4.3BSD
UNIX Operating System. Addison-
Wesley, Reading, MA, 1989.
[Mogul et al. 871 Mogul, J., Rashid, R., and Ac-
cetta, M. The Packet Filter: An Effi-
cient Mechanism for User-level Network
Code. In Proceedings of the Eleventh
ACM Symposium on Operating Systems
Principles, pages 39-51, 1987.
[Ousterhout et al. 881 Ousterhout, J. K., Cheren-
son, A. R., Douglis, F., Nelson, M. N.,
and Welch, B. B. The Sprite Network
Operating System. IEEE Computer,
21(2):23-36, February 1988.
[Reynolds & Heller 911 Reynolds, F. and Heller, J.
Kernel Support for Network Protocol
Servers. In Proceedings of the Sec-
ond Usenix Mach Workshop, November
1991.
[Schroeder & Burrows 901 Schroeder, M. D. and
Burrows, M. Performance of Firefly
RPC. ACM Tmnsactions on Computer
Systems, 8(1):1-17, February 1990. Also
appeared in Proceedings of the 12th
ACM Symposium on Operating Systems
Principles, December 1989.

159

Potrebbero piacerti anche