Sei sulla pagina 1di 14

Understanding the Packet Forwarding Capability of General-Purpose Processors

Katerina Argyraki , Kevin Fall , Gianluca Iannaccone , Allan Knies , Maziar Manesh , Sylvia Ratnasamy EPFL, Intel Research Abstract
Compared to traditional high-end network equipment built on specialized hardware, software routers running on commodity servers offer signicant advantages: lower costs due to large-volume manufacturing, a widespread supply/support chain, and, most importantly, programmability and extensibility. The challenge is scaling software-router performance to carrier-level speeds. As a rst step, in this paper, we study the packetprocessing capability of modern commodity servers; we identify the packet-processing bottlenecks, examine to what extent these can be alleviated through upcoming technology advances, and discuss what further changes are needed to take software routers beyond the small enterprise. The challenge of building network infrastructure that is both programmable and capable of high performance can be approached from one of two extreme starting points. One approach would be to start with existing high-end, specialized devices and retro-t programmability into them. For example, some router vendors have announced plans to support limited APIs that will allow third-party developers to change/extend the software part of their products (which does not typically involve core packet processing) [7, 9]. A larger degree of programmability is possible with network-processor chips, which offer a semi-specialized option, i.e., implement only the most expensive packet-processing operations in specialized hardware and run the rest on programmable processors. While certainly an improvement we note that, in practice, network processors have proven hard to program: in the best case, the programmer needs to learn a new language; in the worst, she must be aware of (and program to avoid) low-level issues like resource contention during parallel execution or expensive memory accesses [14, 16]. From the opposite end of the spectrum, a different approach would be to start with software routers and optimize their packet-processing performance. The allure of this approach is that it would allow network infrastructure to tap into the many desirable properties of the PC-based ecosystem, including lower costs due to largevolume manufacturing, rapid advances in power management, familiar programming environment and operating systems, and a widespread supply/support chain. In other words, if feasible, this approach could enable a network infrastructure that is programmable in much the same way as end-systems are today. The challenge is taking this approach beyond the small enterprise, i.e., scaling PC/server packet-processing performance to carrier-level speeds. It is perhaps too early to tell which approach dominates; in fact, its more likely that each approach results in different tradeoffs between programmability and performance, and these tradeoffs will cause each to be adopted where appropriate. As yet, however, there has been little research exposing what tradeoffs are achievable. As a rst step in this direction, in this paper, we explore the performance limitations for packet processing on commodity servers.

Introduction

To what extent are general-purpose processors capable of high-speed packet processing? The answer to this question could have signicant implications for how future network infrastructure is built. To date, the development of network equipment (switches, routers, various middleboxes) has focussed primarily on achieving high performance for relatively limited forms of packet processing. However, as networks take on increasingly sophisticated functionality (e.g., data loss protection, application acceleration, intrusion detection), and as major ISPs compete in offering new services (e.g., video, mobility support services), there is an increasing need for network equipment that is programmable and extensible. And indeed, both industry and research have already taken initial steps to tackle the issue [4, 6, 7, 9, 21]. In current networking equipment, high performance and programmability are competing goalsif not mutually exclusive. On the one hand, we have high-end switches and routers that rely on specialized hardware and software and offer high performance, but are notoriously difcult to extend, program, or otherwise experiment with. On the other hand, we have software routers, where all signicant packet-processing steps are performed in software running on commodity PC/server platforms; these are, of course, easily programmable, but only suitable for low-packet-rate environments such as small enterprises [6].

A legitimate question at this point is whether the performance requirements for network equipment are just too high and our exploration is a fools errand. The bar is indeed high: in terms of individual link/port speeds, 10Gbps is already widespread and 40Gbps is being deployed at major ISPs; in terms of aggregate switching speeds, carrier-grade routers range from 40Gbps to a high of 92Tbps! Two developments, however, lend us hope. The rst is a recent research proposal [11] that presents a solution whereby a cluster of N servers can be interconnected to achieve aggregate switching speeds of NR bps, provided each server can process packets at a rate on the order of R bps. This result implies that, in order to scale software routers, it is sufcient to scale a single server to individual line speeds (10-40Gbps) rather than aggregate speeds (40Gbps-92Tbps). This reduction makes for a much more plausible target. Secondly, we expect that the current trajectory in server technology trends will work in favor of packetprocessing workloads. For example, packet processing appears naturally suited to exploiting the tremendous computational power that multicore processors offer parallel applications. Similarly, I/O bandwidth has gained tremendously by the transition from PCI-X to PCIe allowing 10Gbps Ethernet NICs to enter the PC market [1]. And nally, as we discuss in Section 4, the impending arrival of multiprocessor architectures with multiple independent memory controllers should offer a similar boost in available memory bandwidth. While there is widespread awareness of these advances in server technology, we nd little comprehensive evaluation of how these advances can/do translate into performance improvements for packet-processing workloads. Hence, in this paper, we undertake a measurement study aimed at exploring these issues. Specically, we focus on the following questions: what are the packet-processing bottlenecks in modern general-purpose platforms; what (hardware or software) architectural changes can help remove these bottlenecks; do the current technology trends for generalpurpose platforms favor packet processing? As we shall see, answering these seemingly straightforward questions, requires a surprising amount of sleuthing. Modern processors and operating systems are both beasts of great complexity. And while current hardware/software offer extensive hooks for measurement and system proling these can be equally overwhelming. For example, current x86 processors have over 400 performance counters that can be programmed for detailed tracing of everything from branch mispredictions 2

to I/O data transactions. Its thus easy (as we discovered) to sink in a morass of performance monitoring data. Part of our contribution is thus a methodology by which to go about such an evaluation. Our study adopts a top-down approach in which we start with black-box testing and then recursively identify and drill down into only those aspects of the overall system that merit further scrutiny. Finally, it is important to note that even though our study stemmed from an interest in programmable network infrastructure, our ndings are relevant to more than just the network context. Packet processing is just one instance of a more general class of stream based applications (such as real time video delivery, stock trading, etc.). Our ndings apply equally to these too. The remainder of this paper is organized as follows. We start the paper in Section 2 with some high-level analysis estimating upper bounds on the packet processing performance for different server architectures. Section 3 follows this with a measurement study aimed at identifying the bottlenecks and overheads on these servers. We present the inferences from our measurement study in Section 4 and discuss potential improvements in Section 5. We discuss related work in Section 6 and nally conclude.

2 Optimistic Back-of-the-Envelope Analysis


Before delving into experimentation, we would like to calibrate our expectations. We thus start with a simple thought experiment aimed at estimating absolute upper bounds on the packet forwarding performance of both existing and next-generation servers. Since our goal is quick calibration, our reasoning here is deliberately both coarse-grained and optimistic; the experimental results that follow will show where reality lies. Figures 1 and 2 present a high-level view of two server architectures: Fig.1 depicts a traditional shared-bus architecture used in current x86 servers [3], while Fig.2 represents a point-to-point architecture as will be supported by the next-generation of x86 servers [8]. In the shared-bus architecture, communication between the CPUs, memory, and I/O is routed through the chipset that includes the memory and I/O bus controllers. There are three main system buses in this architecture. The front side bus (FSB) is used for communication both between different CPUs and between a CPU1 and the chipset. The PCIe bus connects I/O devices, including network interfaces, to the chipset via one or more high-speed serial channels known as lanes and, nally, the memory bus connects the memory and chipset.
this paper we will use the terms CPU, socket and processor interchangeably to refer to a multi-core processor.
1 In

Figure 1: Traditional shared bus architecture The point-to-point server (Fig.2) represents two signicant architectural changes relative to the above: rst, the FSB is replaced by a mesh of dedicated point-to-point links thus removing a potential bottleneck for inter-CPU communication. Second, the point-to-point architecture replaces the single external memory controller shared across CPUs with a memory controller integrated within each CPU; this leads to a dramatic increase in aggregate memory bandwidth, since each CPU now has a dedicated link to a portion of the overall memory space. Servers based on such point-to-point architectures and with upto 32 cores (4 sockets and 8 cores/socket) are due to emerge in the near future [10]. To estimate a servers packet-forwarding capability, we consider the following minimal set of operations typically required to forward an incoming packet and the corresponding load they impose on each of the primarysystem components: 1. The incoming packet is DMA-ed from the network card (NIC) to main memory (incurring one transaction on the PCIe and memory bus). 2. The CPU reads the packet header (one transaction on the FSB and memory bus). 3. The CPU performs any necessary packet processing (CPU-only, assuming no bus transactions). 4. The CPU writes the modied packet header to memory (one transaction on the memory bus and FSB). 5. The packet is DMA-ed from memory to NIC (one transaction on the memory and PCIe bus). Figures 1 and 2 also show the manner in which each of these operations maps onto the various system buses for the architecture in question. As we see, for the sharedbus architecture, a single packet results in 4 transactions on the memory bus and 2 on each of the FSB and PCIe buses; thus, a line rate of R bps leads to (roughly) a load of 4R, 2R, and 2R on each of the memory, FSB, and 3

Figure 2: Point-to-point architecture PCIe buses.2 Currently available technology advertises memory, FSB, and PCIe bandwidths of approximately 100Gbps, 85Gbps, and 64Gbps respectively (assuming DDR2 SDRAM at 800MHz, a 64-bit wide 1.33GHz FSB, and 32-lane PCIe1.1); these numbers suggest that a current shared-bus architecture could sustain line rates up to R = 25 Gb/s. For the point-to-point architecture, each packet contributes 4 memory-bus transactions, 4 transactions on the inter-socket point-to-point links, and 2 PCIe transactions; since we have 4 memory buses, 6 inter-socket links and 4 PCIe links, assuming uniform load distribution across the system, a line rate of R bps yields loads of R, 2R/3, and R/2 on each of the memory, inter-socket, and PCIe buses respectively. If we (conservatively) assume similar technology constants as before (memory, inter-socket, and PCIe bandwidths at 100Gbps, 85Gbps, and 64Gbps respectively) this suggests a point-to-point architecture could scale to line rates of 40Gb/s and even higher. In terms of CPU resources: If we assume min-sized packets of 40 bytes, then the packet interarrival time is 32ns for speeds of R =10Gb/s and 8ns for R =40Gb/s. For the shared-bus server with 8 CPU, each with a speed of 3 GHz (available today), this implies a budget of 3072 and 768 cycles/pkt for line rates of 10Gbps and 40Gbps respectively. Assuming a cycles-per-instruction (CPI) ratio of 1, this suggests a budget of 3072 (768) instructions per packet for line rates of 10Gb/s (40Gb/s). With 32 cores at similar speeds, the point-to-point server would see a budget of 12288 and 3072 instructions/pkt for 10Gb/s and 40Gb/s respectively. In summary, based on the above, one might conclude that current shared-bus architectures may scale to 25Gb/s but not 40Gb/s, while emerging servers may scale even to 40Gb/s.
2 This estimate assumes that the entire packet (rather than the header) is read to/from the memory and CPU for packet processing. A more accurate estimate would account for packet header sizes (or cache line sizes if smaller than header lengths). We ignore this here since our tests in the following section consider only min-sized packets of 64 bytes, equal to a cache line length because of which the inaccuracy is of little relevance.

Measurement-based Analysis

15 64 bytes 128 bytes 256 bytes 512 bytes 1024 bytes

sustained load (Gbps)

We now turn to experimentation. We rst describe our experimental setup and then present the packet-forwarding rates achieved by unmodied software/hardware. Experimental Setup For our experiments, we use a mid-level server machine running SMP Click [18]. Our server is a dual-socket 1.6GHz quad-core CPU with an L2 cache of 4MB, two 1.066GHz FSBs (one to each socket) and 8 GBytes of DDR2-667 SDRAM. With the exception of the CPU speeds, these ratings are similar to the shared-bus architecture from Figure 1 and, hence, our results should be comparable. The machine has a total of 16 1GigE NICs. To source/sink trafc, we use two additional servers each of which is connected to 8 of the 16 NICs on our test machine. We generate (and terminate) trafc using similar servers with 8 GigE NICs. We instrument our servers with Intel EMON, a performance monitoring tool similar to Intel VTune, as well as a chipsetspecic tool that allows us to monitor memory-bus usage.3 The forwarding rates achieved will depend on the nature of the trafc workload. To a rst approximation, this workload can be characterized by: (1) the incoming packet arrival rate r, measured in packets/sec, (2) the size of packets, P, measured in bytes (hence the incoming rate R = r P) and (3) the processing per packet. We focus on evaluating the fundamental capability of the system to move packets through and, hence, start by considering only the rst two factors (packet rate and size) without considering any sophisticated packet processing. Hence, we remove the IP routing components from our Click conguration and only implement simple forwarding that enforces a route between source and destination NICs; i.e., packets arriving on NIC #0 are sent to NIC #1, NIC #2 to NIC #3 and so on. We have 16 NICs and, hence, use 8 kernel threads, each pinned to one core and each in charge of one input/output NIC pair. In the results that follow, where the input rate to the system is under 8Gbps, we use one of our trafc generation servers as the source and the other as sink; for tests that require higher trafc rates each server acts as both source and sink allowing us to generate input trafc up to 16Gbps. Measured performance We start by looking at the loss-free forwarding rate the server can sustain (i.e., without dropping packets) under increasing input packet
3 Although our tools are proprietary, many of the measures they report are derived from public performance counters and, in these cases, our tests are reproducible. In an extended technical report, we will present in detail how our measures, when possible, can be derived from the public performance counters available on x86 processors.

10

0 0

10

12

14

16

offered load (Gbps)

Figure 3: Forwarding rate under increasing load for different packet sizes.

rates and for various packet sizes. We plot this sustained rate in terms of both bits-per-second (bps) and packetsper-second (pps) in Figures 3 and 4 respectively. We see that, in the case of larger packet sizes (1024 bytes and higher), the server scales to 14.9 Gbps and can keep up with the offered load up to the maximum trafc we can generate given the number of slots on the server; i.e., packet forwarding isnt limited by any bottleneck inside the server. However, in the case of 64 byte packets, we see that performance saturates at around 3.4 Gbps, or 6.4 million pps. As Figure 4 suggests, the server is troubled by the high input packet rate (pps) rather than bit rate (bps). Note that the case of 64 byte packets is the worstcase trafc scenario. Though unlikely in reality, it covers an important role as it is considered the reference benchmark by network equipment manufactures. Relative to the back-of-the-envelope estimates we arrived at in the previous section, we can conclude that, while our server approaches the estimated rates for larger packet sizes, for small packets, the achievable rates are well below our estimates. At a high-level, our reasoning could have been wildly off-target for two reasons: (1) in assuming that the nominal/advertised rates for each system component (PCIe, memory, FSB) are attainable in practice and/or (2) in our estimates of the overhead per packet (4x, 2x, etc.). In what follows, we look into each of these possibilities. In Section 3.1 we attempt to track down the bottleneck(s) that limit(s) the forwarding rate for small packets and, in so doing, estimate attainable performance limits for the different system components. In Section 3.2 we take a closer look at the perpacket overheads and attempt to deconstruct these into their component causes. 4

8 7 64 bytes 128 bytes 256 bytes 512 bytes 1024 bytes

35 30 observerd load (Gbps) 25 20 15 10 5 Memory I/O FSB

sustained load (Mpps)

6 5 4 3 2 1 0 0

0 0

offered load (Mpps)

2 3 offered load (Gbps)

Figure 4: Forwarding rate under increasing load for different packet sizes in pps.

Figure 5: Bus bandwidths for 64 byte packets.


50

3.1

Bottleneck Analysis
observerd load (Gbps) 40

We look for the bottleneck through a process of elimination, starting with the four major system components discussed earlier the CPUs and the three system buses and drilling deeper as and when it appears warranted. CPU The CPUs are plausible candidates, since CPU processing depends on the incoming packet rate, and performance saturates as soon as we reach a specic packet rate (the same for 64-byte and 128-byte packets, as shown in Figure 4). Note that the traditional metric of CPU utilization reveals little here, because Click operates in a pure polling mode, where the CPUs are always 100% utilized. Instead, we look at the number of empty polls i.e., the number of times the CPU polls for packets to process but none are available in memory. Our measurements reveal that, even at the saturation rate (3.4Gbps for 64-byte packets), we still see a non-trivial number of empty polls approximately 62,000 per second for each core. Hence, we eliminate CPU processing as a candidate bottleneck. System buses Our tools allow us to directly measure the load in bits/sec on the FSB and memory bus; the load difference between these two buses gives us an estimate for the PCIe load. Note that this is not always a good estimate, as FSB bandwidth can be consumed by inter-socket communication which does not gure on the memory bus; however, it does make sense in our particular setup (with each input/output port pair consistently served by the same socket) which yields little intersocket communication. Figures 5 and 6 plot the load on each of the FSB, memory, and PCIe buses for 64-byte and 1024-byte packets under increasing input rates. We see that, for any particular line rate, the load on all three 5

Memory I/O FSB

30

20

10

0 0

5 10 offered load (Gbps)

15

Figure 6: Bus bandwidths for 1024 byte packets. buses is always higher with 64-byte packets than with 1024-byte ones. Hence, any of the buses could be the bottleneck, and we proceed to examine each one more closely. FSB Under the covers, the FSB consists of separate data and address buses, and our tools allow us to separately measure the utilization of each. The results are shown in Figures 7 and 8: while it is clear that the data bus is under-utilized, it is not immediately obvious whether this is the case for the address bus as well. To gage the maximum attainable utilization on each bus, we wrote a simple benchmark program (we will refer to it as the stream benchmark from now on) that creates and writes to a very large array. This benchmark consumes 50 Gbps of FSB bandwidth that translate into 37% databus utilization and 74% address-bus utilization. These numbers are well above the utilization levels from our packet-forwarding workload, which means that the latter does not saturate the FSB. Hence, we conclude that the

100
Gbps

FSB Data FSB Address 80 bus utilization (%)

3 2 1 Forwarding rate 1 2 3 4 Cores 5 6 7 8

60

40
Gbps

25 20 15 10 5 I/O load 1 2 3 4 Cores 5 6 7 8

20

0 0

2 3 offered load (Gbps)

Figure 7: FSB data and address bus utilization for 64 byte packets.
100 FSB Data FSB Address 80 bus utilization (%)

Figure 9: Forwarding rate (top) and PCIe load (bottom) for 64 byte packets as a function of the number of cores rate on a single lane beyond the per-lane rate recorded at saturation, then we will have showed that our packetforwarding workload does not saturate the PCIe lanes and, hence, packet rate is not the problem. To this end, we start with a single pair of input/output ports pinned to a single core and gradually add ports and cores. The results are shown in Figure 9, where we plot both the sustained forwarding rate and the PCIe load: we already know that for 64-byte packets, at saturation, each input/output port pair sustains approximately 0.4Gbps (Figure 3); from Figure 9, we see that each individual port pair (and, hence, the corresponding PCIe lanes) can go well beyond that rate (approx. 0.75Gbps). Hence, we conclude that the PCIe bus is not the bottleneck either. Memory This leaves us with the memory bus as the only potential culprit. To estimate the maximum attainable memory bandwidth, we use the stream benchmark described above, which consumes 51Gbps of memorybus bandwidth. This is about 35% higher than the 33Gbps maximum consumed by our 64-byte packetforwarding workload, surprisingly suggesting that aggregate memory bandwidth is not the bottleneck either.4 This would seem to return us to square one. However, memory-system performance is notoriously sensitive to details like access patterns and load balancing; hence, we look further into these details. We consider two potential reasons why our packetforwarding workload might reduce memory-system ef4 Note that even 51Gbps is fairly low relative to the nominal rating of 100Gbps we used in estimating upper bounds. It turns out this limit is due to saturation of the address bus; recall that the address bus utilization is 74% for the stream test; prior work [24] and discussions with architects reveal that an address bus is regarded as saturated at approximately 75% utilization. This is in keeping with the general perception that, in a shared-bus architecture, the vast majority of applications are bottlenecked on the FSB.

60

40

20

0 0

5 10 offered load (Gbps)

15

Figure 8: FSB data and address bus utilization for 1024 byte packets. FSB is not the bottleneck. PCIe Unlike the FSB where all operations are in xedsize units (64 bytes a cache line), the PCIe bus supports variable-length transfers; hence, if the PCIe bus is the bottleneck, this could be due either to the incoming bit rate or to the requested operation rate (which depends on the incoming packet rate). To test the former, we simply look at the maximum bit rate that the PCIe bus has sustained; from Figures 5 and 6, we see that the maximum PCIe load for 1024-byte packets exceeds the PCIe load recorded at saturation for 64-byte packets. Hence, the PCIe bit rate is not the problem. To test whether the PCIe operation rate is the bottleneck, we measure the maximum packet rate that can be sustained by each individual PCIe lane. Our rationale is the following: given that PCIe lanes are independent from each other, if we can successfully drive the packet 6

10 8

10 8

10 8

Gbps

Gbps

4 2 0 1 2 3 4 2 1 3 4

4 2 0 1 2 3 4 2 1 3 4

Gbps

6 4 2 0 1 2 3 4 2 1 3 4

bank

rank

bank

rank

bank

rank

Figure 10: Memory load distribution across banks and ranks. Left: 64 byte packets. Middle: 1024 byte packets. Right: the stream benchmark. ciency relative to the stream benchmark. The rst is the fact that the sequence of memory locations accessed due to our workload is highly irregularas opposed to the nicely in-sequence access pattern of the stream benchmark. To assess the impact of irregular accesses, we rerun the stream benchmark but, instead of writing to each array entry in sequence, we write to random locations. This modication does cause a drop in memory bandwidth, but the drop is modest (from 51Gbps to about 46Gbps), indicating that irregular accesses are not the problem. The second reason is sub-optimal use of the physical memory space: The memory system is internally organized as multiple memory channels (or branches) each of which is organized as a grid of ranks and banks. In particular, the 8GB memory on our machine consists of two memory channels each of which comprises a grid of 4 banks 4 ranks; our tools reports the memory trafc to different rank/bank pairs aggregated across memory channels; i.e., the memory trafc we report for (say) a pair (bank 1, rank3) is the sum of the trafc seen on (bank 1, rank 3) for each of the two memory channels. Figure 10 shows the distribution of memory trafc over the various ranks and banks for three workloads: (1) 64byte packets at the saturation rate of 3.4Gbps, (2) 1024byte packets at 15.2Gbps, and (3) the stream benchmark. Notice that, while memory trafc is perfectly balanced for the stream benchmark (and reasonably balanced for the 1024-byte packet workload), for the 64-byte packet workload, it is all concentrated on two rank/bank elements (in reality, we see one overloaded rank-bank pair on each channel; since the gure shows the aggregate load over the two channels). This result suggests that the bottleneck is not the aggregate memory bandwidth, but the bandwidth to the individual rank/bank elements that, for some reason, end up carrying most of the 64-byte packet workload. To verify this, we measure the maximum attainable bandwidth 7 to a single rank/bank pair; we do this through a simple test that creates multiple threads, all of which continuously read and write a single locations in memory. The result is 7.2Gbps of memory trafc (all on a single rank/bank pair), which is almost equal to the maximum per-rank/bank load recorded at saturation for the 64-byte packet workload. We should note that both the CPUs and the FSB are under-utilized during this memory test. Hence, we conclude that the bottleneck is the memory system, not because it lacks the necessary capacity, but because of the imbalance in accessed memory locations. We now look into why this imbalance takes place. We see from Figure 10 that, for 1024-byte packets, the memory load is much better distributed than for 64byte packets. This leads us to suspect that the imbalance is related to the manner in which packets are laid out onto the rank/bank grid. We test this with an experiment where we maintain a xed packet rate (400,000 packets/sec) and measure the resulting memory-load distribution for different packet sizes (64 to 1500 bytes). Figure 11 shows the outcome: Ignoring for the moment the load on {bank 2, rank 3}, we observe that, as the packet size increases, the additional memory load is distributed over increasing numbers of rank/bank pairs and, within a single memory channel, this spilling over to additional ranks and banks happens at the granularity of 64 bytes; for example, for 256-byte packets, we see increased load on 3 rank/bank pairs, for 512-byte packets on 4 rank/bank pairs, and so forth. Moreover, we observe that this growth starts from the low-ordered banks, i.e., bank 1 is loaded rst, then bank 2 and so on.5 These observations lead us to the following theory: the default packet-buffer size in Linux is 2KB; each such
5 Regarding the high load on {bank 2, rank 3}: we suspect this is caused by the large number of empty polls that we see at the low packet rate for this test, and that the location corresponds to the memorymapped I/O register being polled. We nd the load on this rank-bank drops with increasing packet rates further supporting this conjecture.

64 bytes
Gbps 4 2 0 12 bank 34 3 4 1 2 rank 4 2 0 12

128 bytes
4 2 0 34 3 4 1 2 12

256 bytes

34

3 4 1 2

512 bytes
4 2 0 12 34 3 4 1 2 4 2 0

1024 bytes
4 2

1500 bytes

Original Click w/2048 byte buffers


10 Gbps 10 Gbps 5 0 1 2 1 3 4 1 2 3 4

Modified Click w/1024 byte buffers

0 12 34 3 4 1 2 12 34 3 4 1 2

5 0

Figure 11: Memory load distribution across banks and ranks for different packets sizes and a xed packet rate. buffer spans the entire rank/bank grid, which would allow high memory throughput if we were using the entire 2KB allocation. However, our 64-byte packet workload ends up using only one of the rank/bank pairs on each of the memory channels, leading to the two spikes we see in Figure 10. To test this theory, we repeat our earlier experiment with 64-byte packets from Figure 10, but now change the default buffer allocation size to 1KB. If our theory is right and a 2KB address space spans the entire grid, then 1KB should span half the grid and, hence, the two spikes in Figure 10 should split into 4 spikes. Figure 12 shows that this is indeed the case. Unfortunately (for some reason we do not fully understand as yet), we have not been able to allocate yet smaller buffer sizes (e.g., 128B), due to the need for the device driver to accomodate for additional data structures, and hence we do not experiment with even smaller allocations. Nonetheless, our experiment with 1024-byte buffers clearly shows the cause (and potential to remedy) the problem of skewed memory load. As we discuss in Section 5, we believe this issue could be xed in a general manner through the use of a modied memory allocator that allows for variable-size buffer allocations. Finally, if our conjecture that this imbalance was the performance bottleneck was right then reducing the imbalance should translate to higher packet-forwarding rates. Happily, using 1024B buffers we do see a 29.5% increase in forwarding rate from 3.4Gbps to 4.4Gbps; Figure 13 shows this improvement in terms of the packet rate (from 6.4 to 8.2 Mpps). Summary of bottleneck analysis The presented experiments showed what rates are achievable on each system component for hand-crafted workloads like our stream benchmark. We use these rates as re-calibrated 8

bank

rank

bank

rank

Figure 12: Memory load distribution across banks and ranks for 64B packets and two different sizes of packet buffers.

11 10 9 original click w/2048 byte buffers modified click w/1024 byte buffers

sustained load (Mpps)

8 7 6 5 4 3 2 1 0 0 2 4 6 8 10

offered load (Mpps)

Figure 13: Before-and-after forwarding rates for 64B packets and two different sizes of packet buffers.

system component 1 rank-bank FSB address bus aggregate memory PCIe FSB data bus

attainable limit (Gbps) 7.2 (from volatile-int) 74 (from stream) 51 (from stream) 36 (from 1KB pkt tests) 37 (from stream)

load w/ 64B router (Gbps) 7.168 50 33 20 9

percentage room-to-grow 0 48 54 80 311

Table 1: Room for growth on each of the system components computed as the percentage increase in measured usage for the 64B packet forwarding workload that can be accommodated before we hit achievable performance limits as obtained from specially crafted benchmark tests. upper bounds on the performance of each component and compare them to the corresponding rates measured for the 64-byte packet workload at saturation. To quantify our observations, we dene, for each component, the room for growth as the percentage increase in usage that could be accommodated on the component before we hit the upper bound. For example, for the stream benchmark, we measured 51Gbps of maximum aggregate memory bandwidth; for our 64-byte packet workload, at saturation, we measured 33Gbps of aggregate memory bandwidth; thus, ignoring other bottlenecks (such as the per rank/bank load), there is room to increase memory-bus usage by about 54% ((51 33)/33) before hitting the 51Gbps upper bound. Table 1 summarizes our results. We see that, if we can eliminate the problem of poor memory allocation (we discuss potential solutions in Section 5), then there is room for a fairly substantial improvement in the minimum-sized packet forwarding rateapproximately 50%. The next section looks for additional sources of inefciencythis time due to software overheads.

12

10

8 Overhead Ratio

2 FSB Memory 0 1 2 3 4 Packets/sec (Mpps) 5 6 7

Figure 14: Memory and FSB per packet overhead. where the CPU does not even need to read packet headers to determine the output port. Hence, with the optimistic reasoning of Section 2, in our experiments, an incoming trafc rate of R bps should roughly result in loads of 0, 2R, and 2R on the FSB, PCIe, and memory bus, all of them due to moving packets from NIC to memory and back. Not surprisingly, these estimates are below the loads that we actually measure, indicating that, beyond moving packets around, all three buses incur an extra per-packet overhead. We quantify this overhead as the number of extra per-packet transactions (i.e., transactions that are not due to moving packets between NIC and memory) performed on each bus. We compute it as follows: measured load estimated load packet rate transaction size Figure 14 plots this number for the FSB and memory bus as a function of the packet rate and size; the PCIe overhead is simply the difference between the other two. So, the FSB and PCIe overheads start around 6, while the memory-bus overhead starts around 12; all overheads slightly drop as the packet rate increases. It turns out that these overheads make sense once we consider the transactions for book-keeping socket-buffer 9

3.2

Overhead Analysis

The previous section treated the system as a black box, measuring the load on each bus but making no attempt to justify it. We now try to deconstruct the measured load into its components, as a way to assess the packetforwarding efciency of our system. First, we adjust the back-of-the-envelope analysis of Section 2 to our particular experimental setup and use it to estimate the expected load on each bus: In Section 2, we argued that an incoming trafc rate of R bps should roughly lead to loads of 2R, 2R, and 4R on the FSB, PCIe, and memory bus respectively. These numbers were based on two assumptions: rst, that bus loads are only due to moving packets around; second, that the CPU reads and updates each incoming packet, thus contributing to FSB and memory-bus load. The second assumption does not hold in our experiments, because we use static routing,

descriptors: For each packet transfer from NIC to memory, there are three such transactions on each of the FSB and PCIe bus: the NIC updates the corresponding socket-buffer descriptor, as well as the descriptor ring (two PCIe and memory-bus transactions); the CPU reads the updated descriptor, writes a new (empty) descriptor to memory, and updates the descriptor ring accordingly (three FSB and memory-bus transactions); nally, the NIC reads the new descriptor (one PCIe and memorybus transaction). Each packet transfer from memory to NIC involves similar transactions and, hence, descriptor book-keeping accounts for the 6 extra per-packet transactions we measure on the FSB and PCIe busand, hence, the 12 extra transactions measured on the memory bus. The slight overhead drop as the packet rate increases is due to the cache that optimizes the transfer of multiple (up to four) descriptors with each 64-byte transaction (each descriptor is 16-bytes long); this optimization kicks in more often at higher packet rates. We should note that these extra per-packet transactions translate into surprisingly high trafc overheads, especially for small packets: for 1024-byte packets, 12 per-packet transactions on the memory bus translate into 37.5% trafc overhead; for 64-byte packets, this number becomes 600%. As we discuss in Section 5, these overheads can be reduced by amortizing descriptors over multiple packets whenever possible (similar techniques are already common in high-speed capture cards).

Inferring Server Potential

We now apply our ndings from the last two sections to answer the following questions: Given our analysis of current-server packetforwarding performance, what can we improve and what levels of performance are attainable? What packet-forwarding performance should we expect from next-generation servers? We answer these through extrapolative analysis and leave validation to future work.

4.1

Shared-bus Architectures

In Section 3.1, we saw that the rst packet-forwarding bottleneck arises from inefcient use of the memory system, in particular, the imbalanced layout of packets across memory ranks and banks. The question is, how much could we improve performance by xing this imbalance? According to our overhead analysis (Section 3.2), perpacket overhead on the memory bus does not increase 10

with packet rate; hence, if we eliminated the problematic packet layout, we could increase our forwarding rate until we hit the next bottleneck. According to our bottleneck analysis (Section 3.1), that is the FSB address bus, and we could increase our forwarding rate by 50% before hitting it. Hence, we argue that eliminating the problematic layout could increase our minimum-size-packet forwarding rate by 50%, i.e., from 3.4Gbps to approximately 5.1Gbps. A second area for improvement, identied in Section 3.2, is the use of socket-buffer descriptors and the chatty manner in which these are maintained. We now estimate how much we could improve performance by simply amortizing descriptor transfer across multiple packet transfers. We start by considering the memory bus. From Section 3.2, we can approximate the load on the memory bus as 2 bit rate + 10 packet rate transaction size. Were we to transfer, say, 10 descriptors with a single transaction, that would immediately reduce memory-bus load to 2 bit rate + packet rate transaction size; for 64-byte packets and 64-byte transactions, this corresponds to a factor-of-4 reduction. Applying a similar line of reasoning to the FSB and PCIe bus, we can show that, for 64byte packets, descriptor amortization stands to reduce the load on each bus by factors of 10 and 2.5. Recall from Table 1 that we had 0%, 50% and 80% room for growth on each of the memory, FSB, and PCIe buses, and, hence, the load on each of these buses could grow by a factor of 4 (41.0), 15 (10150%), and 4.5 (2.5180%) respectively. Since the maximum improvement that can be accommodated on all buses is by a factor of 4, we argue that reducing descriptor-related overheads could improve our minimum-size-packet forwarding rate from 3.4Gbps to 13.6Gbps. Finally, combining both optimizations should allow us to climb still higher to approximately 20Gbps, though, of course, the limited number of network slots on our machine would limit us to 16Gbps. While the above is an extrapolation (albeit one derived from empirical observations), it nonetheless points to the tremendous untapped potential of current servers. Even if our estimates are off by a factor two, it still seems possible that current servers can achieve forwarding rates of 10Gbpsa number currently considered the realm of specialized (and expensive) equipment. We close by noting that the suggested xes involve only modest operating-system rearchitecting.

4.2

Point-to-point Architectures

We now apply the results of our measurement study to estimate the packet forwarding rates that might be achievable with point-to-point (p2p) server architectures as in-

troduced in Section 2 (Figure 2). At times, we make assumptions where the necessary details of p2p architectures arent yet known and, in such cases, we explicitly note our assumptions as such. In a p2p architecture the role of the FSB that of carrying trafc between sockets and between CPUs and their non-local memory is played by point-to-point links such as Intels QuickPath [5]. In our analysis, we assume our ndings from the FSB apply to these intersocket buses. Specically, we assume that the 50% roomto-grow that we measured on the FSB applies to these inter-socket buses. If anything, this seems like a wildly conservative assumption for two reasons: (1) the nominal speed of these inter-socket links is 200Gbps [2], compared to 85Gbps for current FSBs and (2) the operations seen on the single FSB are now spread across six intersocket links. We compute the expected performance for the p2p architecture by considering the different factors that will offer a performance improvement relative to the shared-bus server weve studied so far. These factors are: (1) reduced per-bus overheads: This improvement results simply due to the transition from a shared-bus to a peer-to-peer architecture as discussed in Section 2. These overheads and the corresponding reduction are summarized in the rst three columns of Table 2.6 (2) room-to-grow: As before this records the capacity for growth on each bus. For this we use our ndings from Section 3.1. (3) technology improvements: This accounts for the standard technology improvements expected in this nextgeneration of servers. We assume a 2x improvement on the FSB and PCIe buses by observing that (for example) the Intel QuickPath inter-socket links for use in the Nehalem server line supports speeds that are over 2x faster than current FSBs. Likewise, the PCIe-2.0 runs 2x faster than current PCIe-1.1 and the recently announced PCIe-3.0 is to run at 2x the speed of PCIe-2.0 [20] (our test server uses PCIe-1.1). We conservatively assume that memory technology will not improve. Table 2 summarizes these performance factors and computes the combined performance improvement that we can expect on each system component. As we see, the overall performance improvement is still limited by memory (both because were assuming the rank-bank
6 Note that, while Section 3.2 revealed that the overheads we see in practice are far higher than those from our analysis, were assuming that the relative reduction across architectures will still hold. This appears reasonable since this reduction is entirely due to the offered load being split across more system components 6 vs. 1 inter-socket bus, 4 vs. 1 memory bus and 4 vs. 1 PCIe bus.

o I

v e

v e

Figure 15: Forwarding rates for shared-bus and p2p server architectures with and without different optimizations. imbalance problem remains and that memory technology improves more slowly). Despite this, were left with a 4x improvement suggesting that a next-generation p2p server running unmodied Linux+click will scale to approximately 13.6Gbps. The additional use of the optimizations described above could further improve performance to potentially exceed 40Gbps. Figure 15 summarizes the various forwarding rates for the different architectures and optimizations considered. In summary, current shared-bus servers scale to (minsized) packet forwarding rates of 3.4Gbps and we estimate future p2p servers will scale to 10Gbps. Moreover our analysis suggests that modications to eliminate key bottlenecks and overheads stand to improve these rates to over 10Gbps and 40Gbps respectively.

Recommendations and Discussion

5.1 Eliminating the Bottlenecks


We believe the bottlenecks and overheads identied in the previous sections can be addressed through relatively modest changes to operating systems and NIC rmware. Unfortunately, the need to modify NIC rmware makes it difcult to experiment with these changes. We describe these modications at a high level and note that these modications do not impact the programmability of the system. Improved memory allocators. Recall that our results in Section 3.1 suggest that the imbalance in memory accesses with regard to (skb) packet buffers in the kernel 11

system bus memory FSB/CSI memory

shared-bus overheads (section 2) 4R 2R 2R

p2p overheads (section 2) R 2R/3 R/2

gain from reduced overheads 4x 3x 4x

room-to -grow (Table 1) 1.0x 1.5x 1.8x

gain w/ tech trends 1.0x 2x 2x

overall gain 4x 9x 14.4x

Table 2: Computing the performance improvement with a p2p server architecture. R denotes the line rate in bits/second. occurs because the kernel allocates all packet buffers to be a single size with a default of 2KB. This problem can be addressed by simply creating packet buffers of various sizes (e.g., 64B, 256B, 1024B and 2048B) and allocating a packet to the buffer appropriate for its size. This can be implemented by simply creating multiple descriptor rings, one for each buffer size; on receiving an incoming packet, the NIC simply uses the descriptor ring appropriate to the size of the received packet. While more wasteful of system memory, this isnt an issue since the memory requirements for a router workload are a small fraction of the available server memory. This approach is in fact inspired by similar approaches in hardware routers that pre-divide memory space into separate regions for use by packets of different sizes [13]. The imbalance due to packet descriptors can be likewise tackled by arranging for packet descriptors to consume a greater portion of the memory space by, for example, using larger descriptor rings and/or multiple descriptor rings. Conveniently however, the use of amortized packet descriptors as described below would also have the effect of greatly reducing the descriptor-related trafc to memory and hence implementing amortized descriptors might sufce to reduce this problem. Amortizing packet descriptors Section 3.2 reveals that handling packet descriptors imposes an inordinate per-packet overhead particularly for small packet sizes. As alluded to earlier, a simple strategy is to have a single descriptor summarize multiple up to a parameter k packets. This amortization is similar to what is already implemented on capture cards designed for specialized monitoring equipment. Such amortization is easily accommodated for k smaller than the amount of packet buffer memory already on the NIC. Since we imagine that k can be a fairly small number ( 10) and since current NICs already have buffer capacity for a fair number of packets (e.g., our cards have room for 64 fullsized packets), such amortization should not increase the storage requirements on NICs. Amortization can however impose increased delay. This can be controlled by having a timeout that regulates the maximum time period the NIC can wait to transfer packets. Setting to 12 a small multiple (e.g., 2k times) the reception time for small packets should be an acceptable delay penalty.

5.2 Discussion
When we set out to study the forwarding performance of commodity servers, we already expected the memory system to be the bottleneck; the fact, however, that the bottleneck was due to an unfortunate combination of packet layout and memory-chip organization came as a surprise. While trying to gure this out, we looked at how the kernel allocates memory for its structures; not surprisingly, it favors adjacent memory addresses to leverage caching. However, given that the kernel uses physical addresses, nearby addresses often correspond to physically nearby locations that fall on the same memory rank and bank. As a result, workloads that cannot benet from caching may end up hitting the same memory rank/bank pairs and, hence, be unable to benet from aggregate memory bandwidth either. In short, when combined with an unfortunate data layout, locality can hurt rather than help. Another surprise was the lack of literature on the behavior and performance of system components outside the CPUs. The increasing processor speeds and the rise of multi-processor systems mean that, from now on, processing data is less likely to be the bottleneck than moving it around between CPUs and other I/O devices. Hence, it is important to be able to measure and understand system performance beyond the CPUs. Finally, we were surprised by the lack of efciency in moving data between system components. In many cases, data is unnecessarily transferred to memory (contributing to memory-bus load) when it could be directly transferred from the NIC to the appropriate CPU cache. Packet forwarding and processing workloads would benet signicantly from techniques along the lines of Direct Cache Access (DCA), where the memory controller directly places incoming packets into the right CPU cache by snooping the DMA transfer from NIC to memory [17].

Related and Future Work

The use of a software router based on general-purpose hardware and operating systems is not a new one. In fact, the 13 NSFNET NSS (nodal switching subsystems) included 9 systems running Berkeley UNIX interconnected with a 4Mb/s IBM token ring. Click [18] and Scout [22] explored the question of how to architect router software for improved programmability and extensibility; SMP Click [12] extends the early Click architecture to better exploit the performance potential of multiprocessor PCs. These efforts focused primarily on designing the software architecture for packet processing and, while they do report on the performance of their systems, this is at a fairly high-level using purely black-box evaluation. By contrast, our work assumes Clicks software architecture but delves under the covers (of both hardware and software) to understand why performance is limited and how these limitations carry over to future server architectures. As a slight digression: it is somewhat interesting to note the role of time on the performance of these (fairly similar) software routers. The early NSF nodes achieved forwarding rates of 1K packets/sec (circa 1986), Click (at SOSP99) reported a maximum forwarding rate of 330Kpps which SMP-click improves to 494Kpps (2001); we nd unmodied Click achieves about 6.5Mpps. This is of course somewhat anecdotal since were not necessarily comparing the same click congurations but nonetheless suggests the general trajectory. There is an extensive body of work on benchmarking various application workloads on general-purpose processors. The vast majority of this work is in the context of computation-centric workloads and benchmarks such as TPC-C. Closer to our interest in packet processing, are efforts similar to those of Veal et al. [24] that look for the bottlenecks in server-like workloads that involve a fair load of TCP termination. Their analysis reveals that such workloads are bottlenecked on the FSB address bus. A similar conclusion has been arrived at for several, more traditional, workloads. (We refer the reader to [24] for additional references to the literature on such evaluations.) As our results indicate, the bottleneck to packet processing lies elsewhere. There is similarly a large body of work on packet processing using specialized hardware (e.g., see [19] and the references therein). Most recently, Turner et al. describe a Supercharged Planetlab Platform [23] for high-performance overlays that combines generalpurpose servers with network processors (for slow and fast path processing respectively); they achieve forwarding rates of up to 5Gbps for 130B packets. We focus instead on general-purpose processors and our results suggest that these offer performance that is competitive. Closest to our work is a recent independent effort by 13

Egi et al. [15]. Motivated by the goal of building highperformance virtualized routers on commodity hardware, the authors undertake a measurement study to understand the performance limitations of modern PCs. They observe similar performance and, like us, arrive at the conclusion that something is amiss at the memory system. Through inference based on black-box testing the authors suggest that non-contiguous memory writes initiated by the PCIe controller is the likely culprit. Our access to chipset tools allows us to probe the internals of the memory system and our ndings there lead us to a somewhat different conclusion. Finally, our work also builds on a recent position paper making the case for cluster-based software routers [11]; the paper identies the need to scale servers to line rate but doesnt explore the issue of bottlenecks and performance in any detail. In terms of future work, we plan to extend our work along three main directions. First, were exploring the possibility of implementing the modied descriptor and buffer allocator schemes described above. Second, we hope to repeat our analysis on the Nehalem server platforms once available [8]. Finally, were currently working to build a cluster-based router prototype as described in earlier work [11] and hope to leverage our ndings here to both evaluate and improve our prototype.

Conclusion

A long-held and widespread perception has been that general-purpose processors are incapable of high-speed packet forwarding motivating an entire industry around the development of specialized (and often expensive) network equipment. Likewise, the barrier to scalability has been variously attributed to limitations in I/O, memory throughput and various other factors. While these notions might each have been true at various points in time, modern PC technology evolves rapidly and hence it is important that we calibrate our perceptions by the current state of technology. In this paper, we revisit old questions about the scalability of in-software packet processing in the context of current and emerging off-the-shelf server technology. Another, perhaps more important, contribution of our work is to offer concrete data on questions that have often been answered through anecdotal or indirect experience. Our results suggest that particularly with a little care modern server platforms do in fact hold the potential to scale to the high rates typically associated with specialized network equipment and that emerging technology trends (multicore, NUMA-like memory architectures, etc.) should only further improve this scalability. We hope that our results, taken together with the growing need for more exible network infrastructure, will

spur further exploration into the role of commodity PC hardware/software in building future networks.

COMM, 2007. [24] B. Veal and A. Foong. Performance scalability of a multi-core web server. In Proc. of ACM ANCS, Dec. 2007.

References
[1] Intel 10 Gigabit XF SR Server Adapters. http: //www.intel.com/network/connectivity/ products/10gbexfsrserveradapter.htm. [2] Intel QuickPath Interconnect. http://en.wikipedia. org/wiki/Intel_QuickPath_Interconnect. [3] Intel Xeon Processor 5000 Sequence. http://www.intel. com/products/processor/xeon5000. [4] NetFPGA. http://yuba.stanford.edu/NetFPGA/. [5] Next-Generation Intel Microarchitecture. http://www. intel.com/technology/architecture-silicon/ next-gen. [6] Vyatta: Open Source Networking. http://www.vyatta. com/products/. [7] Cisco Opening Up IOS. http://www.networkworld. com/news/2007/121207-cisco-ios.html, Dec. 2007. [8] Intel Demonstrates Industrys First 32nm Chip and NextGeneration Nehalem Microprocessor Architecture. Intel News Release., Sept. 2007. http://www.intel.com/ pressroom/archive/releases/20070918corp_a. htm. [9] Juniper Open IP Solution Development Program. http://http://www.juniper.net/company/ presscenter/pr/2007/pr-071210.html, 2007. [10] Intel Corporations Multicore Architecture Brieng, Mar. 2008. http://www.intel.com/pressroom/archive/ releases/20080317fact.htm. [11] K. Argyraki et al. Can software routers scale? In ACM Sigcomm Workshop on Programmable Routers for Extensible Services, Aug. 2008. [12] B. Chen and R. Morris. Flexible control of parallelism in a multiprocesor pc router. In Proc. of the USENIX Technical Conference, June 2001. [13] I. Cisco Systems. Introduction to Cisco IOS Software. http: //www.ciscopress.com/articles/. [14] D. Comer. Network Processors. http://www.cisco.com/ web/about/ac123/ac147/archived_issues/ipj_ 7-4/network_processors.html. [15] N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt, and L. Mathy. Towards performant virtual routers on commodity hardware. Technical Report Research Note RN/08/XX, University College London, Lancaster University, May 2008. [16] R. Ennals, R. Sharp, and A. Mycroft. Task partitioning for multicore network processors. In Proc. of International Conference on Compiler Construction, 2005. [17] R. Huggahalli, R. Iyer, and S. Tetrick. Direct Cache Access for High Bandwidth Network I/O. In Proc. of ISCA, 2005. [18] E. Kohler, R. Morris, B. Chen, J. Jannotti, and F. Kaashoek. The click modular router. ACM Transactions on Computer Systems, 18(3):263297, Aug. 2000. [19] J. Mudigonda, H. Vin, and S. W. Keckler. Reconciling performance and programmability in networking systems. In Proc. of SIGCOMM, 2007. [20] PIC-SIG. PCI Express Base 2.0 Specication, 2007. http://www.pcisig.com/specications/pciexpress/base2. [21] A. Sigcomm. Workshop on Programmable Routers for Extensible Services. http://www.sigcomm.org/sigcomm2008/ workshops/presto/, 2008. [22] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb. Building a Robust Software-Based Router Using Network Processors. In Proc. of the 18th ACM SOSP, 2001. [23] J. Turner et al. Supercharging planetlab a high performance, multi-application, overlay network platform. In Proc. of SIG-

14

Potrebbero piacerti anche