Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
General Terms
Measurement, Performance, Design, Experimentation
Keywords
Network Firewall, Netfilter, Throughput, Network Processor, Prototype, Hybrid Firewall
1. INTRODUCTION
Network Firewalls occupy an essential role in computer security, guarding the perimeter of LANs. The goal of any network firewall is to allow desired packets unimpeded access to the network while dropping undesirable packets. In so doing, the firewall protects the data and compute resources of the
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ANCS05, October 2628, 2005, Princeton, New Jersey, USA. Copyright 2005 ACM 1-59593-082-5/05/0010...$5.00.
Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries TM Xeon is a trademark of Intel Corporation or its subsidiaries in the United States and other countries * Other brands and names are the property of their respective owners
115
packet filtering operation required for firewall operation, deciding to forward or drop the packet on its own. We select an NP for our hybrid firewall both because of its packet processing prowess and because it is programmable, allowing changes in response to changing security needs. Programmability is an important factor in our selection of an NP over custom hardware. Packets requiring additional computation, for instance packets defining a new flow, are processed by a Netfilter firewall running on Intel XeonTM Processors (called microprocessors or CPUs). The Intel Xeon Processor is selected for its ability to rapidly complete the complex firewall processing required by a full featured Netfilter firewall. We first describe the Linux Netfilter firewall and the significant challenge posed by minimum packets. In order to build our hybrid firewall we constructed a library for packet communication between the NPs and the microprocessors over PCI. We describe this library, the two variations implemented, and the performance measured. The hybrid firewall itself is described next with a focus on interfacing Netfilter to the NP resident firewall. Performance measures show that our hybrid firewall achieves our goal of providing line rate firewall processing for all packets with more intensive firewall processing for a fraction of the packets. We dissect the performance achieved and present the hard lessons learned along the way. Related work and conclusions finish the paper.
notifying the application. The output chain provides a similar mechanism for packets originating from the local host machine.
local process
output chain
netfilter Linux IP
2. NETFILTER FIREWALL
Netfilter [6] is a set of Linux kernel modifications provided to enable packet operations on packets received from the Linux IP networking layer. Introduced in the Linux 2.4 code base, Netfilter and its associated modules replace the previous instantiations of IP filtering used in older versions of Linux. These include ipfwadm in Linux 2.0, and IPChains in Linux 2.2. Netfilter provides the system code and the application programming interfaces needed to provide access from either kernel processes or user applications. IPTables is a set of userspace tools and kernel loadable modules used in conjunction with Netfilter. IPTables allows a user to define sets of rules governing packet-filtering behavior. IPTables also includes kernel loadable modules supporting many different types of packet operations. Used together, Netfilter and IPTables support packet filtering, packet forwarding, Network Address Translation (NAT), event logging, connection tracking, and more advanced operations (often called packet mangling) capable of modifying the L4+ packet header or payload. In this paper, we will refer to Netfilter and IPTables interchangeably. The basic packet flow through Netfilter and IPTables is determined by a series of filter blocks known as chains. This is the same concept used with IPChains in Linux 2.2. The three basic chains are the forwarding chain, the input chain, and the output chain as shown in Figure 1. The forwarding chain provides a path and rule set for packets not destined for the local host machine packets to be routed through the network. This chain allows packet filtering based on user-defined rules, packet header data, and state collected from previous packets. The input chain provides similar functionality for packets destined to a process on the local host machine. Packets can be passed to the application expecting them, or dropped without
3. NP/CPU COMMUNICATION
Our hybrid firewall consists of a Netfilter firewall on a dual 2.4 GHz Intel Xeon Processor system coupled to a simple custom packet filtering firewall executing on a dual 1.4 GHz IXP2800 compact PCI card. Figure 2 is a picture of the hybrid firewall hardware. The two sets of processors connect via a 64 bit, 66 MHz Peripheral Components Interface (PCI) bus. The IXP2800 network processor features sixteen RISC engines called MEs. Each ME holds enough registers to keep state for up to 8 computational threads and can context swap between these threads in just a few clocks. Our hybrid firewalls Ethernet receive, Ethernet transmit, and IP processing along with the packet forwarding firewall execute on the MEs. To move packets and data between the ME based firewall and the microprocessor based firewall, we had to create the communications library described next.
116
Local Memory
Figure 3. Message Passing Memory Configuration Figure 3 shows the conceptual memory map within a given processors address space. This system keeps two copies of packet buffers and circular queues, one in local memory and one in remote (the other processors) memory across PCI. Since the physical and virtual addresses of each region are different for each processor, a pointer value on one processor means nothing to the other processor. Applications use a system of base and offset pointers to coordinate their data structures, communicating physical offsets within the shared regions rather than passing virtual address pointers.
117
When updating queue elements, the application writes both the local and remote copies. Software performs all reads of queue state locally, so these reads return quickly without having to wait for PCI latency. The developer must impose restraints as to when each portion of the application can access particular data fields within the shared application state. For instance, the producer must never write an element in the FULL state whereas the consumer must never write an element in the FREE state. Further, software must ensure that all related updates are complete prior to changing the value of the status bits.
The mirrored memory technique removes all runtime reads from the PCI bus. This write-write interface markedly improves overall bus efficiency. The trade off is the introduction of many small writes to update remote queue elements. Figure 6 shows measurements of the final bandwidth achieved. Both tests assumed an ideal consumer. The data shows that queue state updates represent an acceptably small impact to performance using the mirrored memory approach.
Throughput (Mb/s)
2500 2000 1500 1000 500 0 64 320 576 832 1088 1344
Reads - 1CH Writes - 1CH Writes - 2CH Reads - 2CH
Throughput (Mb/s)
Figure 6 PCI Bus Performance Showing Impact of Small Queue Management Writes This model allows for maximum flexibility in application design because the processors may initiate, drop, and modify any packets they own without regard to any latent state on the other processor. The principal drawback of this model is complete packet data copies in both directions across the PCI bus. This may waste PCI bus cycles and processor cycles for applications that do not modify packet data.
The six clock overhead estimate assumes only one clock to propagate across the non-transparent bridge. Calculation assumes a perfect target with no contention from other PCI bus masters.
118
moving data over PCI. The resulting coupled packet sharing mechanism delivers about double the data rate available from the decoupled paradigm for periods of heavy PCI utilization as shown in Figure 7.
2000 1800 Throughput (Mb/s) 1600 1400 1200 1000 800 600 400 200 0 0 256 512 768 1024 1280 Packet Size (Bytes)
With the PCI communications established between MEs and CPU, we move on to describing the hybrid firewall. While this firewall supports concurrent coupled and decoupled operation, this paper addresses the coupled mode as it provides the best performance
Coupled Decoupled
Figure 7. Coupled vs. Decoupled Maximum Data Rates Table 1 and Table 2 show processor cycle counts during each mode of operation. These cycle counts reveal the five routines most frequently called while servicing 1518-byte packets at the maximum attainable data rate. In coupled mode, the processor is mostly busy handling the PCI Comms tasklet, polling the ring buffers, and receiving packets; packet transmission to the MEs is a low-overhead operation. The same analysis performed on the decoupled model reveals the source of the performance disparity. Copies to the MEs memory during packet transmission dominate performance, consuming up to 23% of the processors time. These copies account for just 5% of the actual retired instructions, suggesting that the processor spends many cycles stalled within the copy routine. These stalls arise because the processor supplies data at a much greater rate than the PCI bus can accept it. Table 1. Cycle Count Coupled mode, 1518-byte packets % of Image Symbol Total (top 5 shown) 15.0 Linux tasklet_action 13.0 11.7 10.6 8.1 41.6 CPU/ME Driver CPU/ME Driver Linux Linux ia_poll ia_ingress_tasklet net_rx_action Do_softirq other
Table 2. Cycle Count Decoupled mode, 1518-byte packets % of Image Symbol Total (top 5 shown) 23.3 CPU/ME Driver copy_to_me 11.9 10.8 9.6 8.4 35.9 Linux CPU/ME Driver CPU/ME Driver Linux tasklet_action ia_poll ia_ingress_tasklet net_rx_action other
119
packets with a protocol ID of 0x800 normally precede to the IP receive handling routine. The PCI Comms library uses a unique Application ID to determine the destination application for each packet. For the hybrid firewall application, because Ethernet and IP packet processing has already occurred in the NP, packets should simply be handed to IPTables for firewall processing. We use the existing Linux dispatch code and bypass the Linux IP stack by registering a receive handler for each unique application ID and setting the protocol field in the sk_buff to match the application ID indicated by the microblock. Linux forwards IP packets to our application specific protocol as long as the protocol field in the sk_buff matches the protocol field registered by the application with the networking core. This paper refers to the protocol module as the NF Protocol Interface, because it is responsible for interfacing to the Netfilter modules. This method provides two very important advantages. First, we do not have to write our own dispatch code. Second, by using the existing network dispatch code in the Linux network core, we enable other legacy applications to receive our packets as well. For example, to send a packet to an existing Linux protocol stack, the NP sets the Application ID to that particular network protocol ID (such as 0x800 for TCP/IP).
Linux Kernel
PCI Comms Kernel Module PCI 64/66 PCI Comms Microblock IPv4 Fowarding Ethernet
Network Processor
Figure 8. Hybrid Firewall Software Architecture The PCI Comms kernel module polls the ME CPU communication rings for new packets arriving from the MEs. We chose a polling interface both because it is a good match for our high traffic rate application and because interrupts from MEs were not available to us. For heavy traffic loads, polling mode works best since it allows the operating system to schedule the work, ensuring other applications run uninterrupted. We simulated interrupts combined with polling by scheduling our polling routine as a tasklet during times of light traffic. The operating system schedules tasklets with a much lower frequency than the polling interface, saving CPU cycles. If during polling the driver determined that there were no packets received from the PCI Comms microblock, the driver removed itself from the polling queue and scheduled a tasklet to check the rings later. Because the latency between tasklet calls can be quite long, we had our tasklet immediately place our driver back in the polling queue even if there were no packets received. If interrupts had been available, we would have used them during times of low link utilization, switching to polling under heavy load. The Linux operating system uses a packet descriptor structure called a sk_buff to hold packet data and meta data as it traverses through the operating system. NIC devices will typically DMA directly into the packet data portion of the sk_buff. Similarly, the PCI Comms Kernel Module keeps a pool of allocated sk_buffs for use in microengine DMA transactions. The pool slowly drains over time. Software replenishes it when it reaches a low water mark, or when there is a break in the packet passing work. We chose this design since keeping DMA buffers available on the queue for the NP requires the CPU minimize time spent managing buffers between each poll. The network driver interface under Linux requires that the driver set the protocol field in the sk_buff structure to contain the type field from the Ethernet packet header. This protocol field is then used by the Linux network core to dispatch packets to whichever protocol has registered a receive handler for the specific protocol ID in that packets descriptor. For example,
120
pitfalls, an ME-specific copy of the Netfilter connection table was created for the ME resident firewall. We implemented a UDP socket interface between the microprocessors and the IXP2800 firewall software to enable the microprocessor to place a copy of its connection tracking state into NP resident memory. This simple RPC-like interface allows the CPU resident NFPI to update state in the IXP2800s memory. NP resident state is formatted specifically for optimized ME accesses. Runtime firewall rules updates could use the same mechanism, but were not implemented.
121
% of Total 15.3 13.0 11.7 10.1 7.9 4.6 4.2 3.9 3.9 3.8 2.0 2.0 2.0 1.9 1.7 1.6 1.3 1.2 1.2 1.2
Table 3. Cycle Count for Hybrid Firewall2 Image Symbol Linux PCI comms Driver PCI comms Driver Linux Linux PCI comms Driver Linux PCI comms Driver PCI comms Driver Linux PCI comms Driver Linux Linux Linux Linux Linux Linux Linux Linux PCI comms Driver tasklet_action Ia_poll ia_ingress_tasklet net_rx_action do_soft_irq ia_get_msg netif_receive_skb ia_rx_msg ia_msg_rx_alloc_backup Tasklet_schedule ia_check_rings eth_type_trans skb_dequeue skb_release_data kfree kmalloc skb_queue_tail alloc_skb memcpy me_pkt_forward
to manage. Synchronizing data structures amongst the multiple address spaces is both complicated and tedious. Lack of atomic transactions to PCI resident memory resources drove constraints into the creation and use of data structures accessible from both microprocessor and ME. The inability of the MEs to use microprocessor virtual addresses also complicated data structure creation and use. The low bandwidth and long latency associated with ME PCI reads drove our message passing architecture to consume extra memory bandwidth and made effective state sharing almost impossible. Polling and memory allocation for packet buffers consumed more microprocessor cycles than we hoped. We will seek to improve these features in future implementations of our hybrid approach.
7. RELATED WORK
Brink et. al. [7] present measurements for IPv4 forwarding, for a Netfilter firewall and IPsec on Linux* 2.4.18 running on a dual Intel Xeon Processor based system. Their measurements show lower-than-line-rate throughput for smaller packets, concurring with ours. The authors also present IPv4 forwarding results for a dual Intel IXP2400 network processor system showing close to maximum theoretical line rate for all packet sizes. Based on benchmarks published by the Network Processing Forum [8] and the IETF [9], Kean [10] presents a methodology for benchmarking firewalls along with measurements for an earlier generation network processor, the Intel IXP1200. Keans measurements show significant roll-off for smaller packet sizes, though not as large as the Netfilter firewall. Alternate firewall platform architectures have been explored before. The Twin Cities prototype provided a coherent shared memory interface between an Intel Pentium III Processor and an IXP1200 [11]. Twin Cities did not target an existing CPU firewall (Netfilter) and used custom hardware. Other authors have explored different architectures, including FPGA based firewalls [12] and firewall targeted Network Processor designs [13]. Corrent* [14] advertises a firewall that most closely matches the work described here, using a combination of network processors and general-purpose processors to execute a CheckPoint* Firewall. Corrent* even shows that their approach leads to enhanced performance for small packets. IPFabrics* [15] offers both a PCI add in card that holds two Intel IXP2350s and a Packet Processing Language that enables quick programming of applications like Firewalls for the NPs.
The actual firewall packet processing is only a small percentage of the overall cycle count, so small that the functions responsible do not appear in the table. All observed Netfilter function calls accounted for only 0.91% of the cycle count in the above profile. Likewise, the NF protocol interface module accounted for 0.66% of the cycle count. This analysis identifies several high cost tasks. Managing the OS buffers (sk_buff) accounts for about 10% of the CPU time (netif_receive_skb, skb_dequeue, skb_release_data, kfree, kmalloc, alloc_skb, skb_queue_tail). Polling rings within the PCI communications library uses a combination of a polling loop (ia_poll 13%) and Linux tasklet scheduling (tasklet_action 15.3% and ia_ingress_tasklet 11.7%). Clearly, polling is the area we would first look to tune. Having a system with interrupts available would eliminate this tasklet scheduling overhead.
6. CONCLUSIONS
Using Linux Netfilter, an existing full featured firewall running on a high performance general purpose CPU, we were able to successfully add a simple packet filtering firewall running on a network processor and achieve substantial performance gains. In fact, for small packets our hybrid firewall prototype was able to achieve 2 Gb/s line rate with almost 30% of the packets forwarded to Netfilter. This is greater than 4X throughput gain over the standard Netfilter firewall. Our hybrid firewall exhibited the superior performance we suspected it would. A number of characteristics of the PCI connection between our microprocessors and network processor served to make programming difficult. The multiple address spaces associated with each processor, mapped together through the nontransparent PCI-to-PCI bridge were difficult for the programmer
8. ACKNOWLEDGEMENTS
The authors would like to acknowledge Santosh Balakrishnan, Alok Kumar and Ellen Deleganes for the NP based firewall that served as an early version for the NP code used in this paper. We should also like to extend a special thank you to Rick Coulson and Sanjay Panditji for their steadfast support of this work and to Raj Yavatkar for his valuable guidance.
* 2
Pentium III is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Other brands and names are the property of their respective owners
122
9. REFERENCES
[1] W. R. Cheswick, S. Bellovin, A. Rubin. Firewalls and Internet Security, Second Edition. Addison-Wesley. 2003. San Francisco. [2] M. Kounavis, A. Kumar, H. Vin, R. Yavatkar, A. Campbell. Directions in Packet Classification for Network Processors. 2004 http://comet.ctr.columbia.edu/~campbell/papers/np2.pdf [3] R. Peschi, P. Chandra, M. Castelino. IP Forwarding Application Level Benchmark v1.6. Network Processing Forum May 12, 2003. http://www.npforum.org/techinfo/approved.shtml#ias [4] Computer Emergence Response Team. CERT Advisory CA-1996-21 TCP SYN Flooding and IP Spoofing Attacks. Nov 29, 2000. http://www.cert.org/advisories/CA-1996-21.html [5] Intel IXP2800 Network Processor. Intel Corporation., http://www.intel.com/design/network/products/npfamily/IX P2800.htm [6] H. Welte. What is Netfilter/IPTables? http://www.netfilter.org [7] P. Brink, M. Castelino, D. Meng, C. Rawal, H. Tadepalli. Network Processing Performance Metrics for IA- and NPBased Systems. Intel Technology Journal, Volume 7, Issue4. 2003. pp.78-91.
[8] NPF Benchmarking Implementation Agreements. Network Processing Forum http://www.npforum.org/techinfo/approved.shtml#ias [8] B. Hickman, D. Newman, S. Tadjudin, T. Martin, RFC3511 Benchmarking Methodology for Firewall Performance The Internet Society. April 2003. http://www.faqs.org/rfcs/rfc3511.html. [9] L. Kean, S. B. M Nor. A Benchmarking Methdology for NPU-Based Stateful Firewall. APCC 2003. Volume 3, 21-24 Sept. 2003 Page(s):904 - 908 [10] F. Hady, T. Bock, M. Cabot, J. Meinecke, K Oliver, W. Talarek. Platform level support for High Throughput Edge Applications: The Twin Cities Prototype. IEEE Network. July/August 2003. pp. 22-27. [11] A. Kayssi, L. Harik, R. Ferzli, M. Fawaz. FPGA-based Internet protocol firewall chip. Electronics, Circuits and Systems, 2000. ICECS 2000. Volume 1, 17-20 Dec. 2000 Page(s):316 - 319 [12] K. Vlachos. A Novel Network Processor For Security Applications in High-Speed Data Networks. Bell Labs Technical Journal 8(1). pp 131-149. 2003. [13] Corrent Security Appliances Sustain Maximum Throughput Under Attack. http://www.intel.com/design/embedded/casestudies/corrent. pdf [14] Double Espresso. IPFabrics*. http://www.ipfabrics.com/products/de.php.
123