Network Processor Acceleration For A Linux Netfilter Firewall 3050

Network Processor Acceleration for a Linux* Netfilter Firewall
Kristen Accardi, Tony Bock, Frank Hady, Jon Krueger

Intel Corporation th 2111 NE 25 Ave Hillsboro, OR 97124
{kristen.c.accardi, tony.bock, frank.hady, jon.krueger}@intel.com ABSTRACT

Network firewalls occupy a central role in computer security, protecting data, compute, and networking resources while still allowing useful packets to flow. Increases in both the work per network packet and packet rate make it increasingly difficult for general-purpose processor based firewalls to maintain line rate. In a bid to address these evolving requirements we have prototyped a hybrid firewall, using a simple firewall running on a network processor to accelerate a Linux* Netfilter Firewall executing on a general purpose processor. The simple firewall on the network processor provides high rate packet processing for all the packets while the general-purpose processor delivers high rate, full featured firewall processing for those packets that need it. This paper describes the hybrid firewall prototype with a focus on the software created to accelerate Netfilter with a network processor resident firewall. Measurements show our hybrid firewall able to maintain close to 2 Gb/sec line rate for all packet sizes, a significant improvement over the original firewall. We also include the hard won lessons learned while implementing the hybrid firewall. organization that owns the LAN. Increases in LAN and Internet bandwidth rates coupled with requirements for more sophisticated packet filtering have made the work of the firewall more difficult. All firewalls make forward/drop decisions for each packet, but the complexity of the work performed to make the forward/drop decisions varies widely. Packet Filter firewalls [1] make forward/drop decisions based on the contents of the packet header, and a set of rules. Stateful firewalls make forward/drop decisions based on packet headers, rules lists, and state collected from previous packets. Application layer firewalls (a.k.a. application layer gateways) reconstruct application level data objects carried within sets of packets and make forward/drop decisions based upon the state of those objects. In reaction to increasingly sophisticated threats, firewall complexity is increasing. A recent study [2] showed a commercial ISP firewall containing 3000 rules. Almost all commercial firewalls are stateful and most include application layer features. Firewalls are also incorporating Intrusion Detection and Intrusion Prevention, improving security at the cost of increased computation. The trend toward more secure and robust firewall protection drives an escalating need for more processing power within the firewall. Since firewalls make per-packet decisions, packet rate is a critical metric of firewall performance. As a rule, firewalls must be able to handle minimum sized packets at the maximum rate delivered by the attached media. Minimum sized packets occur frequently in real traffic the Network Processing Forum [3] specifies 40-byte packets to occupy 56% of the Internet Mix of IP packets. Denial-of-service attacks like SYN flood [4] commonly exploit minimum sized packets because they represent the most difficult workload for many firewall implementations. In this paper, we explore the advantages of a hybrid firewall one that includes an application layer Linux Netfilter Firewall running on a general purpose processor, and a simple packet filtering firewall executing on a Network Processor (NP) - by building and measuring the performance of such a firewall. The Intel IXP2800 Network Processor [5] classifies all packets arriving at the firewall. For simple cases, the NP completes the
Categories and Subject Descriptors

C.2.0 [Computer-Communication Networks]: General Security and protection (e.g., firewalls)
General Terms
Measurement, Performance, Design, Experimentation
Keywords
Network Firewall, Netfilter, Throughput, Network Processor, Prototype, Hybrid Firewall
1. INTRODUCTION
Network Firewalls occupy an essential role in computer security, guarding the perimeter of LANs. The goal of any network firewall is to allow desired packets unimpeded access to the network while dropping undesirable packets. In so doing, the firewall protects the data and compute resources of the
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ANCS05, October 2628, 2005, Princeton, New Jersey, USA. Copyright 2005 ACM 1-59593-082-5/05/0010...$5.00.
Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries TM Xeon is a trademark of Intel Corporation or its subsidiaries in the United States and other countries * Other brands and names are the property of their respective owners
115
packet filtering operation required for firewall operation, deciding to forward or drop the packet on its own. We select an NP for our hybrid firewall both because of its packet processing prowess and because it is programmable, allowing changes in response to changing security needs. Programmability is an important factor in our selection of an NP over custom hardware. Packets requiring additional computation, for instance packets defining a new flow, are processed by a Netfilter firewall running on Intel XeonTM Processors (called microprocessors or CPUs). The Intel Xeon Processor is selected for its ability to rapidly complete the complex firewall processing required by a full featured Netfilter firewall. We first describe the Linux Netfilter firewall and the significant challenge posed by minimum packets. In order to build our hybrid firewall we constructed a library for packet communication between the NPs and the microprocessors over PCI. We describe this library, the two variations implemented, and the performance measured. The hybrid firewall itself is described next with a focus on interfacing Netfilter to the NP resident firewall. Performance measures show that our hybrid firewall achieves our goal of providing line rate firewall processing for all packets with more intensive firewall processing for a fraction of the packets. We dissect the performance achieved and present the hard lessons learned along the way. Related work and conclusions finish the paper.
notifying the application. The output chain provides a similar mechanism for packets originating from the local host machine.
local process
input chain forwarding chain
output chain
netfilter Linux IP
IP Packets Figure 1. Rules Chains for Netfilter and IPTables

Stateful firewalls based on Netfilter/IPTables store per connection state. In some cases, connection tracking enables the firewall to consult a rules chain only when establishing a new connection, improving performance. IPTables evaluates new connections using helper modules that have protocol specific knowledge. Connection state storage also enables more advanced firewall features such as Network Address Translation (NAT), and Application Layer Gateways (ALGs). ALGs for protocols like FTP perform packet data inspection/modification and retain significant connection state across packets within a connection. We used Linux 2.6.7 Netfilter and IPTables to implement the microprocessor based portion of our firewall. Studies have shown that Netfilter firewalls cannot maintain Gigabit/sec line rates for small sized packets. Brink et. al. [7] showed multiGb/sec Netfilter forwarding rates for large packets, but only about 200 Mbps for 64-byte Ethernet packets. We measured slightly higher Netfilter firewall performance, but still fell well short of our 2 Gb/sec line rate for 64-byte Ethernet packets. Our Hybrid firewall uses an NP based firewall in concert with Netfilter firewall to provide near 2 Gb/sec line rate processing for all packet sizes.
2. NETFILTER FIREWALL
Netfilter [6] is a set of Linux kernel modifications provided to enable packet operations on packets received from the Linux IP networking layer. Introduced in the Linux 2.4 code base, Netfilter and its associated modules replace the previous instantiations of IP filtering used in older versions of Linux. These include ipfwadm in Linux 2.0, and IPChains in Linux 2.2. Netfilter provides the system code and the application programming interfaces needed to provide access from either kernel processes or user applications. IPTables is a set of userspace tools and kernel loadable modules used in conjunction with Netfilter. IPTables allows a user to define sets of rules governing packet-filtering behavior. IPTables also includes kernel loadable modules supporting many different types of packet operations. Used together, Netfilter and IPTables support packet filtering, packet forwarding, Network Address Translation (NAT), event logging, connection tracking, and more advanced operations (often called packet mangling) capable of modifying the L4+ packet header or payload. In this paper, we will refer to Netfilter and IPTables interchangeably. The basic packet flow through Netfilter and IPTables is determined by a series of filter blocks known as chains. This is the same concept used with IPChains in Linux 2.2. The three basic chains are the forwarding chain, the input chain, and the output chain as shown in Figure 1. The forwarding chain provides a path and rule set for packets not destined for the local host machine packets to be routed through the network. This chain allows packet filtering based on user-defined rules, packet header data, and state collected from previous packets. The input chain provides similar functionality for packets destined to a process on the local host machine. Packets can be passed to the application expecting them, or dropped without
3. NP/CPU COMMUNICATION
Our hybrid firewall consists of a Netfilter firewall on a dual 2.4 GHz Intel Xeon Processor system coupled to a simple custom packet filtering firewall executing on a dual 1.4 GHz IXP2800 compact PCI card. Figure 2 is a picture of the hybrid firewall hardware. The two sets of processors connect via a 64 bit, 66 MHz Peripheral Components Interface (PCI) bus. The IXP2800 network processor features sixteen RISC engines called MEs. Each ME holds enough registers to keep state for up to 8 computational threads and can context swap between these threads in just a few clocks. Our hybrid firewalls Ethernet receive, Ethernet transmit, and IP processing along with the packet forwarding firewall execute on the MEs. To move packets and data between the ME based firewall and the microprocessor based firewall, we had to create the communications library described next.
116
Remote Copy of Circular Queues Remote Packet Buffers PCI
Local Copy of Circular Queues Local Packet Buffers
Local Memory
Figure 2. Hybrid Firewall prototype hardware
3.1 Message-Passing Architecture

The first step towards packet passing between the MEs and microprocessors was enabling reads and writes from one processors memory space to the others. Unlike peripheral devices, which share a small portion of the platforms global address space, processors commonly assume control over all address space. This was true of both the CPU and NP processors in our firewall. A non-transparent PCI-to-PCI bridge provides connectivity between the two processors address spaces by translating a subset of each processors address space into the address space of the other processor. After enabling reads and writes, we built a message-passing architecture between the two groups of processors. Performance measurements of different read and write operations across PCI heavily influenced the final architecture.
Figure 3. Message Passing Memory Configuration Figure 3 shows the conceptual memory map within a given processors address space. This system keeps two copies of packet buffers and circular queues, one in local memory and one in remote (the other processors) memory across PCI. Since the physical and virtual addresses of each region are different for each processor, a pointer value on one processor means nothing to the other processor. Applications use a system of base and offset pointers to coordinate their data structures, communicating physical offsets within the shared regions rather than passing virtual address pointers.
3.1.2 Message Passing Operations

Operation of the queues is similar in both directions. To prepare or refresh an element for the producer, the consumer allocates a packet buffer and calculates the corresponding physical byte offset to that buffer. It then records this offset along with a unique packet identifier into the queue element and raises the element status bits to the FREE state to indicate to the producer that this element is ready for the produce operation. To send a packet, the producer reads the next queue element to make sure the status bits indicate FREE. When encountering a full queue, the next queue element will read FULL. If the queue is not full, the producer uses the packet buffer offset in the queue element to copy the packet data, including headers, to the consumers memory (remote). Within the queue element, the producer writes the packets unique identifier, the length in bytes of the copied data, and an indication of operations to perform on the packet via the Application ID field. When the copy completes, the producer sets the status bits to FULL, indicating to the consumer that valid data is available. The consume operation consists of reading the local unique identifier from the queue element and passing the packet to the function or application indicated by the Application ID field. With the packet dispatched, the consumer performs the refresh operation outlined above and raises the status bits to FREE once again. Our measurements show reads across the PCI bus to exhibit a latency of hundreds of nanoseconds, too long compared to small packet arrival times. Using a technique herein referred to as mirrored memory, the PCI Communications Library instantiates a complete copy of the circular queue structures in both the microprocessor and NP memories.
3.1.1 Memory Configuration

To implement a message passing architecture, the application establishes a set of circular queues visible to both processors. In order to eliminate the need for mutually exclusive accesses (not available over PCI), the library uses a unidirectional producer/consumer model. Complementary pairs of these queues (CPU ME + ME CPU) establish bidirectional message passing. At initialization time, software on each side of the nontransparent bridge communicates the offset of the base address of this circular queue within the PCI address window provided by the non-transparent bridge. Queue size and structure are constant (set at compile time.) Each queue consists of a fixed number of entries or elements with each element containing the following fields: IA Packet ID: Unique identifier for packets allocated by the microprocessor NP Packet ID: unique identifier for packets allocated by the NP processor Packet Length: Length of packet in bytes, including headers Status bits: unidirectional semaphore; Producer raises, consumer lowers Application ID: Message field to tell consumer how to dispatch the packet Packet Buffer Offset: pre-allocated buffer provided by the consumer for the producer to copy packet data
117
When updating queue elements, the application writes both the local and remote copies. Software performs all reads of queue state locally, so these reads return quickly without having to wait for PCI latency. The developer must impose restraints as to when each portion of the application can access particular data fields within the shared application state. For instance, the producer must never write an element in the FULL state whereas the consumer must never write an element in the FREE state. Further, software must ensure that all related updates are complete prior to changing the value of the status bits.
The mirrored memory technique removes all runtime reads from the PCI bus. This write-write interface markedly improves overall bus efficiency. The trade off is the introduction of many small writes to update remote queue elements. Figure 6 shows measurements of the final bandwidth achieved. Both tests assumed an ideal consumer. The data shows that queue state updates represent an acceptably small impact to performance using the mirrored memory approach.
3.2 Application Usage Models

The PCI link between the processors is a likely performance bottleneck. To help best manage this limitation, the PCI Communications Library provides two usage models, one optimized for maximum flexibility (decoupled) and another designed for greater performance for a subset of applications (coupled).
3.1.3 Performance Considerations

Moving packets between processing domains requires a copy of packet data across the PCI bus. The IXP2800 provides DMA engines for ME CPU data moves. Figure 4 shows that these DMA engines achieve data rates of almost 3 Gb/s. Using the NPs DMA engines to perform CPU ME data movement requires PCI reads. Figure 4 shows the throughput achieved here is much lower than in the write case; too low for our needs. We selected an alternate method for CPU ME data moves, using the microprocessor to write the data across the PCI bus to NP memory. To maximize the size of the CPU PCI writes and thereby improve PCI efficiency, we mapped the memory shared with the NP as Write-Combining. WriteCombining memory enables greater CPU write performance than uncached memory. Using the CPU to move data from CPU ME costs CPU cycles. When writing a large block of data to an uncached or writecombining region, the processor will stall until the entire write completes. Figure 5 is a conservative estimation of the CPU cycles spent moving 512 bytes across PCI.
3000
3.2.1 Decoupled Packet Passing

In this usage model, once the producer sets the status bits to FULL during the produce operation, the consumer owns the packet. The producer then drops the original packet and frees the associated memory.
3500 3000 2500 2000 1500 1000 500 0 0 256 512 768 1024 1280 1536
Packet Data Only Data + Q Mgmt
Throughput (Mb/s)
2500 2000 1500 1000 500 0 64 320 576 832 1088 1344
Reads - 1CH Writes - 1CH Writes - 2CH Reads - 2CH
Throughput (Mb/s)
Transfer Size (Bytes)
Figure 6 PCI Bus Performance Showing Impact of Small Queue Management Writes This model allows for maximum flexibility in application design because the processors may initiate, drop, and modify any packets they own without regard to any latent state on the other processor. The principal drawback of this model is complete packet data copies in both directions across the PCI bus. This may waste PCI bus cycles and processor cycles for applications that do not modify packet data.
Transfer Size (Bytes)
Figure 4. NP Raw DMA Performance (1 & 2 channels)
3.2.2 Coupled Packet Passing

512 Bytes + 6clocks _ overhead = 70clocks 8Bytes / clock 70 PCIclk (66 MHz ) = 3150 CPUclk (3GHz )
Figure 5 Cost of Moving Data with CPU Cycles1 Coupled packet passing optimizes for packet filtering applications like firewalls and intrusion detection. Software using coupled packet passing assumes all packets originate within the MEs. As needed, these packets pass to the microprocessor for additional processing, but a copy of the original packet data remains within NP memory for later use. Software on the microprocessor then returns just the original packet identifier with an action directive like PASS or DROP coded into the Application ID field. The NP references the packet identifier to locate the original packet data and uses this stored copy to perform the indicated action. Coupled packet passing conserves PCI bus cycles by reducing PCI traffic by roughly half compared to the same application in decoupled mode. Further, the CPU need not burn cycles
The six clock overhead estimate assumes only one clock to propagate across the non-transparent bridge. Calculation assumes a perfect target with no contention from other PCI bus masters.
118
moving data over PCI. The resulting coupled packet sharing mechanism delivers about double the data rate available from the decoupled paradigm for periods of heavy PCI utilization as shown in Figure 7.
2000 1800 Throughput (Mb/s) 1600 1400 1200 1000 800 600 400 200 0 0 256 512 768 1024 1280 Packet Size (Bytes)
With the PCI communications established between MEs and CPU, we move on to describing the hybrid firewall. While this firewall supports concurrent coupled and decoupled operation, this paper addresses the coupled mode as it provides the best performance
4. THE HYBRID FIREWALL

Our hybrid firewall distributes the firewall work between the Netfilter firewall on the microprocessors and the packet filtering firewall on the network processor. For our prototype Netfilter packet mangling was not enabled, allowing use of the coupled PCI communications library and benefit from its performance advantages. Packets arrive from the network at the IXP2800 MEs. The MEs handle base packet processing for all packets received by the firewall including Ethernet processing, IPv4 forwarding operations, and application of simple firewall rules. Some packets require processing not included in the NP based packet filtering firewall. For our hybrid firewall this includes stateful or application level firewall processing, such as ALG processing. These packets are sent to the Netfilter firewall executing on the microprocessors, allowing our hybrid firewall to benefit from both the time tested robustness of the Netfilter firewall and the complex processing speed of the microprocessor while still providing the highly optimized packet processing features of the NP. A key assumption in the construction of our hybrid firewall is that a large fraction of the packets received may be processed by the packet filtering firewall. There is good reason to believe this is true. We studied traces from seven internet connection points collected on June 6, 2004 by the National Laboratory for Applied Network Research (www.nlanr.net). 90% of the packets were TCP and we measured an average of 15 packets per connection so flow setup and teardown, even at the TCP level, could be accomplished by forwarding as little as 13% of the packets to the microprocessor. FTP traffic, which would require special ALG handling, represented only 4% of the packets seen. SSH traffic was also 4%. While the fraction of packets filterable by just the packet filtering firewall will vary with rule and traffic mix, real traffic seems to support the hybrid approach.
Coupled Decoupled
Figure 7. Coupled vs. Decoupled Maximum Data Rates Table 1 and Table 2 show processor cycle counts during each mode of operation. These cycle counts reveal the five routines most frequently called while servicing 1518-byte packets at the maximum attainable data rate. In coupled mode, the processor is mostly busy handling the PCI Comms tasklet, polling the ring buffers, and receiving packets; packet transmission to the MEs is a low-overhead operation. The same analysis performed on the decoupled model reveals the source of the performance disparity. Copies to the MEs memory during packet transmission dominate performance, consuming up to 23% of the processors time. These copies account for just 5% of the actual retired instructions, suggesting that the processor spends many cycles stalled within the copy routine. These stalls arise because the processor supplies data at a much greater rate than the PCI bus can accept it. Table 1. Cycle Count Coupled mode, 1518-byte packets % of Image Symbol Total (top 5 shown) 15.0 Linux tasklet_action 13.0 11.7 10.6 8.1 41.6 CPU/ME Driver CPU/ME Driver Linux Linux ia_poll ia_ingress_tasklet net_rx_action Do_softirq other
4.1 Getting Packets to our Applications

As shown in Figure 8, the hybrid firewall builds upon the PCI communications library and utilizes pre-existing software within the Linux kernel and IXP (yellow or lightest gray). The PCI Comms Kernel Module exposes a standard Linux network driver interface allowing packets to travel between the MEs and the Linux network stack in both directions. Upper layer modules use this driver just as they would a regular NIC driver. A lightweight Netfilter protocol interface module (NFPI) injects packets into the standard Netfilter infrastructure, bypassing the IPv4 forwarding and Ethernet processing already completed by the NP.
Table 2. Cycle Count Decoupled mode, 1518-byte packets % of Image Symbol Total (top 5 shown) 23.3 CPU/ME Driver copy_to_me 11.9 10.8 9.6 8.4 35.9 Linux CPU/ME Driver CPU/ME Driver Linux tasklet_action ia_poll ia_ingress_tasklet net_rx_action other
119
General Purpose Processor
User Space Applications Snort IDS Network Stack IPTables
packets with a protocol ID of 0x800 normally precede to the IP receive handling routine. The PCI Comms library uses a unique Application ID to determine the destination application for each packet. For the hybrid firewall application, because Ethernet and IP packet processing has already occurred in the NP, packets should simply be handed to IPTables for firewall processing. We use the existing Linux dispatch code and bypass the Linux IP stack by registering a receive handler for each unique application ID and setting the protocol field in the sk_buff to match the application ID indicated by the microblock. Linux forwards IP packets to our application specific protocol as long as the protocol field in the sk_buff matches the protocol field registered by the application with the networking core. This paper refers to the protocol module as the NF Protocol Interface, because it is responsible for interfacing to the Netfilter modules. This method provides two very important advantages. First, we do not have to write our own dispatch code. Second, by using the existing network dispatch code in the Linux network core, we enable other legacy applications to receive our packets as well. For example, to send a packet to an existing Linux protocol stack, the NP sets the Application ID to that particular network protocol ID (such as 0x800 for TCP/IP).
Linux Kernel
Netfilter Firewall NFPI
PCI Comms Kernel Module PCI 64/66 PCI Comms Microblock IPv4 Fowarding Ethernet
Network Processor
Figure 8. Hybrid Firewall Software Architecture The PCI Comms kernel module polls the ME CPU communication rings for new packets arriving from the MEs. We chose a polling interface both because it is a good match for our high traffic rate application and because interrupts from MEs were not available to us. For heavy traffic loads, polling mode works best since it allows the operating system to schedule the work, ensuring other applications run uninterrupted. We simulated interrupts combined with polling by scheduling our polling routine as a tasklet during times of light traffic. The operating system schedules tasklets with a much lower frequency than the polling interface, saving CPU cycles. If during polling the driver determined that there were no packets received from the PCI Comms microblock, the driver removed itself from the polling queue and scheduled a tasklet to check the rings later. Because the latency between tasklet calls can be quite long, we had our tasklet immediately place our driver back in the polling queue even if there were no packets received. If interrupts had been available, we would have used them during times of low link utilization, switching to polling under heavy load. The Linux operating system uses a packet descriptor structure called a sk_buff to hold packet data and meta data as it traverses through the operating system. NIC devices will typically DMA directly into the packet data portion of the sk_buff. Similarly, the PCI Comms Kernel Module keeps a pool of allocated sk_buffs for use in microengine DMA transactions. The pool slowly drains over time. Software replenishes it when it reaches a low water mark, or when there is a break in the packet passing work. We chose this design since keeping DMA buffers available on the queue for the NP requires the CPU minimize time spent managing buffers between each poll. The network driver interface under Linux requires that the driver set the protocol field in the sk_buff structure to contain the type field from the Ethernet packet header. This protocol field is then used by the Linux network core to dispatch packets to whichever protocol has registered a receive handler for the specific protocol ID in that packets descriptor. For example,
4.2 Filtering Packets

Netfilters design assumes operation within the context of an IP network protocol. The NF Protocol Interface performs the minimal processing on the packets descriptor required by adjusting the fields in the sk_buff to point to the IP header, as if IP packet processing had occurred. For the FORWARD hook required by packet filtering, this is all that is required to ensure proper IPTables processing. Note that his path avoids unnecessary TCP/IP processing on the general-purpose processor. Once packets circulate through the standard IPTables entry points, IPTables makes a forward/drop determination based on its own rule set. IPTables then sends the sk_buff to the PCI Communications Module through the NF protocol interface along with a flag to indicate whether to drop or accept the packet. The drop/accept notification is placed on the CPU ME communication ring.
4.3 Sharing State

Sharing state (rules tables and connection tracking tables) between a general-purpose processor and the MEs is difficult in our prototype. The MEs cannot match the virtual to physical address mapping performed on the microprocessor, and so are unable to follow the virtual pointers within microprocessor resident data structures. Moreover, the MEs cannot atomically update data structures held in the CPU memory since PCI does not provide atomic memory transactions. Without atomic updates, it is impossible to implement semaphores to guard read/write state, forcing the programmer into a single writer model that may not be performance efficient. Netfilter resident contains function does not provide for such state sharing with PCI processors. Connection tracking state in IPTables both data and pointers to data. Rules state contains pointers as well as data. To avoid address translation
120
pitfalls, an ME-specific copy of the Netfilter connection table was created for the ME resident firewall. We implemented a UDP socket interface between the microprocessors and the IXP2800 firewall software to enable the microprocessor to place a copy of its connection tracking state into NP resident memory. This simple RPC-like interface allows the CPU resident NFPI to update state in the IXP2800s memory. NP resident state is formatted specifically for optimized ME accesses. Runtime firewall rules updates could use the same mechanism, but were not implemented.
4.4 Extending the hybrid firewall

By sticking with standard (rather than proprietary) interfaces to the Linux kernel, our hybrid firewall can be extended with unmodified off the shelf software. We tested this by adding the Snort* Intrusion Detection System (IDS) program to our firewall as shown in Figure 8. This addition required nothing more than the standard installation of the IDS application. Figure 9. Hybrid Firewall Performance
5.1 Cost of Netfilter Processing

To measure the cost of Netfilter processing, we built a loopback kernel module that returns packets received from the PCI Comms Kernel module with no processing. Figure 10 compares the throughput achieved by the hybrid firewall sending all packets to Netfilter to the same setup with the loopback driver replacing Netfilter. Netfilter processing imposes only a slight processing burden. For 64-byte packets, the introduction of Netfilter processing results in an 11% throughput reduction. Netfilter does not impose enough overhead to influence performance for large packets. Netfilter is clearly not the bottleneck limiting small packet performance.
2000 1800 Throughput (Mb/s) 1600 1400 1200 1000 800 600 400 200 0 0 256 512 768 1024 1280 Packet Size (Bytes) Firewall Loopback Line Rate
5. HYBRID FIREWALL PERFORMANCE

An expected advantage of the hybrid firewall, and in fact our motivation for exploring the hybrid firewall, is superior performance. Figure 9 shows the performance of the hybrid firewall for various percentages of packets sent to the microprocessors. The chart also contains a Netfilter only line showing the performance of the original, non-hybrid Netfilter firewall on a platform using a pair of gigabit Ethernet NICs in place of the NP. Throughput includes Ethernet header and higher layer bits, but not Ethernet preamble or inter-packet gap. The 0% (NP alone handles all packets) line is 2Gb/s: full line rate. When all of the packets go to Netfilter (100%), the performance of the hybrid firewall is roughly equivalent to the Netfilter only firewall. This indicates that our NP based packet filtering firewall and our connection to the Netfilter firewall has a reasonably performing implementation versus off the shelf products in the Netfilter Only case. For cases where the NP handles most or all of the packets, the advantages of our Hybrid firewall become clear. The 0% and 10% lines (i.e. 10% forwarded to Netfilter) achieve full line rate for every packet size. The 30% case achieves full line rate with packets of 128 bytes or longer. Finally, the system is able to supply full line rate, even with half of the packets going to the general purpose CPU, for continuous streams of packets as small as 256 bytes. This represents excellent performance improvement for the Hybrid firewall over the standard Netfilter firewall verifying our hypothesis.
Figure 10. Netfilter Firewall compared to Loopback
5.2 Analysis of CPU Processing Cycles

Table 3 shows microprocessor utilization for one of the two Intel Xeon Processors used. Both processors showed very similar utilization. The table includes all processes consuming greater than 1% of the processors cycle count. The data shows the hybrid firewall processing a stream of 64 byte packets, with 100% forwarded to Netfilter. All of these functions relate to packet processing and together account for 95% of the total cycle count.
121
% of Total 15.3 13.0 11.7 10.1 7.9 4.6 4.2 3.9 3.9 3.8 2.0 2.0 2.0 1.9 1.7 1.6 1.3 1.2 1.2 1.2
Table 3. Cycle Count for Hybrid Firewall2 Image Symbol Linux PCI comms Driver PCI comms Driver Linux Linux PCI comms Driver Linux PCI comms Driver PCI comms Driver Linux PCI comms Driver Linux Linux Linux Linux Linux Linux Linux Linux PCI comms Driver tasklet_action Ia_poll ia_ingress_tasklet net_rx_action do_soft_irq ia_get_msg netif_receive_skb ia_rx_msg ia_msg_rx_alloc_backup Tasklet_schedule ia_check_rings eth_type_trans skb_dequeue skb_release_data kfree kmalloc skb_queue_tail alloc_skb memcpy me_pkt_forward
to manage. Synchronizing data structures amongst the multiple address spaces is both complicated and tedious. Lack of atomic transactions to PCI resident memory resources drove constraints into the creation and use of data structures accessible from both microprocessor and ME. The inability of the MEs to use microprocessor virtual addresses also complicated data structure creation and use. The low bandwidth and long latency associated with ME PCI reads drove our message passing architecture to consume extra memory bandwidth and made effective state sharing almost impossible. Polling and memory allocation for packet buffers consumed more microprocessor cycles than we hoped. We will seek to improve these features in future implementations of our hybrid approach.
7. RELATED WORK
Brink et. al. [7] present measurements for IPv4 forwarding, for a Netfilter firewall and IPsec on Linux* 2.4.18 running on a dual Intel Xeon Processor based system. Their measurements show lower-than-line-rate throughput for smaller packets, concurring with ours. The authors also present IPv4 forwarding results for a dual Intel IXP2400 network processor system showing close to maximum theoretical line rate for all packet sizes. Based on benchmarks published by the Network Processing Forum [8] and the IETF [9], Kean [10] presents a methodology for benchmarking firewalls along with measurements for an earlier generation network processor, the Intel IXP1200. Keans measurements show significant roll-off for smaller packet sizes, though not as large as the Netfilter firewall. Alternate firewall platform architectures have been explored before. The Twin Cities prototype provided a coherent shared memory interface between an Intel Pentium III Processor and an IXP1200 [11]. Twin Cities did not target an existing CPU firewall (Netfilter) and used custom hardware. Other authors have explored different architectures, including FPGA based firewalls [12] and firewall targeted Network Processor designs [13]. Corrent* [14] advertises a firewall that most closely matches the work described here, using a combination of network processors and general-purpose processors to execute a CheckPoint* Firewall. Corrent* even shows that their approach leads to enhanced performance for small packets. IPFabrics* [15] offers both a PCI add in card that holds two Intel IXP2350s and a Packet Processing Language that enables quick programming of applications like Firewalls for the NPs.
The actual firewall packet processing is only a small percentage of the overall cycle count, so small that the functions responsible do not appear in the table. All observed Netfilter function calls accounted for only 0.91% of the cycle count in the above profile. Likewise, the NF protocol interface module accounted for 0.66% of the cycle count. This analysis identifies several high cost tasks. Managing the OS buffers (sk_buff) accounts for about 10% of the CPU time (netif_receive_skb, skb_dequeue, skb_release_data, kfree, kmalloc, alloc_skb, skb_queue_tail). Polling rings within the PCI communications library uses a combination of a polling loop (ia_poll 13%) and Linux tasklet scheduling (tasklet_action 15.3% and ia_ingress_tasklet 11.7%). Clearly, polling is the area we would first look to tune. Having a system with interrupts available would eliminate this tasklet scheduling overhead.
6. CONCLUSIONS
Using Linux Netfilter, an existing full featured firewall running on a high performance general purpose CPU, we were able to successfully add a simple packet filtering firewall running on a network processor and achieve substantial performance gains. In fact, for small packets our hybrid firewall prototype was able to achieve 2 Gb/s line rate with almost 30% of the packets forwarded to Netfilter. This is greater than 4X throughput gain over the standard Netfilter firewall. Our hybrid firewall exhibited the superior performance we suspected it would. A number of characteristics of the PCI connection between our microprocessors and network processor served to make programming difficult. The multiple address spaces associated with each processor, mapped together through the nontransparent PCI-to-PCI bridge were difficult for the programmer
8. ACKNOWLEDGEMENTS
The authors would like to acknowledge Santosh Balakrishnan, Alok Kumar and Ellen Deleganes for the NP based firewall that served as an early version for the NP code used in this paper. We should also like to extend a special thank you to Rick Coulson and Sanjay Panditji for their steadfast support of this work and to Raj Yavatkar for his valuable guidance.
* 2
Continuous stream of 64-byte packets
Pentium III is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Other brands and names are the property of their respective owners
122
9. REFERENCES
[1] W. R. Cheswick, S. Bellovin, A. Rubin. Firewalls and Internet Security, Second Edition. Addison-Wesley. 2003. San Francisco. [2] M. Kounavis, A. Kumar, H. Vin, R. Yavatkar, A. Campbell. Directions in Packet Classification for Network Processors. 2004 http://comet.ctr.columbia.edu/~campbell/papers/np2.pdf [3] R. Peschi, P. Chandra, M. Castelino. IP Forwarding Application Level Benchmark v1.6. Network Processing Forum May 12, 2003. http://www.npforum.org/techinfo/approved.shtml#ias [4] Computer Emergence Response Team. CERT Advisory CA-1996-21 TCP SYN Flooding and IP Spoofing Attacks. Nov 29, 2000. http://www.cert.org/advisories/CA-1996-21.html [5] Intel IXP2800 Network Processor. Intel Corporation., http://www.intel.com/design/network/products/npfamily/IX P2800.htm [6] H. Welte. What is Netfilter/IPTables? http://www.netfilter.org [7] P. Brink, M. Castelino, D. Meng, C. Rawal, H. Tadepalli. Network Processing Performance Metrics for IA- and NPBased Systems. Intel Technology Journal, Volume 7, Issue4. 2003. pp.78-91.
[8] NPF Benchmarking Implementation Agreements. Network Processing Forum http://www.npforum.org/techinfo/approved.shtml#ias [8] B. Hickman, D. Newman, S. Tadjudin, T. Martin, RFC3511 Benchmarking Methodology for Firewall Performance The Internet Society. April 2003. http://www.faqs.org/rfcs/rfc3511.html. [9] L. Kean, S. B. M Nor. A Benchmarking Methdology for NPU-Based Stateful Firewall. APCC 2003. Volume 3, 21-24 Sept. 2003 Page(s):904 - 908 [10] F. Hady, T. Bock, M. Cabot, J. Meinecke, K Oliver, W. Talarek. Platform level support for High Throughput Edge Applications: The Twin Cities Prototype. IEEE Network. July/August 2003. pp. 22-27. [11] A. Kayssi, L. Harik, R. Ferzli, M. Fawaz. FPGA-based Internet protocol firewall chip. Electronics, Circuits and Systems, 2000. ICECS 2000. Volume 1, 17-20 Dec. 2000 Page(s):316 - 319 [12] K. Vlachos. A Novel Network Processor For Security Applications in High-Speed Data Networks. Bell Labs Technical Journal 8(1). pp 131-149. 2003. [13] Corrent Security Appliances Sustain Maximum Throughput Under Attack. http://www.intel.com/design/embedded/casestudies/corrent. pdf [14] Double Espresso. IPFabrics*. http://www.ipfabrics.com/products/de.php.
123

Network Processor Acceleration For A Linux Netfilter Firewall 3050

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Network Processor Acceleration For A Linux Netfilter Firewall 3050

Caricato da

Copyright:

Formati disponibili

Network Processor Acceleration for a Linux* Netfilter Firewall

Kristen Accardi, Tony Bock, Frank Hady, Jon Krueger

{kristen.c.accardi, tony.bock, frank.hady, jon.krueger}@intel.com ABSTRACT

Categories and Subject Descriptors

input chain forwarding chain

IP Packets Figure 1. Rules Chains for Netfilter and IPTables

Remote Copy of Circular Queues Remote Packet Buffers PCI

Local Copy of Circular Queues Local Packet Buffers

Figure 2. Hybrid Firewall prototype hardware

3.1 Message-Passing Architecture

3.1.2 Message Passing Operations

3.1.1 Memory Configuration

3.2 Application Usage Models

3.1.3 Performance Considerations

3.2.1 Decoupled Packet Passing

Transfer Size (Bytes)

Transfer Size (Bytes)

Figure 4. NP Raw DMA Performance (1 & 2 channels)

3.2.2 Coupled Packet Passing

4. THE HYBRID FIREWALL

4.1 Getting Packets to our Applications

General Purpose Processor

User Space Applications Snort IDS Network Stack IPTables

Netfilter Firewall NFPI

4.2 Filtering Packets

4.3 Sharing State

4.4 Extending the hybrid firewall

5.1 Cost of Netfilter Processing

5. HYBRID FIREWALL PERFORMANCE

Figure 10. Netfilter Firewall compared to Loopback

5.2 Analysis of CPU Processing Cycles

Continuous stream of 64-byte packets

Potrebbero piacerti anche