995

iii
iv
Dedication
To my loving I dedicate this dissertation to my parents, my wife, my kids (Suha, Amr, Marwh), my brothers and sister.
iii
Acknowledgment
I thank Allah (SWT) for granting me the guidance, patience, health, and determination to successfully accomplish this work. I would like to express my deep appreciation and gratitude to my thesis advisor Dr. Khalid Salah, for his constant help, guidance, encouragement and invaluable support. Thanks are due to my thesis committee members Dr. Muhammed Alsuwaiyel and Dr. Nasir Darwish for their cooperation, comments and support. I would like to thank King Fahd University of Petroleum and Minerals for sponsoring me throughout my graduate studies. I also would like to thank my parents, my wife and my kids who always support me with their love, patience, encouragement and constant prayers. I would like to thank my brothers and sister for their love and support throughout my study. Finally, I would like to thank all my friends, mostly those in the ICS Department and all those who helped in this thesis. A special thanks to Mr. M. Saif, and Mr. W. Al-Kuti for their feedback and encouragements.
iv
Table of Contents
..............................................................................................................................................iii Dedication............................................................................................................................iii Acknowledgment................................................................................................................iv .............................................................................................................................................iv Table of Contents.................................................................................................................v LIST OF TABLES ..........................................................................................................viii List of Figures.....................................................................................................................ix ...............................................................................................................................................x Thesis Abstract...................................................................................................................xi ......................................................................................................................... xii INTRODUCTION...............................................................................................................1 1.1 Background.......................................................................................................................1 1.2 Motivation........................................................................................................................5 1.3 Main Contributions...........................................................................................................6 1.4 Thesis Organization .........................................................................................................7 LITERATURE REVIEW...................................................................................................8 2.1 Receive Livelock Phenomenon .......................................................................................8 2.2 Ideal Scheme...................................................................................................................10 2.3 Normal Interruption........................................................................................................10 2.4 Interrupt Disable-Enable Scheme..................................................................................11 2.5 Interrupt Coalescing.......................................................................................................12 2.6 Polling.............................................................................................................................13 2.7 Assumed Architecture Model .......................................................................................16 2.8 SelfSimilar Traffic........................................................................................................19 SIMULATION STUDY USING POISSON TRAFFIC................................................22 3.1 Poisson Traffic Model....................................................................................................22 3.2 Generating Poisson traffic..............................................................................................23 v
3.3 Simulation Model...........................................................................................................24 3.4 Simulation Guidelines....................................................................................................24 3.5 Simulation Parameters....................................................................................................25 3.6 Comparison and Numerical Results...............................................................................26 SIMULATION STUDY USING BURSTY TRAFFIC..................................................35 4.1 Bursty Traffic Model......................................................................................................36 ..............................................................................................................................................38 ..............................................................................................................................................38 ..............................................................................................................................................39 ..............................................................................................................................................39 ..............................................................................................................................................47 ..............................................................................................................................................47 ..............................................................................................................................................48 ..............................................................................................................................................48 ..............................................................................................................................................50 4.2 Comparison and Numerical Results...............................................................................52 ,.............................................................................................................................................53 HYBRID SCHEME...........................................................................................................60 5.1 Background.....................................................................................................................61 5.2 Switching Mechanism Evaluation..................................................................................63 ..............................................................................................................................................63 where is the protocol processing service rate and is the mean ISR overhead .......63 5.3 Performance Evaluation of the Hybrid Scheme............................................................70 ARRIVAL RATE ESTIMATION ..................................................................................77 6.1 Requirements..................................................................................................................78 ..............................................................................................................................................78 (6-14)...................................................................................................................................78 6.2 Existing Rate Estimation Algorithms............................................................................79 seconds,.............................................................................................................................82 (6-16)...................................................................................................................................82 ,............................................................................................................................................83 vi
(6-17)...................................................................................................................................83 ..............................................................................................................................................83 (6-18)...................................................................................................................................83 6.3 Evaluation and Comparison...........................................................................................88 6.4 Design Issues..................................................................................................................97 6.5 Implementation of the Proposed EWMA in Linux......................................................104 (6-19).................................................................................................................................105 (6-20).................................................................................................................................105 (6-21).................................................................................................................................106 (6-22).................................................................................................................................107 (6-10).................................................................................................................................107 CONCLUSION AND FUTRE WORK.........................................................................110 Bibliography.....................................................................................................................113 Vita....................................................................................................................................120
vii
LIST OF TABLES
Table 5.1. Investigation on the Number of Switching Modes for Bursty Traffic........70 Table 6.2. Cost of Multiplication and Alternate One...................................................109
viii
List of Figures
Figure 2.1. Receive livelock phenomenon..........................................................................9 Figure 2.2. Pure Polling Algorithm .................................................................................14 Figure 2.3. NAPI Polling Algorithm ................................................................................15 Figure 2.4. Architecture model of DMA-based design reprinted in [SAL03a]...........18 Figure 3.5. CPU Availability for Poisson Traffic............................................................27 Figure 3.6. System Delay for Poisson Traffic..................................................................30 Figure 3.7. System Delay at Low Rate for Poisson Traffic............................................30 Figure 3.8. System Throughput for Poisson Traffic.......................................................32 Figure 4.9. Bursty Traffic versus Poisson Traffic...........................................................37 Figure 4.10. Pareto and Exponential Density Functions................................................41 Figure 4.11. Comparison of Actual and Synthetic Ethernet Traffic [LAL 95]...........44 Figure 4.12. A Single ON/OFF Source ............................................................................46 Figure 4.13. Self-similar Traffic Generation Model......................................................47 Figure 4.14. Burstness for 10 and 100 ms Time Unit.....................................................51 Figure 4.15. Comparison of Simulation AC to Theoretical AC....................................52 Figure 4.16. CPU Availability for Bursty Traffic...........................................................55 Figure 4.17. System Delay for Bursty Traffic..................................................................56 Figure 4.18. System Delay at Low Rate for Bursty Traffic...........................................57 Figure 4.19. System Throughput for Bursty Traffic......................................................58 Figure 4.20. System Dealy under Bursty and Poisson Traffic.......................................59 Figure 5.21. Switching Mechanism in terms of CPU Availability (Poisson Traffic). .66 Figure 5.22. Switching Mechanism in terms of Latency (Poisson Traffic)..................67 Figure 5.23. Switching Mechanism in terms of CPU Availability (Bursty Traffic)....68 Figure 5.24. Switching Mechanism in terms of Latency (Bursty Traffic)...................68 Figure 5.25. CPU Availability Including Hybrid Scheme (Poisson Traffic)................72 Figure 5.26. System Delay at High Rate Including Hybrid Scheme (Poisson Traffic) ...............................................................................................................................................72 Figure 5.27. System Delay at Low Rate Including Hybrid Scheme (Poisson Traffic)73 ix
Figure 5.28. System Throughput Including Hybrid Scheme (Poisson Traffic)...........73 Figure 5.29. Availability Including Hybrid Scheme (Bursty Traffic)..........................74 Figure 5.30. Zoom in of Figure 5-9...................................................................................74 Figure 5.31. System Delay at High Rate Including Hybrid Scheme (Bursty Traffic) 75 Figure 5.32. System Delay at Low Rate Including Hybrid Scheme (Bursty Traffic). 75 Figure 5.33. System Throughput Including Hybrid Scheme (Bursty Traffic)............76 Figure 6.34. Our Proposed Measurement for EWMA...................................................88 Figure 6.35. Average Relative Estimation Error for Bursty Traffic............................90 Figure 6.36. Average Relative Estimation Error for Poisson Traffic...........................90 Figure 6.37. Stability Evaluation with Bursty Traffic....................................................92 Figure 6.38. Stability Evaluation with Bursty Traffic (Zoom in Figure 6.37).............92 Figure 6.39. Stability Evaluation with Poisson Traffic..................................................93 Figure 6.40. Stability Evaluation with Poisson Traffic (Zoom in Figure 6.39)............93 Figure 6.41. Agility Evaluation with Bursty Traffic.......................................................95 Figure 6.42. Agility Evaluation with Bursty Traffic (Zooming in Figure 6.41)..........95 Figure 6.43. Agility Evaluation with Poisson Traffic.....................................................96 Figure 6.44. Agility Evaluation with Poisson Traffic (Zooming Figure 6.43).............96 Figure 6.45. Time Window Size for Bursty Traffic........................................................99 Figure 6.46. Time Window Size for Poisson Traffic.....................................................100 Figure 6.47. Stability with Different Weight Factors for Bursty Traffic...................102 Figure 6.48. Agility with Different Values of Weight Factor for Bursty Traffic......102 Figure 6.49. Stability for Different Weight Factors for Poisson Traffic....................103 Figure 6.50. Agility for Different Weight Factors for Poisson Traffic.......................103 Figure 6.51. Timer Structure..........................................................................................105 Figure 6.52. Generating a New Estimator ....................................................................107 Figure 6.53. Estimation Function...................................................................................108
Thesis Abstract
NAME: Fahd AbdulSalam AL-Haidari TITLE: Impact of Bursty Traffic on the Performance of Popular Interrupt Handling Schemes for Gigabit-Network Hosts MAJOR FIELD: Computer Science. DATE OF DEGREE: June 2007.
With the current wide deployment of Gigabit Ethernet technology, the network performance bottleneck has shifted from the network to the end hosts and servers. System performance of Gigabit network hosts can severely be degraded due to interrupt overhead caused by heavy incoming traffic. To eliminate interrupt overhead, different solutions have been proposed in the literature. The most popular interrupt-handling schemes primarily include interrupt coalescing, polling, and interrupt disabling and enabling of interrupts. When studying the performance of these interrupt handling schemes, the network traffic is often assumed to be Poisson which can be invalid. This thesis presents a study of the impact of bursty traffic on the performance of these schemes in terms of throughput, system latency, packet loss, and CPU utilization. The study is conducted using discrete-event simulation. The simulation results are compared with those reported in the previous work done using the Poisson traffic. In addition, the thesis studies a novel hybrid scheme of interrupt Disabling-Enabling and pure polling. The performance of the hybrid scheme is compared and its design and implementation aspects are addressed.
MASTER OF SCIENCE DEGREE KING FAHD UNIVERSITY OF PETROLEUM AND MINERALS DHAHRAN, SAUDI ARABIA
xi

: : ) (interrupts : : 7002
, . . , . : , - . ) (Poisson . ) (bursty . . . , - . , .
xii
CHAPTER 1
INTRODUCTION 1.1 Background
Interrupt overhead of high-speed network devices can severely impact the system performance. Traditional operating systems were designed to handle network devices that interrupt at a rate of around 1000 packets per second, as is the case for shared 10Mbps Ethernet. The cost of handling interrupts in these old systems was low enough that any normal system would spend only a fraction of its CPU time handling interrupts. However, for Gigabit networks, the interrupt rate for the maximum sized-packet of 1500 bytes increases to 80,000 interrupts per second. Of course with 10 Gigabit Ethernet and considering smaller packets, the problem can only get worse. With Gigabit Ethernet and the highest possible rate of 1.23 million interrupts per second, for a minimum sized packet of 64 bytes, the CPU must process a packet in less than 0.82 s in order to keep up with such a rate [MOR00]. Also, it is expected that the interrupt cost will not be reduced linearly by the speed frequency of the processor, as I/O and memory speed limits dominate. Consequently, the shift to higher packet arrival rate can subject a host to congestive collapse.
2 Interrupt-driven systems tend to perform very badly under heavy load conditions. For every incoming packet, each hardware interrupt, is associated with context switching of saving and restoring processors state. More importantly, interrupt-level handling, by definition, has absolute priority over all other tasks. If interrupt rate is high enough, the system will spend all of its time responding to interrupts, and nothing else will be performed; and hence, the system throughput will drop to zero. This situation is called receive livelock [RAM93]. In this situation, the system is not deadlocked, but it makes no progress on any of its tasks, causing any task scheduled at a lower priority to starve or not having a chance to run. At low packet arrival rates, the cost of interrupt overhead and latency for handling incoming packets is low. However, interrupt overhead cost directly increases with an increase of packet arrival rate, causing receive livelock. A number of interrupt-handling schemes and solutions have been proposed in the literature [MOG97, IND98, ARO00, DOV01, BRU96, KEN96] to mitigate interrupt overhead and improve OS performance. The most popular interrupt-handling schemes include primarily interrupt coalescing, polling, and interrupt disabling and enabling. Other solutions include OS-bypass protocol, zero-copying, jumbo frames, pushing some or all protocol processing to hardware, etc. In recent analysis of traffic measurements, evidence of non-Markovian effects, such as bursty across multiple time scales, long range dependence and self-similarity, have been observed in a wide variety of traffic sources. Non-Markovian effects have been
3 observed in signaling traffic, variable bit-rate (VBR) video, Ethernet LAN traffic, ATM cell traffic, MAN traffic and general Internet WAN traffic [PAX95]. Given the evidence of self-similarity in such a wide variety of sources, it is clear that any general model for data traffic must account for these properties. Since the selfsimilarity property of a traffic arrival process, has significant effects on all metrics of networking performance, including throughput, packet loss rate, response time, and buffer occupancy. In this investigation, a simulation model, accounting for these properties, is used to compare and analyze the performance of the most popular interrupt-handling schemes, and subsequently finding the optimal interrupt handling scheme that best suits Gigabit network traffic. The proposed simulation model will consider much more realistic assumptions such as variable packet size, and bursty or self-similar traffic that resulted from aggregating multiple streams (one stream per source) each of which consist of alternating Pareto-distributed ON-OFF periods. Several studies examined the trade-offs between interrupts and polling and, therefore, suggested hybrid schemes. One implementation of such schemes was proposed for Windows NT platform [HAN97]. The implementation simply uses a two level scheme where interrupt-driven servicing of the network interfaces is used until a certain level of network traffic. At that level and above, a polling thread scheme is used. The switching mechanism is based on detecting the user thread starvation which represents the receive livelock. In [HAN97], the interrupt processing time is used as an indication of system
4 load. This indication can be measured by monitoring the execution time of the interruptroutine. Another implementation that combines interrupts with polling is called Hybrid Interrupt-Polling (HIP) [DOV01]. The HIP scheme adaptively switches between the use of interrupts and polling based on the observed rate of packet arrivals. Specifically, if the packet arrivals are frequent and predictable, the receive mechanism operates in polling mode and interrupts are disabled. Otherwise, if the packet arrivals are infrequent, less predictable, or if the number of consecutive unsuccessful polls exceeds a threshold, the receive mechanism operates in the interrupt mode. This hybrid scheme uses the normal interruption, which performs very poorly at light and moderately light loads in terms of throughput, CPU availability, and latency. Also in [DOV01], the switching between normal interruption and polling is, somewhat, carried at arbitrarily. In [BIS06], Biswass proposes a hybrid poll engine that operates in the normal interrupt mode when expecting low packet arrival rates and switches to polling operation when higher packet arrival rates are expected. In every poll cycle, after completion of the event data processing task, the expected arrival rate is forecasted. A hybrid scheme that combines interrupt disabling-enabling and pure polling has been proposed in [SAL07]. In this thesis, the hybrid scheme proposed in [SAL07] is studied using both Poisson and bursty traffic. In addition, some of its important design
5 and implementation issues, including the switching mechanism and the rate estimation algorithm.
1.2
Motivation
With emergence of Gigabit networks, achieving high performance communication becomes a challenge. Most modern operating systems depend on interrupts for event notifications. As mentioned earlier, interrupt-driven systems tend to perform very badly under Gigabit network environment. Different solutions to eliminate interrupt overhead and resolve receive livelock problem have been proposed. Such solutions include interrupt coalescing, enabling and disabling interrupts, polling, jumbo frames, etc. The performance of these solutions has been studied using Poisson model as an input. Many studies have been proved the failure of using this model to study the performance of LAN and argued the existence of selfsimilar phenomena in LAN traffic. None of these solutions modeled and studied the performance and behavior of the popular interrupt handling schemes under bursty traffic. Here, we investigated the performance of those interrupt handling schemes in terms of throughput, system latency, packet loss, and CPU utilization using a discrete-event simulation model considering bursty traffic and much more realistic assumption that suits the modern Gigabit network environment and hosts such as a modern 2.53GHz PentiumIV machine.
1.3
Main Contributions
The objective of this research work is to develop an accurate model that describes the data traffic, then using this model, to develop a discrete-event simulation to study the performance of the popular interrupt handling schemes. The main contributions of this thesis work are Develop a discrete-event simulation model for network traffic incorporating the aspects of bursty traffic. Study the performance of the interrupt handling schemes with much more realistic assumption such as variable packet size, and input parameters that suit the modern Gigabit network environment and hosts Compare the developed model to the previous existing models based on Poisson processes. Conduct a literature search about the implementations of bursty traffic in the simulation model. Obtain simulation results for various performance metrics such as system throughput, latency, CPU availability to user applications, CPU utilizations by interrupt overhead and protocol processing, packet loss, saturation condition, and livelock conditions. Study a novel hybrid scheme. Discuss some design and implementation issues.
1.4
Thesis Organization
This thesis is organized as follows. Chapter 2 presents a review of existing interrupt handling schemes. A simulation model using Poisson model is introduced in Chapter 3. In Chapter 4, the details of techniques, based on bursty traffic are given. Then, the performance evaluation of the interrupt handling schemes using this model is outlined. Based on the results, reported in Chapters 3 and 4, Chapter 5 presents a novel hybrid schemes. Using the rate estimation as a key for the performance of the novel hybrid scheme, corresponding results are reported in Chapter 6. Finally, Chapter 7 draws the conclusions and discusses possible future work directions.
CHAPTER 2
LITERATURE REVIEW
In this Chapter, important topics of interest to the thesis work are discussed. These topics include the receive livelock phenomenon and the five interrupt handling schemes which include ideal mode, normal interruption, interrupt disable-enable mode, polling, and interrupt coalescing. In addition, the assumed architecture and the self-similar or bursty traffic are described afterwards. Then, a problem statement of the thesis is outlined.
2.1
Receive Livelock Phenomenon
In this section, the phenomenon of receive livelock is briefly discussed. Incoming network packets, received at a host, must either be forwarded to other hosts, as is the case in PC-based routers or to application processes where they are consumed. The delivered system throughput is a measure of the rate at which such packets are processed successfully. Figure 1, adopted from [RAM93, MOG97], shows the delivered system throughput as a function of offered input load. It should be noted that the figure conceptually illustrates the expected behavior of the system and does not illustrate analytical behavior. The figure shows that in the ideal case, regardless of packet arrival rate, every incoming packet is processed. However, all practical systems have finite
9 processing capacity, and cannot receive and process packets beyond a maximum rate. This rate is called the Maximum Loss-Free Receive Rate (MLFRR) [RAM93]. Such rate is acceptable and exhibits a relatively constant behavior afterwards.
Throughput
Ideal
MLFRR
Acceptable
Livelock Offered load
Figure 2.1. Receive livelock phenomenon
Under network input overload, a host can be swamped with incoming packets to the extent that the effective system throughput falls to zero. In such a situation, where a host has not crashed, but is unable to perform useful work such as delivering received packets to user processes or running other ready processes, is known as receive livelock. Similarly, under receive livelock, a PC-based router would be unable to forward packets to the outgoing interfaces. The main reason for receive livelock is that interrupts are handled at a very high priority level, higher than software interrupts or input threads that process the packets
10 further up to the protocol stack. At low packet arrival rates, such design allows the kernel to process the interrupt of the incoming packet almost immediately, freeing up CPU processing power for other user tasks or threads before the arrival of the next packet. However, if another packet arrives before the completion of handling the first one (e.g., in the case of high packet arrival rate), starvation will then occur for user tasks and threads which results in an unpleasant performance of dropping packets due to queue overflows, excessive network latency, and bad system throughput.
2.2
Ideal Scheme
In the ideal system, it is assumed that the overhead involved in generating interrupts is totally ignored. The ideal system gives the best performance that can be possibly obtained, thus serving as a reference or a benchmark to compare with other schemes.
2.3
Normal Interruption
In normal interruption, an interrupt overhead is counted for each packet arrival, except for the case when the packet arrives during the handling of interrupt. It should be kept in mind that modeling interrupts is a challenging and difficult task. The (Interrupt Service Routine) ISR execution preempts protocol processing, and hence, one may think that such an interrupt-driven system can be simply modeled as a priority queuing system
11 with preemption in which there are two arrivals of different priorities. The first arrival is the arrival of ISRs, and has the higher priority, while the second arrival is the arrival of incoming packets, and has the lower priority. However, this is an invalid model because ISR handling is not counted for every packet arrival. ISR handling is ignored if the system is servicing another interrupt of the same level. In other words, if the system is currently executing another ISR, the new ISR, of the same priority interrupt level, will be masked off and there will be no service for it.
2.4
Interrupt Disable-Enable Scheme
The key idea behind interrupt disable-enable handling scheme is inspired by [MOG97] and is implemented in some operating systems such as that of Linux 2.4.x and the most recent Linux 2.6.x with NAPI (New API) [SALI01, DER04]. The idea of pure interrupt disable-enable scheme is to have the interrupts turned off or disabled as long as there are packets to be processes by the kernels protocol stack, i.e., the protocol buffer is not empty. When the buffer is empty, the interrupts are turned on again or re-enabled. Any incoming packet, while the interrupts are disabled, is DMAd quietly to protocol buffer without incurring any interrupt overhead.
12
2.5
Interrupt Coalescing
One of the most popular solutions to mitigate interrupt overhead for Gigabit network hosts is interrupt coalescing (IC). Sometimes, this scheme is known as interrupt batching in the literature. In recent years, most network adapters or network interface cards (NICs) are manufactured to have interrupt coalescing. Additionally, many operating systems, including Windows 2000 and Linux, support IC. IC is a mode or a feature in which the NIC generates a single interrupt for a group of incoming packets. This is opposed to normal interruption mode in which the NIC generates an interrupt for every incoming packet. Although IC is an important feature that is widely available to mitigate interrupt overhead and improve performance, little research has been done to study its performance, except for what has been reported in [IND98, PRA04, ZEC02, and KIM01]. In [IND98], a timer-based IC was studied using NIC emulation. The OS provided overload conditions to the NIC to adjust the coalescing time. In [KIM01], the performance of interrupt coalescing was analyzed using an experiment that consisted of modifying the Linux kernel of a gateway. In [PRA04, ZEC02], the impact of hosts using interrupt coalescing on the overall bandwidth and latency of IP networks was investigated experimentally.
13 In IC, there are two schemes to mitigate the rate of interrupts: count-based IC and time-based IC. In count-based IC mode, the NIC generates an interrupt when a predefined number of packets have been received. In the time-based IC mode, the NIC waits a predefined time period before it generates an interrupt. During this time period, multiple packets can be received. The time period gets restarted only when the previous time period has expired and a fresh packet has been received. The coalescing parameters of the number of packets and the time period are tunable and configurable parameters which are set by the device driver.
2.6
Polling
The idea of polling is to disable interrupts of incoming packets altogether and thus eliminating interrupt overhead completely. In polling, the OS periodically polls its host system memory (i.e., protocol processing buffer or DMA Rx Ring) to find packets to process. In general, exhaustive polling is rarely implemented. Polling with quota is usually the case whereby only a maximum number of packets is processed in each poll in order to leave some CPU power for application processing. There are primarily two drawbacks for polling. First, unsuccessful polls can be encountered as packets are not guaranteed to be present at all times in the host memory, and thus CPU power is wasted. Second, processing of incoming packets is not performed immediately as the packets get queued until they are polled. Selecting the polling period is crucial. Very frequent polling can be detrimental to performance as significant overhead can be encountered at each poll.
14 On the other hand, if polling is performed infrequently, packets may encounter long delays.
Pure Polling vs. NAPI Polling. Polling was proposed in [RAM93,

MAQ96, ARO00, DOV01, DER04] to mitigate completely the interrupt overhead generated by packet arrivals. Releases of FreeBSD 4.6 and Linux 2.6 and thereafter can be configured for polling mode. Both of these releases use polling with quota, however, there is a difference. In FreeBSD polling, the interrupt is completely disabled for incoming packets. The algorithm for pure polling (or basic FreeBSD style) is illustrated in Figure 2.2. During the polling period, a limited number of packets, say Q, are processed by the protocol stack. In the situation where polling is triggered while in the midst of a polling cycle (i.e., servicing packets by the protocol stack), the trigger is ignored, but polling is turned on again and overhead is incurred which is purely a waste of CPU cycles.
Poll_Int (start of polling cycle

1. 2.
TPoll )
IP Processing
1. 2.
Set Poll_Mode ON Trigger IP Processing
Set Poll_Mode OFF Process up to Q packets or until no more packets in buffer If Poll_Mode is ON goto to Step 1
3.
Figure 2.2. Pure Polling Algorithm
In Linux NAPI polling [SALI01, BIA05], a combination of the scheme of interrupt disabling-enabling and polling is used. This is achieved by disabling the interrupts of incoming packets once a packet is received and triggering polling immediately (as illustrated in the NAPI algorithm of Figure 3). After processing Q
15 packets, if the protocol processing buffer is not empty, polling is triggered again on the next polling cycle; otherwise, polling is turned off and the interrupts of incoming packets are re-enabled. The key idea behind Linux NAPI polling is to combine the mitigation of interrupt overhead at high load while improving the responsiveness at low load.
Rcvd_Pkt_INT 1. 2. Disable Rcvd_Pkt_INT Enable Poll_INT and trigger the handling of Poll_INT Poll_INT (start of polling cycle IP Processing
1.
TPoll )
1. 2.
Set Poll_Mode OFF Process up to Q packets or until no more packets in buffer If Poll_Mode is ON goto to Step 1 If buffer is empty then disable Poll_INT and enable Rcvd_Pkt_INT
Set Poll_Mode ON Trigger IP Processing
2.
3.
4.
Figure 2.3. NAPI Polling Algorithm
The polling period in the latest versions of FreeBSD and Linux is not deterministic. Linux polls occur with softirqs. As all softirqs, they get typically executed at end of hardware interrupt and just before returning from kernel to user mode. FreeBSD polls occur at end of clock interrupt and system calls, and within idle loops. With these techniques, context switching overhead and cache pollution are decreased. It is also possible to avoid more context switching overhead and cache pollution in polling by utilizing soft timers [ARO00]. With soft timers, the OS can choose to poll the protocol buffer at more convenient or trigger points. OS convenient points can occur when the system is already in the right context and has already suffered cache pollution [ARO00]. As opposed to Linux, the quota during polling in FreeBSD is dynamically adjusted. In
16 FreeBSD, the quota depends on a number of configured parameters such as system load, CPU speed, remaining CPU fraction for polling process, and maximum quota.
2.7
Assumed Architecture Model
The network interface consists of several hardware and software interacting units in both the computer and the NIC (Network Interface Card). The NIC consists of a receive part and a transmit part. The major components of the receive part of the network interface system are shown in Figure 2 reprinted in [SAL03a]. The design of the receive part is critical because there is less control over receiving traffic from multiple stations concurrently. Also, the bursty nature of the received traffic affects the receiving side since bursty traffic is believed to have a significant impact on network performance [PAR00]. The network adaptors (NICs) are designed to transfer data between the NIC buffer and the host system memory. This can be done by employing the Programmed Input /Output engines (PIO) or by employing the Direct Memory Access (DMA) engines. With PIO engines, the copying of arrived packets from NIC buffer to host kernel memory or from the host kernel to the adapter is performed by the CPU as part of Interrupt Service Routing (ISR) handling for each incoming packet. In PIO, the CPU during the ISR handling sits in a tight loop copying data from the NIC memory into the host memory. After the packet is copied, the ISR then sets a software interrupt to trigger packet protocol processing. It is very likely that one or more incoming packets arrive
17 during the execution of the ISR. For this reason, the ISR handling must not exit unless all incoming packets are copied from the NIC to the kernel memory. When the network packet processing is triggered, the kernel processes the incoming packet by executing the network protocol stack and delivers the packet data to the user application [MOG97]. A major drawback of a PIO-based design is burdening the CPU with copying incoming packets from the NIC to kernel memory. With DMA engines, the NIC directly reads and writes from (or to) the host system memory without any CPU involvements. This is done simply as the CPU gives the NIC a memory address and the NIC writes to or reads from it, through the bus interface such as the PCI. As a result, the CPU cycles consumed in copying packets from the NIC to kernel memory are minimal. With Gigabit environment, the use of DMA becomes necessary in order to eliminate any CPU overhead involved in copying packets. Figure 2 shows the assumed typical system architecture model, and the flow path of an incoming packet between the NIC, host memory, and application. The packet is moved from the NIC Rx buffer, through the bus interface such as the PCI, to the Rx system buffer ring in the host memory, and then to the user application. As shown at initialization, the descriptor of the system Rx buffer ring is loaded into the DMA engine. The descriptor is a circular linked list of basically packet descriptors. Each packet descriptor contains the start address of the packet and its corresponding length. The length field of each packet is updated by the Rx DMA engine after the packet is copied into the host memory.
18
User Space
Application
Hostsystem memory
.. .
Application
Kernel Space
Network Protocol Stack Device Driver
Rx circular buffer starting address is loaded at initialization to start DMAing incoming packets
PCI Rx DMA Engine Tx DMA Engine
NIC Rx MAC Network Link Tx MAC
Figure 2.4. Architecture model of DMA-based design reprinted in [SAL03a]
It is worth noting that the device driver for the network adapters is typically configured such that an interrupt is generated after the incoming packet has completely moved into the host system memory. In order to minimize the time for ISR execution, ISR handling mainly sets a software interrupt to trigger the protocol processing for the incoming packet. It should be noted that in this situation if two or more packets arrive during an ISR handling, the ISR time for servicing all of these packets will be the ISR time for servicing a single packet, with no extra time introduced [SALA03a]. After the notification of the arrival of a new packet, the kernel will process the packet by first examining the type of frame being received and then invoking immediately the proper handling stack function or protocol, e.g. ARP, IP, TCP, UDP, etc. Note that TCP or UDP processing includes IP processing. The packet will remain in the kernel or
19 system memory until it is discarded or delivered to the user application. The network protocol processing for packets carried out by the kernel will continue as long as there are packets available in the system memory buffer. However, this protocol processing of packets can be interrupted by ISR executions as a result of new packet arrivals. This is so because packet processing by the kernel runs at a lower priority than the ISR. There are two possible system delivery options of packet to user applications. The first option is to perform an extra copy of packet from kernel space to user space. This is done as part of the OS protection and isolation of user space and kernel space. This option will stretch the time of protocol processing for each incoming packet. A second option eliminates this extra copy using different techniques described in [DRU96a, DRU96b, SHI01, BRU96, DIT97, KEN96]. The kernel is written such that the packet is delivered to the application using block pointer manipulations. The simulation model, proposed in this thesis, captures both options using a smaller processing time for the second option.
2.8
SelfSimilar Traffic
When modeling network traffic, packet and connection arrivals are often assumed to be Poisson processes because such processes have attractive theoretical properties [FRO94]. A number of studies have shown, however, that for both local-area and widearea network traffic, the distribution of packet inter-arrivals clearly differs from exponential [DAN92]. In [PAX95], Paxson and Floyd show that in some cases commonly-
20 used Poisson models seriously underestimate the bursty of TCP traffic over a wide range of time scales. They show that for SMTP (email) and NNTP (network news), connection arrivals are not well-modeled as Poisson, since both types of connections are machineinitiated and can be timer-driven, and so this phenomenon can not be described with Poisson models. Also, they have found that FTP data connections within a single FTP session (which are initiated whenever the user lists a directory or transfers a file) come clustered in bursts and these arrivals are not well-modeled as Poisson processes. In recent analyses of traffic measurements, evidence of non-Markovian effects, such as bursty across multiple time scales, long range dependence and self-similarity, have been observed in a wide variety of traffic sources. Non-Markovian effects have been observed in signaling traffic, variable bit-rate (VBR) video, Ethernet LAN traffic, ATM cell traffic, MAN traffic and general Internet WAN traffic. Some recent measurements of traditional telephony traffic have even shown signs of heavy-tailed holding times, which are closely related to the self-similarity or long range dependent (LRD) nature of traffic [PAX95]. This self-similarity represents the scale-invariant property, which measures the burst size of the flow density fluctuation seems to have the same tendency at various observation time scales. For actually measured traffic, the correlation in traffic can extend to a wide range of different time scales, or mathematically, the correlation function of realistic traffic decays with lag time, which is the property of LRD; while for traditional model-generated traffic, its correlation function decays exponentially fast, namely, shortrange dependence.
21 It has been argued that there are few independent mechanisms in different protocol layers that lead to the self-similarity phenomenon. In the application layer, the distribution of file sizes is an important factor. In the UNIX file system, this is known to follow a power law distribution. Moreover, Crovella [CRO97b] showed that the file size distribution of a web server decays as a power law. The lower layer protocol splits the data from an application into packets, which are transmitted to the physical link. Namely, the resulting packet traffic flow in the network is necessarily affected by self-similarity at the application level. Similarly, in the data link layer, the CSMA/CD mechanism is known to generate phase transition phenomena [FUK00]. In the transport layer, Veres and Boda,
[VER00], demonstrated that the deterministic process in TCP creates self-similarity.
CHAPTER 3
SIMULATION STUDY USING POISSON TRAFFIC

This chapter discusses the simulation study of the performance metrics including the CPU availability, latency, and throughput, assuming a Poisson model for the input traffic. This study includes the Poisson traffic model, generation of Poisson traffic, and the evaluation performance for the interrupt handling schemes based on this model. In this Chapter, the simulation guidelines, presented in [LAW99], are adopted. The main contribution of the proposed consists of considering much more realistic assumptions that suit the modern Gigabit network environment and hosts such as a modern 2.53GHz Pentium-IV machine. Furthermore, the simulation study includes the NAPI interrupt scheme.
3.1
Poisson Traffic Model
Stochastic models of packet traffic, used in the past, were almost exclusively Markovian in nature, or more generally short-range dependent traffic processes. Those traffic models, now called classical models, assume a Poisson arrival rate. These mathematical models, used to describe the old telephony system, were based on the observation that the traffic followed the nature of a Poisson distribution. This distribution
22
23 implies that call arrivals to the network are independent and that call inter-arrival times are exponentially distributed. The new data traffic has vastly differing statistical characteristics which are much more irregular and variable which undermines the basis for the traditional models. As described in Chapter 4, the inter arrival time distribution in data traffic will not be of a Poisson nature. However, as discussed in [PAX95], the call arrival distribution might follow a Poisson distribution as in the case of FTP or Telnet connections.
3.2
Generating Poisson traffic
To generate Poisson sampling intervals, one first determines the rate at which the singleton measurements will on average be made (e.g., for an average sampling interval of 30 seconds, we have = 1/30, if the units of time are seconds). One, then, generates a series of exponentially-distributed random numbers E1 , E 2 ,..., E n . The first measurement is made at time E1 , the next at time E1 + E 2 , and so on. One technique for generating exponentially-distributed random numbers is based on the ability to generate U 1 , U 2 , ..., U n , random numbers that are uniformly distributed between 0 and 1. Given such U i , to generate Ei , one uses: Ei = log(U i )
(3-1)
24
where log (U i ) is the natural logarithm of U i
3.3
Simulation Model
This section presents the simulation work to study and compare the performance of various interrupt handling schemes considering fixed packet sizes, and Poisson arrival. A discrete-event simulation model is developed and written using C language. A detailed description and flowcharts of the simulation model for normal interruption can be found in [SAL05]. The simulation model, reported in [SAL05], is extended for the schemes of pure and NAPI polling, count-based and time-based IC, and disabling and enabling of interrupts.
3.4
Simulation Guidelines
The simulation closely follows the guidelines given in [LAW99]. A PMMLCG is used as the random number generator. The simulation is automated to produce independent replications with different initial seeds that were ten million apart. During the simulation run, checks for overlapping in the random number streams were carried out to ascertain that such condition does not occur. The simulation was terminated when achieving a precision of no more than 10% of the mean with a confidence of 95%. The replication/deletion approach for means, discussed in [LAW99] is employed and
25 dynamically implemented. The length of the initial transient period is computed using the MCR (Marginal Confidence Rule) heuristic developed by White [WHI97]. Each replication run lasts for five times of the length of the initial transient period.
3.5
Simulation Parameters
The simulation uses realistic values for system parameters that suit modern Gigabit network environment and hosts. A detailed discussion of these parameters can be found in [SAL05]. For a modern 2.53GHz Pentium-IV machine, the following parameters are used.
The overall NIC interrupt cost is 3.73 s. This overhead cost includes both the cost of the interrupt overhead and handling. The former was measured experimentally on a 500MHz Pentium-III machine to be in the vicinity of 4.36 s as it is reported in [ARO00]. However, in [SAL05], it was concluded that the cost of the interrupt overhead with null handler, for a modern 2.53GHz Pentium-IV machine, can be 2.62 s. The latter was measured experimentally, by [MOR00], to be 5.53 s on a 450MHz Pentium-III machine. The measurement of interrupt handling included a substantial work and major cache pollution. In [SAL05], Salah and Badawi concluded that the handling cost is 1.11 s for a modern 2.53GHz Pentium-IV machine. As a result, the overall interrupt cost is 3.73 s.
26
The protocol processing overhead is assumed to be 5.25 s for the TCP latency overhead in addition to 748 Mbytes/s for the TCP bandwidth. The former, including OS overhead as well as TCP actual processing, was reported in [ASH04] to be 5.25 s. The latter, which includes the copying of packet payload to user application (buffering and copying), was reported by [ASH04] to 748 Mbytes/s.
The kernels protocol processing buffer B is a size of 1024 packets. The cost of updating the NIC register is assumed to be 0.5 s. For polling, the quota assumed to be 3 packets per poll, the polling period to be 20 s, and the polling overhead to be 1.59 s.
For count-based IC, the coalescing parameters of =1 and = 8, were used. For time-based coalescing, we use the coalescing parameters of T = 0 and T = 50 s.
3.6
Comparison and Numerical Results
The evaluation results are given for key performance indicators. These indicators include mean system throughput, CPU utilization, and latency. The performance for the all interrupt schemes that include the ideal system, normal interruption, pure and NAPI polling, interrupt disable-enable, and interrupt coalescing of time-based and count-based, are reported and compared.
27 Figure 3.5 depicts different behaviors of system availability for different values of traffic load. At light load (less than 0.1), it is noticed that Pure polling system gives the worst availability. This can be interpreted as that the interarrival time during such light load is very long in such that many unsuccessful polling cycles may occur. In this situation, more extra overhead exists due to the context switching modes that exceed the ISR overhead caused by normal interruption. It worth noting that this situation is not true for the NAPI polling because of the disable-enable mechanism involved in, so it shows better availability than pure one.
Figure 3.5. CPU Availability for Poisson Traffic
At moderate rate (0.1 to 0.4), an important notation is shown for the DisableEnable scheme and NAPI scheme with quota of 3 packets. Both schemes show very
28 comparable availability. Intuitively, this situation can be interpreted by configuring that the number of packets that may be processed by a NAPI polling cycle is equivalent to thus may be processed by the disable-enable schemes with one ISR. This means that the NAPI has to do ISR notifying the system and starting a new polling cycle as many as the disable-enable do notifying the system for an incoming packet. In other words, the effective service time for both is comparable. This situation can be verified by showing that both NAPI and disable-enable cause a comparable latency during such loads as it is shown in Figure 3.3. However, the other interrupts schemes give better availability than Disable-Enable and NAPI except the Normal interrupt which gives the worst availability. At higher rate (greater than 0.5), the NAPI and Pure polling schemes show comparable availability that outperforms the others. This depicts that the polling mechanism with quota of 3 works perfectly during such high rate at which the probability of queue is empty is minimal. As a result, the number of unsuccessful polling cycles is minimal for pure polling and the number of ISR needed by NAPI is minimal. Hence, both NAPI and pure polling show comparable availability. However, for normal and coalescing interrupt schemes, when the traffic rate increases right after the saturation point, the CPU becomes more utilized handling ISR so that the availability reaches the bottom. Another important consideration, is that the sharp decreasing in the CPU availability for user application due to the increasing in the effective service rate during such high rate and thus they show more throughput and less latency.
29 We concluded that the pure polling outperforms the others in terms of CPU availability under different traffic load conditions except at very low rate where it gives worse availability than the others. The NAPI polling is comparable in availability to the pure one under high traffic load and comparable to disable-enable schemes under low traffic load. The others show miserable failure in terms of availability under high traffic load. Figure 3.2 and its zooming in Figure 3.3 show the performance of all interrupt schemes in terms of latency. It was observed that the interrupt coalescing schemes give worse latency than the others at very low rate. The interrupt coalescing notifies the protocol processing only when the queued packets reach the interrupt coalescing quota. The packets during such rate come apart so that they have to stay long time in the queue before servicing. It worth noting that increasing the quota of the interrupt coalescing will cause more latency during such low rate. Figure 3.3 also depicts that the Disable-Enable interrupt gives worse latency than the normal interruption at extremely low rate. This can be illustrated as the packets come apart so that each packet causes an ISR even with the disable-enable scheme. Moreover, for Disable-Enable scheme, each packet causes an extra latency due to the overhead for enabling and disabling the interrupt (updating the NIC registers).
30
Figure 3.6. System Delay for Poisson Traffic
Figure 3.7. System Delay at Low Rate for Poisson Traffic
31 Another important consideration at low rate is that the Pure polling gives worse latency than both the NAPI polling and Disable-Enable interruption. The NAPI and Disable-Enable schemes cause an overhead notifying the protocol processing once the new packet come which will be served immediately after finishing the ISR handling (NAPI and Disable-enable) and the context switch for NAPI. However, for the Pure polling, this new packet may wait in the queue for polling period till it being servicing (it takes more time than the ISR does). Hence, the pure polling gives worse latency than the others second to the interrupt coalescing. When the rate increases, the latency in general increases. At high rate right after the saturation point, the normal interrupt gives the worst latency as it is expected because of the sharp decreasing in the effective service rate after the saturation point due to the ISR handling overhead for each packet. The polling schemes with quota of 3 packets and polling period of 20 us give worse latency than disable-enable, but better latency than the normal interruption. According to the analytical study of polling system described in [SAL05], which comes in line with the simulation study, the effective service rate for polling is affected by the values of both the quota and the period of polling. By increasing the quota and decreasing the polling period, the effective service rate increases and thus the polling schemes can be enhanced in terms of latency and throughput by adjusting the quota and polling period in such that it does not reduce the availability much. For the latency, we concluded that the Disable-Enable interrupt scheme outperforms the others. Although it gives worse latency than the normal at the extremely
32 low rate, this can be eliminated by considering todays characteristics of the traffic such as the bursty as it will be discussed later in Chapter 4.
Figure 3.8. System Throughput for Poisson Traffic
From Figure 3.4, it is observed that the maximum throughput occurs at 187 kpps. For normal interruption, it can be noted that the saturation or cliff point for the system occurs at 127 kpps. At this point, the corresponding CPU utilization (for ISR handling plus protocol processing) is at 100%, and thus resulting in a CPU availability of zero. Therefore, user applications will starve and livelock at this point. Also, Figure 3.4 shows that as the arrival rate increases after the saturation point the system throughput starts to decline.
33 One observation can be made about coalescing schemes with a parameter of =1 in case of count-based coalescing and T=0 in case of time-based coalescing. It is observed that in such cases, both coalescing scheme resort exactly, as expected, to normal interruption. There are also a number of important observations and conclusions to be made when examining and comparing the performance of all interrupt handling schemes. It can be concluded that no single scheme gives the best performance. For example, the scheme of interrupt disabling and enabling outperforms all other schemes in terms of throughput also in terms of latency except for extremely low rate. However in terms of CPU availability, the interrupt disabling and enabling gives the worst performance second to normal interruption. When comparing polling with the others, it is shown that polling schemes outperforms all other schemes in terms of availability at high rate. However, pure polling gives the worst availability at very low rate and NAPI gives a comparable availability to the disable-enable scheme at low rate. When comparing pure polling to NAPI polling, it is obvious from the plots that both of them give comparable results in terms of throughput, availability and latency at high load. However, NAPI polling outperforms pure polling in terms of latency at low load (and in terms of CPU availability at very low load). The pure polling outperforms the NAPI in terms of availability at low and moderate rate.
34 When comparing polling to coalescing, it can be noted that coalescing can give similar throughput and CPU availability (and latency at very high load) with large coalescing values of and T.
CHAPTER 4
SIMULATION STUDY USING BURSTY TRAFFIC

Chapter 3 discussed and evaluated the performance of various interrupt schemes assuming fixed packet size and Poisson traffic. However, in recent analysis of traffic measurements, evidence of non-Markovian effects, such as bursty across multiple time scales, have been observed in a wide variety of traffic sources, such as in signaling traffic, variable bit-rate (VBR) video, Ethernet LAN traffic, ATM cell traffic, MAN traffic and general Internet WAN traffic [PAX95]. Given the evidence of self-similarity in such a wide variety of sources, it is clear that any general model for data traffic must account for these properties. This Chapter presents a simulation model that accounting for these properties. It is organized as follows. Section 4.1 presents the bursty traffic including its definition, characteristics, modeling, and traffic generator. Section 4.2 describes the simulation results assuming bursty traffic as an input, and then gives the comparisons and numerical results obtained from applying different interrupt schemes.
35
36
4.1
Bursty Traffic Model
Stochastic models of packet traffic used in past were almost exclusively Markovian in nature, or more generally short-range dependent traffic processes. Those traffic models, now called classic models, assumed Poisson arrival rate. However, there have been a number of recent empirical studies that provide evidence of the prevalence of self-similar traffic patterns in measured traffic with long-range dependence [BRU00, ALO01]. Those traffic models, called bursty or self-similar models, have vastly different statistical characteristics. In this section, this model, including its definition, characteristics, modeling, and traffic generator, is introduced.
4.1.1 Definition
Intuitively, self-similar phenomena displays structural similarities across a wide range of time scales. This intuition is illustrated graphically in Figure 4.9 at the first row, which shows the same Ethernet traffic trace at various levels of aggregation. Starting with a time unit of 100ms, each subsequent plot is obtained from the previous one by increasing the time resolution by a factor of 10 and by concentrating on a randomly choosing subinterval The plots show that even as the level of aggregation increases, the plots look the same. This means that the reference structure is repeating itself over a wide range of
37 scales, thus the statistics of the process do not change with a change in the time scale. By contrasting, in Figure 4.9 at the second row (Poisson process), it is clear that at higher level of aggregation the process has different characteristics, becoming less fluctuation and more regular.
Bursty Traffic
Figure 4.9. Bursty Traffic versus Poisson Traffic
Mathematically self-similar processes are described and defined by the following. Assume an increment process X = ( X t ; t =1,2,3,) , where X t denotes the number of packets arriving in interval t, the aggregated series is defined as:
38
X k( m ) =
X ( k 1) m +1 + + X km m
,k 1
(4-1)
that is, by summing the original series X over non-overlapping blocks, k, of size m and normalizing by dividing by m. Then, it said that X is a self-similar with parameter H, if, for all positive m, the following sequences have the same distribution:
{ mX (
4.1.2 Characteristics
m)
, k 1 = mH X k , k 1 .
} {
d
(4-2)
Many properties can describe the bursty traffic including the self-similarity, longrange dependence, slowly decaying variance, and heavy-tailed distribution. However, the most interesting features that used in designing such bursy traffic are long-range dependence and heavy-tailed distribution.
Long-range dependence. Statistically, the property that characterizes selfsimilarity is a slowly decaying autocorrelation function. The autocorrelation, r ( k ) , of a stochastic process, X t , is defined as
39
r ( k ) = E [ ( X t )( X t +k ) ] 2 .
(4-3)
The autocorrelation is a measure of the correlation between elements of X t that are a distance of k apart. For a stochastic self-similar process, with an autocorrelation function r ( k ) ~ k as k , where 0 < < 1 , is said to show long-range-dependence. In contrast, for a short-range dependent process, r ( k ) ~
k
as k , where 0 < < 1 .
Therefore a process which is long-range-dependent has an autocorrelation function that decays much slower than an exponential decay as exhibited by the traditional Poissonian exponential traffic, which has an autocorrelation that decays at least as fast as exponentially. Exactly self-similar processes have autocorrelation functions that satisfy
r( k ) =
1 ( k + 1) 2 H 2k 2 H + ( k 1) 2 H 2
(4-4)
where H ( 0.5 < H < 1.0 ) is the Hurst parameter, or degree of self-similarity. Exactly, self-similar processes with H in this range are long-range dependent. It is important to note that self-similarity or long-range dependence cannot be verified for a finite sample. However, following the terminology used in [LEL94], it can be said that a finite sample is self-similar in nature if it is statistically consistent with a sample of a is self-similar process.
40
Heavy-Tailed Distributions. Strongly related to the concept of selfsimilarity is the property of a Heavy-Tailed distribution. Heavy-tail distributions can be used to characterize probability densities that describe traffic processes such as packet interarrival times and burst lengths. Mathematically, a distribution is heavy-tailed if Pr [ X > x ] ~ 1 as x , > 0 x
(4-5)
In general, a heavy-tailed distribution exhibits infinite variance, and possibly an infinite mean. Intuitively, an infinite variance means that the random variable X can fluctuate far away from its central range. These characteristics destroy the property of traffic smoothing usually associated with the light-tailed distributions of traffic, such as Poisson. A frequently used heavy-tailed distribution is the Pareto distribution whose density and distribution functions are given by ak f ( x) = k x
+1
k F ( x) = 1 x
( x > k ; > 0) ,
(4-6)
The parameter k specifies the minimum value that the random variable can take, called the location parameter. The parameter is the shape parameter that determines the mean and variance of the random variable, as 2 , the distribution has infinite variance, and with 1 , it has infinite mean and variance. In Figure 4.10, we plot the density functions for both Pareto and exponential distributions on a log-linear scale to show clearly the meaning of heavy-tailed property of
41 Pareto. Note that the exponential density function is a straight line, reflecting the exponential decay of the distribution. The tail of the Pareto decays much slowly than exponential, and hence, the term heavy tail.
Figure 4.10. Pareto and Exponential Density Functions
4.1.3 Ethernet and Self-similar Traffic
Starting with the extensive analysis of traffic measurements from Ethernet LANs over a 4-year period described in [LEL94], there has been a number of recent empirical studies that provide evidence of self-similar traffic patterns in measured traffic. The
42 original paper of [LEL93], subsequently revised and expanded in [LEL94], reports the results of the Ethernet traffic measurements conducted between 1989 and 1992 at the Bellcore Morris Research and Engineering Centre (MRE). The left column of Figure 4.11 shows plots of actual measurement, by recording the number of packets seen on the Ethernet per unit time. The first plot shows the entire 27-hour run, using a time unit of 100 seconds. Each subsequent plot is obtained from the previous one by increasing the time resolution by a factor of 10 and displaying a randomly chosen subinterval. Thus, the last plot in the first column covers a period of approximately 0.0027 hours and a time unit of 100 ms. The plots of the actual measurement, shown at the left column, look intuitively similar to one another as they involve a fair amount of bursty. Thus, Ethernet traffic seems to look the same in the large time scales, i.e., minutes or hours, as in the small, i.e., seconds or milliseconds. This feature of Ethernet traffic is significantly different from both conventional telephone traffic and stochastic models traditionally used in data network analysis and design such as the Poison model. The middle column of Figure 4.11 shows set of plots generated in the same fashion as the Ethernet plots, but using synthetic Poisson traffic. It is clear from the plots that such model shows different behavior in terms of bursty under different level of aggregation. As the level of aggregation increases, the traffic patterns smoothes out. Thus, the plot considered long time scales of 100 seconds at the top, is the smoothest one. This
43 is expected as the variance in the data will be reduced by a factor of 10 for each level of aggregation. The right column of Figure 4.11 shows set of plots considered synthetic selfsimilar model with a Hurst parameter H of 0.9 which is equivalent to the one estimated by [LEL93, LEL94, and LEL95] for the Ethernet traffic shown at the left column. The plots show the same general character as those for the real Ethernet traffic.
44
Figure 4.11. Comparison of Actual and Synthetic Ethernet Traffic [LAL 95]
Now, it is clear that the self-similar processes are radically different from the stochastic models traditionally used for network traffic. The latter models, such Poisson
45 model, show that as the number of users and the amount of traffic increase, the traffic becomes smoother and less bursty. Leland and Wilson contradict these models by showing that as the number of users and the amount of traffic increase, the network traffic becomes less smooth and more bursty.
4.1.4 Modeling Bursty Traffic
After reviewing the main concepts pertaining to long-range dependence and heavy-tail distributions, the models proposed for capturing long-range dependence and heavy-tailness, are considered in the context of network traffic. From a modeling viewpoint, several approaches are available to modeling selfsimilar features. Some of the models, mostly used to modeling the bursty network traffic in LAN and WAN environments, are the ON-OFF model, the fractional Gaussian noise model, fractional ARIMA processes, and chaotic deterministic maps. Focus will be given to the best known one called ON-OFF model, [ULA02], some times, called structural model. This model has advantages over the other ones including its ability to describing the impact of network design parameters [OLI02], and it can be implemented on parallel computers [LEL95]. The ON-OFF model is simple and very attractive for its ability to provide an intuition into the self-similarity of Ethernet traffic. Figure 4.12 displays a schematic view
46 of a section of the ON-OFF process. X 1 , X 2 , , X n are i.i.d non-negative random variables representing the duration of the ON states S1 , S 2 , , S n , and Y1 , Y2 , , Yn are i.i.d non-negative random variables representing the duration of OFF states. The aggregate traffic is generated by multiplexing a number of independent ON-OFF sources, where each source alternates between these states.
Figure 4.12. A Single ON/OFF Source
4.1.5 Generating Bursty Ethernet Traffic
It has been shown [TAQ97], that self-similar network traffic can be generated by multiplexing several sources of Pareto-distributed ON and OFF periods. They concluded that the resulting self-similar traffic is obtained by aggregating multiple streams (one stream per source) each consisting of alternating Pareto -distributed ON-OFF periods. Figure 4.13 graphically illustrates the models used to generate bursty traffic from multiple streams, of course, by considering the notations reported by [LEL95]. To
47 generate a synthetic traffic of a predefined load, the resulting load L is just a sum of loads Li generated by each individual source i. Given N sources,
L = Li
i
(4-7)
Thus, it is important to get a good estimation of the load generated by one source. The load generated by one source is the mean size of a packet train divided over the mean size of packet train and the mean size of inter-train gap. Stated otherwise, it is the mean size of ON period over mean size of ON and OFF periods.
Li =
ON i ON i + OFFi
(4-8)
ON/OFF Source S1 ON/OFF Source S2 Self-Similar Traffic (Aggregated Load)
S1 S2 S3
Aggregator
ON/OFF Source SN
Aggregated Stream
Figure 4.13. Self-similar Traffic Generation Model
In practice, the Pareto distribution is chosen to model heavy-tailed traffic trains for both ON-periods and OFF-periods. Thus ON and OFF are considered as shape parameters (between 1 and 2), in addition to bON and bOFF as location parameters. In
48 Pareto distribution the mean size of ON-periods, ON , and the mean size of OFF-periods, OFF , are calculated as,
ON = ( ON bON ) ( ON 1) OFF = ( off bOff ) ( off 1)

(4-9)
The location parameter of bON is the minimum ON period and depends on the minimum Ethernet frame size of 64 bytes. This is fixed to 64x8 bit times or 512 ns. The calculation of bOFF can be computed by solving Equations (4-8) and (4-9) so that bOFF can be expressed (in terms of the known parameters of L , bON , ON , and OFF ) as
b bOFF = ON ON OFF
OFF 1 1 Li 1 L ON i
(4-10)
During the ON period, packets are generated back-to-back with a rate of 1 Gbps. The packet sizes are not fixed and follow an empirical distribution, which are real measurements of packet sizes from MCI backbone. The measurements are reported in [CLA98] and available online at http://www.caida.org. In [CLA98], the reported packet size distribution represents IP datagram sizes. To obtain Ethernet frame size distribution, the packet sizes were modified to include 18-byte header (12 bytes for destination and source addresses, 2 bytes for length/type, and 4 bytes for FCS). In addition, all bytes shorter than 46 bytes were padded to 64 bytes, so that the minimum Ethernet frame size is equal to 64 bytes and the maximum size is equal to 1518 bytes.
49 It has been shown in this literature that some parameters affect the generated traffic. These parameters include ON and OFF , that represent Hurst parameter H, which in turn represents the degree of self-similarity of the traffic. Other parameters that affect the produced aggregated traffic include the number of sources and the total length of time to simulate. The values given to these parameters are chosen according to the measurements of the traffic that be simulated. The parameters, ON and OFF , are set to 1.3 and 1.5, respectively. The choice of these values are commonly used and promoted by measurements on actual Ethernet traffic performed in [WIL97]. Also, they are consistent with the conclusion made by [OLI02], that the OFF duration can have higher variability than the ON duration, since some source model phenomena, that are triggered by humans, (e.g., HTTP sessions) have extremely long period of latency. For the number of sources needed to generate such synthetic traffic, in [MAL04], it was concluded that a large number of sources only increases the complexity, and are not required at all. In [ULA02], it was shown that there is minimal dependence of selfsimilarity on a number of sources. They used different number of sources with a fixed =1.3, and the results were parallel and the resulting values of H were nearly equal to each model. In [PHI02], results concluded that it is possible to produce traffic that is selfsimilar in nature with many fewer sources that used in previous studies and good results
50 were obtained with as few as 8 sources. Thus, the number of sources is restricted to 8 in the simulation study carried out in this thesis. The total time required by the bursty traffic simulation is very important. A simulation run with bursty traffic can not be automated to stop when achieving a desired precision for the estimated mean. This is due of the presence of irregular incoming traffic resulting from the use of empirical variable packet sizes and the superposition of multiple streams with each stream producing ON and OFF periods (with huge variance) modeled. Therefore, simulation with such traffic will be very slow to converge to steady state and thus CI can be very long [CRO97a]. In order to obtain relatively acceptable accuracy and precision, simulation has to produce a huge number of samples. In [CRO97a], an estimate to calculate the N required samples to satisfy k digit accuracy is given: (4-11)
N 10
(1 ) . 1
Therefore, achieving two digit accuracy using =1.3, N have to be about 10 8 samples. In [PHI02], the total time of 1024 10 5 packets is suggested to be suitable for producing self-similar traffic. Hence, for the current simulation, each replication generated 10 8 packets. To verify the proposed model for generating self-similar traffic, the produced traffic using different time unit (10 ms and 100 ms) , show self-similar property as shown in Figure 4.14.
51
Figure 4.14. Burstness for 10 and 100 ms Time Unit
Self-similarity can also be determined by looking at several functions. One function is the autocorrelation function (AC) given by Equation (4-3). Plot of the AC function of the generated traffic against the theoretical AC of a self-similar process given by Equation (4-4), is shown in Figure 4.15. Since is set to 1.3, the expected value of H is 0.85. Figure 4.15 shows that the AC appears to be slowly-decaying, and agree to the theoretical AC, with relative inaccuracy of the resulted H about 10%.
52
Figure 4.15. Comparison of Simulation AC to Theoretical AC
It is important to note that self-similarity or long-range dependence cannot be verified for a finite sample. However, it should be stressed that the accuracy of the resulting Hurst parameter does not matter more than the existence of self-similarity in synthetic traffic.
4.2
Comparison and Numerical Results
The simulation results obtained by following the guidelines described in Chapter 3 (see section 3.3.1). For the bursty traffic, as discussed in section (4.1.5), 8 stream sources were used, each with Pareto of 1.3 and 1.5 for the ON and OFF period, respectively. The simulations were stopped after producing 100 million packets. The packet sizes are not
53 fixed and follow the empirical distribution discussed in [CRO97a]. For the other parameters related to the host, they are equal to those used in Chapter 3 for the Poisson model, suitable for a modern 2.53GHz Pentium-IV machine. The evaluation results are given for key performance indicators. These indicators include mean system throughput, CPU utilization, and latency. Results are reported to compare the performance for the all interrupt schemes that include the ideal system, normal interruption, pure and NAPI polling, interrupt Disable-Enable, and interrupt coalescing of time-based and count-based. In the bursty traffic, the input rate was expressed by the load, which is the ratio of the input rate (bps) to the link capacity (bps). For the throughput, it should be expressed using the same manner used for the input rate. It should be translated from the packets unit per second into the load expression. The way to do such translation is as follows.
Load ( 0 1) =
rate( pps) Avg_pkt_size (bits) , LinkCapacity (bps)
(4-12)
where the Avg_pkt_size (bits) at particular rate in the simulation can be obtained by summing up the packet sizes and dividing it by number of packets, and The LinkCapacity is 230 (bps). For example, for the Poisson traffic, the throughput of 187 kpps was translated to the corresponding load of 0.09 using a fixed packet size of 64 bytes, and to 0.76 using a fixed packet size of 512 bytes. For the bursty traffic, the average packet size can be
54 obtained from the empirical distribution which is 557 bytes [SAL07]. Hence, the throughput of 167 kpps was translated to load of about 0.74. Figure 4.16 shows the performance of the CPU availability, it indicates that the pure polling outperforms the other schemes in terms of CPU availability under different traffic load conditions except at very low rate where it gives lower availability. The NAPI polling is comparable in availability to the pure one under high traffic load and comparable to disable-enable schemes under low traffic load. The others show miserable failure in terms of availability under high traffic load. By comparing these results to those of the Poisson model, they are very close to each other.
55
Figure 4.16. CPU Availability for Bursty Traffic
Figure 4.17 shows the performance of all interrupt schemes in terms of latency. Figure 4.18 is a zoomed-in version of Figure 4.17. Both figures show that the disableenable interrupt scheme outperforms the others, even with such very low rate, which is different to the Disable-Enable behavior under Poisson traffic. This behavior is related to the bursty model as the packets come in a group (back to back). Thus, the Disable-Enable scheme, under very low rate, will not cause an ISR for each packet. When the rate increases, the latency in general increases. Another important observation can be made about coalescing schemes. It was observed that the count-based coalescing scheme gives worse latency than the others at very low rate. This is because it might require many ON and OFF periods to reach the
56 coalescing quota and then being notifying the protocol processing. However, with timebased coalescing scheme, it might benefit from the OFF period which is very long under such very low rate. Hence, the time-based coalescing shows better latency than the countbased.
Figure 4.17. System Delay for Bursty Traffic
57
Figure 4.18. System Delay at Low Rate for Bursty Traffic
From Figure 4.19, it is observed that the maximum throughput occurs at 167 kpps. For normal interruption, it can be noted that the saturation or cliff point for the system occurs at 116 kpps. Also, Figure 4.19 shows that as the arrival rate increases after the saturation point the system throughput starts to decline.
58
Figure 4.19. System Throughput for Bursty Traffic
Furthermore, there are a number of important observations to be made when examining and comparing the performance of all interrupt handling schemes. It can be concluded that no single scheme gives the best performance. For example, the scheme of interrupt disabling and enabling outperforms all other schemes in terms of throughput and latency. However in terms of CPU availability, the interrupt disabling and enabling gives the worst performance second to normal interruption, and the polling schemes outperform the others. When comparing the performance of the interrupt schemes based on Poisson to those based on bursty, it can be noted that, in general the performance curves are close to each other except for the latency. In Figure 4.20, the obtained results from the Ideal
59 interrupt scheme for both bursty and Poisson traffic are reported to show clearly the difference between the two models. In general, the bursty traffic shows more latency than the Poisson traffic, also this is true for the loss probability. This consideration is consistent with many studies which argued that the self-similar characteristic dramatically increases both the packet delay and loss in data networks as the queue size distribution has a heavytailed distribution [PAR00].
Figure 4.20. System Dealy under Bursty and Poisson Traffic
CHAPTER 5
HYBRID SCHEME
Conclusions drawn in Chapters 3 and 4 show that no particular interrupt handling scheme gives the best performance under all load conditions for both Poisson traffic and bursty traffic. It was shown that the scheme of disabling and enabling interrupts outperforms, for the most part, all other schemes in terms of throughput and latency. However, when it comes to CPU availability, pure polling is the most appropriate scheme to use. This means that the appropriate scheme depends primarily on the system performance requirements, most important performance metric of interest, and traffic load. For example, there are systems and applications where reducing latency is the major issue. For instance, for a loosely coupled multiprocessor system, the network latency is part of the computational delay and, therefore, it has to be minimized [MUK97]. In contrast, in multimedia applications the audio and video streams arriving from the network are queued for a playback delay of several tens or hundreds of milliseconds before they are delivered to the user [SCH96]. Consequently, an increase of a few milliseconds in the receive latency would normally go unnoticed in that type of applications.
60
61 Based on these important observations and in order to compensate for the disadvantage of interrupt disable-enable scheme of poor CPU availability, a novel hybrid scheme of interrupt disable-enable and pure polling, is studied. This hybrid scheme would make up for the CPU availability drawback of interrupt disable-enable scheme when the host is under heavy load. In short, the scheme would operate in interrupt disable-enable until reaching a heavy load region at which the system must switch to pure polling. This chapter is organized as follows. Section 5.1 presents briefly some existing Interrupt-Polling Schemes. Then, the switching mechanism is described in Section 5.2 along with the evaluate of its parameters including the cliff point. Section 5.3 gives the evaluation results and discussion. Finally, in Section 5.4 conclusions are made and future work is discussed.
5.1
Background
Trade-offs between interrupts and polling has been extensively discussed in the literature, and many hybrid schemes have been suggested. One implementation for such scheme was performed for Windows NT platform [HAN97]. The implementation simply uses a two-level scheme where interrupt-driven servicing of the network interfaces is used until a certain level of network traffic, and above that level, a polling thread scheme is used. The switching mechanism was based on the detecting the user thread starvation which is the receive livelock. The authors implement the interrupt processing time for
62 indication of system load. This time can be measured by monitoring the execution time of the interrupt-routine. Another implementation that combines interrupts with polling, namely, Hybrid Interrupt-Polling (HIP) is described in [DOV01]. The basic idea of HIP is to adaptively switch between the use of interrupts and polling based on the observed rate of packet arrivals. Specifically, if the packet arrivals are frequent and predictable, the receive mechanism operates in polling mode and interrupts are disabled. On other hand, if the packet arrivals are infrequent, less predictable, or if the number of consecutive unsuccessful polls exceeds a threshold, the receive mechanism operates in the interrupt mode. In this mode, the polling operating is stopped and the interrupts are enabled. This
hybrid scheme use the normal interruption which performs very poorly at light and
moderately light loads in terms of throughput, CPU availability, and latency as we discussed in the previous chapters. Also in [DOV01], the switching between normal interruption and polling was done somewhat arbitrarily. In [BIS06], Biswas and Sinha proposed a hybrid poll engine that operates in the normal interrupt mode when it expects low packet arrival rates and switches to polling operation when it expects higher packet arrival rates. In every poll cycle, after completion of the event data processing task the expected arrival rate is forecasted. None of these solutions proposed a hybrid scheme that combines interrupt disabling-enabling and pure polling. However, in [SAL07], a hybrid scheme that combines interrupt disabling-enabling and pure polling was proposed.
63
5.2
Switching Mechanism Evaluation
In this section, an evaluation of the cliff point, which is the saturation point in the normal interrupt scheme, and the occurring of the switching point, is carried out. Identifying the switching point is critical. Under Poisson traffic, the analytical work provided equations to identify where cliff point occurs. The cliff point can be simply identified as the saturation point of normal interruption. The cliff point, derived in [SAL05] and can be expressed as:
r cliff = 1 + 4 1 2 r
where is the protocol processing service rate and 1 r is the mean ISR overhead
(5-13)
For Poisson traffic, Equation (5-13) gives a cliff point at 127 kpps. For Bursty traffic, Equation (5-13) surprisingly gives an adequate approximation for the cliff point. The cliff point can be computed if the average packet size is measured. Based on the empirical packet size distribution, the average packet size is 557bytes, and this yields an average protocol service time of 5.99 s. Consequently and applying Equation (5-13), the cliff point cliff for bursty traffic is 116 kpps. Another problem related to the switching mechanism is how to detect the occurrence of the cliff point. The rate estimation algorithm, discussed and evaluated later in Chapter 6, is used for such purpose. The algorithm gives the estimated rate which is an
64 indicator of the traffic load and by comparing the estimated rate with the determined cliff pointed the switching is done. Many mechanisms may be used to decide when the switching point should be occurred according to the cliff point. One mechanism is that the switching point occurs exactly at the cliff point. Another mechanism uses double thresholds of operations: upper
(U )
and lower ( L ) , where U = cliff and L = cliff . and are tunable and design
parameters, and their value selection depends on how aggressive or relaxed the need of switching between interrupts and polling. This mechanism was introduced in [SAL05] by choosing =0.95 and =0.85, thus the switching point is before the cliff point. These mechanisms are evaluated in terms of the number of switching around the cliff point in terms of the performance metrics including the CPU availability and the latency. For the Poisson traffic, the rate estimation algorithm discussed in Chapter 6, with a weight factor of 0.25 and both and are 5%, is used. For bursty traffic, extensive evaluation for different values of weight and also for and , is carried out. For both Poisson and bursty traffic, the following cases are considered: Single Threshold: The switching point occurs exactly at the cliff when the estimated rate is grater than the cliff point.
65 Early-Bounded Thresholds: The switching point occurs before the cliff . When the estimated rate is grater than U = (1 ) cliff , the switching is to the pure polling; and into the disable-enable interrupt when the estimated rate is less than L = (1 2 ) cliff . Vicinity-Bounded Thresholds: The switching point occurs in the vicinity of cliff . When the estimated rate is grater than U = (1 + ) cliff , the switching is to the pure polling; and into the disable-enable interrupt when the estimated rate is less than L = (1 ) cliff . The cliff point is between L and U . For the Poisson traffic, Figure 5.21 and Figure 5.22 show that using one threshold point produces more frequent switching modes around the cliff point and so it shows more latency and less availability around the load of 0.5 which is the cliff point. Using double thresholds is necessary in order to avoid possible significant overhead that may result from frequent fluctuation around one threshold point. Also, Figures 5-1 and 5-2 show that using double threshold points U and L before the cliff point in the case of Early-Bounded thresholds outperforms the vicinity-Bounded thresholds. In the Early-Bounded thresholds, the switching to the pure polling scheme occurs at the starting of the load of 0.5 which is the cliff point and during this load the pure polling works. Whereas, in the VicinityBounded thresholds the switching occurs at the load of 0.6. For the bounded thresholds, only one switching mode occurred at the load of 0.5 and this is because of the stability of
66 the Poisson traffic, i.e., for Poisson traffic measurement, there are few short peaks and they are not largely distant from the measurement.
Figure 5.21. Switching Mechanism in terms of CPU Availability (Poisson Traffic)
67
Figure 5.22. Switching Mechanism in terms of Latency (Poisson Traffic)
For the bursty traffic, the results of the rate estimation show less stability than the Poisson. Consequently, there are more undesirable frequently switching modes that may occur during the moderate rate and around the cliff point. Figures 5-3 and 5-4 show that at very low rate all the evaluated switching mechanism follow the Disable-Enable scheme closely. Also at very high rate, the results for all mechanism are stable and follow the pure polling. However, around the cliff point, it is shown that all the evaluated switching mechanism produce many undesirable switching modes so that the performance metrics (CPU availability and system delay) values lay between the performance of DisableEnable and pure polling schemes. Figure 5-4 depicts that at the cliff point, Early-Bounded mechanism converges into the pure polling more quickly than the other evaluated switching mechanism.
68
Figure 5.23. Switching Mechanism in terms of CPU Availability (Bursty Traffic)
Figure 5.24. Switching Mechanism in terms of Latency (Bursty Traffic)
69 The undesirable switching modes, shown for the bursty traffic, represent a major problem that may degrade the performance of the hybrid scheme. However, reducing these undesirable switching modes is possible by tuning the parameters that enhance the stability of the estimator. These parameters include the weight factor of the estimator and the percentage of the upper and lower thresholds. Switching mechanisms have been investigated for the bursty traffic by applying different values of the weight factor of the estimator and different percentages of the thresholds. The simulation was done for 5 millions packet arrivals for each point of offered load that was incremented gradually by 0.1. The results consider the number of switching modes that have been occurred before the load of 0.5 (0.1-0.4) which is the cliff point, at 0.5, and then after 0.5 (0.6-1). In Table 5.1, the Early-Bounded threshold show less undesirable switches as the rate reaches the cliff point, but it shows more undesirable switches during the low rate. The Vicinity-Bounded threshold shows less undesirable switches during the low rate, but it shows more switches when the rate reaches the cliff point. Table 5.1 shows that the minimum the weight factor, the more stable are the results. Hence, it is clear from this table, that the number of undesirable switches is reduced as the weight of the rate estimation is minimized. It is shown that the best case is with a weight of 0.0078125 which is equivalent to 2-7. However, choosing a very small weight factor leads to a delay on reacting to real changes in the traffic load as will be discussed in Chapter 6. Thus, one has to choose an acceptable value of estimator weight such that the number of undesirable switching modes is inconsiderable. We have used the
70 estimator of 0.03125 weigh (corresponding to 2-5), and the Early-Bounded threshold with upper and lower thresholds of 10% apart from the cliff point.
Table 5.1. Investigation on the Number of Switching Modes for Bursty Traffic
Early-Bounded weight 5% 10% -2 (2 ) 15% 0.03125 5% 10% (2-5) 15% 0.015625 5% 10% (2-6) 15% 0.0078125 5% 10% -7 (2 ) 15% 0.25 Single threshold low 6 6 6 3 3 3 5 5 5 1 1 1 cliff 42 42 42 8 8 8 10 10 10 6 6 6 high 26 26 26 4 4 4 0 0 0 0 0 0 Thresholds low cliff high 41 44 2 51 10 0 73 2 0 12 5 0 15 0 0 11 0 0 7 4 0 7 0 0 7 0 0 1 2 0 5 0 0 3 0 0
Vicinity-Bounded low 29 25 21 10 9 9 3 3 3 1 1 0 Thresholds cliff high 20 2 12 4 8 2 5 0 4 0 2 0 6 0 4 0 2 0 4 0 2 0 1 0
5.3
Performance Evaluation of the Hybrid Scheme
In this section, the hybrid scheme is evaluated in terms of availability, latency and throughput for both the Poisson and bursty traffic. The cliff point, described in Equation (5-13), is used. For the Poisson traffic, the cliff point is 127 kpps and for the bursty traffic it is 116 kpps. As described in section 5.2, double threshold points U and L is used. The results, illustrated in Figure 5.25 to Figure 5.33, show that the hybrid scheme outperforms the others for both the Poisson and bursty traffic as it makes up for the CPU
71 availability drawback of interrupt disable-enable scheme when operating at high load. At the low rate, the hybrid works in disable-enable scheme which outperforms the others schemes in terms of latency, availability and throughput.
72
Figure 5.25. CPU Availability Including Hybrid Scheme (Poisson Traffic)
Figure 5.26. System Delay at High Rate Including Hybrid Scheme (Poisson Traffic)
73
Figure 5.27. System Delay at Low Rate Including Hybrid Scheme (Poisson Traffic)
Figure 5.28. System Throughput Including Hybrid Scheme (Poisson Traffic)
74
Figure 5.29. Availability Including Hybrid Scheme (Bursty Traffic)
Figure 5.30. Zoom in of Figure 5-9
75
Figure 5.31. System Delay at High Rate Including Hybrid Scheme (Bursty Traffic)
Figure 5.32. System Delay at Low Rate Including Hybrid Scheme (Bursty Traffic)
76
Figure 5.33. System Throughput Including Hybrid Scheme (Bursty Traffic)
CHAPTER 6
ARRIVAL RATE ESTIMATION

The estimation of the arrival rate is an integral part of most of the issues in the communication and network area including network planning, management, monitoring, and maintenance decisions. For example, estimated values of packet inter-arrival time are used to measure the traffic rate for the QoS-enabled Internet [AGH03]. The rate estimation is an essential part of call admission [GRO99], link-sharing [FLO95], and fair scheduling algorithm [AGH01]. In a hybrid interrupt-polling architecture, correct switching mode is the key to superior performance. The switching mode is based on the current estimation of the traffic rate. Our goal is to design a satisfactory estimator that guides the system into switching at the appropriate moment in time which in turn leads to improved performance. Otherwise, the estimator might misguide the system into undesirable frequent mode switching which wastes CPU resources. This chapter presents some existing rate estimation algorithms and evaluates them in terms of the requirements of our approach. The evaluation is done by discreet-event simulation models in order to come up with an algorithm that satisfies the requirements of the architecture.
77
78 The remainder of this chapter is organized as: Section 6.1 describes the requirements of the hybrid interrupt-polling approach. Section 6.2 presents some existing algorithms of the rate estimation, including the EWMA and TSW algorithms and our proposed algorithm. Section 6.3 gives the evaluation results and discussion. Section 6.4, shows evaluation of some design issues includes the cost overhead, the weight factor, and the time window size. In section 6.5, the efficient implementation of the proposed algorithm in Linux, is carried out.
6.1
Requirements
In a hybrid interrupt-polling architecture, the proper habiting scheme is based on the quality of the estimator that can be qualified by the following requirements:
Accuracy: the algorithm should produce results that are close to the value of the actual data rates. The agility also conveys the meaning of the accuracy with ignoring the short peaks. The accuracy can be evaluated in term of relative error of estimated rate that is defined as below.
relative _ error =
real estimated real
100 %
(6-14)
Stability: The algorithm should ignore the short term changes of traffic behavior [AGH03]. For the hybrid interrupt-polling architecture, this requirement is very important as the instability means that the estimator will
79 follow the short peaks and then leads into frequently switching modes which are undesirable. The stability is affected directly by the weight factor used in the estimation algorithm. The goal is to consider these short changes with lower weight.
Agility: The algorithm should quickly discover an increase or decrease in the actual data rate of the traffic. However, it should only include these changes into its result, when a long-term change is about to happen. The goal is to consider these permanent changes with higher weight.
Cost: The cost overhead can be expressed in various terms: computational complexity, memory, and time for calculation. The optimum is of course a simple, fast algorithm that need not store much data. Another factor is the amount of samples that is created and that has to be processed by the system. The algorithm that tends to produce a large amount of samples will add more overhead into the system.
6.2
Existing Rate Estimation Algorithms
Traffic rate estimation can be performed off-line or on-line [RFC3272] due to the objective. Off-line estimation is used to acquire long-term traffic trend for network planning, management and maintenance decisions. Complex and sophisticated algorithms can be used to accurately derive diverse traffic characteristics. Whereas, the on-line
80 estimation is used by those traffic management algorithms that require a current estimate of the traffic rate. For example, the QoS-enabled Internet [AGH03], and calculating the round trip time (RTT) in TCP congestion control [SIL03]. We considered the on-line estimation that is suitable for our approach. Many techniques are used to estimate the traffic rate on-line, and the recursive estimators are the most commonly used on-line estimator. There are two main types of recursive rate estimators: Exponential Weighted Moving Average (EWMA) and time sliding window (TSW). In what follows, we first introduce the conventional algorithms which have been used frequently in network applications, including the EWMA, the EWMA with modifications to adapt the weight factor, and the TSW. Then, a description of the proposed algorithm is outlined.
6.2.1 EWMA Algorithm
EWMA (Exponentially Weighted Moving Average) is an ideal maximum likelihood estimator that employs exponential data weighting and is widely used in many different areas and applications. For example, it is the principal method for network condition estimation and is used to calculate the round trip time (RTT) in TCP congestion control [SIL03]. EWMA is well suited for measuring the average rate of the traffic flow as it is used in [FLO95] to estimate the bandwidth over the appropriate time interval, to
81 determine whether or not each class has been receiving its link-sharing bandwidth. shows the conventional EWMA packet-arrival that used the mean inter-arrival time which is the measurement of time between two consecutive packet arrival events. The current average rate at sample k, ek , is calculated based on the value of the last estimation, ek 1 and the current observation ok , such that:
ek = (1 ) ek 1 + ok
(6-15)
Algorithm 6.1. EWMA Packet-Arrival
Initialization:
= a constant weight factor

Avg_rate = 0 T_front =0 For each packet arrival do: interarriva l now T _ front Avg_rate (1 ) Avg _ rate + interarriva l T_front now return Avg _ rate End
The weight determines the time constant of the estimator. If the rate suddenly changes, causing the computed value of the ok variable to change from one value to
82 another, then it takes 1 ln (1 ) steps before the estimated rate moves 63% of the way from the old value of ok to the new value [FLO95]. This corresponds to a time constant of roughly
A ln (1 ) seconds,
(6-16)
where A is the interval in which we measure the rate. According to Equation (6-16), a large weight factor results in a small time constant and thus the estimator follows the changes quickly, more agile. However, a small weight factor results in a large time constant causing the estimator to ignore the short term variations, more stable. The agility and the stability are interdependent and trade off against each other. The decision on setting the estimator for agility or stability may well depend on its application. Consequently, the performance of this algorithm is limited for fixed weight factor.
6.2.2 EWMA Dynamic Weight Algorithm
The basic assumption behind the modifiers for dynamic weight is that sharp increases or decreases can first be treated as peaks. Only if the change persists, it should be taken into account (but then very quickly). The goal is to consider these changes with lower weight. Smaller changes or stagnancy indicate a stable trend, so these measurements can be considered with a higher weight.
83 In [BUR02], they presented three different modifications on the EWMA. For two of them, the weight was dynamically calculated to adapt to different load situations. For the third one, a smoothing algorithm was used to filter out short-term effects before employing the EWMA algorithm. The conventional EWMA algorithm was then adapted to get the maximum out of the smoothed results, by modifying the history. Algorithm 6.2 shows a way to adapt the weight factor using the gradient as it is described in [BUR02]. The gradient is used to capture the change of the averaged rate over time. The weight is calculated based on the current gradient and the maximum weight max as:
i = max
1+
1 mi mnorm
(6-17)
where mi is the current gradient and m norm is the Link-dependent mean calculated by simply using the half of the maximum gradient during a time window. It is calculated as:
mnorm =
1 capacity 2 time _ window
(6-18)
84
Algorithm 6.2. EWMA Dynamic Weight
Initialization:
opt = a constant to represent the optimal weight max = a constant to represent the maximum weight
Avg_rate = 0 mnorm = a constant which set to Link Capacity 2 Time _ Window
For each window do:
N number of packets during this window
M _ rate N Time _ Window m ( M _ rate Avg _ rate ) Time _ window
max (1 + m mnorm )
Avg_rate (1 ) Avg _ rate + M _ ratel
return Avg _ rate End
6.2.3 Time Sliding Window (TSW) Algorithm
A TSW is a time-based estimator that employs a rectangular data weighting function based on a fixed size time window. This scheme is specified in [RFC2859] as a rate estimator for measuring the average sending rate of the traffic stream based on the bytes in the IP header and IP payload. TSW estimates the rate upon each packet arrival and decays, or forgets, the past history over time [CLA98].
85 Algorithm 6.3 shows the algorithm for the TSW that maintains three state variables, Win_length, which is measured in units of time, Avg_rate, the rate estimate upon each packet arrival, and T_front, which is the time of last packet arrival. TSW is used to estimate the rate upon each packet arrival, so state variables are updated each time a packet arrives. The now variable is the time of current packet arrival; and Pkt_size is the packet size of the arriving packet.
Algorithm 6.3. TSW
Initialization:
Win_length = a constant Avg_rate = 0 T_front =0

For each packet arrival do: Bytes_in_T SW Avg_rate Win_length New_bytes Bytes_in_T SW + Pkt_size Avg_rate New_bytes T_front now return Avg _ rate End
( now T_front + Win_length )
The agility and stability of TSW depend on the size of the time window (Win_length). If Win_length is small, the estimator will be more agile as the past takes less weight so that it can follow the new changes in the traffic rate quickly. On the other
86 hand, if Win_length is large, the estimator will be more stable as the past takes more weight so that it can forget short peaks which are very brief and only temporary.
6.2.4 Proposed Algorithm
The proposed algorithm is a minor modification of the conventional EWMA. The conventional EWMA estimate the rate using the mean inter-arrival time which is the measurement of time between two consecutive packet arrival events. Then, the rate is defined based on the respective time interval. This measurement, used in [DOV01], is discussed by [BUR02] as a quite accurate method, but produces a problem of the large amount of samples that is created and that has to be processed by the system. In addition, it is impossible to determine the exact moment of a packet arrival in a practical system which suffers from event delivery latencies or which sometimes operate in polling mode [BIS06]. Algorithm 6.4 shows the proposed algorithm that based on the time-window instead of the inter-packet measurement. The concept of the time window is illustrated in Figure 6.34. For a fixed time window, the numbers of arriving packets are summed up. After that time, the ratio of total arriving packets and time gives the mean rate over that interval. The time window will then be restarted. This method is very stable and produces no direct problems [BUR02].
87
Algorithm 6.4. EWMA Time-Window
Initialization:
= a constant weight factor

Avg_rate = 0 Time _ Window = a constant to represent the size of the jumping window
For each window do:
N number of packets during this window M _ rate N Time _ Window Avg_rate (1 ) Avg _ rate + M _ rate
return Avg _ rate End
The size of the interval, affects the quality of estimator in terms of stability and agility. In general, the larger the size, the more stable is the estimator, but introduces more delay in following the permanent changes, which is less agility. However, an acceptable size must be determined according to the purpose of using the estimator. For example, in [BUR02], they deployed a bandwidth estimator by nodes in IP networks that perform quality of service (QOS) routing based on the available bandwidth. They have chosen the interval with a size of 1s based on assumption that most routing protocols have fixed restrictions on the minimum time between routing information updates. In the proposed approach, the estimator should use an interval that reflects reality as close as possible.
88
N-1
M-1
time
Ti
Ti +1
i = N / T
i +1 = M / T
Figure 6.34. Our Proposed Measurement for EWMA
6.3
Evaluation and Comparison
In this section, we compare and evaluate the existing rate estimation algorithm and our proposed one in terms of accuracy, stability and agility. We consider both Poisson and bursty traffic. For all simulations, we have chosen the time-window measurement approach with a jumping window with a size of 10ms to calculate the measurement rate. For each window, the packets are summed up and then divided by the window size to calculate the measurement of the rate. The results for this measurement are always shown as reference with a solid line. The results of the other schemes are shown as:
EWMA Packet-Arrival (HIP) that shows the results of the EWMA algorithm based on the inter arrival as it is used in [DOV01].
TSW that shows the results of the Time sliding window algorithm with a window length of 10ms.
89
EWMA Time-Window that shows the results of our proposed algorithm that based on a jumping window time of 10ms for all simulations except those used to evaluate different values of the window time.
EWMA Dynamic Weight that shows the results of the EWMA algorithm with a dynamic weight factor adapted by the gradient values.
Accuracy. In this case, we consider the accuracy requirement to evaluate the existing estimator algorithms. The fixed weight factor is 0.3 for all algorithms as it is used in [DOV01], and the maximum weight for the dynamic weight is 0.6 as it is used in [BUR02]. The window length for TSW algorithm is 10ms. The best result is the one with lest average relative error. For the bursty traffic, we offered a constant load of 70% during 60 seconds of simulation. Figure 6.35 shows the results of the average relative estimation error for the bursty model. The results show that the EWMA based on the inter arrival packet (HIP) performs poorly in term of accuracy as it shows much more relative estimation error than the others. On the other hand, it is shown that a little deference exists between others. For the Poisson traffic, we offered a constant rate of 150000 during 60 seconds of simulation. Figure 6.36 shows the results of the average relative estimation error for the Poisson model. The results show the same behaviors shown with the bursty traffic except that average accuracy with the Poisson traffic is better than the accuracy with the bursty traffic for all algorithms.
90
Avarage Estimation Error Avarage Estimation Error %
30% 25% 20% 15% 10% 5% 0%
EWMA Time-Window EWMA Dynamic Weight EWMA Packet-Arrival (HIP) TSW wl= 10ms
Figure 6.35. Average Relative Estimation Error for Bursty Traffic
Avarage Estimation Error %
Avarage Estimation Error %im
30% 25% 20% 15% 10% 5% 0%
EWMA Time-Window EWMA Dynamic Weight EWMA Packet-Arrival (HIP) TSW wl=10ms
Figure 6.36. Average Relative Estimation Error for Poisson Traffic
Stability. In this case, we evaluate the stability requirement for existing algorithms and the proposed one. To present the stability of the algorithms clearly, we
91 have chosen the weight factor of 0.125 for all except the TSW which based on the window length of 10ms. The maximum weight is 0.25 for the dynamic weight factor used with the EWMA. We have chosen a weight factor of 0.125 to produce more stable results for all, and then evaluate the algorithms in such that the best one is the one that dose not follow the short peaks as possible. For the bursty traffic, we offered a constant load of 70% during the simulation to evaluate the stability requirement. For the Poisson traffic, we offered a constant arrival rate equals 150000 during the simulation to evaluate the stability requirement. Figure 6.37 shows the results of the bursty model for evaluating the stability. The results show that the EWMA packet-arrival (HIP) performs poorly in term of stability. Figure 6.38 shows zooming in results to see the deference between the other algorithms, and it shows that the EWMA time-window behaves much like the EWMA with dynamic weight factor. Only for strong deviations for the rate measurement (e.g. between 100 and 300), real differences can be observed which shows that EWMA time-window outperforms the others. Figure 6.39 shows the results for the Poisson traffic. The results show that the EWMA packet-arrival (HIP) also performs poorly in term of stability with the Poisson traffic. Figure 6.40 shows zooming in results to see the deference between the other algorithms, and it shows clearly that the EWMA time-window outperforms the others.
92
Figure 6.37. Stability Evaluation with Bursty Traffic
Figure 6.38. Stability Evaluation with Bursty Traffic (Zoom in Figure 6.37)
93
Figure 6.39. Stability Evaluation with Poisson Traffic
Figure 6.40. Stability Evaluation with Poisson Traffic (Zoom in Figure 6.39)
94 Agility. In this case, we evaluate the agility requirement for existing algorithms and the proposed one. The best one is the one that follows the permanent changes quickly as possible. The simulation used a fixed weight factor of 0.3 for all algorithms to measure the agility as it is used in [DOV01], and the maximum weight for the dynamic weight is 0.6 as it is used in [BUR02]. The window length for the TSW algorithm is 10ms. For the bursty traffic, we offered a changeable load during the simulation so that the quickness of following these changes can be measured to evaluate the agility requirement. For the Poisson traffic, we offered a changeable arrival rate during the simulation to evaluate the agility requirement. Figure 6.41 shows the results of the bursty model for evaluating the agility. The results show that the EWMA packet-arrival performs poorly in term of the agility. Figure 6.42 shows zooming in results to see the deference between the other algorithms. The permanent changes in the load, from high to low at 200 and from low to high at 400, show that the EWMA time-window follows these permanent changes slowly, unlike the TSW and the EWMA Dynamic Weight algorithms. Figure 6.43 shows the results for the Poisson traffic. The results show that the EWMA packet-arrival (HIP) is also performs poorly in term of agility for the Poisson traffic. Figure 6.44 shows zooming in results that depict the same behavior shown with the Bursty traffic.
95
Figure 6.41. Agility Evaluation with Bursty Traffic
Figure 6.42. Agility Evaluation with Bursty Traffic (Zooming in Figure 6.41)
96
Figure 6.43. Agility Evaluation with Poisson Traffic
Figure 6.44. Agility Evaluation with Poisson Traffic (Zooming Figure 6.43)
97 From previous evaluation cases, it can be conclude that:
The EWMA packet-arrival (HIP) performs poorly in terms of accuracy and stability for both bursty and Poisson traffic, and so it is not suitable for the proposed approach.
The EWMA time-window which is the proposed algorithm follows the permanent changes slowly. The other algorithms show a little deference in term of agility.
However, in term of stability EWMA time-window algorithm outperforms others algorithm for both the bursty traffic and Poisson traffic.
6.4
Design Issues
This section discusses the cost overhead as a requirement that should be as low as possible. Another design issue is the selection parameters that should be chosen carefully.
6.4.1 The Cost Overhead
The goal of the proposed approach is to enhance the performance of the system. This goal can not be achieved using an estimator that introduces a big overhead on. For example, the algorithms that designed to be executed for each packet arrival, like HIP and
98 TSW, introduce more overhead than the others designed to be executed for each acceptable interval time like EWMA time-window algorithm. Another issue that affects the cost overhead for execution the estimator is the time complexity expressed in CPU cycles needed for executing the estimator. For example, the EWMA time-window estimator with a dynamic weight factor is more overhead than the EWMA time-window estimator with a fixed weight factor as it needs more operations, including division and multiplication, to calculate the gradient for adapting the weight. As a result, the EWMA time-window estimator with a fixed weight, our proposed algorithm, is the best designed estimator that produces a little overhead as it is designed to be executed on each interval instead of each packet arrival. In addition, our proposed algorithm based on a fixed weight factor which does not need an extra computation overhead for adapting the weight factor.
6.4.2 Selection of Parameters
Our proposed estimator based on two parameters, the interval size and weight factor, which determine its quality. In this section, we evaluate different values for the interval size and weight factors to come up with the most proper values. The Interval Size Parameter. We evaluate different interval sizes to come up with a proper interval that satisfy the requirements. We evaluate three different intervals
99 32ms, 16ms and 8ms, using a fixed weight of 0.25, considering both bursty and Poisson traffic. Figure 6.45 shows the results for the bursty traffic. It shows that the intervals of 32ms and 16ms are more stable than the interval of 8ms as it is shown between 100 and 200 and also between 400 and 500. However, the interval of 8ms is more agile than the others as it is shown at the 200 time slot, and it is acceptable in term of stability as it does not differ much than the others.
Figure 6.45. Time Window Size for Bursty Traffic
Figure 6.46 shows the results for the Poisson traffic. The results show that they hardly differ in term of stability as it is shown between 100 and 200, and hardly differ in
100 term of agility as it is shown at 200 and 400 for decreasing and increasing of the rate respectively.
Figure 6.46. Time Window Size for Poisson Traffic
The Weight Factor Parameter. We evaluate different values of the weight factor in term of stability and agility. We consider these values: 0.125, 0.25, and 0.5 for two reasons. First reason is that these values can be represented as a negative power of 2 and then the estimated rate can be computed with one shift and two add instructions which are a costless operation when implementing in the Linux [JAC88]. Second reason is that the proposed value for this weight factor was 0.3 in [DOV01] and [BUR02], and was 0.25 in [JAC88]. The value of 0.25 is close to 0.3 and can be expressed in a negative power of 2. The evaluation is done for both bursty and Poisson traffic, using a window time of 8ms.
101 Figure 6.47 shows the results for the bursty traffic, it shows that the weight factor of 0.125 produce more stable results than the other values, but with less agility than the others as showed in Figure 6.48. The weight factor of 0.5 shows more agility but less stability. The results also depicts that the bursty traffic generates many short peaks that are apart from the measurements largely. Consequently, for the bursty traffic, the stability requirement is very important and must be considered with small weight factor. Otherwise, the estimator will cause many undesirable switching modes that bring down the performance of the system. For the Poisson traffic, Figure 6.49 and Figure 6.50 show that the Poisson traffic is more stable than the bursty so that the weight factor of 0.25 is perfect for the stability requirement in Poisson traffic. Also, this weight value is acceptable to balance between the stability and agility requirements.
102
Figure 6.47. Stability with Different Weight Factors for Bursty Traffic
Figure 6.48. Agility with Different Values of Weight Factor for Bursty Traffic
103
Figure 6.49. Stability for Different Weight Factors for Poisson Traffic
Figure 6.50. Agility for Different Weight Factors for Poisson Traffic
104
6.5
Implementation of the Proposed EWMA in Linux
This section presents, in details, the implementation of the proposed algorithm in Linux with the proper parameters that have been chosen according to the evaluation study discussed in Section 4. The implementation takes into account the following considerations: Implementing the parameters efficiently, including the interval size and the weight factor; and implementing the computation of the estimated rate efficiently to reduce the computation cost and minimize the total overhead.
6.5.1 Selection of Parameters
The interval size can be easily implemented by a software timer in Linux. Figure 6.51 shows the structure of the timer that should be initiated with the proper values to allow estimation function to be invoked at some future moment equals to the time interval (8ms). This value should be assigned to the expires field and should be expressed in term of jiffies units. However, to understand the expression of the interval in jiffies units, the jiffies and HZ parameters in Linux, should be discussed. The global variable jiffies holds the number of ticks that have occurred since the system booted. On boot, the kernel initializes the variable to zero, and it is incremented by one during each timer interrupt. The frequency of the system timer (the tick rate) is programmed on system boot based on a static preprocessor Define, HZ.
105
struct timer_list { struct list_head entry; unsigned long expires; spinlock_t lock; unsigned long magic; void (*function) (unsigned long) ; unsigned long data; tvec_base_t *base; } static struct timer_list est_timer; Figure 6.51. Timer Structure
The kernel defines the value of HZ in the header file <asm/param.h>. The value of HZ differs for each supported architecture. The tick rate has a frequency of HZ hertz and a period of 1/HZ seconds. Thus, because there are HZ timer interrupts in a second, there are HZ jiffies in a second. For example, the HZ value in i386 architecture is 1000. Thus, timer interrupt occurs 1000 times per second and the period is 1 millisecond [LOV05]. The representation of interval size in jiffy units can be expressed as: (HZ/250) << idx jiffies, which is equivalent to: (1<<idx)/250 seconds,
(6-19)
(6-20)
where -1< idx <6, which produces different classes of the interval size (4ms, 8ms, 16ms, 32ms, 64ms or 128ms). With idx=1, the interval size is 8ms.
106 The ewmas weight factor is chosen as negative power of 2, and it is implemented as:
ewma _ log
=2
(6-21)
The ewma _ log is defined in the header file param.h. According to the results of the simulation study, we have chosen a weight factor of 0.25, so the corresponding value of ewma _ log is 2.
6.5.2 Modifications to Linux Code
Figure 6.52 shows the gen _ new _ estimator function that should called to initiate the structure of the timer showed in Figure 6.51. The length of the interval defined in Equation (6-19) will be added into the current jiffies of the expires field. Then, the kernel checks the timer every one jiffy. If the value of jiffies in the system is greater than or equal to the value of expires field, the EWMA function will be invoked to estimate the current rate.
Int gen_new_estimator(struct gnet_stats_basic *bstats) { struct gen_estimator *est; est->bstats = bstats;
107
est->ewma_log = 2; est->last_packets = bstats->packets; est->avpps = rate_est->pps<<10; init_timer(&est_timer); est_timer.data = (HZ/250)<<idx; est_timer.expires = jiffies + (HZ/250)<<idx; est_timer.function = estimation_timer; add_timer(&est_timer); return 0; } Figure 6.52. Generating a New Estimator
Figure 6.53 shows the implementation of the EWMA function which is called estimation _ timer function. When this function is executed, the average rate measurement is calculated by dividing the current obtained packets over the length of the interval defined in equation (6-20). The average rate is calculated as: Rate = ((npackets last_packets) <<(10 idx))* 250,
(6-22)
then, it modifies the expires field peroidically using the mod_timer function as: mod_timer(&est_timer, jiffies + est_timer.data),
(6-10)
where est_timer.data holds the length of interval in jiffy units defined in Equation (619).
static void estimation_timer() { struct gen_estimator *e; u32 npackets; u32 rate; if ((e = est_list)!= NULL) {
108
npackets = e->bstats->packets; rate =( (npackets - e->last_packets)<<(10 - idx)); rate = (rate<<8)-(rate<<2)-(rate<<1); e->last_packets = npackets; e->avpps += ((long)rate - (long)e->avpps) >> e->ewma_log; e->rate_est->pps = (e->avpps+0x1FF)>>10; } mod_timer(&est_timer, jiffies + est_timer.data); } Figure 6.53. Estimation Function
The multiplication operation in Equation (6-22) will be executed each 8ms as we chose it as an acceptable interval. According to [FOG06], the cost of the multiplication operation in Pentium 4 is 11 cycles. With 2.GHZ processor, the length of the cycle is 1/2.GHZ equals to 0.5 ns so the cost of the multiplication operation in the rate estimation is 343.75 ns for every second. As a result, the multiplication operation is replaced with shifting and subtracting operations that cost one cycle each. The alternative code for calculating the average rate is: rate =( (npackets - e->last_packets)<<(10 - idx)); rate = (rate<<8)-(rate<<2)-(rate<<1); Table 6.2 shows the difference in cost for the multiplication code and the alternative one.
109
Table 6.2. Cost of Multiplication and Alternate One. Operation Multiplication Alternative code Cycles 11 6 The Cost Per second 343.75 ns 187.5 ns
CHAPTER 7
CONCLUSION AND FUTRE WORK

This chapter presents a summary of our major contributions in this thesis work to study the operating system performance for different interrupt handling schemes. It also gives indications of future research directions. In this thesis work, a discrete-event simulation model was developed to study the performance of the interrupt handling schemes using bursty traffic with variable packet size. The interrupt handling schemes that were studied included the ideal system, normal interruption, pure and NAPI polling, interrupt disable-enable, and interrupt coalescing of time-based and count-based, as well as the hybrid interrupt scheme. The performance was studied in terms of system throughput, system latency and CPU availability for user processes. In Chapter 3, a simulation model, to evaluate the performance metrics of Gigabitnetwork hosts assuming Poisson traffic as an input traffic, is developed. The reported simulation results were verified by considering special cases and by comparing the results to some existing analytical models. It was concluded that no particular interrupt handling scheme gives the best performance under all load conditions. Selection of the most appropriate scheme to employ depends primarily on system performance requirements, most important performance metric, and present traffic load. It was shown that the 110
111 disable-enable interrupt scheme outperforms, in general, all other schemes in terms of throughput and latency. However, for the CPU availability, pure polling is the most appropriate scheme to use at moderate and high traffic load. In Chapter 4, to evaluate the performance metrics of Gigabit-network hosts assuming bursty traffic, instead of Poisson traffic, as an input traffic, a simulation model was introduced. It was concluded that no particular interrupt handling scheme gives the best performance under all load conditions. The scheme of interrupt disabling and enabling outperforms all other schemes in terms of throughput and latency. However, in terms of CPU availability, the interrupt disabling and enabling gives the worst performance second to normal interruption, and the polling schemes outperform the others. Also it can be concluded that the bursty traffic causes more latency than the Poisson traffic, as the queue size distribution has a heavy-tailed distribution. Consequently, it has been concluded that no particular interrupt handling scheme gives the best performance under all load conditions for both Poisson traffic and bursty traffic. However, the scheme of disabling and enabling interrupts outperforms, for the most part, all other schemes in terms of throughput and latency, and when it comes to CPU availability, pure polling is the most appropriate scheme to use. Chapter 5 studied a novel hybrid interrupt-polling scheme, which works in disable-enable scheme at low rate and in pure polling at high rate after the saturation point. It was concluded that the hybrid scheme outperforms the others schemes under the heavy traffic in terms of CPU availability making up the drawback of interrupt disable-
112 enable scheme as it is expected. The switching mechanism was evaluated in terms of the cliff point, the number of switching modes, and the occurring of the switching point. Finally, as the rate estimation is the key to superior performance of the hybrid scheme, in Chapter 6, different rate estimation methods were presented, including the EWMA, the EWMA with modifications to adapt the weight factor, the TSW, and the proposed algorithm. It was concluded that the proposed estimation algorithm, EWMA time-window, is the best designed estimator that required a little overhead, and outperforms the others in terms of stability. The implementation of the proposed algorithm in Linux was presented. The implementation was simple and optimum in the time complexity. The code was efficient by avoiding the division and multiplication operations which are cost arithmetic operations. Instead, we used an alternative code with shift and addition operations which are costless arithmetic operations. The followings are some future work directions: Experimental implementation of the hybrid scheme which was proven to superior. Experimental implementation for proposed rate estimation method and studying its overhead and efficiency.
Bibliography
[3COM] 3Com Corporation,"Gigabit Server Network Interface Cards 7100xx Family, http://www.costcentral.com/pdf/DS/3COMBC/DS3COMBC109285.PDF Agharebparast, F., and Leung, V.C.M., Efficient fair queuing with decoupled delay-bandwidth guarantees, in Proc. IEEE Globecom01, San Antonio, TX, pp.2601-2605, Nov. 2001. Agharebparast, F and Leung, V. C. M., A New Traffic Rate Estimation and Monitoring Algorithm for the QoS-Enabled Internet, GLOBECOM, VOL. 7, pp. 3883-3887, 2003. Alouf, S., Nain, P., and Towsley, D., Inferring Network Characteristics via Moment-Based Estimators, Proceeding of IEEE INFOCOM 2001, Anchorage, Alaska, April 2001, pp. 1045-1054. Alteon WebSystems Inc, Jumbo Frames, www.alteon-websystems.com /oducts/white_papers/jumbo Aron, M., and Druschel, P., Soft Timers: Efficient Microsecond Software Timer Support for Network Processing, ACM Transactions on Computer Systems, vol. 18, no. 3 , pp. 197-228, August 2000. Ashford Computer Consulting Service, GigaBit Ethernet to the DesktopClient1 System Benchmarks, 2004, http://www.accs.com/p_and_p/GigaBit/Client1.html El-Badawi K., Performance Evaluation of Interrupt Handling Schemes in Gigabit Networks, M.S. Thesis, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabai, April 2003. Barr, D. R., Glen, A. G. and Graf, H. F., The Straightforward Nature of Arrival Rate Estimation?, JSTOR: American Statistician, vol. 52, No 4, pp. 346-350, Nov. 1998. Bhoedjang, R., Ruhl, T., and Bal, H., User-Level Network Interface Protocols, IEEE Computer Magazine, vol. 31, no. 11, Nov. 1998, pp. 5360. Bianco, A., Finochietto, J. M., Galante, G., Mellia, M., and Neri, F., OpenSource PC-Based Software Routers: A viable Approach to HighPerformance Packet Switching Proceedings of QoS-IP 2005, Catania, Italy, February 2005, pp. 353-366.
[AGH01]
[AGH03]
[ALO01]
[ALTE] [ARO00]
[ASH04]
[BAD03]
[BAR98]
[BHO98]
[BIA05]
114
[BIS06]
Biswas, A. and Sinha, P., Efficient real-time Linux interface for PCI devices: A study on hardening a Network Intrusion Detection System, SANE, MAY 2006. Brun, O., M., J., Garcia Analytical Solution of Finite Capacity M/D/1 Queues, Journal of Applied Probability, vol 37, no 4, December 2000, pp. 1092-1098. Brustoloni and P. Steenkiste, "Effects of Buffering Semantics on I/O Performance, "Proceedings Second USENIX Symposium. On Operating Systems Design and Implementation, October 1996, pp. 277-291. Burgstahler L., and Neubauer M., New Modifications of the Poisson Moving Average Algorithm for Bandwidth Estimation, In Proc. of the 15th ITC Specialist Seminar, July 2002. Claffy K. C., Miller G., and Thompson K., The Nature of the Beast: Recent Traffic Measurements from an Internet Backbone, In the Proceedings of INET 1998, Geneva, Switzerland, July 1998. Clark, D. and Fang, W., Explicit allocation of best-effort packet delivery service, IEEE/ACM Trans. Networking, vol. 6 (4), pp 362-373, August 1998. Crovella, M. , and Lipsky, L. , Long-Lasting Transient conditions in Simulations with Heavy-Tailed Workloads, Proceedings of the 1997 Winter Simulation Conference, Atlanta, GA, pp. 1005-1012, 1997. Crovella M, Bestavros A."Self-similarity in world wide web traffic: evidence and possible causes", IEEE/ACM Transactions on Networking 1997; 5(6):835846. [DJCME92] P. Danzig, S. Jamin, R. Caceres, D. Mitzel, and D. Estrin, An EmpiricalWorkload Model for Driving Widearea TCP/IP Network Simulations, Internetworking: Research and Experience, 3(1), pp. 1-26, March, 1992. Deri L., Improving Passive Packet Capture: Beyond Device Polling, Proceedings of the 4th International System Administration and Network Engineering Conference, Amsterdam, September 2004.
[BRU00]
[BRU96]
[BUR02]
[CLA98]
[CLA98]
[CRO97a]
[CRO97b]
[DAN92]
[DER04]
115
[DIT97]
Ditta Z., Parulkar G., and Cox J., "The APIC Approach to High Performance Network Interface Design: Protected DMA and Other Techniques," Proceeding of IEEE INFOCOM 1997, Kobe, Japan, April 1997, pp. 179-187. Dovrolis, C., Thayer, A., and Ramanathan, P., HIP: Hybrid InterruptPollling for the Network Interface, ACM OS Reviews, vol 35, pp. 50-60, Oct. 2001. Druschel P., and Banga G., Lazy Receive Processing (LRP): A Network Subsystem Architecture for Server Systems, Proceedings Second USENIX Symposium on Operating Systems Design and Implementation, October 1996, pp. 261-276. Druschel P., Operating System Support for High-Speed Communication, Communications of the ACM, vol. 39, no. 9, September 1996, pp. 41-51. Dunkels A., Design and Implementation of the lwIP TCP/IP Stack,February 2001, http://www.sics.se/~adam/lwip/doc/lwip.pdf Feng W., Is TCP an Adequate Protocol for High-Performance Computing Needs? Proceedings of SC2000, Dallas, Texas, USA, November 2000. Floyd, S. and Jacobson, V., Link-sharing and resource management models for packet networks, IEEE/ACM Trans. Networking, vol. 3 pp.365-386, Aug. 1995. Fog, A., Optimizing software in C++ An optimization guide for Windows, Linux and Mac platforms, Copenhagen University, 2006. Foong A., Huff T., Hum H., Patwardhan J., and Regnier G., TCP Performance Re-Visited, IEEE Symposium on Performance of Systems and Software, March 2003, pp. 70-79 Frost V. and Melamed B., Traffic Modeling for Telecommunications Networks, IEEE Communications Magazine, 32(3), pp. 70-80, March, 1994. Fukuda K, Takayasu M, Takayasu H. "Origin of critical behavior in ethernet traffic," Physical A 2000; 287:289301. Gallatin A., Chase J., and Yocum K., "Trapeze/IP: TCP/IP at Near-Gigabit
[DOV01]
[DRU96a]
[DRU96b] [DUN01] [FEN00] [FLO95]
[FOG06] [FOO03]
[FRO94]
[FUK00]
[GAL99]
116 Speeds", Annual USENIX Technical Conference, Monterey, Canada, June 1999. [GRO99] Grossglauser, M. and Tse, D., A framework for robust measurement based admission control, IEEE/ACM Trans. Networking, vol.7(3), pp.293-309, June 1999. Guffens V., Path of a Packet in the Linux Kernel, April 2003 www.auto.ucl.ac.be/~guffens/doc/path_packet.pdf Indiresan A., Mehra A., and Shin K. G., Receive Livelock Elimination via Intelligent Interface Backoff, TCL Technical Report, University of Michigan, 1998. Jacobson, V., Congestion avoidance and control, Proc. WGCOMM, vol. 88, Aug. 1988. Keng H. and Chu J., "Zero-copy TCP in Solraris," Proceedings of the USENIX 1996 Annual Technical Conference, January 1996. Kim I., Moon J., and Yeom H. Y., Timer-Based Interrupt Mitigation for High Performance Packet Processing, Proceedings of 5th International Conference on High-Performance Computing in the Asia-Pacific Region, Gold Coast, Australia, September 2001. Kochetkov K., Intel PRO/1000 T Desktop Adapter Review, http://www.digit-life.com/articles/intelpro1000t Law, A., and Kelton, W., Simulation Modeling and Analysis, McGrawHill, 2nd Edition, 1999. Leland W., Taqqu M., Willinger W., and Wilson D., On the Self-Similar Nature of Ethernet Traffic, SIGCOMM 93, Sept. 1993. Leland, W., Taqqu, M., Willinger, W., and Wilson, D., On the Self-Similar Nature of Ethernet Traffic (Extended Version), IEEE/ACM Transactions on Networking, Vol. 2, No. 1, pp. 1-15, February 1994. (From an earlier version of the paper in SIGCOMM 93, Sept. 1993.) Love, R., Linux Kernel Development Second Edition, Novell Press, Jan 2005
[GUF03] [IND98] [JAC88] [KEN96] [KIM01]
[KOCH] [LAW99] [LEL93] [LEL94]
[LOV05]
[MAL04]
Malik, S., Killat, U., How Many Traffic Sources are Enough?, Proceedings
117 of Performance Modeling and Evaluation of Heterogeneous Networks (HET-NETS), Ilkely, U.K., July 2004. [MAQ96] Maquelin O., Gao G. R., Hum H. H J., Theobalk K. G., and Tian X., Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling, Proceedings of the 23rd Annual International Symposium on Computer Architecture, , Philadelphia, PA, 1996, pp. 178188. Mogul, J., and Ramakrishnan, K. K., Eliminating Receive Livelock In An Interrupt-Driven Kernel, ACM Trans. Computer Systems, vol. 15, no. 3, pp. 217-252, August 1997. Morris, R., Kohler, E., Jannotti, J., and Kaashoek, M., The Click Modular Router, ACM Transactions on Computer Systems, vol. 8, no. 3, pp. 263297, Aug 2000. Mukherjee, S., and Hill, M. D., A survey of user-level network interfaces for system area networks, Tech. Rep. TR 1340, Computer Sciences Department, University of Wisconsin- Madison, Feb. 1997. Olivier, Eric, Long-range dependence and heavy-tail modeling for teletraffic data, ,32, January , 2002 Park, Willinger, "Self-Similar Network Traffic and Performance Evaluation", Wiley & Sons, Inc, 2000. Paxson V. and Floyd dS., Wide-Area Traffic: The Failure of Poisson Modeling, IEEE/ACM Transactions on Networking, vol. 3, no. 3, pp. 226-244, June 1995,. Philip M., Simulation of Self-Similar Network Traffic Using High Variance ON/OFF, A Thesis Presented to the Graduate School of Clemson University, May 2002. Prasad R., Jain M., and Dovrolis C., Effects of Interrupt Coalescence on Network Measurements, Proceedings of Passive and Active Measurement (PAM) Workshop, France, April 2004. Ramakrishnan, K. K., Performance Considerations in Designing Network Interfaces, IEEE Journal on Selected Areas in Communications 11(2): 203-219.1993. Friend R. and Monsour R., IP Payload Compression Using LZS, RFC
[MOG97]
[MOR00]
[MUK97]
[OLI02] [PAR00] [PAX95]
[PHI02]
[PRA04]
[RAM93]
[RFC2395]
118 2395, Dec 1998 [RFC2859] [RFC3272] [RUB01] [SAL03a] Fang W., Seddigh N. and Nandy B., A Time Sliding Window Three Colour Marker (TSWTCM), RFC 2859, Jun 2000. Awduche, D., Chiu, A., Elwalid, A., Widjaja, I. and Xiao, X., Overview and Principles of Internet Traffic Engineering, RFC 3272, May 2002. Rubini A. and Corbet J., The Linux Device Drivers, OReilly, 2001. Salah K. and Badawi K., "Evaluating System Performance in Gigabit Networks", The 28th IEEE Local Computer Networks (LCN), Bonn/Knigswinter, Germany, October 20-24, 2003, pp. 498-505 Salah K. and El-Badawi K., "Performance Evaluation of Interrupt-Driven Kernels in Gigabit Networks", IEEE Globecom 2003, San Francisco, USA, December 1-5, 2003, pp. 3953-3957. Salah, K., and EL-Badawi, K., Analysis and Simulation of interrupt Overhead impact on OS Throughput in High-Speed Networks, International Journal of Communication Systems, vol. 18, no. 5, Wiley Publisher, June 2005, pp. 501-526. Salah K., Badawi, K., and AL-Haidari, F., Performance Analysis and Comparison of Interrupt-Handling Schemes in Gigabit Networks, submitted for publication, 2007. Salim J. H., Beyond Softnet, Proceedings of the 5th Annual Linux Showcase and Conference, November 2001, pp 165-172. Schulzrinne, H., Casner, S., Frederick, R. and Jacobson, V., RTP: A Transport Protocol for Real-Time Applications, Jan. 1996. Shivan P., Wyckoff P., and Panda D., EMP: Zero-copy OS-bypass NICdriven Gigabit Ethernet Message Passing, Proceedings of SC2001, Denver, Colorado, USA, November 2001. Silberschatz, A., Galvin, P., and Gagne, G., Operating System Concepts, John Wiley & Sons, Inc, 4th Edition, 2003.
[SAL03b]
[SAL05]
[SAL07]
[SALI01] [SCH96] [SHI01]
[SIL03]
[SIN04]
Sinhra A., Sarat S., and Shapiro J., Network Subsystems Reloaded: A
119 High-Performance, Defensible Network Subsystem, Proceedings of the USENIX Technical Conference, Boston, MA, June 2004, pp. 213-226. [TAQ97] Taqqu M. S., Willinger W., and Sherman R., Proof of a Fundamental Result in Self-Similar Traffic Modeling ACM/SIGCOMM Computer Communication Review, vol. 24, no. 2, 1997, pp. 5-23. Traw C., and Smith J., "Hardware/software Organization of a High Performance ATM Host Interface," IEEE JSAC, vol.11, no. 2, February 1993. Traw C., and Smith J., "Giving Applications Access to Gb/s Networking," IEEE Network, vol. 7, no. 4, July 1993, pp. 44-52. Ulanovs, P., Petersons, E., Modeling Methods of Self-similar Traffic for Network Performance Evaluation, Scientific Proceedings of RTU. Series 7. Telecommunications and Electronics, 2002 Veres A, Boda M. "The chaotic nature of TCP congestion control. Proceedings of INFOCOM 2000," IEEE: Tel Aviv, Israel, 2000; 1715 1723. Wang C. and Wolff D., Efficient Simulation of Queues in Heavy Traffic, ACM Transactions on Modeling and Computer Simulation, vol. 13, no. 1, January 2003, pp. 62-81 White J., An Effective Truncation Heuristic for Bias Reduction in Simulation Output, Simulation Journal, vol. 69, no. 6, pp. 323-334, December 1997 Willinger W., Taqqu M. S., Sherman R., and Wilson D. V., SelfSimilarity through high-variability statistical analysis of Ethernet LAN traffic at source level, IEEE/ACM Transactions of Networking, vol. 5, pp. 71-86, February 1997. Zec M., Mikuc M., and Zagar M., Estimating the Impact of Interrupt Coalescing Delays on Steady State TCP Throughput, Proceedings of the 10th SoftCOM, October 2002.
[TRA93a]
[TRA93b] [ULA02]
[VER00]
[WAN03]
[WHI97]
[WIL97]
[ZEC02]
VITA
AL-Haidari Fahd AbdulSalam. Born in Yemen on Augest 28, 1975. Completed Bachelor of Science (B.Sc.) in Computer Science from University of Mousel, Iraq, in July 1999. Completed MS in Computer Science from King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia, in June 2007. Email: fahdhyd@yahoo.com.

995

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

995

Caricato da

Copyright:

Formati disponibili

iii

INTRODUCTION 1.1 Background

Receive Livelock Phenomenon

Livelock Offered load

Figure 2.1. Receive livelock phenomenon

Interrupt Disable-Enable Scheme

Pure Polling vs. NAPI Polling. Polling was proposed in [RAM93,

Poll_Int (start of polling cycle

Set Poll_Mode ON Trigger IP Processing

Figure 2.2. Pure Polling Algorithm

Set Poll_Mode ON Trigger IP Processing

Figure 2.3. NAPI Polling Algorithm

Assumed Architecture Model

Network Protocol Stack Device Driver

PCI Rx DMA Engine Tx DMA Engine

NIC Rx MAC Network Link Tx MAC

Figure 2.4. Architecture model of DMA-based design reprinted in [SAL03a]

SIMULATION STUDY USING POISSON TRAFFIC

Poisson Traffic Model

Generating Poisson traffic

where log (U i ) is the natural logarithm of U i

Comparison and Numerical Results

Figure 3.5. CPU Availability for Poisson Traffic

Figure 3.6. System Delay for Poisson Traffic

Figure 3.7. System Delay at Low Rate for Poisson Traffic

Figure 3.8. System Throughput for Poisson Traffic

SIMULATION STUDY USING BURSTY TRAFFIC

Bursty Traffic Model

Figure 4.9. Bursty Traffic versus Poisson Traffic

as k , where 0 < < 1 .

Figure 4.10. Pareto and Exponential Density Functions

4.1.3 Ethernet and Self-similar Traffic

4.1.4 Modeling Bursty Traffic

Figure 4.12. A Single ON/OFF Source

4.1.5 Generating Bursty Ethernet Traffic

ON/OFF Source S1 ON/OFF Source S2 Self-Similar Traffic (Aggregated Load)

Figure 4.13. Self-similar Traffic Generation Model

ON = ( ON bON ) ( ON 1) OFF = ( off bOff ) ( off 1)

Figure 4.14. Burstness for 10 and 100 ms Time Unit

Figure 4.15. Comparison of Simulation AC to Theoretical AC

Comparison and Numerical Results

rate( pps) Avg_pkt_size (bits) , LinkCapacity (bps)

Figure 4.16. CPU Availability for Bursty Traffic

Figure 4.17. System Delay for Bursty Traffic

Figure 4.18. System Delay at Low Rate for Bursty Traffic

Figure 4.19. System Throughput for Bursty Traffic

Figure 4.20. System Dealy under Bursty and Poisson Traffic

Switching Mechanism Evaluation

Figure 5.21. Switching Mechanism in terms of CPU Availability (Poisson Traffic)

Figure 5.22. Switching Mechanism in terms of Latency (Poisson Traffic)

Figure 5.23. Switching Mechanism in terms of CPU Availability (Bursty Traffic)

Figure 5.24. Switching Mechanism in terms of Latency (Bursty Traffic)

Vicinity-Bounded low 29 25 21 10 9 9 3 3 3 1 1 0 Thresholds cliff high 20 2 12 4 8 2 5 0 4 0 2 0 6 0 4 0 2 0 4 0 2 0 1 0

Performance Evaluation of the Hybrid Scheme

Figure 5.25. CPU Availability Including Hybrid Scheme (Poisson Traffic)

Figure 5.28. System Throughput Including Hybrid Scheme (Poisson Traffic)

Figure 5.29. Availability Including Hybrid Scheme (Bursty Traffic)

Figure 5.30. Zoom in of Figure 5-9

Figure 5.33. System Throughput Including Hybrid Scheme (Bursty Traffic)

ARRIVAL RATE ESTIMATION

real estimated real

Existing Rate Estimation Algorithms

6.2.1 EWMA Algorithm

Int gen_new_estimator(struct gnet_stats_basic bstats) { struct gen_estimator est; est->bstats = bstats;