Sei sulla pagina 1di 2

2017 IEEE International Conference on Cluster Computing

Evaluating Effect of Write Combining on PCIe

Throughput to Improve HPC Interconnect
Mahesh Chaudhari Kedar Kulkarni Shreeya Badhe Dr. Vandana Inamdar
HPC-Technologies Group HPC-Technologies Group HPC-Technologies Group Dept. of Computer Engineering
Centre for Development of Centre for Development of Centre for Development of and Information Technology
Advanced Computing Advanced Computing Advanced Computing College of Engineering, Pune
Pune, India Pune, India Pune, India Pune, India

Abstract HPC interconnect is a very crucial component of components of HPC interconnect, are required to achieve the
any HPC machine. Interconnect performance is one of the competitive performance. Send side interface (between host
contributing factors for overall performance of HPC system. and NIC), fabric and receive side interface (between NIC and
Most popular interface to connect Network Interface Card (NIC) host) broadly constitutes any HPC fabric interconnect.
to CPU is PCI express (PCIe). With denser core counts in Improvement in send side interface is desired in this work.
compute servers and increasingly maturing fabric interconnect
speeds, there is need to maximize the packet data movement Function of send side interface is to pass the application
throughput between system memory and fabric interface data (may or may not be packetized) to the Network Interface
network buffers, such that the rate at which applications (CPU) Card (NIC) for sending out packetized data through the fabric
generating data is matched with rate at which fabric consumes to the destination node. There are two ways this I/O is
the same. Thus PCIe throughput for small and medium messages performed in system: Programmed Input/output (PIO) and
(Programmed Input/Output) needs to be improved, to Direct Memory Access (DMA). Applications adapting
synchronize with core processing rate and fabric speeds. And message passing as their Inter Process Communication (IPC)
there is scope for this improvement by increasing the payload generally has two types of messages to share, latency sensitive
size in Transaction Layer Packet (TLP) of PCIe. Traditionally, small messages (in bytes) and bandwidth hungry bulk transfers
CPU issues memory writes in 8 bytes (payload for TLP), (in Megabytes). Choosing the I/O mode according to the
underutilizing the PCIe bus since overhead compared to payload message size can improve the performance. Threshold, where
is more. Write combining can increase the payload size in TLP,
mode of I/O needed to be switched from PIO to DMA should
leading to more efficient utilization of available bus bandwidth,
thereby improving the overall throughput.
be set very optimally to gain maximum performance. This
work also derives this threshold for HPC interconnect.
This work evaluates the performance that could be gained by Write Combine (WC) technique allows non-temporal
using Write Combine Buffers (WCB) available on Intel CPU, for streaming data to be stored temporarily in intermediate buffers,
send side interface of HPC interconnect. These buffers are used to be released together later in burst instead of writing in small
for aggregating the small (usually 8 bytes) memory mapped I/O chunks to the destination. Destination may be next level cache
stores, to form the bigger PCIe Transaction Level Packets (TLP),
or system memory or I/O memory. Intel CPU contains special
which leads to better bus bandwidth utilization. It is observed
that, this technique improves peak PIO bandwidth by 2x
buffers called Write Combine store buffers to ease the L1 data
compared to normal PIO. It is also observed that till 4096 bytes cache miss penalty. When a memory region is defined as WC
write combine enabled PIO outperforms DMA. memory type, memory locations within this region are not
cached and coherency is not enforced. Speculative reads are
Keywords HPC interconnect; write-combining; Network I/O; allowed and writes may be delayed. Thus, WC is most suitable
PCI bus bandwidth. for applications, in which, strongly ordered un-cached reads,
but weakly ordered streaming stores are required, such as
I. INTRODUCTION graphics applications and network I/O.
Cluster based HPC system is mainly comprised of three
components: high end compute server, high performance fabric II. RELATED WORK
interconnect, and highly optimized software. Compute server is
densely populated with more core count, large system memory Similar work on improving the I/O bus bandwidth is done
and efficient I/O buses. System Software, parallel by Steen Larsen and Ben Lee, in which they have compared
programming library, installation and execution scripts, the throughput achieved, using dma based descriptor vs using
statistics monitoring daemons etc. are bundled in the software write-combined PIO [1]. This work differs from Larsens
suite for HPC system. Metrics for evaluating/measuring the work [1] with following differences. First, this work deploys
performance of fabric interconnect is bandwidth in bytes per direct user I/O, i.e. user application can directly PIO writes
second, latency in micro second and message rate in million data without going through kernel, saving the context switch
messages per second. Optimizations at all levels and all penalty. Issue with bypassing kernel is sharing problem, which

2168-9253/17 $31.00 2017 IEEE 639

DOI 10.1109/CLUSTER.2017.109
causes contention between multiple processes for shared bandwidth saturates at 665 MB/s (for data size >64KB), while
hardware queues. However this problem is taken care by normal PIO bandwidth saturates to 60 MB/s (for data
allocating the separate queues to all contending processes, so size>512B). Nearly 11x improvement is observed in sustained
that each user will have exclusive access to the queue owned bandwidth for write combine enabled PIO against normal PIO.
by it. Second, PCIe Gen2 x8 link used in this work overcomes Figure 1 also shows that write combine enabled PIO
the 1 GB/s theoretical limitation presented by x4 PCIe Gen1 outperforms the DMA till message size less than 4096 bytes.
bus in [1]. Third, hardware developed for this work is custom
I/O Adapter, not the standard Ethernet or InfiniBand adaptor. Peak bit rate of x8 Gen2 PCIe link is 40 Gb/s. Physical layer
encoding reduces it by 20% (=32 Gb/s or 4 GB/s). Transaction
layer overhead (24B header for 64B data) further limits it to
III. PROPOSED METHOD 72.7%. Thus practically achievable throughput is 2.88 GB/s.
Traditionally, PIO writes are executed in 8 bytes chunks
since CPU issues them in 8 bytes. Problem with this is
inefficient use of PCI bandwidth since transaction layer
overhead of PCIe stack is 20-24 bytes for 8 bytes payload. In
write combining, small PIOs to same cacheline are buffered
into separate buffers in Intel CPU and full cacheline transfers
happens over the PCIe instead of partial cacheline transfers.
This same technique is exploited in this work to do network
I/O. This is achieved by declaring I/O memory on hardware
device as write combine memory type. I/O memory is mapped
directly to address space of user application, with write-
combine attribute set. Small (8 bytes) I/O by application is
buffered into write-combine buffers (a.k.a. fill buffers) on CPU
core, till the buffer is full. Once buffer is full, it is evicted to
destination memory automatically. Timings, to calculate
Bandwidth, are measured by reading Time Stamp Counter
(TSC) register on CPU core, using rdtscp instruction [3].
Figure 1 Bandwidth comparison of Write Combine
TSC is calibrated first, for our test system before use. For our
enabled PIO vs Normal PIO vs DMA
test machine TSC is clocking at the frequency of 2.6025GHz.
Overhead of reading TSC register is 44 TSC cycles which is VI. CONCLUSION AND FUTURE WORK
subtracted from readings to get pure PIO write time. Other
system noise is also avoided to prevent diluting the readings. Write combined PIO can be used to transfer small and
To avoid the context switch penalty, in-house developed medium messages (till 4KB) from system memory to NIC
benchmark process is pinned to same core. Core is exclusively buffers for across node IPC. DMA is still suitable as network
owned by the benchmark process to prevent sharing the core I/O mode for large messages(>4KB).
by other processes. Hardware interrupts are disabled for the There is still scope for improvement as we could achieve
corresponding core. Tests are repeated 1000 times and average ~87% of practically possible peak bandwidth of x8 PCIe gen2
of readings within mean + standard deviation ( + ) is link. Following optimizations can be tried to extract more out
considered as final reading. of PCIe. Parallel version of in-house developed benchmark
This work also requires development of simple PIO can be implemented in which, multiple threads or processes of
controller in custom I/O adapter. Logic in this device processes benchmark is spawned; each thread (process) pumps data over
PIO requests received from CPU over PCIe bus. This adaptor PCIe link to their exclusively owned memory on hardware.
has 2KB of on board memory available for user access. All of them aggregately might saturate the PCIe link.
Pipelining PIO writes across all available write combine
IV. EXPERIMENTAL SETUP buffer (if not happening implicitly) can also present scope for
performance improvement. Non-temporal version of data
Experiments for this work are carried out over the Intel movement instruction which bypasses L1 and L2 lookups,
Xeon E5-2650 v2, Sandy Bridge based 16 cores (2 threads per may also contribute to saturate link [2]. Also, experiments can
core) clocked at 2.6GHz. Required custom I/O adaptor is
be repeated over more advanced configuration like PCIe Gen3
implemented on Xilinx Kintex 7 FPGA KC705 evaluation kit
interface and Haswell platform, to evaluate the performance.
operated at 250MHz. This custom I/O adapter is interfaced to
CPU over PCIe Gen2 interface. CentOS 6.4 (2.6.31-358 REFERENCES
kernel) distribution of linux is used to operate the test server.
[1] Steen Larsen and Ben Lee, Reevaluation of Programmed I/O with
Write-Combining buffers to Improve I/O Performance on Cluster
V. OBSERVATIONS AND ANALYSIS Systems, International Conference on Networking, Architecture and
Storage (NAS), Aug 2015.
It can be seen from the figure 1 that 2x improvement is [2] Liang-min Wang, How to Implement a 64B PCIe* Burst Transfer on
achieved in peak bandwidth (2.49 GB/s at 64 bytes) for write Intel Architecture, Feb 2013.
combine enabled PIO against peak bandwidth (1.182 GB/s at [3] Gabriele Paoloni, How to Benchmark code execution times on Intel IA-
32 bytes) for normal PIO. Write combine enabled PIO 32 and IA-64 ISA, White paper by Intel, Sept 2010.