A Lightweight, GPU-Based Software RAID System

2010 39th International Conference on Parallel Processing
A Lightweight, GPU-Based Software RAID System

Matthew L. Curry , H. Lee Ward , Anthony Skjellum , Ron Brightwell
of Computer and Information Sciences University of Alabama at Birmingham, Birmingham, Alabama 352941170 Email: {curryml,tony}@cis.uab.edu Computer Science Research Institute Sandia National Laboratories, Albuquerque, New Mexico 87185 Email: {lee,rbbrigh}@sandia.gov
Department
AbstractWhile RAID is the prevailing method of creating reliable secondary storage infrastructure, many users desire more exibility than offered by current implementations. Traditionally, RAID capabilities have been implemented largely in hardware in order to achieve the best performance possible, but hardware RAID has rigid designs that are costly to change. Software implementations are much more exible, but software RAID has historically been viewed as much less capable of high throughput than hardware RAID controllers. This work presents a system, Gibraltar RAID, that attains high RAID performance by ofoading the calculations related to error correcting codes to GPUs. This paper describes the architecture, performance, and qualities of the system. A comparison to a well-known software RAID implementation, the md driver included with the Linux operating system, is presented. While this work is presented in the context of high performance computing, these ndings also apply to a general RAID market.
I. I NTRODUCTION RAID (Redundant Array of Independent Disks) is a technology that allows for groups of disks to be joined in order to be viewed as a single logical block device [1]. Chen et al. describe RAID as a method of combining several disks in order to have larger, single volumes that are presented in a familiar way, improving the overall reliability of the storage, and increasing the overall speed of operations on the volume by exploiting parallel transfers from all disks. They dene RAID levels 5 and 6, which are described as obtaining parallelism between disks by splitting data across disks in a block-cyclic manner using a xed chunk size. They show that transfers signicantly larger than this chunk size can exploit multiple disks simultaneously for different parts of the transfer, while small transfers that are not close together can also exercise several disks independently. Fault tolerance is a necessary part of RAID because combining multiple components increases the average rate of failure of components. For example, RAID 0 is the only RAID level that does not contain redundant information. A RAID 0 array composed of n disks has the following Mean Time To Data Loss (MTTDL): MTTDL = MTTF/n, where MTTF is the mean time to failure of a single disk. Single disks can have an MTTF on the order of one million hours [2], yielding an expected lifetime of 114 years. However,
0190-3918/10 $26.00 2010 IEEE DOI 10.1109/ICPP.2010.64 565
even a moderately sized RAID 0 array of twenty disks can have a MTTDL of less than an installations desired lifetime. Most RAID congurations include some level of redundancy, including RAID 5 and RAID 6, which contain parity. These parity-based levels generally implement systems of k + m disks where any m disks may fail without losing data. RAID 5 and 6 have xed values m = 1 and m = 2, respectively. Reliability is incorporated into the array by interleaving m chunks of parity for every the k chunks of data, spreading them evenly over all n = k + m disks at the same offset to create a stripe. The documents describing RAID [1] specify that the parity is generated such that any k chunks of data and parity are required to recover the k original chunks of data. The means of generation is left up to the implementer, with many specialized erasure codes existing for RAID 6 [3], [4]. Reed-Solomon coding [5] can be used to perform k + m codings for m 2. RAID implementations for high performance workloads are typically implemented in hardware, because hardware has historically provided much higher performance in many situations. RAID performs erasure correcting coding on all data written [1], necessitating a high-speed implementation of erasure correction. This work aims to achieve greatly improved software RAID performance by ofoading erasure correction operations onto commodity graphics processing units (GPUs). The authors have already shown that GPUs are more wellsuited to performing Reed-Solomon coding than CPUs [6], [7], [8]. This capability is attributable to a lack of appropriate x86-64 instructions for performing Reed-Solomon coding [9]. While GPUs also lack these instructions, the memory architecture and the large number of cores both contribute to much higher rates of coding and decoding. This work integrates the Gibraltar library [8], a GPU-based Reed-Solomon coding library, into a system called Gibraltar RAID. II. M OTIVATION This work occupies an interesting space between hardwareand software-based RAID controllers, inheriting attributes from each. The overall goal is to provide increased exibility over hardware-based RAID implementations, and increased speed over software-based implementations. This section details the benets of a high-performance software RAID implementation over hardware RAID.
A. Lightweight One reason for the expense of high-performance RAID controllers is the lack of HPC-specic controllers in the marketplace. Instead, HPC customers purchase products aimed at the larger enterprise customer base. The enterprise market demands a controller that often has many more features than are strictly needed for HPC, like deduplication [10]. By only implementing the features that are required for high-performance streaming I/O (that is, storing data reliably and serving data to a parallel lesystem), costs can remain relatively low while servicing HPC (and similar) needs just as effectively. B. Inexpensive Commodity hardware has impacted HPC for the better, yielding systems that are much more powerful while maintaining cost-effectiveness. For example, todays primary architecture for supercomputers is the cluster of workstations, which are largely created from commodity components. Initial investigations into this architecture were performed to eld an economical alternative to expensive, proprietary systems [11]. As the commodity computing market grew, improvements to the processing capacity of home and business computers were leveraged by new clusters, forming more powerful supercomputers. The result is that the largest machines are built with commodity processor technology [12]. GPU vendors have been continuing the commodity HPC trend by opening a general-purpose API to program processors that are usually intended to render graphics for interactive animation [13]. GPUs have mostly been used to improve on CPU performance, with several important applications accelerated signicantly with GPUs [14]. Applying GPUs to RAID is unusual because the GPU is intended to replace a noncommodity part the erasure correction engines used in RAID controllers. The benets of large production and continual improvements for the graphics market allow this application to benet from improving GPUs without much effort. This situation is in direct contrast to the improvement of RAID controllers, which usually require redesign to incorporate improvements in technology. C. Extended RAID Features While a reduced set of features is certainly a benet, another goal behind this project is to provide exactly the features that are required, including those not offered with less full-featured RAID controllers. Certain applications require high levels of reliability and data integrity, but RAID controllers often do not provide features beyond simple RAID 6. We have integrated two uncommon features into Gibraltar RAID: Extra parity and read verication. 1) Extra Parity: Recently, there has been much written that indicates how RAID 6, which allows for two disks to fail without data loss, is not going to be sufcient to ensure data reliability in the near future. A specic example questions the validity of hard disk manufacturers reliability estimates of their own disks [15]. Others also have shown that many
hard disk failures are correlated, resulting in an increased probability of array failure once the rst disk has failed [16]. Further, the growing discrepancy between disk size and disk speed is increasing the chances of disk failure during array reconstruction once a disk has already failed. An additional problem is encountering an unrecoverable read error (URE) during reconstruction after redundancy has been eliminated from the array. UREs can occur for many reasons, including media damage, a reduction of magnetic coating on the platters, or component wear, and result in data loss for the affected sector. A RAID array operating without any failed disks can recover from this loss, but an array that lacks redundancy cannot. For example, if a RAID 6 array with two failed disks attempts to rebuild onto clean disks, but receives a URE from one of the remaining disks, the contents of the affected sector are completely lost. If the two corresponding sectors on the failed disks are used for data blocks, then those sectors are not recoverable. UREs have not historically been a great concern, but the typical hard disk size has increased while the URE rate has not decreased. Given a typical SATA disk with a URE rate of one sector per 1014 bits [17], a read error can occur (on average) once every 12.5 TB. Many large RAID arrays are signicantly larger than 12.5 TB and are heavily loaded, implying that such arrays will have a high probability of encountering a URE within context of a rebuild, causing data loss. One mitigating strategy for UREs is scrubbing [18]. During scrubbing of an array, the RAID controller reads all disks from beginning to end, attempting to detect UREs and parity mismatches as a background process during normal operation of a RAID array. When a URE is encountered, the controller recovers the contents of the sector with parity and requests the disk to rewrite the information. The disk will either rewrite the data (if the sector is not damaged) or remap the sector to another within a pool of spare sectors. This process is not possible without redundancy in the array to use for data recovery. Scrubbing is not the perfect solution: It consumes bandwidth, reducing the bandwidth available to client applications; and scrubbing does not work if parity in the array is exhausted and the array is rebuilding, when the array is likely to encounter a URE. One way to lessen the above problems of UREs, less reliable disks, and batch-correlated failures is to introduce higher degrees of parity. Current hardware RAID controllers are statically congured with industry-standard and vendor-specic RAID levels, and do not allow for arbitrary k + m redundancy within an array. The primary distinguishing feature of Gibraltar RAID is the ability to use the values of m that users require for their applications, constrained only by space utilization and performance requirements. An array implemented with the controller described here can be dynamically tuned to provide the desired amount of parity for each installation to make efcient trade-offs in performance, reliability, capacity, and scale. 2) Read Verication: In large computer installations, component failure is unavoidable. Some of the failures that occur
566
are not easily diagnosable because of the lack of detection for certain trouble conditions. For example, a faulty disk cable can be responsible for a large amount of silent data corruption between a RAID controller and its disks. However, a controller that does not verify reads is not capable of detecting the problems imposed by a bad cable, creating mysterious data corruption and failures in an application. Furthermore, increasingly small feature sizes and the resulting highly dense computer installations increase the probability of single events (like an alpha particle or neutron strike) causing silent corruption of data. While most components can and do include their own error correction between different components to guard against data corruption, missing error correction circuitry (and less robust types) can allow corrupted data to be propagated to a user application. Gibraltar RAID effectively detects corruption arising from these problems by verifying reads at high speeds. Read verication can be used to effectively recover from single events, detect intermittent or chronic problems with components, and continue operating in spite of less severe problems. Additionally, using read verication can correct UREs without requesting extra data or seriously reducing the speed of operation, as the data required to recover the sector is initially requested for data verication. This reduces the latency between encountering a URE and having available data to recover the associated sector. High-speed recovery of data lends itself to read verication when some disks within an array have failed and a rebuild has not been completed. Gibraltar RAID supports this by recovering more of the data than is strictly necessary based on the operating mode of an array; specically, during recovery of a missing chunk, a different chunk in the same stripe that is available on disk is read and also recovered in the same recovery operation that regenerates the missing chunk. The recovered chunk and the read chunk can be compared to ensure correctness. This process does not work without extra redundancy in the array, indicating a need for enough parity to ensure that redundancy is always present. III. A RCHITECTURAL D ESCRIPTION Because a GPU is used for coding (generating parity) and decoding (recovering lost data), Gibraltar RAID was implemented entirely in user space. Interactions between the RAID array and network clients can be managed through the Linux SCSI Target Framework [19]. The interface between Gibraltar RAID and the target is general, allowing Gibraltar RAID to be used with other network storage packages that have user space operation. Gibraltar RAID cannot take advantage of many facilities for I/O within the operating system kernel, so many portions of the underlying secondary storage stack had to be reimplemented within Gibraltar RAID in user space. This section details the mechanics of the components that have been reimplemented, and how the components communicate. Figure 1 has been provided to aid with the description following. Different data paths are outlined as follows: Writes are denoted with a w, reads with an r, and victimization
with a v. Each interaction is noted with a letter indicating the data path and a number indicating the order in which the interactions occur. A. Architectural Overview There are two main restrictions that govern the overall design of a GPU-based RAID system. First and foremost, an application using the NVIDIA CUDA toolkit must run in user space. No public, documented interfaces to the NVIDIA runtime exist that are available from a driver that runs in kernel space. Given that the iSCSI target used for this project runs in user space, the requirement that the coding occurs in user space presents little logistical difculty. However, if desired, a small driver that passes block requests to a user space daemon is not difcult to consider. Regardless, GPU operations must be performed in user space. To simplify design and development, and to minimize user mode to kernel mode transitions, the other components are also located in user space. Second, as a user space service that requires knowledge of the contents of cache in order to verify reads, Gibraltar RAID must bypass the Linux buffer cache when accessing disks. If this is not done, a user application may request data that is still in the buffer cache, and Gibraltar RAID will have no means of knowing whether that data came from the disk or was already in RAM. Parity would then be re-veried, which is expensive compared to simply serving the request from the buffer cache. It is possible to keep a large cache in user space and still use the buffer cache to schedule I/O to and from disk. However, an application that is required to deal with large amounts of information in its own cache needs a high degree of control over the memory it is using (and memory used by the kernel on the applications behalf) in order to prevent running afoul of Linuxs optimistic memory allocation. If the system is low on memory and an application calls malloc(), virtual memory can be allocated to the process in absence of physical memory. The assumption is that the memory either will not be completely used, or more memory will become available to support this allocation. For storage applications like RAID, the general pattern is to write data to disk, then acquire more data from users or other input devices. If the buffer cache is using a large amount of memory, as it tends to do, it can interfere with the applications ability to maintain high bandwidth. The requirement to bypass the Linux buffer cache necessitates using the O DIRECT ag when opening the devices, which allows a user to perform I/O operations directly from memory in user space rather than through the Linux buffer cache. O DIRECT unfortunately conicts with some CUDA functionality, which includes special memory allocators for optimizing memory transfers between host memory and GPUs. Such allocators perform operations like map host memory into the GPU, which provides increased overlap between PCI-Express transfers and computation within the GPU; and prevent memory regions from being paged out by the virtual memory system, which saves a memory copy when transferring data to a GPU. Direct I/O operations from memory regions allocated with these allocators fail, so these opti-
567
v6
Client Service Thread

I/O Notifier
pread
pwrite
r1 r5
r7
w1
Stripe Cache
r4 r2
r6 v2
v1
For reads, additional stripes may also be requested in order to provide read ahead capability that takes advantage of spatial locality of disk accesses. In the case of streaming reads, there is a high probability that a read for a block of data will be quickly followed by a request for the next block of data. The Linux kernel buffer cache does this; read ahead is necessary to get the best performance out of a storage system under streaming workloads. After all read or write requests have been registered, the client service thread waits for the requests to be fullled. In the case of a read, the routine passes each stripe to the erasure coding component to be veried or recovered. Writes can be recorded as incomplete updates to a stripe, anticipating that the whole stripe will eventually be overwritten. If this does not happen before a stripe is chosen for victimization, the stripe must be read, veried, modied with the contents of the incomplete stripe, updated with regenerated parity, and written. C. Stripe Cache
Victimizer Erasure Coding Victim Cache

v3
r3 v4
Linux Async I/O
I/O Scheduler
Fig. 1: Gibraltar RAID Architecture and Data Flow Diagram
mizations are unavailable. This decreases the effectiveness of the GPU signicantly because of the need for increased host memory copies, but GPU operations still maintain signicant speed improvements over a CPU-based software RAID. B. System/RAID Interface All interaction between the target and Gibraltar RAID occurs through functions implementing the interfaces for the standard C library calls pread() and pwrite(). Each function takes a pointer to a user buffer, an offset into the RAID device, and the length of the desired data to be copied into the user buffer. Each request is mapped by these functions to the stripes affected, and asynchronous requests for those stripes are submitted to the Gibraltar RAID cache. These requests are lled without knowledge of whether the stripes are present in the cache, so the stripe can be requested in one of two ways:

Request stripe with its full, veried contents; or Request a clean stripe, which may return an uninitialized stripe and does not incur disk I/O.
Gibraltar RAID includes a stripe cache that operates asynchronously with the I/O thread and the users requests for reads and writes. In the event of a read request to the cache, the cache submits requests to the I/O thread to read the relevant stripes from disk in their entirety, including parity blocks for read verication. This full-stripe operation can be viewed as a slight mandatory read ahead. Writes have a simple implementation in the stripe cache, as they do not require immediate disk operations. The cache is optimistic; if the write request does not ll an entire stripe, the cache assumes that the stripe will be completely populated before being ushed to disk. Therefore, no reads are required before a partial update to a stripe is made. If a stripe is in the process of being read, the cache will force the client thread to wait until the read has been completed. If the thread has already been previously read, it will be simply given to the client thread to be updated with the write contents. Otherwise, a blank stripe is returned, and the client is responsible for maintaining the clean/dirty statistics related to the writes requested. A victimizer thread occasionally (because of an external stimulus from the cache related to memory pressure) deletes clean stripes and ushes dirty stripes. The interface between the victimizer and the cache is exible, allowing many different types of algorithms to be implemented. The default mode of operation is the Least Recently Used (LRU) caching algorithm, but higher quality algorithms may be used. Furthermore, the victimizer can have an internal timer used to implement write-behind caching. D. I/O Scheduler The scheduler receives requests from the cache for reading stripes, and receives requests from the victimizer for writing stripes. The scheduler accumulates these requests in a queue until it is ready to service them. All of the requests are received as a batch to facilitate combining of requests that are adjacent
The second option is useful if the stripe is to be completely overwritten by a write request, or if it is expected to be overwritten. Client service threads, threads created by the target to satisfy read and write requests from network clients, call pread and pwrite.
568
80 70 Throughput (MB/s) 60 50 40 30 20 10 0 1 2 4 8 16 32 64 128 256 512 1024 Operation Size (KB) 4096 16384 Streaming Read Streaming Write
Fig. 2: Performance of a single disk in a RS-1600-F4-SBD switched JBOD over 4Gbps Fibre Channel
of resources than CPU-intensive pthread condition variables with a high thread-to-core ratio allow. While asynchronous I/O has been implemented in the Linux kernel and C libraries for some time, the methods for performing asynchronous vector I/O are not well-documented. There is a method of using io_submit() to submit iovec structures in an unconventional and difcult-todiscover way. The typical usage of io_submit() takes an array of iocb structures as a parameter, which describes individual I/O operations to submit asynchronously. However, to use the relatively new vectored read and write capabilities, one passes an array of iov structures within the iocb structure instead of iocb_common structures. These will be used to perform a vectored I/O operation. This is not noted in system documentation. E. I/O Notier
on disk. Only write requests are combined for the following reasons: Clients are affected by latency by having to wait longer for a read request to be serviced. Reads are necessarily synchronous, so a large combined read request will force all clients to wait until the entire combined request is serviced. Our experiments show that performing short contiguous reads easily obtains good performance from a disk. Writes, however, can be signicantly slower (depending on the qualities of the system, such as hardware and conguration) unless a comparatively large amount of data is accumulated for writing at once. Figure 2 demonstrates this property. Notice that contiguous writes of 16 megabytes are slightly slower than contiguous reads of 64 kilobytes. This performance degradation is only apparent for les or devices opened with the O DIRECT ag, which bypasses the Linux kernel buffer cache. For normal disk operation, write requests are combined in the buffer cache, hiding this potential performance issue. The scheduler receives all waiting requests as a batch, ordering and combining them as necessary to achieve the highest possible disk bandwidth. The ordering algorithm currently used is the circular elevator algorithm (C-SCAN). Combining requests is conceptually simple. If there are two writes that are adjacent on disk, it is possible to use a vector write in order to combine the requests into a single write call. One implementation of this type of call available to applications is pwritev(), a vector version of pwrite(). Vectored operations allow a contiguous area of a disk to be the target for a combined write operation from non-contiguous areas of memory. However, using pwritev() is not the best strategy for a RAID implementation. Asynchronous I/O, which allows the Linux kernel to manage read and write requests in the background, is more sensible for storage-intensive applications. Initially, this implementation used the pthreads library to perform synchronous reads and writes with one thread assigned per disk. Switching to asynchronous reads and writes allowed for more efcient use
The I/O Notier is a thread that collects events resulting from the asynchronous I/O calls, and performs record-keeping on a per-stripe basis. Once all of the asynchronous I/O calls for a stripe have been completed, the notier noties other threads that depend on the stripe associated with the I/O. If the stripe is undergoing eviction from the cache, this thread will initiate destruction of the stripe at I/O completion. F. Victim Cache When considering the speed that new I/O requests can arrive at the RAID controller, there is a signicant delay between the decision to victimize a dirty stripe and the completion of the write associated with it. Canceling a write in progress is an inefcient action because of the combining of writes and the asynchronous completion of the writes. To aid in victimization, a victim cache is included that allows for a client read or write request to rescue a stripe from being deleted before it has been written, or even while the write is still in progress. Maintaining a separate cache can allow for many ush requests to be in ight simultaneously without overloading the hash table for the main cache. A separate cache also allows fewer lookups and, thus, locking, during the victimization of a stripe. G. Erasure Coding Component The erasure coding component uses a the Gibraltar library, which was designed to perform Reed-Solomon encoding and decoding at high rates using GPUs [8]. Briey, the Gibraltar library operates by accepting k buffers of data, such as striped data for a RAID array, and returns m buffers of parity. This GPU-based parity calculation can encode and decode at speeds of well over four gigabytes per second for RAID 6 workloads on commodity GPUs. Figure 3 shows RAID 6 performance over a variety of stripe sizes as compared to a CPU implementation of Reed-Solomon coding, Jerasure [20]. A unique feature of Gibraltar RAID is the ability to extend far beyond RAID 6 in the number of disks that may fail without data loss. Figure 4 shows performance for RAID TP, a tripleparity RAID level that can tolerate three disk failures. These tests were performed with a GeForce 285 and an Intel Extreme
569
4000 3500 Throughput (MB/s) 3000 2500 2000 1500 1000 500 0 2 4 6 8 k 10 12 14 16 Gibraltar Coding Gibraltar Decoding Jerasure Coding Jerasure Decoding
Fig. 3: RAID 6 (m = 2) Performance of GPU, 1 MB Chunk Size

3500 3000 Throughput (MB/s) 2500 2000 1500 1000 500 0 2 4 6 8 k 10 12 14 16 Gibraltar Coding Gibraltar Decoding Jerasure Coding Jerasure Decoding
Fig. 4: RAID-TP (m = 3) Performance of GPU, 1 MB Chunk Size
Edition 965. The most compelling feature is that the GeForce 285, though costing approximately 60% less than the Intel processor, demonstrated a signicantly increased performance. GPU parity calculations entail transfer of signicant amounts of data across the PCI-Express bus to the GPU. This implies that using the Gibraltar library in this system also incurs signicant PCI-Express trafc. This trafc can be a signicant concern if other hardware, like network adapters and host bus adapters, also heavily uses the PCI-Express bus. IV. P ERFORMANCE In order to test the performance of Gibraltar RAID, an experiment has been constructed to measure the speed of streaming raw I/O to a RAID array managed by Linux md, the software RAID driver included with the Linux operating system, and Gibraltar RAID. All measurements were taken without use of the iSCSI target, instead targeting the base devices. With both reads and writes, requests are submitted to the array starting with offset zero, with an I/O size of one megabyte (220 bytes). The operations are continued for 100 gigabytes (100 230 bytes). All congurations are tested in normal mode (i.e., no disks failed) and all degraded modes (i.e. at least one disk unavailable) supported by the RAID levels
tested. Each test has been run three times, and the maximum bandwidths are reported. For cases with high variation, minimum bandwidths are also reported. Each array was congured to have 16 disks of data capacity, with the addition of the necessary number of disks to support the RAID level implemented (i.e., two extra disks for RAID 6, and three extra disks for RAID TP). A chunk size of 64KB was used. With Linux md, the array was built with the assumeclean option because md would otherwise attempt to calculate all parity blocks for the array. While the data in the md array will not be consistent at rst, the inconsistency does not cause any difculty with the tests as executed. Writes are also performed prior to reads, so all data is consistent when read. Tests for both Gibraltar RAID and md were performed on the same server. It has two quad-core Intel Xeon X5550 CPUs and 24 GB of DDR3 RAM clocked at 1333 MHz. The GPU used is an NVIDIA Tesla C1060. Samsung Spinpoint F1 HE103UJ disks are connected to an LSISAS1068E disk controller. Table I gives the result of the performance testing. One notable feature is that Gibraltar RAIDs read bandwidth is slightly lower than that of md in normal mode. This is expected, as Gibraltar RAID performs read verication with all available parity, which consumes more bandwidth. As the above conguration is bandwidth-limited for reads by the disk controller, the bandwidth consumed by reading parity slightly reduces the performance. However, total bandwidth can be calculated for Gibraltar RAID by dividing its rate by the number of data blocks per stripe and multiplying by the number of total blocks per stripe, yielding comparable total bandwidths (achieving 99.2% and 97.3% of mds rate for Gibraltar RAID 6 and Gibraltar RAID TP, respectively). This is an important result: Parity verication does not signicantly reduce the speed of reads in the above congurations. Only the extra bandwidth requirement for parity reads causes a performance decrease. One notable trend is that the variability between results in the write test on Gibraltar RAID could be quite high, with up to a 40% difference between runs. Read bandwidth tests had almost no variability. As this is an early prototype that focuses rst on functionality over performance, some variations are to be expected. We estimate that the performance variation results from a potential serialization based on memory usage constraints. While the cache is lling, the amount of free memory available to the cache decreases. When the victimizer is awakened by the cache as it runs out of memory, the nearly full cache will cause the writes accepted into the system to slow to the rate of disk and GPU coding occurring serially, as this is the rate at which memory is freed for new requests. A batch must proceed through the GPU parity generation component and be written out to disk before a new set of writes can be accepted into the memory that has just been released. In many situations, this condition does not occur and writes proceed at the same speed as reads; however, this is not always the case, and the tests reect this undesired behavior.
570
Write Read Write (sf) Read (sf) Write (df) Read (df) Write (tf) Read (tf)
Performance Linux md, RAID 6 Gibraltar, RAID 6 Gibraltar, RAID TP 612 637 (min: 438) 727 (min: 589) 878 774 720 369 669 (min: 470) 743 (min: 438) 262 795 721 369 602 (min: 501) 759 (min: 462) 253 789 742 N/A N/A 649 (min: 489) N/A N/A 767 sf: Single Failure, df: Double Failure, tf: Triple failure
TABLE I: Performance of Gibraltar RAID compared to Linux md, for streaming operations to raw device. Measured in MB/s, higher is better.
This can be considered a lesson learned, and an opportunity for future improvement. Degraded mode performance for Gibraltar RAID is signicantly better than md, showing no performance reduction over normal mode. In fact, performance typically improves in degraded mode because less data is read for parity verication. In comparison, md suffers a performance decrease of 70% for reads. Our previous work in developing the Gibraltar library focused on making coding and decoding similar operations, allowing degraded mode to be as fast as normal mode [8]. V. C ONCLUSION High-performance software RAID with commodity hardware assist has the potential to provide an economical means of high-speed storage for a multitude of uses. Its exibility provides for increased capabilities that allow it to enable applications that even many hardware RAID controllers are not capable of servicing. For example, environments that are particularly harsh and space-constrained can apply the extra parity and read verication to provide much more reliable storage in the face of difcult conditions. However, the work presented in the paper does not require extraordinary conditions to show benets to users. Highperformance streaming I/O is a basic requirement of many applications, but typical software RAID and inexpensive hardware RAID controllers may not be capable of supporting that level of performance. Many applications only require highperformance streaming I/O and data reliability, but historically only expensive RAID controllers can provide both capabilities. Gibraltar, when paired with an economically priced GPU, can also ll that niche. This work demonstrates that parity generation and data recovery after disk failure can occur at line speed for many disk installations by using a GPU. Further, as Gibraltar can verify data at high speeds, users are protected from a whole class of errors that cannot be prevented without similar measures in other controllers. As many software RAID packages do not verify reads, the type of techonology presented here would be a natural upgrade. This is a signicant point because hardware RAID controllers with this capability are much more expensive than the GPU required to perform these computations. Furthermore, hardware RAID controllers that do support read verication are rare.
571
While this work currently uses GPUs, GPUs are not the only type of device that can be used to calculate parity. As technology evolves, different computation devices will be preferred for computation of parity based on analysis of cost versus performance. The Gibraltar library, the underlying library for Reed-Solomon coding and decoding for Gibraltar RAID, is exible enough to allow for different (or multiple) devices to be tasked with parity computation. The changing computational landscape, with a multitude of accelerators and CPU types available, will be well-served in the future by a user space RAID infrastructure. We intend to test Gibraltar with a variety of other computation devices and methods. VI. F UTURE W ORK It has been shown that GPUs are fully capable of sustaining parity generation and data reconstruction beyond the rates sustainable by the software RAID implementations available for use within this study. However, previously mentioned issues such as memory fragmentation, pinning/mapping of memory, and direct I/O have made parity operations much slower than they could be. In the future, we plan to investigate methods to solve such memory problems within the controller. Extra computational capacity obtained from such improvements could be used to provide other I/O-related services or service other applications entirely. While the tests in this paper cover the speed of the raw device, it is important to determine the achievable performance through the means of access. This RAID infrastructure is currently implemented as an easily integrated modication to the Linux Target Framework, which allows network access to storage via iSCSI, iSER, FCoE, and other protocols. The current task with top priority as of this writing is to integrate machines running this software into clusters with high-speed interconnects. We can then perform multi-client tests over a variety of protocols. Further access options that can be provided include a small driver allowing direct access to the array as a block device. VII. ACKNOWLEDGEMENTS This work was supported by the United States Department of Energy under Contract DE-AC04-94AL85000. This work was also supported by the National Science Foundation under grant CNS-0821497.
R EFERENCES
[1] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson, RAID: High-performance, reliable secondary storage, ACM Computing Surveys, vol. 26, no. 2, pp. 145185, 1994. [2] Seagate Technology LLC, Barracuda ES.2 data sheet. http://www. seagate.com/docs/pdf/datasheet/disc/ds barracuda es 2.pdf, 2008. [3] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar, Row-diagonal parity for double disk failure correction, in In Proceedings of the 3rd USENIX Symposium on File and Storage Technologies (FAST 04), pp. 114, 2004. [4] J. S. Plank, J. Luo, C. D. Schuman, L. Xu, and Z. Wilcox-OHearn, A performance evaluation and examination of open-source erasure coding libraries for storage, in FAST-2009: 7th Usenix Conference on File and Storage Technologies, February 2009. [5] I. S. Reed and G. Solomon, Polynomial codes over certain nite elds, Journal of the Society for Industrial and Applied Mathematics, vol. 8, no. 2, pp. 300304, 1960. [6] M. L. Curry, A. Skjellum, H. Ward, and R. Brightwell, Accelerating Reed-Solomon coding in RAID systems with GPUs, in IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 16, April 2008. [7] M. L. Curry, H. L. Ward, A. Skjellum, and R. Brightwell, Arbitrary dimenision Reed-Solomon coding and decoding for extended RAID on GPUs, in 3rd Petascale Data Storage Workshop held in conjunction with SC08, November 2008. [8] M. L. Curry, A. Skjellum, H. L. Ward, and R. Brightwell, Gibraltar: A library for RAID-like Reed-Solomon coding on programmable graphics processors, 2010. Submitted to Concurrency and Computation: Practice and Experience. [9] R. Bhaskar, P. K. Dubey, V. Kumar, and A. Rudra, Efcient Galois eld arithmetic on SIMD architectures, in SPAA 03: Proceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures, (New York, NY, USA), pp. 256257, ACM Press, 2003.
[10] D. Geer, Reducing the storage burden via data deduplication, Computer, vol. 41, pp. 15 17, dec. 2008. [11] T. Sterling, D. J. Becker, D. Savarese, J. E. Dorband, U. A. Ranawake, and C. V. Packer, Beowulf: A parallel workstation for scientic computation, in In Proceedings of the 24th International Conference on Parallel Processing, pp. 1114, CRC Press, 1995. [12] Top500 supercomputing sites. http://www.top500.org. [13] NVIDIA Corporation, NVIDIA CUDA Compute Unied Device Architecture Programming Guide. Santa Clara, CA, 2007. [14] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, GPU computing, Proceedings of the IEEE, vol. 96, pp. 879 899, may 2008. [15] B. Schroeder and G. A. Gibson, Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?, in Proceedings of the 5th USENIX Conference on File and Storage Technologies, (Berkeley, CA, USA), pp. 11, USENIX Association, 2007. [16] J.-F. P ris and D. D. E. Long, Using device diversity to protect data a against batch-correlated disk failures, in StorageSS 06: Proceedings of the Second ACM Workshop on Storage Security and Survivability, (New York, NY, USA), pp. 4752, ACM Press, 2006. [17] Seagate Technology LLC, Barracuda LP SATA datasheet. http://www.seagate.com/ww/v/index.jsp? name=st32000542as-bcuda-lp-sata-2tb-hd&vgnextoid= 1f70e5daa90b0210VgnVCM1000001a48090aRCRD&locale=en-US# tTabContentSpecications, 2010. [18] J.-F. Paris, A. Amer, D. Long, and T. Schwarz, Evaluating the impact of irrecoverable read errors on disk array reliability, pp. 379 384, nov. 2009. [19] The Linux SCSI Target Framework. http://stgt.sourceforge.net/. [20] J. S. Plank, S. Simmerman, and C. D. Schuman, Jerasure: A library in C/C++ facilitating erasure coding for storage applications - Version 1.2, Tech. Rep. CS-08-627, University of Tennessee, August 2008.
572

A Lightweight, GPU-Based Software RAID System

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

A Lightweight, GPU-Based Software RAID System

Caricato da

Copyright:

Formati disponibili

2010 39th International Conference on Parallel Processing

A Lightweight, GPU-Based Software RAID System

Client Service Thread

Victimizer Erasure Coding Victim Cache

Linux Async I/O

Fig. 1: Gibraltar RAID Architecture and Data Flow Diagram

Fig. 3: RAID 6 (m = 2) Performance of GPU, 1 MB Chunk Size

Fig. 4: RAID-TP (m = 3) Performance of GPU, 1 MB Chunk Size

Potrebbero piacerti anche