Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
ABSTRACT sult, a write-intensive workload can wear out the SSD within
Deployment of SSDs in enterprise settings is limited by the months. Also, this erasure limit continues to decrease as
low erase cycles available on commodity devices. Redun- MLC devices increase in capacity and density. As a conse-
dancy solutions such as RAID can potentially be used to pro- quence, the reliability of MLC devices remains a paramount
tect against the high Bit Error Rate (BER) of aging SSDs. concern for its adoption in servers [4].
Unfortunately, such solutions wear out redundant devices at
similar rates, inducing correlated failures as arrays age in In this paper, we explore the possibility of using device-level
unison. We present Diff-RAID, a new RAID variant that redundancy to mask the effects of aging on SSDs. Clustering
distributes parity unevenly across SSDs to create age dispari- options such as RAID can potentially be used to tolerate the
ties within arrays. By doing so, Diff-RAID balances the high higher BERs exhibited by worn out SSDs. However, these
BER of old SSDs against the low BER of young SSDs. Diff- techniques do not automatically provide adequate protec-
RAID provides much greater reliability for SSDs compared tion for aging SSDs; by balancing write load across devices,
to RAID-4 and RAID-5 for the same space overhead, and solutions such as RAID-5 cause all SSDs to wear out at ap-
offers a trade-off curve between throughput and reliability. proximately the same rate. Intuitively, such solutions end
up trying to protect data on old SSDs by storing it redun-
dantly on other, equally old SSDs. Later in the paper, we
1. INTRODUCTION quantify the ineffectiveness of such an approach.
Solid State Devices (SSDs) have emerged in the last few
years as a viable secondary storage option for laptops and
We propose Differential RAID (Diff-RAID), a new RAID
personal computers. Now, SSDs are poised to replace con-
technique designed to provide a reliable server storage so-
ventional disks within large-scale data centers, potentially
lution using MLCs. Diff-RAID leverages the fact that the
providing massive gains in I/O throughput and power effi-
BER of an SSD increases continuously through its lifetime,
ciency for high-performance applications. The major barrier
starting out very low and reaching the maximum specified
to SSD deployment in such settings is cost [3]. This barrier
rate as it wears out. Accordingly, Diff-RAID attempts to
is gradually being lifted through high-capacity Multi-Level
create an age differential among devices in the array, bal-
Cell (MLC) technology, which has driven down the cost of
ancing the high BER of older devices against the low BER
flash storage significantly.
of young devices.
Yet, MLC devices are severely hamstrung by low endurance
To create and maintain this age differential, Diff-RAID mod-
limits. Individual flash blocks within an SSD require expen-
ifies conventional RAID in two ways. First, it distributes
sive erase operations between successive page writes. Each
parity unequally across devices; since parity blocks are up-
erasure makes the device less reliable, increasing the Bit Er-
dated more frequently than data blocks due to random writes,
ror Rate (BER) observed by accesses. Consequently, SSD
this unequal distribution forces some devices to age faster
manufacturers specify not only a maximum BER (usually
than others. Diff-RAID supports arbitrary parity assign-
between 10−14 to 10−15 [2], as with conventional hard disks),
ments, providing a trade-off curve between throughput and
but also a limit on the number of erasures within which this
reliability. Second, Diff-RAID redistributes parity during
BER guarantee holds. For MLC devices, the rated erasure
device replacements to ensure that the oldest device in the
limit is typically 5,000 to 10,000 cycles per block; as a re-
array always holds the most parity and ages at the highest
rate. When the oldest device in the array is replaced at
some threshold age, its parity blocks are assigned across all
Permission to make digital or hard copies of all or part of this work for
the devices in the array and not just to its replacement.
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies Diff-RAID’s ability to tolerate high BERs on aging SSDs
bear this notice and the full citation on the first page. To copy otherwise, to provides multiple advantages. First, it provides a much
republish, to post on servers or to redistribute to lists, requires prior specific higher degree of reliability for SSD arrays compared to stan-
permission and/or a fee. dard RAID-5 or RAID-4 configurations. Second, it opens
ACM SIGOPS Operating Systems Review (c) 2010 ACM ISSN 0163-5980
/10/01...$10.00
the door to using commodity SSDs past their erasure limit,
55
protecting the data on expired SSDs by storing it redun- 3. WHY NOT RAID?
dantly on younger devices. Third, it potentially reduces the RAID techniques induce similar update loads on multiple
need for expensive hardware Error Correction Codes (ECC) disks, which in turn, may induce correlated failures in SSDs
in the devices; as MLC densities continue to increase, the as noted above. For RAID-1, RAID-10 and RAID-01, mir-
cost of masking high error rates with ECC is prohibitive. rors are exposed to identical workloads and grow old to-
These benefits are achieved at a slight cost in throughput gether. All the devices in RAID-5 receive roughly equiva-
degradation and in device replacement complexity; these lent workloads and age at similar rates. RAID-4 does im-
trade-offs are explored in detail below. pose a slightly staggered wear-out schedule; the parity device
wears out at a much faster rate than the other devices for
2. BACKGROUND workloads dominated by random writes, and has to be re-
NAND-based SSDs exhibit failure behaviors which differ placed much before them. However, the non-parity devices
substantially from hard disks. Most crucially, the Bit Error in RAID-4 age at similar rates. Thus, each of these RAID
Rate (BER) of modern SSDs is uniform across all disk pages, options imposes a lock-step wear-out schedule on devices in
and grows strongly with the number of updates the SSD the array.
receives. In order to understand this behavior, we briefly
overview basic features of SSDs. As long as each SSD is replaced immediately as it hits the
erasure limit, the SSD array is as reliable as an array con-
SSDs are composed of pages, which are the smallest units structed from hard disks of comparable maximum BER.
that can be read or programmed (written). Pages are struc- However, if the SSDs are allowed to continue past their era-
tured together into larger units called blocks, which are the sure limit, the reliability of the array falls drastically. Hence,
smallest units that can be erased (all bits reset to 1s). Shar- when one of the devices fails, the probability of another fail-
ing erasure circuitry across multiple pages allows for higher ure during the RAID reconstruction is very high. Conse-
bit densities by freeing up chip area, reducing overall cost. quently, the overall reliability of conventional RAID arrays
Bits can only be programmed in one direction, from 1s to 0s; of SSDs varies highly, hitting a peak as multiple devices
as a result, a rewrite to a specific page requires an erasure reach their erasure limit simultaneously. A more quantita-
of the block containing the page. tive analysis of this is given in Section 5 below.
Modern SSDs provide excellent performance despite these Essentially, the wear-out schedule imposed on different de-
design restrictions by using wear-levelling algorithms that vices by a redundancy scheme heavily impacts the overall
store data in log format on flash. These wear-levelling al- reliability of the array. The load balancing properties of
gorithms ensure that wear is spread evenly across the SSD, schemes such as RAID-5 – which make them so attractive
resulting in a relatively uniform BER across the SSD. How- for hard disks – result in lower array reliability for SSDs.
ever, SSDs are still subject to an erasure limit, beyond which They induce a lock-step SSD wear-out schedule that leaves
the BER of the device becomes unacceptably high. SSDs the array vulnerable to correlated failures.
typically use hardware ECC to correct bit errors and extend
the erasure limit; in this paper, we use the term BER to re- Required is a redundancy scheme that can distribute writes
fer to the post-ECC error rate, also called the Uncorrectable unevenly across devices, creating an imbalance in update
Bit Error Rate (UBER), unless specified otherwise. load across the array. Such a load imbalance will translate
into different aging rates for SSDs, allowing staggered device
First-generation SSDs used SLC (Single-Level Cell) flash, wear-outs over the lifetime of the array and reducing the
where each flash cell stores a single bit value. This variant probability of correlated failure. This observation leads us to
of flash has relatively high endurance limits – around 100,000 introduce novel RAID variants, which we collectively name
erase cycles per block – but is prohibitively expensive; as a Diff-RAID.
result, it is now used predominantly in top-tier SSDs meant
for high-performance applications. Recently, the emergence 4. DESIGN
of MLC (Multi-Level Cell) technology has dropped the price The goal of Diff-RAID is to use device-level redundancy to
of SSDs significantly (for example, the Intel X25-M [1] is mask the high BERs of aging SSDs. The basic idea is to store
a popular MLC SSD priced at $4 per GB at the time of data redundantly across devices of different ages, balancing
writing), storing multiple bits within each flash cell to ex- the high BER of old SSDs past their erasure limit against
pand capacity without increasing cost. However, MLC de- the low BER of young SSDs. Diff-RAID implements this
vices have a major drawback: the endurance limit drops idea by creating and maintaining age differentials within a
to around 10,000 erase cycles per block. As flash geome- RAID array.
tries shrink and MLC technology packs more bits into each
cell, the price of SSDs is expected to continue decreasing; We measure the age of a device by the average number of
unfortunately, their pre-ECC BER is expected to increase, cycles used by each block in it, assuming a perfect wear-
creating the possibility of lower erasure limits. levelling algorithm that equalizes this number across blocks.
For example, if each block within a device has used up
In this paper, we re-examine the use of standard device-level roughly 7,500 cycles, we say that the device has used up
redundancy techniques in face of the unreliability character- 7,500 cycles of its lifetime.
istics of MLC devices, as they approach and cross the erasure
limit. Since, as we noted above, SSDs differ from hard disks Diff-RAID controls the aging rates of different devices in the
in fundamental ways, this requires a rethink of traditional array by distributing parity blocks unevenly across them.
redundancy solutions such as RAID. Random writes cause parity blocks to attract much more
56
Replacement 1 Parity other parity assignments will result in less disparity, with
Distribution: RAID-5 at the other extreme providing no disparity.
10,000 Erasures
SSD0: 70%
SSD1: 10% Ideally, we would like the aging rate of a device to be pro-
portional to its age, so that the older a device, the faster it
Age
SSD2: 10% ages. The intuition here is simple: the presence of an old
SSD3: 10% SSD makes the array more unreliable, and hence we want to
use up its erase cycles rapidly and remove it as soon as pos-
sible. Conversely, young SSDs make the array more reliable,
and hence we want to keep them young as long as possible.
If two SSDs are at the same age, we want to age one faster
SSD 0 SSD 1 SSD 2 SSD 3 than the other in order to create an imbalance in ages.
57
100
Age (% of Cycles) 90
80
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Replacements
Figure 2: Age distribution of devices for Diff-RAID parity assignment (100, 0, 0, 0, 0) (i.e., RAID-4 with age-
driven shuffling on device replacements). Each set of bars represents the age distribution of devices in the
array when the oldest SSD wears out.
level as the write performance required by the workload fluc- signment corresponding to RAID-4. The parity assignment
tuates. used by Diff-RAID is (1, 5, 5, 5, 5, 5) — effectively, the only
difference between conventional RAID-4 and the Diff-RAID
When a device is replaced and a new device takes its place, version is the age-driven shuffling of parity on each device
age-driven shuffling of the parity assignments has to occur. replacement. Conventional RAID-5 causes all SSDs age in
One implementation option involves treating a device re- lock-step fashion, and conventional RAID-4 does so with
placement as a failure and triggering the RAID reconstruc- the data devices; as a result, the probability of data loss
tion process to rebuild the contents of the replaced SSD on a on an SSD failure climbs to almost 1 for both solutions as
new device, but modifying the reconstruction logic so that it the array ages, and periodically resets to almost zero when-
interchanges parity blocks and data blocks on different SSDs ever all SSDs are replaced simultaneously. In contrast, Diff-
to achieve the required parity assignment over a shuffled or- RAID never experiences a data loss probability of more than
dering of the devices. We expect to explore different imple- 17%, and once the age distribution converges, it consistently
mentation options that perform device replacement without maintains a data loss probability under 4%, which is compa-
triggering RAID reconstruction logic, as well. rable to the 3% data loss probability of a SATA disk array.
In this simulation, we assume that the oldest device in the
5. SIMULATION RESULTS array fails initially, and compute the probability of data loss
In this section, we show through simulation that Diff-RAID in the remaining devices; if any device is allowed to fail
provides much better reliability for SSD arrays than con- initially with equal probability, the RAID-5 curve remains
ventional RAID. Also, we show that Diff-RAID can push unchanged, the RAID-4 curve does not change significantly,
inexpensive, low-end SSDs to twice their rated erasure limit and the Diff-RAID curve is bounded at 35%.
without compromising the overall reliability of the array. We
model the Bit Error Rate of an individual SSD using data Figure 4 shows the space of possible Diff-RAID configura-
in a study published by Intel and Micron [2]. That paper tions for a 6-device array. Each point in this graph rep-
contains raw (pre-ECC) BERs for four MLC devices up to resents a different parity assignment; on the x-axis is the
10,000 erase cycles, of which one device is rated at 5,000 throughput of the solution as a fraction of the total array
cycles. We use the raw BER curve of this 5,000 cycle device throughput, and on the y-axis is the probability of data loss
as a starting point, assume 2-bit ECC, and modify the re- upon a device failure. As expected, the parity assignment
sulting UBER data slightly to obtain a smooth curve that corresponding to RAID-4 is on one extreme of the trade-off
hits an UBER of 10−14 at 5,000 cycles, comparable to the curve, providing high reliability but very low throughput,
UBER of SATA hard disks. and the assignment corresponding to RAID-5 is at the op-
posite end, with high throughput but very low reliability.
We quantify the reliability of a RAID array by considering
a common failure mode — a disk fails in the array, and dur- Sequential Writes: So far, we have assumed a workload
ing reconstruction an uncorrectable bit error is encountered consisting only of random writes. Sequential writes that up-
elsewhere in the array, resulting in the loss of data. For a 6- date the entire stripe can potentially reduce the reliability
disk RAID-5 array of 80 GB SATA disks (with UBER equal of a Diff-RAID array, since they reduce the ability of parity
to 10−14 ), the probability of data loss given a disk failure blocks to imbalance write load. When we ran the experi-
is around 3%, which remains fairly constant throughout the ment shown in Figure 3 with a workload consisting of 10%
lifetime of the array. In other words, for every 100 instances sequential writes, the worst case unreliability for Diff-RAID
of a disk failure in such an array, roughly 3 will experience in the initial period climbed to 23%; after convergence, the
some data loss. unreliability stays under 8%. With a workload where se-
quential writes comprise 50% of all writes, Diff-RAID is still
Figure 3 illustrates the difference between conventional RAID- bounded at 25% unreliability after convergence, providing 4
5 and RAID-4 used with SSDs and a Diff-RAID parity as- times the reliability of conventional RAID-4 and RAID-5. In
58
1
Probability of Data Loss 0.9 RAID-5
on 1 Disk Failure 0.8 RAID-4
0.7 Diff-RAID
0.6
0.5
0.4
0.3
0.2
0.1
0
0 50000 100000 150000 200000 250000
Total Array Writes (in multiples of 20M pages)
Figure 3: RAID-5 and RAID-4 versus Diff-RAID for a 6-device array. Diff-RAID is configured to use a RAID-
4 parity assignment – (100, 0, 0, 0, 0, 0) – but differs by performing age-driven shuffling on device replacements.
59