Sei sulla pagina 1di 9

IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO.

3, SEPTEMBER 2005

419

IBM z990 Soft Error Detection and Recovery


Patrick J. Meaney, Scott B. Swaney, Pia N. Sanda, Senior Member, IEEE, and Lisa Spainhower, Member, IEEE
Invited Paper

AbstractSoft errors in logic are becoming more signicant in the design of computer systems due to increased sensitivities of latches and combinatorial logic and the increased number of transistors on a chip. At the same time, users of computer systems continue to expect higher levels of system reliability. Therefore, the investment in hardware and rmware software mitigation is likely to continue to rise. The IBM eServer z990 system is designed to detect and recover from myriad instances of soft and permanent errors. The error detection and recovery within the z990 processors and the nest chips is described with respect to the system level protection against soft errors. Index Terms Error-correcting code (ECC), error detection, recovery, reliability, availability, and serviceability (RAS), single-event upset (SEU), soft error rate (SER) .

A. System Reliability, Availability, and Serviceability (RAS) Characteristics IBM zSeries servers are designed for continuous operation and have RAS features throughout, including hardware, rmware, and software. In keeping with the zSeries tradition, the z990 is tolerant to errors, both hard and soft. It is designed to detect the occurrence of SEUs and to recover from almost all upsets. The z990 is a scalable machine ranging from one to four nodes. The memory and L2 cache across these nodes are fully ECC protected. ECC is also used to protect most of the system data and control buses. Parity is used in many other places to allow for the detection of errors. Many of the detected errors can be recovered using instruction or operation retry. In cases where retry is not successful, processors may be forced to checkstop and a spare processor may be used. In very rare cases, the system may be checkstopped to protect against silent data errors. In order to achieve high levels of protection in the processor cores, instruction and execution units within the microprocessor are duplicated and compared. Techniques of checkpointing (saving critical states of the processor over time) and rollback (restoring those critical states in the presence of errors) allow recovery from soft errors. Hard and soft error information is monitored on every z990 machine. For hard errors, line delete, memory sparing, and CPU sparing are some of the techniques used to keep the hardware running. There are also proactive calls home for parts maintenance (e.g., node repairs). For soft errors, events and rates are tracked and compared against predictions.

I. I NTRODUCTION INGLE-EVENT upsets (SEUs) in logic paths can lead to an increased soft error rate (SER) and potentially cause silent errors, if unmitigated. Silent errors are undetected errors that lead to incorrect machine results. As CMOS devices continue to get smaller and increase in density, the risk of SEUs in logic paths increases. These events can be caused by transient particles, such as cosmic rays, neutrons, or alpha particles emitted from impurities in packaging. The susceptible structures include latches and register les today in 130- and 90-nm technology nodes; combinatorial circuits are expected to contribute to SER in future CMOS generations. SRAM arrays, if not fully protected by error detection and correction schemes such as error-correcting codes (ECCs), would increasingly contribute to the total uncorrectable system SER. Earlier papers in this issue address the technology trends and the silicon processing and circuit level SER mitigation techniques. Processing and circuit design, even when combined, cannot practically meet the mitigation needs of all terrestrial computing systems. Alpha-particle reduction techniques, such as barrier layers, will substantially help control alpha-particle ux at the silicon devices, but such interventions may not be completely effective in all cases. They are also not effective against cosmic particle-induced upsets. Furthermore, transient upsets occurring due to noise are not mitigated by alpha-particle inhibitors, and future CMOS circuits will become increasingly sensitive to noise.

II. O VERVIEW OF THE Z 990 N ODE The IBM eServer z990 introduced a new nodal technology that supports concurrent capacity upgrade. The z990 contains up to four pluggable nodes (books) connected through a planar board in a daisy chain (ring) interconnect structure. The details of the system design including the book are described by Fair et al. [1]. The elements in each book are schematically illustrated in Fig. 1. Each node contains up to 64 GB physical memory and 32 MB L2 cache for a system capacity of 256 GB physical memory and 128 MB L2 cache. Fig. 1 layout is shown in Fig. 2. The multichip module (MCM) contains the following: eight dual-core central processor (CP) chips; the system control element (SCE), consisting

Manuscript received February 28, 2005; revised July 29, 2005. The authors are with the IBM Systems and Technology Group, Poughkeepsie, NY 12601 USA (e-mail: sanda@us.ibm.com; meaney@us.ibm.com). Digital Object Identier 10.1109/TDMR.2005.859577

1530-4388/$20.00 2005 IEEE

420

IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005

Fig. 1. Elements of a z990 node. The MCM is shown, with connections to memory, I/O (through MBA chips), and other nodes (through the ring connections).

Fig. 3.

Micrograph of the SCD chip.

A. Cache Structure
Fig. 2. MCM is shown, containing eight dual core processor chips, four SCD chips with L2 cache, SCC control chip, MSC chips, and clock.

of a system controller chip (SCC) and four L2 cache/data chips (SCD); a main storage controller (MSC); and a clock chip to provide oscillator signals to the chips. The MSC connections to the synchronous memory interface (SMI) chips on the two memory cards are shown. Three input/output (I/O) memory bus adapter (MBA) chips for I/O connectivity are also contained within the book. The data buses into and out of each SCD chip to the CPs, MBAs, MSC, and internode (rings) are all one word (4 B) wide.

The z990 utilizes a multinode cache structure [2]. The SCD chip, built-in IBMs 130-nm technology, is shown in Fig. 3. The 8 MB of cache per chip is physically and logically divided in half. Cache A, associated with Bitstack A, corresponds to logical pipe 0 (address bit 55 = 0). Cache B, associated with Bitstack B, corresponds to logical pipe 1 (address bit 55 = 1). These pipes are address-interleaved on 256 B line boundaries. Each side has 4 MB of SRAM and has a corresponding dataow bitstack, which controls communication between the memory (MSC), dual core processor chips (CPs), I/O (MBA chips), and between books (rings). The bitstacks hold data routing muxes as well as buffers and registers to stage data. Each quadword of logical data are split across all four SCD chips, which always

MEANEY et al.: IBM z990 SOFT ERROR DETECTION AND RECOVERY

421

Fig. 4.

High-level SCD dataow.

Fig. 5. The elements within one of the cores, core 0, are outlined. Each core has its own L1 D-Cache and I-Cache. The two cores share a common L2 interface that is also highlighted in the micrograph. The oating point unit (FPU), xed point unit (FXU), and instruction unit (IU) are duplicated to protect against SEUs and are shown at the bottom of core 0 with the designation mirror. III. N ODE -L EVEL RAS Without mitigation, the highest SERs in a system occur in the dynamic random access memory (DRAMfound in system memory), followed by the SRAM (used for caches and large arrays), followed by register les, latches, and logic. We will discuss the techniques used on the z990 for each of these areas. A. Memory The z990 is congurable to up to 64 GB of memory per book. The memory is protected by (140 128) S2EC single symbol correction/double symbol detection (SSC/DSD) with 2-bit symbols. The ECC protects against hard errors (including wires) as well as soft errors on the data [3]. Correctable errors (CEs) can originate from 1-bit errors or from 2 bits (in the same grouping or symbol). Uncorrectable errors (UEs) occur when more than one symbol has one or more errors. The ECC will also detect addressing errors. Normally, if data are either stored to or fetched from an incorrect address, the data still has correct ECC, and silent errors are undetectable. However, by using extra ECC code space (unused combinations of data/checkbits), hard and soft address errors are detected to avoid silent data errors in memory [4]. In order to protect against the use of previously corrupted data, special UEs (SPUEparticular ECC) are stored in the memory to indicate that the data was corrupted somewhere else and not by the memory. In the case of z990, two varieties of special UE syndromes are stored to help isolate the source of the error, one for memory known bad, and another for cache or other sources [5]. As in past machines, scrubbing (systematic fetching to search for and correct soft CEs so they do not line up to become UEs)

Fig. 5.

Micrograph of the dual core microprocessor chip.

act in lock-step with each other within each pipe. Each 32-bit word in each SCD chip is protected by (39/32) single error correction/double error detection (SEC/DED) ECC. The high-level view of the node dataow is shown in Fig. 4. Fetch and store requests from each pipe may originate from any local core or I/O. Core requests are sent from the CP chips to the SCE. I/O requests are routed through the MBA chips also to the SCE. The controls are handled by the SCC chip while the data are all routed through the SCD chips. The SCC chip resolves the cache coherency and conicts, and communicates with the other nodes to control all the data transfers between nodes, memory, cores, and I/O. The SCD chips act as dataows that are controlled by the SCC chips. Microprocessor: A micrograph of the dual core processor chip, built-in IBMs 130-nm CMOS technology, is shown in

422

IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005

TABLE I ARRAY ERROR PROTECTION

and sparing (using redundant chips to repair against hardware defects) are used in the memory to protect against hard and soft errors. B. L2 Cache The L2 cache has (39/32) SEC/DED to protect against hard and soft errors. When a CE is detected in the cache, an immediate line purge occurs. If the line was changed, the data are forced to go through ECC on the castout, thus, cleaning the data. If the line was not changed, the cache line is invalidated. It is later refetched from memory or a remote L2 cache where the data are still clean. Repeated CEs from the same cell are assumed to be a hard failure and will cause a line delete (cache line location will no longer be used for that congruence class). Deleting line locations from the cache is a way of ensuring that cache hardware defects do not continue to corrupt line data. Nonspecial UEs will cause line purges as well as immediate line deletes due to the likelihood of hard failures and severity of reusing the line locations. Special UEs in the cache are purged (to avoid an aging special UE (SPUE) from becoming a normal UE or miscorrected CE) but are never deleted (since they indicate the cache is not necessarily at fault). For further soft error protection, the cache SRAMs (and most other SRAMs in the z990) are laid out such that ECC words never cover array cells that are in close proximity. In fact, with the 16-way associativity and two- to four-way bitline decodes, an SEU would have to affect two cells that are 64-bit pitches apart to cause a UE in the cache. Therefore, the L2 cache with its ECC, purge, and delete provides a design that is robust, even in the presence of high SERs.

C. Other SRAMs and Register Files The remaining SRAMs in the nest (i.e., SCC, SCD, MSC) are either protected by ECC (with or without purge/delete support), parity, or have no measurable effect on system performance and so do not need protection (e.g., LRU, Trace). Register les are covered with similar techniques as SRAMs. If they hold data, they are covered by ECC (or parity if they can be reliably recovered with a retry operation). If they contain controls, they are typically covered by parity with retry. Some register les have parity and cause overindications on compares. Some register les do not need checking since they are used for diagnostics only. Table I shows a summary of the RAS protection for the various SRAMs and register les in the SCC, SCD, and MSC chips. It is important to point out that the L2 cache and the L2 key directory can tolerate UEs. The L2 directory, on the other hand, is forced to system checkstop on UE to avoid silent data errors. D. ECC-Protected Dataow Most of the dataow throughout the SCE is protected with ECC. This includes all off-node (ring) interfaces, internal dataow bitstacks (registers, register les, and buses), memory store and fetch buses, CP store buses, and even data buses used by rmware. When UEs occur on data, every interface and bitstack element uses UE tags (separate UE indicator signals) to indicate which transient lines are bad. When these need to traverse across another interface, the ECC is cleanly generated, and the tags are transferred as parity-protected elds. This way, the

MEANEY et al.: IBM z990 SOFT ERROR DETECTION AND RECOVERY

423

propagated errors do not appear the same as interface errors. This improves the fault isolation of checks so correct parts can be replaced and diagnosed. There are some interfaces for which the ECC is instead generated with special UE patterns. There are two unique UE tag states, namely: 1) memory special UE; and 2) SCE (nonmemory) special UE. If the parity of the UE tags is in error, the data are assumed to be an SCE special UE. E. Parity-Protected CP Fetch Data Bus and I/O Buses The CP fetch data bus is parity protected. There is no need to have ECC on this bus, because the L1 cache is store-thru. When a parity error is detected on the fetch bus (or most places in the processor), the processor is sent into checkpoint recovery and the entire L1 cache can be cleared and refetched as needed from the L2 and memory. If CP refetches are unsuccessful (due to a hard error), the CP will threshold and cause a CP checkstop and CP sparing event. The I/O buses from the SCD chip to and from the MBA chip are also parity protected. Operations on these buses can be retried, so ECC is not necessary on these buses to protect against soft errors. F. ECC-Protected SMP Off-Chip Address and Controls The command and address controls across the multiple-node interfaces are ECC protected. This way, if there are hard or soft errors near the interfaces, they will be corrected. For hard errors, there are threshold checking and maintenance requests for CEs to ensure that solid CEs get attention before they can line up with other CEs, causing UEs. G. L2 Pipeline Retry The L2 cache pipeline is a resource pipeline that controls the cache coherency of the system. It controls the L2 directories, address compares, store stacks, conguration arrays, cross-interrogation logic, castout resources, miss logic, memory fetch/store controllers, remote node processing, and many other resources related to cache coherency. There is a priority queue in front of the pipeline to effectively select one requesting operation to use the resources available in the pipe. If there are any resources that are needed but are not available in the pipe, an operation may be rejected until that resource frees up. Then, the request enters the pipe when the resource is available. This reject scheme is also used for address compare and other cache coherent protocols. The L2 cache pipeline address ow is shown in Fig. 6. The central pipe itself is made up of register stages (C0, C1, C2, C3, C4, and C5). Each stage includes the following parity-protected elds (each eld has its own parity group or groups): 1) 2) 3) 4) 5) 6) valid; command; address; ownership; requestor identication (ID); compartment.

Fig. 6. L2 cache pipeline address ow.

If the C0 or C1 stage has bad parity on any eld, the subsequent stages (C2C5) in the pipe are set to be invalid for that operation and the operation is rejected back to the requestor (e.g., Req A). In some cases, the pipeline operations immediately after a rejected operation are also rejected to maintain cache coherency (e.g., operations going into C0 and C1 are also rejected in Fig. 6). There is no performance penalty of the wasted pipe cycles, because soft errors are so rare. The downstream pipe resources automatically ignore the invalid pipe stages. This reject causes the requestor to refresh the source requestor to retry the pipeline operation. If the next retry of the pipe operation also fails, the requestor rejects the operation back to the source (e.g., a core). This may force a core or I/O requestor into a retry or recovery state. This subsequently may recover with good control data, and the pipeline operation may be more successful after the core recovery. If there is a hard error (broken register, logic, wire, etc.), the core will likely checkstop and be spared. Since a new core will use different requestor logic, the spare processor should be able to proceed without error. Failures that occur later in the pipeline or that corrupt the directory states may be past the point of retry. These errors force a system checkstop to protect against silent data corruption. H. Store Address Stacks The CP store address stacks (made up of register arrays on the SCC chip) are parity protected. When a parity error is detected in the store address stack, that processor is sent into

424

IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005

instruction processor damage (IPD) recovery to indicate that the store could not be processed. The processor issues a machine check to the OS, which will attempt to recover the instruction stream. The address stacks are also used for address compare to make sure stores are properly handled. If there are stores pending and an LRU or cross-interrogate (XI) castout is requested, the castout must wait until the store stack entries for that line are complete. If there are parity errors in the store stack, rather than checkstop the machine, the hardware will overindicate the compare (i.e. treat the bad store stack entries as if they compared). The only impact is a slight delay to the castout waiting for that valid entry to attempt to store (even if the address was not really for the same line). I. MBA (I/O) Operations The MBA interface has several forms of recovery. These include command reject response (in the case where the operation had a detected failure, usually parity), special UE/bad parity data return (the mechanism of returning uncorrectable, known bad data to the MBA chip), no response or forced hang (in the case where the requestor ID may be broken and there is no information on which request to respond to), and various key and storage responses that indicate errors. All of these could be caused by soft errors and will recover if the operation is attempted again following a soft error. J. SCE Sense/Control Operations When millicode (i.e., vertical microcode) sense/control operations are issued and UEs are detected, the UEs are reported back to millicode. In each case, the millicode has the option of retrying the command. Therefore, in the case of a soft error, the retry will succeed and the operation will complete normally. K. Memory Interface Address/Command Parity The memory address/command interface between the SCC and MSC chips is parity protected. When parity errors are detected, the interface operation is retried. A failing retry causes a system checkstop in the case of a hard failure. However, soft failures in that area should recover without problems. L. SCD Bitstack The data in the SCD bitstacks are protected by word (39/32) SEC/DED ECC. There are also control signals that control the bitstack operations. These include mux selects, register le addresses, and register ingates. These controls are checked through several different methods. Muxes have built-in orthogonality checking. Operationally, a bitstack mux is either idle (no input selected) or has only one port actively selected. There is checking to make sure that more than one port is not selected. When this occurs, data could become corrupted. Depending on where the data are going, this could cause a UE tag to be issued on the data or would cause a system checkstop to protect against silent data corruption.

There is also parity checking on many of the controls along with parity prediction circuits. For instance, address registers that feed register les typically have incrementors with parity predict. If the address has bad parity, often, UE tags will be set for that line. One particular feature that protects against silent data corruption is a bitstack cross-check circuit [6]. This consists of a multiple input shift register (MISR). Normally, an MISR is used in a test circuit and does not get utilized during normal component operations. However, this circuit monitors and compares many control inputs from one area of a design with the control inputs from another area using few signals. By design, the four SCD chips have two bitstacks, one per pipe. Since all four SCDs operate in lock-step, all four pipe 0 bitstacks should have identical control signals. Likewise, all four pipe 1 bitstacks also should have identical control signals. The SCD0 and SCD1 chips share compressed information to help compare pipe 0 operations. They also share pipe 1 information. Chips SCD2 and SCD3 share similar information. The compared information consists of an XOR of critical controls (fast detect for any odd number of control errors) as well as the MISR bit 0 of those controls (a slower but more robust way of checking multiple-bit errors in the bitstacks). Therefore, even in the presence of many errors caused by an SEU, this will easily detect a problem. z990 uses the MISR and XOR checking to checkstop the machine and protect against silent data corruption. M. Microprocessor Recovery in Logic Paths The recovery strategy for the z990 is the same as past generations of zSeries microprocessors, which is to maintain an architectural checkpoint on every hardware instruction boundary, so that in the event of an error, in-ight operations can be purged, the error cleared, the last checkpoint restored, and the instruction processing resumed from the last checkpoint [7][9]. The register checkpoint is implemented in a checkpoint array in the recovery unit (Runit). The checkpoint array contains a copy of every architected register in the processor microarchitecture, protected by SEC/DED ECC. Updates to all architected registers are captured in the Runit and held until instruction commit (checkpoint) time. The Runit has a pipeline for capturing register results so that the checkpointing lags execution. It is important for the checkpoint pipeline to lag execution to allow time for errors during execution to be detected and reported in time to block bad results from making it to the checkpoint. Register updates are not checkpointed until all the values for an instruction (or group of superscalar instructions) are available, without any errors detected. The z990 microarchitecture state is mapped into 256 sixtyfour-bit registers. Registers that are only updated as a result of a load or computational result, and only accessed as a register operand, are checkpointed in compact register le arrays. Registers that are updated by other sources or directly control processor state (i.e., instruction address, condition code, exception information, processor status, control registers, etc.) are checkpointed in discrete latches.

MEANEY et al.: IBM z990 SOFT ERROR DETECTION AND RECOVERY

425

Fig. 7.

Basic lock-step replication.

Fig. 8. Replication, with mirrored logic delayed one cycle.

Because of the additional latency for the checkpoint pipeline in the Runit, registers that are accessed frequently as source operands or for address generation (i.e., general purpose registers, oating point registers, access registers) have working copies in the instruction and execution units, so the checkpointed Runit copy is exclusively used only for recovery purposes. However, many of the various control registers (roughly half of the total registers in the microarchitecture) are accessed infrequently, so they may be accessed directly from the Runit without a signicant performance penalty. This greatly reduces the overhead of maintaining the Runit checkpoint, as it integrates the register space with the mainline function of the processor. Error detection throughout the processor is essential to protect the checkpoint. Although the checkpoint is maintained on hardware instruction group boundaries, recovery is performed on a processor basis, not by instruction. Errors do not need to be associated with a particular instruction, as long as they are reported early enough to block checkpointing of the affected instruction(s). It is perfectly acceptable to back up to a checkpoint earlier than the instruction that encountered the error. As in prior generations, the z990 processor relies on unitlevel replication for robust error detection in much of the computational logic. Replication of logic for the purpose of comparing the outputs is an effective means of detecting soft or hard errors. For caches and other large data structures (typically implemented with SRAMs), replication is very inefcient (both for area and for power) compared to more traditional parity or ECC. However, for general varied control logic, replication is often an affordable solution that supports high frequency. For example, Fig. 7 shows the method used in earlier generations of zSeries microprocessors. The IU (fetching, decode, address generation) and execution units (xed point and oating point arithmetic, load/store, operand fetching) are duplicated, but the cache and recovery units are not. Normal computational ow only requires one

copy of the replicated logic; the second copy is only used for error detection. The light areas represent the overhead for this simple duplication checking scheme. In technologies where the propagation delay within a processor cycle is dominated by silicon, such a duplication scheme is excellent for high-frequency design as it avoids impacting critical paths for generating or validating check bits (i.e., parity). The functional units are simply instantiated twice, with comparators added. The computational ow is not impacted. As technology density increases, transistor switching speed increases, which allows faster processor operating frequencies. However, since wire delay does not scale with transistor switching speed, the wire delay consumes a larger portion of a processor cycle and also limits how far a signal can be sent on a wire within one cycle. In order to not impact the operating frequency, the oor plan of the chip is optimized for the computational ow. This may result in the second instances of the duplicated units being too far away to receive signals the same cycle as the computational units. In this case, latches are added to signals to and from the duplicated units such that the duplicated units run one machine cycle behind the units used for the computational ow. Because signals to and from the duplicated units are delayed, the signals from the computational ow must be delayed two cycles before they can be compared to the signals from the duplicated units, as illustrated in Fig. 8. In general, denser technologies lead to increased superscalar parallelism. The added latches to reach to and from the duplicated units and align the values from the computational ow include control signals, address buses, and data buses. As frequency increases, multiple cycles may be required to reach the duplicated units, and the accumulated cost of these latches becomes appreciable. Fig. 9 illustrates the increased overhead when accounting for superscalar parallelism and two cycles to reach to and from the duplicated units. The mechanism of backing up and resuming from a prior checkpoint is very effective for recovering from soft errors,

426

IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005

balance was reached between data volume and critical logging of soft error information. Field data on SEUs is useful for comparing real SEU performance against predictions. Unfortunately, statistically relevant data are usually limited to prior server generations, until enough systems and hours of operation are available on the current generation. A. Microprocessor Logic Path Recovery Data for z900 As described in the previous section, I/E-unit mirroring has been used for generations of IBM CMOS mainframes for SEU protection. These events were mined for the z900 population for a period of time totaling 198 M poweredon processor core hours. For example, for a 12-way system running for 1 h, 12 processor hours would be logged. Hours before powering on the system would not be logged, and if the system is powered off, hours would also not be logged. For z900, for which the microprocessor is fabricated in IBMs 180-nm technology, three events were positively identied as miscompares between the mirrored units in the 198 M processor hours. Data for the z990 is currently being accumulated and is in the process of being analyzed and compared against predictions. V. S UMMARY AND C ONCLUSION The robustness against SEUs in latches and combinatorial logic is an increasing requirement in the design of future computing systems. Because of the continuous availability and data correctness requirements of mainframe workloads, the IBM zSeries servers have required designs for SEU robustness for logic paths as well as arrays. The strategy and implementation for SEU robustness has been described in detail for the most recent generation, the z990. Microarchitecturallevel SEU mitigation features include the following: extensive use of ECC and parity with retry on data and controls; full SRAM ECC and parity protection; operational retries; microprocessor I/E-unit mirroring, checkpointing, and rollback; and other hardware derating techniques. These are examples of approaches that are available for future mainframe, general purpose, and application-specic computing systems as SEU rates increase due to rising device sensitivities and levels of integration. ACKNOWLEDGMENT The authors are indebted to P. Mak, T. Slegel, M. Mueller, and C. Webb for their technical leadership in the microarchitectural approaches to SER mitigation described here, as well as for helpful conversations. L. Alvez, D. Modi, B. Fair, C. Lin, C. Walters, and others contributed to the RAS reviews and provided invaluable feedback to improve the designs as well. The authors also acknowledge the zSeries design team for their expert implementations. They also thank C. Henderson for his support in the z900 eld data mining, as well as A. Munteanu for database improvements for future data mining.

Fig. 9. Superscalar replication with a two-cycle delay.

but not as effective for errors that persist when retrying from the checkpoint (hard errors). The checkpoint is also used for recovering hard errors, by restoring the checkpoint to a different physical processor. As technology density continues to increase, wires will start to dominate the propagation delay. Furthermore, area is impacted. Transistors need to be spread apart to allow room for the wiring, which adds even more delay. It will become more difcult to avoid impacting the computational ow with the duplicated unit checking approach. At some point, the spreading out of the logic due to the staging latches and wiring will likely have more impact than incorporating more traditional lower-overhead checking techniques (such as parity prediction and residue). As the number of cycles to reach the duplicated units increases, the checkpointing of instructions must also be delayed to allow errors to be detected in time to block the checkpointing. This deeper checkpoint pipeline delays the access to the Runit register operands, possibly impacting performance, even on infrequent operations.

IV. F IELD T RACKING Most of IBMs mainframe customers have systems that call home for immediate response for the rare case of a system failure and that initiate a repair action if fail-over resources such as a spare processor have been used. The logged information is limited to machine state (customer workload and data are not visible), and the data set is often pruned and compressed to minimize the storage required. For hard errors, extensive failure information has historically been available for diagnosis of the failing parts. This includes trace arrays, failing signatures and bits, and other debug data. On z990, more information is available related to soft errors than was ever available before. In the past, these data were often pruned to make way for other system debug data. However, a

MEANEY et al.: IBM z990 SOFT ERROR DETECTION AND RECOVERY

427

R EFERENCES
[1] M. L. Fair, C. R. Conkin, S. B. Swaney, P. J. Meaney, W. J. Clarke, L. C. Alves, I. N. Modi, F. Freier, W. Fischer, and N. E. Weber, Reliability, availability, and serviceability (RAS) of the IBM eServer z990, IBM J. Res. Develop., vol. 48, no. 3/4, p. 519, May/Jul. 2004. [2] P. R. Turgeon, P. Mak, M. A. Blake, M. F. Fee, C. B. Ford, III, P. J. Meaney, R. Seigler, and W. W. Shen, The S/390 G5/G6 binodal cache, IBM J. Res. Develop., vol. 43, no. 5/6, p. 661, Sep./Nov. 1999. [3] C. L. Chen, M. Y. Hsiao, P. Meaney, and W. Shen, Single symbol correction double symbol detection code employing a modular H-matrix, U.S. Patent 6 463 563, Oct. 8, 2002. [4] , Detecting address faults in an ECC-protected memory, U.S. Patent 6 457 154, Sep. 24, 2002. [5] , Generating special uncorrectable error codes for failure isolation, U.S. Patent 6 519 736, Feb. 11, 2003. [6] P. Meaney, Method for identifying SMP bus transfer errors, U.S. Patent 6 055 660, Apr. 25, 2000. [7] L. Spainhower and T. A. Gregg, G4: A fault tolerant CMOS mainframe, in Proc. 28th Annu. Int. Symp. Fault-Tolerant Computing, Munich, Germany, 1998, pp. 432440. [8] , IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective, IBM J. Res. Develop., vol. 43, no. 5/6, pp. 863873, Sep./Nov. 1999. [9] M. Mueller et al., RAS strategy for IBM S/390 G5 and G6, IBM J. Res. Develop., vol. 43, no. 5/6, pp. 875888, Sep./Nov. 1999.

Scott B. Swaney received the B.S. degree in electrical engineering from the Pennsylvania State University, University Park, in 1988. He joined IBM that same year in the Enterprise Systems Division as a VLSI Design Engineer. He is currently a Senior Technical Staff Member working on hardware and system design for IBM eServers, Poughkeepsie, NY. He specializes in design for high availability in microprocessors. He holds multiple patents related to processor recovery.

Pia N. Sanda (M91SM03) received the B.S. degree in engineering and the Ph.D. degree in physics from Cornell University, Ithaca, NY, in 1976 and 1982, respectively. She is a Senior Technical Staff Member and the Program Director for Soft Error Management at IBM, Poughkeepsie, NY. She is one of the founders of picosecond imaging circuit analysis (PICA), a method for viewing timing of operating ICs that is now widely used for failure analysis. She also invented methods for autogeneration of phase shift masks and has designed devices and circuits for microprocessors. Her research interests include IC reliability from devices to systems.

Patrick J. Meaney received the B.S. degree in electrical and computer engineering from Clarkson University, Potsdam, NY, in 1986, and the M.S. degree in computer engineering from Syracuse University, Syracuse, NY, in 1991. He joined IBM in Poughkeepsie, NY, in 1986 where he is a Senior Technical Staff Member. He is currently the zSeries nest pervasive design lead and has worked in the areas of reliability, availability, and serviceability (RAS), fault tolerance, errorcorrecting code (ECC), recovery, and timing. He holds 23 U.S. patents and has ten patents pending.

Lisa Spainhower (M99) is a Distinguished Engineer in the System Design organization of the IBM System and Technology Group, Poughkeepsie, NY, responsible for high availability and fault-tolerant server design. Ms. Spainhower is a member of the IBM Academy of Technology, IEEE Computer Society, Executive Committee of the IEEE Technical Committee on Fault Tolerant Computing, and IFIP WG 10.4 on Dependable Computing and Fault Tolerance.

Potrebbero piacerti anche