Sei sulla pagina 1di 3

How to use ECC to protect embedded

memories
Sandeep Kaushik - May 23, 2013

The scaling of semiconductor technologies has led to a lower operating voltage in semiconductor
devices, which, in turn, reduces the charge available on the capacitors for volatile memories. The
overall effect of this is that devices are generally more sensitive to soft or transient errors, because
even low-energy alpha particles can easily flip the bits stored in storage cells or change the values
stored in sequential logic elements, producing erroneous results.
Increasing memory density, system-on-chip (SoC) memory content, performance, and technologyscaling combined with reduced voltages increases the probability of multi-bit transient errors.
Notably, transient errors are no longer restricted to aerospace applications. Now applications such
as biomedical, automotive, networking, and high-end computing are susceptible to transient errors
and have a need for high reliability and safety.
Transient error sources are, in many cases, self-inflicted because alpha particles are commonly
generated in materials adjacent to the chip, solders, and in the packaging. Due to the higher
susceptibility to multiple-bit (multi-bit) transient errors, and an increasing requirement for high
reliability, there is a greater need to mitigate transient errors in embedded memories. In this article
we discuss transient error detection and correction methods using advanced error correction code
(ECC) based solutions for embedded memories in order to meet the requirements of todays highreliability applications.
Understanding Errors
Transient or soft errors are functional errors resulting from strikes by energetic ions such as
neutrons and alpha particles. They are random in nature and typically lead to data corruption or
cause electronic systems to crash. For less critical applications, transient errors are eclipsed by
more common issues and can be fixed by resetting or rewriting the device, and generally the time
required for resetting or rewriting and bringing the device back to its normal operation is acceptable
to users.
However, for critical applications such as networking, transient errors can be catastrophic. Just
relying on the reset strategy for transient error mitigation can be very expensive, as the system will
be unavailable during the length of the reset or cycle time. This delay might not be acceptable given
that some of these mission-critical systems require 99.999% availability.
In addition to disruptions in high availability, transient memory errors can cause security
vulnerabilities. Since transient errors have been around and causing electronic systems to fail for
years, JEDEC JESD89A was defined to standardize the requirements and procedures for soft-erro-rate testing of integrated circuits and reporting of results. However, the options to take any
corrective action based on the testing for errors after a design is complete are limited.

Transient errors, like a lot of other design issues, are very costly to address as an afterthought. They
can be proactively handled in a much less expensive and more effective manner. In fact, it is
significantly more advantageous to address transient errors during the design phase.

Figure 1: Embedded memory content is expected to command 88% of area in high-end


SoCs by 2020
As shown in Figure 1, embedded memories dominate the SoC area, making transient error
mitigation for SRAMs crucial for high reliability. Once a transient error causes the bit stored in a
storage cell to flip, there is no mechanism to recover the bit other than to explicitly rewrite the value
or correct the errors while reading out. The most effective solution to address transient errors for
SRAMs is to use traditional ECC, such as Hamming codes. ECC allows data that is being accessed to
be checked for errors and corrected on the fly. It differs from the basic parity-checking in which
errors are only detected and not corrected. Title-1
ECC for memories has been deployed for many years. It uses extra check bits or parity bits for each
data item. These bits are computed and compared whenever the data item is accessed. This
traditional ECC approach provides single error correction (SEC) and double error detection (DED).
The main limitation of traditional ECC is that it only corrects single-bit errors. If two or more
neighboring bits are impacted simultaneously by a transient error hit, the classical ECC will not be
able to provide the required error correction.
The rapid evolution in memory IP design and technology scaling have had a significant impact on
transient errors. Both the increased complexity and greater packing density of memory are more
likely to raise multi-bit soft error rates. Memory errors occur mostly during read/write activity, so
higher speeds contribute to higher error rates. With the increasing packing density and technology
scaling, multi-bit upsets have an increased probability of occurrence. Based on a recent study, multibit upsets are equally likely to occur as single-bit upsets at 28-nanometer technology nodes and
below. Traditional transient error mitigation approaches are unable to provide sufficient protection
for multi-bit upsets because the code may correct only one bit of a multi-bit upset.
High-reliability applications require a solution that not only provides multi-bit error detection and
correction, but also optimizes performance and area impact. For example, Synopsys DesignWare
STAR ECC IP aims to be an automated solution for multi-bit error detection and correction. It uses
the structural information of the memory to allow multi-bit error detection and correction. It
automatically generates ECC Verilog code for single-port and multi-port SRAM memories.
Additionally, STAR ECC generates Verilog testbenches and scripts to automate the verification,

timing, and synthesis of generated IP.


Transient errors, particularly multi-bit ones, are a matter of increasing concern in SoCs, especially
with scaling technologies and embedded memories occupying larger areas of the SoC. For critical
applications requiring high availability and reliability, transient errors can be disastrous and very
expensive. Existing ECC methods lack the multi-bit error correction and automation needed to meet
reliability and volume production goals. What is needed is a fully automated and configurable
solution that provides protection against multi-bit upsets, enabling designers to meet the reliability
and safety goals of their mission-critical applications.
If you liked this feature, and would like to see a collection of related features delivered
directly to your inbox, sign up for the Test & Measurement newsletter here.

Related Articles
Turbo product codes advance ECC technology from 1969!
Memory Hierarchy Design - Part 3. Memory technology and optimization
About the Author
Sandeep Kaushik's profile.

Potrebbero piacerti anche