Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
memories
Sandeep Kaushik - May 23, 2013
The scaling of semiconductor technologies has led to a lower operating voltage in semiconductor
devices, which, in turn, reduces the charge available on the capacitors for volatile memories. The
overall effect of this is that devices are generally more sensitive to soft or transient errors, because
even low-energy alpha particles can easily flip the bits stored in storage cells or change the values
stored in sequential logic elements, producing erroneous results.
Increasing memory density, system-on-chip (SoC) memory content, performance, and technologyscaling combined with reduced voltages increases the probability of multi-bit transient errors.
Notably, transient errors are no longer restricted to aerospace applications. Now applications such
as biomedical, automotive, networking, and high-end computing are susceptible to transient errors
and have a need for high reliability and safety.
Transient error sources are, in many cases, self-inflicted because alpha particles are commonly
generated in materials adjacent to the chip, solders, and in the packaging. Due to the higher
susceptibility to multiple-bit (multi-bit) transient errors, and an increasing requirement for high
reliability, there is a greater need to mitigate transient errors in embedded memories. In this article
we discuss transient error detection and correction methods using advanced error correction code
(ECC) based solutions for embedded memories in order to meet the requirements of todays highreliability applications.
Understanding Errors
Transient or soft errors are functional errors resulting from strikes by energetic ions such as
neutrons and alpha particles. They are random in nature and typically lead to data corruption or
cause electronic systems to crash. For less critical applications, transient errors are eclipsed by
more common issues and can be fixed by resetting or rewriting the device, and generally the time
required for resetting or rewriting and bringing the device back to its normal operation is acceptable
to users.
However, for critical applications such as networking, transient errors can be catastrophic. Just
relying on the reset strategy for transient error mitigation can be very expensive, as the system will
be unavailable during the length of the reset or cycle time. This delay might not be acceptable given
that some of these mission-critical systems require 99.999% availability.
In addition to disruptions in high availability, transient memory errors can cause security
vulnerabilities. Since transient errors have been around and causing electronic systems to fail for
years, JEDEC JESD89A was defined to standardize the requirements and procedures for soft-erro-rate testing of integrated circuits and reporting of results. However, the options to take any
corrective action based on the testing for errors after a design is complete are limited.
Transient errors, like a lot of other design issues, are very costly to address as an afterthought. They
can be proactively handled in a much less expensive and more effective manner. In fact, it is
significantly more advantageous to address transient errors during the design phase.
Related Articles
Turbo product codes advance ECC technology from 1969!
Memory Hierarchy Design - Part 3. Memory technology and optimization
About the Author
Sandeep Kaushik's profile.