Sei sulla pagina 1di 44

EMT 480/3: RELIABILITY &

FAILURE ANALYSIS
original version by
Noraini Othman

Edited by Hasnizah Aris

Lecture contents
1.
2.
3.
4.
5.
6.

Reliability Engineering
Design for Reliability (DFR)
Reliability Prediction Techniques
FMEA
FTA
Reliability Life Testing

Terms & Definitions

Reliability
the ability of a product to conform to its
electrical and visual/mechanical specifications
over a specified period of time under
specified conditions at a specified confidence
level

Reliability Engineering
refers to the development of
processes and standards to
reliability semiconductors during
Focuses
on
eliminating
requirements

technology,
ensure the
applications.
maintenance

Reliability Monitoring
consists of getting finished product samples from the line and
subjecting these to reliability testing. Valid reliability failures
should undergo root cause analysis for reliability improvement

Wafer-level Reliability Testing


once an integrated circuit has been designed and the first
silicon comes out, reliability tests at wafer-level are done to
assess the reliability of the die

Package-level Reliability Testing


refers to the assessment of the overall reliability of the device
in a packaged form.

New Product Qualification


operationally the same as package-level
reliability testing, except that it is systemized
with the objective of generating official
reliability data that would justify the mass
production of a new product.

Designing for Reliability (DFR)

The concept is to exert as much effort as


possible to design a product to be inherently
reliable

This consist of following all known design


rules for making a product reliable, not only
electrically but visually and mechanically as
possible

Building reliability into a product as early as


the design phase is a must.

Reliability design begins with the specification of


reliability goals consistent with cost and
performance objective

These goals must be translated into individual


component,
subcomponent
and
part
specifications

Various design methods are then applied in


order to meet the goals (such as stress-strength
analysis, simplification etc.)

A failure analysis is then performed to determine


whether specifications are being met and to
provide a systematic approach for identifying,
ranking and eliminating failure modes

If either the reliability or the safety goals are not


met, the design process must continue

Often, it may require a complete redesign

Reliability Design

In summary, an excellent reliability engineering


system would have all of the following
components:
(a) design for reliability
(b) wafer-level reliability testing
(c) package-level reliability testing
(d) new product/process qualification
(e) reliability monitoring

What is Reliability Prediction


Techniques?

Reliability prediction is a design-assist process by


which the reliability characteristics of a system are
obtained, by calculating the anticipated system RAMS
(Reliability, Availability, Maintainability and Safety-Integrity)
from assumed component failure rates.

The Importance of Reliability Prediction:


(a) provides early indication of a systems potential to meet
the design reliability requirements
(b) enables assessment of life-cycle costs to be carried out
(c) enables one to establish which components, or areas in a
design contribute to the major portion of unreliability
(d) enables trade-offs to be made, as for eg. between
reliability and maintainability in achieving a given availability

Why Reliability Prediction Techniques


is Needed?

Traditionally, reliability has been achieved through


extensive testing and use of techniques such as
probabilistic reliability modeling (These are techniques
done in the late stages of development)

The challenge is to design in quality and reliability


early in the development cycles

Reliability of a device could be known up-front, during


the design phase and before the device is
manufactured

This could avoid costly redesign cycles.

What Are The Factors That Affect The Reliability Performance


of Electronic Components?

Material quality

Operating temperature

Vibration and miscellaneous mechanical factors

Electrical stress levels

Introduction to FMEA
(a) Introduction

FMEA stands for Failure Modes and Effects Analysis

It is a methodology designed
(i) to identify potential failure modes for a product or
process;
(ii) to assess the risk associated with those failure modes;
(iii) to rank the issues in terms of importance; and
(iv) to identify and carry out corrective actions to address the
most serious concerns

For easy understanding, just remember that FMEA


is intended to
document:
(i) a Failure
(ii) its Mode
(iii) its Effects
(iv) by Analysis

(b) Types of FMEA


There are 2 standard categories of FMEA:
Design FMEA (DFMEA)
addresses potential failure modes arising during
design of components and subsystems

Process FMEA (PFMEA)


addresses potential failure modes arising during
manufacturing and assembly processes

(c) Process for Conducting FMEA

The process for conducting FMEA is summarized as


follows:
(a) Describe product or process
(b) Define Functions
(c) Identify Potential Failure Modes
(d) Describes Effects of Failures
(e) Determine Causes
(f) Direction Methods or Current Controls
(g) Calculate Risks use Risk Priority Number (RPN)
(h) Take Action
(i) Assess Results

A typical FMEA incorporates some method to


evaluate the risk associated with the potential
problems identified through the analysis. One of it
is by using the Risk Priority Numbers (RPN)

To use RPN method to assess risk, the analysis


team must:
(a) Rate the severity of each effect of failure
(b) Rate the likelihood of occurrence for each
cause of failure
(c) Rate the likelihood of prior detection for
each
cause of failure
(d) Calculate the RPN by obtaining the
product
of = Severity x Occurrence x
RPN
the three ratings:
Detection

An Example of FMEA Hazard Assessment

(d) Benefits of FMEA

Improve product/process reliability and quality

Increase customer satisfaction

Early identification and elimination of potential


product/process failure modes

Prioritize product/process deficiencies

Capture engineering/organization knowledge

Introduction to Fault Tree Analysis


(FTA)
(a) What is a FTA?

FTA stands for Fault Tree Analysis

It is a graphical representation of the major faults


or critical failures associated with a product, the
causes
for
the
faults,
and
potential
countermeasures

The tool helps identify areas of concern for new


product design or for improvement of existing
products. It also helps identify corrective actions
to correct or mitigate problems

In a Fault Tree, one works in a failure space, and


looks at system failure combinations

(b) When to use it?


FTA is useful both in designing new
products/services or in dealing
with identified problems in existing
products/services.

In the quality planning process, the analysis can be


used to optimize
process features and goals and to design for
critical factors and
human error. As part of process improvement, it
can be used to
help identify root causes of trouble and to design
remedies and

(c) Basic Constructs of FTA

The basic constructs in a Fault Tree Diagram are


(a) gates (~ represent conditions)
(b) events (represent the system failure mode)

The two most commonly used gates are:


(a) AND gates
(b) OR gates

If occurrence of either event causes the top event to


occur, then these events (blocks) are connected
using an OR gate

Alternatively, if both events need to occur to cause


the top event to occur, they are connected by an
AND gate

Example:
For the Top Event to occur, either A or B must
happen. In other
words, failure of A or B, causes the system to fail.

Block Diagram

equivalent
=

? Reliability

Symbols used in FTA

(d) How to Perform FTA in 6 steps

1. Select a top level event for analysis. Try to be specific, for


example, Email server down for more than 4 hours. Sources of
top level events include: Problem/Known Error Records;
potential failures from brainstorming; etc.

2. Identify faults that could lead to the top level event.


Continuing the above example, some possible faults leading to an
outage lasting more than four hours might be loss of power,
another might be hardware failure. List all the faults under the
top level event in boxes and connect the fault boxes to the top
level event box by drawing lines.

3. For each fault, list as many causes as


possible in boxes below the related fault.
Continuing the example above, in the case of loss
of power," some causes might be electrical
outage, power supply failure, and so on. Connect
the boxes to the appropriate fault box.

4. Draw a diagram of the fault tree." Two logic


operators AND and OR, also known as logic gates
are used to represent the sequencing of faults and
causes. For example, Email server down for more
than 4 hours could be caused by loss of power or
hardware fault." Another might be loss of building
power and battery backup exhausted.
Update faults and causes by grouping logically
related items using AND or OR between faults and
events; and faults and causes. Re-draw the lines
from top level event to logic gates to faults to logic
gates to causes.

5. Continue identifying causes for each fault until you reach


a root cause (reactive FTA), or one that you can do
something about (proactive FTA). For example, the root cause
of power supply failure might be filter clogged;" the root cause
of battery backup exhausted might be battery backup too
small."

6. Consider countermeasures. A root cause is one you can do


something about; so now you need to think of the
countermeasures you might apply to each root cause. List
countermeasures for each root cause in a box under the root
cause. For example, for filter clogged a countermeasure might
be clean filter monthly. Link the countermeasure to the root
cause by drawing a line.

Example:

Solution:

Reliability Life Testing


(a) Objectives

Measure the reliability performance of the product

Provide a level of confidence for a new product


designs reliability performance

Verify aspects of the functionality of the design

Determine the breaking point of the product

Uncover any weaknesses in components or the


packaging of the design and develop appropriate
corrective action

Reliability Life Testing


(b) Burn-In and Screening

Burn-in:
A process of operating items at elevated stress
levels (particularly temperature, humidity and
voltage) in order to accelerate the processes
leading to failure. The populations of defective
items are thus reduced

Screening:
An enhancement to Quality Control whereby
additional detailed visual and electrical/mechanical
tests seek to reveal defective features which would
otherwise increase the population of weak items

Reliability Life Testing


(c) Stress testing

In stress testing, a device is stressed until it fails.


Stresses can be classified as environmental or
self-generated

The environmental stresses may be any


combination of temperature, vibration, humidity,
shock or ingression of foreign bodies

The self-generated stress includes power


dissipation, applied voltage and current, selfgenerated vibration and wear.

Reliability Life Testing


(d) Environmental Stress Screening (ESS)

ESS is a screening process in which a product is


subjected to environmentally generated stress to
precipitate latent product defects

ESS techniques can precipitate latent failures,


which cannot be detected with electrical testing
or visual inspection, so that infant-mortality cases
can be eliminated and the product can enter the
useful-life phase of the bath-tub curve at the end
of the ESS testing

Several types of ESS testing available are listed as


follows:
(i) Temperature Cycling
(ii) Thermal Shock
(iii) Humidity Testing
(iv) Temperature, Humidity, Bias (THB) Testing

(i)

Temperature Cycling

()

Refers to the process in which a product is


subjected to multiple cycles of changing
temperatures between pre-determined extremes
at relatively high rates of change fatiguing and
causing inferior product to fail

()

Cycling will show at what temperature, both


high and low, a product will cease to function
properly

(ii) Thermal Shock

Refers to rapid temperature changes from extreme cold to


hot environment to thermally shocks and stresses a products

This causes permanent changes in electrical performance


and can cause sudden overloading of materials

Thermal shock failures are due to thermal mismatches or


materials with differences in rates of thermal expansion and
contraction

(iii) Humidity Testing

Humidity testing normally involves high heat to


aid in forcing water vapor through weakly
sealed components

Many electronic devices are susceptible to the


damaging effects of moisture both by direct
condensation and indirect effects

Direct condensation is where water comes out of


the air and forms droplets on a device

These droplets may find their way into the


device and attack sensitive components

Common effects include shorting of electrical


components and initiation of corrosive effects

Indirect effects are numerous

Example is moisture breaching sealed devices


which results in failures over time

(iv) Temperature, Humidity, Bias (THB) Testing

THB Testing is a reliability test designed to accelerate


metal corrosion, particularly that of the metallization on
the die surface of the device

Aside from temperature and humidity which are enough to


promote corrosion of metals in the presence of
contaminants, bias is applied to the device to provide the
potential differences needed to trigger the corrosion
process, as well as to drive mobile contaminants to areas of
concentration on the die

(e) Other Types of Test


- Marginal Testing : involves proving the various system
functions at the extreme limits of the electrical and
mechanical parameters
)

High Reliability Testing : example, verification of a


product MTBF of 106 hours involving a 2000 elapsed hours
of testing

- Testing for Packaging and Transport :


involves
consideration of waterproofing, hermetic seals, ventilation
holes, adequate padding, adequate storage facilities etc.

Many ALT of semiconductors involve


semiconductor properties are usually
temperature dependency

temperature as
have a strong

The most common accelerated test condition is as follows:


(a) Mechanical Shock
(b) Drop Shock (Test)
(c) Voltage Extremes
(d) High Humidity
(e) Random Vibration Test; etc

Potrebbero piacerti anche