Sei sulla pagina 1di 14

February 2008

An Overview of a Root Cause


Failure Analysis (RCFA) Process
Roger Zavagnin
EnCana Corporation

This presentation on the fundamentals of a Root Cause Failure Analysis (RCFA) process
briefly describes
What RCFA is, and why it is done,
Types of root causes,
7 generic steps in an RCFA investigation,
Challenges in setting up an RCFA process,
Setting up and sustaining an RCFA process.

2008 IPEIA Conference, Banff, Canada

February 2008

What and Why RCFA?

A class of problem-solving methods to


eliminate recurrence of failure, or
manage the consequences of failure

Reactive method

Failure classifications
Chronic
Sporadic

Root Cause Failure Analysis is a class of problem solving methods using a step-by-step
method to discover the basic causes of failure.
Many commercial solutions exist with associated training, consulting, and software costs,
but all of these methods share the same fundamentals within this presentation.
The fundamental purposes why RCFA is implemented are
to eliminate the recurrence of a failure, or
to manage the consequences of a failure should they occur again.
In many cases, it is not possible to completely eliminate the probability of a failure, and
that is why we consider failure management policies to manage the consequences to a
tolerable risk level.
Note that RCFA requires a failure to occur first before investigating and analyzing. Thus, it
is a reactive means to developing failure management policies. If consequences of failure
are intolerable, then a proactive method is required.
These failures typically are classified as sporadic or chronic.
Sporadic failure events often are one-time events that usually gain significant
attention because they usually involve significant, unexpected, and severe
consequences.
Chronic events, unfortunately, are those that are accepted but may have significant
cumulative losses over a long period. Most failure events that occur more than once
should be considered chronic.

2008 IPEIA Conference, Banff, Canada

February 2008

What are Root Causes?

Physical
Physical component or material that failed

Human
Human actions or decisions leading to failure
event

Latent
Reasons why actions or decisions were made

What are Root Causes? The root cause is the most basic reason for the problem
occurring that, if eliminated, will eliminate the failure event from recurring. We define three
types of root causes: physical, human, and latent. Often, a failure results from one or more
root causes.
Physical root causes are the physical component that caused the failure event. These
are almost always present and are typically the overall physical reason the event occurred.
Traditional Failure Analysis is key to determining the physical root cause. Unfortunately, if
we only rely upon it, we will stop too early and implement a physical redesign because all we
know is what physically failed. This is a common error in performing RCFA.
Human root causes are the last human actions that led to the failure event. These usually
are, but not always, present. Too often, organizations seek out the individual that did a
wrong action and stop there. This is counterproductive and will make people unsupportive
of RCFA because it becomes a witch-hunt.
Latent root causes are the reasons why decisions were made that resulted in the error.
There will usually be more than one latent root, and typically, if these did not exist, then the
human root likely would have been avoided. Examples are organizational systems and
processes that made the human think a certain way and make the improper action.
Eliminating latent root causes will eliminate the failure event, and should be the focus of the
investigation.
Generally, analysis teams are hesitant to address the latent root causes because these
are weaknesses in the existing organization. It is important that they are supported in this
approach, and the method is fully understood among all individuals.

2008 IPEIA Conference, Banff, Canada

February 2008

7 Steps of RCFA?

Scoping

Preserving evidence and collecting data

ORganizing

Analyzing

Documenting

Implementing

Confirming

These seven steps describe the method used in almost any RCFA process. Each is
described in following slides.
Note the steps can be remembered by the acronym SPORADIC a classification of
failure discussed earlier.
Not all of these steps are performed by the same individuals. It is important that RCFA is
viewed as a problem-solving method that spans the organization and beyond!

2008 IPEIA Conference, Banff, Canada

February 2008

Scoping

Consequences and risk

Who needs to be involved?


Internal non-technical functions?
External jurisdictional bodies?

Formal vs. informal methods

Local vs. external principal analysts

Our method starts with a scoping of the failure. Scoping first evaluates the consequences
of the failure and its risk. Evaluating the risk identifies the consequences for what could
have happened if the failure recurs, and the associated frequency or probability of
recurrence. Doing so allows us to understand the reasonable, worst-case consequences
and either eliminate or manage them.
Scoping also considers the nature of the failure event and directs the efforts to include the
internal and external functions as required. Internal functions, such as Environment / Health
/ Safety departments, Insurance, Communications, etc., or external functions such as
jurisdictional bodies, may be required to participate in the investigation. In some cases,
clearance to proceed with an investigation is necessary and good policies defining scoping
ensures appropriate steps are taken.
The last purpose of scoping determines the level of formality. Small, relatively simple
failure events often follow a straightforward investigative method using local facilitators or
analysts.
It is not uncommon that complex failures, or those involving multiple internal and external
parties, use an external and experienced facilitator to provide an unbiased approach. In
some cases, these principal analysts come from an external source such as a jurisdictional
body.

2008 IPEIA Conference, Banff, Canada

February 2008

Preserving Evidence and Collecting Data

Most important step!

Basic skills using best practices

Typical tasks
Coordinating activities among all parties
Interviewing and taking notes
Photographing
Handling parts
Collecting logs, databooks, alarm data, etc.

Preserving evidence and collecting data is the most important step in RCFA. Too often
we work in environments focused on repairing and returning equipment to service as soon
as possible. Within minutes, key evidence is lost or altered.
Without effective evidence preservation and data collection, an RCFA becomes lengthy,
drawn out, identify the wrong root causes, lead to wasted resources, and allow failures to
recur.
Developing basic skills using best practices are viewed as a suitable responsibility for
nearly all field operations, mechanical staff, and contractors. These individuals typically
have the most experience with the equipment,
are present when the failure occurs, or are first-responders, and
participate in or coordinate the repairs and/or clean-up.
Common tasks during this step include
Coordinating activities to preserve evidence and collect data - at the site and off-site
such as repair shops, labs, etc.
Interviewing parties, taking notes, and witness statements
Photographing the overall site, the unaltered scene, damage, and all stages of
disassembly,
Handling parts, including disassembly methods, cutting or torching to avoid altering
the evidence, and preservation / packaging
Collecting logs, databooks, alarm data, drawings, manuals, etc.
Because so many stages are involved with the disassembly, repair, and lab analysis, it is
reasonable to expect this step to span weeks.

2008 IPEIA Conference, Banff, Canada

February 2008

ORganizing the Analysis

Analysis team
Facilitator (Principal Analyst)
Participants

Reviewer(s)

Implementers

Organizing the analysis team usually occurs during preserving evidence but sometimes is
completed afterwards when the failure is better understood.
Typically, the analysis team consists of a Facilitator or Principal Analyst responsible for
managing the analysis team through the analysis stage and documenting the findings /
recommendations. Typically, it is undesirable to have the facilitator who is a technical expert
since bias may be introduced. Ideally, the Facilitator thoroughly understands the RCFA
method, plans the project, and manages the dynamics among the participants during the
sessions.
Participants are individuals with expertise in the equipment (its manufacture, fabrication,
application, operation, servicing, and maintenance). The RCFA flows better and much more
efficiently when Participants have basic training in the RCFA process. Participants are not
expected to be Facilitators.
Reviewers often are not on the analysis team, but later validate the technical conclusions
leading to the root causes and the technical feasibility of the recommendations.
Often the implementation of tasks is the responsibility of individuals other than the analysis
team or reviewers. Communication is essential!

2008 IPEIA Conference, Banff, Canada

February 2008

Analyzing Sequence of Events

Good for simple,


linear events
with few root
causes
Evidence-based
Identify means
to break the
chain of events

Liquid levels rising


since April 30/06
4 sand found in
separator

Entrained sand
entering separator
(during regular
operation)

Sand building up in
separator
(since cleaning on
Apr 3/04)

Elbow erodes
(between June 1/06
through Sept 3/06

Eroded wall at
elbow found

Site logs do not have regular


entries and inconsistent levels
recorded

Level controller
jammed by sand

Gas leak
at elbow

Level controller
fouled by sand
(around
June 1/06)

20% LEL alarm


trips
Sept 3/06
09:45

Sand carried in
gas / liquid stream
from separator
(after
June 1/06)

Site ESDs close


Sept 3/06
09:45

High gas
concentration
present

The next step is the analysis. One method is the Sequence of Events method.
Sequence of events analysis is very useful for
straightforward problems that have a known sequence of events leading to the failure
event,
complex problems where combinations of root causes exist and the approach is to
determine which cause(s) must be eliminated to break the chain,
establishing timelines and identifying which events require some other analysis tool
such as a logic tree.
It requires an understanding of what is controllable, and the resulting outcome of the
control, action, or response
Approach:
1. Map the sequence of events that lead to the failure
2. For each event, determine if it is controllable, and if so, what alternatives exist to
change what happens,
3. Compare the alternatives and identify which can be implemented to break the chain
of events.
4. Create recommendations for physical, human, and latent roots contributing to the
sequence of events.

2008 IPEIA Conference, Banff, Canada

February 2008

Analyzing Logic Tree

Good for complex / ambiguous problems with


many root causes
Hypothesis development
Simple questions
Evidence-based confirmation
Accommodates confidence

Another analysis method uses a Logic Tree. It is very well-suited for complex or
ambiguous problems with many root causes.
The analysis is managed by constructing a logic tree using structured and simple
questions. These questions are used to
first define a failure event at the top of the tree,
identify possible hypotheses for the preceding cause of the failure,
test the hypotheses using evidence and data collected earlier.
This hypothesis/verification continues until the trail can be traced back to a latent root for
which a suitable failure management policy can be defined. The next level of hypotheses
must clearly flow from its predecessor (the one before it). If it is clear that a step is missing
between causes it is added in and evidence sought to support its presence.
Once the fault tree is completed and checked for logical flow, the team then determines
recommendations to prevent the sequence of causes and effects from recurring.
This method also accommodates a confidence rating based on the accuracy or quality of
collected evidence.

2008 IPEIA Conference, Banff, Canada

February 2008

Analyzing Logic Tree


Factual Causes
What is the abnormal state of failure?

Event

How has this event occurred in the past?

Mode 1 Mode 2 Mode 3 Mode 4


Hyp. 1

Hyp. 2
Hyp. 3

What evidence do we have at hand


describing what caused the failure?
Hypothesis and Verification
How could the preceding event have
occurred?

Hyp. 4

P
Hyp. 5

H
Hyp. 6

What was the action or decision that


allowed this physical root to occur?

How do business practices and systems


contribute to this thinking?

This slide demonstrates the simple questions used throughout the logic tree. We will not
discuss the questions in detail considering the time constraints around this presentation.
In short, the failure event is a straightforward description of the loss of function, not of the failure
itself.
It is followed by asking How has this event occurred in the past? (for chronic failures), or What
evidence do we have at hand describing what caused the failure? (for sporadic failures). In both
cases, only the facts are listed without any guesses on the causes.
Hypotheses for physical roots follow using the question How could the preceding event have
occurred? so an educational guess can be made. Evidence is used to prove or disprove the
hypothesis. If a hypothesis is disproved, or has a low confidence associated with it, then it is no
longer pursued. Only the developing roots that are proven with high confidence are pursued. This
prevents wasted resources chasing red herrings
At a point, the question for physical roots no longer makes sense. Usually this is when we
transition into discovering human root causes, and the question What was the action or decision
that allowed this physical root to occur? As stated earlier, the analysis does not stop here. This
question only allows us to understand the human root cause.
Once the human root cause is identified, it becomes apparent that the more suitable question is
How do business practices and systems contribute to this thinking? Both internal and external
business practices and systems are within the scope of this question. Simply put, include your
manufacturers, suppliers, vendors, engineers, packagers, distributors, shippers, constructors,
commissioners, operators, and maintainers.
Typical latent root causes include training, skills verification, operating procedures, standards of
workmanship, time pressures, methods, drawing updates, communications, role and responsibility
definition, work scope definition, work conditions, management of change, holdpoints, inspections,
and procedures.

2008 IPEIA Conference, Banff, Canada

10

February 2008

Documenting, Implementing, Confirming

3 stages of communication

Selecting recommendations
Effort vs. likelihood to prevent recurrence
Not all causes need a corrective action
High payback is not uncommon, but surprising!

Long time periods to confirm results


Investigate similar failures

New causes?

Originated before the RCFA?

Communicating the analysis involves three stages.


1st: a summary of the failure event, the root causes, and the associated
recommendations coming out of the analysis;
2nd: which recommendations were selected during the evaluation, how they will be
implemented, when, and by whom;
3rd: whether or not the implemented recommendations were successful.
It is important to understand that not all causes need a correction action applied to them
to prevent recurrence or to adequately manage the consequences of failure. For example,
an Sequence of Events requires the sequence to be broken, and often only a few
recommendations with a high impact require implementation. A Logic Tree analysis could
identify a number of root causes, but only a few have technically feasible recommendations
or have such a high impact that the remaining risk of recurrence is tolerable.
During the selection of recommendations, it is not uncommon to find payback in the range
of 30:1 or higher! Because latent roots deal with organizational systems, policies, and
procedures, the effort to change and manage those is significantly less than complex
physical redesigns.
Lastly, note that it may take months, years, or decades to confirm whether the
implementation was successful. Too often, the organizations pursuing an RCFA program
expect immediate results with a financial quarter or two. Many failure mechanisms
commence thousands of hours before the failure is recognized. It is foolish to immediately
conclude the implementation was unsuccessful without understanding the root cause of the
failure!

2008 IPEIA Conference, Banff, Canada

11

February 2008

Challenges

Attitude The failure is preventable or manageable!

Learning History and experience has value

Capacity Busy repairing vs. busy eliminating

Capability Follow a simple, well-defined process


with technical support

Expectations Results may take a long time

Change Already doing it to a degree

Now we consider some challenges in setting up an RCFA process.


First, the culture or status quo likely accepts failures because stuff happens. Starting on
small, chronic failures allows for quick demonstration that failures are preventable or
manageable and provides quick to real results.
Engaging experienced employees through the process fosters a culture of learning from
their experience, and gains buy-in by recognizing the importance of their experience.
Often, being too busy repairing is a challenge. The question to be asked of employees,
and demonstrated by their supervisors, is whether they intend to remain busy repairing, or
get busy eliminating. Eliminating chronic failures tends to hit a critical mass where reduced
repair time easily accommodates additional RCFA activities.
Starting with a straightforward, simple RCFA process that everyone can comprehend and
identify their responsibilities is key. Ensure experienced RCFA individuals are available to
train, coach, and do analyses.
Setting the right expectations is important. As stated earlier, it may take years to confirm
the prevention of sporadic failures, or perhaps months for chronic failures. It is important
that sponsors understand this duration. After implementation it is necessary to ensure the
organization does not slip back to its bad practices.
Lastly, fear of change is common. Most technical people already do ad-hoc RCFA
although not to the level of identifying latent roots (typically just physical roots, leading to
expensive redesigns). Building upon existing though processes is a good start to fine-tune
their skills to this more thorough analysis method.

2008 IPEIA Conference, Banff, Canada

12

February 2008

Setting-Up & Sustaining RCFA

Dedicated Trainer / Coach resource

Training based on roles & responsibilities


Preserving Evidence & Data Collection
Participant / Facilitators / Reviewers

Leadership & Active Sponsorship


Assign resources / select / implement / track

Starting right

Chronic vs. Sporadic problems


Sensible and achievable method
Learn the method before a software tool
Focused efforts in a friendly sandbox

Here are some considerations to set up and sustain an RCFA process. First, dedicate at
least one individual as a trainer, coach, and doer until a larger network of facilitators is
established.
Establish competencies among your field staff by setting up training for Preserving
Evidence & Data Collection. Within EnCana, we have an online e-learning module
supported with a Quick Reference Card to make training accessible to nearly everyone.
Train your principal analysts (facilitators), reviewers, and participants (generally your
technical specialists) in the RCFA method. Preparing them for the analysis ensures their
fluency in RCFA terminology and working with a common methodology.
Ensure you have leadership and active sponsorship for the RCFAs. Success during the
first few analyses is essential to demonstrate the efficiency, simplicity, and effectiveness of
the method.
Focus on chronic problems since these have a faster confirmation of results.
Pick a sensible and achievable method of doing your RCFA work. Many commercialized
methods exist, but you must consider scalability costs and suitability of training for all roles
and responsibilities.
Learn the method before attempting to use software as a tool. Developing the thought
processes is more important!
If possible, start in a friendly sandbox surrounded with sponsors and peers who
understand there will be glitches, but will accept these as the process is tuned.

2008 IPEIA Conference, Banff, Canada

13

February 2008

Acknowledgements & Further Reading

Root Cause Analysis: Improving


Performance for Bottom-Line Results, 2nd ed.
Robert Latino, Kenneth Latino
ISBN 0-8493-1318-X

Root Cause Failure Analysis


R. Keith Mobley
ISBN 0-7506-7158-0

The two books above are recommended if you are seeking additional information on
RCFA. The first book (Latino) presents the logic tree analysis in the PROACT methodology.
The second book (Mobley) presents the sequence of events analysis tool.
In conclusion, I encourage you to implement an RCFA process if you have not done so
yet. Strive to discover the latent root causes with your organization and externally. By doing
so, you will avoid pursuing many expensive physical redesigns and realize significant
reductions in your environmental / safety incidents and production costs.

2008 IPEIA Conference, Banff, Canada

14

Potrebbero piacerti anche