Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
This presentation on the fundamentals of a Root Cause Failure Analysis (RCFA) process
briefly describes
What RCFA is, and why it is done,
Types of root causes,
7 generic steps in an RCFA investigation,
Challenges in setting up an RCFA process,
Setting up and sustaining an RCFA process.
February 2008
Reactive method
Failure classifications
Chronic
Sporadic
Root Cause Failure Analysis is a class of problem solving methods using a step-by-step
method to discover the basic causes of failure.
Many commercial solutions exist with associated training, consulting, and software costs,
but all of these methods share the same fundamentals within this presentation.
The fundamental purposes why RCFA is implemented are
to eliminate the recurrence of a failure, or
to manage the consequences of a failure should they occur again.
In many cases, it is not possible to completely eliminate the probability of a failure, and
that is why we consider failure management policies to manage the consequences to a
tolerable risk level.
Note that RCFA requires a failure to occur first before investigating and analyzing. Thus, it
is a reactive means to developing failure management policies. If consequences of failure
are intolerable, then a proactive method is required.
These failures typically are classified as sporadic or chronic.
Sporadic failure events often are one-time events that usually gain significant
attention because they usually involve significant, unexpected, and severe
consequences.
Chronic events, unfortunately, are those that are accepted but may have significant
cumulative losses over a long period. Most failure events that occur more than once
should be considered chronic.
February 2008
Physical
Physical component or material that failed
Human
Human actions or decisions leading to failure
event
Latent
Reasons why actions or decisions were made
What are Root Causes? The root cause is the most basic reason for the problem
occurring that, if eliminated, will eliminate the failure event from recurring. We define three
types of root causes: physical, human, and latent. Often, a failure results from one or more
root causes.
Physical root causes are the physical component that caused the failure event. These
are almost always present and are typically the overall physical reason the event occurred.
Traditional Failure Analysis is key to determining the physical root cause. Unfortunately, if
we only rely upon it, we will stop too early and implement a physical redesign because all we
know is what physically failed. This is a common error in performing RCFA.
Human root causes are the last human actions that led to the failure event. These usually
are, but not always, present. Too often, organizations seek out the individual that did a
wrong action and stop there. This is counterproductive and will make people unsupportive
of RCFA because it becomes a witch-hunt.
Latent root causes are the reasons why decisions were made that resulted in the error.
There will usually be more than one latent root, and typically, if these did not exist, then the
human root likely would have been avoided. Examples are organizational systems and
processes that made the human think a certain way and make the improper action.
Eliminating latent root causes will eliminate the failure event, and should be the focus of the
investigation.
Generally, analysis teams are hesitant to address the latent root causes because these
are weaknesses in the existing organization. It is important that they are supported in this
approach, and the method is fully understood among all individuals.
February 2008
7 Steps of RCFA?
Scoping
ORganizing
Analyzing
Documenting
Implementing
Confirming
These seven steps describe the method used in almost any RCFA process. Each is
described in following slides.
Note the steps can be remembered by the acronym SPORADIC a classification of
failure discussed earlier.
Not all of these steps are performed by the same individuals. It is important that RCFA is
viewed as a problem-solving method that spans the organization and beyond!
February 2008
Scoping
Our method starts with a scoping of the failure. Scoping first evaluates the consequences
of the failure and its risk. Evaluating the risk identifies the consequences for what could
have happened if the failure recurs, and the associated frequency or probability of
recurrence. Doing so allows us to understand the reasonable, worst-case consequences
and either eliminate or manage them.
Scoping also considers the nature of the failure event and directs the efforts to include the
internal and external functions as required. Internal functions, such as Environment / Health
/ Safety departments, Insurance, Communications, etc., or external functions such as
jurisdictional bodies, may be required to participate in the investigation. In some cases,
clearance to proceed with an investigation is necessary and good policies defining scoping
ensures appropriate steps are taken.
The last purpose of scoping determines the level of formality. Small, relatively simple
failure events often follow a straightforward investigative method using local facilitators or
analysts.
It is not uncommon that complex failures, or those involving multiple internal and external
parties, use an external and experienced facilitator to provide an unbiased approach. In
some cases, these principal analysts come from an external source such as a jurisdictional
body.
February 2008
Typical tasks
Coordinating activities among all parties
Interviewing and taking notes
Photographing
Handling parts
Collecting logs, databooks, alarm data, etc.
Preserving evidence and collecting data is the most important step in RCFA. Too often
we work in environments focused on repairing and returning equipment to service as soon
as possible. Within minutes, key evidence is lost or altered.
Without effective evidence preservation and data collection, an RCFA becomes lengthy,
drawn out, identify the wrong root causes, lead to wasted resources, and allow failures to
recur.
Developing basic skills using best practices are viewed as a suitable responsibility for
nearly all field operations, mechanical staff, and contractors. These individuals typically
have the most experience with the equipment,
are present when the failure occurs, or are first-responders, and
participate in or coordinate the repairs and/or clean-up.
Common tasks during this step include
Coordinating activities to preserve evidence and collect data - at the site and off-site
such as repair shops, labs, etc.
Interviewing parties, taking notes, and witness statements
Photographing the overall site, the unaltered scene, damage, and all stages of
disassembly,
Handling parts, including disassembly methods, cutting or torching to avoid altering
the evidence, and preservation / packaging
Collecting logs, databooks, alarm data, drawings, manuals, etc.
Because so many stages are involved with the disassembly, repair, and lab analysis, it is
reasonable to expect this step to span weeks.
February 2008
Analysis team
Facilitator (Principal Analyst)
Participants
Reviewer(s)
Implementers
Organizing the analysis team usually occurs during preserving evidence but sometimes is
completed afterwards when the failure is better understood.
Typically, the analysis team consists of a Facilitator or Principal Analyst responsible for
managing the analysis team through the analysis stage and documenting the findings /
recommendations. Typically, it is undesirable to have the facilitator who is a technical expert
since bias may be introduced. Ideally, the Facilitator thoroughly understands the RCFA
method, plans the project, and manages the dynamics among the participants during the
sessions.
Participants are individuals with expertise in the equipment (its manufacture, fabrication,
application, operation, servicing, and maintenance). The RCFA flows better and much more
efficiently when Participants have basic training in the RCFA process. Participants are not
expected to be Facilitators.
Reviewers often are not on the analysis team, but later validate the technical conclusions
leading to the root causes and the technical feasibility of the recommendations.
Often the implementation of tasks is the responsibility of individuals other than the analysis
team or reviewers. Communication is essential!
February 2008
Entrained sand
entering separator
(during regular
operation)
Sand building up in
separator
(since cleaning on
Apr 3/04)
Elbow erodes
(between June 1/06
through Sept 3/06
Eroded wall at
elbow found
Level controller
jammed by sand
Gas leak
at elbow
Level controller
fouled by sand
(around
June 1/06)
Sand carried in
gas / liquid stream
from separator
(after
June 1/06)
High gas
concentration
present
The next step is the analysis. One method is the Sequence of Events method.
Sequence of events analysis is very useful for
straightforward problems that have a known sequence of events leading to the failure
event,
complex problems where combinations of root causes exist and the approach is to
determine which cause(s) must be eliminated to break the chain,
establishing timelines and identifying which events require some other analysis tool
such as a logic tree.
It requires an understanding of what is controllable, and the resulting outcome of the
control, action, or response
Approach:
1. Map the sequence of events that lead to the failure
2. For each event, determine if it is controllable, and if so, what alternatives exist to
change what happens,
3. Compare the alternatives and identify which can be implemented to break the chain
of events.
4. Create recommendations for physical, human, and latent roots contributing to the
sequence of events.
February 2008
Another analysis method uses a Logic Tree. It is very well-suited for complex or
ambiguous problems with many root causes.
The analysis is managed by constructing a logic tree using structured and simple
questions. These questions are used to
first define a failure event at the top of the tree,
identify possible hypotheses for the preceding cause of the failure,
test the hypotheses using evidence and data collected earlier.
This hypothesis/verification continues until the trail can be traced back to a latent root for
which a suitable failure management policy can be defined. The next level of hypotheses
must clearly flow from its predecessor (the one before it). If it is clear that a step is missing
between causes it is added in and evidence sought to support its presence.
Once the fault tree is completed and checked for logical flow, the team then determines
recommendations to prevent the sequence of causes and effects from recurring.
This method also accommodates a confidence rating based on the accuracy or quality of
collected evidence.
February 2008
Event
Hyp. 2
Hyp. 3
Hyp. 4
P
Hyp. 5
H
Hyp. 6
This slide demonstrates the simple questions used throughout the logic tree. We will not
discuss the questions in detail considering the time constraints around this presentation.
In short, the failure event is a straightforward description of the loss of function, not of the failure
itself.
It is followed by asking How has this event occurred in the past? (for chronic failures), or What
evidence do we have at hand describing what caused the failure? (for sporadic failures). In both
cases, only the facts are listed without any guesses on the causes.
Hypotheses for physical roots follow using the question How could the preceding event have
occurred? so an educational guess can be made. Evidence is used to prove or disprove the
hypothesis. If a hypothesis is disproved, or has a low confidence associated with it, then it is no
longer pursued. Only the developing roots that are proven with high confidence are pursued. This
prevents wasted resources chasing red herrings
At a point, the question for physical roots no longer makes sense. Usually this is when we
transition into discovering human root causes, and the question What was the action or decision
that allowed this physical root to occur? As stated earlier, the analysis does not stop here. This
question only allows us to understand the human root cause.
Once the human root cause is identified, it becomes apparent that the more suitable question is
How do business practices and systems contribute to this thinking? Both internal and external
business practices and systems are within the scope of this question. Simply put, include your
manufacturers, suppliers, vendors, engineers, packagers, distributors, shippers, constructors,
commissioners, operators, and maintainers.
Typical latent root causes include training, skills verification, operating procedures, standards of
workmanship, time pressures, methods, drawing updates, communications, role and responsibility
definition, work scope definition, work conditions, management of change, holdpoints, inspections,
and procedures.
10
February 2008
3 stages of communication
Selecting recommendations
Effort vs. likelihood to prevent recurrence
Not all causes need a corrective action
High payback is not uncommon, but surprising!
New causes?
11
February 2008
Challenges
12
February 2008
Starting right
Here are some considerations to set up and sustain an RCFA process. First, dedicate at
least one individual as a trainer, coach, and doer until a larger network of facilitators is
established.
Establish competencies among your field staff by setting up training for Preserving
Evidence & Data Collection. Within EnCana, we have an online e-learning module
supported with a Quick Reference Card to make training accessible to nearly everyone.
Train your principal analysts (facilitators), reviewers, and participants (generally your
technical specialists) in the RCFA method. Preparing them for the analysis ensures their
fluency in RCFA terminology and working with a common methodology.
Ensure you have leadership and active sponsorship for the RCFAs. Success during the
first few analyses is essential to demonstrate the efficiency, simplicity, and effectiveness of
the method.
Focus on chronic problems since these have a faster confirmation of results.
Pick a sensible and achievable method of doing your RCFA work. Many commercialized
methods exist, but you must consider scalability costs and suitability of training for all roles
and responsibilities.
Learn the method before attempting to use software as a tool. Developing the thought
processes is more important!
If possible, start in a friendly sandbox surrounded with sponsors and peers who
understand there will be glitches, but will accept these as the process is tuned.
13
February 2008
The two books above are recommended if you are seeking additional information on
RCFA. The first book (Latino) presents the logic tree analysis in the PROACT methodology.
The second book (Mobley) presents the sequence of events analysis tool.
In conclusion, I encourage you to implement an RCFA process if you have not done so
yet. Strive to discover the latent root causes with your organization and externally. By doing
so, you will avoid pursuing many expensive physical redesigns and realize significant
reductions in your environmental / safety incidents and production costs.
14