Sei sulla pagina 1di 12

The selection of logic solver and field device technologies

and configurations to meet safety and availability


requirements in the process industries
Paul Gruhn, P.E., ISA 84 Expert
ICS Triplex | Rockwell Automation, Houston, Texas, USA
pgruhn@ra.rockwell.com, 281-330-0393
Keywords
Safety Instrumented System, Safety Instrumented Function, Safety Integrity Level, Safe
Failure Fraction, Fault Tolerance, ISA 84 (IEC 61511) standard
Abstract
When a restaurant has only two choices (e.g., hamburger or chicken sandwich), choosing is
easy and fast. However, when a restaurant has a menu that is twelve pages long, choosing is
neither easy nor fast. Designing a safety instrumented system is similarly problematic. The sheer
number of choices available, such as configuration (e.g., single, dual, triple, quad), design
options (e.g., certified vs. prior use, centralized vs. distributed), determining test intervals, and
the multitude of different vendor products and technologies for both logic solvers (e.g., relays,
solid state, programmable) and field devices (e.g., switches, transmitters) means that choosing
and designing a system is no longer as easy and simple as it used to be back in the days of relays
and discrete switches. This paper will review different design configurations and options for
safety instrumented systems in the process industries and their impact on system performance.
Background Concepts
In order to compare logic and field device technologies and configurations, their impact on
system performance, and understand the fault tolerance requirements listed in the standards, it
will first be necessary to review and define a few basic concepts such as failure modes, safe
failure fraction, hardware fault tolerance, and the real impact of redundancy.
Failure Modes

Safety system failures have long been categorized in two different modes; safe and dangerous.
Safe failures result in nuisance trips and lost production downtime. Dangerous failures are where
the system will not respond to an actual demand. SIL (Safety Integrity Level) is a measure of
performance against dangerous failures only. In other words, knowing the SIL a function meets,
tells you nothing about its nuisance trip performance.

ISAFrance,2012

Gruhn

Safe Failure Fraction

Intelligent devices (i.e., those with


microprocessors) are able to detect some failures.
However, diagnostic coverage can never be 100%.
Therefore, four failure categories may be defined, as
shown in Figure 1.
Safe failure fraction (SFF), a term used in recent
industry standards, is defined as safe failures
(detected and undetected), plus dangerous detected
failures, divided by the total of all failures. With the
splits shown in Figure 1, which are merely an
example for illustrative purposes, the safe failure
fraction is 75%.

Hardware Fault Tolerance

Figure 1: Failure Categories

Redundancy and fault tolerance are not the same.


Redundancy is a vague term open to various
interpretations. Anything more than one is redundant. The definition of fault tolerance is more
specific.
A hardware fault tolerance of N (i.e., 0, 1 or 2) means that N+1 dangerous faults could cause a
loss of the safety function. A non-redundant configuration (1oo1: one-out-of-one) has a fault
tolerance of zero. A two-out-of-two (2oo2) configuration also has a fault tolerance of zero. Oneout-of-two (1oo2) and two-out-of-three (2oo3) configurations have a fault tolerance of one. A
fault tolerance of two requires either a 1oo3 or 2oo4 configuration.
Table 1 lists the fault tolerance requirements for software based devices in order to meet the
different integrity levels based on the safe failure fraction, per IEC 61508.
Safe Failure Fraction

Hardware Fault Tolerance


0

< 60 %

Not Allowed

SIL 1

SIL 2

60 % < 90 %

SIL 1

SIL 2

SIL 3

90 % < 99 %

SIL 2

SIL 3

SIL 4

> 99 %

SIL 3

SIL 4

SIL 4

Table 1: Hardware Fault Tolerance Requirements for Type B Devices (per IEC 61508)

Devices are considered type B when:


a)
b)
c)

the failure mode of at least one constituent component is not well defined, or
the behavior of the subsystem under fault conditions cannot be completely determined, or
there is insufficient dependable failure data from field experience to support claims for
rates of failure for detected and undetected dangerous failures.

ISAFrance,2012

Gruhn

Software based programmable devices/systems are considered type B. A common or typical


safe failure fraction for a general purpose PLC (Programmable Logic Controller) would be in the
range of 70-80%. Table 1 means that if your performance target was SIL 1, you could do it with
a 1oo1 or 2oo2 PLC configuration. If your target were SIL 2, you would need a 1oo2 or 2oo3
configuration.
However, if you had a safety PLC (one designed originally to meet the requirements of the
IEC 61508 standard) that had a safe failure fraction of 95%, you could achieve SIL 2 with a 1oo1
configuration, and SIL 3 with a 2oo3 configuration. If the safe failure fraction were greater than
99%, you could achieve SIL 3 with a 1oo1 configuration. There are a variety of systems from
different vendors that meet all of the above.
The Real Impact of Redundancy

Few terms, as well as its impact, are as misunderstood as redundancy. Most people believe
that if one is good, two must be better, three must be better than that, and since a couple of
vendors offer quad, that must be the best. If marketers can do it with razor blades, why not safety
systems? Strange as it may sound, dual is not always better than single, triple is not always better
than dual, and what some offer as quad is not as quad as people might think. The impact of
redundancy depends upon the failure mode.
Single (1oo1)

Start with a base case of a non-redundant one-out-of-one (1oo1) system. Assume a safe
(nuisance trip) failure probability in one year of .04 (4%). You could think of it as 4 systems out
of 100 causing a nuisance trip within a year, or 1 system in 25 causing a nuisance trip, or a mean
time to fail safe (MTTFsafe) of 25 years (1/.04).
Assume a dangerous failure probability in one year of .02 (2%). You could think of it as 2
systems out of 100 not responding in a year, or 1 in 50 not responding in a year, or a mean time
to fail dangerously (MTTFdanger) of 50 years (1/.02). These numbers are just for comparison
purposes at this point.
Dual (1oo2)

A one-out-of-two (1oo2) configuration has the outputs wired in series (assuming closed and
energized contacts, as shown in Figure 2). Either channel can shut the system down. Since there
is twice as much hardware, there are twice as many nuisance trips. Therefore, the .04 of a single
system doubles to .08. You could think of it as 8 systems out of 100 causing a nuisance trip
within a year, or 1 system in 12.5 causing a nuisance trip, or a MTTFsafe of 12.5 years.
A 1oo2 configuration would fail to function in the dangerous mode only if both channels were
to fail dangerously at the same time. If one were stuck, the other could still de-energize and shut
down the system. What is the probability of two simultaneous failures? It is the probability of the
single event squared (like two coins landing heads). So the probability of two channels failing at
the same time is remote (0.02 x 0.02 = 0.0004). You could think of it as 4 systems out of 10,000
not responding in a year, or 1 in 2,500 not responding in a year, or a MTTFdanger of 2,500
years.

ISAFrance,2012

Gruhn

In other words, a 1oo2 configuration is very safe (the probability of a dangerous system
failure is very small), but the system suffers twice as many nuisance trips as single, which is not
desirable from a lost production standpoint.

Figure 2: The Impact of Redundancy


Dual (2oo2)

A two-out-of-two (2oo2) configuration has the outputs wired in parallel. Both channels must
de-energize in order to perform a shutdown. This system would fail dangerously if a single
channel had a dangerous failure. Since this configuration has twice as much hardware compared
to single, it has twice as many dangerous failures. Therefore the .02 of a single configuration
doubles to .04. You could think of it as 4 systems out of 100 not responding in a year, or 1 in 25
not responding in a year, or a MTTFdanger of 25 years.
For this configuration to have a nuisance trip, both channels would have to suffer safe failures
at the same time. As before, the probability of two simultaneous failures is the probability of a
single event squared. Therefore, nuisance trip failures in this cofiguration are unlikely (0.04 x
0.04 = 0.0016). You could think of it as 16 systems out of 10,000 causing a nuisance trip within
a year, or 1 system in 625 causing a nuisance trip, or a MTTFsafe of 625 years.
So a 2oo2 configuration protects against nuisance trips (i.e., the probability of safe failures is
very small), but the system is less safe than single, which is not desirable from a safety
ISAFrance,2012

Gruhn

standpoint. This is not to imply that 2oo2 configurations are bad or should not be designed. If
the probability of failure on demand meets the overall safety integrity level requirements, then
the configuration is safe enough.
Triple (2oo3)

Triple Modular Redundant (TMR) concepts were developed in the 1970s through research
with NASA (National Aeronautics and Space Agency) and released as commercial products in
the early and mid 1980s. The reason for triplication back then was very simple; early computer
based systems had limited diagnostics. For example, if there were only two signals and they
disagreed, it was not always possible to determine which one was correct. Adding a third channel
solved the problem. One can assume that a channel in disagreement has an error and can simply
be outvoted by the other two. A two-out-of-three (2oo3) configuration is a majority voting
system. Whatever two or more channels say, that is what the system does.
What initially surprises people is that a 2oo3 system has a greater nuisance trip rate than a
2oo2 system, and a greater probability of a dangerous failure than a 1oo2 system. (Refer to
Figure 2 once again to compare the numbers). Some people initially think, Wait a minute, that
cant be! Actually it is intuitively obvious, you just have to think about it a moment.
How many simultaneous failures does a 1oo2 configuration need in order to have a dangerous
failure? Two. How many simultaneous failures does a 2oo3 configuration need in order to have a
dangerous failure? Two. A triplicated configuration has more hardware, hence three times as
many dual failure combinations! (A+B, A+C, B+C)
How many simultaneous failures does a 2oo2 configuration need in order to suffer a nuisance
trip? Two. How many simultaneous failures does a 2oo3 configuration need in order to suffer a
nuisance trip? Two. Same thing, a triplicated configuration has three times as many dual failure
combinations. A triplicated configuration is actually somewhat of a tradeoff. Overall, it is good
in both modes, but not as good as the two different dual systems. However, a traditional dual
system is either good in one mode or the other, not both.
Note: all of the above comparisons ignore common cause (i.e., a single stressor or failure that
makes multiple components fail at the same time).
1oo2D

If you look carefully at the numbers in Figure 2 you can see that the 1oo2 configuration is
safer than 2oo3, and the 2oo2 configuration offers better nuisance trip performance than 2oo3. If
a dual configuration could be designed to provide the best performance of both dual systems,
such a system could outperform a triplicated system, at least in theory.
Improvements made in both hardware and software since the early 1980s mean that failures in
dual redundant computer-based systems can now generally be diagnosed well enough to tell
which of two channels is correct if they disagree. The industry refers to this relatively newer dual
design as 1oo2D. The D stands for diagnostics (usually tied to some form of secondary
outputs).
When Simplex is Not Really Simplex

So how can a simplex configuration get certified to SIL 3? Simple diagnostics. As shown in
Table 1, if the safe failure fraction can be shown to exceed 99%, the system can meet SIL 3 with
ISAFrance,2012

Gruhn

a fault tolerance of 0. However, how does such a system achieve such a high level of
diagnostics? Simple redundancy.
What?! Yes, redundancy. Just because a system has a single processor or I/O module, does
not mean it is actually simplex. In fact, there are redundant circuits and/or processing, operating
in either a 1oo2 or 2oo2 configuration. One such system utilizes diverse software compiled from
the same source code and processed in a different manner. In effect, 1oo2 processing.
As shown earlier, while 1oo2 is safe (which is all the standards and certification agencies
cover), such systems result in more nuisance trips. Whenever the channels disagreeand they
will at some pointthe system will shut down. Safe, but not very available. This may be fine for
a machinery application (which is what some of these systems were originally designed for)
where one of thirty punch presses going down will not have a major impact on overall
operations. However, applying such a system to a refinery, where downtime can cost over
$1,000,000/day, is a completely different matter. Uptime is often just as important as safety. The
terms used for this type of performance are availability and/or MTTFsp (Mean Time To Failure,
spurious).
When Quad is Not Really Quad

Some systems are promoted as quad (2oo4D) redundant. Prospective users are cautioned to
investigate how much of these systems are actually quad. These systems were originally
developed as 1oo2D. Due to limitations with their degraded run-time restrictions (i.e., how long
they were allowed to continue operating with a detected failure, such as 72 hours), they ran into
marketing and sales difficulties from their triplicated competitors who did not have such
stringent restrictions. The dual vendors solution for this problem was to make the processors
more redundant (quad). I/O modules are still either simplex or dual redundant, not quad. The
quad systems have the same safety rating and nuisance trip performance as they did with a
1oo2D configuration; only the degraded runtime restriction changed. Yet quad sounds like
more and better than triplicated.
Centralized vs. Distributed Logic Solvers
Some systems (and most early triplicated systems) were designed for relatively large I/O
(input and output) applications. These systems are relatively expensive and difficult to justify for
low I/O count systems (e.g., < 100 I/O). In order to justify and implement such systems in the
past, many users combined what were formerly separate, smaller systems into one larger
centralized system. One monolithic centralized system may be economical and suitable for your
needs, especially for larger I/O counts. (The same could be said for most early DCS (Distributed
Control Systems).) However, a number of very economic systems are now available for small
I/O counts making them cost effective to implement in the originally desired distributed manner.
Such a design can minimize the impact of a common cause failure (i.e., one that could affect the
entire facility if everything were handled in one system), as well as lower the cost of field wiring
(as the distributed systems can now be located much closer to the field devices). Such systems
can utilize standard networks to pass safety critical data back and forth between controllers.

ISAFrance,2012

Gruhn

Fault Tolerance of Field Devices


Hardware fault tolerance was defined earlier in this paper. Table 2 (below) appears in the
84/61511 standard and describes the required level of fault tolerance for field devices for
different SILs. However, there are exceptions to every rule and the standard describes cases
where the fault tolerance numbers may be decreased by one in some circumstances, yet must be
increased by one in other circumstances.
The financial impact of redundant field devices is significant. The installed cost of a second
transmitter is approximately $10,000. The installed cost for a second valve is significantly
higher. Therefore, simply going from SIL 1 to SIL 2 for a single function may increase the cost
$40,000 for a single function. The additional costs of going from SIL 2 to SIL 3 are even greater.
Safety Integrity Level

Minimum Hardware Fault Tolerance Requirement

0 (i.e., 1oo1, 2oo2)

1 (i.e., 1oo2, 2oo3)

2 (i.e., 1oo3)

Special Requirements Apply - See IEC 61508

Table 2: Minimum Hardware Fault Tolerance Requirements for Field Devices

Logic & Field Device Design Options for Safety and Availability
Figures 3 through 5 graphically compare the nuisance trip (MTTFsp: Mean Time To Fail
spurious) and safety performance (RRF: Risk Reduction Factor, which is 1/PFD (Probability of
Failure on Demand)) of different sensor, logic and final element configurations. The exact
numbers for each are shown in Table 3. However, quantification does not reveal everything. SIL
2 is usually the highest requirement in most process industry applications. A simplex (1oo1)
system can be designed to meet this requirement, and is the lowest cost design. Yet a fault
tolerant SIL 3 rated logic solver can still offer advantages. For example, if you live in a town
with 100 people, and there is one central bank in town, and every family contributes $100,000,
you expect a certain level of security from that bank. Yet if you live in a city with 1,000,000
people, and there is only one central bank in town, and every family contributes $100,000, the
risk to each family is exactly the same, yet you would expect a different (higher) level of security
from the larger centralized bank. The concept is similar if you are combining 1,000 functions in
one logic solver (vs. only 10).

ISAFrance,2012

Gruhn

Figure 3: Performance of Sensors

Figure 4: Performance of Logic Solvers

ISAFrance,2012

Gruhn


Figure 5: Performance of Logic Valves

ISAFrance,2012

Gruhn

Table 3: Performance of Different Sensor, Logic and Valve Configurations

Note that while a logic solver may have a RRF of 1,300, this does not mean it is suitable for use in SIL 3 applications (a Risk
Reduction Factor of 1,000 to 10,000 for an entire Safety Instrumented Function). Logic solvers are usually allocated 10-15% of the
total Probability of Failure on Demand for a function. (A function consists of a sensor, logic and final element.) A logic solver number
of 1,300 therefore means it is suitable for use in SIL 2. This is graphically represented in Figure 4.
ISAFrance,2012

10

Gruhn

Assumptions:

1. Switches: 30 year safe & dangerous MTTF, 0% diagnostic coverage


2. Transmitters: 60 year safe & dangerous MTTF, 30% diagnostic coverage simplex, 90% dual,
99% triplicated
3. Safety transmitters: 60 year safe & dangerous, 95% diagnostic coverage simplex, 99% dual
4. Relays: 300 year MTTF, 95% safe, 0% diagnostic coverage
5. Standard PLC: 20 year CPU MTTF, 60% safe, 85% diagnostic coverage; 50 year I/O MTTF,
75% safe, 20% coverage
6. Safety PLC: Similar, but 95% diagnostic coverage for 1oo1D, 99% for 1oo2D and 2oo3
7. Valves: 40 year safe & dangerous MTTF, 0% diagnostic coverage, 80% w/ partial stroke test
8. 1 year manual proof test
9. 10% common cause Beta value (for redundant configurations)
Conclusions
1. Redundancy is not the magic answer for safety, diagnostics is.
2. Diagnostics of valves is obtained through partial stroke testing.
3. A properly designed simplex (1oo1) system can meet SIL 2 for the lowest cost. (SIL 2 is
often the highest target selected in the process industries.)
4. Quantification of performance is a useful tool, but cannot account for everything. All models
are wrong; some are just less wrong than others.
References
1. IEC 61508, Functional safety of electrical/electronic/programmable electronic safety-related
systems, 2010 (2nd edition)
2. ANSI/ISA-84.00.01-2004 (IEC 61511 Mod), Functional Safety: Safety Instrumented
Systems for the Process Sector, 2004
3. Safety Shutdown Systems: Design, Analysis and Justification, 2nd edition, Gruhn & Cheddie,
ISA press, 2005
4. Things to consider when selecting a safety instrumented system, presented at the ISA Safety
Division Symposium, April 2009
5. New trends for safety instrumented systems, Hydrocarbon Processing, April 2009
6. Not all safety integrity level 3 safety systems are the same, Hydrocarbon Processing, March
2006
7. Get full value from partial stroking, Chemical Processing, March 2007
8. Safety instrumented system design: valuable lessons learned, Hydrocarbon Processing,
August 2000

ISAFrance,2012

11

Gruhn

Author Bio
Paul Gruhn is the Global Process Safety Consultant at ICS Triplex | Rockwell Automation in
Houston, Texas. Paul is an ISA Fellow, a member of the ISA 84 standard committee, the
developer and instructor of ISA courses on safety systems, and the primary author of the ISA
textbook on the subject. Paul developed the first commercial safety system modeling program
over 17 years ago. He has a B.S. degree in Mechanical Engineering from Illinois Institute of
Technology, is a licensed Professional Engineer (P.E.) in Texas, and an ISA 84 Expert.

ISAFrance,2012

12

Gruhn

Potrebbero piacerti anche