Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Automation Industrielle
Industrielle Automation
9.1
Dependability - Overview
Sret de fonctionnement - Vue densemble
Verlsslichkeit - bersicht
Prof. Dr. H. Kirrmann & Dr. B. Eschermann
ABB Research Center, Baden, Switzerland
2005-06-14 HK
2/40
Industrial Automation
3/40
Transportation
Nuclear Applications
Networks
Business
Medicine
Irradiation equipment,
Life support equipment
Industrial Processes
Industrial Automation
4/40
2002
2003
2004
2005
2006
Industrial Automation
5/40
fault
manifestation
function
repair
on
latency
Industrial Automation
off
on
outage
6/40
fault
may
cause
Industrial Automation
error
may
cause
7/40
failure
9.1 Dependability - Overview
Hierarchy of Faults/Failures
fault failure
fault failure
fault failure
system level
e.g. computer delivers wrong outputs
Industrial Automation
8/40
Types of Faults
Computers can be affected by two kinds of faults:
physical faults
design faults
Industrial Automation
9/40
Transient errors , firm errors, soft errors,.... do not use these terms
Industrial Automation
10/40
software
unsuccessful
recovery
15%
hardware
35%
20%
30%
handling
11/40
Basic concepts
Basic concepts
Industrial Automation
12/40
Availability
failure
good
failure
bad
up
repair
state
state
down
MTTF
good
bad
up
time
down
repair
up
MDT
up
time
Industrial Automation
13/40
Failure/Repair Cycle
Without repair:
system works
time
MTTF
MTTF: mean time to fail
With repair:
repair
system works
MUT
(MTTF)
repair
system works
down
MDT
(MTTR)
MUT
time
MDT
MTBF
MTTR: mean time to repair ~ MDT (mean down time)
MTBF: mean time between failures, (*n'est pas "moyenne des temps de bon fonctionnement )
MTBF = MTTF + MTTR
if MTTR MTTF: MTBF MTTF
Industrial Automation
14/40
Redundancy
Increasing safety or availability requires the introduction of redundancy (resources which
are not needed if there were no failures).
Faults are detected by introducing a check redundancy.
Operation is continued thanks to operational redundancy (can do the same task)
Increasing reliability and maintenance quality increases both safety and availability
switch to red:
no accident risk (safe)
decreased traffic performance
detected
fault
(dont know
about failure)
Industrial Automation
switch to green:
accident risk
traffic continues (available)
15/40
failure
up
impaired
down
up
repair
2nd failure
When redundancy is available, the system does not fail until redundancy is
exhausted (or redundancy switchover is unsuccessful)
Industrial Automation
16/40
Maintenance
"The combination of all technical and administrative actions, including supervision actions intended to
retain a component in, or restore it to, a state in which it can perform its required function"
17/40
Differed maintenance
failure
degraded
state
up
up
unscheduled
maintenance
down
preventive
maintenance
up
down
MTTR
MTTFcomp
MTTR
MTBR
Industrial Automation
18/40
Preventive maintenance
In principle, preventive maintenance restores the initially good state at regular intervals.
This assumes that the coverage of the tests is 100% and that no uncorrected aging
takes place.
Industrial Automation
19/40
Safety
we distinguish:
hazards caused by the presence of control system itself:
explosion-proof design of measurement and control equipment
(e.g. Ex-proof devices, see "Instrumentation")
Industrial Automation
20/40
Safety
The probability that the system does not behave in a way considered as dangerous.
Expressed by the probability that the system does not enter a state defined as dangerous
accidental event
in normal operation
dangerous failure
failure
up
repair
safe (down)
states
dangerous
states
damage
no way back
difficulty of defining which states are dangerous level of damage ? acceptable risk ?
Industrial Automation
21/40
Safe States
Safe state
exists: sensitive system
does not exist: critical system
Sensitive systems
railway: train stops, all signals red (but: fire in tunnel?)
nuclear power station: switch off chain reaction by removing moderator
(may depend on how reactor is constructed)
Critical systems
military airplanes: only possible to fly with computer control system
(plane inherently instable)
Industrial Automation
22/40
Types of Redundancy
Structural redundancy (hardware):
Extend system with additional components that are not necessary to achieve the required
functionality (e.g. overdimension wire gauge, use 2-out-of-3 computers)
Information redundancy:
Time redundancy:
Industrial Automation
23/40
Availability
availability depends on a
functional redundancy (which can
take over the function) and on the
quality of maintenance
Safety and Availability are often contradictory (completely safe systems are
unavailable) since they share redundancy.
Industrial Automation
24/40
losses (US$)
4
damages
protection
trip
protection
does not trip
1
T grace T detect
Industrial Automation
stand-still
costs
T trip
T damage
25/40
time
26/40
Industrial Automation
27/40
Dependability
(Sret de fonctionnement, Verlsslichkeit)
goals
reliability
availability
maintainability
safety
security
by error passivation
fault isolation
reconfiguration
(on-line repair)
Industrial Automation
achieved by
fault avoidance
fault detection/diagnosis
fault tolerance
(= error avoidance)
guaranteed by
quantitative analysis
qualitative analysis
by error recovery
by error compensation
forward recovery
fault masking
backward recovery error correction
28/40
29/40
Industrial Automation
30/40
Secondary
System
C
bus
Distributed
Computer System
C
Control, Protection
Monitoring,
Diagnosis
Process
Primary
System
Environment
Availability/safety depends on output of computer system and process/environment.
Industrial Automation
31/40
persistency breach
integrity breach
32/40
Safety Threats
depending on the controlled process,
safety can be threatened by failures of the control system:
integrity breach
persistency breach
Requirement:
fail-silent (fail-safe, fail-stop) computer
"rather stop than fail"
Requirement:
fail-operate computer
"rather some wrong data than none"
Safety depends on the tolerance of the process against failure of the control system
Industrial Automation
33/40
discrete systems
F(nT)
n
time
Industrial Automation
34/40
secondary
system
primary
system
integrity
availability
substation protection
railway signalling
airplane control
persistency
Industrial Automation
safety
35/40
Protection system:
Not acting normally,
forces safe state (trip) if necessary
Maximal failure rate given in failures per
demand.
Control
+
Process
Display
Protection
Measurement
Industrial Automation
36/40
Process state
power plant
line
protection
substation
bay
busbar
protection
busbar
substation
to consumers
Two kinds of malfunctions:
An underfunction (not working when it should) of a protection system is a safety threat
An overfunction (working when it should not) of a protection system is an availability threat
Industrial Automation
37/40
Findings
Reliability and fault tolerance must be considered early in the development process,
they can hardly be increased afterwards.
Reliability is closely related to the concept of quality, its root are laid in the design process,
starting with the requirement specs, and accompanying through all its lifetime.
Industrial Automation
38/40
References
H. Nussbaumer: Informatique industrielle IV; PPUR.
J.-C. Laprie (ed.): Dependable computing and fault tolerant systems; Springer.
J.-C. Laprie (ed.): Guide de la sret de fonctionnement; Cpadus.
D. Siewiorek, R. Swarz: The theory and practice of reliable system design; Digital
Press.
T. Anderson, P. Lee: Fault tolerance - Principles and practice; Prentice-Hall.
A. Birolini: Quality and reliability of technical systems; Springer.
M. Lyu (ed.): Software fault tolerance: Wiley.
Journals: IEEE Transactions on Reliability, IEEE Transactions on Computers
Conferences: International Conference on Dependable Systems and Networks,
European Dependable Computing Conference
Industrial Automation
39/40
Assessment
which kinds of fault exist and how are they distinguished
explain the difference between reliability, availability, safety in terms of a state diagram
explain the trade-off between availability and safety
what is the difference between safety and security
explain the terms MTTF, MTTR, MTBF, MTBR
how does a protection system differ from a control system when considering failures ?
which forms of redundancy exist for computers ?
how does the type of plant influence its behaviour towards faults ?
Industrial Automation
40/40
Industrial Automation
Automation Industrielle
Industrielle Automation
9.2
Dependability - Evaluation
Estimation de la fiabilit
Verlsslichkeitsabschtzung
Prof. Dr. H. Kirrmann
ABB Research Center, Baden, Switzerland
2006-06-14, HK
Dependability Evaluation
This part of the course applies to any system that may fail.
Industrial Automation
2/72
Industrial Automation
3/72
Reliability
Reliability = probability that a mission is executed successfully
(definition of success? : a question of satisfaction)
Reliability depends on:
duration (tant va la cruche leau., "der Krug geht zum Brunnen bis er bricht))
environment: temperature, vibrations, radiations, etc...
R(t)
1,0
lim R(t) = 0
t
25
25
vehicle
85
laboratory
40
85
time
4/72
remaining
good bulbs
R(t)
time
l
aging
infancy
mature
t
t + Dt
time
Reliability R(t): number of good bulbs remaining at time t divided by initial number of bulbs
Failure rate l(t): number of bulbs that failed in interval t, t+Dt, divided by number of remaining bulbs
Industrial Automation
5/72
lim R(t) = 0
t
Failure rate l(t) = probability that a (still good) element fails during the next time unit dt.
dR(t) / dt
l(t) =
R(t)
R(t)
and: R(t) = e
l(x) dx
MTTF =
R(t) dt
Industrial Automation
6/72
l(t)
bathtub
aging
childhood
(burn-in)
by discrete expression
mature
R(t)
1
0.8
R (t) = e -lt
0.6
R(t) l= bathtub
0.4
0.2
MTTF =
MTTF
Industrial Automation
7/72
e -lt dt =
1
114'000
years
Element
Rating
resistor
capacitor
capacitor
processor
RAM
Flash
FPGA
PLC
digital I/O
analog I/O
battery
VLSI
soldering
0.25 W
(dry) 100 nF
(elect.) 100 mF
486
4MB
4MB
5000 gates
compact
32 points
8 points
per element
per package
per point
failure rate
0.1 fit
0.5 fit
10 fit
500 fit
1 fit
12 fit
80 fit
6500 fit
2000 fit
1000 fit
400 fit
100 fit
0.01 fit
These figures can be obtained from catalogues such as MIL Standard 217F or from the
manufacturers data sheets.
Warning: Design failures outweigh hardware failures for small series
Industrial Automation
8/72
Industrial Automation
9/72
10/72
11/72
Industrial Automation
12/72
R(t)
reliability
of redundant
1
element
Cold redundancy (cold standby): the reserve is switched off and has zero failure rate
reliability
of reserve
element
R(t)
1
Industrial Automation
t
13/72
Industrial Automation
14/72
R total = R1 * R2 * .. * Rn = P (Ri)
I=1
Assuming a constant failure rate l allows to calculate easily the failure rate of a system
by summing the failure rates of the individual components.
R NooN = e -Sli t
This is the base for the calculation of the failure rate of systems (MIL-STD-217F)
Industrial Automation
15/72
lcontrol = 0.00005
power
electronics
h-1
encoder
motor
motor+encoder
controller
16/72
Industrial Automation
17/72
R1
R1 ok
ok
R2 ok ok
R2
R1 good
R2 good
1-R1
R1oo2 =
R1 good
R2 down
R1R2 + R1 (1-R2) +
R1 down
R2 good
(1-R1) R2
R1
R1oo2 = 1 - (1-R2)(1-R1)
R2
with R1 = R2 = R:
R1oo2 = 2 R - R2
1-R2
with R = e -lt
R1oo2 = 2 e -lt - e -2lt
Industrial Automation
18/72
apply:
R1oo1 = e -lt
R2oo2 = e -2lt
assuming there is no common mode of failure (bad fuel or oil, hail, birds,)
Industrial Automation
19/72
R1
R2
work
fail
ok ok ok
ok
ok ok
ok
ok
ok
ok ok
ok
R3
R2
R1
R3
R1 good
R2 good
R3 good
2/3
R1 bad
R2 good
R3 good
R1 good
R2 bad
R3 good
R1 good
R2 good
R3 bad
20/72
Industrial Automation
21/72
Example with
N=4
R4
R3
R2
R1
one of N fail
no fail
RKooN =
RN
+(
N
N-1
1 ) (1-R) R +
two of N fail
N
( 2 ) (1-R)2RN-2 +...+
K of N fail
all fail
N
( K ) (1-R)KRN-K +....+ (1-R)N = 1
N of N
N + (N-1) of N
N + (N-1) + (N-2) of N
RKooN =
Industrial Automation
i=0
N
)
Ri (1 R)N-i
i
22/72
Comparison chart
1.000
1oo4
1oo1
2oo4
0.400
1oo2
3oo4
1oo1
1.
8
1.
2
0.
8
0.
4
0.
2
0.000
0.
6
8oo12
1.
6
2oo3
1.
4
0.200
0.600
0.800
t
Industrial Automation
23/72
Summary
Assumes: all units have identical failure rates and comparison/voting hardware does not fail.
1oo2 (duplication and
error detection)
1oo1 (nonredundant)
R1oo1 = R
R1oo2 = 2R R2
RKooN =
Industrial Automation
S
i=0
N ) Ri (1 R)N-i
i
24/72
What is better ?
25/72
with redundancy
ARL
R2
R1
simplex
MT1
MT2
MIF:
RIF:
time
26/72
l = 0.001
1
0.8
1oo2
but:
0.6
8
0.4
1oo1
MTTF1oo2 = (2
-lt -
-2lt)
dt =
0.2
3
2l
Industrial Automation
27/72
R3
R2
R1
2/3
(3e
MTTF2oo3 =
-2lt -
2e
-3lt)
dt
5
6l
1
0.8
0.6
1oo1
1oo2
0.4
Industrial Automation
0.2
2oo3
0
28/72
R1
R1
R1
R2
output
Industrial Automation
29/72
Industrial Automation
30/72
Repair
Industrial Automation
31/72
Preventive maintenance
R(t)
1
MTBPM
Mean Time between preventive maintenance
Preventive maintenance reduces the probability of failure, but does not prevent it.
in systems with wear, preventive maintenance prevents aging (e.g. replace oil, filters)
Preventive maintenance is a regenerative process (maintained parts as good as new)
Industrial Automation
32/72
Considering Repair
Industrial Automation
33/72
Industrial Automation
34/72
Markov
Define distinct states of the system depending on fault-relevant events
States must be
mutually exclusive
collectively exhaustive
protection failure
normal
OK
s
threat to plant
(not dangerous)
protection
not working
PD
repair
lightning strikes
DG
danger
pi(t) = 1
all states
35/72
State 2
l
P2
dP1/dt = P2 l P1,
dP2/dt = l P1 P2
Note: there also exist discrete Markov Chains, in which the time takes discrete steps t = 0, 1, 2,
etc., with similar definition
Industrial Automation
36/72
lk1
inflow
outflow
dpi(t) = lk pk(t) dt
lk1
pk1(t)
Pi
P1
li pi(t)
lk2
lk3
P4
P3
lk2
lk3
li
State Sk1
pump
pi(t)
State Si
li pi(t)
37/72
good
arbitrary transitions:
l(t)
fail
good
dp0 = - l p0
dt
dp1 = + l p0
dt
R(t) = p0 = e -lt
fail
down
fail1
all
all
ok
up2
non-terminal states
Industrial Automation
up1
fail2
terminal states
38/72
good
Reliability
Availability
l(t)
failure rate
bad
up
failure rate
down
repair rate
state
state
MTTF
good
bad
up
time
repair
down
39/72
up
up
MDT
time
Industrial Automation
40/72
Markov:
2l
fail
l = constant
Solution:
initial conditions:
p0 (0) = 1 (initially good)
dp0 = - 2l p0
dt
p1 (0) = 0
dp1 = + 2l p0 - lp1
dt
dp2 =
+ lp1
dt
p2 (0) = 0
p0 (t) = e -2lt
p1 (t) = 2 e -lt - 2 e -2lt
R(t) = p0 (t) + p1 (t) = 2 e -lt - e -2lt
Industrial Automation
41/72
2l
0
absorbing state
m
repair rate
Linear
Differential
Equations:
dp0 = - 2l p0
dt
+ m p1
dp1 = + 2l p0 - (l+m) p1
dt
dp2 =
dt
+ l p1
initial conditions:
p0 (0) = 1 (initially good)
p1 (0) = 0
p2 (0) = 0
Ultimately , the absorbing states will be filled, the non-absorbing will be empty.
Industrial Automation
42/72
good ln
dp0 = - 2l p0
0
b
lb
dt
lb
dp1 = + l p0 - (l+m) p1
3
ln
dt
dp2 = + l p0
fail
dt
is equivalent to:
1+2
- (l+m) p2
dp3 =
+ l p1
dp0 = - 2l p0
+ m p1 + m p2
dt
dt
2l
+ m p1 + m p2
+ l p2
dt
dp3 =
dt
with mn = mb ; ln = lb
+ l (p1+p2)
it is easier to model with a repair team for each failed unit (no serialisation of repair)
Industrial Automation
43/72
we
wedo
donot
not
consider
considershort
short
mission
missiontime
time
with:
W=
l2 + 6l + 2
l = 0.01
1
m = 10 h-1
0.8
0.6
m = 1.0 h-1
1oo2 no repair
repair
repairdoes
doesnot
not
interrupt
interrupt
mission
mission
0.4
m = 0.1 h-1
0.2
Time in hours
R(t) accurate, but not very helpful - MTTF is a better index for long mission time
Industrial Automation
44/72
absorbing states j
R(t)
non-absorbing states i
1.0000
0.8000
0.6000
MTTF =
Spi(t) dt
0.4000
0
0.2000
0.0000
0
Industrial Automation
10
45/72
12
14
time
9.2 Dependability - Evaluation
+ mP1(s)
sP2(s) - 0 =
lim
t
-1 = - 2 l P0
0 = + 2l P0
MTTF = P0 + P1 = (m + l) +
Industrial Automation
2l2
46/72
+ mP1
- (l+m)P1
1 = m/l + 3
2l
l
1
0
0
..
= M Pna
Industrial Automation
47/72
2lc
l (1-c)
l (1-c)
1
l
2
l
(absorbing state)
1 = - 2l P0
+ mP1
0 = + 2lc P0 - (l+m)P1
0 = + l(1-c) P0
- lP2
Industrial Automation
MTTF =
48/72
2l (1-c)
2lc
m
absorbing state
applying Markov:
-1 = - 2l P0
+ mP1
0 = + 2lc P0 - (l+m)P1
2
MTTF =
0 = + 2l(1-c) P0
+ lP1
(1+2c) + m/l
2 ( l + m (1-c) )
The results are nearly the same as with the previous four-state model
Industrial Automation
49/72
MTTF (c)
600000
500000
400000
300000
200000
100000
1.
0
0. 000
99 0
0. 99 0
99 99
0. 99
99 90
0. 99
99 00
0. 90
99 00
0. 00
90 00
0. 00
90 00
0. 00
6 0
0. 000 0
00 0
00 0
00
coverage
Industrial Automation
)
lim MTTF = ( +
2l
l 2
m 0
50/72
1
lim MTTF =
l/m 0 l (1-c)
9.2 Dependability - Evaluation
x
coverage is assumed to be the probability that
self-check detects an error in the controller.
when self-check detects an
error, it passivates the controller
(output is disconnected)
and the other controller takes control.
control
a1
selfcheck
selfcheck
control
a2
Industrial Automation
51/72
log (MTTF)
16.00
1 second
14.00
6 minutes
12.00
10.00
1 Mio years
or
oronce
onceper
peryear
yearon
onaa
million
millionvehicles
vehicles
30 minutes
8.00
6.00
0.1% undetected
4.00
conclusion:
the repair interval does not matter when
coverage is poor
Industrial Automation
2.00
0.00
1
poor
52/72
uncoverage
10
excellent
OK
PD
s
threat to plant
(not dangerous)
DG
protection down
(detection and repair)
threat to plant
danger
Industrial Automation
53/72
l1
t = test rate (e.g. 1/6 months)
m = repair rate (e.g. 1/8 hours)
0
Normal
repaired
3
test rate
repaired
l3
unavailable
states
lurking overfunction
(unwanted trip at next attack)
plant s
threat
l2
Plant down
Double fault
repaired
plant threat
s
protection failed
6
by underfunction
(fail-to-trip)
detected
error
5
t
test rate
lurking
underfunction
s2 (unlikely)
Danger
since there exist back-up protection systems, utilities are more concerned by non-productive states
Industrial Automation
54/72
Industrial Automation
55/72
Stationary availability A =
MTTF
MTTF + MTTR
Industrial Automation
56/72
Availability
l
0
down
up
up
A = availability = lim
up
up
up
down
? up times
? (up times + down times)
57/72
substation automation
telecom power supply
Industrial Automation
> 99,95%
5 * 10-7
58/72
up states i
Availability =
down
up
Spi(t = )
Industrial Automation
Unavailability =
59/72
down states j
(non-absorbing)
Spj (t = oo)
Computation of Availabilities
Divide states into two sets: UP states (system works) and DOWN states (system
doesnt work).
The stationary availability is given by the formula A = MTTF / (MTTF + MTTR).
The MTTR is given by the inverse of the repair rate, MTTR = 1/, to get the system
back from the set of down states to the set of up states.
The MTTF is given by the following set of equations:
MTTF(i) = 1/ri + S (rij / ri) MTTF(j)
where i, j denote states, rij is the transition rate from state i to state j, rii = 0, the sum is
taken over all states j which belong to the set of UP states and
ri = S rij
and MTTF = MTTF(i) if i is the initial state in which the system starts
Industrial Automation
60/72
system
OK
2l
UP
latent
fault
no
function
DOWN
2
61/72
2l
0
1
m
down state
(but not absorbing)
2m
A=
1+
stationary state:
lim dp = dp = dp = 0
0
1
2
t 8 dt
dt
dt
unavailability U = (1 - A) =
2l2
m2 + 2l
Industrial Automation
62/72
lim U<<1
2l2
m2 + 2l
Availability calculation
1) Set up differential equations for all states
2) Identify up and down states (no absorbing states allowed !)
3) Remove one state equation save one (arbitrary, for numerical reasons take unlikely state)
4) Add as first equation the precondition: 1 = ? p (all states)
1
0
0
..
= M Pall
Industrial Automation
63/72
9.2.6 Examples
9.2.1 Reliability definitions
9.2.2 Reliability of series and parallel systems
9.2.3 Considering repair
9.2.4 Markov models
9.2.5 Availability evaluation with Markov
9.2.6 Examples
Industrial Automation
64/72
reserve
member
N
member
R
member
N
member
R
member
N
member
R
MVB
I/O system
Assumption: each unit has a back-up unit which is switched on when the on-line unit fails
The error detection coverage c of each unit is imperfect
The switchover is not always bumpless - when the back-up unit is not correctly actualized, the main
switch trips and the locomotive is stuck on the track
What is the probability of the locomotive to be stuck on track ?
Industrial Automation
65/72
all OK
member N failure
detected
(1-s-b)
s
S0
cl
(1-c) l
p
train stop
and
reboot
member R
failure
detected
takeover
unsuccessful
l member N fails
member R l
fails
undetected
Industrial Automation
member R
on-line
l
member R fails
stuck on track
member N fails
l = 10-4
m = 0.1
c = 0.9
b = 0.9
s = 0.01
r = 10
p = 1/8765
66/72
32%
61%
Stuck:
2nd failure before repair
Stuck: after reboot
0.0009%
0.00045%
OK after
reboot
67/72
current sensor
circuit breaker
IEC 61508 characterizes a protection device by its Probability to Fail on Demand (PFD):
PFD = (1 - availability of the non-faulty system) (State 0)
overfunction
underfunction
good
(1-u)l
mR
ul
mR
plant down
Industrial Automation
u = probability of underfunction
4
plant damaged
68/72
l(1-u)
mR
ucl
l: protection failure
danger
overfunction
4
u(1-c)l
mT
normal
PFD = 1 - P0 = 1 -
with:
1
1 + l u (1-c) + l u c
R
T
l = 10-7 h-1
MTTR = 8 hours -> R =0.125 h-1
Test Period = 3 months -> T =2/4380
coverage = 90%
Industrial Automation
lu(
(1-c)
T
c
R
S8
S6
S2
self-check
overfunction
l1
PLANT DOWN
DOUBLE FAULT
s1
s1
l1 (1-c)
l2
S10
le2
l2 c
l2 (1-c)
S1
S9
dT
dM
l1 c
S5
dT
dM
S11
S3
s2
self-check
underfunction
S4
(1-c)
l3 c l3
le1
P1
l3
s2
dM
s2
S7
DANGER
s2
P8, P9: error
detection failed
Industrial Automation
70/72
Reliability
down
all
all
ok
up
up
fail
down
fail
all
all
ok
fail
up
good
Industrial Automation
up
down
down
down
71/72
Industrial Automation
72/72
Industrial Automation
Automation Industrielle
Industrielle Automation
2001 May 3, BE
9.3.2
9.3.3
2001 May 3, BE
usual behavior
of loco driver
main signal
advance signal
2001 May 3, BE
on-board system
vital computer
speed
track-side devices
brake
2001 May 3, BE
Eurocab: Motivation
TODAY
TOMORROW
2001 May 3, BE
2001 May 3, BE
companyspecific
(competition)
ManMachine
Interface
European
Vital
Computer
Data
Logger
Eurocab
bus
companyspecific
(competition)
standard
Speed and
Distance
Measurement
2001 May 3, BE
Train
Interface
Specific
Interface 1
Specific
Interface n
data
safety
protocol
vital process
data
safety
protocol
data
non-vital
equipment
non-vital
process
non-vital
protocol
data
bus
protocol
(non-vital)
bus
protocol
(non-vital)
bus
protocol
(non-vital)
data
serial bus
EPFL - Industrial Automation
2001 May 3, BE
bus system
(untrusted)
vital process
vital equipment
(trusted)
CRC
time
stamp
source
bus
expected
safety ID
sink
EPFL - Industrial Automation
2001 May 3, BE
clocks
have to be
synchronised
9.3 Eurocab Case Study
source identifier
bus
master
BUS
slaves
sink
source
sink
2nd phase:
Slave
Response
value
bus
master
slaves
2001 May 3, BE
BUS
source
sink
10
sink
9.3 Eurocab Case Study
example value
comment
safety ID
0F11
name of telegram
measured_speed
for identification
length
256 bits
periodic/sporadic
periodic
broadcast/point-to-point
broadcast
source function
SDM
sink function
any
grace period
- 1 ms, + 257 ms
etc.
...
2001 May 3, BE
11
safety ID
time stamp
data
32
16
CRC
32
only have to be
have to be transmitted on the bus (explicitly)
checked
(implicitly via CRC)
EPFL - Industrial Automation
2001 May 3, BE
12
Creation
Checking
2001 May 3, BE
13
error in ...
... content
Safety CRC
Safety CRC
... address
Implicit Safety ID
Safety ID
... time
... sequence
2001 May 3, BE
14
Industrial Automation
Automation Industrielle
Industrielle Automation
Dependable Architectures
9.4 Architectures sres de fonctionnement
Verlssliche Architekturen
Prof. Dr. H. Kirrmann
ABB Research Center, Baden, Switzerland
2005-06-14, HK
1/52
9.4.2
Fault-Tolerant Structures
9.4.3
9.4.4
9.4.5
Industrial Automation
2/52
processor
E
D
Error Detection
E
D
processor
processor
active
E
D
workby
change-over
logic
off-switch
outputs
output
inputs
a) Integer
b) Persistent
" rather wrong than nothing "
"fail-operate"
processor
processor
2/3
2/3 voter
outputs
c) Integer & persistent
error masking
Industrial Automation
3/52
9.4.1
9.4.2
Fault-Tolerant Structures
9.4.3
9.4.4
9.4.5
Industrial Automation
4/52
Error Detection
Industrial Automation
5/52
Error detection depends on the type of component, its error rate and its complexity.
Data transmission lines
medium to high error rate, memoryless
parity, CRC, watchdog
Regular memory elements
medium error rate, large storage
parity, Hamming codes, CRC on disk.
Processors and controllers
low error rate, high complexity
duplication and comparison, coded logic
Supporting elements
high error rate, high diversity
mechanical integrity, power supply supervision, watchdogs,...
Industrial Automation
6/52
Industrial Automation
7/52
absolute test
relative test
on-line
watchdog (time-out)
control flow checking
error-detecting code (CRC, etc.)
off-line
comparison with
precomputed test result
(fixed inputs)
Industrial Automation
8/52
safe
switch
supply
voltage
reset
time
> k ms
inhibit
The application processor periodically resets the watchdog timer. If it fails to do it, the
watchdog processor will shut down and restart the processor processor.
Industrial Automation
9/52
worker
sync
checker
comparator
switch
safe output
Conditions:
Variant:
Industrial Automation
10/52
Integer processors
Integer processors are often called fail-safe processors, but they are only safe
when used in plants where a safe state can be reached by passive means.
This requires a high coverage, that is usually achieved by duplication and comparison.
For operation, both computers must be operational, this is a 2oo2 structure
(2 out of 2).
Industrial Automation
11/52
E
D
E
D
E
D
E
D I/O
parallel
backplane bus
(self-test by
parity)
E MEM
D
stable storage
(with EDC)
changeover logic
to safe state
serial bus
(CRC)
Vs
safe value
12/52
The dual channel should be extended as far as possible into the plant
worker
checker
worker
checker
controller
E
D
Industrial Automation
13/52
9.4.1
9.4.2
Fault-Tolerant Structures
9.4.3
9.4.4
9.4.5
Industrial Automation
14/52
Industrial Automation
15/52
sync
sync
16/52
primary unit
standby unit
switch
output
What are standby units used for?
only as redundancy
for other functions (can get lower priority in case of primary unit failure)
better performance (graceful degradation in case of failure)
Industrial Automation
17/52
Industrial Automation
reconfiguration unit:
the pilot judges which
FCDM to trust in case of
discrepancy
18/52
Hot standby
sync
on-line
Cold standby
sync
workby
on-line
standby
19/52
Standby is no operational
Error detection needed.
Long switchover period
with loss of state info.
No aging of reserve unit.
input
synchronization
synchronization
Worker
Matching
E Worker
D
CoWorker
Output
Matching
Co- E
Worker D
Output
commutator
comparator
disjunctor
output
output
INTEGER
2oo2
PERSISTENT
1oo2D
Industrial Automation
20/52
Hybrid Redundancy
Mixture of workby (static redundancy) and standby (dynamic redundancy).
workby
workby
workby
standby
standby
Reconfiguration
(self-purging
redundancy)
workby
voter
failed
workby
workby
standby
voter
Industrial Automation
21/52
General designation
NooK: N out-of K
1oo1:
1oo2:
2oo2:
1oo2D:
2oo3:
2oo4:
simplex system
duplicated system, one unit is sufficient to perform the function
duplicated system, both units must be operational (fail-safe)
duplicated system with self-check error detection (fail-operational)
triple modular redundancy: 2 out of three must be operational (masking)
masking architecture
Industrial Automation
22/52
9.4.3 Workby
9.4.1
9.4.2
Fault-Tolerant Structures
9.4.3
9.4.4
9.4.5
Industrial Automation
23/52
identical,
deterministic,
synchronized
state machines
24/52
input A
computer
A
redundant
matching
input B
computer
B
25/52
Matching
The matched value depends on the semantics of the variables.
Matching needs knowledge of the dynamic and physical behaviour.
Matching stretches over several consecutive values of the variables.
jitter
A
Binary variables:
Analog variables:
time
B
time
Industrial Automation
26/52
computer
B
computer
C
27/52
attack
attack
attack
retreat
B
attack
attack
retreat
attack
attack
retreat C
attack
28/52
Industrial Automation
29/52
instruction number
CPU 1
101
102
103
synchronized
CPU (same clock)
104
105
407
CPU 2
101
102
103
just after
104
106
408
101
407
101
408
time
30/52
clock
D
Q
- 100 ns
Analogy
golf ball
E = kynetic energy
E < Ecrit
E ~ Ecrit
E > Ecrit
matching must rely on the exchange of defined signals, common signals are
no suitable mean for reaching a consensus.
Industrial Automation
31/52
read inputs
read inputs
build
consensus
build
consensus
build
consensus
compute
compute
compute
synchro
outputs
synchro
outputs
synchro
outputs
The last decision on the correct value must be made in the process itself.
Industrial Automation
32/52
motors
control
surfaces
Damaged Unit
power
electronics
and control
Industrial Automation
33/52
9.4.4 Standby
rserve asynchrone, unbeteiligter Ersatz
9.4.1
9.4.2
Fault-Tolerant Structures
9.4.3
9.4.4
9.4.5
Industrial Automation
34/52
Industrial Automation
35/52
Recovery
It is not sufficient that a back-up unit exists, it must be loaded with the same
data and be in a state as near possible to the state of the on-line unit.
The actualisation of the back-up assumes that computers are deterministic
and identical machines.
Given two identical machines, initially in the same state, the states of these
machines will follow each other provided they always act on the same inputs,
received in the same sequence.
workby
both machines are fed with the
same, synchronized inputs
and modify their states based on
these inputs only in the same
manner
standby
the on-line unit regularly copies its
state and its inputs to the back-up.
36/52
a) STANDBY
ED = Error Detection
INPUT
error
detection
INPUT
INPUT'
track I/O
INPUT"
SYNC
save
E
D
Back-Up
(stand-by)
On-line
on-line
restore
E
D
E
D
Back-Up
(work-by)
On-Line
E
D
match
on-line
back-up
back-up
OUTPUT
SWITCHOVER
UNIT
OUTPUT
Checkpointing
Saving enough information to reconstruct a previous, known-good state.
To limit the data to save (checkpoint duration, distance between checkpoints),
only the parts of the state modified since last checkpoint are saved.
full
delta
back-up back-up CP
CP
CP
CP
CP
CP
ON-LINE
On-Line
stable
storage
(or stand-by's
memory)
recover
reconstruct
known-good
state
CP
CP
CP
Stand-By
reconstruct initial state
apply deltas to full back-up
38/52
Checkpointing
The amount of data to save to reconstruct a previous known-good state
depend on the instant the checkpoint is taken.
processor
microregister
registers
cache
RAM
disk
world (cannot be rolled back !)
Recovery depends on which parts of the state are trusted after a crash = stable
storage , and which are not (volatile storage)
and on which parts are relevant.
Industrial Automation
39/52
Checkpointing Strategy
Industrial Automation
40/52
Logging
For faster recovery and closer checkpointing, the stand-by monitors the
input-output interactions of the on-line unit in a log (fifo).
After reconstructing a know-good state, the stand-by resumes computation and applies
the log of interactions to it:
Checkpoint
full back-up
Checkpoint
On-line
external world
Checkpoint
Stand-by
log entries
reconstruct
known-good state
replay
log
regular
operation
It takes its input data from the log instead of reading them directly.
It suppresses outputs if they are already in the log (counts them)
It resumes normal computations when the log is void.
Industrial Automation
41/52
Domino Effect
As long as a failed unit does not communicate with the outer world, there is no harm.
The failure of a unit can oblige to roll back another unit which did not fail,because it acted
on incorrect data.
This roll-back can propagate under evil circumstances ad infinitum (Domino-effect)
This effect can be easily prevented by placing the checkpoints in function of
communication - each communication point should be a checkpoint.
6
Process 1
Process 2
5
Process 3
4
Industrial Automation
42/52
2/3 voting
lock-step
synchronization
1/2 workby
workby/
standby
common
memory
local
network
10 ms
0.1s
1s
standby
wide area
network
10s
The time available for recovery depends on the tolerance of the plant against outages.
When this time is long enough, stand-by operation becomes possible
Industrial Automation
43/52
9.4.1
9.4.2
Fault-Tolerant Structures
9.4.3
9.4.4
9.4.5
Industrial Automation
44/52
E
D
SIDE B
E
D
E
D
E
D
E
D
USU
E
D
E
D
E
D
I/O
I/O
E
D
duplicated
input/output
Commutator
Input
Output Input"
45/52
System
Features
Central repository
Redundant 2oo3
Duplication of connectivity severs
each maintains its own A&E and history log
Network
Dual lines, dual interfaces,
dual ports on controller CPU
Connectivity
Server
Aspect
Server
Controller CPU
Hot standby, 1oo2
PROFIBUS DP/V1 line redundancy
Single bus interface, dual lines
PROFIBUS DP/V1 slave redundancy
S800, S900, dual bus interfaces
Redundant I/O, remote
Dual power supplies
Supervision of A and B power lines in
AC 800M, S800 I/O, S900 I/O
Power back-up for workplaces and servers
UPS (Uninterruptible Power Supply) technology
Industrial Automation
46/52
Source device
Sink device
match
decoder
match
decoder
decoder
decoder
line A
?
line B
Skew: 10 ns
Industrial Automation
Skew: 8 us
Skew: 8 us
47/52
B777: airplane
Industrial Automation
48/52
Industrial Automation
49/52
Industrial Automation
50/52
B777 Modules
Industrial Automation
51/52
signal
mgt.
Motorola
68040
Intel
80486
AMD
29050
Primary
Flight
Computer
(PFC 1)
PFC 2
PFC 3
(Intel)
(AMD)
triplicated
output bus
actuator control
left actuator
Industrial Automation
actuator control
centre actuator
52/52
actuator control
right actuator
9.4 Dependable Architectures
GPC 1
CPU 1
IOP 1
GPC 2
CPU 2
IOP 2
GPC 3
CPU 3
IOP 3
GPC 4
CPU 4
IOP 4
GPC 5
CPU 5
IOP 5
Control
Panels
28
1 - MHz
serial data
buses
( 23 shared,
5 dedicated )
Intercomputer (5)
Mass memory (2)
Display system (4)
Payload operation (2)
Launch function (2)
Flight instrument (5;1 dedicated per GPC)
Flight - critical sensor and control (8)
GNC sensors
Main engine interface
Aerosurface actuators
Thrust - vector control
actuators
Primary flight displays
Mission event controllers
Master time
Navigation aids
Mass
memory Telemetry
units
Industrial Automation
CRT
display
53/52
payloadinterface
Manipulator
uplink
Solid rocket
boosters
Ground umbilicals
Ground support
equipment
Wrap-up
Industrial Automation
54/52
Industrial Automation
55/52
Industrial Automation
Automation Industrielle
Industrielle Automation
9.5
Dependable Software
Logiciel fiable
Verlssliche Software
9.5.3 Examples
Automatic Train Protection
High-Voltage Substation Protection
Industrial Automation
2/40
safety
integrity level
control systems
[per hour]
protection systems
[per operation]
10
-9
to < 10
-8
10
-5
to < 10
-4
10
-8
to < 10
-7
10
-4
to < 10
-3
10
-7
to < 10
-6
10
-3
to < 10
-2
10
-6
to < 10
-5
10
-2
to < 10
-1
3/40
Software Problems
Did you ever see software that did not fail once in 10 000 years
(i.e. it never failed during your lifetime)?
4/40
"The range gate's prediction of where the Scud will next appear is a function of the Scud's known velocity and the
time of the last radar detection.
Velocity is a real number that can be expressed as a whole number and a decimal (e.g., 3750.2563...miles per
hour).
Time is kept continuously by the system's internal clock in tenths of seconds but is expressed as an integer or whole
number (e.g., 32, 33, 34...).
The longer the system has been running, the larger the number representing time. To predict where the Scud will
next appear, both time and velocity must be expressed as real numbers. Because of the way the Patriot computer
performs its calculations and the fact that its registers are only 24 bits long, the conversion of time from an integer
to a real number cannot be any more precise than 24 bits. This conversion results in a loss of precision causing a
less accurate time calculation. The effect of this inaccuracy on the range gate's calculation is directly proportional
to the target's velocity and the length of the system has been running. Consequently, performing the conversion
after the Patriot has been running continuously for extended periods causes the range gate to shift away from the
center of the target, making it less likely that the target, in this case a Scud, will be successfully intercepted."
Industrial Automation
5/40
http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html
(no more available at the original site)
"The failure of the Ariane 501 was caused by the complete loss of guidance and attitude information 37 seconds
after start of the main engine ignition sequence (30 seconds after lift-off). This loss of information was due to
specification and design errors in the software of the inertial reference system.
The internal SRI* software exception was caused during execution of a data conversion from 64-bit floating
point to 16-bit signed integer value. The floating point number which was converted had a value greater than
what could be represented by a 16-bit signed integer. "
*SRI stands for Systme de Rfrence Inertielle or Inertial Reference System.
Code was reused from the Ariane 4 guidance system. The Ariane 4 has different flight characteristics in the first 30 s of
flight and exception conditions were generated on both inertial guidance system (IGS) channels of the Ariane 5. There
are some instances in other domains where what worked for the first implementation did not work for the second.
6/40
A 1988 survey conducted by the United Kingdom's Health & Safety Executive (Bootle,
U.K.) of 34 "reportable" accidents in the chemical process industry revealed that
inadequate specifications could be linked to 20% (the #1 cause) of these accidents.
Industrial Automation
7/40
physical
system
(e.g. HV
substation,
train, factory)
environment
(e.g. persons,
buildings, etc.)
Industrial Automation
8/40
Which Faults?
software
design faults
???
systematic faults
solution: diversity
???
physical faults
hardware
statistics
random faults
solution: redundancy
Industrial Automation
9/40
Approach 1: Layered
fail-safe software
fail-safe hardware
against
design faults
fail-safe software
against
physical faults
hardware
systematic
less flexible
flexible
less expensive
expensive
Industrial Automation
10/40
1)
Fault removal
design diversity
Industrial Automation
11/40
Requirements
analysis
a
requirements
specification
complete
system
System/Software
Design
Verification &
Validation
design
specification
Integration
Implementation
program
Industrial Automation
12/40
Validation:
dynamic techniques
static techniques
test
review
simulation
proof
Industrial Automation
13/40
3.00 / limit
99 %
4.61 / limit
99.9 %
6.91 / limit
99.99 %
9.21 / limit
99.999 %
11.51 / limit
Example: c = 99.99 % , failure rate 10 -9/h test length > 1 million years
Industrial Automation
14/40
Testing
Testing requires a test specification, test rules (suite) and test protocol
specification
implementation
test rules
test procedure
test results
Testing can only reveal errors, not demonstrate their absence ! (Dikstra)
Industrial Automation
15/40
SDL
LOTOS
Esterel
graphical syntax
syntax analysis,
static checks
interactive simulation
deterministic simulation
stochastic simulation
code generation
Industrial Automation
16/40
Statecharts
C, Ada
Formal Proofs
Implementation Proofs
Property Proofs
informal
requirements
formal
spec.
formalization
construction
proof
formal
spec.
formal
implementation
Industrial Automation
required
properties
proof
17/40
VDM
Z
SDL
LOTOS
NP
mathematical foundation
dynamic logic
(pre- and postconditions)
predicate logic, set theory
finite-state machines
process algebra
propositional logic
example tools
Mural
from University of Manchester
SpecBox from Adelard
ProofPower from ICL Secure Systems
DST-fuzz from Deutsche System Technik
SDT
from Telelogic
Geode from Verilog
The LOTOS Toolbox from Information
Technology Architecture B.V.
NP-Tools
from Logikkonsult NP
Dilemma:
Either the language is not very powerful,
or the proof process cannot be easily automated.
Industrial Automation
18/40
19/40
Acceptance Tests
allowed
states
Industrial Automation
20/40
Design errors are difficult to detect and even more difficult to correct on-line.
The cost of diverse software can often be invested more efficiently in
off-line testing and validation instead.
Rate of safety-critical failures (assuming independence between versions):
r(t)
development
version 1
rs(t)
development
version 2
rd(t)
rdi(t)
debugging two versions (stretched by factor 2)
t1
Industrial Automation
21/40
periodical tests
example test
overhead
?
continuous supervision
plausibility check
Industrial Automation
acceptance test
22/40
redundancy/diversity
hardware/software/time
?
range checks
structural checks
timing checks
coding checks
reversal checks
compute y = x; check x = y 2
Industrial Automation
23/40
Recovery Blocks
input
recovery
state
acc.
test
passed
result
alternate
version 1
versions exhausted
unrecoverable error
Industrial Automation
24/40
different teams
different languages
different data structures
different operating system
different tools (e.g. compilers)
different sites (countries)
different specification languages
software 1
software 2
specification
software n
run time:
time
f1
f2
=
f1'
f3
=
f2'
Industrial Automation
f4
=
f3'
f4'
f5
f6
=
f5'
25/40
f6'
f7
f8
=
f7'
=
f8'
Industrial Automation
26/40
no
yes
no
P > Pth?
yes
Industrial Automation
T > Tth?
branch 2
branch 1
branch 3
27/40
Redundant Data
Redundantly linked list
status
status
status
Data diversity
in 1
in
input
diversification
in 2
out 1
algorithm
decision
out
out 3
in 3
Industrial Automation
out 2
28/40
Examples
Use of formal methods
Formal specification with Z
Tektronix: Specification of reusable oscilloscope architecture
Formal specification with SDL
ABB Signal: Specification of automatic train protection systems
Formal software verification with Statecharts
GEC Alsthom: SACEM - speed control of RER line A trains in Paris
Use of design diversity
2x2-version programming
Aerospatiale: Fly-by wire system of Airbus A310
2-version programming
US Space Shuttle: PASS (IBM) and BFS (Rockwell)
2-version programming
ABB Signal: Error detection in automatic train protection system EBICAB
900
Industrial Automation
29/40
Both for physical faults and design faults (single processor time redundancy).
data
input
algorithm A
algorithm B
A = B?
data
output
time
- 2 separate teams for algorithms A and B
3rd team for A and B specs and synchronisation
- B data is inverted, single bytes mirrored compared with A data
- A data stored in increasing order, B data in decreasing order
- Comparison between A and B data at checkpoints
- Single points of failure (e.g. data input) with special protection (e.g. serial input with CRC)
Industrial Automation
30/40
power plant
power plant
line
protection
substation
bay
busbar
protection
busbar
substation
to consumers
Industrial Automation
31/40
secondary system:
busbar protection
current
measurement
tripping
primary system:
busbar
Industrial Automation
Kirchhoffs
current law
32/40
REB 500 is a
distributed
real-time
computer system
(up to 250
processors).
central unit
CMP
BIO
CSP
bay units
AI
BIO
CT
current
measurement
AI
BIO
CT
tripping,
busbar replica
busbar
Industrial Automation
33/40
Software Self-Supervision
Each processor in the system runs application objects and self-supervision tasks.
CMP appl.
CMP SSV
CSP appl.
AI appl.
AI SSV
CSP SSV
BIO appl.
BIO SSV
Industrial Automation
34/40
Application Objects
data (in)
=?
data (out)
status
self-supervision (n)
Self-Supervision Objects
periodic/
start-up
HW tests
deblock (n+1)
status classification
continuous
application
monitoring
deblock (n)
self-supervision (n-1)
Industrial Automation
35/40
Safety CRC
Implicit safety ID (source/sink)
Time-stamp
Receiver time-out
Input Consistency:
Safe Storage:
Duplicate data
Check cyclic production/consumption with toggle bit
Diverse tripping:
Industrial Automation
36/40
CMP
CSP
AI
running
running
major error
running
running
deblock
CSP
running
BIO
blocked
busbar
zone 2
busbar
zone 1
Industrial Automation
37/40
Industrial Automation
38/40
Industrial Automation
39/40
Industrial Automation
40/40
X
b
a
Y
write a program to determine the x,y coordinates of the robot head H, given that EC and
CH are known.
The (absolute) angles are given by a resolver with 16 bits (0..65535), at joints E and C
Industrial Automation
41/40
Industrial Automation
Automation Industrielle
Industrielle Automation
9.6
Dr. B. Eschermann
ABB Research Center, Baden, Switzerland
2004 June BE
2004 June BE
effect on system ?
component 1
failure
mode 1
component n
failure
mode 1
failure
mode k
failure
mode k
2004 June BE
component
failure mode
effect on system
water tank
empty
no coffee produced
too full
electronics damaged
empty
no coffee produced
too full
too full
2004 June BE
2004 June BE
2004 June BE
2004 June BE
Criticality Grid
Criticality levels
I
II
III
IV
very low
low
2004 June BE
medium
high
Probability
of failure
Failure Criticalities
IV: Any event which could potentially cause the loss of primary system function(s)
resulting in significant damage to the system or its environment and causes
the loss of life
III: Any event which could potentially cause the loss of primary system function(s)
resulting in significant damage to the system or its environment and negligible
hazards to life
II: Any event which degrades system performance function(s) without appreciable
damage to either system, environment or lives
I: Any event which could cause degradation of system performance function(s)
resulting in negligible damage to either system or environment and no
damage to life
2004 June BE
FMEA/FMECA: Result
2004 June BE
10
2) Identify the functional structure of the system and how the components
contribute to functions.
f1
EPFL - Industrial Automation
f2
2004 June BE
f3
f5
f4
11
f6
f7
9.6 Dependability Analysis
failure
mode
2004 June BE
failure
cause
failure effect
local global
12
failure
other
remark
detection provision
- false actuation
- fails to open
- fails to stop
- fails to close
- fails to start
- fails if open
- fails to switch
- fails if closed
- restricted flow
- inadvertent operation
- loss of input
- intermittent operation
- loss of output
- premature operation
- erroneous indication
- delayed operation
- leakage
2004 June BE
13
2004 June BE
14
no problem
failure mode x
&
common source
failure mode y
serious
consequence
no problem
2004 June BE
15
pressure p1
coil with
inductivity L2
iron core
coil with
inductivity L1
i2(t)
i1(t)
u1(t)
u2(t)
2004 June BE
16
p1 L1
sensor data
preparation
different
failure
effects
p2 L2
sensor data
processing
output data
generation
processing 1
watchdog
processing 2
pstatic
checking
(limits,
consistency)
Tempsens
Tempelec
safe
output
(e.g.
upscale)
=
A/D
conversion
power
supply
controlled
current
generator
2004 June BE
17
Fai l ure
Mo de
Lo c al Ef f e c t
De t e c t i o n
Me c h an i s m
Gl o b al Ef f e c t
Co mme n t s
1.1.1
out of failsafe
accuracy
range
go to safe state
output driven to
up/downscale
consistency check
(comp. with p2),
detection of small
failures not guaranteed
(allowed difference p1p2)
go to safe state
output driven to
up/downscale
consistency check
(comp. with p1),
detection of small
failures not guaranteed
(allowed difference p1p2)
n/a
p1
measurement
1.1.2
1.2.1
1.2.2
p2
measurement
out of failsafe
accuracy
range
wrong but
pressure input via
within fail- L2 slightly wrong
safe
accuracy
range
2004 June BE
18
FTA
failures
of system
system state
to avoid
The main problem with both FMEA and FTA is to not forget anything important.
Doing both FMEA and FTA may help to become more complete (2 different views).
EPFL - Industrial Automation
2004 June BE
19
coffee machine
doesnt work
water tank
empty
&
no coffee
beans
undeveloped event:
analyzed elsewhere
2004 June BE
20
power
switch off
basic event:
not further
developed
tripping algorithm 1
inputs
&
trip
signal
underfunctions increased
Putot = 2Pu - Pu2
tripping algorithm 2
tripping algorithm 1
inputs
comparison
trip
signal
&
2004 June BE
dynamic
modeling
necessary
repair
tripping algorithm 2
overfunctions reduced
Potot = Po2
21
2004 June BE
22
Markov Model
l2(1-c)
(l1+l2)(1-c)
latent underfunction
2 chains, n. detectable
latent overfunction
1 chain, n. detectable
(l1+l2)c+l3
(l1+l2)c+l3
s1+l1(1-c)
(l1+l2+l3)c
OK
m
detectable error
1 chain, repair
l1(1-c)
s2
l1+l2+l3c
l3(1-c)
s2
latent underfunction
not detectable
l1=0.01, l2=l3=0.025, s1=5, s2=1, m=365,
EPFL - Industrial Automation
2004 June BE
overfunction
23
underfunction
c=0.9 [1/ Y]
9.6 Dependability Analysis
Analysis Results
mean time to
underfunction [Y]
400
weekly test
300
permanent comparison (red. HW)
200
2-yearly test
50
5000
500
2004 June BE
24
mean time to
overfunction [Y]
safety
integrity level
control systems
[per hour]
protection systems
[per operation]
10
-9
to < 10
-8
10
-5
to < 10
-4
10
-8
to < 10
-7
10
-4
to < 10
-3
10
-7
to < 10
-6
10
-3
to < 10
-2
10
-6
to < 10
-5
10
-2
to < 10
-1
2004 June BE
25
concept
overall planning
overall
operation and
maintenance
planning
overall
safety
validation
planning
overall
installation and
commissioning
planning
2004 June BE
safety-related
systems:
E/E/PES
realisation
12
overall installation
and commissioning
13
14
16
26
10
safety-related
systems:
other
technology
realisation
15
11
external risk
reduction
facilities
realisation
overall modifications
and retrofit
IEC 61580
2004 June BE
27
2004 June BE
28
2004 June BE
29