Sei sulla pagina 1di 6

I SA TRANSACTIONS

ELSEVIER

ISA Transactions 34 (1995) 193-198

Using Markov models for safety analysis of programmable electronic systems


Julia V. Bukowski a, William M. Goble b,.
a Villanova University, Villanova, PA 19085, USA b Moore Products Company, Sunnytown Pike, Spring House, PA 19477, USA

Abstract

Markov Models (diagrams showing failure states) can easily represent the operation of a fault tolerant programmable electronic system (PES) as various system components fail a n d / o r are repaired. These models can account for multiple failure rates as a function of failure state, common cause failures, on-line diagnostic capability of a PES, multiple failure modes, and different repair rates as a function of failure state. Further, the same physical system may behave differently in different operating modes and this can be accounted for by different Markov models. Such models can be constructed simply and accurately when a systematic method is used. This paper describes the systematic method and shows examples of the reliability and safety analysis developed for a new fault tolerant control system under two different operating modes. The importance of including the operating mode in the modeling and analysis is clearly demonstrated. One operating mode is substantially safer than the other.

Keywords: Reliability analysis; Safety analysis; Programmable electronic controls; Quad architecture

1. Introduction

O n e reliability and safety modeling tool has shown the flexibility to realistically model a programmable electronic control system. This tool is called a M a r k o v model (or Failure State Diagram). T h e m e t h o d can account for on-line diagnostic coverage, c o m m o n cause failures, multiple failure states, different repair times as a function of what has failed, and variable failure rates. A Markov model can be developed in a systematic m a n n e r in the context of a single drawing. This

* Corresponding author.

allows that drawing to qualitatively d o c u m e n t the operation of the PES u n d e r various c o m p o n e n t failure conditions. A Markov model is a state diagram. Circles are used to represent combinations of failed or working components. Arrows (directed arcs) are used to show the possible ways the controller may change states (due to the failure or repair of the controller components). T h e values on the arcs represent the rates at which the controller moves between states. Generally, A's represent failure rates a n d / x ' s represent repair rates. Fig. 1 shows a Markov model for a single repairable component with one failure mode. T h e c o m p o n e n t is either successful (state 1) or failed (state 2). T h e model can move from state 1 to state 2 at a rate

0019-11578/95/$09.50 1995 Elsevier Science B.V. All rights reserved SSDI 00 1 9 - 0 5 7 8 ( 9 5 ) 0 0 0 0 8 - 9

194 X12

J.V. Bukowski, W.M. Goble / l S A Transactions 34 (1995) 193-198

Fig. 1. Markov model - repairable component, one failure model.

of a12 , the failure rate. The model can move from failure, state 2, to success, state 1 at the repair
rate, jig 21.

ability of Failure on D e m a n d (PFD), etc. can be calculated via relatively straightforward techniques. Numeric results can be quickly obtained using a personal computer spreadsheet. General techniques are detailed in [1] and calculation of M T T F D is presented in [2,3]. Background material on Markov model solutions can be obtained in many college textbooks including [4].

Fig. 2 shows a simplified Markov model for a PES showing two system success states and two failure modes. In this Markov model, state 1 represents the situation where all controller components are operating and, consequently, the system is fully operational. State 2 represents the state where the system is operating but some of its components have failed, i.e., state 2 represents a degraded mode of operation. States 3 ( F s) and 4 ( F D) represent two different failure modes. Assuming all rates are on a per hour basis, all information needed to analyze the model can be placed in a single matrix, P. This matrix contains all the " f r o m - t o " information in a Markov model. For i not equal to j, the entry in P for the ith row and j t h column is the transition probability from state i to state j. The entries along the diagonal are simply 1 - (sum of all the other entries in that row). As an example note that AI2 goes from state 1 to state 2. This value appears in the first row, second column. Continuing, for the model in Fig. 2,

2. Systematic method
Markov modeling can be done more quickly and more accurately when a systematic method is used to build the model. The following procedure is used: 1. Identify failure modes and failure rates of all system components. 2. Start the Markov model with a state where all components are operating successfully. 3. For each system success state, build a checklist of all failure rates and failure modes of all operating components. Show each failure rate as an exit arc. 4. R e p e a t step 3 until no successful components remain. 5. For each state with a failed component, add appropriate repair rates. 6. Simplify model by merging states. Any states that have identical exit rates to the same states can be merged. The use of these techniques almost forces the construction of an accurate Markov model. The actual operation of the system under component failure conditions becomes obvious. Sometimes the result can be quite surprising. Unpredicted failure states appear. Given the full picture of how a system operates under failure conditions, designs can be optimized. New architectures can be created.

1p =

AI2

A12

/_/.21 0 0

1 -/x21 - As - Ao 0 0

As 1 0

Ao 0 1

Various reliability metrics including Mean Time To any Fail (MTTF), Mean Time To Fail, D state (MTTFD), Reliability, Availability, Prob-

3. Quad logic architecture example

Fig. 2. Simplified Markov model.

Fig. 3 shows a provide both high four channels of channels and two

PES architecture designed to availability and high safety via electronics, two logic solving diagnostic channels. The logic

2. K.. Bukowski, W.M. Goble / ISA Transactions 34 (1995) 193-198

195

71 "
I

~n~ut- - I - ~ ~ _ _ "
Channel

I'

6ga.-ao,

FT ).
I t

--~

q Channe, rl So,ve,

I ~np&- - II

Logic

I.~--L--~

I I ch~o~,

'
I

Fig. 3. Q U A D L O G I C architecture.

solver channels read process signals, perform process calculations, and generate outputs. The diagnostic channels monitor operation of the input, logic solver, and output circuits. Measurements of voltages, currents, waveforms, memory patterns, and timing are compared to reference levels which indicate proper operation. One logic solver channel and one diagnostic channel are packaged into a set of modules. Thus, two physical sets of modules form the system. The objective of this Markov analysis is to compare two different modes of operation in the quad logic architecture. In the first m o d e called calculate/calculate, both logic solvers are operating synchronously. Each logic solver calculates process outputs in parallel. Both output sets are energized. In mode two called calculate/verify, one logic solver calculates process outputs with its diagnostic output energized and the other repeats the calculation to verify correct operation but with its diagnostic output de-energized. The role of calculator is constantly switched between units. The calculate/verify mode offers diagnostic advantages since all components can be fully exercised during the periodic switching. In addition, the calculate/verify mode avoids common cause software failures characteristic of synchronous, identical software systems. In spite of known advantages for the C a l c u l a t e / V e r i f y mode, a question remains as to how the system is affected by the different operating modes under failure conditions. Two Markov models must be developed and compared. 3.1. Calculate / calculate mode Markov model To build a Markov model of the quad logic architecture operating in the calculate/calculate

mode, the systematic method is used. The first step is to identify all failure modes and failure rates. Two failure modes are distinguished - those that cause outputs to fail de-energized (fail-safe in a de-energize to trip system) and those that cause outputs to energize or freeze. These failure modes are generally referred to as safe and dangerous. Failures of either mode that are detected by on-line diagnostics must be distinguished from those not detected by diagnostics. These categories are called detected and undetected. Thus far four failure rates have been established: - AsP: safe, detected failures; - hsv: safe, undetected failures; - ADD: dangerous, detected failures; ADu: dangerous, undetected failures. To properly account for common cause failures, each failure rate should be partitioned into normal and common cause. This results in eight failure rates for each physical set of channels in our PES: - AsDY: safe, detected normal stress failures; - hsuN: safe, undetected normal stress failures; - hsDc: safe, detected common cause failures; - hsvc: safe, undetected common cause failures; - ArmY: dangerous, detected normal stress failures; -ADuN: dangerous, undetected normal stress failures; - ADD(: dangerous, detected common cause failures; - Abut: dangerous, undetected common cause failures. With all failure rates identified, Markov model construction begins with a state in which all components operate successfully - state 1. A checklist chart (Fig. 4) shows all failure rates that must exit. The first step Markov model is shown in Fig. 5. All failure rates in the chart have been placed into the model. So far, the Markov model has a fail-safe state (5), a fail-danger state (6), a state where all components are successful (I) and three degraded
-

C~ SDC SUC DDCDUCSDNsDN SuNSUN DDNDDN/C)0NIDUN


Fig. 4. State 0 failure rate checklist.

][

196

J. V. BukowskL W.M. Goble / ISA Transactions 34 (1995) 193-198

2xso.

x~q ~u.xDO

Fig. 5. Step 1 Markov model - Calculate/calculate mode.

Fig. 7. Calculate/verify mode Markov model.

system success states (2,3,4). The Markov model construction continues from state 2. Again a checklist is created of failure rates. Since only one physical set of modules is in operation, no common cause partitioning need be done. The checklist at this point is simply, )t SD, h SU, ADD and )tDU. The same failure rates are present for the remaining system success states (3 and 4). The final Markov Model is shown in Fig. 6. In state 1, all system components operate successfully. In state 2, one set of modules has failed with a safe, detected failure. The system is still successful because the other set of modules has control. A similar situation has occurred in state 3 - a dangerous, detected failure has occurred. The system is still successful because the diagnostic channel has de-energized the failed outputs and the other set of modules has control. In state 3 the system is also degraded but successful. One module set has experienced a safe, undetected failure. Its outputs are de-energized and the other

module set has control. On-line repair rates have been added to states 2 and 3 where diagnostics have detected the failure. Under such circumstances repair to the fully operational state can be made quickly.

3.2. Calculate / verify mode Markot~ model


Failure rate categories are identical for the calculate/verify mode Markov model. In the calculate/verify mode of operation the diagnostic cut-off switch is de-energized in the verify mode. This difference in switch position changes how the system responds under failure conditions. Using the same systematic technique, a new Markov model is constructed (Fig. 7).

3.3. Comparison
A close examination of the models shows that the calculate/verify is less likely to go to a faildanger state. Arcs exiting from state 1 will be significant. Comparing the arcs: * State 1 to state 5 (FS) - The calculate/verify mode has a higher failure rate. It includes two sets of safe undetected normal failures. , State 1 to state 6 (FD) - The calculate/verify mode has a lower failure rate. In calculate/verify mode only a dangerous, undetected common cause failure rate is present. With safety class diagnostics and high common cause strength, this rate would be near zero. The calculate/calculate mode has two sets of dangerous undetected failures in addition to the dangerous undetected

~oc. ~uc;~oc

FS

ou

~ou

Fig. 6. Calculate/calculate mode Markov model.

J.V. Bukowski, W.M. Goble/ISA Transactions 34 (1995) 193-198


Table 1 Input data for comparison spreadsheet Variable Input circuit - safe failure rate L o g i c s o l v e r - safe failure rate O u t p u t c i r c u i t - safe failure rate Input circuit - dangerous failure rate L o g i c s o l v e r - dangerous failure rate O u t p u t c i r c u i t - dangerous failure rate Software - safe failure rate Software - dangerous failure rate Coverage factor - safe input circuit failures Coverage factor - dangerous input circuit failures Coverage factor - safe logic solver failures Coverage factor - dangerous logic solver failures Coverage factor - safe output circuit failures Coverage factor - dangerous output circuit failures Coverage factor - safe software failures Coverage factor - dangerous software failures Quantity of input circuits Quantity of output circuits Probability of c o m m o n cause failure - hardware P r o b a b i l i w o f c o m m o n cause failure - software Symbol
1 sic I s mp 1 s oc 1 d ic 1 d mp 1 d oc 1 s sw I d sw c s ic c d ic c s mp c d mp c s oc c d oc c s sw c d sw n m

197

Units
50 F I T S 35[) F I T S 100 F I T S 100 F I T S 150 F I T S 100 F I T S 1000 F I T S 0 FITS 0.9 P r o b . 0.99 P r o b . 0.9 P r o b . 0.99 P r o b . 0.9 P r o b . 0.99 P r o b . 0.9 P r o b . 0.99 P r o b . 16 Quantity 8 Quantity 0.05 P r o b . I).l P r o b .

beta hw beta sw

c o m m o n cause. This extra failure rate makes this mode less safe. So, overall, it appears that the calculate/verify mode trades a lower availability for higher safety. In safety instrumented systems, this trade-off can be a significant positive for the user. A numerical example will verify the comparison. Using a set of failure rate data, a spreadsheet model was created to solve for: MTTF (Mean Time To any Failure - a measure of availability); PFD (Probability of Failure on D e m a n d - a measure of safety); HRF (Hazard Reduction Factor - inverse of PFD). The input data set is shown in Table 1. Results are shown in Table 2. The calculate/calculate mode shows an MTTF of 1,190,171 Hrs, while the calculate/verify mode is lower with an MTTF of

1,091,421 Hrs. The calculate/calculate mode has an H R F of 2,251 while the calculate/verify mode shows a substantial improvement with an H R F of 46,898! Using the spreadsheet, comparisons were made over wide ranges of input data. With the exception of extremely poor diagnostic coverage and low c o m m o n cause strength, the comparison was always the same. The calculate/verify mode was substantially safer. Since the calculate/verify mode avoids the inherent c o m m o n cause problems of synchronous software systems, it becomes the clear choice.

4. Conclusion Markov model techniques have the capability to fully account for the inherent complexities of PES. In our modeling experience we have been able to use these techniques to compare architectures, compare operating modes, and determine that diagnostics and c o m m o n cause susceptibility are critical variables in PES systems. Safety and availability modeling allows designers to make superior choices in sometimes counter-intuitive

Table 2 Comparison results Calculate/calculate modeCalculate/verify mode MTTF 1,190,171 1,091,421 Hours H R F 2251 46898

198

J. II..Bukowski, W.M. Goble l I S A Transactions 34 (1995) 193-198 [2] J.V. Bukowski and W.M. Goble, "Reliability analysis of controllers for safety shutdown systems", Proc. Ninth lnternat. Conf. of the Israel Society for Quality Assurance (ISQA), Jerusalem, Israel (November 1992).

situations. It b e c o m e s c l e a r why t h e I S A 84.02 s u b c o m m i t t e e has c h o s e n M a r k o v M o d e l i n g as the p r e f e r r e d t e c h n i q u e for safety a n d availability evaluation.

[3] J.V. Bukowski and W.M. Goble, "The reliability analysis of PES safety-systems", Proc. of the Food and Pharmaceutical Industries Symposium, Instrument Society of America, Toronto, Canada (1992). [4] D.P. Maki and M. Thompson, Mathematical Models and Applications (Prentice-Hall, Englewood Cliffs, NJ, 1973).

References
[1] W.M. Goble, Evaluating Control System Reliability Techniques and Applications (ISA, Research Triangle Park, NC, 1992).

Potrebbero piacerti anche