Sei sulla pagina 1di 6

2012 International Conference on Computing Sciences

Fault Tolerant ALU System


Ayon Majumdar, Sahil Nayyar, Jitendra Singh Sengar
School of Electronics and Communication Engineering
Lovely Professional University
majumdar007ayon@gmail.com

missions are such applications that the fault tolerance of


hardware became key issue [3].

Abstract This paper presents the design of FAULT TOLERANT


ALU SYSTEM by using Triple Modular Redundancy. ALU is a
critical component of microprocessor and is the core component of
central processing unit. Therefore, it is necessary for making the
ALU to be fault tolerant. The use of voting logic and disagreement
detector has been implied in making the ALU system to be fault
tolerant. The source code for the following was developed in
VerilogHDL. The software used was XilinxISE.

A. Fault Tolerance Requirement


The basic characteristics of fault tolerant system are [1]1. In case of failure, the system should be able to continue
its normal operation during the repair process without
any interruption.
2. The failure should be isolated to the faulty component
instead of propagating it to the whole system.
3. Mechanisms for the isolation of faulty components are
required for system protection.

Keywords fault tolerance, redundancy, TMR, ALU, voting logic

I.

B. Deciding Parameters for the System to be Fault Tolerant

INTRODUCTION

To make the entire components fault tolerant for a system is not


an ideal option. Below is given the criteria which should be kept
in mind before deciding which component should be made fault
tolerant.1. Importance of the component, like in case of laptops,
the microprocessor is the most critical component.
Therefore it is more likeable to be made fault tolerant
rather than any other component.
2. Probability of the failure of the component, if a
component is more likely to fail than others, then it
should be made fault tolerant.
3. Cost for making the component fault tolerant, for
example providing a redundant heat sink for a laptop is
too expensive both economically as well as in terms of
weight and board space.

When some part of the system fails, the fault tolerant design
enables it to continue its normal operation, probably at reduced
level rather than total failure of the system. The whole system is
not failed due to the failure of a component whether its in the
case of hardware or software. [1]. Assume that a motor vehicle
has a spare tire, so as that its drivable when one of the tires is
punctured. Thus, the integrity of the structure is maintained in
spite of failures like corrosion, fatigue etc. [1].
There are majorly two types of faults1. Permanent Faults are due to manufacturing defects,
early life failures, wear out failures
2. Temporary Faults are only present for a short period of
time. Mostly caused by external disturbance or
marginal design parameters.
Permanent faults are quite hard to avoid, as they are
manufacturing defects of a system but we can avoid the
temporary faults. So, to avoid a system from temporary faults
we make it a Fault Tolerant System.

C. System Level Operation


In hardware fault tolerance, it is required that the faulty part is
replaced with a spare one while the system is still in operation.
Systems that have a single backup are known as single point
tolerant.in such systems; the repair time should be quite less as
compared to mean time between failures [1].
Suppose the state of system operation is represented as S, where
S=0 means system operates normally and S=1 represents system
failure. Then S is a function of time t, as shown in Fig. 1 [4].

II. FAULT TOLERANT SYSTEM


Sometimes the system is able to continue its normal operation
even when some of its components fail. This property of the
system is called fault tolerance [2]. The operating quality is
proportional to the severity of the failure i.e. operating quality
decreases as the severity of the failure increases for naively
designed systems [2]. Fault tolerance becomes substantial
design criteria for the applications where the reliability of
hardware was crucial. Medical, military and long-range
978-0-7695-4817-3/12 $26.00 2012 IEEE
DOI 10.1109/ICCS.2012.36

255

Redundancy is the most critical concept for a system to make


fault tolerant.
III. REDUNDANCY
The critical components or functions of the system are
duplicated or might be triplicated, so as to increase the
reliability of the system [5]. This process is called redundancy.
For example, for hydraulic systems of aircraft, the control
system may be triplicated to make it redundant. Therefore, if
there is an error in one component then it will be voted out by
the other two components [5]. Thus, the probability for the
failure of the system as a whole is greatly reduced.

Fig. 1 System Operation and Repair

Suppose the system is in normal operation at t = 0, it fails at t1,


and the normal system operation is recovered at t2 by some
software modification, reset, or hardware replacement. Similar
failure and repair events happen at t3 and t4 [4]. The duration of
normal system operation (Tn), for intervals such as t1  t0 and t3
 t2, is generally assumed to be a random number that is
exponentially distributed. This is known as the exponential
failure law.
Hence, the probability that a system will operate normally until
time t, referred to as reliability, is given by:

A. Types of Redundancy
The four major forms of redundancy are as follows [5]:
1. Hardware redundancy, for example, DMR and TMR.
2. Information redundancy, for example, Error detection
and correction methods.
3. Time redundancy, performs same operations twice to
see if it gets same outputs both time.
4. Software redundancy, such as N-version programming.

(1)

where  is the failure rate[4]. Because a system is composed of a


number of components, the overall failure rate for the system is
the sum of the individual failure rates (i) for each of the k
components:

B. Functions of Redundancy
There are two functions of redundancy i.e. passive redundancy
and active redundancy [5].
When excess capacity is used to reduce the impact of the
components failures it is known as passive redundancy. One
common example is increasing the build quality of some
components that are critical to the device [5].
The performance of each device is monitored and any decline in
it is eliminated. This is called active redundancy and this
monitoring is used in voting logic. Thus the voting logic can be
used for fault masking. The voting logic automatically
reconfigures components as it is linked to switching [5].

(2)

The mean time between failures (MTBF) is given by:

(3)

Similarly, the repair time (R) is also assumed to obey an


exponential distribution and is given by:

(4)

where is the repair rate[4]. Hence, the mean time to repair


(MTTR) is given by:

IV. TRIPLE MODULAR REDUNDANCY


For some time it has been known that the reliability of digital
systems can be improved through the use of redundant
components, if these additional components are properly
employed. The most common type of redundancy method is
Triple Modular Redundancy (TMR) which has been explained
further in this paper [7].
Triple modular redundancy, (TMR) is a fault-tolerant form of Nmodular redundancy, in which three systems perform a process
and that result is processed by a voting system to produce a
single output [6]. If any one of the three systems fails, the other
two systems can correct and mask the fault. If the voter fails
then the complete system will fail.
The majority voter uses voting logic as shown in Fig. 2.

(5)

The fraction of time that a system is operating normally (failurefree) is the system availability and is given by:


(6)

This formula is widely used in reliability engineering; for


example, telephone systems are required to have system
availability of 0.9999 (simply called four nines), while highreliability systems may require seven nines or more [4].

256

implements a total of 16 functions i.e. 8 arithmetic functions and


8 logical functions. Most ALUs can perform the following
operations:
1. Bitwise logic operations (AND, NOT, OR, XOR,
NAND, NOR, XNOR)
ns
2. Integer arithmetic operation
3. Bit-shifting operations.
VI.

ANT ALU SYSTEM


FAULT TOLERA

herefore it is critical to make


ALU is an essential part of CPU; th
it fault tolerant rather than any other component.

Fig. 2 Example of Triple Modular Redunddancy

In TMR, as shown in Fig. 2, the outputs of all tthe three modules


are compared using the majority voter and the majority are
passed as the final output. Suppose two out of three modules
have similar outputs the majority voter can determine which
replication has error as two-to-one vote is observed by the
majority voter. After this only two modules are left and the
majority voter can switch to dual modular reddundancy (DMR).
TMR can be used for N number of replicationns. The redundant
system will not fail if none of the three moodules fails, or if
exactly one of the three modules fails [7]. It iss assumed that the
failures of the three modules are independent [[7]. Since the two
events are mutually exclusive, the reliability R of the redundant
system is equal to the sum of the probabilitties of these two
events [7]. Hence,
R=Rm3+3Rm2 (1-Rm) = 3Rm2-2Rm3

(7)
System
Fig. 3 Fault Tolerant ALU
A

The voting logic compares the outputs of all the modules pass
the majority output i.e. if all three outputs are same then it
becomes the final output and if two out of three outputs are
same then the two same outputs become the ffinal output. Also,
if the two same outputs are erred output then iit will become the
final output.

To make the ALU fault tolerant we


w have used the method of
Triple Modular Redundancy. In thiis method the ALU system
implemented is triplicated, each having the same input, thus
making it triple mode redundant.
The output of the three ALUs is passed through the Voting
puts and pass the majority
Circuit that will compare the outp
output. This means that if any two ALUs are giving the same
output, then that output will be passsed by the voting circuit and
ole circuit. In case of all the
becomes the final output of the who
ALUs giving the same outputs, theen that output becomes the
ALUs giving different
final output but in case of all the three
t
nder a conflict and fails. At
outputs then the voting circuit is un
this time the final output will be indeeterminate.
Disagreement Detector compares the
t outputs of all the three
ALUs and indicates which ALU is giving a different output or
y one. Moreover, if all three
in general which ALU is the faulty
outputs are same then it indicates that no ALU is faulty. The
wo ALUs become faulty. It
disagreement detector fails if any tw
will then indicate that the one ALU that
t is fault free to be faulty.

UNIT
V. ARITHMETIC LOGIC U
ALU (Arithmetic logic unit) is a critical component of a
microprocessor and is the core component of ccentral processing
unit [8]. ALUs comprise the combinatiional logic that
implements logic operations, such as AND aand OR etc., and
arithmetic operations, such as ADD and SUBTR
RACT etc. [8]
Most of a processor's operations are performedd by one or more
ALUs. All the data is loaded from the input registers into an
ALU and the operation to be performed on thatt data by the ALU
is decided by the Control Unit [9]. The output result is stored in
output registers. Control Unit is used to transsfer the processed
data between the two registers, ALU and mem
mory [9]. An ALU

257

Thus, we have made the ALU system fault tolerant to a great


level but still the problem persists. Its because practically we
are unable to make a 100% fault free system. We can reduce the
level of fault occurrence but we cannot totally omit it. In the
above Fault Tolerant ALU System, there is a limitation i.e. it
fails if N-1 systems become faulty. In other words, out of N
systems (where N being odd no. of systems), if N-1 systems are
faulty then our model fails. In case of ALU, out of three ALUs,
if any two ALUs fail then the whole model fails.

Below is the RTL Schematic of the ALU implemented showing


blocks of various functions like addition, subtraction,
multiplication, division etc.

A. Result of the ALU Implemented


An 8-bit ALU was implemented on VerilogHDL. It has two
input ports, a and b, one output port out and one port for
command line. The RTL schematic of the ALU is shown along
with the simulated output.

Fig. 4 Simulated output of the ALU

The 8-bit ALU implemented has 8 arithmetic and 8 logical


functions. Its simulated output is shown in Fig. 4 showing all the
functions along with its RTL schematic in Fig. 5.
The variable command determines which function to be
executed and when to be executed. If command is 0 then
addition function is executed as 0 has been assigned to addition.
If command is 8 then logical AND will be performed, as 8 has
been assigned to it and so on. Whereas the output enable oe
determines the availability of the output. When oe is 1, the
output is available and when oe is 0, no output is obtained. So,
oe is made high by default to receive the output.

Fig. 5 RTL Schematic of the ALU

258

B. Result of Fault Tolerant ALU System


Below is the simulated output of the fault tolerant ALU system
designed using VerilogHDL.

Fig. 6 Simulated Output of Fault Tolerant ALU System

Algorithm for the fault tolerant ALU system is as follows:


1. Design an ALU system and then triplicate it to achieve
TMR.
2. Now design the voting circuit, compare all the three
outputs of the ALUsa. Lets consider the outputs to be a, b and c of
the three ALUs and y, the majority output
passing from the voting circuit.
b. If a=b and ac then y=a.
c. If b=c and ba then y=b.
d. If c=a and cb then y=c.
e. If a=b=c then y=a or y=b or y=c.
3. Now design the disagreement detector, again compare
the outputs of the three ALUsa. Lets consider the outputs to be p, q and r of
the three ALUs.
b. Lets take three indicators u, v and w for p, q
and r respectively.
c. If p=q and pr then ALU_3 is faulty; w=1.
d. If q=r and qp then ALU_1 is faulty; u=1.
e. If r=p and rq then ALU_2 is faulty; v=1.
f. If a=b=c then No ALU is faulty; p=0, q=0
and r=0.

Fig. 7 RTL Schematic for Fault Tolerant ALU System

259

The above schematic shows three ALU modules integrated into


a single module thus exhibiting triple modular redundancy.
The previously mentioned algorithm implies the design of fault
tolerant ALU system on VerilogHDL. Here, a, b and c are
considered to be the outputs of ALU_1, ALU_2 and ALU_3
respectively.
Similarly, p, q and r are considered to be the outputs of
ALU_1, ALU_2 and ALU_3 respectively.
The simulated output of the fault tolerant ALU system is shown
in Fig. 6, from which it is clear that a and b are the primary
inputs whereas oe used for output enable and command is used
for which function of the ALU to be selected. The out1, out2
and out3 in Fig. 6 represent the output of the three ALUs
respectively whereas dout represents the output of the
disagreement detector. Also, the indicators u, v and w are
represented as x, y and z respectively.
In this fault tolerant ALU system, the second ALU module is
considered to be faulty and can be seen in the simulated output
in Fig. 6. Also, the function performed by the ALU is addition
for this case.
VII. CONCLUSION
Ideal systems that can be made completely fault tolerant or fail
safe do not exist in real world. Thus, the fault tolerant ALU
system has its limitations that can be overcome by replacing the
faulty module with a spare one. For this the system should be
optimized in such a manner that the mean time between failures
(MTBF) is more than the mean time to repair (MTTR). The
faulty module can be replaced with a spare one before the other
module fails while the system continues its normal operation.
Also, the built quality can be increased while taking care of
other measures, such that the ALU becomes less likely to fail.
Thus, the ALU system becomes fault tolerant to a great extent as
achieving sufficient fault tolerance is the major design issue.
REFERENCES
[1]
[2]
[3]
[4]

[5]
[6]
[7]
[8]
[9]

Fault Tolerant Design [Online]. Available: http://www.bgb.gr/storage/


P. J. Denning (December 1976). "Fault Tolerant Operating Systems".
ACM Computing Surveys (CSUR)
Hierarchical Triple-Modular Redundancy (H-TMR)Network For Digital
Systems by B. Baykant Alagoz
Laung Terng Wang, Cheng Wen Wu and Xiaoqing Wen VLSI Test
Principles and Architectures: Design for Testability The Morgan
Kaufmann Series in Systems on Silicon, 2008
Redundancy Management Technique for Space Shuttle Computers, IBM
Research
David Ratter. "FPGAs on Mars"
The Use of Triple-Modular Redundancy to Improve Computer
Reliability by R.E. Lyons and W. Vanderkulk
8 Bit Arithmetic Logic Unit by Samuel Winchenbach and Mohammed
Driss, University of Maine, Orono.
Stallings, William (2006). Computer Organization & Architecture:
Designing for Performance 7th ed. Pearson Prentice Hall.

260

Potrebbero piacerti anche