Sei sulla pagina 1di 17

Fault Tolerance in Grid Computing

1. INTRODUCTION
Grid computing enables the aggregation and sharing of geographically distributed computational data and other resources as a single, unified resource for solving large scale compute and data intensive applications in dynamic, multi-institutional virtual organizations. Since Grid resources are highly heterogeneous and dynamic, more faults are likely to occur in Grid environment. The generally accepted definitions of some basic terms are as follow: Fault: A fault is a violation of a system(s) underlying assumptions. Error: An error is an internal data state that reflects a fault. Failure: A failure is an externally visible deviation from specifications. A fault need not result in an error, or an error in a failure. For example, in the Ethernet link layer of the network stack, a packet collision is an error that does not result in a failure because the Ethernet layer handles it transparently.

1.1 GRID ARCHITECTURE


The following figure 1 describes grid architecture.

Fig 1: Grid Architecture The architecture modules include: A User Interface (UI) through which users can submit their jobs to the grid. A Grid Scheduler (GS) assigns jobs received from users to grid resources.

Fault Tolerance in Grid Computing

Resource Information Server (RIS), which collects the resource capability information, such as CPU capacities, memory size, etc. scheduling decisions taken by the GS are based upon information provided by the RIS. Fault Handler (FH) is included to handle failures in the system. Handling failures include fault detection and fault recovery.

Grid computing consists geographically dispersed resources managed by a single administration unit. Currently only a grid environment with a single centralized GS and RIS is considered. In the grid architecture considered, resources possess varying failure behaviour. So, the considered architecture contains a fault handler that can deal with failures if happened.

1.2 FAULT TOLERANCE


Fault tolerance is the ability to preserve the delivery of expected services despite the presence of fault-caused errors within the system itself. It aims at the avoidance of failures presence of faults. A fault tolerant service detects errors and recovers from them without participation of any external agents, such as humans.

Errors are detected and corrected and permanent fault share located and removed while the system continues to deliver acceptable services. Strategies to recover from errors include 1. roll-back, which implies bringing the system to a correct state saved before the error occurred 2. roll-forward, i.e. bringing the system to a fresh state without errors, or compensation, i.e. masking an error, in situations when the system contains enough redundancy to do that. Hence fault tolerance could be considered as the survival attribute of computer systems in the case of Grid computing, these approaches, although still very much useful, may not be enough. It is an important property in Grid computing as the dependability of individual Grid resources may not be guaranteed. As resources are used outside of organizational boundaries, it becomes increasingly difficult to guarantee that a resource being used is not malicious in some way. In many cases, an organization may send out jobs for remote execution on machines upon which trust cannot be placed; for example, the machines may be outside of its organizational boundaries, or may be desktop machines that many people have access to. A fault tolerant approach may therefore be useful in order to potentially prevent a malicious node affecting the overall performance of the application. As applications scale to take advantage of Grid resources, their size and complexity will increase dramatically. However, experiences tell that systems with complex asynchronous and interacting activities are very much prone to errors and failures due to their extreme complexity, and we simply cannot
2

Fault Tolerance in Grid Computing

expect such applications to be fault free, no matter how much effort is invested in fault avoidance and fault removal. In fact, the likelihood of errors occurring may be exacerbated by the fact that many Grid applications will perform long tasks that may require several days of computation. Hence, the cost and difficulty of recovering from faults in Grid applications is higher than that of normal applications. Furthermore, the heterogeneous nature of Grid nodes means that many Grid applications will be functioning in environments where interaction faults are more likely to occur between disparate Grid nodes. The heterogeneous architectures and operating system platforms, give rise to a number of problems that are not present in the traditional homogeneous systems. The complexity of both (a) varying architectural features, such as data representation and instruction sets, and (b) varying operating system features, such as process management and communication interfaces, must be masked from the application programmer. If no fault tolerance is provided, the system cannot survive to continue when one or several processes fail, and the whole program crashes. In this sense, a technique is needed that would enable a system to perform fault tolerant procedures that can continue to execute even in the presence of a fault. Since the primary purpose of grids is to provide computational power to scientists, the intended application developer is a domain expert, typically a physicist, a biologist or a meteorologist, not necessarily a grid expert. So, most application developers are unaware of the different types of failures that may occur in the grid environment and hence, fault tolerance becomes increasingly important.

Fault Tolerance in Grid Computing

2. TYPES OF FAILURE
Faults may be classified based on several factors. Network faults: Faults due to network partition, Packet Loss, Packet corruption. Processor faults: Machine or operating system crashes. Process faults: Resource shortage, software bug. Interaction faults: Protocol incompatibilities, Security incompatibilities, Policy problems, Timing overhead. With respect to time, three types of failures can occur in computer systems : Permanent Intermittent Transient Permanent: These failures occur by accidentally cutting a wire, power breakdowns and so on. It is easy to reproduce these failures. These failures can cause major disruptions and some part of the system may not be functioning as desired. Intermittent: These are non-deterministic failures and appear occasionally. Most of the times, these failures are neglected while testing the system and only appear when the system goes into operation. Therefore, it is hard to predict the extent of damage these failures can bring to the system. Transient: These failures are caused by some inherent fault in the system. However, these failures are corrected by retrying, such as restarting software or resending a message. These failures are very common in computer systems.

2.1BEHAVIOUR OF FAILED SYSTEMS


The types of failures described above cause the system to behave in different ways. Three types of behaviours are possible in systems after a failure . They are Fail-stop system Byzantine system Fail-fast system Crash failure system

Fail-stop system: The system does not output any data once it has failed. It immediately stops sending any events or messages and does not respond to any messages. Byzantine system: The system does not stop after a failure, instead behaves in a inconsistent way. It may send out wrong information, or respond late to a message.

Fault Tolerance in Grid Computing

Fail-fast system: The system behaves like a Byzantine system for some time but moves into a fail-stop mode after a short period of time. It does not matter what type of fault or failure has caused this behaviour but it is necessary that the system does not perform any operation once it has failed. In other words it just stops doing anything following a failure. Crash failure system: The system simply stops working at specific point in time. Based on the above-mentioned faults and failures, different fault tolerance techniques have been devised. The rest of the paper concentrates on each of them.

2.2FOUR FORMS OF FAULT TOLERANCE


To behave correctly, a distributed pro-gram A must satisfy both its safety and its liveness properties. This may no longer be the case if faults from a certain fault class are allowed to occur. So if faults occur, how are the properties of A affected? Four different possible combinations are shown in Table 1. Table 1 Live Masking Non-masking not live fail safe None

Safe not safe

If a program A still satisfies both its safety and its liveliness properties in the presence of faults from a specified fault class F, then we say that A is masking fault tolerance for fault class F. This is the strict, most costly, and most desirable form of fault tolerance because the program is able to tolerate the faults transparently. If neither safety nor liveliness is guar-anteed in the presence of faults from F, then the program does not offer any form of fault tolerance. This is the weakest, cheapest, most trivial and most undesirable form of fault tolerance.

Fault Tolerance in Grid Computing

3. FAULT TOLERANCE TECHNIQUE


Based on the above mentioned faults and failures, different fault tolerance techniques have been devised. The rest of the report concentrates on each of them. Section 2 describes how fault tolerance can be achieved by replication techniques, Section 3 describes fault tolerance by check pointing mechanisms, Section 4 describes fault tolerance by scheduling polices, Section 5 describes fault tolerance, malleability and migration support for divide and conquer applications and finally Section 6 describes failure detection by heartbeat signal mechanisms.

3.1 FAULT TOLERANCE BY REPLICATION:


Replication is an important method to achieve fault tolerance in grids. Different replication techniques are: Job replication Component replication Data replication

Fig.2: Architecture of Replication method

3.1.1 Job replication


As shown in Fig. 2 grid resources, with a fault tolerance service, register themselves with a UDDI repository. The fault tolerance service is capable of receiving jobs, executing them, performing checksum operations on them, and sending the result back. Then, a client application instantiates one or more coordination services, that are designed to function under the Distributed Recovery Block scheme. There can be any number of these services, but it is recommended that at least two be used, to guard against a single point of failure. The coordination services contact one or more UDDI registries in order to determine the location and number of compatible resources available. They may also have the capability to contact any appropriate metering services, in order to determine the relevant costs of using the
6

Fault Tolerance in Grid Computing

different processing nodes, if the information is not made available in the meta-data provided by the nodes themselves. Fig. 1 System architecture. The coordination services then wait for the client to send them a job via a multicast; once this has been received, the primary coordination service (if primary fails, the secondary will take over) determines the nodes to send replicas of the job to, and then sends replicas of the job to these nodes. The nodes then process the job until completion or until they fail and/or leave the system. Upon completion, the fault tolerance service on each node generates a checksum based upon the results it has generated, and broadcasts this checksum to the coordination services. The coordination services wait until a given number of nodes have completed, and then vote on the returned checksums. A correct checksum is assumed to be the one returned by the majority (i.e. n=2 C1) of the nodes who broadcast a checksum. If a consensus is not reached, then the coordination service can wait for more checksums to be received, or send out more replicas (preferably to nodes belonging to organizations different from the nodes whose checksums did not reach consensus) or return an error message to the client application. If a consensus is machine running a worker process is transparently bypassed by faster workers. Dr.Bayanihan uses a reached node with a correct checksum which is randomly selected and requested to send the full result back to the client application. The coordination service returns the appropriate checksum to the client, and the client then generates a checksum on the received result, compares it with the consensus checksum returned by the coordination service and accepts the results if there is a match. The drawback of this approach is, high overhead; this is both due to the voting method and also because voting cannot commence until a suitable number of results have been generated 3.1.2. Component replication When components are replicated on different machines in Grid, and if any component or machine fails, then that application can be transferred and run on another machine having the required components. 3.1.3. Data replication Replication is also commonly used by fault tolerance mechanisms to enhance availability in Grid like environments where failures are more likely to occur. When a node hosting a data copy crashes, other copies are made available by other nodes. Again, it may be of two types. a) Synchronous replication In this method, each local database (replica) needs to get acknowledgements from all other replicas or at least a majority of replicas if any modifications are made. It provides the highest degree of consistency for replicated data, but at the cost of performance. b) Asynchronous replication Based on the relative low performance of write operations in synchronously replicated environment, asynchronous replication is introduced at the cost of lower consistency.

Fault Tolerance in Grid Computing

3.2 FAULT TOLERANCE BY CHECKPOINTING:


Check pointing is the process of saving the state of a running application to stable storage. In case of any fault, this saved state can be used to resume execution of the application from the point in the computation where the check-point was last taken instead of restarting the application from its very beginning. Resuming the execution from such intermediate points does reduce the execution time to a large extent. For example, if an application is not completed due to machine crash, the applications execution state is reloaded from the physical check-point file when the machine has been fixed and rebooted. Once the loading completes, the application is recovered and resumes execution. Checkpointing could be classified based on the following attributes. 1. Level of abstraction 2. In-transit and orphan messages 3. Who instruments the application to take check points 4. Granularity of check-pointing 5. Scope of check-point 6. Storage Space requirement

3.2.1. Level of abstraction The first criterion is based on the level of abstraction at which the state of a process is saved. Accordingly, it falls under the following categories. System Level Check-points User or Application Level Check-points Mixed Level Check-points 3.2.1.1. System Level Check-point (S.L.C) System level check pointing is a technique which provides automatic, transparent check pointing of applications at the operating system or middleware level. The application is seen as a black box, and the check pointing mechanism has no knowledge about any of its characteristics. Typically, this involves capturing the complete process image of the application. In system level check pointing, check pointing the bits that constitute the state of the process such as the contents of the program counter, registers and memory, are saved on stable storage. Examples of systems that do system-level check pointing are Condor and Libckpt. Some systems like Starfish give the programmer some control on what is saved. But, complete system level check pointing of parallel machines with thousands of processors can be impractical because each system check-point can require thousands of nodes sending

Fault Tolerance in Grid Computing

terabytes of data to stable storage. For this reason, system level check pointing is not done on large machines such as the IBM BlueGene or the ASCI machines. 3.2.1.2. Application Level Check-point (A.L.C) Applications can obtain fault tolerance by providing their own check pointing code. The application is written such that it correctly restarts from various positions in the code by storing certain information to a restart file. Major Middleware tools that make use of application level check pointing are BOINC and Xtrem Web. Some key differences between system level Check-points and Application level check-points as given in are listed: Transparency: In user defined check pointing the programmers responsible for specifying the data to be included in the check-point, and where the check-points could be taken within the application code. On the other hand, system level check pointing is transparent to the user and requires little or no programmers effort.

3.2.1.3. Mixed level check pointing It is clear that neither SLC nor ALC is always the better solution. Efficiency and correctness are difficult issues for both approaches. So, Mixed Level Check pointing(MLC) combines aspects of both SLC and ALC.

3.3 FAULT TOLERANCE BY SCHEDULING:


To overcome the drawbacks present with check pointing and replication mechanisms, fault tolerance is factored into Grid Scheduling. Scheduling policies for Grid systems can be classified into space sharing and time-sharing policies. It is also possible to combine these two types of policies into a hybrid policy. In time-sharing scheme, processors are shared over time by executing different applications on the same processors during different time intervals, which is commonly known as time slice or quantum. In contrast, in the space sharing approach, processors are partitioned into disjoint sets and each application executes in isolation on one of these sets. Several studies have been made on fault tolerant scheduling such as Charlotte, Bayanihan, Javelin, GUCHA and Xtrem Web. Charlotte introduces a fault tolerance mechanism called eager scheduling for load balancing and failure masking. It reschedules a task to idle processors as long as the tasks result has not been returned. Crashes can be handled without the need of detecting them. Assigning a single task to multiple processors also guarantees that a crash failed machine or a slow credibility based fault tolerance mechanism on the basis of eager scheduling. Credibility based fault tolerance mechanism provides fault tolerance of malicious volunteers. Javelin uses an advanced eager scheduling which enhances scalability. Javelin forms a problem tree to keep track of the computation status. The selection of host to steal work uses an algorithm based on the tree structure. If a host fails, Javelin provides fault tolerance by using tree repair scheme. In Gucha a scheduling algorithm is based on the transfer, information and placement policies. Gucha implements a scheduling algorithm such that volunteers with the highest capability are selected for task execution. In Xtrem Web, tasks are scheduled according to a FIFO scheme.
9

Fault Tolerance in Grid Computing

Although these scheduling mechanisms tolerate crash and link failures by using an eager scheduling, they have some draw backs such as redundant recomputation and also they do not consider volunteer autonomy failures. When the above-mentioned scheduling mechanisms are applied to volunteer autonomy failures, they result in an independent live-lock problem. To overcome the above-mentioned problems, the following scheduling mechanisms have been proposed. Distributed Fault Tolerant Scheduling (DFTS) Volunteer Availability based Fault Tolerant Scheduling(VAFTS)

Distributed Fault Tolerant Scheduling (DFTS) System model

Fig.3: Architecture of Grid scheduling

Fig. 3 shows the basic components of a grid system of interest, which consists of N sites and each site is composed of a set of nodes, P1; : : : ; Pn, and a set of disk storage systems. The resources in a given site are sharable (i.e., community based) and the nodes may fail with probability f , 0 f 1and be repaired independently. Each site in the grid has a Single
10

Fault Tolerance in Grid Computing

Resource Manager (SRM), which provides scheduling and resource management services. SRM provides resource reservation and job staging facilities as well. Users can submit a job to SRM, which determines where and when to run the job. Each job, Ji, that arrives at SRM can be decomposed into t tasks, Ji= T1; : : : ; Ttg and each task Ti executes sequential code and is fully preemptable. By default, allocation of resources from multiple sites to a single job is allowed. This capability can be disabled system wide or on a per job basis. Users must log onto a site and submit jobs to the local SRM of that site using a command language such as matchmaker used in Condor-G. Note that access to resources is typically subject to individual access, accounting, priority, and security policies of the resource owners. Those policies are typically enforced by local management systems. Here the assumption is that every SRM in the system is reachable from any other SRM unless there is a failure in the network or the node housing the SRM.DFTS employs peer-to-peer fail-over strategy such that every SRM is a backup of another SRM in the system. The primary SRM (PSRM) and the backup SRM (BSRM) also communicate periodically such that the BSRM assumes the responsibility of the PSRM in the event that the latter fails. In case of a link failure between PSRM and remote SRMs, we let the BSRM to assume monitoring of job progress and inform the PSRM if possible. If BSRM cannot monitor the jobs, the SRM with the lowest id is allowed to monitor the execution of the job. In DFTS the following parameters are used in the pseudo codes: k: A replica threshold, n k, and is set based on the system state. R: The number of sites that have responded to the poll message of the home SRM, Monitor: The time interval between checking the health of the replicas. H: is the number of healthy replicas of a running job at any given point in time.

There are two main components of the DFTS policy Job Placement Replica Management Job placement algorithm 1. Poll all sites for availability information (a) IF (R n) THEN i. Choose the best n sites ii. Designate one of them as a backup home SM and notify the n sites the backup (b) ELSE i. Reserve n R site that are expected to finish soon. ENDIF 2. IF there are at least n sites THEN
11

Fault Tolerance in Grid Computing

(a) Send a replica of the job to each site (b) update job table and backup scheduler. ENDIF Description: When a job arrives, DFTS chooses a set of n candidate sites, n >D1, for job execution (including the home site) and orders them for an estimate of the job completion time. Note that if the home SRM cannot find n candidate sites, still the job is scheduled. S When these sites become available, they contact the SRM that has reserved them. When a PSRM recruits n SRMs on which the job executes, it also sends the identity of the BSRM to these n site managers. Once the candidate sites are selected, the home SRM designates one of the n SRMs as a backup home SRM and sends the identity of the backup SRM to the candidate sites. The home SRM and the backup SRM also communicate periodically such that the backup SRM assumes the responsibility of the home SRM in case the latter fails. The home SM (Site Manager) then sends a replica of the job to each site and updates the job table and backup scheduler with this information. In case that all replicas of the job have not been scheduled, the home SRM monitors the availability of those sites it reserved, by listening to them. As soon as one becomes available, SRM schedules a replica to the site. If a job successfully completes, the home SRM sends release message to each site it had reserved such that these sites can be used for running other jobs. In order to avoid a race problem, timer based reservation scheme has been introduced such that once a remote SM offers its resources to execute a job, it will not accept any request from another SM (Site Manager) until such time that it is released by the Home Site Manager (HSM) that has reserved it or the time limit expires or the job assigned to it completes. Replica management algorithm 1. Prompt all remote SRMs that have not reported job status 2. Determine the number of healthy replicas (i.e., H) 3. IF any replica is done THEN (a) Tell all remote SRMs to terminate their replica (b) Update job table 4. ELSEIF (H > n) THEN (a) Select last replica from set to terminate (b) Update job table 5. ELSE (a) Pick next site from set to terminate (b) Inform the remote SRM to execute the job
12

Fault Tolerance in Grid Computing

(c) Update job table ENDIF Description: Every remote SRM running a replica of a job will inform the status of the job replica to the primary SRM every monitor interval. The PSRM will walk up immediately if it receives a job completed message from one of the remote SRMs. In this case, it informs all other remote SRMs to halt the execution of the job. In all other cases, the primary SRM periodically checks the application status table to see who missed the status report. It then queries all remote SRMs to obtain machine and network status to monitor the health of any job replicas running within the site. If a failure of a replica is detected, a replacement site is sought immediately. First, the primary SRM determines the number of healthy replicas, H, and compares it with the replica threshold, k. If H > k, then the last replica to be scheduled is terminated. However, if H < k, a replacement site is sought by polling all sites on which the replica of the job are not running. If a site is found, a replica is forwarded to the site and both the backup SRM and the job table are updated. However, if no site is available, one of the sites on which the replica of the job is not running is reserved. When this site becomes available, it contacts the SRM that has reserved it. If H < k, a replica of the job is then sent to this SRM. When a replica of the job successfully completes, the algorithm informs all sites on which a replica is running to terminate. Also, any site that has been reserved is notified to cancel the reservation.

3.4 FAILURE DETECTION BY HEARTBEAT MECHANISM:


Though the above sections detailed about fault tolerant techniques, failure detectors are an integral part of any fault tolerant distributed systems. So this section presents different failure detectors. However, traditional failure detectors do not perform efficiently when applied to Grid environments Most of the earlier proposed detectors were either designed for local area networks or to handle small number of nodes and hence lack in scalability, efficiency, running times etc. In most real life distributed systems, the failure detection service is implemented via variants of the Heartbeat mechanism. Heartbeat mechanism is: Centralized 3.3.1. Centralized approach In this approach there will be only one centralized monitor to receive heart beats signals of all the nodes. If the heartbeat signal is not received from any node within the heartbeat interval then that node will be suspected. There are three models in the centralized approach. Push model Pull model All the models have three entities each: a monitorable object which is the one being monitored for failure, a monitor for monitoring one or more monitorable objects and detecting any failures, and finally a client to report the results of monitoring of one or more monitorable objects.

13

Fault Tolerance in Grid Computing

Push model In this model, monitorable object sends out a heartbeat message at regular intervals to the monitor. If the monitor detects that a message has not arrived within its expected time bounds then the monitorable object is suspected. It is shown in Fig. 4.

Pull model

Fig.4: Push Model

In this model, the monitor sends out a liveliness request to the monitorable object and then the monitorable object responds to the request. If there is a reply to the liveliness request means that the monitorable object is alive otherwise monitorable object is suspected. It is shown in Fig. 5

Fig.5: Pull Model 14

Fault Tolerance in Grid Computing

3.5 DISADVANTAGE OF FAULT TOLERANCE:


Fault tolerance technique makes grid system more reliable. Though fault tolerance is a merit for grid system but it has some disadvantages. The followings are some important demerits of grid system: Interference with fault detection in same or different component: If a component B depends upon output from A. B is fault tolerable and then fault tolerance in b can hide the problem of A and little change in B may fail suddenly, then it may appear B is in problem but the root problem is with A. Reduction in priority of fault detection: Even if the operator is fault tolerant, it is likely to reduce importance of repairing the fault. If faults are not corrected and some way fault tolerant component fails, then the system will completely fail. Cost: Fault tolerant components and redundant components tend to increase cost.

15

Fault Tolerance in Grid Computing

4. CONCLUSION
Fault tolerance is very important technique in grid computing. Grid computing is more prone to faults due to assembling of resources from different sources. Also, as applications grow to use more resources for longer periods of time, they will inevitably encounter increasing number of resources failures. When failures occur, this will affect the execution of the jobs assigned to the failed resources. So, a fault-tolerant service is important in grids. Faulttolerant is the ability to preserve the delivery of expected services despite the presence of failures within the grid itself. The categories of failures in grid computing systems include resource failure, network failure, and application failure. Grid applications must have faulttolerant services that detect faults and resolve them. These services enable applications to carry on their computations on the resources of the grid in case of failure without terminating applications. Also, these services must satisfy the minimum levels of quality of service (QoS) requirements for applications such as the deadline to complete the execution, the number of computing resources, the type of the platform, and so on. Among different fault tolerant services fault tolerant by scheduling is more beneficial. It allocates the resources, schedules them appropriately. Heartbeat mechanism is used to detect the fault. The algorithm, merits, demerits of different fault tolerant techniques has been discussed.

16

Fault Tolerance in Grid Computing

5. REFERENCES
Mohammed Amoon,A fault-tolerant scheduling system for computational grids, Computers and Electrical Engineering, Elsevier ,2012 S. Siva Sathya, K. Syam Babu ,Survey on Fault tolerant technique in grid, Computer science review, Elsever, 2010 Babar Nazir, Taimoor Khan , Fault Tolerant Job Scheduling in Computational Grid,2nd International Conference on Emerging Technologies, 2006 FELIX C. GRTNER, Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments, ACM Computing Surveys,1999 www.google.com ( The grid computing blog)

17

Potrebbero piacerti anche