Proceedings of International Conference on Computing Sciences
WILKES100 ICCS 2013
ISBN: 978-93-5107-172-3 Efficient check-pointing techniques for distributed systems Harjinder Kaur 1* and Rachit Garg 2
1 Lecturer,School of Computer Applications, Lovely Professional University, PB, India 2 Assistant Professor ,School of Computer Applications, Lovely Professional University, PB, India Abstract A checkpoint is the state of a process on stable storage and Checkpointing is a technique that is used for to recover to a fault tolerant state. A state is said to be consistent if it contains no inconsistent state. In checkpointing processes take checkpoints that result in consistent global state. During failure, the system restarts its execution from a previous consistent state which should be global and finally saves on the stable storage the last checkpointed state and only the computation done after that needs to be redone. In this paper, we present some inconsistencies in existing in checkpointing for Distributed Systems and give guidelines for reviewing such protocols [1, 4, 5-10]. 2013 Elsevier Science. All rights reserved. Keywords: Checkpointing, happened before, global state, Clock ordering 1. Introduction A Distributed System is a collection of autonomous processes which are spatially distributed and communication between these processes is implemented using communication channels through which the processes exchange information. Each process in Distributed system is having certain events [1, 2, 3, 4]. The problem here is the decision concerning which event to occur first. To overcome this problem we are introducing the partial ordering which is defined by happened before relationship. The happened before relationship for two events is defined as follows: The event e1 is decided to occur before e2 then using happened before relationship it is represented as: e1- >e2 Check-pointing in Distributed systems The checkpoint is defined as the saved state of a process. Checkpionting is difficult to implement in distributed systems because in distributed systems there are multiple streams of execution at a time and there is no global clock [5]. Due to the absence of global clock it is difficult to start checkpoint in all streams at the same instance of time. In order to permit consistent rollback recovery implementation the checkpoints from individual streams are selected in such a way that they the selected checkpoints are concurrent. The following are the various methods which help in selecting one checkpoint per process which forms global consistent checkpoint which allow global rollback recovery [4, 6]. Corresponding Author: Harjinder Kaur 415 Elsevier Publications, 2013 Harjinder Kaur and Dr Rachit Garg Ordering of events in distributed systems In distributed environment different processors exchange information which results in dependency among events of different processors making it difficult to implement Total Ordering. Lamports proposed a solution to this problem known as happens before relation which introduce partial ordering of events in distributed systems which is solution to total ordering [5-8]. The following are some definitions which articulate about various events and checkpoints in distributed environment; Lamports happen before relation (Definition 1) 1. If a and b are two events occurring in the same process and if a occurs before b then it is defined by a->b 2. If a is the event of sending a message and b is the event of receiving the same message in another process then a->b. Concurrent events (Definition 2) Two events a and b are said to be concurrent iff a does not occur before b and b does not occur before a. Local checkpoint (Definition 3) Local checkpoint is the event which records the state of process of a processor at given instance of time. Global checkpoints (Definition 4) Its a collect ion of the entire local checkpoints one from each processor. Consistent global checkpoint (Definition 5) A consistent Global checkpoint G c is a collection of all the checkpoints one from each processor in such a way that each local checkpoint is concurrent to every other local checkpoint. The Partial ordering The occurring of events is represented in terms of time. J ust take one example for that if we said some event is happened at 2:30 then that event is to be considered happened if it occurred before 2:31. If the specifications of an event are represented in terms of physical clocks the system must have real clocks in order to observe the events. A system is collect ion of different processes which in turn is a collection of different events. For example in communication process sending and receiving messages are considered as two different events. We can represent these events with the help of happened before relationship which is represented by ->. Suppose sending of messages occurred before receiving then with happened before relationship it is represented by sending->receiving [8-12]. Logical clocks The difficulties related to physical clocks are overcome by the implementation of logical clocks. It is the way of assigning the number to event which is considered as the time in which the event is going to happen. The clock is defined as Ci for process Pi which assigns a number Ci(e) where e is an event in a process. The concept of logical clock is implemented with the help of counters [5, 7]. If we have multiple events then the order for the events in which they occur is defined based on some condition known as Clock Condition which is defined as follows: Clock Condition: For any events e1 and e2 if e1->e2 the C(e1)<C(e2) which means that the event e1 must occurred before event e2. 416 Elsevier Publications, 2013 Efficient check-pointing techniques for distributed systems This condition will not hold true for concurrent events means the events that occurred at the same time. So in order to ensure the clock condition is satisfied if the following two conditions were satisfied; 1. If e1 and e2 are events in a process Pi and e1 comes before e2 then Ci(e1)<Ci(e2). 2. If e1 is the sender of a message of process Pi and e2 is the recipient of process Pj then Ci(e1)<Ci(e2). Checkpoint algorithm assumptions for message passing system There are number of Check-pointing Algorithms available for message passing but here we are discussing the one proposed by Chandy and Lamports. According to Chandy and Lamports the distributed system consists of finite set of processors and finite number of channels which allow the communication possible between the available processors. The following are the molds on which the algorithm is based: 1. The distributed system consists of finite set of processors and finite set of Channels. 2. All the communication between processors is through the available communication channels. 3. All the channels are fault free. 4. The global state of processors =Local state of all the processors +state of communication channels. 5. State of Channel refers to the set of messages sent through the channels but not yet received by the destination through the channel. 6. Infinite capacity buffers are available. 7. Termination of algorithm ensures fault-free communication. Types of Check-pointing In distributed systems different types of Checkpoints [5, 7, 8-10] are available. In this section we are describing each type of check-pointing: Centralizes vs. distributed checkpoints: In Centralizes check-pointing single node initiate the checkpoints and co- ordinated with other participating nodes. The problem with centralized approach is that all other participating nodes have to initiate the checkpoints once the centralized node decide the checkpoint whereas in distributed there is no single node that is going to initiate the checkpoint. In distributed check-pointing individual node can initiate the checkpoint independently. Complete vs. selective check-pointing/ rollback: In complete check-pointing nodes has to participate in every global check-pointing. In selective check-pointing groups of nodes that are dependent upon each other participate in the process check-pointing. In completer rollback force all the nodes in the system to rollback and restart to maintain the consistency. But in selective Rollback only the group of dependent nodes needs to be rollback and others can continue with their operations. Static vs. dynamic check-pointing: In Static Check-pointing the location of the checkpoints are identified before the program execution starts. Static check-pointing is best suitable for uniprocessor systems whereas in dynamic check-pointing the locations of checkpoints are identified during the execution of the program by initiating the checkpoint algorithm. Periodic vs. non-periodic checkpoints: Periodic checkpoint algorithm forces the nodes to initiate checkpoints at predertmined times whereas nonperiodic algorithms do not force the nodes to initiate checkpoints at predertimed times. The cost incurred in Periodic algorithms will be measured in constructing global consistent state which is not in the case of aperiodic in which there will be no concurrent checkpoints. Example of Distributed System To demonstrate the definition of distributed system consider a system consisting of two processes S1 and S2 and two channels D1 and D2 417 Elsevier Publications, 2013 Harjinder Kaur and Dr Rachit Garg Fig 1: The simple distributed system In order to illustrate the single-token conversation system consider a system contains a single token that is to be passed from one process to another. Based on token concept each process is having two sates T1 and T2 where T1 is a state that does not keep the token and T2 is the state which does. Fig 2: State transition diagram of a process Fig.3 Global states and transitions of the single token conservation system In the above example only one token that has to passed from one process to another. Each process is having two events 1) sending of token which means transition from T2 to T1 Because T2 state is holding the token 2) receiving of token which means transition from T1 to T2. The conversation is shown in Fig.3. 418 Elsevier Publications, 2013 Efficient check-pointing techniques for distributed systems Conclusion We have pointed out some inconsistencies in some checkpointing protocol techniques for distributed systems and investigate the problems [1, 8, 9, 11-13] and present some findings to avoid them. Hence, our work will enable to further design efficient checkpointing techniques which will be non inconsistent in nature [10]. References [1] Garg, R.,Checkpointing with light Checkpoints for Mobile Distributed Computing System, International Conference on Advanced Computing and Communication Technologies, 16th November, 2013, InderScience Publishers, Geneva, Switzerland and Guangdong University of Technology, China (Accepted). [2] Malhotra, N., Garg, R., Mahajan, R., Quantitative Detection of AODV against Black Hole and Worm Hole Attacks in MANET International J ournal of Computer Application Volume 68 - Number 11, J anuary 2013. (Foundation of Computer Science , New York , USA ). [3]Thind.T.,&Garg.R.,"Mobiledistributed system: concepts, issues, challenges", National Conference on Emerging Trends in Computer Science & Engineering (ETCSE-2012) , 11th-12th May,2012, Guru Kashi University, Talwandi Sabo, Punjab, India [4] Khunteta A., Sharma P., & Garg R., New and efficient Low Overheads Algorithm for Mobile Distributed Systems, ICWET11, ACM Digital Library New York USA, February 2011. [5] Garg, R., & Kumar P., A Review of Fault Tolerant Checkpointing Protocols for Mobile Computing System, International J ournal of Computer Applications, Vol 3, No2, J une 2010. (Foundation of Computer Science , New York , USA ) [6] Kumar P, & Garg R., An Efficient Synchronous Checkpointing Protocol for Mobile Distributed Systems, Global Journal of Computer Science and Technology, Vol. 10 Issue 5 J une/J uly 2010. [7] Garg, R., & Kumar P., A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems, International Journal on Computer Science and Engineering, J une/J uly 2010. [8] Garg, R., & Kumar P., A Non-blocking Coordinated Checkpointing Algorithm for Mobile Computing System, International Journal of Computer Science Issues, Vol 3, Issue 3, No3, May 2010. [9] Garg, R., & Kumar P., Low Overhead Checkpointing Protocols for Mobile Distributed Systems: A Comparative Study, International Journal on Engineering Science and Technology, J une/J uly 2010 . [10] Kumar P, & Garg R., Soft-Checkpointing Based Coordinated Checkpointing Protocol for Mobile Distributed System, International Journal of Computer Science Issues, Vol 3, Issue 3, No5, May 2010. [11] Garg, R., Sensor Networks: Opportunities and Challenges,. Computer Society of India, CSI-Communications (Monthly J ournal). Volume No. 30, Isuue No. 6, September 2006, pp-50-54. [12] Garg, R., Singh, M. and Singh, Baldev. 2006 Sensor Networks: Technology Trends, Proc. Second National Conference on Electronic Circuits and Communication Systems ECCS-2006 on February 9-10, 2006, Thapar Institute of Engineering and Technology (Deemed University) Patiala, India pp-433,437. [13] Kuljeet Kaur, Technologies to Overcome from Intimidation of Wireless Network Security, International J ournal of Applied Information Systems 2(1):25-29, May 2012. Published by Foundation of Computer Science, New York, USA 419 Elsevier Publications, 2013 Index
C Chat bot, 413 Corpora, 413 Corpora collection, 413
M Market analysis, 411412
N Natural language processing, 413
P Poem editor, 412 Poem viewer, 413 Potential impact natural language processing, 413 outreach, 414 psychology, 414 Psychology, 414