Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Computing
978-1-4799-2572-8/14/$31.00 2014
c IEEE 844
benefits offered by Cloud Computing are immense that it has 1) Data Failures: This type involves failures due to
brought a new dimension to provide services and resources to corruption of data, missing source data and other flaws in the
its users. data.
A. Key Characteristics of Cloud Computing
2) Computation Failures: It involves all types of hardware
• Improves agility as users can easily and or infrastructure failures like faulty or slow VMs, storage
inexpensively facilitate technological resources. access exception, etc.
• Reduce cost by converting capital expenditure to
operational expenditure. End users do not need to Most of the applications hosted by the Cloud are real-time
purchase and manage hardware, servers etc. High Performance Computing (HPC) systems which require
• Can be accessed from any location using variety of higher level of fault tolerance. According to studies conducted
devices like smart phones, laptops, PCs etc. having by Schroeder and Gibson [3], most of the faults in Cloud
internet connectivity. occur due to hardware failures mainly in processors, hard disk
• Multi-tenancy enables users to share the resources drive, integrated circuit sockets, and memory. There are large
and cost among large pool of users allowing number of processors provisioned in Cloud. These processors
increased peak capacity and efficient usage of under- create virtual instances, communication links, integrated
utilized resources. circuit sockets etc. One way or another, these processors are
• Provide scalability via on-demand provisioning of prone to failure. It has been predicted that a system with
resources on a real time basis. 1,00,000 processors will experience a processor failure every
few minutes [4]. Other than hardware faults, there are other
Even though, Cloud Computing is a general trend in all faults such as software faults causing application failure and
industries; there are issues in Cloud which need to be network faults due to server overload, network congestion etc.
addressed. One such issue is to ensure continuous reliability which inhibits the communication between the Cloud and the
and guaranteed availability of the services provided by Cloud end users. So there is a need for having seasoned fault
Computing. Although the services offered by Cloud are tolerance method which manages faults in diverse aspects.
beyond the traditional approach, benefits are always
accompanied with some risks & failures. For example, This paper is organized as follows. Section II describes the
Amazon’s Elastic Compute Cloud (EC2) experienced failure concepts of fault tolerance, Section III summarizes the related
in Elastic Block Storage (EBS) drives and network works & draws analytical comparison among different FT
con¿guration, bringing down thousands of hosted applications models, and finally Section IV presents the conclusion.
and websites for 24-72 hours [1].
II. FAULT TOLERANCE – AN OVERVIEW
In a conventional system, Fault Tolerance (FT) deals with
quick repairing and replacement of faulty devices to retain the
system. Whereas in Cloud Computing, fault tolerance is the
ability of the Cloud to withstand the abrupt changes which
occur due to hardware faults, software faults, network
congestions etc.
Jameela Al-Jaroodi [18] et al propose a delay-tolerant fault JiSu Park [24] et al concentrate on mobile Cloud
tolerance algorithm which adapts failures by effectively Computing and provide a monitoring technique towards fault
reducing execution time and thus minimizing the fault tolerance. In mobile Cloud Computing, mobile devices are
discovery & recovery overhead in the Cloud. The algorithm used as resource which is unstable as the state information
claims to be used efficiently in places like Cloud which changes dynamically. Based on Markov Chain model, a
handles distributed tasks. According to them, the algorithm monitoring technique is created to collect the state
ensures that data gets downloaded reliably from replicated information. This state information is necessary to calculate
servers and efficiently executing applications on independent the reliability of fault tolerance in mobile Cloud Computing.
multiple distributed servers in the Cloud. With this technique, it is possible to change monitoring time
interval dynamically.
Yilei Zhang [19] et al propose a BFTCloud, a Byzantine
Fault Tolerant framework for Cloud Computing. Replication Guisheng Fan et al [25] put forward a model based
technique is used to provide the basic fault tolerance. In Byzantine fault detection technique. In this technique, Cloud
addition to it, BFTCloud select voluntary nodes based on QoS Computing Fault Net (CFN) is created to model the different
characteristics and reliability performance. According to components of Cloud Computing such as service resources,
authors, their extensive experiments on various types of Cloud detection and failure process etc. Petri net is used to create the
environment shows that BFTCloud guarantees robustness of different components of Cloud Computing which gets
integrated dynamically into CFN model. Based on CFN model,
systems when up to ݂ of totally 3݂ + 1 resource providers are
the properties of the components are analyzed developing a
faulty, including crash faults, arbitrary behaviour faults, etc. fault detection strategy at each level which dynamically detects
But Giuliana Santos Veronese [20] et al claim that it is the faults in the execution process.
possible to reduce the number of replicas to 2f + 1 preserving
the same properties of traditional BFT algorithms. This is Thanyalak Chalermarrewong [8] et al propose a fault
achieved by using a simple trusted service which will reduce management framework which provide emphasis on hardware
the number of replicas, which in turn reduces the cost of fault tolerance. An ARMA model with a fault tree and fault
infrastructure in Cloud. Peter Garraghan [21] et al also discuss analysis technique is employed which act as proactive fault
about harnessing the potential and feasibility of Byzantine tolerance techniques to predict the system failures. Based on
fault tolerant system by developing a framework called FT-FC the prediction, the resource manager decides whether the
and apply them to federated . machine requires task migration to prevent possible fatalities.
In order to get accurate prediction results, the framework
Yuesheng Tan [22] et al suggest a better fault tolerant includes a model adequacy checking function which can be
system compared to tradition Byzantine fault tolerant used to adjust the prediction model as required.
algorithm. They developed a virtualization intrusion tolerance
system by adopting the method of hybrid fault model. This FT In another paper, Magdalena Slwainska et al [26] focus on
model undergo active and passive replicas, updating the state heterogeneity in Cloud Computing named Unibus. This paper
of the system, state transfer and proactive recovery. Their discuss on how to employ Unibus to orchestrate the resources
results show that the system allows tolerating f faulty replicas
and provide fault tolerance platform capable of executing are used to analytically evaluate these FT models. Table I
messages using Message Passing Interface (MPI). In order to illustrates the comparison among FT models based on these
support fault tolerance in Unibus, a Distributed MultiThread parameters. The different parameters are:
Checkpointing (DMTCP) is used which enables checkpointing 1. Type of FT technique - which can be proactive or reactive.
at the end user level.
2. Performance – checking the efficiency of the system.
Sheheryar Malik et al [27] propose a fault tolerance model
3. Response Time – amount of time required to respond to a
for real time Cloud Computing. In this model, the faults are
particular procedure or algorithm. The value should be
managed based on the reliability of processing nodes or virtual
minimum.
machine. According to authors, the reliability of nodes
changes in every computational cycle. The proposed fault 4. Reliability - which targets to give accurate results within a
tolerance model collects and analyses the performance or real time environment.
reliability metrics of a particular virtual machine. If a
particular VM can produce the correct results within the IV. CONCLUSION
speculated time, that node or VM is considered to be worthy
Over the past years, Cloud Computing has become a
node and its reliability increases. There is a minimum value
popular computational technology across all industries. Cloud
for reliability for which a particular node is to be considered brings forth vast advantages like providing access to large
worthy or fault tolerable VM. And if a node fails to produce amount of data & resources, on-demand service provisioning,
the minimum result within the specified time, its reliability reduced cost of managing the infrastructure etc. making it
decreases and the system undergoes backward recovery or unique from other technologies. As the famous quote says,
safety measure. with great power, comes great responsibility, Cloud Computing
with its immense benefits has to ensure continuous reliability
And finally, Pranesh Das et al [28] propose a Virtualization and guaranteed availability of the services provided. So there is
and Fault Tolerance (VFT) technique by increasing the system a need for an efficient fault tolerance method which shields the
availability and reducing the service time. This reactive fault Cloud from faults or failures. In this paper, we concentrate on
tolerant technique consists of a Cloud Manager (CM) module the standard fault tolerant concepts in Cloud Computing. Since
and a Decision Maker (DM) which are used to manage the Cloud Computing is a new field of research compared to other
virtualization, load balancing and to handle the faults. The first technologies, lot of research works are being carried out,
step involves virtualization & load balancing and in the second especially in developing a standalone fault tolerance method.
step fault tolerance is achieved by redundancy, checkpointing There are numerous FT methods proposed by the research
and fault handler. The virtualization includes a fault hander. experts in this field. Our ultimate aim is to analyze these FT
Not all the faults are recoverable. Fault handler finds these methods, understand the limitations and to develop a FT
unrecoverable faulty nodes and restricts these virtual nodes method which manages all type of faults in diverse aspects.
from future requests or usage. It also helps to remove the
temporary software faults from recoverable nodes making them V. REFERENCES
available for future requests.
[1] R. Jhawar, V. Piuri, and M. D. Santambrogio, "A comprehensive
conceptual system-level approach to fault tolerance in cloud
Based on some metrics obtained from these FT models, an computing." In Systems Conference (SysCon), 2012 IEEE International,
analytical comparison is done on some of the generally used pp. 1-5. IEEE, Mar 2012.
FT models. A certain number of parameters or FT properties
[11] R. Jhawar, V. Piuri, and M. D. Santambrogio, "Fault tolerance [25] G. Fan, H. Yu, L. Chen, and D. Liu. "Model Based Byzantine Fault
management in IaaS clouds." In Satellite Telecommunications (ESTEL), Detection Technique for Cloud Computing." In Services Computing
2012 IEEE 1st AESS European Conference on, pp. 1-6. IEEE, 2012. Conference (APSCC), 2012 IEEE Asia-Pacific, pp. 249-256. IEEE,
2012.
[12] R. Jhawar, V. Piuri, and M. D. Santambrogio, "Fault Tolerance
Management in Cloud Computing: A System-Level Perspective," [26] M. Slawinska, J. Slawinski, and V. Sunderam. "Unibus: Aspects of
Systems Journal, IEEE , vol.7, no.2, pp.288-297, June 2013 heterogeneity and fault tolerance in cloud computing." In Parallel &
Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010
[13] S. Sidiroglou, O. Laadan, C. Perez, N. Viennot, J. Nieh, and A. D. IEEE International Symposium on, pp. 1-10. IEEE, 2010.
Keromytis. "Assure: automatic software self-healing using rescue
points." In ACM Sigplan Notices vol. 44, no. 3, pp.37-48, 2009. [27] S. Malik, and F. Huet. "Adaptive Fault Tolerance in Real Time Cloud
Computing." In Services (SERVICES), 2011 IEEE World Congress on,
[14] G. Chen,, H. Jin, D. Zou, B. B. Zhou, W. Qiang, and G. Hu. "SHelp: pp. 280-287. IEEE, 2011.
Automatic Self-healing for Multiple Application Instances in a Virtual
Machine Environment." In Cluster Computing (CLUSTER), 2010 IEEE [28] P. Das, and P. M. Khilar. "VFT: A virtualization and fault tolerance
International Conference on, pp. 97-106. IEEE, 2010. approach for cloud computing." In Information & Communication
Technologies (ICT), 2013 IEEE Conference on, pp. 473-478. IEEE,
[15] L. Wu,, B. Liu, and W. Lin. "A Dynamic Data Fault-Tolerance 2013.
Mechanism for Cloud Storage." In Emerging Intelligent Data and Web