Fault Injection OpenStack Thesis

}w
!"#$%&'()+,-./012345<yA|
M ASARYKOVA UNIVERZITA
FAKULTA INFORMATIKY
Fault injection testing of

OpenStack
D IPLOMA T HESIS
Martina Kollarova
Brno, spring 2014

Declaration
Hereby I declare, that this paper is my original authorial work, which I

have worked out by my own. All sources, references and literature used or
excerpted during elaboration of this work are properly cited and listed in
complete reference to the due source.
Advisor: Mgr. Marek Grac, Ph.D.
ii
Acknowledgement
I hereby give my thanks to my advisor, Marek Grac, for help with the orga-
nization of this work, my colleagues Attila Darazs, Peter Belanyi, Tal Kam-
mer and Fabio Di Nitto for the technical advice, and Red Hat for the re-
sources that allowed me to create this.
iii
Abstract
The created framework, named DestroyStack, provides tools for software-

based fault injection and high-availability testing of OpenStack. It is able to
save and restore the state of the servers using virtualization. It injects faults
using remote Shell commands to the servers and is flexible enough to test
multiple topologies of the distributed system. The framework is resource-
efficient due to state restoration, as it does not require re-installation when
an injected fault has damaged the system. It contains the design of several
tests and implements the object service disk failure tests as a proof of con-
cept.
iv
Keywords
fault injection testing, high availability, fault tolerance, OpenStack, server

state restoration, distributed system, virtualization
v
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 OpenStack and Fault Tolerance . . . . . . . . . . . . . . . . . . . . 5
2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Overview of the OpenStack Ecosystem . . . . . . . . . . . . . 7
2.3 About Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Sources of Failures . . . . . . . . . . . . . . . . . . . . 9
2.4 Highly-available OpenStack . . . . . . . . . . . . . . . . . . . 10
2.4.1 OpenStack Swift . . . . . . . . . . . . . . . . . . . . . 12
3 Problem Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 About Fault-injection Testing . . . . . . . . . . . . . . . . . . 15
3.1.1 OpenStack and Fault-injection Methods . . . . . . . . 16
3.2 Related Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Tempest . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Chaos Monkey . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Gigan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.4 ORCHESTRA . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.5 ComFIRM . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.6 Tools for Simulating Disk Failures . . . . . . . . . . . 18
4 Test Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 OpenStack Swift Tests . . . . . . . . . . . . . . . . . . . . . . 20
4.2 High-availability Tests . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 VM Creation and Scheduling . . . . . . . . . . . . . . 26
4.3 Other Tests Ideas . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Framework Design and Implementation . . . . . . . . . . . . . . 29
5.1 Support of Multiple Topologies . . . . . . . . . . . . . . . . . 31
5.2 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3 State Restoration . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3.1 Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3.2 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 System Deployment Script . . . . . . . . . . . . . . . . . . . . 35
5.5 The Frameworks Capabilities and Drawbacks . . . . . . . . 36
5.5.1 Unimplemented features . . . . . . . . . . . . . . . . 36
1
5.6 Implementation of Fault-injection Tests . . . . . . . . . . . . 38
5.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Appendix A Attachments . . . . . . . . . . . . . . . . . . . . . . . . . 42
Appendix B User tutorial . . . . . . . . . . . . . . . . . . . . . . . . . 43
B.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
B.2 Running the tested system in VirtualBox . . . . . . . . . . . . 44
B.3 Running the tested system inside OpenStack VMs . . . . . . 45
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2
Chapter 1
Introduction
As Leslie Lamport once defined it,

A distributed system is one in which the failure of a computer
you didnt even know existed can render your own computer
unusable.
OpenStack is a system like thatit gives you access to resources beyond
the capabilities of your own computer, e.g. virtual servers and storage, but
a single failure on a far away computer can make it all go away. In our
imperfect world, there is no way to prevent faults completely, thus we at-
tempt to make the system fault tolerant, meaning that a problem on one
computer will be either not noticed by the user at all, or at least recovered
from quickly (see Chapter 2).
However, it is not enough to set up the system in a way we believe it is
fault tolerant, we need to test it. The best way to do that is to simulate the
faults and see what happens, which we call fault injection. Software testing
is an expensive and time consuming process and it becomes much harder
when the system is distributed. There are hundreds of different ways Open-
Stack can be installed, another hundred things that could go wrong. Testing
this manually would take massive amounts of time, especially since the per-
son testing it is likely have issues figuring out which of the faults he injected
caused the bug, or repeating it when he tries to demonstrate it to another
person. A longer analysis of the problem, information about fault-injection
testing and existing tools are in Chapter 3.
The aim of this work is to create a tool that can automatically test what
happens when a fault occurs, verify that the system survived it and repeat a
problem if it discovers one. Since the number of things that could possibly
fail is high, it is important to design tests that try to break the things that
are the most likely to cause outages or data loss, instead of just randomly
pulling the cables in the server room. The project contains a basic set of test
designs in Chapter 4 and implements some of them to demonstrate how
the tools should be used.
3
1. I NTRODUCTION
Since these kinds of tests are destructive by nature, it is necessary to re-

store the system back to the state before the injected fault, otherwise a com-
bination of faults could cause the whole system to break down and it would
need to be re-installed, which is an expensive operation. The created frame-
work, called DestroyStack, provides multiple options on how to restore the
system. The design and implementation of the framework, together with
information on how some of the tests were implemented, is described in
Chapter 5.
4
Chapter 2
OpenStack and Fault Tolerance
Imagine a company that needs to host some larger service. It considers

buying hardware, building space for the servers and hiring somebody to
administer and maintain them. Alternatively, it could rent server hosting,
which would reduce the costs of server space and maintenance. However,
this would be still inflexible, as adding a new server can take a long time;
and if the company providing the hosting raises the prices or goes bankrupt,
switching to another would be problematic.
Hosting the service on the infrastructure provided by OpenStack is the
third option. Adding a new server is something that can be done within a
minute and you only pay for what is actually being usedit can autoscale,
removing or adding servers depending on the utilization. It allows the user
to create virtual networks between the servers or add large quantities of
reliable storage. Since it is open-source and has multiple providers, you can
always switch to somewhere else using the same API [17].
OpenStack1 is an open-source project supported by over 150 companies
and a large number of developers. It lets you build an Infrastructure as a
Service (IaaS) cloud that runs on commodity hardware [5]. The term IaaS
means that the cloud provides virtual machines or storage, instead of some
specific software, and the user is responsible for deploying his application
on them.
1. https://openstack.org/
5
2. O PEN S TACK AND FAULT T OLERANCE
2.1 Terminology
These are the most common terms used throughout this work. Many of
them were defined in the paper by Laprie [10], which sets the terminology
for fault-tolerant computing.
cloud a vague term usually referring to a distributed system

that is accessed trough the network; often a marketing
term referring to some application or service provided
trough the Internet
topology arrangement of the OpenStack services on servers, might

also be referred to as system setup
VM virtual machine, also called an instance, a software-based

emulation of a computer
server a computer that provides some service, it can also be called

a node and it can be run as a virtual machine
fault is a physical defect or flaw that occurs within some hard-

ware or software component, for example a short-circuit
or a programmers mistake, that may cause an error
error the consequence of a fault, a defect in the system state that

may lead to a failure
failure when the actual behaviour of the system deviates from

the specified behaviour, because there was an error in the
system
fault tolerance is the property that enables a system to continue operat-

ing properly in the event of a fault
availability fraction of time for which the system is available (oppo-

site of downtime), but can also refer to the availability of
a single node
downtime when a user cannot get his work done, the system is con-
sidered down [11] (also called outage)
HA high availability; requires that the system is fault tolerant

and doesnt have single points of failure; see Section 2.4
6
2.2 Overview of the OpenStack Ecosystem
OpenStack consists of a series of interrelated projects, each of them provid-

ing some service that can be accessed trough the REST API. 2 The system is
modularmost components are optional, or at least replaceable by a com-
patible system. The following list gives a simplified description of the basic
components and the diagram in Figure 2.1 shows how they interact with
each other.
Nova schedules, creates and manages virtual machines (also re-

ferred to as Compute)
Neutron enables networking between VMs, allows defining vir-

tual networks
Keystone authentication and authorization service, provides a cat-

alogue of endpoints (API access points) of other services
Glance registry of disk images and snapshots
Cinder block storage, provides disks for VMs
Swift object storage, a distributed storage system putting em-

phasis on reliability, see Section 2.4.1
Horizon web-based user interface to the other services
Ceilometer monitors and meters the system for billing, benchmark-

ing and statistics
Heat allows a user to create a template of resources (VMs, stor-

age, network) and create/manage them with a single com-
mand
When a user wants to create a new virtual machine trough the Nova
API, the request is first sent to Keystone to authenticate him. Afterwards,
the Nova scheduler finds a server with a hypervisor3 with free resources
where it could run the VM. A request is sent to Neutron, which connects it
to the networks the user selected. Cinder creates block storage (disk) for it
2. Representational state transfer (REST), is a software architectural style that is stateless

and typically uses the HTTP methods GET, POST, PUT and DELETE to execute different
operations, but is not limited to the HTTP protocol
3. A hypervisor is a piece of hardware of software that runs virtual machines, for example
Xen, Hyper-V, KVM
7
Figure 2.1: Conceptual schema of the core OpenStack components [13].
and Glance finds the image that was chosen by the user, lets say a Fedora
Linux installation image (which could be stored directly on the file system,
or in Swift, or some other storage system) and boots the VM from it. Most
of the communication between servers is done trough a messaging service
that implements the Advanced Message Queuing Protocol (AMQP) and the
state of the virtual images is kept in an SQL database.
2.3 About Fault Tolerance
When a failure happens on a non-distributed system it is often total in

the sense that it affects all components and it usually brings down the
8
whole application. A characteristic feature of distributed systems that dis-

tinguishes them from single-machine systems is the concept of a partial
failure. A partial failure happens when only a subset of components fails,
which can affect the operation of the other components, while at the same
time leaving other components completely unaffected [19].
Distributed systems consist of a multitude of hardware and software
components that are bound to fail eventually, especially since the probabil-
ity of a node failing raises exponentially with the number of nodes. To be
fault-tolerant, systems either exhibit a well-defined failure behavior when com-
ponents fail or mask failures to users [2]. OpenStack uses the second type of
fault-tolerance, masking failures to user by using redundancy of hardware
and system components to create the illusion that the system is operating
correctly.
2.3.1 Sources of Failures

Server failures are commonly classified [19] as follows:
crash a server halts, but works correctly until then
ommision a server fails to send, receive or respond to a failure
timing timeout of the response
response the response is incorrect
arbitrary a server may produce arbitrary responses at any time,

also known as Byzantine failures
Other types of failures that can be encountered are disk failures or net-
work connectivity issues (e.g., overloaded networking switch), and failures
caused by the environment (flood, fire). A network partition happens when
a failure in the network causes the system to be split and unable to commu-
nicate with with the other part.
Mean time to failure, or MTTF, is a commonly used measurement [6] of
availability, defined as
uptime
MTTF = .
number of failures
A study by Google [6] characterized the availability properties of a cloud
storage system and found that the MTTF of a node is 4.3 months and less
than 10% of events (failures) last longer than 15 minutes. They also noted
9
that a large number of failures are correlated, for example because of power
outages or rolling upgrades of the system (scheduled gradual upgrade of
the system). Another study[9] found that the MTTF of a disk was 10-50
years, but the annual failure rate (probability that a component fails during
one year) of storage systems was between 2% and 4%, or even more with
certain kinds of disks. This is because disk failures are not always the dom-
inant factor that causes issuesdisks contribute to 20-55% of a storage sub-
system failures, physical interconnects (cables, networks, power outages)
contribute 27-68%.
In the book by Marcus and Stern [11], statistics show that unplanned
downtime is caused by system software bugs in 27% of the cases, by hard-
ware in 23%, human error 18%, network failures 17%, and by natural disas-
ters in 8% of the collected cases.
2.4 Highly-available OpenStack
There is no generally agreed-upon definition of high availability. The SNIA

dictionary [18] defines it as: The ability of a system to perform its function
continuously (without interruption) for a significantly longer period of time
than the reliabilities of its individual components would suggest. In the
analysis by Gray and Siewiorek [7], a system is called highly available if
it has availability of 99.999%, which means it will be unavailable only 5
minutes per year, but it is also commonly used to refer to the method of
providing fault tolerance by masking failures to the users.
The main goal of high availability in OpenStack is to minimize system
downtime and data loss by removing single points of failure, which de-
pends on whether the service is stateless or stateful. A stateless service only
gives a response to a request. To make it highly available, it is enough to
provide redundant instances and load balance them. The API service of
Nova, Glance, Keystone and Neutron, together with the Nova scheduler
and conductor (an abstraction level over the database), are stateless. A state-
ful service is one where subsequent requests to the service depend on the
results of the first request. An example of a stateful service in OpenStack is
the database and message queue [15].
The main approaches to making a stateful service highly available are
active/active or active/passive configuration. With the active/passive approach,
a backup server is set up for some specific service and is brought up to re-
place a failed server. For example, a backup of the database which would
replace the main database in case of a failure. The disadvantage of active/-
10
passive is that it takes time to detect the failure and replace the service
with the backup. To make the replacement faster, the service can be already
kept active on the backup node, which is called hot standby. In an active/ac-
tive setup, there is also a backup server for a service, but both of them are
used concurrently. The difficulty with active/active is that the state of all
the redundant services has to be kept in sync [11, 15]. A load balancer
(in OpenStack, HAproxy is commonly used) manages the traffic to these
systems, ensuring that the operational systems handle the request. The re-
placement of a failed node (also called failover) is provided in OpenStack by
Pacemaker [16].
Figure 2.2 shows how Red Hat plans to make OpenStack highly avail-
able. Some of the services are shown as single nodes, either because they
dont need to be highly available (e.g. Foreman, which is a deployment and
administering tool), or they are managed by the components itselfSwift
has HA capabilities of its own (see Section 2.4.1), Compute (Nova) nodes
are taken care of by the Nova scheduler, and the availability of tenant in-
stances is the responsibility of the end-users. The MongoDB and MariaDB
(previously named MySQL) represent the databases.
Mirantis describes [12] an alternative HA topology using similar con-
cepts and tools.
Figure 2.2: Schema of how future HA deployments should look like.4
11
2.4.1 OpenStack Swift

Swift, the distributed object storage system similar to Amazon S3,5 has HA
capabilities of its own. It mainly focuses on being reliable, not losing any of
the data and horizontal scalability.6 It expects some of the disks and com-
puters to fail occasionally.
The basic idea is to keep a number of copies of each file called replicas
(three by default), as far away from each other as possible, so that a flood
in a single server room does not lose all of the copies. This means it will try
to put the replicas into different regions (i.e. distant locations, when data
centers are distributed along multiple countries, cities or at least buildings),
or zones (groups of servers, usually with separate power supply), servers,
or at least different disks if the system consists of only one server.
If a disk fails, Swift will create more replicas on some other available
disk, so that there will be a set number of copies available againthese are
referred to as handoff nodes [13]. This process of re-creating replicas onto
handoff nodes is usually called replication, but we will refer to the finished
replication as replica regeneration for clarity in the Swift tests, where the dis-
tinction is important.
There are two main types of Swift nodes. The first one is the proxy server,
which in this case means the server that knows where my data should
be. The other type is the data server, which contains the disks, watches
the data for corruption and copies them if a disk dies. You can combine
them together or even do a single node installation, but in practice these
two tend to be separated. The Swift data services are further fragmented
into the account (provides a list of containers for the account), container
(keeps track of objects in the container) and object service. A container is
something like a file system directory, but without any hierarchy. These can
also be separated into different servers, but it is not necessary [5]. The data
location is determined using a hash ring, which is distributed among all the
nodes [13].
4. Source: https://github.com/fabbione/rhos-ha-deploy
5. https://aws.amazon.com/s3/
6. Horizontal scaling means that more nodes are added to the system, in contrast to verti-
cal scaling which adds more power to the existing nodes
12
Chapter 3
Problem Analysis
The goal is to make sure that the OpenStack system can handle failures.
Even though it seems to be highly available, the first fault can prove us
wrong in the belief. Since the failures that could trigger fault tolerance mech-
anisms are infrequent, we need to simulate them to verify the system.
Imagine a person, let us call him Mallet, 1 testing the systems fault toler-
ance. Mallet installs a part of OpenStack components to be highly available,
since installing all of them in an HA arrangement is a time and resources
expensive endeavour. He starts randomly restarting some services and do-
ing other damage to the system, always checking if it still works from the
normal users point of view. He could also try running part of the test suite
(see Section 3.2.1) and checking the system messages for errors.
Doing so would take him anywhere between hours to days, and once he
finds an unexpected error, he might not be sure what caused itwas it just
because of the last injected fault, or by the previous one (when he didnt
notice it), or an unlucky combination of them? Unsure of what exactly had
happened, he installs another system with the same topology and tries to
repeat the latest actions.
If Mallet is lucky, he will reproduce the error and report a bug on it. He
will provide the developers a reproducera list of commands or a script that
re-creates the problem, and information on how to set it up the way he did.
But even in this optimistic case, it could take the developer hours of her
time to reproduce the problem, and then even more when the bug fix has
to be verified.
However, if he is unlucky, he wont be able to easily re-create the prob-
lem. He will try repeating some of the other failures he did before, trying
them out in different order, looking into system messages for hints as to
what happened. Yet if he encountered a race condition that only appears
infrequently, he is out of luckthe error could prove forever elusive or will
1. Mallet is a name used in cryptography for the person who is a malicious attacker, in
contrast to Eve, which is usually just a passive eavesdropper
13
3. P ROBLEM A NALYSIS
only occur once a year on some customers setup, causing a big outage.
To make testing easier, we need some tool that can do all of Mallets
actions automatically. This tool has to provide:
repeatability, it needs to be possible to automatically repeat the con-
ditions under which the error occurred
isolation, a problem in one test should not propagate into another

test, to make debugging easier
verification of the systems function after the failure, to make sure it

recovered from the fault
resource efficiency, a full system re-installation should not be nec-

essary after the tests, since it is a time expensive action in a big dis-
tributed system like OpenStack
state restoration is closely related to resource efficiency and isola-

tion; it means that we can go back to the state before the failure was
injected and either repeat it or try a different test scenario
flexibility, since it takes a large amount of resources to install Open-

Stack completely highly available and not everybody has access to
resources like this, it should be possible to install only a subset of
the components in HA and run only tests that should be able to be
successful on the setup
Thanks to this tool, it should be feasible to reproduce a problem that
happens only occasionally, by simply selecting one test and running it a
few hundred times, since this wouldnt require a re-installation after each
time. Ideally, there should be an automatic installation tool provided, so a
developer can reproduce a bug by just executing a few commands, all the
way from deploying the system to getting to the problem. The system could
be tested periodically with all the tests and thus it could discover occasional
race conditions.
The requirement of flexibility implies that multiple system topologies
can be tested. However, in some cases the topology needed for one test is
in conflict with the requirements of another test, for example when the first
requires a maximum of two servers with some service, but the other expects
a minimum of three. It should be able to work around these conflicts once
they arise (in this case, disable some servers for the first test), but it might
not always be possible. Thus, it could happen that there would be no single
setup where all of the tests could run.
14
3.1 About Fault-injection Testing
The term fault-injection covers a range of testing techniques, all the way
from white-box testing (on the level of source code) to black-box testing
(without peering into the internal workings); from damaging the pins on a
chip to giving the program random input. Most of the existing techniques
fall into five main categories[20], depending on what they are based upon:
hardware accomplished at physical level, for example by heavy ion
radiation or modifying the value of the pins on the circuit,
also called HWIFI (Hardware Implemented Fault Injec-
tion)
software tries to reproduce at software level the errors that would

have been produced upon faults in the hardware, also
called SWIFI (Software Implemented Fault Injection)
simulation injects faults into high-level models, e.g. VHDL (VHSIC

Hardware Description Language) models
emulation alternative solution for reducing the time spent during

simulation-based fault injection and uses the Field Pro-
grammable Gate Arrays (FPGAs) to emulate circuits
hybrid mixes software-implemented fault injection and hardware

monitoring
Software injection methods can be further classified into compile-time
and run-time. To inject a fault at compile-time, some program instruction
must be modified before the program is loaded and executed. The source
code gets edited in a way that should trigger some problem and the pro-
gram can be run trough a test suit to see if it catches it. It can be used to
improve the test coverage by executing code paths that are rarely followed,
in particular error handling code paths [1].
Run-time injection techniques commonly require a trigger to inject a
fault at a specific time, for which timers, interrupts or code-insertion trigger
can be used. Code insertion is similar to the compile-time modification of
instructions, but in this case it rather adds instructions than modify them, to
perform the fault injection at runtime. It usually corrupts RAM, processor
registers or other storage. Protocol fault injection can re-order, damage or
lose network packets. A subset of this kind of testing is called fuzz testing or
robustness testing, which gives the program invalid, unexpected or random
inputs.
15
3.1.1 OpenStack and Fault-injection Methods
Software-based fault injection testing is the best suited method for Open-
Stack. Hardware-based testing is meant for circuits, there is no formal model
of OpenStack to do simulation-based testing and the system is probably too
complex to efficiently create even an abstract formal model of it. Emulation-
based testing is meant for VHDL model testing, which is inapplicable.
The high availability of OpenStack has to be tested with higher-level
faults than normally. An example of a fault would be a server shutdown,
network partition or disk error. The tests could be named high-availability
tests, but this is not a commonly used term. Protocol fault injection could
be used to test the communication, but would be focused on the messag-
ing service. This combines features of black-box testing (it doesnt know
how things are implemented) and white-box testingit partially looks into
internal structures (e.g. checks the replica count by directly accessing the
Swift objects in tests described in Section 4.1). This type of testing is called
gray-box testing.
Compile-time fault injection of OpenStack would be possible, but it is
not the focus of this workwe are rather interested in the fault tolerance
and high availability of the whole system, not the small parts and com-
ponents of it, on which this testing method is focused on. This is because
it shouldnt matter to the system that a single service crashed (see Sec-
tion 2.3). Nonetheless, it can become a problem if a service doesnt crash,
but starts sending incorrect information to neighboring components. This
kind of fault would best be simulated by protocol fault injection, because
the services all communicate trough the REST API or AMQP.
3.2 Related Tools
There are existing tools that test OpenStack and there are frameworks for
fault-injection testing, but until now they didnt intersect. This section pro-
vides an overview of SWIFI tools that somehow relate to this work, and
existing OpenStack tests. We combine a few concepts from these and could
potentially use some of them as external libraries.
Most of the existing fault-injection tools are specific to some proprietary
technology. For example, ORCHESTRA (see Section 3.2.4) was originally
developed for the Mach operating system and later ported to Solaris. Nev-
ertheless, we can use the concepts from these tools and this section com-
pares them to the design of this work.
16
3.2.1 Tempest
The main testing framework of OpenStack is called Tempest2 and is an open-
source project with more than 2000 tests. Its design principles3 require that
the tests only access the public interfacesno direct queries to the database
or remote commands to the servers are allowed, thus it is only meant for
black-box testing. It also strives to be topology independent, therefore the
tests dont know if OpenStack is installed on a single node or on hundreds
of nodes; neither is it possible to find out how many servers with a specific
service there are, or gain access to them.
These design principles make it impossible to create fault-injection tests
with Tempest, because to inject failures you need to have access to the ma-
chines (e.g., to simulate a disk failure, restart a service, etc.) and to know
where certain services are installed. Without control of the servers, it isnt
possible to restore the state of the system after an injected fault. Thus, this
framework is unsuitable for the type of testing we need.
However, Tempest could be used as an external tool to verify that a test
was successful and the system is still in working order. After each fault-
injection, a relevant subset of Tempest could be run to see if the tested com-
ponents respond correctly to API calls. Only a relevant part should be run,
because the whole test suite is big and therefore slow (it takes more than 30
minutes to run all the tests).
3.2.2 Chaos Monkey

The Chaos Monkey4 is a service running in AWS (Amazon Web Services), a
project similar to OpenStack, that randomly terminates virtual machines,
and thereby tests whether the application running on those instances is
reliable and fault tolerant. Although it doesnt test AWS itself, a similar
approach could be used in OpenStacknot for terminating VMs, but ran-
domly aborting OpenStack servers and services. While this might be an
useful concept to implement once the OpenStack project is very stable, it
would not accomplish our current goals, since it wouldnt make the fail-
ures repeatable and isolated, thus it would not be possible to use it as a
reproducer for bugs. Even though such a service could be used by Tem-
pest (see Section 3.2.1) trough an API and therefore the tests would not
need direct access to the servers, it would still not make it viable to provide
2. https://github.com/openstack/tempest
3. http://docs.openstack.org/developer/tempest/overview.html#
design-principles
4. https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey
17
state restoration or complete verification of success (e.g. there is no API that

shows how many copies are there of each object in Swift, thus the tests in
Section 4.1 could not be implemented).
3.2.3 Gigan
In the Gigan [8] tool, virtualization is used to test a distributed system by

fault injection. It creates lower-level failures than we currently require for
OpenStack, for example bit flips in the CPU registers, thanks to the vir-
tualization manager that grants them access to the VMs hardware. How-
ever, DestroyStack uses a few of the concepts from it, for example the state
restoration by creating snapshots, as described in Section 5.2 and 5.3.
3.2.4 ORCHESTRA
ORCHESTRA [3] is a tool that uses a technique called script-driven prob-

ing and fault injection, which it is similar to the approach this work usesit
tries to deterministically get the system into states that rarely happen us-
ing scripts that execute a certain scenario. The created tool, ORCHESTRA,
focuses on injecting failures into the communication protocol of a real-time
distributed systemit intercepts messages and is able to manipulate them,
drop, delay or retransmit them, drop messages of a certain type or even
insert new ones. The concepts from it could be used in the future to inject
failures into the messaging queue in OpenStack to simulate internal net-
work problems, but is out of the scope of this work (see Section 4.3).
3.2.5 ComFIRM
The ComFIRM tool [4] is similar to ORCHESTRA, but inserts code directly
inside the Linux kernel, into the message exchange subsystem. It also sup-
ports message omission and timing faults, but is architecture independent
and is able to run on the versions 2.4 and 2.6 of Linux, therefore it would be
a good choice as an external tool for our framework in the future, if it will
be desirable to create protocol fault-injection tests of the messaging service
or REST API.
3.2.6 Tools for Simulating Disk Failures
These tools will be necessary to simulate disk errors for the Swift tests (see
Section 4.1).
18
The simplest way to simulate a failure of a disk is to overwrite it with

random data, for example:
# mount | grep /dev/vdb
/dev/vdb on / s r v /node/ d e v i c e 5 type e x t 4 . . .
# dd i f =/dev/urandom o f =/dev/vdb bs =4K count =10
# l s / s r v /node/ d e v i c e 5
l s : reading d i r e c t o r y / s r v /node/ d e v i c e 5 : Input/output e r r o r
This will overwrite the beggining of the disk with random bytes, thus
damaging the partition table and file system. A similar approach can also
simulate a full disk, but we could also use /dev/full, which is a special
device that always returns the error No space left on device.
Kernel Fault Injection Modules

The Linux Kernel contains tools for fault injection5 which seem easy to use6
and provides tools to inject failures into memory allocation functions, page
allocations and I/O operations, and the probability of the failure is con-
figurable. However, these modules are normally not included in the Linux
kernel and would have to be compiled. At this moment there is no need
to provide this functionality, because the tests we need just require higher
level failuresthese tools are useful to test file systems and OpenStack
Swift works on top of a chosen file system.
5. https://www.kernel.org/doc/Documentation/fault-injection/
fault-injection.txt
6. http://blog.wpkg.org/2007/11/08/using-fault-injection/
19
Chapter 4
Test Design
The design of the framework was preceded by the design of the test, to cre-
ate an estimate of what tools and functions it should provide. Not all of the
tests were implemented, mostly because of a lack of a good OpenStack de-
ployment tool that would be able to install the required system topologies.
You can find more information on the implementation and results of the
test in Section 5.6.
4.1 OpenStack Swift Tests
The test design process has started with OpenStack Swift, the object stor-
age service, because reliability is its main focus and it has high-availability
capabilities of its own (see Section 2.4.1). Therefore, parts of it can be tested
without a complicated HA setup, which is useful especially because the
deployment tools are in development and often have problems deploying
highly-available setups.
As described in Section 2.4.1, replicas are copies of objects that are kept
for data redundancy. Usually there are three replicas of each object and this
is what the following tests assume. When a disk fails, the system recognizes
it and creates another replica in a different location so that there are three
copies of each object againthis process is called replica regeneration here.
In all the tests, there is a set time limit for the replica regeneration and
the test should fail if it takes any longer than that. In the beginning of each
test, Swift should already be populated with some random objects and have
all the replicas correctly distributed and consistent. Since it would be diffi-
cult to reverse-engineer a file that would get saved by Swift into the location
we are testing and observing, there should be enough files uploaded so that
each disk would have at least one. The same is true when a test is adding
additional data to Swift. Ideally, a tool should be created to provide statis-
tics on how the files are distributed, and if some disk wouldnt have any
data, more would be uploaded until this becomes the case.
20
4. T EST D ESIGN
The tests are organized based on their system topology requirements

some require a specific setup and number of resources. If they dont need
some elements, for example a load balancer, they are left out of the descrip-
tion.
Swift proxy
Keystone
Swift data Swift data
Figure 4.1: Minimum OpenStack topology required for the replication tests.
The basic OpenStack topology needed for replication tests is depicted

in Figure 4.1. It is assumed there are at least six disks in total, three on one
data server and three on the other. Both data servers should have all of
the data servicesthe account, container and object service. It is not highly
available, since the termination of the proxy server would cause an outage,
but thanks to Swifts design, the data servers should be HA.
1 upload random f i l e s t o S w i f t
2 damage d i s k on d a t a s e r v e r [ 0 ]
3 start time = get time ( )
4 while ( g e t s w i f t r e p l i c a c o u n t ( ) < 3 ) :
5 current time = get time ( )
6 i f c u r r e n t t i m e s t a r t t i m e > timeout :
7 fail test
Listing 4.1: Outline of the simplest replication test.
Listing 4.1 demonstrates a basic replication test. The term damage disk
on line 2 can be implemented by force unmounting some disk or by using
the dd command as described in Section 3.2.6. The test could be repeated in
the same form, but with some tool that creates the disk errors at random. In
that case, the timeout for the replication regeneration should be increased,
21
4. T EST D ESIGN
since it could take Swift a while to recognize those errors. The same could
be done for the rest of the tests in this section.
Whether the test succeeded has to be verified by checking that each ob-
ject has the appropriate number of replicas and the contents of the object is
correct. Since Swift doesnt provide any API to check this (and it should not
be trusted even if it existed), it is necessary to look into the Swift object ring
as to where the objects are, and download each of them directly from the
data server. The administration tool swift-get-nodes1 should be used
for this, since it takes the hash of the object and the ring file as input and
produces the direct links to the files as output. In the example test in List-
ing 4.1, this is done by the get swift replica count function on line 4,
and the test fails if it doesnt recover into three replicas for each object.
1 upload random f i l e s t o S w i f t
4 start time = get time ( )
5 # wait u n t i l a t l e a s t a p a r t o f r e p l i c a s i s r ec o ve r e d
9 fail test
10
15 fail test
Listing 4.2: Outline of an advanced replication test.
The test in Listing 4.2 uses the same concepts as the previous one. First
it injects failures into two disks on different data servers, since Swift places
replicas as far away from each other as it can and we want to be sure we
damage at least two replicas of an object. After this, the replicas should
recover from their third copy onto handoff nodes, but we only have to wait
until there are two of each file. This should make it feasible to recover even
if a third disk gets damaged, which is done on line 11. In this case, it can
be done on any of the two data servers, but if there are more of them, the
damage should be made on another node. This is again because Swift places
copies of the data as far away from each other as possibleif different zones
are not an option, it will try using another server. If that is not feasible, it
1. documentation can be found in the manual page, man swift-get-nodes
22
4. T EST D ESIGN
will at least place them on different disks. After three disks are gone, there
should exist an object that is only on the handoff nodes, which is the special
case we wanted to test here.
The test could continue damaging the disks like described in Listing 4.2
until there is only a single disk left in the system, which would hold all
the data, assuming it has capacity for it. The user request for a file might
not work anymore, because the proxy server tries only a set number of
nodes before it gives up and declares the object missing; and writing new
data would stop working, because that requires the majority of writes to be
successful (in this case with replica count set to three, at least two replicas
would have to be created, which is no longer possible). However, the data
would still be there and could be recovered [13].
Most of the other tests are similar to Listing 4.1 and 4.2 and all of them
use a timeout for the replica regeneration, for example:
disk tempararily 1. force-unmount a disk

unavailable 2. wait until the replicas regenerate
3. mount the disk back with the data intact
4. wait until the replicas regenerate - the extra
replicas from handoff nodes should get removed
disk replacement 1. unmount a disk

2. wait until the replicas regenerate
3. wipe the disk and mount it back
expected failure 1. damage three disks at the same time (or more,
if the replica count is higher)
2. check that the replicas didnt regenerate even
after some time period
3. fail if the replicas regenerated (this tests whether
the tests themselves are correct)
Swift will regenerate the replicas only if a disk failed, not if the entire
node is down [13, p. 133], therefore we dont need to test for replica re-
generation if a whole server fails. However, new objects should be written
onto the handoff nodes, so we should check if there is the correct number
of replicas of the new objects.
23
4. T EST D ESIGN
simple data server 1. shut down one data server

shutdown 2. upload new objects
3. wait
4. check if the new objects have the correct num-
ber of replicas (handoff nodes will be used for
new objects)
5. boot data server
Another kind of damage to the system happens when a disk fills up.
This should be simulated manually and not trough Swift, because it tries
to distribute the files evenly. If a disk is full, it is handled similarly to a
damaged diskhandoff nodes get used to store the data instead.
full disk 1. fill up a disk with data

2. upload new objects trough Swift API
3. immediately check whether there is the correct
number of replicas
A similar test should be done when the node isnt completely full, but
has some small amount of space left, and upload a file that is larger than
that. Another edge case would be to have three or more disks already filled
up and try to write new data.
Swift behaves similarly with zones as with servers, i.e. if the system
has only one zone, it behaves as if each server is a separate zone. When we
define zones (groups of servers, usually with independent power supplies),
Swift will try to put a replica into different zones, so that the loss of one
group of servers doesnt cause loss of data availability. Therefore, all the
tests should be repeated with actions like select disk in first data server
replaced by select disk in any server in first zone.
The tests could also be repeated while Swift rebalances are in progress
when a new disk or group of disk is added and the data are being evenly
redistributed in the data center. However, the expected behaviour of this
doesnt seem to be specified and it would first have to be studied, but as a
minimum it should not affect the users file uploads and downloads.
To test network partitions, two Swift regions could be created and the con-
nection between them damaged (imagine cutting the cable between a data
center in Prague and another one in Brno). Swift uses eventual consistency
24
4. T EST D ESIGN
and the latest change should be the one with the priority. This should be
tested by uploading new files into both data centers while the connection is
cut. A user in the first region and another user in the second region would
both write into the same files in a shared project. Afterwards the connection
would be restored and the tests would wait until the replication is done,
then check whether the correct (latest) changes are written into the shared
files.
An observational study could be made of what happens when Swift gets
slowly filled up with data. To study this efficiently, the disks should have a
small capacity. This might be made into a test case, but the test result would
be difficult to measure, as we expect it to fail at some point.
4.2 High-availability Tests
To test a stateless service, all that is required is to damage it and check if the
system still works. The basic outline of such tests is as follows:
1. damage service (e.g. shut down node, restart service daemon)
2. verify whether the service is working
The second step could be done with some selected API calls to the ser-
vice, or running a relevant part of Tempest. It should be done immediately
if the service is set up as active/active. Ideally, some API call would be
repeated with high frequency in parallel with the fault injection and mon-
itored for failures. If the component is set up as active/passive or in hot
standby, the service should be checked after a timeout. In case of nova ser-
vices, the command nova-manage should be used to check whether the
damage is shown in the output. Depending on how many nodes there are
in the system, the test can be repeated multiple times, until there is only one
node with the service left.
The most sensitive service could be the messaging queue, which is state-
ful and all the services have to reconnect to it after a failure. A full messag-
ing queue failure should be performed, when all the nodes running it are
restarted, after which the whole OpenStack system should be checked.
The tests should be independent of the specific load balancer and clus-
ter resource manager software if possible. Pacemaker itself (or any other
clustering software used instead of it) has to be tested, since it manages all
the resources and its failure could cause a cascading failure.
25
4. T EST D ESIGN
4.2.1 VM Creation and Scheduling
Controller
nova scheduler, api,
database, storage, etc.
Compute Compute Compute
Figure 4.2: Minimum OpenStack topology required for VM creation tests.
The Figure 4.2 shows how the basic topology for virtual machine cre-
ation and scheduling tests should look like, but it can be extended to more
nodes. The Compute nodes contain the nova-compute service, hypervisor
and the necessary networking services; the Controller has everything else,
for example the Nova scheduler, Keystone, database, storage, etc. A load
balancer or Pacemaker arent necessary.
Nova scheduler selects a Compute server (the one that contains the hy-
pervisor) that is reporting it is functioning properly and has enough re-
sources. It creates the requested VM on it and contacts the other services for
resources (networking, storage). Afterwards, Nova checks whether the VM
was created correctlyif not, it throws it aways and tries it on another node
(by default, it tries this three times) [14]. If it fails even after that, the VM
is set to an error state. By default, the scheduler doesnt remember which
nodes failed before, but there is a filter scheduler which assigns the nodes
weights based on their availability and other factors.
The basic outline of most of the tests:
1. send VM creation request
2. find compute server on which the request was scheduled
3. do damage to the selected compute server
4. see whether the VM creation was re-scheduled onto another node
26
4. T EST D ESIGN
Multiple fault-injection can be done on the compute node, for example:
stop compute service
stop networking service
use direct hypervisor commands to stop the VM creation
restart the whole node
bring network interface controller (NIC) down
The steps of the tests could be executed again with the next selected
node being damaged, as many times as the scheduler is set to re-try it. A
test case should be made that does the damage more than the set number of
re-tries and would expect that the VM will report an error state, after which
the VM should be removed to test whether it is responsive and doesnt get
stuck.
Instead of damaging the compute service, we can inject fault into the
resource it requires, but only for the short duration of the first attempt at
VM creation, after which the damage should be restored. Examples of such
resource faults:
stop database service
damage block storage, e.g. stop the service
restart Neutron service
The tests could be extended to the filter scheduler, where repeated dam-
age and failed VM creation would be done on one node, and it would mon-
itor whether it stops scheduling new VMs onto this node.
Additionally, a study could be done about the behaviour about the sys-
tem when the memory on all nodes starts running out, but since there is no
defined behaviour for this, it should not be made into a test case.
4.3 Other Tests Ideas
Communication
In related works [3, 4], the ORCHESTRA and ComFIRM tool manipulate
the communication between the nodes of a distributed system by modify-
ing, dropping or delaying messages (see Section 3.2). Most of these faults
dont need to be tested at the OpenStack level and should rather belong
27
4. T EST D ESIGN
to the specific implementation of the Advanced Message Queuing Protocol

(AMQP) that is being used. The DestroyStack framework could try to use a
similar approach to simulate network partitions, test the messaging service
or responses of the API. The ComFIRM tool works with the Linux kernel
and could be used as an external tool. However, this is out of the scope of
this work.
Configuration
All of the services could be tested by damaging the configuration files and
restarting the service, as a kind of fuzz testing. This would simulate a hu-
man error which is a main contributor to unscheduled downtime of sys-
tems [11]. When the damage is expected to completely fail the service, the
restart should fail and the system should handle it the same way as in the
high-availability tests described in Section 4.2 (assuming the system is set
up as HA). However, a small damage to the configuration that doesnt fail
the service restart should not break the other services. The behaviour de-
pends on the service under test and might not always be specified, therefore
it requires further study.
28
Chapter 5
Framework Design and Implementation
The tool is designed to emulate the actions that a person testing the system
by trying to damage services would take, as described in Chapter 3. The
tests are deterministic (though it would be possible to create random tests
too) and are essentially implemented as remote commands to the servers
on which the OpenStack system under test is installed. The created frame-
work, called DestroyStack, keeps complete control of all the nodes. To pro-
vide state restoration, repeatability and isolation, the framework uses virtu-
alization to create snapshots of the system. The Gigan tool [8] used a similar
approach, but DestroyStack doesnt use the virtualization to inject low-level
failures into the hardware, thus it isnt directly dependent on it and can run
on physical hardware too, if state restoration isnt necessary. It is flexible
enough to support multiple topologies and doesnt require that the system
is reinstalled after the injected failures damage the system too much, thus
is resource efficient.
The language of choice is Python, because the whole OpenStack ecosys-
tem uses it and there are Python libraries available for each component,
whereas another language would force us to work on the level of the REST
API and make development slower. To control the servers by remote com-
mands, the Python Paramiko library1 is used to communicate trough the
Secure Shell (SSH) protocol. It doesnt provide the convenient commands
that the deployment tools Fabric2 or Ansible3 have, but sadly they are un-
suitable to be used as libraries from Python, since they expect to have full
control of the program execution, and it would be difficult to integrate them
with the rest of the tools.
The framework uses JSON (JavaScript Object Notation) for configura-
tion, which is an open standard format that is human-readable and easy to
1. http://www.paramiko.org/
2. Fabric is a simple imperative (i.e. sequence of commands) deployment tool written in
Python, available from: http://fabric.readthedocs.org
3. Ansible is also an imperative deployment tool in Python, but provides more higher level
functionality than Fabric, see https://github.com/ansible/ansible
29
5. F RAMEWORK D ESIGN AND I MPLEMENTATION
parse in Python, in contrast to XML. An example of such a configuration

file is in Listing 5.1, showing the minimum server topology that is required
by the currently implemented Swift tests, as depicted in Figure 4.1.
{
timeout : 3 6 0 ,
servers : [
{
ip : 1 9 2 . 1 6 8 . 3 3 . 1 1 ,
r o l e s : [ s w i f t p r o x y , keystone ]
},
{
ip : 1 9 2 . 1 6 8 . 3 3 . 2 2 ,
e x t r a d i s k s : [ vdb , vdc , vdd ] ,
roles : [ swift data ]
},
{
ip : 1 9 2 . 1 6 8 . 3 3 . 3 3 ,
e x t r a d i s k s : [ vdb , vdc , vdd ] ,
roles : [ swift data ]
}
],
keystone : {
user : admin ,
password : 123456
},
management : {
type : manual
}
}
Listing 5.1: Example of a JSON configuration file.
The server with the roles swift proxy and keystone represents the
Swift proxy server in the diagram, while the other two are the Swift data
servers. The key extra disks points to the disk devices in the /dev/ di-
rectory on the server. They will be used for Swift data, and most of the in-
jected failures in the Swift tests will be performed on them. The keystone
section contains authentication for the OpenStack clients and the manage-
ment part is related to state restoration, which is explained more closely
in Section 5.1. A JSON schema for the configuration file is provided in the
source code, both as documentation and a validation tool.
A user gets an example file like Listing 5.1 and usually just needs to
update the addresses of the servers to match his topology. He can add any
number of servers and assign them the roles that match his installation,
30
thus DestroyStack is independent of the deployment tool that was used.

Additionally, the demonstration tool uses the information about roles to
install the necessary OpenStack system.
5.1 Support of Multiple Topologies
The framework was originally designed to support a finite list of topologies

and generate a configuration file for each of them. Each of the test groups (a
set of tests which require the same or similar topology) would then be put
in a separate file. Only tests from that file would be executed for a given
topology. However, this approach is inflexible, especially if the user wants
to set up a system that should be able to support multiple groups of tests.
Since that, the tools and tests were redesigned to use only a single con-
figuration file as shown in the Listing 5.1, where each server has a set of
roles. A role is the name of a service that runs on the node, for example
swift proxy, compute, keystone, database, etc. The specification of the server
roles is left to the user, though sample configurations are provided.
Each group of tests has a set of requirements for the tested system, and
each test can specify additional requirements. If these are not satisfied, the
tests get skipped, as shown in Listing 5.2. In this example, the manager
object keeps track of all the servers and is able to filter them out based on
their roles. The first condition makes sure there are at least two Swift data
servers; the second tests if there is a server that is both a Swift proxy and
has Keystone installed.
i f l e n ( manager . g e t a l l ( r o l e = s w i f t d a t a ) ) < 2 \
or manager . g e t ( r o l e s =[ keystone , s w i f t p r o x y ] ) i s None :
r a i s e SkipTest
Listing 5.2: Example of conditional test skipping.
This allows for flexibility in both the system deployment tools and test
groupingit is possible to create a topology which satisfies only the re-
quirements of a certain subset of the tests that interest us, and theoretically
a topology that satisfies all of them. The latter might not always be possi-
ble, since some tests could require a maximum number of a certain kind of
server roles, another would require a minimum higher than that. In those
special cases, it can be possible to work around it by shutting down the
extra services, but some requirements may be contradictory and the frame-
work has to take it into consideration.
31
The roles are also used by the installation script to deploy the system
and decide what to install where based on these data. For more information
on these tools, see Section 5.4.
5.2 Virtualization
Since the nature of the tests doesnt require high performance, using vir-
tual machines for the tested OpenStack system is possible. It allows fast
and flexible deployment, since it doesnt require that the person running
the tool searches for extra hardware, and it can provide virtual networks,
thus allowing us to create various system topologies without manual re-
configuration. Virtualization is a commonly used tool for testing with fault
injection, for example in the Gigan test framework [8].
A problematic behavior may only manifest when a rare ordering of
events occurs. To discover them, we need to be able to run the tests often
and ideally without human intervention. If we used physical machines, a
failure from which the system couldnt recover would stop the test run, and
somebody would have to re-install the system. Using VMs allows us to re-
store the state of the system under test (see Section 5.3) into the state before
the failure, and makes it possible to isolate and repeat the failures. How-
ever, using virtualization is not enforced and the tests dont use it directly.
In Gigan [8], the tests inject failures into CPU registers (causing bit-flips)
using virtualization that gave them direct access to the hardware, but in the
tests we have designed for OpenStack, such tight coupling isnt necessary.
This allows us to use different virtualization managers and even physical
hardware, if state restoration isnt necessary (but could be implemented
even there, see Section 5.3.2, under LVM).
5.3 State Restoration
OpenStack should be able to survive the individual tests, but not necessar-
ily a combination of them. The injected faults tend to damage the system
in a way that is difficult to recover from. Some of them would even require
that the whole system must be reinstalled, which is currently a slow pro-
cess. The goal of the framework was to be resource efficient, to make the
full run of tests faster and require less work by the user. State restoration
makes this feasible, along with providing repeatability and isolation of the
tests, also making it possible to even create negative test casestests where
the system is expected to fail, which checks whether the tests themselves
32
have correct assumptions about the system.

There are multiple methods, virtualization managers and tools that could
accomplish the systems restoration, and this section provides an overview
of the implemented methods, along with tools that are being considered as
features in the future.
5.3.1 Manual
Originally, DestroyStack was trying to restore the system manually. In the

beginning, it created backups of all the configuration and the files that keep
the state of the component under test. After each test run, all the relevant
services were stopped, the files restored and then the services were started
up again. This was easily possible for the object storage service (Swift),
which doesnt use a database and everything is in files and data disks.
However, it proved more difficult than it seemed. The method is prone to
programmer error and often caused mysterious failures of the component.
It could even fix the system by mistake, for example if the installation tool
doesnt set the file permissions correctly and the state restoration fixes it,
causing some tests to pass even though they should have failed.
Restoring the system manually is also dependent on the version of the
system, since the next release might use different files and directories. It has
to be written for each component that is being tested this way and it is time
consuming to do so, thus adding even a simple test of a new component
would be difficult.
This type of restoration is currently still kept in the code-base, but is
considered deprecated and only a best-effort kind of tool. To use it, the
management item in the configuration has to be set to manual, as shown
in Listing 5.1.
5.3.2 Snapshots
Because of the problems with manual state restoration, creating snapshots

is the currently preferred way to restore OpenStack back into a working
state. When the tests start, a snapshot of the full system is taken. A test
does some damage and then the system is restored from the snapshot. This
is currently done for each server under DestroyStacks control, but in the
future it could be limited to a set of servers that are affected by the damage.
33
Meta-OpenStack
Since the future users of DestroyStack are testers and developers of Open-
Stack, it is likely they have access to a large and stable OpenStack cloud
where they can create VMs. Thus they would create a number of virtual
machines in a meta-OpenStack, inside of which they would install the sys-
tem that they want to test.
For the tool to use this kind of state restoration, it has to be given the
credentials to the managing OpenStack system, like shown in Listing 5.3,
which is a part of a configuration file as shown in Listing 5.1. In this case, it
is required that DestroyStack is running from a separate server and not on
the tested system, since it cannot restore the state of the instance on which
it is running by itself.
management : {
type : metaopenstack ,
a u t h u r l : h t t p :// myopenstack . com: 5 0 0 0 / v2 . 0 / ,
user : myuser ,
t e n a n t : mytenant ,
password : 1234
}
Listing 5.3: Part of the configuration file when DestroyStack uses meta-
OpenStack snapshots to restore state.
Vagrant
For the users which dont have a meta-OpenStack available, the tool Va-
grant4 has been chosen, since it can easily create and manage VirtualBox5
and libvirt6 virtual machines (although the latter is still in development7 ). It
would have been possible to support both of them with native commands,
but Vagrant allows us to use a simple command to snapshot all the VMs
and provides a unified interface.
4. http://www.vagrantup.com/
5. https://www.virtualbox.org/
6. http://libvirt.org/
7. https://github.com/pradels/vagrant-libvirt
34
LVM
The most general solution for state restoration might be LVM8 (Linux Vol-
ume Manager) snapshots, since they could be used on physical machines.
However, the system images available at the time of writing didnt use LVM
and creating a general method of restoring the full contents of a systems
disk, including the root partition, is not trivial. Due to these problems, snap-
shotting with LVM has been left as an option and possible feature in the
future, if there is demand for it.
5.4 System Deployment Script
DestroyStack provides a script that uses the configuration file shown in

Listing 5.1 and installs the system with the OpenStack deployment tool
Packstack,9 using the roles field for each server to decide where to install
the services. The script isnt fully functional and currently only supports
OpenStack Swift, Keystone and accompanied services. This is because it
will be soon rewritten to use Khaleesi,10 which is a set of scripts that is able
to deploy the system using multiple tools, including Packstack. It can also
create VMs in a meta-OpenStack, in which the tested OpenStack will run.
However, it isnt stable at this point and so the transition to it had to be
postponed.
The project also contains a Vagrant configuration file, which is able to
create VirtualBox (and later libvirt) VMs using a single command. It auto-
matically downloads Linux Fedora boots up three virtual machines with
it and adds extra disks to themthis is the basic necessary system to run
the Swift tests that are currently implemented, as shown in Figure 4.1 and
described in Section 4.1.
These tools are only there to help the user, OpenStack can be deployed
in any way and the configuration file can be edited to match itthe main re-
quirement is that the server roles have to be consistent with the locations of
the services. In the future, it would be possible to create an auto-discovery
script that would find out what services are running on the already installed
system and assign the roles accordingly.
8. https://en.wikipedia.org/wiki/Logical_Volume_Manager_(Linux)
9. https://github.com/stackforge/packstack
10. https://github.com/redhat-openstack/khaleesi
35
5.5 The Frameworks Capabilities and Drawbacks
Theoretically, the framework can be used for most kinds of tests. There are
no artificial restrictions, you can run any command on any of the systems
servers, and the state restoration is a tool you can use, but dont have to if it
is not necessary. However, it is recommended that if a test doesnt require
direct access to the servers or the state restoration, it should be put into the
main test suit, Tempest (see Section 3.2.1).
DestroyStack was not designed for performance or scalability tests
state restoration usually requires that the nodes are virtual machines. If the
performance tests dont need state restoration or if the alternative methods
get developed (LVM snapshots, some manual method), it could be possible
to use it this way.
The framework is mainly designed to provide tools for OpenStack tests,
but it isnt tightly coupled with it and the tools could be used for tests of
other systems. Especially the state restoration functionality is helpful for
general fault-injection testing. If necessary, DestroyStack could be split into
two projectsthe tools and the tests. In that case, the set of tools could be
packaged and imported into other projects.
Because of the state restoration, the framework requires that the nodes
are virtual machines in one of the supported virtualization managers and
the user can snapshot them. However, not everybody has access to these
kinds of resources. It is still possible to use the best-effort manual restora-
tion or disable the restoration completely, but this might not be able to run
all the tests successfully (see Section 5.3).
The tests could be used on a system before it is deemed stable and
production-ready. In this case, the state restoration would be disabled and
the administrator would run a single test that would verify if his setup is
truly highly available and reliable. Preferably, the Tempest test suite would
be run before and after this destructive test. However, while Tempest can be
used even on a production system, because it shouldnt cause any damage
to it, using DestroyStack would be dangerous and might cause downtime.
Therefore, the fault injection tests should not be used as a setup verification
mechanism.
5.5.1 Unimplemented features
This section contains an overview of ideas for new functionality and ele-
ments that have been delegated to other tools.
36
Alternative Virtualization Managers
Only a system inside OpenStack or VirtualBox VMs are currently supported,

but adding support for AWS (Amazon Web Services) would probably not
be difficult (OpenStack supports Amazons API, so the code for them could
be made identical) and the Vagrant tool has support for libvirt in develop-
ment (for more information, see Section 5.3.2).
Physical Hosts
If somebody wants to use the framework on a set of physical hosts, he has

to either try to use the manual state restoration (Section 5.3.1) or forego
the use of it and reinstall the system after each test. In case he is using some
intelligent system management tool, this might not be as complicated, since
it could recognize the changed state and bring it back to what it should be.
However, if there is demand for it, DestroyStack can add support for LVM
snapshots in the future, which would make it possible to save and restore
the state on a physical machine.
Logging
Collection of the system logs is left to the script that will be running the
tests, which will most likely be Khaleesi in the future. Khaleesi already sup-
ports this. DestroyStack only collects the logs from the framework and tests
themselves. There are already multiple logging monitoring tool that can
match the events from the system logs and DestroyStack logs and display
the evens in a human readable format. The OpenStack Ceilometer service
provides this functionality, but its installation and usage is currently left to
the user and collecting these data will be also left to Khaleesi in the future.
Tempest Tests Execution
As described in Section 3.2.1, Tempest is a project containing over 2000

black-box tests of OpenStack. It would be useful to run a part of it (not
all, because that would be time expensive) that is relevant to the injected
fault, to verify that the system is functioning and recovered from the failure.
However, the script that would be able to completely configure Tempest is
in development, and thus it is currently difficult to run it successfully on
something that wasnt deployed by DevStack, a tool that installs OpenStack
on a single server from the source code and is used by developers.
37
Snapshots of Cinder Volumes
To deploy Swift correctly in the meta-OpenStack system, extra volumes

should be created in the meta-OpenStacks Cinder service and used as data
disks in tested systems Swift service. Since the extra disks dont get auto-
matically snapshotted when the main system state is saved (like it happens
with Vagrant snapshots), this has to be implemented in the state restora-
tion. However, the Red Hat internal tool that is being used to create VMs
and run tests doesnt support the creation of Cinder disks, so this feature
wasnt implemented yet. Instead, a workaround was createdthe VMs get
extra ephemeral storage, which acts as a single disk, and DestroyStack rec-
ognizes it and partitions it. The partitions are then used as if they were
disks.
5.6 Implementation of Fault-injection Tests
The tests use the Python nosetests11 framework. It was chosen because at
the time the project was created, the main set of OpenStack tests, Tempest,
was using it too. It is usually used for unit testing (testing of small units
of code), but it has features like module and package-level test setups that
make it an acceptable tool for other kinds of tests. It is able to collect the
output from the tests and report the results in XML format.
The basic outline of how tests are implemented is in the template file
shown in the Listing 5.4, meant to be used as a starting point for creating
new tests. The ServerManager object keeps track of all the servers and
provides functions to filter them by role, and to save and restore the state.
The requirements function on line 5 specifies on what conditions should
the tests run, as described in Section 5.1. A snapshot of the system is taken
on line 19, in the setupClass method, which is ran only once per group of
tests. If the snapshots already exist, they wont be created again, so its pos-
sible to create multiple groups of tests and the operation wont be repeated.
On the other hand, a group of tests can specify a tag for it and thereby create
its own set of snapshots. After the state of the system is saved, the setUp
method is executed and a file is created on one of the servers. The test on
line 28 only checks if the file exists. After this, the state of the system is
restored in the tearDown method and the file wont be there anymore.
11. https://nose.readthedocs.org
38
1 from nose import S k i p T e s t

2 from d e s t r o y s t a c k . t o o l s . server manager import ServerManager
3 import d e s t r o y s t a c k . t o o l s . common as common
4
5
6 def re qui rem ent s ( manager ) :
7 r e t u r n ( manager . g e t ( r o l e = keystone ) i s not None )
8
9
10 c l a s s TestMySetupName ( ) :
11 manager = None
12
13 @classmethod
14 def s e t u p C l a s s ( c l s ) :
15 c o n f i g = common . g e t c o n f i g ( )
16 c l s . manager = ServerManager ( c o n f i g )
17 i f not r equ ire men ts ( c l s . manager ) :
18 r a i s e SkipTest
19 c l s . manager . s a v e s t a t e ( )
20
21 def setUp ( s e l f ) :
22 s e l f . s e r v e r = s e l f . manager . g e t ( r o l e = keystone )
23 s e l f . s e r v e r . cmd( touch s o m e f i l e )
24
25 def tearDown ( s e l f ) :
26 s e l f . manager . l o a d s t a t e ( )
27
28 def t e s t f i l e ( s e l f ) :
29 r e s u l t = s e l f . s e r v e r . cmd( l s )
30 a s s e r t s o m e f i l e i n r e s u l t . out
Listing 5.4: Shortened version (without comments) of the source file

destroystack/test template.py.sample.
5.6.1 Results
Some of the Swift tests from Section 4.1, were implemented as a demonstra-
tion of how the framework is to be used. The nosetests tool finds all the tests
in the directory and prints the results on the command line, while collecting
the log output and standard output from the tests, as shown in Listing 5.5.
The results are also written into an XML file that can be used by other tools.
So far, DestroyStack just found issues related to the OpenStack deploy-
ment tool.12 However, the test two disks down, third later (also described
12. Bug #1072070 - Packstack fails if more than one Swift disk is specified
Bug #1072099 - Cannot specify Swift disk, only loopback device
Bug #1020480 - swift-init exits with 0 even if the service fails to start
39
TestSwiftSmallSetup
test disk replacement #1 FAIL
test one disk down #2 OK
test one disk down restore #3 OK
test two disks down #4 OK
test two disks down third later #5 FAIL
Listing 5.5: Output of the tests.
in Listing 4.2) is failing in approximately 50% of the test runs, because only
two copies of the objects get found instead of three. It could be an issue
with Swift, but so far the failure wasnt traced down and reported. The disk
replacement test failure is probably just an issue with the test implementa-
tion, since it was successful before the tools got recently redesigned.
It takes approximately two minutes to execute a Swift test successfully,
because the replica regeneration takes some time. In case of an unsuccessful
test, it depends on the timeout set in the configuration. The time expanse
of the state restoration depends on the speed of the system running the
virtual machines. It usually takes between one and five minutes to take
the snapshots of all the virtual machines, which is done in parallel and so
it shouldnt get worse when bigger topologies are added, though it might
cause a big increase in I/O operations on the underlying system. Restoring
the VMs back to the state before the failure takes another 1-2 minutes, but is
also done in parallel. A full test run, not including the system deployment,
therefore currently takes approximately 20 minutes.
40
Chapter 6
Conclusion
The created framework, named DestroyStack, is designed to run software-

based fault injection and high-availability tests on multiple topologies of
OpenStack. It is able to restore the state of the tested system using virtual-
ization, so it can isolate and repeat tests as many times as needed. Since it
can reset the state back before a test damaged the system, re-installation of
OpenStack isnt necessary and it is therefore resource-efficient. It supports
multiple types of state restoration, and provides tools for test result verifica-
tion on the object storage service (Swift). DestroyStack is published1 under
the Apache 2 license.
The project already contains tests that verify the reliability of the object
storage when disk errors occur, as a proof of concept and demonstration on
how to use the tools. The results of them still need to be analyzed, since one
of them is randomly failing, but otherwise only deployment-related bugs
were found. The high-availability tests of other components were out of
scope of this work, but the project contains a basic analysis and design of
them. The Red Hat OpenStack team automatically executes the tests each
night and plans to use the tools to implement more tests.
The framework is independent of the OpenStack installation method,
but provides a script to deploy the basic system topology needed by the
implemented tests and a script that also creates the virtual machines. Once
the Khaleesi tool for OpenStack deployment becomes stable, DestroyStack
will use it to support more deployment options and collect the logging mes-
sages from the tested system.
The future plans are to implement the designed HA tests and add sup-
port for running a relevant part of the Tempest test suite after fault-injections,
to better verify the systems function, which was postponed because the
Tempest configuration script is still a work in progress. If there are users
who need it, the support for libvirt and LVM state restoration could be
added.
1. https://github.com/mkollaro/destroystack
41
Appendix A
Attachments
destroystack/test *
Fault-injection tests, each file contains a group
of tests that have similar requirements about
the tested system.
destroystack/test template.py.sample
Template for tests, new users can use it as start-
ing point to understand the tests.
destroystack/tools/
Source code of the framework, containing server
management tools and state restoration mech-
anisms.
etc/
Configuration file samples.
etc/schema.json
JSON schema specifying the configuration for-
mat and options.
README.md
Project description and usage tutorial, similar
to the contents of Appendix B.
TEST PLAN.md
Simple description of the tests with ASCII draw-
ings of the required topologies, similar to the
contents of Chapter 4, but less detailed.
42
Appendix B
User tutorial
This project tries to test the reliability of OpenStack by simulating failures,

network problems and generally destroying data and nodes to see if the
setup survives it. The basic idea is to inject some fault, see if everything still
works as expected and restore the state back to what it was before the fault.
Currently contains only Swift tests, but other components are planned.
B.1 Requirements
You will either need access to some VMs running in an OpenStack cloud
or VirtualBox locally (script for setting them up VirtualBox VMs is already
provided). Using VMs is necessary because the machines are being snap-
shotted between the tests to provide test isolation and recover from faults
that damaged the system. Support for Amazon AWS and libvirt VMs might
be added in the future. If you need bare metal, you can add support for LVM
snapshotting, or you can use the manual best-effort recovery.
The tests dont tend to be computationally intensive. For now, you should
be fine if you can spare 2GB of memory for the VMs in total. Certain topolo-
gies need extra disks for Swift, but their size isnt important - 1GB is enough
per disk.
So far, it has been tested only with RHEL and Fedora Linux, plus the
OpenStack versions RDO Havana or RHOS 4.0 (Red Hat OpenStack), in-
stalled by Packstack1 . The tests themselves dont really care what or how
is it deployed. The tests use the nosetests framework and the OpenStack
clients, both of which will be installed as dependencies if you install this
repository with python-pip.
1. https://github.com/stackforge/packstack
43
B. U SER TUTORIAL
B.2 Running the tested system in VirtualBox
You can try the tests out with Vagrant and VirtualBox (libvirt may be added
later). While easier to use, it isnt fast - creating the virtual machines will
take a few minutes, installing OpenStack on them another 15 minutes and
the tests themselves take a while to run.
1. install the latest version of Vagrant2 and VirtualBox3
2. install Vagrant plugin for creating snapshots
$ vagrant plugin install vagrant-vbox-snapshot
3. install DestroyStack pip dependencies
$ sudo pip install -e --user destroystack/
4. change to the main DestroyStack directory (necessary for Vagrant)
$ cd destroystack/
5. boot up the VirtualBox VMs
$ vagrant up
6. copy the configuration file (you dont have to edit it)
$ cp etc/config.json.vagrant.sample etc/config.json
7. copy the OpenStack RPM repository to the VMs if necessary

8. deploy the system using Packstack (but you can use a different tool)
$ python bin/packstack_deploy.py
9. run tests
$ nosetests
To remove the VMs and extra files, run

$ cd destroystack/
$ vagrant destroy
$ rm -r tmp/
2. http://www.vagrantup.com/downloads.html
3. https://www.virtualbox.org/wiki/Downloads
44
B. U SER TUTORIAL
B.3 Running the tested system inside OpenStack VMs
If you have a production instance of OpenStack (let us call it meta-OpenStack)

where you can manage VMs, you can install the tested system on them
run OpenStack inside OpenStack.
The steps you need to take are similar to the steps used with Virtual-
Box, except in step 5 you need to create the virtual machines yourself. For
the basic set of Swift tests, create three VMs and either use the ephemeral
flavor or add a Cinder disk. You can try using the Khaleesi project4 for this
purpose.
Another difference is the configuration file, in which you will need to
give the tests access to the meta OpenStack API and edit the IP addresses
of the servers.
$ cp etc/config.json.openstack.sample etc/config.json
Configure the management section to point to your meta-OpenStack
system endpoint, user name and password. If your meta-OpenStack uses
unique IPs for the VMs, you can just use those, but if not you need to
provide the IDs of the VMs under the id field. Change the disk names
in case they are called differently than /dev/{vdb,vdc,vdd}. There is a
workaround for the case when you have only one extra diskthree parti-
tions will be created on it, so you can use a single one and the tools will de-
tect it. All the disks will all be wiped and formatted. The services password
is what will be set in the answer files for keystone and other things, you
dont need to change it. The timeout is in seconds and tells the tests how
long to wait for stuff like replica regeneration before failing the tests. For
more information about the configuration file, look at etc/schema.json
which is a JSON schema of it and can serve as a validation tool.
4. https://github.com/redhat-openstack/khaleesi
45
Bibliography
[1] B IEMAN , J. M., D REILINGER , D., AND L IN , L. Using fault injection

to increase software test coverage. In Proceedings of the The Seventh
International Symposium on Software Reliability Engineering (1996),
ISSRE 96, IEEE Computer Society, pp. 166.
[2] C RISTIAN , F. Understanding fault-tolerant distributed systems. Com-

mun. ACM 34, 2 (Feb. 1991), 5678.
[3] D AWSON , S., J AHANIAN , F., M ITTON , T., AND T UNG , T.-L. Testing
of fault-tolerant and real-time distributed systems via protocol fault
injection. In Fault Tolerant Computing, 1996., Proceedings of Annual
Symposium on (1996), IEEE, pp. 404414.
[4] D REBES , R. J., J ACQUES -S ILVA , G., DA T RINDADE , J. M. F., AND W E -

BER , T. S. A kernel-based communication fault injector for depend-
ability testing of distributed systems. In Proceedings of the First Haifa
International Conference on Hardware and Software Verification and
Testing (2006), HVC05, Springer-Verlag, pp. 177190.
[5] F IFIELD , T., F LEMING , D., G ENTLE , A., H OCHSTEIN , L., P ROULX , J.,
T OEWS , E., AND T OPJIAN , J. OpenStack Operations Guide. OReilly
Media, May 2014. Available at http://docs.openstack.org/
ops/.
[6] F ORD , D., L ABELLE , F., P OPOVICI , F., S TOKELY, M., T RUONG , V.-A.,
B ARROSO , L., G RIMES , C., AND Q UINLAN , S. Availability in globally
distributed storage systems. In Proceedings of the 9th USENIX Sym-
posium on Operating Systems Design and Implementation (2010).
[7] G RAY, J., AND S IEWIOREK , D. P. High-availability computer systems.

Computer 24, 9 (1991), 3948.
[8] H SU , I., G ALLAGHER , A., L E , M., AND TAMIR , Y. Using virtualization

to validate fault-tolerant distributed systems. In Int. Conf. on Parallel
and Distributed Computing and Systems, pp. 210217.
46
[9] J IANG , W., H U , C., Z HOU , Y., AND K ANEVSKY, A. Are disks the dom-
inant contributor for storage failures?: A comprehensive study of stor-
age subsystem failure characteristics. Trans. Storage 4, 3 (Nov. 2008),
7:17:25.
[10] L APRIE ., J.-C. Dependable computing and fault tolerance: concepts

and terminology. In Proceedings of 15th International Symposium on
Fault-Tolerant Computing (FTSC-15) (1985), pp. 211.
[11] M ARCUS , E., AND S TERN , H. Blueprints for High Availability: De-
signing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
[12] Mirantis OpenStack reference architecture. http://docs.

mirantis.com/openstack/fuel/. [Online, accessed 2014-05-20].
[13] OpenStack cloud administrator guide. http://docs.openstack.

org/admin-guide-cloud/. [Online, accessed 2014-05-16].
[14] OpenStack configuration reference. http://docs.openstack.

org/icehouse/config-reference/. [Online, accessed 2014-05-
24].
[15] OpenStack high availability guide. http://docs.openstack.

org/high-availability-guide/. [Online, accessed 2014-05-24].
[16] Pacemaker 1.1 configuration explained. http://clusterlabs.

org/doc/. [Online, accessed 2014-05-20].
[17] S EFRAOUI , O., A ISSAOUI , M., AND E LEULDJ , M. OpenStack: Toward

an open-source solution for cloud computing. International Journal of
Computer Applications 55, 3 (October 2012), 3842.
[18] Storage Networking Industry Association (SNIA) dictionary.

http://www.snia.org/education/dictionary. [Online,
accessed 2014-05-11].
[19] TANENBAUM , A. S., AND VAN S TEEN , M. Distributed Systems, Prin-

ciples and Paradigms. Prentice Hall, 2002.
[20] Z IADE , H., AYOUBI , R. A., V ELAZCO , R., ET AL . A survey on fault

injection techniques. Int. Arab J. Inf. Technol. 1, 2 (2004), 171186.
47
Index
active/active, 10 messaging queue, see AMQP

AMQP, 8, 16, 28 meta-OpenStack, 34
AWS, 17, 37 MTTF, 9
Ceilometer, 7, 37 network partition, 9, 16, 24, 28
Chaos Monkey, 17 Neutron, 7
Cinder, 7 nosetests, 38, 39, 43
cloud computing, 6 Nova, 7, 26
ComFIRM, 18, 27
ORCHESTRA, 18, 27
compile-time fault injection, 15
Compute, see Nova Pacemaker, 11
Packstack, 35, 43
DevStack, 37
protocol fault-injection, 15, 18, 27
fuzz testing, 15, 28
reproducer, 13
Gigan, 18, 29, 32 REST API, 7, 16, 29
Glance, 7 robustness testing, 15
gray-box testing, 16 roles, 31
run-time fault injection, 15
HA (high availability), 6, 8, 10, 25
HAproxy, 11 snapshots, 18, 33, 37
Heat, 7 stateless/stateful service, 10
hot standby, 11 SWIFI, 15, 16
hypervisor, 7, 26 Swift, 7, 12, 20
data server, 12
IaaS, 5 handoff, 12, 23
JSON, 29 proxy server, 12
rebalancing, 24
kernel fault-injection, 18, 19 replica regeneration, 12, 20
Keystone, 7 replicas, 12
Khaleesi, 35, 37, 41 ring, 12, 22
libvirt, 34, 35, 41 Tempest, 17
load balancer, 11
LVM, 35, 37, 41 Vagrant, 34, 35, 44
VirtualBox, 34, 35, 44
48

Fault Injection OpenStack Thesis

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Fault Injection OpenStack Thesis

Caricato da

Copyright:

Formati disponibili

}w

Fault injection testing of

Brno, spring 2014

Hereby I declare, that this paper is my original authorial work, which I

Advisor: Mgr. Marek Grac, Ph.D.

The created framework, named DestroyStack, provides tools for software-

fault injection testing, high availability, fault tolerance, OpenStack, server

As Leslie Lamport once defined it,

Since these kinds of tests are destructive by nature, it is necessary to re-

OpenStack and Fault Tolerance

Imagine a company that needs to host some larger service. It considers

cloud a vague term usually referring to a distributed system

topology arrangement of the OpenStack services on servers, might

VM virtual machine, also called an instance, a software-based

server a computer that provides some service, it can also be called

fault is a physical defect or flaw that occurs within some hard-

error the consequence of a fault, a defect in the system state that

failure when the actual behaviour of the system deviates from

fault tolerance is the property that enables a system to continue operat-

availability fraction of time for which the system is available (oppo-

HA high availability; requires that the system is fault tolerant

2.2 Overview of the OpenStack Ecosystem

OpenStack consists of a series of interrelated projects, each of them provid-

Nova schedules, creates and manages virtual machines (also re-

Neutron enables networking between VMs, allows defining vir-

Keystone authentication and authorization service, provides a cat-

Glance registry of disk images and snapshots

Cinder block storage, provides disks for VMs

Swift object storage, a distributed storage system putting em-

Horizon web-based user interface to the other services

Ceilometer monitors and meters the system for billing, benchmark-

Heat allows a user to create a template of resources (VMs, stor-

2. Representational state transfer (REST), is a software architectural style that is stateless

Figure 2.1: Conceptual schema of the core OpenStack components [13].

2.3 About Fault Tolerance

When a failure happens on a non-distributed system it is often total in

whole application. A characteristic feature of distributed systems that dis-

2.3.1 Sources of Failures

crash a server halts, but works correctly until then

ommision a server fails to send, receive or respond to a failure

timing timeout of the response

response the response is incorrect

arbitrary a server may produce arbitrary responses at any time,

2.4 Highly-available OpenStack

There is no generally agreed-upon definition of high availability. The SNIA

Figure 2.2: Schema of how future HA deployments should look like.4

2.4.1 OpenStack Swift

isolation, a problem in one test should not propagate into another

verification of the systems function after the failure, to make sure it

resource efficiency, a full system re-installation should not be nec-

state restoration is closely related to resource efficiency and isola-

flexibility, since it takes a large amount of resources to install Open-

3.1 About Fault-injection Testing

software tries to reproduce at software level the errors that would

simulation injects faults into high-level models, e.g. VHDL (VHSIC

emulation alternative solution for reducing the time spent during

hybrid mixes software-implemented fault injection and hardware

3.1.1 OpenStack and Fault-injection Methods

3.2 Related Tools

3.2.2 Chaos Monkey

state restoration or complete verification of success (e.g. there is no API that

In the Gigan [8] tool, virtualization is used to test a distributed system by

ORCHESTRA [3] is a tool that uses a technique called script-driven prob-

3.2.6 Tools for Simulating Disk Failures

}w