Sei sulla pagina 1di 6

Principles of Antifragile Software

Martin Monperrus
University of Lille & Inria, France
martin.monperrus@univ-lille1.fr
January 27, 2017

Abstract reliability.
The notion of “antifragility” comes from the
There are many software engineering concepts book by Nassim Nicholas Taleb simply entitled
and techniques related to software errors. But “Antifragile” [14]. Antifragility is a property of
is this enough? Have we already completely ex- systems, whether natural or artificial: a system
plored the software engineering noosphere with is antifragile if it thrives and improves when fac-
respect to errors and reliability? In this paper, ing errors. Taleb has a broad definition of “error”:
I discuss an novel concept, called “software an- it can be volatility (e.g. for financial systems),
tifragility”, that is unconventional and has the attacks and shocks (e.g. for immune systems),
capacity to improve the way we engineer errors death (e.g. for human systems), etc. Yet, Taleb’s
and dependability in a disruptive manner. This essay is not at all about engineering, and it re-
paper first discusses the foundations of software mains to translate the power and breadth of his
antifragilty, from classical fault tolerance to the vision into a set of sound engineering principles.
most recent advances on automatic software re- This paper provides a first step in this direction
pair and fault injection in production. This pa- and discusses the relations between traditional
per then explores the relation between the an- software engineering concepts and antifragility.
tifragility of the development process and the First, I relate software antifragility to classi-
antifragility of the resulting software product. cal fault tolerance. Second, I show the link be-
tween antifragility and the most recent advances
on automatic software repair and fault injection.
1 Introduction Third, I explore the relation between the an-
tifragility of the development process and the
The software engineering body of knowledge on antifragility of the resulting software product.
software errors and reliability is not short of con- This paper is a revised version of an Arxiv
cepts, starting from the classical definitions of paper [9].
faults, errors and failures [1], continuing with the
techniques for fault-freeness proofs, fault removal
and fault tolerance, etc. But is this enough? 2 Software Fragility
Have we already completely explored the space
of software engineering concepts related to er- There are many pieces of evidence of software
rors? In this paper, I discuss a novel concept, fragility, sometimes referred to as “software brit-
that I call “software antifragility”, which as the tleness”, [13]. For instance, the inaugural flight
capacity to radically change the way we reason of Ariane 5 ended up with the total destruction
about software errors and the way we engineer of the rocket, because of an overflow in a sub-

1
component of the system. At a totally different plexity also naturally suffer from errors, as com-
scale, in the Eclipse development environment, a plex biological and ecological systems do. Once
single external plugin of a low-level library pro- one acknowledges the necessary existence of soft-
viding optional features crashes the whole sys- ware errors in large and interconnected software
tem and makes it unusable (this is a recent exam- systems[13, 10], it changes the game.
ple of fragility from December 20131 ). Software
fragility seems independent of scale, domain and
3.1 Fault-tolerance and An-
implementation technology.
There are means to combat fragility: fault pre- tifragility
vention, fault tolerance, fault removal, and fault Instead of aiming at error-free software, there are
forecasting [1]. Software engineers strive for de- software engineering techniques to constantly de-
pendability, they do their best to prevent, de- tect errors in production (aka self-checking soft-
tect and repair errors. They prevent bugs by ware [18]) and to tolerate them as well (aka fault
following best practices, They detect bugs by ex- tolerance [11]). Self-checking and self-testing and
tensively testing them and comparing the imple- fault-tolerance is not loving errors literally, but
mentation against the specification, They repair it is an interesting first step.
bugs reported by testers or users and ship the In Taleb’s view, a key point of antifragility is
fixes in the next release. However, despite those that an antifragile system becomes better and
efforts, most software remains fragile. There are stronger under continuous attacks and errors.
pragmatic explanations to this fragility: lack of The immune system, for instance, has this prop-
education, technical debs in legacy systems, or erty: it requires constant pressure from microbes
the economic pressure for writing cheap code. to stay reactive. Self-detection of bugs is not an-
However, I think that the reason is more fun- tifragile, software may detect a lot of erroneous
damental: we do not take the right perspective states, but it would not make it detect more.
on errors. For fault tolerance, the frontier blurs. If the
fault tolerance mechanism is static there is no
advantage from having more faults. If the fault
3 Software Antifragility tolerance mechanism is adaptive [6] and if some-
thing is learned when an error happens, the sys-
As Taleb puts it, an antifragile system “loves er- tem always improves. We hit here a first charac-
rors”. Software engineers do not. First, errors teristic of software antifragility. A software sys-
cost money: it is time-consuming to find and to tem with dynamic, adaptive fault tolerance capa-
repair bugs. Second, they are unpredictable: one bilities is antifragile: exposed to faults, it contin-
can hardly forecast when and where they will oc- uously improves.
cur, one can not precisely estimate the difficulty
of repairing them. Software errors are tradition-
ally considered as a plague to be eradicated and 3.2 Automatic Runtime Bug Re-
this is the problem. pair
Possibly, instead of damning errors, one can
Fault removal, i.e. bug repair, is one means to
see them as an intrinsic characteristic of the sys-
attain reliability [1]. Let us now consider soft-
tems we build. Complex systems have errors: in
ware that repairs its own bugs at runtime and
biological systems, errors constantly occur: DNA
call the corresponding body of techniques “au-
pairs are not properly copied, cells mutate, etc.
tomatic runtime repair” (also called “automatic
Software systems of reasonable size and com-
recovery” and also “self-healing” [7]).
1 https://bugs.eclipse.org/bugs/show_bug.cgi? There are two kinds of automatic software re-
id=334466 pair: state repair and behavioral repair [8]. State

2
repair consists in modifying a program’s state system’s error recovery capabilities; if the sys-
during its execution (the registers, the heap, the tem can handle those injected faults, it is likely
stack, etc.). Demsky and Rinard’s paper on data to handle real-world natural faults of the same
structure repair [4] is an example of such state re- nature. Third, monitoring the impact of each in-
pair. Behavioral repair consists in modifying the jection gives the opportunity to learn something
program behavior, with runtime patches. The on the system itself and the real environmental
patch, whether binary or source, is synthesized conditions.
and applied at runtime, with no human in the Because of these three effects, injecting faults
loop. For instance, the application communities in production makes the system better. This
of Locasto and colleagues [7] share behavioral corresponds to the main characteristic of an-
patches for repairing faults in C code. tifragility: “the antifragile loves error”. It is not
As said previously, a software system can be purely the injected faults that improve the sys-
considered as antifragile as long as it learns some- tem, it is the impact of injected faults on the
thing from bugs that occur. Automatic runtime engineering ecosystem (the design principles, the
bug repair at the behavioral level corresponds mindset of engineers, etc). I will come back on
to antifragility, since each fixed bug results in the profound relation between product and pro-
a change in the code, in a better system. This cess in Section 4. A software system using fault
means “loving errors”: a software system with self-injection in production is antifragile, it de-
runtime bug repair capabilities loves errors be- creases the risk of missing, or incorrect or rotting
cause those errors continuously trigger improve- of error-handling code by continuously exercising
ments of the system itself. it.
Injecting faults in production must come with
a careful analysis of the the dependability losses.
3.3 Failure Injection in Production
There must be a balance between the depend-
If you really “love errors”, you always want more ability losses (due to injected system failures)
of them. In software, one can create artificial and the dependability gains (due to software im-
errors using techniques called fault and failure provements) that result from using fault injec-
injection. So, literally, software that “loves er- tion in production. Measuring this tradeoff is
rors” would continuously self-injects faults and the key point of antifragile software engineering.
perturbations. Would it make sense? The idea of fault injection in production is un-
By self-injecting failures, a software system conventional but not new. In 1975, Yau and Che-
constantly exercises its error-recovery capabili- ung [18] proposed inserting fake “ghost planes”
ties. If the system resists those injected failures, in an air traffic control system. If all the ghost
it will likely resist similar real-world failures. For planes land safely while interacting with the sys-
instance, in a distributed system, servers may tem and human operators, one can really trust
crash or be disconnected from the rest of the net- the system. Recently, a company named Netflix
work. In a distributed system with fault injec- released a “simian army” [5, 2], whose different
tion, a fault injector may randomly crash some kinds of monkeys inject faults in their services
servers (an example of such an injector is the and datacenters. For instance, the “Chaos Mon-
Chaos Monkey [2]). key” randomly crashes some production servers,
Ensuring the occurrence of faults has three and the “Latency Monkey” arbitrarily increases
positive effects on the system. First, it forces en- and decreases the latency in the server network.
gineers to think of error-recovery as a first-class They call this practice “chaos engineering”.
engineering element: the system must at least From 1975 to today, the idea of fault injection
be able to resist the injected faults. Second, it in production has remained almost invisible. Au-
gives engineers and users confidence about the tomated fault injection in production has rather

3
been overlooked so far ( This concept is not men- ous deployment, errors have smaller impacts. No
tioned in the cornerstone paper by Avizienis, La- massive groups of interacting features and fixes
prie and Randell. [1].). However, the nascent arrive in production at the same time. When an
chaos engineering community may signal a real error is found in production, the new version can
shift. be released very quickly before an catastrophic
propagation.
Also, when an error is found in production,
4 Software Development it applies to a version that is close to the
Process Antifragility most recent version of the software product (the
“HEAD” version). Fixing an error in HEAD is
On the one hand, there is the software, the prod- usually much easier than fixing an error in a past
uct, and on the other hand there is the process version, because the patch can seamlessly be ap-
that builds the product. In Taleb’s view, an- plied to all close versions, and because the de-
tifragility is a concept that also applies to pro- velopers usually have the latest version in mind.
cesses. For instance, he says that the Silicon Both properties (ease of deployment, ease of fix-
Valley innovation process is quite antifragile, be- ing) contribute to minimize the effects of errors.
cause it deeply admits errors, and both inventors We recognize here a property of antifragility as
and investors both know that many startups will Taleb puts it: If you want to become antifragile,
eventually fail. I now discuss the antifragility as- put yourself in the situation “loves errors” [...]
pect of the software development process. by making these numerous and small in harm.
(Taleb [14]).

4.1 Test-driven Development


4.2 Bus Factor
In test-driven development, developers write au-
tomated tests for each feature they write. When In software development, the “bus factor” mea-
a bug is found, a test that reproduces the bug sures to what extent people are essential to a
is first written; then the bug is fixed. The re- project. If a key developer is hit by a bus (or any-
sulting strength of the test suite gives develop- thing similar in effect), could it bring the whole
ers much confidence in the ability of their code project down? In dependability terms, such a
to resist changes. Concretely, this confidence en- consequence means that there is a failure propa-
ables them to put “refactoring” as a key phase gation from a minor issue to a catastrophic effect.
of development. Since developers have an aid There are management practices to cope with
(the test suite) to assess the correctness of their this critical risk. For instance, one technique is to
software, they can continuously refine the design regularly move people from projects to project,
or the implementation. They refactor fearlessly, so that nobody concentrates essential knowledge.
having little doubts that they can break anything At one extreme is “If a programmer is indispens-
that will go unnoticed. Furthermore, test-driven able, get rid of him as quickly as possible” [17].
development allows continuous deployment, as In the short-term, moving people is sub-optimal.
opposed to long release cycles. Continuous de- From a people perspective, they temporarily lose
ployment means that features and bug fixes are some productivity when they join a new project,
released in production in a daily manner (and in order to learn a new set of techniques, con-
sometimes several times a day). It is the trust ventions, and communication patterns. They
given by automated tests that allows continuous will often feel frustrated and unhappy because
deployment. of this. From a project perspective, when a de-
What is interesting with test-driven develop- veloper leaves, the project experiences a small
ment is the second order effect. With continu- slow-down. The slow-down lasts until the rest of

4
the team grasps the knowledge and know-how of brittle programming languages and execution en-
the developer who has just left. However, from a vironments. That would be a 21th century echo
long-term perspective, it decreases the bus fac- to Van Neuman’s dream of building reliable sys-
tor. In other terms, moving people transforms tems from unreliable components [16].
rare a,d irreversible large errors (project failure)
into lots of small errors (productivity loss, slow
down). This is again antifragile. References
[1] A. Avizienis, J.-C. Laprie, B. Randell,
4.3 Conway’s Law et al. Fundamental concepts of dependabil-
In programming, Conway’s law states that the ity. Technical report, University of Newcas-
“organizations which design systems [...] are tle upon Tyne, 2001.
constrained to produce designs which are copies [2] A. Basiri, N. Behnam, R. de Rooij,
of the communication structures of these orga- L. Hochstein, L. Kosewski, J. Reynolds, and
nizations” [3]. Raymond famously put this as C. Rosenthal. Chaos engineering. IEEE
“If you have four groups working on a compiler, Computer, 33(3):35 – 41, 2016.
you’ll get a 4-pass compiler” [12]
More generally, the engineering process has an [3] M. E. Conway. How do committees invent?
impact on the product architecture and proper- Datamation, 14(4):28–31, 1968.
ties. In other terms, some properties of a sys- [4] B. Demsky and M. Rinard. Automatic de-
tem emerge from the process employed to build tection and repair of errors in data struc-
it. Since antifragility is a property, there may tures. ACM SIGPLAN Notices, 38(11):78–
be software development processes that hinder 95, 2003.
antifragility in the resulting software and others
that foster it. The latter would be “antifragile [5] Y. Izrailevsky and A. Tseitlin. The Net-
software engineering”. flix simian army. http://techblog.
I tend to think that the engineers that set up netflix.com/2011/07/netflix-simian-
antifragile processes better know the nature of army.html, 2011.
errors than others. I believe that developers en-
[6] Z. T. Kalbarczyk, R. K. Iyer, S. Bagchi,
rolled in an antifragile process become imbued
and K. Whisnant. Chameleon: A soft-
of some values of antifragility. Tseitlin’s concept
ware infrastructure for adaptive fault tol-
of “antifragile organizations” is along the same
erance. IEEE Transactions on Parallel and
line[15]. Because of this, I hypothesize that an-
Distributed Systems, 10(6):560–579, 1999.
tifragile software development processes are bet-
ter at producing antifragile software systems. [7] M. E. Locasto, S. Sidiroglou, and A. D.
Keromytis. Software self-healing using col-
laborative application communities. In Pro-
5 Conclusion ceedings of the Symposium on Network and
Distributed Systems Security, 2006.
This is only the beginning of antifragile soft-
ware engineering. Beyond the vision presented [8] M. Monperrus. A critical review of
here, research now has to devise sound engi- "automatic patch generation learned from
neering principles and techniques regarding self- human-written patches": Essay on the
checking, self-repair and fault injection in pro- problem statement and the evaluation of
duction. Because of the amount of legacy soft- automatic software repair. In Proceedings
ware, a major research avenue is to invent ways of the International Conference on Software
to develop antifragile software on top of existing Engineering, 2014.

5
[9] M. Monperrus. Principles of antifragile soft- [14] N. N. Taled. Antifragile. Random House,
ware. Technical Report 1404.3056, Arxiv, 2012.
2014.
[15] A. Tseitlin. The antifragile organization.
[10] H. Petroski. To Engineer is Human: The Commun. ACM, 56(8):40–44, Aug. 2013.
Role of Failure in Successful Design. Vin-
tage Books, 1992. [16] J. von Neumann. Probabilistic logics and
the synthesis of reliable organisms from un-
[11] B. Randell. System structure for software reliable components. Automata Studies,
fault tolerance. IEEE Transactions on Soft- 1956.
ware Engineering, SE-1(2):220 –232, june
1975. [17] G. M. Weinberg. The psychology of com-
puter programming. Van Nostrand Reinhold
[12] E. S. Raymond et al. The jargon file. New York, 1971.
http://catb.org/jargon/, last accessed
Jan. 2014, -. [18] S. Yau and R. Cheung. Design of self-
checking software. In ACM SIGPLAN No-
[13] M. Shaw. Self-healing: softening precision tices, volume 10, pages 450–455. ACM,
to avoid brittleness. In Proceedings of the 1975.
first workshop on self-healing systems, 2002.

Potrebbero piacerti anche