Sei sulla pagina 1di 4

Reasons to Repeat Tests

by James Bach
(with help from colleagues Doug Hoffman, Michael Bolton, Ken Pugh, Cem Kaner, Bret
Pettichord, Jim Batterson, Geoff Sutton, plus numerous students who have participated in
the "Minefield Debate" as part of my testing class. The minefield analogy as I talk about
it was inspired by Brian Marick's talk Classic Testing Mistakes.)
Testing to find bugs is like searching a minefield for mines. If you just travel the same
path through the field again and again, you won't find a lot of mines. Actually, that's a
great way to avoid mines. The space represented by a modern software product is hugely
more complex than a minefield, so it's even more of a problem to assume that some
small number of "paths", say, a hundred, thousand, or million, when endlessly repeated,
will find every important bug. As many tests as a team of testers can physically perform
in a few weeks or months is still not that many tests compared to all the things that can
happen to a product in the field.
The minefield analogy is really just another way of saying that testing is a sampling
process, and we probably want a larger sample, rather than a tiny sample repeated over
and over again. Hence the minefield heuristic is do different tests instead of
repeating the same tests.
But what do I mean by repeat the same test? It's easy to see that no test can be
repeated exactly, any more than you can exactly retrace your footsteps. You can get
close, but you will always be a tiny bit off. Does repeating a test mean that the second
time you run the test you have to make sure that sunlight is shining at the same angle
onto your mousepad? Maybe. Don't laugh. I did experience a bug, once, that was
triggered by sunlight hitting an optical sensor inside a mouse. You just can't say for sure
what factors are going to affect a test. However, when you test you have a certain goal
and a certain theory of the system. You may very well be able to repeat a test with
respect to that goal and theory in every respect that A) you know about and B) you care
about and C) isn't too expensive to repeat. Nothing is necessarily intractable about that.
Therefore, by a repeated test, I mean a test that includes elements already known to be
covered in other tests. To repeat a test is to repeat some aspect of a previous test. The
minefield heuristic is saying that it's better to try to do something you haven't yet done,
then to do something you already have done.
If you disagree with this idea, or if you agree with it, please read further. Because...
...this analysis is too simplistic! In fact, even though diversity in testing is important and
powerful, and even though the argument against repetition is generally valid, I do know
of ten exceptions. There are ten specific reasons why, in some particular situation, it is
not unreasonable to repeat tests. It may even be important to repeat some tests.
For technical reasons you might rationally repeat tests...
1. Recharge: if there is a substantial probability of a new problem or a recurring old
problem that would be caught by a particular existing test, or if an old test is
applied to a new code base. This includes re-running a test to verify a fix, or
repeating a test on successively earlier builds as you try to discover when a
particular problem or behavior was introduced. This also includes running an old
test on the same software that is running on a new O/S. In other words, a tired old
test can be "recharged" by changes to the technology under test. Note that the

recharge effect doesn't necessarily mean you should run the same old tests, only
that it isn't necessarily irrational to do so.
2. Intermittence: if you suspect that the discovery of a bug is not guaranteed by
one correct run of a test, perhaps due to important variables involved that you
can't control in your tests. Performing a test that is, to you, exactly the same as a
test you've performed before, may result in discovery of a bug that was always
there but not revealed until the uncontrolled variables line up in a certain way.
This is the same reason that a gambler at a slot machine plays again after losing
the first time.
3. Retry: if you aren't sure that the test was run correctly the other time(s) it was
performed. A variant of this is having several testers follow the same instructions
and check to see that they all get the same result.
4. Mutation: if you are changing an important part of the test while keeping
another part constant. Even though you are repeating some elements of the test,
the test as a whole is new, and may reveal new behavior. I mutate a test because
although I have covered something before, I haven't yet covered it well enough. A
common form of mutation is to operate the product the same way while using
different data. The key difference between mutating a test and intermittence or
retry is that with mutation the change is directly under your control. Mutation is
intentional, intermittence results from incidental factors, and you retry a test
because of accidental factors.
5. Benchmark: if the repeated tests comprise a performance standard that gets its
value by comparison with previous executions of the same exact tests. When
historical test data is used as an oracle, then you must take care that the tests
you perform are comparable to the historical data. Holding tests constant may not
be the only way to make results comparable, but it might be the best choice
available.
For business reasons you might rationally repeat tests...
6. Inexpensive: if they have some value and are sufficiently inexpensive compared
to the cost of new and different tests. These tests may not be enough to justify
confidence in the product, however.
7. Importance: if a problem that could be discovered by those tests is likely to have
substantially more importance than problems detectable by other tests. The
distribution of the importance of product behavior is not necessarily uniform.
Sometimes a particular problem may be considered intolerable just because it's
already impacted an important user once (a "never let it happen again" situation).
This doesn't necessarily mean that you must run the same exact test, just
something that is sufficiently similar to catch the problem (see Mutation). Be
careful not to confuse the importance of a problem with the importance of a test.
A test might be important for many reasons, even if the problems it detects are
not critical ones. Also, don't make the mistake of spending so much effort on one
test that looks for an important bug that you neglect other tests that might be just
as good or better at finding that kind of problem.
8. Enough: if the tests you repeat represent the only tests that seem worth doing.
This is the virus scanner argument: maybe a repeated virus scan is okay for an
ordinary user, instead of constantly changing virus tests. However, we may
introduce variation because we don't know which tests truly are worth doing, or
we are unable to achieve enoughness via repeated tests.

9. Mandated: if, due to contract, management edict, or regulation, you are forced
to run the same exact tests. However, even in these situations, it is often not
necessary that the mandated tests be the only tests you perform. You may be
able to run new tests without violating the mandate.
10. Indifference/Avoidance: if the "tests" are being run for some reason other than
finding bugs, such as for training purposes, demo purposes (such as an
acceptance test that you desperately hope will pass when the customer is
watching), or to put the system into a certain state. If one of your goals in running
a test is to avoid bugs, then the principal argument for variation disappears.
I have collected these reasons in the course of probably a hundred hours of debate with
testing students and colleagues. Many of my colleagues prefer different words or a
different breakdown of reasons. There's nothing particularly sacred about my way of
doing it (except that some breakdowns would lead to long lists of very similar items). The
important thing is that when I hear a reason that seems not to fit within the ones I
already have, I add that reason to this list. I started with two reasons, in 1997. I added
the tenth one in late 2004.

Applying the Minefield: An Example


Ward Cunningham wrote "I believe the automation required of TDD [Test Driven Design]
(and Fit) is exempt from the analogy because the searching we are doing is for the best
expression of a program in the presence of tests, not the best tests."
Here's how I think it applies:
Your units tests might pass or they might fail. You write them so that they will fail in the
event that some interesting expectation is violated. So, you call them tests and they
seem to be tests.
We introduce the minefield criticism the first time you run any given test in your unit test
suite. The first time you run it, it fails, right? Of course, since it wouldn't be TDD,
otherwise. The questions below are inspired by the Minefield heuristic "vary your tests
instead of repeating them."
Question: Why run it again?
Answer: Exception #1, "recharge." You run it again because you have added code to
make the test pass, therefore running the test again is not merely redundant, the value
of the test has been recharged by the code changing around it.
Question: During the course of development, but after the first time the test passes,
why not delete it? Why bother to run it again?
Answer: Several reasons. Recharge still applies a little bit, since you may accidentally
break the product during development, but it could be argued that most of those unit
tests most of the time don't fail, and some of them are extremely unlikely to fail even if
you change the code quite a bit. But here you have the second reason: exception #6,
"inexpensive." It's so cheap to create these tests and to run them and to keep them
running, while at the same time they do have some value, even if not a lot. And you have
a third reason for some of the tests: exception #7, "importance." For a good many of the
unit tests, failure would indicate a very serious problem, were it to occur. If you are
testing something that is particularly complex, or involves many interacting sub-systems,
you may also want to epeat because of exception #2, "intermittence". Perhaps
something will fail after the forty-third run because of probabilistic factors in he test.

Finally, there's #3, the "retry" exception, which reminds us that we might not have run
the test correctly, before. As you once said, Ward, something might give off a bad smell
only after you've seen the test run a hundred times or so. In other words, as a result of
running a test many times, you might come to an insight about the product that reveals
a failure that was there all along, but never noticed.
Question: Let's say that I'm a really good developer and though I write good tests, they
just don't fail because I just don't put bugs into my code. I have a whole lot of tests and
they don't fail. What was the sense in investing in such tests?
Answer: Two potential reasons. Exception #10, "avoidance/indifference." You may create
the tests as a form of documentation for future developers and you like them to be
exactly the same in order to minimize the chance that they will fail (and thus be less
useful as documentation). Or maybe you want to impress a customer with your great
software, and they won't be as impressed if the tests don't pass. A second reason is
exception #9, "mandated." you may work this way because your peer group or your
manager requires you to. This is a little like avoidance except that with a mandate you
do, in fact, want to find bugs. You are searching for them, you just are required to use a
certain technique to do so.
We therefore see that the fairly simple, often repeated unit tests of TDD may indeed be
exempt from the minefield-based argument in favor of varying tests, inasmuch as the
reasons I cited apply. But TDD is not exempt from this kind of heuristic analysis. It is
always reasonable to question the value of repeated tests, and that's what the minefield
invites us to do.

Potrebbero piacerti anche