Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Marc Fisher, Mingming Cao, Gregg Rothermel, Curtis R. Cook, Margaret M. Burnett
Computer Science Department
Oregon State University
Corvallis, Oregon
grother@cs.orst.edu
ABSTRACT
Spreadsheet languages, which include commercial spreadsheets and various research systems, have had a substantial
impact on end-user computing. Research shows, however,
that spreadsheets often contain faults. Thus, in previous
work, we presented a methodology that assists spreadsheet
users in testing their spreadsheet formulas. Our empirical
studies have shown that this methodology can help endusers test spreadsheets more adequately and eciently; however, the process of generating test cases can still represent
a signicant impediment. To address this problem, we have
been investigating how to automate test case generation for
spreadsheets in ways that support incremental testing and
provide immediate visual feedback. We have utilized two
techniques for generating test cases, one involving random
selection and one involving a goal-oriented approach. We
describe these techniques, and report results of an experiment examining their relative costs and benets.
1.
INTRODUCTION
Perhaps the most widely used programming paradigm today is the spreadsheet paradigm. Little research, however,
has addressed the software engineering tasks that arise in
creating and maintaining spreadsheets. This inattention is
surprising given the role played by spreadsheets on signicant matters such as budgets, grades, and business decisions.
In fact, recent research reports that spreadsheets often
contain faults. A survey of the literature [20] provides details: in four eld audits of operational spreadsheets, faults
were found in an average of 20.6% of the spreadsheets audited; in eleven experiments in which participants created
spreadsheets, faults were found in an average of 60.8% of
those spreadsheets; in four experiments in which participants inspected spreadsheets for faults, an average of 55.8%
of those faults were missed. Compounding these problems
is the unwarranted condence spreadsheet users have in the
correctness of their spreadsheets [3, 29].
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 2001 ACM 1-58113-472-X ...$5.00.
To help address these problems, in previous work we presented a methodology for testing spreadsheet formulas [22,
24]. Our \What You See Is What You Test" (WYSIWYT)
methodology lets spreadsheet users incrementally apply test
inputs and validate outputs, and provides visual feedback
about the eectiveness of their testing. Empirical studies
have shown that this methodology can help users test their
spreadsheets more adequately and more eciently [25].
As presented to date, the WYSIWYT methodology has
relied solely on the intuitions of spreadsheet users to identify test cases for their spreadsheets. In general, the process
of manually identifying appropriate test cases is laborious,
and its success depends on the experience of the tester. This
problem is especially serious for users of spreadsheet languages, who typically are not experienced programmers and
lack background in testing. Existing research on automated
test case generation (e.g., [6, 8, 10, 13, 14, 15, 21]), however, has been directed at imperative languages, and we can
nd no research specically addressing automated test case
generation for spreadsheet languages.
To address this problem, we have been investigating how
to automate the generation of test cases for spreadsheets in
ways that support incremental testing and provide immediate visual feedback. We have utilized two techniques for test
case generation: one using random selection and one using
a goal-oriented approach [10]. We describe these techniques
and their integration into our WYSIWYT methodology, and
report results of experiments examining their eciency and
eectiveness.
2. BACKGROUND
Users of spreadsheet languages \program" by specifying
cell formulas. Each cell's value is dened by that cell's formula, and as soon as the user enters a formula, it is evaluated
and the result is displayed. The best-known examples of
spreadsheet languages are found in commercial spreadsheet
systems, but there are also many research systems (e.g. [4,
5, 16, 27]) based on this paradigm.
In this article, we present examples of spreadsheets in the
research language Forms/3 [4]. Figure 1 shows an example
of a Forms/3 spreadsheet, GrossPay, which calculates an
employee's weekly pay given a payrate and the hours they
worked during that week. As the gure shows, Forms/3
spreadsheets, like traditional spreadsheets, consist of cells;
however, these cells are not restricted to grids. Also, in the
gure cell formulas are displayed, but in general the user
can display or hide formulas.
Figure 2: Forms/3 spreadsheet GrossPay with testing information displayed after a user validation.
have been used in producing a validated value. In our example, all interactions ending at references in the formula for
TotalHours have been exercised; hence, that cell's border is
now fully blue (black in this paper).
If users choose, they can also view interactions caused by
cell references by displaying data
ow arrows between cells
or subexpressions in formulas; in the example, the user has
chosen to view interactions ending at cell GrossPay. These
arrows depict testedness information at a ner granularity,
following the same color scheme as for the cell borders.
If the user next modies a formula, interactions potentially aected by this modication are identied by the system, and information on those interactions is updated to
indicate that they require retesting. The updated information is immediately re
ected in changes in the various visual
indicators just discussed (e.g., replacement of blue border
colors by less blue colors).
Although a user of our methodology need not be aware of
it, the methodology is based on the use of a data
ow test
adequacy criterion adapted from the output-in
uencing-alldu-pairs data
ow adequacy criterion dened for imperative
programs [9]; for brevity we call our adaptation of this criterion the du-adequacy criterion. We precisely dene this
criterion in [22]; here, we summarize that presentation.
The du adequacy criterion is dened through an abstract
model of spreadsheets called a cell relation graph (CRG).
Figure 3 shows the CRG for spreadsheet GrossPay. A CRG
consists of a set of cell formula graphs (enclosed in rectangles
in the gure) that summarize the control
ow within formulas, connected by edges (dashed lines in the gure) summarizing data dependencies between cells. Each cell formula
graph is a directed graph, similar to a control
ow graph
for imperative languages, in which each node represents an
expression in a cell formula and each edge represents
ow of
control between expressions. There are three types of nodes:
entry and exit nodes, representing initiation and termination
of the evaluation of the formula; denition nodes, representing simple expressions that dene a cell's value; and predicate nodes, representing predicate expressions in formulas.
Two edges extend from each predicate node: these represent
the true and false branches of the predicate expression.
A denition of cell C is a node in C 's formula graph representing an expression that denes C , and a use of C is either
1:E
4:E
7:E
10:E
13:E
2:constant
5:constant
8:constant
11:constant
14:constant
3:X
6:X
9:X
12:X
15:X
Mon
Tues
Wed
Thurs
Fri
16:E
19:E
17:constant
18:X
21:X
PayRate
TotalHours
22:E
23: if TotalHours <= 40
F
T
24: Payrate * TotalHours
26:X
GrossPay
executed. Ferguson and Korel distinguish two types of pathoriented techniques: those based on symbolic execution, and
those that are execution-oriented. Techniques based on symbolic execution [6, 8, 13, 18] use symbolic execution to nd
the constraints, in terms of input variables, that must be satised in order to execute a target path, and attempt to solve
this system of constraints. The solution provides a test case
for that path. A disadvantage of these techniques is that
they can waste eort attempting to nd inputs for infeasible paths; they can also require large amounts of memory
to store the expressions encountered during symbolic execution, and powerful constraint solvers to solve complex equalities and inequalities. They may also have diculties handling complex expressions. Execution-oriented techniques
[15] alleviate some of these diculties by incorporating dynamic execution information into the search for inputs, using function minimization to solve subgoals that contribute
toward an intended coverage goal.
Goal-oriented test case generation techniques [10, 14], like
execution-oriented techniques, are also dynamic, and use
function minimization to solve subgoals leading toward an
intended coverage goal; however, goal-oriented techniques
focus on the nal goal rather than on a specic path, concentrating on executions that can be determined (e.g. through
the use of data dependence information) to possibly in
uence progress toward the goal. Like execution-oriented
methods, these techniques take advantage of the actual variable values obtained during execution to try to solve problems with complex expressions; however, by not focusing on
specic paths, the techniques gain eectiveness [10].
communication, Je Outt.
Due to space limitations the description is somewhat abbreviated; a more detailed version, with additional description
of Ferguson and Korel's approach, can be found in [11].
4
An additional requirement present for imperative programs
| that the denition \reach" the use | is achieved automatically in spreadsheets of the type we consider, provided
the denition and use are both executed, since these spreadsheets do not contain loops or \redenitions" of cells [22].
(or u). For example, the constraint path for nodes 20 and 25
in the CRG for GrossPay are (19,20) and (22,23,25), respectively. The constraint path for du-association (d,u) consists
of the concatenation of the constraint paths for d and u.
Thus, for example, the constraint path for du-association
(20,25) in GrossPay is (19,20,22,23,25).
When considering du-association (d; u), Chaining rst constructs the constraint path for (d; u). Given our methodology, it is necessarily the case that under current inputs,
(d; u) is not exercised { if (d; u) were exercised, it would not
be included in UsefulDUs. Thus, it must be the case that
under current inputs, one or more predicates in the constraint path are being evaluated in a manner that causes
nodes in the constraint path to not be reached. Chaining's
task is to alter this situation, by nding inputs that cause
all nodes on the constraint path to be reached.
To do this, Chaining compares the constraint path for
(d,u) to the path built by concatenating the execution traces
for the cells containing d and u. These execution traces
consist of the lists of CRG nodes executed in the cells during the cells' most recent evaluations, and they can be retrieved from the spreadsheet engine, which previously collected them for use by the WYSIWYT subsystem. In the example we have been considering, the relevant concatenated
execution trace, assuming the spreadsheet's input cells have
values as shown in Figure 2, is (19,20,22,23,24). Chaining
identies the break point in the constraint path: the pair of
nodes consisting of the rst node in the concatenated execution traces not present in the constraint path, together
with the predicate node immediately preceding it (or, less
formally, the earliest \incorrectly taken branch" in the execution traces). In our example, the break point is (23,24).
Given a break point (n1 ,n2 ), Chaining's next task is to
nd inputs that cause the predicate in n1 to take the opposite branch. To do this, the technique uses a constrained
linear search procedure over the input space; we describe
this procedure later in this section. Three outcomes of this
search procedure are possible. (1) The search does not succeed. In this case, n1 is designated a problem node and dealt
with by a procedure described momentarily. (2) The search
succeeds, and du-association (d,u) is now executed. In this
case, the technique has succeeded and terminates. (3) The
search succeeds, and inputs that cause the desired branch
from the predicate in n1 have been found, but a subsequent
predicate on the constraint path has not been satised (i.e.,
another break point exists), and (d,u) has not yet been executed. In this case, Chaining repeats the above process,
nding the next break point and initiating a new search, to
try to make further progress.5
In the example we have been considering, the only outcomes possible are outcomes 1 and 2: the search fails, or
5
A variation on this algorithm lets Chaining report success
on executing any du-association in UsefulDUs, an event
which can occur if UsefulDUs contains more than one duassociation and if, in attempting to execute one specic
du-association, Chaining happens on a set of inputs that
execute a dierent du-association in UsefulDUs. The results of this variation make sense from an end-user's point
of view, because the fact that Chaining iterates through
du-associations is incidental to the user's request: the user
requested only that some du-association in a set of such duassociations be executed. Our prototype in fact implements
this variation; however, to simplify the presentation we focus
here on the single du-association being iterated on.
Relational Operator
Branch Function
l < r
l > r
l
l
l
l
=
=6
r
r
r
r
if
if
r
l
0 then
+ 1 else
0 then
+ 1 else
if = then 1 else j j
j j
l
Logical
Operator
Branch Function
and
True branch: if (
) 0 and (
)0
then (
)+ (
)
else min (
) (
)
False branch: if (
) 0 and (
)0
then (
)+ (
)
else max (
) (
)
or
True branch: if (
) 0 and (
)0
then (
)+ (
)
else max (
) (
)
False branch: if (
) 0 and (
)0
then (
)+ (
)
else min (
) (
)
not
True branch: (
)
False branch: (
)
l
f l; true
f r; true
f l; true
f r; true
f l; true ; f r; true
f l; f alse
f l; f alse
f r; f alse
f r; f alse
f l; f alse ; f r; f alse
f l; true
f r; true
f l; true
f r; true
f l; true ; f r; true
f l; f alse
f l; f alse
f r; f alse
f r; f alse
f l; f alse ; f r; f alse
f e; f alse
f e; true
Comparison of Approaches
Space does not permit a detailed comparison of Ferguson
and Korel's [10] Chaining technique and our adaptation of
that technique. For readers familiar with that approach,
however, we summarize the dierences between approaches.
For brevity in the following, we refer to Ferguson and
Korel's technique as CF-Chaining, due to its use of control
ow information.
These dierences primarily involve simplications to Ferguson and Korel's approach, made possible by the spreadsheet
evaluation model. As such, this work illustrates the potential suitability of Ferguson and Korel's overall approach to
that evaluation model. These simplications improve the efciency of the approach, which is a critical matter in spreadsheets, which feature rapid response.
4. EMPIRICAL STUDIES
Our test case generation methodology is intended to help
users achieve du-adequate testing, which is communicated to
the user with devices such as cell border colors. Determining
whether this methodology achieves this goal requires user
studies; however, before undertaking such studies we must
rst address more fundamental questions: namely, whether
the methodology can in fact generate inputs that exercise
RQ2: How do
RQ3: Does the provision of range information alter the effectiveness and eciency of Random and Chaining?
4.1 Subjects
We used ten spreadsheets as subjects (see Table 3). These
spreadsheets were created by experienced Forms/3 users to
perform a wide variety of tasks: Digits is a number to digits
splitter, Grades translates quiz scores into letter grades, FitMachine and MicroGen are simulations, NetPay calculates
an employee's income after deductions, Budget determines
whether a proposed purchase is within a budget, Solution is
a quadratic equation solver, NewClock is a graphical desktop
clock, RandomJury determines statistically whether a panel
of jury members was selected randomly, and MBTI implements a version of the Myers-Briggs Type Indicator (a personality test). Table 3 provides data indicating the complexity of the spreadsheets considered, including the numbers of
cells, du-associations, expressions, and predicates contained
in each spreadsheet.
Our test case generation prototype handles only integer
type inputs; thus, all input cells in these subject spreadsheets are of integer type. Since commercial spreadsheets
contain infeasible du-associations, all subject spreadsheets
in our experiments also contain infeasible du-associations.
To measure the eectiveness of our techniques at exercising
feasible du-associations in this experiment, we determined
all the infeasible du-associations through inspection.
4.2 Measures
To investigate our research questions we use two measures:
eectiveness and eciency. Since our underlying testing
system uses du-adequacy as a testing criterion, we measured
a test case generation technique's eectiveness by the percentage of feasible du-associations exercised by the test cases
it generated. To measure a test case generation technique's
eciency, we measured the amount of (wall clock) time required to generate a test case that exercises one or more
du-associations.
spreadsheet level. The three independent variables manipulated in this experiment are:
eectiveness
eciency
4.5.1 Effectiveness
We consider two dierent views of eectiveness. First,
we consider the ability of our techniques to generate test
cases to cover all the feasible du-associations in the subject
spreadsheets { we refer to this as their ultimate eectiveness. Table 4 lists, for each of the subject spreadsheets,
the ultimate eectiveness of Random and Chaining with and
without range information, averaged across 35 runs.
Random
without
Range
Digits
59.44%
FitMachine 50.50%
Grades
67.10%
MBTI
25.64%
MicroGen
71.43%
NetPay
40.00%
NewClock
57.14%
Budget
96.57%
RandomJury 78.78%
Solution
57.69%
Random
with
Range
97.89%
50.50%
99.82%
100.00%
99.18%
100.00%
100.00%
100.00%
83.23%
78.79%
Chaining
without
Range
100.00%
97.93%
99.71%
99.87%
100.00%
100.00%
99.01%
100.00%
94.29%
100.00%
Chaining
with
Range
100.00%
97.90%
99.89%
99.64%
100.00%
100.00%
99.36%
100.00%
92.69%
100.00%
cases in which there was the greatest potential for improvement, addition of range information actually decreased (by
less than 2%) ultimate eectiveness. To determine whether
the dierences in ultimate eectiveness between Chaining
with and without range information were statistically signicant, we used unpaired t-tests on pairs of eectiveness
values per technique per spreadsheet. The dierences between the techniques were statistically signicant only for
MBTI ( < .05).8
Random without range information behaved much dierently than Chaining. In only one case did Random without
range information achieve an ultimate eectiveness greater
than 90% (Budget), and in six of ten cases it achieved an
ultimate eectiveness less than 60%. Ultimate eectiveness
also varied widely for this technique, ranging from 25.64% to
96.57%. On all ten spreadsheets, the ultimate eectiveness
of Random without ranges was less than that of Chaining
without ranges; dierences between the techniques ranged
from 3.4% to 74.2% across spreadsheets (average overall difference 38%.) Unpaired t-tests showed that the eectiveness
dierences between Random without ranges and Chaining
without ranges were all statistically signicant ( < .05).
In contrast to the results observed for Chaining, addition
of range information to Random did aect its performance,
in all but one case increasing ultimate eectiveness, and in
seven of ten cases increasing it by more than 20%. Unpaired
t-tests showed that all increases were statistically signicant;
eectiveness remained unchanged only on FitMachine.
Addition of range information to Random also helped its
performance in comparison to Chaining. On two spreadsheets, MBTI and NewClock, Random with range information
achieved greater ultimate eectiveness than Chaining with
range information; however, this dierence, though statistically signicant, was less than 1%. On ve spreadsheets
(Digits, FitMachine, MicroGen, RandomJury, and Solution)
on the other hand, Chaining with range information resulted
in statistically greater ultimate eectiveness than Random
with range information, and in two of these cases the difference exceeded 20%. (On Grades, NetPay, and Budget,
dierences were not statistically signicant.)
Our second view of eectiveness (Table 5) considers the
ability of the four test case generation techniques, across all
ten subject spreadsheets, to generate test cases reaching four
dierent levels of eectiveness. The table shows the number
of times, out of 350 runs, each technique achieved at least
50%, 75%, 90%, or 100% eectiveness. As the table shows,
Random without range information is the only technique to
sometimes fail to reach 50% eectiveness, and is far less
successful than the other techniques at reaching higher levels
of eectiveness. Also, Chaining techniques achieved 75%,
90%, and 100% eectiveness more often than Random with
range information, and Chaining without range information
achieved 90% and 100% eectiveness a few more times than
does Chaining with range information.
without Range
with Range
without Range
with Range
Random
Random
Chaining
Chaining
50%
274
350
350
350
75%
76
312
349
350
90% 100%
35
5
245 218
347 252
342 239
Table 5: Number of runs (out of 350) in which techniques exceeded specic levels of eectiveness.
Total
Test cases
Test cases
test cases generated
generated
generated in 4 secs
in 10 secs
Random w/o Range
1318
1026 (77.8%) 1114 (84.5%)
Random w Range
9731
6886 (70.8%) 8239 (84.7%)
Chaining w/o Range
12750 10354 (81.2%) 11262 (88.3%)
Chaining w Range
12115 11001 (90.8%) 11605 (95.8%)
<
<
Table 6: Number and percentage of test cases generated in less than 4 and less than 10 seconds, across
all 10 spreadsheets.
4.5.2 Efficiency
We also consider two views of eciency. First, Table 6
reports response time characteristics of the test case generation techniques; that is, the amount of time the user must
wait after pushing the \Help Me Test" button until a suit-
able test case is displayed. The table shows, for each technique, the number of total test cases successfully generated
by that technique, and the numbers and percentages of the
times in which test case generation succeeded in less than 4
seconds, and less than 10 seconds, respectively.9
As the table illustrates, with all techniques, a test case was
generated within 4 seconds over 70% of the time, and within
10 seconds over 84% of the time. With or without range
information, Chaining achieved faster response times more
often than Random. The table also shows that Random without ranges was more responsive than Random with ranges at
the 4 second mark; however, this should be qualied by the
fact that Random without ranges was able to generate only
about one-eighth as many test cases as Random with ranges.
Finally, the table shows that the use of range information
did improve Chaining's response time: in fact, Chaining
with range information achieved the greatest number of low
response times, succeeding within 4 seconds on over 90% of
the runs, and within 10 seconds on over 95% of the runs.
Table 7 provides a second view, showing the eciencies of
techniques relative to various levels of coverage across all ten
subject spreadsheets as the techniques are applied repeatedly; that is, how long test generation might take to attain
the desired level of coverage. For each of the four techniques, four levels of coverage (50%, 75%, 90%, and 100%)
are considered.
As the table shows, both Chaining techniques achieved
100% coverage much more quickly (over 400 seconds more
quickly) than their Random counterparts. Chaining with
range information was fastest, reaching each level of coverage more quickly than Chaining without range information;
however, the time dierences between the two Chaining
techniques never exceeded 94 seconds. Further, Chaining
without range information was somewhat slower than Random
with range information at reaching the 50%, 75%, and 90%
levels of coverage, although the dierence never exceeded 54
seconds, and decreased as coverage level increased.
Two entries in this table bear further scrutiny. Random
without range information appears surprisingly fast at the
90% coverage level; however, this should be interpreted in
light of the small number of cases in which the technique
In the literature on usability, 4 seconds is cited as an appropriate limit for common tasks [26], and 10 seconds as a
limit for keeping a user's attention focused on a task [17].
without Range
with Range
without Range
with Range
Random
Random
Chaining
Chaining
100%
622.4
521.1
119.4
25.9
Table 7: Average cumulative time (seconds) required to achieve given levels of coverage, considering only test cases that reached those levels (see
numbers reported in Table 5).
achieved that level of coverage (see Table 5). Specically,
the technique reached the 90% coverage level on only 35
test cases (all on one spreadsheet), so these results simply
show that on that one spreadsheet, 90% coverage could be
achieved quickly. Similarly, results for Random without range
information at the 100% coverage level are based on only the
ve test cases on which the technique reached that level.
4.7 Discussion
Keeping in mind the limitations imposed by the threats to
validity just described, our results have several implications.
First, our results suggest that, from the point of view
of eectiveness and eciency, automated test case generation for spreadsheets seems to be feasible. In the cases we
considered, Chaining was highly eective (both with and
without range information) at generating test cases, achieving 100% coverage of feasible du-associations on half of the
spreadsheets considered, greater than 97% coverage on all
but one spreadsheet, and greater than 92% coverage on that
one spreadsheet. In a high percentage of cases, Chaining
achieved this coverage within reasonable time limits. These
results thus motivate further work on test case generation
techniques for spreadsheets, and on the Chaining technique,
and suggest that studies of whether end users can prot from
the use of such techniques would be appropriate.
Our results also highlight several tradeos between techniques. First, we had initially conjectured that with spreadsheets, random test case generation might perform nearly
as well as a more complex heuristic, thus providing a more
easily implemented approach to test case generation. Our
experiments suggest that this conjecture is false. In the cases
we observed, Random techniques were much less eective
at covering du-associations in spreadsheets than Chaining
techniques, over half of the time achieving less than 80% coverage. Further, Random techniques were much less consistent
than Chaining techniques in terms of eectiveness: whereas
Chaining's eectiveness ranged only from 92% to 100% coverage, the eectiveness of Random techniques ranged from
25% to 100% coverage, a range nine times larger than that
of Chaining techniques. Random techniques also exhibited
fast response times less often than Chaining techniques,
and although a Random technique using range information
was capable of achieving low levels of coverage more quickly
than Chaining without range information, the latter outperformed the former at high levels of coverage.
At the outset of this work we also postulated that provision of range information would benet both test case generation techniques. Where Random was concerned, this proved
correct: Random with ranges often achieved far greater levels of coverage than Random without ranges. We were surprised, however, that Chaining did not benet, in terms of
eectiveness, from the provision of range information. In
fact, Chaining without range information was marginally
better than Chaining with range information at achieving
higher coverage levels. Also, while range information did
help Chaining achieve results more quickly than Chaining
without such information, the speedup was not large. On re
ection, we suspect that the Chaining algorithm, restricted
by ranges, is less able than its unrestricted counterpart to
jump beyond local minima/maxima and nd solutions, though
when it does nd solutions it can do so in fewer steps.
Overall, these results support some stronger suggestions
about automated test case generation for spreadsheets:
5.
Acknowledgments
We thank the Visual Programming Research Group for their
work on Forms/3 and their feedback on our methodologies.
This work has been supported by the National Science Foundation by ESS Award CCR-9806821 and ITR Award CCR0082265 to Oregon State University.
6.
REFERENCES