Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Smart Experimental
Design
Course Notes
V LAAMS I NSTITUUT
VOOR B IOTECHNOLOGIE
Luc Wouters
April 2016
Contents
1
Introduction
4.8.2
20
4.8.1.6
Random sampling .
22
4.8.1.7
Standardization . .
22
Strategies
for
controlling
mental design . . . . . . . . .
22
2.1.1
4.8.2.1
Replication . . . . .
22
2.1.2
Scientific
4.8.2.2
Subsampling . . . .
23
4.8.2.3
Blocking . . . . . . .
24
4.8.2.4
Covariates . . . . .
24
2.1.3
research
as
a
5
4.9
Simplicity of design . . . . . . . . . .
25
25
2.2
2.3
7
5
Randomization . . .
search . . . . . . . . . . . . . . . . . .
phased process . . . . . . . .
4.8.1.5
mentation
27
3.1
5.1
28
3.2
Types of experiments . . . . . . . . .
10
3.3
11
13
4.1
Some terminology . . . . . . . . . . .
13
4.2
4.3
13
4.4
Variation is omnipresent . . . . . . .
16
4.5
16
4.6
17
4.7
18
4.8
5.1.1
5.1.2
5.2
18
5.3
18
4.8.1.1
18
4.8.1.2
Blinding . . . . . . .
19 6
4.8.1.3
The presence of a
block design . . . . . . . . . .
29
5.1.2.1
30
5.1.2.2
Efficiency
of
the
com-
31
5.1.3
32
5.1.4
33
Treatment designs . . . . . . . . . . .
34
5.2.1
One-way layout . . . . . . . .
34
5.2.2
Factorial designs . . . . . . .
34
37
5.3.1
5.3.2
designs . . . . . . . . . . . . .
20
Calibration . . . . .
20
6.1
41
37
technical protocol .
28
randomized
4.8.1.4
Error-control designs . . . . . . . . .
41
ii
CONTENTS
6.2
9.1.1
Experimental design . . . . .
59
6.3
9.1.2
Statistical methods . . . . . .
59
60
6.4
42
9.2.1
60
42
9.2.2
Graphical displays . . . . . .
60
9.2.3
6.4.1
6.4.2
Meads
resource
require43
6.5
44
6.6
46
6.7
47
Sequential plans . . . . . . . . . . . .
48
significance tests . . . . . . .
10 Concluding Remarks and Summary
10.2 Recommended reading . . . . . . . .
63
10.3 Summary . . . . . . . . . . . . . . . .
64
References
65
7.1
7.2
7.3
Significance tests . . . . . . . . . . .
51 Appendices
51
52 Appendix A Glossary of Statistical Terms
7.4
53
7.5
Multiplicity . . . . . . . . . . . . . .
55
57
59
9.1
59
63
63
51
tistical significance . . . . . . . . . .
61
7.6
8
41
ment equation . . . . . . . . .
6.8
9.2
population model . . . . . . . . . . .
73
77
77
B.1.1
MS Excel . . . . . . . . . . . .
77
B.1.2
R-Language . . . . . . . . . .
77
78
78
B.2.2
R-Language . . . . . . . . . .
78
1. Introduction
More often than not, we are unable to reproduce findings published by researchers in journals.
Glenn Begley, Vice President Research Amgen (2015)
The way we do our research [with our animals] is stone-age.
Ulrich Dirnagl, Charit University Medicine Berlin (2013)
Over the past decade, the biosciences have been
producibility of research findings. This lack of re- analysis carried out by Potti et al. (2006), but the
liability can be attributed in large part to statistical
fallacies, misconceptions, and other methodolog-
based on the erroneous results. In 2011, after several corrections, the original study by Potti et al.
that there is a definite need to transform and imExample 1.2. In 2009, a group of researchers from
asked to have a look at the the data. What they countered. During a decade Begley and Ellis (2012)
found was a mess of poorly conducted data analysis identified a set of 53 landmark publications in pre(Baggerly and Coombes, 2009). Some of the data
was mislabeled, some samples were duplicated in from reputable labs. A team of 100 scientists tried
the data, sometimes samples were marked as both to replicate the results. To their surprise, in 47 of
2 Formally, we consider replicability as the replication of scientific findings using independent investigators, methods,
data, equipment, and protocols. Replicability has long been, and will continue to be, the standard by which scientific claims
are evaluated. On the other hand, reproducibility means that starting from the data gathered by the scientist, we can
reproduce the same results, p-values, confidence intervals, tables and figures as those reported by the scientist (Peng, 2009).
CHAPTER 1. INTRODUCTION
replicated. This outcome was particularly disturb- journals threshold of publication (Hayes, 2014) 1 .
ing since Begley and Ellis did every effort to work in
close collaboration with the authors of the original
papers and even tried to replicate the experiments
in the laboratory of the original investigator. In
some cases, 50 attempts were made to reproduce
the original data, without obtaining the claimed result (Begley, 2012). What is even more troubling is
that Amgens findings were consistent with those of
others. In a similar setting, Bayer researchers found
that only 25% of the original findings in target discovery could be validated (Prinz et al., 2011).
Example 1.3. Seralini et al. (2012) published a 2- amount of work in this study and the tight deadyear feeding study in rats investigating the health lines that were set up. A sample size evaluation
effects of genetically modified (GM) maize NK603 conducted after the study was completed indicated
with and without glyphosate-containing herbicides. that sampling as few as 100 cells per lobe would
The authors of the study concluded that GM maize have been without appreciable loss of information.
NK603 and low levels of glyphosate herbicide formulations, at concentrations well below officiallyset safe limits, induce severe adverse health effects,
Unreliable biological
reagents and
reference materials 36.1%
such as tumors, in rats. Apart from the publication, Seralini also presented his findings in a press
conference, which was widely covered in the media
showing shocking photos of rats with enormous tumors. Consequently, this study had severe impact
on the general public and also on the interest of
Improper study
design 27.6%
Laboratory protocol
errors 10.8%
Inadequate
data analysis
and reporting 25.5%
Doing good science and producing highGMOs in Russia and Kenya. However, short after
its publication many scientists, among them also quality data should be the concern of every serious
researchers from the VIB (Vlaams Instituut voor research scientist. Unfortunately, as shown by the
Biotechnologie, 2012), heavily criticized the study first three examples, this is not always the case. As
and expressed their concerns about the validity of mentioned above, there is a genuine concern about
the findings. A polemic debate started with oppo- the reproducibility of research findings and it has
nents of GMOs and also within the scientific com- been argued that most research findings are false
munity, which inspired media to refer to the contro- (Ioannidis, 2005). In a recent paper Begley and
versy as The Seralini affair or Seralini tumor-gate. Ioannidis (2015) estimated that 85% of biomedical
Subsequently, the European Food Safety Authority research is wasted at large. Freedman et al. (2015)
(2012) thoroughly scrutinized the study and found tried to identify the root causes of the replicability
that it was of inadequate design, analysis and re- problem and to estimate its economic impact. They
porting. Specifically, the number of animals was estimated that in the United States alone approxiconsidered too small and not sufficient for reaching
a solid conclusion. Eventually, the journal retracted not be replicated. The main problems causing this
1 S
eralini managed to republish the study in Environmental Sciences Europe (S
eralini et al., 2014), a journal with a
considerably lower impact factor.
publications.
techniques of statistical analysis, statistical thinking and reasoning are generalist skills that focus on
Fang et al. (2012) found that 21.3% of the retractions were due to error, while 67.4% of the retrac-
to specific research problems and what the implications are in terms of data collection, experimen-
hand, we will consider statistical reasoning as beStudies such as those by Potti et al. (2006), Scholl
reasoning.
CHAPTER 1. INTRODUCTION
2.1
2.1.1
mental research
2.1.2
There are two basic approaches to implement a scientific research project. One approach is to conduct an observational study2 in which we investigate the effect of naturally occurring variation and
the assignment of treatments is outside the control
of the investigator. Although there are often good
and valid reasons for conducting an observational
study, their main drawback is that the presence of
Phase
Deliverable
Definition
Research Proposal
Design
Protocol
Data Collection
Data set
Analysis
Conclusions
Reporting
Report
tion
ments. A well-designed experimental study eliminates the bias caused by confounding variables.
5. reporting
The great power of a controlled experiment, provided it is well conceived, lies in the fact that it al-
2 also
proposal, stating the hypothesis related to the research (research hypothesis) and the implications
or predictions that follow from it. The design of
2.2
the experiment needed for testing the research hypothesis is formalized in a written protocol. After
the experiment has been carried out, the data will
be collected providing the experimental data set.
Statistical analysis of this data set will yield conclusions that answer the research question by accept-
stract world (Figure 2.3). Definition and reporting are conceptual and complex tasks requiring a
great deal of abstract reasoning. Conversely, experimental work and data collection are very concrete, measurable tasks handling with the practical
2.1.3
Scientific research is not a simple static activity, but as depicted in Figure 2.2, an iterative and
highly dynamic process. A research project is carried out within some organizational or manage-
Figure 2.4. Archetypes of researchers based on the relative fraction of the available resources that they are willing to spend at
each phase of the research process. D(1): definition phase, D(2):
design phase, C: data collection, A: analysis, R: reporting
ure 2.4):
analysis time;
Statistics
Statistical Thinking
Specialist skill
Science
Technology
Closure, seclusion
Introvert
Discrete interventions
Builds on good thinking
Generalist skill
Informed practice
Principles, patterns
Ambiguous, dialogue
Extravert
Permeates the research process
Valued skill itself
the lab freak who strongly believes that if formed practice and focused on the applications of
enough data are collected something inter-
design.
search projects success. In this capacity the statistical thinker acts as a diagnoser.
In contrast to statistics, which operates in a
closed and secluded mathematical context, statis-
2.3
tical thinking for his application area and he him- search process is limited to discrete interventions,
self is skilled in statistical thinking, or he collabo-
tistical thinking is a generalist skill based on in- ciated cost. Statistical thinking then helps the sci-
ity.
3. The design of an experiment balances between its internal validity (proper control of
noise) and external validity (the experiments
generalizability).
4. Good experimental practice provides the
clue to bias minimization.
5. Good experimental design is the clue to the
control of variability.
6. Experimental design integrates various disciplines.
7. A priori consideration of statistical power is
an indispensable pillar of an effective experiment.
3.1
1984).
contribute to knowledge about the subject? Sometimes a preliminary exploratory experiment is use-
ful to generate clear questions that will be answered in the actual experiment. The study ob-
explicitly as possible. It is wise to limit the objec- imental model. By doing so, an auxiliary hypothesis
tives of a study to a maximum of, say three (Sel-
10
able.
also for its confirmation, involves circular reasoning. Exploratory studies tend to consist of a pack-
predictions will follow logically from the hypotheses that we wish to test, and not from other rival
hypotheses. Good predictions will also lead to in-
be tested.
3.2
Types of experiments
In
11
3.3
of research hypotheses.
A second type of experiment is the optimiza- data. Preliminary experiments on a limited scale,
tion experiment which has the objective of finding or pilot experiments, are especially useful when
conditions that give rise to a maximum or mini-
perature, and pressure that gives rise to the maximum yield in a chemical production plant. Dose
study can be of help to make sure that a sensible research question was asked. For instance, if
plate experiments. These sources of variation can be the researcher to practice, validate and standardplate effects, row effects, column effects, and the
12
dataset.
4.1
Some terminology
the specific treatment, the effect of the experimental design, and an error component that describes the
treatment is independently applied. The character the response in one unit is unaffected by the
4.3
4.2
Defining
the
experimental
unit
sponse obtained in another unit and that the occurrence of a high or low result in one unit has no
effect on the result of another unit. Correct iden-
ticular
14
response variable.
ing the proper basic unit in their experimental material. In these cases, the term pseudoreplication is
often used (Fry, 2014). Pseudo-replication can result in a false estimate of the precision of the experimental results leading to invalid conclusions
(Lazic, 2010).
Figure 4.2. Morphometric analysis of the diameter of bile canaliculi in wild-type and Cx32-deficient liver. MeansSEM from
three livers. *: P<0.005 (after Temme et al. (2001))
The choice of plants as experimental units is incorrect, because the observations of growth made
on adjacent plants within a plot are not indepen-
dent of one another. If one plant is genetically sugenetic strains of mice, wild-type and connexin 32 perior and grows particularly large, it will tend to
(Cx32)-deficient. They measured the diameters of shade its inferior neighbors and cause them to grow
bile canaliculi in the livers of three wild-type and more slowly. An appropriate design would be to put
of three Cx32-deficient animals, making several ob- the plants into separate pots and randomly treatservations on each liver. Their results are shown in ing each pot with fertilizer or not. In this case there
Figure 4.2. It should be clear that Temme et al. is no competition of resources between plants. An(2001) mistakenly took cells, which were the ob- other alternative would make use of several fertilservational units, for experimental units and used ized and unfertilized plots of ground. The outcome
them also as units of analysis. If we consider the would then be the crop yield in each plot.
genotype as the treatment, then it is clear that not
the cell but the animal is the experimental unit.
Moreover, cells from the same animal will be more
alike than cells from different animals. This interdependency of the cells invalidates the statisti-
cal analysis, as it was carried out by the investiga- Example 4.3. Rivenson et al. (1988) studied the toxtors. Therefore, the correct experimental unit and icity of N-nitrosamines in rats and described their
unit of analysis is the animal, not the cell. Hence, experimental set-up as:
1 If we recalculate the standard errors of the mean (SEM) using the appropriate number of experimental units, then they
are a factor 7-10 larger than the reported ones.
15
with 12 animals distributed over 4 cages per treatSince the treatment was supplied in the drinking ment. For some outcomes variability is expected to
water, it is impossible to provide different treat- be reduced when animals are more content when
ments to any two individual rats.
The choice of the experimental unit is also of particular concern in plant research when the treatment condition has been applied to whole pots,
trays or growth chambers rather than to individual
16
and pups grouped five to a cage and the effects on essary information. This is in contrast to other scithe offspring were observed. Here, although obser- entific areas such as physics, chemistry and engivations on the individual offspring were made, the neering where the studied effects are much larger
experimental units are the mutant dams that were than the natural variation.
randomly assigned to treatment. Therefore, the observations on the offspring should be averaged to
give a single figure for each dam and these data,
one for each dam, are to be used for comparing the
4.5
treatments.
A single individual can also relate to several experimental units. This is illustrated by the following example.
Example 4.5. (Fry, 2014) The efficacy of two agents
at promoting regrowth of epithelium across a wound
was evaluated by making 12 small wounds in a standardized way in a grid pattern on the back of a pig.
The wounds were far enough apart for effects on
each to be independent. One of four treatments
would then be applied at random to the wound in
each square of the grid. In this case the experimental unit would be the wound and, as there are 12
wounds, for each treatment there would be three
replicates.
Internal validity refers to the fact that in a wellconceived experiment the effect of a given treatment is unequivocally attributed to that treatment.
However, the effect of the treatment is masked by
the presence of the uncontrolled variation of the
experimental material.
An experiment with a high level of internal validity should have a great chance to detect the ef-
4.4
Variation is omnipresent
experimental material.
17
and weight range and other characteristics determine the target population and make the study as
thousands of genes or proteins simultaneously. Minor differences in a number of non-biological variables, such as reagents from different lots, differ-
maintained.
Tuesday all diseased samples, thus masking the effect of disease with that of the two batches. Worse
is that these batch effects do not affect the entire
4.6
microarray in the same manner. Correlation patterns differ by batch, and even reverse sign across
bathes (Leek et al., 2010).
ment groups.
between the two treatment groups. It is clear that ability are also related to the concepts of accuracy
this treatment effect is a biased result since the dif- and precision of a measurement process. Absence
ference between the two groups is completely con- of bias means that our measurement is accurate,
founded with the difference between both sexes.
18
4.8
maximizing
signal-to-
noise ratio
this case the study may still reach the correct con-
clusions.
4.7
gies for minimizing the bias are based on good experimental practice, such as: the use of controls,
blinding, the presence of a protocol, calibration,
randomization, random sampling, and standard-
variate measurement, and sub-sampling. In addition random sampling can be added to enhance
the external validity. We will now consider each of
these strategies in more detail.
4.8.1
4.8.1.1
the first three requirements in the preceding sections. The following section provides some basic
of the experiment1 .
1 Active
controls play a special role in so-called equivalence or non-inferiority studies, where the purpose is to show that
a given therapy is equivalent or non-inferior to an existing standard.
19
the shortcoming that the effect of treatment is com- the number of statistically significant findings was
pletely confounded with the effect of time, thus intro- substantially larger as compared to blinded studducing a potential source of bias. Furthermore,
highlighted by Hirst et al. (2014). Despite its importance, blinding of experimenters is often ne-
Vehicle control (laboratory experiments) or placebo iments means that both the experimenter and the
control (clinical trials) are terms that refer to a con- observer are uninformed about the treatment control group that receives a matching treatment con-
4.8.1.2
Blinding
corresponds to which particular treatment. The sequence of the treatments in the list should be ran-
20
et al., 1995).
in a reduction in the sensitivity to detect anomalies. In this context, Holland and Holland (2011)
stochastic law1 . Randomization is a critical element in proper study design. It is an objective and
4.8.1.3
conditions is also a necessary condition for a rigorous statistical analysis, in the sense that it provides
ther in Chapter 8.
Calibration
treatment effects, makes experimental units independent of one another and justifies the use of sig-
4.8.1.5
Randomization
1 By
the term stochastic is meant that it involves some elements of chance, such as picking numbers out of a hat, or
preferably, using a computer program to assign experimental units to treatment groups.
21
the laboratory staff. Of course, this can be accomplished by maintaining the original randomization
sequence throughout the experiment.
Formal randomization requires the use of a randomization device. This can be the tossing of a coin,
tem (R Core Team, 2013) are contained in Appendix et al., 2009). For example, an investigator wishes to
compare the effect of two treatments (A, B) on the
B.
body weight of rats. All twelve animals are delivSome investigators are convinced that not ran-
ways assign treatment A to the first member of the the following scenario: heavy animals react slower
pair and B to the remaining unit. However, if there and are easier to catch than the smaller animals.
each pair consistently yields a higher or lower re-
22
an experiment brain cells were taken from animals pecially the case in studies that attempt to make
and placed in Petri dishes, such that one Petri dish a broad inference towards the target population
corresponded to one particular animal. The Petri (population model), like gene expression experdishes were then randomly divided into two groups iments that try to relate a specific pathology to
and placed in an incubator. After 72 hrs incuba- the differential expression of certain genes probes
tion one group of Petri dishes was treated with the
experimental drug, while the other group received iment, the bias in the results is minimized only if it
solvent.
ulation.
Standardization
4.8.2
4.8.2.1
Replication
Random sampling
Ronald Fisher1 noted in his pioneering book The
judged.
cable.
treatment groups (X1 ) and (X2 ) and equal number of experimental units per treatment group, this
1 Sir Ronald Aylmer Fisher (Londen,1890 - Adelaide 1962) is considered a genius who almost single-handedly created the
foundations of modern statistical science and experimental design.
2 The standard deviation refers to the variation of the individual experimental units, whereas the standard error refers
to the random variation of an estimate (mostly the mean) from a whole experiment. The standard deviation is a basic
property of the underlying distribution and, unlike the standard error, is not altered by replication.
23
Figure 4.7. The effect of blocking illustrated by a study of the effect of diet on running speed of dogs. Not taking age of the dog
into account (left panel) masks most of the effect of the diet. In the right panel dogs are grouped (blocked) according to age and
comparisons are made within each age group. The latter design is much more efficient.
X1 X2 =
2/n, where is the common stan- one could choose units of a larger size, such that
the experimental unit is an effective but expensive strategy to control variability. As we will see later, choosing an appropriate experimental design that takes
into account the different sources of variability that
can be identified, is a more efficient way to increase
the precision.
4.8.2.2
Subsampling
When subsampling is present, the standard deviation used in the comparison of the treatment means is composed of the variability between the experimental units (between-unit variability) and the variability within the experimental
units(within-unit variability). It can be shown that
in the presence of subsampling the overall standard deviation of the experiment is equal to:
deviation is only possible to a very limited extend. This can be accomplished by standardization of the experimental conditions, but also this
r
=
n2 +
2
m
m
24
This
r
X1 X2 =
now becomes:
2
2 2 m
(n +
)
n
m
mental unit.
block treatments are randomly assigned to the experimental units, thus removing the effect of the
4.8.2.3
Blocking
Example 4.10. Consider a (hypothetical) experiment in which the effect of two diets on running
4.8.2.4
Covariates
speed of dogs is studied. We can carry out the experiment by taking 6 dogs with varying age and
randomly allocating 3 dogs to diet A and the 3 remaining to diet B. However, as shown in the left
a covariate. It is an uncontrollable but measurable attribute of the experimental units (or their
factors. Such groupings are referred to as blocks variable before treatment, etc. The covariate filor strata. Units within a block are then randomly
25
lamented that:
one another.
Figure 4.9. Results of an experiment with baseline as covariate. There is a linear relationship between the covariate and
the response and this relationship is the same in both treatment
groups.
4.9
Table 4.1. Multiplication factor to correct for the bias in estimates of the standard deviation based on small samples, after
Bolch (1968).
Simplicity of design
Factor
2
3
4
5
6
1.253
1.128
1.085
1.064
1.051
sumptions.
dard deviation.
4.10
Alternatively, one can also make use of the results of previous experiments to guesstimate the
new experiments standard deviation. However,
This is the last of Coxs precepts for a good experi- we then make the strong assumption that random
ment (see Section 4.7, page 18). It is the only sta- variation is the same in the new experiment.
26
Latin squares, Youden square designs, lattice designs, Placket-Burman designs, simplex designs,
Figure 5.1. The three aspects of the design determine its complexity and the required resources
27
28
5.1
Error-control designs
should be tested and at which level? Is the interaction of two treatment factors of interest or not?
The error-control design implements the strategies
that we learned in Section 4.8.2 to filter out different sources of variability. The sampling & observation aspect of of our experiment is about how experimental units are sampled from the population,
how and how many subsamples should be drawn,
etc.
5.1.1
study are determined by these three aspects of experimental design. The required resources, namely
of the standard error is only possible in a randomized experiment. In addition, randomization has
codes and prepared the daily drug solutions. Treatment codes were concealed from the rest of the laboratory staff that was responsible for the daily treatment administration and final histological evaluation.
tages.
Similar
29
findings were reported by Levasseur et al. (1995) a statistical model that allows to correct for the
and Faessel et al. (1999) who described the presence row-column bias (Schlain et al., 2001; Straetemans
of parabolic patterns (Figure 5.2) in cell growth ex- et al., 2005).
periments carried out in 96-well microtiter plates.
They were not able to show conclusively the underlying causes of this systematic error, which, as
shown in Figure 5.2, could be of considerable magnitude. Therefore, they concluded that only by random allocation of the treatments to the wells these
systematic errors could be avoided.
Figure 5.3. Scheme for the addition and dilution of drugcontaining medium to tubes in a 96-tube rack, randomization
of the tubes, and then addition of drug-containing medium to
cell-containing wells of a 95-wel microtiter plate. (after Faessel
et al. (1999))
5.1.2
Figure 5.2. Presence of bias in 96-well microtiter plates (after
Levasseur et al. (1995))
is one way of dealing with bias in microtiterplates. Example 5.3. In the completely randomized design
Alternative methods consist of choosing an appro-
priate experimental design such as a Latin square ally housed in a rack consisting of 5 shelves of each
design (see Section 5.1.4, page 33), or to construct 8 cages. On different shelves, rats are likely to be
30
Figure 5.4. Outline of a paired experiment on isolated cardiomyocytes. Cardiomyocytes of a single animal were isolated and seeded
in plastic Petri dishes. From the resulting five pairs of Petri dishes, one member was randomly assigned to drug treatment, while
the remaining member received the vehicle.
exposed to multiple varieties of light intensity, tem- and Kempthorne, 2008). Sometimes, this is accomperature, humidity, sounds, views, etc. The inves-
tigators were convinced that shelf height affected of variation into one aggregate blocking factor as
the results. Therefore, they decided to switch to
5.1.2.1
come of the experiment. It would be most unfortunate if our randomization procedure yielded a
design in which there was a great imbalance on
Table 5.1. Results of an experiment using a paired design for testing the effect of a drug on the number of viable cardiomyocytes
after calcium overload
Rat No.
Vehicle
Drug
Drug - Vehicle
1
2
3
4
5
44
64
60
50
76
46
75
67
64
77
2
11
7
14
1
with differences in this nuisance factor and be biased. The second main reason for a randomized
complete block design is its possibility to considerably reduce the error variation in our experiment, thereby making the comparisons more precise. The main objection to a randomized complete
block design is that it makes the strong assumption
that there is no interaction between the treatment vari- Example 5.5. Isolated cardiomyocytes provide an
able and the blocking characteristics, i.e. that the effect easy tool to assess the effect of drugs on calciumof the treatments is the same among all blocks.
overload (Ver Donck et al., 1986). Figure 5.4 illustrates the experimental setting. Cardiomyocytes of
The basic idea behind blocking is to partition the
31
70
70
60
50
% Viable Myocytes
% Viable Myocytes
60
50
Control
Treated
Control
Treatment
Treated
Treatment
Figure 5.5. Gain in efficiency induced by blocking illustrated in a paired design. In the left panel, the myocyte experiment is
considered as a completely randomized design in which the two samples largely overlap one another. In the right panel the lines
connect the data of the same animal and show a marked effect of the treatment.
a stabilization period the cells were exposed to a that connect the data from the same animal. It is
stimulating substance (i.e. veratridine) and the per- clear that for each pair the drug-treated Petri dish
centage viable, i.e. rod-shaped, cardiomyocytes in yielded consistently a higher result than its vehicle
a dish was counted. Although comparison of the control counterpart. Since the different pairs (anitreatment with the solvent control within a single mals) are independent from one another, the mean
animal provides the best precision, it lacks external difference and its standard error can be calculated.
validity. Therefore, a paired experiment with my- The mean difference is 7.0 with a standard error of
ocytes from different animals and with animal as 2.51.
blocking factor was carried out. From each animal
two Petri dishes containing exactly 100 cardiomy- 5.1.2.2 Efficiency of the randomized complete
ocytes were prepared. From the resulting five pairs
block design
of Petri dishes, one member was randomly assigned
Example 5.6. Suppose that in Example 5.5 the exto drug treatment, while the remaining member reperimenter would not have used blocking, i.e. conceived the vehicle. After stabilization and exposure
sider it as if he had used myocytes originating from
to the stimulus, the number of viable cardiomy10 completely different animals. The 10 Petri dishes
ocytes in each Petri dish was counted. The resultwould then be randomly distributed over the two
ing data are contained in Table 5.1 and displayed
treatment groups and we would have been conin Figure 5.5.
fronted with a completely randomized design. Assume also that the results of this hypothetical exThere are 10 experimental units, since Petri periment were identical to those obtained in the acdishes can be independently assigned to vehicle or tual paired experiment. As is illustrated in the left
drug. However, the statistical analysis should take panel of Figure 5.5, the two groups largely overthe particular structure of the experiment into ac- lap one another. Since all experimental units are
count. More specifically, the pairing has imposed now independent of one another, the effect of the
restrictions on the randomization such that data
obtained from one animal cannot be freely inter- tween the two mean values and comparing it with
changed with that from another animal. This is il- its standard error1 Obviously, the mean difference
lustrated in the right panel of Figure 5.5 by the lines is the same as in the paired experiment. However,
1
p As already mentioned in Section 4.8.2.1, page 22, the standard error on the difference between two means is equal to
2/n
32
the standard error on the mean difference has risen Balanced incomplete block designs (BIB) exist for
considerably from a value of 2.51 to 7.83, i.e. the only certain combinations of the number of treatuse of blocking induced a substantial increase in the ments and number and size of blocks. Software
precision of the experiment1 .
ity to enhance the precision of the experiment considerably, while the conclusions have the same va-
Table 5.2. Balanced incomplete block design for Example 5.7 with
treatments A,B,C and D
Lamb
First
Second
A
B
A
C
A
D
B
C
B
D
C
D
5.1.3
require(crossdes)
# find a incomplete block design
inc.bl<-find.BIB(4,6,2)
# check if it is a valid BIB
isGYD(inc.bl)
et al. (2004) provide a method to compare designs on the basis of their relative efficiency. For the design in
Example 5.5, the calculations show that this paired design is 7.7 times more efficient than the completely randomized design
in Example 5.6. In other words, about 8 times as many replications per treatment with a completely randomized design
are required to achieve the same results.
33
5.2.
1958).
of error. The disadvantage is the strong assumption that there are no interactions between the
5.1.4
The Latin square design is an extension of the randomized complete block design, but now blocking
is done simultaneously on two characteristics that
affect the response variable.
In addition Latin
square designs are limited by the fact that the number of treatments, number of row, and number of
columns must all be equal. Fortunately, there are
arrangements that do not have this limitation (Cox,
repetition chambers for each treatment. The four units are assigned to each treatment group. Howclimate treatments were (i) control (CO, with ambi- ever, it may happen that more experimental units
ent conditions), (ii) increased air temperature = air are required to obtain an adequate precision. The
warming (AW), (iii) reduced irrigation = drought Latin square can then be replicated and several
(D), and (iv) air warming and drought (AWD).
squares can be used to obtain the necessary sample size. In doing this, there are two possibilities
In the above example 4 4 Latin squares were to consider. Either one stacks the squares on top
used to control for the location in the ecosystem of each other and keeps them as separate indepenchambers. Another example from laboratory prac-
microtiter plates (Burrows et al., 1984). Still another example is about experiments on neuronal
Row
1
2
3
4
Column
1
B
C
A
D
A
B
D
C
D
A
C
B
C
D
B
A
extreme results.
34
Table 5.4. A balanced lattice square arrangement of 16 treatments with 5 replicate squares on a single microtiter plate with eight
lettered rows and twelve numbered columns (after (Burrows et al., 1984))
A
B
C
D
E
F
G
H
1
11
12
9
10
12
3
13
6
2
7
8
5
6
14
5
11
4
3
3
4
1
2
7
16
2
9
4
15
16
13
14
1
10
8
15
5
11
6
1
16
5
1
9
13
6
10
7
4
13
11
15
7
3
7
12
5
2
15
4
8
16
12
9
15
9
4
6
10
5
3
10
16
11
12
14
7
1
12
2
8
13
11
8
9
8
3
14
14
10
2
6
Samples diluted 1:1,600 and serially- row- and column-effect by means of Latin-square
diluted standards (100 L) were placed in wells. designs are not restricted to 96-well microtiter
Samples were arranged in triads, each containing plates. but also apply to their larger variants, like
samples from a case and two-matched controls. In
order to minimize measurement error due to a spaRandom Latin squares can be produced using
the R-package magic (Hankin, 2005). A possible
croplate wells were grouped into 6 blocks of 16 (44) layout of the experiment in Example 5.8 is obtained
wells and each set of 3 samples was placed in the by:
tial gradient in binding efficiency within plate, mi-
same block using a version of Latin square design along with standards. Placement patterns were
changed across blocks such that the influence of the
>
>
>
>
require(magic)
trts<-c("CO","AW","D","AWD")
ii<-rlatin(4)
matrix(trts[ii],nrow=4,ncol=4,byrow=FALSE)
spatial gradient on the signal-standard level relationship was minimized. The authors do not explicitly state what they mean by version of Latin
square.
[1,]
[2,]
[3,]
[4,]
[,1]
"D"
"AWD"
"AW"
"CO"
[,2]
"AW"
"CO"
"D"
"AWD"
[,3]
"AWD"
"D"
"CO"
"AW"
[,4]
"CO"
"AW"
"AWD"
"D"
used. The interested reader is referred to the literature (Cox, 1958) for this more advanced topic, but
5.2
Treatment designs
5.2.1
One-way layout
patterns were changed across blocks they presumably mean randomization of the rows and columns
5.2.2
Factorial designs
2001).
35
design, the factors and all combinations (hence full) No interaction When there is no interaction beof the levels of the factors are studied. The factorial
as their interaction effect, i.e. the deviation from ad- two diets are parallel to one another, as is shown in
ditivity of their joint effect. We will use the 22 full Figure 5.6.
factorial design to explore the concepts of factorial
designs and of statistical interaction.
Example 5.10. (Bate and Clark, 2014) A study was effect is the same in both strains. There is also an
conducted to assess whether the serum chemokines overall effect of the strain. Lesions are larger in the
JE and KC could be used as markers of atheroscle-
rosis development in mice (Parkin et al., 2004). Two this difference is the same for both diets.
strains of apolipoprotein-E-deficient (apoE-/- ) mice,
C3H apoE-/- and C57BL apoE-/- were used in the
Since the differences between the diets is the
5000
the external validity of the conclusions is broadened since they apply to both strains. In addition, the comparison between the two strains can
4000
3500
3000
2500
units. This makes a factorial design a highly efficient design, since all the animals are used to test
2000
4500
C3H
C57BL
1500
5000
Western
Diet
4000
3500
3000
2000
1500
C3H
C57BL
4500
Figure 5.6. Plot of mean lesion area for the case where there is no
interaction between strains and diets.
2500
Normal
Normal
Western
Diet
Figure 5.7. Plot of mean lesion area for the case where there is a
moderate interaction between strains and diets.
36
Moderate interaction When there is a moderate in- 2 2 factorial combined with a balanced incomteraction, the direction of the effect of the first fac-
with the level of the second factor. This is exemplified by Figure 5.7, where the lines are not parallel, though both indicate an increase in lesion size
Factorial designs are widely used in plant research, process optimization, etc. Recent appli-
croarray experiments.
site is true.
5000
4000
3500
3000
2500
ray experiments can be found in the specialized literature (Glonek and Solomon, 2004) and (Banerjee
2000
4500
C3H
C57BL
1500
Normal
Western
Diet
Figure 5.8. Plot of mean lesion area for the case where there is a
moderate interaction between strains and diets.
37
5.3.1
Although
our discussion here is restricted to the 2 2 factorial, more factors and more levels can be used.
5.3
ment Petri dishes are placed inside constant temperature incubators (see Figure 5.10). Within each incubator growh media are randomly assigned to the
38
5.3.2
Figure 5.12. A typical repeated measures design. Animals are randomized to different treatment groups, the variable of interest
(e.g. blood pressure) is measured at the start of the experiment
and at different time points following treatment application.
the presence of carry-over effects by which the results obtained for a treatment are influenced by
the previous treatment. In addition, any confound-
one another.
39
Table 5.5. Bioassay experiment of Example 5.14. Row and column indicators refer to the conventional coding of 96-well plates. The
four samples (A, B, C, D) are in duplo applied to the rows, and the serial dilution level (dose) is applied to an entire column. A1/1
in a cell means sample A, first replicate at dilution level 1, B2/3 second replicate of sample B at dilution 3, etc.
A
B
C
D
E
F
G
H
10
11
12
B1/2
D2/2
B2/2
C1/2
A1/2
A2/2
C2/2
D1/2
B1/8
D2/8
B2/8
C1/8
A1/8
A2/8
C2/8
D1/8
B1/10
D2/10
B2/10
C1/10
A1/10
A2/10
C2/10
D1/10
B1/1
D2/1
B2/1
C1/1
A1/1
A2/1
C2/1
D1/1
B1/11
D2/11
B2/11
C1/11
A1/11
A2/11
C2/11
D1/11
B1/3
D2/3
B2/3
C1/3
A1/3
A2/3
C2/3
D1/3
B1/12
D2/12
B2/12
C1/12
A1/12
A2/12
C2/12
D1/12
B1/7
D2/7
B2/7
C1/7
A1/7
A2/7
C2/7
D1/7
B1/4
D2/4
B2/4
C1/4
A1/4
A2/4
C2/4
D1/4
B1/5
D2/5
B2/5
C1/5
A1/5
A2/5
C2/5
D1/5
B1/6
D2/6
B2/6
C1/6
A1/6
A2/6
C2/6
D1/6
B1/9
D2/9
B2/9
C1/9
A1/9
A2/9
C2/9
D1/9
40
6.1
6.3
cates, the more confidence we have in our conclusions. Therefore, we would prefer to carry out our
6.2
periments
At the end of the study when the data are anThe estimation of the appropriate size of the exper-
of the experiment.
2 The
41
42
State of Nature
Decision
made
Alternative hypothesis
true
Correct decision
(1 )
False positive
False negative
Correct decision
(1 )
come larger.
small. It is convenient, for quantitative data, to express the difference in means as effect size by dividThe basis of sample size calculation is formed by
false positives is called the level of significance or al- one is interested in, can be substantially larger.
pha level and is usually set at values of 0.01, 0.05, or
0.10. The false negative rate depends on the postulated alternative hypothesis and is usually described by its complement, i.e. the probability of
rejecting the null hypothesis when the alternative
6.4
6.4.1
hypothesis holds. This is called the power of the Now that we are familiar with the concepts of hystatistical hypothesis test. Power levels are usually pothesis testing and the determinants of sample
expressed as percentages and values of 80% or 90% size, we can proceed with the actual calculations.
are standard in sample size calculations.
There is a significant amount of free software available to make elementary sample size calculations.
In particular, there is the R-package pwr (Cham-
pely, 2009).
Example 6.1. Consider the completely randomized
experiment about cardiomyocytes discussed in Example 5.6 (page 31). The standard deviations of the
two groups are each about 12.5. A large effect of
0.8 in this case, corresponds to a difference between
both groups of 10 myocytes. Lets assume that we
wish to plan a new experiment to detect such a
difference with a power of 80% and we want to reject the null hypothesis of no difference at a level
of significance of 0.05, whatever the direction of the
difference between the two samples (i.e. a two-sided
1 When comparing mean values from two independent groups, the standard deviation for calculating the effect size, can
be from either group when
q variances of the two groups are homogeneous, or alternatively a pooled standard deviation can
test1 ). The calculations are carried out in R in a single line of code and show that 26 experimental units
are required in each of the two treatment groups:
> require(pwr)
> pwr.t.test(d=0.8,power=0.8,sig.level=0.05,
+
type="two.sample",
+
alternative="two.sided")
43
6.4.2
=
=
=
=
=
25.52457
0.8
0.05
0.8
two.sided
=
=
=
=
=
5
0.8
0.05
0.2007395
two.sided
n 16 /2
where represents the effect size and n stands for
the required sample size in each treatment group.
For the above example, the equation results in:
44
n=32
n=32
80
80
n=16
n=16
n=12
60
n=8
40
n=12
Power
Power
60
n=8
40
n=6
n=6
20
10
15
20
25
n=4
20
30
10
Number of Subsamples
15
20
25
n=4
30
Number of Subsamples
Figure 6.1. Power curves for a two-group comparison to detect a difference of 1, with a two-sided t-test with significance level
= 0.05 as a function of the number of subsamples m. Lines are drawn for different numbers of experimental units n in each
group. For both left and right panel the between sample standard deviation (n ) is 1, while within sample standard deviation (m )
is 1 in the left panel and 2 in the right panel. The dots connected by the dashed line indicate where the total number of subsamples
2
2
2 n m equals 192. The vertical line indicates an upper bound to the useful number of subsamples of m = 4(m
/n
).
6.5
2
2 2 m
(n +
)
n
m
eight blocks, then N = 31, B = 7, T = 3 and units and subsamples and n and m the between
E = 31 7 3 = 21 instead of 28. However, sample and within sample standard deviation. Usblocking nearly always reduces the inherent vari-
Example 6.3. If we consider again, in this context in the right panel m = 2. The dots connected by
the paired experiment of Example 5.5 (page 30), we a dashed line represent the power for experiments
have N = 9, B = 4, T = 1. Hence, E = 941 = 4. where the total number of subsamples equals 192
Obviously the sample size of 10 experimental units (2 treatment groups n p).
was too small to allow an adequate estimate of
the error. At least 2 experimental units should be
added.
45
128
128
= 0.5
= 0.5
= 0.8
= 1.0
= 1.5
= 2.0
64
32
= 0.8
= 1.0
16
= 1.5
= 2.0
64
32
16
10
Number of comparisons
10
Number of comparisons
Figure 6.2. Required sample size of a two-sided test with a significance level of 0.05 and a power of 80% (left panel) and 90%
(right panel) as a function of the number of comparisons that are carried out. Lines are drawn for different values of the effect size
(). Note that the y-axis is logarithmic.
2
4(m
/n2 ). This is known as Coxs rule of thumb
2
samples m is greater than 4(m
/n2 ). Coxs ratio
least when the cost of subsamples and experimental units is not taken into consideration. An experi-
cn
2
m
cm
n2
n . In this example, taking more subsamples does large relative to the cost of subsamples cm , or when
make sense. The power curves keep increasing the variation among subsamples m is large relauntil the number of subsamples is about 16. The
46
0.06
0.06
Fraction
Fraction
0.04
0.04
0.02
0.02
0.00
0.00
0.0
0.5
1.0
1.5
2.0
Effect size
Effect size
Figure 6.3. Running the cardiomyocyte experiment a large number of times, the measured effect sizes follow a broad distribution. In
both plots the true effect size is 0.8. The dark area represents statistically significant results (two-sided p 0.05) and the vertical
dotted line indicates the effect size which is just large enough to be statistically significant. Left panel: 26 animals are used per
treatment group which corresponds to a power of 80%. Right panel: Only 5 animals per treatment group are used which results in
an underpowered experiment
6.6
sured. A sophisticated statistical technique known statistical test is carried out on the data, the overas mixed model analysis of variance allowed to es- all rate of false positive findings is higher than the
2
as 4.58 and 13.7
timate from the data n2 and m
respectively. Surprisingly variability within an ani- cumvent this inflation of the false positive error
mal was larger than between animals. If we were to rate, the critical value of each individual test is usually set at a more stringent level. The most simber of measurements to 4 13.7/4.58 12 per ani- ple adjustment, Bonferronis adjustment, consists
mal. Alternatively, we can take the differential costs of just dividing the significance level of each indiset up a new experiment, we could limit the num-
of experimental units and subsamples into account. vidual test by the number of comparisons. BonferIt makes sense to assume that the cost of 1 ani- ronis adjustment maintains the error rate of the
mal is about 100 times the cost of one diameter totality of tests that are carried out in the same conmeasurement. Making this assumption, the opti- text at its original level. But, as we already noted
mum number of subsamples per animal would be above, when the significance level is set at a lower
p
100 13.7/4.58 17. Thus the total number value, the required sample size will necessarily inof diameter measurements could be reduced from crease. Fortunately, the increase in required num1300 to 220. Even if animals would cost 1000 times ber of replicates is surprisingly small.
more than a diameter measurement, the optimum
number of subsamples per animal would be about
47
all values of and power (100 (1 )). For pow- other consequence of low statistical power is that
ers of 80% and 90 %, carrying out two indepen-
at its level of = 0.05. Similarly, when 3 or 4 independent tests are involved, the required sample
size increases with 30% or 40% respectively. After
ber of comparisons.
from these experiments follow a distribution as displayed in the left-hand panel of Figure 6.3. The
tainable level.
6.7
48
6.8
120
Sequential plans
80
2011).
60
40
20
0
20
40
60
80
As shown in Figure 6.4, effect inflation is worst a significant result, sequential plans are prone to
for small low-powered studies which can only detect exaggerate the treatment effect. There is certainly a
treatment effects that happen to be large. There-
fore, significant research findings of small studies such as early screening, but a fixed sample size
are biased in favour of inflated effects. This has confirmatory experiment is needed to provide an
consequences when an attempt is made to replicate
a published finding and the sample size is computed
based on the published effect. When this is an inflated estimate, sample size of the confirmatory experiment will be too low. To summarize, effect inflation due to small, underpowered experiments is
one of the major reasons for the lack of replicability
in scientific research.
49
a candidate compound was selected for further de- tion of false positive and false negative results was
velopment. An advantage of this screening proce- known and fixed. A disadvantage of the method was
dure was that, given the biologically relevant level that a dedicated computer program was required for
of activity that must be detected, the expected frac- the follow-up of the results.
50
7.1
7.2
ture must completely determine the statistical procedure by which this estimate is to
be calculated. If this were not so, no in-
sult.
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
0.4
52
2 2.79
4 2.792
2 2.79
Figure 7.2. Distribution of the test statistic t for the cardiomyocyte example, under the assumption that the null hypothesis of no
difference between the samples is true.
appropriate statistical analysis is straightforward. approach of hypothesis testing. For Fisher, the pHowever, some important statistical issues remain, value was an informal measure to see how surprissuch as the type data and the assumptions we make ing the data were and whether they deserved a secabout the distribution of the data.
7.3
Significance tests
Significance testing is related to, but not exactly the pothesis tests. The cardiomyocytes experiment of
same as hypothesis testing (see Section 6.3). Signif- Example 5.5 (page 30) will help us to illustrate the
icance testing differs from the Neyman-Pearson hy-
pothesis testing approach in that there is no need to set up to test the null hypothesis of no difference
define an alternative hypothesis. Here, we only de- between vehicle and drug. This null hypothesis is
fine a null hypothesis and calculate the probability tested at a level of significance of 0.05, i.e. we
to obtain results as extreme or more extreme than want to limit the probability of a false positive rewhat was actually observed, assuming the null hy- sult to 0.05. The paired design of this experiment
pothesis is true. This is done by calculating, from is a special case of the randomized complete block
the experimental data, a quantity called test statis- design with only two treatments and the response is
tic. Then, based on the the statistical model, the a continuously distributed variable. In this design,
distribution of this test statistic is derived under calculations can be simplified by evaluating for each
the null hypothesis. With this null-distribution, the pair separately the treatment effect, thus removing
probability is calculated of obtaining a test statis- the block effect. This is done in Table 5.1 in the coltic that is as extreme or more extreme than the umn with the Drug - Vehicle differences. We now
one observed. This probability is referred to as p- must make some assumptions about the statistical
value. It is common practice to compare this p- model that generated the data. Specifically, we asvalue to a preset level of significance (usually sume that the differences are independent from one
0.05). When the p-value is smaller then , the null another and originate from a normal distribution.
hypothesis is rejected, otherwise the result is inconclusive. However, this conflates the two worlds of
1 the
of a sample
the sample size, i.e. sx = SD/ n = 5.61/ 5 = 2.51
7.4
53
the null hypothesis of no difference between the two When the inferential results are sensitive to the distreatment conditions holds, the distribution of this tributional and other assumptions of the statistical
statistic is known2 and is depicted in Figure 7.2. analysis, it is essential that these assumptions are
On the left panel of Figure 7.2, the value for the also verified. The aptness of the statistical model is
test statistic 2.79 that was obtained from the ex- preferably assessed by informal methods such as
perimental data is indicated and the area under the diagnostic plotting (Grafen and Hails, 2002; Kutcurve, right to this value is shaded in grey. This ner et al., 2004). When planning the experiment,
area corresponds to the one sided p-value, i.e. the historical data, or the results of exploratory or piprobability of obtaining a greater value for the test lot experiments can already be used for a prelimstatistic than the one obtained in the experiment. inary verification of the model assumptions. AnSince by definition the total area under the curve other option is to use statistical methods that are
equals one, we can calculate the value of the shaded robust against departures from the assumptions
area. For our example, this results in a value of (Lehmann, 1975). It is also wise, before carrying
0.024, which is the probability of obtaining a value out formal tests, to make graphical displays of the
for the test statistic as extreme or more extreme data. This allows to identify outliers and gives althan the one obtained in the experiment, provided ready indications whether the statistical model is
the null hypothesis holds.
appropriate or not. Such exploratory work is also
a tool for gaining insight in the research project and
can lead to new hypotheses.
Figure 7.3. One hundred drugs are tested for activity against a
biological targe. Each drug occupies a square in the grid, the top
row are the drugs that are truly active. Statistically significant
results are obtained only for the darker-grey drugs. The black
cells are false positives (after Reinhart (2015)).
to reject the null-hypothesis at the pre-specified significance level of 0.05 using a two-sided test.
7.5
2 Under
the null hypothesis and when the assumptions are true, the test statistic is distributed as a Student t-distribution
with n 1 degrees of freedom.
54
F DR = (1 )/[(1 ) + (1 )]
(7.1)
1.0
p-value
0.8
MFDR
0.1
0.05
0.01
0.005
0.001
0.385
0.289
0.111
0.067
0.0184
0.6
0.4
20%
50%
80%
0.2
ally mean? In how many instances does this result reflect a true difference? We already deduced
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Prevalence
follows
From
the
above
reasoning,
it
FDR depends on the threshold , the power (1) alent to the minimum FDR (MFDR). This gives the
7.6. MULTIPLICITY
55
Some values of the MFDR are given in Table 7.1. tested. Testing multiple related hypotheses also
For p = 0.05 the M F DR = 0.289, which means raises the type I error rate. The same problem of
that a researcher who claims a discovery when p multiplicity arises when a study includes a large
0.05 is observed will make a fool of him-/herself in
The FDR is certainly one of the key factors re- the study should be stressed and the results intersponsible for the lack of replicability in research preted with great care.
and puts the decision-theoretic approach with its
irrational dichotomization of the p-value into significant and non-significant certainly into question.
As it was noted already in the introductory
chapter, the issues of reproducibility and replicability of research founding have been concerning
the scientific, but certainly also the statistical community, deeply. This has led the board of the American Statistical Association to issue a statement on
March 6, 2016 in which the society warns against
the misuse of p-values (Wasserstein and Lazar,
2016). This is the first time in its 177 year old
history that explicit recommendations on a fun-
Example 7.3. Suppose a drug is tested at 20 different doses on a specific variable. Further, suppose
that we reject the null hypothesis of no treatment
effect for each dose separately when the probability
of falsely rejecting the null hypothesis (the significance level ) is less than or equal to 0.05. Then
the overall probability of falsely declaring the existence of a treatment effect when all underlying null
hypotheses are in fact true is 1(10.05)20 = 0.64.
This means that we are more likely to get one significant result than not. The same multiplicity problem arises when a single dose of the drug is tested
on 20 variables that are mutually independent.
The problem of multiplicity is of particular immary, the ASA advises in its statement researchers portance and magnitude in gene expression mito avoid drawing scientific conclusions or making croarray experiments (Bretz et al., 2005). For examdecisions based on p-values alone. P-values should ple a microarray experiment examines the differcertainly not be interpreted as measuring the prob- ential expression of 30,000 genes in wildtype and
ability that the studied hypothesis is true or the in a mutant. Assume that for each gene an approprobability that the data were produced by chance
7.6
Multiplicity
The multiplicity problem must at least be recog-
56
57
58
9.1.1
Experimental design
how blocking was dealt with in the statistical analysis. When there is ambiguity about the experimental unit, the unit used in the statistical analysis
should be specified and a justification for its choice
should be provided.
use technical statistical terms such as random, normal,significant,correlation, and sample in their everyday meaning, i.e. out of the statistical context. We
9.1.2
Statistical methods
volved.
ported results. The authors should report and justify which methods they used. A term like tests of
significance is too vague, and should be more de-
9.1
tailed.
60
Spurious precision detracts from a papers readability and credibility. Therefore, unnecessary precision, particularly in tables, should be avoided.
9.2
9.2.1
should not be expressed to more than one decimal place and with samples less than 100, the use
of decimal places should be avoided. Percentages
The number of experimental units used in the analysis should always be clearly specified. Any discrepancies with the number of units actually ran-
domized to treatment conditions should be accounted for. Whenever possible, findings should
be quantified and presented with appropriate indicators of measurement error or uncertainty. As
9.2.2
Graphical displays
measures of spread and precision, standard deviations (SD) and standard errors (SEM) should not
be confused. Standard deviations are a measure of
spread and as such a descriptive statistic , while
standard errors are a measure of precision of the
mean. Normally distributed data should preferably be summarized as mean (SD), not as mean
SD. For non-normally distributed data, medians
and inter-quartile ranges are the most appropriate summary statistics. The practice of reporting
mean SEM should preferably be replaced by the
reporting of confidence intervals, which are more
informative. Extremely small datasets should not
be summarized at all, but should preferably be reported or displayed as raw data.
Generally,
and counts, the mean minus 2 SD (or minus groups are small. Graphs such as Figure 9.1 and
n) which indicates a lower 2.5% of Figure 9.2 are much more informative than the
2 SEM
such a pitfall.
be of great help.
61
ing a p-value, it happens that a value that is technically larger than the significance level of 0.05, say
0.051, is rounded down to p = 0.05. This is inac-
9.2.3
Interpreting and reporting signifi- curate and, to avoid this error, p-values should be
reported to the third decimal. If a one-sided test is
cance tests
used and the result is in the wrong direction, then
little help to have in the Methods section a statement such as statistical methods included analysis of
cance without any reference to which specific pro- null hypothesis can be accepted. Consequently,
cedure is reported in the Results part.
they conclude that there is no effect of the treatment or that there is no difference between the
Tests of statistical significance should be two- treatment groups. However, from a philosophisided. When comparing two means or two pro- cal point of view, one can never prove the nonexportions, there is a choice between a two-sided or
istence of something.
pointed out:
test the investigator alternative hypothesis specifies the direction of the difference, e.g. experimental treatment greater than control. In a two-sided
test, no such direction is specified. A one-sided test
Therefore, one should avoid sole reliance on statistical hypothesis testing and preferably supplement
Exact p-values, rather than statements such as ones findings with confidence intervals which are
p < 0.0500 or even worse NS (not significant), more informative. Confidence intervals on a difshould be reported where possible. The practice ference of means or proportions provide informaof dichotomizing p-values into significant and not
where a study yielding a p-value of 0.049 would be their 95% confidence intervals. The shaded area
flagged significant, while an almost equivalent re-
62
et al., 2011)
line. However, effect sizes that have no biological relevance are still plausible as is shown by the
upper limit of the confidence interval. The second
row shows the result of an experiment that was not
significant at the 0.05 level. However, the confidence interval reaches well within the area of biological relevance. Therefore, notwithstanding the
nonsignificant outcome, this experiment is inconclusive. The third outcome concerns a result that
was not significant, but the 95% confidence interval does not reach beyond the boundaries of scientific relevance. The nonsignificant result here can
also be interpreted that, with 95% confidence, the
treatment effect was also irrelevant from a scientific point of view.
Figure 9.3. Use of confidence intervals for interpreting statistical results. Estimated treatment effects are displayed with their
95% confidence intervals. The shaded area indicates the zone of
biological relevance.
statistical model.
tice to make a sharp distinction between significant periment, the scientist should bear in mind the topand nonsignificant findings and making compar- ics covered in Section 6.7 about effect size inflation
isons of the sort X is statistically significant, while
10.1
What we didnt touch yet was the role of the statistician in the research project. The statistician is
10.2
Recommended reading
Kempthorne, 2008).
64
periment.
ics1 .
tion and reporting of scientific results. In particular, the problems with a blind trust on statistical
10.3
Summary
1 http://www.datascope.be
REFERENCES
65
References
Altman, D. G., Gore, S. M., Gardner, M. J., and Pocock, S. J.
(1983). Statistical guidelines for contributors to medical journals. BMJ 286, 14891493.
URL http://www.bmj.com/content/286/6376/1489
Amaratunga, D. and Cabrera, J. (2004). Exploration and Analysis of DNA Microarray and Protein Array Data. New York, NY: J.
Wiley.
Anderson, V. and McLean, R. (1974). Design of Experiments.
New York, NY: Marcel Dekker Inc.
Aoki, Y., Helzlsouer, K. J., and Strickland, P. T. (2014).
Arylesterase phenotype-specific positive association between
arylesterase activity and cholinesterase specific activity in human serum. Int. J. Environ. Res. Public Health 11, 14221443.
doi:doi:10.3390/ijerph110201422.
Babij, C. J., Zhang, R. J., Y. anf Kurzeja, Munzli, A., Shehabeldin, A., Fernando, M., Quon, K., Kassner, P. D., RuefliBrasse, A. A., Watson, V. J., Fajardo, F., Jackson, A., Zondlo,
J., Sun, Y., Ellison, A. R., Plewa, C. A., T., S., Robinson, J.,
McCarter, J., Judd, T., Carnahan, J., and Dussault, I. (2011).
STK33 kinase activity is nonessential in KRAS-dependent cancer cells. Cancer Research 71, 58185826. doi:10.1158/00085472.CAN-11-0778.
Baggerly, K. A. and Coombes, K. R. (2009).
Deriving
chemosensitivity from cell lines: Forensic bioinformatics and
reproducible research in high-throughput biology. Annals of
Applied Statistics 3, 13091334. doi:10.1214/09-AOAS291.
Bailar III, J. C. and Mostelller, F. (1988). Guidelines for statistical reporting in articles for medical journals. Ann. Int. Med.
108, 226273.
URL http://www.people.vcu.edu/$\sim$albest/Guidance/
guidelines_for_statistical_reporting.htm
Banerjee, T. and Mukerjee, R. (2007). Optimal factorial designs
for cDNA microarray experiments. Ann. Appl. Stat. 2, 366385.
doi:10.1214/07-AOAS144.
Bate, S. and Clark, R. (2014). The Design and Statistical Analysis
of Animal Experiments. Cambridge, UK: Cambridge University
Press.
Begley, C. G. and Ellis, L. M. (2012). Raise standards for preclinical research. Nature 483, 531533. doi:10.1038/483531a.
Begley, C. G. and Ioannidis, J. P. A. (2015). Reproducibility in science.
Circ. Res. 116, 116126.
doi:10.1161/
CIRCRESAHA114.303819.
Begley, S. (2012). In cancer science, many "discoveries" dont
hold up. Reuters March 28.
URL
http://www.reuters.com/article/2012/03/28/usscience-cancer-idUSBRE82R12P20120328
Bland, M. and Altman, D. (1994). One and two sided tests of
significance. BMJ 309, 248.
Bretz, F., Landgrebe, J., and Brunner, E. (2005). Multiplicity issues in microarray experiments. Methods Inf. Med. 44, 431437.
Brien, C. J., Berger, B., Rabie, H., and Tester, M. (2013). Accounting for variation in designing greenhouse experiments
with special reference to greenhouses containing plants on
conveyor systems. Plant Methods 9, 526. doi:10.1186/17464811-9-5.
Burrows, P. M., Scott, S. W., Barnett, O., and McLaughlin,
M. R. (1984). Use of experimental designs with quantitative
ELISA. J. Virol. Methods 8, 207216.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A.,
Flint, J., Robinson, E. S. J., and Munafo, M. R. (2013). Power
failure: Why small sample size undermines the reliability
of neuroscience. Nature Reviews Neuroscience 14, 112. doi:
10.1038/nrn3475.
Casella, G. (2008). Statistical Design. New Tork, NY: Springer.
Champely, S. (2009). pwr: Basic functions for power analysis.
R package version 1.1.1.
URL http://CRAN.R-project.org/package=pwr
Cleveland, W. S. (1993). Visualizing Data. Summit, NJ: Hobart
Press.
Cleveland, W. S. (1994). The Elements of Graphing Data. Summit, NJ: Hobart Press.
Clewer, A. G. and Scarisbrick, D. H. (2001). Practical Statistics
and Experimental Design for Plant and Crop Science. Chichester,
UK: J. Wiley.
Cochran, W. and Cox, G. (1957). Experimental Designs. New
York, NY: John Wiley & Sons Inc., 2nd edition.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates, 2nd edition.
Cokol, M., Ozbay, F., and Rodriguez-Esteban, R. (2008).
Retraction rates are on the rise. EMBO Rep. 9, 2. doi:
10.1038/sj.embor.7401143.
URL
http://www.ncbi.nlm.nih.gov/pmc/articles/
PMC2246630/
Colquhoun, D. (2014). An investigation of the discovery rate
and the misinterpretation of p-values. R. Soc. open sci. 1,
140216. doi:10.1098/rsos/140216.
Council of Europe (2006). Appendix a of the european convention for the protection of vertebrate animals used for experimental and other scientific purposes (ets no. 123). guidelines
for accomodation and care of animals (article 5 of the convention). approved by the multilateral consultation.
URL https://www.aaalac.org/about/AppA-ETS123.pdf
Cox, D. (1958). Planning of Experiments. New York, NY: J. Wiley.
Curran-Everett, D. (2000). Multiple comparisons: philosophies and illustrations. Am. J. Physiol. Regulatory Integrative
Comp. Physiol. 279, R1R8.
Bretz, F., Hothorn, T., and Westfall, P. (2010). Multiple Comparisons Using R. Boca Raton, FL: CRC Press.
66
Eklund, A. (2010). beeswarm: The bee swarm plot, an alternative to stripchart. R package version 0.0.7.
URL http://CRAN.R-project.org/package=beeswarm
European Food Safety Authority (2012). Final review of the
Sralini et al. (2012) publication on a 2-year rodent feeding
study with glyphosate formulations and GM maize NK603 as
published online on 19 September 2012 in Food and Chemical
Toxicology. EFSA Journal 10, 2986. doi:10.2903/j.efsa.2012.
2986.
Everitt, B. S. and Hothorn, T. (2010). A Handbook of Statistical
Analyses using R. Boca Raton, FL: Chapman and Hall/CRC,
2nd edition.
Faessel, H., Levasseur, L., Slocum, H., and Greco, W. (1999).
Parabolic growth patterns in 96-well plate cell growth experiments. In Vitro Cell. Dev. Biol. Anim. 35, 270278.
Fang, F. C., Steen, R. C., and Casadevall, A. (2012). Misconduct accounts for the majority of retracted scientific publications. Proc. Natl. Acad. Sci. U.S.A. 109, 1702817033. doi:
10.1073/pnas.1212247109.
REFERENCES
Hempel, C. G. (1966).
Philosophy of Natural Science.
Englewood-Cliffs, NJ: Prentice-Hall.
Hinkelmann, K. and Kempthorne, O. (2008). Design and Analysis of Experiments. Volume 1. Introduction to Experimental Design.
Hoboken, NJ: J. Wiley, 2nd edition.
Hirst, J. A., Howick, J., Aronson, J. K., Roberts, N., Perera, R.,
Koshiaris, C., and Heneghan, C. (2014). The need for randomization in animal trials: An overview of systematic reviews.
PLOS ONE 9, e98856. doi:10.1371/journal.pone.0098856.
Holland, T. and Holland, C. (2011). Unbiased histological
examinations in toxicological experiments (or, the informed
leading the blinded examination). Toxicol. Pathol. 39, 711714.
doi:10.1177/0192623311406288.
Holman, L., Head, M. L., Lanfear, R., and Jennions, M. D.
(2015). Evidence of experimental bias in the life sciences: why
we need blind data recording. PLoS Biol. 13, e1002190. doi:
10.1371/journal.pbio.1002190.
Hotz, R. L. (2007). Most science studies appear to be tainted
by sloppy analysis. The Wall Street Journal September 14.
URL http://online.wsj.com/article/SB118972683557627104.
html
Gelman, A. and Stern, H. (2006). The difference between "significant" and "not significant" is not itself statistically significant. Am. Stat. 60, 328331.
Giesbrecht, F. G. and Gumpertz, M. L. (2004). Planning, Construction, and Statistical Analysis of Comparative Experiments.
New York, NY: J. Wiley.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med. 2, e124. doi:10.1371/journal.pmed.
0020124.
Ioannidis, J. P. A. (2014). How to make more published research true. PLoS Med. 11, e1001747. doi:10.1371/journal.
pmed.1001747.
REFERENCES
67
68
REFERENCES
Tukey, J. W. (1980). We need both exploratory and confirmatory. The American Statistician 34, 2325.
Van Belle, G. (2008). Statistical Rules of Thumb. Hoboken, NJ: J.
Wiley, 2nd edition.
van der Worp, B., Howells, D. W., Sena, E. S., Porritt, M.,
Rewell, S., OCollins, V., and Macleod, M. R. (2010). Can animal models of disease reliably inform human studies. PLoS
Med. 7, e1000245. doi:10.1371/journal.pmed.1000245.
van Luijk, J., Bakker, B., Rovers, M. M., Ritskes-Hoitinga, M.,
de Vries, R. B. M., and Leenaars, M. (2014). Systematic reviews of animal studies; missing link in translational research?
PLOS ONE 9, e89981. doi:10.1371/journal.pone.0089981.
Vandenbroeck, P., Wouters, L., Molenberghs, G., Van Gestel,
J., and Bijnens, L. (2006). Teaching statistical thinking to life
scientists: a case-based approach. J. Biopharm. Stat. 16, 6175.
Ver Donck, L., Pauwels, P. J., Vandeplassche, G., and Borgers,
M. (1986). Isolated rat cardiac myocytes as an experimental
model to study calcium overload: the effect of calcium-entry
blockers. Life Sci. 38, 765772.
Verheyen, F., Racz, R., Borgers, M., Driesen, R. B., Lenders,
M. H., and Flameng, W. J. (2014). Chronic hibernating myocardium in sheep can occur without degenerating events and
is reversed after revascularization. Cardiovasc Pathol. 23, 160
168. doi:10.1016/j.carpath.2014.01.003.
Vlaams Instituut voor Biotechnologie (2012). A scientific
analysis of the rat study conducted by Gilles-Eric Sralini et
al.
URL http://www.vib.be/en/news/Documents/20121008_
EN_Analyse\rattenstudieS{}ralini\et\al.pdf
Wacholder, S., Chanoch, S., Garcia-Closas, M., El ghormli, L.,
and Rothman, N. (2004). Assessing the probability that a positive report is false: an approach for molecular epidemiology
studies. J Natl Cancer Inst 96, 434442. doi:10.1093/jnci/
djh075.
Wasserstein, R. and Lazar, N. (2016). The asas statement on
p-values: context, process, and purpose. The American Statistician 70. doi:10.1080/00031305.2016.1154108. In press.
URL http://dx.doi.org/10.1080/00031305.2016.1154108
Weissgerber, T. L., Milic, N. M., Wionham, S. J., and Garovic,
V. D. (2015). Beyond bar and line graphs: time for a new
data presentation paradigm. PLoS Biol. 13, e1002128. doi:
10.1371/journal.pbio.1002128.
Wilcoxon, F., Rhodes, L. J., and Bradley, R. A. (1963). Two sequential two-sample grouped rank tests with applications to
screening experiments. Biometrics 19, 5884.
Wilks, S. S. (1951). Undergraduate statistical education. J.
Amer. Statist. Assoc. 46, 118.
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Cheshire, CT.: Graphics Press.
REFERENCES
69
70
REFERENCES
Appendices
71
is used to indicate the reliability of an estimate. For a given confidence level, if several
fects.
Alternative hypothesis :
treatment.
Critical value : the cutoff or decision value in hypothesis testing which separates the accep-
experimental design in which the same number of observations is taken for each combi-
ments.
74
lation parameter.
Experimental unit : the smalles unit to which different treatments or experimental conditions
can be applied.
Explanatory variable : also called predictor, a variable which is used in a relationship to explain
able.
studies are used to refine experimental procedures and provide information on sources
False negative : the error of accepting the null hypothesis when it is false.
False positive : the error of rejecting the null hypothisis when it is true.
Hypothesis testing : a formal statistical procedure
where one tests a particular hypothesis on
the basis of experimental data.
Internal validity : extent to which a causal conclusion based on a study is warranted.
Level of significance : the allowable rate of false
positives, set prior to analysis of the data.
Null hypothesis : a hypothesis indicating no difference which will either be accepted or rejected as a result of a statistical test.
Observational unit : the unit on which the response is measured or observed; this is
not necessarily identical to the experimental
unit.
of both.
75
when subsampling is
Test statistic : a statistic used in hypothesis testing; extreme values of the test statistic are unlikely under the null hypothesis.
Treatment : a specific combination of factor levels.
Two-sided test : a statistical test for which the rejection region consists of both very large or
esis is true.
Type I error : error made by the incorrect rejection of a true null hypothesis.
Variability : the random fluctuation of a measurement process about its central value.
76
sign
Suppose 21 experimental units have to be randomly assigned to three treatment groups, such
B.1.2
animals
B.1.1
R-Language
>
>
>
>
>
MS Excel
[1]
[7]
[13]
[19]
"A"
"A"
"A"
"A"
"B"
"B"
"B"
"B"
second column with pseudo-random numbers between 0 and 1. Subsequently the two columns are
selected and the Sort command from the Data-menu
is executed. In the Sort-window that appears now,
we select column B as column to be sorted by. The
77
78
.
[13] "A" "B" "C" "B" "A" "A"
[19] "C" "A" "B"
> design<-data.frame(block=rep(1:5,rep(4,5)),treat=treat)
> head(design,10) # first 10 exp units
B.2
1
2
3
4
5
6
7
8
9
10
B.2.1
MS Excel
B.2.2
>
>
>
>
R-Language
>
>
>
>
>
>
>
block treat
1
A
1
B
1
C
1
D
2
A
2
B
2
C
2
D
3
A
3
B
3
4
1
2
8
6
7
5
9
11
block treat
1
C
1
D
1
A
1
B
2
D
2
B
2
C
2
A
3
A
3
C