Statistical Thinking and Smart Experimental Design 2016

Statistical Thinking and
Smart Experimental
Design
Course Notes
V LAAMS I NSTITUUT
VOOR B IOTECHNOLOGIE
Luc Wouters
April 2016
Contents
1
Introduction
Smart Research Design by Statistical

Thinking
2.1
4.8.2
The architecture of experimental re-
20
4.8.1.6
Random sampling .
22
4.8.1.7
Standardization . .
22
Strategies
for
controlling
variability - good experi5
mental design . . . . . . . . .
22
2.1.1
The controlled experiment . .
4.8.2.1
Replication . . . . .
22
2.1.2
Scientific
4.8.2.2
Subsampling . . . .
23
4.8.2.3
Blocking . . . . . . .
24
4.8.2.4
Covariates . . . . .
24
2.1.3
research
as
a
5
Scientific research as an iterative, dynamic process . . . .
4.9
Simplicity of design . . . . . . . . . .
25
4.10 The calculation of uncertainty . . . .
25
2.2
Research styles - The smart researcher
2.3
Principles of statistical thinking . . .
7
5
Randomization . . .
search . . . . . . . . . . . . . . . . . .
phased process . . . . . . . .
4.8.1.5
Common Designs in Biological Experi-
Planning the Experiment
mentation
27
3.1
The planning process . . . . . . . . .
5.1
28
3.2
Types of experiments . . . . . . . . .
10
3.3
The pilot study . . . . . . . . . . . .
11
Principles of Statistical Design
13
4.1
Some terminology . . . . . . . . . . .
13
4.2
The structure of the response variable 13
4.3
Defining the experimental unit . . .
13
4.4
Variation is omnipresent . . . . . . .
16
4.5
Balancing internal and external validity . . . . . . . . . . . . . . . . . .
16
4.6
Bias and variability . . . . . . . . . .
17
4.7
Requirements for a good experiment
18
4.8
Strategies for minimizing bias and

maximizing signal-to-noise ratio . .
4.8.1
5.1.1
The completely randomized

design . . . . . . . . . . . . .
5.1.2
5.2
18
5.3
bias - good experimental

practice . . . . . . . . . . . . .
18
4.8.1.1
The use of controls .
18
4.8.1.2
Blinding . . . . . . .
19 6
4.8.1.3
The presence of a
block design . . . . . . . . . .
29
5.1.2.1
The paired design .
30
5.1.2.2
Efficiency
of
the
com-
plete block design .
31
5.1.3
Incomplete block designs . .
32
5.1.4
The Latin square design . . .
33
Treatment designs . . . . . . . . . . .
34
5.2.1
One-way layout . . . . . . . .
34
5.2.2
Factorial designs . . . . . . .
34
More complex designs . . . . . . . .
37
5.3.1
The split-plot and strip-plot
5.3.2
The repeated measures design 38
designs . . . . . . . . . . . . .
20
Calibration . . . . .
20
6.1
41
Determining sample size is a risk cost assessment . . . . . . . . . . . .
37
The Required Number of Replicates Sample Size
technical protocol .
28
The randomized complete
randomized
Strategies for minimizing
4.8.1.4
Error-control designs . . . . . . . . .
41
ii
CONTENTS
6.2
The context of biomedical experiments 41
9.1.1
Experimental design . . . . .
59
6.3
The hypothesis testing context - the
9.1.2
Statistical methods . . . . . .
59
The Results section . . . . . . . . . .
60
6.4
Sample size calculations . . . . . . .
42
9.2.1
Summarizing the data . . . .
60
42
9.2.2
Graphical displays . . . . . .
60
9.2.3
Interpreting and reporting
6.4.1
Power analysis computations
6.4.2
Meads
resource
require43
6.5
How many subsamples . . . . . . . .
44
6.6
Multiplicity and sample size . . . . .
46
6.7
The problem with underpowered

studies . . . . . . . . . . . . . . . . .
47
Sequential plans . . . . . . . . . . . .
48
significance tests . . . . . . .
10 Concluding Remarks and Summary
10.2 Recommended reading . . . . . . . .
63
10.3 Summary . . . . . . . . . . . . . . . .
64
References
65
7.1
The statistical triangle . . . . . . . .
7.2
The statistical model revisited . . . .
7.3
Significance tests . . . . . . . . . . .
51 Appendices
51
52 Appendix A Glossary of Statistical Terms
7.4
Verifying the statistical assumptions
53
7.5
The meaning of the p-value and sta53
Multiplicity . . . . . . . . . . . . . .
55
The Study Protocol
57
Interpretation and Reporting
59
9.1
59
The Methods section . . . . . . . . .
63
63
51
tistical significance . . . . . . . . . .
61
10.1 Role of the statistician . . . . . . . .
The Statistical Analysis
7.6
8
41
ment equation . . . . . . . . .
6.8
9.2
population model . . . . . . . . . . .
73
Appendix B Tools for randomization in MS

Excel and R
77
B.1 Completely randomized design . . .
77
B.1.1
MS Excel . . . . . . . . . . . .
77
B.1.2
R-Language . . . . . . . . . .
77
B.2 Randomized complete block design

B.2.1 MS Excel . . . . . . . . . . . .
78
78
B.2.2
R-Language . . . . . . . . . .
78
1. Introduction
More often than not, we are unable to reproduce findings published by researchers in journals.
Glenn Begley, Vice President Research Amgen (2015)
The way we do our research [with our animals] is stone-age.
Ulrich Dirnagl, Charit University Medicine Berlin (2013)
Over the past decade, the biosciences have been
sensitive and resistant, etc. Baggerly and Coombes
plagued by problems with the replicability and re-
concluded that they were unable to reproduce the
producibility of research findings. This lack of re- analysis carried out by Potti et al. (2006), but the
liability can be attributed in large part to statistical
fallacies, misconceptions, and other methodolog-
damage was done. Several clinical trials had started
ical issues (Begley and Ioannidis, 2015; Loscalzo,
based on the erroneous results. In 2011, after several corrections, the original study by Potti et al.
2012; Peng, 2015; Prinz et al., 2011; Reinhart, 2015;
was retracted from Nature Medicine because we
van der Worp et al., 2010). The following exam-
have been unable to reproduce certain crucial exper-
ples illustrate some of these problems and show
iments (Potti et al., 2011).
that there is a definite need to transform and imExample 1.2. In 2009, a group of researchers from
prove the research process.
the Harvard Medical School published a study

Example 1.1. In 2006, a group of researchers from showing that cancer tumors could be destroyed by
Duke University led by Anil Potti published a pa- targeting the STK33 protein (Scholl et al., 2009).
per claiming that they had built an algorithm us- Scientists at Amgen Inc. pounced on the idea and
ing genomic microarray data that allowed to predict assigned a group of 24 researchers to try to repeat
which cancer patients would respond to chemother- the experiment with the objective of developing a
apy (Potti et al., 2006). This would spare patients new medicine. After six months intensive lab work,
the side effects of ineffective treatments. Of course it turned out that the project was a waste of time
this paper drew a lot of attention and many inde- and money, since it was impossible for the Amgen
pendent investigators tried to reproduce the results. scientists to replicate the results (Babij et al., 2011;
Keith Baggerly and Kevin Coombes, two statisti-
Naik, 2011). Unfortunately, this was not the only
cians at MD Anderson Cancer Center, were also
problem of replicability the Amgen researchers en-
asked to have a look at the the data. What they countered. During a decade Begley and Ellis (2012)
found was a mess of poorly conducted data analysis identified a set of 53 landmark publications in pre(Baggerly and Coombes, 2009). Some of the data
clinical cancer research, i.e. papers in top journals
was mislabeled, some samples were duplicated in from reputable labs. A team of 100 scientists tried
the data, sometimes samples were marked as both to replicate the results. To their surprise, in 47 of
2 Formally, we consider replicability as the replication of scientific findings using independent investigators, methods,
data, equipment, and protocols. Replicability has long been, and will continue to be, the standard by which scientific claims
are evaluated. On the other hand, reproducibility means that starting from the data gathered by the scientist, we can
reproduce the same results, p-values, confidence intervals, tables and figures as those reported by the scientist (Peng, 2009).
the 53 studies (i.e. 89%) the findings could not be
CHAPTER 1. INTRODUCTION
Seralinis paper, claiming that it did not reach the
replicated. This outcome was particularly disturb- journals threshold of publication (Hayes, 2014) 1 .
ing since Begley and Ellis did every effort to work in
close collaboration with the authors of the original
papers and even tried to replicate the experiments
in the laboratory of the original investigator. In
some cases, 50 attempts were made to reproduce
the original data, without obtaining the claimed result (Begley, 2012). What is even more troubling is
that Amgens findings were consistent with those of
others. In a similar setting, Bayer researchers found
that only 25% of the original findings in target discovery could be validated (Prinz et al., 2011).
Example 1.4. Selwyn (1996) describes a study

where an investigator examined the effect of a test
compound on hepatocyte diameters. The experimenter decided to study eight rats per treatment
group, three different lobes of each rats liver, five
fields per lobe, and approximately 1,000 to 2,000
cells per field. At that time, most of the work,
i.e. measuring the cell diameters, was done manually, making the total amount of work, i.e. 15,000
- 30,000 measurements per rat, substantial. The
experimenter complained about the overwhelming
Example 1.3. Seralini et al. (2012) published a 2- amount of work in this study and the tight deadyear feeding study in rats investigating the health lines that were set up. A sample size evaluation
effects of genetically modified (GM) maize NK603 conducted after the study was completed indicated
with and without glyphosate-containing herbicides. that sampling as few as 100 cells per lobe would
The authors of the study concluded that GM maize have been without appreciable loss of information.
NK603 and low levels of glyphosate herbicide formulations, at concentrations well below officiallyset safe limits, induce severe adverse health effects,
Unreliable biological
reagents and
reference materials 36.1%
such as tumors, in rats. Apart from the publication, Seralini also presented his findings in a press
conference, which was widely covered in the media
showing shocking photos of rats with enormous tumors. Consequently, this study had severe impact
on the general public and also on the interest of
Improper study
design 27.6%
Laboratory protocol
errors 10.8%
Inadequate
data analysis
and reporting 25.5%
industry. The paper was used in the debate over

a referendum over labeling of GM food in California, and it led to bans on importation of certain
Figure 1.1. Categories of errors that contribute to the problem

of replicability in life science research (source Freedman et al.
2015)
Doing good science and producing highGMOs in Russia and Kenya. However, short after
its publication many scientists, among them also quality data should be the concern of every serious
researchers from the VIB (Vlaams Instituut voor research scientist. Unfortunately, as shown by the
Biotechnologie, 2012), heavily criticized the study first three examples, this is not always the case. As
and expressed their concerns about the validity of mentioned above, there is a genuine concern about
the findings. A polemic debate started with oppo- the reproducibility of research findings and it has
nents of GMOs and also within the scientific com- been argued that most research findings are false
munity, which inspired media to refer to the contro- (Ioannidis, 2005). In a recent paper Begley and
versy as The Seralini affair or Seralini tumor-gate. Ioannidis (2015) estimated that 85% of biomedical
Subsequently, the European Food Safety Authority research is wasted at large. Freedman et al. (2015)
(2012) thoroughly scrutinized the study and found tried to identify the root causes of the replicability
that it was of inadequate design, analysis and re- problem and to estimate its economic impact. They
porting. Specifically, the number of animals was estimated that in the United States alone approxiconsidered too small and not sufficient for reaching
mately US$28B/year is spent on research that can-
a solid conclusion. Eventually, the journal retracted not be replicated. The main problems causing this
1 S
eralini managed to republish the study in Environmental Sciences Europe (S
eralini et al., 2014), a journal with a
considerably lower impact factor.
lack of replicability are summarized in Figure 1.1.
or no thought is given to methodological issues, in
Issues in study design and data analysis accounted
particular to the statistical aspects of the study de-
for more than 50% of the studies that could not be
sign, the studies are often seriously flawed and are
replicated. This was also concluded by Kilkenny
not capable of meeting their intended purpose. In
et al. (2009) who surveyed 271 papers reporting
some cases, such as the Sralini study study, the
laboratory animal experiments. They found that
experiments were designed too small to enable an
most of the studies had flaws in their design and
answer to the research question. Conversely, like
almost one third of the papers that used statistical
in Example 1.4, there are also studies that use too
methods did not describe them or did not present
much experimental material, so that valuable re-
their results adequately.
sources are wasted.
Not only scientists, but also the journals have a
To improve on these issues of credibility and effi-
great responsibility in guarding the quality of their
ciency, we need effective interventions and change
publications.
Peer reviewers and editors, who
the way scientists look at the research process
have little or no statistical training, let methodolog-
(Ioannidis, 2014; Reinhart, 2015). This can be ac-
ical errors pass undetected. Moreover, high-impact
complished by introducing statistical thinking and
journals tend to focus on statistically significant re-
statistical reasoning as powerful informed skills,
sults of unexpected findings, often without looking
based on the fundamentals of statistics, that en-
at the practical importance. Especially, in studies
hance the quality of the research data (Vanden-
with insufficient sample size, this publication bias
broeck et al., 2006). While the science of statis-
causes high numbers of irreproducible and even
tics is mostly involved with the complexities and
false results (Ioannidis, 2005; Reinhart, 2015).
techniques of statistical analysis, statistical thinking and reasoning are generalist skills that focus on
In addition to the problem of replicability of re-
the application of nontechnical concepts and prin-
search findings, there has also been a dramatic rise
ciples. There are no clear, generally accepted defi-
in the number of journal retractions over the last
nitions of statistical thinking and reasoning. In our
decades (Cokol et al., 2008). In a review of all
conceptualization we consider statistical thinking
2,047 biomedical and life-science research articles

indexed by PubMed as retracted on May 3, 2012,
as a skill that helps to better understand how sta-
Fang et al. (2012) found that 21.3% of the retractions were due to error, while 67.4% of the retrac-
to specific research problems and what the implications are in terms of data collection, experimen-
tions were attributable to misconduct, including
tal setup, data analysis, and reporting. Statistical
fraud or suspected fraud (43.4%), duplicate pub-
thinking will provide us with a generic methodol-
lication (14.2%), and plagiarism (9.8%).
ogy to design insightful experiments. On the other
tistical methods can contribute to finding answers
hand, we will consider statistical reasoning as beStudies such as those by Potti et al. (2006), Scholl
ing more involved with the presentation and inter-
et al. (2009), and Sralini et al. (2012), as well
pretation of the statistical analysis. Of course, as
as the lack of replicability in general and the in-
is apparent from the above, there is a large over-
creased number of retractions have also caught the
lap between the concepts of statistical thinking and
attention of mainstream media (Begley, 2012; Hotz,
reasoning.
2007; Lehrer, 2010; Naik, 2011; Zimmer, 2012) and

have put the integrity of science into question by
the general public.
Statistical thinking permeates the entire research

process and, when adequately implemented, can
lead to a highly successful and productive research
To summarize, a substantial part of the issues of
enterprise. This was demonstrated by the eminent
replicability can be attributed to a lack of quality in
scientist, the late Dr. Paul Janssen. As pointed out
the design and execution of the studies. When little
by Lewi (2005), the success of Dr. Paul Janssen
CHAPTER 1. INTRODUCTION
could be attibruted to a large extent on having a
was such a success that, when he retired in 1991,
set of statistical precepts being accepted by his col-
his laboratory had produced 77 original medicines
laborators. These formed the statistical founda-
over a period of less than 40 years. This still rep-
tion upon which his research was built and insured
resents a world record. In addition, at its peak, the
that research proceeded in an orderly and planned
Janssen laboratory produced more than 200 scien-
fashion, while at the same time having an open
tific publications per year (Lewi and Smith, 2007).
mind for unexpected opportunities. His approach
2. Smart Research Design by Statistical

Thinking
Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and
write !
Samuel S. Wilks, (1951).
2.1
2.1.1
The architecture of experi-
tistical thinking and reasoning can be of use to op-
mental research
timize their design and interpretation.
2.1.2
The controlled experiment
Scientific research as a phased process
There are two basic approaches to implement a scientific research project. One approach is to conduct an observational study2 in which we investigate the effect of naturally occurring variation and
the assignment of treatments is outside the control
of the investigator. Although there are often good
and valid reasons for conducting an observational
study, their main drawback is that the presence of
Phase
Deliverable
Definition
Research Proposal
Design
Protocol
Data Collection
Data set
Analysis
Conclusions
Reporting
Report
Figure 2.1. Research is a phased process with each of the phases

having a specific deliverable
concomitant confounding variables can never be
From a systems analysis point of view, the sci-
excluded, thus weakening the conclusions.
entific research process can be divided into five

distinct stages:
An alternative to an observational study is an
1. definition of the research question
experimental or manipulative study in which the

investigator manipulates the experimental system
2. design of the experiment
and measures the effect of his manipulations on

the experimental material. Since the manipulation
3. conduct of the experiment and data collec-
of the experimental system is under control of the
tion
experimenter, one also speaks of controlled experi4. data analysis
ments. A well-designed experimental study eliminates the bias caused by confounding variables.
5. reporting
The great power of a controlled experiment, provided it is well conceived, lies in the fact that it al-
Each of these phases results in a specific deliver-
lows us to demonstrate causal relationships. We
able (Figure 2.1). The definition of the research
will focus on controlled experiments and how sta-
question will usually result in a research or grant
2 also
called correlational study
CHAPTER 2. SMART RESEARCH DESIGN BY STATISTICAL THINKING
proposal, stating the hypothesis related to the research (research hypothesis) and the implications
or predictions that follow from it. The design of
2.2
Research styles - The smart

researcher
the experiment needed for testing the research hypothesis is formalized in a written protocol. After
the experiment has been carried out, the data will
be collected providing the experimental data set.
Statistical analysis of this data set will yield conclusions that answer the research question by accept-
Figure 2.3. Modulating between the concrete and abstract world
ing or rejecting the formalized hypothesis. Finally,
The five phases that make up the research pro-
a well carried out research project will result in a
cess modulate between the concrete and the ab-
report, thesis, or journal article.
stract world (Figure 2.3). Definition and reporting are conceptual and complex tasks requiring a
great deal of abstract reasoning. Conversely, experimental work and data collection are very concrete, measurable tasks handling with the practical
2.1.3
Scientific research as an iterative, details and complications of the specific research

domain.
dynamic process
Figure 2.2. Scientific research as an iterative process
Scientific research is not a simple static activity, but as depicted in Figure 2.2, an iterative and
highly dynamic process. A research project is carried out within some organizational or manage-
Figure 2.4. Archetypes of researchers based on the relative fraction of the available resources that they are willing to spend at
each phase of the research process. D(1): definition phase, D(2):
design phase, C: data collection, A: analysis, R: reporting
ment context which can be rather authoritative;
Scientists exhibit different styles in their research
this context can be academic, governmental, or cor-
depending on the relative fraction of the available
porate (business). In this context, the management
resources that they are willing to spend at each
objectives of the research project are put forward.
phase of the research process. This allows us to
The aim of our research project itself is to fill an
recognize different archetypes of researchers (Fig-
existing information gap. Therefore, the research
ure 2.4):
question is defined, the experiment is designed

and carried out and the data are analyzed. The
results of this analysis allow informed decisions
to be made and provide a way of feedback to ad-
the novelist who needs to spend a lot of time

distilling a report from an ill-conceived experiment;
just the definition of the research question. On the
the data salvager who believes that no matter
other hand, the experimental results will trigger re-
how you collect the data or set up the exper-
search management to reconsider their objectives
iment, there is always a statistical fix-up at
and eventually request for more information.
analysis time;
2.3. PRINCIPLES OF STATISTICAL THINKING
Table 2.1. Statistical thinking versus statistics
Statistics
Statistical Thinking
Specialist skill
Science
Technology
Closure, seclusion
Introvert
Discrete interventions
Builds on good thinking
Generalist skill
Informed practice
Principles, patterns
Ambiguous, dialogue
Extravert
Permeates the research process
Valued skill itself
the lab freak who strongly believes that if formed practice and focused on the applications of
enough data are collected something inter-
nontechnical concepts and principles.
esting will always emerge;

The statistical thinker attempts to understand
the smart researcher who is aware of the ar- how statistical methods can contribute to finding
chitecture of the experiment as a sequence of answers to specific research problems in terms of
steps and allocates a major part of his time
data collection, experimental setup, data analysis
budget to the first two steps: definition and
and reporting. He or she is able to postulate which
design.
statistical expertise is required to enhance the re-
The smart researcher is convinced that time spent

planning and designing an experiment at the outset will save time and money in the long run. He
opposes the lab freak by trying to reduce the number of measurements to be taken, thus effectively
search projects success. In this capacity the statistical thinker acts as a diagnoser.
In contrast to statistics, which operates in a
closed and secluded mathematical context, statis-
tical thinking is a practice that is fully integrated

reducing the time spent in the lab. In contrast to
with the researchers scientific field, not merely an
the lab freak, the smart researcher recognizes that the
autonomous science. Hence, the statistical thinker
design of the experiment will govern how the data
operates in a more ambiguous setting, where he
will be analyzed, thereby reducing time spent at
is deeply involved in applied research, with a
the data analysis stage to a minimum. By carefully
good working knowledge of the substantive scipreparing and formalizing the definition and deence. In this role, the statistical thinker acts as an
sign phase, the smart researcher can look ahead to
(intermediary) between scientists and statisticians
the reporting phase with peace of mind, which is
and goes into dialogue with them. He attempts to
in contrast to the novelist.
integrate the several potentially competing priori-
2.3
Principles of statistical thinking
The smart researcher recognizes the value of statis-
ties that make up the success of a research project:

resource economy, statistical power, and scientific
relevance, into a coherent and statistically underpinned research strategy.
While the impact of the statistician on the re-
tical thinking for his application area and he him- search process is limited to discrete interventions,
self is skilled in statistical thinking, or he collabo-
the statistical thinker truly permeates the research
rates with a professional who masters this skill. As
process. His combined skills lead to increased ef-
noted before, statistical thinking is related to, but
ficiency, which is important to increase the speed
distinct from statistical science (Table 2.1). While
with which research data, analyses, and conclu-
statistics is a specialized technical skill based on
sions become available. Moreover, these skills al-
mathematical statistics as a science on its own, sta-
low to enhance the quality and to reduce the asso-
tistical thinking is a generalist skill based on in- ciated cost. Statistical thinking then helps the sci-
CHAPTER 2. SMART RESEARCH DESIGN BY STATISTICAL THINKING
entist to build a case and negotiate it on fair and

objective grounds with those in the organization
seeking to contribute to more business-oriented
measures of performance. In that sense, the successful statistical thinker is a persuasive communicator. This comparison clearly shows that the power
of statistics in research is actually founded upon
good statistical thinking.
Smart research design is based on the seven basic principles of statistical thinking:
1. Time spent thinking on the conceptualization
and design of an experiment is time wisely
spent.
2. The design of an experiment reflects the contributions from different sources of variabil-
ity.
3. The design of an experiment balances between its internal validity (proper control of
noise) and external validity (the experiments
generalizability).
4. Good experimental practice provides the
clue to bias minimization.
5. Good experimental design is the clue to the
control of variability.
6. Experimental design integrates various disciplines.
7. A priori consideration of statistical power is
an indispensable pillar of an effective experiment.
3. Planning the Experiment

Experimental observations are only experience carefully planned in advance, and designed to form a
secure basis of new knowledge.
R. A. Fisher (1935).
3.1
The planning process
be realistic in what can be accomplished in a single

study.
Example 3.1. The study by Seralini et al. (2012) is a
typical example of a study where the research team
tried to accomplish too much objectives. In this
study 10 treatments were examined in both female
and male rats. Since the research team apparently
had a very limited amount of resources available,
the investigators used only 10 animals per treatment per sex. This was far below the 50 animals
per treatment group that are standard in long term
Figure 3.1. The planning process
carcinogenicity studies (Gart et al., 1986; Haseman,
The first step in planning an experiment (Fig-
1984).
ure 3.1) is the specification of its objectives. The

researcher should realize what the actual goal is
After having formulated the research objec-
of his experiment and how it integrates into the
tives, the scientist will then try to transfer them
whole set of related studies on the subject. How
into scientific hypotheses that might answer the
does it relate to management or other objectives?
question. Often it is impossible to study the re-
How will the results from this particular study
search objective directly, but some surrogate ex-
contribute to knowledge about the subject? Sometimes a preliminary exploratory experiment is use-
perimental model is used instead. For example
ful to generate clear questions that will be answered in the actual experiment. The study ob-
toxic in rats. The real objective was to establish

the toxicity in humans. As a surrogate for man, the
jectives should be well defined and written out as
Sprague-Dawley strain of rat was chosen as exper-
Sralini was not interested whether GMOs were
explicitly as possible. It is wise to limit the objec- imental model. By doing so, an auxiliary hypothesis
tives of a study to a maximum of, say three (Sel-
(Hempel, 1966) was put forward, namely that the
wyn, 1996). Any more than that risks designing an
experimental model was adequate to the research
overly complex experiment and could compromise
objectives. Sralinis choice of the Sprague-Dawley
the integrity of the study. Trying to accomplish
rat strain received much criticism (European Food
each of many objectives in a single study stretches
Safety Authority, 2012), since this strain is prone to
its resources too thin and as a result, often none of
the development of tumors. Auxiliary hypotheses
the study objectives is satisfied. Objectives should
also play a role when it is difficult or even impos-
also be reasonable and attainable and one should
sible to measure the variable of interest directly. In

9
10
CHAPTER 3. PLANNING THE EXPERIMENT
this case, an indirect measure as a surrogate for the
are used to explore a new research area. They pro-
target variable might be available and the investi-
vide a powerful method for discovery (Hempel,
gator relies on the premiss that the indirect mea-
1966), i.e they are performed to generate new hy-
sure is a valid surrogate for the actual target vari-
potheses that can then be formally tested in confir-
able.
matory experiments. Replication, sample size and

formal hypothesis testing are less appropriate with
Based on both the scientific and auxiliary hy-
this type of experiment. Currently, the vast major-
potheses, the researcher will then predict the test
ity of published research in the biomedical sciences
implications of what to expect if these hypotheses
originates from this sort of experiment (Kimmel-
are true. Each of these predictions should be the
man et al., 2014). The exploratory nature of these
strongest possible test of the scientific hypotheses.
studies is also reflected in the way the data are
The deduction of these test implications also in-
analysed. Exploratory data analysis as opposed to
volves additional auxiliary hypotheses. As stated
confirmatory data analysis is a flexible approach,
by Hempel (1966), reliance on auxiliary hypotheses
based mainly on graphical displays, towards for-
is the rule, rather than the exception, in testing sci-
mulating new theories (Tukey, 1980). Exploratory
entific hypotheses. Therefore, it is important that
studies aim primarily at developing these new re-
the researcher is aware of the auxiliary assump-
search hypotheses, but they do not answer unam-
tions he makes when predicting the test implica-
biguously the research question, since using the
tions. Generating sensible predictions is one of the
same data that generated the research hypothesis
key factors of good experimental design. Good
also for its confirmation, involves circular reasoning. Exploratory studies tend to consist of a pack-
predictions will follow logically from the hypotheses that we wish to test, and not from other rival
hypotheses. Good predictions will also lead to in-
age of small and flexible experiments using differ-
sightful experiments that allow the predictions to
study by Sralini et al. (2012) was in fact an ex-
be tested.
ploratory experiment and much of the controver-
ent methodologies (Kimmelman et al., 2014). The
sies around this study would not have arisen, if it

The next step in the planning process is then to
decide which data are required to confirm or refute
the predicted test implications. Throughout the sequence of question, hypothesis, prediction it is essential to assess each step critically with enough
skepticism and even ask a colleague to play the
devils advocate. During the design and planning
stage of the study one should already have the
person refereeing the manuscript in mind. It is
much better that problems are identified at this
early stage of the research process than after the
experiment started. At the end of the experiment
the scientist should be able to determine whether
the objectives have been met, i.e. whether the research questions were answered to satisfaction.
would have been presented as such.

Pilot experiments are designed to make sure the
research question is sensible, they allow to refine
the experimental procedures, to determine how
variables should be measured, whether the experimental setup is feasible, etc. Pilot experiments
are especially useful when the actual experiment
is large, time-consuming or expensive (Selwyn,
1996). Information obtained in the pilot experiment is of particular importance when writing the
technical and study protocol of such studies. Pilot
experiments are discussed in more detail in Section
3.3.
Confirmatory experiments are used to assess the
test implications of a scientific hypothesis.
3.2
Types of experiments
In
biomedical research this assessment is based on

statistical methodology. In contrast to exploratory
We first distinguish between exploratory, pilot and
studies, confirmatory experiments make use of
confirmatory experiments. Exploratory experiments
rigid pre-specified designs and a priori stated hy-
3.3. THE PILOT STUDY
11
potheses. Exploratory and confirmatory studies
combination of row and column effects (Burrows
complement one another in the sense that the for-
et al., 1984). A variation experiment can also tell
mer generates the hypotheses that can be put to
us about the importance of cage location in animal
crucial testing in the latter. Confirmatory exper-
experiments, where animals are kept in racks of 24
iments are the main topic of this tutorial.
cages. Animals in cages close to the ventilation

could respond differently from the rest (Young,
A further distinction between different types of
1989). Other examples of variation experiments
experiments is based on the type of objective of
come from the area plant research where the posi-
the study in question. A comparative experiment is
tion of plants in the greenhouse (Brien et al., 2013)
one in which two or more techniques, treatments,
or the growth chamber (Potvin et al., 1990) could
or levels of an explanatory variable are to be com-
be the subject of investigation.
pared with one another. There are many examples

of comparative experiments in biomedical areas.
For example in nutrition studies different diets can
3.3
The pilot study
be compared to one another in laboratory animals.

In clinical studies, the efficacy of an experimental
As researchers are often under considerable time
drug is assessed in a trial by comparing it to treat-
pressure, there is the temptation to start as soon
ment with placebo. We will focus primarily on de-
as possible with the actual experiment. However,
signing comparative experiments for confirmation
a critical step in a new research project, that is of-
of research hypotheses.
ten missed, is to spend a bit of time and resources

on the beginning of the study collecting some pilot
A second type of experiment is the optimiza- data. Preliminary experiments on a limited scale,
tion experiment which has the objective of finding or pilot experiments, are especially useful when
conditions that give rise to a maximum or mini-
we deal with time-consuming, important or expen-
mum response. Optimization experiments are of-
sive studies and are of great value for assessing the
ten used in product development, such as finding
feasibility of the actual experiment. During the pi-
the optimum combination of concentration, tem-
lot stage the researcher is allowed to make varia-
perature, and pressure that gives rise to the maximum yield in a chemical production plant. Dose
tions in experimental conditions such as measure-
finding trials in clinical development are another
study can be of help to make sure that a sensible research question was asked. For instance, if
example of optimization experiments.
ment method, experimental set-up, etc. The pilot
our research question was about whether there is

The third type of experiment is the prediction experiment in which the objective is to provide some
statistical/mathematical model to predict new responses. Examples are dose response experiments
in pharmacology and immuno-assay experiments.
The final experimental type is the variation experiment. This type of experiment has as objective to study the size and structure of bias and
random variation. Variation experiments are implemented as uniformity trials, i.e. studies with-
a difference in concentration of a certain protein

between diseased and non-diseased tissue, it is of
importance that this protein is present in a measurable amount. Carrying out a pilot experiment
in this case can save considerable time, resources
and eventual embarrassment. One could also wonder whether the effect of an intervention is large
enough to warrant further study. A pilot study can
then give a preliminary idea about the size of this
effect and could be of help in making such a strategic decision.
out different treatment conditions. For example,

the assessment of sources of variation in microtiter
A second crucial role for the pilot study is for
plate experiments. These sources of variation can be the researcher to practice, validate and standardplate effects, row effects, column effects, and the
ize the experimental techniques that will be used
12
CHAPTER 3. PLANNING THE EXPERIMENT
in the full study. When appropriate, trial runs of
just the required sample size of the experiment and
different types of assays allow fine-tuning them so
to set up the data analysis environment.
that they will give optimal results. Finally, the pilot

study provides basic data to debug and fine-tune
The pilot study still belongs to the exploratory
the experimental design. Provided the experimen-
phase of the research project and is not part of the
tal techniques work well, carrying out a small-scale
actual, final experiment. In order to preserve the
version of the actual experiment will yield some
quality of the data and the validity of the statistical
preliminary experimental data. These pilot data
analysis, the pilot data cannot be included in the final
can be very valuable and allow to calculate or ad-
dataset.
4. Principles of Statistical Design

It is easy to conduct an experiment in such a way that no useful inferences can be made.
William Cochran and Gertrude Cox (1957).
4.1
Some terminology
the specific treatment, the effect of the experimental design, and an error component that describes the
We refer to a factor as the condition or set of con-
deviation of this particular experimental unit from
ditions that we manipulate in the experiment, e.g.
the mean value of its treatment group. There are

the concentration of a drug. The factor level is the some strong assumptions associated with this simparticular value of a factor, e.g. 10-6 M, 10-5 M. A ple model:
treatment consists of a specific combination of fac the treatment terms add rather than, for extor levels. In single-factor studies a treatment corample multiply;
responds to a factor level. The experimental unit is

defined as the smallest physical entity to which a
treatment effects are constant;
treatment is independently applied. The character the response in one unit is unaffected by the
istic that is measured and on which the effect of the
treatment applied to the other units.
different treatments is investigated and analysed is

referred to as the response or dependent variable. The
These assumptions are particularly important in
observational unit is the unit on which the response
the statistical analysis. A statistical analysis is only
is measured or observed. Often the observational
valid when all of these assumptions are met.
unit is identical to the experimental unit, but this

is not necessarily always the case. The definition
4.3
of additional statistical terms can be found in Appendix ??.
4.2
Defining
the
experimental
unit
The structure of the response

variable
The experimental unit corresponds to the smallest

division of the experimental material to which a
treatment can (randomly) be assigned, such that
any two units can receive different treatments. It
is important that the experimental units respond
independently of one another, in the sense that a
treatment applied to one unit cannot affect the re-
Figure 4.1. The response variable as the result of an additive

model
We assume that the response obtained for a par-
sponse obtained in another unit and that the occurrence of a high or low result in one unit has no
effect on the result of another unit. Correct iden-
ticular
tification of the experimental unit is of paramount

importance for a valid design and analysis of the
experimental unit can be described by a simple
additive model (Figure 4.1) consisting of the effect of study.

13
14
CHAPTER 4. PRINCIPLES OF STATISTICAL DESIGN
In many experiments the choice of the experi-
there were only three experimental units per treat-
mental unit is obvious. However, in studies where
ment, certainly not 280 and 162 units1 . The cor-
replication is at multiple levels, or when replicates
rect method of analysis calculates for each animal
cannot be considered independent, it often hap-
the average cell diameter and takes this value as the
pens that investigators have difficulties recogniz-
response variable.
ing the proper basic unit in their experimental material. In these cases, the term pseudoreplication is
often used (Fry, 2014). Pseudo-replication can result in a false estimate of the precision of the experimental results leading to invalid conclusions
(Lazic, 2010).
Mistakes as in the above example are abundant

whenever microscopy is concerned and the individual cell is used as experimental unit. One could
wonder whether these are mistakes made out of ignorance or out of convenience. The concern is even
greater when such studies get published in peer reviewed high impact scientific journals.
The following example represents a situation

commonly encountered in biomedical research
when multiple levels are present.
Independence of units can be an issue in plant

research and in studies using laboratory animals
as is illustrated by the following example.
Example 4.2. (LeBlanc, 2004) In order to assess the
effect of fertilization on corn plant growth, two plots
are established: one fertilized and one without fertilizer (control). In each plot 100 corn plants are
grown. To determine the effect of fertilization, the
average mass of the corn plants is compared between the two plots.
Figure 4.2. Morphometric analysis of the diameter of bile canaliculi in wild-type and Cx32-deficient liver. MeansSEM from
three livers. *: P<0.005 (after Temme et al. (2001))
The choice of plants as experimental units is incorrect, because the observations of growth made
on adjacent plants within a plot are not indepen-
Example 4.1. Temme et al. (2001) compared two
dent of one another. If one plant is genetically sugenetic strains of mice, wild-type and connexin 32 perior and grows particularly large, it will tend to
(Cx32)-deficient. They measured the diameters of shade its inferior neighbors and cause them to grow
bile canaliculi in the livers of three wild-type and more slowly. An appropriate design would be to put
of three Cx32-deficient animals, making several ob- the plants into separate pots and randomly treatservations on each liver. Their results are shown in ing each pot with fertilizer or not. In this case there
Figure 4.2. It should be clear that Temme et al. is no competition of resources between plants. An(2001) mistakenly took cells, which were the ob- other alternative would make use of several fertilservational units, for experimental units and used ized and unfertilized plots of ground. The outcome
them also as units of analysis. If we consider the would then be the crop yield in each plot.
genotype as the treatment, then it is clear that not
the cell but the animal is the experimental unit.
Moreover, cells from the same animal will be more
alike than cells from different animals. This interdependency of the cells invalidates the statisti-
Laboratory animals such as mice and rats are

often housed together in cages and the treatment
can than conveniently be supplied by the tap water, such as in the following example:
cal analysis, as it was carried out by the investiga- Example 4.3. Rivenson et al. (1988) studied the toxtors. Therefore, the correct experimental unit and icity of N-nitrosamines in rats and described their
unit of analysis is the animal, not the cell. Hence, experimental set-up as:
1 If we recalculate the standard errors of the mean (SEM) using the appropriate number of experimental units, then they
are a factor 7-10 larger than the reported ones.
4.3. DEFINING THE EXPERIMENTAL UNIT
15
The rats were housed in groups of 3 in
is not just a simple multiple of the number of an-
solid-bottomed polycarbonate cages with
imals in the cage and the required number of ex-
hardwood bedding under standard condi-
perimental units (cages). An experiment requiring
tions diet and tap water with or without
10 animals per treatment group when housed in-
N-nitrosamines were given ad libitum.
dividually is almost equivalent to an experiment
with 12 animals distributed over 4 cages per treatSince the treatment was supplied in the drinking ment. For some outcomes variability is expected to
water, it is impossible to provide different treat- be reduced when animals are more content when
ments to any two individual rats.
Furthermore, they are group-housed, which would enhance the

the responses obtained within the different animals latter experiments efficiency (Fry, 2014).
within a cage can be considered to be dependent
upon one another in the sense that the occurrence
of extreme values in one unit can affect the result
of another unit. Therefore, the experimental unit
here is not the single rat, but the cage.
The choice of the experimental unit is also of particular concern in plant research when the treatment condition has been applied to whole pots,
trays or growth chambers rather than to individual
basic units is found in the study by Sralini et al.
plants. By now, it should be clear that in these cases

it is the plot, tray or growth chamber that con-
(2012). In their study, rats were housed in groups
stitutes the experimental unit (Fernandez, 2007).
of two per cage and the treatment was present in
Especially in the case of growth chambers, when
the food delivered to the cages.
the treatment condition is an environmental fac-
An identical problem with independence of the
tor, it is difficult to obtain a sufficient number of

Even when the animals are individually treated,
e.g. by injection, group-housing can cause animals in the same cage to interact which would invalidate the assumption of independence of units.
For instance, in studies with rats, a socially dominant animal may prevent others from eating at certain times. Mice housed in a group usually lie together, thereby reducing their total surface area.
A reduced heat loss per animal in the group is
the result. Due to this behavioral thermoregulation
their metabolic rate is altered (Ritskes-Hoitinga
genuine replications. Therefore, experimenters are

tempted to consider the individual plants, pots,
trays or Petri dishes as experimental units, which is
wrong. Because the treatment condition (temperature, light intensity/duration, etc.) was applied to
the entire chamber, it is indeed the growth chamber itself that constitutes the experimental unit.
For a sufficient number of replications when the
number of growth chambers is limited, the experiment should be replicated using the same growth
chamber repeatedly over time (Fernandez, 2007).
and Strubbe, 2007).

Two-generation reproductive studies which inNevertheless, single housing of gregarious animal species is considered detrimental to their welfare and regulations in Europe concerning animal
welfare insist on group housing of such species
(Council of Europe, 2006). However, when animals
volve exposure in utero are standard procedures

in teratology. Also here, the entire litter rather
than the individual pup constitutes the experimental unit (Gart et al., 1986). This also applies to other
experiments in reproductive biology.
are housed together, the cage rather than the indi-
Example 4.4. (Fry, 2014) A drug was tested for its
vidual animal should be considered as the experi-
capacity to reduce the effect of a mutation causing
mental unit (Fry, 2014; Gart et al., 1986). Statistical
a common condition. To accomplish this, homozy-
analysis should take this into account by using ap-
gous mutant female rats were randomly assigned to
propriate techniques. Fortunately, as is pointed out
drug-treated and control groups. Then they were
by (Fry, 2014), when the cage is taken as the exper-
mated with homozygous mutant males, producing
imental unit the total number of animals needed
homozygous mutant offspring. Litters were weaned
16
and pups grouped five to a cage and the effects on essary information. This is in contrast to other scithe offspring were observed. Here, although obser- entific areas such as physics, chemistry and engivations on the individual offspring were made, the neering where the studied effects are much larger
experimental units are the mutant dams that were than the natural variation.
randomly assigned to treatment. Therefore, the observations on the offspring should be averaged to
give a single figure for each dam and these data,
one for each dam, are to be used for comparing the
4.5
Balancing internal and external validity
treatments.
A single individual can also relate to several experimental units. This is illustrated by the following example.
Example 4.5. (Fry, 2014) The efficacy of two agents
at promoting regrowth of epithelium across a wound
was evaluated by making 12 small wounds in a standardized way in a grid pattern on the back of a pig.
The wounds were far enough apart for effects on
each to be independent. One of four treatments
would then be applied at random to the wound in
each square of the grid. In this case the experimental unit would be the wound and, as there are 12
wounds, for each treatment there would be three
replicates.
Figure 4.3. The basic dilemma: balancing between internal and

external validity
Internal validity refers to the fact that in a wellconceived experiment the effect of a given treatment is unequivocally attributed to that treatment.
However, the effect of the treatment is masked by
the presence of the uncontrolled variation of the
experimental material.
An experiment with a high level of internal validity should have a great chance to detect the ef-
4.4
Variation is omnipresent
fect of the treatment. If we consider the treatment

effect as a signal and the inherent variation of our
Variation is everywhere in the natural world and
experimental material as noise, then a good exper-
is often substantial in the life sciences. Despite ac-
imental design will maximize the signal-to-noise ra-
curate execution of the experiment, the measure-
tio. Increasing the signal can be accomplished by
ments obtained in identically treated objects will

yield different results. For example, cells grown in
choosing experimental material that is more sensi-
test tubes will vary in their growth rates and in an-
tive to the treatment. Identification of factors that

increase the sensitivity of the experimental ma-
imal research no two animals will behave exactly
terial could be carried out in preliminary experi-
the same. In general, the more complex the system
ments. Reducing the noise is another way to in-
that we study, the more factors will interact with
crease the signal-to-noise ratio. This can be accom-
each other and the greater will be the variation
plished by repeating the experiment in a number
between the experimental units. Experiments in
of animals, but this is not a very efficient way of
whole animals will undoubtedly show more varia-
reducing the noise. An alternative for noise reduc-
tion than in vitro studies on isolated organs. When
tion is by using very uniform experimental mate-
the variation cannot be controlled or its source can-
rial. The use of cells harvested from a single animal
not be measured, we will refer to it as noise, ran-
is an example of noise reduction by uniformity of
dom variation or error. This uncontrollable variation
experimental material.
masks the effects under investigation and is the

reason why replication of experimental units and
External validity is related to the extent that our
statistical methods are required to extract the nec-
conclusions can be generalized to the target popu-
4.6. BIAS AND VARIABILITY
17
lation. The choice of the target population, how a

sample is selected from this population and the experimental procedures used in the study are all determinants of its external validity. Clearly, the experimental material should mimic the target population as close as possible. In animal experiments
specifying species and strain of the animal, the age
Figure 4.4. Bias and variability illustrated by a marksman shot

at a bulls eye
and weight range and other characteristics determine the target population and make the study as
Example 4.7. Confounding bias can enter a study
realistic and informative as possible. External va-
through less obvious routes. For example, micorar-
lidity can be very low if we work in a highly con-
rays allow biologists to measure the expression of
trolled environment using very uniform material.
thousands of genes or proteins simultaneously. Minor differences in a number of non-biological variables, such as reagents from different lots, differ-
Thus there is a trade-off between internal and ex-
ent technicians or even changing atmospheric ozone
ternal validity (Figure 4.3), as one goes up the other
levels can impact the data. Microarrays are usually
comes down. Fortunately, as we will see, there are

statistical strategies for designing a study such that
processed in batches and in large studies different
the noise is reduced while the external validity is
Consider a naive experimental setup in which on
maintained.
Monday all control samples are measured and on
institutes can collaberate using different equipment.
Tuesday all diseased samples, thus masking the effect of disease with that of the two batches. Worse
is that these batch effects do not affect the entire
4.6
Bias and variability
Bias and variability (Figure 4.4) are two important
microarray in the same manner. Correlation patterns differ by batch, and even reverse sign across
bathes (Leek et al., 2010).
concepts when dealing with the design of exper-
Studies using laboratory animals can be subject
iments. A good experiment will minimize or, at
to bias when all animals assigned to a specific treat-
best, try to eliminate bias and will control for vari-
ment are kept in the same cage. In Section 4.3, we
ability. By bias, we mean a systematic deviation in
already mentioned the problem of independence
observed measurements from the true value. One

of the most important sources of bias in a study is
when animals were housed in groups. In addi-
the way experimental units are allocated to treat-
same treatment, the effects due to the conditions
ment groups.
in the cage are superimposed on the effects of the
tion to these, when all animals in a cage receive the
treatments. When the experiment is restricted to

Example 4.6. Consider an animal study in which a single cage per treatment, the comparisons bethe effect of an experimental treatment is investi- tween the treatments can be biased(Fry, 2014). The
gated with respect to a control treatment. Suppose same reasoning applies to the location of the cages
the experimenter allocates all males to the control in the rack (Gart et al., 1986).
treatment and all females to the experimental treatment. Further assume that at the end of the ex-
By variability, we mean a random fluctuation
periment the investigator finds a strong difference
about a central value. The terms bias and vari-
between the two treatment groups. It is clear that ability are also related to the concepts of accuracy
this treatment effect is a biased result since the dif- and precision of a measurement process. Absence
ference between the two groups is completely con- of bias means that our measurement is accurate,
founded with the difference between both sexes.
while little variability means that the measurement
18
is precise. Good experiments are as free as possi-
4.8
ble from bias and variability. Of the two, bias is
Strategies for minimizing bias

and
the most important. Failure to minimize the bias
maximizing
signal-to-
noise ratio
of an experiment leads to erroneous conclusions

and thereby jeopardizes the internal validity. Conversely, if the outcome of the experiment shows
To safeguard the internal validity of his study, the
too much variability, this can sometimes be reme-
scientists needs to optimize the signal-to-noise ra-
diated by refinement of the experimental methods,
tio (Figure 4.5). This constitutes the basic principle
increasing the sample size, or other techniques. In
of statistical design of experiments. The signal can
this case the study may still reach the correct con-
be maximized by the proper choice of the measur-
clusions.
ing device and experimental domain. The noise is

minimized by reducing bias and variability. Strate-
4.7
Requirements for a good experiment
gies for minimizing the bias are based on good experimental practice, such as: the use of controls,
blinding, the presence of a protocol, calibration,
randomization, random sampling, and standard-
Cox (1958) enunciated the following requirements
ization. Variability can be minimized by elements
for a good experiment:
of experimental design: replication, blocking, co-
1. treatment comparisons should as far as possible be free of systematic error (bias);

2. the comparisons should also be made sufficiently precise (signal-to-noise);
3. the conclusions should have a wide range of
variate measurement, and sub-sampling. In addition random sampling can be added to enhance
the external validity. We will now consider each of
these strategies in more detail.
4.8.1
Strategies for minimizing bias good experimental practice
4.8.1.1
The use of controls
validity (external validity);

4. the experimental arrangement should be as
simple as possible;
5. uncertainty in the conclusions should be assessable.
In biomedical studies, a control or reference standard is a standard treatment condition against

which all others may be compared. The control
can either be a negative control or a positive control.
These five requirements will determine the basic
The term active control is also used for the latter. In
elements of the design. We have discussed already
some studies, both negative and positive controls
the first three requirements in the preceding sections. The following section provides some basic
are present. In this case, the purpose of the positive
strategies to fulfill these requirements.
of the experiment1 .
control is mostly to provide an internal validation
When negative controls are used, subjects can

sometimes act as their own control (self-control). In
this case the subject is first evaluated under standard conditions (i.e. untreated). Subsequently,
the treatment is applied and the subject is reevaluated.
Figure 4.5. Overview of strategies for minimizing the bias and
maximizing the signal-to-noise ratio
This design, also called pre-post de-
sign, has the property that all comparisons are

made within the same subject. In general, vari-
1 Active
controls play a special role in so-called equivalence or non-inferiority studies, where the purpose is to show that
a given therapy is equivalent or non-inferior to an existing standard.
4.8. STRATEGIES FOR MINIMIZING BIAS AND MAXIMIZING SIGNAL-TO-NOISE RATIO
19
ability within a subject is smaller than between
In a recent survey of studies in evolutionary bi-
subjects. Therefore, this is a more efficient design
ology and of the life sciences at large, Holman
than comparing control and treatment in two sep-
et al. (2015) found that in unblinded studies the
arate groups. However, the use of self-control has
mean reported effect size was inflated by 27% and
the shortcoming that the effect of treatment is com- the number of statistically significant findings was
pletely confounded with the effect of time, thus intro- substantially larger as compared to blinded studducing a potential source of bias. Furthermore,
ies. The importance of blinding in combination
blinding, which is another method to minimize
with randomization in animal studies was also
bias, is impossible in this type of design.
highlighted by Hirst et al. (2014). Despite its importance, blinding of experimenters is often ne-
Another type of negative control is where one

treated group does not receive any treatment at all,
i.e. the experimental units remain untouched. Just
as in the previous case of self-control, untreated controls cannot be blinded. Furthermore, applying the
treatment (e.g. a drug) often requires extra manipulation of the subjects (e.g. injection). The effect of
the treatment is then intertwined with that of the
manipulation and consequently it is potentially biased.
glected in biomedical research. For example, in

a systematic review of studies on animals in nonclinical research, van Luijk et al. (2014) found that
only 24% reported blinded assessment of the outcome, while only 15% considered blinding of the
caretaker/investigator.
Two types of blinding must be distinguished. In
single blinding the investigators are uninformed regarding the treatment condition of the experimental subjects. Single blinding neutralizes investigator
bias. The term double blinding in laboratory exper-
Vehicle control (laboratory experiments) or placebo iments means that both the experimenter and the
control (clinical trials) are terms that refer to a con- observer are uninformed about the treatment control group that receives a matching treatment con-
dition of the experimental units. In clinical trials
dition without the active ingredient. Another term
double blinding means that both investigators and
for this type of control, in the context of experi-
subjects are unaware of the treatment condition.
mental surgery, is sham control. In the sham control

group subjects or animals undergo a faked opera-
Two methods of blinding have found their way
tive intervention that omits the step thought to be
to the laboratory: group blinding and individual
therapeutically necessary. This type of vehicle con-
blinding. Group blinding involves identical codes,
trol, placebo control or sham control is the most
say A, B, C, etc., for entire treatment groups. This
desirable and truly minimizes bias. In clinical re-
approach is less appropriate, since when results ac-
search the placebo controlled trial has become the

gold standard. However, in the same context of
cumulate and the investigator is able to break the

code, blindness is completely destroyed. A much
clinical research ethical consideration may some-
better approach is to assign a code (e.g. sequence
times preclude its application.
number) to each experimental unit individually

and to maintain a list that indicates which code
4.8.1.2
Blinding
corresponds to which particular treatment. The sequence of the treatments in the list should be ran-
Researchers expectations may influence the study
domized. In practice, this individual blinding pro-
outcome at many stages. For instance, the exper-
cedure often involves an independent person that
imental material may unintentionally be handled
maintains the list and prepares the treatment con-
differently based on the treatment group, or ob-
ditions (e.g. drugs).
servations may be biased to confirm prior beliefs.

Blinding is a very useful strategy for minimizing
this subconscious experimenter bias.
Especially when the outcome of the experiment

is subjectively evaluated, blinding must be consid-
20
ered. However, there is one situation where blind-
of treatment effects across a range of disease ar-
ing does not seem to be appropriate, namely in tox-
eas and outcome measures. Randomization has
icologic histopathology. Here, the bias that would
been successfully applied in many areas of re-
be reduced by blinding is actually a bias favor-
search, such as microarray studies (Ghlmann and
ing diagnosis of a toxicological hazard and a con-
Talloen, 2009; Yang et al., 2008) and studies using
servative safety evaluation, which is appropriate
96-well microplates (Faessel et al., 1999; Levasseur
in this context. Blinded evaluation would result
et al., 1995).
in a reduction in the sensitivity to detect anomalies. In this context, Holland and Holland (2011)
Formal randomization, in our context, is the pro-
suggested that for toxicological work both an un-
cess of allocating experimental units to treatment
blinded and blinded evaluation of histologic mate-
groups or conditions according to a well-defined
rial should be carried out.
stochastic law1 . Randomization is a critical element in proper study design. It is an objective and
4.8.1.3
The presence of a technical protocol
scientifically accepted method for the allocation of

experimental units to treatment groups. Formal
The presence of a written technical protocol de-
randomization ensures that the effect of uncon-
scribing in full detail the specific definitions of
trolled sources of variability has equal probabil-
measurement and scoring methods is imperative
ity in all treatment groups. In the long run, ran-
to minimize potential bias. The technical protocol
domization balances treatment groups on unim-
specifies practical actions and gives guidelines for
portant or unobservable variables, of which we
lab technicians on how to manipulate the experi-
are often unaware. Any differences that exist in
mental units (animals, etc.), the materials involved

in the experiment, the required logistics, etc. It
these variables after randomized treatment alloca-
also gives details on data collection and process-
In other words, randomization is an operation that
ing. Last but not least, the technical protocol lays
effectively turns lethal bias into more manageable
down the personal responsibilities of the techni-
random error (Vandenbroeck et al., 2006). The ran-
cal staff. The importance and contents of the other
dom allocation of experimental units to treatment
protocol, the study protocol, will be discussed fur-
conditions is also a necessary condition for a rigorous statistical analysis, in the sense that it provides
ther in Chapter 8.
tion are then to be attributed to the play of chance.
an unbiased estimate of the standard error of the

4.8.1.4
Calibration
treatment effects, makes experimental units independent of one another and justifies the use of sig-
Calibration is an operation that compares the out-
nificance tests (Cox, 1958; Fisher, 1935; Lehmann,
put of a measurement device to standards of
1975). In addition, randomization is also of use as
known value, leading to correction of the values
a device for blinding the experiment.
indicated by the measurement device. Calibration

neutralizes instrument bias, i.e. the bias in the investigators measurement system.
Example 4.8. In neurological research, animals are

randomly allocated to treatments. At the end of
the experimental procedures, the animals are sacrificed, slides are made from certain target areas of
4.8.1.5
Randomization
the brain and these slides are investigated micro-
Together with blinding, randomization is an im-
scopically. At each of these stages errors can arise
portant tool for bias elimination in animal stud-
leading to biased results if the original randomiza-
ies. In an overview of systematic reviews on an-
tion order is not maintained.
imal studies Hirst et al. (2014) found that failure
Errors may arise at various stages in the exper-
to randomize is likely to result in overestimation
iment. Therefore, to eliminate all possible bias, it
1 By
the term stochastic is meant that it involves some elements of chance, such as picking numbers out of a hat, or
preferably, using a computer program to assign experimental units to treatment groups.
21
is essential that the randomization procedure cov-
more commonly used in the laboratory, e.g. the al-
ers all important sources of variation connected
ternating sequence AB, BA, AB, BA,. . . . Here too, it
with the experimental units. In addition, as far
cannot be excluded that a certain pattern in the un-
as practical, experimental units receiving the same
controlled variability coincides with this arrange-
treatment should be dealt with separately and in-
ment. For instance, if 8 experimental units are
dependently at all stages at which important er-
tested in one day, the first unit on a given day will
rors may arise. If this is not the case, additional
always receive treatment A. Furthermore, when a
randomization procedures should be introduced
systematic arrangement has been applied, the sta-
(Cox, 1958). To summarize, randomization should
tistical analysis is based on the false assumption of
apply to each stage of the experiment (Fry, 2014):
randomness and can be totally misleading.
allocation of independent experimental units

to treatment groups
order of exposure to test alteration within an
environment
order of measurement
Researchers are sometimes tempted to improve

on the random allocation of animals by re-arranging
individuals so that the mean weights are almost
identical. However, by reducing the variability
between the treatment groups as is done in Figure 4.6, the variability within the groups is altered
and can now differ between groups. This reduces
Therefore, in animal experiments, when the cage
the precision of the experiment and invalidates the
is the experimental unit, the arrangement of cages
subsequent statistical analysis. Later, we will see
within the rack or room, the administration of sub-
that the randomized block design instead of sys-
stances, taking of samples, etc. should also be ran-
tematic arrangement is the correct way of handling
domized, even though this adds an extra burden to
these last two cases.
the laboratory staff. Of course, this can be accomplished by maintaining the original randomization
sequence throughout the experiment.
Formal randomization requires the use of a randomization device. This can be the tossing of a coin,
Figure 4.6. Trying to improve the random allocation by reducing

the intergroup variability increases the intragroup variability
use of randomization tables (Cox, 1958), or use of

computer software (Kilkenny et al., 2009). Meth-
Formal randomization must be distinguished from
ods of randomization using MS Excel and the R sys-
haphazard allocation to treatment groups (Kilkenny
tem (R Core Team, 2013) are contained in Appendix et al., 2009). For example, an investigator wishes to
compare the effect of two treatments (A, B) on the
B.
body weight of rats. All twelve animals are delivSome investigators are convinced that not ran-
ered in a single cage to the laboratory. The labo-
domization, but a systematic arrangement is the pre-
ratory technician takes six animals out of the cage
ferred way to eliminate the influence of uncon-
and assigns them to treatment A, while the remain-
trolled variables. For example when two treat-
ing animals will receive treatment B. Although,
ments A and B have to be compared, one possibil-
many scientists would consider this as a random
ity is to set up pairs of experimental units and al-
assignment, it is not. Indeed, one could imagine
ways assign treatment A to the first member of the the following scenario: heavy animals react slower
pair and B to the remaining unit. However, if there and are easier to catch than the smaller animals.
each pair consistently yields a higher or lower re-
Consequently, the first six animals will on average

weigh more than the remaining six.
sult than the second member, the estimated treat-
Example 4.9. An important issue in the design of
ment effect will be biased. Other arrangements are
an experiment is the moment of randomization. In
is a systematic effect such that the first member of
22
an experiment brain cells were taken from animals pecially the case in studies that attempt to make
and placed in Petri dishes, such that one Petri dish a broad inference towards the target population
corresponded to one particular animal. The Petri (population model), like gene expression experdishes were then randomly divided into two groups iments that try to relate a specific pathology to
and placed in an incubator. After 72 hrs incuba- the differential expression of certain genes probes
tion one group of Petri dishes was treated with the
(Nadon and Shoemaker, 2002). For such an exper-
experimental drug, while the other group received iment, the bias in the results is minimized only if it
solvent.
is based on a random sample from the target pop-
Although the investigators made a serious effort to
ulation.
introduce randomization in their experiment, they

overlooked the fact the placement of the Petri dishes 4.8.1.7
in the incubator introduced a systematic error. Instead of randomly dividing the Petri dishes in two
groups at the start of the experiment, they should
have made random treatment allocation after the
incubation period.
Standardization
Standardization of the experimental conditions is

an effective way to minimize the bias. In addition it also can also be used to reduce the intrinsic variability in the results. Examples of standardization of the experimental conditions are the use
As pointed out before, it is important that the
of genetically or phenotypically uniform animals
randomization covers all substantial sources of varia-
or plants, environmental and nutritional control,
tion connected with the experimental units. As a
acclimatization, and standardization of the mea-
rule, randomization should be performed imme-
surement system. As discussed before, too much
diately before treatment application. Furthermore,
standardization of the experimental conditions can
after the randomization process has been carried
jeopardize the external validity of the results.
out the randomized sequence of the experimental

units must be maintained, otherwise a new ran-
4.8.2
Strategies for controlling variability

- good experimental design
4.8.2.1
Replication
domization procedure is required.

4.8.1.6
Random sampling
Ronald Fisher1 noted in his pioneering book The
Using a random sample in our study increases its
Design of Experiments that replication at the level

external validity and allows us to make a broad of the experimental unit serves two purposes. The
inference, i.e. a population model of inference first is to increase the precision of estimation and
(Lehmann, 1975). However, in practice it is of-
the second is to supply an estimate of error by
ten difficult and/or impractical to conduct studies
which the significance of the comparisons is to be
with true random sampling. Clinical trials are usu-
judged.
ally conducted using eligible patients from a small

number of study sites. Animal experiments are
The precision of an experiment depends on the
based on the available animals. This certainly lim-
standard deviation2 () of the experimental mate-
its the external validity of these studies and is one
rial and inversely on the number of experimental
of the reasons that the results are not always repli-
units (n). In a comparative experiment with two
cable.
treatment groups (X1 ) and (X2 ) and equal number of experimental units per treatment group, this
In some cases, maximizing the external validity

of the study is of great importance. This is es-
precision is quantified by the standard error of the

difference between the two averages (X1 X2 ) as:
1 Sir Ronald Aylmer Fisher (Londen,1890 - Adelaide 1962) is considered a genius who almost single-handedly created the
foundations of modern statistical science and experimental design.
2 The standard deviation refers to the variation of the individual experimental units, whereas the standard error refers
to the random variation of an estimate (mostly the mean) from a whole experiment. The standard deviation is a basic
property of the underlying distribution and, unlike the standard error, is not altered by replication.
23
Figure 4.7. The effect of blocking illustrated by a study of the effect of diet on running speed of dogs. Not taking age of the dog
into account (left panel) masks most of the effect of the diet. In the right panel dogs are grouped (blocked) according to age and
comparisons are made within each age group. The latter design is much more efficient.
X1 X2 =
2/n, where is the common stan- one could choose units of a larger size, such that
dard deviation and n is the number of experimen-
the estimates are more precise. In still other ex-
tal units in each treatment group.
periments, there are multiple levels of sampling.

The process of taking samples below the primary
The standard deviation is composed of the in-
level of the experimental unit is known as subsam-
trinsic variability of the experimental material and
pling (Cox, 1958; Selwyn, 1996) or pseudoreplica-
the precision of the experimental work. Reduc-
tion (Fry, 2014; Lazic, 2010; LeBlanc, 2004). The ex-
tion of the standard deviation is only possible to
periment reported by Temme et al. (2001) where
a limited extend by refining experimental proce-
the diameter of many liver cells was measured in
dures. However, one can by increasing the num-
3 animals/experimental condition, is an example
ber of experimental units effectively enhance the
of subsampling with animals at the primary level
experiments precision. Unfortunately, due to the

inverse square root dependency of the standard er-
and cells at the subsample level. Multiple obser-
ror on the sample size, this is not an efficient way to
pseudoreplications or subsamples. In biological
control the precision. Indeed, the standard error is
and chemical analysis, it is common practice to du-
halved by a fourfold increase in the number of ex-
plicate or triplicate independent determinations on
perimental units, but a hundredfold increase in the
samples from the same experimental unit. In this
number of units is required to divide the standard
case samples and determinations within samples
error by ten. In other words, replication at the level of
constitute two distinct levels of subsampling.
vations or measurements made over time are also
the experimental unit is an effective but expensive strategy to control variability. As we will see later, choosing an appropriate experimental design that takes
into account the different sources of variability that
can be identified, is a more efficient way to increase
the precision.
4.8.2.2
Subsampling
As mentioned above, reduction of the standard
When subsampling is present, the standard deviation used in the comparison of the treatment means is composed of the variability between the experimental units (between-unit variability) and the variability within the experimental
units(within-unit variability). It can be shown that
in the presence of subsampling the overall standard deviation of the experiment is equal to:
deviation is only possible to a very limited extend. This can be accomplished by standardization of the experimental conditions, but also this
r
=
n2 +
2
m
m
method is limited and it jeopardizes the external
where n and m are the number of experimental
validity of the experiment. However, in some ex-
units and subsamples and n and m the between
periments it is possible to manipulate the phys-
sample and within sample standard deviation.
ical size of the experimental units. In this case
In Section 4.8.2.1, we defined the precision of
24
the comparative 2-treatments experiment.
This
r
X1 X2 =
fectively do by blocking is dividing the variation

between the individuals into variation between
now becomes:
2
2 2 m
(n +
)
n
m
blocks and variation within blocks. If the blocking

factor has an important effect on the response, then
the between-block variation is much greater than
Thus, by increasing the number of experimental
the within block variation. We will take this into

units n we reduce the total variability, while sub- account in the analysis of the data (analysis of varisample replication m only affects the within-unit ance or ANOVA with blocks as additional factor).
variability. A large number of subsamples makes Comparisons of treatments are then carried out
only sense when the variability of the measure-
within blocks, where the variation is much smaller.

ment at the sublevel m is substantial as compared Blocking is an effective and efficient way to ento the between-unit variability n . How to deter- hance the precision of the experiment. In addition,
mine the required number of subsamples will be
blocking allows to reduce the bias due to an im-
discussed in Section 6.5. As a conclusion, we can
balance in baseline characteristics that are known
say that subsample replication is not identical and not
to affect the outcome. However, blocking does not
as effective as replication on the level of the true experi-
eliminate the need for randomization. Within each
mental unit.
block treatments are randomly assigned to the experimental units, thus removing the effect of the
4.8.2.3
remaining unknown sources of bias.
Blocking
Example 4.10. Consider a (hypothetical) experiment in which the effect of two diets on running
4.8.2.4
Covariates
speed of dogs is studied. We can carry out the experiment by taking 6 dogs with varying age and
randomly allocating 3 dogs to diet A and the 3 remaining to diet B. However, as shown in the left
Figure 4.8. Additive model with a linear covariate adjustment
panel of Figure 4.7, the effect of diet will be greatly

masked by the variability between dogs. A more
Blocking on a baseline characteristic such as
intelligent way to set up the experiment is to group
body weight is one possible strategy to elimi-
the dogs by age and make all comparisons within
nate the variability induced by the heterogeneity
the same age group, thus removing the effect of dif-
in weight of the animals or patients. Instead of
ferent ages. This is illustrated in the right panel
grouping in blocks, or in addition to, one can also
of Figure 4.7. With the variability due to age re-
make use of the actual value of the measurement.
moved, the effect of the diets within the age groups
Such a concomitant measurement is referred to as
is much more clear.
a covariate. It is an uncontrollable but measurable attribute of the experimental units (or their
If one or more factors other than the treatment
environment) that is unaffected by the treatments but
conditions can be identified as potentially influenc-
may have an influence on the measured response.
ing the outcome of the experiment, it may be ap-
Examples of covariates are body weight, age, am-
propriate to group the experimental units on these
bient temperature, measurement of the response
factors. Such groupings are referred to as blocks variable before treatment, etc. The covariate filor strata. Units within a block are then randomly
ters out the effect of a particular source of variabil-
assigned to the treatments. Examples of blocking
ity. Rather than blocking it represents a quantifi-
factors are plates (in microtiter plate experiments),
able attribute of the system studied. The statisti-
greenhouses, animals, litters, date of experiment,
cal model underlying the design of an experiment
or based on categorizations of continuous base-
with covariate adjustment is conceptualized in Fig-
line characteristics such as body weight, baseline
ure 4.8. The model implies that there is a linear re-
measurement of the response, etc. What we ef-
lationship between the covariate and the response
4.9. SIMPLICITY OF DESIGN
25
and that this relationship is the same in all treat-
tistical requirement, but it is also the most impor-
ment groups. In other words, there is a series of
tant one. Unfortunately, it is also the requirement
parallel curves, one per treatment group, relating
that is most often neglected. Fisher (1935) already
the response to the covariate. This is exemplified
lamented that:
in Figure 4.9 showing the results of an experiment

with two treatment groups in which the baseline
It is possible, and indeed it is all too fre-
measurement of the response variable served as
quent, for an experiment to be so conducted
covariate. There is a linear relationship between
that no valid estimate of error is available.
the covariate and the response and this is almost

the same in both treatment groups as is shown by
Without the ability to estimate error, there is no ba-
the fact that the two lines are almost parrallel to
sis for statistical inference. Therefore, in a well-
one another.
conceived experiment, we should always be able

to calculate the uncertainty in the estimates of the
treatment comparisons. This usually means estimating the standard error of the difference between the treatment means. To make this calculation in a rigorous manner, the set of experimental units must respond independently to a specific
treatment and may only differ in a random way
from the set of experimental units assigned to the
other treatments. This requirement again stresses
the importance of independence of experimental
units and randomness in treatment allocation.
Figure 4.9. Results of an experiment with baseline as covariate. There is a linear relationship between the covariate and
the response and this relationship is the same in both treatment
groups.
4.9
Table 4.1. Multiplication factor to correct for the bias in estimates of the standard deviation based on small samples, after
Bolch (1968).
Simplicity of design
In addition to external validity, bias and precision,

Cox (1958) also stated that the design of our exper-
Factor
2
3
4
5
6
1.253
1.128
1.085
1.064
1.051
iment should be as simple as possible. When the

When the number of experimental units is
design of the experiment is too complex, it may

be difficult to ensure adherence to a complicated
small, the sample estimate of the standard devia-
schedule of alterations, especially if these are to be
tion is biased and underestimates the true stan-
carried out by relatively unskilled people. An ef-
dard deviation. A multiplication factor to correct
ficient and simple experimental design has the ad-
for this bias for normal distributions is given in Ta-
ditional advantage that the statistical analysis will
ble 4.1. For a sample size of 3, the estimate should
also be simple without making unreasonable as-
be increased with 13% to obtain the actual stan-
sumptions.
dard deviation.
4.10
The calculation of uncertainty
Alternatively, one can also make use of the results of previous experiments to guesstimate the
new experiments standard deviation. However,
This is the last of Coxs precepts for a good experi- we then make the strong assumption that random
ment (see Section 4.7, page 18). It is the only sta- variation is the same in the new experiment.
26
5. Common Designs in Biological

Experimentation
And so it was ... borne in upon me that very often, when the most elaborate statistical refinements
possible could increase the precision by only a few percent, yet a different design involving little or no
additional experimental labour might increase the precision two-fold, or five-fold or even more
R. A. Fisher (1962)
There is a multitude of designs that can be con-
ciple of experimental design is to provide a syn-
sidered when planning an experiment and some
thetic approach to minimize bias and control variabil-
have been employed more commonly than oth-
ity. Furthermore, as shown in Figure 5.1, we can
ers in the area of biological research. Unfortu-
consider each of the specialized experimental de-
nately, the literature about experimental designs
signs as integrating three different aspects of the
is littered with technical jargon, which makes its
design (Hinkelmann and Kempthorne, 2008):
understanding quite a challenge. To name a few,
the treatment design,
there are: completely randomized designs, ran-
the error-control design,
domized complete block designs, factorial designs,

split plot designs, Latin square designs, Greco-
the sampling & observation design.
Latin squares, Youden square designs, lattice designs, Placket-Burman designs, simplex designs,
The treatment design is concerned about which
Box-Behrken designs, etc.
treatments are to be included in the experiment

and is closely linked to the goals and aims of
It helps to find our way through this jungle of
study. Should a negative or positive control be
designs by keeping in mind that the major prin-
incorporated in the experiment, or should both
Figure 5.1. The three aspects of the design determine its complexity and the required resources
27
28
CHAPTER 5. COMMON DESIGNS IN BIOLOGICAL EXPERIMENTATION
be present? How many doses or concentrations
5.1
Error-control designs
should be tested and at which level? Is the interaction of two treatment factors of interest or not?
The error-control design implements the strategies
that we learned in Section 4.8.2 to filter out different sources of variability. The sampling & observation aspect of of our experiment is about how experimental units are sampled from the population,
how and how many subsamples should be drawn,
etc.
5.1.1
The completely randomized design
This is the most common and simplest possible

error-control design for comparative experiments.
Each experimental unit is randomly assigned to exactly one treatment condition. This is often the
default design used by investigators who do not
really think about design problems. The advantage of the completely randomized design is that it
is simple to implement, as experimental units are
simply randomized to the various treatments. The
obvious disadvantage is the lack of precision in the
comparisons among the treatments, which is based
The complexity and the required resources of a
on the variation between the experimental units.
study are determined by these three aspects of experimental design. The required resources, namely
In the following example of a completely ran-
the number of experimental units, of a study are
domized design the investigators used randomiza-
governed by the number of treatments, the num-
tion, blinding, and individual housing of animals
ber of blocks and the standard error. The more
to guarantee the absence of systematic error and
treatments, or the more blocks, the more experi-
independence of experimental units.
mental units are needed. The complexity of the

experiment is determined by the underlying sta-
Example 5.1. To investigate the interaction of
tistical model of Figure 4.1. In particular the error-
chronic treatment with drugs on the proliferation
control design defines the studys complexity. The
of gastric epithelial cells in rats, an experiment was
randomization process is a major part of this error-
set up in which two experimental drugs were com-
control design. A justified and rigorous estimation
pared with their respective vehicles. A total of 40
of the standard error is only possible in a randomized experiment. In addition, randomization has
rats were randomly divided into four groups of each
the advantage that it distributes the effects of un-
cedure described in Appendix B. To guarantee in-
controlled variability randomly over the treatment
dependence of the experimental units, the animals
groups. Replication of experimental units should
were kept in separate cages. Cages were distributed

over the racks according to their sequence number.
be sufficient, such that an adequate number of de-
ten animals, using the MS Excel randomization pro-
grees of freedom are available for estimating the
Blinding was accomplished by letting the sequence
experiments precision (standard error). This pa-
number of each animal correspond to a given treat-
rameter is related to the sampling & observation
ment. One laboratory worker was familiar with the
aspect of the design.
codes and prepared the daily drug solutions. Treatment codes were concealed from the rest of the laboratory staff that was responsible for the daily treatment administration and final histological evaluation.
These three aspects of experimental design pro-
Example 5.2. Another example of a completely
vide a framework for classifying and comparing
randomized design is about eliminating the bias
the different types of experimental design that are
present in experiments using 96-well mictrotiter
used in the life sciences. As we will see, each
plates. Burrows et al. (1984) already described the
of these designs has its advantages and disadvan-
presence of plate location effects in ELISA assays
tages.
carried out in 96-well microtiter plates.
Similar
5.1. ERROR-CONTROL DESIGNS
29
findings were reported by Levasseur et al. (1995) a statistical model that allows to correct for the
and Faessel et al. (1999) who described the presence row-column bias (Schlain et al., 2001; Straetemans
of parabolic patterns (Figure 5.2) in cell growth ex- et al., 2005).
periments carried out in 96-well microtiter plates.
They were not able to show conclusively the underlying causes of this systematic error, which, as
shown in Figure 5.2, could be of considerable magnitude. Therefore, they concluded that only by random allocation of the treatments to the wells these
systematic errors could be avoided.
Figure 5.3. Scheme for the addition and dilution of drugcontaining medium to tubes in a 96-tube rack, randomization
of the tubes, and then addition of drug-containing medium to
cell-containing wells of a 95-wel microtiter plate. (after Faessel
et al. (1999))
5.1.2
Figure 5.2. Presence of bias in 96-well microtiter plates (after
Levasseur et al. (1995))
The randomized complete block design
The concept of blocking as a tool to increase effi-
ciency by enhancing the signal-to-noise ratio has

To accomplish this, they developed an ingenious
already been introduced in Section 4.8.2.3 (page
method for randomizing treatments in 96-well mi24). In a randomized complete block design a sincrotiter plates. Figure 5.3 sketches this procedure.
gle isolated extraneous source of variability (block)
Drugs are serially diluted into tubes in a 96-tube
closely related to the response is eliminated from
rack. Next, a randomization map is generated usthe comparisons between treatment groups. The
ing an Excel macro. The randomization map is then
design is complete, since all treatments are applied
taped to the top of an empty tube rack. The original
within each block. Consequently, treatments can be
tubes are then transferred to the new rack by pushcompared with one another within the blocks. The
ing the numbered tubes through the correspondrandomization procedure now randomizes treating numbered rack position. Using a multipipette
ments separately within each block. The randomsystem drug-containing medium is then transferred
ized complete block design is a very useful and refrom the tubes to the wells of a 96-well microtiter
liable error-control design, since all treatment complate. At the end of the assay, the random data file
parisons can be made within a block. When a
generated by the plate reader is imported into the
study is designed such that the number of experExcel spreadsheet and automatically resorted.
imental units within each block and treatment is
The above example is also a paradigm of how randomization can be used to eliminate systematic errors by transforming them into random noise. This
equal, the design is called a balanced design. A

few examples will illustrate its use in the laboratory.
is one way of dealing with bias in microtiterplates. Example 5.3. In the completely randomized design
Alternative methods consist of choosing an appro-
of Example 5.1 (page 28), the rats were individu-
priate experimental design such as a Latin square ally housed in a rack consisting of 5 shelves of each
design (see Section 5.1.4, page 33), or to construct 8 cages. On different shelves, rats are likely to be
30
Figure 5.4. Outline of a paired experiment on isolated cardiomyocytes. Cardiomyocytes of a single animal were isolated and seeded
in plastic Petri dishes. From the resulting five pairs of Petri dishes, one member was randomly assigned to drug treatment, while
the remaining member received the vehicle.
exposed to multiple varieties of light intensity, tem- and Kempthorne, 2008). Sometimes, this is accomperature, humidity, sounds, views, etc. The inves-
plished by combining several extraneous sources
tigators were convinced that shelf height affected of variation into one aggregate blocking factor as
the results. Therefore, they decided to switch to
shown by the following example.
a randomized complete block design in which the

Example 5.4. In addition to the shelf height, the inblocks corresponded to the 5 shelves of the rack.
vestigators also suspected that the body weight of
Within each block separately the animals were ranthe animals might affect the results. Therefore, the
domly allocated to the treatments, such that in
animals were numbered in order of increasing body
each block each treatment condition occured exweight. The first eight animals were placed on the
actly twice. This example illustrates also that, altop shelf and randomized to the four treatment conthough all treatments are present in each block in a
ditions, then the next eight animals were placed on
randomized complete block design, more than one
the second shelf and randomized, etc. The resulting
experimental unit per block can be allocated to a
design was now simultaneously controlled for shelf
treatment condition.
height and body weight. With regard to these two
There are two main reasons for choosing a randomized complete block design above a com-
characteristics, animals within a block were as much

alike as possible.
pletely randomized design. Suppose there is an

extraneous factor that is strongly related to the out-
5.1.2.1
The paired design
come of the experiment. It would be most unfortunate if our randomization procedure yielded a
design in which there was a great imbalance on
Table 5.1. Results of an experiment using a paired design for testing the effect of a drug on the number of viable cardiomyocytes
after calcium overload
this factor. If this were the case, the comparisons
Rat No.
Vehicle
Drug
Drug - Vehicle
between treatment groups would be confounded
1
2
3
4
5
44
64
60
50
76
46
75
67
64
77
2
11
7
14
1
with differences in this nuisance factor and be biased. The second main reason for a randomized
complete block design is its possibility to considerably reduce the error variation in our experiment, thereby making the comparisons more precise. The main objection to a randomized complete
block design is that it makes the strong assumption
When only two treatments are compared, the

randomized complete block design can be simplified to a paired design.
that there is no interaction between the treatment vari- Example 5.5. Isolated cardiomyocytes provide an
able and the blocking characteristics, i.e. that the effect easy tool to assess the effect of drugs on calciumof the treatments is the same among all blocks.
overload (Ver Donck et al., 1986). Figure 5.4 illustrates the experimental setting. Cardiomyocytes of
The basic idea behind blocking is to partition the
a single animal were isolated and seeded in plastic
total set of experimental units into subsets (blocks)
Petri dishes. The Petri dishes were treated with
that are as homogeneous as possible. (Hinkelmann
the experimental drug or with its solvent. After
31
70
70
60
50
% Viable Myocytes
% Viable Myocytes
60
50
Control
Treated
Control
Treatment
Treated
Treatment
Figure 5.5. Gain in efficiency induced by blocking illustrated in a paired design. In the left panel, the myocyte experiment is
considered as a completely randomized design in which the two samples largely overlap one another. In the right panel the lines
connect the data of the same animal and show a marked effect of the treatment.
a stabilization period the cells were exposed to a that connect the data from the same animal. It is
stimulating substance (i.e. veratridine) and the per- clear that for each pair the drug-treated Petri dish
centage viable, i.e. rod-shaped, cardiomyocytes in yielded consistently a higher result than its vehicle
a dish was counted. Although comparison of the control counterpart. Since the different pairs (anitreatment with the solvent control within a single mals) are independent from one another, the mean
animal provides the best precision, it lacks external difference and its standard error can be calculated.
validity. Therefore, a paired experiment with my- The mean difference is 7.0 with a standard error of
ocytes from different animals and with animal as 2.51.
blocking factor was carried out. From each animal
two Petri dishes containing exactly 100 cardiomy- 5.1.2.2 Efficiency of the randomized complete
ocytes were prepared. From the resulting five pairs
block design
of Petri dishes, one member was randomly assigned
Example 5.6. Suppose that in Example 5.5 the exto drug treatment, while the remaining member reperimenter would not have used blocking, i.e. conceived the vehicle. After stabilization and exposure
sider it as if he had used myocytes originating from
to the stimulus, the number of viable cardiomy10 completely different animals. The 10 Petri dishes
ocytes in each Petri dish was counted. The resultwould then be randomly distributed over the two
ing data are contained in Table 5.1 and displayed
treatment groups and we would have been conin Figure 5.5.
fronted with a completely randomized design. Assume also that the results of this hypothetical exThere are 10 experimental units, since Petri periment were identical to those obtained in the acdishes can be independently assigned to vehicle or tual paired experiment. As is illustrated in the left
drug. However, the statistical analysis should take panel of Figure 5.5, the two groups largely overthe particular structure of the experiment into ac- lap one another. Since all experimental units are
count. More specifically, the pairing has imposed now independent of one another, the effect of the
restrictions on the randomization such that data
drug is evaluated by calculating the difference be-
obtained from one animal cannot be freely inter- tween the two mean values and comparing it with
changed with that from another animal. This is il- its standard error1 Obviously, the mean difference
lustrated in the right panel of Figure 5.5 by the lines is the same as in the paired experiment. However,
1
p As already mentioned in Section 4.8.2.1, page 22, the standard error on the difference between two means is equal to
2/n
32
the standard error on the mean difference has risen Balanced incomplete block designs (BIB) exist for
considerably from a value of 2.51 to 7.83, i.e. the only certain combinations of the number of treatuse of blocking induced a substantial increase in the ments and number and size of blocks. Software
precision of the experiment1 .
to construct incomplete block designs and to test

whether a candidate design satisfies the criteria for
Examples 5.5 and 5.6 clearly demonstrate that

carrying out a paired experiment has the possibil-
being balanced is provided by the crossdes package

in R (Sailer, 2013).
ity to enhance the precision of the experiment considerably, while the conclusions have the same va-
Table 5.2. Balanced incomplete block design for Example 5.7 with
treatments A,B,C and D
lidity as in a completely randomized experiment.

However, the forming of blocks of experimental
Lamb
units is only successful when the criterion upon
First
Second
which the pairing is based, is related to the out-
Sibling lamb pair

1
A
B
A
C
A
D
B
C
B
D
C
D
come of the experiment. Using a characteristic that

does not have an important effect on the response
variables as blocking factor, is worse than useless
since the statistical analysis will lose a bit of power
by taking the blocking into account. This can be of
particular importance for small sized experiments.
Example 5.7. (Anderson and McLean, 1974; Bate

and Clark, 2014). An experiment was carried out
to assess the effect that vitamin A and a protein dietary supplement have on the weight gain of lambs
over a 2-month period. There were four treatments,
labeled A, B, C and D in the study corresponding
5.1.3
Incomplete block designs
In some circumstances, the block size is smaller

than the number of treatments and consequently
it is impossible to assign all treatments within each
of the blocks. When a certain comparison is of specific interest, such as comparison with control, it is
wise to include it in each of the blocks.
Balanced incomplete block designs allow all
pairwise comparisons of treatments with equal
precision using a block size that is less than the
number of treatments. To achieve this, the balanced incomplete block design has to satisfy the
following conditions (Bate and Clark, 2014; Cox,
1958):
each block contains the same number of
units;
each treatment occurs the same number of
times in the design;
every pair of treatments occurs together in
the same number of blocks.
1 Kutner
to low dose and high dose of vitamin A combined

with low dose and high dose of protein. Three replicates per treatment was considered as sufficient and
blocking was carried using pairs of sibling lambs, so
six pairs of siblings were used. With the number
of treatments restricted to two per block, the balanced incomplete block design shown in Table 5.2
was used.
A possible layout of the experiment is obtained
in R by:
>
>
>
>
>
require(crossdes)
# find a incomplete block design
inc.bl<-find.BIB(4,6,2)
# check if it is a valid BIB
isGYD(inc.bl)
[1] The design is a balanced incomplete block design w.r.t. rows.

>
>
>
>
>
>
# if OK make a suitable display

# replace numbers by treatment codes
aBIB<-LETTERS[inc.bl]
dim(aBIB)<-dim((inc.bl))
# transpose displays matrix as in text
t(aBIB)
[,1] [,2] [,3] [,4] [,5] [,6]

[1,] "B" "A" "A" "C" "B" "A"
[2,] "D" "C" "B" "D" "C" "D"
et al. (2004) provide a method to compare designs on the basis of their relative efficiency. For the design in
Example 5.5, the calculations show that this paired design is 7.7 times more efficient than the completely randomized design
in Example 5.6. In other words, about 8 times as many replications per treatment with a completely randomized design
are required to achieve the same results.
33
It is clear that the BIB-design generated in the R-
In a Latin square design the k treatments are ar-
session is equivalent to the design shown in Table
ranged in a k k square such as in Table 5.3 Each
5.2.
of the four treatments A, B, C, and D occurs exactly

When looking for a suitable balanced incom-
once in each row and exactly once in each column.
plete block design, it can happen that if we add one
The Latin square design is categorized as a two-way
or more additional treatments, a more suitable de-
block error-control design.
sign is found. Alternatively, omitting one or more

treatments can yield a more efficient design (Cox,
The main advantage of the Latin square design

is that it simultaneous balances out two sources
1958).
of error. The disadvantage is the strong assumption that there are no interactions between the
5.1.4
The Latin square design
The Latin square design is an extension of the randomized complete block design, but now blocking
is done simultaneously on two characteristics that
affect the response variable.
blocking variables or between the treatment variable and blocking variables.
In addition Latin
square designs are limited by the fact that the number of treatments, number of row, and number of
columns must all be equal. Fortunately, there are
arrangements that do not have this limitation (Cox,
A total of 16 1958; Hinkelmann and Kempthorne, 2008).
Example 5.8. (Hu et al., 2015).
large (3m height,6.7m2 area) hexagonal ecosystem

chambers were arranged in a Latin square with four
In a k k Latin square, only k experimental
repetition chambers for each treatment. The four units are assigned to each treatment group. Howclimate treatments were (i) control (CO, with ambi- ever, it may happen that more experimental units
ent conditions), (ii) increased air temperature = air are required to obtain an adequate precision. The
warming (AW), (iii) reduced irrigation = drought Latin square can then be replicated and several
(D), and (iv) air warming and drought (AWD).
squares can be used to obtain the necessary sample size. In doing this, there are two possibilities
In the above example 4 4 Latin squares were to consider. Either one stacks the squares on top
used to control for the location in the ecosystem of each other and keeps them as separate indepenchambers. Another example from laboratory prac-
dent squares, or one completely randomizes the
tice concerns the simultaneous control, without
order of the rows (or columns) of the design. In
loss of precision, of the row and column effects in
general, keeping the squares separate is not a good
microtiter plates (Burrows et al., 1984). Still another example is about experiments on neuronal
idea and leads to less precise estimation and loss

of degrees of freedom, especially in the case of a
protection, where a pair of animals was tested each
2 2 Latin square. It is only when there is a rea-
day and the investigator expected a systematic dif-
son to believe that the column (or row) effects are
ference not only between the pairs but also be-
different between the squares, that keeping them
tween the animal tested in the morning and the
separate makes sense.
one tested in the afternoon.

Table 5.3. Arrangement for a 4 4 Latin square design controlling
for column and row effects. The letters A-D indicate the four
treatments
Row
1
2
3
4
In the case of 96-well microplates consisting of

8 rows and 12 columns, scientists often omit the
edge rows and columns which are known to yield
Column
1
B
C
A
D
A
B
D
C
D
A
C
B
C
D
B
A
extreme results.
The remaining 6 rows and 10
columns are then used for experimental purposes.

However, using a Latin square design that appropriately corrects for row and column effects, the
complete 96-well microplate can be used as is illustrated by the following example.
34
Table 5.4. A balanced lattice square arrangement of 16 treatments with 5 replicate squares on a single microtiter plate with eight
lettered rows and twelve numbered columns (after (Burrows et al., 1984))
A
B
C
D
E
F
G
H
1
11
12
9
10
12
3
13
6
2
7
8
5
6
14
5
11
4
3
3
4
1
2
7
16
2
9
4
15
16
13
14
1
10
8
15
5
11
6
1
16
5
1
9
13
6
10
7
4
13
11
15
7
3
7
12
5
2
15
4
8
16
12
9
15
9
4
6
10
5
3
10
16
11
12
14
7
1
12
2
8
13
11
The principle of simultaneously controlling for
Example 5.9. Aoki et al. (2014) describe their assay

as follows:
8
9
8
3
14
14
10
2
6
Samples diluted 1:1,600 and serially- row- and column-effect by means of Latin-square
diluted standards (100 L) were placed in wells. designs are not restricted to 96-well microtiter
Samples were arranged in triads, each containing plates. but also apply to their larger variants, like
samples from a case and two-matched controls. In
the 384 and 1536 microwell plates.
order to minimize measurement error due to a spaRandom Latin squares can be produced using
the R-package magic (Hankin, 2005). A possible
croplate wells were grouped into 6 blocks of 16 (44) layout of the experiment in Example 5.8 is obtained
wells and each set of 3 samples was placed in the by:
tial gradient in binding efficiency within plate, mi-
same block using a version of Latin square design along with standards. Placement patterns were
changed across blocks such that the influence of the
>
>
>
>
require(magic)
trts<-c("CO","AW","D","AWD")
ii<-rlatin(4)
matrix(trts[ii],nrow=4,ncol=4,byrow=FALSE)
spatial gradient on the signal-standard level relationship was minimized. The authors do not explicitly state what they mean by version of Latin
square.
Presumably lattice square designs were
[1,]
[2,]
[3,]
[4,]
[,1]
"D"
"AWD"
"AW"
"CO"
[,2]
"AW"
"CO"
"D"
"AWD"
[,3]
"AWD"
"D"
"CO"
"AW"
[,4]
"CO"
"AW"
"AWD"
"D"
used. The interested reader is referred to the literature (Cox, 1958) for this more advanced topic, but
5.2
Treatment designs
the idea is clear that this systematic arrangement

allows for unbiased comparisons. By placement
5.2.1
One-way layout
patterns were changed across blocks they presumably mean randomization of the rows and columns
The examples that we discussed up to now (apart
of the individual Latin squares.
from Examples 5.7 and 5.8), all considered the
In microplate experiments, variations on Latin

square designs such as Latin rectangles (Hinkelmann and Kempthorne, 2008) and incomplete
Latin squares such as Youden squares and lattice
squares (Cox, 1958) can be of use. Burrows et al.
(1984) investigated the properties of a number of
these designs for quantitative ELISA. One of the
designs they considered (see Table 5.4) was a bal-
treatment aspect of the design as consisting of a

single factor. This factor can represent presence or
absence of a single condition or several different
related treatment conditions (e.g. Drug A, Drug B,
Drug C ). The treatment aspect of these designs is
referred to as single factor or one-way layout.
5.2.2
Factorial designs
anced lattice square design that allows to compare
In some types of experimental work, such as in Ex-
within one plate 16 treatments with one another,
ample 5.7 (page 32), it can be of interest to assess
with 5 replicates for each treatment. Other system-
the joint effect of two or more factors, e.g. high
atic arrangements specifically for estimating rela-
and low dose of Vitamin A combined with high
tive potencies in the presence of microplate loca-
and low dose of protein. This is a typical example
tion effects have also been proposed (Schlain et al.,
of a 2 2 full factorial design, the simplest and most
2001).
frequently used factorial treatment design. In this
5.2. TREATMENT DESIGNS
35
design, the factors and all combinations (hence full) No interaction When there is no interaction beof the levels of the factors are studied. The factorial
tween strains and diets, the difference between the
treatment design of this example allows estimating
two diets is the same, irrespective of the mouse
the main effects of the individual treatments, as well
strain. The lines connecting the mean values of the
as their interaction effect, i.e. the deviation from ad- two diets are parallel to one another, as is shown in
ditivity of their joint effect. We will use the 22 full Figure 5.6.
factorial design to explore the concepts of factorial
designs and of statistical interaction.
There is an overall effect of diet, animals fed

with the Western diet show larger lesions, and this
Example 5.10. (Bate and Clark, 2014) A study was effect is the same in both strains. There is also an
conducted to assess whether the serum chemokines overall effect of the strain. Lesions are larger in the
JE and KC could be used as markers of atheroscle-
C3H apoE-/- than in the C57BL apoE-/- strain and
rosis development in mice (Parkin et al., 2004). Two this difference is the same for both diets.
strains of apolipoprotein-E-deficient (apoE-/- ) mice,
C3H apoE-/- and C57BL apoE-/- were used in the
Since the differences between the diets is the
study. These mice were fed either a normal diet

or a diet containing cholesterol (the Western diet).
After 12 weeks the animals were sacrificed and their
atherosclerotic lesion area was determined.
same regardless which strain of mice they are fed

to, it is appropriate to average the results from each
diet across both strains and make a single comparison between the diets rather than making the comparison separately for each strain. In doing this,
5000
the external validity of the conclusions is broadened since they apply to both strains. In addition, the comparison between the two strains can
4000
be made, irrespective of the diet the animals are
3500
C57BL apoE-/- regardless of the diet. In both cases
3000
receiving. C3H apoE-/- have larger lesions than

the comparisons make use of all the experimental
2500
units. This makes a factorial design a highly efficient design, since all the animals are used to test
2000
Lesion area ( m2)
4500
C3H
C57BL
1500
simultaneously two hypotheses.
5000
Western
Diet
dent diet and the Western diet. In total there were

four combinations of factor levels:
1. C3H apoE-/- + normal diet
2. C3H apoE-/- + Western diet
3. C57BL apoE-/- + normal diet
4. C57BL apoE-/- + Western diet
Let us consider some possible outcomes of this
experiment.
4000
3500
3000
2000
diet factor also contained two levels, the normal ro-
1500
of two levels, C3H apoE-/- and C57BL apoE-/- . The
Lesion area ( m2)
The study design consisted of two categorical

factors, Strain and Diet. The strain factor consisted
C3H
C57BL
4500
Figure 5.6. Plot of mean lesion area for the case where there is no
interaction between strains and diets.
2500
Normal
Normal
Western
Diet
Figure 5.7. Plot of mean lesion area for the case where there is a
moderate interaction between strains and diets.
36
Moderate interaction When there is a moderate in- 2 2 factorial combined with a balanced incomteraction, the direction of the effect of the first fac-
plete block design, whereas a 2 2 factorial design
tor is the same regardless of the level of the sec-
is used in connection with a Latin square error con-
ond factor. However, the size of the effect varies
trol design in Example 5.8 (page 33).
with the level of the second factor. This is exemplified by Figure 5.7, where the lines are not parallel, though both indicate an increase in lesion size
Factorial designs are widely used in plant research, process optimization, etc. Recent appli-
for the Western diet as compared to the normal
cations in the biomedical field include cDNA mi-
diet. This increase is more pronounced in the C3H
croarray experiments.
apoE-/- strain than in the C57BL apoE-/- animals.

Hence, the C3H apoE-/- strain is more sensitive to Example 5.11. Glonek and Solomon (2004) describe
an example of a 2 2 factorial design using cDNA
changes in diet.
microarrays and, in doing this, highlight the imporStrong interaction The effect of the first factor can
tance of the interaction effect in this context:
also be entirely dependent on the level of the sec-
A 2 2 factorial experiment was con-
ond factor. This is illustrated in Figure 5.8, where
ducted to compare the two mutants at
feeding the Western diet to the C3H apoE-/- has

little effect on lesion size, whereas the effect on
C57BL apoE-/- mice is substantial. This is an exam-
times zero hours and 24 hours; it was

anticipated that measuring changes over
time would distinguish genes involved
ple of strong interaction. The Western diet always
in promoting or blocking differentiation,
results in bigger lesions, but the effect in the C57BL
or that suppress or enhance growth, as
apoE-/- strain is much more pronounced than in

the C3H apoE-/- mice. Furthermore, when fed with
normal diet the C3H apoE-/- mice show a larger lesion area than the C57BL apoE-/- strain. However,
genes potentially involved in leukaemia.

We are interested in genes differentially
expressed between the two samples i.e.
in the sample main effect, but more par-
when the animals receive Western diet the oppo-
ticularly, in those genes which are dif-
site is true.
ferentially expressed in the two samples

at time 24 hours but not at time zero
5000
hours. This is the interaction of sample

and time.
4000
In the case of cDNA experiments, the design
3500
in other application areas. More information on
3000
of a 2 2 factorial is not as straightforward as

how to set up factorial designs involving microar-
2500
ray experiments can be found in the specialized literature (Glonek and Solomon, 2004) and (Banerjee
2000
Lesion area ( m2)
4500
C3H
C57BL
1500
and Mukerjee, 2007).
Normal
Western
Diet
Figure 5.8. Plot of mean lesion area for the case where there is a
moderate interaction between strains and diets.
It happens that researchers have actually

planned a factorial experiment, but in the design
and analysis phase failed to recognize it as such.
In that case they do not take full advantage of the
factorial structure for interpreting treatment effects
The factorial treatment design can be combined
and often analyze and interpret the experiment
with the error control designs that we encountered
with an incorrect procedure (Nieuwenhuis et al.,
in Section 5.1. Example 5.7 (page 32) illustrates a
2011). We will come back to that in Section 9.2.3.
5.3. MORE COMPLEX DESIGNS
37
When the two factors consist of concentrations
5.3.1
The split-plot and strip-plot designs
or dosages of drugs, researchers tend to confuse

the statistical concept of interaction with the pharmacological concept of synergism. However, the
requirements for two drugs to be synergistic with
each other are much more stringent than just the
superadditivity associated with statistical interaction (Greco et al., 1995; Tallarida, 2001). It is easy
to demonstrate that, due to the nonlinearity of the
log-dose response relationship, superadditive ef-
These types of design incorporate subsampling

and allow to make comparisons among different
treatments at two or more sampling levels. The
split-plot design allows assessment of the effect
of two independent factors using different experimental units as is illustrated by the following examples.
fects will always be present for the combination,

since the total drug dosage has increased, thus implying that a drug could be synergistic with itself.
In connection with the 96-well plates and the presence of the plate location effects that we saw earlier, Straetemans et al. (2005) provide a statistical
method for assessing synergism or antagonism.
Higher dimensional factorial designs
Although
our discussion here is restricted to the 2 2 factorial, more factors and more levels can be used.
Figure 5.9. Outline of the split-plot experiment of 5.12. Cages

each containing two mice each were assigned at random to a
number of dietary treatments and the color-marked mice within
the cage were randomly selected to receive one of two vitamin
treatments by injection.
However, the number of experimental units that

are involved may become rather large. For instance, an experiment with three factors at three
levels involves 3 3 3 = 27 different treatment
combinations. When there are at least two experimental units in each treatment group, then 54 units
will be required. Such an experiment may soon
be too large to be feasible. In addition, three factor and higher order interactions are difficult to interpret (Clewer and Scarisbrick, 2001). Therefore,
when a large number of factors are involved, the
presence of higher order interactions is often neglected. Such high dimensional fractional factorial
Example 5.12. An example of a split-plot design is

the following hypothetical experiment on diets and
vitamins (see Figure 5.9). Cages each containing
two mice were assigned at random to a number of
dietary treatments (i.e. cage was the experimental unit for comparing diets), and the color-marked
mice within the cage were randomly selected to receive one of two vitamin treatments by injection
(i.e. mice were the experimental units for the vitamin effect).
designs are mostly used in process optimization.

(Selwyn, 1996).
Example 5.13. Another example is about the effects

of temperature and growth medium on yeast growth
rate (Ruxton and Colegrave, 2003). In this experi-
5.3
More complex designs
ment Petri dishes are placed inside constant temperature incubators (see Figure 5.10). Within each incubator growh media are randomly assigned to the
We will now consider some sepecialized exper-
individual Petri dishes. Temperature is then consid-
imental designs consisting of a somewhat more
ered as the main-plot factor and growth medium the
complex error-control design that is intertwined
subplot factor. The experiment has to be repeated
with a factorial treatment design.
using several incubators for each temperature.
38
originally developed for agricultural field trials,

the next example shows that these designs have
found a place in the modern laboratory.
Example 5.14. (Casella, 2008; Lansky, 2002) describe strip-plot designs that accomodate for serial
dilutions in 96-well microtiter plates. In this case,
Figure 5.10. Outline of the split-plot experiment of Example 5.13.
Six incubators were randomly assigned to 3 temperature levels
in duplicate. In each incubator 8 Petri dishes were placed. Four
growth media were randomly applied to the Petri dishes.
the 8 rows of the microplate are randomly assigned

to the samples (e.g. reference and three test samples in duplicate) and the columns are assigned to
The term split-plot originates from agricultural
the serial dilution level (dose). The dilution level is
research where fields are randomly assigned to dif-
fixed for a given column, such that all samples in
ferent levels of a primary factor and smaller ar-
a column have the same serial dilution level, which
eas within the fields are randomly assigned to one
makes it a strip-plot design.
level of another secondary factor. Actually, the

split-plot design can be considered as two random-
5.3.2
The repeated measures design
ized complete block designs supperimposed upon

one another (Hinkelmann and Kempthorne, 2008).
The split-plot design is a two-way crossed (factorial)
treatment design and a split-plot error design.
A special case of the split-plot design, is called
the strip-plot or split-block design. The strip-plot
design differs from the split-plot design by applying a special kind of randomization. In the stripplot or strip-block design, both factors are applied
to whole-plots which are placed orthogonal to
each other.
Figure 5.12. A typical repeated measures design. Animals are randomized to different treatment groups, the variable of interest
(e.g. blood pressure) is measured at the start of the experiment
and at different time points following treatment application.
The repeated measures design is a special case

of the split-plot and more specifically of the stripplot designs. In a repeated measures design, we
typically take multiple measurements on a subject
over time. If any treatment is applied to the subjects, they immediately become the whole plots
and Time is the subplot factor. A typical experimental set-up is displayed in Figure 5.12 where
two groups of animals are randomized over 2 (or
more) treatment groups and the variable of interest is measured just before and at several time
points following treatment application. The major disadvantage of repeated measures designs is
Figure 5.11. Schematic representation of a strip-plot experiment.

Factor A is applied to whole plots in the horizontal direction at
levels a1, a2, a3, a4, and a5. Factor B is also applied to whole
plots at levels b1, b2 , and b3.
the presence of carry-over effects by which the results obtained for a treatment are influenced by
the previous treatment. In addition, any confound-
Schematically, this is represented as in Fig-
ing of the treatment effect with time, as was the
ure 5.11. There are two factors A and B that are
case in the use of self-controls in Section 4.8.1.1,
applied to the two types of whole plots. As in Fig-
must be avoided. Therefore, like in the example of
ure 5.11, both plots are placed orthogonal upon
Figure 5.12, a parallel control group which protects
one another.
against time related bias must always be included.
Although strip plot designs were
5.3. MORE COMPLEX DESIGNS
39
Table 5.5. Bioassay experiment of Example 5.14. Row and column indicators refer to the conventional coding of 96-well plates. The
four samples (A, B, C, D) are in duplo applied to the rows, and the serial dilution level (dose) is applied to an entire column. A1/1
in a cell means sample A, first replicate at dilution level 1, B2/3 second replicate of sample B at dilution 3, etc.
A
B
C
D
E
F
G
H
10
11
12
B1/2
D2/2
B2/2
C1/2
A1/2
A2/2
C2/2
D1/2
B1/8
D2/8
B2/8
C1/8
A1/8
A2/8
C2/8
D1/8
B1/10
D2/10
B2/10
C1/10
A1/10
A2/10
C2/10
D1/10
B1/1
D2/1
B2/1
C1/1
A1/1
A2/1
C2/1
D1/1
B1/11
D2/11
B2/11
C1/11
A1/11
A2/11
C2/11
D1/11
B1/3
D2/3
B2/3
C1/3
A1/3
A2/3
C2/3
D1/3
B1/12
D2/12
B2/12
C1/12
A1/12
A2/12
C2/12
D1/12
B1/7
D2/7
B2/7
C1/7
A1/7
A2/7
C2/7
D1/7
B1/4
D2/4
B2/4
C1/4
A1/4
A2/4
C2/4
D1/4
B1/5
D2/5
B2/5
C1/5
A1/5
A2/5
C2/5
D1/5
B1/6
D2/6
B2/6
C1/6
A1/6
A2/6
C2/6
D1/6
B1/9
D2/9
B2/9
C1/9
A1/9
A2/9
C2/9
D1/9
40
6. The Required Number of Replicates Sample Size

Data, data, data. I cannot make bricks without clay.
Sherlock Holmes
The Adventure of the Copper Beeches, A.C. Doyle.
6.1
Determining sample size is a
tion, interval estimation, and hypothesis testing, of
risk - cost assessment
which hypothesis testing is definitely the most important in biomedical studies.
Replication is the basis of all experimental design

and a natural question that arises in each study is
6.3
how many replicates are required. The more repli-
The hypothesis testing context - the population model
cates, the more confidence we have in our conclusions. Therefore, we would prefer to carry out our
In the hypothesis testing context2 one defines a
experiment on a sample that is as large as possi-
null hypothesis and, for the purpose of sample size
ble. However, increasing the number of replicates
estimation, an alternative hypothesis of interest.
incurs a rise in cost. Thus, the answer to how large
The null hypothesis will often be that the response
an experiment should be, is that it should be just
variable does not really depend on the treatment
big enough to give confidence that any biologically
condition. For example, one may state as a null hy-
meaningful effect that exists can be detected.
pothesis that the population means of a particular

measurement are equal under two or more differ-
6.2
The context of biomedical ex-
ent treatment conditions and that any differences

found can be attributed to chance.
periments
At the end of the study when the data are anThe estimation of the appropriate size of the exper-
alyzed (see Section 7.3), we will either accept or
iment is straightforward and depends on the statis-
reject the null hypothesis in favor of the alterna-
tical context, the assumptions made, and the study
tive hypothesis. As is indicated in Table 6.1, there
specifications. Context and specifications on their
are four possible outcomes at the end of the exper-
turn depend on the study objectives and the design
iment: When the null hypothesis is true and we

failed to reject it, we have made the correct deci-
of the experiment.
sion. This is also the case when the null hypothesis

In practice, the most frequently encountered
is false and we did reject it. However, there are two
contexts in statistical inference are point estima-
decisions that are erroneous. If the null hypothesis
2 The
hypothesis testing context is in statistics also known as the Neyman-Pearson system
41
42
CHAPTER 6. THE REQUIRED NUMBER OF REPLICATES - SAMPLE SIZE
Table 6.1. The decision process in hypothesis testing
State of Nature
Decision
made
Null hypothesis true

true
Alternative hypothesis
true
Do not reject null hypothesis
Correct decision
(1 )
False positive
False negative
Correct decision
(1 )
Reject null hypothesis
is true and we incorrectly rejected it, then we made
When the significance level decreases or the
a false positive decision. Conversely, if the alternative
power increases, the required sample size will be-
hypothesis is true (i.e. the null hypothesis is false)
come larger.
and we failed to reject the null hypothesis we have
larger or the difference to be detected smaller, the
made a false negative decision. In statistics, a false
required sample size will also become larger. Con-
positive decision is also referred to as a type I error
versely, when the difference to be detected is large
and a false negative decision as a type II error.
or variability low, the required sample size will be
Similarly, when the variability is
small. It is convenient, for quantitative data, to express the difference in means as effect size by dividThe basis of sample size calculation is formed by
ing it by the standard deviation1 . The effect size
specifying an allowable rate of false positives and
then takes both the difference and inherent vari-
an allowable rate of false negatives for a particu-
ability into account. Cohen (1988) argues that ef-
lar alternative hypothesis and then to estimate a
fect sizes of 0.2, 0.5 and 0.8 can be regarded respec-
sample size just large enough, so that these low er-
tively as small, medium and large. However, in
ror rates can be achieved. The allowable rate for
basic biomedical research the size of the effect that
false positives is called the level of significance or alone is interested in, can be substantially larger.
pha level and is usually set at values of 0.01, 0.05, or
0.10. The false negative rate depends on the postulated alternative hypothesis and is usually described by its complement, i.e. the probability of
rejecting the null hypothesis when the alternative
6.4
6.4.1
Sample size calculations

Power analysis computations
hypothesis holds. This is called the power of the Now that we are familiar with the concepts of hystatistical hypothesis test. Power levels are usually pothesis testing and the determinants of sample
expressed as percentages and values of 80% or 90% size, we can proceed with the actual calculations.
are standard in sample size calculations.
There is a significant amount of free software available to make elementary sample size calculations.
In particular, there is the R-package pwr (Cham-
Significance level and power are already two of

the four major determinants of the sample size required for hypothesis testing. The remaining two
are the inherent variability in the study parameter of interest and the size of the difference to be
detected in the postulated alternative hypothesis.
Other key factors that determine the sample size
are the number of treatments and the number of
blocks used in the experimental design.
pely, 2009).
Example 6.1. Consider the completely randomized
experiment about cardiomyocytes discussed in Example 5.6 (page 31). The standard deviations of the
two groups are each about 12.5. A large effect of
0.8 in this case, corresponds to a difference between
both groups of 10 myocytes. Lets assume that we
wish to plan a new experiment to detect such a
difference with a power of 80% and we want to reject the null hypothesis of no difference at a level
of significance of 0.05, whatever the direction of the
difference between the two samples (i.e. a two-sided
1 When comparing mean values from two independent groups, the standard deviation for calculating the effect size, can
be from either group when
q variances of the two groups are homogeneous, or alternatively a pooled standard deviation can
be calculated as sp = (s21 + s22 )/2.

1 See Section 7.3 for a discussion of one-sided and two-sided tests.
6.4. SAMPLE SIZE CALCULATIONS
test1 ). The calculations are carried out in R in a single line of code and show that 26 experimental units
are required in each of the two treatment groups:
> require(pwr)
> pwr.t.test(d=0.8,power=0.8,sig.level=0.05,
+
type="two.sample",
+
alternative="two.sided")
43
6.4.2
Meads resource requirement equation
There are occasions when it is difficult to use a

power analysis, because there is no information on
the inherent variability (i.e. standard deviation)
and/or because the effect size of interest is dif-
Two-sample t test power calculation

n
d
sig.level
power
alternative
=
=
=
=
=
25.52457
0.8
0.05
0.8
two.sided
NOTE: n is number in *each* group
ficult to specify. An alternative, quick and dirty

method for approximate sample size determination was proposed by Mead (1988). The method
is appropriate for comparative experiments which
can be analyzed using analysis of variance (Grafen
and Hails, 2002; Kutner et al., 2004), such as:
Exploratory experiments
Alternatively we could ask what would be the power

of an experiment with say, 5 animals per treatment
group:
> pwr.t.test(d=0.8,n=5,sig.level=0.05,
+
type="two.sample",
+
alternative="two.sided")
Two-sample t test power calculation

n
d
sig.level
power
alternative
=
=
=
=
=
5
0.8
0.05
0.2007395
two.sided
NOTE: n is number in *each* group
For a completely randomized experiment with 2

groups of 5 animals, as described in Example 5.6,
the power to detect a difference of 10 myocytes between treatment groups ( = 0.8) is only 20%.
Complex biological experiments with several

factors and treatments
Any eperiment where the power analysis
method is not possible or practicable.
The method depends on the law of diminishing returns: adding one experimental unit to a small experiment gives good returns, while adding it to a
large experiment does not do so. It has been used
by statisticians for decades, but has been explicitly
justified by Mead (1988). An appropriate sample
size can be roughly determined by the number of
degrees of freedom for the error term in the analysis of variance (ANOVA) or t test given by the formula:
E =N T B
where E, N, T and B are the total, error, treatment
A quick and dirty method for sample size cal-
and block degrees of freedom (number of occurrences
culation against a two-sided alternative in a two
or levels minus 1) in the ANOVA. In order to ob-
group comparison, with a power of 0.8 and Type I
tain a good estimate of error it is necessary to have
error of 0.05 is provided by Lehrs equation (Lehr,
at least 10 degrees of freedom for E, and many
1992; Van Belle, 2008):
statisticians would take 12 or 15 degrees of free-
n 16 /2
where represents the effect size and n stands for
the required sample size in each treatment group.
For the above example, the equation results in:
dom as their preferred lower limit. On the other

hand, if E is allowed to be large, say greater than
20, then the experimenter is wasting resources. It
is recommended that, in a non-blocked design the,
E should be between ten and twenty.
16/0.64 = 25 animals per treatment group. The nu-
Example 6.2. Suppose an experiment is planned
merator of Lehrs equation depends on the desired
with four treatments, with eight animals per group
power. Alternative values for the numerator are 8
(32 rats total). In this case N=31, B=0 (no block-
and 21 for powers of 50% and 90%, respectively.
ing), T=3, hence E=28. This experiment is a bit
44
n=32
n=32
80
80
n=16
n=16
n=12
60
n=8
40
n=12
Power
Power
60
n=8
40
n=6
n=6
20
10
15
20
25
n=4
20
30
10
Number of Subsamples
15
20
25
n=4
30
Number of Subsamples
Figure 6.1. Power curves for a two-group comparison to detect a difference of 1, with a two-sided t-test with significance level
= 0.05 as a function of the number of subsamples m. Lines are drawn for different numbers of experimental units n in each
group. For both left and right panel the between sample standard deviation (n ) is 1, while within sample standard deviation (m )
is 1 in the left panel and 2 in the right panel. The dots connected by the dashed line indicate where the total number of subsamples
2
2
2 n m equals 192. The vertical line indicates an upper bound to the useful number of subsamples of m = 4(m
/n
).
too large, and six animals per group might be more
6.5
How many subsamples
appropriate (23 - 3 = 20).

In Section 4.8.2.2 we defined the standard error of
the experiment when subsamples are present as:
There is one problem with this simple equation. It appears as though blocking is bad, because
it reduces the error degrees of freedom. If in the
above example, the experiment would be done in
2
2 2 m
(n +
)
n
m
where n and m are the number of experimental
eight blocks, then N = 31, B = 7, T = 3 and units and subsamples and n and m the between
E = 31 7 3 = 21 instead of 28. However, sample and within sample standard deviation. Usblocking nearly always reduces the inherent vari-
ing this expression, we can establish the power for
ability, which more than compensates for the de-
different configurations of an experiment.
crease in the error degrees of freedom, unless the

experiment is very small and the blocking criterion
was not well related to the response. Therefore,
when blocking is present and the error degrees of
freedom is not less than about 6, the experiment is
probably still of an adequate size.
Figure 6.1 shows the influence of the number of

experimental units (n) and the number of subsamples (m) per experimental unit on the power of two
experiments to detect a difference between two
mean values of 1. For both experiments n = 1.
The left panel shows the case where m = 1, while
Example 6.3. If we consider again, in this context in the right panel m = 2. The dots connected by
the paired experiment of Example 5.5 (page 30), we a dashed line represent the power for experiments
have N = 9, B = 4, T = 1. Hence, E = 941 = 4. where the total number of subsamples equals 192
Obviously the sample size of 10 experimental units (2 treatment groups n p).
was too small to allow an adequate estimate of
the error. At least 2 experimental units should be
added.
As is illustrated in the left panel of Figure 6.1,

subsampling has only a limited effect on the power
6.5. HOW MANY SUBSAMPLES
45
128
128
= 0.5
= 0.5
= 0.8
= 1.0
= 1.5
= 2.0
64
32
= 0.8
= 1.0
16
= 1.5
= 2.0
Required sample size per group
Required sample size per group
64
32
16
10
Number of comparisons
10
Number of comparisons
Figure 6.2. Required sample size of a two-sided test with a significance level of 0.05 and a power of 80% (left panel) and 90%
(right panel) as a function of the number of comparisons that are carried out. Lines are drawn for different values of the effect size
(). Note that the y-axis is logarithmic.
of the experiment when the within sample vari-
In both examples the power curves have flat-
ability m is the same size (or smaller) as the be-
tened after crossing the vertical line where m =
tween sample variability n . In this case it has
2
4(m
/n2 ). This is known as Coxs rule of thumb
no sense taking more than say 4 subsamples per
(Cox, 1958) about subsamples, which states that
experimental unit, as is indicated by the vertical
for the completely randomized design there is not
line in Figure 6.1. Furthermore, the sharp decline
much increase in power when the number of sub-
of the dashed line connecting the points with the
2
samples m is greater than 4(m
/n2 ). Coxs ratio
same total number of subsamples indicates that
provides an upper limit for a useful number of sub-
subsampling in this case is rather inefficient, at
samples. However, this rule of thumb does not
least when the cost of subsamples and experimental units is not taken into consideration. An experi-
take the different costs involved with experimental
ment with 32 experimental units and 3 subsamples
units and subsamples into account. In many cases,

especially in animal research, the cost of the exper-
has a power of more than 90%, while for an exper-
imental unit is substantially larger than that of the
iment with the same total number of subsamples
subunit. Taking these differential costs into consid-
but with 4 experimental units and 24 subsamples
eration, the optimum number of subsamples can
per unit, the power is only about 20%.
be derived as (Snedecor and Cochran, 1980):

s
m=
cn
2
m
cm
n2
The right panel of Figure 6.1 shows the case

where the within sample standard deviation m
This equation shows that taking subsamples is of
is twice the standard deviation between samples
interest when the cost of experimental units cn is
n . In this example, taking more subsamples does large relative to the cost of subsamples cm , or when
make sense. The power curves keep increasing the variation among subsamples m is large relauntil the number of subsamples is about 16. The
tive to the variation among experimental units n .
loss in efficiency by taking subsamples is also more

moderate, as is indicated by the less sharp decline
Example 6.4. In a morphologic study (Verheyen
of the dotted line.
et al., 2014), the diameter of cardiomyocytes was
46
0.06
0.06
Fraction
Fraction
0.04
0.04
0.02
0.02
0.00
0.00
0.0
0.5
1.0
1.5
2.0
Effect size
Effect size
Figure 6.3. Running the cardiomyocyte experiment a large number of times, the measured effect sizes follow a broad distribution. In
both plots the true effect size is 0.8. The dark area represents statistically significant results (two-sided p 0.05) and the vertical
dotted line indicates the effect size which is just large enough to be statistically significant. Left panel: 26 animals are used per
treatment group which corresponds to a power of 80%. Right panel: Only 5 animals per treatment group are used which results in
an underpowered experiment
examined in 7 sheep that underwent surgery and 6
6.6
Multiplicity and sample size
sheep that were used as control. For each animal

the diameter of about 100 epicardial cells was mea-
As we shall see in Section 7.6, when more than one
sured. A sophisticated statistical technique known statistical test is carried out on the data, the overas mixed model analysis of variance allowed to es- all rate of false positive findings is higher than the
2
as 4.58 and 13.7
timate from the data n2 and m
false positive rate for each test separately. To cir-
respectively. Surprisingly variability within an ani- cumvent this inflation of the false positive error
mal was larger than between animals. If we were to rate, the critical value of each individual test is usually set at a more stringent level. The most simber of measurements to 4 13.7/4.58 12 per ani- ple adjustment, Bonferronis adjustment, consists
mal. Alternatively, we can take the differential costs of just dividing the significance level of each indiset up a new experiment, we could limit the num-
of experimental units and subsamples into account. vidual test by the number of comparisons. BonferIt makes sense to assume that the cost of 1 ani- ronis adjustment maintains the error rate of the
mal is about 100 times the cost of one diameter totality of tests that are carried out in the same conmeasurement. Making this assumption, the opti- text at its original level. But, as we already noted
mum number of subsamples per animal would be above, when the significance level is set at a lower
p
100 13.7/4.58 17. Thus the total number value, the required sample size will necessarily inof diameter measurements could be reduced from crease. Fortunately, the increase in required num1300 to 220. Even if animals would cost 1000 times ber of replicates is surprisingly small.
more than a diameter measurement, the optimum
number of subsamples per animal would be about
Figure 6.2 shows for a two-sided Student t-test

55, which is still a reduction of about 50% of the with a significance level of 0.05 and a power of
original workload. As a conclusion, this is a typi- 80% (left panel) and 90% (right panel), how the
cal example of a study in which statistical input at required sample size increases with an increasing
the onset would have improved research efficiency number of comparisons. The percent increase in
considerably.
sample size due to adding an extra comparison,

corresponds to the slope of the line segment connecting adjacent points in Figure 6.2. The evolution of the relative sample size is comparable for
6.7. THE PROBLEM WITH UNDERPOWERED STUDIES
47
all values of and power (100 (1 )). For pow- other consequence of low statistical power is that
ers of 80% and 90 %, carrying out two indepen-
effect sizes are overestimated and results become
dent statistical tests instead of one, involves a 20%
less reproducible (Button et al., 2013). This is best
larger sample size to maintain the overall error rate
illustrated by the following example.
at its level of = 0.05. Similarly, when 3 or 4 independent tests are involved, the required sample
size increases with 30% or 40% respectively. After
Example 6.5. Consider the cardiomyocyte example
4 comparisons, the effect tapers off and all curves
as discussed in Example 5.6. A sample size calcu-
approach linearity. Adding an extra comparison in
lation, treating the experiment as if it was a com-
the range of 4 - 10 comparisons, will increase the
pletely randomized design (which it was not) was
required sample size with about 2.7%, leading to a
carried out on page 42 and yielded a required sam-
total increase in sample size for 10 comparisons of
ple size of 26 animals in each group to detect a
about 70%. For a larger number of comparisons,
large treatment effect = 0.8 with a power of
Witte et al. (2000) noted that the relative sample
80%. Imagine running several copies of this experi-
size increases linear with the logarithm of the num-
ment, say 10,000. The effect sizes that are obtained
ber of comparisons.
from these experiments follow a distribution as displayed in the left-hand panel of Figure 6.3. The
Figure 6.2 illustrates also how sample size de-
dark shaded area corresponds to experiments that
pends on the effect size. Large sample sizes are
yielded a statistically significant result. This subset
indeed required for detecting moderate-small dif-
yields a slightly increased estimate of the effect size
ferences. However, for large and very large dif-
of 0.89, which corresponds to an effect size infla-
ferences, as we usually want to detect in early re-
tion of 11%. This effect size inflation is to be ex-
search, the required sample size reduces to an at-
pected when an effect has to pass a certain threshold
tainable level.
such as statistical significance. A relative inflation

of 11% as in this case is acceptable.
6.7
The problem with underpowered studies
A survey of articles that were published in 2011
The situation is completely different in the right-
(Tressoldi et al., 2013), showed that in prestigious
hand panel of Figure 6.3 where the experiments now
journals such as Science and Nature, fewer than 3%
use only 5 animals per treatment group and corre-
of the publications calculate the statistical power
sponding power has dropped to 20%. The variabil-
before starting their study. More specifically in
ity of the results is substantially larger as displayed
the field of neuroscience, published studies have

a power between 8 and 32% to detect a true effect
by the larger scale of the x-axis. While the stan-
(Button et al., 2013). Low statistical power might
periment was 0.28, this has now increased to 0.63,
lead the researcher to wrongly conclude there is

no effect from an experimental treatment when in
an increase with a factor 2.25 which corresponds

p
to 26/5. The significant experiments now consti-
fact an effect does exist. In addition, in research in-
tute a much smaller part of the distribution. The
volving animals underpowered studies raise a sig-
mean effect size in this subset has now increased
nificant ethical concern. If each individual study
to 1.57, an inflation of 96%. In addition, the max-
is underpowered, the true effect will only likely be
imum effect detected in all studies is now 3.16 as
discovered after many studies using many animals
compared to 1.95 in the larger experiment. This
have been completed and analyzed, using far more
phenomenon is known as truth inflation, type M er-
animal subjects than if the study had been done
ror (M stands for magnitude) or the winnerss curse
properly the first time (Button et al., 2013). An-
(Button et al., 2013; Reinhart, 2015).
dard deviation of the effect sizes in the larger ex-
48
6.8
120
Sequential plans
Sequential plans allow investigators to save on
Relative bias of research finding (%)
the experimental material, by testing at different

100
stages, as data accumulate. These procedures have

been used in clinical research and are now advo-
80
cated for use in animal experiments (Fitts, 2010,
2011).
60
40
all types of designs that we discussed before can
be implemented in a sequential manner. Sequen-
20
tial procedures are entirely based on the Neyman-
0
20
40
Sequential plans are sometimes referred
to as "sequential designs", but strictly speaking
60
80
Statistical power of study (%)
Pearson hypothesis decision making approach that

we saw in Section 6.3 and do not consider the accuracy or precision of the treatment effect estima-
Figure 6.4. The winners curse: effect size inflation as a function

tion.
of statistical power.
Therefore, in the case of early termination for
As shown in Figure 6.4, effect inflation is worst a significant result, sequential plans are prone to
for small low-powered studies which can only detect exaggerate the treatment effect. There is certainly a
treatment effects that happen to be large. There-
place for these procedures in exploratory research
fore, significant research findings of small studies such as early screening, but a fixed sample size
are biased in favour of inflated effects. This has confirmatory experiment is needed to provide an
consequences when an attempt is made to replicate
a published finding and the sample size is computed
based on the published effect. When this is an inflated estimate, sample size of the confirmatory experiment will be too low. To summarize, effect inflation due to small, underpowered experiments is
one of the major reasons for the lack of replicability
in scientific research.
unbiased and precise estimate of the effect size.

Example 6.6. In a search for compounds that offer protection against traumatic brain injury, a rat
model was used as a screening test. Preliminary
power calculations showed that at least 25 animals per treatment group were required to detect
a protective effect with a power of 80% against a
one-sided alternative with a type I error of 0.05.
Taking into consideration that a large number of
test compounds would be inactive, a fixed sample size approach was regarded as unethical and
inefficient. Therefore, a one-sided sequential procedure (Wilcoxon et al., 1963) was considered as
more appropriate. The procedure operated in different stages (Figure 6.5). At each stage a number of animals was selected, such that the group
was as homogeneous as possible. The animals were
then randomly allocated to the different treatment
groups, as three per group. At a given stage the
treatments consisted of several experimental compounds and their control vehicle. After measuring
the response, the procedure allowed the investigator
to make the decision of rejecting the drug as uninteresting, accepting the drug as active, or to continue with a new group of animals in a next stage.
Figure 6.5. Outline of a sequential experiment
After having tested about 50 treatment conditions,
6.8. SEQUENTIAL PLANS
49
a candidate compound was selected for further de- tion of false positive and false negative results was
velopment. An advantage of this screening proce- known and fixed. A disadvantage of the method was
dure was that, given the biologically relevant level that a dedicated computer program was required for
of activity that must be detected, the expected frac- the follow-up of the results.
50
7. The Statistical Analysis

How absurdly simple !, I cried.
Quite so !, said he, a little nettled. Every problem becomes very childish when once it is explained
to you.
Dr. Watson and Sherlock Holmes
The Adventure of the Dancing Men, A.C. Doyle.
We teach it because its what we do; we do it because its what we teach. (on the use of p < 0.05)
John Cobb (2014)
7.1
The statistical triangle
ities of the statistical analysis have now almost become trivial
There is a one-to-one correspondence between the

study objectives, the study design and the analy-
7.2
sis. The objectives of the study will indicate which

of the designs may be considered. Once a study de-
The statistical model revisited
sign is selected, it will on its turn determine which

type of analysis is appropriate. This principle, that
the statistical analysis is determined by the way
the experiment is conducted, was enunciated by
Fisher (1935):
All that we need to emphasize immediately
is that, if an experiment does allow us to
calculate a valid estimate of error, its struc-
Figure 7.1. The statistical triangle: a conceptual framework for

the statistical analysis
ture must completely determine the statistical procedure by which this estimate is to
be calculated. If this were not so, no in-
Example 7.1. We already stated that every experi-
terpretation of the data could ever be un-
mental design is underpinned by a statistical model
ambiguous; for we could never be sure that
and that the experimental results should be consid-
some other equally valid method of inter-
ered as being generated by this experimental model.
pretation would not lead to a different re-
This conceptual framework as illustrated in Figure
sult.
7.1, greatly simplifies the statistical analysis to just

fitting the statistical model to the data and compar-
In other words, choice of the statistical methods
ing the model component related to the treatment
follows directly from the objectives and design of
effect with the error component (Grafen and Hails,
the study. With this in mind, many of the complex-
2002; Kutner et al., 2004). Hence, the choice of the

51
0.3
0.2
0.1
0.0
0.0
0.1
0.2
0.3
0.4
CHAPTER 7. THE STATISTICAL ANALYSIS
0.4
52
2 2.79
4 2.792
2 2.79
Figure 7.2. Distribution of the test statistic t for the cardiomyocyte example, under the assumption that the null hypothesis of no
difference between the samples is true.
appropriate statistical analysis is straightforward. approach of hypothesis testing. For Fisher, the pHowever, some important statistical issues remain, value was an informal measure to see how surprissuch as the type data and the assumptions we make ing the data were and whether they deserved a secabout the distribution of the data.
ond look (Nuzzo, 2014; Reinhart, 2015). It is good

practice to follow Fisher and to report the actual
7.3
Significance tests
p-values rather than p 0.05 (see Section 9.2.3),

since this allows anyone to construct their own hy-
Significance testing is related to, but not exactly the pothesis tests. The cardiomyocytes experiment of
same as hypothesis testing (see Section 6.3). Signif- Example 5.5 (page 30) will help us to illustrate the
icance testing differs from the Neyman-Pearson hy-
idea of significance testing. The experiment was
pothesis testing approach in that there is no need to set up to test the null hypothesis of no difference
define an alternative hypothesis. Here, we only de- between vehicle and drug. This null hypothesis is
fine a null hypothesis and calculate the probability tested at a level of significance of 0.05, i.e. we
to obtain results as extreme or more extreme than want to limit the probability of a false positive rewhat was actually observed, assuming the null hy- sult to 0.05. The paired design of this experiment
pothesis is true. This is done by calculating, from is a special case of the randomized complete block
the experimental data, a quantity called test statis- design with only two treatments and the response is
tic. Then, based on the the statistical model, the a continuously distributed variable. In this design,
distribution of this test statistic is derived under calculations can be simplified by evaluating for each
the null hypothesis. With this null-distribution, the pair separately the treatment effect, thus removing
probability is calculated of obtaining a test statis- the block effect. This is done in Table 5.1 in the coltic that is as extreme or more extreme than the umn with the Drug - Vehicle differences. We now
one observed. This probability is referred to as p- must make some assumptions about the statistical
value. It is common practice to compare this p- model that generated the data. Specifically, we asvalue to a preset level of significance (usually sume that the differences are independent from one
0.05). When the p-value is smaller then , the null another and originate from a normal distribution.
hypothesis is rejected, otherwise the result is inconclusive. However, this conflates the two worlds of
Next, we define a relevant test statistic, in this
significance testing and the formal decision-making
case it is the mean value of the differences (7), di-
1 the
standard error of the mean

is obtained by dividing the sample standard deviation by the square root of
of a sample
the sample size, i.e. sx = SD/ n = 5.61/ 5 = 2.51
7.4. VERIFYING THE STATISTICAL ASSUMPTIONS
vided by its standard error1 . For our example, we
obtain a value of 7/ 2.51 = 2.79 for this statistic.
7.4
53
Verifying the statistical assumptions
Under the assumptions made above and provided
the null hypothesis of no difference between the two When the inferential results are sensitive to the distreatment conditions holds, the distribution of this tributional and other assumptions of the statistical
statistic is known2 and is depicted in Figure 7.2. analysis, it is essential that these assumptions are
On the left panel of Figure 7.2, the value for the also verified. The aptness of the statistical model is
test statistic 2.79 that was obtained from the ex- preferably assessed by informal methods such as
perimental data is indicated and the area under the diagnostic plotting (Grafen and Hails, 2002; Kutcurve, right to this value is shaded in grey. This ner et al., 2004). When planning the experiment,
area corresponds to the one sided p-value, i.e. the historical data, or the results of exploratory or piprobability of obtaining a greater value for the test lot experiments can already be used for a prelimstatistic than the one obtained in the experiment. inary verification of the model assumptions. AnSince by definition the total area under the curve other option is to use statistical methods that are
equals one, we can calculate the value of the shaded robust against departures from the assumptions
area. For our example, this results in a value of (Lehmann, 1975). It is also wise, before carrying
0.024, which is the probability of obtaining a value out formal tests, to make graphical displays of the
for the test statistic as extreme or more extreme data. This allows to identify outliers and gives althan the one obtained in the experiment, provided ready indications whether the statistical model is
the null hypothesis holds.
appropriate or not. Such exploratory work is also
a tool for gaining insight in the research project and
can lead to new hypotheses.
However, before the experiment was carried out,

we were also interested in finding the opposite result, i.e. we were also interested in a decrease in
viable myocytes. Therefore, when we consider more
extreme results, we should also look at values that
are less than -2.79. This is done in the right panel
of figure 7.2. The sum of the two areas is called the
two-sided p-value and corresponds to the probability of obtaining under the null hypothesis, a more
extreme result than 2.79. In our example, the obtained two-sided p-value is 0.049, which allows us
Figure 7.3. One hundred drugs are tested for activity against a
biological targe. Each drug occupies a square in the grid, the top
row are the drugs that are truly active. Statistically significant
results are obtained only for the darker-grey drugs. The black
cells are false positives (after Reinhart (2015)).
to reject the null-hypothesis at the pre-specified significance level of 0.05 using a two-sided test.
7.5
The meaning of the p-value

and statistical significance
There is one important caveat in all this: Signif-
The literature in the life sciences is literally flooded
icance tests only reflect whether the obtained result
with p-values and yet this is also the most misun-
could be attributed to chance alone, but do not tell
derstood, misinterpreted and sometimes miscalcu-
whether the difference is meaningful or not from a
lated measure (Goodman, 2008). When we ob-
scientific point of view.
tain a result that is not significant, this does not
2 Under
the null hypothesis and when the assumptions are true, the test statistic is distributed as a Student t-distribution
with n 1 degrees of freedom.
54
mean that there was no difference between treat-
and the prevalence or base rate as:
ment groups. Sample size of the experiment could

just be too small to establish a statistically signifi-
F DR = (1 )/[(1 ) + (1 )]
cant result (See Chapter 6). But, what does a sig-
= 1/{1 + [/(1 )][(1 )/]}
(7.1)
nificant result mean?

Example 7.2. In a laboratory 100 experimental
compounds are tested against a particular biological target. Figure 7.3 illustrates the situation. Each
square in the grid represents a tested compound. In
reality only 10 drugs are active against the target,
these are located in the top row. We call this value
of 10% the prevalence or base rate. Lets assume
that our statistical test has a power of 80%. This
means that of the 10 active drugs 8 will be correctly detected. These are shown in darker grey.
The threshold for the p-value to declare a drug statistically significant is set to 0.05, thus there is a
5% chance to declare an inactive compound incorrectly as active. There are 90 drugs that are in
reality in active, so about 5 will yield a significant
effect. These are shown on the second row in black.
Hence, in the experiment 13 drugs are declared active, of which only 8 are truly effective, i.e. the
positive predictive value is about 8/13 = 62%, or its
complement the false discovery rate (FDR) is about
38%.
For our example, the above derivation of the FDR

yields a value of 0.36 when the prevalence of active drugs is 10%, significance threshold of 0.05
and a power 1 of 80%. This rises to 0.69 when
the power is reduced to 20%. Meaning that under these conditions, 69% of the drugs (or other research questions) that were declared active, are in
fact false positives.
The FDR depends highly on the prevalence rate
as is illustrated in Figure 7.4, leading to the conclusion that when working in new areas where the a
priori probability of a finding is low, say 1/100, a
significant result does not necessarily imply a genuine activity. In fact, under these circumstances
even in a well-powered experiment (80% power)
with a significance level of 0.05, 69% of the positives findings are false. To make things worse, it
are such surprise, groundbreaking findings, often
combined with exaggerated effect sizes due to a
small sample size (Section 6.7), that are likely to be
published in prestigious journals like Nature and
Science.
Table 7.1. Minimum false discovery rate MFDR for some commonly used critical values of p
1.0
p-value
False Discovery Rate
0.8
MFDR
0.1
0.05
0.01
0.005
0.001
0.385
0.289
0.111
0.067
0.0184
0.6
What is the value of p 0.05 Consider an exper-
0.4
iment whose results yield a p-value close to 0.05,

say between 0.045 and 0.05. What does it actu-
20%
50%
80%
0.2
ally mean? In how many instances does this result reflect a true difference? We already deduced
0.0
0.0
0.2
0.4
0.6
0.8
that, when the power or the prevalence rate are
1.0
low, the FDR can easily reach 70%. But what is
Prevalence
Figure 7.4. False discovery rate as a function of the prevalence

) and the power 100 (1 ), ( = 0.05), lines are drawn for
power 100 (1 ) of 80%, 50% and 20%
the most optimistic scenario? In other words what

is the minimum value for the FDR? Irrespective
of power, sample size and prior probability, Sellke
follows
et al. (2001) derived an expression for what they
(Colquhoun, 2014; Wacholder et al., 2004) that the
From
the
above
reasoning,
it
call the conditional error probability, which is equiv-
FDR depends on the threshold , the power (1) alent to the minimum FDR (MFDR). This gives the
7.6. MULTIPLICITY
55
minimum probability that, when a test is declared
jectives not only increases the studys complexity,
significant, the null hypothesis is in fact true.
but also results in more hypotheses that are to be
Some values of the MFDR are given in Table 7.1. tested. Testing multiple related hypotheses also
For p = 0.05 the M F DR = 0.289, which means raises the type I error rate. The same problem of
that a researcher who claims a discovery when p multiplicity arises when a study includes a large
0.05 is observed will make a fool of him-/herself in
number of variables, or when measurements are
about 30% of the cases. Even for a p-value of 0.01,
made at a large number of time points. Only in
the null hypothesis can still be true in 11% of the
studies of the most exploratory nature the statisti-
cases (Colquhoun, 2014).
cal analysis of every possible variable or time point

is acceptable. In this case, the exploratory nature of
The FDR is certainly one of the key factors re- the study should be stressed and the results intersponsible for the lack of replicability in research preted with great care.
and puts the decision-theoretic approach with its
irrational dichotomization of the p-value into significant and non-significant certainly into question.
As it was noted already in the introductory
chapter, the issues of reproducibility and replicability of research founding have been concerning
the scientific, but certainly also the statistical community, deeply. This has led the board of the American Statistical Association to issue a statement on
March 6, 2016 in which the society warns against
the misuse of p-values (Wasserstein and Lazar,
2016). This is the first time in its 177 year old
history that explicit recommendations on a fun-
Example 7.3. Suppose a drug is tested at 20 different doses on a specific variable. Further, suppose
that we reject the null hypothesis of no treatment
effect for each dose separately when the probability
of falsely rejecting the null hypothesis (the significance level ) is less than or equal to 0.05. Then
the overall probability of falsely declaring the existence of a treatment effect when all underlying null
hypotheses are in fact true is 1(10.05)20 = 0.64.
This means that we are more likely to get one significant result than not. The same multiplicity problem arises when a single dose of the drug is tested
on 20 variables that are mutually independent.
damental matter in statistics were done. In sum-
The problem of multiplicity is of particular immary, the ASA advises in its statement researchers portance and magnitude in gene expression mito avoid drawing scientific conclusions or making croarray experiments (Bretz et al., 2005). For examdecisions based on p-values alone. P-values should ple a microarray experiment examines the differcertainly not be interpreted as measuring the prob- ential expression of 30,000 genes in wildtype and
ability that the studied hypothesis is true or the in a mutant. Assume that for each gene an approprobability that the data were produced by chance
priate two-sided two-sample test is performed at
alone. Researchers should describe not only the
the 5% significance level. Then we expect to obtain
data analyses that produced statistically significant

results, the society says, but all statistical tests and
roughly 1,500 false-positives. Strategies for dealing
choices made in calculations.
in microarrays are provided by Amaratunga and
with, what is often called the curse of multiplicity,

Cabrera (2004) and Bretz et al. (2005).
7.6
Multiplicity
The multiplicity problem must at least be recog-
In Section 3.1, we already pointed out that it is wise
nized at the planning stage. Ways to deal with it
to limit the number of objectives in a study. As al-
(Bretz et al., 2010; Curran-Everett, 2000) should be
ready mentioned in Section 6.6, increasing the ob-
investigated and specified in the protocol.
56
8. The Study Protocol

Nothing clears up a case so much as stating it to another person.
Sherlock Holmes
The Memoirs of Sherlock Holmes. Silver Blaze. A.C. Doyle.
The writing of the study protocol finalizes the
Writing down the statistical analysis plan before-
end of the research design phase. Every study
hand prevents also from trying several methods
should have a written formal protocol before it is
of analysis and report only those results that suit
started. The complete study protocol consists of a
the investigator. Such a practice is of course inap-
more conceptual research protocol and the techni-
propriate, unscientific, and unethical. In this con-
cal protocol, which we already discussed in Section
text, the study protocol is a safeguard for the re-
4.8.1.3. The research protocol states the rational for
producibility of research findings.
performing the study, the studys objectives, the

related hypotheses and working hypotheses that
are tested and their consequential predictions. It
Many investigators consider writing a detailed
should contain a section on experimental design,
protocol a waste of time. However, the smart re-
how treatments will be assigned to experimen-
searcher understands that by writing a good pro-
tal units, information and justification of planned
tocol he is actually preparing his final study re-
sample sizes, and a description of the statistical

analysis that is to be performed. Defining the sta-
port. A well-written protocol is even more essen-
tistical methods in the protocol is of importance
collaborative. Once the protocol has been formal-
since it allows preparation of the data analytic pro-
ized, it is important that it is followed as good as

possible and every deviation of it should be docu-
tial when the design is complex or the study is
cedures beforehand and ensures against the mis-
leading practice of data dredging or data snooping. mented.
57
58
CHAPTER 8. THE STUDY PROTOCOL
9. Interpretation and Reporting

No isolated experiment, however significant in itself, can suffice for the experimental demonstration
of any natural phenomenon; for the one chance in a million will undoubtedly occur, with no less
and no more than its appropriate frequency, however surprised we may be that it should occur to us.
R. A. Fisher (1935)
9.1.1
While the previous chapters focused on the

planning phase of the study with the protocol as fi-
Experimental design
Readers should be told about weaknesses and

strengths in study design, e.g. when randomiza-
nal deliverable, this section deals with some points

to consider when interpreting and reporting the
tion and/or blinding was used since it adds to the
results of the statistical analysis. Several journals
reliability of the data. A detailed description of the
(Altman et al., 1983; Bailar III and Mostelller, 1988)
randomization and blinding procedures, and how
have published guidelines for reporting statistical
and when these were applied will allow the reader
results. Although these guidances focus mostly on
to judge the quality of the study. Reasons for block-
clinical research, many of the precepts also apply
ing and the blocking factors should be given and
to other types of research.
how blocking was dealt with in the statistical analysis. When there is ambiguity about the experimental unit, the unit used in the statistical analysis
should be specified and a justification for its choice
As a general rule in writing reports containing

statistical methodology, it is recommended not to
should be provided.
use technical statistical terms such as random, normal,significant,correlation, and sample in their everyday meaning, i.e. out of the statistical context. We
9.1.2
Statistical methods
Statistical methods should be described with
will now discuss the different sections of a scien-
enough detail to enable a knowledgeable reader
tific publication in which statistical reasoning is in-
with access to the original data to verify the re-
volved.
ported results. The authors should report and justify which methods they used. A term like tests of
significance is too vague, and should be more de-
9.1
tailed.
The Methods section
The level of significance and, when applicable,

The Methods section should contain details of the
direction of statistical tests should be specified,
experimental design, such as the size and number
e.g. two-sided p-values less than or equal to 0.05
of experimental groups, how experimental units
were considered to indicate statistical significance.
were assigned to treatment groups, how experi-
Some procedures, e.g. analysis of variance, chi-
mental outcomes were assessed, and what statis-
square tests, etc. are by definition two-sided. Is-
tical and analytical methods were used.
sues about multiplicity (Section 7.6) and a justifi59
60
CHAPTER 9. INTERPRETATION AND REPORTING
cation of the strategy that deals with them should

also be addressed here.
Spurious precision detracts from a papers readability and credibility. Therefore, unnecessary precision, particularly in tables, should be avoided.
The software used in the statistical analysis and

its version should also be specified. When the Rsystem is used (R Core Team, 2013), both R and the
packages that were used should be referenced.
When presenting means and standard deviations,

it is important to bear in mind the precision of the
original data. Means should be given one additional decimal place more than the raw data. Standard deviations and standard errors usually require one more extra decimal place. Percentages
9.2
The Results section
9.2.1
Summarizing the data
should not be expressed to more than one decimal place and with samples less than 100, the use
of decimal places should be avoided. Percentages
The number of experimental units used in the analysis should always be clearly specified. Any discrepancies with the number of units actually ran-
should not be used at all for small samples. Note

that the remarks about rounding only apply to the
presentation of results, rounding should not be
done at all before or during the statistical analysis.
domized to treatment conditions should be accounted for. Whenever possible, findings should
be quantified and presented with appropriate indicators of measurement error or uncertainty. As
9.2.2
Graphical displays
measures of spread and precision, standard deviations (SD) and standard errors (SEM) should not
be confused. Standard deviations are a measure of
spread and as such a descriptive statistic , while
standard errors are a measure of precision of the
mean. Normally distributed data should preferably be summarized as mean (SD), not as mean
SD. For non-normally distributed data, medians
and inter-quartile ranges are the most appropriate summary statistics. The practice of reporting
mean SEM should preferably be replaced by the
reporting of confidence intervals, which are more
informative. Extremely small datasets should not
be summarized at all, but should preferably be reported or displayed as raw data.
Figure 9.1. Scatter diagram with indication of median values and

95% distribution-free confidence intervals
Graphical displays complement tabular presentations of descriptive statistics.
Generally,
graphs are better suited than tables for identifying

patterns in the data, whereas tables are better for
providing large amounts of data with a high de-
When reporting SD (or SEM) one should realize
gree of numerical detail. Whenever possible, one
that for positive variables (i.e. variables measured
should always attempt to graph individual data
on a ratio scale) such as concentrations, durations,
points. This is especially the case when treatment
and counts, the mean minus 2 SD (or minus groups are small. Graphs such as Figure 9.1 and
n) which indicates a lower 2.5% of Figure 9.2 are much more informative than the
2 SEM
the distribution, can lead to a ridiculous negative
usual bar and line graphs showing mean values
value. In this case, an appropriate 95% confidence
SEM (Weissgerber et al., 2015). These graphs are
interval based on the lognormal distribution, or al-
easily constructed using the R-language. Specifi-
ternatively a distribution-free interval, will avoid
cally, the R-package beeswarm (Eklund, 2010) can
such a pitfall.
be of great help.
9.2. THE RESULTS SECTION
61
sult of 0.051 would be flagged as NS. Reporting

exact p-values, would allow readers to compare the
reported p-value with their own choice of significance levels. One should also avoid reporting a pvalue as p = 0.000, since a value with zero probability of occurrence is, by definition, an impossible
value. No observed event can ever have a probability of zero. Therefore, such an extreme small
p-value must be reported as p < 0.001. In roundFigure 9.2. Graphical display of longitudinal data showing individual subject profiles
ing a p-value, it happens that a value that is technically larger than the significance level of 0.05, say
0.051, is rounded down to p = 0.05. This is inac-
9.2.3
Interpreting and reporting signifi- curate and, to avoid this error, p-values should be
reported to the third decimal. If a one-sided test is
cance tests
used and the result is in the wrong direction, then
When data are summarized in the Results section,
the report must state that p > 0.05 (Levine and
the statistical methods that were used to analyze
Atkin, 2004), or even better report the complement
them should be specified. It is to the reader of
of the p-value, i.e. 1 p.
little help to have in the Methods section a statement such as statistical methods included analysis of
There is a common misconception among scien-
variance, regression analysis, as well as tests of signifi-
tists that a nonsignificant result implies that the
cance without any reference to which specific pro- null hypothesis can be accepted. Consequently,
cedure is reported in the Results part.
they conclude that there is no effect of the treatment or that there is no difference between the
Tests of statistical significance should be two- treatment groups. However, from a philosophisided. When comparing two means or two pro- cal point of view, one can never prove the nonexportions, there is a choice between a two-sided or
istence of something.
a one-sided test (see Section 7.3. In a one-sided
pointed out:
test the investigator alternative hypothesis specifies the direction of the difference, e.g. experimental treatment greater than control. In a two-sided
test, no such direction is specified. A one-sided test
As Fisher (1935) clearly
it should be noted that the null hypothesis is

never proved or established, but is possibly
disproved, in the course of experimentation.
is rarely appropriate and when one-sided tests are
To state it otherwise: Lack of evidence is no evidence
used, their use should be justified (Bland and Alt-
for lack of effect. Conversely, an effect that is sta-
man, 1994). For all two group comparisons, the
tistically significant is not necessarily of biomedi-
report should clearly state whether one-sided or
cal importance, nor is it replicable (see Section 7.5.
two-sided p-values are reported.
Therefore, one should avoid sole reliance on statistical hypothesis testing and preferably supplement
Exact p-values, rather than statements such as ones findings with confidence intervals which are
p < 0.0500 or even worse NS (not significant), more informative. Confidence intervals on a difshould be reported where possible. The practice ference of means or proportions provide informaof dichotomizing p-values into significant and not
tion about the size of an effect and its uncertainty
significant has no rational scientific basis at all and
and are of particular value when the results of the
should be abandoned. This lack of rationality be-
test fail to reject the null hypothesis. This is illus-
comes apparent when one considers the situation
trated in Figure 9.3 showing treatment effects and
where a study yielding a p-value of 0.049 would be their 95% confidence intervals. The shaded area
flagged significant, while an almost equivalent re-
indicates the region in which results are important
62
CHAPTER 9. INTERPRETATION AND REPORTING
from a scientific (biological) point of view. Three
Y is not. A representative, though fictitious, ex-
possible outcomes for treatment effects are shown
ample of such claims is for instance:
here as mean values and 95% confidence intervals.

The region encompassed by the confidence inter-
The percentage of neurons showing cue-
val can be interpreted as the set of plausible val-
related activity increased with training in
ues of the treatment effect. The top row shows a
the mutant mice (P < 0.05), but not in the
result that is statistically significant, the 95% confi-
control mice (P > 0.05). (Nieuwenhuis
dence interval does not encompass the zero effect
et al., 2011)
line. However, effect sizes that have no biological relevance are still plausible as is shown by the
upper limit of the confidence interval. The second
row shows the result of an experiment that was not
significant at the 0.05 level. However, the confidence interval reaches well within the area of biological relevance. Therefore, notwithstanding the
nonsignificant outcome, this experiment is inconclusive. The third outcome concerns a result that
was not significant, but the 95% confidence interval does not reach beyond the boundaries of scientific relevance. The nonsignificant result here can
also be interpreted that, with 95% confidence, the
treatment effect was also irrelevant from a scientific point of view.
Such comparisons are absurd, inappropriate and

can be misleading. Indeed, the difference between
significant and not significant is not itself statistically significant (Gelman and Stern, 2006). Unfortunately, such a practice is commonplace. A recent review by Nieuwenhuis et al. (2011) showed
that in the area of cellular and molecular neuroscience the majority of authors erroneously claim
an interaction effect when they obtained a significant result in one group and a nonsignificant result in the other. In view of our discussion of pvalues and nonsignificant findings, it is needless
to say that this approach is completely wrong and
misleading. The correct approach would be to design a factorial experiment and test the interaction effect of genotype on training. In this context,
it must also be noted that carrying out a statistical test to prove equivalence of baseline measurements is also pointless. Tests of significance are not
tests of equivalence. When baseline measurements
are present, their value should be included in the
Figure 9.3. Use of confidence intervals for interpreting statistical results. Estimated treatment effects are displayed with their
95% confidence intervals. The shaded area indicates the zone of
biological relevance.
In the scientific literature, it is common prac-
statistical model.
Finally, when interpreting the results of the ex-
tice to make a sharp distinction between significant periment, the scientist should bear in mind the topand nonsignificant findings and making compar- ics covered in Section 6.7 about effect size inflation
isons of the sort X is statistically significant, while
and Section 7.5 about the pitfalls of p-values.
10. Concluding Remarks and Summary

You know my methods. Apply them.
Sherlock Holmes
The Sign of the Four A.C. Doyle.
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post
mortem examination. He can perhaps say what the experiment died of.
R.A. Fisher (1938)
10.1
Role of the statistician
perhaps they can only say what the experiment

died of.
What we didnt touch yet was the role of the statistician in the research project. The statistician is
10.2
a professional particularly skilled in solving research problems. She should be considered as a
Recommended reading
Statistics Done Wrong: The Woefully Complete Guide
team member and often even as collaborator or
by Reinhart (2015) is in my opinion essential
partner in the research process in which she can
reading material for all scientists working in
have a critical role. Whenever possible, the statisti-
biomedicine and the life sciences in general. This
cian should be consulted when there is doubt with
small book (152 pages) provides a well-written
regard to design, sample size, or statistical analy-
very accessible guide to the most popular statis-
sis. A statistician working closely together with
tical errors and slip-ups committed by scientists
a scientist can greatly improve the projects like-
every day, in the lab and in peer-reviewed jour-
lihood of success. Many applied statisticians be-
nals. Scientists working with laboratory animals
come involved into the subject area and, by virtue
should certainly read the article by Fry (2014) and
of their statistical training, take on the role of statis-
the book The Design and Statistical Analysis of An-
tical thinker, thereby permeating the research pro-
imal Experiments by Bate and Clark (2014). For
cess. In a great many instances this key role of the
those interested in the history of statistics and the

life of famous statisticians, The Lady Tasting Tea
statistician is recognized and granted with a coauthorship.
by Salsburg (2001) is a lucidly written account of

the history of statistics, experimental design and
The most effective way to work with a con-
how statistical thinking revolutionized 20th Cen-
sulting statistician is to include her or him from
tury science. A clear, comprehensive and highly
the very beginning of the project, when the
recommended work on experimental design, is the
study objectives are formulated (Hinkelmann and
book by Selwyn (1996), while, on a more introduc-
Kempthorne, 2008).
What always should be
tory level, there is the book by Ruxton and Cole-
avoided, is contacting the statistical support group
grave (2003). A gentle introduction to statistics
after the experiment has reached its completion,
in general and hypothesis testing, confidence in63
64
CHAPTER 10. CONCLUDING REMARKS AND SUMMARY
tervals and analysis of variance in particular, can
tistical thinking was introduced as a non-specialist
be found in the highly recommended book of the
generalist skill that permeates the entire research
two Wonnacott brothers (Wonnacott and Wonna-
process. The seven principles of statistical think-
cott, 1990). Comprehensive works at an advanced
ing were formulated as: 1) time spent thinking on
level on statistics and experimental design are the
the conceptualization and design of an experiment
books by Kutner et al. (2004), Hinkelmann and
is time wisely spent; 2) the design of an experiment
Kempthorne (2008), Casella (2008), and Giesbrecht
reflects the contributions from different sources of
and Gumpertz (2004). The latter two also provide
variability; 3) the design of an experiment balances
designs suitable for 96-well microtiter plates. For
between its internal validity (proper control of
those who want to carry out their analyses in the
noise) and external validity (the experiments gen-
freely available R-language (R Core Team, 2013) is
eralizability); 4) good experimental practice pro-
the book by Dalgaard (2002) a good starter, while
vides the clue to bias minimization; 5) good ex-
the book by Everitt and Hothorn (2010) is at a
perimental design is the clue to the control of vari-
more advanced level. Guidances and tips for ef-
ability; 6) experimental design integrates various
ficient data visualization can be found in the work
disciplines; 7) a priori consideration of statistical
of Tufte (1983) and in the two books by William
power is an indispensable pillar of an effective ex-
Cleveland (Cleveland, 1993, 1994). Finally, there
periment.
is the freely available e-book Speaking of Graphics

(Lewi, 2006), which takes the reader on a fascinat-
We elaborated on each of these and finally dis-
ing journey through the history of statistical graph-
cussed some points to consider in the interpreta-
ics1 .
tion and reporting of scientific results. In particular, the problems with a blind trust on statistical
10.3
Summary
We have looked at the complexities of the research

process from the vantage point of a generalist. Sta-
1 http://www.datascope.be
hypothesis tests and of exaggerated effect sizes in

small significant studies were highlighted.
REFERENCES
65
References
Altman, D. G., Gore, S. M., Gardner, M. J., and Pocock, S. J.
(1983). Statistical guidelines for contributors to medical journals. BMJ 286, 14891493.
URL http://www.bmj.com/content/286/6376/1489
Amaratunga, D. and Cabrera, J. (2004). Exploration and Analysis of DNA Microarray and Protein Array Data. New York, NY: J.
Wiley.
Anderson, V. and McLean, R. (1974). Design of Experiments.
New York, NY: Marcel Dekker Inc.
Aoki, Y., Helzlsouer, K. J., and Strickland, P. T. (2014).
Arylesterase phenotype-specific positive association between
arylesterase activity and cholinesterase specific activity in human serum. Int. J. Environ. Res. Public Health 11, 14221443.
doi:doi:10.3390/ijerph110201422.
Babij, C. J., Zhang, R. J., Y. anf Kurzeja, Munzli, A., Shehabeldin, A., Fernando, M., Quon, K., Kassner, P. D., RuefliBrasse, A. A., Watson, V. J., Fajardo, F., Jackson, A., Zondlo,
J., Sun, Y., Ellison, A. R., Plewa, C. A., T., S., Robinson, J.,
McCarter, J., Judd, T., Carnahan, J., and Dussault, I. (2011).
STK33 kinase activity is nonessential in KRAS-dependent cancer cells. Cancer Research 71, 58185826. doi:10.1158/00085472.CAN-11-0778.
Baggerly, K. A. and Coombes, K. R. (2009).
Deriving
chemosensitivity from cell lines: Forensic bioinformatics and
reproducible research in high-throughput biology. Annals of
Applied Statistics 3, 13091334. doi:10.1214/09-AOAS291.
Bailar III, J. C. and Mostelller, F. (1988). Guidelines for statistical reporting in articles for medical journals. Ann. Int. Med.
108, 226273.
URL http://www.people.vcu.edu/$\sim$albest/Guidance/
guidelines_for_statistical_reporting.htm
Banerjee, T. and Mukerjee, R. (2007). Optimal factorial designs
for cDNA microarray experiments. Ann. Appl. Stat. 2, 366385.
doi:10.1214/07-AOAS144.
Bate, S. and Clark, R. (2014). The Design and Statistical Analysis
of Animal Experiments. Cambridge, UK: Cambridge University
Press.
Begley, C. G. and Ellis, L. M. (2012). Raise standards for preclinical research. Nature 483, 531533. doi:10.1038/483531a.
Begley, C. G. and Ioannidis, J. P. A. (2015). Reproducibility in science.
Circ. Res. 116, 116126.
doi:10.1161/
CIRCRESAHA114.303819.
Begley, S. (2012). In cancer science, many "discoveries" dont
hold up. Reuters March 28.
URL
http://www.reuters.com/article/2012/03/28/usscience-cancer-idUSBRE82R12P20120328
Bland, M. and Altman, D. (1994). One and two sided tests of
significance. BMJ 309, 248.
Bretz, F., Landgrebe, J., and Brunner, E. (2005). Multiplicity issues in microarray experiments. Methods Inf. Med. 44, 431437.
Brien, C. J., Berger, B., Rabie, H., and Tester, M. (2013). Accounting for variation in designing greenhouse experiments
with special reference to greenhouses containing plants on
conveyor systems. Plant Methods 9, 526. doi:10.1186/17464811-9-5.
Burrows, P. M., Scott, S. W., Barnett, O., and McLaughlin,
M. R. (1984). Use of experimental designs with quantitative
ELISA. J. Virol. Methods 8, 207216.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A.,
Flint, J., Robinson, E. S. J., and Munafo, M. R. (2013). Power
failure: Why small sample size undermines the reliability
of neuroscience. Nature Reviews Neuroscience 14, 112. doi:
10.1038/nrn3475.
Casella, G. (2008). Statistical Design. New Tork, NY: Springer.
Champely, S. (2009). pwr: Basic functions for power analysis.
R package version 1.1.1.
URL http://CRAN.R-project.org/package=pwr
Cleveland, W. S. (1993). Visualizing Data. Summit, NJ: Hobart
Press.
Cleveland, W. S. (1994). The Elements of Graphing Data. Summit, NJ: Hobart Press.
Clewer, A. G. and Scarisbrick, D. H. (2001). Practical Statistics
and Experimental Design for Plant and Crop Science. Chichester,
UK: J. Wiley.
Cochran, W. and Cox, G. (1957). Experimental Designs. New
York, NY: John Wiley & Sons Inc., 2nd edition.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates, 2nd edition.
Cokol, M., Ozbay, F., and Rodriguez-Esteban, R. (2008).
Retraction rates are on the rise. EMBO Rep. 9, 2. doi:
10.1038/sj.embor.7401143.
URL
http://www.ncbi.nlm.nih.gov/pmc/articles/
PMC2246630/
Colquhoun, D. (2014). An investigation of the discovery rate
and the misinterpretation of p-values. R. Soc. open sci. 1,
140216. doi:10.1098/rsos/140216.
Council of Europe (2006). Appendix a of the european convention for the protection of vertebrate animals used for experimental and other scientific purposes (ets no. 123). guidelines
for accomodation and care of animals (article 5 of the convention). approved by the multilateral consultation.
URL https://www.aaalac.org/about/AppA-ETS123.pdf
Cox, D. (1958). Planning of Experiments. New York, NY: J. Wiley.
Bolch, B. (1968). More on unbiased estimation of the standard

deviation. The American Statician 22, 27.
Curran-Everett, D. (2000). Multiple comparisons: philosophies and illustrations. Am. J. Physiol. Regulatory Integrative
Comp. Physiol. 279, R1R8.
Bretz, F., Hothorn, T., and Westfall, P. (2010). Multiple Comparisons Using R. Boca Raton, FL: CRC Press.
Dalgaard, P. (2002). Introductory Statistics with R. New York,

NY: Springer.
66
Eklund, A. (2010). beeswarm: The bee swarm plot, an alternative to stripchart. R package version 0.0.7.
URL http://CRAN.R-project.org/package=beeswarm
European Food Safety Authority (2012). Final review of the
Sralini et al. (2012) publication on a 2-year rodent feeding
study with glyphosate formulations and GM maize NK603 as
published online on 19 September 2012 in Food and Chemical
Toxicology. EFSA Journal 10, 2986. doi:10.2903/j.efsa.2012.
2986.
Everitt, B. S. and Hothorn, T. (2010). A Handbook of Statistical
Analyses using R. Boca Raton, FL: Chapman and Hall/CRC,
2nd edition.
Faessel, H., Levasseur, L., Slocum, H., and Greco, W. (1999).
Parabolic growth patterns in 96-well plate cell growth experiments. In Vitro Cell. Dev. Biol. Anim. 35, 270278.
Fang, F. C., Steen, R. C., and Casadevall, A. (2012). Misconduct accounts for the majority of retracted scientific publications. Proc. Natl. Acad. Sci. U.S.A. 109, 1702817033. doi:
10.1073/pnas.1212247109.
REFERENCES
Ghlmann, H. and Talloen, W. (2009). Gene Expression Studies

Using Affymetrix Microarrays. Boca Raton, FL: CRC Press.
Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions. Semin Hematol 45, 135140. doi:10.1053/j.
seminhematol.2008.04.003.
Grafen, A. and Hails, R. (2002). Modern Statistics for the Life
Sciences. Oxford, UK: Oxford University Press.
Greco, W. R., Bravo, G., and Parsons, J. C. (1995). The search
for synergy: a critical review from a response surface perspective. Pharmacol. Rev. 47, 331385.
Hankin, R. K. S. (2005). Recreational mathematics with r: introducing the magic package. R News 5.
Haseman, J. K. (1984). Statistical issues in the design, analysis
and interpretation of animal carcinogenicity studies. Enviromental Health Perspect. 58, 385392.
URL
http://www.ncbi.nlm.nih.gov/pmc/articles/
PMC1569418/
Fernandez, G. C. J. (2007). Design and analysis of commonly

used comparative horticultural experiments. HortScience 42,
10521069.
Hayes, W. (2014). Retraction notice to "Long term toxicity of a

Roundup herbicide and a Roundup-tolerant genetically modified maize" [Food Chem. Toxicol. 50 (2012): 4221-4231]. Food
Chem. Toxicol. 52, 244. doi:10.1016/j.fct.2013.11.047.
Fisher, R. (1962). The place of the design of experiments in the

logic of scientific inference. Colloques Int. Centre Natl. Recherche
Sci. Paris 110, 1319.
Hempel, C. G. (1966).
Philosophy of Natural Science.
Englewood-Cliffs, NJ: Prentice-Hall.
Fisher, R. A. (1935). The Design of Experiments. Edinburgh, UK:

Oliver and Boyd.
Hinkelmann, K. and Kempthorne, O. (2008). Design and Analysis of Experiments. Volume 1. Introduction to Experimental Design.
Hoboken, NJ: J. Wiley, 2nd edition.
Fisher, R. A. (1938). Presidential address: The first session of

the Indian Statistical Conference, Calcutta. Sankhya 4, 1417.
Fitts, D. A. (2010). Improved stopping rules for the design of
small-scale experiments in biomedical and biobehavioral research. Behavior Research Methods 42, 322. doi:10.3758/BRM.
42.1.3.
Fitts, D. A. (2011). Minimizing animal numbers: the variablecriteria stopping rule. Comparative Medicine 61, 206218.
Freedman, L. P., Cockburn, I. M., and Simcoe, T. S. (2015). The
economics of reproducibility in preclinical research. PLoS Biol.
13, e1002165. doi:10.1371/journal.pbio.1002165.
Fry, D. (2014). Experimental design: reduction and refinement
in studies using animals. In K. Bayne and P. Turner, editors,
Laboratory Animal Welfare, chapter 8, pages 95112. London,
UK: Academic Press.
Gart, J., Krewski, D., P.N., L., Tarone, R., and Wahrendorf,
J. (1986). The design and analysis of long-term animal experiments, volume 3 of Statistical Methods in Cancer Research. Lyon,
France: International Agency for Research on Cancer.
Hirst, J. A., Howick, J., Aronson, J. K., Roberts, N., Perera, R.,
Koshiaris, C., and Heneghan, C. (2014). The need for randomization in animal trials: An overview of systematic reviews.
PLOS ONE 9, e98856. doi:10.1371/journal.pone.0098856.
Holland, T. and Holland, C. (2011). Unbiased histological
examinations in toxicological experiments (or, the informed
leading the blinded examination). Toxicol. Pathol. 39, 711714.
doi:10.1177/0192623311406288.
Holman, L., Head, M. L., Lanfear, R., and Jennions, M. D.
(2015). Evidence of experimental bias in the life sciences: why
we need blind data recording. PLoS Biol. 13, e1002190. doi:
10.1371/journal.pbio.1002190.
Hotz, R. L. (2007). Most science studies appear to be tainted
by sloppy analysis. The Wall Street Journal September 14.
URL http://online.wsj.com/article/SB118972683557627104.
html
Gelman, A. and Stern, H. (2006). The difference between "significant" and "not significant" is not itself statistically significant. Am. Stat. 60, 328331.
Hu, B., Simon, J., Gnthardt-Goerg, M. S., Arend, M., Kuster,

T. M., and Rennenberg, H. (2015). Changes in the dynamics of
foliar n metabolites in oak saplings by drought and air warming depend on species and soil type. PLOS ONE 10, e0126701.
doi:10.1371/journal.pone.0126701.
Giesbrecht, F. G. and Gumpertz, M. L. (2004). Planning, Construction, and Statistical Analysis of Comparative Experiments.
New York, NY: J. Wiley.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Med. 2, e124. doi:10.1371/journal.pmed.
0020124.
Glonek, G. F. V. and Solomon, P. J. (2004). Factorial and time

course designs for cDNA microarray experiments. Biostatistics
5, 89111.
Ioannidis, J. P. A. (2014). How to make more published research true. PLoS Med. 11, e1001747. doi:10.1371/journal.
pmed.1001747.
REFERENCES
Kilkenny, C., Parsons, N., Kadyszewski, E., Festing, M. F. W.,

Cuthill, I. C., Fry, D., Hutton, J., and Altman, D. G. (2009). Survey of the quality of experimental design, statistical analysis
and reporting of research using animals. PLOS ONE 4, e7824.
doi:10.1371/journal.pone.0007824.
Kimmelman, J., Mogil, J. S., and Dirnagl, U. (2014). Distinguishing between exploratory and confirmatory preclinical research will improve translation. PLoS Biol. 12, e1001863. doi:
10.1371/journal.pbio.1001863.
Kutner, M. H., Nachtsheim, C., Neter, J., and Li, W. (2004).
Applied Linear Statistical Models.
Chicago, IL: McGrawHill/Irwin, 5th edition.
Lansky, D. (2002). Strip-plot designs, mixed models, and comparisons between linear and non-linear models for microtitre
plate bioassays. In W. Brown and A. R. Mire-Sluis, editors, The
Design and Analysis of Potency Assays for Biotechnology Products.
Dev. Biol., volume 107, pages 1123. Basel: Karger.
Lazic, S. (2010). The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis. BMC Neuroscience
11. doi:10.1186/1471-2202-11-5.
LeBlanc, D. C. (2004). Statistics: Concepts and Applications for
Science. Sudbury, MA: Jones and Bartlett Publishers.
Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead,
B., Johnson, E. W., Geman, D., Baggerly, K., and Irizarry, R. A.
(2010). Tackling the widespread and critical impact of batch
effects in high-throughput data. Nature Reviews Genetics 11,
733739. doi:10.1038/nrg2825.
Lehmann, E. L. (1975). Nonparametrics: Statistical Methods
Based on Ranks. San Francisco, CA: Holden-Day.
Lehr, R. (1992). Sixteen s squared over d squared: a relation
for crude sample size estimates. Stat. Med. 11, 10991102.
Lehrer, J. (2010). The truth wears off. The New Yorker [online]
December 13.
URL http://www.newyorker.com/magazine/2010/12/13/
the-truth-wears-off
Levasseur, L., Faessel, H., Slocum, H., and Greco, W. . (1995).
Precision and pattern in 96-well plate cell growth experiments.
In Proceedings of the American Statistical Association, Biopharmaceutical Section, pages 227232. Alexandria, Virginia: American Statistical Association.
Levine, T. R. and Atkin, C. (2004). The accurate reporting of
software-generated p-values: a cautionary note. Comm. Res.
Rep. 21, 324327. doi:10.1080/08824090409359995.
Lewi, P. J. (2005). The role of statistics in the success of a pharmaceutical research laboratory: a historical case description. J
Chemometr. 19, 282287.
Lewi, P. J. (2006). Speaking of graphics.
URL http://www.datascope.be
Lewi, P. J. and Smith, A. (2007). Successful pharmaceutical
discovery: Paul Janssens concept of drug research. R&D Management 37, 355361. doi:10.1111/j.1467-9310.2007.00481.x.
Loscalzo, J. (2012).
Irreproducible experimental results:
causes, (mis)interpretations, and consequences.
Circulation 125, 12111214. doi:10.1161/CIRCULATIONAHA.112.
098244.
67
Mead, R. (1988). The design of experiments: statistical principles

for practical application. Cambridge, UK: Cambridge University
Press.
Nadon, R. and Shoemaker, J. (2002). Statistical issues with microarrays: processing and analysis. Trends in Genetics 15, 265
271.
Naik, G. (2011). Scientists elusive goal: Reproducing study
results. The Wall Street Journal December 2.
URL
http://online.wsj.com/article/
SB10001424052970203764804577059841672541590.html
Nieuwenhuis, S., Forstmann, B. U., and Wagenmakers, E.-J.
(2011). Erroneous analysis of interactions in neuroscience: a
problem of significance. Nat. Neurosci. 14, 11051107.
Nuzzo, R. (2014). Scientific method: Statistical errors. Nature
506, 150152. doi:10.1038/506150a.
Parkin, S., Pritchett, J., Grimsdich, D., Bruckdorfer, K., Sahota,
P., Lloyd, A., and Overend, P. (2004). Circulating levels of
the chemokines JE and KC in female C3H apolipoprotein-Edeficient and C57BL apolipoprotein-E-deficient mice as potential markers of atherosclerosis development. Biochemical Society Transactions 32, 128130.
Peng, R. (2009). Reproducible research and biostatistics. Biostatistics 10, 405408. doi:10.1093/biostatistics/kxp014.
Peng, R. (2015). The reproducibility crisis in science. Significance 12, 3032. doi:10.1111/j.1740-9713.2015.00827.x.
Potti, A., Dressman, H. K., Bild, A., Riedel, R., Chan, G., Sayer,
R., Cragun, J., Cottrill, H., Kelley, M. J., Petersen, R., Harpole,
D., Marks, J., Berchuck, A., Ginsburg, G. S., Febbo, P., Lancaster, J., and Nevins, J. R. (2006). Genomic signature to guide
the use of chemotherapeutics. Nature Medicine 12, 12941300.
doi:10.1038/nm1491. (Retracted).
Potti, A., Dressman, H. K., Bild, A., Riedel, R., Chan, G., Sayer,
R., Cragun, J., Cottrill, H., Kelley, M. J., Petersen, R., Harpole,
D., Marks, J., Berchuck, A., Ginsburg, G. S., Febbo, P., Lancaster, J., and Nevins, J. R. (2011). Retracted: Genomic signature to guide the use of chemotherapeutics. Nature Medicine
17, 135. doi:10.1038/nm0111-135. (Retracted).
Potvin, C., Lechowicz, M. J., Bell, G., and Schoen, D. (1990).
Spatial, temporal and species-specific patterns of heterogeneity in growth chamber experiments. Canadian Journal of Botany
68, 499504.
Prinz, F., Schlange, A., and Asadullah, K. (2011). Believe it
or not: how much can we rely on published data on potential drug targets. Nature Rev. Drug Discov. 10, 712713. doi:
10.1038/nrd3439-c1.
R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
URL http://www.R-project.org
Reinhart, A. (2015). Statistics Done Wrong: The Woefully Complete Guide. San Francisco, CA: no starch press.
Ritskes-Hoitinga, M. and Strubbe, J. (2007). Nutrition and animal welfare. In E. Kaliste, editor, The Welfare of Laboratory Animals, chapter 5, pages 95112. Dordrecht, The Netherlands:
Springer.
68
Rivenson, A., Hoffmann, D., Prokopczyk, B., Amin, S., and

Hecht, S. S. (1988). Induction of lung and exocrine pancreas
tumors in F344 rats by tobacco-specific and Areca-derived Nnitrosamines. Cancer Res. 48, 69126917.
Ruxton, G. D. and Colegrave, N. (2003). Experimental Design
for the Life Sciences. Oxford, UK: Oxford University Press.
Sailer, M. O. (2013). crossdes: Construction of Crossover Designs.
R package version 1.1-1.
URL http://CRAN.R-project.org/package=crossdes
Salsburg, D. (2001). The Lady Tasting Tea. New York, NY.: Freeman.
Schlain, B., Jethwa, H., Subramanyam, M., Moulder, K., Bhatt,
B., and Molley, M. (2001). Designs for bioassays with plate
location effects. BioPharm International 14, 4044.
Scholl, C., Frhling, S., Dunn, I., Schinzel, A. C., Barbie, D. A.,
Kim, S. Y., Silver, S. J., Tamayo, P., Wadlow, R. C., Ramaswamy,
S., Dhner, K., Bullinger, L., Sandy, P., J.S., B., Root, D. E., Jacks,
T., Hahn, W., and Gilliland, D. G. (2009). Synthetic lethal interaction between oncogenic KRAS dependency and STK33
suppression in human cancer cells. Cell 137, 821834. doi:
10.1016/j.cell.2009.03.017.
Sellke, T., Bayarri, M., and Berger, J. (2001). Calibration of p
values for testing precise null hypotheses. The American Statistician 55, 6271.
Selwyn, M. R. (1996). Principles of Experimental Design for the
Life Sciences. Boca Raton, FL: CRC Press.
Sralini, G.-E., Claire, E., Mesnage, R., Gress, S., Defarge, N.,
Malatesta, M., Hennequin, D., and Vendmois, J. (2012). Long
term toxicity of a roundup herbicide and a roundup-tolerant
genetically modified maize. Food Chem. Toxicol. 50, 42214231.
doi:10.1016/j.fct.2012.08.005.
Sralini, G.-E., Claire, E., Mesnage, R., Gress, S., Defarge, N.,
Malatesta, M., Hennequin, D., and Vendmois, J. (2014). Republished study: long term toxicity of a roundup herbicide
and a roundup-tolerant genetically modified maize. Environmental Sciences Europe 26, 14. doi:10.1186/s12302-014-0014-5.
Snedecor, G. W. and Cochran, W. G. (1980). Statistical Methods.
Ames, IA: Iowa State University Press, 7th edition.
Straetemans, R., OBrien, T., Wouters, L., Van Dun, J., Janicot,
M., Bijnens, L., Burzykowski, T., and M, A. (2005). Design and
analysis of drug combination experiments. Biometrical J 47,
299308.
Tallarida, R. J. (2001). Drug synergism: its detection and applications. J. Pharm. Exp. Ther. 298, 865872.
Temme, A., Smpel, F., Rieber, G. S. E. P., Willecke, K. J. K.,
and Ott, T. (2001). Dilated bile canaliculi and attenuated
decrease of nerve-dependent bile secretion in connexin32deficient mouse liver. Eur. J. Physiol. 442, 961966.
REFERENCES
Tukey, J. W. (1980). We need both exploratory and confirmatory. The American Statistician 34, 2325.
Van Belle, G. (2008). Statistical Rules of Thumb. Hoboken, NJ: J.
Wiley, 2nd edition.
van der Worp, B., Howells, D. W., Sena, E. S., Porritt, M.,
Rewell, S., OCollins, V., and Macleod, M. R. (2010). Can animal models of disease reliably inform human studies. PLoS
Med. 7, e1000245. doi:10.1371/journal.pmed.1000245.
van Luijk, J., Bakker, B., Rovers, M. M., Ritskes-Hoitinga, M.,
de Vries, R. B. M., and Leenaars, M. (2014). Systematic reviews of animal studies; missing link in translational research?
PLOS ONE 9, e89981. doi:10.1371/journal.pone.0089981.
Vandenbroeck, P., Wouters, L., Molenberghs, G., Van Gestel,
J., and Bijnens, L. (2006). Teaching statistical thinking to life
scientists: a case-based approach. J. Biopharm. Stat. 16, 6175.
Ver Donck, L., Pauwels, P. J., Vandeplassche, G., and Borgers,
M. (1986). Isolated rat cardiac myocytes as an experimental
model to study calcium overload: the effect of calcium-entry
blockers. Life Sci. 38, 765772.
Verheyen, F., Racz, R., Borgers, M., Driesen, R. B., Lenders,
M. H., and Flameng, W. J. (2014). Chronic hibernating myocardium in sheep can occur without degenerating events and
is reversed after revascularization. Cardiovasc Pathol. 23, 160
168. doi:10.1016/j.carpath.2014.01.003.
Vlaams Instituut voor Biotechnologie (2012). A scientific
analysis of the rat study conducted by Gilles-Eric Sralini et
al.
URL http://www.vib.be/en/news/Documents/20121008_
EN_Analyse\rattenstudieS{}ralini\et\al.pdf
Wacholder, S., Chanoch, S., Garcia-Closas, M., El ghormli, L.,
and Rothman, N. (2004). Assessing the probability that a positive report is false: an approach for molecular epidemiology
studies. J Natl Cancer Inst 96, 434442. doi:10.1093/jnci/
djh075.
Wasserstein, R. and Lazar, N. (2016). The asas statement on
p-values: context, process, and purpose. The American Statistician 70. doi:10.1080/00031305.2016.1154108. In press.
URL http://dx.doi.org/10.1080/00031305.2016.1154108
Weissgerber, T. L., Milic, N. M., Wionham, S. J., and Garovic,
V. D. (2015). Beyond bar and line graphs: time for a new
data presentation paradigm. PLoS Biol. 13, e1002128. doi:
10.1371/journal.pbio.1002128.
Wilcoxon, F., Rhodes, L. J., and Bradley, R. A. (1963). Two sequential two-sample grouped rank tests with applications to
screening experiments. Biometrics 19, 5884.
Wilks, S. S. (1951). Undergraduate statistical education. J.
Amer. Statist. Assoc. 46, 118.
Tressoldi, P. E., Giofr, D., Sella, F., and Cumming, G. (2013).

High impact = high statistical standards? not necessarily so.
Nature Reviews Neuroscience 8, e56180. doi:10.1371/journal.
pone.0056180.
Witte, J., Elston, R., and Cardon, L. (2000). On the relative

sample size required for multiple comparisons. Statist. Med.
19, 369372.
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Cheshire, CT.: Graphics Press.
Wonnacott, T. H. and Wonnacott, R. J. (1990). Introductory

Statistics. New York, NY.: J. Wiley, 5th edition.
REFERENCES
Yang, H., Harrington, C. A., Vartanian, K., Coldren, C. D.,

Hall, R., and Churchill, G. A. (2008). Randomization in laboratory procedure is key to obtaining reproducible microarray results. PLOS ONE 3, e3724. doi:10.1371/journal.pone.
0003724.
Young, S. S. (1989). Are there location/cage/systematic nontreatment effects in long-term rodent studies? a question re-
69
visited. Fundam Appl Toxicol. 13, 183188.

Zimmer, C. (2012). A sharp rise in retractions prompts calls
for reform. The New York Times April 17.
URL http://www.nytimes.com/2012/04/17/science/risein-scientific-journal-retractions-prompts-calls-forreform.html
70
REFERENCES
Appendices
71
A. Glossary of Statistical Terms

ANOVA : see analysis of variance.
is used to indicate the reliability of an estimate. For a given confidence level, if several
Accuracy : the degree to which a measurement
confidence intervals are constructed based
process is free of bias.
on independent repeats of the study, then on
Additive model : a model in which the combined
the long run, the proportion of such intervals
effect of several explanatory variables or fac-
that contain the true value of the parameter
tors is equal to the sum of their separate ef-
will correspond the confidence level.
fects.
Alternative hypothesis :
Confounding : the phenomenon in which an extraa hypothesis which is
neous variable, not under control of the in-
presumed to hold if the null hypothesis does
vestigator, influences both the factors under
not; the alternative hypothesis is necessary in
study and the response variable.
deciding upon the direction of the test and in

Covariate : a concomitant measurement that is re-
estimating sample sizes.
lated to the response but is not affected by the

Analysis of variance :a statistical method of infer-
treatment.
ence for making simultaneous comparisons

between two or more means.
Critical value : the cutoff or decision value in hypothesis testing which separates the accep-
Balanced design : a term usually applied to any
tance and rejection regions of a test.
experimental design in which the same number of observations is taken for each combi-
Data set : a general term for observations and
nation of the experimental factors.
measurements collected during any type of

scientific investigation.
Bias : the long-run difference between the average

of a measurement process and its true value.
Degrees of freedom : the number of values that are
als are uninformed as to the treatment con-
free to vary in the calculation of a statistic,

e.g. for the standard deviation, the mean is
ditions of the experimental units.
already calculated and puts a restriction on
Blinding : the condition under which individu-
the number of values that can vary; therefore
Block : a set of units which are expected to re-
the degrees of freedom of the standard devi-
spond similarly as a result of treatment.
ation is the number of observations minus 1.
Completely randomized design : a design in which
Effect size : when comparing treatment differ-
each experimental unit is randomized to a
ences, the effect size is the mean difference
single treatment condition or set of treat-
divided by the standard deviation (not stan-
ments.
dard error); the standard deviation can be
Confidence interval : a random interval that de-
from either group, or a poooled standard de-
pends on the data obtained in the study and
viation can be used.

73
74
APPENDIX A. GLOSSARY OF STATISTICAL TERMS
Estimation : an inferential process that uses the
P-value : the probability of obtaining a test statis-
value of a statistic derived from a sample to
tic as extreme as or more extreme than the
estimate the value of a corresponding popu-
observed one, provided the null hypothesis
lation parameter.
is true; small p-values are unlikely when the
Experimental unit : the smalles unit to which different treatments or experimental conditions
can be applied.
null hypothesis holds.

Parameter : a population quantity of interest, examples are the population mean and standard deviation of a normal distribution.
Explanatory variable : also called predictor, a variable which is used in a relationship to explain
Pilot study : a preliminary study performed to
or to predict changes in the values of another
gain initial information to be used in plan-
variable; the latter called the dependent vari-
ning a subsequent, definitive study; pilot
able.
studies are used to refine experimental procedures and provide information on sources
External validity : extent to which the results of a
of bias and variability.
study can be generalized to other situations.

Population : the collection of all subjects or units
Factor : the condition or set of conditions that is
manipulated by the investigator.
Factor level : the particular value of a factor.
about which inference is desired.

Precision : the degree to which a measurement
process is limited in terms of its variability
about a central value.
False negative : the error of accepting the null hypothesis when it is false.
False positive : the error of rejecting the null hypothisis when it is true.
Hypothesis testing : a formal statistical procedure
where one tests a particular hypothesis on
the basis of experimental data.
Internal validity : extent to which a causal conclusion based on a study is warranted.
Level of significance : the allowable rate of false
positives, set prior to analysis of the data.
Protocol : a document describing the plan for

a study; protocols typically contain information on the rationale for performing the
study, the study objectives, experimental
procedures to be followed, sample sizes and
their justification, and the statistical analsyses to be performed; the study protocol must
be distinguished from the technical protocol
which is more about lab instructions.
Randomization :
a well-defined stochastic law
for assigning experimental units to differing

treatment conditions; randomization may
also be applied elsewhere in the experiment.
Null hypothesis : a hypothesis indicating no difference which will either be accepted or rejected as a result of a statistical test.
Observational unit : the unit on which the response is measured or observed; this is
not necessarily identical to the experimental
unit.
Statistic : a mathematical function of the observed

data.
Statistical inference : the process of drawing conclusions from data that is subject to random
variation.
Stochastic : non-determistic, chance dependent.
One-sided test : a statistical test for which the re-
Subsampling : the situation in which measure-
jection region consists of either very large or
ments are taken at several nested levels; the
very small values of the test statistic, but not
highest level is called the primary sampling
of both.
unit; the next level is called the secondary
75
sampling unit, etc.
when subsampling is
very small values of the test statistic.
present, it is of great importance to identify

the correct experimental unit.
Type II error : error made by not rejecting the

null hypothesis when the alternative hypoth-
Test statistic : a statistic used in hypothesis testing; extreme values of the test statistic are unlikely under the null hypothesis.
Treatment : a specific combination of factor levels.
Two-sided test : a statistical test for which the rejection region consists of both very large or
esis is true.
Type I error : error made by the incorrect rejection of a true null hypothesis.
Variability : the random fluctuation of a measurement process about its central value.
76
APPENDIX A. GLOSSARY OF STATISTICAL TERMS
B. Tools for randomization in MS Excel

and R
B.1
Completely randomized de-
treatment codes in column A are now in random
sign
order, i.e. the first animal will receive treatment 2,

the second treatment 3, etc.
Suppose 21 experimental units have to be randomly assigned to three treatment groups, such
B.1.2
that each treatment group contains exactly seven
In the open source statistical language R, the same

result is obtained by
animals
B.1.1
R-Language
>
>
>
>
>
MS Excel
A randomization list is easily constructed using a

spreadsheet program like Microsoft Excel. This is
illustrated in Figure B.1. We enter in the first col-
# make randomization process reproducible

set.seed(14391)
# sequence of treatment codes A,B,C repeated 7 times
x<-rep(c("A","B","C"),7)
x
[1]
[7]
[13]
[19]
umn of the spreadsheet the code for the treatment

(1, 2, 3). Using the RAND() function, we fill the
"A"
"A"
"A"
"A"
"B"
"B"
"B"
"B"
"C" "A" "B" "C"

"C" "A" "B" "C"
"C" "A" "B" "C"
"C"
second column with pseudo-random numbers between 0 and 1. Subsequently the two columns are
selected and the Sort command from the Data-menu
is executed. In the Sort-window that appears now,
we select column B as column to be sorted by. The
> # randomize the sequence in x

> rx<-sample(x)
> rx
[1] "B" "B" "B" "A" "A" "C"
[7] "C" "C" "B" "A" "C" "C"
Figure B.1. Generating a completely randomized design in MS Excel
77
78
APPENDIX B. TOOLS FOR RANDOMIZATION IN MS EXCEL AND R
Figure B.2. Generating a randomized complete block design in MS Excel
.
[13] "A" "B" "C" "B" "A" "A"
[19] "C" "A" "B"
> design<-data.frame(block=rep(1:5,rep(4,5)),treat=treat)
> head(design,10) # first 10 exp units
B.2
1
2
3
4
5
6
7
8
9
10
Randomized complete block

design
Suppose 20 experimental units, organized in 5 blocks of size 4

have to be randomly assigned to 4 treatment groups A, B, C, D,
such that each treatment occurs exactly once in each block.
B.2.1
MS Excel
To generate the design in MS Excel, follow the procedure that

is depicted in Figure B.2. We enter in the first column of the
spreadsheet the code for the treatment (A, C, B, D). The second column (Column B) is filled with an indication of the block
(1:5). Using the RAND() function, we fill the third column with
pseudo-random numbers between 0 and 1. Subsequently, the
three columns are selected and the Sort command from the
Data-menu is executed. In the Sort window that appears now,
we select Column B as first sort criterion and Column C as second sort criterion. The treatment codes in column A are now
for each block in random order, i.e. the first animal in block 1
will receive treatment A, the second treatment D, etc.
B.2.2
>
>
>
>
R-Language
set.seed(3223) # some number

# treatments repeated 5 times
treat<- rep(c("A","B","C","D"),5)
# blocks and treatments
>
>
>
>
>
>
>
block treat
1
A
1
B
1
C
1
D
2
A
2
B
2
C
2
D
3
A
3
B
# randomly distribute units

rdesign<-design[sample(dim(design)[1]),]
# order by blocks for convenience
rdesign<-rdesign[order(rdesign[,"block"]),]
# sequence of units within blocks
# is randomly assigned to treatments
head(rdesign,10)
3
4
1
2
8
6
7
5
9
11
block treat
1
C
1
D
1
A
1
B
2
D
2
B
2
C
2
A
3
A
3
C

Statistical Thinking and Smart Experimental Design 2016

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Statistical Thinking and Smart Experimental Design 2016

Caricato da

Copyright:

Formati disponibili

Statistical Thinking and

Smart Research Design by Statistical

The architecture of experimental re-

variability - good experi5

The controlled experiment . .

Scientific research as an iterative, dynamic process . . . .

4.10 The calculation of uncertainty . . . .

Research styles - The smart researcher

Principles of statistical thinking . . .

Common Designs in Biological Experi-

Planning the Experiment

The planning process . . . . . . . . .

The pilot study . . . . . . . . . . . .

Principles of Statistical Design

The structure of the response variable 13

Defining the experimental unit . . .

Balancing internal and external validity . . . . . . . . . . . . . . . . . .

Bias and variability . . . . . . . . . .

Requirements for a good experiment

Strategies for minimizing bias and

The completely randomized

bias - good experimental

The use of controls .

The paired design .

plete block design .

Incomplete block designs . .

The Latin square design . . .

More complex designs . . . . . . . .

The split-plot and strip-plot

The repeated measures design 38

Determining sample size is a risk cost assessment . . . . . . . . . . . .

The Required Number of Replicates Sample Size

The randomized complete

Strategies for minimizing

The context of biomedical experiments 41

The hypothesis testing context - the

The Results section . . . . . . . . . .

Sample size calculations . . . . . . .

Summarizing the data . . . .

Interpreting and reporting

Power analysis computations

How many subsamples . . . . . . . .

Multiplicity and sample size . . . . .

The problem with underpowered

The statistical triangle . . . . . . . .

The statistical model revisited . . . .

Verifying the statistical assumptions

The meaning of the p-value and sta53

The Study Protocol

Interpretation and Reporting

The Methods section . . . . . . . . .

10.1 Role of the statistician . . . . . . . .

The Statistical Analysis

Appendix B Tools for randomization in MS

B.1 Completely randomized design . . .

B.2 Randomized complete block design

sensitive and resistant, etc. Baggerly and Coombes

plagued by problems with the replicability and re-

concluded that they were unable to reproduce the

damage was done. Several clinical trials had started

ical issues (Begley and Ioannidis, 2015; Loscalzo,

2012; Peng, 2015; Prinz et al., 2011; Reinhart, 2015;

was retracted from Nature Medicine because we

van der Worp et al., 2010). The following exam-