Experiments

Experiments, quasi-experiments, single-case research
and meta-analysis
Introduction effect of that change on another variable – called

the dependent variable. Using a fixed design,
The issue of causality and, hence, predictability has
experimental research can be confirmatory, seeking
exercised the minds of researchers considerably
to support or not to support a null hypothesis, or
(Smith 1991: 177). One response has been in the
exploratory, discovering the effects of certain
operation of control, and it finds its apotheosis in
variables. An independent variable is the input
the experimental design. If rival causes or
variable, whereas the dependent variable is the
explanations can be eliminated from a study then
outcome variable – the result; for example, Kgaile
clear causality can be established; the model can
and Morrison (2006) indicate seven independent
explain outcomes. Smith (1991: 177) claims the
variables that have an effect on the result (the
high ground for the experimental approach,
effectiveness of the school) (Box 13.1).
arguing that it is the only method that directly
In an experiment the post-test measures the
concerns itself with causality; this, clearly is
dependent variable, and the independent variables
contestable, as we make clear in Part Three of this
are isolated and controlled carefully.
book.
Imagine that we have been transported to a
In Chapter 12, we described ex post facto
laboratory to investigate the properties of a new
research as experimentation in reverse in that ex
wonder-fertilizer that farmers could use on their
post facto studies start with groups that are already
cereal crops, let us say wheat (Morrison 1993: 44
different with regard to certain characteristics and
– 5). The scientist would take the bag of wheat
then proceed to search, in retrospect, for the factors
seed and randomly split it into two equal parts.
that brought about those differences. We then went
One part would be grown under normal existing
on to cite Kerlinger’s description of the
conditions – controlled and measured amounts of
experimental researcher’s approach:
soil, warmth, water and light and no other factors.
If x, then y; if frustration, then aggression . . . the This would be called the control group. The other
researcher uses some method to measure x and then part would be grown under the same conditions –
observes y to see if concomitant variation occurs. the same controlled and measured amounts of soil,
(Kerlinger 1970) warmth, water and light as the control group, but,
additionally, the new wonder- fertilizer. Then,
The essential feature of experimental research is four months later, the two groups are examined
that investigators deliberately control and and their growth measured. The control group has
manipulate the conditions which determine the grown half a metre and each ear of wheat is in
events in which they are interested, introduce an place but the seeds are small. The experimental
intervention and measure the difference that it group, by contrast, has grown half a metre as well
makes. An experiment involves making a change but has significantly more seeds on each ear, the
in the value of one variable – called the seeds are larger, fuller and more robust.
independent variable – and observing the
INTRODUCTION 273
Box 13.1
Chapter 13
Independent and dependent variables
Parents Teaching
Development
and and
planning
community learning
School
Effectiveness
Index
Professional Culture and

Management Leadership
development climate
Source: Kgaile and Morrison 2006
The scientist concludes that, because both groups holding every other variable constant for the
came into contact with nothing other than two groups
measured amounts of soil, warmth, water and light, O the final measurement of yield and growth to
then it could not have been anything else but the compare the control and experimental groups
new wonder-fertilizer that caused the and to look at differences from the pretest
experimental group to flourish so well. The key results (the post-test)
factors in the experiment were the following: O the comparison of one group with another
O the stage of generalization – that this new wonder-

O the random allocation of the whole bag of wheat fertilizer improves yield and growth under a
into two matched groups (the control and the given set of conditions.
experimental group), involving the initial
measurement of the size of the wheat to ensure This model, premised on notions of isolation and
that it was the same for both groups (i.e. the control of variables in order to establish causality,
pretest) may be appropriate for a laboratory, though
O the identification of key variables (soil, whether, in fact, a social situation either ever
warmth, water, and light) could become the antiseptic, artificial world of the
O the control of the key variables (the same laboratory or should become such a world is both an
amounts to each group) empirical and a moral question respectively.
O the exclusion of any other variables Further, the ethical dilemmas of treating humans
O the giving of the special treatment (the as manipulable, controllable and inanimate are
intervention) to the experimental group while considerable (see Chapter 2). However, let us
pursue the experimental model further.
INTRODUCTION 274
Frequently in learning experiments in classroom than a double blind experiment, and it is even
settings, the independent variable is a stimulus of possible not to tell participants that they are in
some kind, a new method in arithmetical an experiment at all, or to tell them that the
computation for example, and the dependent experiment is about X when, in fact, it is
variable is a response, the time taken to do twenty about Y, i.e. to ‘put them off the scent’. This
sums using the new method. Mostempirical studies form of deception needs to be justified; a
in educational settings, however, are quasi- common justification is that it enables the
experimental rather than experimental. The single experiment to be conducted under more natural
most important difference between the quasi- conditions, without participants altering their
experiment and the true experiment is that in everyday behaviour.
the former case, the researcher undertakes his
study with groups that are intact, that is to
Designs in educational experimentation
say, the groups have been constituted by means
other than random selection. In this chapter we There are several different kinds of experimental
identify the essential features of true experimental design, for example:
and quasi-experimental designs, our intention
O the controlled experiment in laboratory condi-
being to introduce the reader to the meaning and
tions (the ‘true’ experiment): two or more
purpose of control in educational experimentation.
groups
In experiments, researchers can remain rela-
O the field or quasi-experiment in the natural
tively aloof from the participants, bringing a degree
setting rather than the laboratory, but where
of objectivity to the research (Robson 2002: 98).
variables are isolated, controlled and
Observer effects can distort the experiment, for
manipulated.
example researchers may record inconsistently, or
O the natural experiment in which it is not possible to
inaccurately, or selectively, or, less consciously,
isolate and control variables.
they may be having an effect on the experiment.
Further, participant effects might distort the ex- We consider these in this chapter (see http://
periment (see the discussion of the Hawthorne www.routledge.com/textbooks/9780415368780 –
effect in Chapter 6); the fact of simply being in Chapter 13, file 13.1. ppt). The laboratory exper-
an experiment, rather than what the experiment is iment (the classic true experiment) is conducted
doing, might be sufficient to alter participants’ in a specially contrived, artificial environment, so
behaviour. that variables can be isolated, controlled and ma-
In medical experiments these twin concerns are nipulated (as in the example of the wheat seeds
addressed by giving placebos to certain above). The field experiment is similar to the lab-
participants, to monitor any changes, and oratory experiment in that variables are isolated,
experiments are blind or double blind. In blind controlled and manipulated, but the setting is the
experiments, participants are not told whether they real world rather than the artificially constructed
are in a control group or an experimental group, world of the laboratory.
though which they are is known to the researcher. Sometimes it is not possible, desirable or ethical
In a double blind experiment not even the to set up a laboratory or field experiment. For
researcher knows whether a participant is in the example, let us imagine that we wanted to
control of experimental group – that knowledge investigate the trauma effects on people in road
resides with a third party. These are intended to traffic accidents. We could not require a
reduce the subtle effects of participants knowing participant to run under a bus, or another to stand
whether they are in a control or experimental in the way of a moving lorry, or another to be hit
group. In educational research it is easier to by a motorcycle, and so on. Instead we might
conduct a blind experiment rather examine hospital records to see the trauma effects
of victims of bus accidents, lorry accidents and
TRUE EXPERIMENTAL DESIGNS 275
Chapter 13
motorcycle accidents, and see which group seem to O the post-test two experimental groups design
have sustained the greatest traumas. It may be that O the pretest-post-test two treatment design
the lorry accident victims had the greatest trauma, O the matched pairs design
followed by the motorcycle victims, followed by O the factorial design
the bus victims. Now, although it is not possible O the parametric design
to say with 100 per cent certainty what caused the O repeated measures designs.
trauma, one could make an intelligent guess that

those involved in lorry accidents suffer the worst The laboratory experiment typically has to identify
injuries. Here we look at the outcomes and work and control a large number of variables, and this
backwards to examine possible causes. We may not be possible. Further, the laboratory
cannot isolate, control or manipulate variables, but environment itself can have an effect on the
nevertheless we can come to some likely defensible experiment, or it may take some time for a
conclusions. particular intervention to manifest its effects (e.g.
In the outline of research designs that follows a particular reading intervention may have little
we use symbols and conventions from Campbell immediate effect but may have a delayed effect in
and Stanley (1963): promoting a liking for reading in adult life, or may
have a cumulative effect over time).
O X represents the exposure of a group to an A ‘true’ experiment includes several key
experimental variable or event, the effects of features:
which are to be measured.
O O refers to the process of observation or
O one or more control groups
measurement. O one or more experimental groups
O random allocation to control and experimental
O Xs and Os in a given row are applied to the same
persons. groups
O pretest of the groups to ensure parity
O Left to right order indicates temporal sequence.
O post-test of the groups to see the effects on the
O Xs and Os vertical to one another are
simultaneous. dependent variable
O one or more interventions to the experimental
O R indicates random assignment to separate
treatment groups. group(s)

O isolation, control and manipulation of inde-
O Parallel rows unseparated by dashes represent
comparison groups equated by randomization, pendent variables

O non-contamination between the control and
while those separated by a dashed line
represent groups not equated by random experimental groups.
assignment. If an experiment does not possess all of these
features then it is a quasi-experiment: it may look
True experimental designs as if it is an experiment (‘quasi’ means ‘as if’) but it
is not a true experiment, only a variant on it.
There are several variants of the ‘true’ experi- An alternative to the laboratory experiment is
mental design, and we consider many of these the quasi-experiment or field experiment,
below (see http://www.routledge.com/textbooks/ including:
9780415368780 – Chapter 13, file 13.2. ppt):
O the one-group pretest-post-test
O the pretest-post-test control and experimental
O the non-equivalent control group design
group design
O the time series design.
O the two control groups and one experimental
group pretest-post-test design We consider these below. Field experiments have

O the post-test control and experimental group less control over experimental conditions or
design extraneous variables than a laboratory experiment
and, hence, inferring causality is more contestable, Box 13.2

but they have the attraction of taking place in a The effects of randomization
natural setting. Extraneous variables may include,
for example:
Select twenty cards from a pack, ten red and ten black.
O participant factors: they may differ on important Shuffle and deal into two ten-card piles. Now count the
number of red cards and black cards in either pile and
characteristics between the control and record the results. Repeat the whole sequence many
experimental groups times, recording the results each time.
O intervention factors: the intervention may not be
You will soon convince yourself that the most likely
exactly the same for all participants, varying, for
distribution of reds and blacks in a pile is five in each:
example, in sequence, duration, degree of the next most likely, six red (or black) and four black
intervention and assistance, and other practices (or red); and so on. You will be lucky (or unlucky for the
and contents purposes of the demonstration!) to achieve one pile of
O situational factors: the experimental conditions red and the other entirely of black cards. The probability
of this happening is 1 in 92,378. On the other hand, the
may differ.
probability of obtaining a ‘mix’ of not more than six of
one colour and four of the other is about 82 in 100.
These can lead to experimental error, in which the
results may not be due to the independent If you now imagine the red cards to stand for the
variables in question. ‘better’ ten children and the black cards for the ‘poorer’
ten children in a class of twenty, you will conclude that
the operation of the laws of chance alone will almost
probably give you close equivalent ‘mixes’ of ‘better’
The pretest-post-test control and and ‘poorer’ children in the experimental and control
experimental group design
Source: adapted from Pilliner 1973
A complete exposition of experimental designs
is beyond the scope of this chapter. In the brief
outline that follows, we have selected one
design from the comprehensive treatment of the randomization even with a small number of subjects
subject by Campbell and Stanley (1963) in order is well illustrated in Box 13.2.
to identify the essential features of what they Randomization, then, ensures the greater likelihood
term a ‘true experimental’ and what Kerlinger of equivalence, that is, the apportioning out between
(1970) refers to as a ‘good’ design. Along with its the experimental and control groups of any other
variants, the chosen design is commonly used in factors or characteristics of the subjects which might
educational experimentation (see conceivably affect the experimental variables in which
http://www.routledge.com/textbooks/ the researcher is interested. If the groups are made
9780415368780 – Chapter 13, file 13.3. ppt). equivalent, then any so-called ‘clouding’ effects should
The pretest-post-test control group design can be present in both groups.
be represented as: So strong is this simple and elegant true
Experimental RO1 X O2 experimental design, that all the threats to internal
Control RO3 O4 validity identified in Chapter 6 are, according to
Kerlinger (1970) observes that, in theory, Campbell and Stanley (1963), controlled in the
random assignment to E and C conditions pretest-post-test control group design. The causal
controls all possible independent variables. In effect of an intervention can be calculated in three
practice, of course, it is only when enough steps:
subjects are included in the experiment that the 1 Subtract the pretest score from the post-test score
principle of randomization has a chance to for the experimental group to yield score 1.
operate as a powerful control. However, the
effects of
2 Subtract the pretest score from the post-test controls has led to highly questionable claims be-
Chapter 13
score for the control group to yield score 2. ing made about the success of programmes (Boruch
3 Subtract score 2 from score 1. 1997: 69). Examples of the use of RCTs can be seen in
Maynard and Chalmers (1997).
Using Campbell’s and Stanley’s terminology, The randomized controlled trial is the ‘gold
the effect of the experimental intervention is: standard’ of many educational researchers, as it
purports to establish controllability, causality and
(O2 − RO1) − (O4 − RO3) generalizability (Coe et al. 2000; Curriculum,
Evaluation and Management Centre 2000). How far
If the result is negative then the causal effect this is true is contested (Morrison 2001b). For
was negative. example, complexity theory replaces simple causality
One problem that has been identified with this with an emphasis on networks, linkages, holism,
particular experimental design is the interaction feedback, relationships and interactivity in context
effect of testing. Good (1963) explains that (Cohen and Stewart 1995), emergence, dynamical
whereas the various threats to the validity of the systems, self-organization and an open system
experiments listed in Chapter 6 can be thought of (rather than the closed world of the experimental
as main effects, manifesting themselves in mean laboratory). Even if we could conduct an
differences independently of the presence of experiment, its applicability to ongoing, emerging,
other variables, interaction effects, as their name interactive, relational, changing, open situations, in
implies, are joint effects and may occur even practice, may be limited (Morrison 2001b). It is
when no main effects are present. For example, misconceived to hold variables constant in a
an interaction effect may occur as a result of the dynamical, evolving, fluid, open situation.
pretest measure sensitizing the subjects to the Further, the laboratory is a contrived, unreal
experimental variable.1 Interaction effects can be and artificial world. Schools and classrooms are not
controlled for by adding to the pretest- post-test the antiseptic, reductionist, analysed- out or
control group design two more groups that do not analysable-out world of the laboratory. Indeed the
experience the pretest measures. The result is a successionist conceptualization of causality (Harre´
four-group design, as suggested by Solomon 1972), wherein researchers make inferences about
(1949) below. Later in the chapter, we describe causality on the basis of observation, must admit its
an educational study which built into a pretest- limitations. One cannot infer causes from effects or
post-test group design a further control group to multiple causes from multiple effects. Generalizability
take account of the possibility of pretest from the laboratory to the classroom is dangerous, yet
sensitization. with field experiments, with their loss of control of
Randomization, Smith (1991: 215) explains, variables, generalizability might be equally dangerous.
produces equivalence over a whole range of Classical experimental methods, abiding by the
variables, whereas matching produces need for replicability and predictability, may not be
equivalence over only a few named particularly fruitful since, in complex phenomena,
variables. The use of randomized controlled results are never clearly replicable or predictable: we
trials (RCTs), a method used in medicine, is a never step into the same river twice. In linear thinking
putative way of establishing causality and small causes bring small effects and large causes bring
generalizability (though, in medicine, the sample large effects, but in complexity theory small causes can
sizes for some RCTs is necessarily so small – bring huge effects and huge causes may have little or
there being limited sufferers from a particular no effect. Further, to atomize phenomena into
complaint – that randomization is seriously measurable variables
compromised).
A powerful advocacy of RCTs for planning
and evaluation is provided by Boruch (1997).
Indeed he argues that the problem of poor
experimental
and then to focus only on certain ones of these http://www.routledge.com/textbooks/

is to miss synergy and the spirit of the whole. 9780415368780 – Chapter 13, file 13.4. ppt).
Measurement, however acute, may tell us little of
value about a phenomenon; I can measure every The post-test control and experimental group
physical variable of a person but the nature of the design
person, what makes that person who she or he is, Here participants are randomly assigned to a control
eludes atomization and measurement. group and an experimental group, but there is no
Randomized controlled trials belong to a pretest. The experimental group receives the
discredited view of science as positivism. intervention and the two groups are given only a post-
Though we address ethical concerns in Chapter test (see http://www.routledge.com/
2, it is important here to note the common textbooks/9780415368780 – Chapter 13, file 13.5.
reservation that is voiced about the two- group ppt). The design is:
experiment (e.g. Gorard 2001: 146), which is to Experimental R1 X O1 Control
question how ethical it is to deny a control group R2 O2
access to a treatment or intervention in order to
suit the researcher (to which the counter- The post-test two experimental groups design
argument is, as in medicine, that the researcher Here participants are randomly assigned to each
does not know whether the intervention (e.g. of two experimental groups. Exper- imental group
the new drug) will work or whether it will bring 1 receives intervention 1 and experimental group
harmful results, and, indeed, the purpose of the 2 receives intervention 2. Only post-tests are
experiment is to discover this). conducted on the two groups (see
http://www.routledge.com/ textbooks/9780415368780
The two control groups and one – Chapter 13, file 13.6. ppt). The design is:
experimental group pretest-post-test design Experimental1 R1 X1 O1
This is the Solomon (1949) design, intended to Experimental2 R2 X2 O2
identify the interaction effect that may occur if
the subject deduces the desired result from The pretest-post-test two treatment design
looking at the pretest and the post-test. It is the Here participants are randomly allocated to each of
same as the randomized controlled trial above, two experimental groups. Experi- mental group 1
except that there are two control groups instead receives intervention 1 and experimental group 2
of one. In the standard randomized controlled receives intervention
trial any change in the experimental group can be 2. Pretests and post-tests are conducted to measure changes in
due to the intervention or the pretest, and any individuals in the two groups (see
change in the control group can be due to the http://www.routledge.com/textbooks/ 9780415368780 –
pretest. In the Solomon variant the second Chapter 13, file 13.7. ppt). The design is:
control group receives the intervention but no Experimental1 RO1 X1 O2
pretest. This can be modelled thus: Experimental2 RO3 X2 O4
Experimental RO1 X O2 The true experiment can also be conducted with
Control1 RO3 O4 one control group and two or more experimental
Control2 X O5 groups. (see http://www.routledge.com/
Thus any change in this second control group
can be due only to the intervention. We refer
readers to Bailey (1994: 231 – 4) for a full ex-
plication of this technique and its variants (see
textbooks/9780415368780 – Chapter 13, file dependent variable, such as sex, age, ability). So, first,
13.8. ppt). So, for example, the designs might be: pairs of participants are selected who are matched in
terms of the independent variable under consideration
E R X O (e.g. whose scores on a particular measure are the same
x O 1 2 or similar), and then each of the pair is randomly
p 1 assigned to the control or experimental group.
e Randomization takes place at the pair rather than the
r group level. Although, as its name suggests, this
i ensures effective matching of control and experimental
m groups, in practice it may not be easy to find
e sufficiently close matching, particularly in a field
n experiment, although finding such a close match in a
t field experiment may increase the control of the
a experiment considerably. Matched pairs designs are
l1 useful if the researcher cannot be certain that individual
E R X O differences will not obscure treatment effects, as it
x O 2 4 enables these individual differences to be controlled.
p 3 Borg and Gall (1979: 547) set out a useful
e series of steps in the planning and conduct of an
r experiment:
i 1 Carry out a measure of the dependent variable.
m 2 Assign participants to matched pairs, based on the
e scores and measures established from Step 1.
n
t
a
l2
C R O
o O 6
n 5
t
r
o
l
This can be extended to the post-test control
and experimental group design and the post-test
two experimental groups design, and the pretest-
post- test two treatment design.
The matched pairs design

As the name suggests, here participants are
allocated to control and experimental groups
randomly, but the basis of the allocation is that
one member of the control group is matched to a
member of the experimental group on the several
independent variables considered important for
the study (e.g. those independent variables that
are considered to have an influence on the
3 Randomly assign one person from each pair
Chapter 13
to the control group and the other to the
experimental group.
4 Administer the experimental treatment/
intervention to the experimental group and,
if appropriate, a placebo to the control
group. Ensure that the control group is not
subject to the intervention.
5 Carry out a measure of the dependent
variable with both groups and
compare/measure them in order to
determine the effect and its size on the
dependent variable.
Borg and Gall indicate that difficulties arise
in the close matching of the sample of the control
and experimental groups. This involves careful
identification of the variables on which the
matching must take place. Borg and Gall (1979:
547) suggest that matching on a number of
variables that correlate with the dependent
variable is more likely to reduce errors than
matching on a single variable. The problem, of
course, is that the greater the number of variables
that have to be matched, the harder it is actually
to find the sample of people who are matched.
Hence the balance must be struck between
having too few variables such that error can
occur, and having so many variables that it is
impossible to draw a sample. Instead of
matched pairs, random allocation is possible, and
this is discussed below.
Mitchell and Jolley (1988: 103) pose three
important questions that researchers need to
consider when comparing two groups:
O Are the two groups equal at the commence-
ment of the experiment?
O Would the two groups have grown apart
naturally, regardless of the intervention?
O To what extent has initial measurement error
of the two groups been a contributory factor in
differences between scores?
Borg and Gall (1979) draw attention to the

need to specify the degree of exactitude (or
variance) of the match. For example, if the
subjects were to be matched on, say, linguistic
ability as measured in a standardized test, it is
important to define the
limits of variability that will be used to define IND L ( LEVEL

the matching (e.g. ± 3 points). As before, the EPEN E 4 TWO
greater the degree of precision in the matching - V )
here, the closer will be the match, but the greater DENT E
the degree VARI L Moderate
of precision the harder it will be to find an ABLE O availability
exactly matched sample. N (2)
One way of addressing this issue is to place all Ava E Moderate
the subjects in rank order on the basis of the ilabilit motivation
scores or measures of the dependent variable. y of (5)
Then the first two subjects become one matched resour L
pair (which one is allocated to the control group ces i
and which to the experimental group is done m
randomly, e.g. i
Mot t
ivation e
for the d
subjec a
t v
studie a
d i
l
-
a
b
i
l
i
t
y
(
1
)
L
i
t
t
l
e
m
o
t
i
v
a
-
t
i
o
n
TRUE EXPERIMENTAL DESIGNS
LEVEL THREE Hig a l y 281
(3)
h b i High moti-
avail- i t vation (6)
by tossing a coin), the next two subjects Here the possible combinations are: 1 + 4, 1 + 5,
become the next matched pair, then the next two 1 + 6, 2 + 4, 2 + 5, 2 + 6, 3 + 4, 3 + 5 and
subjects become the next matched pair, and so on 3 + 6. This yields 9 groups (3 × 3 combinations).
until the sample is drawn. Here the loss of Pretests and post-tests or post-tests only can
precision is counterbalanced by the avoidance of be conducted. It might show, for example, that
the loss of subjects. limited availability of resources and little motivation
The alternative to matching that has been had a statistically significant influence on examination
discussed earlier in the chapter is randomiza- performance, whereas moderate and high availability
tion. Smith (1991: 215) suggests that matching is of resources did not, or that high availability and
most widely used in quasi-experimental and non- high motivation had a statistically significant effect
experimental research, and is a far inferior means on performance, whereas high motivation and limited
of ruling out alternative causal explanations than availability did not, and so on.
randomization. This example assumes that there are the same
number of levels for each independent variable; this
The factorial design may not be the case. One variable may have, say, two
In an experiment there may be two or more levels, another three levels, and another four levels.
independent variables acting on the depen- Here the possible combinations are
dent variable. For example, performance in 2 × 3 × 4 = 24 levels and, therefore, 24 experi-
an examination may be a consequence of mental groups. One can see that factorial designs
availability of resources (independent variable quickly generate several groups of participants. A
one: limited availability, moderate availability, common example is a 2 × 2 design, in which two
high availability) and motivation for the sub- independent variables each have two values (i.e. four
ject studied (independent variable two: little groups). Here experimental group 1 receives
motivation, moderate motivation, high moti- the intervention with independent variable 1 at level
vation). Each independent variable is stud- 1 and independent variable 2 at level 1; experimental
ied at each of its levels (in the example group 2 receives the intervention with independent
here it is three levels for each indepen- variable 1 at level 1 and independent variable 2 at level
dent variable) (see http://www.routledge.com/ 2; experimental group 3 receives the intervention with
textbooks/9780415368780 – Chapter 13, file independent variable 1 at level 2 and independent
13.9. ppt). Participants are randomly assigned to variable 2 at level 1; experimental group 4 receives
groups that cover all the possible combinations of the intervention with independent variable 1 at level 2
levels of each independent variable, as shown in and independent variable 2 at level 2.
the model.
Factorial designs also have to take account (four levels of the independent variable ‘reading
Chapter 13
of the interaction of the independent variability’). Four experimental groups are set up to
ables. For example, one factor (independent receive the intervention, thus: experimental group one
variable) may be ‘sex’ and the other ‘age’ (poor readers); experimental group two (average
(Box 13.3). The researcher may be investigating readers), experimental group three (good readers and
their effects on motivation for learning mathe- experimental group four (outstanding readers). The
matics (see http://www.routledge.com/textbooks/ control group (group five) would receive no
9780415368780 – Chapter 13, file 13.10. ppt). intervention. The researcher could chart the
Here one can see that the difference in differential effects of the intervention on the groups,
motivation for mathematics is not constant and thus have a more sensitive indication of its effects
between males and females, but that it varies than if there was only one experimental group
according to the age of the participants. There is containing a wide range of reading abilities; the
an interaction effect between age and sex, such researcher would know which group was most and
that the effect of sex depends on age. A factorial least affected by the intervention. Parametric
design is useful for examining interaction effects. designs are useful if an independent variable is
At their simplest, factorial designs may have considered to have different levels or a range of values
two levels of an independent variable, e.g. its which may have a bearing on the outcome
presence or absence, but, as has been seen here, it (confirmatory research) or if the researcher wishes to
can become more complex. That complexity is discover whether different levels of an independent
bought at the price of increasing exponentially variable have an effect on the outcome (exploratory
the number of groups required. research).
Repeated measures designs

The parametric design Here participants in the experimental groups are
Here participants are randomly assigned to tested under two or more experimental conditions. So,
groups whose parameters are fixed in terms of for example, a member of the experimental group may
the levels of the independent variable that each receive more than one ‘intervention’, which may or
receives. For example, let us imagine that an may not include a control condition. This is a variant of
experiment is conducted to improve the reading the matched pairs
abilities of poor, average, good, and outstanding
readers
Box 13.3
Interaction effects in an experiment
Motivation for mathematics
100
design, and offers considerable control www.routledge.com/textbooks/9780415368780 –

potential, as it is exactly the same person Chapter 13, file 13.12. ppt). At best, they may be able
receiving different interventions. (see to employ something approaching a true experimental
http://www.routledge.com/ design in which they have control over what Campbell
textbooks/9780415368780 – Chapter 13, file and Stanley (1963) refer to as ‘the who and to whom of
13.11. ppt). Order effects raise their heads here: the order measurement’ but lack control over ‘the when and to
in which the interventions are sequenced may have whom of exposure’, or the randomization of
an effect on the outcome; the first intervention exposures – essential if true experimentation is to take
may have an influence – a carry-over effect – on place. These situations are quasi-experimental and the
the second, and the second intervention may have methodologies employed by researchers are termed
an influence on the third and so on. Further, early quasi- experimental designs. (Kerlinger (1970) refers
interventions may have a greater effect than later to quasi-experimental situations as ‘compromise de-
interventions. To overcome this it is possible to signs’, an apt description when applied to much
randomize the order of the interventions and educational research where the random selection or
assign participants randomly to different random assignment of schools and classrooms is quite
sequences, though this may not ensure a balanced impracticable.)
sequence. Rather, a deliberate ordering may have Quasi-experiments come in several forms, for
to be planned, for example, in a three-intervention example:
experiment: O Pre-experimental designs: the one group pretest-
O Group 1 receives intervention 1 followed by post-test design; the one group post- tests only design;
intervention 2, followed by intervention 3. the post-tests only non- equivalent design.
O Group 2 receives intervention 2 followed by O Pretest-post-test non-equivalent group design.
intervention 3, followed by intervention 1. O One-group time series. We consider these below.
O Group 3 receives intervention 3 followed by
intervention 1, followed by intervention 2. A pre-experimental design: the one group
O Group 4 receives intervention 1 followed by pretest-post-test
intervention 3, followed by intervention 2. Very often, reports about the value of a new
O Group 5 receives intervention 2 followed by teaching method or interest aroused by some
intervention 1, followed by intervention 3. curriculum innovation or other reveal that a
O Group 6 receives intervention 3 followed by researcher has measured a group on a dependent
intervention 2, followed by intervention 1. variable (O1), for example, attitudes towards minority
Repeated measures designs are useful if groups, and then introduced an experimental
it is considered that order effects are either manipulation (X), perhaps a ten- week curriculum
unimportant or unlikely, or if the researcher project designed to increase tolerance of ethnic
cannot be certain that individual differences will minorities. Following the experimental treatment, the
not obscure treatment effects, as it enables these researcher has again measured group attitudes (O2) and
individual differences to be controlled. proceeded to account for differences between pretest
and post- test scores by reference to the effects of X.
A quasi-experimental design: the non- The one group pretest-post-test design can be
equivalent control group design represented as:
Often in educational research, it is simply Experimental O1 X O2
not possible for investigators to undertake true
experiments, e.g. in random assignation of
participants to control or experimental groups.
Quasi-experiments are the stuff of field experi-
mentation, i.e. outside the laboratory (see http://
A QUASI-EXPERIMENTAL DESIGN: THE NON-EQUIVALENT CONTROL GROUP DESIGN 283
Suppose that just such a project has been (an intervention and a post-test), the lack of a pretest,
Chapter 13
undertaken and that the researcher finds that of a control group, of random allocation, and of
O2 scores indicate greater tolerance of ethnic controls, renders this a flawed methodology.
minorities than O1 scores. How justified is the
researcher in attributing the cause of O1 A pre-experimental design: the post-tests only
− non-equivalent groups design
O2 differences to the experimental Again, although this appears to be akin to an
treatment experiment, the lack of a pretest, of matched groups, of
(X) , that is, the term’s project work? At first random allocation, and of controls, renders this a
glance the assumption of causality seems flawed methodology.
reasonable enough. The situation is not that
simple, however. Compare for a moment the A quasi-experimental design: the pretest-post-test
circumstances represented in our hypothetical non-equivalent group design
educational example with those which typically One of the most commonly used quasi- experimental
obtain in experiments in the physical sciences. designs in educational research can be represented as:
Physicists who apply heat to a metal bar can Experimental O1 X O2
confidently attribute the observed expansion to ----------
the rise in temperature that they have introduced Control O3 O4
because within the confines of the laboratory The dashed line separating the parallel rows in the
they have excluded (i.e. controlled) all other diagram of the non-equivalent control group indicates
extraneous sources of variation (Pilliner 1973). that the experimental and control groups have not been
The same degree of control can never be equated by randomization – hence the term ‘non-
attained in educational experimentation. At equivalent’. The addition of a control group makes the
this point readers may care to reflect upon present design a decided improvement over the one
some possible influences other than the ten- group pretest-post- test design, for to the degree that
week curriculum project that might account experimenters can make E and C groups as equivalent
for the O1 − O2 differences in our hypothetical as possible, they can avoid the equivocality of
educational example. interpretations that plague the pre-experimental design
They may conclude that factors to do with the discussed earlier. The equivalence of groups can be
pupils, the teacher, the school, the classroom strengthened by matching, followed by random
organization, the curriculum materials and their assignment to E and C treatments.
presentation, the way that the subjects’ attitudes Where matching is not possible, the researcher
were measured, to say nothing of the thousand is advised to use samples from the same population
and one other events that occurred in and about or samples that are as alike as possible (Kerlinger
the school during the course of the term’s work, 1970). Where intact groups differ substantially,
might all have exerted some influence upon the however, matching is unsatisfactory due to regression
observed differences in attitude. These kinds effects which lead to different group means on post-test
of extraneous variables which are outside the measures. Campbell and Stanley (1963) put it this way:
experimenters control in one-group pretest-post- If [in the non-equivalent control group design] the
test designs threaten to invalidate their research means of the groups are substantially different, then
efforts. We later identify a number of such
threats to the validity of educational
experimentation.
A pre-experimental design: the one group

post-tests only design
Here an experimental group receives the
intervention and then takes the post-test.
Although this has some features of an experiment
A QUASI-EXPERIMENTAL DESIGN: THE NON-EQUIVALENT CONTROL GROUP DESIGN 284
the process of matching not only fails to O They involve the continuous assessment of some
provide the intended equation but in addition aspect of human behaviour over a period of time,
insures the occurrence of unwanted regression requiring on the part of the researcher the
effects. It becomes predictably certain that the administration of measures on multiple occasions
two groups will differ on their post-test scores within separate phases of a study.
altogether independently of any effects of X, and O They involve ‘intervention effects’ which are
that this difference will vary directly with the replicated in the same subject(s) over time.
difference between the total populations from
which the selection was made and inversely with Continuous assessment measures are used as a basis
the test-retest correlation. for drawing inferences about the effectiveness of
(Campbell and Stanley 1963: 49) intervention procedures.
The characteristics of single-case research studies
The one-group time series are discussed by Kazdin (1982) in terms of ABAB
Here the one group is the experimental group, designs, the basic experimental format in most single-
and it is given more than one pretest and more case researches. ABAB designs, Kazdin observes,
than one post-test. The time series uses repeated consist of a family of procedures in which observations
tests or observations both before and after the of performance are made over time for a given client or
treatment, which, in effect, enables the group of clients. Over the course of the investigation,
participants to become their own controls, which changes are made in the experimental conditions to
reduces the effects of reactivity. Time series which the client is exposed. The basic rationale of the
allow for trends to be observed, and avoid ABAB design is illustrated in Box 13.4. What it does is
reliance on only one single pretesting and post- this. It examines the effects of an intervention by
testing data collection point. This enables trends alternating the baseline condition (the A phase), when
to be observed such as no effect at all (e.g. no intervention is in effect, with the intervention
continuing an existing upward, downward or condition (the B phase). The A and B phases are then
even trend), a clear effect (e.g. a sustained rise or repeated to complete the four phases. As Kazdin
drop in performance), delayed effects (e.g. some (1982) says, the effects of the intervention are clear if
time after the intervention has occurred). Time performance improves during the first intervention
series studies have the potential to increase phase, reverts to or approaches original baseline levels
reliability. of performance when the treatment is withdrawn, and
improves again when treatment is recommenced in the
Single-case research: ABAB design second intervention phase.
At the beginning of Chapter 11, we described An example of the application of the ABAB
case study researchers as typically engaged in design in an educational setting is provided by
observing the characteristics of an individual Dietz (1977) whose single-case study sought to
unit, be it a child, a classroom, a school, or a measure the effect that a teacher could have upon the
whole community. We went on to contrast case disruptive behaviour of an adolescent boy whose
study researchers with experimenters whom we persistent talking disturbed his fellow classmates in a
described as typically concerned with the special education class.
manipulation of variables in order to determine In order to decrease the unwelcome behaviour, a
their causal significance. That distinction, as we reinforcement programme was devised in which the
shall see, is only partly true. boy could earn extra time with the teacher by
Increasingly, in recent years, single-case decreasing the number of times he called out. The boy
research as an experimental methodology has was told that when he made three (or fewer)
extended to such diverse fields as clinical interruptions during any fifty-five-minute
psychology, medicine, education, social work,
psychiatry and counselling. Most of the single-
case studies carried out in these (and other) areas
share the following characteristics:
PROCEDURES IN CONDUCTING EXPERIMENTAL RESEARCH 285
Chapter 13
Box 13.4 able to provide an experimental technique for
The ABAB design evaluating interventions for the individual subject.
Moreover, such interventions can be directed towards
Baseline Intervention Base the particular
Intervention subject or group
(A Phase) (B phase) (A) (B)

Frequency of behaviour
Days
The solid lines in each phase present the actual data. The dashed lines indicate the
Source: adapted from Kazdin 1982
class period the teacher would spend extra

time working with him. In the technical language
of behaviour modification theory, the pupil
would receive reinforcing consequences when he
was able to show a low rate of disruptive
behaviour (in Box 13.5 this is referred to as
‘differential reinforcement of low rates’ or DRL).
When the boy was able to desist from talking
aloud on fewer than three occasions during any
timetabled period, he was rewarded by the
teacher spending fifteen minutes with him
helping him with his learning tasks. The pattern
of results displayed in Box 13.5 shows the
considerable changes that occurred in the boy’s
behaviour when the intervention procedures were
carried out and the substantial increases in
disruptions towards baseline levels when the
teacher’s rewarding strategies were withdrawn.
Finally, when the intervention was reinstated, the
boy’s behaviour is seen to improve again.
The single-case research design is
uniquely
and replicated over time or across behaviours,

situations, or persons. Single-case research offers
an alternative strategy to the more usual
methodologies based on between-group designs.
There are, however, a number of problems that
arise in connection with the use of single-
case designs having to do with ambiguities
introduced by trends and variations in baseline
phase data and with the generality of results
from single-case research. The interested reader
is directed to Kazdin (1982), Borg (1981) and
Vasta (1979).2
Procedures in conducting experimental

research
An experimental investigation must follow a
set of logical procedures. Those that we now
enumerate, however, should be treated with some
circumspection. It is extraordinarily difficult (and
foolhardy) to lay down clear-cut rules as guides
to experimental research. At best, we can identify
an ideal route to be followed, knowing full well
that educational research rarely proceeds in such
a systematic fashion.3
Box 13.5
An ABAB design in an educational setting
Baseline Treatment Reversal Treatment

40
full-session DRL full-session DRL
35
Frequency of talking aloud
30
25
20
5 10 15 20 25 30 35
Sessions
DRL, differential reinforcement of low rates
Source: Kazdin 1982
First, researchers must identify and define important of them can be varied experimentally
the research problem as precisely as possible, while others are held constant.
always supposing that the problem is amenable Third, researchers must select appropriate levels at
to experimental methods. which to test the independent variables. By way of
Second, researchers must formulate hypotheses example, suppose an educational psychologist wishes
that they wish to test. This involves making to find out whether longer or shorter periods of reading
predictions about relationships between specific make for reading attainment in school settings (see
variables and at the same time making decisions Simon 1978). The psychologist will hardly select five-
about other variables that are to be excluded from hour and five-minute periods as appropriate levels;
the experiment by means of controls. Variables, rather, she is more likely to choose thirty-minute and
remember, must have two properties. The first sixty-minute levels, in order to compare with the usual
property is that variables must be measurable. timetabled periods of forty-five minutes’ duration. In
Physical fitness, for example, is not directly other words, the experimenter will vary the stimuli at
measurable until it has been operationally such levels as are of practical interest in the real- life
defined. Making the variable ‘physical fitness’ situation. Pursuing the example of reading attainment
operational means simply defining it by letting somewhat further, our hypothetical experimenter will
something else that is measurable stand for it – a be wise to vary the stimuli in large enough intervals so
gymnastics test, perhaps. The second property is as to obtain measurable results. Comparing reading
that the proxy variable must be a valid indicator periods of forty-four minutes, or forty-six minutes,
of the hypothetical variable in which one is with timetabled reading lessons of forty-five minutes is
interested. That is to say, a gymnastics test scarcely likely to result in observable differences in
probably is a reasonable proxy for physical attainment.
fitness; height, on the other hand, most certainly Fourth, researchers must decide which kind of
is not. Excluding variables from the experiment experiment they will adopt, perhaps from the
is inevitable, given constraints of time and varieties set out in this chapter.
money. It follows therefore that one must set up
priorities among the variables in which one is
interested so that the most
EXAMPLES FROM EDUCATIONAL RESEARCH 287
Fifth, in planning the design of the experiment, 6 Administer the pretest.

researchers must take account of the population
to which they wish to generalize their results.
This involves making decisions over sample
sizes and sampling methods. Sampling decisions
are bound up with questions of funds, staffing
and the amount of time available for
experimentation.
Sixth, with problems of validity in mind,
researchers must select instruments, choose tests
and decide upon appropriate methods of analysis.
Seventh, before embarking upon the actual
experiment, researchers must pilot test the
experimental procedures to identify possible
snags in connection with any aspect of the
investigation.
This is of crucial importance.
Eighth, during the experiment itself,
researchers must endeavour to follow tested and
agreed-on procedures to the letter. The
standardization of instructions, the exact timing
of experimental sequences, the meticulous
recording and checking of observations – these
are the hallmark of the competent researcher.
With their data collected, researchers face the
most important part of the whole enterprise.
Processing data, analysing results and drafting
reports are all extremely demanding activities,
both in intellectual effort and time. Often this last
part of the experimental research is given too
little time in the overall planning of the
investigation. Experienced researchers rarely
make such a mistake; computer program faults
and a dozen more unanticipated disasters teach
the hard lesson of leaving ample time for the
analysis and interpretation of experimental
findings.
A ten-step model for the conduct of the
experiment can be suggested (see http://www.
routledge.com/textbooks/9780415368780 –
Chapter 13, file 13.13. ppt):
1 Identify the purpose of the experiment.

2 Select the relevant variables.
3 Specify the level(s) of the intervention (e.g.
low, medium, high intervention).
4 Control the experimental conditions and
environment.
5 Select the appropriate experimental design.
7 Assign the participants to the group(s).
Chapter 13
8 Conduct the intervention.
9 Conduct the post-test.
10 Analyse the results.
The sequence of steps 6 and 7 can be reversed;
the intention in putting them in the present
sequence is to ensure that the two groups are
randomly allocated and matched. In experiments
and fixed designs, data are aggregated rather than
related to specific individuals, and data look for
averages, the range of results, and their variation.
In calculating differences or similarity between
groups at the stages of the pretest and the post-
test, the t-test for independent samples is often
used.
Examples from educational research
Example 1: a pre-experimental design

A pre-experimental design was used in a study
involving the 1991 – 92 postgraduate diploma in
education group following a course of training to
equip them to teach social studies in senior
secondary schools in Botswana. The researcher
wished to find out whether the programme of
studies he had devised would effect changes in
the students’ orientations towards social studies
teaching. To that end, he employed a research
instrument, the Barth/Shermis Studies Preference
Scale (BSSPS), which has had wide use in
differing cultures including the United States,
Egypt and Nigeria, and whose construction meets
commonly required criteria concerning validity
and internal consistency reliability.
The BSSPS consists of forty-five Likert-
type items (Chapter 15), providing measures
of what purport to be three social studies
traditions or philosophical orientations, the oldest
of which, Citizenship Transmission, involves
indoctrination of the young in the basic values
of a society. The second orientation, the Social
Science, is held to relate to the acquisition of
knowledge-gathering skills based on the mastery
of social science concepts and processes. The
third tradition, Reflective Inquiry, is said to
derive from John Dewey’s pragmatism with its
emphasis on the process of inquiry. Forty-eight
postgraduate diploma students were examinations would produce an improvement in

administered the BSSPS during the first session performance across the secondary curriculum. The title
of their one-year course of study. At the end of of their report, ‘Illuminating English: how explicit
the programme, the BSSPS was again completed language teaching improved public examination results
in order to determine whether changes had in a comprehensive school’, suggests that the authors
occurred in students’ philosophical orientations. were persuaded that they had achieved their objective.
Briefly, the ‘preferred orientation’ in the pretest In light of the experimental design selected for the
and post-test was the criterion measure, the two research, readers may ask themselves whether or not
orientations least preferred being ignored. the results are as unequivocal as reported.
Broadly speaking, students tended to move from The design adopted in the Shevington study
a majority holding a Citizenship Transmission (Shevington is the location of the experiment in north-
orientation at the beginning of the course to a west England) may be represented as:
greater affirmation of the Social Science and the Experimental O1 X O2
Reflective Inquiry traditions. Using the symbols ----------
and conventions adopted earlier to represent Control O3 O4
research designs, we can illustrate the Botswana This is, of course, the non-equivalent control group
study as: design outlined earlier in this chapter in which parallel
Experimental O1 X O2 rows separated by dashed lines represent groups that
The briefest consideration reveals have not been equated by random assignment.
inadequacies in the design. Indeed, Campbell and In brief, the researchers adopted a methodology akin
Stanley (1963) describe the one group pretest- to teaching English as a foreign language and applied
post-test design as ‘a ‘‘bad example’’ to illustrate this to Years 7 – 9 (for pupils aged 11 – 14) in
several of the confounded extraneous variables Shevington Comprehensive School and two
that can jeopardize internal validity. These neighbouring schools, monitoring the pupils at every
variables stage and comparing their performance with control
offer plausible hypotheses explaining an O1 − groups drawn both from Shevington and the two other
O2 schools. Inevitably, because experimental and control
difference, rival to the hypothesis that groups were not randomly allocated, there were
caused significant differences in the performance of some
the difference’ (Campbell and Stanley 1963). groups on pre-treatment measures such as the York
The investigator is rightly cautious in his Language Aptitude Test. Moreover, because no
conclusions: ‘it is possible to say that the social standardized reading tests of sufficient difficulty were
studies course might be responsible for this available as post-treatment measures, tests had to be
phenomenon, although other extraneous variables devised by the researchers, who provide no details as
might be operating’ (Adeyemi 1992, emphasis to their validity or reliability. These difficulties
added). Somewhat ingenuously he puts his finger notwithstanding, pupils in the experimental groups
on one potential explanation, that the changes taking public examinations in 1990 and 1991 showed
could have occurred among his intending substantial gains in respect of the percentage increases
teachers because the shift from ‘inculcation to of those obtaining GCSE Grades A– C. The
rational decision- making was in line with the researchers note that during the three years 1989 to
recommendation of the Nine Year Social Studies 1991, ‘no other significant change in the policy,
Syllabus issued by the Botswana Ministry of teaching staff or organization of the school took place
Education in 1989’ (Adeyemi 1992). which could
Example 2: a quasi-experimental design

Mason et al.’s (1992) longitudinal study took
place between 1984 and 1992. Its principal aim
was to test whether the explicit teaching of
linguistic features of GCSE textbooks,
coursework and
EVIDENCE-BASED EDUCATIONAL RESEARCH AND META-ANALYSIS 289
account for this dramatic improvement of 50 only when enough

per cent’ (Mason et al. 1992).
Although the Shevington researchers at-
tempted to exercise control over extraneous vari-
ables, readers may well ask whether threats to
internal and external validity such as those
alluded
to earlier were sufficiently met as to allow
such a categorical conclusion as ‘the pupils . . .
achieved greater success in public examinations
as a result of taking part in the project’ (Mason et
al. 1992).
Example 3: a ‘true’ experimental design

Another investigation (Bhadwal and Panda
1991) concerned with effecting improvements
in pupils’ performance as a consequence of
changing teaching strategies used a more
robust experimental design. In rural India, the
researchers drew a sample of seventy-eight
pupils, matched by socio-economic backgrounds
and non-verbal IQs, from three primary schools
that were themselves matched by location,
physical facilities, teachers’ qualifications and
skills, school evaluation procedures and degree
of parental involvement. Twenty-six pupils were
randomly selected to comprise the experimental
group, the remaining fifty-two being equally
divided into two control groups. Before the
introduction of the changed teaching strategies to
the experimental group, all three groups
completed questionnaires on their study habits
and attitudes. These instruments were
specifically designed for use with younger
children and were subjected to the usual item
analyses, test-retest and split-half reliability
inspections. Bhadwal and Panda’s research
design can be represented as:
Experimental RO1 X RO2
First control RO3 RO4
Second control RO5 RO6
Recalling Kerlinger’s (1970) discussion of a
‘good’ experimental design, the version of the
pretest- post-test control design employed here
(unlike the design used in Example 2 above)
resorted to randomization which, in theory,
controls all possible independent variables.
Kerlinger (1970) adds, however, ‘in practice, it is
subjects are included in the experiment that the
Chapter 13
principle of randomization has a chance to
operate as a powerful control’. It is doubtful
whether twenty-six pupils in each of the three
groups in Bhadwal and Panda’s (1991) study
constituted ‘enough subjects’.
In addition to the matching procedures in
drawing up the sample, and the random
allocation of pupils to experimental and control
groups, the researchers also used analysis of
covariance, as a further means of controlling for
initial differences between E and C groups on
their pretest mean scores on the independent
variables, study habits and attitudes.
The experimental programme involved im-
proving teaching skills, classroom organization,
teaching aids, pupil participation, remedial help,
peer-tutoring and continuous evaluation. In addi-
tion, provision was also made in the experimental
group for ensuring parental involvement and
extra reading materials. It would be startling if
such a package of teaching aids and curriculum
strategies did not effect significant changes in
their recipients and such was the case in the
experimental results. The Experimental Group
made highly significant gains in respect of its
level of study habits as compared with Control
Group 2 where students did not show a marked
change. What did surprise the investigators, we
suspect, was the significant increase in levels of
study habits in Control Group 1. Maybe, they
opined, this unexpected result occurred because
Control Group 1 pupils were tested immediately
prior to the beginning of their annual
examinations. On the other hand, they conceded,
some unac- countable variables might have been
operating. There is, surely, a lesson here for all
researchers! (For a set of examples of
problematic experiments see
http://www.routledge.com/textbooks/
9780415368780 – Chapter 13, file 13.1.doc).
Evidence-based educational research and

meta-analysis
Evidence-based research
In an age of evidence-based education
(Thomas and Pring 2004), meta-analysis is an
increasingly
used method of investigation, bringing in this area are Fitz-Gibbon (1996; 1997; 1999) and
together different studies to provide evidence to Tymms (1996), who, at the Curriculum, Evaluation
in- form policy-making and planning. Meta- and Management Centre at the University of Durham,
analysis is a research strategy in itself. That this have established one of the world’s largest monitoring
is happening significantly is demonstrated in the centres in education. Fitz-Gibbon’s work is critical of
establishment of the EPPI-Centre (Evidence multilevel modelling and, instead, suggests how
for Policy and Practice Information and Co- indicator systems can be used with experimental
ordinating Centre) at the University of Lon- methods to provide clear evidence of causality and a
don (http://eppi.ioe.ac.uk/EPPIWeb/home.aspx), ready answer to her own question, ‘How do we know
the Social, Psychological, Educational and Crim- what works?’ (Fitz-Gibbon 1999: 33).
inological Controlled Trials Register (SPECTR), Echoing Anderson and Biddle (1991), Fitz- Gibbon
later transferred to the Campbell Collabora- suggests that policy-makers shun evidence in the
tion (http://www.campbellcollaboration.org), a development of policy and that practitioners, in the
parallel to the Cochrane Collaboration in hurly-burly of everyday activity, call upon tacit
medicine (http://www.cochrane.org/index0.htm), knowledge rather than the knowledge which is derived
which undertakes systematic reviews and meta- from RCTs. However, in a compelling argument (Fitz-
analyses of, typically, experimental evidence in Gibbon 1997: 35 – 6), she suggests that evidence-
medicine, and the Curriculum, Evaluation and based approaches are necessary in order to challenge
Management (CEM) centre at the University of the imposition of unproven practices, solve problems
Durham (http://www.cemcentre.org). ‘Evidence’ and avoid harmful procedures, and create improvement
here typically comes from randomized controlled that leads to more effective learning. Further, such
trials of one hue or another (Tymms 1999; Coe evidence, she contends, should examine effect sizes
et al. 2000; Thomas and Pring 2004: 95), with rather than statistical significance.
their emphasis on careful sampling, control of While the nature of information in evidence- based
variables, both extraneous and included, and education might be contested by researchers whose
measurements of effect size. The cumulative sympathies (for whatever reason) lie outside
evidence from collected RCTs is intended to randomized controlled trials, the message from Fitz-
provide a reliable body of knowledge on which to Gibbon will not go away: the educational community
base policy and practice (Coe et al. 2000). Such needs evidence on which to base its judgements and
accumulated data, it is claimed, deliver evidence actions. The development of indicator systems
of ‘what works’, although Morrison (2001b) worldwide attests to the importance of this, be it
suggests that this claim is suspect. through assessment and examination data, inspection
The roots of evidence-based practice lie findings, national and international comparisons of
in medicine, where the advocacy by Cochrane achievement, or target setting. Rather than being
(1972) for randomized controlled trials together a shot in the dark, evidence-based education suggests
with their systematic review and documentation that policy formation should be informed, and policy
led to the foundation of the Cochrane decision-making should be based on the best
Collaboration (Maynard and Chalmers 1997), information to date rather than on hunch, ideology or
which is now worldwide. The careful, political will. It is bordering on the unethical to
quantitative- based research studies that can implement untried and untested recommendations in
contribute to the accretion of an evidential base is educational practice, just as it is unethical to use
seen to be a powerful counter to the often untried untested products and procedures on hospital patients
and under- tested schemes that are injected into without their consent.
practice.
More recently evidence-based education has
entered the worlds of social policy, social work
(MacDonald 1997) and education (Fitz-Gibbon
1997). At the forefront of educational research
Meta-analysis overlooking effect size

The study by Bhadwal and Panda (1991) is
typical of research undertaken to explore the
effectiveness of classroom methods. Often as not,
such studies fail to reach the light of day,
particularly when they form part of the
research requirements for a higher degree.
Meta-analysis is, simply, the analysis of other
analyses. It involves aggregating and combining
the results of comparable studies into a coherent
account to discover main effects. This is often
done statistically, though qualitative analysis is
also advocated. Among the advantages of using
meta-analysis, Fitz-Gibbon (1985) cites the
following:
O Humble, small-scale reports which have

simply been gathering dust may now become
useful.
O Small-scale research conducted by
individual students and lecturers will be valuable
since meta-analysis provides a way of
coordinating results drawn from many studies
without having to coordinate the studies
themselves.
O For historians, a whole new genre of studies
is created – the study of how effect sizes vary
over time, relating this to historical changes.
(Fitz-Gibbon 1985: 46)
McGaw (1997: 371) suggests that quantitative

meta-analysis replaces intuition, which is fre-
quently reported narratively (Wood 1995: 389),
as a means of synthesizing different research
studies transparently and explicitly (a desidera-
tum in many synthetic studies: Jackson 1980),
particularly when they differ very substantially.
Narrative reviews, suggest Jackson (1980), Cook
et al. (1992: 13) and Wood (1995: 390), are
prone to:
O lack comprehensiveness, being selective

and only going to subsets of studies
O misrepresentation and crude representation
of research findings
O over-reliance on significance tests as a
means of supporting hypotheses, thereby
overlooking the point that sample size exerts a
major effect on significance levels, and
O reviewers’ failure to recognize thatEVIDENCE-BASED
random EDUCATIONAL RESEARCH AND META-ANALYSIS 292
Chapter 13
sampling error can play a part in creating
variations in findings among studies
O overlook differing and conflicting research
findings
O reviewers’ failure to examine critically the
evidence, methods and conclusions of previous
reviews
O overlook the extent to which findings from
research are mediated by the characteristics of
the sample
O overlook the importance of intervening
variables in research
O unreplicability because the procedures for
integrating the research findings have not been
made explicit.
Since the late 1970s a quantitative method for

synthesizing research results has been
developed by Glass and colleagues (Glass and
Smith 1978; Glass et al. 1981) and others (e.g.
Hedges and Olkin 1985; Hedges 1990; Rosenthal
1991) to super- sede narrative intuition. Meta-
analysis, essentially the ‘analysis of analysis’, is
a means of quanti- tatively identifying
generalizations from a range of separate and
disparate studies, and discovering inadequacies
in existing research such that new emphases for
future research can be proposed. It is simple to
use and easy to understand, though the statistical
treatment that underpins it is somewhat
complex. It involves the quantification and
synthesis of findings from separate studies on
some common measure, usually an aggregate of
effect size estimates, together with an analysis of
the relationship between effect size and other
features of the studies being synthesized.
Statistical treatments are applied to attenuate
the effects of other contaminating factors, e.g.
sampling error, measurement errors, and range
restriction. Research findings are coded into
substantive categories for generalizations to be
made (Glass et al. 1981), such that consistency of
findings is discovered that, through the
traditional means of intuition and narrative
review, would have been missed.
Fitz-Gibbon (1985: 45) explains the technique
by suggesting that in meta-analysis the
effects of variables are examined in terms
of their
effect size, that is to say, in terms of how much 4 Estimate the effect sizes through calculation for
difference they make rather than only in terms each pair of variables (dependent and
of whether or not the effects are statistically independent variable) (see Glass 1977), weighting
significant at some arbitrary level such as 5 per the effect-size by the sample size.
cent. Because, with effect sizes, it becomes easier 5 Calculate the mean and the standard deviation of
to concentrate on the educational significance of effect-sizes across the studies, i.e. the variance
a finding rather than trying to assess its across the studies.
importance by its statistical significance, we may 6 Determine the effects of sampling errors,
finally see statistical significance kept in its place measurement errors and range of restriction.
as just one of many possible threats to internal 7 If a large proportion of the variance is attributable
validity. The move towards elevating effect size to the issues in Step 6, then the average effect-
over significance levels is very important (see size can be considered an accurate estimate of
also Chapter 24), and signals an emphasis on relationships between variables.
‘fitness for purpose’ (the size of the effect having 8 If a large proportion of the variance is not
to be suitable for the researcher’s purposes) over attributable to the issues in Step 6, then review
arbitrary cut-off points in significance levels as those characteristics of interest which correlate with
determinants of utility. the study effects.
The term ‘meta-analysis’ Cook et al. (1992: 7 – 12) set out a five step model
originated in for an integrative review as a research process,
1976 (Glass 1976) and early forms of meta- covering:
analysis used calculations of combined
probabilities and frequencies with which results 1 Problem formulation, where a high quality meta-analysis
fell into defined categories (e.g. statistically must be rigorous in its attention to the design, conduct and
significant at given levels), although problems of analysis of the review.
different sample sizes confounded rigour (e.g. 2 Data collection, where sampling of studies for review has to
large samples would yield significance in trivial demonstrate fitness for purpose.
effects, while important data from small samples 3 Data retrieval and analysis, where threats to validity in
would not be discovered because they failed to non-experimental research – of which integrative review is
reach statistical significance) (Light and Smith an example – are addressed. Validity here must demonstrate
1971; Glass et al. 1981; McGaw 1997: 371). fitness for purpose, reliability in coding, and attention to the
Glass (1976) and Glass et al. (1981) suggested methodological rigour of the original pieces of research.
three levels of analysis: 4 Analysis and interpretation, where the accumulated findings
O primary analysis of the data of several pieces of research should be regarded as complex
O secondary analysis, a re-analysis using data points that have to be interpreted by meticulous
different statistics statistical analysis.
O meta-analysis analysing results of several Fitz-Gibbon (1984: 141 – 2) sets out four steps in
studies statistically in order to integrate the conducting a meta-analysis:
findings.
Glass et al. (1981) and Hunter et al. (1982) 1 Finding studies (e.g. published, unpublished,
suggest eight steps in the procedure: reviews) from which effect sizes can be computed.
1 Identify the variables for focus (independent
and dependent).
2 Identify all the studies which feature the
variables in which the researcher is interested.
3 Code each study for those characteristics that
might be predictors of outcomes and effect
sizes. (e.g. age of participants, gender,
ethnicity, duration of the intervention).
2 Coding the study characteristics (e.g. date, of avoiding Type II errors (failing to find effects
Chapter 13
publication status, design characteristics, that really exist), synthesizing research findings more
quality of design, status of researcher). rigorously and systematically, and generating
3 Measuring the effect sizes (e.g. locating the hypotheses for future research. However, Hedges and
experimental group as a z-score in the control Olkin (1980) and Cook et al. (1992: 297) show that
group distribution) so that outcomes can be Type II errors become more likely as the number of
measured on a common scale, controlling for studies included in the sample increases. Further,
‘lumpy data’ (non-independent data from a Rosenthal (1991) has indicated a method for avoiding
large data set). Type I errors (finding an effect that, in fact, does not
4 Correlating effect sizes with context variables exist) that is based on establishing how many
(e.g. to identify differences between well- unpublished studies that average a null result would
controlled and poorly-controlled studies). need to be undertaken to offset the group of published
Effect size (e.g. Cohen’s d and eta squared) are statistically significant studies. For one example he
the preferred statistics over statistical significance shows a ratio of 277:1 of unpublished to published
in meta-analyses, and we discuss this in Part Five. research, thereby indicating the limited bias in
Effect size is a measure of the degree to which a published
phenomenon is present or the degree to which research.
a null hypothesis is not supported. Wood (1995: Meta-analysis is not without its critics (e.g. Wolf
393) suggests that effect-size can be 1986; Elliott 2001; Thomas and Pring 2004). Wolf
calculated by dividing the significance level by (1986: 14 – 17) suggests six main areas:
the sample size. Glass et al. (1981: 29, 102)
calculate the effect size as: O It is difficult to draw logical conclusions
(Mean of experimental group − mean of control from studies that use different interventions,
group) Standard deviation of the control group measurements, definitions of variables, and
Hedges (1981) and Hunter et al. (1982) suggest participants.
alternative equations to take account of O Results from poorly designed studies take their
differential weightings due to sample size place alongside results from higher quality studies.
variations. The two most frequently used indices O Published research is favoured over unpub-
of effect sizes are standardized mean differences lished research.
and correlations (Hunter et al. 1982: 373), though O Multiple results from a single study are used,
non-parametric statistics, e.g. the median, can making the overall meta-analysis appear more reliable
be used. Lipsey (1992: 93 – 100) sets out a series than it is, since the results are not independent.
of statistical tests for working on effect sizes, O Interaction effects are overlooked in favour of
effect size means and homogeneity. It is clear main effects.
from this that Glass and others assume that meta- O Meta-analysis may have ‘mischievous conse-
analysis can be undertaken only for a particular quences’ (Wolf 1986: 16) because its apparent
kind of research – the experimental type – rather objectivity and precision may disguise proce- dural
than for all types of research; this might limit its invalidity in the studies.
applicability.
Glass et al. (1981) suggest that meta- Wolf (1986) provides a robust response to these
analysis criticisms, both theoretically and empiri- cally. Wolf
is particularly useful when it uses unpublished (1986: 55 – 6) also suggests a ten-step sequence for
dissertations, as these often contain weaker corre- carrying out meta-analyses rigorously:
lations than those reported in published research, 1 Make clear the criteria for inclusion and exclusion
and hence act as a brake on misleading, more of studies.
spectacular generalizations. Meta-analysis, it is
claimed (Cooper and Rosenthal 1980), is a
means
2 Search for unpublished studies. O A systematic, comprehensive and exhaustive

3 Develop coding categories that cover the search for relevant studies.
widest range of studies identified. O The specification and application of clear criteria
4 Look for interaction effects and examine for the inclusion and exclusion of studies, including
multiple independent and dependent data extraction criteria: published; unpublished;
variables separately. citation details; language; key- words; funding support;
5 Test for heterogeneity of results and the type of study (e.g. process or outcome-focused,
effects of outliers, graphing distributions of prospective or retrospective); nature of the
results. intervention; sample characteristics; planning and
6 Check for inter-rater coding reliability. processes of the study; outcome evaluation.
7 Use indicators of effect size rather than O Evaluations of the quality of the methodology
statistical significance. used in each study (e.g. the kind of experiment and
8 Calculate unadjusted (raw) and weighted sample; reporting of outcome measures).
tests and effects sizes in order to examine the O The specification of strategies for reducing bias in
influence of sample size on the results found. selecting and reviewing studies.
9 Combine qualitative and quantitative review- O Transparency in the methodology adopted for
ing methods. reviewing the studies.
10 Report the limitations of the meta-analyses Gorard (2001) acknowledges that subjectivity can
conducted. enter into meta-analysis. Since so much depends upon
One can add to this the need to specify the the quality of the results that are to be synthesized,
research questions being asked, the conceptual there is the danger that adherents may simply multiply
frameworks being used, the review protocols the inadequacies of the database and the limits of the
being followed, the search and retrieval strategies sample (e.g. trying to compare the incomparable).
being used, and the ways in which the syntheses Hunter et al. (1982) suggest that sampling error and the
of the findings from several studies are brought influence of other factors has to be addressed, and that
together (Thomas and Pring 2004: 54 – 5). it should account for less than 75 per cent of the
Gorard (2001: 72 – 3) suggests a four-step variance in observed effect sizes if the results are to be
model for conducting meta-analysis: acceptable and able to be coded into categories. The
1 Collect all the appropriate studies for issue is clear here: coding categories have to declare
inclusion. their level of precision, their reliability (e.g. inter-coder
2 Weight each study ‘according to its size and reliability – the equivalent of inter-rater reliability, see
quality’. Chapter 6) and validity (McGaw 1997: 376 – 7).
3 List the outcome measures used. To the charge that selection bias will be as strong
4 Select a method for aggregation, based on the in meta-analysis – which embraces both published
nature of the data collected (e.g. counting and unpublished research – as in solely published
those studies in which an effect appeared and research, Glass et al. (1981: 226 – 9) argue that it is
those in which an effect did not appear, or necessary to counter gross claims made in published
calculating the average effect size across the research with more cautious claims found in
studies). unpublished research.
Evans and Benefield (2001: 533 – 7) set out Because the quantitative mode of (many) studies
six principles for undertaking systematic reviews demands only a few common variables
of evidence:
O A clear specification of the research
question which is being addressed.
to be measured in each case, explains Tripp many weak studies can add up to a strong con-
Chapter 13
(1985), cumulation of the studies tends to clusion, and that the differences in the size of
increase sample size much more than it increases experimental effects between high-validity and low-
the complexity of the data in terms of the number validity studies are surprisingly small (Glass et al.
of variables. Meta-analysis risks attempting to 1981: 221, 226).
synthesize studies which are insufficiently Further, Wood (1995: 296) suggests that meta-
similar to each other to permit this with any analysis oversimplifies results by concentrating on
legitimacy (Glass et al. 1981: 22; McGaw 1997: overall effects to the neglect of the interaction of
372) other than at an unhelpful level of intervening variables. To the charge that, because
generality. The analogy here might be to try to meta-analyses are frequently conducted on large data
keep together oil and water as ‘liquids’; meta- sets where multiple results derive from the same study
analysts would argue that differences between (i.e. that the data are non- independent) and are
studies and their relationships to findings can be therefore unreliable, Glass et al. (1981: 153 – 216)
coded and addressed in meta-analysis. Eysenck indicate how this can be addressed by using
(1978) suggests that early meta-evaluation sophisticated data analysis techniques. Finally, a
studies mixed apples with oranges. Morrison practical concern is the time required not only to use
(2001b) asks: the easily discoverable studies (typically large-scale
How can we be certain that meta-analysis is published studies) but also to include the smaller-scale
fair if the hypotheses for the separate unpublished studies; the effect of neglecting the latter
experiments were not identical, if the hypotheses might be to build in bias in the meta-analysis.
were not operationalizations of the identical It is the traditional pursuit of generalizations
constructs, if the conduct of the separate RCTs from each quantitative study which has most
(e.g. time frames, interventions and programmes, hampered the development of a database adequate to
controls, constitution of the groups, reflect the complexity of the social nature of education.
characteristics of the participants, measures used) The cumulative effects of ‘good’ and ‘bad’
were not identical? experimental studies is graphically illustrated in Box
(Morrison 2001b: 78) 13.6.
Although Glass et al. (1981: 218 – 20) address
these kinds of charges, it remains the case An example of meta-analysis in educational
(McGaw 1997) that there is a risk in meta- research
analysis of dealing indiscriminately with a large Glass and Smith (1978) and Glass et al. (1981: 35
and sometimes incoherent body of research – 44) identified 77 empirical studies of the relationship
literature. between class size and pupil learning. These studies
It is unclear, too, how meta-analysis dif- yielded 725 comparisons of the
ferentiates between ‘good’ and ‘bad’ re-
search – e.g. between methodologically rigorous
and poorly constructed research (Cook et al.
1992:
297). Smith and Glass (1977) and a achievements of smaller and larger classes, the
Levacˇic´ n
d
Glatter (2000) suggest that it is possible to use 6) effectively address the charge of using data from
study findings, regardless of their methodologi- ‘poor’ studies, arguing, among other points, that
cal quality, though Glass and Smith (1978) and
Slavin (1984a, 1984b), in a study of the effects of
class size, indicate that methodological quality
does make a difference. Glass et al. (1981: 220 –
comparisons resting on data accumulated from
nearly 900,000 pupils of all ages and aptitudes
studying all manner of school subjects. Using
regression analysis, the 725 comparisons were
integrated into a single curve showing the
relationship between class size and achievement
in general. This curve revealed a definite inverse
relationship between class size and pupil
learning.
Box 13.6
Class size and learning in well-controlled and poorly controlled studies
90
Well-controlled studies
Achievement in percentile ranks
80 Poorly controlled studies
70
60
10 20 30 40
Class size
Regression lines for the regression of achievement (expressed in percentile ranks) onto class-size
for studies that were well-controlled and poorly controlled in the assignment of pupils to classes.
Source: adapted from Glass and Smith 1978
When the researchers derived similar curves substantially affected the curve – whether the
for a variety of circumstances that they original study controlled adequately in the
hypothesized would alter the basic relationship experimental sense for initial differences among pupils
(for example, grade level, subject taught, pupil and teachers in smaller and larger classes.
ability etc.), virtually none of these special Adequate and inadequate control curves are set out
circumstances altered the basic relationship. Only in Box 13.6.4
one factor

Experiments

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Experiments

Caricato da

Copyright:

Formati disponibili

Experiments, quasi-experiments, single-case research

Introduction effect of that change on another variable – called

Professional Culture and

Source: Kgaile and Morrison 2006

O the stage of generalization – that this new wonder-

followed by the motorcycle victims, followed by O the factorial design

trauma, one could make an intelligent guess that

treatment groups. group(s)

comparison groups equated by randomization, pendent variables

group pretest-post-test design We consider these below. Field experiments have

and, hence, inferring causality is more contestable, Box 13.2

and then to focus only on certain ones of these http://www.routledge.com/textbooks/

The matched pairs design

Borg and Gall (1979) draw attention to the

limits of variability that will be used to define IND L ( LEVEL

Repeated measures designs

design, and offers considerable control www.routledge.com/textbooks/9780415368780 –

A pre-experimental design: the one group

(A Phase) (B phase) (A) (B)

Source: adapted from Kazdin 1982

class period the teacher would spend extra

and replicated over time or across behaviours,

Procedures in conducting experimental

Baseline Treatment Reversal Treatment

Source: Kazdin 1982

Fifth, in planning the design of the experiment, 6 Administer the pretest.

1 Identify the purpose of the experiment.

Examples from educational research

Example 1: a pre-experimental design

postgraduate diploma students were examinations would produce an improvement in

Example 2: a quasi-experimental design

account for this dramatic improvement of 50 only when enough

Example 3: a ‘true’ experimental design

Evidence-based educational research and

Meta-analysis overlooking effect size

O Humble, small-scale reports which have

McGaw (1997: 371) suggests that quantitative

O lack comprehensiveness, being selective

Since the late 1970s a quantitative method for

2 Search for unpublished studies. O A systematic, comprehensive and exhaustive

Source: adapted from Glass and Smith 1978

Potrebbero piacerti anche