Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
and meta-analysis
Box 13.1
Chapter 13
Independent and dependent variables
Parents Teaching
Development
and and
planning
community learning
School
Effectiveness
Index
The scientist concludes that, because both groups holding every other variable constant for the
came into contact with nothing other than two groups
measured amounts of soil, warmth, water and light, O the final measurement of yield and growth to
then it could not have been anything else but the compare the control and experimental groups
new wonder-fertilizer that caused the and to look at differences from the pretest
experimental group to flourish so well. The key results (the post-test)
factors in the experiment were the following: O the comparison of one group with another
Frequently in learning experiments in classroom than a double blind experiment, and it is even
settings, the independent variable is a stimulus of possible not to tell participants that they are in
some kind, a new method in arithmetical an experiment at all, or to tell them that the
computation for example, and the dependent experiment is about X when, in fact, it is
variable is a response, the time taken to do twenty about Y, i.e. to ‘put them off the scent’. This
sums using the new method. Mostempirical studies form of deception needs to be justified; a
in educational settings, however, are quasi- common justification is that it enables the
experimental rather than experimental. The single experiment to be conducted under more natural
most important difference between the quasi- conditions, without participants altering their
experiment and the true experiment is that in everyday behaviour.
the former case, the researcher undertakes his
study with groups that are intact, that is to
Designs in educational experimentation
say, the groups have been constituted by means
other than random selection. In this chapter we There are several different kinds of experimental
identify the essential features of true experimental design, for example:
and quasi-experimental designs, our intention
O the controlled experiment in laboratory condi-
being to introduce the reader to the meaning and
tions (the ‘true’ experiment): two or more
purpose of control in educational experimentation.
groups
In experiments, researchers can remain rela-
O the field or quasi-experiment in the natural
tively aloof from the participants, bringing a degree
setting rather than the laboratory, but where
of objectivity to the research (Robson 2002: 98).
variables are isolated, controlled and
Observer effects can distort the experiment, for
manipulated.
example researchers may record inconsistently, or
O the natural experiment in which it is not possible to
inaccurately, or selectively, or, less consciously,
isolate and control variables.
they may be having an effect on the experiment.
Further, participant effects might distort the ex- We consider these in this chapter (see http://
periment (see the discussion of the Hawthorne www.routledge.com/textbooks/9780415368780 –
effect in Chapter 6); the fact of simply being in Chapter 13, file 13.1. ppt). The laboratory exper-
an experiment, rather than what the ex- periment is iment (the classic true experiment) is conducted
doing, might be sufficient to alter participants’ in a specially contrived, artificial environment, so
behaviour. that variables can be isolated, controlled and ma-
In medical experiments these twin concerns are nipulated (as in the example of the wheat seeds
addressed by giving placebos to certain above). The field experiment is similar to the lab-
participants, to monitor any changes, and oratory experiment in that variables are isolated,
experiments are blind or double blind. In blind controlled and manipulated, but the setting is the
experiments, participants are not told whether they real world rather than the artificially constructed
are in a control group or an experimental group, world of the laboratory.
though which they are is known to the researcher. Sometimes it is not possible, desirable or ethical
In a double blind experiment not even the to set up a laboratory or field experiment. For
researcher knows whether a participant is in the example, let us imagine that we wanted to
control of experimental group – that knowledge investigate the trauma effects on people in road
resides with a third party. These are intended to traffic accidents. We could not require a
reduce the subtle effects of participants knowing participant to run under a bus, or another to stand
whether they are in a control or experimental in the way of a moving lorry, or another to be hit
group. In educational research it is easier to by a motorcycle, and so on. Instead we might
conduct a blind experiment rather examine hospital records to see the trauma effects
of victims of bus accidents, lorry accidents and
TRUE EXPERIMENTAL DESIGNS 275
Chapter 13
motorcycle accidents, and see which group seem to O the post-test two experimental groups design
have sustained the greatest traumas. It may be that O the pretest-post-test two treatment design
the lorry accident victims had the greatest trauma, O the matched pairs design
the bus victims. Now, although it is not possible O the parametric design
to say with 100 per cent certainty what caused the O repeated measures designs.
2 Subtract the pretest score from the post-test controls has led to highly questionable claims be-
Chapter 13
score for the control group to yield score 2. ing made about the success of programmes (Boruch
3 Subtract score 2 from score 1. 1997: 69). Examples of the use of RCTs can be seen in
Maynard and Chalmers (1997).
Using Campbell’s and Stanley’s terminology, The randomized controlled trial is the ‘gold
the effect of the experimental intervention is: standard’ of many educational researchers, as it
purports to establish controllability, causality and
(O2 − RO1) − (O4 − RO3) generalizability (Coe et al. 2000; Curriculum,
Evaluation and Management Centre 2000). How far
If the result is negative then the causal effect this is true is contested (Morrison 2001b). For
was negative. example, complexity theory replaces simple causality
One problem that has been identified with this with an emphasis on networks, linkages, holism,
particular experimental design is the interaction feedback, relationships and interactivity in context
effect of testing. Good (1963) explains that (Cohen and Stewart 1995), emergence, dynamical
whereas the various threats to the validity of the systems, self-organization and an open system
experiments listed in Chapter 6 can be thought of (rather than the closed world of the experimental
as main effects, manifesting themselves in mean laboratory). Even if we could conduct an
differences independently of the presence of experiment, its applicability to ongoing, emerging,
other variables, interaction effects, as their name interactive, relational, changing, open situations, in
implies, are joint effects and may occur even practice, may be limited (Morrison 2001b). It is
when no main effects are present. For example, misconceived to hold variables constant in a
an interaction effect may occur as a result of the dynamical, evolving, fluid, open situation.
pretest measure sensitizing the subjects to the Further, the laboratory is a contrived, unreal
experimental variable.1 Interaction effects can be and artificial world. Schools and classrooms are not
controlled for by adding to the pretest- post-test the antiseptic, reductionist, analysed- out or
control group design two more groups that do not analysable-out world of the laboratory. Indeed the
experience the pretest measures. The result is a successionist conceptualization of causality (Harre´
four-group design, as suggested by Solomon 1972), wherein researchers make inferences about
(1949) below. Later in the chapter, we describe causality on the basis of observation, must admit its
an educational study which built into a pretest- limitations. One cannot infer causes from effects or
post-test group design a further control group to multiple causes from multiple effects. Generalizability
take account of the possibility of pretest from the laboratory to the classroom is dangerous, yet
sensitization. with field experiments, with their loss of control of
Randomization, Smith (1991: 215) explains, variables, generalizability might be equally dangerous.
produces equivalence over a whole range of Classical experimental methods, abiding by the
variables, whereas matching produces need for replicability and predictability, may not be
equivalence over only a few named particularly fruitful since, in complex phenomena,
variables. The use of randomized controlled results are never clearly replicable or predictable: we
trials (RCTs), a method used in medicine, is a never step into the same river twice. In linear thinking
putative way of establishing causality and small causes bring small effects and large causes bring
generalizability (though, in medicine, the sample large effects, but in complexity theory small causes can
sizes for some RCTs is necessarily so small – bring huge effects and huge causes may have little or
there being limited sufferers from a particular no effect. Further, to atomize phenomena into
complaint – that randomization is seriously measurable variables
compromised).
A powerful advocacy of RCTs for planning
and evaluation is provided by Boruch (1997).
Indeed he argues that the problem of poor
experimental
TRUE EXPERIMENTAL DESIGNS 278
textbooks/9780415368780 – Chapter 13, file dependent variable, such as sex, age, ability). So, first,
13.8. ppt). So, for example, the designs might be: pairs of participants are selected who are matched in
terms of the independent variable under consideration
E R X O (e.g. whose scores on a particular measure are the same
x O 1 2 or similar), and then each of the pair is randomly
p 1 assigned to the control or experimental group.
e Randomization takes place at the pair rather than the
r group level. Although, as its name suggests, this
i ensures effective matching of control and experimental
m groups, in practice it may not be easy to find
e sufficiently close matching, particularly in a field
n experiment, although finding such a close match in a
t field experiment may increase the control of the
a experiment considerably. Matched pairs designs are
l1 useful if the researcher cannot be certain that individual
E R X O differences will not obscure treatment effects, as it
x O 2 4 enables these individual differences to be controlled.
p 3 Borg and Gall (1979: 547) set out a useful
e series of steps in the planning and conduct of an
r experiment:
i 1 Carry out a measure of the dependent variable.
m 2 Assign participants to matched pairs, based on the
e scores and measures established from Step 1.
n
t
a
l2
C R O
o O 6
n 5
t
r
o
l
This can be extended to the post-test control
and experimental group design and the post-test
two experimental groups design, and the pretest-
post- test two treatment design.
Chapter 13
to the control group and the other to the
experimental group.
4 Administer the experimental treatment/
intervention to the experimental group and,
if appropriate, a placebo to the control
group. Ensure that the control group is not
subject to the intervention.
5 Carry out a measure of the dependent
variable with both groups and
compare/measure them in order to
determine the effect and its size on the
dependent variable.
Borg and Gall indicate that difficulties arise
in the close matching of the sample of the control
and experimental groups. This involves careful
identification of the variables on which the
matching must take place. Borg and Gall (1979:
547) suggest that matching on a number of
variables that correlate with the dependent
variable is more likely to reduce errors than
matching on a single variable. The problem, of
course, is that the greater the number of variables
that have to be matched, the harder it is actually
to find the sample of people who are matched.
Hence the balance must be struck between
having too few variables such that error can
occur, and having so many variables that it is
impossible to draw a sample. Instead of
matched pairs, random allocation is possible, and
this is discussed below.
Mitchell and Jolley (1988: 103) pose three
important questions that researchers need to
consider when comparing two groups:
O Are the two groups equal at the commence-
ment of the experiment?
O Would the two groups have grown apart
naturally, regardless of the intervention?
O To what extent has initial measurement error
of the two groups been a contributory factor in
differences between scores?
Factorial designs also have to take account (four levels of the independent variable ‘reading
Chapter 13
of the interaction of the independent vari- ability’). Four experimental groups are set up to
ables. For example, one factor (independent receive the intervention, thus: experimental group one
variable) may be ‘sex’ and the other ‘age’ (poor readers); experimental group two (average
(Box 13.3). The researcher may be investigating readers), experimental group three (good readers and
their effects on motivation for learning mathe- experimental group four (outstanding readers). The
matics (see http://www.routledge.com/textbooks/ control group (group five) would receive no
9780415368780 – Chapter 13, file 13.10. ppt). intervention. The researcher could chart the
Here one can see that the difference in differential effects of the intervention on the groups,
motivation for mathematics is not constant and thus have a more sensitive indication of its effects
between males and females, but that it varies than if there was only one experimental group
according to the age of the participants. There is containing a wide range of reading abilities; the
an interaction effect between age and sex, such researcher would know which group was most and
that the effect of sex depends on age. A factorial least affected by the intervention. Parametric
design is useful for examining interaction effects. designs are useful if an independent variable is
At their simplest, factorial designs may have considered to have different levels or a range of values
two levels of an independent variable, e.g. its which may have a bearing on the outcome
presence or absence, but, as has been seen here, it (confirmatory research) or if the researcher wishes to
can become more complex. That complexity is discover whether different levels of an independent
bought at the price of increasing exponentially variable have an effect on the outcome (exploratory
the number of groups required. research).
Box 13.3
Interaction effects in an experiment
Motivation for mathematics
100
TRUE EXPERIMENTAL DESIGNS 283
Suppose that just such a project has been (an intervention and a post-test), the lack of a pretest,
Chapter 13
undertaken and that the researcher finds that of a control group, of random allocation, and of
O2 scores indicate greater tolerance of ethnic controls, renders this a flawed methodology.
minorities than O1 scores. How justified is the
researcher in attributing the cause of O1 A pre-experimental design: the post-tests only
− non-equivalent groups design
O2 differences to the experimental Again, although this appears to be akin to an
treatment experiment, the lack of a pretest, of matched groups, of
(X) , that is, the term’s project work? At first random allocation, and of controls, renders this a
glance the assumption of causality seems flawed methodology.
reasonable enough. The situation is not that
simple, however. Compare for a moment the A quasi-experimental design: the pretest-post-test
circumstances represented in our hypothetical non-equivalent group design
educational example with those which typically One of the most commonly used quasi- experimental
obtain in experiments in the physical sciences. designs in educational research can be represented as:
Physicists who apply heat to a metal bar can Experimental O1 X O2
confidently attribute the observed expansion to ----------
the rise in temperature that they have introduced Control O3 O4
because within the confines of the laboratory The dashed line separating the parallel rows in the
they have excluded (i.e. controlled) all other diagram of the non-equivalent control group indicates
extraneous sources of variation (Pilliner 1973). that the experimental and control groups have not been
The same degree of control can never be equated by randomization – hence the term ‘non-
attained in educational experimentation. At equivalent’. The addition of a control group makes the
this point readers may care to reflect upon present design a decided improvement over the one
some possible influences other than the ten- group pretest-post- test design, for to the degree that
week curriculum project that might account experimenters can make E and C groups as equivalent
for the O1 − O2 differences in our hypothetical as possible, they can avoid the equivocality of
educational example. interpretations that plague the pre-experimental design
They may conclude that factors to do with the discussed earlier. The equivalence of groups can be
pupils, the teacher, the school, the classroom strengthened by matching, followed by random
organization, the curriculum materials and their assignment to E and C treatments.
presentation, the way that the subjects’ attitudes Where matching is not possible, the researcher
were measured, to say nothing of the thousand is advised to use samples from the same population
and one other events that occurred in and about or samples that are as alike as possible (Kerlinger
the school during the course of the term’s work, 1970). Where intact groups differ substantially,
might all have exerted some influence upon the however, matching is unsatisfactory due to regression
observed differences in attitude. These kinds effects which lead to different group means on post-test
of extraneous variables which are outside the measures. Campbell and Stanley (1963) put it this way:
experimenters control in one-group pretest-post- If [in the non-equivalent control group design] the
test designs threaten to invalidate their research means of the groups are substantially different, then
efforts. We later identify a number of such
threats to the validity of educational
experimentation.
the process of matching not only fails to O They involve the continuous assessment of some
provide the intended equation but in addition aspect of human behaviour over a period of time,
insures the occurrence of unwanted regression requiring on the part of the researcher the
effects. It becomes predictably certain that the administration of measures on multiple occasions
two groups will differ on their post-test scores within separate phases of a study.
altogether independently of any effects of X, and O They involve ‘intervention effects’ which are
that this difference will vary directly with the replicated in the same subject(s) over time.
difference between the total populations from
which the selection was made and inversely with Continuous assessment measures are used as a basis
the test-retest correlation. for drawing inferences about the effectiveness of
(Campbell and Stanley 1963: 49) intervention procedures.
The characteristics of single-case research studies
The one-group time series are discussed by Kazdin (1982) in terms of ABAB
Here the one group is the experimental group, designs, the basic experimental format in most single-
and it is given more than one pretest and more case researches. ABAB designs, Kazdin observes,
than one post-test. The time series uses repeated consist of a family of procedures in which observations
tests or observations both before and after the of performance are made over time for a given client or
treatment, which, in effect, enables the group of clients. Over the course of the investigation,
participants to become their own controls, which changes are made in the experimental conditions to
reduces the effects of reactivity. Time series which the client is exposed. The basic rationale of the
allow for trends to be observed, and avoid ABAB design is illustrated in Box 13.4. What it does is
reliance on only one single pretesting and post- this. It examines the effects of an intervention by
testing data collection point. This enables trends alternating the baseline condition (the A phase), when
to be observed such as no effect at all (e.g. no intervention is in effect, with the intervention
continuing an existing upward, downward or condition (the B phase). The A and B phases are then
even trend), a clear effect (e.g. a sustained rise or repeated to complete the four phases. As Kazdin
drop in performance), delayed effects (e.g. some (1982) says, the effects of the intervention are clear if
time after the intervention has occurred). Time performance improves during the first intervention
series studies have the potential to increase phase, reverts to or approaches original baseline levels
reliability. of performance when the treatment is withdrawn, and
improves again when treatment is recommenced in the
Single-case research: ABAB design second intervention phase.
At the beginning of Chapter 11, we described An example of the application of the ABAB
case study researchers as typically engaged in design in an educational setting is provided by
observing the characteristics of an individual Dietz (1977) whose single-case study sought to
unit, be it a child, a classroom, a school, or a measure the effect that a teacher could have upon the
whole community. We went on to contrast case disruptive behaviour of an adolescent boy whose
study researchers with experimenters whom we persistent talking disturbed his fellow classmates in a
described as typically concerned with the special education class.
manipulation of variables in order to determine In order to decrease the unwelcome behaviour, a
their causal significance. That distinction, as we reinforcement programme was devised in which the
shall see, is only partly true. boy could earn extra time with the teacher by
Increasingly, in recent years, single-case decreasing the number of times he called out. The boy
research as an experimental methodology has was told that when he made three (or fewer)
extended to such diverse fields as clinical interruptions during any fifty-five-minute
psychology, medicine, education, social work,
psychiatry and counselling. Most of the single-
case studies carried out in these (and other) areas
share the following characteristics:
PROCEDURES IN CONDUCTING EXPERIMENTAL RESEARCH 285
Chapter 13
Box 13.4 able to provide an experimental technique for
The ABAB design evaluating interventions for the individual subject.
Moreover, such interventions can be directed towards
Baseline Intervention Base the particular
Intervention subject or group
Days
The solid lines in each phase present the actual data. The dashed lines indicate the
Box 13.5
An ABAB design in an educational setting
30
25
20
5 10 15 20 25 30 35
Sessions
DRL, differential reinforcement of low rates
First, researchers must identify and define important of them can be varied experimentally
the research problem as precisely as possible, while others are held constant.
always supposing that the problem is amenable Third, researchers must select appropriate levels at
to experimental methods. which to test the independent variables. By way of
Second, researchers must formulate hypotheses example, suppose an educational psychologist wishes
that they wish to test. This involves making to find out whether longer or shorter periods of reading
predictions about relationships between specific make for reading attainment in school settings (see
variables and at the same time making decisions Simon 1978). The psychologist will hardly select five-
about other variables that are to be excluded from hour and five-minute periods as appropriate levels;
the experiment by means of controls. Variables, rather, she is more likely to choose thirty-minute and
remember, must have two properties. The first sixty-minute levels, in order to compare with the usual
property is that variables must be measurable. timetabled periods of forty-five minutes’ duration. In
Physical fitness, for example, is not directly other words, the experimenter will vary the stimuli at
measurable until it has been operationally such levels as are of practical interest in the real- life
defined. Making the variable ‘physical fitness’ situation. Pursuing the example of reading attainment
operational means simply defining it by letting somewhat further, our hypothetical experimenter will
something else that is measurable stand for it – a be wise to vary the stimuli in large enough intervals so
gymnastics test, perhaps. The second property is as to obtain measurable results. Comparing reading
that the proxy variable must be a valid indicator periods of forty-four minutes, or forty-six minutes,
of the hypothetical variable in which one is with timetabled reading lessons of forty-five minutes is
interested. That is to say, a gymnastics test scarcely likely to result in observable differences in
probably is a reasonable proxy for physical attainment.
fitness; height, on the other hand, most certainly Fourth, researchers must decide which kind of
is not. Excluding variables from the experiment experiment they will adopt, perhaps from the
is inevitable, given constraints of time and varieties set out in this chapter.
money. It follows therefore that one must set up
priorities among the variables in which one is
interested so that the most
EXAMPLES FROM EDUCATIONAL RESEARCH 287
Chapter 13
8 Conduct the intervention.
9 Conduct the post-test.
10 Analyse the results.
The sequence of steps 6 and 7 can be reversed;
the intention in putting them in the present
sequence is to ensure that the two groups are
randomly allocated and matched. In experiments
and fixed designs, data are aggregated rather than
related to specific individuals, and data look for
averages, the range of results, and their variation.
In calculating differences or similarity between
groups at the stages of the pretest and the post-
test, the t-test for independent samples is often
used.
Chapter 13
principle of randomization has a chance to
operate as a powerful control’. It is doubtful
whether twenty-six pupils in each of the three
groups in Bhadwal and Panda’s (1991) study
constituted ‘enough subjects’.
In addition to the matching procedures in
drawing up the sample, and the random
allocation of pupils to experimental and control
groups, the researchers also used analysis of
covariance, as a further means of controlling for
initial differences between E and C groups on
their pretest mean scores on the independent
variables, study habits and attitudes.
The experimental programme involved im-
proving teaching skills, classroom organization,
teaching aids, pupil participation, remedial help,
peer-tutoring and continuous evaluation. In addi-
tion, provision was also made in the experimental
group for ensuring parental involvement and
extra reading materials. It would be startling if
such a package of teaching aids and curriculum
strate- gies did not effect significant changes in
their recipients and such was the case in the
exper- imental results. The Experimental Group
made highly significant gains in respect of its
level of study habits as compared with Control
Group 2 where students did not show a marked
change. What did surprise the investigators, we
suspect, was the significant increase in levels of
study habits in Control Group 1. Maybe, they
opined, this unexpected result occurred because
Control Group 1 pupils were tested immediately
prior to the beginning of their annual
examinations. On the other hand, they conceded,
some unac- countable variables might have been
operating. There is, surely, a lesson here for all
researchers! (For a set of examples of
problematic exper- iments see
http://www.routledge.com/textbooks/
9780415368780 – Chapter 13, file 13.1.doc).
used method of investigation, bringing in this area are Fitz-Gibbon (1996; 1997; 1999) and
together different studies to provide evidence to Tymms (1996), who, at the Curriculum, Evaluation
in- form policy-making and planning. Meta- and Management Centre at the University of Durham,
analysis is a research strategy in itself. That this have established one of the world’s largest monitoring
is happening significantly is demonstrated in the centres in education. Fitz-Gibbon’s work is critical of
establishment of the EPPI-Centre (Evidence multilevel modelling and, instead, suggests how
for Policy and Practice Information and Co- indicator systems can be used with experimental
ordinating Centre) at the University of Lon- methods to provide clear evidence of causality and a
don (http://eppi.ioe.ac.uk/EPPIWeb/home.aspx), ready answer to her own question, ‘How do we know
the Social, Psychological, Educational and Crim- what works?’ (Fitz-Gibbon 1999: 33).
inological Controlled Trials Register (SPECTR), Echoing Anderson and Biddle (1991), Fitz- Gibbon
later transferred to the Campbell Collabora- suggests that policy-makers shun evidence in the
tion (http://www.campbellcollaboration.org), a development of policy and that practitioners, in the
parallel to the Cochrane Collaboration in hurly-burly of everyday activity, call upon tacit
medicine (http://www.cochrane.org/index0.htm), knowledge rather than the knowledge which is derived
which undertakes systematic reviews and meta- from RCTs. However, in a compelling argument (Fitz-
analyses of, typically, experimental evidence in Gibbon 1997: 35 – 6), she suggests that evidence-
medicine, and the Curriculum, Evaluation and based approaches are necessary in order to challenge
Management (CEM) centre at the University of the imposition of unproven practices, solve problems
Durham (http://www.cemcentre.org). ‘Evidence’ and avoid harmful procedures, and create improvement
here typically comes from randomized controlled that leads to more effective learning. Further, such
trials of one hue or another (Tymms 1999; Coe evidence, she contends, should examine effect sizes
et al. 2000; Thomas and Pring 2004: 95), with rather than statistical significance.
their emphasis on careful sampling, control of While the nature of information in evidence- based
variables, both extraneous and included, and education might be contested by researchers whose
measurements of effect size. The cumulative sympathies (for whatever reason) lie outside
evidence from col- lected RCTs is intended to randomized controlled trials, the message from Fitz-
provide a reliable body of knowledge on which to Gibbon will not go away: the educational community
base policy and prac- tice (Coe et al. 2000). Such needs evidence on which to base its judgements and
accumulated data, it is claimed, deliver evidence actions. The development of indicator systems
of ‘what works’, al- though Morrison (2001b) worldwide attests to the importance of this, be it
suggests that this claim is suspect. through assessment and examination data, inspection
The roots of evidence-based practice lie findings, national and international comparisons of
in medicine, where the advocacy by Cochrane achievement, or target setting. Rather than being
(1972) for randomized controlled trials together a shot in the dark, evidence-based education suggests
with their systematic review and documentation that policy formation should be informed, and policy
led to the foundation of the Cochrane decision-making should be based on the best
Collaboration (Maynard and Chalmers 1997), information to date rather than on hunch, ideology or
which is now worldwide. The careful, political will. It is bordering on the unethical to
quantitative- based research studies that can implement untried and untested recommendations in
contribute to the accretion of an evidential base is educational practice, just as it is unethical to use
seen to be a powerful counter to the often untried untested products and procedures on hospital patients
and under- tested schemes that are injected into without their consent.
practice.
More recently evidence-based education has
entered the worlds of social policy, social work
(MacDonald 1997) and education (Fitz-Gibbon
1997). At the forefront of educational research
EVIDENCE-BASED EDUCATIONAL RESEARCH AND META-ANALYSIS 291
Chapter 13
sampling error can play a part in creating
variations in findings among studies
O overlook differing and conflicting research
findings
O reviewers’ failure to examine critically the
evidence, methods and conclusions of previous
reviews
O overlook the extent to which findings from
research are mediated by the characteristics of
the sample
O overlook the importance of intervening
variables in research
O unreplicability because the procedures for
integrating the research findings have not been
made explicit.
effect size, that is to say, in terms of how much 4 Estimate the effect sizes through calculation for
difference they make rather than only in terms each pair of variables (dependent and
of whether or not the effects are statistically independent variable) (see Glass 1977), weighting
significant at some arbitrary level such as 5 per the effect-size by the sample size.
cent. Because, with effect sizes, it becomes easier 5 Calculate the mean and the standard deviation of
to concentrate on the educational significance of effect-sizes across the studies, i.e. the variance
a finding rather than trying to assess its across the studies.
importance by its statistical significance, we may 6 Determine the effects of sampling errors,
finally see statistical significance kept in its place measurement errors and range of restriction.
as just one of many possible threats to internal 7 If a large proportion of the variance is attributable
validity. The move towards elevating effect size to the issues in Step 6, then the average effect-
over significance levels is very important (see size can be considered an accurate estimate of
also Chapter 24), and signals an emphasis on relationships between variables.
‘fitness for purpose’ (the size of the effect having 8 If a large proportion of the variance is not
to be suitable for the researcher’s purposes) over attributable to the issues in Step 6, then review
arbitrary cut-off points in significance levels as those characteristics of interest which correlate with
determinants of utility. the study effects.
The term ‘meta-analysis’ Cook et al. (1992: 7 – 12) set out a five step model
originated in for an integrative review as a research process,
1976 (Glass 1976) and early forms of meta- covering:
analysis used calculations of combined
probabilities and frequencies with which results 1 Problem formulation, where a high quality meta-analysis
fell into defined categories (e.g. statistically must be rigorous in its attention to the design, conduct and
significant at given levels), although problems of analysis of the review.
different sample sizes confounded rigour (e.g. 2 Data collection, where sampling of studies for review has to
large samples would yield significance in trivial demonstrate fitness for purpose.
effects, while important data from small samples 3 Data retrieval and analysis, where threats to validity in
would not be discovered be- cause they failed to non-experimental research – of which integrative review is
reach statistical significance) (Light and Smith an example – are addressed. Validity here must demonstrate
1971; Glass et al. 1981; McGaw 1997: 371). fitness for purpose, reliability in coding, and attention to the
Glass (1976) and Glass et al. (1981) suggested methodological rigour of the original pieces of research.
three levels of analysis: 4 Analysis and interpretation, where the accumulated findings
O primary analysis of the data of several pieces of research should be regarded as complex
O secondary analysis, a re-analysis using data points that have to be interpreted by meticulous
different statistics statistical analysis.
O meta-analysis analysing results of several Fitz-Gibbon (1984: 141 – 2) sets out four steps in
studies statistically in order to integrate the conducting a meta-analysis:
findings.
Glass et al. (1981) and Hunter et al. (1982) 1 Finding studies (e.g. published, unpublished,
suggest eight steps in the procedure: reviews) from which effect sizes can be computed.
1 Identify the variables for focus (independent
and dependent).
2 Identify all the studies which feature the
variables in which the researcher is interested.
3 Code each study for those characteristics that
might be predictors of outcomes and effect
sizes. (e.g. age of participants, gender,
ethnicity, duration of the intervention).
EVIDENCE-BASED EDUCATIONAL RESEARCH AND META-ANALYSIS 294
2 Coding the study characteristics (e.g. date, of avoiding Type II errors (failing to find effects
Chapter 13
publication status, design characteristics, that really exist), synthesizing research findings more
quality of design, status of researcher). rigorously and systematically, and generating
3 Measuring the effect sizes (e.g. locating the hypotheses for future research. However, Hedges and
experimental group as a z-score in the control Olkin (1980) and Cook et al. (1992: 297) show that
group distribution) so that outcomes can be Type II errors become more likely as the number of
measured on a common scale, controlling for studies included in the sample increases. Further,
‘lumpy data’ (non-independent data from a Rosenthal (1991) has indicated a method for avoiding
large data set). Type I errors (finding an effect that, in fact, does not
4 Correlating effect sizes with context variables exist) that is based on establishing how many
(e.g. to identify differences between well- unpublished studies that average a null result would
controlled and poorly-controlled studies). need to be undertaken to offset the group of published
Effect size (e.g. Cohen’s d and eta squared) are statistically significant studies. For one example he
the preferred statistics over statistical significance shows a ratio of 277:1 of unpublished to published
in meta-analyses, and we discuss this in Part Five. research, thereby indicating the limited bias in
Effect size is a measure of the degree to which a published
phenomenon is present or the degree to which research.
a null hypothesis is not supported. Wood (1995: Meta-analysis is not without its critics (e.g. Wolf
393) suggests that effect-size can be 1986; Elliott 2001; Thomas and Pring 2004). Wolf
calculated by dividing the significance level by (1986: 14 – 17) suggests six main areas:
the sample size. Glass et al. (1981: 29, 102)
calculate the effect size as: O It is difficult to draw logical conclusions
(Mean of experimental group − mean of control from studies that use different interventions,
group) Standard deviation of the control group measurements, definitions of variables, and
Hedges (1981) and Hunter et al. (1982) suggest participants.
alternative equations to take account of O Results from poorly designed studies take their
differential weightings due to sample size place alongside results from higher quality studies.
variations. The two most frequently used indices O Published research is favoured over unpub-
of effect sizes are standardized mean differences lished research.
and correlations (Hunter et al. 1982: 373), though O Multiple results from a single study are used,
non-parametric statistics, e.g. the median, can making the overall meta-analysis appear more reliable
be used. Lipsey (1992: 93 – 100) sets out a series than it is, since the results are not independent.
of statistical tests for working on effect sizes, O Interaction effects are overlooked in favour of
effect size means and homogeneity. It is clear main effects.
from this that Glass and others assume that meta- O Meta-analysis may have ‘mischievous conse-
analysis can be undertaken only for a particular quences’ (Wolf 1986: 16) because its apparent
kind of research – the experimental type – rather objectivity and precision may disguise proce- dural
than for all types of research; this might limit its invalidity in the studies.
applicability.
Glass et al. (1981) suggest that meta- Wolf (1986) provides a robust response to these
analysis criticisms, both theoretically and empiri- cally. Wolf
is particularly useful when it uses unpublished (1986: 55 – 6) also suggests a ten-step sequence for
dissertations, as these often contain weaker corre- carrying out meta-analyses rigorously:
lations than those reported in published research, 1 Make clear the criteria for inclusion and exclusion
and hence act as a brake on misleading, more of studies.
spectacular generalizations. Meta-analysis, it is
claimed (Cooper and Rosenthal 1980), is a
means
EVIDENCE-BASED EDUCATIONAL RESEARCH AND META-ANALYSIS 295
to be measured in each case, explains Tripp many weak studies can add up to a strong con-
Chapter 13
(1985), cumulation of the studies tends to clusion, and that the differences in the size of
increase sample size much more than it increases experimental effects between high-validity and low-
the complexity of the data in terms of the number validity studies are surprisingly small (Glass et al.
of variables. Meta-analysis risks attempting to 1981: 221, 226).
synthesize studies which are insufficiently Further, Wood (1995: 296) suggests that meta-
similar to each other to permit this with any analysis oversimplifies results by concentrating on
legitimacy (Glass et al. 1981: 22; McGaw 1997: overall effects to the neglect of the interaction of
372) other than at an unhelpful level of intervening variables. To the charge that, because
generality. The analogy here might be to try to meta-analyses are frequently conducted on large data
keep together oil and water as ‘liquids’; meta- sets where multiple results derive from the same study
analysts would argue that differences between (i.e. that the data are non- independent) and are
studies and their relationships to findings can be therefore unreliable, Glass et al. (1981: 153 – 216)
coded and addressed in meta-analysis. Eysenck indicate how this can be addressed by using
(1978) suggests that early meta-evaluation sophisticated data analysis techniques. Finally, a
studies mixed apples with oranges. Morrison practical concern is the time required not only to use
(2001b) asks: the easily discoverable studies (typically large-scale
How can we be certain that meta-analysis is published studies) but also to include the smaller-scale
fair if the hypotheses for the separate unpublished studies; the effect of neglecting the latter
experiments were not identical, if the hypotheses might be to build in bias in the meta-analysis.
were not operationalizations of the identical It is the traditional pursuit of generalizations
constructs, if the conduct of the separate RCTs from each quantitative study which has most
(e.g. time frames, interventions and programmes, hampered the development of a database adequate to
controls, constitution of the groups, reflect the complexity of the social nature of education.
characteristics of the participants, measures used) The cumulative effects of ‘good’ and ‘bad’
were not identical? experimental studies is graphically illustrated in Box
(Morrison 2001b: 78) 13.6.
Although Glass et al. (1981: 218 – 20) address
these kinds of charges, it remains the case An example of meta-analysis in educational
(McGaw 1997) that there is a risk in meta- research
analysis of dealing indiscriminately with a large Glass and Smith (1978) and Glass et al. (1981: 35
and sometimes incoherent body of research – 44) identified 77 empirical studies of the relationship
literature. between class size and pupil learning. These studies
It is unclear, too, how meta-analysis dif- yielded 725 comparisons of the
ferentiates between ‘good’ and ‘bad’ re-
search – e.g. between methodologically rigorous
and poorly constructed research (Cook et al.
1992:
297). Smith and Glass (1977) and a achievements of smaller and larger classes, the
Levacˇic´ n
d
Glatter (2000) suggest that it is possible to use 6) effectively address the charge of using data from
study findings, regardless of their methodologi- ‘poor’ studies, arguing, among other points, that
cal quality, though Glass and Smith (1978) and
Slavin (1984a, 1984b), in a study of the effects of
class size, indicate that methodological quality
does make a difference. Glass et al. (1981: 220 –
EVIDENCE-BASED EDUCATIONAL RESEARCH AND META-ANALYSIS 297
comparisons resting on data accumulated from
nearly 900,000 pupils of all ages and aptitudes
studying all manner of school subjects. Using
regression analysis, the 725 comparisons were
integrated into a single curve showing the
relationship between class size and achievement
in general. This curve revealed a definite inverse
relationship between class size and pupil
learning.
EVIDENCE-BASED EDUCATIONAL RESEARCH AND META-ANALYSIS 298
Box 13.6
Class size and learning in well-controlled and poorly controlled studies
90
Well-controlled studies
Achievement in percentile ranks
80 Poorly controlled studies
70
60
10 20 30 40
Class size
Regression lines for the regression of achievement (expressed in percentile ranks) onto class-size
for studies that were well-controlled and poorly controlled in the assignment of pupils to classes.
When the researchers derived similar curves substantially affected the curve – whether the
for a variety of circumstances that they original study controlled adequately in the
hypothesized would alter the basic relationship experimental sense for initial differences among pupils
(for example, grade level, subject taught, pupil and teachers in smaller and larger classes.
ability etc.), virtually none of these special Adequate and inadequate control curves are set out
circumstances altered the basic relationship. Only in Box 13.6.4
one factor