Single group pre-post research designs methodological concerns

This article was downloaded by: [Victoria University]
On: 09 February 2014, At: 05:38

Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Oxford Review of Education

Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/core20
Single group, pre- and post-test

research designs: Some methodological
concerns
a
Emma Marsden & Carole J. Torgerson

a
University of York , UK
University of Durham , UK
Published online: 02 Oct 2012.
To cite this article: Emma Marsden & Carole J. Torgerson (2012) Single group, pre- and post-test
research designs: Some methodological concerns, Oxford Review of Education, 38:5, 583-616, DOI:
10.1080/03054985.2012.731208
To link to this article: http://dx.doi.org/10.1080/03054985.2012.731208
PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the
Content) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or
howsoever caused arising directly or indirectly in connection with, in relation to or arising
out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Conditions of access and use can be found at http://www.tandfonline.com/page/termsand-conditions
Oxford Review of Education

Vol. 38, No. 5, October 2012, pp. 583616
Single group, pre- and post-test

research designs: Some
methodological concerns
Emma Marsdena* and Carole J. Torgersonb
Downloaded by [Victoria University] at 05:38 09 February 2014
University of York, UK;
University of Durham, UK
This article provides two illustrations of some of the factors that can influence findings from
pre- and post-test research designs in evaluation studies, including regression to the mean
(RTM), maturation, history and test effects. The first illustration involves a re-analysis of data
from a study by Marsden (2004), in which pre-test scores are plotted against gain scores to
demonstrate RTM effects. The second illustration is a methodological review of single group,
pre- and post-test research designs (pre-experiments) that evaluate causal relationships between
intervention and outcome. Re-analysis of Marsdens prior data shows that learners with higher
baseline scores consistently made smaller gains than those with lower baseline scores, demonstrating that RTM is clearly observable in single group, pre-post test designs. Our review found
that 13% of the sample of 490 articles were evaluation studies. Of these evaluation studies,
about half used an experimental design. However, a quarter used a single group, pre-post test
design, and researchers using these designs did not mention possible RTM effects in their explanations, although other explanatory factors were mentioned. We conclude by describing how
using experimental or quasi-experimental designs would have enabled researchers to explain
their findings more accurately, and to draw more useful implications for pedagogy.
Keywords: education research; pre-experiment; regression to the mean; control group; research design;
research methods
Introduction
Pre-experimental designs in evaluation studies
Evaluations of educational policy and practice interventions that rely on the single
group pre-experimental research design (also known as the before and after or
*Corresponding author. Department of Education, University of York, Heslington, York YO10

5DD, UK. Email: emma.marsden@york.ac.uk
ISSN 0305-4985 (print)/ISSN 1465-3915 (online)/12/05058334
2012 Taylor & Francis
http://dx.doi.org/10.1080/03054985.2012.731208
584
E. Marsden and C.J. Torgerson
pre- and post-test design) may be threatened by a number of biases. Typically, in

this design, participants are selected (sometimes on the basis of performance below
a pre-specified threshold in a test), pre-tested, exposed to an educational intervention and then post-tested. Observed improvements in the outcome measures may
be ascribed to the intervention in a causal relationship. However, any evaluative
approach that uses this design provides weak information about the counterfactual
inference (Cook & Campbell, 1979; Shadish, et al., 2002; Torgerson & Torgerson,
2008) and may be subject to a number of confounding variables, such as history,
maturation, test effects and the statistical phenomenon known as the regression to
the mean (RTM) effect (Campbell & Stanley, 1963; Cook & Campbell, 1979;
Shadish et al., 2002). Shadish et al.s nine threats to internal validity include the
four confounding variables with which we are concerned here: history; maturation;
regression and testing (Shadish et al., 2002, p. 55).
In this paper, we argue why it could be inappropriate to ascribe improvements
in outcome measures to an intervention being evaluated due to ignoring other possible explanations, such as maturation or the RTM phenomenon. We first discuss
how a range of factors can affect the causal validity of pre-experimental designs in
evaluation research. Second, we demonstrate that one of these, RTM, is a clearly
observable phenomenon in pre-experimental designs, by analysing pre-post test
gains as a function of baseline achievement across a battery of outcome measures.
Finally, we present a methodological review of studies published in educational
research journals in 2009. We illustrate the issue of deriving causal inference from
an observed change in outcome without consideration of alternative explanations
for the change observed. The selected papers describe empirical studies that evaluated education interventions, adopting a pre-experimental design with no control
or comparison group. We argue that many of these papers do not directly address
the possibility that regression, maturation, test effects or other possible confounders
could account for their findings, and we discuss the implications for interpreting
their findings.
Threats to the causal validity of the single group pre- and post-test design
A number of threats to the single group design weaken a causal interpretation.
Some of these, such as attrition or un-blinded assessment, are common to experimental or multiple group designs and we will not discuss them further (Cook &
Campbell, 1979). Others, however, such as maturation, history, test effects and
regression effects cannot be controlled for using a single group pre- and post-test
design, and we discuss these below.
History
The pre-experimental design cannot control for the contemporaneous effects of
normal educational experience or innovations in practice and policy that may
account for some or all of the observed changes. A design using a control or
Pre- and post-test research designs
585
comparison group is usually necessary to account for the possible effects of these
on post-test scores (Cook & Campbell, 1979; Shadish et al., 2002; Torgerson &
Torgerson, 2008).
Maturation
Learners tend to improve in their educational outcomes over time simply due to
increasing maturity. In the absence of a control group we cannot control for maturation effects because these will tend to affect post-test scores regardless of any
new intervention being evaluated. The greater the time difference between preand post-test, the greater the potential effects due to maturation (Cook & Campbell, 1979; Shadish et al., 2002).
Test effects
Evaluations usually involve some form of measurement, before and after the intervention. It is possible that improvements can result from the test itself, attributable
to factors such as participants remembering questions or the questions raising
awareness and triggering learning after the pre-test, independent of the subsequent
intervention. Ideally, two or more equivalent versions of the same test should be
used, counter-balanced amongst participants at pre- and post-test. However, to
fully ascertain whether any learning occurred as a result of simply having done the
test, it is necessary to test participants who did not receive the pre-test. Thus,
there can be four groups of participants: pre- and post-test no intervention; preand post-test with intervention; post-test only no intervention; post-test only with
intervention. This is the Solomon four group design (Shadish et al., 2002).
The RTM phenomenon and an illustration of its effects
RTM is a statistical phenomenon that affects all pre-experimental designs that
include, or analyse data from, participants selected on the basis of an extreme,
usually low but sometimes high, pre-test score (Cook & Campbell, 1979).
Galton (1886) discovered the statistical phenomenon of RTM in his work on
the heritability of height, describing it as regression to mediocrity. Thorndike
(1942) reminded us how RTM, or the regression fallacy, can affect educational
research. The phenomenon can affect measurement known to contain an error
component (as well as a true measurement), for example a reading test. Many
test results have a normal distribution with most values clustered around the mean
and a smaller number of markedly lower or higher results. In almost all educational tests a proportion of the results will have a random error component (the
error term). Scores towards the ends of a distribution will, on average, be more
likely to have a higher error term than those nearer the mean. When students are
re-tested the results towards the ends of the distribution will tend to move closer
to the mean (to their true value) than the results in the middle range. A minority
will not regress to the mean but the majority will, moving the mean of the subgroups towards the whole sample mean on post-test. The regression effect is,
therefore, most evident for students with the lowest and highest pre-test scores.
In order to illustrate the RTM phenomenon we re-analysed data from a study
undertaken by the first author (Marsden, 2004, 2006). To ascertain whether
RTM affected the data, participants change scores (i.e., post-test minus pre-test
scores) were plotted against pre-test scores. If RTM were present a negative correlation would be observed, because participants with high pre-test scores would, on
average, tend to have smaller gains than participants with low pre-test scores. As
expected, Figure 1 shows a strong negative correlation ( 0.65, p < 0.001) between
the pre-test and change scores.1
The lower and upper quartiles of the pre-test scores were extracted for each of
the four measures (listening, speaking, reading and writing). This created two subgroups, lower and upper, coming from the same contextual group with the same
intervention. The pre- to post-test gains made by these lower and upper groups
were compared. This simulated a pre-experiment that compared the effectiveness
of the intervention for those with the lowest and highest scores at the outset.
(Once the inter-quartile range of the test scores was eliminated, the remaining
samples were very small, and it is emphasised that our aim was solely to illustrate
the existence of RTM effects.) Of the 16 comparisons (i.e., pre-to-post and
pre-to-delayed post tests, in two groups, in four outcome measures), six showed
statistically significantly larger gains by the lower groups than the upper groups. A
30.00
20.00
Pre-post
586
10.00
0.00
-10.00
-20.00
20.0030.00
40.00
50.00
60.00
Pretest score
Figure 1. Pre-test scores on a test plotted against gains between pre- and post- tests (data from
Marsden, 2004)
587
further four comparisons suggested a borderline statistically significant difference

in the same direction. For the remaining six gain scores, the lower groups gains
were larger than the upper groups gains in all but one case, though the differences
were not statistically significant. Critically, the lower group never improved less
than the upper group.2
Design issues
We can address the issues illustrated above by introducing a control or comparison
group formed by random allocation. Then, maturation, history, test effects and
RTM effects will affect both groups similarly and cancel out when comparing
groups. Consequently, group differences in changes from pre- to post-tests can be
appropriately ascribed to the intervention.
Although including a randomised control group will deal with temporal and RTM
effects, using a selected control group, not formed by random allocation, may not.
There are several ways a selected control group can introduce bias, the most obvious
of which is selection bias. The members of a control group may be systematically different in some variable, often unobserved, which influences outcome; consequently,
any difference observed between the groups at outcome may be due to selection
rather than to treatment. Secondly, even well matched control groups may introduce
bias through difference in history. For example, groups may have been exposed to
different interventions that may accelerate maturation. Also, RTM can differentially
affect different contextual groups even though they may appear to be matched at pretest. This is because it is an individuals position in relation to their own contextual
groups mean that determines whether their score is likely to regress up or down to
their groups mean, and by how much. Most of the extreme values will, regardless of
the presence or effectiveness of an intervention regress towards the mean, though a
minority do not. Identifying those values that will regress on re-testing, given no
intervention, and those that will not, is difficult, if not impossible.
The effects of history, maturation, test effects and RTM often do not operate
alone. Usually, differences in pre- and post-test gains are a combination of all four
factors. A methodological review of studies of psychological, educational and
behavioural treatments (Lipsey & Wilson, 1993) showed that the pre- post-test
design consistently overestimates effectiveness by an average of 61% compared with
studies with a control group. This greater improvement seen in before and after
studies compared with quasi-experiments (i.e., experiments with a control group) is
entirely predictable given what we know about history, maturation, test effects and
RTM effects. Indeed, gains in control or comparison groups can be observed,
demonstrating that not all of the gains in the experimental group are attributable to
the intervention itself. For example, Norris and Ortegas (2000) meta-analysis of
78 second language education studies found that the average effect sizes of true
control and comparison treatment groups was d=0.30 (st.dev.=0.39), that 15
groups with no experimental intervention made small but important gains, and that
change over time in control groups was a consistent phenomenon.
588
Methodological review of empirical studies

We undertook a small-scale methodological review of a systematically assembled
dataset of published pre-experimental studies.
Review methods
We searched 13 educational research journals: British Educational Research Journal, Cambridge Journal of Education, Educational Studies, International Journal of
Science Education, Journal of Educational Research, Journal of Research in Reading,
Journal of Teacher Education, Language Learning and Technology, Oxford Review of
Education, Reading and Writing: An Interdisciplinary Journal, Research in the Teaching of English, Reading Research Quarterly and Science Education, for 2009, using
the database Educational Resources Information Centre. The year 2009 provided the
most recent full cycle of journal issues before the start of the review process. This
is not a representative sample of education journals or of educational research
itself, and we note that it is likely that the number of articles that fit our criteria
from any one journal is potentially positively correlated with the number of articles published by that journal in that year, and/or with the amount of detail provided in the report.
In order to be selected for the review, papers had to: be unique empirical studies; compare a construct (e.g., attitude, knowledge, behaviour) before and after an
intervention; have at least one quantified measure; employ a study design that did
not include a control or comparison group or any other mechanism that could
have potentially addressed the known biases of using a single group design. Independent data extraction of the studies (with double data extraction of over 80% of
the studies) retrieved information about: the topic; the nature of the intervention;
the outcome measures; and the results. We also recorded whether: the author/s
derived a causal inference between intervention and outcome; the author/s mentioned RTM as a possible explanation for the results; the author/s mentioned other
potential explanatory factors for the results.
Results
In total, 490 articles were published in 2009 in the 13 journals. We found 64
(13%) evaluated innovative interventions and used experimental, quasi-experimental or pre-experimental designs (with quantitative and/or qualitative outcome measurements).3 Of these 64, 19 were included at the first screening stage. At the
second screening stage we excluded three studies (Graebner et al., 2009; MacArthur & Lembo, 2009; Tsaparlis & Papaphotis, 2009) because they did not fit our
criteria. This left 16 (25%) evaluation studies that met our criteria (i.e., pre-post
designs without a control or comparison group). (Note, 48 studies evaluated interventions using designs with control or comparison groups.)
Detailed information about each pre-experiment is presented in the Appendix.
589
The nature of the arguments about causal relationships in the studies

All 16 studies argued that there was a causal relationship between the intervention
and observed changes on outcome measurements. Fifteen of the 16 studies documented improvement between pre- and post-tests. No author mentioned the possibility that RTM effects could have partly or wholly explained the results observed.
Six authors did not mention any factors, other than the intervention, that could
potentially partly explain the observed changes over time. Ten of the authors
acknowledged, indirectly or directly, that the experimental intervention may not
have been the (only) cause of the observed gains (see Table 1 and Appendix).
Maturation or time was cited as a possible cause for observed learning gains
in two studies (Sherrod & Wilhem, 2009; Wilhelm, 2009). The test effect itself
was not mentioned, though Sherin & van Es (2009) described this issue indirectly
as they speculated that the outcome measurements themselves (instructional
behaviour in the classroom) may have encouraged change over time (p. 33);
OByrne (2009) intentionally used part of the pre-test as a learning tool in the
subsequent intervention; and Guisasola et al. (2009) acknowledged that the tests
were part of the intervention and may have caused some of the improvements.
If the studies reviewed here had used a control or comparison group, the results
may have supported the authors claims about a causal relationship between the
intervention and the results. The absence of a control/comparison group does not
eliminate the possibility of a causal relationship, but the use of a control/comparison group is necessary to warrant claims about the existence and strength of such
a relationship.
Table 1. Consideration of potential factors explaining the observed changes, other than, or in
addition to, the experimental intervention
Characteristic of study
Studies
No mention of any potential explanatory Annetta et al.; Ducate & Lomicka; Newton &
factor (except the intervention).
Newton; Sherrod & Wilhem; Spalding et al.; Taylor
& Jones.
Acknowledgement that other
Grace; Park et al.
(unspecified) factors may be involved.
Specific alternative explanatory factors
Brady et al. (characteristics of intervention and other
mentioned.
extraneous variables pertaining to quality of
interventionsupport, time, resources; the measure
was not standardised). Evagorou et al. (self-selection
bias). Guisasola et al. (tests influence learning). Jones
et al. (self-selection bias). Miedijensky & Tal
(influence of regular school, time, maturation).
OByrne (indirectly: tests influence learning). Sherin
& van Es (relationship between the intervention and
the outcome measure may be cyclical). Wilhelm
(differential maturation, differential exposure to the
intervention, differential motivation).
590
The role of RTM in cases of ceiling performance at pre-test

One study did not document improvement between pre- and post-test: Ducate &
Lomickas (2009) findings could have been attributable to the RTM phenomenon,
as the authors note that the pre-test scores were at ceiling. RTM would predict
that these students would naturally regress to the population mean (move from
ceiling downwards) given no intervention. In fact, it is possible that the instruction
did actually enhance the participants learning as there was no downward trend in
the post-test results, but stable scores or non-significant gains were observed. Furthermore, other factors such as maturation, history and test effects may also have
contributed to the observed lack of gains, as these factors have less impact when
pre-test scores are already high or at ceiling. Critically, however, none of these
interpretations can be supported by a study design that did not include a comparison group.
The use of different pre- and post-test measures
Twelve studies used pre- and post-test measures that were either identical or two
slightly different versions of the same test. One study (Park et al., 2009) used a
pre-intervention measure that was different from the follow-up test: the pre-test
was the grade point average from the previous years science tests, and the posttest was a specially designed astronomy test. Because two different tests were used,
this allowed regression effects to be more pronounced than if the same test had
been used. All tests have an error value, but two different but correlated tests
(indicated by the assumption that both measure a certain body of knowledge) will
have an even greater error component, allowing greater scope for regression
effects. When large and randomised samples are used in experiments with two or
more comparison groups, pre-tests are not, in fact, necessary in order to infer causality, but in pre-experimental designs the nature of the pre- and post-tests has
important consequences for the claims made.
The notion of differential effectiveness as a function of pre-test score
In three studies (Brady et al., 2009; Park et al., 2009; Wilhelm, 2009) the authors
argued that the intervention had the greatest effect amongst those who achieved
low scores on the pre-test (see Appendix). Brady et al. (2009) stated that those
teachers who had higher scores at the outset generally had smaller gains; likewise,
teachers with initially low scores had the potential for larger improvement (p.
436). Wilhelm (2009) argued that although the females scored lower than the
males on every [...] domain, females made gains that brought each of their post
domain items up to or beyond those of the males pre-scores. Also ... the post-[...]
tests displayed no significant difference between genders [despite] a significant difference between groups favouring males [at pretest] (p. 2118). Park et al. (2009)
divided their pre-intervention scores into five bands and found that the improve-
591
ment was different in accordance with the students pre-achievement, F(4,229)

=7.853, p=.000 and gender. [] the lowest pre-achievement groups improvement
was greater than other groups [] low achieving students made the most significant gains after Computer Assisted Instruction (CAI) (p < .05) compared with
students who had high pre and post achievement levels (pp. 10061007). They
found that highest scorers at pre-test deteriorated at post-test (table 2, p. 1004).
They also argued that girls achieved higher at pre-test and did not improve at
post-test, and boys achieved worse at pre-test and made gains at post-test (table 4,
p. 1005). The authors suggested that the intervention had a differential impact
based on both gender and pre-intervention achievement. However, these two factors appear to be related, as the girls baseline scores were generally higher than
the boys. It is therefore possible that RTM effects were, in part, responsible for
the gains seen. The conflation of pre-achievement level and gender, in the absence
of considering RTM effects, is also observed in the analysis of attitude data:
attitudes of lower achievers at pre-test were enhanced, while attitudes of higher
achievers at pre-test did not improve or deteriorated (p. 1003) and the boys
significantly enhanced their attitude to science through CAI, but CAI did not
appear to have the same significance for girls attitude to science (p. 1006).
Another study (Evagorou et al., 2009) may also illustrate that pre-achievement
level can partly explain the gains observed. Although the authors made no claims
about differential effectiveness as a function of pre-test scores, they did note that
four of the 7 skills tested were initially quite undeveloped (Evagorou et al.,
2009, p. 670) (i.e., as these initial scores could have been skewed to the lower end
of the distribution, gains due to RTM were more likely).
These examples demonstrate that pre-post gains are sometimes used to argue
that interventions are most effective for those who have low scores at baseline.
Other examples can be found from outside our review, such as an evaluation by
Moore and Wade (1998) which concluded that five or six years after the intervention the Reading Recovery teaching, the weakest group [had] overtaken initially
more able readers and performed better in both reading accuracy and comprehension (p. 201). Similarly, Benati et al. (2010) argued that learners who scored
lower on the pre-test improved more than the high scorers such that the two
groups were equal on the post-test (p. 127) and Bell et al. (2009) argued that
students of lower attainment at Key Stage 3 appear to perform better [in Science
GCSE] than would have been predicted from their Key Stage 3 attainment, but
that higher attaining pupils perform less well (p. 119).
The conclusion that an intervention is more beneficial for low achievers at baseline is only warranted when an equivalent low-scoring sub-group from the control
or comparison group does not make gains equivalent to the low achievers from the
experimental group. This is demonstrated by Ben-David and Zohar (2009). They
randomly assigned equal numbers of low and high achievement participants to a
control and an intervention group, and were therefore able to conclude that their
intervention resulted in more learning gains for low achieving students than for
high achieving students. McCutchen et al. (2009) also reported a differential
592
impact of an intervention on low and high achievers (though assignment to conditions used matched randomisation at the school level rather than at the level of
individual participants, and so RTM may have affected the different groups
unevenly, as described above).
Discussion
Control (or comparison) groups are important for avoiding unwarranted interpretations of data from pre-post measurements. It should be noted that 14 of the 64
evaluation studies did use a comparison group, without pre-intervention measures;
and 34 of the 64 studies used both a pre-post design and a control/comparison
group (with or without random allocation to groups).
The use of control and comparison groups principally avoids unwarranted interpretations (internal validity). It can also improve ecological validity. For example,
using test only groups can inform decisions when the intervention would be
added to the normal programme, and using comparison groups can help practitioners determine the relative merits of different interventions. As discussed above,
random allocation is the best way of addressing history, maturation, test effects
and RTM effects. If a control group cannot be formed by random assignment then
a contemporaneous control group is preferable to no control group.
Another way of partly controlling for RTM effects is to undertake repeated multiple baseline measurements, in an interrupted time series design, until a stable score
is achieved so as to reduce the margin of error of the test. This improves the validity
of associating any future gains with the intervention rather than RTM. This is often
done in cognitive psychology research in order to find an asymptote that is more
likely to reflect the true value of the construct being measured. McArthur and Lembo (2009) evaluated cognitive strategy instruction for writing skills. The three participants did between three and five pre-test essays to obtain a stable baseline (p. 1029).
The post-test consisted of three more essays. For two students, post-test scores were
all higher than stable baseline scores. For the other student, a slight increase was
observed at post-test over baseline. The authors note that the percentage of nonoverlapping data between stable baseline and post-test was 100% (p. 1029).
Whilst such a research design is statistically more satisfactory, for the participant, teacher and policy-maker, it is time consuming and difficult to justify pedagogically. Randomisation is therefore probably a preferable method of addressing
the RTM problem, particularly as it also eliminates selection bias.
Pre-experimental designs do, however, have a role to play in educational
research. For example, before and after data can determine the promise of an intervention during its development phase. In this case researchers will investigate the
potential for an intervention to improve scores in an iterative cycle of testing and
developing, though the researcher should guard against over-interpretation beyond
the observation that the intervention has promise. Many of the studies we
reviewed also made useful contributions by demonstrating feasibility of implementation. However, pre-experimental research in which the observed magnitude of
593
gains over time is ascribed uniquely to a causal relationship between the intervention and the outcome measures is a concern. Furthermore, caution must be exercised when using pre-experimental research to inform sample size calculations for
RCTs because such studies over-estimate the intervention effects and lead to an
underestimation of the sample size (Torgerson & Torgerson, 2008).
We do not know the extent to which the effects outlined earlier influenced the
findings reported in the studies we reviewed. Thirteen of the 16 studies included all
the participants in all analyses, and did not split the pre-test data into high and low
scorers. In such studies one might argue that the movement up to the mean from
the lower scorers and the movement down to the mean from the higher scorers
may have cancelled out the effects of RTM. However, this is by no means certain,
as the movement of the lower and upper outliers due to RTM may not have been
equivalent. Indeed, equal upwards and downwards movement is unlikely given the
combined effects of history, maturation, test effects and the intervention (experimental or comparison). The combined effects of these factors may reduce any
regression down to the mean of the higher scorers but increase the regression up to
the mean of the lower scorers. Clearly, some of the difference might be due to the
intervention actually being effective at improving the outcomes measured, but how
much, if any, is impossible to know due to the limitations within the design.
Conclusions
In our small-scale methodological review of pre-experimental studies we have illustrated that a number of authors of such research designs did not take into account
the potential biasing effects of history, maturation, test effects and RTM in the
discussion of their results. We found several studies that divided the participants
on the basis of their pre-test scores into low and high achievers and argued that an
intervention was more beneficial for those with low scores at baseline, but did not
discuss RTM as a possible factor influencing this finding.
In pre-experiments, history, maturation, test effects or RTM effects may not
explain all of the pre-post differences observed in these studies, and the experimental interventions may be responsible for some of the effects observed. However, because random allocation to experimental and comparison groups was not
used, we cannot tell the extent to which the differences were due to history, maturation, test or the regression artefact. We know, however, that some of the
observed difference is likely to be artefactual.
Randomised controlled trials are widely used to control for selection bias, that
is, where participants are selected on characteristics that may bias the results. This
paper has highlighted how randomised control groups are also important to
control for history, maturation, test and the RTM phenomenon. Our review found
about one fifth of the evaluation studies did use a comparison group, and about
half used pre-intervention measures in addition to a comparison group, some with
random allocation. This illustrates that such designs are feasible in instructional
settings.
594
Acknowledgements
We thank David Torgerson for his useful comments and suggestions on an earlier
draft of the paper.
Notes
1.
2.
3.
The data were, in fact, from a trial that used matched randomisation to an experimental
and a comparison group, thereby controlling for RTM effects.
For reasons of space statistics are not provided, but the data can be found in Marsden,
2004.
The majority of studies that were NOT evaluation studies aimed to explore potential relationships, define constructs, or document processes.
Notes on contributors
Emma Marsden is Senior Lecturer in Language Education in the Department of
Education at the University of York. Her interests are in research methods
and design, both for general educational and applied linguistics research,
language learning theories, and foreign language education. She has worked
on projects funded by the ESRC, DfES, British Academy, HEA LLAS
Subject Centre and the British Council.
Carole Torgerson is Professor of Education in the School of Education at Durham
University. Her main methodological research interests are in experimental
methods (randomised controlled trials and quasi-experiments) and research
synthesis. She has received awards from the DfES, DfCSF, Home Office,
ESRC, HEA, CfBT, HEFCE, NIHR and a range of other organisations.
References
Annetta, L., Mangrum, J., Holmes, S., Collazo, K. & Cheng, M. (2009) Bridging realty to virtual reality: investigating gender effect and student engagement on learning through video
game play in an elementary school, International Journal of Science Education, 31(8), 1091
1113.
Bell, J., Donnelly, J., Homer, M. & Pell, G. (2009) A value-added study of the impact of science curriculum reform using the national pupil database, British Educational Research Journal, 35(1), 119135.
Benati, A., Lee, J. & McNulty, E. (2010) Exploring the effects of Processing Instruction on a
discourse-level guided composition, in: A. Benati & J. Lee (Eds) Processing instruction and
discourse (London, Continuum), 97147.
Ben-David, A. & Zohar, A. (2009) Contribution of meta-strategic knowledge to scientific
inquiry learning, International Journal of Science Education, 31(12), 16571682.
Brady, S., Gillis, M., Smith, T., Lavalette, M., Liss-Bronstein, L., Lowe, E., North, W., Russo,
E. & Wilder, T.D. (2009) First grade teachers knowledge of phonological awareness and
code concepts: examining gains from an intensive form of professional development and
corresponding teacher attitudes, Reading and Writing: An Interdisciplinary Journal, 22(4),
425429.
595
Campbell, D.T. & Stanley, J.C. (1963) Experimental and quasi-experimental designs for research
(Chicago, IL, RandMcNally).
Cook, T.D. & Campbell, D.T. (1979) Quasi-experimentation: design and analysis issues for field settings (Boston, MA, Houghton Mifflin).
Ducate, L. & Lomicka, L. (2009) Podcasting: an effective tool for honing language students
pronunciation? Language Learning and Technology, 13(3), 6686.
Evagorou, M., Korfiatis, K., Nicolaou, C. & Constantinou, C. (2009) An investigation of the
potential of interactive simulations for developing thinking skills in elementary school: a case
study with fifth-graders and sixth-graders, International Journal of Science Education, 31(5),
655674.
Galton, F. (1886) Regression towards mediocrity in hereditary stature, The Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246263.
Grace, M. (2009) Developing high quality decision-making discussions about biological conservation in a normal classroom setting, International Journal of Science Education, 31(4), 551
570.
Graebner, I.T., de Souza, E.M.T. & Saito, C.H. (2009) Action-research and food and nutrition
security: a school experience mediated by conceptual graphic representation tool, International Journal of Science Education, 31(6), 809827.
Guisasola, J., Solbes, J., Barragues, J.-I., Morentin, M. & Moreno, A. (2009) Students understanding of the special theory of relativity and design for a guided visit to a science museum,
International Journal of Science Education, 31(15), 20852104.
Jones, G., Taylor, A. & Broadwell, B. (2009) Estimating linear size and scale: body rulers, International Journal of Science Education, 31(11), 14951509.
Lipsey, M.W. & Wilson, D.B. (1993) The efficacy of psychological, educational and behavioral
treatment: confirmation from meta-analysis, American Psychologist, 48(12), 11811209.
MacArthur, C.A. & Lembo, L. (2009) Strategy instruction in writing for adult literacy learners,
Reading and Writing, 22(9), 10211039.
Marsden, E. (2004) Teaching and learning of French verb inflections: a classroom experiment
using processing instruction. Unpublished Ph.D. dissertation, University of Southampton.
Marsden, E. (2006) Exploring input processing in the classroom: an experimental comparison
of processing instruction and enriched input, Language Learning, 56, 507566.
McCutchen, D., Green, L., Abbott, R. & Sanders, E. (2009) Further evidence for teacher
knowledge: supporting struggling readers in grades three through five, Reading and Writing:
An Interdisciplinary Journal, 22(4), 401423.
Miedijensky, S. & Tal, T. (2009) Embedded assessment in project-based science courses for the
gifted: insights to inform teaching all students, International Journal of Science Education, 31
(18), 24112435.
Moore, M. & Wade, B. (1998) Reading and comprehension: a longitudinal study of ex-Reading
Recovery students, Educational Studies, 24, 195203.
Newton, D.P. & Newton, L.D. (2009) Knowledge development at the time of use: a problembased approach to lesson planning in primary teacher training in a low knowledge, low skill
context, Educational Studies, 35(3), 311321.
Norris, J. & Ortega, L. (2000) Effectiveness of L2 instruction: a research synthesis and quantitative meta-analysis, Language Learning, 50, 417528.
OByrne, B. (2009) Knowing more than words can say: using multimodal assessment tools to
excavate and construct knowledge about wolves, International Journal of Science Education,
31(4), 523539.
Park, H., Khan, S. & Petrina, S. (2009) ICT in science education: a quasi-experimental study
of achievement, attitudes toward science, and career aspirations of Korean middle school
students, International Journal of Science Education, 31(8), 9931012.
596
Shadish, W.R., Cook, T.D. & Campbell, D.T. (2002) Experimental and quasi-experimental designs
for generalized causal inference (Boston, Houghton Mifflin).
Sherin, M.G. & van Es, E.A. (2009) Effects of video club participation on teachers professional
vision, Journal of Teacher Education, 60(1), 2037.
Sherrod, S.E. & Wilhelm, J. (2009) A study of how classroom dialogue facilitates the development of geometric spatial concepts related to understanding the cause of moon phases,
International Journal of Science Education, 31(7), 873894.
Spalding, E., Wang, J., Lin, E. & Hu, G. (2009) Analyzing voice in the writing of Chinese
teachers of English, Research in the Teaching of English, 44(1), 2351.
Taylor, A. & Jones, G. (2009) Proportional reasoning ability and concepts of scale: Surface area
to volume relationships in science, International Journal of Science Education, 31(9), 1231
1247.
Thorndike, R.L. (1942) Regression fallacies in the matched groups experiment, Psychometrika,
7, 85102.
Torgerson, C. & Torgerson, D. (2008) Designing and running randomised trials in health, education
and the social sciences: an introduction (Basingstoke, Palgrave Macmillan).
Tsaparlis, G. & Papaphotis, G. (2009) High-school students conceptual difficulties and
attempts and conceptual change: the case of basic quantum chemical concepts, International
Journal of Science Education, 31(7), 895930.
Wilhelm, J. (2009) Gender differences in lunar-related scientific and mathematical understandings, International Journal of Science Education, 31(15), 21052122.
Brady, Gillis,
Smith,
Annetta,
Mangrum,
Holmes,
Collazo &
Cheng.
Study
U.S.A.
n=74 students
First grade
teachers;
The gain from pretest to post-test

overall was
significant for the
sample exposed to
MEGA ... (p.
1100). The overall
gain from pre-test to
post-test was
significant (0.000), f
= 67.02 (p. 1100)
Tables 2 & 3.
No.
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
No.
Yes. the MEGA
integrated into an
elementary school
science class did result
in the learning of key
science concepts for
fifth-grade boys and
girls learning of
simple machines
(p. 1104).
Examine efficacy of Scores on knowledge Yes. Encouragingly,

an intensive form of survey: indicated
this model of PD
Measures: pre and

post test measuring
basic knowledge of
the six simple
machines and the
purpose of each.
Elementary
school;
Sample size
5th grade
Examine students
learning of simple
students of
machines;
varying
academic levels
1011 years;
Measures
Participants;
Science
Education;
Topic;
Setting;
Country
Objective
of study;
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
where appropriate relationship?
Data extracted from included studies with a pre-experiment (single group pre-post test) design
Appendix

597
Elementary
school;
Professional
Lavalette,
development of
Lissteachers;
Bronstein,
Lowe, North,
Russo &
Wilder
Study
Topic;
Setting;
Country
Appendix (Continued)
Measures
Participants;
Sample size
professional
development for
building the
knowledge of firstgrade teachers in the
areas of phonological
awareness and
phonics;
Objective
of study;
weak knowledge of
phonological
awareness and
phonics concepts
prior to PD [average
42.6% correct] and
large, significant
gains in each year by
year-end [average
74.1% correct] on
all [three] subtests
and on the total
score (pp. 436,
437). Repeated
Measures ANOVAs
showed statistically
significant differences
for phonological
awareness, Code,
Fluency and Oral
language, (p. 437).
With large effect
sizes for [PA] and
[C] (.73 and .80) (p.
443).
Attitudes: Repeated
measure ANOVAs
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
(Continued)
generated substantial
overall gains on the
survey of teachers
knowledge (p. 443)
Thus, there are effects
for time with final
teacher knowledge
scores ... being
significantly higher
than their respective
beginning scores (p.
437) assessment of
teacher attitudes
indicates that positive
feelings about the PD
increased, as did
personal commitment
to participate (p. 439)
The present study
demonstrated the value
of an intensive form of
PD provided by skilled
mentors for building
teacher knowledge (p.
447).
But authors state: At (But authors state
this point one cannot we only have been
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
598
Ducate &
Lomicka
Study
Foreign
language
education;
U.S.A.
Topic;
Setting;
Country
Measures
Participants;
Sample size
n=57 (n=65
for analysis of
Teacher
Knowledge)
Measures: Survey to
assess teacher
knowledge needed to
teach basic reading
skills and a Teacher
Attitude Survey
(TAS) to measure
attitudes to the
professional
development.
Undergraduates Using podcasts to
improve foreign
(1822 years
language
old);
pronunciation;
Objective
of study;
conclude whether the

gains stem from the
extent of classroom
support provided for
individual teachers or
from other attributes
(p. 444). Other factors
mentioned such as
variable administrative
support and sufficient
time and resources for
PD meetings.
And authors state The
instrument has not
been normed or
standardized p. 446).
Yes, indirectly.
Negative result
reported, causal
relationship implied
indicated significant
effects of time with
higher scores at the
end of the year for
self-efficacy
and positive
attitudes toward PD
lower scores at
the end of the year
on negative attitudes
(p. 438). Tables 3 &
4.
No statistically
significant
improvement in
most of the measures
Results, as
reported by
authors, including
statistics and
references to
Did authors ascribe
relevant tables
a causal
No.
(Continued)
able to identify a
modest portion of
the variance
accounting for
teachers responses
and scores)
(p. 446).
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)

599
Study
U.S.A.
University;
Topic;
Setting;
Country
Sample size
Attitudes towards
pronunciation using
Pronunciation
Attitude Inventory.
Measures: Pre and

post assessments of
speech samples, from
identical scripted
podcast, for
comprehensibility
and accentedness.
Measures
Participants;
n=12 German
students; n=10
French
students.
Objective
of study;
(comprehensibility,
accentedness, and
attitudes towards
pronunciation) (p.
73). One sig
difference in French
comprehensibility
ratings (p. 73).
(p. 67). Reasons given

for lack of gains: 16
weeks is not a
sufficient amount of
time to make gains in
pronunciation ;
podcasting and
repeated readings
alone are not enough
to improve
pronunciation over an
academic semester
(pp.76, 77).
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
(Continued)
(But authors argue

gains were not
observed because
pre-test scores were
high and future
research should be
done with lower
achievers as
significant
improvement may
then be detected (p.
77).)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
600
Grace
Measures
Participants;
Sample size
n=13
Measures: 2 tests,
both used as pre and
post tests. Each test
with tasks
corresponding to
seven thinking skills.
1112 year olds; To investigate the

impact of a
simulation-based
learning environment
on students
development of
system thinking
skills;
Objective
of study;
Cyprus.
Decision-making 1516 year olds; Can peer group
in science
decision-making
classrooms;
discussions help
develop students
Elementary
school;
Thinking skills;
Evagorou,
Korfiatis,
Nicolaou &
Constantinou.
Study
Topic;
Setting;
Country
About three quarters

of the students
modified their
proposed solutions
Considerable
improvements in the
participants system
thinking skills, on six
measures, but not on
feedback thinking
(pp. 664669 and
Tables 2, 3 & 4).
Yes and No.

Discussions had a
marked impact on
students proposed
No.
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
No.
Yes (and indirectly,
no). The proposed
learning environment
provoked considerable
improvements in some
system thinking skills
during a relatively brief
learning process.
(p. 656) after the
instruction the total
number of referred
elements increased
(p. 664). We have to
admit the failure of
our intervention in
promoting feedback
thinking (p. 671).
But authors state that
results: could be
positively affected by
the fact that [the
students] voluntarily
participated in the
project (p. 669).
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables

601
Study
Measures
Participants;
Sample size
n=131 (4 intact
classes)
Topic;
Setting;
Country
Secondary
school;
personal reasoning in
relation to
conservation issues?
Measures: Pre and
post questionnaire
Objective
of study;
(p. 557). A
comparison of preand post-test
comments revealed a
general shift to
higher-level
responses following
the discussions
(p. 559). 54% of
student exhibited an
increased quality of
response; 40%
remained at the same
level, and 6%
dropped down a
level Almost 20% of
students moved from
level 3 to level 4 (p.
559).
But it is not possible
to establish with
certainty that the
differences between an
individuals pre-test
and post-test
statements were the
direct result of the
solutions (p. 567).

And the time span
between the pre and
post-tests was
considered short
enough to minimise
the possible impact of
other external
influences (p. 557).
And most students
knowledge and
awareness of values
increased after peer
discussion (p. 567).
Results, as
reported by
authors, including
statistics and
references to
Did authors ascribe
relevant tables
a causal
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
602
Guisasola,
Solbes,
Barragues,
Morentin &
Moreno
Study
Objective
of study;
Measures
Participants;
Sample size
U.K.
How does a museum
Physics education 1st year
undergraduates; visit influence
in Engineering
students
course;
understanding of the
Special theory of
Relativity (STR) and
its applications? Do
students use more
scientific arguments
when discussing
topics related to the
Special theory of
Relativity after
visiting the
exhibition?
University
n=35
Measures: to
measure
understanding, a
questionnaire as the
pre-test; a written
report structured
around similar
questions to pre-test
for the post test.
Topic;
Setting;
Country
Increases between
pre and post
measures in
understanding.
Figures 1, 2 and 3
showing differences
between pre and post
measures for: correct
explanations of
aspects of STR;
scientific arguments
applied; proportions
of three or more
mentions of
applications.
Acknowledge testeffect: change in the

students
understanding, can be
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
Yes. The results show No.

that the teaching
sequence and
exhibition visit have
increased the students
interest, knowledge,
and understanding of
the STR and its
applications
(p. 2100).
discussions (p.
556).
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables

603
Spain
Jones, Taylor, & Maths

Broadwell
Education;
Study
Topic;
Setting;
Country
Sample size
6th 9th grade
Measures
Participants;
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
attributed not only to

the visit to the
museum, but rather to
the overall pre-visit,
visit, and post-visit
teaching process
(p. 2091).
No.
Yes. Results of a
The results of this
To examine the
study revealed that paired-sample t-test
impact of teaching
students to use their teaching students to suggested that the
significant changes in
use their body as a
bodies as rough
measurement tools rough measurement pre-test and post-test
scores for the LMA
tool increased
on their ability to
were not due to
their ability to
estimate linear
accurately estimate random chance but
measurements.
instead are probably
linear sizes (see
Table 1). The mean due to the intervention
score on the pre-test the students received
as a result of
for the LMA was
completing the metric
26.21 (SD = 4.57)
measurement tasks
whereas the mean
(p. 1504) .
score for the posttest increased to
30.68 (SD = 3.43).
(p. 1504, and see p.
1495); Paired t-tests
found significant
Objective
of study;
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
604
Miedijensky
&Tal
Study
Measures
Participants;
Sample size
1113 years;
n=19
1215 year olds; To document
student views on and
reactions to
assessment for
learning, amongst
students taking oneyear project-based
science courses for
the gifted.
Topic;
Setting;
Country
Summer Camp;
U.S.A.
Impacts of
assessment for
learning
approach
amongst gifted
and talented;
pull-out
programme for
gifted and
talented;
Measure: 20 item
test to assess
understanding of
metric scale.
Objective
of study;
Significant
differences between
the pre/post
questionnaires were
found with regard to
the three main
categories and most
of the subcategories.
differences for
object estimation ,
kinaesthetic estimation
..., and body ruler
(p. 1504)
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
No.
Yes. Causal links
between experiencing
AFL and positive
views about AFL: Our
findings indicate
positive impacts of
AFL on the students
views of assessment
(p. 2430). Also,
relationship between
But authors state: All

of the participants
were volunteers
enrolled in science
summer camp. As
such, they are most
probably not
representative of the
variation that would
exist for all students of
this age (p. 1503).
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables

605
Study
Measures
Participants;
Sample size
n=86
Topic;
Setting;
Country
Israel
Measures: pre-post
questionnaire:
general view of
assessment, ideas
about assessment
modes, and
relationships
Objective
of study;
AFL and learning:

Figure 3 and By
following the students
through ... projects,
developing the
assessments with them,
... we showed how
assessment supported
learning the
findings of this study
have strengthened our
belief that the
students voice is
important to further
improve the
assessment and its
impact on learning
(p. 2432).
But authors state:
significant shift
Since the courses took
toward a more
complex view of the time, and the meetings
different dimensions occurred only once a
of assessment p. week, one could claim
2421; Table 4 & 5; that other factors such
as the regular school
Figure 2
or even time and
maturation should be
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
606
Topic;
Setting;
Country
Teacher
education;
Study
Newton &
Newton
Measures
Sample size
between assessment
and learning;
12 post-treatment
interviews
Primary teacher Impact of a problemtrainees, and
solving approach to
PGCE tutors;
lesson planning in an
area where trainees
Objective
of study;
Participants;
Statistically
significant increase,
(very large effect
size), in students
reported confidence
Yes. there was a

very large increase in
student confidence in
planning science
lessons which they
considered to
contribute to this shift.
We cannot entirely
denounce this concern;
however, our data
indicate nothing of this
sort of assessment was
employed in the
regular schools; and
indicate the students
strongly associated
their views to the
assessment
components, and
provided relevant
examples that support
our claim (p. 2430).
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
No.
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)

607
Study
Measures
Participants;
Sample size
n=75 PGCE
students; and
Topic;
Setting;
Country
PGCE teacher
training course;
Measures: Before
and after
comparisons of a)
have little subject

knowledge.
Objective
of study;
in planning to teach, ascribed to the PBL

element (effect size,
mean score
increasing from 3.24 2.17) (pp. 318319).
at the outset to 6.49
effect size, 2.17
(pp. 317319). The
grades awarded for
solutions to
Problems 1 and 6
suggested an increase
in the students
lesson planning skills
(effect size, 0.95) (p.
319); and the mean
score for the
relatively easy
Problem 1 was 4.40.
For the relatively
difficult Problem 6,
it was 6.82, an
increase that was
statistically
significant (p. 318).
these judgements [by
the tutor about quality
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
608
OByrne
Study
reported selfconfidence and b)
quality of solutions
to problems.
Sample size
n=3 PGCE
tutors
Primary school;
n (aggregated
over 3
consecutive
years)=43;
concept of form
emerged once they
were provided carefully
constructed assessment
and learning tasks that
drew out this
knowledge (p. 533);
Some of these shifts
were no doubt the
result of small-group
investigations
(p. 534). Multimodal
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
Yes. Their mastery of No.

the
of lesson plans] were

subjective and not
based on objective
criteria (p. 318).
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
Descriptive statistics
of individual test
items (pp. 529535).
Tables 1 & 2.
Measurementdependent gains
between pre and post
test, e.g. Every pair
of prepost-unit
drawings showed
refinement of
concepts of wolves in
Measures: Pre-post nature as distinct
True-false test; Pre- from fictional or
imaginary wolves,
post drawings of
more commonly
wolves.
featured in pre-unit
drawings (p. 538).
Measures
Participants;
Impact of multimodal activities and

small group
discussion on
developing
understanding of
wolves;
Objective
of study;
U.K.
Understanding of Grade 2 pupils
wolves;
from one
school;
Topic;
Setting;
Country

609
Park, Khan &

Petrina
Study
Measures
Participants;
Sample size
Middle school;
Measures:
Comparisons of
Impact of Computer
Assisted Instruction
(CAI) on
achievement and
attitudes;
Objective
of study;
After one year

aggregated n=28
ICT in science
Grade 8
education;
students from
one school;
U.S.A.
Topic;
Setting;
Country
... After CAI classes

students
achievement in
science improved
significantly ... ,
[and] The mean
differences of
students Attitude to
Science before and
after CAI were
significant ... , with
students having more
positive attitudes
towards science after
CA. (Table 1, p.
1003).
But Although there

are a number of
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
No.
Yes. CAI was
significantly correlated
with improvement in
most of the
achievement groups
(p. 1003).
Collectively, student
achievement in the
post-achievement test
improved significantly
compared to their
achievement in science
prior to CAI (p.
1006).
Assessment Tools
Support Gradual
Concept Change (p.
537).
Implicit in design is
that test-effect used,
indirectly, as a learning
tool.
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
610
n=234
Korea
Teachers
Sample size
Impact on attitudes
to science and future
courses and career
aspirations measured
by pre- and postquestionnaires.
Document effect of
discussing video
recorded lessons
(video clubs) on
teacher learning.
post-achievement
test to students
Grade Point Average
in the previous year.
Measures
Participants;
Topic;
Setting;
Country
Sherin & van Es Maths Teacher

education;
Elementary and
middle schools;
U.S.A.
Study
Objective
of study;
Quantitative
improvement on all
indicators of
teachers attention to
students
mathematical
thinking, amongst all
teachers on all
measures. Tables 3
11, pp. 2532 ... not
only did the teachers,
over time, come to
use more
sophisticated
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
No.
Yes. there is a
strong alignment
between the reasoning
strategies developed in
the video club and
those displayed in the
later classroom
observations.
factors that likely

contributed to the
outcomes measured in
this study, the
potential contributions
of CAI on low
achieving students and
girls in science are
intriguing (p. 1008).
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables

611
n=4 + 7
Understanding
7th grade;
geometric spatial
Measures: quantity
and nature of
discussion during
video clubs (prepost); classroom
observations (earlylate); noticing
interviews (pre +
post)
Will classroom
dialogue facilitate
Sample size
Sherrod &
Wilhem
Measures
Participants;
Study
Objective
of study;
Topic;
Setting;
Country
Learners
demonstrated new
strategies for
reasoning about
student thinking ...
they also came to
notice more complex
issues of student
thinking (p. 27).
And in Meeting 1,
only 25% of the
comments about the
student concerned
mathematical
thinking ... in
Meeting 10, 92% ...
were ... to do with
mathematical
thinking (p. 28).
Yes (implicitly).
Following classroom

the causal relationship
may be bidirectionalteachers
instruction and video
clubs mutually
influencing (p. 33).
Results, as
reported by
authors, including
statistics and
references to
Did authors ascribe
relevant tables
a causal
No.
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
612
Middle school;
concepts related
to the cause of
lunar phases;
Measures
Participants;
Sample size
Effect of 3 week
summer writing
workshop on
n= 92 (5 classes Measures: 2 D
taught by same drawings, before and
teacher)
after dialogue. And
the Lunar Phases
Concept Inventory
pre- and postmeasure.
students
understanding of
lunar concepts
related to geometric
spatial visualisation?
Objective
of study;
U.S.A.
Spalding,
Development of Chinese
Wang, Lin & writing in English teachers of
English as a
Hu
as a foreign
language;
Study
Topic;
Setting;
Country
Scoring showed that

the teachers writing
improved
significantly in the
understanding of
three concepts
tested: significant
gains in scientific
and geometric spatial
understanding ,
but also
accelerated
dedication to inquiry
teaching (p. 877);
increased proportions
of students
demonstrating
understanding
(p. 881).
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
No.
Yes. The writing
course led to
significant gains in
writing scores (p. 48).
discourse, 8.7%, 6.5%

and 7.6%
demonstrated new
understanding of the
geometric
configuration
Tables 3, 4 & 5
(p. 882).
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables

613
China.
Taylor & Jones Science

education;
Study
Topic;
Setting;
Country
Measures
Participants;
Sample size
Measures: Pre- and

post-workshop
writing samples,
assessed using the 6
+ 1 Trait[R]
analytical model.
1113 year olds Impact of a series of
science investigations
enrolled on a
science summer on improving
understanding of
camp;
surface area to
volume relationships
n=57
foreign language teachers English

(grades 312); writing.
Objective
of study;
A significant
correlation between
proportional
reasoning ability and
students
understanding of
surface area to
volume relationships.
Mean score on the
pre-test was 54.42
(SD 20.41) whereas
the mean score on
the post-test
increased to 75.89
(SD 19.71) (pp.
course of the
institute (p. 23).
Greatest gains in
voice. Tables 1, 2
& 3.
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
No.
Yes. Results of a
paired-sample t-test
suggested that the
significant changes in
pre-test and post-test
for the ASAVA were
not due to random
chance but instead are
probably due to the
intervention the
students received as a
result of completing
the surface area to
volume application
tasks (p. 1236).
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
614
Wilhelm
Study
Measures
Participants;
Sample size
n=19
Middle level
students;
Topic;
Setting;
Country
Middle school;
U.S.A.
Science
education;
Examine gender
differences in lunar
phases
understanding.
Measures:
Understanding tested
by pre-post
achievement tests.
Relationship of
understanding to
proportional
reasoning ability
tested in one off
achievement test.
Objective
of study;
The mean pre-test

score was 31.2%
and the mean posttest score was 52.9%
. A repeatedmeasures ANOVA
revealed a significant
increase from pretest to post-test on
overall test scores
(see Table 3)
(p. 2112).
12351236, 1236
1237).
(Continued)
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
No.
Yes. The partial g2
value of 0.703
indicates that
approximately 70.3%
of the gain in lunarrelated understanding
can be directly
attributed to the
inquiry Moon unit
(p. 2113). The partial
g2 value of 0.151
indicates that
approximately 15.1%
of the gain in lunar-
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables

615
Study
Sample size
n=123
University;
U.S.A.
Measures
Participants;
Topic;
Setting;
Country
Measures: a Lunar
Phases Concept
Inventory (20 item)
and a Geometric
Spatial Assessment
(GSA) (16 item).
Objective
of study;
GSA: The mean pretest score was 49.4%

and the mean
post-test score was
56.2% A
repeated-measures
ANOVA revealed a
significant increase
(see Table 6)
(p. 2116).
related understanding
can be directly
attributed to the
inquiry Moon unit
(p. 2116). Findings
suggest that both
scientific and
mathematical
understandings can be
significantly improved
for both sexes through
the use of spatially
focused, inquiryoriented curriculum
such as REAL
(pp. 2105, 2120).
Results, as
reported by
authors, including
statistics and
Did authors ascribe
references to
a causal
relevant tables
But authors state

The other 30% [of
gain in scores] could
be attributed to
differential
maturation,
differential exposure
to the intervention,
differential
motivation, and so
forth (p. 2113).
RTM mentioned?
(Points that could
relate to RTM, as
interpreted by
reviewers)
616

Single group pre-post research designs methodological concerns

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Single group pre-post research designs methodological concerns

Caricato da

Copyright:

Formati disponibili

This article was downloaded by: [Victoria University]

On: 09 February 2014, At: 05:38

Oxford Review of Education

Single group, pre- and post-test

Emma Marsden & Carole J. Torgerson

PLEASE SCROLL DOWN FOR ARTICLE

Oxford Review of Education

Single group, pre- and post-test

University of York, UK;

*Corresponding author. Department of Education, University of York, Heslington, York YO10

Downloaded by [Victoria University] at 05:38 09 February 2014

E. Marsden and C.J. Torgerson

pre- and post-test design) may be threatened by a number of biases. Typically, in

Pre- and post-test research designs

Downloaded by [Victoria University] at 05:38 09 February 2014

E. Marsden and C.J. Torgerson

Downloaded by [Victoria University] at 05:38 09 February 2014

Pre- and post-test research designs

further four comparisons suggested a borderline statistically significant difference

Downloaded by [Victoria University] at 05:38 09 February 2014

E. Marsden and C.J. Torgerson

Methodological review of empirical studies

Downloaded by [Victoria University] at 05:38 09 February 2014

Pre- and post-test research designs

Downloaded by [Victoria University] at 05:38 09 February 2014

The nature of the arguments about causal relationships in the studies

E. Marsden and C.J. Torgerson

Downloaded by [Victoria University] at 05:38 09 February 2014

The role of RTM in cases of ceiling performance at pre-test

Downloaded by [Victoria University] at 05:38 09 February 2014

Pre- and post-test research designs

ment was different in accordance with the students pre-achievement, F(4,229)

E. Marsden and C.J. Torgerson

Downloaded by [Victoria University] at 05:38 09 February 2014

Downloaded by [Victoria University] at 05:38 09 February 2014

Pre- and post-test research designs

E. Marsden and C.J. Torgerson

Downloaded by [Victoria University] at 05:38 09 February 2014

Downloaded by [Victoria University] at 05:38 09 February 2014

Pre- and post-test research designs

Downloaded by [Victoria University] at 05:38 09 February 2014

E. Marsden and C.J. Torgerson

The gain from pretest to post-test

Examine efficacy of Scores on knowledge Yes. Encouragingly,

Measures: pre and

Downloaded by [Victoria University] at 05:38 09 February 2014

Pre- and post-test research designs

Downloaded by [Victoria University] at 05:38 09 February 2014

conclude whether the

Downloaded by [Victoria University] at 05:38 09 February 2014

Pre- and post-test research designs

Measures: Pre and

(p. 67). Reasons given

Downloaded by [Victoria University] at 05:38 09 February 2014

(But authors argue

1112 year olds; To investigate the

About three quarters

Yes and No.

Downloaded by [Victoria University] at 05:38 09 February 2014

Pre- and post-test research designs

solutions (p. 567).

Downloaded by [Victoria University] at 05:38 09 February 2014

Acknowledge testeffect: change in the