An Introduction to Meta-Analysis
With Articles From The Journal of
Educational Research (1992-2002)
ELISHA A, CHAMBERS
University of Missouri-St. Louis
ABSTRACT The author provides rescarchers, editors, and
‘readers with recommendations and guidelines for metaanalytic
research, These recommendations are for procedures and for
reporting metaanalytic findings. An applied example illustrates
a meta-analysis with a data set and considers 2 research ques”
tions. Four main section includes (a) a description of meta-
analytic research methodology, (b) an illustration of a meta-
Analysis, (e) a review of The Journal of Educational Research
articles’ in which meta-analyses were investigated from
1992-2002, and (d) recommendations and guidelines for con-
ducting and reporting meta-analyses.
Key words: guidelines for reporting meta-analytic rescarchy
meta-analyses reviewed in The Journal of Educational
Research, years 19922002
O
requires the in
ne of the most salient ways to quantitatively syn-
thesize research findings is through a meta
analysis. Gene V Glass first introduced the mi
technique in 1976. This statistical procedure
stigator to calculate effect si:
Hedges, Shymansky, and
he procedures for
from data,
reported in research article
Woexlworth (1989) explained e
ducting meta-analysis ate composed of a series of steps in
which the n and the last
stage is preparing a report that includes methods, data, and
lytic results allow the researcher to
arliest stage is Forming the prob
a discussion, Meta-a
report on the efficacy of a program or intervention where
the reported effect size is a standardized wt can be
interpreted in terms of proportions or percentiles. Fu
to examine the relation:
ue tl
more, one can use the effect
ship between variables; for example
investigation, Cohen (1981) considered the relationship
erween stuxlent ratings of instruction and achievement
The gonl of this article was to provide researchers, ed
ndations for procedutes and
in a metaanalytic
tors, and readers with recon
guidelines for reporting meta-analytic research. An applied
example is pres
computers in K-
analysis with a data set and considers two research ques-
nted (see the section on the efficacy of
that ills
classrooms) ss a meta
ice. The first
tions. There are four main sections in thi
section describes meta-analytic research methoxlology: The
second section illustrates a meta-analysis. The third section
reviews The Joumal of Educational Reseanch
1992-2002 in which meta-analyses have been ¢
The fourth, and final, section examines recoramen
for conducting and reporting me
articles from,
Description of Methodology
analysis, as introduced by Glass (1976), requires
for to caleulate effect sizes from data reported
Glass stated the following:
Primary analysis isthe orginal analysis of data in a research
study Tes what one typically imagines as the application of
statistical methods. Secondary analysis # the reanalysis of
Jats for answering the original research question with hetter
statistical techniques, of answering new questions with old
Meta-analysis refers to the analysis of analyses. Lane
alysis ofa lage collection of
ara
ie to refer to the statistical
analysis results from
‘The statistical analysis thar is conducted on the research
findings yields an effect size. An effect sie isa standardized
value that is computed by subtracting the control group,
mean from the experimental group mean (oF one group
mean from another) and then dividing re by either
the control group standard deviation or the pooled stan
ation (see Glass, 1978). The formula is 2s follows:
a
An effect size also can be interpreted in terms of propor:
tions or percentiles. Hedges and Olkin (1985) explained
thar it can also be understood as“
cconttol scores that are less than the avera
experimental group” (p. 76). For example, an effect
the proportion of the
score in the
Addves correspondence to Elisha A. Chambers, Division of Educ
tional Psychology, Research and Evaluation, College of Education
Marilc Holl 402, University of Missouri-St. Lins, One Univerity
Bivd., St. Louis, MO 63121-4400. (E-mail: clishachambers@
sumsl cde)30 means that the average student who receives the treat
ment would outperform 62% of the students in the control
group. In terms of percentiles, with an effect size of 30, a
typical student in the conerol group would perform at the
50th percentile, whereas a typical student in the treatment
group woukl perform at the 62nd percentile,
In terms of synthesis, the meta-analytic researcher
atvempts to examine all stulies on a given topic, thereby
avoiding selection bias in support of the auchor’s research
incerest (Wolf, 1986). In a meta
calculated from the results of all studies in a given a
therefore, subjective weight is not an issue. In addition,
meta-analytic re
cies by examining the impact that different study features
have on the outcome variable of interest. As Wolf noted,
moderating variables are largely overlooked. It is posible
ysis, an effect size is
thers are able to examine inconsisten-
for one ro examine moderating variables in a meta-analysis
by conducting analyses on the other features of the stuxly
ample, Dunn, Grigss, Olson, and Beasley (1995)
‘examined seven porential moderators (eg. sociveconomic
status, length of intervention) in their meta-analytie study
‘on Teaming styles. As another example, in his stay of the
relationship erween student ratings of their instructors
and achievement, Cohen (1981) surmised that class size
‘might he a moderating variable.
the meta-analysis were portioned so that one statistic rep-
resented langer class
[Because the two statistics were significantly different, class
size was seen asa moderating variable.
for
and another smaller class sizes.
Lis
tations
Although there are many benefits to conducting a mera
analysis, there also are some li
Glass, MeGaw, and Smith (1981) discovered four major
disadvantages in conducting a meta-analysis. The first dis-
advantage is called the “apples and oranges” comparison.
Some authors (e.g, Gallo, 1978; Presby, 1978) noted thy
aggregating results that use different research techniques is
inappropriate because they are too dissimilar. However, ii
fairly simple to correct this issue because one can code for
dlfferent techniques, where appropriate, and test whether
the results are too dissimilar. In the second disadvantage
tations to this technique.
arguments have been made against including poorly
designed studies in a meta-analysis (e.g, Eysenck, 1978).
Again, like the resolution for the “apples and oranges”
argument, the meta-analytie researcher can cosle stay fea
tures to test whether there ate differences based on study
‘quality, In the third disadvantage, Rosenthal (1979) noted
‘publication bias. The author described this issue as the
“file drawer problem"; that is, there is a discrepancy
between published and unpublished research in. which
results from potential stulies are filed away and never pub
lished. That hias has heen in favor of published research
that yields positive significant results. The results reported
in other meta-analytic research have somewhat substanti-
‘The Journal of Educational Research
ated that claim because published research is more likely to
he positive (e.g, Kulik & Kulik, 1991; Ryan, 1991). The
problem can he addressed when one obtains unpublished
manuscripts by concacting researchers directly oF locating
unpublished manuscriprs through resources such as the
Educational Resources Information Center (ERIC). Ie also
is possible eo estimate the number of studies that would
have to be in “file drawers" to change the results of a
icta-analysis (see Rosenthal, 1979).
The fourth, and final, disadvantage isthe use of multiple
findings from the same study, thereby potentially biasing
the results, Like the other three disadvantages, this prob:
Tem also can be overcome. It is important that one
late effect sizes that are independent, so it is a common
practice for researchers to use only one effect size for each,
study. A second altemative is for researchers to calculate
‘multiple effect sizes fora study if it ean be established that
the samples are independ
In addition to the limitations outlined in the preceding
paragraph, Onwuexbusie and Levin (2003) discussed con-
cers about the sensitivity of effect sizes, Onwueghuric and
Levin noted that nine factors should be considered when
investigating effect sizes. The factors include: research
objective, research design, effect-size measure, interpreta:
tion guidelines, smpling issues, distribution nonnormality,
score variability, measurement ero, and scale of measure
rent, Because meta-analytic research is dependent on
effect size measures it is important chat researchers consid-
er those factors.
In rms of meta-analytic research, a sizable diserepaney
tend to exist between the number of studies located and
the actual inclusion of the number of studies used in a
rcta-analysis. For example, Bangert-Deowns, Kulik and
Kulik (1985) located almost 500 apparently pertinent arti-
cles but found that over 90% were not suitable. Likewise,
Gillingham andl Guthrie (1987) reported that cheir sample
was composed of only 6% of the targeted literature as a
result of cheir inability to caleulate effects from the lack of
daa provided. Thus, i likely that there will ke adiserep-
ancy between the apparent number of usable research arte
cles anal the actual number of usable research articles
(Overall, meta-analytic esearch is very effective method
for researchers to statistically examine an abundant amount
of research, Is important that one is aware of the potential
downfalls in conducting a meta-analysis; however, when
these downfalls are accounted for, the metaeanalytic,
research can provide important and meaningful ansivers
Procedures
‘The procedures used in meta-analytic research are siti
lar to those used in primary research. Multiple phases to the
process include (a) select a topic, (b} define a problem, (€)
‘conduct a literature review, (d) state a research question or
hypothesis, (e) collect data, (f) analyze the data, and (2)
evaluate the findings (sce Creswell, 2002; Fraenkel &September/October 2004 [Vol. 98(No. 1)]
Wallen, 2005). After selecting a topic, defining a problem,
and conducting the literature review, the researcher is
almost ready to begin data collection. Prior to obtaining
manuscripts, the meta-analyst needs to set criteria for inclu
sion in the research synthesis; for example, the age of the
participants, geographic location of the study, oF date of
publication. Once the inclusion criteria have heen estab
lished, the meta-analyst should select, modify, or ereate an
instrument ro asist in the coding, org
of the data (a coding scheme also should be documer
Cading different study features allows the metaanalytic
researcher to examine potential moderating variables. It is
helpful if the instrument is reviewed by another individual
(ea, an expert in the fickl) to ensure that there have been
ro oversights; piloting the instrument also is informative.
The meta-analyst is then ready to collect data.
‘To obtain published and unpublished manuscripts, there
are at least two steps in the data collection procedures. The
st step isto contact individuals who are involved in pro-
grams of research that are related to the topic under inves-
tigation. Requests can then be made for unpublished man-
useripts and conference proceedings. The second step is to
conduct several searches for manuscripts through pertinent
clectronic databases. Manuscript acquisition is sometimes
limited by library availability; however, interlibrary loans
ccan he extremely helpful. It also is possible to obea
manuscripts by consulting the reference lists of collected
eles, Once the manuscripts have been obtained and
coded, interrater reliability should be calculated to estab
lish whether independent raters are able to consistently
code the information (a higher percentage that the coders
are in agreement indicates a more consistent coding of the
information).
‘An Example: Efficacy of Computers
in K-12 Classrooms.
‘The following illustration of synthesi
ture through a meta-analysis was introduced by Glass
(1976, 1978). The purpose of this example is to examine
the efficacy of computers in K-12 classrooms from
1992-2002. Two research questions for this illust
inelude: (a) [s educational technology generally effica-
cious? (b) What effect sizes are revealed for each of the
three grade levels (i
Selection of Studies
‘The articles chosen for this illustration were scaled down,
for the purposes of this article, thus, a randomly drawn
stratified sample (n = 30) was selected from a langer meta-
analytic study (Chambers, 2003). The stratification was
based on grade level (m5 * 10,1, "10, m,,. * 10). The
targeted studies were those that implemented computer
assisted instruction in K-12 classrooms that provided out-
ichievement and were reported between
7
the years 1992 and 2002, The studies were identified
through various electronic databases such as Dissertation
Abstracts International, PsycINFO, and ERIC. The articles
were limited to those available at the university’ li
darough inter ‘cles needed to include
sufficient information e an effect size and the
research had to be of experimental or quasi-experimental
design (Campbell & Stanley, 1966); thus, the studies need-
ced to include a treatment or experimental group and a co
trol or comparison group. Furthermore, the treatn
soup and the control group had to yield a total minimum
size of 10 (ie, 5 students in each group) because instruc
tional sessions with less than 5 students could be consi
cred tutorials (Trace S Leitner, 1984).
Procedures
This stuxly began with the establishment of a coding
scheme that inclised 37 items; however, for che purposes of
this lh there were only 4 items of interest (ie
grade level, achievement, total instructional time,
quency of instruction). To address the flere
(Gee Rosenthal, 1979), [attempted to loeate unpublished
manuscripts hy conta Proj
cect coonlinators directly; however, only 5% of those who
were contacted responded. Once the txles were obtaines,
they were coded and data were entered into SPSS 11.0 for
Windows (SPSS, 2001). Interater reliability vas established
in the lager stay (86% agreement), and diserepancies were
solved through diseussion
g researchers nd education:
Treatment of Data
Effect were calculated for all the studies under
investigation. In this example, | used the standard devia-
tion of the control group because it was not contaminated
from the treatment (see Hedges et al, 1989). In attempting
to gather the necessary information from stules, [had to
calculate effec hast, For
+ (see Wolf, 1986). In addition, when there were an uneven
‘numberof participants in the control group and the exper-
‘imental group, Fused the eect formula reported by Holmes
(1984),
In order not to violate the assumption of independence,
| calculated one effect size for each study (see Lipsey &
Wilson, 2001). That single effect size was calculated on the
brass ofthe study major focus (individ effect sizes are
reported in the Appendix). When | calculated individual
effect sizes, overall mean effect sizes were calculated, as well
as standand errors forthe mean effect szes (sce Lipsey &
Wilson), and the upper and lower values for 958% cont
dence intervals. This reporting is in accordance with the
recommendations of Wilkinson and the Task Force on Sta-
tistical Inferences (1999), who stated ¢
should “fallvays present effect sizes for primary outcomes"
(p- 599), in akltion to provid tes
es from other test statisties
researchers38
Results
‘The 30 studies yielded 30 effect sizes for 4,467 students
(see Table 1 for a summary of the effect sizes). By coding
the length and frequency of the different instruction:
Interventions, one can report descriptive info
oth variables. The instructional interventions varied
from 2 hr to as much as 105 hr (M = 29.71, SD = 42.08,
Median = 9.0). The frequency of instruction vai
‘ovo times per week to as much as five times
(M = 3.67, SD Median = 4.0).
fon for
Effect:
Both weighted and unweighted mean effect sizes are
reported below. To correct for bias (Hedges & Olkin, 198:
Lipsey & Wilson, 2001; Rudner, Glas, Evartt & Emery
2002), | calculated weighted values with the inverse vari
ght formula (Hedges & Olkin; Lipsey & Wilbon,
Ruiner eval)
To capture an accurate picture of effect sizes, the data
were examined for outliers. The unweighted achievement
mean effect sie for this sample was 0.61 (SD = 1-13). An
examination of the data revealed one potential outlier (an
size Adjusoments
TABLE 1. Effect Size by Study
Study Grade level Nua
1 KS 8
2 KS 2
3 KS 260
4 KS 8S
fi KS oo
6 KS 198,
7 KS 48
8 KS 89
9 KS 66
0 al
u ot or
2 oo 246
1B 6 4
a o 99
Is oF 96
16 os 151
7 OS 9
Is 68 Luo
9 os ®
» 68 a
2 on on
2 92 1s
a 9-2 130
4 od 134
2 on 0
26 oD 126
2 92 52
28 912 189
9 oD 2
30 on B
‘The Journal of Educational Research
fier size beyond 0.61 ¢ 3.59) chat was obtained from
Study No. 21 and was 5.89 (SE = 0.18, ,.4 = 671). The
authors from Study No. 21 acknowledged that their study
viekled a large effec size (they controlled for the students?
contr examination scores, current grade-point average, and
pretest ores).
Homogenciey of Effect Sites
The Q statistic was compute in a test of homogeneity of
the effect sizes (see Hedges S Olkin, 1985; sce Step 3: Data
Analysis for a diseussion on homogeneity). The Q statistic
revealed thatthe effect sizes were heterogeneous because the
test for homogeneity was significant Q(29) = 2919.11, criti
eal 72029, N = 30) = 42.56, p = 05. On closer examination,
it appears that the outlier diseussed previously (Study No.
21) contributed to the heterogeneity; however, the analysis
‘was still significant when the outlier was excluded Q(28) =
82.89, critical 72(28, N = 29) = 41.34, p = .05. Thompson
(1994) suggested that that is often a result of methoxolosical
diferences or participant differences; therefore itis impor-
tant 10 examine the different factors asociated with these
effect sizes. For instance, larger efecr sizes have been associ
ated with lower ability students for computer-assisted
instruction (est, Banert-Drowns, 1993). Similaly, the par-
ticipants in Study No. 21 were solely low-ability students.
However, due to the limitations of this article, only grade
level was examined
Research Questions
To answer the first research question, “Is educational
technology. generally efficacious?” I caleulated a mean
clfet sie. The unweighted mean effect size for achie
iment was 0.61 (SE = 0.21, Median = 0.47) and varied from
=1.29 10 5.89. The upper Cl limit was 1.03 and the lower
Cl, limic was 0.19. The weighted mean effect size for
achievement was 1.37 (SE = 0.08) and the upper Cly; limit
was 1.53, whereas the lower Cly, limit was 1.21. That find-
ing indicates that students in the computer groups outper-
forined their peers in comparison groups by over a half a
standard deviation in measures of achievement (or over 1
of a standard deviation when examining the weighted
mean).
To answer the second research question, “Wha
sizes are revealed for each of the three grade lev
culated mean effect sizes by grade level for each srade-le
category (ie, K=5, 6-8, 9-12). The unweighted and
weighted mean effect sizes, standard ertors, al confidence
intervals are reported in Table 1 for each grade level. (See
Figure 1 for a graphic representation of the unweighted
‘means and Figure 2 for a graphic representation of the
weighted means.)
The median effeet size for elementary school students
(K-5) was 0.61. The effect sizes varied from -0.43 t0 0.99.
“The median effect size for middle school students (6-8) was
effectSeptember/October 2004 [Vol. 98(No. 1)]
as
au
wi
gis
el
aie T +
gos LL
0
io io io
mentary (K-S) High Schoo! (9-12)
Junior High (6-8)
FIGURE 1. Unveighted mean effect sizes for achieve
ment, by grade level.
5
£
a
a
“Mean Etfect Size
(98% Confidence Intervals)
BRS
in
= =
2
High School (9-12)
High (6-8)
Blementary (K-5)
Ju
FIGURE 2. Weighted means effect sizes for achieve-
iment, by grade level.
O47. The effect sizes varied from -0.55 to 1.33. The medi
an effect size for high school students (9-12) was 0.36. The
effect sizes varied from —1.30 to 5.89. Those findings, and
those reported in Table 2, suggest that overall, students at
any grade level benefit from educational technology in
terms of achievement outcomes. Students in the computer
‘groups outperformed their peers in comparison groups by
cover a third of a standard deviation in measures of achieve-
ment at all three grade levels when ex
unweighted or weighted mean effect
ning either the
Discussion
In general, and at specific grade levels (ise K-5, 6-8,
9-12), the metaanalytic findings in this illustration
favored educational technology interventions and_pro-
grams. That finding is consistent with previous meta-analy-
ses in which educational technology in K-12 classrooms
39
was examined! (e.2, Bangert-Deowns, Kulik, & Kulik, 1985;
Niemiec S& Walberg, 1985; Ryan, 1991). However, past
meta-analyses typically reported that the effect sizes for
children in lower grades were higher than those found for
students in higher grades (ea, Fletcher-linn S& Gravatt,
1995; Kulik & Kulik, 1991; Kulik, Kulik, & Bangert-
Drowns, 1985): perhaps the way in which technology is
presently being used is different, as the mean effect sizes
reported in many of the past meta-analyses are from studies
thar are over 8 years old. It is important to further invest
szite such differences co better understand the impact th
computers and educational technology have on achieve-
ment in K-12 classrooms.
Review of The Journal of Educational Research
Articles (1992-2002)
In this section, [examine meta-analytical work that has
been published in The Jowmal of Educational Research from
1992-2002. The following eight articles are considered
Dunn et al. (1995); Fan (2001); Guthrie, Schafer, Vor
Secker, anc! Alhan (2000); Hough and Hall (1994); Lo
Abram, and Spence (2000); Mavrogenes and Besrucske
(1994); MeGregor (1993); and Rael 2001). Two of the
atcles reported on issues associated with conducting mu
analyses and reporting effect sizes (Fans Hough S Hall,
1994), whereas in si of the her a meta-analysis
was conducted or effect sizes were reported (Dunn etal;
Guthrie etal; Lou et al., 2000; Mavrogenes & Bezrucsko,
1994; McGregor; Razel). Therefor, | fist review the atti
cles in which the metaranalytic sues are explored and
then examine the me
analytic and effect
Meta-analyticIsswes
In the first articles, I examined meta-analytic issues that
included statistical significance and effect sizes (Fan, 2001)
and the use of different formulas in calculating effect sizes
(Hough & Hall, 1994). Fan examined sampling variability!
of the effect size measures d and R?, wh
Hall contrasted effect-size formulas (see Glass, McGaw &
nith, 1981; Schmide & Hunter, 1977).
By using a Monte Carlo experiment, Fan (2001) found
that sample d appears to be an unbiased estimator of popu
lation d; however, the R? appears co have an upward bias
that is consistently above the population mean. Therefore,
hers use bias correction
2 popila-
Fan recommended that res
Ihecause the adjusted RE value is very close to 1
tion RE value.
Similarly, Hough and Hall (1994) found differences
berween the formulas that they investigated. By using the
nalyses, they found that so of the
three elfect sizes were slightly larger for the Hunter Schmidt
formula than for the Glass formula (0.30, 0.34, 0.79 versus
0.29, 0.34, 0.75), In addition, when they used the measure-
ment ertor correction recommended by Schmidt and
dlata from three meta40
‘The Journal of Educational Research
TABLE 2. Summary of Unweighted and Weighted Mean Effect Sizes (ES), Standard Errors
(SE), and Confidence Intervals (CI) for Achievement
Men Ch, I Weighted — Chg Chyg
Grade level ES. upper lower —smean ES upper lower
Elementary school(K-S) 05308202355 oot 049
SE 043 0.03
Middle schoo! (6-8) 055 09S 01S 036 047
SE os 0.02
High sehoo! (9-12) 076 212-060 290 3) 2s3
SE 0.60 19)
Hunter (1977), they found chat nwo of the three corrected
effect sizes were significantly larger than the effect sizes ca
culated with the Glass formula. However, they did conclude
that those differences appear to be minimal and that it ean
be dilficule to acquire the reliability coefficients necessary to
correct for measurement ertor. As noted by Hough and Hall,
the Hunter-Schmide formula is more accurate (especially
when measurement «
ever, the Glass formula is an acceptable alter
requires fewer resources to ealeulae. Even Hough and Hall
noted that they were not able toacquire the necessary infor-
imation for 15% of the studies (even after consulting tes
‘manuals. [eis important that meta-analytic researchers are
avwre of those isues and discrepancies when they assess
which of the formulas would best suit their needs, and note
any limitations
or corrections are conducted); how-
Articles
« published in The Jounal of
1992 and 2002, authors used
ses and examined a vatiety
rarch betwee
lof topics, (Table 3 provides a brief summary of the infor
imation from the articles) In two of the studies, measures of
association were used, and in the remaining studies, eff
sizes were tsed. In three of the four studies in which eff
sizes were used, the pooled standard deviation for
employed (Hedges & Olkin, 1985), and in one study,
authors used the control-group standard deviation formula
(Glass, 1978). The selection and inclusion of studies that 1
analyzed varied somewhat from study to study; however, 1
cstablished selection criteria. The studies also reported on
issues of validity, homogeneity (and moderators), outliers,
‘weighting, and interrater coding. The way in which results
were reported also greatly varied.
Dunn et al. (1995) were the only researchers to report
that they examined meta-analytic studies in tetms of
threats to validity. By using eriteria that Campbell and
nley (1966) established, Dunn and colleagues found
that 6 of the 42 articles in which experimental designs were
used had serious threats to internal and extemal validity;
ir meta
therefore, the 6 studies were not included in th
analysis. Dunn and colleagues also used a variety of tech-
hiques to examine homogeneity
In addition to Dunn et al. (1995), Lou et al. (2000) also
ned! homogeneity (see Hedges & Olkin, 1985). Dunn
and colleagues used three indicators of homogeneity (ie,
residual standard deviation, percentage of observed vari
tance accounted! for by sampling error, and a chi-square test
cof homogeneity), whereas Low and colleagues used one
indicator (ie. chi-square test of homogeneity). Authors of
both studies found that their data were heterogeneous,
Even after Lou and colleagues removed five stusies from
their analyses that appeared to he outliers, they found that
the data were heterogeneous. In both studies, the possibili-
ty of moxlerators ined (see Thompson, 1994).
Another isse that was addressed was that of weighting
(see Hedges & Olkin, 1985; Lipsey & Wilson, 2001; Rude
ner et al, 2002). Only Dunn et al. (1995) and Lou etal
(2000) reported thar they weighted their findings. Both
stuclies reported the use of approaches consistent with
Hedges and Olkin. Weighting
sicer because it corrects for biasing that could result from
ith smaller population sizes (Hedges & Olkin;
Lipsey & Wilson).
Reliability issues were examined in three studies (Dunia
et al, 1995; Guthrie et al, 2000; Low et al, 2000), and
validity was reported in two studies (Guthrie et als Mavro:
genes & Besruccko, 1994). Although Dunn and colleagues
contacted the primary researchers directly, Low and col
ygues and Guthrie and colleagues examined inter
ng. When Dunn and colleagues contacted the primary
researchers, they requested verification of coding and
sought to collect any missi Low and colleagues cal-
culated the percentage of inercoder agreement for study
features and effect-size information, and Guthrie and col-
Teagues reported on intercod for scores on a
performance assessment. Guthrie and colleagues and
“Mavrogenes and Bezruczko also reported on internal con-
sistency: In addition, Guthrie and colleagues ackiresse face
validity, convergent validity, and construct validity
‘The eight studies provided some type of graphic repre
sentation of theit findings—e bles
and four studies presented figures (Dunn et al. 1995; Fan,
jan important issue to cone
‘greet
it stnlies providedSeptember/October 2004 [Vol 98(No. 1)]
a
TABLE 3, Summary of Measure of Effect Findings From The Journal of Educational Research (1992-2002)
Ravel (2001) 305 @ 1,022,000
Study Nevcoanr Naser Nopainon Measure of elfect Overall finding(s) Finding summary
Dunn, Griggs. Olson,
& Beasley (1995) 6s 6 181 Weighted conelation 40.38
coefficient
learning syle
Guthrie, Schater,
Von Seeker, & Alban
(2000) B B 2.719 Hedges & Olkin (1985) —Multipl eect fects of instruc
—Pooled standard Sizes reported tional reading
deviation formula practices 0n|
Achievement in 6
Lou, Abgami, Spence
tal, (2000) 3 NIA Hedges & Olkin (1985) +0.16 —Favoring within
Pooled standard lass grouping over
deviation formula ‘hole-class inst
—Weighied tion
Mavrozenes &
Berraczho (1994) B ® 186 Pooled standard Multiple effect Examined gender
evition formula sizes reported dlilferences in writing
MeGregor (1993) a 7 NA Glass (1978) 40.42 In favor of role
Control group playing on student
standard deviation ‘acial animes
Formel 4048 tn favor of anti
‘cst teachings on
Student atiudes
Multiple findings Relationship be=
reported lvision
90% negative and achieve
nt varies by age
“A metoanayss way not conde in his sy ater, twas primary study thar reported eect sizes
2001; McGregor, 1993; Razel, 2000). The content in the
tables varied greatly from guidelines (Fan, 2001), to mod:
erator: (Dunn et al), to weighted effect-sie reliability
coefficients (Guthrie et sl,, 2000), The content in the fig-
also varied greatly from a student writing sample
(Dunn et al.), to correlations (Razel) to normal distribu
tion eurves (McGregor, 1993),
Like the graphic representations, the narrative portion of
the report also varied greatly in terms of the kinds of infor
imation that were included. In addition to anticipated effect
confidence intervals also were included (Lou et al,
2000), as were correlations (Mavrogenes & Bezruczko,
1994), and coefficients of variation (McGregor, 1995). The
following section addresses recommendations and guide-
lines for conducting and reporting meta-analytic studi
Recommendations and Guidelines
‘There are four basic steps for conducting meta-analytic
research: design, data collection (manuscript acquisition),
data analysis, and reporting, Each of the steps is discussed
Jn the following paragraphs. This section concludes with a
brief examination of the limitations associated with con.
ducting meta-analyie researc.
Step 1: Designing a Meta-anatytie Study
‘To design a meta-analysis, researchers must consider sev
eral factors. Those factors include criteria for manuscript
inclusion, instrumentation, and coding schemes. Re
searchers need to establish inclusion criteria forthe studies
that will be included in the meta-analysis. Inclusion crite-
ria may inelude (a) study desiga (e.g, experimental, quasi-
experimental), (b) types of ourcome measures (e..
achievement rest scores), (c) ae oF grade level of part
pants, oF (d) group sample sizes. Onee inclusion criteria
have been established, then instrumentation and costing
schemes shoukl be selected, modified, or created. It is
inmportant that researchers have a protocol that can be used
to consistently document the necessary information from2
each study under
festigation. In addition, a coding
scheme is necessary for analyzing the data that are collect-
ced. Following data collection, intercoder or interrater reli-
ability should be conducted. Once those design issues are
‘established, then it is time to start searching for potential
manuscripts.
‘Step 2: Data Collection o Manuscript Acquistion
‘There are three major ways to obtain manuscripts: con-
tact individuals, conduct searches of electronic databases,
ad consult reference lists. It is possible for
contact individuals who are likely to have unpublished or
published manuscripts that relate to the meta-analytic
The second alternative
literature searches of electronic data.
study under investigation,
involves conduct
bases such as Dissertation Abstracts International and
ERIC, Once a list of potential articles has been identi
researchers should read through the abstracts and sum-
rmaries and make decisions about including the study if
they can make the determination from the inf
provided in the abstract or summary. Otherwise,
researchers might have to obtain and review the man
script in its entirety: Manuscripts thar have been selected
ccan then be coded for data analysis. However, prior to
culating measures of effec, researchers might have to con
tact the primary source to obtain missing or additional
information,
Step 3: Data Analysis
‘After the studies have been obtained and missing or
additional information is collecte
to be calculated. The most appropriate measure needs to be
selected and the data should be analyzed accordingly:
There are a variety of resources available to assist with this
selection (eg, Glass, 1978; Glass et al, 1981; Fan, 2001;
Hedges & Olkin, 1985; Hedges er a., 1989; Lipsey & Wil
son, 2001; Maxwell & Delaney, 2000; Rosenthal, 1984:
Rusdner et al., 2002; Wolf, 1986). In addition, when the
sample sie of the control group and the experimental
oup are uneven, additional effect formulas are available
(caps, Holmes, 1984). Furths is possible for one to
use transformations on the reported data to calculate effect
sizes, and if means and standard deviations are not report-
ced, effect sizes can be calculated from other test statistics
such as & F, of 7 as transformation formulas are widely
.. Wolf). Statistical independence is impor-
“There is a concern with independence when
multiple findings from the same study are included in the
alysis, thereby potentially biasing the results. That
wn be avoided by calculating effect s
measures of effect need
problem
independent; therefore, itis common practice to use only
‘one effect size for each study. However, Lipsey and Wilson
(2001) stated chat
be statistically independent if, for a given distribution, no
elfect sizes can usually be assumed to
‘The Journal of Educational Research
more than one effect size comes from any subject sample”
(p. 112), Therefore, multiple effect sizes can be reported
for a single study if ic is established that the samples are
independent.
In general, homogeneity of effect sizes is examined:
homogeneity indicates whether the distribution of the
effect sizes accurately reflects the mean effect size for the
population (Lipsey & Wilson, 2001). The Q statistic is
‘computed in atest of homogeneity and is distributed as a
chi-square (see Hedges & Olkin, 1985). Thompson
(1994) noted that when tests of homogeneity are signife
cant, itis often a result of methodological differences or
participant differences; thus, it is important that one
ccxamine diferent factors aswociated with these effect sizes
In addition, outliers should be noted and possibly removed
from the analyses. Lipsey and Wilson defined outliers as
values heyond two or three standard deviations of the
mean, When data exploration is complete, weights can be
calculated,
Another predominant isue isd
weighting the effet sizes because
stuales with smaller sample sizes (Hedges & Olkin, 1985;
Lipsey & Wilbon, 2001; Ruiner etal, 2002). One ean cal-
cular weighted values by using the inverse variance weight
formula to corteet bias (Hedges & Olkin; Lipsey & Wilson,
Ruder etal)
“The final phase in this step is for researchers to caleulate
mean effect sizes (weighted and unweighted effect sizes can
be calculated) along with other pertinent information,
such as standand errors fidence intervals. The for
for mean effect site is ESx = SES/n, where n repre
sents the number of effect sizes in the calculation and SES
represents the summation of all the effect sizes from the
es used in the investigation. The standard ertor forthe
cffct sizes ao can be calculated (see Lipsey & Wil
001).
Thompson (2002) norest the importance of reporting
confidence intervals, as they indicate the limits in which
can is likely to be located, thereby
providing an index of precision (Lipsey & Wilson, 2001),
Furthermore, if the values of the confidence int
encompasses zero, then its possible tha chete is no effect,
Cumining and Finch (2001) noted that “[eonfidence inter
vals] support meta-analysis and meta-analytic thinking
focused on estimation. This feature of [confidence inter-
vals] has been ltee explored or exploited in che social sci-
ences but is i [thei] view crucial and deserving of much
thought and development” (p. 334). The appropriate use
and calculation of confidence intervals have been widely
discussed (see Algina & Moulder, 2001; Cumming &
Finch, 2001; Fan & Thompson, 201; Fidler & Thompson,
2001; Henson & Cumming, 2003; Mendoza & Stafon!,
2001; Smithson, 2000, 2001). Reporting of confidence
‘intervals also is consistent with the recommendations of
Wilkinson and the Task Force on Statistical Inference
(199).
of weighting oF not
ssing could result fromSeptember/October 2004 [Vol. 98(No. 1)]
Suep 4: Reporting Meta-anaytic Findings
“The final step is the reporting of meta-analytic findings Ie
is important that one provides graphic and tabular represen
tations of the findings (see Henson & Cumming, 2003;
Thompson, 2002; Wilkinson & the Task Force on Statistical
Inference}; for example, Thompson (2002) provided an
illustration of graphing confidence intervals for effect sizes.
The potential formars are varied but should enhance inter-
pretation of the findings. In addition, Wilkinson and the
Task Force stated, “interval estimates should be given for any
effect sites involving principal outcomes” (p. 599). Further-
‘more, they stated that itis time for authors to cake advan:
tage of chem and for editors and reviewers to unge authors to
«do 50” (p. 602). Therefore, itis important for researchers t0
consult the guidelines and recommendation of the Ameri-
can Pyychological Association (2001) because they are
intended to facilitate the dissemination of information.
Summary
Meta-analysis is a salient way to synthe:
findings. In this article, I provided a description of meta-
analytic research with an illustration of a meta-analysis, in
addition to a review of artic
have heen investigated in The Journal of Educational
Research from 1992 to 2002. Furthermore, recommenda-
tions associated with conducting a metacanalysis were pre
sented, thereby providing researchers, editors, and readers
with guidelines for metacanalytic research.
in which me!
NOTE
1, The tem homogeneous “ue 0 refer to populations ae samples
‘hac have low varity” (Vast, 1993, p 105). Furthermore, Pn etl
(1995) note th fancthon of the metacanalyr thedterma
thon oF hormones describe sharing common eect ste” (y 359),
[REFERENCES
Alga}, Se Moule, B.C. (2001). Sample sts fr confidence intervals
inthe irene ithe gure mltpe-ortlaton coxfictent. Elica
tal and Pcl! Measeement, 61(3) 633-68.
‘American Peehologtel Avett, (2001). Pacaton manta of the
“American Praga Aswan (5h el.) Washingt, DC: Auto
Banger Dram, RL. (1993). The wand proceso aan isructional
Took A metals of woe peocesng ta ting insttetion. Revie
of Ecsinal Rein, 6311), 03-93
BaagereDrowns, RL, Kulik JA. 8 Kubik, CC. (1985). Eiectveness
ff computer based. education im secondary schools. Jounal of
Computer Based rr, 1203), 59-88,
Campbell DT, 8 Sealy, C. (1968). Experimental and quasierpee
sent dens for reser Chica: Rand McNally
(Chambers EA. (2003). Bficacy of edocatonal well in elementary
wal scandy claws, A tnctivanalyse of the research rare
from 1992-2002 (Doctoral dssenation, Sather Ili Univesity,
2002). Discnatin Aaa Inert, 63,316
Cohen, PL AN(T9SI) Seen ratings of rsteaction an student ables
Tent A tet anaiss of ulation validity tides. Roi of Educa
{Brad Rowan, 51(3), 281-50,
Cresell,| W. G02), Escalon, conducting, and val
ating amie anata reach Upper Sale Ber Nf: Pea
ton Elreation, Ine
8
(Cumming, Gs, & Finch, S, (2001). A primer on the underandina use
fd cakulton of confidence instal hat ae ted on cena and
fhoncentaldstibtons, Edwacoal ond Piychlgcal Mesaneney,
o1ia)332-374
Dan R, Gnas S. A, Oban & Bese. M. (1995). A mta-amalye
{ie vabdiio of the Din aa Du mol of etn tle preference
The Juma of Educa Reseth, BS, 353-362
Ejsenck, HJ. (1979), An extcbe in megrsiliness. Ameian Pcl
33,517.
Fan, X_ 2001). Statitical semficance and fot size in education
escrch: Tho sides ofa own The Jounal of Eaucatial Resch, 89,
pe
Fan, X, & Thompson, B. (2001). Confidence intervals for et sen
aku sone ably ocficents plese: An EPM genes tori
Edeatinal and Pocholagcal Measnement 61), 517-531
Fidler B, & Thompson, Br (2001), Computing corect confidence inter
‘vas Gor ANOVA fied and rardomeet ect ster Ednctimal and
Poschilagcal Mesaroment. 61(8), 575-604
lercher Flin, C. Mr & Gravatt, B (1995). The ffeacy of computer
ssa stration (CAI) A tana, Joumal oj Educational Com
Pring Rcarch, 123), 219-242
Fraenkel, JR S Wallen, N. E- (2003), How to dean and evaluate research
‘edict. Boston: McGraw-Hill
Gallo, PS (1978). A metwaalis—a mised metaphor? Amen Po
hod $5, 51
Gllinhar, SiG. S Curb, J.T (1987) Relatonshirs between CBI
und rec on teaching, Contemporary Educational Pcs, 12,
19-199,
as, G- V (1976). Primary ccomary, aad metaanalss of seca,
EthcatonalRewarch, 3.3.
Ghss, GV (1978) lntegraing fining The metaanahss of research,
Review of Research in Educa, 5, 331-37
Glas, GV McCaw, B, 8 Sth, MLL (1981). Metra sia
ilk Sige
1. T Scaler, W. D Von Seer, C:, & Alban, T, (2000), Com
Tricine of dnutretional prietices to reading achicvetnent ina
Statewide mnpeoverent progam. The femal of Educational Research
33, 211-236
Heiss, LY, 8 Olin, 1 (1985), Stats meth for metranas
‘Ofland, FL: Academie Pres.
Hedges LV, Shymansh J A & Woodwork, G. (1989). A practical
isu to mae matods of temas Waning, DC: Natal Sh
nce Teachers Anca
Henson, Ro, & Cumming, G. (2003, Api). Confdnce ita for
‘fects se fan other nborane sats! A apical oppraach. Pape
Troentel athe sarual meeting of dhe Americ Edvcatsonel Research
FRscianion, Chica
Holos, CT (981). Ee sze etination in manasa of
Exponent! Education, 52, 106-10
Hog SL Hall, BW. (1994) Copan ofthe Glas and Hunter
ond tants recnigis The ral of deat Resch,
Kuli, CC, S Kul, J. A. (1991), fective of computer
into. An psd analy, Computers st Haman Bakar 7,
7594
Kuli, JA. Kulik, CC, S Bangen-Dromr, RL (1985). Eletveness
‘of compurctrsal elucaton in elementary schools. Computers bt
Haman Beha 1,5
Line MW, Wilson, D. B. (2001), Paced means. Thousand
TC Spon). (2000) Fees of with ls gp
ing talent acevement: An exploratory mel. The Jal of Ee
‘tonal Rscarch, 94, 101-113.
Macrogenes, NAL, 8 Bess, N. (1994). A stud ofthe waiting
Figheprtechaleaneaged cullen. The aun of Esvcaonal Reser,
7.228.283,
Manstll, SE, & Delaney, HD. 2000). Desig experiments nl en
Tag dara: A model companion pops. Miwa, NJ Elba
MeGraon I. (199), Efevtnenes cole playing nd antiacit teaching
in raking sen projice The Juma of Ecsta Recerch, 80
215-216,
“Mena, Ly & Seaford, KL (2001). Confidence ince power cae
Casto, ad sample ss exit for the squared multiple corel
‘Hm coc une the fixe aid ard reseion mos A co“4
puter program ae uel sana ables Ec
ful Mecarement, 614, 650-667
Niemies RP Walker, H.J (1985). Comparer and achievement
the ckmentary yea, fal of Elana Caputing Resear, 1),
$540
COnminoghusc, A.J. & Levi, JR. (2003, Apa Do ofc te mews
mete? The oad, the Bad and the as. Paper peesented at the
rm meting of dhe American Educational Reseach Ascii,
Chie
Pasty, 8 (1978). Overly bral categories obscure important diferences
Tremeen therapy, Ameican Phage, 33,314
Race, M_ (2001). The complex mol of television viewing an educ
‘tonal achievement. The Jad of Edusatinol Rach, 94 31.37
Rowmthal R197) The “ile dame prble” aad tolerance tr nul
results Paychudgea Btn, 863), 635-681
Rocha, R (1984, Metal procs for soi escach Beverly
Fille Sige
Rader LM Gln, GV vane, D. LS Emery.) 2002). A wae
fet the maul of resch stubs. Calese Park, MD: ER
Caring om Asicoment an Esalaton, Univepty of Moylan
Ryan A.W. (1991). Metaanays of oe
Iptcr applications in elementary school, Elec Admnanaies
Quer, 2702), 161-18
Ssh EL, 8 Hite, JE (1977). Developmcnt ofa snr ha
‘The Journal of Educational Research
te the prblem of vality general
62,529.54
Swithson, M: (2000). Sais with onfdense. Thora Onks, CA: Sage
Stuhr M, (2001). Correct confidence cra fo varkns ees
‘et secs and parameters The inpwrtance of msnccntal distatons
in compuring intervals Edicotonal an Pscbgl Moauromet,
11,9), (05-632
SPSS. 2001) SPSS 11.0 for Windows [Computer Software, Chicas
SPSS Ine
Thome, B. (2002), What fre quantitative social cence research
‘inkl look lke: Confidence intervals for elect ates Edwcatnal
Ressnde, 310), 25-32
Thormpwn, S-C. (1994) Sptomatc review: Why seures of hetcrgene-
fay im metasanalys shud i inventeatal. BJM, 309, 381-1355.
Thies SM, & Lester, DW. (1984, Sopamber). Te rca of the
eof cls ig a achive scary als Pap peered a
the anual mestng of the Mil-Weern Elvcanal Reach Asc
mon Chie.
WT: (1993). Ditmar of sates and methdigy, Newbury Pak,
‘CA: Sage. Wilkimon, Ls S the Tink Force on Sttstical erence
(1999) Sorital metho im psychokegy journals: Guelines and
‘splnaion.Amarcan Prchobse, 34(8), 594-604
Wai FM (1986) Meuanafis: Quantiatve mets for sarc she
st Bevery Hill, CA: Se
on Journal of Ald Pach,
Cri
Problem criteria
|. Presenting a significant issue
2. Relating issue to theory
3, Weatfying confounds
Design criteria
Desering sample of sudies; How represe
Coding scheme and interatr reliability
‘Total numberof studies
6, Describing characteristics of the study
inbor of participants in each study
er that are experiments
Pereentage male and female students
Adults andr children
ly identifying delimitation
Results evteria|
9, Test of homogeneity
10, Visual display of elfect sizes (he
1, Central tendency. spread of effet si
Inerpretation eritera
12
raphe abulae)
14. Aniculating cautionvimitations
15, Adsaneing recommendations fr practice and research
Elementary and/or secondary andlor postsecondary stodents
APPENDIX
ia for Evaluating a Meta-Analysis
Yes No NIA,
44 Inclusion criteria for the studies (ee. experimental, se of participants)
ive isthe sample of the population of studies?
5. Data collection procedures (eg, comact individuals, database searches, reference lists)
. Single effect size foreach study versus multiple effect sizes (Le, statistical independence)
es, and confidence intervals
Significance testing of effect sies: Ave they greater than chance?
15, Comparing effect sizes of various hinds of subjects, designs, sample sizes, et
ans 18), 393-407,
“These criteria were amended rom Hall J. A & Rosen, R (1998) lterpeting and evaluat
3 metaanalysen valuation atthe Health Profs