Sei sulla pagina 1di 33

ARTICLE

Verdicts or Inventions?
Interpreting Results From Randomized Controlled
Experiments in Criminology

LAWRENCE W. SHERMAN
University of Pennsylvania

HEATHER STRANG

Australian National University

The social benefits of randomized controlled trials (RCTs) would be enhanced by general
recognition of three problems of their interpretation and a redefinition of their mission in
relation to program development and evaluation. One problem is that of forest versus
trees, or the sampling relationship between each test of a hypothesis and the conclusions
drawn from all such tests taken together. A second problem is interpreting RCTs as testing
theory or policy when they cannot achieve a high correlation between the treatments
assigned and treatments actually applied in each case. The third problem is what works for
whom, or whether identical treatments cause different effects, on average, for different kinds
of people, groups, situations, or other units of analysis that were different at the point of random assignment. Confronting these three problems suggests that RCTs should not only seek
verdicts about what works but also should seek better inventions of crime prevention
programs for further testing.
Keywords: randomized controlled trials; experiments; criminology; forest graphs; systematic
reviews; Campbell Collaboration

In the past four decades, the question of what works in preventing crime has
received increasing attention from policy makers and administrators (Attorney
Generals Task Force on Violent Crime, 1981; Martinson, 1974; Presidents
Commission, 1967; Sherman et al., 1997, 2002). The growing interest may stem
from high crime rates, from frustration with traditional criminal justice solutions, or from an accumulation of crime prevention evaluations suggesting that
many crime prevention strategies are ineffective (Butterfield, 1997). Those
Authors Note: This article was prepared with the support of the Jerry Lee Center of Criminology,
University of Pennsylvania; Criminology Research Council (Australia); Department of Health (Australia); Department of Transport and Communications (Australia); the Australian National University; and the Smith Richardson Foundation. Points of view or opinions expressed herein are those of
the authors and do not necessarily the official position of any or all of the funding organizations.
AMERICAN BEHAVIORAL SCIENTIST, Vol. 47 No. 5, January 2004 575-607
DOI: 10.1177/0002764203259294
2004 Sage Publications

575

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

576

AMERICAN BEHAVIORAL SCIENTIST

negative conclusions may reflect a growing use of more rigorous research


designs, as recommended by sources as diverse as the National Academy of Sciences (White & Krislov, 1977) and the Federal Judicial Center (1981) of the
U.S. Supreme Court. These strong research designs tend to find less success in
crime prevention strategies than do weaker, less-controlled research designs
(Weisburd, Lum, & Petrosino, 2001) that are more prone to distortion and bias
(see also Glazerman, Levy, & Myers, 2003). Most powerful among the stronger
research designs in eliminating rival alternative explanations (Campbell &
Stanley, 1963) is the randomized controlled trial.
Randomized controlled trials (RCTs) have now demonstrated that criminal
sanctions do not consistently deter crime. Rather, both threats and imposition of
criminal penalties have highly variable effects on crime. Depending on the kinds
of offenders and the kinds of offenses, criminal punishment can increase crime,
reduce crime, or have no effect on crime (Sherman, 1993). Although field experiments have clearly established the range of possibilities, they are still just
beginning to map the genome-like complexity of both the delivery of, and
responses to, legal punishment of varying kinds.
Similar discoveries also have been made about crime prevention by means
other than punishment or legal threats. Strategies as diverse as preschool education (Yoshikawa, 1994), nurses visiting at-risk infants and mothers in their
homes (Olds et al., 1998), and drug abuse treatment for adults (Gottfredson &
Exum, 2002) have shown clear crime prevention benefits in RCTs. Taken
together, the growing body of research and researchers using RCTs to test the
effects of action intended to prevent crime constitute the emerging field of
experimental criminology. Although the actual number of RCTs funded by the
U.S. Department of Justice varies widely in a pattern of feast or famine (Garner
& Visher, 2003), the growing emphasis on research reviews and syntheses gives
increasing prominence to accumulated RCTs already reported in the literature.
The growth of experimental criminology also is evident in the recent emergence of two organizations for advancing its work. One is the Academy of
Experimental Criminology, an international group founded in 1999 to recognize
the major contributors to the field by election as Fellows of the Academy
(www.crim.upenn.edu/aec). The second organization is the Campbell Crime
and Justice Group (www.aic.gov.au/ccjg), which is part of the larger Campbell
Collaboration (www.campbell.org), named in memory of experimental methodologist Donald T. Campbell. The Campbell Collaboration (known as C2)
was organized in 2000 with the encouragement of the Cochrane Collaboration
(www.cochrane.org), a medical research organization devoted to compilation of
systematic reviews of randomized trials to reach the most reliable conclusions
possible about what works in medicine. The Cochrane Collaboration was
founded in 1992 with the support of the British National Health Service to
cope with the profusion of medical researchan estimated 1 million RCTs
(Millenson, 1997, p. 130)that no practicing physician or healthcare policy
maker had time to review properly. The goal of the Cochrane Collaboration is

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

577

the same as the goal of experimental criminology: learning as much as possible


about the relative effectiveness of different strategies for reducing specific kinds
of harm under specific circumstances. But in the case of criminology, there is
a growing strategic question about whether that learning is removed from the
process of inventing those strategies or an integral part of it. The purpose of
this article is to explore that question, especially in relation to the unique role
of RCTs.
Verdicts or inventions? The growth of experimental criminology has now
made it possible to reach the same kind of strong verdicts about the safety and
effectiveness of crime programs that are reached from medical RCTs. Perhaps
the most powerful example is Petrosino, Turpin-Petrosino, and Finckenauers
(2000; Petrosino, Turpin-Petrosino, & Buehler, 2003) conclusionfrom a systematic review of seven RCTsthat Scared Straight prison-visit programs
probably increase, rather than reduce, crime. The publication of this conclusion
failed to prevent the Governor of Illinois from signing legislation requiring that
Chicago schools use this program with all high-risk youth (Long & Chase,
2003). But the Chicago case illustrates the potential value of reaching verdicts
on programs that cause more harm than good, or simply waste money. A prominent example of the latter is the verdict that Drug Abuse Resistance Education
(D.A.R.E.) had no effect on drug abuse (Gottfredson, Wilson, & Najaka, 2002),
which in that case led to a major overhaul of the program and RCTs to evaluate
the new program design.
The benefit of such verdicts is clear. What is less clear is the harm that verdicts may do to the process of reinventing crime prevention (Sherman, 2003).
The regulatory overtones of RCTs reaching verdicts, as they do in medicine, can
create an adversarial relationship between evaluators and crime prevention professionals.1 As Zigler (2003) has observed, there may be nothing more threatening than having someone conclude that your whole lifes work has been a
failure. At best, such verdicts may prematurely close off promising lines of
research for statistical reasons illustrated below. At worst, such verdicts may
undermine the entire political economy of program evaluation in crime and justice, turning professionals against evaluators and weakening public support for
further RCTs. This prospect stands in sharp contrast to medicine, where practicing professionals invent and test new methods in RCTs that they conduct themselves, and serve as the primary consumer of their own evaluation research products. Thus, it is all the more important for RCTs in crime prevention to
contribute to the process of invention as well as reach clearer verdicts at each
stage of the invention process.
The experience of medicine in reaching externally valid verdicts about what
works based on accumulations of internally valid RCTs offers the largest body
of evidence available for guiding the growth of experimental criminology.
Although the volume of criminological RCTs remains a tiny fraction of the
volume of RCTs in medicine, there are about 600 RCTs published in criminol-

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

578

AMERICAN BEHAVIORAL SCIENTIST

ogy (Chalmers, 2003). This growing abundance of RCTs on crime prevention


raises problems of interpretation that were not evident when virtually all experiments were a first-ever test of a hypothesis and replications were unknown.
Given the crucial role of replication in science, it is essential for both policy
makers and scientists to interpret the results of each RCT in the context of actual
or potential replications. That vision foresees a world in which verdicts on what
works would almost never be made on the basis of a single RCT (see Sherman &
Cohn, 1989) but almost always on the basis of multiple tests reaching similar
conclusions that can be integrated in a commonly understood framework.
Three crucial problems. Achieving that vision in criminology requires both
experimenters and systematic reviewers of RCTs to recognize three crucial
problems. One is the problem of forest versus trees, which led three decades
ago to the invention of the forest graph to plot the magnitude and direction of
effect sizes in all available tests of a single hypothesis. The problem that this
invention addressed, but by no means solved, is the sampling relationship
between each individual RCT and a hypothetical universe of all potential RCTs
of its kind. Without a clear understanding of that relationship, the interpretation
of RCTs will suffer problems of external validity that limit the value of research
for learning what works. The crucial issue is whether there is enough homogeneity of program content and research design across the accumulated tests to
offer a metaphorical apple (tree) to apple (tree) comparison or whether the
assembled sample of RCTs includes orange and peach trees in what is supposed
to be a forest of nothing but apple trees.
The second major problem revealed by the growing number of RCTs in
crime prevention is the theory versus policy problem of defining what hypothesis has been tested: a theory, a policy, or both. Ideally, all RCTs try to test a
theory of causation that underlies a policy or program that is randomly assigned
to some units but not others. In practice, the test is rarely completed with all units
randomly assigned to the program. This result forces a choice between analyzing data on the basis of randomly assigned intentions to treat (ITT) or on the
basis of selectively occurring treatment on the treated (TOT), or both. ITT
analysis tests the hypothesis that a policy of attempting to apply a program
causes less crime. TOT analysis tests the hypothesis that the theory of applying a
program to reduce crime is correct when the program is actually applied.
Although there are advanced statistical techniques for doing this in a way that
minimizes the selection bias (Angrist, Imbens, & Rubin, 1996) inherent in the
unadjusted comparisons of program completers and randomly assigned controls often reported in the literature (see Gorman, 2002), such adjustments are
often not possible with the relatively small sample sizes in medical and criminological experiments.
The theory that the actual treatment, and not just intention to treat, may seem
to be the more useful hypothesis to test. The loss of complete random assignment, however, presents some conceptual and estimation challenges. The

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

579

conceptual issue arises insofar as treatment effects are heterogeneous across the
units being treated: one may not be able to safely extrapolate from one treated
group to the treatment effect in the population at large. The estimation issues are
twofold. First, there is the hazard of confusing the untreated treatment group
with the control group; when this occurs, the estimated treatment effects may be
severely biased. Second, as we discuss below, instrumental variable (IV) estimators used to estimate the TOT parameter consistently will have low power
unless a high proportion of the treatment group is actually treated. This problem
of interpreting the effects of programs on crime is vastly compounded when
each test becomes a tree in a forest of tests, with each test varying in the proportion of units to which treatment was successfully applied as randomly assigned.
The third problem of interpretation concerns heterogeneity of effects of the
same program on different kinds of populations, or what works for whom
(Fonagy, Target, Cottrell, Phillips, & Kurtz, 2002)? In agriculture (Cox, 1958),
medicine (Millenson, 1997), and criminology (Sherman & Smith, 1992), there
are well-documented examples of different effects of identical treatments
among different kinds of populations. This problem suggests the great importance of replication, which is often the best means for discovering variability in
effects. It also suggests the value of large sample sizes, in which theoretically
defined groups at baseline can be randomly assigned to different treatments to
comprise experiments within experiments. Verdicts about what works drawn
from one or two RCTs with small sample sizes may therefore run a high risk of
obscuring crucially important variability of effects and of premature closure of
research fed by the common public preference for one-size-fits-all policies. This
problem is part of a larger problem of premature conclusions that programs do
not work, when in fact they may be quite promising (Weisburd, Lum, & Yang,
2003).
One basic question. These three problems of interpretation provide a conceptual basis for framing a basic question: What is the purpose of RCTs in crime
prevention? This question can be informed by comparing criminology to engineering to medicine (Sherman, 2003). The history of engineering shows that
under properly controlled conditions, repeated trials that alter only one or two
variables at a time have been crucial to the process of invention (Uglow, 2002,
pp. 57-58). The use of modern statistics for meta-analysis and forest graphs,
however, may have obscured the value of historically great inventions. Unlocking the complexity of discovery from relatively rigid statistical formulas
requires recognition of the lightbulb problem, in which one unusually good
result may be discarded in systematic reviews as an outlier rather than being
hailed as a prime whats promising candidate for exact replication. The alternative is a process of reconciliation, in which analysts consider the reasons for
differences in results of similar programs in independent tests (Moffitt, 2003),
exploring, rather than just deploring, the lack of exact replication of program
designs and operations.

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

580

AMERICAN BEHAVIORAL SCIENTIST

Exact replication is the underlying theme of all of these problems of interpretation. All field sciences (such as medicine and agriculture), as distinct from laboratory sciences (such as physics and chemistry), suffer great threats to exact
replication and must struggle hard to overcome them. Less tolerance of inexact
replication, and greater investment in exact replication of those RCTs yielding
large crime prevention effects, would produce both better science and better
public policy. Less pressure to create forests from inexactly replicated trees
and more incentives to produce exactly replicated treesalso might keep the
jury out long enough to accumulate more evidence about what could work if we
just keeping trying to improve on what has been done so far.
PROBLEM 1:
TREES VERSUS THE FOREST
What is the relationship of a single RCT result to an externally valid verdict
about whether a crime prevention program works? Our first impulse in answering this question may come from the great experiments in laboratory science, in
which one experimental result told the whole story. The truth of the initial experiment was supported by its replication by other researchers using the exact same
methods that yielded exactly the same result: Galileo dropping weights from
different heights, Boyle measuring barometric air pressure, and other such master strokes (Harre, 1983) come to mind. But these examples, alas, are limited to
the laboratory sciences and are not appropriate exemplars for discovery of
cause-and-effect relationships in field studies using heterogeneous materials
such as soil or people. Research in less exact sciences as medicine and agriculture expects a range of effect sizesand even directionacross repeated experiments. This variation requires not one but many experiments to reach a general,
externally valid conclusion about the cause-effect hypothesis tested in each
experiment.
Even in the laboratory sciences, a spectrum of replicability exists and measures can be variable. But a bar graph of the results of five trials of the same
experiment would be expected to show a forest (or pattern) of five rather similar trees (results). In the inexact sciences, a graph of five trials would virtually
never show the same results (Cox, 1958, p. 3). In the laboratory sciences, each
tree is thought to be as good as any other tree for reaching a conclusion, and the
number of trees all reaching the same result may be irrelevant. In the inexact sciences, the addition or subtraction of any one tree can alter the conclusion
reached about the shape of the forest.
The critical importance of forest analysis has only recently been understood
in medicine. In a famous example now used as the logo of the Cochrane Collaboration (see Figure 1), a treatment for preventing morbidity and mortality associated with premature birth by corticosteroid was tested seven times (Chalmers,

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

581

Figure 1: Effects of Corticosteroids on Infant Mortality in Seven Randomized


Controlled Trials
SOURCE: Chalmers (2003).
NOTE: The Cochrane Logo is used with permission from The Cochrane Collaboration, www.
cochrane.org.

2003). In five of the seven tests, the impact of the treatment on the rate of premature birth was statistically insignificant. By the traditional vote-counting
methods of a narrative literature reviewincluding the methodology used by
Sherman and his colleagues (1997) in their Preventing Crime report to the U.S.
Congressthe verdict from this evidence would have been that most tests of
this treatment show that it does not work. It was only when all seven effect sizes
and their confidence intervals were horizontally displayed as the trees across a
vertical line indicating zero effect that the shape of the forest emerged, showing
reduced infant mortality from premature birth as the average effect across all
seven (Chalmers, 2003). The best interpretation of the forest graph is that all
seven tests show roughly the same results, within confidence intervalsthat the
treatment reduces mortality.

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

582

AMERICAN BEHAVIORAL SCIENTIST

The Cochrane Collaboration logo provides a clear contrast between the variability of individual patients (and samples of patients) responses to a chemically identical treatment (corticosteroids) and the constancy of results of identical tests of, say, gravity. The law of gravity may perfectly predict the speed with
which any bowling ball drops from the Washington monument, but perfect prediction has defied the study of medical patients, as well as of cornfields, voters,
and criminal offenders. The exact effects of fertilizer on harvest volume in each
cornfield, of a political advertisement on each voters choice, and of prison term
length on the recidivism of each criminal offender cannot be discovered because
the units of analysis vary in so many unknown ways. Repeated tests of a fertilizer on samples of 100 potato fields therefore show a wide range of responses (Cox, 1958, p.1), just as offenders vary widely in their recidivism rates
(Blumstein, Cohen, Roth, & Visher, 1986). Rather than seeking the uniform
cause-effect laws of physics, the inexact sciences seek to estimate average
effects of interventions across large samples of units. The logic of RCTs allows
us to discover the average effect of one action on a sample of units compared to
the average effect of a different (or no) action on a similar sample of units. Our
goal is to discover the relative difference of average effects of two treatment conditions rather than the absolute effect of a treatment condition on a highly
diverse population.
This point is critical to the need for multiple RCTs testing each intervention.
Once the variability of units within samples is recognized, the variability of
(average) responses to interventions between samples becomes central. Just as a
single RCT attempts to estimate relative differences in average effects between
treatment and control groups, a meta-analysis or other approach to reviewing
RCTs attempts to estimate the average of the average effects across all RCTs.
The publication of these studies one by one can be a source of frustration to lay
readers confused by apparently conflicting results. Why dont they just get it
right first and then announce the conclusions? is a question many people ask,
for example, about medical research.
This problem is further complicated by the use of very different medical
research designs, such as observational and experimental studies of diets, to test
the same hypotheses. Differences in method often yield differences in conclusions (Weisburd et al., 2001), which is why systematic reviews of medical
research rarely mix together studies using different methods. Yet such mixtures
are commonly found in social science (Wilson & Lipsey, 2000), partly because
of a dearth of studies using identical research designs. The recent growth of
experimental criminology, however, now makes it possible to limit at least some
reviews to RCTs only. That trend will simplify but not eliminate the problems of
drawing externally valid conclusions from a sample of RCTs done in slightly (or
very) different ways with diverse kinds of populations or settings. In anticipating the growing importance of those problems, this article focuses solely on the
issues of interpreting RCTs and sets aside the ongoing debates about whether to

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

583

emphasize RCTs, or even how to integrate them with less rigorous research
designs.
The future success of efforts to reach a single verdict from a systematic
review of all RCTs on a single crime prevention program or practice depends on
each RCT becoming simply one unit in a theoretical sampling frame of all possible tests of the same program. How well the achieved sample of tests ever conducted corresponds to a theoretical sampling frame is hard to judge. This is true
either in advance of doing any research or after many tests have already been
completed. The problem lies in conceptualizing just what the theoretical basis of
a sampling frame might be given a wide range of possibilities. This problem
makes scientists properly reluctant to base major policy decisions about large,
diverse populations solely on a single RCT with a tiny fraction of that population. Viewing all results as part of a distribution of possible results, any one
result could either be placed right on the central tendency of all results or far out
on the fringes of the distribution as a freak occurrence. It is the risk of being misled by a single result that leads to the great preference for replication, more so in
inexact sciences than in the laboratory sciences. Only in extreme life-or-death
situations, such as the treatment of AIDs with AZT, does medical research stop
at a single RCT with a no-treatment control group (Hilts, 1986).
The problem with reaching verdicts about what works on the basis of multiple RCTs, rather than from a single RCT, is that we rarely can limit the
evidenceor conclusionsto exact replications of population, community
context, demographic characteristics of the sample, and other variables ad infinitum. The strength of using diverse samples and program content is that similar
findings suggest robust predictions, or high probability of accuracy, in concluding that such programs will work almost anywhere. The weakness is that important preconditions of success in the achieved sample of RCTs may not be obvious, leading to failure in predicting that a program will work in, say, Community
8 based on RCTs in Communities 1 to 7. This inevitably leaves us with the problem of imperfect prediction of what works where. This problem of external
validity lacks a theory, and even body of evidence from which to induce a theory,
to help deduce reasonable generalizations from tests in one setting to tests in
another. We have little scientific basis for concluding that what works in Kansas
City, Missouri, is likely to work in New York or that what works in Australia is
likely to work in England.2
The huge scientific gap in knowing how and when to generalize from one,
or even many, RCTs is a cause of great public dissatisfaction with social
scienceand perhaps with all inexact sciences. Policy makers in New York do
not want to know what works to prevent crime in Kansas City. They want
to know what will work in New York, but without bothering to replicate RCTs
in New York that have been conducted in Kansas City. Policy makers expect
criminologists to tell them, based on what the literature shows, whether what
works in Kansas City is likely to work in New York. There are two ways we can

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

584

AMERICAN BEHAVIORAL SCIENTIST

do this. One is to guessnot recommended but widely practicedand the other


is to develop enough RCTs across diverse urban populations, in which the treatment is replicated as exactly as possible, so that we can estimate the likely
effects of a test in New York on the basis of the distribution of results of prior
tests.
But criminology has yet to achieve such a distribution based on exact replications: a forest of trees, all of the same species. We have not even come close. Our
best efforts to date are forests with such diverse species of trees that it is hard for
us to predict just what fruit, on average, each will bear.3 Such diversity of the
RCT trees raises problems of both internal validity in drawing conclusions from
such mixed-forest reviews and of external validity in generalizing from the conclusions of such reviews. Both kinds of validity may be understood in relation to
our inductive theory about systematic reviews of RCTs.
The theory is straightforward but multidimensional: The more homogeneous
the program, research design, and analytic methods used by RCTs included in a
systematic review, the more internally and externally valid the conclusions from
the review will be. Note that this proposition excludes consideration of sample
population characteristics, which provide the basis for separate propositions
about both external validity of reviews (what works where) and the issue raised
below about what works for whom. Every other aspect of an RCT, however, is
theoretically relevant to the theory of internal validity, independent of sample
characteristics. Unlike a single RCT, in which randomization creates very similar distributions of internal variability across units within each treatment group
for comparisons between treatment groups, a universe of all RCTs carries no
guarantee of similarity from one RCT to the next. The idea of an average effect
of a treatment across RCTs may therefore make sense only in direct proportion
to how similar the RCTs are in their methodological characteristics. The methodological variability of RCTs necessarily makes this theory of validity multidimensional in ways that risk exactly the kind of specification error that random
assignment itself was designed to eliminate (Fisher, 1935). With that risk painfully in mind, we offer a theory of validity that specifies five methodological
dimensions on which we have observed substantial variability across RCTs:
(a) treatment content, (b) implementation dosage, (c) control group treatment,
(d) outcome measures, and (e) attrition. 4
These dimensions can be illustrated with both a criminological and a medical
example. The criminological example is a systematic review of seven restorative
justice RCTs for the Campbell Collaborations Crime and Justice Group and the
Smith Richardson Foundation (Strang & Sherman, 2003). The medical example
is a systematic, Cochrane Collaboration review of 39 RCTs of a common treatment for back pain (Assendelft, Morton, Yu, Suttorp, & Shekelle, 2003).
A comparison of the medical and criminological examples produces two
conclusions. One is that even when holding all else constant, repeated tests of
the same treatment produce different measured results in different tests with different samples, just as in Figure 1 above. The other conclusion is that the

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

585

medical example shows far greater tendency than the criminological example to
hold all else (besides sample characteristics) methodologically constant. This
difference is apparently not unique to the examples chosen but reflective of the
greater tendency toward standardization of practices in medicine as compared to
crime prevention. This makes the science of crime preventionat least to
dateeven less exact than other inexact sciences. That fact is one we can try to
change, even as we must take it into account in reaching verdicts based on the
available evidence.
Nonetheless, even medicine suffers substantial variability in the quality of
study methods. The authors of our medical example conclude that variable quality constitutes the major limitation to their systematic review because there is
little consensus on how study quality should be assessed and on the optimal
method of incorporating study quality in statistical pooling (Assendelft et al.,
2003, p. 877; see also Juni, Altman, & Egger, 2001). What follows may not generate consensus about how to handle methodological variability in experimental
criminology or in RCTs for any social science but it may at least generate
discussion.
TREATMENT CONTENT

The treatment content in the medical example appears to be far more consistent than in the criminological example. The widely used medical treatment,
known as the recommended therapy for patients with low back pain
(Assendelft et al., 2003, p. 879) is called spinal manipulative therapy. This
consists of a hands-on twisting maneuver designed to relieve back pain conducted in a standard manner by doctors. The criminological example is called a
restorative justice conference, in which an admitted criminal offender meets
for 90 to 180 min with his family or friends, sometimes with the crime victim(s),
and always with a discussion facilitator (usually a police officer) who seeks to
evoke an intensely emotional dialogue about the harm the crime caused and how
the offender might repair that harm. This treatment was a legal diversion from
regular processing by juvenile courts, except in two RCTs that included mostly
adult offenders.
How much did these treatments vary across RCTs? Very little variation was
reported across all 39 RCTs in the medical case. The crime prevention treatment
varied widely across the seven RCTs, with victims and their families present in
five of the seven, a standard video interview with a victim shown in one, the
number of families and friends of victim and offender varying widely from each
experiment to the next, and mean length of discussion time differing across
RCTs. After consultation with Campbell Collaboration reviewers and consultants, the reviewers decided to disaggregate the review by treatment content,
separating those programs testing face-to-face meetings with personal victims
of crimes of violence and property from restorative justice conferences that did
not include personal (as distinct from corporate or community) victims.

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

586

AMERICAN BEHAVIORAL SCIENTIST

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%

83%

83%

79%

90%

67%
50%
32%

Beth-V
(N=76)

Beth-P
Indy
(N=113) (N=167)

JVC
(N=62)

JPP
(N=124)

JPS
(N=77)

PCA
(N=450)

Figure 2: Percentage of Restorative Justice Conferences Treated as Randomly


Assigned in Seven Randomized Controlled Trials

SOURCE: McCold and Wachtel (1998); McGarrell, Olivares, Crawford, and Kroovand (2000);
Sherman, Strang, and Woods (2000).
IMPLEMENTATION DOSAGE

The medical example did not report the percentages of patients randomly
assigned to the experimental group who were actually treated as intended, but
they appeared to vary across the 39 RCTs. This variation was great enough so
that a regression analysis of study quality characteristics included whether each
study employed an ITT analysis as distinct from the TOT analysis, with ITT
given the higher quality score. What the back therapy review did not do, apparently, was to disaggregate or exclude RCTs based on failure to meet a theoretical
threshold of the percentage of patients assigned to a therapy who actually started
or completed the therapy (see Weinstein & Levin, 1989).
In the criminological example, the percentage of cases randomly assigned to
restorative justice that actually received that treatment featured prominently in
the review because it varied widely, from less than 50% to more than 90% (see
Figure 2). The percentage of cases assigned to the control groupmost often
prosecution in courtthat actually completed the control treatment also varied,
but not as widely. After consultation with Campbell Collaboration reviewers
and consultants, the reviewers decided to disaggregate the review by treatment
dosage, excluding any RCT in which less than 66% of the offenders assigned to
receive a restorative justice treatment actually attended such a conference.
CONTROL GROUP TREATMENT

The medical example systematically disaggregated the results according to


the content of the five control treatments used across all 39 RCTs for two
dependent variables: chronic low back pain and acute low back pain. Because

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

587

some studies reported on both outcomes, the number of 39 RCTs is actually an


underestimate given the total of 41 effect size trees disaggregated into 10 different subgroups by control comparison groups (5) and dependent variables (2).
The control group comparisons consisted of sham therapy (3 RCTs for
chronic and 1 for acute), General Practitioner care (4 RCTs for chronic and 3 for
acute), physical therapy or exercise (2 RCTs for chronic and 6 for acute), traction, corset or other therapy (7 RCTs for chronic and 10 for acute), and Back
School (3 RCTs for chronic and 2 for acute). In one sense. This procedure
created 10 separate reviews with an average of 4 RCTs eachwith each of
these subreviews addressing a different form of the same question: what is the
relative effect of the leading back pain treatment when compared to treatment X1 . . . or X2 . . . or X3 to X5 on Outcome A (chronic) or B (acute)? In
these efforts to compare RCTs with replications that are as exactly alike as
possible, we see the constant awareness that answering the question of what
works? must always be prefaced by answering the question, as compared to
what?
The criminological example of restorative justice also employs such disaggregation, but with far fewer RCT cases to analyze. In one case, the control
group was diversion to any one of 23 alternative programs for juvenile delinquents, including community service and special education; in another RCT, the
control group was prosecution in adult court, most commonly resulting in a fine
and suspension of drivers license (for drunk driving); in a third, the control
group was mixed between prosecution in either juvenile or adult court depending on the age of the offender; in the other four, the control group was prosecution in juvenile court.
OUTCOME MEASURES

In the medical example, the outcome measure was calculated by the reviewers as the effect size of differences in a measure of pain. Most of the RCTs
seemed to employ a standard and widely used scale in back medicine, the 100mm visual analogue scale [VAS], but the published article is unclear about how
many or which of the RCTs used this scale.5 Although the authors reported a
range of sensitivity analyses, they made no mention of having done so with differences in outcome measures. Use of effect sizes makes diverse outcome measures mathematically comparable, but that does not necessarily make them
measure the same things.
In the criminological example, the measures were clearer but more diverse. A
National Academy of Sciences report has documented substantial differences
between the participation rate in offending (the percentage of any group of people with at least one or more measured crime in a given time period) and the frequency of offending (the rate of measured crimes per individual offender within
a given time period) (Blumstein et al., 1986). The most sensitive approach to
estimating crime prevention effects is the measurement of offense frequency

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

588

AMERICAN BEHAVIORAL SCIENTIST

because a small proportion of offenders commit a high proportion of all crimes.


However, in three of the seven restorative justice RCTs, participation rate was
the only outcome measured. This rate was defined as the percentage of offenders
with at least one arrest in the first year following random assignment (after-only
differences) and was available in all seven RCTs. In four of the RCTs, investigators also examined the 2-year before, 2-year after, before-after difference of
differences in mean arrest frequency. Our sensitivity analysis of the four RCTs
with both outcome measures revealed substantially different conclusions
derived from examining the less sensitive participation rate and the more sensitive frequency rate. These results give pause to meta-analyses of crime prevention evaluations focusing only on standardized effect sizes that do not take into
account whether the outcome measure is a participation or a frequency rate and
whether the time period of follow-up varies widely across the studies included
in the sample.
On this last point, it is worth noting that the medical review also employed
quite heterogeneous time periods in the outcome measurement, defining shortterm follow-up as any measure taken within 6 weeks of treatment and long-term
follow-up as any measure taken after 6 weeks, using the measure taken closest to
6 months.
ATTRITION

The problem of attrition is a two-headed monster in all experiments testing


treatments that are spread out over time. In both medicine and criminology,
these treatments run the risk of failure to complete the treatment regimen and
failure to measure the effects of whatever treatment was applied. The completion of treatment is defined in this article as a dosage matter, considered above as
the second dimension on this list. In experimental criminology and social science generally, it may be more useful to define attrition solely in terms of subjects lost to follow-up measurement after random assignment.
In this respect, criminology has the advantage over medicine. Arrest records
provide a wide surveillance net over criminal conduct, although no such surveillance system exists for many minor forms of morbidity such as back pain (in
contrast to, e.g., SARS or AIDS). In the medical review, the only measure of outcome was drawn from direct interaction between staff and patient. If the patient
withdrew or dropped out, no measurement could occur. In the criminological
review, however, whether offenders ever attended a restorative justice conference had no effect on whether their repeat offending could be measured, imperfectly but consistently and without treatment group bias, by their subsequent
arrest records.
The medical review used attrition as a quality score indicator but did not otherwise test for sensitivity of the findings to differences in attrition rates. In the
criminological review, there was no attrition measured in the follow-up period,
although no doubt some offenders died or left the police jurisdiction in which

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

589

the arrest records were examined. Absent evidence or theory that such attrition
from the surveillance system would be related to treatment group assignment,
we may conclude that this study characteristic is a strength of criminological
experiments focusing on official records. Experiments relying on self-reported
offending or victimization data, however, experience far greater problems with
attrition (Sherman, 1992).
Sample population characteristics. Even when methodological characteristics of RCTs are identical, we may expect results to vary by the characteristics of
the sample population. Age, gender, race, education, community wealth or poverty, and many other characteristics could interact with the treatments being
tested. The only empirical way to assess the effects of these variables is to
repeatedly test the treatments (with exact replications of method) in different
populations. As Moffitt (2002) notes, social science RCTs rarely do this systematically. Medicine also appears relatively insensitive to the possibility of population differences, even in Cochrane Collaboration systematic reviews. That may,
in turn, have affected systematic reviews in social science, which also may be
aggregating results prematurely across widely diverse populations. The alternative model, followed by the restorative justice review analyzed above, is to stake
out the disaggregation of RCT results by population characteristics on theoretical grounds from the outset. This approach reduces the emphasis on estimates of
average treatment effects in a haphazard sample of diverse populations, the
external validity of which is entirely unknown and that lacks theoretical coherence. In this respect, medicine has the advantage over criminology because
many medical conditions may have far less demographic variability in the first
place, let alone sensitivity to demographic differences.
The population characteristics in the medical example appeared to be relatively consistent across the 39 RCTs: middle-age to older people suffering back
pain who sought out medical treatment from U.S. and European doctors affiliated with teaching-research hospitals. The population characteristics of the
criminological example varied widely by nation (three United States, four Australian), age (one 7 to 14, one 18 and older, one 14 to 29, four 14 to 17), offense
type (two just violence, two just property crimes with personal victims, one
drinking-driving, one shoplifting from large stores, one a mix of property and
violent crimes), and largest minority groups (four with Aboriginal Australians,
two with Hispanic Americans, one with African Americans). All of these differences were confounded to some extent with the differences in methodological
characteristics described above.
When both the medical and criminological review examples arrayed all of
the RCT trees into forest graphs, they reached the same general conclusion: The
intervention being tested does not work. More precisely, the forest graphs reveal
no pattern of positive effects, on average, across the tests. But that conclusion is
based solely on the premise that the definition of the forest is correct. In the medical example, there would be widespread agreement with the claim that the 39

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

590

AMERICAN BEHAVIORAL SCIENTIST

Figure 3: Forest Graph of Randomized Controlled Trials in Restorative Justice:


Effects on Repeat Offending

SOURCE: McCold and Wachtel (1998); McGarrell, Olivares, Crawford, and Kroovand (2000);
Sherman, Strang, and Woods (2000).

RCTs tested a very similar treatment, with similar methods: most experimental
cases treated as intended, similar outcome measures, and minimal attrition.
Only the control group comparisons varied across the trials, and even those were
systematically disaggregated. The medical example provides the kind of welldefined forest, with sufficient homogeneity, needed to reach a high level of
probability that the verdict is correct: The treatment had no measured effect relative to any known comparison. But in the criminological example, the verdict
would be very different if the analysis were based on (a) all seven diverse RCTs
or (b) disaggregated in important ways.
Figure 3 shows the ITT effects of restorative justice on crime in the seven
completed RCTs with 1-year, after-only differences in percentage of offenders
with one or more arrest. The average effect across all seven tests is not significant at .05. However, the design and population differences across the seven
tests are so great that it is arguably inappropriate to reach a verdict about the
seven RCTs as a whole. For the trees with clear crime reduction effects, lumping
them into the average effect is a form of guilt by association. As the Strang and

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

591

Sherman (2003) draft Campbell Review of these RCTs shows, there are at least
15 ways to disaggregate and rearrange the data on theoretical grounds that lead
to very different conclusions. With eight additional RCTs testing restorative justice still underway, the population of RCTs will more than double in the next 4
years. These new RCTs will add more trees to several of the most promising forests, such as the treatment of violent crimes in RCTs with more than 66% delivery of restorative justice to those cases randomly assigned to that treatment. The
new RCTs also will allow contrasts across populations defined in multivariate
terms, such as juvenile and adult property crimes versus juvenile and adult
violent crimes.
The procedural question for reaching a verdict on this program is whether
these diverse varieties of the same program should be defined as codefendants or
each given separate trials. In the forest-trees metaphor, the question is whether
they are all trees in one forest or whether they belong in up to seven different forestssingle case studies rather than standardized tests. Do they form part of a
single distribution or are they better conceptualized as each one belonging to
more precisely defined distributions? It seems that the more conservative
approach would be to avoid any general statements about whether restorative
justice works in preference for seven specific statements about where it has and
has not worked under very specific and restricted conditions.
Note that even such restricted statements can be misinterpreted, as Sir
Charles Pollard, an English public official who has promoted research and
development (including RCTs) on restorative justice, has pointed out. The mere
statement that restorative justice failed to show a crime reduction effect on shoplifting in one Australian test, he observes, has led many policy makers to conclude that it does notand hence cannotwork for shoplifting. Taking the
point of the preceding paragraphs, however, would lead to just the opposite
inference: restorative justice has not yet shown a crime reduction effect for shoplifting, but there has only been one RCT focused on that question to date, and
that one result could be a chance outlier in a larger distribution.6 Earlier reviews,
such as Sherman et al. (1997), have failed to sufficiently emphasize the limitations of conclusions based on a single RCT, especially in the context of a difference between RCT and weaker forms of research design.
In that context, less systematic reviews of the evidence on restorative justice
have reached exactly the opposite conclusion from that suggested by Figure 3:
restorative justice, on average, does indeed work (Latimer, Dowden, & Muise,
2001; McCold & Wachtel, 2000). The integration of findings from RCT and
weaker research designs poses even greater difficulties in defining the nature of
the forest and profoundly reduces the precision of comparisons of results across
research designs. Even among RCTs, however, there is a question of what trees
to include or exclude, based on the extent to which randomly assigned treatments were actually applied in each case.

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

592

AMERICAN BEHAVIORAL SCIENTIST

PROBLEM 2:
THEORY VERSUS POLICY
Perhaps the most controversial issue in the interpretation of all RCTs is
the issue of whether they test a theory about the effects of applying a treatment
or a policy of attempting to apply a treatment. The logic of random assignment
for reaching causal inference is greatly weakened by noncompliance with the
theory of a treatment, such as patients not taking pills or offenders not attending
restorative justice conferences. There is widespread consensus among statisticians of all stripes that there is an unacceptable risk of selection bias (and erroneous conclusions) from reporting analysis of only the cases that completed a
treatment in comparison to a control group randomly assigned to get no treatment (see Gorman, 2002), at least without any attempt at statistical adjustment.
There is far less consensus about whether a mathematically sophisticated, multivariate analysis can create a fairer comparison to the completed cases to reach a
conclusion that does not suffer from selection bias. On this latter point, medical
statisticians and econometricians differ widely about how best to deal with
divergence from random assignment.
In essence, the medical view is that ITT analysis should be retained, without
adjustment, regardless of the extent of nondelivery of treatment to estimate
the effects of a policy of offering such treatment (Peto et al., 1976; Piantadosi,
1997; but see Weinstein & Levin, 1989). The corollary medicaland often
statisticalview is that the only valid way to test the theory of the intervention
per se is to reduce divergence from treatment after random assignment (Cox,
1958). The logical foundation of the medical argument for ITT is that the reasons for divergence from random assignment may be so confounded with outcomes that no extremes of mathematical adjustment or assumptions can patch
up the breakdown in random assignment.
The econometric view is that noncompliance can be modeled with IVs so that
the effects of TOT7 can be estimated with virtually the same degree of confidence as in an ITT analysis with no deviation from random assignment (Angrist
et al., 1996). An IV is one that is exogenous to the econometric model and unrelated to the causal processes affecting the outcome measure. The use of random
assignment as an instrument, it is argued, satisfies the test of exogeneity, thus
allowing the average effect of treatment on the treated and untreated to be estimated from the difference in repeat offending between the control group and the
untreated subset of the treatment group.
Most criminal sanctioning experiments share with long-term medical treatment (Peto et al., 1976) the character of allocating relatively small samples of
cases to a stream of sequential and conditional decisions rather than large samples to a single state of 1-day (Gerber & Green, 2000) or long-term but unchanging treatment condition. Such treatment states as draft lottery number (Angrist
et al., 1996), welfare benefits offered (Moffitt, 2002), or eligibility for school
vouchers (Howell & Peterson, 2002) may be applied to thousands or millions of

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

593

people at once. When sample sizes are only in the hundreds, these ongoing interactions between offender and official behavior can create far too many subgroups within randomly assigned treatment conditions to reliably compute the
multivariate estimates of TOT effects. The stream of criminal sanctioning usually conditions treatments at one stage on prior offender behavior at another
stage (such as failure to appear in court), creating two or more subgroups of
varying sizes within randomly assigned treatment groups of 50 to 100 cases.
When the cumulative multiplication of subgroups reduces the cell size and
statistical power needed to create dummy variables for each of the numerous
treatment-received groups for an IV estimate, the assumptions needed for the
estimation procedure are violated and the procedure becomes inappropriate.
ITT estimates, however, remain appropriate as long a there is a reasonably high
correlation between random assignment and treatment actually applied.
More complex theories of sanctioning, such as restorative justice, create even
more contingencies in the application of a policy of using such sanctions and
may need far larger sample sizes than can be achieved at present in order to use
IV methods of estimating TOT effects. Yet, many of those complications can be
overcome in the design of future RCTs to have higher correlations between
assigned and delivered treatments. The long-term effect may be to make interpretation less complex and more elegantly clear to policy makers as RCTs
create closer matches between tests of policies and tests of crime reduction
theories.
The restorative justice review provides a prime example of the design versus
analysis solutions to this problem. In the Bethlehem, Pennsylvania, restorative
justice experiments, McCold and Wachtel (1998) randomly assigned restorative
justice to juvenile offenders before seeking consent from either offenders parents or the victims of the crimes. The combined refusal rates of these two essential players in a restorative justice treatment led to very low rates of treatment as
randomly assigned (see Figure 2). The authors did not use an IV model to analyze the effect of TOT, and it is not clear that a well-specified model could have
been developed with the relatively small sample size available. But if all future
RCTs of restorative justice postpone random assignment until agreement in
principle to meet has been firmly committed by all parties, then the treatment
implementation rates should be far higher and the analysis of the RJ policy
should be a closer approximation of RJ theory.8
THE BABY AND THE BATH WATER

Both econometricians and less sophisticated analysts of treatment-on-thetreated effects aspire to a worthy goal: not throwing out the baby with the bath
water. Premature verdicts that an innovation does not work can clearly extinguish future funding for research and development with that invention. RCTs
that fail to test the theory of the innovation may kill off further tests, even though
improvements in either research designs or treatment delivery systems could

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

594

AMERICAN BEHAVIORAL SCIENTIST

produce very different results. The common flaw of crime prevention RCTs
reporting unadjusted TOT effects may be motivated by a reasonable desire to
promote an innovation that may well work in a revised delivery system (see, e.g.,
Latimer et al., 2001; McCold & Wachtel, 2000, on restorative justice effects for
victims). However, because unadjusted TOT estimates are entirely misleading
(Gorman, 2002), such practices threaten the enterprise of evaluation. The goal
of using evaluation for better inventions is far better served by strong science
that requires stricter exclusion of RCTs from systematic reviews based on
implementation rate than by systematic review procedures that include raw
TOT-effect RCT results.
This crucial distinction between reaching a verdict on policies and on theories can be illustrated with the first modern RCT in medical research
conducted by Sir Austin Bradford Hill (Streptomycin Tuberculosis Trials Committee, 1948), a medical statistician whose only degree, as it happened, was in
economics and who found an effective cure for the dreaded disease. For example, in Figure 1 the fourth tree in the forest may just as likely be a lightbulb for
a better treatment as a chance variation. The illustration is not the actual trial but
a what-if hypothetical alternative scenario. Let us suppose that half of some
200 tuberculosis patients who were randomly assigned to receive streptomycin
had taken the drug in the form of a pill that they immediately regurgitated. This
could have resulted in a reduction of statistical power sufficient to obscure the
effect of streptomycin on survival rates, which was the theory being tested. Such
a result would have been a highly valid test of the policy of giving streptomycin
to tuberculosis in pill form. But it would not have been a valid test of the theory
that some means of administering the drugwhether intravenously, by injection, or mixed with a liquidwould reduce death rates. Three possible verdicts
could have been reached from the RCT result. One would have been to abandon
further tests of streptomycinthrowing out the baby with the bath water. A second would have been to employ an IV approach to the analysis of TOT, which
may or may not have reached the conclusion that the treatment worked. A third
possible verdict would have been to redesign the experiment with a different
policy on delivery systems, perhaps trial and error with each patient to find the
delivery system most likely to have the patient retain the drug.
The econometric analysis may have been of little use. An IV analysis of the
effect of treatment on the treatedthose who took the pills and did not regurgitate themmay have lacked sufficient statistical power to detect modest but
lifesaving treatment effects on the treated. Weinstein and Levin (1989), for example, estimate that for an RCT with 400 patients in each treatment group and
10% mortality in the control group, a 25% rate of random assignment failure
reduces the statistical power of a .05 test from .77 to .57, well below conventional power requirements.
The best solution to such a problem might be throwing out the bath water
of the first RCT while putting the baby of streptomycin into a new drug delivery system that could achieve higher treatment-as-assigned rates. That solu-

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

595

tion would not only provide stronger causal inference and greater statistical
power but also would help reengineer the operating system that would have to be
rolled out on a large scale as policy if the theory of streptomycin curing TB was
supported by the second RCT. Although there is no question that the failure
of the initial RCT to test the theory would have been a major disappointment
and perhaps a political disaster, as a model for developing new programs in criminology it would have made an important point: refine and revise treatment delivery systems in Phase I, nonrandomized studies before beginning an RCT to ensure reasonably high take-up rates after random assignment in the Phase II RCT.9
The theory-versus-policy issue is even more important at the point of systematic reviews. Systematic reviewers may generally prefer to have more rather
than fewer RCTs to review given the enormous effort it takes to locate studies
meeting a long list of stringent criteria. It is therefore likely that they are biased
to err on the side of inclusion of studies that meet all other criteria except a
requirement that they meet some threshold of treatment implementation as randomly assigned. That is just what we did in our first draft of the Campbell Collaboration Review of restorative justice RCTs. But after presenting the data on
treatment failures across RCTs to a Campbell group, we were persuaded by the
discussion that it is better science to exclude weak tests of the theory, even if
there are fewer trees left with which to construct a forest graph. In the end, our
initial forest of seven RCTs was reduced to three that tested the theory of face-toface meetings of victims, offenders, and their supporters, in which more than
two thirds of cases randomly assigned to such meetings actually experienced
one. That refinement left the analysis far less complex, but it also changed the
substantive conclusionfrom restorative justice not working to prevent crime
across all trials in Figure 2 to the program achieving substantial, nonchance
crime prevention benefits in two of the three caseswith the best results found
in the two RCTs with higher rates of treatment as assigned.
The final argument for strict exclusion rules for judging theories (not policies) with RCTs is that only high treatment-as-assigned rates can reveal different treatment effects on different kinds of people. This may be the most daunting
challenge to interpreting RCTs in criminology, one that can only be met by the
hard work of completing many RCTs of the same treatment with different kinds
of people.
PROBLEM 3:
WHAT WORKS FOR WHOM?
The problem of differential response to treatment affects both internal and
external validity. The challenge for internal validity is to identify different parts
of the forest, in which program effects clearly differ from the effects found in
other parts of the forest. The challenge for external validity is to predict whether
a program that worked in one test also will work in a different community. Both

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

596

AMERICAN BEHAVIORAL SCIENTIST

1000
800

Arrest
Warning

600

503

750
599

523

400
200
0
1

2
Police Action

Figure 4: Subgroup Effects of Arrest on Domestic Violence in the Milwaukee


Domestic Violence Experiment

SOURCE: Sherman (1992).

challenges can be better met by the use of systematic reviews across RCTs as
well as producing more RCTs to review.
Internal validity. In the analysis of main effects of a treatment across a large
sample (Cox, 1958), there are numerous examples of two subgroups showing
exactly opposite reactions. The net effect of these intrasample differences may
be to show no overall difference between randomly assigned treatments. But
there is a strong case for viewing the random assignment of treatments within
different subgroups as separate and independent RCTs, as long as the subgroups
were identified on the basis of characteristics existing prior to the date of random
assignment and the subgroups did not differ substantially in treatment-asassigned rates or any other methodological characteristic, such as attrition in
follow-up measurement, and the theoretical basis for identifying the subgroup
was identified ex ante independent of a post hoc, data-dredging fishing expedition in which the interaction effect was discovered by chancepreferably
ex ante rather than post hoc.
The case for subdividing large RCTs into smaller RCTs is driven by the enormous benefits from learning what works for whom. Although one size fits all
may be the egalitarian goal for many societies, especially when it comes to criminal punishment (let the punishment fit the crime), there is growing evidence
that equal treatment produces unequal results. In the domestic violence arrest
experiments, for example, the lack of main effects in Milwaukee, Omaha,
Miami, and Colorado Springs masked the fact that arrest had opposite effects for
employed and unemployed offenders (Berk, Campbell, Klap, & Western, 1992;
Pate & Hamilton, 1992; Sherman & Smith, 1992;): Employed offenders were
less likely to reoffend if they were arrested than if they were warned, whereas

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

597

unemployed offenders were more likely to reoffend when they were arrested
than if they were warned (see Figure 4). The measures of unemployment in all
four RCTs covered the date of arrest, whereas the decision to test unemployment
was suggested in print 4 years prior to the launch of the RCTs (Sherman, 1984).
In a recent medical example, federal Medicare reimbursement policy was
changed based on subgroup analysis of a large RCT, at the cost of $60,000 per
patient. Federal officials decided to pay for lung volume reduction surgery
after suspending payments for this recently invented approach due to the initial
absence of data on its effectiveness. To generate RCT data, the government only
paid for the surgery on patients who agreed to enroll in the National Emphysema
Treatment Trial (NETT) that began in 1996. This RCT assigned eligible patients
to surgery or nonsurgical treatment in large enough numbers to discover very
different effects on three subgroups. In one subgroup, the operation actually
shortened life; in a second group, it did not prolong life but did improve quality
of life; in the third group, about 25% of the sample, the operation improved both
quality of life and length of survival. The three groups were identifiable at the
time of random assignment on the basis of severity and within-lung location of
the disease, as well as on the basis of capacity to exercise. The new federal policy
will pay for the surgery for the subgroups found to benefit from it but not for
those whose predicted impact would be zero or negative (Grady, 2003).
However defensible such subgroup RCTs may be, they leave many readers
with a certain degree of doubt if separate random assignment sequences are not
used for each subgroup. More convincing evidence for such readers may be
RCTs that were explicitly designed to test for subgroup differences. This would
be far easier if research funding agencies were to view subgroup differences as
normal and acceptable, with a standard policy of funding follow-up RCTs that
would seek optimal treatments for each subgroup identified in the initial RCTs.
Had the National Institute of Justice followed up the 1992 findings that domestic
violence arrest effects varied by offender employment status with new RCTs,
we would know much more by now about preventing domestic violence. Nonarrest alternatives designed to protect victims without angering offenders might
have been developed and tested as an alternative to one-size-fits-all mandatory
arrest laws, which have only recently come under criticism from victims advocate groups (Sontag, 2002). Multiple tests of the same approaches could then
have led to systematic reviews of the effects of those approaches on unemployed
spouse assaulters as a diagnostically relevant group of offenders. Although
experimental criminology has never been able to proceed with such continuity
(Skogan & Frydl, 2003), the path described here is clearly feasible, and arguably
a more profitable investment than long-term debate over what the original RCTs
found. The same can be said about the debate in this volume over subgroup
effects of school vouchers on reading scores by race of family (Howell & Peterson, 2004 [this issue]; Krueger & Zhu, 2004 [this issue]). Fortunately, in our
own work, we have been able to reengineer the process of random assignment in

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

598

AMERICAN BEHAVIORAL SCIENTIST

100%

100%

100%

100%

96%

90%

87%

86%

83%

80%
73%

Conference Held

70%
60%
50%
40%
30%
20%
10%
0%

LOB
(n=45)

LOR
(n=17)

NGP
(n=14)

NAC
(n=7)

NJO
(n=17)

NJA
(n=15)

TVC
(n=6)

TVP
(n=26)

Figure 5: Percentage of Restorative Justice Conferences Treated as Randomly


Assigned in Current United Kingdom Experiments (within 90 days of random
assignment)

the later RCTs further into the process of recruiting cases, obtaining higher rates
of treatment as intended (cf. Figures 2 and 5).
EXTERNAL VALIDITY

Findings of differential effects within one community are only one kind of
variability in treatment effects. The central issue of external validity also requires testing for differential effects across communities. Findings such as the
employment-interaction effects that are replicated across very different cities
give some confidence that programs may have very similar effects despite contextual differences. But a few examples hardly constitute strong evidence. This
challenge to interpretation of RCTs in experimental criminology can be met by
strategic production of RCTs in very different types of communities, with combinations of characteristics that might be theoretically relevant to treatment
effects.
In the 1986 decision of the National Institute of Justice to fund six replications of the Minneapolis Domestic Violence Experiment (Sherman & Berk,
1984), every effort was made to spread the RCTs across different kinds of communities. Miami offered high rates of immigration and a large Hispanic popula-

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

599

tion, as well as a high rate of marriage. Charlotte offered a large African American community with booming employment, whereas Milwaukee offered an
African American community with concentrated poverty and hypersegregation;
Atlanta fell in between Charlotte and Milwaukee on those dimensions. Omaha
offered a majority White community, as did Colorado Springs, although the latter had a large military population and a substantial African American working
class.
These differences have never been systematically examined in a Campbell
Collaboration review. The likely result of such a review, however, is that the
overall community context made less difference than the characteristics of the
neighborhoods and individuals populating the RCT samples. To the extent that
these correlated with offender employment status, the evidence suggests that the
findings from any one site were externally valid to the other sites as long as
employment status subgroups were specified. With an accumulation of systematic reviews reaching such conclusions, it may be possible for experimental
criminology to show that what works where is not nearly as important as what
works for whom.
The eight restorative justice RCTs underway in the United Kingdom also
may reach such a conclusion, with four RCTs in one site and two in each of
two other sites. The sites vary widely in wealth and ethnicity: from White and
wealthy Thames Valley to White and poor parts of Northumbria to ethnically
diverse and economically mixed London. The content and sample populations
also vary across these sites, so there will be only limited capacity to examine
external validity in a cross-site analysis. But in a new round of RCTs that may
begin in 2004, it also may be possible to structure RCTs of identical restorative justice interventions across highly diverse communities. Whether similar
results can be found across nations, or from Common Law to Civil Law systems,
remains to be seen. But it is precisely the kind of science that the consortium of
national research agencies on crime and justice could foster, or at least discuss,
at their annual meetings.
THE LIGHTBULB QUESTION:
VERDICTS, INVENTIONS, OR BOTH?
Perhaps the most important challenge to interpretation of RCTs is what we
may call the lightbulb question: How can we can identify the optimal version of
a program that has been implemented in widely varying ways? The answer
requires that experimenters and RCT reviewers know when to reverse their
logic: when to look for outliers rather than averages. For example, in Figure 1
the fourth tree in the forest may just as likely be a lightbulb for a better treatment as a chance variation. If we conceive of program development and testing
as a process of invention, and not just reaching verdicts, the goal becomes quite
different. Experimental criminologists, at least, should not only try to learn what

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

600

AMERICAN BEHAVIORAL SCIENTIST

will probably happen under average circumstances but also what could happen
under above average circumstances.
Thomas Edison was not interested in the average life of all previous versions
of the lightbulb, and the Wright brothers were not interested in the average
length of flight of all previous airplanes. Their goal, similar to most inventors,
was to refine previous efforts that had failed to find ways to make them work.
Had they limited their work to systematic reviews of previous trials, they would
not have been led to such refinements. Indeed, they would have been bound by
prudent statistical practice to discard outliers in a sensitivity analysis, to see
whether the outliers affected the estimate of average effects (just as Assendelft
et al. did in their review of back pain therapy). Rather than discarding outliers,
however, the famous inventors embraced outliers and sought to replicate them,
redirecting their work to a new forest of trials driven by the best outlier of the
previous trials.
Thus, the meaning of the word trial takes on dramatically different connotations in different disciplines. In criminal law, a trial decides whether a defendant
is guilty. In medical and social statistics, a (randomized) trial determines the
average effects of one treatment compared to another. But in the engineering
language employed by most inventors of machines, a trial is simply one treatment of one case: one attempt to fly, one lightbulb left burning until it fails, one
attempt to make a smooth glaze on a china plate. In the latter example, Josiah
Wedgwood kept careful notes on more than 5,000 trials of glazing methods in
his pottery factory in 1759 to 1794 (Uglow, 2002, p. 53). In a more recent British
example, James Dyson, a vacuum cleaner inventor, tested more than 5,000 versions of a product that needs no bag for emptying the dirt it inhales (Eastman,
2003).
In our first four RCTs on restorative justice, we found clear benefits in only
one of the four trials. This could have led us to conclude that, on average, there
was no crime prevention effect to be gained from this approach. But we were
equally justified in seizing the large crime reduction effect in one trialthe only
one treating violent crimeand seeking opportunities to replicate it. The latter
course was far more consistent with the history of invention, although the former course may be more consistent with the history of modern program
evaluation.
The interpretation of an outlier is clearly subject to debate. Research and
development funding is scarce, and it is important to invest scarce funding
where it will have the greatest likelihood of a crime prevention yield. If an outlier is merely a chance result, as suggested by the overall distribution in which it
is found, then pursuing a replication of that outlier may well be a waste of
money. But this logic only holds up when the programs and methods have been
replicated exactly across all trials. Given the evidence presented above about
variability in content and research methods across criminological RCTs, it will
rarely be the case that exact replications suggest an outlier is due merely to

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

601

chance. More often, a close examination of the details of the successful outlier
may suggest ways to attempt a successful replication.
Even more challenging as a lightbulb problem is deciding when to continue
trials when no successful outliers have been found. The cost of mounting any
one RCT in crime prevention is far higher in researcher time than the cost of
altering one lightbulb design, or the cost of changing the glaze on a Wedgwood
plate. But the statistical odds of success may be no different, and the number of
trials needed to find success may still be the same. This gives scant comfort to
experimental criminologists but may demonstrate the need for more incentives
to support more trials of ways to prevent crime.
All research runs the risk of pursuing a dead end. In pharmaceuticals, the
odds against success are presumed to be overwhelming. The odds of any newly
tested cancer treatment working, for example, are only 1 in 50,000 (Oldham,
1987). Deciding to pursue any strategy in the face of disappointing test results
could always be the act of mad genius, simple foolishness, or raw determination.
In public policy, decisions to close off dead ends are properly driven by opportunity costs, in relation to what else might be tested. Such decisions are complicated, however, by emotional attachments to the programs by advocates or
clients of the program. There appears to be little rational basis for what crime
prevention ideas are funded, let alone pursued with further RCTs. Despite their
repeated failures, programs such as Scared Straight (Petrosino et al., 2003) and
D.A.R.E. (Gottfredson et al., 2002) continue to attract funding. Meanwhile, programs with a very strong initial crime prevention effectsuch as issuing arrest
warrants for domestic violence after suspects have left the scene (Dunford,
1990)go unreplicated and unimplemented. This topic goes well beyond the
scope of this article but it hangs over experimental criminology like a thick fog.
We remain hopeful that better science can clear the air and perhaps provide some
guidelines for discussion.
It is tempting to suggest an arbitrary cutoff point for funding or testing a
crime prevention strategy, such as an accumulation of an RCT forest of 6 very
similar trees6 being the average number of RCTs in each of the 998 Cochrane
Reviews of medical evaluations completed to date (Mallett & Clarke, 2002). Yet
the question of just how similar the RCT trees are may be crucial. If they all varied as much as in our restorative justice example, one outlier may well be worth
pursuing. Thus, it may take several iterations of RCTs and forest graphs to reach
the point where 6 really similar trees would be available for examination.
What may be more prudent than using any arbitrary number of RCTs is to
examine the shape of the distribution of those effects rather than the average
effect. If one result stands out as very successful, there may still be great value in
qualitative examination of how that RCT differed from the other five. That kind
of reconciliation process (Moffitt, 2003) may identify the most promising
(Sherman et al., 1997) versions of the program for future testing, even though
there is a risk that not-so-promising strategies get pursued largely on the basis of
ideological appeal.

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

602

AMERICAN BEHAVIORAL SCIENTIST

CONCLUSIONS:
BETTER LIGHTBULBS AND BETTER VERDICTS
To summarize, we suggest that RCTs of crime prevention programs may be
more accurately interpreted to the extent that they are systematically reviewed at
much greater levels of specificity rather than aggregated into very heterogeneous groups of test results; distinguish tests of theories (in which most cases are
treated as randomly assigned) from tests of policies only (in which there is great
discrepancy between treatment assigned and applied); and distinguish what
works for whom among different subgroups that can be identified at the point of
random assignment, and hence at the point of routine treatment.
We also suggest that RCTs in crime prevention be focused primarily on the
process of inventing and discovering more effective crime prevention strategies
and only secondarily on rendering independent verdicts about the effects of
established programs. This conclusion is difficult to resolve in principle but
works well in practice. For the practitioners and academics engaged in developing RCTs on crime prevention, it can help clarify the nature of their task and provide exemplars of experimental criminology for guidance.
The latter recommendation should be useful even if we accept an inevitable
tension between evaluation and invention. Depending on the results in each
case, RCTs may provide constructive feedback on improving programs that
work or destructive feedback on programs that do not. But the interpretation
of RCTs will be more nuanced if it recognizes the risk of mistaken verdicts.
Few people regret the verdict against selling thalidomide to pregnant women
(Sunday Times, 1979) but many would have regretted a termination of research
on polio vaccines after the initial disasters that killed human subjects (Smith,
1990). Modern medical research may err on the side of unsafe verdicts at the
expense of invention but it does so within a highly inventive framework and a
culture of science. Experimental criminology faces greater barriers to both
invention and verdicts. Yet it, too, can aspire to achieve both clear verdicts that
help guide policy and creative discovery of better ways to prevent crime.
The Cochrane Collaboration in medicine has expanded both the scope and
the concept of a verdict on medical practices. The scope now applies to practices
that are not regulated by government agencies, such as exercise during pregnancy. The concept of a verdict now goes well beyond the irrevocability of a
courtroom trial. A court of law is required to consider all available evidence by
the time a jury is sent out to reach a verdict. A court is not required to keep
the trial going indefinitely while each side searches for more evidence. However, the neverending assessment of evidence is the hallmark of science, and
Cochrane Reviews are never final. New evidence can always be added and verdicts can be changed at any time. If that same spirit guides the Campbell Collaboration Crime and Justice Group, then the tension between verdicts and
inventions may be reduced.

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

603

The key to the success of RCTs for both verdicts and inventions may be ongoing attention to their political economy of practitioner and public support. The
producers of crime prevention evaluations have ignored the consumers of such
evaluations at the peril of their very future, which may render irrelevant fine distinctions about their methods. Continuing support for RCTs requires strong
partnerships between academics and crime prevention practitioners, who identify themselves as coinventors. The arms-length, judicial dispassion of the independent evaluator may have some uses in policy making. But the role of RCTs in
fostering inventions seems to require close and long-term collaborations in
which practitioners and academics act both as engineers designing the programs
and as scientists testing program effects. The collaborations with practitioners by experimental criminologists such as David Olds et al. (1998), Denise
Gottfredson and Exum (2002), Patricia Chamberlain, and David Weisburd
all show that this model is possible. The question for crime prevention policy
is how to foster an infrastructure to make such collaborations multiply and
prosper.
NOTES
1. This issue and most others discussed in this article apply just as much to nonrandomized cohort
studies as they do to randomized controlled trials (RCTs).
2. The authors are currently engaged in work on the latter example, in which similar (but inexactly replicated) RCTs we have conducted in Australia (Sherman, Strang, & Woods, 2000) are
underway in England (Sherman & Strang, 2003). Yet, the results of such tests, which have already
produced near-identical findings across four RCTs, two in each country, will constitute only one or
two cases in the much larger forest that would be needed to induce a theory of external validity of
crime prevention RCTs.
3. In a more general discussion of this problem in social policy evaluations, several Campbell
Collaboration commentators used the metaphor of a stew, in contrast to a program consisting of a
single ingredient (Cottingham, 2003).
4. This list differs from the list of 10 quality indicators used in the medical example on the basis
of a consensus within the Cochrane Collaboration Back Review Group for Spinal Disorders (van
Tulder, Assendelft, Koes, & Bouter, 1997), which Assendelft, Morton, Yu, Suttorp, and Shekelle
(2003, p. 872) used to rate the quality of the 39 RCTs they reviewed by giving equal weight to the
clear presence of each of the following: (a) adequate procedure for generating the random assignment sequence; (b) concealed randomization; (c) care provider blinded; (d) control for cointerventions; (e) co-interventions reported for each group separately; (f) patient blinded; (g) outcome assessor blinded; (h) less than 20% attrition for short-term follow-up, 30% for long-term, and
no substantial bias by treatment group defined as differences in attrition rates; (i) identical timing of
outcome assessment; and (j) intention-to-treat analysis. As Assendelft et al. point out, there is no
consensus on a gold standard for RCT quality in medicine. The factors most important to internal
validity of trials appear to vary by medical specialty, which suggests they will vary across social science disciplines and even programmatic areas within criminology, such as police patrol programs
versus sentencing and correctional programs. The five factors we select for emphasis here were
induced from the draft systematic Campbell reviews presented at the Smith Richardson Foundation
Test-Bed Conference on Systematic Reviews in St. Michaels, Maryland, in May 2003, and may or

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

604

AMERICAN BEHAVIORAL SCIENTIST

may not have applicability beyond experimental criminology to such fields as educational programs,
teen pregnancy prevention, and social services.
5. The back therapy review also examined patient ability to function as measured by the Roland
Disability Questionnaire (RDQ) but the main focus of the analysis was on pain.
6. Moreover, one RCT shows clear crime reduction benefits in an after-only, 1-year arrest prevalence measure, which is reflected in the lowest common denominator analysis in Figure 3 and would
be the interpretation used in many meta-analyses. Before-after analysis, however, reveals that the
randomly assigned treatment group had half the prevalence of arrest at baseline of the control group
and showed a far greater increase in prevalence rates after random assignment to restorative justice
than the control group. Before-after difference of differences in frequency of repeat offending also
failed to show any benefits of restorative justice. By this conservative definition, we conclude that
restorative justice did not reduce repeat offending shoplifters in the Canberra RCT, but we would not
claim anything conclusive about the external validity of that conclusion to London, Indianapolis,
Indiana, or Toronto, Canada.
7. Medical statisticians refer to this approach as TR, or treatment received (Piantadosi, 1997),
although the active role of humans in avoiding treatment may be better captured by the more neutral
concept of just plain treated without assuming anything about why treatment happened.
8. Even the authors best efforts to follow this advice in eight current RCTs on restorative justice
in the United Kingdom, however, reveals that the treatment itself is inherently prone to last-minute
changes in consent by some offenders and victims, although at far lower rates of treatment failure
than in the Bethlehem experiments and even the Canberra RCTs. The policy question about such
treatments may in fact be the most relevant theory question: What is the cost-effectiveness of getting
everyone concerned to agree to meet in a conference? Because the meeting failures remain rare and
unpredictable, there may be little practical value in knowing whether the theory is correct if officials
cannot know when it is possible to apply it.
9. This is what we failed to do separately for all four of the separate RCTs in Canberra, Australia,
(Sherman et al., 2000) but that we did relentlessly with our operational partners for almost a year with
eight subsequent RCTs in the United Kingdom. In the process, we decided to cancel plans for one
RCT that had too few cases and too high a last-minute cancellation rate.

REFERENCES
Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91, 444-455.
Assendelft, W. J. J., Morton, S. C., Yu, E. I., Suttorp, M. J., & Shekelle, P. G. (2003). Spinal manipulative therapy for low back pain: A meta-analysis of effectiveness relative to other therapies.
Annals of Internal Medicine, 138, 871-881.
Attorney Generals Task Force on Violent Crime. (1981). Report. Washington, DC: U.S. Department
of Justice.
Berk, R. A., Campbell, A., Klap, R., & Western, B. (1992). The deterrent effect of arrest in incidents
of domestic violence: A Bayesian analysis of four field experiments. American Sociological
Review, 57, 698-708.
Blumstein, A., Cohen, J., Roth, J. A., & Visher, C. A. (1986). Criminal careers and career criminals. Washington, DC: National Academies Press.
Butterfield, F. (1997, April 16). Most efforts to stop crime fall far short, study finds. New York Times, p. 1.
Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research.
Chicago: Rand-McNally.
Chalmers, I. (2003). Trying to do more good than harm in policy and practice: The role of rigorous,
transparent, up-to-date evaluations. Annals of the American Academy of Political and Social Science, 589, 22-40.

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

605

Cottingham, P. (2003). Summing up the evidence from experiments. Westport, NY: Smith Richardson Foundation.
Cox, D. R. (1958). Planning of experiments. London: Wiley.
Dunford, F. (1990). System-initiated warrants for suspects of misdemeanor domestic assault: A pilot
study. Justice Quarterly, 7, 631-653.
Eastman, J. (2003, June 16). Sweeping changes: James Dyson, inventor of the cyclonic vacuum,
wants to revolutionize invention itself. Los Angeles Times, p. 1.
Federal Judicial Center. (1981). Experimentation in the law. Washington, DC: Federal Judicial
Center.
Fisher, R. A. (1935). The design of experiments. Edinburgh, Scotland: Oliver & Boyd.
Fonagy, P., Target, M., Cottrell, D., Phillips, J., & Kurtz, Z. (Eds.). (2002). What works for whom? A
critical review of treatments for children and adolescents. New York: Guilford.
Garner, J. H., & Visher, C. A. (2003). The production of criminological experiments. Evaluation
Review, 27, 316-335.
Gerber, A. S., & Green, D. P. (2000). The effects of canvassing, direct mail, and telephone contact on
voter turnout: A field experiment. American Political Science Review, 94, 653-663.
Glazerman, S., Levy, D. M., & Myers, D. (2003). Nonexperimental vs. experimental estimates of
earnings impacts. Annals of the American Academy of Political and Social Science, 589, 63-93.
Gorman, D. M. (2002). The science of drug and alcohol prevention: The case of the randomized
trial of the Life Skills Training program. International Journal of Drug Policy, 13, 21-26.
Gottfredson, D. C., & Exum, L. M. (2002). The Baltimore city drug treatment court: One-year results
from a randomized study. Journal of Research in Crime and Delinquency, 39, 337-356.
Gottfredson, D. C., Wilson, D. B., & Najaka, S. S. (2002). School-based crime prevention. In L. W.
Sherman, D. Farrington, B. C. Welsh, & D. L. MacKenzie (Eds.), Evidence-based crime prevention. London: Routledge.
Grady, D. (2003, August 21). Medicare to pay for major lung operation. New York Times, p. A22.
Harre, R. (1983). Great scientific experiments. Oxford, UK: Oxford University Press.
Hilts, P. J. (1986, September 20). Results of preliminary tests prompt AIDS drugs release. The
Washington Post, p. A1.
Howell, W., & Peterson, P. (2002). The education gap: Vouchers and urban schools. Washington,
DC: Brookings Institution.
Howell, W., & Peterson, P. (2004, this issue). Uses of theory in randomized field trials: Lessons from
school voucher research on disaggregation, missing data, and the generalization of findings.
American Behavioral Scientist.
Juni, P., Altman, D. G., & Egger, M. (2001). Systematic reviews in health care: Assessing the quality
of controlled clinical trials. British Medical Journal, 323, 42-46.
Krueger, A., & Zhu, P. (2004, this issue). Inefficiency, subsample selection bias, and nonrobustness:
A Response to Paul E. Peterson and G. William Howell. American Behavioral Scientist.
Latimer, J., Dowden, C., & Muise, D. (2001). The effectiveness of restorative justice practices: A
meta-analysis. Ottowa, Canada: Department of Justice, Research and Statistics Division Methodological Series.
Long, R., & Chase, J. (2003, August 19). Schools to target bad kids for prison: But law to scare
youths has critics. Chicago Tribune, C1.
Mallett, S., & Clarke, M. (2002). The typical Cochrane review. How many trials? How many participants? International Journal of Technology Assessment in Health Care, 18, 820-823.
Martinson, R. (1974). What works? Questions and answers about prison reform. The Public Interest,
10, 22-54.
McCold, P., & Wachtel, B. (1998). Restorative policing experiment: The Bethlehem, Pennsylvania,
Police Family Group Conferencing Project. Pipersville, PA: Community Service Foundation.
McCold, P., & Wachtel, T. (2000, October 1-4). Restorative justice theory validation. Paper presented to the 4th International Conference on Restorative Justice for Juveniles, University of
Tbingen, Germany, http://www.iirp.org/Pages/eforum001.html.

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

606

AMERICAN BEHAVIORAL SCIENTIST

McGarrell, E. F., Olivares, K., Crawford, K., & Kroovand, N. (2000). Returning justice to the community: The Indianapolis Juvenile Restorative Justice Experiment. Indianapolis, IN: Hudson
Institute, Crime Control Policy Center.
Millenson, M. L. (1997). Demanding medical excellence: Doctors and accountability in the information age. Chicago: University of Chicago Press.
Moffitt, R. (2002, August 20). The role of randomized field trials in social science research: A perspective from evaluations of reforms of social welfare programs. Paper presented to the Conference on Randomized Experiments in the Social Sciences, Yale University Institution for Social
and Policy Studies.
Moffitt, R. (2003, September 4). Statement presented to the Workshop on Improving Evaluation of
Criminal Justice Programs, Committee on Law and Justice, National Research Council, Washington, DC, National Academy of Sciences.
Oldham, R. K. (1987). Patient-funded cancer research. New England Journal of Medicine, 316, 46-47.
Olds, D. L., Henderson, C. R., Cole, R., Eckenrode, J., Kitzman, H., Luckey, D., et al. (1998). Longterm effects of nurse home visitation on childrens criminal and anti-social behavior: 15-year
follow-up of a randomized controlled trial. Journal of the American Medical Association, 280,
1238-1244.
Pate, A. M., & Hamilton, E. H. (1992). Formal and informal deterrents to domestic violence: The
Dade County spouse assault experiment. American Sociological Review, 57, 691-697.
Peto, R., Pike, M., Armitage, P., Breslow, N. E., Cox, D. R., Howard, S. V., et al. (1976). Design and
analysis of randomized clinical trials requiring prolonged observation of each patient: Part 1.
Introduction and design. British Journal of Cancer, 34, 585-612.
Petrosino, A., Turpin-Petrosino, C., & Buehler, J. (2003). Scared Straight and other juvenile awareness programs for preventing juvenile delinquency: A systematic review of randomized experimental evidence. Annals of the American Academy of Political and Social Science, 589, 41-62.
Petrosino, A., Turpin-Petrosino, C., & Finckenauer, J. O. (2000). Well-meaning programs can have
harmful effects! Lessons from Scared Straight and other like programs. Crime and Delinquency,
46, 354-379.
Piantadosi, S. (1997). Clinical trials: A methodologic perspective. London: Wiley.
Presidents Commission on Law Enforcement and Administration of Justice. (1967). The challenge
of crime in a free society. Washington, DC: Government Printing Office.
Sherman, L. W. (1984). Experiments in police discretion: Scientific boon or dangerous knowledge?
Law and Contemporary Problems, 47, 61-81.
Sherman, L. W. (1992). Policing domestic violence. New York: Free Press.
Sherman, L. W. (1993). Defiance, deterrence and irrelevance: A theory of the criminal sanction.
Journal of Research in Crime and Delinquency, 30, 445-473.
Sherman, L. W. (2003). Reason for emotion: Reinventing justice with theories, innovations and
research. Criminology, 41, 1-37.
Sherman, L. W., & Berk, R. A. (1984). The specific deterrent effects of arrest for domestic assault: A
field experiment. American Sociological Review, 49, 261-272.
Sherman, L. W., & Cohn, E. G. (1989). The impact of research on legal policy: The Minneapolis
Domestic Violence Experiment. Law and Society Review, 23, 117-144.
Sherman, L. W., Farrington, D. P., Welsh, D. C., & MacKenzie, D. L. (2002). Evidence-based crime
prevention. New York: Routledge.
Sherman, L. W., Gottfredson, D. C., MacKenzie, D. L., Eck, J., Reuter, P., & Bushway, S. (1997).
Preventing crime: What works, what doesnt, whats promising. Washington, DC: U.S. Department of Justice.
Sherman, L. W., & Smith, D. A. (1992). Crime, punishment and stake in conformity: Legal and informal control of domestic violence. American Sociological Review, 57, 680-690.
Sherman, L. W., & Strang, H. (2003, August). Smart justice and hard science: Testing restorative
justice. Paper presented to the 13th World Congress of Criminology, Rio de Janiero.

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Sherman, Strang / VERDICTS OR INVENTIONS?

607

Sherman, L. W., Strang, H., & Woods, D. (2000). Recidivism patterns in the Canberra Reintegrative
Shaming Experiments. Canberra: Australian National University, Research School of Social Sciences, Centre for Restorative Justice, www.aic.gov.au/rjustice/rise/index.html.
Skogan, W., & Frydl, K. (Eds.). (2003). Committee to review research on police policy and practices:
Fairness and effectiveness in policing. The evidence. Washington, DC: The National Academies
Press.
Smith, J. (1990). Patenting the sun: Polio and the Salk vaccine. New York: William Morrow.
Sontag, D. (2002, November 17). Fierce entanglements. New York Times Magazine, p. 52.
Strang, H., & Sherman, L. W. (2003). Effects of face-to-face restorative justice on repeat offending
and victim satisfaction: A systematic review for the Campbell Crime And Justice Group. Philadelphia: Jerry Lee Center of Criminology, University of Pennsylvania, www.crim.upenn.edu/
campbell.html
Streptomycin Tuberculosis Trials Committee. (1948). Streptomycin treatment of pulmonary tuberculosis: A Medical Research Council investigation. British Medical Journal, 20, 769-782.
Sunday Times. (1979). Suffer the children: The story of thalidomide. New York: Viking.
Uglow, J. (2002). The lunar men: Five friends whose curiosity changed the world. New York: Farrar,
Straus and Giroux.
van Tulder, M. W., Assendelft, W. J., Koes, B. W., & Bouter, L. M. (1997). Method guidelines for
systematic reviews in the Cochrane Collaboration Back Review Group for Spinal Disorders.
Spine, 22, 2323-2330.
Weinstein, G. S., & Levin, B. (1989). Effect of crossover on the statistical power of randomized studies. Annals of Thoracic Surgery, 48, 490-495.
Weisburd, D., Lum, C. M., & Petrosino, A. (2001). Does research design affect study outcomes in
criminal justice? Annals of the American Academy of Social and Political Science, 578, 50-70.
Weisburd, D., Lum, C. M., & Yang, S. -M. (2003). When can we conclude that treatments or programs dont work? Annals of the American Academy of Political and Social Science, 587, 31-48.
White, S. O., & Krislov, S. (1977). Understanding crime: An evaluation of the national institute of
law enforcement and criminal justice. Washington, DC: National Academy of Sciences.
Wilson, D. B., & Lipsey, M. W. (2000). Practical meta-analysis. Thousand Oaks, CA: Sage.
Yoshikawa, H. (1994). Prevention as cumulative protection: Effects of early family support and education on chronic delinquency and its risks. Psychological Bulletin, 115, 28-54.
Zigler, J. (2003, September 4). Statement presented to the Workshop on Improving Evaluation of
Criminal Justice Programs, Committee on Law and Justice, National Research Council, Washington, DC, National Academy of Sciences.
LAWRENCE W. SHERMAN is the Albert M. Greenfield Professor of Human Relations and
Chair of the Department of Criminology at the University of Pennsylvania. He has designed
and directed more than 25 randomized controlled trials of crime prevention strategies with
police, course, and prisons in the United States, Australia, and the United Kingdom. His
recent work includes Reason for Emotion: Reinventing Justice With Theories, Innovations,
and Research (Criminology, February 2003) and Misleading Evidence and Evidence-Led
Policy: Making Social Science More Experimental (Annals of the AAPSS, September
2003).
HEATHER STRANG is the Director of the Centre for Restorative Justice at the Research
School of Social Science of the Australian National University. Since 1995, she has directed
12 randomized controlled trials in restorative justice in England and Australia, and she is
primary reviewer of the Campbell Collaboration Systematic Review of Restorative Justice.
Her recent work includes Repair or Revenge: Victims and Restorative Justice (Oxford University
Press, 2002) and Restorative Justice and Civil Society (edited with John Braithwaite, Cambridge University Press, 2001).

Downloaded from abs.sagepub.com at UNIV FEDERAL DE PERNAMBUCO on March 11, 2015

Potrebbero piacerti anche