Sei sulla pagina 1di 7

The American Statistician

ISSN: 0003-1305 (Print) 1537-2731 (Online) Journal homepage: http://amstat.tandfonline.com/loi/utas20

The Abuse of Power

John M Hoenig & Dennis M Heisey

To cite this article: John M Hoenig & Dennis M Heisey (2001) The Abuse of Power, The
American Statistician, 55:1, 19-24, DOI: 10.1198/000313001300339897

To link to this article: http://dx.doi.org/10.1198/000313001300339897

Published online: 01 Jan 2012.

Submit your article to this journal

Article views: 2892

Citing articles: 39 View citing articles

Full Terms & Conditions of access and use can be found at


http://amstat.tandfonline.com/action/journalInformation?journalCode=utas20

Download by: [RMIT University Library] Date: 01 January 2017, At: 02:00
Statistical Practice
The Abuse of Power: The Pervasive Fallacy of Power
Calculations for Data Analysis
John M. H OENIG and Dennis M. H EISEY
culations as a matter of policy (Anon. 1995; Anon. 1998).
We emphasize that these calculations are sought primarily
It is well known that statistical power calculations can be with the thought that they are useful for explaining the ob-
valuable in planning an experiment. There is also a large lit- served data, rather than for the purpose of planning some
erature advocating that power calculations be made when- future experiment. We even found statistical textbooks that
ever one performs a statistical test of a hypothesis and one illustrate the awed approach (e.g., Rosner 1990; Winer,
obtains a statistically nonsigni cant result. Advocates of Brown, and Michels 1991; Zar 1996). Researchers need to
such post-experiment power calculations claim the calcu- be made aware of the shortcomings of power calculations
lations should be used to aid in the interpretation of the as data analytic tools and taught more appropriate method-
experimental results. This approach, which appears in vari- ology.
ous forms, is fundamentally awed. We document that the It is important to understand the motivation of applied
problem is extensive and present arguments to demonstrate scientists for using power analysis to interpret hypothesis
the aw in the logic. tests with nonsigni cant results. The traditional, widely ac-
cepted standard has been to protect the investigator from
KEY WORDS: Bioequivalence testing; Burden of proof;
falsely concluding that some treatment has an eect when
Observed power; Retrospective power analysis; Statistical
indeed it has none. However, there is increasing recognition
power; Type II error.
that a reversal of the usual scienti c burden of proof (e.g.,
Dayton 1998) is preferred in many areas of scienti c infer-
ence. Areas where this is a particular concern include mak-
ing decisions about environmental impacts, product safety,
1. INTRODUCTION and public welfare where some people want to be protected
from failing to reject a null hypothesis of no impact when
It is well known among applied scientists that a lack of
a serious (e.g., harmful or dangerous) eect exists. We be-
impact or eect is not su ciently established by a failure
lieve that the post-hoc power approaches that have conse-
to demonstrate statistical signi cance. A failure to reject
quently arisen are due to applied scientists being heavily
the null hypothesis of no eect may be the result of low
tradition-bound to test the usual no impact null hypothe-
statistical power when an important eect actually exists
sis, despite it not always being the relevant null hypothesis
and the null hypothesis of no eect is in fact false. This can
for the question at hand.
be called the dilemma of the nonrejected null hypothesis:
We describe the aws in trying to use power calculations
what should we do when we fail to reject a hypothesis?
for data-analytic purposes and suggest that statistics courses
Dismayingly, there is a large, current literature that advo-
should have more emphasis on the investigators choice of
cates the inappropriate use of post-experiment power cal-
hypotheses and on the interpretation of con dence intervals.
culations as a guide to interpreting tests with statistically
We also suggest that introducing the concept of equivalence
nonsigni cant results. These ideas are held tenaciously in a
testing may help students understand hypothesis tests. For
variety of disciplines as evidenced by methodological rec-
pedagogical reasons, we have kept our explanations as sim-
ommendations in 19 applied journals (Table 1). In our ex-
ple as possible.
perience as consulting statisticians, authors are not infre-
quently required to perform such calculations by journal
2. INAPPROPRIATE USES OF POWER ANALYSIS
reviewers or editors; at least two journals ask for these cal-

2.1 Observed Power


John M. Hoenig is Professor, Virginia Institute of Marine Science, Col-
lege of William and Mary, Gloucester Point, VA 23062 (E-mail: hoenig@
There are two common applications of power analysis
vims.edu). Dennis M. Heisey is Statistician, Department of Surgery and when a nonrejected null hypothesis occurs. The rst is
Department of Biostatistics and Medical Informatics, University of Wis- to compute the power of the test for the observed value
consin, Madison, WI 53792. Order of authorship determined by random- of the test statistic. That is, assuming the observed treat-
ization. The authors thank Marilyn Lewis for research assistance and the
anonymous reviewers for helpful comments. This is VIMS Contribution
ment eects and variability are equal to the true parame-
No. 2335. ter values, the probability of rejecting the null hypothesis

c 2001 American Statistical Association The American Statistician, February 2001, Vol. 55, No. 1 19
Table 1. Journals with Articles Advocating Post-ExperimentPower and therefore adds nothing to the interpretation of results.
Analysis An interesting special case occurs when P = ; for the
American Journal of Physical Anthropology: Hodges and Schell (1988)
Z test example above it is immediately obvious that ob-
served power = .5 because Z = Zp . Thus, computing ob-
American Naturalist: Toft and Shea (1983); Rotenberry and Wiens (1985)
served power can never lead to a statement such as because
*Animal Behavior: Thomas and Juanes (1996); Anon. (1998)
the null hypothesis could not be rejected and the observed
Aquaculture: Searcy-Bernal (1994)
power was high, the data support the null hypothesis. Be-
Australian Journal of Marine and Freshwater Research:
cause of the one-to-one relationship between p values and
Fairweather (1991)
observed power, nonsigni cant p values always correspond
Behavioral Research Therapy: Hallahan and Rosenthal (1996)
to low observed powers (Figure 1). Computing the observed
Bulletin of the Ecological Society of America: Thomas and Krebs (1997)
power after observing the p value should cause nothing to
Canadian Journal of Fisheries and Aquatic Sciences:
Peterman (1989, 1990a)
change about our interpretation of the p value. These results
are easily extended to two-sided tests.
Conservation Biology: Reed and Blaustein (1995, 1997);
Hayes and Steidl (1997); Thomas (1997) There is a misconception about the relationship between
Ecology: Peterman (1990b) observed power and p value in the applied literature which
Journal of Counseling Psychology: Fagley (1985) is likely to confuse nonstatisticians. Goodman and Berlin
*Journal of Wildlife Management: (Anon., 1995); Steidl,
(1994), Steidl, Hayes, and Schauber (1997), Hayes and
Hayes and Schauber (1997) Steidl (1997), and Reed and Blaustein (1997) asserted with-
Marine Pollution Bulletin: Peterman and MGonigle (1992) out proof that observed power will always be less than .5
Neurotoxicology and Teratology: Muller and Benignus (1992) when the test result is nonsigni cant. An intuitive coun-
Rehabilitation Psychology: McAweeney, Forchheimer, and Tate (1997)
terexample is as follows. In a two-tailed Z test, the test
statistic has the value Z = 1.96 if the test is marginally sig-
Research in the Teaching of English: Daly and Hexamer (1983)
ni cant at = .05. Therefore, the probability of observing
Science: Dayton (1998)
a test statistic above 1.96, if the true mean of Z is 1.96,
The Compass of Sigma Gamma Epsilon: Smith and Kuhnhenn (1983)
is .5. The probability of rejecting the null hypothesis is the
Veterinary Surgery: Markel (1991)
probability of getting a test statistic above 1.96 or below
NOTE: indicates journal requires or requests post-experiment power calculations when test 1.96. Therefore, the probability is slightly larger than .5.
results are nonsignicant.

is computed. This is sometimes referred to as observed


power. Several widely distributed statistical software pack-
ages, such as SPSS, provide observed power in conjunction
with data analyses (see Thomas and Krebs 1997). Advo-
cates of observed power argue that there is evidence for
the null hypothesis being true if statistical signi cance was
not achieved despite the computed power being high at the
observed eect size. (Usually, this is stated in terms of the
evidence for the null hypothesis (no eect) being weak if
observed power was low.)
Observed power can never ful ll the goals of its advo-
cates because the observed signi cance level of a test (p
value) also determines the observed power; for any test
the observed power is a 1:1 function of the p value. A
p value is a random variable, P , on [0; 1]. We represent
the cumulative distribution function (cdf) of the p value as
Pr(P p) = G (p), where is the parameter value. Con-
sider a one-sample Z test of the hypothesis H0 : 0 ver-
sus Ha : > 0 when thepdata are from a normal distribution
with known . Let = n = . Then G (p) = 1 (Zp ),
where Zp is the 100(1 p)th percentile of the standard nor-
mal distribution (Hung, ONeill, Bauer, and Kohne 1997).
That is, Zp is the observed statistic. Both p values and ob-
served power are obtained from G (p). A p value is obtained
by setting = 0, so G0 (p) = 1 (Zp ) = p. Observed
power is obtained by setting the parameter to the observed
statistic and nding the percentile for P < , so observed
Figure 1. Observed Power as a Function of the p Value for a One-
power is given by GZp ( ) = 1 (Z Zp ) and thus the Tailed Z Test in Which is Set to .05. When a test is marginally signi cant
observed power is determined completely by the p value (P = .05) the estimated power is 50%.

20 Statistical Practice
In fact, it is rather easy to produce special examples of test These inferential approaches have not been justi ed for-
statistics with skewed distributions that can produce arbi- mally. Cohen (1988, p. 16) claimed that if you design a
trarily high observed powers for p = . study to have high power 1 to detect departure from
A number of authors have noted that observed power may the null hypothesis, and you fail to reject the null hypoth-
not be especially useful, but to our knowledge a fatal logical esis, then the conclusion that the true parameter value lies
aw has gone largely unnoticed. Consider two experiments within units of the null value is signi cant at the
that gave rise to nonrejected null hypotheses. Suppose the level. Thus, in using the same logic as that with which we
observed power was larger in the rst experiment than the reject the null hypothesis with risk equal to , the null hy-
second. Advocates of observed power would interpret this pothesis can be accepted in preference to that which holds
to mean that the rst experiment gives stronger support fa- that ES [the eect size] = with risk equal to . (We
voring the null hypothesis. Their logic is that if power is have changed Cohens notation in the above to conform to
low one might have missed detecting a real departure from that used here.) Furthermore, Cohen stated (p. 16) proof
the null hypothesis but if, despite high power, one fails to by statistical induction is probabilistic without elabora-
reject the null hypothesis, then the null is probably true or tion. He appeared to be making a probabilistic statement
close to true. This is easily shown to be nonsense. For ex- about the true value of the parameter which is invalid in a
ample, consider the one-sided Z test described above. Let classical statistical context. Furthermore, because his pro-
Zp1 and Zp2 refer to the observed test statistics in the re- cedure chooses the sample size to have a speci ed, xed
spective experiments. The observed power was highest in power before conducting the experiment, his argument as-
the rst experiment and we know this implies Zp1 > Zp2 sumes that the actual power is equal to the intended power
because observed power is GZp ( ) which is an increasing and, additionally, his procedure ignores the experimental
function of the Z statistic. So by usual standards of us- evidence about eect size and sampling variability because
ing the p value as statistical evidence, the rst experiment the value of is not updated according to the experimen-
gives the stronger support against the null, contradicting tal results. Rotenberry and Wiens (1985) and Searcy-Bernal
the power interpretation. We will refer to this inappropri- (1994) cited Cohen in justifying their interpretation of post-
ate interpretation as the power approach paradox (PAP): experiment computed power.
higher observed power does not imply stronger evidence
Although many nd the detectable eect size and biologi-
for a null hypothesis that is not rejected.
cally signi cant eect size approaches more appealing than
2.2 Detectable Eect Size and Biologically the observed power approach, these approaches also suf-
Signi cant Eect Size fer from fatal PAP. Consider the previous two experiments
where the rst was closer to signi cance; that is, Zp1 > Zp2 .
A second, perhaps more intriguing, application of post-
Furthermore, suppose that we observed the same estimated
experiment power calculations is nding the hypothetical
eect size in both experiments and the sample sizes were
true dierence that would have resulted in a particular
the same in both. This implies 1 < 2 . For some p desired
power, say .9. This is an attempt to determine the de-
level of power , one solves = 1 (Z n = ) for
tectable eect size. It is applied as follows: an experiment
is performed that fails to reject the null. Then, based on to obtain the desired detectable eect size, . It follows
the observed variability, one computes what the eect size that the computed detectable eect size will be smaller in
would have needed to have been to have a power of .9. Ad- the rst experiment. And, for any conjectured eect size,
vocates of this approach view this detectable eect size as the computed power will always be higher in the rst ex-
an upper bound on the true eect size; that is, because sig- periment. These results lead to the nonsensical conclusion
ni cance was not achieved, nature is unlikely to be near this that the rst experiment provides the stronger evidence for
state where power is high. The closer the detectable eect the null hypothesis (because the apparent power is higher
size is to the null hypothesis of 0, the stronger the evidence but signi cant results were not obtained), in direct contra-
is taken to be for the null. For example, in a one-tailed Z diction to the standard interpretation of the experimental
test of the hypothesis H0 : 0 versus Ha : > 0, one results (p values).
might observe a sample mean X = 1.4 with X = 1. Thus, Various suggestions have been made for improving
Z = 1.4 and P = .08, which is not signi cant at = .05. post-experiment power analyses. Some have noted certain
We note that if the true value of were 3.29 (and X were estimates of general eect sizes (e.g., noncentrality param-
1) we would have power = .95 to reject H0 . Hence, 3.29 eters) may be biased (Thomas 1997; Gerard, Smith, and
would be considered an upper bound on the likely value of Weerakkody 1998), which potentially could be corrected.
the true mean. (Note that a 95% upper con dence bound Others have addressed the fact that the standard error used
on would be 3.04. We return to this point later.) in power calculations is known imprecisely, and have sug-
A variant of the detectable eect size approach is the gested computing con dence intervals for post-experiment
biologically signi cant eect size approach, where one power estimates (Thomas 1997; Thomas and Krebs 1997).
computes the power at some eect size deemed to be bio- This is curious because, in order to evaluate a test result, one
logically important. The higher the computed power is for apparently needs to examine power but, in order to evaluate
detecting meaningful departures from the null, the stronger (test) if power is adequate one does not consider the power
the evidence is taken to be for nature to be near the null of a test for adequate power. Rather, one switches the in-
when the null is not rejected. ferential framework to one based on con dence intervals.
The American Statistician, February 2001, Vol. 55, No. 1 21
These suggestions are super uous in that they do nothing 4. EQUIVALENCE TESTING
to correct the fundamental PAP. Simply saying that an experiment demonstrates that a
treatment is near-null because the con dence interval
3. POWER ANALYSIS VERSUS CONFIDENCE is narrow about the null value may seem unsatisfactorily
INTERVALS seat-of-the-pants. However, this can be formulated as a
From a pedagogic point of view, it is interesting to com- rigorous test. Suppose that we are willing to conclude that
pare the inference one would obtain from consideration of a treatment is negligible if its absolute eect is no greater
con dence intervals to that obtained from the power anal- than some small positive value . Demonstrating such prac-
ysis approach. Con dence intervals have at least two inter- tical equivalence requires reversing the traditional burden
pretations. One interpretation is based on the equivalence of proof; it is not su cient to simply fail to show a dif-
of con dence intervals and hypothesis tests. That is, if a ference, one must be fairly certain that a large dierence
con dence interval does not cover a hypothesized param- does not exist. Thus, in contrast to the traditional casting
eter value, then the value is refuted by the observed data. of the null hypothesis, the null hypothesis becomes that a
Conversely, all values covered by the con dence interval treatment has a large eect, or H0 : jDj , where D is
could not be rejected; we refer to these as the set of non- the actual treatment eect. The alternative hypothesis is the
refuted values. If the nonrefuted states are clustered tightly hypothesis of practical equivalence, or HA : jDj < .
about a speci c null value, one has con dence that nature Schuirmann (1987) showed that if a 1 2 con dence
is near the null value. If the nonrefuted states range widely interval lies entirely between and , we can reject the
from the null, one must obviously be cautious about inter- null hypothesis of nonequivalence in favor of equivalence
preting the nonrejection as an indication of a near-null at the level. The equivalence test is at the level because
state. The more widely known interpretation is that con - it involves two one-tailed level tests, which together de-
dence intervals cover the true value with some xed level of scribe a 1 2 level con dence interval. This approach to
probability. Using either interpretation, the breadth of the equivalence testing is actually always a bit on the conser-
interval tells us how con dent we can be of the true state vative side; the actual level 0 for normally distributed data
of nature being close to the null. from a one-sample experiment p with known and nominal
Once we have constructed a con dence interval, power level is 0 = 1 + (2 n= Z ), which shows the
calculations yield no additional insights. It is pointless to conservatismp will be slight in many practical applications
perform power calculations for hypotheses outside of the where 2 n= substantially exceeds Z . More powerful
con dence interval because the data have already told us equivalence testing procedures exist (e.g., Berger and Hsu
that these are unlikely values. What about values inside the 1996), but for well-behaved problems with simple struc-
con dence interval? We already know that these are values tures the simplicity of this approach seems to make it a
that are not refuted by the data. It would be a mistake to compelling choice to recommend to the researcher involved
conclude that the data refute any value within the con - in analysis (Hauck and Anderson 1996).
dence interval. However, there can be values within a 95% Considering the power approach as a formal test in the
con dence interval that yield computed powers of nearly above equivalence testing framework makes it clear why
.975. Thus, it would be a mistake to interpret a value asso- it is logically doomed. The power approach requires two
ciated with high power as representing some type of upper outcomes before declaring equivalence, which are (1) the
bound on the plausible size of the true eect, at least in null hypothesis of no dierence H0 : D = 0 cannot be re-
any straightforward sense. The proposition that computed jected, and (2) some predetermined level of power must be
power for eect sizes within a con dence interval can be achieved for jDj = . To achieve outcome 1, the absolute
very high can be demonstrated as follows. Consider the case value of the observed test statistic must be less than Z .
where the random variable X has a normal distribution. We This in turn implies that thep observed absolute dierence
wish to test the null hypothesis that the mean is zero ver- jdj must be less than Za = n. Thus, as jDj becomes more
sus the alternative that it is not zero. A random sample of precisely estimated by increasing n or decreasing , the ob-
large size is taken which has a mean, x, of 2 and a standard served dierence jdj must become progressively smaller if
error of the mean of 1.0255. The upper critical region for we want to demonstrate equivalence. This simply does not
a two-sided Z test then corresponds to values of the mean make sense: it should become easier, not more di cult, to
greater than 1.96 1.0255 = 2.01. Therefore, we fail to re- conclude equivalence as jDj becomes better characterized.
ject the null hypothesis. A 95% con dence interval would Schuirmann (1987) noted that when viewed as a formal test
be ( .01; 4.01). We note that a value of 4 for the popula- of equivalence, the power approach results in a critical re-
tion mean is not refuted by the data. Now post-hoc power gion that is essentially upside down from what a reasonable
calculation indicates the probability of rejecting the null equivalence test should have.
hypothesis if the mean is actually 4 is Pr(jXj > 2.01) =
Pr(Z > (2.01 4)=1.0255) + Pr(Z < ( 2.01 4)=1.0255) 5. DISCUSSION
which is about .974. Thus, the power calculation suggests Because of the prominence of post-hoc power calcula-
that a value of 4 for the mean is unlikelyotherwise we tions for data analysis in the literature, elementary statis-
ought to have rejected the null hypothesis. This contradicts tics texts should devote some attention to explaining what
the standard theory of hypothesis tests. should not be done. However, there is a larger lesson to be
22 Statistical Practice
learned from the confusion about power analysis. We be- In any particular analysis, one needs to ask whether it is
lieve the central focus of good data analysis should be to more appropriate to use the no dierence null hypothesis
nd which parameter values are supported by the data and rather than the nonequivalence null hypothesis. This is a
which are not. Perhaps unwittingly, advocates of post-hoc question that regulators, researchers, and statisticians need
power analysis are seemingly grappling with exactly this to be asked and be asking constantly. We doubt whether
question. many researchers are even aware that they have choices
The reader with Bayesian inclinations would probably with respect to the null hypotheses they test and that the
think what foolishnessthe whole issue would be moot if choices re ect where the burden of proof is placed.
people just focused on the sensible task of obtaining poste- We would not entirely rule out the use of power-type con-
rior distributions. Philosophically, we nd this attractive as cepts in data analysis, but their application is extremely lim-
it avoids some nagging issues in frequentist statistics con- ited. One potential application might be to examine whether
cerning p values and con dence intervals (e.g., Berry 1993; several experiments were similar, except for sample size;
Freeman 1993; Schervish 1996; Goodman 1999a,b). But, this might be an issue for example in meta-analyses (Hung,
the real world of data analysis is for the most part solidly ONeill, Bauer, and Kohne 1997). The goal here, examin-
frequentist and will remain so into the foreseeable future. ing homogeneity, diers from the usual motivations for post
Within the limitations of the frequentist framework, it is hoc power considerations.
important that analyses be as appropriate as possible. Power calculations tell us how well we might be able to
Introductory statistics classes can focus on characteriz- characterize nature in the future given a particular state and
ing which parameter values are supported by the data by statistical study design, but they cannot use information in
emphasizing con dence intervals more and placing less em- the data to tell us about the likely states of nature. With
phasis on hypothesis testing. One might argue that a rigor- traditional frequentist statistics, this is best achieved with
ous understanding of con dence intervals requires a rigor- con dence intervals, appropriate choices of null hypotheses,
ous understanding of hypothesis testing and p values. We and equivalence testing. Confusion about these issues could
be reduced if introductory statistics classes for researchers
feel that researchers often do not need a rigorous under-
placed more emphasis on these concepts and less emphasis
standing of con dence intervals to use them to good ad-
on hypothesis testing.
vantage. Although we cannot demonstrate it formally, we
suspect that imperfectly understood con dence intervals are
[Received July 2000. Revised September 2000.]
more useful and less dangerous than imperfectly under-
stood p values and hypothesis tests. For example, it is surely
prevalent that researchers interpret con dence intervals as REFERENCES
if they were Bayesian credibility regions; to what extent Anon. (1995), Journal News, Journal of Wildlife Management, 59, 196
does this lead to serious practical problems? The indirect 199.
(1998), Instructions to Authors, Animal Behavior, 55, iviii.
logic of frequentist hypothesis testing is simply nonintuitive
Berger, R. L., and Hsu, J. C. (1996), Bioequivalence Trials, Intersection-
and hard for most people to understand (Berry 1993; Free- Union Tests and Equivalence Con dence Sets, Statistical Science, 11,
man 1993; Goodman 1999a,b). If informally motivated con- 283319.
dence intervals lead to better science than rigorously mo- Berry, D. A. (1993), A Case for Bayesianism in Clinical Trials, Statistics
tivated hypothesis testing, then perhaps the rigor normally in Medicine, 12, 13771393.
Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences
presented to students destined to be applied researchers can (2nd ed.), Hillsdale, NJ: Lawrence Erlbaum Associates.
be sacri ced. Daly, J. A., and Hexamer, A. (1983), Statistical Power in Research in
Of course, researchers must be exposed to hypothesis English Education, Research in the Teaching of English, 17, 157164.
tests and p values in their statistical education if for no Dayton, P. K. (1998), Reversal of the Burden of Proof in Fisheries Man-
agement, Science, 279, 821822.
other reason than so they are able to read their literatures.
Fagley, N. S. (1985), Applied Statistical Power Analysis and the Inter-
However, more emphasis should be placed on general prin- pretation of Nonsigni cant Results by Research Consumers, Journal
ciples and less emphasis on mechanics. Typically, almost of Counseling Psychology, 32, 391396.
no attention is given to why a particular null hypothesis Fairweather, P. G. (1991), Statistical Power and Design Requirements for
is chosen and there is virtually no consideration of other Environmental Monitoring, Australian Journal of Marine and Fresh-
water Research, 42, 555567.
options. As Hauck and Anderson (1996) noted, both statis- Freeman, P. R. (1993), The Role of P values in Analysing Trial Results,
ticians and nonstatisticians often test the wrong hypothesis Statistics in Medicine, 12, 14431452.
because they are so conditioned to test null hypotheses of Gerard, P. D., Smith, D. R., and Weerakkody, G. (1998), Limits of Retro-
no dierence. Statisticians need to be careful not to present spective Power Analysis, Journal of Wildlife Management, 62, 801807.
Goodman, S. N. (1999a), Toward Evidence-Based Medical Statistics. 1:
statistical analysis as a rote process. Introductory statistics
The P Value Fallacy, Annals of Internal Medicine, 130, 9951004.
students frequently ask the question, why focus on pro- (1999b), Toward Evidence-Based Medical Statistics 2: The Bayes
tection against erroneously rejecting a true null of no dif- Factor, Annals of Internal Medicine, 130, 10051013.
ference? The stock answer is often something like it is Goodman, S. N., and Berlin, J. A. (1994), The Use of Predicted Con -
bad for science to conclude a dierence exists when it does dence Intervals When Planning Experiments and the Misuse of Power
When Interpreting Results, Annals of Internal Medicine, 121, 200206.
not. This is not su cient. In matters of public health and Hallahan, M., and Rosenthal, R. (1996), Statistical Power: Concepts, Pro-
regulation, it is often more important to be protected against cedures, and Applications, Behavioral Research Therapy, 34, 489499.
erroneously concluding no dierence exists when one does. Hayes, J. P., and Steidl, R. J. (1997), Statistical Power Analysis and Am-

The American Statistician, February 2001, Vol. 55, No. 1 23


phibian Population Trends, Conservation Biology, 11, 273275. Kent Publishing.
Hauck, W. W., and Anderson, S. (1996), Comment on Bioequivalence Rotenberry, J. T., and Wiens, J. A. (1985), Statistical Power Analysis and
Trials, Intersection-Union Tests and Equivalence Con dence Sets, Sta- Community-Wide Patterns, American Naturalist, 125, 164168.
tistical Science, 11, 303304. Schervish, M. J. (1996), P Values: What They Are and What They Are
Hodges, D. C., and Schell, L. M. (1988), Power Analysis in Biological Not, Journal of the American Statistical Association, 50, 203206.
Anthropology, American Journal of Physical Anthropology, 77, 175 Schuirmann, D. J. (1987), A Comparison of the Two One-sided Tests
181. Procedure and the Power Approach for Assessing the Equivalence of
Hung, H. M. J., ONeill, R. T., Bauer, P., and Kohne, K. (1997), The Bioavailability, Journal of Pharmacokinetics and Biopharmaceutics,
Behavior of the P Value When the Alternative Hypothesis is True, 15, 657680.
Biometrics, 53, 1122. Searcy-Bernal, R. (1994), Statistical Power and Aquacultural Research,
Markel, M. D. (1991), The Power of a Statistical TestWhat Does In- Aquaculture, 127, 371388.
signi cance Mean?, Veterinary Surgery, 20, 209214. Smith, A. D., and Kuhnhenn, G. L. (1983), A Statistical Note on Power
McAweeney, M. J., Forchheimer, M., and Tate, D. G. (1997), Improving Analysis as Applied to Hypothesis Testing Among Selected Petro-
Outcome Research in Rehabilitation Psychology: Some Methodological graphic Point-Count Data, The Compass of Sigma Gamma Epsilon,
Recommendations, Rehabilitation Psychology, 42, 125135. 61, 2230.
Muller, K. E., and Benignus, V. A. (1992), Increasing Scienti c Power Steidl, R. J., Hayes, J. P., and Schauber, E. (1997), Statistical Power Anal-
With Statistical Power, Neurotoxicology and Teratology, 14, 211219. ysis in Wildlife Research, Journal of Wildlife Management, 61, 270
Peterman, R. (1989), Application of Statistical Power Analysis to the Ore- 279.
gon Coho Salmon (Oncorhynchus kisutch) Problem, Canadian Journal Thomas, L. (1997), Retrospective Power Analysis, Conservation Biology,
of Fisheries and Aquatic Sciences, 46, 1183. 11, 276280.
(1990a), Statistical Power Analysis Can Improve Fisheries Re- Thomas, L., and Juanes, F. (1996), The Importance of Statistical Power
search and Management, Canadian Journal of Fisheries and Aquatic Analysis: An Example from Animal Behaviour, Animal Behavior, 52,
Sciences, 47, 215. 856859.
(1990b), The Importance of Reporting Statistical Power: the For- Thomas, L., and Krebs, C. J. (1997), A Review of Statistical Power Anal-
est Decline and Acidic Deposition Example, Ecology, 71, 20242027. ysis Software, Bulletin of the Ecological Society of America, 78, 126
Peterman, R., and MGonigle, M. (1992), Statistical Power Analysis and 139.
the Precautionary Principle, Marine Pollution Bulletin, 24, 231234. Toft, C. A., and Shea, P. J. (1983), Detecting Community-wide Patterns:
Reed, J. M., and Blaustein, A. R. (1995), "Assessment of Nondeclining Estimating Power Strengthens Statistical Inference, American Natural-
Amphibian Populations Using Power Analysis," Conservation Biology, ist, 122, 618625.
9, 12991300. Winer, B. J., Brown, D. R., and Michels, K. M. (1991), Statistical Principles
(1997), Biologically Signi cant Population Declines and Statisti- in Experimental Design (3rd ed.), New York: McGraw-Hill.
cal Power, Conservation Biology, 11, 281282.
Zar, J. H. (1996), Biostatistical Analysis (3rd ed.), Upper Saddle River, NJ:
Rosner, B. (1990), Fundamentals of Biostatistics (3rd ed.), Boston: PWS- Prentice Hall.

24 Statistical Practice

Potrebbero piacerti anche