Sei sulla pagina 1di 55

Power analysis

Summary
Before you do an experiment, you should perform a power analysis to
estimate the number of observations you need to have a good chance of
detecting the effect you're looking for.

Introduction
When you are designing an experiment, it is a good idea to estimate
the sample size you'll need. This is especially true if you're proposing to
do something painful to humans or other vertebrates, where it is
particularly important to minimize the number of individuals (without
making the sample size so small that the whole experiment is a waste of
time and suffering), or if you're planning a very time-consuming or
expensive experiment. Methods have been developed for many statistical
tests to estimate the sample size needed to detect a particular effect, or to
estimate the size of the effect that can be detected with a particular
sample size.
In order to do a power analysis, you need to specify an effect size. This
is the size of the difference between your null hypothesis and the
alternative hypothesis that you hope to detect. For applied and clinical
biological research, there may be a very definite effect size that you want
to detect. For example, if you're testing a new dog shampoo, the
marketing department at your company may tell you that producing the
new shampoo would only be worthwhile if it made dogs' coats at least
25% shinier, on average. That would be your effect size, and you would
use it when deciding how many dogs you would need to put through the
canine reflectometer.
When doing basic biological research, you often don't know how big a
difference you're looking for, and the temptation may be to just use the
biggest sample size you can afford, or use a similar sample size to other
research in your field. You should still do a power analysis before you do
the experiment, just to get an idea of what kind of effects you could
detect. For example, some anti-vaccination kooks have proposed that the
U.S. government conduct a large study of unvaccinated and vaccinated
children to see whether vaccines cause autism. It is not clear what effect

size would be interesting: 10% more autism in one group? 50% more?
twice as much? However, doing a power analysis shows that even if the
study included every unvaccinated child in the United States aged 3 to 6,
and an equal number of vaccinated children, there would have to be 25%
more autism in one group in order to have a high chance of seeing a
significant difference. A more plausible study, of 5,000 unvaccinated and
5,000 vaccinated children, would detect a significant difference with
high power only if there were three times more autism in one group than
the other. Because it is unlikely that there is such a big difference in
autism between vaccinated and unvaccinated children, and because
failing to find a relationship with such a study would not convince antivaccination kooks that there was no relationship (nothing would
convince them there's no relationshipthat's what makes them kooks),
the power analysis tells you that such a large, expensive study would not
be worthwhile.

Parameters
There are four or five numbers involved in a power analysis. You must
choose the values for each one before you do the analysis. If you don't
have a good reason for using a particular value, you can try different
values and look at the effect on sample size.

Effect size
The effect size is the minimum deviation from the null hypothesis that
you hope to detect. For example, if you are treating hens with something
that you hope will change the sex ratio of their chicks, you might decide
that the minimum change in the proportion of sexes that you're looking
for is 10%. You would then say that your effect size is 10%. If you're
testing something to make the hens lay more eggs, the effect size might
be 2 eggs per month.
Occasionally, you'll have a good economic or clinical reason for
choosing a particular effect size. If you're testing a chicken feed
supplement that costs $1.50 per month, you're only interested in finding
out whether it will produce more than $1.50 worth of extra eggs each
month; knowing that a supplement produces an extra 0.1 egg a month is
not useful information to you, and you don't need to design your
experiment to find that out. But for most basic biological research, the
effect size is just a nice round number that you pulled out of your butt.

Let's say you're doing a power analysis for a study of a mutation in a


promoter region, to see if it affects gene expression. How big a change in
gene expression are you looking for: 10%? 20%? 50%? It's a pretty
arbitrary number, but it will have a huge effect on the number of
transgenic mice who will give their expensive little lives for your science.
If you don't have a good reason to look for a particular effect size, you
might as well admit that and draw a graph with sample size on the X-axis
and effect size on the Y-axis. G*Power will do this for you.

Alpha
Alpha is the significance level of the test (the P value), the probability
of rejecting the null hypothesis even though it is true (a false positive).
The usual value is alpha=0.05. Some power calculators use the onetailed alpha, which is confusing, since the two-tailed alpha is much more
common. Be sure you know which you're using.

Beta or power
Beta, in a power analysis, is the probability of accepting the null
hypothesis, even though it is false (a false negative), when the real
difference is equal to the minimum effect size. The power of a test is the
probability of rejecting the null hypothesis (getting a significant result)
when the real difference is equal to the minimum effect size. Power is
1beta. There is no clear consensus on the value to use, so this is another
number you pull out of your butt; a power of 80% (equivalent to a beta of
20%) is probably the most common, while some people use 50% or 90%.
The cost to you of a false negative should influence your choice of power;
if you really, really want to be sure that you detect your effect size, you'll
want to use a higher value for power (lower beta), which will result in a
bigger sample size. Some power calculators ask you to enter beta, while
others ask for power (1beta); be very sure you understand which you
need to use.

Standard deviation
For measurement variables, you also need an estimate of the standard
deviation. As standard deviation gets bigger, it gets harder to detect a
significant difference, so you'll need a bigger sample size. Your estimate
of the standard deviation can come from pilot experiments or from
similar experiments in the published literature. Your standard deviation
once you do the experiment is unlikely to be exactly the same, so your

experiment will actually be somewhat more or less powerful than you


had predicted.
For nominal variables, the standard deviation is a simple function of
the sample size, so you don't need to estimate it separately.

How it works
The details of a power analysis are different for different statistical
tests, but the basic concepts are similar; here I'll use the exact binomial
test as an example. Imagine that you are studying wrist fractures, and
your null hypothesis is that half the people who break one wrist break
their right wrist, and half break their left. You decide that the minimum
effect size is 10%; if the percentage of people who break their right wrist
is 60% or more, or 40% or less, you want to have a significant result from
the exact binomial test. I have no idea why you picked 10%, but that's
what you'll use. Alpha is 5%, as usual. You want power to be 90%, which
means that if the percentage of broken right wrists really is 40% or 60%,
you want a sample size that will yield a significant (P<0.05) result 90%
of the time, and a non-significant result (which would be a false negative
in this case) only 10% of the time.

The first graph shows the probability distribution under the null
hypothesis, with a sample size of 50 individuals. If the null hypothesis is

true, you'll see less than 36% or more than 64% of people breaking their
right wrists (a false positive) about 5% of the time. As the second graph
shows, if the true percentage is 40%, the sample data will be less than 36
or more than 64% only 21% of the time; you'd get a true positive only
21% of the time, and a false negative 79% of the time. Obviously, a
sample size of 50 is too small for this experiment; it would only yield a
significant result 21% of the time, even if there's a 40:60 ratio of broken
right wrists to left wrists.

The next graph shows the probability distribution under the null
hypothesis, with a sample size of 270 individuals. In order to be
significant at the P<0.05 level, the observed result would have to be less
than 43.7% or more than 56.3% of people breaking their right wrists. As
the second graph shows, if the true percentage is 40%, the sample data
will be this extreme 90% of the time. A sample size of 270 is pretty good
for this experiment; it would yield a significant result 90% of the time if
there's a 40:60 ratio of broken right wrists to left wrists. If the ratio of
broken right to left wrists is further away from 50:50, you'll have an even
higher probability of getting a significant result.

Examples

You plan to cross peas that are heterozygotes for Yellow/green pea
color, where Yellow is dominant. The expected ratio in the offspring is 3
Yellow: 1 green. You want to know whether yellow peas are actually more
or less fit, which might show up as a different proportion of yellow peas
than expected. You arbitrarily decide that you want a sample size that
will detect a significant (P<0.05) difference if there are 3% more or fewer
yellow peas than expected, with a power of 90%. You will test the data
using the exact binomial test of goodness-of-fit if the sample size is small
enough, or aGtest of goodness-of-fit if the sample size is larger. The
power analysis is the same for both tests.
Using G*Power as described for the exact test of goodness-of-fit, the
result is that it would take 2109 pea plants if you want to get a significant
(P<0.05) result 90% of the time, if the true proportion of yellow peas is
78%, and 2271 peas if the true proportion is 72% yellow. Since you'd be
interested in a deviation in either direction, you use the larger number,
2271. That's a lot of peas, but you're reassured to see that it's not a
ridiculous number. If you want to detect a difference of 0.1% between the
expected and observed numbers of yellow peas, you can calculate that
you'll need 1,970,142 peas; if that's what you need to detect, the sample
size analysis tells you that you're going to have to include a pea-sorting
robot in your budget.
The example data for the two-sample ttest shows that the average
height in the 2 p.m. section of Biological Data Analysis was 66.6 inches
and the average height in the 5 p.m. section was 64.6 inches, but the
difference is not significant (P=0.207). You want to know how many
students you'd have to sample to have an 80% chance of a difference this
large being significant. Using G*Power as described on the twosample ttest page, enter 2.0 for the difference in means. Using the
STDEV function in Excel, calculate the standard deviation for each
sample in the original data; it is 4.8 for sample 1 and 3.6 for sample 2.
Enter 0.05 for alpha and 0.80 for power. The result is 72, meaning that if
5 p.m. students really were two inches shorter than 2 p.m. students,
you'd need 72 students in each class to detect a significant difference
80% of the time, if the true difference really is 2.0 inches.

How to do power analyses

G*Power
G*Power is an excellent free program, available for Mac and
Windows, that will do power analyses for a large variety of tests. I will
explain how to use G*Power for power analyses for most of the tests in
this handbook.

R
Salvatore Mangiafico's R Companion has sample R programs to do
power analyses for many of the tests in this handbook; go to the page for
the individual test and scroll to the bottom for the power analysis
program.

SAS
SAS has a PROC POWER that you can use for power analyses. You
enter the needed parameters (which vary depending on the test) and
enter a period (which symbolizes missing data in SAS) for the parameter
you're solving for (usually ntotal, the total sample size,
or npergroup, the number of samples in each group). I find that
G*Power is easier to use than SAS for this purpose, so I don't
recommend using SAS for your power analyses.
Power Analysis

Overview
Power analysis is an important aspect of experimental design. It allows us to determine the sample size
required to detect an effect of a given size with a given degree of confidence. Conversely, it allows us
to determine the probability of detecting an effect of a given size with a given level of confidence,
under sample size constraints. If the probability is unacceptably low, we would be wise to alter or
abandon the experiment.
The following four quantities have an intimate relationship:
1.
2.
3.

sample size
effect size
significance level = P(Type I error) = probability of finding an effect that is not there

4.

power = 1 - P(Type II error) = probability of finding an effect that is there


Given any three, we can determine the fourth.

Power Analysis in R
The pwr package develped by Stphane Champely, impliments power analysis as outlined by Cohen (!
988). Some of the more important functions are listed below.
function

power calculations for

pwr.2p.test

two proportions (equal n)

pwr.2p2n.test

two proportions (unequal n)

pwr.anova.test

balanced one way ANOVA

pwr.chisq.test

chi-square test

pwr.f2.test

general linear model

pwr.p.test

proportion (one sample)

pwr.r.test

correlation

pwr.t.test

t-tests (one sample, 2 sample, paired)

pwr.t2n.test

t-test (two samples with unequal n)

For each of these functions, you enter three of the four quantities (effect size, sample size,
significance level, power) and the fourth is calculated.
The significance level defaults to 0.05. Therefore, to calculate the significance level, given an effect
size, sample size, and power, use the option "sig.level=NULL".
Specifying an effect size can be a daunting task. ES formulas and Cohen's suggestions (based on social
science research) are provided below. Cohen's suggestions should only be seen as very rough guidelines.
Your own subject matter experience should be brought to bear.

t-tests
For t-tests, use the following functions:
pwr.t.test(n = , d = , sig.level = , power = , type = c("two.sample", "one.sample", "paired"))
where n is the sample size, d is the effect size, and type indicates a two-sample t-test, one-sample ttest or paired t-test. If you have unequal sample sizes, use
pwr.t2n.test(n1 = , n2= , d = , sig.level =, power = )
where n1 and n2 are the sample sizes.
For t-tests, the effect size is assessed as

Cohen suggests that d values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes
respectively.
You can specify alternative="two.sided", "less", or "greater" to indicate a two-tailed, or one-tailed test.
A two tailed test is the default.

ANOVA
For a one-way analysis of variance use
pwr.anova.test(k = , n = , f = , sig.level = , power = )
where k is the number of groups and n is the common sample size in each group.
For a one-way ANOVA effect size is measured by f where

Cohen suggests that f values of 0.1, 0.25, and 0.4 represent small, medium, and large effect sizes
respectively.

Correlations
For correlation coefficients use
pwr.r.test(n = , r = , sig.level = , power = )
where n is the sample size and r is the correlation. We use the population correlation coefficient as the
effect size measure. Cohen suggests that r values of 0.1, 0.3, and 0.5 represent small, medium, and
large effect sizes respectively.

Linear Models
For linear models (e.g., multiple regression) use
pwr.f2.test(u =, v = , f2 = , sig.level = , power = )
where u and v are the numerator and denominator degrees of freedom. We use f2 as the effect size
measure.

The first formula is appropriate when we are evaluating the impact of a set of predictors on an
outcome. The second formula is appropriate when we are evaluating the impact of one set of
predictors above and beyond a second set of predictors (or covariates). Cohen suggests f2 values of
0.02, 0.15, and 0.35 represent small, medium, and large effect sizes.

Tests of Proportions
When comparing two proportions use
pwr.2p.test(h = , n = , sig.level =, power = )
where h is the effect size and n is the common sample size in each group.

Cohen suggests that h values of 0.2, 0.5, and 0.8 represent small, medium, and large effect sizes
respectively.
For unequal n's use
pwr.2p2n.test(h = , n1 = , n2 = , sig.level = , power = )
To test a single proportion use
pwr.p.test(h = , n = , sig.level = power = )
For both two sample and one sample proportion tests, you can specify alternative="two.sided", "less",
or "greater" to indicate a two-tailed, or one-tailed test. A two tailed test is the default.

Chi-square Tests
For chi-square tests use

pwr.chisq.test(w =, N = , df = , sig.level =, power = )


where w is the effect size, N is the total sample size, and df is the degrees of freedom. The effect size
w is defined as

Cohen suggests that w values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes
respectively.

some Examples
library(pwr)

For

sample

size

0.80,

when

one-way

ANOVA

needed
the

significance

comparing

in

each

effect

group

size

level

is

of

groups,

to

calculate

obtain

moderate

a
(0.25)

0.05

the

power

of

and

is

employed.

t-test,

with

pwr.anova.test(k=5,f=.25,sig.level=.05,power=.8)

What

is

significance

and

the
level
an

power
of

of
0.01,

effect

a
25
size

one-tailed
people
equal

in

each
to

group,
0.75?

pwr.t.test(n=25,d=0.75,sig.level=.01,alternative="greater")

Using

significance

30

for

two-tailed
level

each

test

of

0.01

proportion,

with

proportions,
and

what

and

common

effect

size

power

assuming

sample
can

size

be

a
of

detected

of

.75?

pwr.2p.test(n=30,sig.level=0.01,power=0.75)

Creating Power or Sample Size Plots


The functions in the pwr package can be used to generate power and sample size graphs.
#

Plot

sample

size

curves

for

detecting

various

correlations

of
sizes.

library(pwr)

#
r
nr

range

of
<-

correlations
seq(.1,.5,.01)

<-

power

length(r)

values

<-

np

seq(.4,.9,.1)

<-

length(p)

obtain

samsize

<-

for

sizes

array(numeric(nr*np),

dim=c(nr,np))

(i

for
result

sample

in

(j
<-

sig.level

in

pwr.r.test(n
=

1:np){

NULL,

.05,

alternative

1:nr){
r

power

samsize[j,i]

r[j],
p[i],

"two.sided")

<-

ceiling(result$n)

}
}

set

up

xrange
yrange
colors

<-

range(r)

<-

round(range(samsize))

<-

plot(xrange,

rainbow(length(p))
yrange,

xlab="Correlation
ylab="Sample

graph

type="n",

Coefficient
Size

(r)",
(n)"

add

for

power

curves

in

1:np){

(i

lines(r,

samsize[,i],

type="l",

lwd=2,

col=colors[i])

add

annotation

abline(v=0,

(grid

lines,

h=seq(0,yrange[2],50),

abline(h=0,

title,

legend)

lty=2,

col="grey89")

v=seq(xrange[1],xrange[2],.02),

lty=2,
col="grey89")

title("Sample

Size

Estimation

for

Sig=0.05

Correlation

Studies\n
(Two-tailed)")

legend("topright",
fill=colors)

click to view

title="Power",

as.character(p),

Statistical Computing Seminars


Introduction to Power Analysis
Introduction
This is the first of a three-part series of seminars on power analysis. This seminar treats power and the various
factors that affect power on both a conceptual and a mechanical level. While we will not cover the formulas
needed to actually run a power analysis, later on we will discuss some of the software packages that can be
used to conduct power analyses. We encourage everyone to review the material in this seminar before moving

onto theintermediate power analysis seminar and advanced power analysis seminar, as those seminars build
on the material presented in this one.
OK, let's start off with a basic definition of what a power is. Power is the probability of detecting an effect, given
that the effect is really there. In other words, it is the probability of rejecting the null hypothesis when it is in fact
false. For example, let's say that we have a simple study with drug A and a placebo group, and that the drug
truly is effective; the power is the probability of finding a difference between the two groups. So, imagine that
we had a power of .8 and that this simple study was conducted many times. Having power of .8 means that
80% of the time, we would get a statistically significant difference between the drug A and placebo groups.
This also means that 20% of the times that we run this experiment, we will not obtain a statistically significant
effect between the two groups, even though there really is an effect in reality.
There are several of reasons why one might do a power analysis. Perhaps the most common use is to
determine the necessary number of subjects needed to detect an effect of a given size. Note that trying to find
the absolute, bare minimum number of subjects needed in the study is often not a good idea. Additionally,
power analysis can be used to determine power, given an effect size and the number of subjects available.
You might do this when you know, for example, that only 75 subjects are available (or that you only have the
budget for 75 subjects), and you want to know if you will have enough power to justify actually doing the study.
In most cases, there is really no point to conducting a study that is seriously underpowered. Besides the issue
of the number of necessary subjects, there are other good reasons for doing a power analysis. For example, a
power analysis is often required as part of a grant proposal. And finally, doing a power analysis is often just
part of doing good research. A power analysis is a good way of making sure that you have thought through
every aspect of the study and the statistical analysis before you start collecting data.
Despite these advantages of power analyses, there are some limitations. One limitation is that power analyses
do not typically generalize very well. If you change the methodology used to collect the data or change the
statistical procedure used to analyze the data, you will most likely have to redo the power analysis. In some
cases, a power analysis might suggest a number of subjects that is inadequate for the statistical procedure.
For example, a power analysis might suggest that you need 30 subjects for your logistic regression, but logistic
regression, like all maximum likelihood procedures, require much larger sample sizes. Perhaps the most
important limitation is that a standard power analysis gives you a "best case scenario" estimate of the
necessary number of subjects needed to detect the effect. In most cases, this "best case scenario" is based on
assumptions and educated guesses. If any of these assumptions or guesses are incorrect, you may have less
power than you need to detect the effect. Finally, because power analyses are based on assumptions and
educated guesses, you often get a range of the number of subjects needed, not a precise number. For
example, if you do not know what the standard deviation of your outcome measure will be, you guess at this
value, run the power analysis and get X number of subjects. Then you guess a slightly larger value, rerun the
power analysis and get a slightly larger number of necessary subjects. You repeat this process over the
plausible range of values of the standard deviation, which gives you a range of the number of subjects that you
will need.
After all of this discussion of power analyses and the necessary number of subjects, we need to stress that
power is not the only consideration when determining the necessary sample size. For example, different
researchers might have different reasons for conducting a regression analysis. One might want to see if the
regression coefficient is different from zero, while the other wants to get a very precise estimate of the
regression coefficient with a very small confidence interval around it. This second purpose requires a larger

sample size than does merely seeing if the regression coefficient is different from zero. Another consideration
when determining the necessary sample size is the assumptions of the statistical procedure that is going to be
used. The number of statistical tests that you intend to conduct will also influence your necessary sample size:
the more tests that you want to run, the more subjects that you will need. You will also want to consider the
representativeness of the sample, which, of course, influences the generalizability of the results. Unless you
have a really sophisticated sampling plan, the greater the desired generalizability, the larger the necessary
sample size. Finally, please note that most of what is in this presentation does not readily apply to people who
are developing a sampling plan for a survey or psychometric analyses.

Definitions
Before we move on, let's make sure we are all using the same definitions. We have already defined power as
the probability of detecting a "true" effect, when the effect exists. Most recommendations for power fall
between .8 and .9. We have also been using the term "effect size", and while intuitively it is an easy concept,
there are lots of definitions and lots of formulas for calculating effect sizes. For example, the current APA
manual has a list of more than 15 effect sizes, and there are more than a few books mostly dedicated to the
calculation of effect sizes in various situations. For now, let's stick with one of the simplest definitions, which is
that an effect size is the difference of two group means divided by the pooled standard deviation. Going back
to our previous example, suppose the mean of the outcome variable for the drug A group was 10 and it was 5
for the placebo group. If the pooled standard deviation was 2.5, we would have and effect size which is equal
to (10-5)/2.5 = 2 (which is a large effect size).
We also need to think about "statistically significance" versus "clinically relevant". This issue comes up often
when considering effect sizes. For example, for a given number of subjects, you might only need a small effect
size to have a power of .9. But that effect size might correspond to a difference between the drug and placebo
groups that isn't clinically meaningful, say reducing blood pressure by two points. So even though you would
have enough power, it still might not be worth doing the study, because the results would not be useful for
clinicians.
There are a few other definitions that we will need later in this seminar. A Type I error occurs when the null
hypothesis is true (in other words, there really is no effect), but you reject the null hypothesis. A Type II error
occurs when the alternative hypothesis is correct, but you fail to reject the null hypothesis (in other words, there
really is an effect, but you failed to detect it). Alpha inflation refers to the increase in the nominal alpha level
when the number of statistical tests conducted on a given data set is increased. The intermediate power
analysis seminar will discuss things like multiplicity, familywise and experimentwise alpha, so we will leave
those definitions for then.
When discussing statistical power, we have four inter-related concepts: power, effect size, sample size and
alpha. These four things are related such that each is a function of the other three. In other words, if three of
these values are fixed, the fourth is completely determined (Cohen, 1988, page 14). We mention this because,
by increasing one, you can decrease (or increase) another. For example, if you can increase your effect size,
you will need fewer subjects, given the same power and alpha level. Specifically, increasing the effect size, the
sample size and/or alpha will increase your power.
While we are thinking about these related concepts and the effect of increasing things, let's take a quick look at
a standard power graph. (This graph was made in SPSS Sample Power, and for this example, we've used .61
and 4 for our two proportion positive values.)

We like these kinds of graphs because they make clear the diminishing returns you get for adding more and
more subjects. For example, let's say that we have only 10 subjects per group. We can see that we have a
power of about .15, which is really, really low. We add 50 subjects per group, now we have a power of about .
6, an increase of .45. However, if we started with 100 subjects per group (power of about .8) and added 50 per
group, we would have a power of .95, an increase of only .15. So each additional subject gives you less
additional power. This curve also illustrates the "cost" of increasing your desired power from .8 to .9.

Knowing your research project


As we mentioned before, one of the big benefits of doing a power analysis is making sure that you have
thought through every detail of your research project. Now most researchers have thought through most, if not
all, of the substantive issues involved in their research. While this is absolutely necessary, it often is not
sufficient. Researchers also need to carefully consider all aspects of the experimental design, the variables
involved, and the statistical analysis technique that will be used. As you will see in the next sections of this
presentation, a power analysis is the union of substantive knowledge (i.e., knowledge about the subject
matter), experimental or quasi-experimental design issues, and statistical analysis. Almost every aspect of the
experimental design can affect power. For example, the type of control group that is used or the number of
time points that are collected will affect how much power you have. So knowing about these issues and
carefully considering your options is important. There are plenty of excellent books that cover these issues in
detail, including Shadish, Cook and Campbell (2002), Cook and Campbell (1979), Campbell and Stanley
(1963), Brickman (2000a, 2000b), Campbell and Russo (2001), Campbell and Sechrest (2000) and Anderson
(2001).
Also, you want to know as much as possible about the statistical technique that you are going to use. If you
learn that you need to use a binary logistic regression because your outcome variable is 0/1, don't stop there;
rather, get a sample data set (there are plenty of sample data sets on our web site) and try it out. You may
discover that the statistical package that you use doesn't do the type of analysis that need to do. For example,

if you are an SPSS user and you need to do a weighted multilevel logistic regression, you will quickly discover
that SPSS doesn't do that (as of version 23), and you will have to find (and probably learn) another statistical
package that will do that analysis. Maybe you want to learn another statistical package, or maybe that is
beyond what you want to do for this project. If you are writing a grant proposal, maybe you will want to include
funds for purchasing the new program. You will also want to learn what the assumptions are and what the
"quirks" are with this particular type of analysis. Remember that the number of necessary subjects given to you
by a power analysis assumes that all of the assumptions of the analysis have been met, so knowing what those
assumptions are is important deciding if they are likely to be met or not.
The point of this section is to make clear that knowing your research project involves many things, and you may
find that you need to do some research about experimental design or statistical techniques before you do your
power analysis. We want to emphasize that this is time and effort well spent. We also want to remind you that
for almost all researchers, this is a normal part of doing good research. UCLA researchers are welcome and
encouraged to come by walk-in consulting at this stage of the research process to discuss issues and ideas,
check out books and try out software.

What you need to know to do a power analysis


In the previous section, we discussed in general terms what you need to know to do a power analysis. In this
section we will discuss some of the actual quantities that you need to know to do a power analysis for some
simple statistics. Although we understand very few researchers test their main hypothesis with a t-test or a chisquare test, our point here is only to give you a flavor of the types of things that you will need to know (or guess
at) in order to be ready for a power analysis.
- For an independent samples t-test, you will need to know the population means of the two groups (or the
difference between the means), and the population standard deviations of the two groups. So, using our
example of drug A and placebo, we would need to know the difference in the means of the two groups, as well
as the standard deviation for each group (because the group means and standard deviations are the best
estimate that we have of those population values). Clearly, if we knew all of this, we wouldn't need to conduct
the study. In reality, researchers make educated guesses at these values. We always recommend that you
use several different values, such as decreasing the difference in the means and increasing the standard
deviations, so that you get a range of values for the number of necessary subjects.
In SPSS Sample Power, we would have a screen that looks like the one below, and we would fill in the
necessary values. As we can see, we would need a total of 70 subjects (35 per group) to have a power of 91%
if we had a mean of 5 and a standard deviation of 2.5 in the drug A group, and a mean of 3 and a standard
deviation of 2.5 in the placebo group. If we decreased the difference in the means and increased the standard
deviations such that for the drug A group, we had a mean of 4.5 and a standard deviation of 3, and for the
placebo group a mean of 3.5 and a standard deviation of 3, we would need 190 subjects per group, or a total of
380 subjects, to have a power of 90%. In other words, seemingly small differences in means and standard
deviations can have a huge effect on the number of subjects required.

- For a correlation, you need to know/guess at the correlation in the population. This is a good time to
remember back to an early stats class where they emphasized that correlation is a large N procedure (Chen
and Popovich, 2002). If you guess that the population correlation is .6, a power analysis would suggest (with
an alpha of .05 and for a power of .8) that you would need only 16 subjects. There are several points to be
made here. First, common sense suggests that N = 16 is pretty low. Second, a population correlation of .6 is
pretty high, especially in the social sciences. Third, the power analysis assumes that all of the assumptions of
the correlation have been met. For example, we are assuming that there is no restriction of range issue, which
is common with Likert scales; the sample data for both variables are normally distributed; the relationship
between the two variables is linear; and there are no serious outliers. Also, whereas you might be able to say
that the sample correlation does not equal zero, you likely will not have a very precise estimate of the
population correlation coefficient.

- For a chi-square test, you will need to know the proportion positive for both populations (i.e., rows and
columns). Let's assume that we will have a 2 x 2 chi-square, and let's think of both variables as 0/1. Let's say
that we wanted to know if there was a relationship between drug group (drug A/placebo) and improved health.
In SPSS Sample Power, you would see a screen like this.

In order to get the .60 and the .30, we would need to know (or guess at) the number of people whose health
improved in both the drug A and placebo groups. We would also need to know (or guess at) either the number
of people whose health did not improve in those two groups, or the total number of people in each group.

Improved health

Not improved

Row total

(positive)

health

Drug A (positive)

33 (33/55 = .6)

22

55

Placebo

17 (17/55 = .3)

38

55

Column total

50

60

Grand Total =
110

- For an ordinary least squares regression, you would need to know things like the R2 for the full and reduced
model. For a simple logistic regression analysis with only one continuous predictor variable, you would need to
know the probability of a positive outcome (i.e., the probability that the outcome equals 1) at the mean of the
predictor variable and the probability of a positive outcome at one standard deviation above the mean of the
predictor variable. Especially for the various types of logistic models (e.g., binary, ordinal and multinomial), you
will need to think very carefully about your sample size, and information from a power analysis will only be part
of your considerations. For example, according to Long (1997, pages 53-54), 100 is a minimum sample size
for logistic regression, and you want *at least* 10 observations per predictor. This does not mean that if you
have only one predictor you need only 10 observations. Also, if you have categorical predictors, you may need
to have more observations to avoid computational difficulties caused by empty cells or cells with few
observations. More observations are needed when the outcome variable is very lopsided; in other words, when
there are very few 1s and lots of 0s, or vice versa. These cautions emphasize the need to know your data set
well, so that you know if your outcome variable is lopsided or if you are likely to have a problem with empty
cells.
The point of this section is to give you a sense of the level of detail about your variables that you need to be
able to estimate in order to do a power analysis. Also, when doing power analyses for regression models,
power programs will start to ask for values that most researchers are not use to providing. Guessing at the
mean and standard deviation of your response variable is one thing, but increments to R2 is a metric in which
few researchers are used to thinking. In our next section we will discuss how you can guestimate these
numbers.

Obtaining the necessary numbers to do a power analysis


There are at least three ways to guestimate the values that are needed to do a power analysis: a literature
review, a pilot study and using Cohen's recommendations. We will review the pros and cons of each of these
methods. For this discussion, we will focus on finding the effect size, as that is often the most difficult number
to obtain and often has the strongest impact on power.
Literature review: Sometimes you can find one or more published studies that are similar enough to yours that
you can get a idea of the effect size. If you can find several such studies, you might be able to use metaanalysis techniques to get a robust estimate of the effect size. However, oftentimes there are no studies similar
enough to your study to get a good estimate of the effect size. Even if you can find such an study, the

necessary effect sizes or other values are often not clearly stated in the article and need to be calculated (if
they can) based on the information provided.
Pilot studies: There are lots of good reasons to do a pilot study prior to conducting the actual study. From a
power analysis prospective, a pilot study can give you a rough estimate of the effect size, as well as a rough
estimate of the variability in your measures. You can also get some idea about where missing data might
occur, and as we will discuss later, how you handle missing data can greatly affect your power. Other benefits
of a pilot study include allowing you to identify coding problems, setting up the data base, and inputting the data
for a practice analysis. This will allow you to determine if the data are input in the correct shape, etc. (please
listen to our podcast # for more information on this).
Of course, there are some limitations to the information that you can get from a pilot study. (Many of these
limitations apply to small samples in general.) First of all, when estimating effect sizes based on nonsignificant
results, the effect size estimate will necessarily have an increased error; in other words, the standard error of
the effect size estimate will be larger than when the result is significant. The effect size estimate that you
obtain may be unduly influenced by some peculiarity of the small sample. Also, you often cannot get a good
idea of the degree of missingness and attrition that will be seen in the real study. Despite these limitations, we
strongly encourage researchers to conduct a pilot study. The opportunity to identify and correct "bugs" before
collecting the real data is often invaluable. Also, because of the number of values that need to be guestimated
in a power analysis, the precision of any one of these values is not that important. If you can estimate the
effect size to within 10% or 20% of the true value, that is probably sufficient for you to conduct a meaningful
power analysis, and such fluctuations can be taken into account during the power analysis.
Cohen's recommendations: Jacob Cohen has many well-known publications regarding issues of power and
power analyses, including some recommendations about effect sizes that you can use when doing your power
analysis. Many researchers (including Cohen) consider the use of such recommendations as a last resort,
when a thorough literature review has failed to reveal any useful numbers and a pilot study is either not
possible or not feasible. From Cohen (1988, pages 24-27):
- Small effect: 1% of the variance; d = 0.25 (too small to detect other than statistically; lower limit of what is
clinically relevant)
- Medium effect: 6% of the variance; d = 0.5 (apparent with careful observation)
- Large effect: at least 15% of the variance; d = 0.8 (apparent with a superficial glance; unlikely to be the focus
of research because it is too obvious)
Lipsey and Wilson (1993) did a meta analysis of 302 meta analyses of over 10,000 studies and found that the
average effect size was .5, adding support to Cohen's recommendation that, as a last resort, guess that the
effect size is .5 (cited in Bausell and Li, 2002). Sedlmeier and Gigerenzer (1989) found that the average effect
size for articles in The Journal of Abnormal Psychology was a medium effect. According to Keppel and
Wickens (2004), when you really have no idea what the effect size is, go with the smallest effect size of
practical value. In other words, you need to know how small of a difference is meaningful to you. Keep in mind
that research suggests that most researchers are overly optimistic about the effect sizes in their research, and
that most research studies are under powered (Keppel and Wickens, 2004; Tversky and Kahneman,1971).
This is part of the reason why we stress that a power analysis gives you a lower limit to the number of
necessary subjects.

Factors that affect power


From the preceding discussion, you might be starting to think that the number of subjects and the effect size
are the most important factors, or even the only factors, that affect power. Although effect size is often the
largest contributor to power, saying it is the only important issue is far from the truth. There are at least a
dozen other factors that can influence the power of a study, and many of these factors should be considered
not only from the perspective of doing a power analysis, but also as part of doing good research. The first
couple of factors that we will discuss are more "mechanical" ways of increasing power (e.g., alpha level,
sample size and effect size). After that, the discussion will turn to more methodological issues that affect
power.
1. Alpha level: One obvious way to increase your power is to increase your alpha (from .05 to say, .1).
Whereas this might be an advisable strategy when doing a pilot study, increasing your alpha usually is not a
viable option. We should point out here that many researchers are starting to prefer to use .01 as an alpha
level instead of .05 as a crude attempt to assure results are clinically relevant; this alpha reduction reduces
power.
1a. One- versus two-tailed tests: In some cases, you can test your hypothesis with a one-tailed test. For
example, if your hypothesis was that drug A is better than the placebo, then you could use a one-tailed test.
However, you would fail to detect a difference, even if it was a large difference, if the placebo was better than
drug A. The advantage of one-tailed tests is that they put all of your power "on one side" to test your
hypothesis. The disadvantage is that you cannot detect differences that are in the opposite direction of your
hypothesis. Moreover, many grant and journal reviews frown on the use of one-tailed tests, believing it is a way
to feign significance (Stratton and Neil, 2004).
2. Sample size: A second obvious way to increase power is simply collect data on more subjects. In some
situations, though, the subjects are difficult to get or extremely costly to run. For example, you may have
access to only 20 autistic children or only have enough funding to interview 30 cancer survivors. If possible,
you might try increasing the number of subjects in groups that do not have these restrictions, for example, if
you are comparing to a group of normal controls. While it is true that, in general, it is often desirable to have
roughly the same number of subjects in each group, this is not absolutely necessary. However, you get
diminishing returns for additional subjects in the control group: adding an extra 100 subjects to the control
group might not be much more helpful than adding 10 extra subjects to the control group.
3. Effect size: Another obvious way to increase your power is to increase the effect size. Of course, this is
often easier said than done. A common way of increasing the effect size is to increase the experimental
manipulation. Going back to our example of drug A and placebo, increasing the experimental manipulation
might mean increasing the dose of the drug. While this might be a realistic option more often than increasing
your alpha level, there are still plenty of times when you cannot do this. Perhaps the human subjects
committee will not allow it, it does not make sense clinically, or it doesn't allow you to generalize your results
the way you want to. Many of the other issues discussed below indirectly increase effect size by providing a
stronger research design or a more powerful statistical analysis.
4. Experimental task: Well, maybe you can not increase the experimental manipulation, but perhaps you can
change the experimental task, if there is one. If a variety of tasks have been used in your research area,
consider which of these tasks provides the most power (compared to other important issues, such as

relevancy, participant discomfort, and the like). However, if various tasks have not been reviewed in your field,
designing a more sensitive task might be beyond the scope of your research project.
5. Response variable: How you measure your response variable(s) is just as important as what task you have
the subject perform. When thinking about power, you want to use a measure that is as high in sensitivity and
low in measurement error as is possible. Researchers in the social sciences often have a variety of measures
from which they can choose, while researchers in other fields may not. For example, there are numerous
established measures of anxiety, IQ, attitudes, etc. Even if there are not established measures, you still have
some choice. Do you want to use a Likert scale, and if so, how many points should it have? Modifications to
procedures can also help reduce measurement error. For example, you want to make sure that each subject
knows exactly what he or she is supposed to be rating. Oral instructions need to be clear, and items on
questionnaires need to be unambiguous to all respondents. When possible, use direct instead of indirect
measures. For example, asking people what tax bracket they are in is a more direct way of determining their
annual income than asking them about the square footage of their house. Again, this point may be more
applicable to those in the social sciences than those in other areas of research. We should also note that
minimizing the measurement error in your predictor variables will also help increase your power.
Just as an aside, most texts on experimental design strongly suggest collecting more than one measure of the
response in which you are interested. While this is very good methodologically and provides marked benefits
for certain analyses and missing data, it does complicate the power analysis. Theintermediate power analysis
seminar will discuss this issue.
6. Experimental design: Another thing to consider is that some types of experimental designs are more
powerful than others. For example, repeated measures designs are virtually always more powerful than
designs in which you only get measurements at one time. If you are already using a repeated measures
design, increasing the number of time points a response variable is collected to at least four or five will also
provide increased power over fewer data collections. There is a point of diminishing return when a researcher
collects too many time points, though this depends on many factors such as the response variable, statistical
design, age of participants, etc.
7. Groups: Another point to consider is the number and types of groups that you are using. Reducing the
number of experimental conditions will reduce the number of subjects that is needed, or you can keep the
same number of subjects and just have more per group. When thinking about which groups to exclude from
the design, you might want to leave out those in the middle and keep the groups with the more extreme
manipulations. Going back to our drug A example, let's say that we were originally thinking about having a total
of four groups: the first group will be our placebo group, the second group would get a small dose of drug A, the
third group a medium dose, and the fourth group a large dose. Clearly, much more power is needed to detect
an effect between the medium and large dose groups than to detect an effect between the large dose group
and the placebo group. If we found that we were unable to increase the power enough such that we were likely
to find an effect between small and medium dose groups or between the medium and the large dose groups,
then it would probably make more sense to run the study without these groups. In some cases, you may even
be able to change your comparison group to something more extreme. For example, we once had a client who
was designing a study to compare people with clinical levels of anxiety to a group that had subclinical levels of
anxiety. However, while doing the power analysis and realizing how many subjects she would need to detect
the effect, she found that she needed far fewer subjects if she compared the group with the clinical levels of
anxiety to a group of "normal" people (a number of subjects she could reasonably obtain).

8. Statistical procedure: Changing the type of statistical analysis may also help increase power, especially
when some of the assumptions of the test are violated. For example, as Maxwell and Delaney (2004) noted,
"Even when ANOVA is robust, it may not provide the most powerful test available when its assumptions have
been violated." In particular, violations of assumptions regarding independence, normality and heterogeneity
can reduce power. In such cases, nonparametric alternatives may be more powerful.
9. Statistical model: You can also modify the statistical model. For example, interactions often require more
power than main effects. Hence, you might find that you have reasonable power for a main effects model, but
not enough power when the model includes interactions. Many (perhaps most?) power analysis programs do
not have an option to include interaction terms when describing the proposed analysis, so you need to keep
this in mind when using these programs to help you determine how many subjects will be needed. When
thinking about the statistical model, you might want to consider using covariates or blocking variables. Ideally,
both covariates and blocking variables reduce the variability in the response variable. However, it can be
challenging to find such variables. Moreover, your statistical model should use as many of the response
variable time points as possible when examining longitudinal data. Using a change-score analysis when one
has collected five time points makes little sense and ignores the added power from these additional time
points. The more the statistical model "knows" about how a person changes over time, the more variance that
can be pulled out of the error term and ascribed to an effect.
9a. Correlation between time points: Understanding the expected correlation between a response variable
measured at one time in your study with the same response variable measured at another time can provide
important and power-saving information. As noted previously, when the statistical model has a certain amount
of information regarding the manner by which people change over time, it can enhance the effect size
estimate. This is largely dependent on the correlation of the response measure over time. For example, in a
before-after data collection scenario, response variables with a .00 correlation from before the treatment to
after the treatment would provide no extra benefit to the statistical model, as we can't better understand a
subject's score by knowing how he or she changes over time. Rarely, however, do variables have a .00
correlation on the same outcomes measured at different times. As discussed in the intermediate power
analysis seminar, it is important to know that outcome variables with larger correlations over time provide
enhanced power when used in a complimentary statistical model.
10. Modify response variable: Besides modifying your statistical model, you might also try modifying your
response variable. Possible benefits of this strategy include reducing extreme scores and/or meeting the
assumptions of the statistical procedure. For example, some response variables might need to be log
transformed. However, you need to be careful here. Transforming variables often makes the results more
difficult to interpret, because now you are working in, say, a logarithm metric instead of the metric in which the
variable was originally measured. Moreover, if you use a transformation that adjusts the model too much, you
can loose more power than is necessary. Categorizing continuous response variables (sometimes used as a
way of handling extreme scores) can also be problematic, because logistic or ordinal logistic regression often
requires many more subjects than does OLS regression. It makes sense that categorizing a response variable
will lead to a loss of power, as information is being "thrown away."
11. Purpose of the study: Different researchers have different reasons for conducting research. Some are
trying to determine if a coefficient (such as a regression coefficient) is different from zero. Others are trying to
get a precise estimate of a coefficient. Still others are replicating research that has already been done. The
purpose of the research can affect the necessary sample size. Going back to our drug A and placebo study,

let's suppose our purpose is to test the difference in means to see if it equals zero. In this case, we need a
relatively small sample size. If our purpose is to get a precise estimate of the means (i.e., minimizing the
standard errors), then we will need a larger sample size. If our purpose is to replicate previous research, then
again we will need a relatively large sample size. Tversky and Kahneman (1971) pointed out that we often
need more subjects in a replication study than were in the original study. They also noted that researchers are
often too optimistic about how much power they really have. They claim that researchers too readily assign
"causal" reasons to explain differences between studies, instead of sampling error. They also mentioned that
researchers tend to underestimate the impact of sampling and think that results will replicate more often than is
the case.
12. Missing data: A final point that we would like to make here regards missing data. Almost all researchers
have issues with missing data. When designing your study and selecting your measures, you want to do
everything possible to minimize missing data. Handling missing data via imputation methods can be very tricky
and very time-consuming. If the data set is small, the situation can be even more difficult. In general, missing
data reduces power; poor imputation methods can greatly reduce power. If you have to impute, you want to
have as few missing data points on as few variables as possible. When designing the study, you might want to
collect data specifically for use in an imputation model (which usually involves a different set of variables than
the model used to test your hypothesis). It is also important to note that the default technique for handling
missing data by virtually every statistical program is to remove the entire case from an analysis (i.e., casewise
deletion). This process is undertaken even if the analysis involves 20 variables and a subject is missing only
one datum of the 20. Casewise deletion is one of the biggest contributors to loss of power, both because of the
omnipresence of missing data and because of the omnipresence of this default setting in statistical programs
(Graham et al., 2003).
This ends the section on the various factors that can influence power. We know that was a lot, and we
understand that much of this can be frustrating because there is very little that is "black and white". We hope
that this section made clear the close relationship between the experimental design, the statistical analysis and
power.

Cautions about small sample sizes and sampling variation


We want to take a moment here to mention some issues that frequently arise when using small samples. (We
aren't going to put a lower limit on what we mean be "small sample size.") While there are situations in which a
researcher can either only get or afford a small number of subjects, in most cases, the researcher has some
choice in how many subjects to include. Considerations of time and effort argue for running as few subjects as
possible, but there are some difficulties associated with small sample sizes, and these may outweigh any gains
from the saving of time, effort or both. One obvious problem with small sample sizes is that they have low
power. This means that you need to have a large effect size to detect anything. You will also have fewer
options with respect to appropriate statistical procedures, as many common procedures, such as correlations,
logistic regression and multilevel modeling, are not appropriate with small sample sizes. One is also more
likely to violate the assumptions of the statistical procedure that is used (especially assumptions like
normality). In most cases, the statistical model must be smaller when the data set is small. Interaction terms,
which often test interesting hypotheses, are frequently the first casualties. Generalizability of the results may
also be comprised, and it can be difficult to argue that a small sample is representative of a large and varied
population. Missing data are also more problematic; there are a reduced number of imputations methods
available to you, and these are not considered to be desirable imputation methods (such as mean imputation).

Finally, with a small sample size, alpha inflation issues can be more difficult to address, and you are more likely
to run as many tests as you have subjects.
While the issue of sampling variability is relevant to all research, it is especially relevant to studies with small
sample sizes. To quote Murphy and Myors (2004, page 59), "The lack of attention to power analysis (and the
deplorable habit of placing too much weight on the results of small sample studies) are well documented in the
literature, and there is no good excuse to ignore power in designing studies." In an early article entitled The
Law of Small Numbers, Tversky and Kahneman (1971) stated that many researchers act like the Law of Large
Numbers applies to small numbers. People often believe that small samples are more representative of the
population than they really are.
The last two points to be made here is that there is usually no point to conducting an underpowered study, and
that underpowered studies can cause chaos in the literature because studies that are similar methodologically
may report conflicting results.

Software
We will briefly discuss some of the programs that you can use to assist you with your power analysis. Most
programs are fairly easy to use, but you still need to know effect sizes, means, standard deviations, etc.
Among the programs specifically designed for power analysis, we use SPSS Sample Power, PASS and
GPower. These programs have a friendly point-and-click interface and will do power analyses for things like
correlations, OLS regression and logistic regression. We have also started using Optimal Design for repeated
measures, longitudinal and multilevel designs. We should note that Sample Power is a stand-alone program
that is sold by SPSS; it is not part of SPSS Base or an add-on module. PASS can be purchased directly from
NCSS at http://www.ncss.com/index.htm . GPower (please seehttp://www.psycho.uniduesseldorf.de/aap/projects/gpower/ for details) and Optimal Design (please
see http://sitemaker.umich.edu/group-based/home for details) are free.
Several general use stat packages also have procedures for calculating power. SAS has proc power, which
has a lot of features and is pretty nice. Stata has the sampsi command, as well as many user-written
commands, including fpower, powerreg and aipe (written by our IDRE statistical consultants). Statistica has
an add-on module for power analysis. There are also many programs online that are free.
For more advanced/complicated analyses, Mplus is a good choice. It will allow you to do Monte Carlo
simulations, and there are some examples
athttp://www.statmodel.com/power.shtml and http://www.statmodel.com/ugexcerpts.shtml .
Most of the programs that we have mentioned do roughly the same things, so when selecting a power analysis
program, the real issue is your comfort; all of the programs require you to provide the same kind of information.

Multiplicity
This issue of multiplicity arises when a researcher has more than one outcome of interest in a given study.
While it is often good methodological practice to have more than one measure of the response variable of
interest, additional response variables mean more statistical tests need to be conducted on the data set, and
this leads to question of experimentwise alpha control. Returning to our example of drug A and placebo, if we
have only one response variable, then only one t test is needed to test our hypothesis. However, if we have

three measures of our response variable, we would want to do three ttests, hoping that each would show
results in the same direction. The question is how to control the Type I error (AKA false alarm) rate. Most
researchers are familiar with Bonferroni correction, which calls for dividing the prespecified alpha level
(usually .05) by the number of tests to be conducted. In our example, we would have .05/3 = .0167. Hence, .
0167 would be our new critical alpha level, and statistics with a p-value greater than .0167 would be classified
as not statistically significant. It is well-known that the Bonferroni correction is very conservative, so in our next
seminar, we will discuss other ways of adjusting the alpha level.

Afterthoughts: A post-hoc power analysis


In general, just say "No!" to post-hoc analyses. There are many reasons, both mechanical and theoretical, why
most researchers should not do post-hoc power analyses. Excellent summaries can be found in Hoenig and
Heisey (2001) The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis and
Levine and Ensom (2001) Post Hoc Power Analysis: An Idea Whose Time Has Passed?. As Hoenig and
Heisey show, power is mathematically directly related to the p-value; hence, calculating power once you know
the p-value associated with a statistic adds no new information. Furthermore, as Levine and Ensom clearly
explain, the logic underlying post-hoc power analysis is fundamentally flawed.
However, there are some things that you should look at after your study is completed. Have a look at the
means and standard deviations of your variables and see how close they are (or are not) from the values that
you used in the power analysis. Many researchers do a series of related studies, and this information can aid
in making decisions in future research. For example, if you find that your outcome variable had a standard
deviation of 7, and in your power analysis you were guessing it would have a standard deviation of 2, you may
want to consider using a different measure that has less variance in your next study.
The point here is that in addition to answering your research question(s), your current research project can also
assist with your next power analysis.

Conclusions
Conducting research is kind of like buying a car. While buying a car isn't the biggest purchase that you will
make in your life, few of us enter into the process lightly. Rather, we consider a variety of things, such as need
and cost, before making a purchase. You would do your research before you went and bought a car, because
once you drove the car off the dealer's lot, there is nothing you can do about it if you realize this isn't the car
that you need. Choosing the type of analysis is like choosing which kind of car to buy. The number of subjects
is like your budget, and the model is like your expenses. You would never go buy a car without first having
some idea about what the payments will be. This is like doing a power analysis to determine approximately
how many subjects will be needed. Imagine signing the papers for your new Maserati only to find that the
payments will be twice your monthly take-home pay. This is like wanting to do a multilevel model with a binary
outcome, 10 predictors and lots of cross-level interactions and realizing that you can't do this with only 50
subjects. You don't have enough "currency" to run that kind of model. You need to find a model that is "more in
your price range." If you had $530 a month budgeted for your new car, you probably wouldn't want exactly
$530 in monthly payments. Rather you would want some "wiggle-room" in case something cost a little more
than anticipated or you were running a little short on money that month. Likewise, if your power analysis says
you need about 300 subjects, you wouldn't want to collect data on exactly 300 subjects. You would want to
collect data on 300 subjects plus a few, just to give yourself some "wiggle-room" just in case.

Don't be afraid of what you don't know. Get in there and try it BEFORE you collect your data. Correcting things
is easy at this stage; after you collect your data, all you can do is damage control. If you are in a hurry to get a
project done, perhaps the worst thing that you can do is start collecting data now and worry about the rest
later. The project will take much longer if you do this than if you do what we are suggesting and do the power
analysis and other planning steps. If you have everything all planned out, things will go much smoother and
you will have fewer and/or less intense panic attacks. Of course, some thing unexpected will always happen,
but it is unlikely to be as big of a problem. UCLA researchers are always welcome and strongly encouraged to
come into our walk-in consulting and discuss their research before they begin the project.
Power analysis = planning. You will want to plan not only for the test of your main hypothesis, but also for
follow-up tests and tests of secondary hypotheses. You will want to make sure that "confirmation" checks will
run as planned (for example, checking to see that interrater reliability was acceptable). If you intend to use
imputation methods to address missing data issues, you will need to become familiar with the issues
surrounding the particular procedure as well as including any additional variables in your data collection
procedures. Part of your planning should also include a list of the statistical tests that you intend to run and
consideration of any procedure to address alpha inflation issues that might be necessary.
The number output by any power analysis program is often just a starting point of thought more than a final
answer to the question of how many subjects will be needed. As we have seen, you also need to consider the
purpose of the study (coefficient different from 0, precise point estimate, replication), the type of statistical test
that will be used (t-test versus maximum likelihood technique), the total number of statistical tests that will be
performed on the data set, genearlizability from the sample to the population, and probably several other things
as well.
The take-home message from this seminar is "do your research before you do your research."

References
Anderson, N. H. (2001). Empirical Direction in Design and Analysis. Mahwah, New Jersey: Lawrence
Erlbaum Associates.
Bausell, R. B. and Li, Y. (2002). Power Analysis for Experimental Research: A Practical Guide for the
Biological, Medical and Social Sciences. Cambridge University Press, New York, New York.
Bickman, L., Editor. (2000). Research Design: Donald Campbell's Legacy, Volume 2. Thousand Oaks, CA:
Sage Publications.
Bickman, L., Editor. (2000). Validity and Social Experimentation. Thousand Oaks, CA: Sage Publications.
Campbell, D. T. and Russo, M. J. (2001). Social Measurement. Thousand Oaks, CA: Sage Publications.
Campbell, D. T. and Stanley, J. C. (1963). Experimental and Quasi-experimental Designs for Research.
Reprinted from Handbook of Research on Teaching. Palo Alto, CA: Houghton Mifflin Co.
Chen, P. and Popovich, P. M. (2002). Correlation: Parametric and Nonparametric Measures. Thousand
Oaks, CA: Sage Publications.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, Second Edition. Hillsdale, New
Jersey: Lawrence Erlbaum Associates.

Cook, T. D. and Campbell, D. T. Quasi-experimentation: Design and Analysis Issues for Field Settings.
(1979). Palo Alto, CA: Houghton Mifflin Co.
Graham, J. W., Cumsille, P. E., and Elek-Fisk, E. (2003). Methods for handling missing data. In J. A. Schinka
and W. F. Velicer (Eds.), Handbook of psychology (Vol. 2, pp. 87-114). New York: Wiley.
Green, S. B. (1991). How many subjects does it take to do a regression analysis? Multivariate Behavioral
Research, 26(3), 499-510.
Hoenig, J. M. and Heisey, D. M. (2001). The Abuse of Power: The Pervasive Fallacy of Power Calculations
for Data Analysis. The American Statistician, 55(1), 19-24.
Kelley, K and Maxwell, S. E. (2003). Sample size for multiple regression: Obtaining regression coefficients
that are accurate, not simply significant. Psychological Methods, 8(3), 305-321.
Keppel, G. and Wickens, T. D. (2004). Design and Analysis: A Researcher's Handbook, Fourth Edition.
Pearson Prentice Hall: Upper Saddle River, New Jersey.
Kline, R. B. Beyond Significance (2004). Beyond Significance Testing: Reforming Data Analysis Methods in
Behavioral Research. American Psychological Association: Washington, D.C.
Levine, M., and Ensom M. H. H. (2001). Post Hoc Power Analysis: An Idea Whose Time Has
Passed? Pharmacotherapy, 21(4), 405-409.
Lipsey, M. W. and Wilson, D. B. (1993). The Efficacy of Psychological, Educational, and Behavioral
Treatment: Confirmation from Meta-analysis. American Psychologist, 48(12), 1181-1209.
Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks,
CA: Sage Publications.
Maxwell, S. E. (2000). Sample size and multiple regression analysis. Psychological Methods, 5(4), 434-458.
Maxwell, S. E. and Delany, H. D. (2004). Designing Experiments and Analyzing Data: A Model Comparison
Perspective, Second Edition. Lawrence Erlbaum Associates, Mahwah, New Jersey.
Murphy, K. R. and Myors, B. (2004). Statistical Power Analysis: A Simple and General Model for Traditional
and Modern Hypothesis Tests. Mahwah, New Jersey: Lawrence Erlbaum Associates.
Publication Manual of the American Psychological Association, Fifth Edition. (2001). Washington, D.C.:
American Psychological Association.
Sedlmeier, P. and Gigerenzer, G. (1989). Do Studies of Statistical Power Have an Effect on the Power of
Studies? Psychological Bulletin, 105(2), 309-316.
Shadish, W. R., Cook, T. D. and Campbell, D. T. (2002). Experimental and Quasi-experimental Designs for
Generalized Causal Inference. Boston: Houghton Mifflin Co.
Stratton, I. M. and Neil, A. (2004). How to ensure your paper is rejected by the statistical reviewer. Diabetic
Medicine, 22, 371-373.
Tversky, A. and Kahneman, D. (1971). Belief in the Law of Small Numbers. Psychological Bulletin, 76(23),
105-110.

Webb, E., Campbell, D. T., Schwartz, R. D., and Sechrest, L. (2000). Unobtrusive Measures, Revised Edition.
Thousand Oaks, CA: Sage Publications.

How to cite this page


Report an error on this page or leave a comment
The content of this web site should not be construed as an endorsement of any particular web
site, book, or software product by the University of California.

Power Analysis
Last revised: 5/8/2015

Previous
Next
Contents
Power Analysis

Designing an Experiment, Power Analysis

General Purpose
Power Analysis and Sample Size Calculation in Experimental Design
o Sampling Theory
o Hypothesis Testing Logic
o Calculating Power
o Calculating Required Sample Size
o Graphical Approaches to Power Analysis
Noncentrality Interval Estimation and the Evaluation of Statistical Models
o Inadequacies of the Hypothesis Testing Approach
o Advantages of Interval Estimation
o Why Interval Estimates are Seldom Reported
o Replacing Traditional Hypothesis Tests with Interval Estimates

General Purpose
The techniques of statistical power analysis, sample size estimation, and advanced techniques
for confidenceinterval estimation are discussed here. The main goal of first the two techniques is to
allow you to decide, while in the process of designing an experiment, (a) how large a sample is needed
to enable statistical judgments that are accurate and reliable and (b) how likely your statistical test
will be to detect effects of a given size in a particular situation. The third technique is useful in
implementing objectives a and b and in evaluating the size of experimental effects in practice.

Performing power analysis and sample size estimation is an important aspect of experimental design,
because without these calculations, sample size may be too high or too low. If sample size is too low,
the experiment will lack the precision to provide reliable answers to the questions it is investigating. If
sample size is too large, time and resources will be wasted, often for minimal gain.
In some power analysis software programs, a number of graphical and analytical tools are available to
enable precise evaluation of the factors affecting power and sample size in many of the most
commonly encountered statistical analyses. This information can be crucial to the design of a study
that is cost-effective and scientifically useful.
Noncentrality interval estimation procedures and other sophisticated confidence interval procedures
provide some sophisticated confidence interval methods for analyzing the importance of an observed
experimental result. An increasing number of influential statisticians are suggesting that confidence
interval estimation should augment or replace traditional hypothesis testing approaches in the analysis
of experimental data.
Back to Top

Power Analysis and Sample Size Calculation in Experimental Design


There is a growing recognition of the importance of power analysis and sample size calculation in the
proper design of experiments. Click on the links below for a discussion of the fundamental ideas behind
these methods.

Sampling Theory
Hypothesis Testing Logic
Calculating Power
Calculating Required Sample Size
Graphical Approaches to Power Analysis

Sampling Theory
In most situations in statistical analysis, we do not have access to an entire statistical population of
interest, either because the population is too large, is not willing to be measured, or the measurement
process is too expensive or time-consuming to allow more than a small segment of the population to be
observed. As a result, we often make important decisions about a statistical population on the basis of
a relatively small amount of sample data.
Typically, we take a sample and compute a quantity called a statistic in order to estimate some
characteristic of a population called a parameter.
For example, suppose a politician is interested in the proportion of people who currently favor her
position on a particular issue. Her constituency is a large city with a population of about 1,500,000
potential voters. In this case, the parameter of interest, which we might call , is the proportion of

people in the entire population who favor the politician's position. The politician is going to
commission an opinion poll, in which a (hopefully) random sample of people will be asked whether or
not they favor her position. The number (call it N) of people to be polled will be quite small, relative
to the size of the population. Once these people have been polled, the proportion of them favoring the
politician's position will be computed. This proportion, which is a statistic, can be called p.
One thing is virtually certain before the study is ever performed: p will not be equal to !
Because p involves "the luck of the draw," it will deviate from . The amount by which p is wrong,
i.e., the amount by which it deviates from , is called sampling error.
In any one sample, it is virtually certain there will be some sampling error (except in some highly
unusual circumstances), and that we will never be certain exactly how large this error is. If we knew
the amount of the sampling error, this would imply that we also knew the exact value of the
parameter, in which case we would not need to be doing the opinion poll in the first place.
In general, the larger the sample size N, the smaller sampling error tends to be. (You can never be sure
what will happen in a particular experiment, of course.) If we are to make accurate decisions about a
parameter like , we need to have an N large enough so that sampling error will tend to be
"reasonably small." If N is too small, there is not much point in gathering the data, because the results
will tend to be too imprecise to be of much use.
On the other hand, there is also a point of diminishing returns beyond which increasing N provides little
benefit. Once N is "large enough" to produce a reasonable level of accuracy, making it larger simply
wastes time and money.
So some key decisions in planning any experiment are, "How precise will my parameter estimates tend
to be if I select a particular sample size?" and "How big a sample do I need to attain a desirable level of
precision?"
The purpose of Power Analysis and Sample Size Estimation is to provide you with the statistical
methods to answer these questions quickly, easily, and accurately. A good statistical software program
will provide simple dialogs for performing power calculations and sample size estimation for many of
the classic statistical procedures as well as special noncentral distribution routines to allow the
advanced user to perform a variety of additional calculations.

Hypothesis Testing Logic


Suppose that the politician was interested in showing that more than the majority of people supported
her position. Her question, in statistical terms: "Is
> .50?" Being an optimist, she believes that it is.
In statistics, the following strategy is quite common. State as a "statistical null hypothesis" something
that is the logical opposite of what you believe. Call this hypothesis H0. Gather data. Then, using
statistical theory, show from the data that it is likely H0 is false, and should be rejected.

By rejecting H0, you support what you actually believe. This kind of situation, which is typical in many
fields of research, for example, is called "Reject-Support testing," (RS testing) because rejecting the
null hypothesissupports the experimenter's theory.
The null hypothesis is either true or false, and the statistical decision process is set up so that there
are no "ties." The null hypothesis is either rejected or not rejected. Consequently, before undertaking
the experiment, we can be certain that only 4 possible things can happen. These are summarized in the
table below
State of the World

Decision

HO

H1

H0

Correct
Acceptance

Type II Error

H1

Type I Error

Correct
Rejection

Note that there are two kinds of errors represented in the table. Many statistics textbooks present a
point of view that is common in the social sciences, i.e., that
, the Type I error rate, must be kept at
or below .05, and that, if at all possible,

, the Type II error rate, must be kept low as well.

"Statistical power," which is equal to 1 - , must be kept correspondingly high. Ideally, power should
be at least .80 to detect a reasonable departure from the null hypothesis.
The conventions are, of course, much more rigid with respect to
than with respect to . For
example, in the social sciences seldom, if ever, is
allowed to stray above the magical .05 mark.
Significance Testing (RS/AS). In the context of significance testing, we can define two basic kinds of
situations, reject-support (RS) (discussed above) and accept-support (AS). In RS testing, the null
hypothesis is the opposite of what the researcher actually believes, and rejecting it supports the
researcher's theory. In a two group RS experiment involving comparison of the means of an
experimental and control group, the experimenter believes the treatment has an effect, and seeks to
confirm it through a significance test that rejects the null hypothesis.
In the RS situation, a Type I error represents, in a sense, a "false positive" for the researcher's theory.
From society's standpoint, such false positives are particularly undesirable. They result in much wasted
effort, especially when the false positive is interesting from a theoretical or political standpoint (or
both), and as a result stimulates a substantial amount of research. Such follow-up research will usually
not replicate the (incorrect) original work, and much confusion and frustration will result.
In RS testing, a Type II error is a tragedy from the researcher's standpoint, because a theory that is true
is, by mistake, not confirmed. So, for example, if a drug designed to improve a medical condition is
found (incorrectly) not to produce an improvement relative to a control group, a worthwhile therapy
will be lost, at least temporarily, and an experimenter's worthwhile idea will be discounted.

As a consequence, in RS testing, society, in the person of journal editors and reviewers, insists on
keeping
low. The statistically well-informed researcher makes it a top priority to keep
well. Ultimately, of course, everyone benefits if both error probabilities are kept low, but
unfortunately there is often, in practice, a trade-off between the two types of error.

low as

The RS situation is by far the more common one, and the conventions relevant to it have come to
dominate popular views on statistical testing. As a result, the prevailing views on error rates are that
relaxing
beyond a certain level is unthinkable, and that it is up to the researcher to make
sure statistical power is adequate. You might argue how appropriate these views are in the context of
RS testing, but they are not altogether unreasonable.
In AS testing, the common view on error rates we described above is clearly inappropriate. In AS
testing, H0 is what the researcher actually believes, so accepting it supports the researcher's theory. In
this case, a Type I erroris a false negative for the researcher's theory, and a Type II error constitutes a
false positive. Consequently, acting in a way that might be construed as highly virtuous in the RS
situation, for example, maintaining a very low Type I error rate like .001, is actually "stacking the deck"
in favor of the researcher's theory in AS testing.
In both AS and RS situations, it is easy to find examples where significance testing seems strained and
unrealistic. Consider first the RS situation. In some such situations, it is simply not possible to have
very large samples. An example that comes to mind is social or clinical psychological field research.
Researchers in these fields sometimes spend several days interviewing a single subject. A year's
research may only yield valid data from 50 subjects. Correlational tests, in particular, have very low
power when samples are that small. In such a case, it probably makes sense to relax
beyond .05, if
it means that reasonable power can be achieved.
On the other hand, it is possible, in an important sense, to have power that is too high. For
example, you might be testing the hypothesis that two population means are equal (i.e., Mu1 = Mu2)
with sample sizes of a million in each group. In this case, even with trivial differences between groups,
the null hypothesis would virtually always be rejected.
The situation becomes even more unnatural in AS testing. Here, if N is too high, the researcher almost
inevitably decides against the theory, even when it turns out, in an important sense, to be an excellent
approximation to the data. It seems paradoxical indeed that in this context experimental precision
seems to work against the researcher.
To summarize:
In Reject-Support research:
1. The researcher wants to reject H0.
2. Society wants to control Type I error.
3. The researcher must be very concerned about Type II error.
4. High sample size works for the researcher.
5. If there is "too much power," trivial effects become "highly significant."

In Accept-Support research:
1. The researcher wants to accept H0.
2. "Society" should be worrying about controlling Type II error, although it sometimes gets
confused and retains the conventions applicable to RS testing.
3. The researcher must be very careful to control Type I error.
4. High sample size works against the researcher.
5. If there is "too much power," the researcher's theory can be "rejected" by a significance test
even though it fits the data almost perfectly.
Back to Top

Calculating Power
Properly designed experiments must ensure that power will be reasonably high to detect reasonable
departures from the null hypothesis. Otherwise, an experiment is hardly worth doing. Elementary
textbooks contain detailed discussions of the factors influencing power in a statistical test. These
include
1. What kind of statistical test is being performed. Some statistical tests are inherently more
powerful than others.
2. Sample size. In general, the larger the sample size, the larger the power. However, generally
increasing sample size involves tangible costs, both in time, money, and effort. Consequently,
it is important to make sample size "large enough," but not wastefully large.
3. The size of experimental effects. If the null hypothesis is wrong by a substantial amount,
power will be higher than if it is wrong by a small amount.
4. The level of error in experimental measurements. Measurement error acts like "noise" that can
bury the "signal" of real experimental effects. Consequently, anything that enhances the
accuracy and consistency of measurement can increase statistical power.
Back to Top

Calculating Required Sample Size


To ensure a statistical test will have adequate power, you usually must perform special analyses prior
to running the experiment, to calculate how large an N is required.

Let's briefly examine the kind of statistical theory that lies at the foundation of the calculations used
to estimate power and sample size. Return to the original example of the politician, contemplating
how large an opinion poll should be taken to suit her purposes.
Statistical theory, of course, cannot tell us what will happen with any particular opinion poll. However,
through the concept of a sampling distribution, it can tell us what will tend to happen in the long
run, over many opinion polls of a particular size.
A sampling distribution is the distribution of a statistic over repeated samples. Consider the sample
proportion presulting from an opinion poll of size N, in the situation where the population
proportion
is exactly .50. Sampling distribution theory tells us that p will have a distribution that
can be calculated from the binomial theorem. This distribution, for reasonably large N, and for values
of p not too close to 0 or 1, looks very much like a normal distribution with a mean of
and a
standard deviation (called the "standard error of the proportion") of
p = ((1-)/N)**1/2
Suppose, for example, the politician takes an opinion poll based on an N of 100. Then the distribution
of p, over repeated samples, will look like this if
= .5.

The values are centered around .5, but a small percentage of values are greater than .6 or less than .4.
This distribution of values reflects the fact that an opinion poll based on a sample of 100 is an
imperfect indicator of the population proportion .
If p were a "perfect" estimate of , the standard error of the proportion would be zero, and the
sampling distribution would be a spike located at 0.5. The spread of the sampling distribution indicates
how much "noise" is mixed in with the "signal" generated by the parameter.
Notice from the equation for the standard error of the proportion that, as N increases, the standard
error of the proportion gets smaller. If N becomes large enough, we can be very certain that our
estimate p will be a very accurate one.

Suppose the politician uses a decision criterion as follows. If the observed value of p is greater than .
58, she will decide that the null hypothesis that
is less than or equal to .50 is false. This rejection
rule is diagrammed below.

You may, by adding up all the probabilities (computable from the binomial distribution), determine that
the probability of rejecting the null hypothesis when p = .50 is .044. Hence, this decision rule controls
the Type I Error rate,
, at or below .044. It turns out, this is the lowest decision criterion that
maintains
at or below .05.
However, the politician is also concerned about power in this situation, because it is by rejecting the
null hypothesis that she is able to support the notion that she has public opinion on her side.
Suppose that 55% of the people support the politician, that is, that = .55 and the null hypothesis is
actually false. In this case, the correct decision is to reject the null hypothesis. What is the probability
that she will obtain a sample proportion greater than the "cut-off" value of .58 required to reject the
null hypothesis?
In the figure below, we have superimposed the sampling distribution for p when
= .55. Clearly, only
a small percentage of the time will the politician reach the correct decision that she has majority
support. The probability of obtaining a p greater than .58 is only .241.

Needless to say, there is no point in conducting an experiment in which, if your position is correct, it
will only be verified 24.1% of the time! In this case a statistician would say that the significance test
has "inadequate power to detect a departure of 5 percentage points from the null hypothesized value."
The crux of the problem lies in the width of the two distributions in the preceding figure. If the sample
size were larger, the standard error of the proportion would be smaller, and there would be little
overlap between the distributions. Then it would be possible to find a decision criterion that provides a
low
and high power.
The question is, "How large an N is necessary to produce a power that is reasonably high" in this
situation, while maintaining
at a reasonably low value.
You could, of course, go through laborious, repetitive calculations in order to arrive at such a sample
size. However, a good software program will perform them automatically, with just a few clicks of the
mouse. Moreover, for each analytic situation that it handles, it will provide extensive capabilities for
analyzing and graphing the theoretical relationships between power, sample size, and the variables
that affect them. Assuming that the user will be employing the well known chi-square test, rather than
the exact binomial test, suppose that the politician decides that she requires a power of .80 to detect
a p of .80. It turns out, a sample size of 607 will yield a power of exactly .8009. (The actual alpha of
this test, which has a nominal level of .05, is .0522 in this situation.)
Back to Top

Graphical Approaches to Power Analysis


In the preceding discussion, we arrived at a necessary sample size of 607 under the assumption
that p is precisely .80. In practice, of course, we would be foolish to perform only one power
calculation, based on one hypothetical value. For example, suppose the function relating required
sample size to p is particularly steep in this case. It might then be that the sample size required for
a p of .70 is much different than that required to reliably detect a p of .80.

Intelligent analysis of power and sample size requires the construction, and careful evaluation, of
graphs relating power, sample size, the amount by which the null hypothesis is wrong (i.e., the
experimental effect), and other factors such as Type I error rate.
In the example discussed in the preceding section, the goal, from the standpoint of the politician, is to
plan a study that can decide, with a low probability of error, whether the support level is greater
than .50. Graphical analysis can shed a considerable amount of light on the capabilities of a statistical
test to provide the desired information under such circumstances.
For example, the researcher could plot power against sample size, under the assumption that the true
level is .55, i.e., 55%. The user might start with a graph that covers a very wide range of sample sizes,
to get a general idea of how the statistical test behaves. The following graph shows power as a
function of sample sizes ranging from 20 to 2000, using a "normal approximation" to the exact binomial
distribution.

The previous graph demonstrates that power reaches an acceptable level (often considered to be
between .80 and .90) at a sample size of approximately 600.
Remember, however, that this calculation is based on the supposition that the true value of p is .55. It
may be that the shape of the curve relating power and sample size is very sensitive to this value. The
question immediately arises, "how sensitive is the slope of this graph to changes in the actual value
of p?
There are a number of ways to address this question. You can plot power vs. sample size for other
values of p, for example. Below is a graph of power vs. sample size for p = .6.

You can see immediately in the preceding graph that the improvement in power for increases
in N occurs muchmore rapidly for p = .6 than for p = .55. The difference is striking if you merge the two
graphs into one, as shown below.

In planning a study, particularly when a grant proposal must be submitted with a proposed sample
size, you must estimate what constitutes a reasonable minimum effect that you wish to detect, a
minimum power to detect that effect, and the sample size that will achieve that desired level of
power. This sample size can be obtained by analyzing the above graphs (additionally, some software
packages can calculate it directly). For example, if the user requests the minimum sample size
required to achieve a power of .90 when p = .55, some programs can calculate this directly. The result
is reported in a spreadsheet, as below,
One Proportion, Z
(or Chi-Square) Test
H0: Pi < = Pi0
Value

Null Hypothesized Proportion (Pi0)

.5000

Population Proportion (Pi)

.5500

Alpha (Nominal)

.0500

Required Power

.9000

Required Sample Size (N)

853.0000

Actual Alpha (Exact)

.0501

Power (Normal Approximation)

.9001

Power (Exact)

.9002

For a given level of power, a graph of sample size vs. p can show how sensitive the required sample size
is to the actual value of p. This can be important in gauging how sensitive the estimate of a required
sample size is. For example, the following graph shows values of N needed to achieve a power of .90
for various values of p, when the null hypothesis is that p = .50

The preceding graph demonstrates how the required N drops off rapidly as p varies from .55 to .60. To
be able to reliably detect a difference of .05 (from the null hypothesized value of .50) requires
an N greater than 800, but reliable detection of a difference of .10 requires an N of only around 200.
Obviously, then, required sample size is somewhat difficult to pinpoint in this situation. It is much
better to be aware of the overall performance of the statistical test against a range of
possibilities before beginning an experiment, than to be informed of an unpleasant reality after the
fact. For example, imagine that the experimenter had estimated the required sample size on the basis
of reliably (with power of .90) detecting a p of .6. The experimenter budgets for a sample size of, say,
220, and imagines that minor departures of p from .6 will not require substantial differences in N. Only

later does the experimenter realize that a small change in requires a huge increase in N , and that the
planning for the experiment was optimistic. In some such situations, a "window of opportunity" may
close before the sample size can be adjusted upward.
Across a wide variety of analytic situations, Power analysis and sample size estimation involve steps
that are fundamentally the same.
1. The type of analysis and null hypothesis are specified
2. Power and required sample size for a reasonable range of effects is investigated.
3. The sample size required to detect a reasonable experimental effect (i.e., departure from the
null hypothesis), with a reasonable level of power, is calculated, while allowing for a
reasonable margin of error.
Back to Top

Noncentrality Interval Estimation and the Evaluation of Statistical Models


Power Analysis and Interval Estimation includes a number of confidence intervals that are not widely
available in general purpose statistics packages. Several of these are discussed within a common
theoretical framework, called "noncentrality interval estimation," by Steiger and Fouladi (1997). In this
section, we briefly review some of the basic rationale behind the emerging popularity of confidence
intervals.

Inadequacies of the Hypothesis Testing Approach


Strictly speaking, the outcome of a significance test is the dichotomous decision whether or not to
reject the null hypothesis. This dichotomy is inherently dissatisfying to many scientists who use the null
hypothesis as a statement of no effect, and are more interested in knowing how big an effect is than
whether it is (precisely) zero. This has led to behavior like putting one, two, or three asterisks next to
results in tables, or listing p-values next to results, when, in fact, such numbers, across (or sometimes
even within!) studies need not be monotonically related to the best estimates of strength of
experimental effects, and hence can be extremely misleading. Some writers (e.g., Guttman, 1977)
view asterisk-placing behavior as inconsistent with the foundations of significance testing logic.
Probability levels can deceive about the "strength" of a result, especially when presented without
supporting information. For example, if, in an ANOVA table, one effect had a p-value of .019, and the
other a p-value of .048, it might be an error to conclude that the statistical evidence supported the
view that the first effect was stronger than the second. A meaningful interpretation would require
additional information. To see why, suppose someone reports a p-value of .001. This could be
representative of a trivial population effect combined with a huge sample size, or a powerful
population effect combined with a moderate sample size, or a huge population effect with a small
sample. Similarly a p-value of .075 could represent a powerful effect operating with a small sample, or
a tiny effect with a huge sample. Clearly then, we need to be careful when comparing p-values.

In Accept-Support testing, which occurs frequently in the context of model fitting in factor analysis or
"causal modeling," significance testing logic is basically inappropriate. Rejection of an "almost true" null
hypothesis in such situations frequently has been followed by vague statements that the rejection
shouldn't be taken too seriously. Failure to reject a null hypothesis usually results in a demand by a
vigilant journal editor for cumbersome power calculations. Such problems can be avoided to some
extent by using confidence intervals.
Back to Top

Advantages of Interval Estimation


Much research is exploratory. The fundamental questions in exploratory research are "What is our best
guess for the size of the population effect?" and "How precisely have we determined the population
effect size from our sample data?" Significance testing fails to answer these questions directly. Many a
researcher, faced with an "overwhelming rejection" of a null hypothesis, cannot resist the temptation
to report that it was "significant well beyond the .001 level." Yet it is widely agreed that a p-value
following a significance test can be a poor vehicle for conveying what we have learned about the
strength of population effects.
Confidence interval estimation provides a convenient alternative to significance testing in most
situations. Consider the 2-tailed hypothesis of no difference between means. Recall first that the
significance test rejects at the
significance level if and only if the 1 confidence interval for the
mean difference excludes the value zero. Thus the significance test can be performed with the
confidence interval. Most undergraduate texts in behavioral statistics show how to compute such a
confidence interval. The interval is exact under the assumptions of the standard t test. However, the
confidence interval contains information about experimental precision that is not available from the
result of a significance test. Assuming we are reasonably confident about the metric of the data, it is
much more informative to state a confidence interval on Mu1 - Mu2 than it is to give the p-value for
thet test of the hypothesis that Mu1 - Mu2 = 0 In summary, we might say that, in general, a confidence
interval conveys more information, in a more naturally usable form, than a significance test.
This is seen most clearly when confidence intervals from several studies are graphed alongside one
another, as in the figure below

The figure shows confidence intervals for the difference between means for 3 experiments, all
performed in the same domain, using measures with approximately the same variability. Experiments 1
and 3 yield a confidence interval that fails to include zero. For these experiments, the null hypothesis
was rejected. The second experiment yields a confidence interval that includes zero, so the null
hypothesis of no difference is not rejected. A significance testing approach would yield the impression
that the second experiment did not agree with the first and the third.
The confidence intervals suggest a different interpretation, however. The first experiment had a very
large sample size, and very high precision of measurement, reflected in a very narrow confidence
interval. In this experiment, a small effect was found, and determined with such high precision that
the null hypothesis of no difference could be rejected at a stringent significance level.
The second experiment clearly lacked precision, and this is reflected in the very wide confidence
interval. Evidently, the sample size was too small. It may well be that the actual effect in conditions
assessed in the second experiment was larger than that in the first experiment, but the experimental
precision was simply inadequate to detect it.
The third experiment found an effect that was statistically significant, and perhaps substantially higher
than the first experiment, although this is partly masked by the lower level of precision, reflected in a
confidence interval that, though narrower than Experiment 2, is substantially wider than Experiment 1.
Suppose the 3 experiments involved testing groups for differences in IQ. In the final analysis, we may
have hadtoo much power in Experiment 1, as we are declaring "highly significant" a rather miniscule
effect substantially less than a single IQ point. We had far too little power in Experiment 2. Experiment
3 seems about right.
Many of the arguments we have made on behalf of confidence intervals have been made by others as
cogently as we have made them here. Yet, confidence intervals are seldom reported in the literature.
Most important, as we demonstrate in the succeeding sections, there are several extremely useful
confidence intervals that virtuallynever are reported. In what follows, we discuss why the intervals are
seldom reported.
Back to Top

Why Interval Estimates are Seldom Reported


In spite of the obvious advantages of interval estimates, they are seldom employed in published
articles in many areas of science. On those infrequent occasions when interval estimates are reported,
they are often not the optimal ones. There are several reasons for this status quo:
Tradition. Traditional approaches to statistics emphasize significance testing much more than interval
estimation.
Pragmatism. In RS situations, interval estimates are sometimes embarrassing. When they are narrow
but close to zero, they suggest that a "highly significant" result may be statistically significant but
trivial. When they are wide, they betray a lack of experimental precision.
Ignorance. Many people are simply unaware of some of the very valuable interval estimation
procedures that are available. For example, many textbooks on multivariate analysis never mention
that it is possible to compute a confidence interval on the squared multiple correlation coefficient.
Lack of availability. Some of the most desirable interval estimation procedures are computer
intensive, and have not been implemented in major statistical packages. This has made it less likely
that anyone will try the procedure.
Back to Top

Replacing Traditional Hypothesis Tests with Interval Estimates


There are a number of confidence interval procedures that can replace and/or augment the traditional
hypothesis tests used in classical testing situations. For a review of these techniques, see Steiger &
Fouladi (1997).
Analysis of Variance. One area where confidence intervals have seldom been employed is in assessing
strength of effects in the Analysis of Variance (ANOVA).
For example, suppose you are reading a paper, which reports that, in a 1-Way ANOVA, with 4 groups,
and N = 60per group, an F statistic was found that is significant at the .05 level ("F = 2.70, p =.0464").
This result is statistically significant, but how meaningful is it in a practical sense? What have we
learned about the size of the experimental effects?
Fleischman (1980) discusses a technique for setting a confidence interval on the overall effect size in
the Analysis of Variance. This technique allows you to set a confidence interval on the RMSSE, the rootmean-square standardized effect (RMSSE). Standardized effects are reported in standard deviation
units, and are hence remain constant when the unit of measurement changes. So, for example, an
experimental effect reported in pounds would be different from the same effect reported in kilograms,

whereas the standardized effect would be the same in each case. In the case of the data mentioned
above, the F statistic that is significant at the .05 level yields a 90% confidence interval for the RMSSE
that ranges from .0190 to .3139. The lower limit of this interval stands for a truly mediocre effect, less
than 1/50th of a standard deviation. The upper limit of the interval represents effects on the order of
1/3 of a standard deviation, moderate but not overwhelming. It seems, then, that the results from this
study need not imply really strong experimental effects, even though the effects are statistically
"significant."
Multiple Regression. The squared multiple correlation is reported frequently as an index of the overall
strength of a prediction equation. After fitting a regression equation, the most natural questions to ask
are, (a) "How effective is the regression equation at predicting the criterion?" and (b) "How precisely
has this effectiveness been determined?"
Hence, one very common statistical application that practically cries out for a confidence interval is
multiple regression analysis. Publishing an observed squared multiple R together with the result of a
hypothesis test that the population squared multiple correlation is zero, conveys little of the available
statistical information. A confidence interval on the populations squared multiple correlation is much
more informative.
One software package computes exact confidence intervals for the population squared multiple
correlation, following the approach of Steiger and Fouladi (1992). As an example, suppose a criterion is
predicted from 45 independent observations on 5 variables and the observed squared multiple
correlation is .40. In this case a 95% confidence interval for the population squared multiple correlation
ranges from .095 to .562! A 95% lower confidence limit is at .129. On the other hand the sample
multiple correlation value is significant "beyond the .001 level," because the p-value is .0009, and the
shrunken estimator is .327. Clearly, it is far more impressive to state that "the squared multiple R value
is significant at the .001 level" than it is to state that "we are 95% confident that the population
squared multiple correlation is between .095 and .562." But we believe the latter statement conveys
the quality and meaning of the statistical result more accurately than the former.
Some writers, like Lee (1972), prefer a lower confidence limit, or "statistical lower bound" on the
squared multiple correlation to a confidence interval. The rationale, apparently, is that we
are primarily interested in assuring that the percentage of variance "accounted for" in the regression
equation exceeds some value. Although we understand the motivation behind this view, we hesitate to
accept it. The confidence interval, in fact, contains a lower bound, but also includes an upper bound,
and, in the interval width, a measure of precision of estimation. It seems to us that adoption of a lower
confidence limit can lead to a false sense of security, and reduces that amount of information available
in the model assessment process.

Statistical power
by Sarini Abdullah on 20/08/2015 in Statistika inferensi 0 Comments

Did you know Santa once took a statistics class? He had trouble remembering which
hypothesis should have the equal sign so he would keep repeating: the null hypothesis,
the null hypothesis, the null hypothesis.
In fact, to this day you can hear him say Ho, Ho, Ho!

(Mark Eakin)

Apakah statistical power?


Statistical power, atau kadang dikenal dengan power dari suatu pengujian pada tingkat
signifikansi didefinisikan sebagai suatu probabilitas
akan menerima hipotesis null ketika hipotesis tersebut benar.
Secara ringkas, dapat dirangkum dalam tabel berikut.

Maksudnya?
Ada baiknya kita bahas satu persatu istilah istilah di atas.
Apa itu hipotesis?
Suatu pernyataan yang akan diuji kebenarannya. Yang dimaksud kebenaran di sini
adalah relative, yaitu seberapa kuat pernyataan tersebut didukung oleh data.
Hipotesis terbagi dua, yaitu:

Hipotesis awal (hipotesis null), biasa disimbolkan Ho.

Biasanya mengambil satu nilai tertentu, spesifik.

Hipotesis alternative, merupakan komplemen dari hipotesis awal, biasa disimbolkan


H1 atau Ha

Dapat berupa satu nilai yang spesifik, ataupun berupa interval.

Biasanya kecurigaan kita dinyatakan di sini

Contoh:
Pemerintah menyatakan bahwa harga daging sapi per-kg di pasaran adalah Rp
100.000,-. Namun, kenyataannya masyarakat mengeluhkan mahalnya harga daging
sapi, bahkan jauh di atas Rp.100.000,-. Untuk meng-klarifikasi hal tersebut, apakah

pernyataan pemerintah masih dapat dipercaya atau tidak, maka kita bisa lakukan
pengujian, tentunya setelah mengumpulkan data harga daging sapi di pasaran.
Hipotesis yang akan diuji adalah:
Ho: Rata rata harga daging sapi per-kg = Rp 100.000,Ha: Rata rata harga daging sapi per-kg di atas Rp 100.000,Sebagai dasar pengambilan keputusan, kita tetapkan taraf signifikansi , biasanya =
0.05 atau 5%.
Apa itu tingkat signifikansi ?
Nilai yang menyatakan batas toleransi (antara 0 dan 1) yang kita tetapkan untuk
melakukan membuat kesalahan dalam pengambilan kesimpulan. Tentunya kita ingin
sekecil mungkin kan peluang untuk berbuat salah?
Maksudnya?
adalah probabilitas bahwa menolak Ho pada saat Ho benar. Dikatakan benar,
karena kecil sekalikemungkinannya bahwa nilai yang tertera pada diperoleh secara
kebetulan, yang berarti itu haruslah representasi dari kondisi yang sesungguhnya.
Kata kecil sekali di atas, biasa dikenal dengan istilah p-value, sehingga dalam
pengujian hipotesis simpelnya adalah: jika p-value < maka tolak Ho.
Kenapa? Karena kemungkinan kita melakukan kesalahan lebih kecil dari batas
toleransi kesalahan yang kita perbolehkan, .
Lalu apa itu statistical power?
Apakah saja tidak cukup?
Dalam praktiknya, jika Ho ditolak, biasanya kita tidak perlu menghitung power. Akan
tetapi, jika Ho tidak ditolak, maka dirasa perlu menghitung power.
Kenapa?
Alasan praktis:
Biasanya kita melakukan pengujian untuk mengonfirmasi kecurigaan kita akan suatu
fenomena. Kecurigaan itu kita letakkan pada H 1. Kecurigaan kita itu mestinya karena
merasa sesungguhnya apa yang terjadi tidak seperti yang dinyatakan (pada H o), seperti
kasus harga daging di atas.
Jadi, mestinya kecurigaan kita didukung oleh data dong? Yang artinya H 1 diterima,
Ho ditolak.
Tapi kok, Ho tidak ditolak?
Karena itu, kita ingin meyakinkan, bahwa tidak ada yang salah dengan pengujian ini.
Bagaimana mengetahui bahwa hasil pengujian ini memang bisa dipercaya?

Ya dengan power tersebut.


Kok bisa?
Simpel saja. Kita lebih bisa bersandar pada sesuatu yang lebih kuat bukan,
dibandingkan dengan yang lemah bukan? Kekuatan itu diukur dari power. Suatu
pengujian dikatakan bagus, hasilnya memang bisa dipercaya, jikapowernya besar.
Ingat kembali arti power:

probabilitas menerima Ho pada saat Ho benar.


yang berarti suatu tindakan yang benar, bukan?
Jika memang salah, harus ditolak, dengan tingkat keyakinan yang tinggi.
Jika menolaknya ragu ragu (power kecil), maka sama saja tidak yakin apakah
memang benar Ho tersebut salah.
Karena itu, saat Ho tidak ditolak apa memang penerimaan ini karena Ho benar?
Masih ingat logika implikasi?
Jika p maka q setara dengan jika ~q maka ~p .
Jika hujan maka saya pakai payung
setara dengan
jika saya tidak pake payung maka pasti saat itu tidak hujan (karena kalau hujan saya
pasti akan pakai payung).
Menggunakan logika tersebut, maka:

Jika Ho salah , maka tolak Ho,


setara dengan

jika Ho tidak ditolak, maka mesti karena Ho tidak salah.


Dan tingkat kebenaran pernyataan ini pada uji kita diukur oleh power. Dan pengujian
yang bisa dipercaya semestinya memberikan power yang besar.
Seberapa besar power yang bagus?
Makin dekat dengan 1 makin bagus, artinya tingkat keyakinan kita terhadap kebenaran
hasil uji tersebut mendekati 100%. Namun, umumnya 0.8 sudah dianggap cukup baik.
Bagaimana mendapatkan power yang besar?
Nilai power ditentukan oleh:
1. Variabilitas data (diukur dari variansi): makin besar variansi data, makin kecil
power.Logis saja kan? Makin susah mengambil kesimpulan dari sesuatu yang

kondisinya sering berubah ubah.Biasanya kondisi ini sudah inheren di


populasi, susah diatasi. Akan tetapi bisa sedikit diperbaiki dengan merancang
cara pengumpulan data dengan cermat.
2. Ukuran sampel: makin besar ukuran sampel, power makin besar.
Logis juga kan? Makin besar sampel, makin banyak informasi yang diperoleh, makin
meyakinkan hasil pengujian. Di bagian ini biasanya peneliti bisa berperan lebih banyak,
tingkatkan jumlah sampel untuk meningkatkan power pengujian. Akan tetapi, tetap ada
kendala yang muncul, misalnya terbentur masalah keterbatasan biaya, waktu, dan
tenaga atau sumber daya lainnya.
Apakah kita bisa mengetahui berapa power pengujian sebelum bahkan
mengumpulkan data?
Bisa, dengan syarat kita tahu informasi mengenai kondisi populasi (variansi), bisa dari
pengalaman atau literature, dan ukuran sampel yang akan diambil. Atau dengan pilot
study terlebih dahulu.
Pada perangkat lunak R, bisa dilakukan dengan menggunakan package pwr.
Untuk uji t, ikuti syntax berikut.

Contoh:
Data hipotesis denyut jantung, diukur dengan variable hr (heart rate), akan diuji apakah
rerata denyut jantung 150 atau bukan.

Diperoleh hasil

Ho tidak ditolak.
Maka ingin diketahui power pengujian. Karena df = 269, maka kita tahu bahwa n = 269
+ 1 = 270.

dimana mean (hr) dan standar deviasi, sd(hr) dihitung langsung dari data.
Seandainya kita belum punya data sama sekali, namun kita ingin mengetahui power
jika akan menggunakan sampel berukuran 270, dengan rerata pengamatan kita nanti
tidak jauh berbeda nilai dugaan kita (delta = 0.3), perkiraan standar deviasi 1.5, dan
tingkat signifikansi 5%, maka dengan prosedur berikut:

diperoleh power = 0.906. Cukup besar.


Akan tetapi, katakan kita ingin power yang lebih besar, misal 99%. Berapakah ukuran
sampel yang diperlukan?

Ternyata diperlukan paling tidak 462 pengamatan untuk power sebesar itu. Wajar saja,
bukan? No pain, no gain. Ingin hasil meyakinkan, bekerjalah lebih keras, kumpulkan
data lebih banyak.

Potrebbero piacerti anche