Sei sulla pagina 1di 20

Summary sheet from last time: Summary sheet from last time:

Confidence intervals Hypothesis testing


Confidence intervals take on the usual form: parameter = Of course, any of the confidence intervals on the
statistic tcrit SE(statistic)
previous slide can be turned into hypothesis tests
parameter SE by computing tobt = (observed expected)/SE, and
a se sqrt(1/N + mx2/ssxx) comparing with tcrit.
b se/sqrt(ssxx) Testing H0: =0:
y (mean) se sqrt(1/N + (xo mx)2/ssxx)
tobt = rsqrt(N-2)/sqrt(1-r2)
ynew se sqrt(1/N + (xnew mx)2/ssxx + 1)
(individual)
Compare with tcrit for N-2 degrees of freedom.
Where tcrit is with N-2 degrees of freedom, and Testing H0: =o:
se = sqrt((yi yi)2/(N-2)) = sqrt((ssyy bssxy)/(n-2)) Need to use a different test statistic for this.

Statistics
Descriptive
Chi-square tests and non Graphs
parametric statistics Frequency distributions
Mean, median, & mode
9.07 Range, variance, & standard deviation
4/13/2004 Inferential
Confidence intervals
Hypothesis testing

1
Two categories of inferential
Parametric statistics
statistics
Parametric Limited to quantitative sorts of dependent variables (as
opposed to, e.g., categorical or ordinal variables)
What weve done so far Require dependent variable scores are normally
distributed, or at least their means are approximately
Nonparametric normal.
What well talk about in this lecture There exist some parametric statistical procedures that assume
some other, non-normal distribution, but mostly a normal
distribution is assumed.
The general point: parametric statistics operate with some assumed
form for the distribution of the data.
Sometimes require that population variances are equal

Parametric statistics Nonparametric statistics


Best to design a study that allows you to use Use them when:
parametric procedures when possible, because The dependent variable is quantitative, but has a very
parametric statistics are more powerful than non-normal distribution or unequal population
nonparametric. variances
Parametric procedures are robust they will The dependent variable is categorical
tolerate some violation of their assumptions. Male or female
But if the data severely violate these assumptions, Democrat, Republican, or independent
this may lead to an increase in a Type I error, i.e. Or the dependent variable is ordinal
you are more likely to reject H0 when it is in fact Child A is most aggressive, child B is 2nd most aggressive
true.

Nonparametric statistics Chi-square goodness-of-fit test


The design and logic of nonparametric statistics A common nonparametric statistical test
are very similar to those for parametric statistics: Considered nonparametric because it operates on
Would we expect to see these results by chance, if our categorical data, and because it can be used to compare
model of the population is correct (one-sample tests)? two distributions regardless of the distributions
Do differences in samples accurately represent Also written 2
differences in populations (two-sample tests)?
H0, Ha, sampling distributions, Type I & Type II errors,
alpha levels, critical values, maximizing power -- all As an introduction, first consider a parametric
this stuff still applies in nonparametric statistics. statistical test which should be quite familiar to
you by now

Coin flip example (yet again) Coin flip example


You flip a coin 100 times, and get 60 heads. Null hypothesis: the coin is a fair coin
You can think of the output as categorical Alternative hypothesis: the coin is not fair
(heads vs. tails) What is the probability that we would have
However, to make this a problem we could seen 60 or more heads in 100 flips, if the
solve, we considered the dependent variable coin were fair?
to be the number of heads

Solving this problem via z-


Coin flip example
approximation
We could solve this using either the zobt = (0.60-0.50)/sqrt(0.52/100) = 2
binomial formula to compute p(60/100) + p 0.046
p(61/100) + , or by using the z- If our criterion were =0.05, we would
approximation to binomial data decide that this coin was unlikely to be fair.
Either way, we did a parametric test

We cannot solve this problem in a good way


What if we were rolling a die,
using the parametric procedures weve seen
instead of tossing a coin? thus far
A gambler rolls a Number on Number of
Why not use the same technique as for the
die 60 times, and die rolls coin flip example?
gets the results 1 4 In the coin flip example, the process was a
2 6
shown on the right. binomial process
3 17
Is the gambler using 4 16 However, a binomial process can have only two
a fair die? 5 8 choices of outcomes on each trial: success
6 9 (heads) and failure (tails)
Here was have 6 possible outcomes per trial

The multinomial distribution Determining if the die is loaded


Can solve problems like the fair die exactly Null hypothesis: the die is fair
using the multinomial distribution Alternative hypothesis: the die is loaded
This is an extension of the binomial
distribution As in the coin flip example, wed like to
determine whether the die is loaded by
However, this can be a bear to calculate
particularly without a computer comparing the observed frequency of each
outcome in the sample to the expected
So statisticians developed the 2 test as an
alternative, approximate method frequency if the die were fair

In 60 throws, we expect 10 of each Basic idea of the 2 test for goodness


outcome of fit
Some of these lines may For each line of the table, there is a difference
look suspicious (e.g. Number on Observed Expected between the observed and expected frequencies
freqs 3 & 4 seem high) die frequency frequency
1 4 10
We will combine these differences into one
However, with 6 lines in overall measure of the distance between the
the table, its likely at 2 6 10 observed and expected values
least one of them will 3 17 10
look suspicious, even if
The bigger this combined measure, the more likely
the die is fair
4 16 10 it is that the model which gave us the expected
Youre essentially doing
5 8 10 frequencies is not a good fit to our data
multiple statistical tests 6 9 10 The model corresponds to our null hypothesis
by eye Sum: 60 60 (that the die is fair). If the model is not a good fit,
we reject the null hypothesis

The 2 statistic The 2 statistic


When the observed frequency is far from the
rows (observed frequency
i expected frequencyi )
2 expected frequency (regardless of whether it is too
2 = small or two large) the corresponding term is
i =1 expected frequencyi
large.
When the two are close, the term is small.
Clearly the 2 statistic will be larger the more
terms you add up (the more rows in your table),
and well have to take into account the number of
rows somehow.

Probability that 2 14.2 if the die is


Back to our example
actually fair
Number on Observed Expected
2 = (4-10)2/10 With some assumptions about the distribution of
die frequency frequency + (6-10)2/10 observed frequencies about expected frequencies
1 4 10 + (17-10)2/10 (not too hard to come by, for the die problem), one
2 6 10 + (16-10)2/10 can solve this problem exactly.
3 17 10 + (8-10)2/10
+ (9-10)2/10 When Pearson invented the 2 test, however
4 16 10
5 8 10 = 142/10 = 14.2 (1900), computers didnt exist, so he developed a
6 9 10 Whats the probability of method for approximating this probability using a
Sum: 60 60 getting 2 14.2 if the new distribution, called the 2 distribution.
die is actually fair?

The 2 distribution The 2 distribution


The statistic 2 is approximately distributed Random 2
according to a 2 distribution factoids: 16
df = 4

As with the t-distribution, there is a Mean = d.f. 12

different 2 distribution for each number of Mode = d.f. 2


8
degrees of freedom Median d.f. - .7
df = 10
In this case, however, the 2 distribution Skewed 4

changes quite radically as you change d.f. distribution,


0
skew as d.f. 0 4 8 12 16 20 24

Figure by MIT OCW.

What degrees of freedom do you


The 2 test use?
area (along
top of table)
2
Use the tables Let k = # categories = # expected
at the back of frequencies in the table = # terms in the
your book, for
# in the table
equation for 2
the appropriate For the die example, k=6, for the 6 possible
d.f. and . Degrees of
side of the die
freedom 99% 95% 5% 1%
1 .00016 .0039 3.84 6.64 Then, d.f. = k 1 d
2 0.020 0.10 5.99 9.21 Where d is the number of population
3 0.12 0.35 7.82 11.34 parameters you had to estimate from the sample
4 0.30 0.71 9.49 13.28 in order to compute the expected frequencies
5 0.55 1.14 11.07 15.09

7
Degrees of freedom for the die
Huh?
example
This business about d will hopefully become more So, d.f. = k 1 = 6 1 = 5
clear later, as we do more examples. Why the -1?
For now, how did we compute the expected We start off with 6 degrees of freedom (the 6
frequencies for the die? expected frequencies, ei, i=1:6), but we know
With 60 throws, for a fair die we expect 10 throws for the total number of rolls (ei = 60), which
each possible outcome removes one degree of freedom
This didnt involve any estimation of population I.E. if we know ei, i=1:5, these 5 and the total
parameters from the sample number of rolls determine e6, so there are only
d=0 5 degrees of freedom

Back to the example So, is the die fair?


We found 2 = 14.2 Were fairly certain (p<0.05) that the die is
What is p(2 14.2)? loaded.
Looking up in the 2 tables, for 5 d.f:
Degrees of
freedom 99% 95% 5% 1%
5 0.55 1.14 11.07 15.09

14.2 > 11.07, but < 15.09


p is probably a little bit bigger than 0.01 (its
actually about 0.014, if we had a better table))

When would this not be a good


Rule of thumb
approximation?
Once again, this 2 distribution is only an Example: suppose your null hypothesis for the die
approximation to the distribution of 2 game was instead that you had the following
values wed expect to see probabilities of each outcome:
(1: 0.01), (2, 0.01), (3, 0.01), (4, 0.95), (5, 0.01), (6, 0.01)
As a rule of thumb, the approximation can In 100 throws of the die, your expected frequencies would
be trusted when the expected frequency in be (1, 1, 1, 95, 1, 1)
each line of the table, ei, is 5 or more These 1s are too small for the approximation to be good.
For this model, youd need at least 500 throws to be
confident in using the 2 tables for the 2 test

Other assumptions for the 2 test Summary of steps


Each observation belongs to one and only Collect the N observations (the data)
one category Compute the k observed frequencies, oi
Independence: each observed frequency is Select the null hypothesis
independent of all the others Based upon the null hypothesis and N,
One thing this means: 2 is not for repeated- predict the expected frequencies for each
measures experiments in which, e.g., each
outcome, ei, i=1:k
subject gets rated on a task both when they do it k (o e ) 2
right-handed, and when they do it left-handed. Compute the 2 statistic, 2 = i e i
i =1 i

Summary of steps, continued An intuition worth checking


How does N affect whether we accept or reject the null
For a given a, find 2crit for k 1 d hypothesis?
degrees of freedom Recall from the fair coin example that if we tossed the coin
10 times and 60% of the flips were heads, we were not so
Note that the number of degrees of freedom concerned about the fairness of the coin.
depends upon the model, not upon the data But if we tossed the coin 1000 times, and still 60% of the
tosses were heads, then, based on the law of averages, we
If 2 > 2crit, then reject the null hypothesis were concerned about the fairness of the coin.
-- decide that the model is not a good fit. Surely, all else being equal, we should be more likely to
reject the null hypothesis if N is larger.
Else maintain the null hypothesis as valid. Is this true for the 2 test? On the face of it, its not
obvious. d.f. is a function of k, not N.

Checking the intuition about the


effect of N
Effect of N on 2
Suppose we double N, and keep the same All else being equal, increasing N increases
relative observed frequencies 2
i.e. if we observed 10 instances of outcome i So, for a given set of relative observed
with N observations, in 2N observations we frequencies, for larger N we are more likely
observe 20 instances of outcome i. to reject the null hypothesis
What happens to 2? This matches with our intuitions about the
k (2oi 2ei ) 2 k 4(oi ei ) 2 effect of N
( 2 ) new = = = 2 2
i =1 2 ei i =1 2 ei

10

What happens if we do the coin flip example


the 2 way instead of the z-test way? Look at this a bit more closely
100 coin flips, see 60 heads. Is the coin fair? Our obtained value of the 2 statistic was 4
Table: Observed Expected
Outcome
frequency frequency
zobt was 2
Heads 60 50 (zobt)2 = 2
Tails 40 50
Also, zcrit2 = (2)crit for d.f. = 1
Sum: 100 100
So, we get the same answer using the z-test
2 = (60-50)2/50 + (40-50)2/50 = 4 as we do using the 2 test
Looking it up in the 2 table under d.f.=1, it looks
like p is just slightly < 0.05 This is always true when we apply the 2
We conclude that its likely that the coin is not fair. test to the binomial case (d.f. = 1)

A 2 test for goodness of fit to a


Another use of the 2 test normal distribution
Suppose you take N samples from a First, must bin the data into a finite number
distribution, and you want to know if the of categories, so we can apply the 2 test:
samples seem to come from a particular Height Observed Expected
distribution, e.g. a normal distribution (inches) frequency frequency
h<62.5 5
E.G. heights of N students
62.5h<65.5 18
We can use the 2 test to check whether this 65.5h<68.5 42
data is well fit by a normal distribution 68.5h<71.5 27
71.5h 8
Sum: 100

11

A 2 test for goodness of fit to a A 2 test for goodness of fit to a


normal distribution normal distribution
Next, we find the candidate parameters of Given that mean and standard deviation,
the normal distribution, from the mean find the z-scores for the category
height, and its standard deviation boundaries
It turns out that for this data, mh = 67.45, We will need to know what fraction of the
and s = 2.92 normal curve falls within each category, so we
can predict the expected frequencies
Boundary 62.5 65.5 68.5 71.5
z -1.70 -0.67 0.36 1.39

Get the area under the normal curve


What fraction of the area under the
from - to z, for each boundary
normal curve is in each category?
value of z
See http://davidmlane.com/hyperstat/z_table.html Boundary 62.5 65.5 68.5 71.5
z -1.70 -0.67 0.36 1.39
for a convenient way to get these values
Area below
0.0446 0.2514 0.6406 0.9177
Boundary 62.5 65.5 68.5 71.5 z

z -1.70 -0.67 0.36 1.39 Height Observed Fraction of the


Area below (inches) frequency area
0.0446 0.2514 0.6406 0.9177
z h<62.5 5 0.0446
62.5h<65.5 18 0.2068
65.5h<68.5 42 0.3892
68.5h<71.5 27 0.2771
71.5h 8 0.0823
Sum: 100 1

12

What frequency do we expect for Now (finally) were ready to do the


each category? 2 test
Multiply the expected fraction by the total number of Compute
samples, 100 2 = (5-4)2/4 + (18-21)2/21 + (42-39)2/39 + (27-28)2/28
+ (8-8)2/8
If you do any rounding, make sure the sum = N = 100!
= + 9/21 + 9/39 + 1/28 = 0.9451
Height Observed Fraction of the Expected (Dont worry
Degrees of freedom = k 1 d = 2
(inches) frequency area frequency too much about k=5 (= # categories = # terms in 2 eqn)
h<62.5 5 0.0446 4 one bin out of d=2 (= number of population parameters estimated from the
5 having slightly data in computed expected frequencies. Estimated and .)
62.5h<65.5 18 0.2068 21 fewer than 5
65.5h<68.5 42 0.3892 39 counts)
Find critical value for =0.05: 2crit = 5.99
68.5h<71.5 27 0.2771 28
The normal distribution seems to be a very good fit to the
data (we maintain the null hypothesis as viable)
71.5h 8 0.0823 8
Sum: 100 1 100

Another example The lower tail of the 2 distribution

How to combine results across independent In the previous examples, we looked in the upper
experiments tail of the 2 distribution, to judge whether the
disagreement between observed and expected
Results in the lower tail of the 2 frequencies was great enough that it was unlikely
distribution (as opposed to the upper tail, to have occurred by chance, if the null hypothesis
which is what weve looked at so far) model was correct
We can also look in the lower tail, and judge
whether the agreement between data and model is
too good to be true

13
Gregor Mendel and genetics Gregor Mendel and genetics
In 1865, Mendel published an article Mendel ran a series of experiments on garden peas
Pea seeds are either yellow or green. Seed color is an
providing a scientific explanation for indicator, even before you plant it, of the genes of the child
heredity. plant the seed would grow into.
Mendel bred a strain that had only yellow seeds, and
This eventually led to a revolution in another that had only green seeds.
biology. Then he crossed the two, to get a 1st generation yellow-
green hybrid: all seeds yellow
Then he crossed pairs of 1st generation hybrids, to get 2nd
generation hybrids: approximately 75% of seeds were
yellow, 25% were green

Genetic model Genetic model


Mendel postulated a theory in which y (yellow) is The model which describes the number of
dominant, and g (green) is recessive green and yellow seeds should (Mendel)
y/y, y/g, and g/y make yellow postulated) be one in which you randomly
g/g makes green
choose, with replacement, a yellow or green
One gene is chosen at random from each parent
ball from a box with 3 yellow balls and one
First generation parents are either gy or yg. So green ball
how many green seeds and how many yellow
seeds do we expect from them? In the long run, youd expect the seeds to be
roughly 75% yellow, 25% green

14

Mendel, genetics, and goodness-of-


One of Mendels experiments
fit
In fact, this is very close to the results that Mendels made a great discovery in genetics, and
Mendel got. his theory has survived the rigor of repeated
testing and proven to be extremely powerful
For example, on one experiment, he
However, R.A. Fisher complained that Mendels
obtained 8023 2nd-generation hybrid seeds fits were too good to have happened by chance.
He expected 8023/4 2006 green seeds, He thought (being generous) that Mendel had been
and got 2001 deceived by a gardening assistant who knew
Very close to whats predicted by the model! what answers Mendel was expecting

Pooling the results of multiple Pooling the results of multiple


experiments experiments
The thing is, the experimental results just When you have multiple, independent
mentioned have extremely good agreement with
theoretical expectations experiments, the results can be pooled by
This much agreement is unlikely, but nonetheless adding up the separate 2 statistics to get a
could happen by chance pooled 2.
But it happened for Mendel on every one of his
experiments but one! The degrees of freedom also add up, to get a
Fisher used the 2 test to pool the results of pooled d.f.
Mendels experiments, and test the likelihood of
getting so much agreement with the model

15

Pooling the results of multiple


Fishers results
experiments: mini-example
Exp. 1 gives 2=5.8, d.f.=5 For Mendels data, Fisher got a pooled 2
Exp. 2 gives 2=3.1, d.f.=2 value under 42, with 84 degrees of freedom
If one looks up the area to the left of 42
The two experiments pooled together give
under the 2 curve with 84 degrees of
2=5.8+3.1=8.9, with 5+2=7 d.f. freedom, it is only about 4/100,000
The probability of getting such a small
value of 2, i.e. such good agreement with
the data, is only 0.00004

A lower tail example testing


Fishers results whether our data fits a normal
distribution
Fisher took Mendels genetic model for Recall our earlier example in which we wanted to
granted that wasnt what he was testing know if our height data fit a normal distribution
Based on his results, Fisher rejected his null We got a 2 value of 0.9451, with 2 degrees of
freedom, and concluded from looking in the area
hypothesis that Mendels data were of the upper tail of the 2 distribution (2>0.9451)
gathered honestly in favor of his that we could not reject the null hypothesis
alternative hypothesis that Mendels data The data seemed to be well fit by a normal distribution
were fudged Could the fit have been too good to be true?

16

A lower tail example testing


whether our data fits a normal One-way vs. two-way 2
distribution
To test whether the fit was too good, we look at Our examples so far have been one-way 2 tests.
the area in the lower tail of the 2 distribution Naming of tests is like for factorial designs
With =0.05, we get a critical value of 20.05 = One-way means one factor
0.10 E.G. factor = number on die
0.9451 is considerably larger than this critical E.G. factor = color of peas
value E.G. factor = into which category did our sample fall?
We maintain the null hypothesis, and reject the
Two-way 2: two factors
alternative hypothesis = the fit is too good to be
true. The data seem just fine.

Two-way 2 Handedness vs. gender


Often used to test independence Suppose you have a sample of men and
E.G: are handedness (right-handed vs. left-handed women in the population, and you ask them
vs. ambidextrous) and gender (male vs. female) whether they are left-handed, right-handed,
independent? or ambidextrous. Get data something like
In other words, is the distribution of handedness this:
the same for men as for women? Men Women
Test of independence = test of whether two Right-handed 934 1070
distributions are equal Left-handed 113 92
Ambidextrous 20 8
An mn =
Total: 1067 1170
32 table

17

Is the distribution of handedness the Is the distribution of handedness the


same for men as for women? same for men as for women?
Its hard to judge from the previous table, From this table, it seems that the distribution in the sample
is not the same for men as for women.
because there are more men than women. Women are more likely to be right-handed, less likely to be left-
handed or ambidextrous
Convert to percentages, i.e. what percent of
Do women have more developed left brains (are they more
men (women) are right-handed, etc? rational?)
Do women feel more social pressure to conform and be
Men Women right-handed?
(numbers in table
Right-handed 87.5% 91.5% dont quite add up Or is the difference just due to chance? Even if the
Left-handed 10.6% 7.9%
right due to distributions are the same in the population, they might
rounding dont appear different just due to chance in the sample.
Ambidextrous 1.9% 0.7% worry about it, were
Sum: 100% ~100% just eyeballing
the percentages)

Using the 2 test to test if the observed


Computing the table of expected
difference between the distributions is
real or due to chance frequencies
Basic idea: First, take row and column totals in our
Construct a table of the expected frequencies for each
combination of handedness and gender, based on the
original table:
null hypothesis of independence (Q: how?)
Men Women Total
Compute 2 as before
2
m n (oij eij ) Right-handed 934 1070 2004
2 = Left-handed 113 92 205
i =1 j =1 eij
Ambidextrous 20 8 28

2
Compare the computed with the critical value from Total: 1067 1170 2237
the 2 table (Q: how many degrees of freedom?)

18

Men Women Total


Right-handed 934 1070 2004
Left-handed
Ambidextrous
113
20
92
8
205
28
The full table, and 2 computation
Total: 1067 1170 2237
Observed Expected
From this table, the percentage of right-handers in
the sample is 2004/2237 89.6% Men Women Men Women

If handedness and sex are independent, the Right-handed 934 1070 956 1048
number of right-handed men in the sample should Left-handed 113 92 98 107
be 89.6% of the number of men (1067) Ambidextrous 20 8 13 15
(0.896)(1067) 956 2 = (934-956)2 /956 + (1070-1048)2/1048
Similarly, the number of left-handed women + (113-98)2/98 + (92-107)2/107
should be 205/22371170 107, and so on for the + (20-13)2/13 + (8-15)2/15
other table entries. 12

So, are the two distributions


Degrees of freedom
different?
When testing independence in an mn table, the 2 = 12, d.f. = 2
degrees of freedom is (m-1)(n-1) Take = 0.01
Theres again a -d term if you had to estimate d
population parameters in the process of generating 2crit = 9.21 -> p<0.01
the table. We reject the null hypothesis that handedness is
Estimating the % of right-handers in the population independent of gender (alt: that the distribution of
doesnt count handedness is the same for men as for women),
There wont be a -d term in any of our two-way and conclude that there seems to be a real
examples or homework problems
difference in handedness for men vs. women.
So, d.f. = (3-1)(2-1) = 2

19

Summary
One-way 2 test for goodness-of-fit
d.f. = k 1 d for k categories (k rows in the one-way table)
Two-way 2 test for whether two variables are independent
(alt: whether two distributions are the same)
d.f. = (m-1)(n-1) d for an m n table
Can test both the upper tail (is the data poorly fit by the
expected frequencies?) and the lower tail (is the data fit too
good to be true?)
2 = sum[(observed expected)2 / expected]

20

Potrebbero piacerti anche