Sei sulla pagina 1di 52

Assumptions for Statistical Inference Floyd Bullard Overview Random samples

Assumptions for Statistical Inference


Floyd Bullard
The NC School of Science & Mathematics

Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

26-27 January 2007

Assumptions in the sciences

Assumptions for Statistical Inference Floyd Bullard Overview

Some assumptions we might make when solving problems in the other sciences: Physics: There is no air resistance. Ecology: Foxes and rabbits are the only animals. Epidemiology: People only die of disease or old age. Oceanography: Seawater has the same composition everywhere. Archaeology: At a given site, older objects are deeper in the ground than younger objects. etc.

Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions in (AP) Statistics

Assumptions for Statistical Inference Floyd Bullard Overview Random samples

In AP Statistics, nearly all assumptions are of three types. The sample is representative of the population.

Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions in (AP) Statistics

Assumptions for Statistical Inference Floyd Bullard Overview Random samples

In AP Statistics, nearly all assumptions are of three types. The sample is representative of the population. The sample is large enough that the distribution of some statistic is approximately equal to its limiting distribution.

Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions in (AP) Statistics

Assumptions for Statistical Inference Floyd Bullard Overview Random samples

In AP Statistics, nearly all assumptions are of three types. The sample is representative of the population. The sample is large enough that the distribution of some statistic is approximately equal to its limiting distribution. Modeling assumptions. (In AP statistics, these arise in the regression context.)

Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard

When we extrapolate information from a sample to a population, we are naturally assuming that the sample is representative of the population in some way. In particular, lets suppose that X is some random variable whose distribution over the population is f (x). We will be observing data whose distribution is not f (x), but rather g (x|S)the conditional distribution of X given membership in the sample.

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard

When we extrapolate information from a sample to a population, we are naturally assuming that the sample is representative of the population in some way. In particular, lets suppose that X is some random variable whose distribution over the population is f (x). We will be observing data whose distribution is not f (x), but rather g (x|S)the conditional distribution of X given membership in the sample. Is it fair to observe g (x|S) and treat it as if it it were f (x)?

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard

When we extrapolate information from a sample to a population, we are naturally assuming that the sample is representative of the population in some way. In particular, lets suppose that X is some random variable whose distribution over the population is f (x). We will be observing data whose distribution is not f (x), but rather g (x|S)the conditional distribution of X given membership in the sample. Is it fair to observe g (x|S) and treat it as if it it were f (x)? Under what conditions are the conditional and unconditional distributions of X the same?

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard

The distributions f (x) and g (x|S) will be the same if and only if X and S are independentthat is, if the value of the random variable and the elements membership in the sample are completely unrelated to one another. Can we guarantee that?

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard

The distributions f (x) and g (x|S) will be the same if and only if X and S are independentthat is, if the value of the random variable and the elements membership in the sample are completely unrelated to one another. Can we guarantee that? Of course we can. If membership in the sample is completely random, then it is independent of anything we can think of. Thats why we like random samples so much. They allow us to treat the X s in our sample as if they had the same distribution as those in the population. Random sampling permits inference.

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

A Problem

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

But theres a problem. Random samples are hard to come by. So we often assume for the sake of inference that our sample is random even though we know for a fact it isnt.

A Problem

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

But theres a problem. Random samples are hard to come by. So we often assume for the sake of inference that our sample is random even though we know for a fact it isnt. Is that okay? What will happen if the assumption is really quite wrong?

Alices project

Assumptions for Statistical Inference Floyd Bullard Overview Random samples

A student named Alice wants to estimate the proportion of students in her school who can name her states two U.S. Senators. She plans to sample 100 students and ask them to name the two senators. Shell use the sample proportion she gets to construct a condence interval estimate of the population proportion.

Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Alices project

Assumptions for Statistical Inference Floyd Bullard Overview Random samples

A student named Alice wants to estimate the proportion of students in her school who can name her states two U.S. Senators. She plans to sample 100 students and ask them to name the two senators. Shell use the sample proportion she gets to construct a condence interval estimate of the population proportion. How should she get her sample?

Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Alices project (continued)

Assumptions for Statistical Inference Floyd Bullard Overview Random samples

Here are some ways she might sample 100 students. Include all the students in her classes until she gets 100. Include her friends and her friends friends. Send out an all-school email and include the rst 100 students who reply. Stand outside the school in the morning and include every fth student until she has 100.

Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Roberts project

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics

Roberts school is considering starting school a half hour later in the morning and ending a half hour later in the afternoon. Robert wants to estimate the proportion of students in the school who would be in favor of this. Would Alices sampling method work for him?

Modeling assumptions: linear regression When assumptions arent met Conclusion

A popular class project

Assumptions for Statistical Inference Floyd Bullard Overview

You plan to guide your students through a class project in which they will estimate the quality of ve brands of paper towels. (The students will determine how to dene quality.) You buy one roll of each of ve brands of paper towels and bring them to class. The students take six towels of each brand and measure each ones quality. Parallel boxplots of the brands quality scores give an idea of which brands are better than others.

Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

A popular class project

Assumptions for Statistical Inference Floyd Bullard Overview

You plan to guide your students through a class project in which they will estimate the quality of ve brands of paper towels. (The students will determine how to dene quality.) You buy one roll of each of ve brands of paper towels and bring them to class. The students take six towels of each brand and measure each ones quality. Parallel boxplots of the brands quality scores give an idea of which brands are better than others. What assumption are you and your students making? Is it justied?

Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Capture/recapture

Assumptions for Statistical Inference Floyd Bullard Overview Random samples

Forty squirrels are captured in a park and tagged. A month later, fty squirrels in the park are captured, and ten are found to be tagged. Thats 20% of the second sample, so we might assume that N = 5 40 = 200 is a good estimate of the number of squirrels in the park. What assumptions are being made here? Are they reasonable? What will the eect be on the population size estimator N if they are not reasonable?

Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard

The upshot: In practice, we often do not have the luxury of true random samples. We may make the assumption that a sample is a simple random sample (SRS) so that we may extrapolate its properties to the population. Whether this is reasonable or not depends on whether we believe that sample membership and the properties of interest are more or less independent of one another.

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard

The upshot: In practice, we often do not have the luxury of true random samples. We may make the assumption that a sample is a simple random sample (SRS) so that we may extrapolate its properties to the population. Whether this is reasonable or not depends on whether we believe that sample membership and the properties of interest are more or less independent of one another. Reasonable people may disagree about whether the assumption is justied.

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard

What do all of the following statements have in common? pN p1 p 2 N N X


2 2 s1 /n1 +s2 /n2 (Oi Ei )2

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

X t(n1) s/ n X 1 X 2

t(n ) 2 ) (df

Ei

Assumptions for Statistical Inference Floyd Bullard

What do all of the following statements have in common? pN p1 p 2 N N X


2 2 s1 /n1 +s2 /n2 (Oi Ei )2

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

X t(n1) s/ n X 1 X 2

t(n ) 2 ) (df

Ei

Theyre all limiting distributions.

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

We rely on sample sizes being large enough to justify using a limiting distribution. How do we know whats large enough?

Proportions

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics

For proportions, we often require that np and n(1 p) both be at least 10 (or sometimes 5). At least one text requires the single condition that np(1 p) > 5. Where did these come from?

Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard

Lets require that the mean of p (which is p) be at least three standard deviations (one standard deviation is p(1 p)/n) above 0. p > 3 p(1 p)/n

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard

Lets require that the mean of p (which is p) be at least three standard deviations (one standard deviation is p(1 p)/n) above 0. p > 3 p
2

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

> 9p(1 p)/n

p(1 p)/n

Assumptions for Statistical Inference Floyd Bullard

Lets require that the mean of p (which is p) be at least three standard deviations (one standard deviation is p(1 p)/n) above 0. p > 3 p np
2 2

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

> 9p(1 p)/n > 9p(1 p)

p(1 p)/n

Assumptions for Statistical Inference Floyd Bullard

Lets require that the mean of p (which is p) be at least three standard deviations (one standard deviation is p(1 p)/n) above 0. p > 3 p np
2 2

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

> 9p(1 p)/n > 9p(1 p)

p(1 p)/n

np > 9(1 p)

Note that this is guaranteed by np > 10. (Do you see why?) And np > 5 would guarantee that p > 2 p(1 p)/n.

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

The requirement n(1 p) > 10 will similarly insure that p is at least three standard deviations below 1.

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

The requirement n(1 p) > 10 will similarly insure that p is at least three standard deviations below 1. If we are comparing two proportions, then both must obey this rule-of-thumb.

means

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics

X will have an approximately normal distribution (and hence X will have an approximately t (n1) distribution) if the s/ n sample size n is large enough.

Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference

Heres a common rule-of-thumb. If n 10 and the data display no obvious outliers or skew, then continue with inference using the t distribution; but the inference still relies on the assumption that the population is approximately normal. If 10 < n 40 and the data display at most only one or two outliers and no severe skew, then continue with with inference using the t distribution; the population need not be approximately normal. If n > 40, then except for extraordinarly severe skewwhich would be indicated by numerous outliersinference using the t distribution is okay.

Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference

Heres a common rule-of-thumb. If n 10 and the data display no obvious outliers or skew, then continue with inference using the t distribution; but the inference still relies on the assumption that the population is approximately normal. If 10 < n 40 and the data display at most only one or two outliers and no severe skew, then continue with with inference using the t distribution; the population need not be approximately normal. If n > 40, then except for extraordinarly severe skewwhich would be indicated by numerous outliersinference using the t distribution is okay. (But you might question whether inference on such a populations mean is what you really want to be doing.)

Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

linear regression
Our third type of assumption is the modeling assumption. We choose a mathematical model that we think will describe the underlying phenomenon that generated our data. If the model is very poor, then our inference will be meaningless.

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

linear regression
Our third type of assumption is the modeling assumption. We choose a mathematical model that we think will describe the underlying phenomenon that generated our data. If the model is very poor, then our inference will be meaningless. The only example of this students see in AP statistics is the linear regression model, which is: yi = 0 + 1 xi + e i , where ei N(0, ). In this model there are three parameters to be estimated: 0 , 1 , and .
iid

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard Overview

Another way of stating the model is: yi N(0 + 1 xi , ) In other words, the means of the y s have a linear relationship with the xs, but there is variability in the actual y data about those meansnormally distributed errors with constant variability across all values of x.

Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard Overview

To check whether the model is reasonable, we: Look at the residuals from the linear regression to see whether there is a pattern. Verify that the residuals are of roughly constant magnitude for all xs. Check to see whether the residuals appear to be approximately normally distributed.

Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Sample is not random

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics

If a sample is assumed to be random when in fact there is an association between sample membership and a measured variable of interest, then the sampling procedure is biased. Conclusions will tend to systematically overestimate or underestimate the parameters of interest.

Modeling assumptions: linear regression When assumptions arent met Conclusion

Proportions: small samples


At each lattice point in this graph, 10,000 random samples were simulated and a 95% condence interval estimate of p constructed under the assumption that p has a normal distribution. Coverage rates are shown.
Confidence level accuracy when assuming normality 0.9 0.9 n(1p)=5 n(1p)=10 np(1p)>5 0.7 0.7 true population proportion 0.6 0.8

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

0.8

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2 np=10 0.1 np=5 10 20 30 40 50 60 sample size 70 80 90 100 0

0.1

Means: small samples


2.5 2 1.5 f(x) f(x) 0.4 1 0.5 0 0.2 0.8

Assumptions for Statistical Inference Floyd Bullard Overview


0.6

Random samples Limiting distributions of statistics


0 1 2 x 3 4 5

2 x

Modeling assumptions: linear regression When assumptions arent met Conclusion

1 0.8 0.6 f(x) 0.4 0.2 0 f(x)

1.5

0.5

2 x

2 x

Figure: These four distributions were used to simulate random samples of dierent sizes.

Means: small samples


10,000 random samples were simulated for each sample size from 2 to 50. Condence intervals were computed assuming X N. Coverage rates are shown.
estimated coverage probability estimated coverage probability 1 1

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met
0 10 20 30 sample size 40 50

0.95

0.95

0.9

0.9

0.85

10

20 30 sample size

40

50

0.85

Conclusion

estimated coverage probability

0.95

estimated coverage probability 0 10 20 30 sample size 40 50

0.95

0.9

0.9

0.85

0.85

10

20 30 sample size

40

50

Chi-square goodness-of t test: small samples.


Here we tested the claimed distribution: (0.45, 0.25, 0.20, 0.05, 0.05). Rejection rates from 10,000 samples are shown.
Chi square rejection rates when null is true 0.11

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

0.1

0.09 rejection rate when =0.05

0.08

0.07

0.06

0.05

0.04

0.03

20

40

60 sample size

80

100

120

Chi-square goodness-of t test: small samples.


Here we tested the claimed distribution: (0.45, 0.25, 0.20, 0.09, 0.01). Rejection rates from 10,000 samples are shown.
Chi square rejection rates when null is true 0.11

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

0.1

0.09 rejection rate when =0.05

0.08

0.07

0.06

0.05

0.04

0.03

20

40

60 sample size

80

100

120

Modeling assumptions: linear model is a bad model.


If data do not come from a linear model, but you impose one anyway, then estimates of 0 , 1 , and wont mean much.
The orbiting planet of 51 Pegasi 80 60 40 20 0 20 40 60 80

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

radial velocity

4 time

Figure: 0 = 23.8, 1 = 33.4, s = 37.7

Modeling assumptions: linear model is a bad model.


But note that nothing about your software will prevent you from estimating them anyway! Without checking the modeling assumptions, you wont know whether the model is any good or not!
The orbiting planet of 51 Pegasi 80 60 40 20 0 20 40 60 80

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

radial velocity

4 time

A sample problem

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Suppose the following is an AP exam question. 8 apples are randomly sampled from an orchard, and their weights in pounds are measured to be 0.44, 0.43, 0.33, 0.56, 0.50, 0.50, 0.45, 0.38. Estimate the mean weight of apples in the orchard, including a reasonable margin of error.

A sample problem (continued)

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

The rubric requires students to check inferential assumptions. Suppose one student writes the following: np > 10 and n(1 p) > 10 n < 40, assume normality.

A sample problem (continued)

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

The rubric requires students to check inferential assumptions. Suppose one student writes the following: np > 10 and n(1 p) > 10 n < 40, assume normality. How do you think the rubric will score this response for the check of assumptions?

A sample problem (continued)


The following would make an AP reader happy. Random sampleyes, given in problem. Small data set (n = 8) requires population to be approximately normal. Reasonable?
0.65 0.6

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

0.55

0.5

0.45

0.4

0.35

1.5

0.5

0.5

1.5

Normal probability plot looks roughly linear, so assumption is reasonable. We continue with the construction of a 95% t-Condence Interval...

Assumptions for Statistical Inference Floyd Bullard

Three types of assumptions in AP statistics: Sample is random. (Reasonableness cannot be checked with the data.) Sample is large enough to assume a limiting distribution for a statistic. Linear model is appropriate for bivariate data. Checking assumptions is a big part of all statistics. This should be a part of every inference problem students do all year, not just something they study in an idolated unit.

Overview Random samples Limiting distributions of statistics Modeling assumptions: linear regression When assumptions arent met Conclusion

Assumptions for Statistical Inference Floyd Bullard Overview Random samples Limiting distributions of statistics

Thank you for coming! bullard@ncssm.edu

Modeling assumptions: linear regression When assumptions arent met Conclusion

Potrebbero piacerti anche