Sei sulla pagina 1di 10

SIMPLE RANDOM SAMPLING

INTRODUCTION

1. Objective: Take a sample from the population, measure some characteristic on each of the sampled units, and use this information to estimate (infer) the characteristic in the entire population. 2. Simple random sampling is the most basic sampling procedure to draw the sample. 3. Simple random sampling forms the basis for many of the more complicated sampling procedures. 4. Simple random sampling is easy to describe but is often very difficult to carry out in the field where there is not a complete list of all the members of the population. DEFINITION: A simple random sample is a sample of size n drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected. 5. Note that this definition requires that we know the population size N.

ASSUMPTIONS FOR SIMPLE RANDOM SAMPLING

Simple random sampling is one form of the general set of sampling procedures referred to as probability sampling. Probability sampling procedures must meet 4 criteria (Cochran, 1977:9): 1. We can define the set of distinct samples which the procedure is capable of selecting. 2. Each possible sample has assigned to it a known probability of selection. 3. We select one of the samples by a random process in which each sample receives its appropriate probability of being selected. 4. The method for computing the estimate must lead to a unique estimate for any specific sample.

For any sampling procedure of this type we can calculate the frequency distribution of the estimates that it generates if repeatedly applied to the same population and therefore determine bias and variance of the estimator. In general we do not assume that the underlying population follows a normal distribution, but in order to calculate bounds and confidence intervals from single samples it may be useful to assume that the estimates follow a normal distribution. This assumption will be appropriate for large sample sizes but will be problematic for small sample sizes drawn from highly skewed populations. Cochran (1977:42) suggested the "crude rule" for positively skewed distributions that n should be greater than 25G12 where G1 is Fisher's measure of skewness. Alternately Tchebysheff's theorem states that at least 75% of the observations for any probability distribution will be within 2 standard deviations of their mean (Scheaffer et al., 1986:16). Taking all of this together, we can state the following assumptions for simple random sampling: 1. The sample units included in the sample must be selected in a truly random manner from the population (a random, independent sample). 2. The estimator must follow a normal distribution for the bound and confidence interval to give correct coverage (large sample or normal population). 3. Failing 2 above, the bound calculated as 2 times the standard error of the estimator will include the true value for the population parameter in at least 75% of the cases.

DRAWING A SIMPLE RANDOM SAMPLE

1. Two commonly used "random" sampling procedures do not yield simple random samples: Haphazard sampling: Representative sampling: 2. True random samples virtually always involve the use of random numbers from a random number table or an algorithm on a computer.

3. Drawing a simple random sample is accomplished by making a complete list of all the elements in a population, assigning each a number and then drawing a set of random numbers which identifies n members of the population to be sampled. Note that any random number is rejected which is a repeat of a previously sampled number so that each element of the population is sampled only once. This is termed sampling without replacement. The alternative of sampling with replacement is also possible and it useful in some situations. 4. EXAMPLE: Suppose that we wanted to sample a stream to estimate the mean number of fish per pool. We could travel along the stream from its mouth to its headwaters identifying the pools and assigning each pool a number. Then we could pick n random numbers from a random number table and sample the pools corresponding to those numbers. 5. An alternative way to use random numbers to select samples if you have access to a computer is the following: A. Enter a list of the elements of your population into a spreadsheet or database or statistical data set. B. Assign each element a random number. C. Sort the elements by the random numbers. D. Print out the sorted list. E. The first n elements on your list consist of a random sample of size n from your population. An advantage of this approach is that if you later decide to increase your sample size you can simply add the next couple of observations on your sorted list, or you can decrease your sample size by dropping off the last couple of samples on your list. In reality we have to make these kinds of adjustments to our sampling as we proceed in a typical study.

ESTIMATING THE POPULATION MEAN

1. Estimated Mean:

2. Estimated Variance of Mean:

3. Bound on the error of estimation (B):

Note that the t in the equation for the bound is a Student's t value for n-1 d.f. at the level of significance. In most cases the following approximations are reasonable:

4. Confidence Interval:

5. Sample size:

Note: Use t=2 for 95% Bound etc.

ESTIMATING THE POPULATION TOTAL

1. Estimate of Total:

2. Estimated Variance of Total:

3. Estimated Bound on Total:

4. Sample Size:

Note: Use t=2 for 95% Bound etc.

SAMPLING WITH REPLACEMENT

Consider a population of potato sacks, each of which has either 12, 13, 14, 15, 16, 17, or 18 potatoes, and all the values are equally likely. Suppose that, in this population, there is exactly one sack with each number. So the whole population has seven sacks. If I sample two with replacement, then I first pick one (say 14). I had a 1/7 probability of choosing that one. Then I replace it. Then I pick another. Every one of them still has 1/7 probability of being chosen. And there are exactly 49 different possibilities here (assuming we distinguish between the first and second.) They are: (12,12), (12,13), (12, 14), (12,15), (12,16), (12,17), (12,18), (13,12), (13,13), (13,14), etc.

SAMPLING WITHOUT REPLACEMENT

Consider the same population of potato sacks, each of which has either 12, 13, 14, 15, 16, 17, or 18 potatoes, and all the values are equally likely. Suppose that, in this population, there is exactly one sack with each number. So the whole population has seven sacks. If I sample two without replacement, then I first pick one (say 14). I had a 1/7 probability of choosing that one. Then I pick another. At this point, there are only six possibilities: 12, 13, 15, 16, 17, and 18. So there are only 42 different possibilities here (again assuming that we distinguish between the first and the second.) They are: (12,13), (12,14), (12,15), (12,16), (12,17), (12,18), (13,12), (13,14), (13,15), etc.

WHAT'S THE DIFFERENCE?

When we sample with replacement, the two sample values are independent. Practically, this means that what we get on the first one doesn't affect what we get on the second. Mathematically, this means that the covariance between the two is zero. In sampling without replacement, the two sample values aren't independent. Practically, this means that what we got on the first one affects what we can get for the second one. Mathematically, this means that the covariance between the two isn't zero. That complicates the computations. In particular, if we have a SRS (simple random sample) without replacement, from a population with variance , then the covariance of two of the different sample values is size. , where N is the population

POPULATION SIZE - LEADING TO A DISCUSSION OF "INFINITE" POPULATIONS

When we sample without replacement, and get a non-zero covariance, the covariance depends on the population size. If the population is very large, this covariance is very close to zero. In that case, sampling with replacement isn't much different from sampling without replacement. In some discussions, people describe this difference as sampling from an infinite population (sampling with replacement) versus sampling from a finite population (without replacement).

CONFIDENCE INTERVALS

The idea behind confidence intervals is that it is not enough just using sample mean to estimate the population mean. The sample mean by itself is a single point. This does not give people any idea as to how good your estimation is of the population mean. If we want to assess the accuracy of this estimate we will use confidence intervals which provide us with information as to how good our estimation is. A confidence interval, viewed before the sample is selected, is the interval which has a pre-specified probability of containing the parameter. To obtain this confidence interval you need to know the sampling distribution of the estimate. Once we know the distribution, we can talk about confidence. We want to be able to say something about , or rather because should be close to . So the type of statement that we want to make will look like this: P (|| < d) = 1 Thus, we need to know the distribution of . In certain cases the distribution of can be stated easily. However, there are many different types of distributions. The normal distribution is easy to use as an example because it does not bring with it too much complexity. When we talk about the Central Limit Theorem for the sample mean, what are we talking about? The finite population Central Limit Theorem for the sample mean: What happens when n gets large? y has a population mean and a standard deviation of /n since we do not know so we will use s to estimate . We can thus estimate the standard deviation of y to be: s/n The value n in the denominator helps us because as n is getting larger the standard deviation of y is getting smaller.

The distribution of y is very complicated when the sample size is small. When the sample size is larger there is more regularity and it is easier to see the distribution. This is not the case when the sample size is small. We want to find a confidence interval for . If we go about picking samples we can determine a y and from here we can construct an interval about the mean. However, there is a slight complication that comes out of /n. We have two unknowns, and . We will estimate by s, now y / s/n does not have a normal distribution but a t distribution. Thus, a 100 (1-) % confidence interval for can be derived as follows: y / Var(y) N(0,1) whereas, y / Var(y) tn1 Now, we can compute the confidence interval as: y t/2 Var(y) In addition, we are sampling without replacement here so we need to make a correction at this point and get a new formula for our sampling scheme that is more precise. If we want a 100 (1-) % confidence interval for , this is: y t/2 (Nn / N) (s2 / n) What you now have above is the confidence interval for and then the confidence interval for is given below. A 100 (1-) % confidence interval for is given by: t/2 N (Nn) (s2 / n) Be careful now, when can we use these? In what situation are these confidence intervals applicable? These approximate intervals above are good when n is large (because of the Central Limit Theorem), or when the observations y1, y2, ..., yn are normal.

When sample size is 30 or more, we consider the sample size to be large and by Central Limit Theorem, y will be normal even if the sample does not come from a Normal Distribution. Thus, when sample size is 30 or more, there is no need to check whether the sample comes from a Normal Distribution. We can use the t-interval. When sample size is 8 to 29, we would usually use a normal probability plot to see whether the data come from a normal distribution. If it does not violate the normal assumption then we can go ahead and use the t-interval. However, when sample size is 7 or less, if we use normal probability plot to check for normality, we may fail to reject Normality due to not enough sample size. In the examples here in these lessons and in the textbook we typically use small sample sizes and this might be the wrong image to give you. These small samples have been set for illustration purposes only. When you have a sample size of 5 you really do not have enough power to say the distribution is a normal and we will use nonparametric methods instead of t. For the beetle example in the text, an approximate 95% CI for is: y t/2 (Nn / N) (s2 / n) Note that the t-value for = 0.025 and at n - 1 = 8 - 1 = 7 df can be found by using the t-table to be 2.365

y t/2 (Nn / N) (s2/n) = 222.875 2.365 222.256 = 222.875 2.365 14.908 = 222.875 35.258 And, an approximate 95% CI for is then: = 22287.5 2.365 2222560 = 22287.5 3525.802

Potrebbero piacerti anche