Sei sulla pagina 1di 11

STATISTICS AND DATA ANALYSIS B01.1305.

30
Peter Lakner
Handout 2
1
8 EXAMPLE: An industrial process produces ve-liter cans of paint thinner. The history
of this process indicates a mean ll of 5.02 liters, with a standard deviation of 0.21 liter.
The quality control experts watch this process and select a can for inspection every hour.
This process runs for 12 hours every day. The exact contents of the selected can are
then determined. The process is said to be in control if the volume is within the range
5.02 a(0.21). The question here involves the choice of a.
Suppose that a = 2.0. About how often by chance alone, will a can be declared out of
control?
Repeat this for a = 2.5 and a = 3.0.
NOTE: The technique implied here is generally implemented by plotting the points on a
control chart. The use of a = 3.0 is perhaps the most common choice.
These ideas about control charts should certainly be noted:
Perhaps the most important benet of control charts is that it causes people to watch the
process.
Control charts force people to confront the concept of statistical variability.
The choice 3 assures that very few false alarms will be issued. If the process suddenly shifts
mean by one standard deviation, the shift will be detected immediately with probability
about 2%. (In the paint thinner example, this could be a change from mean 5.02 to 5.02
+ 0.21 = 5.23.) It could take a while to notice this shift. On the other hand, shift of two
standard deviations will be noticed very quickly.
The example used here dealt only with single cans of paint thinner. In other contexts,
one takes small samples and watches the sample standard deviation as well as the sample
mean.
2
4 Condence intervals
200 households are drawn at random from a population, and their incomes are recorded.
We have

X = 29, 000. Suppose that = 8, 000. The Central Limit Theorem states that

X N
_
,
8,000

200
_
which is N (, 566) Hence
P
_
1.96 <

X
566
< 1.96
_
= .95
The .95 gure is accepted as a large probability, but it is selected rather arbitrarily.
Rearranging the inequalities we see that
P
_

X 1.96 566 < <

X + 1.96 566

= .95
In other words, We are 95% condent that the unknown mean is in the interval (

X
1.96 566,

X +1.96 566). In our case we conclude that the population mean falls in the
interval (27, 890, 30, 109) with 95% probability. This interval is called a 95% condence
interval.
The general formula for a 95% condence interval is
_

X 1.96

n
,

X + 1.96

n
_
This is valid if n is large, or the original population is normal. The rule of thumb for large
n is n > 30. Sometimes one wants to change the condence level. For example, one may
want 99% certainty instead of 99%. Here is a general formula do do this. Let z
a
be a
number such that P(Z > z
a
) = a. Suppose that we want a 100 (1 ) level condence
interval (for example, if we want a 95% condence interval then = (1 .95) = .05. Find
z
/2
and use this number instead of 1.96. So the formula for a 100(1) level condence
interval is
_

X z
/2

n
,

X +z
/2

n
_
Here is an example. Suppose that we want a 90% condence interval. Then = .1,
/2 = .05, and z
.05
= 1.64 (this last gre comes by nding .45 among the probabilities in
the normal probability table, and reading the corresponding z value).
However, the problem with this is that usually is unknown. then we replace by s which
is the sample standard deviation. Now we have to use the t-table instead of the normal
probability table to nd the multiplier of the s/

n factor.
3
4
5
The degree of freedom will be n 1. Suppose that the sample size is 20 and we want
a 95% condence interval. Then we have = .025 so use that column (in the table it is
denoted by a instead of ) and the d.f. row 19. This give t
.025
= 2.093. This is a bit
larger than 1.96 which would come from the normal distribution, so it results a bit larger
condence interval. But if n is large (say n > 30) then the t distribution and the normal
distribution almost coincide. In the n > 30 case we can use the line of the t table, which
gives exactly the same numbers as the normal table.
When using the sample standard deviation s instead of then we need to assume that the
original population follows approximately normal distribution.
EXAMPLE (Exercise 7.36 from page 261 in the textbook): A random sample of 20 taste-
testers rate the quality of a proposed new product on a 0-100 scale. The data is available
in the data le EX0736.
Open Minitab
OPEN WORKSHEET
X:/SOR/B011305/HOG/EXERCISE FILES/MINITAB/EX0736
We nd under STATISTICS - BASIC STATISTICS - DISPLAY DESCRIPTIVE STATIS-
TICS that

X = 54.9 and
s

n
= 3.96 (the latter gure under standard error of the mean).
Next we do a normality test.
STAT-BASIC STAT - NORMALITY TEST
We usually accept normality if the p-value > .05. Here it is only .03 so we reject the
assumption that this sample comes from a normal distribution. The interpretation of
the p-value is that assuming a normal population there is a 3% chance of drawing such
sample. We can still calculate a 95% condence interval, but we take it with a grain of
salt since the population is unlikely to be normal. The 95% condence interval will be
(54.9 2.093 3.96) which is (46.6, 63.19). We can get the same results from Minitab by
STAT - BASIC STAT - 1 sample t
EXAMPLE (Exercise 7.43 on page 283 in the textbook). A bottling process lls 16-ounce
bottles. It is important that the average volume placed in the containers is 16 ounces, that
is, overlling or underlling is a problem. The quality control inspector selects 20 bottles
from the lling process and measures the volume of liquid each contains. The data are
available in the le EX0743. Check for normality and calculate a 95% condence interval
6
for the population mean.
Condence interval for a proportion
n = 2, 200 eligible voters have been asked about their voting preferences in an election
with two candidates, A and B. 471 of the 2,200 will vote for A. Let be the proportion of
people in the population will vote for candidate A. This proportion is unknown. However,
we can estimate it by the sample fraction
=
Y
n
=
471
2, 200
= .214
where Y is the number of people in the sample who vote for A. We want to create a
condence interval for . Y follows binomial distribution with parameters n and thus
SD(Y ) =
_
n(1 )
_
2, 200 .214 .786 = 19.24
This is an approximation because we replaced with . Then
SD( ) =
19.24
2, 200
= .0087
Approximately

Y follows normal distribution so with 95% probability
1.96 <

.0087
< 1.96
We can rearrange this and say with 95% condence that | | < 1.96 .0087 = .017
Since = .214 we can convert this to a 95% condence interval to which is .214 .017.
This gives an interval (.197, .231). We are 95% condent that the population fraction of
people voting for candidate A falls in in this interval.
Here is a general formula for a 100(1)% condence interval for the unknown population
mean:
z
/2
_
(1 )
n
We can say with 95% probability that candidate A will lose the election. Of course this
is much more interesting for a more tightly contested election. Suppose that n = 10, 000
and Y = 4, 700. It is not clear now what the outcome of the election will be since 4,700 is
too close to 5,000. Lets compute a 95% condence interval. = .47 and
.47 1.96
_
.47 .53
10, 000
= .47 .01
7
So we are 95% condent that the proportion of votes candidate A will receive in the election
is between 46% and 48%. We can say with 95% condence that A will lose the election.
Sample size for estimating a population mean.
EXAMPLE: Suppose that we would like to estimate the average wage in an industry. How
large sample is needed to be 95% sure that the distance between the sample mean and the
population mean is no more than .5?
Suppose that = $4.00. Then we have to set
z
/2
4

n
= .5
Since = .1 and z
.05
= 1.65, this gives n = 1.65
2
4
2
2
2
= 174.24. We round this up to
175. This way the distance between

X and will be
What happens if we dont know ? Here are two options:
1. Have a pilot study with a small n and use s from this sample to estimate ;
2. If you can guess the maximal and the minimal values in the population then estimate
by (MAX - MIN)/4.
Here is a general formula for calculating n. Suppose that we want to be 100 (1 )%
sure that the distance between the sample mean and the population mean is no more than
E. Then
n =
_
z
/2

E
_
2
In this case we always round n up.
Sample size for estimating a population proportion.
EXAMPLE: We want to organize a poll before an election. We are interested in the
proportion of votes candidate A will receive. Our objective is that the distance between
the sample proportion and the population proportion should be no more than E = .01
with 95% probability. Formally, we want | | .01 with 95% probability.
We want to make
1.96
_
(1 )
n
= .01
8
This gives
n =
1.96
2
(1 )
.01
2
The problem is, of course, that we dont know . We could have a pilot study and estimate
it with . However, instead it is customary to do do the following. Find the maximum
of the quantity (1 ) which represents the worst case scenario. Here is the graph of
(1 ) created with Minitab:
CALC- MAKE PATTERNED DATA - SIMPLE SET OF NUMBERS
STORE: C1
FROM: 0
TO: 1
IN STEPS: .01
CALC - CALCULATOR
STORE: C2
EXPRESSION: C1*(1-C1) GRAPH-SCATTERPLOT
Y:C2 X:C1
It turns out that the maximum of (1 ) is achieved at = .5 and the maximum value
is 1/4. Use this value so
n =
1.96
2
4 (.01)
2
which gives, after rounding up, n = 9604. The general formula for the sample size such
that the distance between the sample mean and the population mean is no more than E
with 100 (1 )% probability is
n =
z
2
/2
4 E
2
With the help of additional information for we can reduce the necessary sample size.
Suppose that we know that < .4. The maximal value of (1 ) on the interval (0, .4)
is achieved at = .4 so this is the value we shall use as the worst case scenario.Now
n =
1.96
2
.4 .6
.01
2
= 9219.84
which gives, after rounding, n = 9, 220. This is a small reduction compared to 9,604.
However, suppose that we know that < .1. The worst case scenario is now = .1 so
n =
1.96
2
.1 .9
.01
2
3, 458
This is a huge reduction compared to 9,604.
9
Hypothesis testing
Example 8.7 on page 324 of the textbook: An airline institutes a snake system
waiting line at its counters to try to reduce the average waiting time. The mean waiting
time under specic conditions with the previous system was 6.1 minutes. A sample of 14
waiting times are taken; the times are measured at widely separated intervals to eliminate
the possibility of dependent observations. The resulting sample mean is

X = 5.043 and
the standard deviation is 2.266.
Denote by the unknown average waiting time under the snake system. We set up to
hypotheses. The null-hypothesis will be
H
0
: = 6.1
whereas the alternative hypothesis will be
H
1
: < 6.1
We would like to know whether the sample provides sucient evidence in order to reject
H
0
. The logic of this decision making is the following. If H
0
was true then
t
stat
=

X 6.1
s

n
would follow t distribution with degrees of freedom 141 = 13. In this case t
stat
= 1.75.
Look up in the t table the number t
.1
such that P(t
stat
> t
.1
) = P(t
stat
< t
.1
) = .1 (still
temporarily assuming = 6.1). Turns out that t
.1
= 1.35. It has only 10% chance that
t
stat
would be below -1.75, if H
0
was true. Since in our case t
stat
< 1.35 we rather reject
H
0
then believe that such unlikely event happened.
The selection of the number .1 was rather arbitrary. It is called the signicance level and
denoted by . In many cases it is selected as .05 or .01.
Exercise: Repeat the above procedure for = .01.
There are two kind of errors we can commit in this procedure. The type I error is that we
reject H
0
when it is true. The type II error is that we accept H
0
when in fact it is false.
The signicance level is the probability of the type I error.
Such tests as this are called one-sided because we only consider one side of 6.1 in the
alternative hypothesis. If it had been = 6.1, we would call it a two-sided test, or two-sided
alternative hypothesis.
10
The basic logic of this is the following. If the outcome of the study under the null hy-
pothesis is less likely than the pre-determined (small) signicance level then we reject
H
0
. Otherwise we can not reject H
0
. We use the terminology can not reject instead
of accept H
0
, because we can not say that there is a low probability of accepting H
0
when it is false. However, if we reject H
0
we feel more secure because the probability of
rejecting H
0
when it is false is small (it is equal to ). Sometimes instead of the words
could not reject H
0
we use the words the study was inconclusive. This reects the
fact that usually when we set up the study we would like to reject H
0
. In this example
we set up the study exactly to prove that the snake system is better. It is interesting to
notice that our ndings are somewhat ambiguous, because we rejected H
0
at = .1 but
could not reject it at = .01. In order to avoid any tampering with the signicance level
to get a desired result, one should always set up before looking at the data.
Here is a general recipe for the above procedure. Let be an unknown population mean,
and
0
a given number (this was the 6.1 gure in the above example). We set up two
hypotheses
H
0
: =
0
H
a
: <
0
select a signicance level , and calculate the t-statisic
t
stat
=

X
0
s

n
Look up t

from the t-table. If t


stat
< t

then reject H
0
. Otherwise we can not reject
H
0
.
This process needs a little modication if the alternative hypothesis is >
0
. In that
case we shall reject H
0
if t
stat
is large. Here are the exact details:
We set up two hypotheses
H
0
: =
0
H

a
: >
0
select a signicance level , and calculate the t-statisic
t
stat
=

X
0
s

n
Look up t

from the t-table. If t


stat
> t

then reject H
0
. Otherwise we can not reject H
0
.
In case we reject H
0
we use the words (in the rst case with H
a
: <
0
) the sample
mean is signicantly below
0
, and in the second case (with H

a
: >
0
) upon rejecting
H
0
we say the sample mean is signicantly above
0
.
11

Potrebbero piacerti anche