Sei sulla pagina 1di 35

1

Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Central Limit Theorem
and
Confidence Intervals
2
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Objectives
Understand the Central Limit Theorem
and why it is important to other
statistical tools.
Understand and calculate confidence
intervals.
3
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Central Limit Th. Exercise
Divide into four teams.
Each team will be given a process that
delivers an output.
Run the process and measure the output to
get 30 data points.
Calculate the statistics of the process output:
Mean
Standard Deviation
P-value for Anderson-Darling Normality Test
Show histogram
4
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Central Limit Th. Exercise
Sample
Std. Dev.
4
3
2
1
P-value Sample
Mean
Sample
Size
Team
5
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Central Limit Theorem
Given a population with a mean of and a variance of

2
, if we sample that population repeatedly using a
sample size of n, and further plot a distribution of the
means of those samples, then the following will be
true:
The mean of the sampling distribution will be .
The variance of the sampling distribution will be

2
/ n.
The shape of the sampling distribution will
approach a normal distribution as n gets larger
regardless of the shape of the original population.
6
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Central Limit Theorem
Reference: The Black Belt Memory Jogger, p. 139ff.
Dice animation: http//www.stat.sc.edu/~west/javahtml/CLT.html
Simulation with various distributions: http//www.statisticalengineering.com/central_limit_theorem.htm
n = 6
n = sample size used to calculate the x-bars that are plotted in the histograms.
Population
Distribution
x
Population
Distribution
x
n = 2 n = 25 n = 6
n = 25
x
n = 2
Distributions of x
Distributions of x
x x
x x x
7
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Central Limit Theorem
How can we use the Theorem to our advantage?
A sample average is an estimate of the population mean.
If we wanted to estimate the population mean of a set of
dice that were numbered something different from 1-6 (but
numbered consecutively), and we could only take one
sample, what sample size would you choose: 1, 2, 5, or 10?
Why?
Could we estimate the maximum expected difference
between the the sample average and the population mean?
To the 95% confidence level, how far off might the true
mean be from the sample mean based on your dice
exercise?
8
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Confidence Interval
sample 2 sample 3 sample 1
population
(the truth)
true mean
sample 4
9
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Confidence Intervals
10
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Statistics such as the sample mean and standard
deviation are only estimates of the population Mu ()
and Sigma () and are based upon a limited amount
of data.
Because there is variability in these estimates from
sample to sample, we can quantify our uncertainty
using statistically grounded confidence intervals
based on the Central Limit Theorem. Confidence
intervals provide a range of plausible values for the
population parameters ( and ).
Confidence Intervals
11
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Most of the time, we calculate 95% confidence intervals (CIs)
for a parameter (occasionally 90% or 99%).
The CI is interpreted as:
We are 95% certain that our calculated interval surrounds
the true population parameter (e.g., or ).
In technical terms it would be more correct to say, the
method we use for calculating the interval will yield correct
results 95% of the time.
Confidence Intervals
12
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
What is a Confidence Interval?
Estimate margin of error
Sample statistic [ ___ X ___ ]
Confidence
factor
Measure of
variability
e.g.,
x, s, ...
Usually, confidence intervals have a uncertainty:
In some cases the uncertainty is not symmetrical and
the + term is different from the - term; e.g., for .
13
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
CI Exercise 1
To see the variability in sample estimates, lets define a
process that has a normal distribution with:
known (true) mean value = 70
known (true) standard deviation = 5
Each member in the class will generate 20 observations
from this process, with mean = 70 and standard deviation =
5 (In Minitab, use Calc>Random Data>Normal).
Use the graphical descriptive statistics procedure in
Minitab to calculate the 95% confidence interval for the
mean and sigma based on your sample of 20 data points.
Do they cover the true mean 70 and the true sigma 5?
Based on a class size of 30, we would expect 1 or 2 CIs
NOT to contain 70 for the mean, and also 1 or 2 that do
NOT contain 5 for sigma.
14
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Luck of the Draw
We want to be certain that our confidence interval contains the
population parameters. But certainty only comes with
measuring the entire population. Therefore, for the vast majority
of cases, we have to live with being 95% certain that our sample
has captured the population parameters inside the confidence
intervals. We say that we are 95% confident.
In reality we will never know
whether our sample was one
of the lucky 95% that actually
contained the true
parameter, or one of the
unlucky 5% that did not.
,
X, s
15
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
CIs from Minitab
When Minitab has raw data, it is simple to
calculate the confidence intervals.
Mean, Standard Deviation
Stat>Basic Statistics>Display Descriptive
Statistics (with Graphical Summary)
Proportion
Stat>Basic Statistics>1 Proportion
16
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
CI Exercise 2
1. Find CI for Mean and Standard Deviation for
data in: CI_Exercise_2.MTW
Note also that Minitab calculates the CI
for the Median.
Repeat at the 90% confidence level.
2. Calculate the 95% CI for the proportion of
defects from a process based on a sample
of 431 items where 48 were found to be
defective.
17
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
CIs without Raw Data
Without raw data, Minitab cannot calculate CIs.
For continuous data given the mean, standard
deviation, sample size and confidence level, the
formulas are pretty straight forward.
A formula also exists for CIs for proportions
using the summarized count data for defects and
sample size.
Reference:
Quality Engineering Statistics by Robert A.
Dovich.
The Black Belt Memory Jogger, p. 143ff.
18
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
CI for Mean of Continuous Data
( known or >50 Samples)

=
n
Z X CI
2 /

size sample n
deviation standard population
data of mean X
=
=
=

level confidence
given a for value on distributi normal Z
2 /
=

WARNING!!!!
This formula only applies
when is known, which is
rare. If the sample size is
large (exceeds 50), it is a
good approximation.
19
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Example
Mean = 2.15
Standard Deviation = 0.8
Sample Size = 55
= 0.05
What is the confidence interval of the mean
for this situation?
20
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
= 2.15 = 0.05
= 0.8 n = 55
X
55
8 . 0
Z 15 . 2 CI
2 05 . 0
= 0.211 2.15
55
8 . 0
96 . 1 15 . 2 = =
Mean of Continuous Data
( known or >50 Samples)
Answer to Example:
21
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
n
s
t X CI
1 n , 2
=

This formula has no
restrictions on sample size.
It is based on the Students t
Distribution.
= mean
s =standard deviation
n =sample size
=degrees of freedom, used in
some tables and calculated as n-1
for this test.
t
/2,n-1
=value from t distribution
X
CI for Mean of Continuous Data
This t value comes from the t
table using the column for the
alpha risk divided by 2 (risk
divided between each tail) and
the row for the degrees of
freedom, n-1.
22
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
What is this t-distribution?
The t-distributions comprise a family of distributions with one extra
parameter (degrees of freedom where df = sample size -1).
They are similar in shape to the normal distribution (symmetric and
bell-shaped), although wider, and flatter in the tails.
Used for estimating population parameters when the sample size is
small (<50).
The smaller the sample size, the flatter the distribution tails.
3 2 1 0 -1 -2 -3
0.4
0.3
0.2
0.1
0.0
t
f
r
e
q
u
e
n
c
y
2.78
0.025
Area =
t-distribution
with 4 d.f.
(n=5)
For smaller sample sizes, the
uncertainty (as a multiple of s), is
larger because:
(a) the 1/sqrt(n) factor is larger
(b) the t critical point is larger
23
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Selected t-values
Here are values from the t-distribution for various
sample sizes (for 95% confidence intervals):
Sample Size t-value (/2 =.025)
2 12.71
3 4.30
5 2.78
10 2.26
20 2.09
30 2.05
100 1.98
1000 1.96


As the sample size
increases, what happens
to the t-value?
Why is a sample size of 1
not in the table?
24
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Mean of Continuous Data
= 2.15 = 0.05
s = 0.8 n = 25
X
25
8 . 0
t 15 . 2 CI
24 , 2 05 . 0
=
0.330 2.15 =
25
8 . 0
064 . 2 15 . 2
=
Same example as before, but what if only 25
samples instead of 55:
25
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Now You Try!
Mean = 15.82
Standard Deviation = 6.54
Sample Size = 30
= 0.01
What is the confidence interval of the mean for this
situation?
26
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Variance of Continuous Data
The variance CI is based
on the
2
Distribution.
Since it is based on this
distribution, the CI will
not be symmetrical!
( )
( ) ( )
( )
( ) ( )
2
1 , 2 / 1
2
2
2
1 , 2 /
2
1 1

n n
s n s n

Where
n = sample size
s
2
= variance
= risk
( ) ( )
2
1 , 2 / n

=
2
lookup value
( ) ( )
2
1 , 2 / 1 n

=
2
lookup value
Similar to the t table, the alpha term indicates which column of
the table to use and the degrees of freedom term which row.
27
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Variance of Continuous Data
( )
( ) ( )
( )
( ) ( )
2
1 25 , 2 / 10 . 0 1
2
2
2
1 25 , 2 / 10 . 0
2
673 . 0 1 25 673 . 0 1 25

n = 25, s = 0.673, = 0.10


78 . 0 30 . 0
2

What does
this mean?
Example:
28
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Proportions
The proportion CI is
based on the fact that the
Z Distribution is a fairly
good approximation of
the binomial distribution
at reasonable sample
sizes. This formula only
applies to sample sizes of
30 or more. Minitab does
an exact calculation and
is a better tool.
( )
n
p p
Z p

1
2
Where
p = average proportion
seen in the sample
n = sample size
= risk
29
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Proportions Example
n = 700, #defectives = 16, = 0.10
What does
this mean?
( )
700
023 . 0 1 023 . 0
023 . 0
2 10 . 0

Z
( )
700
023 . 0 1 023 . 0
645 . 1 023 . 0

0094 . 0 023 . 0
30
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Formula Reference
Confidence Intervals for:
Mean ( known or at
least 50 samples)
Mean ( unknown)
Standard Deviation
Proportions
n

Z x
/2

n
s
t x
1 n /2,

2
1 n /2, 1
2
1 n /2,
1 n
s
1 n
s


( )
n
p 1 p
Z p
2


31
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Summary of Confidence Intervals
Confidence Intervals provide realistic bounds for
parameter estimates, i.e., an interval of plausible values.
If we have raw data, Minitab will calculate CI for us.
If we dont have the raw data, using the mean, standard
deviation, sample size, and confidence level we can still
calculate confidence intervals for parameters such as the
mean and standard deviation of the population. We can
also use the sample size and number defective to calculate
the CI for proportions.
The factors that affect the width of a CI are:
Variation
Sample Size
Risk
32
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
CI Exercise 3
In the top 25 markets, the Government reports the $
value of mortgages funded each month. This value is
fairly inaccurate the first time it is reported, but the
numbers are revised monthly as the figures firm up.
We need to calculate our market share in these
markets, but since the Governments numbers keep
changing, we are not sure when it is safe to do it.
The numbers keep changing, so who really knows the
truth? We have made a decision that surely by five
months after the fact, the numbers are good. But do
we have to wait a full five months to calculate our
market share?
33
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
CI Exercise 3
Using confidence intervals, determine in which month
we can first have 95% confidence that on average 90%
of the final mortgage values have been reported.
The data we are using are for the month of February
2003 and are found in: CI_Exercise_3.MTW.
All of the data are for the month of Feb-03. The
columns represent the Feb-03 $s as updated and
reported in the month indicated by the column label.
34
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
CI Exercise 3
Since we have decided that the fifth month (Jul-03) is
as accurate as Feb-03 will ever get, we have divided
the reported amount for each month by the July
amount to get the proportion that was reported in that
month. Therefore July is 1.000.
As an example, in Atlanta only 27.6% of the final value
was reported in March. 81.8% of the final value was
reported in April, 92.8% in May, and 99.6% in June.
Using the data and confidence intervals, determine
how many months we would have to wait to be 95%
confident that in worst case a minimum of 90% of the
final value (based on July) has been reported.
35
Thesematerialscontaininformationthat isproprietary andconfidential to Bank of America. Thesematerialsshall not beduplicated. 2005Bank of America. All rightsreserved.
J anuary 3 2005ver.4.4- ActionLegal Copy Service.
Black Belt Key Learnings
Does this tool have an application to my current project?
__________________________________________________________________________
__________________________________________________________________________
This tool can help me answer the following questions:
__________________________________________________________________________
__________________________________________________________________________
What are the key learnings about this tool and/or subject?
__________________________________________________________________________
__________________________________________________________________________
How comfortable will I be in training my team on this tool?
__________________________________________________________________________
__________________________________________________________________________

Potrebbero piacerti anche