Sei sulla pagina 1di 60

PROBABILITY & STATISTICAL

INFERENCE LECTURE 4
MSc in Computing (Data Analytics)

Lecture Outline
Recap to Statistical Inference
Central Limit Theorem
Confidence Intervals
Section Takeaways
Statistical Analysis Process
Population
Representative
Sample
Sample
Statistic
Describe
Make
Inference
Populations vs. Samples
How do Irish voters intend voting in the next election?
The voting population of Ireland:
2,680,000
1



A sample of 1,008 adults
was taken and surveyed
for their voting intention
in the next election
2

1. Source - http://www.nationmaster.com/graph/dem_pre_ele_vot_age_pop-presidential-elections-voting-age-
population
2. http://redcresearch.ie/wp-content/uploads/2012/01/Report.pdf
Populations vs. Samples
How do Irish voters intend voting in the next election?
1,008 voters were
asked how they
intended to vote in
the next election
Fine Gael: 30%
Labour: 14%
Fianna Fail: 18%
Sinn Fein: 17%
Other: 21%
Populations vs. Samples
The term population is used in statistics to represent
all possible measurements or outcomes that are of
interest to us in a particular study or piece of
analysis
In the example the population of interest was the voting
intentions of all voters in Ireland
The term sample refers to a subset of the
population that is selected for analysis
In the example the polling company selected a sample
of 1,008 voters
Sampling
In choosing a sample it is important that it is
representative of the population
No bias should exist in the sample
There are a number of sampling methods available
to ensure that your data is representative
A simple random sample is the most straight
forward of these methods
Statistical Inference
The statistical methods used to draw conclusions
about populations based on the statistics describing
a sample is known as statistical inference
We want to make decisions based on evidence from
a sample i.e. extrapolate from sample evidence to
a general population
To make such decisions we need to be able to
quantify our (un)certainty about how good or bad
our sample information is
Statistical Inference
Statistical Inference is divided into two major areas:
Parameter Estimation: This is where sample statistics
are used to estimate population parameters
Hypothesis Testing: A statistical hypothesis is a
statement about the parameters of one or more
populations. Hypothesis testing tests whether a
hypothesis is supported by data collected
The population mean is denoted by (mu)
In general, given a sufficiently large sample,
we use the sample mean as a point
estimate of

The population variance is denoted by
2
(sigma-
squared)
In general, given a sufficiently large sample,
we use the sample variance s
2
as a point
estimate of
2
Population Statistics Point Estimation
Population Statistics Point Estimation
An estimate of proportion, p, of items in a population
that belong to a class of interest is calculated as:


where c is the number of items in a random sample of size
n that belong to the class of interest
This is known as the sample proportion

p =
c
n
Central Limit Theorem
Demonstration

Central Limit Theorem Explained by
Example
The distribution shown
is a poission
distribution with =3
This could represent
the distribution of the
number of clicks on a
particular link in one
second
Create 200 sample distributions each with a large sample size
Calculate the mean of each distribution
Central Limit Theorem
Explain what has happened?

As the sample sizes increased the shape of
the histogram of means tended towards a
normal distribution

As the sample sizes increased the spread
(standard deviation) between the sample
means decreased


Central Limit Theorem
These histograms are pictures of The Sampling
Distribution of the Mean

This phenomenon will happen in ALL cases

The proof of this is called the Central Limit
Theorem (CLT) and involves some fairly non-
trivial mathematics

Definition: Central Limit Theorem
continued
The sampling distribution of the mean has a average value =
(the population mean).

The sampling distribution of the mean has a standard deviation



Where is the population standard deviation, and n is the
sample size taken.

This value is called the standard error of the mean.

The Sampling Distribution of the Mean will be a Normal
distribution if the sample size is large.

n
o
Central Limit Theorem - Definition
If a random sample is taken from a population,
where:
Each member of the sample can be considered to be
independent of each other
The are all members of the same population
That population has a mean value and a standard
deviation
Then.......
Central Limit Theorem - Definition
.........







This is a non-mathematical definition of the Central
Limit Theorem (CLT)
The central limit theorem states that
given a distribution with a mean and
variance , the sampling distribution
of the mean approaches a normal
distribution with a mean and a
variance /n as n, the sample size,
increases
The Distribution of the Sample Means
n
s
x
o

~
~
Confidence Intervals
How can we use the CLT
The Central Limit Theorem avoids the necessity of
specifying a complete statistical model for all the
sampled data.
All we have to do is specify a probability model for
the sample mean.
For any sample mean, calculated from a large
independent random sample taken from any
population with a mean and standard deviation ,
we know from the CLT, that this sample mean is a
random variable from a Normal distribution with a
mean = and a standard deviation =

n
o
Practical use for the CLT continued
Take a single sample and calculate

This is an estimate of the true (but unknown)
population mean.

But, how good is this estimate?

We assume that is not exactly , but is
somewhere near - but how near is it likely to be?

___
X
___
X
Confidence Intervals Introduction
We would like to make probability statements as to
how close is likely to be to .

If sample size is sufficiently large then the estimate
can be considered as:
a random variable from a Normal distribution,
so probability statements are possible.

This is how we use the CLT in practical data analysis.

___
X
___
X
Confidence Intervals Introduction
For a Normal distribution, we know that 95% of
values will be within 1.96 Standard deviations of
So, given one estimate we can say that this estimate
is within 1.96 standard errors of the actual
population mean , with 95% confidence

95% in
shaded
area
We can turn this knowledge on its
head: given we can be 95%
confident that the true mean is
within 1.96 standard errors of it.
Confidence Interval
From this we can specify a range of values within which we
are 95% confident that the population mean () lies
This is called a confidence interval
95% Confidence Interval for a population mean
(from large enough sample):




Remarkably, this result holds for samples of size 30 or
more. So, a large sample in this context, is a sample of 30
or more.

n
o


96 . 1 x
error standard 96 . 1 x
__
__
Interpretation: we would say that the average lifetime of all
components () is between 4,456 and 7,290 hours with 95%
confidence
Example
One sample of size 30 from the electronic components
yields a sample mean = 5,873 hours .We know o =
3,959 so a 95% confidence interval would be:

7290 to 4456 1417 5873
30
3959
96 . 1 5873 96 . 1 x
error standard 96 . 1 x
__
__



n
o
Confidence Intervals
Why is this any good?
Before: one estimate, = 5,873 but no idea of how
good or bad it was, i.e. how close to is was likely to
be.
Now: 95% confident that is between 4,456 and
7,290 hours.
So, using CLT leads to Confidence Intervals that
enables us to estimate a statistic with certain level of
confidence.
In other word it gives us an objective measure of the
actual amount of information contained in our sample
about the likely location of .

Problem with
All of the above assumes that the population
standard deviation (i.e. o) is known.

In practice this is not known (just like ).
So, we need to estimate o as well as
we get this estimate from the standard deviation of the
sample, given that the sample is large enough.
Sample Standard Deviation is called s


Estimate o by s

( )
2
1

=
n
x x
s
General Confidence Interval for
(Large Samples)
The general formula is:




Where:
o is between a value between 0-1,
(1-o)100% is the confidence level you want
Z
1-o/2
is a value from the Normal distribution table.
Example: for a 95% CI, o = 0.05
(1-o)100% = 95%
Z
1-o/2
= 1.96

n
s
z x =
2 / 1
__
- 1
CI
o o
Confidence Level /2 Z
1-
o/2
90% 0.05 (5%) 1.6449
95% 0.025 (2.5%) 1.96
99% 0.005 (0.5%) 2.5758
99.9% 0.0005 (0.05%) 4.4172
Z-Values
The value of Z
1-o/2
for other % confidence intervals
are given in standard tables.

Confidence
Level
Z
1-o/2
CI
90% 1.6449 4681 to 7065
95% 1.96 4456 to 7290
99% 2.5758 4011 to 7735
99.9% 4.4172 2679 to 9067
Example
Using these we get the following results for the electronic
component example:







Note as o gets smaller the CI gets wider
Also, at the same time as n gets bigger the CI narrows So big
samples leads to more precise estimates (i.e. narrower
confidence intervals)

What CIs and sample sizes should I
use?
You cant control s it is inherent in the data
(population).
You cant control x-bar either.
You can control Z
1-o/2
but in practice scientific
convention sets this to reflect 90%, 95% or 99%
confidence, with 95% being the accepted default.
You can choose n but resources may limit you.
There is a whole topic called sample size
determination which you may want to review before
collecting data or starting research

Confidence Interval Assumptions
Sample size 40 or greater

Experimental units are independent or each other

Experimental units were randomly sampled

The independence assumption requires that value of the
variable for one experimental unit should not tell us
anything about the value of another.

Randomness is required to avoid systematic bias in
selection.

Exercise
Complete Exercise 1 & 2
Calculation of CIs for small samples
What about small samples?

In the case of CIs about a mean we can use the
Student-t distribution.

The process turns of to be very similar but the CLT
no longer works

History of the Student t test
William Gosset used the publishing pseudonym Student.
He derived the correct sampling distribution for the mean
of samples < 40 and called it the t distribution.

In his honour, it is often called the Student t distribution.

Gosset was a chief brewer for Guinness.

The mathematical details are complicated, but, it turns out
that we perform exactly the same calculations as before,
with the one change that the t distribution instead of the
normal distribution is used.

Assumptions
Student ts result only referred to a mean where the
distribution of the population was normally
distributed with some mean and finite standard
deviation .

This is in contrast to the CLT for large samples that
required no such assumption about normality.

The t-test also requires the assumption regarding
independence in the sample.

Statistical Model for mean from small
samples
The experimental units are independently sampled from
a population with mean= and standard deviation =

The population is normally distributed (we dont need
this with large samples)

So, to use the t-test for a small sample, you need to
establish that data is sampled from a population that is
normally distributed you could look at the histogram
of the sample and see if it is symmetric and bell shaped
or use other methods.
If Assumptions met:

The statistic:





Can be shown to be distributed according to a
(student) t-distribution.

The t-distribution has one parameter, called degrees
of freedom (df).
The t - Statistic
n s
t

=
___
X
The t-Distribution
The t-distribution itself is bell shaped and symmetric
just like the normal distribution but is flatter.

There are many t distributions one for each sample
size.

The rule used is: for a sample of size n use the t
distribution with degrees of freedom = n1
Example: if the sample size is 15, then use a t
distribution with degrees of freedom 15 1=14.

Note the degrees of freedom often abbreviated to df.

-4 -2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
Normal(0,1)
t(df=4
t(df=1)
The t probability density function with k degrees of freedom:
( ) | |
( )
( ) | |
2 / ) 1 (
2
1 /
1
2
2 / 1
) (
+
+
-
I
+ I
=
k
k x
k k
k
x f
t
The t-Distribution
General Confidence Interval for
(small Samples)
The general formula is:




Where (1-o) 100% is the confidence level you want and
t(n-1, o/2) is a value from the t distribution with df=n-1, and with a
specified o level.

What is t(n1, 1o/2)?

A value from the t distribution with n1 df such that
100(1 o)% of values lie within that range around the mean.


n
s
t x
n
=
) 1 , 2 / 1 (
__
- 1
CI
o o
How do you find t(n1, 1o/2)?
from a table specifically designed to give it to you or
use a computer







Note: as o gets smaller then CI gets wider
as df gets smaller then CI gets wider

Confidence
Level
o /2 t(df=1) t(df=10) t(df=30)
90% 0.05 (5%) 6.314 1.812 1.697
95% 0.025 (2.5%) 12.71 2.228 2.042
99% 0.005 (0.5%) 63.66 3.169 2.750
99.9% 0.0005 (0.05%) 636.6 4.587 3.646
Example
Internal temperature of autoclaved aerated
concrete used in building. An engineer recorded the
following data:
23.01, 22.22, 22.04, 22.62, 22.59
95% CI for the population mean?
) 97 . 22 , 03 . 22 ( 4696 . 0 5 . 22
5
3793 . 0
776 . 2 5 . 22
CI
) 1 , 2 / (
__
- 1
= =
=
=

n
s
t x
n o o
Exercise
Answer Questions 3-6
Confidence Intervals for Proportions
(Large Samples)
Proportions (including %) are often a statistic of interest

Think of the proportion of defective items on a
production line, the proportion of people who respond
favourably to a survey question, to proportion of
success versus failures in some experiment

Proportions are also covered by the CLT - remember
that a proportion is a different kind of average

Confidence Intervals for Proportions
(Large Samples)
Take a sample of size n of electronic components coming
off a production line, a test each one for defects. The
statistic of interest is the proportion of defectives
produced by the production process.
The estimated proportion from the sample is,




where (p-hat) is the symbol used for the estimated
proportion from the sample



size) sample total n(the
Sample the in s Defective of No

= p
Confidence Intervals for Proportions
(Large Samples)
If the sample size is sufficiently large and we repeat
the experiment a large number of times, then:
The sampling distribution of the proportion will be
normally distributed by the CLT
The mean of this distribution will be p - i.e. the 'true'
population proportion
The standard deviation of the sampling distribution of the
proportion, called the standard error of the proportion is
estimated by

n
) 1 (
proportion of S.E
p p
=
Example:
A pharmaceutical company produces 400,000
capsules per day of a particular drug. They test
200 of the capsules for defects (too much/little
active compound). If the population p = 0.05, and
they take 10,000 repeated samples this is the
histogram they would get
Sample Size
How big does the sample have to be for the CLT to
work with proportions?
The rule is different than the rule for means. Do the
following test.
A rule of thumb: the sample size is big enough if
1. np > 5 and
2. n(1-p) > 5

General Confidence Interval Formula for a
Population Proportion (large Sample)






where o = the confidence level and Z
1-o/2
= a value from the
standard normal distribution such that 100(1-o)% of values of
a standard normal distribution lie within that range around the
mean

So the Z
1-o/2
values used for a population proportion are the
same as those used for a population mean

n
p p
z p CI
)

1 (

2 / 1

=
o o
Example
How many voters will give F.F. a first preference in the next
general election ? There are 2 different estimates
Researcher A (10 people) => 40%
Researcher B (100 people) => 25%
How much 'better' is estimate B than estimate A ?
Step one: Can we use the formula for large numbers
1. Researcher A: np = 10 * 0.4 = 4 => 4 is not greater than 5
therefore you cannot used the large number method
2. Researcher B: np = 100 * 0.25 = 25
n(1-p) = 100 * (1-0 .25) = 75
both figures are greater than 5 therefore you can used the
large number method
Example Continued
Researcher B - 95% Confidence Interval









So, the 95% CI is 17% to 33%.


0.33 to 0.17
08 . 0 25 . 0
04 . 0 96 . 1 25 . 0
100
75 . 0 25 . 0
96 . 1 25 . 0
) 1 (

95
95
95
95
2 / 1
=
=
=

=

CI
CI
CI
CI
n
p p
z p CI
o o
Example Continued
NB: If fact we can get a 95% CI for researcher A's
findings using small sample theory (exact CI) - this is
available in SAS and other software:
Exact CIs are often based on direct use of probability
models.
The method is based directly on calculations for the
binomial distribution (see lecture 3)
What do we have to do?
Using the CLT, we found, that the 95% CI was composed
of the set of values for the mean, such that an
hypothesis test would not reject the null hypotheses for
any of those values in the set using the = 0.05 level.

Using SAS we can calculate a 95% CI for
Researcher A:
CI 95% for Researcher A = 12% to 74%
which is too wide to be informative anyway!

If we use the same technique for researcher B we
get:
CI95 for Researcher B = 17% to 35%
Which is virtually the same as before using the CLT.

Exact CI and tests for population
proportions
These work for small samples as well as large
samples

With large sample will give essentially the same
results as CLT

Must be used for small samples, however

Based on the binomial probability distribution.

Difference between Exact and CLT
based methods
When sample sizes are large they will give the
same results but exact tests can be very hard to
compute even with modern PCs

When sample sizes are small exact methods must
be used

The CIs from small samples tend to be very wide
there is no short cut from collecting as much high
quality data as you can manage.

Exercise
Answer Question 7-9

Potrebbero piacerti anche