Introduction Statistics Imperial College London

Introduction
COMP 245 STATISTICS

Dr N A Heard
Room 543 Huxley Building
Dr N A Heard (Room 543 Huxley Building) Introduction 1 / 6
Denition of Statistics
Statistics is the science and practice of developing human knowledge
through the use of empirical data.
Statistical theory is a branch of mathematics using probability theory
to model randomness and uncertainty in data.
Statistical inference is the process of using inductive methods and
statistical theory on sampled data to reason about a dened
population.
A statistic is a numerical summary of data.
Population vs. Sample
The previous denitions suggested an important distinction between a
sample and a population.
Loosely, we can think of a population as being a large, perhaps innite,
collection of individuals or objects or quantities in which we are
interested. For reasons of generality we would wish to make
inferences about the entire population.
Often it will be impractical or impossible to exhaustively observe
every member of a population. So instead we observe what we hope is
a representative sample from the population.
To best ensure the sample is representative and not biased in some
way, where possible we draw the sample at random from the
population.
Statistical methods are then used to relate the measurements of the
sample to the characteristics of the entire population.
Model-based inference
In this course, we will consider very simple parametric statistical
models to represent our populations of interest.
Statistical inference will thus mean estimating model parameters
using our observed sample.
Likelihood methods will be our main tool for this task. We will
learn to calculate the likelihood of a particular parameter solution
given our observed sample.
Teaching & Assessment Methods
Over 9 weeks, Term 1:
2 50 lectures each week.
50 problems class each week.
1 ofce hour each week for more involved questions/general
problems with the course.
Assessment:
1 assessed coursework (counts as one exam question).
Exam. Answer all 4 questions. Q1 is mutliple choice.
Learning Material
Printed handouts of course notes.
9 exercise sheets and solutions.
Course website
http://www2.imperial.ac.uk/
~
naheard/Comp245/index.html
Introduction
COMP 245 STATISTICS
Dr N A Heard
1 Introduction to Statistics
Denition of Statistics
Statistics is the science and practice of developing human knowledge through the use of
empirical data.
Statistical theory is a branch of mathematics using probability theory to model random-
ness and uncertainty in data.
Statistical inference is the process of using inductive methods and statistical theory on
sampled data to reason about a dened population.
A statistic is a numerical summary of data.
Population vs. Sample
The previous denitions suggested an important distinction between a sample and a pop-
ulation.
Loosely, we can think of a population as being a large, perhaps innite, collection of indi-
viduals or objects or quantities in which we are interested. For reasons of generality we would
wish to make inferences about the entire population.
Often it will be impractical or impossible to exhaustively observe every member of a pop-
ulation. So instead we observe what we hope is a representative sample from the population.
To best ensure the sample is representative and not biased in some way, where possible we
draw the sample at random from the population.
Statistical methods are then used to relate the measurements of the sample to the charac-
teristics of the entire population.
2 Statistical Modelling
Model-based inference
In this course, we will consider very simple parametric statistical models to represent our
populations of interest.
Statistical inference will thus mean estimating model parameters using our observed
sample.
1
Likelihood methods will be our main tool for this task. We will learn to calculate the
likelihood of a particular parameter solution given our observed sample.
3 Course Organisation
Teaching & Assessment Methods
Over 9 weeks, Term 1:
2 50 lectures each week.
50 problems class each week.
1 ofce hour each week for more involved questions/general problems with the course.
Assessment:
1 assessed coursework (counts as one exam question).
Exam. Answer all 4 questions. Q1 is mutliple choice.
Learning Material
Printed handouts of course notes.
9 exercise sheets and solutions.
Course website http://www2.imperial.ac.uk/
~
naheard/Comp245/index.html
2
1. Probabilities for events
For events A, B, and C P(A B) = P(A) +P(B) P(A B)
More generally P(
A
i
) =
P(A
i
)
P(A
i
A
j
) +
P(A
i
A
j
A
k
)
The odds in favour of A P(A) / P(A)
Conditional probability P(A B) =
P(A B)
P(B)
provided that P(B) > 0
Chain rule P(A B C) = P(A) P(B A) P(C A B)
Bayes rule P(A B) =
P(A) P(B A)
P(A) P(B A) + P(A) P(B A)
A and B are independent if P(B A) = P(B)
A, B, and C are independent if P(A B C) = P(A)P(B)P(C) , and
P(A B) = P(A)P(B) , P(B C) = P(B)P(C) , P(C A) = P(C)P(A)
2. Probability distribution, expectation and variance
The probability distribution for a discrete random variable X is called the
probability mass function (pmf) and is the complete set of probabilities {p
x
} = {P(X = x)}
Expectation E(X) = =
x
xp
x
For function g(x) of x, E{g(X)} =
x
g(x)p
x
, so E(X
2
) =
x
x
2
p
x
Sample mean x =
1
n
k
x
k
estimates from random sample x
1
, x
2
, . . . , x
n
Variance var (X) =
2
= E{(X )
2
} = E(X
2
)
2
Sample variance s
2
=
1
n 1
_
k
x
2
k

1
n
_
j
x
j
_
2
_
estimates
2
Standard deviation sd(X) =
If value y is observed with frequency n
y
n =
y
n
y
,
k
x
k
=
y
yn
y
,
k
x
2
k
=
y
y
2
n
y
Skewness
1
= E
_
X
_
3
is estimated by
1
n 1
_
x
i
x
s
_
3
Kurtosis
2
= E
_
X
_
4
3 is estimated by
1
n 1
_
x
i
x
s
_
4
3
Sample median x or x
med
. Half the sample values are smaller and half larger
If the sample values x
1
, . . . , x
n
are ordered as x
(1)
x
(2)
x
(n)
,
then x = x
(
n+1
2
)
if n is odd, and x =
1
2
(x
(
n
2
)
+ x
(
n+2
2
)
) if n is even
1
-quantile Q() is such that P(X Q()) =
Sample -quantile

Q() Proportion of the data values are smaller
Lower quartile Q1 =

Q(0.25) one quarter are smaller
Upper quartile Q3 =

Q(0.75) three quarters are smaller
Sample median x =

Q(0.5) estimates the population median Q(0.5)
3. Probability distribution for a continuous random variable
The cumulative distribution function (cdf) F(x) = P(X x) =
_
x
x
0
=
f(x
0
)dx
0
The probability density function (pdf) f(x) =
dF(x)
dx
E(X) = =
_

xf(x)dx, var (X) =

2
= E(X
2
)
2
, where E(X
2
) =
_

x
2
f(x)dx
4. Discrete probability distributions
Discrete Uniform Uniform (n)
p
x
=
1
n
(x = 1, 2, . . . , n) = (n + 1)/2,
2
= (n
2
1)/12
Binomial distribution Binomial (n, )
p
x
=
_
n
x
_

x
(1 )
nx
(x = 0, 1, 2, . . . , n) = n ,
2
= n(1 )
Poisson distribution Poisson ()
p
x
=

x
e
x!
(x = 0, 1, 2, . . .) (with > 0) = ,
2
=
Geometric distribution Geometric ()
p
x
= (1 )
x1
(x = 1, 2, 3, . . .) =
1
,
2
=
1
2
5. Continuous probability distributions
Uniform distribution Uniform (, )
f(x) =
_
_
1

( < x < ), = ( +)/2 ,
2
= ( )
2
/12
0 (otherwise).
Exponential distribution Exponential ()
f(x) =
_
_
e
x
(0 < x < ), = 1/,
2
= 1/
2
0 ( < x 0).
2
Normal distribution N (,
2
)
f(x) =
1
2
2
exp
_
1
2
_
x
_
2
_
( < x < ), E(X) = , var (X) =
2
Standard normal distribution N(0,1)
If X is N(,
2
), then Y =
X
is N(0,1)
6. Reliability
For a device in continuous operation with failure time random variable T having pdf f(t) (t > 0)
The reliability function at time t R(t) = P(T > t)
The failure rate or hazard function h(t) = f(t)/R(t)
The cumulative hazard function H(t) =
_
t
0
h(t
0
) dt
0
= ln{R(t)}
The Weibull(, ) distribution has H(t) = t
7. System reliability
For a system of k devices, which operate independently, let
R
i
= P(D
i
) = P(device i operates)
The system reliability, R, is the probability of a path of operating devices
A system of devices in series operates only if every device operates
R = P(D
1
D
2
D
k
) = R
1
R
2
R
k
A system of devices in parallel operates if any device operates
R = P(D
1
D
2
D
k
) = 1 (1 R
1
)(1 R
2
) (1 R
k
)
8. Covariance and correlation
The covariance of X and Y cov (X, Y ) = E(XY ) {E(X)}{E(Y )}
From pairs of observations (x
1
, y
1
), . . . , (x
n
, y
n
) S
xy
=
k
x
k
y
k

1
n
(
i
x
i
)(
j
y
j
)
S
xx
=
k
x
2
k

1
n
(
i
x
i
)
2
, S
yy
=
k
y
2
k

1
n
(
j
y
j
)
2
Sample covariance s
xy
=
1
n 1
S
xy
estimates cov (X, Y )
Correlation coecient = corr (X, Y ) =
cov (X, Y )
sd(X) sd(Y )
Sample correlation coecient r =
S
xy
_
S
xx
S
yy
estimates
3
9. Sums of random variables
E(X +Y ) = E(X) +E(Y )
var (X +Y ) = var (X) + var (Y ) + 2 cov (X, Y )
cov (aX +bY, cX +dY ) = (ac) var (X) + (bd) var (Y ) + (ad +bc) cov (X, Y )
If X is N(
1
,
2
1
), Y is N(
2
,
2
2
), and cov (X, Y ) = c, then X+Y is N(
1
+
2
,
2
1
+
2
2
+2c)
10. Bias, standard error, mean square error
If t estimates (with random variable T giving t)
Bias of t bias (t) = E(T)
Standard error of t se (t) = sd(T)
Mean square error of t MSE(t) = E{(T )
2
} = {se (t)}
2
+{bias (t)}
2
If x estimates , then bias (x) = 0 , se (x) = /
n, MSE(x) =
2
/n, se (x) = s/
n
Central limit property If n is fairly large, x is from N(,
2
/n) approximately
11. Likelihood
The likelihood is the joint probability as a function of the unknown parameter .
For a random sample x
1
, x
2
, . . . , x
n
(; x
1
, x
2
, . . . , x
n
) = P(X
1
= x
1
) P(X
n
= x
n
) (discrete distribution)
(; x
1
, x
2
, . . . , x
n
) = f(x
1
) f(x
2
) f(x
n
) (continuous distribution)
The maximum likelihood estimator (MLE) is

for which the likelihood is a maximum
12. Condence intervals
If x
1
, x
2
, . . . , x
n
are a random sample from N(,
2
) and
2
is known, then
the 95% condence interval for is (x 1.96

n
, x + 1.96

n
)
If
2
is estimated, then from the Student t table for t
n1
we nd t
0
= t
n1,0.05
The 95% condence interval for is (x t
0
s
n
, x +t
0
s
n
)
4
13. Standard normal table Values of pdf (y) = f(y) and cdf (y) = F(y)
y (y) (y) y (y) (y) y (y) (y) y (y)
0 .399 .5 .9 .266 .816 1.8 .079 .964 2.8 .997
.1 .397 .540 1.0 .242 .841 1.9 .066 .971 3.0 .999
.2 .391 .579 1.1 .218 .864 2.0 .054 .977 0.841 .8
.3 .381 .618 1.2 .194 .885 2.1 .044 .982 1.282 .9
.4 .368 .655 1.3 .171 .903 2.2 .035 .986 1.645 .95
.5 .352 .691 1.4 .150 .919 2.3 .028 .989 1.96 .975
.6 .333 .726 1.5 .130 .933 2.4 .022 .992 2.326 .99
.7 .312 .758 1.6 .111 .945 2.5 .018 .994 2.576 .995
.8 .290 .788 1.7 .094 .955 2.6 .014 .995 3.09 .999
14. Student t table Values t
m,p
of x for which P(|X| > x) = p , when X is t
m
m p= 0.10 0.05 0.02 0.01 m p= 0.10 0.05 0.02 0.01
1 6.31 12.71 31.82 63.66 9 1.83 2.26 2.82 3.25
2 2.92 4.30 6.96 9.92 10 1.81 2.23 2.76 3.17
3 2.35 3.18 4.54 5.84 12 1.78 2.18 2.68 3.05
4 2.13 2.78 3.75 4.60 15 1.75 2.13 2.60 2.95
5 2.02 2.57 3.36 4.03 20 1.72 2.09 2.53 2.85
6 1.94 2.45 3.14 3.71 25 1.71 2.06 2.48 2.78
7 1.89 2.36 3.00 3.50 40 1.68 2.02 2.42 2.70
8 1.86 2.31 2.90 3.36 1.645 1.96 2.326 2.576
15. Chi-squared table Values
2
k,p
of x for which P(X > x) = p , when X is
2
k
and p = .995, .975, etc
k .995 .975 .05 .025 .01 .005 k .995 .975 .05 .025 .01 .005
1 .000 .001 3.84 5.02 6.63 7.88 18 6.26 8.23 28.87 31.53 34.81 37.16
2 .010 .051 5.99 7.38 9.21 10.60 20 7.43 9.59 31.42 34.17 37.57 40.00
3 .072 .216 7.81 9.35 11.34 12.84 22 8.64 10.98 33.92 36.78 40.29 42.80
4 .207 .484 9.49 11.14 13.28 14.86 24 9.89 12.40 36.42 39.36 42.98 45.56
5 .412 .831 11.07 12.83 15.09 16.75 26 11.16 13.84 38.89 41.92 45.64 48.29
6 .676 1.24 12.59 14.45 16.81 18.55 28 12.46 15.31 41.34 44.46 48.28 50.99
7 .990 1.69 14.07 16.01 18.48 20.28 30 13.79 16.79 43.77 46.98 50.89 53.67
8 1.34 2.18 15.51 17.53 20.09 21.95 40 20.71 24.43 55.76 59.34 63.69 66.77
9 1.73 2.70 16.92 19.02 21.67 23.59 50 27.99 32.36 67.50 71.41 76.15 79.49
10 2.16 3.25 13.31 20.48 23.21 25.19 60 35.53 40.48 79.08 83.30 88.38 91.95
12 3.07 4.40 21.03 23.34 26.22 28.30 70 43.28 48.76 90.53 95.02 100.4 104.2
14 4.07 5.63 23.68 26.12 29.14 31.32 80 51.17 57.15 101.9 106.6 112.3 116.3
16 5.14 6.91 26.30 28.85 32.00 34.27 100 67.33 74.22 124.3 129.6 135.8 140.2
5
16. The chi-squared goodness-of-t test
The frequencies n
y
are grouped so that the tted frequency n
y
for every group exceeds about 5.
X
2
=
y
(n
y
n
y
)
2
n
y
is referred to the table of
2
k
with signicance point p,
where k is the number of terms summed, less one for each constraint, eg matching total frequency,
and matching x with
17. Joint probability distributions
Discrete distribution {p
xy
}, where p
xy
= P({X = x} {Y = y}) .
Let p
x
= P(X = x), and p
y
= P(Y = y), then
p
x
=
y
p
xy
and P(X = x Y = y) =
p
xy
p
y
Continuous distribution
Joint cdf F(x, y) = P({X x} {Y y}) =
_
x
x
0
=
_
y
y
0
=
f(x
0
, y
0
) dx
0
dy
0
Joint pdf f(x, y) =
d
2
F(x, y)
dxdy
Marginal pdf of X f
X
(x) =
_

f(x, y
0
) dy
0
Conditional pdf of X given Y = y f
X|Y
(x|y) =
f(x, y)
f
Y
(y)
(provided f
Y
(y) > 0)
18. Linear regression
To t the linear regression model y = +x by y
x
= +

x from observations
(x
1
, y
1
), . . . , (x
n
, y
n
) , the least squares t is = y x
,

=
S
xy
S
xx
The residual sum of squares RSS = S
yy

S
2
xy
S
xx
2
=
RSS
n 2
n 2
2
is from
2
n2
E( ) = , E(
) = ,
var ( ) =
x
2
i
nS
xx
2
, var (
) =

2
S
xx
, cov ( ,

) =
x
S
xx
2
y
x
= +

x , E( y
x
) = +x , var ( y
x
) =
_
1
n
+
(x x)
2
S
xx
_

2

se ( )
,

se (
)
,
y
x
x
se ( y
x
)
are each from t
n2
6
COMP 245 Statistics
Exercises 0 - Mathematical Methods Revision
1. Find the nth term and infinite sum of the following sequences, stating for which real values of
x the infinite sums converge
(a)
1
x
,
1
4x
,
1
16x
,
1
64x
, . . . (b)
1
x
,
1
x
2
,
1
x
3
,
1
x
4
, . . . (c) 1,
1
x
,
1
x
2
,
1
x
3
,
1
x
4
, . . .
2. Using your answer from 1b, find the value of x such that
i=1
x
i
= 1.
3. Find
d f
dx
for the following functions f (x):
(a) f (x) =
n
i=0
a
i
x
i
(a
i
R, n Z
+
);
(b) f (x) = x log(x);
(c) f (x) = e
e
x
;
4. Integrate the following functions f (x) with respect to x:
(a) f (x) =
n
i=0
a
i
x
i
(a
i
R, n Z
+
);
(b) f (x) = x log(x);
(c) f (x) = e
ax
(a R, a 0);
(d) f (x) = xe
ax
(a R, a 0);
5. Using your answer from 4c, find the value of a such that

0
e
ax
dx = 1.
6. Integrate the function f (x, y) = xy over the interior of the quarter-ellipse which satisfies
x
2
2
+ y
2
= 1, x > 0, y > 0.
What would be the integral of the function g(x, y) = |xy| over the interior of the entire ellipse?
7. For the function f : R R, f (x) = x
2
+ 1, find the inverse image of [1, 3).
8. Suppose a curve is known to pass through the following points (x, y):
(1.4, 3.0), (0.2, 1.6), (3.0, 0.9)
By linear interpolation, find approximate y-values of the curve at
(a) x = 0.8; (b) x = 1.0.
COMP 245 Statistics
Solutions 0 - Mathematical Methods Revision
1. (a) x
n
=
1
4
n1
x
, S
=
4
3x
for all x 0.
(b) x
n
=
1
x
n
, S
=
1
x 1
for all x such that |x| > 1.
(c) x
n
=
1
x
n1
, S
=
x
x 1
for all x such that |x| > 1.
2. 1 =
i=1
x
i
=
1
x 1
= x = 2.
(So the infinite sequence
1
2
,
1
4
,
1
8
,
1
16
, . . . sums to 1.)
3. (a)

n
i=1
ia
i
x
i1
;
(b) log(x) + 1;
(c) e
x+e
x
;
4. (a)
n
i=0
a
i
i + 1
x
i+1
+ c;
(b)
x
2
2
_
log(x)
1
2
_
+ c;
(c)
e
ax
a
+ c;
(d)
e
ax
a
_
x +
1
a
_
+ c;
5. 1 =
_

0
e
ax
dx =
e
ax
a
0
= 0
_
1
a
_
=
1
a
= a = 1.
6. The interior of the quarter-ellipse is given by the region
E = {(x, y)|0 < x <
2, 0 < y <
_
1 x
2
/2}.
The integral of f over E is then
_
E
f (x, y)dxdy =
_

2
x=0
xdx
_

1x
2
/2
y=0
ydy =
_

2
x=0
x
2
dx
_
_
y
2
_
1x
2
/2
y=0
_
=
_

2
x=0
x
2

x
3
4
dx
=
_
x
2
4

x
4
16
_
2
x=0
=
1
2

1
4
=
1
4
.
By symmetry, the integral of g over the whole ellipse would be equal to 1.
7. (
2,
2).
8. (a) At x = 0.8, y = 0.7.
(b) At x = 1.0, y = 0.6625.
Mathematical Methods
COMP 245 STATISTICS
Dr N A Heard
Dr N A Heard (Room 543 Huxley Building) Mathematical Methods 1 / 14
R, Z, N
The following conventions and set notation will be used:
Notation Set Description
R (, ) The real numbers
R
+
(0, ) The positive real numbers
Z {. . . , 2, 1, 0, 1, 2, . . .} The integers
Z
+
{1, 2, 3, . . .} The positive integers
N {0, 1, 2, 3, . . .} The natural numbers
log
Where there is any ambiguity, by log I will mean the natural
logarithm, sometimes written ln elsewhere. For any other base b
(e.g. b = 10), I will write log
b
. So
log log
e
ln.
Rules:
log(x.y) = log(x) +log(y)
= log(
i
x
i
) =
i
log(x
i
)
log(x
y
) = y. log(x)
log(e
x
) = x
lim
x0
log(x) =
exp
For exponential, I will use the notations e or exp interchangeably.
So
e
x
exp(x).
Rules:
exp(x + y) = exp(x). exp(y)
= exp(
i
x
i
) =
i
exp(x
i
)
exp(x)
y
= exp(x.y)
exp{log(x)} = x
exp(0) = 1
Arithmetic Progressions
Consider the innite sequence of numbers
a, a + d, a +2d, a +3d, . . .
The rst term in this sequence is a, and then each subsequent term is
equal to the previous term plus d, the common difference. Any such
sequence is known as an arithmetic progression (a.p.).
Formulae:
n
th
term = a + (n 1)d
Sum of rst n terms, S
n
=
n
2
{2a + (n 1)d}
(Innite sum, S
= , unless a = d = 0)
Geometric Progressions
a, a.r, a.r
2
, a.r
3
, . . .
The rst term in this sequence is a, and then each subsequent term is
equal to the previous term multiplied by r, the common ratio. Any such
sequence is known as a geometric progression (g.p.).
Formulae:
n
th
term = a.r
n1
n
=
a(1 r
n
)
1 r
, if r = 1
Innite sum, S
=
a
1 r
, if |r| < 1; (S
diverges otherwise for

a = 0)
Differentiation
Let f , g be functions of a variable x; for the derivative of f with respect
to x, we use the notations
df
dx
or f
(x) interchangeably. So
df
dx
f
(x) lim
h0
f (x + h) f (x)
h
.
Formulae:
Chain Rule:
d
dx
f (g(x)) = f
(g(x))g
(x)
Product Rule:
d
dx
f (x)g(x) = f
(x)g(x) + f (x)g
(x)
Quotient Rule: for g(x) = 0,
d
dx
f (x)
g(x)
=
f
(x)g(x) f (x)g
(x)
{g(x)}
2
Integration
Fundamental Theorem of Calculus:
d
dx
x
u=a
f (u)du = f (x),
so integration is antidifferentiation.
Formulae:
Change of variable: if y = g(x),
b
a
f (x)dx =
g(b)
g(a)
f (g
1
(y))g
1
(y)dy
By parts:
b
a
f (x)g
(x)dx = [f (x)g(x)]
b
a
b
a
f
(x)g(x)dx
Linearity
Both differentiation and integration are additive. That is, for functions
f , g,
d
dx
{f (x) + g(x)} =
df (x)
dx
+
dg(x)
dx
,
{f (x) + g(x)}dx =
f (x)dx +
g(x)dx.
And for any constant c R,
d
dx
{c f (x)} = c
df (x)
dx
,
{c f (x)}dx = c
f (x)dx.
Double Integrals
We will only consider the integral of bivariate functions f : R
2
R
over normal domains D R
2
of the form
D = {(x, y)|a < x < b, (x) < y < (x)}.
Then
D
f (x, y)dxdy
b
x=a
(x)
y=(x)
f (x, y)dy
dx.
Image of a function
Suppose f is a function f : X Y.
For A X, the image of A under f , written f (A) is the subset of Y given
by
f (A) = {y Y|f (x) = y for some x A}.
The image of X under f can be referred to simply as the image of f .
Inverse image of a function
Recall the inverse of f , should it exist, is denoted f
1
and has the
property
f
1
(f (x)) = x
for any value x X.
When f is invertible, the inverse image of B Y is dened as the image
of B under f
1
.
More generally, for B Y, the inverse image of B under (possibly
non-invertible) f is given by
f
1
(B) = {x X|f (x) B}.
Approximating a function
Consider a function f : R R of unknown form, where all that is
known about f is the value it takes at each of a pre-determined discrete
set of points X = {a = x
1
< x
2
< . . . < x
k
= b}; for each of these
values x
i
, denote the corresponding function value f
i
= f (x
i
).
Interpolation is the task of nding an approximate value of f for a
general point x in the interval [a, b], say

f (x). (Extrapolation would be
the task of nding an approximate value of f for x outside the interval
[a, b].)
Linear interpolation
The most commonly used approximation is linear interpolation, which
assumes the underlying function f can be considered approximately
piecewise linear between the set of known points.
-
x
f (x)
x
u
f
u
X
X
X
X
X
X
r
r
Let x
, x
u
be nearest pair of points in X on either side of x. Then f (x) is
linearly approximated by
f (x) = f
+ (x x
)
(f
u
f
)
(x
u
x
)
.
Mathematical Methods
COMP 245 STATISTICS
Dr N A Heard
1 Notation
R, Z, N
The following conventions and set notation will be used:
Notation Set Description
R (, ) The real numbers
R
+
(0, ) The positive real numbers
Z {. . . , 2, 1, 0, 1, 2, . . .} The integers
Z
+
{1, 2, 3, . . .} The positive integers
N {0, 1, 2, 3, . . .} The natural numbers
2 Log and Exponential
log
Where there is any ambiguity, by log I will mean the natural logarithm, sometimes writ-
ten ln elsewhere. For any other base b (e.g. b = 10), I will write log
b
. So
log log
e
ln.
Rules:
log(x.y) = log(x) +log(y) = log(
i
x
i
) =
i
log(x
i
)
log(x
y
) = y. log(x)
log(e
x
) = x
lim
x0
log(x) =
exp
For exponential, I will use the notations e or exp interchangeably. So
e
x
exp(x).
Rules:
exp(x + y) = exp(x). exp(y) = exp(
i
x
i
) =
i
exp(x
i
)
1
exp(x)
y
= exp(x.y)
exp{log(x)} = x
exp(0) = 1
3 Arithmetic and Geometric Progressions
Arithmetic Progressions
a, a + d, a +2d, a +3d, . . .
The rst term in this sequence is a, and then each subsequent term is equal to the previous
term plus d, the common difference. Any such sequence is known as an arithmetic progression
(a.p.).
Formulae:
n
th
term = a + (n 1)d
n
=
n
2
{2a + (n 1)d}
(Innite sum, S
= , unless a = d = 0)
Geometric Progressions
a, a.r, a.r
2
, a.r
3
, . . .
The rst term in this sequence is a, and then each subsequent term is equal to the previous
termmultiplied by r, the common ratio. Any such sequence is known as a geometric progression
(g.p.).
Formulae:
n
th
term = a.r
n1
n
=
a(1 r
n
)
1 r
, if r = 1
Innite sum, S
=
a
1 r
, if |r| < 1; (S
diverges otherwise for a = 0)

4 Calculus
Differentiation
Let f , g be functions of a variable x; for the derivative of f with respect to x, we use the
notations
d f
dx
or f
(x) interchangeably. So
d f
dx
f
(x) lim
h0
f (x + h) f (x)
h
.
Formulae:
2
Chain Rule:
d
dx
f (g(x)) = f
(g(x))g
(x)
Product Rule:
d
dx
f (x)g(x) = f
(x)g(x) + f (x)g
(x)
Quotient Rule: for g(x) = 0,
d
dx
f (x)
g(x)
=
f
(x)g(x) f (x)g
(x)
{g(x)}
2
Integration
Fundamental Theorem of Calculus:
d
dx
x
u=a
f (u)du = f (x),
so integration is antidifferentiation.
Formulae:
Change of variable: if y = g(x),
b
a
f (x)dx =
g(b)
g(a)
f (g
1
(y))g
1
(y)dy
By parts:
b
a
f (x)g
(x)dx = [ f (x)g(x)]
b
a
b
a
f
(x)g(x)dx
Linearity
Both differentiation and integration are additive. That is, for functions f , g,
d
dx
{f (x) + g(x)} =
d f (x)
dx
+
dg(x)
dx
,
{f (x) + g(x)}dx =
f (x)dx +
g(x)dx.
And for any constant c R,
d
dx
{c f (x)} = c
d f (x)
dx
,
{c f (x)}dx = c
f (x)dx.
Double Integrals
We will only consider the integral of bivariate functions f : R
2
R over normal domains
D R
2
of the form
D = {(x, y)|a < x < b, (x) < y < (x)}.
Then
D
f (x, y)dxdy
b
x=a
(x)
y=(x)
f (x, y)dy
dx.
3
5 Function images and inverses
Image of a function
Suppose f is a function f : X Y.
For A X, the image of A under f , written f (A) is the subset of Y given by
f (A) = {y Y| f (x) = y for some x A}.
The image of X under f can be referred to simply as the image of f .
Inverse image of a function
Recall the inverse of f , should it exist, is denoted f
1
and has the property
f
1
( f (x)) = x
for any value x X.
When f is invertible, the inverse image of B Y is dened as the image of B under f
1
.
More generally, for B Y, the inverse image of B under (possibly non-invertible) f is given
by
f
1
(B) = {x X| f (x) B}.
6 Interpolation
Approximating a function
Consider a function f : R R of unknown form, where all that is known about f is the
value it takes at each of a pre-determined discrete set of points X = {a = x
1
< x
2
< . . . <
x
k
= b}; for each of these values x
i
, denote the corresponding function value f
i
= f (x
i
).
Interpolation is the task of nding an approximate value of f for a general point x in the
interval [a, b], say

f (x). (Extrapolation would be the task of nding an approximate value of f
for x outside the interval [a, b].)
Linear interpolation
The most commonly used approximation is linear interpolation, which assumes the under-
lying function f can be considered approximately piecewise linear between the set of known
points.
-
x
f (x)
x
u
f
u
X
X
X
X
X
X
r
r
Let x
, x
u
be nearest pair of points in X on either side of x. Then f (x) is linearly approxi-
mated by
f (x) = f
+ (x x
)
( f
u
f
)
(x
u
x
)
.
4
COMP 245 Statistics
Exercises 1 - Numerical Summaries
For the first three questions, let (x
1
, x
2
, . . . , x
n
) be a sample of n real numbers and m R some
measure of location of those data.
1. Show that m = x, the sample mean, is the minimiser of the sum of squared deviations
n
i=1
(x
i
m)
2
.
2. Show by induction that m = x
(n+1)/2
, the sample median, minimises the sum of absolute
deviations
n
i=1
|x
i
m|.
(Note that this is not necessarily a unique minimiser.)
[Hint: Assume the samples are ordered. Consider the base cases of n = 1 and n = 2 first; and
then when assumed true for all sizes up to n, consider the case of n + 2.]
3. If m were to be the mode of the sample, construct your own measure of dispersion for which
this would be the minimiser. Describe how the equation you give acts as a (crude) measure
of dispersion.
4. The blood plasma beta endorphin concentration levels for 11 runners who collapsed towards
the end of the Great North Run were
66 72 79 84 102 110 123 144 162 169 414
Calculate the median and mean of this sample. Why might one have predicted beforehand
that the mean would be larger than the median? Why might the standard deviation not be a
very good measure of dispersion?
5. The table below gives the blood plasma beta endorphin concentrations of 11 runners before
and after the race. Find the median, the mean, and the standard deviation of the before-after
differences. Also, calculate the covariance and correlation of the before and after concentration
levels.
Before 4.3 4.6 5.2 5.2 6.6 7.2 8.4 9.0 10.4 14.0 17.8
After 29.6 25.1 15.5 29.6 24.1 37.8 20.2 21.9 14.2 34.6 46.2
6. The data below give the percentage of silica found in each of 22 chondrites meteors. Find the
median and the upper and lower quartiles of the data.
20.77, 22.56, 22.71, 22.99, 26.39, 27.08, 27.32, 27.33 27.57, 27.81, 28.69, 29.36, 30.25,
31.89, 32.88, 33.23 33.28, 33.40, 33.52, 33.83, 33.95, 34.82
7. The list below shows the survival time (in days) of patients undergoing treatment for stomach
cancer. Using cells 0-99, 100-199, 200-299, . . . , 1100-1199, plot a histogram of the data.
Compute the mean and standard deviation of the data. Why is the mean larger than the
apparent mode of the data? Calculate the skewness of the data and of the log transformed
data.
124, 42, 25, 45, 412, 51, 1112, 46, 103, 876, 146, 340, 396
8. A car travels for 10 miles at 30 mph and 10 miles at 60 mph. What was its average speed?
COMP 245 Statistics
Solutions 1 - Numerical Summaries
1. Taking the derivative wrt m,
d
dm
n
i=1
(x
i
m)
2
= 2
n
i=1
(x
i
m) = 2(mn
n
i=1
x
i
).
Setting this equal to zero yields the stationary point of m =
n
i=1
x
i
/n = x. To check this is a
minimiser, differentiate again wrt m,
d
2
dm
2
n
i=1
(x
i
m)
2
= 2n
which is positive for all m.
2. For ease of notation but w.l.o.g., assume all samples are ordered so x
1
x
2
. . . x
n
.
The case of n = 1 is trivial, and for n = 2
n
i=1
|x
i
m| = |x
1
m| + |x
2
m| >= x
2
x
1
with equality attained m in the range x
1
m x
2
, which includes the median.
Suppose the result holds for all samples up to size n, and now consider an ordered sample
of size n + 2. First note that the median of x
2
, x
3
, . . . , x
n
, x
n+1
is equal to the median of
x
1
, x
2
, x
3
, . . . , x
n
, x
n+1
, x
n+2
(since in the larger sample we have simply appended a data point
on either side), but that the former is a sample of size n. So we wish to show that the median
of x
2
, x
3
, . . . , x
n
, x
n+1
is a minimiser of
n+2
i=1
|x
i
m|. Then
n+2
i=1
|x
i
m| = |x
1
m| + |x
n+2
m| +
n+1
i=2
|x
i
m| x
n+2
x
1
+
n+1
i=2
|x
i
m|
with equality attained min the range x
1
m x
n+2
; and clearly the median of x
2
, x
3
, . . . , x
n
, x
n+1
lies within this range and is also a minimiser of
n+1
i=2
|x
i
m| by the inductive hypothesis.
3. A corresponding measure of dispersion would be
n
i=1
I(x
i
m).
If m is our measure of location of the data, then this measure of dispersion counts how many
of the sample take some different value. This would be minimised by the mode.
4. Median = 110, mean = 138.6.
Because of the right skew.
Because it will be sensitive to the outlying value of 414.
5. Differences are: -25.3, -20.5, -10.3, -24.4, -17.5, -30.6, -11.8, -12.9, -3.8, -20.6, -28.4.
Mean = -18.74, median = -20.5, sd = 7.94 (or 8.33).
Covariance = 19.24, correlation = 0.51.
6. The lower quartile LQ = x
((n+1)/4)
= x
(23/4)
which is three quarters of the way between
x
(5)
= 26.39 and x
(6)
= 27.08. Hence LQ = 26.39 + (27.08 26.39) 3/4 = 26.908.
Similarly, the upper quartile UQ = x
((n+1)3/4)
= x
(69/4)
, which is one quarter of the way
between x
(17)
= 33.28 and x
(18)=33.40
. Hence, UQ = 33.28 3/4 + 33.40 1/4 = 33.31.
The median is x
((n+1)/2)
= x
(23/2)
= 28.69/2 + 29.36/2 = 29.0.
7.
Survival time (days)
D
e
n
s
i
t
y
0 200 400 600 800 1000 1200
0
.
0
0
0
0
.
0
0
1
0
.
0
0
2
0
.
0
0
3
Mean = 286, sd = 332.72 (or 346.3 for
1
n1
formula).
Because of the skewness of the data.
Skewness = 1.43 (or similar); Skewness of log transformed data = 0.26 (or similar).
8. The car travels a total of 20 miles in (10/30 hours plus 10/60 hours). That is, 20 miles in 0.5
hours. That is, 40 miles per hour. (Not (30+60)/2.)
This can be most simply calculated using the harmonic mean,
2
1
30
+
1
60
=
2
3
60
= 40.
Numerical Summaries
COMP 245 STATISTICS
Dr N A Heard
Dr N A Heard (Room 543 Huxley Building) Numerical Summaries 1 / 29
Once a sample of data has been drawn from the population of interest,
the rst task of the statistical analyst might be to calculate various
numerical summaries of these data.
This procedure serves two purposes:
The rst is exploratory. Calculating statistics which characterise
general properties of the sample, such as location, dispersion, or
symmetry, helps us to understand the data we have gathered.
This aim can be greatly aided by the use of graphical displays
representing the data.
The second, as we shall see later, is that these summaries will
commonly provide the means for relating the sample we have
learnt about to the wider population in which we are truly
interested. Later, we will assess the properties of these numerical
summaries as estimators of population parameters.
Sample Mean
The arithmetic mean (or just mean for short) of a sample of real values
(x
1
, . . . , x
n
) is the sum of the values divided by their number. That is,
Defn.
x =
1
n
n
i=1
x
i
This is often colloquially referred to as the average.
Ex. The mean of (7, 2, 4, 12, 5) is
7 +2 +4 +12 +5
5
=
30
5
= 6.
Order Statistics
For a sample of real values (x
1
, . . . , x
n
), dene the i
th
order statistic x
(i)
to be the i
th
smallest value of the sample.
So
x
(1)
min(x
1
, . . . , x
n
) is the smallest value;
x
(2)
is the next smallest, and so on, up to
.
.
.
x
(n)
max(x
1
, . . . , x
n
) being the largest value.
Furthermore, in an abuse of notation it will be useful to dene x
(i+)
for integer 1 i < n and non-integer (0, 1) as the linear interpolant
x
(i+)
= (1 ) x
(i)
+ x
(i+1)
,
where the order statistics x
(i)
are dened as before.
-
i
x
(i)
i +
x
(i+)
i +1
x
(i+1)
r
r
Ex.
x
(4.2)
= 0.8 x
(4)
+0.2 x
(5)
.
Sample Median
The median of a sample of real values (x
1
, . . . , x
n
) is the middle value
of the order statistics. That is, using our extended notation,
Defn.
median = x
({n+1}/2)
=
_
x
({n+1}/2)
if n is odd
x
(n/2)
+x
(n/2+1)
2
if n is even
Ex.
The median of (7, 2, 4, 12, 5) is 5.
The median of (7, 2, 4, 12, 5, 15) is 6.
Mean vs. Median
The mean is sensitive to outlying points, whilst the median is not.
Ex.
(1, 2, 3, 4, 5) has median = mean = 3.
(1, 2, 3, 4, 40) again has median = 3, but now mean = 10.
The arithmetic mean is the most commonly used location statistic,
followed by the median.
Sample Mode
The mode of a sample of real values (x
1
, . . . , x
n
) is the value of the x
i
which occurs most frequently in the sample.
Ex.
The mode of (3, 5, 7, 2, 10, 14, 12, 2, 5, 2) is 2.
Some data sets are multimodal.
Geometric Mean
Two other useful measures of location (other averages).
For positive data, the geometric mean
Defn.
x
G
=
n
i=1
x
i
.
Note: It is easy to show that x
G
= exp
_
1
n
n
i=1
logx
i
_
, the exponential
of the arithmetic mean of the logs of the data.
= It is thus less severely affected by exceptionally large values.
Harmonic Mean
The harmonic mean
Defn.
x
H
=
_
1
n
n
i=1
1
x
i
_
1
=
n
n
i=1
1
x
i
.
which is most useful when averaging rates.
Note: For positive data (x
1
, . . . , x
n
),
Arithmetic mean geometric mean harmonic mean.
2 4 6 8 10
1
2
3
4
5
x2
M
e
a
n
Arithmetic
Geometric
Harmonic
x1
Arithmetic, geometric and harmonic means for two data points points
(x
1
, x
2
), where x
1
= 1.
Range
The range of a sample of real values (x
1
, . . . , x
n
) is the difference
between the largest and the smallest values. That is
Defn.
range = x
(n)
x
(1)
Ex.
The range of (7, 2, 4, 12, 5) is 12 2 = 10.
Quartiles
Consider again the order statistics of a sample,
_
x
(1)
, . . . , x
(n)
_
.
We dened the median so that it lay approximately
1
2
of the way
through the ordered sample. (Not necessarily exactly or uniquely since
there may be tied values or n even.)
Similarly, we can dene the rst and third quartiles respectively as
being values
1
4
and
3
4
of the way through the ordered sample.
Interquartile Range
Defn.
rst quartile = x
({n+1}/4)
third quartile = x
(3{n+1}/4)
and thus we dene the interquartile range as the range of the data
lying between the rst and third quartiles,
Defn.
interquartile range = third quartile rst quartile
= x
(3{n+1}/4)
x
({n+1}/4)
Five Figure Summary
The ve gure summary of a set of data lists, in order:
The minimum value in the sample
The lower quartile
The sample median
The upper quartile
The maximum value
Sample Variance
The most widely used measure of dispersion is based on the squared
differences between the data points and their mean, (x
i
x)
2
.
The average (the mean) of these squared differences is the mean
square or sample variance
Defn.
s
2
=
1
n
n
i=1
(x
i
x)
2
.
Equivalently, it is often more convenient to rewrite this formula as
s
2
=
_
1
n
n
i=1
x
2
i
_
_
1
n
n
i=1
x
i
_
2
That is, the mean of the squares minus the square of the mean.
Sample Standard Deviation
The square root of the variance is the root mean square or sample
standard deviation
Defn.
s =
1
n
n
i=1
(x
i
x)
2
Unlike the variance, the standard deviation is in the same units as the
x
i
.
Summary
We can see analogies between the numerical summaries for location
and dispersion, and their robustness properties are comparable.
Least Robust More Robust Most Robust
Location
x
(1)
+ x
(n)
2
x x
({n+1}/2)
Dispersion x
(n)
x
(1)
s
2
x
(3{n+1}/4)
x
({n+1}/4)
(where
x
(1)
+ x
(n)
2
would be the midpoint of our data halfway between
the minimum and maximum values in the sample, which provides
another alternative descriptor of location.)
Sample Skewness
Skewness is a measure of asymmetry. The skewness of a sample of
real values (x
1
, . . . , x
n
) is given by
Defn.
skewness =
1
n
n
i=1
_
x
i
x
s
_
3
.
A sample is positively (negatively) or right (left) skewed if the upper
tail of the histogram of the sample is longer (shorter) than the lower
tail.
Skewness
Positive skew Negative skew
Since the mean is more sensitive to outlying points than is the median,
one might choose the median as a more suitable measure of average
value if the sample is skewed.
Can expect skewness for example when the data can only take positive
(or only negative values) and if the values are not far from zero.
Can remove skewness by transforming the data. In the case above, we
need a transformation which has larger effect on the larger values: e.g.
square root, log (though beware 0 values).
Note: in a positively skewed sample the mean is often greater than the
median.
Sample Covariance
Suppose we have a sample made up of ordered pairs of real values
((x
1
, y
1
), . . . , (x
n
, y
n
)). The value x
i
might correspond to the
measurement of one quantity x of individual i, and y
i
to another
quantity y of the same individual.
The covariance between the samples of x and y is given by
Defn.
s
xy
=
1
n
n
i=1
(x
i
x)(y
i
y).
It gives a measurement of relatedness between the two quantities x
and y.
As before, the covariance can be rewritten equivalently as
s
xy
=

n
i=1
x
i
y
i
n
x y.
Sample Correlation
Note that the magnitude of s
xy
varies according to the scale on which
the data have been measured. The correlation between the samples of
x and y is dened to be
Defn.
r
xy
=
s
xy
s
x
s
y
=

n
i=1
x
i
y
i
n x y
ns
x
s
y
.
where s
x
and s
y
are the sample standard deviations of (x
1
, . . . , x
n
) and
(y
1
, . . . , y
n
) respectively.
Unlike covariance, correlation gives a measurement of relatedness
between the two quantities x and y which is scale-invariant.
Box-and-Whisker/Box Plots
Based on the ve gure summary.
Median - middle line in the box
3rd & 1st Quartiles - top and bottom of the box
Whiskers - extend out to any points which are within
(
3
2
interquartile range) from the box
Any extreme points out to the maximum and minimum which are
beyond the whiskers are plotted individually.
A B C D E F
0
5
1
0
1
5
2
0
2
5
Insecticide
C
o
u
n
t
Ex. Box plots of the counts of insects found in agricultural experimental
units treated with six different insecticides (A-F).
Empirical pmf
The empirical probability mass function (epmf) of a sample of real
values (x
1
, . . . , x
n
) is given by
Defn.
p
n
(x) =
1
n
n
i=1
I(x
i
= x).
For any real number x, p
n
(x) returns the proportion of the data which
take the value x. The non-zero values of p
n
(x) are the normalised
frequency counts of the data.
Note that a mode of the sample (x
1
, . . . , x
n
) yields a maximal value of
p
n
(x).
Empirical cdf
The empirical cumulative distribution function (ecdf) of a sample of
real values (x
1
, . . . , x
n
) is given by
Defn.
F
n
(x) =
1
n
n
i=1
I(x
i
x).
For any real number x, F
n
(x) returns the proportion of the data having
values which do not exceed x.
Note this is a step function, with change points at the sampled values.
0 5 10 15 20 25
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Count
F
n
Ex. Collecting the insecticide data from before together across the dif-
ferent treatments.
Numerical Summaries
COMP 245 STATISTICS
Dr N A Heard
1 Introduction
Once a sample of data has been drawn from the population of interest, the rst task of the
statistical analyst might be to calculate various numerical summaries of these data.
This procedure serves two purposes:
The rst is exploratory. Calculating statistics which characterise general properties of the
sample, such as location, dispersion, or symmetry, helps us to understand the data we
have gathered. This aim can be greatly aided by the use of graphical displays represent-
ing the data.
The second, as we shall see later, is that these summaries will commonly provide the
means for relating the sample we have learnt about to the wider population in which we
are truly interested. Later, we will assess the properties of these numerical summaries as
estimators of population parameters.
2 Summary Statistics
2.1 Measures of Location
Sample Mean
The arithmetic mean (or just mean for short) of a sample of real values (x
1
, . . . , x
n
) is the
sum of the values divided by their number. That is,
x =
1
n
n
i=1
x
i
This is often colloquially referred to as the average.
Ex. The mean of (7, 2, 4, 12, 5) is
7 +2 +4 +12 +5
5
=
30
5
= 6.
Order Statistics
For a sample of real values (x
1
, . . . , x
n
), dene the i
th
order statistic x
(i)
to be the i
th
smallest
value of the sample.
So
1
x
(1)
min(x
1
, . . . , x
n
) is the smallest value;
x
(2)
is the next smallest, and so on, up to
.
.
.
x
(n)
max(x
1
, . . . , x
n
) being the largest value.
Furthermore, in an abuse of notation it will be useful to dene x
(i+)
for integer 1 i < n
and non-integer (0, 1) as the linear interpolant
x
(i+)
= (1 ) x
(i)
+ x
(i+1)
,
where the order statistics x
(i)
are dened as before.
-
i
x
(i)
i +
x
(i+)
i +1
x
(i+1)
r
r
Ex.
x
(4.2)
= 0.8 x
(4)
+0.2 x
(5)
.
Sample Median
The median of a sample of real values (x
1
, . . . , x
n
) is the middle value of the order statistics.
That is, using our extended notation,
median = x
({n+1}/2)
=
_
x
({n+1}/2)
if n is odd
x
(n/2)
+x
(n/2+1)
2
if n is even
Ex.
The median of (7, 2, 4, 12, 5) is 5.
The median of (7, 2, 4, 12, 5, 15) is 6.
Mean vs. Median
The mean is sensitive to outlying points, whilst the median is not.
Ex.
(1, 2, 3, 4, 5) has median = mean = 3.
(1, 2, 3, 4, 40) again has median = 3, but now mean = 10.
The arithmetic mean is the most commonly used location statistic, followed by the median.
2
Sample Mode
The mode of a sample of real values (x
1
, . . . , x
n
) is the value of the x
i
which occurs most
frequently in the sample.
Ex.
The mode of (3, 5, 7, 2, 10, 14, 12, 2, 5, 2) is 2.
Some data sets are multimodal.
Geometric Mean
Two other useful measures of location (other averages).
For positive data, the geometric mean
x
G
=
n
i=1
x
i
.
Note: It is easy to show that x
G
= exp
_
1
n
n
i=1
log x
i
_
, the exponential of the arithmetic
mean of the logs of the data.
= It is thus less severely affected by exceptionally large values.
Harmonic Mean
The harmonic mean
x
H
=
_
1
n
n
i=1
1
x
i
_
1
=
n
n
i=1
1
x
i
.
which is most useful when averaging rates.
Note: For positive data (x
1
, . . . , x
n
),
Arithmetic mean geometric mean harmonic mean.
2.2 Measures of Dispersion
Range
The range of a sample of real values (x
1
, . . . , x
n
) is the difference between the largest and
the smallest values. That is
range = x
(n)
x
(1)
Ex.
The range of (7, 2, 4, 12, 5) is 12 2 = 10.
3
2 4 6 8 10
1
2
3
4
5
x2
M
e
a
n
Arithmetic
Geometric
Harmonic
x1
Arithmetic, geometric and harmonic means for two data points points (x
1
, x
2
), where x
1
= 1.
Quartiles
Consider again the order statistics of a sample,
_
x
(1)
, . . . , x
(n)
_
.
We dened the median so that it lay approximately
1
2
of the way through the ordered sam-
ple. (Not necessarily exactly or uniquely since there may be tied values or n even.)
Similarly, we can dene the rst and third quartiles respectively as being values
1
4
and
3
4
of the way through the ordered sample.
Interquartile Range
rst quartile = x
({n+1}/4)
third quartile = x
(3{n+1}/4)
and thus we dene the interquartile range as the range of the data lying between the rst
and third quartiles,
interquartile range = third quartile rst quartile
= x
(3{n+1}/4)
x
({n+1}/4)
Five Figure Summary
The ve gure summary of a set of data lists, in order:
The minimum value in the sample
The lower quartile
The sample median
4
The upper quartile
The maximum value
Sample Variance
The most widely used measure of dispersion is based on the squared differences between
the data points and their mean, (x
i
x)
2
.
The average (the mean) of these squared differences is the mean square or sample variance
s
2
=
1
n
n
i=1
(x
i
x)
2
.
Equivalently, it is often more convenient to rewrite this formula as
s
2
=
_
1
n
n
i=1
x
2
i
_
_
1
n
n
i=1
x
i
_
2
That is, the mean of the squares minus the square of the mean.
Sample Standard Deviation
The square root of the variance is the root mean square or sample standard deviation
s =
1
n
n
i=1
(x
i
x)
2
Unlike the variance, the standard deviation is in the same units as the x
i
.
Summary
We can see analogies between the numerical summaries for location and dispersion, and
their robustness properties are comparable.
Least Robust More Robust Most Robust
Location
x
(1)
+ x
(n)
2
x x
({n+1}/2)
Dispersion x
(n)
x
(1)
s
2
x
(3{n+1}/4)
x
({n+1}/4)
(where
x
(1)
+ x
(n)
2
would be the midpoint of our data halfway between the minimum and
maximum values in the sample, which provides another alternative descriptor of location.)
5
2.3 Skewness
Sample Skewness
Skewness is a measure of asymmetry. The skewness of a sample of real values (x
1
, . . . , x
n
)
is given by
skewness =
1
n
n
i=1
_
x
i
x
s
_
3
.
A sample is positively (negatively) or right (left) skewed if the upper tail of the histogram
of the sample is longer (shorter) than the lower tail.
Skewness
Positive skew Negative skew
Since the mean is more sensitive to outlying points than is the median, one might choose
the median as a more suitable measure of average value if the sample is skewed.
Can expect skewness for example when the data can only take positive (or only negative
values) and if the values are not far from zero.
Can remove skewness by transforming the data. In the case above, we need a transfor-
mation which has larger effect on the larger values: e.g. square root, log (though beware 0
values).
Note: in a positively skewed sample the mean is often greater than the median.
2.4 Covariance and Correlation
Sample Covariance
Suppose we have a sample made up of ordered pairs of real values ((x
1
, y
1
), . . . , (x
n
, y
n
)).
The value x
i
might correspond to the measurement of one quantity x of individual i, and y
i
to
another quantity y of the same individual.
The covariance between the samples of x and y is given by
s
xy
=
1
n
n
i=1
(x
i
x)(y
i
y).
6
It gives a measurement of relatedness between the two quantities x and y.
As before, the covariance can be rewritten equivalently as
s
xy
=

n
i=1
x
i
y
i
n
x y.
Sample Correlation
Note that the magnitude of s
xy
varies according to the scale on which the data have been
measured. The correlation between the samples of x and y is dened to be
r
xy
=
s
xy
s
x
s
y
=

n
i=1
x
i
y
i
n x y
ns
x
s
y
.
where s
x
and s
y
are the sample standard deviations of (x
1
, . . . , x
n
) and (y
1
, . . . , y
n
) respectively.
Unlike covariance, correlation gives a measurement of relatedness between the two quan-
tities x and y which is scale-invariant.
3 Related Graphical Displays
3.1 Box-and-Whisker Plots
Box-and-Whisker/Box Plots
Based on the ve gure summary.
Median - middle line in the box
3rd & 1st Quartiles - top and bottom of the box
Whiskers - extend out to any points which are within (
3
2
interquartile range) from the
box
Any extreme points out to the maximum and minimum which are beyond the whiskers
are plotted individually.
3.2 Empirical pmf & cdf
Empirical pmf
The empirical probability mass function (epmf) of a sample of real values (x
1
, . . . , x
n
) is
given by
p
n
(x) =
1
n
n
i=1
I(x
i
= x).
For any real number x, p
n
(x) returns the proportion of the data which take the value x. The
non-zero values of p
n
(x) are the normalised frequency counts of the data.
Note that a mode of the sample (x
1
, . . . , x
n
) yields a maximal value of p
n
(x).
7
A B C D E F
0
5
1
0
1
5
2
0
2
5
Insecticide
C
o
u
n
t
Ex. Box plots of the counts of insects found in agricultural experimental units treated with six
different insecticides (A-F).
Empirical cdf
The empirical cumulative distribution function (ecdf) of a sample of real values (x
1
, . . . , x
n
)
is given by
F
n
(x) =
1
n
n
i=1
I(x
i
x).
For any real number x, F
n
(x) returns the proportion of the data having values which do
not exceed x.
Note this is a step function, with change points at the sampled values.
0 5 10 15 20 25
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Count
F
n
Ex. Collecting the insecticide data from before together across the different treatments.
8
COMP 245 Statistics
Exercises 2 - Probability
1. For two events E and F, show that
P(E F) = P(E) + P(F) P(E F).
2. Suppose two events E and F are mutually exclusive. State the precise conditions under which
they may also be independent.
3. What is the probability that a single roll of a die will give an odd number if
(a) no other information is given;
(b) you are told that the number is less than 4.
4. (a) Whats the probability of getting two sixes with two dice?
(b) Whats the probability of getting a total of 3 with two dice?
5. Two students try to solve a problem theyve been set. Student A has a probability of
2
5
of being
able to solve the problem, and student B has a probability of
1
3
. If both try it independently,
what is the probability that the problem is solved?
6. A straight AB line of unit length is divided internally at a point X, where X is equally likely to
be any point of AB. What is the probability that AX.XB <
3
16
?
7. (a) In one spin of a European roulette wheel (which has pockets numbered 0, 1, 2, up to
and including 36) what is the probability that the outcome is odd?
(b) An urn contains x red balls and y green ones (both larger than 2). You remove them,
without replacing them, one at a time.
i. What is the chance that the first is red?
ii. What is the chance that the second is red?
iii. What is the chance that the first two are red?
iv. What is the chance that the last but one is red?
8. (a) An experiment consists of tossing a fair coin and rolling a fair die. What is the probability
of the joint event heads with an odd number of spots?
(b) In a particular class, 30% were female, and 90% of the males and 80% of the females
passed the examination. What percentage of the class passed the examination altogether?
9. On any day the chance of rain is 25%. The chance of rain on two consecutive days is 10%.
(a) Does this mean that the events of rain on two consecutive days are independent or
dependent events?
(b) Given that it is raining today, what is the chance of rain tomorrow?
(c) Given that it will rain tomorrow, what is the chance of rain today?
10. A university lecturer leaves his umbrella behind with probability
1
4
every time he visits a shop
(and, once he has left it, he does not collect it again).
(a) If he sets out with his umbrella to visit four different shops, what is the probability that
he will leave it in the fourth shop?
(b) If he arrives home without his umbrella, what is the probability that he left it in the fourth
shop?
(c) If he arrives home without it, and was seen to be carrying it after leaving the first shop,
what is the probability that he left it in the fourth shop?
11. A warehouse contains packs of electronic components. Forty percent of the packs contain
components of low quality for which the probability that any given component will prove
satisfactory is 0.8; forty percent contain components of medium quality for which this prob-
ability is 0.9; and the remaining twenty percent contain high quality components which are
certain to be satisfactory.
(a) If a pack is chosen at randomand one component fromit is tested, what is the probability
that this component is satisfactory?
(b) If a pack is chosen at random and two components from it are tested, what is the
probability that exactly one of the components tested is satisfactory?
(c) If it was found that just one of the components tested was satisfactory, what is the
probability that the selected pack contained medium quality components?
(d) If both components were found to be satisfactory, what is the probability that the selected
pack contained high quality components?
12. Prove that if P(A) > P(B) then P(A|B) > P(B|A).
COMP 245 Statistics
Solutions 2 - Probability
1. First, note that E F = (E F) (
E F) (E

F) is a disjoint union. So from the axioms of
probability,
P(E F) = P(E F) + (
E F) + P(E

F). (1)
Also, note that F = (
E F) (E

F) is a disjoint union, and so
P(E

F) = P(F) P(E F), (2)
and similarly
P(
E F) = P(E) P(E F). (3)

Substituting equations (2) and (3) into equation (1) yields the result.
2. Since E and F are mutually exclusive, E F = and so P(E F) = 0.
For independence, we require 0 = P(E F) = P(E)P(F), and so that one or both of P(E) and
P(F) must be zero.
3. (a)
3
6
=
1
2
.
(b) If the number is less than 4, it must be one of 1, 2, or 3. Two of these three are odd.
Hence the answer is
2
3
.
4. (a) There are 36 possibilities, only one of which is two 6s. Hence
1
36
.
(b) Two of the 36 possibilities will give a total of 3, these being (1,2) and (2,1). Hence
2
36
=
1
18
.
5. The problem is solved if either one or both of the students solve the problem. Thus
P(solved) = 1 P(not solved) = 1 P(not solved by either)
= 1 P(A does not solve it and B does not solve it)
= 1 P(A does not solve it).P(B does not solve it) (by independence)
= 1
1
2
5
1
1
3
= 1
3
5

2
3
=
3
5
.
6. Let AX = a. Then first note that AX.XB = a(1 a). Then a(1 a) <
3
16
if a <
1
4
or a >
3
4
. Thus
P(a(1 a) <
3
16
) = P(a <
1
4
) + P(a >
3
4
) =
1
4
+
1
4
=
1
2
.
7. (a)
18
37
.
(b) i.
x
x+y
.
ii.
P(2nd red) = P(2nd red|first red)P(first red) + P(2nd red|first green)P(first green)
=
x 1
x + y 1
x
x + y
+
x
x + y 1
y
x + y
=
x
x + y
iii. P(first two red) = P(2nd red|first red)P(first red) =
x1
x+y1
x
x+y
.
iv.
x
x+y
by second argument in 7(b)ii.
8. (a)
1
4
.
(b) 0.9 0.7 + 0.8 0.3 = 0.87.
9. (a) Dependent. If they were independent the chance of rain on two consecutive days would
be 0.25 0.25 = 0.0625, not 0.10.
(b) Prain tomorrow | rain today) = P(rain tomorrow and rain today) / P(rain today) = 0.1 /
0.25 = 0.40.
(c) The numbers are the same.
10. (a) This is the probability that he doesnt leave it in the first three shops but does leave it in
the fourth shop given that he didnt leave it in the first three =
3
4

3
4

3
4

1
4
=
27
256
.
(b) Let E
i
be the event that he left it in the ith shop. We want P(E
4
|E
1
E
2
E
3
E
4
).
Now
P(E
4
|E
1
E
2
E
3
E
4
) =
P(E
4
(E
1
E
2
E
3
E
4
))
P(E
1
E
2
E
3
E
4
)
=
P(E
4
)
P(E
1
E
2
E
3
E
4
)
=
P(E
4
)
1 P(E
1
E
2
E
3
E
4
)
=
P(E
4
)
1 P(E
1
E
2
E
3
E
4
)
=
3
4

3
4

3
4

1
4
1
3
4

3
4

3
4

3
4
= 0.15.
(c) This is the same problems as in 10b, but as if there were only three shops. Hence
3
4

3
4

1
4
1
3
4

3
4

3
4
= 0.24.
11. Let S = satisfactory and U = S = unsatisfactory; L = low quality, M = medium quality, H =
high quality.
(a)
P(S) = P(S|L) P(L) + P(S|M) P(M) + P(S|H) P(H)
= 0.8 0.4 + 0.9 0.4 + 1 0.2
= 0.88.
(b) If two components are tested there can be four outcomes: SS, SU, US, UU. If the
probability of an S is p these four possibilities have respective probabilities p
2
, p(1
p), (1 p)p, (1 p)
2
. Thus the probability of exactly one S (US SU) is 2p(1 p). So
P(US SU) = P(US SU|L) P(L) + P(US SU|M) P(M) + P(US SU|H) P(H)
= 2 0.8 0.2 0.4 + 2 0.9 0.1 0.4 + 2 1 0 0.2
= 0.2.
(c)
P(M|US SU) =
P(US SU|M) P(M)
P(US SU)
=
2 0.9 0.1 0.4
0.2
= 0.36.
(d)
P(SS) = P(SS|L) P(L) + P(SS|M) P(M) + P(SS|H) P(H)
= 0.8 0.8 0.4 + 0.9 0.9 0.4 + 1 1 0.2
= 0.78.
P(SS H) = P(SS|H) P(H)
= 1 1 0.2
= 0.2.
So P(H|SS) =
P(SS H)
P(SS)
=
0.2
0.78
= 0.26.
12. P(A) > P(B)
P(A)
P(B)
> 1. Then, from Bayes Theorem P(A|B) = P(B|A)
P(A)
P(B)
P(A|B) >
P(B|A) 1.
Elementary Set Theory
COMP 245 STATISTICS
Dr N A Heard
Dr N A Heard (Room 543 Huxley Building) Elementary Set Theory 1 / 10
Sets and notation
A set is any collection of distinct objects, and is a fundamental object of
mathematics.
= {}, the empty set containing no objects, is included.
The objects in a set can be anything, for example integers, real
numbers, or the objects may even themselves be sets.
Notation and abbreviations:
- is an element of (set membership)
- if and only if (equivalence)
= - implies
- there exists
- for all
s.t. or | - such that
wrt - with respect to
Subsets, Complements and Singletons
If a set B contains all of the objects contained in another set A, and
possibly some other objects besides, we say A is a subset of B and
write A B.
Suppose A B for two sets A and B. If we also have B A we write
A = B, whereas if we know B A we write A B.
The complement of a set A wrt a universal set (say, of all possible
values) is A = { | / A}.
A singleton is a set with exactly one element - {} for some .
Unions and Intersections
Consider two sets A and B.
The union of A and B, AB = { | A or B}.
The intersection of A and B, AB = { | A and B}.
More generally, for sets A
1
, A
2
, . . . we dene
i
A
i
= { |i s.t. A
i
}
i
A
i
= { |i, A
i
}
If i, j, A
i
A
j
= , we say the sets are disjoint. Furthermore, if we
also have
i
A
i
= , we say the sets form a partition of .
Both operators are commutative:
AB = B A,
AB = B A.
The union operator is distributive over intersection, and vice versa:
A (B C) = (AB) (AC),
A (B C) = (AB) (AC).
Differences and De Morgans Laws
The difference of A and B, A\B = AB = { | A and / B}.
Notice A and B are disjoint A\B = A.
The following rules, known as De Morgans Laws, provide useful
relations between complements, unions and intersections:
De Morgans Laws
(AB) = AB, (AB) = AB.
Examples
Let = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} be our universal set and
A = {1, 2, 3, 4, 5, 6} and B = {5, 6, 7, 8, 9} be two sets of elements of .
AB = {1, 2, 3, 4, 5, 6, 7, 8, 9}.
AB = {5, 6}.
A\B = {1, 2, 3, 4}.
(AB) = {10}.
(AB) = {1, 2, 3, 4, 7, 8, 9, 10}.
A = {7, 8, 9, 10}, B = {1, 2, 3, 4, 10}.
AB = {10}.
AB = {1, 2, 3, 4, 7, 8, 9, 10}.
Cartesian products of sets
For two sets
1
,
2
, their Cartesian product is the set of all ordered
pairs of their elements. That is,
2
= {(
1
,
2
)|
1

1
,
2

2
}
More generally, the Cartesian product for sets
1
,
2
, . . . is written
i
.
Cardinality
A useful measure of a set is the size, or cardinality.
The cardinality of a nite set is simply the number of elements it
contains. For innite sets, there are again an innite number of
different cardinalities they can take. However, amongst these there is a
most important distinction: Between those which are countable and
those which are not.
A set is countable if a function f : N s.t. f (N) . That is,
the elements of can satisfactorily be written out as a possibly
unending list {
1
,
2
,
3
, . . .}. Note that all nite sets are countable.
A set is countably innite if it is countable but not nite. Clearly Nis
countably innite. So is NN.
A set which is not countable is uncountable. R is uncountable.
The empty set has zero cardinality,
|| = 0.
For nite sets A and B:
If A and B are disjoint, then
|AB| = |A| +|B|;
otherwise,
|AB| = |A| +|B| |AB|.
Elementary Set Theory
COMP 245 STATISTICS
Dr N A Heard
Contents
1 Sets, subsets and complements 1
1.1 Sets and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Subsets, Complements and Singletons . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Set operations 2
2.1 Unions and Intersections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Cartesian Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Cardinality 3
1 Sets, subsets and complements
1.1 Sets and notation
A set is any collection of distinct objects, and is a fundamental object of mathematics.
= {}, the empty set containing no objects, is included.
The objects in a set can be anything, for example integers, real numbers, or the objects may
even themselves be sets.
Notation and abbreviations:
- is an element of (set membership)
- if and only if (equivalence)
= - implies
- there exists
- for all
s.t. or | - such that
wrt - with respect to
1.2 Subsets, Complements and Singletons
If a set B contains all of the objects contained in another set A, and possibly some other objects
besides, we say A is a subset of B and write A B.
Suppose A B for two sets A and B. If we also have B A we write A = B, whereas if
we know B A we write A B.
1
The complement of a set A wrt a universal set (say, of all possible values) is A = {
| / A}.
A singleton is a set with exactly one element - {} for some .
2 Set operations
2.1 Unions and Intersections
Consider two sets A and B.
The union of A and B, A B = { | A or B}.
The intersection of A and B, A B = { | A and B}.
More generally, for sets A
1
, A
2
, . . . we dene
i
A
i
= { |i s.t. A
i
}
i
A
i
= { |i, A
i
}
If i, j, A
i
A
j
= , we say the sets are disjoint. Furthermore, if we also have
i
A
i
= ,
we say the sets form a partition of .
Both operators are commutative:
A B = B A,
A B = B A.
The union operator is distributive over intersection, and vice versa:
A (B C) = (A B) (A C),
A (B C) = (A B) (A C).
Differences and De Morgans Laws
The difference of A and B, A\B = A B = { | A and / B}.
Notice A and B are disjoint A\B = A.
The following rules, known as De Morgans Laws, provide useful relations between com-
plements, unions and intersections:
(A B) = A B, (A B) = A B.
2
Examples
Let = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} be our universal set and A = {1, 2, 3, 4, 5, 6} and B =
{5, 6, 7, 8, 9} be two sets of elements of .
A B = {1, 2, 3, 4, 5, 6, 7, 8, 9}.
A B = {5, 6}.
A\B = {1, 2, 3, 4}.
(A B) = {10}.
(A B) = {1, 2, 3, 4, 7, 8, 9, 10}.
A = {7, 8, 9, 10}, B = {1, 2, 3, 4, 10}.
A B = {10}.
A B = {1, 2, 3, 4, 7, 8, 9, 10}.
2.2 Cartesian Products
For two sets
1
,
2
, their Cartesian product is the set of all ordered pairs of their elements.
That is,
2
= {(
1
,
2
)|
1

1
,
2

2
}
More generally, the Cartesian product for sets
1
,
2
, . . . is written
i
.
3 Cardinality
A useful measure of a set is the size, or cardinality.
The cardinality of a nite set is simply the number of elements it contains. For innite sets,
there are again an innite number of different cardinalities they can take. However, amongst
these there is a most important distinction: Between those which are countable and those
which are not.
A set is countable if a function f : N s.t. f (N) . That is, the elements of
can satisfactorily be written out as a possibly unending list {
1
,
2
,
3
, . . .}. Note that all nite
sets are countable.
A set is countably innite if it is countable but not nite. Clearly N is countably innite.
So is NN.
A set which is not countable is uncountable. R is uncountable.
The empty set has zero cardinality,
|| = 0.
For nite sets A and B:
If A and B are disjoint, then
|A B| = |A| +|B|;
otherwise,
|A B| = |A| +|B| |A B|.
3
COMP 245 Statistics
Exercises 2b - Further Probability
1. Show that if three events A, B, and C are independent, then A and B C are independent.
2. The sample space S of a random experiment is given by S = {a, b, c, d}, with probabilities
P({a}) = 0.2, P({b}) = 0.3, P({c}) = 0.4, P({d}) = 0.1. Let A denote event {a, b}, and B the event
{b, c, d}. Determine the following probabilities:
(a) P(A)
(b) P(B)
(c) P(A)
(d) P(A B)
(e) P(A B)
3. Two factories produce similar parts. Factory 1 produces 1000 parts, 100 of which are
defective. Factory 2 produces 2000 parts, 150 of which are defective. A part is selected at
random and found to be defective; what is the probability that it came from factory 1?
4. In an experiment in which two fair dice are thrown, let A be the event that the first die is odd,
let B be the event that the second die is odd, and let C be the event that the sum is odd. Show
that events A, B, and C are pairwise independent, but A, B, and C are not jointly independent.
5. A company producing mobile phones has three manufacturing plants, producing 50, 30,
and 20 percent respectively of its product. Suppose that the probabilities that a phone
manufactured by these plants is defective are 0.02, 0.05, and 0.01 respectively.
(a) If a phone is selected at random from the output of the company, what is the probability
that it is defective?
(b) If a phone selected at random is found to be defective, what is the probability that it was
manufactured by the second plant?
6. In a gambling game called craps, a pair of dice is rolled and the outcome is the sum of the
dice. The player wins on the first roll if the sum is 7 or 11 and loses if the sum is 2, 3, or 12.
If the sum is 4, 5, 6, 8, 9, or 10, that number is called the players point. Once the point
is established, the rule is: If the player rolls a 7 before the point, the player loses; but if the
point is rolled before a 7, the player wins. Compute the probability of winning in the game of
craps.
COMP 245 Statistics
Solutions 2b - Further Probability
1.
P{A (B C)} = P{(A B) (A C)}
= P(A B) + P(A C) P(A B C)
= P(A)P(B) + P(A)P(C) P(A)P(B)P(C)
= P(A){P(B) + P(C) P(B C)}
= P(A)P(B C),
so A and B C are independent.
2. (a) P(A) = P({a}) + P({b}) = 0.2 + 0.3 = 0.5.
(b) P(B) = P({b}) + P({c}) + P({d}) = 0.3 + 0.4 + 0.1 = 0.8.
(c) P(A) = P({c, d}) = P({c}) + P({d}) = 0.4 + 0.1 = 0.5.
(d) P(A B) = P({a, b, c, d}) = P(S) = 1.
(e) P(A B) = P({b}) = 0.3.
3. Let A = Part came from factory 1, B = Part is defective. We want P(A|B).
We know P(B|A) = 100/1000, P(B|A) = 150/2000. Also, since a part is selected at random,
there is a 1000/(1000+2000) = 1/3 chance that it comes from factory 1. That is, we know
P(A) = 1/3 and P(A) = 2/3.
The overall probability that a selected part will be defective, P(B) = (100+150)/(1000+2000) =
250/3000 = 1/12. Hence
P(A|B) = P(B|A) P(A)/P(B)
= (1/10 1/3)/(1/12)
= 0.4.
4. Sample space:
(1,1) (1,2) (1,3) (1,4) (1,6) (1,6)
(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)
(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)
(4,1) (4,2) (4,3) (4,4) (4,5) (4,6)
(5,1) (5,2) (5,3) (5,4) (5,5) (5,6)
(6,1) (6,2) (6,3) (6,4) (6,5) (6,6)
There are 36 equally likely possible outcomes altogether. In 18 of these, the first die is odd,
and in 18 the second die is odd. Hence P(A) = P(B) =
1
2
. Likewise, from the table we see
that in 18 of the 36 the sum is odd, so that P(C) =
1
2
.
Also from the table we can see that P(A B) = P(A C) = P(B C) =
9
36
=
1
4
.
However, P(A).P(B) =
1
2

1
2
=
1
4
which is equal to P(AB), so that A and B are independent.
Similarly for B and C and for A and C.
Thus A, B, and C are pairwise independent.
Since the sum of two odd numbers is even, we have P(A B C) = 0, which is not equal to
P(A)P(B)P(C) =
1
8
, the three events A, B, and C are not independent.
5. Let D be the event that the phone is defective, and let M
i
be the event that the phone is
manufactured by plant i (i = 1, 2, 3).
(a) P(D) =
3
i=1
P(D|M
i
)P(M
i
) = 0.02 0.5 + 0.05 0.3 + 0.01 0.2 = 0.027.
(b) P(M
2
|D) =
P(D|M
2
)P(M
2
)
P(D)
=
0.05 0.3
0.027
= 0.556.
6. Let, A, B, and C be the event that the player wins, the player wins on the first roll, and the
player gains point, respectively. Then P(A) = P(B) +P(C). Now P(B) = P(sum = 7) +P(sum =
11) = 6/36 + 2/36 = 2/9.
Let A
k
be the event that point occurs before 7. Then
P(C) =
k{4,5,6,8,9,10}
P(A
k
)P(point = k).
Well P(A
k
) =
P(sum = k)
P(sum = k) + P(sum = 7)
.
We have P(sum = 4) = 3/36, P(sum = 5) = 4/36, P(sum = 6) = 5/36, P(sum = 8) = 5/36,
P(sum = 9) = 4/36 and P(sum = 10) = 3/36.
Hence it follows that P(A
4
) = 1/3, P(A
5
) = 2/5, P(A
4
) = 5/11, P(A
8
) = 5/11, P(A
9
) = 2/5
and P(A
1
0) = 1/3.
So P(A) = P(B) + P(C) = 2/9 + 134/495 = 0.49293.
Probability
COMP 245 STATISTICS
Dr N A Heard
Dr N A Heard (Room 543 Huxley Building) Probability 1 / 47
Sample Spaces
We consider a random experiment whose range of possible outcomes
can be described by a set S, called the sample space.
We use S as our universal set ().
Ex.
Coin tossing: S = {H, T}.
Die rolling: S = { , , , , , }.
2 coins: S = {(H, H), (H, T), (T, H), (T, T)}.
Events
An event E is any subset of the sample space, E S; it is a collection of
some of the possible outcomes.
Ex.
Coin tossing: E = {H}, E = {T}.
Die rolling: E = { }, E = {Even numbered face} = { , , }.
2 coins: E = {Head on the rst toss} = {(H, H), (H, T)}.
Extreme possible events are (the null event) or S.
The singleton subsets of S (those subsets which contain exactly one
element from S) are known as the elementary events of S.
Suppose we now perform this random experiment; the outcome will
be a single element s
S. Then for any event E S, we will say E has

occurred if and only if s
E.
If E has not occurred, it must be that s
/ E s
E, so E has
occurred; so E can be read as the event not E.
First notice that the smallest event which will have occurred will be
the singleton {s
}. For any other event E, E will occur if and only if

{s
} E. Thus we can immediately draw two conclusions before the

experiment has even been performed.
Remark
For any sample space S, the following statements will always be true:
1
the null event will never occur;
2
the universal event S will always occur.
Hence it is only for events E in between these extreme events,
E S for which we have uncertainty about whether E will occur.
It is precisely for quantifying this uncertainty over these events that
we require the notion of probability.
Set operators on events
Consider a set of events {E
1
, E
2
, . . .}.
The event
i
E
i
= {s S|i s.t. s E
i
} will occur if and only if at
least one of the events {E
i
} occurs. So E
1
E
2
can be read as event
E
1
or E
2
.
The event
i
E
i
= {s S|i, s E
i
} will occur if and only if all of
the events {E
i
} occur. So E
1
E
2
can be read as events E
1
and E
2
.
The events are said to be mutually exclusive if i, j, E
i
E
j
=
(i.e. they are disjoint). At most one of the events can occur.
-algebras of events
So for our random experiment with sample space S of possible
outcomes, for which events/subsets E S would we like to dene the
probability of E occurring?
Every subset? If S is nite or even countable, then this is ne. But for
uncountably innite sample spaces it can be shown that we can very
easily start off dening sensible, proper probabilities for an initial
collection of subsets of S in a way that leaves it impossible to then
carry on and consistently dene probability for all the remaining
subsets of S.
For this reason, when dening a probability measure on S we (usually
implicitly) simultaneously agree on a collection of subsets of S that we
wish to measure with probability. Generically, we will refer to this set of
subsets as S.
There are three properties we will require of S, the reasons for which
will become immediately apparent when we meet the axioms of
probability.
We need S to be
1
nonempty, S S;
2
closed under complements: E S = E S;
3
closed under countable union E
1
, E
2
, . . . S =
i
E
i
S.
Such a collection of sets is known as a -algebra.
Axioms of Probability
A probability measure on the pair (S, S) is a mapping P : S [0, 1]
satisfying the following three axioms for all subsets of S on which it is
dened (S, the measurable subsets of S)
1
E S, 0 P(E) 1;
2
P(S) = 1;
3
Countably additive: For disjoint subsets E
1
, E
2
, . . . S
P
i
E
i
i
P(E
i
).
Simple Probability Results
Exercises:
From 1-3 it is easy to derive the following:
1
P(E) = 1 P(E);
2
P() = 0;
3
For any events E and F,
P(E F) = P(E) +P(F) P(E F).
Independence
Two events E and F are said to be independent if and only if
P(E F) = P(E)P(F). This is sometimes written E F.
More generally, a set of events {E
1
, E
2
, . . .} are said to be independent
if for any nite subset {E
1
, E
2
, . . . , E
n
},
P
i=1
E
i
=
n
i=1
P(E
i
).
If events E and F are independent, then E and F are also independent.
Proof:
Since F = (E F) (E F) is a disjoint union,
P(F) = P(E F) +P(E F) by Axiom 3. So P(E F) =
P(F) P(E F) = P(F) P(E)P(F) = (1 P(E))P(F) = P(E)P(F).
Classical
If S is nite and the elementary events are considered equally likely,
then the probability of an event E is the proportion of all outcomes in S
in which lie inside E,
P(E) =
|E|
|S|
.
Ex.
Rolling a die: Elementary events are { }, { }, . . . , { }.
P({ }) = P({ }) = . . . = P({ }) =

1
6
.
P(Odd number) = P({ , , }) =

3
6
=
1
2
.
Randomly drawn playing card: 52 elementary events
{2}, {3}, . . . , {A}, {2}, {3}, . . . , . . . , {K}, {A}.
P() = P() = P() = P() =

1
4
.
The joint event {Suit is red and value is 3} contains two of 52

elementary events, so P({red 3}) =
2
52
=
1
26
. Since suit and face
value should be independent, check that
P({red 3}) = P({, }) P({any 3}).
The equally likely (uniform) idea can be extended to innite spaces,
by apportioning probability to sets not by their cardinality but by
other standard measures, like volume or mass.
Ex.
If a meteorite were to strike Earth, the probability that it will strike
land rather than sea would be given by
Total area of land
Total area of Earth
.
Frequentist
Observation shows that if one takes repeated observations in
identical random situations, in which event E may or may not occur,
then the proportion of times in which E occurs tends to some limiting
value - called the probability of E
Ex.
Proportion of heads in tosses of a coin: H, H, T, H, T, T, H, T, T, . . .
1
2
.
Subjective
Probability is a degree of belief held by an individual.
For example, De Finetti (1937/1964) suggested the following: Suppose
a random experiment is to be performed, where an event E S may or
may not happen. Now suppose an individual is entered into a game
regarding this experiment where he has two choices, each leading to
monetary (or utility) consequences:
1
Gamble: If E occurs, he wins 1; if E occurs, he wins 0;
2
Stick: Regardless of the outcome of the experiment, he receives
P(E) for some real number P(E).
The critical value of P(E) for which the individual is indifferent
between options 1 and 2 is dened to be the individuals probability
for the event E occurring.
This procedure can be repeated for all possible events E in S.
Suppose after this process of elicitation of the individuals preferences
under the different events, we can simultaneously arrange an arbitrary
number of monetary bets with the individual based on the outcome of
the experiment.
If it is possible to choose these bets in such a way that the individual is
certain to lose money (this is called a Dutch Book), then the
individuals degrees of belief are said to be incoherent.
To be coherent, it is easily seen, for example, that we must have
0 P(E) 1 for all events E, E F = P(E) P(F), etc.
De M er es problem
Antoine Gombaud, chevalier de M er e (1607-1684) posed to Pascal the
following gambling problem: which of these two events is more
likely?
1
E = {4 rolls of a die yield at least one }.
2
F = {24 rolls of two dice yield at least one pair of }.
De M er e observed that E seemed to lead to a protable even money
bet whereas F did not.
We calculate P(E) and P(F).
1
Each roll of the die is independent from the previous rolls, and so
there are 6
4
equally likely outcomes. Of these, 5
4
show no s.
So the probability of no showing is
5
4
6
4
0.4823.
So P(E), the probability of at least one showing, is
1 0.4823 = 0.5177.
2
There are 36
24
equally likely outcomes here. Of these, 35
24
dont
show a .
So the probability of no is
35
24
36
24
0.5086
So P(F), the probability of at least one , is
1 0.5086 = 0.4914
Hence P(E) 0.5177 >
1
2
> 0.4914 P(F).
Coin and Die
Consider tossing a coin and rolling a die.
We would consider each of the 12 possible combinations of Head/Tail
and die value as equally likely.
So we can construct a probability table:
H
1
12
1
12
1
12
1
12
1
12
1
12
1
2
T
1
12
1
12
1
12
1
12
1
12
1
12
1
2
1
6
1
6
1
6
1
6
1
6
1
6
From this table we can calculate the probability of any event we might
be interested in, simply by adding up the probabilities of all the
elementary events it contains.
For example, the event of getting a head on the coin
{H} = {(H, ), (H, ), . . . , (H, )}
has probability
P({H}) = P({(H, )}) +P({(H, )}) +. . . +P({(H, )})
=
1
12
+
1
12
+. . . +
1
12
=
1
2
.
Notice the two experiments satisfy our probability denition of
independence, since for example
P({(H, )}) =
1
12
=
1
2
1
6
= P({H}) P({ }).
Coin and Two Dice
A crooked die called a top has the same faces on opposite sides.
Suppose we have two dice, one normal and one which is a top with
opposite faces numbered , , or .
Now suppose we rst ip the coin. If it comes up heads, we roll the
normal die; tails, and we roll the top.
To calculate the probability table easily, we notice that this is
equivalent to the previous game using one normal die except with the
change after tails that a roll of a is relabelled as a , , .
So we can just merge those probabilities in the tails row.
H
1
12
1
12
1
12
1
12
1
12
1
12
1
2
T
1
6
0
1
6
0
1
6
0
1
2
1
4
1
12
1
4
1
12
1
4
1
12
The probabilities of the different outcomes of the dice change
according to the outcome of the coin toss. And note, for example,
P({(H, )}) =
1
12
=
1
24
=
1
2
1
12
= P({H}) P({ }).
So the two experiments are now dependent.
Conditional Probability
For two events E and F in S where P(F) = 0, we dene the conditional
probability of E occurring given that we know F has occurred as
Defn.
P(E|F) =
P(E F)
P(F)
.
Note that if E and F are independent, then
P(E|F) =
P(E F)
P(F)
=
P(E)P(F)
P(F)
= P(E).
Example 1 - Rolling a
We roll a normal die once.
1
What is the probability of E = {the die shows a }?
2
What is the probability of E = {the die shows a } given we
know
F = {the die shows an odd number}?
Solution:
1
P(E) =
Number of ways a can come up
Total number of possible outcomes
=
1
6
.
2
Now the set of possible outcomes is just F = { , , }.
So P(E|F) =
=
1
3
.
Note P(F) =
1
2
and E F = E, and hence we have P(E|F) =
P(E F)
P(F)
.
Example 2 - Rolling two dice
Suppose we roll two normal dice, one from each hand.
Then the sample space comprises all of the ordered pairs of dice values
S = {( , ), ( , ), . . . , ( , )}.
Let E be the event that the die thrown from the left hand will show a
larger value than the die thrown from the right hand.
P(E) =
# outcomes with left value > right
total # outcomes
=
15
36
.
Suppose we are now informed that an event F has occurred, where
F = {the value of the left hand die is }
How does this change the probability of E occurring?
Well since F has occurred, the only sample space elements which could
have possibly occurred are exactly those elements in
F = {( , ), ( , ), ( , ), ( , ), ( , )( , )}.
Similarly the only sample space elements in E that could have
occurred now must be in E F = {( , ), ( , ), ( , ), ( , )}.
So our revised probability is
total # outcomes ( , )
=
4
6
=
P(E F)
P(F)
P(E|F).
Discussion of Examples
In both examples, we considered the probability of an event E, and
then reconsidered what this probability would be if we were given the
knowledge that F had occurred. What happened?
Answer: The sample space S was replaced by F, and the event E was
replaced by E F. So originally, we had
P(E) = P(E|S) =
P(E S)
P(S)
(since E S = E, and P(S) = 1 by Axiom 2).
So we can think of probability conditioning as a shrinking of the
sample space, with events replaced by their intersections with the
reduced space and a consequent rescaling of probabilities.
Earlier we met the concept of independence of events according to a
probability measure P. We can now extend that idea to conditional
probabilities since P(|F) is itself a perfectly good probability measure
obeying the axioms of probability.
For three events E
1
, E
2
and F, the event pair E
1
and E
2
are said to be
conditionally independent given F if and only if
P(E
1
E
2
|F) = P(E
1
|F)P(E
2
|F). This is sometimes written E
1
E
2
|F.
Bayes Theorem
For two events E and F in S, we have
P(E F) = P(F)P(E|F); (1)
but also, since E F F E
P(E F) = P(E)P(F|E). (2)
Equating the RHS of (1) and (2), provided P(F) = 0 we can rearrange
to obtain
Bayes Theorem
P(E|F) =
P(E)P(F|E)
P(F)
.
Partition Rule
Consider a set of events {F
1
, F
2
, . . .} which form a partition of S.
Then for any event E S
Partition Rule
P(E) =
i
P(E|F
i
)P(F
i
).
Proof:
E = E S = E
i
F
i
=
i
(E F
i
).
So
P(E) = P
i
(E F
i
)
,
which, by countable additivity (Axiom 3) and noting that since the
{F
1
, F
2
, . . .} are disjoint so are {E F
1
, E F
2
, . . .}, implies
P(E) =
i
P(E F
i
). (3)
(3) is known as the law of total probability; and it can be rewritten
P(E) =
i
P(E|F
i
)P(F
i
).
For any events E and F in S, note that {F, F}, say, form a partition of S.
So by the law of total probability we have
P(E) = P(E F) +P(E F)
= P(E|F)P(F) +P(E|F)P(F).
Terminology
When considering multiple events, say E and F, we often refer to
probabilities of the form P(E|F) as conditional probabilities;
probabilities of the form P(E F) as joint probabilities;
probabilities of the form P(E) as marginal probabilities.
Example 1 - Defective Chips
Ex.
A box contains 5000 VLSI chips, 1000 from company X and 4000 from
Y. 10% of the chips made by X are defective and 5% of those made by Y
are defective. If a randomly chosen chip is found to be defective, nd
the probability that it came from X.
Let E = chip was made by X;
let F = chip is defective.
First of all, which probabilities have we been given?
A box contains 5000 VLSI chips, 1000 from company X and 4000
from Y.
= P(E) =
1000
5000
= 0.2,
P(E) =
4000
5000
= 0.8.
10% of the chips made by X are defective and 5% of those made by
Y are defective.
= P(F|E) = 10% = 0.1,
P(F|E) = 5% = 0.05.
We have enough information to construct the probability table
E E
F 0.02 0.04 0.06
F 0.18 0.76 0.94
0.2 0.8
The law of total probability has enabled us to extract the marginal
probabilities P(F) and P(F) as 0.06 and 0.94 respectively.
So by Bayes Theorem we can calculate the conditional probabilities. In
particular, we want
P(E|F) =
P(E F)
P(F)
=
0.02
0.06
=
1
3
.
Example 2 - Kidney stones
Kidney stones are small (< 2cm diam) or large (> 2 cm diam).
Treatment can succeed or fail. The following data were collected from
a sample of 700 patients with kidney stones.
Success (S) Failure (S)
Large (L) 247 96 343
Small (L) 315 42 357
Total 562 138 700
For a patient randomly drawn from this sample, what is the
probability that the outcome of treatment was successful, given the
kidney stones were large?
Clearly we can get the answer directly from the table by ignoring the
small stone patients
P(S|L) =
247
343
or we can go the long way round:
P(L) =
343
700
,
P(S L) =
247
700
,
P(S|L) =
P(S L)
P(L)
=
247
700
343
700
=
247
343
.
Example 3 - Multiple Choice Question
A multiple choice question has c available choices. Let p be the
probability that the student knows the right answer, and 1 p that he
does not. When he doesnt know, he chooses an answer at random.
Given that the answer the student chooses is correct, what is the
probability that the student knew the correct answer?
Let A be the event that the question is answered correctly;
let K be the event that the student knew the correct answer.
Then we require P(K|A).
By Bayes Theorem
P(K|A) =
P(A|K)P(K)
P(A)
and we know P(A|K) = 1 and P(K) = p, so it remains to nd P(A).
By the partition rule,
P(A) = P(A|K)P(K) +P(A|K)P(K)
and since P(A|K) =
1
c
, this gives
P(A) = 1 p +
1
c
(1 p).
Hence
P(K|A) =
p
p +
1p
c
=
cp
cp +1 p
.
Note: the larger c is, the greater the probability that the student knew
the answer, given that they answered correctly.
Example 4 - Super Computer Jobs
Measurements at the North Carolina Super Computing Center (NCSC)
on a certain day showed that 15% of the jobs came from Duke, 35%
from UNC, and 50% from NC State University. Suppose that the
probabilities that each of these jobs is a multitasking job is 0.01, 0.05,
and 0.02 respectively.
1
Find the probabilities that a job chosen at random is a
multitasking job.
2
Find the probability that a randomly chosen job comes from UNC,
given that it is a multitasking job.
Solution:
Let U
i
= job is from university i, i = 1, 2, 3 for Duke, UNC, NC State
respectively;
let M = job uses multitasking.
1
P(M) = P(M|U
1
)P(U
1
) +P(M|U
2
)P(U
2
) +P(M|U
3
)P(U
3
)
= 0.01 0.15 +0.05 0.35 +0.02 0.5 = 0.029.
2
P(U
2
|M) =
P(M|U
2
)P(U
2
)
P(M)
=
0.05 0.35
0.029
= 0.603.
Example 5 - HIV Test
A new HIV test is claimed to correctly identify 95% of people who are
really HIV positive and 98% of people who are really HIV negative. Is
this acceptable?
If only 1 in a 1000 of the population are HIV positive, what is the
probability that someone who tests positive actually has HIV?
Solution:
Let H = has the HIV virus;
let T = test is positive.
We have been given P(T|H) = 0.95, P(T|H) = 0.98 and P(H) = 0.001.
We wish to nd P(H|T).
P(H|T) =
P(T|H)P(H)
P(T|H)P(H) +P(T|H)P(H)
=
0.95 0.001
0.95 0.001 +0.02 0.999
= 0.045
That is, less than 5% of those who test positive really have HIV.
Example 5 - continued
If the HIV test shows a positive result, the individual might wish to
retake the test. Suppose that the results of a person retaking the HIV
test are conditionally independent given HIV status (clearly two
results of the test would certainly not be unconditionally independent).
If the test again gives a positive result, what is the probability that the
person actually has HIV?
Solution:
Let T
i
= i
th
test is positive.
P(H|T
1
T
2
) =
P(T
1
T
2
|H)P(H)
P(T
1
T
2
)
=
P(T
1
T
2
|H)P(H)
P(T
1
T
2
|H)P(H) +P(T
1
T
2
|H)P(H)
=
P(T
1
|H)P(T
2
|H)P(H)
P(T
1
|H)P(T
2
|H)P(H) +P(T
1
|H)P(T
2
|H)P(H)
by conditional independence.
Since P(T
i
|H) = 0.95 and P(T
i
|H) = 0.02,
P(H|T
1
T
2
) =
0.95 0.95 0.001
0.95 0.95 0.001 +0.02 0.02 0.999
0.693.
So almost a 70% chance after taking the test twice and both times
showing as positive. For three times, this goes up to 99%.
Probability
COMP 245 STATISTICS
Dr N A Heard
Contents
1 Sample Spaces and Events 1
1.1 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Combinations of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Probability 3
2.1 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Simple Probability Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Interpretations of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Examples 6
3.1 De M er es problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Joint events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Conditional Probability 8
4.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Bayes Theorem and the Partition Rule . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.5 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1 Sample Spaces and Events
1.1 Sample Spaces
We consider a random experiment whose range of possible outcomes can be described by a set
S, called the sample space.
We use S as our universal set ().
Ex.
Coin tossing: S = {H, T}.
Die rolling: S = { , , , , , }.
2 coins: S = {(H, H), (H, T), (T, H), (T, T)}.
1
1.2 Events
An event E is any subset of the sample space, E S; it is a collection of some of the possible
outcomes.
Ex.
Coin tossing: E = {H}, E = {T}.
Die rolling: E = { }, E = {Even numbered face} = { , , }.
2 coins: E = {Head on the rst toss} = {(H, H), (H, T)}.
Extreme possible events are (the null event) or S.
The singleton subsets of S (those subsets which contain exactly one element from S) are
known as the elementary events of S.
Suppose we now perform this random experiment; the outcome will be a single element
s
S. Then for any event E S, we will say E has occurred if and only if s
E.
If E has not occurred, it must be that s
/ E s
E, so E has occurred; so E can be

read as the event not E.
First notice that the smallest event which will have occurred will be the singleton {s
}. For
any other event E, E will occur if and only if {s
} E. Thus we can immediately draw two

conclusions before the experiment has even been performed.
Remark 1.1. For any sample space S, the following statements will always be true:
1. the null event will never occur;
2. the universal event S will always occur.
Hence it is only for events E in between these extreme events, E S for which we
have uncertainty about whether E will occur. It is precisely for quantifying this uncertainty
over these events that we require the notion of probability.
1.3 Combinations of Events
Set operators on events
Consider a set of events {E
1
, E
2
, . . .}.
The event

i
E
i
= {s S|i s.t. s E
i
} will occur if and only if at least one of the events
{E
i
} occurs. So E
1
E
2
can be read as event E
1
or E
2
.
The event
i
E
i
= {s S|i, s E
i
} will occur if and only if all of the events {E
i
} occur.
So E
1
E
2
can be read as events E
1
and E
2
.
The events are said to be mutually exclusive if i, j, E
i
E
j
= (i.e. they are disjoint).
At most one of the events can occur.
2
2 Probability
2.1 Axioms of Probability
-algebras of events
So for our randomexperiment with sample space S of possible outcomes, for which events/subsets
E S would we like to dene the probability of E occurring?
Every subset? If S is nite or even countable, then this is ne. But for uncountably innite
sample spaces it can be shown that we can very easily start off dening sensible, proper prob-
abilities for an initial collection of subsets of S in a way that leaves it impossible to then carry
on and consistently dene probability for all the remaining subsets of S.
For this reason, when dening a probability measure on S we (usually implicitly) simulta-
neously agree on a collection of subsets of S that we wish to measure with probability. Generi-
cally, we will refer to this set of subsets as S.
There are three properties we will require of S, the reasons for which will become imme-
diately apparent when we meet the axioms of probability.
We need S to be
1. nonempty, S S;
2. closed under complements: E S = E S;
3. closed under countable union E
1
, E
2
, . . . S =
i
E
i
S.
Such a collection of sets is known as a -algebra.
Axioms of Probability
A probability measure on the pair (S, S) is a mapping P : S [0, 1] satisfying the fol-
lowing three axioms for all subsets of S on which it is dened (S, the measurable subsets of
S)
1. E S, 0 P(E) 1;
2. P(S) = 1;
3. Countably additive: For disjoint subsets E
1
, E
2
, . . . S
P
i
E
i
i
P(E
i
).
2.2 Simple Probability Results
Exercises:
From 1-3 it is easy to derive the following:
1. P(E) = 1 P(E);
2. P() = 0;
3. For any events E and F,
P(E F) = P(E) +P(F) P(E F).
3
2.3 Independence
Two events E and F are said to be independent if and only if P(E F) = P(E)P(F). This is
sometimes written E F.
More generally, a set of events {E
1
, E
2
, . . .} are said to be independent if for any nite subset
{E
1
, E
2
, . . . , E
n
},
P
i=1
E
i
=
n
i=1
P(E
i
).
If events E and F are independent, then E and F are also independent.
Proof:
Since F = (E F) (E F) is a disjoint union, P(F) = P(E F) +P(E F) by Axiom 3. So
P(E F) = P(F) P(E F) = P(F) P(E)P(F) = (1 P(E))P(F) = P(E)P(F).
2.4 Interpretations of Probability
Classical
If S is nite and the elementary events are considered equally likely, then the probability
of an event E is the proportion of all outcomes in S in which lie inside E,
P(E) =
|E|
|S|
.
Ex.
Rolling a die: Elementary events are { }, { }, . . . , { }.
P({ }) = P({ }) = . . . = P({ }) =
1
6
.
P(Odd number) = P({ , , }) =
3
6
=
1
2
.
Randomly drawn playing card: 52 elementary events
{2}, {3}, . . . , {A}, {2}, {3}, . . . , . . . , {K}, {A}.
P() = P() = P() = P() =
1
4
.
The joint event {Suit is red and value is 3} contains two of 52 elementary events, so
P({red 3}) =
2
52
=
1
26
. Since suit and face value should be independent, check that
P({red 3}) = P({, }) P({any 3}).
The equally likely (uniform) idea can be extended to innite spaces, by apportioning
probability to sets not by their cardinality but by other standard measures, like volume or mass.
Ex.
If a meteorite were to strike Earth, the probability that it will strike land rather than sea
would be given by
Total area of land
Total area of Earth
.
4
Frequentist
Observation shows that if one takes repeated observations in identical random situa-
tions, in which event E may or may not occur, then the proportion of times in which E occurs
tends to some limiting value - called the probability of E
Ex.
Proportion of heads in tosses of a coin: H, H, T, H, T, T, H, T, T, . . .
1
2
.
Subjective
Probability is a degree of belief held by an individual.
For example, De Finetti (1937/1964) suggested the following: Suppose a random experi-
ment is to be performed, where an event E S may or may not happen. Now suppose an
individual is entered into a game regarding this experiment where he has two choices, each
leading to monetary (or utility) consequences:
1. Gamble: If E occurs, he wins 1; if E occurs, he wins 0;
2. Stick: Regardless of the outcome of the experiment, he receives P(E) for some real num-
ber P(E).
The critical value of P(E) for which the individual is indifferent between options 1 and 2 is
dened to be the individuals probability for the event E occurring.
This procedure can be repeated for all possible events E in S.
Suppose after this process of elicitation of the individuals preferences under the different
events, we can simultaneously arrange an arbitrary number of monetary bets with the indi-
vidual based on the outcome of the experiment.
If it is possible to choose these bets in such a way that the individual is certain to lose money
(this is called a Dutch Book), then the individuals degrees of belief are said to be incoherent.
To be coherent, it is easily seen, for example, that we must have 0 P(E) 1 for all events
E, E F = P(E) P(F), etc.
5
3 Examples
3.1 De M er es problem
Antoine Gombaud, chevalier de M er e (1607-1684) posed to Pascal the following gambling prob-
lem: which of these two events is more likely?
1. E = {4 rolls of a die yield at least one }.
2. F = {24 rolls of two dice yield at least one pair of }.
De M er e observed that E seemed to lead to a protable even money bet whereas F did not.
We calculate P(E) and P(F).
1. Each roll of the die is independent from the previous rolls, and so there are 6
4
equally
likely outcomes. Of these, 5
4
show no s.
So the probability of no showing is
5
4
6
4
0.4823.
So P(E), the probability of at least one showing, is 1 0.4823 = 0.5177.
2. There are 36
24
equally likely outcomes here. Of these, 35
24
dont show a .
So the probability of no is
35
24
36
24
0.5086
So P(F), the probability of at least one , is 1 0.5086 = 0.4914
Hence P(E) 0.5177 >
1
2
> 0.4914 P(F).
3.2 Joint events
Coin and Die
Consider tossing a coin and rolling a die.
We would consider each of the 12 possible combinations of Head/Tail and die value as
equally likely.
So we can construct a probability table:
H
1
12
1
12
1
12
1
12
1
12
1
12
1
2
T
1
12
1
12
1
12
1
12
1
12
1
12
1
2
1
6
1
6
1
6
1
6
1
6
1
6
From this table we can calculate the probability of any event we might be interested in,
simply by adding up the probabilities of all the elementary events it contains.
For example, the event of getting a head on the coin
{H} = {(H, ), (H, ), . . . , (H, )}
6
has probability
P({H}) = P({(H, )}) +P({(H, )}) +. . . +P({(H, )})
=
1
12
+
1
12
+. . . +
1
12
=
1
2
.
Notice the two experiments satisfy our probability denition of independence, since for
example
P({(H, )}) =
1
12
=
1
2
1
6
= P({H}) P({ }).
Coin and Two Dice
A crooked die called a top has the same faces on opposite sides.
Suppose we have two dice, one normal and one which is a top with opposite faces num-
bered , , or .
Now suppose we rst ip the coin. If it comes up heads, we roll the normal die; tails, and
we roll the top.
To calculate the probability table easily, we notice that this is equivalent to the previous
game using one normal die except with the change after tails that a roll of a is relabelled as
a , , . So we can just merge those probabilities in the tails row.
H
1
12
1
12
1
12
1
12
1
12
1
12
1
2
T
1
6
0
1
6
0
1
6
0
1
2
1
4
1
12
1
4
1
12
1
4
1
12
The probabilities of the different outcomes of the dice change according to the outcome of
the coin toss. And note, for example,
P({(H, )}) =
1
12
=
1
24
=
1
2
1
12
= P({H}) P({ }).
So the two experiments are now dependent.
7
4 Conditional Probability
4.1 Denition
For two events E and F in S where P(F) = 0, we dene the conditional probability of E occur-
ring given that we know F has occurred as
P(E|F) =
P(E F)
P(F)
.
Note that if E and F are independent, then
P(E|F) =
P(E F)
P(F)
=
P(E)P(F)
P(F)
= P(E).
4.2 Examples
Example 1 - Rolling a
We roll a normal die once.
1. What is the probability of E = {the die shows a }?
2. What is the probability of E = {the die shows a } given we know
F = {the die shows an odd number}?
Solution:
1. P(E) =
=
1
6
.
2. Now the set of possible outcomes is just F = { , , }.
So P(E|F) =
=
1
3
.
Note P(F) =
1
2
and E F = E, and hence we have P(E|F) =
P(E F)
P(F)
.
Example 2 - Rolling two dice
Suppose we roll two normal dice, one from each hand.
Then the sample space comprises all of the ordered pairs of dice values
S = {( , ), ( , ), . . . , ( , )}.
Let E be the event that the die thrown from the left hand will show a larger value than the
die thrown from the right hand.
P(E) =
total # outcomes
=
15
36
.
8
Suppose we are now informed that an event F has occurred, where
F = {the value of the left hand die is }
How does this change the probability of E occurring?
Well since F has occurred, the only sample space elements which could have possibly oc-
curred are exactly those elements in F = {( , ), ( , ), ( , ), ( , ), ( , )( , )}.
Similarly the only sample space elements in E that could have occurred now must be in
E F = {( , ), ( , ), ( , ), ( , )}.
So our revised probability is
total # outcomes ( , )
=
4
6
=
P(E F)
P(F)
P(E|F).
Discussion of Examples
In both examples, we considered the probability of an event E, and then reconsidered what
this probability would be if we were given the knowledge that F had occurred. What hap-
pened?
Answer: The sample space S was replaced by F, and the event E was replaced by E F. So
originally, we had
P(E) = P(E|S) =
P(E S)
P(S)
(since E S = E, and P(S) = 1 by Axiom 2).
So we can think of probability conditioning as a shrinking of the sample space, with events
replaced by their intersections with the reduced space and a consequent rescaling of probabil-
ities.
4.3 Conditional Independence
Earlier we met the concept of independence of events according to a probability measure P.
We can now extend that idea to conditional probabilities since P(|F) is itself a perfectly good
probability measure obeying the axioms of probability.
For three events E
1
, E
2
and F, the event pair E
1
and E
2
are said to be conditionally in-
dependent given F if and only if P(E
1
E
2
|F) = P(E
1
|F)P(E
2
|F). This is sometimes written
E
1
E
2
|F.
4.4 Bayes Theorem and the Partition Rule
Bayes Theorem
For two events E and F in S, we have
P(E F) = P(F)P(E|F); (1)
but also, since E F F E
P(E F) = P(E)P(F|E). (2)
Equating the RHS of (1) and (2), provided P(F) = 0 we can rearrange to obtain
P(E|F) =
P(E)P(F|E)
P(F)
.
9
Partition Rule
Consider a set of events {F
1
, F
2
, . . .} which form a partition of S.
Then for any event E S
P(E) =
i
P(E|F
i
)P(F
i
).
Proof:
E = E S = E
i
F
i
=
i
(E F
i
).
So
P(E) = P
i
(E F
i
)
,
which, by countable additivity (Axiom 3) and noting that since the {F
1
, F
2
, . . .} are disjoint so
are {E F
1
, E F
2
, . . .}, implies
P(E) =
i
P(E F
i
). (3)
(3) is known as the law of total probability; and it can be rewritten
P(E) =
i
P(E|F
i
)P(F
i
).
For any events E and F in S, note that {F, F}, say, form a partition of S. So by the law of
total probability we have
P(E) = P(E F) +P(E F)
= P(E|F)P(F) +P(E|F)P(F).
Terminology
When considering multiple events, say E and F, we often refer to
probabilities of the form P(E|F) as conditional probabilities;
probabilities of the form P(E F) as joint probabilities;
probabilities of the form P(E) as marginal probabilities.
10
4.5 More Examples
Example 1 - Defective Chips
Ex.
A box contains 5000 VLSI chips, 1000 from company X and 4000 from Y. 10% of the chips
made by X are defective and 5% of those made by Y are defective. If a randomly chosen chip
is found to be defective, nd the probability that it came from X.
Let E = chip was made by X;
let F = chip is defective.
First of all, which probabilities have we been given?
A box contains 5000 VLSI chips, 1000 from company X and 4000 from Y.
= P(E) =
1000
5000
= 0.2,
P(E) =
4000
5000
= 0.8.
10% of the chips made by X are defective and 5% of those made by Y are defective.
= P(F|E) = 10% = 0.1,
P(F|E) = 5% = 0.05.
We have enough information to construct the probability table
E E
F 0.02 0.04 0.06
F 0.18 0.76 0.94
0.2 0.8
The law of total probability has enabled us to extract the marginal probabilities P(F) and
P(F) as 0.06 and 0.94 respectively.
So by Bayes Theorem we can calculate the conditional probabilities. In particular, we want
P(E|F) =
P(E F)
P(F)
=
0.02
0.06
=
1
3
.
11
Example 2 - Kidney stones
Kidney stones are small (< 2cm diam) or large (> 2 cm diam). Treatment can succeed or
fail. The following data were collected from a sample of 700 patients with kidney stones.
Success (S) Failure (S)
Large (L) 247 96 343
Small (L) 315 42 357
Total 562 138 700
For a patient randomly drawn from this sample, what is the probability that the outcome
of treatment was successful, given the kidney stones were large?
Clearly we can get the answer directly from the table by ignoring the small stone patients
P(S|L) =
247
343
or we can go the long way round:
P(L) =
343
700
,
P(S L) =
247
700
,
P(S|L) =
P(S L)
P(L)
=
247
700
343
700
=
247
343
.
Example 3 - Multiple Choice Question
A multiple choice question has c available choices. Let p be the probability that the student
knows the right answer, and 1 p that he does not. When he doesnt know, he chooses an
answer at random. Given that the answer the student chooses is correct, what is the probability
that the student knew the correct answer?
Let A be the event that the question is answered correctly;
let K be the event that the student knew the correct answer.
Then we require P(K|A).
By Bayes Theorem
P(K|A) =
P(A|K)P(K)
P(A)
and we know P(A|K) = 1 and P(K) = p, so it remains to nd P(A).
By the partition rule,
P(A) = P(A|K)P(K) +P(A|K)P(K)
and since P(A|K) =
1
c
, this gives
P(A) = 1 p +
1
c
(1 p).
Hence
P(K|A) =
p
p +
1p
c
=
cp
cp +1 p
.
Note: the larger c is, the greater the probability that the student knew the answer, given
that they answered correctly.
12
Example 4 - Super Computer Jobs
Measurements at the North Carolina Super Computing Center (NCSC) on a certain day
showed that 15% of the jobs came from Duke, 35% from UNC, and 50% from NC State Uni-
versity. Suppose that the probabilities that each of these jobs is a multitasking job is 0.01, 0.05,
and 0.02 respectively.
1. Find the probabilities that a job chosen at random is a multitasking job.
2. Find the probability that a randomly chosen job comes from UNC, given that it is a mul-
titasking job.
Solution:
Let U
i
= job is from university i, i = 1, 2, 3 for Duke, UNC, NC State respectively;
let M = job uses multitasking.
1.
P(M) = P(M|U
1
)P(U
1
) +P(M|U
2
)P(U
2
) +P(M|U
3
)P(U
3
)
= 0.01 0.15 +0.05 0.35 +0.02 0.5 = 0.029.
2.
P(U
2
|M) =
P(M|U
2
)P(U
2
)
P(M)
=
0.05 0.35
0.029
= 0.603.
Example 5 - HIV Test
A new HIV test is claimed to correctly identify 95% of people who are really HIV positive
and 98% of people who are really HIV negative. Is this acceptable?
If only 1 in a 1000 of the population are HIV positive, what is the probability that someone
who tests positive actually has HIV?
Solution:
Let H = has the HIV virus;
let T = test is positive.
We have been given P(T|H) = 0.95, P(T|H) = 0.98 and P(H) = 0.001.
We wish to nd P(H|T).
P(H|T) =
P(T|H)P(H)
P(T|H)P(H) +P(T|H)P(H)
=
0.95 0.001
0.95 0.001 +0.02 0.999
= 0.045
That is, less than 5% of those who test positive really have HIV.
13
Example 5 - continued
If the HIVtest shows a positive result, the individual might wish to retake the test. Suppose
that the results of a person retaking the HIV test are conditionally independent given HIV
status (clearly two results of the test would certainly not be unconditionally independent). If
the test again gives a positive result, what is the probability that the person actually has HIV?
Solution:
Let T
i
= i
th
test is positive.
P(H|T
1
T
2
) =
P(T
1
T
2
|H)P(H)
P(T
1
T
2
)
=
P(T
1
T
2
|H)P(H)
P(T
1
T
2
|H)P(H) +P(T
1
T
2
|H)P(H)
=
P(T
1
|H)P(T
2
|H)P(H)
P(T
1
|H)P(T
2
|H)P(H) +P(T
1
|H)P(T
2
|H)P(H)
by conditional independence.
Since P(T
i
|H) = 0.95 and P(T
i
|H) = 0.02,
P(H|T
1
T
2
) =
0.95 0.95 0.001
0.95 0.95 0.001 +0.02 0.02 0.999
0.693.
So almost a 70% chance after taking the test twice and both times showing as positive. For
three times, this goes up to 99%.
14
COMP 245 Statistics
Exercises 3 - Discrete Random Variables
1. An experiment involves tossing two unbiased coins.
(a) What is the sample space of this experiment?
(b) What is the probability mass function of the random variable X, which takes value 2 if
two heads show, 1 if one head shows, and 0 if no heads show?
(c) What is the probability mass function of the random variable Y, which takes the value
3 if at least one head shows and 1 if no head shows?
2. Suppose that two fair dice are thrown and define a random variable X as the total number
of spots showing. Make a table showing the probability mass function, p(x) of X and plot a
graph of p(x).
3. In tossing a fair coin four times, what is the probability that one will obtain
(a) four heads;
(b) three heads;
(c) at least two heads;
(d) not more than one head?
4. An urn holds 5 white and 3 black marbles.
(a) If two marbles are drawn at random without replacement and X denotes the number of
white marbles
i. find the probability mass function of X, and
ii. plot the cumulative distribution function of X.
(b) Repeat 4a if the marbles are drawn with replacement.
5. The probability that a student will pass a particular course is 0.4. Find the probability that,
out of 5 students
(a) none pass; (b) one passes; (c) at least one passes.
6. (a) If each student in a class of 110 has the same probability, 0.8, of passing an examination,
what is
i. the expected number of passes?
ii. the standard deviation of the number of passes?
(b) If each student in a college of 11000 has the same probability of graduating, what is
i. the expected number of graduates?
ii. the standard deviation of the number of graduates?
7. An insurance salesman sells policies to 5 computer companies. The probability that each of
these companies will make a claim over the next five years is
1
5
. Find the probability that,
over the next five years
(a) all companies will claim;
(b) at least three companies will claim;
(c) only two will claim;
(d) at least one will not claim.
8. Compute the mean, sd, and the skewness for the following binomial distributions, and
comment on the results:
(a) Binomial(100, 0.9);
(b) Binomial(100, 0.7);
(c) Binomial(100, 0.5);
(d) Binomial(1000, 0.9);
(e) Binomial(1000, 0.7);
(f) Binomial(1000, 0.5).
9. In a class of 20 students taking an examination,
2 have probability 0.4 of passing;
2 have probability 0.9 of passing.
(a) What is the expected number of passes?
(b) What is the standard deviation of the number of passes?
10. (Geometric distribution.)
A computer class has a limited number of terminals available for use. A student notices that,
on average, there is a 0.4 chance that there will be a free terminal each time he tries to use a
machine.
(a) What is the average number of times he will have to try use a machine until he is
successful?
(b) What is his chance of being successful the first time he tries?
(c) What is his probability of being successful the first time on each of three different
occasions?
11. (a) What is the mean and variance of a sum of n independent Bernoulli random variables,
each with parameter p?
(b) What if they have different parameters, (p
1
, p
2
, . . . , p
n
)?
(c) What can you say if I now tell you that they are not independent?
COMP 245 Statistics
Solutions 3 - Discrete Random Variables
1. (a) S = {HH, HT, TH, TT}.
(b) p
X
(0) =
1
4
, p
X
(1) =
1
2
, p
X
(2) =
1
4
.
(c) p
Y
(1) =
1
4
, p
Y
(3) =
3
4
.
2.
x 2 3 4 5 6 7 8 9 10 11 12
p(x)
1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
-
6
0 1 2 3 4 5 6 7 8 9 10 11 12
1
18
1
9
1
6
x
p(x)
3. Let X be a random variable giving the number of heads obtained. Then X Binomial(4,)
(a) p(4) =
1
2
4
=
1
16
.
(b) p(3) =
4
1

1
2
3

1
2
1
=
4
16
=
1
4
.
(c) p(2) + p(3) + p(4) =
4
2

1
2
2

1
2
2
+ p(3) + p(4) =
11
16
.
(d) By symmetry (switch heads with tails) p(0) + p(1) = p(4) + p(3) =
5
16
.
4. (a) i. p(0) =
3
8

2
7
=
3
28
; p(1) =
3
8

5
7
+
5
8

3
7
=
15
28
; p(2) =
5
8

4
7
=
5
14
.
ii.
-
6
0 1 2
d
t
3
28
d
t
9
14
d
t
1
x
F
X
(x)
(b) i. p(0) =
3
8

3
8
=
9
64
; p(1) =
3
8

5
8
+
5
8

3
8
=
15
32
; p(2) =
5
8

5
8
=
25
64
.
ii.
-
6
0 1 2
d
t
9
64
d
t
39
64
d
t
1
x
F
X
(x)
5. (a)

5
0
0.4
0
0.6
5
= 0.078.
(b)

5
1
0.4
1
0.6
4
= 0.26.
(c) 1 P(none pass) = 0.922.
6. (a) The distribution is Binomial(110,0.8) so
i. Mean = 110 0.8 = 88.
ii. Standard deviation =

110 0.8 (1 0.8) = 4.195.
(b) The distribution is Binomial(11000,0.8) so
i. Mean = 11000 0.8 = 8800.
ii. Standard deviation =

11000 0.8 (1 0.8) = 41.95.
7. The distribution is Binomial(5,
1
5
) so
(a) 0.2
5
.
(b)

5
3
0.2
3
0.8
2
+
5
4
0.2
4
0.8
1
+
5
5
0.2
5
0.8
0
.
(c)

5
2
0.2
2
0.8
3
.
(d) 1 0.2
5
.
8. For Binomial(n, p), mean=np, standard deviation=
np(1 p), skewness=

1 2p
np(1 p)
.
(a) 90, 3, -0.2667.
(b) 70, 4.58, -0.0873.
(c) 50, 5, 0.
(d) 900, 9.49, -0.0843.
(e) 700, 14.49, -0.0276.
(f) 500, 15.81, 0.
Absolute value of skewness decreases as p gets closer to and as sample size n increases.
9. Let X be the total number of passes. This is a random variable formed as the sum of five
Binomial random variables, one Binomial(2,0.4), one Binomial(4,0.6), etc.
(a) E(X) = 2 0.4 + 4 0.6 + 5 0.7 + 7 0.8 + 2 0.9 = 14.1.
(b) Var(X) = 2 0.4 0.6 +4 0.6 0.4 +5 0.7 0.3 +7 0.8 0.2 +2 0.9 0.1 = 3.79,
so sd(X) = 1.95.
10. (a) 1/0.4=2.5.
(b) 0.4.
(c) 0.4
3
=0.064.
11. (a) np, np(1 p).
(b)

n
i=1
p
i
,

n
i=1
p
i
(1 p
i
).
(c) The mean is unaffected, but further information would be needed to calculate the vari-
ance.
Discrete Random Variables
COMP 245 STATISTICS
Dr N A Heard
Dr N A Heard (Room 545 Huxley Building) Discrete Random Variables 1 / 62
Denition
Suppose, as before, that for a random experiment we have identied a
sample space S and a probability measure P(E) dened on
(measurable) subsets E S.
A random variable is a mapping from the sample space to the real
numbers. So if X is a random variable, X : S R. Each element of the
sample space s S is assigned by X a (not necessarily unique)
numerical value X(s).
If we denote the unknown outcome of the random experiment as s
,
then the corresponding unknown outcome of the random variable
X(s
) will be generically referred to as X.

The probability measure P already dened on S induces a probability
distribution on the random variable X in R:
For each x R, let S
x
S be the set containing just those elements of S
which are mapped by X to numbers no greater than x. That is, let S
x
be
the inverse image of (, x] under the function X. Then noting the
equaivalence
X(s
) x s
S
x
.
we see that
P
X
(X x) P(S
x
).
The image of S under X is called the range of the random variable:
Defn.
range(X) X(S) = {x R|s S s.t. X(s) = x}
So as S contains all the possible outcomes of the experiment, range(X)
contains all the possible outcomes for the random variable X.
Example
Let our random experiment be tossing a fair coin, with sample space
{H, T} and probability measure P({H}) = P({T}) =
1
2
.
We can dene a random variable X : {H, T} R taking values, say,
X(T) = 0,
X(H) = 1.
In this case, what does S
x
look like for each x R?
S
x
=
if x < 0;
{T} if 0 x < 1;
{H, T} if x 1.
This denes a range of probabilities P
X
on the continuumR
P
X
(X x) = P(S
x
) =
P() = 0 if x < 0;
P({T}) =
1
2
if 0 x < 1;
P({H, T}) = 1 if x 1.
-
6
1
1
2
0
1 x
P
X
(X x)
cdf
The cumulative distribution function (cdf) of a random variable X,
written F
X
(x) (or just F(x)) is the probability that X takes value less
than or equal to x.
Defn.
F
X
(x) = P
X
(X x)
For any random variable X, F
X
is right-continuous, meaning if a
decreasing sequence of real numbers x
1
, x
2
, . . . x, then
F
X
(x
1
), F
X
(x
2
), . . . F
X
(x).
cdf Properties
For a given function F
X
(x), to check this is a valid cdf, we need to
make sure the following conditions hold.
1
0 F
X
(x) 1, x R;
2
Monotonicity: x
1
, x
2
R, x
1
< x
2
F
X
(x
1
) F
X
(x
2
);
3
F
X
() = 0, F
X
() = 1.
For nite intervals (a, b] R, it is easy to check that
P
X
(a < X b) = F
X
(b) F
X
(a).
Unless there is any ambiguity, we generally suppress the subscript
of P
X
() in our notation and just write P() for the probability
measure for the random variable.
That is, we forget about the underlying sample space and just think
about the random variable and its probabilities.
Often, it will be most convenient to work this way and consider the
random variable directly from the very start, with the range of X
being our sample space.
Simple Random Variables
We say a random variable is simple if it can take only a nite number
of possible values. That is,
Defn.
X is simple range(X) is nite.
Suppose X is simple, and can take one of m values X = {x
1
, x
2
, . . . , x
m
}
ordered so that x
1
< x
2
< . . . < x
m
. Each sample space element s S is
mapped by X to one of these m values.
Then in this case, we can partition the sample space S into m disjoint
subsets {E
1
, E
2
, . . . , E
m
} so that s E
i
X(s) = x
i
, i = 1, 2, . . . , m.
We can then write down the probability of the random variable X
taking the particular value x
i
as
P
X
(X = x
i
) P(E
i
).
It is also easy to check that
P
X
(X = x
i
) = F
X
(x
i
) F
X
(x
i1
),
where we can take x
0
= .
Examples of Simple Random Variables
Consider once again the experiment of rolling a single die.
Then S = { , , , , , } and for any s S (e.g. s = ) we have
P({s}) =
1
6
.
An obvious random variable we could dene on S would be
X : S R, s.t.
X( ) = 1,
X( ) = 2,
.
.
.
X( ) = 6.
Then e.g. P
X
(1 < X 5) = P({ , , , }) =
4
6
=
2
3
and P
X
(X {2, 4, 6}) = P({ , , }) =
1
2
.
Alternatively, we could dene a random variable Y : S R, s.t.
Y( ) = Y( ) = Y( ) = 0,
Y( ) = Y( ) = Y( ) = 1.
Then clearly P
Y
(Y = 0) = P({ , , }) =
1
2
and
P
Y
(Y = 1) = P({ , , }) =
1
2
.
Comments
Note that under either random variable X or Y, we still got the
same probability of getting an even number of spots on the die.
Indeed, this would be the case for any random variable we may
care to dene.
A random variable is simply a numeric relabelling of our
underlying sample space, and all probabilities are derived from
the associated underlying probability measure.
A simple random variable is a special case of a discrete random variable.
We say a random variable is discrete if it can take only a countable
number of possible values. That is,
Defn.
X is discrete range(X) is countable.
Suppose X is discrete, and can take one of the countable set of values
X = {x
1
, x
2
, . . .} ordered so that x
1
< x
2
< . . .. Each sample space
element s S is mapped by X to one of these values.
Then in this case, we can partition the sample space S into a countable
collection of disjoint subsets {E
1
, E
2
, . . .} s.t.
s E
i
X(s) = x
i
, i = 1, 2, . . ..
As with simple random variables we can write down the probability of
the discrete random variable X taking the particular value x
i
as
P
X
(X = x
i
) P(E
i
).
We again have
P
X
(X = x
i
) = F
X
(x
i
) F
X
(x
i1
),
For a discrete random variable X, F
X
is a monotonic increasing step
function with jumps only at points in X.
Example: Poisson(5) cdf
G
G
G
G
G
G
G
G
G
G
G
G G G G G G G G G G
0 5 10 15 20
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
F
(
x
)
Probability Mass Function
For a discrete random variable X and x R, we dene the probability
mass function (pmf), p
X
(x) (or just p(x)) as
Defn.
p
X
(x) = P
X
(X = x)
pmf Properties
To check we have a valid pmf, we need to make sure the following
conditions hold.
If X can take values X = {x
1
, x
2
, . . .} then we must have:
1
0 p
X
(x) 1, x R;
2
xX
p
X
(x) = 1.
Example: Poisson(5) pmf
0 5 10 15 20
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
x
p
(
x
)
Knowing either the pmf or cdf of a discrete random variable
characterises its probability distribution.
That is, from the pmf we can derive the cdf, and vice versa.
p(x
i
) = F(x
i
) F(x
i1
),
F(x
i
) =
i
j=1
p(x
j
).
Links with Statistics
We can now see the rst links between the numerical summaries and
graphical displays we saw in earlier lectures and probability theory:
We can often think of a set of data (x
1
, x
2
, . . . , x
n
) as n realisations of a
random variable X dened on an underlying population for the data.
Recall the normalised frequency counts we considered for a set of
data, known as the empirical probability mass function. This can
be seen as an empirical estimate for the pmf of their underlying
population.
Also recall the empirical cumulative distribution function. This
too is an empirical estimate, but for the cdf of the underlying
population.
E(X)
For a discrete random variable X we dene the expectation of X,
Defn.
E
X
(X) =
x
xp
X
(x).
E
X
(X) (often just written E(X) or even
X
) is also referred to as the
mean of X.
It gives a weighted average of the possible values of the random
variable X, with the weights given by the probabilities of each
outcome.
Examples
1
If X is a r.v. taking the integer value scored on a single roll of a fair
die, then
E(X) =
6
x=1
xp(x)
= 1.
1
6
+2.
1
6
+3.
1
6
+4.
1
6
+5.
1
6
+6.
1
6
=
21
6
= 3.5.
2
If now X is a score from a student answering a single multiple
choice question with four options, with 3 marks awarded for a
correct answer, -1 for a wrong answer and 0 for no answer, what is
the expected value if they answer at random?
E(X) = 3.P(Correct) + (1).P(Incorrect) = 3.
1
4
1.
3
4
= 0.
E{g(X)}
More generally, for a function of interest g : R R of the random
variable X, rst notice that the composition g(X),
g(X)(s) = (g X)(s)
is also a random variable. It follows that
Defn.
E
X
{g(X)} =
x
g(x)p
X
(x) (1)
Linearity of Expectation
Consider the linear function g(X) = aX + b for constants a, b R. We
can see from (1) that
E
X
(aX + b) =
x
(ax + b)p
X
(x)
= a
x
xp
X
(x) + b
x
p
X
(x)
and since
x
xp
X
(x) = E(X) and
x
p
X
(x) = 1 we have
E(aX + b) = aE(X) + b, a, b R.
It is equally easy to check that for g, h : R R, we have
E{g(X) + h(X)} = E{g(X)} +E{h(X)}.
Var(X)
Consider another special case of g(X), namely
g(X) = {X E(X)}
2
.
The expectation of this function wrt P
X
gives a measure of dispersion
or variability of the random variable X around its mean, called the
variance and denoted Var
X
(X) (or sometimes
2
X
):
Defn.
Var
X
(X) = E
X
[{X E
X
(X)}
2
].
We can expand the expression {X E(X)}
2
and exploit the linearity of
expectation to get an alternative formula for the variance.
{X E(X)}
2
= X
2
2E(X)X +{E(X)}
2
Var(X) = E[X
2
{2E(X)}X +{E(X)}
2
]
= E(X
2
) 2E(X)E(X) +{E(X)}
2
and hence
Var(X) = E(X
2
) {E(X)}
2
.
Variance of a Linear Function of a Random Variable
We saw earlier that for constants a, b R the linear combination aX +b
had expectation aE(X) + b. What about the variance?
It is easy to show that the corresponding result is
Var(aX + b) = a
2
Var(X), a, b R.
sd(X)
The standard deviation of a random variable X, written sd
X
(X) (or
sometimes
X
), is the square root of the variance.
Defn.
sd
X
(X) =
Var
X
(X).
Skewness
The skewness (
1
) of a discrete random variable X is given by
Defn.
1
=
E
X
[{X E
X
(X)}
3
]
sd
X
(X)
3
.
That is, if = E(X) and = sd(X),
1
=
E[(X )
3
]
3
.
Examples
1
If X is a r.v. taking the integer value scored with a single roll of a
fair die, then
Var(X) =
6
x=1
x
2
p(x) 3.5
2
= 1
2
.
1
6
+2
2
.
1
6
+ . . . +6
2
.
1
6
3.5
2
= 1.25.
2
If now X is a score from a student answering a single multiple
choice question with four options, with 3 marks awarded for a
correct answer, -1 for a wrong answer and 0 for no answer, what is
the standard deviation if they answer at random?
E(X
2
) = 3
2
.P(Correct) + (1)
2
.P(Incorrect) = 9.
1
4
+1.
3
4
= 3
sd(X) =
3 0
2
=
3.
We have met three important quantities for a random variable, dened
through expectation - the mean , the variance
2
and the standard
deviation .
Again we can see a duality with the corresponding numerical
summaries for data which we met - the sample mean x, the sample
variance s
2
and the sample standard deviation s.
The duality is this: If we were to consider the data sample as the
population and draw a random member from that sample as a random
variable, this r.v. would have cdf F
n
(x), the empirical cdf. The mean of
the r.v. = x, variance
2
= s
2
and standard deviation = s.
Expectation of Sums of Random Variables
Let X
1
, X
2
, . . . , X
n
be n random variables, perhaps with different
distributions and not necessarily independent.
Let S
n
=
n
i=1
X
i
be the sum of those variables, and
S
n
n
be their average.
Then the mean of S
n
is given by
E(S
n
) =
n
i=1
E(X
i
), E
S
n
n
=

n
i=1
E(X
i
)
n
.
Variance of Sums of Random Variables
However, for the variance of S
n
, only if X
1
, X
2
, . . . , X
n
are independent
do we have
Var(S
n
) =
n
i=1
Var(X
i
), Var
S
n
n
=

n
i=1
Var(X
i
)
n
2
.
So if X
1
, X
2
, . . . , X
n
are independent and identically distributed with
E(X
i
) =
X
and Var(X
i
) =
2
X
we get
E
S
n
n
=
X
, Var
S
n
n
=

2
X
n
.
Bernoulli(p)
Consider an experiment with only two possible outcomes, encoded as
a random variable X taking value 1, with probability p, or 0, with
probability (1 p), accordingly.
(Ex.: Tossing a coin, X = 1 for a head, X = 0 for tails, p =
1
2
.)
Then we say X Bernoulli(p) and note the pmf to be
p(x) = p
x
(1 p)
1x
, x = 0, 1.
Using the formulae for mean and variance, it follows that
= p,
2
= p(1 p).
Example: Bernoulli() pmf
Binomial(n, p)
Consider n identical, independent Bernoulli(p) trials X
1
, . . . , X
n
.
Let X =
n
i=1
X
i
be the total number of 1s observed in the n trials.
(Ex.: Tossing a coin n times, X is the number of heads obtained, p =
1
2
.)
Then X is a random variable taking values in {0, 1, 2, . . . , n}, and we
say X Binomial(n, p).
From the Binomial Theorem we nd the pmf to be
p(x) = (
n
x
)p
x
(1 p)
nx
, x = 0, 1, 2, . . . , n.
To calculate the Binomial pmf we recall that
n
x
=
n!
x!(n x)!
and
x! =
x
i=1
i. (Note 0! = 1.)
It can be shown, either directly from the pmf or from the results for
sums of random variables, that the mean and variance are
= np,
2
= np(1 p).
Similarly, the skewness is given by
1
=
1 2p
np(1 p)
.
Example: Binomial(20,) pmf
Go to Poi(5) pmf
Example
Suppose that 10 users are authorised to use a particular computer
system, and that the system collapses if 7 or more users attempt to log
on simultaneously. Suppose that each user has the same probability
p = 0.2 of wishing to log on in each hour.
What is the probability that the system will crash in a given hour?
Solution: The probability that exactly x users will want to log on in any
hour is given by Binomial(n, p) = Binomial(10, 0.2).
Hence the probability of 7 or more users wishing to log on in any hour
is
p(7) + p(8) + p(9) + p(10)
=
10
7
0.2
7
0.8
3
+ . . . +
10
10
0.2
10
0.8
0
= 0.00086.
Two more examples
A manufacturing plant produces chips with a defect rate of 10%.
The quality control procedure consists of checking samples of size
50. Then the distribution of the number of defectives is expected
to be Binomial(50,0.1).
When transmitting binary digits through a communication
channel, the number of digits received correctly out of n
transmitted digits, can be modelled by a Binomial(n, p), where p is
the probability that a digit is transmitted incorrectly.
Note: Recall the independence condition necessary for these models to
be reasonable.
Geometric(p)
Consider a potentially innite sequence of independent Bernoulli(p)
random variables X
1
, X
2
, . . ..
Suppose we dene a quantity X by
X = min{i|X
i
= 1}
to be the index of the rst Bernoulli trial to result in a 1.
(Ex.: Tossing a coin, X is the number of tosses until the rst head is
obtained, p =
1
2
.)
Then X is a random variable taking values in Z
+
= {1, 2, . . .}, and we
say X Geometric(p).
Clearly the pmf is given by
p(x) = p(1 p)
x1
, x = 1, 2, . . ..
The mean and variance are
=
1
p
,
2
=
1 p
p
2
.
The skewness is given by
1
=
2 p
1 p
,
and so is always positive.
Example: Geometric() pmf
Alternative Formulation
If X Geometric(p), let us consider Y = X 1.
Then Y is a random variable taking values in N = {0, 1, 2, . . .}, and
corresponds to the number of independent Bernoulli(p) trials before we
obtain our rst 1. (Some texts refer to this as the Geometric
distribution.)
Note we have pmf
p
Y
(y) = p(1 p)
y
, y = 0, 1, 2, . . .,
and the mean becomes
Y
=
1 p
p
.
while the variance and skewness are unaffected by the shift.
Example
Suppose people have problems logging onto a particular website once
every 5 attempts, on average.
1
Assuming the attempts are independent, what is the probability
that an individual will not succeed until the 4
th
?
p =
4
5
= 0.8. p(4) = (1 p)
3
p = 0.2
3
0.8 = 0.0064.
2
On average, how many trials must one make until succeeding?
Mean=
1
p
=
5
4
= 1.25.
3
Whats the probability that the rst successful attempt is the 7
th
or
later?
p(7) + p(8) + p(9) + . . . =
p(1 p)
6
1 (1 p)
= (1 p)
6
= 0.2
6
.
Example (contd. from Binomial)
Again suppose that 10 users are authorised to use a particular
computer system, and that the system collapses if 7 or more users
attempt to log on simultaneously. Suppose that each user has the same
probability p = 0.2 of wishing to log on in each hour.
Using the Binomial distribution we found the probability that the
system will crash in any given hour to be 0.00086.
Using the Geometric distribution formulae, we are able to answer
questions such as: On average, after how many hours will the system
crash?
Mean =
1
p
=
1
0.00086
= 1163 hours.
Example: Mad(?) Dictators and Birth Control
A dictator, keen to maximise the ratio of males to females in his
country (so he could build up his all male army) ordered that each
couple should keep having children until a boy was born and then
stop.
Calculate the number expected number of boys that a couple will
have, and the expected number of girls, given that P(boy)=.
Assume for simplicity that each couple can have arbitrarily many
children (although this is not necessary to get the following results).
Then since each couple stops when 1 boy is born, the expected number
of boys per couple is 1.
On the other hand, if Y is the number of girls given birth to by a
couple, Y clearly follows the alternative formulation for the
Geometric() distribution.
So the expected number of girls for a couple is
1
1
2
1
2
= 1.
Poi()
Let X be a random variable on N = {0, 1, 2, . . .} with pmf
p(x) =
e
x
x!
, x = 0, 1, 2, . . .,
for some > 0.
Then X is said to follow a Poisson distribution with rate parameter
and we write X Poi().
Poisson random variables are concerned with the number of random
events occurring per unit of time or space, when there is a constant
underlying probability rate of events occurring across this unit.
Examples
the number of minor car crashes per day in the U.K.;
the number of mistakes on each of my slides;
the number of potholes in each mile of road;
the number of jobs which arrive at a database server per hour;
the number of particles emitted by a radioactive substance in a
given time.
An interesting property of the Poisson distribution is that it has equal
mean and variance, namely
= ,
2
= .
1
=
1
,
so is always positive but decreasing as .
Example: Poi(5) pmf (again)
Go to Binomial(20,) pmf
Poisson Approximation to the Binomial
Notice the similarity between the pmf plots weve seen for
Binomial(20,) and Poi(5).
It can be shown that for Binomial(n, p), when p is small and n is large,
this distribution can be well approximated by the Poisson distribution
with rate parameter np, Poi(np).
p in the above is not small, we would typically prefer p < 0.1 for the
approximation to be useful.
The usefulness of this approximation is in using probability tables;
tabulating a single Poisson() distribution encompasses an innite
number of possible corresponding Binomial distributions,
Binomial(n,

n
).
Ex.
A manufacturer produces VLSI chips, of which 1% are defective. Find
the probability that in a box of 100 chips none are defective.
We want p(0) from Binomial(100,0.01). Since n is large and p is small,
we can approximate this distribution by Poi(100 0.01) Poi(1).
Then p(0)
e
1
0
0!
= 0.3679.
Fitting a Poisson Distribution to Data
Ex.
The number of particles emitted by a radioactive substance which
reached a Geiger counter was measured for 2608 time intervals, each
of length 7.5 seconds.
The (real) data are given in the table below:
x 0 1 2 3 4 5 6 7 8 9 10
n
x
57 203 383 525 532 408 273 139 45 27 16
Do these data correspond to 2608 independent observations of an
identical Poisson random variable?
The total number of particles,
x
xn
x
, is 10,094, and the total number of
intervals observed, n =
x
n
x
, is 2608, so that the average number
reaching the counter in an interval is
10094
2608
= 3.870.
Since the mean of Poi() is , we can try setting = 3.87 and see how
well this ts the data.
For example, considering the case x = 0, for a single experiment
interval the probability of observing 0 particles would be
p(0) =
e
3.87
3.87
0
0!
= 0.02086. So over n = 2608 repetitions, our
(Binomial) expectation of the number of 0 counts would be
n p(0) = 54.4.
Similarly for x = 1, 2, . . ., we obtain the following table of expected
values from the Poi(3.87) model:
x 0 1 2 3 4 5 6 7 8 9 10
O(n
x
) 57 203 383 525 532 408 273 139 45 27 16
E(n
x
) 54.4 210.5 407.4 525.5 508.4 393.5 253.8 140.3 67.9 29.2 17.1
(O=Observed, E=Expected).
The two sets of numbers appear sufciently close to suggest the
Poisson approximation is a good one. Later, when we come to look at
hypothesis testing, we will see how to make such judgements
quantitatively.
U({1, 2, . . . , n})
Let X be a random variable on {1, 2, . . . , n} with pmf
p(x) =
1
n
, x = 1, 2, . . . , n.
Then X is said to follow a discrete uniform distribution and we write
X U({1, 2, . . . , n}).
=
n +1
2
,
2
=
n
2
1
12
.
and the skewness is clearly zero.
COMP 245 STATISTICS
Dr N A Heard
Contents
1 Random Variables 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Discrete Random Variables 4
2.1 Simple Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Mean and Variance 8
3.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Variance, Standard Deviation and Skewness . . . . . . . . . . . . . . . . . . . . . 9
3.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Discrete Distributions 12
4.1 Bernoulli, Binomial and Geometric Distributions . . . . . . . . . . . . . . . . . . . 12
4.2 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1
1 Random Variables
1.1 Introduction
Denition
Suppose, as before, that for a random experiment we have identied a sample space S and
a probability measure P(E) dened on (measurable) subsets E S.
A random variable is a mapping from the sample space to the real numbers. So if X is a
random variable, X : S R. Each element of the sample space s S is assigned by X a (not
necessarily unique) numerical value X(s).
If we denote the unknown outcome of the randomexperiment as s
, then the corresponding

unknown outcome of the random variable X(s
) will be generically referred to as X.

The probability measure P already dened on S induces a probability distribution on the
random variable X in R:
For each x R, let S
x
S be the set containing just those elements of S which are mapped
by X to numbers no greater than x. That is, let S
x
be the inverse image of (, x] under the
function X. Then noting the equaivalence
X(s
) x s
S
x
.
we see that
P
X
(X x) P(S
x
).
The image of S under X is called the range of the random variable:
range(X) X(S) = {x R|s S s.t. X(s) = x}
So as S contains all the possible outcomes of the experiment, range(X) contains all the
possible outcomes for the random variable X.
Example
Let our randomexperiment be tossing a fair coin, with sample space {H, T} and probability
measure P({H}) = P({T}) =
1
2
.
We can dene a random variable X : {H, T} R taking values, say,
X(T) = 0,
X(H) = 1.
In this case, what does S
x
look like for each x R?
S
x
=
if x < 0;
{T} if 0 x < 1;
{H, T} if x 1.
This denes a range of probabilities P
X
on the continuumR
P
X
(X x) = P(S
x
) =
P() = 0 if x < 0;
P({T}) =
1
2
if 0 x < 1;
P({H, T}) = 1 if x 1.
2
-
6
1
1
2
0
1 x
P
X
(X x)
1.2 Cumulative Distribution Function
cdf
The cumulative distribution function (cdf) of a random variable X, written F
X
(x) (or just
F(x)) is the probability that X takes value less than or equal to x.
F
X
(x) = P
X
(X x)
For any random variable X, F
X
is right-continuous, meaning if a decreasing sequence of real
numbers x
1
, x
2
, . . . x, then F
X
(x
1
), F
X
(x
2
), . . . F
X
(x).
cdf Properties
For a given function F
X
(x), to check this is a valid cdf, we need to make sure the following
conditions hold.
1. 0 F
X
(x) 1, x R;
2. Monotonicity: x
1
, x
2
R, x
1
< x
2
F
X
(x
1
) F
X
(x
2
);
3. F
X
() = 0, F
X
() = 1.
For nite intervals (a, b] R, it is easy to check that
P
X
(a < X b) = F
X
(b) F
X
(a).
Unless there is any ambiguity, we generally suppress the subscript of P
X
() in our nota-
tion and just write P() for the probability measure for the random variable.
That is, we forget about the underlying sample space and just think about the ran-
dom variable and its probabilities.
Often, it will be most convenient to work this way and consider the randomvariable
directly from the very start, with the range of X being our sample space.
3
2 Discrete Random Variables
2.1 Simple Random Variables
We say a random variable is simple if it can take only a nite number of possible values. That
is,
X is simple range(X) is nite.
Suppose X is simple, and can take one of m values X = {x
1
, x
2
, . . . , x
m
} ordered so that
x
1
< x
2
< . . . < x
m
. Each sample space element s S is mapped by X to one of these m values.
Then in this case, we can partition the sample space S into mdisjoint subsets {E
1
, E
2
, . . . , E
m
}
so that s E
i
X(s) = x
i
, i = 1, 2, . . . , m.
We can then write down the probability of the random variable X taking the particular
value x
i
as
P
X
(X = x
i
) P(E
i
).
It is also easy to check that
P
X
(X = x
i
) = F
X
(x
i
) F
X
(x
i1
),
where we can take x
0
= .
Examples of Simple Random Variables
Consider once again the experiment of rolling a single die.
Then S = { , , , , , } and for any s S (e.g. s = ) we have P({s}) =
1
6
.
An obvious random variable we could dene on S would be X : S R, s.t.
X( ) = 1,
X( ) = 2,
.
.
.
X( ) = 6.
Then e.g. P
X
(1 < X 5) = P({ , , , }) =
4
6
=
2
3
and P
X
(X {2, 4, 6}) = P({ , , }) =
1
2
.
Alternatively, we could dene a random variable Y : S R, s.t.
Y( ) = Y( ) = Y( ) = 0,
Y( ) = Y( ) = Y( ) = 1.
Then clearly P
Y
(Y = 0) = P({ , , }) =
1
2
and P
Y
(Y = 1) = P({ , , }) =
1
2
.
4
Comments
Note that under either randomvariable X or Y, we still got the same probability of getting
an even number of spots on the die.
Indeed, this would be the case for any random variable we may care to dene.
A random variable is simply a numeric relabelling of our underlying sample space, and
all probabilities are derived from the associated underlying probability measure.
2.2 Discrete Random Variables
A simple random variable is a special case of a discrete random variable. We say a random
variable is discrete if it can take only a countable number of possible values. That is,
X is discrete range(X) is countable.
Suppose X is discrete, and can take one of the countable set of values X = {x
1
, x
2
, . . .}
ordered so that x
1
< x
2
< . . .. Each sample space element s S is mapped by X to one of these
values.
Then in this case, we can partition the sample space S into a countable collection of disjoint
subsets {E
1
, E
2
, . . .} s.t. s E
i
X(s) = x
i
, i = 1, 2, . . ..
As with simple randomvariables we can write down the probability of the discrete random
variable X taking the particular value x
i
as
P
X
(X = x
i
) P(E
i
).
We again have
P
X
(X = x
i
) = F
X
(x
i
) F
X
(x
i1
),
For a discrete random variable X, F
X
is a monotonic increasing step function with jumps
only at points in X.
Example: Poisson(5) cdf
G
G
G
G
G
G
G
G
G
G
G
G G G G G G G G G G
0 5 10 15 20
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
F
(
x
)
5
2.3 Probability Mass Function
For a discrete random variable X and x R, we dene the probability mass function (pmf),
p
X
(x) (or just p(x)) as
p
X
(x) = P
X
(X = x)
pmf Properties
To check we have a valid pmf, we need to make sure the following conditions hold.
If X can take values X = {x
1
, x
2
, . . .} then we must have:
1. 0 p
X
(x) 1, x R;
2.

xX
p
X
(x) = 1.
Example: Poisson(5) pmf
0 5 10 15 20
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
x
p
(
x
)
Knowing either the pmf or cdf of a discrete random variable characterises its probability
distribution.
That is, from the pmf we can derive the cdf, and vice versa.
p(x
i
) = F(x
i
) F(x
i1
),
F(x
i
) =
i
j=1
p(x
j
).
6
We can now see the rst links between the numerical summaries and graphical displays
we saw in earlier lectures and probability theory:
We can often think of a set of data (x
1
, x
2
, . . . , x
n
) as n realisations of a random variable X
dened on an underlying population for the data.
Recall the normalised frequency counts we considered for a set of data, known as the
empirical probability mass function. This can be seen as an empirical estimate for the pmf
of their underlying population.
Also recall the empirical cumulative distribution function. This too is an empirical esti-
mate, but for the cdf of the underlying population.
7
3 Mean and Variance
3.1 Expectation
E(X)
For a discrete random variable X we dene the expectation of X,
E
X
(X) =
x
xp
X
(x).
E
X
(X) (often just written E(X) or even
X
) is also referred to as the mean of X.
It gives a weighted average of the possible values of the random variable X, with the
weights given by the probabilities of each outcome.
Examples
1. If X is a r.v. taking the integer value scored on a single roll of a fair die, then
E(X) =
6
x=1
xp(x)
= 1.
1
6
+2.
1
6
+3.
1
6
+4.
1
6
+5.
1
6
+6.
1
6
=
21
6
= 3.5.
2. If now X is a score from a student answering a single multiple choice question with four
options, with 3 marks awarded for a correct answer, -1 for a wrong answer and 0 for no
answer, what is the expected value if they answer at random?
E(X) = 3.P(Correct) + (1).P(Incorrect) = 3.
1
4
1.
3
4
= 0.
E{g(X)}
More generally, for a function of interest g : R R of the random variable X, rst notice
that the composition g(X),
g(X)(s) = (g X)(s)
is also a random variable. It follows that
E
X
{g(X)} =
x
g(x)p
X
(x) (1)
Consider the linear function g(X) = aX + b for constants a, b R. We can see from (1) that
E
X
(aX + b) =
x
(ax + b)p
X
(x)
= a
x
xp
X
(x) + b
x
p
X
(x)
and since
x
xp
X
(x) = E(X) and
x
p
X
(x) = 1 we have
8
E(aX + b) = aE(X) + b, a, b R.
It is equally easy to check that for g, h : R R, we have
E{g(X) + h(X)} = E{g(X)} +E{h(X)}.
3.2 Variance, Standard Deviation and Skewness
Var(X)
Consider another special case of g(X), namely
g(X) = {X E(X)}
2
.
The expectation of this function wrt P
X
gives a measure of dispersion or variability of the
random variable X around its mean, called the variance and denoted Var
X
(X) (or sometimes
2
X
):
Var
X
(X) = E
X
[{X E
X
(X)}
2
].
We can expand the expression {X E(X)}
2
and exploit the linearity of expectation to get
an alternative formula for the variance.
{X E(X)}
2
= X
2
2E(X)X +{E(X)}
2
Var(X) = E[X
2
{2E(X)}X +{E(X)}
2
]
= E(X
2
) 2E(X)E(X) +{E(X)}
2
and hence
Var(X) = E(X
2
) {E(X)}
2
.
Variance of a Linear Function of a Random Variable
We saw earlier that for constants a, b R the linear combination aX + b had expectation
aE(X) + b. What about the variance?
It is easy to show that the corresponding result is
Var(aX + b) = a
2
Var(X), a, b R.
sd(X)
The standard deviation of a random variable X, written sd
X
(X) (or sometimes
X
), is the
square root of the variance.
sd
X
(X) =
Var
X
(X).
9
Skewness
The skewness (
1
) of a discrete random variable X is given by
1
=
E
X
[{X E
X
(X)}
3
]
sd
X
(X)
3
.
That is, if = E(X) and = sd(X),
1
=
E[(X )
3
]
3
.
Examples
1. If X is a r.v. taking the integer value scored with a single roll of a fair die, then
Var(X) =
6
x=1
x
2
p(x) 3.5
2
= 1
2
.
1
6
+2
2
.
1
6
+. . . +6
2
.
1
6
3.5
2
= 1.25.
2. If now X is a score from a student answering a single multiple choice question with four
options, with 3 marks awarded for a correct answer, -1 for a wrong answer and 0 for no
answer, what is the standard deviation if they answer at random?
E(X
2
) = 3
2
.P(Correct) + (1)
2
.P(Incorrect) = 9.
1
4
+1.
3
4
= 3
sd(X) =
3 0
2
=
3.
3.3 Comments
We have met three important quantities for a randomvariable, dened through expectation
- the mean , the variance
2
and the standard deviation .
Again we can see a duality with the corresponding numerical summaries for data which
we met - the sample mean x, the sample variance s
2
and the sample standard deviation s.
The duality is this: If we were to consider the data sample as the population and draw a
random member from that sample as a random variable, this r.v. would have cdf F
n
(x), the
empirical cdf. The mean of the r.v. = x, variance
2
= s
2
and standard deviation = s.
3.4 Sums of Random Variables
Expectation of Sums of Random Variables
Let X
1
, X
2
, . . . , X
n
be n random variables, perhaps with different distributions and not nec-
essarily independent.
Let S
n
=
n
i=1
X
i
be the sum of those variables, and
S
n
n
be their average.
Then the mean of S
n
is given by
10
E(S
n
) =
n
i=1
E(X
i
), E
S
n
n
=

n
i=1
E(X
i
)
n
.
Variance of Sums of Random Variables
However, for the variance of S
n
, only if X
1
, X
2
, . . . , X
n
are independent do we have
Var(S
n
) =
n
i=1
Var(X
i
), Var
S
n
n
=

n
i=1
Var(X
i
)
n
2
.
So if X
1
, X
2
, . . . , X
n
are independent and identically distributed with E(X
i
) =
X
and
Var(X
i
) =
2
X
we get
E
S
n
n
=
X
, Var
S
n
n
=

2
X
n
.
11
4 Discrete Distributions
4.1 Bernoulli, Binomial and Geometric Distributions
Bernoulli(p)
Consider an experiment with only two possible outcomes, encoded as a random variable
X taking value 1, with probability p, or 0, with probability (1 p), accordingly.
(Ex.: Tossing a coin, X = 1 for a head, X = 0 for tails, p =
1
2
.)
Then we say X Bernoulli(p) and note the pmf to be
p(x) = p
x
(1 p)
1x
, x = 0, 1.
Using the formulae for mean and variance, it follows that
= p,
2
= p(1 p).
Example: Bernoulli() pmf
Binomial(n, p)
Consider n identical, independent Bernoulli(p) trials X
1
, . . . , X
n
.
Let X =
n
i=1
X
i
be the total number of 1s observed in the n trials.
(Ex.: Tossing a coin n times, X is the number of heads obtained, p =
1
2
.)
Then X is a randomvariable taking values in {0, 1, 2, . . . , n}, and we say X Binomial(n, p).
From the Binomial Theorem we nd the pmf to be
p(x) = (
n
x
)p
x
(1 p)
nx
, x = 0, 1, 2, . . . , n.
To calculate the Binomial pmf we recall that
n
x
=
n!
x!(n x)!
and x! =
x
i=1
i. (Note 0! = 1.)
It can be shown, either directly from the pmf or from the results for sums of random vari-
ables, that the mean and variance are
= np,
2
= np(1 p).
Similarly, the skewness is given by
1
=
1 2p
np(1 p)
.
12
Example: Binomial(20,) pmf
Example
Suppose that 10 users are authorised to use a particular computer system, and that the
system collapses if 7 or more users attempt to log on simultaneously. Suppose that each user
has the same probability p = 0.2 of wishing to log on in each hour.
What is the probability that the system will crash in a given hour?
Solution: The probability that exactly x users will want to log on in any hour is given by
Binomial(n, p) = Binomial(10, 0.2).
Hence the probability of 7 or more users wishing to log on in any hour is
p(7) + p(8) + p(9) + p(10)
=
10
7
0.2
7
0.8
3
+. . . +
10
10
0.2
10
0.8
0
= 0.00086.
Two more examples
A manufacturing plant produces chips with a defect rate of 10%. The quality control
procedure consists of checking samples of size 50. Then the distribution of the number
of defectives is expected to be Binomial(50,0.1).
When transmitting binary digits through a communication channel, the number of digits
received correctly out of n transmitted digits, can be modelled by a Binomial(n, p), where
p is the probability that a digit is transmitted incorrectly.
Note: Recall the independence condition necessary for these models to be reasonable.
Geometric(p)
Consider a potentially innite sequence of independent Bernoulli(p) random variables
X
1
, X
2
, . . ..
Suppose we dene a quantity X by
X = min{i|X
i
= 1}
13
to be the index of the rst Bernoulli trial to result in a 1.
(Ex.: Tossing a coin, X is the number of tosses until the rst head is obtained, p =
1
2
.)
Then X is a randomvariable taking values in Z
+
= {1, 2, . . .}, and we say X Geometric(p).
Clearly the pmf is given by
p(x) = p(1 p)
x1
, x = 1, 2, . . ..
=
1
p
,
2
=
1 p
p
2
.
1
=
2 p
1 p
,
and so is always positive.
Example: Geometric() pmf
Alternative Formulation
If X Geometric(p), let us consider Y = X 1.
Then Y is a random variable taking values in N = {0, 1, 2, . . .}, and corresponds to the
number of independent Bernoulli(p) trials before we obtain our rst 1. (Some texts refer to this
as the Geometric distribution.)
Note we have pmf
p
Y
(y) = p(1 p)
y
, y = 0, 1, 2, . . .,
and the mean becomes
Y
=
1 p
p
.
while the variance and skewness are unaffected by the shift.
14
Example
Suppose people have problems logging onto a particular website once every 5 attempts, on
average.
1. Assuming the attempts are independent, what is the probability that an individual will
not succeed until the 4
th
?
p =
4
5
= 0.8. p(4) = (1 p)
3
p = 0.2
3
0.8 = 0.0064.
2. On average, how many trials must one make until succeeding?
Mean=
1
p
=
5
4
= 1.25.
3. Whats the probability that the rst successful attempt is the 7
th
or later?
p(7) + p(8) + p(9) +. . . =
p(1 p)
6
1 (1 p)
= (1 p)
6
= 0.2
6
.
Example (contd. from Binomial)
Again suppose that 10 users are authorised to use a particular computer system, and that
the system collapses if 7 or more users attempt to log on simultaneously. Suppose that each
user has the same probability p = 0.2 of wishing to log on in each hour.
Using the Binomial distribution we found the probability that the system will crash in any
given hour to be 0.00086.
Using the Geometric distribution formulae, we are able to answer questions such as: On
average, after how many hours will the system crash?
Mean =
1
p
=
1
0.00086
= 1163 hours.
Example: Mad(?) Dictators and Birth Control
A dictator, keen to maximise the ratio of males to females in his country (so he could build
up his all male army) ordered that each couple should keep having children until a boy was
born and then stop.
Calculate the number expected number of boys that a couple will have, and the expected
number of girls, given that P(boy)=.
Assume for simplicity that each couple can have arbitrarily many children (although this
is not necessary to get the following results). Then since each couple stops when 1 boy is born,
the expected number of boys per couple is 1.
On the other hand, if Y is the number of girls given birth to by a couple, Y clearly follows
the alternative formulation for the Geometric() distribution.
So the expected number of girls for a couple is
1
1
2
1
2
= 1.
15
4.2 Poisson Distribution
Poi()
Let X be a random variable on N = {0, 1, 2, . . .} with pmf
p(x) =
e
x
x!
, x = 0, 1, 2, . . .,
for some > 0.
Then X is said to follow a Poisson distribution with rate parameter and we write X
Poi().
Poisson random variables are concerned with the number of random events occurring per
unit of time or space, when there is a constant underlying probability rate of events occurring
across this unit.
Examples
the number of minor car crashes per day in the U.K.;
the number of mistakes on each of my slides;
the number of potholes in each mile of road;
the number of jobs which arrive at a database server per hour;
the number of particles emitted by a radioactive substance in a given time.
An interesting property of the Poisson distribution is that it has equal mean and variance,
namely
= ,
2
= .
1
=
1
,
so is always positive but decreasing as .
Example: Poi(5) pmf (again)
16
Poisson Approximation to the Binomial
Notice the similarity between the pmf plots weve seen for Binomial(20,) and Poi(5).
It can be shown that for Binomial(n, p), when p is small and n is large, this distribution can
be well approximated by the Poisson distribution with rate parameter np, Poi(np).
p in the above is not small, we would typically prefer p < 0.1 for the approximation to be
useful.
The usefulness of this approximation is in using probability tables; tabulating a single
Poisson() distribution encompasses an innite number of possible corresponding Binomial
distributions, Binomial(n,

n
).
Ex.
A manufacturer produces VLSI chips, of which 1% are defective. Find the probability that
in a box of 100 chips none are defective.
We want p(0) from Binomial(100,0.01). Since n is large and p is small, we can approximate
this distribution by Poi(100 0.01) Poi(1).
Then p(0)
e
1
0
0!
= 0.3679.
Fitting a Poisson Distribution to Data
Ex.
The number of particles emitted by a radioactive substance which reached a Geiger counter
was measured for 2608 time intervals, each of length 7.5 seconds.
The (real) data are given in the table below:
x 0 1 2 3 4 5 6 7 8 9 10
n
x
57 203 383 525 532 408 273 139 45 27 16
Do these data correspond to 2608 independent observations of an identical Poisson random
variable?
The total number of particles,
x
xn
x
, is 10,094, and the total number of intervals observed,
n =
x
n
x
, is 2608, so that the average number reaching the counter in an interval is
10094
2608
=
3.870.
Since the mean of Poi() is , we can try setting = 3.87 and see how well this ts the
data.
For example, considering the case x = 0, for a single experiment interval the probability of
observing 0 particles would be p(0) =
e
3.87
3.87
0
0!
= 0.02086. So over n = 2608 repetitions, our
(Binomial) expectation of the number of 0 counts would be n p(0) = 54.4.
Similarly for x = 1, 2, . . ., we obtain the following table of expected values fromthe Poi(3.87)
model:
x 0 1 2 3 4 5 6 7 8 9 10
O(n
x
) 57 203 383 525 532 408 273 139 45 27 16
E(n
x
) 54.4 210.5 407.4 525.5 508.4 393.5 253.8 140.3 67.9 29.2 17.1
17
The two sets of numbers appear sufciently close to suggest the Poisson approximation is
a good one. Later, when we come to look at hypothesis testing, we will see how to make such
judgements quantitatively.
4.3 Discrete Uniform Distribution
U({1, 2, . . . , n})
Let X be a random variable on {1, 2, . . . , n} with pmf
p(x) =
1
n
, x = 1, 2, . . . , n.
Then X is said to follow a discrete uniform distribution and we write X U({1, 2, . . . , n}).
=
n +1
2
,
2
=
n
2
1
12
.
and the skewness is clearly zero.
18
COMP 245 Statistics
Exercises 4 - Continuous Random Variables
1. Suppose Xis a continuous randomvariable with density function f which is symmetric around
zero, so x R, f (x) = f (x).
Show that the cdf satisfies F(x) = 1 F(x).
2. Electrons hit a circular plate with unit radius. Let X be the random variable representing the
distance of a particle strike from the centre of the plate. Assuming that a particle is equally
likely to strike anywhere on the plate,
(a) for 0 < r < 1 find P(X < r), and hence write down the full the cumulative distribution
function of X, F
X
;
(b) find P(r < X < s), where r < s;
(c) find the probability density function for X, f
X
.
(d) calculate the mean distance of a particle strike from the origin.
3. Prove that the mean and variance of an Exp() random variable are
1
and
1
2
respectively.
4. Let X U(0, 1). Find the cdf and hence the pdf of the transformed variable Y = e
X
.
5. Let X N(,
2
), and let Y =
X
. Using the results on transformations of variables,

validate the claim in lectures that Y N(0, 1).
6. Let X be a continuous random variable, with cdf F
X
(x) and pdf f
X
(x). Let Y = aX + b, where
a 0, b R are constants.
(a) Considering in turn the two cases a > 0 and a < 0, use the definition of a cdf to find
expressions for the cdf of Y, F
Y
(y), in terms of F
X
.
(b) Using the relationship between a pdf and its cdf, show that the pdf for Y is given by
f
Y
(y) =
1
|a|
f
X
y b
a
.
7. If area refers to the area under the curve of the standard normal probability density function
, find the value or values of z such that
(a) the area between 0 and z is 0.3770;
(b) the area to the left of z is 0.8621;
(c) the area between -1.5 and z is 0.0217.
8. Find the area under the standard normal curve
(a) between z = 0 and z = 1.2;
(b) between z = -0.68 and z = 0;
(c) between z = -0.46 and z = 2.21;
(d) between z = 0.81 and z = 1.94;
(e) to the right of z = -1.28.
COMP 245 Statistics
Solutions 4 - Continuous Random Variables
1. Begin with F(x) +F(x) =
_
x
u=
f (u)du+
_
x
v=
f (v)dv. Taking the change of variable v = u
for the first part leads to F(x) + F(x) =
_
x
v=
f (v)dv +
_
x
v=
f (v)dv =
_

v=x
f (v)dv +
_
x
u=
f (v)dv =
_

v=
f (v)dv = 1.
2. (a) Since a particle is equally likely to hit anywhere on the plate, for 0 < r < 1 the probability
that it will strike inside a circle of radius r is
r
2
1
2
= r
2
. Hence
F(x) =
_
_
0, x 0
x
2
, 0 < x < 1
1, x 1.
(b) P(r < X < s) = P(X < s) P(X < r) = s
2
r
2
.
(c) From 2a the cdf of X is given by F(x) = x
2
for 0 x 1. So the pdf, f (x) = F
(x) = 2x
for 0 x 1, and 0 everywhere else.
(d) The expected distance from the origin is
E(X) =
_

xf (x)dx =
_
1
0
x.2xdx =
_
2x
3
3
_
1
0
=
2
3
.
3. A random variable X Exp() has density f (x) = e
x
for x 0. So
E(X) =
_

x f (x)dx =
_

0
xe
x
dx =
_
xe
x
_
0
+
_

0
e
x
dx
= [0 0]
_
e
x
0
=
_
0
_
__
=
1
.
E
_
X
2
_
=
_

x
2
f (x)dx =
_

0
x
2
e
x
dx =
_
x
2
e
x
_
0
+
_

0
2xe
x
dx
= [0 0] +
2
E(X) =
2
2
Var(X) = E
_
X
2
_
{E (X)}
2
=
2
2

1
2
=
1
2
.
4. X has range [0, 1] and pdf f
X
(x) =
_
1, 0 < x < 1
0, otherwise.
Thus the cdf for Y = e
X
is given by
F
Y
(y) = P
Y
(Y y) = P(e
X
y) = P(X log(y)) =
_
log(y)
f
X
(x)dx =
_
log(y)
0
dx = log(y),
for 0 < log(y) < 1; that is, for 1 < y < e.
For the pdf, differentiating F
Y
(y) wrt y gives
f
Y
(y) =
_
1
y
, 1 < y < e
0, otherwise.
5. Let g(x) =
x
, so Y = g(X). First we note that g is clearly a continuous, monotonically

increasing function of x. Therefore we have
f
Y
(y) = f
X
{g
1
(y)}|g
1
(y)|.
Well g
1
(y) = y + , so g
1
(y) = . Since X N(,

2
),
f
X
(x) =
1
2
exp
_
(x )
2
2
2
_
and hence
f
Y
(y) = f
X
(y + ) =
1
2
exp
_
(y + )
2
2
2
_
=
1
2
exp
_
y
2
2
_
= Y N(0, 1).
6. (a) a 0, F
Y
(y) = P
Y
(Y y) = P(aX + b y) = P(aX y b).
If a > 0,
F
Y
(y) = P
_
X
y b
a
_
= F
X
_
y b
a
_
,
whereas if a < 0,
F
Y
(y) = P
_
X
y b
a
_
= 1 F
X
_
y b
a
_
.
(b) The pdf of a continuous random variable is the derivative of the cdf, f
Y
(y) = F
y
(y).
So if a > 0,
f
Y
(y) =
d
dy
F
X
_
y b
a
_
=
1
a
F
X
_
y b
a
_
=
1
a
f
X
_
y b
a
_
,
whereas if a < 0,
f
Y
(y) =
d
dy
_
1 F
X
_
y b
a
__
=
1
a
F
X
_
y b
a
_
=
1
a
f
X
_
y b
a
_
,
So either way, we have
f
Y
(y) =
1
|a|
f
X
_
y b
a
_
.
7. (a) z = 1.16 or z = -1.16. (b) z = 1.09. (c) z = -1.35 or z = -1.69.
8. (a) 0.3849
(b) 0.2517
(c) 0.6636
(d) 0.1828
(e) 0.8997
Continuous Random Variables
COMP 245 STATISTICS
Dr N A Heard
Dr N A Heard (Room 545 Huxley Building) Continuous Random Variables 1 / 56
Denition
Suppose again we have a random experiment with sample space S and
probability measure P.
Recall our denition of a random variable as a mapping X : S R
from the sample space S to the real numbers inducing a probability
measure P
X
(B) = P{X
1
(B)}, B R.
We dene the random variable X to be (absolutely) continuous if
f
X
: R R s.t.
P
X
(B) =
_
xB
f
X
(x)dx, B R, (1)
in which case f
X
is referred to as the probability density function
(pdf) of X.
Comments
A connected sequence of comments:
One consequence of this denition is that the probability of any
singleton set B = {x}, x R is zero for a continuous random
variable,
P
X
(X = x) = P
X
({x}) = 0.
This in turn implies that any countable set B = {x
1
, x
2
, . . .} R
will have zero probability measure for a continuous random
variable, since P
X
(X B) = P
X
(X = x
1
) +P
X
(X = x
2
) +. . ..
This automatically implies that the range of a continuous random
variable will be uncountable.
This tells us that a random variable cannot be both discrete and
continuous.
Examples
The following quantities would typically be modelled with continuous
random variables. They are measurements of time, distance and other
phenomena that can, at least in theory, be determined to an arbitrarily
high degree of accuracy.
The height or weight of a randomly chosen individual from a
population.
The duration of this lecture.
The volume of fuel consumed by a bus on its route.
The total distance driven by a taxi cab in a day.
Note that there are obvious ways to discretise all of the above examples.
pdf as the derivative of the cdf
From (1), we notice the cumulative distribution function (cdf) for a
continuous random variable X is therefore given by
F
X
(x) =
_
x
f
X
(t)dt, x R,
This expression leads us to a denition of the pdf of a continuous r.v. X
for which we already have the cdf; by the Fundamental Theorem of
Calculus we nd the pdf of X to be given by
Defn.
f
X
(x) =
d
dx
F
X
(x) or F
X
(x).
Properties of a pdf
Since the pdf is the derivative of the cdf, and because we know that a
cdf is non-decreasing, this tells us the pdf will always be non-negative.
So, in the same way as we did for cdfs and discrete pmfs, we have the
following checklist to ensure f
X
is a valid pdf.
pdf:
1
f
X
(x) 0, x R;
2
_

f
X
(x)dx = 1.
Interval Probabilities
Suppose we are interested in whether continuous a r.v. X lies in an
interval (a, b]. From the denition of a continuous random variable,
this is given by
P
X
(a < X b) =
_
b
a
f
X
(x)dx.
That is, the area under the pdf between a and b.
x
f
(
x
)
a b
P(a<X<b)
Further Comments:
Besides still being a non-decreasing function satisfying
F
X
() = 0, F
X
() = 1, the cdf F
X
of a continuous random
variable X is also (absolutely) continuous.
For a continuous r.v. since x, P(X = x) = 0,
F
x
(x) = P(X x) P(X < x).
For small x, f
X
(x)x is approximately the probability that X takes
a value in the small interval, say, [x, x +x).
Since the density (pdf) f
X
(x) is not itself a probability, then unlike
the pmf of a discrete r.v. we do not require f
X
(x) 1.
From (1) it is clear that the pdf of a continuous r.v. X completely
characterises its distribution, so we often just specify f
X
.
Example
Suppose we have a continuous random variable X with probability
density function
f (x) =
_
cx
2
, 0 < x < 3
0, otherwise
for some unknown constant c.
Questions
1
Determine c.
2
Find the cdf of X.
3
Calculate P(1 < X < 2).
Solutions
1
To nd c: We must have
1 =
_

f (x)dx =
_
3
0
cx
2
dx = c
_
x
3
3
_
3
0
= 9c
c =
1
9
.
2
F(x) =
_
_
_
0, x < 0
_
x
f (u)du =
_
x
0
u
2
9
du =
x
3
27
, 0 x 3
1, x > 3.
3
P(1 < X < 2) = F(2) F(1) =
8
27
1
27
=
7
27
0.2593.
1 0 1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
pdf
x
f
(
x
)
1 0 1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
cdf
x
F
(
x
)
Transforming random variables
Suppose we have a continuous random variable X and wish to
consider the transformed random variable Y = g(X) for some function
g : R R, s.t. g is continuous and strictly monotonic (so g
1
exists).
Suppose g is monotonic increasing. Then for y R,
Y y X g
1
(y). So
F
Y
(y) = P
Y
(Y y) = P
X
(X g
1
(y)) = F
X
(g
1
(y))
By the chain rule of differentiation,
f
Y
(y) = F
Y
(y) = f
X
{g
1
(y)}g
1
(y)
Note g
1
(y) =
d
dy
g
1
(y) is positive since we assumed g was
increasing.
If we had g monotonic decreasing, Y y X g
1
(y) and
F
Y
(y) = P
X
(X g
1
(y)) = 1 F
X
(g
1
(y)).
So by comparison with before, we would have
f
Y
(y) = F
Y
(y) = f
X
{g
1
(y)}g
1
(y)
with g
1
(y) always negative.

So overall, for Y = g(X) we have
f
Y
(y) = f
X
{g
1
(y)}|g
1
(y)|. (2)
E(X)
For a continuous random variable X we dene the mean or
expectation of X,
Defn.
X
or E
X
(X) =
_

xf
X
(x)dx.
More generally, for a function of interest of the random variable
g : R R we have
Defn.
E
X
{g(X)} =
_

g(x)f
X
(x)dx.
Clearly, for continuous random variables we again have linearity of
expectation
E(aX + b) = aE(X) + b, a, b R,
and that for two functions g, h : R R, we have additivity of
expectation
E{g(X) + h(X)} = E{g(X)} +E{h(X)}.
Var(X)
The variance of a continuous random variable X is given by
Defn.
2
X
or Var
X
(X) = E{(X
X
)
2
} =
_

(x
X
)
2
f
X
(x)dx.
and again it is easy to show that
Var
X
(X) =
_

x
2
f
X
(x)dx
2
X
= E(X
2
) {E(X)}
2
.
For a linear transformation aX + b we again have
Var(aX + b) = a
2
Var(X), a, b R.
Q
X
()
Recall we dened the lower and upper quartiles and median of a
sample of data as points (,,)-way through the ordered sample.
For [0, 1] and a continuous random variable X we dene the
-quantile of X, Q
X
(), as
Q
X
() = min
qR
{q : F
X
(q) = }.
If F
X
is invertible then
Defn.
Q
X
() = F
1
X
().
In particular the median of a random variable X is Q
X
(0.5). That is,
the median is a solution to the equation F
X
(x) =
1
2
.
Example (continued)
Again suppose we have a continuous random variable X with
probability density function given by
f (x) =
_
x
2
/9, 0 < x < 3
0, otherwise.
Questions
1
Calculate E(X).
2
Calculate Var(X).
3
Calculate the median of X.
Solutions
1
E(X) =
_

xf (x)dx =
_
3
0
x.
x
2
9
dx =
x
4
36
3
0
=
3
4
36
= 2.25.
2
E(X
2
) =
_

x
2
f (x)dx =
_
3
0
x
2
.
x
2
9
dx =
x
5
45
3
0
=
3
5
45
= 5.4.
So Var(X) = E(X
2
) {E(X)}
2
= 5.4 2.25
2
= 0.3375.
3
From earlier, F(x) =
x
3
27
, for 0 < x < 3.
Setting F(x) =
1
2
and solving, we get
x
3
27
=
1
2
x =
3
_
27
2
=
3
3
2
2.3811 for the median.
U(a, b)
Suppose X is a continuous random variable with probability density
function
f (x) =
_
1
ba
, a < x < b
0, otherwise,
and hence corresponding cumulative distribution function
F(x) =
_
_
_
0, x a
xa
ba
, a < x < b
1, x b.
Then X is said to follow a uniform distribution on the interval (a, b)
and we write X U(a, b).
Example: U(0,1)
-
6
1
0 1 x
f (x)
-
6
1
0 1 x
F(x)
Notice from the cdf that the quantiles of U(0,1) are the special case
where Q() = .
Relationship between U(a, b) and U(0,1)
Suppose X U(0, 1), so F
X
(x) = x, 0 x 1. For a < b R, if
Y = a + (b a)X then Y U(a, b).
0 1
X
a
b
Y
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
-
Proof:
We rst observe that for any y (a, b),
Y y a + (b a)X y X
y a
b a
.
From this we nd Y U(a, b), since
F
Y
(y) = P(Y y) = P
_
X
y a
b a
_
= F
X
_
y a
b a
_
=
y a
b a
.
Mean and Variance of U(a, b)
To nd the mean of X U(a, b),
E(X) =
_

xf (x)dx =
_
b
a
x.
1
b a
dx =
_
x
2
2(b a)
_
b
a
=
b
2
a
2
2(b a)
=
(b a)(b + a)
2(b a)
=
a + b
2
.
Similarly we get Var(X) = E(X
2
) E(X)
2
=
(ba)
2
12
, so
=
a + b
2
,
2
=
(b a)
2
12
.
Exp()
Suppose now X is a random variable taking value on R
+
= [0, ) with
pdf
f (x) = e
x
, x 0,
for some > 0.
Then X is said to follow an exponential distribution with rate
parameter and we write X Exp().
Straightforward integration between 0 and x leads to the cdf,
F(x) = 1 e
x
, x 0.
The mean and variance are given by
=
1
,
2
=
1
2
.
Example: Exp(1), Exp(0.5) & Exp(0.2) pdfs
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
f
(
x
)
= 1
= 0.5
= 0.2
Example: Exp(0.2), Exp(0.5) & Exp(1) cdfs
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
F
(
x
)
Lack of Memory Property
First notice that from the exponential distribution cdf equation we
have P(X > x) = e
x
.
An important (and not always desirable) characteristic of the
exponential distribution is the so called lack of memory property.
For x, s > 0, consider the conditional probability P(X > x + s|X > s) of
the additional magnitude of an exponentially distributed random
variable given we already know it is greater than s.
Well
P(X > x + s|X > s) =
P(X > x + s)
P(X > s)
,
which, when X Exp(), gives
P(X > x + s|X > s) =
e
(x+s)
e
s
= e
x
,
again an exponential distribution with parameter .
So if we think of the exponential variable as the time to an event, then
knowledge that we have waited time s for the event tells us nothing
about how much longer we will have to wait - the process has no
memory.
Examples
Exponential random variables are often used to model the time until
occurrence of a random event where there is an assumed constant risk
() of the event happening over time, and so are frequently used as a
simplest model, for example, in reliability analysis. So examples
include:
the time to failure of a component in a system;
the time until we nd the next mistake on my slides;
the distance we travel along a road until we nd the next pothole;
the time until the next jobs arrives at a database server;
Link with Poisson Distribution
Notice the duality between some of the exponential r.v. examples and
those we saw for a Poisson distribution. In each case, number of
events has been replaced with time between events.
Claim:
If events in a random process occur according to a Poisson distribution
with rate then the time between events has an exponential
distribution with rate parameter .
Proof:
Suppose we have some random event process such that x > 0, the
number of events occurring in [0, x], N
x
, follows a Poisson distribution
with rate parameter , so N
x
Poi(x). Such a process is known as an
homogeneous Poisson process. Let X be the time until the rst event of
this process arrives.
Then we notice that
P(X > x) P(N
x
= 0)
=
(x)
0
e
x
0!
= e
x
.
and hence X Exp(). The same argument applies for all subsequent
inter-arrival times.
N(,
2
)
Suppose X is a random variable taking value on R with pdf
f (x) =
1
2
exp
_
(x )
2
2
2
_
,
for some R, > 0.
Then X is said to follow a Gaussian or normal distribution with mean
and variance
2
, and we write X N(,
2
).
The cdf of X N(,
2
) is not analytically tractable for any (, ), so
we can only write
F(x) =
1
2
_
x
exp
_
(t )
2
2
2
_
dt.
Example: N(0,1), N(2,1) & N(0,4) pdfs
4 2 0 2 4 6
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
f
(
x
)
N(0,1) N(2,1)
N(0,4)
Example: N(0,1), N(2,1) & N(0,4) cdfs
4 2 0 2 4 6
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
F
(
x
)
N(0, 1)
Setting = 0, = 1 and Z N(0, 1) gives the special case of the
standard normal, with simplied density
f (z) (z) =
1
2
e
z
2
2
.
Again for the cdf, we can only write
F(z) (z) =
1
2
_
z
t
2
2
dt.
Statistical Tables
Since the cdf, and therefore any probabilities, associated with a normal
distribution are not analytically available, numerical integration
procedures are used to nd approximate probabilities.
In particular, statistical tables contain values of the standard normal
cdf (z) for a range of values z R, and the quantiles
1
() for a
range of values (0, 1). Linear interpolation is used for
approximation between the tabulated values.
But why just tabulate N(0, 1)? We will now see how all normal
distribution probabilities can be related back to probabilities from a
standard normal distribution.
Linear Transformations of Normal Random Variables
Suppose we have X N(,
2
). Then it is also true that for any
constants a, b R, the linear combination aX + b also follows a normal
distribution. More precisely,
X N(,
2
) aX + b N(a + b, a
2
2
), a, b R.
(Note that the mean and variance parameters of this transformed
distribution follow from the general results for expectation and
variance of any random variable under linear transformation.)
In particular, this allows us to standardise any normal r.v.,
X N(,
2
)
X
N(0, 1).
Standardising Normal Random Variables
So if X N(,
2
) and we set Z =
X
, then since > 0 we can rst

observe that for any x R,
X x
X
Z
x
.
Therefore we can write the cdf of X in terms of ,
F
X
(x) = P(X x) = P
_
Z
x
_
=
_
x
_
.
Table of
z (z) z (z) z (z) z (z)
0 .5 0.9 .816 1.8 .964 2.8 .997
.1 .540 1.0 .841 1.9 .971 3.0 .998
.2 .579 1.1 .864 2.0 .977 3.5 .9998
.3 .618 1.2 .885 2.1 .982 1.282 .9
.4 .655 1.3 .903 2.2 .986 1.645 .95
.5 .691 1.4 .919 2.3 .989 1.96 .975
.6 .726 1.5 .933 2.4 .992 2.326 .99
.7 .758 1.6 .945 2.5 .994 2.576 .995
.8 .788 1.7 .955 2.6 .995 3.09 .999
Using Table of
First of all notice that (z) has been tabulated for z > 0.
This is because the standard normal pdf is symmetric about 0, so
(z) = (z). For the cdf , this means
(z) = 1 (z).
So for example, (1.2) = 1 (1.2) 1 0.885 = 0.115.
Similarly, if Z N(0, 1) and we want P(Z > z), then for example
P(Z > 1.5) = 1 P(Z 1.5) = 1 (1.5).
Important Quantiles of N(0, 1)
We will often have cause to use the 97.5% and 99.5% quantiles of
N(0, 1), given by
1
(0.975) and
1
(0.995).
(1.96) 97.5%.
So with 95% probability an N(0, 1) r.v. will lie in [1.96, 1.96]
( [2, 2]).
(2.58) = 99.5%.
So with 99% probability an N(0, 1) r.v. will lie in [2.58, 2.58].
More generally, for (0, 1) and dening z
1/2
to be the (1 /2)
quantile of N(0, 1), if Z N(0, 1) then
P
Z
(Z [z
1/2
, z
1/2
]) = 1 .
More generally still, if X N(,
2
), then
P
X
(X [ z
1/2
, +z
1/2
]) = 1 ,
and hence [ z
1/2
, +z
1/2
] gives a (1 ) probability region
for X centred around .
This can be rewritten as
P
X
(|X | z
1/2
) = 1 .
Example
An analogue signal received at a detector (measured in microvolts)
may be modelled as a Gaussian random variable X N(200, 256).
1
What is the probability that the signal will exceed 240V?
2
What is the probability that the signal is larger than 240V given
that it is greater than 210V?
Solutions:
1
P(X > 240) = 1 P(X 240) = 1
_
240 200
256
_
=
1 (2.5) 0.00621.
2
P(X > 240|X > 210) =
P(X > 240)
P(X > 210)
=
1
_
240200
256
_
1
_
210200
256
_
0.02335.
The Central Limit Theorem
Let X
1
, X
2
, . . . , X
n
be n independent and identically distributed (i.i.d.)
random variables from any probability distribution, each with mean
and variance
2
.
From before we know E
_
n
i=1
X
i
_
= n,Var
_
n
i=1
X
i
_
= n
2
. First
notice E
_
n
i=1
X
i
n
_
= 0, Var
_
n
i=1
X
i
n
_
= n
2
. Dividing by
n, E
_
n
i=1
X
i
n
n
_
= 0, Var
_
n
i=1
X
i
n
n
_
= 1.
But we can now present the following, astonishing result.
Theorem
lim
n
n
i=1
X
i
n
n
.
This can also be written as
lim
n
X
/
n
,
where

X =

n
i=1
X
i
n
.
Or nally, for large n we have approximately
X N
_
,

2
n
_
,
or
n
i=1
X
i
N
_
n, n
2
_
.
We note here that although all these approximate distributional results
hold irrespective of the distribution of the {X
i
}, in the special case
where X
i
N(,
2
) these distributional results are, in fact, exact. This
is because the sum of independent normally distributed random
variables is also normally distributed.
Example
Consider the most simple example, that X
1
, X
2
, . . . are i.i.d.
Bernoulli(p) discrete random variables taking value 0 or 1.
Then the {X
i
} each have mean = p and variance
2
= p(1 p).
By denition, we know that for any n,
n
i=1
X
i
Binomial(n, p).
But now, by the Central Limit Theorem (CLT), we also have for large n
that approximately
n
i=1
X
i
N
_
n, n
2
_
N(np, np(1 p)).
So for large n
Binomial(n, p) N(np, np(1 p)).
Notice that the LHS is a discrete distribution, and the RHS is a
continuous distribution.
Binomial(10,) pmf & N(5,2.5) pdf
0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
x
p
(
x
)
Binomial(10,0.5)
0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
x
f
(
x
)
N(5,2.5)
Binomial(100,) pmf & N(50,25) pdf
20 30 40 50 60 70 80
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
x
p
(
x
)
Binomial(100,0.5)
20 30 40 50 60 70 80
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
x
f
(
x
)
N(50,25)
400 450 500 550 600
0
.
0
0
0
0
.
0
0
5
0
.
0
1
0
0
.
0
1
5
0
.
0
2
0
0
.
0
2
5
x
p
(
x
)
Binomial(1000,0.5)
400 450 500 550 600
0
.
0
0
0
0
.
0
0
5
0
.
0
1
0
0
.
0
1
5
0
.
0
2
0
0
.
0
2
5
x
f
(
x
)
N(500,250)
So suppose X was the number of heads found on 1000 tosses of a fair
coin, and we were interested in P(X 490).
Using the binomial distribution pmf, we would need to calculate
P(X 490) = p
X
(0) + p
X
(1) + p
X
(2) +. . . + p
X
(490) (!) ( 0.27).
However, using the CLT we have approximately X N(500, 250) and
so P(X 490)
_
490 500
250
_
= (0.632) = 1 (0.632) 0.26.
Log-Normal Distribution
Suppose X N(,
2
), and consider the transformation Y = e
X
.
Then if g(x) = e
x
, g
1
(y) = log(y) and g
1
(y) =
1
y
.
Then by (2) we have
f
Y
(y) =
1
y
2
exp
_
{log(y) }
2
2
2
_
, y > 0,
and we say Y follows a log-normal distribution.
Example: LN(0,1), LN(2,1) & LN(0,4) pdfs
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
x
f
(
x
)
LN(0,1)
LN(2,1)
LN(0,4)
Continuous Random Variables
COMP 245 STATISTICS
Dr N A Heard
Contents
1 Continuous Random Variables 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Mean, Variance and Quantiles 6
2.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Continuous Distributions 8
3.1 Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
1 Continuous Random Variables
1.1 Introduction
Denition
Suppose again we have a randomexperiment with sample space S and probability measure
P.
Recall our denition of a random variable as a mapping X : S R from the sample space
S to the real numbers inducing a probability measure P
X
(B) = P{X
1
(B)}, B R.
We dene the random variable X to be (absolutely) continuous if f
X
: R R s.t.
P
X
(B) =
_
xB
f
X
(x)dx, B R, (1)
in which case f
X
is referred to as the probability density function (pdf) of X.
Comments
A connected sequence of comments:
One consequence of this denition is that the probability of any singleton set B = {x}, x
R is zero for a continuous random variable,
P
X
(X = x) = P
X
({x}) = 0.
This in turn implies that any countable set B = {x
1
, x
2
, . . .} R will have zero probabil-
ity measure for a continuous randomvariable, since P
X
(X B) = P
X
(X = x
1
) +P
X
(X =
x
2
) +. . ..
This automatically implies that the range of a continuous random variable will be un-
countable.
This tells us that a random variable cannot be both discrete and continuous.
Examples
The following quantities would typically be modelled with continuous random variables.
They are measurements of time, distance and other phenomena that can, at least in theory, be
determined to an arbitrarily high degree of accuracy.
The height or weight of a randomly chosen individual from a population.
The duration of this lecture.
The volume of fuel consumed by a bus on its route.
The total distance driven by a taxi cab in a day.
Note that there are obvious ways to discretise all of the above examples.
2
1.2 Probability Density Functions
pdf as the derivative of the cdf
From (1), we notice the cumulative distribution function (cdf) for a continuous random
variable X is therefore given by
F
X
(x) =
_
x
f
X
(t)dt, x R,
This expression leads us to a denition of the pdf of a continuous r.v. X for which we
already have the cdf; by the Fundamental Theorem of Calculus we nd the pdf of X to be
given by
f
X
(x) =
d
dx
F
X
(x) or F
X
(x).
Properties of a pdf
Since the pdf is the derivative of the cdf, and because we knowthat a cdf is non-decreasing,
this tells us the pdf will always be non-negative.
So, in the same way as we did for cdfs and discrete pmfs, we have the following checklist
to ensure f
X
is a valid pdf.
pdf:
1. f
X
(x) 0, x R;
2.
_

f
X
(x)dx = 1.
Suppose we are interested in whether continuous a r.v. X lies in an interval (a, b]. From the
denition of a continuous random variable, this is given by
P
X
(a < X b) =
_
b
a
f
X
(x)dx.
That is, the area under the pdf between a and b.
3
x
f
(
x
)
a b
P(a<X<b)
Further Comments:
Besides still being a non-decreasing function satisfying F
X
() = 0, F
X
() = 1, the cdf
F
X
of a continuous random variable X is also (absolutely) continuous.
For a continuous r.v. since x, P(X = x) = 0, F
x
(x) = P(X x) P(X < x).
For small x, f
X
(x)x is approximately the probability that X takes a value in the small
interval, say, [x, x +x).
Since the density (pdf) f
X
(x) is not itself a probability, then unlike the pmf of a discrete
r.v. we do not require f
X
(x) 1.
From (1) it is clear that the pdf of a continuous r.v. X completely characterises its distri-
bution, so we often just specify f
X
.
Example
Suppose we have a continuous random variable X with probability density function
f (x) =
_
cx
2
, 0 < x < 3
0, otherwise
for some unknown constant c.
Questions
1. Determine c.
2. Find the cdf of X.
3. Calculate P(1 < X < 2).
Solutions
1. To nd c: We must have
1 =
_

f (x)dx =
_
3
0
cx
2
dx = c
_
x
3
3
_
3
0
= 9c
c =
1
9
.
4
2. F(x) =
_
_
_
0, x < 0
_
x
f (u)du =
_
x
0
u
2
9
du =
x
3
27
, 0 x 3
1, x > 3.
3. P(1 < X < 2) = F(2) F(1) =
8
27
1
27
=
7
27
0.2593.
1 0 1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
pdf
x
f
(
x
)
1 0 1 2 3 4 5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
cdf
x
F
(
x
)
1.3 Transformations
Transforming random variables
Suppose we have a continuous random variable X and wish to consider the transformed
random variable Y = g(X) for some function g : R R, s.t. g is continuous and strictly
monotonic (so g
1
exists).
Suppose g is monotonic increasing. Then for y R, Y y X g
1
(y). So
F
Y
(y) = P
Y
(Y y) = P
X
(X g
1
(y)) = F
X
(g
1
(y))
By the chain rule of differentiation,
f
Y
(y) = F
Y
(y) = f
X
{g
1
(y)}g
1
(y)
Note g
1
(y) =
d
dy
g
1
(y) is positive since we assumed g was increasing.
If we had g monotonic decreasing, Y y X g
1
(y) and
F
Y
(y) = P
X
(X g
1
(y)) = 1 F
X
(g
1
(y)).
So by comparison with before, we would have
f
Y
(y) = F
Y
(y) = f
X
{g
1
(y)}g
1
(y)
with g
1
(y) always negative.

So overall, for Y = g(X) we have
f
Y
(y) = f
X
{g
1
(y)}|g
1
(y)|. (2)
5
2 Mean, Variance and Quantiles
2.1 Expectation
E(X)
For a continuous random variable X we dene the mean or expectation of X,
X
or E
X
(X) =
_

x f
X
(x)dx.
More generally, for a function of interest of the random variable g : R R we have
E
X
{g(X)} =
_

g(x) f
X
(x)dx.
Clearly, for continuous random variables we again have linearity of expectation
E(aX + b) = aE(X) + b, a, b R,
and that for two functions g, h : R R, we have additivity of expectation
E{g(X) + h(X)} = E{g(X)} +E{h(X)}.
2.2 Variance
Var(X)
The variance of a continuous random variable X is given by
2
X
or Var
X
(X) = E{(X
X
)
2
} =
_

(x
X
)
2
f
X
(x)dx.
and again it is easy to show that
Var
X
(X) =
_

x
2
f
X
(x)dx
2
X
= E(X
2
) {E(X)}
2
.
For a linear transformation aX + b we again have Var(aX + b) = a
2
Var(X), a, b
R.
6
2.3 Quantiles
Q
X
()
Recall we dened the lower and upper quartiles and median of a sample of data as points
(,,)-way through the ordered sample.
For [0, 1] and a continuous random variable X we dene the -quantile of X, Q
X
(),
as
Q
X
() = min
qR
{q : F
X
(q) = }.
If F
X
is invertible then
Q
X
() = F
1
X
().
In particular the median of a randomvariable X is Q
X
(0.5). That is, the median is a solution
to the equation F
X
(x) =
1
2
.
Example (continued)
Again suppose we have a continuous random variable X with probability density function
given by
f (x) =
_
x
2
/9, 0 < x < 3
0, otherwise.
Questions
1. Calculate E(X).
2. Calculate Var(X).
3. Calculate the median of X.
Solutions
1. E(X) =
_

x f (x)dx =
_
3
0
x.
x
2
9
dx =
x
4
36
3
0
=
3
4
36
= 2.25.
2. E(X
2
) =
_

x
2
f (x)dx =
_
3
0
x
2
.
x
2
9
dx =
x
5
45
3
0
=
3
5
45
= 5.4.
So Var(X) = E(X
2
) {E(X)}
2
= 5.4 2.25
2
= 0.3375.
3. From earlier, F(x) =
x
3
27
, for 0 < x < 3.
Setting F(x) =
1
2
and solving, we get
x
3
27
=
1
2
x =
3
_
27
2
=
3
3
2
2.3811 for the
median.
7
3 Continuous Distributions
3.1 Uniform
U(a, b)
Suppose X is a continuous random variable with probability density function
f (x) =
_
1
ba
, a < x < b
0, otherwise,
and hence corresponding cumulative distribution function
F(x) =
_
_
_
0, x a
xa
ba
, a < x < b
1, x b.
Then X is said to follow a uniform distribution on the interval (a, b) and we write X
U(a, b).
Example: U(0,1)
-
6
1
0 1 x
f (x)
-
6
1
0 1 x
F(x)
Notice from the cdf that the quantiles of U(0,1) are the special case where Q() = .
Relationship between U(a, b) and U(0,1)
Suppose X U(0, 1), so F
X
(x) = x, 0 x 1. For a < b R, if Y = a + (b a)X then
Y U(a, b).
0 1
X
a
b
Y
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
-
Proof:
We rst observe that for any y (a, b), Y y a + (b a)X y X
y a
b a
.
Fromthis we nd Y U(a, b), since F
Y
(y) = P(Y y) = P
_
X
y a
b a
_
= F
X
_
y a
b a
_
=
y a
b a
.
8
Mean and Variance of U(a, b)
To nd the mean of X U(a, b),
E(X) =
_

x f (x)dx =
_
b
a
x.
1
b a
dx =
_
x
2
2(b a)
_
b
a
=
b
2
a
2
2(b a)
=
(b a)(b + a)
2(b a)
=
a + b
2
.
Similarly we get Var(X) = E(X
2
) E(X)
2
=
(ba)
2
12
, so
=
a + b
2
,
2
=
(b a)
2
12
.
3.2 Exponential
Exp()
Suppose now X is a random variable taking value on R
+
= [0, ) with pdf
f (x) = e
x
, x 0,
for some > 0.
Then X is said to follow an exponential distribution with rate parameter and we write
X Exp().
Straightforward integration between 0 and x leads to the cdf,
F(x) = 1 e
x
, x 0.
The mean and variance are given by
=
1
,
2
=
1
2
.
Example: Exp(1), Exp(0.5) & Exp(0.2) pdfs
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
f
(
x
)
= 1
= 0.5
= 0.2
9
Example: Exp(0.2), Exp(0.5) & Exp(1) cdfs
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
F
(
x
)
Lack of Memory Property
First notice that from the exponential distribution cdf equation we have P(X > x) = e
x
.
An important (and not always desirable) characteristic of the exponential distribution is
the so called lack of memory property.
For x, s > 0, consider the conditional probability P(X > x + s|X > s) of the additional
magnitude of an exponentially distributed randomvariable given we already knowit is greater
than s.
Well
P(X > x + s|X > s) =
P(X > x + s)
P(X > s)
,
which, when X Exp(), gives
P(X > x + s|X > s) =
e
(x+s)
e
s
= e
x
,
again an exponential distribution with parameter .
So if we think of the exponential variable as the time to an event, then knowledge that we
have waited time s for the event tells us nothing about how much longer we will have to wait
- the process has no memory.
Examples
Exponential random variables are often used to model the time until occurrence of a ran-
dom event where there is an assumed constant risk () of the event happening over time, and
so are frequently used as a simplest model, for example, in reliability analysis. So examples
include:
the time to failure of a component in a system;
the time until we nd the next mistake on my slides;
the distance we travel along a road until we nd the next pothole;
the time until the next jobs arrives at a database server;
10
Link with Poisson Distribution
Notice the duality between some of the exponential r.v. examples and those we saw for a
Poisson distribution. In each case, number of events has been replaced with time between
events.
Claim:
If events in a random process occur according to a Poisson distribution with rate then the
time between events has an exponential distribution with rate parameter .
Proof:
Suppose we have some random event process such that x > 0, the number of events
occurring in [0, x], N
x
, follows a Poisson distribution with rate parameter , so N
x
Poi(x).
Such a process is known as an homogeneous Poisson process. Let X be the time until the rst
event of this process arrives.
Then we notice that
P(X > x) P(N
x
= 0)
=
(x)
0
e
x
0!
= e
x
.
and hence X Exp(). The same argument applies for all subsequent inter-arrival times.
3.3 Gaussian
N(,
2
)
Suppose X is a random variable taking value on R with pdf
f (x) =
1
2
exp
_
(x )
2
2
2
_
,
for some R, > 0.
Then X is said to follow a Gaussian or normal distribution with mean and variance
2
,
and we write X N(,
2
).
The cdf of X N(,
2
) is not analytically tractable for any (, ), so we can only write
F(x) =
1
2
_
x
exp
_
(t )
2
2
2
_
dt.
11
Example: N(0,1), N(2,1) & N(0,4) pdfs
4 2 0 2 4 6
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
f
(
x
)
N(0,1) N(2,1)
N(0,4)
Example: N(0,1), N(2,1) & N(0,4) cdfs
4 2 0 2 4 6
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
F
(
x
)
N(0, 1)
Setting = 0, = 1 and Z N(0, 1) gives the special case of the standard normal, with
simplied density
f (z) (z) =
1
2
e
z
2
2
.
Again for the cdf, we can only write
F(z) (z) =
1
2
_
z
t
2
2
dt.
Statistical Tables
Since the cdf, and therefore any probabilities, associated with a normal distribution are not
analytically available, numerical integration procedures are used to nd approximate proba-
bilities.
12
In particular, statistical tables contain values of the standard normal cdf (z) for a range of
values z R, and the quantiles
1
() for a range of values (0, 1). Linear interpolation is
used for approximation between the tabulated values.
But why just tabulate N(0, 1)? We will now see how all normal distribution probabilities
can be related back to probabilities from a standard normal distribution.
Linear Transformations of Normal Random Variables
Suppose we have X N(,
2
). Then it is also true that for any constants a, b R, the
linear combination aX + b also follows a normal distribution. More precisely,
X N(,
2
) aX + b N(a + b, a
2
2
), a, b R.
(Note that the mean and variance parameters of this transformed distribution follow from the
general results for expectation and variance of any random variable under linear transforma-
tion.)
In particular, this allows us to standardise any normal r.v.,
X N(,
2
)
X
N(0, 1).
Standardising Normal Random Variables
So if X N(,
2
) and we set Z =
X
, then since > 0 we can rst observe that for

any x R,
X x
X
Z
x
.
Therefore we can write the cdf of X in terms of ,
F
X
(x) = P(X x) = P
_
Z
x
_
=
_
x
_
.
Table of
z (z) z (z) z (z) z (z)
0 .5 0.9 .816 1.8 .964 2.8 .997
.1 .540 1.0 .841 1.9 .971 3.0 .998
.2 .579 1.1 .864 2.0 .977 3.5 .9998
.3 .618 1.2 .885 2.1 .982 1.282 .9
.4 .655 1.3 .903 2.2 .986 1.645 .95
.5 .691 1.4 .919 2.3 .989 1.96 .975
.6 .726 1.5 .933 2.4 .992 2.326 .99
.7 .758 1.6 .945 2.5 .994 2.576 .995
.8 .788 1.7 .955 2.6 .995 3.09 .999
13
Using Table of
First of all notice that (z) has been tabulated for z > 0.
This is because the standard normal pdf is symmetric about 0, so (z) = (z). For the
cdf , this means
(z) = 1 (z).
So for example, (1.2) = 1 (1.2) 1 0.885 = 0.115.
Similarly, if Z N(0, 1) and we want P(Z > z), then for example P(Z > 1.5) = 1 P(Z
1.5) = 1 (1.5).
Important Quantiles of N(0, 1)
We will often have cause to use the 97.5%and 99.5%quantiles of N(0, 1), given by
1
(0.975)
and
1
(0.995).
(1.96) 97.5%.
So with 95% probability an N(0, 1) r.v. will lie in [1.96, 1.96] ( [2, 2]).
(2.58) = 99.5%.
So with 99% probability an N(0, 1) r.v. will lie in [2.58, 2.58].
More generally, for (0, 1) and dening z
1/2
to be the (1 /2) quantile of N(0, 1), if
Z N(0, 1) then
P
Z
(Z [z
1/2
, z
1/2
]) = 1 .
More generally still, if X N(,
2
), then
P
X
(X [ z
1/2
, +z
1/2
]) = 1 ,
and hence [ z
1/2
, + z
1/2
] gives a (1 ) probability region for X centred around
.
This can be rewritten as
P
X
(|X | z
1/2
) = 1 .
Example
An analogue signal received at a detector (measured in microvolts) may be modelled as a
Gaussian random variable X N(200, 256).
1. What is the probability that the signal will exceed 240V?
2. What is the probability that the signal is larger than 240V given that it is greater than
210V?
Solutions:
1. P(X > 240) = 1 P(X 240) = 1
_
240 200
256
_
= 1 (2.5) 0.00621.
14
2. P(X > 240|X > 210) =
P(X > 240)
P(X > 210)
=
1
_
240200
256
_
1
_
210200
256
_ 0.02335.
The Central Limit Theorem
Let X
1
, X
2
, . . . , X
n
be n independent and identically distributed (i.i.d.) random variables
from any probability distribution, each with mean and variance
2
.
Frombefore we knowE
_
n
i=1
X
i
_
= n,Var
_
n
i=1
X
i
_
= n
2
. First notice E
_
n
i=1
X
i
n
_
=
0, Var
_
n
i=1
X
i
n
_
= n
2
. Dividing by
n, E
_
n
i=1
X
i
n
n
_
= 0, Var
_
n
i=1
X
i
n
n
_
=
1.
But we can now present the following, astonishing result.
lim
n
n
i=1
X
i
n
n
.
This can also be written as
lim
n
X
/
n
,
where

X =

n
i=1
X
i
n
.
Or nally, for large n we have approximately
X N
_
,

2
n
_
,
or
n
i=1
X
i
N
_
n, n
2
_
.
We note here that although all these approximate distributional results hold irrespective
of the distribution of the {X
i
}, in the special case where X
i
N(,
2
) these distributional
results are, in fact, exact. This is because the sum of independent normally distributed random
variables is also normally distributed.
Example
Consider the most simple example, that X
1
, X
2
, . . . are i.i.d. Bernoulli(p) discrete random
variables taking value 0 or 1.
Then the {X
i
} each have mean = p and variance
2
= p(1 p).
By denition, we know that for any n,
n
i=1
X
i
Binomial(n, p).
15
But now, by the Central Limit Theorem (CLT), we also have for large n that approximately
n
i=1
X
i
N
_
n, n
2
_
N(np, np(1 p)).
So for large n
Binomial(n, p) N(np, np(1 p)).
Notice that the LHS is a discrete distribution, and the RHS is a continuous distribution.
Binomial(10,) pmf & N(5,2.5) pdf
0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
x
p
(
x
)
Binomial(10,0.5)
0 2 4 6 8 10
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
x
f
(
x
)
N(5,2.5)
20 30 40 50 60 70 80
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
x
p
(
x
)
Binomial(100,0.5)
20 30 40 50 60 70 80
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
x
f
(
x
)
N(50,25)
16
400 450 500 550 600
0
.
0
0
0
0
.
0
0
5
0
.
0
1
0
0
.
0
1
5
0
.
0
2
0
0
.
0
2
5
x
p
(
x
)
Binomial(1000,0.5)
400 450 500 550 600
0
.
0
0
0
0
.
0
0
5
0
.
0
1
0
0
.
0
1
5
0
.
0
2
0
0
.
0
2
5
x
f
(
x
)
N(500,250)
So suppose X was the number of heads found on 1000 tosses of a fair coin, and we were
interested in P(X 490).
Using the binomial distribution pmf, we would need to calculate P(X 490) = p
X
(0) +
p
X
(1) + p
X
(2) +. . . + p
X
(490) (!) ( 0.27).
However, using the CLT we have approximately X N(500, 250) and so P(X 490)
_
490 500
250
_
= (0.632) = 1 (0.632) 0.26.
Log-Normal Distribution
Suppose X N(,
2
), and consider the transformation Y = e
X
.
Then if g(x) = e
x
, g
1
(y) = log(y) and g
1
(y) =
1
y
.
Then by (2) we have
f
Y
(y) =
1
y
2
exp
_
{log(y) }
2
2
2
_
, y > 0,
and we say Y follows a log-normal distribution.
Example: LN(0,1), LN(2,1) & LN(0,4) pdfs
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
x
f
(
x
)
LN(0,1)
LN(2,1)
LN(0,4)
17
COMP 245 Statistics
Exercises 5 - Jointly Distributed Random Variables
1. Suppose the joint pdf of a pair of continuous random variables is given by
f (x, y) =
k(x + y), 0 < x < 2, 0 < y < 2

0, otherwise.
(a) Find the constant k.
(b) Find the marginal pdfs of X and Y.
(c) Find if X and Y are independent.
2. A manufacturer has been using two different manufacturing processes to make computer
memory chips. Let X and Y be a two continuous random variables, where X denotes the
time to failure of chips made by process A and Y denotes the time to failure of chips made
by process B. Assuming that the joint pdf of (X, Y) is
f (x, y) =
abe
(ax+by)
, x, y > 0
0, otherwise.
where a = 10
4
and b = 1.2 10
4
, determine P(X > Y).
3. The joint probability mass function of two discrete random variables X and Y is given by
p(x, y) = cxy for x = 1, 2, 3 and y = 1, 2, 3, and zero otherwise. Find
(a) the constant c;
(b) P(X = 2, Y = 3);
(c) P(X 2, Y 2);
(d) P(X 2);
(e) P(Y < 2);
(f) P(X = 1);
(g) P(Y = 3).
4. Let X and Y be continuous random variables having joint density function f (x, y) = c
x
2
+ y
2
when 0 x 1 and 0 y 1, and f (x, y) = 0 otherwise. Determine

(a) the constant c;
(b) P(X < 1/2, Y > 1/2);
(c) P(1/4 < X < 3/4);
(d) P(Y < 1/2);
(e) whether X and Y are independent.
COMP 245 Statistics
Solutions 5 - Jointly Distributed Random Variables
1. (a) We require
1 =

y=

x=
f
XY
(x, y)dxdy =
2
y=0
2
x=0
k(x + y)dxdy
= k
2
y=0
x
2
2
+ yx
2
x=0
dy = 2k
2
y=0
(y + 1)dy = 2k
y
2
2
+ y
2
0
= 8k
k =
1
8
.
(b)
f
X
(x) =

y=
f
XY
(x, y)dy =
1
8
2
y=0
(x + y)dy =
1
8
xy +
y
2
2
2
y=0
=
1
4
(x + 1),
for 0 < x < 2, and 0 otherwise. Identically for f
Y
(y).
(c) Since f (x, y) f (x) f (y), X and Y are not independent.
2.
P(X > Y) =

y=

x=y
f
XY
(x, y)dxdy =

y=0

x=y
abe
(ax+by)
dxdy
= ab

y=0
e
by

x=y
e
ax
dxdy = ab

y=0
e
by
e
ax
a
x=y
dy
= b

y=0
e
by
e
ay
dy = b

y=0
e
(a+b)y
dy
= b
e
(a+b)y
a + b
y=0
=
b
a + b
=
1.2 10
4
10
4
+ 1.2 10
4
=
12
22
= 0.545.
3. (a)
1
36
.
(b)
1
6
.
(c)
1
4
.
(d)
5
6
.
(e)
1
6
.
(f)
1
6
.
(g)
1
2
.
4. (a)
3
2
. (b)
1
4
. (c)
29
64
. (d)
5
16
.
(e) Not independent, f (x, y) f (x) f (y).
Jointly Distributed Random Variables
COMP 245 STATISTICS
Dr N A Heard
Dr N A Heard (Room 545 Huxley Building) Jointly Distributed Random Variables 1 / 25
Joint Probability Distribution
Suppose we have two random variables X and Y dened on a sample
space S with probability measure P(E), E S. Jointly, they form a map
(X, Y) : S R
2
where s (X(s), Y(s)).
From before, we know to dene the marginal probability distributions
P
X
and P
Y
by, for example,
P
X
(B) = P(X
1
(B)), B R.
We now dene the joint probability distribution P
XY
by
Defn.
P
XY
(B
X
, B
Y
) = P{X
1
(B
X
) Y
1
(B
Y
)}, B
X
, B
Y
R.
So P
XY
(B
X
, B
Y
), the probability that X B
X
and Y B
Y
, is given by the
probability P of the set of all points in the sample space that get
mapped both into B
X
by X and into B
Y
by Y.
More generally, for a single region B
XY
R
2
, nd the collection of
sample space elements
S
B
XY
= {s S|(X(s), Y(s)) B
XY
}
and dene
Defn.
P
XY
(B
XY
) = P(S
B
XY
).
Joint Cumulative Distribution Function
We thus dene the joint cumulative distribution function as
Defn.
F
XY
(x, y) = P
XY
((, x], (, y]), x, y R.
It is easy to check that the marginal cdfs for X and Y can be recovered
by
F
X
(x) = F
XY
(x, ), x R,
F
Y
(y) = F
XY
(, y), y R,
and that the two denitions will agree.
Properties of a Joint cdf
For F
XY
to be a valid cdf, we need to make sure the following
conditions hold.
1
0 F
XY
(x, y) 1, x, y R;
2
Monotonicity: x
1
, x
2
, y
1
, y
2
R,
x
1
< x
2
F
XY
(x
1
, y
1
) F
XY
(x
2
, y
1
) and
y
1
< y
2
F
XY
(x
1
, y
1
) F
XY
(x
1
, y
2
);
3
x, y R,
F
XY
(x, ) = 0, F
XY
(, y) = 0 and F
XY
(, ) = 1.
Suppose we are interested in
whether the random variable pair
(X, Y) lie in the interval cross
product (x
1
, x
2
] (y
1
, y
2
]; that is,
if x
1
< X x
2
and y
1
< Y y
2
.
-
6
y
1
y
2
x
1
x
2
X
Y
First note that P
XY
((x
1
, x
2
], (, y]) = F
XY
(x
2
, y) F
XY
(x
1
, y).
It is then easy to see that P
XY
((x
1
, x
2
], (y
1
, y
2
]) is given by
F
XY
(x
2
, y
2
) F
XY
(x
1
, y
2
) F
XY
(x
2
, y
1
) + F
XY
(x
1
, y
1
).
Joint pmfs
If X and Y are both discrete random variables, then we can dene the
joint probability mass function as
Defn.
p
XY
(x, y) = P
XY
({x}, {y}), x, y R.
We can recover the marginal pmfs p
X
and p
Y
since, by the law of total
probability, x, y R
p
X
(x) =
y
p
XY
(x, y), p
Y
(y) =
x
p
XY
(x, y).
Properties of a Joint pmf
For p
XY
to be a valid pmf, we need to make sure the following
conditions hold.
1
0 p
XY
(x, y) 1, x, y R;
2
x
p
XY
(x, y) = 1.
Joint pdfs
On the other hand, if f
XY
: RR R s.t.
P
XY
(B
XY
) =
(x,y)B
XY
f
XY
(x, y)dxdy, B
XY
RR,
then we say X and Y are jointly continuous and we refer to f
XY
as the
joint probability density function of X and Y.
In this case, we have
F
XY
(x, y) =
y
t=
x
s=
f
XY
(s, t)dsdt, x, y R,
By the Fundamental Theorem of Calculus we can identify the joint pdf
as
Defn.
f
XY
(x, y) =

2
xy
F
XY
(x, y).
Furthermore, we can recover the marginal densities f
X
and f
Y
:
f
X
(x) =
d
dx
F
X
(x) =
d
dx
F
XY
(x, )
=
d
dx

y=
x
s=
f
XY
(s, y)dsdy.
By the Fundamental Theorem of Calculus, and through a symmetric
argument for Y, we thus get
f
X
(x) =

y=
f
XY
(x, y)dy, f
Y
(y) =

x=
f
XY
(x, y)dx.
Properties of a Joint pdf
For f
XY
to be a valid pdf, we need to make sure the following
conditions hold.
1
f
XY
(x, y) 0, x, y R;
2

y=

x=
f
XY
(x, y)dxdy = 1.
Independence of Random Variables
Two random variables X and Y are independent iff B
X
, B
Y
R.,
P
XY
(B
X
, B
Y
) = P
X
(B
X
)P
Y
(B
Y
).
More specically, two discrete random variables X and Y are
independent iff
Defn.
p
XY
(x, y) = p
X
(x)p
Y
(y), x, y R;
and two continuous random variables X and Y are independent iff
Defn.
f
XY
(x, y) = f
X
(x)f
Y
(y), x, y R.
Conditional Distributions
For two r.v.s X, Y we dene the conditional probability distribution
P
Y|X
by
Defn.
P
Y|X
(B
Y
|B
X
) =
P
XY
(B
X
, B
Y
)
P
X
(B
X
)
, B
X
, B
Y
R, P
X
(B
X
) = 0.
This is the revised probability of Y falling inside B
Y
given that we now
know X B
X
.
Then we have X and Y are independent
P
Y|X
(B
Y
|B
X
) = P
Y
(B
Y
), B
X
, B
Y
R.
For discrete r.v.s X, Y we dene the conditional probability mass
function p
Y|X
by
Defn.
p
Y|X
(y|x) =
p
XY
(x, y)
p
X
(x)
, x, y R, p
X
(x) = 0,
and for continuous r.v.s X, Y we dene the conditional probability
density function f
Y|X
by
Defn.
f
Y|X
(y|x) =
f
XY
(x, y)
f
X
(x)
, x, y R.
[Justication:
P(Y y|X [x, x + dx)) =
P
XY
([x, x + dx), (, y])
P
X
([x, x + dx))
=
F(y, x + dx) F(y, x)
F(x + dx) F(x)
=
{F(y, x + dx) F(y, x)}/dx
{F(x + dx) F(x)}/dx
= f (y|X [x, x + dx)) =
d{F(y, x + dx) F(y, x)}/dxdy
{F(x + dx) F(x)}/dx
= f (y|x) = lim
dx>0
f (y|X [x, x + dx)) =
f (y, x)
f (x)
.]
In either case, X and Y are independent
p
Y|X
(y|x) = p
Y
(y) or f
Y|X
(y|x) = f
Y
(y), x, y R.
E{g(X, Y)}
Suppose we have a bivariate function of interest of the random
variables X and Y, g : RR R.
If X and Y are discrete, we dene E{g(X, Y)} by
Defn.
E
XY
{g(X, Y)} =
x
g(x, y)p
XY
(x, y).
If X and Y are jointly continuous, we dene E{g(X, Y)} by
Defn.
E
XY
{g(X, Y)} =

y=

x=
g(x, y)f
XY
(x, y)dxdy.
Immediately from these denitions we have the following:
If g(X, Y) = g
1
(X) + g
2
(Y),
E
XY
{g
1
(X) + g
2
(Y)} = E
X
{g
1
(X)} + E
Y
{g
2
(Y)}.
If g(X, Y) = g
1
(X)g
2
(Y) and X and Y are independent,
E
XY
{g
1
(X)g
2
(Y)} = E
X
{g
1
(X)}E
Y
{g
2
(Y)}.
In particular, considering g(X, Y) = XY for independent X, Y we
have
E
XY
(XY) = E
X
(X)E
Y
(Y).
E
Y|X
(Y|X = x)
In general E
XY
(XY) = E
X
(X)E
Y
(Y).
Suppose X and Y are discrete r.v.s with joint pmf p(x, y). If we are
given the value x of the r.v. X, our revised pmf for Y is the conditional
pmf p(y|x), for y R.
The conditional expectation of Y given X = x is therefore
Defn.
E
Y|X
(Y|X = x) =
y
y p(y|x).
Similarly, if X and Y were continuous,
Defn.
E
Y|X
(Y|X = x) =

y=
y f (y|x)dy.
In either case, the conditional expectation is a function of x but not the
unknown Y.
Covariance
For a single variable X we considered the expectation of
g(X) = (X
X
)(X
X
), called the variance and denoted
2
X
.
The bivariate extension of this is the expectation of
g(X, Y) = (X
X
)(Y
Y
). We dene the covariance of X and Y by
XY
= Cov(X, Y) = E
XY
[(X
X
)(Y
Y
)].
Correlation
Covariance measures how the random variables move in tandem with
one another, and so is closely related to the idea of correlation.
We dene the correlation of X and Y by
XY
= Cor(X, Y) =

XY
Y
.
Unlike the covariance, the correlation is invariant to the scale of the
r.v.s X and Y.
It is easily shown that if X and Y are independent random variables,
then
XY
=
XY
= 0.
Example 1
Suppose that the lifetime, X, and brightness, Y of a light bulb are
modelled as continuous random variables. Let their joint pdf be given
by
f (x, y) =
1
2
e
1
x
2
y
, x, y > 0.
Are lifetime and brightness independent?
The marginal pdf for X is
f (x) =

y=
f (x, y)dy =

y=0
2
e
1
x
2
y
dy
=
1
e
1
x
.
Similarly f (y) =
2
e
2
y
. Hence f (x, y) = f (x)f (y) and X and Y are
independent.
Example 2
Suppose continuous r.v.s (X, Y) R
2
have joint pdf
f (x, y) =
, x
2
+ y
2
1
0, otherwise.
Determine the marginal pdfs for X and Y.
Well x
2
+ y
2
1 |y|
1 x
2
. So
f (x) =

y=
f (x, y)dy =

1x
2
y=
1x
2
1
dy =
2
1 x
2
.
Similarly f (y) =
2
1 y
2
. Hence f (x, y) = f (x)f (y) and X and Y are
not independent.
Jointly Distributed Random Variables
COMP 245 STATISTICS
Dr N A Heard
Contents
1 Jointly Distributed Random Variables 1
1.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Joint cdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Joint Probability Mass Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Joint Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Independence and Expectation 4
2.1 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Examples 6
1 Jointly Distributed Random Variables
1.1 Denition
Joint Probability Distribution
Suppose we have two random variables X and Y dened on a sample space S with proba-
bility measure P(E), E S. Jointly, they form a map (X, Y) : S R
2
where s (X(s), Y(s)).
From before, we know to dene the marginal probability distributions P
X
and P
Y
by, for
example,
P
X
(B) = P(X
1
(B)), B R.
We now dene the joint probability distribution P
XY
by
P
XY
(B
X
, B
Y
) = P{X
1
(B
X
) Y
1
(B
Y
)}, B
X
, B
Y
R.
So P
XY
(B
X
, B
Y
), the probability that X B
X
and Y B
Y
, is given by the probability P of
the set of all points in the sample space that get mapped both into B
X
by X and into B
Y
by Y.
More generally, for a single region B
XY
R
2
, nd the collection of sample space elements
S
B
XY
= {s S|(X(s), Y(s)) B
XY
}
and dene
P
XY
(B
XY
) = P(S
B
XY
).
1
1.2 Joint cdfs
Joint Cumulative Distribution Function
We thus dene the joint cumulative distribution function as
F
XY
(x, y) = P
XY
((, x], (, y]), x, y R.
It is easy to check that the marginal cdfs for X and Y can be recovered by
F
X
(x) = F
XY
(x, ), x R,
F
Y
(y) = F
XY
(, y), y R,
and that the two denitions will agree.
Properties of a Joint cdf
For F
XY
to be a valid cdf, we need to make sure the following conditions hold.
1. 0 F
XY
(x, y) 1, x, y R;
2. Monotonicity: x
1
, x
2
, y
1
, y
2
R,
x
1
< x
2
F
XY
(x
1
, y
1
) F
XY
(x
2
, y
1
) and y
1
< y
2
F
XY
(x
1
, y
1
) F
XY
(x
1
, y
2
);
3. x, y R,
F
XY
(x, ) = 0, F
XY
(, y) = 0 and F
XY
(, ) = 1.
Suppose we are interested in whether the
random variable pair (X, Y) lie in the inter-
val cross product (x
1
, x
2
] (y
1
, y
2
]; that is, if
x
1
< X x
2
and y
1
< Y y
2
.
-
6
y
1
y
2
x
1
x
2
X
Y
First note that P
XY
((x
1
, x
2
], (, y]) = F
XY
(x
2
, y) F
XY
(x
1
, y).
It is then easy to see that P
XY
((x
1
, x
2
], (y
1
, y
2
]) is given by
F
XY
(x
2
, y
2
) F
XY
(x
1
, y
2
) F
XY
(x
2
, y
1
) + F
XY
(x
1
, y
1
).
1.3 Joint Probability Mass Functions
If X and Y are both discrete random variables, then we can dene the joint probability mass
function as
p
XY
(x, y) = P
XY
({x}, {y}), x, y R.
We can recover the marginal pmfs p
X
and p
Y
since, by the lawof total probability, x, y R
p
X
(x) =
y
p
XY
(x, y), p
Y
(y) =
x
p
XY
(x, y).
2
Properties of a Joint pmf
For p
XY
to be a valid pmf, we need to make sure the following conditions hold.
1. 0 p
XY
(x, y) 1, x, y R;
2.

y
x
p
XY
(x, y) = 1.
1.4 Joint Probability Density Functions
On the other hand, if f
XY
: RR R s.t.
P
XY
(B
XY
) =
(x,y)B
XY
f
XY
(x, y)dxdy, B
XY
RR,
then we say X and Y are jointly continuous and we refer to f
XY
as the joint probability
density function of X and Y.
In this case, we have
F
XY
(x, y) =
y
t=
x
s=
f
XY
(s, t)dsdt, x, y R,
By the Fundamental Theorem of Calculus we can identify the joint pdf as
f
XY
(x, y) =

2
xy
F
XY
(x, y).
Furthermore, we can recover the marginal densities f
X
and f
Y
:
f
X
(x) =
d
dx
F
X
(x) =
d
dx
F
XY
(x, )
=
d
dx

y=
x
s=
f
XY
(s, y)dsdy.
By the Fundamental Theorem of Calculus, and through a symmetric argument for Y, we
thus get
f
X
(x) =

y=
f
XY
(x, y)dy, f
Y
(y) =

x=
f
XY
(x, y)dx.
Properties of a Joint pdf
For f
XY
to be a valid pdf, we need to make sure the following conditions hold.
1. f
XY
(x, y) 0, x, y R;
2.

y=

x=
f
XY
(x, y)dxdy = 1.
3
2 Independence and Expectation
2.1 Independence
Independence of Random Variables
Two random variables X and Y are independent iff B
X
, B
Y
R.,
P
XY
(B
X
, B
Y
) = P
X
(B
X
)P
Y
(B
Y
).
More specically, two discrete random variables X and Y are independent iff
p
XY
(x, y) = p
X
(x)p
Y
(y), x, y R;
and two continuous random variables X and Y are independent iff
f
XY
(x, y) = f
X
(x) f
Y
(y), x, y R.
Conditional Distributions
For two r.v.s X, Y we dene the conditional probability distribution P
Y|X
by
P
Y|X
(B
Y
|B
X
) =
P
XY
(B
X
, B
Y
)
P
X
(B
X
)
, B
X
, B
Y
R, P
X
(B
X
) = 0.
This is the revised probability of Y falling inside B
Y
given that we now know X B
X
.
Then we have X and Y are independent P
Y|X
(B
Y
|B
X
) = P
Y
(B
Y
), B
X
, B
Y
R.
For discrete r.v.s X, Y we dene the conditional probability mass function p
Y|X
by
p
Y|X
(y|x) =
p
XY
(x, y)
p
X
(x)
, x, y R, p
X
(x) = 0,
and for continuous r.v.s X, Y we dene the conditional probability density function f
Y|X
by
f
Y|X
(y|x) =
f
XY
(x, y)
f
X
(x)
, x, y R.
[Justication:
P(Y y|X [x, x + dx)) =
P
XY
([x, x + dx), (, y])
P
X
([x, x + dx))
=
F(y, x + dx) F(y, x)
F(x + dx) F(x)
=
{F(y, x + dx) F(y, x)}/dx
{F(x + dx) F(x)}/dx
= f (y|X [x, x + dx)) =
d{F(y, x + dx) F(y, x)}/dxdy
{F(x + dx) F(x)}/dx
= f (y|x) = lim
dx>0
f (y|X [x, x + dx)) =
f (y, x)
f (x)
.]
In either case, X andY are independent p
Y|X
(y|x) = p
Y
(y) or f
Y|X
(y|x) = f
Y
(y), x, y
R.
4
2.2 Expectation
E{g(X, Y)}
Suppose we have a bivariate function of interest of the random variables X and Y, g :
RR R.
If X and Y are discrete, we dene E{g(X, Y)} by
E
XY
{g(X, Y)} =
x
g(x, y)p
XY
(x, y).
If X and Y are jointly continuous, we dene E{g(X, Y)} by
E
XY
{g(X, Y)} =

y=

x=
g(x, y) f
XY
(x, y)dxdy.
Immediately from these denitions we have the following:
If g(X, Y) = g
1
(X) + g
2
(Y),
E
XY
{g
1
(X) + g
2
(Y)} = E
X
{g
1
(X)} + E
Y
{g
2
(Y)}.
If g(X, Y) = g
1
(X)g
2
(Y) and X and Y are independent,
E
XY
{g
1
(X)g
2
(Y)} = E
X
{g
1
(X)}E
Y
{g
2
(Y)}.
In particular, considering g(X, Y) = XY for independent X, Y we have
E
XY
(XY) = E
X
(X)E
Y
(Y).
2.3 Conditional Expectation
E
Y|X
(Y|X = x)
In general E
XY
(XY) = E
X
(X)E
Y
(Y).
Suppose X and Y are discrete r.v.s with joint pmf p(x, y). If we are given the value x of the
r.v. X, our revised pmf for Y is the conditional pmf p(y|x), for y R.
The conditional expectation of Y given X = x is therefore
E
Y|X
(Y|X = x) =
y
y p(y|x).
Similarly, if X and Y were continuous,
E
Y|X
(Y|X = x) =

y=
y f (y|x)dy.
In either case, the conditional expectation is a function of x but not the unknown Y.
5
2.4 Covariance and Correlation
Covariance
For a single variable X we considered the expectation of g(X) = (X
X
)(X
X
), called
the variance and denoted
2
X
.
The bivariate extension of this is the expectation of g(X, Y) = (X
X
)(Y
Y
). We dene
the covariance of X and Y by
XY
= Cov(X, Y) = E
XY
[(X
X
)(Y
Y
)].
Correlation
Covariance measures how the random variables move in tandem with one another, and so
is closely related to the idea of correlation.
We dene the correlation of X and Y by
XY
= Cor(X, Y) =

XY
Y
.
Unlike the covariance, the correlation is invariant to the scale of the r.v.s X and Y.
It is easily shown that if X and Y are independent random variables, then
XY
=
XY
= 0.
3 Examples
Example 1
Suppose that the lifetime, X, and brightness, Y of a light bulb are modelled as continuous
random variables. Let their joint pdf be given by
f (x, y) =
1
2
e
1
x
2
y
, x, y > 0.
Are lifetime and brightness independent?
The marginal pdf for X is
f (x) =

y=
f (x, y)dy =

y=0
2
e
1
x
2
y
dy
=
1
e
1
x
.
Similarly f (y) =
2
e
2
y
. Hence f (x, y) = f (x) f (y) and X and Y are independent.
Example 2
Suppose continuous r.v.s (X, Y) R
2
have joint pdf
f (x, y) =
, x
2
+ y
2
1
0, otherwise.
Determine the marginal pdfs for X and Y.
Well x
2
+ y
2
1 |y|
1 x
2
. So
f (x) =

y=
f (x, y)dy =

1x
2
y=
1x
2
1
dy =
2
1 x
2
.
Similarly f (y) =
2
1 y
2
. Hence f (x, y) = f (x) f (y) and X and Y are not independent.
6
Estimation
COMP 245 STATISTICS
Dr N A Heard
Contents
1 Parameter Estimation 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Point Estimates 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Bias, Efciency and Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Condence Intervals 10
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Normal Distribution with Known Variance . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Normal Distribution with Unknown Variance . . . . . . . . . . . . . . . . . . . . . 11
1
1 Parameter Estimation
1.1 Introduction
In statistics we typically analyse a set of data by considering it as a random sample from a
larger, underlying population about which we wish to make inference.
1. The chapter on numerical summaries considered various summary sample statistics for
describing a particular sample of data. We dened quantities such as the sample mean x,
and sample variance s
2
.
2. The chapters on random variables, on the other hand, were concerned with character-
ising the underlying population. We dened corresponding population parameters such as
the population mean E(X), and population variance Var(X).
We noticed a duality between the two sets of denitions of statistics and parameters.
In particular, we sawthat they were equivalent in the extreme circumstance that our sample
exactly represented the entire population (so, for example, the cdf of a new randomly drawn
member of the population is precisely the empirical cdf of our sample).
Away fromthis extreme circumstance, the sample statistics can be seen to give approximate
values for the corresponding population parameters. We can use them as estimates.
For convenient modelling of populations (point 2), we met several simple parameterised
probability distributions (e.g. Poi(), Exp(), U(a, b), N(,
2
)). There, population parameters
such as mean and variance are functions of the distribution parameters. So more generally, we
may wish to use the data, or just their sample statistics, to estimate distribution parameters.
For a sample of data x = (x
1
, . . . , x
n
), we can consider these observed values as realisations
of corresponding random variables X = (X
1
, . . . , X
n
).
If the underlying population (from which the sample has been drawn) is such that the
distribution of a single random draw X has probability distribution P
X|
(|), where is a
generic parameter or vector of parameters, we typically then assume that our n data point
random variables X are i.i.d. P
X|
(|).
Note that P
X|
(|) is the conditional distribution for draws from our model for the population
given the true (but unknown) parameter values .
1.2 Estimators
Statistics, Estimators and Estimates
Consider a sequence of random variables X = (X
1
, . . . , X
n
) corresponding to n i.i.d. data
samples to be drawn from a population with distribution P
X
. Let x = (x
1
, . . . , x
n
) be the
corresponding realised values we observe for these r.v.s.
A statistic is a function T : R
n
R
p
applied to the random variable X. Note that T(X) =
T(X
1
, . . . , X
n
) is itself a random variable. For example,

X =
n
i=1
X
i
/n is a statistic.
The corresponding realised value of a statistic T is written t = t(x) (e.g. t = x).
If a statistic T(X) is to be used to approximate parameters of the distribution P
X|
(|), we
say T is an estimator for those parameters; we call the actual realised value of the estimator
for a particular data sample, t(x), an estimate.
2
2 Point Estimates
2.1 Introduction
Denition
A point estimate is a statistic estimating a single parameter or characteristic of a distribu-
tion.
For a running example which we will return to, consider a sample of data (x
1
, . . . , x
n
) from
an Exponential() distribution with unknown ; we might construct a point estimate for either
itself, or perhaps for the mean of the distribution
_
=
1
_
, or the variance
_
=
2
_
.
Concentrating on the mean of the distribution in this example, a natural estimator of this
could be the sample mean,

X. But alternatively, we could propose simply the rst data point
we observe, X
1
as our point estimator; or, if the data had been given to us already ordered we
might (lazily) suggest the sample median, X
({n+1}/2)
.
How do we quantify which estimator is better?
Sampling Distribution
Suppose for a moment we actually knew the parameter values of our population distri-
bution P
X|
(|) (so suppose we know in our exponential example).
Then since our sampled data are considered to be i.i.d. realisations from this distribution
(so each X
i
Exp() in that example), it follows that any statistic T = T(X
1
, . . . , X
n
) is also a
random variable with some distribution which also only depends on these parameters.
If we are able to (approximately) identify this sampling distribution of our statistic, call it
P
T|
, we can then nd the conditional expectation, variance, etc of our statistic.
Sometimes P
T|
, will have a convenient closed-form expression which we can derive, but
in other situations it will not.
For those other situations, then for the special case where our statistic T is the sample
mean, then provided that our sample size n is large, we use the CLT to give us an approximate
distribution for P
T|
. Whatever the form of the population distribution P
X|
, we know from the
CLT that approximately

X N(E[X], Var[X]/n).
For our X
i
Exp() example, it can be shown that the statistic T =

X is a continuous
random variable with pdf
f
T|
(t|) =
(n)
n
t
n1
e
nt
(n 1)!
, t > 0.
This is actually the pdf of a Gamma(n, n) random variable, a well known continuous
distribution, so T Gamma(n, n).
So using the fact that Gamma(, ) has expectation

, here we have
E(

X) = E
T|
(T|) =
n
n
=
1
= E(X).
So the expected value of

X is the true population mean.
3
2.2 Bias, Efciency and Consistency
Bias
The previous result suggests that

X is, at least in one respect, a good statistic for estimating
the unknown mean of an exponential distribution.
Formally, we dene the bias of an estimator T for a parameter ,
bias(T) = E[T|] .
If an estimator has zero bias we say that estimator is unbiased. So in our example,

X gives
an unbiased estimate of =
1
, the mean of an exponential distribution.
[In contrast, the sample median is a biased estimator of the mean of an exponential distri-
bution. For example if n = 3, it can be shown that E(X
({n+1}/2)
) =
5
6
.]
In fact, the unbiasedness of

X is true for any distribution; the sample mean x will always
be an unbiased estimate for the population mean :
E(

X) = E
_
n
i=1
X
i
n
_
=

n
i=1
E(X
i
)
n
=
n
n
= .
Similarly, there is an estimator for the population variance
2
which is unbiased, irrespec-
tive of the population distribution. Disappointingly, this is not the sample variance
S
2
=
1
n
n
i=1
(X
i

X)
2
,
as this has one too many degrees of freedom. (Note that if we knew the population mean ,
then
1
n
n
i=1
(X
i
)
2
would be unbiased for
2
.)
Bias-Corrected Sample Variance
However, we can instead dene the bias-corrected sample variance,
S
2
n1
=
1
n 1
n
i=1
(X
i

X)
2
which is then always an unbiased estimator of the population variance
2
.
Warning: Because of its usefulness as an unbiased estimate of
2
, many statistical text books and
software packages (and indeed your formula sheet for the exam!) refer to s
2
n1
=
1
n 1
n
i=1
(x
i
x)
2
as
the sample variance.
Efciency
Suppose we have two unbiased estimators for a parameter , which we will call

(X) and
(X). And again suppose we have the corresponding sampling distributions for these estima-
tors, P
|
and P
|
, and so can calculate their means, variances, etc.
Then we say

is more efcient than

if:
1. , Var
|
_
_
Var
|
_
_
;
4
2. s.t. Var
|
_
_
< Var
|
_
_
.
That is, the variance of

is never higher than that of

, no matter what the true value of
is; and for some value of ,

has a strictly lower variance than

.
If

is more efcient than any other possible estimator, we say

is efcient.
Example
Suppose we have a population with mean and variance
2
, from which we are to obtain
a random sample X
1
, . . . , X
n
.
Consider two estimators for ,

M =

X, the sample mean, and

M = X
1
, the rst observation
in the sample.
Well we have seen E(

X) = always, and certainly E(X
1
) = , so both estimators are
unbiased.
We also know Var(

X) =

2
n
, and of course Var(X
1
) =
2
, independent of . So if n 2,

M

M as an estimator of .
Consistency
In the previous example, the worst aspect of the estimate

M = X
1
is that it does not change,
let alone improve, no matter how large a sample n of data is collected. In contrast, the variance
of

M =

X gets smaller and smaller as n increases.
Technically, we say an estimator

is a consistent estimator for the parameter if

con-
verges in probability to . That is, > 0, P(|
| > ) 0 as n .
This is hard to demonstrate, but if

is unbiased we do have:
lim
n
Var
_
_
= 0

is consistent.
So returning to our example, we see

X is a consistent estimator of for any underlying popu-
lation.
2.3 Maximum Likelihood Estimation
Deriving an Estimator
We have met different criteria for measuring the relative quality of rival estimators, but no
principled manner for deriving these estimates.
There are many ways of deriving estimators, but we shall concentrate on just one - maxi-
mum likelihood estimation.
Likelihood
If the underlying population is a discrete distribution with an unknown parameter , then
each of the samples X
i
are i.i.d. with probability mass function p
X|
(x
i
).
Since the n data samples are independent, the joint probability of all of the data, x =
(x
1
, . . . , x
n
), is
5
L(|x) =
n
i=1
p
X|
(x
i
).
The function L(|X) is called the likelihood function and is considered as a function of the
parameter for a xed sample of data x. L(|X) is the probability of observing the data we
have, x, if the true parameter were .
If, on the other hand, the underlying population is a continuous distribution with an un-
known parameter , then each of the samples X
i
are i.i.d. with probability density function
f
X|
(x
i
), and the likelihood function is dened by
L(|x) =
n
i=1
f
X|
(x
i
).
Maximising the Likelihood
Clearly, for a xed set of data, varying the population parameter would give different
probabilities of observing these data, and hence different likelihoods.
Maximum likelihood estimation seeks to nd the parameter value

MLE
which maximises
the likelihood function,
MLE
= argsup
L(|x).
This value

MLE
is known as the maximum likelihood estimate (MLE).
Log-Likelihood Function
For maximising the likelihood function, it often proves more convenient to consider the
log-likelihood, (|x) = log{L(|x)}. Since log() is a monotonic increasing function, the
argument maximising maximises L.
The log-likelihood function can be written as
(|x) =
n
i=1
log{p
X|
(x
i
)} or (|x) =
n
i=1
log{f
X|
(x
i
)},
for discrete and continuous distributions respectively.
In either case, solving

() = 0 yields the MLE if

2
2
() < 0.
MLE Example: Binomial(10, p)
Suppose we have a population which is Binomial(10, p), where the probability parameter
p is unknown.
Then suppose we draw a sample of size 100 from the population that is, we observe 100
independent Binomial(10, p) random variables X
1
, X
2
, . . . , X
100
and obtain the data sum-
marised in the table below.
6
x 0 1 2 3 4 5 6 7 8 9 10
Freq. 2 16 35 22 21 3 1 0 0 0 0
Each of our Binomial(10, p) samples X
i
have pmf
p
X
(x
i
) =
_
10
x
i
_
p
x
i
(1 p)
10x
i
, i = 1, 2, . . . , 100.
Since the n = 100 data samples are assumed independent, the likelihood function for p for
all of the data is
L(p) =
n
i=1
p
X
(x
i
) =
n
i=1
__
10
x
i
_
p
x
i
(1 p)
10x
i
_
=
_
n
i=1
_
10
x
i
_
_
p
n
i=1
x
i
(1 p)
10n
n
i=1
x
i
.
So the log-likelihood is given by
(p) = log
_
n
i=1
_
10
x
i
_
_
+ log(p)
n
i=1
x
i
+ log(1 p)
_
10n
n
i=1
x
i
_
.
Notice the rst term in this equation is constant wrt p, so nding the value of p maximising
(p) is equivalent to nding p maximising
log(p)
n
i=1
x
i
+ log(1 p)
_
10n
n
i=1
x
i
_
and this becomes apparent when we differentiate wrt p, with the constant termdifferentiating
to zero.
p
(p) = 0 +

n
i=1
x
i
p

10n
n
i=1
x
i
1 p
.
Setting this derivative equal to zero, we get
n
i=1
x
i
p

10n
n
i=1
x
i
1 p
= 0
(1 p)
n
i=1
x
i
= p
_
10n
n
i=1
x
i
_
i=1
x
i
= p
_
10n
n
i=1
x
i
+
n
i=1
x
i
_
p =

n
i=1
x
i
10n
=
x
10
.
To check this point is a maximum of , we require the second derivative
2
p
2
(p) =
n
i=1
x
i
p
2

10n
n
i=1
x
i
(1 p)
2
=
n x
p
2

10n n x
(1 p)
2
= n
_
x
p
2
+
10 x
(1 p)
2
_
7
(which is in fact < 0 p, the likelihood is log concave).
Substituting p =
x
10
, this gives
100n
_
1
x
+
1
10 x
_
=
1000n
(10 x) x
,
which is clearly < 0. So the MLE for p is
x
10
= 0.257.
Second MLE Example: N(,
2
)
If X N(,
2
), then
f
X
(x) =
1
2
exp
_
(x )
2
2
2
_
So for an i.i.d. sample x = (x
1
, . . . , x
n
), the likelihood function for (,
2
) for all of the data
is
L(,
2
) =
n
i=1
f
X
(x
i
)
= (2
2
)
n
2
exp
_
n
i=1
(x
i
)
2
2
2
_
.
The log likelihood is
(,
2
) =
n
2
log(2) n log()

n
i=1
(x
i
)
2
2
2
.
If we are interested in the MLE for , we can take the partial derivative wrt and set this
equal to zero.
0 =

(,
2
) =

n
i=1
(x
i
)
2
0 =
n
i=1
(x
i
) =
n
i=1
x
i
n
=

n
i=1
x
i
n
= x.
To check this is a maximum, we look at the second derivative.
2
(,
2
) =
n
2
,
which is negative everywhere, so x is the MLE for , independently from the value of
2
.
CLT signicance:
This result is particularly useful. We have already seen that

X is always an unbiased esti-
mator for the population mean . And now, using the CLT, we also have that for large n that
X is always approximately the MLE for the population mean , irrespective of the distribution
of X.
8
Properties of the MLE
So how good an estimator of is the MLE?
- The MLE is not necessarily unbiased.
+ For large n, the MLE approximately normally distributed with mean .
+ The MLE is consistent.
+ The MLE is always asymptotically efcient, and if an efcient estimator exists, it is the
MLE.
+ Because it is derived from the likelihood of the data, it is well-principled. This is the
likelihood principle, which asserts that all the information about a parameter from a set
of data is held in the likelihood.
9
3 Condence Intervals
3.1 Introduction
Uncertainty of Estimates
In most circumstances, it will not be sufcient to report simply a point estimate

for an
unknown parameter of interest. We would usually want to also quantify our degree of uncer-
tainty in this estimate. If we were to repeat the whole data collection and parameter estimation
procedure, how different an estimate might we have obtained.
If we were again to suppose we had knowledge of the true value of our unknown parame-
ter , or at least had access to the (approximate) true sampling distribution of our statistic, P
T|
,
then the variance of this distribution would give such a measure.
The solution we consider is to plug in our known estimated value of ,

, into P
T|
and
hence use the (maybe further) approximated sampling distribution, P
T|
.
Condence Intervals
In particular, we know by the CLT that for any underlying distribution (mean , variance
2
) for our sample, the sample mean

X is approximately normally distributed with mean
and variance

2
n
. We now further approximate this by imagining

X N( x,

2
n
).
Then, if we knew the true population variance
2
, we would be able to say that had the true
mean parameter been x, then for large n, fromour standard normal tables, with 95%probability
we would have observed our statistic

X within the interval
_
x 1.96

n
, x + 1.96

n
_
.
This is known as the 95% condence interval for .
More generally, for any desired coverage probability level 1 we can dene the the 100(1
)% condence interval for by
_
x z
1
n
, x + z
1
n
_
where z
is the -quantile of the standard normal (so before we used = 0.05 and hence z
0.975
to obtain our 95% C.I.).
Loose interpretation: Amongst all the possible intervals x z
1
n
we might have observed
(that is, from samples we might have drawn), 95% would have contained the unknown true
parameter value .
Example
A corporation conducts a survey to investigate the proportion of employees who thought
the board was doing a good job. 1000 employees, randomly selected, were asked, and 732 said
they did. Find a 99% condence interval for the value of the proportion in the population who
thought the board was doing a good job.
10
Clearly each observation X
i
Bernoulli(p) for some unknown p, and we want to nd a
C.I. for p, which is also the mean of X
i
.
We have our estimate p = x = 0.732 for which we can use the CLT. Since the variance of
Bernoulli(p) is p(1 p), we can use x(1 x) = 0.196 as an approximate variance.
So an approximate 99% C.I. is
_
0.732 2.576
_
0.196
1000
, 0.732 + 2.576
_
0.196
1000
_
3.2 Normal Distribution with Known Variance
Condence Intervals for N(,
2
) Samples -
2
Known
The CI given in the Bernoulli(p) example was only an approximate interval, relying on the
Central Limit Theorem, and also assuming the population variance
2
was known.
However, if we in fact know that X
1
, . . . , X
n
are an i.i.d. sample from N(,
2
), then we
have exactly
X N
_
,

2
n
_
, or

X
/
n
N(0, 1)
In which case,
_
x z
1
n
, x + z
1
n
_
is an exact condence interval for , assuming we know
2
.
3.3 Normal Distribution with Unknown Variance
2
) Samples -
2
Unknown
However, in any applied example where we are aiming to t a normal distribution model
to real data, it will usually be the case that both and
2
are unknown.
However, if again we have X
1
, . . . , X
n
as an i.i.d. sample from N(,
2
) but with
2
now
unknown, then we have exactly
X
s
n1
/
n
t
n1
where s
n1
=
n
i=1
(X
i

X)
2
n 1
is the bias-corrected sample standard deviation, and t
is
the Students t distribution with degrees of freedom.
Then if follows that an exact 100(1 )% condence interval for is
_
x t
n1,1
2
s
n1
n
, x + t
n1,1
2
s
n1
n
_
where t
,
is the -quantile of t
.
11
t
pdf for = 1, 5, 10, 30, and N(0, 1) pdf

3 2 1 0 1 2 3
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
f
(
x
)
N(0,1)
t_1
t_5
t_10
t_30
Notes:
t
is heavier tailed than N(0, 1) for any number of degrees of freedom .

Hence the t distribution CI will always be wider than the Normal distribution CI. So if
we know
2
, we should use it.
lim
N(0, 1).
For > 40 the difference between t
and N(0, 1) is so insignicant that the t distribution

is not tabulated beyond this many degrees of freedom, and so there we can instead revert
to N(0, 1) tables.
12
Table of t
,

0.95 0.975 0.99 0.995 0.95 0.975 0.99 0.995
1 6.31 12.71 31.82 63.66 9 1.83 2.26 2.82 3.25
2 2.92 4.30 6.96 9.92 10 1.81 2.23 2.76 3.17
3 2.35 3.18 4.54 5.84 12 1.78 2.18 2.68 3.05
4 2.13 2.78 3.75 4.60 15 1.75 2.13 2.60 2.95
5 2.02 2.57 3.36 4.03 20 1.72 2.09 2.53 2.85
6 1.94 2.45 3.14 3.71 25 1.71 2.06 2.48 2.78
7 1.89 2.36 3.00 3.50 40 1.68 2.02 2.42 2.70
8 1.86 2.31 2.90 3.36 1.645 1.96 2.326 2.576
Example
A random sample of 100 observations from a normally distributed population has sample
mean 83.2 and bias-corrected sample standard deviation of 6.4.
1. Find a 95% condence interval for .
2. Give an interpretation for this interval.
Solution:
1. An exact 95% condence interval would given by x t
99,0.975
s
n1
100
.
Since n = 100 is large, we can approximate this by
x z
0.975
s
n1
100
= 83.2 1.96
6.4
10
= [81.95, 84.45].
2. With 95% condence, we can say that the population mean lies between 81.95 and 84.45
13
Estimation
COMP 245 STATISTICS
Dr N A Heard
Dr N A Heard (Room 545 Huxley Building) Estimation 1 / 40
Estimation
In statistics we typically analyse a set of data by considering it as a
random sample from a larger, underlying population about which we
wish to make inference.
1
The chapter on numerical summaries considered various
summary sample statistics for describing a particular sample of
data. We dened quantities such as the sample mean x, and
sample variance s
2
.
2
The chapters on random variables, on the other hand, were
concerned with characterising the underlying population. We
dened corresponding population parameters such as the
population mean E(X), and population variance Var(X).
We noticed a duality between the two sets of denitions of statistics
and parameters.
In particular, we saw that they were equivalent in the extreme
circumstance that our sample exactly represented the entire
population (so, for example, the cdf of a new randomly drawn
member of the population is precisely the empirical cdf of our sample).
Away from this extreme circumstance, the sample statistics can be seen
to give approximate values for the corresponding population
parameters. We can use them as estimates.
For convenient modelling of populations (point 2), we met several
simple parameterised probability distributions (e.g. Poi(), Exp(),
U(a, b), N(,
2
)). There, population parameters such as mean and
variance are functions of the distribution parameters. So more
generally, we may wish to use the data, or just their sample statistics,
to estimate distribution parameters.
For a sample of data x = (x
1
, . . . , x
n
), we can consider these observed
values as realisations of corresponding random variables
X = (X
1
, . . . , X
n
).
If the underlying population (from which the sample has been drawn)
is such that the distribution of a single random draw X has probability
distribution P
X|
(|), where is a generic parameter or vector of
parameters, we typically then assume that our n data point random
variables X are i.i.d. P
X|
(|).
Note that P
X|
(|) is the conditional distribution for draws from our
model for the population given the true (but unknown) parameter
values .
Statistics, Estimators and Estimates
Consider a sequence of random variables X = (X
1
, . . . , X
n
)
corresponding to n i.i.d. data samples to be drawn from a population
with distribution P
X
. Let x = (x
1
, . . . , x
n
) be the corresponding realised
values we observe for these r.v.s.
A statistic is a function T : R
n
R
p
applied to the random variable X.
Note that T(X) = T(X
1
, . . . , X
n
) is itself a random variable. For
example,

X =
n
i=1
X
i
/n is a statistic.
The corresponding realised value of a statistic T is written t = t(x) (e.g.
t = x).
If a statistic T(X) is to be used to approximate parameters of the
distribution P
X|
(|), we say T is an estimator for those parameters;
we call the actual realised value of the estimator for a particular data
sample, t(x), an estimate.
Denition
A point estimate is a statistic estimating a single parameter or
characteristic of a distribution.
For a running example which we will return to, consider a sample of
data (x
1
, . . . , x
n
) from an Exponential() distribution with unknown ;
we might construct a point estimate for either itself, or perhaps for
the mean of the distribution
_
=
1
_
, or the variance
_
=
2
_
.
Concentrating on the mean of the distribution in this example, a
natural estimator of this could be the sample mean,

X. But
alternatively, we could propose simply the rst data point we observe,
X
1
as our point estimator; or, if the data had been given to us already
ordered we might (lazily) suggest the sample median, X
({n+1}/2)
.
How do we quantify which estimator is better?
Sampling Distribution
Suppose for a moment we actually knew the parameter values of our
population distribution P
X|
(|) (so suppose we know in our
exponential example).
Then since our sampled data are considered to be i.i.d. realisations
from this distribution (so each X
i
Exp() in that example), it follows
that any statistic T = T(X
1
, . . . , X
n
) is also a random variable with
some distribution which also only depends on these parameters.
If we are able to (approximately) identify this sampling distribution of
our statistic, call it P
T|
, we can then nd the conditional expectation,
variance, etc of our statistic.
Sometimes P
T|
, will have a convenient closed-form expression which
we can derive, but in other situations it will not.
For those other situations, then for the special case where our statistic
T is the sample mean, then provided that our sample size n is large, we
use the CLT to give us an approximate distribution for P
T|
. Whatever
the form of the population distribution P
X|
, we know from the CLT
that approximately

X N(E[X], Var[X]/n).
For our X
i
Exp() example, it can be shown that the statistic T =

X
is a continuous random variable with pdf
f
T|
(t|) =
(n)
n
t
n1
e
nt
(n 1)!
, t > 0.
This is actually the pdf of a Gamma(n, n) random variable, a well
known continuous distribution, so T Gamma(n, n).
So using the fact that Gamma(, ) has expectation

, here we have
E(

X) = E
T|
(T|) =
n
n
=
1
= E(X).
So the expected value of

X is the true population mean.
Bias
The previous result suggests that

X is, at least in one respect, a good
statistic for estimating the unknown mean of an exponential
distribution.
Formally, we dene the bias of an estimator T for a parameter ,
Defn.
bias(T) = E[T|] .
If an estimator has zero bias we say that estimator is unbiased. So in
our example,

X gives an unbiased estimate of =
1
, the mean of an
exponential distribution.
[In contrast, the sample median is a biased estimator of the mean of an
exponential distribution. For example if n = 3, it can be shown that
E(X
({n+1}/2)
) =
5
6
.]
In fact, the unbiasedness of

X is true for any distribution; the sample
mean x will always be an unbiased estimate for the population mean :
E(

X) = E
_
n
i=1
X
i
n
_
=

n
i=1
E(X
i
)
n
=
n
n
= .
Similarly, there is an estimator for the population variance
2
which is
unbiased, irrespective of the population distribution. Disappointingly,
this is not the sample variance
S
2
=
1
n
n
i=1
(X
i

X)
2
,
as this has one too many degrees of freedom. (Note that if we knew
the population mean , then
1
n
n
i=1
(X
i
)
2
would be unbiased for
2
.)
Bias-Corrected Sample Variance
However, we can instead dene the bias-corrected sample variance,
Defn.
S
2
n1
=
1
n 1
n
i=1
(X
i

X)
2
which is then always an unbiased estimator of the population variance
2
.
Warning: Because of its usefulness as an unbiased estimate of
2
,
many statistical text books and software packages (and indeed your
formula sheet for the exam!) refer to s
2
n1
=
1
n 1
n
i=1
(x
i
x)
2
as the
sample variance.
Efciency
Suppose we have two unbiased estimators for a parameter , which we
will call

(X) and

(X). And again suppose we have the
corresponding sampling distributions for these estimators, P
|
and
P
|
, and so can calculate their means, variances, etc.
Then we say


if:
1
, Var
|
_
_
Var
|
_
_
;
2
s.t. Var
|
_
_
< Var
|
_
_
.
That is, the variance of

is never higher than that of

, no matter
what the true value of is; and for some value of ,

has a strictly
lower variance than

.
If

is more efcient than any other possible estimator, we say

is
efcient.
Example
Suppose we have a population with mean and variance
2
, from
which we are to obtain a random sample X
1
, . . . , X
n
.
Consider two estimators for ,

M =

X, the sample mean, and

M = X
1
,
the rst observation in the sample.
Well we have seen E(

X) = always, and certainly E(X
1
) = , so both
estimators are unbiased.
We also know Var(

X) =

2
n
, and of course Var(X
1
) =
2
, independent
of . So if n 2,

M is more efcient than

M as an estimator of .
Consistency
In the previous example, the worst aspect of the estimate

M = X
1
is
that it does not change, let alone improve, no matter how large a
sample n of data is collected. In contrast, the variance of

M =

X gets
smaller and smaller as n increases.
Technically, we say an estimator

is a consistent estimator for the
parameter if

converges in probability to . That is,
> 0, P(|
| > ) 0 as n .
This is hard to demonstrate, but if

is unbiased we do have:
lim
n
Var
_
_
= 0

is consistent.
So returning to our example, we see

X is a consistent estimator of
for any underlying population.
Deriving an Estimator
We have met different criteria for measuring the relative quality of
rival estimators, but no principled manner for deriving these
estimates.
There are many ways of deriving estimators, but we shall concentrate
on just one - maximum likelihood estimation.
Likelihood
If the underlying population is a discrete distribution with an
unknown parameter , then each of the samples X
i
are i.i.d. with
probability mass function p
X|
(x
i
).
Since the n data samples are independent, the joint probability of all of
the data, x = (x
1
, . . . , x
n
), is
Defn.
L(|x) =
n
i=1
p
X|
(x
i
).
The function L(|X) is called the likelihood function and is
considered as a function of the parameter for a xed sample of data
x. L(|X) is the probability of observing the data we have, x, if the true
parameter were .
If, on the other hand, the underlying population is a continuous
distribution with an unknown parameter , then each of the samples
X
i
are i.i.d. with probability density function f
X|
(x
i
), and the
likelihood function is dened by
Defn.
L(|x) =
n
i=1
f
X|
(x
i
).
Maximising the Likelihood
Clearly, for a xed set of data, varying the population parameter
would give different probabilities of observing these data, and hence
different likelihoods.
Maximum likelihood estimation seeks to nd the parameter value
MLE
which maximises the likelihood function,
Defn.
MLE
= argsup
L(|x).
This value

MLE
is known as the maximum likelihood estimate (MLE).
Log-Likelihood Function
For maximising the likelihood function, it often proves more
convenient to consider the log-likelihood, (|x) = log{L(|x)}. Since
log() is a monotonic increasing function, the argument maximising
maximises L.
The log-likelihood function can be written as
(|x) =
n
i=1
log{p
X|
(x
i
)} or (|x) =
n
i=1
log{f
X|
(x
i
)},
for discrete and continuous distributions respectively.
In either case, solving

() = 0 yields the MLE if

2
2
() < 0.
MLE Example: Binomial(10, p)
Suppose we have a population which is Binomial(10, p), where the
probability parameter p is unknown.
Then suppose we draw a sample of size 100 from the population
that is, we observe 100 independent Binomial(10, p) random variables
X
1
, X
2
, . . . , X
100
and obtain the data summarised in the table below.
x 0 1 2 3 4 5 6 7 8 9 10
Freq. 2 16 35 22 21 3 1 0 0 0 0
Each of our Binomial(10, p) samples X
i
have pmf
p
X
(x
i
) =
_
10
x
i
_
p
x
i
(1 p)
10x
i
, i = 1, 2, . . . , 100.
Since the n = 100 data samples are assumed independent, the
likelihood function for p for all of the data is
L(p) =
n
i=1
p
X
(x
i
) =
n
i=1
__
10
x
i
_
p
x
i
(1 p)
10x
i
_
=
_
n
i=1
_
10
x
i
_
_
p
n
i=1
x
i
(1 p)
10n
n
i=1
x
i
.
So the log-likelihood is given by
(p) = log
_
n
i=1
_
10
x
i
_
_
+ log(p)
n
i=1
x
i
+ log(1 p)
_
10n
n
i=1
x
i
_
.
Notice the rst term in this equation is constant wrt p, so nding the
value of p maximising (p) is equivalent to nding p maximising
log(p)
n
i=1
x
i
+ log(1 p)
_
10n
n
i=1
x
i
_
and this becomes apparent when we differentiate wrt p, with the
constant term differentiating to zero.
p
(p) = 0 +

n
i=1
x
i
p

10n
n
i=1
x
i
1 p
.
Setting this derivative equal to zero, we get
n
i=1
x
i
p

10n
n
i=1
x
i
1 p
= 0
(1 p)
n
i=1
x
i
= p
_
10n
n
i=1
x
i
_
i=1
x
i
= p
_
10n
n
i=1
x
i
+
n
i=1
x
i
_
p =

n
i=1
x
i
10n
=
x
10
.
To check this point is a maximum of , we require the second
derivative
2
p
2
(p) =
n
i=1
x
i
p
2

10n
n
i=1
x
i
(1 p)
2
=
n x
p
2

10n n x
(1 p)
2
= n
_
x
p
2
+
10 x
(1 p)
2
_
(which is in fact < 0 p, the likelihood is log concave).
Substituting p =
x
10
, this gives
100n
_
1
x
+
1
10 x
_
=
1000n
(10 x) x
,
which is clearly < 0. So the MLE for p is
x
10
= 0.257.
Second MLE Example: N(,
2
)
If X N(,
2
), then
f
X
(x) =
1
2
exp
_
(x )
2
2
2
_
So for an i.i.d. sample x = (x
1
, . . . , x
n
), the likelihood function for
(,
2
) for all of the data is
L(,
2
) =
n
i=1
f
X
(x
i
)
= (2
2
)
n
2
exp
_
n
i=1
(x
i
)
2
2
2
_
.
The log likelihood is
(,
2
) =
n
2
log(2) nlog()

n
i=1
(x
i
)
2
2
2
.
If we are interested in the MLE for , we can take the partial derivative
wrt and set this equal to zero.
0 =

(,
2
) =

n
i=1
(x
i
)
2
0 =
n
i=1
(x
i
) =
n
i=1
x
i
n
=

n
i=1
x
i
n
= x.
To check this is a maximum, we look at the second derivative.
2
(,
2
) =
n
2
,
which is negative everywhere, so x is the MLE for , independently
from the value of
2
.
CLT signicance:
This result is particularly useful. We have already seen that

X is always
an unbiased estimator for the population mean . And now, using the
CLT, we also have that for large n that

X is always approximately the
MLE for the population mean , irrespective of the distribution of X.
Properties of the MLE
So how good an estimator of is the MLE?
- The MLE is not necessarily unbiased.
+ For large n, the MLE approximately normally distributed with
mean .
+ The MLE is consistent.
+ The MLE is always asymptotically efcient, and if an efcient
estimator exists, it is the MLE.
+ Because it is derived from the likelihood of the data, it is
well-principled. This is the likelihood principle, which asserts that
all the information about a parameter from a set of data is held in
the likelihood.
Uncertainty of Estimates
In most circumstances, it will not be sufcient to report simply a point
estimate

for an unknown parameter of interest. We would usually
want to also quantify our degree of uncertainty in this estimate. If we
were to repeat the whole data collection and parameter estimation
procedure, how different an estimate might we have obtained.
If we were again to suppose we had knowledge of the true value of
our unknown parameter , or at least had access to the (approximate)
true sampling distribution of our statistic, P
T|
, then the variance of
this distribution would give such a measure.
The solution we consider is to plug in our known estimated value of ,
, into P
T|
and hence use the (maybe further) approximated sampling
distribution, P
T|
.
Condence Intervals
In particular, we know by the CLT that for any underlying distribution
(mean , variance
2
) for our sample, the sample mean

X is
approximately normally distributed with mean and variance

2
n
. We
now further approximate this by imagining

X N( x,

2
n
).
Then, if we knew the true population variance
2
, we would be able to
say that had the true mean parameter been x, then for large n, from our
standard normal tables, with 95% probability we would have observed
our statistic

X within the interval
_
x 1.96

n
, x + 1.96

n
_
.
This is known as the 95% condence interval for .
More generally, for any desired coverage probability level 1 we can
dene the the 100(1 )% condence interval for by
_
x z
1
n
, x + z
1
n
_
where z
is the -quantile of the standard normal (so before we used

= 0.05 and hence z
0.975
to obtain our 95% C.I.).
Loose interpretation: Amongst all the possible intervals x z
1
n
we might have observed (that is, from samples we might have drawn),
95% would have contained the unknown true parameter value .
Example
A corporation conducts a survey to investigate the proportion of
employees who thought the board was doing a good job. 1000
employees, randomly selected, were asked, and 732 said they did.
Find a 99% condence interval for the value of the proportion in the
population who thought the board was doing a good job.
Clearly each observation X
i
Bernoulli(p) for some unknown p, and
we want to nd a C.I. for p, which is also the mean of X
i
.
We have our estimate p = x = 0.732 for which we can use the CLT.
Since the variance of Bernoulli(p) is p(1 p), we can use
x(1 x) = 0.196 as an approximate variance.
So an approximate 99% C.I. is
_
0.732 2.576
_
0.196
1000
, 0.732 + 2.576
_
0.196
1000
_
2
) Samples -
2
Known
The CI given in the Bernoulli(p) example was only an approximate
interval, relying on the Central Limit Theorem, and also assuming the
population variance
2
was known.
However, if we in fact know that X
1
, . . . , X
n
are an i.i.d. sample from
N(,
2
), then we have exactly
X N
_
,

2
n
_
, or

X
/
n
N(0, 1)
In which case,
_
x z
1
n
, x + z
1
n
_
is an exact condence interval for , assuming we know
2
.
2
) Samples -
2
Unknown
However, in any applied example where we are aiming to t a normal
distribution model to real data, it will usually be the case that both
and
2
are unknown.
However, if again we have X
1
, . . . , X
n
as an i.i.d. sample from N(,
2
)
but with
2
now unknown, then we have exactly
X
s
n1
/
n
t
n1
where s
n1
=
n
i=1
(X
i

X)
2
n 1
is the bias-corrected sample standard
deviation, and t
is the Students t distribution with degrees of

freedom.
Then if follows that an exact 100(1 )% condence interval for is
_
x t
n1,1
2
s
n1
n
, x + t
n1,1
2
s
n1
n
_
where t
,
is the -quantile of t
.
t
pdf for = 1, 5, 10, 30, and N(0, 1) pdf

3 2 1 0 1 2 3
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
f
(
x
)
N(0,1)
t_1
t_5
t_10
t_30
Notes:
t
is heavier tailed than N(0, 1) for any number of degrees of

freedom .
Hence the t distribution CI will always be wider than the Normal
distribution CI. So if we know
2
, we should use it.
lim
N(0, 1).
For > 40 the difference between t
and N(0, 1) is so insignicant

that the t distribution is not tabulated beyond this many degrees
of freedom, and so there we can instead revert to N(0, 1) tables.
Table of t
,

0.95 0.975 0.99 0.995 0.95 0.975 0.99 0.995
1 6.31 12.71 31.82 63.66 9 1.83 2.26 2.82 3.25
2 2.92 4.30 6.96 9.92 10 1.81 2.23 2.76 3.17
3 2.35 3.18 4.54 5.84 12 1.78 2.18 2.68 3.05
4 2.13 2.78 3.75 4.60 15 1.75 2.13 2.60 2.95
5 2.02 2.57 3.36 4.03 20 1.72 2.09 2.53 2.85
6 1.94 2.45 3.14 3.71 25 1.71 2.06 2.48 2.78
7 1.89 2.36 3.00 3.50 40 1.68 2.02 2.42 2.70
8 1.86 2.31 2.90 3.36 1.645 1.96 2.326 2.576
Example
A random sample of 100 observations from a normally distributed
population has sample mean 83.2 and bias-corrected sample standard
deviation of 6.4.
1
Find a 95% condence interval for .
2
Give an interpretation for this interval.
Solution:
1
An exact 95% condence interval would given by
x t
99,0.975
s
n1
100
.
Since n = 100 is large, we can approximate this by
x z
0.975
s
n1
100
= 83.2 1.96
6.4
10
= [81.95, 84.45].
2
With 95% condence, we can say that the population mean lies
between 81.95 and 84.45
COMP 245 Statistics
Exercises 6 - Estimation
1. If (X
1
, . . . , X
n
) are a random sample from an exponential distribution with rate parameter ,
find the maximum likelihood estimate for .
2. Derive the maximum likelihood estimate for for n independent samples from Poisson().
3. In a study of traffic congestion, data were collected on the number of occupants in private
cars on a certain road. These data, collected for 1469 cars, are given below
Count 1 2 3 4 5 6
Frequency 902 403 106 38 16 4
One theory suggests that these data may have arisen from a modified geometric distribution,
in which the probability that there are x occupants in a car is
p(x) = p(1 p)
x1
, x = 1, 2, . . . .
(a) Find the maximum likelihood estimate of the parameter p of the geometric distribution
for these data. (Note that P(X x) = (1 p)
x1
.)
(b) [To be attempted after the lectures on hypothesis testing] Describe how a hypothesis test
could be carried out, at the 1% level, to see if these data do come from a geometric
distribution.
4. (a) For a random sample of size n from a normal distribution with unknown mean and
known variance
2
, what is the confidence level for each of the following confidence
limits for ?
i. x 1.96

n
;
ii. x 1.645

n
;
iii. x 2.575

n
;
iv. x 0.99

n
;
(b) A randomsample of 64 observations froma population produced the following summary
statistics:

i
x
i
= 700,
i
(x
i
x)
2
= 4238.
Find a 95% confidence interval for , and interpret this interval.
5. Compute confidence intervals at the 95% level for the means of the distributions from which
the following sample values were obtained:
(a) n = 100,
i
x
i
= 250,
i
x
2
i
= 725000;
(b) n = 100, x = 83.2, s
n1
= 6.4.
6. The following random sample was selected from a normal distribution:
7.53, 4.35, 7.66, 7.54, 5.83, 1.92, 3.14, 4.41
(a) Construct a 90% confidence interval for the population mean.
(b) Construct a 99% confidence interval for the population mean.
COMP 245 Statistics
Solutions 6 - Estimation
1. The pdf for each sample X
i
is given by f (x
i
) = e
x
i
, and hence the log-likelihood function is
() =
n
i=1
log{ f (x
i
)}
=
n
i=1
{log() x
i
} = nlog()
n
i=1
x
i
.
To find the MLE for , we calculate the derivative of () wrt ,
d
d
() =
n
i=1
x
i
,
which is zero when =
n
n
i=1
x
i
=
1
x
. To check this is a maximum, we examine the second
derivative of ,
d
2
d
2
() =
n
2
which is in fact negative for any positive parameter , so

=
1
x
is the MLE.
2. Let (x
1
, . . . , x
n
) be the random sample from Poisson(). Then
L() =
n
i=1
x
i
exp
x
i
!
= () = log()
n
i=1
x
i
n
n
i=1
log(x
i
!)
d()
d
=
n
i=1
x
i
n =

=
n
i=1
x
i
n
.
Since the second derivative
d
2
()
d
2
=
n
i=1
x
i
2
is negative (provided not all the samples are zero) for all , then clearly the MLE for is given
by the sample mean. (The case when all of the samples are zero is also easy to check
simply note that the first derivative is then negative everywhere and hence the likelihood is a
decreasing function of .)
3. (a) Let X be the number of individuals in a car. Then for x {1, 2, 3, 4, 5} we have a
contribution to the likelihood given by the pmf p
x
= p(x) = p(1 p)
x1
; on the other
hand, for the data for which we only know x 6, these are observed with probability
p
6
= P(X 6) = (1 p)
61
. Combining all this together, we get a likelihood function for
all the data
L(p) =
6
i=1
p
n
i
i
= p(1)
n
1
p(2)
n
2
p(3)
n
3
p(4)
n
4
p(5)
n
5
P(X 6)
n
6
=
5
i=1
{p(1 p)
i1
}
n
i
(1 p)
5
n
6
= p
n
1
+n
2
+n
3
+n
4
+n
5
(1 p)
n
2
+2n
3
+3n
4
+4n
5
+5n
6
(p) = (n
1
+ n
2
+ n
3
+ n
4
+ n
5
) log(p) + (n
2
+ 2n
3
+ 3n
4
+ 4n
5
+ 5n
6
) log(1 p),
where n
1
, . . . , n
5
are the number of times we observed 1, . . . , 5 people in a car, n
6
is the
number of times we observed at least 6 people in a car.
To find the maximum, we differentiate (p) wrt p and set equal to zero,
0 =
d
dp
(p) =
n
1
+ n
2
+ n
3
+ n
4
+ n
5
p

n
2
+ 2n
3
+ 3n
4
+ 4n
5
+ 5n
6
1 p
(n
1
+ n
2
+ n
3
+ n
4
+ n
5
)(1 p) = (n
2
+ 2n
3
+ 3n
4
+ 4n
5
+ 5n
6
)p
(n
1
+ n
2
+ n
3
+ n
4
+ n
5
)(1 p) = (n
2
+ 2n
3
+ 3n
4
+ 4n
5
+ 5n
6
)p
p =
n
1
+ n
2
+ n
3
+ n
4
+ n
5
n
1
+ 2n
2
+ 3n
3
+ 4n
4
+ 5n
5
+ 5n
6
.
Substituting in the values of the {n
i
} from our table, we get
p
MLE
=
1464
902 + 2 403 + 3 106 + 4 36 + 5 16 + 5
=
1464
2278
= 0.643.
and this is a maximum since the second derivative of (p),
d
2
dp
2
(p) =
n
1
+ n
2
+ n
3
+ n
4
+ n
5
p
2

n
2
+ 2n
3
+ 3n
4
+ 4n
5
+ 5n
6
(1 p)
2
,
is negative everywhere.
(b) Substitute p = p
MLE
= 0.643 into the bin probability functions p
i
, i {1, . . . , 6}, from
which the expected numbers are given as E
i
= 1469 p
i
. Using O
i
= n
i
, compute the
2
test statistic, and compare it with the 1% level of
2
(6 1 1).
4. (a) i. 95%. ii. 90%. iii. 99%. iv. 68%.
(b) 10.9375 2.0094. We can be 95% confident that lies in this interval.
5. The sample size n = 100 is quite large, so we can use the Normal distribution as an approxi-
mation to the t, so in both cases the confidence limits are x 1.96
s
n1
n
.
(a) x =
250
100
= 2.5, s
n1
=
i
x
2
i
n x
2
n 1
=
725000 100 2.5

2
99
= 85.539. So the
confidence limits are 2.5 16.7656.
(b) 83.2 1.254.
6. In both cases we use the formula x t
n1,
s
n1
n
with n=8.
(a) = 0.95, C.I.: [3.83, 6.77].
(b) = 0.995, C.I.: [2.59, 8.01].
Hypothesis Testing
COMP 245 STATISTICS
Dr N A Heard
Contents
1 Hypothesis Testing 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Error Rates and Power of a Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Testing for a population mean 3
2.1 Normal Distribution with Known Variance . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Duality with Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Normal Distribution with Unknown Variance . . . . . . . . . . . . . . . . . . . . . 5
3 Testing for differences in population means 7
3.1 Two Sample Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Normal Distributions with Known Variances . . . . . . . . . . . . . . . . . . . . . 7
3.3 Normal Distributions with Unknown Variances . . . . . . . . . . . . . . . . . . . 8
4 Goodness of Fit 11
4.1 Count Data and Chi-Square Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
1 Hypothesis Testing
1.1 Introduction
Hypotheses
Suppose we are going to obtain a random i.i.d. sample X = (X
1
, . . . , X
n
) of a random
variable X with an unknown distribution P
X
. To proceed with modelling the underlying pop-
ulation, we might hypothesise probability models for P
X
and then test whether such hypotheses
seem plausible in light of the realised data x = (x
1
, . . . , x
n
).
Or, more specically, we might x upon a parametric family P
X|
with unknown parameter
and then hypothesise values for . Generically let
0
denote a hypothesised value for . Then
after observing the data, we wish to test whether we can indeed reasonably assume =
0
.
For example, if X N(,
2
) we may wish to test whether = 0 is plausible in light of the
data x.
Formally, we dene a null hypothesis H
0
as our hypothesised model of interest, and also
specify an alternative hypothesis H
1
of rival models against which we wish to test H
0
.
Most often we simply test H
0
: =
0
against H
1
: =
0
. This is known as a two-
sided test. In some situations it may be more appropriate to consider alternatives of the form
H
1
: >
0
or H
1
: <
0
, known as one-sided tests.
Rejection Region for a Test Statistic
To test the validity of H
0
, we rst choose a test statistic T(X) of the data for which we can
nd the distribution, P
T
, under H
0
.
Then, we identify a rejection region R R of low probability values of T under the assump-
tion that H
0
is true, so
P(T R|H
0
) =
for some small probability (typically 5%).
A well chosen rejection region will have relatively high probability under H
1
, whilst retain-
ing low probability under H
0
.
Finally, we calculate the observed test statistic t(x) for our observed data x. If t R we
reject the null hypothesis at the 100% level.
p-Values
For each possible signicance level (0, 1), a hypothesis test at the 100% level will
result in either rejecting or not rejecting H
0
.
As 0 it becomes less and less likely that the null hypothesis will be rejected, as the
rejection region is becoming smaller and smaller. Similarly, as 1 it becomes more and
more likely that the null hypothesis will be rejected.
For a given data set and resulting test statistic, we might, therefore, be interested in iden-
tifying the critical signicance level which marks the threshold between us rejecting and not
rejecting the null hypothesis. This is known as the p-value of the data.
Smaller p-values suggest stronger evidence against H
0
.
1.2 Error Rates and Power of a Test
Test Errors
There are two types of error in the outcome of a hypothesis test:
2
Type I: Rejecting H
0
when in fact H
0
is true. By construction, this happens with probabil-
ity . For this reason, the signicance level of a hypothesis test is also referred to as the
Type I error rate.
Type II: Not rejecting H
0
when in fact H
1
is true. The probability with which this type
of error will occur depends on the unknown true value of , so to calculate values we
plug-in a plausible alternative value for =
0
,
1
say, and let = P(T / R| =
1
) be
the probability of a Type II error.
Power
We dene the power of a hypothesis test by
1 = P(T R| =
1
).
For a xed signicance level , a well chosen test statistic T and rejection region R will
have high power - that is, maximise the probability of rejecting the null hypothesis when the
alternative is true.
2 Testing for a population mean
2.1 Normal Distribution with Known Variance
N(,
2
) -
2
Known
Suppose X
1
, . . . , X
n
are i.i.d. N(,
2
) with
2
known and unknown.
We may wish to test if =
0
for some specic value
0
(e.g.
0
= 0,
0
= 9.8).
Then we can state our null and alternative hypotheses as
H
0
: =
0
;
H
1
: =
0
.
Under H
0
: =
0
, we then know both and
2
. So for the sample mean

X we have a
known distribution for the test statistic
Z =
X
0
/
n
.
So if we dene our rejection region R to be the 100% tails of the standard normal distribu-
tion distribution,
R =
, z
1
z
1
2
,
|z| > z
1
,
we have P(Z R) = under H
0
.
We thus reject H
0
at the 100% signicance level our observed test statistic z =
x
0
/
n
R.
The p-value is given by 2 {1 (|z|)}.
3
Ex. 5% Rejection Region for N(0, 1) Statistic
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
z
((
z
))
R
2.1.1 Duality with Condence Intervals
There is a strong connection in this context between hypothesis testing and condence inter-
vals.
Suppose we have constructed a 100(1 )% condence interval for . Then this is pre-
cisely the set of values {
0
} for which there would be insufcient evidence to reject a null
hypothesis H
0
: =
0
at the 100%-level.
Example
A company makes packets of snack foods. The bags are labelled as weighing 454g; of
course they wont all be exactly 454g, and lets suppose the variance of bag weights is known
to be 70g. The following data show the mass in grams of 50 randomly sampled packets.
464, 450, 450, 456, 452, 433, 446, 446, 450, 447, 442, 438, 452, 447, 460, 450, 453, 456,
446, 433, 448, 450, 439, 452, 459, 454, 456, 454, 452, 449, 463, 449, 447, 466, 446, 447,
450, 449, 457, 464, 468, 447, 433, 464, 469, 457, 454, 451, 453, 443
Are these data consistent with the claim that the mean weight of packets is 454g?
1. We wish to test H
0
: = 454 vs. H
1
: = 454. So set
0
= 454.
2. Although we have not been told that the packet weights are individually normally dis-
tributed, by the CLT we still have that the mean weight of the sample of packets is ap-
proximately normally distributed, and hence we still approximately have Z =
X
0
/
n

.
3. x = 451.22 and n = 50 z =
x
0
/
n
= 2.350.
4. For a 5%-level signicance test, we compare the statistic z = 2.350 with the rejection
region R = (, z
0.975
) (z
0.975
, ) = (, 1.96) (1.96, ). Clearly we have z R,
and so at the 5%-level we reject the null hypothesis that the mean packet weight is 454g.
4
5. At which signicance levels would we have not rejected the null hypothesis?
For a 1%-level signicance test, the rejection region would have been
R = (, z
0.995
) (z
0.995
, ) = (, 2.576) (2.576, ).
In which case z / R, and so at the 1%-level we would not have rejected the null
hypothesis.
The p-value is
2 {1 (|z|)} = 2 {1 (| 2.350|)} 2(1 0.9906)
= 0.019,
and so we would only reject the null hypothesis for > 1.9%.
2.2 Normal Distribution with Unknown Variance
N(,
2
) -
2
Unknown
Similarly, if
2
in the above example were unknown, we still have that
T =
X
0
s
n1
/
n
t
n1
.
So for a test of H
0
: =
0
vs. H
1
: =
0
at the level, the rejection region of our
observed test statistic t =
x
0
s
n1
/
n
is
R =
, t
n1,1
t
n1,1
2
,
|t| > t
n1,1
.
Ex. 5% Rejection Region for t
5
Statistic
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
t
f
(
t
)
R
5
Example 1
Consider again the snack food weights example. There, we assumed the variance of bag
weights was known to be 70. Without this, we could have estimated the variance by
s
2
n1
=
1
n 1
n
i=1
(x
i
x)
2
= 70.502.
Then the corresponding t-statistic becomes
t =
x
0
s
n1
/
n
= 2.341,
very similar to the z-statistic of before.
And since n = 50, we compare with the t
49
distribution which is approximately N(0, 1). So
the hypothesis test results and p-value would be practically identical.
Example 2
A particular piece of code takes a random time to run on a computer, but the average time
is known to be 6 seconds. The programmer tries an alternative optimisation in compilation
and wishes to know whether the mean run time has changed. To explore this, he runs the re-
optimised code 16 times, obtaining a sample mean run time of 5.8 seconds and bias-corrected
sample standard deviation of 1.2 seconds. Is the code any faster?
1. We wish to test H
0
: = 6 vs. H
1
: = 6. So set
0
= 6.
2. Assuming the run times are approximately normal, T =
X
0
s
n1
/
n
t
n1
. That is,
X 6
s
n1
/
16
t
15
. So we reject H
0
at the 100% level if |T| > t
15,1/2
.
3. x = 5.8, s
n1
= 1.2 and n = 16 t =
x
0
s
n1
/
n
=
2
3
.
4. We have |t| = 2/3 << 2.13 = t
15,.975
, so we have insufcient evidence to reject H
0
at the
5% level.
5. In fact, the p-value for these data is 51.51%, so there is very little evidence to suggest the
code is now any faster.
6
3 Testing for differences in population means
3.1 Two Sample Problems
Samples from 2 Populations
Suppose, as before, we have a random sample X = (X
1
, . . . , X
n
1
) from an unknown popu-
lation distribution P
X
.
But now, suppose we have a further random sample Y = (Y
1
, . . . , Y
n
2
) from a second,
different population P
Y
.
Then we may wish to test hypotheses concerning the similarity of the two distributions P
X
and P
Y
.
In particular, we are often interested in testing whether P
X
and P
Y
have equal means. That
is, to test
H
0
:
X
=
Y
vs H
1
:
X
=
Y
.
Paired Data
A special case is when the two samples X and Y are paired. That is, if n
1
= n
2
= n and
the data are collected as pairs (X
1
, Y
1
), . . . , (X
n
, Y
n
) so that, for each i, X
i
and Y
i
are possibly
dependent.
For example, we might have a random sample of n individuals and X
i
represents the heart
rate of the i
th
person before light exercise and Y
i
the heart rate of the same person afterwards.
In this special case, for a test of equal means we can consider the sample of differences
Z
1
= X
1
Y
1
, . . . , Z
n
= X
n
Y
n
and test H
0
:
Z
= 0 using the single sample methods we
have seen. In the above example, this would test whether light exercise causes a change in
heart rate.
3.2 Normal Distributions with Known Variances
N(
X
,
2
X
), N(
Y
,
2
Y
)
Suppose
X = (X
1
, . . . , X
n
1
) are i.i.d. N(
X
,
2
X
) with
X
unknown;
Y = (Y
1
, . . . , Y
n
2
) are i.i.d. N(
Y
,
2
Y
) with
Y
unknown;
the two samples X and Y are independent.
Then we still have that, independently,
X N
X
,

2
X
n
1

Y N
Y
,

2
Y
n
2
From this it follows that the difference in sample means,
X

Y N
Y
,

2
X
n
1
+

2
Y
n
2
,
7
and hence
(

X

Y) (
X
Y
)
2
X
/n
1
+
2
Y
/n
2
.
So under the null hypothesis H
0
:
X
=
Y
, we have
Z =
X

Y
2
X
/n
1
+
2
Y
/n
2
.
N(
X
,
2
X
), N(
Y
,
2
Y
) -
2
X
,
2
Y
Known
So if
2
X
and
2
Y
are known, we immediately have a test statistic
z =
x y
2
X
/n
1
+
2
Y
/n
2
which we can compare against the quantiles of a standard normal.
That is,
R =
|z| > z
1
,
gives a rejection region for a hypothesis test of H
0
:
X
=
Y
vs. H
1
:
X
=
Y
at the 100%
level.
3.3 Normal Distributions with Unknown Variances
N(
X
,
2
), N(
Y
,
2
) -
2
Unknown
On the other hand, suppose
2
X
and
2
Y
are unknown.
Then if we know
2
X
=
2
Y
=
2
but
2
is unknown, we can still proceed.
We have
(

X

Y) (
X
Y
)
1/n
1
+ 1/n
2
,
and so, under H
0
:
X
=
Y
,
X

Y
1/n
1
+ 1/n
2
.
but with unknown.
Pooled Estimate of Population Variance
We need an estimator for the variance using samples from two populations with different
means. Just combining the samples together into one big sample would over-estimate the
variance, since some of the variability in the samples would be due to the difference in
X
and
Y
.
So we dene the bias-corrected pooled sample variance
S
2
n
1
+n
2
2
=

n
1
i=1
(X
i

X)
2
+
n
2
i=1
(Y
i

Y)
2
n
1
+ n
2
2
,
8
which is an unbiased estimator for
2
.
We can immediately see that s
2
n
1
+n
2
2
is indeed an unbiased estimate of
2
by noting
S
2
n
1
+n
2
2
=
n
1
1
n
1
+ n
2
2
S
2
n
1
1
+
n
2
1
n
1
+ n
2
2
S
2
n
2
1
;
That is, s
2
n
1
+n
2
2
is a weighted average of the bias-corrected sample variances for the individual
samples x and y, which are both unbiased estimates for
2
.
Then substituting S
n
1
+n
2
2
in for we get
(

X

Y) (
X
Y
)
S
n
1
+n
2
2
1/n
1
+ 1/n
2
t
n
1
+n
2
2
,
and so, under H
0
:
X
=
Y
,
T =
X

Y
S
n
1
+n
2
2
1/n
1
+ 1/n
2
t
n
1
+n
2
2
.
So we have a rejection region for a hypothesis test of H
0
:
X
=
Y
vs. H
1
:
X
=
Y
at the
100% level given by
R =
|t| > t
n
1
+n
2
2,1
,
for the statistic
t =
x y
s
n
1
+n
2
2
1/n
1
+ 1/n
2
.
Example
The same piece of C code was repeatedly run after compilation under two different C com-
pilers, and the run times under each compiler were recorded. The sample mean and bias-
corrected sample standard deviation for Compiler 1 were 114s and 310s respectively, and the
corresponding gures for Compiler 2 were 94s and 290s. Both sets of data were each based on
15 runs.
Suppose that Compiler 2 is a rened version of Compiler 1, and so if
1
,
2
are the expected
run times of the code under the two compilations, we might fairly assume
2

1
.
Conduct a hypothesis test of H
0
:
1
=
2
vs H
1
:
1
>
2
at the 5% level.
Until now we have mostly considered two-sided tests. That is tests of the form H
0
: =
0
vs H
1
: =
0
.
Here we need to consider one-sided tests, which differ by the alternative hypothesis being
of the form H
1
: <
0
or H
1
: >
0
.
This presents no extra methodological challenge and requires only a slight adjustment in
the construction of the rejection region.
We still use the t-statistic
t =
x y
s
n
1
+n
2
2
1/n
1
+ 1/n
2
,
9
where x, y are the sample mean run times under Compilers 1 and 2 respectively. But now the
one-sided rejection region becomes
R = {t |t > t
n
1
+n
2
2,1
} .
First calculating the bias-corrected pooled sample variance, we get
s
2
n
1
+n
2
2
=
14 310 + 14 290
28
= 300.
(Note that since the sample sizes n
1
and n
2
are equal, the pooled estimate of the variance is the
average of the individual estimates.)
So t =
x y
s
n
1
+n
2
2
1/n
1
+ 1/n
2
=
114 94
300
1/15 + 1/15
=
10 = 3.162.
For a 1-sided test we compare t = 3.162 with t
28,0.95
= 1.701 and conclude that we reject the
null hypothesis at the 5% level; the second compilation is signicantly faster.
10
4 Goodness of Fit
4.1 Count Data and Chi-Square Tests
Count Data
The results in the previous sections relied upon the data being either normally distributed,
or at least through the CLT having the sample mean being approximately normally distributed.
Tests were then developed for making inference on population means under those assump-
tions. These tests were very much model-based.
Another important but very different problem concerns model checking, which can be ad-
dressed through a more general consideration of count data for simple (discrete and nite)
distributions.
The following ideas can then be trivially extended to innite range discrete and continuous
r.v.s by binning observed samples into a nite collection of predened intervals.
Samples from a Simple Random Variable
Let X be a simple random variable taking values in the range {x
1
, . . . , x
k
}, with probability
mass function p
j
= P(X = x
j
), j = 1, . . . , k.
A random sample of size n from the distribution of X can be summarised by the observed
frequency counts O = (O
1
, . . . , O
k
) at the points x
1
, . . . , x
k
(so
k
j=1
O
j
= n).
Suppose it is hypothesised that the true pmf {p
j
} is from a particular parametric model
p
j
= P(X = x
j
|), j = 1, . . . , k for some unknown parameter p-vector .
To test this hypothesis about the model, we rst need to estimate the unknown parameters
so that we are able to calculate the distribution of any statistic under the null hypothesis
H
0
: p
j
= P(X = x
j
|), j = 1, . . . , k. Let

be such an estimator, obtained using the sample O.
Then under H
0
we have estimated probabilities for the pmf p
j
= P(X = x
j
|
), j = 1, . . . , k
and so we are able to calculate estimated expected frequency counts E = (E
1
, . . . , E
k
) by E
j
=
n p
j
. (Note again we have
k
j=1
E
j
= n.)
We then seek to compare the observed frequencies with the expected frequencies to test for
goodness of t.
Chi-Square Test
To test H
0
: p
j
= P(X = x
j
|) vs. H
1
: 0 p
j
1, p
j
= 1 we use the chi-square statistic
X
2
=
k
i=1
(O
i
E
i
)
2
E
i
.
If H
0
were true, then the statistic X
2
would approximately follow a chi-square distribution
with = k p 1 degrees of freedom.
k is the number of values (categories) the simple r.v. X can take.
p is the number of parameters being estimated (dim()).
For the approximation to be valid, we should have j, E
j
5. This may require some
merging of categories.
11
pdf for = 1, 2, 3, 5, 10
0 2 4 6 8 10
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
x
((
x
))
2
1
2
2
2
3
2
5
2
10
Rejection Region
Clearly larger values of X
2
correspond to larger deviations from the null hypothesis model.
That is, if X
2
= 0 the observed counts exactly match those expected under H
0
.
For this reason, we always perform a one-sided goodness of t test using the
2
statistic,
looking only at the upper tail of the distribution.
Hence the rejection region for a goodness of t hypothesis test at the at the 100% level is
given by
R =
x
2
x
2
>
2
kp1,1
.
4.2 Proportions
Example
Each year, around 1.3 million people in the USA suffer adverse drug effects (ADEs). A
study in the Journal of the American Medical Association (July 5, 1995) gave the causes of 95
ADEs below.
Cause Number of ADEs
Lack of knowledge of drug 29
Rule violation 17
Faulty dose checking 13
Slips 9
Other 27
Test whether the true percentages of ADEs differ across the 5 causes.
Under the null hypothesis that the 5 causes are equally likely, we would have expected
counts of
95
5
= 19 for each cause.
12
So our
2
statistic becomes
x
2
=
(29 19)
2
19
+
(17 19)
2
19
+
(13 19)
2
19
+
(9 19)
2
19
+
(27 19)
2
19
=
100
19
+
4
19
+
36
19
+
100
19
+
64
19
=
304
19
= 16.
We have not estimated any parameters from the data, so we compare x
2
with the quantiles
of the
2
51
=
2
4
distribution.
Well 16 > 9.49 =
2
4,0.95
, so we reject the null hypothesis at the 5% level; we have reason to
suppose that there is a difference in the true percentages across the different causes.
4.3 Model Checking
Example - Fitting a Poisson Distribution to Data
Recall the example from the Discrete Random Variables chapter, where the number of par-
ticles emitted by a radioactive substance which reached a Geiger counter was measured for
2608 time intervals, each of length 7.5 seconds.
We tted a Poisson() distribution to the data by plugging in the sample mean number of
counts (3.870) for the rate parameter . (Which we now know to be the MLE!)
x 0 1 2 3 4 5 6 7 8 9 10
O(n
x
) 57 203 383 525 532 408 273 139 45 27 16
E(n
x
) 54.4 210.5 407.4 525.5 508.4 393.5 253.8 140.3 67.9 29.2 17.1
Whilst the tted Poisson(3.87) expected frequencies looked quite convincing to the eye, at
that time we had no formal method of quantitatively assessing the t. However, we now know
how to proceed.
x 0 1 2 3 4 5 6 7 8 9 10
O 57 203 383 525 532 408 273 139 45 27 16
E 54.4 210.5 407.4 525.5 508.4 393.5 253.8 140.3 67.9 29.2 17.1
OE 2.6 -7.5 -24.4 -0.5 23.6 14.5 19.2 -1.3 22.9 2.2 1.1
(OE)
2
E
0.124 0.267 1.461 0.000 1.096 0.534 1.452 0.012 7.723 0.166 0.071
The statistic x
2
=
(OE)
2
E
= 12.906 should be compared with a
2
1111
=
2
9
distribution.
Well
2
9,0.95
= 16.91, so at the 5% level we do not reject the null hypothesis of a Poisson(3.87)
model for the data.
4.4 Independence
Contingency Tables
Suppose we have two simple random variables X and Y which are jointly distributed with
unknown probability mass function p
XY
.
We are often interested in trying to ascertain whether X and Y are independent. That is,
determine whether p
XY
(x, y) = p
X
(x)p
Y
(y).
13
Let the ranges of the r.v.s X and Y be {x
1
, . . . , x
k
} and {y
1
, . . . , y
} respectively.
Then an i.i.d. sample of size n from the joint distribution of (X, Y) can be represented by a
list of counts n
ij
(1 i k; 1 j ) of the number of times we observe the pair (x
i
, y
j
).
Tabulating these data in the following way gives what is known as a k contingency
table.
y
1
y
2
. . . y
x
1
n
11
n
12
n
1
n
1
x
2
n
21
n
22
n
2
n
2
.
.
.
x
k
n
k1
n
k2
n
k
n
k
n
1
n
2
. . . n
n
Note the row sums (n
1
, n
2
, . . . , n
k
) represent the frequencies of x
1
, x
2
, . . . , x
k
in the sample
(that is, ignoring the value of Y). Similarly for the column sums (n
1
, n
2
, . . . , n
) and y
1
, . . . , y
.
Under the null hypothesis
H
0
: X and Y are independent,
the expected values of the entries of the contingency table, conditional on the row and column
sums, can be estimated by
n
ij
=
n
i
n
j
n
, 1 i k, 1 j .
To see this, consider the marginal distribution of X; we could estimate p
X
(x
i
) by p
i
=
n
i
n
.
Similarly for p
Y
(y
j
) we get p
j
=
n
j
n
.
Then under the null hypothesis of independence p
XY
(x
i
, y
j
) = p
X
(x
i
)p
Y
(y
j
), and so we can
estimate p
XY
(x
i
, y
j
) by
p
ij
= p
i
p
j
=
n
i
n
j
n
2
.
Now that we have a set of expected frequencies to compare against our k observed
frequencies, a
2
test can be performed.
We are using both the row and column sums to estimate our probabilities, and there are k
and of these respectively. So we compare our calculated x
2
statistic against a
2
distribution
with k {(k 1) + ( 1)} 1 = (k 1)( 1) degrees of freedom.
Hence the rejection region for a hypothesis test of independence in a k contingency
table at the at the 100% level is given by
R =
x
2
x
2
>
2
(k1)(1),1
.
14
Example
An article in International Journal of Sports Psychology (July-Sept 1990) evaluated the rela-
tionship between physical tness and stress. 549 people were classied as good, average, or
poor tness, and were also tested for signs of stress (yes or no). The data are shown in the table
below.
Poor Fitness Average Fitness Good Fitness
Stress 206 184 85 475
No stress 36 28 10 74
242 212 95 549
Is there any relationship between stress and tness?
Under independence we would estimate the expected values to be
Stress 209.4 183.4 82.2 475
No stress 32.6 28.6 12.8 74
242 212 95 549
Hence the
2
statistic is calculated to be
X
2
=
i
(O
i
E
i
)
2
E
i
=
(206 209.4)
2
209.4
+ . . . +
(10 12.88)
2
12.8
= 1.1323.
This should be compared with a
2
distribution with (2 1) (3 1) = 2 degrees of
freedom.
2
2,0.95
= 5.99, so we have no signicant evidence to suggest there is any relationship be-
tween tness and stress.
15
Hypothesis Testing
COMP 245 STATISTICS
Dr N A Heard
Dr N A Heard (Room 545 Huxley Building) Hypothesis Testing 1 / 47
Hypotheses
Suppose we are going to obtain a random i.i.d. sample
X = (X
1
, . . . , X
n
) of a random variable X with an unknown
distribution P
X
. To proceed with modelling the underlying
population, we might hypothesise probability models for P
X
and then
test whether such hypotheses seem plausible in light of the realised
data x = (x
1
, . . . , x
n
).
Or, more specically, we might x upon a parametric family P
X|
with
unknown parameter and then hypothesise values for . Generically
let
0
denote a hypothesised value for . Then after observing the data,
we wish to test whether we can indeed reasonably assume =
0
. For
example, if X N(,
2
) we may wish to test whether = 0 is
plausible in light of the data x.
Formally, we dene a null hypothesis H
0
as our hypothesised model
of interest, and also specify an alternative hypothesis H
1
of rival
models against which we wish to test H
0
.
Most often we simply test H
0
: =
0
against H
1
: =
0
. This is
known as a two-sided test. In some situations it may be more
appropriate to consider alternatives of the form H
1
: >
0
or
H
1
: <
0
, known as one-sided tests.
Rejection Region for a Test Statistic
To test the validity of H
0
, we rst choose a test statistic T(X) of the
data for which we can nd the distribution, P
T
, under H
0
.
Then, we identify a rejection region R R of low probability values
of T under the assumption that H
0
is true, so
P(T R|H
0
) =
for some small probability (typically 5%).
A well chosen rejection region will have relatively high probability
under H
1
, whilst retaining low probability under H
0
.
Finally, we calculate the observed test statistic t(x) for our observed
data x. If t R we reject the null hypothesis at the 100% level.
p-Values
For each possible signicance level (0, 1), a hypothesis test at the
100% level will result in either rejecting or not rejecting H
0
.
As 0 it becomes less and less likely that the null hypothesis will be
rejected, as the rejection region is becoming smaller and smaller.
Similarly, as 1 it becomes more and more likely that the null
hypothesis will be rejected.
For a given data set and resulting test statistic, we might, therefore, be
interested in identifying the critical signicance level which marks the
threshold between us rejecting and not rejecting the null hypothesis.
This is known as the p-value of the data.
Smaller p-values suggest stronger evidence against H
0
.
Test Errors
There are two types of error in the outcome of a hypothesis test:
Type I: Rejecting H
0
when in fact H
0
is true. By construction, this
happens with probability . For this reason, the signicance level
of a hypothesis test is also referred to as the Type I error rate.
Type II: Not rejecting H
0
when in fact H
1
is true. The probability
with which this type of error will occur depends on the unknown
true value of , so to calculate values we plug-in a plausible
alternative value for =
0
,
1
say, and let = P(T / R| =
1
) be
the probability of a Type II error.
Power
We dene the power of a hypothesis test by
Defn.
1 = P(T R| =
1
).
For a xed signicance level , a well chosen test statistic T and
rejection region R will have high power - that is, maximise the
probability of rejecting the null hypothesis when the alternative is true.
N(,
2
) -
2
Known
Suppose X
1
, . . . , X
n
are i.i.d. N(,
2
) with
2
known and unknown.
We may wish to test if =
0
for some specic value
0
(e.g.
0
= 0,
0
= 9.8).
Then we can state our null and alternative hypotheses as
H
0
: =
0
;
H
1
: =
0
.
Under H
0
: =
0
, we then know both and
2
. So for the sample
mean

X we have a known distribution for the test statistic
Z =
X
0
/
n
.
So if we dene our rejection region R to be the 100% tails of the
standard normal distribution distribution,
R =
, z
1
z
1
2
,
|z| > z
1
,
we have P(Z R) = under H
0
.
We thus reject H
0
at the 100% signicance level our observed
test statistic z =
x
0
/
n
R.
The p-value is given by 2 {1 (|z|)}.
Ex. 5% Rejection Region for N(0, 1) Statistic
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
z
((
z
))
R
Jump to t
5
5% rejection region
There is a strong connection in this context between hypothesis testing
and condence intervals.
Suppose we have constructed a 100(1 )% condence interval for .
Then this is precisely the set of values {
0
} for which there would be
insufcient evidence to reject a null hypothesis H
0
: =
0
at the
100%-level.
Example
A company makes packets of snack foods. The bags are labelled as
weighing 454g; of course they wont all be exactly 454g, and lets
suppose the variance of bag weights is known to be 70g. The
following data show the mass in grams of 50 randomly sampled
packets.
464, 450, 450, 456, 452, 433, 446, 446, 450, 447, 442, 438, 452,
447, 460, 450, 453, 456, 446, 433, 448, 450, 439, 452, 459, 454,
456, 454, 452, 449, 463, 449, 447, 466, 446, 447, 450, 449, 457,
464, 468, 447, 433, 464, 469, 457, 454, 451, 453, 443
Are these data consistent with the claim that the mean weight of
packets is 454g?
1
We wish to test H
0
: = 454 vs. H
1
: = 454. So set
0
= 454.
2
Although we have not been told that the packet weights are
individually normally distributed, by the CLT we still have that
the mean weight of the sample of packets is approximately
normally distributed, and hence we still approximately have
Z =
X
0
/
n
.
3
x = 451.22 and n = 50 z =
x
0
/
n
= 2.350.
4
For a 5%-level signicance test, we compare the statistic
z = 2.350 with the rejection region
R = (, z
0.975
) (z
0.975
, ) = (, 1.96) (1.96, ). Clearly
we have z R, and so at the 5%-level we reject the null hypothesis
that the mean packet weight is 454g.
5
At which signicance levels would we have not rejected the null
hypothesis?
For a 1%-level signicance test, the rejection region would have

been
R = (, z
0.995
) (z
0.995
, ) = (, 2.576) (2.576, ).
In which case z / R, and so at the 1%-level we would not have
rejected the null hypothesis.
The p-value is
2 {1 (|z|)} = 2 {1 (| 2.350|)} 2(1 0.9906)
= 0.019,
and so we would only reject the null hypothesis for > 1.9%.
N(,
2
) -
2
Unknown
Similarly, if
2
in the above example were unknown, we still have that
T =
X
0
s
n1
/
n
t
n1
.
So for a test of H
0
: =
0
vs. H
1
: =
0
at the level, the rejection
region of our observed test statistic t =
x
0
s
n1
/
n
is
R =
, t
n1,1
t
n1,1
2
,
|t| > t
n1,1
.
Ex. 5% Rejection Region for t
5
Statistic
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
t
f
(
t
)
R
Jump to 5% rejection region
Example 1
Consider again the snack food weights example. There, we assumed
the variance of bag weights was known to be 70. Without this, we
could have estimated the variance by
s
2
n1
=
1
n 1
n
i=1
(x
i
x)
2
= 70.502.
Then the corresponding t-statistic becomes
t =
x
0
s
n1
/
n
= 2.341,
very similar to the z-statistic of before.
And since n = 50, we compare with the t
49
distribution which is
approximately N(0, 1). So the hypothesis test results and p-value
would be practically identical.
Example 2
A particular piece of code takes a random time to run on a computer,
but the average time is known to be 6 seconds. The programmer tries
an alternative optimisation in compilation and wishes to know
whether the mean run time has changed. To explore this, he runs the
re-optimised code 16 times, obtaining a sample mean run time of 5.8
seconds and bias-corrected sample standard deviation of 1.2 seconds.
Is the code any faster?
1
We wish to test H
0
: = 6 vs. H
1
: = 6. So set
0
= 6.
2
Assuming the run times are approximately normal,
T =
X
0
s
n1
/
n
t
n1
. That is,

X 6
s
n1
/
16
t
15
. So we reject H
0
at
the 100% level if |T| > t
15,1/2
.
3
x = 5.8, s
n1
= 1.2 and n = 16 t =
x
0
s
n1
/
n
=
2
3
.
4
We have |t| = 2/3 << 2.13 = t
15,.975
, so we have insufcient
evidence to reject H
0
at the 5% level.
5
In fact, the p-value for these data is 51.51%, so there is very little
evidence to suggest the code is now any faster.
Samples from 2 Populations
Suppose, as before, we have a random sample X = (X
1
, . . . , X
n
1
) from
an unknown population distribution P
X
.
But now, suppose we have a further random sample Y = (Y
1
, . . . , Y
n
2
)
from a second, different population P
Y
.
Then we may wish to test hypotheses concerning the similarity of the
two distributions P
X
and P
Y
.
In particular, we are often interested in testing whether P
X
and P
Y
have equal means. That is, to test
H
0
:
X
=
Y
vs H
1
:
X
=
Y
.
Paired Data
A special case is when the two samples X and Y are paired. That is, if
n
1
= n
2
= n and the data are collected as pairs (X
1
, Y
1
), . . . , (X
n
, Y
n
) so
that, for each i, X
i
and Y
i
are possibly dependent.
For example, we might have a random sample of n individuals and X
i
represents the heart rate of the i
th
person before light exercise and Y
i
the heart rate of the same person afterwards.
In this special case, for a test of equal means we can consider the
sample of differences Z
1
= X
1
Y
1
, . . . , Z
n
= X
n
Y
n
and test
H
0
:
Z
= 0 using the single sample methods we have seen. In the
above example, this would test whether light exercise causes a change
in heart rate.
N(
X
,
2
X
), N(
Y
,
2
Y
)
Suppose
X = (X
1
, . . . , X
n
1
) are i.i.d. N(
X
,
2
X
) with
X
unknown;
Y = (Y
1
, . . . , Y
n
2
) are i.i.d. N(
Y
,
2
Y
) with
Y
unknown;
the two samples X and Y are independent.
Then we still have that, independently,
X N
X
,

2
X
n
1

Y N
Y
,

2
Y
n
2

From this it follows that the difference in sample means,
X

Y N
Y
,

2
X
n
1
+

2
Y
n
2
,
and hence
(

X

Y) (
X
Y
)
2
X
/n
1
+
2
Y
/n
2
.
So under the null hypothesis H
0
:
X
=
Y
, we have
Z =
X

Y
2
X
/n
1
+
2
Y
/n
2
.
N(
X
,
2
X
), N(
Y
,
2
Y
) -
2
X
,
2
Y
Known
So if
2
X
and
2
Y
are known, we immediately have a test statistic
z =
x y
2
X
/n
1
+
2
Y
/n
2
which we can compare against the quantiles of a standard normal.
That is,
R =
|z| > z
1
,
gives a rejection region for a hypothesis test of H
0
:
X
=
Y
vs.
H
1
:
X
=
Y
at the 100% level.
N(
X
,
2
), N(
Y
,
2
) -
2
Unknown
On the other hand, suppose
2
X
and
2
Y
are unknown.
Then if we know
2
X
=
2
Y
=
2
but
2
is unknown, we can still
proceed.
We have
(

X

Y) (
X
Y
)
1/n
1
+ 1/n
2
,
and so, under H
0
:
X
=
Y
,
X

Y
1/n
1
+ 1/n
2
.
but with unknown.
Pooled Estimate of Population Variance
We need an estimator for the variance using samples from two
populations with different means. Just combining the samples
together into one big sample would over-estimate the variance, since
some of the variability in the samples would be due to the difference
in
X
and
Y
.
So we dene the bias-corrected pooled sample variance
Defn.
S
2
n
1
+n
2
2
=

n
1
i=1
(X
i

X)
2
+
n
2
i=1
(Y
i

Y)
2
n
1
+ n
2
2
,
which is an unbiased estimator for
2
.
We can immediately see that s
2
n
1
+n
2
2
is indeed an unbiased estimate of
2
by noting
S
2
n
1
+n
2
2
=
n
1
1
n
1
+ n
2
2
S
2
n
1
1
+
n
2
1
n
1
+ n
2
2
S
2
n
2
1
;
That is, s
2
n
1
+n
2
2
is a weighted average of the bias-corrected sample
variances for the individual samples x and y, which are both unbiased
estimates for
2
.
Then substituting S
n
1
+n
2
2
in for we get
(

X

Y) (
X
Y
)
S
n
1
+n
2
2
1/n
1
+ 1/n
2
t
n
1
+n
2
2
,
and so, under H
0
:
X
=
Y
,
T =
X

Y
S
n
1
+n
2
2
1/n
1
+ 1/n
2
t
n
1
+n
2
2
.
So we have a rejection region for a hypothesis test of H
0
:
X
=
Y
vs.
H
1
:
X
=
Y
at the 100% level given by
R =
|t| > t
n
1
+n
2
2,1
,
for the statistic
t =
x y
s
n
1
+n
2
2
1/n
1
+ 1/n
2
.
Example
The same piece of C code was repeatedly run after compilation under
two different C compilers, and the run times under each compiler
were recorded. The sample mean and bias-corrected sample standard
deviation for Compiler 1 were 114s and 310s respectively, and the
corresponding gures for Compiler 2 were 94s and 290s. Both sets of
data were each based on 15 runs.
Suppose that Compiler 2 is a rened version of Compiler 1, and so if
1
,
2
are the expected run times of the code under the two
compilations, we might fairly assume
2

1
.
Conduct a hypothesis test of H
0
:
1
=
2
vs H
1
:
1
>
2
at the 5%
level.
Until now we have mostly considered two-sided tests. That is tests of
the form H
0
: =
0
vs H
1
: =
0
.
Here we need to consider one-sided tests, which differ by the
alternative hypothesis being of the form H
1
: <
0
or H
1
: >
0
.
This presents no extra methodological challenge and requires only a
slight adjustment in the construction of the rejection region.
We still use the t-statistic
t =
x y
s
n
1
+n
2
2
1/n
1
+ 1/n
2
,
where x, y are the sample mean run times under Compilers 1 and 2
respectively. But now the one-sided rejection region becomes
R = {t |t > t
n
1
+n
2
2,1
} .
First calculating the bias-corrected pooled sample variance, we get
s
2
n
1
+n
2
2
=
14 310 + 14 290
28
= 300.
(Note that since the sample sizes n
1
and n
2
are equal, the pooled
estimate of the variance is the average of the individual estimates.)
So t =
x y
s
n
1
+n
2
2
1/n
1
+ 1/n
2
=
114 94
300
1/15 + 1/15
=
10 = 3.162.
For a 1-sided test we compare t = 3.162 with t
28,0.95
= 1.701 and
conclude that we reject the null hypothesis at the 5% level; the second
compilation is signicantly faster.
Count Data
The results in the previous sections relied upon the data being either
normally distributed, or at least through the CLT having the sample
mean being approximately normally distributed. Tests were then
developed for making inference on population means under those
assumptions. These tests were very much model-based.
Another important but very different problem concerns model checking,
which can be addressed through a more general consideration of count
data for simple (discrete and nite) distributions.
The following ideas can then be trivially extended to innite range
discrete and continuous r.v.s by binning observed samples into a nite
collection of predened intervals.
Samples from a Simple Random Variable
Let X be a simple random variable taking values in the range
{x
1
, . . . , x
k
}, with probability mass function p
j
= P(X = x
j
),
j = 1, . . . , k.
A random sample of size n from the distribution of X can be
summarised by the observed frequency counts O = (O
1
, . . . , O
k
) at the
points x
1
, . . . , x
k
(so
k
j=1
O
j
= n).
Suppose it is hypothesised that the true pmf {p
j
} is from a particular
parametric model p
j
= P(X = x
j
|), j = 1, . . . , k for some unknown
parameter p-vector .
To test this hypothesis about the model, we rst need to estimate the
unknown parameters so that we are able to calculate the distribution
of any statistic under the null hypothesis H
0
: p
j
= P(X = x
j
|),
j = 1, . . . , k. Let

be such an estimator, obtained using the sample O.
Then under H
0
we have estimated probabilities for the pmf
p
j
= P(X = x
j
|
), j = 1, . . . , k and so we are able to calculate estimated

expected frequency counts E = (E
1
, . . . , E
k
) by E
j
= n p
j
. (Note again
we have
k
j=1
E
j
= n.)
We then seek to compare the observed frequencies with the expected
frequencies to test for goodness of t.
Chi-Square Test
To test H
0
: p
j
= P(X = x
j
|) vs. H
1
: 0 p
j
1,
p
j
= 1 we use the
chi-square statistic
Defn.
X
2
=
k
i=1
(O
i
E
i
)
2
E
i
.
If H
0
were true, then the statistic X
2
would approximately follow a
chi-square distribution with = k p 1 degrees of freedom.
k is the number of values (categories) the simple r.v. X can take.
p is the number of parameters being estimated (dim()).
For the approximation to be valid, we should have j, E
j
5. This
may require some merging of categories.
pdf for = 1, 2, 3, 5, 10
0 2 4 6 8 10
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
x
((
x
))
2
1
2
2
2
3
2
5
2
10
Rejection Region
Clearly larger values of X
2
correspond to larger deviations from the
null hypothesis model. That is, if X
2
= 0 the observed counts exactly
match those expected under H
0
.
For this reason, we always perform a one-sided goodness of t test
using the
2
statistic, looking only at the upper tail of the distribution.
Hence the rejection region for a goodness of t hypothesis test at the at
the 100% level is given by
R =
x
2
x
2
>
2
kp1,1
.
Example
Each year, around 1.3 million people in the USA suffer adverse drug
effects (ADEs). A study in the Journal of the American Medical
Association (July 5, 1995) gave the causes of 95 ADEs below.
Cause Number of ADEs
Lack of knowledge of drug 29
Rule violation 17
Faulty dose checking 13
Slips 9
Other 27
Test whether the true percentages of ADEs differ across the 5 causes.
Under the null hypothesis that the 5 causes are equally likely, we
would have expected counts of
95
5
= 19 for each cause.
So our
2
statistic becomes
x
2
=
(29 19)
2
19
+
(17 19)
2
19
+
(13 19)
2
19
+
(9 19)
2
19
+
(27 19)
2
19
=
100
19
+
4
19
+
36
19
+
100
19
+
64
19
=
304
19
= 16.
We have not estimated any parameters from the data, so we compare
x
2
with the quantiles of the
2
51
=
2
4
distribution.
Well 16 > 9.49 =
2
4,0.95
, so we reject the null hypothesis at the 5%
level; we have reason to suppose that there is a difference in the true
percentages across the different causes.
Example - Fitting a Poisson Distribution to Data
Recall the example from the Discrete Random Variables chapter, where
the number of particles emitted by a radioactive substance which
reached a Geiger counter was measured for 2608 time intervals, each
of length 7.5 seconds.
We tted a Poisson() distribution to the data by plugging in the
sample mean number of counts (3.870) for the rate parameter .
(Which we now know to be the MLE!)
x 0 1 2 3 4 5 6 7 8 9 10
O(n
x
) 57 203 383 525 532 408 273 139 45 27 16
E(n
x
) 54.4 210.5 407.4 525.5 508.4 393.5 253.8 140.3 67.9 29.2 17.1
Whilst the tted Poisson(3.87) expected frequencies looked quite
convincing to the eye, at that time we had no formal method of
quantitatively assessing the t. However, we now know how to
proceed.
x 0 1 2 3 4 5 6 7 8 9 10
O 57 203 383 525 532 408 273 139 45 27 16
E 54.4 210.5 407.4 525.5 508.4 393.5 253.8 140.3 67.9 29.2 17.1
OE 2.6 -7.5 -24.4 -0.5 23.6 14.5 19.2 -1.3 22.9 2.2 1.1
(OE)
2
E
0.124 0.267 1.461 0.000 1.096 0.534 1.452 0.012 7.723 0.166 0.071
The statistic x
2
=
(OE)
2
E
= 12.906 should be compared with a
2
1111
=
2
9
distribution.
Well
2
9,0.95
= 16.91, so at the 5% level we do not reject the null
hypothesis of a Poisson(3.87) model for the data.
Contingency Tables
Suppose we have two simple random variables X and Y which are
jointly distributed with unknown probability mass function p
XY
.
We are often interested in trying to ascertain whether X and Y are
independent. That is, determine whether p
XY
(x, y) = p
X
(x)p
Y
(y).
Let the ranges of the r.v.s X and Y be {x
1
, . . . , x
k
} and {y
1
, . . . , y
}
respectively.
Then an i.i.d. sample of size n from the joint distribution of (X, Y) can
be represented by a list of counts n
ij
(1 i k; 1 j ) of the
number of times we observe the pair (x
i
, y
j
).
Tabulating these data in the following way gives what is known as a
k contingency table.
y
1
y
2
. . . y
x
1
n
11
n
12
n
1
n
1
x
2
n
21
n
22
n
2
n
2
.
.
.
x
k
n
k1
n
k2
n
k
n
k
n
1
n
2
. . . n
n
Note the row sums (n
1
, n
2
, . . . , n
k
) represent the frequencies of
x
1
, x
2
, . . . , x
k
in the sample (that is, ignoring the value of Y). Similarly
for the column sums (n
1
, n
2
, . . . , n
) and y
1
, . . . , y
.
Under the null hypothesis
H
0
: X and Y are independent,
the expected values of the entries of the contingency table, conditional
on the row and column sums, can be estimated by
n
ij
=
n
i
n
j
n
, 1 i k, 1 j .
To see this, consider the marginal distribution of X; we could estimate
p
X
(x
i
) by p
i
=
n
i
n
. Similarly for p
Y
(y
j
) we get p
j
=
n
j
n
.
Then under the null hypothesis of independence
p
XY
(x
i
, y
j
) = p
X
(x
i
)p
Y
(y
j
), and so we can estimate p
XY
(x
i
, y
j
) by
p
ij
= p
i
p
j
=
n
i
n
j
n
2
.
Now that we have a set of expected frequencies to compare against
our k observed frequencies, a
2
test can be performed.
We are using both the row and column sums to estimate our
probabilities, and there are k and of these respectively. So we
compare our calculated x
2
statistic against a
2
distribution with
k {(k 1) + ( 1)} 1 = (k 1)( 1) degrees of freedom.
Hence the rejection region for a hypothesis test of independence in a
k contingency table at the at the 100% level is given by
R =
x
2
x
2
>
2
(k1)(1),1
.
Example
An article in International Journal of Sports Psychology (July-Sept 1990)
evaluated the relationship between physical tness and stress. 549
people were classied as good, average, or poor tness, and were also
tested for signs of stress (yes or no). The data are shown in the table
below.
Stress 206 184 85 475
No stress 36 28 10 74
242 212 95 549
Is there any relationship between stress and tness?
Under independence we would estimate the expected values to be
Stress 209.4 183.4 82.2 475
No stress 32.6 28.6 12.8 74
242 212 95 549
Hence the
2
statistic is calculated to be
X
2
=
i
(O
i
E
i
)
2
E
i
=
(206 209.4)
2
209.4
+ . . . +
(10 12.88)
2
12.8
= 1.1323.
This should be compared with a
2
distribution with
(2 1) (3 1) = 2 degrees of freedom.
2
2,0.95
= 5.99, so we have no signicant evidence to suggest there is
any relationship between tness and stress.
COMP 245 Statistics
Exercises 7 - Hypothesis Testing
1. To decide from which manufacturer to purchase 2000 PCs for its undergraduates, a University
decided to carry out a test. It bought 50 PCs frommanufacturer A and 50 frommanufacturer B.
Over the course of a six month period 6 of those from manufacturer A experienced problems,
and 10 of those frommanufacturer B. This suggested that, to be on the safe side, the University
should by machines from manufacturer A. However, As machines were more expensive than
Bs. Before making the decision to buy from A, the University wanted to be confident that the
difference was a real one, and was not merely due to chance fluctuation. Carry out a test to
investigate this.
You may wish to use the extract from chi-squared tables given below:
Degrees of Upper tail area
freedom .10 .05 .01
1 2.706 3.841 6.635
2 4.605 5.991 9.210
3 6.251 7.815 11.345
4 7.779 9.488 13.277
5 9.236 11.071 15.086
2. In an ESP experiment, a subject in one room is asked to state the colour (red or blue) of
100 cards randomly selected, with replacement, from a pack of 25 red and 25 blue cards by
someone in another room. If the subject gets 34 right, determine whether the results are
significant at the 1% level. Clearly write down your null hypothesis, alternative hypothesis,
the test statistic you use, the rejection region, and interpret your conclusions.
Hint: you may regard 100 as a large sample. You may wish to use the following
extract form a standard normal distribution, showing the upper tail area from a N(0,1)
distribution for certain points x.
x Upper tail area
1.28 0.010
1.64 0.050
1.96 0.025
2.33 0.001
2.58 0.005
3. Charles Darwin measured differences in height for 15 pairs of plants of the species Zea mays.
(Each plant had parents grown from the same seed - one plant in each pair was the progeny
of a cross-fertilisation, the other of a self-fertilisation. Darwins measurements were the
differences in height between cross-fertilised and self-fertilised progeny.) The data, measured
in eighths of an inch, are given below.
49, -67, 8, 16, 6, 23, 28, 41, 14, 29, 56, 24, 75, 60, -48.
(a) Supposing that the observed differences {d
i
|i = 1, . . . , 15} are independent observations
on a normally distributed random variable D with mean and variance
2
, state appro-
priate null and alternative hypotheses for a two-sided test of the hypothesis that there is
no difference between the heights of progeny of cross-fertilised and self-fertilised plant,
and state the null distribution of an appropriate test statistic.
(b) Obtain the form of the rejection region for the test you defined in part (a), assuming a
10% significance level.
(c) Calculate the value of the test statistic for this data set, and state the conclusions of your
test.
You may want to use the following extract from a t-distribution, giving the point x with
the specified area under the upper tail beyond x of a t-distribution with df degrees of
freedom.
df 5% 10%
13 1.7709 1.3502
14 1.7613 1.3450
15 1.7531 1.3406
4. Some of Students original experiments involved counting the numbers of yeast cells found
on a microscope slide. The results of one experiment are given in the table below, which
shows the number of small squares on a slide which contain 0, 1, 2, 3, 4, or 5 cells. We
want to use these data to see if the mean number of cells per square is 0.6, using the 5%
significance level.
Number of cells in square 0 1 2 3 4 5
Frequency 213 128 37 18 3 1
(a) State the null hypothesis and the alternative hypothesis.
(b) The distribution of the numbers of cells is far from normal; it can take only positive
integer values, it is very far from symmetric, and dies away very quickly. Which familiar
distribution might be appropriate as a model for these data?
(c) Estimate the mean and the variance of the distribution you suggested in 4b.
(d) Explain why a critical region with the form
x < 1.96
x > + 1.96
would be a reasonable region, making sure you explain the 1.96, the term
n
and the
implications of the union. What is the test statistic in mind?
(e) Compute the limits of the critical region.
(f) Draw a conclusion about whether or not the null hypothesis can be rejected.
5. A survey of 320 families with 5 children each, gave the distribution shown below. Is this table
consistent with the hypothesis that male and female children are equally probable? Obtain
results for both the 1% and 5% levels. Work through the details of the test - dont just hit the
chi-squared button on a statistical calculator.
Boys/girls 5/0 4/1 3/2 2/3 1/4 0/5
Number of families 18 56 110 88 40 8
You may wish to use the table extract from Q1.
6. As part of a telephone interview, a sample of 500 executives and a sample of 250 MBA students
were asked to respond to the question Should corporations become more directly involved
with social problems such as homelessness, education, and drugs? The results are shown
below. Test the hypothesis that the patterns of response for the two groups are the same.
More involved Not more involved Note sure
Executives 345 135 20
MBA students 222 20 8
You may wish to use the table extract from Q1.
7. (a) For a test at a fixed significance level, and with given null and alternative hypotheses,
what will happen to the power as the sample size increases?
(b) For a test of a given null hypothesis against a given alternative hypothesis, and with a
given sample size, describe what would happen to the power of the test if the significance
level was changed from 5% to 1%.
(c) A test of a given null hypothesis against a given alternative hypothesis, with a sample of
size n and significance level of , has power of 80%. What change could I make to the
test to increase my chance of rejecting a false null hypothesis?
(d) How can we attain a test which has a very low probability of Type I error and also a very
low probability of Type II error?
8. The data belowshowthe frequency with which each of the balls numbered 1-49 have appeared
in the main draw in the National Lottery between its inception in November 1994 until
December 2008. This table shows that there are substantial differences between the numbers
of times different balls have appeared - for an extreme example, number 20 has been drawn
just 134 times, whereas 38 has appeared 197 times, almost 50% more often.
Should we conclude from this table that the balls have different probabilities of appearing?
Ball Freq.
1 160
2 168
3 163
4 163
5 155
6 174
7 168
8 158
9 176
10 171
Ball Freq.
11 185
12 178
13 147
14 160
15 157
16 147
17 158
18 164
19 166
20 134
Ball Freq.
21 148
22 169
23 183
24 164
25 182
26 161
27 173
28 165
29 164
30 177
Ball Freq.
31 180
32 170
33 176
34 159
35 173
36 148
37 150
38 197
39 166
40 178
Ball Freq.
41 137
42 167
43 182
44 184
45 162
46 164
47 177
48 177
49 169
COMP 245 Statistics
Solutions 7 - Hypothesis Testing
1. We wish to know if the two binary variables, manufacturer A/B and Faulty/Not faulty are
independent. We can use a chi-squared test to explore this.
The observed numbers are
A B
Faulty 6 10
Not Faulty 44 40
Under the null hypothesis of independence, the expected values are
A B
Faulty
1650
100
1650
100
Not Faulty
8450
100
8450
100
=
A B
Faulty 8 8
Not Faulty 42 42
The test statistic is thus
X
2
=
(6 8)
2
8
+
(10 8)
2
8
+
(44 42)
2
42
+
(40 42)
2
42
= 1.19.
This is to be compared with a
2
1
distribution. This is not significant, even at the 10% level, so
we have no reason for supposing that the underlying faulty PC rates of the two manufacturers
are different.
2. The null hypothesis is that the probability of getting a card right, p, is
1
2
. That is, H
0
: p =
1
2
.
A possible alternative hypothesis would be H
1
: p
1
2
, but a better one would be H
1
: p >
1
2
since we would probably really be interested in the possibility that the subject gets more than
half of the guesses right and not less than half.
Under the null hypothesis, the proportion right would be expected to followthe Binomial(100,
1
2
)
distribution. This has mean 100
1
2
= 50 and variance 100
1
2

1
2
= 25.
Adopting the one-sided alternative hypothesis, we will use the upper tail of this distribution as
the rejection region. Since the sample size, 100, is large and the Binomial(100,
1
2
) is symmetric,
we can approximate it by a normal distribution, N(50,25).
Our test statistic is therefore given by
z =
34 50
25
= 3.2,
which should then be compared with the upper tail of a standard normal. Since the test is
one-sided, the rejection region for a test at the 5% level is given by
R = {z|z > 1.64}.
In fact, our observed test statistic is negative, so is certainly not in R. Thus we have no
evidence to reject the null hypothesis.
3. (a) H
0
: = 0 vs. H
1
: 0, where is the mean difference between the heights of a pair
of cross-fertilised and self-fertilised plants whose parents were grown from the same
seed.
An appropriate test statistic is
T =
D
S
n1
/
n
with null distribution t
n1
, where

D is the sample mean of the differences and S
n1
is
their bias-corrected sample standard deviation.
(b) For a two-sided test at the 10% significance level, the rejection region is defined by
R = {t||t| > t
14,0.95
} = {t||t| > 1.761}.
(c) For these data we have
t =
d
s
n1
/
n
=
20.93
37.74/
15
= 2.15.
Thus we have t R and so the hypothesis of zero difference is rejected in favour of the
alternative hypothesis that there is a difference in the mean height of cross-fertilised and
self-fertilised plants.
4. (a) H
0
: = 0.6; H
0
: 0.6.
(b) Poisson().
(c) The mean number of cells in a square is 0.6825. This number estimates both the mean
and variance of the population, since the mean and variance of a Poisson distribution are
the same.
(d) The variance of a Poisson() distribution is . So the mean of a sample of size n from
Poisson() has variance

n
, and so the standard deviation of the sample mean is
n
.
The 1.96 arises because the sample size is quite large, and we can approximate the
distribution of the mean by a normal distribution (by the Central Limit Theorem), and
this is the 5% critical region for a normal distribution with mean and standard deviation
n
; this is a union of two terms, because the alternative hypothesis is two-sided. The
appropriate test statistic would clearly be the sample mean number of cells in a square.
(e) The confidence limits are
0.6 1.96
0.6
400
, 0.6 + 1.96
0.6
400
= (0.5241, 0.6759).
(f) The observed value of the test statistic is 0.6825, and this lies in the critical region (it is
greater than 0.6759). Thus we reject the null hypothesis that the true mean number of
cells per square is 0.6 at the 5% level.
5. We assume that the sexes of children in the same family are independent. Then the distribution
hypothesised is Binomial(5,
1
2
). Thus the expected number of families with 1 boy and 4 girls,
for example, is
320
5
1
1
2
1

1
2
4
= 320
5
32
= 50.
Continuing in this way, we get
Number of boys O E O E
(OE)
2
E
0 18 10 8 6.4
1 56 50 6 0.72
2 110 100 10 1
3 88 100 -12 1.44
4 40 50 -10 2
5 8 10 -2 0.4
11.96
Comparing this to the
2
5
distribution (no parameters have been estimated from the data) we
see that this is greater than 11.07, the 5% level, but less than 15.09, the 1% level. Thus we
can reject the null hypothesis that the births of males and females are equally likely at the 5%
level but not at the 1% level.
6. Compute marginal numbers: 345+135+20 = 500 etc for rows, 345+222 = 563 etc for cols.
Compute expected values 500x563/750 etc.
Compute

i
(O
i
E
i
)
2
E
i
.
Compare with a
2
distribution with (2-1)(3-1) = 2 degrees of freedom.
7. (a) Power will increase.
(b) Power would decrease.
(c) Either increase sample size or increase .
(d) Use a large sample size.
8. A simple chi-squared test is appropriate. The null hypothesis is that the balls have equal
probabilities of appearing. The sum of all the ball frequencies (the number of lottery balls
drawn since November 1994) is 8,154.
The expected frequency of each ball, under the null hypothesis is thus
8154
49
= 166.4082.
The test statistic
49
i=1
(O
i
166.4082)
2
166.4082
turns out to be 46.475. Comparing with the
2
48
distribution, we see the null hypothesis
cannot be rejected at any reasonable significance level.

Introduction Statistics Imperial College London

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Introduction Statistics Imperial College London

Caricato da

Copyright:

Formati disponibili

Introduction

COMP 245 STATISTICS

xf(x)dx, var (X) =

diverges otherwise for

diverges otherwise for a = 0)

E F) = P(E) P(E F). (3)

S. Then for any event E S, we will say E has

}. For any other event E, E will occur if and only if

} E. Thus we can immediately draw two conclusions before the

P({ }) = P({ }) = . . . = P({ }) =

P(Odd number) = P({ , , }) =

P() = P() = P() = P() =

The joint event {Suit is red and value is 3} contains two of 52

E, so E has occurred; so E can be

} E. Thus we can immediately draw two

np(1 p), skewness=

) will be generically referred to as X.

, then the corresponding

) will be generically referred to as X.

. Using the results on transformations of variables,

, so Y = g(X). First we note that g is clearly a continuous, monotonically

(y) = . Since X N(,

(y) always negative.

, then since > 0 we can rst

(y) always negative.

, then since > 0 we can rst observe that for

k(x + y), 0 < x < 2, 0 < y < 2

when 0 x 1 and 0 y 1, and f (x, y) = 0 otherwise. Determine

() = 0 yields the MLE if

pdf for = 1, 5, 10, 30, and N(0, 1) pdf

is heavier tailed than N(0, 1) for any number of degrees of freedom .

and N(0, 1) is so insignicant that the t distribution

() = 0 yields the MLE if

is the -quantile of the standard normal (so before we used

is the Students t distribution with degrees of

pdf for = 1, 5, 10, 30, and N(0, 1) pdf

is heavier tailed than N(0, 1) for any number of degrees of

and N(0, 1) is so insignicant

725000 100 2.5

From this it follows that the difference in sample means,

For a 1%-level signicance test, the rejection region would have

Dr N A Heard (Room 545 Huxley Building) Hypothesis Testing 22 / 47

), j = 1, . . . , k and so we are able to calculate estimated

Potrebbero piacerti anche