Generalized Linear Models

1
Likelihoods for Generalized Linear Models
1.1 Some General Theory
We assume that Yi has the p.d.f. that is a member of the exponential family. That is,
f (yi; θi , φ) = exp{(yi θi − b(θi ))/ai (φ) + c(yi ; φ)}
for some specific functions ai (·), b(·) and c(·), with φ known.
• The parameter θ is called the canonical parameter.
• The parameter φ, termed the scale parameter, is constant for all i = 1, . . . , n. For now consider
φ to be specified.
Consider a log-likelihood for a single observation from the exponential family.
ℓ(θ, φ; y) ∝ log f (y; θ, φ)
= (yθ − b(θ))/a(φ) + c(y; φ)
∂ℓ y − b′ (θ)
S(θ) = =
∂θ a(φ)
∂2ℓ b′′ (θ)

I(θ) = − =
∂θ2 a(φ)
Consider the following very general results :

•
1.1. SOME GENERAL THEORY
Z
f (y; θ, φ)dy =1
∂
Z
f (y; θ, φ)dy = ∂1/∂θ
∂θ
∂
Z
f (y; θ, φ)dy = 0
∂θ
But
∂ 1 ∂
log f (y; θ, φ) = f (y; θ, φ) .
∂θ f (y; θ, φ) ∂θ
Therefore,
∂
Z
log f (y; θ, φ)f (y; θ, φ)dy = 0
∂θ
which in turn implies E(S(θ)) = 0.

• Differentiating again, we find
∂ ∂
Z
log f (y; θ, φ)f (y; θ, φ)dy = ∂0/∂θ
∂θ ∂θ
R ∂2 ∂ ∂
Z
log f (y; θ, φ)f (y; θ, φ)dy + log f (y; θ, φ) f (y; θ, φ)dy = 0
∂θ2 ∂θ ∂θ
R ∂2 ∂
Z
log f (y; θ, φ)f (y; θ, φ)dy + ( log f (y; θ, φ))2f (y; θ, φ)dy = 0 ,
∂θ2 ∂θ
which implies,
E(−∂ 2 log f (y; θ, φ)/∂θ2) = E((∂ log f (y; θ, φ)/∂θ)2 )
These facts are useful to us for deriving the following results.
• Mean :
E(S(θ)) = 0 ⇒ E(Y − b′ (θ))/a(φ)) = 0
which in turn implies that E(Y ) = b′ (θ) = µ. The key result here is that µ = b′ (θ).
• Variance :
E(−∂ 2 log f (y; θ, φ)/∂θ2) = E((∂ log f (y; θ, φ)/∂θ)2 )
implies
E(b′′ (θ)/a(φ)) = E( ((y − b′ (θ))/a(φ))2 )
which in turn implies b′′ (θ)/a(φ) = Var(Y )/a(φ)2 and hence Var(Y ) = b′′ (θ)a(φ). The variance
can be written as a product of
b′′ (θ)
- a function of the canonical parameter and hence the mean of the distribution
- sometimes called the variance function and denoted by V (µ) when considered as a function
of µ, the mean
- note that this is not in general the variance of Y .
a(φ) - a function only of φ
RJ Cook 2 October 21, 2008

CHAPTER 1. LIKELIHOODS FOR GENERALIZED LINEAR MODELS
- can commonly be written as a(φ) = φ/w where φ is called a dispersion parameter and w
is a prior weight for this observation.
Link Functions
The link functions relate η (the linear predictor) to µ, the expected value of the random variable
Y.
Canonical Links
Consider the log-likelihood function from the exponential family.
ℓ(θ, φ; y) = (yθ − b(θ))/a(φ) + c(y; φ).
If y1 , y2, ..., yn is a random sample of n observations we introduce a subscript i and write

n
X
ℓ(θ, φ; y) = [(yiθi − b(θi ))/ai (φ) + c(yi; φ)]
i=1
If we set
θi = ηi = x′i β
we say we have a canonical link (i.e. canonical parameter = linear predictor)
n
X
ℓ(β, φ; y) = [(yi x′i β − b(x′i β))/ai (φ) + c(yi ; φ)]
i=1
p n n n
X X X X
= βj ( yixij /ai (φ)) − b(x′i β)/ai (φ) + c(yi ; φ)
j=1 i=1 i=1 i=1
Pn
If φ is known, i=1 yi xij is a sufficient statistic for βj , or equivalently, Xy is a sufficient statistic
for β.
Canonical links are useful in terms of their statistical properties, but context and goodness of
fit should motivate the choice of the link. It often turns out that the canonical links are in fact
appropriate (i.e. linear regression with normally distributed observations, logistic regression).

1.1. SOME GENERAL THEORY
a. Normal Distribution
Dropping the subscript i, consider a single observation from the N(µ, σ 2 ) distribution. It has
density
1
f (y; θ, φ) = √ exp{−(y − µ)2 /2σ 2 }
2πσ 2
1
=√ exp{−(y 2 − 2µy + µ2 )/2σ 2 }
2πσ 2
1
=√ exp{(yµ − µ2 /2)/σ 2 − y 2 /2σ 2 }
2πσ 2
= exp{(yµ − µ2 /2)/σ 2 − y 2/2σ 2 − log(2πσ 2 )/2}
Matching this up with the general expression for the exponential family, we see θ = µ, φ = σ 2 ,
a(φ) = φ
b(θ) = 21 θ2 = 12 µ2
c(y; φ) = − 21 (y 2/φ + log(2πφ))

= − 21 (y 2/σ 2 + log(2πσ 2 )).
f (y; θ, φ) = exp{(yµ − µ2 /2)/σ 2 − (y 2/σ 2 + log(2πσ 2 ))/2}

• mean : b′ (θ) = θ = µ
• variance : b′′ (θ)a(φ) = 1 · φ = σ 2
• variance function : b′′ (θ) = 1 {= V (µ)}
• canonical link : θ = µ = η (identity)
b. Poisson Distribution
f (y; λ) = λy e−λ /y!
f (y; θ, φ) = exp{(y log λ − λ) − log y!}

Matching these expressions up with the expression for the general exponential family, we see
θ = log λ and φ = 1. Furthermore,
a(φ) = 1
b(θ) = exp{θ} = λ
c(y; φ) = − log y!
These give
• mean : b′ (θ) = exp{θ} = µ = λ

• variance : b′′ (θ)a(φ) = exp{θ} · 1 = µ
• variance function : b′′ (θ) = exp{θ} = µ {= V (µ)}
• canonical link : θ = log µ = η (log link)
c. Binomial Distribution
If V ∼ Bin(m, π), then

m
f (v; π) = π v (1 − π)m−v , v = 0, 1, . . . , m.
v
Now consider the transformation to Y = V /m. Then,

m
f (y; π) = π my (1 − π)m−my , y = 0, 1/m, . . . , 1
my
We can write this as

m
f (y; π) = exp{my log π + (m − my) log(1 − π) + log }
my

m
= exp{my log(π/(1 − π)) + m log(1 − π) + log }
my

−1 m
= exp{[y log(π/(1 − π)) + log(1 − π)]/m + log }
my
Here we have θ = log(π/(1 − π)), φ = 1, w = m, and

a(φ) = φ = 1/m
b(θ) = log(1 + exp{θ}) = − log(1 − π)

m
c(y; φ) = log
my
• mean : b′ (θ) = exp{θ}/(1 + exp{θ}) = µ = π
• variance :

1.2. ITERATIVELY REWEIGHTED LEAST SQUARES
eθ (1 + eθ ) − eθ (eθ ) 1

′′
b (θ)a(φ) =
(1 + eθ )2 m
θ
e 1 1
= θ θ
1+e 1+e m
= µ(1 − µ)/m
= π(1 − π)/m
• variance function : b′′ (θ) = µ(1 − µ) = V (µ)
• canonical link : θ = log(µ/(1 − µ)) = η (logit link)
1.2 Iteratively Reweighted Least Squares

1.2.1 The Score Vector
This is a way of interpreting the Newton Raphson algorithm for maximization of the likelihood
function. Consider the log-likelihood for a single observation from the exponential family
ℓ(θ, φ; y) = [(yθ − b(θ))/a(φ) + c(y; φ)]
Recall
• ℓ is a function of θ (we initially assume that φ is known)
• µ can be expressed in terms of θ through µ = b′ (θ)
• η can be expressed in terms of µ through the link function and µ = b′ (θ)
• η can be expressed in terms of β through η = x′ β
To find an MLE of β, we want to solve S(β) = ∂ℓ/∂β = 0.

Consider differentiating with respect to a scalar βj .
By the chain rule
∂ℓ ∂ℓ ∂θ ∂µ ∂η
= · · ·
∂βj ∂θ ∂µ ∂η ∂βj
where
∂ℓ
= (y − b′ (θ))/a(φ)
∂θ
−1
∂θ ∂µ 1
= = ′′
∂µ ∂θ b (θ)
∂µ ∂µ
=
∂η ∂η
∂η
= xj
∂βj

Since µ = b′ (θ), and V = b′′ (θ),

∂ℓ y − b′ (θ) 1 ∂µ
= · ′′ · · xj
∂βj a(φ) b (θ) ∂η
y − µ 1 ∂µ
= · · · xj
a(φ) V ∂η
y − µ ∂µ
= · · xj
Var(Y ) ∂η
2
y − µ ∂µ ∂η
= xj
Var(Y ) ∂η ∂µ
∂η
= (y − µ) · W · xj
∂µ
where W −1 = (Var(Y ))(∂η/∂µ)2 .

With n observations
n
X
∂ℓ/∂βj = ∂ℓi /∂βj
i=1
where Wi−1 = Var(yi)(∂ηi /∂µi )2 and ∂ηi /∂µi means ∂η/∂µ evaluated wrt the covariate vector xi .
The score vector will then take the form S(β) = (∂ℓ/∂β0 , ∂ℓ/∂β1 , . . . , ∂ℓ/∂βp−1 )′ . In vector form we
can write S(β) as XW(y − µ) ∗ ∂η/∂µ where in vector form y = (y1 , . . . , yn )′ and µ = (µ1 , . . . , µn )′
are n × 1 vectors, X = (x1 , . . . , xn ) is a p × n matrix, W denotes the diagonal matrix with
 
W1
 W2 
W= ,
 Wj 
Wn
and ∗ denotes an elementwise product.
1.2.2 Newton Raphson and Fisher Scoring
Newton Raphson
β̂ (r+1) = β̂ (r) + I −1 (β̂ (r) )S(β̂ (r) )
where I is the observed information matrix.
Fisher Scoring Method

Fisher suggested using the expected information matrix rather than the observed information ma-
trix. In general, this simplifies the computations as we shall see.
Consider, for a single observation (one observation of many, but subscript omitted for convenience)

1.2. ITERATIVELY REWEIGHTED LEAST SQUARES
∂2ℓ ∂ ∂ℓ
Ijk = − =−
∂βj ∂βk ∂βk ∂βj

∂ ∂η
=− (y − µ)W xj
∂βk ∂µ

∂ ∂η ∂η ∂
= −(y − µ) W xj − W xj (y − µ)
∂βk ∂µ ∂µ ∂βk
But
∂µ ∂µ ∂η ∂µ
= · = · xk .
∂βk ∂η ∂βk ∂η
Therefore,

∂ ∂η ∂η ∂µ
Ijk = −(y − µ) W xj + W xj xk
∂βk ∂µ ∂µ ∂η

∂ ∂η
= −(y − µ) W xj + xj W xk
∂βk ∂µ
Taking expectations we get,
2
∂ ℓ ∂ ∂η
Ijk = −E = E (y − µ) W xj + E {xj W xk }
∂βj ∂βk ∂βk ∂µ

∂ ∂η
=− W xj E {(y − µ)} + xj W xk
∂βk ∂µ
Notice that the first term vanishes since E{(y − µ)} = 0 by definition. Then, for n observations we
can write n
X
Ijk = xij Wi xik = (XWX ′ )jk
i=1
where again, W is a diagonal matrix
 
W1
 W2 
W= 
 Wj 
Wn
and Wi−1 = Var{Yi}(∂ηi /∂µi )2 .
The Fisher scoring method operates by utilizing

β̂ (r+1) = β̂ (r) + I −1 (β̂ (r) )S(β̂ (r) )
where I is the expected information matrix given above, as opposed to the observed information
matrix which is used in the Newton-Raphson algorithm.

1.3 Iteratively Re-weighted Least Squares
1.3.1 Overview
Why is this called the iteratively re-weighted least squares ?

It is because of the following manipulation :
β̂ (r+1) = β̂ (r) + I −1 (β̂ (r) )S(β̂ (r) )
I(β̂ (r) )β̂ (r+1) = I(β̂ (r) )β̂ (r) + S(β̂ (r) )
(XW(β̂ (r) )X ′ )β̂ (r+1) = (XW(β̂ (r) )X ′ ) β̂ (r) + S(β̂ (r) )
(XW(β̂ (r) )X ′ )β̂ (r+1) = (XW(β̂ (r) )X ′ )β̂ (r) + (XW(β̂ (r) )(y − µ(β̂ (r) )) ∗ (∂η(β̂ (r) )/∂µ))

(r) ′ (r+1) (r)
′ (r) (r) (r)
(XW(β̂ )X )β̂ = XW(β̂ ) X β̂ + (y − µ(β̂ )) ∗ (∂η(β̂ )/∂µ)
Let z = η + (y − µ)∂η/∂µ. Then
β̂ (r+1) = (XW(β̂ (r) )X ′ )−1 XW(β̂ (r) )z(β̂ (r) )
This is the same as the weighted LS estimate of β with dependent variable z(β̂ (r) ) and weight matrix
W(β̂ (r) ). Since we are updating z with each iteration, it is called re-weighted least squares, and since
we have to repeat this estimation procedure until convergence, it is called iteratively re-weighted
least squares.
Note:
1. Consider a Taylor series approximation of g(y) about µ to get
g(y) = g(µ) × g ′(µ)(y − µ) + g ′′ (µ)(y − µ)2 /2 + · · ·
∂η
z = η+(y−µ) can be thought of as a “linearized form” of the link function. That is, it provides
∂µ
a linear approximation to the functional relationship between the mean of the distribution and
the linear predictor.
2. This motivates the choice of W .
2 2
∂η ∂η
Var(Z) = Var(Y ) = a(φ)b′′ (θ)
∂µ ∂µ
2 2
1 −1 ∂µ 1 ∂µ
inverse variance → W = = (Var(y)) =
Var(Z) ∂η a(φ)b′′ (θ) ∂η

1.3. ITERATIVELY RE-WEIGHTED LEAST SQUARES
1.3.2 When is Fisher Scoring Method Equivalent to Newton Raphson?
This question can be equivalently re-phrased as “When is the expected information matrix the same
as the observed information matrix ? ”
Recall
∂2ℓ

∂ ∂η ∂η ∂
Ijk =− = −(y − µ) W xj − W xj (y − µ)
∂βj ∂βk ∂βk ∂µ ∂µ ∂βk
Consider the first term of the above expression for the observed information matrix. Recall that
V = b′′ (θ) = ∂b′ (θ)/∂θ = ∂µ/∂θ .

Then
2
1 ∂µ
W=
a(φ) V ∂η

1 ∂µ ∂µ
=
a(φ) ∂µ/∂θ ∂η ∂η

1 ∂θ ∂µ ∂µ
= ·
a(φ) ∂µ ∂η ∂η
with canonical link:θ = η
1 ∂µ
= ·
a(φ) ∂η
Therefore, with the canonical link,

∂η
W xj = xj /a(φ)
∂µ
and since
∂
xj /a(φ) = 0
∂βk
the expected information matrix equals the observed information matrix. Hence, there is no differ-
ence between the Newton Raphson algorithm and the Fisher Scoring algorithm.
The difference arises when using other (non-canonical) link functions!
Questions
Problem 1.1.

Suppose a population is divided into n strata, and we sample mi subjects from the ith stratum,
i = 1, . . . , n. Let Yij ∼ N(µi , φ) (independently distributed) denote the response for the jth subject
in the ith stratum of the sample, j = 1, . . . , mi , i = 1, . . . , n. Suppose all that is available are
the sample means for each strata, ȳ1 , ȳ2 , . . . , ȳn , and an associated p × 1 covariate vector xi =
(xi0 , xi1 , . . . , xi,p−1 )′ , where xi0 = 1, i = 1, . . . , n. Then
√
mi
f (ȳi ; µi , φ) = √ exp{−mi (ȳi − µi)2 /(2φ)}
2πφ
where −∞ < y < ∞, −∞ < µi < ∞, and φ > 0.
a. Show that the distribution of Ȳi belongs to the exponential family and find the functions ai (·),
b(·), c(·; ·), E(Ȳi ), Var(Ȳi ), and the canonical link function of g(µ) = η.
b. Given the data ȳ1 , ȳ2, . . . , ȳn and the linear predictor ηi = xi ′ β, find the specific form of the score
vector and information matrix for β and explain how you would obtain maximum likelihood
estimates of β0 , β1 , . . . , βp−1 .
c. Relate the Newton-Raphson algorithm to any other method of model fitting you may have seen
before.
Problem 1.2.
Consider a setting where Yi1 , . . . , Yimi are mi independently distributed Poisson random variables
with Yij ∼ Poisson(µij ), j = 1, . . . , mi , i = 1, . . . , n. Moreover, assume that the Poisson counts are
generated by a time homogeneous Poisson process with µij = λi tij , where λi is an underlying rate
assumed to be common for Yi1 , . . . , Yimi and tij is the duration of observation leading to the count
yij , j = 1, . . . , mi , i = 1, . . . , n. Finally, assume that associated with Yi1 , . . . , Yimi is a p × 1 covariate
vector xi = (xi0 , xi1 , . . . , xi,p−1 )′ where xi0 = 1, i = 1, . . . , n, and let β = (β0 , . . . , βp−1 )′ .
a. Write down the likelihood for the rate Pfunctions λ = (λ1 , . . . , λn )′ .
mi
b. Show that the distribution for Yi· = j=1 Yij belongs to the exponential family and hence find
the functions a(·), b(·), c(·; ·), E(Yi· ), Var(Yi· ), and the canonical link function g(µ) = η.
c. Given summary data y1· , y2· , . . . , yn· and the linear predictors ηi = xi ′ β, find the specific
form for an entry of the score vector (i.e. ∂ℓ/∂βj ) and the expected information matrix (i.e.
E(−∂ 2 ℓ/∂βj ∂βk )) under the canonical link.
d. Briefly describe how to obtain maximum likelihood estimates of β0 , β1 , . . . , βp−1 using a Fisher
Scoring algorithm.
Problem 1.3.
Let Y1 , Y2 , . . . , Yn be independent Poisson random variables with means µ1 , µ2 , . . . , µn respectively,
and let Y = (Y1 , . . . , Yn )′ . Associated with each Yi is a covariate vector, xi = (1, xi1 , . . . , xi,p−1 )′ , of
length p.
a. Show that ηi = log µi is the natural parameter of the Poisson distribution.
b. Find the score vector for β.
c. Find the observed and expected information matrix for β and hence show how to obtain the
MLE for β.
d. FOR STAT 831 ONLY
Show that T = XY is a vector of sufficient statistics for β where X is the p × n matrix with
columns comprised of covariate vectors.
Problem 1.4.

Suppose you observe a sample y1 , y2, . . . , yn consisting of realizations of n independent Poisson

random variables where E(Yi ) = µi . Suppose that associated with yi is a p × 1 vector of explanatory
variables (1, xi1 , xi2 , . . . , xi,p−1 )′ . A Poisson regression model with the canonical link takes the form
log(µi ) = β0 + β1 xi1 + . . . + βp−1 xi,p−1 .
The likelihood for the vector of regression coefficients β is constructed by making the substitution
above into the likelihood for the n means µi , . . . , µn . To answer the following questions, you may
either calculate the derivatives using standard methods, or use the general results given in class for
the exponential family.
a. Write down the score vector for the regression coefficients βi , i = 0, . . . , p − 1.
b. Write down the observed and expected information matrix for the regression coefficients. Are
they the same or different? Why?
c. What is the form of the weight function? What types of observations will have the largest and
smallest weights?
Problem 1.5.
Suppose yi1 , yi2 , . . . , yimi are observations from a Gaussian distribution with mean µi and variance
σ 2 , i = 1, 2, . . . , I. Associated with each yi = (yi1 , yi2 , . . . , yimi )′ is a vector of explanatory variables
xi = (xi0 , xi1 , . . . , xi,p−1 )′ .
a. Show that ȳi = nj=1
P i
yij /mi is suffient for µi.
b. Show that the distribution of ȳi is in the exponential family and identify the parameters θi and
φ, and the functions ai (φ), b(θi ), and c(yi; φ).
c. If we want to set up a regression model, what is the canonical link.
d. Write down the score and information function and indicate connections between the Fisher
scoring/Newton Raphson iterations and another method for estimation regression parameters
that you’ve encountered before.
Problem 1.6.
Consider the table below which summarizes data from two samples with independent binomial
responses for each.
Outcome
Present Absent Total
Group 1 y m1 − y m1
Group 2 t − y m2 − t + y m2
Total t m. − t m.
The conditional distribution of y given t, the first column total, (and m1 and m2 ) is
! !
m1 m2
exp{yα}
y t−y
f (y|t, m1 , m2 ) = ! !
X m m
1 2
exp{vα}
v t−v
v∈S
where S = {v : max(0, t − m2 ) ≤ v ≤ min(m1 , t)}.

a. Show that this distribution belongs to the exponential family and hence find the canonical
parameter, the functions a(·), b(·), c(· ; ·), E(Y ), var(Y ) and the canonical link function.
b. Suppose now that we have a series of n independent 2 × 2 tables of the sort above and let xi =
(1, xi1 , . . . , xi,p−1 )′ denote a p × 1 vector of explanatory variables for the ith table. Introducing
subscripts to distinguish data from different tables, we then summarize the data from table i as
Outcome
Present Absent Total
Group 1 yi m1i − yi m1i
Group 2 ti − yi m2i − ti + yi m2i
Total ti m.i − ti m.i
The conditional distribution of yi given ti , the first column total, (and m1i and m2i ) is
! !
m1i m2i
exp{yiαi }
yi ti − yi
f (yi |ti , m1i , m2i ) = ! !
X m m2i
1i
exp{vαi }
v ti − v
v∈Si
with Si = {v : max(0, ti − m2i ) ≤ v ≤ min(m1i , ti )}. Given the data y1 , y2 , . . . , yn and the linear
predictor ηi = x′i β where β = (β0 , . . . , βp−1)′ , find the specific form of the score and information
function and explain how you would obtain maximum likelihood estimates of β0 , β1 , . . . , βp−1.
Problem 1.7.
Consider n 2 × 2 tables where the ith table is given by
Success Failure Total
Group 1 yi mi1 − yi mi1
Group 2 ti − yi mi2 − ti + yi mi2
Total ti mi − ti mi
a. Derive the conditional distribution of Yi given mi1 , mi2 and ti and show that this belongs to the
exponential family of distributions. Find the canonical parameter and hence find the conditional
mean and variance of Yi .
b. Suppose that associated with the ith table is a p × 1 covariate vector xi = (1, xi1 , . . . , xi,p−1 )′ ,
i = 1, . . . , n. Use the canonical link and the systematic component as η = X ′ β where X is a
p × n matrix of column vectors x1 , . . . , xn. Find the score vector and the observed and expected
information matrix for β, and hence show how to obtain the MLE of β.
Problem 1.8.
Suppose a population is divided into n strata, and we sample mi subjects from the ith stratum,
i = 1, . . . , n. Let Yij ∼ Bernoulli(πi ) (independently distributed) denote the response for the jth
subject in the ith stratum of the sample, so that P(Yij = 1) = πi and P(Yij = 0) = 1 − πi ,
j = 1, . . . , mi , i = 1, . . . , n. Suppose
Pmi that the responses available from the strata are simply the
means y1 , y2, . . . , yn (where yi = j=1 yij /mi , i = 1, . . . , n), and that the associated sample sizes,
m1 , . . . , mn , are also available.
a. Show that the distribution of yi belongs to the exponential family by identifying the functions
a(·), b(·) and c(· ; ·), obtain E(Yi ) and Var(Yi), and name the canonical link function.

b. Suppose that associated with yi is a p × 1 covariate vector xi = (1, xi1 , . . . , xi,p−1 )′ , i = 1, . . . , n.

If β = (β0 , . . . , βp−1)′ is a p × 1 vector of regression coefficients, let ηi = x′i β denote the linear
predictor for stratum i, i = 1, . . . , n. Find the specific form of the score vector and information
matrix and explain how you would obtain the maximum likelihood estimate of β, which we
denote as β̂.
c. What is the mean vector and covariate matrix for the asymptotic distribution of the MLE β̂?
d. STAT 831 ONLY
Is there any “information” lost about the vector of regression coefficients, β, when only the
sample means and the sample sizes are available from each stratum (as opposed to when the
individual subjects’ responses are available)? Explain.

2
Basic methods for the Analysis of Binary Data
2.1 Introduction
Binary responses generally require different analysis techniques than have been considered so far in
regression courses. Examples of binary responses include disease status (diseased/not diseased) and
survival status (dead/alive). In addition to have such a binary response, we often have a single binary
covariate of interest. Examples include treatment (experimental/control) or exposure (exposed to
radiation/not exposured to radiation) variables. To summarize the data, we might construct a 2 × 2
table as follows.
Table 2.1. A 2 × 2 Table
Disease
Present Absent
Group 1 y1 m 1 − y1 m 1
Group 2 y2 m 2 − y2 m 2
Total y. m. − y. m.
If we have m1 and m2 fixed, we typically assume we’ve got two independent binomial samples with
Yk ∼ Bin(mk , πk ), k = 1, 2. There are many measures of association one can consider for such
tables, but here we will focus on the odds ratio. The odds of one event versus another is simply the
ratio of their respective probabilities. Therefore, the odds of disease versus no disease in Group 1
is π1 /(1 − π1 ). We sometimes just refer to this as the odds of disease in Group 1. In this context,
the odds is a 1-1 monotonically increasing function of π1 which takes on values on the non-negative
real line. The odds of disease in Group 2 is π2 /(1 − π2 ). The odds ratio reflecting the relative odds
of disease in Group 1 versus Group 2 is then
π1 /(1 − π1 )
ψ= .
π2 /(1 − π2 )
Note that in the case of a “rare” disease (i.e. when π1 and π2 are very small, then ψ is close to the
relative risk, π1 /π2 . This can be seen by noting that

π1 /(1 − π1 ) π1 1 − π2
ψ= = .
π2 /(1 − π2 ) π2 1 − π1
When π1 ≈ π2 is small the fraction in parentheses is close to 1.
2.2. ESTIMATION OF THE ODDS RATIO
2.2 Estimation of the Odds Ratio
We would like to use likelihood theory to estimate ψ and therefore need to construct an appropriate
likelihood function. Note that

m1 y1 m2
P r(Y1 = y1 , Y2 = y2 ) = π1 (1 − π1 ) m1 −y1
π2y2 (1 − π2 )m2 −y2
y1 y2
L(π1 , π2 ) = π1y1 (1 − π1 )m1 −y1 π2y2 (1 − π2 )m2 −y2
y1 y2
π1 m1 π2
L(π1 , π2 ) = (1 − π1 ) (1 − π2 )m2
1 − π1 1 − π2
y y2 +y1
π1 /(1 − π1 ) 1

π2
L(π1 , π2 ) = (1 − π1 )m1 (1 − π2 )m2
π2 /(1 − π2 ) 1 − π2
We want to reparameterize to get rid of π1 and so note that if ψ = π1 /(1 − π1 )/[π2 /(1 − π2 )] then
π1 = ψπ2 /[1 − π2 + ψπ2 ]. Substituting into the above likelihood we get
y2 +y1 m1
y1 π2 (1 − π2 )
L(ψ, π2 ) = ψ (1 − π2 )m2
1 − π2 (1 − π2 + ψπ2 )
Now that we have a likelihood involving the parameters of interest, we can consider further repa-
rameterization to enable us to obtain Wald-type quantities for inference. Wald-type quantities are
most appealing when the corresponding parameters are unrestricted (i.e. the parameter space is the
real line). Therefore consider reparameterizing to β = log ψ and α = log(π2 /(1 − π2 )). Here we get
L(α, β) = ey1 β (1 + eα+β )−m1 ey.α (1 + eα )−m2 ,
where y· = y1 + y2 . Then the log-likelihood is
ℓ(α, β) = y1 β − m1 log(1 + eα+β ) + y.α − m2 log(1 + eα ).
Differentiating with respect to α and β we get

∂ℓ(α, β) m1 eα+β m2 eα
Sα (α, β) = =− + y. −
∂α 1 + eα+β 1 + eα
∂ℓ(α, β) m1 eα+β
Sβ (α, β) = = y1 −
∂β 1 + eα+β
If S(α, β) = (Sα (α, β), Sβ (α, β)), solving S(α, β) = 0 gives α̂ = log(y2 /(m2 − y2 )) and β̂ =
log( yy21 /(m
/(m1 −y1 )
2 −y2 )
). These estimates are natural since they imply π̂2 = eα̂ /(1 + eα̂ ) = y2 /m2 , π̂1 = y1 /m1
and ψ̂ = π̂1 /(1 − π̂1 )/[π̂2 /(1 − π̂2 )]. Note that

CHAPTER 2. BASIC ANALYSIS OF BINARY DATA
eα+β (1 + eα+β ) − eα+β eα+β eα (1 + eα ) − e2α

Iαα = − −m1 − m2
(1 + eα+β )2 (1 + eα )2
eα+β eα
= m1 + m 2
(1 + eα+β )2 (1 + eα )2
eα+β (1 + eα+β ) − eα+β eα+β

Iαβ = − −m1
(1 + eα+β )2
eα+β
= m1
(1 + eα+β )2
eα+β (1 + eα+β ) − eα+β eα+β

Iββ = − −m1
(1 + eα+β )2
eα+β
= m1
(1 + eα+β )2
We are interested in the (β, β) entry of I −1 which we denoted by I ββ in section 1.2. This is given
by I ββ (α, β) = [Iββ − Iβα Iαα
−1
Iαβ ]−1 and here we obtain
1 1 1 1
I ββ (α, β) = + + + ,
E(Y1 ) E(m1 − Y1 ) E(Y2 ) E(m2 − Y2 )
which we evaluate at α̂, β̂ to give
1 1 1 1
I ββ (α̂, β̂) = + + + .
y1 m1 − y1 y2 m2 − y2
Proof : First note that we can write Iαα = m1 π1 (1 − π1 ) + m2 π2 (1 − π2 ), Iαβ = Iβα = m1 π1 (1 − π1 )

and Iββ = m1 π1 (1 − π1 ). Now we get
[I −1 ]ββ = I ββ = [Iββ − Iβα Iαα
−1
Iαβ ]−1 = [m1 π1 (1 − π1 ) − (m1 π1 (1 − π1 ))2 /(m1 π1 (1 − π1 )
+ m2 π2 (1 − π2 ))]−1
m1 π1 (1 − π1 )m2 π2 (1 − π2 ) −1
=[ ]
m1 π1 (1 − π1 ) + m2 π2 (1 − π2 )
1 1
= +
m1 π1 (1 − π1 ) m2 π2 (1 − π2 )
1 1 1 1
= + + +
m1 π1 m1 (1 − π1 ) m2 π2 m2 (1 − π2 )
1 1 1 1
= + + + .
E(Y1 ) E(m1 − Y1 ) E(Y2 ) E(m2 − Y2 )
Given this result we may obtain a Wald-type approximate 95% CI for β as
1/2 1/2 !
1 1 1 1 1 1 1 1
β̂ − 1.96 + + + , β̂ + 1.96 + + +
y1 m1 − y1 y2 m2 − y2 y1 m1 − y1 y2 m2 − y2

β̂L β̂U
which we denote as (β̂L , β̂U ). An approximate 95% CI for ψ is then given by e , e . We are
not typically that interested in α or π2 in such 2 × 2 tables. Note that given the likelihood and

2.3. MULTIPLE REGRESSION FOR BINARY RESPONSES
the asymptotic (large sample) results in section 1.2 we could use the likelihood itself to conduct
inference about β and hence ψ. The Wald-type pivotal used here is much more convenient and since
the range of values for β is unrestricted the results will generally agree very closely.
2.3 Multiple Regression for Binary Responses
The results of the preceding section were directed at the case with a single factor variable with two
levels and a binary response. This is a simple setting but more often we need multiple regression
methodology since we may
a. want to be able to control for confounding variables and hence want to examine the effect of
several (possibly related collinear) variables simultaneously,
b. want to examine the effect of categorical covariates (>2 levels) or continuous covariates.
c. want to develop sophisticated models that describe complex relationships.
Example : Consider the data in the following table which describes the relationship between the
level of prenatal care and fetal mortality. The data arose from two clinics which we refer to as Clinic
A and Clinic B (not their real names!).
Table 2.2. Prenatal Care Data from Two Clinics.
Died Survived Total

Intensive 20 316 336
Regular 46 373 419
66 689 755
Here we obtain ψ̂ = π̂1 /(1 − π̂1 )/[π2 /(1 − π2 )] = 20/316/[46/373] = 0.51 and a 95% CI of ψ of
(0.30, 0.89). This suggests a strong association between level of prenatal care and fetal mortality.
However if we consider data from just those subjects who are at Clinic A, we get the following table.
This gives ψ̂ = 16 × 176/[12 × 293] = 0.80 with a 95% CI for ψ of (0.37, 1.73). While the odds ratio
Table 2.3. Prenatal Care Data from Patients at Clinic A
Died Survived
Regular 12 176 188
28 469 497
estimate is in the direction of a protective effect with intensive prenatal care, the confidence interval
is quite wide and includes values above one (which correspond to an increased risk of mortality).
Now we consider the corresponding data from Clinic B.
This gives ψ̂ = 4 × 197/[23 × 34] = 1.01 with a 95% CI for ψ of (0.33, 3.10). Note that the
reduction in the odds of fetal mortality due to intensive prenatal care which we observed in the

Table 2.4. Prenatal Care Data from Patients at Clinic B
Died Survived
Intensive 4 23 27
Regular 34 197 231
38 220 258
pooled data, appears to have vanished for patients in Clinic B, and we found above that the estimate
of benefit is considerably smaller (and no longer significant) among patients in Clinic A. For further
investigation, we examine the relationship between clinic and level of care.
Table 2.5. The Association Between Clinic and Level of Care
Clinic A Clinic B
Regular 188 231 419
497 258 755
Here we get ψ̂ = 14.06 with a 95% CI for ψ of (9.12, 21.76). This suggests a very strong and
statistically significant relationship between clinic and the intensity of prenatal care. Specifically, we
can see that the proportion of patients in Clinic A receiving intensive prenatal care is considerably
higher than it is for patients in Clinic B. The following table displays the relationship between clinic
and mortality.
Table 2.6. The Association Between Clinic and Mortality Rate
Clinic A Clinic B
Died 28 38 66
Survived 469 220 689
497 258 755
Here we obtain ψ̂ = 0.35 with a 95% CI (0.21, 0.58). This suggests that there is a statistically
significantly (significant at the 5% level since the 95% confidence interval does not include one)
higher rate of mortality in Clinic B. Finally, we can tabulate the relationship between clinic and
mortality stratified by level of care. We do that in the following table. We will return to this example
shortly.
To summarize what we found here, we found that there is an apparent strong association between
level of prenatal care and fetal mortality. When stratifying by clinic, evidence of this apparent
association is greatly reduced. When we stratify by level of prenatal care, there is a reduced risk
of mortality for patients in Clinic A versus Clinic B. We aim to study how these findings might be
reflected in a regression model.

2.4. SETTING UP A BINOMIAL REGRESSION MODEL
Table 2.7. Stratified Tabulation of Clinic Effects
Intensive Regular
Died Survived Died Survived Total
Clinic A 16 293 309 12 176 188
Clinic B 4 23 27 34 197 231
20 316 336 46 373 419
2.4 Setting up a Binomial Regression Model

2.4.1 Introduction and Notation
Let x1 , x2 , ..., xp−1 be a set of p − 1 explanatory variables and x = (1, x1 , x2 , ..., xp−1 )′ be a p × 1
vector of explanatory variables. Let β = (β0 , β1 , ..., βp−1)′ be a p ×1 vector of parameters. The scalar
quantity
η = x′ β = β0 + β1 x1 + · · · + βp−1 xp−1
is called the linear predictor. Let xi = (1, xi1 , xi2 , ..., xi,p−1 )′ be the vector of covariates for the ith
subject, i = 1, 2, . . . , n. Define
1 1 ··· 1
 
 x11 x21 xn1 
X=  ... .
.. ..  = (x1 , x2 , ..., xn )
. 
x1,p−1 x2,p−1 xn,p−1
Then the vector of linear predictors is given by
η1
 
η =  ...  = X ′ β
ηn
Recall in the context of the Gaussian linear model Yi ∼ N(µi , σ 2 ) are independent and we set
E(Yi) = µi = ηi = β0 + β1 xi1 + β2 xi2 + · · · + βp−1 xi,p−1 ,
Now consider binomial data with Yi ∼ Bin(mi , πi ).
We might think of a regression model of the form
E(Yi/mi ) = πi = ηi = β0 + β1 xi1 + · · · + βp−1 xi,p−1
Is this a reasonable model? It is not particularly convenient to work with since we’ve got to impose
constraints on the RHS because 0 ≤ πi ≤ 1, i = 1, 2, ..., n. Therefore, rather than working with πi
directly, we work with a function of it. The so-called link function defines such a transformation,
which typically maps [0, 1] → (−∞, +∞). We denote the link function by g(π).
Name of Link Function Expression
Identity g(π) = π
log-log g(π) = log(− log(π))
Probit† g(π) = Φ−1 (π)
Logit g(π) = log(π/(1 − π))

†
Φ is the cdf for a standard normal random variable.
Having selected the logit link, our regression model takes the form g(π) = η. Introducing the
subscript i to distinguish individuals we have
πi
log( ) = x′i β = β0 + β1 xi1 + · · · + βp−1 xi,p−1
1 − πi
2.4.2 The Logit Link and Odds Ratios
Let Y1 and Y2 denote the number of individuals with the outcome in Groups 1 and 2 respectively.
We let Yk ∼ Bin(mk , πk ), k = 1, 2. Let xi = 1 if the ith individual is in Group 1 and xi = 0
otherwise. We now consider a model for each individual’s response (which is binary, not binomial)
with the following form
πi
log = β0 + β1 xi .
1 − πi
Consider the log odds for a subject in group 1 as

πi
log = β0 + β1
1 − πi
and the log odds for a subject in group 2 as

πj
log = β0
1 − πj
This implies that
πi πj
log − log = β1
1 − πi 1 − πj
which means log ψ = β1 where ψ is the odds ratio comparing the odds of an event for a subject in
group 1 versus a subject in group 2. Therefore, the regression coefficient from this logistic model
may be interpreted as a log odds ratio describing the association between group membership and
the outcome.
Frequently we are interested in the parameter π itself. In this case note that

πi
log = β0 + β1 xi
1 − πi
πi
= eβ0 +β1 xi
1 − πi
eβ0 +β1 xi
πi =
1 + eβ0 +β1 xi
In a Gaussian model, given β̂, the fitted value for E(yi ) is µ̂i (xi ) = x′i β̂. In this binomial regression
model, the fitted value for E(Yi /mi ) is
eβ̂0 +β̂1 xi1

π̂i = π̂(xi ) =
1 + eβ̂0 +β̂1 xi1
More generally, these fitted values may be written as

exp(x′i β̂)
π̂i = π̂(xi ) =
1 + exp(x′i β̂)
Now consider the case with two binary explanatory variables. Let

1 if factor A present
xi1 =
0 otherwise

1 if factor B present
xi2 =
0 otherwise

1 if A and B present
xi3 =
0 otherwise
Consider the model

πi
log = β0 + β1 xi1 + β2 xi2
1 − πi
and interpret the effect of xi1 . First we compute the log odds when factors A and B are present,
where xi = (1, 1, 1)′. Here we get

πi
log = β0 + β1 + β2 .
1 − πi
Then we get the log odds when factors A is absent but B present by noting that xj = (1, 0, 1)′ and

πj
log = β0 + β2 .
1 − πj
Taking the difference of these log odds gives

πi πj πi (1 − πj )
log − log = log = β1
1 − πi 1 − πj πj (1 − πi )
Therefore β1 is again the log odds ratio reflecting the effect of factor A, but this time we are
controlling for factor B and are specifying that factor B is present. Note that the effect of factor
A is the same regardless of the level of B. To see this we examine the log odds when factors A is
present B absent, where xi = (1, 1, 0)′ and

πi
log = β0 + β1
1 − πi
and the log odds when factor A is absent and B is absent, where xj = (1, 0, 0)′ and

πj
log = β0 .
1 − πj
Again the difference gives
πi /(1 − πi )
log = β1 .
πj /(1 − πj )

Now consider the model

πi
log = β0 + β1 xi1 + β2 xi2 + β3 xi3
1 − πi
where we have introduced an interaction term. The log odds when factor A and B is present is
obtained by noting that xi = (1, 1, 1, 1)′ and so

πi
log = β0 + β1 + β2 + β3 .
1 − πi
When factor A is absent and B is present xj = (1, 0, 1, 0)′ giving

πj
log = β0 + β2 .
1 − πj
Taking the difference again we find

πi πj
log − log = β1 + β3
1 − πi 1 − πj

π (1−π )
(giving log πji (1−πji ) = β1 + β3 ). The log odds when factor A is present and B is absent leads to
xi = (1, 1, 0, 0)′ and
πi
log = β0 + β1
1 − πi
The log odds when factor A is absent and B is absent is (xi = (1, 0, 0, 0)′).

πj
log = β0
1 − πj

π (1−π )
giving log πji (1−πji ) = β1 . With interaction term, we find that the effect of factor A depends on the
presence or absence of factor B. If factor B is absent the log odds ratio relating A to the outcome
is β1 , but if factor B is present it is β1 + β3 .
Now we consider the data from the prenatal care example from before. Here we set up a regres-
sion model for the analyses of interest.
Again our response Yi ∼ Bin(mi , πi ), i = 1, 2, . . . , n and we have explanatory variables

1 Clinic A
xi1 =
0 Clinic B

1 intensive level of care
xi2 =
0 regular level of care

1 intensive level of care and Clinic A
xi3 =
0 otherwise
Before considering the data analysis, we note that the parameters of the model can often be inter-
preted quickly and more easily if we write down the following tables. For a model with

π
log = β0 + β1 xi1 + β2 xi2
1−π
we can write

Clinic Level of Care xi πi /(1 − πi )
B regular (1, 0, 0)′ eβ0

B intensive (1, 0, 1)′ eβ0 +β2
A regular (1, 1, 0)′ eβ0 +β1
A intensive (1, 1, 1)′ eβ0 +β1 +β2
where we make use of the fact that πi /(1 − πi ) = exp(x′i β). The last column reports the odds of
mortality for the four combinations of the risk factors. If we divide the corresponding terms we can
obtain odds ratios. For example among those patients in Clinic B the relative odds of mortality for
those with intensive versus regular care is
eβ0 +β2
= eβ2 .
eβ0
Among those in Clinic A the relative odds of mortality for those with intensive versus regular care
is
eβ0 +β1 +β2
β +β
= eβ2
e 0 1
the same expression we got before for those in Clinic B. By similar methods we can see that the
relative odds of mortality for those in Clinic A versus Clinic B is eβ1 regardless of their drinking
status.
Now consider the model

πi
log = β0 + β1 xi1 + β2 xi2 + β3 xi3
1 − πi
where now xi = (1, xi1 , xi2 , xi3 )′ and β = (β0 , β1 , β2 , β3 )′ .
Clinic Level of Care xi πi /(1 − πi )
B regular (1, 0, 0, 0) eβ0

B intensive (1, 0, 1, 0) eβ0 +β2
A regular (1, 1, 0, 0) eβ0 +β1
A intensive (1, 1, 1, 1) eβ0 +β1 +β2 +β3
Here the odds ratio of mortality for those with intensive versus regular care among those in Clinic
B is
eβ0 +β2
= eβ2 .
eβ0
However the corresponding odds ratio among those in Clinic A is
eβ0 +β1 +β2 +β3
= eβ2 +β3 .
eβ0 +β1
If β3 = 0 then the effect of level of prenatal care does not depend on the clinic and vice versa. If
β3 = 0 and β2 = 0 as well, it means not only does the effect of level of care not depend on clinic,
but there is no such effect.

2.4.3 Logistic Regression Analysis of Prenatal Care Data
What follows is the data file prenatal.dat in which the first line contains the variable labels and
the remaining four lines the data. As before we are using indicator variables for the explanatory
variables and have binomial response data.
clinic loc y m
0 0 34 231
0 1 4 27
1 0 12 188
1 1 16 309
The program used to analyse the data is given below.
Splus program for analysis of prenatal care data
help.start()
prenatal.dat_read.table("prenatal.dat", header=T)
% here we construct the response variable for the logistic regression analysis
prenatal.dat$resp_cbind(prenatal.dat$y,prenatal.dat$m-prenatal.dat$y)
prenatal.dat
% now we fit the model using the glm function and store the result in "model1"
% we indicate "resp" contains a binomial response and that we are using the
% logistic link function
model1_glm(resp ~ loc, family=binomial(link=logit),data=prenatal.dat)
summary(model1)
% the "names" function lists the contents of the object "model1" and following
% this statement we examine some of the contents of these objects (try it)
names(model1)
model1$family
model1$formula
model1$coefficients
model1$deviance
model1$fitted.values
model1$residuals
% now we fit a model to examine the relationship between level of care

% and mortality adjusting for clinic
model2_glm(resp ~ clinic + loc, family=binomial(link=logit),data=prenatal.dat)
summary(model2)
% here we examine whether the association between loc and mortality depends on
% the clinic
model3_glm(resp ~ loc + clinic + loc*clinic, family=binomial(link=logit),data=prenatal.dat)
summary(model3)
% now we examine the marginal relationship between mortality and clinic

model4_glm(resp ~ clinic, family=binomial(link=logit),data=prenatal.dat)
summary(model4)
A selection of the output printed from the summary commands summary(model1), summary(model2), sum-
mary(model3) and summary(model4) follows.
> prenatal.dat$resp_cbind(prenatal.dat$y,prenatal.dat$m-prenatal.dat$y)
> prenatal.dat
clinic loc y m resp.1 resp.2
1 0 0 34 231 34 197
2 0 1 4 27 4 23
3 1 0 12 188 12 176
4 1 1 16 309 16 293
Here we print out the augmented dataframe to see what the resp variables looks like. Next we fit the regression model
examining the relationship between the level of care and mortality. A portion of the output is reported below.

> model1_glm(resp ~ loc, family=binomial(link=logit),data=prenatal.dat)

> summary(model1)
Coefficients:
Value Std. Error t value
(Intercept) -2.0929355 0.1561487 -13.403476
loc -0.6670721 0.2782789 -2.397135
Null Deviance: 16.91763 on 3 degrees of freedom
Residual Deviance: 10.81438 on 2 degrees of freedom
Number of Fisher Scoring Iterations: 3
The numbers under the heading Value are the maximum likelihood estimates of the regression coefficients β̂0 and β̂2 (Here
we are using the convention that the subscripts for the regression coefficients coincide with the subscripts on the variables
themselves). The numbers under Std. Error are estimated standard errors based on the inverse of the information matrix (more
on this shortly). Finally, the numbers under t value are Wald-type test statistics for testing the hypothesis that H0 : βk = 0
vs. H0 : βk 6= 0. Note that these are of the form
(β̂k − 0)/s.e.(β̂k ) .
These test statistics are approximately standard normal if the null hypothesis is true. To verify note that for testing the effect
of level of care based on model 1 we find −0.6670721/0.2782789 = −2.397135. The p−values can be computed as
p − value = 2 × pr(U > (|β̂k − 0)/s.d.(β̂k )|)

where U ∼ N (0, 1). Therefore, if we test the hypothesis that there is no relation between level of care and mortality (H0 : β2 =
0.0) we get 2 × pr(U > 2.397135) = 2 ∗ (1 − pnorm(2.397135)) = 0.01652383. Here we conclude that those patients receiving
more intensive care are at a significantly lower risk of mortality than those receiving standard level of care, and that this
evidence is rather strong. To further characterize this dependence we need to conduct inference about the odds ratio. Recall
that exp(β2 ) is the odds ratio parameter, exp(β̂2 ) is the MLE, and
(exp(β̂2 − 1.96 s.e.(β̂2 )) , exp(β̂2 + 1.96 s.e.(β̂2 )))

is an approximate 95 % confidence interval. Here we get a point estimate of exp(−0.6670721) = 0.51 and a 95 % CI of
(exp(−0.6670721 − 1.96 × 0.2782789), exp(−0.6670721 + 1.96 × 0.2782789)) = (0.30, 0.89)
as we did before. In fact the analysis before was exactly the same as this one and the results will be identical apart from
rounding. Finally note that while the Wald statistic is computed for β0 , it is seldom of interest. Instead we tend to focus on
coefficients of explanatory variables which have useful interpretations.
Now we consider introducing clinic into the model. This generates a model in which we can examine the effect of level of
care on mortality, but adjusted for the clinic the patient attended.
> model2_glm(resp ~ clinic + loc, family=binomial(link=logit),data=prenatal.dat)
> summary(model2)
Coefficients:
(Intercept) -1.7410476 0.1784691 -9.7554564
clinic -0.9862793 0.3089322 -3.1925428
loc -0.1503053 0.3301670 -0.4552402
(Dispersion Parameter for Binomial family taken to be 1 )

Here we see that there is no longer any evidence that there is a relationship between level of care and mortality. The association
that did exist has been “explained away” by the clinic the patients attended. To see this, note that a test of H0 : β2 = 0, while
controlling for clinic, gives a p−value as 2 × pr(U > 0.4552402) = 2 ∗ (1 − pnorm(0.4552402)) ≈ 0.645. This is interesting, but
it does not mean that there is no effect of level of care for any patient. There may be an interaction between level of care and
clinic, for example, and the level of care variable may be significant in one of the clinics. To check this, we therefore fit the
model with loc and clinic main effects as well as the loc*clinic interaction.

> model3_glm(resp~loc+clinic+loc*clinic,family=binomial(link=logit),data=prenatal.dat)
> summary(model3)
Call: glm(formula = resp ~ loc + clinic + loc * clinic, family = binomial(link =

logit), data = prenatal.dat)
Coefficients:
(Intercept) -1.756843204 0.1857092 -9.46018403
loc 0.007643349 0.5726826 0.01334657
clinic -0.928734141 0.3514300 -2.64272868
loc:clinic -0.229649891 0.6949054 -0.33047650
Residual Deviance: 0 on 0 degrees of freedom

It is not surprising that this interaction term is not significant, since when we tabulated the data we found that the odds
ratios relating level of care with mortality within the two strata were quite similar.
Finally we fit the model involving just the clinic main effect.
> model4_glm(resp ~ clinic, family=binomial(link=logit),data=prenatal.dat)
> summary(model4)
Call: glm(formula = resp ~ clinic, family = binomial(link = logit), data =

prenatal.dat)
Deviance Residuals:
1 2 3 4
-0.004318004 0.01261876 0.4367092 -0.352063
Coefficients:
(Intercept) -1.756041 0.1756737 -9.996041
clinic -1.062357 0.2621205 -4.052933

When testing the hypothesis H0 : β1 = 0 versus H0 : β1 6= 0 we get 2×pr(U > 4.052933) = 2 ∗ (1 − pnorm(4.052933)) < 0.001.
Here we conclude that those patients at clinic A are at a significantly lower risk of mortality than those at clinic B and that
this evidence is very strong. Here we get a point estimate for the odds ratio reflecting the reduction in odds of mortality in
Clinic A compared to Clinic B as exp(−1.062357) = 0.346 and a corresponding 95 % CI as
(exp(−1.062357 − 1.96 × 0.2621205, (exp(−1.062357 + 1.96 × 0.2621205) = (0.21, 0.58)
2.5 Likelihood for Binary Regression
The responses y1 , y2 , ..., yn as observations from n random variables Y1 ...Yn where Yi ∼ Bin(mi , πi ) , i = 1, 2, . . . , n.
n
!
mi
πiyi (1 − πi )mi −yi
Y
pr(Y = y; π) =
i=1 yi
which gives

2.5. LIKELIHOOD FOR BINARY REGRESSION
n
πiyi (1 − πi )mi −yi
Y
L(π; y) =
i=1
Xn
ℓ(π; y) = [yi log πi + (mi − yi ) log(1 − πi )]
i=1
n
X πi
= [yi log + mi log(1 − πi )]
i=1
1 − πi
Modeling procedures are based on expressing π1 , π2 , . . . , πn in terms of fewer parameters (this is with a view of data
reduction). These new parameters
take the form of regression coefficients. This is done through a link function. The
πi
logistic link g(π) = log 1−πi = x′i β implies
′
exi β
πi =
1 + exi β
′
If the dimension of β is less than n, we say we have an unsaturated model. In this case we take the dimension to be
p, corresponding to a model with p − 1 covariates. Under some models with the dimension of β equal to n, we have
a saturated model. Returning to the log-likelihood:
n n h
X 1 X ′
i
ℓ(β; y) = yi (x′i β) + mi log = yi (x′i β) − mi log(1 + exi β )
1 + exi β
′
i=1 i=1
x′i β̂ ′
Upon maximizing ℓ wrt β, we obtain β̂ and compute π̂ = e /1+exi β̂ . The quality of the fit of these regression models
will be judged by how well π̂1 , π̂2 , ..., π̂n fit the data (or equivalently, how well mi π̂i approximates yi , i = 1, 2, ..., n).
We need a criterion to assess how much worse unsaturated models are from the saturated model. A convenient way
of testing nested hypotheses is based on the likelihood ratio statistic.
Likelihood Ratio Tests:

L(θ) is a likelihood of a q dimensional parameter vector θ it may be maximized with no constraints on θ
giving θ̃, or subject to constraints on θ̂. In the latter case the effect dimension of θ will be denoted by p.
Note that regression models may be interpreted as imposing constraints on the mean responses. That is we
force relationships between the µi values in linear regression, or πi values in binary regression. We may then
formulate a hypothesis that the constraint is a reasonable one a test is by seeing how consistent it is with
the data. The likelihood ratio statistic
−2 log(L(θ̂)/L(θ̃))
has a χ2 distribution on ν = q − p degrees of freedom if the null hypothesis that the constraints are reasonable
is true, where ν is the difference in the effective number of parameters with and without the constraints.
Therefore if
−2 log(L(θ̂)/L(θ̃)) > χ2ν (α)
we would reject H0 at the α significance level. It is more informative to examine the p−value and so we
compute
p = pr(χ2q−p > −2 log(L(θ̂)/L(θ̃)))
Returning to the log likelihood for binomial data we have

n
X πi
ℓ(π; y) = [yi log + mi log(1 − πi )]
i=1
1 − πi

let π̃ = (π̃1 , . . . , π̃n )′ = (y1 /m1 , . . . , yn /mn )′ represent the MLE under the saturated model and let π̂ = (π̂1 , · · · , π̂n )′
denote the MLE under the constrained model imposed by the regression equation. With a little algebra one can show
that the LR statistic −2 log (L(π̂)/L(π̃)) = −2(ℓ(π̂) − ℓ(π̃)) obtained by substituting the appropriate MLE’s into the
log-likelihood has the form
" n » „ «– X n
#
X m i − yi
2(ℓ(π̃) − ℓ(π̂)) = 2 yi log(yi /mi ) + (mi − yi ) log − [yi log π̂ + (mi − yi ) log(1 − π̂i ]
i=1
mi i=1
"m » „ « „ «–#
X yi m i − yi
=2 yi log + (mi − yi ) log
i=1
mi π̂i mi (1 − π̂i )
This likelihood ratio statistic is central to the analysis of binary regression models and so has a special name. It is
called the deviance statistic and is represented by D if we think of it as random and d if we refer to a realized value
for it. Splus reports this as the residual deviance, and sometimes is will be called the scaled deviance for reasons that
will become clear shortly. Based on the general result above, we would expect it to have a χ2 distribution on n − p
degrees of freedome. Unfortunately, this distributional approximation for the deviance statistic is not as good as one
might hope! It does perform very well, however, for testing nested unsaturated models which we will consider in the
next section. We remark in passing that the deviance statistic has the form

X Oij
2 Oij log
Eij
where Oij is an observed quantity and Eij is an expected quantity. We use two subscripts here since we are summing
over both the yi cells and the (mi − yi ) cells.
The Pearson statistic is another statistic one can use for assessing “overall” fit of a model.
n
X (yi − mi π̂i )2
P =
i=1
mi π̂i (1 − π̂i )
(Oi − Ei )2 /Vi . As for the deviance statistic P ∼ χ2n−p approximately if the model provides a
P
which has the form
reasonable fit to the data (i.e. if the assumed model is “true”). The Chi-square approximation is a little bit better
than for deviance statistics. Both however are poor if sample size (mi ) are small. The deviance and Pearson statistics
can be shown to be asymptotically equivalent by a Taylor series expansion.
2.6 Testing Nested Non-saturated Models
Suppose we have a model

log(πi /(1 − πi )) = β0 + β1 xi1 + · · · + βp−1 xip−1
and another model

log(πi /(1 − πi )) = β0 + β1 xi1 + · · · + βp−1 xip−π + · · · + βq−1 xi,q−1
We may be interested in testing whether the first model, which is a sub-model of the second, provides as good a
fit to the data. This is equivalent to testing the significance of the covariates xp , . . . , xq−1 , or to testing H0 : βp =
· · · = βq = 0. Let π̂i denote the MLE of πi under the reduced model with p parameters and let π̃i denote the
MLE of πi under the full model with q parameters. Again with a little algebra one can show that the likelihood
ratio statistici corresponding to this test is given by the difference in the deviance of the two models. That is the
appropriate likelihood ratio test statistic is ∆D = D0 − DA where D0 is the deviance under the null model and DA

2.8. ESTIMATION OF PROGNOSIS FOR CHILDREN WITH NEUROBLASTOMA
is the deviance under the alternative. The statistic ∆D is approximately χ2 distributed on q − p degrees of freedom,
and the approximation is far superior here than it is for the deviance itself.
If ∆D > χ2q−p (α), we reject H0 and claim that the reduced model leads to a signifcantly worse fit to the data
than the full model, or equivalently that one or more of the covariates xp , . . . , xq−1 is important for predicting the
outcome. The p−value is given by
p − value = pr(χ2q−p > ∆D)
2.7 Residuals for Binomial Data
2.7.1 Deviance Residuals
Recall
n
X yi mi y i
D(π̂; y) = 2 yi log + (mi − yi ) log
i=1
n
X yi mi − y i
= 2 yi log + (mi − yi ) log
i=1
n
X
= di
i=1
Let
p
riD = sign(yi − mi π̂i ) di
where sign(yi − mi π̂i ) is + if yi − mi π̂i > 0 and sign(yi − mi π̂i ) is − if yi − mi π̂i < 0. We call these deviance residuals.
If the model is adequate, then these are approximately N (0, 1) distributed.
2.7.2 Pearson residuals
Pearson residuals have the form

yi − mi π̂i
riP = p
mi π̂i (1 − π̂i )
and are approximately N (0, 1) if the model is adequate. A general rule, if mi π̂i < 5 for one or more i, you should be
concerned about the validity of the approximation (χ2 or N (0, 1)) and hence your conclusions (the same holds for
mi (1 − π̂i )).
2.8 Estimation of Prognosis for Children with Neuroblastoma
2.8.1 Background
Purpose of Study : To investigate the relationship between the probability of surviving 2 years free of disease following
diagnosis and treatment for neuroblastoma, and age at diagnosis and stage of disease at diagnosis. These data are
summarized in the following table where the cell entries are of the form y/m with y representing the number of
patients surviving 2 years, and m representing the number of patients in that age-stage combination at the start of
the study.

Stage
Age (months) I II III IV V
0-11 11/12 15/16 2/4 5/18 18/19
12-23 3/4 3/7 5/8 0/25 1/3
24+ 4/5 4/12 3/15 3/93 2/5
As an initial look at the data, consider the marginal distributions.
Stage
Age (months) I II III IV V Total
0-11 11/12 15/16 2/4 5/18 18/19 51/69 (0.74)
12-23 3/4 3/7 5/8 0/25 1/3 12/47 (0.26)
24+ 4/5 4/12 3/15 3/93 2/5 16/130 (0.12)
Total 18/21 22/35 10/27 8/136 21/27 79/246 (0.32)
(.86) (0.62) (0.37) (0.06) (0.78)
2.8.2 Modeling Covariate Effects
Testing Parameters to Reduce Model
Let ( ( (
1 if age : 12 − 23months 1 if age : 24+months 1 if stage II:
xi1 = xi2 = xi3 =
0 o.w. 0 o.w. 0 o.w.
( ( (
1 if stage III: 1 if stage IV: 1 if stage V:
xi4 = xi5 = xi6 =
0 o.w. 0 o.w. 0 o.w.
Now consider the models:
logit(πi ) = β0 + β1 xi1 + β2 xi2 + β3 xi3 + β4 xi4 + β5 xi5 + β6 xi6

logit(πi ) = β0 + β1 xi1 + β2 xi2
logit(πi ) = β0 + β3 xi3 + β4 xi4 + β5 xi5 + β6 xi6
What follows is the data file neuro.dat in which the first line contains the variable labels and the remaining fifteen
lines the data. Here age and stage are categorical (ordinal) variables so we will have to compute the indicator variables
in the Splus program
age stage y m
1 1 11 12
1 2 15 16

1 3 2 4
1 4 5 18
1 5 18 19
2 1 3 4
2 2 3 7
2 3 5 8
2 4 0 25
2 5 1 3
3 1 4 5
3 2 4 12
3 3 3 15
3 4 3 93
3 5 2 5
Splus program for analysis of neuroblastoma data
neuro.dat_read.table("neuro.dat", header=T)
% here we create indicator variables for age and stage using the treatment contrast
neuro.dat$agef_factor(neuro.dat$age)
neuro.dat$ageft_C(neuro.dat$agef,treatment)
neuro.dat$stagef_factor(neuro.dat$stage)
neuro.dat$stageft_C(neuro.dat$stagef,treatment)
% here we construct the response variable for logistic regression

neuro.dat$resp_cbind(neuro.dat$y,neuro.dat$m-neuro.dat$y)
neuro.dat
% here we fit the model with age and stage and print out summary statistics
model1_glm(resp ~ ageft + stageft, family=binomial(link=logit),data=neuro.dat)
summary(model1)
% here we record deviance residuals (rd1), linear predictor (lp1), and fitted values (fv1)
rd1_residuals.glm(model1,"deviance")
lp1_model1$linear.predictors
fv1_model1$fitted.values
% here we compute the Pearson residual as an exercise

rp1_(neuro.dat$y - neuro.dat$m*fv1)/sqrt(neuro.dat$m*fv1*(1-fv1))
% here we verify that fitted values agree with what we expect from the linear predictor
fv1
exp(lp1)/(1+exp(lp1))
cbind(lp1,fv1,rp1)
% plotting the deviance and Pearson residuals

motif()
plot(fv1,rd1,ylim=c(-3,3), xlab="FITTED VALUES",ylab="RESIDUALS",pch=11)
points(fv1,rp1,pch=10)
abline(h=-2)
abline(h= 2)
legend(locator(),c("Deviance Residual","Pearson Residual"), pch="\013\012",bty="n")
% title("Residuals Plotted Against Fitted Values")
% here we fit two reduced models to enable us to test the importance of age and stage
model2_glm(resp ~ ageft, family=binomial(link=logit),data=neuro.dat)
summary(model2)
model3_glm(resp ~ stageft, family=binomial(link=logit),data=neuro.dat)
summary(model3)
A selection of the output is provided. The final data object neuro.dat is given by
> neuro.dat
age stage y m agef ageft stagef stageft resp.1 resp.2
1 1 1 11 12 1 1 1 1 11 1

2 1 2 15 16 1 1 2 2 15 1
3 1 3 2 4 1 1 3 3 2 2
4 1 4 5 18 1 1 4 4 5 13
5 1 5 18 19 1 1 5 5 18 1
6 2 1 3 4 2 2 1 1 3 1
7 2 2 3 7 2 2 2 2 3 4
8 2 3 5 8 2 2 3 3 5 3
9 2 4 0 25 2 2 4 4 0 25
10 2 5 1 3 2 2 5 5 1 2
11 3 1 4 5 3 3 1 1 4 1
12 3 2 4 12 3 3 2 2 4 8
13 3 3 3 15 3 3 3 3 3 12
14 3 4 3 93 3 3 4 4 3 90
15 3 5 2 5 3 3 5 5 2 3
The estimates arising from the first model fit are given below along with a little additional information.
Coefficients:
(Intercept) 3.317448 0.7690988 4.313423
ageft2 -2.118049 0.5724491 -3.699978
ageft3 -2.612933 0.5005830 -5.219781
stageft2 -1.252876 0.7814548 -1.603262
stageft3 -1.775857 0.7982609 -2.224658
stageft4 -4.367750 0.7874140 -5.546955
stageft5 -1.022186 0.8621250 -1.185659

Correlation of Coefficients:
(Intercept) ageft2 ageft3 stageft2 stageft3 stageft4
ageft2 -0.4148044
ageft3 -0.4532413 0.5477983
stageft2 -0.7660224 0.0405768 0.0413439
stageft3 -0.7060820 -0.0363006 -0.0391957 0.7138739
stageft4 -0.8348667 0.1809823 0.1685087 0.7349739 0.7019729
stageft5 -0.7440306 0.1217271 0.1287567 0.6694787 0.6427576 0.6860101
Before interpreting these results too much, we should look to see how good the fit is to the data. The residual plot in Figure 1
indicates a rather good fit and there are not data points which lead to unacceptably large residuals. We base these inferences on
the deviance residuals since they are somewhat better behaved than Pearson residuals. For completeness however, we consider
the following table of linear predictors, fitted values, and Pearson residuals.
> cbind(lp1,fv1,rp1)
lp1 fv1 rp1
1 3.31744824 0.96502256 -0.91175324
2 2.06457175 0.88741175 0.63385031
3 1.54159087 0.82369587 -1.69883998
4 -1.05030164 0.25916718 0.18019654
5 2.29526217 0.90848389 0.58782256
6 1.19939942 0.76841793 -0.08732117
7 -0.05347707 0.48663392 -0.30734765
8 -0.57645795 0.35974803 1.56325182
9 -3.16835046 0.04037428 -1.02558449
10 0.17721334 0.54418776 -0.73329034
11 0.70451480 0.66918800 0.62168142
12 -0.54836169 0.36624459 -0.23664028
13 -1.07134257 0.25514785 -0.48994036
14 -3.66323508 0.02500796 0.44776072
15 -0.31767128 0.42124338 -0.09620422
As an exercise, verify that you can reproduce these results by computer and by hand.
Now we consider simplifying the model further by examining the decrease in the quality of the fit that results from dropping
the stage variable(s).

> model2_glm(resp ~ ageft, family=binomial(link=logit),data=neuro.dat)

> summary(model2)
Coefficients:
(Intercept) 1.041454 0.2741563 3.798759
ageft2 -2.111895 0.4324230 -4.883863
ageft3 -3.005062 0.3825334 -7.855686

(Intercept) ageft2
ageft2 -0.6340003
ageft3 -0.7166859 0.4543791
Now we fit the model excluding the age variable to examine the drop in the quality of fit from model one (with age and stage).
> model3_glm(resp ~ stageft, family=binomial(link=logit),data=neuro.dat)
> summary(model3)
Coefficients:
(Intercept) 1.7917595 0.6236082 2.873213
stageft2 -1.2656664 0.7150277 -1.770094
stageft3 -2.3223877 0.7400748 -3.138045
stageft4 -4.5643439 0.7220359 -6.321492
stageft5 -0.5389965 0.7766185 -0.694030

(Intercept) stageft2 stageft3 stageft4
stageft2 -0.8721456
stageft3 -0.8426286 0.7348948
stageft4 -0.8636804 0.7532550 0.7277618
stageft5 -0.8029788 0.7003144 0.6766129 0.6935170
Here we consider testing models using the Likelihood ratio test, which is this context we are calling the “change in deviance”,
and denote by ∆D.
Factors In Model Deviance d.f.
Age & Stage 9.63 8

Age 83.58 12
Stage 42.45 10
Consider the test of the null hypothesis that stage is unrelated to survival, or equivalently that H0 : β3 = · · · = β6 = 0. The
change in deviance in going from the first model to the second model is ∆D = 83.58 − 9.63 = 73.96. The random variable
corresponding to ∆D is taken to be approximately χ2 distributed on q − p = 12 − 8 = 4 degrees of freedom. Therefore
S.L. ≈ P(χ24 > 73.96) < 0.0001
and we reject the null hypothesis that stage is unrelated to survival since this is characterized as very strong evidence against
it.
Consider the test of the null hypothesis that age is unrelated to survival, or equivalently that H0 : β1 = β2 = 0. The change
in deviance in going from the first model to the third model is ∆D = 42.45 − 9.63 = 32.82. The random variable corresponding
to ∆D is taken to be approximately χ2 distributed on q − p = 10 − 8 = 2 degrees of freedom. Therefore
S.L. ≈ P(χ22 > 32.82) < 0.0001
and we also reject the null hypothesis that age is unrelated to survival. Therefore both age and stage play highly significant
roles in determining prognosis, even after controlling for the other variable.

3
Deviance Residual
Pearson Residual
2 1
RESIDUALS
0 -1
-2
-3
0.0 0.2 0.4 0.6 0.8 1.0

FITTED VALUES
Fig. 2.1. Plot of Residuals by Fitted Values for Neuroblastoma Model with Age and Stage
Based on the Splus output provided answer these additional questions:
a. What is the odds ratio of surviving two years for a patient with disease in state IV versus stage I?
age stage xi πi /(1 − πi )
NA IV (1, x1 , x2 , 0, 0, 1, 0)′ exp{β0 + β1 x1 + β2 x2 + β5 }

NA I (1, x1 , x2 , 0, 0, 0, 0)′ exp{β0 + β1 x1 + β2 x2 }
The ratio is eβ5 which is estimated as eβ̂5 = e−4.368 = 0.013. Therefore we can say, “When controlling for age, the
odds of surviving two years among those diagnosed in state IV is 0.013 times the odds among subjects diagnosed
in state I”. Clearly diagnosis in stage IV has a substantial effect on reducing the probability of survival.
b. What is the odds ratio of surviving for a patient aged 12-23 months versus 0-11 months?
12-23 NA (1, 1, 0, x3 , x4 , x5 , x6 )′ exp{β0 + β1 + β3 x3 + β4 x4 + β5 x5 + β6 x6 }

0-11 NA (1, 0, 0, x3 , x4 , x5 , x6 )′ exp{β0 + β3 x3 + β4 x4 + β5 x5 + β6 x6 }
The odds ratio is therefore eβ1 which is estimated as eβ̂1 = e−2.12 = 0.12. Therefore we can say, “When controlling
for stage, the odds of surviving two years among those diagnosed at 12-23 months is 0.12 times the odds of
surviving two years among subjects diagnosed at 0-11 months of age.”. Diagnosis at the intermediate age range
greatly reduces the probability of surviving two years compared to the youngest age range.

c. What is the odds ratio of surviving for a patient aged 24+ months versus 12-23 months?
24+ NA (1, 0, 1, x3 , x4 , x5 , x6 )′ exp{β0 + β2 + β3 x3 + β4 x4 + β5 x5 + β6 x6 }

12-23 NA (1, 1, 0, x3 , x4 , x5 , x6 )′ exp{β0 + β1 + β3 x3 + β4 x4 + β5 x5 + β6 x6 }
The odds ratio is therefore eβ2 −β1 which is estimated as eβ̂2 −β̂1 = e−2.61−(−2.118) = 0.61. Therefore we can say,
“When controlling for stage, the odds of surviving two years among those diagnosed at 24+ months of age is 0.61
times the odds of surviving two years among subjects diagnosed at 12-23 months of age.”. Note that here we have
an odds ratio that is a function of two regression coefficients and the estimate is a function of two estimated
coefficients.
d. Given final model, what is the odds ratio of surviving for a patient with age is in the 12-23 months range and
disease in stage II versus a patient with age is in the 0-11 months range and disease in stage I.
12-23 II (1, 1, 0, 1, 0, 0, 0)′ exp{β0 + β1 + β3 }

0-11 I (1, 0, 0, 0, 0, 0, 0)′ exp{β0 }
The odds ratio is therefore eβ1 +β3 which is estimated as eβ̂1 +β̂3 = e−2.12+(−1.25) = 0.03. Therefore we can say,
“The odds of surviving two years among those diagnosed in stage II at 12-23+ months of age is 0.03 times the
odds of surviving two years among subjects diagnosed in stage I at 0-11 months of age.”. Notice we do not say
“when controlling for X” here, since both factors define the groups being compared.
e. Given final model, what is the odds ratio of surviving for a patient with age is in the 24+ months range and
disease in stage IV versus a patient with age is in the 12-23 months range and disease in stage III.
24+ IV (1, 0, 1, 0, 0, 1, 0)′ exp{β0 + β2 + β5 }

12-23 III (1, 1, 0, 0, 1, 0, 0)′ exp{β0 + β1 + β4 }
The odds ratio is therefore eβ2 +β5 −β1 −β4 which is estimated as
eβ̂2 +β̂5 −β1 −β4 = e−2.61−4.36−(−2.12)−(−4.37) = e−0.48 = 0.62.
Therefore we can say, “The odds of surviving two years among those diagnosed in stage IV at 24+ months of age
is 0.62 times the odds of surviving two years among subjects diagnosed in stage III at 12-23+ months of age.”.
f. Construct a confidence interval for the odds ratio in 1) and 2) ?
For question 1 we get
(e−4.37−1.96(0.79) , e−4.37+1.96(0.79) ) = (0.003, 0.060)
and similar calculations for question 2 give (0.039, 0.369).

g. Can you construct a confidence interval for the odds ratios in 3), 4) and 5) ?
This is a trickier problem.

2.9 Confidence Intervals for non-linear functions of ηi
Recall that since β̂ is an MLE, β̂ ∼ M V N (β, I −1 (β̂)) approximately. This means that
x′i β̂ ∼ N (x′i β, x′i I −1 (β̂)xi )
and
x′ β̂ − x′i β
q i ∼ N (0, 1)
x′i I −1 (β̂)xi )
• An approximate 95% CI for ηi = x′i β is then given by

q q
(x′i β̂ − 1.96 x′i I −1 (β̂)xi , x′i β̂ + 1.96 x′i I −1 (β̂)xi ) = (η̂L , η̂U )
• If the odds ratio of interest is expressed as exp{c′ β} where c is a column vector defining the contrast of the
regression coefficients, then an approximate 95 % confidence interval for this odds ratio is
q q
exp{c′ β̂ − 1.96 c′ I −1 (β̂)c}, exp{c′ β̂ + 1.96 c′ I −1 (β̂)c}
• An approximate 95% CI for πi = exp{x′i β}/(1 + exp{x′i β}) is then given by
(exp{η̂L }/(1 + exp{η̂L }), exp{η̂U }/(1 + exp{η̂U }))
As an example we consider estimation of an approximate 95 % CI for the odds ratio in question 3 of the
neuroblastoma example. The vector defining the contrast of interest is c = (0, −1, 1, 0, 0, 0, 0)′.
The program used to compute the variance of a contrast of β̂ follows.
% use the summary.glm function and store the result in the tmp object
> tmp_summary.glm(model1)
% examine the contents of the tmp objects and store the covariance matrix in v
> names(tmp)
> v_tmp$cov.unscaled
> v
% create x vector to get contrast of regression coefficients

> x_c(0,-1,1,0,0,0,0)
> x_as.matrix(x,7,1)
> dim(x)
% compute variance estimate of difference in estimates for age parameters
> t(x)%*%v%*%x
The resulting output is as follows. The output from the previous commands follows.
> names(tmp)
[1] "call" "terms" "coefficients" "dispersion"
[5] "df" "deviance.resid" "cov.unscaled" "correlation"
[9] "deviance" "null.deviance" "iter" "nas"
> v
(Intercept) ageft2 ageft3 stageft2 stageft3
(Intercept) 0.5915130 -0.18262588 -0.17449687 -0.46039170 -0.43349301
ageft2 -0.1826259 0.32769796 0.15697614 0.01815174 -0.01658807
ageft3 -0.1744969 0.15697614 0.25058331 0.01617304 -0.01566243
stageft2 -0.4603917 0.01815174 0.01617304 0.61067160 0.44531795

2.9. CONFIDENCE INTERVALS FOR NON-LINEAR FUNCTIONS OF ηI
stageft3 -0.4334930 -0.01658807 -0.01566243 0.44531795 0.63722040

stageft4 -0.5055946 0.08157857 0.06642039 0.45225032 0.44123334
stageft5 -0.4933364 0.06007507 0.05556688 0.45103564 0.44234617
stageft4 stageft5
(Intercept) -0.50559456 -0.49333643
ageft2 0.08157857 0.06007507
ageft3 0.06642039 0.05556688
stageft2 0.45225032 0.45103564
stageft3 0.44123334 0.44234617
stageft4 0.62002078 0.46569746
stageft5 0.46569746 0.74325944
> x_c(0,-1,1,0,0,0,0)
> dim(x)
[1] 7 1
> t(x)%*%v%*%x
[,1]
[1,] 0.264329
We know from before that eβ̂2 −β̂1 = e−2.613−(−2.118) = e−0.495 = 0.61. The Splus output above indicates that
√ √
ˆ β̂2 − β̂1 ) = 0.264. An approximate 95 % CI for β2 − β1 is therefore (−0.495 − 1.96 0.264, −0.495 + 1.96 0.264) =
var(
(−1.50, 0.51), and the corresponding interval for the odds ratio is (0.22, 1.67).

2.10 Bioassay and Dose Response Models
2.10.1 Definitions
One of the first applications of regression procedures for binary data was for the analysis of bioassay results. Bioassay
experiments are typically animal studies designed to investigate the toxicity of an agent. This is achieved by dividing
a sample of animals, say beetles, into several groups and exposing the groups to varying amounts of the toxin. The
impact of exposure to the toxin is then assessed by examining what happens to the beetles some fixed time after
exposure. Below are some relevant definitions.
Stimulus : Each group of animals is subjected to a toxin at a stated dose. The dose indicates the intensity and
is usually specified as the log of the concentration. The dose may vary from group to group but is assumed to be
constant within a group.
Response : As a result of the stimulus, subjects will manifest a binary response (often of the form died/survived).
Tolerance : For any one subject there is a certain level of intensity below which the response will not occur and
above which it will occur. This level is called the tolerance (or threshold). The tolerance varies from one individual to
another in the population and therefore from subject to subject in the sample. We can therefore ascribe a distribution
to it.
Tolerance Distribution : Let z = concentration of toxin. Work with x = log z to measure intensity of stimulus, i.e.
dose. Let f (x) denote the p.d.f. of the distribution of the tolerance in the population. Then if a dose x0 were given
to the entire population, the proportion Z x0
π= f (s)ds
−∞
R∞
would respond. (Note that −∞
f (s)ds = 1 since they would all respond eventually)
2.10.2 Modelling the Dose-Response Relationship
Let yj denote the # dying out of the original mj , where we assume Yj ∼ Bin(mj , πj ) where πj is the probability of
dying for the jth group, j = 1, 2, . . . , J. Let xj indicate the dose of the stimulus (on log conc. scale). We want to
model πj , j = 1, 2, . . . , J as a function of xj where xj is a continuous covariate.
In general, write π(x) to indicate dependence on x. Since 0 ≤ π ≤ 1, the usual set up is to model with
g(π) = β0 + β1 x
where g( ) is a link function or

π(x) = F ∗ (β0 + β1 x) where F ∗ ( ) = g −1 ( ).
One often observes dose response curves that are sigmoidal in shape. This suggests selecting link functions such that
F ∗ ( ) is a c.d.f. If Z x
−1 ∗
π(x) = g (β0 + β1 x) = F (β0 + β1 x) = f (s)ds
−∞

2.10. BIOASSAY AND DOSE RESPONSE MODELS
then the p.d.f. is called the tolerance distribution. It determines how the “probability of a positive response” changes
with the value of the dose.
∂π(x)
= (F ∗ )′ (β0 + β1 x)(β1 ) = f (x)
∂x
Consider a single binary response (died/survived) and a single covariate x, representing dose.
If π is the probability of death and η = β0 + β1 x, consider the following example.

2.10.3 A Dose-Response Problem
Consider an experiment by Bliss (Annals of Applied Biology, 1935) in which groups of beetles were exposed to varying
concentrations of carbon disulphide (CS2 ) gas. The following results were recorded.
# of insects # of insects
Dose (xi ) killed (yi ) (mi ) yi /mi
1.6907 6 59 0.10
1.7242 13 60 0.22
1.7552 18 62 0.29
1.7842 28 56 0.50
1.8113 52 63 0.83
1.8369 53 59 0.89
1.8610 61 62 0.98
1.8839 60 60 1.00
What follows is the data file beetle.dat in which the first line contains the variable labels and the remaining 8 lines
the data.
dose y m
1.6907 6 59
1.7242 13 60
1.7552 18 62
1.7842 28 56
1.8113 52 63
1.8369 53 59
1.8610 61 62
1.8839 60 60
Splus program for analysis of dose-response data
beetle.dat_read.table("beetle.dat", header=T)

beetle.dat$resp_cbind(beetle.dat$y,beetle.dat$m-beetle.dat$y)
beetle.dat
% here we fit a logistic model involving dose

model1_glm(resp ~ dose, family=binomial(link=logit),data=beetle.dat)
summary(model1)
% here we record deviance residuals in rd1

% plotting the deviance residuals by dose and by fitted values

motif()
par(mfrow=c(3,2))
plot(beetle.dat$dose,rd1,ylim=c(-5,5), xlab="DOSE",ylab="DEVIANCE RESIDUALS")
abline(h=-2) ; abline(h= 2); title("Model 1 - logit link")
plot(fv1,rd1,ylim=c(-5,5), xlab="FITTED VALUE",ylab="DEVIANCE RESIDUALS")
abline(h=-2) ; abline(h= 2); title("Model 1 - logit link")
% here we fit a probit model involving dose

model2_glm(resp ~ dose, family=binomial(link=probit),data=beetle.dat)
summary(model2)

October 21, 2008
Tolerance Distribution
Form Name π = g −1 (η) η = g(π)
√ Rx 1 2
f (s) = β1 exp{−(β0 + β1 s)2 /2}/ 2π normal π(x) = −∞
√ 1 e− 2 (β0 +βs ) ds
β
2π
η = Φ−1 (π)
= Φ(β0 + β1 x)
= Φ(η)
42
β1 eβ0 +β1 s
Rx
f (s) = β1 eβ0 +β1 s /[1 + eβ0 +β1 s ]2 logistic π(x) = −∞ [1+eβ0 +β1 s ]2 ds
π
η = log( 1−π )
eβ0 +β1 x
= 1+eβ0 +β1 x
eη
= 1+eη
Rx β0 +β1 s
f (s) = β1 exp{β0 + β1 s − exp(β0 + β1 s)} extreme value π(x) = −∞
β1 eβ0 +β1 s−e ds
ν−eν
Rη
= −∞
e dν η = log(− log(1 − π))
= 1 − exp(−eη )
RJ Cook
% here we fit a complementary log-log model involving dose

model3_glm(resp ~ dose, family=binomial(link=cloglog),data=beetle.dat)
summary(model3)

abline(h=-2); abline(h= 2); title("Model 2 - probit link")
abline(h=-2); abline(h= 2); title("Model 2 - probit link")

abline(h=-2); abline(h= 2); title("Model 3 - log-log link")
abline(h=-2); abline(h= 2); title("Model 3 - log-log link")
A selection of the output is provided. Selected output from the model fitting is reported here.
> model1_glm(resp ~ dose, family=binomial(link=logit),data=beetle.dat)
> summary(model1)
Call: glm(formula = resp ~ dose, family = binomial(link = logit), data = beetle.dat)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.594122 -0.3943905 0.8329144 1.259226 1.593991
Coefficients:
(Intercept) -60.71733 5.173518 -11.73618
dose 34.27026 2.908076 11.78451
(Intercept)
dose -0.9996809
> model2_glm(resp ~ dose, family=binomial(link=probit),data=beetle.dat)

> summary(model2)
Call: glm(formula = resp ~ dose, family = binomial(link = probit), data = beetle.dat)

Deviance Residuals:
-1.571435 -0.4702437 0.7500979 1.063238 1.34479
Coefficients:
(Intercept) -34.93468 2.647945 -13.19313
dose 19.72761 1.487237 13.26460

(Intercept)
dose -0.9996109
> model3_glm(resp ~ dose, family=binomial(link=cloglog),data=beetle.dat)

> summary(model3)
Call: glm(formula = resp ~ dose, family = binomial(link = cloglog), data = beetle.dat

)
Deviance Residuals:
-0.8032665 -0.5513584 0.03089627 0.3831507 1.288832
Coefficients:
(Intercept) -39.57252 3.236839 -12.22567
dose 22.04129 1.797415 12.26277
(Intercept)
dose -0.9996995
Model 1 - logit link Model 1 - logit link

4
4
DEVIANCE RESIDUALS
DEVIANCE RESIDUALS
2
• • • • • •
• •
• •
0
• •
• •
• •
-2
-2
-4
-4
1.70 1.75 1.80 1.85 0.2 0.4 0.6 0.8 1.0

DOSE FITTED VALUE
Model 2 - probit link Model 2 - probit link

4
4
DEVIANCE RESIDUALS
DEVIANCE RESIDUALS
2
• • • • • •
• • • •
0
• •
• • • •
-2
-2
-4
-4
1.70 1.75 1.80 1.85 0.2 0.4 0.6 0.8 1.0

DOSE FITTED VALUE
Model 3 - log-log link Model 3 - log-log link

4
4
DEVIANCE RESIDUALS
DEVIANCE RESIDUALS
2
• •
• • • •
• •
0
• •
• • • • • •
-2
-2
-4
-4
1.70 1.75 1.80 1.85 0.2 0.4 0.6 0.8 1.0

DOSE FITTED VALUE
Fig. 2.2. Plot of deviance residuals by dose and by fitted values

We can plot the actual data (as yi /mi ) against dose, and see how well the dose-response curves fit the data. The
program to do this is included below.
beetle.dat_read.table("beetle.dat", header=T)

beetle.dat$resp_cbind(beetle.dat$y,beetle.dat$m-beetle.dat$y)
beetle.dat
motif()
plot(beetle.dat$dose, beetle.dat$y/beetle.dat$m,xlim=c(1.65,1.95),ylim=c(0,1),
xlab="DOSE",ylab="PROBABILITY OF DEATH")
x<-seq(1.65,1.95,by=0.001)
prob<-as.vector(rep(1,301))
model_glm(resp ~ dose, family=binomial(link=logit),data=beetle.dat)

beta_as.vector(model$coefficients)
for(i in 1:301) {prob[i]<- exp(beta[1]+beta[2]*x[i])/(1+exp(beta[1]+beta[2]*x[i])) }
lines(x,prob,lty=2)
model_glm(resp ~ dose, family=binomial(link=probit),data=beetle.dat)

for(i in 1:301) { prob[i]<- pnorm(beta[1] + beta[2] * x[i]) }
lines(x,prob,lty=5)
model_glm(resp ~ dose, family=binomial(link=cloglog),data=beetle.dat)

for(i in 1:301) { prob[i]<- 1 - exp( - exp(beta[1] + beta[2] * x[i])) }
lines(x,prob,lty=1)
legend(locator(),c("LOGIT",,"PROBIT","CLOGLOG"),lty=c(2,5,1),bty="n")
1.0
• •
•
LOGIT
PROBIT •
0.8
CLOGLOG
PROBABILITY OF DEATH
0.6
•
0.4
•
•
0.2
•
0.0
1.65 1.70 1.75 1.80 1.85 1.90 1.95

DOSE
Fig. 2.3. Plot of the data and the dose-response curves for Bliss (1935) data
Note that the curve for the complementary log-log link fits the data better than the other two, as one would expect
from the residual plots and the deviance statistics.

2.11. MAXIMUM LIKELIHOOD FOR BINOMIAL MODELS
2.11 Maximum Likelihood for Binomial Models
2.11.1 The Algorithm for Estimation
Recall ℓ(β) is the log-likelihood in terms of β and
∂ℓ(β)
S(β) =
∂β

∂ℓ(β)
I(β) = −
∂βj ∂βk
Then asymptotically,
(β̂ − β)′ I(β) (β̂ − β) ∼ χ2p

(β̂j − βj )[{I −1 (β̂)}jj ]−1/2 ∼ N (0, 1) ,
where dim(β) = p.
In general,
S(β̂) ≃ S(β) + S ′ (β)(β̂ − β)

S(β) ≃ −S ′ (β)(β̂ − β)
S(β) ≃ I(β)(β̂ − β)
β̂ ≃ β + I −1 (β)S(β)
This suggests a Newton-Raphson search for finding the MLE might be appropriate. The Newton Raphson iterations
will take the following form:
β̂ (r) = β̂ (r−1) + I −1 (β̂ (r−1) )S(β̂ (r−1) )
where we iterate until convergence. Convergence criteria might be specified based on norm of score vector at β̂ (r)
(e.g. ||S(β̂ (r) || ≤ ǫ where ||x|| denote the Euclidean norm of x and ǫ is a specified tolerance criteria), or based on the
magnitude of the successive values of β (e.g. if ||β̂ (r) − β̂ (r−1) || ≤ ǫ).
Consider the specific case of binomially distributed data

Pn
ℓ(β) = i=1 [yi log(πi /(1 − πi )) + mi log(1 − πi )]
Pn ′
= i=1 [yi xi β − mi log(1 + exp{x′i β})]
Pn
= i=1 [yi ηi − mi log(1 + exp{ηi })]
Pn
= i=1 ℓi (β)
Differentiating with respect to βj we get

n
∂ℓ X ∂ℓi
=
∂βj i=1
∂βj
n
X ∂ηi ∂ℓi
=
i=1
∂βj ∂ηi
We can write this in vector notation as S(β) = X ′ Sη where Sη = (∂ℓ1 /∂η1 , . . . , ∂ℓn /∂ηn )′ .
Differentiating again to get the information matrix, we get
n
∂2ℓ X ∂ 2 ℓi
=
∂βk ∂βj i=1
∂βk ∂βℓ
n
X ∂ ∂ηi ∂ℓi
=
i=1
∂βk ∂βj ∂ηi
n
X ∂ηi ∂ ∂ℓi ∂ℓi ∂
= · + · (xij )
i=1
∂βj ∂βk ∂ηi ∂ηi ∂βk
n
X ∂ηi ∂ ∂ℓi
= · + 0
i=1
∂βj ∂βk ∂ηi
n
X ∂ηi ∂ ∂ℓi ∂ηi
= ·
i=1
∂βj ∂ηi ∂ηi ∂βk
n
∂ 2 ℓi
X
= xij xik
i=1
∂ηi2
The information matrix can be written in matrix notation as I(β) = X ′ Iη (η)X where Iη (η) = ∂Sη /∂η ′ . Using these
results, the Newton Raphson algorithm gives
β̂ (r) = β̂ (r−1) + I −1 (β̂ (r−1) )S(β̂ (r−1) )
β̂ (r) = β̂ (r−1) + [X ′ Iη (η̂ (r−1) )X]−1 X ′ Sη (η̂ (r−1) )
′
X Iη (η̂ (r−1) )X β̂ (r) = X ′ Iη (η̂ (r−1) )X β̂ (r−1) + X ′ Sη (η̂ (r−1) )

′ h i
X Iη (η̂ (r−1) )X β̂ (r) = X ′ Iη (η̂ (r−1) ) X β̂ (r−1) + Iη−1 (η̂ (r−1) )Sη (η̂ (r−1) )

The Newton-Raphson algorithm for models of this sort is often referred to as iteratively re-weighted least squares.
To explain this, consider the following. Let Z (r) denote a matrix of artificially constructed covariates defined at the
(r − 1)st iteration, and W (r−1) be a matrix defined at the (r − 1)st iteration as follows
Z (r−1) = X β̂ (r−1) + Iη−1 (η̂ (r−1) )Sη (η̂ (r−1) )
W (r−1) = Iη (η̂ (r−1) )

2.12. MISCELLANEOUS REMARKS
Then we can write the Newton-Raphson iterations above in terms of W and Z as
(X ′ W (r−1) X)β̂ (r) = X ′ W (r−1) Z (r−1)
β̂ (r) = (X ′ W (r−1) X)−1 X ′ W (r−1) Z (r−1)
This can be thought of as a series of weighted regressions of the vector Z (r) on a set of covariates X with weight
W (r) . Since these calculations must be repeated, it is referred to as Iteratively reweighted Least Squares (IRWLS, or
sometimes IWLS).
Consider the Binomial Distribution with the Logistic Link

X
ℓ= [yi log (πi /(1 − πi )) + mi log(1 − πi )]
The logistic link implies

log (πi /(1 − πi )) = ηi , πi = exp{ηi }/(1 + exp{ηi })
which gives
n
X
ℓ= [yi ηi − mi log(1 + exp{ηi })]
i=1
Now differentiating with respect to ηi we get
∂ℓ
= yi − mi exp{ηi }/(1 + exp{ηi })
∂ηi
= yi − mi πi
and differentiating again we get

∂2ℓ
ηi
e (1 + eηi ) − eηi eηi

− 2 = − −mi
∂ηi (1 + eηi )2
= mi πi (1 − πi )
(r) (r) 1 (r)
Then here we have Zi = η̂i + (r) (r)
(yi − mi π̂i ) and
mi π̂i (1 − π̂i )
 
(r)
w11 0 0 0
(r)
W (r) = 
 
 0 w22 0 0 

(r)
0 0 0 wnn
(r) (r) (r)
where wii = mi πi (1 − πi ).
2.12 Miscellaneous Remarks
2.12.1 A Note on Alternative Model Specifications
Suppose Yi ∼ Bin(mi , πi ), i = 1, 2, . . . , n and there is a three level covariate indicating which of three age groups an
individual belongs to (e.g. 18-44 years old, 45-64 years old, and 65+ years old). Let

( (
1 45 − 64 1 65+
xi1 = xi2 = .
0 o.w. 0 o.w.
Then the logistic model may take the form
log (πi /(1 − πi )) = β0 + β1 xi1 + β2 xi2 .
Alternatively, we could write the model as
log (πi /(1 − πi )) = α + βj , j = 1, 2, 3
where β1 = 0 is the required constraint. This method of writing down the model is particularly advantageous when
there are several covariates to be considered and they have several levels each. For example, if we had two factor
variables with 3 and 4 levels respectively, we could either write down a model based on 5 indicator variables explicitly,
or write
log (πjk /(1 − πjk )) = α + βj + γk , j = 1, 2, 3, k = 1, 2, 3, 4.
Here we need further constraints, which we might specify as
β1 = 0, γ1 = 0
These are sometimes called corner-point constraints (or treatment constraints in Splus). There are other constraints
that we can use, but these are often the most useful. We will return to this topic in the discussion of log-linear
models.
2.12.2 A Note on the Pearson χ2 Statistic
Consider a table of binary responses of the following form.

Sample Covariate # Successes # Failures Total
1 x1 y1 m1 − y 1 m1
2 x2 y2 m2 − y 2 m2
↓ ↓ ↓ ↓ ↓
i xi yi mi − y i mi
↓ ↓ ↓ ↓ ↓
n xn yn mn − y n mn
For example, consider the 2 × 2 table :
E Ē
x1 = 1 y1 m1 − y1 m1
x2 = 0 y2 m2 − y2 m2
As we have stated already, the Pearson’s χ2 goodness of fit statistic takes the form
n
∼ χ2n−p
i=1
mi π̂i (1 − π̂i )

Let oij and eij denote the observed and expected cell counts for the (i, j) cell of the 2-way table. Consider a statistic
with the following form :
X (oij − eij )2 n
(yi − mi π̂i )2 ((mi − yi ) − (mi (1 − π̂i )))2
X
= +
eij i=1
(i,j)
n
(yi − mi π̂i )2 (yi − mi π̂i )2
X
= +
i=1
n
(yi − mi π̂i )2 (1 − π̂i ) (yi − mi π̂0 )2 πi
X
= +
i=1
mi πi (1 − π̂i ) mi π̂i (1 − π̂i )
n
=
i=1
mi π̂i (1 − π̂i )
We have shown here that the Pearson χ2 statistic may be written in terms of all cells of the table, or in terms of
the binomial responses. The former way of writing it generalizes nicely to variables with more than two categories,
so we will see it again in this form when we consider log-linear models.
Questions
Problem 2.1.
Consider a 2 × 2 contingency table from a prospective study in which people who were or were not exposed to some
pollutant are followed-up and, after several years, categorized according to the presence or absence of a disease. The
table below shows the probabilities for each cell.
Probabilities for a prospective study
Diseased Not Diseased

Exposed π1 1 − π1
Unexposed π2 1 − π2
The odds of disease for each exposure group is given by πi /(1 − πi ), i = 1, 2, and so the odds ratio
π1 (1 − π2 )
ψ=
π2 (1 − π1 )
is the relative odds of disease for the exposed versus the not exposed groups.
a. For the simple model πi = exp(αi )/(1 + exp(αi )) show that if there is no association between exposure status
and disease status, then there is no difference in the logits of the disease probabilities.
b. Consider J 2 × 2 tables like the table above, where there is one table for each combination of covariates in the
vector x = (1, x1 , x2 , . . . , xp )′ . Such data arises frequently when we wish to stratify by potentially confounding

variables such as sex, age (measured in decades perhaps), family history of disease, etc. Let xj denote the x for
stratum j, j = 1, . . . , J. For the logistic model
πij = exp{αi + x′j β i }/(1 + exp{αi + x′j β i }) , i = 1, 2, j = 1, . . . , J,
show that the log odds ratio relating exposure status and disease status is constant over all tables if β 1 = β 2 .
Problem 2.2.
Suppose Yk ∼ Bin(mk , πk ), k = 1, 2 are two independent binomial random variables. Let xk denote an explanatory
indicator variable which indicates which sample the response is from. Specifically xk = 1 if k = 1 and xk = 0 if k = 2.
Then consider the logistic regression model
log(πk /(1 − πk )) = β0 + β1 xk
a. Write down the likelihood in terms of π1 and π2 and then reparameterize it in terms of β0 and β1 .
b. Derive the score vector S(β) = (S1 , S2 )′ = (∂ℓ/∂β0 , ∂ℓ/∂β1)′ , and the information matrix I(β) with (j, k) entry
Ijk (β) = ∂ 2 ℓ/∂βj ∂βk , j = 0, 1, k = 0, 1.
c. An asymptotic covariance matrix for β̂ is given by I −1 (β̂). Based on this show that the asymptotic variance for
β̂1 is estimated by
asvar(β̂1 ) = [1/y1 + 1/(m1 − y1 ) + 1/y2 + 1/(m2 − y2 )]
Hint: Upon finding the form of I(β), it will be convenient to re-express this matrix in terms of π1 and π2 for
simplification.
Problem 2.3.
The table below gives data reported by Gordon and Foss (1966). On each of 18 days very young babies in a hospital
nursery were chosen as subjects if they were not crying at a certain instant. One baby selected at random was rocked
for a set period, the remainder serving as controls. The numbers not crying at the end of a specified period are given
in the table. (There is no information about the extent to which the same infant enters the experiment on a number
of days so we will treat responses on different days as independent.)
a. Pool the data from the different days into a single 2 × 2 contingency table and test the hypothesis that the
probability of crying is the same for rocked and unrocked babies.
b. The analysis in (a) ignores the matching by days. To incorporate this aspect, re-analyse the data using a logistic
model with parameters for days, and control or experimental conditions. How well does it fit the data? Examine
the residuals to see if there are any patterns in the data which are not accounted for by the model. By fitting
a model which ignores the control or experimental effects, test the hypothesis that rocking does not affect the
probability of crying. What is the simplest model which describes the data well?
Crying Babies Data

No of No. of
control No. not experimental No. not
Day babies crying babies crying
1 8 3 1 1
2 6 2 1 1
3 5 1 1 1
4 6 1 1 0
5 5 4 1 1
6 9 4 1 1
7 8 5 1 1
8 8 4 1 1
9 5 3 1 1
10 9 8 1 0
11 6 5 1 1
12 9 8 1 1
13 8 5 1 1
14 5 4 1 1
15 6 4 1 1
16 8 7 1 1
17 6 4 1 0
18 8 5 1 1
Problem 2.4.
A study was carried out to investigate the effect of a new drug in reducing operative mortality following major
abdominal surgery. Patients were assigned to treatment or control groups within each of three categories of surgical
risk: high, medium and low. The results were as follows:
Surgical Risk
Treatment Outcome Low Medium High
Control Died 3 7 6
Survived 12 6 3
Treated Died 2 3 2
Survived 10 10 8
Denote by π the probability of death following surgery. Let x1 = 1 for subjects on treatment, with x1 = 0
otherwise, x2 = 1 for subjects at medium risk, with x2 = 0 otherwise, x3 = 1 for subjects at high risk, with
x3 = 0 otherwise, x12 = x1 x2 and x13 = x1 x3 , where x12 and x13 are the interaction terms for treatment and
risk group covariates. Express each of the following hypotheses as a logistic regression model for π in terms of
the approriate linear predictor and determine the degrees of freedom for the deviance statistic. Do no impose any
additional constraints besides those explicitly stated in words and define any additional covariates you require.
a. H1 : There is no association between treatment and outcome in any category of risk.

b. H2 : The log odds ratios associating treatment and outcome are constant across categories of surgical risk.
c. H3 : The log odds ratios associating treatment and outcome are linearly related to a continuous covariate, x4 ,
which expresses surgical risk.

d. H4 : The logit transform of the π is linearly related to x4 , with a common slope for treatment and control groups.
e. Fit a logistic model to the data above using a model that contains one factor at two levels for treatment status
and one factor at three levels for surgical risk.
a) Compute the deviance residual for the cell consisting of control patients at the low level of surgical risk.
b) Based on the available information, does this model appear to provide a reasonable fit?
c) Carefully explain the findings regarding the effect of treatment. Incorporate an estimate of the odds ratio, a
95% confidence interval, and a statement regarding the overall significance of this finding.
d) Construct a 95 % confidence interval for the probability of surgical death for a control patient with a medium
level of surgical risk based on this model.
Problem 2.5.
Consider a study of the occurrence of infection following birth by Caesarian section. Let y = 1 denote the occurrence
of an infection, and y = 0 denote the absence of an infection. The investigators are interested in learning about risk
factors for infection and have identified three potential explanatory variables: i) xi1 = 1 if the Caesarian was planned
and xi1 = 0 otherwise, ii) xi2 = 1 if any risk factors were present (such as diabetic mother, obese mother, etc.), and
xi2 = 0 otherwise, and iii) xi3 = 1 if any antibiotics were given as a preventative measure, and xi3 = 0 otherwise.
The data are summarized as follows :
Caesarian Planned Caesarian Not Planned
Infection No Infection Infection No Infection
Antibiotics Risk Factors 1 17 11 87
Antibiotics No Risk Factors 0 2 0 0
No Antibiotics Risk Factors 28 30 23 3
No Antibiotics No Risk Factors 8 32 0 9
a. Fit the logistic regression model containing only the main effects for the three explanatory variables.
b. Based on the fitted model above construct a 95 % confidence interval for the probability of infection when the
Caesarian was not planned, no risk factors are present, and no antibiotics have been prescribed.
c. Construct a 95 % confidence interval for the odds ratio reflecting the change in risk of infection for a woman on
antibiotics with risk factors undergoing an unplanned Ceasarian versus a woman not on antibiotics without risk
factors undergoing a planned Caesarian.
Problem 2.6.
The following table gives the analgesic effect of a treatment on patients with neuralgia. The response variable is given
in the last column and indicates whether there was pain or not. The duration variable represents the duration of the
complaint (in months) before treatment began.

Treatment Age Sex Duration Pain

Y 76 M 36 Y
Y 52 M 22 Y
N 80 F 33 N
Y 77 M 33 N
Y 73 F 17 N
N 82 F 84 N
Y 71 M 24 N
N 78 F 96 N
Y 83 F 61 Y
Y 75 F 60 Y
N 62 M 8 N
N 74 F 35 N
Y 78 F 3 Y
Y 70 F 27 Y
N 72 M 60 N
Y 71 F 8 Y
N 74 F 5 N
N 81 F 26 Y
a. Fit a simple logistic regression model to investigate the effect of treatment on the occurrence of pain, and interpret
the result. Report significance levels for any tests you carry out, as well as a point estimate and a 95 % confidence
interval for the corresponding odds ratio relating treatment to pain relief.
b. Do any of the other factors, taken in isolation, appear to have an effect on the occurrence of pain ?
c. Now consider models for simultaneously assessing the effect of these explanatory variables. Find the model you
feel adequately describes the variation in the response and interpret the effect of the explanatory variables as
above.
Problem 2.7.
The table below shows the number of strains of Staphylococcus aureus isolated from patients of a certain hospital
for each of the years 1947 to 1950.The strains were further classified as resistant or sensitive to 1 unit per ml of
penicillin.
1947 1948 1949 1950 Total

Resistant 45 41 113 53 252
Sensitive 194 145 185 107 631
Total 239 186 298 160 883
Proportion 0.19 0.22 0.38 0.33
The proportion of resistant strains appears to increase during the four year period and it would be useful to, i)
assess whether or not there are significant differences in the proportion of resistant strains over the four years, and
ii) try to characterize the nature of any apparent trend. You do not have to fit the models to answer the following
questions.

a. Suppose that X is a binary explanatory variable and Y is a binary response variable. Furthermore, suppose
P (X = 1|Y = 1) = γ1 , P (X = 0|Y = 1) = 1 − γ1 , P (X = 1|Y = 0) = γ0 , P (X = 0|Y = 0) = 1 − γ0 ,
P (Y = 1) = ζ and P (Y = 0) = 1 − ζ. Of interest is a model for P (Y = 1|X = x), the conditional probability of
the response given the explanatory variable.
i. Derive the expressions for P (Y = 1|X = 1) and P (Y = 1|X = 0). [2 marks]
ii. Consider two odds:
P (Y = 1|X = 1) P (Y = 1|X = 0)
and .
1 − P (Y = 1|X = 1) 1 − P (Y = 1|X = 0)
Show that the first odds divided by the second is the same as an odds ratio given by dividing [2 marks]
P (X = 1|Y = 1) P (X = 1|Y = 0)
by .
1 − P (X = 1|Y = 1) 1 − P (X = 1|Y = 0)
b. Let πi be the parameter representing the proportion of resistant strains for year 1946+i, i = 1, 2, 3, 4. Define
appropriate indicator variables and give the expression for the relevent saturated logistic regression model.
c. a) Provide a reduced model that you would use in combination with the saturated model to test the hypothesis
that there is no difference in the proportion of resistant strains over the four year period.
b) What are the degrees of freedom of this test?
d. Suppose you were interested in investigating the presence of a linear trend on the logit scale.
a) Write down the corresponding logistic model, and be sure to define any additional covariates you might
require.
b) What two models would you compare to assess the presence of a linear trend, and what would be the degrees
of freedom for this test ?
e. What other models would you consider for these data ?
Problem 2.8.
One design for longitudinal studies is to follow initially healthy individuals for a period of τ years and then determine
those who are still healthy, and those who have developed the disease in that period of time. At the beginning of the
period, information on suspected risk factors for the disease is obtained for each individual. Let xi be a p × 1 vector
of risk factors for the ith individual, and define

1, if the individual develops the disease
yi =
0, otherwise.
It is of interest to model P (Yi = 1|xi ), the conditional probability of developing the disease given risk factors
xi . Suppose that given Yi , xi is multivariate normal with f1 (xi |Yi = 1) ≡ M V Np (µ1 , Σ) and f0 (xi |Yi = 0) ≡
M V Np (µ0 , Σ), where µ1 , µ0 and Σ are unknown. Let π1 be the unconditional probability of developing the disease
and π0 = 1 − π1 .
a. Obtain an expression for P (Yi = 1|xi ).

b. Identify the form of this expression
Problem 2.9.
Does growing up in a right-handed world exert an overwhelming bias toward right-handedness? Data relevant to this
question were painstakingly gathered by Carter-Saltman (1980) and are displayed in Table 1. Eight hundred and
eight children were sampled and 408 of these were adopted and 400 were not adopted. The handedness of the child

and both parents was determined. If handedness is mostly biologically determined, one would expect to find more
right-handed children among those with right-handed parents, but for adopted children, one would expect to find
the same proportions of right-handed children regardless of the handedness of the parent. Note that in this study
there were no couples in which both parents were left-handed.
Table 1. Handedness of Children and their Parents

Father Mother Biological Parents Adoptive Parents
Right Right 300/340 308/355
Right Left 29/38 12/16
Left Right 16/22 35/37
Note: Each entry represents the number of right-handed children on the left and the total number of children on the
right (Carter-Sultzman, 1980)
a. Test whether or not the handedness of the child is related to that of the biological parents.
b. Test whether or not the handedness of the child is related to that of the adoptive parents.
Problem 2.10.
A sample of 146 five-year-old children had their teeth examined and those with decayed, missing, or filled teeth
(dmft) were noted. From their address it was also determined whether their drinking water was fluoridated. Finally,
on the basis of various sociological measurements, their social class was designed as I,II,III, or unclassifiable. The
data are given in Table 2.
Table 2. Fluoridation and Dental Health
Class I Class II Class III Unclassified Total
Fluoridated 12/117 26/170 11/52 24/118 73/457
Unfluoridated 12/56 48/146 29/64 49/104 138/370
Note: Each entry gives the number of children with dmft from the total number of children in that cell (Carmichael
et at., 1989).
a. Test the hypothesis that the effect of fluoridation does not depend on social class.
b. Assuming a uniform effect of fluoridation across the different social classes, give a 95% interval for the odds ratio
reflecting the effect of fluoridation and provide a careful interpretion.
c. Summarize the effect of social class, if any.
Problem 2.11.
a. Derive the form of the residual deviance for a logistic regression model. [4 marks]
b. The data in the following table were obtained from the Florida Department of Highway Safety and Motor vehicles
(Bishop et al., 1975) which summarizes the outcomes of car accidents and the status of the driver of the car with
respect to seat-belt use and whether they were ejected from the car during the accident.
Injury
Seatbelt Ejected Non-fatal Fatal
Used Yes 1105 14
No 411,111 483
Not Used Yes 4624 497
No 157,342 1008

Attached is some Splus code and output.

Is there evidence of a need for the interaction term between seat-belt use and ejection in the logistic model for
the probability the injury was fatal? Explain why or why not you feel this term is needed. [2 marks]
c. Select and justify your choice of the most appropriate model and use it to make conclusions regarding the effects
of seat-belt use and ejection status on the fatality of the accidents. Use odds ratios and associated 95% confidence
intervals to help in your interpretations. [4 marks]
d. STAT 831 ONLY
Compute the deviance residuals for the third and fourth observations in the logistic regression model including
only seat-belt use (stored in the object fit2). [2 marks]
Problem 2.12.
In a biomedical study of an immuno-activating ability of two agents TNF (tumor necrosis factor) and IFN (interferon),
to induce cell differentiation, the number of cells that exhibit markers of differentiation after exposure to TNF and/or
IFN was recorded. At each of the 16 dose combinations of TNF/IFN, 200 cells were examined. The number (y) of
cells differentiating in one trial and the corresponding dose levels of the two factors are given in the table below.
Number (y) of
Differentiating Cells Dose of TNF (U/ml) Dose of IFN (U/ml)
11 0 0
18 0 4
20 0 20
39 0 100
22 1 0
38 1 4
52 1 20
69 1 100
31 10 0
68 10 4
69 10 20
128 10 100
102 100 0
171 100 4
180 100 20
193 100 100
a. Define continuous explanatory variables xi1 and xi2 based on the TNF and IFN doses respectively. Further
define the interaction term xi3 = xi1 × xi2 . Find the logistic model you feel is most appropriate for describing
the distribution of the responses, and carefully interpret the effects of the covariates.
b. Construct plots based on the deviance residuals. What do you conclude from these plots ?
c. Find the best fitting model using the complementary log-log link and interpret your results.
d. Find the best fitting model using the probit link and interpret your results. No confidence intervals are required
for these analyses.
Problem 2.13.

The following data are from a study of the effects of viruses on chicken eggs. Eggs were injected with various dilutions
of a virus and were monitored daily up to day 18 after injection. At the end of the study, the eggs were classified
into three groups:
a. Those that died.

b. Those which are alive and deformed.
c. Those which are alive and normal.
Number Alive
Concentration of Eggs Dead Deformed Not Deformed
18.8 16 4 1 11
232.5 19 8 8 3
3468.0 17 10 6 1
51680.0 19 17 2 0
Write the data in the form
Dose x1 · · · xi · · · xn
No. at risk m1 · · · mi · · · mn
Dead y11 · · · yi1 · · · yn1
(Alive) Deformed y12 · · · yi2 · · · yn2
(Alive) Not Deformed y13 · · · yi3 · · · yn3
Assume that for the ith dose there are probabilities πi1 , πi2 , πi3 with πi1 + πi2 + πi3 = 1 such that the numbers
yij (j = 1, 2, 3) follow the multinomial distribution with mi trials and parameters (πi1 , πi2 , πi3 ). Consider a model
which has two parts:
log {πi1 /(πi2 + πi3 )} = α1 + βxi

log {πi2 /πi3 } = α2 + βxi
a. Explain what each part of the above model is fitting.

b. Show that the likelihood for the data based on the multinomial distribution and the above model is equivalent
to a likelihood based on 2 binomial distributions. Given this, explain how one could fit the model in Splus using
the glm function.
c. Fit the above model to these data using Splus and comment on the significance of any terms in the model.
Interpret the results.
Problem 2.14.
If x is the concentration of a toxin, call z = log(x) the dose. Subjects exposed to this toxin are observed to respond
or not respond to it, leading to a binary response. If a sample of m subjects is exposed to the toxin at dose z, let y
denote the number that respond where y ∼ Bin(m, π(z)).
Suppose the tolerance distribution is normal, N (µ, σ 2 ). This implies the probit link is appropriate and the model
is
Φ(βo + β1 zi ) = πi i = 1, 2, . . . , n.
a. Express µ and σ in terms of β0 and β1 .

b. Consider the following data representing a dose-response experiment in which insects were exposed to varying
doses of ammonia.

Log concentration (mg/l) Number of Insects Number responding

z m y
0.72 58 3
0.80 61 19
0.87 63 16
0.93 59 37
0.98 57 49
1.02 55 54
1.07 57 55
1.10 61 60
Fit the probit model above. Is this a good fit?

c. From the model in (b), estimate the LD50. (The median lethal dose, commonly denoted LD50, is the dose required
to kill half of the subjects.)
Problem 2.15.
One of the earliest bioassay experiments was designed to study the reaction of mice to varying concentrations of
insulin (0.001 IU). After exposure to the assigned concentration of insulin, the mice were observed for the occurrence
of convulsions. The data are summarized in Table 1.
Let mi denote the number of mice exposed in the ith sample, and yi the number convulsing, i = 1, . . . , 14.
Suppose Yi ∼ Bin(mi , πi ) are independently distributed, where πi is the probability of a mouse convulsing in sample
i, i = 1, . . . , 14. Suppose xi1 denotes the dose (log concentration) of insulin in sample i, and xi2 = 1 if the insulin is
prepared using the test preparation in sample i and xi2 = 0 otherwise, i = 1, . . . , 14.
Number of mice with convulsions exposed to different concentrations (0.001 IU) of insulin
under two different preparations (Finney, 1978)
Standard preparation Test preparation
With With
Conc. Convulsions Total Conc. convulsions Total
3.4 0 33 6.5 2 40
5.2 5 32 10.0 10 30
7.0 11 38 14.0 18 40
8.5 14 37 21.5 21 35
10.5 18 40 29.0 27 37
13.0 21 37
18.0 23 31
21.0 30 37
28.0 27 30
a. It is common to assume that each mouse has an inherent tolerance to insulin exposure and that doses above this
tolerance will cause it to convulse, while doses below this tolerance will not cause it to convulse. The distribution
characterizing the variation in this latent tolerance in the population is called the tolerance distribution. Suppose

there are two different Gaussian tolerance distributions, one for the mice exposed to insulin under the standard
preparation (say N (µS , σS2 )), and one for the mice exposed to insulin under the test preparation (say N (µT , σT2 )).
If we fit a dose-response model to data of this form, which link function is implied for a binary regression model?
Note: If Z ∼ N (µ, σ 2 ) then
exp(− 12 (z − µ)2 /σ 2 )
f (z; µ, σ) = √ , −∞<z <∞
2πσ
b. If the fitted dose-response model involved a linear predictor
ηi = β0 + β1 xi1 + β2 xi2 + β3 xi1 xi2 ,
give an expression for each of the regression coefficients in terms of the parameters of the two tolerance distribu-
tions.
c. Some Splus output is attached which reports the results of fitting several models to this data. Justify your
choice of the model you feel is most appropriate for describing the distribution of responses as a function of dose
and method of preparation. Report your conclusions regarding the impact of insulin exposure and method of
preparation on the probability that convulsions are experienced.
d. Let δS and δT denote the doses at which 50% of the population will respond under the standard and test
preparations respectively. Derive an expression for δS − δT in terms of the regression coefficients of your final
model
Problem 2.16.
Montgomery and Peck (1982) describe a study on the compressive strength of an alloy fastener used in the construction
of aircraft. Ten pressure loads, increasing in units of 200 psi from 2500 psi to 4300 psi, were used with different numbers
of fasteners being tested at each of these loads. The data in the table below refer to the number of fasteners failing
out of the number tested at each load.
Number of fasteners failing out of the number subjected to

varying pressure loads
Load Sample size Number failing Proportion

2500 50 10 0.20
2700 70 17 0.24
2900 100 30 0.30
3100 60 21 0.35
3300 40 18 0.45
3500 85 43 0.51
3700 90 54 0.60
3900 50 33 0.66
4100 80 60 0.75
4300 65 51 0.78
Attached is some Splus code and output for the following questions. Pick one of the two models (fit1 or fit2) to
answer the following questions.

a. Estimate the effect of a 500 psi unit increase in the load on the odds of failure of a fastener. In your answer,
report an odds ratio and associated 95% confidence interval. [4 marks]
b. The maximum load per fastener under normal operation is 400 psi. Give an estimate and 95% confidence interval
for the probability of failure at this load. [4 marks]
c. Obtain the fitted value for the load of 2500 psi. If you had to estimate the probability of failure for a load of this
magnitude, would you report the raw proportion or the fitted value? Justify your answer based on the information
provided. [4 marks]

3
Log-linear Models
3.1 The Poisson Distribution
3.1.1 Poisson Regression Models for Counts
Suppose Y is a random variable with a Poisson distribution with mean µ. Then we denote this as Y ∼ Poisson(µ),
where
µy e−µ
f (y) =y = 0, 1, 2, ...
y!
Furthermore E(y) = µ and Var(y) = µ defines the mean variance relationship. For a single observation we can write
this in the exponential family form
ℓ(µ, y) = y log µ − µ − log y!
and identify θ = log µ, φ = 1, a(φ) = 1, b(θ) = eθ , and c(y; φ) = − log y!. The mean is given in terms of these
functions as b′ (θ) = eθ = µ, and the variance, b′′ (θ)a(φ) = eθ · 1 = µ, so we retrieve what we would expect. As in the
binomial case, we can allow for possible overdispersion (extra Poisson variation) by introducing another parameter
(φ). We will return to this later.
For a sample of of independent observations of size n with Yi ∼ Poisson(µi ), i = 1, 2, ..., n, we denote the data as
y = (y1 , y2 , ..., yn )′ . The vector of corresponding means is µ = (µ1 , µ2 , ..., µn )′ . Then
n n n
µyi e−µi Pn
µyi i exp(−
Y Y Y
i
P r(Yi = yi , i = 1, ..., n; µ) = = i=1 µi )/ yi !
i=1
yi ! i=1 i=1
which gives a log-likelihood

n
X
ℓ(µ; y) = [yi log µi − µi − log yi !]
i=1
If there is a column vector of covariates xi = (xi1 , xi2 , ..., xi,p−1 )′ associated with observation yi , let log µi = βi =
x′i β = p−1 x′i β
P
j=0 xij βj . To obtain a log-likelihood in terms of β, substitute µi = e as follows,
n
X
ℓ(β; y) = [yi log µi − µi − log yi !]
i=1
n
[yi [ p−1
X P Pp−1
= j=0 xij βj ] − exp( j=0 xij βj ) − log yi !]
i=1
n n
X X Pp−1
=[ yi xij ]βj − [exp{ j=0 xij βj } − log yi !]
i=1 i=1
Pn
Tj = i=1 yi xij is sufficient for βj by the factorization theorem.
3.1. THE POISSON DISTRIBUTION
3.1.2 Regression for Poisson Processes Observed over (0, τ ]
Frequently we are interested in modelling data from Poisson processes. A counting process N (t) is any non-decreasing
integer function of time such that N (0) = 0 and N (t) is the number of events happening in (0, t]. A counting process
is a Poisson process if it has independent increments (i.e. N (s1 + t1 ) − N (s1 ) is independent of N (s2 + t2 ) − N (s2 )
where s1 < t1 < s2 < t2 ) and the number of events occurring over (0, t] is distributed as
(λt)n e−λt
P (N (t) = n; λ) = , n = 0, 1, 2, ... .
n!
In this case E(N (t)) = µ(t) = λt, and N (t) is a special kind of Poisson random variable. We therefore use the log-link
if we are interested in establishing regression methods. However,
log µ(t) = log λt = log λ + log t.
When we set up this regression model we have Ni (ti ), i = 1, 2, ..., n corresponding to the event counts for each
subject where the ith subject is observed over (0, ti ]. As before, we let xi denote a p × 1 column vector of explanatory
variables for this subject and hence
log µi (ti ) = log λi + log ti

= x′i β + log ti .
The term log ti is called an offset term. It “explains” some variation in the event counts across subjects, but does so
in a deterministic way.

CHAPTER 3. LOG-LINEAR MODELS
Application: Ship Damage Incidents
McCullagh and Nelder (1989) discuss the analysis of a data set which records the number of times a certain type of
damage incident occurs in cargo ships. The damage is caused by waves and occurs in the forward section of various
cargo carrying vessels. In order to prevent this type of damage from occurring in the future, the investigators want
to identify risk factors including ship type (A-E), year of construction (1960-1964; 1965-1969; 1970-1974; 1975-1979),
and period of operation (1960-1974; 1975-1979).
The data is summarized below where we have adopted the coding convention that the ship type variable is 1, 2,
3, 4, and 5 for ship types A, B, C, D, and E respectively. The year of construction variable, cyr, is 1, 2, 3, and 4 for
eras 1960-1964, 1965-1969, 1970-1974, and 1975-1979 respectively while the year of operation variable, oyr, is 1 for
1960-74 and 2 for 1975-1979.
type cyr oyr months y
1 1 1 127 0
1 1 2 63 0
1 2 1 1095 3
1 2 2 1095 4
1 3 1 1512 6
1 3 2 3353 18
1 4 2 2244 11
2 1 1 44882 39
2 1 2 17176 29
2 2 1 28609 58
2 2 2 20370 53
2 3 1 7064 12
2 3 2 13099 44
2 4 2 7117 18
3 1 1 1179 1
3 1 2 552 1
3 2 1 781 0
3 2 2 676 1
3 3 1 783 6
3 3 2 1948 2
3 4 2 274 1
4 1 1 251 0
4 1 2 105 0
4 2 1 288 0
4 2 2 192 0
4 3 1 349 2
4 3 2 1208 11
4 4 2 2051 4
5 1 1 45 0
5 2 1 789 7
5 2 2 437 7
5 3 1 1157 5
5 3 2 2161 12
5 4 2 542 1
The program used to analyse the data is below.
ship.dat_read.table("ship.dat", header=T)
ship.dat$typef_factor(ship.dat$type)
ship.dat$typeft_C(ship.dat$typef,treatment)
ship.dat$cyrf_factor(ship.dat$cyr)
ship.dat$cyrft_C(ship.dat$cyrf,treatment)
ship.dat$oyrf_factor(ship.dat$oyr)
ship.dat$oyrft_C(ship.dat$oyrf,treatment)
ship.dat

% fitting the main effects with the offset term

model1_glm(y ~ typeft + cyrft + oyrft +offset(log(months)), family=poisson,
data=ship.dat)
summary(model1)
% fitting all main effects (treating offset as a covariate for diagnostics)

model2_glm(y ~ typeft + cyrft + oyrft +log(months), family=poisson,
data=ship.dat)
summary(model2)
% testing for the association between ship type and frequency of events
model3a_glm(y ~ cyrft + oyrft + offset(log(months)), family=poisson,
data=ship.dat)
model3a$deviance
model3a$df
1-pchisq(model3a$deviance-model1$deviance,model3a$df-model1$df)
% testing for the association between year of construction and event frequency
model3b_glm(y ~ typeft + oyrft + offset(log(months)), family=poisson,
data=ship.dat)
model3b$deviance
model3b$df
1-pchisq(model3b$deviance-model1$deviance,model3b$df-model1$df)
% testing for the association between year of operation and event frequency
model3c_glm(y ~ typeft + cyrft + offset(log(months)), family=poisson,
data=ship.dat)
model3c$deviance
model3c$df
1-pchisq(model3c$deviance-model1$deviance,model3c$df-model1$df)
% testing for the interaction between type of ship and year of construction
model4_glm(y ~ typeft + cyrft + oyrft + typeft*cyrft + offset(log(months)),
family=poisson, data=ship.dat)
model4$deviance
model4$df
1-pchisq(model1$deviance-model4$deviance,model1$df-model4$df)
summary(model4)
mrho_summary(model4)$correlation
mrho
% testing for the interaction between type of ship and year of operation
model5_glm(y ~ typeft + cyrft + oyrft + typeft*oyrft + offset(log(months)),
% testing for the interaction between year of construction and operation

model6_glm(y ~ typeft + cyrft + oyrft + cyrft*oyrft + offset(log(months)),
motif()
ship.dat$rdeviance_residuals.glm(model1,type="deviance")
plot(model1$fitted.values,ship.dat$rdeviance,ylim=c(-4,4),
xlab="FITTED VALUES",ylab="DEVIANCE RESIDUALS")
abline(h=-2)
abline(h= 2)
A selection of the output is provided.

> model1_glm(y ~ typeft + cyrft + oyrft +offset(log(months)), family=poisson,
data=ship.dat)
> model1$deviance
> summary(model1)

Call: glm(formula = y ~ typeft + cyrft + oyrft + offset(log(months)),

family = poisson, data = ship.dat)
Deviance Residuals:
-1.676867 -0.8292956 -0.4370106 0.505821 2.791171
Coefficients:
(Intercept) -6.4059016 0.2174315 -29.461700
typeft2 -0.5433443 0.1775869 -3.059596
typeft3 -0.6873773 0.3281646 -2.094611
typeft4 -0.0759614 0.2905555 -0.261435
typeft5 0.3255795 0.2358758 1.380300
cyrft2 0.6971404 0.1496286 4.659139
cyrft3 0.8184266 0.1697504 4.821354
cyrft4 0.4534266 0.2331490 1.944793
oyrft 0.3844670 0.1182584 3.251077
(Dispersion Parameter for Poisson family taken to be 1 )

(Intercept) typeft2 typeft3 typeft4 typeft5 cyrft2
typeft2 -0.8114317
typeft3 -0.3794021 0.4343318
typeft4 -0.3706315 0.4468646 0.2381810
typeft5 -0.4699103 0.5706801 0.3144642 0.3338068
cyrft2 -0.4842330 0.0855953 0.0359326 0.0276773 -0.0040845
cyrft3 -0.5500980 0.2713808 0.0456319 0.0285842 -0.0370840 0.6334918
cyrft4 -0.4015248 0.2284334 0.0973273 -0.0965854 0.0527938 0.4755087
oyrft -0.2161077 0.0253826 -0.0030794 -0.0047295 0.0269346 -0.1200544
cyrft3 cyrft4
typeft2
typeft3
typeft4
typeft5
cyrft2
cyrft3
cyrft4 0.5482314
oyrft -0.2635932 -0.3153428
> model2_glm(y ~ typeft + cyrft + oyrft +log(months), family=poisson,

data=ship.dat)
> summary(model2)
Call: glm(formula = y ~ typeft + cyrft + oyrft + log(months),

Deviance Residuals:
-1.657977 -0.8938599 -0.4900228 0.4675645 2.743528
Coefficients:
(Intercept) -5.5939581 0.8721424 -6.4140418
typeft2 -0.3498652 0.2701854 -1.2949080
typeft3 -0.7630863 0.3375247 -2.2608311
typeft4 -0.1354756 0.2970846 -0.4560169
typeft5 0.2739282 0.2417527 1.1330923
cyrft2 0.6625407 0.1536194 4.3128721
cyrft3 0.7597379 0.1776244 4.2772168
cyrft4 0.3697408 0.2458086 1.5041821
oyrft 0.3702568 0.1181308 3.1342948
log(months) 0.9027077 0.1017761 8.8695429


Note that the coefficient for the log(months) variable is 0.90 and is not significantly different than 1. This is means that the
data are not inconsistent with the time-homogeneous model in this regard. Now we consider various nested sub-models to
model 1 to see if any of the main effects are not significant. Since the main effects are generally factor variables with many
levels, we need to carry out such tests based on the change in the deviance.
> model3a_glm(y ~ cyrft + oyrft + offset(log(months)), family=poisson,
data=ship.dat)
> model3a$deviance
[1] 62.36534
model3a$df
[1] 29
> 1-pchisq(model3a$deviance-model1$deviance,model3a$df-model1$df)
[1] 9.299568e-05
The p−value for the likelihood ratio test of the hypothesis that there is no variation in the accident rate across ships of
different types is extremely small. This is strong evidence of a need to adjust for the difference in the accident rates between
ship types.
> model3b_glm(y ~ typeft + oyrft + offset(log(months)), family=poisson,
data=ship.dat)
> model3b$deviance
[1] 70.10294
> model3b$df
[1] 28
> 1-pchisq(model3b$deviance-model1$deviance,model3b$df-model1$df)
[1] 6.974977e-07
Again the p−value is extremely small suggesting there is strong evidence of a need to adjust for the difference in the accident
rates for ships manufactured in different eras.
> model3c_glm(y ~ typeft + cyrft + offset(log(months)), family=poisson,
data=ship.dat)
> model3c$deviance
[1] 49.35519
> model3c$df
[1] 26
> 1-pchisq(model3c$deviance-model1$deviance,model3c$df-model1$df)
[1] 0.001094692
The p−value is very small indicating strong evidence of a need to adjust for the difference in accident rates for ships used in
different eras.
> model4_glm(y ~ typeft + cyrft + oyrft + typeft*cyrft + offset(log(months)),
> model4$deviance
[1] 14.58767
> model4$df
[1] 13
> 1-pchisq(model1$deviance-model4$deviance,model1$df-model4$df)
[1] 0.01966758
> summary(model4)
Call: glm(formula = y ~ typeft + cyrft + oyrft + typeft * cyrft + offset(log(months)),

Deviance Residuals:
-1.99643 -0.09176195 -0.009573433 0.1384936 2.538268
Coefficients:
(Intercept) -14.4765415 56.9623361 -0.254142342
typeft2 7.5380111 56.9624437 0.132333001

typeft3 7.5736662 56.9667029 0.132949000

typeft4 -0.5963653 80.2258094 -0.007433584
typeft5 0.8801175 99.0523768 0.008885375
cyrft2 8.5198530 56.9635718 0.149566693
cyrft3 8.8843363 56.9626945 0.155967628
cyrft4 8.7733762 56.9631535 0.154018442
oyrft 0.3850453 0.1186346 3.245641702
typeft2cyrft2 -7.8493674 56.9637763 -0.137795769
typeft3cyrft2 -9.0982541 56.9767330 -0.159683675
typeft4cyrft2 -8.8894833 98.5878643 -0.090168129
typeft5cyrft2 0.4493244 99.0534513 0.004536181
typeft2cyrft3 -8.0983620 56.9629658 -0.142168897
typeft3cyrft3 -8.1033885 56.9681658 -0.142244154
typeft4cyrft3 1.0922871 80.2265495 0.013615033
typeft5cyrft3 -0.8287100 99.0528822 -0.008366339
typeft2cyrft4 -8.1997610 56.9637293 -0.143947053
typeft3cyrft4 -7.8686743 56.9762770 -0.138104395
typeft4cyrft4 -0.3253031 80.2279341 -0.004054736
typeft5cyrft4 -1.8572636 99.0578834 -0.018749276
The p−value from the likelihood ratio test is small suggesting we may want to consider putting the interaction term in the
model for the type of ship and the era of construction. However, if we examine the asymptotic standard errors then we see they
are huge. Examination of the correlation matrix of the regression parameter estimators reveals the model is overparameterized.
> mrho_summary(model4)$correlation
> mrho
(Intercept) typeft2 typeft3 typeft4
(Intercept) 1.0000000000 -0.9999974528 -9.999226e-01 -7.100246e-01
typeft2 -0.9999974528 1.0000000000 9.999207e-01 7.100232e-01
typeft3 -0.9999226000 0.9999206894 1.000000e+00 7.099701e-01
typeft4 -0.7100245778 0.7100231913 7.099701e-01 1.000000e+00
typeft5 -0.5750728851 0.5750714203 5.750284e-01 4.083159e-01
cyrft2 -0.9999772188 0.9999756009 9.999009e-01 7.100091e-01
cyrft3 -0.9999923094 0.9999909571 9.999163e-01 7.100200e-01
cyrft4 -0.9999838227 0.9999828370 9.999082e-01 7.100143e-01
oyrft -0.0008781622 0.0001284504 2.922393e-05 6.050829e-05
typeft2cyrft2 0.9999739057 -0.9999765836 -9.998973e-01 -7.100066e-01
typeft3cyrft2 0.9997465112 -0.9997446560 -9.998240e-01 -7.098451e-01
typeft4cyrft2 0.5777819250 -0.5777808871 -5.777377e-01 -8.137493e-01
typeft5cyrft2 0.5750664934 -0.5750651601 -5.750221e-01 -4.083114e-01
typeft2cyrft3 0.9999882256 -0.9999908253 -9.999115e-01 -7.100167e-01
typeft3cyrft3 0.9998969586 -0.9998950168 -9.999743e-01 -7.099519e-01
typeft4cyrft3 0.7100181198 -0.7100166552 -7.099636e-01 -9.999908e-01
typeft5cyrft3 0.5750699170 -0.5750684812 -5.750254e-01 -4.083138e-01
typeft2cyrft4 0.9999748840 -0.9999774311 -9.998981e-01 -7.100072e-01
typeft3cyrft4 0.9997545752 -0.9997526650 -9.998320e-01 -7.098508e-01
typeft4cyrft4 0.7100057744 -0.7100043879 -7.099513e-01 -9.999735e-01
typeft5cyrft4 0.5750409171 -0.5750394524 -5.749964e-01 -4.082932e-01
typeft5 cyrft2 cyrft3 cyrft4
(Intercept) -0.5750728851 -0.9999772188 -0.9999923094 -0.999983823
typeft2 0.5750714203 0.9999756009 0.9999909571 0.999982837
typeft3 0.5750283745 0.9999008726 0.9999162628 0.999908192
typeft4 0.4083158825 0.7100091003 0.7100200145 0.710014264
typeft5 1.0000000000 0.5750597843 0.5750684625 0.575063582
cyrft2 0.5750597843 1.0000000000 0.9999715036 0.999963623
cyrft3 0.5750684625 0.9999715036 1.0000000000 0.999979451
cyrft4 0.5750635820 0.9999636231 0.9999794514 1.000000000
oyrft 0.0005050073 -0.0003612134 -0.0007155276 -0.001204505
typeft2cyrft2 -0.5750578790 -0.9999962961 -0.9999676881 -0.999959653
typeft3cyrft2 -0.5749271106 -0.9997688942 -0.9997402920 -0.999732258
typeft4cyrft2 -0.3322667186 -0.5777949176 -0.5777784038 -0.577773783

typeft5cyrft2 -0.9999890645 -0.5750789283 -0.5750623502 -0.575057555

typeft2cyrft3 -0.5750661140 -0.9999664607 -0.9999946828 -0.999973756
typeft3cyrft3 -0.5750136288 -0.9998751801 -0.9999033952 -0.999882464
typeft4cyrft3 -0.4083121687 -0.7100025133 -0.7100225080 -0.710007589
typeft5cyrft3 -0.9999948782 -0.5750568641 -0.5750729406 -0.575060694
typeft2cyrft4 -0.5750584416 -0.9999530325 -0.9999683884 -0.999988285
typeft3cyrft4 -0.5749317480 -0.9997328515 -0.9997482391 -0.999768180
typeft4cyrft4 -0.4083050691 -0.7099902973 -0.7100012112 -0.710015353
typeft5cyrft4 -0.9999444105 -0.5750278170 -0.5750364947 -0.575047726
etc etc etc

Inspection of the data makes it clear why this is the case. The model with the type*cyr interaction is trying to estimate the
effect of construction year for each type of ship (see ship type 4 in the data file).
> model5_glm(y ~ typeft + cyrft + oyrft + typeft*oyrft + offset(log(months)),
[1] 0.2936317
The interaction between ship type and year of operation is not significant
> model6_glm(y ~ typeft + cyrft + oyrft + cyrft*oyrft + offset(log(months)),
[1] 0.4091268
The interaction between year of construction and operation is not significant
The residual plot looks quite reasonable, so we should be happy making inferences based on these models, in particular model
1.
Fig. 3.1. Deviance Residual Plot for Model 1 of Ship Damage Data
4
•
2
•
DEVIANCE RESIDUALS
• •
•• •
• • •
• •
0
•
•• • • • • •
• •
• •
• • • •
• •
-2 -4
0 10 20 30 40 50
FITTED VALUES

3.2 Analysis of Contingency Tables
3.2.1 Introduction
Contingency tables can be formed to display data when all response variables are categorical. A two-dimensional
contingency table is formed by the cross-classification of two variables and the observations consist of the cell counts
of that contingency table. The basic assumption in a contingency table like this is that each cell frequency has
an independent Poisson distribution with mean µij for the (i, j) cell. When we condition on the relevant “total”
frequencies (i.e. possibly those terms fixed by design) we get a multinomial, product multinomial, or non-central
hypergeometric distribution.
3.2.2 The Multinomial Distribution
Suppose a sample of Y.. = y.. individuals is cross-classified according to two categorical variables V and W with I
and J levels respectively. The data may then be summarized in the following contingency table:
FACTOR W
1 2 3 ··· j ··· J
1 y11 y12 y13 · · · y1j · · · y1J y1.
2 y21 y22 y23 · · · y2j · · · y2J y2.
3 y31 y32 y33 · · · y3j · · · y3J y3.
.
FACTOR V ..
i yi1 yi2 yi3 · · · yij · · · yiJ yi.
..
.
I yI1 yI2 yI3 · · · yIj · · · yIJ yI.
y.1 y.2 y.3 · · · y.j · · · y.J y..
Recall our initial assumption is that

y
µijij e−µij
P (Yij = yij ) =
yij !
Since the Poisson variates are assumed to be independent
I Y
J y
!
Y µijij e−µij
P (Yij = yij , i = 1, 2, ..., I, j = 1, 2, ..., J) =
i=1 j=1
yij !
We must condition on Y.. = y.. since this is fixed by the design, and we know
µ..y.. e−µ..
Pr(Y.. = y..) =
y..!
XX
where µ.. = µij . Therefore
i j

3.2. ANALYSIS OF CONTINGENCY TABLES
Y Y µyijij e−µij
i j
yij !
Pr(Yij = yij , i = 1, ..., I, j = 1, ..., J|Y.. = y..) =
µ..y.. exp{−µ..}/y..!
YY y YY YY 1
µijij e−µij
i j i j i j
yij !
=
µy..
·· exp{−µ..}/y..!
yij
y..! YY µij
= YY
yij ! i j µ..
i j
y..! YY y
= YY πijij
yij ! i j
i j
XX
where πij = µij /µ... which is a multinomial distribution since πij = 1. The corresponding log-likelihood is
i j
XX
ℓ(π) = yij log πij
i j
where π = (π11 , . . . , πIJ )′ . One might be interested in testing independence of the two methods of classification where
H0 : πij = πi. π.j and HA : πij 6= πi. π.j (i.e. there is some association). The log-likelihood under H0 is
XX
ℓ(π) = yij log(πi. π.j )
i j
XX
= yij (log πi. + log π.j )
i j
X X
= yi. log πi. + y.j log π.j
i j
The MLEs of πi. and π.j under H0 are given by π̂i. = yi. /y.. and π̂.j = y.j /y.. respectively, i = 1, 2, ..., I, j = 1, 2, ..., J,
and the fitted values under H0 are
yi. y.j
µ̂ij = y..π̂ij = y.. .
y.. y..
Therefore,
XX yi. y.j
ℓ(π̂) = yij log
i j
y..2
The MLEs of πij under HA are given by π̃ij = yij /y.. . Therefore

XX yij
ℓ(π̃) = yij log .
i j
y..
The likelihood ratio statistic is given by
D = 2(ℓ(π̃) − ℓ(π̂))

XX yij yi. y.j
=2 yij log /
i j
y.. y..2

XX yi. y.j
=2 yij log yij /
i j
y..
XX
=2 Oij log(Oij /Eij )
i j

where Oij are the observed cell counts and Eij are the expected values under the null hypothesis. The degrees of
freedom are given by IJ − ((I − 1) + (J − 1) + 1) = (I − 1)(J − 1). This is a common form of the goodness of fit
statistics that we have seen before.
Alternatively, could use the Pearson statistic,
X X (Oij − Eij )2
T =
i j
Eij
which is also asymptotically χ2 distributed under the null hypothesis with the same degrees of freedom.
3.2.3 The Product Multinomial Distribution
Suppose J independent Poisson (µij ) variates are observed in each of I populations.
FACTOR
1 2 3 ··· j ··· J
1 y11 y12 y13 · · · y1j · · · y1J y1.
2 y21 y22 y23 · · · y2j · · · y2J y2.
3 y31 y32 y33 · · · y3j · · · y3J y3.
.
POPULATION ..
..
.
y.1 y.2 y.3 · · · y.j · · · y.J y..
Recall our initial assumption is still

y
µijij e−µij
Pr(Yij = yij ) =
yij !
If a sample of size yi. is chosen from population i and subjects are then classified into levels 1, 2, ..., J according to
the factor, then we should condition on yi. , i = 1, 2, ..., I since these are all fixed by design. We know
µyi.i. e−µi.
Pr(Yi. = yi. ) =
yi. .!
Therefore Pr(Yij = yij , i = 1, 2, ..., I, j = 1, 2, ..., J|yi. i = 1, 2, ..., I) is given by
" #
YY y YY 1
( µijij )( )
i j
y ij! i j i j
yij !
I yi. −µi.
= I
! I !
Y µi. e Y yi.
Y 1
µi.
i=1
yi. ! i=1
y !
i=1 i.
 
I
Y  yi. ! Y µij J y ij

= Y 
i=1

y !
ij j=1
µi.

j
PJ
πij = µij /µi. This is a product multinomial distribution where j=1 πij = 1, for i = 1, . . . , I. Note then that the
πij ’s have a different interpretation here. The corresponding log-likelihood is

3.3. LOG LINEAR MODELS FOR CONTINGENCY TABLES
I X
X J J
X
ℓ(π) = yij log πij , πij = 1
i=1 j=1 j=1
where π = (π11 , . . . , πIJ )′ .

In this case we might be interested in testing H0 : π1j = π2j = · · · πIj = πj , j = 1, 2, ..., J versus HA : at least
one inequality holds. The log likelihood under H0 is
XX X
ℓ(π̂) = yij log π̂j π̂j = 1, i = 1, 2, ..., I
i j j
where π̂j = y·j /y·· , j = 1, 2, ..., J, are the MLEs under H0 . Therefore
XX
ℓ(π̂) = yij log(y.j /y..).
i j
The log-likelihood under HA is

XX X
ℓ(π̃) = yij log π̃ij πij = 1 i = 1, 2, ..., I
i j j
where π̃ij = yij /yi. , j = 1, 2, ..., J, i = 1, 2, ..., I. The likelihood ratio statistic takes the form
D = 2(ℓ(π̃) − ℓ(π̂))

XX yij y.j
=2 yij log /
i j
y i. y.
XX
=2 yij log (yij / (yi. y.j /y..))
i j

XX Oij
=2 Oij log
i j
Eij
The degrees of freedom is given by I(J − 1) − (J − 1) = (I − 1)(J − 1). Note that you get the same test statistic and
degrees of freedom for either multinomial or product multinomial sampling scheme! Alternatively, we could use the
Pearson chi-squared statistic given by
X X (Oij − Eij )2
T =
i j
Eij
which is also asymptotically χ2 distributed under the null hypothesis with the same degrees of freedom.
3.3 Log Linear Models for Contingency Tables
3.3.1 Log Linear Models for 2-way Tables
Here we will show how associations in contingency tables can be investigated by the use of Poisson regression models.
Consider the following contingency table where we initially assume Yij ∼ Poisson(µij )

FACTOR W
1 2 3 ··· j ··· J
1 y11 y12 y13 · · · y1j · · · y1J y1.
2 y21 y22 y23 · · · y2j · · · y2J y2.
3 y31 y32 y33 · · · y3j · · · y3J y3.
.
FACTOR V ..
..
.
y.1 y.2 y.3 · · · y.j · · · y.J y..
If y.. is fixed by design, as it often is, we work with conditional distribution of yij , i = 1, 2, ..., I, j = 1, 2, ..., J,
given y·· . We have already shown that this is a multinomial distribution.
 
y..! Y Y yij
Pr(Yij = yij , i = 1, 2, ..., I, j = 1, 2, ..., J|Y.. = y..) = Y Y  πij 
yij ! i j
i j
where πij = µij /µ.. . For this multinomial sampling plan we are often interested in testing
H0 : πij = πi. π.j
or equivalently that µij = E(Yij ) = y..πi. π.j . Traditionally we would estimate πi. and π.j for i = 1, 2, . . . , I and
j = 1, 2, . . . , J from the marginal totals. Next we would estimate the expected value of the (i, j) cell entry using
these sample estimates, and perhaps compute a Pearson statistic or a deviance statistic to carry out a formal test.
Interestingly, there is log-linear model we can fit to assess this hypothesis, and if this model fits the data well,
then we would conclude that there is little evidence against the null hypothesis of independence. In this case, the
corresponding log linear model under H0 is
log µij = u + uVi + uW

j i = 1, 2, ..., I, j = 1, 2, ..., J (1)
where V is the variable corresponding to factor 1, W corresponds to factor 2, and i and j denote the levels. The
notation is different from that we have used in other regression contexts because in log-linear models we are typically
dealing with several variables each with several levels and the expression for the linear predictors using the previous
“covariate” notation becomes too long.
Under HA : πij 6= πi. π.j , or equivalently µij = E(Yij ) 6= y..πi. π.j . The corresponding log linear model is saturated
and given by
log µij = u + uVi + uW VW
j + uij i = 1, 2, ..., I j = 1, 2, ..., J (2)
The test for independence is based on testing the fit of model (1) vs. model (2).
Model (2) defines the most general log-linear model for a two-way table. It introduces 1+I +J +IJ parameters, so
constraints are needed since we only have IJ observations. Corner-point constraints are obtained using treatment
contrasts in Splus. Here we set
uV1 = 0, uW VW
1 = 0, u1j = 0, j = 1, 2, ...J, , uVi1W = 0 i = 1, 2, ..., I.

3.3. LOG LINEAR MODELS FOR CONTINGENCY TABLES
We are then left with 1 + (I − 1) + (J − 1) + (I − 1)(J − 1) = IJ parameters.

Parameters can be interpreted as reflecting deviations about the reference level for the corresponding effect. ANOVA
constraints correspond to the following.
X X X
uW uVij W = uVij W = 0
X
uVi = 0 j = 0
i j i j
u.V = 0 u.W = 0 uV.j W = uVi. W = 0.
We are then left with 1 + (I − 1) + (J − 1) + (IJ − (I + J − 1)) = IJ parameters.

Starting with the full model, under ANOVA constraints we get
log µij = u + uVi + uW VW

j + uij
X
log µij = Iu + 0 + IuW
j
i
X
log µij = Ju + JuVi
j
XX
log µij = IJu
i j
X X XX
This in turn gives uW
j = log µij /I − u, uVi = log µij /J − u, u = log µij /IJ, and
i j
X X
uVij W = log µij − u − uVi − uW
j == log µij − u − ( log µij /I − u) − ( log µij /J − u).
i j
As can be seen these are analogous to the constraints and interpretation of effects arising in ordinary ANOVA models.
Here parameters are interpreted in terms of deviations from lower order terms.
Maximum likelihood estimation for a two-way model proceeds as follows. Note that
Pr(Yij = yij , i = 1, 2, ..., I, j = 1, 2, ..., J) =
i j
yij !
YY
= exp(yij log µij − µij − log yij !)
i j
and as before the resulting log likelihood takes the form

XX
ℓ= (yij log µij − µij − log yij !)
i j
We can make the substitution according to any particular log-linear model. For the saturated model, for example,
we let log µij = u + uVi + uW VW V W VW
j + uij , i = 1, 2, ..., I, j = 1, 2, ..., J and substitute for µij = exp{u + ui + uj + uij }
in the log likelihood and maximize subject to conerpoint or ANOVA constraints.
How is it that we can use the Poisson distribution to derive the MLE’s in this context when in fact Y.. = y..
was fixed and the multinomial distribution was appropriate? We showed the relationship between the Poisson log
likelihood and the multinomial likelihood for a special case in class, but the result holds quite generally. If the log
likelihoods are related then it is not surprising that the MLE’s for for some log-linear models the estimates of the

means are the same for Poisson distribution and the corresponding multinomial distribution or product multinomial
distribution. For this to hold, certain terms must be included in the log-linear regression model. The appropriate
terms are the parameters that correspond to the marginal totals that were fixed by design. For example
• If y.. is fixed, include the u term (multinomial case).

• If yi. are fixed, include u + uVi terms (product multinomial case).
• If y.j are fixed, include u + uW
j terms (product multinomial case).
To illustrate why this is the case, we focus on the case of a product multinomial distribution. We note that
we want to base inference on P r(Yij = yij , i = 1, 2, ..., I, j = 1, 2, ..., J|y1. , ..., yI. ). Since the responses in different
samples are independent, we can write this as
I
Y
pr(Yij = yij , j = 1, 2, ..., J|Yi. = yi. ).
i=1
If we focus on a single sample, say the ith one, We note that Yi. are Poisson (µi. ), i = 1, . . . , I. We can then write
P r(Yi1 = yi1 , ..., YiJ = yiJ )

P r(Yi1 = yi1 , ..., YiJ = yiJ |Yi. = yi. ) = .
P r(Yi. = yi. )
Provided we include appropriate terms so that yi. are fit exactly, then because Yi. is Poisson the log likelihood arising
from the conditional distribution on the left hand side is the same as the log likelihood arising from the right-hand
side. This is because the conditioning statistic (here yi. ) is Poisson and when it is fit exactly the corresponding
log-likelihood contribution is zero!
Additional Notes:
a. Log-linear models derive their name from the fact that the effects are linear on the log scale.
b. In log-linear models there is no distinction between explanatory and response variables.
c. Log linear models can be used to analyse relationships between several variables. As a result, interaction terms
become of primary interest.
d. Log linear models are said to be hierarchical in the sense that higher order terms are not included in the model
unless all related lower order terms are. That is
log µij = u + uVi + uVij W
would never be fit as a log linear model since uW

j term is not there!
e. Because the Poisson distribution is a member of the exponential family, IRWLS can be used for estimation.

3.4. APPLICATION
3.4 Application
3.4.1 A Melanoma Study
A cross-sectional study was conducted in which (y.. = 400 patients) patients with malignant melanoma were classified
according to two factors, the site of the tumour and the histological type. The data are as follows.
Tumour Type Head and Neck Trunk Extremities Total

Hutchinson’s freckle 22 2 10 34
Superficial Spreading 16 54 115 185
Nodular 19 33 73 125
Indeterminate 11 17 28 56
Total 68 106 226 400
Here we carry out an analysis to investigate whether the different types of tumour appear equally likely in the
different sites. That is we are assessing whether or not there is an association between histological type and tumour
site.
We begin with the assumption that yij ∼ Poisson(µij ), i = 1, 2, 3, 4, j = 1, 2, 3. We then condition on Y.. = y..
since fixed by design, leading to the multinomial distribution.
y..! YY y
P (Yij = yij , ∀(i, j)|Y.. = y..) = Q Q πijij
i j y ij ! i j
where πij = µij /µ... The null hypothesis is that the tumour type and site are independent, giving H0 : πij = πi. π.j ,
i = 1, 2, 3, 4, j = 1, 2, 3. Under H0 : µij = E(Yij ) = y..πi. π.j meaning we will have to fit the row and column totals to
allow estimation of πi. , i = 1, 2, 3, 4 and π.j , j = 1, 2, 3. Thus our log linear model under the null hypothesis is
log µij = u + uVi + uW

j i = 1, 2, 3, 4 j = 1, 2, 3
where V corresponds to tumour type variable (i indicating the level), and W corresponds to tumour site variable (j
indicating the level). The if the model appears to fit the data well, then we have little evidence against the assumption
that tumour type and site are independent. If the model does not fit the data well, then some tumour types appear
more frequently in certain locations.
Here we have the Splus data file and the corresponding program.
type locat y
1 1 22
1 2 2
1 3 10
2 1 16
2 2 54
2 3 115
3 1 19
3 2 33
3 3 73
4 1 11
4 2 17
4 3 28
derm.dat_read.table("derm.dat", header=T)
derm.dat$typef_factor(derm.dat$type)
derm.dat$typeft_C(derm.dat$typef,treatment)
derm.dat$sitef_factor(derm.dat$locat)

derm.dat$siteft_C(derm.dat$sitef,treatment)
derm.dat
attach(derm.dat)
% fitting the model with both main effects

model1_glm(y ~ typeft + siteft, family=poisson)
summary(model1)
% fitting the model with only the "histological type" main effect
model2_glm(y ~ typeft, family=poisson)
1-pchisq(model2$deviance - model1$deviance,model2$df - model1$df)
% fitting the model with only the "site" main effect

model3_glm(y ~ siteft, family=poisson)
% here we fit the model with the anova (sum) constraints

derm.dat_read.table("derm.dat", header=T)
derm.dat$typef_factor(derm.dat$type)
derm.dat$typefs_C(derm.dat$typef,sum,3)
derm.dat$sitef_factor(derm.dat$locat)
derm.dat$sitefs_C(derm.dat$sitef,sum,2)
derm.dat
attach(derm.dat)

model1_glm(y ~ typefs + sitefs, family=poisson)
summary(model1)
% creating deviance residuals for diagnostic plots

derm.dat$fitted.values_model1$fitted.values
derm.dat$rdeviance_residuals.glm(model1,type="deviance")
derm.dat
A selection of Splus output follows.

> model1_glm(y ~ typeft + siteft, family=poisson)
> summary(model1)
Call: glm(formula = y ~ typeft + siteft, family = poisson)

Deviance Residuals:
-3.045336 -1.074106 0.1296569 0.5856518 5.135292
Coefficients:
(Intercept) 1.7544309 0.2034503 8.623388
typeft2 1.6939682 0.1860136 9.106689
typeft3 1.3019261 0.1928617 6.750567
typeft4 0.4989641 0.2169164 2.300260
siteft2 0.4439313 0.1553150 2.858264
siteft3 1.2010272 0.1382637 8.686499
(Intercept) typeft2 typeft3 typeft4 siteft2
typeft2 -0.7714645
typeft3 -0.7440714 0.8138197

3.4. APPLICATION
typeft4 -0.6615586 0.7235722 0.6978797

siteft2 -0.4650394 0.0000000 0.0000000 0.0000000
siteft3 -0.5223902 0.0000000 0.0000000 0.0000000 0.6842897
Now we examine the fitted values and residuals.
> derm.dat
type locat y typef typeft sitef siteft fitted.values rdeviance
1 1 1 22 1 1 1 1 5.780157 5.13529196
2 1 2 2 1 1 2 2 9.010244 -2.82836146
3 1 3 10 1 1 3 3 19.210520 -2.31594072
4 2 1 16 2 2 1 1 31.450003 -3.04533648
5 2 2 54 2 2 2 2 49.025000 0.69899705
6 2 3 115 2 2 3 3 104.524998 1.00813997
7 3 1 19 3 3 1 1 21.250002 -0.49711123
8 3 2 33 3 3 2 2 33.125000 -0.02173227
9 3 3 73 3 3 3 3 70.624999 0.28104599
10 4 1 11 4 4 1 1 9.520001 0.46798405
11 4 2 17 4 4 2 2 14.840000 0.54787009
12 4 3 28 4 4 3 3 31.639999 -0.66016090
Note that we can verify that the row and column totals are fit exactly (e.g. sum the first three observations corresponding to
the total number of Hutchinson freckle cases, and sum the corresponding fitted values). We conclude that the model does not
provide a very good fit to the data since there are some rather large deviance residuals corresponding to the first two rows
of the table. Therefore our hypothesis that tumour type and site are independent does not seem plausible. Specifically, based
on the fitted values and residuals we see that Hutchinson’s freckle occurs more often on the head and neck than we would
expect under the independence assumptino, and less often on the trunk and extremeties. Furthermore, superficial spreading
melanoma occurs less often on the head and neck than we would expect.
> model2_glm(y ~ typeft, family=poisson)
> 1-pchisq(model2$deviance - model1$deviance,model2$df - model1$df)
[1] 0
> model3_glm(y ~ siteft, family=poisson)

[1] 0
The above likelihood ratio tests indicate that the different tumour types do not occur equally often, and melanoma does not
occur equally often at the different sites of the body. These are not particularly interesting findings in themselves.
Now we examine the result of fitting a similar main effects model but with the anova (sum in Splus) contrasts. Here we
get
> model1_glm(y ~ typefs + sitefs, family=poisson)
> summary(model1)
Call: glm(formula = y ~ typefs + sitefs, family = poisson)

Deviance Residuals:
-3.045336 -1.074106 0.1296569 0.5856518 5.135292
Coefficients:
(Intercept) 3.1764650 0.06672231 47.607238
typefs1 -0.8737146 0.13555988 -6.445230
typefs2 0.8202536 0.08050656 10.188655
typefs3 0.4282115 0.08819635 4.855207
sitefs1 -0.5483195 0.08983258 -6.103793
sitefs2 -0.1043882 0.07946273 -1.313675
(Intercept) typefs1 typefs2 typefs3 sitefs1

typefs1 0.3892042
typefs2 -0.4518752 -0.4463883
typefs3 -0.3022502 -0.4617212 0.0601799
sitefs1 0.2880608 0.0000000 0.0000000 0.0000000
sitefs2 -0.0054655 0.0000000 0.0000000 0.0000000 -0.6821281
Note that the coefficients differ as one would expect and that the correlation matrix has change. The inferences one draws
regarding the effect of the type and site variables is the same however (try carrying out a likelihood ratio test in this sort of
setting and you will see it gives the same value as before). Again, the fitted values are the same as we see below.
> derm.dat$fitted.values_model1$fitted.values
> derm.dat$rdeviance_residuals.glm(model1,type="deviance")
> derm.dat
type locat y typef typefs sitef sitefs fitted.values rdeviance

1 1 1 22 1 1 1 1 5.780157 5.13529196
2 1 2 2 1 1 2 2 9.010244 -2.82836146
3 1 3 10 1 1 3 3 19.210520 -2.31594072
4 2 1 16 2 2 1 1 31.450003 -3.04533648
5 2 2 54 2 2 2 2 49.025000 0.69899705
6 2 3 115 2 2 3 3 104.524998 1.00813997
7 3 1 19 3 3 1 1 21.250002 -0.49711123
8 3 2 33 3 3 2 2 33.125000 -0.02173227
9 3 3 73 3 3 3 3 70.624999 0.28104599
10 4 1 11 4 4 1 1 9.520001 0.46798405
11 4 2 17 4 4 2 2 14.840000 0.54787009
12 4 3 28 4 4 3 3 31.639999 -0.66016090
Additional tabulations give us further insight into the findings.
Row Percentages
Head and Neck Trunk Extremities All sites
Hutchinson’s Freckle 64.7 5.9 29.4 100
Superficial Spreading 8.6 29.2 62.2 100
Nodular 15.2 26.4 58.4 100
Indeterminate 19.6 30.4 50.0 100
All types 17.0 26.5 56.5 100
Column Percentages
Head and Neck Trunk Extremities All sites
Hutchinson’s Freckle 32.4 1.9 4.4 8.5
Superficial Spreading 23.5 50.9 50.9 46.25
Nodular 27.9 31.1 32.3 31.25
Indeterminate 16.2 16.0 12.4 14.00
All types 100 100 100 100

3.5. LOG LINEAR MODELS FOR THREE WAY TABLES
3.5 Log Linear Models for Three Way Tables
3.5.1 Introduction
Consider the general problem in which subjects are classified with respect to three factor variables denoted V , W , and
Z with I, J, and K levels respectively. As with two-way tables, we initially assume yijk ∼ Poisson(µijk ), i = 1, 2, ..., I,
j = 1, 2, ..., J, k = 1, 2, ..., K. As before, if Y··· = y··· is fixed by design (as it usually would be), we must condition on
this to give the multinomial distribution,
y...! YYY y
ijk
P (Yijk = yijk , ∀(i, j, k)|Y... = y...) = Y Y Y πijk
yijk ! i j k
i j k
where πijk = µijk /µ... are now the parameters of interest.

In the case of two-way contingency tables we discussed the connection between log-linear models and questions
about the association between the two factors. To recap, main effects accommodated non-uniform distributions of
the row and column totals, and the interaction terms allowed for association between the two factors of interest.
In terms of an association, it was either present or absent. As we will see in what follows, with three way tables
(contingency tables involving three factor variables) the nature of the associations present may be somewhat more
complicated.
a. Mutual Independence
H0 : πijk = πi.. π.j. π..k
The variables V , W , and Z are said to be mutually independent if we can write πijk = πi.. π.j. π..k . That is, all
three variables V , W and Z are independent of each other. This implies
y··· πijk = y...πi.. π.j. π..k
The corresponding log-linear model is given by
log µijk = u + uVi + uW Z

j + uk
with uV1 = uW Z
1 = u1 = 0 (with cornerpoint constraints for example). We know this is the corresponding model
because in the hypothesis we can model the individual probabilities within the table simply in terms of the
margin totals and the above model will fit these exactly. It is convenient to have a short hand notation to iden-
tify log-linear models, and this model is sometimes denoted by (V, W, Z), where we list the highest order terms
involving each of the factors. The model d.f. is 1 + (I − 1) + (J − 1) + (K − 1) = I + J + K − 2 and the residual
d.f. is (IJK) − (I + J + K − 2)
b. Joint Independence
H0 : πijk = πi.k π.j.
W is is said to be jointly independent of V and Z if we can write πijk = πi.k π.j. . That is, W is independent of
the “new variable” with IK levels defined by combination of levels of V and Z. This implies then, that given

the levels of V and Z, we have learned nothing about distribution of factor W . Equivalently we may say that
the nature of the association which exists between V and Z does not vary with different levels of factor W . The
corresponding log-linear model has the form
log µijk = u + uVi + uW Z VZ

j + uk + uik
since again this model will ensure that the marginal probabilities required to model a general probability πijk are
fit exactly. If the hypothesis is true, then fitting these marginal totals should be sufficient to provide a good fit
to the data. In the short-hand notation, this model is denoted as (V Z, W ). The model d.f. is then [1 + (I − 1) +
(J − 1) + (K − 1) + (I − 1)(K − 1)] and the residual d.f. is (IJK) − [1 + (I − 1) + (J − 1) + (K − 1) + (I − 1)(K − 1)].
We could also have V as jointly independent of W and Z (denoted (V, W Z)) or Z jointly independent of V and
W (denoted (V W, Z)).
c. Conditional Independence
H0 : πij|k = πi.|k π.j|k
where πij|k = πijk /π..k determines the joint distribution of V and W at level k of Z. Thus V and W are said to
be conditionally independent of Z if we can write πij|k = πi.|k π.j|k , i = 1, 2, ..., I, j = 1, 2, ..., J. H0 implies

πijk πi.k π.jk
= → πijk = πi.k π.jk /π..k
π..k π..k π..k
The corresponding log-linear model is of the from
log µijk = u + uVi + uW Z VZ WZ

j + uk + uik + ujk
which is denoted as (V Z, W Z). Again, we know this is the corresponding model because under this model all
that is required to derive the probability πijk are the marginal probabilities πi·k and π·jk . The model d.f. is
[1 + (I − 1) + (J − 1) + (K − 1) + (I − 1)(K − 1) + (J − 1)(K − 1)] and the residual d.f. (IJK) − [1 + (I − 1) + (J −
1) + (K − 1) + (I − 1)(K − 1) + (J − 1)(K − 1)]. We could also have models of the form (V W, W Z) or (V W, V Z),
where the former would, for example, indicate that when we condition on W , V and Z are independent.
d. All Pairs Conditionally Independent
This is related to the association structure in (3) and involves the hypothesis
H0 : πij|k = πi.|k π.j|k , πik|j = πi.|j π.k|j , πjk|i = πj.|i π.k|i
Here we have the association between any two variables constant across levels of other variable. An equivalent
way of thinking about this model is that there is an association between all pairs of variables, but the nature of
this association is constant with respect to changing levels of the remaining variable.
log µijk = u + uVi + uW Z VW

j + uk + uij + uVikZ + uW
jk
Z
which is denoted by (V W, V Z, W Z). The model d.f. is [1 + (I − 1) + (J − 1) + (K − 1) + (I − 1)(J − 1) + (I −

1)(K − 1) + (J − 1)(K − 1)] and the residual d.f. is (IJK) − [1 + (I − 1) + (J − 1) + (K − 1) + (I − 1)(J − 1) +
(I − 1)(K − 1) + (J − 1)(K − 1)].
The next most complicated model is the “saturated” involving the three way interaction term and a total of IJK
parameters.

3.6. GOODNESS OF FIT STATISTICS FOR LOG LINEAR MODELS
3.6 Goodness of Fit Statistics for Log Linear Models
3.6.1 Test Statistics
The fit of a log linear model can be judged based on the deviance assuming an underlying Poisson distribution for
the cell counts. We know from before that the deviance statistic has the form

XXX Oijk
D=2 Oijk log
i j
Eijk
k
where the d.f. is (IJK) − q where q is the number of parameters under H0 .

Then D ∼ χ2(IJK)−q approximately. Alternatively, can use a Pearson’s χ2 statistic
X X X (Oijk − Eijk )2
i j
Eijk
k
where Oijk is observed value for cell (i, j, k) and Eijk is expected value for cell (i, j, k) under H0 and again this is
approximately chi-squared distributed under the null hypothesis with d.f. IJK − q.
3.6.2 Residuals
The deviance residuals have the usual form under the Poisson model. The standardized Pearson residuals take the
form
P Oijk − Eijk
rijk = p
Eijk

3.7 Applications
3.7.1 Methods of Suicide
Suppose we are interested in examining the nature of the association among the 3 variables sex, age group, and
method of suicide, in the following table.
Cause of Death
Sex Age( yrs) Solid/Liquid Gas Respiratory Weapons Jumping Other
Male 10-40 398 121 455 155 55 124
Male 40-70 399 82 797 168 51 82
Male > 70 93 6 316 33 26 14
Female 10-40 259 15 95 14 40 38
Female 40-70 450 13 450 26 71 60
Female > 70 154 5 185 7 38 10
Note that here there is no obvious response variable to single out. Since we are interested in the association among
all three variables we consider methods based on log-linear models.
Here we carry out an analysis to investigate the association between sex, age and method of suicide. Let X denote
sex, Y denote age, and Z denote the method of suicide. We know the log linear model

j + uk + uij + uVikZ + uW Z V WZ
jk + uijk
will provide a perfect fit to the data (since it is a saturated model). We seek to find a simpler model which describes
the data well. In other words, we are looking for a simpler representation of the relationship between the sex, age
and method of suicide variables.
sex age method y
1 1 1 398
1 1 2 121
1 1 3 455
1 1 4 155
1 1 5 55
1 1 6 124
1 2 1 399
1 2 2 82
1 2 3 797
1 2 4 168
1 2 5 51
1 2 6 82
1 3 1 93
1 3 2 6
1 3 3 316
1 3 4 33
1 3 5 26
1 3 6 14
2 1 1 259
2 1 2 15
2 1 3 95
2 1 4 14
2 1 5 40
2 1 6 38
2 2 1 450

3.7. APPLICATIONS
2 2 2 13
2 2 3 450
2 2 4 26
2 2 5 71
2 2 6 60
2 3 1 154
2 3 2 5
2 3 3 185
2 3 4 7
2 3 5 38
2 3 6 10
suic.dat_read.table("suic.dat", header=T)
suic.dat$sexf_factor(sex)
suic.dat$sexft_C(suic.dat$sexf,treatment)
suic.dat$agef_factor(suic.dat$age)
suic.dat$ageft_C(suic.dat$agef,treatment)
suic.dat$methodf_factor(suic.dat$method)
suic.dat$methodft_C(suic.dat$methodf,treatment)
attach(suic.dat)

model1_glm(y ~ sexft + ageft + methodft + sexft*ageft + sexft*methodft +
ageft*methodft + sexft*ageft*methodft, family=poisson)
summary(model1)
names(model1)
model1$df.residual
model1$deviance
model2_glm(y ~ sexft + ageft + methodft + sexft*ageft + sexft*methodft +

ageft*methodft, family=poisson)
model2$df.residual
model2$deviance
1-pchisq(model2$deviance-model1$deviance,model2$df.residual-model1$df.residual)
model3_glm(y ~ sexft + ageft + methodft + sexft*ageft + sexft*methodft,

family=poisson)
model3$df.residual
model3$deviance
model4_glm(y ~ sexft + ageft + methodft + sexft*ageft + ageft*methodft,

family=poisson)
model4$df.residual
model4$deviance
model5_glm(y ~ sexft + ageft + methodft + sexft*methodft + ageft*methodft,

family=poisson)
model5$df.residual
model5$deviance
% creating deviance residuals for diagnostic plots

suic.dat$fitted.values2_model2$fitted.values
suic.dat$rdeviance2_residuals.glm(model2,type="deviance")
suic.dat

motif()
par(mfrow=c(2,2))
plot(suic.dat$fitted.values2,suic.dat$rdeviance2,ylim=c(-5,5),
xlab="FITTED VALUES",ylab="DEVIANCE RESIDUALS"); title("Model 2")

> model1_glm(y ~ sexft + ageft + methodft + sexft*ageft + sexft*methodft +
ageft*methodft + sexft*ageft*methodft, family=poisson)
> model1$df.residual
[1] 0
> model1$deviance
[1] -1.865175e-13
Here we obtain a zero residual deviance as we would expect, and zero for the residual degrees of freedom.
> model2_glm(y ~ sexft + ageft + methodft + sexft*ageft + sexft*methodft +
ageft*methodft, family=poisson)
[1] 10
> model2$deviance
[1] 14.90066
> 1-pchisq(model2$deviance-model1$deviance,model2$df.residual-
model1$df.residual)
[1] 0.1357257
We find a small but nonsignificant drop in the quality of fit in going from model 1 to model 2, and therefore think of model 2
as our default model now. Further simplification is not possible as the following results show.
> model3_glm(y ~ sexft + ageft + methodft + sexft*ageft + sexft*methodft,
family=poisson)
[1] 20
> model3$deviance
[1] 293.1775
model2$df.residual)
[1] 0
> model4_glm(y ~ sexft + ageft + methodft + sexft*ageft + ageft*methodft,
family=poisson)
[1] 15
> model4$deviance
[1] 388.9973
model2$df.residual)
[1] 0
> model5_glm(y ~ sexft + ageft + methodft + sexft*methodft + ageft*methodft,
family=poisson)
[1] 12
> model5$deviance
[1] 154.7015
model2$df.residual)
[1] 0
Here we examine the fitted values and deviance residuals for model 2. This tabulation and the plot suggestions a good fit to
the data.
> suic.dat$fitted.values2_model2$fitted.values
> suic.dat$rdeviance2_residuals.glm(model2,type="deviance")

3.7. APPLICATIONS
suic.dat
> > sex age method y fitted.values2 rdeviance2
1 1 1 1 398 410.853837 -0.63749626
2 1 1 2 121 122.699573 -0.15378908
3 1 1 3 455 439.224163 0.74830743
4 1 1 4 155 156.373386 -0.10998886
5 1 1 5 55 56.820757 -0.24285278
6 1 1 6 124 122.028284 0.17801260
7 1 2 1 399 379.442338 0.99557892
8 1 2 2 82 77.620529 0.49252029
9 1 2 3 797 819.882186 -0.80289911
10 1 2 4 168 166.268603 0.13404183
11 1 2 5 51 51.090948 -0.01272770
12 1 2 6 82 84.695397 -0.29445651
13 1 3 1 93 99.703825 -0.67912009
14 1 3 2 6 8.680256 -0.96385340
15 1 3 3 316 308.893651 0.40280000
16 1 3 4 33 33.358011 -0.06209777
17 1 3 5 26 24.088295 0.38452049
18 1 3 6 14 13.276319 0.19684871
19 2 1 1 259 246.146163 0.81230758
20 2 1 2 15 13.300433 0.45658974
21 2 1 3 95 110.775837 -1.53676135
22 2 1 4 14 12.626614 0.37979274
23 2 1 5 40 38.179243 0.29237485
24 2 1 6 38 39.971716 -0.31448402
25 2 2 1 450 469.557662 -0.90892922
26 2 2 2 13 17.379479 -1.10004443
27 2 2 3 450 427.117814 1.09752217
28 2 2 4 26 27.731397 -0.33229749
29 2 2 5 71 70.909052 0.01079815
30 2 2 6 60 57.304603 0.35332589
31 2 3 1 154 147.296175 0.54825361
32 2 3 2 5 2.320220 1.52256317
33 2 3 3 185 192.106349 -0.51592538
34 2 3 4 7 6.641989 0.13769372
35 2 3 5 38 39.911705 -0.30506648
36 2 3 6 10 10.723681 -0.22354952
The following analysis of deviance table summarizes our findings.

Model Form Residual Deviance Residual d.f. p−value
1 (VWZ) 0 0 NA
2 (VW,VZ,WZ) 14.9 10 0.14 (vs. 1)
3 (VW,VZ) 293.2 20 0.00 (vs. 2)
4 (VW,WZ) 389.0 15 0.00 (vs. 2)
5 (VZ,WZ) 154.7 12 0.00 (vs. 2)
We conclude that model 2 is appropriate and hence that all pairs of variables are conditionally independent. Another
way of saying this is that all variables are associated in a pairwise fashion, but this degree of association does not
depend on the level of the third variable.

Fig. 3.2. Plot of Residuals by Fitted Values for Models 2, 3, 4, and 5 of Suicide Data
Model 2 Model 3
DEVIANCE RESIDUALS
DEVIANCE RESIDUALS
•
••
-4 -2 0 2 4
-4 -2 0 2 4
•
•• • • •
•
• • •• •• •
• • • ••
•••••••••••••• •• ••• •• •
•• • • • • • • •
• • • •• •
• •
•• •
•
•
0 200 400 600 800 0 200 400 600
FITTED VALUES FITTED VALUES
Model 4 Model 5
DEVIANCE RESIDUALS
DEVIANCE RESIDUALS
•
•
• •
-4 -2 0 2 4
• -4 -2 0 2 4
• •• • • •• •
•• • • • • •
•••••• • • •
••• • • • ••• • • •
• •••••• •
• • • • •
• •
• • • •
• •• • •
•• •
0 200 400 600 0 200 400 600 800
FITTED VALUES FITTED VALUES
3.7.2 Seatbelt Use and Fatality of Accidents
We now consider the special case of a 2 × 2 × 2 table. That is, where each variable has only two levels. For illustration
we will discuss the data in the following table from the Florida Department of Highway Safety and Motor vehicles
(Bishop et al., 1975) which summarizes three variables pertaining to car accidents. Recorded are whether a seat-belt
was used, whether the victim was ejected from the car, and whether the injury was fatal.
Injury (Z)
Seatbelt (V ) Ejected (W ) Non-fatal Fatal
Used Yes 1105 14
No 411,111 483
Not Used Yes 4624 497
No 157,342 1008
Rewriting this table in general notation we have
Injury (Z)
Seatbelt (V ) Ejected (W ) Non-fatal (k = 1) Fatal (k = 2)
Used (i = 2) Yes (j = 2) y221 y222
No (j = 1) y211 y212
Not Used (i = 1) Yes (j = 2) y121 y122
No (j = 1) y111 y112
where Yijk ∼ Poisson(µijk ), i = 1, 2, ..., I, j = 1, 2, ..., J, k = 1, 2, ..., K. The full model is

jk + uijk ,

3.7. APPLICATIONS
but we are interested in identifying simpler models which still fit the data well and are easy to interpret. Note that
in this table only y... is fixed and so only the intercept needs to be included by design.
seat eject injury y
1 1 1 157342
1 1 2 1008
1 2 1 4624
1 2 2 497
2 1 1 411111
2 1 2 483
2 2 1 1105
2 2 2 14
acc.dat_read.table("acc.datv", header=T)
acc.dat$seatf_factor(acc.dat$seat)
acc.dat$s_C(acc.dat$seatf,treatment)
acc.dat$ejectf_factor(acc.dat$eject)
acc.dat$e_C(acc.dat$ejectf,treatment)
acc.dat$injuryf_factor(acc.dat$injury)
acc.dat$i_C(acc.dat$injuryf,treatment)
acc.dat
model1_glm(y ~s*e*i , family=poisson,data=acc.dat); summary(model1)
model2_glm(y ~s*e+s*i+e*i , family=poisson,data=acc.dat); summary(model2)

acc.dat$fv_model2$fitted.values
acc.dat
model3_glm(y ~s*e+s*i+i , family=poisson,data=acc.dat); summary(model3)

model4_glm(y ~s*e+e*i+e, family=poisson,data=acc.dat); summary(model4)

model5_glm(y ~s*i+e*i+i, family=poisson,data=acc.dat); summary(model5)


> model1_glm(y ~s*e*i , family=poisson,data=acc.dat); summary(model1)
>
Call: glm(formula = y ~ s * e * i, family = poisson, data = acc.dat)
Coefficients:
(Intercept) 11.9661771 0.002521028 4746.54670
s 0.9604415 0.002964459 323.98545
e -3.5271616 0.014920407 -236.39848
i -5.0504536 0.031597770 -159.83576
s:e -2.3918563 0.033615895 -71.15254
s:i -1.6961483 0.055418813 -30.60600
e:i 2.8200282 0.056804529 49.64443
s:e:i -0.4419696 0.278627222 -1.58624
Null Deviance: 1624865 on 7 degrees of freedom
Residual Deviance: 0 on 0 degrees of freedom

(Intercept) s e i s:e s:i
s -0.8504177
e -0.1689651 0.1436909
i -0.0797850 0.0678506 0.0134809
s:e 0.0749951 -0.0881862 -0.4438498 -0.0059835
s:i 0.0454905 -0.0534919 -0.0076863 -0.5701632 0.0047173
e:i 0.0443808 -0.0377422 -0.2626623 -0.5562544 0.1165826 0.3171558
s:e:i -0.0090480 0.0106395 0.0535497 0.1134052 -0.1206483 -0.1988995
e:i
s
e
i
s:e
s:i
e:i
s:e:i -0.2038729
---------------------------------------------------------------------------
> model2_glm(y ~s*e+s*i+e*i , family=poisson,data=acc.dat); summary(model2)

>
Call: glm(formula = y ~ s * e + s * i + e * i, family = poisson, data = acc.dat)
Coefficients:
(Intercept) 11.9661334 0.002520935 4746.70421
s 0.9605018 0.002964258 324.02772
e -3.5256338 0.014879047 -236.95293
i -5.0436195 0.031201820 -161.64504
s:e -2.3996357 0.033340295 -71.97404
s:i -1.7173207 0.054015372 -31.79319
e:i 2.7977945 0.055255851 50.63345
(Intercept) s e i s:e s:i
s -0.8504057
e -0.1687532 0.1432888
i -0.0793250 0.0669801 0.0049262
s:e 0.0740365 -0.0870602 -0.4387264 0.0138669
s:i 0.0440104 -0.0517522 0.0078886 -0.5548114 -0.0311263
e:i 0.0428931 -0.0355879 -0.2541765 -0.5407262 0.0837227 0.2701176

[1] 0.09114565
> acc.dat$fv_model2$fitted.values
> acc.dat
seat eject injury y seatf s ejectf e injuryf i fv
1 1 1 1 157342 1 1 1 1 1 1 157335.13193
2 1 1 2 1008 1 1 1 1 2 2 1014.86807
3 1 2 1 4624 1 1 2 2 1 1 4630.86807
4 1 2 2 497 1 1 2 2 2 2 490.13193
5 2 1 1 411111 2 2 1 1 1 1 411117.86807
6 2 1 2 483 2 2 1 1 2 2 476.13193

3.7. APPLICATIONS
7 2 2 1 1105 2 2 2 2 1 1 1098.13193
8 2 2 2 14 2 2 2 2 2 2 20.86807
---------------------------------------------------------------------------
> model3_glm(y ~s*e+s*i+i , family=poisson,data=acc.dat); summary(model3)
Call: glm(formula = y ~ s * e + s * i + i, family = poisson, data = acc.dat)
Coefficients:
(Intercept) 11.9633139 0.002524274 4739.30924
s 0.9632739 0.002967231 324.63734
e -3.4314580 0.014197737 -241.69050
i -4.6785782 0.025825428 -181.16169
s:e -2.4761440 0.033130957 -74.73807
s:i -2.0421345 0.051782723 -39.43660
(Intercept) s e i s:e
s -0.8507171
e -0.1761993 0.1498958
i -0.0947092 0.0805707 0.0000000
s:e 0.0755074 -0.0889496 -0.4285338 0.0000000
s:i 0.0472340 -0.0559712 0.0000000 -0.4987267 0.0000000

[1] 0
>
---------------------------------------------------------------------------
> model4_glm(y ~s*e+e*i+e, family=poisson,data=acc.dat); summary(model4)

>
Call: glm(formula = y ~ s * e + e * i + e, family = poisson, data = acc.dat)
Coefficients:
(Intercept) 11.9699436 0.002513903 4761.49686
s 0.9552297 0.002957139 323.02491
e -3.5142778 0.014693133 -239.17825
i -5.9434707 0.025914304 -229.35097
s:e -2.4761440 0.033131101 -74.73775
e:i 3.5265440 0.052943310 66.60981
(Intercept) s e i s:e
s -0.8494934
e -0.1710938 0.1453430

i -0.0270033 0.0000000 0.0046201

s:e 0.0758221 -0.0892557 -0.4141109 0.0000000
e:i 0.0132174 0.0000000 -0.2266473 -0.4894727 0.0000000

[1] 0
---------------------------------------------------------------------------
> model5_glm(y ~s*i+e*i+i, family=poisson,data=acc.dat); summary(model5)

>
Call: glm(formula = y ~ s * i + e * i + i, family = poisson, data = acc.dat)
Coefficients:
(Intercept) 11.985114 0.002488217 4816.74862
s 0.934161 0.002932430 318.56213
i -4.963265 0.029014282 -171.06284
e -4.597323 0.013209434 -348.03329
s:i -2.042119 0.051803432 -39.42053
e:i 3.526490 0.052921727 66.63597
(Intercept) s i e s:i
s -0.8460864
i -0.0857583 0.0725590
e -0.0535221 0.0000000 0.0045900
s:i 0.0478943 -0.0566069 -0.4461379 0.0000000
e:i 0.0133593 0.0000000 -0.4378953 -0.2496032 0.0000000

[1] 0

3.7. APPLICATIONS
To summarize the models fit so far, consider the following analysis of deviance table.
Model Deviance Residual d.f. p-value

(V W Z) 0 0 NA
†
(V W, V Z, W Z) 2.85 1 0.09
(V W, V Z) 1680.41 1 <0.001
(V W, W Z) 1144.64 1 <0.001
(V Z, W Z) 7133.98 1 <0.001
† reference model for deviance tests for last three models.

We therefore select (V W, V Z, W Z) as our final model. We seek an easy interpretation to the association. Consider
the following table of fitted and observed values from this model
Injury (Z)
Seatbelt (V ) Ejected (W ) Non-Fatal Fatal
Used Yes µ̂221 = 1098.1 y221 = 1105 µ̂222 = 20.9 y222 = 14

No µ̂211 = 411117.9 y211 = 411, 111 µ̂212 = 476.1 y212 = 483
Not Used Yes µ̂121 = 4630.9 y121 = 4624 µ̂122 = 490.1 y122 = 497
No µ̂111 = 157, 335.1 y111 = 157, 342 µ̂112 = 1014.9 y112 = 1008
and the following row percentages of the row totals.
Injury (Z)
Seatbelt (V ) Ejected (W ) Non-Fatal Fatal
Used Yes 98.7 1.3
No 99.9 0.1
Not Used Yes 90.4 9.6
No 99.4 0.6
Note that we can interpret log(µ112 /µ111 ) as a log odds since this is equal to log(π112 /π111 ) where π111 = 1 − π112 =
µ111 /µ11· .
log µ112 = u + uV1 + uW Z VW VZ WZ Z

1 + u2 + u11 + u12 + u12 = u + u2
log µ111 = u + uV1 + uW Z VW VZ WZ
1 + u1 + u11 + u11 + u11 = u
log(µ112 /µ111 ) = uZ
2.
Therefore uZ
2 is a log odds of a fatal accident in the last row of the table (i.e. those not using seatbelts who were
not ejected). In the third row we have,
log µ122 = u + uV1 + uW Z VW VZ WZ W Z WZ

2 + u2 + u12 + u12 + u22 = u + u2 + u2 + u22
log µ121 = u + uV1 + uW Z VW VZ WZ W
2 + u1 + u12 u11 + u21 = u + u2
log(µ122 /µ121 ) = uZ WZ
2 + u22
Therefore uZ WZ
2 + u22 is the log odds of a fatality among those who were not wearing a seatbelt and were ejected.

This implies that log(µ122 /µ121 ) − log(µ112 /µ111 ) = µW

22
Z
is a log odds ratio! Specifically it is the log odds ratio
reflecting the relative risk of a fatal accident for those ejected versus those not ejected, among passengers who
did not use their seatbelt.
Identical calculations for the top half of the table would reveal that this is also the log odds ratio reflecting
the change in risk of a fatal accident for ejected versus unejected passengers who did use their seatbelt. Suppose
however, the three-way term uVijkW Z was in the model. In this case we would find
log(µ222 /µ221 ) = uZ WZ V WZ
2 + u22 + u222
and so the log odds ratio would be
log(µ222 /µ221 ) − log(µ212 /µ211 ) = uW Z V WZ

22 + u222
This µVijkW Z plays the role of an interaction term allowing the effect of one variable on another to vary with levels of
another.
We may construct several odds ratios for this data set for the saturated model.
Odds Ratios
Outcome Comparison At Form Value
Z = 2 W = 2 vs. W = 1 V = 1 exp(uW Z
22 ) exp(2.80) = 16.4
Z = 2 W = 2 vs. W = 1 V = 2 exp(uW Z V WZ
22 + u222 ) exp(2.80 + 0) = 16.4
Z =2 V = 2 vs. V = 1 W = 1 exp(uV22Z ) exp(−1.72) = 0.18

Z =2 V = 2 vs. V = 1 W = 2 exp(uV22Z + uV222
WZ
) exp(−1.72 + 0) = 0.18
W =2 V = 2 vs. V = 1 Z = 1 exp(uV22W ) exp(−2.40) = 0.09

W =2 V = 2 vs. V = 1 Z = 2 exp(uV22W + uV222
WZ
) exp(−2.40 + 0) = 0.09
These odds ratios make sense since they suggest the relative odds of a fatality among those ejected compared to
those not ejected is 16.4, the relative odds of fatality among those using a seatbelt compared to those who do not
use a seatbelt is 0.18, and the relative odds of ejection for those using a seatbelt compared to those who do not is
0.09. The fact that we could not reduce the model further means these terms are all significant.
If these odds ratios relating to fatality could have been obtained from a logistic model it is natural to ask what
we have gained here. Specifically we are able to examine the relationship between all variables including V and W .

3.7. APPLICATIONS
3.7.3 Methods of Suicide - Continued
Here we revisit the analysis of the associations between age, sex and method of suicide. Recall the data was provided
as follows.
Cause of Death
Sex Age( yrs) Solid/Liquid Gas Respiratory Weapons Jumping Other
Male 10-40 398 121 455 155 55 124
Male 40-70 399 82 797 168 51 82
Male > 70 93 6 316 33 26 14
Female 10-40 259 15 95 14 40 38
Female 40-70 450 13 450 26 71 60
Female > 70 154 5 185 7 38 10
Here we carry out an analysis to investigate the association between sex, age and method of suicide. Let V denote
sex, W denote age, and Z denote the method of suicide. We know the log-linear model

jk + uijk
will provide a perfect fit to the data (since it is a saturated model). We seek to find a simpler model which describes
the data well. In other words, we are looking for a simpler representation of the relationship between the sex, age
and method of suicide variables. Our previous analyses are summarized in the following analysis of deviance table.
Model Form Residual Deviance Residual d.f. p−value

1 (VWZ) 0 0 NA
2 (VW,VZ,WZ) 14.9 10 0.14 (vs. 1)
3 (VW,VZ) 293.2 20 <0.001 (vs. 2)
4 (VW,WZ) 389.0 15 <0.001 (vs. 2)
5 (VZ,WZ) 154.7 12 <0.001 (vs. 2)
We found that the model with all two-way interactions provided good fit, but any simpler models did not. The final
model is therefore
j + uk + uij + uVikZ + uW
jk
Z
We conclude that all pairs of variables are conditionally independent. This means that each pair of variables are
associated, but the nature of the association does not depend on the remaining variable. A partial data file, some of
the code, and more detailed results for the final model are provided below.
sex age method y
1 1 1 398
1 1 2 121
1 1 3 455
1 1 4 155
1 1 5 55
1 1 6 124
1 2 1 399
1 2 2 82
1 2 3 797
1 2 4 168
1 2 5 51
1 2 6 82
| | | |

| | | |
2 3 1 154
2 3 2 5
2 3 3 185
2 3 4 7
2 3 5 38
2 3 6 10
> summary(model2)
Call: glm(formula = y ~ sexft + ageft + methodft + sexft * ageft +

sexft * methodft + ageft * methodft, family = poisson)
Deviance Residuals:
-1.536761 -0.3189374 -0.0009647755 0.3890904 1.522563
Coefficients:
(Intercept) 6.01823752 0.04598603 130.8710083
sexft -0.51231200 0.06497888 -7.8842854
ageft2 -0.07953488 0.06145781 -1.2941379
ageft3 -1.41603348 0.08944448 -15.8314247
methodft2 -1.20849865 0.09848962 -12.2703143
methodft3 0.06677238 0.06149132 1.0858831
methodft4 -0.96599088 0.08980039 -10.7570898
methodft5 -1.97833582 0.12157446 -16.2726262
methodft6 -1.21398467 0.09453852 -12.8411643
sexftageft2 0.72540047 0.07057738 10.2780866
sexftageft3 0.90255331 0.09146294 9.8679675
sexftmethodft2 -1.70963024 0.19501227 -8.7667829
sexftmethodft3 -0.86518923 0.06773258 -12.7736045
sexftmethodft4 -2.00412781 0.16368076 -12.2441259
sexftmethodft5 0.11470230 0.13117469 0.8744240
sexftmethodft6 -0.60376876 0.12897160 -4.6814085
ageft2methodft2 -0.37837205 0.14627609 -2.5866979
ageft3methodft2 -1.23265437 0.32332905 -3.8123836
ageft2methodft3 0.70368563 0.07520193 9.3572818
ageft3methodft3 1.06402062 0.09994185 10.6463976
ageft2methodft4 0.14089281 0.12072263 1.1670787
ageft3methodft4 -0.12891521 0.19514619 -0.6606084
ageft2methodft5 -0.02675948 0.14825462 -0.1804968
ageft3methodft5 0.55785782 0.18046659 3.0911972
ageft2methodft6 -0.28565673 0.12815869 -2.2289298
ageft3methodft6 -0.80223746 0.23289228 -3.4446717
(Intercept) sexft ageft2 ageft3 methodft2
sexft -0.5293849
ageft2 -0.6545075 0.2190292
ageft3 -0.4394480 0.1311005 0.3888462
methodft2 -0.4533271 0.2215140 0.2858969 0.1943518
methodft3 -0.6841670 0.2756100 0.4034836 0.2612433 0.3162147
methodft4 -0.4979142 0.2443126 0.3151823 0.2122326 0.2270934
methodft5 -0.3263533 0.1022024 0.1786165 0.1083398 0.1528525
methodft6 -0.4508464 0.1902968 0.2674112 0.1837221 0.2078568
sexftageft2 0.3397924 -0.6418626 -0.5140561 -0.2152121 -0.1200132
sexftageft3 0.2536292 -0.4791016 -0.2414450 -0.5326448 -0.0938820
sexftmethodft2 0.1062991 -0.2007973 0.0286658 0.0122047 -0.2304799
sexftmethodft3 0.2208293 -0.4171432 0.1774634 0.1780183 -0.1097188
sexftmethodft4 0.1060513 -0.2003292 0.0598045 0.0419839 -0.0508429

3.7. APPLICATIONS
sexftmethodft5 0.1425449 -0.2692651 0.0505203 0.0685465 -0.0667879

sexftmethodft6 0.1610142 -0.3041534 0.0410371 0.0231834 -0.0740126
ageft2methodft2 0.2529359 -0.0503634 -0.3862841 -0.1507102 -0.6020723
ageft3methodft2 0.1112080 -0.0166989 -0.1000918 -0.2486458 -0.2683474
ageft2methodft3 0.4621401 -0.0415795 -0.7012516 -0.2872079 -0.2081306
ageft3methodft3 0.3396993 -0.0160971 -0.3110943 -0.7455278 -0.1536745
ageft2methodft4 0.3144334 -0.0760570 -0.4797001 -0.1886685 -0.1389475
ageft3methodft4 0.1896557 -0.0378682 -0.1706784 -0.4240949 -0.0846157
ageft2methodft5 0.1962951 0.0509259 -0.2975391 -0.1228269 -0.0921233
ageft3methodft5 0.1569320 0.0500076 -0.1420419 -0.3487947 -0.0735750
ageft2methodft6 0.2550316 0.0061021 -0.3891925 -0.1527229 -0.1163611
ageft3methodft6 0.1359793 0.0115986 -0.1237458 -0.3004775 -0.0623334
methodft3 methodft4 methodft5 methodft6 sexftageft2
sexft
ageft2
ageft3
methodft2
methodft3
methodft4 0.3470428
methodft5 0.2470905 0.1676903
methodft6 0.3243994 0.2280788 0.1600694
sexftageft2 -0.0830023 -0.1345844 0.0091555 -0.0650899
sexftageft3 -0.0473855 -0.1018097 0.0248777 -0.0563644 0.5224432
sexftmethodft2 -0.0751623 -0.0536014 -0.0365003 -0.0497402 0.0145057
sexftmethodft3 -0.2982165 -0.1202488 -0.1116984 -0.1250255 -0.1555133
sexftmethodft4 -0.0850828 -0.1796592 -0.0447405 -0.0550151 -0.0346600
sexftmethodft5 -0.1095480 -0.0734059 -0.4296993 -0.0701315 0.0004322
sexftmethodft6 -0.1139113 -0.0811430 -0.0554210 -0.3633460 0.0257253
ageft2methodft2 -0.1640848 -0.1236929 -0.0755765 -0.1081552 0.1502869
ageft3methodft2 -0.0705781 -0.0545302 -0.0312892 -0.0483916 0.0419511
ageft2methodft3 -0.6919440 -0.2288213 -0.1471802 -0.2049272 0.1966763
ageft3methodft3 -0.5088548 -0.1682945 -0.1040063 -0.1517437 0.1044275
ageft2methodft4 -0.2007209 -0.6873745 -0.0913101 -0.1325942 0.2049072
ageft3methodft4 -0.1179159 -0.4206746 -0.0512740 -0.0813944 0.0801582
ageft2methodft5 -0.1485757 -0.1009736 -0.6004452 -0.0966729 -0.0132118
ageft3methodft5 -0.1188513 -0.0806851 -0.4802351 -0.0771332 -0.0072822
ageft2methodft6 -0.1789791 -0.1278539 -0.0870692 -0.5810417 0.0712485
ageft3methodft6 -0.0947820 -0.0682922 -0.0455391 -0.3113944 0.0239577
sexftageft3 sexftmethodft2 sexftmethodft3 sexftmethodft4
sexft
ageft2
ageft3
methodft2
methodft3
methodft4
methodft5
methodft6
sexftageft2
sexftageft3
sexftmethodft2 0.0330197
sexftmethodft3 -0.1817508 0.1731028
sexftmethodft4 -0.0159376 0.0735935 0.2182056
sexftmethodft5 -0.0412894 0.0909166 0.2715988 0.1099042
sexftmethodft6 0.0423141 0.0953244 0.2624777 0.1111174
ageft2methodft2 0.0696734 -0.1193198 -0.0646977 -0.0228423
ageft3methodft2 0.0962804 -0.0690603 -0.0407222 -0.0111298
ageft2methodft3 0.1108900 -0.0256074 -0.1712706 -0.0410400
ageft3methodft3 0.2049051 -0.0201005 -0.1695428 -0.0351916
ageft2methodft4 0.0983250 -0.0152393 -0.0822141 -0.1127032
ageft3methodft4 0.1820997 -0.0076950 -0.0714789 -0.0885246
ageft2methodft5 -0.0002573 -0.0145400 -0.0408446 -0.0168942
ageft3methodft5 -0.0071624 -0.0151072 -0.0412567 -0.0174907
ageft2methodft6 0.0310918 -0.0160483 -0.0587959 -0.0225838
ageft3methodft6 0.0511431 -0.0098344 -0.0422720 -0.0144563
sexftmethodft5 sexftmethodft6 ageft2methodft2 ageft3methodft2
sexft
ageft2

ageft3
methodft2
methodft3
methodft4
methodft5
methodft6
sexftageft2
sexftageft3
sexftmethodft2
sexftmethodft3
sexftmethodft4
sexftmethodft5
sexftmethodft6 0.1380205
ageft2methodft2 -0.0214002 -0.0188304
ageft3methodft2 -0.0165786 -0.0085290 0.2022833
ageft2methodft3 -0.0431139 -0.0380815 0.2817386 0.0755249
ageft3methodft3 -0.0484958 -0.0319812 0.1243735 0.1957429
ageft2methodft4 -0.0260336 -0.0221505 0.1880540 0.0490921
ageft3methodft4 -0.0284351 -0.0131817 0.0669367 0.1078330
ageft2methodft5 -0.1585728 -0.0221231 0.1259021 0.0338279
ageft3methodft5 -0.1606040 -0.0228262 0.0601454 0.0968107
ageft2methodft6 -0.0246950 -0.1590233 0.1588169 0.0413490
ageft3methodft6 -0.0193417 -0.1101573 0.0505534 0.0804766
ageft2methodft3 ageft3methodft3 ageft2methodft4
sexft
ageft2
ageft3
methodft2
methodft3
methodft4
methodft5
methodft6
sexftageft2
sexftageft3
sexftmethodft2
sexftmethodft3
sexftmethodft4
sexftmethodft5
sexftmethodft6
ageft2methodft2
ageft3methodft2
ageft2methodft3
ageft3methodft3 0.4860919
ageft2methodft4 0.3458827 0.1534926
ageft3methodft4 0.1274191 0.3289479 0.3409389
ageft2methodft5 0.2463889 0.1090278 0.1522688
ageft3methodft5 0.1175691 0.3138438 0.0727012
ageft2methodft6 0.3018880 0.1323026 0.1940422
ageft3methodft6 0.0973867 0.2548121 0.0619062
ageft3methodft4 ageft2methodft5 ageft3methodft5
sexft
ageft2
ageft3
methodft2
methodft3
methodft4
methodft5
methodft6
sexftageft2
sexftageft3
sexftmethodft2
sexftmethodft3
sexftmethodft4
sexftmethodft5
sexftmethodft6
ageft2methodft2
ageft3methodft2

3.7. APPLICATIONS
ageft2methodft3
ageft3methodft3
ageft2methodft4
ageft3methodft4
ageft3methodft5 0.1602452 0.4758838
ageft2methodft6 0.0691248 0.1450617 0.0693600
ageft3methodft6 0.1344971 0.0468263 0.1349523
ageft2methodft6
sexft
ageft2
ageft3
methodft2
methodft3
methodft4
methodft5
methodft6
sexftageft2
sexftageft3
sexftmethodft2
sexftmethodft3
sexftmethodft4
sexftmethodft5
sexftmethodft6
ageft2methodft2
ageft3methodft2
ageft2methodft3
ageft3methodft3
ageft2methodft4
ageft3methodft4
ageft2methodft5
ageft3methodft5
ageft2methodft6

Splus code for sample question.

> as.vector(model2$coefficients)
[1] 6.01823752 -0.51231200 -0.07953488 -1.41603348 -1.20849865 0.06677238
[7] -0.96599088 -1.97833582 -1.21398467 0.72540047 0.90255331 -1.70963024
[13] -0.86518923 -2.00412781 0.11470230 -0.60376876 -0.37837205 -1.23265437
[19] 0.70368563 1.06402062 0.14089281 -0.12891521 -0.02675948 0.55785782
[25] -0.28565673 -0.80223746
> names(tmp)
[1] "call" "terms" "coefficients" "dispersion"
[5] "df" "deviance.resid" "cov.unscaled" "correlation"
[9] "deviance" "null.deviance" "iter" "nas"
> dim(v)
[1] 26 26
> x_c(0,0,0,0,0,0,-1,1,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,1,0,0,0)
> x
[,1]
[1,] 0
[2,] 0
[3,] 0
[4,] 0
[5,] 0
[6,] 0
[7,] -1
[8,] 1
[9,] 0
[10,] 0
[11,] 0
[12,] 0
[13,] 0
[14,] 0
[15,] 0
[16,] 0
[17,] 0
[18,] 0
[19,] 0
[20,] 0
[21,] -1
[22,] 0
[23,] 1
[24,] 0
[25,] 0
[26,] 0
> t(x)%*%as.vector(model2$coefficients)
[,1]
[1,] -1.179997
> sqrt(t(x)%*%v%*%x)
[,1]
[1,] 0.1382256
Questions
Problem 3.1.
Consider the following data on newly diagnosed lung cancer cases occurring in four Danish cities between 1968 and
1971.

3.7. APPLICATIONS
City
Fredericia Horsens Kolding Vejile
Age Cases Pop. Cases Pop. Cases Pop. Cases Pop.
40-54 11 3059 13 2879 4 3142 5 2520
55-59 11 800 6 1083 8 1050 7 878
60-64 11 710 15 923 7 895 10 839
65-69 10 581 10 834 11 702 14 631
70-74 11 509 12 634 9 535 8 539
> 75 10 605 2 782 12 659 7 619
A Poisson model is often used to analyse data of this sort based on the expected number of cases (recall the Poisson
approximation for the binomial model). Variable population sizes, however, suggest a systematic (non-random) way
in which the number of cases will vary from city to city and across age groups. It is therefore preferable to model
the disease rates. This may be achieved using a Poisson regression model where the log of the population size is used
as an offset term. That is, if µ is the mean number of newly diagnosed lung cancer patients, x is a p × 1 covariate
vector, and β is a p × 1 vector of regression coefficients, then we may specify a model for the mean number of cases
as
log(µ) = x′ β + log(Pop.) .
Model based estimates of the expected number of new cases for a particular covariate configuration may be obtained
as exp{x′ β̂ + log(pop.)}.
a. Compute a model-based estimate of the mean number of newly diagnosed lung cancer patients over this three
year period in Kolding for the 55-59 year old age group.
b. From the fitted model estimate the relative increase in the lung cancer rate for the 70-74 year old age group,
compared to the 40-54 year old age group.
c. Does there appear to be an interaction between the factor variable for age and the factor variable for city?
d. Epidemiologists frequently express disease rates as the number of new cases per hundred thousand people in the
population per year. Compute a model-based estimate of the expected number of newly diagnosed lung cancer
patients per 100,000 person-years of risk in Kolding for the 55-59 year old age group for 1970. Assume that the
disease rates are stable over time.
Problem 3.2.
In a prospective study subjects were randomly allocated to four different treatment groups and followed over time
to determine whether their health changed over the course of follow-up. One group received a placebo, one group
received drug A, one group received drug B, and one group received both drugs A and B. The responses of all patients
are summarized below.
Randomized Controlled Trial with Multinomial Response

Response
Treatment Improved (j=1) No difference (j=2) Worse (j=3) Total
Placebo (i=1) 18 17 15 50
Drug A (i=2) 20 10 20 50
Drug B (i=3) 23 10 17 50
Drugs A & B (i=4) 25 13 12 50

Therefore, for this study there is one explanatory variable (treatment) and one trinomial response variable (im-
proved, no difference, worse). We want to know if the treatments have any effect on the response, or in other words,
if the distribution of responses is the same for each of the four treatment groups.
Let πij denote the probability of outcome j for the ith treatment group, let yij denote the frequency with which
this happens, j = 1, 2, 3, and let mi = yi1 + yi2 + yi3 , i = 1, 2, 3, 4.
a. Write down the likelihood function for the data in Table 1 under the most general model and give the null
hypothesis of interest (stated above) in terms of the parameters of this model.
b. Give expressions for the maximum likelihood estimate of πij under the model in (a) and (b).
c. Give the form of the log-linear model corresponding to the null hypothesis. Define any notation that you introduce.
d. Fit the model in (c) and draw conclusions regarding the presence of any association in this table.
Problem 3.3.
The data below arise from a study investigating the relation between cigarette smoking behaviour, hypertension
status and proteinurea (the presence of protein in urine) among expectant mothers.
Hypertension
Yes No
Proteinurea Proteinurea
Data on Smoking, Hypertension and Proteinurea Cigarettes/day Yes No Yes No
0 439 1740 294 5132
1-19 195 811 244 3625
>20 31 154 51 658
a. Find the most appropriate log-linear model to describe the association between these three factors and discuss
your findings. Use odds ratios to describe any associations you find.
b. Suppose proteinurea was a response in a logistic model. Find a suitable logistic model using the covariates
cigarettes/day and hypertension and interpret this model and the associations it implies.
c. Briefly discuss the differences in the two modeling strategies in (a) and (b) and the relation between the findings
in the two analyses.
Problem 3.4.
The first table below presents data collected by a social scientist on inter-generational social mobility in Britain. The
data were collected by randomly sampling males from the general population, assessing their social status, and that
of their father.The rating of social status was summarized using five categories, with category 1 being the lowest and
5 the highest. The second table contains data from an analogous Danish study.
a. Write down the likelihood for the data from the British for the most general multinomial model.
b. Give the form of the corresponding log-linear model you need to fit to assess this hypothesis of perfect social
mobility in Britain (define any notation you introduce). Fit the model and make conclusions regarding this
hypothesis.
c. By fitting further log-linear models, assess whether the social mobility patterns are the same in Britain and
Denmark.

3.7. APPLICATIONS
British Social Mobility Data

Son’s Status
Father’s Status 1 2 3 4 5
1 50 45 8 18 8
2 28 174 84 154 55
3 11 78 110 223 96
4 14 150 185 714 447
5 0 42 72 320 411
Danish Social Mobility Data

Son’s Status
Father’s Status 1 2 3 4 5
1 18 17 16 4 1
2 24 105 109 59 21
3 23 84 289 217 95
4 8 49 175 348 198
5 6 8 69 201 246
Problem 3.5.
The table below presents data collected by social scientists on inter-generational social mobility in Britain and
Denmark. The data were collected by randomly sampling males from the general populations, and assessing their
social status and that of their father. The rating of social status presented here was obtained by collapsing the
classifications on the original 5-point scale to give a simple designation of lower versus middle/upper class.
BRITISH STUDY DANISH STUDY
Son’s Status Son’s Status

Father’s Status Lower Mid/Upper Father’s Status Lower Mid/Upper
Lower 297 327 Lower 164 210
Mid/Upper 295 2578 Mid/Upper 178 1838
Suppose that the saturated log-linear model for the full table (using data from both countries) is given by
log µijk = u + uF S C FS FC SC F SC
i + uj + uk + uij + uik + ujk + uijk
where the superscripts F , S, and C represent father, son and country respectively, and uF S C
1 = u1 = u1 = 0,
uF S FS FC FC SC SC F SC F SC F SC
1j = ui1 = u1k = ui1 = u1k = uj1 = 0, and u1jk = ui1k = uij1 = 0.
a. Write down the log-linear model you would fit and assess to test the null hypothesis of perfect social mobility
based only on the British data. Be sure to define any terms you introduce and any constraints they might have.
b. Give an explicit expression for the deviance statistic for testing the hypothesis of perfect social mobility in (a)
and state the degrees of freedom.

c. The attached Splus output contains results from fitting several models to the full data set. By comparing the fit of
any relevant models, draw conclusions regarding the nature of social mobility patterns in Brittain and Denmark.
Provide estimates of one or more relevant odds ratios, associated 95% confidence intervals, and p−values in your
summary statements.
Problem 3.6.
a. In class it was stated that Poisson regression models may be used to model data in the form of two-way contingency
tables (where one factor variable is cross-tabulated against another). The distributions that we derived for the
contingency tables were either multinomial or product multinomial depending on whether the total of the entire
table was fixed by design, or just the row totals were fixed. Explain carefully why we can fit Poisson regression
models to address questions about multinomial or product multinomial distributions.
b. The following data relates to a flu vaccine trial in which patients were randomized to receive either a vaccine or
placebo treatment. The response of interest is the severity of their flu symptoms over the course of a full season;
this is classified as mild, moderate, or severe.
Response
Mild Moderate Severe Total
Placebo 25 8 5 38
Vaccine 6 18 11 35
Test the hypothesis that the response pattern is the same for the placebo and vaccine groups using the Pearson
chi-square statistic for an r × s contingency table.
c. Analyse the data using log-linear models. Describe and interpret the final model non-technically.
d. For the model corresponding to no difference in the response distribution for placebo and vaccinated individuals,
calculate the deviance residuals. Comment on these residuals and the nature any lack-of-fit, if present.
Problem 3.7.
The following table gives results of a study examining the relationship between smoking status and a breathing test
result, by age, based on individuals in industrial plants in Houston in 1974 and 1975.
Breathing Test Results
Age Smoking Status Normal Not Normal
< 40 Never Smoked 577 34
< 40 Current Smoker 682 57
40 − 59 Never Smoked 164 4

40 − 59 Current Smoker 245 74
Using log-linear models describe the nature of the association between the three factors of interest.
Problem 3.8.
Suppose a sample of Poisson counts are available. Let f0 , f1 , . . . , fK denote the observed frequencies of 0, 1, 2, . . . , K
counts respectively, where for convenience if y ≥ K we denote it by y = K. If K is sufficiently large this will lead to
PK
little loss of information. In this notation, the total sample size is given by n = k=1 fk .
The table below reports on the frequency of repairs required for n = 550 army vehicles over a 10 year period.
It is hoped that it is reasonable to assume that the distribution of the number of repairs may be represented by a
Poisson distribution with mean (µ).

3.7. APPLICATIONS
a. Give the null hypothesis in terms of relevant parameters. Define any new parameters that you introduce.
b. Write down the likelihood in terms of µ and the frequencies (i.e. under the null hypothesis), and under the
alternative hypothesis.
c. Find expressions for the M LE of µ and gk (µ), where gk (µ) is Pr(y = k).
d. Obtain the expected frequencies and test the goodness-of-fit of a Poisson distribution for the following data.
Frequency Distribution of Repair Visits for Army Vehicles
# of visits 0 1 2 3 4 5 6+
frequency 295 190 53 5 5 2 0
Problem 3.9.
Ashford and Sowden (1970) conducted a study on the health of British coal miners who were smokers but did
not show radiological signs of a lung disorder called pneumoconiosis. A simple random sample of employed miners
were assessed to determine whether they became breathless or coughed upon mild exertion. The resulting data are
displayed in the following table, stratified according to five year age intervals.
Breathless
Yes No
Coughed Coughed
Age Yes No Yes No
20-24 9 7 95 1841
25-29 23 9 105 1654
30-34 54 19 177 1863
35-39 121 48 257 2357
40-44 169 54 273 1778
45-49 269 88 324 1712
50-54 404 117 245 1324
55-59 406 152 225 967
60-64 372 106 132 526
Suppose an analysis is planned to assess the association between age (A), breathlessness (B), and coughing (C).
The following saturated log-linear model could be fit
log µijk = u + uA B C AB AC BC ABC

i + uj + uk + uij + uik + ujk + uijk ,
where the corner-point (treatment) constraints are uA B C AB AB AC AC

1 = 0, u1 = 0, u1 = 0, u1j = ui1 = 0, u1k = ui1 = 0,
uBC BC ABC ABC
1k = uj1 = 0, and u1jk = ui1k = uABC
ij1 = 0, i = 1, . . . , 9, j = 1, 2, k = 1, 2.
a. Fit log-linear models to these data. Justify your choice for the model that you feel is most appropriate for
characterizing the association between these three factors, and summarize the nature of the association(s).
b. Provide point estimates of two or more odds ratios characterizing the association between pairs of variables and
carefully interpret their meaning.
c. If interest was strictly in modeling the coughing as a response, give the logistic regression model that you would
fit to obtain the same results you obtained in a). Give the numerical values for the regression coefficients in this
logistic regression model based on the estimated coefficients in part a).

d. What information do you lose in the logistic regression model in b) which you can obtain from a log-linear model
in a)?
Problem 3.10.
Let Yi0 denote the prior count for subject i, i = 1, . . . , 59. To the analyze the epilepsy data set, assume Yi0 |ui ∼
Poisson(ui λ0 ) where given ui , Yi0 and (Yi1 , Yi2 , Yi3 , Yi4 ) are independently distributed.
a. Derive the joint conditional distribution of (Yi1 , Yi2 , Yi3 , Yi4 ) given Yi0 + Yi1 + · · · + Yi4 and describe how to
estimate β1 based on a logistic regression model.
b. Fit a logistic regression model to estimate β1 (and given a 95% confidence interval) and contrast your findings
with those in 1). Use the data on epilepsy.

3.7. APPLICATIONS
PLACEBO PROGABIDE
TRT PRIOR TRT PRIOR
Y1 Y2 Y3 Y4 GROUP COUNT AGE Y1 Y2 Y3 Y4 GROUP COUNT AGE
5 3 3 3 0 11 31 0 4 3 0 1 19 20
3 5 3 3 0 11 30 3 6 1 3 1 10 20
2 4 0 5 0 6 25 2 6 7 4 1 19 18
4 4 1 4 0 8 36 4 3 1 3 1 24 24
7 18 9 21 0 66 22 22 17 19 16 1 31 30
5 2 8 7 0 27 29 5 4 7 4 1 14 35
6 4 0 2 0 12 31 2 4 0 4 1 11 57
40 20 23 12 0 52 42 3 7 7 7 1 67 20
5 6 6 5 0 23 37 4 18 2 5 1 41 22
14 13 6 0 0 10 28 2 1 1 0 1 7 28
26 12 6 22 0 52 36 0 2 4 0 1 22 23
12 6 8 5 0 33 24 5 4 0 3 1 13 40
4 4 6 2 0 18 23 11 14 25 15 1 46 43
DATA FROM THE EPILEPSY TRIAL 7 9 12 14 0 42 36 10 5 3 8 1 36 21
16 24 10 9 0 87 26 19 7 6 7 1 38 35
11 0 0 5 0 50 26 1 1 2 4 1 7 25
0 0 3 3 0 18 28 6 10 8 8 1 36 26
37 29 28 29 0 111 31 2 1 0 0 1 11 25
3 5 2 5 0 18 32 102 65 72 63 1 151 22
3 0 6 7 0 20 21 4 3 2 4 1 22 32
3 4 3 4 0 12 29 8 6 5 7 1 42 25
3 4 3 4 0 9 21 1 3 1 5 1 32 35
2 3 3 5 0 17 32 18 11 28 13 1 56 21
8 12 2 8 0 28 25 6 3 4 0 1 24 41
18 24 76 25 0 55 30 3 5 4 3 1 16 32
2 1 2 1 0 9 40 1 23 19 8 1 22 26
3 1 4 2 0 10 19 2 3 0 1 1 25 21
13 15 13 12 0 47 22 0 0 0 0 1 13 36
11 14 9 8 1 76 18 1 4 3 2 1 12 37
8 6 9 4 1 38 32
Problem 3.11.
Poisson regression models may be used to examine the association of two or more factors in multi-way contingency
tables. Consider the case of a two-way contingency table in which two factors (X and Y, say) are cross classified.
a. Show why we can use a Poisson regression model to investigate the association between X and Y , even though
the data were collected this way.
b. What term must be in the Poisson model to provide valid inferences under this sampling plan?
Problem 3.12.

3.7. APPLICATIONS
A sample of 1729 individuals were cross-classified according to whether or not they (1) read newspapers, (2) listen
to the radio, (3) do “solid” reading, (4) attend lectures, and (5) have a good knowledge regarding cancer.
Radio No Radio
Solid No Solid Solid No Solid
Reading Reading Reading Reading
Knowledge Knowledge
Good Poor Good Poor Good Poor Good Poor
Newspapers Lectures 23 8 8 4 27 18 7 6
No Lectures 102 67 35 59 201 177 75 156
No Newspapers Lectures 1 3 4 3 3 8 2 10
No Lectures 16 16 13 50 67 83 84 393
a. Analyse the data using hierarchical log linear models without distinguishing between response and explanatory
variables.
b. Briefly explain the final model chosen as non-technically as possible.
c. Compare the analysis with your analysis based on the logistic model that you carried out in assignment 3 with
“good knowledge of cancer” as the response. Which do you feel more comfortable with?

Generalized Linear Models

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Generalized Linear Models

Caricato da

Copyright:

Formati disponibili

1

Likelihoods for Generalized Linear Models

1.1 Some General Theory

Consider a log-likelihood for a single observation from the exponential family.

ℓ(θ, φ; y) ∝ log f (y; θ, φ)

= (yθ − b(θ))/a(φ) + c(y; φ)

∂2ℓ b′′ (θ)

Consider the following very general results :

which in turn implies E(S(θ)) = 0.

RJ Cook 2 October 21, 2008

If y1 , y2, ..., yn is a random sample of n observations we introduce a subscript i and write

RJ Cook 3 October 21, 2008

c(y; φ) = − 21 (y 2/φ + log(2πφ))

f (y; θ, φ) = exp{(yµ − µ2 /2)/σ 2 − (y 2/σ 2 + log(2πσ 2 ))/2}

• variance : b′′ (θ)a(φ) = 1 · φ = σ 2

• variance function : b′′ (θ) = 1 {= V (µ)}

• canonical link : θ = µ = η (identity)

f (y; θ, φ) = exp{(y log λ − λ) − log y!}

RJ Cook 4 October 21, 2008

• variance : b′′ (θ)a(φ) = exp{θ} · 1 = µ

• variance function : b′′ (θ) = exp{θ} = µ {= V (µ)}

• canonical link : θ = log µ = η (log link)

If V ∼ Bin(m, π), then

Now consider the transformation to Y = V /m. Then,

Here we have θ = log(π/(1 − π)), φ = 1, w = m, and

b(θ) = log(1 + exp{θ}) = − log(1 − π)

• mean : b′ (θ) = exp{θ}/(1 + exp{θ}) = µ = π

RJ Cook 5 October 21, 2008

• variance function : b′′ (θ) = µ(1 − µ) = V (µ)

• canonical link : θ = log(µ/(1 − µ)) = η (logit link)

1.2 Iteratively Reweighted Least Squares

To find an MLE of β, we want to solve S(β) = ∂ℓ/∂β = 0.

RJ Cook 6 October 21, 2008

Since µ = b′ (θ), and V = b′′ (θ),

where W −1 = (Var(Y ))(∂η/∂µ)2 .

and ∗ denotes an elementwise product.

1.2.2 Newton Raphson and Fisher Scoring

Fisher Scoring Method

RJ Cook 7 October 21, 2008

The Fisher scoring method operates by utilizing

RJ Cook 8 October 21, 2008

1.3 Iteratively Re-weighted Least Squares

Why is this called the iteratively re-weighted least squares ?

(XW(β̂ (r) )X ′ )β̂ (r+1) = (XW(β̂ (r) )X ′ ) β̂ (r) + S(β̂ (r) )

Let z = η + (y − µ)∂η/∂µ. Then

β̂ (r+1) = (XW(β̂ (r) )X ′ )−1 XW(β̂ (r) )z(β̂ (r) )

RJ Cook 9 October 21, 2008

1.3.2 When is Fisher Scoring Method Equivalent to Newton Raphson?

V = b′′ (θ) = ∂b′ (θ)/∂θ = ∂µ/∂θ .

Therefore, with the canonical link,

RJ Cook 10 October 21, 2008

RJ Cook 11 October 21, 2008

Suppose you observe a sample y1 , y2, . . . , yn consisting of realizations of n independent Poisson

where S = {v : max(0, t − m2 ) ≤ v ≤ min(m1 , t)}.

RJ Cook 12 October 21, 2008

RJ Cook 13 October 21, 2008

b. Suppose that associated with yi is a p × 1 covariate vector xi = (1, xi1 , . . . , xi,p−1 )′ , i = 1, . . . , n.

RJ Cook 14 October 21, 2008

Table 2.1. A 2 × 2 Table

2.2 Estimation of the Odds Ratio

Differentiating with respect to α and β we get

RJ Cook 16 October 21, 2008

eα+β (1 + eα+β ) − eα+β eα+β eα (1 + eα ) − e2α

Consider the model