Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
We assume that Yi has the p.d.f. that is a member of the exponential family. That is,
f (yi; θi , φ) = exp{(yi θi − b(θi ))/ai (φ) + c(yi ; φ)}
for some specific functions ai (·), b(·) and c(·), with φ known.
• The parameter θ is called the canonical parameter.
• The parameter φ, termed the scale parameter, is constant for all i = 1, . . . , n. For now consider
φ to be specified.
∂ℓ y − b′ (θ)
S(θ) = =
∂θ a(φ)
∂
Z
f (y; θ, φ)dy = ∂1/∂θ
∂θ
∂
Z
f (y; θ, φ)dy = 0
∂θ
But
∂ 1 ∂
log f (y; θ, φ) = f (y; θ, φ) .
∂θ f (y; θ, φ) ∂θ
Therefore,
∂
Z
log f (y; θ, φ)f (y; θ, φ)dy = 0
∂θ
which implies,
E(−∂ 2 log f (y; θ, φ)/∂θ2) = E((∂ log f (y; θ, φ)/∂θ)2 )
These facts are useful to us for deriving the following results.
• Mean :
E(S(θ)) = 0 ⇒ E(Y − b′ (θ))/a(φ)) = 0
which in turn implies that E(Y ) = b′ (θ) = µ. The key result here is that µ = b′ (θ).
• Variance :
E(−∂ 2 log f (y; θ, φ)/∂θ2) = E((∂ log f (y; θ, φ)/∂θ)2 )
implies
E(b′′ (θ)/a(φ)) = E( ((y − b′ (θ))/a(φ))2 )
which in turn implies b′′ (θ)/a(φ) = Var(Y )/a(φ)2 and hence Var(Y ) = b′′ (θ)a(φ). The variance
can be written as a product of
b′′ (θ)
- a function of the canonical parameter and hence the mean of the distribution
- sometimes called the variance function and denoted by V (µ) when considered as a function
of µ, the mean
- note that this is not in general the variance of Y .
a(φ) - a function only of φ
- can commonly be written as a(φ) = φ/w where φ is called a dispersion parameter and w
is a prior weight for this observation.
Link Functions
The link functions relate η (the linear predictor) to µ, the expected value of the random variable
Y.
Canonical Links
Consider the log-likelihood function from the exponential family.
ℓ(θ, φ; y) = (yθ − b(θ))/a(φ) + c(y; φ).
If we set
θi = ηi = x′i β
we say we have a canonical link (i.e. canonical parameter = linear predictor)
n
X
ℓ(β, φ; y) = [(yi x′i β − b(x′i β))/ai (φ) + c(yi ; φ)]
i=1
p n n n
X X X X
= βj ( yixij /ai (φ)) − b(x′i β)/ai (φ) + c(yi ; φ)
j=1 i=1 i=1 i=1
Pn
If φ is known, i=1 yi xij is a sufficient statistic for βj , or equivalently, Xy is a sufficient statistic
for β.
Canonical links are useful in terms of their statistical properties, but context and goodness of
fit should motivate the choice of the link. It often turns out that the canonical links are in fact
appropriate (i.e. linear regression with normally distributed observations, logistic regression).
a. Normal Distribution
Dropping the subscript i, consider a single observation from the N(µ, σ 2 ) distribution. It has
density
1
f (y; θ, φ) = √ exp{−(y − µ)2 /2σ 2 }
2πσ 2
1
=√ exp{−(y 2 − 2µy + µ2 )/2σ 2 }
2πσ 2
1
=√ exp{(yµ − µ2 /2)/σ 2 − y 2 /2σ 2 }
2πσ 2
= exp{(yµ − µ2 /2)/σ 2 − y 2/2σ 2 − log(2πσ 2 )/2}
Matching this up with the general expression for the exponential family, we see θ = µ, φ = σ 2 ,
a(φ) = φ
b(θ) = 21 θ2 = 12 µ2
b. Poisson Distribution
f (y; λ) = λy e−λ /y!
b(θ) = exp{θ} = λ
c(y; φ) = − log y!
These give
• mean : b′ (θ) = exp{θ} = µ = λ
c. Binomial Distribution
• variance :
eθ (1 + eθ ) − eθ (eθ ) 1
′′
b (θ)a(φ) =
(1 + eθ )2 m
θ
e 1 1
= θ θ
1+e 1+e m
= µ(1 − µ)/m
= π(1 − π)/m
This is a way of interpreting the Newton Raphson algorithm for maximization of the likelihood
function. Consider the log-likelihood for a single observation from the exponential family
ℓ(θ, φ; y) = [(yθ − b(θ))/a(φ) + c(y; φ)]
Recall
• ℓ is a function of θ (we initially assume that φ is known)
• µ can be expressed in terms of θ through µ = b′ (θ)
• η can be expressed in terms of µ through the link function and µ = b′ (θ)
• η can be expressed in terms of β through η = x′ β
where Wi−1 = Var(yi)(∂ηi /∂µi )2 and ∂ηi /∂µi means ∂η/∂µ evaluated wrt the covariate vector xi .
The score vector will then take the form S(β) = (∂ℓ/∂β0 , ∂ℓ/∂β1 , . . . , ∂ℓ/∂βp−1 )′ . In vector form we
can write S(β) as XW(y − µ) ∗ ∂η/∂µ where in vector form y = (y1 , . . . , yn )′ and µ = (µ1 , . . . , µn )′
are n × 1 vectors, X = (x1 , . . . , xn ) is a p × n matrix, W denotes the diagonal matrix with
W1
W2
W= ,
Wj
Wn
Newton Raphson
β̂ (r+1) = β̂ (r) + I −1 (β̂ (r) )S(β̂ (r) )
where I is the observed information matrix.
∂2ℓ ∂ ∂ℓ
Ijk = − =−
∂βj ∂βk ∂βk ∂βj
∂ ∂η
=− (y − µ)W xj
∂βk ∂µ
∂ ∂η ∂η ∂
= −(y − µ) W xj − W xj (y − µ)
∂βk ∂µ ∂µ ∂βk
But
∂µ ∂µ ∂η ∂µ
= · = · xk .
∂βk ∂η ∂βk ∂η
Therefore,
∂ ∂η ∂η ∂µ
Ijk = −(y − µ) W xj + W xj xk
∂βk ∂µ ∂µ ∂η
∂ ∂η
= −(y − µ) W xj + xj W xk
∂βk ∂µ
Taking expectations we get,
2
∂ ℓ ∂ ∂η
Ijk = −E = E (y − µ) W xj + E {xj W xk }
∂βj ∂βk ∂βk ∂µ
∂ ∂η
=− W xj E {(y − µ)} + xj W xk
∂βk ∂µ
Notice that the first term vanishes since E{(y − µ)} = 0 by definition. Then, for n observations we
can write n
X
Ijk = xij Wi xik = (XWX ′ )jk
i=1
where again, W is a diagonal matrix
W1
W2
W=
Wj
Wn
and Wi−1 = Var{Yi}(∂ηi /∂µi )2 .
1.3.1 Overview
I(β̂ (r) )β̂ (r+1) = I(β̂ (r) )β̂ (r) + S(β̂ (r) )
(XW(β̂ (r) )X ′ )β̂ (r+1) = (XW(β̂ (r) )X ′ )β̂ (r) + (XW(β̂ (r) )(y − µ(β̂ (r) )) ∗ (∂η(β̂ (r) )/∂µ))
(r) ′ (r+1) (r)
′ (r) (r) (r)
(XW(β̂ )X )β̂ = XW(β̂ ) X β̂ + (y − µ(β̂ )) ∗ (∂η(β̂ )/∂µ)
This is the same as the weighted LS estimate of β with dependent variable z(β̂ (r) ) and weight matrix
W(β̂ (r) ). Since we are updating z with each iteration, it is called re-weighted least squares, and since
we have to repeat this estimation procedure until convergence, it is called iteratively re-weighted
least squares.
Note:
1. Consider a Taylor series approximation of g(y) about µ to get
g(y) = g(µ) × g ′(µ)(y − µ) + g ′′ (µ)(y − µ)2 /2 + · · ·
∂η
z = η+(y−µ) can be thought of as a “linearized form” of the link function. That is, it provides
∂µ
a linear approximation to the functional relationship between the mean of the distribution and
the linear predictor.
2. This motivates the choice of W .
2 2
∂η ∂η
Var(Z) = Var(Y ) = a(φ)b′′ (θ)
∂µ ∂µ
2 2
1 −1 ∂µ 1 ∂µ
inverse variance → W = = (Var(y)) =
Var(Z) ∂η a(φ)b′′ (θ) ∂η
This question can be equivalently re-phrased as “When is the expected information matrix the same
as the observed information matrix ? ”
Recall
∂2ℓ
∂ ∂η ∂η ∂
Ijk =− = −(y − µ) W xj − W xj (y − µ)
∂βj ∂βk ∂βk ∂µ ∂µ ∂βk
Consider the first term of the above expression for the observed information matrix. Recall that
Questions
Problem 1.1.
Suppose a population is divided into n strata, and we sample mi subjects from the ith stratum,
i = 1, . . . , n. Let Yij ∼ N(µi , φ) (independently distributed) denote the response for the jth subject
in the ith stratum of the sample, j = 1, . . . , mi , i = 1, . . . , n. Suppose all that is available are
the sample means for each strata, ȳ1 , ȳ2 , . . . , ȳn , and an associated p × 1 covariate vector xi =
(xi0 , xi1 , . . . , xi,p−1 )′ , where xi0 = 1, i = 1, . . . , n. Then
√
mi
f (ȳi ; µi , φ) = √ exp{−mi (ȳi − µi)2 /(2φ)}
2πφ
where −∞ < y < ∞, −∞ < µi < ∞, and φ > 0.
a. Show that the distribution of Ȳi belongs to the exponential family and find the functions ai (·),
b(·), c(·; ·), E(Ȳi ), Var(Ȳi ), and the canonical link function of g(µ) = η.
b. Given the data ȳ1 , ȳ2, . . . , ȳn and the linear predictor ηi = xi ′ β, find the specific form of the score
vector and information matrix for β and explain how you would obtain maximum likelihood
estimates of β0 , β1 , . . . , βp−1 .
c. Relate the Newton-Raphson algorithm to any other method of model fitting you may have seen
before.
Problem 1.2.
Consider a setting where Yi1 , . . . , Yimi are mi independently distributed Poisson random variables
with Yij ∼ Poisson(µij ), j = 1, . . . , mi , i = 1, . . . , n. Moreover, assume that the Poisson counts are
generated by a time homogeneous Poisson process with µij = λi tij , where λi is an underlying rate
assumed to be common for Yi1 , . . . , Yimi and tij is the duration of observation leading to the count
yij , j = 1, . . . , mi , i = 1, . . . , n. Finally, assume that associated with Yi1 , . . . , Yimi is a p × 1 covariate
vector xi = (xi0 , xi1 , . . . , xi,p−1 )′ where xi0 = 1, i = 1, . . . , n, and let β = (β0 , . . . , βp−1 )′ .
a. Write down the likelihood for the rate Pfunctions λ = (λ1 , . . . , λn )′ .
mi
b. Show that the distribution for Yi· = j=1 Yij belongs to the exponential family and hence find
the functions a(·), b(·), c(·; ·), E(Yi· ), Var(Yi· ), and the canonical link function g(µ) = η.
c. Given summary data y1· , y2· , . . . , yn· and the linear predictors ηi = xi ′ β, find the specific
form for an entry of the score vector (i.e. ∂ℓ/∂βj ) and the expected information matrix (i.e.
E(−∂ 2 ℓ/∂βj ∂βk )) under the canonical link.
d. Briefly describe how to obtain maximum likelihood estimates of β0 , β1 , . . . , βp−1 using a Fisher
Scoring algorithm.
Problem 1.3.
Let Y1 , Y2 , . . . , Yn be independent Poisson random variables with means µ1 , µ2 , . . . , µn respectively,
and let Y = (Y1 , . . . , Yn )′ . Associated with each Yi is a covariate vector, xi = (1, xi1 , . . . , xi,p−1 )′ , of
length p.
a. Show that ηi = log µi is the natural parameter of the Poisson distribution.
b. Find the score vector for β.
c. Find the observed and expected information matrix for β and hence show how to obtain the
MLE for β.
d. FOR STAT 831 ONLY
Show that T = XY is a vector of sufficient statistics for β where X is the p × n matrix with
columns comprised of covariate vectors.
Problem 1.4.
a. Show that this distribution belongs to the exponential family and hence find the canonical
parameter, the functions a(·), b(·), c(· ; ·), E(Y ), var(Y ) and the canonical link function.
b. Suppose now that we have a series of n independent 2 × 2 tables of the sort above and let xi =
(1, xi1 , . . . , xi,p−1 )′ denote a p × 1 vector of explanatory variables for the ith table. Introducing
subscripts to distinguish data from different tables, we then summarize the data from table i as
Outcome
Present Absent Total
Group 1 yi m1i − yi m1i
Group 2 ti − yi m2i − ti + yi m2i
Total ti m.i − ti m.i
The conditional distribution of yi given ti , the first column total, (and m1i and m2i ) is
! !
m1i m2i
exp{yiαi }
yi ti − yi
f (yi |ti , m1i , m2i ) = ! !
X m m2i
1i
exp{vαi }
v ti − v
v∈Si
with Si = {v : max(0, ti − m2i ) ≤ v ≤ min(m1i , ti )}. Given the data y1 , y2 , . . . , yn and the linear
predictor ηi = x′i β where β = (β0 , . . . , βp−1)′ , find the specific form of the score and information
function and explain how you would obtain maximum likelihood estimates of β0 , β1 , . . . , βp−1.
Problem 1.7.
Consider n 2 × 2 tables where the ith table is given by
Success Failure Total
Group 1 yi mi1 − yi mi1
Group 2 ti − yi mi2 − ti + yi mi2
Total ti mi − ti mi
a. Derive the conditional distribution of Yi given mi1 , mi2 and ti and show that this belongs to the
exponential family of distributions. Find the canonical parameter and hence find the conditional
mean and variance of Yi .
b. Suppose that associated with the ith table is a p × 1 covariate vector xi = (1, xi1 , . . . , xi,p−1 )′ ,
i = 1, . . . , n. Use the canonical link and the systematic component as η = X ′ β where X is a
p × n matrix of column vectors x1 , . . . , xn. Find the score vector and the observed and expected
information matrix for β, and hence show how to obtain the MLE of β.
Problem 1.8.
Suppose a population is divided into n strata, and we sample mi subjects from the ith stratum,
i = 1, . . . , n. Let Yij ∼ Bernoulli(πi ) (independently distributed) denote the response for the jth
subject in the ith stratum of the sample, so that P(Yij = 1) = πi and P(Yij = 0) = 1 − πi ,
j = 1, . . . , mi , i = 1, . . . , n. Suppose
Pmi that the responses available from the strata are simply the
means y1 , y2, . . . , yn (where yi = j=1 yij /mi , i = 1, . . . , n), and that the associated sample sizes,
m1 , . . . , mn , are also available.
a. Show that the distribution of yi belongs to the exponential family by identifying the functions
a(·), b(·) and c(· ; ·), obtain E(Yi ) and Var(Yi), and name the canonical link function.
2.1 Introduction
Binary responses generally require different analysis techniques than have been considered so far in
regression courses. Examples of binary responses include disease status (diseased/not diseased) and
survival status (dead/alive). In addition to have such a binary response, we often have a single binary
covariate of interest. Examples include treatment (experimental/control) or exposure (exposed to
radiation/not exposured to radiation) variables. To summarize the data, we might construct a 2 × 2
table as follows.
Disease
Present Absent
Group 1 y1 m 1 − y1 m 1
Group 2 y2 m 2 − y2 m 2
Total y. m. − y. m.
If we have m1 and m2 fixed, we typically assume we’ve got two independent binomial samples with
Yk ∼ Bin(mk , πk ), k = 1, 2. There are many measures of association one can consider for such
tables, but here we will focus on the odds ratio. The odds of one event versus another is simply the
ratio of their respective probabilities. Therefore, the odds of disease versus no disease in Group 1
is π1 /(1 − π1 ). We sometimes just refer to this as the odds of disease in Group 1. In this context,
the odds is a 1-1 monotonically increasing function of π1 which takes on values on the non-negative
real line. The odds of disease in Group 2 is π2 /(1 − π2 ). The odds ratio reflecting the relative odds
of disease in Group 1 versus Group 2 is then
π1 /(1 − π1 )
ψ= .
π2 /(1 − π2 )
Note that in the case of a “rare” disease (i.e. when π1 and π2 are very small, then ψ is close to the
relative risk, π1 /π2 . This can be seen by noting that
π1 /(1 − π1 ) π1 1 − π2
ψ= = .
π2 /(1 − π2 ) π2 1 − π1
When π1 ≈ π2 is small the fraction in parentheses is close to 1.
2.2. ESTIMATION OF THE ODDS RATIO
We would like to use likelihood theory to estimate ψ and therefore need to construct an appropriate
likelihood function. Note that
m1 y1 m2
P r(Y1 = y1 , Y2 = y2 ) = π1 (1 − π1 ) m1 −y1
π2y2 (1 − π2 )m2 −y2
y1 y2
L(π1 , π2 ) = π1y1 (1 − π1 )m1 −y1 π2y2 (1 − π2 )m2 −y2
y1 y2
π1 m1 π2
L(π1 , π2 ) = (1 − π1 ) (1 − π2 )m2
1 − π1 1 − π2
y y2 +y1
π1 /(1 − π1 ) 1
π2
L(π1 , π2 ) = (1 − π1 )m1 (1 − π2 )m2
π2 /(1 − π2 ) 1 − π2
We want to reparameterize to get rid of π1 and so note that if ψ = π1 /(1 − π1 )/[π2 /(1 − π2 )] then
π1 = ψπ2 /[1 − π2 + ψπ2 ]. Substituting into the above likelihood we get
y2 +y1 m1
y1 π2 (1 − π2 )
L(ψ, π2 ) = ψ (1 − π2 )m2
1 − π2 (1 − π2 + ψπ2 )
Now that we have a likelihood involving the parameters of interest, we can consider further repa-
rameterization to enable us to obtain Wald-type quantities for inference. Wald-type quantities are
most appealing when the corresponding parameters are unrestricted (i.e. the parameter space is the
real line). Therefore consider reparameterizing to β = log ψ and α = log(π2 /(1 − π2 )). Here we get
L(α, β) = ey1 β (1 + eα+β )−m1 ey.α (1 + eα )−m2 ,
where y· = y1 + y2 . Then the log-likelihood is
ℓ(α, β) = y1 β − m1 log(1 + eα+β ) + y.α − m2 log(1 + eα ).
If S(α, β) = (Sα (α, β), Sβ (α, β)), solving S(α, β) = 0 gives α̂ = log(y2 /(m2 − y2 )) and β̂ =
log( yy21 /(m
/(m1 −y1 )
2 −y2 )
). These estimates are natural since they imply π̂2 = eα̂ /(1 + eα̂ ) = y2 /m2 , π̂1 = y1 /m1
and ψ̂ = π̂1 /(1 − π̂1 )/[π̂2 /(1 − π̂2 )]. Note that
the asymptotic (large sample) results in section 1.2 we could use the likelihood itself to conduct
inference about β and hence ψ. The Wald-type pivotal used here is much more convenient and since
the range of values for β is unrestricted the results will generally agree very closely.
The results of the preceding section were directed at the case with a single factor variable with two
levels and a binary response. This is a simple setting but more often we need multiple regression
methodology since we may
a. want to be able to control for confounding variables and hence want to examine the effect of
several (possibly related collinear) variables simultaneously,
b. want to examine the effect of categorical covariates (>2 levels) or continuous covariates.
c. want to develop sophisticated models that describe complex relationships.
Example : Consider the data in the following table which describes the relationship between the
level of prenatal care and fetal mortality. The data arose from two clinics which we refer to as Clinic
A and Clinic B (not their real names!).
Here we obtain ψ̂ = π̂1 /(1 − π̂1 )/[π2 /(1 − π2 )] = 20/316/[46/373] = 0.51 and a 95% CI of ψ of
(0.30, 0.89). This suggests a strong association between level of prenatal care and fetal mortality.
However if we consider data from just those subjects who are at Clinic A, we get the following table.
This gives ψ̂ = 16 × 176/[12 × 293] = 0.80 with a 95% CI for ψ of (0.37, 1.73). While the odds ratio
Died Survived
Intensive 16 293 309
Regular 12 176 188
28 469 497
estimate is in the direction of a protective effect with intensive prenatal care, the confidence interval
is quite wide and includes values above one (which correspond to an increased risk of mortality).
Now we consider the corresponding data from Clinic B.
This gives ψ̂ = 4 × 197/[23 × 34] = 1.01 with a 95% CI for ψ of (0.33, 3.10). Note that the
reduction in the odds of fetal mortality due to intensive prenatal care which we observed in the
Died Survived
Intensive 4 23 27
Regular 34 197 231
38 220 258
pooled data, appears to have vanished for patients in Clinic B, and we found above that the estimate
of benefit is considerably smaller (and no longer significant) among patients in Clinic A. For further
investigation, we examine the relationship between clinic and level of care.
Clinic A Clinic B
Intensive 309 27 336
Regular 188 231 419
497 258 755
Here we get ψ̂ = 14.06 with a 95% CI for ψ of (9.12, 21.76). This suggests a very strong and
statistically significant relationship between clinic and the intensity of prenatal care. Specifically, we
can see that the proportion of patients in Clinic A receiving intensive prenatal care is considerably
higher than it is for patients in Clinic B. The following table displays the relationship between clinic
and mortality.
Clinic A Clinic B
Died 28 38 66
Survived 469 220 689
497 258 755
Here we obtain ψ̂ = 0.35 with a 95% CI (0.21, 0.58). This suggests that there is a statistically
significantly (significant at the 5% level since the 95% confidence interval does not include one)
higher rate of mortality in Clinic B. Finally, we can tabulate the relationship between clinic and
mortality stratified by level of care. We do that in the following table. We will return to this example
shortly.
To summarize what we found here, we found that there is an apparent strong association between
level of prenatal care and fetal mortality. When stratifying by clinic, evidence of this apparent
association is greatly reduced. When we stratify by level of prenatal care, there is a reduced risk
of mortality for patients in Clinic A versus Clinic B. We aim to study how these findings might be
reflected in a regression model.
Intensive Regular
Died Survived Died Survived Total
Clinic A 16 293 309 12 176 188
Clinic B 4 23 27 34 197 231
20 316 336 46 373 419
Let x1 , x2 , ..., xp−1 be a set of p − 1 explanatory variables and x = (1, x1 , x2 , ..., xp−1 )′ be a p × 1
vector of explanatory variables. Let β = (β0 , β1 , ..., βp−1)′ be a p ×1 vector of parameters. The scalar
quantity
η = x′ β = β0 + β1 x1 + · · · + βp−1 xp−1
is called the linear predictor. Let xi = (1, xi1 , xi2 , ..., xi,p−1 )′ be the vector of covariates for the ith
subject, i = 1, 2, . . . , n. Define
1 1 ··· 1
x11 x21 xn1
X= ... .
.. .. = (x1 , x2 , ..., xn )
.
x1,p−1 x2,p−1 xn,p−1
Then the vector of linear predictors is given by
η1
η = ... = X ′ β
ηn
Recall in the context of the Gaussian linear model Yi ∼ N(µi , σ 2 ) are independent and we set
E(Yi) = µi = ηi = β0 + β1 xi1 + β2 xi2 + · · · + βp−1 xi,p−1 ,
Now consider binomial data with Yi ∼ Bin(mi , πi ).
We might think of a regression model of the form
E(Yi/mi ) = πi = ηi = β0 + β1 xi1 + · · · + βp−1 xi,p−1
Is this a reasonable model? It is not particularly convenient to work with since we’ve got to impose
constraints on the RHS because 0 ≤ πi ≤ 1, i = 1, 2, ..., n. Therefore, rather than working with πi
directly, we work with a function of it. The so-called link function defines such a transformation,
which typically maps [0, 1] → (−∞, +∞). We denote the link function by g(π).
Name of Link Function Expression
Identity g(π) = π
†
Φ is the cdf for a standard normal random variable.
Having selected the logit link, our regression model takes the form g(π) = η. Introducing the
subscript i to distinguish individuals we have
πi
log( ) = x′i β = β0 + β1 xi1 + · · · + βp−1 xi,p−1
1 − πi
Let Y1 and Y2 denote the number of individuals with the outcome in Groups 1 and 2 respectively.
We let Yk ∼ Bin(mk , πk ), k = 1, 2. Let xi = 1 if the ith individual is in Group 1 and xi = 0
otherwise. We now consider a model for each individual’s response (which is binary, not binomial)
with the following form
πi
log = β0 + β1 xi .
1 − πi
Consider the log odds for a subject in group 1 as
πi
log = β0 + β1
1 − πi
and the log odds for a subject in group 2 as
πj
log = β0
1 − πj
This implies that
πi πj
log − log = β1
1 − πi 1 − πj
which means log ψ = β1 where ψ is the odds ratio comparing the odds of an event for a subject in
group 1 versus a subject in group 2. Therefore, the regression coefficient from this logistic model
may be interpreted as a log odds ratio describing the association between group membership and
the outcome.
Frequently we are interested in the parameter π itself. In this case note that
πi
log = β0 + β1 xi
1 − πi
πi
= eβ0 +β1 xi
1 − πi
eβ0 +β1 xi
πi =
1 + eβ0 +β1 xi
In a Gaussian model, given β̂, the fitted value for E(yi ) is µ̂i (xi ) = x′i β̂. In this binomial regression
model, the fitted value for E(Yi /mi ) is
exp(x′i β̂)
π̂i = π̂(xi ) =
1 + exp(x′i β̂)
Now consider the case with two binary explanatory variables. Let
1 if factor A present
xi1 =
0 otherwise
1 if factor B present
xi2 =
0 otherwise
1 if A and B present
xi3 =
0 otherwise
the same expression we got before for those in Clinic B. By similar methods we can see that the
relative odds of mortality for those in Clinic A versus Clinic B is eβ1 regardless of their drinking
status.
Now consider the model
πi
log = β0 + β1 xi1 + β2 xi2 + β3 xi3
1 − πi
where now xi = (1, xi1 , xi2 , xi3 )′ and β = (β0 , β1 , β2 , β3 )′ .
Clinic Level of Care xi πi /(1 − πi )
What follows is the data file prenatal.dat in which the first line contains the variable labels and
the remaining four lines the data. As before we are using indicator variables for the explanatory
variables and have binomial response data.
clinic loc y m
0 0 34 231
0 1 4 27
1 0 12 188
1 1 16 309
The program used to analyse the data is given below.
Splus program for analysis of prenatal care data
help.start()
prenatal.dat_read.table("prenatal.dat", header=T)
% here we construct the response variable for the logistic regression analysis
prenatal.dat$resp_cbind(prenatal.dat$y,prenatal.dat$m-prenatal.dat$y)
prenatal.dat
% now we fit the model using the glm function and store the result in "model1"
% we indicate "resp" contains a binomial response and that we are using the
% logistic link function
model1_glm(resp ~ loc, family=binomial(link=logit),data=prenatal.dat)
summary(model1)
% the "names" function lists the contents of the object "model1" and following
% this statement we examine some of the contents of these objects (try it)
names(model1)
model1$family
model1$formula
model1$coefficients
model1$deviance
model1$fitted.values
model1$residuals
% here we examine whether the association between loc and mortality depends on
% the clinic
model3_glm(resp ~ loc + clinic + loc*clinic, family=binomial(link=logit),data=prenatal.dat)
summary(model3)
A selection of the output printed from the summary commands summary(model1), summary(model2), sum-
mary(model3) and summary(model4) follows.
> prenatal.dat$resp_cbind(prenatal.dat$y,prenatal.dat$m-prenatal.dat$y)
> prenatal.dat
clinic loc y m resp.1 resp.2
1 0 0 34 231 34 197
2 0 1 4 27 4 23
3 1 0 12 188 12 176
4 1 1 16 309 16 293
Here we print out the augmented dataframe to see what the resp variables looks like. Next we fit the regression model
examining the relationship between the level of care and mortality. A portion of the output is reported below.
Coefficients:
Value Std. Error t value
(Intercept) -2.0929355 0.1561487 -13.403476
loc -0.6670721 0.2782789 -2.397135
The numbers under the heading Value are the maximum likelihood estimates of the regression coefficients β̂0 and β̂2 (Here
we are using the convention that the subscripts for the regression coefficients coincide with the subscripts on the variables
themselves). The numbers under Std. Error are estimated standard errors based on the inverse of the information matrix (more
on this shortly). Finally, the numbers under t value are Wald-type test statistics for testing the hypothesis that H0 : βk = 0
vs. H0 : βk 6= 0. Note that these are of the form
(β̂k − 0)/s.e.(β̂k ) .
These test statistics are approximately standard normal if the null hypothesis is true. To verify note that for testing the effect
of level of care based on model 1 we find −0.6670721/0.2782789 = −2.397135. The p−values can be computed as
Now we consider introducing clinic into the model. This generates a model in which we can examine the effect of level of
care on mortality, but adjusted for the clinic the patient attended.
> model2_glm(resp ~ clinic + loc, family=binomial(link=logit),data=prenatal.dat)
> summary(model2)
Coefficients:
Value Std. Error t value
(Intercept) -1.7410476 0.1784691 -9.7554564
clinic -0.9862793 0.3089322 -3.1925428
loc -0.1503053 0.3301670 -0.4552402
> model3_glm(resp~loc+clinic+loc*clinic,family=binomial(link=logit),data=prenatal.dat)
> summary(model3)
Coefficients:
Value Std. Error t value
(Intercept) -1.756843204 0.1857092 -9.46018403
loc 0.007643349 0.5726826 0.01334657
clinic -0.928734141 0.3514300 -2.64272868
loc:clinic -0.229649891 0.6949054 -0.33047650
Coefficients:
Value Std. Error t value
(Intercept) -1.756041 0.1756737 -9.996041
clinic -1.062357 0.2621205 -4.052933
The responses y1 , y2 , ..., yn as observations from n random variables Y1 ...Yn where Yi ∼ Bin(mi , πi ) , i = 1, 2, . . . , n.
n
!
mi
πiyi (1 − πi )mi −yi
Y
pr(Y = y; π) =
i=1 yi
which gives
Modeling procedures are based on expressing π1 , π2 , . . . , πn in terms of fewer parameters (this is with a view of data
reduction). These new parameters
take the form of regression coefficients. This is done through a link function. The
πi
logistic link g(π) = log 1−πi = x′i β implies
′
exi β
πi =
1 + exi β
′
If the dimension of β is less than n, we say we have an unsaturated model. In this case we take the dimension to be
p, corresponding to a model with p − 1 covariates. Under some models with the dimension of β equal to n, we have
a saturated model. Returning to the log-likelihood:
n n h
X 1 X ′
i
ℓ(β; y) = yi (x′i β) + mi log = yi (x′i β) − mi log(1 + exi β )
1 + exi β
′
i=1 i=1
x′i β̂ ′
Upon maximizing ℓ wrt β, we obtain β̂ and compute π̂ = e /1+exi β̂ . The quality of the fit of these regression models
will be judged by how well π̂1 , π̂2 , ..., π̂n fit the data (or equivalently, how well mi π̂i approximates yi , i = 1, 2, ..., n).
We need a criterion to assess how much worse unsaturated models are from the saturated model. A convenient way
of testing nested hypotheses is based on the likelihood ratio statistic.
has a χ2 distribution on ν = q − p degrees of freedom if the null hypothesis that the constraints are reasonable
is true, where ν is the difference in the effective number of parameters with and without the constraints.
Therefore if
−2 log(L(θ̂)/L(θ̃)) > χ2ν (α)
we would reject H0 at the α significance level. It is more informative to examine the p−value and so we
compute
p = pr(χ2q−p > −2 log(L(θ̂)/L(θ̃)))
let π̃ = (π̃1 , . . . , π̃n )′ = (y1 /m1 , . . . , yn /mn )′ represent the MLE under the saturated model and let π̂ = (π̂1 , · · · , π̂n )′
denote the MLE under the constrained model imposed by the regression equation. With a little algebra one can show
that the LR statistic −2 log (L(π̂)/L(π̃)) = −2(ℓ(π̂) − ℓ(π̃)) obtained by substituting the appropriate MLE’s into the
log-likelihood has the form
" n » „ «– X n
#
X m i − yi
2(ℓ(π̃) − ℓ(π̂)) = 2 yi log(yi /mi ) + (mi − yi ) log − [yi log π̂ + (mi − yi ) log(1 − π̂i ]
i=1
mi i=1
"m » „ « „ «–#
X yi m i − yi
=2 yi log + (mi − yi ) log
i=1
mi π̂i mi (1 − π̂i )
This likelihood ratio statistic is central to the analysis of binary regression models and so has a special name. It is
called the deviance statistic and is represented by D if we think of it as random and d if we refer to a realized value
for it. Splus reports this as the residual deviance, and sometimes is will be called the scaled deviance for reasons that
will become clear shortly. Based on the general result above, we would expect it to have a χ2 distribution on n − p
degrees of freedome. Unfortunately, this distributional approximation for the deviance statistic is not as good as one
might hope! It does perform very well, however, for testing nested unsaturated models which we will consider in the
next section. We remark in passing that the deviance statistic has the form
X Oij
2 Oij log
Eij
where Oij is an observed quantity and Eij is an expected quantity. We use two subscripts here since we are summing
over both the yi cells and the (mi − yi ) cells.
The Pearson statistic is another statistic one can use for assessing “overall” fit of a model.
n
X (yi − mi π̂i )2
P =
i=1
mi π̂i (1 − π̂i )
(Oi − Ei )2 /Vi . As for the deviance statistic P ∼ χ2n−p approximately if the model provides a
P
which has the form
reasonable fit to the data (i.e. if the assumed model is “true”). The Chi-square approximation is a little bit better
than for deviance statistics. Both however are poor if sample size (mi ) are small. The deviance and Pearson statistics
can be shown to be asymptotically equivalent by a Taylor series expansion.
We may be interested in testing whether the first model, which is a sub-model of the second, provides as good a
fit to the data. This is equivalent to testing the significance of the covariates xp , . . . , xq−1 , or to testing H0 : βp =
· · · = βq = 0. Let π̂i denote the MLE of πi under the reduced model with p parameters and let π̃i denote the
MLE of πi under the full model with q parameters. Again with a little algebra one can show that the likelihood
ratio statistici corresponding to this test is given by the difference in the deviance of the two models. That is the
appropriate likelihood ratio test statistic is ∆D = D0 − DA where D0 is the deviance under the null model and DA
is the deviance under the alternative. The statistic ∆D is approximately χ2 distributed on q − p degrees of freedom,
and the approximation is far superior here than it is for the deviance itself.
If ∆D > χ2q−p (α), we reject H0 and claim that the reduced model leads to a signifcantly worse fit to the data
than the full model, or equivalently that one or more of the covariates xp , . . . , xq−1 is important for predicting the
outcome. The p−value is given by
p − value = pr(χ2q−p > ∆D)
Recall
n
X yi mi y i
D(π̂; y) = 2 yi log + (mi − yi ) log
i=1
mi π̂i mi (1 − π̂i )
n
X yi mi − y i
= 2 yi log + (mi − yi ) log
i=1
mi π̂i mi (1 − π̂i )
n
X
= di
i=1
Let
p
riD = sign(yi − mi π̂i ) di
where sign(yi − mi π̂i ) is + if yi − mi π̂i > 0 and sign(yi − mi π̂i ) is − if yi − mi π̂i < 0. We call these deviance residuals.
If the model is adequate, then these are approximately N (0, 1) distributed.
2.8.1 Background
Purpose of Study : To investigate the relationship between the probability of surviving 2 years free of disease following
diagnosis and treatment for neuroblastoma, and age at diagnosis and stage of disease at diagnosis. These data are
summarized in the following table where the cell entries are of the form y/m with y representing the number of
patients surviving 2 years, and m representing the number of patients in that age-stage combination at the start of
the study.
Stage
Age (months) I II III IV V
Stage
Age (months) I II III IV V Total
Let ( ( (
1 if age : 12 − 23months 1 if age : 24+months 1 if stage II:
xi1 = xi2 = xi3 =
0 o.w. 0 o.w. 0 o.w.
( ( (
1 if stage III: 1 if stage IV: 1 if stage V:
xi4 = xi5 = xi6 =
0 o.w. 0 o.w. 0 o.w.
Now consider the models:
What follows is the data file neuro.dat in which the first line contains the variable labels and the remaining fifteen
lines the data. Here age and stage are categorical (ordinal) variables so we will have to compute the indicator variables
in the Splus program
age stage y m
1 1 11 12
1 2 15 16
1 3 2 4
1 4 5 18
1 5 18 19
2 1 3 4
2 2 3 7
2 3 5 8
2 4 0 25
2 5 1 3
3 1 4 5
3 2 4 12
3 3 3 15
3 4 3 93
3 5 2 5
The program used to analyse the data is given below.
Splus program for analysis of neuroblastoma data
neuro.dat_read.table("neuro.dat", header=T)
% here we create indicator variables for age and stage using the treatment contrast
neuro.dat$agef_factor(neuro.dat$age)
neuro.dat$ageft_C(neuro.dat$agef,treatment)
neuro.dat$stagef_factor(neuro.dat$stage)
neuro.dat$stageft_C(neuro.dat$stagef,treatment)
% here we fit the model with age and stage and print out summary statistics
model1_glm(resp ~ ageft + stageft, family=binomial(link=logit),data=neuro.dat)
summary(model1)
% here we record deviance residuals (rd1), linear predictor (lp1), and fitted values (fv1)
rd1_residuals.glm(model1,"deviance")
lp1_model1$linear.predictors
fv1_model1$fitted.values
% here we verify that fitted values agree with what we expect from the linear predictor
fv1
exp(lp1)/(1+exp(lp1))
cbind(lp1,fv1,rp1)
% here we fit two reduced models to enable us to test the importance of age and stage
model2_glm(resp ~ ageft, family=binomial(link=logit),data=neuro.dat)
summary(model2)
model3_glm(resp ~ stageft, family=binomial(link=logit),data=neuro.dat)
summary(model3)
A selection of the output is provided. The final data object neuro.dat is given by
> neuro.dat
age stage y m agef ageft stagef stageft resp.1 resp.2
1 1 1 11 12 1 1 1 1 11 1
2 1 2 15 16 1 1 2 2 15 1
3 1 3 2 4 1 1 3 3 2 2
4 1 4 5 18 1 1 4 4 5 13
5 1 5 18 19 1 1 5 5 18 1
6 2 1 3 4 2 2 1 1 3 1
7 2 2 3 7 2 2 2 2 3 4
8 2 3 5 8 2 2 3 3 5 3
9 2 4 0 25 2 2 4 4 0 25
10 2 5 1 3 2 2 5 5 1 2
11 3 1 4 5 3 3 1 1 4 1
12 3 2 4 12 3 3 2 2 4 8
13 3 3 3 15 3 3 3 3 3 12
14 3 4 3 93 3 3 4 4 3 90
15 3 5 2 5 3 3 5 5 2 3
The estimates arising from the first model fit are given below along with a little additional information.
Coefficients:
Value Std. Error t value
(Intercept) 3.317448 0.7690988 4.313423
ageft2 -2.118049 0.5724491 -3.699978
ageft3 -2.612933 0.5005830 -5.219781
stageft2 -1.252876 0.7814548 -1.603262
stageft3 -1.775857 0.7982609 -2.224658
stageft4 -4.367750 0.7874140 -5.546955
stageft5 -1.022186 0.8621250 -1.185659
Correlation of Coefficients:
(Intercept) ageft2 ageft3 stageft2 stageft3 stageft4
ageft2 -0.4148044
ageft3 -0.4532413 0.5477983
stageft2 -0.7660224 0.0405768 0.0413439
stageft3 -0.7060820 -0.0363006 -0.0391957 0.7138739
stageft4 -0.8348667 0.1809823 0.1685087 0.7349739 0.7019729
stageft5 -0.7440306 0.1217271 0.1287567 0.6694787 0.6427576 0.6860101
Before interpreting these results too much, we should look to see how good the fit is to the data. The residual plot in Figure 1
indicates a rather good fit and there are not data points which lead to unacceptably large residuals. We base these inferences on
the deviance residuals since they are somewhat better behaved than Pearson residuals. For completeness however, we consider
the following table of linear predictors, fitted values, and Pearson residuals.
> cbind(lp1,fv1,rp1)
lp1 fv1 rp1
1 3.31744824 0.96502256 -0.91175324
2 2.06457175 0.88741175 0.63385031
3 1.54159087 0.82369587 -1.69883998
4 -1.05030164 0.25916718 0.18019654
5 2.29526217 0.90848389 0.58782256
6 1.19939942 0.76841793 -0.08732117
7 -0.05347707 0.48663392 -0.30734765
8 -0.57645795 0.35974803 1.56325182
9 -3.16835046 0.04037428 -1.02558449
10 0.17721334 0.54418776 -0.73329034
11 0.70451480 0.66918800 0.62168142
12 -0.54836169 0.36624459 -0.23664028
13 -1.07134257 0.25514785 -0.48994036
14 -3.66323508 0.02500796 0.44776072
15 -0.31767128 0.42124338 -0.09620422
As an exercise, verify that you can reproduce these results by computer and by hand.
Now we consider simplifying the model further by examining the decrease in the quality of the fit that results from dropping
the stage variable(s).
Coefficients:
Value Std. Error t value
(Intercept) 1.041454 0.2741563 3.798759
ageft2 -2.111895 0.4324230 -4.883863
ageft3 -3.005062 0.3825334 -7.855686
Correlation of Coefficients:
(Intercept) ageft2
ageft2 -0.6340003
ageft3 -0.7166859 0.4543791
Now we fit the model excluding the age variable to examine the drop in the quality of fit from model one (with age and stage).
> model3_glm(resp ~ stageft, family=binomial(link=logit),data=neuro.dat)
> summary(model3)
Coefficients:
Value Std. Error t value
(Intercept) 1.7917595 0.6236082 2.873213
stageft2 -1.2656664 0.7150277 -1.770094
stageft3 -2.3223877 0.7400748 -3.138045
stageft4 -4.5643439 0.7220359 -6.321492
stageft5 -0.5389965 0.7766185 -0.694030
Correlation of Coefficients:
(Intercept) stageft2 stageft3 stageft4
stageft2 -0.8721456
stageft3 -0.8426286 0.7348948
stageft4 -0.8636804 0.7532550 0.7277618
stageft5 -0.8029788 0.7003144 0.6766129 0.6935170
Here we consider testing models using the Likelihood ratio test, which is this context we are calling the “change in deviance”,
and denote by ∆D.
Factors In Model Deviance d.f.
Consider the test of the null hypothesis that stage is unrelated to survival, or equivalently that H0 : β3 = · · · = β6 = 0. The
change in deviance in going from the first model to the second model is ∆D = 83.58 − 9.63 = 73.96. The random variable
corresponding to ∆D is taken to be approximately χ2 distributed on q − p = 12 − 8 = 4 degrees of freedom. Therefore
S.L. ≈ P(χ24 > 73.96) < 0.0001
and we reject the null hypothesis that stage is unrelated to survival since this is characterized as very strong evidence against
it.
Consider the test of the null hypothesis that age is unrelated to survival, or equivalently that H0 : β1 = β2 = 0. The change
in deviance in going from the first model to the third model is ∆D = 42.45 − 9.63 = 32.82. The random variable corresponding
to ∆D is taken to be approximately χ2 distributed on q − p = 10 − 8 = 2 degrees of freedom. Therefore
S.L. ≈ P(χ22 > 32.82) < 0.0001
and we also reject the null hypothesis that age is unrelated to survival. Therefore both age and stage play highly significant
roles in determining prognosis, even after controlling for the other variable.
Deviance Residual
Pearson Residual
2 1
RESIDUALS
0 -1
-2
-3
Fig. 2.1. Plot of Residuals by Fitted Values for Neuroblastoma Model with Age and Stage
a. What is the odds ratio of surviving two years for a patient with disease in state IV versus stage I?
age stage xi πi /(1 − πi )
The ratio is eβ5 which is estimated as eβ̂5 = e−4.368 = 0.013. Therefore we can say, “When controlling for age, the
odds of surviving two years among those diagnosed in state IV is 0.013 times the odds among subjects diagnosed
in state I”. Clearly diagnosis in stage IV has a substantial effect on reducing the probability of survival.
b. What is the odds ratio of surviving for a patient aged 12-23 months versus 0-11 months?
age stage xi πi /(1 − πi )
The odds ratio is therefore eβ1 which is estimated as eβ̂1 = e−2.12 = 0.12. Therefore we can say, “When controlling
for stage, the odds of surviving two years among those diagnosed at 12-23 months is 0.12 times the odds of
surviving two years among subjects diagnosed at 0-11 months of age.”. Diagnosis at the intermediate age range
greatly reduces the probability of surviving two years compared to the youngest age range.
c. What is the odds ratio of surviving for a patient aged 24+ months versus 12-23 months?
age stage xi πi /(1 − πi )
The odds ratio is therefore eβ2 −β1 which is estimated as eβ̂2 −β̂1 = e−2.61−(−2.118) = 0.61. Therefore we can say,
“When controlling for stage, the odds of surviving two years among those diagnosed at 24+ months of age is 0.61
times the odds of surviving two years among subjects diagnosed at 12-23 months of age.”. Note that here we have
an odds ratio that is a function of two regression coefficients and the estimate is a function of two estimated
coefficients.
d. Given final model, what is the odds ratio of surviving for a patient with age is in the 12-23 months range and
disease in stage II versus a patient with age is in the 0-11 months range and disease in stage I.
age stage xi πi /(1 − πi )
The odds ratio is therefore eβ1 +β3 which is estimated as eβ̂1 +β̂3 = e−2.12+(−1.25) = 0.03. Therefore we can say,
“The odds of surviving two years among those diagnosed in stage II at 12-23+ months of age is 0.03 times the
odds of surviving two years among subjects diagnosed in stage I at 0-11 months of age.”. Notice we do not say
“when controlling for X” here, since both factors define the groups being compared.
e. Given final model, what is the odds ratio of surviving for a patient with age is in the 24+ months range and
disease in stage IV versus a patient with age is in the 12-23 months range and disease in stage III.
age stage xi πi /(1 − πi )
The odds ratio is therefore eβ2 +β5 −β1 −β4 which is estimated as
Therefore we can say, “The odds of surviving two years among those diagnosed in stage IV at 24+ months of age
is 0.62 times the odds of surviving two years among subjects diagnosed in stage III at 12-23+ months of age.”.
f. Construct a confidence interval for the odds ratio in 1) and 2) ?
For question 1 we get
(e−4.37−1.96(0.79) , e−4.37+1.96(0.79) ) = (0.003, 0.060)
Recall that since β̂ is an MLE, β̂ ∼ M V N (β, I −1 (β̂)) approximately. This means that
and
x′ β̂ − x′i β
q i ∼ N (0, 1)
x′i I −1 (β̂)xi )
• If the odds ratio of interest is expressed as exp{c′ β} where c is a column vector defining the contrast of the
regression coefficients, then an approximate 95 % confidence interval for this odds ratio is
q q
exp{c′ β̂ − 1.96 c′ I −1 (β̂)c}, exp{c′ β̂ + 1.96 c′ I −1 (β̂)c}
As an example we consider estimation of an approximate 95 % CI for the odds ratio in question 3 of the
neuroblastoma example. The vector defining the contrast of interest is c = (0, −1, 1, 0, 0, 0, 0)′.
The program used to compute the variance of a contrast of β̂ follows.
% use the summary.glm function and store the result in the tmp object
> tmp_summary.glm(model1)
% examine the contents of the tmp objects and store the covariance matrix in v
> names(tmp)
> v_tmp$cov.unscaled
> v
The resulting output is as follows. The output from the previous commands follows.
> tmp_summary.glm(model1)
> names(tmp)
[1] "call" "terms" "coefficients" "dispersion"
[5] "df" "deviance.resid" "cov.unscaled" "correlation"
[9] "deviance" "null.deviance" "iter" "nas"
> v_tmp$cov.unscaled
> v
(Intercept) ageft2 ageft3 stageft2 stageft3
(Intercept) 0.5915130 -0.18262588 -0.17449687 -0.46039170 -0.43349301
ageft2 -0.1826259 0.32769796 0.15697614 0.01815174 -0.01658807
ageft3 -0.1744969 0.15697614 0.25058331 0.01617304 -0.01566243
stageft2 -0.4603917 0.01815174 0.01617304 0.61067160 0.44531795
> x_c(0,-1,1,0,0,0,0)
> x_as.matrix(x,7,1)
> dim(x)
[1] 7 1
> t(x)%*%v%*%x
[,1]
[1,] 0.264329
We know from before that eβ̂2 −β̂1 = e−2.613−(−2.118) = e−0.495 = 0.61. The Splus output above indicates that
√ √
ˆ β̂2 − β̂1 ) = 0.264. An approximate 95 % CI for β2 − β1 is therefore (−0.495 − 1.96 0.264, −0.495 + 1.96 0.264) =
var(
(−1.50, 0.51), and the corresponding interval for the odds ratio is (0.22, 1.67).
2.10.1 Definitions
One of the first applications of regression procedures for binary data was for the analysis of bioassay results. Bioassay
experiments are typically animal studies designed to investigate the toxicity of an agent. This is achieved by dividing
a sample of animals, say beetles, into several groups and exposing the groups to varying amounts of the toxin. The
impact of exposure to the toxin is then assessed by examining what happens to the beetles some fixed time after
exposure. Below are some relevant definitions.
Stimulus : Each group of animals is subjected to a toxin at a stated dose. The dose indicates the intensity and
is usually specified as the log of the concentration. The dose may vary from group to group but is assumed to be
constant within a group.
Response : As a result of the stimulus, subjects will manifest a binary response (often of the form died/survived).
Tolerance : For any one subject there is a certain level of intensity below which the response will not occur and
above which it will occur. This level is called the tolerance (or threshold). The tolerance varies from one individual to
another in the population and therefore from subject to subject in the sample. We can therefore ascribe a distribution
to it.
Tolerance Distribution : Let z = concentration of toxin. Work with x = log z to measure intensity of stimulus, i.e.
dose. Let f (x) denote the p.d.f. of the distribution of the tolerance in the population. Then if a dose x0 were given
to the entire population, the proportion Z x0
π= f (s)ds
−∞
R∞
would respond. (Note that −∞
f (s)ds = 1 since they would all respond eventually)
Let yj denote the # dying out of the original mj , where we assume Yj ∼ Bin(mj , πj ) where πj is the probability of
dying for the jth group, j = 1, 2, . . . , J. Let xj indicate the dose of the stimulus (on log conc. scale). We want to
model πj , j = 1, 2, . . . , J as a function of xj where xj is a continuous covariate.
In general, write π(x) to indicate dependence on x. Since 0 ≤ π ≤ 1, the usual set up is to model with
g(π) = β0 + β1 x
One often observes dose response curves that are sigmoidal in shape. This suggests selecting link functions such that
F ∗ ( ) is a c.d.f. If Z x
−1 ∗
π(x) = g (β0 + β1 x) = F (β0 + β1 x) = f (s)ds
−∞
then the p.d.f. is called the tolerance distribution. It determines how the “probability of a positive response” changes
with the value of the dose.
∂π(x)
= (F ∗ )′ (β0 + β1 x)(β1 ) = f (x)
∂x
Consider a single binary response (died/survived) and a single covariate x, representing dose.
If π is the probability of death and η = β0 + β1 x, consider the following example.
Consider an experiment by Bliss (Annals of Applied Biology, 1935) in which groups of beetles were exposed to varying
concentrations of carbon disulphide (CS2 ) gas. The following results were recorded.
# of insects # of insects
Dose (xi ) killed (yi ) (mi ) yi /mi
1.6907 6 59 0.10
1.7242 13 60 0.22
1.7552 18 62 0.29
1.7842 28 56 0.50
1.8113 52 63 0.83
1.8369 53 59 0.89
1.8610 61 62 0.98
1.8839 60 60 1.00
What follows is the data file beetle.dat in which the first line contains the variable labels and the remaining 8 lines
the data.
dose y m
1.6907 6 59
1.7242 13 60
1.7552 18 62
1.7842 28 56
1.8113 52 63
1.8369 53 59
1.8610 61 62
1.8839 60 60
The program used to analyse the data is given below.
Splus program for analysis of dose-response data
beetle.dat_read.table("beetle.dat", header=T)
√ Rx 1 2
f (s) = β1 exp{−(β0 + β1 s)2 /2}/ 2π normal π(x) = −∞
√ 1 e− 2 (β0 +βs ) ds
β
2π
η = Φ−1 (π)
= Φ(β0 + β1 x)
= Φ(η)
42
2.10. BIOASSAY AND DOSE RESPONSE MODELS
β1 eβ0 +β1 s
Rx
f (s) = β1 eβ0 +β1 s /[1 + eβ0 +β1 s ]2 logistic π(x) = −∞ [1+eβ0 +β1 s ]2 ds
π
η = log( 1−π )
eβ0 +β1 x
= 1+eβ0 +β1 x
eη
= 1+eη
Rx β0 +β1 s
f (s) = β1 exp{β0 + β1 s − exp(β0 + β1 s)} extreme value π(x) = −∞
β1 eβ0 +β1 s−e ds
ν−eν
Rη
= −∞
e dν η = log(− log(1 − π))
= 1 − exp(−eη )
RJ Cook
CHAPTER 2. BASIC ANALYSIS OF BINARY DATA
fv2_model2$fitted.values
A selection of the output is provided. Selected output from the model fitting is reported here.
> model1_glm(resp ~ dose, family=binomial(link=logit),data=beetle.dat)
> summary(model1)
Coefficients:
Value Std. Error t value
(Intercept) -60.71733 5.173518 -11.73618
dose 34.27026 2.908076 11.78451
Correlation of Coefficients:
(Intercept)
dose -0.9996809
Coefficients:
Value Std. Error t value
(Intercept) -34.93468 2.647945 -13.19313
dose 19.72761 1.487237 13.26460
Correlation of Coefficients:
(Intercept)
dose -0.9996109
Coefficients:
Value Std. Error t value
(Intercept) -39.57252 3.236839 -12.22567
dose 22.04129 1.797415 12.26277
Correlation of Coefficients:
(Intercept)
dose -0.9996995
4
DEVIANCE RESIDUALS
DEVIANCE RESIDUALS
2
• • • • • •
• •
• •
0
• •
• •
• •
-2
-2
-4
-4
4
DEVIANCE RESIDUALS
DEVIANCE RESIDUALS
2
• • • • • •
• • • •
0
• •
• • • •
-2
-2
-4
-4
4
DEVIANCE RESIDUALS
DEVIANCE RESIDUALS
2
• •
• • • •
• •
0
• •
• • • • • •
-2
-2
-4
-4
We can plot the actual data (as yi /mi ) against dose, and see how well the dose-response curves fit the data. The
program to do this is included below.
beetle.dat_read.table("beetle.dat", header=T)
legend(locator(),c("LOGIT",,"PROBIT","CLOGLOG"),lty=c(2,5,1),bty="n")
1.0
• •
•
LOGIT
PROBIT •
0.8
CLOGLOG
PROBABILITY OF DEATH
0.6
•
0.4
•
•
0.2
•
0.0
Fig. 2.3. Plot of the data and the dose-response curves for Bliss (1935) data
Note that the curve for the complementary log-log link fits the data better than the other two, as one would expect
from the residual plots and the deviance statistics.
∂ℓ(β)
S(β) =
∂β
∂ℓ(β)
I(β) = −
∂βj ∂βk
Then asymptotically,
where dim(β) = p.
In general,
This suggests a Newton-Raphson search for finding the MLE might be appropriate. The Newton Raphson iterations
will take the following form:
β̂ (r) = β̂ (r−1) + I −1 (β̂ (r−1) )S(β̂ (r−1) )
where we iterate until convergence. Convergence criteria might be specified based on norm of score vector at β̂ (r)
(e.g. ||S(β̂ (r) || ≤ ǫ where ||x|| denote the Euclidean norm of x and ǫ is a specified tolerance criteria), or based on the
magnitude of the successive values of β (e.g. if ||β̂ (r) − β̂ (r−1) || ≤ ǫ).
n
X ∂ηi ∂ℓi
=
i=1
∂βj ∂ηi
We can write this in vector notation as S(β) = X ′ Sη where Sη = (∂ℓ1 /∂η1 , . . . , ∂ℓn /∂ηn )′ .
Differentiating again to get the information matrix, we get
n
∂2ℓ X ∂ 2 ℓi
=
∂βk ∂βj i=1
∂βk ∂βℓ
n
X ∂ ∂ηi ∂ℓi
=
i=1
∂βk ∂βj ∂ηi
n
X ∂ηi ∂ ∂ℓi ∂ℓi ∂
= · + · (xij )
i=1
∂βj ∂βk ∂ηi ∂ηi ∂βk
n
X ∂ηi ∂ ∂ℓi
= · + 0
i=1
∂βj ∂βk ∂ηi
n
X ∂ηi ∂ ∂ℓi ∂ηi
= ·
i=1
∂βj ∂ηi ∂ηi ∂βk
n
∂ 2 ℓi
X
= xij xik
i=1
∂ηi2
The information matrix can be written in matrix notation as I(β) = X ′ Iη (η)X where Iη (η) = ∂Sη /∂η ′ . Using these
′
X Iη (η̂ (r−1) )X β̂ (r) = X ′ Iη (η̂ (r−1) )X β̂ (r−1) + X ′ Sη (η̂ (r−1) )
′ h i
X Iη (η̂ (r−1) )X β̂ (r) = X ′ Iη (η̂ (r−1) ) X β̂ (r−1) + Iη−1 (η̂ (r−1) )Sη (η̂ (r−1) )
The Newton-Raphson algorithm for models of this sort is often referred to as iteratively re-weighted least squares.
To explain this, consider the following. Let Z (r) denote a matrix of artificially constructed covariates defined at the
(r − 1)st iteration, and W (r−1) be a matrix defined at the (r − 1)st iteration as follows
This can be thought of as a series of weighted regressions of the vector Z (r) on a set of covariates X with weight
W (r) . Since these calculations must be repeated, it is referred to as Iteratively reweighted Least Squares (IRWLS, or
sometimes IWLS).
which gives
n
X
ℓ= [yi ηi − mi log(1 + exp{ηi })]
i=1
Now differentiating with respect to ηi we get
∂ℓ
= yi − mi exp{ηi }/(1 + exp{ηi })
∂ηi
= yi − mi πi
Suppose Yi ∼ Bin(mi , πi ), i = 1, 2, . . . , n and there is a three level covariate indicating which of three age groups an
individual belongs to (e.g. 18-44 years old, 45-64 years old, and 65+ years old). Let
where β1 = 0 is the required constraint. This method of writing down the model is particularly advantageous when
there are several covariates to be considered and they have several levels each. For example, if we had two factor
variables with 3 and 4 levels respectively, we could either write down a model based on 5 indicator variables explicitly,
or write
log (πjk /(1 − πjk )) = α + βj + γk , j = 1, 2, 3, k = 1, 2, 3, 4.
β1 = 0, γ1 = 0
These are sometimes called corner-point constraints (or treatment constraints in Splus). There are other constraints
that we can use, but these are often the most useful. We will return to this topic in the discussion of log-linear
models.
Let oij and eij denote the observed and expected cell counts for the (i, j) cell of the 2-way table. Consider a statistic
with the following form :
X (oij − eij )2 n
(yi − mi π̂i )2 ((mi − yi ) − (mi (1 − π̂i )))2
X
= +
eij i=1
mi π̂i mi (1 − π̂i )
(i,j)
n
(yi − mi π̂i )2 (yi − mi π̂i )2
X
= +
i=1
mi π̂i mi (1 − π̂i )
n
(yi − mi π̂i )2 (1 − π̂i ) (yi − mi π̂0 )2 πi
X
= +
i=1
mi πi (1 − π̂i ) mi π̂i (1 − π̂i )
n
X (yi − mi π̂i )2
=
i=1
mi π̂i (1 − π̂i )
We have shown here that the Pearson χ2 statistic may be written in terms of all cells of the table, or in terms of
the binomial responses. The former way of writing it generalizes nicely to variables with more than two categories,
so we will see it again in this form when we consider log-linear models.
Questions
Problem 2.1.
Consider a 2 × 2 contingency table from a prospective study in which people who were or were not exposed to some
pollutant are followed-up and, after several years, categorized according to the presence or absence of a disease. The
table below shows the probabilities for each cell.
Unexposed π2 1 − π2
The odds of disease for each exposure group is given by πi /(1 − πi ), i = 1, 2, and so the odds ratio
π1 (1 − π2 )
ψ=
π2 (1 − π1 )
is the relative odds of disease for the exposed versus the not exposed groups.
a. For the simple model πi = exp(αi )/(1 + exp(αi )) show that if there is no association between exposure status
and disease status, then there is no difference in the logits of the disease probabilities.
b. Consider J 2 × 2 tables like the table above, where there is one table for each combination of covariates in the
vector x = (1, x1 , x2 , . . . , xp )′ . Such data arises frequently when we wish to stratify by potentially confounding
variables such as sex, age (measured in decades perhaps), family history of disease, etc. Let xj denote the x for
stratum j, j = 1, . . . , J. For the logistic model
show that the log odds ratio relating exposure status and disease status is constant over all tables if β 1 = β 2 .
Problem 2.2.
Suppose Yk ∼ Bin(mk , πk ), k = 1, 2 are two independent binomial random variables. Let xk denote an explanatory
indicator variable which indicates which sample the response is from. Specifically xk = 1 if k = 1 and xk = 0 if k = 2.
Then consider the logistic regression model
log(πk /(1 − πk )) = β0 + β1 xk
a. Write down the likelihood in terms of π1 and π2 and then reparameterize it in terms of β0 and β1 .
b. Derive the score vector S(β) = (S1 , S2 )′ = (∂ℓ/∂β0 , ∂ℓ/∂β1)′ , and the information matrix I(β) with (j, k) entry
Ijk (β) = ∂ 2 ℓ/∂βj ∂βk , j = 0, 1, k = 0, 1.
c. An asymptotic covariance matrix for β̂ is given by I −1 (β̂). Based on this show that the asymptotic variance for
β̂1 is estimated by
asvar(β̂1 ) = [1/y1 + 1/(m1 − y1 ) + 1/y2 + 1/(m2 − y2 )]
Hint: Upon finding the form of I(β), it will be convenient to re-express this matrix in terms of π1 and π2 for
simplification.
Problem 2.3.
The table below gives data reported by Gordon and Foss (1966). On each of 18 days very young babies in a hospital
nursery were chosen as subjects if they were not crying at a certain instant. One baby selected at random was rocked
for a set period, the remainder serving as controls. The numbers not crying at the end of a specified period are given
in the table. (There is no information about the extent to which the same infant enters the experiment on a number
of days so we will treat responses on different days as independent.)
a. Pool the data from the different days into a single 2 × 2 contingency table and test the hypothesis that the
probability of crying is the same for rocked and unrocked babies.
b. The analysis in (a) ignores the matching by days. To incorporate this aspect, re-analyse the data using a logistic
model with parameters for days, and control or experimental conditions. How well does it fit the data? Examine
the residuals to see if there are any patterns in the data which are not accounted for by the model. By fitting
a model which ignores the control or experimental effects, test the hypothesis that rocking does not affect the
probability of crying. What is the simplest model which describes the data well?
No of No. of
control No. not experimental No. not
Day babies crying babies crying
1 8 3 1 1
2 6 2 1 1
3 5 1 1 1
4 6 1 1 0
5 5 4 1 1
6 9 4 1 1
7 8 5 1 1
8 8 4 1 1
9 5 3 1 1
10 9 8 1 0
11 6 5 1 1
12 9 8 1 1
13 8 5 1 1
14 5 4 1 1
15 6 4 1 1
16 8 7 1 1
17 6 4 1 0
18 8 5 1 1
Problem 2.4.
A study was carried out to investigate the effect of a new drug in reducing operative mortality following major
abdominal surgery. Patients were assigned to treatment or control groups within each of three categories of surgical
risk: high, medium and low. The results were as follows:
Surgical Risk
Treatment Outcome Low Medium High
Control Died 3 7 6
Survived 12 6 3
Treated Died 2 3 2
Survived 10 10 8
Denote by π the probability of death following surgery. Let x1 = 1 for subjects on treatment, with x1 = 0
otherwise, x2 = 1 for subjects at medium risk, with x2 = 0 otherwise, x3 = 1 for subjects at high risk, with
x3 = 0 otherwise, x12 = x1 x2 and x13 = x1 x3 , where x12 and x13 are the interaction terms for treatment and
risk group covariates. Express each of the following hypotheses as a logistic regression model for π in terms of
the approriate linear predictor and determine the degrees of freedom for the deviance statistic. Do no impose any
additional constraints besides those explicitly stated in words and define any additional covariates you require.
d. H4 : The logit transform of the π is linearly related to x4 , with a common slope for treatment and control groups.
e. Fit a logistic model to the data above using a model that contains one factor at two levels for treatment status
and one factor at three levels for surgical risk.
a) Compute the deviance residual for the cell consisting of control patients at the low level of surgical risk.
b) Based on the available information, does this model appear to provide a reasonable fit?
c) Carefully explain the findings regarding the effect of treatment. Incorporate an estimate of the odds ratio, a
95% confidence interval, and a statement regarding the overall significance of this finding.
d) Construct a 95 % confidence interval for the probability of surgical death for a control patient with a medium
level of surgical risk based on this model.
Problem 2.5.
Consider a study of the occurrence of infection following birth by Caesarian section. Let y = 1 denote the occurrence
of an infection, and y = 0 denote the absence of an infection. The investigators are interested in learning about risk
factors for infection and have identified three potential explanatory variables: i) xi1 = 1 if the Caesarian was planned
and xi1 = 0 otherwise, ii) xi2 = 1 if any risk factors were present (such as diabetic mother, obese mother, etc.), and
xi2 = 0 otherwise, and iii) xi3 = 1 if any antibiotics were given as a preventative measure, and xi3 = 0 otherwise.
The data are summarized as follows :
Caesarian Planned Caesarian Not Planned
Infection No Infection Infection No Infection
Antibiotics Risk Factors 1 17 11 87
Antibiotics No Risk Factors 0 2 0 0
No Antibiotics Risk Factors 28 30 23 3
No Antibiotics No Risk Factors 8 32 0 9
a. Fit the logistic regression model containing only the main effects for the three explanatory variables.
b. Based on the fitted model above construct a 95 % confidence interval for the probability of infection when the
Caesarian was not planned, no risk factors are present, and no antibiotics have been prescribed.
c. Construct a 95 % confidence interval for the odds ratio reflecting the change in risk of infection for a woman on
antibiotics with risk factors undergoing an unplanned Ceasarian versus a woman not on antibiotics without risk
factors undergoing a planned Caesarian.
Problem 2.6.
The following table gives the analgesic effect of a treatment on patients with neuralgia. The response variable is given
in the last column and indicates whether there was pain or not. The duration variable represents the duration of the
complaint (in months) before treatment began.
a. Fit a simple logistic regression model to investigate the effect of treatment on the occurrence of pain, and interpret
the result. Report significance levels for any tests you carry out, as well as a point estimate and a 95 % confidence
interval for the corresponding odds ratio relating treatment to pain relief.
b. Do any of the other factors, taken in isolation, appear to have an effect on the occurrence of pain ?
c. Now consider models for simultaneously assessing the effect of these explanatory variables. Find the model you
feel adequately describes the variation in the response and interpret the effect of the explanatory variables as
above.
Problem 2.7.
The table below shows the number of strains of Staphylococcus aureus isolated from patients of a certain hospital
for each of the years 1947 to 1950.The strains were further classified as resistant or sensitive to 1 unit per ml of
penicillin.
The proportion of resistant strains appears to increase during the four year period and it would be useful to, i)
assess whether or not there are significant differences in the proportion of resistant strains over the four years, and
ii) try to characterize the nature of any apparent trend. You do not have to fit the models to answer the following
questions.
a. Suppose that X is a binary explanatory variable and Y is a binary response variable. Furthermore, suppose
P (X = 1|Y = 1) = γ1 , P (X = 0|Y = 1) = 1 − γ1 , P (X = 1|Y = 0) = γ0 , P (X = 0|Y = 0) = 1 − γ0 ,
P (Y = 1) = ζ and P (Y = 0) = 1 − ζ. Of interest is a model for P (Y = 1|X = x), the conditional probability of
the response given the explanatory variable.
i. Derive the expressions for P (Y = 1|X = 1) and P (Y = 1|X = 0). [2 marks]
ii. Consider two odds:
P (Y = 1|X = 1) P (Y = 1|X = 0)
and .
1 − P (Y = 1|X = 1) 1 − P (Y = 1|X = 0)
Show that the first odds divided by the second is the same as an odds ratio given by dividing [2 marks]
P (X = 1|Y = 1) P (X = 1|Y = 0)
by .
1 − P (X = 1|Y = 1) 1 − P (X = 1|Y = 0)
b. Let πi be the parameter representing the proportion of resistant strains for year 1946+i, i = 1, 2, 3, 4. Define
appropriate indicator variables and give the expression for the relevent saturated logistic regression model.
c. a) Provide a reduced model that you would use in combination with the saturated model to test the hypothesis
that there is no difference in the proportion of resistant strains over the four year period.
b) What are the degrees of freedom of this test?
d. Suppose you were interested in investigating the presence of a linear trend on the logit scale.
a) Write down the corresponding logistic model, and be sure to define any additional covariates you might
require.
b) What two models would you compare to assess the presence of a linear trend, and what would be the degrees
of freedom for this test ?
e. What other models would you consider for these data ?
Problem 2.8.
One design for longitudinal studies is to follow initially healthy individuals for a period of τ years and then determine
those who are still healthy, and those who have developed the disease in that period of time. At the beginning of the
period, information on suspected risk factors for the disease is obtained for each individual. Let xi be a p × 1 vector
of risk factors for the ith individual, and define
1, if the individual develops the disease
yi =
0, otherwise.
It is of interest to model P (Yi = 1|xi ), the conditional probability of developing the disease given risk factors
xi . Suppose that given Yi , xi is multivariate normal with f1 (xi |Yi = 1) ≡ M V Np (µ1 , Σ) and f0 (xi |Yi = 0) ≡
M V Np (µ0 , Σ), where µ1 , µ0 and Σ are unknown. Let π1 be the unconditional probability of developing the disease
and π0 = 1 − π1 .
Problem 2.9.
Does growing up in a right-handed world exert an overwhelming bias toward right-handedness? Data relevant to this
question were painstakingly gathered by Carter-Saltman (1980) and are displayed in Table 1. Eight hundred and
eight children were sampled and 408 of these were adopted and 400 were not adopted. The handedness of the child
and both parents was determined. If handedness is mostly biologically determined, one would expect to find more
right-handed children among those with right-handed parents, but for adopted children, one would expect to find
the same proportions of right-handed children regardless of the handedness of the parent. Note that in this study
there were no couples in which both parents were left-handed.
a. Test whether or not the handedness of the child is related to that of the biological parents.
b. Test whether or not the handedness of the child is related to that of the adoptive parents.
Problem 2.10.
A sample of 146 five-year-old children had their teeth examined and those with decayed, missing, or filled teeth
(dmft) were noted. From their address it was also determined whether their drinking water was fluoridated. Finally,
on the basis of various sociological measurements, their social class was designed as I,II,III, or unclassifiable. The
data are given in Table 2.
Table 2. Fluoridation and Dental Health
Class I Class II Class III Unclassified Total
Fluoridated 12/117 26/170 11/52 24/118 73/457
Unfluoridated 12/56 48/146 29/64 49/104 138/370
Note: Each entry gives the number of children with dmft from the total number of children in that cell (Carmichael
et at., 1989).
a. Test the hypothesis that the effect of fluoridation does not depend on social class.
b. Assuming a uniform effect of fluoridation across the different social classes, give a 95% interval for the odds ratio
reflecting the effect of fluoridation and provide a careful interpretion.
c. Summarize the effect of social class, if any.
Problem 2.11.
a. Derive the form of the residual deviance for a logistic regression model. [4 marks]
b. The data in the following table were obtained from the Florida Department of Highway Safety and Motor vehicles
(Bishop et al., 1975) which summarizes the outcomes of car accidents and the status of the driver of the car with
respect to seat-belt use and whether they were ejected from the car during the accident.
Injury
Seatbelt Ejected Non-fatal Fatal
Used Yes 1105 14
No 411,111 483
Not Used Yes 4624 497
No 157,342 1008
Problem 2.12.
In a biomedical study of an immuno-activating ability of two agents TNF (tumor necrosis factor) and IFN (interferon),
to induce cell differentiation, the number of cells that exhibit markers of differentiation after exposure to TNF and/or
IFN was recorded. At each of the 16 dose combinations of TNF/IFN, 200 cells were examined. The number (y) of
cells differentiating in one trial and the corresponding dose levels of the two factors are given in the table below.
Number (y) of
Differentiating Cells Dose of TNF (U/ml) Dose of IFN (U/ml)
11 0 0
18 0 4
20 0 20
39 0 100
22 1 0
38 1 4
52 1 20
69 1 100
31 10 0
68 10 4
69 10 20
128 10 100
102 100 0
171 100 4
180 100 20
193 100 100
a. Define continuous explanatory variables xi1 and xi2 based on the TNF and IFN doses respectively. Further
define the interaction term xi3 = xi1 × xi2 . Find the logistic model you feel is most appropriate for describing
the distribution of the responses, and carefully interpret the effects of the covariates.
b. Construct plots based on the deviance residuals. What do you conclude from these plots ?
c. Find the best fitting model using the complementary log-log link and interpret your results.
d. Find the best fitting model using the probit link and interpret your results. No confidence intervals are required
for these analyses.
Problem 2.13.
The following data are from a study of the effects of viruses on chicken eggs. Eggs were injected with various dilutions
of a virus and were monitored daily up to day 18 after injection. At the end of the study, the eggs were classified
into three groups:
Number Alive
Concentration of Eggs Dead Deformed Not Deformed
18.8 16 4 1 11
232.5 19 8 8 3
3468.0 17 10 6 1
51680.0 19 17 2 0
Dose x1 · · · xi · · · xn
No. at risk m1 · · · mi · · · mn
Dead y11 · · · yi1 · · · yn1
(Alive) Deformed y12 · · · yi2 · · · yn2
(Alive) Not Deformed y13 · · · yi3 · · · yn3
Assume that for the ith dose there are probabilities πi1 , πi2 , πi3 with πi1 + πi2 + πi3 = 1 such that the numbers
yij (j = 1, 2, 3) follow the multinomial distribution with mi trials and parameters (πi1 , πi2 , πi3 ). Consider a model
which has two parts:
Problem 2.14.
If x is the concentration of a toxin, call z = log(x) the dose. Subjects exposed to this toxin are observed to respond
or not respond to it, leading to a binary response. If a sample of m subjects is exposed to the toxin at dose z, let y
denote the number that respond where y ∼ Bin(m, π(z)).
Suppose the tolerance distribution is normal, N (µ, σ 2 ). This implies the probit link is appropriate and the model
is
Φ(βo + β1 zi ) = πi i = 1, 2, . . . , n.
0.72 58 3
0.80 61 19
0.87 63 16
0.93 59 37
0.98 57 49
1.02 55 54
1.07 57 55
1.10 61 60
Problem 2.15.
One of the earliest bioassay experiments was designed to study the reaction of mice to varying concentrations of
insulin (0.001 IU). After exposure to the assigned concentration of insulin, the mice were observed for the occurrence
of convulsions. The data are summarized in Table 1.
Let mi denote the number of mice exposed in the ith sample, and yi the number convulsing, i = 1, . . . , 14.
Suppose Yi ∼ Bin(mi , πi ) are independently distributed, where πi is the probability of a mouse convulsing in sample
i, i = 1, . . . , 14. Suppose xi1 denotes the dose (log concentration) of insulin in sample i, and xi2 = 1 if the insulin is
prepared using the test preparation in sample i and xi2 = 0 otherwise, i = 1, . . . , 14.
Number of mice with convulsions exposed to different concentrations (0.001 IU) of insulin
under two different preparations (Finney, 1978)
Standard preparation Test preparation
With With
Conc. Convulsions Total Conc. convulsions Total
3.4 0 33 6.5 2 40
5.2 5 32 10.0 10 30
7.0 11 38 14.0 18 40
8.5 14 37 21.5 21 35
10.5 18 40 29.0 27 37
13.0 21 37
18.0 23 31
21.0 30 37
28.0 27 30
a. It is common to assume that each mouse has an inherent tolerance to insulin exposure and that doses above this
tolerance will cause it to convulse, while doses below this tolerance will not cause it to convulse. The distribution
characterizing the variation in this latent tolerance in the population is called the tolerance distribution. Suppose
there are two different Gaussian tolerance distributions, one for the mice exposed to insulin under the standard
preparation (say N (µS , σS2 )), and one for the mice exposed to insulin under the test preparation (say N (µT , σT2 )).
If we fit a dose-response model to data of this form, which link function is implied for a binary regression model?
Note: If Z ∼ N (µ, σ 2 ) then
exp(− 12 (z − µ)2 /σ 2 )
f (z; µ, σ) = √ , −∞<z <∞
2πσ
b. If the fitted dose-response model involved a linear predictor
give an expression for each of the regression coefficients in terms of the parameters of the two tolerance distribu-
tions.
c. Some Splus output is attached which reports the results of fitting several models to this data. Justify your
choice of the model you feel is most appropriate for describing the distribution of responses as a function of dose
and method of preparation. Report your conclusions regarding the impact of insulin exposure and method of
preparation on the probability that convulsions are experienced.
d. Let δS and δT denote the doses at which 50% of the population will respond under the standard and test
preparations respectively. Derive an expression for δS − δT in terms of the regression coefficients of your final
model
Problem 2.16.
Montgomery and Peck (1982) describe a study on the compressive strength of an alloy fastener used in the construction
of aircraft. Ten pressure loads, increasing in units of 200 psi from 2500 psi to 4300 psi, were used with different numbers
of fasteners being tested at each of these loads. The data in the table below refer to the number of fasteners failing
out of the number tested at each load.
Attached is some Splus code and output for the following questions. Pick one of the two models (fit1 or fit2) to
answer the following questions.
a. Estimate the effect of a 500 psi unit increase in the load on the odds of failure of a fastener. In your answer,
report an odds ratio and associated 95% confidence interval. [4 marks]
b. The maximum load per fastener under normal operation is 400 psi. Give an estimate and 95% confidence interval
for the probability of failure at this load. [4 marks]
c. Obtain the fitted value for the load of 2500 psi. If you had to estimate the probability of failure for a load of this
magnitude, would you report the raw proportion or the fitted value? Justify your answer based on the information
provided. [4 marks]
Log-linear Models
Suppose Y is a random variable with a Poisson distribution with mean µ. Then we denote this as Y ∼ Poisson(µ),
where
µy e−µ
f (y) =y = 0, 1, 2, ...
y!
Furthermore E(y) = µ and Var(y) = µ defines the mean variance relationship. For a single observation we can write
this in the exponential family form
ℓ(µ, y) = y log µ − µ − log y!
and identify θ = log µ, φ = 1, a(φ) = 1, b(θ) = eθ , and c(y; φ) = − log y!. The mean is given in terms of these
functions as b′ (θ) = eθ = µ, and the variance, b′′ (θ)a(φ) = eθ · 1 = µ, so we retrieve what we would expect. As in the
binomial case, we can allow for possible overdispersion (extra Poisson variation) by introducing another parameter
(φ). We will return to this later.
For a sample of of independent observations of size n with Yi ∼ Poisson(µi ), i = 1, 2, ..., n, we denote the data as
y = (y1 , y2 , ..., yn )′ . The vector of corresponding means is µ = (µ1 , µ2 , ..., µn )′ . Then
n n n
µyi e−µi Pn
µyi i exp(−
Y Y Y
i
P r(Yi = yi , i = 1, ..., n; µ) = = i=1 µi )/ yi !
i=1
yi ! i=1 i=1
Frequently we are interested in modelling data from Poisson processes. A counting process N (t) is any non-decreasing
integer function of time such that N (0) = 0 and N (t) is the number of events happening in (0, t]. A counting process
is a Poisson process if it has independent increments (i.e. N (s1 + t1 ) − N (s1 ) is independent of N (s2 + t2 ) − N (s2 )
where s1 < t1 < s2 < t2 ) and the number of events occurring over (0, t] is distributed as
(λt)n e−λt
P (N (t) = n; λ) = , n = 0, 1, 2, ... .
n!
In this case E(N (t)) = µ(t) = λt, and N (t) is a special kind of Poisson random variable. We therefore use the log-link
if we are interested in establishing regression methods. However,
When we set up this regression model we have Ni (ti ), i = 1, 2, ..., n corresponding to the event counts for each
subject where the ith subject is observed over (0, ti ]. As before, we let xi denote a p × 1 column vector of explanatory
variables for this subject and hence
The term log ti is called an offset term. It “explains” some variation in the event counts across subjects, but does so
in a deterministic way.
McCullagh and Nelder (1989) discuss the analysis of a data set which records the number of times a certain type of
damage incident occurs in cargo ships. The damage is caused by waves and occurs in the forward section of various
cargo carrying vessels. In order to prevent this type of damage from occurring in the future, the investigators want
to identify risk factors including ship type (A-E), year of construction (1960-1964; 1965-1969; 1970-1974; 1975-1979),
and period of operation (1960-1974; 1975-1979).
The data is summarized below where we have adopted the coding convention that the ship type variable is 1, 2,
3, 4, and 5 for ship types A, B, C, D, and E respectively. The year of construction variable, cyr, is 1, 2, 3, and 4 for
eras 1960-1964, 1965-1969, 1970-1974, and 1975-1979 respectively while the year of operation variable, oyr, is 1 for
1960-74 and 2 for 1975-1979.
type cyr oyr months y
1 1 1 127 0
1 1 2 63 0
1 2 1 1095 3
1 2 2 1095 4
1 3 1 1512 6
1 3 2 3353 18
1 4 2 2244 11
2 1 1 44882 39
2 1 2 17176 29
2 2 1 28609 58
2 2 2 20370 53
2 3 1 7064 12
2 3 2 13099 44
2 4 2 7117 18
3 1 1 1179 1
3 1 2 552 1
3 2 1 781 0
3 2 2 676 1
3 3 1 783 6
3 3 2 1948 2
3 4 2 274 1
4 1 1 251 0
4 1 2 105 0
4 2 1 288 0
4 2 2 192 0
4 3 1 349 2
4 3 2 1208 11
4 4 2 2051 4
5 1 1 45 0
5 2 1 789 7
5 2 2 437 7
5 3 1 1157 5
5 3 2 2161 12
5 4 2 542 1
The program used to analyse the data is below.
ship.dat_read.table("ship.dat", header=T)
ship.dat$typef_factor(ship.dat$type)
ship.dat$typeft_C(ship.dat$typef,treatment)
ship.dat$cyrf_factor(ship.dat$cyr)
ship.dat$cyrft_C(ship.dat$cyrf,treatment)
ship.dat$oyrf_factor(ship.dat$oyr)
ship.dat$oyrft_C(ship.dat$oyrf,treatment)
ship.dat
% testing for the association between ship type and frequency of events
model3a_glm(y ~ cyrft + oyrft + offset(log(months)), family=poisson,
data=ship.dat)
model3a$deviance
model3a$df
1-pchisq(model3a$deviance-model1$deviance,model3a$df-model1$df)
% testing for the association between year of construction and event frequency
model3b_glm(y ~ typeft + oyrft + offset(log(months)), family=poisson,
data=ship.dat)
model3b$deviance
model3b$df
1-pchisq(model3b$deviance-model1$deviance,model3b$df-model1$df)
% testing for the association between year of operation and event frequency
model3c_glm(y ~ typeft + cyrft + offset(log(months)), family=poisson,
data=ship.dat)
model3c$deviance
model3c$df
1-pchisq(model3c$deviance-model1$deviance,model3c$df-model1$df)
% testing for the interaction between type of ship and year of construction
model4_glm(y ~ typeft + cyrft + oyrft + typeft*cyrft + offset(log(months)),
family=poisson, data=ship.dat)
model4$deviance
model4$df
1-pchisq(model1$deviance-model4$deviance,model1$df-model4$df)
summary(model4)
mrho_summary(model4)$correlation
mrho
% testing for the interaction between type of ship and year of operation
model5_glm(y ~ typeft + cyrft + oyrft + typeft*oyrft + offset(log(months)),
family=poisson, data=ship.dat)
1-pchisq(model1$deviance-model5$deviance,model1$df-model5$df)
motif()
ship.dat$rdeviance_residuals.glm(model1,type="deviance")
plot(model1$fitted.values,ship.dat$rdeviance,ylim=c(-4,4),
xlab="FITTED VALUES",ylab="DEVIANCE RESIDUALS")
abline(h=-2)
abline(h= 2)
Coefficients:
Value Std. Error t value
(Intercept) -6.4059016 0.2174315 -29.461700
typeft2 -0.5433443 0.1775869 -3.059596
typeft3 -0.6873773 0.3281646 -2.094611
typeft4 -0.0759614 0.2905555 -0.261435
typeft5 0.3255795 0.2358758 1.380300
cyrft2 0.6971404 0.1496286 4.659139
cyrft3 0.8184266 0.1697504 4.821354
cyrft4 0.4534266 0.2331490 1.944793
oyrft 0.3844670 0.1182584 3.251077
Correlation of Coefficients:
(Intercept) typeft2 typeft3 typeft4 typeft5 cyrft2
typeft2 -0.8114317
typeft3 -0.3794021 0.4343318
typeft4 -0.3706315 0.4468646 0.2381810
typeft5 -0.4699103 0.5706801 0.3144642 0.3338068
cyrft2 -0.4842330 0.0855953 0.0359326 0.0276773 -0.0040845
cyrft3 -0.5500980 0.2713808 0.0456319 0.0285842 -0.0370840 0.6334918
cyrft4 -0.4015248 0.2284334 0.0973273 -0.0965854 0.0527938 0.4755087
oyrft -0.2161077 0.0253826 -0.0030794 -0.0047295 0.0269346 -0.1200544
cyrft3 cyrft4
typeft2
typeft3
typeft4
typeft5
cyrft2
cyrft3
cyrft4 0.5482314
oyrft -0.2635932 -0.3153428
Coefficients:
Value Std. Error t value
(Intercept) -5.5939581 0.8721424 -6.4140418
typeft2 -0.3498652 0.2701854 -1.2949080
typeft3 -0.7630863 0.3375247 -2.2608311
typeft4 -0.1354756 0.2970846 -0.4560169
typeft5 0.2739282 0.2417527 1.1330923
cyrft2 0.6625407 0.1536194 4.3128721
cyrft3 0.7597379 0.1776244 4.2772168
cyrft4 0.3697408 0.2458086 1.5041821
oyrft 0.3702568 0.1181308 3.1342948
log(months) 0.9027077 0.1017761 8.8695429
Coefficients:
Value Std. Error t value
(Intercept) -14.4765415 56.9623361 -0.254142342
typeft2 7.5380111 56.9624437 0.132333001
The p−value from the likelihood ratio test is small suggesting we may want to consider putting the interaction term in the
model for the type of ship and the era of construction. However, if we examine the asymptotic standard errors then we see they
are huge. Examination of the correlation matrix of the regression parameter estimators reveals the model is overparameterized.
> mrho_summary(model4)$correlation
> mrho
(Intercept) typeft2 typeft3 typeft4
(Intercept) 1.0000000000 -0.9999974528 -9.999226e-01 -7.100246e-01
typeft2 -0.9999974528 1.0000000000 9.999207e-01 7.100232e-01
typeft3 -0.9999226000 0.9999206894 1.000000e+00 7.099701e-01
typeft4 -0.7100245778 0.7100231913 7.099701e-01 1.000000e+00
typeft5 -0.5750728851 0.5750714203 5.750284e-01 4.083159e-01
cyrft2 -0.9999772188 0.9999756009 9.999009e-01 7.100091e-01
cyrft3 -0.9999923094 0.9999909571 9.999163e-01 7.100200e-01
cyrft4 -0.9999838227 0.9999828370 9.999082e-01 7.100143e-01
oyrft -0.0008781622 0.0001284504 2.922393e-05 6.050829e-05
typeft2cyrft2 0.9999739057 -0.9999765836 -9.998973e-01 -7.100066e-01
typeft3cyrft2 0.9997465112 -0.9997446560 -9.998240e-01 -7.098451e-01
typeft4cyrft2 0.5777819250 -0.5777808871 -5.777377e-01 -8.137493e-01
typeft5cyrft2 0.5750664934 -0.5750651601 -5.750221e-01 -4.083114e-01
typeft2cyrft3 0.9999882256 -0.9999908253 -9.999115e-01 -7.100167e-01
typeft3cyrft3 0.9998969586 -0.9998950168 -9.999743e-01 -7.099519e-01
typeft4cyrft3 0.7100181198 -0.7100166552 -7.099636e-01 -9.999908e-01
typeft5cyrft3 0.5750699170 -0.5750684812 -5.750254e-01 -4.083138e-01
typeft2cyrft4 0.9999748840 -0.9999774311 -9.998981e-01 -7.100072e-01
typeft3cyrft4 0.9997545752 -0.9997526650 -9.998320e-01 -7.098508e-01
typeft4cyrft4 0.7100057744 -0.7100043879 -7.099513e-01 -9.999735e-01
typeft5cyrft4 0.5750409171 -0.5750394524 -5.749964e-01 -4.082932e-01
typeft5 cyrft2 cyrft3 cyrft4
(Intercept) -0.5750728851 -0.9999772188 -0.9999923094 -0.999983823
typeft2 0.5750714203 0.9999756009 0.9999909571 0.999982837
typeft3 0.5750283745 0.9999008726 0.9999162628 0.999908192
typeft4 0.4083158825 0.7100091003 0.7100200145 0.710014264
typeft5 1.0000000000 0.5750597843 0.5750684625 0.575063582
cyrft2 0.5750597843 1.0000000000 0.9999715036 0.999963623
cyrft3 0.5750684625 0.9999715036 1.0000000000 0.999979451
cyrft4 0.5750635820 0.9999636231 0.9999794514 1.000000000
oyrft 0.0005050073 -0.0003612134 -0.0007155276 -0.001204505
typeft2cyrft2 -0.5750578790 -0.9999962961 -0.9999676881 -0.999959653
typeft3cyrft2 -0.5749271106 -0.9997688942 -0.9997402920 -0.999732258
typeft4cyrft2 -0.3322667186 -0.5777949176 -0.5777784038 -0.577773783
Fig. 3.1. Deviance Residual Plot for Model 1 of Ship Damage Data
4
•
2
•
DEVIANCE RESIDUALS
• •
•• •
• • •
• •
0
•
•• • • • • •
• •
• •
• • • •
• •
-2 -4
0 10 20 30 40 50
FITTED VALUES
3.2.1 Introduction
Contingency tables can be formed to display data when all response variables are categorical. A two-dimensional
contingency table is formed by the cross-classification of two variables and the observations consist of the cell counts
of that contingency table. The basic assumption in a contingency table like this is that each cell frequency has
an independent Poisson distribution with mean µij for the (i, j) cell. When we condition on the relevant “total”
frequencies (i.e. possibly those terms fixed by design) we get a multinomial, product multinomial, or non-central
hypergeometric distribution.
Suppose a sample of Y.. = y.. individuals is cross-classified according to two categorical variables V and W with I
and J levels respectively. The data may then be summarized in the following contingency table:
FACTOR W
1 2 3 ··· j ··· J
1 y11 y12 y13 · · · y1j · · · y1J y1.
2 y21 y22 y23 · · · y2j · · · y2J y2.
3 y31 y32 y33 · · · y3j · · · y3J y3.
.
FACTOR V ..
i yi1 yi2 yi3 · · · yij · · · yiJ yi.
..
.
I yI1 yI2 yI3 · · · yIj · · · yIJ yI.
y.1 y.2 y.3 · · · y.j · · · y.J y..
We must condition on Y.. = y.. since this is fixed by the design, and we know
µ..y.. e−µ..
Pr(Y.. = y..) =
y..!
XX
where µ.. = µij . Therefore
i j
Y Y µyijij e−µij
i j
yij !
Pr(Yij = yij , i = 1, ..., I, j = 1, ..., J|Y.. = y..) =
µ..y.. exp{−µ..}/y..!
YY y YY YY 1
µijij e−µij
i j i j i j
yij !
=
µy..
·· exp{−µ..}/y..!
yij
y..! YY µij
= YY
yij ! i j µ..
i j
y..! YY y
= YY πijij
yij ! i j
i j
XX
where πij = µij /µ... which is a multinomial distribution since πij = 1. The corresponding log-likelihood is
i j
XX
ℓ(π) = yij log πij
i j
where π = (π11 , . . . , πIJ )′ . One might be interested in testing independence of the two methods of classification where
H0 : πij = πi. π.j and HA : πij 6= πi. π.j (i.e. there is some association). The log-likelihood under H0 is
XX
ℓ(π) = yij log(πi. π.j )
i j
XX
= yij (log πi. + log π.j )
i j
X X
= yi. log πi. + y.j log π.j
i j
The MLEs of πi. and π.j under H0 are given by π̂i. = yi. /y.. and π̂.j = y.j /y.. respectively, i = 1, 2, ..., I, j = 1, 2, ..., J,
and the fitted values under H0 are
yi. y.j
µ̂ij = y..π̂ij = y.. .
y.. y..
Therefore,
XX yi. y.j
ℓ(π̂) = yij log
i j
y..2
The MLEs of πij under HA are given by π̃ij = yij /y.. . Therefore
XX yij
ℓ(π̃) = yij log .
i j
y..
D = 2(ℓ(π̃) − ℓ(π̂))
XX yij yi. y.j
=2 yij log /
i j
y.. y..2
XX yi. y.j
=2 yij log yij /
i j
y..
XX
=2 Oij log(Oij /Eij )
i j
where Oij are the observed cell counts and Eij are the expected values under the null hypothesis. The degrees of
freedom are given by IJ − ((I − 1) + (J − 1) + 1) = (I − 1)(J − 1). This is a common form of the goodness of fit
statistics that we have seen before.
Alternatively, could use the Pearson statistic,
X X (Oij − Eij )2
T =
i j
Eij
which is also asymptotically χ2 distributed under the null hypothesis with the same degrees of freedom.
FACTOR
1 2 3 ··· j ··· J
1 y11 y12 y13 · · · y1j · · · y1J y1.
2 y21 y22 y23 · · · y2j · · · y2J y2.
3 y31 y32 y33 · · · y3j · · · y3J y3.
.
POPULATION ..
i yi1 yi2 yi3 · · · yij · · · yiJ yi.
..
.
I yI1 yI2 yI3 · · · yIj · · · yIJ yI.
y.1 y.2 y.3 · · · y.j · · · y.J y..
µyi.i. e−µi.
Pr(Yi. = yi. ) =
yi. .!
Y Y µyijij e−µij
" #
YY y YY 1
( µijij )( )
i j
y ij! i j i j
yij !
I yi. −µi.
= I
! I !
Y µi. e Y yi.
Y 1
µi.
i=1
yi. ! i=1
y !
i=1 i.
I
Y yi. ! Y µij J y ij
= Y
i=1
y !
ij j=1
µi.
j
PJ
πij = µij /µi. This is a product multinomial distribution where j=1 πij = 1, for i = 1, . . . , I. Note then that the
πij ’s have a different interpretation here. The corresponding log-likelihood is
I X
X J J
X
ℓ(π) = yij log πij , πij = 1
i=1 j=1 j=1
where π̂j = y·j /y·· , j = 1, 2, ..., J, are the MLEs under H0 . Therefore
XX
ℓ(π̂) = yij log(y.j /y..).
i j
where π̃ij = yij /yi. , j = 1, 2, ..., J, i = 1, 2, ..., I. The likelihood ratio statistic takes the form
D = 2(ℓ(π̃) − ℓ(π̂))
XX yij y.j
=2 yij log /
i j
y i. y.
XX
=2 yij log (yij / (yi. y.j /y..))
i j
XX Oij
=2 Oij log
i j
Eij
The degrees of freedom is given by I(J − 1) − (J − 1) = (I − 1)(J − 1). Note that you get the same test statistic and
degrees of freedom for either multinomial or product multinomial sampling scheme! Alternatively, we could use the
Pearson chi-squared statistic given by
X X (Oij − Eij )2
T =
i j
Eij
which is also asymptotically χ2 distributed under the null hypothesis with the same degrees of freedom.
Here we will show how associations in contingency tables can be investigated by the use of Poisson regression models.
Consider the following contingency table where we initially assume Yij ∼ Poisson(µij )
FACTOR W
1 2 3 ··· j ··· J
1 y11 y12 y13 · · · y1j · · · y1J y1.
2 y21 y22 y23 · · · y2j · · · y2J y2.
3 y31 y32 y33 · · · y3j · · · y3J y3.
.
FACTOR V ..
i yi1 yi2 yi3 · · · yij · · · yiJ yi.
..
.
I yI1 yI2 yI3 · · · yIj · · · yIJ yI.
y.1 y.2 y.3 · · · y.j · · · y.J y..
If y.. is fixed by design, as it often is, we work with conditional distribution of yij , i = 1, 2, ..., I, j = 1, 2, ..., J,
given y·· . We have already shown that this is a multinomial distribution.
y..! Y Y yij
Pr(Yij = yij , i = 1, 2, ..., I, j = 1, 2, ..., J|Y.. = y..) = Y Y πij
yij ! i j
i j
where πij = µij /µ.. . For this multinomial sampling plan we are often interested in testing
or equivalently that µij = E(Yij ) = y..πi. π.j . Traditionally we would estimate πi. and π.j for i = 1, 2, . . . , I and
j = 1, 2, . . . , J from the marginal totals. Next we would estimate the expected value of the (i, j) cell entry using
these sample estimates, and perhaps compute a Pearson statistic or a deviance statistic to carry out a formal test.
Interestingly, there is log-linear model we can fit to assess this hypothesis, and if this model fits the data well,
then we would conclude that there is little evidence against the null hypothesis of independence. In this case, the
corresponding log linear model under H0 is
where V is the variable corresponding to factor 1, W corresponds to factor 2, and i and j denote the levels. The
notation is different from that we have used in other regression contexts because in log-linear models we are typically
dealing with several variables each with several levels and the expression for the linear predictors using the previous
“covariate” notation becomes too long.
Under HA : πij 6= πi. π.j , or equivalently µij = E(Yij ) 6= y..πi. π.j . The corresponding log linear model is saturated
and given by
log µij = u + uVi + uW VW
j + uij i = 1, 2, ..., I j = 1, 2, ..., J (2)
The test for independence is based on testing the fit of model (1) vs. model (2).
Model (2) defines the most general log-linear model for a two-way table. It introduces 1+I +J +IJ parameters, so
constraints are needed since we only have IJ observations. Corner-point constraints are obtained using treatment
contrasts in Splus. Here we set
uV1 = 0, uW VW
1 = 0, u1j = 0, j = 1, 2, ...J, , uVi1W = 0 i = 1, 2, ..., I.
X X XX
This in turn gives uW
j = log µij /I − u, uVi = log µij /J − u, u = log µij /IJ, and
i j
X X
uVij W = log µij − u − uVi − uW
j == log µij − u − ( log µij /I − u) − ( log µij /J − u).
i j
As can be seen these are analogous to the constraints and interpretation of effects arising in ordinary ANOVA models.
Here parameters are interpreted in terms of deviations from lower order terms.
Maximum likelihood estimation for a two-way model proceeds as follows. Note that
Y Y µyijij e−µij
Pr(Yij = yij , i = 1, 2, ..., I, j = 1, 2, ..., J) =
i j
yij !
YY
= exp(yij log µij − µij − log yij !)
i j
We can make the substitution according to any particular log-linear model. For the saturated model, for example,
we let log µij = u + uVi + uW VW V W VW
j + uij , i = 1, 2, ..., I, j = 1, 2, ..., J and substitute for µij = exp{u + ui + uj + uij }
in the log likelihood and maximize subject to conerpoint or ANOVA constraints.
How is it that we can use the Poisson distribution to derive the MLE’s in this context when in fact Y.. = y..
was fixed and the multinomial distribution was appropriate? We showed the relationship between the Poisson log
likelihood and the multinomial likelihood for a special case in class, but the result holds quite generally. If the log
likelihoods are related then it is not surprising that the MLE’s for for some log-linear models the estimates of the
means are the same for Poisson distribution and the corresponding multinomial distribution or product multinomial
distribution. For this to hold, certain terms must be included in the log-linear regression model. The appropriate
terms are the parameters that correspond to the marginal totals that were fixed by design. For example
To illustrate why this is the case, we focus on the case of a product multinomial distribution. We note that
we want to base inference on P r(Yij = yij , i = 1, 2, ..., I, j = 1, 2, ..., J|y1. , ..., yI. ). Since the responses in different
samples are independent, we can write this as
I
Y
pr(Yij = yij , j = 1, 2, ..., J|Yi. = yi. ).
i=1
If we focus on a single sample, say the ith one, We note that Yi. are Poisson (µi. ), i = 1, . . . , I. We can then write
Provided we include appropriate terms so that yi. are fit exactly, then because Yi. is Poisson the log likelihood arising
from the conditional distribution on the left hand side is the same as the log likelihood arising from the right-hand
side. This is because the conditioning statistic (here yi. ) is Poisson and when it is fit exactly the corresponding
log-likelihood contribution is zero!
Additional Notes:
a. Log-linear models derive their name from the fact that the effects are linear on the log scale.
b. In log-linear models there is no distinction between explanatory and response variables.
c. Log linear models can be used to analyse relationships between several variables. As a result, interaction terms
become of primary interest.
d. Log linear models are said to be hierarchical in the sense that higher order terms are not included in the model
unless all related lower order terms are. That is
3.4 Application
A cross-sectional study was conducted in which (y.. = 400 patients) patients with malignant melanoma were classified
according to two factors, the site of the tumour and the histological type. The data are as follows.
Here we carry out an analysis to investigate whether the different types of tumour appear equally likely in the
different sites. That is we are assessing whether or not there is an association between histological type and tumour
site.
We begin with the assumption that yij ∼ Poisson(µij ), i = 1, 2, 3, 4, j = 1, 2, 3. We then condition on Y.. = y..
since fixed by design, leading to the multinomial distribution.
y..! YY y
P (Yij = yij , ∀(i, j)|Y.. = y..) = Q Q πijij
i j y ij ! i j
where πij = µij /µ... The null hypothesis is that the tumour type and site are independent, giving H0 : πij = πi. π.j ,
i = 1, 2, 3, 4, j = 1, 2, 3. Under H0 : µij = E(Yij ) = y..πi. π.j meaning we will have to fit the row and column totals to
allow estimation of πi. , i = 1, 2, 3, 4 and π.j , j = 1, 2, 3. Thus our log linear model under the null hypothesis is
where V corresponds to tumour type variable (i indicating the level), and W corresponds to tumour site variable (j
indicating the level). The if the model appears to fit the data well, then we have little evidence against the assumption
that tumour type and site are independent. If the model does not fit the data well, then some tumour types appear
more frequently in certain locations.
Here we have the Splus data file and the corresponding program.
type locat y
1 1 22
1 2 2
1 3 10
2 1 16
2 2 54
2 3 115
3 1 19
3 2 33
3 3 73
4 1 11
4 2 17
4 3 28
derm.dat_read.table("derm.dat", header=T)
derm.dat$typef_factor(derm.dat$type)
derm.dat$typeft_C(derm.dat$typef,treatment)
derm.dat$sitef_factor(derm.dat$locat)
derm.dat$siteft_C(derm.dat$sitef,treatment)
derm.dat
attach(derm.dat)
% fitting the model with only the "histological type" main effect
model2_glm(y ~ typeft, family=poisson)
1-pchisq(model2$deviance - model1$deviance,model2$df - model1$df)
Coefficients:
Value Std. Error t value
(Intercept) 1.7544309 0.2034503 8.623388
typeft2 1.6939682 0.1860136 9.106689
typeft3 1.3019261 0.1928617 6.750567
typeft4 0.4989641 0.2169164 2.300260
siteft2 0.4439313 0.1553150 2.858264
siteft3 1.2010272 0.1382637 8.686499
Correlation of Coefficients:
(Intercept) typeft2 typeft3 typeft4 siteft2
typeft2 -0.7714645
typeft3 -0.7440714 0.8138197
Coefficients:
Value Std. Error t value
(Intercept) 3.1764650 0.06672231 47.607238
typefs1 -0.8737146 0.13555988 -6.445230
typefs2 0.8202536 0.08050656 10.188655
typefs3 0.4282115 0.08819635 4.855207
sitefs1 -0.5483195 0.08983258 -6.103793
sitefs2 -0.1043882 0.07946273 -1.313675
Correlation of Coefficients:
(Intercept) typefs1 typefs2 typefs3 sitefs1
typefs1 0.3892042
typefs2 -0.4518752 -0.4463883
typefs3 -0.3022502 -0.4617212 0.0601799
sitefs1 0.2880608 0.0000000 0.0000000 0.0000000
sitefs2 -0.0054655 0.0000000 0.0000000 0.0000000 -0.6821281
Note that the coefficients differ as one would expect and that the correlation matrix has change. The inferences one draws
regarding the effect of the type and site variables is the same however (try carrying out a likelihood ratio test in this sort of
setting and you will see it gives the same value as before). Again, the fitted values are the same as we see below.
> derm.dat$fitted.values_model1$fitted.values
> derm.dat$rdeviance_residuals.glm(model1,type="deviance")
> derm.dat
Row Percentages
Head and Neck Trunk Extremities All sites
Hutchinson’s Freckle 64.7 5.9 29.4 100
Superficial Spreading 8.6 29.2 62.2 100
Nodular 15.2 26.4 58.4 100
Indeterminate 19.6 30.4 50.0 100
All types 17.0 26.5 56.5 100
Column Percentages
Head and Neck Trunk Extremities All sites
Hutchinson’s Freckle 32.4 1.9 4.4 8.5
Superficial Spreading 23.5 50.9 50.9 46.25
Nodular 27.9 31.1 32.3 31.25
Indeterminate 16.2 16.0 12.4 14.00
All types 100 100 100 100
3.5.1 Introduction
Consider the general problem in which subjects are classified with respect to three factor variables denoted V , W , and
Z with I, J, and K levels respectively. As with two-way tables, we initially assume yijk ∼ Poisson(µijk ), i = 1, 2, ..., I,
j = 1, 2, ..., J, k = 1, 2, ..., K. As before, if Y··· = y··· is fixed by design (as it usually would be), we must condition on
this to give the multinomial distribution,
y...! YYY y
ijk
P (Yijk = yijk , ∀(i, j, k)|Y... = y...) = Y Y Y πijk
yijk ! i j k
i j k
a. Mutual Independence
The variables V , W , and Z are said to be mutually independent if we can write πijk = πi.. π.j. π..k . That is, all
three variables V , W and Z are independent of each other. This implies
with uV1 = uW Z
1 = u1 = 0 (with cornerpoint constraints for example). We know this is the corresponding model
because in the hypothesis we can model the individual probabilities within the table simply in terms of the
margin totals and the above model will fit these exactly. It is convenient to have a short hand notation to iden-
tify log-linear models, and this model is sometimes denoted by (V, W, Z), where we list the highest order terms
involving each of the factors. The model d.f. is 1 + (I − 1) + (J − 1) + (K − 1) = I + J + K − 2 and the residual
d.f. is (IJK) − (I + J + K − 2)
b. Joint Independence
W is is said to be jointly independent of V and Z if we can write πijk = πi.k π.j. . That is, W is independent of
the “new variable” with IK levels defined by combination of levels of V and Z. This implies then, that given
the levels of V and Z, we have learned nothing about distribution of factor W . Equivalently we may say that
the nature of the association which exists between V and Z does not vary with different levels of factor W . The
corresponding log-linear model has the form
since again this model will ensure that the marginal probabilities required to model a general probability πijk are
fit exactly. If the hypothesis is true, then fitting these marginal totals should be sufficient to provide a good fit
to the data. In the short-hand notation, this model is denoted as (V Z, W ). The model d.f. is then [1 + (I − 1) +
(J − 1) + (K − 1) + (I − 1)(K − 1)] and the residual d.f. is (IJK) − [1 + (I − 1) + (J − 1) + (K − 1) + (I − 1)(K − 1)].
We could also have V as jointly independent of W and Z (denoted (V, W Z)) or Z jointly independent of V and
W (denoted (V W, Z)).
c. Conditional Independence
where πij|k = πijk /π..k determines the joint distribution of V and W at level k of Z. Thus V and W are said to
be conditionally independent of Z if we can write πij|k = πi.|k π.j|k , i = 1, 2, ..., I, j = 1, 2, ..., J. H0 implies
πijk πi.k π.jk
= → πijk = πi.k π.jk /π..k
π..k π..k π..k
The corresponding log-linear model is of the from
which is denoted as (V Z, W Z). Again, we know this is the corresponding model because under this model all
that is required to derive the probability πijk are the marginal probabilities πi·k and π·jk . The model d.f. is
[1 + (I − 1) + (J − 1) + (K − 1) + (I − 1)(K − 1) + (J − 1)(K − 1)] and the residual d.f. (IJK) − [1 + (I − 1) + (J −
1) + (K − 1) + (I − 1)(K − 1) + (J − 1)(K − 1)]. We could also have models of the form (V W, W Z) or (V W, V Z),
where the former would, for example, indicate that when we condition on W , V and Z are independent.
d. All Pairs Conditionally Independent
This is related to the association structure in (3) and involves the hypothesis
Here we have the association between any two variables constant across levels of other variable. An equivalent
way of thinking about this model is that there is an association between all pairs of variables, but the nature of
this association is constant with respect to changing levels of the remaining variable.
The next most complicated model is the “saturated” involving the three way interaction term and a total of IJK
parameters.
The fit of a log linear model can be judged based on the deviance assuming an underlying Poisson distribution for
the cell counts. We know from before that the deviance statistic has the form
XXX Oijk
D=2 Oijk log
i j
Eijk
k
X X X (Oijk − Eijk )2
i j
Eijk
k
where Oijk is observed value for cell (i, j, k) and Eijk is expected value for cell (i, j, k) under H0 and again this is
approximately chi-squared distributed under the null hypothesis with d.f. IJK − q.
3.6.2 Residuals
The deviance residuals have the usual form under the Poisson model. The standardized Pearson residuals take the
form
P Oijk − Eijk
rijk = p
Eijk
3.7 Applications
Suppose we are interested in examining the nature of the association among the 3 variables sex, age group, and
method of suicide, in the following table.
Cause of Death
Sex Age( yrs) Solid/Liquid Gas Respiratory Weapons Jumping Other
Male 10-40 398 121 455 155 55 124
Male 40-70 399 82 797 168 51 82
Male > 70 93 6 316 33 26 14
Female 10-40 259 15 95 14 40 38
Female 40-70 450 13 450 26 71 60
Female > 70 154 5 185 7 38 10
Note that here there is no obvious response variable to single out. Since we are interested in the association among
all three variables we consider methods based on log-linear models.
Here we carry out an analysis to investigate the association between sex, age and method of suicide. Let X denote
sex, Y denote age, and Z denote the method of suicide. We know the log linear model
will provide a perfect fit to the data (since it is a saturated model). We seek to find a simpler model which describes
the data well. In other words, we are looking for a simpler representation of the relationship between the sex, age
and method of suicide variables.
Here we have the Splus data file and the corresponding program.
sex age method y
1 1 1 398
1 1 2 121
1 1 3 455
1 1 4 155
1 1 5 55
1 1 6 124
1 2 1 399
1 2 2 82
1 2 3 797
1 2 4 168
1 2 5 51
1 2 6 82
1 3 1 93
1 3 2 6
1 3 3 316
1 3 4 33
1 3 5 26
1 3 6 14
2 1 1 259
2 1 2 15
2 1 3 95
2 1 4 14
2 1 5 40
2 1 6 38
2 2 1 450
2 2 2 13
2 2 3 450
2 2 4 26
2 2 5 71
2 2 6 60
2 3 1 154
2 3 2 5
2 3 3 185
2 3 4 7
2 3 5 38
2 3 6 10
suic.dat_read.table("suic.dat", header=T)
suic.dat$sexf_factor(sex)
suic.dat$sexft_C(suic.dat$sexf,treatment)
suic.dat$agef_factor(suic.dat$age)
suic.dat$ageft_C(suic.dat$agef,treatment)
suic.dat$methodf_factor(suic.dat$method)
suic.dat$methodft_C(suic.dat$methodf,treatment)
attach(suic.dat)
suic.dat$fitted.values3_model3$fitted.values
suic.dat$rdeviance3_residuals.glm(model3,type="deviance")
suic.dat$fitted.values4_model4$fitted.values
suic.dat$rdeviance4_residuals.glm(model4,type="deviance")
suic.dat$fitted.values5_model5$fitted.values
suic.dat$rdeviance5_residuals.glm(model5,type="deviance")
motif()
par(mfrow=c(2,2))
plot(suic.dat$fitted.values2,suic.dat$rdeviance2,ylim=c(-5,5),
xlab="FITTED VALUES",ylab="DEVIANCE RESIDUALS"); title("Model 2")
plot(suic.dat$fitted.values3,suic.dat$rdeviance3,ylim=c(-5,5),
xlab="FITTED VALUES",ylab="DEVIANCE RESIDUALS"); title("Model 3")
plot(suic.dat$fitted.values4,suic.dat$rdeviance4,ylim=c(-5,5),
xlab="FITTED VALUES",ylab="DEVIANCE RESIDUALS"); title("Model 4")
plot(suic.dat$fitted.values5,suic.dat$rdeviance5,ylim=c(-5,5),
xlab="FITTED VALUES",ylab="DEVIANCE RESIDUALS"); title("Model 5")
suic.dat
> > sex age method y fitted.values2 rdeviance2
1 1 1 1 398 410.853837 -0.63749626
2 1 1 2 121 122.699573 -0.15378908
3 1 1 3 455 439.224163 0.74830743
4 1 1 4 155 156.373386 -0.10998886
5 1 1 5 55 56.820757 -0.24285278
6 1 1 6 124 122.028284 0.17801260
7 1 2 1 399 379.442338 0.99557892
8 1 2 2 82 77.620529 0.49252029
9 1 2 3 797 819.882186 -0.80289911
10 1 2 4 168 166.268603 0.13404183
11 1 2 5 51 51.090948 -0.01272770
12 1 2 6 82 84.695397 -0.29445651
13 1 3 1 93 99.703825 -0.67912009
14 1 3 2 6 8.680256 -0.96385340
15 1 3 3 316 308.893651 0.40280000
16 1 3 4 33 33.358011 -0.06209777
17 1 3 5 26 24.088295 0.38452049
18 1 3 6 14 13.276319 0.19684871
19 2 1 1 259 246.146163 0.81230758
20 2 1 2 15 13.300433 0.45658974
21 2 1 3 95 110.775837 -1.53676135
22 2 1 4 14 12.626614 0.37979274
23 2 1 5 40 38.179243 0.29237485
24 2 1 6 38 39.971716 -0.31448402
25 2 2 1 450 469.557662 -0.90892922
26 2 2 2 13 17.379479 -1.10004443
27 2 2 3 450 427.117814 1.09752217
28 2 2 4 26 27.731397 -0.33229749
29 2 2 5 71 70.909052 0.01079815
30 2 2 6 60 57.304603 0.35332589
31 2 3 1 154 147.296175 0.54825361
32 2 3 2 5 2.320220 1.52256317
33 2 3 3 185 192.106349 -0.51592538
34 2 3 4 7 6.641989 0.13769372
35 2 3 5 38 39.911705 -0.30506648
36 2 3 6 10 10.723681 -0.22354952
Fig. 3.2. Plot of Residuals by Fitted Values for Models 2, 3, 4, and 5 of Suicide Data
Model 2 Model 3
DEVIANCE RESIDUALS
DEVIANCE RESIDUALS
•
••
-4 -2 0 2 4
-4 -2 0 2 4
•
•• • • •
•
• • •• •• •
• • • ••
•••••••••••••• •• ••• •• •
•• • • • • • • •
• • • •• •
• •
•• •
•
•
0 200 400 600 800 0 200 400 600
FITTED VALUES FITTED VALUES
Model 4 Model 5
DEVIANCE RESIDUALS
DEVIANCE RESIDUALS
•
•
• •
-4 -2 0 2 4
• -4 -2 0 2 4
• •• • • •• •
•• • • • • •
•••••• • • •
••• • • • ••• • • •
• •••••• •
• • • • •
• •
• • • •
• •• • •
•• •
0 200 400 600 0 200 400 600 800
FITTED VALUES FITTED VALUES
We now consider the special case of a 2 × 2 × 2 table. That is, where each variable has only two levels. For illustration
we will discuss the data in the following table from the Florida Department of Highway Safety and Motor vehicles
(Bishop et al., 1975) which summarizes three variables pertaining to car accidents. Recorded are whether a seat-belt
was used, whether the victim was ejected from the car, and whether the injury was fatal.
Injury (Z)
Seatbelt (V ) Ejected (W ) Non-fatal Fatal
Used Yes 1105 14
No 411,111 483
Not Used Yes 4624 497
No 157,342 1008
Injury (Z)
Seatbelt (V ) Ejected (W ) Non-fatal (k = 1) Fatal (k = 2)
Used (i = 2) Yes (j = 2) y221 y222
No (j = 1) y211 y212
Not Used (i = 1) Yes (j = 2) y121 y122
No (j = 1) y111 y112
but we are interested in identifying simpler models which still fit the data well and are easy to interpret. Note that
in this table only y... is fixed and so only the intercept needs to be included by design.
Here we have the Splus data file and the corresponding program.
seat eject injury y
1 1 1 157342
1 1 2 1008
1 2 1 4624
1 2 2 497
2 1 1 411111
2 1 2 483
2 2 1 1105
2 2 2 14
acc.dat_read.table("acc.datv", header=T)
acc.dat$seatf_factor(acc.dat$seat)
acc.dat$s_C(acc.dat$seatf,treatment)
acc.dat$ejectf_factor(acc.dat$eject)
acc.dat$e_C(acc.dat$ejectf,treatment)
acc.dat$injuryf_factor(acc.dat$injury)
acc.dat$i_C(acc.dat$injuryf,treatment)
acc.dat
Coefficients:
Value Std. Error t value
(Intercept) 11.9661771 0.002521028 4746.54670
s 0.9604415 0.002964459 323.98545
e -3.5271616 0.014920407 -236.39848
i -5.0504536 0.031597770 -159.83576
s:e -2.3918563 0.033615895 -71.15254
s:i -1.6961483 0.055418813 -30.60600
e:i 2.8200282 0.056804529 49.64443
s:e:i -0.4419696 0.278627222 -1.58624
Correlation of Coefficients:
(Intercept) s e i s:e s:i
s -0.8504177
e -0.1689651 0.1436909
i -0.0797850 0.0678506 0.0134809
s:e 0.0749951 -0.0881862 -0.4438498 -0.0059835
s:i 0.0454905 -0.0534919 -0.0076863 -0.5701632 0.0047173
e:i 0.0443808 -0.0377422 -0.2626623 -0.5562544 0.1165826 0.3171558
s:e:i -0.0090480 0.0106395 0.0535497 0.1134052 -0.1206483 -0.1988995
e:i
s
e
i
s:e
s:i
e:i
s:e:i -0.2038729
---------------------------------------------------------------------------
Coefficients:
Value Std. Error t value
(Intercept) 11.9661334 0.002520935 4746.70421
s 0.9605018 0.002964258 324.02772
e -3.5256338 0.014879047 -236.95293
i -5.0436195 0.031201820 -161.64504
s:e -2.3996357 0.033340295 -71.97404
s:i -1.7173207 0.054015372 -31.79319
e:i 2.7977945 0.055255851 50.63345
Correlation of Coefficients:
(Intercept) s e i s:e s:i
s -0.8504057
e -0.1687532 0.1432888
i -0.0793250 0.0669801 0.0049262
s:e 0.0740365 -0.0870602 -0.4387264 0.0138669
s:i 0.0440104 -0.0517522 0.0078886 -0.5548114 -0.0311263
e:i 0.0428931 -0.0355879 -0.2541765 -0.5407262 0.0837227 0.2701176
> acc.dat$fv_model2$fitted.values
> acc.dat
seat eject injury y seatf s ejectf e injuryf i fv
1 1 1 1 157342 1 1 1 1 1 1 157335.13193
2 1 1 2 1008 1 1 1 1 2 2 1014.86807
3 1 2 1 4624 1 1 2 2 1 1 4630.86807
4 1 2 2 497 1 1 2 2 2 2 490.13193
5 2 1 1 411111 2 2 1 1 1 1 411117.86807
6 2 1 2 483 2 2 1 1 2 2 476.13193
7 2 2 1 1105 2 2 2 2 1 1 1098.13193
8 2 2 2 14 2 2 2 2 2 2 20.86807
---------------------------------------------------------------------------
Coefficients:
Value Std. Error t value
(Intercept) 11.9633139 0.002524274 4739.30924
s 0.9632739 0.002967231 324.63734
e -3.4314580 0.014197737 -241.69050
i -4.6785782 0.025825428 -181.16169
s:e -2.4761440 0.033130957 -74.73807
s:i -2.0421345 0.051782723 -39.43660
Correlation of Coefficients:
(Intercept) s e i s:e
s -0.8507171
e -0.1761993 0.1498958
i -0.0947092 0.0805707 0.0000000
s:e 0.0755074 -0.0889496 -0.4285338 0.0000000
s:i 0.0472340 -0.0559712 0.0000000 -0.4987267 0.0000000
---------------------------------------------------------------------------
Coefficients:
Value Std. Error t value
(Intercept) 11.9699436 0.002513903 4761.49686
s 0.9552297 0.002957139 323.02491
e -3.5142778 0.014693133 -239.17825
i -5.9434707 0.025914304 -229.35097
s:e -2.4761440 0.033131101 -74.73775
e:i 3.5265440 0.052943310 66.60981
Correlation of Coefficients:
(Intercept) s e i s:e
s -0.8494934
e -0.1710938 0.1453430
---------------------------------------------------------------------------
Coefficients:
Value Std. Error t value
(Intercept) 11.985114 0.002488217 4816.74862
s 0.934161 0.002932430 318.56213
i -4.963265 0.029014282 -171.06284
e -4.597323 0.013209434 -348.03329
s:i -2.042119 0.051803432 -39.42053
e:i 3.526490 0.052921727 66.63597
Correlation of Coefficients:
(Intercept) s i e s:i
s -0.8460864
i -0.0857583 0.0725590
e -0.0535221 0.0000000 0.0045900
s:i 0.0478943 -0.0566069 -0.4461379 0.0000000
e:i 0.0133593 0.0000000 -0.4378953 -0.2496032 0.0000000
To summarize the models fit so far, consider the following analysis of deviance table.
Injury (Z)
Seatbelt (V ) Ejected (W ) Non-Fatal Fatal
Injury (Z)
Seatbelt (V ) Ejected (W ) Non-Fatal Fatal
Used Yes 98.7 1.3
No 99.9 0.1
Not Used Yes 90.4 9.6
No 99.4 0.6
Note that we can interpret log(µ112 /µ111 ) as a log odds since this is equal to log(π112 /π111 ) where π111 = 1 − π112 =
µ111 /µ11· .
Therefore uZ
2 is a log odds of a fatal accident in the last row of the table (i.e. those not using seatbelts who were
not ejected). In the third row we have,
Therefore uZ WZ
2 + u22 is the log odds of a fatality among those who were not wearing a seatbelt and were ejected.
log(µ222 /µ221 ) = uZ WZ V WZ
2 + u22 + u222
This µVijkW Z plays the role of an interaction term allowing the effect of one variable on another to vary with levels of
another.
We may construct several odds ratios for this data set for the saturated model.
Odds Ratios
Outcome Comparison At Form Value
Z = 2 W = 2 vs. W = 1 V = 1 exp(uW Z
22 ) exp(2.80) = 16.4
Z = 2 W = 2 vs. W = 1 V = 2 exp(uW Z V WZ
22 + u222 ) exp(2.80 + 0) = 16.4
These odds ratios make sense since they suggest the relative odds of a fatality among those ejected compared to
those not ejected is 16.4, the relative odds of fatality among those using a seatbelt compared to those who do not
use a seatbelt is 0.18, and the relative odds of ejection for those using a seatbelt compared to those who do not is
0.09. The fact that we could not reduce the model further means these terms are all significant.
If these odds ratios relating to fatality could have been obtained from a logistic model it is natural to ask what
we have gained here. Specifically we are able to examine the relationship between all variables including V and W .
Here we revisit the analysis of the associations between age, sex and method of suicide. Recall the data was provided
as follows.
Cause of Death
Sex Age( yrs) Solid/Liquid Gas Respiratory Weapons Jumping Other
Male 10-40 398 121 455 155 55 124
Male 40-70 399 82 797 168 51 82
Male > 70 93 6 316 33 26 14
Female 10-40 259 15 95 14 40 38
Female 40-70 450 13 450 26 71 60
Female > 70 154 5 185 7 38 10
Here we carry out an analysis to investigate the association between sex, age and method of suicide. Let V denote
sex, W denote age, and Z denote the method of suicide. We know the log-linear model
will provide a perfect fit to the data (since it is a saturated model). We seek to find a simpler model which describes
the data well. In other words, we are looking for a simpler representation of the relationship between the sex, age
and method of suicide variables. Our previous analyses are summarized in the following analysis of deviance table.
We found that the model with all two-way interactions provided good fit, but any simpler models did not. The final
model is therefore
log µijk = u + uVi + uW Z VW
j + uk + uij + uVikZ + uW
jk
Z
We conclude that all pairs of variables are conditionally independent. This means that each pair of variables are
associated, but the nature of the association does not depend on the remaining variable. A partial data file, some of
the code, and more detailed results for the final model are provided below.
sex age method y
1 1 1 398
1 1 2 121
1 1 3 455
1 1 4 155
1 1 5 55
1 1 6 124
1 2 1 399
1 2 2 82
1 2 3 797
1 2 4 168
1 2 5 51
1 2 6 82
| | | |
| | | |
2 3 1 154
2 3 2 5
2 3 3 185
2 3 4 7
2 3 5 38
2 3 6 10
> summary(model2)
Coefficients:
Value Std. Error t value
(Intercept) 6.01823752 0.04598603 130.8710083
sexft -0.51231200 0.06497888 -7.8842854
ageft2 -0.07953488 0.06145781 -1.2941379
ageft3 -1.41603348 0.08944448 -15.8314247
methodft2 -1.20849865 0.09848962 -12.2703143
methodft3 0.06677238 0.06149132 1.0858831
methodft4 -0.96599088 0.08980039 -10.7570898
methodft5 -1.97833582 0.12157446 -16.2726262
methodft6 -1.21398467 0.09453852 -12.8411643
sexftageft2 0.72540047 0.07057738 10.2780866
sexftageft3 0.90255331 0.09146294 9.8679675
sexftmethodft2 -1.70963024 0.19501227 -8.7667829
sexftmethodft3 -0.86518923 0.06773258 -12.7736045
sexftmethodft4 -2.00412781 0.16368076 -12.2441259
sexftmethodft5 0.11470230 0.13117469 0.8744240
sexftmethodft6 -0.60376876 0.12897160 -4.6814085
ageft2methodft2 -0.37837205 0.14627609 -2.5866979
ageft3methodft2 -1.23265437 0.32332905 -3.8123836
ageft2methodft3 0.70368563 0.07520193 9.3572818
ageft3methodft3 1.06402062 0.09994185 10.6463976
ageft2methodft4 0.14089281 0.12072263 1.1670787
ageft3methodft4 -0.12891521 0.19514619 -0.6606084
ageft2methodft5 -0.02675948 0.14825462 -0.1804968
ageft3methodft5 0.55785782 0.18046659 3.0911972
ageft2methodft6 -0.28565673 0.12815869 -2.2289298
ageft3methodft6 -0.80223746 0.23289228 -3.4446717
Correlation of Coefficients:
(Intercept) sexft ageft2 ageft3 methodft2
sexft -0.5293849
ageft2 -0.6545075 0.2190292
ageft3 -0.4394480 0.1311005 0.3888462
methodft2 -0.4533271 0.2215140 0.2858969 0.1943518
methodft3 -0.6841670 0.2756100 0.4034836 0.2612433 0.3162147
methodft4 -0.4979142 0.2443126 0.3151823 0.2122326 0.2270934
methodft5 -0.3263533 0.1022024 0.1786165 0.1083398 0.1528525
methodft6 -0.4508464 0.1902968 0.2674112 0.1837221 0.2078568
sexftageft2 0.3397924 -0.6418626 -0.5140561 -0.2152121 -0.1200132
sexftageft3 0.2536292 -0.4791016 -0.2414450 -0.5326448 -0.0938820
sexftmethodft2 0.1062991 -0.2007973 0.0286658 0.0122047 -0.2304799
sexftmethodft3 0.2208293 -0.4171432 0.1774634 0.1780183 -0.1097188
sexftmethodft4 0.1060513 -0.2003292 0.0598045 0.0419839 -0.0508429
ageft3
methodft2
methodft3
methodft4
methodft5
methodft6
sexftageft2
sexftageft3
sexftmethodft2
sexftmethodft3
sexftmethodft4
sexftmethodft5
sexftmethodft6 0.1380205
ageft2methodft2 -0.0214002 -0.0188304
ageft3methodft2 -0.0165786 -0.0085290 0.2022833
ageft2methodft3 -0.0431139 -0.0380815 0.2817386 0.0755249
ageft3methodft3 -0.0484958 -0.0319812 0.1243735 0.1957429
ageft2methodft4 -0.0260336 -0.0221505 0.1880540 0.0490921
ageft3methodft4 -0.0284351 -0.0131817 0.0669367 0.1078330
ageft2methodft5 -0.1585728 -0.0221231 0.1259021 0.0338279
ageft3methodft5 -0.1606040 -0.0228262 0.0601454 0.0968107
ageft2methodft6 -0.0246950 -0.1590233 0.1588169 0.0413490
ageft3methodft6 -0.0193417 -0.1101573 0.0505534 0.0804766
ageft2methodft3 ageft3methodft3 ageft2methodft4
sexft
ageft2
ageft3
methodft2
methodft3
methodft4
methodft5
methodft6
sexftageft2
sexftageft3
sexftmethodft2
sexftmethodft3
sexftmethodft4
sexftmethodft5
sexftmethodft6
ageft2methodft2
ageft3methodft2
ageft2methodft3
ageft3methodft3 0.4860919
ageft2methodft4 0.3458827 0.1534926
ageft3methodft4 0.1274191 0.3289479 0.3409389
ageft2methodft5 0.2463889 0.1090278 0.1522688
ageft3methodft5 0.1175691 0.3138438 0.0727012
ageft2methodft6 0.3018880 0.1323026 0.1940422
ageft3methodft6 0.0973867 0.2548121 0.0619062
ageft3methodft4 ageft2methodft5 ageft3methodft5
sexft
ageft2
ageft3
methodft2
methodft3
methodft4
methodft5
methodft6
sexftageft2
sexftageft3
sexftmethodft2
sexftmethodft3
sexftmethodft4
sexftmethodft5
sexftmethodft6
ageft2methodft2
ageft3methodft2
ageft2methodft3
ageft3methodft3
ageft2methodft4
ageft3methodft4
ageft2methodft5 0.0560627
ageft3methodft5 0.1602452 0.4758838
ageft2methodft6 0.0691248 0.1450617 0.0693600
ageft3methodft6 0.1344971 0.0468263 0.1349523
ageft2methodft6
sexft
ageft2
ageft3
methodft2
methodft3
methodft4
methodft5
methodft6
sexftageft2
sexftageft3
sexftmethodft2
sexftmethodft3
sexftmethodft4
sexftmethodft5
sexftmethodft6
ageft2methodft2
ageft3methodft2
ageft2methodft3
ageft3methodft3
ageft2methodft4
ageft3methodft4
ageft2methodft5
ageft3methodft5
ageft2methodft6
ageft3methodft6 0.2772938
> x_c(0,0,0,0,0,0,-1,1,0,0,0,0,0,0,0,0,0,0,0,0,-1,0,1,0,0,0)
> x_as.matrix(x,26,1)
> x
[,1]
[1,] 0
[2,] 0
[3,] 0
[4,] 0
[5,] 0
[6,] 0
[7,] -1
[8,] 1
[9,] 0
[10,] 0
[11,] 0
[12,] 0
[13,] 0
[14,] 0
[15,] 0
[16,] 0
[17,] 0
[18,] 0
[19,] 0
[20,] 0
[21,] -1
[22,] 0
[23,] 1
[24,] 0
[25,] 0
[26,] 0
> t(x)%*%as.vector(model2$coefficients)
[,1]
[1,] -1.179997
> sqrt(t(x)%*%v%*%x)
[,1]
[1,] 0.1382256
Questions
Problem 3.1.
Consider the following data on newly diagnosed lung cancer cases occurring in four Danish cities between 1968 and
1971.
City
Fredericia Horsens Kolding Vejile
Age Cases Pop. Cases Pop. Cases Pop. Cases Pop.
40-54 11 3059 13 2879 4 3142 5 2520
55-59 11 800 6 1083 8 1050 7 878
60-64 11 710 15 923 7 895 10 839
65-69 10 581 10 834 11 702 14 631
70-74 11 509 12 634 9 535 8 539
> 75 10 605 2 782 12 659 7 619
A Poisson model is often used to analyse data of this sort based on the expected number of cases (recall the Poisson
approximation for the binomial model). Variable population sizes, however, suggest a systematic (non-random) way
in which the number of cases will vary from city to city and across age groups. It is therefore preferable to model
the disease rates. This may be achieved using a Poisson regression model where the log of the population size is used
as an offset term. That is, if µ is the mean number of newly diagnosed lung cancer patients, x is a p × 1 covariate
vector, and β is a p × 1 vector of regression coefficients, then we may specify a model for the mean number of cases
as
log(µ) = x′ β + log(Pop.) .
Model based estimates of the expected number of new cases for a particular covariate configuration may be obtained
as exp{x′ β̂ + log(pop.)}.
a. Compute a model-based estimate of the mean number of newly diagnosed lung cancer patients over this three
year period in Kolding for the 55-59 year old age group.
b. From the fitted model estimate the relative increase in the lung cancer rate for the 70-74 year old age group,
compared to the 40-54 year old age group.
c. Does there appear to be an interaction between the factor variable for age and the factor variable for city?
d. Epidemiologists frequently express disease rates as the number of new cases per hundred thousand people in the
population per year. Compute a model-based estimate of the expected number of newly diagnosed lung cancer
patients per 100,000 person-years of risk in Kolding for the 55-59 year old age group for 1970. Assume that the
disease rates are stable over time.
Problem 3.2.
In a prospective study subjects were randomly allocated to four different treatment groups and followed over time
to determine whether their health changed over the course of follow-up. One group received a placebo, one group
received drug A, one group received drug B, and one group received both drugs A and B. The responses of all patients
are summarized below.
Therefore, for this study there is one explanatory variable (treatment) and one trinomial response variable (im-
proved, no difference, worse). We want to know if the treatments have any effect on the response, or in other words,
if the distribution of responses is the same for each of the four treatment groups.
Let πij denote the probability of outcome j for the ith treatment group, let yij denote the frequency with which
this happens, j = 1, 2, 3, and let mi = yi1 + yi2 + yi3 , i = 1, 2, 3, 4.
a. Write down the likelihood function for the data in Table 1 under the most general model and give the null
hypothesis of interest (stated above) in terms of the parameters of this model.
b. Give expressions for the maximum likelihood estimate of πij under the model in (a) and (b).
c. Give the form of the log-linear model corresponding to the null hypothesis. Define any notation that you introduce.
d. Fit the model in (c) and draw conclusions regarding the presence of any association in this table.
Problem 3.3.
The data below arise from a study investigating the relation between cigarette smoking behaviour, hypertension
status and proteinurea (the presence of protein in urine) among expectant mothers.
Hypertension
Yes No
Proteinurea Proteinurea
Data on Smoking, Hypertension and Proteinurea Cigarettes/day Yes No Yes No
0 439 1740 294 5132
1-19 195 811 244 3625
>20 31 154 51 658
a. Find the most appropriate log-linear model to describe the association between these three factors and discuss
your findings. Use odds ratios to describe any associations you find.
b. Suppose proteinurea was a response in a logistic model. Find a suitable logistic model using the covariates
cigarettes/day and hypertension and interpret this model and the associations it implies.
c. Briefly discuss the differences in the two modeling strategies in (a) and (b) and the relation between the findings
in the two analyses.
Problem 3.4.
The first table below presents data collected by a social scientist on inter-generational social mobility in Britain. The
data were collected by randomly sampling males from the general population, assessing their social status, and that
of their father.The rating of social status was summarized using five categories, with category 1 being the lowest and
5 the highest. The second table contains data from an analogous Danish study.
a. Write down the likelihood for the data from the British for the most general multinomial model.
b. Give the form of the corresponding log-linear model you need to fit to assess this hypothesis of perfect social
mobility in Britain (define any notation you introduce). Fit the model and make conclusions regarding this
hypothesis.
c. By fitting further log-linear models, assess whether the social mobility patterns are the same in Britain and
Denmark.
Problem 3.5.
The table below presents data collected by social scientists on inter-generational social mobility in Britain and
Denmark. The data were collected by randomly sampling males from the general populations, and assessing their
social status and that of their father. The rating of social status presented here was obtained by collapsing the
classifications on the original 5-point scale to give a simple designation of lower versus middle/upper class.
Suppose that the saturated log-linear model for the full table (using data from both countries) is given by
log µijk = u + uF S C FS FC SC F SC
i + uj + uk + uij + uik + ujk + uijk
where the superscripts F , S, and C represent father, son and country respectively, and uF S C
1 = u1 = u1 = 0,
uF S FS FC FC SC SC F SC F SC F SC
1j = ui1 = u1k = ui1 = u1k = uj1 = 0, and u1jk = ui1k = uij1 = 0.
a. Write down the log-linear model you would fit and assess to test the null hypothesis of perfect social mobility
based only on the British data. Be sure to define any terms you introduce and any constraints they might have.
b. Give an explicit expression for the deviance statistic for testing the hypothesis of perfect social mobility in (a)
and state the degrees of freedom.
c. The attached Splus output contains results from fitting several models to the full data set. By comparing the fit of
any relevant models, draw conclusions regarding the nature of social mobility patterns in Brittain and Denmark.
Provide estimates of one or more relevant odds ratios, associated 95% confidence intervals, and p−values in your
summary statements.
Problem 3.6.
a. In class it was stated that Poisson regression models may be used to model data in the form of two-way contingency
tables (where one factor variable is cross-tabulated against another). The distributions that we derived for the
contingency tables were either multinomial or product multinomial depending on whether the total of the entire
table was fixed by design, or just the row totals were fixed. Explain carefully why we can fit Poisson regression
models to address questions about multinomial or product multinomial distributions.
b. The following data relates to a flu vaccine trial in which patients were randomized to receive either a vaccine or
placebo treatment. The response of interest is the severity of their flu symptoms over the course of a full season;
this is classified as mild, moderate, or severe.
Response
Mild Moderate Severe Total
Placebo 25 8 5 38
Vaccine 6 18 11 35
Test the hypothesis that the response pattern is the same for the placebo and vaccine groups using the Pearson
chi-square statistic for an r × s contingency table.
c. Analyse the data using log-linear models. Describe and interpret the final model non-technically.
d. For the model corresponding to no difference in the response distribution for placebo and vaccinated individuals,
calculate the deviance residuals. Comment on these residuals and the nature any lack-of-fit, if present.
Problem 3.7.
The following table gives results of a study examining the relationship between smoking status and a breathing test
result, by age, based on individuals in industrial plants in Houston in 1974 and 1975.
Breathing Test Results
Age Smoking Status Normal Not Normal
< 40 Never Smoked 577 34
< 40 Current Smoker 682 57
Problem 3.8.
Suppose a sample of Poisson counts are available. Let f0 , f1 , . . . , fK denote the observed frequencies of 0, 1, 2, . . . , K
counts respectively, where for convenience if y ≥ K we denote it by y = K. If K is sufficiently large this will lead to
PK
little loss of information. In this notation, the total sample size is given by n = k=1 fk .
The table below reports on the frequency of repairs required for n = 550 army vehicles over a 10 year period.
It is hoped that it is reasonable to assume that the distribution of the number of repairs may be represented by a
Poisson distribution with mean (µ).
a. Give the null hypothesis in terms of relevant parameters. Define any new parameters that you introduce.
b. Write down the likelihood in terms of µ and the frequencies (i.e. under the null hypothesis), and under the
alternative hypothesis.
c. Find expressions for the M LE of µ and gk (µ), where gk (µ) is Pr(y = k).
d. Obtain the expected frequencies and test the goodness-of-fit of a Poisson distribution for the following data.
# of visits 0 1 2 3 4 5 6+
frequency 295 190 53 5 5 2 0
Problem 3.9.
Ashford and Sowden (1970) conducted a study on the health of British coal miners who were smokers but did
not show radiological signs of a lung disorder called pneumoconiosis. A simple random sample of employed miners
were assessed to determine whether they became breathless or coughed upon mild exertion. The resulting data are
displayed in the following table, stratified according to five year age intervals.
Breathless
Yes No
Coughed Coughed
Age Yes No Yes No
20-24 9 7 95 1841
25-29 23 9 105 1654
30-34 54 19 177 1863
35-39 121 48 257 2357
40-44 169 54 273 1778
45-49 269 88 324 1712
50-54 404 117 245 1324
55-59 406 152 225 967
60-64 372 106 132 526
Suppose an analysis is planned to assess the association between age (A), breathlessness (B), and coughing (C).
The following saturated log-linear model could be fit
a. Fit log-linear models to these data. Justify your choice for the model that you feel is most appropriate for
characterizing the association between these three factors, and summarize the nature of the association(s).
b. Provide point estimates of two or more odds ratios characterizing the association between pairs of variables and
carefully interpret their meaning.
c. If interest was strictly in modeling the coughing as a response, give the logistic regression model that you would
fit to obtain the same results you obtained in a). Give the numerical values for the regression coefficients in this
logistic regression model based on the estimated coefficients in part a).
d. What information do you lose in the logistic regression model in b) which you can obtain from a log-linear model
in a)?
Problem 3.10.
Let Yi0 denote the prior count for subject i, i = 1, . . . , 59. To the analyze the epilepsy data set, assume Yi0 |ui ∼
Poisson(ui λ0 ) where given ui , Yi0 and (Yi1 , Yi2 , Yi3 , Yi4 ) are independently distributed.
a. Derive the joint conditional distribution of (Yi1 , Yi2 , Yi3 , Yi4 ) given Yi0 + Yi1 + · · · + Yi4 and describe how to
estimate β1 based on a logistic regression model.
b. Fit a logistic regression model to estimate β1 (and given a 95% confidence interval) and contrast your findings
with those in 1). Use the data on epilepsy.
PLACEBO PROGABIDE
TRT PRIOR TRT PRIOR
Y1 Y2 Y3 Y4 GROUP COUNT AGE Y1 Y2 Y3 Y4 GROUP COUNT AGE
5 3 3 3 0 11 31 0 4 3 0 1 19 20
3 5 3 3 0 11 30 3 6 1 3 1 10 20
2 4 0 5 0 6 25 2 6 7 4 1 19 18
4 4 1 4 0 8 36 4 3 1 3 1 24 24
7 18 9 21 0 66 22 22 17 19 16 1 31 30
5 2 8 7 0 27 29 5 4 7 4 1 14 35
6 4 0 2 0 12 31 2 4 0 4 1 11 57
40 20 23 12 0 52 42 3 7 7 7 1 67 20
5 6 6 5 0 23 37 4 18 2 5 1 41 22
14 13 6 0 0 10 28 2 1 1 0 1 7 28
26 12 6 22 0 52 36 0 2 4 0 1 22 23
12 6 8 5 0 33 24 5 4 0 3 1 13 40
4 4 6 2 0 18 23 11 14 25 15 1 46 43
DATA FROM THE EPILEPSY TRIAL 7 9 12 14 0 42 36 10 5 3 8 1 36 21
16 24 10 9 0 87 26 19 7 6 7 1 38 35
11 0 0 5 0 50 26 1 1 2 4 1 7 25
0 0 3 3 0 18 28 6 10 8 8 1 36 26
37 29 28 29 0 111 31 2 1 0 0 1 11 25
3 5 2 5 0 18 32 102 65 72 63 1 151 22
3 0 6 7 0 20 21 4 3 2 4 1 22 32
3 4 3 4 0 12 29 8 6 5 7 1 42 25
3 4 3 4 0 9 21 1 3 1 5 1 32 35
2 3 3 5 0 17 32 18 11 28 13 1 56 21
8 12 2 8 0 28 25 6 3 4 0 1 24 41
18 24 76 25 0 55 30 3 5 4 3 1 16 32
2 1 2 1 0 9 40 1 23 19 8 1 22 26
3 1 4 2 0 10 19 2 3 0 1 1 25 21
13 15 13 12 0 47 22 0 0 0 0 1 13 36
11 14 9 8 1 76 18 1 4 3 2 1 12 37
8 6 9 4 1 38 32
Problem 3.11.
Poisson regression models may be used to examine the association of two or more factors in multi-way contingency
tables. Consider the case of a two-way contingency table in which two factors (X and Y, say) are cross classified.
a. Show why we can use a Poisson regression model to investigate the association between X and Y , even though
the data were collected this way.
b. What term must be in the Poisson model to provide valid inferences under this sampling plan?
Problem 3.12.
A sample of 1729 individuals were cross-classified according to whether or not they (1) read newspapers, (2) listen
to the radio, (3) do “solid” reading, (4) attend lectures, and (5) have a good knowledge regarding cancer.
Radio No Radio
Solid No Solid Solid No Solid
Reading Reading Reading Reading
Knowledge Knowledge
Good Poor Good Poor Good Poor Good Poor
Newspapers Lectures 23 8 8 4 27 18 7 6
No Lectures 102 67 35 59 201 177 75 156
No Newspapers Lectures 1 3 4 3 3 8 2 10
No Lectures 16 16 13 50 67 83 84 393
a. Analyse the data using hierarchical log linear models without distinguishing between response and explanatory
variables.
b. Briefly explain the final model chosen as non-technically as possible.
c. Compare the analysis with your analysis based on the logistic model that you carried out in assignment 3 with
“good knowledge of cancer” as the response. Which do you feel more comfortable with?