Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Bayes 1
Chris Holmes
Professor of Biostatistics
Oxford Centre for Gene Function
MCMC Appl. Bayes 2
Objectives of Course
Key References:
Gelman, A. et al. Bayesian Data Analysis. 2nd Ed. (2004). Chapman & Hall
Gilks, W. R. et al. eds. Markov chain Monte Carlo in Practice. (1996). Chapman &
Hall.
Acknowledgements:
◦ Axiomatic
– Bayesian statistics has an axiomatic foundation, from which all
procedures then follow.
– It is prescriptive in telling you how to act coherently
– “If you don’t adopt a Bayesian approach you must ask yourself
which of the axioms are you prepared to discard?” See
disadvantages.
◦ Unified framework.
– Random effects, Hierarchical models, Missing variables, Nested
and Non-nested models. All handled in the same framework.
◦ Intuitive
– For a model parameterised by θ , what is the interpretation of
P r(θ < 0|y), where y denotes “data”?
– Compare with confidence intervals for θ
– When testing a null hypothesis H0 against an alternative H1 , in
Intro 7
◦ Inference is subjective
– You and I on observing the same data will be lead to different
conclusions
– This seems to fly in the face of scientific reasoning
p(y, θ) = p(y|θ)p(θ)
p(y1 , . . . , yn )
p(y, θ) = p(y|θ)p(θ)
• the prior p(θ) elicits a model space and a distn on that model
space via a distn on parameters
p(y, θ)
p(θ|y) =
p(y)
p(y|θ)p(θ)
=
p(y)
BAYES THEOREM
Z
p(ỹ|y) = p(ỹ|θ, y)p(θ|y)dθ
Z
= p(ỹ|θ)p(θ|y)dθ
Again, note that all 3 quantities are defined via probability statements
on the unknown variable of interest
Overview 18
y = xβ + ²
where ² N (0, σ 2 I). Alternatly,
y ∼ N (y|xβ, σ 2 I)
β̂ = (x0 x)−1 x0 y
Then,
p(β|y) ∝ p(y|β)p(β)
−n/2 1
∝ σ exp[− 2 (y − xβ)0 (y − xβ)] ×
2σ
|v|−1/2 exp[−β 0 vβ]
Note
◦ Practically, there are often standard and well used forms for the set
{p(y|θ), p(θ)}
◦ In the example above the choice of p(β) = N (β|0, v) lead to
easy (closed form) calculations of p(β|y) and p(ỹ|y)
Overview 23
◦ Often we will use forms for p(θ) that depend on p(y|θ) and which
make these calculations easy
– when the prior and the posterior are from the same family of
distributions (such as normal-normal) then the prior is termed
conjugate
− Note, from a purest perspective this is putting the cart before the
horse
Overview 24
We have seen that for the normal linear regression with known noise
variance and prior, p(β) = N (0, v), then the posterior is,
Examples:
where I(·) is the logical indicator function. More generaly, for a set
A∈Θ
XM
1
P r(θ ∈ A|y) ≈ I(θ(i) ∈ A)
M i=1
Overview 32
XM
1 (i)
Fθj (a) ≈ I(θj ≤ a)
M i=1
Note that all these quantities can be computed from the same
bag of samples. That is, we can first collect θ (1) , . . . , θ (M ) as a
proxy for p(θ|y) and then use the same set of samples over and
again for whatever we are subsequently interested in.
Overview 35
2 Modelling Data
p(θ(i) ) ≈ p(θ|y)
P n (θ(0) , A) ≈ P (θ ∈ A|y)
That is the distribution of the state of the chain, p(θ (i) ), converges
to the target density, p(θ|y) as i gets “large”
Models 40
p(yi |xi )
Models 45
3 MCMC Algorithms
First, recall that MCMC is an iterative procedure, such that given the
current state of the chain, θ (i) , the algorithm makes a probabilistic
update to θ (i+1)
–MCMC Algorithm–
θ(0) ← x
For i=1 to M
θ(i) = f (θ(i−1) )
End
where f (·) outputs a draw from a conditional probability density
Models 52
–M-H Algorithm–
θ(0) ← x
For i=0 to M
Draw θ̃ ∼ q(θ̃|θ(i) )
Set θ (i+1) ← θ̃ with probability α(θ(i) , θ̃)
Else set θ (i+1) ← θ(i) , where
½ ¾
p(b|y)q(a|b)
α(a, b) = min 1,
p(a|y)q(b|a)
End
Models 56
Note:
◦ The acceptance term corrects for the fact that the proposal density
is not the target distribution
Models 57
p(b|y)q(a|b)
◦ To accept with probability p(a|y)q(b|a)
,
◦ Pretty much any q(a|b) will do, so long as it gets you around the
state space Θ. However different q(a|b) lead to different levels of
performance in terms of convergence rates to the target
distribution and exploration of the model space
Models 60
–Gibbs Sampler –
θ(0) ← x
For i=0 to M
Set θ̃ ← θ(i)
For j=1 to p
Draw X ∼ p(θj |θ̃−j , y)
Set θ̃j ←X
End
Set θ (i+1) ← θ̃
End
Models 67
Note:
y ∼ N (y|xβ, σ 2 I)
Suppose we take,
p(β) = N (β|0, v)
p(σ 2 ) = IG(σ 2 |a, b)
where IG(·|a, b) denotes the Inverse-Gamma density,
and
y ∼ N (y|xβ, σ 2 I)
β ∼ N (β|0, vI)
σ 2 ∼ IG(σ 2 |a, b)
v ∼ IG(v|c, d)
and
Models 77
and
• Slice sampling
• Rejection sampling
• Ratio of uniforms
Models 80
Some Issues
◦ Given the current state of the chain, θ(i) , the algorithm makes a
probabilistic update to θ (i+1)
◦ That is, the first B samples, {θ(0) , θ(1) , . . . , θ(B) } are discarded
◦ This user defined initial portion of the chain to discard is known as
a burn-in phase for the chain
http://www.mrc-bsu.cam.ac.uk/bugs/
documentation/Download/cdaman03.pdf
Output 87
Graphical Analysis
The first step in any output analysis is to eyeball sample traces from
(1) (M )
various variables, {θj , . . . , θj }, for a set of key variables j
There should be no continuous drift in the sequence of values
following burn-in (as the samples are supposed to follow the same
distribution)
For example, usually, θ (0) is far away from the major support of the
posterior density
Output 88
Initially then, the chain will often be seen to “migrate” away from θ (0)
towards a region of high posterior probability centred around a mode
of p(θ|y)
CODA offers four formal tests for convergence, perhaps the two most
popular one being those reported by Geweke and those of Gelman
and Rubin
4.2 Geweke
◦ The method compares the within and between chain variances for
each variable. When the chains have “mixed” (converged) the
Output 93
The Theory
XM
(·) 1
f (θ ) = f (θ(i) )
M i=1
and E[f (θ)] denotes the true unknown expectation. Note that almost
all quantities of interest can be written as expectations
Output 97
Hence, the greater the covariance between samplers, the greater the
variance in the MCMC estimator (for given sample size M )
In Practice
k
X
ESS = M/(1 + 2 ρ(j))
j=1
If run time is not an issue, but storage is, it is useful to thin the chain
by only saving one in every T samples - clearly this will reduce the
autocorrelations in the saved samples
Conclusions 100
5 Conclusions
Often the posterior will not be of standard form (for example when the
prior is non-conjugate)