Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Bayesian Statistics
L p x (1) ,, x( n)
Probability of observing a particular
sequence of heads and tails in D
p x p x p x
(1) (2) ( n)
n h
h 1
log L h log n h log 1
d h nh
log L 0
d 1
• h
ML
n ML estimate predicts zero probability of seeing
•After one toss, if it comes up tail, our
heads. If first n tosses are tails, the ML continues to predict zero prob. of seeing heads.
Including prior knowledge into the estimation process
• Even though the ML estimator might say ML 0, we “know” that the coin can come
up both heads and tails, i.e.: 0
• Starting point for our consideration is that q is not only a number, but we will give q a
full probability distribution function
•Suppose we know that the coin is either fair (q=0.5) with prob. p or in favor of tails
(q=0.4) with probability 1-p.
• We want to combine this prior knowledge with new data D (i.e. number of heads in n
throws) to arrive at a posterior distribution for q. We will apply Bayes rule:
Posterior distr.
p ,D p p D | p p D |
p( | D )
p D p p D | d
n
p p D |
1
The numerator is just the joint distribution of q and D, evaluated at a particular D. The
denominator is the marginal distribution of the data D, that is, it is just a number that
makes the Numerator integrate to one.
Maximum a-posteriori estimate
We define the MAP estimate as the maximum (i.e. mode) of the posterior distribution.
MAP estimator:
arg max p D arg max p D p
MAP arg max log p D log p
The latter version makes the comparison to the maximum likelihood estimate easy:
We see that ML and MAP are identical, if p(q) is a constant that does not depend on
q.
Thus our prior would be a uniform distribution over the domain of q. We call such a
prior for obvious reasons a flat or uniformed prior.
Summary
•Bayesian estimation involves the application of Bayes rule to combine a prior density
and a conditional density to arrive at a posterior density.
p D p
p D
p D
•Maximum a posteriori (MAP) estimation: If we need a “best guess” from our posterior
distribution, often the maximum of the posterior distribution is used.
MAP arg max p D arg max p D p
•The MAP and ML estimate are identical, when our prior is uniformly distributed on q,
i.e. is flat or uniformed.
•With a two-way classification problem and data that is Gaussian given the category
membership, the posterior is a logistic function, linear in the data.
1
P x 1 y
1 exp θT y
In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is
a mode of the posterior distribution. The MAP can be used to obtain a
point estimate of an unobserved quantity on the basis of empirical data. It is
closely related to Fisher's method of maximum likelihood (ML), but employs
an augmented optimization objective which incorporates a prior distribution
over the quantity one wants to estimate. MAP estimation can therefore be
seen as a regularization of ML estimation.
Bayesian estimation for a potentially biased coin
• Suppose that we believe that the coin is either fair, or that it is biased toward tails: q
= probability of tossing a head. After observing n coin tosses, we note that:
out of which h trials are head.
D x (1) , , x ( n)
for 0.5
p D h 1
( n h)
p 1 for 0.4
0 otherwise
P 0.4 | D
1 0.4 h 0.6 n h
0.5n 1 0.4 h 0.6 n h
Now we can accurately calculate the probability that we have a fair coin, given some data D. In contrast to
the ML estimate, which only gave us one number qML, we have here a full probability distribution, that is we
know also how certain we are that we have a fair or unfair coin.
In some situation we would like a single number, that represents our best guess of q. One possibility for this
best guess is the maximum a-posteriori estimate (MAP).
Binomial distribution and discrete random variables
Suppose a random variable can only take one of two variables (e.g., 0 and 1, success
and failure, etc.). Such trials are termed Bernoulli trials.
x 0,1 P x 1 P x 0 1
x x (1) , x (2) , , x ( N )
Probability density or
distribution
p x (1) x(1)
1 1 x(1)
0.2 n 20
p h 0.15
0.1
0.05
5 10 15 20
h
Formulating a continuous prior for the coin toss problem
• q represents the probability of a head. We want a continuous distribution that is
defined between 0 and 1, and is 0 for 0 and 1.
p D 1
n
1 n
p 1 Beta distribution
c
1
n
c 1
d normalizing constant
0
q = probability of tossing a head
2.5 4; n 8 3.5
1; n 8
2 3; n 6 3 1; n 6
2; n 4 2.5 1; n 4
p 1.5
1; n 2
2 1; n 2
1 1.5
1
0.5
0.5
1
p 1 When we apply Bayes rule to integrate
c some old knowledge (the prior) in the
1 form of a beta-distribution with
c 1 d
parameters a and b, with some new
0 knowledge h and n (coming from a
nh
p D | h 1 binomial distribution), then we find that
1 the posterior distribution also has the
n h
1 h 1 form of a beta distribution with
p | D c
1 parameters a+h and b+n-h.
1 nh
1 h 1 d
c
0
1
Beta and binomial distribution are
nh
h 1 therefore call conjugate distributions.
d
1
n h
d h 1
d
0
MAP estimator for the coin toss problem
Let us look at the MAP estimator if we start with a prior of a=1, n=2, i.e. we have a
slight belief in the fact that the coin is fair.
1.5
n 2; h 1
Our posterior is then: p 1
1 h1 n 1 h
p | D 1
0.5
p y | x 1
p y | x 0 P x 1 p y x 1
P x 1| y 1
P x i p y x i
i 0
Height is normally distributed in the population of men and in the population of women, with
different means, and similar variances. Let x be an indicator variable for being a female. Then the
conditional distribution of y (the height becomes):
1 1 2
p y | x 1 exp 2 y f
2 2
1 1 2
p y | x 0 exp 2 y m
2 2
Classification with a continuous conditional distribution
Let us further assume that we start with a prior distribution, such that x is 1 with
probability p.
P x 1 p y | x 1
P x 1| y
P x 1 p y | x 1 P x 0 p y | x 0 The posterior is a logistic function of a
1 2
linear function of the data and
exp 2 y f parameters (remember this result the
2
1 2 1 2 section on classification!).
exp 2 y f 1 exp 2 y m
2 2 The maximum-likelihood argument
1
would just have decided under which
1 exp 2 y m
1 2
model the data would have been more
1 2
likely.
1 2
exp 2 y f
2 The posterior distribution gives us the
1 full probability that we have a male or
1 exp log
1
2
1
2 y m
2
y f
2
female.
1
We can also include prior knowledge in
our scheme.
1
1 exp log
2
1
2 m
1
2 f 2 2 y m f
1
1 exp θT y
f2 m f
T
1 2
, y 1, y
m T
θ log ,
2 2 2
Classification with a continuous conditional distribution
Computing the probability that the subject is female, given that we observed
height y.
1
P x 1| y
m f m f
2 2
1
1 exp log y
2 2
2
m 176cm
f 166cm
12cm
1
P x 1 0.5
Our prior probability
P x 1 0.3
0.8
Posterior 0.6
probability: P x 1| y
0.4
0.2