PML Lec6 Slides

Exponential Family and Generalized Linear Models
Piyush Rai
Probabilistic Machine Learning (CS772A)

Aug 22, 2017
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 1
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)
Defines a class of distributions. An Exponential Family distribution is of the form

1
p(x|θ) = h(x) exp[θ> φ(x)]
Z (θ)

1
p(x|θ) = h(x) exp[θ> φ(x)] = h(x) exp[θ> φ(x) − A(θ)]
Z (θ)

1
Z (θ)
x ∈ X m is the random variable being modeled (where X denotes some space, e.g., R or {0, 1})

1
Z (θ)
θ ∈ Rd : Natural parameters or canonical parameters defining the distribution

1
Z (θ)
φ(x) ∈ Rd : Sufficient statistics (another random variable)

1
Z (θ)
Why “sufficient”: p(x|θ) as a function of θ depends on x only via φ(x)

1
Z (θ)
h(x) exp[θ> φ(x)]dx: Partition function

R
Z (θ) =

1
Z (θ)

R
Z (θ) =
A(θ) = log Z (θ): Log-partition function (also called the cumulant function)

1
Z (θ)

R
Z (θ) =
A(θ) = log Z (θ): Log-partition function (also called the cumulant function)
h(x): A constant (doesn’t depend on θ)
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Recall the standard definition of the Binomial distribution

!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

!
N x
x
where N: number of trials
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

!
N x
x
where N: number of trials, x (a scalar): number of successes
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

!
N x
x
where N: number of trials, x (a scalar): number of successes, p: probability is success in each trial
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

!
N x
x
Can re-express the above representation of the Binomial as
!
N p
exp xlog + N log(1 − p)
x 1−p
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

!
N x
x
!
N p
N
x 1−p
h(x) = x
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

!
N x
x
!
N p
N
x 1−p
h(x) =
x
p
θ = log 1−p
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

!
N x
x
!
N p
N
x 1−p
h(x) =
x
p 1
θ = log 1−p , and p = 1+exp(−θ)
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

!
N x
x
!
N p
N
x 1−p
h(x) =
x
p 1
φ(x) = x
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

!
N x
x
!
N p
N
x 1−p
h(x) =
x
p 1
φ(x) = x
A(θ) = −N log(1 − p)
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

!
N x
x
!
N p
N
x 1−p
h(x) =
x
p 1
φ(x) = x
A(θ) = −N log(1 − p) = N log(1 + exp(θ))
(Univariate) Gaussian as Exponential Family
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Recall the standard definition of a univariate Gaussian

" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
N (x|µ, σ ) = √ exp − = √ exp x − x − − log σ
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ > !#
1 σ2
x µ2
= √ exp − − log σ
2π − 12
2σ
x2 2σ 2
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ > !#
1 σ2
x µ2
h(x) = √1 2π − 12
2σ
x2 2σ 2
2π
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ > !#
1 σ2
x µ2
h(x) = √12π 2π − 12
2σ
x2 2σ 2
µ
θ
θ= σ2 = 1
− 2σ1 2 θ2
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ > !#
1 σ2
x µ2
h(x) = √12π 2π − 12
2σ
x2 2σ 2
µ θ1
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ > !#
1 σ2
x µ2
h(x) = √12π 2π − 12
2σ
x2 2σ 2
µ θ1
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2

x
φ(x) = 2
x
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ > !#
1 σ2
x µ2
h(x) = √12π 2π − 12
2σ
x2 2σ 2
µ θ1
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2

x
φ(x) = 2
x
µ2
A(θ) = 2σ 2 + log σ
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ > !#
1 σ2
x µ2
h(x) = √12π 2π − 12
2σ
x2 2σ 2
µ θ1
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2

x
φ(x) = 2
x
µ2 −θ12 1 1
A(θ) = 2σ 2 + log σ = 4θ2 − 2 log(−2θ2 ) − 2 log(2π)
A General Trick
A general trick to represent any distribution (assuming it is exp-family dist.) in exp-family form
Write the given distribution as exp(log p()) and simplify, e.g., for the Binomial

N x
exp (log Binomial(x|N, p)) = exp log p (1 − p)N−x
x

N
= exp log + x log p + (N − x) log(1 − p)
x

N p
= exp x log − N log(1 − p)
x 1−p
Now compare the resulting expression with the exponential family form
p(x|θ) = h(x) exp(θ> φ(x) − A(θ))
.. to identify the natural parameters, sufficient statistics, log-partition function, etc.
Other Examples
Many other distribution belong to the exponential family

Bernoulli
Beta
Gamma
Multinoulli/Multinomial
Dirichlet
Multivariate Gaussian
.. and many more ( https://en.wikipedia.org/wiki/Exponential_family )
Other Examples

Bernoulli
Beta
Gamma
Dirichlet
Note: Not all distributions belong to the exponential family, e.g.,

Uniform distribution (x ∼ Unif(a, b))
Student-t distribution
Mixture distributions (e.g., mixture of Gaussians)
Other Examples

Bernoulli
Beta
Gamma
Dirichlet
Note: Not all distributions belong to the exponential family, e.g.,

Uniform distribution (x ∼ Unif(a, b))
Student-t distribution
Mixture distributions (e.g., mixture of Gaussians)
If the support of the distribution depends on its parameters, then it is not an exp. family dist.
Log-Partition Function
h(x) exp[θ> φ(x)]dx is the log-partition function

R
A(θ) = log Z (θ) = log

R
A(θ) is also called the cumulant function

R
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)

R
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar).

R
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar). Show that the first and the second
derivatives of A(θ) are

R
dA
= Ep(x|θ) [φ(x)]
dθ

R
dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]
dθ

R
dA
dθ
d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)]

=
dθ2

R
dA
dθ
d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]

=
dθ2

R
dA
dθ
d 2A 2

=
dθ2
Note: The above result also holds when θ and φ(x) are vector-valued (the “var” will be “covar”)

R
dA
dθ
d 2A 2

=
dθ2
Important: A(θ) is a convex function of θ.

R
dA
dθ
d 2A 2

=
dθ2
Important: A(θ) is a convex function of θ. Why?

R
dA
dθ
d 2A 2

=
dθ2
Important: A(θ) is a convex function of θ. Why?
Exercise: For Binomial, using its expression of A(θ), derive the first and second cumulants of φ(x)
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution
p(x|θ) = h(x) exp θ> φ(x) − A(θ)


To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
Y
p(D|θ) = p(x i |θ)
i=1

N
" N # " N
#
>
Y Y X
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ)
i=1 i=1 i=1

N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1

N
" N # " N
# " N
#
h i
> >
Y Y X Y
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N

N
" N # " N
# " N
#
h i
> >
Y Y X Y
i=1 i=1 i=1 i=1
PN
PN
Size of φ(D) = i=1 φ(x i ) does not grow with N (same as the size of each φ(x i ))

N
" N # " N
# " N
#
h i
> >
Y Y X Y
i=1 i=1 i=1 i=1
PN
PN
Only exponential family distributions have finite-sized sufficient statistics

N
" N # " N
# " N
#
h i
> >
Y Y X Y
i=1 i=1 i=1 i=1
PN
PN
No need to store all the data; can simply store and recursively update the sufficient statistics with
more and more data

N
" N # " N
# " N
#
h i
> >
Y Y X Y
i=1 i=1 i=1 i=1
PN
PN
No need to store all the data; can simply store and recursively update the sufficient statistics with
more and more data
Very useful when doing probabilistic/Bayesian inference with large-scale data sets. Also useful in
online parameter estimation problems.
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)

The likelihood is of the form p(D|θ) = i=1
The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
hQ i
N

Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
hQ i
N

MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i
>
∇θ θ φ(D) − NA(θ)
hQ i
N

h i
>
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)]
hQ i
N

h i
>
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)] = φ(D) − NEp(x|θ) [φ(x)]
hQ i
N

h i N
>
X
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)] = φ(D) − NEp(x|θ) [φ(x)] = φ(x i ) − NEp(x|θ) [φ(x)]
i=1
Therefore, at the “optimal” (i.e., MLE) θ̂, where the derivative is 0, the following must hold
N
1 X
Ep(x|θ) [φ(x)] = φ(x i )
N
i=1
hQ i
N

h i N
>
X
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)] = φ(D) − NEp(x|θ) [φ(x)] = φ(x i ) − NEp(x|θ) [φ(x)]
i=1
Therefore, at the “optimal” (i.e., MLE) θ̂, where the derivative is 0, the following must hold
N
1 X
Ep(x|θ) [φ(x)] = φ(x i )
N
i=1
This is basically matching the expected moments of the distribution with empirical moments
(“empirical” here means what we compute using the observed data)
Moment Matching: An Example
Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching

N
1 X
E[φ(x)] = φ(xi )
N
i=1

N
1 X
E[φ(x)] =
φ(xi )
N
i=1

x
The “true”, i.e., expected moments: E[φ(x)] = E 2 . Therefore
x
" 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi

N
1 X
E[φ(x)] =
φ(xi )
N
i=1

x
x
" 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi
For a univariate Gaussian, note that E[x] = µ

N
1 X
E[φ(x)] =
φ(xi )
N
i=1

x
x
" 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi
For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2

N
1 X
E[φ(x)] =
φ(xi )
N
i=1

x
x
" 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi
For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2

N
1 X
E[φ(x)] =
φ(xi )
N
i=1

x
x
" 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi
Thus we have two equations and two unknowns

N
1 X
E[φ(x)] =
φ(xi )
N
i=1

x
x
" 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi
1
PN
From the first equation, we immediately get µ = N i=1 xi

N
1 X
E[φ(x)] =
φ(xi )
N
i=1

x
x
" 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi
1
PN
From the second equation, we get σ 2 = E[x 2 ] − µ2

N
1 X
E[φ(x)] =
φ(xi )
N
i=1

x
x
" 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi
1
PN
1
PN
From the second equation, we get σ 2 = E[x 2 ] − µ2 = N i=1 xi
2
− µ2

N
1 X
E[φ(x)] =
φ(xi )
N
i=1

x
x
" 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi
1
PN
1
PN 1
PN
From the second equation, we get σ 2 = E[x 2 ] − µ2 = N i=1 xi
2
− µ2 = N i=1 (xi − µ)2
Bayesian Inference for Exponential Family Distributions
We saw that the total likelihood given N i.i.d. observations D{x 1 , . . . , x N }

h i N
X
p(D|θ) ∝ exp θ> φ(D) − NA(θ) where φ(D) = φ(x i )
i=1

h i N
X
i=1
Let’s choose the following prior (note: it looks similar in terms of θ within the exponent)
h i
p(θ|ν0 , τ 0 ) = h(θ) exp θ> τ0 − ν0 A(θ) − Ac (ν0 , τ 0 )

h i N
X
i=1
h i
h(θ) exp θ> τ0 − ν0 A(θ) dθ

R
Ignoring the prior’s log-partition function Ac (ν0 , τ 0 ) = log θ

h i N
X
i=1
h i

R
h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)

h i N
X
i=1
h i

R
h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)
Comparing the prior’s form with the likelihood, we notice that

h i N
X
i=1
h i

R
h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)

ν0 is like the number of “pseudo-observations” coming from the prior

h i N
X
i=1
h i

R
h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)

ν0 is like the number of “pseudo-observations” coming from the prior
τ0 is the total sufficient statistics of these ν0 pseudo-observations
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
h i N
>
X
i=1
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
h i N
>
X
i=1
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
h i
Note that the posterior has the same form as the prior
h i N
>
X
i=1
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
h i
Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
h i N
>
X
i=1
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
h i
(note: all exponential family distributions have a conjugate prior having a form shown as above)
h i N
>
X
i=1
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
h i
Thus posterior hyperparams ν0 0 , τ0 0 are obtained by simply adding “stuff” to prior’s hyperparams
h i N
>
X
i=1
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
h i
ν0 0 ← ν0 + N (no. of pseudo-obs + no. of actual obs)
h i N
>
X
i=1
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
h i
0
τ0 ← τ0 + φ(D) (total suff-stats from pseudo-obs + total suff-stats from actual obs)
h i N
>
X
i=1
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
h i
0
τ0 ← τ0 + φ(D) (total suff-stats from pseudo-obs + total suff-stats from actual obs)
Note: Prior’s log-partition function Ac (ν0 , τ 0 ) updates to posterior’s: Ac (ν0 + N, τ 0 + φ(D))
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was

h i

h i
Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)


h i

Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation

h i

The posterior can be written as

ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N

h i


ν0 τ̄ 0 + φ(D)
ν0 + N
φ(D)
Denoting φ̄ = N as the average suff-stats per real observation, the posterior updates are

h i


ν0 τ̄ 0 + φ(D)
ν0 + N
φ(D)
ν0 0 ← ν0 + N

h i


ν0 τ̄ 0 + φ(D)
ν0 + N
φ(D)
ν0 0 ← ν0 + N
ν0 τ̄0 + N φ̄
τ̄00 ←
ν0 + N

h i


ν0 τ̄ 0 + φ(D)
ν0 + N
φ(D)
ν0 0 ← ν0 + N
ν0 τ̄0 + N φ̄
τ̄00 ←
ν0 + N
Note that the posterior hyperparam τ̄00 is a convex combination of the average suff-stats τ̄0 of the
ν0 pseudo-observations and the average suff-stats φ̄ of the N actual observations
Posterior Predictive Distribution
Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution
Assme some test data D0 = {x̃ 1 , . . . , x̃ N 0 } from the same distribution (N 0 ≥ 1)

The posterior predictive distribution of D0 (probability distribution of new data given old data)
Z
p(D0 |D) = p(D0 |θ)p(θ|D)dθ

Z
We’ve already seen some specific examples of computing the posterior predictive dist., e.g.,

Z
Beta-Bernoulli case: Posterior predictive distribution of next coin toss

Z
Bayesian linear regression: Posterior predictive distribution of the response y∗ of test input x ∗

Z
Bayesian linear regression: Posterior predictive distribution of the response y∗ of test input x ∗
Nice Property: If the likelihood is an exponential family distribution, prior is conjugate (and thus is
the posterior), the posterior predictive always has a closed form expression (shown next)
Recall the form of the likelihood p(D|θ) for exp. family dist.
"N #
Y h i
p(D|θ) = h(x i ) exp θ> φ(D) − NA(θ)
i=1
The conjugate prior was

h i
"N #
Y h i
i=1

h i
For this choice of the conjugate prior, the posterior was shown to be
h i
p(θ|D) = h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))
"N #
Y h i
i=1

h i
For this choice of the conjugate prior, the posterior was shown to be
h i
p(θ|D) = h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))
For the test data D0 , the likelihood will be

 0
 0
N
Y h i N
X
0
p(D |θ) =  h(x̃ i ) exp θ> φ(D0 ) − N 0 A(θ) where φ(D0 ) = φ(x̃ i )
i=1 i=1
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
Z
0 0
 
0
 
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
  
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ
Z
0 0
 
0
 
Z N h i
> 0 0 >
Y
  
i=1
| {z }
constant w.r.t. θ
The above gets simplified further into
Z
0 0
 
0
 
Z N h i
> 0 0 >
Y
  
i=1
| {z }
constant w.r.t. θ

 0
R
N
h(θ) exp θ> (τ0 + φ(D) + φ(D0 )) − (ν0 + N + N 0 )A(θ) dθ

0
Y
p(D |D) =  h(x̃ i )
i=1
exp [Ac (ν0 + N, τ 0 + φ(D))]
Z
0 0
 
0
 
Z N h i
> 0 0 >
Y
  
i=1
| {z }
constant w.r.t. θ

 0
R
N

0
Y
p(D |D) =  h(x̃ i )
i=1
exp [Ac (ν0 + N, τ 0 + φ(D))]
 0 
N 0 0
 h(x̃ i ) Zc (ν0 + N + N , τ 0 + φ(D) + φ(D ))
Y
=
i=1
exp [Ac (ν0 + N, τ 0 + φ(D))]
Z
0 0
 
0
 
Z N h i
> 0 0 >
Y
  
i=1
| {z }
constant w.r.t. θ

 0
R
N

0
Y
p(D |D) =  h(x̃ i )
i=1
exp [Ac (ν0 + N, τ 0 + φ(D))]
 0 
N 0 0
 h(x̃ i ) Zc (ν0 + N + N , τ 0 + φ(D) + φ(D ))
Y
=
i=1
exp [Ac (ν0 + N, τ 0 + φ(D))]
h i
where Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 )) = h(θ) exp θ > (τ0 + φ(D) + φ(D 0 )) − (ν0 + N + N 0 )A(θ) dθ
R
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y
=  h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y
i=1
Therefore the posterior predictive is proportional to ..
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y
i=1

.. the ratio of two partition functions of two “posterior distributions” (one with N + N 0 examples and
the other with N examples)
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y
i=1

.. or exponential of the difference of the corresponding log-partition functions
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y
i=1

Note that the form of Zc (and Ac ) will simply depend on the chosen conjugate prior
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y
i=1

Very useful result. Also holds for N = 0
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y
i=1

Very useful result. Also holds for N = 0
In the N = 0 case, p(D0 ) = p(D0 |θ)p(θ)dθ is simply the marginal likelihood of D0
R
Exponential Family and GLM
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )
p(y |x, w ) = N (w > x, β −1 )
Logistic regression: when response y is binary (0/1)
p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)
p(y |x, w ) = N (w > x, β −1 )
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)
In both, the model depends on the inputs x as w > x
p(y |x, w ) = N (w > x, β −1 )
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)

Can we extend it to other type of outputs?
p(y |x, w ) = N (w > x, β −1 )
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)

Solution: Model the output using an exp-fam distribution (Gaussian and Bernoulli already are!)
p(y |η) = h(y ) exp(ηy − A(η)) (Generalized Linear Model (GLM))
p(y |x, w ) = N (w > x, β −1 )
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)

Solution: Model the output using an exp-fam distribution (Gaussian and Bernoulli already are!)
p(y |η) = h(y ) exp(ηy − A(η)) (Generalized Linear Model (GLM))
.. where η is a scalar-valued natural parameter (depends on x) and sufficient statistics φ(y ) = y

Generalized Linear Models: Formally
The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x
The inputs x appear in the model only as a linear combination ξ = w > x
Conditional mean µ of a response y ’s distribution is modeled via a response function f
µ = E[y ] = f (ξ) = f (w > x)
µ = E[y ] = f (ξ) = f (w > x)
- for (probabilistic) linear regression, f is identity, i.e., µ = f (w > x) = w > x,
µ = E[y ] = f (ξ) = f (w > x)

- for logistic regression, f is sigmoid, i.e., µ = f (w > x) = exp(w > x)/(1 + exp(w > x))
µ = E[y ] = f (ξ) = f (w > x)

The natural parameter η = ψ(µ) where ψ is the link function
µ = E[y ] = f (ξ) = f (w > x)

The natural parameter η = ψ(µ) where ψ is the link function
Note: Some GLM can be represented as p(y |η, φ) = h(y , φ) exp( ηy −A(η)
φ ) where φ is a dispersion
parameter (Gaussian/gamma GLMs use this representation)
GLM with Canonical Response Function
A GLM has a canonical response function f if f = ψ −1
For such a GLM, η = ψ(µ) = ψ(f (w > x)) = w > x

µ
E.g., for logistic regression η = log 1−µ = w >x n

µ
Thus, for Canonical GLMs
p(y |η) = h(y ) exp(ηy − A(η))

= h(y ) exp(y w > x − A(η))

µ

= h(y ) exp(y w > x − A(η))
This form makes parameter estimation in canonical GLM easy (e.g., gradients easy to compute)

µ

= h(y ) exp(y w > x − A(η))
This form makes parameter estimation in canonical GLM easy (e.g., gradients easy to compute)
We will focus on canonical GLMs only (these are the most common)
MLE for GLM
Log likelihood
N
>
Y
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn ))
n=1
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
n=1 n=1 n=1 n=1
Convexity of A(η) guarantees a global optima. Taking derivative w.r.t. w

N
!
X 0 dηn
yn x n − A (ηn )
n=1
dw
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
n=1 n=1 n=1 n=1

N
! N
X 0 dηn X
yn x n − A (ηn ) = (yn x n − µn x n )
n=1
dw n=1
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
n=1 n=1 n=1 n=1

N
! N N
X 0 dηn X X
yn x n − A (ηn ) = (yn x n − µn x n ) = (yn − µn )x n
n=1
dw n=1 n=1
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
n=1 n=1 n=1 n=1

N
! N N
X 0 dηn X X
n=1
dw n=1 n=1
where µn = f (w > x n ) and ’f ’ (= ψ −1 ) depends on type of response y , e.g.,
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
n=1 n=1 n=1 n=1

N
! N N
X 0 dηn X X
n=1
dw n=1 n=1

Real-valued y (linear regression): f is identity, i.e., µn = w > x n
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
n=1 n=1 n=1 n=1

N
! N N
X 0 dηn X X
n=1
dw n=1 n=1

exp(w > x n )
Binary y (logistic regression): f is logistic function, i.e., µn = 1+exp(w > x n )
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
n=1 n=1 n=1 n=1

N
! N N
X 0 dηn X X
n=1
dw n=1 n=1

exp(w > x n )
Count-valued y (Poisson regression): µn = exp(w > x n )
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
n=1 n=1 n=1 n=1

N
! N N
X 0 dηn X X
n=1
dw n=1 n=1

exp(w > x n )
Positive reals y (gamma regression): µn = −(w > x n )−1
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
n=1 n=1 n=1 n=1

N
! N N
X 0 dηn X X
n=1
dw n=1 n=1

exp(w > x n )
Positive reals y (gamma regression): µn = −(w > x n )−1
To estimate w via MLE (or MAP), either set the derivative to zero or use iterative methods (e.g.,
gradient descent, iteratively reweighted least squares, etc.)
Bayesian Inference for GLM
If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form
Example: Bayesian linear regression with Gaussian likelihood and Gaussian prior on w
Otherwise, approximate Bayesian inference is needed (e.g., Laplace, MCMC, variational inf, etc.)
Example: Bayesian logistic regression with sigmoid-Bernoulli likelihood and Gaussian prior on w
Example: Bayesian logistic regression with sigmoid-Bernoulli likelihood and Gaussian prior on w
Interesting class project idea: Design simple inference algorithms for non-conjugate GLM
Summary
Exp. family distributions are very useful for modeling diverse types of data/parameters
Summary
Conjugate priors to exp. family distributions make parameter updates very simple
Summary
Other quantities such as posterior predictive can be computed in closed form
Summary
Useful in designing generative classification models.
Summary
Useful in designing generative classification models. Choosing class-conditional from exponential

family with conjugate priors helps in parameter estimation
Summary

Useful in designing generative models for unsupervised learning
Summary

Also used in designing GLMs
Summary

Also used in designing GLMs
We will see several use cases when we discuss approximate inference algoritms (e.g., Gibbs
sampling, and especially variational inference)

PML Lec6 Slides

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

PML Lec6 Slides

Caricato da

Copyright:

Formati disponibili

Exponential Family and Generalized Linear Models

Probabilistic Machine Learning (CS772A)

Defines a class of distributions. An Exponential Family distribution is of the form

Defines a class of distributions. An Exponential Family distribution is of the form

Defines a class of distributions. An Exponential Family distribution is of the form

Defines a class of distributions. An Exponential Family distribution is of the form

Defines a class of distributions. An Exponential Family distribution is of the form

Defines a class of distributions. An Exponential Family distribution is of the form

Defines a class of distributions. An Exponential Family distribution is of the form

h(x) exp[θ> φ(x)]dx: Partition function

Defines a class of distributions. An Exponential Family distribution is of the form

h(x) exp[θ> φ(x)]dx: Partition function

Defines a class of distributions. An Exponential Family distribution is of the form

h(x) exp[θ> φ(x)]dx: Partition function

Recall the standard definition of the Binomial distribution

Recall the standard definition of the Binomial distribution

Recall the standard definition of the Binomial distribution

Recall the standard definition of the Binomial distribution

Recall the standard definition of the Binomial distribution

Recall the standard definition of the Binomial distribution

Recall the standard definition of the Binomial distribution

Recall the standard definition of the Binomial distribution

Recall the standard definition of the Binomial distribution

Recall the standard definition of the Binomial distribution

Recall the standard definition of the Binomial distribution

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian

p(x|θ) = h(x) exp(θ> φ(x) − A(θ))

.. to identify the natural parameters, sufficient statistics, log-partition function, etc.

Many other distribution belong to the exponential family

Many other distribution belong to the exponential family

Note: Not all distributions belong to the exponential family, e.g.,

Many other distribution belong to the exponential family

Note: Not all distributions belong to the exponential family, e.g.,

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

h(x) exp[θ> φ(x)]dx is the log-partition function

p(x|θ) = h(x) exp θ> φ(x) − A(θ)

p(x|θ) = h(x) exp θ> φ(x) − A(θ)

p(x|θ) = h(x) exp θ> φ(x) − A(θ)