Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Piyush Rai
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 1
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
µ
θ
θ= σ2 = 1
− 2σ1 2 θ2
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
µ θ1
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
µ θ1
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2
x
φ(x) = 2
x
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
µ θ1
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2
x
φ(x) = 2
x
µ2
A(θ) = 2σ 2 + log σ
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
µ θ1
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2
x
φ(x) = 2
x
µ2 −θ12 1 1
A(θ) = 2σ 2 + log σ = 4θ2 − 2 log(−2θ2 ) − 2 log(2π)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
A General Trick
A general trick to represent any distribution (assuming it is exp-family dist.) in exp-family form
Write the given distribution as exp(log p()) and simplify, e.g., for the Binomial
N x
exp (log Binomial(x|N, p)) = exp log p (1 − p)N−x
x
N
= exp log + x log p + (N − x) log(1 − p)
x
N p
= exp x log − N log(1 − p)
x 1−p
Now compare the resulting expression with the exponential family form
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 5
Other Examples
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 6
Other Examples
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 6
Other Examples
If the support of the distribution depends on its parameters, then it is not an exp. family dist.
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 6
Log-Partition Function
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
dA
= Ep(x|θ) [φ(x)]
dθ
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]
dθ
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]
dθ
d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)]
=
dθ2
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]
dθ
d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]
=
dθ2
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]
dθ
d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]
=
dθ2
Note: The above result also holds when θ and φ(x) are vector-valued (the “var” will be “covar”)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]
dθ
d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]
=
dθ2
Note: The above result also holds when θ and φ(x) are vector-valued (the “var” will be “covar”)
Important: A(θ) is a convex function of θ.
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]
dθ
d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]
=
dθ2
Note: The above result also holds when θ and φ(x) are vector-valued (the “var” will be “covar”)
Important: A(θ) is a convex function of θ. Why?
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function
dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]
dθ
d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]
=
dθ2
Note: The above result also holds when θ and φ(x) are vector-valued (the “var” will be “covar”)
Important: A(θ) is a convex function of θ. Why?
Exercise: For Binomial, using its expression of A(θ), derive the first and second cumulants of φ(x)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution
To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
Y
p(D|θ) = p(x i |θ)
i=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution
To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
#
>
Y Y X
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ)
i=1 i=1 i=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution
To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution
To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution
To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N
PN
Size of φ(D) = i=1 φ(x i ) does not grow with N (same as the size of each φ(x i ))
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution
To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N
PN
Size of φ(D) = i=1 φ(x i ) does not grow with N (same as the size of each φ(x i ))
Only exponential family distributions have finite-sized sufficient statistics
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution
To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N
PN
Size of φ(D) = i=1 φ(x i ) does not grow with N (same as the size of each φ(x i ))
Only exponential family distributions have finite-sized sufficient statistics
No need to store all the data; can simply store and recursively update the sufficient statistics with
more and more data
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution
To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N
PN
Size of φ(D) = i=1 φ(x i ) does not grow with N (same as the size of each φ(x i ))
Only exponential family distributions have finite-sized sufficient statistics
No need to store all the data; can simply store and recursively update the sufficient statistics with
more and more data
Very useful when doing probabilistic/Bayesian inference with large-scale data sets. Also useful in
online parameter estimation problems.
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
The likelihood is of the form p(D|θ) = i=1
The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
The likelihood is of the form p(D|θ) = i=1
The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
The likelihood is of the form p(D|θ) = i=1
The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i
>
∇θ θ φ(D) − NA(θ)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
The likelihood is of the form p(D|θ) = i=1
The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i
>
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
The likelihood is of the form p(D|θ) = i=1
The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i
>
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)] = φ(D) − NEp(x|θ) [φ(x)]
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
The likelihood is of the form p(D|θ) = i=1
The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i N
>
X
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)] = φ(D) − NEp(x|θ) [φ(x)] = φ(x i ) − NEp(x|θ) [φ(x)]
i=1
Therefore, at the “optimal” (i.e., MLE) θ̂, where the derivative is 0, the following must hold
N
1 X
Ep(x|θ) [φ(x)] = φ(x i )
N
i=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
The likelihood is of the form p(D|θ) = i=1
The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i N
>
X
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)] = φ(D) − NEp(x|θ) [φ(x)] = φ(x i ) − NEp(x|θ) [φ(x)]
i=1
Therefore, at the “optimal” (i.e., MLE) θ̂, where the derivative is 0, the following must hold
N
1 X
Ep(x|θ) [φ(x)] = φ(x i )
N
i=1
This is basically matching the expected moments of the distribution with empirical moments
(“empirical” here means what we compute using the observed data)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
Moment Matching: An Example
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example
For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example
For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example
For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Thus we have two equations and two unknowns
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example
For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Thus we have two equations and two unknowns
1
PN
From the first equation, we immediately get µ = N i=1 xi
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example
For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Thus we have two equations and two unknowns
1
PN
From the first equation, we immediately get µ = N i=1 xi
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example
For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Thus we have two equations and two unknowns
1
PN
From the first equation, we immediately get µ = N i=1 xi
1
PN
From the second equation, we get σ 2 = E[x 2 ] − µ2 = N i=1 xi
2
− µ2
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example
For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Thus we have two equations and two unknowns
1
PN
From the first equation, we immediately get µ = N i=1 xi
1
PN 1
PN
From the second equation, we get σ 2 = E[x 2 ] − µ2 = N i=1 xi
2
− µ2 = N i=1 (xi − µ)2
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Bayesian Inference for Exponential Family Distributions
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions
h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions
h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions
h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions
h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Note that the posterior has the same form as the prior
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
(note: all exponential family distributions have a conjugate prior having a form shown as above)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
(note: all exponential family distributions have a conjugate prior having a form shown as above)
Thus posterior hyperparams ν0 0 , τ0 0 are obtained by simply adding “stuff” to prior’s hyperparams
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
(note: all exponential family distributions have a conjugate prior having a form shown as above)
Thus posterior hyperparams ν0 0 , τ0 0 are obtained by simply adding “stuff” to prior’s hyperparams
ν0 0 ← ν0 + N (no. of pseudo-obs + no. of actual obs)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
(note: all exponential family distributions have a conjugate prior having a form shown as above)
Thus posterior hyperparams ν0 0 , τ0 0 are obtained by simply adding “stuff” to prior’s hyperparams
ν0 0 ← ν0 + N (no. of pseudo-obs + no. of actual obs)
0
τ0 ← τ0 + φ(D) (total suff-stats from pseudo-obs + total suff-stats from actual obs)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)
For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
(note: all exponential family distributions have a conjugate prior having a form shown as above)
Thus posterior hyperparams ν0 0 , τ0 0 are obtained by simply adding “stuff” to prior’s hyperparams
ν0 0 ← ν0 + N (no. of pseudo-obs + no. of actual obs)
0
τ0 ← τ0 + φ(D) (total suff-stats from pseudo-obs + total suff-stats from actual obs)
Note: Prior’s log-partition function Ac (ν0 , τ 0 ) updates to posterior’s: Ac (ν0 + N, τ 0 + φ(D))
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
The posterior can be written as
ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
The posterior can be written as
ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N
φ(D)
Denoting φ̄ = N as the average suff-stats per real observation, the posterior updates are
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
The posterior can be written as
ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N
φ(D)
Denoting φ̄ = N as the average suff-stats per real observation, the posterior updates are
ν0 0 ← ν0 + N
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
The posterior can be written as
ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N
φ(D)
Denoting φ̄ = N as the average suff-stats per real observation, the posterior updates are
ν0 0 ← ν0 + N
ν0 τ̄0 + N φ̄
τ̄00 ←
ν0 + N
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)
Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
The posterior can be written as
ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N
φ(D)
Denoting φ̄ = N as the average suff-stats per real observation, the posterior updates are
ν0 0 ← ν0 + N
ν0 τ̄0 + N φ̄
τ̄00 ←
ν0 + N
Note that the posterior hyperparam τ̄00 is a convex combination of the average suff-stats τ̄0 of the
ν0 pseudo-observations and the average suff-stats φ̄ of the N actual observations
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
Posterior Predictive Distribution
Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution
Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution
Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution
Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution
We’ve already seen some specific examples of computing the posterior predictive dist., e.g.,
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution
Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution
We’ve already seen some specific examples of computing the posterior predictive dist., e.g.,
Beta-Bernoulli case: Posterior predictive distribution of next coin toss
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution
Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution
We’ve already seen some specific examples of computing the posterior predictive dist., e.g.,
Beta-Bernoulli case: Posterior predictive distribution of next coin toss
Bayesian linear regression: Posterior predictive distribution of the response y∗ of test input x ∗
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution
Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution
We’ve already seen some specific examples of computing the posterior predictive dist., e.g.,
Beta-Bernoulli case: Posterior predictive distribution of next coin toss
Bayesian linear regression: Posterior predictive distribution of the response y∗ of test input x ∗
Nice Property: If the likelihood is an exponential family distribution, prior is conjugate (and thus is
the posterior), the posterior predictive always has a closed form expression (shown next)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution
Recall the form of the likelihood p(D|θ) for exp. family dist.
"N #
Y h i
p(D|θ) = h(x i ) exp θ> φ(D) − NA(θ)
i=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 15
Posterior Predictive Distribution
Recall the form of the likelihood p(D|θ) for exp. family dist.
"N #
Y h i
p(D|θ) = h(x i ) exp θ> φ(D) − NA(θ)
i=1
For this choice of the conjugate prior, the posterior was shown to be
h i
p(θ|D) = h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 15
Posterior Predictive Distribution
Recall the form of the likelihood p(D|θ) for exp. family dist.
"N #
Y h i
p(D|θ) = h(x i ) exp θ> φ(D) − NA(θ)
i=1
For this choice of the conjugate prior, the posterior was shown to be
h i
p(θ|D) = h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 15
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
0
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
0
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
0
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
0
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
0
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) = h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) = h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
N
0 0
Y
= h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) = h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
N
0 0
Y
= h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) = h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
N
0 0
Y
= h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) = h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
N
0 0
Y
= h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) = h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
N
0 0
Y
= h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1
Note that the form of Zc (and Ac ) will simply depend on the chosen conjugate prior
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) = h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
N
0 0
Y
= h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1
Note that the form of Zc (and Ac ) will simply depend on the chosen conjugate prior
Very useful result. Also holds for N = 0
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) = h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
N
0 0
Y
= h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1
Note that the form of Zc (and Ac ) will simply depend on the chosen conjugate prior
Very useful result. Also holds for N = 0
In the N = 0 case, p(D0 ) = p(D0 |θ)p(θ)dθ is simply the marginal likelihood of D0
R
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Exponential Family and GLM
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 18
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )
p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )
p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )
p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )
p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )
p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)
The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally
The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally
The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally
The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally
The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally
The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally
The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally
The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x
Note: Some GLM can be represented as p(y |η, φ) = h(y , φ) exp( ηy −A(η)
φ ) where φ is a dispersion
parameter (Gaussian/gamma GLMs use this representation)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
GLM with Canonical Response Function
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 21
GLM with Canonical Response Function
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 21
GLM with Canonical Response Function
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 21
GLM with Canonical Response Function
This form makes parameter estimation in canonical GLM easy (e.g., gradients easy to compute)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 21
GLM with Canonical Response Function
This form makes parameter estimation in canonical GLM easy (e.g., gradients easy to compute)
We will focus on canonical GLMs only (these are the most common)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 21
MLE for GLM
Log likelihood
N
>
Y
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn ))
n=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1
To estimate w via MLE (or MAP), either set the derivative to zero or use iterative methods (e.g.,
gradient descent, iteratively reweighted least squares, etc.)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
Bayesian Inference for GLM
If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 23
Bayesian Inference for GLM
If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form
Example: Bayesian linear regression with Gaussian likelihood and Gaussian prior on w
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 23
Bayesian Inference for GLM
If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form
Example: Bayesian linear regression with Gaussian likelihood and Gaussian prior on w
Otherwise, approximate Bayesian inference is needed (e.g., Laplace, MCMC, variational inf, etc.)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 23
Bayesian Inference for GLM
If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form
Example: Bayesian linear regression with Gaussian likelihood and Gaussian prior on w
Otherwise, approximate Bayesian inference is needed (e.g., Laplace, MCMC, variational inf, etc.)
Example: Bayesian logistic regression with sigmoid-Bernoulli likelihood and Gaussian prior on w
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 23
Bayesian Inference for GLM
If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form
Example: Bayesian linear regression with Gaussian likelihood and Gaussian prior on w
Otherwise, approximate Bayesian inference is needed (e.g., Laplace, MCMC, variational inf, etc.)
Example: Bayesian logistic regression with sigmoid-Bernoulli likelihood and Gaussian prior on w
Interesting class project idea: Design simple inference algorithms for non-conjugate GLM
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 23
Summary
Exp. family distributions are very useful for modeling diverse types of data/parameters
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary
Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary
Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary
Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary
Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary
Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary
Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary
Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form
We will see several use cases when we discuss approximate inference algoritms (e.g., Gibbs
sampling, and especially variational inference)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24