Sei sulla pagina 1di 166

Exponential Family and Generalized Linear Models

Piyush Rai

Probabilistic Machine Learning (CS772A)


Aug 22, 2017

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 1
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)

Defines a class of distributions. An Exponential Family distribution is of the form


1
p(x|θ) = h(x) exp[θ> φ(x)]
Z (θ)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)

Defines a class of distributions. An Exponential Family distribution is of the form


1
p(x|θ) = h(x) exp[θ> φ(x)] = h(x) exp[θ> φ(x) − A(θ)]
Z (θ)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)

Defines a class of distributions. An Exponential Family distribution is of the form


1
p(x|θ) = h(x) exp[θ> φ(x)] = h(x) exp[θ> φ(x) − A(θ)]
Z (θ)
x ∈ X m is the random variable being modeled (where X denotes some space, e.g., R or {0, 1})

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)

Defines a class of distributions. An Exponential Family distribution is of the form


1
p(x|θ) = h(x) exp[θ> φ(x)] = h(x) exp[θ> φ(x) − A(θ)]
Z (θ)
x ∈ X m is the random variable being modeled (where X denotes some space, e.g., R or {0, 1})
θ ∈ Rd : Natural parameters or canonical parameters defining the distribution

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)

Defines a class of distributions. An Exponential Family distribution is of the form


1
p(x|θ) = h(x) exp[θ> φ(x)] = h(x) exp[θ> φ(x) − A(θ)]
Z (θ)
x ∈ X m is the random variable being modeled (where X denotes some space, e.g., R or {0, 1})
θ ∈ Rd : Natural parameters or canonical parameters defining the distribution
φ(x) ∈ Rd : Sufficient statistics (another random variable)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)

Defines a class of distributions. An Exponential Family distribution is of the form


1
p(x|θ) = h(x) exp[θ> φ(x)] = h(x) exp[θ> φ(x) − A(θ)]
Z (θ)
x ∈ X m is the random variable being modeled (where X denotes some space, e.g., R or {0, 1})
θ ∈ Rd : Natural parameters or canonical parameters defining the distribution
φ(x) ∈ Rd : Sufficient statistics (another random variable)
Why “sufficient”: p(x|θ) as a function of θ depends on x only via φ(x)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)

Defines a class of distributions. An Exponential Family distribution is of the form


1
p(x|θ) = h(x) exp[θ> φ(x)] = h(x) exp[θ> φ(x) − A(θ)]
Z (θ)
x ∈ X m is the random variable being modeled (where X denotes some space, e.g., R or {0, 1})
θ ∈ Rd : Natural parameters or canonical parameters defining the distribution
φ(x) ∈ Rd : Sufficient statistics (another random variable)
Why “sufficient”: p(x|θ) as a function of θ depends on x only via φ(x)

h(x) exp[θ> φ(x)]dx: Partition function


R
Z (θ) =

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)

Defines a class of distributions. An Exponential Family distribution is of the form


1
p(x|θ) = h(x) exp[θ> φ(x)] = h(x) exp[θ> φ(x) − A(θ)]
Z (θ)
x ∈ X m is the random variable being modeled (where X denotes some space, e.g., R or {0, 1})
θ ∈ Rd : Natural parameters or canonical parameters defining the distribution
φ(x) ∈ Rd : Sufficient statistics (another random variable)
Why “sufficient”: p(x|θ) as a function of θ depends on x only via φ(x)

h(x) exp[θ> φ(x)]dx: Partition function


R
Z (θ) =
A(θ) = log Z (θ): Log-partition function (also called the cumulant function)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Exponential Family (Pitman, Darmois, Koopman, Late 1930s)

Defines a class of distributions. An Exponential Family distribution is of the form


1
p(x|θ) = h(x) exp[θ> φ(x)] = h(x) exp[θ> φ(x) − A(θ)]
Z (θ)
x ∈ X m is the random variable being modeled (where X denotes some space, e.g., R or {0, 1})
θ ∈ Rd : Natural parameters or canonical parameters defining the distribution
φ(x) ∈ Rd : Sufficient statistics (another random variable)
Why “sufficient”: p(x|θ) as a function of θ depends on x only via φ(x)

h(x) exp[θ> φ(x)]dx: Partition function


R
Z (θ) =
A(θ) = log Z (θ): Log-partition function (also called the cumulant function)
h(x): A constant (doesn’t depend on θ)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 2
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
where N: number of trials

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
where N: number of trials, x (a scalar): number of successes

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
where N: number of trials, x (a scalar): number of successes, p: probability is success in each trial

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
where N: number of trials, x (a scalar): number of successes, p: probability is success in each trial
Can re-express the above representation of the Binomial as
!    
N p
exp xlog + N log(1 − p)
x 1−p

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
where N: number of trials, x (a scalar): number of successes, p: probability is success in each trial
Can re-express the above representation of the Binomial as
!    
N p
exp xlog + N log(1 − p)
N
 x 1−p
h(x) = x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
where N: number of trials, x (a scalar): number of successes, p: probability is success in each trial
Can re-express the above representation of the Binomial as
!    
N p
exp xlog + N log(1 − p)
N
 x 1−p
h(x) =
x 
p
θ = log 1−p

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
where N: number of trials, x (a scalar): number of successes, p: probability is success in each trial
Can re-express the above representation of the Binomial as
!    
N p
exp xlog + N log(1 − p)
N
 x 1−p
h(x) =
x 
p 1
θ = log 1−p , and p = 1+exp(−θ)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
where N: number of trials, x (a scalar): number of successes, p: probability is success in each trial
Can re-express the above representation of the Binomial as
!    
N p
exp xlog + N log(1 − p)
N
 x 1−p
h(x) =
x 
p 1
θ = log 1−p , and p = 1+exp(−θ)
φ(x) = x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
where N: number of trials, x (a scalar): number of successes, p: probability is success in each trial
Can re-express the above representation of the Binomial as
!    
N p
exp xlog + N log(1 − p)
N
 x 1−p
h(x) =
x 
p 1
θ = log 1−p , and p = 1+exp(−θ)
φ(x) = x
A(θ) = −N log(1 − p)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
Binomial as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form
p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of the Binomial distribution


!
N x
Binomial(x|N, p) = p (1 − p)N−x
x
where N: number of trials, x (a scalar): number of successes, p: probability is success in each trial
Can re-express the above representation of the Binomial as
!    
N p
exp xlog + N log(1 − p)
N
 x 1−p
h(x) =
x 
p 1
θ = log 1−p , and p = 1+exp(−θ)
φ(x) = x
A(θ) = −N log(1 − p) = N log(1 + exp(θ))
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 3
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian


" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
N (x|µ, σ ) = √ exp − = √ exp x − x − − log σ
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ >   !#
1 σ2
x µ2
= √ exp − − log σ
2π − 12

x2 2σ 2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian


" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
N (x|µ, σ ) = √ exp − = √ exp x − x − − log σ
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ >   !#
1 σ2
x µ2
= √ exp − − log σ
h(x) = √1 2π − 12

x2 2σ 2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian


" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
N (x|µ, σ ) = √ exp − = √ exp x − x − − log σ
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ >   !#
1 σ2
x µ2
= √ exp − − log σ
h(x) = √12π 2π − 12

x2 2σ 2

 µ   
θ
θ= σ2 = 1
− 2σ1 2 θ2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian


" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
N (x|µ, σ ) = √ exp − = √ exp x − x − − log σ
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ >   !#
1 σ2
x µ2
= √ exp − − log σ
h(x) = √12π 2π − 12

x2 2σ 2

 µ       θ1 
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian


" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
N (x|µ, σ ) = √ exp − = √ exp x − x − − log σ
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ >   !#
1 σ2
x µ2
= √ exp − − log σ
h(x) = √12π 2π − 12

x2 2σ 2

 µ       θ1 
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2
 
x
φ(x) = 2
x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian


" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
N (x|µ, σ ) = √ exp − = √ exp x − x − − log σ
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ >   !#
1 σ2
x µ2
= √ exp − − log σ
h(x) = √12π 2π − 12

x2 2σ 2

 µ       θ1 
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2
 
x
φ(x) = 2
x
µ2
A(θ) = 2σ 2 + log σ

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
(Univariate) Gaussian as Exponential Family
Let’s try to write the Binomial distribution in the exponential family form

p(x|θ) = h(x) exp[θ> φ(x) − A(θ)]

Recall the standard definition of a univariate Gaussian


" # " #
2 1 (x − µ)2 1 µ 1 2 µ2
N (x|µ, σ ) = √ exp − = √ exp x − x − − log σ
2πσ 2 2σ 2 2π σ2 2σ 2 2σ 2
" µ >   !#
1 σ2
x µ2
= √ exp − − log σ
h(x) = √12π 2π − 12

x2 2σ 2

 µ       θ1 
σ 2 θ1 µ − 2θ2
θ= = , and 2 =
− 2σ1 2 θ2 σ − 2θ1 2
 
x
φ(x) = 2
x
µ2 −θ12 1 1
A(θ) = 2σ 2 + log σ = 4θ2 − 2 log(−2θ2 ) − 2 log(2π)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 4
A General Trick

A general trick to represent any distribution (assuming it is exp-family dist.) in exp-family form

Write the given distribution as exp(log p()) and simplify, e.g., for the Binomial
   
N x
exp (log Binomial(x|N, p)) = exp log p (1 − p)N−x
x
   
N
= exp log + x log p + (N − x) log(1 − p)
x
   
N p
= exp x log − N log(1 − p)
x 1−p

Now compare the resulting expression with the exponential family form

p(x|θ) = h(x) exp(θ> φ(x) − A(θ))

.. to identify the natural parameters, sufficient statistics, log-partition function, etc.

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 5
Other Examples

Many other distribution belong to the exponential family


Bernoulli
Beta
Gamma
Multinoulli/Multinomial
Dirichlet
Multivariate Gaussian
.. and many more ( https://en.wikipedia.org/wiki/Exponential_family )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 6
Other Examples

Many other distribution belong to the exponential family


Bernoulli
Beta
Gamma
Multinoulli/Multinomial
Dirichlet
Multivariate Gaussian
.. and many more ( https://en.wikipedia.org/wiki/Exponential_family )

Note: Not all distributions belong to the exponential family, e.g.,


Uniform distribution (x ∼ Unif(a, b))
Student-t distribution
Mixture distributions (e.g., mixture of Gaussians)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 6
Other Examples

Many other distribution belong to the exponential family


Bernoulli
Beta
Gamma
Multinoulli/Multinomial
Dirichlet
Multivariate Gaussian
.. and many more ( https://en.wikipedia.org/wiki/Exponential_family )

Note: Not all distributions belong to the exponential family, e.g.,


Uniform distribution (x ∼ Unif(a, b))
Student-t distribution
Mixture distributions (e.g., mixture of Gaussians)

If the support of the distribution depends on its parameters, then it is not an exp. family dist.

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 6
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar).

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar). Show that the first and the second
derivatives of A(θ) are

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar). Show that the first and the second
derivatives of A(θ) are

dA
= Ep(x|θ) [φ(x)]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar). Show that the first and the second
derivatives of A(θ) are

dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar). Show that the first and the second
derivatives of A(θ) are

dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]

d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)]

=
dθ2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar). Show that the first and the second
derivatives of A(θ) are

dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]

d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]

=
dθ2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar). Show that the first and the second
derivatives of A(θ) are

dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]

d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]

=
dθ2
Note: The above result also holds when θ and φ(x) are vector-valued (the “var” will be “covar”)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar). Show that the first and the second
derivatives of A(θ) are

dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]

d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]

=
dθ2
Note: The above result also holds when θ and φ(x) are vector-valued (the “var” will be “covar”)
Important: A(θ) is a convex function of θ.

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar). Show that the first and the second
derivatives of A(θ) are

dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]

d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]

=
dθ2
Note: The above result also holds when θ and φ(x) are vector-valued (the “var” will be “covar”)
Important: A(θ) is a convex function of θ. Why?

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
Log-Partition Function

h(x) exp[θ> φ(x)]dx is the log-partition function


R
A(θ) = log Z (θ) = log
A(θ) is also called the cumulant function
Derivatives of A(θ) can be used to generate the cumulants of the sufficient statistics φ(x)
Exercise: Assume θ to be a scalar (thus φ(x) is also scalar). Show that the first and the second
derivatives of A(θ) are

dA
= Ep(x|θ) [φ(x)] = mean[φ(x)]

d 2A 2
Ep(x|θ) [φ2 (x)] − Ep(x|θ) [φ(x)] = var[φ(x)]

=
dθ2
Note: The above result also holds when θ and φ(x) are vector-valued (the “var” will be “covar”)
Important: A(θ) is a convex function of θ. Why?
Exercise: For Binomial, using its expression of A(θ), derive the first and second cumulants of φ(x)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 7
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution

p(x|θ) = h(x) exp θ> φ(x) − A(θ)


 

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution

p(x|θ) = h(x) exp θ> φ(x) − A(θ)


 

To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
Y
p(D|θ) = p(x i |θ)
i=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution

p(x|θ) = h(x) exp θ> φ(x) − A(θ)


 

To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
#
>
Y Y X
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ)
i=1 i=1 i=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution

p(x|θ) = h(x) exp θ> φ(x) − A(θ)


 

To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution

p(x|θ) = h(x) exp θ> φ(x) − A(θ)


 

To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution

p(x|θ) = h(x) exp θ> φ(x) − A(θ)


 

To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N
PN
Size of φ(D) = i=1 φ(x i ) does not grow with N (same as the size of each φ(x i ))

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution

p(x|θ) = h(x) exp θ> φ(x) − A(θ)


 

To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N
PN
Size of φ(D) = i=1 φ(x i ) does not grow with N (same as the size of each φ(x i ))
Only exponential family distributions have finite-sized sufficient statistics

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution

p(x|θ) = h(x) exp θ> φ(x) − A(θ)


 

To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N
PN
Size of φ(D) = i=1 φ(x i ) does not grow with N (same as the size of each φ(x i ))
Only exponential family distributions have finite-sized sufficient statistics
No need to store all the data; can simply store and recursively update the sufficient statistics with
more and more data

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
Suppose we have data D = {x 1 , . . . , x N } drawn i.i.d. from an exponential family distribution

p(x|θ) = h(x) exp θ> φ(x) − A(θ)


 

To do MLE, we need the overall likelihood. This is simply a product of the individual likelihoods
N
" N # " N
# " N
#
h i
> >
Y Y X Y
p(D|θ) = p(x i |θ) = h(x i ) exp θ φ(x i ) − NA(θ) = h(x i ) exp θ φ(D) − NA(θ)
i=1 i=1 i=1 i=1
PN
To estimate θ (as we’ll see shortly), we only need φ(D) = i=1 φ(x i ) and N
PN
Size of φ(D) = i=1 φ(x i ) does not grow with N (same as the size of each φ(x i ))
Only exponential family distributions have finite-sized sufficient statistics
No need to store all the data; can simply store and recursively update the sufficient statistics with
more and more data
Very useful when doing probabilistic/Bayesian inference with large-scale data sets. Also useful in
online parameter estimation problems.
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 8
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
 
The likelihood is of the form p(D|θ) = i=1

The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
 
The likelihood is of the form p(D|θ) = i=1

The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
 
The likelihood is of the form p(D|θ) = i=1

The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i
>
∇θ θ φ(D) − NA(θ)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
 
The likelihood is of the form p(D|θ) = i=1

The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i
>
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
 
The likelihood is of the form p(D|θ) = i=1

The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i
>
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)] = φ(D) − NEp(x|θ) [φ(x)]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
 
The likelihood is of the form p(D|θ) = i=1

The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i N
>
X
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)] = φ(D) − NEp(x|θ) [φ(x)] = φ(x i ) − NEp(x|θ) [φ(x)]
i=1

Therefore, at the “optimal” (i.e., MLE) θ̂, where the derivative is 0, the following must hold
N
1 X
Ep(x|θ) [φ(x)] = φ(x i )
N
i=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
MLE for Exponential Family Distributions
hQ i
N
h(x i ) exp θ> φ(D) − NA(θ)
 
The likelihood is of the form p(D|θ) = i=1

The log-likelihood is (ignoring constant w.r.t. θ): log p(D|θ) = θ> φ(D) − NA(θ)
Note: This is concave in θ (since −A(θ) is concave). Maximization will yield a global maxima of θ
MLE for exp-fam distributions can also be seen as doing moment-matching. To see this, note that
h i N
>
X
∇θ θ φ(D) − NA(θ) = φ(D) − N∇θ [A(θ)] = φ(D) − NEp(x|θ) [φ(x)] = φ(x i ) − NEp(x|θ) [φ(x)]
i=1

Therefore, at the “optimal” (i.e., MLE) θ̂, where the derivative is 0, the following must hold
N
1 X
Ep(x|θ) [φ(x)] = φ(x i )
N
i=1

This is basically matching the expected moments of the distribution with empirical moments
(“empirical” here means what we compute using the observed data)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 9
Moment Matching: An Example

Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching


N
1 X
E[φ(x)] = φ(xi )
N
i=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example

Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching


N
1 X
E[φ(x)] =
φ(xi )
N
i=1
 
x
The “true”, i.e., expected moments: E[φ(x)] = E 2 . Therefore
x
  " 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example

Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching


N
1 X
E[φ(x)] =
φ(xi )
N
i=1
 
x
The “true”, i.e., expected moments: E[φ(x)] = E 2 . Therefore
x
  " 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi

For a univariate Gaussian, note that E[x] = µ

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example

Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching


N
1 X
E[φ(x)] =
φ(xi )
N
i=1
 
x
The “true”, i.e., expected moments: E[φ(x)] = E 2 . Therefore
x
  " 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi

For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example

Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching


N
1 X
E[φ(x)] =
φ(xi )
N
i=1
 
x
The “true”, i.e., expected moments: E[φ(x)] = E 2 . Therefore
x
  " 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi

For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example

Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching


N
1 X
E[φ(x)] =
φ(xi )
N
i=1
 
x
The “true”, i.e., expected moments: E[φ(x)] = E 2 . Therefore
x
  " 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi

For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Thus we have two equations and two unknowns

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example

Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching


N
1 X
E[φ(x)] =
φ(xi )
N
i=1
 
x
The “true”, i.e., expected moments: E[φ(x)] = E 2 . Therefore
x
  " 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi

For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Thus we have two equations and two unknowns
1
PN
From the first equation, we immediately get µ = N i=1 xi

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example

Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching


N
1 X
E[φ(x)] =
φ(xi )
N
i=1
 
x
The “true”, i.e., expected moments: E[φ(x)] = E 2 . Therefore
x
  " 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi

For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Thus we have two equations and two unknowns
1
PN
From the first equation, we immediately get µ = N i=1 xi

From the second equation, we get σ 2 = E[x 2 ] − µ2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example

Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching


N
1 X
E[φ(x)] =
φ(xi )
N
i=1
 
x
The “true”, i.e., expected moments: E[φ(x)] = E 2 . Therefore
x
  " 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi

For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Thus we have two equations and two unknowns
1
PN
From the first equation, we immediately get µ = N i=1 xi
1
PN
From the second equation, we get σ 2 = E[x 2 ] − µ2 = N i=1 xi
2
− µ2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Moment Matching: An Example

Given N observations x1 , . . . , xN from a univariate Gaussian N(x|µ, σ 2 ), doing moment-matching


N
1 X
E[φ(x)] =
φ(xi )
N
i=1
 
x
The “true”, i.e., expected moments: E[φ(x)] = E 2 . Therefore
x
  " 1 PN #
x N i=1 xi
E 2 = 1 PN 2
x N i=1 xi

For a univariate Gaussian, note that E[x] = µ and E[x 2 ] = var[x] + E[x]2 = σ 2 + µ2
Thus we have two equations and two unknowns
1
PN
From the first equation, we immediately get µ = N i=1 xi
1
PN 1
PN
From the second equation, we get σ 2 = E[x 2 ] − µ2 = N i=1 xi
2
− µ2 = N i=1 (xi − µ)2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 10
Bayesian Inference for Exponential Family Distributions

We saw that the total likelihood given N i.i.d. observations D{x 1 , . . . , x N }


h i N
X
p(D|θ) ∝ exp θ> φ(D) − NA(θ) where φ(D) = φ(x i )
i=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions

We saw that the total likelihood given N i.i.d. observations D{x 1 , . . . , x N }


h i N
X
p(D|θ) ∝ exp θ> φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
Let’s choose the following prior (note: it looks similar in terms of θ within the exponent)
h i
p(θ|ν0 , τ 0 ) = h(θ) exp θ> τ0 − ν0 A(θ) − Ac (ν0 , τ 0 )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions

We saw that the total likelihood given N i.i.d. observations D{x 1 , . . . , x N }


h i N
X
p(D|θ) ∝ exp θ> φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
Let’s choose the following prior (note: it looks similar in terms of θ within the exponent)
h i
p(θ|ν0 , τ 0 ) = h(θ) exp θ> τ0 − ν0 A(θ) − Ac (ν0 , τ 0 )

h(θ) exp θ> τ0 − ν0 A(θ) dθ


R  
Ignoring the prior’s log-partition function Ac (ν0 , τ 0 ) = log θ

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions

We saw that the total likelihood given N i.i.d. observations D{x 1 , . . . , x N }


h i N
X
p(D|θ) ∝ exp θ> φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
Let’s choose the following prior (note: it looks similar in terms of θ within the exponent)
h i
p(θ|ν0 , τ 0 ) = h(θ) exp θ> τ0 − ν0 A(θ) − Ac (ν0 , τ 0 )

h(θ) exp θ> τ0 − ν0 A(θ) dθ


R  
Ignoring the prior’s log-partition function Ac (ν0 , τ 0 ) = log θ

h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions

We saw that the total likelihood given N i.i.d. observations D{x 1 , . . . , x N }


h i N
X
p(D|θ) ∝ exp θ> φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
Let’s choose the following prior (note: it looks similar in terms of θ within the exponent)
h i
p(θ|ν0 , τ 0 ) = h(θ) exp θ> τ0 − ν0 A(θ) − Ac (ν0 , τ 0 )

h(θ) exp θ> τ0 − ν0 A(θ) dθ


R  
Ignoring the prior’s log-partition function Ac (ν0 , τ 0 ) = log θ

h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)

Comparing the prior’s form with the likelihood, we notice that

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions

We saw that the total likelihood given N i.i.d. observations D{x 1 , . . . , x N }


h i N
X
p(D|θ) ∝ exp θ> φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
Let’s choose the following prior (note: it looks similar in terms of θ within the exponent)
h i
p(θ|ν0 , τ 0 ) = h(θ) exp θ> τ0 − ν0 A(θ) − Ac (ν0 , τ 0 )

h(θ) exp θ> τ0 − ν0 A(θ) dθ


R  
Ignoring the prior’s log-partition function Ac (ν0 , τ 0 ) = log θ

h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)

Comparing the prior’s form with the likelihood, we notice that


ν0 is like the number of “pseudo-observations” coming from the prior

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
Bayesian Inference for Exponential Family Distributions

We saw that the total likelihood given N i.i.d. observations D{x 1 , . . . , x N }


h i N
X
p(D|θ) ∝ exp θ> φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
Let’s choose the following prior (note: it looks similar in terms of θ within the exponent)
h i
p(θ|ν0 , τ 0 ) = h(θ) exp θ> τ0 − ν0 A(θ) − Ac (ν0 , τ 0 )

h(θ) exp θ> τ0 − ν0 A(θ) dθ


R  
Ignoring the prior’s log-partition function Ac (ν0 , τ 0 ) = log θ

h i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ)

Comparing the prior’s form with the likelihood, we notice that


ν0 is like the number of “pseudo-observations” coming from the prior
τ0 is the total sufficient statistics of these ν0 pseudo-observations

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 11
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)

For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)

For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Note that the posterior has the same form as the prior

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)

For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Note that the posterior has the same form as the prior; such a prior is called a conjugate prior

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)

For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
(note: all exponential family distributions have a conjugate prior having a form shown as above)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)

For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
(note: all exponential family distributions have a conjugate prior having a form shown as above)
Thus posterior hyperparams ν0 0 , τ0 0 are obtained by simply adding “stuff” to prior’s hyperparams

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)

For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
(note: all exponential family distributions have a conjugate prior having a form shown as above)
Thus posterior hyperparams ν0 0 , τ0 0 are obtained by simply adding “stuff” to prior’s hyperparams
ν0 0 ← ν0 + N (no. of pseudo-obs + no. of actual obs)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)

For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
(note: all exponential family distributions have a conjugate prior having a form shown as above)
Thus posterior hyperparams ν0 0 , τ0 0 are obtained by simply adding “stuff” to prior’s hyperparams
ν0 0 ← ν0 + N (no. of pseudo-obs + no. of actual obs)
0
τ0 ← τ0 + φ(D) (total suff-stats from pseudo-obs + total suff-stats from actual obs)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
As we saw, the likelihood is
h i N
>
X
p(D|θ) ∝ exp θ φ(D) − NA(θ) where φ(D) = φ(x i )
i=1
And the prior we chose is h
>
i
p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ τ0 − ν0 A(θ)

For this form of the prior, the posterior p(θ|D) ∝ p(θ)p(D|θ) will be
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Note that the posterior has the same form as the prior; such a prior is called a conjugate prior
(note: all exponential family distributions have a conjugate prior having a form shown as above)
Thus posterior hyperparams ν0 0 , τ0 0 are obtained by simply adding “stuff” to prior’s hyperparams
ν0 0 ← ν0 + N (no. of pseudo-obs + no. of actual obs)
0
τ0 ← τ0 + φ(D) (total suff-stats from pseudo-obs + total suff-stats from actual obs)
Note: Prior’s log-partition function Ac (ν0 , τ 0 ) updates to posterior’s: Ac (ν0 + N, τ 0 + φ(D))
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 12
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
 
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
 
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
 

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
 
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
 

Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
 
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
 

Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
The posterior can be written as
 
ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
 
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
 

Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
The posterior can be written as
 
ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N
φ(D)
Denoting φ̄ = N as the average suff-stats per real observation, the posterior updates are

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
 
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
 

Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
The posterior can be written as
 
ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N
φ(D)
Denoting φ̄ = N as the average suff-stats per real observation, the posterior updates are
ν0 0 ← ν0 + N

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
 
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
 

Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
The posterior can be written as
 
ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N
φ(D)
Denoting φ̄ = N as the average suff-stats per real observation, the posterior updates are
ν0 0 ← ν0 + N
ν0 τ̄0 + N φ̄
τ̄00 ←
ν0 + N

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
The Posterior Distribution
Assuming the prior p(θ|ν0 , τ 0 ) ∝ h(θ) exp θ> τ0 − ν0 A(θ) , the posterior was
 
h i
p(θ|D) ∝ h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ)

Assuming τ0 = ν0 τ̄0 , we can also write the prior as p(θ|ν0 , τ̄0 ) ∝ exp θ> ν0 τ̄0 − ν0 A(θ)
 

Can think of τ̄0 = τ0 /ν0 as the average sufficient statistics per pseudo-observation
The posterior can be written as
 
ν0 τ̄ 0 + φ(D)
p(θ|D) ∝ h(θ) exp θ> (ν0 + N) − (ν0 + N)A(θ)
ν0 + N
φ(D)
Denoting φ̄ = N as the average suff-stats per real observation, the posterior updates are
ν0 0 ← ν0 + N
ν0 τ̄0 + N φ̄
τ̄00 ←
ν0 + N
Note that the posterior hyperparam τ̄00 is a convex combination of the average suff-stats τ̄0 of the
ν0 pseudo-observations and the average suff-stats φ̄ of the N actual observations
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 13
Posterior Predictive Distribution

Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution

Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution

Assme some test data D0 = {x̃ 1 , . . . , x̃ N 0 } from the same distribution (N 0 ≥ 1)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution

Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution

Assme some test data D0 = {x̃ 1 , . . . , x̃ N 0 } from the same distribution (N 0 ≥ 1)


The posterior predictive distribution of D0 (probability distribution of new data given old data)
Z
p(D0 |D) = p(D0 |θ)p(θ|D)dθ

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution

Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution

Assme some test data D0 = {x̃ 1 , . . . , x̃ N 0 } from the same distribution (N 0 ≥ 1)


The posterior predictive distribution of D0 (probability distribution of new data given old data)
Z
p(D0 |D) = p(D0 |θ)p(θ|D)dθ

We’ve already seen some specific examples of computing the posterior predictive dist., e.g.,

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution

Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution

Assme some test data D0 = {x̃ 1 , . . . , x̃ N 0 } from the same distribution (N 0 ≥ 1)


The posterior predictive distribution of D0 (probability distribution of new data given old data)
Z
p(D0 |D) = p(D0 |θ)p(θ|D)dθ

We’ve already seen some specific examples of computing the posterior predictive dist., e.g.,
Beta-Bernoulli case: Posterior predictive distribution of next coin toss

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution

Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution

Assme some test data D0 = {x̃ 1 , . . . , x̃ N 0 } from the same distribution (N 0 ≥ 1)


The posterior predictive distribution of D0 (probability distribution of new data given old data)
Z
p(D0 |D) = p(D0 |θ)p(θ|D)dθ

We’ve already seen some specific examples of computing the posterior predictive dist., e.g.,
Beta-Bernoulli case: Posterior predictive distribution of next coin toss
Bayesian linear regression: Posterior predictive distribution of the response y∗ of test input x ∗

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution

Assume some past (training) data D = {x 1 , . . . , x N } generated from an exp. family distribution

Assme some test data D0 = {x̃ 1 , . . . , x̃ N 0 } from the same distribution (N 0 ≥ 1)


The posterior predictive distribution of D0 (probability distribution of new data given old data)
Z
p(D0 |D) = p(D0 |θ)p(θ|D)dθ

We’ve already seen some specific examples of computing the posterior predictive dist., e.g.,
Beta-Bernoulli case: Posterior predictive distribution of next coin toss
Bayesian linear regression: Posterior predictive distribution of the response y∗ of test input x ∗

Nice Property: If the likelihood is an exponential family distribution, prior is conjugate (and thus is
the posterior), the posterior predictive always has a closed form expression (shown next)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 14
Posterior Predictive Distribution

Recall the form of the likelihood p(D|θ) for exp. family dist.
"N #
Y h i
p(D|θ) = h(x i ) exp θ> φ(D) − NA(θ)
i=1

The conjugate prior was


h i
p(θ|ν0 , τ 0 ) = h(θ) exp θ> τ0 − ν0 A(θ) − Ac (ν0 , τ 0 )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 15
Posterior Predictive Distribution

Recall the form of the likelihood p(D|θ) for exp. family dist.
"N #
Y h i
p(D|θ) = h(x i ) exp θ> φ(D) − NA(θ)
i=1

The conjugate prior was


h i
p(θ|ν0 , τ 0 ) = h(θ) exp θ> τ0 − ν0 A(θ) − Ac (ν0 , τ 0 )

For this choice of the conjugate prior, the posterior was shown to be
h i
p(θ|D) = h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 15
Posterior Predictive Distribution

Recall the form of the likelihood p(D|θ) for exp. family dist.
"N #
Y h i
p(D|θ) = h(x i ) exp θ> φ(D) − NA(θ)
i=1

The conjugate prior was


h i
p(θ|ν0 , τ 0 ) = h(θ) exp θ> τ0 − ν0 A(θ) − Ac (ν0 , τ 0 )

For this choice of the conjugate prior, the posterior was shown to be
h i
p(θ|D) = h(θ) exp θ> (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))

For the test data D0 , the likelihood will be


 0
 0
N
Y h i N
X
0
p(D |θ) =  h(x̃ i ) exp θ> φ(D0 ) − N 0 A(θ) where φ(D0 ) = φ(x̃ i )
i=1 i=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 15
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
 
0
 
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
  
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
 
0
 
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
  
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ

The above gets simplified further into

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
 
0
 
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
  
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ

The above gets simplified further into


 0
R
N
h(θ) exp θ> (τ0 + φ(D) + φ(D0 )) − (ν0 + N + N 0 )A(θ) dθ
 
0
Y
p(D |D) =  h(x̃ i )
i=1
exp [Ac (ν0 + N, τ 0 + φ(D))]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
 
0
 
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
  
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ

The above gets simplified further into


 0
R
N
h(θ) exp θ> (τ0 + φ(D) + φ(D0 )) − (ν0 + N + N 0 )A(θ) dθ
 
0
Y
p(D |D) =  h(x̃ i )
i=1
exp [Ac (ν0 + N, τ 0 + φ(D))]
 0 
N 0 0
 h(x̃ i ) Zc (ν0 + N + N , τ 0 + φ(D) + φ(D ))
Y
=
i=1
exp [Ac (ν0 + N, τ 0 + φ(D))]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Therefore the posterior predictive distribution will be
Z
0 0
p(D |D) = p(D |θ)p(θ|D)dθ
 
0
 
Z N h i
> 0 0 >
Y
= h(x̃ i ) exp θ φ(D ) − N A(θ) h(θ) exp θ (τ0 + φ(D)) − (ν0 + N)A(θ) − Ac (ν0 + N, τ 0 + φ(D))dθ
  
i=1
| {z }
| {z } constant w.r.t. θ
constant w.r.t. θ

The above gets simplified further into


 0
R
N
h(θ) exp θ> (τ0 + φ(D) + φ(D0 )) − (ν0 + N + N 0 )A(θ) dθ
 
0
Y
p(D |D) =  h(x̃ i )
i=1
exp [Ac (ν0 + N, τ 0 + φ(D))]
 0 
N 0 0
 h(x̃ i ) Zc (ν0 + N + N , τ 0 + φ(D) + φ(D ))
Y
=
i=1
exp [Ac (ν0 + N, τ 0 + φ(D))]
h i
where Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 )) = h(θ) exp θ > (τ0 + φ(D) + φ(D 0 )) − (ν0 + N + N 0 )A(θ) dθ
R

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 16
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y  
=  h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y  
=  h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1

Therefore the posterior predictive is proportional to ..

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y  
=  h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1

Therefore the posterior predictive is proportional to ..


.. the ratio of two partition functions of two “posterior distributions” (one with N + N 0 examples and
the other with N examples)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y  
=  h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1

Therefore the posterior predictive is proportional to ..


.. the ratio of two partition functions of two “posterior distributions” (one with N + N 0 examples and
the other with N examples)
.. or exponential of the difference of the corresponding log-partition functions

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y  
=  h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1

Therefore the posterior predictive is proportional to ..


.. the ratio of two partition functions of two “posterior distributions” (one with N + N 0 examples and
the other with N examples)
.. or exponential of the difference of the corresponding log-partition functions

Note that the form of Zc (and Ac ) will simply depend on the chosen conjugate prior

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y  
=  h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1

Therefore the posterior predictive is proportional to ..


.. the ratio of two partition functions of two “posterior distributions” (one with N + N 0 examples and
the other with N examples)
.. or exponential of the difference of the corresponding log-partition functions

Note that the form of Zc (and Ac ) will simply depend on the chosen conjugate prior
Very useful result. Also holds for N = 0

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Posterior Predictive Distribution
Since Ac = log Zc or Zc = exp(Ac ), we can write the posterior predictive distribution as
0
 
N
0
Y Zc (ν0 + N + N 0 , τ 0 + φ(D) + φ(D 0 ))
p(D |D) =  h(x̃ i )
i=1
Zc (ν0 + N, τ 0 + φ(D))
0
 
N
0 0
Y  
=  h(x̃ i ) exp Ac (ν0 + N + N , τ 0 + φ(D) + φ(D )) − Ac (ν0 + N, τ 0 + φ(D))
i=1

Therefore the posterior predictive is proportional to ..


.. the ratio of two partition functions of two “posterior distributions” (one with N + N 0 examples and
the other with N examples)
.. or exponential of the difference of the corresponding log-partition functions

Note that the form of Zc (and Ac ) will simply depend on the chosen conjugate prior
Very useful result. Also holds for N = 0
In the N = 0 case, p(D0 ) = p(D0 |θ)p(θ)dθ is simply the marginal likelihood of D0
R

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 17
Exponential Family and GLM

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 18
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )

Logistic regression: when response y is binary (0/1)

p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )

Logistic regression: when response y is binary (0/1)

p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)

In both, the model depends on the inputs x as w > x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )

Logistic regression: when response y is binary (0/1)

p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)

In both, the model depends on the inputs x as w > x


Can we extend it to other type of outputs?

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )

Logistic regression: when response y is binary (0/1)

p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)

In both, the model depends on the inputs x as w > x


Can we extend it to other type of outputs?
Solution: Model the output using an exp-fam distribution (Gaussian and Bernoulli already are!)

p(y |η) = h(y ) exp(ηy − A(η)) (Generalized Linear Model (GLM))

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models (GLM)
(Probabilistic) Linear regression: when response y is real-valued
p(y |x, w ) = N (w > x, β −1 )

Logistic regression: when response y is binary (0/1)

p(y |x, w ) = Bernoulli(σ(w > x)) = [σ(w > x)]y [1 − σ(w > x)]1−y
exp(w > x)
where σ(w > x) = 1
1+exp(−w > x)
= 1+exp(w > x)

In both, the model depends on the inputs x as w > x


Can we extend it to other type of outputs?
Solution: Model the output using an exp-fam distribution (Gaussian and Bernoulli already are!)

p(y |η) = h(y ) exp(ηy − A(η)) (Generalized Linear Model (GLM))

.. where η is a scalar-valued natural parameter (depends on x) and sufficient statistics φ(y ) = y


Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 19
Generalized Linear Models: Formally

The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally

The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x

The inputs x appear in the model only as a linear combination ξ = w > x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally

The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x

The inputs x appear in the model only as a linear combination ξ = w > x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally

The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x

The inputs x appear in the model only as a linear combination ξ = w > x

Conditional mean µ of a response y ’s distribution is modeled via a response function f

µ = E[y ] = f (ξ) = f (w > x)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally

The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x

The inputs x appear in the model only as a linear combination ξ = w > x

Conditional mean µ of a response y ’s distribution is modeled via a response function f

µ = E[y ] = f (ξ) = f (w > x)

- for (probabilistic) linear regression, f is identity, i.e., µ = f (w > x) = w > x,

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally

The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x

The inputs x appear in the model only as a linear combination ξ = w > x

Conditional mean µ of a response y ’s distribution is modeled via a response function f

µ = E[y ] = f (ξ) = f (w > x)

- for (probabilistic) linear regression, f is identity, i.e., µ = f (w > x) = w > x,


- for logistic regression, f is sigmoid, i.e., µ = f (w > x) = exp(w > x)/(1 + exp(w > x))

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally

The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x

The inputs x appear in the model only as a linear combination ξ = w > x

Conditional mean µ of a response y ’s distribution is modeled via a response function f

µ = E[y ] = f (ξ) = f (w > x)

- for (probabilistic) linear regression, f is identity, i.e., µ = f (w > x) = w > x,


- for logistic regression, f is sigmoid, i.e., µ = f (w > x) = exp(w > x)/(1 + exp(w > x))
The natural parameter η = ψ(µ) where ψ is the link function

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
Generalized Linear Models: Formally

The GLM is of the form p(y |η) = h(y ) exp(ηy − A(η)) where η depends on x

The inputs x appear in the model only as a linear combination ξ = w > x

Conditional mean µ of a response y ’s distribution is modeled via a response function f

µ = E[y ] = f (ξ) = f (w > x)

- for (probabilistic) linear regression, f is identity, i.e., µ = f (w > x) = w > x,


- for logistic regression, f is sigmoid, i.e., µ = f (w > x) = exp(w > x)/(1 + exp(w > x))
The natural parameter η = ψ(µ) where ψ is the link function

Note: Some GLM can be represented as p(y |η, φ) = h(y , φ) exp( ηy −A(η)
φ ) where φ is a dispersion
parameter (Gaussian/gamma GLMs use this representation)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 20
GLM with Canonical Response Function

A GLM has a canonical response function f if f = ψ −1

For such a GLM, η = ψ(µ) = ψ(f (w > x)) = w > x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 21
GLM with Canonical Response Function

A GLM has a canonical response function f if f = ψ −1

For such a GLM, η = ψ(µ) = ψ(f (w > x)) = w > x


µ
E.g., for logistic regression η = log 1−µ = w >x n

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 21
GLM with Canonical Response Function

A GLM has a canonical response function f if f = ψ −1

For such a GLM, η = ψ(µ) = ψ(f (w > x)) = w > x


µ
E.g., for logistic regression η = log 1−µ = w >x n

Thus, for Canonical GLMs

p(y |η) = h(y ) exp(ηy − A(η))


= h(y ) exp(y w > x − A(η))

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 21
GLM with Canonical Response Function

A GLM has a canonical response function f if f = ψ −1

For such a GLM, η = ψ(µ) = ψ(f (w > x)) = w > x


µ
E.g., for logistic regression η = log 1−µ = w >x n

Thus, for Canonical GLMs

p(y |η) = h(y ) exp(ηy − A(η))


= h(y ) exp(y w > x − A(η))

This form makes parameter estimation in canonical GLM easy (e.g., gradients easy to compute)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 21
GLM with Canonical Response Function

A GLM has a canonical response function f if f = ψ −1

For such a GLM, η = ψ(µ) = ψ(f (w > x)) = w > x


µ
E.g., for logistic regression η = log 1−µ = w >x n

Thus, for Canonical GLMs

p(y |η) = h(y ) exp(ηy − A(η))


= h(y ) exp(y w > x − A(η))

This form makes parameter estimation in canonical GLM easy (e.g., gradients easy to compute)
We will focus on canonical GLMs only (these are the most common)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 21
MLE for GLM
Log likelihood
N
>
Y
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn ))
n=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1

Convexity of A(η) guarantees a global optima. Taking derivative w.r.t. w


N
!
X 0 dηn
yn x n − A (ηn )
n=1
dw

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1

Convexity of A(η) guarantees a global optima. Taking derivative w.r.t. w


N
! N
X 0 dηn X
yn x n − A (ηn ) = (yn x n − µn x n )
n=1
dw n=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1

Convexity of A(η) guarantees a global optima. Taking derivative w.r.t. w


N
! N N
X 0 dηn X X
yn x n − A (ηn ) = (yn x n − µn x n ) = (yn − µn )x n
n=1
dw n=1 n=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1

Convexity of A(η) guarantees a global optima. Taking derivative w.r.t. w


N
! N N
X 0 dηn X X
yn x n − A (ηn ) = (yn x n − µn x n ) = (yn − µn )x n
n=1
dw n=1 n=1

where µn = f (w > x n ) and ’f ’ (= ψ −1 ) depends on type of response y , e.g.,

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1

Convexity of A(η) guarantees a global optima. Taking derivative w.r.t. w


N
! N N
X 0 dηn X X
yn x n − A (ηn ) = (yn x n − µn x n ) = (yn − µn )x n
n=1
dw n=1 n=1

where µn = f (w > x n ) and ’f ’ (= ψ −1 ) depends on type of response y , e.g.,


Real-valued y (linear regression): f is identity, i.e., µn = w > x n

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1

Convexity of A(η) guarantees a global optima. Taking derivative w.r.t. w


N
! N N
X 0 dηn X X
yn x n − A (ηn ) = (yn x n − µn x n ) = (yn − µn )x n
n=1
dw n=1 n=1

where µn = f (w > x n ) and ’f ’ (= ψ −1 ) depends on type of response y , e.g.,


Real-valued y (linear regression): f is identity, i.e., µn = w > x n
exp(w > x n )
Binary y (logistic regression): f is logistic function, i.e., µn = 1+exp(w > x n )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1

Convexity of A(η) guarantees a global optima. Taking derivative w.r.t. w


N
! N N
X 0 dηn X X
yn x n − A (ηn ) = (yn x n − µn x n ) = (yn − µn )x n
n=1
dw n=1 n=1

where µn = f (w > x n ) and ’f ’ (= ψ −1 ) depends on type of response y , e.g.,


Real-valued y (linear regression): f is identity, i.e., µn = w > x n
exp(w > x n )
Binary y (logistic regression): f is logistic function, i.e., µn = 1+exp(w > x n )
Count-valued y (Poisson regression): µn = exp(w > x n )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1

Convexity of A(η) guarantees a global optima. Taking derivative w.r.t. w


N
! N N
X 0 dηn X X
yn x n − A (ηn ) = (yn x n − µn x n ) = (yn − µn )x n
n=1
dw n=1 n=1

where µn = f (w > x n ) and ’f ’ (= ψ −1 ) depends on type of response y , e.g.,


Real-valued y (linear regression): f is identity, i.e., µn = w > x n
exp(w > x n )
Binary y (logistic regression): f is logistic function, i.e., µn = 1+exp(w > x n )
Count-valued y (Poisson regression): µn = exp(w > x n )
Positive reals y (gamma regression): µn = −(w > x n )−1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
MLE for GLM
Log likelihood
N N N N
> >
Y X X X
L(η) = log p(Y |η) = log h(yn ) exp(yn w x n − A(ηn )) = log h(yn ) + w yn x n − A(ηn )
n=1 n=1 n=1 n=1

Convexity of A(η) guarantees a global optima. Taking derivative w.r.t. w


N
! N N
X 0 dηn X X
yn x n − A (ηn ) = (yn x n − µn x n ) = (yn − µn )x n
n=1
dw n=1 n=1

where µn = f (w > x n ) and ’f ’ (= ψ −1 ) depends on type of response y , e.g.,


Real-valued y (linear regression): f is identity, i.e., µn = w > x n
exp(w > x n )
Binary y (logistic regression): f is logistic function, i.e., µn = 1+exp(w > x n )
Count-valued y (Poisson regression): µn = exp(w > x n )
Positive reals y (gamma regression): µn = −(w > x n )−1

To estimate w via MLE (or MAP), either set the derivative to zero or use iterative methods (e.g.,
gradient descent, iteratively reweighted least squares, etc.)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 22
Bayesian Inference for GLM

If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 23
Bayesian Inference for GLM

If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form
Example: Bayesian linear regression with Gaussian likelihood and Gaussian prior on w

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 23
Bayesian Inference for GLM

If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form
Example: Bayesian linear regression with Gaussian likelihood and Gaussian prior on w

Otherwise, approximate Bayesian inference is needed (e.g., Laplace, MCMC, variational inf, etc.)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 23
Bayesian Inference for GLM

If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form
Example: Bayesian linear regression with Gaussian likelihood and Gaussian prior on w

Otherwise, approximate Bayesian inference is needed (e.g., Laplace, MCMC, variational inf, etc.)
Example: Bayesian logistic regression with sigmoid-Bernoulli likelihood and Gaussian prior on w

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 23
Bayesian Inference for GLM

If likelihood is conjugate to the prior on w , Bayesian inference can be done in closed form
Example: Bayesian linear regression with Gaussian likelihood and Gaussian prior on w

Otherwise, approximate Bayesian inference is needed (e.g., Laplace, MCMC, variational inf, etc.)
Example: Bayesian logistic regression with sigmoid-Bernoulli likelihood and Gaussian prior on w

Interesting class project idea: Design simple inference algorithms for non-conjugate GLM

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 23
Summary

Exp. family distributions are very useful for modeling diverse types of data/parameters

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary

Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary

Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary

Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form

Useful in designing generative classification models.

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary

Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form

Useful in designing generative classification models. Choosing class-conditional from exponential


family with conjugate priors helps in parameter estimation

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary

Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form

Useful in designing generative classification models. Choosing class-conditional from exponential


family with conjugate priors helps in parameter estimation
Useful in designing generative models for unsupervised learning

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary

Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form

Useful in designing generative classification models. Choosing class-conditional from exponential


family with conjugate priors helps in parameter estimation
Useful in designing generative models for unsupervised learning
Also used in designing GLMs

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24
Summary

Exp. family distributions are very useful for modeling diverse types of data/parameters
Conjugate priors to exp. family distributions make parameter updates very simple
Other quantities such as posterior predictive can be computed in closed form

Useful in designing generative classification models. Choosing class-conditional from exponential


family with conjugate priors helps in parameter estimation
Useful in designing generative models for unsupervised learning
Also used in designing GLMs

We will see several use cases when we discuss approximate inference algoritms (e.g., Gibbs
sampling, and especially variational inference)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Exponential Family and Generalized Linear Models 24

Potrebbero piacerti anche