Sei sulla pagina 1di 67

CSE291D Lecture 4

Exponential Families
Generalized Linear Models

1
Why exponential families?
• So far, we’ve been studying fairly concrete models,
capturing phenomena such as coin flips, dice rolls,
drawing balls from urns…

• Today we’ll study a more general class of models at a


higher level of abstraction, the exponential family

2
Why exponential families?
• Many standard distributions are in the exponential family
(Gaussian, Bernoulli, Dirichlet, Poisson, exponential, …)

• Derive algorithms for the exponential family, and you


get all those distributions for free!

• Finite set of sufficient statistics capture all relevant properties


of a data set

• Conjugate priors

3
Exponential family distributions
in a nutshell
• Suppose want to come up with a new probability distribution

What’s the simplest way that parameters and data could be mapped to
probabilities?
– How about a linear mapping via a dot product?

• Probabilities and probability densities must be non-negative.


What is the simplest way that we can we ensure non-negativity?

4
Exponential family distributions
in a nutshell
• Suppose want to come up with a new probability distribution

What’s the simplest way that parameters and data could be mapped to
probabilities?
– How about a linear mapping via a dot product?

• Probabilities and probability densities must be non-negative.


What is the simplest way that we can we ensure non-negativity?

5
Exponential family distributions
in a nutshell
• Suppose want to come up with a new probability distribution

What’s the simplest way that parameters and data could be mapped to
probabilities?
– How about a linear mapping via a dot product?

• Probabilities and probability densities must be non-negative.


What is the simplest way that we can we ensure non-negativity?

6
Exponential family distributions
in a nutshell

• Probabilities are proportional to e to the power


of a dot product of parameters and data

• Actual definition is slightly more complicated, but


just generalizes this basic idea.
7
Maximum entropy motivation
• Suppose we’d like to model some phenomenon with a
probability distribution, and all we know is how certain
features/properties behave on average.
– Which distribution should we use?

• The principle of maximum entropy states that in the


absence of further knowledge, we should select the
distribution with the most entropy

• Entropy measures the “flatness” of a distribution, and


can be interpreted as the amount of uncertainty

8
Entropy

9
Maximum entropy motivation

• It can be shown (see the readings) that the


maximum entropy distribution has the form of the
exponential family.

• So, in the absence of any further information, we


should select an exponential family distribution.

10
Statistical mechanics
• The second law of thermodynamics:
“the entropy of the universe tends
towards a maximum”

Rudolf Clausius

• We would therefore expect exponential family


distributions to be prevalent in nature
11
Statistical mechanics
Ludwig Boltzmann

• Suppose we are modeling particles in a gas. The


Boltzmann distribution (Gibbs distribution) has the form:

12
Learning outcomes
By the end of the lesson, you should be able to:

• Write distributions in exponential family form

• Perform maximum likelihood estimation and


Bayesian inference for exponential families

• Specify generalized linear models by selecting an


appropriate exponential family distribution and
link function

13
14
15
16
Exponential family distributions
• A pdf or pmf is said to be in the exponential
family if it can be written in the form:

Some scaling factor which doesn’t


Dot product of parameters and
depend on the parameters,
sufficient statistics
often = 1
17
With normalizing constant
• Two ways to write it:

“partition function”

• Or:

“log partition function”

18
With normalizing constant
• Two ways to write it:

“partition function”

• Or:

“log partition function”

19
Sufficient statistics
• is called sufficient because the likelihood
only depends on the data through

• To learn the parameters, all information we


need is captured in the sufficient statistics.

• Exponential family distributions are the only


distributions with a fixed-dimensional set of
sufficient statistics (under weak assumptions)
20
A slight generalization
• Allow a different parameterization, as long as it can
still be written the same way after transformation

21
Summary / notation

22
Writing distributions in
exponential family form: Bernoulli

23
Writing distributions in
exponential family form: Bernoulli

24
Writing distributions in
exponential family form: Bernoulli

25
Writing distributions in
exponential family form: Bernoulli

26
Minimal exponential families
• The sufficient statistics are in some sense redundant,
as we can compute x and 1 – x as linear functions of
each other.

• The representation is said to be over-complete.

• If the sufficient statistics are linearly independent,


the representation is said to be minimal.

27
Minimal exponential families

28
Minimal exponential families

29
Minimal exponential families

30
Minimal exponential families

31
Minimal exponential families

32
Definitions
• Natural exponential family:

• Canonical form:

• Curved exponential family:

(a representation where the transformed


parameters live in some “curved” space)
33
Gaussian distribution

34
Gaussian distribution

35
Gaussian distribution

36
Gaussian distribution

37
38
Poisson distribution

39
40
Gamma distribution

41
Bayesian inference

42
Bayesian inference

43
Bayesian inference

44
Bayesian inference

45
Log partition function is
expected sufficient statistics

46
Log partition function is
expected sufficient statistics

47
Log partition function is
expected sufficient statistics

48
Log partition function is
expected sufficient statistics

49
Log partition function is
expected sufficient statistics

50
MLE for exponential families
• The likelihood function is:

• Take the log:

51
MLE for exponential families
• The likelihood function is:

• Take the log:

52
MLE for exponential families
• The likelihood function is:

• Take the log:

53
MLE for exponential families
• The likelihood function is:

• Take the log:

54
MLE for exponential families
• Take the derivative and set to zero

55
MLE for exponential families
• Take the derivative and set to zero

56
MLE for exponential families
• Take the derivative and set to zero

57
MLE for exponential families
• Take the derivative and set to zero

58
MLE for exponential families
• At the MLE, the expected sufficient statistics under
the model match the average observed statistics

• If you were to simulate “fantasy data” under the


model, on average the statistics would match
those of the data
59
Maximum likelihood gradient updates

• The gradient is the difference between:


– the average sufficient statistics under the data
– the expected sufficient statistics under “fantasy data”

60
Mean parameterization
• Many distributions are traditionally parameterized by
their mean

• For minimal exponential families, there is a


1-1 mapping between natural parameters and the
mean of the distribution. In this case, we can use
the mean parameterization,

• The MLE is just the sample mean!

61
Generalized linear models
• Linear regression:

• Logistic regression:
1

- 0 +
• What about other types of data?

62
Generalized linear models

Linear predictor

Mean parameter

Mean-parameterized exponential family

63
Link functions

is called the mean function,


or inverse link function

is called the link function

64
Example: Poisson regression

65
Estimation of GLMs
• MLE or MAP estimation:
– Second-order optimization methods common,
Newton-Raphson /
iterative reweighted least squares

• Bayesian inference:
– MCMC, or Laplace approximation

66
GLM example: Think-pair-share
• You are a data scientist working on a competitor to
Google Maps. Your boss asks you to design a GLM to
predict the number of cars that pass each of various
stretches of the 5 freeway on any given hour on any
given day in the next 12 months.

• What input features will you use, and what exponential


family response distribution / link function / inverse
link function will you use for the GLM?
Assume you have all of the same data that Google can
collect.

67