Sei sulla pagina 1di 67

# CSE291D Lecture 4

Exponential Families
Generalized Linear Models

1
Why exponential families?
• So far, we’ve been studying fairly concrete models,
capturing phenomena such as coin flips, dice rolls,
drawing balls from urns…

## • Today we’ll study a more general class of models at a

higher level of abstraction, the exponential family

2
Why exponential families?
• Many standard distributions are in the exponential family
(Gaussian, Bernoulli, Dirichlet, Poisson, exponential, …)

## • Finite set of sufficient statistics capture all relevant properties

of a data set

• Conjugate priors

3
Exponential family distributions
in a nutshell
• Suppose want to come up with a new probability distribution

What’s the simplest way that parameters and data could be mapped to
probabilities?
– How about a linear mapping via a dot product?

## • Probabilities and probability densities must be non-negative.

What is the simplest way that we can we ensure non-negativity?

4
Exponential family distributions
in a nutshell
• Suppose want to come up with a new probability distribution

What’s the simplest way that parameters and data could be mapped to
probabilities?
– How about a linear mapping via a dot product?

## • Probabilities and probability densities must be non-negative.

What is the simplest way that we can we ensure non-negativity?

5
Exponential family distributions
in a nutshell
• Suppose want to come up with a new probability distribution

What’s the simplest way that parameters and data could be mapped to
probabilities?
– How about a linear mapping via a dot product?

## • Probabilities and probability densities must be non-negative.

What is the simplest way that we can we ensure non-negativity?

6
Exponential family distributions
in a nutshell

## • Probabilities are proportional to e to the power

of a dot product of parameters and data

## • Actual definition is slightly more complicated, but

just generalizes this basic idea.
7
Maximum entropy motivation
• Suppose we’d like to model some phenomenon with a
probability distribution, and all we know is how certain
features/properties behave on average.
– Which distribution should we use?

## • The principle of maximum entropy states that in the

absence of further knowledge, we should select the
distribution with the most entropy

## • Entropy measures the “flatness” of a distribution, and

can be interpreted as the amount of uncertainty

8
Entropy

9
Maximum entropy motivation

## • It can be shown (see the readings) that the

maximum entropy distribution has the form of the
exponential family.

## • So, in the absence of any further information, we

should select an exponential family distribution.

10
Statistical mechanics
• The second law of thermodynamics:
“the entropy of the universe tends
towards a maximum”

Rudolf Clausius

## • We would therefore expect exponential family

distributions to be prevalent in nature
11
Statistical mechanics
Ludwig Boltzmann

## • Suppose we are modeling particles in a gas. The

Boltzmann distribution (Gibbs distribution) has the form:

12
Learning outcomes
By the end of the lesson, you should be able to:

## • Perform maximum likelihood estimation and

Bayesian inference for exponential families

## • Specify generalized linear models by selecting an

appropriate exponential family distribution and

13
14
15
16
Exponential family distributions
• A pdf or pmf is said to be in the exponential
family if it can be written in the form:

## Some scaling factor which doesn’t

Dot product of parameters and
depend on the parameters,
sufficient statistics
often = 1
17
With normalizing constant
• Two ways to write it:

“partition function”

• Or:

## “log partition function”

18
With normalizing constant
• Two ways to write it:

“partition function”

• Or:

## “log partition function”

19
Sufficient statistics
• is called sufficient because the likelihood
only depends on the data through

## • To learn the parameters, all information we

need is captured in the sufficient statistics.

## • Exponential family distributions are the only

distributions with a fixed-dimensional set of
sufficient statistics (under weak assumptions)
20
A slight generalization
• Allow a different parameterization, as long as it can
still be written the same way after transformation

21
Summary / notation

22
Writing distributions in
exponential family form: Bernoulli

23
Writing distributions in
exponential family form: Bernoulli

24
Writing distributions in
exponential family form: Bernoulli

25
Writing distributions in
exponential family form: Bernoulli

26
Minimal exponential families
• The sufficient statistics are in some sense redundant,
as we can compute x and 1 – x as linear functions of
each other.

## • If the sufficient statistics are linearly independent,

the representation is said to be minimal.

27
Minimal exponential families

28
Minimal exponential families

29
Minimal exponential families

30
Minimal exponential families

31
Minimal exponential families

32
Definitions
• Natural exponential family:

• Canonical form:

## (a representation where the transformed

parameters live in some “curved” space)
33
Gaussian distribution

34
Gaussian distribution

35
Gaussian distribution

36
Gaussian distribution

37
38
Poisson distribution

39
40
Gamma distribution

41
Bayesian inference

42
Bayesian inference

43
Bayesian inference

44
Bayesian inference

45
Log partition function is
expected sufficient statistics

46
Log partition function is
expected sufficient statistics

47
Log partition function is
expected sufficient statistics

48
Log partition function is
expected sufficient statistics

49
Log partition function is
expected sufficient statistics

50
MLE for exponential families
• The likelihood function is:

## • Take the log:

51
MLE for exponential families
• The likelihood function is:

## • Take the log:

52
MLE for exponential families
• The likelihood function is:

## • Take the log:

53
MLE for exponential families
• The likelihood function is:

## • Take the log:

54
MLE for exponential families
• Take the derivative and set to zero

55
MLE for exponential families
• Take the derivative and set to zero

56
MLE for exponential families
• Take the derivative and set to zero

57
MLE for exponential families
• Take the derivative and set to zero

58
MLE for exponential families
• At the MLE, the expected sufficient statistics under
the model match the average observed statistics

## • If you were to simulate “fantasy data” under the

model, on average the statistics would match
those of the data
59

## • The gradient is the difference between:

– the average sufficient statistics under the data
– the expected sufficient statistics under “fantasy data”

60
Mean parameterization
• Many distributions are traditionally parameterized by
their mean

## • For minimal exponential families, there is a

1-1 mapping between natural parameters and the
mean of the distribution. In this case, we can use
the mean parameterization,

## • The MLE is just the sample mean!

61
Generalized linear models
• Linear regression:

• Logistic regression:
1

- 0 +
• What about other types of data?

62
Generalized linear models

Linear predictor

Mean parameter

63

## is called the link function

64
Example: Poisson regression

65
Estimation of GLMs
• MLE or MAP estimation:
– Second-order optimization methods common,
Newton-Raphson /
iterative reweighted least squares

• Bayesian inference:
– MCMC, or Laplace approximation

66
GLM example: Think-pair-share
• You are a data scientist working on a competitor to