Sei sulla pagina 1di 55

CSE291D Lecture 20

Nonparametric Bayesian models

1
Announcements

• HW 4 is back and available at the front of the room

• Solutions for HW 3 and 4 are also available at the front

• Submit HW5 to my office (under the door if I’m not there), or by


email to me and the TA, by midnight 06/09.

• Submit the project report to me by the same time,


preferably by email.

• Good luck on the exam!


Tuesday 06/07 7pm-10pm in this room (PETER 103).

2
How many clusters should we use
in our mixture model?

3
Choosing the dimensionality of your
latent space
• How many
– clusters should we use in our mixture model?
– dimensions in our factor analysis model?
– Topics in our topic model?

• With more latent variables, can fit training data better


(e.g. a cluster for every data point!)

• However, we may “overfit” and generalize less well to


new data. Also lose parsimony and interpretability

4
Traditional model selection
• Marginal likelihood:

• Bayes factor: ratio of marginal likelihoods between two models

• Can also put a prior on models, pick the one with the highest posterior
probability (or average over all possible models, a.k.a. model averaging)

• Pro: automatically penalizes complicated models, since integrates over all


parameter values, including bad ones

• Con: intractable to do this exactly

5
Approximate heuristic methods:
Bayesian information criterion (BIC)

• Score(model) = model complexity - model fit

• Pick the model with the best (lowest) score


# data points
# (free) parameters

6
(for large n, exponential families)
Nonparametric models
• So-called “nonparametric” models typically do have
parameters, however:

• Num parameters (model complexity) is not fixed,


but grows with the amount of data

• E.g.
– K-nearest neighbors classifier
– Kernel density estimation Model
complexity
– Decision trees

Data points N
7
Bayesian nonparametric models
• Bayesian models whose complexity increases
with the amount of data

• Typically, prior over an infinite # of latent vars


– The number that are used is finite, depends on data

Model
complexity

Data points N
8
Learning outcomes
By the end of the lesson, you should be able to:

• Simulate the Chinese restaurant process

• Perform nonparametric data modeling with


CRP mixture models and
Indian Buffet process models

9
10
11
12
Chinese restaurant process

• A distribution over partitions (groupings) of


objects, e.g. data points

• The number of groups is not specified in advance

• Useful as a prior for cluster assignments in a


nonparametric Bayesian mixture model

13
Chinese restaurant process
• Overall metaphor:
– Imagine a restaurant with an infinite number of tables,
each serving a different dish
• Tables = clusters

– Customers enter the restaurant one at a time,


and sit at a table
• Customers = data points

14
Chinese restaurant process
• Overall metaphor:

– Some dishes are more popular than others


• Customers sit at a table with probability proportional to the
number of customers already at that table
• Or at a new table with probability proportional to

Basically a Polya urn process!


-we also have balls in the urn for “new table” option

15
Chinese restaurant process

16
Chinese restaurant process

17
Chinese restaurant process

18
Chinese restaurant process

19
Chinese restaurant process

20
Chinese restaurant process

21
Chinese restaurant process

22
Chinese restaurant process

23
Chinese restaurant process

24
Chinese restaurant process

25
Chinese restaurant process

26
Chinese restaurant process

27
Chinese restaurant process

28
Chinese restaurant process

29
Chinese restaurant process

30
Chinese restaurant process

31
Chinese restaurant process

32
Chinese restaurant process

33
34
CRP is exchangeable
• Joint distribution

• Terms for customers in group k:


(Ik,: = their indices, Nk = num customers in k)

35
CRP is exchangeable

• Each index occurs in one group. Simplify:

Depends on num groups K, and


the group sizes Nk,
but not on the ordering!

36
CRP mixture models
(a.k.a. Dirichlet process mixture models)
• Generate cluster assignments
(partition of data points) via CRP

• Draw parameters for each cluster from prior

• Draw each data points its cluster

37
Draw from CRP mixture model, N = 50

38
Draw from CRP mixture model, N = 500

39
Draw from CRP mixture model, N = 1000

40
Inference via collapsed Gibbs sampling
• For a collapsed Gibbs update, compute:

Posterior predictive.
If conjugate prior, we can compute this.

• Use exchangeability! Make ci the last customer

41
42
43
Using CRP mixture model to find the
“true” num clusters is dangerous!

-Gershman and Blei (2012) 44


Alternative derivation:
Infinite limit of finite mixture model
• Consider a mixture model with a Dirichlet concentration
parameter that does not depend on num clusters K:

• Marginalize out :

45
46
Alternative derivation:
Infinite limit of finite mixture model

• Goes to 0 as K goes to infinity. Instead count


equivalence classes of partitions

47
Alternative derivation:
Infinite limit of finite mixture model
• Take the limit as K goes to infinity

48
Indian Buffet Process
• Distribution over binary matrices with an
infinite number of columns (latent features)
Features (dishes)

Data points
(customers)

49
Indian Buffet Process
• Start with a finite Beta-Bernoulli model
(each column/feature/dish has a coin-flip param)

• Define equivalence classes of matrices

• Take infinite limit, K goes to infinity 50


Indian Buffet Process
• Each “customer” i eats each “dish” with
probability , then samples
new dishes

51
Extending to non-binary case
• Elementwise multiply Z with a random
real-valued or integer matrix

• Can use as a prior for factor analysis, etc

52
Exam study guide

• The learning outcomes in each lecture are your main guide on what
to study

• Useful to study slides, homeworks, peer instruction questions.


Readings are lower priority, but may be useful

• Format will be similar to homeworks, but will also include a


multi-choice component
– Need to be comfortable with models but don’t need to memorize
pdfs of distributions, proofs done in class, etc.

• Bring pens, scratch paper, pocket calculators (no phones!)

53
Think-pair-share

• Design a nonparametric Bayesian latent variable


model for a social network, represented as a binary
adjacency matrix Y

– How will you specify a prior?


– How will you specify a likelihood?
– Does your model encode any sociological principles?

54
Evaluations

• Please be sure to submit evaluations for both your


instructor and TA, if you have not done so already.

• This will help us a lot!

• (Thanks, if you have already done this)

• I understand that you have been emailed a link to do


this.

55

Potrebbero piacerti anche