Sei sulla pagina 1di 215

A DVANCED M ACHINE L EARNING

FALL 2017

V ERSION : O CTOBER 12, 2017


C OURSE A DMIN
T ERM T IMELINE

First class Final class


Sep 5/6 Midterm Dec 7/11

Python TensorFlow Final project


tutorial tutorial due

Peter Orbanz John Cunningham

Dates
Python tutorial 12/13 September
TensorFlow tutorial 24/25 October
Midterm exam 19/23 October
Final project due 11 December

Advanced Machine Learning 3 / 212


A SSISTANTS AND G RADING
Teaching Assistants
Ian Kinsella and Wenda Zhou

Office Hours Mon/Tue 5:30-7:30pm, Room 1025, Dept of Statistics, 10th floor SSW

Class Homepage
https://wendazhou.com/teaching/AdvancedMLFall17/

Homework
• Some homework problems and final project require coding
• Coding: Python
• Homework due: Tue/Wed at 4pm – no late submissions
• You can drop two homeworks from your final score

Grade
Homework + Midterm Exam + Final Project
20% 40% 40%

Advanced Machine Learning 4 / 212


H OUSE RULES

Email
All email to the TAs, please.

The instructors will not read your email unless it is forwarded by a TA.

Problems with exam/project


If you cannot take the exam or finish the project: You must let us know
at least one week before
the midterm exam/project due date.

Advanced Machine Learning 5 / 212


R EADING

The relevant course material are the slides.

Books (optional)

See class homepage for references.

Advanced Machine Learning 6 / 212


O UTLINE ( VERY TENTATIVE )

Part I (Orbanz) Part II (Cunningham)


• Neural networks (basic definitions and • NN software
training) • Convolutional NNs and computer vision
• Graphical models (ditto) • Recurrent NNs
• Sampling algorithms • Reinforcement learning
• Variational inference • Dimension reduction and autoencoders
• Optimization for GMs and NNs

Advanced Machine Learning 7 / 212


I NTRODUCTION
AGAIN : M ACHINE L EARNING

Historical origins: Artificial intelligence and engineering


Machines need to...
• recognize patterns (e.g. vision, language)

• make decisions based on experience (= data)


• predict
• cope with uncertainty

Modern applications: (A few) Examples


• medical diagnosis • recommender systems
• face detection/recognition • bioinformatics
• speech and handwriting recognition • natural language processing
• web search • computer vision

Today
Machine learning and statistics have become hard to tell apart.

Advanced Machine Learning 9 / 212


L EARNING AND S TATISTICS

Task
Balance the pendulumn upright by moving the sled left and right.
• The computer can control only the motion of the sled.

• Available data: Current state of system (measured 25 times/second).

Advanced Machine Learning 10 / 212


L EARNING AND S TATISTICS

Formalization
State = 4 variables (sled location, sled velocity, angle, angular velocity)
Actions = sled movements

The system can be described by a function


f : S ×A → S
(state, action) 7→ state
Advanced Machine Learning 10 / 212
L EARNING AND S TATISTICS

Advanced Machine Learning 11 / 212


L EARNING AND S TATISTICS

After each run


Fit a function
f : S ×A → S
(state, action) 7→ state
to the data obtained in previous runs.

Running the system involves:


1. The function f , which tells the system “how the world works”.
2. An optimization method that uses f to determine how to move towards the optimal state.

Note well
Learning how the world works is a regression problem.

Advanced Machine Learning 12 / 212


O UR MAIN TOPICS

Neural networks Graphical models


• Define functions • Define distributions
• Represented by directed graph • Represented by directed graph

• Each vertex represents a function • Each vertex represents a distribution


• Incoming edges: Function arguments • Incoming edges: Conditions
• Outgoing edges: Function values • Outgoing edges: Draws from distribution
• Learning: Differentiation/optimization • Learning: Estimation/inference

Advanced Machine Learning 13 / 212


R ECALL P REVIOUS T ERM

sgn(hvH , xi − c) > 0

sgn(hvH , xi − c) < 0

Supervised learning Unsupervised learning


Classification Clustering (mixture models)
Problems Regression HMMs
Dimension reduction (PCA)
Solutions Functions Distributions

Advanced Machine Learning 14 / 212


Neural networks Graphical models

x1 x2 x3
v1 v2 v3

y2 = φ(vt x)

...

...
φ φ φ

.. .. ..
. . .
ψ ψ ψ

y1 y2 y3
...

...

Advanced Machine Learning 15 / 212


NN S AND G RAPHICAL M ODELS

Neural networks
• Representation of function using a graph
• Layers: x1 v13 x2 v31 x3
v21 v23
g v33
x f v12 v22 v32
v11

• Symbolizes: f (g(x))

“f depends on x only through g” φ φ φ

y1 = φ(v1 t x) y2 = φ(v2 t x) y3 = φ(v3 t x)


Graphical models
• Representation of a distribution using a graph
• Layers:
X Y Z

• Symbolizes: p(x, z, y) = p(z|y)p(y|x)p(x)

“Z is conditionally independent of X given Y”

Advanced Machine Learning 16 / 212


Grouping dependent variables into
layers is a good thing.

Advanced Machine Learning 17 / 212


H ISTORICAL PERSPECTIVE : M C C ULLOCH -P ITTS NEURON
MODEL (1943)

A neuron is modeled as a “thresholding device” that combines input signals:

x1 x2 x3

McCulloch-Pitts model
• Collect the input signals x1 , x2 , x3 into a vector x = (x1 , x2 , x3 ) ∈ R3
• Choose fixed vector v ∈ R3 and constant c ∈ R.
• Compute:
y = I{hv, xi > c} for some c ∈ R .

Advanced Machine Learning 18 / 212


R EADING THE DIAGRAM

x1 x2 x3
v1 v2 v3

−1
I{• > 0}

y = I{ v1 x1 + v2 x2 + v3 x3 + (−1)c > 0 } = I{hv, xi > c}

Recall: Linear classifier


f (x) = sgn(hv, xi − c)

Advanced Machine Learning 19 / 212


L INEAR C LASSIFICATION

hx,vi
kvk

f (x) = sgn(hv, xi − c)

Advanced Machine Learning 20 / 212


R EMARKS

x1 x2 x3
v1 v2 v3

y = I{vt x > c}

• The “neural network” represents a linear two-class classifier (on R3 ).


• It does not specify the training method.
• To train the classifier, we need a cost function and an optimization method.

Advanced Machine Learning 21 / 212


T RAINING

• For parameter estimation by optimization, we need an optimization target.


• Idea: Choose 0-1 loss as simplest loss for classification.
• Minimize empirical
16 risk (on training data) under CHAPTER
this loss. 5. LINEAR DISCRIMINANT

J(a) Jp(a)

3 10
2
1 5
0

y3
y2
-2 y2
y1 y1
0
-2 solution -2 solut
0 region 2 a1 0 regio
a2 2 a2 2
4
4

Jq(a) Jr(a)
Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 22 / 212
T HE P ERCEPTRON C RITERION

• Piece-wise constant function not suitable for numerical optimization.


• Approximate by piece-wise linear function → perceptron cost function
CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

Jp(a)

10

y3 y3
y2
-2 y2 -2
y1
0 0
solution -2 solution
region 2 a1 0 region 2 a1
2 a2 2
4 4
4 4

Jr(a)

5
Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 23 / 212
P ERCEPTRON

“Train” McCulloch-Pitts model (that is: estimate (c, v)) by applying gradient descent to the
function
n    
X c 1
Cp (c, v) := I{sgn(hv, x̃i i − c) 6= ỹi } , ,
v x̃i
i=1
called the Perceptron cost function.

Advanced Machine Learning 24 / 212


O UTLINE ( VERY TENTATIVE )

Part I (Orbanz) Part II (Cunningham)


• Neural networks (basic definitions and • NN software
training) • Convolutional NNs and computer vision
• Graphical models (ditto) • Recurrent NNs
• Sampling algorithms • Reinforcement learning
• Variational inference • Dimension reduction and autoencoders
• Optimization for GMs and NNs

Advanced Machine Learning 25 / 212


T OOLS : L OGISTIC R EGRESSION
S IGMOIDS
Sigmoid function
1.0

0.8

1 0.6
σ(x) =
1 + e−x 0.4

0.2

-10 -5 5 10

Note 1.0

0.8

1 + e−x − 1 1
1−σ(x) = = x = σ(−x) 0.6
1 + e−x e +1
0.4

Derivative 0.2

dσ e−x  -10 -5 5 10
(x) = = σ(x) 1 − σ(x)
dx (1 + e−x )2
Sigmoid (blue) and its derivative (red)

Advanced Machine Learning 27 / 212


A PPROXIMATING DECISION BOUNDARIES

1.0
• In linear classification: Decision
0.8
boundary is a discontinuity
• Boundary is represented either by 0.6

indicator function I{• > c} or sign


0.4
function sign(• − c)
• These representations are equivalent: 0.2

Note sign(• − c) = 2 · I{• > c} − 1


-5 0 5 10

The most important use of the sigmoid function in machine learning is as a smooth
approximation to the indicator function.

Given a sigmoid σ and a data point x, we decide which side of the approximated boundary we
are own by thresholding
1
σ(x) ≥
2

Advanced Machine Learning 28 / 212


S CALING
We can add a scale parameter by definining
1
σθ (x) := σ(θx) = for θ ∈ R
1 − e−θx
1.0

0.8

0.6

0.4

0.2

-5 0 5 10

Influence of θ
• As θ increases, σθ approximates I more closely.
• For θ → ∞, the sigmoid converges to I pointwise, that is: For every x 6= 0, we have
σθ (x) → I{x > 0} as θ → +∞ .
1
• Note σθ (0) = 2
always, regardless of θ.

Advanced Machine Learning 29 / 212


A PPROXIMATING A L INEAR C LASSIFIER

So far, we have considered R, but linear classifiers usually live in Rd .

The decision boundary of a linear classifier in We can “stretch” σ into a ridge function on R2 :
R2 is a discontinuous ridge:

• This is a linear classifier of the form • This is the function


x = (x1 , x2 ) 7→ σ(x1 ).
I{hv, xi − c}.
• The ridge runs parallel to the x2 -axes.
• Here: v = (1, 1) and c = 0.
• If we use σ(x2 ) instead, we rotate by 90
degrees (still axis-parallel).

Advanced Machine Learning 30 / 212


S TEERING A S IGMOID

Just as for a linear classifier, we use a normal vector v ∈ Rd .

• The function σ(hv, xi − c) is a sigmoid ridge, where the ridge is orthogonal to the normal
vector v, and c is an offset that shifts the ridge “out of the origin”.
• The plot on the right shows the normal vector (here: v = (1, 1)) in black.
• The parameters v and c have the same meaning for I and σ, that is, σ(hv, xi − c)
approximates I{hv, xi ≥ c}.

Advanced Machine Learning 31 / 212


L OGISTIC R EGRESSION

Logistic regression is a classification method that approximates decision boundaries by


sigmoids.

Setup
• Two-class classification problem
• Observations x1 , . . . , xn ∈ Rd , class labels yi ∈ {0, 1}.

The logistic regression model


We model the conditional distribution of the class label given the data as


P(y|x) := Bernoulli σ(hv, xi − c) .

1
• Recall σ(hv, xi − c) takes values in [0, 1] for all θ, and value 2
on the class boundary.
• The logistic regression model interprets this value as the probability of being in class y.

Advanced Machine Learning 32 / 212


L EARNING L OGISTIC R EGRESSION

Since the model is defined by a parametric distribution, we can apply maximum likelihood.

Notation
Recall from Statistical Machine Learning: We collect the parameters in a vector w by writing
   
v x
w := and x̃ := so that hw, x̃i = hv, xi − c .
−c 1

Likelihood function of the logistic regression model


n
Y 1−yi
σ(hw, x̃i i)yi 1 − σ(hw, x̃i i)
i=1

Negative log-likelihood
n 
X 
L(w) := − yi log σ(hw, x̃i i) + (1 − yi ) log 1 − σ(hw, x̃i i)
i=1

Advanced Machine Learning 33 / 212


M AXIMUM L IKELIHOOD

n
X
σ(wt x̃i ) − yi x̃i

∇L(w) =
i=1

Note
• Each training data point xi contributes to the sum proportionally to the approximation
error σ(wt x̃i ) − yi incurred at xi by approximating the linear classifier by a sigmoid.

Maximum likelihood
• The ML estimator ŵ for w is the solution of
∇L(w) = 0 .
• For logistic regression, this equation has no solution in closed form.
• To find ŵ, we use numerical optimization.
• The function L is convex (= ∪-shaped).

Advanced Machine Learning 34 / 212


R ECALL FROM S TATISTICAL M ACHINE L EARNING

• If f is differentiable, we can apply gradient descent:


x(k+1) := x(k) − ∇f (x(k) )
where x(k) is the candidate solution in step k of the algorithm.
• If the Hessian matrix Hf of partial second derivatives exists and is invertible, we can apply
Newton’s method, which converges faster:
x(k+1) := x(k) − Hf−1 (x(k) ) · ∇f (x(k) )

• Recall that the Hessian matrix of a (twice continuously differentiable) function


f : Rd → R is
 ∂2f 
Hf (x) :=
∂xi ∂xj i,j≤n
Since f is twice differentiable, each ∂ 2 f /∂xi ∂xj exists; since it is twice continuously
differentiable, ∂ 2 f /∂xi ∂xj = ∂ 2 f /∂xj ∂xi , so Hf is symmetric.
• The inverse of Hf (x) exists if and only if the matrix is positive definite (semidefinite does
not suffice), which in turn is true if and only if f is strictly convex.

Advanced Machine Learning 35 / 212


N EWTON ’ S M ETHOD FOR L OGISTIC R EGRESSION
Applying Newton
w(k+1) := w(k) − HL−1 (w(k) ) · ∇L(w(k) )

Matrix notation
 
 1 (x̃1 )1 . . . (x̃1 )j . . . (x̃1 )d 
 
 .. .. .. 
σ(wt x̃1 )(1 − σ(wt x̃1 )) ... 0
 
. . .
 
 
 
1 (x̃i )1 . . . (x̃i )j . . . (x̃i )d  Dσ =  .. .. ..
x̃i
 
X̃ := 
  . . . 
σ(wt x̃n )(1 − σ(wt x̃n ))
 
 .. .. ..  0 ...
 
 . . . 
 
1 (x̃n )1 . . . (x̃n )j . . . (x̃n )d

X̃ is the data matrix (or design matrix) you know from linear regression. X̃ has size n × (d + 1)
and Dσ is n × n.

Newton step
   
σ( w(k) , x̃1 )

y1
w(k+1) = X̃t Dσ X̃
−1 t  (k) 
X̃ Dσ X̃w − Dσ  ..   . 
− . 

. .
 
σ( w(k) , x̃n )

yn

Advanced Machine Learning 36 / 212


N EWTON ’ S M ETHOD FOR L OGISTIC R EGRESSION

  
σ( w(k) , x̃1 ) y1
..   . 

−1 t −1 t
w(k+1) = X̃t Dσ X̃ X̃ Dσ X̃w(k) − Dσ 

X̃ Dσ u(k)

 .
− . 
 . = X̃t Dσ X̃

(k)
σ( w , x̃n ) y n

=: u(k)

Compare this to the least squares solution of a linear regression problem:


−1 t
β = (X̃t X̃)−1 X̃t y
β̂ w(k+1) = X̃t Dσ X̃ X̃ Dσ u(k)

Differences:
• The vector y of regression responses is substituted by the vector u(k) above.

• The matrix X̃t X̃ is substituted by the matrix X̃t Dσ X̃.


• Note that matrices of product form X̃t X̃ are positive semidefinite; since Dσ is diagonal
with non-negative entries, so is X̃t Dσ X̃.

Iteratively Reweighted Least Squares


• At each step, the algorithm solves a least-squares problem “reweighted” by the matrix Dσ .
• Since this happens at each step of an iterative algorithm, Newton’s method applied to the
logistic regression log-likelihood is also known as Iteratively Reweighted Least Squares.

Advanced Machine Learning 37 / 212


OTHER O PTIMIZATION M ETHODS

Newton: Cost
• The size of the Hessian is (d + 1) × (d + 1).
• In high-dimensional problems, inverting HL can become problematic.

Other methods
Maximum likelihood only requires that we minimize the negative log-likelihood; we can choose
any numerical method, not just Newton. Alternatives include:
• Pseudo-Newton methods (only invert HL once, for w(1) , but do not guarantee quadratic
convergence).
• Gradient methods.
• Approximate gradient methods, like stochastic gradient.

Advanced Machine Learning 38 / 212


OVERFITTING
Recall from Statistical Machine Learning
x
H • If we increase the length of v without
v changing its direction, the sign of hv, xi
does not change, but the value changes.
• That means: If v is the normal vector of a
classifier, and we scale v by some θ > 0,
the decision boundary does not move, but
hθv, xi = θ hv, xi.

longer v

1.0

Effect inside a sigmoid


0.8

0.6

σ(hθv, xi) = σ(θ hv, xi) = σθ (hv, xi)


0.4

0.2

As the length of v increases, σ(hv, xi) becomes


more similar to I{hv, xi > 0}. -5 0 5 10

longer v

Advanced Machine Learning 39 / 212


E FFECT ON ML FOR LOGISTIC REGRESSION

1.0

σ(wt x̃i ) − yi
0.8

0.6

0.4

0.2

-5 0 5 10
xi

• Recall each training data point xi contributes an error term σ(wt x̃i ) − yi to the
log-likelihood.
• By increasing the lenghts of w, we can make σ(wt x̃i ) − yi arbitrarily small without
moving the decision boundary.

Advanced Machine Learning 40 / 212


OVERFITTING

Consequence for linearly separable data


• Once the decision boundary is correctly located between the two classes, the maximization
algorithm can increase the log-likelihood arbitrarily by increasing the length of w.
• That does not move the decision boundary, but he logistic function looks more and more
like the indicator I.
• That may fit the training data more tightly, but can lead to bad generalization (e.g. for
similar reasons as for the perceptron, where the decision boundary may end up very close
to a training data point).
That is a form of overfitting.

Data that is not linearly separable


• If the data is not separable, sufficiently many points on the “wrong” side of the decision
boundary prevent overfitting (since making w larger increases error contributions of these
points).
• For large data sets, overfitting can still occur if the fraction of such points is small.

Solutions
• Overfitting can be addressed by including an additive penalty of the form L(w) + λkwk.

Advanced Machine Learning 41 / 212


L OGISTIC R EGRESSION FOR M ULTIPLE C LASSES

Bernoulli and multinomial distributions


• The mulitnomial distribution
P of N draws from K categories with parameter vector
(θ1 , . . . , θK ) (where k≤K θk = 1) has probabililty mass function
K
N! Y m
P(m1 , . . . , mK |θ1 , . . . , θK ) = θ k where mk = # draws in category k
m1 ! · · · mK ! k=1 k

• Note that Bernoulli(p) = Multinomial(p, 1 − p; N = 1).

Logistic regression
• Recall two-class logistic regression is defined by P(Y|x) = Bernoulli(σ(wt x)).
• Idea: To generalize logistic regression to K classes, choose a separate weight vector wk
for each class k, and define P(Y|x) by
Multinomial σ̃(wt1 x), . . . , σ̃(wtK x)


σ(wt1 x)
where σ̃(wt1 x) = P t .
k σ(wk x)

Advanced Machine Learning 42 / 212


L OGISTIC R EGRESSION FOR M ULTIPLE C LASSES
Logistic regression for K classes
The label y now takes values in {1, . . . , K}.

K
Y
P(y|x) = σ̃(wtk x̃)I{y=k}
k=1

The negative log-likelihood becomes


X
L(w1 , . . . , wK ) = − I{y = k} log σ̃(wtk x̃i )
i≤n, k≤K

This can again be optimized numerically.

Comparison to two-class case


• Recall that 1 − σ(x) = σ(−x).
• That means
Bernoulli σ(hv, xi − c) ≡ Multinomial σ(wt x), σ((−1)wt x)
 

• That is: Two-class logistic regression as above is equivalent to multiclass logistic


regression with K = 2 provided we choose w2 = −w1 .

Advanced Machine Learning 43 / 212


G RAPHICAL M ODELS
G RAPHICAL M ODELS

A graphical model represents the dependence structure within a set of random variables
as a graph.

Overview
Roughly speaking:
• Each random variable is represented by vertex.

• If Y depends on X, we draw an edge X → Y.


• For example:

X Y

This says: “X depends on Z, and Y depends on Z”.


• We have to be careful: The above does not imply that X and Y are independent. We have
to make more precise what depends on means.

Advanced Machine Learning 45 / 212


We will use the notation:
L(X) = distribution of the random variable X
L(X|Y) = conditional distribution of X given Y
(L means “law”.)

Reason
• If X is discrete, L(X) is usually given by a mass function P(x).
• If it is continuous, L(X) is usually given by a density p(x).
• With the notation above, we do not have to distinguish between discrete and continuous
variables.

Advanced Machine Learning 46 / 212


D EPENDENCE AND I NDEPENDENCE
Dependence between random variables X1 , . . . , Xn is a property of their
joint distribution L(X1 , . . . , Xn ).

Recall
Two random variables are stochastically independent, or independent for short, if their joint
distribution factorizes:
L(X, Y) = L(X)L(Y)
For densities/mass functions:
P(x, y) = P(x)P(y) or p(x, y) = p(x)p(y)
Dependent means not independent.

Intuitively
X and Y are dependent if knowing the outcome of X provides any information about the
outcome of Y.

More precisely:
• If someone draws (X, Y) simultanuously, and only discloses X = x to you, does that

change your mind about the distribution of Y? (If so: Dependence.)


• Once X is given, the distribution of Y is the conditional L(Y|X = x).
• If that is still L(Y), as before X was drawn, the two are independent. If
L(Y|X = x) 6= L(Y), they are dependent.

Advanced Machine Learning 47 / 212


C ONDITIONAL INDEPENDENCE

Definition
Given random variables X, Y, Z, we say that X is conditionally independent of Y given Z if
L(X, Y|Z = z) = L(X|Z = z)L(Y|Z = z) .
That is equivalent to
L(X|Y = y, Z = z) = L(X|Z = z) .

Notation
X⊥
⊥Z Y

Intuitively
X and Y are dependent given Z = z if, although Z is known, knowing the outcome of X provides
additional information about the outcome of Y.

Advanced Machine Learning 48 / 212


G RAPHICAL M ODEL N OTATION

Factorizing a joint distribution


The joint probability of random variables X1 , . . . , Xn can always be factorized as
L(X1 , . . . , Xn ) = L(Xn |X1 , . . . , Xn−1 )L(Xn−1 |X1 , . . . , Xn−2 ) · · · L(X1 ) .
Note that we can re-arrange the variables in any order.
If there are conditional independencies, we can remove some variables from the conditionals:
L(X1 , . . . , Xn ) = L(Xn |Xn )L(Xn−1 |Xn−1 ) · · · L(X1 ) ,
where Xi is the subset of X1 , . . . , Xn on which Xi depends.

Definition
Let X1 , . . . , Xn be random variables. A (directed) graphical model represents a factorization
of joint distribution L(X1 , . . . , Xn ) as follows:
• Factorize L(X1 , . . . , Xn ).

• Add one vertex for each variable Xi .


• For each variable Xi , add and edge from each variable Xj ∈ Xi to Xi .
That is: An edge Xj → Xi is added if L(X1 , . . . , Xn ) contains the factor L(Xi |Xj ).

Advanced Machine Learning 49 / 212


G RAPHICAL M ODEL N OTATION

Lack of uniqueness
The factorization is usually not unique, since e.g.
L(X, Y) = L(X|Y)L(Y) = L(Y|X)L(X) .
That means the direction of edges is not generally determined.

Remark
• If we use a graphical model to define a model or visualize a model, we decide on the
direction of the edges.
• Estimating the direction of edges from data is a very difficult (and very important)
problem. This is one of the main subjects of a research field called causal inference or
causality.

Advanced Machine Learning 50 / 212


A simple example
X Y

X⊥
⊥Z Y

An example with layers

... Layer 1
All variables in the (k + 1)st layer are
... Layer 2 conditionally independent given the
variables in the kth layer.

Advanced Machine Learning 51 / 212


W ORDS OF C AUTION I

X Y

X⊥
⊥Z Y

Important
• X and Y are not independent, independence holds only conditionally on Z.
• In other words: If we do not observe Z, X and Y are dependent, and we have to change the
graph:

X Y or X Y

Advanced Machine Learning 52 / 212


W ORDS OF C AUTION II

X Y

Conditioning on Z makes X and Y dependent.

Example
• Suppose we start with two indepedent normal variables X and Y.
• Z = X + Y.
If we know Z, and someone reveals the value of Y to us, we know everything about X.

This effect is known as explaining away. We will revisit it later.

Advanced Machine Learning 53 / 212


M ACHINE L EARNING E XAMPLES I

Hidden Markov models

Z1 Z2 ··· Zn−1 Zn

X1 X2 ··· Xn−1 Xn

Advanced Machine Learning 54 / 212


M ACHINE L EARNING E XAMPLES II

Sigmoid belief networks


A graphical model in which each node is a binary random variable and each conditional
probability is a logicist regression model is called a sigmoid belief network.

Terminology: “Belief network” or “Bayes net” are alternative names for graphical models.

Deep belief networks Θ1 Θ2 Θd

A deep belief network is a layered directed ... Input layer


graphical model that looks like this. −→
...
Two tasks for deep belief nets are:
• Sampling: Draw X1:d from L(X1:d |Θ1:d ).

Note: Variables in each layer conditionally .. .. ..


. . .
independent given previous layer.
• Inference: Estimate L(Θ1:d |X1:d = x1:d )
when data x1:d is observed. ...

Problem: Conditioning a layer on the


...
following one makes variables dependent. Data
X1 X2 Xd
(More details later.)

Advanced Machine Learning 55 / 212


M ARKOV R ANDOM F IELDS
U NDIRECTED G RAPHICAL M ODEL

• A graphical model is undirected when its dependency graph is undirected; equivalently, if


each edge in the graph is either absent, or present in both directions.

X Y X Y X Y

Z Z Z

directed undirected

• An undirected graphical model is more commonly known as a Markov random field.


• Markov random fields are special cases of (directed) graphical models, but have distinct
properties. We treat them separately.
• We will consider the undirected case first.

Advanced Machine Learning 57 / 212


OVERVIEW

We start with an undirected graph:


.. .. ..
. . .

... Θj−1 Θj Θj+1 ...

wi+1,j+1

wi−1,i
... Θi−1 Θi Θi+1 ...

... Θk−1 Θk Θk+1 ...

.. .. ..
. . .

A random variable Θi is associated with each vertex. Two random variables interact if they are
neighbors in the graph.

Advanced Machine Learning 58 / 212


N EIGHBORHOOD G RAPH

• We define a neighborhood graph, which is a weighted, undirected graph:


vertex set
set of edge weights

N = (VN , WN )

The vertices vi ∈ VN are often referred to as sites.


• The edge weights are scalars wij ∈ R. Since the graph is undirected, the weights are
symmetric (wij = wji ).
• An edge weight wij = 0 means "no edge between vi and vj ".

Neighborhoods
The set of all neighbors of vj in the graph, vi
∂ (i) := { j | wij 6= 0}
is called the neighborhood of vj .
purple = ∂ (i)

Advanced Machine Learning 59 / 212


M ARKOV R ANDOM F IELDS

Given a neighborhood graph N , associate with each site vi ∈ VN a RV Θi .

The Markov property


We say that the joint distribution P of (Θ1 , . . . , Θn ) satisfies the Markov property with
respect to N if
L(Θi |Θj , j 6= i) = L(Θi |Θj , j ∈ ∂ (i)) .
The set {Θj , j ∈ ∂ (i)} of random variables indexed by neighbors of vi is called the Markov
blanket of Θi .

In words
The Markov property says that each Θi is conditionally independent of the remaining variables
given its Markov blanket.

Definition
A distribution L(Θ1 , . . . , Θn ) which satisfies the Θi
Markov property for a given graph N is called a Markov
random field.

Markov blanket of Θi

Advanced Machine Learning 60 / 212


E NERGY F UNCTIONS

Probabilities and energies


A (strictly positive) density p(x) can always be written in the form
1
p(x) = exp(−H(x)) where H : X → R+
Z
and Z is a normalization constant.
The function H is called an energy function, or cost function, or a potential.

MRF energy
In particular, we can write a MRF density for RVs Θ1:n as
1
p(θ1 , . . . , θn ) = exp(−H(θ1 , . . . , θn ))
Z

Advanced Machine Learning 61 / 212


C LIQUES

Graphical models factorize over the graph. How does that work for MRFs?

A clique in a graph is a fully connected subgraph. In undirected graphs:

Clique Not a clique

5 1 2 6

The cliques in this graph are: i) The triangles (1, 2, 3), (1, 3, 4).
ii) Each pair of vertices connected by an edge (e.g. (2, 6)).

Advanced Machine Learning 62 / 212


T HE H AMMERSLEY-C LIFFORD T HEOREM

Theorem
Let N be a neighborhood graph with vertex set VN . Suppose the random variables
{Θi , i ∈ VN } take values in T , and their joint distribution has probability mass function P, so
there is an energy function H such that
e−H(θ1 ,...,θn )
P(θ1 , . . . , θn ) = P P −H(θ1 ,...,θn )
.
i≤n θi ∈T e

The joint distribution has the Markov property if and only if


X
H(θ1 , θ2 , . . .) = HC (θi , i ∈ C) ,
C∈C

where C is the set of cliques in N , and each HC is a non-negative function with |C| arguments.
Hence,
Y e−Hc (θi ,i∈C)
P(θ1 , . . . , θn ) = P P −Hc (θi ,i∈C)
C∈C i∈C θi ∈T e

Markov random fields factorize over cliques.

Advanced Machine Learning 63 / 212


U SE OF MRF S
In general
• Modeling systems of dependent RVs is one of the hardest problems in probability.
• MRFs model dependence, but break it down to a limited number of interactions to make
the model tractable.

.. .. ..
. . .

MRFs on grids ... Θj−1 Θj Θj+1 ...

• By far the most widely used neighborhood graphs


are 2-dimensional grids. ... ...
Θi−1 Θi Θi+1

• MRFs on grids are used in spatial statistics to


model spatial interactions between RVs.
... Θk−1 Θk Θk+1 ...
• Hammersley-Clifford for grids: The only cliques
are the edges!
.. .. ..
. . .

2-dimensional grid graph with 4-neighborhoods

MRFs on grids factorize over edges.

Advanced Machine Learning 64 / 212


T HE P OTTS M ODEL

Definition
Suppose N = (VN , WN ) a neighborhood graph with n vertices and β > 0 a constant. Then
1  X 
p(θ1:n ) := exp β wij I{θi = θj }
Z(β, WN ) i,j

defines a joint distribution of n random variables Θ1 , . . . , Θn . This distribution is called the


Potts model.

Interpretation
• If wij > 0: The overall probability increases if Θi = Θj .
• If wij < 0: The overall probability decreases if Θi = Θj .
• If wij = 0: No interaction between Θi and Θj .
Positive weights encourage smoothness.

Advanced Machine Learning 65 / 212


E XAMPLE

.. .. ..
. . .

Ising model
The simplest choice is wij = 1 if (i, j) is an edge. ... Θj−1 Θj Θj+1 ...

1  X 
p(θ1:n ) = exp βI{θi = θj } ... Θi−1 Θi Θi+1 ...

Z(β)
(i,j) is an edge
... Θk−1 Θk Θk+1 ...

If N is a d-dim. grid, this model is called the Ising model.


.. .. ..
. . .

Example
Samples from an Ising model on a 56 × 56 grid graph.

Increasing β −→

Advanced Machine Learning 66 / 212


MRF S AS S MOOTHNESS P RIORS
We consider a spatial problem with observations Xi . Each i is a location on a grid.

Spatial model
Suppose we model each Xi by a distribution L(X|Θi ), i.e. each location i has its own parameter
variable Θi . This model is Bayesian (the parameter is a random variable). We use an MRF as
prior distribution.

Xj Xj+1

observed
Xi Xi+1

Θj Θj+1
p( . |θi )

unobserved
Θi Θi+1

We can think of L(X|Θi ) as an emission probability, similar to an HMM.

Spatial smoothing
• We can define the joint distribution (Θ1 , . . . , Θn ) as a MRF on the grid graph.
• For positive weights, the MRF will encourage the model to explain neighbors Xi and Xj by
the same parameter value. → Spatial smoothing.

Advanced Machine Learning 67 / 212


ent values of β = 50 and β = 200, each
rameter values, ranging over several the data
ordersin of
Fig. 4 does not provide
magnitude. the MRI sufficient
image evidence
(Fig. for smoothin
8).ofWith
smoothing. The number clusters is con
E XAMPLE : S EGMENTATION OF N OISY I MAGES
Averages are taken over ten randomly a particular number
initialized of classes,
experi-
is observed. We thus concludethe
and no
timated
by increasing
that, maybe
stabilization
number of clusters
β and
not
datastabilizing
effect
stabilizes
activating
surprisingly,
in Fig. 4 does
at N
the smoot
The effectnotof provide suffi
smoothing
the reliability of MDP and MDP/MRFanounced
particular model selection re-
for number
large valuesof classes, and no
of α, resulting
sults depends on how well theisparametric
observed. We clustering model that, may
thus conclude
used with the DP is able to separate the input
the reliability of features
MDP andinto MDP/MRF m
different classes. The effect ofsults
the base
dependsmeasure
on how scatter,
wellde-the parametr
Mixture model fied here by the parameter β, is demonstrated
used
number of clusters selected isdifferent
with the DP
plotted over
in isFig.
α atThe
9. to
able
twoeffect
The
differ-
separate the
classes. of the base m
• A BMM can be used for image segmentation. ent values of β = 50 and β = 200, each with and without
fied here by the parameter β, is demonst
smoothing. The number of clusters number is consistently decreased
of clusters selected is plotted ov
• The BMM prior on the component parameters is by a naturalβ and activatingent
increasing thevalues
smoothingof β constraint.
= 50 and β = 200, each
conjugate prior q(θ). Fig. 6 A SAR image with a high noise levelThe and stabilizing
ambiguous segments
(upper left). Solutions without (upper right) and with smoothing
effect of smoothing
smoothing.isThe particularly
number ofpro- clusters is con
nounced for large values of α, byresulting
increasing in a and
largeactivating
number the smoot
• In the spatial setting, we index the parameter of each Xi Input βimage.
The stabilizing effect of smoothing
separately as θi . For K mixture components, θ1:n contains nounced for large values of α, resulting
only K different values.
• The joint BMM prior on θ1:n is
n
Y
Fig. 6 A SAR image with a high
qBMM (θ1:nand
noise level
)= q(θi ) .
ambiguous segments Fig. 8 MR frontal view image of a monkey’s
Fig.right)
(upper left). Solutions without (upper 7 Segmentation
and withi=1 results for α = 10, at different levels of smooth-
smoothing (upper left), unsmoothed MDP segmentation (u
ing: Unconstrained (left), standard smoothing (λ = 1, middle) and MDP segmentation (lower left), original image
Segmentation w/o smoothing.
Smoothing term strong
Fig. 6 smoothing (λ =with
A SAR image 5, right)
a high noise level and ambiguous segments
(upper left). Solutions without (upper right) and with smoothing
boundaries (smoothed result, lower right)

We multiply the BMM priorTable


qBMM (θ) with an MRF prior
1 Average number of
α Image Fig. 4 Image Fig. 8
1clusters (withstandard
X 
MDP Smoothed MDP
qMRF (θ1:n ) = deviations),
expchosen β by the I{θi = θj }
algorithm on two images for
Z(β)
different values ofwthe6=0 1e-10 7.7 ± 1.1 4.8 ± 1.4 6.3 ± 0.2
ij
hyperparameter. When
1e-8
Fig. 8 MR frontal12.9 ± 0.8
view image of a monkey’s 6.2 ± 0.4 Original image 6.5 ± 0.3
head.
smoothing is activated (λ = 5,
This encourages spatial smoothnes of the segmentation.
Fig. 7 Segmentation results for α = 10,column),
right at different
thelevels
numberof of
smooth- 1e-6
(upper left), unsmoothed MDP segmentation8.0
14.8 ± 1.7 (upper
± 0.0right), smoothed 8.6 ± 0.9
ing: Unconstrained (left), standardclusters
smoothing
tends (λ = 1,
to be stable and 1e-4
middle)
more MDP segmentation20.6 (lower left), original image
± 1.2 overlaid
± 0.7MRF
9.6frontal with segment12.5 ± 0.3
strong smoothing (λ = 5, right) with respect to a changing α
Segmentation
boundaries (smoothed result, lowerFig. 8 MR
right)
with view smoothing.
image of a monkey’s
Fig. 7 Segmentation results for α = 10,1e-2at different levels33.2 ± 4.6
of smooth- ± 0.4
11.8unsmoothed
(upper left), 22.4 ± 1.8 (u
MDP segmentation
ing: Unconstrained (left), standard smoothing (λ = 1, middle) and MDP segmentation (lower left), original image
strong smoothing (λ = 5, right) boundaries (smoothed result, lower right)
Table 1 Average number of
Advanced Machine Learning α Image Fig. 4 Image Fig. 8 68 / 212
S AMPLING AND I NFERENCE

MRFs pose two main computational problems.

Problem 1: Sampling
Generate samples from the joint distribution of (Θ1 , . . . , Θn ).

Problem 2: Inference
If the MRF is used as a prior, we have to compute or approximate the posterior distribution.

Solution
• MRF distributions on grids are not analytically tractable. The only known exception is the
Ising model in 1 dimension.
• Both sampling and inference are based on Markov chain sampling algorithms.

Advanced Machine Learning 69 / 212


S AMPLING A LGORITHMS
S AMPLING A LGORITHMS

In general
• A sampling algorithm is an algorithm that outputs samples X1 , X2 , . . . from a given
distribution P or density p.
• Sampling algorithms can for example be used to approximate expectations:
n
1X
Ep [ f (X)] ≈ f (Xi )
n i=1

Inference in Bayesian models


Suppose we work with a Bayesian model whose posterior Q̂n := L(Θ|X1:n ) cannot be
computed analytically.
• We will see that it can still be possible to sample from Q̂n .
• Doing so, we obtain samples Θ1 , Θ2 , . . . distributed according to Q̂n .
• This reduces posterior estimation to a density estimation problem
(i.e. estimate Q̂n from Θ1 , Θ2 , . . .).

Advanced Machine Learning 71 / 212


P REDICTIVE D ISTRIBUTIONS

Posterior expectations
If we are only interested in some statistic of the posterior of the form EQ̂n [ f (Θ)] (e.g. the
posterior mean), we can again approximate by
m
1 X
EQ̂n [ f (Θ)] ≈ f (Θi ) .
m i=1
Example: Predictive distribution
The posterior predictive distribution is our best guess of what the next data point xn+1 looks
like, given the posterior under previous observations. In terms of densities:
Z
p(xn+1 |x1:n ) := p(xn+1 |θ)Q̂n (dθ|X1:n = x1:n ) .
T
This is one of the key quantities of interest in Bayesian statistics.

Computation from samples


The predictive is a posterior expectation, and can be approximated as a sample average:
m
1 X
p(xn+1 |x1:n ) = EQ̂n [ p(xn+1 |Θ)] ≈ p(xn+1 |Θi )
m i=1

Advanced Machine Learning 72 / 212


BASIC S AMPLING : A REA U NDER C URVE
Say we are interested in a probability density p on the interval [a, b].

p(y)

Yi

x
a Xi b

Key observation
Suppose we can define a uniform distribution UA on the blue area A under the curve. If we
sample
(X1 , Y1 ), (X2 , Y2 ), . . . ∼iid UA
and discard the vertical coordinates Yi , the Xi are distributed according to p,
X1 , X2 , . . . ∼iid p .

Problem: Defining a uniform distribution is easy on a rectangular area, but difficult on an


arbritrarily shaped one.

Advanced Machine Learning 73 / 212


R EJECTION S AMPLING ON THE I NTERVAL

Solution: Rejection sampling


We can enclose p in box, and sample uniformly from the box B.

p(x)
c

x
a b

• We can sample (Xi , Yi ) uniformly on B by sampling


Xi ∼ Uniform[a, b] and Yi ∼ Uniform[0, c] .
• If (Xi , Yi ) ∈ A, keep the sample.
That is: If Yi ≤ p(Xi ).
• Otherwise: Discard it ("reject" it).
Result: The remaining (non-rejected) samples are uniformly distributed on A.

Advanced Machine Learning 74 / 212


S CALING
This strategy still works if we scale the vertically by some constant k > 0.

k·c

B B

x x
a b a b
We simply draw Yi ∼ Uniform[0, kc] instead of Yi ∼ Uniform[0, c].

Consequence
For sampling, it is sufficient if p is known only up to normalization
(only the shape of p is known).

Advanced Machine Learning 75 / 212


D ISTRIBUTIONS K NOWN UP TO S CALING

Sampling methods usually assume that we can evaluate the target distribution p up to a constant.
That is:
1
p(x) = p̃(x) ,

and we can compute p̃(x) for any given x, but we do not know Z̃.

We have to pause for a moment and convince ourselves that there are useful examples where
this assumption holds.

Example 1: Simple posterior


For an arbitrary posterior computed with Bayes’ theorem, we could write
Qn n
p(xi |θ)q(θ)
Z Y
Π(θ|x1:n ) = i=1 with Z̃ = p(xi |θ)q(θ)dθ .
Z̃ T i=1

Provided that we can compute the numerator, we can sample without computing the
normalization integral Z̃.

Advanced Machine Learning 76 / 212


D ISTRIBUTIONS K NOWN UP TO S CALING

Example 2: Bayesian Mixture Model


Recall that the posterior of the BMM is (up to normalization):
n X
Y K K
 Y 
q̂n (c1:K , θ1:K |x1:n ) ∝ ck p(xi |θk ) q(θk |λ, y) qDirichlet (c1:K )
i=1 k=1 k=1

We already know that we can discard the normalization constant, but can we evaluate the
non-normalized posterior q̃n ?
• The problem with computing q̃n (as a function of unknowns) is that the term
Qn PK 
n
i=1 k=1 . . . blows up into K individual terms.
PK
k=1 ck p(xi |θk ) collapses to a single
• If we evaluate q̃n for specific values of c, x and θ,

number for each xi , and we just have to multiply those n numbers.


So: Computing q̃n as a formula in terms of unknowns is difficult; evaluating it for specific
values of the arguments is easy.

Advanced Machine Learning 77 / 212


D ISTRIBUTIONS K NOWN UP TO S CALING

Example 3: Markov random field


In a MRF, the normalization function is the real problem.

For example, recall the Ising model:


1  X 
p(θ1:n ) = exp βI{θi = θj }
Z(β)
(i,j) is an edge

The normalization function is


X  X 
Z(β) = exp βI{θi = θj }
θ1:n ∈{0,1}n (i,j) is an edge

and hence a sum over 2n terms. The general Potts model is even more difficult.

On the other hand, evaluating


 X 
p̃(θ1:n ) = exp βI{θi = θj }
(i,j) is an edge

for a given configuration θ1:n is straightforward.

Advanced Machine Learning 78 / 212


R EJECTION S AMPLING ON Rd

If we are not on the interval, sampling uniformly from an enclosing box is not possible (since
there is no uniform distribution on all of R or Rd ).

Solution: Proposal density


Instead of a box, we use another distribution r to enclose p:

p(x)

To generate B under r, we apply similar logic as before backwards:


• Sample Xi ∼ r.

• Sample Yi |Xi ∼ Uniform[0, r(Xi )].

r is always a simple distribution which we can sample and evaluate.

Advanced Machine Learning 79 / 212


R EJECTION S AMPLING ON Rd

p(x)

• Choose a simple distribution r from which we know how to sample.


• Scale p̃ such that p̃(x) < r(x) everywhere.
• Sampling: For i = 1, 2, . . . ,:
1. Sample Xi ∼ r.
2. Sample Yi |Xi ∼ Uniform[0, r(Xi )].
3. If Yi < p̃(Xi ), keep Xi .
4. Else, discard Xi and start again at (1).
• The surviving samples X1 , X2 , . . . are distributed according to p.

Advanced Machine Learning 80 / 212


FACTORIZATION P ERSPECTIVE

The rejection step can be interpreted in terms of probabilities and densities.

Factorization
We factorize the target distribution or density p as
distribution from which we
know how to sample
p(x) = r(x) · A(x)
probability function we can evaluate
once a specific value of x is given

Sampling from the factorization


X = X0 · Z
where X0 ∼r and Z|X 0 0
∼ Bernoulli(A(X ))

Sampling Bernoulli variables with uniform variables

Z|X 0 ∼ Bernoulli(A(X 0 )) ⇔ Z = I{U < A(X 0 )} where U ∼ Uniform[0, 1] .

Advanced Machine Learning 81 / 212


I NDEPENDENCE

If we draw proposal samples Xi i.i.d. from r, the resulting sequence of accepted samples
produced by rejection sampling is again i.i.d. with distribution p. Hence:
Rejection samplers produce i.i.d. sequences of samples.

Important consequence
If samples X1 , X2 , . . . are drawn by a rejection sampler, the sample average
m
1 X
f (Xi )
m i=1

(for some function f ) is an unbiased estimate of the expectation Ep [ f (X)].

Advanced Machine Learning 82 / 212


E FFICIENCY

|A|
The fraction of accepted samples is the ratio |B|
of the areas under the curves p̃ and r.

p(x)

If r is not a reasonably close approximation of p, we will end up rejecting a lot of proposal


samples.

Advanced Machine Learning 83 / 212


A N IMPORTANT BIT OF IMPRECISE INTUITION

Example figures for sampling methods tend to look like this. A high-dimensional distribution of correlated RVs will look
rather more like this.

Sampling is usually used in multiple dimensions. Reason, roughly speaking:


• Intractable posterior distributions arise when there are several interacting random

variables. The interactions make the joint distribution complicated.


• In one-dimensional problems (1 RV), we can usually compute the posterior analytically.
• Independent multi-dimensional distributions factorize and reduce to one-dimensional case.

Warning: Avoid sampling if you can solve analytically.

Advanced Machine Learning 84 / 212


W HY IS NOT EVERY SAMPLER A REJECTION SAMPLER ?

We can easily end up in situations where we accept only one in 106 (or 1010 , or 1020 ,. . . )
proposal samples. Especially in higher dimensions, we have to expect this to be not the
exception but the rule.

Advanced Machine Learning 85 / 212


I MPORTANCE S AMPLING

The rejection problem can be fixed easily if we are only interested in approximating an
expectation Ep [ f (X)].

Simple case: We can evaluate p


Suppose p is the target density and q a proposal density. An expectation under p can be
rewritten as
Z Z  
p(x) f (X)p(X)
Ep [ f (X)] = f (x)p(x)dx = f (x) q(x)dx = Eq
q(x) q(X)

Importance sampling
We can sample X1 , X2 , . . . from q and approximate Ep [ f (X)] as
m
1 X p(Xi )
Ep [ f (X)] ≈ f (Xi )
m i=1 q(Xi )

There is no rejection step; all samples are used.


p(Xi )
This method is called importance sampling. The coefficients q(Xi )
are called importance
weights.

Advanced Machine Learning 86 / 212


I MPORTANCE S AMPLING
General case: We can only evaluate p̃
In the general case,
1 1
p= p̃ and q= q̃ ,
Zp Zq
Zp
and Zp (and possibly Zq ) are unknown. We can write Zq
as
R q(x)
p̃(x) q(x) dx
R Z  
Zp p̃(x)dx q(x) p̃(X)
= = = p̃(x) dx = Eq
Zq Zq Zq Zq · q(x) q̃(X)

Approximating the constants


Zp
The fraction Zq
can be approximated using samples x1:m from q:
  m
Zp p̃(X) 1 X p̃(Xi )
= Eq ≈
Zq q̃(X) m i=1 q̃(Xi )

Approximating Ep [ f (X)]

m m m f (X ) i p̃(X )
1 X p(Xi ) 1 X Zq p̃(Xi ) X i q̃(X )
i
Ep [ f (X)] ≈ f (Xi ) = f (Xi ) = Pm p̃(Xj )
m i=1 q(Xi ) m i=1 Zp q̃(Xi ) i=1 j=1 q̃(Xj )

Advanced Machine Learning 87 / 212


I MPORTANCE S AMPLING IN G ENERAL

Conditions
• Given are a target distribution p and a proposal distribution q.
1 1
• p= Zp
p̃ and q = Zq
q̃.
• We can evaluate p̃ and q̃, and we can sample q.
• The objective is to compute Ep [ f (X)] for a given function f .

Algorithm
1. Sample X1 , . . . , Xm from q.
2. Approximate Ep [ f (X)] as
Pm p̃(X )
i
i=1 f (Xi ) q̃(Xi )
Ep [ f (X)] ≈ Pm p̃(Xj )
j=1 q̃(Xj )

Advanced Machine Learning 88 / 212


M ARKOV C HAIN M ONTE C ARLO
M OTIVATION

Suppose we rejection-sample a distribution like this:

region of interest

Once we have drawn a sample in the narrow region of interest, we would like to continue
drawing samples within the same region. That is only possible if each sample depends on the
location of the previous sample.

Proposals in rejection sampling are i.i.d. Hence, once we have found the region where p
concentrates, we forget about it for the next sample.

Advanced Machine Learning 90 / 212


MCMC: I DEA

Recall: Markov chain


• A sufficiently nice Markov chain (MC) has an invariant distribution Pinv .
• Once the MC has converged to Pinv , each sample Xi from the chain has marginal
distribution Pinv .

Markov chain Monte Carlo


We want to sample from a distribution with density p. Suppose we can define a MC with
invariant distribution Pinv ≡ p. If we sample X1 , X2 , . . . from the chain, then once it has
converged, we obtain samples
Xi ∼ p .
This sampling technique is called Markov chain Monte Carlo (MCMC).

Note: For a Markov chain, Xi+1 can depend on Xi , so at least in principle, it is possible for an
MCMC sampler to "remember" the previous step and remain in a high-probability location.

Advanced Machine Learning 91 / 212


C ONTINUOUS M ARKOV C HAIN

The Markov chains we discussed so far had a finite state space X. For MCMC, state space now
has to be the domain of p, so we often need to work with continuous state spaces.

Continuous Markov chain


A continuous Markov chain is defined by an initial distribution Pinit and conditional probability
t(y|x), the transition probability or transition kernel.

In the discrete case, t(y = i|x = j) is the entry pij of the transition matrix p.

Example: A Markov chain on R2


xi
We can define a very simple Markov chain by sampling
Xi+1 |Xi = xi ∼ g( . |xi , σ 2 )
where g(x|µ, σ 2 ) is a spherical Gaussian with fixed variance. In
other words, the transition distribution is
A Gaussian (gray contours) is placed
t(xi+1 |xi ) := g(xi+1 |xi , σ 2 ) . around the current point xi to sample
Xi+1 .

Advanced Machine Learning 92 / 212


I NVARIANT D ISTRIBUTION

Recall: Finite case


• The invariant distribution Pinv is a distribution on the finite state space X of the MC
(i.e. a vector of length |X|).
• "Invariant" means that, if Xi is distributed according to Pinv , and we execute a step
Xi+1 ∼ t( . |xi ) of the chain, then Xi+1 again has distribution Pinv .
• In terms of the transition matrix p:
p · Pinv = Pinv

Continuous case
• X is now uncountable (e.g. X = Rd ).
• The transition matrix p is substituted by the conditional probability t.
• A distribution Pinv with density pinv is invariant if
Z
t(y|x)pinv (x)dx = pinv (y)
X
P
This is simply the continuous analogue of the equation i pij (Pinv )i = (Pinv )j .

Advanced Machine Learning 93 / 212


M ARKOV C HAIN S AMPLING

We run the Markov chain n for steps. We "forget" the order and regard the If p (red contours) is both the
Each step moves from the current locations x1:n as a random set of invariant and initial distribution, each
location xi to a new xi+1 . points. Xi is distributed as Xi ∼ p.

Problems we need to solve


1. We have to construct a MC with invariant distribution p.
2. We cannot actually start sampling with X1 ∼ p; if we knew how to sample from p, all of
this would be pointless.
3. Each point Xi is marginally distributed as Xi ∼ p, but the points are not i.i.d.

Advanced Machine Learning 94 / 212


C ONSTRUCTING THE M ARKOV C HAIN
Given is a continuous target distribution with density p.

Metropolis-Hastings (MH) kernel


1. We start by defining a conditional probability q(y|x) on X.
q has nothing to do with p. We could e.g. choose q(y|x) = g(y|x, σ 2 ), as in the previous example.

2. We define a rejection kernel A as


n q(x |x )p(x ) o
i i+1 i+1
A(xn+1 |xn ) := min 1,
q(xi+1 |xi )p(xi ) total probability that
a proposal is sampled
The normalization of p cancels in the quotient, so knowing p̃ is again enough.
and then rejected
3. We define the transition probability of the chain as Z
t(xi+1 |xi ) := q(xi+1 |xi )A(xi+1 |xi )+δxi (xi+1 )c(xi ) where c(xi ) := q(y|xi )(1−A(y|xi ))dy

Sampling from the MH chain


At each step i + 1, generate a proposal X ∗ ∼ q( . |xi ) and Ui ∼ Uniform[0, 1].
• If Ui ≤ A(x∗ |xi ), accept proposal: Set xi+1 := x∗ .

• If Ui > A(x∗ |xi ), reject proposal: Set xi+1 := xi .

Advanced Machine Learning 95 / 212


P ROBLEM 1: I NITIAL DISTRIBUTION

Recall: Fundamental theorem on Markov chains


Suppose we sample X1 ∼ Pinit and Xi+1 ∼ t( . |xi ). This defines a distribution Pi of Xi , which
can change from step to step. If the MC is nice (recall: recurrent and aperiodic), then
Pi → Pinv for i→∞.

Note: Making precise what aperiodic means in a continuous state space is a bit more technical than in the finite case, but the
theorem still holds. We will not worry about the details here.

Implication
• If we can show that Pinv ≡ p, we do not have to know how to sample from p.
• Instead, we can start with any Pinit , and will get arbitrarily close to p for sufficiently large i.

Advanced Machine Learning 96 / 212


B URN -I N AND M IXING T IME

The number m of steps required until Pm ≈ Pinv ≡ p is called the mixing time of the Markov
chain. (In probability theory, there is a range of definitions for what exactly Pm ≈ Pinv means.)

In MC samplers, the first m samples are also called the burn-in phase. The first m samples of
each run of the sampler are discarded:

X1 , . . . , Xm−1 , Xm , Xm+1 , . . .
Burn-in; Samples from
discard. (approximately) p;
keep.

Convergence diagnostics
In practice, we do not know how large j is. There are a number of methods for assessing whether
the sampler has mixed. Such heuristics are often referred to as convergence diagnostics.

Advanced Machine Learning 97 / 212


P ROBLEM 2: S EQUENTIAL D EPENDENCE
Autocorrelation Plots
Even after burn-in, the samples from a MC are not i.i.d.

Strategy We can get autocorrelation plots using the autocorr


function.
• Estimate empirically how many steps L are needed for xi and xi+L to be approximately
> autocorr.plot(mh.draws)
independent. The number L is called the lag.
• After burn-in, keep only every Lth sample; discard samples in between.

1.0

1.0
Estimating the lag

0.5

0.5
The most commen method uses the autocorrelation function:

i , xi+L )
Autocorrelation

Autocorrelation
E[xi − µi ] · E[xj − µj ]

0.0

0.0
Auto(xi , xj ) :=

Auto(x
σi σj

−0.5

−0.5
We compute Auto(xi , xi+L ) empirically from the sample for
different values of L, and find the smallest L for which the
autocorrelation is close to zero.
−1.0

−1.0
0 5 15 25 0 5 15 25

Lag L
Lag

Advanced Machine Learning 98 / 212


C ONVERGENCE D IAGNOSTICS

There are about half a dozen popular convergence crieteria; the one below is an example.

Gelman-Rubin criterion
• Start several chains at random. For each chain k, sample Xik
has a marginal distribution Pki .
• The distributions of Pki will differ between chains in early
stages.
• Once the chains have converged, all Pi = Pinv are identical.
• Criterion: Use a hypothesis test to compare Pki for different k
(e.g. compare P2i against null hypothesis P1i ). Once the test
does not reject anymore, assume that the chains are past
burn-in.

Reference: A. Gelman and D. B. Rubin: "Inference from Iterative Simulation Using Multiple Sequences", Statistical Science, Vol. 7 (1992) 457-511.

Advanced Machine Learning 99 / 212


S TOCHASTIC H ILL -C LIMBING

The Metropolis-Hastings rejection kernel was defined as:


n q(x |x )p(x ) o
i i+1 i+1
A(xn+1 |xn ) = min 1, .
q(xi+1 |xi )p(xi )
Hence, we certainly accept if the second term is larger than 1, i.e. if
q(xi |xi+1 )p(xi+1 ) > q(xi+1 |xi )p(xi ) .
That means:
• We always accept the proposal value xi+1 if it increases the probability under p.

• If it decreases the probability, we still accept with a probability which depends on the
difference to the current probability.

Hill-climbing interpretation
• The MH sampler somewhat resembles a gradient ascent algorithm on p, which tends to
move in the direction of increasing probability p.
• However:
• The actual steps are chosen at random.
• The sampler can move "downhill" with a certain probability.
• When it reaches a local maximum, it does not get stuck there.

Advanced Machine Learning 100 / 212


S ELECTING A P ROPOSAL D ISTRIBUTION
Everyone’s favorite example: Two Gaussians

• Var[q] too large:


Will overstep p; many rejections.
• Var[q] too small:
Many steps needed to achieve good
coverage of domain.
If p is unimodal and can be roughly
approximated by a Gaussian, Var[q] should be
red = target distribution p chosen as smallest covariance component of p.
gray = proposal distribution q

More generally
For complicated posteriors (recall: small regions of concentration, large low-probability regions
in between) choosing q is much more difficult. To choose q with good performance, we already
need to know something about the posterior.

There are many strategies, e.g. mixture proposals (with one component for large steps and one
for small steps).

Advanced Machine Learning 101 / 212


S UMMARY: MH S AMPLER

• MCMC samplers construct a MC with invariant distribution p.


• The MH kernel is one generic way to construct such a chain from p and a proposal
distribution q.
• Formally, q does not depend on p (but arbitrary choice of q usually means bad
performance).
• We have to discard an initial number m of samples as burn-in to obtain samples
(approximately) distributed according to p.
• After burn-in, we keep only every Lth sample (where L = lag) to make sure the xi are
(approximately) independent.

Keep. Keep. Keep.

X1 , . . . , Xm−1 , Xm , Xm+1 , . . . , Xm+L−1 , Xm+L , Xm+L+1 , . . . Xm+2L−1 , Xm+2L , . . .


Burn-in; Samples correlated Samples correlated
discard. with Xj ; discard. with Xj+L ; discard.

Advanced Machine Learning 102 / 212


T HE G IBBS SAMPLER
G IBBS S AMPLING

By far the most widely used MCMC algorithm is the Gibbs sampler.

Full conditionals
Suppose L(X) is a distribution on RD , so X = (X1 , . . . , XD ). The conditional probability of the
entry Xd given all other entries,
L(Xd |X1 , . . . , Xd−1 , Xd+1 , . . . , XD )
is called the full conditional distribution of Xd .
On RD , that means we are interested in a density
p(xd |x1 , . . . , xd−1 , xd+1 , . . . , xD )

Gibbs sampling
The Gibbs sampler is the special case of the Metropolis-Hastings algorithm defined by
propsoal distribution for Xd = full conditional of Xd .

• Gibbs sampling is only applicable if we can compute the full conditionals for each
dimension d.
• If so, it provides us with a generic way to derive a proposal distribution.

Advanced Machine Learning 104 / 212


T HE G IBBS S AMPLER

Proposal distribution
Suppose p is a distribution on RD , so each sample is of the form Xi = (Xi,1 , . . . , Xi,D ). We
generate a proposal Xi+1 coordinate-by-coordinate as follows:
Xi+1,1 ∼ p( . |xi,2 , . . . , xi,D )
..
.
Xi+1,d ∼ p( . |xi+1,1 , . . . , xi+1,d−1 , xi,d+1 , . . . , xi,D )
.
..
Xi+1,D ∼ p( . |xi+1,1 , . . . , xi+1,D−1 )
Note: Each new Xi+1,d is immediately used in the update of the next dimension d + 1.

A Metropolis-Hastings algorithms with proposals generated as above is called a Gibbs sampler.

No rejections
It is straightforward to show that the Metropolis-Hastings acceptance probability for each
xi+1,d+1 is 1, so proposals in Gibbs sampling are always accepted.

Advanced Machine Learning 105 / 212


E XAMPLE : MRF

In a MRF with D nodes, each dimension d corresponds to one vertex.

Θup

Full conditionals
In a grid with 4-neighborhoods for instance, the Markov
property implies that Θleft Θd Θright
p(θd |θ1 , . . . , θd−1 , θd+1 , . . . , θD ) = p(θd |θleft , θright , θup , θdown )

Θdown

Specifically: Potts model with binary weights


Recall that, for sampling, knowing only p̃ (unnormalized) is sufficient:
p̃(θd |θ1 , . . . , θd−1 , θd+1 , . . . , θD ) =
 
exp β(I{θd = θleft } + I{θd = θright } + I{θd = θup } + I{θd = θdown }

This is clearly very efficiently computable.

Advanced Machine Learning 106 / 212


E XAMPLE : MRF

Sampling the Potts model


Each step of the sampler generates a sample
θi = (θi,1 , . . . , θi,D ) ,
where D is the number of vertices in the grid.

Gibbs sampler
Each step of the Gibbs sampler generates n updates according to
θi+1,d ∼ p( . |θi+1,1 , . . . , θi+1,d−1 , θi,d+1 , . . . , θi,D )
 
∝ exp β(I{θi+1,d = θleft } + I{θi+1,d = θright } + I{θi+1,d = θup } + I{θi+1,d = θdown })

Advanced Machine Learning 107 / 212


B URN -I N M ATTERS
!"#$%&'&!"#$%(&)*
)-+&5)-+&
+&%"$1"40
This example is due to Erik Sudderth (UC Irvine).
:&;&<&40$0
MRFs as "segmentation" priors =3004&>30
-..&/0"1$023%4&

).(...&/0"1$023%4&
• MRFs where introduced as tools for image smoothing and segmentation by D. and S.
Geman in 1984.
• They sampled from a Potts model with a Gibbs sampler, discarding 200 iterations as
burn-in.
• Such a sample (after 200 steps) is shown above, for a Potts model in which each variable
can take one out of 5 possible values.
• These patterns led computer vision researchers to conclude that MRFs are "natural" priors
for image segmentation, since samples from the MRF resemble a segmented image.

Advanced Machine Learning 108 / 212


E XAMPLE : B URN -I N M ATTERS )-+&5)-+&6127&
)-+&5)-+&6127&
+&%"$1"40&%"268931&"76"4&
+&%"$1"40&%"268931&"76"4&
:&;&<&40$0"4&
E. Sudderth ran a Gibbs sampler on the same model and sampled after 200 iterations:&;&<&40$0"4&
(as the Geman brothers did),
=3004&>30"%02$?4@&
and again after 10000 iterations: =3004&>30"%02$?4@&
-..&/0"1$023%4&
-..&/0"1$023%4&

200 iterations

).(...&/0"1$023%4&
).(...&/0"1$023%4&

10000 iterations

Chain 1 Chain 5

• The "segmentation" patterns are not sampled from the MRF distribution p ≡ Pinv , but
rather from P200 6= Pinv .
• The patterns occur not because MRFs are "natural" priors for segmentations, but because
the sampler’s Markov chain has not mixed.
• MRFs are smoothness priors, not segmentation priors.

Advanced Machine Learning 109 / 212


VARIATIONAL I NFERENCE
VARIATIONAL I NFERENCE : I DEA

Problem
We have to solve an inference problem where the correct solution is an “intractable” distribution
with density p∗ (e.g. a complicated posterior in a Bayesian inference problem).

Variational approach
Approximate p∗ as
q∗ := arg min φ(q, p∗ )
q∈Q
where Q is a class of simple distributions and φ is a cost function (small φ means good fit).
That turns the inference problem into a constrained optimization problem
min φ(q, p∗ )
s.t. q ∈ Q

Variational inference approximates a complicated distribution by minimizing the distance


(or discrepancy) to a class of tractable distributions.

Advanced Machine Learning 111 / 212


BACKGROUND : VARIATIONAL M ETHODS

Recall: Optimization approach to problems


Formulate your problem such that the solution x∗ ∈ Rd is the minimum of some function f , and
solve
x∗ := arg min f (x)
x∈Rd
possibly under constraints.
Examples: Support vector machines, linear regression, logistic regression, . . .

Inference problem as above


q∗ := arg min φ(q, p∗ )
q∈Q

• q is now a function (a density), not a point in Rd .


• We have to optimize over a space of functions. Such spaces are in general
infinite-dimensional.
• Often: Q is a parametric model, with parameter space T ⊂ Rd
→ reduces to optimization over Rd .
• However: Optimization over infinite-dimensional spaces is in principle possible.

Advanced Machine Learning 112 / 212


O PTIMIZATION OF FUNCTIONALS

• Let F be a space of functions (e.g. all continuous functions on R).


• A function φ : F → R (a function whose arguments are functions) is called a functional.
• Examples: (1) The integral of a function. (2) The differential entropy of a density.

Recall: Derivatives (on R)


The differential of f : Rd → R at point x is
f (x + ε) − f (x) f (x + x̃) − f (x)
δf (x) = lim if d = 1 or δf (x) = lim in general.
ε&0 ε kx̃k&0 kx̃k
The d-dimensional case works by reducing to the 1-dimensional case using a norm.

Derivatives of functionals
If F is a function space and k • k a norm on F , we can apply the same idea to φ : F → R:
φ( f + f̃ ) − φ( f )
δφ( f ) := lim
k f̃ k&0 k f̃ k
δφ(f ) is called the Fréchet derivative of φ at f .
f is a minimum of a Fréchet-differentiable functional φ only if δφ(f ) = 0.

Advanced Machine Learning Not examinable. 113 / 212


O PTIMIZATION OF F UNCTIONALS
Optimization
We can in principle find a minimum of φ by gradient descent: Add increment functions ∆fk “in
the direction of” δφ(fk ) to the current solution candidate fk .
The maximum entropy problem is often cited as an example.

Horseshoes
• We have to represent the infinite-dimensional quantities fk and ∆fk in some way.
• Many interesting functionals φ are not Fréchet-differentiable as functionals on F . They
only become differentiable when constrained to a much smaller subspace.
One solution is “variational calculus”, an analytic technique that addresses both problems. (We
will not need the details.)

Recall: Maximum entropy principle


• The maximum entropy principle chooses a distribution within some set P of candidates
by selecting the one with the largest entropy.
• That is: It solves the optimization problem
max H(p)
s.t. p ∈ P
• For example, if P are all those distributions under which some given statistic S takes a
given expected value, we obtain exponential family distributions with sufficient statistic S.

Advanced Machine Learning Not examinable. 114 / 212


O PTIMIZATION OF F UNCTIONALS
Maximum entropy as functional optimization
• The entropy H assigns a scalar to a distribution → functional!
• Problem: The entropy as a functional e.g. on all distributions on R is concave, but it is not
differentiable; it is not even continuous.
• The solution for exponential families can be determined using variational calculus.

Functional optimization in machine learning


• We will be interested in problems of the form
min φ(q)
q

s.t. q ∈ Q
where Q is a parametric family.
• That means each element of Q is of the form q( • |θ), for θ ∈ T ⊂ Rd .
• The problem then reduces back to optimization in Rd :
min φ(q( • |θ))
θ
s.t. θ ∈ T
• We can apply gradient descent, Newton, etc.

Advanced Machine Learning Not examinable. 115 / 212


K ULLBACK -L EIBLER D IVERGENCE
Recall
The information in observing X = x under a probability mass function P is
1
JP (x) := log = − log P(x) .
P(x)
Its expectation H(P) := EP [JP (X)] is the entropy of P.
The Kullback-Leibler divergence of P and Q is
X P(x)
DKL (PkQ) := EP [JQ (X)] − H(P) = P(x) log
x
Q(x)

Entropy and KL divergence for densities


If p and q are probability densities, then
Z Z
p(x)
H(p) := − p(x) log p(x)dx and DKL (pkq) := p(x) log dx
q(x)
are the differential entropy of p and the Kullback-Leibler divergence of p and q.

Be careful
• The differential entropy does not behave like the entropy (e.g. it can be negative).
• The KL divergence for densities has properties analogous to the mass function case.

Advanced Machine Learning 116 / 212


VARIATIONAL I NFERENCE WITH KL
Recall VI optimization problem
q∗ := arg min φ(q, p∗ )
q∈Q
We have to choose a cost function.
The term “variational inference” in machine learning typically implies φ is a KL divergence,
q∗ := arg min DKL (q, p∗ )
q∈Q

Order of the arguments


Recall that DKL is not symmetric, so
DKL (q, p∗ ) 6= DKL (p∗ , q)
Which order should we use?
• Recall DKL (pkq) is an expectation with respect to p.

• DKL (p∗ kq) emphasizes regions where the “true” model p∗ has high probability. That is
what we should use if possible.
• We use VI because p∗ is intractable, so we can usually not compute expectations under it.
• We use the expectation DKL (q, p∗ ) under the approximating simpler model instead.
We have to understand the implications of this choice.

Advanced Machine Learning 117 / 212


E XAMPLE

Approximating a Gaussian by a spherical Gaussian


What VI would do if possible What VI does
1 1

z2 z2

0.5 0.5

0 0
0 0.5 z1 1 0 0.5 z1 1
DKL (p∗ kq) =(b)
DKL (greenkred) DKL (qkp∗ ) =(a)
DKL (redkgreen)

Advanced Machine Learning Illustration: Bishop (2006) 118 / 212


E XAMPLE

Approximating a Gaussian mixture by a single Gaussian

What VI would do if possible What VI does

DKL (p∗ kq) = DKL (bluekred) DKL (qkp∗ ) = DKL (redkblue)

Advanced Machine Learning Illustration: Bishop (2006) 119 / 212


VI FOR P OSTERIOR D ISTRIBUTIONS

Often: Target distribution is a posterior of a parameter or latent variable Z, given data x.

Basic approximation problem


p(x|z)p(z)
If the posterior density is p∗ (z) = p(z|x) = p(x)
, then

q∗ ( • ) = arg min DKL (q( • )|p( • |x)) .


q∈Q

Transforming the objective function


h q(Z) i
DKL (q( • )|p( • |x)) = E log
p(Z|x)
= E[log q(Z)] − E[log p(Z|x)]
= E[log q(Z)] − E[log p(Z, x)] + log p(x)

• The evidence p(x) is hard to compute (it is an integral over p∗ (x|z)).


• It depends only on x, so it is an additive constant w.r.t. the optimization problem.
• Dropping it from the objective function does not change the location of the minimum.
F(q) := E[log q(Z)] − E[log p(Z, x)]

Advanced Machine Learning 120 / 212


VI FOR P OSTERIOR D ISTRIBUTIONS

Summary: VI approximation

min F(q)
where F(q) = E[log q(Z)] − E[log p(Z, x)]
s.t. q∈Q

Terminology
• The function F is called a free energy in statistical physics.
• Since there are different forms of free energies, various authors attach different adjectives
(variational free energy, Helmholtz free energy, etc).
• Parts of the machine learning literature have renamed F, by maximizing the objective
function −F and calling it an evidence lower bound, since
−F(q) + DKL (q k p( • |x)) = log p(x) hence e−F(q) ≤ p(x) ,
and p(x) is the “evidence” in the Bayes equation.

Advanced Machine Learning 121 / 212


M EAN F IELD A PPROXIMATION
Definition
A variational approximation of a probability distribution p on a d-dimensional space
q∗ := arg min DKL (q, p∗ )
q∈Q

is called mean field approximation if each element of Q is factorial,


q(z) = q1 (z1 ) . . . qd (zd ) .

In previous example
1

z2

0.5

0
0 0.5 spherical)
Mean field (Gaussian z1 1 Not a mean field

(a)
Advanced Machine Learning 122 / 212
E XAMPLE : M EAN F IELD FOR THE P OTTS M ODEL

Model
We consider a MRF distribution for X1 , . . . , Xn with values in {−1, +1}, given by
n n
1 X X
P(X1 , . . . , Xn ) = exp(−βH(X1 , . . . , Xn )) where H = − wij Xi Xj − hi Xi
Z(β) i,j=1 i=1

“external field”

wij and hi are constant weights.


Physicists call this a “Potts model with an external magnetic field”.

Variational approximation
We choose Q as the family
n
(
1+m
nY o
2
X=1
Q := Qmi mi ∈ [−1, 1] where Qm (X) :=

1−m
2
X = −1
i=1
1+m

Each factor is a Bernoulli 2
, except that the range is {−1, 1} not {0, 1}.

Advanced Machine Learning 123 / 212


E XAMPLE : M EAN F IELD FOR THE P OTTS M ODEL

Optimization Problem
n
Y 
min DKL Qmi P

i=1
s.t. mi ∈ [−1, 1] for i = 1, . . . , n

Mean field solution


The mean field approximation is given by the parameter values mi satisfying the equations
n
 X 
mi = tanh β wij mj + hi .
j=1

That is: For given values of wij and hi , we have to solve for the values mi .

Advanced Machine Learning 124 / 212


E XAMPLE : M EAN F IELD FOR THE P OTTS M ODEL
We have approximated the MRF
n
Y n
 X 
P(X1 , . . . , Xn ) by Bernoulli(mi ) satisfying mi = tanh β wij mj +hi
i=1 j=1

(where we interpret a 0 generated by the Bernoulli as a −1).

Interpretation
• In the MRF P, the random variables Xi interact.
• There is no interaction in the approximation.
• Instead, the effect of interactions is approximated by encoding them in the parameters.
• This is somewhat like a single effect (“field”) acting on all variables simultanuously
(“mean field”).

How accurate is the approximation?


• In physics, P is used to model a ferromagnet with an external magnetic field.
• In this case, β is the inverse temperature.
• These systems exhibit a phenomenon called spontanuous magnetization at certain
temperatures. The mean field solution predicts spontanuous magnetization, but at the
wrong temperature values.

Advanced Machine Learning 125 / 212


D IRECTED G RAPHICAL M ODELS :
M IXTURES AND A DMIXTURES
N EXT: BAYESIAN M IXTURES AND A DMIXTURES

We will consider two variations on finite mixtures:


• Bayesian mixture models (mixtures with priors).

• Admixtures, in which the generation of each observation (e.g. document) can be


influenced by several components (e.g. topics).
• One particular admixture model, called latent Dirichlet allocation, is one of the most
succesful machine learning models of the past ten years.

Advanced Machine Learning 127 / 212


F INITE M IXTURE AS A G RAPHICAL M ODELS
X
π(x) = ck p(x|θk )
k≤K

Sampling from this model


1. Fix ck and θk for k = 1, . . . , K.
2. Generate Zi ∼ Multinomial(c1 , . . . , cK ).
3. Generate Xi |Zi ∼ p( • |θZi ).

As a graphical model
Box notation indicates
c c and θ are not random

Z1 ... Zn
θ

X1 ... Xn

Advanced Machine Learning 128 / 212


P LATE N OTATION
If variables are sampled repeatedly in a graphical model, we enclose these variables in “plate”.

This means: Draw n (conditionally) independent samples from X.

Finite mixture with plate notation


c c

Z1 ... Zn = Z
θ

X1 ... Xn θ X
n

Advanced Machine Learning 129 / 212


BAYESIAN M IXTURE M ODEL

Recall: Mixing distribution of a FMM


K
X Z K
X
π(x) = ck p(x|θk ) = p(x|θ)m(θ)dθ with m := ck δθk
k=1 T k=1
All parameters are summarized in the mixing distribution m.

Bayesian mixture model: Idea


In a Bayesian model, parameters are random variables. Here, that means a random mixing
distribution:
K
X
M( . ) = Ck δΘk ( . )
k=1

Advanced Machine Learning 130 / 212


R ANDOM M IXING D ISTRIBUTION

How can we define a random distribution?


Since M is discrete with finitely many terms, we only have to generate the random variables Ck
and Θk :
XK
M( . ) = Ck δΘk ( . )
k=1

More precisely
Specifically, the term BMM implies that all priors are natural conjugate priors. That is:
• The mixture components p(x|θ) are an exponential family model.
• The prior on each Θk is a natural conjugate prior of p.
• The prior of the vector (C1 , . . . , CK ) is a Dirichlet distribution.

Explanation: Dirichlet distribution


• When we sample from a finite mixture, we choose a component k from a multinomial
distribution with parameter vector (c1 , . . . , ck ).
• The conjugate prior of the multinomial is the Dirichlet distribution.

Advanced Machine Learning 131 / 212


BAYESIAN M IXTURE M ODELS

Definition
A model of the form
K
X Z
π(x) = Ck p(x|Θk ) = p(x|θ)M(θ)dθ
k=1 T

is called a Bayesian mixture model if p(x|θ) is an exponential family model and M a random
mixing distribution, where:
• Θ1 , . . . , ΘK ∼iid q( . |λ, y), where q is a natural conjugate prior for p.

• (C1 , . . . , CK ) is sampled from a K-dimensional Dirichlet distribution.

Advanced Machine Learning 132 / 212


BAYESIAN M IXTURE AS A G RAPHICAL M ODEL
Sampling from a Bayesian Mixture
1. Draw C = (C1 , . . . , Ck ) from a Dirichlet prior.
2. Draw Θ1 , . . . , ΘK ∼iid q, where q is the conjugate prior of p.
3. Draw Zi |C ∼ Multinomial(C).
4. Draw Xi |Zi , Θ ∼ p( • |ΘZi ).

As a graphical model

Θ X
n

Advanced Machine Learning 133 / 212


BAYESIAN M IXTURE : I NFERENCE

Posterior distribution
The posterior density of a BMM under observations x1 , . . . , xn is (up to normalization):
n X
Y K K
 Y 
Π(c1:K , θ1:K |x1:n ) ∝ ck p(xi |θk ) q(θk |λ, y) qDirichlet (c1:K )
i=1 k=1 k=1

The posterior is analytically intractable


• Thanks to conjugacy, we can evaluate each term of the posterior.
Q Pn 
• However: Due to the K k=1
n
i=1 . . . bit, the posterior has K terms!

• Even for 10 clusters and 100 observations, that is impossible to compute.

Advanced Machine Learning 134 / 212


G IBBS S AMPLER FOR THE BMM

This Gibbs sampler is a bit harder to derive, so we skip the derivation and only look at the algorithm.

Recall: Bayesian mixture model


• Exponential family likelihood p(x|θk ) for each cluster k = 1, . . . , K.
• Natural conjugate prior q for all θk .
• Dirichlet prior Dirichlet(α, g) for the mixture weights c1:K .

Assignment probabilities
Each step of the Gibbs sampler computes an assignment matrix:
a11 . . . a1K
 

a =  .. ..  = Pr{x in cluster k}
 

. .  i
ik
an1 . . . anK
Entries are computed as they are in the EM algorithm:
Ck p(xi |Θk )
aik = PK
l=1 Cl p(xi |Θl )
In contrast to EM, the values Ck and Θk are random.

Advanced Machine Learning 135 / 212


G IBBS FOR BMM: A LGORITHM
In each iteration j, the algorithm cycles through these steps:
exactly as in EM
1. For each xi , sample an assignment
C j−1 p(xi |Θ j−1 )
Zij ∼ Multinomial(ai1j , . . . , aiK
j
) where aikj = PK k j−1 k j−1
l=1 Cl p(xi |Θl )

2. For each cluster k, sample a new value for Θkj from the conjugate posterior Π(Θk ) under
the observations currently assigned to k:
 n n 
Θkj ∼ Π Θ λ + I{Zij = k}, y + I{Zij = k}S(xi )
X X

i=1 i=1

# points currently aggregate S(xi ) over


assigned to k cluster k

j
3. Sample new cluster proportions C1:K from the Dirichlet posterior (under all xi ):

prior expectation of Ck # points currently


assigned to k
prior concentration

I{Zij = k}
Pn
j j α · gk +
C1:K ∼ Dirichlet(α + n, g1:K ) where gkj = i=1
α+n
normalization

Advanced Machine Learning 136 / 212


C OMPARISON : EM AND K- MEANS

The BMM Gibbs sampler looks very similar to the EM algorithm, with maximization steps (in
EM) substituted by posterior sampling steps:
Representation of assignments Parameters
EM Assignment probabilities ai,1:K aik -weighted MLE
K-means mi = arg maxk (ai,1:K ) MLE for each cluster
Gibbs for BMM mi ∼ Multinomial(ai,1:K ) Sample posterior for each cluster

Advanced Machine Learning 137 / 212


T OOLS : T HE D IRICHLET D ISTRIBUTION
T HE D IRICHLET D ISTRIBUTION
e1

Recall: Probability simplex c1


The set of all probability distributions on K events is the simplex
X
∆K := {(c1 , . . . , ck ) ∈ RK | ck ≥ 0 and cK = 1} . c3
k
c2
e2 e3

Dirichlet distribution
The Dirichlet distribution is the distribution on ∆K with density
K
1 X 
qDirichlet (c1:K |α, g1:K ) := exp (αgk − 1) log(ck )
Z(α, g1:K ) k=1

Parameters:
• g1:K ∈ ∆K : Mean parameter, i.e. E[c1:K ] = g1:K .

Note g1:K is a probability distribution on K categories.


• α ∈ R+ : Concentration.
Larger α → sharper concentration around g1:K .

Advanced Machine Learning 139 / 212


T HE D IRICHLET D ISTRIBUTION
In all plots, g1:K = 1 , 1 , 1 . Light colors = large density values.
3 3 3

Density plots

α = 1.8 α = 10

As heat maps

α = 0.8 α=1 α = 1.8 α = 10


Large density values Uniform distribution Density peaks Peak sharpens
at extreme points on ∆K around its mean with increasing α

Advanced Machine Learning 140 / 212


M ULTINOMIAL -D IRICHLET M ODEL

Model
The Dirichlet is the natural conjugate prior on the multinomial parameters. If we observe hk
counts in category k, the posterior is
Π(c1:K |h1 , . . . , hk ) = qDirichlet (c1:K |α + n, (αg1 + h1 , . . . , αgK + hK ))
P
where n = k hk is the total number of observations.

Illustration: One observation


Suppose K = 3 and we obtain a single observation in category 3.

This extreme point


correponds to k = 3.

Prior: Mean at the center. Posterior: Shifted mean, increased concentration.

Advanced Machine Learning 141 / 212


T HE D IRICHLET IS S PECIAL

• Rule of thumb: In a Bayesian model, stochastic dependence between dimensions in the


prior amplifies in the posterior.
• To keep inference feasible: Keep variables in the prior “as independent as possible”.

Defining random probability distributions on K categories


• If (Θ1 , . . . , ΘK ) is a random probability distribution, the Θk cannot be independent.
• How do we define Θ1:K so that components are as independent as possible?
• Idea: Start with independent variables X1 , . . . , XK in (0, ∞). If we define
Xk
X̄k := PK then (X̄1 , . . . , X̄K ) ∈ ∆K
j=1 Xj

Dirichlet and gamma distributions


Suppose X1 , . . . , XK are independent random variables. If
X γ1 γK 
Xk ∼ Gamma(γk , 1) then (X̄1 , . . . , X̄K ) ∼ Dirichlet γk ; P , . . . , P
k
γ
j j j γj

Advanced Machine Learning 142 / 212


T HE D IRICHLET IS S PECIAL

Dependence in the prior


In general: Even if X1 , . . . , XK are independent,
Xk X
P and Xj are stochastically dependent.
X
j j j

So: Components
P of the prior couple (1) through normalization and (2) through the latent
variable j Xj .

Lukacs’ theorem: The gamma distribution is special


If X and Y are independent random variables in (0, ∞) (and not constant), then
X

⊥ X+Y if and only if X, Y are gamma with the same shape parameter
X+Y

Consequence: The Dirichlet is special


• In the Dirichlet, components couple only through the normalization constraint.
• Any other random probability defined by normalizing independent variables introduces
more dependence.

Advanced Machine Learning 143 / 212


T EXT M ODELS

Recall: Multinomial text clustering


We assume the corpus is generated by a multinomial mixture model of the form
K
X
π(H) = ck P(H|θk ) ,
k=1

where P(H|θk ) is multionmial.


• A document is represented by a histogram H.

• Topics θ1 , . . . , θK .
• θkj = Pr{ word j in topic k}.

Problem
Each document is generated by a single topic; that is a very restrictive assumption.

Advanced Machine Learning 144 / 212


S AMPLING D OCUMENTS

Parameters
Suppose we consider a corpus with K topics and a vocubulary of d words.
• φ ∈ ∆K topic proportions (φk = Pr{ topic k}).

• θ1 , . . . , θK ∈ ∆d topic parameter vectors (θkj = Pr{ word j in topic k}).


Note: For random generation of documents, we assume that φ and the topic parameters θk are given (they properties of the
corpus). To train the model, they have to be learned from data.

Model 1: Multinomial mixture


To sample a document containing M words:
1. Sample topic Z ∼ Multinomial(φ).
2. For i = 1, . . . , M: Sample wordi |Z ∼ Multinomial(θZ ).
The entire document is sample from topic Z.

Advanced Machine Learning 145 / 212


L ATENT D IRICHLET A LLOCATION
Mixtures of topics
Whether we sample words or entire documents makes a big difference.
• When we sample from the multinomial mixture, we choose a topic at random, then
sample the entire document from that topic.
• For several topics to be represented in the document, we have to sample each word
individually (i.e. choose a new topic for each word).
• Problem: If we do that in the mixture above, every document has the same topic
proportions.

Model 2: Admixture model


Each document explained as a mixture of topics, with mixture weights C1:K .
Fix a matrix θ of size #topics × #words, where
θkj := probability that word j occurs under topic k

1. Sample topic proportions c1:K ∼ Dirichlet(φ).


2. For i = 1, . . . , M:
2.1 Sample topic for word i as Zi |C1:K ∼ Multinomial(C1:K ).
2.2 Sample wordi |Zi ∼ Multinomial(θZi ).
This model is known as Latent Dirichlet Allocation (LDA).

Advanced Machine Learning 146 / 212


C OMPARISON : LDA AND BMM

Observation
LDA is almost a Bayesian mixture model: Both use multinomial components and a Dirichlet
prior on the mixture weights. However, they are not identical.

Comparison
Bayesian MM Admixture (LDA)
Sample c1:K ∼ Dirichlet(φ). Sample c1:K ∼ Dirichlet(φ).
Sample topic k ∼ Multinomial(c1:K ). For i = 1, . . . , M:
For i = 1, . . . , M: Sample topic ki ∼ Multinomial(c1:K ).
Sample wordi ∼ Multinomial(θk ). Sample wordi ∼ Multinomial(θki ).

In admixtures:
• c1:K is generated at random, once for each document.

• Each word is sampled from its own topic.

What do we learn in LDA?


LDA explains each document by a separate parameter c1:K ∈ ∆K . That is, LDA models
documents as topic proportions.

Advanced Machine Learning 147 / 212


LDA AS A G RAPHICAL M ODEL
α α

topic proportions
C C ∼ Dirichlet(α)

topic of word
Z Z ∼ Multinomial(C)

observed word
Θ word Θ word ∼ Multinomial(row Z of Θ)

N N # words

M M # documents

vector of length #topics × |vocabulary| matrix


|vocabulary| each row has sum 1

Bayesian mixture
(for M documents of N words each) LDA

Advanced Machine Learning 148 / 212


P RIOR ON T OPIC P ROBABILITIES

C
• The parameter matrix θ is of size
#topics × |vocabulary|.
• Meaning: θki = probability that term i is η Z
observed under topic k.
• Note entries of θ are non-negative and
each row sums to 1.
• To learn the parameters θ along with the
other parameters, we add a Dirichlet prior Θ word
with parameters η. The rows are drawn
N
i.i.d. from this prior.
M

Θ is now random

Advanced Machine Learning 149 / 212


VARIATIONAL I NFERENCE FOR LDA
Target distribution
Posterior L(Z, C, Θ|words, α, η)

Recall: Z|C ∼ Multinomial(C) and C, Θ1 , . . . , ΘK ∼ Dirichlet. α

Variational approximation
C
Q = {q(z, c, θ|λ, φ, γ) | λ, φ, γ}
where
K M  N 
η
Y Y Y
q(z, c, θ|λ, φ) := p(θk |λ) q(cm |γm ) p(zmn |φmn ) Z
k=1 m=1 n=1

Dirichlet Multinomial

We have introduced a new set of “variational” parameters λ, γ, φ.


Θ word

Variational inference problem N

M
min DKL (q(z, c, θ|λ, φ)kp(z, c, θ|α, η))
s.t. q ∈ Q

Advanced Machine Learning 150 / 212


VARIATIONAL I NFERENCE FOR LDA
QK QM  QN 
q(z, c, θ|λ, φ) := k=1 p(θk |λ) m=1 q(cm |γm ) n=1 p(zmn |φmn )

“global” parameters “local” parameters


C

η Z λ γ φ

Θ word Θ C Z
N K M N

M M

LDA Variational approximation

Advanced Machine Learning 151 / 212


I TERATIVE S OLUTION

Algorithmic solution
We solve the minimization problem
min DKL (q(z, c, θ|λ, φ)kp(z, c, θ|α, η))
s.t. q ∈ Q
by applying gradient descent to the resulting free energy.

VI algorithm
It can be shown that gradient descent amounts to the following algorithm:

repeat (using update equations on next page)

• update local parameters


• update global parameters

until free energy has converged

Advanced Machine Learning 152 / 212


U PDATE E QUATIONS

In iteration t of the algorithm:

Local updates
(t+1) φ̃mn
φmn := P
n φ̃mn
where
 
φ̃mn = exp Eq [log(Cm1 ), . . . , log(Cmd )|γ (t) ] + Eq [log(Θ1,wn ), . . . , log(Θd,wn )|λ(t) ]

N
(t+1)
X
γ (t+1) := α + φn
n=1

Global updates
(t+1) (t+1)
XX
λk =η+ wordmn φmn
m n

Advanced Machine Learning 153 / 212


E XAMPLE : M IXTURE OF T OPICS

The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-
tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in
announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which
will house young artists and provide new public facilities. The Metropolitan Opera Co. and
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter
of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000
donation, too.

Figure 8: An example article from the AP corpus. Each color codes a different factor from which
Advanced Machine Learning the word is putatively generated. From Blei, Ng, Jordan, "Latent Dirichlet Allocation", 2003 154 / 212
R ESTRICTED B OLTZMANN M ACHINES
B OLTZMANN MACHINE

Definition
A Markov random field distribution of variables X1 , . . . , Xn with values in {0, 1} is called a
Boltzmann machine if its joint law is
1  X X 
P(x1 , . . . , xn ) = exp − wij xi xj − ci xi ,
Z i<j≤n i≤n

where wij are the edge weights of the MRF neighborhood graph, and c1 , . . . , cn are scalar
parameters.

Remarks
• The Markov blanket of Xi are those Xj with wij 6= 0.
• For x ∈ {−1, 1}n instead: Potts model with external
magnetic field.
• This is an exponential family with sufficient statistics Xi Xj
and Xi .
• As an exponential family, it is also a maximum entropy
model.

Advanced Machine Learning 156 / 212


W EIGHT MATRIX

1  X X 
P(x1 , . . . , xn ) = exp − wij xi xj − ci xi
Z i<j≤n i≤n

Matrix representation
We collect the parameters in a matrix W := (wij ) and a vector c = (c1 , . . . , cn ), and write
equivalently:
1 t t
P(x1 , . . . , xn ) = e−x Wx−c x
Z(W, c)

• The matrix W is the adjacency matrix of the MRF neighborhood graph.


• Because the MRF is undirected, the matrix is symmetric.

Advanced Machine Learning 157 / 212


R ESTRICTED B OLTZMANN MACHINE

X2 Y4 X4
With observations
If some some vertices represent observation variables Yi :
t t t
e−(x,y) W(x,y)−c x−c̃ y Y1 Y2
P(x1 , . . . , xn , y1 , . . . , ym ) =
Z(W, c, c̃)
X1 Y3 X3
Recall our hierarchical design approach
• Only permit layered structure.
Y1 Y2 Y3
• Obvious grouping: One layer for X, one for Y.
• As before: No connections within layers.
• Since the graph is undirected, that makes it bipartite.

X1 X2 X3

Advanced Machine Learning 158 / 212


R ESTRICTED B OLTZMANN MACHINE

Y1 Y2 Y3

Bipartite graphs
A graph is bipartite graph is a graph whose vertex set V can be
subdivided into two sets A and B = V \ A such that all edges have
one end in A and one end in B.
X1 X2 X3

Definition
A restricted Boltzmann machine (RBM) is a Boltzmann machine whose neighborhood graph
is bipartite.

That defines two layers. We usually think of one of these layers as observed, and one as
unobserved.

Advanced Machine Learning 159 / 212


B OLTZMANN M ACHINES AND S IGMOIDS

Consider a probability of the form


1 −θx
P(x) = e for x ∈ {0, 1} and a fixed θ ∈ R .
Z
This is what the probability of a single variable in a RBM looks like, if the distribution
factorizes.
e−θ 1
P(x = 1) = −θ = = σ(−θ)
e + e0 1 + eθ
where σ again denotes the sigmoid function.

Consequence

For random variables in {0, 1}:


1 −θx
P(x) = e ⇔ X ∼ Bernoulli(σ(−θ))
Z

Advanced Machine Learning 160 / 212


G IBBS SAMPLING B OLTZMANN MACHINES

Full conditionals: General case


t t
ex Wx+c x
P(X = x) =
Z(W, c)
P(Xi = 1|x(i) ) = σ(Wit x + ci )

Full conditionals: RBM


• Variables in X-layer are conditionally
independent given Y-layer and vice versa Y1 Y2 Y3
• Two groups of conditionals: X|Y and Y|X
• Blocked Gibbs samplers
P(X = x|Y = y) = σ(W t y + c̃)
P(Y = y|X = x) = σ(W t x + c)
X1 X2 X3

Advanced Machine Learning 161 / 212


D EEP B ELIEF N ETWORKS
D IRECTED GRAPHICAL MODEL

Θ1 Θ2 ΘN
... Input layer

...

.. .. ..
. . .

...

... Data
X1 X2 XN

Advanced Machine Learning Not examinable. 163 / 212


BAYESIAN V IEW

Θ1 Θ2 ΘN
... L(Θ1:N ) = Prior

...

.. .. ..
. . .
L(X1:N |Θ1:N ) = Likelihood

...

...
X1 X2 XN

Advanced Machine Learning Not examinable. 164 / 212


I NFERENCE P ROBLEM

Θ1 Θ2 ΘN
• Task: Given X1 , . . . , XN , find L(Θ1:N |X1:N ). ...
• Problem: Recall “explaining away”.
...
X Y

.. .. ..
Z . . .

Conditioning on a common outcome makes


variables dependent. ...
• That means: Although each layer is conditionally
independent given the previous one, conditioning
on the subsequent one creates dependence within ...
the layer. X1 X2 XN

Advanced Machine Learning Not examinable. 165 / 212


C OMPLEMENTARY P RIOR : I DEA
The prior L(Θ1:N ) can itself be represented as Combining this prior with the likelihood stacks
a directed (layered) graphical model: the two networks on top of each other:
... ...

.. .. .. .. .. ..
. . . . . .

Θ1 Θ2 ... ΘN Θ1 Θ2 . . .. ΘN

...

.. .. ..
. . .

...

X1 X2 ... XN

Advanced Machine Learning Not examinable. 166 / 212


C OMPLEMENTARY P RIOR : I DEA

... Idea
• Invent a prior such that the dependencies
in the prior and likelihood “cancel out”.
.. .. ..
. . . • Such a prior is called a complementary
prior.
• We will see how to construct a
Θ1 Θ2 . . .. ΘN complementary prior below.

...

.. .. ..
. . .

...

X1 X2 ... XN

Advanced Machine Learning Not examinable. 167 / 212


C OMPLEMENTARY P RIOR

Consider a layered directed graphical model.


X (1) ...
Denote the vector of variables in the kth layer
X (k) , so
(k)
X (k) = (X1 , . . . , XN )
(k) X (2) ...

Observation
X (1) , X (2) , . . . , X (K) is a Markov chain. .. .. ..
. . .

Complementary prior idea


• Suppose Markov chain is reversible. ...
• Then all arrows can be reversed.
• Now: Inference easy. X (K) ...

Advanced Machine Learning Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 168 / 212
C OMPLEMENTARY P RIOR

Consider a layered directed graphical model.


X (1) ...
Denote the vector of variables in the kth layer
X (k) , so
(k)
X (k) = (X1 , . . . , XN )
(k) X (2) ...

Observation
X (1) , X (2) , . . . , X (K) is a Markov chain. .. .. ..
. . .

Complementary prior idea


• Suppose Markov chain is reversible. ...
• Then all arrows can be reversed.
• Now: Inference easy. X (K) ...

Advanced Machine Learning Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 168 / 212
B UILDING A C OMPLEMENTARY P RIOR

X (1) ...

• Find reversible Markov chain with


P(1) = P∞ X (2) ...

• Let pT be its transition kernel


• Choose
.. .. ..
P(k+1) (•|X (k) = x) = pT (•|x) . . .
• Then P(2) = . . . = P(k) = P∞
• Since chain is reversible,
...
P(k) (•|X (k+1) = x) = pT (•|x)
and edges flip.
X (K) ...

Advanced Machine Learning Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 169 / 212
B UILDING A C OMPLEMENTARY P RIOR

X (1) ...

• Find reversible Markov chain with


P(1) = P∞ X (2) ...

• Let pT be its transition kernel


• Choose
.. .. ..
P(k+1) (•|X (k) = x) = pT (•|x) . . .
• Then P(2) = . . . = P(k) = P∞
• Since chain is reversible,
...
P(k) (•|X (k+1) = x) = pT (•|x)
and edges flip.
X (K) ...

Advanced Machine Learning Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 169 / 212
W HERE DO WE GET THE M ARKOV CHAIN ?

...
Start with an RBM X

Y ...

Blocked Gibbs sampling alternates between X and Y


X (1) → Y (1) → X (2) → Y (2) → . . . → X (K) → Y (K)

“Roll off” this chain into a graphical model


X (1) Y (1) X (K) Y (K)

..
.
..
.

... ... ... ...

..
.

The Gibbs sampler for the RBM becomes the model for the directed network.

Advanced Machine Learning Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 170 / 212
W HERE DO WE GET THE M ARKOV CHAIN ?
..
.

... ... ... ...

X ...

.. Y ...
.

..
.

X (1) Y (1) X (K) Y (K)

For K → ∞, we have
d
Y (∞) = Y
Now suppose we use L(Y (∞) ) as our prior distribution. Then:

Using an “infinitely deep” graphical model given by the Gibbs sampler for the RBM as a
prior is equivalent to using the Y-layer of the RBM as a prior.

Advanced Machine Learning Not examinable. 171 / 212


D EEP B ELIEF N ETWORKS

...
First two layers: RBM (undirected)
Θ1 Θ2 ... ΘN

...

.. .. ..
. . .
Remaining layers: Directed

...

X1 X2 ... XN

Advanced Machine Learning Not examinable. 172 / 212


D EEP B ELIEF N ETWORKS

Summary
...
• The RBM consisting of the first two layers is
equivalent to an infinitely deep directed network
representing the Gibbs samplers. (The rolled-off Θ1 Θ2 ... ΘN
Gibbs sampler on the previous slide.)
• That network is infinitely deep because a draw from
the actual RBM distribution corresponds to a Gibbs ...
sampler that has reached its invariant distribution.
• When we draw from the RBM, the second layer is
distributed according to the invariant distribution of
the Markov chain given by the RBM Gibbs .. .. ..
sampler. . . .
• If the transition from each layer to the next in the
directed part is given by the Markov chain’s
transition kernel pT , the RBM is a complementary ...
prior.
• We can then reverse all edges between the Θ-layer
and the X-layer. X1 X2 ... XN

Advanced Machine Learning Not examinable. 173 / 212


N EURAL N ETWORKS : BASICS
T HE M OST I MPORTANT B IT

A neural network represents a function f : Rd1 → Rd2 .

Advanced Machine Learning 175 / 212


R EADING N EURAL N ETWORKS

 
x1
3 3
f :R →R with input x = x 2 
x2

x1 w13 x2 w31 x3
w21 w23
w33
w12 w22
w11 w32

φ1 φ2 φ3

f1 (x) = φ1 (w1 t x) f2 (x) = φ2 (w2 t x) f3 (x) = φ3 (w3 t x)

 
f1 (x) 3
X 
f (x) =  f2 (x) with fi (x) = φi wij xj
f3 (x) j=1

Advanced Machine Learning 176 / 212


x1 x2 xd

... = f (1)

... = f (2)

.. .. ..
. . .

...

... = f (K)

f (x) = f (K) (· · · f (2) (f (1) (x))) = f (K) ◦ . . . ◦ f (1) (x)

Advanced Machine Learning 177 / 212


N EURAL NETWORKS

Layers
• Each function f (k) is of the form
f (k) : Rdk → Rdk+1
• dk is the number of nodes in the kth layer. It is also called the width of the layer.
• We mostly assume for simplicity: d1 = . . . = dK =: d.

Layered feed-forward network


A feed-forward network one organized into successive layers as on previous slide,
f (x) = f (K) ◦ . . . ◦ f (1) (x)
Each layer represents a function f (k) . These functions are of the form:
 (k)
(k)  
φ1 ( w1 , • ) 
 σ(x) (sigmoid)

I{±x > τ }
(k) •
 .  (threshold)
f ( )=  ..  typically: φ(x) =
  c (constant)
(k)
(k) 

φd ( wd , ) • x (linear)

Advanced Machine Learning 178 / 212


H IDDEN L AYERS AND N ONLINEAR F UNCTIONS

Hidden units
• Any nodes (or “units”) in the network that are neither input nor output nodes are called
hidden.
• Every network has an input layer and an output layer.
• If there any additional layers (which hence consist of hidden units), they are called hidden
layers.

Linear and nonlinear networks


• If a network has no hidden units, then


fi (x) = φi ( wi , x )
That means: f is a linear functions, except perhaps for the final application of φ.
• For example: In a classification problem, a two layer network can only represent linear
decision boundaries.
• Networks with at least one hidden layer can represent nonlinear decision surfaces.

Advanced Machine Learning 179 / 212


track, but shed littlethepractical lightvalues.
proper weight on the problems
Even of designing
if there were andproof,
a constructive training neural
it would be of little
track, but shedrecognition
networks — their main use inbenefit
pattern forlittle practical
pattern sincelight on the
we do
recognition problems
not(Fig. theofdesired
know6.3). designing and training
function anywayneural
— it
T WO VS T HREE L AYERS
is related networks — their main benefit for pattern recognition (Fig. 6.3).
to the training patterns in a very complicated way. All in all, then, these
x
results on the expressive power of networks
2 give us confidence we are on the right
track, but shed little practical light on the problems of designing and training neural
x x2
networks — their main benefit for pattern
2 recognition (Fig. 6.3).

fl
Two"layer x2 fl
Two"layer R1
fl R1
Two"layer
R1

fl
Two"layer R2
R1
R2
x1 x2
R2
x1 x2 x1
x2
x1 x2 R2
x1
x1 x2 x2 x
R1 1
Three"layer x2 x1
x2
R2
Three"layer ... R1
Three"layer R1
Three"layer R2R1
R1
R2
R R2
2

......
x 1 x2 ...
R2 x1
RR
2 2
R1
RR
Figure 6.3: Whereas a two-layer network classifier can only implement
1 1 a linear decision
boundary, given an adequate number of hidden units, three-, four- and higher-layer
x1 x2 x1 not
networks can implement arbitrary decision boundaries. The decision regions need
x1 xbe
1 convex, nor x2
x2 simply connected. x1 x
1
Figure 6.3: Whereas a two-layer network classifier can only implement a linear decision
boundary, given an adequate number of hidden units, three-, four- and higher-layer
Figure 6.3: Whereasnetworks
a two-layer
can network arbitrary
implement classifier can only implement The adecision
linearregions
decision
Figure 6.3: Whereas a two-layer network classifier decision boundaries.
can only implement a linear need not
decision
6.3 Backpropagation
boundary, given an adequate
Advanced Machine Learning number Illustration:
of hidden algorithm
units,
R.O. Duda, three-,
P.E. Hart, D.G. Stork,four- and higher-layer
Pattern Classification, Wiley 2001 180 / 212
-1 -.4 wkj 1

T HE XOR P.7-1.5ROBLEM
0
1

hidden j
0

-1
0
1

-1
bias .5 0
-1
1 1 wji
-1
1 1
1 1

input i
z
x1 x2 1

x2 0 1
-1
0
-1
z= -1 0
-1
1

R2 zk

z=1 R1
x1 output k
y1 y2
R2 -.4 wkj
1 -1 1

z= -1 0
.7 0
1 1
-1
0 -1.5 hidden j
-1
0
-1 -1
0 bias .5 0
-1
1 1 wji
-1
1 1
1 1

input i
parity or exclusive-OR problem can be solved by a three-layerx x 1 2

m is the Solution
two-dimensional feature space x1 − x2 , and the
regions we would like to represent
four x
Neural network representation
2

d. The three-layer network is shown in the middle. The input z= -1

erely distribute their (feature) values through multiplicative R2


R1
units. The •hidden and atoutput
Two ridges different units hereare
locations aresubstracted
linear threshold
from each z=1 other.
x 1

ms the linear • Thatsumgenerates


of its inputs
a regiontimesboundedtheir
on associated
both sides. weight, R2
z= -1
um is greater than or equal to 0, and −1 otherwise, as shown
• A linear classifier cannot represent this decision region.
e (“excitatory”) weights are denoted by solid lines, negative
by dashed• lines; Note this
therequires
weightat magnitude
least one hidden layer.
is indicated by the
is labeled. The single output Figureunit 6.1: sums the weighted
The two-bit signals
parity or exclusive-OR problem can be solved by a three-layer
and bias)
Advanced and
Machine emits a +1 network.
Learning if that sum At the bottom
is greater isthan
the
Illustration: R.O.
two-dimensional
Duda,or
P.E. equal
feature space x1 2001
Hart, D.G. Stork, Pattern Classification, Wiley
− x2 , and the four
181 / 212
short, this is not how neural networks “work” — we never find that through train-
ing (Sect. 6.3) simple networks build a Fourier-like representation, or learn to group
sigmoids to get component bumps.

z1 x2

x1
y2 z1 y3

y1
y1 y2 y3 y4
y4

x1 x2

Figure 6.2: A 2-4-1 network (with bias) along with the response functions at different
units; each hidden and output unit has sigmoidal transfer function f (·). In the case
shown, the hidden unit outputs are paired in opposition thereby producing a “bump”
at the output unit. Given a sufficiently large number of hidden units, any continuous
function from input to output can be approximated arbitrarily well by such a network.
Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 182 / 212
F EATURE E XTRACTION

Features
• Raw measurement data is typically not used directly as input for a learning algorithm.
Some form of preprocessing is applied first.
• We can think of this preprocessing as a function, e.g.
F : raw data space −→ Rd
(Rd is only an example, but a very common one.)
• If the raw measurements are m1 , . . . , mN , the data points which are fed into the learning
algorithm are the images xn := F(mn ).

Terminology
• F is called a feature map.
• Its dimensions (the dimensions of its range space) are called features.
• The preprocessing step (= application of F to the raw data) is called feature extraction.

Advanced Machine Learning 183 / 212


E XAMPLE P ROCESSING P IPELINE
Raw data (measurements)

Feature extraction
(preprocessing)

Working data

Mark patterns

This is what a typical processing


Split
pipeline for a supervided learning
propblem might look like.
Training data Test data
(patterns marked) (patterns marked)

Training Apply on
Trained model
(calibration) test data

Error estimate

Advanced Machine Learning 184 / 212


F EATURE E XTRACTION VS L EARNING
Where does learning start?
• It is often a matter of definition where feature extraction stops and learning starts.
• If we have a perfect feature extractor, learning is trivial.
• For example:
• Consider a classfication problem with two classes.
• Suppose the feature extractor maps the raw data measurements of class 1 to a single
point, and all data points in class to to a single distinct point.
• Then classification is trivial.
• That is of course what the classifier is supposed to do in the end (e.g. map to the
points 0 and 1).

Multi-layer networks and feature extraction


• An interesting aspect of multi-layer neural networks is that their early layers can be
intepreted as feature extraction.
• For certain types of problems (e.g. computer vision), features were long “hand-tuned” by
humans.
• Features extracted by neural networks give much better results.
• Several important problems, such as object recognition and face recognition, have
basically been solved in this way.

Advanced Machine Learning 185 / 212


D EEP N ETWORKS AS F EATURE E XTRACTORS

x1 x2 xd

• The network on the right is a classifier


f : Rd → {0, 1}. ... = f (1)
• Suppose we subdivide the network into
the first K − 1 layer and the final layer, by ... = f (2)
defining
F(x) := f (K−1) ◦ . . . ◦ f (1) (x)
.. .. ..
• The entire network is then . . .

f (x) = f (K) ◦ F(x)

• The function f (K) is a two-class logistic ...


regression classifier.
• We can hence think of f as a feature = f (K)
extraction F followed by linear
classification f (K) .
f (K) ( • ) = σ( w(K) , • )

Advanced Machine Learning 186 / 212


A S IMPLE E XAMPLE
OPAGATION, BAYES THEORY AND PROBABILITY 25
sample training patterns

• Problem: Classify characters into three


classes (E, F and L).
• Each digit given as a 8 × 8 = 64 pixel
image
• Neural network: 64 input units (=pixels)
• 2 hidden units

f1 f2 f3
• 3 binary output units, where fi (x) = 1
means image is in class i.
• Each hidden unit has 64 input weights,
h1 h2 one per pixel. The weight values can be
plottes as 8 × 8 images.
w11 w
64,2
learned winput-to-hidden
12 w64,1 weights
x1 ... x64
e top images represent patterns from a large training set used to train
dal network for classifying three characters. The bottom figures show
den weights (represented as patterns) at the two hidden units after
hat these learned weights indeed describe feature groupings useful for
n task. In large
Advanced Machine Learningnetworks, such patterns of learned weights may be 187 / 212
A S IMPLE E XAMPLE
PROPAGATION, BAYES THEORY AND PROBABILITY 25
sample training patterns
h1 h2

learned input-to-hidden weights

training data (with random noise) weight values of h1 and h2 plotted as images
Figure 6.14: The top images represent patterns from a larg
a 64-2-3Dark sigmoidal
• network
regions = large weight values. for classifying three characters
Note thelearned input-to-hidden
weights weights that distinguish characters.
emphasize regions
the input-to-hidden weights

(represented as patterns) at
We can think of weight (= each pixel) as a feature.

training.
The top imagesNote
represent that
The features with
• large these
patterns
weights forlearned
from a large training weights
h1 distinguish {E,F} from L. indeed describe f
set used to train
moidal network for classifying three characters. The bottom figures show
the classification
-hidden The features

weights for h2 task.
(represented distinguish
as patterns) Inat large
{E,L} from networks,
F. hidden
the two units after such patterns o
ote that these learned weights indeed describe feature groupings useful for
difficult
ation task. Into interpret
large networks, such inpatterns
this ofway. learned weights may be
nterpret in this way.
Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 188 / 212
W IDTH VS D EPTH

(k)
A neural network represents a (typically) complicated function f by simple functions φi .

What functions can be represented?


A well-known result in approximation theory says: Every continuous function f : [0, 1]d → R
can be represented in the form
2d+1
X d
X 
f (x) = ξj τij (xi )
j=1 i=1

where ξi and τij are functions R → R. A similar result shows one can approximate f to
arbitrary precision using specifically sigmoids, as
M d
X 
(2) (1)
X
f (x) ≈ wj σ wij xi + ci
j=1 i=1

for some finite M and constants ci .


Note the representations above can both be written as neural networks with three layers (i.e.
with one hidden layer).

Advanced Machine Learning 189 / 212


W IDTH VS D EPTH

Depth rather than width


• The representations above can achieve arbitrary precision with a single hidden layer
(roughly: a three-layer neural network can represent any continuous function).
• In the first representation, ξj and τij are “simpler” than f because they map R → R.
• In the second representation, the functions are more specific (sigmoids), and we typically
need more of them (M is large).
• That means: The price of precision are many hidden units, i.e. the network grows wide.
• The last years have shown: We can obtain very good results by limiting layer width, and
instead increasing depth (= number of layers).
• There is no theory to properly explain this yet.

Limiting width
• Limiting layer width means we limit the degrees of freedom of each function f (k) .
• That is a notion of parsimony.
• Again: There seem to be a lot of interesting questions to study here, but so far, we have no
answers.

Advanced Machine Learning 190 / 212


E XAMPLE : AUTOENCODERS

An example for the effect of layer are autoencoders.


• An autoencoder is a neural network that is trained on its own input: If the network has
weights W and represents a function fW , training solves the optimization problem
min kx − fW (x)k2
W

or something similar for a different norm.


• That seems pointless at first glance: The network tries to approximate the identity
function using its (possibly nonlinear) component functions.
• However: If the layers in the middle have much fewer nodes that those at the top and
bottom, the network learns to compress the input.

Advanced Machine Learning 191 / 212


AUTOENCODERS
x x

f (1) f (1)

f (2) f (2)

f (3) f (3)

f (x) ≈ x f (x) ≈ x

Layers have same width: No effect Narrow middle layers: Compression effect

• Train network on many images.


• Once trained: Input an image x.
• Store x0 := f (2) (x). Note x0 has fewer dimensions than x → compression.
• To decompress x0 : Input it into f (3) and apply the remaining layers of the network
→ reconstruction f (x) ≈ x of x.

Advanced Machine Learning 192 / 212


AUTOENCODERS
Decoder
30
W4
Top
500
RBM
T T
W1 W1 + 8
2000 2000
T T
500 W2 W2 + 7
W3 1000 1000
T T
1000 W3 W3 + 6
RBM
500 500
T T
W4 W4 + 5
30 Code layer 30
1000
W4 W4 + 4
W2
500 500
2000
RBM
W3 W3 + 3
1000 1000
W2 W2 + 2
2000 2000 2000
W1 W1 W1 + 1

RBM Encoder
Pretraining Unrolling Fine tuning

Advanced Machine Learning Illustration: K. Murphy, Machine Learning: A Bayesian perspective, MIT Press 2012 193 / 212
T RAINING N EURAL N ETWORKS

Error measure
• We assume each training data point x comes with a label y.
• We specify an error measure D that compares y to a prediction.

Typical error measures


• Classification problem:
D(ŷ, y) := y log ŷ
• Regression problem:
D(ŷ, y) := ky − ŷk2

Training as an optimization problem


• Given: Training data (x1 , y1 ), . . . , (xN , yN ) with labels yi .
• We specify an error measure D that compares y to a prediction.
N
X
J(w) := D(fw (xi ), yi )
i=1

Advanced Machine Learning 194 / 212


BACKPROPAGATION

Neural network training optimization problem


min J(w)
w
The application of gradient descent to this problem is called backpropagation.

Backpropagation is gradient descent applied to J(w) in a feed-forward network.

Deriving backpropagation
• We have to evaluate the derivative ∇w J(w).
P
• Since J is additive over training points, J(w) = n Jn (w), it suffices to derive ∇w Jn (w).

Advanced Machine Learning 195 / 212


Recall from calculus: Chain rule
Consider a composition of functions f ◦ g(x) = f (g(x)).

d(f ◦ g) df dg
=
dx dg dx

d(f ◦g)
If the derivatives of f and g are f 0 and g0 , that means: dx
(x) = f 0 (g(x))g0 (x)

Neural network
Let w(k) denote the weights in layer k. The function represented by the network is
(K) (1) (K) (1)
fw (x) = fw ◦ · · · ◦ fw (x) = f ◦ ··· ◦ f (x)
w(K) w(1)
To solve the optimization problem, we have to compute derivatives of the form
d dD( • , yn ) dfw
D(fw (xn ), yn ) =
dw dfw dw

Advanced Machine Learning 196 / 212


C ONSIDERING THE D ERIVATIVES

• We will compute the derivates layer by layer.


• Suppose we are only interested in the weights of layer k, and keep all other weights fixed.
The function f represented by the network is then
(k)
fw(k) (x) = f (K) ◦ · · · ◦ f (k+1) ◦ f ◦ f (k−1) ◦ · · · ◦ f (1) (x)
w(k)

• The first k − 1 layers enter only as the function value of x, so we define


z(k) := f (k−1) ◦ · · · ◦ f (1) (x)
and get
(k)
fw(k) (x) = f (K) ◦ · · · ◦ f (k+1) ◦ f (z(k) )
w(k)

• If we differentiate with respect to w(k) , the chain rule gives


(k)
d df (K) df (k+1) df (k)
fw (k) (x) = · · · · w(k)
dw(k) df (K−1) df (k) dw

Advanced Machine Learning Not examinable. 197 / 212


W ITHIN A S INGLE L AYER

• Each f (k) is a vector-valued function f (k) : Rdk → Rdk+1 .


• It is parametrized by the weights w(k) of the kth layer and takes an input vector z ∈ Rdk .
• We write f (k) (z, w(k) ).

Layer-wise derivative
Since f (k) and f (k−1) are vector-valued, we get a Jacobian matrix
 (k+1) (k+1)

∂f1 ∂f1
 ∂f (k) ... (k) 
 1 ∂fd
k 
df (k+1)  ..

.. 

=  . .  =: ∆(k) (z, w(k+1) )
df (k) 
 ∂f (k+1) (k+1) 

 dk+1 ∂fd
k+1 
(k) ... (k)
∂f1 ∂fd
k

• ∆(k) is a matrix of size dk+1 × dk .


• The derivatives in the matrix quantify how f (k+1) reacts to changes in the argument of
f (k) if the weights w(k+1) and w(k) of both functions are fixed.

Advanced Machine Learning Not examinable. 198 / 212


BACKPROPAGATION ALGORITHM
Let w(1) , . . . , w(K) be the current settings of the layer weights. These have either been
computed in the previous iteration, or (in the first iteration) are initialized at random.

Step 1: Forward pass


We start with an input vector x and compute
z(k) := f (k) ◦ · · · ◦ f (1) (x)
for all layers k.

Step 2: Backward pass


• Start with the last layer. Update the weights w(K) by performing a gradient step on
D f (K) (z(K) , w(K) ), y


regarded as a function of w(K) (so z(K) and y are fixed). Denote the updated weights w̃(K) .
• Move backwards one layer at a time. At layer k, we have already computed updates
w̃(K) , . . . , w̃(k+1) . Update w(k) by a gradient step, where the derivative is computed as
df (k)
∆(K−1) (z(K−1) , w̃(K) ) · . . . · ∆(k) (z(k) , w̃(k+1) ) (z, w(k) )
dw(k)
On reaching level 1, go back to step 1 and recompute the z(k) using the updated weights.

Advanced Machine Learning Not examinable. 199 / 212


S UMMARY: BACKPROPAGATION

• Backpropagation is a gradient descent method for the optimization problem


N
X
min J(w) = D(fw (xi ), yi )
w
i=1

D must be chosen such that it is additive over data points.


• It alternates between forward passes that update the layer-wise function values z(k) given
the current weights, and backward passes that update the weights using the current z(k) .
• The layered architecture means we can (1) compute each z(k) from z(k−1) and (2) we can
use the weight updates computed in layers K, . . . , k + 1 to update weights in layer k.

Advanced Machine Learning 200 / 212


M ACHINE A RITHMETIC AND
P SEUDORANDOM N UMBERS
B INARY R EPRESENTATION OF N UMBERS

Binary numbers
Our usual (decimal) representation of integers represents a number x ∈ N0 by digits
d1 , d2 , . . . ∈ {0, . . . , 9} as
[x]10 = di · 10i + di−1 · 10i−1 + . . . + d1 · 100
where i is the largest integer with 10i ≤ x.
The binary representation of x is similarly
[x]2 = 2j · 10j + bj−1 · 2j−1 + . . . + b1 · 20
where b1 , b2 , . . . ∈ {0, 1} and j is the largest integer with 2j ≤ x.

Non-integer numbers
Binary numbers can have fractional digits just as decimal numbers. The post-radix digits
correspond to inverse powers of two:
1 1 1
[10.125]10 = [101.001]2 = 1 · 23 + 0 · 22 + 1 · 2 + 0 ·
+0· 2 +1· 3
2 2 2
Numbers that look “simple” may become more complicated in binary representation:
[0.1]10 = [0.00011]2 = [0.000110011001100110011 . . .]2

Advanced Machine Learning Not examinable. 202 / 212


F LOATING P OINT A RITHMETIC

Basic FP representation

x = (−1)s · 2e · m

where:
• s, the sign, is a single bit, s ∈ {0, 1}.

• e, the exponent, is a (binary) non-negative integer e ∈ {−(2n−1 + 1), . . . , 2n−1 }


• m, the mantissa, is a (binary) integer m ∈ {0, . . . , 2k − 1}.

There are 2n possible values for e, and 2k possible values for m.

The IEEE 754 standard


• The standard floating point system is specified by an industry standard called IEEE 754.
• This standard is implemented by all (well-written) modern software.
• The standard was first published in 1985. There were no proper, widely implemented
rules for numerical computation before then.

Advanced Machine Learning Not examinable. 203 / 212


S INGLE AND D OUBLE P RECISION
Single precision numbers (in IEEE 754)
s e1 ··· en m1 ··· mk
0 1 8 9 31

• A single precision number is represented by 32 bits.


• It is customary to enumerate starting with 0, as 0, . . . , 31.
• 8 bits are invested in the exponent, 1 bit is needed for the sign; the remainder form the
mantissa.

Double precision numbers (in IEEE 754)


s e1 ··· en m1 ··· mk
0 1 11 12 63

• A single precision number is represented by 64 bits.


• 1 bit for the sign, 11 for the exponent, 52 for the mantissa.
• This is the FP number format used in most numerical computations.

Advanced Machine Learning Not examinable. 204 / 212


ROUNDING E RRORS

Rounding
• Choose a set of FP numbers M, (say double precision numbers).
• Start with a number x ∈ R. Unless x happens to be in the finite set M, we cannot represent
it. It gets rounded to the closest FP number,
x̂ := arg min |x − ŷ|
ŷ∈M

• Suppose we multiply to machine numbers x̂, ŷ ∈ M. The product z = x̂ · ŷ has twice as


many digtits, and is typically not in M. It gets rounded to ẑ.

Rounding errors
We distinguish two types of errors:
• The absolute rounding error ẑ − z.

ẑ−z
• The relative rounding error z
.
FP arithmetic is based on the premise that the relative rounding error is more important.

Advanced Machine Learning Not examinable. 205 / 212


ROUNDING E RRORS

Distribution of FP numbers
The distance between two neighboring FP numbers depends on the value of the exponent. That
means:

An interval of length 1 (or any fixed length) contains more FP numbers if it is close to 0
than if it is far away from 0.

In other words, the relative rounding error does not depend on the size of a number, but the
absolute rounding error is larger for large numbers.

Normalized FP number
FP numbers are not unique: Note that
2 = (−1)0 · 21 · 1 = (−1)0 · 20 · 2 = . . .
A machine number is normalized if the mantissa represents a number with exactly one digit
before the radix, and this digit is 1:
(−1)s · 2e · [1.m1 . . . mk ]2
IEEE 754 only permits normalized numbers. That means every number is unique.

Advanced Machine Learning Not examinable. 206 / 212


M ACHINE P RECISION

Error guarantees
IEEE 754 guarantees that
ˆ = (x ⊕ y)(1 + r)
x⊕y satisfies |r| < eps
where
• r is the relative rounding error.

• eps is machine precision.


• ˆ its machine version.
⊕ is one of the operations +, −, × or /, and ⊕

Machine precision
The machine precision, denoted eps, is the smallest machine number for which 1̂+eps
b is
distinguishable from 1:
eps = min {e ∈ M}
1̂+e>
b 1̂
This is not the same as the smallest representable number. In double precision:
• Machine precision is eps = 2−52
• The smallest (normalized) FP number is 2−1022 .
That is so because the intervals between FP numbers around 1 are much larger than around
2−1022 : If we evaluate 1 + 2−1022 , the result is 1.

Advanced Machine Learning Not examinable. 207 / 212


C ANCELLATION E RRORS
Cancellation
• Suppose two machine numbers x̂ = 1.2345e0 and ŷ = 1.2346e0 are obtained by some
numerical computation.
• That means they have already been rounded, perhaps several times.
• Previous rounding means the smallest digits in the mantissa carry errors.
• If we substract the numbers, x̂ − ŷ = −0.0001e0, only the those small digits remain.
• Effectively, we have deleted the reliable information and only kept the errors.
The example illustrates a fundamental problem: The FP principle (keep the relative error under
control) does not work well with substraction.

Example: Linear equations


• One application where cancellation errors are very problematic are linear equation
systems.
• If a matrix is inverted by Gaussian elimination, a lot of terms have to be scaled and
substracted.
• If the input matrices are not well-conditioned, Gaussian elimination can produce results
that consist only of “noise”.
• The Householder and Givens algorithms that most computer packages used have been
developed to avoid this problem.

Advanced Machine Learning Not examinable. 208 / 212


R ANDOM N UMBERS
Pseudorandom numbers
• All implementations of sampling algorithms, random variables, randomized algorithms,
etc on a computer require the generation of random numbers.
• Computers are deterministic machines and have no access to “real” randomness.
• The “random” numbers we use on computers are not random; rather, they are generated in
a way that makes the look like random numbers, in the sense that they do not contain any
obvious deterministic “patterns”. Such numbers are called pseudorandom numbers.

Reproducibility
• Scientific experiments are required to be reproducible.
• Using actual random numbers in a computer experiment or simulation would make it
non-reproducible.
• In this sense, pseudorandomness is an advantage rather than disadvantage.

The so-called random numbers generated by a computer are not random.

If you restart the generation process of a pseudorandom sequence with the same initial
conditions, you reproduce exactly the same sequence.
Advanced Machine Learning Not examinable. 209 / 212
R ANDOM N UMBER G ENERATION

Recursive generation
Most PRN generators use a form of recursion: A sequence x1 , x2 , . . . of PRNs is generated as
xn+1 = f (xn )

Modulo operation
The mod operation outputs the “remainder after division”:
m mod n = m − kn for largest k ∈ N0 such that kn < m
For example: 13 mod 5 = 3

Linear Congruence Generator


An example are generators of the form
xn+1 = C · xn mod D for fixed constants C, D ∈ N .
On a 32-bit machine, a simple generator would be
xn+1 = 16807 · xn mod (232 − 1)

Advanced Machine Learning Not examinable. 210 / 212


P ERIODICITY

Period length
• For a recursive generator xn+1 = f (xn ), each xn+1 depends only on xn .
• Once xn takes the same value as x1 , the sequence (x1 , . . . , xn−1 ) repeats:
(x1 , . . . , xn−1 ) = (xn , . . . , x2n−1 ) = . . .
• If so, n − 1 is called the period length.
• Note a period need not start at 1: Once a value reoccurs for the first time, the generator has
become periodic.
• Since there are finitely many possible numbers, that must happen sooner or later.

Mersenne Twisters
• Almost all software tools we use (Python, Matlab etc) use a generator called a Mersenne
twister.
• This algorithm also uses a recursion, but the recursion is matrix-valued. That makes the
algorithm a little more complicated.
• Mersenne twisters have very long period lengths (e.g. 219937 − 1 for the standard
implementation for 32-bit machines).

Advanced Machine Learning Not examinable. 211 / 212


R ANDOM SEEDS

Random seed of a generator


• A recursive generator xn+1 = f (xn ) can be started at any intial value x0 (so the first
generated number is x1 = f (x0 ).
• This initial value x0 is called the random seed.

In practice
• Every time a generator is started or reset (e.g. when you start Matlab), it starts with a
default seed.
• For example, Matlab’s default generator is a Mersenne twister with seed 0.
• The user can usually pass a different seed to the generator.
• When you run experiments or simulations, always re-run them several times with different
seeds.
• To ensure reproducibility, record the seeds you used along with the code.

Advanced Machine Learning Not examinable. 212 / 212

Potrebbero piacerti anche