Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
FALL 2017
Dates
Python tutorial 12/13 September
TensorFlow tutorial 24/25 October
Midterm exam 19/23 October
Final project due 11 December
Office Hours Mon/Tue 5:30-7:30pm, Room 1025, Dept of Statistics, 10th floor SSW
Class Homepage
https://wendazhou.com/teaching/AdvancedMLFall17/
Homework
• Some homework problems and final project require coding
• Coding: Python
• Homework due: Tue/Wed at 4pm – no late submissions
• You can drop two homeworks from your final score
Grade
Homework + Midterm Exam + Final Project
20% 40% 40%
Email
All email to the TAs, please.
The instructors will not read your email unless it is forwarded by a TA.
Books (optional)
Today
Machine learning and statistics have become hard to tell apart.
Task
Balance the pendulumn upright by moving the sled left and right.
• The computer can control only the motion of the sled.
Formalization
State = 4 variables (sled location, sled velocity, angle, angular velocity)
Actions = sled movements
Note well
Learning how the world works is a regression problem.
sgn(hvH , xi − c) > 0
sgn(hvH , xi − c) < 0
x1 x2 x3
v1 v2 v3
y2 = φ(vt x)
...
...
φ φ φ
.. .. ..
. . .
ψ ψ ψ
y1 y2 y3
...
...
Neural networks
• Representation of function using a graph
• Layers: x1 v13 x2 v31 x3
v21 v23
g v33
x f v12 v22 v32
v11
• Symbolizes: f (g(x))
x1 x2 x3
McCulloch-Pitts model
• Collect the input signals x1 , x2 , x3 into a vector x = (x1 , x2 , x3 ) ∈ R3
• Choose fixed vector v ∈ R3 and constant c ∈ R.
• Compute:
y = I{hv, xi > c} for some c ∈ R .
x1 x2 x3
v1 v2 v3
−1
I{• > 0}
hx,vi
kvk
f (x) = sgn(hv, xi − c)
x1 x2 x3
v1 v2 v3
y = I{vt x > c}
J(a) Jp(a)
3 10
2
1 5
0
y3
y2
-2 y2
y1 y1
0
-2 solution -2 solut
0 region 2 a1 0 regio
a2 2 a2 2
4
4
Jq(a) Jr(a)
Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 22 / 212
T HE P ERCEPTRON C RITERION
Jp(a)
10
y3 y3
y2
-2 y2 -2
y1
0 0
solution -2 solution
region 2 a1 0 region 2 a1
2 a2 2
4 4
4 4
Jr(a)
5
Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 23 / 212
P ERCEPTRON
“Train” McCulloch-Pitts model (that is: estimate (c, v)) by applying gradient descent to the
function
n
X c 1
Cp (c, v) := I{sgn(hv, x̃i i − c) 6= ỹi } , ,
v x̃i
i=1
called the Perceptron cost function.
0.8
1 0.6
σ(x) =
1 + e−x 0.4
0.2
-10 -5 5 10
Note 1.0
0.8
1 + e−x − 1 1
1−σ(x) = = x = σ(−x) 0.6
1 + e−x e +1
0.4
Derivative 0.2
dσ e−x -10 -5 5 10
(x) = = σ(x) 1 − σ(x)
dx (1 + e−x )2
Sigmoid (blue) and its derivative (red)
1.0
• In linear classification: Decision
0.8
boundary is a discontinuity
• Boundary is represented either by 0.6
The most important use of the sigmoid function in machine learning is as a smooth
approximation to the indicator function.
Given a sigmoid σ and a data point x, we decide which side of the approximated boundary we
are own by thresholding
1
σ(x) ≥
2
0.8
0.6
0.4
0.2
-5 0 5 10
Influence of θ
• As θ increases, σθ approximates I more closely.
• For θ → ∞, the sigmoid converges to I pointwise, that is: For every x 6= 0, we have
σθ (x) → I{x > 0} as θ → +∞ .
1
• Note σθ (0) = 2
always, regardless of θ.
The decision boundary of a linear classifier in We can “stretch” σ into a ridge function on R2 :
R2 is a discontinuous ridge:
• The function σ(hv, xi − c) is a sigmoid ridge, where the ridge is orthogonal to the normal
vector v, and c is an offset that shifts the ridge “out of the origin”.
• The plot on the right shows the normal vector (here: v = (1, 1)) in black.
• The parameters v and c have the same meaning for I and σ, that is, σ(hv, xi − c)
approximates I{hv, xi ≥ c}.
Setup
• Two-class classification problem
• Observations x1 , . . . , xn ∈ Rd , class labels yi ∈ {0, 1}.
P(y|x) := Bernoulli σ(hv, xi − c) .
1
• Recall σ(hv, xi − c) takes values in [0, 1] for all θ, and value 2
on the class boundary.
• The logistic regression model interprets this value as the probability of being in class y.
Since the model is defined by a parametric distribution, we can apply maximum likelihood.
Notation
Recall from Statistical Machine Learning: We collect the parameters in a vector w by writing
v x
w := and x̃ := so that hw, x̃i = hv, xi − c .
−c 1
Negative log-likelihood
n
X
L(w) := − yi log σ(hw, x̃i i) + (1 − yi ) log 1 − σ(hw, x̃i i)
i=1
n
X
σ(wt x̃i ) − yi x̃i
∇L(w) =
i=1
Note
• Each training data point xi contributes to the sum proportionally to the approximation
error σ(wt x̃i ) − yi incurred at xi by approximating the linear classifier by a sigmoid.
Maximum likelihood
• The ML estimator ŵ for w is the solution of
∇L(w) = 0 .
• For logistic regression, this equation has no solution in closed form.
• To find ŵ, we use numerical optimization.
• The function L is convex (= ∪-shaped).
Matrix notation
1 (x̃1 )1 . . . (x̃1 )j . . . (x̃1 )d
.. .. ..
σ(wt x̃1 )(1 − σ(wt x̃1 )) ... 0
. . .
1 (x̃i )1 . . . (x̃i )j . . . (x̃i )d Dσ = .. .. ..
x̃i
X̃ :=
. . .
σ(wt x̃n )(1 − σ(wt x̃n ))
.. .. .. 0 ...
. . .
1 (x̃n )1 . . . (x̃n )j . . . (x̃n )d
X̃ is the data matrix (or design matrix) you know from linear regression. X̃ has size n × (d + 1)
and Dσ is n × n.
Newton step
σ( w(k) , x̃1 )
y1
w(k+1) = X̃t Dσ X̃
−1 t (k)
X̃ Dσ X̃w − Dσ .. .
− .
. .
σ( w(k) , x̃n )
yn
=: u(k)
Differences:
• The vector y of regression responses is substituted by the vector u(k) above.
Newton: Cost
• The size of the Hessian is (d + 1) × (d + 1).
• In high-dimensional problems, inverting HL can become problematic.
Other methods
Maximum likelihood only requires that we minimize the negative log-likelihood; we can choose
any numerical method, not just Newton. Alternatives include:
• Pseudo-Newton methods (only invert HL once, for w(1) , but do not guarantee quadratic
convergence).
• Gradient methods.
• Approximate gradient methods, like stochastic gradient.
longer v
1.0
0.6
0.2
longer v
1.0
σ(wt x̃i ) − yi
0.8
0.6
0.4
0.2
-5 0 5 10
xi
• Recall each training data point xi contributes an error term σ(wt x̃i ) − yi to the
log-likelihood.
• By increasing the lenghts of w, we can make σ(wt x̃i ) − yi arbitrarily small without
moving the decision boundary.
Solutions
• Overfitting can be addressed by including an additive penalty of the form L(w) + λkwk.
Logistic regression
• Recall two-class logistic regression is defined by P(Y|x) = Bernoulli(σ(wt x)).
• Idea: To generalize logistic regression to K classes, choose a separate weight vector wk
for each class k, and define P(Y|x) by
Multinomial σ̃(wt1 x), . . . , σ̃(wtK x)
σ(wt1 x)
where σ̃(wt1 x) = P t .
k σ(wk x)
K
Y
P(y|x) = σ̃(wtk x̃)I{y=k}
k=1
A graphical model represents the dependence structure within a set of random variables
as a graph.
Overview
Roughly speaking:
• Each random variable is represented by vertex.
X Y
Reason
• If X is discrete, L(X) is usually given by a mass function P(x).
• If it is continuous, L(X) is usually given by a density p(x).
• With the notation above, we do not have to distinguish between discrete and continuous
variables.
Recall
Two random variables are stochastically independent, or independent for short, if their joint
distribution factorizes:
L(X, Y) = L(X)L(Y)
For densities/mass functions:
P(x, y) = P(x)P(y) or p(x, y) = p(x)p(y)
Dependent means not independent.
Intuitively
X and Y are dependent if knowing the outcome of X provides any information about the
outcome of Y.
More precisely:
• If someone draws (X, Y) simultanuously, and only discloses X = x to you, does that
Definition
Given random variables X, Y, Z, we say that X is conditionally independent of Y given Z if
L(X, Y|Z = z) = L(X|Z = z)L(Y|Z = z) .
That is equivalent to
L(X|Y = y, Z = z) = L(X|Z = z) .
Notation
X⊥
⊥Z Y
Intuitively
X and Y are dependent given Z = z if, although Z is known, knowing the outcome of X provides
additional information about the outcome of Y.
Definition
Let X1 , . . . , Xn be random variables. A (directed) graphical model represents a factorization
of joint distribution L(X1 , . . . , Xn ) as follows:
• Factorize L(X1 , . . . , Xn ).
Lack of uniqueness
The factorization is usually not unique, since e.g.
L(X, Y) = L(X|Y)L(Y) = L(Y|X)L(X) .
That means the direction of edges is not generally determined.
Remark
• If we use a graphical model to define a model or visualize a model, we decide on the
direction of the edges.
• Estimating the direction of edges from data is a very difficult (and very important)
problem. This is one of the main subjects of a research field called causal inference or
causality.
X⊥
⊥Z Y
... Layer 1
All variables in the (k + 1)st layer are
... Layer 2 conditionally independent given the
variables in the kth layer.
X Y
X⊥
⊥Z Y
Important
• X and Y are not independent, independence holds only conditionally on Z.
• In other words: If we do not observe Z, X and Y are dependent, and we have to change the
graph:
X Y or X Y
X Y
Example
• Suppose we start with two indepedent normal variables X and Y.
• Z = X + Y.
If we know Z, and someone reveals the value of Y to us, we know everything about X.
Z1 Z2 ··· Zn−1 Zn
X1 X2 ··· Xn−1 Xn
Terminology: “Belief network” or “Bayes net” are alternative names for graphical models.
X Y X Y X Y
Z Z Z
directed undirected
wi+1,j+1
wi−1,i
... Θi−1 Θi Θi+1 ...
.. .. ..
. . .
A random variable Θi is associated with each vertex. Two random variables interact if they are
neighbors in the graph.
N = (VN , WN )
Neighborhoods
The set of all neighbors of vj in the graph, vi
∂ (i) := { j | wij 6= 0}
is called the neighborhood of vj .
purple = ∂ (i)
In words
The Markov property says that each Θi is conditionally independent of the remaining variables
given its Markov blanket.
Definition
A distribution L(Θ1 , . . . , Θn ) which satisfies the Θi
Markov property for a given graph N is called a Markov
random field.
Markov blanket of Θi
MRF energy
In particular, we can write a MRF density for RVs Θ1:n as
1
p(θ1 , . . . , θn ) = exp(−H(θ1 , . . . , θn ))
Z
Graphical models factorize over the graph. How does that work for MRFs?
5 1 2 6
The cliques in this graph are: i) The triangles (1, 2, 3), (1, 3, 4).
ii) Each pair of vertices connected by an edge (e.g. (2, 6)).
Theorem
Let N be a neighborhood graph with vertex set VN . Suppose the random variables
{Θi , i ∈ VN } take values in T , and their joint distribution has probability mass function P, so
there is an energy function H such that
e−H(θ1 ,...,θn )
P(θ1 , . . . , θn ) = P P −H(θ1 ,...,θn )
.
i≤n θi ∈T e
where C is the set of cliques in N , and each HC is a non-negative function with |C| arguments.
Hence,
Y e−Hc (θi ,i∈C)
P(θ1 , . . . , θn ) = P P −Hc (θi ,i∈C)
C∈C i∈C θi ∈T e
.. .. ..
. . .
Definition
Suppose N = (VN , WN ) a neighborhood graph with n vertices and β > 0 a constant. Then
1 X
p(θ1:n ) := exp β wij I{θi = θj }
Z(β, WN ) i,j
Interpretation
• If wij > 0: The overall probability increases if Θi = Θj .
• If wij < 0: The overall probability decreases if Θi = Θj .
• If wij = 0: No interaction between Θi and Θj .
Positive weights encourage smoothness.
.. .. ..
. . .
Ising model
The simplest choice is wij = 1 if (i, j) is an edge. ... Θj−1 Θj Θj+1 ...
1 X
p(θ1:n ) = exp βI{θi = θj } ... Θi−1 Θi Θi+1 ...
Z(β)
(i,j) is an edge
... Θk−1 Θk Θk+1 ...
Example
Samples from an Ising model on a 56 × 56 grid graph.
Increasing β −→
Spatial model
Suppose we model each Xi by a distribution L(X|Θi ), i.e. each location i has its own parameter
variable Θi . This model is Bayesian (the parameter is a random variable). We use an MRF as
prior distribution.
Xj Xj+1
observed
Xi Xi+1
Θj Θj+1
p( . |θi )
unobserved
Θi Θi+1
Spatial smoothing
• We can define the joint distribution (Θ1 , . . . , Θn ) as a MRF on the grid graph.
• For positive weights, the MRF will encourage the model to explain neighbors Xi and Xj by
the same parameter value. → Spatial smoothing.
Problem 1: Sampling
Generate samples from the joint distribution of (Θ1 , . . . , Θn ).
Problem 2: Inference
If the MRF is used as a prior, we have to compute or approximate the posterior distribution.
Solution
• MRF distributions on grids are not analytically tractable. The only known exception is the
Ising model in 1 dimension.
• Both sampling and inference are based on Markov chain sampling algorithms.
In general
• A sampling algorithm is an algorithm that outputs samples X1 , X2 , . . . from a given
distribution P or density p.
• Sampling algorithms can for example be used to approximate expectations:
n
1X
Ep [ f (X)] ≈ f (Xi )
n i=1
Posterior expectations
If we are only interested in some statistic of the posterior of the form EQ̂n [ f (Θ)] (e.g. the
posterior mean), we can again approximate by
m
1 X
EQ̂n [ f (Θ)] ≈ f (Θi ) .
m i=1
Example: Predictive distribution
The posterior predictive distribution is our best guess of what the next data point xn+1 looks
like, given the posterior under previous observations. In terms of densities:
Z
p(xn+1 |x1:n ) := p(xn+1 |θ)Q̂n (dθ|X1:n = x1:n ) .
T
This is one of the key quantities of interest in Bayesian statistics.
p(y)
Yi
x
a Xi b
Key observation
Suppose we can define a uniform distribution UA on the blue area A under the curve. If we
sample
(X1 , Y1 ), (X2 , Y2 ), . . . ∼iid UA
and discard the vertical coordinates Yi , the Xi are distributed according to p,
X1 , X2 , . . . ∼iid p .
p(x)
c
x
a b
k·c
B B
x x
a b a b
We simply draw Yi ∼ Uniform[0, kc] instead of Yi ∼ Uniform[0, c].
Consequence
For sampling, it is sufficient if p is known only up to normalization
(only the shape of p is known).
Sampling methods usually assume that we can evaluate the target distribution p up to a constant.
That is:
1
p(x) = p̃(x) ,
Z̃
and we can compute p̃(x) for any given x, but we do not know Z̃.
We have to pause for a moment and convince ourselves that there are useful examples where
this assumption holds.
Provided that we can compute the numerator, we can sample without computing the
normalization integral Z̃.
We already know that we can discard the normalization constant, but can we evaluate the
non-normalized posterior q̃n ?
• The problem with computing q̃n (as a function of unknowns) is that the term
Qn PK
n
i=1 k=1 . . . blows up into K individual terms.
PK
k=1 ck p(xi |θk ) collapses to a single
• If we evaluate q̃n for specific values of c, x and θ,
and hence a sum over 2n terms. The general Potts model is even more difficult.
If we are not on the interval, sampling uniformly from an enclosing box is not possible (since
there is no uniform distribution on all of R or Rd ).
p(x)
p(x)
Factorization
We factorize the target distribution or density p as
distribution from which we
know how to sample
p(x) = r(x) · A(x)
probability function we can evaluate
once a specific value of x is given
If we draw proposal samples Xi i.i.d. from r, the resulting sequence of accepted samples
produced by rejection sampling is again i.i.d. with distribution p. Hence:
Rejection samplers produce i.i.d. sequences of samples.
Important consequence
If samples X1 , X2 , . . . are drawn by a rejection sampler, the sample average
m
1 X
f (Xi )
m i=1
|A|
The fraction of accepted samples is the ratio |B|
of the areas under the curves p̃ and r.
p(x)
Example figures for sampling methods tend to look like this. A high-dimensional distribution of correlated RVs will look
rather more like this.
We can easily end up in situations where we accept only one in 106 (or 1010 , or 1020 ,. . . )
proposal samples. Especially in higher dimensions, we have to expect this to be not the
exception but the rule.
The rejection problem can be fixed easily if we are only interested in approximating an
expectation Ep [ f (X)].
Importance sampling
We can sample X1 , X2 , . . . from q and approximate Ep [ f (X)] as
m
1 X p(Xi )
Ep [ f (X)] ≈ f (Xi )
m i=1 q(Xi )
Approximating Ep [ f (X)]
m m m f (X ) i p̃(X )
1 X p(Xi ) 1 X Zq p̃(Xi ) X i q̃(X )
i
Ep [ f (X)] ≈ f (Xi ) = f (Xi ) = Pm p̃(Xj )
m i=1 q(Xi ) m i=1 Zp q̃(Xi ) i=1 j=1 q̃(Xj )
Conditions
• Given are a target distribution p and a proposal distribution q.
1 1
• p= Zp
p̃ and q = Zq
q̃.
• We can evaluate p̃ and q̃, and we can sample q.
• The objective is to compute Ep [ f (X)] for a given function f .
Algorithm
1. Sample X1 , . . . , Xm from q.
2. Approximate Ep [ f (X)] as
Pm p̃(X )
i
i=1 f (Xi ) q̃(Xi )
Ep [ f (X)] ≈ Pm p̃(Xj )
j=1 q̃(Xj )
region of interest
Once we have drawn a sample in the narrow region of interest, we would like to continue
drawing samples within the same region. That is only possible if each sample depends on the
location of the previous sample.
Proposals in rejection sampling are i.i.d. Hence, once we have found the region where p
concentrates, we forget about it for the next sample.
Note: For a Markov chain, Xi+1 can depend on Xi , so at least in principle, it is possible for an
MCMC sampler to "remember" the previous step and remain in a high-probability location.
The Markov chains we discussed so far had a finite state space X. For MCMC, state space now
has to be the domain of p, so we often need to work with continuous state spaces.
In the discrete case, t(y = i|x = j) is the entry pij of the transition matrix p.
Continuous case
• X is now uncountable (e.g. X = Rd ).
• The transition matrix p is substituted by the conditional probability t.
• A distribution Pinv with density pinv is invariant if
Z
t(y|x)pinv (x)dx = pinv (y)
X
P
This is simply the continuous analogue of the equation i pij (Pinv )i = (Pinv )j .
We run the Markov chain n for steps. We "forget" the order and regard the If p (red contours) is both the
Each step moves from the current locations x1:n as a random set of invariant and initial distribution, each
location xi to a new xi+1 . points. Xi is distributed as Xi ∼ p.
Note: Making precise what aperiodic means in a continuous state space is a bit more technical than in the finite case, but the
theorem still holds. We will not worry about the details here.
Implication
• If we can show that Pinv ≡ p, we do not have to know how to sample from p.
• Instead, we can start with any Pinit , and will get arbitrarily close to p for sufficiently large i.
The number m of steps required until Pm ≈ Pinv ≡ p is called the mixing time of the Markov
chain. (In probability theory, there is a range of definitions for what exactly Pm ≈ Pinv means.)
In MC samplers, the first m samples are also called the burn-in phase. The first m samples of
each run of the sampler are discarded:
X1 , . . . , Xm−1 , Xm , Xm+1 , . . .
Burn-in; Samples from
discard. (approximately) p;
keep.
Convergence diagnostics
In practice, we do not know how large j is. There are a number of methods for assessing whether
the sampler has mixed. Such heuristics are often referred to as convergence diagnostics.
1.0
1.0
Estimating the lag
0.5
0.5
The most commen method uses the autocorrelation function:
i , xi+L )
Autocorrelation
Autocorrelation
E[xi − µi ] · E[xj − µj ]
0.0
0.0
Auto(xi , xj ) :=
Auto(x
σi σj
−0.5
−0.5
We compute Auto(xi , xi+L ) empirically from the sample for
different values of L, and find the smallest L for which the
autocorrelation is close to zero.
−1.0
−1.0
0 5 15 25 0 5 15 25
Lag L
Lag
There are about half a dozen popular convergence crieteria; the one below is an example.
Gelman-Rubin criterion
• Start several chains at random. For each chain k, sample Xik
has a marginal distribution Pki .
• The distributions of Pki will differ between chains in early
stages.
• Once the chains have converged, all Pi = Pinv are identical.
• Criterion: Use a hypothesis test to compare Pki for different k
(e.g. compare P2i against null hypothesis P1i ). Once the test
does not reject anymore, assume that the chains are past
burn-in.
Reference: A. Gelman and D. B. Rubin: "Inference from Iterative Simulation Using Multiple Sequences", Statistical Science, Vol. 7 (1992) 457-511.
• If it decreases the probability, we still accept with a probability which depends on the
difference to the current probability.
Hill-climbing interpretation
• The MH sampler somewhat resembles a gradient ascent algorithm on p, which tends to
move in the direction of increasing probability p.
• However:
• The actual steps are chosen at random.
• The sampler can move "downhill" with a certain probability.
• When it reaches a local maximum, it does not get stuck there.
More generally
For complicated posteriors (recall: small regions of concentration, large low-probability regions
in between) choosing q is much more difficult. To choose q with good performance, we already
need to know something about the posterior.
There are many strategies, e.g. mixture proposals (with one component for large steps and one
for small steps).
By far the most widely used MCMC algorithm is the Gibbs sampler.
Full conditionals
Suppose L(X) is a distribution on RD , so X = (X1 , . . . , XD ). The conditional probability of the
entry Xd given all other entries,
L(Xd |X1 , . . . , Xd−1 , Xd+1 , . . . , XD )
is called the full conditional distribution of Xd .
On RD , that means we are interested in a density
p(xd |x1 , . . . , xd−1 , xd+1 , . . . , xD )
Gibbs sampling
The Gibbs sampler is the special case of the Metropolis-Hastings algorithm defined by
propsoal distribution for Xd = full conditional of Xd .
• Gibbs sampling is only applicable if we can compute the full conditionals for each
dimension d.
• If so, it provides us with a generic way to derive a proposal distribution.
Proposal distribution
Suppose p is a distribution on RD , so each sample is of the form Xi = (Xi,1 , . . . , Xi,D ). We
generate a proposal Xi+1 coordinate-by-coordinate as follows:
Xi+1,1 ∼ p( . |xi,2 , . . . , xi,D )
..
.
Xi+1,d ∼ p( . |xi+1,1 , . . . , xi+1,d−1 , xi,d+1 , . . . , xi,D )
.
..
Xi+1,D ∼ p( . |xi+1,1 , . . . , xi+1,D−1 )
Note: Each new Xi+1,d is immediately used in the update of the next dimension d + 1.
No rejections
It is straightforward to show that the Metropolis-Hastings acceptance probability for each
xi+1,d+1 is 1, so proposals in Gibbs sampling are always accepted.
Θup
Full conditionals
In a grid with 4-neighborhoods for instance, the Markov
property implies that Θleft Θd Θright
p(θd |θ1 , . . . , θd−1 , θd+1 , . . . , θD ) = p(θd |θleft , θright , θup , θdown )
Θdown
Gibbs sampler
Each step of the Gibbs sampler generates n updates according to
θi+1,d ∼ p( . |θi+1,1 , . . . , θi+1,d−1 , θi,d+1 , . . . , θi,D )
∝ exp β(I{θi+1,d = θleft } + I{θi+1,d = θright } + I{θi+1,d = θup } + I{θi+1,d = θdown })
).(...&/0"1$023%4&
• MRFs where introduced as tools for image smoothing and segmentation by D. and S.
Geman in 1984.
• They sampled from a Potts model with a Gibbs sampler, discarding 200 iterations as
burn-in.
• Such a sample (after 200 steps) is shown above, for a Potts model in which each variable
can take one out of 5 possible values.
• These patterns led computer vision researchers to conclude that MRFs are "natural" priors
for image segmentation, since samples from the MRF resemble a segmented image.
200 iterations
).(...&/0"1$023%4&
).(...&/0"1$023%4&
10000 iterations
Chain 1 Chain 5
• The "segmentation" patterns are not sampled from the MRF distribution p ≡ Pinv , but
rather from P200 6= Pinv .
• The patterns occur not because MRFs are "natural" priors for segmentations, but because
the sampler’s Markov chain has not mixed.
• MRFs are smoothness priors, not segmentation priors.
Problem
We have to solve an inference problem where the correct solution is an “intractable” distribution
with density p∗ (e.g. a complicated posterior in a Bayesian inference problem).
Variational approach
Approximate p∗ as
q∗ := arg min φ(q, p∗ )
q∈Q
where Q is a class of simple distributions and φ is a cost function (small φ means good fit).
That turns the inference problem into a constrained optimization problem
min φ(q, p∗ )
s.t. q ∈ Q
Derivatives of functionals
If F is a function space and k • k a norm on F , we can apply the same idea to φ : F → R:
φ( f + f̃ ) − φ( f )
δφ( f ) := lim
k f̃ k&0 k f̃ k
δφ(f ) is called the Fréchet derivative of φ at f .
f is a minimum of a Fréchet-differentiable functional φ only if δφ(f ) = 0.
Horseshoes
• We have to represent the infinite-dimensional quantities fk and ∆fk in some way.
• Many interesting functionals φ are not Fréchet-differentiable as functionals on F . They
only become differentiable when constrained to a much smaller subspace.
One solution is “variational calculus”, an analytic technique that addresses both problems. (We
will not need the details.)
s.t. q ∈ Q
where Q is a parametric family.
• That means each element of Q is of the form q( • |θ), for θ ∈ T ⊂ Rd .
• The problem then reduces back to optimization in Rd :
min φ(q( • |θ))
θ
s.t. θ ∈ T
• We can apply gradient descent, Newton, etc.
Be careful
• The differential entropy does not behave like the entropy (e.g. it can be negative).
• The KL divergence for densities has properties analogous to the mass function case.
• DKL (p∗ kq) emphasizes regions where the “true” model p∗ has high probability. That is
what we should use if possible.
• We use VI because p∗ is intractable, so we can usually not compute expectations under it.
• We use the expectation DKL (q, p∗ ) under the approximating simpler model instead.
We have to understand the implications of this choice.
z2 z2
0.5 0.5
0 0
0 0.5 z1 1 0 0.5 z1 1
DKL (p∗ kq) =(b)
DKL (greenkred) DKL (qkp∗ ) =(a)
DKL (redkgreen)
Summary: VI approximation
min F(q)
where F(q) = E[log q(Z)] − E[log p(Z, x)]
s.t. q∈Q
Terminology
• The function F is called a free energy in statistical physics.
• Since there are different forms of free energies, various authors attach different adjectives
(variational free energy, Helmholtz free energy, etc).
• Parts of the machine learning literature have renamed F, by maximizing the objective
function −F and calling it an evidence lower bound, since
−F(q) + DKL (q k p( • |x)) = log p(x) hence e−F(q) ≤ p(x) ,
and p(x) is the “evidence” in the Bayes equation.
In previous example
1
z2
0.5
0
0 0.5 spherical)
Mean field (Gaussian z1 1 Not a mean field
(a)
Advanced Machine Learning 122 / 212
E XAMPLE : M EAN F IELD FOR THE P OTTS M ODEL
Model
We consider a MRF distribution for X1 , . . . , Xn with values in {−1, +1}, given by
n n
1 X X
P(X1 , . . . , Xn ) = exp(−βH(X1 , . . . , Xn )) where H = − wij Xi Xj − hi Xi
Z(β) i,j=1 i=1
“external field”
Variational approximation
We choose Q as the family
n
(
1+m
nY o
2
X=1
Q := Qmi mi ∈ [−1, 1] where Qm (X) :=
1−m
2
X = −1
i=1
1+m
Each factor is a Bernoulli 2
, except that the range is {−1, 1} not {0, 1}.
Optimization Problem
n
Y
min DKL Qmi
P
i=1
s.t. mi ∈ [−1, 1] for i = 1, . . . , n
That is: For given values of wij and hi , we have to solve for the values mi .
Interpretation
• In the MRF P, the random variables Xi interact.
• There is no interaction in the approximation.
• Instead, the effect of interactions is approximated by encoding them in the parameters.
• This is somewhat like a single effect (“field”) acting on all variables simultanuously
(“mean field”).
As a graphical model
Box notation indicates
c c and θ are not random
Z1 ... Zn
θ
X1 ... Xn
Z1 ... Zn = Z
θ
X1 ... Xn θ X
n
More precisely
Specifically, the term BMM implies that all priors are natural conjugate priors. That is:
• The mixture components p(x|θ) are an exponential family model.
• The prior on each Θk is a natural conjugate prior of p.
• The prior of the vector (C1 , . . . , CK ) is a Dirichlet distribution.
Definition
A model of the form
K
X Z
π(x) = Ck p(x|Θk ) = p(x|θ)M(θ)dθ
k=1 T
is called a Bayesian mixture model if p(x|θ) is an exponential family model and M a random
mixing distribution, where:
• Θ1 , . . . , ΘK ∼iid q( . |λ, y), where q is a natural conjugate prior for p.
As a graphical model
Θ X
n
Posterior distribution
The posterior density of a BMM under observations x1 , . . . , xn is (up to normalization):
n X
Y K K
Y
Π(c1:K , θ1:K |x1:n ) ∝ ck p(xi |θk ) q(θk |λ, y) qDirichlet (c1:K )
i=1 k=1 k=1
This Gibbs sampler is a bit harder to derive, so we skip the derivation and only look at the algorithm.
Assignment probabilities
Each step of the Gibbs sampler computes an assignment matrix:
a11 . . . a1K
a = .. .. = Pr{x in cluster k}
. . i
ik
an1 . . . anK
Entries are computed as they are in the EM algorithm:
Ck p(xi |Θk )
aik = PK
l=1 Cl p(xi |Θl )
In contrast to EM, the values Ck and Θk are random.
2. For each cluster k, sample a new value for Θkj from the conjugate posterior Π(Θk ) under
the observations currently assigned to k:
n n
Θkj ∼ Π Θ λ + I{Zij = k}, y + I{Zij = k}S(xi )
X X
i=1 i=1
j
3. Sample new cluster proportions C1:K from the Dirichlet posterior (under all xi ):
I{Zij = k}
Pn
j j α · gk +
C1:K ∼ Dirichlet(α + n, g1:K ) where gkj = i=1
α+n
normalization
The BMM Gibbs sampler looks very similar to the EM algorithm, with maximization steps (in
EM) substituted by posterior sampling steps:
Representation of assignments Parameters
EM Assignment probabilities ai,1:K aik -weighted MLE
K-means mi = arg maxk (ai,1:K ) MLE for each cluster
Gibbs for BMM mi ∼ Multinomial(ai,1:K ) Sample posterior for each cluster
Dirichlet distribution
The Dirichlet distribution is the distribution on ∆K with density
K
1 X
qDirichlet (c1:K |α, g1:K ) := exp (αgk − 1) log(ck )
Z(α, g1:K ) k=1
Parameters:
• g1:K ∈ ∆K : Mean parameter, i.e. E[c1:K ] = g1:K .
Density plots
α = 1.8 α = 10
As heat maps
Model
The Dirichlet is the natural conjugate prior on the multinomial parameters. If we observe hk
counts in category k, the posterior is
Π(c1:K |h1 , . . . , hk ) = qDirichlet (c1:K |α + n, (αg1 + h1 , . . . , αgK + hK ))
P
where n = k hk is the total number of observations.
So: Components
P of the prior couple (1) through normalization and (2) through the latent
variable j Xj .
• Topics θ1 , . . . , θK .
• θkj = Pr{ word j in topic k}.
Problem
Each document is generated by a single topic; that is a very restrictive assumption.
Parameters
Suppose we consider a corpus with K topics and a vocubulary of d words.
• φ ∈ ∆K topic proportions (φk = Pr{ topic k}).
Observation
LDA is almost a Bayesian mixture model: Both use multinomial components and a Dirichlet
prior on the mixture weights. However, they are not identical.
Comparison
Bayesian MM Admixture (LDA)
Sample c1:K ∼ Dirichlet(φ). Sample c1:K ∼ Dirichlet(φ).
Sample topic k ∼ Multinomial(c1:K ). For i = 1, . . . , M:
For i = 1, . . . , M: Sample topic ki ∼ Multinomial(c1:K ).
Sample wordi ∼ Multinomial(θk ). Sample wordi ∼ Multinomial(θki ).
In admixtures:
• c1:K is generated at random, once for each document.
topic proportions
C C ∼ Dirichlet(α)
topic of word
Z Z ∼ Multinomial(C)
observed word
Θ word Θ word ∼ Multinomial(row Z of Θ)
N N # words
M M # documents
Bayesian mixture
(for M documents of N words each) LDA
C
• The parameter matrix θ is of size
#topics × |vocabulary|.
• Meaning: θki = probability that term i is η Z
observed under topic k.
• Note entries of θ are non-negative and
each row sums to 1.
• To learn the parameters θ along with the
other parameters, we add a Dirichlet prior Θ word
with parameters η. The rows are drawn
N
i.i.d. from this prior.
M
Θ is now random
Variational approximation
C
Q = {q(z, c, θ|λ, φ, γ) | λ, φ, γ}
where
K M N
η
Y Y Y
q(z, c, θ|λ, φ) := p(θk |λ) q(cm |γm ) p(zmn |φmn ) Z
k=1 m=1 n=1
Dirichlet Multinomial
M
min DKL (q(z, c, θ|λ, φ)kp(z, c, θ|α, η))
s.t. q ∈ Q
η Z λ γ φ
Θ word Θ C Z
N K M N
M M
Algorithmic solution
We solve the minimization problem
min DKL (q(z, c, θ|λ, φ)kp(z, c, θ|α, η))
s.t. q ∈ Q
by applying gradient descent to the resulting free energy.
VI algorithm
It can be shown that gradient descent amounts to the following algorithm:
Local updates
(t+1) φ̃mn
φmn := P
n φ̃mn
where
φ̃mn = exp Eq [log(Cm1 ), . . . , log(Cmd )|γ (t) ] + Eq [log(Θ1,wn ), . . . , log(Θd,wn )|λ(t) ]
N
(t+1)
X
γ (t+1) := α + φn
n=1
Global updates
(t+1) (t+1)
XX
λk =η+ wordmn φmn
m n
The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-
tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
real opportunity to make a mark on the future of the performing arts with these grants an act
every bit as important as our traditional areas of support in health, medical research, education
and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in
announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which
will house young artists and provide new public facilities. The Metropolitan Opera Co. and
New York Philharmonic will receive $400,000 each. The Juilliard School, where music and
the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter
of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000
donation, too.
Figure 8: An example article from the AP corpus. Each color codes a different factor from which
Advanced Machine Learning the word is putatively generated. From Blei, Ng, Jordan, "Latent Dirichlet Allocation", 2003 154 / 212
R ESTRICTED B OLTZMANN M ACHINES
B OLTZMANN MACHINE
Definition
A Markov random field distribution of variables X1 , . . . , Xn with values in {0, 1} is called a
Boltzmann machine if its joint law is
1 X X
P(x1 , . . . , xn ) = exp − wij xi xj − ci xi ,
Z i<j≤n i≤n
where wij are the edge weights of the MRF neighborhood graph, and c1 , . . . , cn are scalar
parameters.
Remarks
• The Markov blanket of Xi are those Xj with wij 6= 0.
• For x ∈ {−1, 1}n instead: Potts model with external
magnetic field.
• This is an exponential family with sufficient statistics Xi Xj
and Xi .
• As an exponential family, it is also a maximum entropy
model.
1 X X
P(x1 , . . . , xn ) = exp − wij xi xj − ci xi
Z i<j≤n i≤n
Matrix representation
We collect the parameters in a matrix W := (wij ) and a vector c = (c1 , . . . , cn ), and write
equivalently:
1 t t
P(x1 , . . . , xn ) = e−x Wx−c x
Z(W, c)
X2 Y4 X4
With observations
If some some vertices represent observation variables Yi :
t t t
e−(x,y) W(x,y)−c x−c̃ y Y1 Y2
P(x1 , . . . , xn , y1 , . . . , ym ) =
Z(W, c, c̃)
X1 Y3 X3
Recall our hierarchical design approach
• Only permit layered structure.
Y1 Y2 Y3
• Obvious grouping: One layer for X, one for Y.
• As before: No connections within layers.
• Since the graph is undirected, that makes it bipartite.
X1 X2 X3
Y1 Y2 Y3
Bipartite graphs
A graph is bipartite graph is a graph whose vertex set V can be
subdivided into two sets A and B = V \ A such that all edges have
one end in A and one end in B.
X1 X2 X3
Definition
A restricted Boltzmann machine (RBM) is a Boltzmann machine whose neighborhood graph
is bipartite.
That defines two layers. We usually think of one of these layers as observed, and one as
unobserved.
Consequence
Θ1 Θ2 ΘN
... Input layer
...
.. .. ..
. . .
...
... Data
X1 X2 XN
Θ1 Θ2 ΘN
... L(Θ1:N ) = Prior
...
.. .. ..
. . .
L(X1:N |Θ1:N ) = Likelihood
...
...
X1 X2 XN
Θ1 Θ2 ΘN
• Task: Given X1 , . . . , XN , find L(Θ1:N |X1:N ). ...
• Problem: Recall “explaining away”.
...
X Y
.. .. ..
Z . . .
.. .. .. .. .. ..
. . . . . .
Θ1 Θ2 ... ΘN Θ1 Θ2 . . .. ΘN
...
.. .. ..
. . .
...
X1 X2 ... XN
... Idea
• Invent a prior such that the dependencies
in the prior and likelihood “cancel out”.
.. .. ..
. . . • Such a prior is called a complementary
prior.
• We will see how to construct a
Θ1 Θ2 . . .. ΘN complementary prior below.
...
.. .. ..
. . .
...
X1 X2 ... XN
Observation
X (1) , X (2) , . . . , X (K) is a Markov chain. .. .. ..
. . .
Advanced Machine Learning Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 168 / 212
C OMPLEMENTARY P RIOR
Observation
X (1) , X (2) , . . . , X (K) is a Markov chain. .. .. ..
. . .
Advanced Machine Learning Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 168 / 212
B UILDING A C OMPLEMENTARY P RIOR
X (1) ...
Advanced Machine Learning Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 169 / 212
B UILDING A C OMPLEMENTARY P RIOR
X (1) ...
Advanced Machine Learning Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 169 / 212
W HERE DO WE GET THE M ARKOV CHAIN ?
...
Start with an RBM X
Y ...
..
.
..
.
..
.
The Gibbs sampler for the RBM becomes the model for the directed network.
Advanced Machine Learning Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006 170 / 212
W HERE DO WE GET THE M ARKOV CHAIN ?
..
.
X ...
.. Y ...
.
..
.
For K → ∞, we have
d
Y (∞) = Y
Now suppose we use L(Y (∞) ) as our prior distribution. Then:
Using an “infinitely deep” graphical model given by the Gibbs sampler for the RBM as a
prior is equivalent to using the Y-layer of the RBM as a prior.
...
First two layers: RBM (undirected)
Θ1 Θ2 ... ΘN
...
.. .. ..
. . .
Remaining layers: Directed
...
X1 X2 ... XN
Summary
...
• The RBM consisting of the first two layers is
equivalent to an infinitely deep directed network
representing the Gibbs samplers. (The rolled-off Θ1 Θ2 ... ΘN
Gibbs sampler on the previous slide.)
• That network is infinitely deep because a draw from
the actual RBM distribution corresponds to a Gibbs ...
sampler that has reached its invariant distribution.
• When we draw from the RBM, the second layer is
distributed according to the invariant distribution of
the Markov chain given by the RBM Gibbs .. .. ..
sampler. . . .
• If the transition from each layer to the next in the
directed part is given by the Markov chain’s
transition kernel pT , the RBM is a complementary ...
prior.
• We can then reverse all edges between the Θ-layer
and the X-layer. X1 X2 ... XN
x1
3 3
f :R →R with input x = x 2
x2
x1 w13 x2 w31 x3
w21 w23
w33
w12 w22
w11 w32
φ1 φ2 φ3
f1 (x) 3
X
f (x) = f2 (x) with fi (x) = φi wij xj
f3 (x) j=1
... = f (1)
... = f (2)
.. .. ..
. . .
...
... = f (K)
Layers
• Each function f (k) is of the form
f (k) : Rdk → Rdk+1
• dk is the number of nodes in the kth layer. It is also called the width of the layer.
• We mostly assume for simplicity: d1 = . . . = dK =: d.
Hidden units
• Any nodes (or “units”) in the network that are neither input nor output nodes are called
hidden.
• Every network has an input layer and an output layer.
• If there any additional layers (which hence consist of hidden units), they are called hidden
layers.
fl
Two"layer x2 fl
Two"layer R1
fl R1
Two"layer
R1
fl
Two"layer R2
R1
R2
x1 x2
R2
x1 x2 x1
x2
x1 x2 R2
x1
x1 x2 x2 x
R1 1
Three"layer x2 x1
x2
R2
Three"layer ... R1
Three"layer R1
Three"layer R2R1
R1
R2
R R2
2
......
x 1 x2 ...
R2 x1
RR
2 2
R1
RR
Figure 6.3: Whereas a two-layer network classifier can only implement
1 1 a linear decision
boundary, given an adequate number of hidden units, three-, four- and higher-layer
x1 x2 x1 not
networks can implement arbitrary decision boundaries. The decision regions need
x1 xbe
1 convex, nor x2
x2 simply connected. x1 x
1
Figure 6.3: Whereas a two-layer network classifier can only implement a linear decision
boundary, given an adequate number of hidden units, three-, four- and higher-layer
Figure 6.3: Whereasnetworks
a two-layer
can network arbitrary
implement classifier can only implement The adecision
linearregions
decision
Figure 6.3: Whereas a two-layer network classifier decision boundaries.
can only implement a linear need not
decision
6.3 Backpropagation
boundary, given an adequate
Advanced Machine Learning number Illustration:
of hidden algorithm
units,
R.O. Duda, three-,
P.E. Hart, D.G. Stork,four- and higher-layer
Pattern Classification, Wiley 2001 180 / 212
-1 -.4 wkj 1
T HE XOR P.7-1.5ROBLEM
0
1
hidden j
0
-1
0
1
-1
bias .5 0
-1
1 1 wji
-1
1 1
1 1
input i
z
x1 x2 1
x2 0 1
-1
0
-1
z= -1 0
-1
1
R2 zk
z=1 R1
x1 output k
y1 y2
R2 -.4 wkj
1 -1 1
z= -1 0
.7 0
1 1
-1
0 -1.5 hidden j
-1
0
-1 -1
0 bias .5 0
-1
1 1 wji
-1
1 1
1 1
input i
parity or exclusive-OR problem can be solved by a three-layerx x 1 2
m is the Solution
two-dimensional feature space x1 − x2 , and the
regions we would like to represent
four x
Neural network representation
2
z1 x2
x1
y2 z1 y3
y1
y1 y2 y3 y4
y4
x1 x2
Figure 6.2: A 2-4-1 network (with bias) along with the response functions at different
units; each hidden and output unit has sigmoidal transfer function f (·). In the case
shown, the hidden unit outputs are paired in opposition thereby producing a “bump”
at the output unit. Given a sufficiently large number of hidden units, any continuous
function from input to output can be approximated arbitrarily well by such a network.
Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 182 / 212
F EATURE E XTRACTION
Features
• Raw measurement data is typically not used directly as input for a learning algorithm.
Some form of preprocessing is applied first.
• We can think of this preprocessing as a function, e.g.
F : raw data space −→ Rd
(Rd is only an example, but a very common one.)
• If the raw measurements are m1 , . . . , mN , the data points which are fed into the learning
algorithm are the images xn := F(mn ).
Terminology
• F is called a feature map.
• Its dimensions (the dimensions of its range space) are called features.
• The preprocessing step (= application of F to the raw data) is called feature extraction.
Feature extraction
(preprocessing)
Working data
Mark patterns
Training Apply on
Trained model
(calibration) test data
Error estimate
x1 x2 xd
f1 f2 f3
• 3 binary output units, where fi (x) = 1
means image is in class i.
• Each hidden unit has 64 input weights,
h1 h2 one per pixel. The weight values can be
plottes as 8 × 8 images.
w11 w
64,2
learned winput-to-hidden
12 w64,1 weights
x1 ... x64
e top images represent patterns from a large training set used to train
dal network for classifying three characters. The bottom figures show
den weights (represented as patterns) at the two hidden units after
hat these learned weights indeed describe feature groupings useful for
n task. In large
Advanced Machine Learningnetworks, such patterns of learned weights may be 187 / 212
A S IMPLE E XAMPLE
PROPAGATION, BAYES THEORY AND PROBABILITY 25
sample training patterns
h1 h2
training data (with random noise) weight values of h1 and h2 plotted as images
Figure 6.14: The top images represent patterns from a larg
a 64-2-3Dark sigmoidal
• network
regions = large weight values. for classifying three characters
Note thelearned input-to-hidden
weights weights that distinguish characters.
emphasize regions
the input-to-hidden weights
•
(represented as patterns) at
We can think of weight (= each pixel) as a feature.
•
training.
The top imagesNote
represent that
The features with
• large these
patterns
weights forlearned
from a large training weights
h1 distinguish {E,F} from L. indeed describe f
set used to train
moidal network for classifying three characters. The bottom figures show
the classification
-hidden The features
•
weights for h2 task.
(represented distinguish
as patterns) Inat large
{E,L} from networks,
F. hidden
the two units after such patterns o
ote that these learned weights indeed describe feature groupings useful for
difficult
ation task. Into interpret
large networks, such inpatterns
this ofway. learned weights may be
nterpret in this way.
Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 188 / 212
W IDTH VS D EPTH
(k)
A neural network represents a (typically) complicated function f by simple functions φi .
where ξi and τij are functions R → R. A similar result shows one can approximate f to
arbitrary precision using specifically sigmoids, as
M d
X
(2) (1)
X
f (x) ≈ wj σ wij xi + ci
j=1 i=1
Limiting width
• Limiting layer width means we limit the degrees of freedom of each function f (k) .
• That is a notion of parsimony.
• Again: There seem to be a lot of interesting questions to study here, but so far, we have no
answers.
f (1) f (1)
f (2) f (2)
f (3) f (3)
f (x) ≈ x f (x) ≈ x
Layers have same width: No effect Narrow middle layers: Compression effect
RBM Encoder
Pretraining Unrolling Fine tuning
Advanced Machine Learning Illustration: K. Murphy, Machine Learning: A Bayesian perspective, MIT Press 2012 193 / 212
T RAINING N EURAL N ETWORKS
Error measure
• We assume each training data point x comes with a label y.
• We specify an error measure D that compares y to a prediction.
Deriving backpropagation
• We have to evaluate the derivative ∇w J(w).
P
• Since J is additive over training points, J(w) = n Jn (w), it suffices to derive ∇w Jn (w).
d(f ◦ g) df dg
=
dx dg dx
d(f ◦g)
If the derivatives of f and g are f 0 and g0 , that means: dx
(x) = f 0 (g(x))g0 (x)
Neural network
Let w(k) denote the weights in layer k. The function represented by the network is
(K) (1) (K) (1)
fw (x) = fw ◦ · · · ◦ fw (x) = f ◦ ··· ◦ f (x)
w(K) w(1)
To solve the optimization problem, we have to compute derivatives of the form
d dD( • , yn ) dfw
D(fw (xn ), yn ) =
dw dfw dw
Layer-wise derivative
Since f (k) and f (k−1) are vector-valued, we get a Jacobian matrix
(k+1) (k+1)
∂f1 ∂f1
∂f (k) ... (k)
1 ∂fd
k
df (k+1) ..
..
= . . =: ∆(k) (z, w(k+1) )
df (k)
∂f (k+1) (k+1)
dk+1 ∂fd
k+1
(k) ... (k)
∂f1 ∂fd
k
regarded as a function of w(K) (so z(K) and y are fixed). Denote the updated weights w̃(K) .
• Move backwards one layer at a time. At layer k, we have already computed updates
w̃(K) , . . . , w̃(k+1) . Update w(k) by a gradient step, where the derivative is computed as
df (k)
∆(K−1) (z(K−1) , w̃(K) ) · . . . · ∆(k) (z(k) , w̃(k+1) ) (z, w(k) )
dw(k)
On reaching level 1, go back to step 1 and recompute the z(k) using the updated weights.
Binary numbers
Our usual (decimal) representation of integers represents a number x ∈ N0 by digits
d1 , d2 , . . . ∈ {0, . . . , 9} as
[x]10 = di · 10i + di−1 · 10i−1 + . . . + d1 · 100
where i is the largest integer with 10i ≤ x.
The binary representation of x is similarly
[x]2 = 2j · 10j + bj−1 · 2j−1 + . . . + b1 · 20
where b1 , b2 , . . . ∈ {0, 1} and j is the largest integer with 2j ≤ x.
Non-integer numbers
Binary numbers can have fractional digits just as decimal numbers. The post-radix digits
correspond to inverse powers of two:
1 1 1
[10.125]10 = [101.001]2 = 1 · 23 + 0 · 22 + 1 · 2 + 0 ·
+0· 2 +1· 3
2 2 2
Numbers that look “simple” may become more complicated in binary representation:
[0.1]10 = [0.00011]2 = [0.000110011001100110011 . . .]2
Basic FP representation
x = (−1)s · 2e · m
where:
• s, the sign, is a single bit, s ∈ {0, 1}.
Rounding
• Choose a set of FP numbers M, (say double precision numbers).
• Start with a number x ∈ R. Unless x happens to be in the finite set M, we cannot represent
it. It gets rounded to the closest FP number,
x̂ := arg min |x − ŷ|
ŷ∈M
Rounding errors
We distinguish two types of errors:
• The absolute rounding error ẑ − z.
ẑ−z
• The relative rounding error z
.
FP arithmetic is based on the premise that the relative rounding error is more important.
Distribution of FP numbers
The distance between two neighboring FP numbers depends on the value of the exponent. That
means:
An interval of length 1 (or any fixed length) contains more FP numbers if it is close to 0
than if it is far away from 0.
In other words, the relative rounding error does not depend on the size of a number, but the
absolute rounding error is larger for large numbers.
Normalized FP number
FP numbers are not unique: Note that
2 = (−1)0 · 21 · 1 = (−1)0 · 20 · 2 = . . .
A machine number is normalized if the mantissa represents a number with exactly one digit
before the radix, and this digit is 1:
(−1)s · 2e · [1.m1 . . . mk ]2
IEEE 754 only permits normalized numbers. That means every number is unique.
Error guarantees
IEEE 754 guarantees that
ˆ = (x ⊕ y)(1 + r)
x⊕y satisfies |r| < eps
where
• r is the relative rounding error.
Machine precision
The machine precision, denoted eps, is the smallest machine number for which 1̂+eps
b is
distinguishable from 1:
eps = min {e ∈ M}
1̂+e>
b 1̂
This is not the same as the smallest representable number. In double precision:
• Machine precision is eps = 2−52
• The smallest (normalized) FP number is 2−1022 .
That is so because the intervals between FP numbers around 1 are much larger than around
2−1022 : If we evaluate 1 + 2−1022 , the result is 1.
Reproducibility
• Scientific experiments are required to be reproducible.
• Using actual random numbers in a computer experiment or simulation would make it
non-reproducible.
• In this sense, pseudorandomness is an advantage rather than disadvantage.
If you restart the generation process of a pseudorandom sequence with the same initial
conditions, you reproduce exactly the same sequence.
Advanced Machine Learning Not examinable. 209 / 212
R ANDOM N UMBER G ENERATION
Recursive generation
Most PRN generators use a form of recursion: A sequence x1 , x2 , . . . of PRNs is generated as
xn+1 = f (xn )
Modulo operation
The mod operation outputs the “remainder after division”:
m mod n = m − kn for largest k ∈ N0 such that kn < m
For example: 13 mod 5 = 3
Period length
• For a recursive generator xn+1 = f (xn ), each xn+1 depends only on xn .
• Once xn takes the same value as x1 , the sequence (x1 , . . . , xn−1 ) repeats:
(x1 , . . . , xn−1 ) = (xn , . . . , x2n−1 ) = . . .
• If so, n − 1 is called the period length.
• Note a period need not start at 1: Once a value reoccurs for the first time, the generator has
become periodic.
• Since there are finitely many possible numbers, that must happen sooner or later.
Mersenne Twisters
• Almost all software tools we use (Python, Matlab etc) use a generator called a Mersenne
twister.
• This algorithm also uses a recursion, but the recursion is matrix-valued. That makes the
algorithm a little more complicated.
• Mersenne twisters have very long period lengths (e.g. 219937 − 1 for the standard
implementation for 32-bit machines).
In practice
• Every time a generator is started or reset (e.g. when you start Matlab), it starts with a
default seed.
• For example, Matlab’s default generator is a Mersenne twister with seed 0.
• The user can usually pass a different seed to the generator.
• When you run experiments or simulations, always re-run them several times with different
seeds.
• To ensure reproducibility, record the seeds you used along with the code.