Sei sulla pagina 1di 6

Expected Values

 Problem: Calculate expected values of random quantities


in high dimensions

E[g (X )] = dx pdf (x ) g (x )

Markov Chain Monte Carlo

 Monte Carlo Integration:


1 M
E[g (X )]
g (X m ) ; X m ~ iirv' s with distributi on pdf (x)
M m =1

Alberto Surez

 Markov Chain Monte Carlo:


Construct Markov chain whose stationary distribution is pdf (x
x)
X 0 , X1 , X 2 , K , X t , K , XT 1 , XT ;
Calculate expected value
P (t ) ( X t | X 0 ) t
pdf ( X t )
as an ergodic average

Escuela Politcnica Superior


Universidad Autnoma de Madrid

E[g (X )]

Discrete Markov Chains

X (t ) = {X i (t )} ;

1
X i (t ) =
0

if system is in state si at time t

 Markov Chain can be


Homogeneous: The transition probability matrix is independent
of time: W(t) = W. W is a stochastic matrix: W j i 0; W j i = 1

P X (t ) = s j | X (t 1) = sit 1 , X (t 2) = sit 2 ,K, X (1) = si1 , X (0) = si0 =


| X (t 1) = sit 1

Transition matrix: W(t ) : W ji (t ) P (X (t ) = s j | X (t 1) = si )


Chapman-Kolmogorov Eqn.:

Irreducible: All states are connected in a finite number of steps

si , s j S

otherwise

Markov property

t =Ttransient +1

|S |
i =1

(
P (X (t ) = s

g (X ) ;

Types of Markov Chains

 Markov Chain in which the space of states is countable


(i.e., states can be labeled, not necesarily finite)
Space of states: S {si }|iS=|1
Markov Chain:

1
T Ttransient

P(t ) : Pi (t ) = P (X (t ) = si );

P (t ) = 1
i

i
|S |

P(t ) = W(t ) P(t 1) : Pj (t ) = W ji (t ) Pi (t 1);


i =1

n 1/

W n j i > 0

Aperiodic: The system is never forced into a cycle of fixed


length between two states.

si S : Di {n 1 / W n ii > 0} g.c.d .( Di ) = 1

 Lemma: An irreducible homogeneous MC is aperiodic if j : W j j > 0;


m
Dem.

m 1 W i j > 0
Irreducible
n
n 1 W ji > 0
W m +n ii = W m ik W n k i W m i j W n ji > 0
k

W m +n +1ii = W m ik Wk l W n l i W m i j W j j W n ji > 0
k ,l

Stationary distribution of a Markov Chain


 Stationary distribution for a homogeneous Markov Chain

W =
the stationary distribution is an eigenvector of W, with eigenvalue 1.
An arbitrary initial distribution converges to the stationary distribution
provided that all other eigenvalues of W are smaller than 1 in absolute value.

W = ; W v

(n)

= n v ; n = 2,3,K, | S |; 1 > 2 3 K |S | ;
( n)

P(t ) = W P (0);
t

Detailed balance
 Theorem (Feller,1950) Let W be the transition matrix of a finite Marko
Chain, which is irreducible and aperiodic.
The equation W = (Global detailed balance) has a unique
solution, which is the stationary distribution
 A sufficient condition for convergence to the stationary distribution
for a homogeneous Markov Chain is that its transition matrix W
satisfies local detailed balance (reversibility)

|S |

W j i i = Wi j j

P(0) = + n (0) v ( n ) ;
n=2
|S |

P(t ) = + n (0)tn v ( n )
n=2

limt P(t ) =

 Note that is the stationary distribution of the chain

Conditions for converge to the stationary distribution:

W j i i = Wi j j W j i i = j

The chain is irreducible: The eigenvalue =1 is not a multiple eigenvalue of W

The chain is aperiodic: The only eigenvalue with || = 1 is = 1.

Continuous Markov process

Metropolis--Hastings algorithm
Metropolis

 The sequence of random variables


X 0 , X1 , X 2 , K , X t , K
is a Markov process if the Markov property holds
P ( X t | X t 1 ,K, X1 , X 0 ; t ) = P ( X t | X t 1; t )
 Chapman-Kolmogorov eqn.
P ( X t +1 ) = dX t P ( X t +1 | X t ; t ) P ( X t )

 Problem: How to construct a Markov Chain whose


stationary distribution is (x)
 Metropolis et al. (1953) + Hastings (1970)

 A Markov process is homogeneous if the transition


kernel does not depend explicitly on time.
P ( X t +1 | X t ; t ) = P ( X t +1 | X t )

 Under certain regularity conditions a homogeneous


Markov process converges to a unique stationary
distribution P ( t ) ( X t | X 0 ) t
P ( X t )

Assume the system is in state Xt at time t


Choose a proposal distribution q(y|Xt). This distribution is
arbitrary (e.g. A multivariate normal distribution with mean Xt
and a fixed covariance matrix)
Sample Y from a proposal distribution q(y|Xt)
Accept Y with probability

(Y ) q ( X t | Y )

(X t , Y ) = min 1,
( X t ) q (Y | X t )

If Y is accepted, Xt+1 = Y, otherwise Xt+1 = Xt

Metropolis--Hastings algorithm (pseudocode)


Metropolis

Metropolis--Hastings algorithm
Metropolis

(convergence)

 Metropolis-Hastings transition kernel:

 Pseudocode:

P(X t +1 | X t ) = q (X t +1 | X t ) (X t , X t +1 ) + ( X t +1 X t ) 1 dy q(y | X t ) (X t , y )

Intialize X0. Set t = 0;


For t = 0:T-1

 Chapman-Kolmogorov eqn:

Sample Y from q(y|Xt)


Sample U from a uniform distribution U[0,1]

If (U < (y|Xt)) then


Set Xt+1 = Y
otherwise
Set Xt+1 = Xt

P(X t +1 ) = dX t P(X t +1 | X t ) P(X t )

 From the definition of :


( X t ) q ( X t +1 | X t )(X t , X t +1 ) = ( X t +1 ) q ( X t | X t +1 )(X t +1 , X t )
 Local detailed balance
( X t ) P ( X t +1 | X t ) = ( X t +1 ) P ( X t | X t +1 )
 Integrating the detailed balance equation
dXt (Xt ) P(Xt +1 | Xt ) = (Xt +1 ) (x) is invariant
 If Markov Chain converges (which it does, under certain
regularity conditions), (x) is the stationary distribution.

End

Metropolis--Hastings in practice (proposal dists.)


Metropolis
 Metropolis algorithm:

(x)
q(y | x ) = q(x | y ) (x, y ) = min 1,

(y )
 Random-walk Metropolis q(y | x ) = q(x | y ) = q ( x y )
If steps |Y-Xt| generated by q(|Y-Xt|) are either too small or too
large, the chain may have poor mixing properties.

 Independence Sampler:

w(x)
(x)
; w(x) =
q (y | x ) = q(y ) (x, y ) = min 1,
w
q ( x)
(
)
y

Works best if q(x) is a good approximation to (x)


q(x) should be heavier-tailed than (x). Otherwise sampler may
get stuck at the tails of the distribution
If large-sample theory is valid: q(x) a multivariate normal:
mean: mode of (x)
1
2
(x)
Covariance matrix: d log

T
dx dx

x = mode of [ ( x )]

Metropolis--Hastings algorithm
Metropolis

(variants)

 Multiple chains: One can use multiple shorter chains,


instead of a single long one.
Single long chain:
Asymptotic convergence can be proved

Multiple shorter chains:


Might reveal important differences if stationarity has not been reached.
Convergence?
Can be run in parallel.

 Single component Metropolis- Hastings:


Only one component components of X is updated at a given iteration.
The proposal distribution can be different for different components.

Variations:
Update blocks of components
Random updating order:
If one component is modified, then update with larger probability the
components that are highly correlated with it.

Metropolis--Hastings algorithm
Metropolis

(implementation)

 Choice of starting values:


If chain is irreducible, the choice of X0 does not affect the stationary
distribution.
In a rapidly mixing chain, burn-in times are short and the choice of X0
is not very important.
In a slowly mixing chain, it is important to choose a value of X0 that
avoids lengthy burn-in.

 Length of burn-in

Depends on rate on convergence of P(t)(Xt |X0 ) to (Xt) and of


the desired accuracy.
Usually established by measures of convergence that monitor
expected values of f(Xt).
Comparison between properties of multiple chains can be useful.

Gibbs Sampling (Geman + Geman 1984)


 Assume XT =(XIT XIIT).
qI (y I | x I , x II ) = (y I | x II )
qII (y II | x I , x II ) = (y II | x I )
 Algorithm
Set initial value XI(t=0). Set t = 0.
Generate XII(t=0) from (XII | XI(t=0))
For t=0:T-1
Generate XI(t+1) from (XI | XII(t))
Generate XII(t+1) from (XII | XI(t))

End

 Stopping time:
Estimate the variance of the expected value that is being calculated
Variance estimates are easiest if multiple chains are run.

Simulated annealing
 Global optimization in many dimensions.
 Physical annealing: Minimize free energy
Heat up a solid until it melts.
Cool down slowly until crystal is formed.

 Simulated annealing for combinatorial optimization:


Find s* such that cost function E(s*) is minimal
Construct an irreducible aperiodic Markov Chain whose stationary
solution is
1 Ei
1
( si ) =
e ; = ; E ( si ) Ei ; Z (T ) = e Ei (partition fn.)
Z (T )
T
i
Establish an annealing schedule for T

T(t); t = 0,1,2,K / limt T (t ) = 0; (sufficiently slowly)


limT 0 ( si ) =

1
( si , sl* );
*
|S | l

Simulated annealing (pseudocode)


 Procedure simulatedAnnealing
Initialize(i0, T0, L0);
k:= 0; i:= i0;
Repeat
% epoch at fixed temperature Tk with Lk steps.
For L:=1 To Lk Do {
Generate sj from si ;
Generate u from U[0,1];

Ei E j
Tk

If exp

> u

Then i:= j;

k:= k+1;
calculateLength(Lk); calculateTemperature(Tk)

Until stopCriterion;
End;

Simulated annealing (dynamics)


 Transition matrix W(T) = G A(T)
G: Generation probability
A(T): Acceptance probability
ji
G ji Aji (T );

W ji (T ) =
1 G ji Aji (T ) j = i

l i

1 if s j is in the neighorhood of si
1
ji ; i = ji ; ji =
i
j
0 otherwise.
(E j Ei )+
;
Aji (T ) = exp
T

 Local detailed balance: Assuming G ji = Gi j


G ji =

E
qi (T ) = exp i ;
T
A j i (T )qi (T ) = Ai j (T )q j (T ) W j i (T )qi (T ) = Wi j (T )q j (T );

Simulated annealing (convergence II)


 Annealing dynamics (inhomogeneous Markov process)
Assuming an annealing schedule in with during epochs of length L the
temperature is held constant at temperature Tk ,

Tk

( L + 1)
;
log(k + 2)

[logarithmic cooling schedule (slow!)]

L max( Minimum number of transitions to reach s* from any s j )

max (E j Ei | s j in the vicinity of si )

the piecewise homogeneous Markov process converges in distribution to

( si ) =

1
( si , sl* );
| S* | l

Simulated annealing (convergence I)


 Fixed temperature dynamics (homogeneous Markov process)
Theorem: The Markov process with transition matrix W(T) converges to a
stationary distribution of the Boltzmann type if

si , s j S ; p 1 /

sl0 , sl1 , K , sl p S ; l0 = i; l p = j;

Glk +1 lk > 0, k = 0,1, K, ( p 1).


The process is irreducible:

Consider si , s j S and

[W

(T )

j i

p 1

j k p1
k1 ,k 2 ,K,k p 1

/ Glk +1 lk > 0, k = 0,1,K, ( p 1); l0 = i; l p = j

(T ) Wk p 1 k p 2 (T ) K Wk2 k1 (T ) Wk1 i (T )

W j l p1 (T ) Wl p 1 l p 2 (T ) K Wl2 l1 (T ) Wl1 i (T ) > 0


The process is aperiodic

* si , s j S with Gi j > 0 and Ei < E j


Wi i (T ) = 1 Wk i (T ) = 1 W j i (T )
k i

1 G j i

k i , j

k i

Aji (T ) < 1;

k i , j

= 1 Gk i = Gi i > 0
k i

k i

Al i (T ) 1, l j

(T ) = 1 G j i Aj i (T )

k i , j

k i

Ak i (T ) >

Wi i (T ) > 0

Annealing schedules
 Kirkpatrick, Gelatt & Vecchi (1982,1983)
Choose T0 large enough so that most transitions are accepted.
Start with a small value of T0
k:=0; Choose >1
Repeat
T0 = T0;
Until accepance ratio is sufficiently close to 1.

Tk+1 = Tk ; [0.8,0.99]
Lk is sufficiently large so that for each value of Tk quasi-equilibrium
obtains ( | Lk | < Lmax so that long chains are avoided at low T)
Stop criterion: Value of cost function does not change in a specified
number of epochs.

Genetic algorithms for optimization


 Evolve a population of individuals each of which represents a
candidate solution for the problem
Individuals have a DNA composed of bit-strings.
The fitness of the individual is a function of its genotype or of its
fenotype.
Evolution
Selection: Only the best individuals, according to their fitness, survive.
Crossover: Generate new individuals whose genetic material is some
combination of the genetic material of other individuals.
Mutation: Alter information at random.

Important
For sufficiently large numbers of individuals the algorithm improves the
average fitness of the population.
Convergence is not guaranteed (not even in principle).
The hardest part is the coding scheme.

Bibliography
 Markov Chain Monte Carlo in practice
W. R. Gilks, S. Richardson and D. J. Siegelhalter
Chapman & Hall, London 1996.
 Simulated Annealing and Boltzmann Machines
E. Aarts and J. Korst
Wiley-Intescience, New York 1990
 Genetic algorithms in search, optimization, and
machine learning
David E. Goldberg
Addison-Wesley, Reading,Mass 1989

Potrebbero piacerti anche