Sei sulla pagina 1di 147

Data Mining

Practical Machine Learning Tools and Techniques


Slides for Chapter 9, Probabilistic methods

of Data Mining by I. H. Witten, E. Frank,


M. A. Hall, and C. J. Pal
Random variables
In probabilistic approaches to machine learning it is
common to think of data as observations arising from
an underlying probability model for random variables
Given a discrete random variable A, P(A) is a function
that encodes the probabilities for each of the
categories, classes or states that A may be in
For a continuous random variable x, p(x) is a function
that assigns a probability density to all possible values
of x
In contrast, P(A=a) is the single probability of observing
the specific event A=a
Notation
The P(A=a) notation is often simplified to
simply P(a), but one must remember if a was
defined as a random variable or as an
observation
Similarly for the observation that continuous
random variable x has the value x1 it is
common to write this as p(x1)=p(x=x1), a
simplification of the longer but clearer
notation
The product rule
The product rule, sometimes referred to as the
fundamental rule of probability, states that
the joint probability of random variables A
and B can be written

P(A,B) = P(A | B)P(B)


The product rule also applies when A and B
are groups or subsets of events or random
variables.
The sum rule
The sum rule states that given the joint
probability of variables X1, X2, , XN, the
marginal probability for a given variable can
be obtained by summing (or integrating) over
all the other variables.
For example, to obtain the marginal
probability of X1, sum over all the states of all
the other variables:

P(X1) = P(X , X
1 2 = x 2 , , X N = x N )
x2 xN
Marginalization
The previous notation can be simplified to
p(x1 ) = P(x , x ,
1 2 , xN )
x2 xN

The sum rule generalizes to continuous random


variables, ex. for x1,x2,,xN we have
p(x1 ) = p(x , x ,
1 2 , x N )dx1 dx N
x2 xN

These procedures are known as marginalization


They give us marginal distributions of the
variables not included in the sums or integrals
Bayes Rule
Can be obtained by swapping A and B in the product rule
and observing P(B|A)P(A)=P(A|B)P(B) and therefore
P(A | B)P(B)
P(B | A) =
P(A)
Suppose we have models for P(A|B) and P(B)
We observe that A=a, and
we want to compute P(B|A=a)
P(A=a|B) is referred to as the likelihood
P(B) is the prior distribution of B
P(B|A=a) is posterior distribution, obtained from:

P(A = a) = P(A = a,B = b) = P(A = a | B = b)P(B = b)


b b
Maximum Likelihood
Our goal is to estimate a set of parameters of a
probabilistic model, given a set of observations
x1,x2,,xn.
Maximum likelihood techniques assume that:
1) the examples have no dependence on one
another, the occurrence of one has no effect on
the others, and
2) each can be modeled in exactly the same way.
These assumptions are often summarized by
saying that events are independent and
identically distributed (i.i.d.).
Maximum Likelihood
The i.i.d. assumption corresponds to the use
of a joint probability density function for all
observations consisting of the product of the
same probability model p(xi; ) applied to each
observation independently.
For n observations, this could be written as
p(x1, x2, , xn ;q ) = p(x1;q )p(x2 ;q ) p(xn ;q )
where each function p(xi; ) has the same
Maximum Likelihood
The likelihood of our data can be written
n
L(q ; x1, x2 , , xn ) = p(xi ;q )
i=1

The data is fixed, but we can adjust so as to


maximize the likelihood or log-likelihood
n
q ML = argmax log p(xi ;q )
q i=1

We use the the log-likelihood as it is more


numerically stable
Maximum a posteriori (MAP)
parameter estimation
If we treat our parameters as random variables
we can compute the posterior
p(x1, x2 , , xn | q )p(q )
p(q | x1, x2 , , xn ) =
p(x1, x2 , , xn )
We have used | or the given notation in place
of ; to emphasize that is random, but
Conditioned on a point estimate for the
posterior we have a conditionally i.i.d. model
MAP parameter estimation seeks
n
q MAP = arg max log p(xi ;q ) + p(q )
q i=1
The chain rule of probability
Results from applying the product rule
recursively between a single variable and the
rest of the variables
The chain rule states that the joint probability
of n attributes Ai=1m can be decomposed into
the following product:
n-1
P(A1, A2 ,..., An ) = P(A1 ) P(Ai+1 Ai , Ai-1,.., A1 )
i=1
Bayesian networks
The chain rule holds for any order for the Ais
A Bayesian network is an acyclic graph,
Therefore its nodes can be given an ordering
where ancestors of node Ai have indices < i
Thus a Bayesian network can be written
n
P(A1, A2 ,..., An ) = P(Ai Parents(Ai ))
i=1

When a variable has no parents, we use the


unconditional probability of that variable
Bayesian network #1 for the weather data
Random Variables
Y: Play
O: Outlook
Y
T: Temperature
H: Humidity
W: Windy

W O T H

The graphs express the factorization below:

P(Y,O,.T, H,W) = P(W |Y )P(O |Y )P(T |Y )P(H |Y )P(Y )


Bayesian network #2 for the weather data
Random Variables
Y: Play
O: Outlook
Y
T: Temperature
H: Humidity
W: Windy

W O T H

The graphs express the factorization below:

P(Y,O,.T, H,W) = P(W | O,Y )P(O |Y )P(T | O,Y )P(H | T,Y )P(Y )
Estimating Bayesian network
parameters
The log-likelihood of a Bayesian network with
V variables and N examples of complete
variable assignments to the network is

where the parameters of each conditional or


unconditional distribution are given by v
We use the notation to indicate the ith
observation of variable v
Estimating probabilities
in Bayesian networks
The estimation problem decouples into
separate estimation problems for each
conditional or unconditional probability
Unconditional probabilities can be written as

where is an indicator function


returning 1 when the ith observed value for
Ai=a and 0 otherwise
Estimating conditional distributions
Estimating conditional distributions in Bayesian
networks is equally easy and amounts to simply
counting configurations and dividing, ex.

Zero counts cause problems and this motivates


the use of Bayesian priors
Estimating network structure
One possibility is to use cross-validation to estimate the
goodness of fit on held out data (so as to avoid over fitting)
another is to penalize model complexity
Let K be the number of parameters, LL the log-likelihood, and
N the number of instances in the data.
Two popular measures for evaluating the quality of a network
are the Akaike Information Criterion (AIC):
AIC score= - LL + K
and the following MDL metric based on the MDL principle:
K
MDL score= - LL + log N
2
In both cases the log-likelihood is negated, so the aim is to
minimize these scores.
More Bayesian approach: use prior over model structures
Network structure learning algorithm
K2: a simple and very fast learning algorithm,
Starts with a given ordering of the nodes
Processes each node in turn and greedily
considers adding edges from previously
processed nodes to the current one
In each step it adds the edge that maximizes the
networks score
When there is no further improvement, attention
turns to the next node
The number of parents for each node can be
restricted to a predefined maximum to mitigate
overfitting
Tree augmented nave bayes (TAN)
Another good learning algorithm for Bayesian network classifiers
Takes the Nave Bayes (NB) classifier and adds edges to it
The class attribute is the sole parent of each node in a NB model:
TAN considers adding a second parent to each node
If the class node and all corresponding edges are excluded from
consideration, and assuming that there is exactly one node to
which a second parent is not added, the resulting classifier has a
tree structure rooted at the parentless nodehence the name
For this restricted network type there is an efficient algorithm
based on computing a maximum weighted spanning tree
Algorithm is linear in the number of instances and quadratic in
the number of attributes
Data structures for fast Learning
Learning Bayesian networks involves a lot of counting
For each network structure considered in the search,
the data must be scanned afresh to obtain the counts
needed to fill out the conditional probability tables
Counts can be stored effectively in a structure called an
all-dimensions (AD) tree, which is analogous to the kD-
trees used for nearest neighbor search
Each node of the tree represents the occurrence of a
particular combination of attribute values
Straightforward to retrieve the count for a combination
that occurs in the tree
However, the tree does not explicitly represent many
nonzero counts because the most populous expansion
for each attribute is omitted
AD Tree example
Bibliographic Notes & Further Reading
Learning and Bayesian Networks

The K2 algorithm for learning Bayesian networks


was introduced by Cooper and Herskovits (1992).
Bayesian scoring metrics are covered by
Heckerman et al. (1995).
Friedman et al. (1997) introduced the tree
augmented Nave Bayes algorithm, and also
describe multinets.
AD trees were introduced and analyzed by Moore
and Lee (1998)
Komarek and Moore (2000) introduce AD trees for
incremental learning that are also more efficient
for datasets with many attributes.
Clustering and probability density
estimation
If we had data belonging to two known classes A and B, each
having a normal distribution with means and standard
deviations A and A for class A, and B and B for class B
We could define a model whereby samples are taken from
these distributions, using cluster A with probability pA and
cluster B with probability pB (where pA + pB = 1),
Sampling we might obtain the data in the next slide
Now, imagine given the dataset without the classesjust the
numbersand being asked to determine the five parameters
that characterize the model: A, A, B, B, and pA (the
parameter pB can be calculated directly from pA)
Clustering with a Gaussian Mixture
Given the data on the left without the labels A and B,
we wish to estimate a model for a two class Gaussian Mixture
Model (GMM) on the right
Estimating Gaussian parameters
If we knew which of the two distributions each
instance came from, finding the five parameters
would be easyjust estimate the mean and
standard deviation for n=nA or n=nB samples x1,
x2, , xn for each cluster, A and B
x1 + x2 +... + xn
m=
n
(x1 - m ) + (x2 - m ) +... + (xn - m )
2 2 2
s =
2

n -1
To estimate the fifth parameter pA, just take the
proportion of the instances that are in the A
cluster, then pB=1-pA.
Motivating the EM algorithm
If you knew the five parameters, finding the
(posterior) probabilities that a given instance
comes from each distribution would be easy
Given an instance xi, the probability that it
belongs to cluster A is
P(xi | A) P(A) N(xi ; mA , s A )pA
P(A|xi ) = =
P(xi ) N(xi ; mA , s A )pA + N(xi ; mB, s B )pB
where N() is the normal or Gaussian distribution
( x-m )2
1 -
N(x; m, s ) = e 2s 2

2ps
The EM algorithm for a GMM
Start with a random (but reasonable) assignment to
the parameters
Compute the posterior distribution for the cluster
assignments for each example
Update the parameters based on the expected class
assignments - where probabilities act like weights.
If wi is the probability that instance i belongs to
cluster A, the mean and std. dev. are
w1 x1 + w2 x2 +... + wn xn
mA =
w1 + w2 +... + wn
w (x - m ) 2
+ w (x - m ) 2
+... + w (x - m ) 2
s A2 = 1 1 2 2 n n
w1 + w2 +... + wn
The EM algorithm
Optimizes the marginal likelihood, obtained by
marginalizing over the two components of the
Gaussian mixture
The marginal likelihood is a measure of the
goodness of the clustering and it increases
at each iteration of the EM algorithm.
In practice we use the log marginal likelihood
n n
log P(xi ) = log P(xi | ci ) P(ci )
i=1 i=1 ci
n
= log [ N(xi ; m A , s A )pA + N(xi ; m B, s B )pB ]
i=1
Extending the mixture Model
The Gaussian distribution generalizes to n-dimensions
Consider a two-dimensional model consisting of
independent Gaussian distributions for each dimension
We can transform from scalar to matrix notation for a two
dimensional Gaussian distribution as follows:

is the covariance matrix, || is its determinant, the


vector x = [x1 x2]T, and the mean vector = [1 2]T
The multivariate Gaussian distribution
Can be written in the following general form

The equation for estimating the covariance


matrix is
1 N
S = ( x i - m )( x i - m ) .
T

N i=1
The mean is simply
N

xi.
1
m=
N i=1
Clustering with correlated attributes
If all attributes are continuous one can simply use a full
covariance Gaussian mixture model
But one needs to estimate n(n + 1)/2 parameters per
mixture component for a full covariance matrix model
As we will see later principal component analysis (PCA)
can be formulated as a probabilistic model, yielding
probabilistic principal component analysis (PPCA),
Approaches known as mixtures of principal component
analyzers or mixtures of factor analyzers provide ways
of using a much smaller number of parameters to
represent large covariance matrices
Mixtures of Factor Analyzers
The idea is to decompose the covariance matrix M
of each cluster into the form M = WWT + D
W is typically a long and skinny matrix of size n x d,
with as many rows as there are dimensions n of
the input, and as many columns d as there are
dimensions in the reduced dimensional space
Standard PCA corresponds to setting D=0
PPCA corresponds to using the form D=2I, where
is a scalar parameter and I is the identity matrix
Factor analysis corresponds to using a diagonal
matrix for D
Clustering using prior distributions
It would be nice if somehow you could penalize the
model for introducing new parameters
One principled way of doing this is to adopt a fully
Bayesian approach in which every parameter has a
prior probability distribution
Then, whenever a new parameter is introduced, its
prior probability must be incorporated into the overall
likelihood figure
Because this will involve multiplying the overall
likelihood by a number less than onethe prior
probabilityit will automatically penalize the addition
of new parameters
To improve the overall likelihood, the new parameters
will have to yield a benefit that outweighs the penalty
Autoclass
A comprehensive clustering scheme based on Bayesian
networks that uses the finite mixture model with prior
distributions on all the parameters
It allows both numeric and nominal attributes, and
uses the EM algorithm to estimate the parameters of
the probability distributions to best fit the data
Because there is no guarantee that the EM algorithm
converges to the global optimum, the procedure is
repeated for several different sets of initial values
Can perform clustering with correlated nominal and
correlated numeric attributes
DensiTrees
Rather than showing just the most likely clustering
to the user, it may be best to present all of them,
weighted by probability.
More fully Bayesian techniques for hierarchical
clustering have been developed that produce as
output a probability distribution over possible
hierarchical structures representing a dataset.
A DensiTree, shows the set of all hierarchical trees
for a particular dataset
Such visualizations make it easy for people to grasp
the possible hierarchical clusterings
A DensiTree

Shows possible hierarchical clusterings of a given data set


Bibliographic Notes & Further Reading
AutoClass and DensiTrees

The AutoClass program is described by


Cheeseman and Stutz (1995).
Two implementations have been produced: the
original research implementation, in LISP, and
a follow-up public implementation in C that is 10
or 20 times faster but somewhat more restricted
For example, only the normal-distribution model is
implemented for numeric attributes.
DensiTrees were developed by Bouckaert
(2010).
Kernel Density Estimation
The underlying true probability distribution p(x) of data
x1, x2,, xn can be approximated using a kernel density
estimator, which can be written in the following
general form
n n
x - xi
p(x) = Ks (x, xi ) =
1 1
K ,
n i=1 ns i=1 s
where K() is a non-negative kernel function that
integrates to one.
The parameter > 0 is the bandwidth of the kernel,
and serves as a form of smoothing parameter for the
approximation
Estimating densities using kernels is also known as
Parzen window density estimation.
Comparing parametric, semi-parametric
and nonparametric density models

(a) (b) (c)

We can create a classifiers by modeling the density of each


class above & applying Bayes rule, ex. using:
a) A one Gaussian per class parametric model
b) A Gaussian mixture with 2 components per class
GMMs are sometimes called semi-parametric methods
c) A kernel density estimate (KDE) for each class
KDE is a classic form of nonparametric estimation
Nonparametric methods still have parameters,
they just grow in proportion to the volume of the data
Bibliographic Notes & Further Reading
Kernel Density Estimation (KDE)

Epanechnikov (1969) showed the optimality of


the Epanechnikov kernel for KDE under the
mean-squared error metric.
Jones et al. (1996) recommends using a so-called
plug-in estimate to select the kernel
bandwidth.
Duda and Hart (1973) and Bishop (2006) show
theoretically that kernel density estimation
converges to the true distribution as the amount
of data grows.
Hidden variable models
Given a model with observations given by , a per
observation hidden discrete random variable hi and
hidden continuous random variable zi the marginal
likelihood of such a model is given by

where the sum is taken over all possible discrete values


of hi & the integral is taken over the entire domain of zi
To be extra clear about different quantities here we use
to denote the probability of the
random variable xi associated with instance i taking the
value represented by the observation
Learning hidden variable models
If a model has hidden variables we perform learning by
integrating out the uncertainty of the hidden variables,
i.e. we work with the marginal likelihood
Taking the derivative of the marginal likelihood we see
Learning with
expected gradient descent (EGD)
The marginal likelihood can be optimized using
gradient ascent by computing and following the
expected gradient of the log joint likelihood
This can be decomposed into three steps, a
P-step where we compute the posterior over
hidden variables; then an
E-step where we compute the expectation of the
gradient given the posterior; then a
G-step, where we use gradient-based
optimization to maximize the (marginal
likelihood) objective function with respect to the
model parameters.
The Expectation Maximization
(EM) Algorithm
A more well known method, similar to EGD
The EM algorithm consists of two steps, an
E-step that computes the expectations used in
the expected log-likelihood; and an
M-step in which the objective is maximized
typically using a closed-form parameter update.
The expected log joint likelihood is related to the
marginal likelihood in the following way
The EM algorithm in more detail
In a joint probability model with discrete hidden
variables H the probability of the observed data can be
maximized by initializing q old and repeating the steps
1. E-step: Compute expectations using P(H | X;q old )

2. M-step: Find q =arg max P(H | X;q )log P(X, H;q )
new old

q H
3. If the algorithm has not converged, set q old = q new
and return to step 1
The M step corresponds to maximizing the expected
log-likelihood, the overall procedure maximizes the
marginal likelihood of the data
Although discrete hidden variables are used above, the
approach generalizes to continuous ones
EM for Bayesian networks
Unconditional probability distributions are estimated in
the same way in which it would be computed if the
variables Ai had been observed, but with each
observation replaced by its posterior marginal
probability (i.e. marginalizing over the other variables)

Conditional distributions are updated using ratios of


posterior marginals, ex.
EM in practice
For many latent variable models Gaussian mixture
models, probabilistic principal component analysis, and
hidden Markov models the required posterior
distributions can be computed exactly, which accounts for
their popularity.
However, for many probabilistic models it is simply not
possible to compute an exact posterior distribution.
This can easily happen with multiple hidden random
variables, because the posterior needed in the E-step is the
joint posterior of the hidden variables.
There is a vast literature on the subject of how to compute
approximations to the true posterior distribution over
hidden variables in more complex models.
Sampling methods and variational methods are popular in
machine learning
Bayesian estimation and prediction
In a more Bayesian modelling setting the joint
distribution of data and parameters under a
model can be written as
n
p(x1, x2 , , xn , q ;a ) = p(xi | q )p(q ;a ),
i=1
Where is a hyperparameter
Bayesian predictions use a quantity known as the
posterior predictive distribution - the probability
model for a new observation marginalized over
the posterior probability inferred for the
parameters given the observations so far.
Empirical Bayesian methods
One approach is obtained by maximizing the
marginal likelihood with respect to the
models hyperparameters, ex.
n
a MML = argmax log p(xi | q )p(q ;a ) dq
a i=1
Probabilistic inference
With complex probability models and even
with some seemingly simple ones
computing quantities such as posterior
distributions, marginal distributions, and the
maximum probability configuration, often
require specialized methods to achieve results
efficiently even approximate ones.
This is the field of probabilistic inference
Sampling
Sampling methods are popular in both statistics
and machine learning
Useful for implementing fully Bayesian methods
that use distributions on parameters, or for
inference in graphical models with cyclic
structures
Gibbs sampling is a popular special case of the
more general Metropolis-Hastings algorithm
Allows one to generate samples from a joint
distribution even when the true distribution is a
complex continuous function
Gibbs sampling
Gibbs sampling is conceptually very simple.
Assign an initial set of states to the random
variables of interest.
With n random variables, the initial assignments
or samples can be written as x = x (0), , x = x (0)
1 1 n n
Then iteratively cycling through updates to each
variable by sampling from its conditional
distribution given the others:
Simulated annealing
A procedure that seeks an approximate most
probable explanation (MPE) or configuration
We can obtain it by adapting Gibbs sampling to
include an iteration-dependent temperature
term ti, that decreases (usually slowly) over time
Starting with an initial assignment, x1 = x1(0), , xn = xn(0)
subsequent samples take the form
Iterated conditional modes
Consists of iterations of the form
x1(i+1) ~ argmax p(x1 | x2 = x2(i), , xn = xn(i) ),
x1

...
xn(i+1) ~ argmax p(xn | x1 = x1i , , xn-1 = xn-1
i
).
xn
Often converges quickly to an approximate MPE,
but is prone to local minima
Zero temperature limit of a simulated annealer
Gibbs sampling is a popular special case of the
more general Metropolis Hastings algorithm
Gibbs sampling and simulated annealing are
special cases of the broader class of Markov
Chain Monte Carlo (MCMC) methods
Variational methods
Rather than sampling from a complex distribution
the distribution can be approximated by a simpler,
more tractable, function
Suppose we have a probability model with a set of
hidden variables H and observed variables X
Let be the exact posterior
Define as the variational approx.
having its own variational parameters, F
Variational methods make q close to p, yielding
principled EM algorithms with approx. posteriors
Bibliographic Notes & Further Reading
Sampling and MCMC

Markov chain Monte Carlo (MCMC) methods are popular in


Bayesian statistical modeling; see for example Gilks (2005)
Geman and Geman (1984) first described the Gibbs
sampling procedure, naming it after the physicist Josiah
Gibbs because of the analogy between sampling, the
underlying functional forms of random fields and statistical
physics
Hastings (1970) generalization of the Metropolis et al.
(1953) simulated annealing algorithm has been influential
in laying the foundation for present-day methods
The iterated conditional modes approach for finding an
approximate most probable explanation was proposed by
Besag (1986)
Plate notation
A plate is simply a box around a Bayesian
network that denotes a certain number of
replications of it, one for each data instance
The plate below indicates i=1N networks, each
with an observed value for xi and hidden variable Ci
Plate notation captures a model for the joint
probability of the entire data with a simple picture.
C C1 C2 CN Ci

x x1 x2 xN xi
i=1:N

(a) (b) (c)


Probabilistic PCA
Probabilistic PCA, (PPCA) can be written as a Bayes net
Let x be a d-dimensional random vector of observed data, and
h be a k-dimensional vector of hidden, k < d typically
The joint probability model has this linear Gaussian form
h1 h2 hK

x1 x2 xP

where p(h) is a Gaussian distributed random variable having a


zero mean and identity covariance, while p(x|h) is Gaussian
with mean Wh+, and a diagonal covariance matrix D=2I
The mean is included as a parameter, but it would be zero if
we first mean-centered the data, mixtures of PPCA use these
means to model more complex distributions
h1 h2 hK

PPCA and Factor Analysis x2


x1 xP

Restricting the covariance matrix D to be diagonal


produces a model known as factor analysis
Because of the nice properties of Gaussian
distributions, the marginal distribution of x in
these models are also Gaussian

The posterior distribution for h can be obtained


from Bayes rule, and some Gaussian identities
PPCA, learning and h1 h2

hK

the marginal likelihood x1 x2 xP

The log marginal likelihood of the parameters


given all the observed data can be maximized
using this objective function

where the model parameters are


Compare this to the joint probability
h1 h2 hK

Gradient ascent for PPCA x2


x1 xP

The expected log-likelihood (ELL) of the data is

The gradient of the ELL under is

This has a natural interpretation as a difference between


two expectations, the first is akin to the models
prediction or reconstruction of the input forming a matrix
with the models explanation of the input
h1 h2 hK

EM for PPCA x2
x1 xP

The E and M steps of the PPCA EM algorithm


can be written as

where all expectations are taken with respect


to each examples posterior distribution
In the zero input noise limit these equations
can be written even more compactly
Bibliographic Notes & Further Reading
Expectation Maximization (EM) and Variational Methods

The expectation maximization algorithm originates in the


influential work of Dempster, Laird, and Rubin (1977)
The modern variational view provides a solid theoretical
justification for the use of approximate posterior
distributions, see Bishop (2006) for more details
The variational perspective on EM originated in the 1990s
with work by Neal and Hinton (1998), Jordan et al. (1998),
and others
Salakhutdinov, Roweis and Ghahramani (2003) explore the
EM approach and compare it with the expected gradient,
including the more sophisticated expected conjugate
gradient based optimization
Bibliographic Notes & Further Reading
Principal Component Analysis and Related Probabilistic Models

Roweis (1998) gives an early EM formulation for probabilistic principal


component analysis: he examines the zero input noise case and
provides the elegant mathematics for the simplified EM algorithm
Tipping and Bishop (1999) give further analysis, and show that after
optimizing the model and learning the variance of the observation
noise the columns of the matrix W are scaled and rotated principal
eigenvectors of the covariance matrix of the data.
Further generalizations, such as mixtures of principal component
analyzers are in (Dony and Haykin, 1995; Tipping and Bishop, 1999)
See Ghahramani, and Hinton, (1996) for mixtures of factor analyzers
Ilin and Raiko (2010) present PPCA with data that is missing at random
Edwards (2012) provides a nice introduction to graphical modeling,
including mixed models with discrete and continuous components and
graphical Gaussian or inverse covariance models
Latent semantic analysis (LSA) and the
singular value decomposition (SVD)
LSA factorizes a data matrix (ex. of word counts) using SVD

s1
~
X Uk s2 VkT
sk

td tk kk kd
where U and V have orthogonal columns and S is a
diagonal matrix containing the singular values, usually
sorted in decreasing order (t=#terms, d=#docs, k=#topics)
For every k < d, and if all but the k largest
singular values are discarded the data matrix can be
reconstructed in an optimal in a least squares sense
Probabilistic di

Latent Semantic Analysis (pLSA) zij

In the pLSA framework one considers the index of


w
each document as encoded using observations of jj

j=1:mn

discrete random variables di for i=1,,n documents


Each variable di has n states, and over the document
corpus there is one observation of the variable for each
state.
Topics represented with discrete variables zij, while
words are represented with random variables wij,
where mi words are associated with each document
and each word is associated with a topic.
n mn
P(W, D) = P(di ) P(zij | di )P(wij | zij ).
i=1 j=1 zij
pLSA, LDA and smoothed LDA

di i i

zij zij zij

wjj wij wij


j=1:mn j=1:mn j=1:mn
i=1:n i=1:n i=1:n
(a) (b) (c)

Latent Dirichlet Allocation (LDA)


i

Reformulates pLSA replacing document index


variables di with the random parameter zij

The distribution of is influenced by


a Dirichlet prior with hyperparameter wij
j=1:mn

The relationship between discrete topic i=1:n

variables zij and the words wij is also given an


explicit dependence on the hyperparameter, matrix B .
The probability model for all observed words W is
Inference for LDA
The marginal log-likelihood of the model can be
optimized using an empirical Bayesian method by
adjusting the hyperparameter and parameter B
To perform the E-step of EM, the posterior
distribution over the unobserved random
quantities is used

This posterior is intractable and is typically


computed using a variational approximation or by
sampling

Smoothed LDA i

Reduces the effects of overfitting zij

Adds another Dirichlet prior with


hyperparameters given by on the wij

topic parameters B j=1:mn


i=1:n

Collapsed Gibbs sampling can be performed by


integrating out the s and B analytically, which
deals with these distributions exactly
The Gibbs sampler proceeds by simply iteratively
updating each zij conditioned on and to
compute the required approximate posterior
Topics from PNAS
Example: Griffiths and Steyvers (2004) applied
smoothed LDA to 28,154 abstracts of papers
published in the Proceedings of the National
Academy of Science (PNAS) from 1991 to 2001

User tags shown at the bottom were not used to


create the topics, but correlate very well with the
inferred topics
Bibliographic Notes & Further Reading
Latent Semantic Analysis (LSA), pLSA and Latent
Dirichlet Allocation (LDAb)

Latent semantic analysis (LSA) was introduced by


Deerwester et al. (1990)
Probabilistic LSA (pLSA) was proposed by
Hofmann (1999)
Latent Dirichlet allocation (LDAb) was proposed in
Blei, Ng and Jordan (2003).
Collapsed Gibbs sampling for LDAb was
proposed by Teh et al. (2006), who also extended
the concept to variational methods.
Bibliographic Notes & Further Reading
Latent Semantic Analysis (LSA), pLSA and Latent Dirichlet Allocation
(LDAb)

Rather than applying LDAb naively when looking for trends over
time, Blei and Lafferty (2006)s dynamic topic models treat the
temporal evolution of topics explicitly; they examined topical trends
in the journal Science.
Griffiths and Steyvers (2004) used Bayesian model selection to
determine the number of topics in their LDAb analysis of
Proceedings of the National Academy of Science abstracts.
Griffiths and Steyvers (2004) and Teh et al. (2006) give more details
on the collapsed Gibbs sampling and variational approaches to
LDAb.
Hierarchical Dirichlet processes (Teh et al., 2006) and related
techniques offer alternatives to the problem of determining the
number of topics or clusters in hierarchical Bayesian models.
Factor graphs
Represent functions by factoring them into the
product of local functions, each of which acts on
a subset of the full argument set
S
F(x1 , , x n ) = f j (X j )
j=1

where Xj is a subset of the original set of


arguments {x1,,xn}, fj(Xj) is a function of Xj, and
j=1S enumerates the argument subsets.
A factor graph consists of variable nodes circles
for each variable xk and factor nodes
rectangles for each function, with edges that
connect each factor node to its variables.
Factor graph example
A Bayesian network and its factor graph for
F(x1, , x5 ) = P(x1 )P(x2 )P(x3 | x1 )P(x4 | x1, x2 )P(x5 | x2 )
= fA (x1 ) fB (x2 ) fC (x3, x1 ) fD (x4, x1, x2 ) fE (x5, x2 )
fA fB

x1 x2 x1 x2
fD
fC fE

x3 x4 x5 x3 x4 x5
Factor graphs for nave Bayes models
Nave Bayes models have simple factor graphs
n
P(y, x1, , xn ) = P(y) P(xi | y).
i-1

y y

x1 x2 xn x1 x2 xn

Consider the number of parameters needed for


each factor
An important observation
Reversing the arrows in a nave Bayes model
leads to a variable that has many parents
n
P(y, x1, , xn ) = P(y | x1, , xn ) P(xi )
i-1

x1 x2 xn x1 x2 xn

y y

Consider the # of parameters for p(y|x1,,xn)


Logistic regression and factor graphs
Rather than using a large table for the
conditional distribution of a child y given
many xis, a logistic regression model can be
used to reduce the number of parameters
from exponential to linear
Lets assume that all variables are binary
Given a separate function fi(y,xi) for each
binary variable xi, the conditional distribution
defined by a logistic regression model can be
expressed as shown in the next slide
Logistic regression and factor graphs
Logistic regression can be written as
P(y x1, , xn ) x1 x2 xn
n
exp wi fi (y, xi )
1
=
Z(x1, , xn ) i=1
n y

1
= fi (xi , y).
Z(x1, , xn ) i=1

where we note that the denominator Z is a data


dependent a normalization term
This factor graph resembles the Nave Bayes model,
but it is for a factorized conditional distribution
Note: we have used rectangles for the xis because they
are not being explicitly modeled as random variables
Markov random fields (MRFs)
A clique is a group of nodes in an undirected graph
where all nodes are connected to one another
MRFs define another factorized model for a set of
random variables X, where clique sets are given by
Xc and a factor c(Xc) is defined for each clique
such that 1 C
P(X) =
Z
Y (X ),
c c
c=1

The partition function Z normalizes the result to


form a probability distribution
C
Z = Yc (X c ).
xX c=1
Markov random field example
1 U V
P(x1, x2 , x3, x4 ) = fu (Xu ) Yv (Xv )
Z u=1 v=1

1
= f A (x1 ) fB (x2 ) fC (x1 ) fD (x2 ) fE (x1, x2 ) fF (x2 , x3 ) fG (x3, x4 ) fH (x4 , x1 )
Z
Note we used unary and pairwise potentials

An MRF x1 x2 x1 x2
An MRF
using a as a factor
traditional graph
y1 y2 y1 y2
MRF style
undirected
x3 x4 x3 x4
graph

y3 y4 y3 y4

MRFs and energy functions


An MRF lattice is often repeated over an image
MRFs can be expressed in terms of
an energy function F(X), where
U V
F(X) = U(Xu ) + V (Xv ), and
u=1 v=1

Since Z is constant for any assignment of the


variables X we have
U V
-log P(x1, x2 , x3, x4 ) U(Xu ) + V(Xv )
u=1 v=1

Minimizing MRF energies


A commonly used strategy for tasks such as
image segmentation and entity resolution in
text documents is to minimizing an energy
function of the form in the previous slide
When such energy functions are sub-
modular, an exact minimum can be found
using algorithms based on graph-cuts;
otherwise methods such as tree-reweighted
message passing can be used.

Example: Image Segmentation


Given a labeled dataset of medical imagery, such as a
computed tomography or CT image
A well known approach is to learn a Markov Random
Field model to segment that image into classes of
interest, ex. anatomical structures or tumors
Image features are combined with spatial context

Original Image Ground truth MRF segmentation


From Bhole et al. (2013)
Computing marginal probabilities
The marginal for variable xi is
P(xi ) = P(x1, ,xn ),
x ji
where the sum is over the states of all variables xj xi
Consider the task of computing the marginal
conditional probability of variable x3 given an
observation for x4 from a model where
fA fB

P(x1, , x5 ) = P(x1 )P(x2 ) x1 x2 x1 x2


fD
P(x3 | x1 )P(x4 | x1, x2 )P(x5 | x2 ) fC fE

x3 x4 x5 x3 x4 x5
x1 x2

Marginal probabilities x3 x4 x5

Since other variables in the graph have not been


observed, they should be integrated out of the
graphical model to obtain the desired result, ex.
where the key quantity is

However, this sum involves a large data structure


containing the joint probability, composed of the
products over the individual probabilities
Intuition for the sum-product algorithm
The sum-product algorithm refers to a much
better solution for computing marginals: simply
push the sums as far as possible to the right
before computing products of probabilities
In our example the required marginalization can
be computed by

fA fB

x1 x2 x1 x2
fD
fC fE

x3 x4 x5 x3 x4 x5
The sum-product algorithm
Computes exact marginals in tree structured
factor graphs
Begin with variable or function nodes that have
only one connection (leaf nodes)
Function nodes send the message:
to the variable connected to them
Variable nodes send the message:
Other nodes wait until they have received a
message from all neighbors except the one will
send a message to
Function to variable messages
When ready, function nodes send messages of the
following form to variable x:

where N(f)\x represents the set of the function node fs


neighbors, excluding the recipient variable x
Variables of the K other neighboring nodes are x1,,xK
If a variable is observed, messages for functions
involving it no longer need a sum over states of the
variable, the function is evaluated with the observation
One could think of the associated variable node as
being transformed into the new modified function
Variable to function messages
Variable nodes send messages to functions of
form:

where the product is over the (K) messages


from all neighboring functions N(x) other than
the recipient function f, i.e. fk N ( x) \ f
The marginal for each node is obtained from
the product over all K+1 incoming messages
from all functions connected to a variable
K+1
P(xi ) = m f1x (x) m fK x (x)m fK+1x (x) = m fk x (x)
k=1
Numerical stability
Multiplying many probabilities quickly leads to
very small numbers
The sum-product algorithm is often
implemented with re-scaling
Alternatively or additionally the computations
can be performed in log space leading to
computations of the form c = log(exp(a)+ exp(b))
To help prevent loss of precision when
computing the exponents, use the equivalent
expression with the smaller exponent below
c = log(ea + eb ) = a + log(1+ eb-a ) = b + log(1+ ea-b )
fA fB

x1
fD
x2 Sum product messages
fC fE
fA fB
x3 x4 x5

1d 1c
5d 5c
fC fD fE
6a 5a 4a 3a 2a 1a
x3 x1 x2 x5
1b 2b 3b 4b 5b 6b

x4
The messages for our example
fA fB
P(x3, x!4 ) = P(x3 | x1 )P(x
! 1 ) P( x4 | x1, x2 )P(x
!
! 2 ) P(x5 | x2 ) 1a
1!
x1 x2
fC
1d
5d
fD
1c 5c
fE
1d 1c
!x5 # " # $
6a 5a 4a 3a 2a 1a ! # ## " # 2a# #$
x3 x1 x2 x5
1b 2b 3b 4b 5b 6b
! # # # # ## " # # #3a # # #$
! # # # # # # # " #4a# # # # # #$
! # # # # # # # # # " # # 5a# # # # # # # $
x4 6a

The complete
algorithm can yield
all single-variable
marginals in the
graph using the
other messages
shown in the
diagram
Finding the most probable configuration
Finding the most probable configuration of all
other variables in our example given
involves searching for

for which
fA fB

Pushing max to the right fC


x1
fD
x2

fE

x3 x4 x5

Because max behaves in a similar way to sum,


like in the sum-product algorithm we can push
the max operations as far to the right as
possible, noting that max(ab, ac) = amax(b, c)
For our example we have
The max-sum algorithm
A log-space version of max-product
As in the sum-product algorithm, variables or
factors that have only one connection in the
graph begin by sending either :
a function-to-variable message
or a variable-to-function message
Each function and variable node in the graph
waits until it has received a message from all
neighbors other than the node that will
receive its message
Function to variable messages
Each function and variable node in the graph
waits until it has received a message from all
neighbors other than the node that will
receive its message
Function nodes send messages of the
following form to variable x

where the notation N(f)\x is the same as for


the sum-product algorithm above
Variable to function messages
Variables send messages to functions of this form

where the sum is over the messages from all


functions other than the recipient function.
When the algorithm terminates, the probability of
the most probable configuration (MPC) can be
extracted from any node using
the MPC
itself is:
Bibliographic Notes & Further Reading
Graphical Probabilty Models (GPMs) and Inference

Plate notation has been widely used in artificial intelligence


(Buntine, 1994), machine learning (Blei et al., 2003) and
computational statistics (Lunn et al., 2000) to define
complex probabilistic graphical models,
GPMs form the basis of the BUGS (Bayesian inference Using
Gibbs Sampling) software project (Lunn et al., 2000)
Our presentation of factor graphs and the sum-product
algorithm follows their origins in Kschischang et al. (2001)
and Frey (1998)
Bayesian networks and other models that contain cycles
can be manipulated into a structure known as a junction
tree by clustering variables, and Lauritzen and Spiegelhalter
(1988)s junction tree algorithm permits exact inference
Bibliographic Notes & Further Reading
Graphical Probabilty Models and Inference

Ripley (1996) covers the junction tree algorithm,


with practical examples
Huang and Darwiche (1996)s procedural guide is
an excellent resource for those who need to
implement the algorithm
Probability propagation in a junction tree yields
exact results, but is sometimes infeasible because
the clusters become too largein which case one
must resort to sampling or variational methods
Conditional continuous
probability models
Consider a model where the conditional
probability for yi given xi is a Gaussian with
mean given by a linear function of x:

The conditional distribution for N yis given the


corresponding xis can be defined as
N
p(y1 , , y N | x1 , , x N ) = p(y i | x i ).
i=1
Linear regression as a probability model
The log-likelihood of our linear model is:
N N
Ly|x = log p(y i | x i ) = log p(y i | x i ).
i=1 i=1

To maximize the log-likelihood it suffices to find the


parameters that minimize the squared error

i.e. this probability model is simply linear regression


Priors on parameters, Gaussians, ridge
regression and weight decay
Placing a zero mean Gaussian prior on the
parameters w leads to the method of ridge
regression, also called weight decay
Consider a linear regression that uses a D
dimensional vector x to make predictions
The regressions bias term can be represented by
appending a constant feature = 1 as an additional
final dimension of x for every example
The prior is frequently omitted from the bias
The underlying probability model can be written
N N D
p(yi | xi ;q )p(q;t ) = N(yi ;w xi, s ) N(wd ;0, t )
T 2 2

i=1 i=1 d=1


Priors on parameters, L2 and L1
regularization
Maximum a posteriori parameter estimation
based on the log conditional likelihood with a
zero mean Gaussian prior on weights is
equivalent to minimizing a squared error loss
function plus an L2 based regularization term
N
F(w) = { yi - w xi } + l RL2 (w), RL2 (w) = w w = w
T 2 T 2
2
i=1

Using a Laplace prior for the distribution over


weights and taking the log of the likelihood
function yields an L1 based regularization term
RL1 (w) = w 1
Laplace, L1 regularization the LASSO
and the Elastic Net approach
The Laplace distribution with params and b is
1 w-m
P(w; m, b) = L(w; m, b) = exp -
2b b
Modeling the log of the prior probability for each
weight by a Laplace distribution with =0 yields
D
-log L(wd ;0, b) = log(2b) + wd w 1
1 D
d=1 b d=1
The use of L1 regularization is also know as the LASSO,
Least Absolute Shrinkage and Selection Operator.
An alternative (convex!) approach known as the elastic
net combines L1 and L2 regularization techniques using
Matrix vector form of regression
Linear regression can be written in matrix form

Taking the partial derivative with respect to w and


setting the result to zero yields a closed form
expression for the parameters:

(y - Aw)T (y - Aw) = 0 AT Aw = AT y, w = (AT A)-1 AT y
w
AT Aw = AT y are the famous normal equations
A+ = (AT A)-1 AT is known as the pseudoinverse
Linear regression with regularization
For ridge regression our objective function becomes
F(w) = (y - Aw)T (y - Aw) + lwT w
We get a closed form solution for the result

F(w) = 0
w
A T Aw + l w = A T y
w = (A T A + l I)-1 A T y
This modification to the pseudoinverse equation is very
useful, allowing solutions to be found when they would
otherwise not exist, often using a very small
The extension to multidimensional inputs x can still be
formulated using these matrix forms
Polynomial regression and kernels
We can create a model for non-linear predictions
using polynomials of x
The estimation problem remains linear!

A similar trick for transforming a linear model


into a non-linear model is based on using basis
functions (x), where
Multinomial logistic regression
A simple linear probabilistic classifier can be
created using this parametric form

where y {1, ,N} , we use K feature functions


fk(y,x) and K weights, wk for the parameters
We can perform learning using maximum
conditional likelihood and N observations for
both labels and features, { y1, , y N , x 1, , x N }
Multiclass logistic regression and
gradient computations
The objective function
(has no closed form solution)
Soln: gradient descent - the derivative w.r.t. one weight is

we see the derivative breaks apart into two key terms


The first term, the easy part involves terms that disappear
when wk , w j leaving only the term shown on the left
Understanding gradients for
logistic regression
The second term, while seemingly daunting, has a cool
derivative that yields an intuitive and interpretable result:

which corresponds to the expectation of the feature


function under the probability distribution given by the
model with the current parameter settings
Using vectorized w and feature functions f we have
Reformulating multiclass logistic regression
Could alternatively formulate logistic regression as:
exp ( wTc x )
p(y = c | x) = ,
exp ( w x)T
y
y

where y is an integer index, the weights are encoded


into a vector of length K, and
The features x have been re-defined as the result of
evaluating the feature functions fk(y,x) in such a way
that there is no difference between the features
given by fk(y=i,x) and fk(y=j,x).
This form is widely used for the last layer in neural
network models, where it is referred to as the
softmax function
Matrix vector formulation of multiclass
logistic regression
The information concerning class labels could
be encoded into a multinomial or one-hot
vector y, which is all zeros except for a single 1
in the dimension that represents the correct
T

class labelfor example, y = 0 1 0 0
for the second class.
T
The weights form a matrix, W = w1 w2 wK
T
and the biases form a vector, b = b1 b2 bK

The model can be formulated so as to yield
vectors of probabilities
Logistic regression in matrix vector form
With the previous definitions, we can now write
p(y | x) =
(
exp y T W T x + y T b ) ,
exp( y T
W x+y b
T T
)
yY

where the denominator sums over each possible label,


y Y, Y = {[ 1 0 0 0 ]T ,..., [ 0 0 0 1 ]T }
Re-defining x as x = [xT 1]T and the parameters as a
matrix of the form
Logistic regression in matrix vector form
We can now write logistic regression very compactly as

The gradient of the log-conditional-likelihood with


respect to the parameter matrix is

The second term transforms into a vector of


probabilities for the classes of y under the model,
multiplied by the observed x transpose
Gradient descent
Given a conditional probability model
Parameter vector , data
Prior on parameters, with hyperparameter, l
Gradient descent with learning rate h
can be written as:

For convex models the change in the loss or the


parameters is often monitored and the algorithm is
terminated when it stabilizes
Second order methods
Alternatively, gradient descent can be based on
the second derivative by computing the
Hessian matrix H at each iteration and using
updates

To understand why this is possible and the


relationship to learning rates, see Appendix A.1
Generalized linear models
Linear regression and logistic regression are special cases of
a family of conditional probability models known in
statistics as generalized linear models (GLMs)
Idea: unify and generalize linear and logistic regression
Data can be thought of in terms of response variables yi
and explanatory variables organized as vectors xi, i=1,,n.
Response variables could be expressed in different ways,
ranging from binary to categorical or ordinal data
A model is then defined where or the expected value of
the distribution used for the response variable consists of
an initial linear prediction which is then subjected to a
smooth, invertible and potentially non-linear
transformation using the mean function g-1, i.e.

The mean function is the inverse of the link function, g


GLMs
In GLMs the entire set of explanatory variables for all the
observations is often arranged as a nxp matrix X, so that a
vector of linear predictions for the entire data set is
The variance of the underlying distribution can also be
modeled; typically as a function of the mean
Different distributions, link functions and corresponding
mean functions give a great deal of flexibility in defining
probabilistic models, see examples below
Predictions for ordered classes
To define a model with M ordinal categories, M1
cumulative probability models of the form P(Yi j) can be
used where the random variable Yi represents the category
of a given instance i
Models for P(Yi = j) can then be obtained using differences
between the cumulative distribution models
Here we will use complementary cumulative probabilities,
known as survival functions, of the form P(Yi > j) =1- P(Yi j)
They sometimes simplify the interpretation of parameters
The class probabilities can be obtained from:
P(Yi =1) =1- P(Yi > 1)
P(Yi = j) = P(Yi > j -1) - P(Yi > j)
P(Yi = M ) = P(Yi > M -1).
For binary predictions the following model is popular
g ij
logit(g ij ) = log = b j + wT x i
1- g ij
Bibliographic Notes & Further Reading
Logistic Regression, GLMs and Regularized Regression

Logistic regression is sometimes referred to as the workhorse of


applied statistics; Hosmer and Lemeshow (2004) is a great reference
Nelder and Wedderburn (1972)s work led to the generalized linear
modeling framework
McCullagh (1980)s developed proportional odds models for ordinal
regression, which are sometimes called ordered logit models because
they use the generalized logit function
Frank and Hall (2001) showed how to adapt arbitrary machine
learning techniques to ordered predictions
McCullagh and Nelder (1989)s widely cited monograph is another
good source for details on the framework of GLMs
Tibshirani (1996) developed the famous Least Absolute Shrinkage
and Selection Operator, also known as the LASSO
Zou and Hastie (2005) developed the elastic net regularization
approach which combines L1 and L2 regularization
Conditional probability models
based on kernels
Linear models can be transformed into non-linear
ones by applying the kernel trick
Suppose the features x are replaced
by the vector k(x) whose elements
are determined using a kernel
function k(x,xj) for every training
example:
A 1 has been appended to this vector to
implement the bias term in the parameter matrix
Kernelized regression p(y | x) = N(y;wT k(x), s 2 )

Kernelized classification
Bibliographic Notes & Further Reading
Kernel Logistic Regression, and Various Vector Machines

Kernel logistic regression transforms a linear classifier into a non-linear


one, and probabilistic sparse kernel techniques are attractive alternatives
to support vector machines
Tipping (2001) proposed a relevance vector machine that manipulates
priors on parameters in a way that encourages kernel weights to become
zero during learning
Lawrence, Seeger, and Herbrich (2003) proposed an informative vector
machine, which treats the problem as a fast, sparse Gaussian process
method in the sense of Williams and Rasmussen (2006)
Zhu and Hastie (2005) formulated sparse kernel logistic regression as an
import vector machine that uses greedy search methods
None of these methods approach the popularity of Cortes and Vapnik
(1995)s support vector machines, perhaps because their objective
functions are not convex, in contrast to the underlying SVM (and L2
regularized kernel logistic regression)
When probabilities are needed from an SVM, Platt (1999) shows how to fit
a logistic regression to the classification scores.
Markov models
One simple and effective probabilistic model for discrete
sequential data is known as a Markov model
A first-order Markov model assumes that each symbol
in a sequence can be predicted using its conditional
probability given the preceding symbol.
An unconditional probability is used for the first symbol
Given observation variables O={O1,,OT}, we have
T
P(O) = P(O1 ) P(Ot+1 | Ot ).
t=1

Usually, every conditional probability used in such


models is the same
The Bayesian network is a linear chain of variables with
directed edges between each successive pair
Extending Markov models
First order Markov model Second order Markov model

Hidden Markov model Markov random field



Hidden Markov models
A hidden Markov model is a joint probability model of
a set of discrete observed variables O={O1,,OT} and
discrete hidden variables H={H1,,HT} for T
observations that factors the joint distribution as
follows
T T
P(O, H) = P(H 1) P(H t+1 | H t ) P(Ot | H t ),
t=1 t=1

Each Ot is a discrete random variable with N possible


values, and each Ht is a discrete random variable with
M possible values.
The previous figure illustrates a hidden Markov model
as a type of Bayesian network that is known as a
dynamic Bayesian network where variables are
replicated dynamically over the appropriate number of
time steps.
Hidden Markov models (HMMs)
Common to use time-homogeneous models where
the transition matrix P(Ht+1|Ht) is the same at each
time step
Define A to be a transition matrix whose elements
encode P(Ht+1= j|Ht= i), and B to be an emission matrix
B whose elements bij correspond to P(Ot= j|Ht= i)
For t=1, the initial state probability distribution is
encoded in a vector with elements i=P(Ht=i)
The complete set of parameters is ,
i.e. a set containing two matrices and one vector
We write a particular observation sequence as a set of
observations
HMMs: key problems
1. Compute , the probability of a sequence
under the model with parameters
2. Find the most probable explanationthe best
sequence of states H * = {H1 = h1, , HT = hT } that
explains an observation
3. Find the best parameters for the model given
a data set of observed sequences
The first problem can be solved using the sum-
product algorithm, the second using the max-
product algorithm, and the third using the EM
algorithm in which the required expectations are
computed using the sum-product algorithm
Example: HMMs for speech recognition
N i n e t ee n th C e n t u ry

Spectrogram for audio for nineteenth century From Wikipedia


Conditional random fields (CRFs)
Recall, a Markov random field factorizes the joint
distribution for X using an exponentiated energy
function F(X):
Z = exp(-F(X)),
1
P(X) = exp(-F(X)),
Z X

Conditional random fields condition on some


observations X, yielding a conditional distribution

Z(X) = exp(-F(Y, X)),


1
P(Y | X) = exp(-F(Y, X)),
Z(X) Y
CRF Energy Function
Both Markov and conditional random fields can
be defined for general model structures, but the
energy functions usually include just one or two
variablesunary and pairwise potentials.
Conceptually, to create a conditional random field
for P(Y|X) based on U unary and V pairwise
functions of variables in Y, the energy function
takes the form
HMMs, MRFs, CRFs and Factor Graphs
Hidden Markov model Markov random field

Factor graph for an HMM Conditional random field


Linear chain CRFs

For a given sequence of length N define

There are two types of features: a set of J single


variable (state) features uj(yi,X,i) that are a function
of a single yi, and which are computed for each yi in
the sequence i=1N; and a set of K pairwise
(transition) features vk(yi-1,yi,X,i) for i>1.
Each type has its own associated unary weights q ju
and pairwise weights q kv
The global feature vector view for CRFs
Features can be a function of the entire observed sequence
The global dependence on the input is a major advantage of
conditional random fields over HMMs
Can be useful to combine unary and pairwise features and their
parameters to work with all features for a given position i
Define f(yi,yi+1,X,i) as a length-L vector containing all single-
variable and pairwise features & define the global feature vector
to be the sum over position dependent feature vectors so that
N
g(Y,X) = f(yi ,yi+1,X,i)
i=1

The gradient has similar form to logistic regression


Computing gradients

A linear chain CRF with L2 regularization

The contribution to the gradient of each example is

This is the difference between the observed occurrence of


the feature and its expectation under the current
prediction of the model, taken with respect to ,
(marginal conditionals) - minus regularization terms
For unary functions the expectation is w.r.t.
These are the single- and pairwise-variable marginal
conditional distributions, efficiently obtained with the sum-
product algorithm
Example: CRFs for information extraction
Idea: label each word with a named entity so as to
populate a database record with extracted information
Consider the following CRF for the line of text

How about we meet for 30min in room 3195 Tues. at noon ?

Meeting Record Each word is tagged with


Date Tuesday, March 1, 2016 one of the named entities
Time 12:00pm using features computed
Location 3195 locally and globally from
Duration unknown the sequence
Bibliographic Notes & Further Reading
Markov Models and N-grams

Some techniques for smoothing n-grams arise from


applying prior distributions to the parameters of the
models conditional probabilities.
Others come from different perspectivessuch as
interpolation techniques, where weighted combinations of
lower order n-grams are used.
Good-Turing discounting (Good, 1953) (co-invented by Alan
Turing, one of the fathers of computing), and Witten-Bell
smoothing (Witten and Bell, 1991) are based on these
ideas.
Brants and Franz (2006) discuss the massive Google n-
grams, they are available as a 24GB compressed text file
from the Linguistic Data Consortium.
Bibliographic Notes & Further Reading
Hidden Markov Models (HMMs)

Have been used extensively for decades in speech recognition


systems, and are widely applicable to many other problems.

Rabiner and Juang (1986) and Rabiner (1989) give a classic


introduction and tutorial respectively on HMMs.
The human genome sequencing project stretched from
around 1990 to the early 2000s (International Human
Genome Sequencing Consortium, 2001; Venter et al., 2001),
It spawned a surge of activity in recognizing and modeling
genes in genomes using hidden Markov models (Kulp et al.,
1996; Burge and Karlin, 1997)a particularly impressive and
important application of data mining.
Murphy (2002) is an excellent source for details on how
dynamic Bayesian networks extend hidden Markov models.
Bibliographic Notes & Further Reading
Conditional Random Fields and Markov Logic
The original application was to sequence labeling problems, but they have since
become widely used for many sequence processing tasks in data mining.

Lafferty, McCallum, and Pereira (2001) invented conditional random fields


Sutton and McCallum (2006) is an excellent source of further details
Sha and Pereira (2003) present the global feature vector view of CRFs
Our presentation synthesizes the above CRF perspectives
Kristjansson et al. (2004) examined the specific problem of extracting
information from email text
The Stanford Named Entity Recognizer is based on conditional random
fields; Finkel et al. (2005) give details of the implementation
Markov logic networks (MLNs) (Richardson and Domingos, 2006) provide a
way to create dynamically instantiated (conditional and/or traditional)
Markov random fields from programs encoded using weighted clauses in
first-order logic
MLNs been used for collective or structured classification, link or
relationship prediction, entity and identity disambiguation, among many
others, as described in Domingos and Lowd (2009)s textbook
Software for Probabilistic Approaches
to Machine Learning
Matlabs statistics toolbox
Has implementations of principal component analysis and
its probabilistic variant based on the methods we have
discussed,
Gaussian mixture models and hidden Markov Models,
among many other methods.
Scikit-learn (Pedregosa et al., 2011) is a rapidly growing
Python-based set of implementations of many machine
learning methods.
There are many probabilistic and statistical methods
currently implemented in scikit-learn for classification,
regression, clustering, dimensionality reduction (including
factor analysis and probabilistic principal component
analysis), model selection and preprocessing.
Software: PMTK, Hugin & Netica
Kevin Murphys Matlab-based Probabilistic
Modeling Toolkit (PMTK) is a large, open source
collection of Matlab functions and tools.
The software contains implementations for many of
the methods we have discussed here, including code
for Bayesian network manipulation and inference
methods.
The Hugin software package from Hugin Expert
A/S and the Netica software from Norsys are well
known commercial software for manipulating
Bayesian networks.
They contain excellent graphical user interfaces for
interacting with these Bayesian networks.
Software: BUGS, VIBES & Infer.net
The BUGS (Bayesian inference Using Gibbs Sampling)
project has created a variety of software packages for the
Bayesian analysis of complex statistical models using
Markov chain Monte Carlo methods
WinBUGS (Lunn et al., 2000) is a stable version of the software
OpenBUGS is an open source version of the core BUGS
implementation (Lunn et al., 2009)
The VIBES software package (Bishop, Spiegelhalter, and
Winn, 2002) allows inference in graphical models using
variational methods
Microsoft Research has created a programming language
known as infer.net which allows one to define graphical
models and perform inference in them using variational
methods, Gibbs sampling or another message passing
method known as expectation propagation (Minka, 2001)
John Winn and Tom Minka at Microsoft Research have been
leading the infer.net project
R, S, and similar commercial packages
The R programming language and software
environment was created for statistical computing and
visualization (Ihaka and Gentleman, 1996)
It has its origins at the University of Auckland, and
provides an open source implementation of the S
programming language from Bell Labs
It is comparable to well known commercial packages
such as SAS, SPSS, and Stata, and contains
implementations of many classical statistical methods
such as generalized linear models and other regression
techniques
Since it is a general purpose programming language
there are many extensions and implementations of the
models discussed in this chapter available online
Software for LDAb, CRFs and MLNs
The MALLET Machine Learning for Language Toolkit
(McCallum, 2002) provides excellent Java
implementations of latent Dirichlet allocation (LDAb)
and conditional random fields (CRFs)
It also provides many other statistical natural language
processing methods ranging from document
classification and clustering to topic modeling,
information extraction, and other machine learning
techniques frequently used for text processing
The open source Alchemy software package is widely
used for Markov logic networks (Richardson and
Domingos, 2006)
Weka implementations
Weka has implementations of:
Bayesian networks
BayesNet (Bayesian networks without hidden variables for
classification)
A1DE and A2DE (in the AnDE package)
Conditional probability models
ElasticNet (in the elasticNet package)
KernelLogisticRegression (in the kernelLogisticRegression
package)
LatentSemanticAnalysis (in the latentSemanticAnalysis
package)
Clustering
EM (clustering and density estimation using the EM algorithm)

Potrebbero piacerti anche