Lec 9B - Bayesian Learning II

Machine Learning – RIME 832
Bayesian Learning
Dr. Hasan Sajid
Bayes Theorem
• In machine learning we use a model to describe
the process that results in the data that are
observed.
• Formally, we are often interested in determining
the best hypothesis from some space H, given
the observed training data D.
• One way to specify what we mean by the best
hypothesis is to say that we demand the most
probable hypothesis, given the data D plus any
initial knowledge about the prior probabilities of
the various hypotheses in H.
Bayes Theorem
• Bayes theorem provides a way to calculate the
probability of a hypothesis (P(h|D)) based on its
prior probability (P(h)), the probabilities of
observing various data given the hypothesis (P(D|
h)), and the observed data (P(D)) itself.
Bayes Theorem
• Important Observations:
– P(h|D) increases with P(h) and with P(D|h)
– P(h|D) decreases as P(D) increases, because the more
probable it is that D will be observed independent of h, the
less evidence D provides in support of h.
Likelihood Estimate or Probability of
Observing data given a hypothesis
Prior/Our Belief
Posterior Probability
Evidence or Prior probability that training data D will be

observed given no knowledge about which hypothesis
holds). Acts as a normalization constant.
Learning Parameters using Bayes Formulation
• In many learning scenarios, the learner considers
some set of candidate hypotheses H and is
interested in finding the most probable
hypothesis h ε H given the observed data D (or at
least one of the maximally probable if there are
several)
• Any such maximally probable hypothesis is called
a maximum a posteriori (MAP) hypothesis
• we dropped the term P(D) because it is a

constant independent of h. Does it make any
difference in terms of deciding the best
hypothesis h?
• In some cases, we do not have any

background knowledge or belief about a
particular application.
• In such cases, we make each hypothesis
equally probable (uniform distribution) i.e.
equal prior for all of the hypothesis.
• With equal prior, the MAP parameter learning

reduces to maximizing P(D|h) only and is
called as Maximum Likelihood Hypothesis
Example
• Consider a medical diagnosis problem in which
there are two alternative hypotheses{cancer, no
cancer}:
– patient has particular form of cancer (positive)
– patient has NO cancer (negative)
• Prior: over the entire population of people only
0.008 have this disease.
– P(h=cancer) = 0.008
– P(h=not cancer) = 0.992
Example
• Likelihood Estimate (P(D|h)): Based on the
statistics we determine that the test returns a
correct positive result in only 98% of the cases in
which the disease is actually present and a correct
negative result in only 97% of the cases in which
the disease is not present.
• P(positive | Cancer) = 0.98
• P(negative | Cancer) = 0.02
• P(positive | No Cancer) = 0.03
• P(negative | No Cancer) = 0.97
Example
• A new patient comes in and his test comes as
positive as well. Should we diagnose him as
positive or negative?
• We solve it using our MAP formulation
Example
P(h|D)=P(D|h)P(h)
P(cancer|positive) = P(positive|cancer)P(cancer)
= (0.98)(0.008) = 0.0078
P(no cancer|positive) = P(positive|no cancer)P(no cancer)

= (0.03) (0.992) = 0.0298
hMAP = Not Cancer
Normalized Probabilities
P(cancer|positive) = 0.0078/(0.0078+0.0298) = 0.21
P(no cancer|positive) = 0.0078/(0.0078+0.0298) = 0.79
Bayes Theorem Revisited
• What are the probability terms?
Likelihood Estimate or Probability of
Observing data given a hypothesis
Prior/Our Belief
Posterior Probability
Evidence or Prior probability that training data D will be

observed given no knowledge about which hypothesis
holds). Acts as a normalization constant.
Bayes Theorem Revisited
• Lets connect Bayes with Machine Learning
– Data D are the training example
– Hypothesis is our target functions such as linear,
polynomial, logistic, SVM functions
Maximum Likelihood Estimation
• MLE is a method that determines values for the
parameters of a model. The parameter values are
found such that they maximize the likelihood that
the process described by the model produced
the data that were actually observed.
• MLE for parameter estimation, for example of a
linear model for house pricing prediction
problem Feature such as area of house
Price of the house
Parameters
• Different values of the parameters m and c results
in different lines (hypothesis)
Three linear models with different parameter values

• Let’s suppose we have 10 samples(data points) of
some process as shown in fig below:
• Based on our knowledge we first decide that from
what type of model this data would have come
from uniform, gaussian, exponential ……?
• Visual inspection of the figure above suggests
that a Gaussian distribution is plausible because
most of the 10 points are clustered in the middle
with few points scattered to the left and the right.
• Recall that the Gaussian distribution has 2 parameters.
– The mean, μ, and
– the standard deviation, σ.
• Different values of these parameters result in different
curves ( Hypothesis).
• Goal: We want to know which curve was most likely
responsible for creating the data points that we
observed?
• Solution: MLE will find the values of μ and σ that result
in the curve that best fits the data.
• Having Intuitive understanding of MLE, we will now learn
to figure out the most likely parameter values. The
values resulting from this process are called Maximum
Likelihood Estimates.
• To keep things simple and ease of solving, lets consider 3
data points (9, 9.5 and 11) example
• How do we calculate the maximum likelihood estimates
of the parameter values of the Gaussian distribution μ
and σ?
• It is the total probability of observing all of the data, i.e.
the joint probability distribution of all observed data
points.
• Assumption: Each of the data points is generated
independently of each other. This will simplify the
joint probability estimation
• The probability density of observing a single data
point x, that is generated from a Gaussian
distribution is given by:
• Hence the joint probability of observing all the
points are
• We just have to figure out the values of μ and σ

that results in giving the maximum value of the
above expression. It will be our MLE
• To do this we use our old friend “differentiation”
to find the maxima of the function.
• We take the derivative of the function and set it
equal to zero to find the MLE estimates
• However differentiating the product of gaussian
probabilities is a non-trivial task. To ease out
mathematics we maximize log-likelihood instead
of likelihood.
Log-Likelihood
• Monotonically increasing function: if the value on the x-
axis increases, the value on the y-axis also increases.
• This is important because it ensures that the maximum
value of the log of the probability occurs at the same
point as the original probability function.
Log-Likelihood
• Non Monotonic function
• Taking log
With further simplification

• This expression can be differentiated to find the
maximum. In this example we’ll find the MLE of
the mean, μ. To do this we take the partial
derivative of the function with respect to μ, giving
• setting the left hand side of the equation to zero

and then rearranging for μ gives:
• We can do the same thing for σ . Which is your
Homework to be submitted in next class.
• Question: Can maximum likelihood estimation
always be solved in an exact manner like we just
did?
• Answer: No. It’s more likely that in a real world
scenario the derivative of the log-likelihood
function is still analytically intractable (i.e. it’s way
too hard/impossible to differentiate the function
by hand). Therefore, iterative methods like
Expectation-Maximization algorithms are used to
find numerical solutions for the parameter
estimates. The overall idea is still the same though.
• Question: Why maximum likelihood and not maximum
probability?
• Answer: The two expression are equal
• The equation above says that the probability density of the

data given the parameters is equal to the likelihood of the
parameters given the data.
• P(data; μ, σ) means “the probability density of observing the
data with model parameters μ and σ”.
• L(μ, σ; data) means “the likelihood of the parameters μ and
σ taking certain values given that we’ve observed a bunch of
data.”
• Question: When is MLE the same as least square
minimization?
• Least squares minimization is another common
method for estimating parameter values for a
model in machine learning. It turns out that when
the model is assumed to be Gaussian as in the
examples above, the MLE estimates are
equivalent to the least squares method.
• For least squares parameter estimation we want
to find the line that minimizes the total squared
distance between the data points and the
regression line
• In MLE we want to maximize the total
probability of the data.
• When a Gaussian distribution is assumed, the
maximum probability is found when the data
points get closer to the mean value. Since the
Gaussian distribution is symmetric, this is
equivalent to minimizing the distance
between the data points and the mean value.
Prior Probability
• P(h) denotes the Prior probability (the initial
probability that hypothesis h holds, before we have
observed the training data)
• Prior reflects our belief and may reflect any
background knowledge we have about the chance
that h is a correct hypothesis
• If we have no such prior knowledge, then we might
simply assign the same prior probability to each
candidate hypothesis
• Example: Coin toss

Lec 9B - Bayesian Learning II

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lec 9B - Bayesian Learning II

Caricato da

Copyright:

Formati disponibili

Machine Learning – RIME 832

Evidence or Prior probability that training data D will be

• we dropped the term P(D) because it is a

• In some cases, we do not have any

• With equal prior, the MAP parameter learning

P(no cancer|positive) = P(positive|no cancer)P(no cancer)

hMAP = Not Cancer

Evidence or Prior probability that training data D will be

Price of the house

Three linear models with different parameter values

• We just have to figure out the values of μ and σ

With further simplification

• setting the left hand side of the equation to zero

• The equation above says that the probability density of the

Potrebbero piacerti anche