ML - Full (2 Files Merged) PDF

What is learning?
BITS Pilani
 Learning (for humans) is experience from past.
 A machine can be programmed to gather experience in
the form of facts, instances, rules etc.
 A machine with learning capability can predict about
the new situation (seen or unseen) using its past
experience.
 Examples:
 As e hu a s a tell a pe so s a e seei g hi /he
second or fifth time, a machine can also do that.
Machine Learning (IS ZC464)  As e hu a s a e og ize a pe so s oi e e e if ot
seei g pe so s fa e, a a hi e a also e ade to lea
to do the same.
Session 1: Introduction
December 30, 2017 IS ZC464 2
Learning pronunciation (by a

Class Experiment: Training
young kid)
 Let  Training
 AA denote 5  Cat (ae sound)
 BB denote 6  Pot( aw sound)
 AAA denote 50  Pat (ae sound)
 BBB denote 60  Tap (ae sound)
 AAAA denote 500  Cot (aw sound)
 BBBB denote 600  Testing
 Can you find out the equivalent numerical value  Ho do you p o ou e ot ? My stude ts k o the
of AAAAA? 5000: yes/no? answer.
 O of AABB? Not yet t ai ed………  Ho do you p o ou e he k ? The kid is ot t ai ed
yet, hence learning is not to this level.
December 30, 2017 IS ZC464 3 December 30, 2017 IS ZC464 4
Learning example : Relate human learning

with that of machine learning
Learning
 Training • Human
A coin is tossed 10 times and it is observed that it fell 7
times with head on top and 3 times tail on top. Gain experience from day to day activities
[observe that you are learning as you read the above] and gain ablility to predict.
 Testing
Will you get head next? (Hypothesis: get the head on • Machine
top) Get trained with the numerical data (data
yes, most probably.
What is the chance that the next coin when tossed will
can be text, image, sound, rules etc) and be
be head? (Hypothesis: next toss is head) able to predict.
P(next toss is head | Previous 10 tosses had 7
heads)

Why Machine Learning? Machine Learning and Artificial Intelligence
• Humans have limitations in terms of accessibility • Machine Learning is a branch of Artificial

and computational efficiency. Intelligence (AI) in which the intelligent system
• Machine learning is required in learns from its environment.
– Navigation in Mars
– Avalanche areas to detect buried
• AI systems include intelligence of different
– Speech recognition etc. types such as reasoning, planning, search and
• Machine learning is not required in game playing, learning etc. of which learning
– General computations such as payroll is specific to the Machine Learning systems.
– Computation of sum of numbers
– Counting etc.
Common attributes of Human

What is Artificial Intelligence?
mind
• It is the computational intelligence of • Perception/Vision/Recognition,
computers that enables them to behave and • Reason,
act human like. • Imagination,
• An artificial intelligent system possesses one • Memory,
or more of the human capabilities of
reasoning, thinking, planning, learning, • Emotion,
understanding, listening and responding. • Attention, and
• A capacity for communication
Human brain Understanding Human brain

• Thought is a mental activity which allows
human beings to make sense of things in the
world, and to represent and interpret them in
ways that are significant.
• Thinking involves the symbolic or semantic
mediation of ideas or data, as when we form
concepts, engage in problem solving,
reasoning and making decisions.

Artificial Intelligence: An intelligent car
Understanding Human Brain navigation system [An Example]
• Memory is the ability to preserve, retain, and  A system to navigate a car to the airport works on its
vision enabled using camera mounted at the front of
subsequently recall, knowledge, information the car.
or experience.  The syste sees the la e li its, the ehi les o the
• Imagination is the activity of generating or way and controls the car from colliding. [Vision]
evoking novel situations, images, ideas etc. in  It follows the road directions.
the mind.  It also follows the road rules.
 The system learns to handle unforeseen situations. For
example if the traffic flow is restricted on a portion of
the road temporarily, the system takes the alternative
path.[learning]
More intelligence can be Some of the Existing intelligent

expected systems
• The syste liste s to the pe so sitti g i the • Watson : Question Answering Machine
a to stop at a ea y hotel fo a tea a d sees
around to find a hotel, keeps travelling till it finds • Deep Blue: A chess program that defeated the
one and stops the car. [speech Recognition, world chess champion Gary Kasparov
Vision]
• Understands the mood of the person and starts
music to suit the mood of the person. [Facial
Expression]
• Ca a s e the ue ies, su h as ho fa is
Pila i? , What is the ti e , a I sleep fo a
hou ? , Please ake e up he it is : i
the o i g? [Natural Language Processing]
Deep Blue : Chess Program Other intelligent systems

• Smart home
– Lights switch off if there is no one in the room
– Curtain pull off at the sun rise
– Dust bin is emptied before it is overflowing
– Smart water taps, toilets etc.
• Smart office
– Automatic meeting summary
– Speaker recognition and summary generation
• Automatic answering machine
Source : Google Images
Other intelligent machines AI Techniques
• An airplane cockpit can have a intelligent • The general problem of simulating (or
system that takes automatic control when creating) intelligence has been broken down
hijacked [context and speech understanding, into a number of specific sub-problems
NLP, vision] – Reasoning and deduction
• Medical diagnosis systems trained with expert – Knowledge Representation
guidance can diagnose the patients disease – Planning
based on the xray, MRI images and other – Learning
symptoms – Natural Language Processing
• Automated theorem proving – Motion
• General problem solver – Perception
Intelligent Agent Intelligent agent

• An intelligent agent is a system that perceives  An agent is anything that can be viewed as
its environment and takes actions which perceiving its environment through sensors and
maximize its chances of success. acting upon that environment through actuators
 Human Agent Vs. Machine Agent
• Artificial Intelligence aims to build intelligent  Differ in sensor technology
agents or entities.  Ear, nose, eye, touch, smell (HUMAN)
 Speaker, camera, infrared sensors, smoke sensors,
etc
 Differ in their capacity to perceive the environment
 Differ in acting upon the environment through
actuators
How does an intelligent agent work in given

Environment environment?
• It perceives the environment.

 The parameters that are required for reasoning, • Acts based on the experience and query.
thinking, perception and so on
 Example (for humans) • Responds in terms of adding to the knowledge
 A o e yea old hild s e i o e t: Ho e, fa ily e e s, toys base
 A yea old hild s e i o e t : Ho e, fa ily e e s,
school, teachers, books, play mates • Thus must Learn from the history of percepts
 Example (for machines)
 Washi g a hi e i tellige t age t s e i o e t: di t, lothes,
detergent etc
 Intelligent Automobile Robot: parts of automobile and their exact
description

Machine Learning Applications Machine Learning
• Speech recognition • A computer program is said to learn from
• Automatic news summary experience E with respect to some class of
• Spam email detection tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
• Credit card fraud detection improves with experience E. (Tom Mitchell)
• Face recognition
• Function approximation
• Stock market prediction and analysis
• Etc.
Learning From Observations Design of a learning Element

• Learning Element: • Affected by three major issues:
– responsible for making improvements – Which components of the performance element
• Performance Element: are to be learned
– responsible for selecting external actions – What feedback is available to learn these
components
• The learning element uses feedback from the
– What representation is used for the components.
critic on how the agent is doing and
determines how the performance element
should be modified to do better in the future
Types of feedback for learning Learning Algorithms

• Supervised  Decision Trees
– Inputs and outputs  Neural Networks based learning algorithms
• Unsupervised  Ensemble Learning
– Inputs available, but no specific output  Bayes lassifie
• Reinforced  EM (expectation Maximization) algorithm
– Reward or penalty
 Support Vector Machines etc.

Inductive Learning using Decision Trees:
An example to learn to identify an object
Decision Tree
 A decision tree takes as input an object or situation
fruit? described by a set of attributes and returns a decision.
yes no  This decision is the predicted output value for the
input.
color Vegetable?
red
no  The input attributes can be discrete or continuous.
yellow yes
 Classification Learning:
apple mango taste unknown
 Learning a discrete valued function is called classification
bitter sour learning
bittergourd lemon  Regression :
 Learning a continuous function is called Regression.
Decision Tree Decision tree

• A decision tree reaches its decision by • Leaf nodes depict the decision about a
performing a sequence of tests. character having attributes falling on the path
• All non leaf nodes lead to partial decisions and from the root node
assist in moving towards the leaf node. • Each example that participate in the
• Leaf nodes are the decisions based on construction of the decision tree is called a
properties satisfied at non leaf nodes on the training data and the complete set of the
path from the root node. training data is called as training set.
Limitations of Decision Tree Attribute Creation/Selection in various problem

Learning domains (recognition)
• The tree memorizes the observations but does

not extract any pattern from the examples.
• This limits the capability of the learning
algorithm in that the observations do not
extrapolate to examples it has not seen.
google images
Obtain the most suitable Features/ attributes Availability of information

•Color •Images
•shape •Actual data
•No of wheels
•Capacity •Attributes will differ
•Rear mirrors
•No of headlights
Fruits recognition Face Recognition
T: fruit recognition First specify the problem clearly
P: recognition accuracy Do you want to discriminate amongst the ones
E: experience by training shown below or want to put them in one category.
Attributes
Training examples of a person
•Color
•Texture
•But not
shape
Test images
AT&T Laboratories, Cambridge UK

http://www.uk.research.att.com/facedatabase.html
December 30, 2017 IS ZC464 37 December 30, 2017 IS ZC464

38
Human face recognition Selection of attributes

Training set can be a set of face images with varying
•No of eyes X
expressions, illumination, pose etc
T: Face recognition •Hair?
P: recognition accuracy / rejection •Spects
accuracy •Nose line
E: experience by training •Chin shape
•Number of ears
•Wrinkles
Humans are very quick in recognizing face of a •Male?
person. •Ratio of lip length and eye An intelligent system will be said to be with
A alyze you ai s apa ity of e e e i g length
capability of learning (human like) if it recognizes
u e of featu es of a pe so s fa e •What else?
unseen data
Attributes
Mathematical features
•DCT coefficients
•Pixel values
•Average pixel intensity
Learning of a function from given Generalization in Function X Y
sample data Approximation 1 0

T: prediction of y-value for given x- 0 1
value
P: least error Y
0 -1
E: experience by training
Generalization 0.6 0.8
1. Straight Line If the NN answers -
2. Sinusoidal Curve - 0.6 -0.8
3. Other higher order What is f(-0.25)?
polynomial Or -0.6 0.8
f(0.001)
correctly -0.6 -0.8
o
-1 0
X
Y =  (1-X2)

Generalization
Generalization in Classification If the test feature
Problem vector can be Traditional Vs. Machine Learning
correctly classified
Y
Input Data
Traditional Approach Output
Test Program
vector
o Input Data
X
Machine Learning Program
Output
How does a program as an output realized? Neural Networks

• Program is characterized by its parameters. • Mathematical Models representing the massively
parallel machines
• For example:
• Model inspired by the working of human nervous
– A neural network classifier is represented by its system
weights
• Has a number of neurons performing the task
– Weights are obtained by analyzing input and similar to human neuron
output data
• Each neuron triggers the received input according
– A decision tree is characterized by its attributes to the weight.
obtained by training input and output classes • A neural network captures the environment it has
to learn in terms of the weights.
A Neuron
BITS Pilani
• A mathematical neuron is a processing unit

capable of receiving inputs from single or
multiple neurons and triggers a desired response.
• Each neuron has an associated activation function
which takes as input the weighted sum of the
inputs coming to the neuron and triggers a
response depending on the associated threshold Machine Learning (IS ZC464)
Session 2: Training and Testing in
Learning systems
December 30, 2017 IS ZC464 47
X Y
Learning of a function from given
Prediction 1 1
5
2
5
2
sample data
• Recall Learning: A machine with learning 4 4
capability can predict about the new situation 3 3
(seen or unseen) using its past experience.
• Prediction: Straight Line
Given values of x and y

Predict value of y for x = 71
• Prediction is based on learning of the relationship
between x and y
• Training data is the collection of (x,y) pairs
• Testing data is simply value of x for which value
of y is required to be predicted.
January 13, 2018 IS ZC464 2 January 13, 2018 IS ZC464 3
Learning of a function from given

What did the system learn? X Y sample data-straight line learning
•
1 1
Y = f(x)
5 5
• Y=x 2 2
Which line fits the
Straight Line best?
• What is its generalization ability? 4 4
3
• Most accurate or we can say 100% 3 3

2
•
Line is represented
What if the data to train the system changes by parameters of
slope and intercept
slightly? The machine can be still made to How?
learn. Machine must learn

on its own-which is Using the data –
the best fit known as training
data i.e. (x,y) pair
Understanding ERROR Understanding ERROR
• Consider an example of using height and weight

Height (in cm) Weight (in Kg)
145 48
80
165 68
75
155 62
70
160 65
65
170 75
163 67 60
171 76 55
167 72
50
159 65
January 13, 2018 IS ZC464 6 January 13, 2018 145 IS ZC464

155 165 170 7
Which line (hypothesis) fits the given data best? Which line(hypothesis) fits the given data best?
80 80
75 75
70 70
65 65
60 60
55 55
50 50
January 13, 2018 145 IS ZC464

155 165 170 8 January 13, 2018 145 IS ZC464
155 165 170 9
Which line(hypothesis) fits the given data best? Which line(hypothesis) fits the given data best?
80 80
75 75
70 70
65 65
60 60
55 55
50 50
January 13, 2018 145 IS ZC464

155 165 170 10 January 13, 2018 145 IS ZC464
155 165 170 11
Which line(hypothesis) fits the given data best? Which line(hypothesis) fits the given data best?
80 80
75 75
70 70
65 65
60 60
55 55
50 50
January 13, 2018 145 IS ZC464

155 165 170 12 January 13, 2018 145 IS ZC464
155 165 170 13
In 2D space the line parameters A simple example to understand
are two ERROR
• Slope and intercept
• Can be called as w1 and w2 Which line(hypothesis) fits the given data best?
• In order to find a line that best fits the given 7
data, we must find w1 and w2 in such a way 6
that the sum of the squared error is minimum 5
January 13, 2018 IS ZC464 14 January 13, 2018 1 IS ZC464

2 3 4 5 15
Compute the Squared Mean Error

Plotting error when y=f(x)
(line is y=x)
Therefore we can say that error is the
function of slope. If slope is
Sum of squares y represented by w, then error is a
function of w.
(S) = 1*1 7
+0
6 Hypothesis function
+ 1*1
y = wx
+1*1 5
Linear in one variable
+2*2
4 E(w) hw(x) = wx
+1*1
+2*2
3
+1*1
=13 SME = sqrt(S) / total no. of
2
observations
Error will be different if the line’s =√ / 8= .
slope is different (line passes through
1
origin)
At some value of w, E(w) is minimum.
x
w corresponding w
1 2 3 4 5 6 7
to minimum error
Understanding of the error surface Plotting error when y=f(x1,x2)

• Consider m observations <x1,y1>, <x2,y2>,
….<xm,ym>.
• An hypothesis hw(x) that approximates the
function that fits best to the given values of y Hypothesis function
hw(x) = w1x1 + w2x2 …….
• There is likely to be some error corresponding to Linear in two variables
each observation (say i). E(w)
NOTE: each pair < w1 ,w2>
• The magnitude of such error is yi -hw(xi) corresponds to a line given by
eq (1) while only one such
• Objective is to find such w that minimizes the pair corresponds to the line
sum of squares of errors that best approximates the
given training data
Emin(w) = Minimizew ∑i (yi -hw(xi) )2
w1
<w1 ,w2> corresponding
January 13, 2018 IS ZC464 18 January 13, 2018
w2 to minimum errorIS ZC464 19
Another possible surface
Plotting error when y=f(x1,x2) Difficult to visualize when y=f(x1,x2, x3)
Hypothesis function Hypothesis function

hw(x) = w1x1 + w2x2 ……. hw(x) = w1x1 + w2x2 +w3x3
Linear in two variables w3 Linear in three variables
E(w) E(w)
Local
Minima
w1 w1
Global Minima
January 13, 2018
w2 IS ZC464 20 January 13, 2018
w2 IS ZC464 21
Learning of a function from given sample data- Generalization in Function X Y
polynomial curve Learning Approximation 1 0

T: prediction of y-value for given x- 0 1
value
P: least error Y
0 -1
E: experience by training
Generalization 0.6 0.8
1. Straight Line If the NN answers -
2. Sinusoidal Curve - 0.6 -0.8
3. Other higher order What is f(-0.25)?
polynomial Or -0.6 0.8
f(0.001)
correctly -0.6 -0.8
o
-1 0
X
Y =  (1-X2)
Generalization
Generalization in Classification If the test feature
Problem vector can be Traditional Vs. Machine Learning
correctly classified
Y
Input Data
Traditional Approach Output
Test Program
vector
o Input Data
X
Machine Learning Program
Output

Uncertainty in real world
BITS Pilani
• Uncertainty in reaching New Delhi Airport in 5
hours from Pilani
– Cab engine may or may not work at any moment
– The route is diverted due to a procession on the way
– The road condition is bad unexpectedly
– The tire needs replacement
Etc.
Machine Learning(IS ZC464) • A person having stomach ache can be told
Session 3: Uncertainty Handling in that he is suffering from ulcer, while in actual
Real World using Probability Theory it may be gastritis or overeating
January 21, 2018 IS ZC464 2
If
A person has pneumonia
example Then Types of uncertainty
has fever
is pale
has cough • Disease  symptoms
white blood cells count is low
pneumonia may have other symptoms too
• Certainty exists in obtaining symptoms if the • Symptoms  disease
disease is confirmed these symptoms may be common in other
Disease  symptoms diseases as well,
• For converse, it is uncertain that if a person
has fever, and has cough, then has pneumonia, [ but if all possible symptoms can be observed
but if all symptoms are known then the and are same for all patients, then more
disease can be inferred definiteness can be inroduced]
Fever(p) Λ pale (p) Λcough(p) Λ WBC(p)pneumonia(p)
Real world scenario Class assignment

• It is impossible to list all relevant components • Analyze the weather on a day
of the real world – It is cloudy (How much?)
• Depends on individual belief
• Many of the components behave with some
• Belief can be based on experience
uncertainty
• Experience may count on favorable situations
• Due to system hardware limitation, – The day is humid
representing all components of any real world • Is it the sufficient humidity that may cause rains
situation may not be possible – Is it certain that the clouds will rain.
• The clouds may rain if certain other parameters are
favourable.

Conventional reasoning Closed World Assumptions
• Based on three assumptions – The closed world assumptions are based on the minimal
model of the world.
– Predicate descriptions must be sufficient with respect to
the application domain – Any predicate not existing is false.
– The information base is consistent. Example: whether two cities are connected by a plane
fight.
– Through the inference rules, the known information grows
monotonically • Check the list, if there is no direct flight, then we may
infer that the cities are not connected.
• Conventional methods follow closed world – Exactly those predicates that are necessary for a solution
assumptions are created.
– The closed world assumption affects the semantics of
negation in reasoning.
Example : conventional reasoning Uncertainty in First Order Logic

 Human(p)  mammal(p) Λ intelligent(p) Λkind(p) •  x Bird(x)  Fly(x)
Λ legs(p) Λ eyes(p) Λ ………..
• Penguin is a bird. Does it fly?
 mammal(John) Λ legs(John) Λ kind(John) Λ • The above rule does not hold good for all birds
eyes(John) (minimal world assumption)
 What a e said a out Joh s i tellige e? • How can we generalize the rule?
 Is John Intelligent? • There can be a large number of predicates that
 Is he not? can be constructed to represent a larger world
 Does lack of knowledge mean whether we are not
sure that John is intelligent or we are sure that •  x (Bird(x) Λ Abnormal(x)  Fly(x))
John is not intelligent • Uncertainty lies in the predicate abnormal.
Conventional reasoning Nonmonotonic reasoning systems

• Conventional logic is monotonic • Addresses the problem of changing beliefs.
• A set of predicates constitutes the knowledge • Makes most reasonable assumptions in light
base. of uncertain information.
• The size of the KB keeps increasing if a new
knowledge is added
• Pure methods of reasoning cannot handle KB
with incomplete or uncertain knowledge

Handling uncertain information
Example:
using probability theory
• Probability theory deals with the degree of   p Symptom(p, toothache)  Disease(p, cavity)
belief.  The above for example can be said to carry a
• Assigns numerical degree of belief between 0 belief that 8 out of 10 patients have cavity when
and 1 they had toothache.
• Handles the uncertainty that comes from  The probability associated with the above is 0.8.
laziness and ignorance  The belief may change if some more patients
reach with pain and have different diseases.
• The belief could be derived from
– Statistical data  A pro a ilit of .8 does ot ea that it is 8 %
– General rules true ut it is 8 % degree of elief
– Combination of evidence sources  Degree of belief is different from degree of truth
Representing uncertain
Evidences
knowledge using probability
• The pro a ilit that a patie t has a a it is .8, • Probability theory uses a language that is
depe ds o the age t s elief a d ot o the more expressive than the propositional logic
world.
• These beliefs depend on the percepts the agent • The basic element of the language is the
has received so far random variable.
• These percepts constitute the evidence on which • This random variable represents the real
probability assertions are based. world whose status is initially known.
• As new evidences add on, the probability • The proposition asserts that a random
changes.
variable has a particular value drawn from its
• This is known as conditional probability.
domain
Types of random variables Atomic Events

• An atomic event is the complete specification
• Boolean of the state of the real world about which the
– domain is {true, false} agent is uncertain.
– Example :
• Cavity = true
• Example:
– Let the boolean random variables cavity and toothache
• Discrete
constitute the real world then there are 4 atomic events
– domain is any set of integer values
– Example i. (Cavity = true) Λ (toothache= true)
• From domain { sunny, cloudy, rainy, snow} the variable may ii. (Cavity = true) Λ (toothache= false)
take whether = snow iii. (Cavity = false) Λ (toothache= true)
• Continuous iv. (Cavity = false) Λ (toothache= false)
– domain takes values from real numbers
Atomic events Prior probability
• Mutually exclusive • The prior probability associated with
• Set of all possible events is exhaustive proposition is the degree of belief in absence
(disjunction is true) of any other information
• Any proposition is logically equivalent to the • Example
disjunction of all atomic events that entail the – P(cavity = true) = 0.1
– P(cavity) = 0.1 [this is estimated based on the available
truth of the proposition information ]
– As some more information is available, the concept of
conditional probability will be used to determine the P
value
Computing Probability Probability Theory

• A bag contains 8 balls of which 6 are orange Apples and Oranges kept in two bags of different colors
and 2 are green.
• A ball is chosen randomly from the bag.
• What is the probability that the ball is of green
color? Answer = 2/8
• What is the probability that the ball is of
orange color? Answer = 6/8
Computing Probability Examples

• If a ball is to be chosen randomly from a bag, 1. Rolling a die – outcomes
and a bag is chosen randomly, then how likely
S ={ , , , , , }
it is to select red bag? – Computed through
experiments and multiple trials or is known ={1, 2, 3, 4, 5, 6}
apriori
• What is the probability that the ball selected E = the event that an even number is
from red bag is of green color? rolled
• What is the probability that the ball selected = {2, 4, 6}
from blue bag is of orange color? ={ , , }
Joint Probability Prior Probability Distribution
• This finds out how likely it is for two or more • Assu e a dis rete aria le eather
events to happen at the same time. – P(weather = sunny) = 0.4
–
• Example P(weather = rainy ) = 0.1
– P(weather = cloudy) = 0.1
– A patient has both cavity and toothache.
– P(weather = snow) = 0.2
– The joint probability is represented as
P(cavity Λ toothache) or • The distribution is
P(cavity, toothache) – P(weather)= { 0.4, 0.1, 0.1, 0.2}
Joint probability distribution Conditional Probability

• P(weather, cavity) has 4x2 (=8) atomic events • The intelligent agent may get new information
• P(cavity, toothache, weather) has 2x2x4 (=16) about the random variables that make the
• Any probabilistic query can be answered using domain
joint probability • The probabilities are recomputed
• Example
– A bag/urn has 12 red colored balls and 8 blue balls.
– The first trial, the probability of getting a red ball = 12/20
– Second trial, the probability of getting red ball = 11/19
Axioms of Probability
Conditional Probability
Kol ogoro s A io s
• Represented as P(a|b) • For any proposition a
• P(a|b) = P(aΛ b) / P(b) for P(b)>0 – 0<=P(a) <=1
• Also • True propositions have probability 1 and false

propositions have value 0
P(aΛ b) = P(a| b) P(b)
– P(true)=1 , P(false)=0
(Product Rule)
• P(a V b) = P(a) + P(b) – P(aΛb)

P(a) = 1 – P(a) Proposition
• Proof • The probability of a proposition is equal to the
a Λ a = false sum of the probabilities of the atomic events
a V a = true in which it holds.
– P(a) =  P(ei) over all atomic events
Using the third axiom of probability

P(a V a) = P(a) + P(a) – P(a Λ a)
==> P(true) = P(a) + P(a) – P(false)

==> P(a) = 1 – P(a)
Inference using Full Joint

Marginal Probability
Distributions
• P(Y) =  P(Y,z) (sum over all joint
• Joint distribution constructs the complete probabilities of Y with z)
knowledge base [Marginalization Rule]
• Example • P(Y) is the distribution over Y obtained by
– Let there be 2 random boolean variables representing the
summing out all the other variables from any
real world, say they are cavity and toothache joint distribution containing Y.
toothache toothache • Example:
Cavity 0.25 0.15 – P(cavity) = P(cavity, toothache) + P(cavity,  toothache)
Cavity 0.10 0.50 = 0.25 + 0.15
P(cavity)= 0.25 + 0.15 = 0.4 = 0.4
P(toothache) = 0.25 + 0.10 = 0.35
Conditioning Computing conditional probabilities (only

2 random variables)
• P(Y) =  P(Y,z) • P(Cavity | Toothache)
=  P(Y|z) P(z) (using product rule) = P(cavity Λ toothache) / P(toothache)
= 0.25 / 0.35 = 0.7142
• Marginalization and Conditioning are useful
rules for handling probability expressions
• P(Cavity | toothache)
= P(Cavity Λ Toothache) /
P(toothache)
= 0.1 / 0.35 = 0.2857
Inference using Full Joint
Normalization Constant
Distributions
• Normalization constant ensures that the
conditional probabilities of events add up to 1. – Let there be 3 random boolean variables representing the
• Example real world, say they are cavity, toothache and catch.
– We may still represent the joint probabilities as a table,
– P(cavity | toothache) = 0.999999 = 1
shown below, but if we have more random variables, we
• Let  denote the normalization constant simply use the propositions and their probabilities
– Then the conditional probability
P(a|b) = P(aΛ b) / P(b) for P(b)>0
Becomes
P(a|b) =  P(aΛ b)
Compute P(cavity)
Probability expressions More random variables
P(cavity,toothache)
P(toothache)
• P(cavity, toothache, catch) = 0.06

• P(cavity, toothache, catch) = 0.19 toothache toothache
• P(cavity,  toothache, catch) = 0.05

• P(cavity,  toothache,  catch) = 0.10 Catch catch Catch catch
• P( cavity, toothache, catch) = 0.09

• P( cavity, toothache, catch) = 0.01 Cavity 0.06 0.19 0.05 0.10
• P( cavity,  toothache, catch) = 0.22 Cavity 0.09 0.01 0.22 0.28
• P( cavity,  toothache,  catch) = 0.28
Advantage of normalization Probabilistic queries using joint

constant probability distribution
• These queries are answered using the joint
• Can help in generalizing the inference probability distribution.
procedure • The joint probability distribution is the knowledge
• P(X | e) =  P(X,e) base for inference using uncertain real world
=  (P(X,e ,y1) + P( X, e, y2)) • With n random variables, the size of the table
Example: P(cavity | toothache) = becomes 2n.
P(cavity,toothache)/P(toothache)
= (P(cavity,toothache,catch) + P(cavity,toothache, catch))/ • Time to answer a query = O(2n)
P(toothache) • When n is large, the method becomes almost
impractical to work with.

Independent events Ba es Rule
• The two events are statistically independent if the • According to the product rule
occurrence of one does not effect the other. – P(A|B) = P(AΛB)/ P(B)
• P(A|B) = P(A) – P(B|A) = P(BΛA)/P(A)
• P(B|A) = P(B)
– Using commutativity of conjunction
• P(A Λ B) = P(A)P(B) P(A|B) P(B) = P(B|A) P(A)
==> P(A|B) = P(B |A) P(A) / P(B)
• Example
– P(cavity|weather=sunny) = P(cavity)
– P(weather = rainy | toothache,catch) = P(weather=rainy)
example Review
• 90% students pass an examination  Axioms of Probability
• 75% Students who study hard pass the exam  0<=P(a) <=1
 P(true)=1 , P(false)=0
• 60% students study hard  P(a V b) = P(a) + P(b) – P(aΛb)
• Let S: event that students pass the exam  Syntax of the language representing
• H : studies hard uncertainty
• P(S| H) = 0.75  P(proposition)
 P(Proposition) =  P(ei) over all atomic events ei
• P(S) = 0.9  Where proposition is the conjunction of literals representing
• P(H)= 0.6 random variables.
 The random variables represent the real world parameters and
• P(H| S) = ?? capture the uncertainty
• Solutio : Use a es Theore  Example P(cavity Λ (weather = rainy))
• P(H|S) = P(S|H) P(H)/P(S) = 0.75 x 0.6 / 0.9 = 0.5
toothache toothache
catch catch
Review Review
Cavity 0.06
Catch
0.19
Catch
0.05 0.10
Cavity 0.09 0.01 0.22 0.28
 Product Rule
 P(cavity Λ toothache) = P(cavity, toothache) P(toothache)
• Normalization Constant
 Marginal Probability
– P(a|b) = P(a Λ b) / P(b) =  P(a Λ b)
 P(cavity) =  P(ei)
over all atomic events containing cavity – Example
• P(cavity | toothache) =
 Marginalization P(cavity,toothache)/P(toothache)
 Conditioning = (0.06+0.19) /
 P(X) =  P(X,z) (0.06+0.19+0.09+0.01)
=  P(X|z) P(z) = 0.25/ 0.35 = 0.7142
Example: P(cavity) = P(cavity|toothache) P(toothache) Normalization constant is 1/0.35
+ P(cavity|catch) P(catch)

catch catch
Class Cavity
Catch
0.06 0.19
Catch
0.05 0.10
Review
Assignment Cavity 0.09 0.01 0.22 0.28
 Bayes Rule:
1. What is the probability that a patient who has P(A|B) = P(B |A) P(A) / P(B)
toothache has a cavity in his/ her teeth? requires
2. What is the probability that the patient has a cavity? 1 conditional probability
3. What is the probability 2 unconditional probability
P(cavity|catchΛ toothache)
Solution (Problem 2)
Use Product Rule:
P(cavity|catchΛ toothache)
= P(cavity ΛcatchΛ toothache)/P(catchΛ toothache)
= 0.1/(0.1+0.28) = 0.1/ 0.38
Ba es Theore E a ple Example contd..

• Suppose that Bob can decide to go to work by • Suppose that Bob is late one day, and his boss
one of three modes of transportation,car, bus, wishes to estimate the probability that he
or commuter train. Because of high traffic, if drove to work that day by car. Since he does
he decides to go by car, there is a 50% chance not know which mode of transportation Bob
he will be late. If he goes by bus, which has
special reserved lanes but is sometimes usually uses, he gives a prior probability of 1/
overcrowded, the probability of being late is 3 to each of the three possibilities. What is the
only 20%. The commuter train is almost never oss esti ate of the pro a ilit that Bo
late, with a probability of only 1%, but is more drove to work?
expensive than the bus.
Example source Example courtesy:
http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-607/BayesEx.pdf
January 21, 2018 IS ZC464 51 http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-607/BayesEx.pdf
January 21, 2018 IS ZC464 52
Solution Home Work
• A box contains 10 red and 15 blue balls. Two balls are

selected at random and are discarded without their
colors being seen. If a third ball is drawn randomly and
observed to be red, what is the probability that both of
the discarded balls were blue?
P( R | BB ) P( BB )
P( R | BB ) P( BB )  P( R | BR ) P( BR )  P( R | RR ) P( RR )
Example courtesy:
Bayesian Network Bayesian network
• A suitable data structure implementing a
mechanism to represent the dependent and
independent relationships among the real world
variables.
• Captures the uncertain knowledge in a natural cloud humid
and efficient way.

• A set of random variables makes up the nodes of
the network.
• It also consists of the links between nodes rain
exploiting the dependence of one variable on

other
P(N) = 0.7 P(N) = 0.3
Example 2 Example 2Nutrious 8 hrs

food sleep
N SL P(H)
P(A) = 0.6 T T 0.95
P(SH) = 0.4 T F 0.78
Attend Healthy Attend Healthy F T 0.6
Study hard lectures lifestyle Study hard lectures lifestyle F F 0.001
S A H P(GP) SH A H P(GP)
T T T 0.99 T T T 0.99
T T F 0.45 T T F 0.45
T F T 0.60 T F T 0.60
Good T F F 0.30 Good T F F 0.30
performance F T T 0.85 performance F T T 0.85
F T F 0.45 F T F 0.45
F F T 0.05 Associated Conditional F F T 0.05
F F F 0.00001 Probability Tables (CPT) F F F 0.00001
Example 2 weather
example
cavity
• N: nutritious food
• SL: 8 hours sleep
• H: healthy lifestyle
catch
toothache • SH: study hard
• A: attends lectures
• P(N,SL, H, SH, A)
Weather is an independent variable while cavity = P(N) P(SL) P(H|NΛSL) P(SH) P(A)
effects toothache and catch both
= (0.3) x (0.3) x (0.6) x (0.4) x (0.4)
= 0.00864

Baysian Networks Graphs
• It is the representation for uncertain • A graph is a data structure to hold relevant information
in memory efficiently.
knowledge.
• Each node is allocated memory dynamically (on heap)
• These networks provide a concise way to • Nodes are connected by edges.
represent conditional probabilities. • A directed edge depicts the parent child relationship
• A bayesian network is represented as a graph. amongst nodes.
• The node pointed to by the arrow (edge) is the child of
• Bayesian network is more efficient than the the parent node from which the edge comes.
joint distribution tables. • A graph with |V| vertices and |E| edges is traversed in
O(|V|+|E|) time.
Conditional Probability Tables Bayesian Network

• Consider the domain as consisting of n
• Conditional Probability Tables(CPT) capture the strength of aria les sa X , X , X , ….X
the conditional dependency among the variables of the
environment.
• There are 2n atomic events
• Each row represents the atomic event (corresponding to the • The joint probability for an instance of these n
set of variables causing an effect on the child variable) variables is given by the following equation
• Each row must sum to 1 (P(False) = 1-P(True))
• P , , ,…, = ∏ (P(xi | Parent Xi) over
• The size of the CPT corresponding to a node is 2k where k is
the number of parent variables . I
• The CMTs are able to generate any conditional probability.
P(N) = 0.7 P(N) = 0.3 P(N) = 0.7 P(N) = 0.3
Revisit Nutrious
food
8 hrs
sleep
Additional Nutrious
food
8 hrs
sleep
Example 2 links if needed

N SL P(H|N,S)
P(A) = 0.6 T T 0.95 P(A) = 0.6
P(SH) = 0.4 T F 0.78 P(SH) = 0.4
Attend Healthy F T 0.6 Attend Healthy
Study hard lectures lifestyle F F 0.001 Study hard lectures lifestyle
SH A H P(GP |SH, A, H)) SH A H P(GP)

T T T 0.99 T T T 0.99
T T F 0.45 T T F 0.45
T F T 0.60 T F T 0.60
F T F 0.45 F T F 0.45
Associated Conditional F F T 0.05 Associated Conditional F F T 0.05
Probability Tables (CPT) F F F 0.00001 Probability Tables (CPT) F F F 0.00001
P(N) = 0.7 P(N) = 0.3 P(N) = 0.7 P(N) = 0.3
Additional Nutrious
food
8 hrs
sleep
All nodes are Nutrious
food
8 hrs
sleep
links if needed
N P( A)
connectedN P( A)
T 0.95 N SL P(H) T 0.95
F 0.60 T T 0.95 F 0.60
P(SH) = 0.4 T F 0.78 P(SH) = 0.4
Attend Healthy F T 0.6 Attend Healthy
Study hard lectures lifestyle F F 0.001 Study hard lectures lifestyle
SH A H P(GP) SH A H P(GP)
T T T 0.99 T T T 0.99
T T F 0.45 T T F 0.45
T F T 0.60 T F T 0.60
F T F 0.45 F T F 0.45
Associated Conditional F F T 0.05 Associated Conditional F F T 0.05
Probability Tables (CPT) F F F 0.00001 Probability Tables (CPT) F F F 0.00001
What is making the previous BN

Example
bad?
• Links capture dependency. • An environment is represented using 10
• More CMTs need to be constructed more variables, the nodes having at most 3 parent
memory needs variables, then the units of time to process a
• More statistical information is needed, which is query
not practical to collect.
= 10 x 8 = 80
• Worst case time to access each CMT
(corresponding to nodes) = n2n While JPDT, if used as the KB, takes 1024 units of
• Recall the total time to process a probabilistic time.
query using JPDT (joint probability distribution
table) = 2n
Where is the human intelligence

How to construct a BN
required?
• Humans understand the environment and can • Start with constructing the nodes corresponding to
list out the variables that should represent the the root causes.
environment. • The nodes which are affected by the previously
• The dependency of one variable on other generated nodes.
should be listed out. • Example: model uncertainty in environment to find
out what will be status of global warming in next
• The dependency and the effect of the parent year due to ongoing developmental activities
variable in terms of conditional probabilities • Parameters effecting temperature on earth
must be computed. – Deforestation, pollution, reduction of green house gases
• In deciding the root cause(s) • Root causes

– House and fuel needs, factories, communication, and so on
house
green house green house

Deforestation pollution Deforestation pollution
gases gases
vehicles vehicles
fuel factories fuel factories
house house
green house
Deforestation pollution Deforestation pollution
gases
•Notice that there is no cycle in the

graph.
green house •All leaf nodes have parents while
gases the root causes are not having any
parent node
•If there is a need to model more
refined environment, the variables
effecting the root causes should also
be considered
Review: Probability Theory BITS Pilani
Marginal Probability
Joint Probability
Conditional Probability Machine Learning (IS ZC464) Session 4:
Bayes’ Theorem and its applications in Machine Learning, MAP
hypothesis, Information Theory and its application in Minimum
Description Length (MDL) principle
January 21, 2018 IS ZC464 77
Bayesian learning
Bayes theore Example 1: observation of sounds
• Bayes theore pro ides a a to al ulate Training with Observed data: {d1,d2, d3} =training
the probability of a hypothesis based on its data (say D) Sou ds ae a d a are the
observed targets that we
prior probability, the probability of observing d1: at sou ds ith ae know.
various data given the hypothesis, and the d2: pot sou ds ith a Prior probabilities
observed data itself. P sou d = ae = .5
d3: at sou ds ith ae P sou d = a = .5
features su h as a a d o Conditional probabilities are OR
are obtained through represented as P ae |d ,d ,d = /
preprocessing of the given P ae |feature = a = / P a |d ,d ,d = /
words – by parsing P a | feature = o = /
OR
P ae |D = /
P a |D)= 1/3
Bayesian learning would enable

answers to queries such as: Hypothesis
Unknown words used for testing: sat and not • In learning algorithms, the term hypothesis is
Preprocessing gives used in contexts such as
Feature for ord sat = a Concept learning or classification: class label or
Feature for woed ot = o category
What is the likelihood that ord sat sou ds ith ae ? Function approximation: a curve, a line or a
P sat | ae = ? polynomial
What is the likelihood that ord ot sou ds ith ae ? Decision making: a decision tree
P ot | ae = ? • Plural of hypothesis: Hypotheses (multiple labels,
What is the likelihood that ord sat sou ds ith a ? multiple curves, multiple decision trees)
P sat | a =? • Best Hypothesis (Always preferred) : Most
What is the likelihood that ord ot sou ds ith a ? appropriate class, best fit curve, smallest decision
P ot | a =? tree
Observations of Sounds example : continued Likelihood or probability?

• There are two hypotheses in the given What is the likelihood that ord sat sou ds
example ith ae ?
Hypothesis (say h1 : ae P sat | ae
Hypothesis (say h2 : a Which can equivalently be written as
• Learning: requires us to find the best P feature = a | ae = / gi e
hypothesis from the space of two hypothesis Reverse:
h1 and h2 for a new observation What is the likelihood that ae sou d ill
represent a word of type sat?
P ae |feature = a =?

Computation of P ae |feature = a =? Terminology for Bayes theore
• Let us represent the above probabilistic query as • Prior Probability: The probability P(h) denotes
conditional probability using following events the i itial pro a ilit that h pothesis h holds
A: sou d is ae before we have observed the training data.
B: feature is a [E a ple: P a , P ae , P feature = a ,
To compute P(A|B) (read as Probability of A given B) P feature = o et . ased o so e
when P(B|A), P(B) and P(A) are available background knowledge]
Bayes’ theorem pro ides a ay to compute such
probabilities
Terminology for Bayes theore Bayesian Learning

• Posterior Probability: The probability P(h |D) • Bayesian learning is a probabilistic approach
de otes the pro a ilit that the h pothesis h to inference.
holds given the observed training data D [First
recall example of sequence of tossing of coin • Optimal decisions can be made by reasoning
and the probability that changes as we keep about these probabilities together with the
observing the D. Then in the current example, observed data.
P feature = a | ae , isualize u ertai t if • Each observed training observation can
talk a d o e are also used for trai i g a d incrementally decrease or increase the
sou d differe t for feature = a a d feature = estimated probability that a hypothesis is
o respe ti el ] correct.
Bayesian Learning Example

• Prior knowledge can be combined with observation word feature Target
hypothesis
observed data to determine the final (h)
probability of a hypothesis. d1 put u oo

d2 pat a ae
• Bayesian methods can accommodate d3 none o a~
hypotheses that make probabilistic d4 mat a ae

d5 cut u a~
predictions. d6 not o aw
• New instances can be classified by combining d7 nut u a~
d8 talk a aaw
the predictions of multiple hypotheses,
d9 pot o aw
weighted by their probabilities. d10 sat a ae

4 hypotheses Three features
Observations word feature Target Observations word feature Target
(D) hypothesis Prior (D) hypothesis Prior
(h) Probabilities (h) Probabilities of
d1 put u oo P(oo) = 0.1 d1 put u oo features
d2 pat a ae P(ae) = 0.5 d2 pat a ae P(u) = 0.2
P(a~) = 0.1 P(a) = 0.5
d3 none o a~ P(aw)=0.3 d3 none o a~ P(o)= 0.3
d4 mat a ae d4 mat a ae
d5 cut u a~ conditionalProb d5 cut u a~ Conditional
d6 not o aw abilities d6 not o aw Probabilities
P(oo | D) = 0.1 P(u | oo) = 0.1
d7 nut u a~ d7 nut u a~
P(ae | D) = 0.3 P(u | a~)= 0.2
d8 talk a aw P(a~ | D) = 0.3 d8 talk a aw P(a |ae) = 0.3
d9 pot o aw P(aw | D)=0.3 d9 pot o aw P(a |aw) =0.1
P(o| a~) = 0.1
d10 sat a ae d10 sat a ae P(o | aw)=0.2
Conditional
Probabilities
Bayesian Learning Posterior Probabilities P(u | oo) = 0.1
P(u | a~)= 0.2
P(a |ae) = 0.3
• Training : Through the computation of the • P(ae |a) = P(a | ae) * P(ae)/P(a) Maximum P(a |aw) =0.1
probabilities as in previous two slides P(o| a~) = 0.1
• Testing : of unknown words = 0.3* 0.5/0.5 = 0.3 P(o | aw)=0.2
– Example testing: • P(oo |a) = P(a | oo) * P(oo)/P(a) Prior

Probabilities
Whi h sou d does the ord at ake? P(oo) = 0.1
Prepro ess at to get feature a a d o pute P h|a), = 0.1* 0.1/0.5 = 0.02 P(ae) = 0.5
where P(h|D) is known, where D is the set of 10 • P(a~ |a) = P(a | a~) * P(a~)/P(a) P(a~) = 0.1
o ser atio s used to trai the s ste a d h is the P(aw)=0.3
hypothesis. = 0* 0.1/0.5 = 0 Prior
Compute probabilities P(ae | a), P(oo | a), P(a~|a) and Probabilities of
P(aw |a) to obtain the likelihood of sound of cat. • P(aw |a) = P(a | aw) * P(aw)/P(a) features
P(u) = 0.2
= 0.1* 0.3/0.5 = 0.06 P(a) = 0.5
P(o)= 0.3
Maximum a Posteriori (MAP)

MAP hypothesis
hypothesis
• Consider a set of hypotheses H and the
observed data used for training D h MAP
 Arg max P(h | D)
hH
• Define
h  Arg max P(h | D)
MAP
hH P ( D | h ) P ( h)
hMAP  Arg max hH P ( D)
using Bayes
theorem
• The maximally probable hypothesis is called a

maximum a posteriori (MAP) hypothesis.
h MAP
 Arg max P( D | h) P(h) Dropping
P(D) as id
hH constant

Equally Probable hypothesis a priori Maximum Likelihood hypothesis
• If P(hi) = P(hj)  hi and hj in H • P(D|h) is called the likelihood of the data
then in finding the MAP hypothesis, we can given h and any hypothesis that maximizes
ignore the term P(h) in the following equation P(D|h) is called a Maximum Likelihood (ML)
hypothesis.
h
MAP
 Arg max P( D | h) P(h)
And get
hH
h ML
 Arg max P( D | h)
hH
hMAP  Arg max P( D | h)

hH
Minimum Description Length

Home Work Principle
• Read and solve Example given in section 6.2.1 • This is the information theoretic approach to
Mit hell s ook compute the MAP hypothesis.
• Question on Bayes theore • The MAP is computed as shortest length
A doctor knows that the disease meningitis hypothesis in the domain of encoding data.
causes the patient to have a stiff neck, say 50% of
the time. The doctor also knows some • A problem consisting of transmitting random
unconditional facts: the prior probability that the messages needs encoding of messages.
patient has meningitis is 1/50,000, and the prior • Messages are arriving at random (uncertainty
probability that any patient has a stiff neck is about the messages exists)
1/20. What is the probability that a patient with
stiff neck has meningitis? (Verify your answer • Ea h essage i is o sidered to e arri i g ith
with 0.0002) probability pi
Minimum Description Length

Principle Information Theory
• We need to find the encoding scheme using minimum • Information theory studies the quantification, storage,
number of bits. and communication of information.
• The fixed length coding scheme does not work well as
the less probable messages get the encoding using • It was originally proposed by Claude E. Shannon in
same number of bits. 1948 to find fundamental limits on signal
• Example messages processing and communication operations such as data
a1, a2, a3, a4 : 4 symbols compression
Code them as 00, 01, 10, 11 using 2 bit (costly) • A key measure in information theory is "entropy".
representation
• Entropy quantifies the amount of uncertainty involved
Transmit the code sequence 1101100110001110.
in the value of a random variable or the outcome of
Client at the other end can decode using the same
scheme of encoding as a4a2a3a2a3a1a4a3 a random process.
Reference : https://en.wikipedia.org/wiki/Information_theory
Entropy Encoding
• Based on the probability of each source • Less number of bits to represent frequent
symbol to be communicated, the symbols
Shannon entropy H, in units of bits (per • Use more bits to represent less frequent
symbol), is given by symbols
Entropy   p log
i
i 2
p  i symbol Probability Code Code
(pi) (unoptimized) (Optimized)
• where pi is the probability of occurrence of a1 0.4 00 1

a2 0.25 01 010
the ith possible value of the source symbol. a3 0.3 10 00
a4 0.05 11 011
Computation of entropy Average number of bits

• Entropy of the given data Variable length encoding
Entropy   p log
i
i 2
p  i
= 1*0.4 + 3*0.25 + 2*0.3 + 3*0.05 = 1.9 bits
Fixed length encoding = 2 bits
Entropy   p log  p  p log  p
1 2 1 2 2 2
 p log  p  p
3 2 3 4
log  p
2 4

symbol Probability Code Code

= - 0.4*log2(0.4) - 0.25*log2(0.25)- 0.3*log2(0.3) - (pi) (unoptimized) (Optimized)
0.05*log2(0.05) a1 0.4 00 1
a2 0.25 01 010
= 0.52877 + 0.5 + 0.521089 + 0.216096 a3 0.3 10 00
=1.76 bits (information content) a4 0.05 11 011
MAP definition in logarithmic

terms BITS Pilani
h MAP
 Arg max P( D | h) P(h)
hH
h MAP
 Arg max log P( D | h)  log P(h)
hH 2 2
Machine Learning (IS ZC464) Session 5:

Bayes’ Optimal Classifier, Gibbs Algorithm, Naïve Bayes’ Classifier,
h MAP
 Arg min  log P( D | h)  log P(h)
hH 2 2 Problem Solving on Bayesian Learning
January 27, 2018 IS ZC464 30

Review
• Prior and posterior probabilities
• Conditional probabilities
• Bayes theore
• MAP hypothesis
• Minimum Description Length Principle
• Entropy
Slides adapted from:
https://fenix.tecnico.ulisboa.pt/downloadFile/3779571251548/bayes2.ppt
www.cs.bu.edu/fac/gkollios/ada01/LectNotes/Bayesian.ppt
www.doc.ic.ac.uk/~yg/course/ida2002/ida-2002-5.ppt
https://cse.sc.edu/~rose/587/PPT/NaiveBayes.ppt
February 3, 2018 IS ZC464 2 February 3, 2018 IS ZC464 3
Basic Approach MAP Learner
Bayes Rule: For each hypothesis h in H, calculate the posterior probability

P ( D | h) P ( h )
P ( h | D)  P ( D | h) P ( h )
P ( D) P ( h | D) 
P ( D)
• P(h) = prior probability of hypothesis h Output the hypothesis hmap with the highest posterior probability
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior probability ) hmap  max P(h | D)
hH
• P(D|h) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given the • Computational intensive
training data (Maximum A Posteriori hypothesis hmap ) • Provides a standard for judging the performance of learning algorithms
hmap • Choosing P(h) reflects our prior knowledge about the learning task
• P(D|h) defines the probability of D given its hypothesis (class) and is used for training.
 max P (h | D)
hH
P ( D | h) P ( h)
 max
hH P( D)
February 3, 2018  max P ( D | hIS)ZC464
P ( h) 4 February 3, 2018 IS ZC464 5
hH
Classification tasks (Concept learning) Bayesian classification

• Examples
 Spam Classification: Given an email, predict whether it is spam or not
 Medical Diagnosis: Given a list of symptoms, predict whether a patient
• The classification problem may be formalized
has disease X or not using a-posteriori probabilities:
 Weather: Based o te perature, hu idit , et … predi t if it ill rai
tomorrow • P(C|X) = prob. that the sample tuple
• What is classification task in machine learning? X=<x1,…,xk> is of class C.
 Given training data and its class label C, features representing each
training sample (say X=<x1,…,xk> )are extracted. This represents
P(X|C)
• Idea: assign to sample X the class label C such
 An unseen (or seen) sample is required to be classified. Let this be that P(C|X) is maximal
represented by its features X=<x1,…,xk> .
 The classification task requires the machine learning system to
obtain the class label to which the sample might belong to.
• How is probability used in classification?
 To classify we compute P(C|X)

Examples
Examples
• Diagnosis of a disease
• Sound classification of words hi: ith Disease s a e Multi lass lassifi atio pro le
hi: ith sound class hi: Disease diagnosis positive/ negative (Binary
classification problem)
D: o el attri utes e.g. o , a et .
D: Symptoms
P(h|Dtestdata)= MAXi P(hi|Dtestdata) • Spam mail detection
• Face recognition hi: spam positive/ negative (Binary classification
hi: ith Perso s a e problem)
D: attributes such as shapes of eyes, jaw line, nose etc. D: features that describe a spam email (example
words: dollar, free, pounds, million, etc.)
or mathematical features such as wavelets, Discrete Cosine • Speaker recognition
Transform (DCT) etc.
hi: ith Perso s a e
D: Voice tone etc.
Techniques for classification based on posterior Bayes optimal Classifier: A weighted majority
probabilities classifier
• Maximum a Posteriori (MAP) • What is the most probable classification of the new
 Maximum likelihood instance given the training data?
 Only one hypothesis contributes
– The most probable classification of the new instance is
• Bayes Opti al Classifier obtained by combining the prediction of all hypothesis,
 Weighted Majority Classifier weighted by their posterior probabilities
 All hypotheses contribute
 Costly classifier • If the classification of new example can take any
• Gibbs Algorithm value vj from some set V, then the probability P(vj|D)
 Any one randomly picked hypothesis that the correct classification for the new instance is
 Less costly vj, is just:
• Naïve Bayes Classifier
 uses assumption that the attributes are conditionally
independent
h1: travel by scooter

h2: travel by car
h3: travel by bus Example Calculation
v1 : long journey (+)
v2 : city travel (-)
• Let D=<f1,f2,f3> be the training data where fi s are
P(+|h1): probability of the attributes (features such as availability of the • Weighed sum of P(+|D)
travelling on long journey vehicle, money, road conditions etc.) =P(+|h1)P(h1|D) + P(+|h2)P(h2|D) + P(+|h3)P(h3|D) P(h1|D)=0.4
given travel by scooter. • Given observed conditional probabilities = 0.01 *0.4 + 0.9*0.3 + 0.6*0.3 P(h2|D)=0.3
P(h1|D)=0.4 P(h3|D)=0.3
P(h2|D)=0.3 = 0.004 + 0.27 +0.18
P(+|h1)=0.01
P(h3|D)=0.3 = 0.454 P(+|h2)=0.9
• Given observed conditional probabilities P(+|h3)=0.6
P(+|h1)=0.01 • Weighed sum of P(-|D) P(-|h1)=0.7
P(+|h2)=0.9 P(-|h2)=0.5
P(+|h3)=0.6
=P(-|h1)P(h1|D) + P(-|h2)P(h2|D) + P(-|h3)P(h3|D) P(-|h3)=0.2
• Also given observed conditional probabilities = 0.7*0.4 + 0.5*0.3 + 0.2*0.3
P(-|h1)=0.7 =0.28 + 0.15 +0.06
P(-|h2)=0.5 =0.49 h : travel by scooter h : travel by car h : travel by bus
P(-|h3)=0.2 1 2 3
v1 : long journey (+) v2 : city travel (-)
P(+|h1): probability of travelling on long journey given travel by scooter.
Bayes Opti al Classifier Naïve Bayes Learner
Assume target function f: X-> V, where each instance x described by attributes <a1, a2,
max
vjV
 P (v
hjH
j | hi ) P(hi | D) …., a >. Most pro a le alue of f is:
v  max P (v j | a1 , a2 ....an )
vjV
P ( a1 , a2 ....an | v j ) P (v j )
• Compute the Maximum of two probabilities  max
vjV P ( a1 , a2 ....an )
 max P ( a1 , a2 ....an | v j ) P (v j )
P(+|D) and P(-|D) which is 0.49 vjV
• This means it is more likely that a person Naïve Bayes assumption:
travels in a city given its data P(a1 , a2 ....an | v j )   P(ai | v j ) (attributes are conditionally independent)
i
Naïve Bayes Classifier (I) Play-tennis example: estimating P(xi|C)

outlook
• A simplified assumption: attributes are Outlook

sunny
Temperature Humidity Windy Class
hot high false N
P(sunny|p) = 2/9 P(sunny|n) = 3/5
P(overcast|p) = 4/9 P(overcast|n) = 0
conditionally independent: sunny
overcast
hot
hot
high
high
true
false
N
P
P(rain|p) = 3/9 P(rain|n) = 2/5
rain mild high false P
n rain cool normal false P
temperature
P(C j | D)  P(C j ) P(d i | C j )
rain cool normal true N
overcast cool normal true P
i 1
sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5
sunny cool normal false P
rain mild normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
sunny mild normal true P
overcast mild high true P P(cool|p) = 3/9 P(cool|n) = 1/5
overcast hot normal false P
• Greatly reduces the computation cost, only rain mild high true N humidity
P(high|p) = 3/9 P(high|n) = 4/5
count the class distribution. P(p) = 9/14 P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(n) = 5/14
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
Naive Bayesian Classifier (II) Play-tennis example: classifying X

• Given a training set, we can compute the probabilities • An unseen sample X = <rain, hot, high, false>
O u tlo o k P N H u m id ity P N
su n n y 2 /9 3 /5 h ig h 3 /9 4 /5
P(rain,hot,high,false|p)·P(p)
o verc ast 4 /9 0 n o rm al 6 /9 1 /5 = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p)
rain 3 /9 2 /5 = 3/9·2/9·3/9·6/9·9/14 = 0.010582
T em p reatu re W in d y
hot 2 /9 2 /5 tru e 3 /9 3 /5 P(rain,hot,high,false|n)·P(n)
m ild 4 /9 2 /5 false 6 /9 2 /5 = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)
cool 3 /9 1 /5
= 2/5·2/5·4/5·2/5·5/14 = 0.018286
• Sample X is classified in class n (don’t play)

The i depe de e h pothesis… Naïve Bayesian Classifier: Comments
• … akes o putatio possi le • Advantages :

– Easy to implement
• … ields opti al lassifiers he satisfied
– Good results obtained in most of the cases
• … ut is seldo satisfied i pra ti e, as attri utes aria les are
• Disadvantages
often correlated.
– Assumption: class conditional independence , therefore loss of accuracy
• Attempts to overcome this limitation: – Practically, dependencies exist among variables
– Bayesian networks, that combine Bayesian reasoning with causal – E.g., hospitals: patients: Profile: age, family history etc
relationships between attributes
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
• when conditional independence assumption is – Dependencies among these cannot be modeled by Naïve Bayesian
satisfied the naive Bayes classification is a MAP Classifier
classification • How to deal with these dependencies?
• – Bayesian Belief Networks
Problem 1: Bayes Theore

• 90% students pass an examination
• 75% Students who study hard pass the exam
• 60% students study hard
• Let S: event that students pass the exam
Problem Solving on Bayesian Learning • H : studies hard
• P(S| H) = 0.75
• P(S) = 0.9
• P(H)= 0.6
• P(H| S) = ??
• Solution : Use bayes Theore
• P(H|S) = P(S|H) P(H)/P(S) = 0.75 x 0.6 / 0.9 = 0.5
Problem 2: Using Joint Compute

More random variables
P(cavity)
Probabilities P(cavity | toothache)
P(toothache)
• P(cavity, toothache, catch) = 0.06 P(Catch | cavity)
• P(cavity, toothache, catch) = 0.19 toothache toothache
• P(cavity,  toothache, catch) = 0.05

• P(cavity,  toothache,  catch) = 0.10 Catch catch Catch catch
• P( cavity, toothache, catch) = 0.09

• P( cavity, toothache, catch) = 0.01 Cavity 0.06 0.19 0.05 0.10
• P( cavity,  toothache, catch) = 0.22 Cavity 0.09 0.01 0.22 0.28
• P( cavity,  toothache,  catch) = 0.28
toothache toothache toothache toothache
Catch catch Catch catch Catch catch Catch catch
Solution Cavity 0.06 0.19 0.05 0.10 Solution Cavity 0.06 0.19 0.05 0.10
Cavity 0.09 0.01 0.22 0.28 Cavity 0.09 0.01 0.22 0.28
• P(cavity) = P(cavity,toothache)+P(cavity,~toothache) • P(cavity | toothache) = P(cavity, toothache) . P(toothache) (using Bayes

(using Marginalization) theorem)
• P(cavity) = P(cavity,toothache, catch)+P(cavity,~toothache, catch)+ • Where
P(cavity,toothache, ~catch)+P(cavity,~toothache, ~catch) (using P(cavity, toothache) = P(cavity, toothache, catch) + P(cavity, toothache, ~catch)
Marginalization) (using Marginalization Rule)
= 0.06 + 0.05 + 0.19 + 0.10 = 0.4

= 0.06 + 0.19 = 0.25
• Similarly
P(toothache) = P(toothache, catch)+P(toothache, ~catch) • Therefore
= P(toothache, catch, cavity)+P(toothache, ~catch, cavity) P(cavity | toothache) = P(cavity, toothache) . P(toothache)
+ P(toothache, catch, ~cavity)+P(toothache, ~catch, ~cavity) = 0.25 * 0.35 = 0.0875
(using Marginalization)
= 0.06 + 0.19 + 0.09 + 0.01 = 0.35
Problem 3: Bayes Theore Example contd..

• Suppose that Bob can decide to go to work by • Suppose that Bob is late one day, and his boss
one of three modes of transport car, bus, or wishes to estimate the probability that he
commuter train. Because of high traffic, if he drove to work that day by car. Since he does
decides to go by car, there is a 50% chance he not know which mode of transportation Bob
will be late. If he goes by bus, which has
special reserved lanes but is sometimes usually uses, he gives a prior probability of 1/
overcrowded, the probability of being late is 3 to each of the three possibilities. What is the
only 20%. The commuter train is almost never oss esti ate of the pro a ilit that Bo
late, with a probability of only 1%, but is more drove to work?
expensive than the bus.
Example source
Solution Problem 4: Minimum Description

Length (MDL) Principle
• Compute the MDL encoding for the problem
given below
symbol pi
A 0.36
B 0.15
C 0.13
D 0.11
E 0.09
F 0.07
G 0.05
H 0.03
I 0.01

Solution Probabilities and Codelengths
• Arrange the symbols in sorted order • Huffman coding
• Pair them by adding their probabilities and
reach the end
• Assign smallest code to the symbol with
highest probability
• Assign incremental length code to other
symbols depending upon their probabilities
• Compute the total number of expected bits
• Compute the entropy of the given information

Source: http://star.itc.it/caprile/teaching/algebra-superiore-2001/
Problem 5: MAP classifier

Problem 6: Naïve Bayes classifier
Does patient have cancer or not?
age income student credit_rating buys_computer
A patient takes a lab test and the result comes back positive. The test
Class: <=30 high no fair no
returns a correct positive result in only 98% of the cases in which the
C1:buys_computer=‘yes’ <=30 high no excellent no
disease is actually present, and a correct negative result in only 97% of
C2:buys_computer=‘no’ 30…40 high no fair yes
the cases in which the disease is not present. Furthermore, .008 of the
>40 medium no fair yes
entire population have this cancer.
>40 low yes fair yes
>40 low yes excellent no
P (cancer )  .008, P (cancer )  .992 Data sample: 31…40 low yes excellent yes
P (  | cancer )  .98, P (  | cancer )  .02 <=30 medium no fair no
X= <=30 low yes fair yes
P (  | cancer )  .03, P ( | cancer )  .97 (age<=30, >40 medium yes fair yes
P (  | cancer ) P (cancer ) Income=medium,
<=30 medium yes excellent yes
P (cancer |  )  Student=yes
P() 31…40 medium no excellent yes
Credit_rating=Fair)
P (  | cancer ) P (cancer ) 31…40 high yes fair yes
P (cancer |  )  >40 medium no excellent no
P()
Naïve Bayesian Classifier: Example Home Work
• Compute P(X|Ci) for each class • A box contains 10 red and 15 blue balls. Two balls are
P age= < | u s_ o puter= es = / = . P u s_ o puter=„ es = / selected at random and are discarded without their
P age= < | u s_ o puter= o = / = .
P i o e= ediu | u s_ o puter= es = / = . P u s_ o puter=„ o = / colors being seen. If a third ball is drawn randomly and
P i o e= ediu | u s_ o puter= o = / = .
P
P
stude t= es | u s_ o puter= es = / = .
stude t= es | u s_ o puter= o = / = .
P
P
redit_rati g= fair | u s_ o puter= es = / = .
redit_rati g= fair | u s_ o puter= o = / = .
the discarded balls were blue?
• X=(age<=30 ,income =medium, student=yes,credit_rating=fair) • Solution Hint:
P(X|Ci) : P X| u s_ o puter= es = . . . . . = .
P X| u s_ o puter= o = . . . . = .
P( R | BB ) P( BB )
P(X|Ci)*P(Ci ) : P X| u s_ o puter= es * P u s_ o puter= es = .
P X| u s_ o puter= o * P u s_ o puter= o = . P( R | BB ) P( BB )  P( R | BR ) P( BR )  P( R | RR ) P( RR )
 X elo gs to lass u s_ o puter= es

What is Regression?
BITS Pilani
• The goal of regression is to predict the value
of o e or ore o ti uous target aria les t
given the value of a D-dimensional vector x of
input variables.
• Polynomial curve fitting is an example of
regression.
Linear models for Regression
February 4, 2018 IS ZC464 2
500
Housing Prices Training set of Size in feet2 Price ($) in
(Portland, OR) 400 housing prices (x) 1000's (y)
300 2104 460
Price 200 1416 232
(in 1000s 1534 315
of dollars) 100
852 178
… …
0
0 500 1000 1500 2000 2500 3000 Notation:
Size (feet2) m = Number of training examples
x s = i put aria le / features
Supervised Learning Regression Problem y s = output aria le / target aria le
Gi e the right a s er for Predict real-valued output
each example in the data.
Slides numbers 3-6 and 11-43 adapted from Coursera Courseware on Machine Learning
course offered by Prof. Andrew Ng.
Training Set How do we represent h ?

Hypothesis:
3 3 3
2 2 2
Learning Algorithm
s: Para eters
1 1 1
Size of
h
Estimated Ho to hoose s? 0 0 0
house price 0 1 2 3 0 1 2 3 0 1 2 3
Linear regression with one variable.

Univariate linear regression.

Recall: example to understand
Terminology
ERROR
• Use parameters 0 and 1 to represent
Which line(hypothesis) fits the given data best? intercept and slope of line
7 • Use J(0, 1) to represent the Error.
6 • Instead of root Mean Squared (RMS) error,
5
consider Squared error.
4
• The number of training examples = m
3
2
• The ith data is x(i)
• The ith target is y(i)
1
February 4, 2018 1 2
IS ZC464 3 4 5 7 February 4, 2018 IS ZC464 8
Hypothesis Objective
• Equation(1):
h(x(i)) = 0 + 1 x(i) • To find 0, 1 to minimize J(0, 1)
Note: The notations used in Bishop s book (Section 3.1) • J(0, 1) is given by the expression
are as follows
1. In place of parameters , The book uses the notion of
w (later will be referred to as weights)
2. In place of <x(i), x(2), x(3),…, (m)>, the book uses vector
J ( 0 , 1) 
1 m

2m i 1 h x  y 

i i
2
x
3. In place of <y(i), y(2), y(3),…, (m)>, the book uses vector • Objective Function
h x  y 
y. m 2
4. In place of h(x(i)) , the book uses y(x,w) given by Minimize  i i
y(x,w) = w0 + w1 x (which is equivalent to  0 1 i 1

equation (1))
Simplified (0 = 0)
Hypothesis:
(for fixed , this is a function of x) (function of the parameter )
3 3
Parameters:
2 2
y
1 1
Cost Function:
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Goal:

(for fixed , this is a function of x) (function of the parameter ) (for fixed , this is a function of x) (function of the parameter )
3 3 3 3
1 = 0
2 2 2 2
y y
1 1 1 1
0 0 0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x x
Compute J(1) = (1/2*3)*{(0.5-1)2+ Compute J(1) = (1/2*3)*{(0-1)2+ (0-
(1-2)2 + (1.5-3)2 } 2)2 + (0-3)2 }
= (1/6)*(0.25+1+2.25) = (1/6)*(1+4+9)
=(1/6)* 3.5 = 0.58 =(1/6)* 14 = 2.3
(function of the parameter )

Hypothesis:
The error curve J(1 ) is plotted for 3
varying values of the parameter 1
2 Parameters:
1
Cost Function:
0
-0.5 0 0.5 1 1.5 2 2.5
Goal:
Surface and Corresponding contour plot

(for fixed , this is a function of x) (function of the parameters )
J(0,1)
500
400
Price ($)
i s 300
200
0
100
0
0 1000 2000 3000
Size in feet2 (x)
1

(for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters )
Have some function
Want
Outline:
J(0,1)
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
1
0

Gradient descent algorithm
Learning Rate: 
J(0,1)
Correct: Simultaneous update Incorrect:
1
0
If α is too small, gradient descent

can be slow.
If α is too large, gradient descent

can overshoot the minimum. It may
fail to converge, or even diverge.
Gradient descent can converge to a local

minimum, even with the learning rate α fixed.
at local optima As we approach a local

minimum, gradient
descent will automatically
Current value of take smaller steps. So, no
need to decrease α over
time.

Gradient descent algorithm Linear Regression Model
update
and
simultaneously
J(0,1)
1
0


Single feature (variable) : x Multiple features (variables).
Size Number
Size Price (feet2) of Number Age of home Price ($1000)
(feet2) ($1000) bedrooms of floors (years)
2104 5 1 45 460
2104 460 1416 3 2 40 232
1534 3 2 30 315
1416 232
852 2 1 36 178
1534 315 … … … … … 2104
 5 
852 178 (1)
 
x  1 
… … Notation:
 
= number of features  45 
= input (features) of training example.
= value of feature in training example.
Hypothesis: Hypothesis:
Previously:
Parameters:
Now with multiple variables or features
Cost function:
h(x) = 0 + 1 x1+ 2 x2+ 3 x3+ 4 x4
Size (feet2) Number of
bedrooms etc.
Or as per Bishop s ook otatio s (page 38, Gradient descent:

section 3.1)
Repeat
y(x,w)= w0 + w1 x1+ w2 x2+ w3 x3+ w4 x4

(simultaneously update for every )
New algorithm :
Gradient Descent
Repeat Linear Regression
Previously (n=1):
Repeat y(x,w)= w0 + w1 x1+ w2 x2+ w3 x3+ …..+wD xD

(simultaneously update for
)
Key Properties of Linear Regression

• y is a linear function of the parameters w0,w1,w2,… D
• y is a linear function of the input variables (features)

(simultaneously update ) x0,x1,x2,…xD

Generalized Form of Linear Regression
BITS Pilani
• A notion of class of functions i(x) is used to
represent the regression function
• y(x,w)= w0 + w1 x1+ w2 x2+ w3 x3+ …..+wD xD
is represented as
y(x,w)= w0 + w1 1(x)+ w2 2(x)+ w3 3(x)+
…..+wD D(x)
Where i(x)=xi
Linear models for Classification
• i(x) is called as basis functions
Classification Decision Regions

• The goal of classification is to take an input
vector x and to assign it to one of K discrete • Training data is viewed to be
classes Ck here k = , , , …, K plotted in a d-dimensional space
where d is the number of
• Examples features used.
• A test data is also viewed to be
 Email: Spam / Not Spam? mapped in the same space.
• Similarity (or closeness) of the
 Online Transactions: Fraudulent (Yes / No)? test data from the cluster of
training classes is obtained.
 Tumor: Malignant / Benign ? • The nearest class is assigned to
the test data
Binary Classification Example of a Decision Boundary

• Only two classes
(Yes) 1
: Negati e Class e.g., e ig tu or Malignant ? Test data

Decision
: Positi e Class e.g., alig a t tu or Boundary
(No) 0
Tumor Size Tumor Size
Threshold classifier output at 0.5:

If , predi t =
If , predi t =

Solving Classification Problems Linearly Separable Non-Face Data
• Require the decision boundaries (or surfaces
in hyper dimensional space) to be identified
based on the training data.
• The decision boundary may be a line, a
polynomial curve or a surface.
• The decision boundary can be represented as
a hypothesis h(x)
Each face is a point in the n-dimensional

The points in the n-dimensional space cannot be
space. (ORL face data for three persons)
clustered (colorwise) by hyperplanes.
Discriminant Functions Example

• Represent the decision boundary • Consider the following
training data
• Discriminant functions are obtained by taking
• Class 1: <1,2>, <1,1>, <2,1>
a linear function of the input vector (feature
• Class 2: <3,3>, 3,4>, <4,3>
vector).
• Can view a decision
• Define y (x) = w0 + w1x+w2 +…wDx boundary as a line separating
• Take a simple case two classes
y (x) = w0 + w1x • The equation of the line is
x2 = -x1 +1
• This is the equation of line.
(not using y deliberately as
• How does this behave as a decision boundary used for target)

Define the hypothesis in terms of
Example vector product
• Test vector <4,4>
 x0 
w w w
T
• Compute h(x) = x1+x2-1 as
  W ..
x   x1 
4+4-1= 7
0 1 D
• Since h(x) > 4, then the  .. 
test data belongs to class 2  
• Test vector <2,1.5>  x D 
• h(x) = 2+1.5-1 = 2.5 <4 Since y (x) = w0 + w1x+w2 +…wDx
• Then it belongs to class 1 T
y W X
How to get the best decision

Classification
boundary?
• If WTX  0, then the vector x belongs to class 2 • Based of experience
• WTX < 0, then the test vector belongs to class using training data
1 • We try to optimize the
fitting of the decision
• Class Assignment boundary.
• Classify • If the training data s
– <5,1> inappropriate, the
– <4,1> classifier is likely to
– <3,1> misclassify data.
Binary versus Multi class

Linear versus circular boundaries
classification
Binary classification: Multi-class classification:
x2 x2
x1 x1

Face data is nonlinearly separable (Hyper-
Nearest Neighbor Classification Surfaces can create boundaries between
clusters)
Do ot k o
Test feature
condition
+
Nearest neighbor: Shortest distance to the mean of the cluster
Classification Problem What to optimize ?

Given Training Data • Given y (x) = w0 + w1x+w2 +…wDx
• Objective 1: Obtain W that gives minimum
Closest cluster to
the n-dimensional error of classification OR
• Objective 2: Obtain W that maximizes the
test feature vector
is computed
separation of the classes
Possible Decision Boundaries
• Hyper Plane
• Hyper Sphere
• Visualize the error surface discussed earlier
• Gaussian Surface
• Support Vectors
with respect to classification error and find
the parameters W that give the least error.
Challenge: Design of Decision Boundary
Decision Tree
BITS Pilani
 A decision tree takes as input an object or situation
described by a set of attributes and returns a decision.
 This decision is the predicted output value for the
input.
 The input attributes can be discrete or continuous.
 Classification Learning:
 Learning a discrete valued function is called classification
learning
Machine Learning (IS ZC464) Session 8: Decision Trees  Regression :
and Review Session  Learning a continuous function is called Regression.

Decision Tree Decision tree
• A decision tree reaches its decision by • Leaf nodes depict the decision about a
performing a sequence of tests. character having attributes falling on the path
• All non leaf nodes lead to partial decisions and from the root node
assist in moving towards the leaf node. • Each example that participate in the
• Leaf nodes are the decisions based on construction of the decision tree is called a
properties satisfied at non leaf nodes on the training data and the complete set of the
path from the root node. training data is called as training set.
Limitations of Decision Tree How can we construct a decision

Learning tree for face recognition problem
• The tree memorizes the observations but does • Define attributes
not extract any pattern from the examples. • Collect the attributes data from training
• This limits the capability of the learning samples
algorithm in that the observations do not • Associate the output (to be used as leaf)
extrapolate to examples it has not seen.
Imagine the size of decision tree with 1000

attributes capable of discriminating between
persons!!!
Decision trees Goal Predicate: WillWait()

• The attributes aid in taking decisions. Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
• The most appropriate attribute is selected for 1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
testing in the beginning else the size of the 3. Fri/Sat: is today Friday or Saturday?
tree becomes large resulting in large 4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
computational time. 6. Price: price range ($, $$, $$$)
• Leaf nodes represent the decisions. 7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
• The attributes falling in the path from 9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
represent the attributes fully able to define
the decision at leaf.
This slide is adapted from the text book and from the set of slides available at
February 11, 2018 IS ZC464 7 aima.eecs.berkeley.edu/slides-ppt/m18-learning.ppt
Attributes Decision Tree
aima.eecs.berkeley.edu/slides-ppt/m18-learning.ppt This slide is adapted from the text book and from the set of slides available at
aima.eecs.berkeley.edu/slides-ppt/m18-learning.ppt
Size of the decision tree Information content

• The size of the Decision tree depends on the • If vi are different possible answers and P(vi )are
the probabilities that answer could be vi. Then
choice of the attributes and the order in the information content I of the actual answer is
which they are used to test the examples. given by
• Sele tio of att i utes ust e fai l good – I(P(v1), P(v2 , …P vn)) = -  P(vi)log2P(vi)
• Assu e that the t ai i g set o tai s p positi e
a d eall useless att i utes su h as t pe e a ples a d egati e e a ples, the a
should be avoided estimate of the information contained in a
• The quality of the attribute can be measured. correct answer is
I(p/(p+n), n/(p+n)) = - (p/(p+n) ) log2(p/(p+n))
• One measure can be the amount of - (n/(p+n) ) log2(n/(p+n))
information the attribute carries.
Refer the given table of Attributes Information content

• Since
- (n/(p+n) ) log2(n/(p+n))
– information = -(6/12) log2 (1/2) – (6/12) log2 (1/2)
– = - log2 (1/2)
= log2 (1/2)-1
= log2 (2)
= 1 bit
aima.eecs.berkeley.edu/slides-ppt/m18-learning.ppt
Generalize the splitting Gain(A)
• Let the attribute A divides the entire training  Gain(A)
set i to sets E , E , … Ev. Where v is the total = I(p/p+n, n/p+n) – Remainder(A)
number of values A can be tested on. The heuristic to choose attribute A from a set of
• Assume that each set Ei contains pi positive all attributes is the maximum gain
examples and ni negative examples
• Remainder (A) Compute
=  (pi+ni)/(p+n) I(pi/(pi+ni), ni/(pi+ni)) 1. Gain(Patrons)
over i=1 to v 2. Gain(type)
Selecting patrons attribute Selecting type as attribute
Refer the given table of Attributes

Gain(patron)
and compute Gain
• 1 – ((2/12)I(o,1) + (4/12)I(1, 0) + (6/12) I(2/6,
4/6))
• Approximately equal to 0.541 bits
February 11, 2018 IS ZC464 19 aima.eecs.berkeley.edu/slides-ppt/m18-learning.ppt
Decision Trees Decision Trees
• Learning is through a series of decisions taken • If the decisions are binary, then in the best
with respect to the attribute at the non-leaf
node. case the decision eliminates almost half of the
• There can be many trees possible for the given regions (leaves).
training data. • If the e a e egio s, the the o e t egio
• Finding the smallest DT is an NP-complete can be found in log2(b) decisions in the best
problem.
• Greedy selection of the attribute with largest
case.
gain to split the training data into two or more • The height of the decision trees depends on
sub-classes may lead to approximately the the order of the attributes selected to split the
smallest tree
training examples at each step.
Expressiveness of the DS Example

• A decision tree can represent a disjunction of
conjunctions of constraints on the attribute
values of instances.
– Each path corresponds to a conjunction
– The tree itself corresponds to a disjunction
If (O=Sunny AND H=Normal) OR (O=Overcast) OR (O=Rain AND W=Weak)
then YES
 A disju tio of o ju tio s of o st ai ts o att i ute

alues
Entropy
• It is the measure of the information content
and is given by
– I = -  P(vi)log2P(vi)
– Where v1,v2,..,vk are the values of the attribute on
which the decisions bifurcate.

Remainder (A)
=  (pi+ni)/(p+n)
Class Work I(pi/(pi+ni), ni/(pi+ni))
over i=1 to v
Understand the examples
• Identify the examples belonging to the two • Decisions are binary – yes / no
sets constructed after the data is split on the • Training data as <example, decision> pair
basis of attribute 'student'. • <r1,no>, <r2,no>, <r3,yes>, <r4,yes> and so on
• Compute the total information content of the • Positive examples: r3, r4, r5, r7, r9, r10, r11, r12,
training data. r13
• Compute the information gain if the training • Negative examples: r1,r2,r6, r8, r14
data is split on the basis of the attribute • Is the given training set sufficient to take any
'student'. decision?
• Draw the decision tree, which may or may not • Is the generalization capability of the given
be optimal. training set sufficient?
Information content of the given Compute the significance of

training data att i ute i o e
• Here v1 = yes, v2 = no (YES) r3,r4, r5, r7, r9,r10, r11, r12, r13
• Positive examples: r3, r4, r5, r7, r9, r10, r11, r12, r13 (NO) r1, r2, r6, r8,r14
• Negative examples: r1,r2,r6, r8, r14

• Total number of examaples = 14 Low
Medium High
• P(v1) = 9/14, P(v2)=5/14
(YES) r5, r7, r9
• Information content is represented by the notion (NO) r6
(YES) r4, r10, r11, r12 (YES) r3, r13
(NO) r8, r14 (NO) r1, r2
I(9/14, 5/14)
• Entropy = - (P(v1)log2(P(v1)) + P(v2)log2(P(v2)) )
= -((9/14)* log2 (9/14) + (5/14)*log2 (5/14))
= 0.8108
Compute the significance of

Recall Generalize the splitting
att i ute i o e
• Let the attribute A divides the entire training
(YES) r3,r4, r5, r7, r9,r10, r11, r12, r13
set i to sets E , E , … Ev. Where v is the total
(NO) r1, r2, r6, r8,r14 number of values A can be tested on.
Low
• Assume that each set Ei contains pi positive
Medium High
(YES) r5, r7, r9
(YES) r4, r10, r11, r12 (YES) r3, r13 • Remainder (A)
(NO) r6
(NO) r8, r14 (NO) r1, r2
=  (pi+ni)/(p+n) I(pi/(pi+ni), ni/(pi+ni))
Observe that the split regions of examples possess mixed over i=1 to v
decisions, this shows the poor quality of the attribute ‘income’

Remainder (A)
Compute the significance of Compute the Remainder information if I(p
=  (pi+ni)/(p+n) attribute
i/(pi+ni), ni/(pi+ni))
att i ute i o e i o e is used fo splitti g

over i=1 to v
• Remainder = (4/14)*I(3/4,1/4)
(YES) r3,r4, r5, r7, r9,r10, r11, r12, r13 + (6/14)*I(4/6, 2/6)
(NO) r1, r2, r6, r8,r14
+(4/14)*I(2/4,2/4)
Low
Medium High
= (4/14) {-(3/4) log2(3/4) – (1/4)log2 (1/4)}
+(6/14) {-(4/6) log2(4/6) – (2/6)log2 (2/6)}
(YES) r5, r7, r9
(YES) r4, r10, r11, r12 (YES) r3, r13
(NO) r6
(NO) r8, r14 (NO) r1, r2
+(4/14){ (-(2/4)log2 (2/4)-(2/4)log2(2/4)
p1 = 3 p2 = 4 p3 = 2 [Home Work: Remaining computation]
n1 = 1 n2 = 2 n3 = 2 p2 = 4 p3 = 2
p1 = 3
n1 = 1 n2 = 2 n3 = 2
Review Session Review Session
Review Session What is learning?

• Mid Semester Syllabus  Learning (for humans) is experience from past.
 A machine can be programmed to gather experience in
– All topics and details discussed in Sessions 1-8
the form of facts, instances, rules etc.
[Refer Slides and video contents]
 A machine with learning capability can predict about
the new situation (seen or unseen) using its past
experience.
• Not included
 Decision Theory [Handout S. No. 2.2]
 Examples:
 Expectation Maximization (EM) Algorithm [Handout S. No. 3.3]  As e hu a s a tell a pe so s a e seei g hi /he
 Bias-variance decomposition [Handout S. No. 3.4]
second or fifth time, a machine can also do that.
 As e hu a s a e og ize a pe so s oi e e e if ot
seei g pe so s fa e, a a hi e a also e ade to lea
to do the same.

Artificial Intelligence: An intelligent car
Class Experiment: Training navigation system [An Example]
 Let  A system to navigate a car to the airport works on its
 AA denote 5 vision enabled using camera mounted at the front of
the car.
 BB denote 6
 AAA denote 50  The s ste sees the la e li its, the ehi les o the
way and controls the car from colliding. [Vision]
 BBB denote 60
 It follows the road directions.
 AAAA denote 500
 It also follows the road rules.
 BBBB denote 600
 The system learns to handle unforeseen situations. For
 Can you find out the equivalent numerical value example if the traffic flow is restricted on a portion of
of AAAAA? 5000: yes/no? the road temporarily, the system takes the alternative
 O of AABB? Not et t ai ed……… path.[learning]

More intelligence can be
Other intelligent systems

expected
• The s ste liste s to the pe so sitti g i the • Smart home
a to stop at a ea hotel fo a tea a d sees
around to find a hotel, keeps travelling till it finds – Lights switch off if there is no one in the room
one and stops the car. [speech Recognition, – Curtain pull off at the sun rise
Vision]
– Dust bin is emptied before it is overflowing
• Understands the mood of the person and starts
music to suit the mood of the person. [Facial – Smart water taps, toilets etc.
Expression] • Smart office
• Ca a s e the ue ies, su h as ho fa is – Automatic meeting summary
Pila i? , What is the ti e , a I sleep fo a
hou ? , Please ake e up he it is : i – Speaker recognition and summary generation
the o i g? [Natural Language Processing] • Automatic answering machine
Other intelligent machines Intelligent Agent

• An airplane cockpit can have a intelligent • An intelligent agent is a system that perceives
system that takes automatic control when its environment and takes actions which
hijacked [context and speech understanding, maximize its chances of success.
NLP, vision]
• Artificial Intelligence aims to build intelligent
• Medical diagnosis systems trained with expert
guidance can diagnose the patients disease agents or entities.
based on the xray, MRI images and other
symptoms
• Automated theorem proving
• General problem solver
Machine Learning Applications Machine Learning

• Speech recognition • A computer program is said to learn from
• Automatic news summary experience E with respect to some class of
• Spam email detection tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
• Credit card fraud detection improves with experience E. (Tom Mitchell)
• Face recognition
• Function approximation
• Stock market prediction and analysis
• Etc.
Learning From Observations Design of a learning Element

• Learning Element: • Affected by three major issues:
– responsible for making improvements – Which components of the performance element
• Performance Element: are to be learned
– responsible for selecting external actions – What feedback is available to learn these
components
• The learning element uses feedback from the
– What representation is used for the components.
critic on how the agent is doing and
determines how the performance element
should be modified to do better in the future
Review Session Review Session X Y
Traditional Vs. Machine Learning Training and testing: Prediction 1

5
1
5
2 2
• Recall Learning: A machine with learning 4 4
Input Data capability can predict about the new situation 3 3
(seen or unseen) using its past experience.
Traditional Output • Prediction:
Approach
Given values of x and y
Program Predict value of y for x = 71
Input Data
• Prediction is based on learning of the relationship
between x and y
Machine Learning Program • Training data is the collection of (x,y) pairs
• Testing data is simply value of x for which value
Output
of y is required to be predicted.
Learning of a function from given Review Session
What did the system learn?

Review Session
sample data X Y
•
1 1
Y = f(x)
5 5
• Y=x 2 2
Straight Line
• What is its generalization ability? 4 4
3
• Most accurate or we can say 100% 3 3

2
• What if the data to train the system changes

slightly? The machine can be still made to
learn.

Learning of a function from given Review Session
Understanding ERROR
sample data-straight line learning
Review Session
Which line fits Which line(hypothesis) fits the given data best?
Straight Line the best?
80
75
70
Line is
represented by 65
parameters of How?
slope and 60
intercept
Machine must
55
learn on its own- Using the data –
which is the best known as training
fit data i.e. (x,y) pair 50
February 11, 2018 IS ZC464 51 February 11, 2018 145 IS ZC464

155 165 170 52
Review Session Another

Review possible surface
Session
Plotting error when y=f(x) Plotting error when y=f(x1,x2)
Hypothesis function Hypothesis function

y = wx hw(x) = w1x1 + w2x2 …….
Linear in one variable Linear in two variables
E(w) hw(x) = wx E(w)
Local
Minima
w corresponding w w1
to minimum error
Global Minima
February 11, 2018 IS ZC464 53 February 11, 2018
w2 IS ZC464 54
Uncertainty in real world Recall

• Uncertainty in reaching New Delhi Airport in 5 • Knowledge representation using Probability
hours from Pilani • Random variables
– Cab engine may or may not work at any moment • Atomic events
– The route is diverted due to a procession on the way • Conditional probability
– The road condition is bad unexpectedly • Prior probability
– The tire needs replacement • Marginalization
Etc. • Bayes theo e a d its appli atio i p o le
• A person having stomach ache can be told that he solving
is suffering from ulcer, while in actual it may be • Joint probability distribution (JPD) table and
gastritis or overeating probabilistic inference

Review Session Bayesian learning
Review Session
Bayes theo e Example 1: observation of sounds

• Bayes theo e p o ides a a to al ulate Training with Observed data: {d1,d2, d3} =training
the probability of a hypothesis based on its data (say D) Sounds ‘ae’ and ‘aw’ are
the observed targets that
prior probability, the probability of observing d1: at sou ds ith ae we know.
various data given the hypothesis, and the d2: pot sou ds ith a Prior probabilities
observed data itself. P(sound = ‘ae’) = 0.5
d3: at sou ds ith ae P(sound = ‘aw’) = 0.5
features such as ‘a’ and Conditional probabilities OR
‘o’ are obtained through are represented as P(‘ae’
preprocessing of the P(‘ae’ |feature = ‘a’)=2/3 |d1,d2,d3)=2/3
given words – by parsing P(‘aw’| feature = ‘o’) = 1/3 P(‘aw’|d1,d2,d3)=
1/3
OR
P(‘ae’ |D)=2/3
P(‘aw’|D)= 1/3
Hypothesis Bayesian Learning

• In learning algorithms, the term hypothesis is • Training : Through the computation of the
used in contexts such as probabilities as in previous two slides
Concept learning or classification: class label or • Testing : of unknown words
category – Example testing:
Function approximation: a curve, a line or a Whi h sou d does the o d at ake?
polynomial
P ep o ess at to get featu e a a d o pute P h|a),
Decision making: a decision tree where P(h|D) is known, where D is the set of 10
• Plural of hypothesis: Hypotheses (multiple labels, o se atio s used to t ai the s ste a d h is the
multiple curves, multiple decision trees) hypothesis.
• Best Hypothesis (Always preferred) : Most Compute probabilities P(ae | a), P(oo | a), P(a~|a) and
appropriate class, best fit curve, smallest decision P(aw |a) to obtain the likelihood of sound of cat.
tree
Maximum a Posteriori (MAP) Review Session
Recall
Review Session
hypothesis
• Consider a set of hypotheses H and the • MAP algorithm
observed data used for training D • Gibbs Algorithm
• Define • Minimum Description Length Principle
h MAP
 Arg max P(h | D) • Information theory – entropy
hH
• Bayes Opti al Classifie
• The maximally probable hypothesis is called a • Naïve Bayes Classifie
maximum a posteriori (MAP) hypothesis.

What is Regression? Training Set How do we represent h ?

Hypothesis:
• The goal of regression is to predict the value
of o e o o e o ti uous ta get a ia les t
given the value of a D-dimensional vector x of Learning Algorithm
‘s: Parameters
input variables.
• Polynomial curve fitting is an example of
regression. Size of
h
Estimate How to choose ‘s ?
house d price
Linear regression with one variable.

Univariate linear regression.

Learning Rate: 
J(0,1)
500
400
Price ($)
in 1000’s 300
200
0 Correct: Simultaneous update Incorrect:
100
0
0 1000 2000 3000
Size in feet2 (x)
1
Linear Regression Generalized Form of Linear Regression
y(x,w)= w0 + w1 x1+ w2 x2+ w3 x3+ …..+wD xD • A notion of class of functions i(x) is used to
represent the regression function
• y(x,w)= w0 + w1 x1+ w2 x2+ w3 x3+ …..+wD xD
is represented as
Key Properties of Linear Regression
y(x,w)= w0 + w1 1(x)+ w2 2(x)+ w3 3(x)+
• y is a linear function of the parameters w0,w1,w2,… D …..+wD D(x)
• y is a linear function of the input variables Where i(x)=xi
(features) x0,x1,x2,…xD • i(x) are called as basis functions for
i=1,2,3,..D

Review Session
What is linear in linear
Basis functions Review Session
regression?
• Linear basis functions i(x)=x (Linear in x) • The following expression is linear in W
y(x,w)= w0 + w1 1(x)+ w2 2(x)+ w3 3(x)+
• Nonlinear basis functions …..+wD D(x)
i(x)=x2 (Quadratic in x) • The basis functions may be linear or nonlinear
i(x)=x3 (Cubic in x) in x
Classification Example of a Decision Boundary

• The goal of classification is to take an input
vector x and to assign it to one of K discrete (Yes) 1
classes Ck he e k = , , , …, K Malignant Decision

Test data
• Examples ?
(No) 0
Boundary
 Email: Spam / Not Spam? Tumor Size Tumor Size
 Online Transactions: Fraudulent (Yes / No)?

Threshold classifier output at
 Tumor: Malignant / Benign ? 0.5:
If , predict “y = 1”
If , predict “y = 0”
Solving Classification Problems Example

• Require the decision boundaries (or surfaces • Test vector <4,4>
in hyper dimensional space) to be identified • Compute h(x) = x1+x2-1 as
based on the training data. 4+4-1= 7
• The decision boundary may be a line, a • Since h(x) > 4, then the
polynomial curve or a surface. test data belongs to class 2
• The decision boundary can be represented as • Test vector <2,1.5>
a hypothesis h(x) • h(x) = 2+1.5-1 = 2.5 <4
• Then it belongs to class 1

Recall Bayes Theo e ased p o le

• Decision boundaries • A box contains 10 red and 15 blue balls. Two balls are
selected at random and are discarded without their
• Binary and multi class classification colors being seen. If a third ball is drawn randomly and
• Decision trees the discarded balls were blue?
• Gain and remainder • Atomic events for the selection of two balls
RR: both balls were Red
• Information content etc. RB: One is red ball and the other is Blue
BB: Both are blue balls
To find P(R | BB) : Probability that the third ball is red
given that both the discarded balls were blue.
Review Session
Probability of the third ball being
Review Session
red
• Number of ways to select two balls = 25C2 = 300
• Number of ways to select two red balls = 10C2 = 45 P(R) = P(RRR)+ P(RBB)+P(RBR)
• Number of ways to select two blue balls = 15C2 = 105 = P(R|RR)*P(RR) + P(R|BB)* P(BB)+ P(R|BR)*P(BR)
• Number of ways to select one red and one blue ball =
10C * 15C = 10*15 = 150 = (1/8)*(45/300) + (1/10)*(105/300)+ (1/9)*(150/300)
1 1
• P(RR) = 45/300 = 0.125*0.15 + 0.1*0.35 + 0.11*0.5
• P(BR) = 150/300 = 0.01875 + 0.035 + 0.0556 = 0.10935
• P(BB) = 105/300
• Therefore the probability that the third ball is red P(R)
= P(RRR)+ P(RBB)+P(RBR)
Probability that the two discarded balls were

blue given that the third ball is red
Review Session BITS Pilani
• The expression is P(BB|R) and is given by
P( R | BB ) P( BB )
P( R | BB ) P( BB )  P( R | BR ) P( BR )  P( R | RR ) P( RR )
Using Bayes Theo e

P(BB|R) = 0.035 / 0.10935 = 0.32 (Answer) Machine Learning (IS ZC464) Session 9:
Problem solving and doubt clearing session

Problem 1 Solution
A doctor knows that the disease meningitis • P(S |M) = 0.5
causes the patient to have a stiff neck (S), say
50% of the time. The doctor also knows some • P(M) = 1/50,000 = 0.00002
unconditional facts: the prior probability that the • P(S) = 1/20 = 0.05
patient has meningitis (M) is 1/50,000, and the
prior probability that any patient has a stiff neck
is 1/20. What is the probability that a patient • To find P(M|S)
with stiff neck has meningitis? • P(M|S) = P(M,S) /P(S)
= P(S|M)*P(M)/P(S)
= 0.5 * 0.00002/0.05 = 0.0002 (Ans)
Problem 2 Given
• A science competition had students from three • Prior probabilities
schools A,B and C. The numbers of students who P(A)= 50/200 =0.25
participated from schools A, B and C respectively
are 50, 80 and 70. The probability of students P(B)=80/200= 0.4
qualifying (Q) the competition given his/her P(C)=70/200 = 0.35
school is P(Q|A)= 0.6, P(Q| B)= 0.25 and P(Q | • Conditional probabilities
A)= 0.45
P(Q|A)= 0.6,
Q1. Compute the probability of qualifying P(Q).
P(Q| B)= 0.25 and
Q2. Compute the probability of a student belonging
to school B given that he/she qualified i,.e. P(Q | A)= 0.45
compute P(B|Q)
Solution (for computing P(Q)) Solution (for computing P(B|Q))

• Joint occurrences of Q: • Since P(B|Q) = P(Q|B)*P(B) /P(Q)
– Student qualifies and is from school A {Bayes’ rule-refer slide 44 session 3}
– Student qualifies and is from school B
– Student qualifies and is from school C
• P(Q) = P(Q,A) + P(Q,B) + P(Q,C) P(B|Q) = 0.25*0.4/0.4075 = 0.2454
{Marginalization rule}
• P(Q) = P(Q|A)*P(A)+ P(Q|B)*P(B)+ P(Q|C)*P(C)
{using product rule for each term}
• P(Q) = 0.6*0.25 + 0.25*0.4+0.45*0.35
=0.150 + 0.100 + 0.1575 = 0.4075
Problem 3 Example
• Consider four 2- dimensional feature vectors • Consider the following training
data
belonging to class 1 are <1,5>, <1,3>, <2,3>, • Class 1: <1,5>, <1,3>, <2,3>, <3,4>
<3,4> and five feature vectors belonging to • Class 2: <1,1>, <2,1>, <3,0>, <3,2>,
<4,2>
class 2 are <1,1>, <2,1>, <3,0>, <3,2>, <4,2> • Can view a decision boundary as a
• Classify the test feature vector <4,3>. line separating two classes
• Slope m = 3/5
• Intercept = 1
• The equation of the decision
boundary (which is a line) is
x2 = (3/5)x1 +1
• Hypothesis h(x) = x2 -(3/5)x1 -1
Define the classifier Example

• h(x) = x2 -(3/5)x1 -1 • Test vector x = <x1,x2> = <4,3>
• Define threshold T appropriately such that • Compute Hypothesis h(x) =
x2 -(3/5)x1 -1
• Since h(x) = 3-((3/5)*4)-1=3-
If h(x) > T then the test data x belongs to class 1
2.4-1 = -0.4 < 0(threshold)
If h(x) < T then the test data x belongs to class 2 • Therefore the test data
Else report misclassification. belongs to class 1
Let T = 0 • Can also verify visually in the
figure.
Class work Problem 4

• Hypothesis h(x) = x2 -(3/5)x1 -1 • Training data for a two class classification
• Threshold (T) = 0 problem (say vehicle recognition among bus
• Classify test data <2,5> and car) is defined using two attributes size
and engine type.
• Q1. If the training data observes 8 times out of
20 times that the vehicle is a bus and 12 times
sees that it was a car. Compute the total
information content of the training data.

Solution (Q2) Recall: Generalize the splitting
• If the attribute size splits the training data into • Let the attribute A divides the entire training
two parts on its values ‘big’ and ‘small’ as set into sets E1, E2, … Ev. Where v is the total
shown in the figure, then compute is Gain. number of values A can be tested on.
(bus) r3,r4, r5, r7, r11, r12, r13, r20 • Assume that each set Ei contains pi positive
(car) r1, r2, r6, r8, r14 , r9,r10, r15, r16,r17 , r18,r19,
size
• Remainder (A)
big small =  (pi+ni)/(p+n) I(pi/(pi+ni), ni/(pi+ni))
(bus) r5, r7, r11, r12, r13, r20 (bus) r3, r4 over i=1 to v
(car) r8, r17 , r9,r10 (car) r1, r2, r6, r15, r16, r14 , r18, r19
Gain(A) Solution (Q1)

• If the training data observes 8 times out of 20 times that the vehicle
 Gain(A) is a bus and 12 times sees that it was a car. Compute the total
information content of the training data.
= I(p/p+n, n/p+n) – Remainder(A) • p = 8 and n = 12
The heuristic to choose attribute A from a set of - (n/(p+n) ) log2(n/(p+n))
all attributes is the maximum gain I(8/20, 12/20)
= I(0.4,0.6)
= -0.4 log2 (0.4) – 0.6 log2 (0.6)
Compute = (1/log 102){-0.4 log10 (0.4) – 0.6 log10 (0.6)}
=(1/0.3010) {(-0.4)*(-0.3979) +(-0.6)*(-0.2218)}
1. Gain(Patrons) =(3.3222) {0.15916 + 0.13308}
= 3.3222 *0.29224
2. Gain(type) =0.97087 bits
Solution (Q2) Remainder computation

• Now we have
• p1 = 6 n1 = 4
• p1 = 6 n1 = 4
• p2 = 2 n2 = 8 • p2 = 2 n2 = 8
• Remainder (A)
=  (pi+ni)/(p+n) I(pi/(pi+ni), ni/(pi+ni))
(bus) r3,r4, r5, r7, r11, r12, r13, r20 over i=1 to v
(car) r1, r2, r6, r8, r14 , r9,r10, r15, r16,r17 , r18,r19, = (p1+n1)/(p+n) I(p1/(p1+n1), n1/(p1+n1)) + (p2+n2)/(p+n) I(p2/(p2+n2),
size n2/(p2+n2))
= (6+4)/(8+12) I(6/(6+4), 4/(6+4)) + (2+8)/(8+12) I(2/(2+8), 8/(2+8))
= (10/20) I(0.6, 0.4) + (10/20)I(0.2,0.8)
big small
= 0.5* (-0.6 log2 (0.6) – 0.4 log2 (0.4))
+ 0.5* (-0.2 log2 (0.2) – 0.8 log2 (0.8))
(bus) r5, r7, r11, r12, r13, r20 (bus) r3, r4 (Computation left as home work)
(car) r8, r17 , r9,r10 (car) r1, r2, r6, r15, r16, r14 , r18, r19

Problem 5 Solution
Compute P(cavity)
P(cavity,toothache)
P(toothache) • Class work
Catch catch Catch catch
Cavity 0.06 0.19 0.05 0.10
Cavity 0.09 0.01 0.22 0.28
Questions?
Best Wishes for your Mid Semester

test

ML - Full (2 Files Merged) PDF

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

ML - Full (2 Files Merged) PDF

Caricato da

Copyright:

Formati disponibili

What is learning?

Learning pronunciation (by a

Learning example : Relate human learning

December 30, 2017 IS ZC464 5 December 30, 2017 IS ZC464 6

• Humans have limitations in terms of accessibility • Machine Learning is a branch of Artificial

December 30, 2017 IS ZC464 7 December 30, 2017 IS ZC464 8

Common attributes of Human

December 30, 2017 IS ZC464 9 December 30, 2017 IS ZC464 10

Human brain Understanding Human brain

December 30, 2017 IS ZC464 11 December 30, 2017 IS ZC464 12

December 30, 2017 IS ZC464 13 December 30, 2017 IS ZC464 14

More intelligence can be Some of the Existing intelligent

December 30, 2017 IS ZC464 15 December 30, 2017 IS ZC464 16

Deep Blue : Chess Program Other intelligent systems

December 30, 2017 IS ZC464 19 December 30, 2017 IS ZC464 20

Intelligent Agent Intelligent agent

December 30, 2017 IS ZC464 21 December 30, 2017 IS ZC464 22

How does an intelligent agent work in given

• It perceives the environment.

December 30, 2017 IS ZC464 23 December 30, 2017 IS ZC464 24

Learning From Observations Design of a learning Element

December 30, 2017 IS ZC464 27 December 30, 2017 IS ZC464 28

Types of feedback for learning Learning Algorithms

December 30, 2017 IS ZC464 29 December 30, 2017 IS ZC464 30

December 30, 2017 IS ZC464 31 December 30, 2017 IS ZC464 32

Decision Tree Decision tree

December 30, 2017 IS ZC464 33 December 30, 2017 IS ZC464 34

Limitations of Decision Tree Attribute Creation/Selection in various problem

• The tree memorizes the observations but does

Obtain the most suitable Features/ attributes Availability of information

AT&T Laboratories, Cambridge UK

December 30, 2017 IS ZC464 37 December 30, 2017 IS ZC464

Human face recognition Selection of attributes

December 30, 2017 IS ZC464 39 December 30, 2017 IS ZC464 40

Learning of a function from given Generalization in Function X Y

sample data Approximation 1 0

December 30, 2017 IS ZC464 41 December 30, 2017 IS ZC464 42

Traditional Approach Output

December 30, 2017 IS ZC464 43 December 30, 2017 IS ZC464 44

How does a program as an output realized? Neural Networks

December 30, 2017 IS ZC464 45 December 30, 2017 IS ZC464 46

• A mathematical neuron is a processing unit

Given values of x and y

January 13, 2018 IS ZC464 2 January 13, 2018 IS ZC464 3

Learning of a function from given

• Most accurate or we can say 100% 3 3

learn. Machine must learn

January 13, 2018 IS ZC464 4 January 13, 2018 IS ZC464 5

Understanding ERROR Understanding ERROR

• Consider an example of using height and weight

January 13, 2018 IS ZC464 6 January 13, 2018 145 IS ZC464

January 13, 2018 145 IS ZC464

Understanding ERROR Understanding ERROR

January 13, 2018 145 IS ZC464

Understanding ERROR Understanding ERROR

January 13, 2018 145 IS ZC464

• In order to find a line that best fits the given 7

data, we must find w1 and w2 in such a way 6

that the sum of the squared error is minimum 5

January 13, 2018 IS ZC464 14 January 13, 2018 1 IS ZC464

Compute the Squared Mean Error

January 13, 2018 IS ZC464 16 January 13, 2018 IS ZC464 17