Sei sulla pagina 1di 65

What is learning?

BITS Pilani
 Learning (for humans) is experience from past.
 A machine can be programmed to gather experience in
the form of facts, instances, rules etc.
 A machine with learning capability can predict about
the new situation (seen or unseen) using its past
experience.
 Examples:
 As e hu a s a tell a pe so s a e seei g hi /he
second or fifth time, a machine can also do that.
Machine Learning (IS ZC464)  As e hu a s a e og ize a pe so s oi e e e if ot
seei g pe so s fa e, a a hi e a also e ade to lea
to do the same.
Session 1: Introduction
December 30, 2017 IS ZC464 2

Learning pronunciation (by a


Class Experiment: Training
young kid)
 Let  Training
 AA denote 5  Cat (ae sound)
 BB denote 6  Pot( aw sound)
 AAA denote 50  Pat (ae sound)
 BBB denote 60  Tap (ae sound)
 AAAA denote 500  Cot (aw sound)
 BBBB denote 600  Testing
 Can you find out the equivalent numerical value  Ho do you p o ou e ot ? My stude ts k o the
of AAAAA? 5000: yes/no? answer.
 O of AABB? Not yet t ai ed………  Ho do you p o ou e he k ? The kid is ot t ai ed
yet, hence learning is not to this level.
December 30, 2017 IS ZC464 3 December 30, 2017 IS ZC464 4

Learning example : Relate human learning


with that of machine learning
Learning
 Training • Human
A coin is tossed 10 times and it is observed that it fell 7
times with head on top and 3 times tail on top. Gain experience from day to day activities
[observe that you are learning as you read the above] and gain ablility to predict.
 Testing
Will you get head next? (Hypothesis: get the head on • Machine
top) Get trained with the numerical data (data
yes, most probably.
What is the chance that the next coin when tossed will
can be text, image, sound, rules etc) and be
be head? (Hypothesis: next toss is head) able to predict.
P(next toss is head | Previous 10 tosses had 7
heads)

December 30, 2017 IS ZC464 5 December 30, 2017 IS ZC464 6


Why Machine Learning? Machine Learning and Artificial Intelligence

• Humans have limitations in terms of accessibility • Machine Learning is a branch of Artificial


and computational efficiency. Intelligence (AI) in which the intelligent system
• Machine learning is required in learns from its environment.
– Navigation in Mars
– Avalanche areas to detect buried
• AI systems include intelligence of different
– Speech recognition etc. types such as reasoning, planning, search and
• Machine learning is not required in game playing, learning etc. of which learning
– General computations such as payroll is specific to the Machine Learning systems.
– Computation of sum of numbers
– Counting etc.

December 30, 2017 IS ZC464 7 December 30, 2017 IS ZC464 8

Common attributes of Human


What is Artificial Intelligence?
mind
• It is the computational intelligence of • Perception/Vision/Recognition,
computers that enables them to behave and • Reason,
act human like. • Imagination,
• An artificial intelligent system possesses one • Memory,
or more of the human capabilities of
reasoning, thinking, planning, learning, • Emotion,
understanding, listening and responding. • Attention, and
• A capacity for communication

December 30, 2017 IS ZC464 9 December 30, 2017 IS ZC464 10

Human brain Understanding Human brain


• Thought is a mental activity which allows
human beings to make sense of things in the
world, and to represent and interpret them in
ways that are significant.
• Thinking involves the symbolic or semantic
mediation of ideas or data, as when we form
concepts, engage in problem solving,
reasoning and making decisions.

December 30, 2017 IS ZC464 11 December 30, 2017 IS ZC464 12


Artificial Intelligence: An intelligent car
Understanding Human Brain navigation system [An Example]
• Memory is the ability to preserve, retain, and  A system to navigate a car to the airport works on its
vision enabled using camera mounted at the front of
subsequently recall, knowledge, information the car.
or experience.  The syste sees the la e li its, the ehi les o the
• Imagination is the activity of generating or way and controls the car from colliding. [Vision]
evoking novel situations, images, ideas etc. in  It follows the road directions.
the mind.  It also follows the road rules.
 The system learns to handle unforeseen situations. For
example if the traffic flow is restricted on a portion of
the road temporarily, the system takes the alternative
path.[learning]

December 30, 2017 IS ZC464 13 December 30, 2017 IS ZC464 14

More intelligence can be Some of the Existing intelligent


expected systems
• The syste liste s to the pe so sitti g i the • Watson : Question Answering Machine
a to stop at a ea y hotel fo a tea a d sees
around to find a hotel, keeps travelling till it finds • Deep Blue: A chess program that defeated the
one and stops the car. [speech Recognition, world chess champion Gary Kasparov
Vision]
• Understands the mood of the person and starts
music to suit the mood of the person. [Facial
Expression]
• Ca a s e the ue ies, su h as ho fa is
Pila i? , What is the ti e , a I sleep fo a
hou ? , Please ake e up he it is : i
the o i g? [Natural Language Processing]

December 30, 2017 IS ZC464 15 December 30, 2017 IS ZC464 16

Deep Blue : Chess Program Other intelligent systems


• Smart home
– Lights switch off if there is no one in the room
– Curtain pull off at the sun rise
– Dust bin is emptied before it is overflowing
– Smart water taps, toilets etc.
• Smart office
– Automatic meeting summary
– Speaker recognition and summary generation
• Automatic answering machine
Source : Google Images
December 30, 2017 IS ZC464 17 December 30, 2017 IS ZC464 18
Other intelligent machines AI Techniques
• An airplane cockpit can have a intelligent • The general problem of simulating (or
system that takes automatic control when creating) intelligence has been broken down
hijacked [context and speech understanding, into a number of specific sub-problems
NLP, vision] – Reasoning and deduction
• Medical diagnosis systems trained with expert – Knowledge Representation
guidance can diagnose the patients disease – Planning
based on the xray, MRI images and other – Learning
symptoms – Natural Language Processing
• Automated theorem proving – Motion
• General problem solver – Perception

December 30, 2017 IS ZC464 19 December 30, 2017 IS ZC464 20

Intelligent Agent Intelligent agent


• An intelligent agent is a system that perceives  An agent is anything that can be viewed as
its environment and takes actions which perceiving its environment through sensors and
maximize its chances of success. acting upon that environment through actuators
 Human Agent Vs. Machine Agent
• Artificial Intelligence aims to build intelligent  Differ in sensor technology
agents or entities.  Ear, nose, eye, touch, smell (HUMAN)
 Speaker, camera, infrared sensors, smoke sensors,
etc
 Differ in their capacity to perceive the environment
 Differ in acting upon the environment through
actuators

December 30, 2017 IS ZC464 21 December 30, 2017 IS ZC464 22

How does an intelligent agent work in given


Environment environment?

• It perceives the environment.


 The parameters that are required for reasoning, • Acts based on the experience and query.
thinking, perception and so on
 Example (for humans) • Responds in terms of adding to the knowledge
 A o e yea old hild s e i o e t: Ho e, fa ily e e s, toys base
 A yea old hild s e i o e t : Ho e, fa ily e e s,
school, teachers, books, play mates • Thus must Learn from the history of percepts
 Example (for machines)
 Washi g a hi e i tellige t age t s e i o e t: di t, lothes,
detergent etc
 Intelligent Automobile Robot: parts of automobile and their exact
description

December 30, 2017 IS ZC464 23 December 30, 2017 IS ZC464 24


Machine Learning Applications Machine Learning
• Speech recognition • A computer program is said to learn from
• Automatic news summary experience E with respect to some class of
• Spam email detection tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
• Credit card fraud detection improves with experience E. (Tom Mitchell)
• Face recognition
• Function approximation
• Stock market prediction and analysis
• Etc.
December 30, 2017 IS ZC464 25 December 30, 2017 IS ZC464 26

Learning From Observations Design of a learning Element


• Learning Element: • Affected by three major issues:
– responsible for making improvements – Which components of the performance element
• Performance Element: are to be learned
– responsible for selecting external actions – What feedback is available to learn these
components
• The learning element uses feedback from the
– What representation is used for the components.
critic on how the agent is doing and
determines how the performance element
should be modified to do better in the future

December 30, 2017 IS ZC464 27 December 30, 2017 IS ZC464 28

Types of feedback for learning Learning Algorithms


• Supervised  Decision Trees
– Inputs and outputs  Neural Networks based learning algorithms
• Unsupervised  Ensemble Learning
– Inputs available, but no specific output  Bayes lassifie
• Reinforced  EM (expectation Maximization) algorithm
– Reward or penalty
 Support Vector Machines etc.

December 30, 2017 IS ZC464 29 December 30, 2017 IS ZC464 30


Inductive Learning using Decision Trees:
An example to learn to identify an object
Decision Tree
 A decision tree takes as input an object or situation
fruit? described by a set of attributes and returns a decision.
yes no  This decision is the predicted output value for the
input.
color Vegetable?
red
no  The input attributes can be discrete or continuous.
yellow yes
 Classification Learning:
apple mango taste unknown
 Learning a discrete valued function is called classification
bitter sour learning
bittergourd lemon  Regression :
 Learning a continuous function is called Regression.

December 30, 2017 IS ZC464 31 December 30, 2017 IS ZC464 32

Decision Tree Decision tree


• A decision tree reaches its decision by • Leaf nodes depict the decision about a
performing a sequence of tests. character having attributes falling on the path
• All non leaf nodes lead to partial decisions and from the root node
assist in moving towards the leaf node. • Each example that participate in the
• Leaf nodes are the decisions based on construction of the decision tree is called a
properties satisfied at non leaf nodes on the training data and the complete set of the
path from the root node. training data is called as training set.

December 30, 2017 IS ZC464 33 December 30, 2017 IS ZC464 34

Limitations of Decision Tree Attribute Creation/Selection in various problem


Learning domains (recognition)

• The tree memorizes the observations but does


not extract any pattern from the examples.
• This limits the capability of the learning
algorithm in that the observations do not
extrapolate to examples it has not seen.
google images

Obtain the most suitable Features/ attributes Availability of information


•Color •Images
•shape •Actual data
•No of wheels
•Capacity •Attributes will differ
•Rear mirrors
•No of headlights
December 30, 2017 IS ZC464 35 December 30, 2017 IS ZC464 36
Fruits recognition Face Recognition
T: fruit recognition First specify the problem clearly
P: recognition accuracy Do you want to discriminate amongst the ones
E: experience by training shown below or want to put them in one category.
Attributes
Training examples of a person
•Color
•Texture
•But not
shape

Test images

AT&T Laboratories, Cambridge UK


http://www.uk.research.att.com/facedatabase.html

December 30, 2017 IS ZC464 37 December 30, 2017 IS ZC464


38

Human face recognition Selection of attributes


Training set can be a set of face images with varying
•No of eyes X
expressions, illumination, pose etc
T: Face recognition •Hair?
P: recognition accuracy / rejection •Spects
accuracy •Nose line
E: experience by training •Chin shape
•Number of ears
•Wrinkles
Humans are very quick in recognizing face of a •Male?
person. •Ratio of lip length and eye An intelligent system will be said to be with
A alyze you ai s apa ity of e e e i g length
capability of learning (human like) if it recognizes
u e of featu es of a pe so s fa e •What else?
unseen data
Attributes

Mathematical features
•DCT coefficients
•Pixel values
•Average pixel intensity

December 30, 2017 IS ZC464 39 December 30, 2017 IS ZC464 40

Learning of a function from given Generalization in Function X Y

sample data Approximation 1 0


T: prediction of y-value for given x- 0 1
value
P: least error Y
0 -1
E: experience by training
Generalization 0.6 0.8
1. Straight Line If the NN answers -
2. Sinusoidal Curve - 0.6 -0.8
3. Other higher order What is f(-0.25)?
polynomial Or -0.6 0.8
f(0.001)
correctly -0.6 -0.8
o
-1 0
X

Y =  (1-X2)

December 30, 2017 IS ZC464 41 December 30, 2017 IS ZC464 42


Generalization
Generalization in Classification If the test feature
Problem vector can be Traditional Vs. Machine Learning
correctly classified

Y
Input Data

Traditional Approach Output

Test Program
vector
o Input Data
X
Machine Learning Program

Output

December 30, 2017 IS ZC464 43 December 30, 2017 IS ZC464 44

How does a program as an output realized? Neural Networks


• Program is characterized by its parameters. • Mathematical Models representing the massively
parallel machines
• For example:
• Model inspired by the working of human nervous
– A neural network classifier is represented by its system
weights
• Has a number of neurons performing the task
– Weights are obtained by analyzing input and similar to human neuron
output data
• Each neuron triggers the received input according
– A decision tree is characterized by its attributes to the weight.
obtained by training input and output classes • A neural network captures the environment it has
to learn in terms of the weights.

December 30, 2017 IS ZC464 45 December 30, 2017 IS ZC464 46

A Neuron
BITS Pilani

• A mathematical neuron is a processing unit


capable of receiving inputs from single or
multiple neurons and triggers a desired response.
• Each neuron has an associated activation function
which takes as input the weighted sum of the
inputs coming to the neuron and triggers a
response depending on the associated threshold Machine Learning (IS ZC464)
Session 2: Training and Testing in
Learning systems
December 30, 2017 IS ZC464 47
X Y
Learning of a function from given
Prediction 1 1
5
2
5
2
sample data
• Recall Learning: A machine with learning 4 4
capability can predict about the new situation 3 3
(seen or unseen) using its past experience.
• Prediction: Straight Line

Given values of x and y


Predict value of y for x = 71
• Prediction is based on learning of the relationship
between x and y
• Training data is the collection of (x,y) pairs
• Testing data is simply value of x for which value
of y is required to be predicted.

January 13, 2018 IS ZC464 2 January 13, 2018 IS ZC464 3

Learning of a function from given


What did the system learn? X Y sample data-straight line learning

1 1
Y = f(x)
5 5
• Y=x 2 2
Which line fits the
Straight Line best?
• What is its generalization ability? 4 4
3

• Most accurate or we can say 100% 3 3


2


Line is represented
What if the data to train the system changes by parameters of
slope and intercept
slightly? The machine can be still made to How?

learn. Machine must learn


on its own-which is Using the data –
the best fit known as training
data i.e. (x,y) pair

January 13, 2018 IS ZC464 4 January 13, 2018 IS ZC464 5

Understanding ERROR Understanding ERROR

• Consider an example of using height and weight


Height (in cm) Weight (in Kg)
145 48
80
165 68
75
155 62
70
160 65
65
170 75
163 67 60

171 76 55

167 72
50
159 65

January 13, 2018 IS ZC464 6 January 13, 2018 145 IS ZC464


155 165 170 7
Understanding ERROR Understanding ERROR

Which line (hypothesis) fits the given data best? Which line(hypothesis) fits the given data best?

80 80

75 75

70 70

65 65

60 60

55 55

50 50

January 13, 2018 145 IS ZC464


155 165 170 8 January 13, 2018 145 IS ZC464
155 165 170 9

Understanding ERROR Understanding ERROR

Which line(hypothesis) fits the given data best? Which line(hypothesis) fits the given data best?

80 80

75 75

70 70

65 65

60 60

55 55

50 50

January 13, 2018 145 IS ZC464


155 165 170 10 January 13, 2018 145 IS ZC464
155 165 170 11

Understanding ERROR Understanding ERROR

Which line(hypothesis) fits the given data best? Which line(hypothesis) fits the given data best?

80 80

75 75

70 70

65 65

60 60

55 55

50 50

January 13, 2018 145 IS ZC464


155 165 170 12 January 13, 2018 145 IS ZC464
155 165 170 13
In 2D space the line parameters A simple example to understand
are two ERROR
• Slope and intercept
• Can be called as w1 and w2 Which line(hypothesis) fits the given data best?

• In order to find a line that best fits the given 7

data, we must find w1 and w2 in such a way 6

that the sum of the squared error is minimum 5

January 13, 2018 IS ZC464 14 January 13, 2018 1 IS ZC464


2 3 4 5 15

Compute the Squared Mean Error


Plotting error when y=f(x)
(line is y=x)
Therefore we can say that error is the
function of slope. If slope is
Sum of squares y represented by w, then error is a
function of w.
(S) = 1*1 7

+0
6 Hypothesis function
+ 1*1
y = wx
+1*1 5
Linear in one variable
+2*2
4 E(w) hw(x) = wx
+1*1
+2*2
3
+1*1
=13 SME = sqrt(S) / total no. of
2
observations
Error will be different if the line’s =√ / 8= .
slope is different (line passes through
1
origin)
At some value of w, E(w) is minimum.
x
w corresponding w
1 2 3 4 5 6 7
to minimum error

January 13, 2018 IS ZC464 16 January 13, 2018 IS ZC464 17

Understanding of the error surface Plotting error when y=f(x1,x2)


• Consider m observations <x1,y1>, <x2,y2>,
….<xm,ym>.
• An hypothesis hw(x) that approximates the
function that fits best to the given values of y Hypothesis function
hw(x) = w1x1 + w2x2 …….
• There is likely to be some error corresponding to Linear in two variables
each observation (say i). E(w)
NOTE: each pair < w1 ,w2>
• The magnitude of such error is yi -hw(xi) corresponds to a line given by
eq (1) while only one such
• Objective is to find such w that minimizes the pair corresponds to the line
sum of squares of errors that best approximates the
given training data
Emin(w) = Minimizew ∑i (yi -hw(xi) )2
w1
<w1 ,w2> corresponding
January 13, 2018 IS ZC464 18 January 13, 2018
w2 to minimum errorIS ZC464 19
Another possible surface

Plotting error when y=f(x1,x2) Difficult to visualize when y=f(x1,x2, x3)

Hypothesis function Hypothesis function


hw(x) = w1x1 + w2x2 ……. hw(x) = w1x1 + w2x2 +w3x3
Linear in two variables w3 Linear in three variables
E(w) E(w)

Local
Minima

w1 w1

Global Minima
January 13, 2018
w2 IS ZC464 20 January 13, 2018
w2 IS ZC464 21

Learning of a function from given sample data- Generalization in Function X Y

polynomial curve Learning Approximation 1 0


T: prediction of y-value for given x- 0 1
value
P: least error Y
0 -1
E: experience by training
Generalization 0.6 0.8
1. Straight Line If the NN answers -
2. Sinusoidal Curve - 0.6 -0.8
3. Other higher order What is f(-0.25)?
polynomial Or -0.6 0.8
f(0.001)
correctly -0.6 -0.8
o
-1 0
X

Y =  (1-X2)

January 13, 2018 IS ZC464 22 January 13, 2018 IS ZC464 23

Generalization
Generalization in Classification If the test feature
Problem vector can be Traditional Vs. Machine Learning
correctly classified

Y
Input Data

Traditional Approach Output

Test Program
vector
o Input Data
X
Machine Learning Program

Output

January 13, 2018 IS ZC464 24 January 13, 2018 IS ZC464 25


Uncertainty in real world
BITS Pilani
• Uncertainty in reaching New Delhi Airport in 5
hours from Pilani
– Cab engine may or may not work at any moment
– The route is diverted due to a procession on the way
– The road condition is bad unexpectedly
– The tire needs replacement
Etc.
Machine Learning(IS ZC464) • A person having stomach ache can be told
Session 3: Uncertainty Handling in that he is suffering from ulcer, while in actual
Real World using Probability Theory it may be gastritis or overeating
January 21, 2018 IS ZC464 2

If
A person has pneumonia
example Then Types of uncertainty
has fever
is pale
has cough • Disease  symptoms
white blood cells count is low
pneumonia may have other symptoms too
• Certainty exists in obtaining symptoms if the • Symptoms  disease
disease is confirmed these symptoms may be common in other
Disease  symptoms diseases as well,
• For converse, it is uncertain that if a person
has fever, and has cough, then has pneumonia, [ but if all possible symptoms can be observed
but if all symptoms are known then the and are same for all patients, then more
disease can be inferred definiteness can be inroduced]
Fever(p) Λ pale (p) Λcough(p) Λ WBC(p)pneumonia(p)
January 21, 2018 IS ZC464 3 January 21, 2018 IS ZC464 4

Real world scenario Class assignment


• It is impossible to list all relevant components • Analyze the weather on a day
of the real world – It is cloudy (How much?)
• Depends on individual belief
• Many of the components behave with some
• Belief can be based on experience
uncertainty
• Experience may count on favorable situations
• Due to system hardware limitation, – The day is humid
representing all components of any real world • Is it the sufficient humidity that may cause rains
situation may not be possible – Is it certain that the clouds will rain.
• The clouds may rain if certain other parameters are
favourable.

January 21, 2018 IS ZC464 5 January 21, 2018 IS ZC464 6


Conventional reasoning Closed World Assumptions
• Based on three assumptions – The closed world assumptions are based on the minimal
model of the world.
– Predicate descriptions must be sufficient with respect to
the application domain – Any predicate not existing is false.
– The information base is consistent. Example: whether two cities are connected by a plane
fight.
– Through the inference rules, the known information grows
monotonically • Check the list, if there is no direct flight, then we may
infer that the cities are not connected.
• Conventional methods follow closed world – Exactly those predicates that are necessary for a solution
assumptions are created.
– The closed world assumption affects the semantics of
negation in reasoning.

January 21, 2018 IS ZC464 7 January 21, 2018 IS ZC464 8

Example : conventional reasoning Uncertainty in First Order Logic


 Human(p)  mammal(p) Λ intelligent(p) Λkind(p) •  x Bird(x)  Fly(x)
Λ legs(p) Λ eyes(p) Λ ………..
• Penguin is a bird. Does it fly?
 mammal(John) Λ legs(John) Λ kind(John) Λ • The above rule does not hold good for all birds
eyes(John) (minimal world assumption)
 What a e said a out Joh s i tellige e? • How can we generalize the rule?
 Is John Intelligent? • There can be a large number of predicates that
 Is he not? can be constructed to represent a larger world
 Does lack of knowledge mean whether we are not
sure that John is intelligent or we are sure that •  x (Bird(x) Λ Abnormal(x)  Fly(x))
John is not intelligent • Uncertainty lies in the predicate abnormal.
January 21, 2018 IS ZC464 9 January 21, 2018 IS ZC464 10

Conventional reasoning Nonmonotonic reasoning systems


• Conventional logic is monotonic • Addresses the problem of changing beliefs.
• A set of predicates constitutes the knowledge • Makes most reasonable assumptions in light
base. of uncertain information.
• The size of the KB keeps increasing if a new
knowledge is added
• Pure methods of reasoning cannot handle KB
with incomplete or uncertain knowledge

January 21, 2018 IS ZC464 11 January 21, 2018 IS ZC464 12


Handling uncertain information
Example:
using probability theory
• Probability theory deals with the degree of   p Symptom(p, toothache)  Disease(p, cavity)
belief.  The above for example can be said to carry a
• Assigns numerical degree of belief between 0 belief that 8 out of 10 patients have cavity when
and 1 they had toothache.
• Handles the uncertainty that comes from  The probability associated with the above is 0.8.
laziness and ignorance  The belief may change if some more patients
reach with pain and have different diseases.
• The belief could be derived from
– Statistical data  A pro a ilit of .8 does ot ea that it is 8 %
– General rules true ut it is 8 % degree of elief
– Combination of evidence sources  Degree of belief is different from degree of truth

January 21, 2018 IS ZC464 13 January 21, 2018 IS ZC464 14

Representing uncertain
Evidences
knowledge using probability
• The pro a ilit that a patie t has a a it is .8, • Probability theory uses a language that is
depe ds o the age t s elief a d ot o the more expressive than the propositional logic
world.
• These beliefs depend on the percepts the agent • The basic element of the language is the
has received so far random variable.
• These percepts constitute the evidence on which • This random variable represents the real
probability assertions are based. world whose status is initially known.
• As new evidences add on, the probability • The proposition asserts that a random
changes.
variable has a particular value drawn from its
• This is known as conditional probability.
domain
January 21, 2018 IS ZC464 15 January 21, 2018 IS ZC464 16

Types of random variables Atomic Events


• An atomic event is the complete specification
• Boolean of the state of the real world about which the
– domain is {true, false} agent is uncertain.
– Example :
• Cavity = true
• Example:
– Let the boolean random variables cavity and toothache
• Discrete
constitute the real world then there are 4 atomic events
– domain is any set of integer values
– Example i. (Cavity = true) Λ (toothache= true)
• From domain { sunny, cloudy, rainy, snow} the variable may ii. (Cavity = true) Λ (toothache= false)
take whether = snow iii. (Cavity = false) Λ (toothache= true)
• Continuous iv. (Cavity = false) Λ (toothache= false)
– domain takes values from real numbers
January 21, 2018 IS ZC464 17 January 21, 2018 IS ZC464 18
Atomic events Prior probability
• Mutually exclusive • The prior probability associated with
• Set of all possible events is exhaustive proposition is the degree of belief in absence
(disjunction is true) of any other information
• Any proposition is logically equivalent to the • Example
disjunction of all atomic events that entail the – P(cavity = true) = 0.1
– P(cavity) = 0.1 [this is estimated based on the available
truth of the proposition information ]
– As some more information is available, the concept of
conditional probability will be used to determine the P
value

January 21, 2018 IS ZC464 19 January 21, 2018 IS ZC464 20

Computing Probability Probability Theory


• A bag contains 8 balls of which 6 are orange Apples and Oranges kept in two bags of different colors
and 2 are green.
• A ball is chosen randomly from the bag.
• What is the probability that the ball is of green
color? Answer = 2/8
• What is the probability that the ball is of
orange color? Answer = 6/8

January 21, 2018 IS ZC464 21 January 21, 2018 IS ZC464 22

Computing Probability Examples


• If a ball is to be chosen randomly from a bag, 1. Rolling a die – outcomes
and a bag is chosen randomly, then how likely
S ={ , , , , , }
it is to select red bag? – Computed through
experiments and multiple trials or is known ={1, 2, 3, 4, 5, 6}
apriori
• What is the probability that the ball selected E = the event that an even number is
from red bag is of green color? rolled
• What is the probability that the ball selected = {2, 4, 6}
from blue bag is of orange color? ={ , , }
January 21, 2018 IS ZC464 23 January 21, 2018 IS ZC464 24
Joint Probability Prior Probability Distribution
• This finds out how likely it is for two or more • Assu e a dis rete aria le eather
events to happen at the same time. – P(weather = sunny) = 0.4

• Example P(weather = rainy ) = 0.1
– P(weather = cloudy) = 0.1
– A patient has both cavity and toothache.
– P(weather = snow) = 0.2
– The joint probability is represented as
P(cavity Λ toothache) or • The distribution is
P(cavity, toothache) – P(weather)= { 0.4, 0.1, 0.1, 0.2}

January 21, 2018 IS ZC464 25 January 21, 2018 IS ZC464 26

Joint probability distribution Conditional Probability


• P(weather, cavity) has 4x2 (=8) atomic events • The intelligent agent may get new information
• P(cavity, toothache, weather) has 2x2x4 (=16) about the random variables that make the
• Any probabilistic query can be answered using domain
joint probability • The probabilities are recomputed
• Example
– A bag/urn has 12 red colored balls and 8 blue balls.
– The first trial, the probability of getting a red ball = 12/20
– Second trial, the probability of getting red ball = 11/19

January 21, 2018 IS ZC464 27 January 21, 2018 IS ZC464 28

Axioms of Probability
Conditional Probability
Kol ogoro s A io s
• Represented as P(a|b) • For any proposition a
• P(a|b) = P(aΛ b) / P(b) for P(b)>0 – 0<=P(a) <=1

• Also • True propositions have probability 1 and false


propositions have value 0
P(aΛ b) = P(a| b) P(b)
– P(true)=1 , P(false)=0
(Product Rule)
• P(a V b) = P(a) + P(b) – P(aΛb)

January 21, 2018 IS ZC464 29 January 21, 2018 IS ZC464 30


P(a) = 1 – P(a) Proposition
• Proof • The probability of a proposition is equal to the
a Λ a = false sum of the probabilities of the atomic events
a V a = true in which it holds.
– P(a) =  P(ei) over all atomic events

Using the third axiom of probability


P(a V a) = P(a) + P(a) – P(a Λ a)

==> P(true) = P(a) + P(a) – P(false)


==> P(a) = 1 – P(a)
January 21, 2018 IS ZC464 31 January 21, 2018 IS ZC464 32

Inference using Full Joint


Marginal Probability
Distributions
• P(Y) =  P(Y,z) (sum over all joint
• Joint distribution constructs the complete probabilities of Y with z)
knowledge base [Marginalization Rule]
• Example • P(Y) is the distribution over Y obtained by
– Let there be 2 random boolean variables representing the
summing out all the other variables from any
real world, say they are cavity and toothache joint distribution containing Y.
toothache toothache • Example:
Cavity 0.25 0.15 – P(cavity) = P(cavity, toothache) + P(cavity,  toothache)
Cavity 0.10 0.50 = 0.25 + 0.15
P(cavity)= 0.25 + 0.15 = 0.4 = 0.4
P(toothache) = 0.25 + 0.10 = 0.35

January 21, 2018 IS ZC464 33 January 21, 2018 IS ZC464 34

Conditioning Computing conditional probabilities (only


2 random variables)
• P(Y) =  P(Y,z) • P(Cavity | Toothache)
=  P(Y|z) P(z) (using product rule) = P(cavity Λ toothache) / P(toothache)
= 0.25 / 0.35 = 0.7142
• Marginalization and Conditioning are useful
rules for handling probability expressions
• P(Cavity | toothache)
= P(Cavity Λ Toothache) /
P(toothache)
= 0.1 / 0.35 = 0.2857
January 21, 2018 IS ZC464 35 January 21, 2018 IS ZC464 36
Inference using Full Joint
Normalization Constant
Distributions
• Normalization constant ensures that the
conditional probabilities of events add up to 1. – Let there be 3 random boolean variables representing the
• Example real world, say they are cavity, toothache and catch.
– We may still represent the joint probabilities as a table,
– P(cavity | toothache) = 0.999999 = 1
shown below, but if we have more random variables, we
• Let  denote the normalization constant simply use the propositions and their probabilities
– Then the conditional probability
P(a|b) = P(aΛ b) / P(b) for P(b)>0
Becomes
P(a|b) =  P(aΛ b)

January 21, 2018 IS ZC464 37 January 21, 2018 IS ZC464 38

Compute P(cavity)
Probability expressions More random variables
P(cavity,toothache)
P(toothache)

• P(cavity, toothache, catch) = 0.06


• P(cavity, toothache, catch) = 0.19 toothache toothache

• P(cavity,  toothache, catch) = 0.05


• P(cavity,  toothache,  catch) = 0.10 Catch catch Catch catch

• P( cavity, toothache, catch) = 0.09


• P( cavity, toothache, catch) = 0.01 Cavity 0.06 0.19 0.05 0.10

• P( cavity,  toothache, catch) = 0.22 Cavity 0.09 0.01 0.22 0.28
• P( cavity,  toothache,  catch) = 0.28
January 21, 2018 IS ZC464 39 January 21, 2018 IS ZC464 40

Advantage of normalization Probabilistic queries using joint


constant probability distribution
• These queries are answered using the joint
• Can help in generalizing the inference probability distribution.
procedure • The joint probability distribution is the knowledge
• P(X | e) =  P(X,e) base for inference using uncertain real world
=  (P(X,e ,y1) + P( X, e, y2)) • With n random variables, the size of the table
Example: P(cavity | toothache) = becomes 2n.
P(cavity,toothache)/P(toothache)
= (P(cavity,toothache,catch) + P(cavity,toothache, catch))/ • Time to answer a query = O(2n)
P(toothache) • When n is large, the method becomes almost
impractical to work with.

January 21, 2018 IS ZC464 41 January 21, 2018 IS ZC464 42


Independent events Ba es Rule
• The two events are statistically independent if the • According to the product rule
occurrence of one does not effect the other. – P(A|B) = P(AΛB)/ P(B)
• P(A|B) = P(A) – P(B|A) = P(BΛA)/P(A)
• P(B|A) = P(B)
– Using commutativity of conjunction
• P(A Λ B) = P(A)P(B) P(A|B) P(B) = P(B|A) P(A)
==> P(A|B) = P(B |A) P(A) / P(B)
• Example
– P(cavity|weather=sunny) = P(cavity)
– P(weather = rainy | toothache,catch) = P(weather=rainy)

January 21, 2018 IS ZC464 43 January 21, 2018 IS ZC464 44

example Review
• 90% students pass an examination  Axioms of Probability
• 75% Students who study hard pass the exam  0<=P(a) <=1
 P(true)=1 , P(false)=0
• 60% students study hard  P(a V b) = P(a) + P(b) – P(aΛb)
• Let S: event that students pass the exam  Syntax of the language representing
• H : studies hard uncertainty
• P(S| H) = 0.75  P(proposition)
 P(Proposition) =  P(ei) over all atomic events ei
• P(S) = 0.9  Where proposition is the conjunction of literals representing
• P(H)= 0.6 random variables.
 The random variables represent the real world parameters and
• P(H| S) = ?? capture the uncertainty
• Solutio : Use a es Theore  Example P(cavity Λ (weather = rainy))
• P(H|S) = P(S|H) P(H)/P(S) = 0.75 x 0.6 / 0.9 = 0.5
January 21, 2018 IS ZC464 45 January 21, 2018 IS ZC464 46

toothache toothache
catch catch
Review Review
Cavity 0.06
Catch
0.19
Catch
0.05 0.10
Cavity 0.09 0.01 0.22 0.28
 Product Rule
 P(cavity Λ toothache) = P(cavity, toothache) P(toothache)
• Normalization Constant
 Marginal Probability
– P(a|b) = P(a Λ b) / P(b) =  P(a Λ b)
 P(cavity) =  P(ei)
over all atomic events containing cavity – Example
• P(cavity | toothache) =
 Marginalization P(cavity,toothache)/P(toothache)
 Conditioning = (0.06+0.19) /
 P(X) =  P(X,z) (0.06+0.19+0.09+0.01)
=  P(X|z) P(z) = 0.25/ 0.35 = 0.7142
Example: P(cavity) = P(cavity|toothache) P(toothache) Normalization constant is 1/0.35
+ P(cavity|catch) P(catch)

January 21, 2018 IS ZC464 47 January 21, 2018 IS ZC464 48


toothache toothache
catch catch
Class Cavity
Catch
0.06 0.19
Catch
0.05 0.10
Review
Assignment Cavity 0.09 0.01 0.22 0.28
 Bayes Rule:
1. What is the probability that a patient who has P(A|B) = P(B |A) P(A) / P(B)
toothache has a cavity in his/ her teeth? requires
2. What is the probability that the patient has a cavity? 1 conditional probability
3. What is the probability 2 unconditional probability
P(cavity|catchΛ toothache)

Solution (Problem 2)
Use Product Rule:
P(cavity|catchΛ toothache)
= P(cavity ΛcatchΛ toothache)/P(catchΛ toothache)
= 0.1/(0.1+0.28) = 0.1/ 0.38
January 21, 2018 IS ZC464 49 January 21, 2018 IS ZC464 50

Ba es Theore E a ple Example contd..


• Suppose that Bob can decide to go to work by • Suppose that Bob is late one day, and his boss
one of three modes of transportation,car, bus, wishes to estimate the probability that he
or commuter train. Because of high traffic, if drove to work that day by car. Since he does
he decides to go by car, there is a 50% chance not know which mode of transportation Bob
he will be late. If he goes by bus, which has
special reserved lanes but is sometimes usually uses, he gives a prior probability of 1/
overcrowded, the probability of being late is 3 to each of the three possibilities. What is the
only 20%. The commuter train is almost never oss esti ate of the pro a ilit that Bo
late, with a probability of only 1%, but is more drove to work?
expensive than the bus.
Example source Example courtesy:
http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-607/BayesEx.pdf
January 21, 2018 IS ZC464 51 http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-607/BayesEx.pdf
January 21, 2018 IS ZC464 52

Solution Home Work

• A box contains 10 red and 15 blue balls. Two balls are


selected at random and are discarded without their
colors being seen. If a third ball is drawn randomly and
observed to be red, what is the probability that both of
the discarded balls were blue?

P( R | BB ) P( BB )
P( R | BB ) P( BB )  P( R | BR ) P( BR )  P( R | RR ) P( RR )

Example courtesy:
http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-607/BayesEx.pdf
January 21, 2018 IS ZC464 53 January 21, 2018 IS ZC464 54
Bayesian Network Bayesian network
• A suitable data structure implementing a
mechanism to represent the dependent and
independent relationships among the real world
variables.
• Captures the uncertain knowledge in a natural cloud humid

and efficient way.


• A set of random variables makes up the nodes of
the network.
• It also consists of the links between nodes rain

exploiting the dependence of one variable on


other

January 21, 2018 IS ZC464 55 January 21, 2018 IS ZC464 56

P(N) = 0.7 P(N) = 0.3

Example 2 Example 2Nutrious 8 hrs


food sleep

N SL P(H)
P(A) = 0.6 T T 0.95
P(SH) = 0.4 T F 0.78
Attend Healthy Attend Healthy F T 0.6
Study hard lectures lifestyle Study hard lectures lifestyle F F 0.001

S A H P(GP) SH A H P(GP)
T T T 0.99 T T T 0.99
T T F 0.45 T T F 0.45
T F T 0.60 T F T 0.60
Good T F F 0.30 Good T F F 0.30
performance F T T 0.85 performance F T T 0.85
F T F 0.45 F T F 0.45
F F T 0.05 Associated Conditional F F T 0.05
F F F 0.00001 Probability Tables (CPT) F F F 0.00001
January 21, 2018 IS ZC464 57 January 21, 2018 IS ZC464 58

Example 2 weather
example
cavity
• N: nutritious food
• SL: 8 hours sleep
• H: healthy lifestyle
catch
toothache • SH: study hard
• A: attends lectures
• P(N,SL, H, SH, A)
Weather is an independent variable while cavity = P(N) P(SL) P(H|NΛSL) P(SH) P(A)
effects toothache and catch both
= (0.3) x (0.3) x (0.6) x (0.4) x (0.4)
= 0.00864

January 21, 2018 IS ZC464 59 January 21, 2018 IS ZC464 60


Baysian Networks Graphs
• It is the representation for uncertain • A graph is a data structure to hold relevant information
in memory efficiently.
knowledge.
• Each node is allocated memory dynamically (on heap)
• These networks provide a concise way to • Nodes are connected by edges.
represent conditional probabilities. • A directed edge depicts the parent child relationship
• A bayesian network is represented as a graph. amongst nodes.
• The node pointed to by the arrow (edge) is the child of
• Bayesian network is more efficient than the the parent node from which the edge comes.
joint distribution tables. • A graph with |V| vertices and |E| edges is traversed in
O(|V|+|E|) time.

January 21, 2018 IS ZC464 61 January 21, 2018 IS ZC464 62

Conditional Probability Tables Bayesian Network


• Consider the domain as consisting of n
• Conditional Probability Tables(CPT) capture the strength of aria les sa X , X , X , ….X
the conditional dependency among the variables of the
environment.
• There are 2n atomic events
• Each row represents the atomic event (corresponding to the • The joint probability for an instance of these n
set of variables causing an effect on the child variable) variables is given by the following equation
• Each row must sum to 1 (P(False) = 1-P(True))
• P , , ,…, = ∏ (P(xi | Parent Xi) over
• The size of the CPT corresponding to a node is 2k where k is
the number of parent variables . I
• The CMTs are able to generate any conditional probability.

January 21, 2018 IS ZC464 63 January 21, 2018 IS ZC464 64

P(N) = 0.7 P(N) = 0.3 P(N) = 0.7 P(N) = 0.3

Revisit Nutrious
food
8 hrs
sleep
Additional Nutrious
food
8 hrs
sleep

Example 2 links if needed


N SL P(H|N,S)
P(A) = 0.6 T T 0.95 P(A) = 0.6
P(SH) = 0.4 T F 0.78 P(SH) = 0.4
Attend Healthy F T 0.6 Attend Healthy
Study hard lectures lifestyle F F 0.001 Study hard lectures lifestyle

SH A H P(GP |SH, A, H)) SH A H P(GP)


T T T 0.99 T T T 0.99
T T F 0.45 T T F 0.45
T F T 0.60 T F T 0.60
Good T F F 0.30 Good T F F 0.30
performance F T T 0.85 performance F T T 0.85
F T F 0.45 F T F 0.45
Associated Conditional F F T 0.05 Associated Conditional F F T 0.05
Probability Tables (CPT) F F F 0.00001 Probability Tables (CPT) F F F 0.00001
January 21, 2018 IS ZC464 65 January 21, 2018 IS ZC464 66
P(N) = 0.7 P(N) = 0.3 P(N) = 0.7 P(N) = 0.3

Additional Nutrious
food
8 hrs
sleep
All nodes are Nutrious
food
8 hrs
sleep

links if needed
N P( A)
connectedN P( A)
T 0.95 N SL P(H) T 0.95
F 0.60 T T 0.95 F 0.60
P(SH) = 0.4 T F 0.78 P(SH) = 0.4
Attend Healthy F T 0.6 Attend Healthy
Study hard lectures lifestyle F F 0.001 Study hard lectures lifestyle

SH A H P(GP) SH A H P(GP)
T T T 0.99 T T T 0.99
T T F 0.45 T T F 0.45
T F T 0.60 T F T 0.60
Good T F F 0.30 Good T F F 0.30
performance F T T 0.85 performance F T T 0.85
F T F 0.45 F T F 0.45
Associated Conditional F F T 0.05 Associated Conditional F F T 0.05
Probability Tables (CPT) F F F 0.00001 Probability Tables (CPT) F F F 0.00001
January 21, 2018 IS ZC464 67 January 21, 2018 IS ZC464 68

What is making the previous BN


Example
bad?
• Links capture dependency. • An environment is represented using 10
• More CMTs need to be constructed more variables, the nodes having at most 3 parent
memory needs variables, then the units of time to process a
• More statistical information is needed, which is query
not practical to collect.
= 10 x 8 = 80
• Worst case time to access each CMT
(corresponding to nodes) = n2n While JPDT, if used as the KB, takes 1024 units of
• Recall the total time to process a probabilistic time.
query using JPDT (joint probability distribution
table) = 2n

January 21, 2018 IS ZC464 69 January 21, 2018 IS ZC464 70

Where is the human intelligence


How to construct a BN
required?
• Humans understand the environment and can • Start with constructing the nodes corresponding to
list out the variables that should represent the the root causes.
environment. • The nodes which are affected by the previously
• The dependency of one variable on other generated nodes.
should be listed out. • Example: model uncertainty in environment to find
out what will be status of global warming in next
• The dependency and the effect of the parent year due to ongoing developmental activities
variable in terms of conditional probabilities • Parameters effecting temperature on earth
must be computed. – Deforestation, pollution, reduction of green house gases

• In deciding the root cause(s) • Root causes


– House and fuel needs, factories, communication, and so on
January 21, 2018 IS ZC464 71 January 21, 2018 IS ZC464 72
house

green house green house


Deforestation pollution Deforestation pollution
gases gases

January 21, 2018 IS ZC464 73 January 21, 2018 IS ZC464 74

vehicles vehicles
fuel factories fuel factories
house house

green house
Deforestation pollution Deforestation pollution
gases

•Notice that there is no cycle in the


graph.
green house •All leaf nodes have parents while
gases the root causes are not having any
parent node
•If there is a need to model more
refined environment, the variables
effecting the root causes should also
be considered

January 21, 2018 IS ZC464 75 January 21, 2018 IS ZC464 76

Review: Probability Theory BITS Pilani

Marginal Probability

Joint Probability
Conditional Probability Machine Learning (IS ZC464) Session 4:
Bayes’ Theorem and its applications in Machine Learning, MAP
hypothesis, Information Theory and its application in Minimum
Description Length (MDL) principle
January 21, 2018 IS ZC464 77
Bayesian learning
Bayes theore Example 1: observation of sounds
• Bayes theore pro ides a a to al ulate Training with Observed data: {d1,d2, d3} =training
the probability of a hypothesis based on its data (say D) Sou ds ae a d a are the
observed targets that we
prior probability, the probability of observing d1: at sou ds ith ae know.
various data given the hypothesis, and the d2: pot sou ds ith a Prior probabilities
observed data itself. P sou d = ae = .5
d3: at sou ds ith ae P sou d = a = .5
features su h as a a d o Conditional probabilities are OR
are obtained through represented as P ae |d ,d ,d = /
preprocessing of the given P ae |feature = a = / P a |d ,d ,d = /
words – by parsing P a | feature = o = /
OR
P ae |D = /
P a |D)= 1/3
January 27, 2018 IS ZC464 2 January 27, 2018 IS ZC464 3

Bayesian learning would enable


answers to queries such as: Hypothesis
Unknown words used for testing: sat and not • In learning algorithms, the term hypothesis is
Preprocessing gives used in contexts such as
Feature for ord sat = a Concept learning or classification: class label or
Feature for woed ot = o category
What is the likelihood that ord sat sou ds ith ae ? Function approximation: a curve, a line or a
P sat | ae = ? polynomial
What is the likelihood that ord ot sou ds ith ae ? Decision making: a decision tree
P ot | ae = ? • Plural of hypothesis: Hypotheses (multiple labels,
What is the likelihood that ord sat sou ds ith a ? multiple curves, multiple decision trees)
P sat | a =? • Best Hypothesis (Always preferred) : Most
What is the likelihood that ord ot sou ds ith a ? appropriate class, best fit curve, smallest decision
P ot | a =? tree

January 27, 2018 IS ZC464 4 January 27, 2018 IS ZC464 5

Observations of Sounds example : continued Likelihood or probability?


• There are two hypotheses in the given What is the likelihood that ord sat sou ds
example ith ae ?
Hypothesis (say h1 : ae P sat | ae
Hypothesis (say h2 : a Which can equivalently be written as
• Learning: requires us to find the best P feature = a | ae = / gi e
hypothesis from the space of two hypothesis Reverse:
h1 and h2 for a new observation What is the likelihood that ae sou d ill
represent a word of type sat?
P ae |feature = a =?

January 27, 2018 IS ZC464 6 January 27, 2018 IS ZC464 7


Computation of P ae |feature = a =? Terminology for Bayes theore
• Let us represent the above probabilistic query as • Prior Probability: The probability P(h) denotes
conditional probability using following events the i itial pro a ilit that h pothesis h holds
A: sou d is ae before we have observed the training data.
B: feature is a [E a ple: P a , P ae , P feature = a ,
To compute P(A|B) (read as Probability of A given B) P feature = o et . ased o so e
when P(B|A), P(B) and P(A) are available background knowledge]
Bayes’ theorem pro ides a ay to compute such
probabilities

January 27, 2018 IS ZC464 8 January 27, 2018 IS ZC464 9

Terminology for Bayes theore Bayesian Learning


• Posterior Probability: The probability P(h |D) • Bayesian learning is a probabilistic approach
de otes the pro a ilit that the h pothesis h to inference.
holds given the observed training data D [First
recall example of sequence of tossing of coin • Optimal decisions can be made by reasoning
and the probability that changes as we keep about these probabilities together with the
observing the D. Then in the current example, observed data.
P feature = a | ae , isualize u ertai t if • Each observed training observation can
talk a d o e are also used for trai i g a d incrementally decrease or increase the
sou d differe t for feature = a a d feature = estimated probability that a hypothesis is
o respe ti el ] correct.
January 27, 2018 IS ZC464 10 January 27, 2018 IS ZC464 11

Bayesian Learning Example


• Prior knowledge can be combined with observation word feature Target
hypothesis
observed data to determine the final (h)

probability of a hypothesis. d1 put u oo


d2 pat a ae
• Bayesian methods can accommodate d3 none o a~

hypotheses that make probabilistic d4 mat a ae


d5 cut u a~
predictions. d6 not o aw
• New instances can be classified by combining d7 nut u a~
d8 talk a aaw
the predictions of multiple hypotheses,
d9 pot o aw
weighted by their probabilities. d10 sat a ae

January 27, 2018 IS ZC464 12 January 27, 2018 IS ZC464 13


4 hypotheses Three features
Observations word feature Target Observations word feature Target
(D) hypothesis Prior (D) hypothesis Prior
(h) Probabilities (h) Probabilities of
d1 put u oo P(oo) = 0.1 d1 put u oo features
d2 pat a ae P(ae) = 0.5 d2 pat a ae P(u) = 0.2
P(a~) = 0.1 P(a) = 0.5
d3 none o a~ P(aw)=0.3 d3 none o a~ P(o)= 0.3
d4 mat a ae d4 mat a ae
d5 cut u a~ conditionalProb d5 cut u a~ Conditional
d6 not o aw abilities d6 not o aw Probabilities
P(oo | D) = 0.1 P(u | oo) = 0.1
d7 nut u a~ d7 nut u a~
P(ae | D) = 0.3 P(u | a~)= 0.2
d8 talk a aw P(a~ | D) = 0.3 d8 talk a aw P(a |ae) = 0.3
d9 pot o aw P(aw | D)=0.3 d9 pot o aw P(a |aw) =0.1
P(o| a~) = 0.1
d10 sat a ae d10 sat a ae P(o | aw)=0.2

January 27, 2018 IS ZC464 14 January 27, 2018 IS ZC464 15

Conditional
Probabilities
Bayesian Learning Posterior Probabilities P(u | oo) = 0.1
P(u | a~)= 0.2
P(a |ae) = 0.3
• Training : Through the computation of the • P(ae |a) = P(a | ae) * P(ae)/P(a) Maximum P(a |aw) =0.1
probabilities as in previous two slides P(o| a~) = 0.1

• Testing : of unknown words = 0.3* 0.5/0.5 = 0.3 P(o | aw)=0.2

– Example testing: • P(oo |a) = P(a | oo) * P(oo)/P(a) Prior


Probabilities
Whi h sou d does the ord at ake? P(oo) = 0.1
Prepro ess at to get feature a a d o pute P h|a), = 0.1* 0.1/0.5 = 0.02 P(ae) = 0.5
where P(h|D) is known, where D is the set of 10 • P(a~ |a) = P(a | a~) * P(a~)/P(a) P(a~) = 0.1
o ser atio s used to trai the s ste a d h is the P(aw)=0.3
hypothesis. = 0* 0.1/0.5 = 0 Prior
Compute probabilities P(ae | a), P(oo | a), P(a~|a) and Probabilities of
P(aw |a) to obtain the likelihood of sound of cat. • P(aw |a) = P(a | aw) * P(aw)/P(a) features
P(u) = 0.2
= 0.1* 0.3/0.5 = 0.06 P(a) = 0.5
P(o)= 0.3

January 27, 2018 IS ZC464 16 January 27, 2018 IS ZC464 17

Maximum a Posteriori (MAP)


MAP hypothesis
hypothesis
• Consider a set of hypotheses H and the
observed data used for training D h MAP
 Arg max P(h | D)
hH
• Define
h  Arg max P(h | D)
MAP
hH P ( D | h ) P ( h)
hMAP  Arg max hH P ( D)
using Bayes
theorem

• The maximally probable hypothesis is called a


maximum a posteriori (MAP) hypothesis.
h MAP
 Arg max P( D | h) P(h) Dropping
P(D) as id
hH constant

January 27, 2018 IS ZC464 18 January 27, 2018 IS ZC464 19


Equally Probable hypothesis a priori Maximum Likelihood hypothesis
• If P(hi) = P(hj)  hi and hj in H • P(D|h) is called the likelihood of the data
then in finding the MAP hypothesis, we can given h and any hypothesis that maximizes
ignore the term P(h) in the following equation P(D|h) is called a Maximum Likelihood (ML)
hypothesis.
h
MAP
 Arg max P( D | h) P(h)
And get
hH
h ML
 Arg max P( D | h)
hH

hMAP  Arg max P( D | h)


hH

January 27, 2018 IS ZC464 20 January 27, 2018 IS ZC464 21

Minimum Description Length


Home Work Principle
• Read and solve Example given in section 6.2.1 • This is the information theoretic approach to
Mit hell s ook compute the MAP hypothesis.
• Question on Bayes theore • The MAP is computed as shortest length
A doctor knows that the disease meningitis hypothesis in the domain of encoding data.
causes the patient to have a stiff neck, say 50% of
the time. The doctor also knows some • A problem consisting of transmitting random
unconditional facts: the prior probability that the messages needs encoding of messages.
patient has meningitis is 1/50,000, and the prior • Messages are arriving at random (uncertainty
probability that any patient has a stiff neck is about the messages exists)
1/20. What is the probability that a patient with
stiff neck has meningitis? (Verify your answer • Ea h essage i is o sidered to e arri i g ith
with 0.0002) probability pi

January 27, 2018 IS ZC464 22 January 27, 2018 IS ZC464 23

Minimum Description Length


Principle Information Theory
• We need to find the encoding scheme using minimum • Information theory studies the quantification, storage,
number of bits. and communication of information.
• The fixed length coding scheme does not work well as
the less probable messages get the encoding using • It was originally proposed by Claude E. Shannon in
same number of bits. 1948 to find fundamental limits on signal
• Example messages processing and communication operations such as data
a1, a2, a3, a4 : 4 symbols compression
Code them as 00, 01, 10, 11 using 2 bit (costly) • A key measure in information theory is "entropy".
representation
• Entropy quantifies the amount of uncertainty involved
Transmit the code sequence 1101100110001110.
in the value of a random variable or the outcome of
Client at the other end can decode using the same
scheme of encoding as a4a2a3a2a3a1a4a3 a random process.
Reference : https://en.wikipedia.org/wiki/Information_theory
January 27, 2018 IS ZC464 24 January 27, 2018 IS ZC464 25
Entropy Encoding
• Based on the probability of each source • Less number of bits to represent frequent
symbol to be communicated, the symbols
Shannon entropy H, in units of bits (per • Use more bits to represent less frequent
symbol), is given by symbols
Entropy   p log
i
i 2
p  i symbol Probability Code Code
(pi) (unoptimized) (Optimized)

• where pi is the probability of occurrence of a1 0.4 00 1


a2 0.25 01 010
the ith possible value of the source symbol. a3 0.3 10 00
a4 0.05 11 011

January 27, 2018 IS ZC464 26 January 27, 2018 IS ZC464 27

Computation of entropy Average number of bits


• Entropy of the given data Variable length encoding

Entropy   p log
i
i 2
p  i
= 1*0.4 + 3*0.25 + 2*0.3 + 3*0.05 = 1.9 bits
Fixed length encoding = 2 bits
Entropy   p log  p  p log  p
1 2 1 2 2 2
 p log  p  p
3 2 3 4
log  p
2 4

symbol Probability Code Code


= - 0.4*log2(0.4) - 0.25*log2(0.25)- 0.3*log2(0.3) - (pi) (unoptimized) (Optimized)
0.05*log2(0.05) a1 0.4 00 1
a2 0.25 01 010
= 0.52877 + 0.5 + 0.521089 + 0.216096 a3 0.3 10 00
=1.76 bits (information content) a4 0.05 11 011

January 27, 2018 IS ZC464 28 January 27, 2018 IS ZC464 29

MAP definition in logarithmic


terms BITS Pilani

h MAP
 Arg max P( D | h) P(h)
hH

h MAP
 Arg max log P( D | h)  log P(h)
hH 2 2

Machine Learning (IS ZC464) Session 5:


Bayes’ Optimal Classifier, Gibbs Algorithm, Naïve Bayes’ Classifier,
h MAP
 Arg min  log P( D | h)  log P(h)
hH 2 2 Problem Solving on Bayesian Learning

January 27, 2018 IS ZC464 30


Review
• Prior and posterior probabilities
• Conditional probabilities
• Bayes theore
• MAP hypothesis
• Minimum Description Length Principle
• Entropy
Slides adapted from:
https://fenix.tecnico.ulisboa.pt/downloadFile/3779571251548/bayes2.ppt
www.cs.bu.edu/fac/gkollios/ada01/LectNotes/Bayesian.ppt
www.doc.ic.ac.uk/~yg/course/ida2002/ida-2002-5.ppt
https://cse.sc.edu/~rose/587/PPT/NaiveBayes.ppt
February 3, 2018 IS ZC464 2 February 3, 2018 IS ZC464 3

Basic Approach MAP Learner

Bayes Rule: For each hypothesis h in H, calculate the posterior probability


P ( D | h) P ( h )
P ( h | D)  P ( D | h) P ( h )
P ( D) P ( h | D) 
P ( D)
• P(h) = prior probability of hypothesis h Output the hypothesis hmap with the highest posterior probability
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior probability ) hmap  max P(h | D)
hH
• P(D|h) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given the • Computational intensive
training data (Maximum A Posteriori hypothesis hmap ) • Provides a standard for judging the performance of learning algorithms
hmap • Choosing P(h) reflects our prior knowledge about the learning task
• P(D|h) defines the probability of D given its hypothesis (class) and is used for training.
 max P (h | D)
hH

P ( D | h) P ( h)
 max
hH P( D)
February 3, 2018  max P ( D | hIS)ZC464
P ( h) 4 February 3, 2018 IS ZC464 5
hH

Classification tasks (Concept learning) Bayesian classification


• Examples
 Spam Classification: Given an email, predict whether it is spam or not
 Medical Diagnosis: Given a list of symptoms, predict whether a patient
• The classification problem may be formalized
has disease X or not using a-posteriori probabilities:
 Weather: Based o te perature, hu idit , et … predi t if it ill rai
tomorrow • P(C|X) = prob. that the sample tuple
• What is classification task in machine learning? X=<x1,…,xk> is of class C.
 Given training data and its class label C, features representing each
training sample (say X=<x1,…,xk> )are extracted. This represents
P(X|C)
• Idea: assign to sample X the class label C such
 An unseen (or seen) sample is required to be classified. Let this be that P(C|X) is maximal
represented by its features X=<x1,…,xk> .
 The classification task requires the machine learning system to
obtain the class label to which the sample might belong to.
• How is probability used in classification?
 To classify we compute P(C|X)

February 3, 2018 IS ZC464 6 February 3, 2018 IS ZC464 7


Examples
Examples
• Diagnosis of a disease
• Sound classification of words hi: ith Disease s a e Multi lass lassifi atio pro le
hi: ith sound class hi: Disease diagnosis positive/ negative (Binary
classification problem)
D: o el attri utes e.g. o , a et .
D: Symptoms
P(h|Dtestdata)= MAXi P(hi|Dtestdata) • Spam mail detection
• Face recognition hi: spam positive/ negative (Binary classification
hi: ith Perso s a e problem)
D: attributes such as shapes of eyes, jaw line, nose etc. D: features that describe a spam email (example
words: dollar, free, pounds, million, etc.)
or mathematical features such as wavelets, Discrete Cosine • Speaker recognition
Transform (DCT) etc.
hi: ith Perso s a e
D: Voice tone etc.

February 3, 2018 IS ZC464 8 February 3, 2018 IS ZC464 9

Techniques for classification based on posterior Bayes optimal Classifier: A weighted majority
probabilities classifier
• Maximum a Posteriori (MAP) • What is the most probable classification of the new
 Maximum likelihood instance given the training data?
 Only one hypothesis contributes
– The most probable classification of the new instance is
• Bayes Opti al Classifier obtained by combining the prediction of all hypothesis,
 Weighted Majority Classifier weighted by their posterior probabilities
 All hypotheses contribute
 Costly classifier • If the classification of new example can take any
• Gibbs Algorithm value vj from some set V, then the probability P(vj|D)
 Any one randomly picked hypothesis that the correct classification for the new instance is
 Less costly vj, is just:
• Naïve Bayes Classifier
 uses assumption that the attributes are conditionally
independent

February 3, 2018 IS ZC464 10 February 3, 2018 IS ZC464 11

h1: travel by scooter


h2: travel by car
h3: travel by bus Example Calculation
v1 : long journey (+)
v2 : city travel (-)
• Let D=<f1,f2,f3> be the training data where fi s are
P(+|h1): probability of the attributes (features such as availability of the • Weighed sum of P(+|D)
travelling on long journey vehicle, money, road conditions etc.) =P(+|h1)P(h1|D) + P(+|h2)P(h2|D) + P(+|h3)P(h3|D) P(h1|D)=0.4
given travel by scooter. • Given observed conditional probabilities = 0.01 *0.4 + 0.9*0.3 + 0.6*0.3 P(h2|D)=0.3
P(h1|D)=0.4 P(h3|D)=0.3
P(h2|D)=0.3 = 0.004 + 0.27 +0.18
P(+|h1)=0.01
P(h3|D)=0.3 = 0.454 P(+|h2)=0.9
• Given observed conditional probabilities P(+|h3)=0.6
P(+|h1)=0.01 • Weighed sum of P(-|D) P(-|h1)=0.7
P(+|h2)=0.9 P(-|h2)=0.5
P(+|h3)=0.6
=P(-|h1)P(h1|D) + P(-|h2)P(h2|D) + P(-|h3)P(h3|D) P(-|h3)=0.2
• Also given observed conditional probabilities = 0.7*0.4 + 0.5*0.3 + 0.2*0.3
P(-|h1)=0.7 =0.28 + 0.15 +0.06
P(-|h2)=0.5 =0.49 h : travel by scooter h : travel by car h : travel by bus
P(-|h3)=0.2 1 2 3
v1 : long journey (+) v2 : city travel (-)
P(+|h1): probability of travelling on long journey given travel by scooter.
February 3, 2018 IS ZC464 12 February 3, 2018 IS ZC464 13
Bayes Opti al Classifier Naïve Bayes Learner
Assume target function f: X-> V, where each instance x described by attributes <a1, a2,
max
vjV
 P (v
hjH
j | hi ) P(hi | D) …., a >. Most pro a le alue of f is:

v  max P (v j | a1 , a2 ....an )
vjV

P ( a1 , a2 ....an | v j ) P (v j )
• Compute the Maximum of two probabilities  max
vjV P ( a1 , a2 ....an )
 max P ( a1 , a2 ....an | v j ) P (v j )
P(+|D) and P(-|D) which is 0.49 vjV

• This means it is more likely that a person Naïve Bayes assumption:

travels in a city given its data P(a1 , a2 ....an | v j )   P(ai | v j ) (attributes are conditionally independent)
i

February 3, 2018 IS ZC464 14 February 3, 2018 IS ZC464 15

Naïve Bayes Classifier (I) Play-tennis example: estimating P(xi|C)


outlook

• A simplified assumption: attributes are Outlook


sunny
Temperature Humidity Windy Class
hot high false N
P(sunny|p) = 2/9 P(sunny|n) = 3/5
P(overcast|p) = 4/9 P(overcast|n) = 0
conditionally independent: sunny
overcast
hot
hot
high
high
true
false
N
P
P(rain|p) = 3/9 P(rain|n) = 2/5
rain mild high false P
n rain cool normal false P
temperature
P(C j | D)  P(C j ) P(d i | C j )
rain cool normal true N
overcast cool normal true P

i 1
sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5
sunny cool normal false P
rain mild normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
sunny mild normal true P
overcast mild high true P P(cool|p) = 3/9 P(cool|n) = 1/5
overcast hot normal false P

• Greatly reduces the computation cost, only rain mild high true N humidity
P(high|p) = 3/9 P(high|n) = 4/5
count the class distribution. P(p) = 9/14 P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(n) = 5/14
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
February 3, 2018 IS ZC464 16 February 3, 2018 IS ZC464 17

Naive Bayesian Classifier (II) Play-tennis example: classifying X


• Given a training set, we can compute the probabilities • An unseen sample X = <rain, hot, high, false>
O u tlo o k P N H u m id ity P N
su n n y 2 /9 3 /5 h ig h 3 /9 4 /5
P(rain,hot,high,false|p)·P(p)
o verc ast 4 /9 0 n o rm al 6 /9 1 /5 = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p)
rain 3 /9 2 /5 = 3/9·2/9·3/9·6/9·9/14 = 0.010582
T em p reatu re W in d y
hot 2 /9 2 /5 tru e 3 /9 3 /5 P(rain,hot,high,false|n)·P(n)
m ild 4 /9 2 /5 false 6 /9 2 /5 = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)
cool 3 /9 1 /5
= 2/5·2/5·4/5·2/5·5/14 = 0.018286

• Sample X is classified in class n (don’t play)


February 3, 2018 IS ZC464 18 February 3, 2018 IS ZC464 19
The i depe de e h pothesis… Naïve Bayesian Classifier: Comments

• … akes o putatio possi le • Advantages :


– Easy to implement
• … ields opti al lassifiers he satisfied
– Good results obtained in most of the cases
• … ut is seldo satisfied i pra ti e, as attri utes aria les are
• Disadvantages
often correlated.
– Assumption: class conditional independence , therefore loss of accuracy
• Attempts to overcome this limitation: – Practically, dependencies exist among variables
– Bayesian networks, that combine Bayesian reasoning with causal – E.g., hospitals: patients: Profile: age, family history etc
relationships between attributes
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
• when conditional independence assumption is – Dependencies among these cannot be modeled by Naïve Bayesian
satisfied the naive Bayes classification is a MAP Classifier
classification • How to deal with these dependencies?
• – Bayesian Belief Networks

February 3, 2018 IS ZC464 20 February 3, 2018 IS ZC464 21

Problem 1: Bayes Theore


• 90% students pass an examination
• 75% Students who study hard pass the exam
• 60% students study hard
• Let S: event that students pass the exam
Problem Solving on Bayesian Learning • H : studies hard
• P(S| H) = 0.75
• P(S) = 0.9
• P(H)= 0.6
• P(H| S) = ??
• Solution : Use bayes Theore
• P(H|S) = P(S|H) P(H)/P(S) = 0.75 x 0.6 / 0.9 = 0.5
February 3, 2018 IS ZC464 22 February 3, 2018 IS ZC464 23

Problem 2: Using Joint Compute


More random variables
P(cavity)
Probabilities P(cavity | toothache)
P(toothache)
• P(cavity, toothache, catch) = 0.06 P(Catch | cavity)

• P(cavity, toothache, catch) = 0.19 toothache toothache

• P(cavity,  toothache, catch) = 0.05


• P(cavity,  toothache,  catch) = 0.10 Catch catch Catch catch

• P( cavity, toothache, catch) = 0.09


• P( cavity, toothache, catch) = 0.01 Cavity 0.06 0.19 0.05 0.10

• P( cavity,  toothache, catch) = 0.22 Cavity 0.09 0.01 0.22 0.28
• P( cavity,  toothache,  catch) = 0.28
February 3, 2018 IS ZC464 24 February 3, 2018 IS ZC464 25
toothache toothache toothache toothache
Catch catch Catch catch Catch catch Catch catch
Solution Cavity 0.06 0.19 0.05 0.10 Solution Cavity 0.06 0.19 0.05 0.10
Cavity 0.09 0.01 0.22 0.28 Cavity 0.09 0.01 0.22 0.28

• P(cavity) = P(cavity,toothache)+P(cavity,~toothache) • P(cavity | toothache) = P(cavity, toothache) . P(toothache) (using Bayes


(using Marginalization) theorem)
• P(cavity) = P(cavity,toothache, catch)+P(cavity,~toothache, catch)+ • Where
P(cavity,toothache, ~catch)+P(cavity,~toothache, ~catch) (using P(cavity, toothache) = P(cavity, toothache, catch) + P(cavity, toothache, ~catch)
Marginalization) (using Marginalization Rule)

= 0.06 + 0.05 + 0.19 + 0.10 = 0.4


= 0.06 + 0.19 = 0.25
• Similarly
P(toothache) = P(toothache, catch)+P(toothache, ~catch) • Therefore
= P(toothache, catch, cavity)+P(toothache, ~catch, cavity) P(cavity | toothache) = P(cavity, toothache) . P(toothache)
+ P(toothache, catch, ~cavity)+P(toothache, ~catch, ~cavity) = 0.25 * 0.35 = 0.0875
(using Marginalization)
= 0.06 + 0.19 + 0.09 + 0.01 = 0.35

February 3, 2018 IS ZC464 26 February 3, 2018 IS ZC464 27

Problem 3: Bayes Theore Example contd..


• Suppose that Bob can decide to go to work by • Suppose that Bob is late one day, and his boss
one of three modes of transport car, bus, or wishes to estimate the probability that he
commuter train. Because of high traffic, if he drove to work that day by car. Since he does
decides to go by car, there is a 50% chance he not know which mode of transportation Bob
will be late. If he goes by bus, which has
special reserved lanes but is sometimes usually uses, he gives a prior probability of 1/
overcrowded, the probability of being late is 3 to each of the three possibilities. What is the
only 20%. The commuter train is almost never oss esti ate of the pro a ilit that Bo
late, with a probability of only 1%, but is more drove to work?
expensive than the bus.
Example source
http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-607/BayesEx.pdf
February 3, 2018 IS ZC464 28 February 3, 2018 IS ZC464 29

Solution Problem 4: Minimum Description


Length (MDL) Principle
• Compute the MDL encoding for the problem
given below
symbol pi
A 0.36
B 0.15
C 0.13
D 0.11
E 0.09
F 0.07
G 0.05
H 0.03
I 0.01
http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-607/BayesEx.pdf

February 3, 2018 IS ZC464 30 February 3, 2018 IS ZC464 31


Solution Probabilities and Codelengths
• Arrange the symbols in sorted order • Huffman coding
• Pair them by adding their probabilities and
reach the end
• Assign smallest code to the symbol with
highest probability
• Assign incremental length code to other
symbols depending upon their probabilities
• Compute the total number of expected bits
• Compute the entropy of the given information

February 3, 2018 IS ZC464 32 February 3, 2018 IS ZC464 33


Source: http://star.itc.it/caprile/teaching/algebra-superiore-2001/

Problem 5: MAP classifier


Problem 6: Naïve Bayes classifier
Does patient have cancer or not?
age income student credit_rating buys_computer
A patient takes a lab test and the result comes back positive. The test
Class: <=30 high no fair no
returns a correct positive result in only 98% of the cases in which the
C1:buys_computer=‘yes’ <=30 high no excellent no
disease is actually present, and a correct negative result in only 97% of
C2:buys_computer=‘no’ 30…40 high no fair yes
the cases in which the disease is not present. Furthermore, .008 of the
>40 medium no fair yes
entire population have this cancer.
>40 low yes fair yes
>40 low yes excellent no
P (cancer )  .008, P (cancer )  .992 Data sample: 31…40 low yes excellent yes
P (  | cancer )  .98, P (  | cancer )  .02 <=30 medium no fair no
X= <=30 low yes fair yes
P (  | cancer )  .03, P ( | cancer )  .97 (age<=30, >40 medium yes fair yes
P (  | cancer ) P (cancer ) Income=medium,
<=30 medium yes excellent yes
P (cancer |  )  Student=yes
P() 31…40 medium no excellent yes
Credit_rating=Fair)
P (  | cancer ) P (cancer ) 31…40 high yes fair yes
P (cancer |  )  >40 medium no excellent no
P()
February 3, 2018 IS ZC464 34 February 3, 2018 IS ZC464 35

Naïve Bayesian Classifier: Example Home Work

• Compute P(X|Ci) for each class • A box contains 10 red and 15 blue balls. Two balls are
P age= < | u s_ o puter= es = / = . P u s_ o puter=„ es = / selected at random and are discarded without their
P age= < | u s_ o puter= o = / = .
P i o e= ediu | u s_ o puter= es = / = . P u s_ o puter=„ o = / colors being seen. If a third ball is drawn randomly and
P i o e= ediu | u s_ o puter= o = / = .
P
P
stude t= es | u s_ o puter= es = / = .
stude t= es | u s_ o puter= o = / = .
observed to be red, what is the probability that both of
P
P
redit_rati g= fair | u s_ o puter= es = / = .
redit_rati g= fair | u s_ o puter= o = / = .
the discarded balls were blue?
• X=(age<=30 ,income =medium, student=yes,credit_rating=fair) • Solution Hint:
P(X|Ci) : P X| u s_ o puter= es = . . . . . = .
P X| u s_ o puter= o = . . . . = .
P( R | BB ) P( BB )
P(X|Ci)*P(Ci ) : P X| u s_ o puter= es * P u s_ o puter= es = .
P X| u s_ o puter= o * P u s_ o puter= o = . P( R | BB ) P( BB )  P( R | BR ) P( BR )  P( R | RR ) P( RR )
 X elo gs to lass u s_ o puter= es

February 3, 2018 IS ZC464 36 February 3, 2018 IS ZC464 37


What is Regression?
BITS Pilani
• The goal of regression is to predict the value
of o e or ore o ti uous target aria les t
given the value of a D-dimensional vector x of
input variables.
• Polynomial curve fitting is an example of
regression.
Machine Learning (IS ZC464) Session 6:
Linear models for Regression

February 4, 2018 IS ZC464 2

500
Housing Prices Training set of Size in feet2 Price ($) in
(Portland, OR) 400 housing prices (x) 1000's (y)
300 2104 460
Price 200 1416 232
(in 1000s 1534 315
of dollars) 100
852 178
… …
0
0 500 1000 1500 2000 2500 3000 Notation:
Size (feet2) m = Number of training examples
x s = i put aria le / features
Supervised Learning Regression Problem y s = output aria le / target aria le
Gi e the right a s er for Predict real-valued output
each example in the data.
Slides numbers 3-6 and 11-43 adapted from Coursera Courseware on Machine Learning
course offered by Prof. Andrew Ng.
February 4, 2018 IS ZC464 3 February 4, 2018 IS ZC464 4

Training Set How do we represent h ?


Hypothesis:
3 3 3

2 2 2
Learning Algorithm
s: Para eters
1 1 1

Size of
h
Estimated Ho to hoose s? 0 0 0
house price 0 1 2 3 0 1 2 3 0 1 2 3

Linear regression with one variable.


Univariate linear regression.

February 4, 2018 IS ZC464 5 February 4, 2018 IS ZC464 6


Recall: example to understand
Terminology
ERROR
• Use parameters 0 and 1 to represent
Which line(hypothesis) fits the given data best? intercept and slope of line
7 • Use J(0, 1) to represent the Error.
6 • Instead of root Mean Squared (RMS) error,
5
consider Squared error.
4
• The number of training examples = m
3

2
• The ith data is x(i)
• The ith target is y(i)
1

February 4, 2018 1 2
IS ZC464 3 4 5 7 February 4, 2018 IS ZC464 8

Hypothesis Objective
• Equation(1):
h(x(i)) = 0 + 1 x(i) • To find 0, 1 to minimize J(0, 1)
Note: The notations used in Bishop s book (Section 3.1) • J(0, 1) is given by the expression
are as follows
1. In place of parameters , The book uses the notion of
w (later will be referred to as weights)
2. In place of <x(i), x(2), x(3),…, (m)>, the book uses vector
J ( 0 , 1) 
1 m

2m i 1 h x  y 

i i
2

x
3. In place of <y(i), y(2), y(3),…, (m)>, the book uses vector • Objective Function

h x  y 
y. m 2
4. In place of h(x(i)) , the book uses y(x,w) given by Minimize  i i
y(x,w) = w0 + w1 x (which is equivalent to  0 1 i 1

equation (1))
February 4, 2018 IS ZC464 9 February 4, 2018 IS ZC464 10

Simplified (0 = 0)
Hypothesis:
(for fixed , this is a function of x) (function of the parameter )

3 3

Parameters:
2 2
y
1 1
Cost Function:

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Goal:

February 4, 2018 IS ZC464 11 February 4, 2018 IS ZC464 12


(for fixed , this is a function of x) (function of the parameter ) (for fixed , this is a function of x) (function of the parameter )

3 3 3 3

1 = 0
2 2 2 2
y y
1 1 1 1

0 0 0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5 0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x x
Compute J(1) = (1/2*3)*{(0.5-1)2+ Compute J(1) = (1/2*3)*{(0-1)2+ (0-
(1-2)2 + (1.5-3)2 } 2)2 + (0-3)2 }
= (1/6)*(0.25+1+2.25) = (1/6)*(1+4+9)
=(1/6)* 3.5 = 0.58 =(1/6)* 14 = 2.3
February 4, 2018 IS ZC464 13 February 4, 2018 IS ZC464 14

(function of the parameter )


Hypothesis:
The error curve J(1 ) is plotted for 3
varying values of the parameter 1

2 Parameters:

1
Cost Function:
0
-0.5 0 0.5 1 1.5 2 2.5
Goal:

February 4, 2018 IS ZC464 15 February 4, 2018 IS ZC464 16

Surface and Corresponding contour plot


(for fixed , this is a function of x) (function of the parameters )

J(0,1)
500

400

Price ($)
i s 300

200
0
100

0
0 1000 2000 3000
Size in feet2 (x)
1

February 4, 2018 IS ZC464 17 February 4, 2018 IS ZC464 18


(for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters )

February 4, 2018 IS ZC464 19 February 4, 2018 IS ZC464 20

(for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters )

February 4, 2018 IS ZC464 21 February 4, 2018 IS ZC464 22

Have some function

Want

Outline:
J(0,1)
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
1
0

February 4, 2018 IS ZC464 23 February 4, 2018 IS ZC464 24


Gradient descent algorithm
Learning Rate: 

J(0,1)

Correct: Simultaneous update Incorrect:

1
0

February 4, 2018 IS ZC464 25 February 4, 2018 IS ZC464 26

Gradient descent algorithm

If α is too small, gradient descent


can be slow.

If α is too large, gradient descent


can overshoot the minimum. It may
fail to converge, or even diverge.

February 4, 2018 IS ZC464 27 February 4, 2018 IS ZC464 28

Gradient descent can converge to a local


minimum, even with the learning rate α fixed.

at local optima As we approach a local


minimum, gradient
descent will automatically
Current value of take smaller steps. So, no
need to decrease α over
time.

February 4, 2018 IS ZC464 29 February 4, 2018 IS ZC464 30


Gradient descent algorithm Linear Regression Model
Gradient descent algorithm

update
and
simultaneously

February 4, 2018 IS ZC464 31 February 4, 2018 IS ZC464 32

(for fixed , this is a function of x) (function of the parameters )

J(0,1)

1
0

February 4, 2018 IS ZC464 33 February 4, 2018 IS ZC464 35

(for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters )

February 4, 2018 IS ZC464 36 February 4, 2018 IS ZC464 37


(for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters )

February 4, 2018 IS ZC464 38 February 4, 2018 IS ZC464 39

(for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters )

February 4, 2018 IS ZC464 40 February 4, 2018 IS ZC464 41

(for fixed , this is a function of x) (function of the parameters ) (for fixed , this is a function of x) (function of the parameters )

February 4, 2018 IS ZC464 42 February 4, 2018 IS ZC464 43


Single feature (variable) : x Multiple features (variables).

Size Number
Size Price (feet2) of Number Age of home Price ($1000)
(feet2) ($1000) bedrooms of floors (years)

2104 5 1 45 460
2104 460 1416 3 2 40 232
1534 3 2 30 315
1416 232
852 2 1 36 178
1534 315 … … … … … 2104
 5 
852 178 (1)
 
x  1 
… … Notation:
 
= number of features  45 
= input (features) of training example.
= value of feature in training example.

February 4, 2018 IS ZC464 44 February 4, 2018 IS ZC464 46

Hypothesis: Hypothesis:
Previously:
Parameters:
Now with multiple variables or features
Cost function:
h(x) = 0 + 1 x1+ 2 x2+ 3 x3+ 4 x4
Size (feet2) Number of
bedrooms etc.

Or as per Bishop s ook otatio s (page 38, Gradient descent:


section 3.1)
Repeat

y(x,w)= w0 + w1 x1+ w2 x2+ w3 x3+ w4 x4


(simultaneously update for every )

February 4, 2018 IS ZC464 47 February 4, 2018 IS ZC464 48

New algorithm :
Gradient Descent
Repeat Linear Regression
Previously (n=1):

Repeat y(x,w)= w0 + w1 x1+ w2 x2+ w3 x3+ …..+wD xD


(simultaneously update for
)

Key Properties of Linear Regression


• y is a linear function of the parameters w0,w1,w2,… D

• y is a linear function of the input variables (features)


(simultaneously update ) x0,x1,x2,…xD

February 4, 2018 IS ZC464 49 February 4, 2018 IS ZC464 50


Generalized Form of Linear Regression
BITS Pilani
• A notion of class of functions i(x) is used to
represent the regression function
• y(x,w)= w0 + w1 x1+ w2 x2+ w3 x3+ …..+wD xD
is represented as
y(x,w)= w0 + w1 1(x)+ w2 2(x)+ w3 3(x)+
…..+wD D(x)
Machine Learning (IS ZC464) Session 7:
Where i(x)=xi
Linear models for Classification
• i(x) is called as basis functions

February 4, 2018 IS ZC464 51

Classification Decision Regions


• The goal of classification is to take an input
vector x and to assign it to one of K discrete • Training data is viewed to be
classes Ck here k = , , , …, K plotted in a d-dimensional space
where d is the number of
• Examples features used.
• A test data is also viewed to be
 Email: Spam / Not Spam? mapped in the same space.
• Similarity (or closeness) of the
 Online Transactions: Fraudulent (Yes / No)? test data from the cluster of
training classes is obtained.
 Tumor: Malignant / Benign ? • The nearest class is assigned to
the test data

February 10, 2018 IS ZC464 2 February 10, 2018 IS ZC464 3

Binary Classification Example of a Decision Boundary


• Only two classes
(Yes) 1

: Negati e Class e.g., e ig tu or Malignant ? Test data


Decision
: Positi e Class e.g., alig a t tu or Boundary
(No) 0
Tumor Size Tumor Size

Threshold classifier output at 0.5:


If , predi t =
If , predi t =

February 10, 2018 IS ZC464 4 February 10, 2018 IS ZC464 5


Solving Classification Problems Linearly Separable Non-Face Data
• Require the decision boundaries (or surfaces
in hyper dimensional space) to be identified
based on the training data.
• The decision boundary may be a line, a
polynomial curve or a surface.
• The decision boundary can be represented as
a hypothesis h(x)

February 10, 2018 IS ZC464 6 February 10, 2018 IS ZC464 7

Each face is a point in the n-dimensional


The points in the n-dimensional space cannot be
space. (ORL face data for three persons)
clustered (colorwise) by hyperplanes.

February 10, 2018 IS ZC464 8 February 10, 2018 IS ZC464 9

Discriminant Functions Example


• Represent the decision boundary • Consider the following
training data
• Discriminant functions are obtained by taking
• Class 1: <1,2>, <1,1>, <2,1>
a linear function of the input vector (feature
• Class 2: <3,3>, 3,4>, <4,3>
vector).
• Can view a decision
• Define y (x) = w0 + w1x+w2 +…wDx boundary as a line separating
• Take a simple case two classes
y (x) = w0 + w1x • The equation of the line is
x2 = -x1 +1
• This is the equation of line.
(not using y deliberately as
• How does this behave as a decision boundary used for target)

February 10, 2018 IS ZC464 10 February 10, 2018 IS ZC464 11


Define the hypothesis in terms of
Example vector product
• Test vector <4,4>
 x0 
w w w
T
• Compute h(x) = x1+x2-1 as
  W ..
x   x1 
4+4-1= 7
0 1 D
• Since h(x) > 4, then the  .. 
test data belongs to class 2  
• Test vector <2,1.5>  x D 
• h(x) = 2+1.5-1 = 2.5 <4 Since y (x) = w0 + w1x+w2 +…wDx
• Then it belongs to class 1 T
y W X
February 10, 2018 IS ZC464 12 February 10, 2018 IS ZC464 13

How to get the best decision


Classification
boundary?
• If WTX  0, then the vector x belongs to class 2 • Based of experience
• WTX < 0, then the test vector belongs to class using training data
1 • We try to optimize the
fitting of the decision
• Class Assignment boundary.
• Classify • If the training data s
– <5,1> inappropriate, the
– <4,1> classifier is likely to
– <3,1> misclassify data.
February 10, 2018 IS ZC464 14 February 10, 2018 IS ZC464 15

Binary versus Multi class


Linear versus circular boundaries
classification
Binary classification: Multi-class classification:

x2 x2

x1 x1

February 10, 2018 IS ZC464 16 February 10, 2018 IS ZC464 17


Face data is nonlinearly separable (Hyper-
Nearest Neighbor Classification Surfaces can create boundaries between
clusters)

Do ot k o
Test feature
condition
+

Nearest neighbor: Shortest distance to the mean of the cluster

February 10, 2018 IS ZC464 18 February 10, 2018 IS ZC464 19

Classification Problem What to optimize ?


Given Training Data • Given y (x) = w0 + w1x+w2 +…wDx
• Objective 1: Obtain W that gives minimum
Closest cluster to
the n-dimensional error of classification OR
• Objective 2: Obtain W that maximizes the
test feature vector
is computed
separation of the classes
Possible Decision Boundaries
• Hyper Plane
• Hyper Sphere
• Visualize the error surface discussed earlier
• Gaussian Surface
• Support Vectors
with respect to classification error and find
the parameters W that give the least error.
Challenge: Design of Decision Boundary

February 10, 2018 IS ZC464 20 February 10, 2018 IS ZC464 21

Decision Tree
BITS Pilani
 A decision tree takes as input an object or situation
described by a set of attributes and returns a decision.
 This decision is the predicted output value for the
input.
 The input attributes can be discrete or continuous.
 Classification Learning:
 Learning a discrete valued function is called classification
learning
Machine Learning (IS ZC464) Session 8: Decision Trees  Regression :
and Review Session  Learning a continuous function is called Regression.

February 11, 2018 IS ZC464 2


Decision Tree Decision tree
• A decision tree reaches its decision by • Leaf nodes depict the decision about a
performing a sequence of tests. character having attributes falling on the path
• All non leaf nodes lead to partial decisions and from the root node
assist in moving towards the leaf node. • Each example that participate in the
• Leaf nodes are the decisions based on construction of the decision tree is called a
properties satisfied at non leaf nodes on the training data and the complete set of the
path from the root node. training data is called as training set.

February 11, 2018 IS ZC464 3 February 11, 2018 IS ZC464 4

Limitations of Decision Tree How can we construct a decision


Learning tree for face recognition problem
• The tree memorizes the observations but does • Define attributes
not extract any pattern from the examples. • Collect the attributes data from training
• This limits the capability of the learning samples
algorithm in that the observations do not • Associate the output (to be used as leaf)
extrapolate to examples it has not seen.

Imagine the size of decision tree with 1000


attributes capable of discriminating between
persons!!!

February 11, 2018 IS ZC464 5 February 11, 2018 IS ZC464 6

Decision trees Goal Predicate: WillWait()


• The attributes aid in taking decisions. Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
• The most appropriate attribute is selected for 1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
testing in the beginning else the size of the 3. Fri/Sat: is today Friday or Saturday?
tree becomes large resulting in large 4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
computational time. 6. Price: price range ($, $$, $$$)
• Leaf nodes represent the decisions. 7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
• The attributes falling in the path from 9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
represent the attributes fully able to define
the decision at leaf.
This slide is adapted from the text book and from the set of slides available at
February 11, 2018 IS ZC464 7 aima.eecs.berkeley.edu/slides-ppt/m18-learning.ppt
February 11, 2018 IS ZC464 8
Attributes Decision Tree

This slide is adapted from the text book and from the set of slides available at
aima.eecs.berkeley.edu/slides-ppt/m18-learning.ppt This slide is adapted from the text book and from the set of slides available at
February 11, 2018 IS ZC464 9 February 11, 2018 IS ZC464 10
aima.eecs.berkeley.edu/slides-ppt/m18-learning.ppt

Size of the decision tree Information content


• The size of the Decision tree depends on the • If vi are different possible answers and P(vi )are
the probabilities that answer could be vi. Then
choice of the attributes and the order in the information content I of the actual answer is
which they are used to test the examples. given by
• Sele tio of att i utes ust e fai l good – I(P(v1), P(v2 , …P vn)) = -  P(vi)log2P(vi)
• Assu e that the t ai i g set o tai s p positi e
a d eall useless att i utes su h as t pe e a ples a d egati e e a ples, the a
should be avoided estimate of the information contained in a
• The quality of the attribute can be measured. correct answer is
I(p/(p+n), n/(p+n)) = - (p/(p+n) ) log2(p/(p+n))
• One measure can be the amount of - (n/(p+n) ) log2(n/(p+n))
information the attribute carries.
February 11, 2018 IS ZC464 11 February 11, 2018 IS ZC464 12

Refer the given table of Attributes Information content


• Since
I(p/(p+n), n/(p+n)) = - (p/(p+n) ) log2(p/(p+n))
- (n/(p+n) ) log2(n/(p+n))
– information = -(6/12) log2 (1/2) – (6/12) log2 (1/2)
– = - log2 (1/2)
= log2 (1/2)-1
= log2 (2)
= 1 bit

This slide is adapted from the text book and from the set of slides available at
aima.eecs.berkeley.edu/slides-ppt/m18-learning.ppt
February 11, 2018 IS ZC464 13 February 11, 2018 IS ZC464 14
Generalize the splitting Gain(A)
• Let the attribute A divides the entire training  Gain(A)
set i to sets E , E , … Ev. Where v is the total = I(p/p+n, n/p+n) – Remainder(A)
number of values A can be tested on. The heuristic to choose attribute A from a set of
• Assume that each set Ei contains pi positive all attributes is the maximum gain
examples and ni negative examples
• Remainder (A) Compute
=  (pi+ni)/(p+n) I(pi/(pi+ni), ni/(pi+ni)) 1. Gain(Patrons)
over i=1 to v 2. Gain(type)

February 11, 2018 IS ZC464 15 February 11, 2018 IS ZC464 16

Selecting patrons attribute Selecting type as attribute

February 11, 2018 IS ZC464 17 February 11, 2018 IS ZC464 18

Refer the given table of Attributes


Gain(patron)
and compute Gain
• 1 – ((2/12)I(o,1) + (4/12)I(1, 0) + (6/12) I(2/6,
4/6))
• Approximately equal to 0.541 bits

This slide is adapted from the text book and from the set of slides available at
February 11, 2018 IS ZC464 19 aima.eecs.berkeley.edu/slides-ppt/m18-learning.ppt
February 11, 2018 IS ZC464 20
Decision Trees Decision Trees
• Learning is through a series of decisions taken • If the decisions are binary, then in the best
with respect to the attribute at the non-leaf
node. case the decision eliminates almost half of the
• There can be many trees possible for the given regions (leaves).
training data. • If the e a e egio s, the the o e t egio
• Finding the smallest DT is an NP-complete can be found in log2(b) decisions in the best
problem.
• Greedy selection of the attribute with largest
case.
gain to split the training data into two or more • The height of the decision trees depends on
sub-classes may lead to approximately the the order of the attributes selected to split the
smallest tree
training examples at each step.
February 11, 2018 IS ZC464 21 February 11, 2018 IS ZC464 22

Expressiveness of the DS Example


• A decision tree can represent a disjunction of
conjunctions of constraints on the attribute
values of instances.
– Each path corresponds to a conjunction
– The tree itself corresponds to a disjunction
If (O=Sunny AND H=Normal) OR (O=Overcast) OR (O=Rain AND W=Weak)
then YES

 A disju tio of o ju tio s of o st ai ts o att i ute


alues

February 11, 2018 IS ZC464 23 February 11, 2018 IS ZC464 24

Entropy
• It is the measure of the information content
and is given by
– I = -  P(vi)log2P(vi)
– Where v1,v2,..,vk are the values of the attribute on
which the decisions bifurcate.

February 11, 2018 IS ZC464 25 February 11, 2018 IS ZC464 26


Remainder (A)
=  (pi+ni)/(p+n)
Class Work I(pi/(pi+ni), ni/(pi+ni))
over i=1 to v
Understand the examples
• Identify the examples belonging to the two • Decisions are binary – yes / no
sets constructed after the data is split on the • Training data as <example, decision> pair
basis of attribute 'student'. • <r1,no>, <r2,no>, <r3,yes>, <r4,yes> and so on
• Compute the total information content of the • Positive examples: r3, r4, r5, r7, r9, r10, r11, r12,
training data. r13
• Compute the information gain if the training • Negative examples: r1,r2,r6, r8, r14
data is split on the basis of the attribute • Is the given training set sufficient to take any
'student'. decision?
• Draw the decision tree, which may or may not • Is the generalization capability of the given
be optimal. training set sufficient?
February 11, 2018 IS ZC464 27 February 11, 2018 IS ZC464 28

Information content of the given Compute the significance of


training data att i ute i o e

• Here v1 = yes, v2 = no (YES) r3,r4, r5, r7, r9,r10, r11, r12, r13
• Positive examples: r3, r4, r5, r7, r9, r10, r11, r12, r13 (NO) r1, r2, r6, r8,r14

• Negative examples: r1,r2,r6, r8, r14


• Total number of examaples = 14 Low
Medium High
• P(v1) = 9/14, P(v2)=5/14
(YES) r5, r7, r9
• Information content is represented by the notion (NO) r6
(YES) r4, r10, r11, r12 (YES) r3, r13
(NO) r8, r14 (NO) r1, r2
I(9/14, 5/14)
• Entropy = - (P(v1)log2(P(v1)) + P(v2)log2(P(v2)) )
= -((9/14)* log2 (9/14) + (5/14)*log2 (5/14))
= 0.8108
February 11, 2018 IS ZC464 29 February 11, 2018 IS ZC464 30

Compute the significance of


Recall Generalize the splitting
att i ute i o e
• Let the attribute A divides the entire training
(YES) r3,r4, r5, r7, r9,r10, r11, r12, r13
set i to sets E , E , … Ev. Where v is the total
(NO) r1, r2, r6, r8,r14 number of values A can be tested on.
Low
• Assume that each set Ei contains pi positive
Medium High
examples and ni negative examples
(YES) r5, r7, r9
(YES) r4, r10, r11, r12 (YES) r3, r13 • Remainder (A)
(NO) r6
(NO) r8, r14 (NO) r1, r2
=  (pi+ni)/(p+n) I(pi/(pi+ni), ni/(pi+ni))
Observe that the split regions of examples possess mixed over i=1 to v
decisions, this shows the poor quality of the attribute ‘income’

February 11, 2018 IS ZC464 31 February 11, 2018 IS ZC464 32


Remainder (A)
Compute the significance of Compute the Remainder information if I(p
=  (pi+ni)/(p+n) attribute
i/(pi+ni), ni/(pi+ni))

att i ute i o e i o e is used fo splitti g


over i=1 to v

• Remainder = (4/14)*I(3/4,1/4)
(YES) r3,r4, r5, r7, r9,r10, r11, r12, r13 + (6/14)*I(4/6, 2/6)
(NO) r1, r2, r6, r8,r14
+(4/14)*I(2/4,2/4)
Low
Medium High
= (4/14) {-(3/4) log2(3/4) – (1/4)log2 (1/4)}
+(6/14) {-(4/6) log2(4/6) – (2/6)log2 (2/6)}
(YES) r5, r7, r9
(YES) r4, r10, r11, r12 (YES) r3, r13
(NO) r6
(NO) r8, r14 (NO) r1, r2
+(4/14){ (-(2/4)log2 (2/4)-(2/4)log2(2/4)
p1 = 3 p2 = 4 p3 = 2 [Home Work: Remaining computation]
n1 = 1 n2 = 2 n3 = 2 p2 = 4 p3 = 2
p1 = 3
n1 = 1 n2 = 2 n3 = 2

February 11, 2018 IS ZC464 33 February 11, 2018 IS ZC464 34

Review Session Review Session

Review Session What is learning?


• Mid Semester Syllabus  Learning (for humans) is experience from past.
 A machine can be programmed to gather experience in
– All topics and details discussed in Sessions 1-8
the form of facts, instances, rules etc.
[Refer Slides and video contents]
 A machine with learning capability can predict about
the new situation (seen or unseen) using its past
experience.
• Not included
 Decision Theory [Handout S. No. 2.2]
 Examples:
 Expectation Maximization (EM) Algorithm [Handout S. No. 3.3]  As e hu a s a tell a pe so s a e seei g hi /he
 Bias-variance decomposition [Handout S. No. 3.4]
second or fifth time, a machine can also do that.
 As e hu a s a e og ize a pe so s oi e e e if ot
seei g pe so s fa e, a a hi e a also e ade to lea
to do the same.

February 11, 2018 IS ZC464 35 February 11, 2018 IS ZC464 36

Review Session Review Session


Artificial Intelligence: An intelligent car
Class Experiment: Training navigation system [An Example]
 Let  A system to navigate a car to the airport works on its
 AA denote 5 vision enabled using camera mounted at the front of
the car.
 BB denote 6
 AAA denote 50  The s ste sees the la e li its, the ehi les o the
way and controls the car from colliding. [Vision]
 BBB denote 60
 It follows the road directions.
 AAAA denote 500
 It also follows the road rules.
 BBBB denote 600
 The system learns to handle unforeseen situations. For
 Can you find out the equivalent numerical value example if the traffic flow is restricted on a portion of
of AAAAA? 5000: yes/no? the road temporarily, the system takes the alternative
 O of AABB? Not et t ai ed……… path.[learning]

February 11, 2018 IS ZC464 37 February 11, 2018 IS ZC464 38


More intelligence can be
Review Session Review Session

Other intelligent systems


expected
• The s ste liste s to the pe so sitti g i the • Smart home
a to stop at a ea hotel fo a tea a d sees
around to find a hotel, keeps travelling till it finds – Lights switch off if there is no one in the room
one and stops the car. [speech Recognition, – Curtain pull off at the sun rise
Vision]
– Dust bin is emptied before it is overflowing
• Understands the mood of the person and starts
music to suit the mood of the person. [Facial – Smart water taps, toilets etc.
Expression] • Smart office
• Ca a s e the ue ies, su h as ho fa is – Automatic meeting summary
Pila i? , What is the ti e , a I sleep fo a
hou ? , Please ake e up he it is : i – Speaker recognition and summary generation
the o i g? [Natural Language Processing] • Automatic answering machine
February 11, 2018 IS ZC464 39 February 11, 2018 IS ZC464 40

Review Session Review Session

Other intelligent machines Intelligent Agent


• An airplane cockpit can have a intelligent • An intelligent agent is a system that perceives
system that takes automatic control when its environment and takes actions which
hijacked [context and speech understanding, maximize its chances of success.
NLP, vision]
• Artificial Intelligence aims to build intelligent
• Medical diagnosis systems trained with expert
guidance can diagnose the patients disease agents or entities.
based on the xray, MRI images and other
symptoms
• Automated theorem proving
• General problem solver
February 11, 2018 IS ZC464 41 February 11, 2018 IS ZC464 42

Review Session Review Session

Machine Learning Applications Machine Learning


• Speech recognition • A computer program is said to learn from
• Automatic news summary experience E with respect to some class of
• Spam email detection tasks T and performance measure P, if its
performance at tasks in T, as measured by P,
• Credit card fraud detection improves with experience E. (Tom Mitchell)
• Face recognition
• Function approximation
• Stock market prediction and analysis
• Etc.
February 11, 2018 IS ZC464 43 February 11, 2018 IS ZC464 44
Review Session Review Session

Learning From Observations Design of a learning Element


• Learning Element: • Affected by three major issues:
– responsible for making improvements – Which components of the performance element
• Performance Element: are to be learned
– responsible for selecting external actions – What feedback is available to learn these
components
• The learning element uses feedback from the
– What representation is used for the components.
critic on how the agent is doing and
determines how the performance element
should be modified to do better in the future

February 11, 2018 IS ZC464 45 February 11, 2018 IS ZC464 46

Review Session Review Session X Y

Traditional Vs. Machine Learning Training and testing: Prediction 1


5
1
5
2 2
• Recall Learning: A machine with learning 4 4
Input Data capability can predict about the new situation 3 3
(seen or unseen) using its past experience.
Traditional Output • Prediction:
Approach
Given values of x and y
Program Predict value of y for x = 71
Input Data
• Prediction is based on learning of the relationship
between x and y
Machine Learning Program • Training data is the collection of (x,y) pairs
• Testing data is simply value of x for which value
Output
of y is required to be predicted.

February 11, 2018 IS ZC464 47 February 11, 2018 IS ZC464 48

Learning of a function from given Review Session

What did the system learn?


Review Session
sample data X Y


1 1
Y = f(x)
5 5
• Y=x 2 2
Straight Line
• What is its generalization ability? 4 4
3

• Most accurate or we can say 100% 3 3


2

• What if the data to train the system changes


slightly? The machine can be still made to
learn.

February 11, 2018 IS ZC464 49 February 11, 2018 IS ZC464 50


Learning of a function from given Review Session

Understanding ERROR
sample data-straight line learning
Review Session

Which line fits Which line(hypothesis) fits the given data best?
Straight Line the best?
80

75

70
Line is
represented by 65
parameters of How?
slope and 60
intercept
Machine must
55
learn on its own- Using the data –
which is the best known as training
fit data i.e. (x,y) pair 50

February 11, 2018 IS ZC464 51 February 11, 2018 145 IS ZC464


155 165 170 52

Review Session Another


Review possible surface
Session

Plotting error when y=f(x) Plotting error when y=f(x1,x2)

Hypothesis function Hypothesis function


y = wx hw(x) = w1x1 + w2x2 …….
Linear in one variable Linear in two variables
E(w) hw(x) = wx E(w)

Local
Minima

w corresponding w w1
to minimum error
Global Minima
February 11, 2018 IS ZC464 53 February 11, 2018
w2 IS ZC464 54

Review Session Review Session

Uncertainty in real world Recall


• Uncertainty in reaching New Delhi Airport in 5 • Knowledge representation using Probability
hours from Pilani • Random variables
– Cab engine may or may not work at any moment • Atomic events
– The route is diverted due to a procession on the way • Conditional probability
– The road condition is bad unexpectedly • Prior probability
– The tire needs replacement • Marginalization
Etc. • Bayes theo e a d its appli atio i p o le
• A person having stomach ache can be told that he solving
is suffering from ulcer, while in actual it may be • Joint probability distribution (JPD) table and
gastritis or overeating probabilistic inference

February 11, 2018 IS ZC464 55 February 11, 2018 IS ZC464 56


Review Session Bayesian learning
Review Session

Bayes theo e Example 1: observation of sounds


• Bayes theo e p o ides a a to al ulate Training with Observed data: {d1,d2, d3} =training
the probability of a hypothesis based on its data (say D) Sounds ‘ae’ and ‘aw’ are
the observed targets that
prior probability, the probability of observing d1: at sou ds ith ae we know.
various data given the hypothesis, and the d2: pot sou ds ith a Prior probabilities
observed data itself. P(sound = ‘ae’) = 0.5
d3: at sou ds ith ae P(sound = ‘aw’) = 0.5
features such as ‘a’ and Conditional probabilities OR
‘o’ are obtained through are represented as P(‘ae’
preprocessing of the P(‘ae’ |feature = ‘a’)=2/3 |d1,d2,d3)=2/3
given words – by parsing P(‘aw’| feature = ‘o’) = 1/3 P(‘aw’|d1,d2,d3)=
1/3
OR
P(‘ae’ |D)=2/3
P(‘aw’|D)= 1/3
February 11, 2018 IS ZC464 57 February 11, 2018 IS ZC464 58

Review Session Review Session

Hypothesis Bayesian Learning


• In learning algorithms, the term hypothesis is • Training : Through the computation of the
used in contexts such as probabilities as in previous two slides
Concept learning or classification: class label or • Testing : of unknown words
category – Example testing:
Function approximation: a curve, a line or a Whi h sou d does the o d at ake?
polynomial
P ep o ess at to get featu e a a d o pute P h|a),
Decision making: a decision tree where P(h|D) is known, where D is the set of 10
• Plural of hypothesis: Hypotheses (multiple labels, o se atio s used to t ai the s ste a d h is the
multiple curves, multiple decision trees) hypothesis.
• Best Hypothesis (Always preferred) : Most Compute probabilities P(ae | a), P(oo | a), P(a~|a) and
appropriate class, best fit curve, smallest decision P(aw |a) to obtain the likelihood of sound of cat.
tree

February 11, 2018 IS ZC464 59 February 11, 2018 IS ZC464 60

Maximum a Posteriori (MAP) Review Session

Recall
Review Session
hypothesis
• Consider a set of hypotheses H and the • MAP algorithm
observed data used for training D • Gibbs Algorithm
• Define • Minimum Description Length Principle
h MAP
 Arg max P(h | D) • Information theory – entropy
hH
• Bayes Opti al Classifie
• The maximally probable hypothesis is called a • Naïve Bayes Classifie
maximum a posteriori (MAP) hypothesis.

February 11, 2018 IS ZC464 61 February 11, 2018 IS ZC464 62


Review Session Review Session

What is Regression? Training Set How do we represent h ?


Hypothesis:
• The goal of regression is to predict the value
of o e o o e o ti uous ta get a ia les t
given the value of a D-dimensional vector x of Learning Algorithm
‘s: Parameters
input variables.
• Polynomial curve fitting is an example of
regression. Size of
h
Estimate How to choose ‘s ?
house d price

Linear regression with one variable.


Univariate linear regression.

February 11, 2018 IS ZC464 63 February 11, 2018 IS ZC464 64

Review Session Review Session


Gradient descent algorithm
(for fixed , this is a function of x) (function of the parameters )
Learning Rate: 

J(0,1)
500

400

Price ($)
in 1000’s 300

200
0 Correct: Simultaneous update Incorrect:
100

0
0 1000 2000 3000
Size in feet2 (x)
1

February 11, 2018 IS ZC464 65 February 11, 2018 IS ZC464 66

Review Session Review Session

Linear Regression Generalized Form of Linear Regression

y(x,w)= w0 + w1 x1+ w2 x2+ w3 x3+ …..+wD xD • A notion of class of functions i(x) is used to
represent the regression function
• y(x,w)= w0 + w1 x1+ w2 x2+ w3 x3+ …..+wD xD
is represented as
Key Properties of Linear Regression
y(x,w)= w0 + w1 1(x)+ w2 2(x)+ w3 3(x)+
• y is a linear function of the parameters w0,w1,w2,… D …..+wD D(x)
• y is a linear function of the input variables Where i(x)=xi
(features) x0,x1,x2,…xD • i(x) are called as basis functions for
i=1,2,3,..D

February 11, 2018 IS ZC464 67 February 11, 2018 IS ZC464 68


Review Session
What is linear in linear
Basis functions Review Session
regression?
• Linear basis functions i(x)=x (Linear in x) • The following expression is linear in W
y(x,w)= w0 + w1 1(x)+ w2 2(x)+ w3 3(x)+
• Nonlinear basis functions …..+wD D(x)
i(x)=x2 (Quadratic in x) • The basis functions may be linear or nonlinear
i(x)=x3 (Cubic in x) in x

February 11, 2018 IS ZC464 69 February 11, 2018 IS ZC464 70

Review Session Review Session

Classification Example of a Decision Boundary


• The goal of classification is to take an input
vector x and to assign it to one of K discrete (Yes) 1

classes Ck he e k = , , , …, K Malignant Decision


Test data

• Examples ?
(No) 0
Boundary

 Email: Spam / Not Spam? Tumor Size Tumor Size

 Online Transactions: Fraudulent (Yes / No)?


Threshold classifier output at
 Tumor: Malignant / Benign ? 0.5:
If , predict “y = 1”
If , predict “y = 0”

February 11, 2018 IS ZC464 71 February 11, 2018 IS ZC464 72

Review Session Review Session

Solving Classification Problems Example


• Require the decision boundaries (or surfaces • Test vector <4,4>
in hyper dimensional space) to be identified • Compute h(x) = x1+x2-1 as
based on the training data. 4+4-1= 7
• The decision boundary may be a line, a • Since h(x) > 4, then the
polynomial curve or a surface. test data belongs to class 2
• The decision boundary can be represented as • Test vector <2,1.5>
a hypothesis h(x) • h(x) = 2+1.5-1 = 2.5 <4
• Then it belongs to class 1

February 11, 2018 IS ZC464 73 February 11, 2018 IS ZC464 74


Review Session Review Session

Recall Bayes Theo e ased p o le


• Decision boundaries • A box contains 10 red and 15 blue balls. Two balls are
selected at random and are discarded without their
• Binary and multi class classification colors being seen. If a third ball is drawn randomly and
observed to be red, what is the probability that both of
• Decision trees the discarded balls were blue?
• Gain and remainder • Atomic events for the selection of two balls
RR: both balls were Red
• Information content etc. RB: One is red ball and the other is Blue
BB: Both are blue balls
To find P(R | BB) : Probability that the third ball is red
given that both the discarded balls were blue.

February 11, 2018 IS ZC464 75 February 11, 2018 IS ZC464 76

Review Session
Probability of the third ball being
Review Session

red
• Number of ways to select two balls = 25C2 = 300
• Number of ways to select two red balls = 10C2 = 45 P(R) = P(RRR)+ P(RBB)+P(RBR)
• Number of ways to select two blue balls = 15C2 = 105 = P(R|RR)*P(RR) + P(R|BB)* P(BB)+ P(R|BR)*P(BR)
• Number of ways to select one red and one blue ball =
10C * 15C = 10*15 = 150 = (1/8)*(45/300) + (1/10)*(105/300)+ (1/9)*(150/300)
1 1
• P(RR) = 45/300 = 0.125*0.15 + 0.1*0.35 + 0.11*0.5
• P(BR) = 150/300 = 0.01875 + 0.035 + 0.0556 = 0.10935
• P(BB) = 105/300
• Therefore the probability that the third ball is red P(R)
= P(RRR)+ P(RBB)+P(RBR)

February 11, 2018 IS ZC464 77 February 11, 2018 IS ZC464 78

Probability that the two discarded balls were


blue given that the third ball is red
Review Session BITS Pilani
• The expression is P(BB|R) and is given by
P( R | BB ) P( BB )
P( R | BB ) P( BB )  P( R | BR ) P( BR )  P( R | RR ) P( RR )

Using Bayes Theo e


P(BB|R) = 0.035 / 0.10935 = 0.32 (Answer) Machine Learning (IS ZC464) Session 9:
Problem solving and doubt clearing session

February 11, 2018 IS ZC464 79


Problem 1 Solution
A doctor knows that the disease meningitis • P(S |M) = 0.5
causes the patient to have a stiff neck (S), say
50% of the time. The doctor also knows some • P(M) = 1/50,000 = 0.00002
unconditional facts: the prior probability that the • P(S) = 1/20 = 0.05
patient has meningitis (M) is 1/50,000, and the
prior probability that any patient has a stiff neck
is 1/20. What is the probability that a patient • To find P(M|S)
with stiff neck has meningitis? • P(M|S) = P(M,S) /P(S)
= P(S|M)*P(M)/P(S)
= 0.5 * 0.00002/0.05 = 0.0002 (Ans)
February 17, 2018 IS ZC464 2 February 17, 2018 IS ZC464 3

Problem 2 Given
• A science competition had students from three • Prior probabilities
schools A,B and C. The numbers of students who P(A)= 50/200 =0.25
participated from schools A, B and C respectively
are 50, 80 and 70. The probability of students P(B)=80/200= 0.4
qualifying (Q) the competition given his/her P(C)=70/200 = 0.35
school is P(Q|A)= 0.6, P(Q| B)= 0.25 and P(Q | • Conditional probabilities
A)= 0.45
P(Q|A)= 0.6,
Q1. Compute the probability of qualifying P(Q).
P(Q| B)= 0.25 and
Q2. Compute the probability of a student belonging
to school B given that he/she qualified i,.e. P(Q | A)= 0.45
compute P(B|Q)
February 17, 2018 IS ZC464 4 February 17, 2018 IS ZC464 5

Solution (for computing P(Q)) Solution (for computing P(B|Q))


• Joint occurrences of Q: • Since P(B|Q) = P(Q|B)*P(B) /P(Q)
– Student qualifies and is from school A {Bayes’ rule-refer slide 44 session 3}
– Student qualifies and is from school B
– Student qualifies and is from school C
• P(Q) = P(Q,A) + P(Q,B) + P(Q,C) P(B|Q) = 0.25*0.4/0.4075 = 0.2454
{Marginalization rule}
• P(Q) = P(Q|A)*P(A)+ P(Q|B)*P(B)+ P(Q|C)*P(C)
{using product rule for each term}
• P(Q) = 0.6*0.25 + 0.25*0.4+0.45*0.35
=0.150 + 0.100 + 0.1575 = 0.4075
February 17, 2018 IS ZC464 6 February 17, 2018 IS ZC464 7
Problem 3 Example
• Consider four 2- dimensional feature vectors • Consider the following training
data
belonging to class 1 are <1,5>, <1,3>, <2,3>, • Class 1: <1,5>, <1,3>, <2,3>, <3,4>
<3,4> and five feature vectors belonging to • Class 2: <1,1>, <2,1>, <3,0>, <3,2>,
<4,2>
class 2 are <1,1>, <2,1>, <3,0>, <3,2>, <4,2> • Can view a decision boundary as a
• Classify the test feature vector <4,3>. line separating two classes
• Slope m = 3/5
• Intercept = 1
• The equation of the decision
boundary (which is a line) is
x2 = (3/5)x1 +1
• Hypothesis h(x) = x2 -(3/5)x1 -1

February 17, 2018 IS ZC464 8 February 17, 2018 IS ZC464 9

Define the classifier Example


• h(x) = x2 -(3/5)x1 -1 • Test vector x = <x1,x2> = <4,3>
• Define threshold T appropriately such that • Compute Hypothesis h(x) =
x2 -(3/5)x1 -1
• Since h(x) = 3-((3/5)*4)-1=3-
If h(x) > T then the test data x belongs to class 1
2.4-1 = -0.4 < 0(threshold)
If h(x) < T then the test data x belongs to class 2 • Therefore the test data
Else report misclassification. belongs to class 1
Let T = 0 • Can also verify visually in the
figure.

February 17, 2018 IS ZC464 10 February 17, 2018 IS ZC464 11

Class work Problem 4


• Hypothesis h(x) = x2 -(3/5)x1 -1 • Training data for a two class classification
• Threshold (T) = 0 problem (say vehicle recognition among bus
• Classify test data <2,5> and car) is defined using two attributes size
and engine type.
• Q1. If the training data observes 8 times out of
20 times that the vehicle is a bus and 12 times
sees that it was a car. Compute the total
information content of the training data.

February 17, 2018 IS ZC464 12 February 17, 2018 IS ZC464 13


Solution (Q2) Recall: Generalize the splitting
• If the attribute size splits the training data into • Let the attribute A divides the entire training
two parts on its values ‘big’ and ‘small’ as set into sets E1, E2, … Ev. Where v is the total
shown in the figure, then compute is Gain. number of values A can be tested on.
(bus) r3,r4, r5, r7, r11, r12, r13, r20 • Assume that each set Ei contains pi positive
(car) r1, r2, r6, r8, r14 , r9,r10, r15, r16,r17 , r18,r19,
examples and ni negative examples
size
• Remainder (A)
big small =  (pi+ni)/(p+n) I(pi/(pi+ni), ni/(pi+ni))
(bus) r5, r7, r11, r12, r13, r20 (bus) r3, r4 over i=1 to v
(car) r8, r17 , r9,r10 (car) r1, r2, r6, r15, r16, r14 , r18, r19

February 17, 2018 IS ZC464 14 February 17, 2018 IS ZC464 15

Gain(A) Solution (Q1)


• If the training data observes 8 times out of 20 times that the vehicle
 Gain(A) is a bus and 12 times sees that it was a car. Compute the total
information content of the training data.
= I(p/p+n, n/p+n) – Remainder(A) • p = 8 and n = 12
I(p/(p+n), n/(p+n)) = - (p/(p+n) ) log2(p/(p+n))
The heuristic to choose attribute A from a set of - (n/(p+n) ) log2(n/(p+n))
all attributes is the maximum gain I(8/20, 12/20)
= I(0.4,0.6)
= -0.4 log2 (0.4) – 0.6 log2 (0.6)
Compute = (1/log 102){-0.4 log10 (0.4) – 0.6 log10 (0.6)}
=(1/0.3010) {(-0.4)*(-0.3979) +(-0.6)*(-0.2218)}
1. Gain(Patrons) =(3.3222) {0.15916 + 0.13308}
= 3.3222 *0.29224
2. Gain(type) =0.97087 bits

February 17, 2018 IS ZC464 16 February 17, 2018 IS ZC464 17

Solution (Q2) Remainder computation


• Now we have
• p1 = 6 n1 = 4
• p1 = 6 n1 = 4
• p2 = 2 n2 = 8 • p2 = 2 n2 = 8
• Remainder (A)
=  (pi+ni)/(p+n) I(pi/(pi+ni), ni/(pi+ni))
(bus) r3,r4, r5, r7, r11, r12, r13, r20 over i=1 to v
(car) r1, r2, r6, r8, r14 , r9,r10, r15, r16,r17 , r18,r19, = (p1+n1)/(p+n) I(p1/(p1+n1), n1/(p1+n1)) + (p2+n2)/(p+n) I(p2/(p2+n2),
size n2/(p2+n2))
= (6+4)/(8+12) I(6/(6+4), 4/(6+4)) + (2+8)/(8+12) I(2/(2+8), 8/(2+8))
= (10/20) I(0.6, 0.4) + (10/20)I(0.2,0.8)
big small
= 0.5* (-0.6 log2 (0.6) – 0.4 log2 (0.4))
+ 0.5* (-0.2 log2 (0.2) – 0.8 log2 (0.8))
(bus) r5, r7, r11, r12, r13, r20 (bus) r3, r4 (Computation left as home work)
(car) r8, r17 , r9,r10 (car) r1, r2, r6, r15, r16, r14 , r18, r19

February 17, 2018 IS ZC464 18 February 17, 2018 IS ZC464 19


Problem 5 Solution
Compute P(cavity)
P(cavity,toothache)
P(toothache) • Class work
toothache toothache

Catch catch Catch catch

Cavity 0.06 0.19 0.05 0.10

Cavity 0.09 0.01 0.22 0.28

February 17, 2018 IS ZC464 20 February 17, 2018 IS ZC464 21

Questions?

Best Wishes for your Mid Semester


test

February 17, 2018 IS ZC464 22 February 17, 2018 IS ZC464 23

Potrebbero piacerti anche