06 Learning 4p

Artificial Intelligence p.4/??
This is a highly simplified model of real learning.
5. Repeat steps 1-4 for different sizes of training sets if the

performance is not satisfactory.
test set: check if f is good
training set: construct f , and
In general, examples are divided into two sets:
, +1i
Problem: find hypothesis h such that h f given a

training set of examples
4. Measure the percentage of examples in the whole set

that are correctly classified by h.
An example is a pair hx, f (x)i, e.g., h
3. Apply the learning algorithm to the training set,

generating a hypothesis h.
f is the target function
2. Choose randomly a subset of the examples as training

set
Inductive Learning
Simplest form: learn a function from examples
Department of Computer Science
The University of Iowa
http://www.cs.uiowa.edu/ hzhang/c145
Hantao Zhang
Ch.6 Machine Learning
1. Collect randomly a large set of examples
Basic Procedures
what kind of feedback is available
how that functional compoent is represented
which functional component is to be learned
what type of performance element is used
Learning elements:
Using his experience and his internal knowledge, a

learning agent is able to produce new knowledge.
A distinct feature of intelligent agents in nature is their

ability to learn from experience.
Machine Learning
f(x)
E.g., curve fitting:
f(x)
Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

Inductive learning method
f(x)

f(x)

1
m
1
ln + ln|H|

Suppose H is the set of all hypothesis, to make sure that the

probability error of the test set is less than and the probability error
that a good hypothesis escapes the learning algorithm is less than ,
we need to have a training set of size m, where
The Stationarity Assumption: The training and test sets are drawn
randomly from the same population of examples using the same
probability distribution.
The underlying principle: Any hypothessi that is consistent with a

sufficiently large set of training examples is unlikely to be seriously
wrong it must be Probably Approximately Correct (PAC).
There is a theoretic foundation: Computational Learning Theory
Why Learning Works?
Ockhams razor: maximize a combination of consistency

and simplicity
Learning curve = % correct on test set as a function of

training set size
1. Use theorems of computational/statistical learning

theory
2. Try h on a new test set of examples (use same
distribution over example space as training set)
How do we know that h f ? (Humes Problem of

Induction)
Performance Measurement

f(x)


f(x)
Alternate?
Bar?
Yes
Yes
No
Reservation?
No
3060
>60
No
T
Yes
Fri/Sat?
Yes
1030
WaitEstimate?
Full
Some
No
None
Patrons?
No
No
Yes
Raining?
Yes
Alternate?
Yes
Hungry?
No
010
One possible representation for hypotheses.

E.g., here is the true tree for deciding whether to wait:
Decision trees
In general, one variable in a decision node. Leaf nodes are

outputs.
Trees have decision nodes and outputs
T
T
F
T
T
T
F
T
X9
X11
X12
F
X8
X10
T
X6
X7
F
X4
T
X3
X5
F ri
F
Bar
X2
X1
Ex.
Hun
P at
Full
None
Full
Full
Some
None
Some
Full
Full
Some
Full
Some
$$$
$$
$$
$$$
$$$
P ri
Rai
Attributes
Res
T ype
Burger
Thai
Italian
Burger
Thai
Burger
Italian
French
Thai
Burger
Thai
French
Est
3060
010
1030
>60
010
010
010
>60
1030
010
3060
010
Wait
Target
Examples described by attribute values (Boolean, discrete,

continuous, etc.) E.g., situations where I will/wont wait for a table:
From Examples to Decision Tree
A decision tree is a compact represention (or generalization) of

some decision rules:
Decision Trees for Classification
Some
Full
French
Italian
Type?
Thai
E(p) = E(p, 1 p) = p lg p (1 p)lg(1 p)

E(1) = E(0) = 0; E(1/2) = 1;
E(2/3) = E(1/3) = 0.917; E(3/4) = E(1/4) = 0.734
Entropy of a binary variable
P atrons? is a better choicegives information about the

classification
None
Patrons?
Burger
Reason: A good attribute splits the examples into subsets that

are (ideally) all positive or all negative
F
T
F
T
F
F
T
T
F
T
T
F
A xor B
T
T
T
B
(also called entropy of the probability)
i=1
Information in an answer when probability is (P1 , . . . , Pn )

is
n
X
Pi log2 Pi
E(P1 , . . . , Pn ) =
Scale: 1 = answer to Boolean question with probability

(0.5, 0.5)
0 = answer to Boolean question with known result.
The more clueless I am about the answer initially, the

more information is contained in the answer.
Information answers questions.
Information
Prefer to find more compact decision trees.
Trivially, there exists a consistent decision tree for any

training set w/ one path to leaf for each example (unless f
nondeterministic in x) but it probably wont generalize to
new examples
Decision trees can express any function of the input

attributes. E.g., for Boolean functions, each truth table row
is a path from root to leaf:
Aim: find a small tree consistent with the training examples
Idea: (recursively) choose most significant attribute as root of

(sub)tree
Expressiveness
Decision tree learning
Patrons?
Some
French
Italian
Type?
Thai
Burger
4
2 2
12 E( 4 , 4 )
= 1.
DecisionTreeLearning(S, A)
5. For each value i of attribute A, let Si be the set of examples with

A = i, call DecisionTreeLearning(Si , A {A}) recursively to obtain a
tree Ti , and let Ti be the child of the root A with label i on the edge
connecting Ti .
4. Pick A A such that Gain(S, A) = E(S) Remain(A) is maximal,

as the root of the tree;
Italian
Thai
No
Yes
Fri/Sat?
Burger
Substantially simpler than true treea more complex

hypothesis isnt justified by small amount of data
French
Type?
No
Hungry?
Full
3. If |A| = {A} return a tree of height one with A as the root;
Some
Yes
None
Patrons?
Decision tree learned from the 12 examples:
X pi + ni
E(pi /(pi + ni ), ni /(pi + ni ))
p+n
Example contd.
Remain(A) =
The remaining bits after choosing A will be
Let Si have pi positive and ni negative examples. Then

E(pi /(pi + ni ), ni /(pi + ni )) bits needed to classify Si .
Suppose an attribute A splits the examples S into subsets Si ,

each of which (we hope) needs less information to complete
the classification.
Suppose we have p positive and n negative examples S at the

root. Hence E(S) = E(p/(p + n), n/(p + n)) bits needed to
classify a new example. E.g., for 12 restaurant examples and
p = n = 6, we need 1 bit.
Information
2. If all examples in S have the same result, return the result;
1. If S = return the default value;
procedure
Let S be the training set, E(S) be its entropy, A be a set of attributes.
Decision Tree Algorithm
= 0.459
So we choose the attribute P atrons? that minimizes the

remaining information.
4
2 2
12 E( 4 , 4 )
4
6
2 4
2
12 E(0, 1) + 12 E(1, 0) + 12 E( 6 , 6 )
Full
Remain(T ype?) =
1 1
2
1 1
2
12 E( 2 , 2 ) + 12 E( 2 , 2 ) +
Remain(P atrons?) =
None
Idea: Choose A such that Remain(A) is minimal, or

Gain(S, A) = E(S) Remain(A) is maximal.
Information
Learning performance = prediction accuracy measured on test set
For supervised learning, the aim is to find a simple hypothesis

approximately consistent with training examples
Learning method depends on type of performance element, available

feedback, type of component to be improved, and its representation
Learning needed for unknown environments, lazy designers
Summary
Another Example
Another Example
Another Example

06 Learning 4p

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

06 Learning 4p

Caricato da

Copyright:

Formati disponibili

Artificial Intelligence p.4/??

This is a highly simplified model of real learning.

5. Repeat steps 1-4 for different sizes of training sets if the

test set: check if f is good

training set: construct f , and

In general, examples are divided into two sets:

Artificial Intelligence p.1/??

Artificial Intelligence p.3/??

Problem: find hypothesis h such that h f given a

4. Measure the percentage of examples in the whole set

An example is a pair hx, f (x)i, e.g., h

3. Apply the learning algorithm to the training set,

f is the target function

2. Choose randomly a subset of the examples as training

Artificial Intelligence p.2/??

Department of Computer Science

The University of Iowa

Ch.6 Machine Learning

1. Collect randomly a large set of examples

what kind of feedback is available

how that functional compoent is represented

which functional component is to be learned

what type of performance element is used

Using his experience and his internal knowledge, a

A distinct feature of intelligent agents in nature is their

Artificial Intelligence p.8/??

Artificial Intelligence p.7/??

E.g., curve fitting:

E.g., curve fitting:

Construct/adjust h to agree with f on training set

Artificial Intelligence p.5/??

Construct/adjust h to agree with f on training set

Artificial Intelligence p.6/??

Inductive learning method

E.g., curve fitting:

Construct/adjust h to agree with f on training set

Inductive learning method

Inductive learning method

E.g., curve fitting:

Construct/adjust h to agree with f on training set

Inductive learning method

Suppose H is the set of all hypothesis, to make sure that the

The underlying principle: Any hypothessi that is consistent with a

There is a theoretic foundation: Computational Learning Theory

Why Learning Works?

Artificial Intelligence p.10/??

Ockhams razor: maximize a combination of consistency

Artificial Intelligence p.9/??

Artificial Intelligence p.11/??

Learning curve = % correct on test set as a function of

1. Use theorems of computational/statistical learning

How do we know that h f ? (Humes Problem of

E.g., curve fitting:

E.g., curve fitting:

Construct/adjust h to agree with f on training set

Construct/adjust h to agree with f on training set

Inductive learning method

Inductive learning method

Artificial Intelligence p.16/??

One possible representation for hypotheses.

Artificial Intelligence p.14/??

In general, one variable in a decision node. Leaf nodes are

Trees have decision nodes and outputs

Artificial Intelligence p.15/??