Sei sulla pagina 1di 7

Artificial Intelligence p.4/??

This is a highly simplified model of real learning.

5. Repeat steps 1-4 for different sizes of training sets if the


performance is not satisfactory.

test set: check if f is good

training set: construct f , and

In general, examples are divided into two sets:

Artificial Intelligence p.1/??

Artificial Intelligence p.3/??

, +1i

Problem: find hypothesis h such that h f given a


training set of examples

4. Measure the percentage of examples in the whole set


that are correctly classified by h.

An example is a pair hx, f (x)i, e.g., h

3. Apply the learning algorithm to the training set,


generating a hypothesis h.

f is the target function

2. Choose randomly a subset of the examples as training


set

Inductive Learning
Simplest form: learn a function from examples

Artificial Intelligence p.2/??

Department of Computer Science

The University of Iowa

http://www.cs.uiowa.edu/ hzhang/c145

Hantao Zhang

Ch.6 Machine Learning

1. Collect randomly a large set of examples

Basic Procedures

what kind of feedback is available

how that functional compoent is represented

which functional component is to be learned

what type of performance element is used

Learning elements:

Using his experience and his internal knowledge, a


learning agent is able to produce new knowledge.

A distinct feature of intelligent agents in nature is their


ability to learn from experience.

Machine Learning

Artificial Intelligence p.8/??

f(x)

Artificial Intelligence p.7/??

E.g., curve fitting:

E.g., curve fitting:

f(x)

Construct/adjust h to agree with f on training set


(h is consistent if it agrees with f on all examples)

Artificial Intelligence p.5/??

Construct/adjust h to agree with f on training set


(h is consistent if it agrees with f on all examples)

Artificial Intelligence p.6/??

Inductive learning method

f(x)

E.g., curve fitting:

Construct/adjust h to agree with f on training set


(h is consistent if it agrees with f on all examples)

Inductive learning method

Inductive learning method

f(x)

E.g., curve fitting:

Construct/adjust h to agree with f on training set


(h is consistent if it agrees with f on all examples)

Inductive learning method

1
m

1
ln + ln|H|


Artificial Intelligence p.12/??

Suppose H is the set of all hypothesis, to make sure that the


probability error of the test set is less than and the probability error
that a good hypothesis escapes the learning algorithm is less than ,
we need to have a training set of size m, where

The Stationarity Assumption: The training and test sets are drawn
randomly from the same population of examples using the same
probability distribution.

The underlying principle: Any hypothessi that is consistent with a


sufficiently large set of training examples is unlikely to be seriously
wrong it must be Probably Approximately Correct (PAC).

There is a theoretic foundation: Computational Learning Theory

Why Learning Works?

Artificial Intelligence p.10/??

Ockhams razor: maximize a combination of consistency


and simplicity

Artificial Intelligence p.9/??

Artificial Intelligence p.11/??

Learning curve = % correct on test set as a function of


training set size

1. Use theorems of computational/statistical learning


theory
2. Try h on a new test set of examples (use same
distribution over example space as training set)

How do we know that h f ? (Humes Problem of


Induction)

Performance Measurement

E.g., curve fitting:

E.g., curve fitting:


f(x)

Construct/adjust h to agree with f on training set


(h is consistent if it agrees with f on all examples)

Construct/adjust h to agree with f on training set


(h is consistent if it agrees with f on all examples)

f(x)

Inductive learning method

Inductive learning method

Alternate?

Bar?

Yes

Yes

No

Reservation?

No

3060

>60

No
T

Yes

Fri/Sat?

Yes

1030

WaitEstimate?

Full

Some

No

None

Patrons?

No

No

Yes

Raining?

Yes

Alternate?

Yes

Hungry?

No

010

Artificial Intelligence p.16/??

One possible representation for hypotheses.


E.g., here is the true tree for deciding whether to wait:

Decision trees

Artificial Intelligence p.14/??

In general, one variable in a decision node. Leaf nodes are


outputs.

Trees have decision nodes and outputs

T
T
F
T

T
T
F
T

X9
X11
X12

F
X8
X10

T
X6
X7

F
X4

T
X3
X5

F ri
F

Bar

X2

X1

Ex.

Hun

P at

Full

None

Full

Full

Some

None

Some

Full

Full

Some

Full

Some

$$$

$$

$$

$$$

$$$

P ri

Rai

Attributes

Res

T ype

Burger

Thai

Italian

Burger

Thai

Burger

Italian

French

Thai

Burger

Thai

French

Est

3060

010

1030

>60

010

010

010

>60

1030

010

3060

010

Artificial Intelligence p.15/??

Wait

Target

Examples described by attribute values (Boolean, discrete,


continuous, etc.) E.g., situations where I will/wont wait for a table:

From Examples to Decision Tree

Artificial Intelligence p.13/??

A decision tree is a compact represention (or generalization) of


some decision rules:

Decision Trees for Classification

Some

Full
French

Italian

Type?
Thai

E(p) = E(p, 1 p) = p lg p (1 p)lg(1 p)


E(1) = E(0) = 0; E(1/2) = 1;
E(2/3) = E(1/3) = 0.917; E(3/4) = E(1/4) = 0.734

Entropy of a binary variable

P atrons? is a better choicegives information about the


classification

None

Patrons?

Artificial Intelligence p.20/??

Artificial Intelligence p.18/??

Burger

Reason: A good attribute splits the examples into subsets that


are (ideally) all positive or all negative
F
T
F
T

F
F
T
T

F
T
T
F

A xor B

T
T

T
B

Artificial Intelligence p.17/??

(also called entropy of the probability)

i=1

Artificial Intelligence p.19/??

Information in an answer when probability is (P1 , . . . , Pn )


is
n
X
Pi log2 Pi
E(P1 , . . . , Pn ) =

Scale: 1 = answer to Boolean question with probability


(0.5, 0.5)
0 = answer to Boolean question with known result.

The more clueless I am about the answer initially, the


more information is contained in the answer.

Information answers questions.

Information

Prefer to find more compact decision trees.

Trivially, there exists a consistent decision tree for any


training set w/ one path to leaf for each example (unless f
nondeterministic in x) but it probably wont generalize to
new examples

Decision trees can express any function of the input


attributes. E.g., for Boolean functions, each truth table row
is a path from root to leaf:

Aim: find a small tree consistent with the training examples

Idea: (recursively) choose most significant attribute as root of


(sub)tree

Expressiveness

Decision tree learning

Patrons?

Some
French

Italian

Type?
Thai

Burger

4
2 2
12 E( 4 , 4 )

= 1.

DecisionTreeLearning(S, A)

Artificial Intelligence p.24/??

5. For each value i of attribute A, let Si be the set of examples with


A = i, call DecisionTreeLearning(Si , A {A}) recursively to obtain a
tree Ti , and let Ti be the child of the root A with label i on the edge
connecting Ti .

4. Pick A A such that Gain(S, A) = E(S) Remain(A) is maximal,


as the root of the tree;

Italian

Thai

No

Yes

Fri/Sat?

Burger

Artificial Intelligence p.23/??

Substantially simpler than true treea more complex


hypothesis isnt justified by small amount of data

French

Type?

No

Hungry?

Full

3. If |A| = {A} return a tree of height one with A as the root;

Some

Yes

None

Patrons?

Decision tree learned from the 12 examples:

Artificial Intelligence p.21/??

X pi + ni
E(pi /(pi + ni ), ni /(pi + ni ))
p+n

Example contd.

Remain(A) =

The remaining bits after choosing A will be

Let Si have pi positive and ni negative examples. Then


E(pi /(pi + ni ), ni /(pi + ni )) bits needed to classify Si .

Suppose an attribute A splits the examples S into subsets Si ,


each of which (we hope) needs less information to complete
the classification.

Suppose we have p positive and n negative examples S at the


root. Hence E(S) = E(p/(p + n), n/(p + n)) bits needed to
classify a new example. E.g., for 12 restaurant examples and
p = n = 6, we need 1 bit.

Information

2. If all examples in S have the same result, return the result;

1. If S = return the default value;

procedure

Let S be the training set, E(S) be its entropy, A be a set of attributes.

Decision Tree Algorithm

= 0.459

Artificial Intelligence p.22/??

So we choose the attribute P atrons? that minimizes the


remaining information.

4
2 2
12 E( 4 , 4 )

4
6
2 4
2
12 E(0, 1) + 12 E(1, 0) + 12 E( 6 , 6 )

Full

Remain(T ype?) =
1 1
2
1 1
2
12 E( 2 , 2 ) + 12 E( 2 , 2 ) +

Remain(P atrons?) =

None

Idea: Choose A such that Remain(A) is minimal, or


Gain(S, A) = E(S) Remain(A) is maximal.

Information

Artificial Intelligence p.26/??

Artificial Intelligence p.28/??

Learning performance = prediction accuracy measured on test set

For supervised learning, the aim is to find a simple hypothesis


approximately consistent with training examples

Learning method depends on type of performance element, available


feedback, type of component to be improved, and its representation

Learning needed for unknown environments, lazy designers

Summary

Another Example

Another Example

Another Example

Artificial Intelligence p.27/??

Artificial Intelligence p.25/??

Potrebbero piacerti anche