Sei sulla pagina 1di 43

Decision Tree Examples 2

Lecture Notes Professor Anita Wasilewska

Training data
rec r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 Age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <-=30 3140 3140 >40 Income High High High Medium Low Low Low Medium Low Medium Medium Medium High Medium Student No No No No Yes Yes Yes No Yes Yes Yes No Yes No Credit_rating Fair Excellent Fair Fair Fair Excellent Excellent Fair Fair Fair Excellent Excellent Fair Excellent Buys_computer No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

Decision Tree 1, Root: student

student No Yes

Age <=30 <=30 3040 >40 <=30 3140 >40

income High High High Medium Medium Medium Medium

CR Fair Excellent Fair Fair Fair Excellent Excellent

Class No No Yes Yes No Yes no

Age >40 >40 3140 <=30 >40 <=30 3140

income Low Low Low Low Medium Medium high

CR Fair excellent Excellent Fair Fair Excellent fair

Class Yes No Yes Yes Yes Yes yes

Decision Tree 1: income (L) and income (R)


student No Yes

income High medium low

income medium high

Age <=30 <=30 31..40

CR Fair Excellent Fair

class No No Yes

Age >40 <=30 31..40 >40

CR Fair Fair Excellent Excellent

Class Yes No Yes no

Age >40 >40 3140 <=30

CR Fair Excellent Excellent fair

class yes No Yes yes

Age >40 <=30

CR Fair Excellent

class Yes Yes

Age 3140

CR Fair

Class yes

Decision Tree 1: next step


student No Yes

income High medium low

income medium high

Age <=30 <=30 31..40

CR Fair Excellent Fair

class No No Yes

Age >40 <=30 31..40 >40

CR Fair Fair Excellent Excellent

Class Yes No Yes no

Age >40 >40 3140 <=30

CR Fair Excellent Excellent fair

class yes No Yes yes

Yes

Yes

Decision Tree 1 : next step


student No Yes

Income High Medium Low

Income Medium High

Yes

Yes

age
<=30 3140 Cr fair Class yes Fair

CR

CR

Excellent Age >40 3140 Class No Yes

Fair

Excellent Age 3140 >40 Class Yes Yes Class Yes No

CR Fair Excellent

Class No No

Age >40 <=30

Class Yes No

Age >40 <=30

Decision Tree 1: next step


student No Yes

Income High Medium Low

Income Medium High

Yes

Yes

age
<=30 3140 Yes Fair

CR

CR

Excellent Age >40 31...40

Fair Class No Yes

Excellent Age 3140 >40 Class Yes No

No

Age >40 <=30

Class Yes No

Yes

Decision Tree 1 : last step


student No Yes

Income High Medium Low

Income Medium High

Yes

Yes

age
<=30 3140 Yes Fair

CR

CR

Excellent Age >40 No 31...40 Yes

Fair Yes

Excellent Age 3140 Yes >40 No

No >40

Age <=30 Yes No

Tree 1 Classifcation rules


1. student(no)^income(high)^age(<=30) => buys_computer(no) 2. student(no)^income(high)^age(3140) => buys_computer(yes) 3. student(no)^income(medium)^CR(fair)^age(>40) => buys_computer(yes) 4. student(no)^income(medium)^CR(fair)^age(<=30) => buys_computer(no) 5. student(no)^income(medium)^CR(excellent)^age(>40) => buys_computer(no) 6. student(no)^income(medium)^CR(excellent)^age(31..40) => buys_computer(yes) 7. student(yes)^income(low)^CR(fair) => buys_computer(yes) 8. student(yes)^income(low)^CR(excellent)^age(31..40) => buys_computer(yes) 9. student(yes)^income(low)^CR(excellent)^age(>40) => buys_computer(no) 10. student(yes)^income(medium)=> buys_computer(yes) 11. student(yes)^income(high)=> buys_computer(yes)

Decision Tree 2: Root Income


Income High Medium Low

Age <=30 <=30 3140 3140

Student No No No Yes

CR Fair Excellent Fair Fair

Class No No Yes yes Age >40 <=30 >40 <=30 3140 >40 Student No No Yes Yes No No CR Fair Fair Fair Excellent Excellent Excellent Class Yes No Yes yes yes no

Age >40 >40 3140 <=30

Student No Yes Yes Yes

CR Fair Excellent Excellent Fair

Class Yes No Yes yes

Income High Medium Low

Age <=30 3140 No

Student Yes Fair CR excellent

Student No Yes

CR Fair Fair

Class Yes yes

Age >40 <=30

CR Fair Excellent

Class Yes Yes Age >40 3140 Student Yes Yes Class No Yes

Student no no

CR Fair Excellent

Class No No

Age >40 <=30 3140 >40

CR Fair Fair Excellent Excellent

Class Yes No Yes No Age >40 <=30 Student No Yes Class Yes Yes

Decision Tree 2 : next step

Income High Medium Low

Age <=30 3140 No

Student Yes Fair Yes Yes Age >40 3140 Student Yes Yes Class No Yes CR excellent

No

Age >40 <=30 3140 >40

CR Fair Fair Excellent Excellent

Class Yes No Yes No Yes

Decision Tree 2 : next step

Income High Medium Low

Age <=30 3140 No

Student Yes Fair Yes Yes Age >40 Student Class No 3140 Student Yes Class Yes CR excellent

No <=30 CR Fair Class No

Age 3140 >40 Yes CR Fair Excellent Class Yes No

Yes

CR Excellent

Class Yes

Decision Tree 2 : next step

Income High Medium Low

Age <=30 3140 No

Student Yes Fair Yes Yes Age >40 No 3140 Yes CR excellent

No <=30 No

Age 3140 >40 Yes CR Fair Excellent Class Yes No

Yes

Decision Tree 2 : next step

Income High Medium Low

Age <=30 3140 No

Student Yes Fair Yes Yes Age >40 No 3140 Yes CR excellent

No <=30 No

Age 3140 >40 Yes CR Fair Excellent

Yes

Yes

No

Decision Tree 2 : last step

Tree 2 Classifcation rules


1. income(high)^age(<=30) => buys_computer(no) 2. income(high)^age(3140) => buys_computer(yes) 3. income(medium)^student(no)^age(<=30) => buys_computer(no) 4. income(medium)^student(no)^age(3140) => buys_computer(yes) 5. income(medium)^student(no)^age(>40)^CR(fair) => buys_computer(yes) 6. income(medium)^student(no)^age(>40)^CR(excellent) => buys_computer(no) 7. income(medium)^student(yes)=> buys_computer(yes) 8. income(medium)^CR(fair)=> buys_computer(yes) 9. income(medium)^ CR(excellent)^age(>40)=> buys_computer(no) 10. income(medium)^ CR(excellent)^age(3140)=> buys_computer(yes)

Formulas for information gain

Calculations of information gain for Tree 1, Root: Student


I(P,N) = - (9/(9+5))Logsub2*(9/(9+5))-(5/(9+5))logsub2*(5/(9+5)) = -.643(-0.64)+(-.357)(-1.49) = .944 I(Psub1,Nsub1) = -(6/(6+1)Logsub2*(6/(6+1)-(1/(6+1))logsub2*(1/(6+1)) = -.857(-.22)+(-.143)(-2.81) = .591 I(Psub2,Nsub2) = -(3/(3+4)Logsub2*(3/(3+4)-(4/(3+4))logsub2*(4/(3+4)) = -.423(-1.24)+(-.571)(-0.81) = .987

Student Yes No

P 6 3

N 1 4

I(Psubi,Nsubi) .591 .987

E(Student) = (((6+1)/14) * .591) = .296 + ((3+4)/14) * .987 = .493 = .789 Gain(Student) = .944 - .789 = .155

Calculations of information gain for Tree 1, Income(Left) node


I(P,N) = -(3/(3+4)Logsub2*(3/(3+4)-(4/(3+4))logsub2*(4/(3+4)) = -.423(-1.24)+(-.571)(-0.81) = .987 I(Psub1,Nsub1) = -(1/(1+2)Logsub2*(1/(1+2)-(2/(1+2))logsub2*(2/(1+2)) = -.333(-1.59)+(-.667)(-0.58) = .916 I(Psub2,Nsub2) = -(2/(2+2)Logsub2*(2/(2+2)-(2/(2+2))logsub2*(2/(2+4)) = -.5(-1)+(-.5)(-1) = 1

Income High Medium

P 1 2

N 2 2

I(Pi,Ni) .916 1

E(Income(L)) = (((1+2)/7) * .916) = .393 + ((2+2)/7) * 1 = .57 = .963 Gain(Income(L)) = .987 - .963 = .024

Calculations of information gain for Tree 1, Income(Right) node


I(P,N) = -(6/(6+1)Logsub2*(6/(6+1)-(1/(6+1))logsub2*(1/(6+1)) = -.857(-.22)+(-2.81)(-.143) = .591 I(Psub1,Nsub1) = -(3/(3+1)Logsub2*(3/(3+1)-(1/(3+1))logsub2*(1/(3+1)) = -.75(-0.42)+(-.25)(-2) = .815 I(Psub2,Nsub2) = -(2/(2+0)Logsub2*(2/(2+0)-(0/(2+0))logsub2*(0/(2+0)) = -1(0)-(0)(infinity) = 0 I(Psub3,Nsub3) = -(1/(1+0)Logsub2*(1/(1+0)-(0/(1+0))logsub2*(0/(1+0)) = -1(0)-(0)(infinity) = 0

Income Low Medium High

P 3 2 1

N 1 0 0

I(Pi,Ni) .815 0 0

E(Income(R)) = (((3+1)/7) * .815) = .465 + ((2+0)/7) * 0 = 0 + ((1+0)/7) * 0 = 0 = .465 Gain(Income(R) = .987 - .465 = .522

Calculations of information gain for Tree 1, age(1) node


I(P,N) = -(1/(1+2)Logsub2*(1/(1+2)-(2/(1+2))logsub2*(2/(1+2)) = -.333(-1.59)+(-.667)(-0.58) = .916

age <=30 3140

P 0 1

N 2 0

I(Psubi,Nsubi) 0 0

E(Age(1)) = (((0+2)/3) * 0) = 0 + ((1+0)/3) * 0) = 0 = 0 Gain(Age(1)) = .916 0 = .916

Calculations of information gain for Tree 1, CR(Left) node


I(P,N) = -(2/(2+2)Logsub2*(2/(2+2)-(2/(2+2))logsub2*(2/(2+2)) = -.5(-1)+(-.5)(-1) = 1

CR Fair Excellent

P N I(Psubi,Nsubi) 1 1 1 1 1 1

E(CR(R)) = (((1+1)/4) * 1) = .5 + ((1+1)/4) * 1) = .5 = 1 Gain(CR(R)) = 1 1 = 0

Calculations of information gain for Tree 1, CR(right) node


I(P,N) = -(3/(3+1)Logsub2*(3/(3+1)-(1/(3+1))logsub2*(1/(3+1)) = -.75(-0.42)+(-.25)(-2) = .815

CR Fair Excellent

P N I(Psubi,Nsubi) 2 1 0 1 0 1

E(CR(R)) = (((2+0)/4) * 0) = 0 + ((1+1)/4) * 1) = .5 = ,5 Gain(CR(R)) = .815 .5 = .315

Calculations of information gain for Tree 1, age(2) node


I(P,N) = -(1/(1+1)Logsub2*(1/(1+1)-(1/(1+1))logsub2*(1/(1+1)) = -.5(-1)+(-.5)(-1) = 1

Income >40 <=30

P 1 0

N 0 1

I(Psubi,Nsubi) 0 0

E(Age(2)) = (((1+0)/2) * 0) = 0 + ((0+1)/2) * 0) = 0 = 0 Gain(Age(2)) = 1 0 = 1

Calculations of information gain for Tree 1, age(3) node


I(P,N) = -(1/(1+1)Logsub2*(1/(1+1)-(1/(1+1))logsub2*(1/(1+1)) = -.5(-1)+(-.5)(-1) = 1

Income >40 <=30

P 0 1

N 1 0

I(Psubi,Nsubi) 0 0

E(Age(3)) = (((0+1)/2) * 0) = 0 + ((1+0)/2) * 0) = 0 = 0 Gain(Age(3)) = 1 0 = 1

Calculations of information gain for Tree 1, age(4) node


I(P,N) = -(1/(1+1)Logsub2*(1/(1+1)-(1/(1+1))logsub2*(1/(1+1)) = -.5(-1)+(-.5)(-1) = 1

age 3140 >40

P 1 0

N 0 1

I(Psubi,Nsubi) 0 0

E(Age(4)) = (((1+0)/2) * 0) = 0 + ((0+1)/2) * 0) = 0 = 0 Gain(Age(4)) = 1 0 = 1

Information gain measure


Gain(student) =.155 Gain(income(L)) = .024 Gain(income(R)) = .522 Gain(age(1)) = .916 Gain(CR(L)) = 0 Gain(CR(R)) = .315 Gain(age(2)) = 1 Gain(age(3)) = 1 Gain(age(4)) = 1

Information Gain for each node


student No Yes .155

Income High Medium

.024 Low

Income Medium

.522 High

Yes

Yes

Age
<=30 3140

.916 Fair Yes

CR

CR

.315

Excellent Age >40 No 31...40 Yes

Fair 1 Yes

Excellent Age 3140 Yes >40 No 1

No >40

Age <=30 Yes

No

Tree 3: root student plus majority voting

student No Yes

Age <=30 <=30 3040 >40 <=30 3140 >40

income High High High Medium Medium Medium Medium

CR Fair Excellent Fair Fair Fair Excellent Excellent

Class No No Yes Yes No Yes no

Age >40 >40 3140 <=30 >40 <=30 3140

income Low Low Low Low Medium Medium high

CR Fair excellent Excellent Fair Fair Excellent fair

Class Yes No Yes Yes Yes Yes yes

Tree 3: Student root plus majority voting

student No Yes

No

Yes

Tree 3 (majority voting) rules and their accuracy


RULES: Student(no) => buys_computer(no) Student(yes) => buys_ computer(yes) Since 10 out of 14 records match the rules Rules accuracy: 10/14 = 0.714 = 71.4% Error Rate: 4/14= 0.28.6 = 28.6%

student No Yes

income High medium low

income medium high

Age <=30 <=30 31..40

CR Fair Excellent Fair

class No No Yes

Age >40 <=30 31..40 >40

CR Fair Fair Excellent Excellent

Class Yes No Yes no

Age >40 >40 3140 <=30

CR Fair Excellent Excellent fair

class yes No Yes yes

Age >40 <=30

CR Fair Excellent

class Yes Yes

Age 3140

CR Fair

Class yes

Tree 4: root student with majority voting on branch income-high

student No Yes

income High medium low

income medium high

No

Age >40 <=30 31..40 >40

CR Fair Fair Excellent Excellent

Class Yes No Yes no

Yes

Yes

Yes

Tree 4: root student with majority voting

student No Yes

income High medium low

income medium high

No
Fair

CR

Yes

Yes

Yes

Excellent Age >40 3140 Class No Yes

Age >40 <=30

Class Yes No

Tree 4: root student with majority voting

student No Yes

income High medium low

income medium high

No
Fair

CR

Yes

Yes

Yes

Excellent Age >40 31...40

No Age >40 Yes <=30 No

Yes

Tree 4: root student with majority voting

Tree 4: Classifcation rules and their accuracy


Student(no)^income(high) => buys_computer(no) Student(no)^income(medium)^CR(fair)^age(>40) => buys_computer(yes) Student(no)^income(medium)^CR(fair)^age(<=30) => buys_computer(no) Student(no)^income(medium)^CR(excellent)^age(>40) => buys_computer(no) Student(no)^income(medium)^CR(fair)^age(3140) => buys_computer(yes) Student(yes)^income(low) => buys_computer(yes) Student(yes)^income(medium) => buys_computer(yes) Student(yes)^income(high) => buys_computer(yes) Since 11 out of 14 records match the rules the accuracy of the rules = 11/14 = 0.786 = 78.6%

Training data plus (red) test data


REC r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16 r17 r18 r19 r20 Age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <-=30 3140 3140 >40 <=30 <=30 <=30 3140 >40 3140 Income High High High Medium Low Low Low Medium Low Medium Medium Medium High Medium Medium Low Low Low Medium High Student No No No No Yes Yes Yes No Yes Yes Yes No Yes No No No No Yes Yes No Credit_rating Fair Excellent Fair Fair Fair Excellent Excellent Fair Fair Fair Excellent Excellent Fair Excellent Excellent Fair Excellent Fair Excellent Excellent Buys_computer No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No No No No Yes Yes Yes

Tree1 classifcation rules


1. student(no)^income(high)^age(<=30) => buys_computer(no) 2. student(no)^income(high)^age(3140) => buys_computer(yes) 3. student(no)^income(medium)^CR(fair)^age(>40) => buys_computer(yes) 4. student(no)^income(medium)^CR(fair)^age(<=30) => buys_computer(no) 5. student(no)^income(medium)^CR(excellent)^age(>40) => buys_computer(no) 6. student(no)^income(medium)^CR(excellent)^age(31..40) => buys_computer(yes) 7. student(yes)^income(low)^CR(fair) => buys_computer(yes) 8. student(yes)^income(low)^CR(excellent)^age(31..40) => buys_computer(yes) 9. student(yes)^income(low)^CR(excellent)^age(>40) => buys_computer(no) 10. student(yes)^income(medium)=> buys_computer(yes) 11. student(yes)^income(high)=> buys_computer(yes)

Book classification rules


Age(<=30)^student(no)=> buys_computer(no) Age(<=30)^student(yes)=> buys_computer(yes) Age(3140)=> buys_computer(yes) Age(>40)^credit_rating(excellent)=> buys_computer(no) Age(<=30)^credit_rating(fair)=> buys_computer(yes)

Book Rules accuracy

For book rules 14 of 14 records from the training data match, so the accuracy of the rules is 14/14= 100% accurate

Tree 2 Classifcation rules


1. income(high)^age(<=30) => buys_computer(no) 2. income(high)^age(3140) => buys_computer(yes) 3. income(medium)^student(no)^age(<=30) => buys_computer(no) 4. income(medium)^student(no)^age(3140) => buys_computer(yes) 5. income(medium)^student(no)^age(>40)^CR(fair) => buys_computer(yes) 6. income(medium)^student(no)^age(>40)^CR(excellent) => buys_computer(no) 7. income(medium)^student(yes)=> buys_computer(yes) 8. income(medium)^CR(fair)=> buys_computer(yes) 9. income(medium)^ CR(excellent)^age(>40)=> buys_computer(no) 10. income(medium)^ CR(excellent)^age(3140)=> buys_computer(yes)

Predictive accuracy for (red) test data


For the book rules: 5 of 6 records from the red test data match, so the predictive accuracy of the book rules is 5/6= 83.33% For the tree 2 rules: 2 out of 6 records from the test data match, so the predictive accuracy of the rules 2 is 2/6 = 33.33%

Tree 3 (majority voting) rules and their accuracy


RULES: Student(no) => buys_computer(no) Student(yes) => buys_ computer(yes) Rules accuracy: 10/14 = 0.714 = 71.4% Predictive accuracy (with respect to red test data): 5/6= 83.33% Observe that the 100% accurate book rules had also predictive accuracy 83.33%

Potrebbero piacerti anche