Chapter 4

DataMining
PracticalMachineLearningToolsandTechniques
SlidesforChapter4ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall
Algorithms:Thebasicmethods
Inferringrudimentaryrules
Statisticalmodeling
Constructingdecisiontrees
Constructingrules
Associationrulelearning
Linearmodels
Instancebasedlearning
Clustering
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
Simplicityfirst
Simplealgorithmsoftenworkverywell!
Therearemanykindsofsimplestructure,eg:
Oneattributedoesallthework
Allattributescontributeequally&independently
Aweightedlinearcombinationmightdo
Instancebased:useafewprototypes
Usesimplelogicalrules
Successofmethoddependsonthedomain
Inferringrudimentaryrules
1R:learnsa1leveldecisiontree
I.e.,rulesthatalltestoneparticularattribute
Basicversion
Onebranchforeachvalue
Eachbranchassignsmostfrequentclass
Errorrate:proportionofinstancesthatdont
belongtothemajorityclassoftheir
correspondingbranch
Chooseattributewithlowesterrorrate
(assumesnominalattributes)
Pseudocodefor1R
For each attribute,
For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate
Note:missingistreatedasaseparateattribute
value
Evaluatingtheweatherattributes
Outlook
Temp
Humidit
y
Wind
y
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcas
tRainy
Hot
High
False
Yes
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcas
tSunny
Cool
Normal
True
Yes
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcas
tOvercas
Mild
High
True
Yes
Hot
Normal
False
Yes
tRainy
Mild
High
True
No
Attribute
Rules
Error
s
Outlook
Sunny No
2/5
Overcast
Yes
Rainy Yes
0/4
Hot No*
2/4
Mild Yes
2/6
Cool Yes
1/4
High No
3/7
Normal Yes
1/7
False Yes
2/8
True No*
3/6
Temp
Humidity
Windy
Total
error
s
4/14
2/5
5/14
4/14
5/14
*indicatesatie
Dealingwithnumericattributes
Discretizenumericattributes
Divideeachattributesrangeintointervals
Sortinstancesaccordingtoattributesvalues
Placebreakpointswhereclasschanges(majorityclass)
Thisminimizesthetotalerror
Example:temperaturefromweatherdata
64
65
68
85
Yes | No | Yes
Outlook
69
70
71 72 72
Yes Yes | No
75
75
80
81
No Yes | Yes Yes | No | Yes

Humidity
Windy
Play
Sunny
Temperatur
e
85
85
False
No
Sunny
80
90
True
No
Overcast
83
86
False
Yes
Rainy
75
80
False
Yes
83
Yes |
No
Theproblemofoverfitting
Thisprocedureisverysensitivetonoise
Oneinstancewithanincorrectclasslabelwillprobably
produceaseparateinterval
Also:timestampattributewillhavezeroerrors
Simplesolution:
enforceminimumnumberofinstancesinmajority
classperinterval
Example(withmin=3):
64
65
68
85
Yes | No | Yes
64
85
Yes
65
68
No
Yes
69
70
71 72 72
Yes Yes | No
69
70
75
80
81
No Yes | Yes Yes | No | Yes
71 72 72
Yes Yes | No
75
No Yes
75
75
80
Yes Yes | No
81
Yes
83
Yes |
No
83
Yes
No
8
Withoverfittingavoidance
Resultingruleset:
Attribute
Rules
Errors
Total errors
Outlook
Sunny No
2/5
4/14
Overcast Yes
0/4
Rainy Yes
2/5
77.5 Yes
3/10
> 77.5 No*
2/4
82.5 Yes
1/7
> 82.5 and 95.5 No
2/6
> 95.5 Yes
0/1
False Yes
2/8
True No*
3/6
Temperature
Humidity
Windy
5/14
3/14
5/14
Discussionof1R
1RwasdescribedinapaperbyHolte(1993)
Containsanexperimentalevaluationon16datasets
(usingcrossvalidationsothatresultswere
representativeofperformanceonfuturedata)
Minimumnumberofinstanceswassetto6after
someexperimentation
1Rssimplerulesperformednotmuchworsethan
muchmorecomplexdecisiontrees
Simplicityfirstpaysoff!
VerySimpleClassificationRulesPerformWellonMost
CommonlyUsedDatasets
RobertC.Holte,ComputerScienceDepartment,UniversityofOttawa
10
Discussionof1R:
Hyperpipes
Anothersimpletechnique:buildoneruleforeachclass
Eachruleisaconjunctionoftests,oneforeachattribute
Fornumericattributes:testcheckswhetherinstance's
valueisinsideaninterval
Intervalgivenbyminimumandmaximumobserved
intrainingdata
Fornominalattributes:testcheckswhethervalueisone
ofasubsetofattributevalues
Subsetgivenbyallpossiblevaluesobservedin
trainingdata
Classwithmostmatchingtestsispredicted
11
Statisticalmodeling
Oppositeof1R:usealltheattributes
Twoassumptions:Attributesare
equallyimportant
statisticallyindependent(giventheclassvalue)
I.e.,knowingthevalueofoneattributesaysnothing
aboutthevalueofanother(iftheclassisknown)
Independenceassumptionisnevercorrect!
Butthisschemeworkswellinpractice
12
Probabilitiesforweatherdata
Outlook
Temperature
Yes
Humidity
Yes
No
No
Sunny
Hot
Overcast
Mild
Rainy
Cool
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
Windy
Yes
No
High
Normal
High
Normal
Play
Yes
No
Yes
No
False
True
3/9
4/5
False
6/9
2/5
6/9
1/5
True
3/9
3/5
9/
14
5/
14
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
True
No13
Rainy
Mild
High
Probabilitiesforweatherdata
Outlook
Temperature
Yes
No
Sunny
Hot
Overcast
Mild
Rainy
Cool
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
Yes
Humidity
Anewday:
No
Windy
Yes
No
High
Normal
High
Normal
Outlook
Temp.
Sunny
Cool
Play
Yes
No
Yes
No
False
True
3/9
4/5
False
6/9
2/5
6/9
1/5
True
3/9
3/5
9/
14
5/
14
Humidit
y
High
Windy
Play
True
Likelihood of the two classes

For yes = 2/9 3/9 3/9 3/9 9/14 = 0.0053
For no = 3/5 1/5 4/5 3/5 5/14 = 0.0206
Conversion into a probability by normalization:
P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795
14
Bayessrule
ProbabilityofeventHgivenevidenceE:
Pr [EH]Pr [H]
Pr [HE]=
Pr [E]
AprioriprobabilityofH:
Probabilityofeventbeforeevidenceisseen
AposterioriprobabilityofH:
Pr[H]
Pr[HE]
Probabilityofeventafterevidenceisseen
ThomasBayes
Born: 1702inLondon,England
Died: 1761inTunbridgeWells,Kent,England
15
NaveBayesforclassification
Classificationlearning:whatsthe
probabilityoftheclassgivenaninstance?
EvidenceE=instance
EventH=classvalueforinstance
Naveassumption:evidencesplitsintoparts
(i.e.attributes)thatareindependent
Pr[E1H]Pr [E2H]Pr [EnH]Pr [H]

Pr [HE]=
Pr [E]
16
Weatherdataexample
Outlook
Temp.
Sunny
Cool
Humidit
y
High
Windy Play
True
EvidenceE
Pr [ yesE]=Pr [Outlook=Sunnyyes]
Pr [Temperature=Coolyes]
Pr[Humidity=Highyes]
Probabilityof
classyes
Pr[ Windy=Trueyes]
Pr[ yes]
Pr [E]
2 3 3 3 9

9 9 9 9 14
=
Pr [E]
17
Thezerofrequencyproblem
Whatifanattributevaluedoesntoccurwithevery
classvalue?
(e.g.Humidity=highforclassyes)
Probabilitywillbezero!
Pr [Humidity=Highyes]=0
Aposterioriprobabilitywillalsobezero!
Pr [yesE]=0
(Nomatterhowlikelytheothervaluesare!)
Remedy:add1tothecountforeveryattribute
valueclasscombination(Laplaceestimator)
Result:probabilitieswillneverbezero!
(also:stabilizesprobabilityestimates)
18
Modifiedprobabilityestimates
Insomecasesaddingaconstantdifferent
from1mightbemoreappropriate
Example:attributeoutlookforclassyes
2 /3
9
4 /3
9
3 /3
9
Sunny
Overcast
Rainy
Weightsdontneedtobeequal
(buttheymustsumto1)
2 p1
9
4 p2
9
3 p3
9
19
Missingvalues
Training:instanceisnotincludedinfrequency
countforattributevalueclasscombination
Classification:attributewillbeomittedfrom
calculation
Example: Outlook Temp. Humidity Windy Play
?
Cool
High
True
Likelihood of yes = 3/9 3/9 3/9 9/14 = 0.0238

Likelihood of no = 1/5 4/5 3/5 5/14 = 0.0343
P(yes) = 0.0238 / (0.0238 + 0.0343) = 41%
P(no) = 0.0343 / (0.0238 + 0.0343) = 59%
20
Numericattributes
Usualassumption:attributeshavea
normalorGaussianprobability
distribution(giventheclass)
Theprobabilitydensityfunctionforthe
normaldistributionisdefinedbytwo
parameters:
n
1
= xi
Samplemean
n
i=1
Standarddeviation
Thenthedensityfunctionf(x)is
f (x)=
1
e
2
1
2
=
(x i )
n1 i=1
(x)2
2 2
21
Statisticsforweatherdata
Outlook
Temperature
Humidity
Windy
Yes
No
Yes
No
Yes
No
Sunny
64, 68,
65,71,
65, 70,
70, 85,
Overcast
69, 70,
72,80,
70, 75,
90, 91,
Rainy
72,
85,
80,
95,
Sunny
2/9
3/5
=73
=75
=79
Overcast
4/9
0/5
=6.2
=7.9
=10.2
Rainy
3/9
2/5
Play
Yes
No
Yes
No
False
True
=86
False
6/9
2/5
=9.7
True
3/9
3/5
9/
14
5/
14
Exampledensityvalue:
f temperature=66yes=
1
e
2 6.2
6673
2
26.2
=0.0340
22
Classifyinganewday
Anewday:
Outlook
Temp.
Sunny
66
Humidity Windy
90
true
Play
?
Likelihood of yes = 2/9 0.0340 0.0221 3/9 9/14 = 0.000036

Likelihood of no = 3/5 0.0221 0.0381 3/5 5/14 = 0.000108
P(yes) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(no) = 0.000108 / (0.000036 + 0. 000108) = 75%
Missingvaluesduringtrainingarenot
includedincalculationofmeanand
standarddeviation
23
Probabilitydensities
Relationshipbetweenprobabilityand
density:
Pr [c xc ]f c
2
2
But:thisdoesntchangecalculationofa
posterioriprobabilitiesbecausecancelsout
Exactrelationship:
b
Pr [axb]= f tdt
a
24
MultinomialnaveBayesI
VersionofnaveBayesusedfordocumentclassification
usingbagofwordsmodel
n1,n2,...,nk:numberoftimeswordioccursindocument
P1,P2,...,Pk:probabilityofobtainingwordiwhen
samplingfromdocumentsinclassH
ProbabilityofobservingdocumentEgivenclassH(based
onmultinomialdistribution):
k
Pr [EH]N !
i=1
ni
Pi
ni !
Ignoresprobabilityofgeneratingadocumentoftheright
length(prob.assumedconstantforeachclass)
25
MultinomialnaveBayesII
Supposedictionaryhastwowords,yellowandblue
SupposePr[yellow|H]=75%andPr[blue|H]=25%
SupposeEisthedocumentblueyellowblue
Probabilityofobservingdocument:
Pr [{blue yellow blue}H]3 !
0.751
1!
0.252
2!
9
= 64
0.14
SupposethereisanotherclassH'thathas
Pr[yellow|H']=10%andPr[yellow|H']=90%:
Pr [{blue yellow blue}H']3!
0.11
1!
0.92
2!
=0.24
Needtotakepriorprobabilityofclassintoaccounttomakefinal
classification
Factorialsdon'tactuallyneedtobecomputed
Underflowscanbepreventedbyusinglogarithms
26
NaveBayes:discussion
NaveBayesworkssurprisinglywell(evenif
independenceassumptionisclearlyviolated)
Why?Becauseclassificationdoesntrequireaccurate
probabilityestimatesaslongasmaximumprobability
isassignedtocorrectclass
However:addingtoomanyredundantattributeswill
causeproblems(e.g.identicalattributes)
Notealso:manynumericattributesarenotnormally
distributed(kerneldensityestimators)
27
Constructingdecisiontrees
Strategy:topdown
Recursivedivideandconquerfashion
First:selectattributeforrootnode
Createbranchforeachpossibleattributevalue
Then:splitinstancesintosubsets
Oneforeachbranchextendingfromthenode
Finally:repeatrecursivelyforeachbranch,using
onlyinstancesthatreachthebranch
Stopifallinstanceshavethesameclass
28
Whichattributetoselect?
29
Whichattributetoselect?
30
Criterionforattributeselection
Whichisthebestattribute?
Popularimpuritycriterion:informationgain
Wanttogetthesmallesttree
Heuristic:choosetheattributethatproducesthe
purestnodes
Informationgainincreaseswiththeaverage
purityofthesubsets
Strategy:chooseattributethatgivesgreatest
informationgain
31
Computinginformation
Measureinformationinbits
Givenaprobabilitydistribution,theinfo
requiredtopredictaneventisthe
distributionsentropy
Entropygivestheinformationrequiredinbits
(caninvolvefractionsofbits!)
Formulaforcomputingtheentropy:
entropy p1, p 2, ... ,p n=p1 log p1p2 log p2 ...p n log pn
32
Example:attributeOutlook
Outlook=Sunny:
info[2,3]=entropy2/5,3/5=2/5 log2/53/5 log3/5=0.971bits
Outlook=Overcast:
Outlook=Rainy:
Note:this
info[4,0]=entropy1,0=1 log10 log0=0 bits isnormally
undefined.
info[2,3]=entropy3/5,2/5=3/5 log3/52/5 log2/5=0.971bits
Expectedinformationforattribute:
info[3,2],[4,0],[3,2]=5/140.9714/1405/140.971=0.693bits
33
Computinginformationgain
Informationgain:informationbeforesplitting
informationaftersplitting
gain(Outlook) =info([9,5])info([2,3],[4,0],[3,2])
=0.9400.693
=0.247bits
Informationgainforattributesfromweatherdata:
gain(Outlook)
=0.247bits
gain(Temperature) =0.029bits
gain(Humidity)
=0.152bits
gain(Windy)
=0.048bits
34
Continuingtosplit
gain(Temperature) =0.571bits
gain(Humidity) =0.971bits
gain(Windy)
=0.020bits
35
Finaldecisiontree
Note:notallleavesneedtobepure;sometimes
identicalinstanceshavedifferentclasses
Splittingstopswhendatacantbesplitanyfurther
36
Wishlistforapuritymeasure
Propertieswerequirefromapuritymeasure:
Whennodeispure,measureshouldbezero
Whenimpurityismaximal(i.e.allclassesequally
likely),measureshouldbemaximal
Measureshouldobeymultistageproperty(i.e.
decisionscanbemadeinseveralstages):
measure[2,3,4]=measure[2,7]7/9measure[3,4]
Entropyistheonlyfunctionthatsatisfiesall
threeproperties!
37
Propertiesoftheentropy
Themultistageproperty:
q
r
entropyp ,q , r=entropyp ,qrqrentropy qr
, qr
Simplificationofcomputation:
info[2,3,4]=2/9log2/93/9log3/94/9log4/9
=[2log23log34log 49log9]/9
Note:insteadofmaximizinginfogainwe
couldjustminimizeinformation
38
Highlybranchingattributes
Problematic:attributeswithalargenumber
ofvalues(extremecase:IDcode)
Subsetsaremorelikelytobepureifthereis
alargenumberofvalues
Informationgainisbiasedtowardschoosing
attributeswithalargenumberofvalues
Thismayresultinoverfitting(selectionofan
attributethatisnonoptimalforprediction)
Anotherproblem:fragmentation
39
WeatherdatawithIDcode
ID code
Outlook
Temp.
Hot
Humidit
y
High
Wind
y
False
Pla
y
No
Sunny
Sunny
Hot
High
True
No
High
False
Yes
Overcas Hot
tRainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Normal
True
Yes
Overcas Cool
tSunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcas Mild
tOvercas Hot
tRainy
Mild
High
True
Yes
Normal
False
Yes
High
True
No
M
N
40
TreestumpforIDcodeattribute
Entropyofsplit:
infoID code=info[0,1]info[0,1]...info[0,1]=0 bits
InformationgainismaximalforIDcode
(namely0.940bits)
41
Gainratio
Gainratio:amodificationoftheinformationgain
thatreducesitsbias
Gainratiotakesnumberandsizeofbranchesinto
accountwhenchoosinganattribute
Itcorrectstheinformationgainbytakingtheintrinsic
informationofasplitintoaccount
Intrinsicinformation:entropyofdistributionof
instancesintobranches(i.e.howmuchinfodowe
needtotellwhichbranchaninstancebelongsto)
42
Computingthegainratio
Example:intrinsicinformationforIDcode
info[1,1,...,1]=141/14log1/14=3.807bits
Valueofattributedecreasesasintrinsic
informationgetslarger
Definitionofgainratio:
gain_ratioattribute=gainattribute
intrinsic_infoattribute
Example:
gain_ratioID code= 0.940bits
=0.246
3.807bits
43
Gainratiosforweatherdata
Outlook
Temperature
Info:
0.693
Info:
0.911
Gain: 0.940-0.693
0.247
Gain: 0.940-0.911
0.029
Split info: info([5,4,5])
1.577
Split info: info([4,6,4])
1.557
Gain ratio: 0.247/1.577
0.157
Gain ratio:
0.029/1.557
Windy
0.019
Info:
0.788
Info:
0.892
Gain: 0.940-0.788
0.152
Gain: 0.940-0.892
0.048
Split info: info([7,7])
1.000
Split info: info([8,6])
0.985
Gain ratio: 0.152/1
0.152
Gain ratio:
0.048/0.985
0.049
Humidity
44
Moreonthegainratio
Outlookstillcomesouttop
However:IDcodehasgreatergainratio
Standardfix:adhoctesttopreventsplittingonthat
typeofattribute
Problemwithgainratio:itmayovercompensate
Maychooseanattributejustbecauseitsintrinsic
informationisverylow
Standardfix:onlyconsiderattributeswithgreater
thanaverageinformationgain
45
Discussion
Topdowninductionofdecisiontrees:ID3,
algorithmdevelopedbyRossQuinlan
Gainratiojustonemodificationofthisbasic
algorithm
C4.5:dealswithnumericattributes,missing
values,noisydata
Similarapproach:CART
Therearemanyotherattributeselection
criteria!
(Butlittledifferenceinaccuracyofresult)
46
Coveringalgorithms
Convertdecisiontreeintoaruleset
Instead,cangeneraterulesetdirectly
Straightforward,butrulesetoverlycomplex
Moreeffectiveconversionsarenottrivial
foreachclassinturnfindrulesetthatcovers
allinstancesinit
(excludinginstancesnotintheclass)
Calledacoveringapproach:
ateachstagearuleisidentifiedthatcovers
someoftheinstances
47
Example:generatingarule
If true
then class = a
If x > 1.2 and y > 2.6

then class = a
If x > 1.2
then class = a
Possiblerulesetforclassb:
If x 1.2 then class = b
If x > 1.2 and y 2.6 then class = b
Couldaddmorerules,getperfectruleset
48
Rulesvs.trees
Correspondingdecisiontree:
(producesexactlythesame
predictions)
But:rulesetscanbemoreperspicuouswhen
decisiontreessufferfromreplicatedsubtrees
Also:inmulticlasssituations,coveringalgorithm
concentratesononeclassatatimewhereas
decisiontreelearnertakesallclassesintoaccount
49
Simplecoveringalgorithm
Generatesarulebyaddingteststhatmaximize
rulesaccuracy
Similartosituationindecisiontrees:problemof
selectinganattributetospliton
But:decisiontreeinducermaximizesoverallpurity
Eachnewtestreduces
rulescoverage:
50
Selectingatest
Goal:maximizeaccuracy
ttotalnumberofinstancescoveredbyrule
ppositiveexamplesoftheclasscoveredbyrule
tpnumberoferrorsmadebyrule
Selecttestthatmaximizestheratiop/t
Wearefinishedwhenp/t=1orthesetof
instancescantbesplitanyfurther
51
Example:contactlensdata
Ruleweseek:
Possibletests:
If ?
then recommendation = hard
Age = Young
2/8
Age = Pre-presbyopic
1/8
Age = Presbyopic
1/8
Spectacle prescription = Myope
3/12
Spectacle prescription = Hypermetrope
1/12
Astigmatism = no
0/12
Astigmatism = yes
4/12
Tear production rate = Reduced
0/12
Tear production rate = Normal
4/12
52
Modifiedruleandresultingdata
Rulewithbesttestadded:
If astigmatism = yes
Instancescoveredbymodifiedrule:
Age
Young
Young
Young
Young
Prepresbyopic
Prepresbyopic
Prepresbyopic
Prepresbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Spectacle
prescription
Myope
Myope
Hypermetrope
Hypermetrope
Myope
Myope
Hypermetrope
Hypermetrope
Myope
Myope
Hypermetrope
Hypermetrope
Astigmatism
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Tear production
rate
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Recommended
lenses
None
Hard
None
hard
None
Hard
None
None
None
Hard
None
None
53
Furtherrefinement
Currentstate:
Possibletests:
and ?
Age = Young
2/4
1/4
Age = Presbyopic
1/4
3/6
1/6
Tear production rate = Reduced
0/6
Tear production rate = Normal
4/6
54
Modifiedruleandresultingdata
Rulewithbesttestadded:
and tear production rate = normal
Instancescoveredbymodifiedrule:
Age
Young
Young
Prepresbyopic
Prepresbyopic
Presbyopic
Presbyopic
Spectacle
prescription
Myope
Hypermetrope
Myope
Hypermetrope
Myope
Hypermetrope
Astigmatism
Yes
Yes
Yes
Yes
Yes
Yes
Tear production
rate
Normal
Normal
Normal
Normal
Normal
Normal
Recommended
lenses
Hard
hard
Hard
None
Hard
None
55
Furtherrefinement
Currentstate:
and ?
Possibletests:
Age = Young
2/2
1/2
Age = Presbyopic
1/2
3/3
1/3
Tiebetweenthefirstandthefourthtest
Wechoosetheonewithgreatercoverage
56
Theresult
Finalrule:
Secondruleforrecommendinghardlenses:
and spectacle prescription = myope
(builtfrominstancesnotcoveredbyfirstrule)
If age = young and astigmatism = yes

Thesetworulescoverallhardlenses:
Processisrepeatedwithothertwoclasses
57
PseudocodeforPRISM
For each class C
Initialize E to the instance set
While E contains instances in class C
Create a rule R with an empty left-hand side that predicts class C
Until R is perfect (or there are no more attributes to use) do
For each attribute A not mentioned in R, and each value v,
Consider adding the condition A = v to the left-hand side of R
Select A and v to maximize the accuracy p/t
(break ties by choosing the condition with the largest p)
Add A = v to R
Remove the instances covered by R from E
58
Rulesvs.decisionlists
PRISMwithouterloopremovedgeneratesa
decisionlistforoneclass
Outerloopconsidersallclassesseparately
Subsequentrulesaredesignedforrulesthatarenot
coveredbypreviousrules
But:orderdoesntmatterbecauseallrulespredictthe
sameclass
Noorderdependenceimplied
Problems:overlappingrules,defaultrulerequired
59
Separateandconquer
MethodslikePRISM(fordealingwithone
class)areseparateandconqueralgorithms:
First,identifyausefulrule
Then,separateoutalltheinstancesitcovers
Finally,conquertheremaininginstances
Differencetodivideandconquermethods:
Subsetcoveredbyruledoesntneedtobe
exploredanyfurther
60
Miningassociationrules
Navemethodforfindingassociationrules:
Twoproblems:
Useseparateandconquermethod
Treateverypossiblecombinationofattribute
valuesasaseparateclass
Computationalcomplexity
Resultingnumberofrules(whichwouldhavetobe
prunedonthebasisofsupportandconfidence)
But:wecanlookforhighsupportrulesdirectly!
61
Itemsets
Support:numberofinstancescorrectlycovered
byassociationrule
Thesameasthenumberofinstancescoveredbyall
testsintherule(LHSandRHS!)
Item:onetest/attributevaluepair
Itemset:allitemsoccurringinarule
Goal:onlyrulesthatexceedpredefinedsupport
Doitbyfindingallitemsetswiththegiven
minimumsupportandgeneratingrulesfromthem!
62
Weatherdata
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
63
Itemsetsforweatherdata
One-item sets
Two-item sets
Three-item sets
Four-item sets
Outlook = Sunny (5)
Outlook = Sunny
Temperature = Hot (2)
Outlook = Sunny
Temperature = Hot
Humidity = High (2)
Outlook = Sunny
Temperature = Hot
Humidity = High
Play = No (2)
Temperature = Cool
(4)
Outlook = Sunny
Humidity = High (3)
Outlook = Sunny
Humidity = High
Windy = False (2)
Outlook = Rainy
Temperature = Mild
Windy = False
Play = Yes (2)
Intotal:12oneitemsets,47twoitemsets,39
threeitemsets,6fouritemsetsand0five
itemsets(withminimumsupportoftwo)
64
Generatingrulesfromanitemset
Onceallitemsetswithminimumsupporthave
beengenerated,wecanturnthemintorules
Example:
Humidity = Normal, Windy = False, Play = Yes (4)
Seven(2N1)potentialrules:
If
If
If
If
If
If
If
Humidity = Normal and Windy = False then Play = Yes

Humidity = Normal and Play = Yes then Windy = False
Windy = False and Play = Yes then Humidity = Normal
Humidity = Normal then Windy = False and Play = Yes
Windy = False then Humidity = Normal and Play = Yes
Play = Yes then Humidity = Normal and Windy = False
True then Humidity = Normal and Windy = False
and Play = Yes
4/4
4/6
4/6
4/7
4/8
4/9
4/12
65
Rulesforweatherdata
Ruleswithsupport>1andconfidence=100%:
Association rule
Sup.
Conf.
Humidity=Normal Windy=False
Play=Yes
100%
Temperature=Cool
Humidity=Normal
100%
Outlook=Overcast
Play=Yes
100%
Temperature=Cold Play=Yes
Humidity=Normal
...
100%
...
...
100%
...
58
Outlook=Sunny Temperature=Hot Humidity=High
Intotal:
3ruleswithsupportfour
5withsupportthree
50withsupporttwo
66
Examplerulesfromthesameset
Itemset:
Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2)
Resultingrules(allwith100%confidence):
Temperature = Cool, Windy = False Humidity = Normal, Play = Yes

Temperature = Cool, Windy = False, Humidity = Normal Play = Yes
Temperature = Cool, Windy = False, Play = Yes Humidity = Normal
duetothefollowingfrequentitemsets:
Temperature = Cool, Windy = False
Temperature = Cool, Humidity = Normal, Windy = False
Temperature = Cool, Windy = False, Play = Yes
(2)
(2)
(2)
67
Generatingitemsetsefficiently
Howcanweefficientlyfindallfrequentitemsets?
Findingoneitemsetseasy
Idea:useoneitemsetstogeneratetwoitemsets,
twoitemsetstogeneratethreeitemsets,
If(AB)isfrequentitemset,then(A)and(B)havetobe
frequentitemsetsaswell!
Ingeneral:ifXisfrequentkitemset,thenall(k1)item
subsetsofXarealsofrequent
Computekitemsetbymerging(k1)itemsets
68
Example
Given:fivethreeitemsets
(A B C), (A B D), (A C D), (A C E), (B C D)
Lexicographicallyordered!
Candidatefouritemsets:
(A B C D)
OK because of (A C D) (B C D)
(A C D E)
Not OK because of (C D E)
Finalcheckbycountinginstancesin
dataset!
(k1)itemsetsarestoredinhashtable
69
Generatingrulesefficiently
Wearelookingforallhighconfidencerules
Betterway:building(c+1)consequentrules
fromcconsequentones
Supportofantecedentobtainedfromhashtable
But:bruteforcemethodis(2N1)
Observation:(c+1)consequentrulecanonlyhold
ifallcorrespondingcconsequentrulesalsohold
Resultingalgorithmsimilartoprocedurefor
largeitemsets
70
Example
1consequentrules:
If Outlook = Sunny and Windy = False and Play = No
then Humidity = High (2/2)
If Humidity = High and Windy = False and Play = No
then Outlook = Sunny (2/2)
Corresponding2consequentrule:
If Windy = False and Play = No
then Outlook = Sunny and Humidity = High (2/2)
Finalcheckofantecedentagainsthashtable!
71
Associationrules:discussion
Abovemethodmakesonepassthroughthedatafor
eachdifferentsizeitemset
Otherpossibility:generate(k+2)itemsetsjustafter(k+1)
itemsetshavebeengenerated
Result:more(k+2)itemsetsthannecessarywillbe
consideredbutlesspassesthroughthedata
Makessenseifdatatoolargeformainmemory
Practicalissue:generatingacertainnumberofrules
(e.g.byincrementallyreducingmin.support)
72
Otherissues
StandardARFFformatveryinefficientfortypical
marketbasketdata
Attributesrepresentitemsinabasketandmost
itemsareusuallymissing
Datashouldberepresentedinsparseformat
Instancesarealsocalledtransactions
Confidenceisnotnecessarilythebestmeasure
Example:milkoccursinalmosteverysupermarket
transaction
Othermeasureshavebeendevised(e.g.lift)
73
Linearmodels:linearregression
Workmostnaturallywithnumericattributes
Standardtechniquefornumericprediction
Outcomeislinearcombinationofattributes
x=w 0w 1 a1w2 a 2...w k a k
Weightsarecalculatedfromthetrainingdata
Predictedvalueforfirsttraininginstancea(1)
1
1
1
k
1
w0 a1
w
a
w
a
...w
a
=
w
a
0
1 1
2 2
k k
j=0
j j
(assumingeachinstanceisextendedwithaconstantattributewithvalue1)
74
Minimizingthesquarederror
Choosek+1coefficientstominimizethesquared
erroronthetrainingdata
Squarederror:
n
i
k
i 2
i=1 x j=0 w j a j
Derivecoefficientsusingstandardmatrix
operations
Canbedoneiftherearemoreinstancesthan
attributes(roughlyspeaking)
Minimizingtheabsoluteerrorismoredifficult
75
Classification
Anyregressiontechniquecanbeusedfor
classification
Training:performaregressionforeachclass,setting
theoutputto1fortraininginstancesthatbelongto
class,and0forthosethatdont
Prediction:predictclasscorrespondingtomodel
withlargestoutputvalue(membershipvalue)
Forlinearregressionthisisknownasmulti
responselinearregression
Problem:membershipvaluesarenotin[0,1]
range,soaren'tproperprobabilityestimates
76
Linearmodels:logisticregression
Buildsalinearmodelforatransformedtarget
variable
Assumewehavetwoclasses
Logisticregressionreplacesthetarget
P[1a1, a2, ....,a k ]
bythistarget
P[1a1, a2, ....,ak ]
log 1P[1a
1,
a2, ...., ak ]
Logittransformationmaps[0,1]to(,+)
77
Logittransformation
Resultingmodel:
Pr [1a 1, a 2, ..., ak ]= 1e
w 0w 1 a 1... w k a k
78
Examplelogisticregressionmodel
Modelwithw0=0.5andw1=1:
Parametersarefoundfromtrainingdatausing
maximumlikelihood
79
Maximumlikelihood
Aim:maximizeprobabilityoftrainingdatawrt
parameters
Canuselogarithmsofprobabilitiesandmaximize
loglikelihoodofmodel:
i
i
ni=1 1xi log1Pr[1ai
,
a
,...,
a
1
2
k ]
i
i
i
i
x logPr [1a1 ,a2 ,..., ak ]
wherethex(i)areeither0or1
Weightswineedtobechosentomaximizelog
likelihood(relativelysimplemethod:iteratively
reweightedleastsquares)
80
Multipleclasses
Canperformlogisticregression
independentlyforeachclass
(likemultiresponselinearregression)
Problem:probabilityestimatesfordifferent
classeswon'tsumtoone
Better:traincoupledmodelsby
maximizinglikelihoodoverallclasses
Alternativethatoftenworkswellin
practice:pairwiseclassification
81
Pairwiseclassification
Idea:buildmodelforeachpairofclasses,usingonly
trainingdatafromthoseclasses
Problem?Havetosolvek(k1)/2classification
problemsforkclassproblem
Turnsoutnottobeaprobleminmanycases
becausetrainingsetsbecomesmall:
Assumedataevenlydistributed,i.e.2n/kper
learningproblemforninstancesintotal
Supposelearningalgorithmislinearinn
Thenruntimeofpairwiseclassificationis
proportionalto(k(k1)/2)(2n/k)=(k1)n
82
Linearmodelsarehyperplanes
Decisionboundaryfortwoclasslogisticregressionis
whereprobabilityequals0.5:
Pr[1a1, a 2, ...,ak ]=1/1expw0w1 a1...w k a k =0.5
whichoccurswhen
w 0w 1 a1...w k ak =0
Thuslogisticregressioncanonlyseparatedatathat
canbeseparatedbyahyperplane
Multiresponselinearregressionhasthesame
problem.Class1isassignedif:
1
1
2
2
2
w1
w
a
...w
a
w
w
a
...w
0
1
1
k
k
0
1
1
k ak
2
1
2
1
2
w1
w
w
w
a
...w
w
0
0
1
1
1
k
k a k 0
83
Linearmodels:the
perceptron
Don'tactuallyneedprobabilityestimatesifallwewant
todoisclassification
Differentapproach:learnseparatinghyperplane
Assumption:dataislinearlyseparable
Algorithmforlearningseparatinghyperplane:perceptron
learningrule
0=w0 a0 w 1 a 1w2 a2...w k a k
Hyperplane:
whereweagainassumethatthereisaconstantattribute
withvalue1(bias)
Ifsumisgreaterthanzerowepredictthefirstclass,
otherwisethesecondclass
84
Thealgorithm
Set all weights to zero
Until all instances in the training data are classified correctly
For each instance I in the training data
If I is classified incorrectly by the perceptron
If I belongs to the first class add it to the weight vector
else subtract it from the weight vector
Whydoesthiswork?
Considersituationwhereinstanceapertainingtothefirst
classhasbeenadded:
w 0a0 a0 w1 a1a 1w 2a2 a2 ...w k a k a k
Thismeansoutputforahasincreasedby:
a0 a0a 1 a 1a2 a2 ...a k a k
Thisnumberisalwayspositive,thusthehyperplanehasmovedintothe
correctdirection(andwecanshowoutputdecreasesforinstancesof
otherclass)
85
Perceptronasaneuralnetwork
Output
layer
Input
layer
86
Linearmodels:Winnow
Anothermistakedrivenalgorithmforfindinga
separatinghyperplane
Assumesbinarydata(i.e.attributevaluesare
eitherzeroorone)
Difference:multiplicativeupdatesinsteadofadditive
updates
Weightsaremultipliedbyauserspecified
parameter(oritsinverse)
Anotherdifference:userspecifiedthreshold
parameter
Predictfirstclassif
w 0 a 0w 1 a1w 2 a2...w k ak
87
Thealgorithm
while some instances are misclassified
for each instance a in the training data
classify a using the current weights
if the predicted class is incorrect
if a belongs to the first class
for each ai that is 1, multiply wi by alpha
(if ai is 0, leave wi unchanged)
otherwise
for each ai that is 1, divide wi by alpha
(if ai is 0, leave wi unchanged)
Winnowisveryeffectiveinhominginonrelevant
features(itisattributeefficient)
Canalsobeusedinanonlinesettinginwhich
newinstancesarrivecontinuously
(liketheperceptronalgorithm)
88
BalancedWinnow
Winnowdoesn'tallownegativeweightsandthiscanbea
drawbackinsomeapplications
BalancedWinnowmaintainstwoweightvectors,oneforeach
class:some instances are misclassified
while
for each instance a in the training data
classify a using the current weights
if the predicted class is incorrect
if a belongs to the first class
for each ai that is 1, multiply wi+ by alpha and divide wi- by alpha
(if ai is 0, leave wi+ and wi- unchanged)
otherwise
for each ai that is 1, multiply wi- by alpha and divide wi+ by alpha
(if ai is 0, leave wi+ and wi- unchanged)
Instanceisclassifiedasbelongingtothefirstclass(oftwo
classes)if:
w w a w w a ...w w a
0
89
Instancebasedlearning
Distancefunctiondefineswhatslearned
Mostinstancebasedschemesuse
Euclideandistance:
1
1
2
1
2 2
1
2 2
a2
a
a
...a
a
1
2
2
k
k
a(1)anda(2):twoinstanceswithkattributes
Takingthesquarerootisnotrequiredwhen
comparingdistances
Otherpopularmetric:cityblockmetric
Addsdifferenceswithoutsquaringthem
90
Normalizationandotherissues
Differentattributesaremeasuredondifferent
scalesneedtobenormalized:
ai =
v i min v i
max v imin vi
vi:theactualvalueofattributei
Nominalattributes:distanceeither0or1
Commonpolicyformissingvalues:assumedtobe
maximallydistant(givennormalizedattributes)
91
Findingnearestneighborsefficiently
Simplestwayoffindingnearestneighbour:linear
scanofthedata
Classificationtakestimeproportionaltotheproductof
thenumberofinstancesintrainingandtestsets
Nearestneighborsearchcanbedonemore
efficientlyusingappropriatedatastructures
Wewilldiscusstwomethodsthatrepresenttraining
datainatreestructure:
kDtreesandballtrees
92
kDtreeexample
93
UsingkDtrees:example
94
MoreonkDtrees
Complexitydependsondepthoftree,givenbylogarithm
ofnumberofnodes
Amountofbacktrackingrequireddependsonqualityof
tree(squarevs.skinnynodes)
Howtobuildagoodtree?Needtofindgoodsplitpoint
andsplitdirection
Splitdirection:directionwithgreatestvariance
Splitpoint:medianvaluealongthatdirection
Usingvalueclosesttomean(ratherthanmedian)canbe
betterifdataisskewed
Canapplythisrecursively
95
Buildingtreesincrementally
Bigadvantageofinstancebasedlearning:classifier
canbeupdatedincrementally
Justaddnewtraininginstance!
CanwedothesamewithkDtrees?
Heuristicstrategy:
Findleafnodecontainingnewinstance
Placeinstanceintoleafifleafisempty
Otherwise,splitleafaccordingtothelongest
dimension(topreservesquareness)
Treeshouldberebuiltoccasionally(i.e.ifdepth
growstotwicetheoptimumdepth)
96
Balltrees
ProbleminkDtrees:corners
Observation:noneedtomakesurethat
regionsdon'toverlap
Canuseballs(hyperspheres)insteadof
hyperrectangles
Aballtreeorganizesthedataintoatreeofk
dimensionalhyperspheres
Normallyallowsforabetterfittothedataand
thusmoreefficientsearch
97
Balltreeexample
98
Usingballtrees
Nearestneighborsearchisdoneusingthesame
backtrackingstrategyasinkDtrees
Ballcanberuledoutfromconsiderationif:distance
fromtargettoball'scenterexceedsball'sradiusplus
currentupperbound
99
Buildingballtrees
Balltreesarebuilttopdown(likekDtrees)
Don'thavetocontinueuntilleafballscontainjusttwo
points:canenforceminimumoccupancy
(sameinkDtrees)
Basicproblem:splittingaballintotwo
Simple(lineartime)splitselectionstrategy:
Choosepointfarthestfromball'scenter
Choosesecondpointfarthestfromfirstone
Assigneachpointtothesetwopoints
Computeclustercentersandradiibasedonthetwo
subsetstogettwoballs
100
Discussionofnearestneighborlearning
Oftenveryaccurate
Assumesallattributesareequallyimportant
Possibleremediesagainstnoisyinstances:
Takeamajorityvoteovertheknearestneighbors
Removingnoisyinstancesfromdataset(difficult!)
StatisticianshaveusedkNNsinceearly1950s
Remedy:attributeselectionorweights
Ifnandk/n0,errorapproachesminimum
kDtreesbecomeinefficientwhennumberof
attributesistoolarge(approximately>10)
Balltrees(whichareinstancesofmetrictrees)work
wellinhigherdimensionalspaces
101
Morediscussion
Insteadofstoringalltraininginstances,compressthem
intoregions
Example:hyperpipes(fromdiscussionof1R)
Anothersimpletechnique(VotingFeatureIntervals):
Constructintervalsforeachattribute
Discretizenumericattributes
Treateachvalueofanominalattributeasaninterval
Countnumberoftimesclassoccursininterval
Predictionisgeneratedbylettingintervalsvote(thosethat
containthetestinstance)
102
Clustering
Clusteringtechniquesapplywhenthereisnoclasstobe
predicted
Aim:divideinstancesintonaturalgroups
Aswe'veseenclusterscanbe:
disjointvs.overlapping
deterministicvs.probabilistic
flatvs.hierarchical
We'lllookataclassicclusteringalgorithmcalledk
means
kmeansclustersaredisjoint,deterministic,andflat
103
Thekmeansalgorithm
Toclusterdataintokgroups:
(kispredefined)
0. Choosekclustercenters
e.g.atrandom
1. Assigninstancestoclusters
basedondistancetoclustercenters
2. Computecentroidsofclusters
3. Gotostep1
untilconvergence
104
Discussion
Algorithmminimizessquareddistancetocluster
centers
Resultcanvarysignificantly
basedoninitialchoiceofseeds
Cangettrappedinlocalminimum
initial
cluster
Example:
centres
instances
Toincreasechanceoffindingglobaloptimum:restart
withdifferentrandomseeds
Canweappliedrecursivelywithk=2
105
Fasterdistancecalculations
CanweusekDtreesorballtreestospeed
uptheprocess?Yes:
First,buildtree,whichremainsstatic,forall
thedatapoints
Ateachnode,storenumberofinstancesand
sumofallinstances
Ineachiteration,descendtreeandfindout
whichclustereachnodebelongsto
Canstopdescendingassoonaswefindoutthata
nodebelongsentirelytoaparticularcluster
Usestatisticsstoredatthenodestocomputenew
clustercenters
106
Example
107
Multiinstancelearning
Simplicityfirstmethodologycanbe
appliedtomultiinstancelearningwith
surprisinglygoodresults
Twosimpleapproaches,bothusing
standardsingleinstancelearners:
Manipulatetheinputtolearning
Manipulatetheoutputoflearning
108
Aggregatingtheinput
Convertmultiinstanceproblemintosingle
instanceone
Summarizetheinstancesinabagbycomputing
mean,mode,minimumandmaximumasnew
attributes
Summaryinstanceretainstheclasslabelofits
bag
Toclassifyanewbagthesameprocessisused
Resultsusingsummaryinstanceswithminimum
andmaximum+supportvectormachineclassifier
arecomparabletospecialpurposemultiinstance
learnersonoriginaldrugdiscoveryproblem
109
Aggregatingtheoutput
Learnasingleinstanceclassifierdirectlyfromthe
originalinstancesineachbag
Eachinstanceisgiventheclassofthebagitoriginates
from
Toclassifyanewbag:
Produceapredictionforeachinstanceinthebag
Aggregatethepredictionstoproduceapredictionfor
thebagasawhole
Oneapproach:treatpredictionsasvotesforthevarious
classlabels
Aproblem:bagscancontaindifferingnumbersof
instancesgiveeachinstanceaweightinversely
proportionaltothebag'ssize
110
Commentsonbasicmethods
BayesrulestemsfromhisEssaytowardssolvinga
probleminthedoctrineofchances(1763)
Difficultbitingeneral:estimatingpriorprobabilities
(easyinthecaseofnaveBayes)
ExtensionofnaveBayes:Bayesiannetworks(which
we'lldiscusslater)
AlgorithmforassociationrulesiscalledAPRIORI
MinskyandPapert(1969)showedthatlinear
classifiershavelimitations,e.g.cantlearnXOR
But:combinationsofthemcan(multilayerneural
nets,whichwe'lldiscusslater)
111

Chapter 4

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Chapter 4

Caricato da

Copyright:

Formati disponibili

DataMining

No Yes | Yes Yes | No | Yes

No Yes | Yes Yes | No | Yes

> 77.5 No*

> 82.5 and 95.5 No

> 95.5 Yes

Likelihood of the two classes

Pr[E1H]Pr [E2H]Pr [EnH]Pr [H]

Likelihood of yes = 3/9 3/9 3/9 9/14 = 0.0238

Likelihood of yes = 2/9 0.0340 0.0221 3/9 9/14 = 0.000036

Split info: info([5,4,5])

Split info: info([4,6,4])

Gain ratio: 0.247/1.577

Split info: info([7,7])

Split info: info([8,6])

Gain ratio: 0.152/1

If x > 1.2 and y > 2.6

Spectacle prescription = Myope

Spectacle prescription = Hypermetrope

Tear production rate = Reduced

Tear production rate = Normal

Spectacle prescription = Myope

Spectacle prescription = Hypermetrope

Tear production rate = Reduced

Tear production rate = Normal

Spectacle prescription = Myope

Spectacle prescription = Hypermetrope

If age = young and astigmatism = yes

Outlook = Sunny (5)

Humidity = Normal and Windy = False then Play = Yes

Outlook=Sunny Temperature=Hot Humidity=High

Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2)

Temperature = Cool, Windy = False Humidity = Normal, Play = Yes

Potrebbero piacerti anche