Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
PracticalMachineLearningToolsandTechniques
SlidesforChapter4ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall
Algorithms:Thebasicmethods
Inferringrudimentaryrules
Statisticalmodeling
Constructingdecisiontrees
Constructingrules
Associationrulelearning
Linearmodels
Instancebasedlearning
Clustering
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
Simplicityfirst
Simplealgorithmsoftenworkverywell!
Therearemanykindsofsimplestructure,eg:
Oneattributedoesallthework
Allattributescontributeequally&independently
Aweightedlinearcombinationmightdo
Instancebased:useafewprototypes
Usesimplelogicalrules
Successofmethoddependsonthedomain
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
Inferringrudimentaryrules
1R:learnsa1leveldecisiontree
I.e.,rulesthatalltestoneparticularattribute
Basicversion
Onebranchforeachvalue
Eachbranchassignsmostfrequentclass
Errorrate:proportionofinstancesthatdont
belongtothemajorityclassoftheir
correspondingbranch
Chooseattributewithlowesterrorrate
(assumesnominalattributes)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
Pseudocodefor1R
For each attribute,
For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate
Note:missingistreatedasaseparateattribute
value
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
Evaluatingtheweatherattributes
Outlook
Temp
Humidit
y
Wind
y
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcas
tRainy
Hot
High
False
Yes
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcas
tSunny
Cool
Normal
True
Yes
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcas
tOvercas
Mild
High
True
Yes
Hot
Normal
False
Yes
tRainy
Mild
High
True
No
Attribute
Rules
Error
s
Outlook
Sunny No
2/5
Overcast
Yes
Rainy Yes
0/4
Hot No*
2/4
Mild Yes
2/6
Cool Yes
1/4
High No
3/7
Normal Yes
1/7
False Yes
2/8
True No*
3/6
Temp
Humidity
Windy
Total
error
s
4/14
2/5
5/14
4/14
5/14
*indicatesatie
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
Dealingwithnumericattributes
Discretizenumericattributes
Divideeachattributesrangeintointervals
Sortinstancesaccordingtoattributesvalues
Placebreakpointswhereclasschanges(majorityclass)
Thisminimizesthetotalerror
Example:temperaturefromweatherdata
64
65
68
85
Yes | No | Yes
Outlook
69
70
71 72 72
Yes Yes | No
75
75
80
81
Windy
Play
Sunny
Temperatur
e
85
85
False
No
Sunny
80
90
True
No
Overcast
83
86
False
Yes
Rainy
75
80
False
Yes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
83
Yes |
No
Theproblemofoverfitting
Thisprocedureisverysensitivetonoise
Oneinstancewithanincorrectclasslabelwillprobably
produceaseparateinterval
Also:timestampattributewillhavezeroerrors
Simplesolution:
enforceminimumnumberofinstancesinmajority
classperinterval
Example(withmin=3):
64
65
68
85
Yes | No | Yes
64
85
Yes
65
68
No
Yes
69
70
71 72 72
Yes Yes | No
69
70
75
80
81
71 72 72
Yes Yes | No
75
No Yes
75
75
80
Yes Yes | No
81
Yes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
83
Yes |
No
83
Yes
No
8
Withoverfittingavoidance
Resultingruleset:
Attribute
Rules
Errors
Total errors
Outlook
Sunny No
2/5
4/14
Overcast Yes
0/4
Rainy Yes
2/5
77.5 Yes
3/10
2/4
82.5 Yes
1/7
2/6
0/1
False Yes
2/8
True No*
3/6
Temperature
Humidity
Windy
5/14
3/14
5/14
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
Discussionof1R
1RwasdescribedinapaperbyHolte(1993)
Containsanexperimentalevaluationon16datasets
(usingcrossvalidationsothatresultswere
representativeofperformanceonfuturedata)
Minimumnumberofinstanceswassetto6after
someexperimentation
1Rssimplerulesperformednotmuchworsethan
muchmorecomplexdecisiontrees
Simplicityfirstpaysoff!
VerySimpleClassificationRulesPerformWellonMost
CommonlyUsedDatasets
RobertC.Holte,ComputerScienceDepartment,UniversityofOttawa
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
10
Discussionof1R:
Hyperpipes
Anothersimpletechnique:buildoneruleforeachclass
Eachruleisaconjunctionoftests,oneforeachattribute
Fornumericattributes:testcheckswhetherinstance's
valueisinsideaninterval
Intervalgivenbyminimumandmaximumobserved
intrainingdata
Fornominalattributes:testcheckswhethervalueisone
ofasubsetofattributevalues
Subsetgivenbyallpossiblevaluesobservedin
trainingdata
Classwithmostmatchingtestsispredicted
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
11
Statisticalmodeling
Oppositeof1R:usealltheattributes
Twoassumptions:Attributesare
equallyimportant
statisticallyindependent(giventheclassvalue)
I.e.,knowingthevalueofoneattributesaysnothing
aboutthevalueofanother(iftheclassisknown)
Independenceassumptionisnevercorrect!
Butthisschemeworkswellinpractice
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
12
Probabilitiesforweatherdata
Outlook
Temperature
Yes
Humidity
Yes
No
No
Sunny
Hot
Overcast
Mild
Rainy
Cool
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
Windy
Yes
No
High
Normal
High
Normal
Play
Yes
No
Yes
No
False
True
3/9
4/5
False
6/9
2/5
6/9
1/5
True
3/9
3/5
9/
14
5/
14
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
True
No13
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
Rainy
Mild
High
Probabilitiesforweatherdata
Outlook
Temperature
Yes
No
Sunny
Hot
Overcast
Mild
Rainy
Cool
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
Yes
Humidity
Anewday:
No
Windy
Yes
No
High
Normal
High
Normal
Outlook
Temp.
Sunny
Cool
Play
Yes
No
Yes
No
False
True
3/9
4/5
False
6/9
2/5
6/9
1/5
True
3/9
3/5
9/
14
5/
14
Humidit
y
High
Windy
Play
True
14
Bayessrule
ProbabilityofeventHgivenevidenceE:
Pr [EH]Pr [H]
Pr [HE]=
Pr [E]
AprioriprobabilityofH:
Probabilityofeventbeforeevidenceisseen
AposterioriprobabilityofH:
Pr[H]
Pr[HE]
Probabilityofeventafterevidenceisseen
ThomasBayes
Born: 1702inLondon,England
Died: 1761inTunbridgeWells,Kent,England
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
15
NaveBayesforclassification
Classificationlearning:whatsthe
probabilityoftheclassgivenaninstance?
EvidenceE=instance
EventH=classvalueforinstance
Naveassumption:evidencesplitsintoparts
(i.e.attributes)thatareindependent
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
16
Weatherdataexample
Outlook
Temp.
Sunny
Cool
Humidit
y
High
Windy Play
True
EvidenceE
Pr [ yesE]=Pr [Outlook=Sunnyyes]
Pr [Temperature=Coolyes]
Pr[Humidity=Highyes]
Probabilityof
classyes
Pr[ Windy=Trueyes]
Pr[ yes]
Pr [E]
2 3 3 3 9
9 9 9 9 14
=
Pr [E]
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
17
Thezerofrequencyproblem
Whatifanattributevaluedoesntoccurwithevery
classvalue?
(e.g.Humidity=highforclassyes)
Probabilitywillbezero!
Pr [Humidity=Highyes]=0
Aposterioriprobabilitywillalsobezero!
Pr [yesE]=0
(Nomatterhowlikelytheothervaluesare!)
Remedy:add1tothecountforeveryattribute
valueclasscombination(Laplaceestimator)
Result:probabilitieswillneverbezero!
(also:stabilizesprobabilityestimates)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
18
Modifiedprobabilityestimates
Insomecasesaddingaconstantdifferent
from1mightbemoreappropriate
Example:attributeoutlookforclassyes
2 /3
9
4 /3
9
3 /3
9
Sunny
Overcast
Rainy
Weightsdontneedtobeequal
(buttheymustsumto1)
2 p1
9
4 p2
9
3 p3
9
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
19
Missingvalues
Training:instanceisnotincludedinfrequency
countforattributevalueclasscombination
Classification:attributewillbeomittedfrom
calculation
Example: Outlook Temp. Humidity Windy Play
?
Cool
High
True
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
20
Numericattributes
Usualassumption:attributeshavea
normalorGaussianprobability
distribution(giventheclass)
Theprobabilitydensityfunctionforthe
normaldistributionisdefinedbytwo
parameters:
n
1
= xi
Samplemean
n
i=1
Standarddeviation
Thenthedensityfunctionf(x)is
f (x)=
1
e
2
1
2
=
(x i )
n1 i=1
(x)2
2 2
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
21
Statisticsforweatherdata
Outlook
Temperature
Humidity
Windy
Yes
No
Yes
No
Yes
No
Sunny
64, 68,
65,71,
65, 70,
70, 85,
Overcast
69, 70,
72,80,
70, 75,
90, 91,
Rainy
72,
85,
80,
95,
Sunny
2/9
3/5
=73
=75
=79
Overcast
4/9
0/5
=6.2
=7.9
=10.2
Rainy
3/9
2/5
Play
Yes
No
Yes
No
False
True
=86
False
6/9
2/5
=9.7
True
3/9
3/5
9/
14
5/
14
Exampledensityvalue:
f temperature=66yes=
1
e
2 6.2
6673
2
26.2
=0.0340
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
22
Classifyinganewday
Anewday:
Outlook
Temp.
Sunny
66
Humidity Windy
90
true
Play
?
Missingvaluesduringtrainingarenot
includedincalculationofmeanand
standarddeviation
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
23
Probabilitydensities
Relationshipbetweenprobabilityand
density:
Pr [c xc ]f c
2
2
But:thisdoesntchangecalculationofa
posterioriprobabilitiesbecausecancelsout
Exactrelationship:
b
Pr [axb]= f tdt
a
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
24
MultinomialnaveBayesI
VersionofnaveBayesusedfordocumentclassification
usingbagofwordsmodel
n1,n2,...,nk:numberoftimeswordioccursindocument
P1,P2,...,Pk:probabilityofobtainingwordiwhen
samplingfromdocumentsinclassH
ProbabilityofobservingdocumentEgivenclassH(based
onmultinomialdistribution):
k
Pr [EH]N !
i=1
ni
Pi
ni !
Ignoresprobabilityofgeneratingadocumentoftheright
length(prob.assumedconstantforeachclass)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
25
MultinomialnaveBayesII
Supposedictionaryhastwowords,yellowandblue
SupposePr[yellow|H]=75%andPr[blue|H]=25%
SupposeEisthedocumentblueyellowblue
Probabilityofobservingdocument:
Pr [{blue yellow blue}H]3 !
0.751
1!
0.252
2!
9
= 64
0.14
SupposethereisanotherclassH'thathas
Pr[yellow|H']=10%andPr[yellow|H']=90%:
Pr [{blue yellow blue}H']3!
0.11
1!
0.92
2!
=0.24
Needtotakepriorprobabilityofclassintoaccounttomakefinal
classification
Factorialsdon'tactuallyneedtobecomputed
Underflowscanbepreventedbyusinglogarithms
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
26
NaveBayes:discussion
NaveBayesworkssurprisinglywell(evenif
independenceassumptionisclearlyviolated)
Why?Becauseclassificationdoesntrequireaccurate
probabilityestimatesaslongasmaximumprobability
isassignedtocorrectclass
However:addingtoomanyredundantattributeswill
causeproblems(e.g.identicalattributes)
Notealso:manynumericattributesarenotnormally
distributed(kerneldensityestimators)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
27
Constructingdecisiontrees
Strategy:topdown
Recursivedivideandconquerfashion
First:selectattributeforrootnode
Createbranchforeachpossibleattributevalue
Then:splitinstancesintosubsets
Oneforeachbranchextendingfromthenode
Finally:repeatrecursivelyforeachbranch,using
onlyinstancesthatreachthebranch
Stopifallinstanceshavethesameclass
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
28
Whichattributetoselect?
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
29
Whichattributetoselect?
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
30
Criterionforattributeselection
Whichisthebestattribute?
Popularimpuritycriterion:informationgain
Wanttogetthesmallesttree
Heuristic:choosetheattributethatproducesthe
purestnodes
Informationgainincreaseswiththeaverage
purityofthesubsets
Strategy:chooseattributethatgivesgreatest
informationgain
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
31
Computinginformation
Measureinformationinbits
Givenaprobabilitydistribution,theinfo
requiredtopredictaneventisthe
distributionsentropy
Entropygivestheinformationrequiredinbits
(caninvolvefractionsofbits!)
Formulaforcomputingtheentropy:
entropy p1, p 2, ... ,p n=p1 log p1p2 log p2 ...p n log pn
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
32
Example:attributeOutlook
Outlook=Sunny:
info[2,3]=entropy2/5,3/5=2/5 log2/53/5 log3/5=0.971bits
Outlook=Overcast:
Outlook=Rainy:
Note:this
info[4,0]=entropy1,0=1 log10 log0=0 bits isnormally
undefined.
info[2,3]=entropy3/5,2/5=3/5 log3/52/5 log2/5=0.971bits
Expectedinformationforattribute:
info[3,2],[4,0],[3,2]=5/140.9714/1405/140.971=0.693bits
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
33
Computinginformationgain
Informationgain:informationbeforesplitting
informationaftersplitting
gain(Outlook) =info([9,5])info([2,3],[4,0],[3,2])
=0.9400.693
=0.247bits
Informationgainforattributesfromweatherdata:
gain(Outlook)
=0.247bits
gain(Temperature) =0.029bits
gain(Humidity)
=0.152bits
gain(Windy)
=0.048bits
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
34
Continuingtosplit
gain(Temperature) =0.571bits
gain(Humidity) =0.971bits
gain(Windy)
=0.020bits
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
35
Finaldecisiontree
Note:notallleavesneedtobepure;sometimes
identicalinstanceshavedifferentclasses
Splittingstopswhendatacantbesplitanyfurther
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
36
Wishlistforapuritymeasure
Propertieswerequirefromapuritymeasure:
Whennodeispure,measureshouldbezero
Whenimpurityismaximal(i.e.allclassesequally
likely),measureshouldbemaximal
Measureshouldobeymultistageproperty(i.e.
decisionscanbemadeinseveralstages):
measure[2,3,4]=measure[2,7]7/9measure[3,4]
Entropyistheonlyfunctionthatsatisfiesall
threeproperties!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
37
Propertiesoftheentropy
Themultistageproperty:
q
r
entropyp ,q , r=entropyp ,qrqrentropy qr
, qr
Simplificationofcomputation:
info[2,3,4]=2/9log2/93/9log3/94/9log4/9
=[2log23log34log 49log9]/9
Note:insteadofmaximizinginfogainwe
couldjustminimizeinformation
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
38
Highlybranchingattributes
Problematic:attributeswithalargenumber
ofvalues(extremecase:IDcode)
Subsetsaremorelikelytobepureifthereis
alargenumberofvalues
Informationgainisbiasedtowardschoosing
attributeswithalargenumberofvalues
Thismayresultinoverfitting(selectionofan
attributethatisnonoptimalforprediction)
Anotherproblem:fragmentation
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
39
WeatherdatawithIDcode
ID code
Outlook
Temp.
Hot
Humidit
y
High
Wind
y
False
Pla
y
No
Sunny
Sunny
Hot
High
True
No
High
False
Yes
Overcas Hot
tRainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Normal
True
Yes
Overcas Cool
tSunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcas Mild
tOvercas Hot
tRainy
Mild
High
True
Yes
Normal
False
Yes
High
True
No
M
N
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
40
TreestumpforIDcodeattribute
Entropyofsplit:
infoID code=info[0,1]info[0,1]...info[0,1]=0 bits
InformationgainismaximalforIDcode
(namely0.940bits)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
41
Gainratio
Gainratio:amodificationoftheinformationgain
thatreducesitsbias
Gainratiotakesnumberandsizeofbranchesinto
accountwhenchoosinganattribute
Itcorrectstheinformationgainbytakingtheintrinsic
informationofasplitintoaccount
Intrinsicinformation:entropyofdistributionof
instancesintobranches(i.e.howmuchinfodowe
needtotellwhichbranchaninstancebelongsto)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
42
Computingthegainratio
Example:intrinsicinformationforIDcode
info[1,1,...,1]=141/14log1/14=3.807bits
Valueofattributedecreasesasintrinsic
informationgetslarger
Definitionofgainratio:
gain_ratioattribute=gainattribute
intrinsic_infoattribute
Example:
gain_ratioID code= 0.940bits
=0.246
3.807bits
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
43
Gainratiosforweatherdata
Outlook
Temperature
Info:
0.693
Info:
0.911
Gain: 0.940-0.693
0.247
Gain: 0.940-0.911
0.029
1.577
1.557
0.157
Gain ratio:
0.029/1.557
Windy
0.019
Info:
0.788
Info:
0.892
Gain: 0.940-0.788
0.152
Gain: 0.940-0.892
0.048
1.000
0.985
0.152
Gain ratio:
0.048/0.985
0.049
Humidity
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
44
Moreonthegainratio
Outlookstillcomesouttop
However:IDcodehasgreatergainratio
Standardfix:adhoctesttopreventsplittingonthat
typeofattribute
Problemwithgainratio:itmayovercompensate
Maychooseanattributejustbecauseitsintrinsic
informationisverylow
Standardfix:onlyconsiderattributeswithgreater
thanaverageinformationgain
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
45
Discussion
Topdowninductionofdecisiontrees:ID3,
algorithmdevelopedbyRossQuinlan
Gainratiojustonemodificationofthisbasic
algorithm
C4.5:dealswithnumericattributes,missing
values,noisydata
Similarapproach:CART
Therearemanyotherattributeselection
criteria!
(Butlittledifferenceinaccuracyofresult)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
46
Coveringalgorithms
Convertdecisiontreeintoaruleset
Instead,cangeneraterulesetdirectly
Straightforward,butrulesetoverlycomplex
Moreeffectiveconversionsarenottrivial
foreachclassinturnfindrulesetthatcovers
allinstancesinit
(excludinginstancesnotintheclass)
Calledacoveringapproach:
ateachstagearuleisidentifiedthatcovers
someoftheinstances
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
47
Example:generatingarule
If true
then class = a
If x > 1.2
then class = a
Possiblerulesetforclassb:
If x 1.2 then class = b
If x > 1.2 and y 2.6 then class = b
Couldaddmorerules,getperfectruleset
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
48
Rulesvs.trees
Correspondingdecisiontree:
(producesexactlythesame
predictions)
But:rulesetscanbemoreperspicuouswhen
decisiontreessufferfromreplicatedsubtrees
Also:inmulticlasssituations,coveringalgorithm
concentratesononeclassatatimewhereas
decisiontreelearnertakesallclassesintoaccount
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
49
Simplecoveringalgorithm
Generatesarulebyaddingteststhatmaximize
rulesaccuracy
Similartosituationindecisiontrees:problemof
selectinganattributetospliton
But:decisiontreeinducermaximizesoverallpurity
Eachnewtestreduces
rulescoverage:
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
50
Selectingatest
Goal:maximizeaccuracy
ttotalnumberofinstancescoveredbyrule
ppositiveexamplesoftheclasscoveredbyrule
tpnumberoferrorsmadebyrule
Selecttestthatmaximizestheratiop/t
Wearefinishedwhenp/t=1orthesetof
instancescantbesplitanyfurther
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
51
Example:contactlensdata
Ruleweseek:
Possibletests:
If ?
then recommendation = hard
Age = Young
2/8
Age = Pre-presbyopic
1/8
Age = Presbyopic
1/8
3/12
1/12
Astigmatism = no
0/12
Astigmatism = yes
4/12
0/12
4/12
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
52
Modifiedruleandresultingdata
Rulewithbesttestadded:
If astigmatism = yes
then recommendation = hard
Instancescoveredbymodifiedrule:
Age
Young
Young
Young
Young
Prepresbyopic
Prepresbyopic
Prepresbyopic
Prepresbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Spectacle
prescription
Myope
Myope
Hypermetrope
Hypermetrope
Myope
Myope
Hypermetrope
Hypermetrope
Myope
Myope
Hypermetrope
Hypermetrope
Astigmatism
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Tear production
rate
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Recommended
lenses
None
Hard
None
hard
None
Hard
None
None
None
Hard
None
None
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
53
Furtherrefinement
Currentstate:
Possibletests:
If astigmatism = yes
and ?
then recommendation = hard
Age = Young
2/4
Age = Pre-presbyopic
1/4
Age = Presbyopic
1/4
3/6
1/6
0/6
4/6
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
54
Modifiedruleandresultingdata
Rulewithbesttestadded:
If astigmatism = yes
and tear production rate = normal
then recommendation = hard
Instancescoveredbymodifiedrule:
Age
Young
Young
Prepresbyopic
Prepresbyopic
Presbyopic
Presbyopic
Spectacle
prescription
Myope
Hypermetrope
Myope
Hypermetrope
Myope
Hypermetrope
Astigmatism
Yes
Yes
Yes
Yes
Yes
Yes
Tear production
rate
Normal
Normal
Normal
Normal
Normal
Normal
Recommended
lenses
Hard
hard
Hard
None
Hard
None
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
55
Furtherrefinement
Currentstate:
If astigmatism = yes
and tear production rate = normal
and ?
then recommendation = hard
Possibletests:
Age = Young
2/2
Age = Pre-presbyopic
1/2
Age = Presbyopic
1/2
3/3
1/3
Tiebetweenthefirstandthefourthtest
Wechoosetheonewithgreatercoverage
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
56
Theresult
Finalrule:
Secondruleforrecommendinghardlenses:
If astigmatism = yes
and tear production rate = normal
and spectacle prescription = myope
then recommendation = hard
(builtfrominstancesnotcoveredbyfirstrule)
Thesetworulescoverallhardlenses:
Processisrepeatedwithothertwoclasses
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
57
PseudocodeforPRISM
For each class C
Initialize E to the instance set
While E contains instances in class C
Create a rule R with an empty left-hand side that predicts class C
Until R is perfect (or there are no more attributes to use) do
For each attribute A not mentioned in R, and each value v,
Consider adding the condition A = v to the left-hand side of R
Select A and v to maximize the accuracy p/t
(break ties by choosing the condition with the largest p)
Add A = v to R
Remove the instances covered by R from E
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
58
Rulesvs.decisionlists
PRISMwithouterloopremovedgeneratesa
decisionlistforoneclass
Outerloopconsidersallclassesseparately
Subsequentrulesaredesignedforrulesthatarenot
coveredbypreviousrules
But:orderdoesntmatterbecauseallrulespredictthe
sameclass
Noorderdependenceimplied
Problems:overlappingrules,defaultrulerequired
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
59
Separateandconquer
MethodslikePRISM(fordealingwithone
class)areseparateandconqueralgorithms:
First,identifyausefulrule
Then,separateoutalltheinstancesitcovers
Finally,conquertheremaininginstances
Differencetodivideandconquermethods:
Subsetcoveredbyruledoesntneedtobe
exploredanyfurther
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
60
Miningassociationrules
Navemethodforfindingassociationrules:
Twoproblems:
Useseparateandconquermethod
Treateverypossiblecombinationofattribute
valuesasaseparateclass
Computationalcomplexity
Resultingnumberofrules(whichwouldhavetobe
prunedonthebasisofsupportandconfidence)
But:wecanlookforhighsupportrulesdirectly!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
61
Itemsets
Support:numberofinstancescorrectlycovered
byassociationrule
Thesameasthenumberofinstancescoveredbyall
testsintherule(LHSandRHS!)
Item:onetest/attributevaluepair
Itemset:allitemsoccurringinarule
Goal:onlyrulesthatexceedpredefinedsupport
Doitbyfindingallitemsetswiththegiven
minimumsupportandgeneratingrulesfromthem!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
62
Weatherdata
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
63
Itemsetsforweatherdata
One-item sets
Two-item sets
Three-item sets
Four-item sets
Outlook = Sunny
Temperature = Hot (2)
Outlook = Sunny
Temperature = Hot
Humidity = High (2)
Outlook = Sunny
Temperature = Hot
Humidity = High
Play = No (2)
Temperature = Cool
(4)
Outlook = Sunny
Humidity = High (3)
Outlook = Sunny
Humidity = High
Windy = False (2)
Outlook = Rainy
Temperature = Mild
Windy = False
Play = Yes (2)
Intotal:12oneitemsets,47twoitemsets,39
threeitemsets,6fouritemsetsand0five
itemsets(withminimumsupportoftwo)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
64
Generatingrulesfromanitemset
Onceallitemsetswithminimumsupporthave
beengenerated,wecanturnthemintorules
Example:
Humidity = Normal, Windy = False, Play = Yes (4)
Seven(2N1)potentialrules:
If
If
If
If
If
If
If
4/4
4/6
4/6
4/7
4/8
4/9
4/12
65
Rulesforweatherdata
Ruleswithsupport>1andconfidence=100%:
Association rule
Sup.
Conf.
Humidity=Normal Windy=False
Play=Yes
100%
Temperature=Cool
Humidity=Normal
100%
Outlook=Overcast
Play=Yes
100%
Temperature=Cold Play=Yes
Humidity=Normal
...
100%
...
...
100%
...
58
Intotal:
3ruleswithsupportfour
5withsupportthree
50withsupporttwo
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
66
Examplerulesfromthesameset
Itemset:
Resultingrules(allwith100%confidence):
duetothefollowingfrequentitemsets:
Temperature = Cool, Windy = False
Temperature = Cool, Humidity = Normal, Windy = False
Temperature = Cool, Windy = False, Play = Yes
(2)
(2)
(2)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
67
Generatingitemsetsefficiently
Howcanweefficientlyfindallfrequentitemsets?
Findingoneitemsetseasy
Idea:useoneitemsetstogeneratetwoitemsets,
twoitemsetstogeneratethreeitemsets,
If(AB)isfrequentitemset,then(A)and(B)havetobe
frequentitemsetsaswell!
Ingeneral:ifXisfrequentkitemset,thenall(k1)item
subsetsofXarealsofrequent
Computekitemsetbymerging(k1)itemsets
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
68
Example
Given:fivethreeitemsets
(A B C), (A B D), (A C D), (A C E), (B C D)
Lexicographicallyordered!
Candidatefouritemsets:
(A B C D)
OK because of (A C D) (B C D)
(A C D E)
Not OK because of (C D E)
Finalcheckbycountinginstancesin
dataset!
(k1)itemsetsarestoredinhashtable
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
69
Generatingrulesefficiently
Wearelookingforallhighconfidencerules
Betterway:building(c+1)consequentrules
fromcconsequentones
Supportofantecedentobtainedfromhashtable
But:bruteforcemethodis(2N1)
Observation:(c+1)consequentrulecanonlyhold
ifallcorrespondingcconsequentrulesalsohold
Resultingalgorithmsimilartoprocedurefor
largeitemsets
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
70
Example
1consequentrules:
If Outlook = Sunny and Windy = False and Play = No
then Humidity = High (2/2)
If Humidity = High and Windy = False and Play = No
then Outlook = Sunny (2/2)
Corresponding2consequentrule:
If Windy = False and Play = No
then Outlook = Sunny and Humidity = High (2/2)
Finalcheckofantecedentagainsthashtable!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
71
Associationrules:discussion
Abovemethodmakesonepassthroughthedatafor
eachdifferentsizeitemset
Otherpossibility:generate(k+2)itemsetsjustafter(k+1)
itemsetshavebeengenerated
Result:more(k+2)itemsetsthannecessarywillbe
consideredbutlesspassesthroughthedata
Makessenseifdatatoolargeformainmemory
Practicalissue:generatingacertainnumberofrules
(e.g.byincrementallyreducingmin.support)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
72
Otherissues
StandardARFFformatveryinefficientfortypical
marketbasketdata
Attributesrepresentitemsinabasketandmost
itemsareusuallymissing
Datashouldberepresentedinsparseformat
Instancesarealsocalledtransactions
Confidenceisnotnecessarilythebestmeasure
Example:milkoccursinalmosteverysupermarket
transaction
Othermeasureshavebeendevised(e.g.lift)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
73
Linearmodels:linearregression
Workmostnaturallywithnumericattributes
Standardtechniquefornumericprediction
Outcomeislinearcombinationofattributes
x=w 0w 1 a1w2 a 2...w k a k
Weightsarecalculatedfromthetrainingdata
Predictedvalueforfirsttraininginstancea(1)
1
1
1
k
1
w0 a1
w
a
w
a
...w
a
=
w
a
0
1 1
2 2
k k
j=0
j j
(assumingeachinstanceisextendedwithaconstantattributewithvalue1)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
74
Minimizingthesquarederror
Choosek+1coefficientstominimizethesquared
erroronthetrainingdata
Squarederror:
n
i
k
i 2
i=1 x j=0 w j a j
Derivecoefficientsusingstandardmatrix
operations
Canbedoneiftherearemoreinstancesthan
attributes(roughlyspeaking)
Minimizingtheabsoluteerrorismoredifficult
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
75
Classification
Anyregressiontechniquecanbeusedfor
classification
Training:performaregressionforeachclass,setting
theoutputto1fortraininginstancesthatbelongto
class,and0forthosethatdont
Prediction:predictclasscorrespondingtomodel
withlargestoutputvalue(membershipvalue)
Forlinearregressionthisisknownasmulti
responselinearregression
Problem:membershipvaluesarenotin[0,1]
range,soaren'tproperprobabilityestimates
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
76
Linearmodels:logisticregression
Buildsalinearmodelforatransformedtarget
variable
Assumewehavetwoclasses
Logisticregressionreplacesthetarget
P[1a1, a2, ....,a k ]
bythistarget
P[1a1, a2, ....,ak ]
log 1P[1a
1,
a2, ...., ak ]
Logittransformationmaps[0,1]to(,+)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
77
Logittransformation
Resultingmodel:
Pr [1a 1, a 2, ..., ak ]= 1e
w 0w 1 a 1... w k a k
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
78
Examplelogisticregressionmodel
Modelwithw0=0.5andw1=1:
Parametersarefoundfromtrainingdatausing
maximumlikelihood
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
79
Maximumlikelihood
Aim:maximizeprobabilityoftrainingdatawrt
parameters
Canuselogarithmsofprobabilitiesandmaximize
loglikelihoodofmodel:
i
i
ni=1 1xi log1Pr[1ai
,
a
,...,
a
1
2
k ]
i
i
i
i
x logPr [1a1 ,a2 ,..., ak ]
wherethex(i)areeither0or1
Weightswineedtobechosentomaximizelog
likelihood(relativelysimplemethod:iteratively
reweightedleastsquares)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
80
Multipleclasses
Canperformlogisticregression
independentlyforeachclass
(likemultiresponselinearregression)
Problem:probabilityestimatesfordifferent
classeswon'tsumtoone
Better:traincoupledmodelsby
maximizinglikelihoodoverallclasses
Alternativethatoftenworkswellin
practice:pairwiseclassification
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
81
Pairwiseclassification
Idea:buildmodelforeachpairofclasses,usingonly
trainingdatafromthoseclasses
Problem?Havetosolvek(k1)/2classification
problemsforkclassproblem
Turnsoutnottobeaprobleminmanycases
becausetrainingsetsbecomesmall:
Assumedataevenlydistributed,i.e.2n/kper
learningproblemforninstancesintotal
Supposelearningalgorithmislinearinn
Thenruntimeofpairwiseclassificationis
proportionalto(k(k1)/2)(2n/k)=(k1)n
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
82
Linearmodelsarehyperplanes
Decisionboundaryfortwoclasslogisticregressionis
whereprobabilityequals0.5:
Pr[1a1, a 2, ...,ak ]=1/1expw0w1 a1...w k a k =0.5
whichoccurswhen
w 0w 1 a1...w k ak =0
Thuslogisticregressioncanonlyseparatedatathat
canbeseparatedbyahyperplane
Multiresponselinearregressionhasthesame
problem.Class1isassignedif:
1
1
2
2
2
w1
w
a
...w
a
w
w
a
...w
0
1
1
k
k
0
1
1
k ak
2
1
2
1
2
w1
w
w
w
a
...w
w
0
0
1
1
1
k
k a k 0
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
83
Linearmodels:the
perceptron
Don'tactuallyneedprobabilityestimatesifallwewant
todoisclassification
Differentapproach:learnseparatinghyperplane
Assumption:dataislinearlyseparable
Algorithmforlearningseparatinghyperplane:perceptron
learningrule
0=w0 a0 w 1 a 1w2 a2...w k a k
Hyperplane:
whereweagainassumethatthereisaconstantattribute
withvalue1(bias)
Ifsumisgreaterthanzerowepredictthefirstclass,
otherwisethesecondclass
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
84
Thealgorithm
Set all weights to zero
Until all instances in the training data are classified correctly
For each instance I in the training data
If I is classified incorrectly by the perceptron
If I belongs to the first class add it to the weight vector
else subtract it from the weight vector
Whydoesthiswork?
Considersituationwhereinstanceapertainingtothefirst
classhasbeenadded:
w 0a0 a0 w1 a1a 1w 2a2 a2 ...w k a k a k
Thismeansoutputforahasincreasedby:
a0 a0a 1 a 1a2 a2 ...a k a k
Thisnumberisalwayspositive,thusthehyperplanehasmovedintothe
correctdirection(andwecanshowoutputdecreasesforinstancesof
otherclass)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
85
Perceptronasaneuralnetwork
Output
layer
Input
layer
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
86
Linearmodels:Winnow
Anothermistakedrivenalgorithmforfindinga
separatinghyperplane
Assumesbinarydata(i.e.attributevaluesare
eitherzeroorone)
Difference:multiplicativeupdatesinsteadofadditive
updates
Weightsaremultipliedbyauserspecified
parameter(oritsinverse)
Anotherdifference:userspecifiedthreshold
parameter
Predictfirstclassif
w 0 a 0w 1 a1w 2 a2...w k ak
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
87
Thealgorithm
while some instances are misclassified
for each instance a in the training data
classify a using the current weights
if the predicted class is incorrect
if a belongs to the first class
for each ai that is 1, multiply wi by alpha
(if ai is 0, leave wi unchanged)
otherwise
for each ai that is 1, divide wi by alpha
(if ai is 0, leave wi unchanged)
Winnowisveryeffectiveinhominginonrelevant
features(itisattributeefficient)
Canalsobeusedinanonlinesettinginwhich
newinstancesarrivecontinuously
(liketheperceptronalgorithm)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
88
BalancedWinnow
Winnowdoesn'tallownegativeweightsandthiscanbea
drawbackinsomeapplications
BalancedWinnowmaintainstwoweightvectors,oneforeach
class:some instances are misclassified
while
for each instance a in the training data
classify a using the current weights
if the predicted class is incorrect
if a belongs to the first class
for each ai that is 1, multiply wi+ by alpha and divide wi- by alpha
(if ai is 0, leave wi+ and wi- unchanged)
otherwise
for each ai that is 1, multiply wi- by alpha and divide wi+ by alpha
(if ai is 0, leave wi+ and wi- unchanged)
Instanceisclassifiedasbelongingtothefirstclass(oftwo
classes)if:
w w a w w a ...w w a
0
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
89
Instancebasedlearning
Distancefunctiondefineswhatslearned
Mostinstancebasedschemesuse
Euclideandistance:
1
1
2
1
2 2
1
2 2
a2
a
a
...a
a
1
2
2
k
k
a(1)anda(2):twoinstanceswithkattributes
Takingthesquarerootisnotrequiredwhen
comparingdistances
Otherpopularmetric:cityblockmetric
Addsdifferenceswithoutsquaringthem
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
90
Normalizationandotherissues
Differentattributesaremeasuredondifferent
scalesneedtobenormalized:
ai =
v i min v i
max v imin vi
vi:theactualvalueofattributei
Nominalattributes:distanceeither0or1
Commonpolicyformissingvalues:assumedtobe
maximallydistant(givennormalizedattributes)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
91
Findingnearestneighborsefficiently
Simplestwayoffindingnearestneighbour:linear
scanofthedata
Classificationtakestimeproportionaltotheproductof
thenumberofinstancesintrainingandtestsets
Nearestneighborsearchcanbedonemore
efficientlyusingappropriatedatastructures
Wewilldiscusstwomethodsthatrepresenttraining
datainatreestructure:
kDtreesandballtrees
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
92
kDtreeexample
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
93
UsingkDtrees:example
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
94
MoreonkDtrees
Complexitydependsondepthoftree,givenbylogarithm
ofnumberofnodes
Amountofbacktrackingrequireddependsonqualityof
tree(squarevs.skinnynodes)
Howtobuildagoodtree?Needtofindgoodsplitpoint
andsplitdirection
Splitdirection:directionwithgreatestvariance
Splitpoint:medianvaluealongthatdirection
Usingvalueclosesttomean(ratherthanmedian)canbe
betterifdataisskewed
Canapplythisrecursively
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
95
Buildingtreesincrementally
Bigadvantageofinstancebasedlearning:classifier
canbeupdatedincrementally
Justaddnewtraininginstance!
CanwedothesamewithkDtrees?
Heuristicstrategy:
Findleafnodecontainingnewinstance
Placeinstanceintoleafifleafisempty
Otherwise,splitleafaccordingtothelongest
dimension(topreservesquareness)
Treeshouldberebuiltoccasionally(i.e.ifdepth
growstotwicetheoptimumdepth)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
96
Balltrees
ProbleminkDtrees:corners
Observation:noneedtomakesurethat
regionsdon'toverlap
Canuseballs(hyperspheres)insteadof
hyperrectangles
Aballtreeorganizesthedataintoatreeofk
dimensionalhyperspheres
Normallyallowsforabetterfittothedataand
thusmoreefficientsearch
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
97
Balltreeexample
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
98
Usingballtrees
Nearestneighborsearchisdoneusingthesame
backtrackingstrategyasinkDtrees
Ballcanberuledoutfromconsiderationif:distance
fromtargettoball'scenterexceedsball'sradiusplus
currentupperbound
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
99
Buildingballtrees
Balltreesarebuilttopdown(likekDtrees)
Don'thavetocontinueuntilleafballscontainjusttwo
points:canenforceminimumoccupancy
(sameinkDtrees)
Basicproblem:splittingaballintotwo
Simple(lineartime)splitselectionstrategy:
Choosepointfarthestfromball'scenter
Choosesecondpointfarthestfromfirstone
Assigneachpointtothesetwopoints
Computeclustercentersandradiibasedonthetwo
subsetstogettwoballs
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
100
Discussionofnearestneighborlearning
Oftenveryaccurate
Assumesallattributesareequallyimportant
Possibleremediesagainstnoisyinstances:
Takeamajorityvoteovertheknearestneighbors
Removingnoisyinstancesfromdataset(difficult!)
StatisticianshaveusedkNNsinceearly1950s
Remedy:attributeselectionorweights
Ifnandk/n0,errorapproachesminimum
kDtreesbecomeinefficientwhennumberof
attributesistoolarge(approximately>10)
Balltrees(whichareinstancesofmetrictrees)work
wellinhigherdimensionalspaces
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
101
Morediscussion
Insteadofstoringalltraininginstances,compressthem
intoregions
Example:hyperpipes(fromdiscussionof1R)
Anothersimpletechnique(VotingFeatureIntervals):
Constructintervalsforeachattribute
Discretizenumericattributes
Treateachvalueofanominalattributeasaninterval
Countnumberoftimesclassoccursininterval
Predictionisgeneratedbylettingintervalsvote(thosethat
containthetestinstance)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
102
Clustering
Clusteringtechniquesapplywhenthereisnoclasstobe
predicted
Aim:divideinstancesintonaturalgroups
Aswe'veseenclusterscanbe:
disjointvs.overlapping
deterministicvs.probabilistic
flatvs.hierarchical
We'lllookataclassicclusteringalgorithmcalledk
means
kmeansclustersaredisjoint,deterministic,andflat
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
103
Thekmeansalgorithm
Toclusterdataintokgroups:
(kispredefined)
0. Choosekclustercenters
e.g.atrandom
1. Assigninstancestoclusters
basedondistancetoclustercenters
2. Computecentroidsofclusters
3. Gotostep1
untilconvergence
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
104
Discussion
Algorithmminimizessquareddistancetocluster
centers
Resultcanvarysignificantly
basedoninitialchoiceofseeds
Cangettrappedinlocalminimum
initial
cluster
Example:
centres
instances
Toincreasechanceoffindingglobaloptimum:restart
withdifferentrandomseeds
Canweappliedrecursivelywithk=2
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
105
Fasterdistancecalculations
CanweusekDtreesorballtreestospeed
uptheprocess?Yes:
First,buildtree,whichremainsstatic,forall
thedatapoints
Ateachnode,storenumberofinstancesand
sumofallinstances
Ineachiteration,descendtreeandfindout
whichclustereachnodebelongsto
Canstopdescendingassoonaswefindoutthata
nodebelongsentirelytoaparticularcluster
Usestatisticsstoredatthenodestocomputenew
clustercenters
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
106
Example
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
107
Multiinstancelearning
Simplicityfirstmethodologycanbe
appliedtomultiinstancelearningwith
surprisinglygoodresults
Twosimpleapproaches,bothusing
standardsingleinstancelearners:
Manipulatetheinputtolearning
Manipulatetheoutputoflearning
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
108
Aggregatingtheinput
Convertmultiinstanceproblemintosingle
instanceone
Summarizetheinstancesinabagbycomputing
mean,mode,minimumandmaximumasnew
attributes
Summaryinstanceretainstheclasslabelofits
bag
Toclassifyanewbagthesameprocessisused
Resultsusingsummaryinstanceswithminimum
andmaximum+supportvectormachineclassifier
arecomparabletospecialpurposemultiinstance
learnersonoriginaldrugdiscoveryproblem
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
109
Aggregatingtheoutput
Learnasingleinstanceclassifierdirectlyfromthe
originalinstancesineachbag
Eachinstanceisgiventheclassofthebagitoriginates
from
Toclassifyanewbag:
Produceapredictionforeachinstanceinthebag
Aggregatethepredictionstoproduceapredictionfor
thebagasawhole
Oneapproach:treatpredictionsasvotesforthevarious
classlabels
Aproblem:bagscancontaindifferingnumbersof
instancesgiveeachinstanceaweightinversely
proportionaltothebag'ssize
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
110
Commentsonbasicmethods
BayesrulestemsfromhisEssaytowardssolvinga
probleminthedoctrineofchances(1763)
Difficultbitingeneral:estimatingpriorprobabilities
(easyinthecaseofnaveBayes)
ExtensionofnaveBayes:Bayesiannetworks(which
we'lldiscusslater)
AlgorithmforassociationrulesiscalledAPRIORI
MinskyandPapert(1969)showedthatlinear
classifiershavelimitations,e.g.cantlearnXOR
But:combinationsofthemcan(multilayerneural
nets,whichwe'lldiscusslater)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
111