Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
PracticalMachineLearningToolsandTechniques
SlidesforChapter5ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall
Credibility:Evaluatingwhatsbeenlearned
Issues:training,testing,tuning
Predictingperformance:confidencelimits
Holdout,crossvalidation,bootstrap
Comparingschemes:thettest
Predictingprobabilities:lossfunctions
Costsensitivemeasures
Evaluatingnumericprediction
TheMinimumDescriptionLengthprinciple
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
Evaluation:thekeytosuccess
Howpredictiveisthemodelwelearned?
Erroronthetrainingdataisnotagood
indicatorofperformanceonfuturedata
Simplesolutionthatcanbeusediflotsof
(labeled)dataisavailable:
Otherwise1NNwouldbetheoptimumclassifier!
Splitdataintotrainingandtestset
However:(labeled)dataisusuallylimited
Moresophisticatedtechniquesneedtobeused
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
Issuesinevaluation
Statisticalreliabilityofestimateddifferencesin
performance(significancetests)
Choiceofperformancemeasure:
Numberofcorrectclassifications
Accuracyofprobabilityestimates
Errorinnumericpredictions
Costsassignedtodifferenttypesoferrors
Manypracticalapplicationsinvolvecosts
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
TrainingandtestingI
Naturalperformancemeasurefor
classificationproblems:errorrate
Success:instancesclassispredictedcorrectly
Error:instancesclassispredictedincorrectly
Errorrate:proportionoferrorsmadeoverthe
wholesetofinstances
Resubstitutionerror:errorrateobtained
fromtrainingdata
Resubstitutionerroris(hopelessly)
optimistic!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
TrainingandtestingII
Testset:independentinstancesthathaveplayed
nopartinformationofclassifier
Assumption:bothtrainingdataandtestdataare
representativesamplesoftheunderlyingproblem
Testandtrainingdatamaydifferinnature
Example:classifiersbuiltusingcustomerdatafromtwo
differenttownsAandB
ToestimateperformanceofclassifierfromtownAincompletely
newtown,testitondatafromB
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
Noteonparametertuning
Itisimportantthatthetestdataisnotusedinany
waytocreatetheclassifier
Somelearningschemesoperateintwostages:
Stage1:buildthebasicstructure
Stage2:optimizeparametersettings
Thetestdatacantbeusedforparametertuning!
Properprocedureusesthreesets:trainingdata,
validationdata,andtestdata
Validationdataisusedtooptimizeparameters
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
Makingthemostofthedata
Onceevaluationiscomplete,allthedatacanbe
usedtobuildthefinalclassifier
Generally,thelargerthetrainingdatathebetter
theclassifier(butreturnsdiminish)
Thelargerthetestdatathemoreaccuratethe
errorestimate
Holdoutprocedure:methodofsplittingoriginal
dataintotrainingandtestset
Dilemma:ideallybothtrainingsetandtestsetshouldbe
large!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
Predictingperformance
Assumetheestimatederrorrateis25%.How
closeisthistothetrueerrorrate?
Predictionisjustliketossinga(biased!)coin
Dependsontheamountoftestdata
Headisasuccess,tailisanerror
Instatistics,asuccessionofindependentevents
likethisiscalledaBernoulliprocess
Statisticaltheoryprovidesuswithconfidence
intervalsforthetrueunderlyingproportion
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
Confidenceintervals
Wecansay:plieswithinacertainspecified
intervalwithacertainspecifiedconfidence
Example:S=750successesinN=1000trials
Estimatedsuccessrate:75%
Howcloseisthistotruesuccessratep?
Answer:with80%confidencepin[73.2,76.7]
Anotherexample:S=75andN=100
Estimatedsuccessrate:75%
With80%confidencepin[69.1,80.1]
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
10
Meanandvariance
MeanandvarianceforaBernoullitrial:
p,p(1p)
Expectedsuccessratef=S/N
Meanandvarianceforf:p,p(1p)/N
ForlargeenoughN,ffollowsaNormal
distribution
c%confidenceinterval[zXz]for
randomvariablewith0meanisgivenby:
Pr [z Xz ]=c
Withasymmetricdistribution:
Pr [z Xz ]=12Pr [ xz ]
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
11
Confidencelimits
Confidencelimitsforthenormaldistributionwith
0meanandavarianceof1:
z
Pr[X
Thus:
1 1.65
z]
0.1%
3.09
0.5%
2.58
1%
2.33
5%
1.65
10%
1.28
20%
0.84
40%
0.25
Pr [1.65X1.65]=90 %
Tousethiswehavetoreduceourrandomvariablef
tohave0meanandunitvariance
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
12
Transformingf
f p
p1p/N
Transformedvalueforf:
(i.e.subtractthemeananddividebythestandarddeviation)
Resultingequation:
f p
Pr [z p(1p)/
z ]=c
N
Solvingforp:
p=f
z2
2N
f
N
f2
N
z2
4N2
z2
N
/1
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
13
Examples
f=75%,N=1000,c=80%(sothatz=1.28):
p[0.732,0 .767]
f=75%,N=100,c=80%(sothatz=1.28):
p[0.691,0 .801]
Notethatnormaldistributionassumptionisonly
validforlargeN(i.e.N>100)
f=75%,N=10,c=80%(sothatz=1.28):
p[0.549,0 .881]
(shouldbetakenwithagrainofsalt)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
14
Holdoutestimation
Whattodoiftheamountofdataislimited?
Theholdoutmethodreservesacertainamount
fortestingandusestheremainderfortraining
Problem:thesamplesmightnotbe
representative
Usually:onethirdfortesting,therestfortraining
Example:classmightbemissinginthetestdata
Advancedversionusesstratification
Ensuresthateachclassisrepresentedwith
approximatelyequalproportionsinbothsubsets
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
15
Repeatedholdoutmethod
Holdoutestimatecanbemademorereliableby
repeatingtheprocesswithdifferentsubsamples
Ineachiteration,acertainproportionisrandomly
selectedfortraining(possiblywithstratificiation)
Theerrorratesonthedifferentiterationsareaveraged
toyieldanoverallerrorrate
Thisiscalledtherepeatedholdoutmethod
Stillnotoptimum:thedifferenttestsetsoverlap
Canwepreventoverlapping?
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
16
Crossvalidation
Crossvalidationavoidsoverlappingtestsets
Firststep:splitdataintoksubsetsofequalsize
Secondstep:useeachsubsetinturnfortesting,
theremainderfortraining
Calledkfoldcrossvalidation
Oftenthesubsetsarestratifiedbeforethe
crossvalidationisperformed
Theerrorestimatesareaveragedtoyieldan
overallerrorestimate
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
17
Moreoncrossvalidation
Standardmethodforevaluation:stratifiedtenfold
crossvalidation
Whyten?
Extensiveexperimentshaveshownthatthisisthebest
choicetogetanaccurateestimate
Thereisalsosometheoreticalevidenceforthis
Stratificationreducestheestimatesvariance
Evenbetter:repeatedstratifiedcrossvalidation
E.g.tenfoldcrossvalidationisrepeatedtentimesand
resultsareaveraged(reducesthevariance)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
18
LeaveOneOutcrossvalidation
LeaveOneOut:
aparticularformofcrossvalidation:
Setnumberoffoldstonumberoftraininginstances
I.e.,forntraininginstances,buildclassifierntimes
Makesbestuseofthedata
Involvesnorandomsubsampling
Verycomputationallyexpensive
(exception:NN)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
19
LeaveOneOutCVandstratification
DisadvantageofLeaveOneOutCV:
stratificationisnotpossible
Itguaranteesanonstratifiedsample
becausethereisonlyoneinstanceinthe
testset!
Extremeexample:randomdataset
splitequallyintotwoclasses
Bestinducerpredictsmajorityclass
50%accuracyonfreshdata
LeaveOneOutCVestimateis100%error!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
20
Thebootstrap
CVusessamplingwithoutreplacement
Thesameinstance,onceselected,cannotbeselected
againforaparticulartraining/testset
Thebootstrapusessamplingwith
replacementtoformthetrainingset
Sampleadatasetofninstancesntimeswith
replacementtoformanewdatasetofninstances
Usethisdataasthetrainingset
Usetheinstancesfromtheoriginal
datasetthatdontoccurinthenew
trainingsetfortesting
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
21
The0.632bootstrap
Alsocalledthe0.632bootstrap
Aparticularinstancehasaprobabilityof
11/nofnotbeingpicked
Thusitsprobabilityofendingupinthe
testdatais:
1 1n ne10.368
Thismeansthetrainingdatawill
containapproximately63.2%ofthe
instances
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
22
Estimatingerrorwiththebootstrap
Theerrorestimateonthetestdatawillbe
verypessimistic
Trainedonjust~63%oftheinstances
Therefore,combineitwiththe
resubstitutionerror:
err=0.632etest instances0.368etraining_instances
Theresubstitutionerrorgetslessweight
thantheerroronthetestdata
Repeatprocessseveraltimeswithdifferent
replacementsamples;averagetheresults
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
23
Moreonthebootstrap
Probablythebestwayofestimating
performanceforverysmalldatasets
However,ithassomeproblems
Considertherandomdatasetfromabove
Aperfectmemorizerwillachieve
0%resubstitutionerrorand
~50%errorontestdata
Bootstrapestimateforthisclassifier:
err=0.63250%0.3680%=31.6%
Trueexpectederror:50%
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
24
Comparingdataminingschemes
Frequentquestion:whichoftwolearning
schemesperformsbetter?
Note:thisisdomaindependent!
Obviousway:compare10foldCVestimates
Generallysufficientinapplications(wedon't
looseifthechosenmethodisnottrulybetter)
However,whataboutmachinelearningresearch?
Needtoshowconvincinglythataparticular
methodworksbetter
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
25
ComparingschemesII
WanttoshowthatschemeAisbetterthanschemeBina
particulardomain
Foragivenamountoftrainingdata
Onaverage,acrossallpossibletrainingsets
Let'sassumewehaveaninfiniteamountofdatafromthe
domain:
Sampleinfinitelymanydatasetofspecifiedsize
Obtaincrossvalidationestimateoneachdatasetfor
eachscheme
CheckifmeanaccuracyforschemeAisbetterthan
meanaccuracyforschemeB
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
26
Pairedttest
Inpracticewehavelimiteddataandalimitednumber
ofestimatesforcomputingthemean
Studentsttesttellswhetherthemeansoftwosamples
aresignificantlydifferent
Inourcasethesamplesarecrossvalidationestimates
fordifferentdatasetsfromthedomain
Useapairedttestbecausetheindividualsamplesare
paired
ThesameCVisappliedtwice
WilliamGosset
Born: 1876inCanterbury;Died:1937inBeaconsfield,England
ObtainedapostasachemistintheGuinnessbreweryinDublinin1899.
Inventedthettesttohandlesmallsamplesforqualitycontrolin
brewing.Wroteunderthename"Student".
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
27
Distributionofthemeans
x1x2xkandy1y2ykarethe2ksamplesfor
thekdifferentdatasets
m andm arethemeans
x
y
Withenoughsamples,themeanofasetof
independentsamplesisnormallydistributed
Estimatedvariancesofthemeansare
x2/kandy2/k
m x x my y
Ifxandyarethetruemeansthen
2
x
/k
2
y
/k
areapproximatelynormallydistributedwith
mean0,variance1
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
28
Studentsdistribution
Withsmallsamples(k<100)themean
followsStudentsdistributionwithk1
degreesoffreedom
Confidencelimits:
9degreesoffreedomnormaldistribution
Assuming
wehave
10estimates
Pr[X z]
Pr[X z]
0.1%
4.30
0.1%
3.09
0.5%
3.25
0.5%
2.58
1%
2.82
1%
2.33
5%
1.83
5%
1.65
10%
1.38
10%
1.28
20%
0.88
20%
0.84
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
29
Distributionofthedifferences
Letmd=mxmy
Thedifferenceofthemeans(md)alsohasa
Studentsdistributionwithk1degreesoffreedom
Let 2bethevarianceofthedifference
d
Thestandardizedversionofmdiscalledthet
statistic:
m
t= /k
d
2
d
Weusettoperformthettest
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
30
Performingthetest
Fixasignificancelevel
Dividethesignificancelevelbytwobecausethetest
istwotailed
Ifadifferenceissignificantatthe%level,
thereisa(100)%chancethatthetruemeansdiffer
I.e.thetruedifferencecanbe+veorve
Lookupthevalueforzthatcorrespondsto/2
Iftzortzthenthedifferenceissignificant
I.e.thenullhypothesis(thatthedifferenceiszero)canbe
rejected
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
31
Unpairedobservations
IftheCVestimatesarefromdifferent
datasets,theyarenolongerpaired
(ormaybewehavekestimatesforone
scheme,andjestimatesfortheotherone)
Thenwehavetouseanunpairedttestwith
min(k,j)1degreesoffreedom
Theestimateofthevarianceofthedifference
ofthemeansbecomes:
2x
k
2y
j
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
32
Dependentestimates
Weassumedthatwehaveenoughdatatocreate
severaldatasetsofthedesiredsize
Needtoreusedataifthat'snotthecase
E.g.runningcrossvalidationswithdifferent
randomizationsonthesamedata
Samplesbecomedependentinsignificant
differencescanbecomesignificant
Aheuristictestisthecorrectedresampledttest:
Assumeweusetherepeatedholdoutmethod,
withn1instancesfortrainingandn2fortesting
Newteststatisticis:
t=
md
n2
1k n 2d
1
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
33
Predictingprobabilities
Performancemeasuresofar:successrate
Alsocalled01lossfunction:
i {0 if prediction is correct }
1 if prediction is incorrect
Mostclassifiersproducesclassprobabilities
Dependingontheapplication,wemightwantto
checktheaccuracyoftheprobabilityestimates
01lossisnottherightthingtouseinthosecases
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
34
Quadraticlossfunction
p1pkareprobabilityestimatesforan
instance
cistheindexoftheinstancesactualclass
a1ak=0,exceptforacwhichis1
Quadraticlossis:
Wanttominimize
E[ j p ja j 2 ]
Canshowthatthisisminimizedwhenpj=pj*,
thetrueprobabilities
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
35
Informationallossfunction
Theinformationallossfunctionislog(pc),
wherecistheindexoftheinstancesactual
class
Numberofbitsrequiredtocommunicatethe
actualclass
Letp1*pk*bethetrueclassprobabilities
Thentheexpectedvalueforthelossfunction
p
log
p
...p
1
2 1
k log 2 p k
is:
Justification:minimizedwhenpj=pj*
Difficulty:zerofrequencyproblem
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
36
Discussion
Whichlossfunctiontochoose?
Bothencouragehonesty
Quadraticlossfunctiontakesintoaccountall
classprobabilityestimatesforaninstance
Informationallossfocusesonlyonthe
probabilityestimatefortheactualclass
Quadraticlossisbounded:
itcanneverexceed2
1 j p2j
Informationallosscanbeinfinite
InformationallossisrelatedtoMDLprinciple
[later]
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
37
Countingthecost
Inpractice,differenttypesofclassification
errorsoftenincurdifferentcosts
Examples:
Terroristprofiling
Notaterroristcorrect99.99%ofthetime
Loandecisions
Oilslickdetection
Faultdiagnosis
Promotionalmailing
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
38
Countingthecost
Theconfusionmatrix:
Predicted class
Actual class
Yes
No
Yes
True positive
No
False positive
False
negative
True
negative
Therearemanyothertypesofcost!
E.g.:costofcollectingtrainingdata
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
39
Aside:thekappastatistic
Twoconfusionmatricesfora3classproblem:
actualpredictor(left)vs.randompredictor(right)
Numberofsuccesses:sumofentriesindiagonal(D)
Kappastatistic:
D
D
observed
random
Dperfect Drandom
measuresrelativeimprovementoverrandompredictor
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
40
Classificationwithcosts
Twocostmatrices:
Successrateisreplacedbyaveragecostper
prediction
Costisgivenbyappropriateentryinthecost
matrix
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
41
Costsensitiveclassification
Cantakecostsintoaccountwhenmakingpredictions
Basicidea:onlypredicthighcostclasswhenvery
confidentaboutprediction
Given:predictedclassprobabilities
Normallywejustpredictthemostlikelyclass
Here,weshouldmakethepredictionthatminimizes
theexpectedcost
Expectedcost:dotproductofvectorofclassprobabilities
andappropriatecolumnincostmatrix
Choosecolumn(class)thatminimizesexpectedcost
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
42
Costsensitivelearning
Sofarwehaven'ttakencostsintoaccountat
trainingtime
Mostlearningschemesdonotperformcost
sensitivelearning
Simplemethodsforcostsensitivelearning:
Theygeneratethesameclassifiernomatterwhat
costsareassignedtothedifferentclasses
Example:standarddecisiontreelearner
Resamplingofinstancesaccordingtocosts
Weightingofinstancesaccordingtocosts
Someschemescantakecostsintoaccountby
varyingaparameter,e.g.naveBayes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
43
Liftcharts
Inpractice,costsarerarelyknown
Decisionsareusuallymadebycomparing
possiblescenarios
Example:promotionalmailoutto1,000,000
households
Mailtoall;0.1%respond(1000)
Dataminingtoolidentifiessubsetof100,000
mostpromising,0.4%oftheserespond(400)
40%ofresponsesfor10%ofcostmaypayoff
Identifysubsetof400,000mostpromising,0.2%
respond(800)
Aliftchartallowsavisualcomparison
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
44
Generatingaliftchart
Sortinstancesaccordingtopredicted
probabilityofbeingpositive:
Predicted probability
Actual class
0.95
Yes
0.93
Yes
0.93
No
0.88
Yes
xaxisissamplesize
yaxisisnumberoftruepositives
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
45
Ahypotheticalliftchart
40%ofresponses
for10%ofcost
80% of responses
for 40% of cost
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
46
ROCcurves
ROCcurvesaresimilartoliftcharts
Standsforreceiveroperatingcharacteristic
Usedinsignaldetectiontoshowtradeoff
betweenhitrateandfalsealarmrateover
noisychannel
Differencestoliftchart:
yaxisshowspercentageoftruepositivesin
sample
ratherthanabsolutenumber
xaxisshowspercentageoffalsepositivesin
sample
ratherthansamplesize
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
47
AsampleROCcurve
Jaggedcurveonesetoftestdata
Smoothcurveusecrossvalidation
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
48
CrossvalidationandROCcurves
SimplemethodofgettingaROCcurveusing
crossvalidation:
Collectprobabilitiesforinstancesintestfolds
Sortinstancesaccordingtoprobabilities
ThismethodisimplementedinWEKA
However,thisisjustonepossibility
AnotherpossibilityistogenerateanROCcurve
foreachfoldandaveragethem
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
49
ROCcurvesfortwoschemes
Forasmall,focusedsample,usemethodA
Foralargerone,usemethodB
Inbetween,choosebetweenAandBwithappropriateprobabilities
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
50
Theconvexhull
Giventwolearningschemeswecanachieve
anypointontheconvexhull!
TPandFPratesforscheme1:t andf
1
1
TPandFPratesforscheme2:t andf
2
2
Ifscheme1isusedtopredict100q%ofthe
casesandscheme2fortherest,then
TPrateforcombinedscheme:
qt1+(1q)t2
FPrateforcombinedscheme:
qf1+(1q)f2
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
51
Moremeasures...
Percentageofretrieveddocumentsthatarerelevant:
precision=TP/(TP+FP)
Percentageofrelevantdocumentsthatarereturned:
recall=TP/(TP+FN)
Precision/recallcurveshavehyperbolicshape
Summarymeasures:averageprecisionat20%,50%and80%
recall(threepointaveragerecall)
Fmeasure=(2recallprecision)/(recall+precision)
sensitivityspecificity=(TP/(TP+FN))(TN/(FP+TN))
AreaundertheROCcurve(AUC):
probabilitythatrandomlychosenpositiveinstanceisranked
aboverandomlychosennegativeone
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
52
Summaryofsomemeasures
Domain
Plot
Explanation
Lift chart
Marketing
TP
Subset
size
TP
(TP+FP)/
(TP+FP+TN+FN)
ROC
curve
Communicatio
ns
TP rate
FP rate
TP/(TP+FN)
FP/(FP+TN)
Recallprecision
curve
Information
retrieval
Recall
Precisio
n
TP/(TP+FN)
TP/(TP+FP)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
53
Costcurves
Costcurvesplotexpectedcostsdirectly
Exampleforcasewithuniformcosts(i.e.error):
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
54
Costcurves:examplewithcosts
p[+]C[+|-]
Probability cost function pc [+]= p[+]C[+|-]p[-]C[-|+]
55
Evaluatingnumericprediction
Samestrategies:independenttestset,
crossvalidation,significancetests,etc.
Difference:errormeasures
Actualtargetvalues:a a a
1 2
n
Predictedtargetvalues:p1p2pn
Mostpopularmeasure:meansquared
error
p a ...p a
2
Easytomanipulatemathematically
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
56
Othermeasures
Therootmeansquarederror:
p1 a1 2...pn an 2
n
Themeanabsoluteerrorislesssensitiveto
outliersthanthemeansquarederror:
p1a1...p na n
n
Sometimesrelativeerrorvaluesaremore
appropriate(e.g.10%foranerrorof50when
predicting500)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
57
Improvementonthemean
Howmuchdoestheschemeimprove
onsimplypredictingtheaverage?
Therelativesquarederroris:
p1a12...pnan 2
a a12...
a an 2
Therelativeabsoluteerroris:
p1a1...pna n
a a1...
aan
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
58
Correlationcoefficient
Measuresthestatisticalcorrelationbetween
thepredictedvaluesandtheactualvalues
SPA
SP SA
SPA =
i pi p
ai
a
n1
SP=
i pi p
2
n1
SA =
i ai a
2
n1
Scaleindependent,between1and+1
Goodperformanceleadstolargevalues!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
59
Whichmeasure?
Besttolookatallofthem
Oftenitdoesntmatter
Example:
A
67.8
91.7
63.3
57.4
41.3
38.5
33.4
29.2
42.2% 57.2%
39.4%
35.8%
43.1% 40.1%
34.8%
30.4%
Correlation coefficient
0.88
0.89
0.91
0.88
D best
C secondbest
A, B
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
60
TheMDLprinciple
MDLstandsforminimumdescriptionlength
Thedescriptionlengthisdefinedas:
spacerequiredtodescribeatheory
+
spacerequiredtodescribethetheorysmistakes
Inourcasethetheoryistheclassifierandthe
mistakesaretheerrorsonthetrainingdata
Aim:weseekaclassifierwithminimalDL
MDLprincipleisamodelselectioncriterion
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
61
Modelselectioncriteria
Modelselectioncriteriaattempttofindagood
compromisebetween:
Thecomplexityofamodel
Itspredictionaccuracyonthetrainingdata
Reasoning:agoodmodelisasimplemodelthat
achieveshighaccuracyonthegivendata
AlsoknownasOccamsRazor:
thebesttheoryisthesmallestone
thatdescribesallthefacts
WilliamofOckham,borninthevillageofOckhamin
Surrey(England)about1285,wasthemostinfluential
philosopherofthe14thcenturyandacontroversial
theologian.
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
62
Elegancevs.errors
Theory1:verysimple,eleganttheorythat
explainsthedataalmostperfectly
Theory2:significantlymorecomplextheorythat
reproducesthedatawithoutmistakes
Theory1isprobablypreferable
Classicalexample:Keplersthreelawson
planetarymotion
LessaccuratethanCopernicusslatestrefinementof
thePtolemaictheoryofepicycles
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
63
MDLandcompression
MDLprinciplerelatestodatacompression:
Thebesttheoryistheonethatcompressesthedata
themost
I.e.tocompressadatasetwegenerateamodeland
thenstorethemodelanditsmistakes
Weneedtocompute
(a)sizeofthemodel,and
(b)spaceneededtoencodetheerrors
(b)easy:usetheinformationallossfunction
(a)needamethodtoencodethemodel
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
64
MDLandBayesstheorem
L[T]=lengthofthetheory
L[E|T]=trainingsetencodedwrtthetheory
Descriptionlength=L[T]+L[E|T]
Bayesstheoremgivesaposterioriprobabilityof
atheorygiventhedata:
Pr [T|E]=
Pr [E|T]Pr [T]
Pr [E]
Equivalentto:
65
MDLandMAP
MAPstandsformaximumaposteriori
probability
FindingtheMAPtheorycorrespondstofinding
theMDLtheory
DifficultbitinapplyingtheMAPprinciple:
determiningthepriorprobabilityPr[T]ofthe
theory
Correspondstodifficultpartinapplyingthe
MDLprinciple:codingschemeforthetheory
I.e.ifweknowapriorithataparticulartheoryis
morelikelyweneedfewerbitstoencodeit
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
66
DiscussionofMDLprinciple
Advantage:makesfulluseofthetrainingdatawhen
selectingamodel
Disadvantage1:appropriatecodingscheme/prior
probabilitiesfortheoriesarecrucial
Disadvantage2:noguaranteethattheMDLtheory
istheonewhichminimizestheexpectederror
Note:OccamsRazorisanaxiom!
Epicurussprincipleofmultipleexplanations:keep
alltheoriesthatareconsistentwiththedata
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
67
MDLandclustering
Descriptionlengthoftheory:
bitsneededtoencodetheclusters
Descriptionlengthofdatagiventheory:
encodeclustermembershipandposition
relativetocluster
e.g.clustercenters
e.g.distancetoclustercenter
Worksifcodingschemeuseslesscodespace
forsmallnumbersthanforlargeones
Withnominalattributes,mustcommunicate
probabilitydistributionsforeachcluster
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)
68