Sei sulla pagina 1di 68

DataMining

PracticalMachineLearningToolsandTechniques
SlidesforChapter5ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall

Credibility:Evaluatingwhatsbeenlearned

Issues:training,testing,tuning
Predictingperformance:confidencelimits
Holdout,crossvalidation,bootstrap
Comparingschemes:thettest
Predictingprobabilities:lossfunctions
Costsensitivemeasures
Evaluatingnumericprediction
TheMinimumDescriptionLengthprinciple

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

Evaluation:thekeytosuccess

Howpredictiveisthemodelwelearned?
Erroronthetrainingdataisnotagood
indicatorofperformanceonfuturedata

Simplesolutionthatcanbeusediflotsof
(labeled)dataisavailable:

Otherwise1NNwouldbetheoptimumclassifier!

Splitdataintotrainingandtestset

However:(labeled)dataisusuallylimited

Moresophisticatedtechniquesneedtobeused

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

Issuesinevaluation

Statisticalreliabilityofestimateddifferencesin
performance(significancetests)
Choiceofperformancemeasure:

Numberofcorrectclassifications
Accuracyofprobabilityestimates
Errorinnumericpredictions

Costsassignedtodifferenttypesoferrors

Manypracticalapplicationsinvolvecosts

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

TrainingandtestingI

Naturalperformancemeasurefor
classificationproblems:errorrate

Success:instancesclassispredictedcorrectly
Error:instancesclassispredictedincorrectly
Errorrate:proportionoferrorsmadeoverthe
wholesetofinstances

Resubstitutionerror:errorrateobtained
fromtrainingdata
Resubstitutionerroris(hopelessly)
optimistic!

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

TrainingandtestingII

Testset:independentinstancesthathaveplayed
nopartinformationofclassifier

Assumption:bothtrainingdataandtestdataare
representativesamplesoftheunderlyingproblem

Testandtrainingdatamaydifferinnature

Example:classifiersbuiltusingcustomerdatafromtwo
differenttownsAandB

ToestimateperformanceofclassifierfromtownAincompletely
newtown,testitondatafromB

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

Noteonparametertuning
Itisimportantthatthetestdataisnotusedinany
waytocreatetheclassifier
Somelearningschemesoperateintwostages:

Stage1:buildthebasicstructure
Stage2:optimizeparametersettings

Thetestdatacantbeusedforparametertuning!
Properprocedureusesthreesets:trainingdata,
validationdata,andtestdata

Validationdataisusedtooptimizeparameters

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

Makingthemostofthedata
Onceevaluationiscomplete,allthedatacanbe
usedtobuildthefinalclassifier
Generally,thelargerthetrainingdatathebetter
theclassifier(butreturnsdiminish)
Thelargerthetestdatathemoreaccuratethe
errorestimate
Holdoutprocedure:methodofsplittingoriginal
dataintotrainingandtestset

Dilemma:ideallybothtrainingsetandtestsetshouldbe
large!

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

Predictingperformance

Assumetheestimatederrorrateis25%.How
closeisthistothetrueerrorrate?

Predictionisjustliketossinga(biased!)coin

Dependsontheamountoftestdata
Headisasuccess,tailisanerror

Instatistics,asuccessionofindependentevents
likethisiscalledaBernoulliprocess

Statisticaltheoryprovidesuswithconfidence
intervalsforthetrueunderlyingproportion

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

Confidenceintervals
Wecansay:plieswithinacertainspecified
intervalwithacertainspecifiedconfidence
Example:S=750successesinN=1000trials

Estimatedsuccessrate:75%
Howcloseisthistotruesuccessratep?

Answer:with80%confidencepin[73.2,76.7]

Anotherexample:S=75andN=100

Estimatedsuccessrate:75%
With80%confidencepin[69.1,80.1]

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

10

Meanandvariance
MeanandvarianceforaBernoullitrial:
p,p(1p)
Expectedsuccessratef=S/N
Meanandvarianceforf:p,p(1p)/N
ForlargeenoughN,ffollowsaNormal
distribution
c%confidenceinterval[zXz]for
randomvariablewith0meanisgivenby:

Pr [z Xz ]=c

Withasymmetricdistribution:
Pr [z Xz ]=12Pr [ xz ]
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

11

Confidencelimits

Confidencelimitsforthenormaldistributionwith
0meanandavarianceof1:
z
Pr[X

Thus:

1 1.65

z]
0.1%

3.09

0.5%

2.58

1%

2.33

5%

1.65

10%

1.28

20%

0.84

40%

0.25

Pr [1.65X1.65]=90 %

Tousethiswehavetoreduceourrandomvariablef
tohave0meanandunitvariance
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

12

Transformingf

f p
p1p/N

Transformedvalueforf:

(i.e.subtractthemeananddividebythestandarddeviation)

Resultingequation:
f p
Pr [z p(1p)/
z ]=c
N

Solvingforp:
p=f

z2
2N

f
N

f2
N

z2
4N2

z2
N

/1

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

13

Examples

f=75%,N=1000,c=80%(sothatz=1.28):

p[0.732,0 .767]

f=75%,N=100,c=80%(sothatz=1.28):

p[0.691,0 .801]
Notethatnormaldistributionassumptionisonly
validforlargeN(i.e.N>100)
f=75%,N=10,c=80%(sothatz=1.28):
p[0.549,0 .881]
(shouldbetakenwithagrainofsalt)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

14

Holdoutestimation

Whattodoiftheamountofdataislimited?
Theholdoutmethodreservesacertainamount
fortestingandusestheremainderfortraining

Problem:thesamplesmightnotbe
representative

Usually:onethirdfortesting,therestfortraining

Example:classmightbemissinginthetestdata

Advancedversionusesstratification

Ensuresthateachclassisrepresentedwith
approximatelyequalproportionsinbothsubsets

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

15

Repeatedholdoutmethod

Holdoutestimatecanbemademorereliableby
repeatingtheprocesswithdifferentsubsamples

Ineachiteration,acertainproportionisrandomly
selectedfortraining(possiblywithstratificiation)
Theerrorratesonthedifferentiterationsareaveraged
toyieldanoverallerrorrate

Thisiscalledtherepeatedholdoutmethod
Stillnotoptimum:thedifferenttestsetsoverlap

Canwepreventoverlapping?

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

16

Crossvalidation

Crossvalidationavoidsoverlappingtestsets

Firststep:splitdataintoksubsetsofequalsize
Secondstep:useeachsubsetinturnfortesting,
theremainderfortraining

Calledkfoldcrossvalidation
Oftenthesubsetsarestratifiedbeforethe
crossvalidationisperformed
Theerrorestimatesareaveragedtoyieldan
overallerrorestimate

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

17

Moreoncrossvalidation

Standardmethodforevaluation:stratifiedtenfold
crossvalidation
Whyten?

Extensiveexperimentshaveshownthatthisisthebest
choicetogetanaccurateestimate
Thereisalsosometheoreticalevidenceforthis

Stratificationreducestheestimatesvariance
Evenbetter:repeatedstratifiedcrossvalidation

E.g.tenfoldcrossvalidationisrepeatedtentimesand
resultsareaveraged(reducesthevariance)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

18

LeaveOneOutcrossvalidation

LeaveOneOut:
aparticularformofcrossvalidation:

Setnumberoffoldstonumberoftraininginstances
I.e.,forntraininginstances,buildclassifierntimes

Makesbestuseofthedata
Involvesnorandomsubsampling
Verycomputationallyexpensive

(exception:NN)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

19

LeaveOneOutCVandstratification

DisadvantageofLeaveOneOutCV:
stratificationisnotpossible

Itguaranteesanonstratifiedsample
becausethereisonlyoneinstanceinthe
testset!

Extremeexample:randomdataset
splitequallyintotwoclasses

Bestinducerpredictsmajorityclass
50%accuracyonfreshdata
LeaveOneOutCVestimateis100%error!

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

20

Thebootstrap

CVusessamplingwithoutreplacement

Thesameinstance,onceselected,cannotbeselected
againforaparticulartraining/testset

Thebootstrapusessamplingwith
replacementtoformthetrainingset
Sampleadatasetofninstancesntimeswith
replacementtoformanewdatasetofninstances
Usethisdataasthetrainingset
Usetheinstancesfromtheoriginal
datasetthatdontoccurinthenew
trainingsetfortesting

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

21

The0.632bootstrap

Alsocalledthe0.632bootstrap

Aparticularinstancehasaprobabilityof
11/nofnotbeingpicked
Thusitsprobabilityofendingupinthe
testdatais:
1 1n ne10.368

Thismeansthetrainingdatawill
containapproximately63.2%ofthe
instances

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

22

Estimatingerrorwiththebootstrap

Theerrorestimateonthetestdatawillbe
verypessimistic

Trainedonjust~63%oftheinstances

Therefore,combineitwiththe
resubstitutionerror:
err=0.632etest instances0.368etraining_instances

Theresubstitutionerrorgetslessweight
thantheerroronthetestdata
Repeatprocessseveraltimeswithdifferent
replacementsamples;averagetheresults

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

23

Moreonthebootstrap

Probablythebestwayofestimating
performanceforverysmalldatasets
However,ithassomeproblems

Considertherandomdatasetfromabove
Aperfectmemorizerwillachieve
0%resubstitutionerrorand
~50%errorontestdata
Bootstrapestimateforthisclassifier:
err=0.63250%0.3680%=31.6%

Trueexpectederror:50%
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

24

Comparingdataminingschemes

Frequentquestion:whichoftwolearning
schemesperformsbetter?
Note:thisisdomaindependent!
Obviousway:compare10foldCVestimates
Generallysufficientinapplications(wedon't
looseifthechosenmethodisnottrulybetter)
However,whataboutmachinelearningresearch?
Needtoshowconvincinglythataparticular
methodworksbetter

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

25

ComparingschemesII

WanttoshowthatschemeAisbetterthanschemeBina
particulardomain
Foragivenamountoftrainingdata
Onaverage,acrossallpossibletrainingsets
Let'sassumewehaveaninfiniteamountofdatafromthe
domain:
Sampleinfinitelymanydatasetofspecifiedsize
Obtaincrossvalidationestimateoneachdatasetfor
eachscheme
CheckifmeanaccuracyforschemeAisbetterthan
meanaccuracyforschemeB

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

26

Pairedttest

Inpracticewehavelimiteddataandalimitednumber
ofestimatesforcomputingthemean
Studentsttesttellswhetherthemeansoftwosamples
aresignificantlydifferent
Inourcasethesamplesarecrossvalidationestimates
fordifferentdatasetsfromthedomain
Useapairedttestbecausetheindividualsamplesare
paired

ThesameCVisappliedtwice

WilliamGosset
Born: 1876inCanterbury;Died:1937inBeaconsfield,England
ObtainedapostasachemistintheGuinnessbreweryinDublinin1899.
Inventedthettesttohandlesmallsamplesforqualitycontrolin
brewing.Wroteunderthename"Student".
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

27

Distributionofthemeans
x1x2xkandy1y2ykarethe2ksamplesfor
thekdifferentdatasets
m andm arethemeans
x
y
Withenoughsamples,themeanofasetof
independentsamplesisnormallydistributed

Estimatedvariancesofthemeansare
x2/kandy2/k
m x x my y

Ifxandyarethetruemeansthen

2
x

/k

2
y

/k

areapproximatelynormallydistributedwith
mean0,variance1
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

28

Studentsdistribution

Withsmallsamples(k<100)themean
followsStudentsdistributionwithk1
degreesoffreedom
Confidencelimits:
9degreesoffreedomnormaldistribution

Assuming
wehave
10estimates

Pr[X z]

Pr[X z]

0.1%

4.30

0.1%

3.09

0.5%

3.25

0.5%

2.58

1%

2.82

1%

2.33

5%

1.83

5%

1.65

10%

1.38

10%

1.28

20%

0.88

20%

0.84

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

29

Distributionofthedifferences

Letmd=mxmy

Thedifferenceofthemeans(md)alsohasa
Studentsdistributionwithk1degreesoffreedom
Let 2bethevarianceofthedifference
d

Thestandardizedversionofmdiscalledthet
statistic:
m
t= /k
d

2
d

Weusettoperformthettest

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

30

Performingthetest

Fixasignificancelevel

Dividethesignificancelevelbytwobecausethetest
istwotailed

Ifadifferenceissignificantatthe%level,
thereisa(100)%chancethatthetruemeansdiffer

I.e.thetruedifferencecanbe+veorve

Lookupthevalueforzthatcorrespondsto/2
Iftzortzthenthedifferenceissignificant

I.e.thenullhypothesis(thatthedifferenceiszero)canbe
rejected

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

31

Unpairedobservations
IftheCVestimatesarefromdifferent
datasets,theyarenolongerpaired
(ormaybewehavekestimatesforone
scheme,andjestimatesfortheotherone)
Thenwehavetouseanunpairedttestwith
min(k,j)1degreesoffreedom
Theestimateofthevarianceofthedifference
ofthemeansbecomes:

2x
k

2y
j

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

32

Dependentestimates

Weassumedthatwehaveenoughdatatocreate
severaldatasetsofthedesiredsize
Needtoreusedataifthat'snotthecase
E.g.runningcrossvalidationswithdifferent
randomizationsonthesamedata
Samplesbecomedependentinsignificant
differencescanbecomesignificant
Aheuristictestisthecorrectedresampledttest:
Assumeweusetherepeatedholdoutmethod,
withn1instancesfortrainingandn2fortesting
Newteststatisticis:

t=

md

n2

1k n 2d
1

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

33

Predictingprobabilities

Performancemeasuresofar:successrate
Alsocalled01lossfunction:

i {0 if prediction is correct }
1 if prediction is incorrect

Mostclassifiersproducesclassprobabilities
Dependingontheapplication,wemightwantto
checktheaccuracyoftheprobabilityestimates
01lossisnottherightthingtouseinthosecases

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

34

Quadraticlossfunction

p1pkareprobabilityestimatesforan
instance

cistheindexoftheinstancesactualclass

a1ak=0,exceptforacwhichis1

Quadraticlossis:

j p j a j 2= j!=c p2j apc 2

Wanttominimize

E[ j p ja j 2 ]

Canshowthatthisisminimizedwhenpj=pj*,
thetrueprobabilities
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

35

Informationallossfunction

Theinformationallossfunctionislog(pc),
wherecistheindexoftheinstancesactual
class
Numberofbitsrequiredtocommunicatethe
actualclass
Letp1*pk*bethetrueclassprobabilities
Thentheexpectedvalueforthelossfunction

p
log
p
...p
1
2 1
k log 2 p k
is:

Justification:minimizedwhenpj=pj*

Difficulty:zerofrequencyproblem
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

36

Discussion

Whichlossfunctiontochoose?

Bothencouragehonesty
Quadraticlossfunctiontakesintoaccountall
classprobabilityestimatesforaninstance
Informationallossfocusesonlyonthe
probabilityestimatefortheactualclass
Quadraticlossisbounded:
itcanneverexceed2
1 j p2j
Informationallosscanbeinfinite

InformationallossisrelatedtoMDLprinciple
[later]

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

37

Countingthecost

Inpractice,differenttypesofclassification
errorsoftenincurdifferentcosts
Examples:

Terroristprofiling

Notaterroristcorrect99.99%ofthetime

Loandecisions
Oilslickdetection
Faultdiagnosis
Promotionalmailing

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

38

Countingthecost

Theconfusionmatrix:
Predicted class
Actual class

Yes

No

Yes

True positive

No

False positive

False
negative
True
negative

Therearemanyothertypesofcost!

E.g.:costofcollectingtrainingdata

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

39

Aside:thekappastatistic

Twoconfusionmatricesfora3classproblem:
actualpredictor(left)vs.randompredictor(right)

Numberofsuccesses:sumofentriesindiagonal(D)
Kappastatistic:
D
D
observed

random

Dperfect Drandom

measuresrelativeimprovementoverrandompredictor
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

40

Classificationwithcosts

Twocostmatrices:

Successrateisreplacedbyaveragecostper
prediction

Costisgivenbyappropriateentryinthecost
matrix
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

41

Costsensitiveclassification

Cantakecostsintoaccountwhenmakingpredictions

Basicidea:onlypredicthighcostclasswhenvery
confidentaboutprediction

Given:predictedclassprobabilities

Normallywejustpredictthemostlikelyclass
Here,weshouldmakethepredictionthatminimizes
theexpectedcost

Expectedcost:dotproductofvectorofclassprobabilities
andappropriatecolumnincostmatrix
Choosecolumn(class)thatminimizesexpectedcost

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

42

Costsensitivelearning
Sofarwehaven'ttakencostsintoaccountat
trainingtime
Mostlearningschemesdonotperformcost
sensitivelearning

Simplemethodsforcostsensitivelearning:

Theygeneratethesameclassifiernomatterwhat
costsareassignedtothedifferentclasses
Example:standarddecisiontreelearner
Resamplingofinstancesaccordingtocosts
Weightingofinstancesaccordingtocosts

Someschemescantakecostsintoaccountby
varyingaparameter,e.g.naveBayes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

43

Liftcharts

Inpractice,costsarerarelyknown
Decisionsareusuallymadebycomparing
possiblescenarios
Example:promotionalmailoutto1,000,000
households

Mailtoall;0.1%respond(1000)
Dataminingtoolidentifiessubsetof100,000
mostpromising,0.4%oftheserespond(400)
40%ofresponsesfor10%ofcostmaypayoff

Identifysubsetof400,000mostpromising,0.2%
respond(800)

Aliftchartallowsavisualcomparison
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

44

Generatingaliftchart

Sortinstancesaccordingtopredicted
probabilityofbeingpositive:
Predicted probability

Actual class

0.95

Yes

0.93

Yes

0.93

No

0.88

Yes

xaxisissamplesize
yaxisisnumberoftruepositives
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

45

Ahypotheticalliftchart

40%ofresponses
for10%ofcost

80% of responses
for 40% of cost

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

46

ROCcurves

ROCcurvesaresimilartoliftcharts

Standsforreceiveroperatingcharacteristic
Usedinsignaldetectiontoshowtradeoff
betweenhitrateandfalsealarmrateover
noisychannel

Differencestoliftchart:

yaxisshowspercentageoftruepositivesin
sample
ratherthanabsolutenumber
xaxisshowspercentageoffalsepositivesin
sample
ratherthansamplesize

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

47

AsampleROCcurve

Jaggedcurveonesetoftestdata
Smoothcurveusecrossvalidation
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

48

CrossvalidationandROCcurves

SimplemethodofgettingaROCcurveusing
crossvalidation:

Collectprobabilitiesforinstancesintestfolds
Sortinstancesaccordingtoprobabilities

ThismethodisimplementedinWEKA
However,thisisjustonepossibility

AnotherpossibilityistogenerateanROCcurve
foreachfoldandaveragethem

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

49

ROCcurvesfortwoschemes

Forasmall,focusedsample,usemethodA
Foralargerone,usemethodB
Inbetween,choosebetweenAandBwithappropriateprobabilities
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

50

Theconvexhull
Giventwolearningschemeswecanachieve
anypointontheconvexhull!
TPandFPratesforscheme1:t andf
1
1
TPandFPratesforscheme2:t andf
2
2
Ifscheme1isusedtopredict100q%ofthe
casesandscheme2fortherest,then

TPrateforcombinedscheme:
qt1+(1q)t2
FPrateforcombinedscheme:
qf1+(1q)f2

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

51

Moremeasures...

Percentageofretrieveddocumentsthatarerelevant:
precision=TP/(TP+FP)
Percentageofrelevantdocumentsthatarereturned:
recall=TP/(TP+FN)
Precision/recallcurveshavehyperbolicshape
Summarymeasures:averageprecisionat20%,50%and80%
recall(threepointaveragerecall)
Fmeasure=(2recallprecision)/(recall+precision)
sensitivityspecificity=(TP/(TP+FN))(TN/(FP+TN))
AreaundertheROCcurve(AUC):
probabilitythatrandomlychosenpositiveinstanceisranked
aboverandomlychosennegativeone
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

52

Summaryofsomemeasures
Domain

Plot

Explanation

Lift chart

Marketing

TP
Subset
size

TP
(TP+FP)/
(TP+FP+TN+FN)

ROC
curve

Communicatio
ns

TP rate
FP rate

TP/(TP+FN)
FP/(FP+TN)

Recallprecision
curve

Information
retrieval

Recall
Precisio
n

TP/(TP+FN)
TP/(TP+FP)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

53

Costcurves

Costcurvesplotexpectedcostsdirectly
Exampleforcasewithuniformcosts(i.e.error):

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

54

Costcurves:examplewithcosts

p[+]C[+|-]
Probability cost function pc [+]= p[+]C[+|-]p[-]C[-|+]

Normalized expected cost=fnpc [+]fp1pc [+]


DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

55

Evaluatingnumericprediction
Samestrategies:independenttestset,
crossvalidation,significancetests,etc.
Difference:errormeasures
Actualtargetvalues:a a a
1 2
n

Predictedtargetvalues:p1p2pn
Mostpopularmeasure:meansquared
error
p a ...p a
2

Easytomanipulatemathematically

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

56

Othermeasures

Therootmeansquarederror:

p1 a1 2...pn an 2
n

Themeanabsoluteerrorislesssensitiveto
outliersthanthemeansquarederror:
p1a1...p na n
n

Sometimesrelativeerrorvaluesaremore
appropriate(e.g.10%foranerrorof50when
predicting500)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

57

Improvementonthemean

Howmuchdoestheschemeimprove
onsimplypredictingtheaverage?
Therelativesquarederroris:
p1a12...pnan 2

a a12...
a an 2

Therelativeabsoluteerroris:
p1a1...pna n

a a1...
aan
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

58

Correlationcoefficient

Measuresthestatisticalcorrelationbetween
thepredictedvaluesandtheactualvalues
SPA

SP SA

SPA =

i pi p
ai
a
n1

SP=

i pi p
2
n1

SA =

i ai a
2
n1

Scaleindependent,between1and+1
Goodperformanceleadstolargevalues!
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

59

Whichmeasure?

Besttolookatallofthem
Oftenitdoesntmatter
Example:
A

Root mean-squared error

67.8

91.7

63.3

57.4

Mean absolute error

41.3

38.5

33.4

29.2

Root rel squared error

42.2% 57.2%

39.4%

35.8%

Relative absolute error

43.1% 40.1%

34.8%

30.4%

Correlation coefficient

0.88

0.89

0.91

0.88

D best

C secondbest

A, B
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

60

TheMDLprinciple
MDLstandsforminimumdescriptionlength
Thedescriptionlengthisdefinedas:
spacerequiredtodescribeatheory

+
spacerequiredtodescribethetheorysmistakes

Inourcasethetheoryistheclassifierandthe
mistakesaretheerrorsonthetrainingdata
Aim:weseekaclassifierwithminimalDL
MDLprincipleisamodelselectioncriterion

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

61

Modelselectioncriteria

Modelselectioncriteriaattempttofindagood
compromisebetween:

Thecomplexityofamodel
Itspredictionaccuracyonthetrainingdata

Reasoning:agoodmodelisasimplemodelthat
achieveshighaccuracyonthegivendata
AlsoknownasOccamsRazor:
thebesttheoryisthesmallestone
thatdescribesallthefacts

WilliamofOckham,borninthevillageofOckhamin
Surrey(England)about1285,wasthemostinfluential
philosopherofthe14thcenturyandacontroversial
theologian.
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

62

Elegancevs.errors

Theory1:verysimple,eleganttheorythat
explainsthedataalmostperfectly
Theory2:significantlymorecomplextheorythat
reproducesthedatawithoutmistakes
Theory1isprobablypreferable
Classicalexample:Keplersthreelawson
planetarymotion

LessaccuratethanCopernicusslatestrefinementof
thePtolemaictheoryofepicycles

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

63

MDLandcompression

MDLprinciplerelatestodatacompression:

Thebesttheoryistheonethatcompressesthedata
themost
I.e.tocompressadatasetwegenerateamodeland
thenstorethemodelanditsmistakes

Weneedtocompute
(a)sizeofthemodel,and
(b)spaceneededtoencodetheerrors
(b)easy:usetheinformationallossfunction
(a)needamethodtoencodethemodel

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

64

MDLandBayesstheorem

L[T]=lengthofthetheory
L[E|T]=trainingsetencodedwrtthetheory
Descriptionlength=L[T]+L[E|T]
Bayesstheoremgivesaposterioriprobabilityof
atheorygiventhedata:

Pr [T|E]=

Pr [E|T]Pr [T]
Pr [E]

Equivalentto:

logPr [T|E]=logPr [E|T]logPr [T ]logPr [E]


constant
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

65

MDLandMAP

MAPstandsformaximumaposteriori
probability
FindingtheMAPtheorycorrespondstofinding
theMDLtheory
DifficultbitinapplyingtheMAPprinciple:
determiningthepriorprobabilityPr[T]ofthe
theory
Correspondstodifficultpartinapplyingthe
MDLprinciple:codingschemeforthetheory
I.e.ifweknowapriorithataparticulartheoryis
morelikelyweneedfewerbitstoencodeit
DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

66

DiscussionofMDLprinciple

Advantage:makesfulluseofthetrainingdatawhen
selectingamodel
Disadvantage1:appropriatecodingscheme/prior
probabilitiesfortheoriesarecrucial
Disadvantage2:noguaranteethattheMDLtheory
istheonewhichminimizestheexpectederror
Note:OccamsRazorisanaxiom!
Epicurussprincipleofmultipleexplanations:keep
alltheoriesthatareconsistentwiththedata

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

67

MDLandclustering

Descriptionlengthoftheory:
bitsneededtoencodetheclusters

Descriptionlengthofdatagiventheory:
encodeclustermembershipandposition
relativetocluster

e.g.clustercenters

e.g.distancetoclustercenter

Worksifcodingschemeuseslesscodespace
forsmallnumbersthanforlargeones
Withnominalattributes,mustcommunicate
probabilitydistributionsforeachcluster

DataMining:PracticalMachineLearningToolsandTechniques(Chapter5)

68

Potrebbero piacerti anche