Sei sulla pagina 1di 111

DataMining

PracticalMachineLearningToolsandTechniques
SlidesforChapter4ofDataMiningbyI.H.Witten,E.Frankand
M.A.Hall

Algorithms:Thebasicmethods

Inferringrudimentaryrules
Statisticalmodeling
Constructingdecisiontrees
Constructingrules
Associationrulelearning
Linearmodels
Instancebasedlearning
Clustering

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

Simplicityfirst

Simplealgorithmsoftenworkverywell!
Therearemanykindsofsimplestructure,eg:

Oneattributedoesallthework
Allattributescontributeequally&independently
Aweightedlinearcombinationmightdo
Instancebased:useafewprototypes
Usesimplelogicalrules

Successofmethoddependsonthedomain

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

Inferringrudimentaryrules

1R:learnsa1leveldecisiontree

I.e.,rulesthatalltestoneparticularattribute

Basicversion

Onebranchforeachvalue
Eachbranchassignsmostfrequentclass
Errorrate:proportionofinstancesthatdont
belongtothemajorityclassoftheir
correspondingbranch
Chooseattributewithlowesterrorrate

(assumesnominalattributes)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

Pseudocodefor1R
For each attribute,
For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate

Note:missingistreatedasaseparateattribute
value

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

Evaluatingtheweatherattributes
Outlook

Temp

Humidit
y

Wind
y

Play

Sunny

Hot

High

False

No

Sunny

Hot

High

True

No

Overcas
tRainy

Hot

High

False

Yes

Mild

High

False

Yes

Rainy

Cool

Normal

False

Yes

Rainy

Cool

Normal

True

No

Overcas
tSunny

Cool

Normal

True

Yes

Mild

High

False

No

Sunny

Cool

Normal

False

Yes

Rainy

Mild

Normal

False

Yes

Sunny

Mild

Normal

True

Yes

Overcas
tOvercas

Mild

High

True

Yes

Hot

Normal

False

Yes

tRainy

Mild

High

True

No

Attribute

Rules

Error
s

Outlook

Sunny No

2/5

Overcast
Yes
Rainy Yes

0/4

Hot No*

2/4

Mild Yes

2/6

Cool Yes

1/4

High No

3/7

Normal Yes

1/7

False Yes

2/8

True No*

3/6

Temp

Humidity
Windy

Total
error
s
4/14

2/5
5/14

4/14
5/14

*indicatesatie

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

Dealingwithnumericattributes

Discretizenumericattributes
Divideeachattributesrangeintointervals

Sortinstancesaccordingtoattributesvalues
Placebreakpointswhereclasschanges(majorityclass)
Thisminimizesthetotalerror

Example:temperaturefromweatherdata

64
65
68
85
Yes | No | Yes
Outlook

69

70

71 72 72

Yes Yes | No

75

75

80

81

No Yes | Yes Yes | No | Yes


Humidity

Windy

Play

Sunny

Temperatur
e
85

85

False

No

Sunny

80

90

True

No

Overcast

83

86

False

Yes

Rainy

75

80

False

Yes

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

83
Yes |

No

Theproblemofoverfitting

Thisprocedureisverysensitivetonoise

Oneinstancewithanincorrectclasslabelwillprobably
produceaseparateinterval

Also:timestampattributewillhavezeroerrors
Simplesolution:
enforceminimumnumberofinstancesinmajority
classperinterval
Example(withmin=3):
64
65
68
85
Yes | No | Yes
64
85
Yes

65

68

No

Yes

69

70

71 72 72

Yes Yes | No
69

70

75

80

81

No Yes | Yes Yes | No | Yes

71 72 72

Yes Yes | No

75

No Yes

75

75

80

Yes Yes | No

81
Yes

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

83
Yes |

No

83
Yes

No
8

Withoverfittingavoidance

Resultingruleset:
Attribute

Rules

Errors

Total errors

Outlook

Sunny No

2/5

4/14

Overcast Yes

0/4

Rainy Yes

2/5

77.5 Yes

3/10

> 77.5 No*

2/4

82.5 Yes

1/7

> 82.5 and 95.5 No

2/6

> 95.5 Yes

0/1

False Yes

2/8

True No*

3/6

Temperature
Humidity

Windy

5/14
3/14

5/14

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

Discussionof1R

1RwasdescribedinapaperbyHolte(1993)

Containsanexperimentalevaluationon16datasets
(usingcrossvalidationsothatresultswere
representativeofperformanceonfuturedata)
Minimumnumberofinstanceswassetto6after
someexperimentation
1Rssimplerulesperformednotmuchworsethan
muchmorecomplexdecisiontrees

Simplicityfirstpaysoff!

VerySimpleClassificationRulesPerformWellonMost
CommonlyUsedDatasets
RobertC.Holte,ComputerScienceDepartment,UniversityofOttawa

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

10

Discussionof1R:
Hyperpipes

Anothersimpletechnique:buildoneruleforeachclass
Eachruleisaconjunctionoftests,oneforeachattribute
Fornumericattributes:testcheckswhetherinstance's
valueisinsideaninterval
Intervalgivenbyminimumandmaximumobserved
intrainingdata
Fornominalattributes:testcheckswhethervalueisone
ofasubsetofattributevalues
Subsetgivenbyallpossiblevaluesobservedin
trainingdata
Classwithmostmatchingtestsispredicted

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

11

Statisticalmodeling

Oppositeof1R:usealltheattributes
Twoassumptions:Attributesare

equallyimportant
statisticallyindependent(giventheclassvalue)

I.e.,knowingthevalueofoneattributesaysnothing
aboutthevalueofanother(iftheclassisknown)

Independenceassumptionisnevercorrect!
Butthisschemeworkswellinpractice

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

12

Probabilitiesforweatherdata
Outlook

Temperature
Yes

Humidity

Yes

No

No

Sunny

Hot

Overcast

Mild

Rainy

Cool

Sunny

2/9

3/5

Hot

2/9

2/5

Overcast

4/9

0/5

Mild

4/9

2/5

Rainy

3/9

2/5

Cool

3/9

1/5

Windy

Yes

No

High

Normal

High
Normal

Play

Yes

No

Yes

No

False

True

3/9

4/5

False

6/9

2/5

6/9

1/5

True

3/9

3/5

9/
14

5/
14

Outlook

Temp

Humidity

Windy

Play

Sunny

Hot

High

False

No

Sunny

Hot

High

True

No

Overcast

Hot

High

False

Yes

Rainy

Mild

High

False

Yes

Rainy

Cool

Normal

False

Yes

Rainy

Cool

Normal

True

No

Overcast

Cool

Normal

True

Yes

Sunny

Mild

High

False

No

Sunny

Cool

Normal

False

Yes

Rainy

Mild

Normal

False

Yes

Sunny

Mild

Normal

True

Yes

Overcast

Mild

High

True

Yes

Overcast

Hot

Normal

False

Yes

True

No13

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)
Rainy
Mild
High

Probabilitiesforweatherdata
Outlook

Temperature

Yes

No

Sunny

Hot

Overcast

Mild

Rainy

Cool

Sunny

2/9

3/5

Hot

2/9

2/5

Overcast

4/9

0/5

Mild

4/9

2/5

Rainy

3/9

2/5

Cool

3/9

1/5

Yes

Humidity

Anewday:

No

Windy

Yes

No

High

Normal

High
Normal

Outlook

Temp.

Sunny

Cool

Play

Yes

No

Yes

No

False

True

3/9

4/5

False

6/9

2/5

6/9

1/5

True

3/9

3/5

9/
14

5/
14

Humidit
y
High

Windy

Play

True

Likelihood of the two classes


For yes = 2/9 3/9 3/9 3/9 9/14 = 0.0053
For no = 3/5 1/5 4/5 3/5 5/14 = 0.0206
Conversion into a probability by normalization:
P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

14

Bayessrule
ProbabilityofeventHgivenevidenceE:

Pr [EH]Pr [H]
Pr [HE]=
Pr [E]

AprioriprobabilityofH:

Probabilityofeventbeforeevidenceisseen

AposterioriprobabilityofH:

Pr[H]
Pr[HE]

Probabilityofeventafterevidenceisseen

ThomasBayes
Born: 1702inLondon,England
Died: 1761inTunbridgeWells,Kent,England
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

15

NaveBayesforclassification

Classificationlearning:whatsthe
probabilityoftheclassgivenaninstance?

EvidenceE=instance
EventH=classvalueforinstance

Naveassumption:evidencesplitsintoparts
(i.e.attributes)thatareindependent

Pr[E1H]Pr [E2H]Pr [EnH]Pr [H]


Pr [HE]=
Pr [E]

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

16

Weatherdataexample
Outlook

Temp.

Sunny

Cool

Humidit
y
High

Windy Play
True

EvidenceE

Pr [ yesE]=Pr [Outlook=Sunnyyes]
Pr [Temperature=Coolyes]
Pr[Humidity=Highyes]
Probabilityof
classyes
Pr[ Windy=Trueyes]
Pr[ yes]

Pr [E]
2 3 3 3 9

9 9 9 9 14
=
Pr [E]
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

17

Thezerofrequencyproblem

Whatifanattributevaluedoesntoccurwithevery
classvalue?
(e.g.Humidity=highforclassyes)

Probabilitywillbezero!
Pr [Humidity=Highyes]=0
Aposterioriprobabilitywillalsobezero!
Pr [yesE]=0
(Nomatterhowlikelytheothervaluesare!)

Remedy:add1tothecountforeveryattribute
valueclasscombination(Laplaceestimator)
Result:probabilitieswillneverbezero!
(also:stabilizesprobabilityestimates)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

18

Modifiedprobabilityestimates

Insomecasesaddingaconstantdifferent
from1mightbemoreappropriate
Example:attributeoutlookforclassyes
2 /3
9

4 /3
9

3 /3
9

Sunny

Overcast

Rainy

Weightsdontneedtobeequal
(buttheymustsumto1)
2 p1
9

4 p2
9

3 p3
9

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

19

Missingvalues

Training:instanceisnotincludedinfrequency
countforattributevalueclasscombination
Classification:attributewillbeomittedfrom
calculation
Example: Outlook Temp. Humidity Windy Play
?

Cool

High

True

Likelihood of yes = 3/9 3/9 3/9 9/14 = 0.0238


Likelihood of no = 1/5 4/5 3/5 5/14 = 0.0343
P(yes) = 0.0238 / (0.0238 + 0.0343) = 41%
P(no) = 0.0343 / (0.0238 + 0.0343) = 59%

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

20

Numericattributes
Usualassumption:attributeshavea
normalorGaussianprobability
distribution(giventheclass)
Theprobabilitydensityfunctionforthe
normaldistributionisdefinedbytwo
parameters:
n
1
= xi
Samplemean
n

i=1

Standarddeviation

Thenthedensityfunctionf(x)is
f (x)=

1
e
2

1
2
=
(x i )

n1 i=1

(x)2

2 2

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

21

Statisticsforweatherdata
Outlook

Temperature

Humidity

Windy

Yes

No

Yes

No

Yes

No

Sunny

64, 68,

65,71,

65, 70,

70, 85,

Overcast

69, 70,

72,80,

70, 75,

90, 91,

Rainy

72,

85,

80,

95,

Sunny

2/9

3/5

=73

=75

=79

Overcast

4/9

0/5

=6.2

=7.9

=10.2

Rainy

3/9

2/5

Play

Yes

No

Yes

No

False

True

=86

False

6/9

2/5

=9.7

True

3/9

3/5

9/
14

5/
14

Exampledensityvalue:
f temperature=66yes=

1
e
2 6.2

6673
2
26.2

=0.0340

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

22

Classifyinganewday

Anewday:

Outlook

Temp.

Sunny

66

Humidity Windy
90

true

Play
?

Likelihood of yes = 2/9 0.0340 0.0221 3/9 9/14 = 0.000036


Likelihood of no = 3/5 0.0221 0.0381 3/5 5/14 = 0.000108
P(yes) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(no) = 0.000108 / (0.000036 + 0. 000108) = 75%

Missingvaluesduringtrainingarenot
includedincalculationofmeanand
standarddeviation
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

23

Probabilitydensities

Relationshipbetweenprobabilityand
density:

Pr [c xc ]f c
2
2

But:thisdoesntchangecalculationofa
posterioriprobabilitiesbecausecancelsout
Exactrelationship:
b

Pr [axb]= f tdt
a

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

24

MultinomialnaveBayesI

VersionofnaveBayesusedfordocumentclassification
usingbagofwordsmodel
n1,n2,...,nk:numberoftimeswordioccursindocument
P1,P2,...,Pk:probabilityofobtainingwordiwhen
samplingfromdocumentsinclassH
ProbabilityofobservingdocumentEgivenclassH(based
onmultinomialdistribution):
k

Pr [EH]N !
i=1

ni

Pi
ni !

Ignoresprobabilityofgeneratingadocumentoftheright
length(prob.assumedconstantforeachclass)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

25

MultinomialnaveBayesII

Supposedictionaryhastwowords,yellowandblue

SupposePr[yellow|H]=75%andPr[blue|H]=25%

SupposeEisthedocumentblueyellowblue

Probabilityofobservingdocument:
Pr [{blue yellow blue}H]3 !

0.751
1!

0.252
2!

9
= 64
0.14

SupposethereisanotherclassH'thathas
Pr[yellow|H']=10%andPr[yellow|H']=90%:
Pr [{blue yellow blue}H']3!

0.11
1!

0.92
2!

=0.24

Needtotakepriorprobabilityofclassintoaccounttomakefinal
classification
Factorialsdon'tactuallyneedtobecomputed
Underflowscanbepreventedbyusinglogarithms
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

26

NaveBayes:discussion

NaveBayesworkssurprisinglywell(evenif
independenceassumptionisclearlyviolated)
Why?Becauseclassificationdoesntrequireaccurate
probabilityestimatesaslongasmaximumprobability
isassignedtocorrectclass
However:addingtoomanyredundantattributeswill
causeproblems(e.g.identicalattributes)
Notealso:manynumericattributesarenotnormally
distributed(kerneldensityestimators)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

27

Constructingdecisiontrees

Strategy:topdown
Recursivedivideandconquerfashion

First:selectattributeforrootnode
Createbranchforeachpossibleattributevalue
Then:splitinstancesintosubsets
Oneforeachbranchextendingfromthenode
Finally:repeatrecursivelyforeachbranch,using
onlyinstancesthatreachthebranch

Stopifallinstanceshavethesameclass

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

28

Whichattributetoselect?

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

29

Whichattributetoselect?

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

30

Criterionforattributeselection

Whichisthebestattribute?

Popularimpuritycriterion:informationgain

Wanttogetthesmallesttree
Heuristic:choosetheattributethatproducesthe
purestnodes
Informationgainincreaseswiththeaverage
purityofthesubsets

Strategy:chooseattributethatgivesgreatest
informationgain

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

31

Computinginformation

Measureinformationinbits
Givenaprobabilitydistribution,theinfo
requiredtopredictaneventisthe
distributionsentropy
Entropygivestheinformationrequiredinbits
(caninvolvefractionsofbits!)
Formulaforcomputingtheentropy:
entropy p1, p 2, ... ,p n=p1 log p1p2 log p2 ...p n log pn

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

32

Example:attributeOutlook

Outlook=Sunny:
info[2,3]=entropy2/5,3/5=2/5 log2/53/5 log3/5=0.971bits

Outlook=Overcast:

Outlook=Rainy:

Note:this
info[4,0]=entropy1,0=1 log10 log0=0 bits isnormally
undefined.
info[2,3]=entropy3/5,2/5=3/5 log3/52/5 log2/5=0.971bits

Expectedinformationforattribute:
info[3,2],[4,0],[3,2]=5/140.9714/1405/140.971=0.693bits

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

33

Computinginformationgain

Informationgain:informationbeforesplitting
informationaftersplitting
gain(Outlook) =info([9,5])info([2,3],[4,0],[3,2])
=0.9400.693
=0.247bits

Informationgainforattributesfromweatherdata:
gain(Outlook)
=0.247bits
gain(Temperature) =0.029bits
gain(Humidity)
=0.152bits
gain(Windy)
=0.048bits

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

34

Continuingtosplit

gain(Temperature) =0.571bits
gain(Humidity) =0.971bits
gain(Windy)
=0.020bits
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

35

Finaldecisiontree

Note:notallleavesneedtobepure;sometimes
identicalinstanceshavedifferentclasses
Splittingstopswhendatacantbesplitanyfurther

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

36

Wishlistforapuritymeasure

Propertieswerequirefromapuritymeasure:

Whennodeispure,measureshouldbezero
Whenimpurityismaximal(i.e.allclassesequally
likely),measureshouldbemaximal
Measureshouldobeymultistageproperty(i.e.
decisionscanbemadeinseveralstages):
measure[2,3,4]=measure[2,7]7/9measure[3,4]

Entropyistheonlyfunctionthatsatisfiesall
threeproperties!

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

37

Propertiesoftheentropy

Themultistageproperty:
q
r
entropyp ,q , r=entropyp ,qrqrentropy qr
, qr

Simplificationofcomputation:
info[2,3,4]=2/9log2/93/9log3/94/9log4/9
=[2log23log34log 49log9]/9

Note:insteadofmaximizinginfogainwe
couldjustminimizeinformation

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

38

Highlybranchingattributes

Problematic:attributeswithalargenumber
ofvalues(extremecase:IDcode)
Subsetsaremorelikelytobepureifthereis
alargenumberofvalues
Informationgainisbiasedtowardschoosing
attributeswithalargenumberofvalues
Thismayresultinoverfitting(selectionofan
attributethatisnonoptimalforprediction)

Anotherproblem:fragmentation

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

39

WeatherdatawithIDcode
ID code

Outlook

Temp.
Hot

Humidit
y
High

Wind
y
False

Pla
y
No

Sunny

Sunny

Hot

High

True

No

High

False

Yes

Overcas Hot
tRainy
Mild

High

False

Yes

Rainy

Cool

Normal

False

Yes

Rainy

Cool

Normal

True

No

Normal

True

Yes

Overcas Cool
tSunny
Mild

High

False

No

Sunny

Cool

Normal

False

Yes

Rainy

Mild

Normal

False

Yes

Sunny

Mild

Normal

True

Yes

Overcas Mild
tOvercas Hot
tRainy
Mild

High

True

Yes

Normal

False

Yes

High

True

No

M
N

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

40

TreestumpforIDcodeattribute

Entropyofsplit:
infoID code=info[0,1]info[0,1]...info[0,1]=0 bits

InformationgainismaximalforIDcode
(namely0.940bits)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

41

Gainratio

Gainratio:amodificationoftheinformationgain
thatreducesitsbias
Gainratiotakesnumberandsizeofbranchesinto
accountwhenchoosinganattribute

Itcorrectstheinformationgainbytakingtheintrinsic
informationofasplitintoaccount

Intrinsicinformation:entropyofdistributionof
instancesintobranches(i.e.howmuchinfodowe
needtotellwhichbranchaninstancebelongsto)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

42

Computingthegainratio

Example:intrinsicinformationforIDcode
info[1,1,...,1]=141/14log1/14=3.807bits

Valueofattributedecreasesasintrinsic
informationgetslarger
Definitionofgainratio:
gain_ratioattribute=gainattribute
intrinsic_infoattribute

Example:
gain_ratioID code= 0.940bits
=0.246
3.807bits
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

43

Gainratiosforweatherdata
Outlook

Temperature

Info:

0.693

Info:

0.911

Gain: 0.940-0.693

0.247

Gain: 0.940-0.911

0.029

Split info: info([5,4,5])

1.577

Split info: info([4,6,4])

1.557

Gain ratio: 0.247/1.577

0.157

Gain ratio:
0.029/1.557
Windy

0.019

Info:

0.788

Info:

0.892

Gain: 0.940-0.788

0.152

Gain: 0.940-0.892

0.048

Split info: info([7,7])

1.000

Split info: info([8,6])

0.985

Gain ratio: 0.152/1

0.152

Gain ratio:
0.048/0.985

0.049

Humidity

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

44

Moreonthegainratio

Outlookstillcomesouttop
However:IDcodehasgreatergainratio

Standardfix:adhoctesttopreventsplittingonthat
typeofattribute

Problemwithgainratio:itmayovercompensate

Maychooseanattributejustbecauseitsintrinsic
informationisverylow
Standardfix:onlyconsiderattributeswithgreater
thanaverageinformationgain

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

45

Discussion

Topdowninductionofdecisiontrees:ID3,
algorithmdevelopedbyRossQuinlan

Gainratiojustonemodificationofthisbasic
algorithm
C4.5:dealswithnumericattributes,missing
values,noisydata

Similarapproach:CART
Therearemanyotherattributeselection
criteria!
(Butlittledifferenceinaccuracyofresult)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

46

Coveringalgorithms

Convertdecisiontreeintoaruleset

Instead,cangeneraterulesetdirectly

Straightforward,butrulesetoverlycomplex
Moreeffectiveconversionsarenottrivial
foreachclassinturnfindrulesetthatcovers
allinstancesinit
(excludinginstancesnotintheclass)

Calledacoveringapproach:

ateachstagearuleisidentifiedthatcovers
someoftheinstances

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

47

Example:generatingarule

If true
then class = a

If x > 1.2 and y > 2.6


then class = a

If x > 1.2
then class = a

Possiblerulesetforclassb:
If x 1.2 then class = b
If x > 1.2 and y 2.6 then class = b

Couldaddmorerules,getperfectruleset
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

48

Rulesvs.trees
Correspondingdecisiontree:
(producesexactlythesame
predictions)

But:rulesetscanbemoreperspicuouswhen
decisiontreessufferfromreplicatedsubtrees
Also:inmulticlasssituations,coveringalgorithm
concentratesononeclassatatimewhereas
decisiontreelearnertakesallclassesintoaccount
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

49

Simplecoveringalgorithm

Generatesarulebyaddingteststhatmaximize
rulesaccuracy
Similartosituationindecisiontrees:problemof
selectinganattributetospliton

But:decisiontreeinducermaximizesoverallpurity

Eachnewtestreduces
rulescoverage:

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

50

Selectingatest

Goal:maximizeaccuracy
ttotalnumberofinstancescoveredbyrule
ppositiveexamplesoftheclasscoveredbyrule
tpnumberoferrorsmadebyrule
Selecttestthatmaximizestheratiop/t

Wearefinishedwhenp/t=1orthesetof
instancescantbesplitanyfurther

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

51

Example:contactlensdata

Ruleweseek:
Possibletests:

If ?
then recommendation = hard

Age = Young

2/8

Age = Pre-presbyopic

1/8

Age = Presbyopic

1/8

Spectacle prescription = Myope

3/12

Spectacle prescription = Hypermetrope

1/12

Astigmatism = no

0/12

Astigmatism = yes

4/12

Tear production rate = Reduced

0/12

Tear production rate = Normal

4/12

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

52

Modifiedruleandresultingdata

Rulewithbesttestadded:
If astigmatism = yes
then recommendation = hard

Instancescoveredbymodifiedrule:

Age
Young
Young
Young
Young
Prepresbyopic
Prepresbyopic
Prepresbyopic
Prepresbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic

Spectacle
prescription
Myope
Myope
Hypermetrope
Hypermetrope
Myope
Myope
Hypermetrope
Hypermetrope
Myope
Myope
Hypermetrope
Hypermetrope

Astigmatism
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes

Tear production
rate
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal

Recommended
lenses
None
Hard
None
hard
None
Hard
None
None
None
Hard
None
None

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

53

Furtherrefinement

Currentstate:

Possibletests:

If astigmatism = yes
and ?
then recommendation = hard

Age = Young

2/4

Age = Pre-presbyopic

1/4

Age = Presbyopic

1/4

Spectacle prescription = Myope

3/6

Spectacle prescription = Hypermetrope

1/6

Tear production rate = Reduced

0/6

Tear production rate = Normal

4/6

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

54

Modifiedruleandresultingdata

Rulewithbesttestadded:
If astigmatism = yes
and tear production rate = normal
then recommendation = hard

Instancescoveredbymodifiedrule:

Age
Young
Young
Prepresbyopic
Prepresbyopic
Presbyopic
Presbyopic

Spectacle
prescription
Myope
Hypermetrope
Myope
Hypermetrope
Myope
Hypermetrope

Astigmatism
Yes
Yes
Yes
Yes
Yes
Yes

Tear production
rate
Normal
Normal
Normal
Normal
Normal
Normal

Recommended
lenses
Hard
hard
Hard
None
Hard
None

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

55

Furtherrefinement

Currentstate:
If astigmatism = yes
and tear production rate = normal
and ?
then recommendation = hard

Possibletests:
Age = Young

2/2

Age = Pre-presbyopic

1/2

Age = Presbyopic

1/2

Spectacle prescription = Myope

3/3

Spectacle prescription = Hypermetrope

1/3

Tiebetweenthefirstandthefourthtest

Wechoosetheonewithgreatercoverage

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

56

Theresult

Finalrule:

Secondruleforrecommendinghardlenses:

If astigmatism = yes
and tear production rate = normal
and spectacle prescription = myope
then recommendation = hard

(builtfrominstancesnotcoveredbyfirstrule)

If age = young and astigmatism = yes


and tear production rate = normal
then recommendation = hard

Thesetworulescoverallhardlenses:

Processisrepeatedwithothertwoclasses

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

57

PseudocodeforPRISM
For each class C
Initialize E to the instance set
While E contains instances in class C
Create a rule R with an empty left-hand side that predicts class C
Until R is perfect (or there are no more attributes to use) do
For each attribute A not mentioned in R, and each value v,
Consider adding the condition A = v to the left-hand side of R
Select A and v to maximize the accuracy p/t
(break ties by choosing the condition with the largest p)
Add A = v to R
Remove the instances covered by R from E

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

58

Rulesvs.decisionlists

PRISMwithouterloopremovedgeneratesa
decisionlistforoneclass

Outerloopconsidersallclassesseparately

Subsequentrulesaredesignedforrulesthatarenot
coveredbypreviousrules
But:orderdoesntmatterbecauseallrulespredictthe
sameclass
Noorderdependenceimplied

Problems:overlappingrules,defaultrulerequired

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

59

Separateandconquer

MethodslikePRISM(fordealingwithone
class)areseparateandconqueralgorithms:

First,identifyausefulrule
Then,separateoutalltheinstancesitcovers
Finally,conquertheremaininginstances

Differencetodivideandconquermethods:

Subsetcoveredbyruledoesntneedtobe
exploredanyfurther

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

60

Miningassociationrules

Navemethodforfindingassociationrules:

Twoproblems:

Useseparateandconquermethod
Treateverypossiblecombinationofattribute
valuesasaseparateclass
Computationalcomplexity
Resultingnumberofrules(whichwouldhavetobe
prunedonthebasisofsupportandconfidence)

But:wecanlookforhighsupportrulesdirectly!

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

61

Itemsets

Support:numberofinstancescorrectlycovered
byassociationrule

Thesameasthenumberofinstancescoveredbyall
testsintherule(LHSandRHS!)

Item:onetest/attributevaluepair
Itemset:allitemsoccurringinarule
Goal:onlyrulesthatexceedpredefinedsupport
Doitbyfindingallitemsetswiththegiven
minimumsupportandgeneratingrulesfromthem!

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

62

Weatherdata
Outlook

Temp

Humidity

Windy

Play

Sunny

Hot

High

False

No

Sunny

Hot

High

True

No

Overcast

Hot

High

False

Yes

Rainy

Mild

High

False

Yes

Rainy

Cool

Normal

False

Yes

Rainy

Cool

Normal

True

No

Overcast

Cool

Normal

True

Yes

Sunny

Mild

High

False

No

Sunny

Cool

Normal

False

Yes

Rainy

Mild

Normal

False

Yes

Sunny

Mild

Normal

True

Yes

Overcast

Mild

High

True

Yes

Overcast

Hot

Normal

False

Yes

Rainy

Mild

High

True

No

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

63

Itemsetsforweatherdata
One-item sets

Two-item sets

Three-item sets

Four-item sets

Outlook = Sunny (5)

Outlook = Sunny
Temperature = Hot (2)

Outlook = Sunny
Temperature = Hot
Humidity = High (2)

Outlook = Sunny
Temperature = Hot
Humidity = High
Play = No (2)

Temperature = Cool
(4)

Outlook = Sunny
Humidity = High (3)

Outlook = Sunny
Humidity = High
Windy = False (2)

Outlook = Rainy
Temperature = Mild
Windy = False
Play = Yes (2)

Intotal:12oneitemsets,47twoitemsets,39
threeitemsets,6fouritemsetsand0five
itemsets(withminimumsupportoftwo)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

64

Generatingrulesfromanitemset

Onceallitemsetswithminimumsupporthave
beengenerated,wecanturnthemintorules
Example:
Humidity = Normal, Windy = False, Play = Yes (4)

Seven(2N1)potentialrules:
If
If
If
If
If
If
If

Humidity = Normal and Windy = False then Play = Yes


Humidity = Normal and Play = Yes then Windy = False
Windy = False and Play = Yes then Humidity = Normal
Humidity = Normal then Windy = False and Play = Yes
Windy = False then Humidity = Normal and Play = Yes
Play = Yes then Humidity = Normal and Windy = False
True then Humidity = Normal and Windy = False
and Play = Yes
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

4/4
4/6
4/6
4/7
4/8
4/9
4/12
65

Rulesforweatherdata

Ruleswithsupport>1andconfidence=100%:
Association rule

Sup.

Conf.

Humidity=Normal Windy=False

Play=Yes

100%

Temperature=Cool

Humidity=Normal

100%

Outlook=Overcast

Play=Yes

100%

Temperature=Cold Play=Yes

Humidity=Normal
...

100%

...

...

100%

...
58

Outlook=Sunny Temperature=Hot Humidity=High

Intotal:
3ruleswithsupportfour
5withsupportthree
50withsupporttwo

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

66

Examplerulesfromthesameset

Itemset:

Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2)

Resultingrules(allwith100%confidence):

Temperature = Cool, Windy = False Humidity = Normal, Play = Yes


Temperature = Cool, Windy = False, Humidity = Normal Play = Yes
Temperature = Cool, Windy = False, Play = Yes Humidity = Normal

duetothefollowingfrequentitemsets:
Temperature = Cool, Windy = False
Temperature = Cool, Humidity = Normal, Windy = False
Temperature = Cool, Windy = False, Play = Yes

(2)
(2)
(2)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

67

Generatingitemsetsefficiently

Howcanweefficientlyfindallfrequentitemsets?
Findingoneitemsetseasy
Idea:useoneitemsetstogeneratetwoitemsets,
twoitemsetstogeneratethreeitemsets,

If(AB)isfrequentitemset,then(A)and(B)havetobe
frequentitemsetsaswell!
Ingeneral:ifXisfrequentkitemset,thenall(k1)item
subsetsofXarealsofrequent

Computekitemsetbymerging(k1)itemsets

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

68

Example

Given:fivethreeitemsets
(A B C), (A B D), (A C D), (A C E), (B C D)

Lexicographicallyordered!

Candidatefouritemsets:

(A B C D)

OK because of (A C D) (B C D)

(A C D E)

Not OK because of (C D E)

Finalcheckbycountinginstancesin
dataset!
(k1)itemsetsarestoredinhashtable
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

69

Generatingrulesefficiently

Wearelookingforallhighconfidencerules

Betterway:building(c+1)consequentrules
fromcconsequentones

Supportofantecedentobtainedfromhashtable
But:bruteforcemethodis(2N1)

Observation:(c+1)consequentrulecanonlyhold
ifallcorrespondingcconsequentrulesalsohold

Resultingalgorithmsimilartoprocedurefor
largeitemsets

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

70

Example

1consequentrules:
If Outlook = Sunny and Windy = False and Play = No
then Humidity = High (2/2)
If Humidity = High and Windy = False and Play = No
then Outlook = Sunny (2/2)

Corresponding2consequentrule:
If Windy = False and Play = No
then Outlook = Sunny and Humidity = High (2/2)

Finalcheckofantecedentagainsthashtable!

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

71

Associationrules:discussion

Abovemethodmakesonepassthroughthedatafor
eachdifferentsizeitemset

Otherpossibility:generate(k+2)itemsetsjustafter(k+1)
itemsetshavebeengenerated
Result:more(k+2)itemsetsthannecessarywillbe
consideredbutlesspassesthroughthedata
Makessenseifdatatoolargeformainmemory

Practicalissue:generatingacertainnumberofrules
(e.g.byincrementallyreducingmin.support)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

72

Otherissues

StandardARFFformatveryinefficientfortypical
marketbasketdata

Attributesrepresentitemsinabasketandmost
itemsareusuallymissing
Datashouldberepresentedinsparseformat

Instancesarealsocalledtransactions
Confidenceisnotnecessarilythebestmeasure

Example:milkoccursinalmosteverysupermarket
transaction
Othermeasureshavebeendevised(e.g.lift)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

73

Linearmodels:linearregression

Workmostnaturallywithnumericattributes
Standardtechniquefornumericprediction

Outcomeislinearcombinationofattributes
x=w 0w 1 a1w2 a 2...w k a k

Weightsarecalculatedfromthetrainingdata
Predictedvalueforfirsttraininginstancea(1)
1
1
1
k
1
w0 a1
w
a
w
a
...w
a
=
w
a
0
1 1
2 2
k k
j=0
j j

(assumingeachinstanceisextendedwithaconstantattributewithvalue1)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

74

Minimizingthesquarederror

Choosek+1coefficientstominimizethesquared
erroronthetrainingdata
Squarederror:
n
i
k
i 2
i=1 x j=0 w j a j

Derivecoefficientsusingstandardmatrix
operations
Canbedoneiftherearemoreinstancesthan
attributes(roughlyspeaking)
Minimizingtheabsoluteerrorismoredifficult
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

75

Classification

Anyregressiontechniquecanbeusedfor
classification

Training:performaregressionforeachclass,setting
theoutputto1fortraininginstancesthatbelongto
class,and0forthosethatdont
Prediction:predictclasscorrespondingtomodel
withlargestoutputvalue(membershipvalue)

Forlinearregressionthisisknownasmulti
responselinearregression
Problem:membershipvaluesarenotin[0,1]
range,soaren'tproperprobabilityestimates
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

76

Linearmodels:logisticregression

Buildsalinearmodelforatransformedtarget
variable
Assumewehavetwoclasses
Logisticregressionreplacesthetarget
P[1a1, a2, ....,a k ]

bythistarget
P[1a1, a2, ....,ak ]

log 1P[1a

1,

a2, ...., ak ]

Logittransformationmaps[0,1]to(,+)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

77

Logittransformation

Resultingmodel:
Pr [1a 1, a 2, ..., ak ]= 1e

w 0w 1 a 1... w k a k

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

78

Examplelogisticregressionmodel

Modelwithw0=0.5andw1=1:

Parametersarefoundfromtrainingdatausing
maximumlikelihood
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

79

Maximumlikelihood

Aim:maximizeprobabilityoftrainingdatawrt
parameters
Canuselogarithmsofprobabilitiesandmaximize
loglikelihoodofmodel:
i
i
ni=1 1xi log1Pr[1ai
,
a
,...,
a
1
2
k ]
i
i
i
i
x logPr [1a1 ,a2 ,..., ak ]

wherethex(i)areeither0or1
Weightswineedtobechosentomaximizelog
likelihood(relativelysimplemethod:iteratively
reweightedleastsquares)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

80

Multipleclasses

Canperformlogisticregression
independentlyforeachclass
(likemultiresponselinearregression)
Problem:probabilityestimatesfordifferent
classeswon'tsumtoone
Better:traincoupledmodelsby
maximizinglikelihoodoverallclasses
Alternativethatoftenworkswellin
practice:pairwiseclassification

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

81

Pairwiseclassification

Idea:buildmodelforeachpairofclasses,usingonly
trainingdatafromthoseclasses
Problem?Havetosolvek(k1)/2classification
problemsforkclassproblem
Turnsoutnottobeaprobleminmanycases
becausetrainingsetsbecomesmall:
Assumedataevenlydistributed,i.e.2n/kper
learningproblemforninstancesintotal
Supposelearningalgorithmislinearinn
Thenruntimeofpairwiseclassificationis
proportionalto(k(k1)/2)(2n/k)=(k1)n
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

82

Linearmodelsarehyperplanes

Decisionboundaryfortwoclasslogisticregressionis
whereprobabilityequals0.5:
Pr[1a1, a 2, ...,ak ]=1/1expw0w1 a1...w k a k =0.5

whichoccurswhen
w 0w 1 a1...w k ak =0
Thuslogisticregressioncanonlyseparatedatathat
canbeseparatedbyahyperplane
Multiresponselinearregressionhasthesame
problem.Class1isassignedif:
1
1
2
2
2
w1
w
a
...w
a
w
w
a
...w
0
1
1
k
k
0
1
1
k ak
2
1
2
1
2
w1
w
w
w
a
...w
w
0
0
1
1
1
k
k a k 0

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

83

Linearmodels:the
perceptron

Don'tactuallyneedprobabilityestimatesifallwewant
todoisclassification
Differentapproach:learnseparatinghyperplane
Assumption:dataislinearlyseparable
Algorithmforlearningseparatinghyperplane:perceptron
learningrule
0=w0 a0 w 1 a 1w2 a2...w k a k
Hyperplane:
whereweagainassumethatthereisaconstantattribute
withvalue1(bias)
Ifsumisgreaterthanzerowepredictthefirstclass,
otherwisethesecondclass

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

84

Thealgorithm
Set all weights to zero
Until all instances in the training data are classified correctly
For each instance I in the training data
If I is classified incorrectly by the perceptron
If I belongs to the first class add it to the weight vector
else subtract it from the weight vector

Whydoesthiswork?
Considersituationwhereinstanceapertainingtothefirst
classhasbeenadded:
w 0a0 a0 w1 a1a 1w 2a2 a2 ...w k a k a k

Thismeansoutputforahasincreasedby:
a0 a0a 1 a 1a2 a2 ...a k a k
Thisnumberisalwayspositive,thusthehyperplanehasmovedintothe
correctdirection(andwecanshowoutputdecreasesforinstancesof
otherclass)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

85

Perceptronasaneuralnetwork

Output
layer

Input
layer

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

86

Linearmodels:Winnow

Anothermistakedrivenalgorithmforfindinga
separatinghyperplane
Assumesbinarydata(i.e.attributevaluesare
eitherzeroorone)
Difference:multiplicativeupdatesinsteadofadditive
updates
Weightsaremultipliedbyauserspecified
parameter(oritsinverse)
Anotherdifference:userspecifiedthreshold
parameter
Predictfirstclassif
w 0 a 0w 1 a1w 2 a2...w k ak
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

87

Thealgorithm
while some instances are misclassified
for each instance a in the training data
classify a using the current weights
if the predicted class is incorrect
if a belongs to the first class
for each ai that is 1, multiply wi by alpha
(if ai is 0, leave wi unchanged)
otherwise
for each ai that is 1, divide wi by alpha
(if ai is 0, leave wi unchanged)

Winnowisveryeffectiveinhominginonrelevant
features(itisattributeefficient)
Canalsobeusedinanonlinesettinginwhich
newinstancesarrivecontinuously
(liketheperceptronalgorithm)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

88

BalancedWinnow

Winnowdoesn'tallownegativeweightsandthiscanbea
drawbackinsomeapplications
BalancedWinnowmaintainstwoweightvectors,oneforeach
class:some instances are misclassified
while
for each instance a in the training data
classify a using the current weights
if the predicted class is incorrect
if a belongs to the first class
for each ai that is 1, multiply wi+ by alpha and divide wi- by alpha
(if ai is 0, leave wi+ and wi- unchanged)
otherwise
for each ai that is 1, multiply wi- by alpha and divide wi+ by alpha
(if ai is 0, leave wi+ and wi- unchanged)

Instanceisclassifiedasbelongingtothefirstclass(oftwo
classes)if:
w w a w w a ...w w a
0

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

89

Instancebasedlearning

Distancefunctiondefineswhatslearned
Mostinstancebasedschemesuse
Euclideandistance:

1
1

2
1
2 2
1
2 2
a2

a
a

...a
a
1
2
2
k
k

a(1)anda(2):twoinstanceswithkattributes
Takingthesquarerootisnotrequiredwhen
comparingdistances
Otherpopularmetric:cityblockmetric

Addsdifferenceswithoutsquaringthem
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

90

Normalizationandotherissues

Differentattributesaremeasuredondifferent
scalesneedtobenormalized:
ai =

v i min v i
max v imin vi

vi:theactualvalueofattributei

Nominalattributes:distanceeither0or1
Commonpolicyformissingvalues:assumedtobe
maximallydistant(givennormalizedattributes)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

91

Findingnearestneighborsefficiently

Simplestwayoffindingnearestneighbour:linear
scanofthedata

Classificationtakestimeproportionaltotheproductof
thenumberofinstancesintrainingandtestsets

Nearestneighborsearchcanbedonemore
efficientlyusingappropriatedatastructures
Wewilldiscusstwomethodsthatrepresenttraining
datainatreestructure:
kDtreesandballtrees

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

92

kDtreeexample

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

93

UsingkDtrees:example

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

94

MoreonkDtrees

Complexitydependsondepthoftree,givenbylogarithm
ofnumberofnodes
Amountofbacktrackingrequireddependsonqualityof
tree(squarevs.skinnynodes)
Howtobuildagoodtree?Needtofindgoodsplitpoint
andsplitdirection
Splitdirection:directionwithgreatestvariance
Splitpoint:medianvaluealongthatdirection
Usingvalueclosesttomean(ratherthanmedian)canbe
betterifdataisskewed
Canapplythisrecursively
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

95

Buildingtreesincrementally

Bigadvantageofinstancebasedlearning:classifier
canbeupdatedincrementally
Justaddnewtraininginstance!
CanwedothesamewithkDtrees?
Heuristicstrategy:
Findleafnodecontainingnewinstance
Placeinstanceintoleafifleafisempty
Otherwise,splitleafaccordingtothelongest
dimension(topreservesquareness)
Treeshouldberebuiltoccasionally(i.e.ifdepth
growstotwicetheoptimumdepth)
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

96

Balltrees

ProbleminkDtrees:corners
Observation:noneedtomakesurethat
regionsdon'toverlap
Canuseballs(hyperspheres)insteadof
hyperrectangles

Aballtreeorganizesthedataintoatreeofk
dimensionalhyperspheres
Normallyallowsforabetterfittothedataand
thusmoreefficientsearch

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

97

Balltreeexample

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

98

Usingballtrees

Nearestneighborsearchisdoneusingthesame
backtrackingstrategyasinkDtrees
Ballcanberuledoutfromconsiderationif:distance
fromtargettoball'scenterexceedsball'sradiusplus
currentupperbound

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

99

Buildingballtrees

Balltreesarebuilttopdown(likekDtrees)
Don'thavetocontinueuntilleafballscontainjusttwo
points:canenforceminimumoccupancy
(sameinkDtrees)
Basicproblem:splittingaballintotwo
Simple(lineartime)splitselectionstrategy:

Choosepointfarthestfromball'scenter
Choosesecondpointfarthestfromfirstone
Assigneachpointtothesetwopoints
Computeclustercentersandradiibasedonthetwo
subsetstogettwoballs

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

100

Discussionofnearestneighborlearning

Oftenveryaccurate
Assumesallattributesareequallyimportant

Possibleremediesagainstnoisyinstances:

Takeamajorityvoteovertheknearestneighbors
Removingnoisyinstancesfromdataset(difficult!)

StatisticianshaveusedkNNsinceearly1950s

Remedy:attributeselectionorweights

Ifnandk/n0,errorapproachesminimum

kDtreesbecomeinefficientwhennumberof
attributesistoolarge(approximately>10)
Balltrees(whichareinstancesofmetrictrees)work
wellinhigherdimensionalspaces
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

101

Morediscussion

Insteadofstoringalltraininginstances,compressthem
intoregions
Example:hyperpipes(fromdiscussionof1R)
Anothersimpletechnique(VotingFeatureIntervals):

Constructintervalsforeachattribute
Discretizenumericattributes
Treateachvalueofanominalattributeasaninterval
Countnumberoftimesclassoccursininterval
Predictionisgeneratedbylettingintervalsvote(thosethat
containthetestinstance)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

102

Clustering

Clusteringtechniquesapplywhenthereisnoclasstobe
predicted
Aim:divideinstancesintonaturalgroups
Aswe'veseenclusterscanbe:
disjointvs.overlapping
deterministicvs.probabilistic
flatvs.hierarchical
We'lllookataclassicclusteringalgorithmcalledk
means
kmeansclustersaredisjoint,deterministic,andflat

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

103

Thekmeansalgorithm
Toclusterdataintokgroups:
(kispredefined)
0. Choosekclustercenters

e.g.atrandom

1. Assigninstancestoclusters

basedondistancetoclustercenters

2. Computecentroidsofclusters
3. Gotostep1

untilconvergence

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

104

Discussion

Algorithmminimizessquareddistancetocluster
centers
Resultcanvarysignificantly
basedoninitialchoiceofseeds
Cangettrappedinlocalminimum
initial
cluster
Example:
centres
instances

Toincreasechanceoffindingglobaloptimum:restart
withdifferentrandomseeds
Canweappliedrecursivelywithk=2
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

105

Fasterdistancecalculations

CanweusekDtreesorballtreestospeed
uptheprocess?Yes:

First,buildtree,whichremainsstatic,forall
thedatapoints
Ateachnode,storenumberofinstancesand
sumofallinstances
Ineachiteration,descendtreeandfindout
whichclustereachnodebelongsto

Canstopdescendingassoonaswefindoutthata
nodebelongsentirelytoaparticularcluster
Usestatisticsstoredatthenodestocomputenew
clustercenters

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

106

Example

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

107

Multiinstancelearning

Simplicityfirstmethodologycanbe
appliedtomultiinstancelearningwith
surprisinglygoodresults
Twosimpleapproaches,bothusing
standardsingleinstancelearners:

Manipulatetheinputtolearning
Manipulatetheoutputoflearning

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

108

Aggregatingtheinput

Convertmultiinstanceproblemintosingle
instanceone

Summarizetheinstancesinabagbycomputing
mean,mode,minimumandmaximumasnew
attributes
Summaryinstanceretainstheclasslabelofits
bag
Toclassifyanewbagthesameprocessisused

Resultsusingsummaryinstanceswithminimum
andmaximum+supportvectormachineclassifier
arecomparabletospecialpurposemultiinstance
learnersonoriginaldrugdiscoveryproblem
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

109

Aggregatingtheoutput

Learnasingleinstanceclassifierdirectlyfromthe
originalinstancesineachbag

Eachinstanceisgiventheclassofthebagitoriginates
from

Toclassifyanewbag:

Produceapredictionforeachinstanceinthebag
Aggregatethepredictionstoproduceapredictionfor
thebagasawhole
Oneapproach:treatpredictionsasvotesforthevarious
classlabels
Aproblem:bagscancontaindifferingnumbersof
instancesgiveeachinstanceaweightinversely
proportionaltothebag'ssize
DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

110

Commentsonbasicmethods

BayesrulestemsfromhisEssaytowardssolvinga
probleminthedoctrineofchances(1763)

Difficultbitingeneral:estimatingpriorprobabilities
(easyinthecaseofnaveBayes)

ExtensionofnaveBayes:Bayesiannetworks(which
we'lldiscusslater)
AlgorithmforassociationrulesiscalledAPRIORI
MinskyandPapert(1969)showedthatlinear
classifiershavelimitations,e.g.cantlearnXOR

But:combinationsofthemcan(multilayerneural
nets,whichwe'lldiscusslater)

DataMining:PracticalMachineLearningToolsandTechniques(Chapter4)

111

Potrebbero piacerti anche