Sei sulla pagina 1di 4

11/28/2016 UnderstandingXGBoostModelonOttoDataset

UnderstandingXGBoostModelonOttoDataset
MichalBenesty

Introduction
XGBoostisanimplementationofthefamousgradientboostingalgorithm.Thismodelisoftendescribedasablackbox,meaningitworkswell
butitisnottrivialtounderstandhow.Indeed,themodelismadeofhundreds(thousands?)ofdecisiontrees.Youmaywonderhowpossiblea
humanwouldbeabletohaveageneralviewofthemodel?

WhileXGBoostisknownforitsfastspeedandaccuratepredictivepower.Italsocomeswithvariousfunctionstohelpyouunderstandthe
model.ThepurposeofthisRMarkdowndocumentistodemonstratehowwecanleveragethefunctionsalreadyimplementedinXGBoostR
packageforthatpurpose.Ofcourse,everythingshowedbelowcanbeappliedtothedatasetyoumayhavetomanipulateatworkorwherever!

FirstwewilltrainamodelontheOTTOdataset,thenwewillgeneratetwovizualisationstogetaclueofwhatisimportanttothemodel,finally,
wewillseehowwecanleveragetheseinformation.

Preparationofthedata
ThispartisbasedonthetutorialexamplebyTongHe(https://github.com/dmlc/xgboost/blob/master/demo/kaggleotto/otto_train_pred.R)

First,letsloadthepackagesandthedataset.

require(xgboost)

##Loadingrequiredpackage:xgboost

require(methods)

##Loadingrequiredpackage:methods

require(data.table)

##Loadingrequiredpackage:data.table

require(magrittr)

##Loadingrequiredpackage:magrittr

train<fread('../input/train.csv',header=T,stringsAsFactors=F)
test<fread('../input/test.csv',header=TRUE,stringsAsFactors=F)

magrittr and data.table areheretomakethecodecleanerandmorerapid.

Letsseewhatisinthisdataset.

#Traindatasetdimensions
dim(train)

##[1]6187895

#Trainingcontent
train[1:6,1:5,with=F]

##idfeat_1feat_2feat_3feat_4
##1:11000
##2:20000
##3:30000
##4:41001
##5:50000
##6:62100

#Testdatasetdimensions
dim(train)

##[1]6187895

#Testcontent
test[1:6,1:5,with=F]

##idfeat_1feat_2feat_3feat_4
##1:10000
##2:2221416
##3:301121
##4:40001
##5:51001
##6:60000

Weonlydisplaythe6firstrowsand5firstcolumnsforconvenience

Eachcolumnrepresentsafeaturemeasuredbyaninteger.Eachrowisaproduct.

Obviouslythefirstcolumn( ID )doesntcontainanyusefulinformation.Toletthealgorithmfocusonrealstuff,wewilldeletethecolumn.

#DeleteIDcolumnintrainingdataset
train[,id:=NULL]

#DeleteIDcolumnintestingdataset
test[,id:=NULL]

Accordingtothe OTTO challengedescription,wehavehereamulticlassclassificationchallenge.Weneedtoextractthelabels(herethename


ofthedifferentclasses)fromthedataset.Weonlyhavetwofiles(testandtraining),itseemslogicalthatthetrainingfilecontainstheclasswe
arelookingfor.Usuallythelabelsisinthefirstorthelastcolumn.Letscheckthecontentofthelastcolumn.

#Checkthecontentofthelastcolumn
train[1:6,ncol(train),with=F]

##target
##1:Class_1
##2:Class_1
##3:Class_1
##4:Class_1
##5:Class_1
##6:Class_1

#Savethenameofthelastcolumn
nameLastCol<names(train)[ncol(train)]

Theclassareprovidedascharacterstringinthe ncol(train) thcolumncalled nameLastCol .Asyoumayknow,XGBoostdoesntsupport


anythingelsethannumbers.Sowewillconvertclassestointegers.Moreover,accordingtothedocumentation,itshouldstartat0.

Forthatpurpose,wewill:

extractthetargetcolumn
removeClass_fromeachclassname
converttointegers
remove1tothenewvalue

#Converttoclassestonumbers
y<train[,nameLastCol,with=F][[1]]%>%gsub('Class_','',.)%>%{as.integer(.)1}
#Displaythefirst5levels
y[1:5]

##[1]00000

Weremovelabelcolumnfromtrainingdataset,otherwiseXGBoostwoulduseittoguessthelabels!!!

train[,nameLastCol:=NULL,with=F]

data.table isanawesomeimplementationofdata.frame,unfortunatelyitisnotaformatsupportednativelybyXGBoost.Weneedtoconvert
bothdatasets(trainingandtest)innumericMatrixformat.

trainMatrix<train[,lapply(.SD,as.numeric)]%>%as.matrix
testMatrix<test[,lapply(.SD,as.numeric)]%>%as.matrix

Modeltraining
https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 1/4
11/28/2016 UnderstandingXGBoostModelonOttoDataset
Modeltraining
Beforethelearningwewillusethecrossvalidationtoevaluatetheourerrorrate.

BasicallyXGBoostwilldividethetrainingdatain nfold parts,thenXGBoostwillretainthefirstpartanduseitasthetestdata.Thenitwill


reintegratethefirstparttothetrainingdatasetandretainthesecondpart,doatrainingandsoon

Lookatthefunctiondocumentationformoreinformation.

numberOfClasses<max(y)+1

param<list("objective"="multi:softprob",
"eval_metric"="mlogloss",
"num_class"=numberOfClasses)

cv.nround<5
cv.nfold<3

bst.cv=xgb.cv(param=param,data=trainMatrix,label=y,
nfold=cv.nfold,nrounds=cv.nround)

##[0]trainmlogloss:1.539950+0.003540testmlogloss:1.555506+0.001696
##[1]trainmlogloss:1.280621+0.002441testmlogloss:1.304418+0.001335
##[2]trainmlogloss:1.111787+0.003201testmlogloss:1.142924+0.002505
##[3]trainmlogloss:0.991269+0.003233testmlogloss:1.029022+0.002207
##[4]trainmlogloss:0.899486+0.003829testmlogloss:0.942855+0.002007

Aswecanseetheerrorrateislowonthetestdataset(fora5mntrainedmodel).

Finally,wearereadytotraintherealmodel!!!

nround=50
bst=xgboost(param=param,data=trainMatrix,label=y,nrounds=nround)

##[0]trainmlogloss:1.539929
##[1]trainmlogloss:1.284352
##[2]trainmlogloss:1.116242
##[3]trainmlogloss:0.997410
##[4]trainmlogloss:0.908786
##[5]trainmlogloss:0.837502
##[6]trainmlogloss:0.780620
##[7]trainmlogloss:0.735472
##[8]trainmlogloss:0.696930
##[9]trainmlogloss:0.666730
##[10]trainmlogloss:0.641023
##[11]trainmlogloss:0.618734
##[12]trainmlogloss:0.599407
##[13]trainmlogloss:0.583202
##[14]trainmlogloss:0.568400
##[15]trainmlogloss:0.555463
##[16]trainmlogloss:0.543348
##[17]trainmlogloss:0.532382
##[18]trainmlogloss:0.522701
##[19]trainmlogloss:0.513794
##[20]trainmlogloss:0.506249
##[21]trainmlogloss:0.497970
##[22]trainmlogloss:0.491400
##[23]trainmlogloss:0.484099
##[24]trainmlogloss:0.477010
##[25]trainmlogloss:0.470935
##[26]trainmlogloss:0.466101
##[27]trainmlogloss:0.461392
##[28]trainmlogloss:0.456607
##[29]trainmlogloss:0.450932
##[30]trainmlogloss:0.446368
##[31]trainmlogloss:0.442488
##[32]trainmlogloss:0.437648
##[33]trainmlogloss:0.433682
##[34]trainmlogloss:0.428969
##[35]trainmlogloss:0.424687
##[36]trainmlogloss:0.421398
##[37]trainmlogloss:0.418917
##[38]trainmlogloss:0.415504
##[39]trainmlogloss:0.411823
##[40]trainmlogloss:0.407470
##[41]trainmlogloss:0.404227
##[42]trainmlogloss:0.401174
##[43]trainmlogloss:0.397705
##[44]trainmlogloss:0.394443
##[45]trainmlogloss:0.392279
##[46]trainmlogloss:0.389940
##[47]trainmlogloss:0.387887
##[48]trainmlogloss:0.385097
##[49]trainmlogloss:0.382814

Modelunderstanding
Featureimportance
Sofar,wehavebuiltamodelmadeof nround trees.

Tobuildatree,thedatasetisdividedrecursivelyseveraltimes.Attheendoftheprocess,yougetgroupsofobservations(here,these
observationsarepropertiesregardingOTTOproducts).

Eachdivisionoperationiscalledasplit.

Eachgroupateachdivisionleveliscalledabranchandthedeepestleveliscalledaleaf.

Inthefinalmodel,theseleafsaresupposedtobeaspureaspossibleforeachtree,meaninginourcasethateachleafshouldbemadeofone
classofOTTOproductonly(ofcourseitisnottrue,butthatswhatwetrytoachieveinaminimumofsplits).

Notallsplitsareequallyimportant.Basicallythefirstsplitofatreewillhavemoreimpactonthepuritythat,forinstance,thedeepestsplit.
Intuitively,weunderstandthatthefirstsplitmakesmostofthework,andthefollowingsplitsfocusonsmallerpartsofthedatasetwhichhave
beenmissclassifiedbythefirsttree.

Inthesameway,inBoostingwetrytooptimizethemissclassificationateachround(itiscalledtheloss).Sothefirsttreewilldothebigwork
andthefollowingtreeswillfocusontheremaining,onthepartsnotcorrectlylearnedbytheprevioustrees.

Theimprovementbroughtbyeachsplitcanbemeasured,itisthegain.

Eachsplitisdoneononefeatureonlyatonevalue.

Letsseewhatthemodellookslike.

https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 2/4
11/28/2016 UnderstandingXGBoostModelonOttoDataset
Letsseewhatthemodellookslike.

model<xgb.dump(bst,with.stats=T)
model[1:10]

##[1]"booster[0]"
##[2]"0:[f16<1.5]yes=1,no=2,missing=1,gain=309.719,cover=12222.8"
##[3]"1:[f29<26.5]yes=3,no=4,missing=3,gain=161.964,cover=11424"
##[4]"3:[f77<2.5]yes=7,no=8,missing=7,gain=106.092,cover=11416.3"
##[5]"7:[f52<12.5]yes=13,no=14,missing=13,gain=43.1389,cover=11211.9"
##[6]"13:[f76<1.5]yes=25,no=26,missing=25,gain=37.407,cover=11143.5"
##[7]"25:[f16<2.00001]yes=49,no=50,missing=50,gain=36.3329,cover=10952.1"
##[8]"49:leaf=0.0905567,cover=1090.77"
##[9]"50:leaf=0.148413,cover=9861.33"
##[10]"26:[f83<26]yes=51,no=52,missing=52,gain=167.766,cover=191.407"

Forconvenience,wearedisplayingthefirst10linesofthemodelonly.

Clearly,itisnoteasytounderstandwhatitmeans.

Basicallyeachlinerepresentsabranch,thereisthetreeID,thefeatureID,thepointwhereitsplits,andinformationregardingthenext
branches(left,right,whentherowforthisfeatureisN/A).

Hopefully,XGBoostoffersabetterrepresentation:featureimportance.

Featureimportanceisaboutaveragingthegainofeachfeatureforallsplitandalltrees.

Thenwecanusethefunction xgb.plot.importance .

#Getthefeaturerealnames
names<dimnames(trainMatrix)[[2]]

#Computefeatureimportancematrix
importance_matrix<xgb.importance(names,model=bst)

#Nicegraph
xgb.plot.importance(importance_matrix[1:10,])

>Tomakeitunderstandablewefirst

extractthecolumnnamesfromthe Matrix .

Interpretation
Inthefeatureimportanceabove,wecanseethefirst10mostimportantfeatures.

Thisfunctiongivesacolortoeachbar.BasicallyaKmeansclusteringisappliedtogroupeachfeaturebyimportance.

Fromhereyoucantakeseveralactions.Forinstanceyoucanremovethelessimportantfeature(featureselectionprocess),orgodeeperin
theinteractionbetweenthemostimportantfeaturesandlabels.

Oryoucanjustreasonaboutwhythesefeaturesaresoimportat(inOTTOchallengewecantgothiswaybecausethereisnotenough
information).

Treegraph
Featureimportancegivesyoufeatureweightinformationbutnotinteractionbetweenfeatures.

XGBoostRpackagehaveanotherusefulfunctionforthat.

xgb.plot.tree(feature_names=names,model=bst,n_first_tree=2)

>=7.5 Leaf

feat_60
>=28 Cover:120.691
Gain:27.921
<7.5 Leaf
feat_25
>=36 Cover:214.123
Gain:75.7615
>=1.5 Leaf
feat_79
<28 Cover:93.4321
Gain:68.6021
<1.5 Leaf

>=160 feat_42

>=8.00001 Leaf
feat_23
>=1.5 Cover:4.74074
Gain:3.15099
<8.00001 Leaf
feat_7
<36 Cover:69.5309
Gain:6.9492

feat_58 >=2.5 Leaf


>=46 Cover:357.333
Gain:79.5287 feat_17
<1.5 Cover:64.7901
Gain:3.43573

>=1.5 Leaf <2.5 Leaf


feat_38
>=3.5 Cover:5.1358
Gain:11.0242
<1.5 Leaf >=12.5 Leaf

<160 feat_78 feat_83


>=3.5 Cover:7.7037
Gain:6.11676
<12.5 Leaf

feat_36
<3.5 Cover:68.5432
Gain:6.38999
>=2.5 Leaf

feat_10
<3.5 Cover:60.8395
Gain:0.8211
feat_16
>=1.5 Cover:798.815
Gain:109.771

>=1.5 Leaf
feat_4
>=2.5 Cover:4.14815 <2.5 Leaf
Gain:1.49291
<1.5 Leaf

>=1.5 >=4.5 Leaf


feat_68

>=13.5 Leaf

feat_32
<2.5 Cover:93.4321
Gain:0.73687
feat_37
<13.5 Cover:92.2469
Gain:0.756948
<4.5 Leaf

>=3.5 Leaf
feat_80
>=100 Cover:15.4074
Gain:15.9666
feat_35
<46 Cover:441.481 <3.5 Leaf
Gain:17.3372

>=3.5 feat_58
feat_17
Cover:12222.8
Gain:309.719 >=1.5 Leaf

feat_74
<100 Cover:9.87654
Gain:2.11749
<1.5 Leaf
feat_78
<1.5 Cover:343.901
Gain:19.1588
>=42 Leaf

feat_60
>=2.5 Cover:82.5679
Gain:1.82279
<42 Leaf

<3.5 feat_87

>=4.5 Leaf

feat_92
<2.5 Cover:236.049
Gain:11.2664
<4.5 Leaf

>=26.5 Leaf

>=32 Leaf

feat_2
>=4.5 Cover:49.5802
Gain:59.7322
<32 Leaf

>=7 feat_76

>=1.5 Leaf
feat_79
feat_40 <4.5 Cover:60.2469
>=2.5 Cover:204.444 Gain:8.51257
Gain:43.3934
<1.5 Leaf

>=7.5 Leaf

>=1.5 Leaf
<7 feat_8

feat_92
<7.5 Cover:93.037
Gain:5.46521
<1.5 Leaf

feat_30
<1.5 Cover:11424
Gain:161.964 >=4.00001 Leaf
feat_78
<26.5 Cover:11416.3 feat_11
Gain:106.092 >=8.5 Cover:10.8642
Gain:0.624697
<4.00001 Leaf

>=12.5 feat_79

>=2.5 Leaf
feat_75
<8.5 Cover:57.4815
Gain:7.57482
<2.5 Leaf

feat_53
<2.5 Cover:11211.9
Gain:43.1389
>=26 Leaf

feat_84
>=1.5 Cover:191.407
Gain:167.766
<26 Leaf

<12.5 feat_77

>=2.00001 Leaf

feat_17
<1.5 Cover:10952.1
Gain:36.3329
<2.00001 Leaf

>=1.5 Leaf

feat_5
>=8.5 Cover:22.9136
Gain:0.712676
<1.5 Leaf
feat_56
>=2.5 Cover:116.346
Gain:39.3733
>=5.5 Leaf

feat_40
<8.5 Cover:93.4321
Gain:24.9656
<5.5 Leaf

>=38 feat_43

>=2.5 Leaf

feat_8
>=1.5 Cover:279.901
Gain:138.309
<2.5 Leaf

feat_48
<2.5 Cover:1084.05
Gain:179.358
>=9.5 Leaf

feat_40
<1.5 Cover:804.148
Gain:144.059
<9.5 Leaf

feat_32
>=4.5 Cover:1387.26
Gain:432.441
>=88 Leaf
feat_76
>=1.5 Cover:14.6173
Gain:27.1344
<88 Leaf

feat_15
>=4.5 Cover:87.1111
Gain:24.079
>=4.5 Leaf

feat_25
<1.5 Cover:72.4938
Gain:3.34127
<4.5 Leaf

<38 feat_24

>=12 Leaf
feat_26
>=1.5 Cover:45.6296
Gain:19.8635
<12 Leaf

feat_40
<4.5 Cover:99.7531
Gain:40.9895
>=1.5 Leaf

feat_15
<1.5 Cover:54.1235
Gain:19.9
<1.5 Leaf
feat_67
>=2.5 Cover:2557.83
Gain:761.2
>=2.5 Leaf

feat_24
>=1.5 Cover:27.4568
Gain:26.9149
<2.5 Leaf
feat_32
>=1.5 Cover:738.173
Gain:144.812
>=5.5 Leaf

feat_88
<1.5 Cover:710.716
Gain:109.712
<5.5 Leaf

>=30 feat_48

>=46 Leaf

feat_43
>=34 Cover:285.432
Gain:68.8393
<46 Leaf

feat_26
<1.5 Cover:314.667
Gain:129.379
>=12.5 Leaf
feat_15
<34 Cover:29.2346
Gain:14.5476
<12.5 Leaf

feat_29
<4.5 Cover:1170.57
Gain:600.26
>=4.5 Leaf

feat_70
>=26 Cover:35.358
Gain:39.7031
<4.5 Leaf

feat_13
>=54 Cover:53.5309
Gain:71.8355
>=2.5 Leaf
feat_48
<26 Cover:18.1728
Gain:10.1724
<2.5 Leaf

<30 feat_55

>=1.5 Leaf

feat_59
>=54 Cover:49.7778
Gain:3.20673
<1.5 Leaf

feat_15
<54 Cover:64.1975
Gain:20.4193
>=7.5 Leaf
feat_40
<54 Cover:14.4198
Gain:21.6375
<7.5 Leaf

feat_14
Cover:12222.8
Gain:8073.58
>=9.5 Leaf

feat_24
>=7.5 Cover:53.1358
Gain:19.499
<9.5 Leaf
feat_64
>=92 Cover:667.259
Gain:109.335
>=44 Leaf

feat_60
<7.5 Cover:614.123
Gain:104.974
<44 Leaf

>=4.5 feat_80

>=1.5 Leaf
feat_42
>=4.5 Cover:24.0988
Gain:15.0735
<1.5 Leaf
feat_72
<92 Cover:114.765
Gain:51.8412
>=9.5 Leaf

feat_64
<4.5 Cover:90.6667
Gain:18.7212
<9.5 Leaf

feat_24
>=1.5 Cover:1634.96
Gain:743.748
>=2.5 Leaf
feat_60
>=32 Cover:343.309
Gain:49.662
<2.5 Leaf

>=2.5 feat_43

>=17.5 Leaf

feat_70
<32 Cover:57.0864
Gain:28.7306
<17.5 Leaf

feat_72
<4.5 Cover:852.938
Gain:509.869
>=2.5 Leaf
feat_86
>=14.5 Cover:97.7778
Gain:24.3496
<2.5 Leaf

<2.5 feat_67

>=3.5 Leaf

feat_34
<14.5 Cover:354.765
Gain:120.467
<3.5 Leaf
feat_15
<2.5 Cover:9664.99
Gain:5621.16
>=1.5 Leaf
feat_36
>=1.5 Cover:104.099
Gain:58.8936
<1.5 Leaf

https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 3/4
11/28/2016 UnderstandingXGBoostModelonOttoDataset
<1.5 Leaf

>=2.5 feat_25

>=48 Leaf

feat_60
<1.5 Cover:173.037
Gain:53.0866
<48 Leaf

feat_40
>=17.5 Cover:5410.57
Gain:528.098
>=10 Leaf
feat_66
>=4.5 Cover:33.5802
Gain:43.4831
<10 Leaf

<2.5 feat_72

>=5.5 Leaf
feat_64
<4.5 Cover:5099.85
Gain:359.585
<5.5 Leaf
feat_86
<1.5 Cover:8030.02
Gain:1533.03
>=3.5 Leaf
feat_62
>=1.5 Cover:540.642
Gain:186.458
<3.5 Leaf

>=80 feat_40

>=4.5 Leaf

feat_9
<1.5 Cover:1463.11
Gain:282.842
<4.5 Leaf

feat_60
<17.5 Cover:2619.46
Gain:663.816
>=68 Leaf

feat_36
>=11.5 Cover:5.33333
Gain:0.543403
<68 Leaf

<80 feat_66

>=6.5 Leaf
feat_9
<11.5 Cover:610.37
Gain:57.0833
<6.5 Leaf

Wearejustdisplayingthefirsttwotreeshere.

Onsimplemodelsthefirsttwotreesmaybeenough.Here,itmightnotbethecase.Wecanseefromthesizeofthetreesthattheintersaction
betweenfeaturesiscomplicated.Besides,XGBoostgenerate k treesateachroundfora k classificationproblem.Thereforethetwotrees
illustratedherearetryingtoclassifydataintodifferentclasses.

https://kaggle2.blob.core.windows.net/forummessageattachments/76715/2435/Understanding%20XGBoost%20Model%20on%20Otto%20Dataset.html?sv=20 4/4

Potrebbero piacerti anche