Sei sulla pagina 1di 53

Lecture 20: Bagging, Random Forests,

Boosting
Reading: Chapter 8

STATS 202: Data mining and analysis

November 13, 2017

1 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf


Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf


Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf


Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf


Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.
I Select the best tree Ti (or the best α) by cross validation.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf


Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.
I Select the best tree Ti (or the best α) by cross validation.
→ Why might it be better to choose α instead of the tree Ti
by cross-validation?

2 / 17
Example. Heart dataset.

How do we deal with categorical predictors?

Thal:a
|

Ca < 0.5 Ca < 0.5

Slope < 1.5 Oldpeak < 1.1

MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b RestECG < 1


ChestPain:a Yes
RestBP < 157 No Yes Yes
Yes No
Chol < 244 No Chol < 244 Sex < 0.5
MaxHR < 156 No Yes
MaxHR < 145.5 Yes
No
No No No No Yes
No Yes

3 / 17
Categorical predictors

I If there are only 2 categories, then the split is obvious. We


don’t have to choose the splitting point s, as for a numerical
variable.

4 / 17
Categorical predictors

I If there are only 2 categories, then the split is obvious. We


don’t have to choose the splitting point s, as for a numerical
variable.
I If there are more than 2 categories:
I Order the categories according to the average of the response:

ChestPain : a > ChestPain : c > ChestPain : b

I Treat as a numerical variable with this ordering, and choose a


splitting point s.

4 / 17
Categorical predictors

I If there are only 2 categories, then the split is obvious. We


don’t have to choose the splitting point s, as for a numerical
variable.
I If there are more than 2 categories:
I Order the categories according to the average of the response:

ChestPain : a > ChestPain : c > ChestPain : b

I Treat as a numerical variable with this ordering, and choose a


splitting point s.
I One can show that this is the optimal way of partitioning.

4 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the


missing data.

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the


missing data.
I When choosing a new split with variable Xj (growing the tree):

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the


missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the


missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second best
split using a different variable, and a third best, ...

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the


missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second best
split using a different variable, and a third best, ...
I To propagate a sample down the tree, if it is missing a variable
to make a decision, try the second best decision, or the third
best, ...

5 / 17
Bagging
I Bagging = Bootstrap Aggregating

6 / 17
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling with
replacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:
boot1 = sample(x, 100, replace = True), ...,
bootB = sample(x, 100, replace = True).

6 / 17
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling with
replacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:
boot1 = sample(x, 100, replace = True), ...,
bootB = sample(x, 100, replace = True).
I We used these samples to get the Standard Error of a
parameter estimate:
v
u
u 1 X B B
(b) 1 X (k) 2
SE(β̂1 ) ≈ t (β̂1 − β̂1 )
B−1 B
b=1 k=1

6 / 17
Bagging

I In Bagging we average the predictions of a model fit to many


Bootstrap samples.
Example. Bagging the Lasso

7 / 17
Bagging

I In Bagging we average the predictions of a model fit to many


Bootstrap samples.
Example. Bagging the Lasso
I Let ŷ L,b be the prediction of the Lasso applied to the bth
bootstrap sample.

7 / 17
Bagging

I In Bagging we average the predictions of a model fit to many


Bootstrap samples.
Example. Bagging the Lasso
I Let ŷ L,b be the prediction of the Lasso applied to the bth
bootstrap sample.
I Bagging prediction:
B
1 X L,b
ŷ boot = ŷ .
B
b=1

7 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,


Bagging reduces the variance of the prediction.

8 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,


Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to the


true distribution of the samples.

8 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,


Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to the


true distribution of the samples.
I Bootstrap samples are like independent realizations of the
data.

8 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,


Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to the


true distribution of the samples.
I Bootstrap samples are like independent realizations of the
data.
I Bagging amounts to averaging the fits from many independent
datasets, which would reduce the variance by a factor 1/B.

8 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .

9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability

9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .

9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .
I Average this total over each Boostrap estimate T 1 , . . . , T B .

9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .
I Average this total over each Boostrap estimate T 1 , . . . , T B .
Fbs

RestECG

ExAng

Sex

Slope

Chol

Age

RestBP

MaxHR

Oldpeak

ChestPain

Ca

Thal

0 20 40 60 80 100
Variable Importance

9 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.

10 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.

10 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.

10 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.
I OOB error:
I For each sample xi , find the prediction ŷib for all bootstrap
samples b which do not contain xi . There should be around
0.37B of them. Average these predictions to obtain ŷioob .

10 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.
I OOB error:
I For each sample xi , find the prediction ŷib for all bootstrap
samples b which do not contain xi . There should be around
0.37B of them. Average these predictions to obtain ŷioob .
I Compute the error (yi − ŷioob )2 .

10 / 17
Out-of-bag (OOB) error

I To estimate the test error of a bagging estimate, we could use


cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.
I OOB error:
I For each sample xi , find the prediction ŷib for all bootstrap
samples b which do not contain xi . There should be around
0.37B of them. Average these predictions to obtain ŷioob .
I Compute the error (yi − ŷioob )2 .
I Average the errors over all observations i = 1, . . . , n.

10 / 17
Out-of-bag (OOB) error

0.30
0.25
Error

0.20
0.15

Test: Bagging
Test: RandomForest
OOB: Bagging
0.10

OOB: RandomForest

0 50 100 150 200 250 300

Number of Trees

The test error decreases as we increase B


(dashed line is the error for a plain decision tree).
11 / 17
Random Forests

Bagging has a problem:


→ The trees produced by different Bootstrap samples can be very
similar.

Random Forests:

12 / 17
Random Forests

Bagging has a problem:


→ The trees produced by different Bootstrap samples can be very
similar.

Random Forests:
I We fit a decision tree to different Bootstrap samples.

12 / 17
Random Forests

Bagging has a problem:


→ The trees produced by different Bootstrap samples can be very
similar.

Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.

12 / 17
Random Forests

Bagging has a problem:


→ The trees produced by different Bootstrap samples can be very
similar.

Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees from
each sample.

12 / 17
Random Forests

Bagging has a problem:


→ The trees produced by different Bootstrap samples can be very
similar.

Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees from
each sample.
I Finally, average the prediction of each tree.

12 / 17
Random Forests vs. Bagging

0.30
0.25
Error

0.20
0.15

Test: Bagging
Test: RandomForest
OOB: Bagging
0.10

OOB: RandomForest

0 50 100 150 200 250 300

Number of Trees

13 / 17
Random Forests, choosing m

m=p
m=p/2

0.5
Test Classification Error m= p

0.4
0.3
0.2

0 100 200 300 400 500

Number of Trees


The optimal m is usually around p,
but this can be used as a tuning parameter.

14 / 17
Boosting

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:

fˆ(x) ← fˆ(x) + λfˆb (x).

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:

fˆ(x) ← fˆ(x) + λfˆb (x).

2.3 Update the residuals,

ri ← ri − λfˆb (xi ).

15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:

fˆ(x) ← fˆ(x) + λfˆb (x).

2.3 Update the residuals,

ri ← ri − λfˆb (xi ).

3. Output the final model:


B
X
fˆ(x) = λfˆb (x).
b=1
15 / 17
Boosting, intuitively

Boosting learns slowly:

We first use the samples that are easiest to predict, then slowly
down weigh these cases, moving on to harder samples.

16 / 17
Boosting vs. random forests

0.25
Boosting: depth=1
Boosting: depth=2
0.20 RandomForest: m= p
Test Classification Error

0.15
0.10
0.05

0 1000 2000 3000 4000 5000

Number of Trees

The parameter λ = 0.01 in each case.


We can tune the model by CV using λ, d, B.
17 / 17

Potrebbero piacerti anche