Sei sulla pagina 1di 52

Bootstrap

Pulak Ghosh
IIMB
What is Bootstrap?
Assessing the Quality of Inference

Observe data X1, ..., Xn

Form a parameter estimate qn = q(X1, ..., Xn)

Want to compute an assessment x of the quality of our estimate qn


(e.g., a confidence region)
The Basic Idea
Theoretical Picture
Any actual sample of data was
drawn from the unknown true
distribution The true
distribution
We use the actual data to make in the sky
inferences about the true
parameters ()
Each green oval is the sample that
might have been

Sample 1 Sample 2 Sample 3 Sample N


Y1 1, Y1 2 Y1 k Y2 1, Y2 2 Y2 k Y3 1, Y3 2 Y3 k YN1, YN2 YNk

Y1 Y2 Y3 YN
The distribution of our estimator (Y) depends on both the true distribution and the size
(k) of our sample
The Unachievable Frequentist Ideal
Ideally, we would
Observe many independent datasets of size n.
Compute qn on each.
Compute x based on these multiple realizations of qn.

But, we only observe one dataset of size n.


Testing and estimation require information about the population
distribution

What is the usual assumptions of the population cannot be made


(normality assumptions, small samples): the traditional approach
does not work

Bootstrap: Estimates the population distribution by using the


information based on a number of resamples from the sample
Use the information of a number of resamples from the sample to
estimate the population distribution

Given a sample of size n:


Procedure:
Treat the sample as population
Draw B samples of size n with replacement from your sample ---
bootstrap sample
Compute for each bootstrap sample the statistic of interest (for
example the mean)
Estimate the sample distribution of the statistic by the boostrap
sample distribution
Sampling
Approximation
Pretend The Sample Is The Population
The Basic Idea
The Bootstrapping Process
Treat the actual distribution as a
proxy for the true distribution.
The actual
Sample with replacement your sample Y
actual distribution N times. Y1 , Y2 Yk

Compute the statistic of interest on


each re-sample.

Re-sample 1 Re-sample 2 Re-sample 3 Re-sample N


Y*1 1, Y*1 2 Y*1 k Y*2 1, Y*2 2 Y*2 k Y*3 1, Y*3 2 Y*3 k Y*N1, Y*N2 Y*Nk

Y*1 Y*2 Y*3 Y*N


{Y*} constitutes an estimate of the distribution of Y.
The Bootstrap
(Efron, 1979)

Use the observed data to simulate multiple datasets of size n:


Repeatedly resample n points with replacement from the original dataset of size n.
Compute q*n on each resample.
Compute x based on these multiple realizations of q*n as our estimate of x for qn.
Sampling With Replacement

In fact, there is a chance of


(1-1/500)500 1/e .368
that any one of the original data points wont appear at all if we sample
with replacement 500 times.
any data point is included with Prob .632
Intuitively, we treat the original sample as the true population in the sky.
Each resample simulates the process of taking a sample from the true
distribution.
Theoretical vs. Empirical
Graph on left: Y-bar calculated from an number of samples from the true
distribution.
Graph on right: {Y*-bar} calculated in each of 1000 re-samples from the empirical
distribution.
Analogy: : Y :: Y : Y*

true distribution (Y-bar) bootstrap distribution (Y*-bar)


0.00 0.01 0.02 0.03 0.04

0.0 0.2 0.4 0.6 0.8


phi.ybar

70 80 90 100 110 120 98.5 99.0 99.5 100.0 100.5 101.0

ybar y.star.bar
An example

X1=(1.57,0.22,19.67,
0,0,2.2,3.12)
Mean=4.13
X=(3.12, 0, 1.57,
19.67, 0.22, 2.20) X2=(0, 2.20, 2.20,
Mean=4.46 2.20, 19.67, 1.57)
Mean=4.64

X3=(0.22, 3.12,1.57,
3.12, 2.20, 0.22)
Mean=1.74
Motivating Example
Lets look at a simple case where we all
know the answer in advance.
raw data
Pull 500 draws from the n(5000,100)
dist.
statistic value
#obs 500
The sample mean 5000 mean 4995.79
Is a point estimate of the true sd 98.78
mean .
2.5%ile 4812.30
But how sure are we of this 97.5%ile 5195.58
estimate?
From theory, we know that:

s.d .( X ) / N 100 4.47


500
Visualizing the Raw Data
500 draws from n(5000,100) raw data
statistic value
Look at summary statistics, histogram,
#obs 500
probability density estimate, QQ-plot.
mean 4995.79
looks pretty normal sd 98.78
2.5%ile 4812.30
97.5%ile 5195.58

n(5000,100) data Normal Q-Q Plot


0.004

5100
0.002

4900
0.000

4700

4700 4800 4900 5000 5100 5200 5300 -3 -2 -1 0 1 2 3


Sampling With Replacement

Now lets use resampling to estimate the s.d. of the sample mean
(4.47)
Draw a data point at random from the data set.
Then throw it back in
Draw a second data point.
Then throw it back in
Keep going until weve got 500 data points.
You might call this a pseudo data set.
This is not merely re-sorting the data.
Some of the original data points will appear more than once; others wont appear at
all.
Resampling

Sample with replacement


500 data points from the S*N
S*1
S*2
original dataset S
Call this S*1 ... S*3

Now do this 999 more S*10 S S*4


times!
S*1, S*2,, S*1000 S*9 S*5

Compute X-bar on each of


S*8 S*6
these 1000 samples. S*7
R Code

norm.data <- rnorm(500, mean=5000, sd=100)


boots <- function(data, R){
b.avg <<- c(); b.sd <<- c()
for(b in 1:R) {
ystar <- sample(data,length(data),replace=T)
b.avg <<- c(b.avg,mean(ystar))
b.sd <<- c(b.sd,sd(ystar))}
}
boots(norm.data, 1000)
Results
From theory we know that X-bar ~
n(5000, 4.47)
raw data X-bar
Bootstrapping estimates this pretty statistic value theory bootstrap
#obs 500 1,000 1,000
well! mean 4995.79 5000.00 4995.98
sd 98.78 4.47 4.43
And we get an estimate of the whole 2.5%ile 4705.08 4991.23 4987.60
distribution, not just a confidence 97.5%ile 5259.27 5008.77 5004.82

interval.

bootstrap X-bar data Normal Q-Q Plot


0.00 0.02 0.04 0.06 0.08

5005
4995
4985

4985 4990 4995 5000 5005 5010 -3 -2 -1 0 1 2 3


Two Ways of Looking at a Confidence Interval

Approximate normality assumption


X-bar 2*(bootstrap dist s.d.)
Percentile method
Just take the desired percentiles of the bootstrap
histogram.
More reliable in cases of asymmetric bootstrap histograms.

mean(norm.data) - 2 * sd(b.avg) raw data X-bar


statistic value theory bootstrap
#obs 500 1,000 1,000
[1] 4986.926 mean 4995.79 5000.00 4995.98
sd 98.78 4.47 4.43
mean(norm.data) + 2 * sd(b.avg) 2.5%ile 4705.08 4991.23 4987.60
97.5%ile 5259.27 5008.77 5004.82
[1] 5004.661
Example
Population: standard normal distribution
Sample with sample size 30
1000 bootstrap samples

Results sample (traditional analysis)


Sample mean: 0.1698
95% C.I: (-0.12, 0.46)

Bootstrap results:
Sample mean: 0.1689
95% C.I: (-0.10,0.45)
Why does bootstrapping work?
Basic idea:
If the sample is a good approximation of the population, bootstrapping will
provide a good approximation of the sample distribution.

Justification:

1. If the sample is representative for the population, the sample


distribution (empirical) approaches the population (theoretical)
distribution if n increases
2. If the number of resamples (B) from the original sample increases, the
bootstrap distribution approaches the sample distribution
Example
The bootstrap does not replace or add to the original data.

We use bootstrap distribution as a way to estimate the variation in a


statistic based on the original data.

Bootstrap distributions usually approximate the shape, spread, and


bias of the actual sampling distribution.

Bootstrap distributions are centered at the value of the statistic from


the original data plus any bias, while the sampling distribution is
centered at the value of the parameter in the population, plus any
bias.
More Interesting Examples

Weve seen that bootstrapping replicates a result we know to be


true from theory.
Often in the real world we either dont know the true
distributional properties of a random variable
or are too busy to find out.
This is when bootstrapping really comes in handy.
Severity Data

2700 size-of-loss data points.


Mean = 3052, Median = 1136
0% 25% 50% 75% 100%
51.84 482.42 1136.10 3094.09 48346.82
Lets estimate the distributions of the sample mean &
75th %ile.
Gamma? Lognormal? Dont need to know.
severity distribution
4 e-04
2 e-04
0 e+00

0 10000 20000 30000 40000 50000


Bootstrapping Sample Avg, 75th
%ile
bootstrap dist of severity sample avg Normal Q-Q Plot

3400
0.004

3200
0.002

3000
2800
0.000

2800 3000 3200 3400 -3 -2 -1 0 1 2 3

bootstrap dist of severity 75th % ile Normal Q-Q Plot

3400
3200
0.002

3000
2800
0.000

2800 2900 3000 3100 3200 3300 3400 -3 -2 -1 0 1 2 3


What about the 90th %ile?
So far so good bootstrapping shows that many of our sample statistics even
average severity! are approximately normally distributed.
But this breaks down if our statistics is not a smooth function of the data
Often in the loss reserving we want to focus our attention way out in the tail
90th %ile is an example.

bootstrap dist of severity 90th % ile Normal Q-Q Plot

9000
0.0010

8000
7000
0.0000

7000 7500 8000 8500 9000 -3 -2 -1 0 1 2 3


Cases where bootstrap does not apply
Small data sets: the original sample is not a good approximation of
the population

Dirty data: outliers add variability in our estimates.

Dependence structures (e.g., time series, spatial problems): Bootstrap


is based on the assumption of independence.
How many bootstrap samples
are needed?
Choice of B depends on
Computer availability

Type of the problem: standard errors, confidence intervals,

Complexity of the problem


The Bootstrap:
Computational Issues

Seemingly a wonderful match to modern parallel and distributed


computing platforms
But the expected number of distinct points in a bootstrap
resample is ~ 0.632n
e.g., if original dataset has size 1 TB, then expect resample to have size ~
632 GB
Cant feasibly send resampled datasets of this size to distributed
servers
Even if one could, cant compute the estimate locally on datasets
this large
Subsampling
(Politis, Romano & Wolf, 1999)

n
Subsampling

b
Subsampling

There are many subsets of size b < n


Choose some sample of them and apply the estimator to each
This yields fluctuations of the estimate, and thus error bars
But a key issue arises: the fact that b < n means that the error bars will be on
the wrong scale (theyll be too large)
Need to analytically correct the error bars
Subsampling
Summary of algorithm:
Repeatedly subsample b < n points without replacement from the original dataset of size n
Compute q*b on each subsample
Compute x based on these multiple realizations of q*b
Analytically correct to produce final estimate of x for qn

The need for analytical correction makes subsampling less automatic than the
bootstrap
Still, much more favorable computational profile than bootstrap
Lets try it out in practice
The Bag of Little Bootstraps (BLB)
Bag of Little Bootstraps (BLB)
Computational Considerations

A key point:
Resources required to compute q generally scale in number of distinct
data points
This is true of many commonly used estimation algorithms (e.g., SVM,
logistic regression, linear regression, kernel methods, general M-
estimators, etc.)
Use weighted representation of resampled datasets to avoid physical data
replication

Example: if original dataset has size 1 TB with each data point 1 MB, and we
take b(n) = n0.6, then expect
subsampled datasets to have size ~ 4 GB
resampled datasets to have size ~ 4 GB
(in contrast, bootstrap resamples have size ~ 632 GB)

Potrebbero piacerti anche