Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Pulak Ghosh
IIMB
What is Bootstrap?
Assessing the Quality of Inference
Y1 Y2 Y3 YN
The distribution of our estimator (Y) depends on both the true distribution and the size
(k) of our sample
The Unachievable Frequentist Ideal
Ideally, we would
Observe many independent datasets of size n.
Compute qn on each.
Compute x based on these multiple realizations of qn.
ybar y.star.bar
An example
X1=(1.57,0.22,19.67,
0,0,2.2,3.12)
Mean=4.13
X=(3.12, 0, 1.57,
19.67, 0.22, 2.20) X2=(0, 2.20, 2.20,
Mean=4.46 2.20, 19.67, 1.57)
Mean=4.64
X3=(0.22, 3.12,1.57,
3.12, 2.20, 0.22)
Mean=1.74
Motivating Example
Lets look at a simple case where we all
know the answer in advance.
raw data
Pull 500 draws from the n(5000,100)
dist.
statistic value
#obs 500
The sample mean 5000 mean 4995.79
Is a point estimate of the true sd 98.78
mean .
2.5%ile 4812.30
But how sure are we of this 97.5%ile 5195.58
estimate?
From theory, we know that:
5100
0.002
4900
0.000
4700
Now lets use resampling to estimate the s.d. of the sample mean
(4.47)
Draw a data point at random from the data set.
Then throw it back in
Draw a second data point.
Then throw it back in
Keep going until weve got 500 data points.
You might call this a pseudo data set.
This is not merely re-sorting the data.
Some of the original data points will appear more than once; others wont appear at
all.
Resampling
interval.
5005
4995
4985
Bootstrap results:
Sample mean: 0.1689
95% C.I: (-0.10,0.45)
Why does bootstrapping work?
Basic idea:
If the sample is a good approximation of the population, bootstrapping will
provide a good approximation of the sample distribution.
Justification:
3400
0.004
3200
0.002
3000
2800
0.000
3400
3200
0.002
3000
2800
0.000
9000
0.0010
8000
7000
0.0000
n
Subsampling
b
Subsampling
The need for analytical correction makes subsampling less automatic than the
bootstrap
Still, much more favorable computational profile than bootstrap
Lets try it out in practice
The Bag of Little Bootstraps (BLB)
Bag of Little Bootstraps (BLB)
Computational Considerations
A key point:
Resources required to compute q generally scale in number of distinct
data points
This is true of many commonly used estimation algorithms (e.g., SVM,
logistic regression, linear regression, kernel methods, general M-
estimators, etc.)
Use weighted representation of resampled datasets to avoid physical data
replication
Example: if original dataset has size 1 TB with each data point 1 MB, and we
take b(n) = n0.6, then expect
subsampled datasets to have size ~ 4 GB
resampled datasets to have size ~ 4 GB
(in contrast, bootstrap resamples have size ~ 632 GB)