Hypothesis Test For Comparison of Algos

Statistical Comparison of Two
Learning Algorithms
Presented by:
Payam Refaeilzadeh
Overview
 How can we tell if one algorithm can learn

better than another?
– Design an experiment to measure the accuracy of
the two algorithms.
– Run multiple trials.
– Compare the samples not just their means. Do a
statistically sound test of the two samples.
– Is any observed difference significant? Is it due to
true difference between algorithms or natural
variation in the measurements?
Statistical Hypothesis Testing
 Statistical Hypothesis: A statement about

the parameters of one or more populations
 Hypothesis Testing: A procedure for
deciding to accept or reject the hypothesis
– Identify the parameter of interest
– State a null hypothesis, H0
– Specify an alternate hypothesis, H1
– Choose a significance level α
– State an appropriate test statistic
Statistical Hypothesis Testing Cont
 Null Hypothesis (H0): A statement presumed to be

true until statistical evidence shows otherwise
 Usually specifies an exact value for a parameter
 Example H0: µ = 30 Kg
 Alternate Hypothesis (H1): Accepted if the null
hypothesis is rejected
 Test Statistic: Particular statistic calculated from
measurements of a random sample / experiment
– A test statistic is assumed to follow a particular distribution
(normal, t, chi-square, etc)
– That particular distribution can be used to test for the
significance of the calculated test statistic.
Error in Hypothesis Testing
 Type I error occurs when H0

is rejected but it is in fact true
– P(Type I error)=α or significance
level
 Type II error occurs when we
fail to reject H0 but it is in fact
false
– P(Type II error)=β
– power = 1-β = Probability of
correctly rejecting H0
– power = ability to distinguish
between the two populations
Paired t-Test
 Collect data in pairs:

– Example: Given a training set DTrain and a test set DTest,
train both learning algorithms on DTrain and then test their
accuracies on DTest.
 Suppose n paired measurements have been made
 Assume
– The measurements are independent
– The measurements for each algorithm follow a normal
distribution
 The test statistic T0 will follow a t-distribution with n-1
degrees of freedom
Paired t-Test cont
Algorithm 1 Algorithm 2 Null Hypothesis:

Trial
#
Accuracy
X1
Accuracy
X2
H0: µD = Δ0
1 X11 X21
2 X12 X22 Test Statistic:

… .. …
T0 
D   
0 n
n X1N X2N
SD
Assume: X1 follows N(µ1,σ1)
X2 follows N(µ2,σ2) Rejection Criteria:
Let: µD = µ1 - µ2
Di = X1i - X2i i=1,2,...,n
1
H1: µD ≠ Δ0 |t0| > tα/2,n-1
D   X 1i  X 2i H1: µD > Δ0 t0 > tα,n-1
n i
S D  STDEV ( X 1i  X 2i ) H1: µD < Δ0 t0 < -tα,n-1
Cross Validated t-test
 Paired t-Test on the 10 paired

accuracies obtained from 10-fold
cross validation
 Advantages
– Large train set size
– Most powerful (Diettrich, 98)
 Disadvantages
…
– Accuracy results are not independent
(overlap)
– Somewhat elevated probability of
type-1 error (Diettrich, 98)
5x2 Cross Validated t-test
 Run 2-fold cross validation 5 times

 Use results from the first of five replications to
estimate mean difference
 Use results for all folds to estimate the variance
 Advantage:
– Lowest Type-1 error (Diettrich, 98)
 Disadvantage
– Not as powerful as 10 fold cross validated t-test (Diettrich,
98)
Re-sampled t-test
 Randomly divide data into train / test sets

(usually 2/3 – 1/3)
 Run multiple trials (usually 30)
 Perform a paired t-test between the trial
accuracies
 This test has very high probability of type-1
error and should never be used.
Calibrated Tests
 Bouckaert – ICML 2003:

– It is very difficult to estimate the true degrees of
freedom because independence assumptions are
being violated
– Instead of correcting for the mean-difference,
calibrate on the degrees of freedom
– Recommendation: use 10 times repeated 10-fold
cross validation with 10 degrees of freedom
References
 R. R. Bouckaert. Choosing between two

learning algorithms based on calibrated
tests. ICML’03: PP 51-58.
 T. G. Dietterich. Approximate statistical tests
for comparing supervised classification
learning algorithms. Neural Computation,
10:1895–1924, 1998.
 D. C. Montgomery et al. Engineering
Statistics. 2nd Edition. Wiley Press. 2001

Hypothesis Test For Comparison of Algos

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Hypothesis Test For Comparison of Algos

Caricato da

Copyright:

Formati disponibili

Statistical Comparison of Two

 How can we tell if one algorithm can learn

 Statistical Hypothesis: A statement about

 Null Hypothesis (H0): A statement presumed to be

 Type I error occurs when H0

 Collect data in pairs:

Algorithm 1 Algorithm 2 Null Hypothesis:

2 X12 X22 Test Statistic:

 Paired t-Test on the 10 paired

 Run 2-fold cross validation 5 times

 Randomly divide data into train / test sets

 Bouckaert – ICML 2003:

 R. R. Bouckaert. Choosing between two

Potrebbero piacerti anche