better than another? – Design an experiment to measure the accuracy of the two algorithms. – Run multiple trials. – Compare the samples not just their means. Do a statistically sound test of the two samples. – Is any observed difference significant? Is it due to true difference between algorithms or natural variation in the measurements? Statistical Hypothesis Testing
Statistical Hypothesis: A statement about
the parameters of one or more populations Hypothesis Testing: A procedure for deciding to accept or reject the hypothesis – Identify the parameter of interest – State a null hypothesis, H0 – Specify an alternate hypothesis, H1 – Choose a significance level α – State an appropriate test statistic Statistical Hypothesis Testing Cont
Null Hypothesis (H0): A statement presumed to be
true until statistical evidence shows otherwise Usually specifies an exact value for a parameter Example H0: µ = 30 Kg Alternate Hypothesis (H1): Accepted if the null hypothesis is rejected Test Statistic: Particular statistic calculated from measurements of a random sample / experiment – A test statistic is assumed to follow a particular distribution (normal, t, chi-square, etc) – That particular distribution can be used to test for the significance of the calculated test statistic. Error in Hypothesis Testing
Type I error occurs when H0
is rejected but it is in fact true – P(Type I error)=α or significance level Type II error occurs when we fail to reject H0 but it is in fact false – P(Type II error)=β – power = 1-β = Probability of correctly rejecting H0 – power = ability to distinguish between the two populations Paired t-Test
Collect data in pairs:
– Example: Given a training set DTrain and a test set DTest, train both learning algorithms on DTrain and then test their accuracies on DTest. Suppose n paired measurements have been made Assume – The measurements are independent – The measurements for each algorithm follow a normal distribution The test statistic T0 will follow a t-distribution with n-1 degrees of freedom Paired t-Test cont
… .. … T0 D 0 n n X1N X2N SD Assume: X1 follows N(µ1,σ1) X2 follows N(µ2,σ2) Rejection Criteria: Let: µD = µ1 - µ2 Di = X1i - X2i i=1,2,...,n 1 H1: µD ≠ Δ0 |t0| > tα/2,n-1 D X 1i X 2i H1: µD > Δ0 t0 > tα,n-1 n i S D STDEV ( X 1i X 2i ) H1: µD < Δ0 t0 < -tα,n-1 Cross Validated t-test
Paired t-Test on the 10 paired
accuracies obtained from 10-fold cross validation Advantages – Large train set size – Most powerful (Diettrich, 98) Disadvantages … – Accuracy results are not independent (overlap) – Somewhat elevated probability of type-1 error (Diettrich, 98) 5x2 Cross Validated t-test
Run 2-fold cross validation 5 times
Use results from the first of five replications to estimate mean difference Use results for all folds to estimate the variance Advantage: – Lowest Type-1 error (Diettrich, 98) Disadvantage – Not as powerful as 10 fold cross validated t-test (Diettrich, 98) Re-sampled t-test
Randomly divide data into train / test sets
(usually 2/3 – 1/3) Run multiple trials (usually 30) Perform a paired t-test between the trial accuracies This test has very high probability of type-1 error and should never be used. Calibrated Tests
Bouckaert – ICML 2003:
– It is very difficult to estimate the true degrees of freedom because independence assumptions are being violated – Instead of correcting for the mean-difference, calibrate on the degrees of freedom – Recommendation: use 10 times repeated 10-fold cross validation with 10 degrees of freedom References
R. R. Bouckaert. Choosing between two
learning algorithms based on calibrated tests. ICML’03: PP 51-58. T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1924, 1998. D. C. Montgomery et al. Engineering Statistics. 2nd Edition. Wiley Press. 2001