Sei sulla pagina 1di 8

Appendix

Statistics for Performance Evaluation

CD-71

CD-72

Appendix D

Statistics for Performance Evaluation

D.1

Single-Valued Summary Statistics


The mean, or average value, is computed by summing the data and dividing the sum by the number of data items. Specifically, = where is the mean value n is the number of data items xi is the ith data item The formula for computing variance is shown in Equation D.2.
2 = 1 n ( xi )2 n i =1

1 n x ni =1 i

(D.1)

(D.2 )

where 2 is the variance is the population mean n is the number of data items xi is the ith data item When calculating the variance for a sampling of data, a better result is obtained by dividing the sum of squares by n 1 rather than by n.The proof of this is beyond the scope of this book; however, it suffices to say that when the division is by n 1, the sample variance is an unbiased estimator of the population variance.An unbiased estimator has the characteristic that the average value of the estimator taken over all possible samples is equal to the parameter being estimated.

D.2

The Normal Distribution


The equation for the normal, or bell-shaped, curve is shown in Equation D.3.
f (x ) = 1 / ( 2)e ( x )
2 /( 2 2 )

(D.3 )

where f(x) is the height of the curve corresponding to values of x e is the base of natural logarithms approximated by 2.718282

D.3

Comparing Supervised Learner Models

CD-73

is the arithmetic mean for the data is the standard deviation

D.3

Comparing Supervised Learner Models


In Chapter 7 we described a general technique for comparing two supervised learner models using the same test dataset. Here we provide two additional techniques for comparing supervised models. In both cases, model test set error rate is treated as a sample mean.

Comparing Models with Independent Test Data


With two independent test sets, we simply compute the variance for each model and apply the classical hypothesis testing procedure. An outline of the technique follows. 1. Given


2.

Two models, M1 and M2, built with the same training data Two independent test sets, set A containing n1 elements and set B with n2 elements Error rate E1 and variance v1 for model M1 on test set A Error rate E2 and variance v2 for model M2 on test set B
P= E1 E 2

Compute (v 1 / n1 + v 2 / n2 ) (D.4 )

3.

Conclude

If P >= 2, the difference in the test set performance of model M1 and model M2 is significant.

Lets look at an example. Suppose we wish to compare the test set performance of learner models M1 and M2.We test M1 on test set A and M2 on test set B. Each test set contains 100 instances. M1 achieves an 80% classification accuracy with set A, and M2 obtains a 70% accuracy with test set B.We wish to know if model M1 has performed significantly better than model M2.

For model M1: E1 = 0.20 v1 = .2 (1 .2) = .16

CD-74

Appendix D

Statistics for Performance Evaluation

For model M2: E2 = 0.30 v2 = .3 (1 .3) = .21

The computation for P is:


P= 0.20 0.30 (0.16/100 + 0.21/100)

As P 1.4714, the difference in model performance is not considered to be significant. We can increase our confidence in the result by switching the two test sets and repeating the experiment.This is especially important if a significant difference is seen with the initial test set selection. The average of the two values for P are then used for the significance test.

Pairwise Comparison with a Single Test Set


When the same test set is applied to the data, one option is to perform an instanceby-instance pairwise matching of the test set results.With an instance-based comparison, a single variance score based on pairwise differences is computed.The formula for calculating the joint variance is shown in Equation D.5.
V12 = where V12 is the joint variance e1i is the classifier error on the i th instance for learner model M1 e 2i is the classifier error on the i th instance for learner model M2 E1 E 2 is the overall classifier error rate for model M1 minus the classifier error rate for model M2 n is the total number of test set instances 1 n 2 [(e1i e 2i ) ( E1 E 2 )] n 1 i =1 (D.5)

When test set error rate is the measure by which two models are compared, the output attribute is categorical.Therefore for any instance i contained in class j, eij is 0 if the classification is correct and 1 if the classification is in error. When the output attribute is numeric, eij represents the absolute difference between the computed and ac-

D.4

Confidence Intervals for Numeric Output

CD-75

tual output value.With the revised formula for computing joint variance, the equation to test for a significant difference in model performance becomes
P= E1 E 2 V12 / n (D.6)

Once again, a 95% confidence level for a significant difference in model test set performance is seen if P >= 2.This technique is appropriate only if an instance-based pairwise comparison of model performance is possible. In the next section we address the case where an instance-based comparison is not possible.

D.4

Confidence Intervals for Numeric Output


Just as when the output was categorical, we are interested in computing confidence intervals for one or more numeric measures. For purposes of illustration, we use mean absolute error. As with classifier error rate, mean absolute error is treated as a sample mean. The sample variance is given by the formula:
variance(mae ) = where

1 n 2 (ei mae ) n 1 i =1

(D.7)

e i is the absolute error for the ith instance


n is the number of instances

Lets look at an example using the data in Table 7.2. To determine a confidence interval for the mae computed for the data in Table 7.2, we first calculate the variance. Specifically, 1 15 2 (ei 0.0604 ) 14 i =1 (0.024 0.0604 )2 + (0.002 0.0604 )2 + .... + (0.001 0.0604 )2 0.0092
variance(0.0604 ) =

Next, as with the classifier error rate, we compute the standard error for the mae as the square root of the variance divided by the number of sample instances.
SE = (0.0092 / 15) 0 .0248

Finally, we calculate the 95% confidence interval by respectively subtracting and adding two standard errors to the computed mae. This tells us that we can be 95% confident that the actual mae falls somewhere between 0.0108 and 0.1100.

CD-76

Appendix D

Statistics for Performance Evaluation

D.5

Comparing Models with Numeric Output


The procedure for comparing models giving numeric output is identical to that for models with categorical output. In the case where two independent test sets are available and mae measures model performance, the classical hypothesis testing model takes the form: P= where mae1 is the mean absolute error for model M1 mae2 is the mean absolute error for model M2 v1 and v2 are variance scores associated with M1 and M2 n1 and n2 are the number of instances within each respective test set When the models are tested on the same data and a pairwise comparison is possible, we use the formula: P= where mae1 is the mean absolute error for model M1 mae2 is the mean absolute error for model M2 v12 is the joint variance computed with the formula defined in Equation D.5 n is the number of test set instances When the same test data is applied but a pairwise comparison is not possible, the most straightforward approach is to compute the variance associated with the mae for each model using the equation:
variance (maej ) = 1 n (ei mae j )2 n 1 i =1 (D.10) mae1 mae 2 (D.8)

(v 1 / n1 + v 2 / n2 )

mae1 mae 2
v 12 / n

(D.9)

where maej is the mean absolute error for model j ei is the absolute value of the computed value minus the actual value for instance i n is the number of test set instances The hypothesis of no significant difference is then tested with Equation D.11.

D.5

Comparing Models with Numeric Output

CD-77

P= where

mae1 mae 2

(D.11)

v( 2 / n )

v is either the average or the larger of the variance scores for each model n is the total number of test set instances As is the case when the output attribute is categorical, using the larger of the two variance scores is the stronger test.

Potrebbero piacerti anche