Eva Uation Methods 273 A Spring 09

Evaluating Classifiers
Lecture 2
Instructor: Max Welling
Evaluation of Results
How do you report classification error?
How certain are you about the error you

claim?
How do you compare two algorithms?
How certain are you if you state one

algorithm performs better than another?
Evaluation
Given:
Hypothesis h(x): XC, in hypothesis space H,
mapping attributes x to classes c=[1,2,3,...C]
A data-sample S(n) of size n.
Questions:
What is the error of h on unseen data?
If we have two competing hypotheses, which one is better
on unseen data?
How do we compare two learning algorithms in the face of limited data?
How certain are we about our answers?

Sample and True Error
We can define two errors:
1) Error(h|S) is the error on the sample S:
1 n
error (h | S ) [h (xi ) yi ]
n i 1
2) Error(h|P) is the true error on the unseen data sampled from the
distribution P(x):
error (h | P ) dx P (x ) [h (x ) f (x )]
where f(x) is the true hypothesis.

Binomial Distributions
Assume you toss a coin n times.
And it has probability p of coming heads (which we will call success)
What is the probability distribution governing the number of heads in n trials?
Answer: the Binomial distribution.
n!
p (# heads r | p ,n ) p r (1 p )n r
r !(n r )!
Distribution over Errors
Consider some hypothesis h(x)
Draw n samples Xk~P(X).
Do this k times.
Compute e1=n*error(h|X1), e2=n*error(h|X2),...,ek=n*error(h|Xk).
{e1,...,ek} are samples from a Binomial distribution !
Why? imagine a magic coin, where God secretly determines the probability
of heads by the following procedure. First He takes some random hypothesis h.
Then, He draws x~P(x) and observes if h(x) correctly predicts the label correctly.
If it does, he makes sure the coin lands heads up...
You have a single sample S, for which you observe

e(S) errors. What would be a reasonable estimate for Error(h|P) you think?
Binomial Moments
mean (r ) E [r | n , p ] np
var
var (r ) E [(r E [r ]) ] np (1 p )
2
mean
If we match the mean, np, with the observed value n*error(h|S) we find:
E [error (h | P )] E [r / n ] p error (h | S )
If we match the variance we can obtain an estimate of the width:

error (h | S ) (1 error (h | S ))
var[error (h | P )] var[r / n ]
n
Confidence Intervals
We would like to state:
With N% confidence we believe that error(h|P) is contained in the interval:
error (h | S ) (1 error (h | S ))
error (h | P ) error (h | S ) zN
n
80% z0.8 1.28
Normal(0,1)
1.28
In principle z N is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to

approximate a Binomial by a Gaussian for which we can easily compute z-values.
Bias-Variance
The estimator is unbiased if E [error (h | X )] p
Imagine again you have infinitely many sample sets X1,X2,.. of size n.
Use these to compute estimates E1,E2,... of p where Ei=error(h|Xi)
If the average of E1,E2,.. converges to p, then error(h|X) is an unbiased estimator.
Two unbiased estimators can still differ in their

variance (efficiency). Which one do you prefer?
p Eav
Flow of Thought
Determine the property you want to know about the future data (e.g. error(h|P))
Find an unbiased estimator E for this quantity based on observing data X (e.g. error(h|X))
Determine the distribution P(E) of E under the assumption you have infinitely
many sample sets X1,X2,...of some size n. (e.g. p(E)=Binomial(p,n), p=error(h|P))
Estimate the parameters of P(E) from an actual data sample S (e.g. p=error(h|S))
Compute mean and variance of P(E) and pray P(E) it is close to a Normal distribution.
(sums of random variables converge to normal distributions central limit theorem)
State you confidence interval as: with confidence N% error(h|P) is contained in the interval
Y mean zN var
Assumptions
We only consider discrete valued hypotheses (i.e. classes)
Training data and test data are drawn IID from the same distribution P(x).
(IID: independently & identically distributed)
The hypothesis must be chosen independently from the data sample S!
When you obtain a hypothesis from a learning algorithm, split the data
into a training set and a testing set. Find the hypothesis using the training set
and estimate error on the testing set.
Comparing Hypotheses
Assume we like to compare 2 hypothesis h1 and h2, which we have
tested on two independent samples S1 and S2 of size n1 and n2.
I.e. we are interested in the quantity: d error (h1| P ) error (h 2| P ) ?
Define estimator for d: d error (h 1 | X 1) error (h 2 | X 2)

with X1,X2 sample sets of size n1,n2.
Since error(h1|S1) and error(h2|S2) are both approximately Normal

their difference is approximately Normal with:
mean d error (h 1 | S 1) error (h 2 | S 2)
error (h 1 | S 1)(1 error (h 1 | S 1)) error (h 2 | S 2)(1 error (h 2 | S 2))
var
n1 n2
Hence, with N% confidence we believe that d is contained in the interval:
d mean zN var
Paired Tests
Consider the following data:
error(h1|s1)=0.1 error(h2|s1)=0.11
and so on.
We have var(error(h1)) = large, var(error(h2)) = large.

The total variance of error(h1)-error(h2) is their sum.
However, h1 is consistently better than h2.
We ignored the fact that we compare on the same data.

We want a different estimator that compares data one by one.
You can use a paired t-test (e.g. in matlab) to see if the two errors
are significantly different, or if one error is significantly larger than the other.
Paired t-test
Chunk the data up in subsets T1,...,Tk with |Ti|>30
On each subset compute the error and compute: i error (h1|Ti ) error (h2|Ti )
Now compute: 1 k

k

i 1
i
1 k
s ( )
k (k 1) i 1
(i )2
State: With N% confidence the difference in error between h1 and h2 is:
tN ,k 1s ( )
t is the t-statistic which is related to the student-t distribution (table 5.6).
Comparing Learning Algorithms
In general it is a really bad idea to estimate error rates on the same data
on which a learning algorithm is trained. WHY?
So just as in x-validation, we split the data into k subsets:

S{T1,T2,...Tk}.
Train both learning algorithm 1 (L1) and learning algorithm 2 (L2) on the complement
of each subset: {S-T1,S-T2,...) to produce hypotheses {L1(S-Ti), L2(S-Ti)} for all i.
Compute for all i : i error (L1 (S Ti ) |Ti ) error (L2 (S Ti ) |Ti )
Note: we train on S-Ti, but test on Ti.
As in the last slide perform a paired t-test on these differences to compute an

estimate and a confidence interval for the relative error of the hypothesis produced
by L1 and L2.
Evaluation: ROC curves
moving threshold
class 0 (negatives) class 1 (positives)
TP = true positive rate =

# positives classified as positive
divided by # positives Identify a threshold in
your classifier that you
FP = false positive rate = can shift.
# negatives classified as positives
divided by # negatives Plot ROC curve while you
shift that parameter.
TN = true negative rate =
# negatives classified as negatives
divided by # negatives
FN = false negatives =
# positives classified as negative
divided by # positives
Conclusion
Never (ever) draw error-curves without confidence intervals
(The second most important sentence of this course)

Eva Uation Methods 273 A Spring 09

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Eva Uation Methods 273 A Spring 09

Caricato da

Copyright:

Formati disponibili

Evaluating Classifiers

How certain are you about the error you

How do you compare two algorithms?

How certain are you if you state one

How certain are we about our answers?

1) Error(h|S) is the error on the sample S:

where f(x) is the true hypothesis.

What is the probability distribution governing the number of heads in n trials?

Answer: the Binomial distribution.

Draw n samples Xk~P(X).

Compute e1=n*error(h|X1), e2=n*error(h|X2),...,ek=n*error(h|Xk).

{e1,...,ek} are samples from a Binomial distribution !

You have a single sample S, for which you observe

If we match the variance we can obtain an estimate of the width:

With N% confidence we believe that error(h|P) is contained in the interval:

80% z0.8 1.28

In principle z N is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to

Use these to compute estimates E1,E2,... of p where Ei=error(h|Xi)

If the average of E1,E2,.. converges to p, then error(h|X) is an unbiased estimator.

Two unbiased estimators can still differ in their

We only consider discrete valued hypotheses (i.e. classes)

The hypothesis must be chosen independently from the data sample S!

I.e. we are interested in the quantity: d error (h1| P ) error (h 2| P ) ?

Define estimator for d: d error (h 1 | X 1) error (h 2 | X 2)

Since error(h1|S1) and error(h2|S2) are both approximately Normal

Hence, with N% confidence we believe that d is contained in the interval:

We have var(error(h1)) = large, var(error(h2)) = large.

We ignored the fact that we compare on the same data.

State: With N% confidence the difference in error between h1 and h2 is:

So just as in x-validation, we split the data into k subsets:

Compute for all i : i error (L1 (S Ti ) |Ti ) error (L2 (S Ti ) |Ti )

Note: we train on S-Ti, but test on Ti.

As in the last slide perform a paired t-test on these differences to compute an

class 0 (negatives) class 1 (positives)

TP = true positive rate =

Never (ever) draw error-curves without confidence intervals

(The second most important sentence of this course)

Potrebbero piacerti anche

Compute e1=nerror(h|X1), e2=nerror(h|X2),...,ek=n*error(h|Xk).