Sei sulla pagina 1di 25

Tony Jebara, Columbia University

Advanced Machine
Learning & Perception

Instructor: Tony Jebara

Tony Jebara, Columbia University


daniel zhu class only on this

Boosting
Combining Multiple Classifiers
Voting
Boosting

idea of taking many weak learners, combining to get a strong learner

Adaboost

jebara co taught with:

Based on material by Y. Freund, P. Long & R. Schapire


Freund and Schapire had the big breakthrough algorithm

better way to do it is a online learning framework.

Tony Jebara, Columbia University

Combining Multiple Learners


stump, weak learner that looks at one feature, maybe a little better
Have many simple learners decision
than random
Also called base learners or
weak learners which have a classification error of <0.5
Combine or vote them to get a higher accuracy
No free lunch: there is no guaranteed best approach here
Different approaches:
Voting
combine learners with fixed weight
Mixture of Experts
adjust learners and a variable weight/gate fn
not just voting on fixed bunch o f experts, you have billions or trillions of experts, all have weight zero,
Q: will you be able to do grad Boosting
and pick one and add it
descent like with kernel
actively search for next base-learners and vote
learning with switches
Cascading, Stacking, Bagging, etc.
new method is "regret minimiztion" related to multi-arm bandits (daniel Zhu course) and online learning

face binding, boosted cascade viola jones, catch face without having to scan entire image

instead of voting (lin comb of decisions), could do gating (trust doctor 1 on children, doctor 2 is best on
elders). Big thing is that you have to make assumptions to do well

Tony Jebara, Columbia University

Voting
ht (x )
Have T classifiers
Average their prediction with weights

f (x ) = t =1 tht (x ) where t 0and t =1 t = 1


T

Like mixture of experts but weight is constant with input


y

Expert

Expert

Expert

slightly smartier thing to do, change alpha (confidence in classifier), change allocation in alpha
depending on the input (child v elder). Pretty much the same as neural network (why figures are the
same), same formula as before, but now a depends on x

Tony Jebara, Columbia University

Mixture of Experts
Have T classifiers or experts ht (x ) and a gating fn t (x )
Average their prediction with variable weights
But, adapt parameters of the gating function
and the experts (fixed total number T of experts)
f (x ) = t =1 t (x )ht (x ) where t (x ) 0and t =1 t (x ) = 1
T

Output
Gating
Network
Expert

Expert
Input

Expert

we are searching for a massive pool of learners. No gating function here, find someone whos good at
what you're not good at. Find the next learner by looking at what example we messsed up on, and pick
and train expert for previously mixtaken labels. Idea is that eventually this will converge. Greedily
adding the next best expert (some nivce theory that you'll mind the equiv of the best possible large
margin solution, lets you throw in anything as a weak learner.

Tony Jebara, Columbia University

Boosting
Actively find complementary or synergistic weak learners
Train next learner based on mistakes of previous ones.
Average prediction with fixed weights
Find next learner by training on weighted versions of data.
training data
(x1,y1 ),, (xN ,yN )

weak learner

Q: at the end, do we have similar results to


SVMs? few support vectors have large
lagrange multipliers, rest are not

N
i =1

wistep ht (x i )yi

{w ,,w }
1

how well the weak learner performs, has gamma (non neg)
error. Average loss, empirical risk. Correct is step func of a negative number
and step func maps it to zero, otherwise pay 1 dollar. 1/2 is a little bit better
than random guessing. 1/2 is not even a learner.

weights

N
i =1

wi

) < 1
2

ensemble of learners
(1,h1 ),, (T ,hT )

weak rule
ht (x )

tht (x )
prediction

t =1

we have a lot of data. All examples initially have equal weights. After one round, will weight down correct answer, and increase weights on incorrect answers.

logic boost, bernstein boost. Adaboost focuses on margin. Margin is a quantity of risk. How much is committee voting for answer.
want to minimize the error, use a classifier h that minimizes error. Instead of working on ugly step functions, lets work with an upper bound
on the step function and minimize that. That is what ada boost adds (the smoothing). Can imagine hinge loss like svm. There are other
boosting methods that introduce another way to smooth the step functions (another benefit here: each x is convex, so sum of exponentials is
also convex).

Tony Jebara, Columbia University

AdaBoost
Most popular weighting scheme
T
y
Define margin for point i as i t =1 tht (x i )
Find an ht and find weight t to min the cost function
N
sum exp-margins
i =1 exp yitht (xi )

this is the margin, want this h_t(x_i) to be the right


sign and large

training data
(x1,y1 ),, (xN ,yN )

weak learner

N
i =1

wistep ht (x i )yi

{w ,,w }
1

weights

N
i =1

wi

) < 1
2

ensemble of learners
(1,h1 ),, (T ,hT )

weak rule
ht (x )

tht (x )
prediction

t =1

Q: are you still getting a linear classifier (like 1 layer neural network), at least for each dimension that you work with?
Q: is each weak learner always just a binary classifier?

Tony Jebara, Columbia University

AdaBoost
Choose base learner & t to
Recall error of
base classifier ht must be

wistep h j (x i )yi

N
i =1

N
i =1

N
i =1

exp yi tht (x i )

min ,h

wi

) < 1
2

For binary h, Adaboost puts this weight on weak learners:


(instead of the
1
t
t = 12 ln
more general rule)

missing w (here)?

Adaboost picks the following for the weights on data for


the next round (here Z is the normalizer to sum to 1)
wit +1 =

wit exp tyiht (x i )


Zt

minimizing exponentiated margins for all classifiers

weak learner classed stump, a classifier that cuts along 1 feature only. Simplest possible classifier - axis aligned cuts.
Q: Does it matter what order the stumps are? I think so... so you don't have a deterministic way to create a decision
tree?

Tony Jebara, Columbia University

Decision Trees
Y
X>3

+1
5

-1

Y>5

-1
-1

-1

+1
3

Tony Jebara, Columbia University

Decision tree as a sum


does no cutting, just
gives -0.2 to every
datapoint

Y
-0.2

X>3

sign
this is adding to the
previous -0.2
weight, not
replacing it

+0.2
+1
+0.1

-0.1

-0.1
-1
Y>5
-0.3

-0.2 +0.1
-0.3
-1

+0.2

find best stump, and then add it - add boosting

Tony Jebara, Columbia University

An alternating decision tree


Y
-0.2

Y<1

sign

0.0

+0.2
+1

X>3
+0.7

-1
-0.1

+0.1

-0.1

+0.1
-1
-0.3

Y>5

-0.3

+0.2

+0.7
+1

Tony Jebara, Columbia University

Example: Medical Diagnostics


Cleve dataset from UC Irvine database.
Heart disease diagnostics (+1=healthy,-1=sick)
13 features from tests (real valued and discrete).
303 instances.

Q: would it be crazy to build a GM off of this tree from boosting?

Tony Jebara, Columbia University

Ad-Tree Example

Tony Jebara, Columbia University

Cross-validated accuracy
Learning
algorithm

Number
of splits

ADtree

17.0%

0.6%

C5.0

27

27.2%

0.5%

446

20.2%

0.5%

C5.0 +
boosting
Boost
Stumps

16

Average Test error


test error variance

16.5%

0.8%

don't worry about heirarchy of


applying stumps. just apply all stumps
to all datapoints

from previous slide; boosting is helping, why? We are bounding the error rate. Step is always less than exp (but equal at the origin)

Tony Jebara, Columbia University

AdaBoost Convergence

Logitboost

Rationale?
Consider bound on
the training error:

Brownboost
0-1 loss

step (y f (x ))
exp (y f (x ))
= exp (y h (x ))

Remp =

Loss

1
N

i =1

1
N

i =1

i =1

= t =1 Z t

penalty im paying
even if i get correct
classifier

t =1

t t

through recursive definition, is actually equal to the above sum


exponentiated margin loss

Mistakes

Margin
Correct

exp bound on step


definition of f(x)
recursive use of Z
was the normalizer at a particular time, this is product of all
Z_t's ever.

Adaboost is essentially doing gradient descent on this.


Convergence?
Greedy one at a time decrease

Tony Jebara, Columbia University

AdaBoost Convergence
Convergence? Consider the binary ht case.

Remp t =1 Z t = t =1 i =1 wit exp tyiht (x i )


T

plug in alpha update from slide 8

1y h x

( )
2 i t i

T
N

= t =1 i =1 wit exp ln t

i t ( i )
t

T
N

= t =1 i =1 wit
1
t

t
1 t
T
= t =1 correct wit
+ incorrect wit

1 t
t

yh x

= t =1 2 t (1 t )
T

t =1

(1 4 ) exp (2 )
2
t

Remp exp 2 t t2 exp 2T 2

gamma stronger than coin flip, all error


weights are less than 1/2

2
t

T increases as I add learners, T increases with iterations, we get less and less
error exponentially. As long as we garuntee that each weak learner is better
than random (has error rate gamma better than random)

So, the final learner converges exponentially fast in T if


each weak learner is at least better than gamma!

Tony Jebara, Columbia University

Curious phenomenon
Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters


larger VC dimension we would think, and overfitting. But we never overfit, b/c of the margin

Tony Jebara, Columbia University

Explanation using margins

0-1 loss

Margin
0 training error, but we've maximized margin. We do additional iterations after getting 0 training
error

Tony Jebara, Columbia University

Explanation using margins

0-1 loss

No examples
with small
margins!!

Margin
margin theta

Tony Jebara, Columbia University

Experimental Evidence

this is a CDF

Tony Jebara, Columbia University

AdaBoost Generalization
Also, a VC analysis gives a generalization bound:

(where d is VC of
Td
R Remp +O

base classifier)
N

generalization garuntee, N is num


examples. Td is my total VC
dimension. VC dimension isn't
giving us reassurance, true risk will
be bounded

But, more iterations overfitting!


A margin analysis is possible, redefine margin as:
y t tht (x )
marf (x,y ) =
t

N
d

Then have R 1 step marf (x i ,yi ) +O


N
N 2
i =1

margina analysis instead - slightly redefined margin definition.


Numerator is original margin, now normalize, and then we have
another garuntee

looks like an empirical risk, but not quite. Theta is margin threshold (if i don't beat some margin threshold, then im getting
errors)
boost forever, turn anything you want into a max margin classifier (so don't stick in an SVM).

Tony Jebara, Columbia University

AdaBoost Generalization
Suggests this optimization problem:

Margin

Tony Jebara, Columbia University

AdaBoost Generalization
Proof Sketch

stability type proof, can wiggle margin within support, and still properly
classify. proof is not through VC arguments. VC tells you to stop boosting as
soon as possible. but margin analyiss says keep boosting - which is the thing
you see in practice

Tony Jebara, Columbia University

UCI Results
% test error rates
Database

Other

Boosting

Error
reduction

Cleveland

27.2 (DT)

16.5

39%

Promoters

22.0 (DT)

11.8

46%

Letter

13.8 (DT)

3.5

74%

Reuters 4

5.8, 6.0, 9.8

2.95

~60%

7.4

~40%

Reuters 8 11.3, 12.1, 13.4

boosted cascasde: turns out if you have a million classifiers, who wants to use a million? its a pain. instead, use a few. Classify image patch to be face/not face. Learn this stump that puts a bunch of strips on
faces, and adds white stip pixels with black strip pixels. Very weak classifer, lets do a bunch of these. problem - 24x24 pixel region means 160k stumps to train (perceptrons basically). We can do it, and we see
error rate droping. Eventually you have a bunch of boosted stumps. Next scan a image and locate the face. Problem will be scanning very many regions in the image, and each classification in a single region
means you use a very large committee of stumps (prob of boosting is you have a large committee that you get answers from). Smart thing, is you ask the first stump, and if he says not-a-face, then you stop
asking. Continue asking until first rejection. We know how to use the committee, only query the first guys. Whats nice is its much faster (2003 even).
Boosted bayesian networks, boosted whatever, theres an industry of consistently seeing improvement in performance with anything