0 Voti positivi0 Voti negativi

3 visualizzazioni25 pagineTony Jebara Advanced Machine Learning. Boosting methods

Aug 31, 2015

© © All Rights Reserved

PDF, TXT o leggi online da Scribd

Tony Jebara Advanced Machine Learning. Boosting methods

© All Rights Reserved

3 visualizzazioni

Tony Jebara Advanced Machine Learning. Boosting methods

© All Rights Reserved

- Cssbb Insert
- Managerial Statistics Syllabus
- Review-on-Fundamentals-of-Statistics_STUDENTS.pptx
- Tardiness Soft Copy Main 1
- Elementary Statistics
- AI IN POWER PLANT
- Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Maize Expert System
- Random Forest
- apa itu berfikir statistik
- Recent
- An+Introduction+to+Face+Detection+and+Recognition
- Engineering College Admission Preferences Based on Student Performance
- An Empirical Study on Customers Attitude Towards Service Level and Brand
- How to Get Basic Statistics About Your Survey Response Data in LimeSurvey_2
- stats chapter 1 project - megan and alex 2
- Blooms taxonomy.docx
- Research Lesson
- What is Research
- DSSpaper_nov13
- Anova

Sei sulla pagina 1di 25

Advanced Machine

Learning & Perception

daniel zhu class only on this

Boosting

Combining Multiple Classifiers

Voting

Boosting

Adaboost

Freund and Schapire had the big breakthrough algorithm

stump, weak learner that looks at one feature, maybe a little better

Have many simple learners decision

than random

Also called base learners or

weak learners which have a classification error of <0.5

Combine or vote them to get a higher accuracy

No free lunch: there is no guaranteed best approach here

Different approaches:

Voting

combine learners with fixed weight

Mixture of Experts

adjust learners and a variable weight/gate fn

not just voting on fixed bunch o f experts, you have billions or trillions of experts, all have weight zero,

Q: will you be able to do grad Boosting

and pick one and add it

descent like with kernel

actively search for next base-learners and vote

learning with switches

Cascading, Stacking, Bagging, etc.

new method is "regret minimiztion" related to multi-arm bandits (daniel Zhu course) and online learning

face binding, boosted cascade viola jones, catch face without having to scan entire image

instead of voting (lin comb of decisions), could do gating (trust doctor 1 on children, doctor 2 is best on

elders). Big thing is that you have to make assumptions to do well

Voting

ht (x )

Have T classifiers

Average their prediction with weights

T

y

Expert

Expert

Expert

slightly smartier thing to do, change alpha (confidence in classifier), change allocation in alpha

depending on the input (child v elder). Pretty much the same as neural network (why figures are the

same), same formula as before, but now a depends on x

Mixture of Experts

Have T classifiers or experts ht (x ) and a gating fn t (x )

Average their prediction with variable weights

But, adapt parameters of the gating function

and the experts (fixed total number T of experts)

f (x ) = t =1 t (x )ht (x ) where t (x ) 0and t =1 t (x ) = 1

T

Output

Gating

Network

Expert

Expert

Input

Expert

we are searching for a massive pool of learners. No gating function here, find someone whos good at

what you're not good at. Find the next learner by looking at what example we messsed up on, and pick

and train expert for previously mixtaken labels. Idea is that eventually this will converge. Greedily

adding the next best expert (some nivce theory that you'll mind the equiv of the best possible large

margin solution, lets you throw in anything as a weak learner.

Boosting

Actively find complementary or synergistic weak learners

Train next learner based on mistakes of previous ones.

Average prediction with fixed weights

Find next learner by training on weighted versions of data.

training data

(x1,y1 ),, (xN ,yN )

weak learner

SVMs? few support vectors have large

lagrange multipliers, rest are not

N

i =1

wistep ht (x i )yi

{w ,,w }

1

how well the weak learner performs, has gamma (non neg)

error. Average loss, empirical risk. Correct is step func of a negative number

and step func maps it to zero, otherwise pay 1 dollar. 1/2 is a little bit better

than random guessing. 1/2 is not even a learner.

weights

N

i =1

wi

) < 1

2

ensemble of learners

(1,h1 ),, (T ,hT )

weak rule

ht (x )

tht (x )

prediction

t =1

we have a lot of data. All examples initially have equal weights. After one round, will weight down correct answer, and increase weights on incorrect answers.

logic boost, bernstein boost. Adaboost focuses on margin. Margin is a quantity of risk. How much is committee voting for answer.

want to minimize the error, use a classifier h that minimizes error. Instead of working on ugly step functions, lets work with an upper bound

on the step function and minimize that. That is what ada boost adds (the smoothing). Can imagine hinge loss like svm. There are other

boosting methods that introduce another way to smooth the step functions (another benefit here: each x is convex, so sum of exponentials is

also convex).

AdaBoost

Most popular weighting scheme

T

y

Define margin for point i as i t =1 tht (x i )

Find an ht and find weight t to min the cost function

N

sum exp-margins

i =1 exp yitht (xi )

sign and large

training data

(x1,y1 ),, (xN ,yN )

weak learner

N

i =1

wistep ht (x i )yi

{w ,,w }

1

weights

N

i =1

wi

) < 1

2

ensemble of learners

(1,h1 ),, (T ,hT )

weak rule

ht (x )

tht (x )

prediction

t =1

Q: are you still getting a linear classifier (like 1 layer neural network), at least for each dimension that you work with?

Q: is each weak learner always just a binary classifier?

AdaBoost

Choose base learner & t to

Recall error of

base classifier ht must be

wistep h j (x i )yi

N

i =1

N

i =1

N

i =1

exp yi tht (x i )

min ,h

wi

) < 1

2

(instead of the

1

t

t = 12 ln

more general rule)

missing w (here)?

the next round (here Z is the normalizer to sum to 1)

wit +1 =

Zt

weak learner classed stump, a classifier that cuts along 1 feature only. Simplest possible classifier - axis aligned cuts.

Q: Does it matter what order the stumps are? I think so... so you don't have a deterministic way to create a decision

tree?

Decision Trees

Y

X>3

+1

5

-1

Y>5

-1

-1

-1

+1

3

does no cutting, just

gives -0.2 to every

datapoint

Y

-0.2

X>3

sign

this is adding to the

previous -0.2

weight, not

replacing it

+0.2

+1

+0.1

-0.1

-0.1

-1

Y>5

-0.3

-0.2 +0.1

-0.3

-1

+0.2

Y

-0.2

Y<1

sign

0.0

+0.2

+1

X>3

+0.7

-1

-0.1

+0.1

-0.1

+0.1

-1

-0.3

Y>5

-0.3

+0.2

+0.7

+1

Cleve dataset from UC Irvine database.

Heart disease diagnostics (+1=healthy,-1=sick)

13 features from tests (real valued and discrete).

303 instances.

Ad-Tree Example

Cross-validated accuracy

Learning

algorithm

Number

of splits

ADtree

17.0%

0.6%

C5.0

27

27.2%

0.5%

446

20.2%

0.5%

C5.0 +

boosting

Boost

Stumps

16

test error variance

16.5%

0.8%

applying stumps. just apply all stumps

to all datapoints

from previous slide; boosting is helping, why? We are bounding the error rate. Step is always less than exp (but equal at the origin)

AdaBoost Convergence

Logitboost

Rationale?

Consider bound on

the training error:

Brownboost

0-1 loss

step (y f (x ))

exp (y f (x ))

= exp (y h (x ))

Remp =

Loss

1

N

i =1

1

N

i =1

i =1

= t =1 Z t

penalty im paying

even if i get correct

classifier

t =1

t t

exponentiated margin loss

Mistakes

Margin

Correct

definition of f(x)

recursive use of Z

was the normalizer at a particular time, this is product of all

Z_t's ever.

Convergence?

Greedy one at a time decrease

AdaBoost Convergence

Convergence? Consider the binary ht case.

T

1y h x

( )

2 i t i

T

N

= t =1 i =1 wit exp ln t

i t ( i )

t

T

N

= t =1 i =1 wit

1

t

t

1 t

T

= t =1 correct wit

+ incorrect wit

1 t

t

yh x

= t =1 2 t (1 t )

T

t =1

(1 4 ) exp (2 )

2

t

weights are less than 1/2

2

t

T increases as I add learners, T increases with iterations, we get less and less

error exponentially. As long as we garuntee that each weak learner is better

than random (has error rate gamma better than random)

each weak learner is at least better than gamma!

Curious phenomenon

Boosting decision trees

larger VC dimension we would think, and overfitting. But we never overfit, b/c of the margin

0-1 loss

Margin

0 training error, but we've maximized margin. We do additional iterations after getting 0 training

error

0-1 loss

No examples

with small

margins!!

Margin

margin theta

Experimental Evidence

this is a CDF

AdaBoost Generalization

Also, a VC analysis gives a generalization bound:

(where d is VC of

Td

R Remp +O

base classifier)

N

examples. Td is my total VC

dimension. VC dimension isn't

giving us reassurance, true risk will

be bounded

A margin analysis is possible, redefine margin as:

y t tht (x )

marf (x,y ) =

t

N

d

N

N 2

i =1

Numerator is original margin, now normalize, and then we have

another garuntee

looks like an empirical risk, but not quite. Theta is margin threshold (if i don't beat some margin threshold, then im getting

errors)

boost forever, turn anything you want into a max margin classifier (so don't stick in an SVM).

AdaBoost Generalization

Suggests this optimization problem:

Margin

AdaBoost Generalization

Proof Sketch

stability type proof, can wiggle margin within support, and still properly

classify. proof is not through VC arguments. VC tells you to stop boosting as

soon as possible. but margin analyiss says keep boosting - which is the thing

you see in practice

UCI Results

% test error rates

Database

Other

Boosting

Error

reduction

Cleveland

27.2 (DT)

16.5

39%

Promoters

22.0 (DT)

11.8

46%

Letter

13.8 (DT)

3.5

74%

Reuters 4

2.95

~60%

7.4

~40%

boosted cascasde: turns out if you have a million classifiers, who wants to use a million? its a pain. instead, use a few. Classify image patch to be face/not face. Learn this stump that puts a bunch of strips on

faces, and adds white stip pixels with black strip pixels. Very weak classifer, lets do a bunch of these. problem - 24x24 pixel region means 160k stumps to train (perceptrons basically). We can do it, and we see

error rate droping. Eventually you have a bunch of boosted stumps. Next scan a image and locate the face. Problem will be scanning very many regions in the image, and each classification in a single region

means you use a very large committee of stumps (prob of boosting is you have a large committee that you get answers from). Smart thing, is you ask the first stump, and if he says not-a-face, then you stop

asking. Continue asking until first rejection. We know how to use the committee, only query the first guys. Whats nice is its much faster (2003 even).

Boosted bayesian networks, boosted whatever, theres an industry of consistently seeing improvement in performance with anything

- Cssbb InsertCaricato daVIJKRISH33
- Managerial Statistics SyllabusCaricato daPed Salvador
- Review-on-Fundamentals-of-Statistics_STUDENTS.pptxCaricato daKen Soriano
- Tardiness Soft Copy Main 1Caricato daRegielyn Bayonito
- Elementary StatisticsCaricato da-sparkle1234
- AI IN POWER PLANTCaricato daAiman Ayyub
- Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Maize Expert SystemCaricato daMandy Diaz
- Random ForestCaricato daendale
- apa itu berfikir statistikCaricato daAgus Winarji
- RecentCaricato daxychen
- An+Introduction+to+Face+Detection+and+RecognitionCaricato daAziz Se
- Engineering College Admission Preferences Based on Student PerformanceCaricato daATS
- An Empirical Study on Customers Attitude Towards Service Level and BrandCaricato daIAEME Publication
- How to Get Basic Statistics About Your Survey Response Data in LimeSurvey_2Caricato daHappy Deal
- stats chapter 1 project - megan and alex 2Caricato daapi-442122486
- Blooms taxonomy.docxCaricato daJomaima Angar
- Research LessonCaricato daMyra Lee Camarista Esmaya
- What is ResearchCaricato daanon_853164953
- DSSpaper_nov13Caricato daEllen Obsina
- AnovaCaricato daanon_789315340
- b.inggris ErlyCaricato daNur Fitriani
- LESSON 1Caricato daMishia Estrada
- Technical University of LadzCaricato daVitaly
- OUTPUT.docCaricato daRyan Green
- Introduction to Statistics With Levels of MeasurementCaricato daqueene
- miprimerart.pdfCaricato daMercedes Hernández
- Statistical Foundation for Analytics-Module 1Caricato daVikramAditya Rattan
- ProgrammeCaricato daJason Robinson
- Precision Landmark Location for Machine Vision and PhotogrammetryCaricato dasakthivel
- Module 08 - StatisticsCaricato daMark Elis Espiridion

- Resource Guide & Checklist Combined 1508Caricato daIlya Kavalerov
- Warren Buffet Reading ListCaricato daIlya Kavalerov
- Waerden FunctionCaricato daIlya Kavalerov
- Permutation Invariant SVMsCaricato daIlya Kavalerov
- Learning Meaning of Music.Caricato daIlya Kavalerov
- Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence ModelsCaricato daIlya Kavalerov
- Object Detection with Discriminatively Trained Part-Based ModelsCaricato daIlya Kavalerov
- The Euler-Maclaurin formula, Bernoulli numbers, the zeta function, and real-variable analytic continuation | What's newCaricato daIlya Kavalerov
- prolog the programming languageCaricato daIlya Kavalerov
- Learning ScalazCaricato daIlya Kavalerov
- Colin PowellCaricato daIlya Kavalerov
- The Synthesis of Complex Audio Spectra by Means of Frequency ModulationCaricato daIlya Kavalerov
- Iceberg Queries and Other Data Mining ConceptsCaricato daIlya Kavalerov
- Principal Component Analysis Step by StepCaricato daIlya Kavalerov
- Power LUSING A POWER LAW DISTRIBUTION TO DESCRIBE BIG DATAaw Big DataCaricato daIlya Kavalerov
- Image based food detectionCaricato daIlya Kavalerov
- hw1 ENEE641 2012Caricato daIlya Kavalerov
- Alarm Network Monitoring SystemCaricato daIlya Kavalerov
- Convolutional NetworksCaricato daIlya Kavalerov
- Syllabus KTH CSC bik13Caricato daIlya Kavalerov
- Shell ScoopsCaricato daIlya Kavalerov
- Spectral Methods for General Mixture ModelsCaricato daIlya Kavalerov
- LIDO specififation - D3.3 Lido-0-8Caricato daIlya Kavalerov
- Map of LanguagesCaricato daIlya Kavalerov
- Strang NotesCaricato daIlya Kavalerov
- HIERARCHICAL TEMPORAL MEMORY including HTM Cortical Learning AlgorithmsCaricato da.xml
- HTTP HeadersCaricato daIlya Kavalerov

- HypothesisCaricato damgskumar
- Simulation-Based Econometric Methods.pdfCaricato daJavierRuizRivera
- Application of Use Rate for Estimating Parameter and Finding the Approximate Failure Number using Warranty Claims in Linear ScaleCaricato daIOSRjournal
- How to Read STATA OutputCaricato daNdivhuho Neosta
- Mathematics 8 LPCaricato daKirk Marion
- Post Hoc PowerCaricato daKellix
- RakCaricato daAmir Ahmad
- StatCaricato daMunish Jindal
- Machine Learning DSE Course HandoutCaricato dabhavana2264
- Day 6Caricato daAditya Shukla
- Probabilistic Models in the Study of LanguageCaricato datenshi66
- Statistics AssignmentCaricato daMei Xiao
- Predicting Bank Loan Recovery Rates With Neural NetworksCaricato daDavid Townsend
- Tutorial PrismCaricato daFelix Ezomo
- Hotelling's One-Sample T2Caricato dascjofyWFawlroa2r06YFVabfbaj
- Exploratory Data AnalysisCaricato dahss601
- sta305hw2(1)Caricato daJasmine Nguyen
- IBM SPSS Amos User's GuideCaricato daravi431
- Lesson Plan 1Caricato daJela Marie Carpizo Escudero
- Box_1965Caricato daSeon Kim
- ch09.pptxCaricato daAnonymous yMOMM9bs
- CFA-AMOSCaricato dafengkyadieperdana
- ITEssentials AS2012 13 Lab09Caricato daLimos Lma
- As Statistics 1Caricato dartb101
- Ch 15 Demand Manmagement & Forecasting-HKCaricato daShashank Gupta
- Chapter-2.pptCaricato daNaqeeb Khan
- Simple Correspondence Analysis Using Adjusted Residuals - Eric Beh 2011Caricato dadbmest
- Hypothesis Testing FinalCaricato daSW Lee
- 8. Confidence Interval Estimationnew.docCaricato daWinnie Ip
- Chapter 3 - Control Chart for VariablesCaricato daSultan Almassar

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.