Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Hilary 2015
Wide margin
Cost function
Slack variables
Loss functions revisited
Optimization
A. Zisserman
Binary Classification
Given training data (xi, yi) for i = 1 . . . N , with
xi Rd and yi {1, 1}, learn a classifier f (x)
such that
f (x i )
0 yi = +1
< 0 yi = 1
Linear separability
linearly
separable
not
linearly
separable
Linear classifiers
A linear classifier has the form
f (x) = 0
X2
f (x) = w>x + b
f (x) < 0
f (x) > 0
X1
Linear classifiers
A linear classifier has the form
f (x) = 0
f (x) = w>x + b
f (x i ) = w > x i + b
separates the categories for i = 1, .., N
how can we find this separating hyperplane ?
w w + sign(f (xi)) xi
For example in 2D
Initialize w = 0
Cycle though the data points { xi, yi }
if xi is misclassified then
w w + sign(f (xi)) xi
after update
X2
X2
w
w
xi
X1
NB after convergence w
X1
PN
= i i xi
Perceptron
example
-2
-4
-6
-8
-10
-15
-10
-5
10
Support Vector
Support Vector
f (x) =
X
i
i yi (xi > x) + b
support vectors
w
. x+ x =
||w||
w>
x+ x
||w||
2
=
||w||
Support Vector
Support Vector
wTx + b = 1
wTx + b = 0
wTx + b = -1
2
||w||
SVM Optimization
Learning the SVM can be formulated as an optimization:
2
1
max
subject to w>xi+b
w ||w||
1
Or equivalently
min ||w||2
w
if yi = +1
for i = 1 . . . N
if yi = 1
>
subject to yi w xi + b 1 for i = 1 . . . N
In general there is a trade off between the margin and the number of
mistakes on the training data
i 0
Margin =
Misclassified
point
1
i
<
||w||
||w||
Support Vector
Support Vector
=0
wTx + b = 1
wTx + b = 0
wTx + b = -1
2
||w||
wRd ,
R+
||w|| +C
N
X
subject to
>
yi w xi + b 1i for i = 1 . . . N
0.8
0.6
feature y
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-0.8
-0.6
-0.4
-0.2
0
feature x
0.2
0.4
0.6
0.8
C = Infinity
hard margin
C = 10
soft margin
binary classification
dominant
direction
HOG
frequency
orientation
Algorithm
Training (Learning)
Represent each example window by a HOG feature vector
xi Rd , with d = 1024
Testing (Detection)
Sliding window classifier
f (x) = w> x + b
Learned model
f (x) = w>x + b
Optimization
Learning an SVM has been formulated as a constrained optimization problem over w and
min
wRd ,i R+
||w||2 + C
N
X
i
>
i subject to yi w xi + b 1 i for i = 1 . . . N
>
The constraint yi w xi + b 1 i, can be written more concisely as
yif (xi) 1 i
wRd
regularization
N
X
i
loss function
Loss function
N
X
2
min ||w|| + C
max (0, 1 yif (xi))
wRd
i
wTx + b = 0
loss function
Points are in three categories:
Support Vector
Support Vector
Loss functions
yif (xi)
SVM uses hinge loss max (0, 1 yi f (xi))
an approximation to the 0-1 loss
Optimization continued
min C
wRd
N
X
i
global
minimum
If the cost function is convex, then a locally optimal point is globally optimal (provided
the optimization is over a convex set, which it is in our case)
Convex functions
convex
Not convex
SVM
min C
wRd
N
X
i
convex
wt+1 wt tw C(wt)
where is the learning rate.
First, rewrite the optimization problem as an average
N
X
1
max (0, 1 yif (xi))
||w||2 +
min C(w) =
w
2
N i
N
1 X
=
||w||2 + max (0, 1 yif (xi))
N i 2
f (xi) = w>xi + b
L
= yixi
w
L
=0
w
yif (xi)
N
1 X
wt+1 wt wt C(wt)
N
1X
wt
(wt + w L(xi, yi; wt))
N i
where is the learning rate.
Then each iteration t involves cycling through the training data with the
updates:
1
In the Pegasos algorithm the learning rate is set at t = t
10
4
1
10
energy
2
0
10
-2
-1
10
-4
-2
10
50
100
150
200
250
300
-6
-6
-4
-2
f (x) =
i yi (xi > x) + b
support vectors
On web page:
http://www.robots.ox.ac.uk/~az/lectures/ml
links to SVM tutorials and video lectures
MATLAB SVM demo