Sei sulla pagina 1di 19

https://www.learnbay.

co/data-science-course/
Classifiers
x f yest
f(x,w,b) = sign(w.
denotes +1 x
denotes -1 - b)

How would you


classify this
data?

https://www.learnbay.co/data-science-course/ Support Vector Machines: Slide 2


Classifiers
x f yest
f(x,w,b) = sign(w.
denotes +1 x
denotes -1 - b)

How would you


classify this
data?

/
Support Vector Machines: Slide 3
Classifiers
x f yest
f(x,w,b) = sign(w.
denotes +1 x
denotes -1 - b)

How would you


classify this
data?

Support Vector Machines: Slide 4


Classifiers
x f yest
f(x,w,b) = sign(w.
denotes +1 x
denotes -1 - b)

How would you


classify this
data?

Support Vector Machines: Slide 5


Classifiers
x f yest
f(x,w,b) = sign(w.
denotes +1 x
denotes -1 - b)

Any of these
would be fine..

..but which
is best?

Support Vector Machines: Slide 6


Margi
x f
n yest
f(x,w,b) = sign(w. x
denotes +1 - b)
denotes -1 Define the margin of
a linear
classifier as
the width that
the boundary
could be
increased by
before hitting a
datapoint.

Support Vector Machines: Slide 7


Margi
x f
n yest
f(x,w,b) = sign(w. x
denotes +1 - b)
denotes -1 The maximum
margin linear
classifier is the
linear
classifier with
the, um,
maximum
margin.
This isimthpelest kind of
SVM (Called
Linear SVM an LSVM)
Support Vector Machines: Slide 8
Margi
x f
n yest
f(x,w,b) = sign(w. x
denotes +1 - b)
denotes -1 The maximum
margin linear
classifier is the
linear
Support
Vectors are
classifier with
those the, um,
datapoints maximum
that the margin.
margin pushes
up against This isimthpelest kind of
SVM (Called an
Linear SVM LSVM)
Support Vector Machines: Slide 9
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b)
2. If we’ve made = sign(w.
a small x
error in
denotes +1
the location of the-boundary
b) (it’s
denotes -1 The
been jolted in maximum
its perpendicular
direction) this gives
marginus least
linear
chance of causing a
classifier is the
misclassification.
linear
Support 3. Easy since the classifier
model is immune
Vectors are
with
to removal of any non- support-
those the, um,
vector datapoints.
datapoints maximum
that the 4. There’s some theory (using VC
margin.
margin pushes dimension) that is related to
up against (but notThis
the same
is theas)simplest
the
proposition that this of
kind is aSVM
good
thing.
(Called an
5. Empirically it works very very
Specifying a line and margin
Plus-
PlaneClassifier

MinuBso-

Pulanndeary

• How do we represent this mathematically?


• …in m input dimensions?

Support Vector Machines: Slide 11


Computing the margin width
M = Margin Width

How do we
compute M in
terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane { x : w . x + b = -1
Cla=im: The vector w
} is perpendicular to the Plus Plane.
Why?

Support Vector Machines: Slide 12


Computing the margin width
M = Margin Width

How do we
compute M in
terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane { x : w . x + b = -1
Cla=im: The vector w
} is perpendicular to the Plus Plane.
Why? Let u and v be two vectors on
the Plus Plane. What is w .
–(uv) ?
And so of course the vector w is
also perpendicular to the
Minus Plane Support Vector Machines: Slide 14
Classifier
M = Margin Width
x+ =

=
s -x
-

Given a guess of w and b we can


• Compute whether all data points in the correct half-planes

• Compute the width of the margin


So now we just need to write a program to search the space
of w’s and b’s to find the widest margin that matches all
the datapoints. How?
Gradient descent? Simulated Annealing? Matrix Inversion?
EM? Newton’s Method?

Support Vector Machines: Slide 14


Non-linear
Feature
SVMs: spaces
General idea: the original feature space
can always be mapped to some
higher- dimensional feature space
training set is separable:
where the
Φ: x → φ(x)

Support Vector Machines: Slide 15


The “Kernel Trick”
The linear classifier relies on inner product between vectors
K(xi,xj)=xiTxj
If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is a function that is eqiuvalent to an inner product in


some feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):

= [1i,xxj)=(1
K(x i12 √2+
xi1xxiTi2xj)x2i2,=
2 √2xi1xi1√
1+ xxj1x
22
2 2x
i2 i+j2T=
2] 2[1xi1xxj1j12 √ √22xxj1i1√
xi22xj2+ xi22xj22 + xj2]
xj12+
=
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √x2j1xxj21x2 x 2 √2x1
2
√2x2]
Thus, a kernel function implicitly maps data to a high-dimensional space
(without the need to compute each φ(x) explicitly).
Support Vector Machines: Slide 16
What Functions are
For some
Kern els?
K(x ,x ) checking that
i j K(xi ,x
φfu(nxci)tioφn(xsj)can be cumberj
T
)=
some. Mercer’s theorem:
Every semi-positive definite symmetric function
is a kernel
Semi-positive definite symmetric functions
cor espond to a semi-positive
definite Gram matrix: symmetric
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)
K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)

Support Vector Machines: Slide 17


Examples of Kernel
Linear: K(x ,x )=
Function
x → φs
i j

x x
Mapping
iT j (x), where φ(x) is
Φ: itself x

Polynomial of power p: K(xi,xj)= (1+


x
jMi)papxpping
T d Φ: x → φ(x), where p dimensions
φ(x) ha s
2
xi
xj 2
2
Gaussian (radial-basis function): K(xi
,xe)
=Mapping Φ: x → φ(x), where φ(x) isj infinite-dimensional:
every point is mapped to a function (a Gaussian);
combination of functions for support vectors is the separator.

Higher-dimensional space still has intrinsic dimensionality d


(the mapping is not onto), but linear separators in it
correspond to non-linear separators in original space.
Support Vector Machines: Slide 18
SVM applications
SVMs were originally proposed by Boser, Guyon and Vapnik in
1992 and gained increasing popularity in late 1990s.
SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
SVMs can be applied to complex data types beyond feature
vectors (e.g. graphs, sequences, relational data) by designing
kernel functions for such data.
SVM techniques have been extended to a number of tasks such
as regression [Vapnik et al. ’97], principal component analysis
[Schölkopf et al. ’99], etc.
Most popular optimization algorithms for SVMs use decomposition
to hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99]
and [Joachims ’99]
Tuning SVMs remains a black art: selecting a specific kernel
and parameters is usually done in a try-and-see manner.

https://www.learnbay.co/data-science-course/ Support Vector Machines: Slide 19