Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1 / 21
What is a Hyperplane?
β0 + β1 X1 + β2 X2 + . . . + βp Xp = 0
• In p = 2 dimensions a hyperplane is a line.
• If β0 = 0, the hyperplane goes through the origin,
otherwise not.
• The vector β = (β1 , β2 , · · · , βp ) is called the normal vector
— it points in a direction orthogonal to the surface of a
hyperplane.
2 / 21
Hyperplane in 2 Dimensions
10
8 β=(β1,β2)
●
●
β1X1+β2X2−6=1.6
6
β1X1+β2X2−6=0
●
4
X2
β1X1+β2X2−6=−4
2
●
0
−2
β1 = 0.8
β2 = 0.6
−2 0 2 4 6 8 10
X1
3 / 21
Separating Hyperplanes
3
3
2
2
X2
X2
1
1
0
0
−1
−1
−1 0 1 2 3 −1 0 1 2 3
X1 X1
maximize M
2
β0 ,β1 ,...,βp
p
X2
X
βj2 = 1,
1
subject to
j=1
0
−1 0 1 2 3
X1
boundary.
1.0
X2
0.5
0 1 2 3
X1
6 / 21
Noisy Data
3
3
2
2
X2
X2
1
1
0
0
−1
−1
−1 0 1 2 3 −1 0 1 2 3
X1 X1
Sometimes the data are separable, but noisy. This can lead to a
poor solution for the maximal-margin classifier.
7 / 21
Support Vector Classifier
10 10
4
4
7 7
3
3
11
9 9
2
2
8 8
X2
X2
1
1
1 1
12
3 3
0
0
4 5 4 5
2 2
−1
−1
6 6
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
X1 X1
p
X
maximize M subject to βj2 = 1,
β0 ,β1 ,...,βp ,1 ,...,n
j=1
yi (β0 + β1 xi1 + β2 xi2 + . . . + βp xip ) ≥ M (1 − i ),
X n
i ≥ 0, i ≤ C,
i=1
8 / 21
C is a regularization parameter
3
2
2
1
1
X2
X2
0
0
−1
−1
−2
−2
−3
−3
−1 0 1 2 −1 0 1 2
X1 X1
3
3
2
2
1
1
X2
X2
0
0
−1
−1
−2
−2
−3
−3
−1 0 1 2 −1 0 1 2
X1 X1
9 / 21
Linear boundary can fail
C.
X2
0
What to do?
−4
4 −4 −2 0 2 4
X1
10 / 21
Feature Expansion
β0 + β1 X1 + β2 X2 + β3 X12 + β4 X22 + β5 X1 X2 = 0
11 / 21
Cubic Polynomials
Here we use a basis
expansion of cubic poly-
4
nomials
From 2 variables to 9
2
X2
X2
The support-vector clas-
0
sifier in the enlarged
space solves the problem
−2
−2
in the lower-dimensional
space
−4
−4
−4 −2 0 2 4
X1
β0 +β1 X1 +β2 X2 +β3 X12 +β4 X22 +β5 X1 X2 +β6 X13 +β7 X23 +β8 X1 X22 +β9 X12 X2 = 0
12 / 21
Nonlinearities and Kernels
13 / 21
Inner products and support vectors
p
X
hxi , x i =
i0 xij xi0 j — inner product between vectors
j=1
training observations.
It turns out that most of the α̂i can be zero:
X
f (x) = β0 + α̂i hx, xi i
i∈S
S is the support set of indices i such that α̂i > 0. [see slide 8]
14 / 21
Kernels and Support Vector Machines
• If we can compute inner-products between observations, we
can fit a SV classifier. Can be quite abstract!
• Some special kernel functions can do this for us. E.g.
d
p
X
K(xi , xi0 ) = 1 + xij xi0 j
j=1
15 / 21
Radial Kernel
p
X
K(xi , xi0 ) = exp(−γ (xij − xi0 j )2 ).
j=1
4
X
f (x) = β0 + α̂i K(x, xi )
i∈S
2
Controls variance by
−2
4 −4 −2 0 2 4
X1 16 / 21
Example: Heart Data
1.0
1.0
0.8
0.8
True positive rate
0.6
0.4
0.4
0.2
0.2
Support Vector Classifier
SVM: γ=10−3
Support Vector Classifier SVM: γ=10−2
SVM: γ=10−1
0.0
0.0
LDA
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
17 / 21
Example continued: Heart Test Data
1.0
1.0
0.8
0.8
True positive rate
0.6
0.4
0.4
0.2
0.2
Support Vector Classifier
SVM: γ=10−3
Support Vector Classifier SVM: γ=10−2
SVM: γ=10−1
0.0
0.0
LDA
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
18 / 21
SVMs: more than 2 classes?
19 / 21
Support Vector versus Logistic Regression?
With f (X) = β0 + β1 X1 + . . . + βp Xp can rephrase
support-vector classifier optimization as
Xn p
X
minimize max [0, 1 − yi f (xi )] + λ βj2
β0 ,β1 ,...,βp
i=1 j=1
8
SVM Loss
Logistic Regression Loss This has the form
loss plus penalty.
6
hinge loss.
4
−6 −4 −2 0 2
21 / 21