Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
SVMs
• Geometric
– Maximizing Margin
• Kernel Methods
– Making nonlinear decision boundaries linear
– Efficiently!
• Capacity
– Structural Risk Minimization
SVM History
• SVM is a classifier derived from statistical learning theory
by Vapnik and Chervonenkis
• SVM was first introduced by Boser, Guyon and Vapnik in
COLT-92
• SVM became famous when, using pixel maps as input, it
gave accuracy comparable to NNs with hand-designed
features in a handwriting recognition task
• SVM is closely related to:
– Kernel machines (a generalization of SVMs), large margin
classifiers, reproducing kernel Hilbert space, Gaussian process,
Boosting
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b)
2. If we’ve made = sign(w.
a small error inxthe
- b)
denotes +1
location of the boundary (it’s been
denotes -1 The maximum
jolted in its perpendicular direction)
this gives us leastmargin
chance linear
of causing a
misclassification. classifier is the
linear
3. There’s some theory classifier
(using VC
Support Vectors dimension) that iswith
related
the,to um,
(but not
are those the same as) the proposition that this
datapoints that maximum margin.
is a good thing.
the margin This is very
the well.
pushes up 4. Empirically it works very
against simplest kind of
SVM (Called an
LSVM)
A “Good” Separator
O O
X X
X O
X O O
X X O
X O
X O
Noise in the Observations
O O
X X
X O
X O O
X X O
X O
X O
Ruling Out Some Separators
O O
X X
X O
X O O
X X O
X O
X O
Lots of Noise
O O
X X
X O
X O O
X X O
X O
X O
Maximizing the Margin
O O
X X
X O
X O O
X X O
X O
X O
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Classify +1 if w . x + b >= 1
as..
-1 if w . x + b <= -1
Universe if -1 < w . x + b < 1
explodes
Computing the margin width
M = Margin Width
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus
Plane. Why?
Computing the margin width
M = Margin Width
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus
Plane. Why? Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?
x-
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
:: not
R notmm
x-
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + l w for some value of l. Why?
Computing the margin width
x+ M = Margin Width
The line from x- to x+ is
x-
How do we compute
perpendicular to the
planes.
M in terms of w
So to getbfrom
and ? x- to x+
travel some distance in
• Plus-plane = +1 } w.
{ x : w . x + b =direction
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + l w for some value of l. Why?
Computing the margin width
x+ M = Margin Width
x-
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
It’s now easy to get M in terms of w and b
Computing the margin width
x+ M = Margin Width
x-
w . (x - + l w) + b = 1
=>
What we know: w . x - + b + l w .w = 1
• w . x+ + b = +1 =>
• w . x- + b = -1 -1 + l w .w = 1
• x+ = x- + l w => 2
λ
• |x - x | = M
+ - w.w
It’s now easy to get M in terms of w and b
Computing the margin width
2
x+ M = Margin Width =
w.w
x-
M = |x+ - x- | =| l w |=
x-
find w
argmax cw
subject to
wai bi, for i = 1, …, m
wj 0 for j = 1, …, n
There are fast algorithms for solving linear programs including the
simplex algorithm and Karmarkar’s algorithm
Learning via Quadratic Programming
e additional linear
a( n 1)1u1 a( n 1) 2u2 ... a( n 1) mum b( n 1)
constraints
equality
a( n 2)1u1 a( n 2) 2u2 ... a( n 2) mum b( n 2)
:
a( n e )1u1 a( n e ) 2u2 ... a( n e ) mum b( n e )
Quadratic Programming
T
u Ru
Find arg max cd u
T Quadratic criterion
u 2
e additional linear
a( n 1)1u1 a( n 1) 2u2 ... a( n 1) mum b( n 1)
constraints
equality
a( n 2)1u1 a( n 2) 2u2 ... a( n 2) mum b( n 2)
:
a( n e )1u1 a( n e ) 2u2 ... a( n e ) mum b( n e )
Learning the Maximum Margin Classifier
M=
2 Given guess of w , b we
w.w can
• Compute whether all
data points are in the
correct half-planes
• Compute the margin
width
What should our How many
Assume constraints
R datapoints,
quadratic optimization will we
each (xhave? R yk =
k,yk) where
criterion be? +/- 1should they be?
What
Minimize w.w w . xk + b >= 1 if yk = 1
w . xk + b <= -1 if yk = -1
This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 1:
denotes -1 Find minimum w.w, while
minimizing number of
training set errors.
Problem: Two things to
minimize makes for an
ill-defined optimization
This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 1.1:
denotes -1 Minimize
w.w + C (#train errors)
Tradeoff parameter
2
m
W
Class 2
W X
T
b 1
m
Class 1
W X
T
10/25/2018
b 1 W X
T
b 0 39
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi {1,-1} be the
class label of xi
• The decision boundary should classify all points correctly
yi( WT Xi b) 1, i
• The decision boundary can be found by solving the
following constrained optimization problem
1 2
Minimize W
2
subject to yi( WT Xi b) 1 i
• This is a constrained optimization problem. Solving it
requires some new tools
– Feel free to ignore the following several slides; what is important is
the constrained optimization problem above
10/25/2018 40
Back to the Original Problem
1 2
Minimize W
2
subject to 1 yi( WT Xi b) 0 for i 1, ,n
• The Lagrangian is
1 T
a 1
n
L W W yi( W T
Xi b)
2 i 1
i
a yi 1
i i
0
10/25/2018 41
• The Karush-Kuhn-Tucker conditions,
ai yi( xiT 0 ) 1 0 i.
10/25/2018 42
The Dual Problem
n
• If we substitute W a yX
i 1
i i i to L , we have
1 n n n n
L a i yi Xi a j y j X j
T
a i 1 yi( a j y j X j Xi b)
T
2 i 1 j 1 i 1 j 1
1 n n n n n n
a ia j yi y j XTi X j a a y a y X
T
X b a i yi
2 i 1 j 1 i 1
i
i 1
i i
j 1
j j j i
i 1
1 n n n
a ia j yi y j XTi X j a
2 i 1 j 1 i 1
i
a y
i 1
i i
0
• Note that
• This is a function of ai only
10/25/2018 43
The Dual Problem
• The new objective function is in terms of ai only
• It is known as the dual problem: if we know w, we know all ai; if we
know all ai, we know w
• The original problem is known as the primal problem
• The objective function of the dual problem needs to be maximized!
• The dual problem is therefore:
1 n n
max. W( a ) a i a a y y X
T
Xj
i 1 2 i 1, j 1 i j i j i
n
subject to ai 0, a y
i 1
i i
0
Properties of ai when we introduce The result when we differentiate the
the Lagrange multipliers original Lagrangian w.r.t. b 44
The Dual Problem
n
1 n
max. W( a ) a i a ia j yi y j Xi X j
T
i 1 2 i 1, j 1
n
subject to ai 0, ai yi 0
i 1
• w can be recovered by n
W a yX
i 1
i i i
10/25/2018 45
Characteristics of the Solution
• Many of the ai are zero
– w is a linear combination of a small number of data points
– This “sparse” representation can be viewed as data compression
as in the construction of knn classifier
• xi with non-zero ai are called support vectors (SV)
– The decision boundary is determined only by the SV
– Let tj (j=1, ..., s) be the indices of the s support vectors. We can
write
• For testing with a new data z
at yt Xt
S
– Compute W and classify
j 1 j j j
z as class 1 if the sum is positive, and class 2 otherwise
– Note: w need not be formed explicitly
j 1 t t t z) b
a
S
W
T
zb y ( X
T
j j j
10/25/2018 46
A Geometrical Interpretation
Class 2
a10=0
a8=0.6
a7=0
a2=0
a5=0
a1=0.8
a4=0
a6=1.4
W X
T
b 1
a9=0
a3=0
Class 1 W Xb 0
T
T
W X b 1
10/25/2018 47
Non-linearly Separable Problems
• We allow “error” xi in classification; it is based on the output of the
discriminant function wTx+b
• xi approximates the number of misclassified samples
Class 2
W X
T
b 1
Class 1 T
W Xb 0
10/25/2018 T
W X b 1 48
Learning Maximum Margin with Noise
e11 M=
2 Given guess of w , b we
e2 w.w can
• Compute sum of
distances of points to
e7
their correct zones
• Compute the margin
width
What should our How many R
Assume constraints will
datapoints,
quadratic optimization weeach have?
(xkR
,yk) where yk =
criterion be? R What+/-should
1 they be?
1
Minimize w.w C ε k w . x + b >= 1-e if y = 1
k k k
2 k 1
w . xk + b <= -1+ek if yk = -1
Learning Maximum Marginm with Noise
= # input
M= dimensions
Given guess of w , b we
e11 2
e2 w.w can
• Compute
Our original (noiseless data) QPsum m+1
had of
variables: w1, w2distances
, … wm, andof b. points to
e7
Our new (noisy data)their has m+1+R
QP correct zones
, … wm, b, ethe
variables: w1, •w2Compute k , e1margin
,… eR
width
What should our How many R
Assume constraints
datapoints, will
R= # records
quadratic optimization weeach have?
(xkR,yk) where yk =
criterion be? R What+/-should
1 they be?
1
Minimize w.w C ε k w . x + b >= 1-e if y = 1
k k k
2 k 1
w . xk + b <= -1+ek if yk = -1
Learning Maximum Margin with Noise
M=
e11 2 Given guess of w , b we
e2 w.w can
• Compute sum of
distances of points to
e7
their correct zones
• Compute the margin
width
What should our How many R
Assume constraints will
datapoints,
quadratic optimization weeach have?
(xkR,yk) where yk =
criterion be? R What+/-should
1 they be?
1
Minimize w.w C ε k w . x + b >= 1-e if y = 1
k k k
2 k 1
There’s a bug in this QP. Canwyou
. xspot <= -1+ek if yk = -1
k + bit?
Learning Maximum Margin with Noise
M=
e11 2 Given guess of w , b we
e2 w.w can
• Compute sum of
distances of points to
e7
their correct zones
• Compute the margin
Howwidth
many constraints will
What should our Assume
we have? R 2R
datapoints,
quadratic optimization What each (xk,ythey
should k) where
be? yk =
criterion be? +/- 1
w . x k + b >= 1-ek if yk = 1
R
1
Minimize w.w C ε k
2 k 1
w . xk + b <= -1+ek if yk = -1
ek >= 0 for all k
Learning Maximum Margin with Noise
M=
e11 2 Given guess of w , b we
e2 w.w can
• Compute sum of
distances of points to
e7
their correct zones
• Compute the margin
Howwidth
many constraints will
What should our Assume
we have? R 2R
datapoints,
quadratic optimization What each (xk,ythey
should k) where
be? yk =
criterion be? +/- 1
w . x k + b >= 1-ek if yk = 1
R
1
Minimize w.w C ε k
2 k 1
w . xk + b <= -1+ek if yk = -1
ek >= 0 for all k
An Equivalent Dual QP
Minimize
R
w . xk + b >= 1-ek if yk = 1
1
w.w C ε k w . xk + b <= -1+ek if yk = -1
2 k 1
ek >= 0, for all k
R
1 R R
Maximize αk αk αl Qkl where Qkl yk yl (x k .xl )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0 αk C k α
k 1
k yk 0
An Equivalent Dual QP
R
1 R R
Maximize αk αk αl Qkl where Qkl yk yl (x k .xl )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0 αk C k α
k 1
k yk 0
Then define:
R
w αk yk x k Then classify with:
k 1 f(x,w,b) = sign(w. x - b)
b y K (1 ε K ) x K .w K
where K arg max αk
k
Example XOR problem revisited:
x1 = (-1,-1), d1= - 1
x2 = (-1, 1), d2= 1
x3 = ( 1,-1), d3= 1
x4 = (-1,-1), d4= -1
Q(a)= S ai – ½ S S ai aj di dj fxi) Tfxj)
=a1 +a2 +a3 +a4 – ½(9 a1 a1 - 2a1 a2 -2 a1 a3 +2a1 a4
+9a2 a2 + 2a2 a3 -2a2 a4 +9a3 a3 -2a3 a4 +9 a4 a4 )
To minimize Q, we only need to calculate
Q ( a )
a i 0, i 1,..., 4
(due to optimality conditions) which gives
1 = 9 a1 - a2 - a3 + a4
1 = -a1 + 9 a2 + a3 - a4
1 = -a1 + a2 + 9 a3 - a4
1 = a1 - a2 - a3 + 9 a4
The solution of which gives the optimal values:
a0,1 =a0,2 =a0,3 =a0,4 =1/8
w0 = S a0,i di fxi)
= 1/8[fx1)- fx2)- fx3)+ fx4)]
1 1 1 1 0
1 1 1 1 0
1 2 2 2 2 1
2
8 1 1 1 1 0
2 2 2
2 0
2 2 2 2 0
x2
2
2
2x
1
2x
2
x=0 z k ( xk , x )
2
k
For a non-linearly separable problem we have to first map data onto
feature space so that they are linear separable
xi fxi)
Given the training data sample {(xi,yi), i=1, …,N}, find the optimum
values of the weight vector w and bias b
w = S a0,i yi fxi)
where a0,i are the optimal Lagrange multipliers determined by
maximizing the following objective function
N
1 N N
Q(a) ai ai a j di d j f ( x i )f ( x j )
T
i 1 2 i 1 j 1
Look at the equation of the boundary in the feature space and use the
optimality conditions derived from the Lagrangian formulations
Hyperplane is defined by
m1
w
j 1
j j ( x) b 0
or
m1
w
j 0
j j ( x ) 0; where 0 ( x ) 1
writing : ( x ) [ 0 ( x ), 1 ( x ),..., m1 ( x )]
w ( x) 0
T
we get :
N
from optimality conditions : w ai d i ( x i )
i 1
N
Thus : ai d i ( x i ) ( x ) 0
i 1
T
N
and so boundary is : ai d i K ( x , xi ) 0
i 1
N
and Output w ( x ) ai d i K ( x , xi )
T
i 1
m1
where : K ( x, xi ) j ( x ) j ( x i )
j 0
In the XOR problem, we chose to use the kernel function:
K(x, xi) = (x T xi+1)2
= 1+ x12 xi12 + 2 x1x2 xi1xi2 + x22 xi22 + 2x1xi1 ,+ 2x2xi2
However, we did not need to calculate f at all and could simply have
used the kernel to calculate:
Q(a) = S ai – ½ S S ai aj di dj Kxi, xj
a d K ( x, x ) 0
i 1
i i i
We therefore only need a suitable choice of kernel function cf:
Mercer’s Theorem:
k 1 f(x,w,b) = sign(w. x - b)
..so this sum only needs
b y K (1 ε K ) x K .w K to be over the
support vectors.
where K arg max αk
k
1
Constant Term Quadratic
2 x1
2 x2
Linear Terms
Basis
:
2 x m
Functions
x12
2 Pure Number of terms (assuming m input
x
2
Quadratic dimensions) = (m+2)-choose-2
:
2
Terms = (m+2)(m+1)/2
x
Φ(x) m
Then define:
w αk y k Φ( x k )
k s.t. α k 0
Then classify with:
w αk y k Φ( x k )
k s.t. α k 0
…or does it?
Then classify with:
: : 2a b i i
2 a 2 b i 1
m m
a12
b1 2
+
2 2
a b m
i bi
2 2 2 2
: : a
2
2
i 1
a b
Φ(a) Φ(b) m
m
2a1a2 2b1b2
Quadratic
2a1a3 2b1b3 +
:
:
Dot Products
2a1am 2b1bm
m m
2 a a
2 3 2 b b
2 3 2a a b b i j i j
: : i 1 j i 1
2 a 1 m
a 2b 1 m
b
: :
2am 1am 2bm 1bm
Just out of casual, innocent, interest,
let’s look at another function of a and
b:
Quadratic
(a.b 1) 2
Dot Products (a.b) 2 2a.b 1
2
m
m
ai bi 2 ai bi 1
i 1 i 1
Φ(a) Φ(b) m m m
m m m m ai bi a j b j 2 ai bi 1
1 2 ai bi a b 2ai a j bi b j
2 2
i i
i 1 j 1 i 1
i 1 i 1 i 1 j i 1
m m m m
(ai bi ) 2 ai bi a j b j 2 ai bi 1
2
i 1 i 1 j i 1 i 1
Just out of casual, innocent, interest,
let’s look at another function of a and
b:
Quadratic
(a.b 1) 2
Dot Products (a.b) 2 2a.b 1
2
m
m
ai bi 2 ai bi 1
i 1 i 1
Φ(a) Φ(b) m m m
m m m m ai bi a j b j 2 ai bi 1
1 2 ai bi a b 2ai a j bi b j
2 2
i i
i 1 j 1 i 1
i 1 i 1 i 1 j i 1
m m m m
(ai bi ) 2 ai bi a j b j 2 ai bi 1
2
i 1 i 1 j i 1 i 1
Then define:
w αk y k Φ( x k )
k s.t. α k 0
Then classify with:
R
Subject to these
constraints:
0 αk C k α
k 1
k yk 0
Then define:
w α
k s.t. α k 0
k y k Φ( x k )
b y K (1 ε K ) x K .w K
Then classify with:
where K arg max αk
k f(x,w,b) = sign(w. f(x) - b)
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize α α α Q where
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
Qkl yk yl (Φ(x k ).Φ(xl ))
operations instead of 75 million
But there are still worrying things lurking away. R
Subject
What to these
are they?
constraints:
0 α C k
k
•The fear of overfittingkwith
α k k y 0
1 this enormous
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k 0
k y k Φ( x k )
b y K (1 ε K ) x K .w K
Then classify with:
where K arg max αk
k f(x,w,b) = sign(w. f(x) - b)
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize α α α Q where
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
Qkl yk yl (Φ(x k ).Φ(xl ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject
What to these
are they?
constraints:
0 α C k
k
•The fear of overfittingkwith
α k k y 0
1 this enormous
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
α
expensive (why?)
w k y k Φ( x k )
k s.t. α k 0 Because each w. f(x) (see below)
needs 75 million operations. What
b y K (1 ε K ) x K .w K can be done?
Then classify with:
where K arg max αk
k f(x,w,b) = sign(w. f(x) - b)
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize α α α Q where
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
Qkl yk yl (Φ(x k ).Φ(xl ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject
What to these
are they?
constraints:
0 α C k
k
•The fear of overfittingkwith
kα k
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
α
expensive (why?)
w k y k Φ( x k )
k s.t. α k 0 Because each w. f(x) (see below)
needs 75 million operations. What
b
w Φ(y
K
εαk y)k Φ
x) (1 x k ) .w
(x
k s.t. α k 0K
Φ( x)
K K
can be done?
Then classify with:
where K arg
α 5
y
k k ( x
max
k x
k s.t. α k 0
α
1)
k
k
Only Sm operations (S=#support vectors) f(x,w,b) = sign(w. f(x) - b)
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize α α α Q where
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
Qkl yk yl (Φ(x k ).Φ(xl ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject
What to these
are they?
constraints:
0 α C k
k
•The fear of overfittingkwith
kα k
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
α
expensive (why?)
w k y k Φ( x k )
k s.t. α k 0 Because each w. f(x) (see below)
needs 75 million operations. What
b
w Φ(y
K
εαk y)k Φ
x) (1 x k ) .w
(x
k s.t. α k 0K
Φ( x)
K K
can be done?
Then classify with:
where K arg
α 5
y
k k ( x
max
k x
k s.t. α k 0
α
1)
k
k
Only Sm operations (S=#support vectors) f(x,w,b) = sign(w. f(x) - b)
QP with Quintic basis functions
R
1 R R
Maximize
k 1
αk αk αl Qkl where
2 k 1 l 1
WhyQ kl yk yl (Φ(x k ).Φ(xl ))
SVMs don’t overfit as much as
you’d think:
No matter what the basis function,
there are really R
only up to R
Subject to these
constraints:
0 αk C k α
most are set tokzero
ky 0
parameters: a1, a2 .. aR, and usually
k
1 by the Maximum
Margin.
Asking for small w.w is like “weight
Then define:
decay” in Neural Nets and like Ridge
Regression parameters in Linear
w α
k s.t. α k 0
k k y Φ( x )
k
regression and like the use of Priors
in Bayesian Regression---all designed
to smooth the function and reduce
ε ) x .w
w Φ( x)
b y (1 αk yk Φ( x k ) Φ( x) overfitting.
K k s.t. αk 0K K K
Then classify with:
where K arg
α 5
y
k k ( x
max
k x α
1
k s.t. α k 0
)
k
k
Only Sm operations (S=#support vectors) f(x,w,b) = sign(w. f(x) - b)
SVM Kernel Functions