A Quick Introduction To Statistical Learning Theory and Support Vector Machines

A Quick Introduction to Statistical Learning Theory and Support
Vector Machines
Rahul B. Warrier
Abstract This article gives a brief introduction to Statistical Learning Theory (SLT) and the relatively new class of
supervised learning algorithms called Support Vector Machines
(SVM). The primary objective of this article is to understand
the chosen topic using the concepts of linear systems theory
discussed in class. To that effect, a thorough mathematical analysis of the topic is done along with some examples that vividly
illustrate the concepts being discussed. Finally, application of
SVM to a simple binary classification problem is discussed.
I. INTRODUCTION
Statistical Learning Theory (SLT) forms the basic framework for machine learning. Of the several categories of
learning, SLT deals with supervised learning, which involves
learning from a training set of data (examples). The training
data consists of an input-output pair, where the input maps
to an output. The learning problem then, consists of finding
the function that maps input to output in a predictive fashion, such that the learned function can be used to predict
output from a new set of input data with minimal mistakes
(generalization) [6].
Support Vector Machine (SVM) is a class of supervised
learning machines for two-group classification/ regression
analysis problems. Conceptually, SVM implements the following: input vectors are non-linearly mapped to a highdimensional feature space wherein a linear decision surface
is constructed [4].
The goal of this article is to briefly describe the mathematical formulation of SLT and more specifically the SVM
(applied to the binary classification problem) through the
concepts of systems theory. The major concepts that will
be highlighted during the course of this discussion are:
inner product spaces and normed spaces, Gram matrices,
linear/nonlinear mappings, convex optimization, projection
theorem on Hilbert spaces and the minimax problem.
This subject has gained immense popularity in the last
few years and ample amount of literature is available on
various aspects of this topic. This article mainly follows the
seminal work of Vladimir N. Vapnik [1] and his colleagues
at AT&T Bell Laboratories [2],[3]. STL concepts have also
been discussed succinctly in [6] and a detailed mathematical
formulation of STL and SVM is available in the dissertation
by Scholkopf, [4]. Geometric interpretation of the SVM
algorithm is presented in [8],[9].
This article is organized as follows: Section 2 deals with
the formulation of the general learning problem. The Empirical Risk Minimization Induction (ERM) principle is disRahul B. Warrier is a Graduate Student in Mechanical Engineering,
University of Washington, Seattle, WA-98105 warrirerr@uw.edu
cussed, leading to the Structural Risk Minimization Induction

(SRM) principle. Section 3 discusses the problem of binary
classification for linearly separable data and the concept of
margin (and the essence of SVM - margin maximization).
Section 4 introduces the Support Vector Learning algorithm
for linearly separable data. The methodology of the SVM
is then extended to data that is not fully linearly separable.
We also look at the Kernal Trick which allows us to use the
SVM to classify nonlinear data. Finally, in Section 5 we look
at an application of SVMs to a simple binary classification
problem.
II. STATISTICAL LEARNING THEORY
Statistical Learning Theory mainly deals with the problem
of supervised learning. This model of learning involves
learning from examples and can be described as follows [1]:
1) (Training data): An independent identically distributed
set of data,{xn }, drawn from a fixed but unknown
distribution P(x),
2) (Labeling of data by a Supervisor): Supervisor labels
each input vector x, with an output vector y, according
to a conditional distribution function P(y|x), also fixed
but unknown.
3) (Learning machine): Learning machine that can implement a set of functions f (x, ), F ( is a vector
of scalar parameters). The goal of supervised learning
is to select from among these functions (defined by the
specific parameter vector) the one which predicts the
supervisors response in the best possible way. This
selection is based on the training data set:
(x1 , y1 ), . . . , (xn , yn )
(1)
A. General setting of the Learning Problem

In order to choose the best approximation to the supervisors response, we consider the loss or discrepancy,
L(y, f (x, )), F , between the supervisors response, y
and the learning machines prediction f (x, ) with input x.
The goal then is to minimize the expected value of the loss,
the risk functional
Z
R() =
L(y, f (x, )) dP(x, y),
(2)
given the training data (1).

B. Empirical Risk Minimization Induction (ERM) Principle
The problem of minimizing the risk functional, (2), given
the training data, (1), is usually solved using the induction
principle1 .
The expected risk functional R() given by (2) is replaced
by the empirical risk functional [1],
Remp () =
1 n
L(y, f (x, ))
n i=1
(3)
which is constructed using the training data, (1).

The Empirical Risk Minimization Induction (ERM) principle
approximates the function L(y, f (x, 0 )), which minimizes
the risk (2), by the function L(y, f (x, n )), which minimizes
the empirical risk (3).
Now, SLT provides probabilistic bounds on the discrepancy between the empirical and expected risk of any function
[4]: if h < n is the VC dimension2 of the class of functions
that the learning machine can implement, then for all functions of that class, with a probability of at least 1 , the
bound

h log()
(4)
,
R() < Remp () +
n
n
holds, and the confidence term is defined as
s

h log( 2n
h log()
h ) + 1 log(/4)
,
=
.
(5)
n
n
n
Fig. 1.
Pictorial representation of the structure of Hypothesis spaces [4]
D. Summary of STL
The problem of learning from examples is solved in three
steps:
1) Define a loss function L(y, f (x, )) that measures the
error of prediction of output f (x, ) of input x when
actual output is y;
2) Define a nested sequence of hypothesis spaces Hn , n =
1, . . . , N whose capacity hn increases with n;
3) Minimize the empirical risk Remp () in each Hn and
choose among the solutions the one that minimizes the
right hand side of the inequality (4).
Here h is the capacity of the function space, f (x, ) and n

the number of training data.
The ERM principle cannot, in practice, be applied directly
[6]. Firstly, there can be infinitely many functions from
the class of functions, f (x, ) that minimize the empirical
risk. Secondly, it can lead to overfitting which means that
even if we get the empirical risk Remp () 0 using a
large VC-dimension, h, the expected risk, R(), due to the
monotonically increasing confidence term, can be very large.
This implies generalization to new data is not guaranteed.
Further, the space of functions f (x, ) is in practice very
large3 , so one generally considers smaller hypothesis spaces
H , [6]. And, looking at (4), we find that to minimize R(),
instead of simply minimizing Remp (), good generalization
is achieved by trying to find the best trade off between the
empirical risk and the complexity of the function space given
by the second term in the inequality (4)[1]. This leads to the
Structural Risk Minimization Induction Principle.
C. Structural Risk Minimization Induction (SRM) Principle
The SRM principle involves defining a nested sequence
of hypothesis spaces H1 H2 HN such that their
corresponding capacities are finite and are ordered in increasing fashion as h1 h2 hN < . The idea then is to
choose the function f (x, 0 ) that minimizes the empirical
risk (3) in the hypothesis space Hn for which the bound on
the structural risk (4) is minimized.
1 Transduction
principle is also used in many instances to reduce the

number of labeled training data required. This scheme is called active
learning [10]
2 Numerous definitions of capacity quantities exist, most popular one
being the VC-dimension [1]. For the binary classification problem it is
defined as the largest number of points that can be separated into two classes
in all possible 2h ways using the functions of the learning machine.
3 We can have f (x, ) = L , the space of square integrable continuous
2
functions.
Fig. 2.
Graphical representation of (4) for fixed n [4]
The bound given by (4) forms part of the theoretical basis

[4] for the Support Vector Learning algorithm described in
Section 4.
III. THE BINARY CLASSIFICATION PROBLEM
A. Problem Setup
Given n input vectors, xi X = Rd , i = 1, . . . , n and their
corresponding labels, yi Y = {+1, 1}, i = 1, . . . , n, all of
which are identically and independently distributed according to some probability distribution P(x, y) = P(x) P(y|x),
the goal is to find a decision function f (x, 0 ), 0 F :
X {+1, 1} that will predict the correct label

yt =
max
y{1,+1}
A hyperplane that satisfies (8) is called a decision boundary,

and the decision function, f : X Y is simply
P(y|xt )
f (x, w) = sign(< w, x > +b)
for a test example xt .
(10)
Hence, we notice that a decision boundary is an affine

subspace, with w as the normal vector to the decision
boundary, and b the bias and from the example shown in
Fig. 5(a) we see that there can be infinitely many possible
decision boundaries.
In the case that the training data is non-separable as shown
in Fig. 5(b), it is possible to map the data to a higher
dimensional Hilbert space called a Feature space, P through
a non-linear mapping : X P such that we can find a
linear decision function, as shown in Fig. 5(c). The decision
function is then,
f (x, w) = sign(< w, (x) > +b)
(11)
IV. SUPPORT VECTOR MACHINES

Fig. 3. Binary Classification with a decision function (potentially nonlinear in the input space)
The search for this optimal prediction function f (x, 0 ) is

carried out in a structured Hypothesis space using the SRM
and ERM principles discussed in the previous section.
The loss function used for binary classification problems
is the zero-one loss function:
L(y, f (x, )) = |y f (x, )|
(6)
in which case the expected risk (2) is simply the probability

of misclassification.
1
|y f (x, )| dP(x, y),
2
and the empirical risk is given by,
Z
R() =
Remp () =
1 n 1
2 |yi f (xi , )| dP(x, y),
n i=1
The distance of a vector to a hyperplane is given by:

d(xi ) =
| < w, xi > +b|

kwk
The margin between the two classes is defined as the distance

of the hyperplane to the closest vectors from each of the two
classes. These vectors are called the support vectors. Thus,
we can express the margin as the minimum of this distance
considering each of the xi X ,
M = min d(xi ) = min
xi X
xi X
| < w, xi > +b|

.
kwk
(12)
B. Linearly Separable Classification

In this section, we assume that X is a Euclidean Hilbert
space with inner product defined.4 .
The data given in (1) with yi {+1, 1} is said to be
linearly separable if there exists a vector w and a scalar b
such that the following inequalities hold for all the elements
of the training set X ,
< w, xi > +b 1
if yi = +1
(7a)
< w, xi > +b 1 if yi = 1,
(7b)
or equivalently,
yi (< w, xi > +b) 1
(8)
Fig. 4. Maximum margin hyperplane using the support vectors (encircled)

to define the optimal margin.
A hyperplane is given by the following equation:

A. Primal Form
< w, x > +b = 0
(9)
Section 4 we see that even if X is not a Hilbert space, we can replace

the inner product with a Kernel function according to Mercers Theorem.
4 In
Of all the possible decision boundaries (hyperplanes), the

optimal hyperplane is the one that maximizes the margin
between the vectors of the two classes. Thus, the search
(a)
(b)
(c)
Fig. 5. (a) 2D separable training space with possible decision boundaries. (b) Non-separable training space. (c) 2D feature space with non-linear
transformation (x) = (x, x2 ), with possible decision boundaries.
for the maximal margin hyperplane leads to a minimax

optimization problem:
max min
w,b xi X
| < w, xi > +b|

kwk
(13)
subject to yi (< w, xi > +b) 1, (xi , yi ) Z = X Y

We see from 4 that the optimal margin is given by
(13) can be reformulated as
max
w,b
2
kwk .
Thus,
2
kwk
(14)
subject to yi (< w, xi > +b) 1, (xi , yi ) X Y
which is a quadratic programming (QP) problem and is called

the Primal form. The solution of (15) gives us the optimal
parameters of the hyperplane which gives us the optimal
decision function:
(16)
(19)
i 0
(15)
subject to yi (< w, xi > +b) 1, (xi , yi ) X Y
f (x, w ) = sign(hw , xi + b )
The primal form discussed previously is sufficient for

linearly separable training data or for data in which a map
is known to obtain a separable feature space. For the case of
non-separable data for which such a map is unknown,
we need to formulate the Dual form for which the cost
functional does not depend on the hyperplane parameters.
Also, in general the dual form of the QP problem is easier
to solve numerically.
The dual form is obtained by constructing the Lagrangian
of (15) using non-negative Lagrangian multipliers:
n
1
L (w, b, ) = kwk2 i (yi (< w, xi > +b) 1)
2
i=1
This is equivalent to the problem,

1
min kwk2
w,b 2
B. Dual Form
The QP problem (15) is solved using the saddle point of the

Lagrangian - maximizing L (w, b, ) with respect to and
minimizing L (w, b, ) with respect to (w, b). The minimum
is given by:

n
L (w, b, )
=
w
(20a)
i yi xi = 0

w
w=w
i=1

n
L (w, b, )
= i yi = 0
(20b)

b
b=b
i=1
For non-separable data, if a map is known that transforms the non-separable training space to a separable feature
space, then the QP problem (15) becomes,
Thus, from (20a) we get, the solution for the optimal

hyperplane as the linear combination of the training vectors:
1
min kwk2
w,b 2
w = i yi xi
(17)
subject to yi (< w, (xi ) > +b) 1, (xi , yi ) X Y
Substituting (20b) and (21) into (19) and maximizing the

Lagrangian we get the dual form,
and the decision function (16) becomes,

f (x, w ) = sign(< w , (x) > +b )
(18)
This QP problem is known as the Primal form and can

be solved efficiently for linearly separable training data or
training spaces with a known map . An example of a
solution on a training dataset of 40 random linearly separable
vectors is illustrated in Fig. 4.
(21)
i=1
max i
i=1

1 n n
i j yi y j xi , x j
2 i=1 j=1
(22)
subject to
i yi = 0,
i 0
i=1
According to the Kuhn-Tucker theorem of optimization

[4], at the saddle point only those Lagrange multipliers i
are non-zero which satisfy the equality constraint in (15),

i.e.,
i (yi (< w , xi > +b ) 1) = 0, for some i = 1, . . . , n
(23)
i 6= 0 for yi (< w , xi > +b ) = 1
The vectors corresponding to these i > 0 are the support
vectors and it can be seen from (23) that they lie exactly on
the margin.
Thus, since the optimal hyperplane lies at equal distances
from the margin of the two classes, we have

< w , xB > +b
< w , xA > +b
=
kw k
kw k
Thus, we have the optimal bias,
b =
1 n
i yi [< xi , xA > + < xi , xB >]
2 i=1
f (x, w) = sign
(25)
i yi K(xi , x) + b
i=1
Recall from (23) that i 6= 0 only for the support vectors.

Hence, we can simplify the decision function considering
only xi SV , the set of support vectors:
!
f (x, w) = sign
(24)
where xA and xB are support vectors from each of the two

classes respectively. Thus, the decision function is
f (x, w) = sign(< w , x > +b )
This procedure is termed as the kernel trick which lets us

use the SVM methodology on non-separable data as well.
Although this is the case, the resulting decision function
obtained is not always robust and we have to settle with
a soft margin hyperplane,[4] wherein classification of new
data is not error-proof.
Thus, using the kernel trick, the optimal decision function
(25) becomes,
!
i yi K(xi , x) + b
(30)
xi SV
where
b =
1 n
i yi [K(xi , xA ) + K(xi , xB )]
2 i=1
with w given by (21) and b given by (24).
where xA and xB are support vectors from each of the two

classes respectively.
C. The Kernel Trick
D. Types of Kernel functions
We now try to map non-separable data into separable

high-dimensional (possibly infinite) feature spaces without
explicitly relying on a nonlinear map .
Define a functional K : X X R such that it represents
an inner product in some arbitrary feature space, H :

K(xi , x j ) = (xi ), (x j ) H
(26)
The main types of kernel functions used in practice are

the following:
1) Linear: hx, x0 i.
2) Polynomial: (hx, x0 i + r)d
3) RBF: exp(|x x0 |2 )
4) Sigmoid: (tanh(hx, x0 i + r))
We can then represent the dual form (22) in the feature

space H by replacing all the inner products with the kernel
function:
n
1 n n
max i i j yi y j K(xi , x j )
(27)
2 i=1 j=1
i=1
V. APPLICATION OF SVM - A N E XAMPLE
i yi = 0,
subject to
i 0
i=1
or equivalently in matrix form,

max
(28)
subject to T y = 0, i 0
where is the Gram matrix defined as:

[]i j = K(xi , x j ) = (xi ), (x j ) H
(29)
and, = [1 y1 , . . . , n yn ].
By Mercers Theorem5 , we get the necessary and sufficient
conditions to represent the kernel K : X X R as an
inner product in some feature space H with a map as in
(26).
5 Mercers Theorem says that a symmetric function K(x , x ) can be
1 2
expressed as an inner product, K(xi , x j ) = (xi ), (x j ) for some if
and only if K(x1 , x2 ) is positive semi-definite [2].
Fig. 6.
2D binary classification using a linear kernel SVM
A two-dimensional binary classification problem for a

training dataset containing 200 linearly separable vectors has
been considered. Following the methodology described in the
previous sections, an optimal hyperplane is constructed using
a linear kernel function after finding the support vectors. The

result of the zero-error classification is shown in Fig.6 with
the support vectors encircled. The Sci-kit library [5] is used
to perform this simulation.
VI. CONCLUSION
In the course of this discussion on Statistical Learning
Theory (STL) and the derived learning algorithm, Support
Vector Machines (SVM) we have covered a variety of
concepts discussed in class. The main concepts that were
used in the mathematical formulation of STL and SVM were
inner products, affine space/linear varieties, transformation
mappings, convex optimization/quadratic programming, projection theorem on Hilbert spaces (to find distance of support
vectors to the hyperplane) and the primal/dual formulation of
optimization problems among many others. In totality, this
exercise has cemented the concepts discussed in class and
has also provided an opportunity to understand a popular
topic of interest, namely, Support Vector Classification.
R EFERENCES
[1] Vapnik, Vladimir N. An overview of statistical learning theory.
Neural Networks, IEEE Transactions on 10.5 (1999): 988-999.
[2] Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. A
training algorithm for optimal margin classifiers. Proceedings of the
fifth annual workshop on Computational learning theory. ACM, 1992.
[3] Cortes, Corinna, and Vladimir Vapnik. Support-vector networks.
Machine learning 20.3 (1995): 273-297.
[4] Schlkopf, Bernhard. Support vector learning. (1997).
[5] Pedregosa, Fabian, et al. Scikit-learn: Machine learning in Python.
The Journal of Machine Learning Research 12 (2011): 2825-2830.
[6] Evgeniou, Theodoros, Massimiliano Pontil, and Tomaso Poggio. Statistical learning theory: A primer. International Journal of Computer
Vision 38.1 (2000): 9-13.
[7] Schlkopf, Bernhard. Statistical learning and kernel methods. (2000).
[8] Bennett, Kristin P., and Erin J. Bredensteiner. Duality and geometry
in SVM classifiers. ICML. 2000.
[9] Zhou, Dengyong, et al. Global geometry of SVM classifiers. Institute
of Automation, Chinese Academy of Sciences, Tech. Rep. AI Lab
(2002).
[10] Tong, Simon, and Daphne Koller. Support vector machine active
learning with applications to text classification. The Journal of
Machine Learning Research 2 (2002): 45-66.
APPENDIX - A
Python Source Code

File: /media/udrive/Acad/Fall 2013/ME510/Project/svm.py
import numpy as np
import pylab as pl
from sklearn import svm
# we create 200 separable points
np.random.seed(0)
X = np.r_[np.random.randn(100, 2) + [2, 2], np.random.randn(100, 2) - [2, 2]]
Y = [0] * 100 + [1] * 100
# fit the model
clf = svm.SVC(kernel='linear')
clf.fit(X, Y)
# get the optimal hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (clf.intercept_[0]) / w[1]
# plot the margins that pass through the support vectors
b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
b = clf.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])
# plot the line, the points, and the support vectors to the plane
pl.hot()
pl.plot(xx, yy, 'k-')
pl.plot(xx, yy_down, 'k--')
pl.plot(xx, yy_up, 'k--')
pl.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=80, facecolors='none')
pl.scatter(X[:, 0], X[:, 1], c=Y)
pl.axis('tight')
pl.xlabel('x_1')
pl.ylabel('x_2')
pl.show()
REF. [1]
988
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999
An Overview of Statistical Learning Theory

Vladimir N. Vapnik
OF THE
In order to choose the best available approximation to the

supervisors response, one measures the loss or discrepancy
between the response
of the supervisor to
and the response
provided by the
a given input
learning machine. Consider the expected value of the loss,
given by the risk functional
(2)
which miniThe goal is to find the function
(over the class of functions
mizes the risk functional
in the situation where the joint probabilis unknown and the only available
ity distribution
information is contained in the training set (1).
C. Three Main Learning Problems
LEARNING PROBLEM
This formulation of the learning problem is rather general.

It encompasses many specific problems. Below we consider
the main ones: the problems of pattern recognition, regression
estimation, and density estimation.
The Problem of Pattern Recognition: Let the supervisors
take on only two values
and let
output
be a set of indicator functions (functions
which take on only two values zero and one). Consider the
following loss-function:
Re
A. Function Estimation Model
fe
re
N this section we consider a model of the learning and show

that analysis of this model can be conducted in the general
statistical framework of minimizing expected loss using observed data. We show that practical problems such as pattern
recognition, regression estimation, and density estimation are
particular case of this general model.
nc
I. SETTING
B. Problem of Risk Minimization
es
AbstractStatistical learning theory was introduced in the late

1960s. Until the 1990s it was a purely theoretical analysis of the
problem of function estimation from a given collection of data.
In the middle of the 1990s new types of learning algorithms
(called support vector machines) based on the developed theory
were proposed. This made statistical learning theory not only
a tool for the theoretical analysis but also a tool for creating
practical algorithms for estimating multidimensional functions.
This article presents a very general overview of statistical learning
theory including both theoretical and algorithmic aspects of the
theory. The goal of this overview is to demonstrate how the
abstract learning theory established conditions for generalization
which are more general than those discussed in classical statistical paradigms and how the understanding of these conditions
inspired new algorithmic approaches to function estimation problems. A more detailed overview of the theory (without proofs) can
be found in Vapnik (1995). In Vapnik (1998) one can find detailed
description of the theory (including proofs).
The model of learning from examples can be described

using three components:
1) a generator of random vectors , drawn independently
;
from a fixed but unknown distribution
2) a supervisor that returns an output vector for every
input vector , according to a conditional distribution
, also fixed but unknown;
function1
3) a learning machine capable of implementing a set of
.
functions
The problem of learning is that of choosing from the given
the one which predicts the
set of functions
supervisors response in the best possible way. The selection
is based on a training set of random independent identically
distributed (i.i.d.) observations drawn according to
(1)
if
if
For this loss function, the functional (2) provides the probability of classification error (i.e., when the answers given
by supervisor and the answers given by indicator function
differ). The problem, therefore, is to find the function
which minimizes the probability of classification errors when
is unknown, but the data (1) are
probability measure
given.
The Problem of Regression Estimation: Let the supervibe a
sors answer be a real value, and let
set of real functions which contains the regression function
It is known that if
then the regression function
is the one which minimizes the functional (2) with the the
following loss-function:
Manuscript received January 11, 1999; revised May 20, 1999.

The author is with AT&T Labs-Research, Red Bank, NJ 07701 USA.
Publisher Item Identifier S 1045-9227(99)07267-7.
1 This is the general case which includes a case where the supervisor uses
a function y f (x):
(3)
(4)
Thus the problem of regression estimation is the problem
of minimizing the risk functional (2) with the loss function
10459227/99$10.00 1999 IEEE
VAPNIK: OVERVIEW OF STATISTICAL LEARNING THEORY
989
It is known that desired density minimizes the risk functional

(2) with the loss-function (5). Thus, again, to estimate the
density from the data one has to minimize the risk-functional
under the condition where the corresponding probability meais unknown but i.i.d. data
sure
are given.
The General Setting of the Learning Problem: The general
setting of the learning problem can be described as follows.
be defined on the space
Let the probability measure
Consider the set of functions
The goal is: to
minimize the risk functional
(6)
is unknown but an i.i.d. sample
which one needs to minimize in order to find the regression

estimate (i.e., the least square method).
In order to estimate a density function from a given set of
one uses the loss function (5). Putting this
functions
loss function into (8) one obtains the maximum likelihood
method: the functional
which one needs to minimize in order to find the approximation to the density.
Since the ERM principle is a general formulation of these
classical estimation problems, any theory concerning the ERM
principle applies to the classical methods as well.
re
if probability measure
The ERM principle is quite general. The classical methods

for solving a specific learning problem, such as the least
squares method in the problem of regression estimation or
the maximum likelihood method in the problem of density
estimation are realizations of the ERM principle for the
specific loss functions considered above.
Indeed, in order to specify the regression problem one
introduces an
-dimensional variable
and uses loss function (4). Using this loss
function in the functional (8) yelds the functional
es
(5)
E. Empirical Risk Minimization Principle

and the Classical Methods
nc
(4) in the situation where the probability measure

is
unknown but the data (1) are given.
The Problem of Density Estimation: Finally, consider the
problem of density estimation from the set of densities
For this problem we consider the following
loss-function:
fe
(7)
Re
is given.
The learning problems considered above are particular cases
of this general problem of minimizing the risk functional (6) on
the basis of empirical data (7), where describes a pair
and
is the specific loss function [for example, one
of (3), (4), or (5)]. Below we will describe results obtained
for the general statement of the problem. To apply it for
specific problems one has to substitute the corresponding lossfunctions in the formulas obtained.
F. Four Parts of Learning Theory

Learning theory has to address the following four questions.
1) What are the conditions for consistency of the ERM
principle?
To answer this question one has to specify the necessary and sufficient conditions for convergence in probability2 of the following sequences of the random values.
a)
D. Empirical Risk Minimization Induction Principle

In order to minimize the risk functional (6), for an unknown
the following induction principle is
probability measure
usually used.
is replaced by the
The expected risk functional
empirical risk functional
The values of risks

converging to the
[where
minimal possible value of the risk
are the expected risks for
each minimizing the empirical
functions
risk
(9)
b)
The
values
of
obtained empirical risks

converging to the
minimal possible value of the risk
(8)
constructed on the basis of the training set (7).

which
The principle is to approximate the function
which minimizes
minimizes risk (6) by the function
empirical risk (8). This principle is called the empirical risk
minimization induction principle (ERM principle).
(10)
2 Convergence in probability of values R( ) means that for any " > 0
`
and for any > 0 there exists a number `0 = `0 ("; ) such, that for any
` > `0 with probability at least 1 the inequality
0
R ( ` ) 0 R ( 0 ) < "
holds true.
II. THE THEORY OF CONSISTENCY OF LEARNING PROCESSES
converges in probability to zero.

From this theorem it follows that the analysis of the ERM
principle requires an analysis of the properties of uniform
convergence of the expectations to their probabilities over the
given set of functions.
B. The Necessary and Sufficient Conditions
for Uniform Convergence
To describe the necessary and sufficient condition for uniform convergence (11), we introduce a concept called the
on the sample
entropy of the set of functions
of size
We introduce this concept in two steps: first for sets of
indicator functions and then for sets of real-valued functions.
Entropy of the Set of Indicator Functions: Let
be a set of indicator functions, that is the functions
which take on only the values zero or one. Consider a sample
Re
fe
re
The theory of consistency is an asymptotic theory. It describes the necessary and sufficient conditions for convergence
of the solutions obtained using the proposed method to the
best possible as the number of observations is increased. The
question arises:
Why do we need a theory of consistency if our goal is to
construct algorithms for a small (finite) sample size?
The answer is:
We need a theory of consistency because it provides not
only sufficient but necessary conditions for convergence of
the empirical risk minimization inductive principle. Therefore
any theory of the empirical risk minimization principle must
satisfy the necessary and sufficient conditions.
In this section, we introduce the main capacity concept (the
so-called VapnikCervonenkis (VC) entropy which defines
the generalization ability of the ERM principle. In the next
sections we show that the nonasymptotic theory of learning is
based on different types of bounds that evaluate this concept
for a fixed amount of observations.
This type of convergence is called uniform one-sided convergence.

In other words, according to the Key theorem the conditions
for consistency of the ERM principle are equivalent to the
conditions for existence of uniform one-sided convergence
(11).
This theorem is called the Key theorem because it asserts
that any analysis of the convergence properties of the ERM
principle must be a worst case analysis. The necessary condition for consistency (not only the sufficient condition) depends
on whether or not the deviation for the worst function over
the given set of of functions
es
Equation (9) shows that solutions found using ERM

converge to the best possible one. Equation (10) shows
that values of empirical risk converge to the value of
the smallest risk.
2) How fast does the sequence of smallest empirical risk
values converge to the smallest actual risk? In other
words what is the rate of generalization of a learning
machine that implements the empirical risk minimization
principle?
3) How can one control the rate of convergence (the rate of
generalization) of the learning machine?
4) How can one construct algorithms that can control the
rate of generalization?
The answers to these questions form the four parts of
learning theory:
1) the theory of consistency of learning processes;
2) the nonasymptotic theory of the rate of convergence of
learning processes;
3) the theory of controlling the generalization of learning
processes;
4) the theory of constructing learning algorithms.
nc
990
A. The Key Theorem of the Learning Theory

The key theorem of the theory concerning the ERM-based
learning processesis the following [27].
be a set of functions
The Key Theorem: Let
that has a bounded loss for probability measure
Then for the ERM principle to be consistent it is necessary and

converge uniformly
sufficient that the empirical risk
over the set
as follows:
to the actual risk
(11)
(12)
Let us characterize the diversity of this set of functions
on the given sample by a quantity
that represents the number of different
separations of this sample that can be obtained using functions
from the given set of indicator functions.
Let us write this in another form. Consider the set of
-dimensional binary vectors
that one obtains when takes various values from

Then
is the number of difgeometrically speaking
ferent vertices of the -dimensional cube that can be obtained
and the set of functions
on the basis of the sample
Let us call the value
the random entropy. The random entropy describes the diversity of the set of functions on the given data.
is a random variable since it was constructed using random
i.i.d. data. Now we consider the expectation of the random
entropy over the joint distribution function
991
We call this quantity the entropy of the set of indicator

,
on samples of size It depends on
functions
,
, the probability measure
the set of functions
, and the number of observations The entropy describes
the expected diversity of the given set of indicator functions
on the sample of size
The main result of the theory of consistency for the pattern recognition problem (the consistency for indicator loss
function) is the following theorem [24].
Theorem: For uniform two-sided convergence of the frequencies to their probabilities3
is taken with respect to product-measure

The main results of the theory of uniform convergence of the
empirical risk to actual risk for bounded loss function includes
the following theorem [24].
Theorem: For uniform two-sided convergence of the empirical risks to the actual risks
(16)
it is necessary and sufficient that the equality
(13)
C. Three Milestones in Learning Theory

In this section, for simplicity, we consider a set of indicator
(i.e., we consider the problem of
functions
pattern recognition). The results obtained for sets of indicator
functions can be generalized for sets of real-valued functions.
In the previous section we introduced the entropy for sets
of indicator functions
re
hold.
Slightly modifying the condition (14) one can obtain the
necessary and sufficient condition for one-sided uniform convergence (11).
Entropy of the Set of Real Functions: Now we generalize
the concept of entropy for sets of real-valued functions. Let
be a set of bounded loss functions.
Using this set of functions and the training set (12) one
can construct the following set of -dimensional real-valued
vectors
es
(14)
be valid.
Slightly modifying the condition (17) one can obtain the
necessary and sufficient condition for one-sided uniform convergence (11).
According to the key assertion this implies the necessary and
sufficient conditions for consistency of the ERM principle.
nc
it is necessary and sufficient that the equality
(17)
(15)
Re
fe
This set of vectors belongs to the -dimensional cube with

and has a finite -net4 in the metric
Let
the edge
be the number of elements of the
minimal -net of the set of vectors
The logarithm of the (random) value
Now, we consider two new functions that are constructed

the annealed VCon the basis of the values
entropy
and the growth function
is called the random VC-entropy of the set of functions

on the sample
The expectation
of the random VC-entropy
is called the VC-entropy of the set of functions

on the sample of the size
Here expectation
sets of indicator functions R() defines probability and Remp ()
defines frequency.
4 The set of vectors q (); 2 3 has minimal "-net q ( );
; q(N )
1
if: 1. There exist N = N 3 ("; z1 ;
; z` ) vectors q(1 ); ; q(N ); such
that for any vector q (3 ); 3 3 one can find among these N vectors one
q(r ) which is "-close to this vector (in a given metric). For a C metric that
means
3 The
111
111
111
(q(3 ); q(r )) = max jQ(zi 3 ) 0 Q(zi ; r )j ":

1i`
is minimal number of vectors which possess this property.

that VC-entropy is different from classical metrical "-entropy
5 Note
N 3 (") is
Q(z; ); 2 3:
where
Hcl3 (") = ln N 3 (")
cardinality of the minimal "-net of the set of functions
These functions are determined in such a way that for any

the inequalities
are valid. On the basis of these functions, the three main

milestones in statistical learning theory are constructed.
In the previous section, we introduced the equation
describing the necessary and sufficient condition for consistency of the ERM principle. This equation is the first milestone
in learning theory: any machine minimizing empirical risk
should satisfy it.
However, this equation says nothing about the rate of
to the minimal one
convergence of obtained risks
It is possible that the ERM principle is consistent but
has arbitrary slow asymptotic rate of convergence.
992
The question is:

Under what conditions is the asymptotic rate of convergence
fast?
We say that the asymptotic rate of convergence is fast if for
the exponential bound
any
is some constant.
Theorem: Any growth function either satisfies the equality
or is bounded by the inequality
where
es
In other words the growth function will be either a linear

function or will be bounded by a logarithmic function. (For
example, it cannot be of the form
We say that the VC dimension of the set of indicator
is infinite if the Growth function
functions
for this set of functions is linear.
We say that the VC dimension of the set of indicator
is finite and equals if the growth
functions
function is bounded by a logarithmic function with coefficient
The finiteness of the VC-dimension of the set of indicator
functions implemented by the learning machine forms the
necessary and sufficient condition for consistency of the ERM
method independent of probability measure. Finiteness of VCdimension also implies fast convergence.
Re
fe
re
describes the sufficient condition for fast convergence.6 It is the

second milestone in statistical learning theory: it guarantees a
fast asymptotic rate of convergence.
Note that both the equation describing the necessary and
sufficient condition for consistency and the one that describes
the sufficient condition for fast convergence of the ERM
(both
method are valid for a given probability measure
and VC-annealed entropy
are conVC-entropy
structed using this measure). However our goal is to construct
a learning machine for solving many different problems (i.e.,
for many different probability measures).
The question is:
Under what conditions is the ERM principle consistent and
rapidly converging, independently of the probability measure?
The following equation describes the necessary and sufficient conditions for consistency of ERM for any probability
measure
is an integer for which
nc
holds true, where

The equation
A. The Structure of the Growth Function
This condition is also sufficient for fast convergence.

This equation is the third milestone in statistical learning
theory. It describes the conditions under which the learning
machine implementing ERM principle has an asymptotic high
rate of convergence independently of the problem to be solved.
These milestones form a foundation for constructing both
distribution independent bounds and rigorous distribution dependent bounds for the rate of convergence of learning machines.
III. BOUNDS ON THE RATE OF CONVERGENCE
OF THE LEARNING PROCESSES
In order to estimate the quality of the ERM method for
a given sample size it is necessary to obtain nonasymptotic
bounds on the rate of uniform convergence.
A nonasymptotic bound of the rate of convergence can
be obtained using a new capacity concept, called the VC
dimension, which allows us to obtain a constructive bound
for the growth function.
The concept of VC-dimension is based on a remarkable
.
property of the growth-function
6 The
necessity of this condition for fast convergence is open question.
B. Equivalent Definition of the VC Dimension
In this section, we give an equivalent definition of the VC

dimension of sets of indicator functions and then we generalize
this definition for sets of real-valued functions.
The VC Dimension of a Set of Indicator Functions: The
VC-dimension of a set of indicator functions
is the maximum number of vectors
which can
possible ways using functions of this
be separated in all
there
set7 (shattered by this set of functions). If for any
exists a set of vectors which can be shattered by the set
then the VC-dimension is equal to infinity.
The VC Dimension of a Set of Real-Valued Functions: Let
be a set of real-valued functions
can approach
and
bounded by constants and
can approach
Let us consider along with the set of real-valued functions
the set of indicator functions
(18)
where
is some constant,
is the step function
if
if
The VC dimension of the set of real valued functions
is defined to be the VC-dimension of the
set of indicator functions (18).
7 Any indicator function separates a set of vectors into two subsets: the
subset of vectors for which this function takes value zero and the subset of
vectors for which it takes value one.
993
C. Two Important Examples

Example 1:
1) The VC-dimension of the set of linear indicator functions
Case 1The Set of Totally Bounded Functions: Without

restriction in generality, we assume that
(19)
The main result in the theory of bounds for sets of totally
bounded functions is the following [20][22].
, the inequality
Theorem: With probability at least
in -dimensional coordinate space

is
, since using functions of this set one
equal to
vectors. Here
is the step
can shatter at most
function, which takes value one, if the expression in the
brackets is positive and takes value zero otherwise.
2) The VC-dimension of the set of linear functions
(20)
holds true simultaneously for all functions of the set (19),
where
(21)
fe
the -margin separating hyperplane if it classifies vectors

as follows:
if
if
nc
es
For the set of indicator functions,

This theorem provides bounds for the risks of all funcwhich
tions of the set (18) [including the function
minimizes empirical risk (8)]. The bounds follow from the
bound on uniform convergence (13) for sets of totally bounded
functions that have finite VC dimension.
Case 2The Set of Unbounded Functions: Consider the
set of (nonnegative) unbounded functions
It is easy to show (by constructing an example) that,
without additional information about the set of unbounded
functions and/or probability measures, it is impossible to
obtain an inequality of type (20). Below we use the following
information:
re
in -dimensional coordinate space

is
because the VC-dimension of
also equal to
corresponding linear indicator functions is equal to
(using
instead of
does not changes the set of
indicator functions).
Example 2: We call a hyperplane
Re
(classifications of vectors that fall into the margin

are undefined).
belong to a sphere of radius
Theorem: Let vectors
. Then the set of -margin separating hyperplanes has the
VC dimension bounded by the inequality
These examples show that in general the VC dimension

, where
is
of the set of hyperplanes is equal to
dimensionality of input space. However, the VC dimension
of the set of -margin separating hyperplanes (with a large
can be less than
This fact will play
value of margin
an important role for constructing new function estimation
methods.
D. Distribution Independent Bounds for the Rate of
Convergence of Learning Processes
Consider sets of functions which possess a finite VCWe distinguish between two cases:
dimension
1) the case where the set of loss functions
is a set of totally bounded functions;
2) the case where the set of loss functions
is not necessarily a set of totally bounded functions.
(22)
is some fixed constant.8

where
The main result for the case of unbounded sets of loss
functions is the following [20][22].
the inequality
Theorem: With probability at least
(23)
holds true simultaneously for all functions of the set, where
is determined by (22),
The theorem bounds the risks for all functions of the set
(including the function
8 This inequality describes some general properties of distribution functions
of the random variables Q(z; ), generated by the P (z ): It describes the
tails of distributions (the probability of big values for the random variables
): If the inequality (22) with p > 2 holds, then the distributions have socalled light tails (large values do not occurs very often). In this case rapid
convergence is possible. If, however, (22) holds only for p < 2 (large values
of the random variables occur rather often) then the rate of convergence
will be small (it will be arbitrarily small if p is sufficiently close to one).
994
sample size).9 The goal is to specify methods which are

appropriate for a given sample size.
E. Problem of Constructing Rigorous

(Distribution Dependent) Bounds
To construct rigorous bounds for the rate of convergence
one has to take into account information about probability
be a set of all probability measures and let
measure. Let
be a subset of the set
We say that one has prior
if
information about an unknown probability measure
one knows the set of measures that contains
Consider the following generalization of the growth function:
the following inequality:
fe
re
nc
Then for sufficiently large
The ERM principle is intended for dealing with a large

sample size. Indeed, the ERM principle can be justified by
is large, the second
considering the inequalities (20). When
summand on the right hand side of inequality (20) becomes
small. The actual risk is then close to the value of the empirical
risk. In this case, a small value of the empirical risk provides
a small value of (expected) risk.
is small, then even a small
However, if
does not guarantee a small value of risk. In this case the
requires a new principle, based on
minimization for
the simultaneous minimization of two terms in (20) one of
which depends on the value of the empirical risk while the
second depends on the VC-dimension of the set of functions.
To minimize risk in this case it is necessary to find a method
which, along with minimizing the value of empirical risk,
controls the VC-dimension of the learning machine.
The following principle, which is called the principle of
structural risk minimization (SRM), is intended to minimize
the risk functional with respect to both empirical risk and
VC-dimension of the set of functions.
be provided with
Let the set of functions
is composed of the nested subsets of
a structure: so that
such that
functions
es
For indicator functions

and for the extreme
the generalized growth function
case where
coincides with the growth function
For another extreme
the generalized
case where contains only one function
growth function coincides with the annealed VC-entropy.
The following assertion is true [20], [26].
Theorem: Suppose that a set of loss-functions is bounded
A. Structural Risk Minimization Induction Principle
Re
holds true.
From this bound it follows that for sufficiently large with
simultaneously for all
(including the
probability
one that minimizes the empirical risk) the following inequality
is valid:
However, this bound is nonconstructive because theory

does not specify a method to evaluate the generalized growth
function. To make this bound constructive and rigorous one
has to estimate the generalized growth function for a given
set of loss-functions and a given set of probability measures.
This is one of the main subjects of the current learning theory
research.
IV. THEORY FOR CONTROLLING THE
GENERALIZATION OF LEARNING MACHINES
The theory for controlling the generalization of a learning
machine is devoted to constructing an induction principle for
minimizing the risk functional which takes into account the
size of the training set (an induction principle for a small
(24)
and
An admissible structure is one satisfying the following three
properties.
is everywhere dense in
1) The set
2) The VC-dimension
of each set
of functions is
finite.
of the structure contains totally bounded
3) Any element
functions
The SRM principle suggests that for a given set of obserchoose the element of structure
, where
vations
and choose the particular function from
for which
the guaranteed risk (20) is minimal.
The SRM principle actually suggests a tradeoff between
the quality of the approximation and the complexity of the
approximating function. (As increases, the minima of empirical risk are decreased; however, the term responsible for
the confidence interval [summand in (20)] is increased. The
SRM principle takes both factors into account.)
The main results of the theory of SRM are the following
[9], [22].
Theorem: For any distribution function the SRM method
provides convergence to the best possible solution with probability one.
In other words SRM method is universally strongly consistent.
9 The
sample size ` is considered to be small if `=h is small, say `=h < 20:
995
Theorem: For admissible structures the method of structural

for
risk minimization provides approximations
converge to the best
which the sequence of risks
with asymptotic rate of convergence10
one
(25)
if the law
is such that
(26)
is the bound for functions from

In (25)
the rate of approximation
and
is
can find the exact solution while when the minimum of this
functional is nonzero one can find an approximate solution.
Therefore by constructing a separating hyperplane one can
control the value of empirical risk.
Unfortunately the set of separating hyperplanes is not flexible enough to provide low empirical risk for many real-life
problems [13].
Two opportunities were considered to increase the flexibility
of the sets of functions:
1) to use a richer set of indicator functions which are
superpositions of linear indicator functions;
2) to map the input vectors in high dimensional feature
space and construct in this space a -margin separating
hyperplane (see Example 2 in Section III-C)
The first idea corresponds to the neural network. The second
idea leads to support vector machines.
V. THEORY OF CONSTRUCTING LEARNING ALGORITHMS

B. Sigmoid Approximation of Indicator
Functions and Neural Nets
nc
es
To describe the idea behind the NN let us consider the

method of minimizing the functional (28). It is impossible
to use regular gradient-based methods of optimization to minimize this functional. (The gradient of the indicator function
is either equal to zero or is undefined.) The solution
is to approximate the set of indicator functions (27) by socalled sigmoid functions
Re
A. Methods of Separating Hyperplanes and

Their Generalization
fe
re
To implement the SRM induction principle in learning

algorithms one has to control two factors that exist in the
bound (20) which has to be minimized:
1) the value of empirical risk;
with the
2) the capacity factor (to choose the element
appropriate value of VC dimension).
Below we restrict ourselves to the pattern recognition case.
We consider two type of learning machines:
1) Neural networks (NNs) that were inspired by the biological analogy to the brain;
2) the support vector machines that were inspired by statistical learning theory.
We will discuss how each corresponding machine can
control these factors.
Consider first the problem of minimizing empirical risk on

the set of linear indicator functions
(27)
where
(29)
is a smooth monotonic function such that
For example, the functions
are sigmoid functions.

For the set of sigmoid function, the empirical risk functional
(30)
It has a gradient grad
and therefore
is smooth in
can be minimized using gradient-based methods. For example,
the gradient descent method uses the following update rule:
Let
be a training set, where
is a vector,
To minimize the empirical risk one has to find the pa(weights) which minimize the
rameters
empirical risk functional
where the data

depends on the iteration
number For convergence of the gradient descent method to
satisfy the conditions
a local minimum, it is enough that
(28)
There are several methods for minimizing this functional. In
the case when the minimum of the empirical risk is zero one
=
j 0 j 0!
10 We say that the random variables ; `

1; 2; 1 1 1 converge to the value
`
0 with asymptotic rate V (`) if there exists constant C such that
P
V 01 (`) ` 0
`!1 C:
Thus, the idea is to use the sigmoid approximation at the stage

of estimating the coefficients, and use the indicator functions
with these coefficients at the stage of recognition.
The generalization of this idea leads to feedforward NNs.
In order to increase the flexibility of the set of decision rules
In the following we use a compact notation for these inequalities:

(32)
It is easy to check that the Optimal hyperplane is the one that
satisfies the conditions (32) and minimizes functional
(33)
(The minimization is taken with respect to both vector and
scalar )
The solution to this optimization problem is given by the
saddle point of the Lagrange functional (Lagrangian)
(34)
es
are Lagrange multipliers. The Lagrangian has to

where the
and maximized with respect
be minimized with respect to
to
In the saddle point, the solutions
and
should
satisfy the conditions
Rewriting these equations in explicit form one obtains the

following properties of the optimal hyperplane.
for the optimal hyperplane should
1) The coefficients
satisfy the constraints
Re
fe
re
of the learning machine one considers a set of functions

which are the superposition of several linear indicator functions (networks of neurons) [13] instead of the set of linear
indicator functions (single neuron). All indicator functions in
this superposition are replaced by sigmoid functions.
A method for calculating the gradient of the empirical risk
for the sigmoid approximation of NNs, called the backpropagation method, was found [15], [12]. Using this gradient
descent method, one can determine the corresponding coefficient values (weights) of all elements of the NN.
In the 1990s, it was proven that the VC dimension of NNs
depends on the type of sigmoid functions and the number of
weights in the NN. Under some general conditions the VC
dimension of the NN is bounded (although it is sufficiently
large). Suppose that the VC dimension does not change during
the NN training procedure, then the generalization ability of
NN depends on how well the NN minimizes the empirical risk
using sufficiently large training data.
The three main problems encountered when minimizating
the empirical risk using the backpropagation method are as
follows.
1) The empirical risk functional has many local minima.
Optimization procedures guarantee convergence to some
local minimum. In general the function which is found
using the gradient-based procedure can be far from the
best one. The quality of the obtained approximation
depends on many factors, in particular on the initial
parameter values of the algorithm.
2) Convergence to a local minimum can be rather slow (due
to the high dimensionality of the weight-space).
3) The sigmoid function has a scaling factor which affects
the quality of the approximation. To choose the scaling
factor one has to make a tradeoff between quality of
approximation and the rate of convergence.
Therefore, a good minimization of the empirical risk depends in many respects on the art of the researcher.
nc
996
C. The Optimal Separating Hyperplanes

To introduce the method which is an alternative to the NN
let us consider the optimal separating hyperplanes [25].
Suppose the training data
can be separated by a hyperplane

(31)
We say that this set of vectors is separated by the optimal hyperplane (or the maximal margin hyperplane) if it is separated
without error and the distance between the closest vector and
the hyperplane is maximal.
To describe the separating hyperplane let us use the following form:
(35)
)
2) The parameters of the optimal hyperplane (vector
are linear combination of the vectors of the training set.
(36)
3) The solution must satisfy the following KuhnTucker
conditions:
(37)
From these conditions it follows that only some training
vectors in expansion (36), the support vectors, can have
in the expansion of
The
nonzero coefficients
support vectors are the vectors for which, in (36), the
equality is achieved. Therefore we obtain
(38)
back into the Lagrangian

Substituting the expression for
and taking into account the KuhnTucker conditions, one
obtains the functional
(39)
It remains to maximize this functional in the nonnegative
quadrant
if
if
997
in (31) we obtain the hyperplane

Putting the expression for
as an expansion on support vectors
(41)
To construct the optimal hyperplane in the case when
the data are linearly nonseparable, we introduce nonnegative
and the functional
variables
which we will minimize subject to constraints
Using the same formalism with Lagrange multipliers one

can show that the optimal hyperplane also has an expansion
can be found by
(41) on support vectors. The coefficients
maximizing the same quadratic form as in the separable case
(39) under slightly different constraints
(43)
that are equivalent to the linear decision functions (33) in the

feature space. The coefficients in (43) are defined by solving
the equation
re
(42)
than it will be possible to construct the solutions which are

equivalent to the optimal hyperplane in the feature space. To
get this solution one only needs to replace the inner product
in (39) and (41) with the function
In other words, one constructs nonlinear decision functions
in the input space
es
(40)
The problem then arises of how to computationally deal

with such high-dimensional spaces: to construct a polynomial
of degree 4 or 5 in a 200-dimensional space it is necessary to
construct hyperplanes in a billion-dimensional feature space.
In 1992, it was noted [5] that for both describing the optimal
separating hyperplane in the feature space (41) and estimating
the corresponding coefficients of expansion of the separating
hyperplane (39) one uses the inner product of two vectors
and
, which are images in the feature space of the
and
Therefore if one can estimate the
input vectors
and
inner product of two vectors in the feature space
as a function of two variables in input space
nc
under the constraint
fe
D. The Support Vector Network
Re
The support-vector network implements the following idea

[21]: Map the input vectors into a very high-dimensional feature space through some nonlinear mapping chosen a priori.
In this space comstruct an optimal separating hyperplane. The
goal is to create the situation described in Example 2 of
Section III-C, where for -margin separating hyperplanes the
To generalize
VC dimension is defined by the ratio
well, we control (decrease) the VC dimension by constructing
an optimal separating hyperplane (that maximizes the margin).
To increase the margin we use very high dimensional spaces.
Example: Consider a maping that allows us to construct
decision polynomials in the input space. To construct a polynomial of degree two, one can create a feature space which
coordinates of the form
has
coordinates
coordinates
coordinates
The separating hyperplane conwhere
structed in this space is a separating second-degree polynomial
in the input space.
To construct a polynomial of degree in an -dimensional
-dimensional feature
input space one has to construct
space, where one then constructs the optimal hyperplane.
(44)
under constraints (42).

In 1909 Mercer proved a theorem which defines the general
form of inner products in Hilbert spaces.
Theorem: The general form of the inner product in Hilbert
space is defined by the symmetric positive definite function
that satisfies the condition
for all functions
satisfying the inequality
Therefore any function

satisfying Mercers condition can be used for constructing rule (43) which is equivalent
to constructing an optimal separating hyperplane in some
feature space.
The learning machines which construct decision functions of
the type (43) are called support vectors networks or support
vector machines (SVMs).11
Using different expressions for inner products
one
can construct different learning machines with arbitrary types
of (nonlinear in input space) decision surfaces.
11 This name stresses that for constructing this type of machine, the idea
of expanding the solution on support vectors is crucial. In the SVM the
complexity of construction depends on the number of support vectors rather
than on the dimensionality of the feature space.
Radial basis function machines with decision functions of the

form
can be implemented by using a function of the type
In this case the SVM machine will find both the centers
and the corresponding weights
The SVM possesses some useful properties.
The optimization problem for constructing an SVM has
a unique solution.
The learning process for constructing an SVM is rather
fast.
Simultaneously with constructing the decision rule, one
obtains the set of support vectors.
Implementation of a new set of decision functions can be
done by changing only one function (kernel
which defines the dot product in -space.
VI. CONCLUSION
This article presents a very general overview of statistical
learning theory. It demonstrates how an abstract analysis
allows us to discover a general model of generalization.
According to this model, the generalization ability of learning machines depends on capacity concepts which are more
sophisticated than merely the dimensionality of the space or
the number of free parameters of the loss function (these concepts are the basis for the classical paradigm of generalization).
The new understanding of the mechanisms behind generalization not only changes the theoretical foundation of
generalization (for example from the new point of view the
Occam razor principle is not always correct), but also changes
the algorithmic approaches to function estimation problems.
The approach described is rather general. It can be applied
for various function estimation problems including regression,
density estimation, solving inverse equations and so on.
Statistical learning theory started more than 30 years ago.
The development of this theory did not involve many researchers. After the success of the SVM in solving real-life
problems, the interest in statistical learning theory significantly
increased. For the first time, abstract mathematical results in
statistical learning theory have a direct impact on algorithmic
tools of data analysis. In the last three years a lot of articles
have appeared that analyze the theory of inference and the
SVM method from different perspectives. These include:
1) obtaining better constructive bounds than the classical
one described in this article (which are closer in spirit to
the nonconstructive bound based on the growth function
than on bounds based on the VC dimension concept).
Success in this direction could lead, in particular, to
creating machines that generalize better than the SVM
based on the concept of optimal hyperplane;
2) extending the SVM ideology to many different problems
of function and data-analysis;
3) developing a theory that allows us to create kernels
that possess desirable properties (for example that can
enforce desirable invariants);
re
E. Why Can Neural Networks and Support

Vectors Networks generalize?
training data), the confidence interval

will be large.
In this case, even if one could minimize the empirical risk
down to zero, the amount of errors on the test set could be
big. This case is called overfitting.
To avoid over fitting (to get a small confidence interval) one
has to construct networks with small VC-dimension.
Therefore to generalize well using an NN one must first
suggest an appropriate architecture of the NN and second find
in this network the function that minimizes the number of
errors on the training data. For NNs both of these problems are solving using some heuristics (see remarks on the
backpropagation method).
In support vector methods one can control both parameters:
in the separable case one obtains the unique solution which
minimizes the empirical risk (down to zero) using a -margin
separating hyperplane with the maximal margin (i.e., subset
with the smallest VC dimension).
In the general case one obtains the unique solution when
one chooses the value of the trade off parameter
es
For example to specify polynomials of any fixed order

one can use the following functions for the inner product in
the corresponding feature space:
nc
998
Re
fe
The generalization ability of both the NNs and support

vectors networks is based on the factors described in the theory
for controlling the generalization of the learning processes. According to this theory, to guarantee a high rate of generalization
of the learning machine one has to construct a structure
on the set of decision functions

and
of the structure
then choose both an appropriate element
within this element that
and a function
minimizes bound (20). The bound (16) can be rewritten in
the simple form
(45)
where the first term is an estimate of the risk and the second
is the confidence interval for this estimate.
In designing an NN, one determines a set of admissible
For a given amount
functions with some VC-dimension
of training data the value
determines the confidence interval
for the network. Choosing the appropriate element of
a structure is therefore a problem of designing the network for
a given training set.
During the learning process this network minimizes the first
term in the bound (45) (the number of errors on the training
set).
If it happens that at the stage of designing the network one
constructs a network too complex (for the given amount of
999
[16]
[17]
[18]
[19]
ACKNOWLEDGMENT
[20]
The author wishes to thank F. Mulier for discussions and

helping to make this article more clear and readable.
[21]
REFERENCES
[22]
[23]
[24]
[25]
[26]
nc
[27]
[28]
[29]
Re
fe
re
[1] N. Alon, B.-David, N. Cesa-Bianchi, and D. Haussler, Scale-sensitive

dimensions, uniform convergence, and learnability, J. ACM, vol. 44,
no. 4, pp. 617631, 1997.
[2] P. L. Bartlett, P. Long, and R. C. Williamson, Fat-shattering and the
learnability of real-valued functions, J. Comput. Syst. Sci., vol. 52, no.
3, pp. 434452, 1996.
[3] P. L. Bartlett and J. Shawe-Taylor, Generalization performance on
support vector machines and other pattern classifiers, in B. Sholkopf,
C. Burges, and A. Smola, Eds., Advances in Kernel MethodsSupport
Vector Learning. Cambridge, MA: MIT Press, 1999.
[4] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, Learnability and the VapnikChervonenkis dimension, J. ACM, vol. 36, no.
4, pp. 929965, 1989.
[5] B. Boser, I. Guyon, and V. N. Vapnik, A training algorithm for optimal
margin classifiers, in Proc. 5th Annu. Wkshp. Comput. Learning Theory.
Pittsburgh, PA: ACM, 1992, pp. 144152.
[6] C. J. C. Burges, Simplified support vector decision rule, in Proc. 13th
Int. Conf. Machine Learning, San Mateo, CA, 1996, pp. 7177.
, Geometry and invariance in kernel-based methods, in B.
[7]
Sholkopf, C. Burges, and A. Smola, Eds., Advances in Kernel MethodsSupport Vector Learning. Cambridge, MA: MIT Press, 1999.
[8] C. Cortes and V. Vapnik, Support vector networks, Machine Learning,
vol. 20, pp. 273297, 1995.
[9] L. Devroye, L. Gyorfi, and G. Lugosi, A Probability Theory of Pattern
Recognition. New York: Springer-Verlag, 1996.
[10] F. Girosi, An equivalence between sparse approximation and support
vector machines, Neural Comput., vol. 10, no. 6, pp. 14551480, 1998.
[11] F. Girosi, M. Jones, and T. Poggio, Regularization theory and neural
networks architectures, Neural Comput., vol. 7, no. 2, pp. 219269,
1995.
[12] Y. Le Cun, Learning processes in an asymmetric threshold network, in
E. Beinenstock, F. Fogelman-Soulie, and G. Weisbuch, Eds., Disordered
Systems and Biological Organizations. Les Houches, France: SpringerVerlag, 1986, pp. 233240.
[13] M. L. Minsky and S. A. Papert, Perceptrons. Cambridge, MA: MIT
Press, 1969, p. 248.
[14] M. Opper, On the annealed VC entropy for margin classifiers: A statistical mechanics study, in B. Sholkopf, C. Burges, and A. Smola, Eds.,
Advances in Kernel MethodsSupport Vector Learning. Cambridge,
MA: MIT Press, 1999.
[15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal
representations by error propagation, in Parallel Distributed Process-
ing: Explorations in Macrostructure of Cognition, Vol. I Cambridge,

MA: Badford, 1986, pp. 318362.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony,
Structural risk minimization, IEEE Trans. Inform. Theory, 1998.
B. Sholkopf, A. Smola, and K. R. Muller, Nonlinear component
analysis as a kernel eigenvalue problem, Neural Comput., vol. 10, pp.
12291319, 1998.
, The connection between regularization operators and support
vector kernels, Neural Networks, vol. 11, pp. 637649, 1998.
M. Talagrand, The Glivenko-Cantelli problem, ten years later, J.
Theoretical Probability, vol. 9, no. 2, pp. 371384, 1996.
V. N. Vapnik, Estimation of Dependencies Based on Empirical Data.
Moscow, Russia: Nauka, 448 pp., 1979 (in Russian). English translation,
New York: Springer-Verlag, 400 pp., 1982.
, The Nature of Statistical Learning Theory. New York:
Springer-Verlag, 1995, p. 188.
, Statistical Learning Theory. New York: Wiley, 1998, p. 736.
V. N. Vapnik and A. Ja. Chervonenkis, On the uniform convergence
of relative frequencies of events to their probabilities, Rep. Academy
Sci. USSR, p. 181, no 4, 1968.
, On the uniform convergence of relative frequencies of events
to their probabilities, Theory Probab. Apl., vol. 16, pp. 264280, 1971.
, Theory of Pattern Recognition. Moscow, Russia: Nauka, 1974
(in Russian). German translation: W. N. Wapnik and A. Ja. Chervonenkis
Theorie der Zeichenerkennung. Berlin, Germany: Akademia-Verlag,
353 pp., 1979.
, Necessary and sufficient conditions for the uniform convergence of the means to their expectations, Theory Probab. Applicat.,
vol. 26. pp. 532553, 1981.
, The necessary and sufficient conditions for consistency of the
method of empirical risk minimization, Yearbook of the Academy of
Sciences of the USSR on Recognition, Classification, and Forecasting,
vol. 2, pp. 217249, Nauka Moscow, 1989 (in Russian). English
translation: Pattern Recogn. and Image Analysis, vol. 1, no. 3, pp.
284305, 1991.
Vidyasagar, A Theory of Learning and Generalization. New York:
Springer, 1997.
Wahba, Spline Models for Observational Data, vol. 59. Philadelphia,
PA: SIAM, 1990.
R. C. Williamson, A. Smola, and B. Sholkopf, Entropy number,
operators, and support vector kernels, in B. Sholkopf, C. Burges, and
A. Smola, Eds., Advances in Kernel MethodsSupport Vector Learning.
Cambridge, MA: MIT Press, 1999.
es
4) developing a new type of inductive inference that is

based on direct generalization from the training set to the
test set, avoiding the intermediate problem of estimating
a function (the transductive type inference).
The hope is that this very fast growing area of research will
significantly boost all branches of data analysis.
[30]
Vladimir N. Vapnik was born in Russia and received the Ph.D. degree in statistics from the Institute of Control Sciences, Academy of Science of the
USSR, Moscow, Russia, in 1964.
Since 1991, he has been working for AT&T Bell
Laboratories (since 1996, AT&T Labs Research),
Red Bank, NJ. His research interests include statistical learning theory, theoretical and applied statistics,
theory and methods for solving stochastic ill-posed
problems, and methods of multidimensional function approximation. His main results in the last three
years are related to the development of the support vector method. He is author
of many publications, including seven monographs on various problems of
statistical learning theory.
REF [2]
A
Training
Algorithm
Optimal
Margin
Classifiers
Isabelle M. Guyon
AT&T Bell Laboratories
50 Fremont Street, 6th Floor
San Francisco, CA 94105
isabelle@neural .att .com
Bernhard
E. Boser
EECS Department
University of California
Berkeley, CA 94720
boser@eecs.berkeley. edu
fe
re
nc
es
In this paper we describe a training algorithm that automatically tunes the capacity of the classification function by maximizing the margin between training examples and class boundary [KM87], optionally after removing some atypical or meaningless examples from the
training data. The resulting classification function depends only on so-called supporting patterns [Vap82].
These are those training examples that are closest to
the decision boundary and are usually a small subset of
the training data.
It will be demonstrated that maximizing the margin

amounts to minimizing the maximum loss, as opposed
to some average quantity such as the mean squared error. This has several desirable consequences. The resulting classification rule achieves an errorless separation of the training data if possible. Outliers or meaningless patterns are identified by the algorithm and can
therefore be eliminated easily with or without supervision. This contrasts classifiers based on minimizing
the mean squared error, which quiet ly ignore at ypical patterns. Another advantage of maximum margin
classifiers is that the sensitivity of the classifier to limited computational
accuracy is minimal compared to
other separations with smaller margin. In analogy to
[Vap82, HLW88] a bound on the generalization performance is obtained with the leave-one-out method. For
the maximum margin classifier it is the ratio of the number of linearly independent supporting patterns to the
number of training examples. This bound is tighter than
a bound based on the capacity of the classifier family.
Re
INTRODUCTION
Good generalization performance of pattern classifiers is

achieved when the capacity of the classification function
is matched to the size of the training set. Classifiers with
a large number of adjustable parameters and therefore
large capacity likely learn the training set without error,
but exhibit poor generalization. Conversely, a classifier
with insufficient capacity might not be able to learn the
task at all. In between, there is an optimal capacity of
the classifier which minimizes the expected generalization error for a given amount of training data. Both
experimental evidence and theoretical studies [GBD92,
*Part of this work was performed while B. Boser was
with AT&T Bell Laboratories. He is now at the University
of California, Berkeley.
Permission
to
copy
granted provided
direct
commercial
title
of tha
that
copying
To copy
ACM
all or
part
of this
are not made
the ACM
and its date
appear,
and
of the Association
otherwise,
material is
or distributed
copyright
The proposed algorithm operates with a large class of

decision functions that are linear in their parameters
but not restricted to linear dependence in the input
components.
Perceptions [Ros62], polynomial classifiers, neural networks with one hidden layer, and Radial
Basis Function (RBF) or potential function classifiers
[ABR64, BL88, MD89] fall into this class. As pointed
out by several authors [ABR64, DH73, PG90], Percep-
notice
notice
for
and the
ia given
for Computing
or to republish,
requires
a fee
permission.
COLT92-71921PA,
@ 1992
advantage,
is by permission
specific
fee
the copies
publication
Machinery.
and/or
without
that
USA
0-89791
-498 -8/92
/0007
/0144
Vladimir
N. Vapnik
AT&T Bell Laboratories
Crawford Corner Road
Holmdel, NJ 07733
vlad@neural. att .com
Mo092, GVB+92, Vap82, BH89, TLS89, Mac92] link the

generalization of a classifier to the error on the training
examples and the complexity of the classifier. Methods such as structural risk minimization
[Vap82] vary
the complexity of the classification function in order to
optimize the generalization.
Abstract
A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented. The technique
is applicable to a wide variety of classifiaction functions, including Perceptions, polynomials, and Radial Basis Functions.
The effective number of parameters is adjusted automatically to match the complexity of the problem. The solution is expressed as a linear combination of supporting patterns. These are the
subset of training patterns that are closest to
the decision boundary. Bounds on the generalization performance based on the leave-one-out
method and the VC-dimension are given. Experimental results on optical character recognition problems demonstrate the good generalization obtained when compared with other
learning algorithms.
for
. ..$1 .50
144
trons have a dual kernel representation implementing

the same decision function. The optimal margin algorithm exploits this duality both for improved efficiency
and flexibility.
In the dual space the decision function
is expressed as a linear combination of basis functions
parametrized by the supporting patterns. The supporting patterns correspond to the class centers of RBF
classifiers and are chosen automatically
by the maximum margin training procedure. In the case of polynomial classifiers, the Perception represent ation involves
an untractable number of parameters. This problem is
overcome in the dual space representation, where the
classification rule is a weighted sum of a kernel function [Pog75] for each supporting pattern.
High order
polynomial classifiers with very large training sets can
therefore be handled efficiently with the proposed algorithm.
The coefficients ak are the parameters to be adjusted

and the xk are the training patterns. The function K
is a predefine kernel, for example a potential function
[ABR64] or any Radial Basis Function [BL88, MD89].
Under certain conditions [C H!53], symmetric kernels
possess finite or infinite series expansions of the form
K(x,
MARGIN
pi(x)
(5)
$Di(x).
In particular, the kernel K(x, x) = (x . x + 1)9 corresponds to a polynomial expansion p(x) of order q
[Pog75].
Provided that the expansion stated in equation 5 exists,
equations 3 and 4 are dual representations of the same
decision function and
The parameters wi are called direct parameters, and the

CYkare referred to as dual parameters.
es
MAXIMUM
= ~
i
The training algorithm is described in Section 2. Section

3 summarizes important properties of optimal margin
classifiers. Experimental results are reported in Section
4.
2
x)
The proposed training algorithm is based on the generalized portrait method described in [Vap82] that constructs separating hyperplanes with maximum margin.
Here this algorithm is extended to train classifiers linear in their parameters, First, the margin between the
class boundary and the training patterns is formulated
in the direct space. This problem description is then
transformed into the dual space by means of the Lagrangian. The resulting problem is that of maximizing
a quadratic form with constraints and is amenable to
efficient numeric optimization algorithms [Lue84].
TRAINING
(Xl) vi),
(X2, Y2), (X3, Y3),
fe
re
The maximum margin training algorithm finds a decision function for pattern vectors x of dimension n belonging to either of two classes A and B. The input to
the training algorithm is a set of p examples xi with
labels gi:
nc
ALGORITHM
. ~, (Xp, yp)
(1)
if xk E class A
yk=l
~k
=
1
ifxk
E ClaSSB.
{
From these training examples the algorithm finds the
parameters of the decision function D(x) during a learning phase. After training, the classification of unknown
patterns is predicted according to the following rule:
2.1
Re
where
MAXIMIZING
THE
DIRECT
SPACE
In the direct space the decision function

D(X)
XEA
if D(x)>O
x c B otherwise.
MARGIN
(2)
= W ~ $O(X)
IN THE
is
(7)
b,
where w and P(X) are N dimensional vectors and b is

a bias. It defines a separating hyperplane in ~-space.
The distance between this hyperplane and pattern x
is D(x)/ llwll (Figure 1). Assuming that a separation
of the training set with margin M between the class
boundary and-the training p~tterns exists, all training
patterns fulfill the following inequality:
The decision functions must be linear in their parameters but are not restricted to linear dependence of x.
These functions can be expressed either in direct, or in
dual space. The direct space notation is identical to the
Perception decision function [Ros62]:
N
D(X)
W/f~(X)
b.
(3)
(8)
i=l
In this equation the pi are predefine functions of x, and
the Wi and b are the adjustable parameters of the decision function. Polynomial classifiers are a special case of
Perceptions for which pi(x) are products of components
of x.
In the dual space, the decision functions
The objective of the training algorithm

parameter vector w that maximizes M:
M*
is to find the
(9)
max
M
W,llwl[=l
subject to
.yk~(x~) ~ M,
k=l,2,
. . ..p.
are of the form

The bound M*
(4)
is
attained for those patterns satisfying

m~nyk~(xk)
k=l
145
= M*.
(lo)
\\
o \\
x,
\\
2.2
patterns of the
A decision function with maximum margin is illustrated

in figure 1. The problem of finding a hyperplane in
~-space with maximum margin is therefore a minimax
problem:
re
(11)
IN
THE
= ;IIW112
~ak
[~k~(Xk)
1](14)
k=l
k=l,2,...,
p.
1) = O,
k=l,2,...,
p.
or Kiihn(15)
The factor one half has been included for cosmetic reasons; it does not change the solution.
The optimization problem 13 is equivalent to searching
a saddle point of the function L(w, b, a). This saddle
point is a the minimum of L(w, 6, a) with respect to w,
and a maximum with respect to a (cr~ > O). At the
solution, the following necessary conditionis met:
Thus, maximizing the margin A4 is equivalent to minimizing the norm IIwII.1 Then the problem of finding a
maximum margin separating hyperplane w* stated in 9
reduces to solving the following quadratic problem:
(13)
under conditions
The maximum
b,a)
~k (y~~(X~)
(12)
m~llw112
MARGIN
The factors ~k are called Lagrange multipliers

Tucker coefficients and satisfy the conditions
Re
= 1.
L(w,
subject to ~k ~ O,
The norm of the parameter vector in equations 9 and

11 is fixed to pick one of an infinite number of possible
solutions that differ only in scaling. Instead of fixing
the norm of w to take care of the scaling problem, the
product of the margin M and the norm of a weight
vector w can be fixed.
fqwl]
THE
Problem 13 can be transformed into the dual space by

means of the Lagrangian [Lue84]
fe
max
miny~ll(x~).
W,llwll=l
k
MAXIMIZING
DUAL
SPACE
nc
These patterns are called the supporting

decision boundary.
es
Figure 1: Maximum margin linear decision function D(x) = w . x + b (V = x). The gray levels encode the absolute
value of the decision function (solid black corresponds to D(x) = O). The numbers indicate the supporting patterns.
yk ~(xk ) ~ 1,
k=l,2,
. . ..p.
margin is M = l/l\w*ll.
hence
In principle the problem stated in 13 can be solved directly with numerical techniques.
However, this approach is impractical when the dimensionality
of the
~-space is large or infinite. Moreover, no information is
gained about the supporting patterns.
w* =
~;yk~~.
(16)
k=l
The patterns which satisfy y~D(x~) = 1 are the supporting patterns. According to equation 16, the vector
w* that specifies the hyperplane with maximum margin
is a linear combination of only the supporting patterns,
which are those patterns for which a~ # 0. Usually the
number of supporting patterns is much smaller than the
number p of patterns in the training set.
1If the tr~~ing &ta is not linearly separable the maXimum margin may be negative. In this case, Jfllwll = 1
is imposed. Maximizing the margin is then equivalent to
maximizing IIw II.
146
exceedingly large dimensionality, the training data is divided into chunks that are processed iteratively [Vap82].
The maximum margin hypersurface is constructed for
the first chunk and a new training set is formed consisting of the supporting patterns from the solution and
those patterns xk in the second chunk of the training
set for which yk D(xk ) < 1 c. A new classifier is
trained and used to construct a training set consisting
of supporting patterns and examples from the first three
chunks which satisfy yk D(xk) < 1 c, This process is
repeated until the entire training set is separated.
The dependence of the Lagrangian L(w, b,a) on the

weight vector w is removed by substituting the expansion of w* given by equation 16 for w. Further transformations using 3 and 5 result in a Lagrangian which
is a function of the parameters a and the bias b only:
~a~(lby~)
b)=
jcdl.cx,
(17)
kml
subject to ~k ~ O,
Here H is a square matrix
k=l,2,...,
p.
of size p x p with elements
~kl = yky~~f(xk,x~).
In order for a unique solution to exist, II must be positive definite. For fixed bias b, the solution a is obtained
by maximizing J(cx, b) under the conditions O!k ~ O.
Based on equations 7 and 16, the resulting decision function is of the form
=
W*
. V(X)+
a; > 0,
~~~~~~(x~, x) + b,
where only the supporting

with nonzero weight,
patterns
3.1
appear in the sum
fe
1. The bias can be fixed a priori and not subjected

to training. This corresponds to the Generalized
Portrait Technique described in [Vap82].
2. The cost function 17 can be optimized with respect

tow and b. This approach gives the largest possible
margin M* in q-space [VC74].
Re
~~yk~~[~~(xA,xk)+
(w . ~(xA)
+ W* ~$O(XB))
SOLUTION
J(~*) =
;Ilw[y
2 (M*)2
= ;i~~
(20)
k=l
the
imthe
pa-
Another benefit of the maximum margin objective is its

insensitivity to small changes of the parameters w or
is a linear funca. Since the decision function D(x)
tion of w in the direct, and of cx in the dual space, the
probability of misclassifications due to parameter variations of the components of these vectors is minimized
for maximum margin, The robustness of the solution
and potentially its generalization performancecan be
increased further by omitting some supporting patterns
from the solution. Equation 20 indicates that the largest
increase in the maximum margin M occurs when the
supporting patterns with largest Q!kare eliminated. The
elimination can be performed automatically or with assistance from a supervisor. This feature gives rise to
other important uses of the optimum margin algorithm
in database cleaning applications [MGB+92].
A strategy to optimize the margin with respect to both

w and b is described in [Vap82]. It solves problem 17 for
differences of pattern vectors to obtain a independent
of the bias, which is computed subsequently. The margin in ~-space is maximized when the decision boundary
is halfway between the two classes. Hence the bias b*
is obtained by applying 18 to two arbitrary supporting
patterns xA e chiss A and xB c class B and taking into
account that D(xA) = 1 and D(xB) = 1.
=
OF THE
The uniqueness of the solution is a consequence of

maximum margin cost function and represents an
portant advantage over other algorithms for which
solution depends on the initial conditions or other
rameters that are difficult to control.
In both cases the solution is found with standard nonlinear optimization algorithms for quadratic forms with
linear constraints [Lue84, Lo072]. The second approach
gives the largest possible margin. There is no guarantee, however, that this solution exhibits also the best
generalization performance.
b*
PROPERTIES
Since maximizing
the margin between the decision
boundary and the training patterns is equivalent to
maximizing a quadratic form in the positive quadrant,
there are no local minima and the solution is always
unique if H has full rank. At the optimum
re
The choice of the bias b gives rise to several variants of

the algorithm. The two considered here are
THE
In this Section, we highlight some important aspects of

the optimal margin training algorithm. The description
is split into a discussion of the qualities of the resulting
classifier, and computational considerations. Classification performance advantages over other techniques will
be illustrated in the Section on experimental results.
(18)
OF
ALGORITHM
nc
D(X)
PROPERTIES
es
J(a,
(19)
Figure 2 compares the decision boundary for a maximum margin and mean squared error (MSE) cost functions. Unlike the MSE based decision function which
simply ignores the outlier, optimal margin classifiers are
very sensitive to atypical patterns that are close to the
~~(x~,xk)]
k=l
The dimension of problem 17 equals the size of the training set, p. To avoid the need to solve a dual problem of
147
Figure 2: Linear decision boundary for MSE (left) and maximum margin cost functions (middle, right) in the presence
of an outlier. In the rightmost picture the outlier has been removed. The numbers reflect the ranking of supporting
patterns according to the magnitude of their Lagrange coefficient czk for each class individually.
polynomial. In practice, m < p << N, i.e. the number

of supporting patterns is much smaller than the dimension of the ~-space. The capacity tuning realized by the
maximum margin algorithm is essential to get generalization with high-order polynomial classifiers.
nc
es
decision boundary.
These examples are readily identified as those with the largest O!k and can be eliminated either automatically or with supervision. Hence,
optimal margin classifiers give complete control over
the handling of outliers, as opposed to quietly ignoring
them.
The optimum margin algorithm performs automatic capacity tuning of the decision function to achieve good
generalization. An estimate for an upper bound of the
generalization error is obtained with the leave-one-out
method: A pattern Xk is removed from the training set.
A classifier is then trained on the remaining patterns
and tested on xk. This process is repeated for all p
training patterns. The generalization error is estimated
by the ratio of misclassified patterns over p. For a maximum margin classifier, two cases arise: ~f xk is not a
supporting pattern, the decision boundary is unchanged
and xk will be classified correctly. If xk is a supporting
pattern, two cases are possible:
re
3.2
CONSIDERATIONS
Re
fe
Speed and convergence are important practical considerations of classification algorithms. The benefit of the
dual space representation to reduce the number of computations required for example for polynomial classifiers
has been pointed out already. In the dual space, each
evaluation of the decision function D(x) requires m evaluations of the kernel function l((x~, x) and forming the
weighted sum of the results. This number can be further reduced through the use of appropriate search techniques which omit evaluations of Ii that yield negligible
contributions to D(x) [Omo9 1].
1. The pattern xk is linearly dependent on the other

supporting patterns. In this case it will be classified
correctly.
2. Xk is linearly
ing patterns.
In the worst
ing patterns
ted from the
COMPUTATIONAL
Typically, the training time for a separating surface

from a database with several thousand examples is a few
minutes on a workstation, when an efficient optimization algorithm is used. All experiments reported in the
next section on a database with 7300 training examples
took less than five minutes of CPU time per separating
surface. The optimization
was performed with an algorithm due to Powell that is described in [Lue84] and
available from public numerical libraries.
independent from the other supportIn this case the outcome is uncertain.
case m linearly independent supportare misclassified when they are omittraining data.
Hence the frequency of errors obtained by this method

and has no direct relationship with
is at most m/p,
the number of adjustable parameters. The number of
linearly independent supporting patterns m itself is
bounded by min(lV, p). This suggests that the number
of supporting patterns is related to an effective capacity of the classifier that is usually much smaller than the
VC-dimension, IV+ 1 [Vap82, HLW88].
Quadratic optimization problems of the form stated in

17 can be solved in polynomial time with the Ellipsoid
method [NY83]. This technique finds first a hyperspace
that is guaranteed to contain the optimum; then the
volume of this space is reduced iteratively by a constant
fraction. The algorithm is polynomial in the number of
free parameters p and the encoding size (i. e. the accuracy of the problem and solution). In practice, however,
algorithms without guaranteed polynomial convergence
are more efficient.
In polynomial classifiers, for example, IV % n~, where

n is the dimension of x-space and q is the order of the
148
1.37
alpha=O.541
Figure 3: Supporting
cr~.
mm
mn
1.05
alpha=O.747
alpha=O.641
alpha=O.651
alpha=O.556
alpha=O.549
alpha=O.544
alpha= 0.54
alpha=O.495
alpha=O.454
alpha=O.445
alpha=O.444
alpha=O.429
alpha=
alpha=O.512
patterns from database DB2forclass
EXPERIMENTAL
2 before cleaning. The patterns areranked
RESULTS
accordingto
classification functions with higher capacity than linear

subdividing planes. Tests with polynomial classifiers of
order q, for which K(x, x) = (x . x + 1)9, give the
following error rates and average number of supporting patterns per hypersurface, <m>.
This average is
computed as the total number of supporting patterns
divided by the number of decision functions. Patterns
that support more than one hypersurface are counted
onlv once in the total. For com~arison, the dimension
N ~f ~-space is also listed.
-
re
The maximum margin training algorithm

has been
tested on two databases with images of handwritten
digits. The first database (DB1) consists of 1200 clean
images recorded from ten subjects. Half of this data is
used for training, and the other half is used to evaluate
the generalization performance. A comparative analysis of the performance of various classification methods
on DB1 can be found in [GVB+92, GPP+89, GBD92].
The other database (DB2) used in the experiment consists of 7300 images for training and 2000 for testing
and has been recorded from actual mail pieces. Results
for this data have been reported in several publications,
see e.g. [CBD+ 90]. The resolution of the images in both
databases is 16 by 16 pixels.
mum
es
alpha=
nc
fe
1 linear
2
3
4
5
Re
In all experiments, the margin is maximized with respect to w and b. Ten hypersurfaces, one per class, are
used to separate the digits. Regardless of the difficulty
of the problemmeasured for example by the number of
supporting patterns found by the algorithmthe
same
similarity function K(x, x) and preprocessing is used
for all hypersurfaces of one experiment. The results obtained with different choices of K corresponding to linear hyperplanes, polynomial classifiers, and basis functions are summarized below. The effect of smoothing is
investigated as a simple form of preprocessing.
DB1
error
<m>
DB2
error
<m>
N
256
3 ~ 104
8.107
4.109
1.1012
The results obtained for DB2 show a strong decrease

of the number of supporting patterns from a linear to
a third order polynomial classification function and an
equivalently significant decrease of the error rate. Further increase of the order of the polynomial has little effect on either the number of supporting patterns or the
performance, unlike the dimension of V-space, IV, which
increases exponentially. The lowest error rate, 4.910 is
obtained with a forth order polynomial and is slightly
better than the 5.1 % reported for a five layer neural network with a sophisticated architecture [CBD+90], which
has been trained and tested on the same data.
For linear hyperplane classifiers, corresponding to the

similarity function K(x, x) = x. x, the algorithm finds
an errorless separation for database DB1. The percentage of errors on the test set is 3.270. This result compares favorably to hyperplane classifiers which minimize
the mean squared error (backpropagation
or pseudoinverse), for which the error on the test set is 12.7 %0.
In the above experiment, the performance changes drastically between first and second order polynomials. This
may be a consequence of the fact that maximum VCdimension of an q-th order polynomial classifier is equal
to the dimension n of the patterns to the q-th power
and thus much larger than n. A more gradual change
of the VC-dimension is possible when the function K is
chosen to be a power series, for example
Database DB2 is also linearly separable but contains

several meaningless patterns. Figure 3 shows the supporting patterns with large Lagrange multipliers a~ for
the hyperplane for class 2. The percentage of misclassifications on the test set of DB2 drops from 15.2% without cleaning to 10.5 %0after removing meaningless and
ambiguous patterns.
K(x, x) = exp (-yx . x) 1.
(21)
In this equation the parameter y is used to vary the VCdimension gradually.

For small values of y, equation
21 approaches a linear classifier with VC-dimension at
Better performance
has been achieved with both
databases using multilayer neural networks or other
149
Figure 4: Decision boundaries for maximum margin classifiers with second order polynomial decision rule If(x, x) =
(x. x+ 1)2 (left) and an exponential RBF I{(x, x) = exp(llx x11/2) (middle). The rightmost picture shows the
decision boundary of a two layer neural network with two hidden units trained with backpropagation.
most equal to the dimension n of the patterns plus one.

Experiments with database DB1 lead to a slightly better performance than the 1.570 obtained with a second
order polynomial classifier:
nc
es
DB1
2.3 0
2.2%
1.3%
1.5%
Better performance might be achieved with other similarity functions I<(x, x). Figure 4 shows the decision
boundary obtained with a second order polynomial and
a radial basis function (RBF) maximum margin classifier with 1{(x, x) = exp (11x x11/2). The decision
boundary of the polynomial classifier is much closer to
one of the two classes. This is a consequence oft he nonlinear transform from q-space to x-space of polynomials
which realizes a position dependent scaling of distance.
Radial Basis Functions do not exhibit this problem. The
decision boundary of a two layer neural network trained
with backpropagat ion is shown for comparison.
re
0:5
0.50
0.75
1.00
The performance improved considerably for DB1. For

DB2 the improvement is less significant and the optimum was obtained for less smoothing than for DB1.
This is expected since the number of training patterns
in DB2 is much larger than in DB1 (7000 versus 600). A
higher performance gain can be expected for more selective hints than smoothing, such as invariance to small
rotations or scaling of the digits [SLD92].
Re
fe
When I<(x, x) is chosen to be the hyperbolic tangent,

the resulting classifier can be interpreted as a neural
network with one hidden layer with m hidden units. The
supporting patterns are the weights in the first layer,
and the coefficients ak the weights of the second, linear
layer. The number of hidden units is chosen by the
training algorithm to maximize the margin between the
classes A and B. Substituting the hyperbolic tangent for
the exponential function did not lead to better results
in our experiments.
The importance of a suitable preprocessing to incorporate knowledge about the task at hand has been pointed
out by many researchers. In optical character recognition, preprocessing that introduce some invariance to
scaling, rotation, and other distortions are particularly
important [SLD92]. As in [C,VB+ 92], smoothing is used
to achieve insensitivity to small distortions. The table
below lists the error on the test set for different amounts
of smoothing. A second order polynomial classifier was
used for database DB1, and a forth order polynomial for
DB2. The smoothing kernel is Gaussian with standard
deviation u.
DB1
error
<m>
DB2
error
<m>
no smoothing
0.5
0.8
1.0
1.2
E
1.5%0
1.3%
0.8 %
0.3%
0.8%
4.970
4.6 %
5.0%
6.0%
44
41
36
31
31
CONCLUSIONS
Maximizing
the margin between the class boundary
and training patterns is an alternative to other training methods optimizing cost functions such as the mean
squared error. This principle is equivalent to minimizing the maximum loss and has a number of important
features. These include automatic capacity tuning of
the classification function, extraction of a small number of supporting patterns from the training data that
are relevant for the classification, and uniqueness of the
solution. They are exploited in an efficient learning algorithm for classifiers linear in their parameters with
very large capacity, such as high order polynomial or
RBF classifiers. Key is the representation of the decision function in a dual space which is of much lower
dimensionality than the feature space.
72
73
79
83
The efficiency and performance of the algorithm have

been demonstrated on handwritten
digit recognition
150
problems. The achieved performance matches that of

sophisticated classifiers, even though no task specific
knowledge has been used. The training algorithm is
polynomial in the number of training patterns, even
in cases when the dimension of the solution space (pspace) is exponential or infinite. The training time in
all experiments was less than an hour on a workstation.
Proc. Ini. Joint

Conf. Neural
Networks.
Int.
Joint Conference on Neural Networks, 1989.
[GVB+92]
Isabelle Guyon, Vladimir Vapnik, Bernhard

Boser, Leon Bottou, and Sara Solla. Structural risk minimization for character recognition. In David S. Touretzky, editor, Neural
volume 4.
Information
Processing
Systems,
Morgan Kaufmann Publishers, San Mateo,
CA, 1992. To appear.
[HLW88]
David Haussler, Nick Littlestone, and Manfred Warmuth. Predicting O,l-functions on

of
randomly drawn points. In Proceedings
Acknowledgements
We wish to thank our colleagues at UC Berkeley and
AT&T Bell Laboratories for many suggestions and stimulating discussions. Comments by L. Bottou, C. Cortes,
S. Sanders, S. Solla, A. Zakhor, and the reviewers are
gratefully acknowledged.
We are especially indebted
to R. Baldick and D. Hochbaum for investigating the
polynomial convergence property, S. Hein for providing
the code for constrained nonlinear optimization,
and
D. Haussler and M. Warmuth for help and advice regarding performance bounds.
the 29th
tions
IEEE,
[KM87]
D. S. Broomhead and D. Lowe.

Multivariate functional interpolation
and adapComplex
Systems,
2:321tive networks.
355, 1988.
[CBD+90]
Yann Le Cun, Bernhard Boser, John S.

Denker, Donnie Henderson, Richard E.
Howard, Wayne Hubbard, and Larry D.
Jackel. Handwritten digit recognition with
a back-propagation
network.
In David S.
ProTouretzky, editor, Neural Information
volume 2, pages 396404.
cessing Systems,
Morgan Kaufmann Publishers, San Mateo,
CA, 1990.
[Mac92]
Re
R. Courant
mathematical
ClassifiR.O. Duda and P.E. Hart. Pattern

Wiley and Son,
cation
And Scene Analysis.
1973.
[GBD92]
S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance

4 (1):1 -58,
dilemma. Neural Computation,
1992.
[GPP+89]
I. Guyon,
I. Poujaud,
L. Personnaz,
G. Dreyfus, J. Denker, and Y. LeCun. Comparing different neural network architectures for classifying handwritten digits. In
Non-linear
editor.
Numerical
Optimization,
Meth-
Academic
David
D. MacKay. A practical bayesian framework

for backprop networks. In David S. TouretProcessing
zky, editor, Neurai Information
volume 4. Morgan Kaufmann PubSystems,
lishers, San Mateo, CA, 1992. To appear.
J. Moody and C. Darken. Fast learning in
networks of locally tuned processing units.
Neural
Compuiaiaon,
1 (2)~281 -294, 1989.
IMGB+921
N. Matic, I. Guyon, L. Bottou, J. Denker,

and V. Vapnik.
Computer-aided
cleaning
of large databases for character recognition.
In Digest ICPR. ICPR, Amsterdam, August
1992.
[Mo092]
J. Moody. Generalization, weight decay, and

architecture selection for nonlinear learning systems. In David S. Touretzky, ediProcessing
Systems,
tor, Neural Information
volume 4. Morgan Kaufmann Publishers,
San Mateo, CA, 1992. To appear.
[NY83]
A.S. Nemirovsky and D. D. Yudin.
Problem
Complexity
in
mization.
151
Linear
and Nonlinear
Luenberger.
Addison-Wesley, 1984.
Programming.
[MD89]
Methods
of
and D. Hilbert.
Interscience, New
physics.
York, 1953.
[DH73]
es
nc
[Lue84]
re
[BL88]
[CH53]
F. A. Lootsma,
Press, London, 1972.
fe
E. B. Baum and D. Haussler. What size net

gives valid generalization?
Neural Computation,
1(1):151160, 1989.
pages 100-109.
1988.
ods for
[BH89]
on the Founda-
Science,
W. Krauth and M. Mezard. Learning algorithms with optimal stability in neural netJ. Phys.
A: Math.
gen., 20: L745,
works.
[Lo072]
M.A. Aizerman, E.M. Braverman, and L.I.
Rozonoer. Theoretical foundations of the
potential function method in pattern recognition learning.
Automation
and Remote
Control,
25:821-837, 1964.
Symposium
1987.
References
[ABR64]
Annual
of Computer
and
Method
Eficiency
Opti-
Wiley, New York, 1983.
[Omo91]
S.M. Omohundro.
Bumptrees for efficient
function, constraint and classification learning. In R.P. Lippmann and et al., editors,
NIPS-90, San Mateo CA, 1991. IEEE, Morgan Kaufmann.
[PG90]
T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to
multilayer networks. Science, 247:978 982,
February 1990.
[Pog75]
T. Poggio. On optimal nonlinear associative

Vol. 19:201209,
recall. Biol. Cybernetics,
1975.
[ROS62]
of neurodynamics.
F. Rosenblatt. Princzp!es
Spartan Books, New York, 1962.
[SLD92]
P. Simard, Y. LeCun, and J. Denker. Tangent propa formalism for specifying selected invariance
in an adaptive network.
In David S. Touretzky, editor, Neural Inforvolume 4. Mormation
Processing
Systems,
gan Kaufmann Publishers, San Mateo, CA,
1992. To appear.
[TLS89]
N. Tishby, E. Levin, and S. A. Solla. Consistent inference of probabilities in layered

networks:
Predictions and generalization.
of the International
Joint
In Proceedings
WashingConference
on Neural
Networks,
ton DC, 1989.
[Vap82]
Vladimir
dence
Vapnik.
Based
Estimation
on Empirical
of
Data.
Depen-
Springer
fe
re
nc
V.N. Vapnik and A.Ya. Chervonenkis. The

theory
of pattern
recognition.
Nauka,
Moscow, 1974.
Re
[VC74]
es
Verlag, New York, 1982.
152
REF [3]

es
!""#$%&('*),+-#.%/10,2!#/"!34!506#87"%:9<;
=6>@?>AB#.C%)DFEG'*)('H#.I*JLKNM,O%=
PRQTS*UHVXWZYU.[ >@J%']\*^*_._"`.acbedFfHgchHbi`8a1jZgXbikl`.anmop)q#r%'*st2p'H#.(%!%uv3$#I*J%!%'1wix&sDyuz"7
I*2!#)()(|{ZIH#&n}%o7"%C%2}'H3~)H6>5J%'-3#IJ%!%'5I*%%I*'H7%&z"#2p2po!3$7%2p'H34'H8&()6&cJ%'w82}2ps5!%u4p/"'H#"T:"7"z%&
'*I*&(%)#.'%8Dy2p!%'H#.2p13$#87"7"'*/R&(4# 'H$J%}u%J8Dy/":3~'H%)(pw'H#&cz"'-)c7"#I*'.ly&J%p)w'H#&z"'
)7"#I*'#~2}!%'H#.@/"'*I*p)(pR)cz"w#I*'}),I*%)(&c(z%I*&('*/BlO"7"'*I*!#2l7"7'H&(p'*)-w&J%'/"'*I*p)(p<)z"wy#I*'
'H%)z"'*)lJ%puJu8'H%'H(#2ppH#&(p4#.C%p2p}&xw&cJ%'l2}'H#8(%!%u3$#I*J%:%'.l>5J%'p/"'H#-C'HJ%:%/4&J%')z"7"7"%&D
'*I*&(%%'*&Fsn9Rs5#)7"' pz%)(2p]!3$7%2p'H34'H8&('*/]wi&J%'$'*)(&pI*&('*/IH#)('4s-J%'H'$&J%'$&n#!%:%u
/Z#&#rIH#.C"')('H7"#.(#&('*/s5p&J%%z%&R'H()X]']J%'H''*&('H%/&J%p)R'*)z%2p&R&(v%8Dy)('H7"#.n#.C%2p'
&(#:%!%u/Z#&#"
puJvu8'H%'H(#2ppH#&(p#.C%p2pp&w)z"7"7&D '*I*&(%'*&Fsn9.)$z%&(p2pp*:%u7.2p%3~:#2:"7"z%&
&(#8%)w%(3$#&(p%)4p)$/"'H34%%)(&(#&n'*/Bv]'1#2p)(]I*%3$7"#.'R&J%'17'Hwi(3$#.%I*'<w-&J%'<)z"7"7"%&D
'*I*&(%%'*&s(9,&( #8pz%)6I*2!#)()n}IH#2"2}'H#8(%!%u#2pu.}&cJ"34)l&J"#&l#2p2"&(%97"#.&T!$#-C'H%IJ"3#.(9
)(&z%/"w67%&(pIH#2l5J"#.(#I*&n'HEG'*I*.u%p&(pL
5BTN(
re
nc
LVS86 #&(&('Hno'*I*8u%p&(pLKZ'cI*}'H8&,2}'H#8(%!%uq#2pu.%p&J"34)HKZ%'Hz"n#2%'*&s(98)HK"(#XD
/"!#2C"#)n})@wyz"%I*&(pRI*2!#)()n{Z'H)HKN7.2p%%34!#2I*2!#)n)(|{Z'H)H
Re
fe
'5&cJ"#.o88'H#.)#u.E%=4%l})cJ%'H-|*B)z%u8u.'*)(&('*/&J%'@{N)(&#2}u8p&J"3w%7"#&(&('H($'*I*.uD
%p&(pL '4I*%)(p/"'H'*/#R34/"'*2Tw&Fso%%(3$#2T/"p)(&!C"z%&('*/77"z%2!#&(p%)HKTv +**+F #.%/
v ; * ; w]/"!34'H%)n}%"#2 '*I*&()-s5p&J34'H#. '*I*&()x + #.%/ ; #.%/I*D #.:#8%I*'
3$#&cpI*'*)
#8%/ K#8%/])J%s'*/]&J"#&&J%'7%&(!3$#2i#H8'*)(!#. )(.2!z%&(prp)~#z"#/Z(#&(pI
/"'*I*p)(pwz" %I+ *&n}%L ;
)( l )(pu1X q +T$+ + eRv +( eRv ;cT$; + eRv ;B 2! ;
+
yr&J%'IH#)('$s,J%'H' + ; &J%'o"z"#/Z(#&(pI$/"'*I*p)(prwz"%I*&n}% /"'*u.'H%'H(#&('*)4&(1 #
2p!%'H#.wyz"%I*&(pL

2}! l )(pu + ;* + 1 + + + ; + ;c

>-'*)(&n:3#&('6&cJ%'6"z"#/Z(#&(pI/"'*I*p)(pwz"%I*&n}%%%'GJ"#)&(@/"'*&('H(34!%'$8}; Zc w'*'7"#.(#.3~'*&('H)H
>R'*)(&(!3$#&('4&J%'42p!%'H#.,wyz"%I*&(p%%2}]rwy'*'$7"#.(#.3~'*&('H)J"# '&nqC'$/"'*&('H(34!%'*/B$&J%'
i(*eHncT!n*ec(e*(XB"en!LFniHyinyi i
(*eHncT(*ce e B"en! c yFinyi% ci
perceptron output
weights of the output unit,
1, ... , 5
dotproduct
output from the 5 hidden units: z1 , ... , z 5

weights of the 5 hidden units
dotproducts
output from the 4 hidden units
weights of the 4 hidden units
dotproducts
input vector, x
Re
fe
re
nc
es
lpuz"' B= )(!3$7%2p',w'*'*/%Diwis5#./7"'HI*'H7%&$s@}&cJ:"7"z%&,z"%}&n)HK 2!#X.'H)6wJ%p/"/"'H<z"%}&n)HK

#.%/ %z%&7"z%&-z"%p&H>5J%'u(#X8Dy)J"#/"!%u$wl&J%' '*I*&(%'H.&cp'*)'Z'*I*&n)5&J%'*!-%z"34'HpI #2!z%'.
IH#)('s-J%'H'$&J%'$%z"34C"'HwC%)('H #&(p%)xp))c3$#2p2-e)#X12p'*)()x&J"#. ; '*)n&(!3$#&(!%u"i ;
7"#.(#834'*&('H)})5%.&5'*2p!#.C%2p'.l})cJ%'H&J%'H'cwi'-'*I*3$34'H%/"'*/BK"' 'H$!&J%',IH#)('5w +
;K
&(oz%)n'&J%'2p:%'H#8,/"})nIH!34!"#&(Gwz"%I*&n}% s5p&J wl&J%',wi(3R

+ "c ;
s-J%'H' })R)n34'1I*%%)(&#.8& lp)J%'HR#2p)(v'*I*3$3~'H%/"'*/#]2p!%'H#./"'*I*p)(pwyz"%I*&(pwi
&J%'5IH#)('5s-J%'H'@&J%'5&s-/"p)(&:C"z%&n}%%)-#.'5%.&6%n3$#2il=,2}u8p&J"34)wl7"#&(&('H(4'*I*.u%%}&n}%
s'H'l&J%'H'cwi'Twy3&J%' 'HC"'*u.!"%!%u#)()(%I*:#&('*/s5p&J&cJ%'lI*%)(&(z%I*&n}%w"2p:%'H#8B/"'*I*p)(p
)z"wy#I*'*)H
E.)('H%C%2!#&(& T'*7%2p'*/]#4/" B'H'H8&-98:%/wl2p'H#.(%!%uR3$#I*J%:%'*)X7'HI*'H7%&%)
$%'Hz" n# 2-%'*&s(98)H>5J%8' 7"'HI*'H7%&cvI*%%)(p)(&()$wI*"%'*I*&('*/%'Hz"%)HKs-J%'H'<'H#I*Jv%'Hz8D
r!3$7%2p'H34'H8&()o#<)('H7"#.(#&(!%u J8"7"'H(7%2!#.%'.K)(<&J%'17"'HI*'H7%&#)$#<s-J%.2p'R!3$7%2p'H34'H.&n)o#
7%p'*I*'*s5p)('42p:%'H#8,)('H7"#.(#&n:%u)z"wy#I*'.6O%'*'lpuz"'
>@J%'o7"C%2p'H3 w{N%/"!%u#.r#2pu.p&J"3 &J"#&3~:%!34p*'*)&J%''H(]#R)n'*&w '*I*&()
C8#/ (z%)(&(!%u#2p2&J%'Rs'*puJ.&n)wG&cJ%'1%'*&Fsn91s5#)$%.&xwiz"%/:vEG.)('H%C%2:#&( & )4&(!34'.K#.%/
E.)n'HC%2!#&(&@)z%u.u.'*)(&n'*/q#4)nIJ%'H34's,J%'H'%2po&cJ%'s'*puJ8&()w&cJ%'z%&7"z%&5z"%p&#.'#/Z#87%&( '.
=,I*I*/"!%u&(1&J%'~{Z'*/)('*&(&(!%u<w5&J%'$.&J%'Hs'*}u%J.&()x&J%'$!"7"z%& '*I*&()#.'%8Dy2p:%'H#82p
&(#8%)w%(34'*/1!8&(&J%'xw'H#&cz"')7"#I*'.K KLw6&cJ%'2!#)n&-2!#X.'H,wz"%p&()H&J%p))7"#I*'4#2}!%'H#.
/"'*I*p)(pwz"%I*&n}%Rp)-I*%%)(&(z%I*&('*/B

#"
e n) }u% !

C8q#/(z%)(&(!%u$&cJ%'s'*puJ8&()$ w%3 &cJ%'&e% Dy&JRJ%p/"/"'Hz"%p&5&($&J%'%z%&7"z%&5z"%}&,)(o#)&n~3~:%|D
34p*'4)(34''Hn,34'H#)z"'x 'H&J%'x&(#!%!%uo/Z#&#"=,)-#o'*)z%2p&-wEG.)('H%C%2:#&(& )#.7"7"#IJLK
'!( lnenZF**)5H+!n-,/L. c0!+ 2X1 @ele4*3 e65 87
I*%)(&c(z%I*&(p<w/"'*I*p)(p(z%2p'*)s5#)#u#:#)n)(I*!#&n'*/]s5p&J<&J%'4I*%)(&nz%I*&(pw2}!%'H#.xJ.8D
7"'Hn7%2:#8%'*):)(34')c7"#I*'.
=#2pu.p&J"3 &cJ"#&$#2p2}s5)~wx#2p25s'*puJ.&n)w-&J%'1%'Hz"(#25%'*&s(9&n #/Z#.7%&4:%/"'H
2pIH#2}2p&(3~:%!34p*'&J%'1'Hn4#])n'*&ow '*I*&(%)qC'*2p%u.!%uv&(v#7"#&n&('H(v'*I*.u%%}&n}%
7"C%2p'H3 s5#)xwiz"%/v! .r K
%K "K X5s-J%'H&cJ%'qC"#I*9DF7"7"#u#&(p#2}u8p&J"3 s5#)
/"p)(I* 'H'*/B1>@J%'o)(82:z%&n}% : . 2 '*)$ #)( 2p}u%J.&$34%/"|{ZIH#&(pw5&J%'o3#&J%'H3$#&n}IH#234%/"'*26w
%'Hz"%)X5>5J%'H'cwi'.KL%'Hz"(#2l%'*&Fsn9.)!3$7%2p'H34'H8&7%p'*I*'cDys5p)('42p!%'H#.Dy&7' 1/"'*I*p)(pwyz"%IcD
&(p%)H
R&J%p)#.&(pI*2p's'I*%)n&(z%I*&-#$%'*s &7'wl2p'H#.(%!%uR3$#I*J%:%'*)XKZ&J%'x)(DyIH#2p2p'*/)z"7"7"%&D
'*I*&(%%'*&s(9 >5J%'R)cz"7"7"&D '*I*&(x%'*&Fsn9!3$7%2p'H34'H.&n)$&J%'w.2p2ps5!%u]p/"'H#"p&~3#.7%)
&J%'x:"7"z%& '*I*&(%)5!.&no)(%34'J%puJR/"!34'H%)(p"#2Tw'H#&cz"')7"#I*' &J"z%uJ)n34'%8Dy2p!%'H#.
3$#.7"7%!%uIJ%8)('H#]7"}%iv&J%p))7"#I*'1#2p!%'H#./"'*I*p)(p)z"wy#I*'Rp)oI*%%)(&(z%I*&('*/s5p&J
)7'*I*:#27"%7"'H&(p'*),&J"#&'H%)z"'4J%puJRu.'H%'Hn#2p}H#&(p#.C%p2}p&Rwl&J%'%'*&s(9
Re
fe
re
nc
es
L >BC%&c#!v#1/"'*I*p)(p)z"Fw#I*'I*('*)7"%/"!%u]&n #7.2p%3~:#2-w,/"'*u'*'
W
&sZK.%'xIH#.IH'H#&n'#xw'H#&z"')7"#I*'.K K"s,J%}I*J1J"#)- .p; I*%/"!"#&('*)5w&J%',wi(3

1I*%/"!"#&n'*)
+
+

XH
;
<I*/":"#&('*)
+
+ X H ;
;

8} ; + I*%/"!"#&('*)
; +
+
;

+
s-J%'H'
+
>5J%'J.X"H7" 'Hn7%2:#8%'p),&J%'HI*%)(&c(z%I*&('*/1!&J%p)5)7"#I*'.
HH
>Gs$7"%C%2}'H3~)#.p)('!&J%'#.C '#.7"7"#I*JLl%'I*%%I*'H7%&z"#2l#.%/1%'x&('*I*J"%}IH#2,>5J%'
I*%I*'H7%&z"#2N7"C%2p'H3 p)5J%s&(,{N%/R#)('H7"#.n#&(!%uJ87'H(7%2!#.%'&J"#&ls@}2p2Lu.'H%'H(#2}p*'s'*2p2il&J%'
/"!34'H%)(p"#2pp&1wl&J%'w'H#&cz"')7"#I*'xs5p2}2lC"'4Jz%u8'.KZ#.%/ %8&-#2p2J87'H(7%2!#.%'*)&J"#&)('H7"#.n#&('
&J%'4&(#:%!%uR/Z#&c#$s5p2p2%'*I*'*)n)#.p2pu.'H%'H(#2pp*'$s'*2p2 %>5J%'4&('*I*J"%pIH#267"C%2p'H3p)J%s I*3xD
7"z%&#&n}%"#2p2}&(R&c'H#&,)z%IJrJ%}u%J8Dy/":3~'H%)(p"#2)7"#I*'*)H,&(RI*%%)(&(z%I*&7.2p%3~:#26w/"'*u'*'
"1% !v# ./"!34'H%)(p"#25)c7"#I*'op&$3$#X]C"'1%'*I*'*)()c#.]&(<I*%)(&nz%I*&J87'H(7%2!#.%'*)o!#
C%p2p2}p /":3~'H%)(p"#2w'H#&cz"')7"#I*'.
>@J%'I*%I*'H7%&z"#2l7"#.&5w&J%p)7"C%2p'H3sG#)5)(.2 '*/<: "XBwi&J%'xIH#)('w5`c_Nb
n_gXai_8jZg*\@w%5)('H7"#8(#.C%2p'$I*2!#)()('*)X=7%&(!3$#2J87'H(7%2!#.%'})4J%'H'4/"'c{N%'*/ #)&J%'42}!%'H#.
/"'*I*p)(pwz"%I*&(p s5p&J3#!3$#23$#.u.!C'*&s'*'H&J%' '*I*&(%)Rw&cJ%']&sI*2!#)()('*)XK)('*'
lpuz"' i&,s5#),C%)('H '*/&J"#&-&(I*%)(&c(z%I*&-)cz%IJ<7%&(!3$#2J87'H(7%2!#.%'*)%%'%%2}J"#)&(
&#.9'$!.&(]#I*I*%z".&$#R)c3$#2p25#.34z"8&w,&J%'o&c(#!%!%u]/Z#&#"K&J%')(IH#2p2}'*/ \^ _._"`.abfXghXbi`.an\K
s-J%pI*J]/"'*&('Hn34!%'$&J%p)3$#8u.!Lo&sG#))J%s-&J"#&|w5&J%'$&c(#!%!%u '*I*&(%)#.'4)('H7"#.n#&('*/
s5p&J%z%&'H()C.v#.7%&(!3$#2-J."7"'Hn7%2:#8%'1&J%'<'*7'*I*&#&(p #2!z%' w&J%'17"C"#.C%p2p}&vw
I*3$3~}&n&(!%u~#84'H($#&('*)(&6'*"#.3$7%2p',p)GCz"%/"'*/1C8&cJ%'(#&(pC"'*&s'*'H&J%'5'*"7"'*I*&#&n}%
BFn" !%e$ # niT$ %. 1Tin" 1HHiT2 + Xin e' &1 ye iyi(H + 1Xen
optimal margin
optimal hyperplane
lpuz"' -='*"#.3$7%2p'wG#)('H7"#.(#.C%2p'o7"C%2p'H3!# /"!34'H%)(p"#2)7"#I*'.o>@J%'$)z"7"7"%&

'*I*&(%)HK"3$#.(9'*/os5p&Ju'*)("z"#.'*)HK%/"'c{N%'&cJ%'3$#.u.!wl2!#.u.'*)n&5)('H7"#.(#&(pRC"'*&s'*'H&J%'
&sI*2!#)n)('*)H
Re
fe
re
nc
es
#2:z%'xwl&J%'z"34C'H5wl)z"7"7& '*I*&n)-#.%/R&cJ%'z"34C"'H@w&c(#!%!%u '*I*&n)H

*e'H(% %zz""3434C"C'H'H6ww&n)#z"!7"%7:%u& '*'*I*I*&(&(%) )

.&('.K&J"#&4&J%p)qCz"%//"%'*)o%.&'*7%2ppI*p&(2pI*%.&#:&cJ%'R/":3~'H%)(p"#2pp&Fw&J%'R)7"#I*'Rw
)('H7"#.n#&(pL&Bwi.2p2ps5)Bwy3&J%p)Cz"%/BK&J"#&B|wZ&J%'7%&(!3$#2%J."7"'H(7%2!#.%'@IH#.C"'I*%)(&c(z%I*&('*/
wy3#@)3$#2p2"z"34C'HBwZ)cz"7"7"& '*I*&(%)'*2:#&( '6&(,&J%'&(#!%!%u)('*&B)(p*'&J%'u.'H%'H(#2}pH#&n}%
#.C%p2pp&Fs5p2p2C'J%pu
J ' 'H!#.!8{N%}&n'/"!34'H%)(p"#2)7"#I*'. y O%'*I*&(p s's5p2}2
/"'H34%)n&(#&('&J"#&&J%'R(#&(pv wi#<'H#262p|wi'q7"%C%2}'H3~)IH#8C'q#)x2ps #)
R#.%/&J%'
7%&(!3$#2J."7"'Hn7%2:#8%'u.'H%'H(#2}p*'*)s'*2p2B!1#$C%p2p2pp/"!34'H%)(p"#2wi'H#&z"')7"#I*'.
A '*&

C"'x&J%'%7%&(!3$#2J87'H(7%2!#.%'4!wi'H#&z"'x)7"#I*'.]'s5p2p2l)J%sK%&J"#&5&J%'xs'*}u%J.&()
w%&J%'
7%&(!3$#2J87'H(7%2!#.%': &J%'xw'H#&cz"')7"#I*'$IH#.C"'4s-p&(&n'H #))(34'42p!%'H#.I*34C%!"#&(p<w
)z"7"7& '*I*&n)

i

)z"7"7& '*I*&n)
>5J%'2p!%'H#.,/"'*I*p)(pwz"%I*&n}%
!R&J%',wi'H#&z"')c7"#I*'s5p2p2l#I*I*/"!%u.2p1C"'xwl&J%',w%(3R

l )n}u%

)z"7"7"& '*I*&()

s-J%'H'

p)5&J%'/".&DF7"/Zz%I*&5C'*&Fs'*'H)z"7"7& '*I*&(%)
#8%/ '*I*&n
!w'H#&cz"'-)c7"#I*'.
>5J%'/"'*I*p)(pwyz"%I*&(pRIH#.&J%'H'cwi'C'/"'*)(IH!C'*/ #)G#~&Fs42!#X.'H5%'*&s(9lpuz"'&
"
"
classification
w1
wi
wj
nc
es
support vectors z i
in feature space
re
input vector in feature space
fe
nonlinear transformation
input vector, x
Re
lpuz"'
"62!#)()n{ZIH#&(p1C81#4)z"7"7&D '*I*&n5%'*&s(94w6#.Rz""9%%s-17"#&n&('H(p)-I*%%I*'H78D
&z"#2p2p/"%'C.~{N)(&&(#.%)wn34!%u&cJ%'7"#&n&('H(:8&($)n34'J%}u%J8Dy/":3~'H%)(p"#2w'H#&cz"'-)c7"#I*'.
=7%&(!3$#2J."7"'Hn7%2:#8%'oI*%)n&(z%I*&('*/]!r&J%p)wi'H#&z"')7"#I*'$/"'*&('Hn34!%'*)&cJ%'$z%&7"z%&H>5J%'
)(!34p2!#.p&R&(o#4&sDy2!#X.'H7"'HI*'H7%&cIH#.<C"')n'*'HRC.RI*37"#.p)(&(4lpuz"'
classification
u1
ui
uj
Lagrange multipliers
us
comparison
u = K( x k,x )
k
es
support
vectors, x k
input vector, x
re
nc
lpuz"' " G2:#)()(|{ZIH#&(pw~#.z""9%%s-7"#&(&('HnC.#)z"7"7"%&D '*I*&(%R%'*&Fsn9Z >5J%'

7"#&(&n'H(<})4!!"7"z%&4)7"#I*'$I*%3$7"#.'*/]&nR)z"7"7"& '*I*&(%)H4>5J%'o'*)z%2p&(!%u #2:z%'*)$#.'o%8D
2p!%'H#.2p4&(#.%)wi(3~'*/BL=v2}!%'H#.wyz"%I*&(pwZ&J%'*)('&(#.%)wi(3~'*/ #2!z%'*)l/"'*&('H(3~:%'&J%'6%z%&7"z%&
wl&J%'I*2!#)()(|{Z'H*
Re
fe
s' 'H*KL' 'H|w-&cJ%'$7%&(!3$#2J."7"'Hn7%2:#8%'ou.'H%'Hn#2p}*'*)s'*2p2&J%'$&('*I*J"%pIH#257"C%2p'H3w

%J s&(x&'H#&&J%'J%}u%J/":3~'H%)(p"#2w'H#&z"',)7"#I*''H3$#!%)H6 p&s5#)6)cJ%s,R4
FK%&J"#&
&J%'/"'HTwN7'H(#&(p%)TwiI*%)(&(z%I*&n:%u4#,/"'*I*p)(pxwyz"%I*&(p4IH#.$C':8&('HI*J"#.%u.'*/BT!%)(&('H#/
wZ3$#.98!%u#5%8Dy2p!%'H#.l&n#.%)wi(3$#&n}%5wZ&cJ%'6!"7"z%& '*I*&()w82}2ps'*/C8/".&DF7"%/Zz%I*&()Ts5p&J
)z"7"7& '*I*&(%)5!<wi'H#&z"'x)7"#I*'.K%'4IH#.{N)(&,I*3$7"#.'x&s '*I*&(),!<:"7"z%&x)7"#I*'RiC.
'.uZ4&c#.9.!%uR&J%'*!/".&DF7"%/Zz%I*&,)n34'4/"p)(&#.%I*'R34'H#)z"' KB#.%/]&J%'Hr3$#.9'$#R%8Dy2p!%'H#.
&(#8%)w%(3$#&(p,wN&J%' #2:z%'@wZ&J%'5'*)z%2p&HlO%'*'5lpuz"6' "ZT>5J%p)6'H"#.C%2p'*)w&J%'I*%)(&c(z%I*&(p
wNpI*JI*2!#)()n'*)lw/"'*I*p)(p~)z"wy#I*'*)HK w'*"#.3$7%2p'7.2p%3~:#2"/"'*I*})n}%4)z"wy#I*'*)Tw#8(C%p&(#.p2p
/"'*u'*'.]'s5p2p2BIH#2p2T&J%p)5&7'wT2}'H#8(%!%u13$#I*J%:%'w%6)cz"7"7"&D '*I*&(),%'*&Fsn9 X
>@J%'-&('*I*J"%pz%'w )cz"7"7"&D '*I*&(%'*&s(98)lsG#)6{N)(&6/"' '*2p7'*/Rwil&cJ%''*)n&pI*&('*/IH#)('
w5)('H7"#.n#&(!%u&(#!%!%u/Z#&#Rs@}&cJ%z%&4'H()Xq]&cJ%})$#.&(pI*2}'Rs'$'*%&('H%/&J%'R#.7"7"#I*Jw
)z"7"7&D '*I*&n6%'*&s(98)6&(xI* 'Hls-J%'H)('H7"#.n#&(ps5p&J%z%&'H(4&J%',&(#:%!%u '*I*&()
$eoeXl/
.L6Xce 52 .] 1n e H +B 3c( * e5in 1eno($# 1XH.(y iny
6 +!nei,niH ciX

:oe-# 1XH.(ye i(y@(iX $n c(ie c XZe,X 3ce2 +le
(ie 1en H*Lc H. n-e *ennp +e +!c# 1HyTecF % 1Nne 1' %. +# 1XH.ny
cFiny
Re
fe
re
nc
es
p)4:37".)()n:C%2p'.<}&cJ]&J%p)x'*&('H%)n}%]s'$I*%)n}/"'H~&J%'$)z"7"7&D '*I*&n%'*&s(98)#)#<%'*s
I*2!#)()xw2p'H#.(%!%u]3$#I*J%!%'.Kl#)7"s'Hwyz%26#.%/z"% 'H)#2#)%'Hz"(#2%'*&s(98)HO%'*I*&n}%
s'5s5p2p2L/"'H34%)(&n#&('5J%ss'*2p2Zp&u.'H%'H(#2pp*'*)wilJ%puJ$/"'*u%'*'7.2p%%34!#2L/"'*I*})n}%)z"wy#I*'*)
iz"7<&(/"'H !]#RJ%puJ/"!34'H%)(p"#26)7"#I*'1e/"!34'H%)n}% . >@J%'~7'Hwi(3$#.%I*'4w&J%'
#2pu.}&cJ"3 p)$I*%3$7"#.'*/v&(&J"#&w-I*2!#)n)(pIH#252p'H#.(%!%uv3$#I*J%:%'*)'.uZT2p:%'H#8$I*2!#)()n{Z'H)HK9D
%'H#.'*)n&G%'*puJ%C"%)I*2:#)()(|{Z'H*KZ#.%/R%'Hz"(#2L%'*&s%(9.)XO%'*I*&n}% K
"K%#.%/ "~#8'5/"' .&n'*/o&(x&J%'
3$# F7".!8&()lwZ&J%'/"'H #&(p~wZ&J%'5#2pu.p&J"3 #.%/$#@/"})nIHz%)()(p4wZ)n34'6wZ}&n)67"7"'H&(p'*)H
'*&#p2p),w&J%'x/"'H #&(pp)'*2p'*u#&('*/1&($#.R#87"7"'H%/"pN
N

,
y &J%p)1)('*I*&(ps'' p'*s&J%'34'*&cJ%/w47%&(!3$#2J87'H(7%2!#.%'*)v "Xwi$)n'H7"#.(#&(pw

&(#:%!%u$/Z#&#s5p&J%%z%&5'H()H&J%'%'*&)('*I*&n}%s',!.&c/Zz%I*'#4%.&n}%$wT)(wi&3$#.u.!%)HK
&J"#&s5p2p2l#2p2}s w%#8R#."#2p&n}Ix&'H#&c34'H.&wl2p'H#.(%!%uRs5p&J'H()5&cJ%'&(#!%!%uo)n'*&H

!#"%$'&(*),+!&$.-0/1&3245+!3!"
Re
fe
re
nc
es
>5J%')('*&@w2!#.C'*2p'*/ &c(#!%!%uq7"#&n&('H(%)
3 6 + + * 3687 97 , 6 :1; =<

HH
p))#}/r&(1C"'$2p!%'H#.2p])n'H7"#.(#.C%2p'$|w5&J%'H'4'*%})n&()# '*I*&n #.%/]#)(IH#2:#8 )cz%IJ&cJ"#&&J%'
!%'*z"#2}p&(p'*)

> |w 6

|w 6
#.' #2pp/w%5#2p2'*2p'H34'H8&(),w6&J%'x&(#!%!%uo)n'*&~ 6'*2ps s's-}&n'&J%'x:%'*"z"#2pp&(p'*)q !
&J%',wi(3@?
6 *A> % CB

XH
>@J%'7%&(!3$#2J8"7"'H(7%2!#.%'
.

p)&J%'$z"%p"z%'$%'$s-J%pI*J<)('H7"#.(#&n'*)&J%'4&(#!%!%u1/Z#&c#$s5p&Jr#o3$#!3$#263#.u.!L5p&x/"'*&('HD
34!%'*)&cJ%'$/"!'*I*&(p E
D s,J%'H'$&J%'4/"p)(&#.%I*'C"'*&s'*'H&J%'o7" F'*I*&n}%%)w&J%'4&n#!%:%u
' *I*&(%)lwZ&s-/"B'H'H8&6I* 2!#)( )('*)p)3$#%!3$#2iK'*IH#2p2"l}u%z"' T>5J%p)6/"p)(&#.%I*'GF * })lu8 'H
C8
FB *B HJI!34K!LN!M +PO c HCI!KQ3L#M +PO c

>5J%'x7%&(!3$#2lJ."7"'Hn7%2:#8%' p)&J%'#8uz"34'H8&()5&J"#&3$#!34p*'4&J%'/"p)(&c#.%I*'1 5&

wi.2p2}s5)Gw%3 #.%/ &J"#&
FB l R
>5J%p)$34'H#.%)HKT&J"#&&J%'7%&(!3$#2J."7"'Hn7%2:#8%'1p)4&J%'qz"%p"z%'R%'&J"#&3~:%!34p*'*) i z"8D
/"'H,&J%'4I*%)(&c(#!.&n)o %%)(&(z%I*&(!%u1#.<7%&n:3#2J8"7"'H(7%2!#.%'p)&J%'H'cwi'$#z"#/Z(#&(pI
7".u%(#.3$34!%uo7"C%2p'H3R
SJT cilecLeT &1ne}5!*Ln x3UF*Lei(Hen e H % 1c inV

*ny(
.

l0 '*I*&n)5 wis,J%}I*J
6
s5p2p26C"'x&('H(34'*/r\^ _._"`.abfXghXbi`.a(\*y<=7"7'H%/"p
s')J%s&cJ"#&5&J%' '*I*&n &J"#&/" '*&(*'H%( 34 !%'*)&J%'x7%&(!3$#2TJ."7"'H(7%2!#.%'4IH#.<C"'4s-p&(&n'HR#)
#42p!%'H#.@I*34C%:"#&(pRwl&(#:%!%u '*I*&()H

M
6
"
+

s-J%'H' > "O%!%I*' <%2pwix)z"7"7& '*I*&()qe)('*'=7"7"'H%/"p K&J%'o'*"7"'*)()n}%
" 'H7"'*)('H8&( )-#I*3$7"#I*&w%(3 w6s-}&n:%u ]'#2})no)J%s&cJ"#&5&(4{N%/<&J%' '*I*&(w
7" #.(#834'*&('H)/
+ 7
%'J"#)@&($)(.2 '&cJ%'wi.2p2}s5!%uz"#/Z(#&(pI7".u%X(H#. 3$34!%u$7"C%2p'H3R
re
nc
es

l
s5p&J<'*)7"'*I*&@&( + 9 7 K")z"C F'*I*&5&n$&J%'I*%)(&c(#!.&n)H
HX

> ]

s-J%'H' p)#8 B Dy/"!34'H%)(p"#2oz"%p& '*I*&( K
36 + 6 7 })&J%' B D
/"!34'H%)(p"#2 '*I*&(%HX wl2:#8C"'*2p)HKL#.%/ p)-#4)n3$3~'*&pI B B DF3#&ps5p&J'*X2pH'H 34'H.&n)

6 6 %
XH
Re
fe
>5J%'5!%'*"z"#2pp&F /"'*)(IH!C"'*)&J%'5%"%'*u%#&( '5"z"#/Z(#.8&H]'&J%'H'cwi'5J"# '&(3$#%!34p*'

&J%'"z"#/Z(#&n}Iw%(3 !R&J%'%%"%'*u#&( 'xz"#/Zn#..&XK)cz"C F'*I*&@&($&J%'I*%)n&(#!8&()
J%'H,&J%'l&c(#!%!%u-/Z#&c# IH#.C'6)('H7"#.n#&('*/s5p&J%z%&'H()Ls'6#2p)(5)cJ%sr:=7"7'H%/"p
= &cJ%'1wi.2p2ps5!%u'*2!#&(p%)cJ%:7 C"'*&s'*'H&J%']3$#%!3z"3 w&J%'<wz"%I*&(p"#2o K@&J%']7"#!
K#.%/R&J%'3#!3$#2 3#.u.!F w%3

l F ;
ewTwi)(3~' #.%/<2:#8u.'I*%)(&c#..& &cJ%':%'*"z"#2pp&

p) #2}p/BK%',IH#.R#I*I*/"!%u.2pR#)()('H&&cJ"#&#2p2J8"7"'H(7%2!#.%'*),&J"#&)('H7"#.(#&n'-&J%'&c(#!%!%u$/Z#&#
J"# '#$3#.u.!
F

* -
Re
fe
re
nc
es
ewZ&cJ%'6&(#:%!%u,)('*&5 IH#.x%.&C'6)('H7"#.n#&('*/C8#5J87'H(7%2!#.%'.K.&cJ%'3$#.u.!C"'*&s'*'H7"#&(&n'H(%)
wL&J%'&Fs,I*2!#)()('*)6C'* I*34'*)#8(C%p&(#.x)3$#2p2iK'*)z%2p&(!%u!4&J%' #2:z%',wZ&J%'wyz"%I*&(p"#2
&z"(%!%u$#.(C%p&n#.2!#.u.'. #%:3~}*!%u4&J%'@wz"%I*&n}%"#2 z"%/"'HI*%)(&(#:8&() #.%/
%'&J%'H'cwi''*p&J%'Hl'H#I*J%'*)#3#!34z"3 e!&cJ%})lIH#)('6%'5J"#)TI*%)(&c(z%I*&('*/4&J%'J8"7"'H(7%2!#.%'
s5p&Jx&J%'53$#%!3$#2"3$#.u8: F KH%'{N%/")l&J"#&T&J%'53$#%!34z"3 '*I*'*'*/")l)(%34'6u. 'H<e2!#.u.'
I*%)(&c#..& e!s-J%pI*JIH#)('#x)('H7"#.(#&(pw&J%',&(#:%!%u/Z#&#xs5p&J#$3$#.u.!$2!#.u8'H&J%'H
D p)5!3$7"8)()(!C%2p'
>@J%' 7"C%2p'H3 w3$#%!34p*:%u wz"%I*&n}%"#2o z"%/"'HI*%)(&(#:8&()] #.%/ IH#.
C"')(.2 '*/ 'H'coI*p'H8&(2pz%)(!%u&J%'w.2p2ps5!%u )(I J%'H3~'. p/"'R&J%'$&(# :%! %u/Z#&# !.&(#
%z"3C'Hw7&(p%)s5p&JR#4'H#)("#.C%2p')3$#2}2B%z"34C"'HwT&(#!%!%u '*I*&(%)6!R'H#IJ7"&(pL
O%&#.&Bz%&6C8)(.2 !%u&J%'"z"#/Z(#&n}I7".un#.3$34!%u7"C%2p'H3 /"'*&('H(34!%'*/oC8&J%'{N)(&l7"%&(p
w6&n#!%:%uR/Z#&#",Z@&J%p)7"C%2p'H3&J%'H'#.'4&so7"8)()(!C%2p'$z%&(I*3~'.'*}&cJ%'H&J%p)7"%&(p
wl&J%'/Z#&#~IH#.1%.&,C"')n'H7"#.(#&('*/1C8R#oJ87'H(7%2!#.%'q:s-J%pIJ<IH#)n'&J%'wyz%2p2B)('*&,w/Z#&#4#)
s'*2p2NIH#8$%.&5C"',)('H7"#.(#&('*/ K.%l&J%',7%&(!3$#2LJ."7"'Hn7%2:#8%',w)('H7"#.(#&(!%u&cJ%'{N)(&7"&n}%$w
&J%'&n#!%:%u/Z#&#~})Gwz"%/B
A '*&&J%' '*I*&n5&cJ"#&3#!34p*'*)wz"%I*&(p"#2- !&cJ%'IH#)('w)('H7"#.(#&(pRw&J%'{N)(&
7"%&(pC' + =34%u]&cJ%'RI*/":"#&('*)ow '*I*&n + )(34'1#8'R'*z"#25&(]*'H>5J%'*
I*('*)7"%/<&(o%8Dy)cz"7"7"&@&(#!%!%u '*I*&()5w&J%p)7"&(pL #.9'#%'*s)('*&5w&n#!%:%u
/Z#&#rI*8&#!%!%uv&J%'1)cz"7"7"& '*I*&()w3 &J%'<{N)(&R7"%&(pvw&(#!%!%u/Z#&c#]#.%/&J%'
'*I*&(%)Bw"&J%'l)('*I*%/47"&(p&J"#&B/"'*)T%.&B)#&n})w,I*%)(&n#!.& K*s,J%'H' p)/"'*&('Hn34!%'*/
C8 + &J%p)R)('*&R#v%'*swz"%I*&(p"#2 ; p)RI*%)(&nz%I*&('*/#.%/3$#%:3~}*'*/ #& ;
8&(!%z%:%u&J%p)7"I*'*)n)-wl!%IH'H34'H8&#2p2p1I*%)n&(z%I*&(!%uR#$)(82:z%&n}% '*I*&(% I* 'H!%uo#2p2
&J%'$7"%&(p%)w&J%'4&(#:%!%u</Z#&#%'$'*p&J%'H{N%/")&J"#&-p&p)x:37".)()n:C%2p'&(R)('H7"#.n#&('4&J%'
&(#:%!%u)('*&5s5p&J%%z%&5'H( K%%'I*%)(&c(z%I*&()5&J%'x7%&(!3$#2L)('H7"#8(#&(!%uoJ87'H(7%2!#.%'xw%&J%'
wyz%2}2L/Z#&c#)('*&HK .&('.K8&J"#&6/Zz":%u&J%p)G7"I*'*)()&J%' #2!z%'wB&J%'@wz"%I*&n}%"#2
p)-34%8&(%pIH#2p2pR:%IH'H#)(!%uZK")(!%I*'434'#.%/134'-&n#!%:%u '*I*&()#.'I*%)n}/"'H'*/1!&J%'
7%&(!34pH#&n}%LKZ2p'H#/"!%u&(o#4)3#2p2}'H,#.%/R)3#2p2}'H@)('H7"#.n#&(pRC"'*&s'*'HR&cJ%'&Fs4I*2!#)()('*)X
,
%)n}/"'H,&cJ%'IH#)('xs-J%'H'4&J%'x&(#!%!%uR/Z#&#4IH#8 %8&C')('H7"#.n#&('*/1s@}&cJ%z%&'H( 5y<&J%p)

IH#)('%'o3$#X1s5#.8&&(R)('H7"#8(#&('4&J%'$&c(#!%!%u )n'*&s5p&J#R3~:%!3$#2G%z" 34C"'Hw'H()H$>
'*"7"'*)()5&J%p)@w%(3$#2p2pR2p'*&-z%)-!8&%/Zz%I*')(%34'%8DF%'*u#&n ' #.!#.C%2p'*)
> % JB
r'IH#.R%s34!%!34p*'&cJ%'wyz"%I*&(p"#2
XH

M+
wi)3$#2}2 "K%)z"C '*I*&5&($&cJ%'I*%)(&(#:8&()

6 *A>
% CB
X H

>
% CB

Z%l)z8oI*p'H8&(2po)c3$#2p2 R&J%'@wz"%I*&n}%"#2 /"'*)(IH:CX'*H)5 &J%'z"34C'H6w &cJ%'-&n#!%:%u4'H()X
!%!34p*!%u %',{N%/"),)(34'3~:%!3$#2)z"C%)n'*&5wl&(#!%!%uo'Hn)H
3 6 * 3 6
XH
ewl&J%'*)('/Z#&#4#.''*%I*2:z%/"'*/<wy3 &J%'&n#!%:%u)('*&5%'IH#.)n'H7"#.(#&(',&J%''H3#!%:%uR7"#.&w
&J%',&(#!%!%u4)('*&6s@}&cJ%z%&'H()X>)n'H7"#.(#&('@&J%'-'H3$#:%!%uo7"#8&6wB&J%',&(#:%!%u4/Z#&#%'
IH#.I*%)n&(z%I*&-#.%7%&(!3$#2B)('H7"#8(#&(!%uoJ87'H(7%2!#.%'.
>@J%})@}/"'H#IH#.<C"''*"7"'*)()n'*/wn3$#2p2p1#)Hl34!%!34p*'$&J%',wyz"%I*&(p"#2
;
M
"
Re
fe
re
nc
es
)z"C '*I*&&(4I*%)(&n#!.&() . #.%/

Ks-J%'H' }),#3~%.&(%pI,I* '*4wyz"%I*&(pR#.%/
p)#~I*%)(&#.8&H
l)cz8oI*p'H.&(2p2!#.u.' #.%/)z8oI*p'H.&n2})3$#2}2 lK%&J%' '*I*&(% #8%/RI*%)(&#.8& K%&J"#&
34!%!34p*'&J%'Gwz"%I*&(p"#2 " z"%/"'HI*%)(&n#!.&() . #8%/
K./"'*&('H(3~:%',&J%'-J8"7"'H(7%2!#.%'
&J"#&434!%!34p*'*)o&J%'R%z"3C'Hx w-'H(%)r&J%'&(#!% !%u)('*&$#. %/)('H7"#.(#&n'$&J%'R'*)(&w-&J%'
'*2p'H34'H8&()-s@}&cJR3$#%:3#23$#.u.!L
.&('.KJ%s' 'H*K@&J"#&R&J%'7"C%2p'H3 wI*%)(&c(z%I*&(!%u#J."7"'H(7%2!#.%'s-J%pI*J34!%!34p*'*)
&J%']%z"3C'Hw'H(%)o%v&J%'&(#:%!%u)('*&op)R!u.'H%'Hn#2 , DyI*%3$7%2p'*&('. > # .p/ , D
I*3$7%2p'*&('H%'*)n)wG%z"7"C%2p'H3 s's5p2}2I*%)n}/"'Hx&J%'IH#)('$w e&J%'4)3$#2}2p'*)(& #2!z%'ow
wi4s-J%pI*J&J%'17%&n:3~}H#&(p7"%C%2}'H3 J"#)o#z"%pz%')(.2!z%&(p yv&cJ%})IH#)('<&J%'
wyz"%I*&(p"#2 " /"'*)(IH!C"'*)w%)z8oI*p'H8&(2p 2!#. u.' 4 &J%'R7"C%2p'H3 w- I*%%)(&(z%I*&(!%u]#R)n'H78D
#.(#&(!%uJ."7"'Hn7%2:#8%' s,J%}I*J3~:%!34p*'*)b g<\^ `
.gHf .b e`.j\K
ZK6w&(#:%!%u'H()o#.%/
3$#%!34p*'*)4&J%'o3$#.u.!<w%&J%'$I*%('*I*&(2pI*2!#)()(|{Z'*/ '*I*&(%)How-&J%'$&c(#!%!%u /Z#&#IH#.]C'
)('H7"#.n#&('*/$s5p&J%%z%&'Hn)&J%'-I*%%)(&(z%I*&('*/J."7"'H(7%2!#.%'xI*.!%I*p/"'*)5s5p&J&J%'-%7%&(!3$#2N3#.u.!
J87'H(7%2!#.%'.
<I*.&c(#)(&5&no&J%'xIH#)('4s5p&
J &cJ%'H''*%p)(&()#.<'coI*p'H8&34'*&J%%/")5wiG{N%/":%u<&J%'
)(.2!z%&(pw " :&cJ%'5IH#)(',w lAL'*&5z%)6IH#2}2L&J%p))(.2!z%&(p<bg\`*b 8a j n_gXai_ .jgH
=7"7'H%/"p = s'I*%)(p/"'H@&J%'7"C%2p'H3w3~:%!34p*!%u1&cJ%',wz"%I*&(p"#2
;

M
)z"C '*I*&4&(1&cJ%'oI*%)n&(#!8&()q 8 #.%/

Ks-J%'H' p)$#34%.&(%%}II* '*<wz"%I*&n}%
s5p&J i % 6>B)(!3$7%2p|w<&J%'wi(34z%2!#)5s'%2p</"'*)(IH!C"'4&J%'xIH#)('w ; !<&J%p)
)('*I*&(pL@Z@&J%p)@wz"%I*&(p<&J%'x7%&(!34pH#&(p7"%C%2}'H3'H3$#!%)#4"z"#/Z(#&(pI$7".u%(#.3$34!%u
7"C%2p'H3R
eynH
-iynBy H H Biii&
.LyyleT &1ne$c( X .Le H
.
4=7"7"'H%/"pR=s'@)J%sv&J"#&6&cJ%' '*I*&( K.#)lwil&J%'@7%&(!3$

#2LJ."7"'H(7%2!#.%'#2pu.}&cJ"3RK
IH#.RC's-p&(&n'HR#)-#42p!%'H#.@I*34C%:"#&(p%)-w)z"7"7& '*I*&()5
>${N%/&J%' '*I*&(% +
7"C%2p'H3w63$#%:3~}*!%u
X H

)z"C '*I*&5&(I*%)(&(#:8&()

6
M+
'7 %'4J"#),&(o)n.2 '&J%'4/Zz"#2T"z"#/Z(#&(pI$7".u%(#.3$34!%u
T
@>

;

s-J%'H'
KL#.%/ #.'&cJ%'$)#.34''*2}'H3~'H.&()4#)z%)('*/!&J%'$7%&n:3~}H#&(p7"C%2p'H3wi
I*%)(&c(z%I*&(!%u]#.]%7%&(!3$#2J8"7"'H(7%2!#.%'.K p)~#<)(IH#2:#8*Kl#.%/ /"'*)(IH!C'*)oI*%/"!"#&('cDys@})n'
!%'*z"#2}p&(p'*)H
.&(',&J"#&4 !3$7%2pp'*)&J"#&5&J%')c3$#2p2p'*)(&#/Z34p)()(!C%2p' #2!z%' !wyz"%I*&(p"#25 p)

-
3$# + 97
HX
>5J%'H'cwi'&n4{N%/ #x)(wi&-3$#8u.!I*2!#)()(|{Z'H@%'J"#)@&(x{N%/ # '*I*&n &cJ"#&-3$#%!34p*'*)

.
l ;

fe
re
nc
es
Re
z"%/"'Hx&J%'$I*%)n&(#!8&()
1#.%/ >5J%p)7"C%2p'H3 /"B'Hwy3&cJ%'o7"C%2p'H3 w5I*8D
() &nz%I*&(!%u#.v%7%&(!3$#253$#.> u8: I*2!#)()(|{Z'H %2pC.&J%'1#/"/"p&(p"#2,&('H(3 s5p&J
: &J%'
wyz"%I*&(p"#2-
. z%'&(R&J%p)&('Hn3 &J%'4)(82:z%&n}%&(&J%'$7"C%2p'H3wI*%%)(&(z%I*&(!%uR&J%'4)(w&
3$#.u.!I*2:#)()(|{Z'H,p)-z"%pz%'$#.%/R'*%p)(&()@w@#../Z#&#4)('*&X
>@J%'4wz"%I*&n}%"#2
. p)$%.&"z"#/Z(#&(pIqC'*IH#.z%)('w5&J%'o&n'H(3s5p&J " #%!34p*:%u

. )z"C '*I*&&(4&J%',I*%) (&(#:8&() > #.%/ C"'*2p%u8)5&(&cJ%'uz"7w)(DyIH#2p2}'*/RI*% '*
7".u%(#.3$34!%u]7"C%2p'H34)X>5J%'H'cwi'.K&(I*%)(&c(z%I*&4)(wi&~3#.u.!rI*2:#)()(|{Z'H$%%'RIH#.]'*p&J%'H
)(.2 '&J%'I* '*7".u%(#.3$34!%u7"C%2p'H3 !&J%' B Dy/"!34'H%)(p"#25)7"#I*'w-&J%'R7"#.(#834'*&('H)
K"%%'IH#.)(82 '&J%'xz"#/Zn#&(pI7"8u(#.3$3~:%u7"C%2p'H3 !&J%'/Zz"#2 B6 )7"#I*'wl&J%'
7"#.(#834'*&('H) #.%/ ,y<z"@'*7'H!34'H8&()s'I*%)(&c(z%I*&-&cJ%')nw&,3$#.u.!<J."7"'H(7%2!#.%'*)4C.
)(.2 !%uo&J%'/Zz"#2Bz"#/Z(#&(pI7".u(#.334!%u~7"C%2p'H3R

5
, c L( R @ . Bx$T
nc
es
>5J%'R#2pu.p&J"3~)/"'*)(IH!C'*/v:r&J%'R7"' }%z%))n'*I*&(p%)4I*%)(&(z%I*&4J."7"'Hn7%2:#8%'*)!&J%':"7"z%&
)7"#I*'.$>RI*%)(&c(z%I*&#1J87'H(7%2!#.%':r#w'H#&cz"'4)7"#I*'$%'~{N)(&J"#)&(R&c(#.%)wi(3&J%'oD
/"!34'H%)(p"#2:"7"z%& '*I*&nl<:8&(4#.oDy/"!34'H%)(p"#2Zwi'H#&z"' '*I*&(&J"z%u%J$#xIJ%.pI*',w#.
Dy/"!34'H%)(p"#2 '*I*&(wyz"%I*&(pT

=/"!34'H%)(p"#2T2p!%'H#.)('H7"#.n#&( #.%/R#4C%!#) })@&J%'HI*%)(&nz%I*&('*/w%6&cJ%'-)('*&w

&(#8%)w%(34'*/ '*I*&()

le 6 + e * ; e * e ,
% JB
HX
XH
2!#)()n{ZIH#&(p]w@#8z""9%%s- '*I*&(%p)4/"%'oC8{N)n&&(#8%)w%(34!%uR&J%' '*I*&(,&(<&J%'
)('H7"#.n#&(!%u$)7"#I*'Re
le n #.%/<&J%'H&#.98!%u$&J%')n}u%w&cJ%',wz"%I*&(p
le l le L

=,I*I*/"!%u4&(&J%'7"7'H&(p'*)wN&cJ%'-)(w&63$#8u.!4I*2!#)()(|{Z'H53~'*&J%/&J%' '*I*&( IH#.$C'

s-p&(&n'HR#)-#42p!%'H#.@I*34C%:"#&(pRwl)z"7"7& '*I*&n)e!&J%'w'H#&z"')7"#I*' >@J"#&-34'H#.%)

fe
re

6 le
M+
>@J%'$2p:%'H#8p&Frw5&J%'$/".&DF7"/Zz%I*&4!3$7%2pp'*)HKl&J"#&x&J%'$I*2!#)()n{ZIH#&(prwz"%I*&n}%!

wi5#.Rz""9%%s- '*I*&(G %%2}R/"'H7'H%/")&J%'/".&DF7"/Zz%I*&n)H
Re
le l le
M+
6 le le B
>5J%'p/"'H#,wZI*%)(&c(z%I*&(!%u)z"7"7"%&D '*I*&(%)%'*&s(98)BI*3~'*)Tw%3I*%)n}/"'H:%uxu.'H%'H(#28wi(34)
wl&J%'/".&DF7"/Zz%I*&5!<# p2!C"'H&)7"#I*'o F
l l

"
=,I*I*:/" !%u$&n&cJ%' p2:C'H&DFO%I*J"34p/"&>5J%'*<X#.8$)("3$34'*&pI@wz"%I*&n}% Ks5p&J

; KIH#.RC"'x'*7"#.%/"'*/<!R&J%',wi(3

,

M+

s-J%'H'
#8%/ #.''*pu.'H #2:z%'*)#8%/1'*pu.'H8wyz"%I*&(p%)

wl&J%'!8&('*u(#2B7"'Hn#&(/"'c{N%'*/]C.&J%'49'H(%'*2 T=)cz8oI*p'H.&I*%/"p&(p&($'H%)z"'
&J"#&
" /"'c{N%'*)#]/"8&DF7"%/Zz%I*&o!#wi'H#&z"'1)7"#I*'<p)R&J"#&o#2p2,&J%''*pu.'H #2!z%'*)1!&J%'
'*"7"#.%)(p
#.']7"8)(p&( '. >Bvu%z"#.(#.8&('*'&J"#&R&J%'*)n' I*%'coI*p'H8&()]#.']7".)n}&n '.Kp&R})
%'*I*'*)()#8q#8%/1)z8oI*p'H8&~ 'H)('H )5>5J%'*'H3 &J"#&5&J%'I*%%/"}&n}%

p)-)c#&(p){Z'*/wi5#2p2 )z%I*J&J"#&
;

Lz"%I*&(p%)&J"#&)#&(p)wi 'H)('H)&J%'*'H3IH#.$&cJ%'H 'cw%'-C"'z%)('*/R#)6/"8&DF7"%/Zz%I*&()Hl=,pcD
'H(3$#8LK(# 'H(3$#8 #8%/ EG.*%%'H$ 6I*%%)(p/"'H#I*% 82:z%&n}%w5&J%'4/".&DF7"%/Zz%I*&!r&J%'
wi'H#&z"')7"#I*'u. 'H1C8$wyz"%I*&(pw&c J%',wn3
q

.
T '*7"
s-J%pI*JR&J%'*IH#2p2 8&('H.&n:#2NLz"%I*&(p%)H
s' 'H*K&J%'$I* .2!z%&(pw&J%'4/".&DF7"/Zz%I*&x: w'H#&cz"'4)7"#I*'$IH#.rC"'$u. 'HrC.#..
wyz"%I*&(p)#&(p)wi%:%u&J%' 'H)('H'H )6I*%/"p&(pLK!R7"#.&(pIHz%2!#.&(4I*%)n&(z%I*&57".2p"%34!#2BI*2!#)D
)(|{Z'H@w/"'*u%'*' !<"Dy/"!34'H%)n}%"#2l!"7"z%&-)7"#I*'%'IH#.Rz%)n'&J%',wi.2p2ps@:%uwyz"%I*&(p

.
B
M@)(!%u/"4 B'H'H8&,/".&DF7"%/Zz%I*&() %'xIH#.RI*%)n&(z%I*&,/"4 B'H'H8&,2}'H#8(%!%u13$#I*J%:%'*)

s5p&J]#8(C%p&(#.p2p]&"7"'*)wG/"'*I*p)(pr)z"wy#I*'*)4
XF$>5J%'/"'*I*p)(pr)z"wy#I*'$w5&J%'*)('3$#I*J%:%'*)
J"#)-#wn3
7

le l 6
M+

s-J%'H' p)&J%'6!3$#u8'wN#,)z"7"7"& '*I*&(!!"7"z%&)7"#I*'5#.%/ p)l&J%'s'*puJ8&TwN#,)z"7"7"%&
'*I*&(%6!<&J%',wi'H#&z"')c7"# I*'.

>{N%/&J%' '*I*&()6 #.%/s'*}u%J.&(/) %'5wi.2p2ps&cJ%'-)#834'5)(.2!z%&(p)(I*J%'H34'#)w&J%'
pu.!"#27%&(!3$#263#.u.!<I*2!#)()(|{Z'Hx)nw&3$#.u.!I*2!#)()(|{Z'H*R>5J%'4%2p]/" B'H'H%I*'$p)4&J"#&
!%)(&('H#/<w63$#&p e/"'*&('Hn34!%'*/ C8] ( %'z%)n'*)5&J%'3#&p
6 6 e % x
XH
Re
fe
re
nc
es
"
GB @ @L TL5 5

4
- +
!#- 2 ! =#4 - +
&5)
4 += !4 + 4 +%
"!# -5
>$I*%)(&nz%I*&G#~)z"7"7"&D '*I*&(%'*&s(98)/"'*I*})n}%1(z%2p'%%'J"#)5&(4)(.2 '#~z"#/Zn#&(pI7%&(|D

34pH#&(p<7"C%2p'H3R
;
z"%/"'H@&J%')(!3$7%2p'4I*%)(&(#:8&()H
s-J%'H'3#&p
6 6

nc
es
e % x
p)/"'*&n'H(34!%'*/C8-&cJ%'6'*2p'H34'H8&()w&J%'6&c(#!%!%u-)('*&XKH#.%/ X H p) &cJ%'Twz"%I*&(p/"'*&('H(34!%!%u
&J%'I* .2!z%&(pwl&J%'/".&DF7"%/Zz%I*&()H
>@J%'6)(.2!z%&(px&(,&J%'7%&(!34pH#&n}%7"C%2p'H3 IH#.C'w%z"%/'coI*p'H8&(2p~C8)(.2 !%u:8&('H(3~'cD
/"!#&('7%&n:3~}H#&(p17"%C%2}'H3~)5/"'*&('H(34!%'*/C.$&cJ%'&(#!%!%u$/Z#&#K&cJ"#&IHz"('H8&(2poI*%)n&(p&z%&('
&J%')z"7"7"%& '*I*&n)HR>5J%p)4&('*I*J"%}"z%'Rp)4/"'*)(IH!C"'*/v!O%'*I*&(p
1>5J%'$C%&c#!%'*/7%&n:3#2
/"'*I*p)(pwz"%I*&n}%Rp)z"%p"z%' $
% #I*J7%&(!34pH#&(p<7"C%2p'H3 IH#8RC"')n.2 '*/1z%)(!%uq#8.)(&#.%/Z#8/R&('*I*J"%}"z%'*)H
re
&'
4 + ( !4 +) 4 + $+*0--, +=$'&/. $!-

Re
fe
IJ"#.%u8:%u<&J%'4wyz"%I*&(p wi,&J%'$I* .2!z%&(p w&J%'/".&DF7"%/Zz%I*&%'IH#.<:3D

7%2p'H34'H8&-/" B'H'H8&-%'*&s%(9.)X
&J%'%'*%&R)('*I*&(ps'rs5p2}2xI*%)(p/"'H<)z"7"7"%&D '*I*&(%R%'*&Fsn9.)R3$#I*J%!%'*) &cJ"#&1z%)('
7"82}"%34!#2/"'*I*p)(p)z"Fw#I*'*)Xo>1)7'*I*wi7.2p%3~:#2})w/"B'H'H8&%/"'H %'$IH#.z%)('
&J%',wi.2p2ps@:%uwyz"%I*&(p%)5wiI*% 82:z%&n}%w&cJ%'/".&DF7"%/Zz%I*&
B
E5#/"!#2l#)(p)5Lz"%I*&(p13$#IJ%!%'*),s5p&JR/"'*I*p)(pwyz"%I*&(p%)-w&J%'wi(3

R
le l )n}u% '*"710 ;

; 2
M+
IH#.RC'!3$7%2p'H34'H8&('*/]C.Rz%)(!%uoI*% 82:z%&n}%%)-wl&J%'&"7"'
1 ;
B '*7 0 ; 2
3!( XFen&
+ 1He(x0 1HX &1 % 1BLi 3c(e($n# 1XH.ny cFiny

y &J%p)IH#)('&J%'$)z"7"7&D '*I*&n%'*&s%(9R3$#I*J%!%'os5p2p26I*%%)(&(z%I*&C.&J&J%'I*'H.&('H) w

&J%'#.7"7"%!3$#&(!%u4wyz"%I*&(p<#.%/1&J%'s'*puJ8&()$
%'@IH#.4#2p)(!%I*(7(#&n'-#7"p$9%%s@2}'*/"u8'5wN&cJ%'G7"C%2p'H3 #&6J"#.%/C.4I*%)(&c(z%I*&(!%u
)7'*I*:#2I*% 82:z%&n}%wyz"%I*&(p%)H1O"z"7"7&D '*I*&(%'*&s(98)#.'&J%'H'cwi'o#<(#&J%'H,u.'H%'Hn#2
I*2!#)()wB2p'H#.(%!%uo3$#IJ%!%'*)s-J%pIJI*J"#.%u.'*)p&()6)('*&wB/"'*I*p)(pwyz"%I*&(p)(!3$7%2pqC84IJ"#.%u8:%u
&J%',wi(3 w&J%'/"8&DF7"%/Zz%I*&H

G4 += 45+ 4 + $.- 4 -5!+!4 & 4 0 - +!$'& $,!#4 -%/ #&# )

Re
fe
re
nc
es
> vI*.&c.2,&J%'u.'H%'H(#2ppH#&(p #.C%p2p}&w$#2}'H#8(%!%u3$#I*J%!%'*)1%'J"#)R&(I*.&c.2,&Fs

/"B'H'H.&xwy#I*&()H&J%'$'HnDF(#&('4]&cJ%'$&(#!%!%u/Z#&#<#.%/]&J%'IH#.7"#I*p&w5&J%'o2p'H#.n%:%u
3$#I*J%!%'#)G3~'H#)z"'*/RC.4p&()0-Dy/"!34'H%)n}% "XFl>5J%'H','*p)(&5#xC"z"%/wil&cJ%'7"%C"#.C%p2}p&
wl'H()%&J%'&('*)n&)n'*&5wl&J%',w82}2ps5!%uwn3RTs5p&J17"%C"#.C%p2}p& $&J%'!%'*"z"#2pp&F

He&n'*)(&'H( N'*"z%'H%I*e&(#:%!%u$'Hn B G8{Z/"'H%I*'.&('H #2
p) #2}p/By&J%'Cz"%/
&cJ%'-I*8{Z/"'H%I*'x!.&('H #2B/"'H7'H%/")-%$&J%'0,Dy/"!34'H%)(pRw&J%'
2p'H#.(%!%uR3$#I*J%:%'.K&J%'%z"34C"'H@w'*2p'H34'H8&()5!&J%'&n#!%:%u)('*&HK"#.%/<&J%' #2:z%'xw
L
>@J%'&swy#I*&(),!
wn3#&(#/"'cDy 6,&J%'$)c3$#2p2p'H&J%'$0,Dy/"!34'H%)(prw&J%')('*&
wBwyz"%I*&(p%),wl&J%'2p'H#.(%!%uq3#IJ%!%'.K"&cJ%')3$#2p2p'H5&cJ%'I*8{Z/"'H%I*'!8&('H #2KLC"z%&5&J%'2!#.u.'H
&J%' #2!z%'w&J%''Hnw'*z%'H%I*.
=u.'H%'Hn#2Zs5#Xwi6'*)(.2 !%u4&J%p)6&c(#/"'cDy 1s5#)67"%7".)('*/R#)&J%'-7"!%I*!7%2p'wB)(&nz%I*&z"(#2
p)9$3~:%!34pH#&n}%Llwil&cJ%'-u. 'H/Z#&#)('*&%'J"#)6&({N%/1#)(82:z%&n}%&J"#&3~:%!34p*'*)5&cJ%'*:
)z"3R=7"#.&(pIHz%2!#.4IH#)n'ow,)(&(z%I*&z"n#2p)9r34!%:3~}H#&(p7"!%I*:7%2p'p)4&J%'1,I*IH#.3xDFE5#*
7"!%I*!7%2p'.-9'*'H7&J%',{Z)n&-&('Hn3 '*"z"#2B&n$*'Ho#.%/13~:%!34p*'$&cJ%')('*I*%/1%%'.
&5p)9%%s-&J"#&5&J%'0,Dy/"!34'H%)(p<wl&J%')('*&5wl2p!%'H#.5!%/"pIH#&(%wyz"%I*&(p%)
e T )(pu *, I
s5p&J{Z%'*/ &cJ"'*)J%.2p/ })'*"z"#2l&(&J%'4/"!34'H%)(p"#2pp&]w6&J%'4!"7"z%&)7"#I*'. s' 'H*K&J%'
0-Dy/"!34'H%)n}%1w&J%')cz"C%)('*&
e T )(puB *, t
e&J%')('*&wwz"%I*&n}%%)s@}&cJCz"%/"'*/ %%(3w&J%'$s'*puJ8&() IH#8]C"'2p'*)()4&J"#.<&J%'/":3~'H8D
)(p"#2pp&Rwl&J%'!"7"z%&-)7"#I*'#.%/Rs5p2p2T/"'H7"'H%/
L3 &J%p)~7.!.&w }'*s &J%'7%&(!3$#2G3#.u.!rI*2:#)()(|{Z'H$34'*&J%%/v'*%'*IHz%&('q#8,I*IH#.3D
E5#*%7":%I*!7%2p'.5i&9'*'H7&J%'5{N)(&6&('Hn3 w
'*"z"#2L&(*'HRiC8$)#&(p)wi%:%u4&J%'@:%'*"z"#2pp&
n #.%/p&G3~:%!34p*'*),&J%',)('*I*%/&('H(3tC.o3~:%!34p*!%uwz"%I*&(p"#2 >@J%})534!%!34pH#XD
&(pR7"' 'H.&(),#.R 'HDi{Z&n&(!%u~7"C%2p'H3R
s' 'H*KT' 'H!&J%'RIH#)('Rs,J%'H'R&J%'&(#!%!%u/Z#&#1#.'R)('H7"#.(#.C%2p'R%'13#HrC%&#!
#RC"'*&n&('H,u.'H%'H(#2ppH#&(p#.C%p2pp&F]C8]34!%:3~}*!%u&J%'$I*%8{Z/"'H%I*'o&('Hn3 !
' 'Hwz"&J%'H
<&J%'4'*"7"'H%)('w6'H(%)<&J%'4&(#!%!%u1)n'*&Hy &J%'4)(wi&3$#.u.!<I*2!#)()(|{Z'Hx34'*&J%%/]&J%p)
Re
fe
re
nc
es
IH#.RC'/"%'4C.IJ%%.)(!%uR#.7"7"7":#&(' #2!z%'*)-wl&J%'7"#.n#.34'*&('H 6y&J%')z"7"7&D '*I*&n

%'*&s(98)#2pu.p&J"3 %'IH#.xI*.&c.2&J%'&(#/"'cDy C'*&s'*'HI*3$7%2p'*%}&xwZ/"'*I*p)(p$(z%2p'5#.%/
wy'*z%'H%I*w6'Hn,C.<IJ"#.%u8:%u<&J%'$7"#.(#.3~'*&('H KL' 'H<!&J%'$34'xu.'H%'H(#2lIH#)('4s-J%'H'
&J%'H''*p)(&n)%5)n.2!z%&(ps5p&J*'H,'H(L%,&J%'6&n#!%:%u,)('*&HT>5J%'H'cwi'&J%'l)z"7"7"%&D '*I*&(%)
%'*&s(94IH#.I*%.&.2BC".&Jqw#I*&()wiu.'H%'H(#2ppH#&(p#.C%p2pp&FRw&J%'2p'H#.n%:%u13#IJ%!%'.
lpuz"' " % "#.3$7%2p'*)w&J%',/".&DF7"/Zz%I*&

s5p&J
O"z"7"7&57"#&(&n'H(%)5#.'5!%/"pIH#&('*/

s5p&J/"z"C%2p'I*!I*2p'*)HK'H()@s5p&JR#4IH.)n)H
B 5. J
es

re
nc
> /"'H3~%)(&(#&('$&J%')z"7"7"%&D '*I*&(%%'*&s(934'*&cJ%/s'oI*%/Zz%I*&&Fs<&F"7"'*)4w,'*7'H|D

3 'H8&()H]'$I*%)(&c(z%I*&#.&(|{ZI*:#26)('*&()xwG7"#&(&('H(%)x:r&J%'o7%2!#.%'R#.%/]'*"7"'H:3~'H.&4s5p&J %/
4
/"'*u'*'57.2p%3~:#2"/"'*I*})n}%4)z"wy#I*'*)HK.#.%/4s'6I*%%/Zz%I*&l'*7'H!34'H.&n)6s5p&Jx&J%''H#2|Dy2pwi'7"C8D
2p'H3 w/"pu.p&'*I*.u%p&(pL
fe
,+!#" -5 #-! &$'-

M@)(!%uo/".&DF7"/Zz%I*&n)5wl&J%',w%(3
Re

s5p&J x s'I*%)(&c(z%I*&6/"'*I*p)(p(z%2p'*)wT/"B'H'H8&6)('*&n)lw 7"#&n&('H(%)l!$&cJ%'-7%2!#.%'.6E'*)z%2p&()
wB&J%'*)(','*"7"'H!34'H8&()IH#.C"' p)z"#2pp*'*/ #8%/q7" p/"'%pI*'p2p2:z%)n&(#&(p%)@wB&J%'7s'HwB&J%'
#2pu.}&cJ"3R % "#.3$7%2p'*)5#.',)J%s-4!$lpuz"' "l>5J%' I*2!#)n)('*)5#.'-'H7"'*)('H.&n'*/qC8oC%2!#I*9$#.%/
s-J%p&('-C"z%2p2p'*&()Hy~&J%'{Zuz"'@s'!%/"pIH#&(',)z"7"7"%&7"#&n&('H(%)ls5p&J$#/"z"C%2p'5I*!I*2p'.KZ#.%/$'Hn)
s5p&J#RIH8)()HR>5J%'$)(82:z%&n}%%)o#.'$%7%&(!3$#2l!&J%'$)('H%)n'o&J"#&% %//"'*u'*'o7".2p"%34!#2p)
'*%})n&5&J"#&3$#.9',2p'*)()'H()X .&(pI*'&J"#&6&J%'%z"34C"'H)w)cz"7"7"&57"#&n&('H(%)5'*2!#&( ',&(4&J%'
%z"3C'H5w&(#!%!%uR7"#&(&('H(%),#.')3$#2}2i
B
,+!#" -5

3!
0 23 1 4'2 -3!4 -
z"'*7'H!34'H.&n)$wi4I*%)n&(z%I*&(!%u)z"7"7"&D '*I*&(%'*&Fsn9.)$3$#.9'1z%)('1w&s /"B'H'H8&

/Z#&#8C"#)('*)wixC%}&DF3$#.7"7"'*//"pu.p&~'*I*.u%p&(pLK#R)3$#2}25#.%/v#R2!#.u.'o/Z#&#.C"#)('.<>5J%'o)c3$#2p2
lpuz"'" % " #.3$7%2p'*)5w67"#&(&('Hn%)5s5p&J2!#.C"'*2p),wy3 &J%'4M-O .)n&#2BO%'H }I*'x/"pu.p&-/Z#&c#.C"#)('.
[[
fe
re
nc
es
%'5p)5#M,O .)(&#2"O%'H pI*'/Z#&#.C"#)n'&cJ"#&6I*8&#!%)"K4

..,&c(#!%!%u~7"#&(&('H(%)6#.%/ K8.,&('*)(&
7"#&(&n'H(%)H>@J%'-'*)(.2!z%&(pxwL&J%'/Z#&#.C"#)('}) ,7%p'*2p)HK"#.%/4)(3~'&"7%pIH#2'*"#.3$7%2p'*)#.'
)J%s-!l}u%z"'&cJ%})/Z#&c#.C"#)(',s''H7&'*"7"'H!34'H8&#2B'*)('H#8IJs5p&J7".2p"%34!#2p)
w #.}%z%)5/"'*u'*'.
>@J%'62!#.u.'l/Z#&#.C"#)('I*%)(p)(&()TwZ."K8.&(#!%!%u#8%/ "K|..&('*)(&7"#&(&('H(%)HK#.%/x})l# .XD .
34p%&z"'wB&J%' yO"> ,&(#:%!%uo#8%/o&('*)n&)n'*&()Hl>5J%''*)(. 2!z%&(pwT&J%'*)('7"#&(&('H(%)p)
%}'*2p/"!%u]#.R!"7"z%&x/":3~'H%)(p"#2pp&Frw "Z,<&J%p)/Z#&#.C"#)n's'J"# 'x%2p I*%%)(&(z%I*&('*/#
".&J~/"'*u'*'7".2p"%34!#2LI*2!#)()(|{Z'H*6>@J%'7'Hw%(3$#.%I*'5w &cJ%})I*2!#)()n{Z'Hp)I*3$7"#.'*/o&(8&J%'H
&7'*)5wl2p'H#.(%!%uq3#IJ%!%'*),&J"#&5&n97"#.&5!1#C"'H%I*J"3$#.(9)(&cz%/" "XF
#2p25%z"$'*"7"'H!34'H8&()R&('H)n'H7"#.(#&(%)HK%'Rwi4'H#I*JI*2!#)()HK5#8'1I*%)n&(z%I*&('*/B % #I*J
J87'HDy)z"wy#I*'3$#.9'z%)(',wB&J%'-)c#.34',/".&7"%/Zz%I*&5#.%/R7"'cDF7"%I*'*)()(!%u$w&cJ%'5/Z#&#"l2!#)n)(|D
{ZIH#&(pw#.z""9%s-R7"#&n&('H(%)p)5/"%'#I*I*/"!%u$&(4&J%'3#!34z"3 z%&7"z%&w&cJ%'*)('&('H
I*2!#)()(|{Z'H)X
V ZUHS4 !U

,LS*U*W L V e Y W"U*WZQWZS*
Re
>5J%'M,O .)(&#2 O%'H pI*' #&#.C"#)('J"#)-C'*'H1'*I*%/"'*/w3 #I*&z"#2B3$#p27%p'*I*'*)#.%/ '*)z%2p&()

wy3&cJ%})4/Z#&c#.C"#)('oJ"# 'RC"'*'H'H7&('*/vC.r)(' 'H(#2'*)('H#.IJ%'H)X1y>#8C%2}' s'$2p})n&$&J%'
7"'HFwn3$#.%I*'5w #.}%z%)lI*2:#)()(|{Z'H)I*.2p2p'*I*&('*/$wy37"z"C%2ppIH#&(p%)5#.%/s,'*"7"'H:3~'H.&()X6>5J%'
'*)z%2p&wJz"3$#8$7"'Hwi(3#.%I*'5s5#)6'H7&('*/oC8
342p'*o? % .O"#I*9.!%u.'H@ XFT>5J%''*)z%2p&
s5p&J=ET>sG#)$IH#.(p'*/z%&oC8 #.%2 '*u.!C"#.%/ }I*J"#'*2 Ep2p'*#&R'*2p2AB#8C%)HK
z"((#X p2p2iK Z">5J%''*)z%2p&()5w "Z #8%/o&J%'C'*)(& Dy2!#X.'H%'Hz"(#2N%'*&s(9Res@}&cJ$7%&n:3#2
%z"3C'H5wJ%}/"/"'Hrz"%}&n) s'H'xC%&#!%'*/<)7"'*I*!#2p2pwi&J%p)7"#.7"'HC81!""#&n'*)-#.%/
'H("#./O%I*J%'*2!9%78w'*)7"'*I*&( '*2p.>5J%']'*)z%2p&1s5p&J#)7'*I*!#27"z"(7.)('%'Hz"n#2%'*&s%(9
#.I*J%p&('*I*&z"'xs5p&
J 42!#X.'H)HK"AL' '*& Ks5#)C%&#!%'*/C.
AL'H5z"]gXb @ F
r&J%'R'*"7"'H!34'H8&()$s@}&cJ]&J%'1 M,O .)(&#26O%'H pI*' #&#.C"#)('s'Rz%)('* /7"'cDF7"I*'*)()n:%u
eI*'H8&('H!%uZK"/"'cDy)n2:#8.&(!%uR#.%/R)34%.&J%!%u &(4:%I*%(7"n#&('9%%s@2}'*/"u8'#.Cz%&&J%'! #.!#.%I*'*)
T ce((
:ie# 1i +!n *in X( Xn ( iXn c c.(ZTi %ci6X

2!#)n)(|{Z'H
(#Xs'H(%*K
z"3$#.R7'Hw%(3$#.%I*'

'*I*p)(p&'*'.KZ=ET>
'*I*p)(p&'*'.KZ6 "
'*)(& 2:#X.'H@%'Hz"(#2%'*&s(9
"
O"7'*I*:#2#.IJ%p&('*I*&z"' 42!#H8'H5%'*&s%(9
"
>T#.C%2p' 'Hwi(3$#.%I*',w #.}%z%)-I*2!#)()n{Z'H)I*.2p2}'*I*&n'*/1wy37"z"C%2}pIH#&n}%%)#.%/Rs-'*7'H|D
34'H8&()Hl'cwi'H'H%I*'*)-)n'*'&('*&X
re
)cz"7"7"& "/ :3~'H%)(p"#2pp&F<w

w 'H#&cz"')7"#I*'
'*I*&n)
8
.

...

?

"

F+ ;

+

+?

es
"
(#Xs
'H(*K
" |
" "
"4
"4
"
"4
nc
/"'*u'*'xw
7.2p%%34!#2
fe
>T#.C%2p' BE'*)cz%2}&n)lC%&#!%'*/wiL/"8&7"%/Zz%I*&()BwZ7".2p"%34!#2p)lw #.}%z%)/"'*u'*'.>@J%'%z"3C'H

()z"7"7& '*I*&()$4p)#$34'H#. #2!z%'7"'H@I*2!#)()(|{Z'H*
Re
w&J%'o7"C%2p'H3 #&J"#.%/Bo>@J%'$'B'*I*&w)3~.&cJ%:%u<w&cJ%})x/Z#&#.C"#)('$#)#17"'cDF7"I*'*)()n:%u
wi5)cz"7"7"&D '*I*&(,%'*&s%(9.),s5#)5! '*)(&(pu#&n'*/ !v
*-@z"5'*"7"'H:3~'H.&()xs'I*J%.)('4&J%'
)34%.&J%!%u$9'H(%'*2B#)##.z%)()n:#8$s5p&J)(&#8%/Z#./$/"' !#&(p !R#u'*'H34'H.&s5p&J<4
XF
x&J%'6'*"7"'H!34'H8&()s5p&Jx&J%p)l/Z#&#.C"#)n'6s'6I*%)n&(z%I*&('*/$7"82}"% 34!#2"!%/"pIH#&(%wz"%I*&(p%)
C"#)('*/x/"8&DF7"%/Zz%I*&()w&J%'w%(3
B>5J%'!"7"z%&l/":3~'H%)(p"#2pp&F4s5#) ."KH#.%/x&J%'6%/"'H
w&cJ%'7".2p"%34!#2(#8%u.'*/w3 &($"l>#8C%2}' /"'*)(IH!C"'*),&J%''*)z%2p&()@w&cJ%'-'*"7"'H!34'H8&()H
>5J%'&n#!%:%u/Z#&##.'%.&52p!%'H#.2p<)('H7"#.(#.C%2p'.
.&(pI*'.K,&J"#&&J%'%z"34C"'Hw)z"7"7& '*I*&n)R!%IH'H#)(' 'H)n2}s52p. >5J%'/"'*u'*'
7"82}"%34!#2lJ"#)5%%2}
. 34',)z"7"7"%& '*I*&()&J"#.&cJ%&
'
./R/"'*u'*'7.2p%%34!#2 #.%/
' 'H 2}'*)n)R&J"#.&J%'{N)(&/"'*u'*' 7.2p%%34! #2i >@J%'1/"!34'H%)(p"#2pp&w&cJ%'Rwi'H#&z"'<)7"#I*'
wi#R/"'*u'*'o7.2p%%34!#26p)J%s' 'H + &(!34'*)x2!#.u.'H,&J"#8&J%'4/"!34'H%)(p"#2pp&w&J%'
wi'H#&z"'5)c7"#I*'wil&
#
.%//"'*u'*'7".2p"%34!#2ZI*2!#)n)(|{Z'H* .&('@&J"#&67"'HFwn3$#.%I*'-#2!348)(&6/"%'*)
%.&IJ"#8%u.'s5p&Jv!%IH'H#)n:%u/"!34'H%)(p"#2pp&w&J%'<)7"#I*' :%/"pIH#&n:%u% 'HFDi{Z&(&(!%u
7"C%2p'H34)X
lpuz"'"AB#.C"'*2p'*/<'*#.37%2}'*),w'Hn)5%$&J%'&c(#!%!%uo)('*&Gw&J%' %/R/"'*u'*'7"82}"%34!#2
)z"7"7&D '*I*&nI*2!#)()(|{Z'H*
Re
fe
re
nc
es
@> J%' '*2!#&( '*2pJ%}u%J%z"34C"'Hw)z"7"7& '*I*&()~w~&J%'12p!%'H#.R)n'H7"#.(#&(%p)R/Zz%'&(

%8Dy)('H7"#8(#.C%p2p}&L&J%'q%z"34C"'H .R!%I*2!z%/"'*)qC.&Jr)z"7"7"%& '*I*&(%)#.%/&(#:%!%u '*I*&()
s5p&Jv#]%%8Dy*'H
D #2!z%'.w
&J%'1&n#!%:%u '*I*&(4p)o3~})nI*2:#)()(|{Z'*/&J%'1%z"34C"'H4w
34p)DyI*2!#)()n{ZIH#&(p%)o%&J%'&(#!%!%u])('*&4# 'H(#u.'*)x&(

"7"'HxI*2!#)()(|{Z'Hxwi&J%'o2p!%'H#.4IH#)('.
Z%5# %/</"'*u'*'I*2!#)n)(|{Z'H,&J%'&(8&#2Bz"34C'H5w3~})DyI*2:#)()(|{ZIH#&(p%)&J%'&c(#!%!%uo)('*&@})
/"s-&( ">@J%'*)(&
' "o7"#&(&('H(%)5#.')cJ%s,:lpuz"'"
&op)R'H3$#.n9H#.C%2p'&J"#&!#2p2z"4'*"7"'H:3~'H.&()<&J%' Cz"%/w%4u.'H%'H(#2ppH#&(p #.C%p2}p&
J%.2p/")s-J%'Hs' I*%%)(p/"'H1&cJ%' %z"34C"'HwC%&c#!%'*/)z"7"7"%& '*I*&()R:%)n&('H#/w4&J%'
'*"7"'*I*&#&n}% #2!z%'w&cJ%}),z"34C"'H R#2p2LIH#)('*)&J%'z"7"7"'H@C"z"%/R&J%','H(7"%C"#.C%p2}p&
wi5&cJ%')(!%u.2p'$I*2!#)()n{Z'H,/"%'*)%.&,'*I*'*'*/
e&J%'4&('*)(&5/Z#&#$&cJ%'~#I*&z"#2B'H(5/"%'*)%.&
'*%I*'*'*/ wi&cJ%')(!%u.2p'I*2!#)n)(|{Z'H
>@J%'&(#!%!%uo&n:3~'wiI*%)(&(z%I*&n}%Rw67".2p"%34!#2TI*2!#)()(|{Z'H),/"'*)%.&5/"'H7'H%/&J%'
/"'*u'*'w&J%'7.2p%%34!#2 o%2p&J%'%z"3C'Hwl)z"7"7"%& '*I*&(%)H % 'H!&J%'s)(&IH#)('
p&5p)5wy#)(&n'H6&cJ"#.&J%'C"'*)n&-7"'Hwi(3~:%u%'Hz"(#2%'*&s(9K.I*%)n&(z%I*&('*/R)7'*I*!#2p2}<wi&J%'&#)c9ZK
AL' '*& F]>@J%'q7'Hw%(3$#.%I*'Rw5&J%p)o%'Hz"n#25%'*&Fsn91p) n#Hs 'H(* .2p%3~:#2})
s5p&J/" '*u '*' @J%}u%J%'H5z%&c7"'Hwi(3 AL' '*&
!U
W"U*WZQBWZS
[ [ V ZUHS4
>5J%' O">/Z#&c#.C"#)('s5#)5z%)('*/w@C"'H%I*J"3$#.(9)(&cz%/"}'*)I*%%/Zz%I*&('*/1 'H (z%)(& s'*'H98)H>5J%'
2p!34p&('*/v&(!34'w(#834'$'H"#.C%2p'*/%2p]&J%'I*%)(&(z%I*&n}%w &"7"'wI*2!#)()n{Z'H KTw%s-J%pI*J]s'
I*J%.)('R#"8&J/"'*u'*'R7"82}"%34!#2@s5p&J%]7"'cDF7"I*'*)()n:%uz"xIJ%8}I*'Rs5#)4C"#)('*/rz"
'*"7"'Hp'H%I*'4s5p&JR&cJ%'M-O .)(&#2/Z#&#.C"#)('.
>T#.C%2p$'
,2pp)(&()&J%'5z"34C"'HwN)cz"7"7"& '*I*&()w%T'H#IJ~wZ&J%' @I*2!#)()(|{Z'H)5#.%/4u. '*)&J%'
7"'HFwn3$#.%I*'w&cJ%'-I*2!#)()n{Z'H@&J%'&(#:%!%uo#8%/R&('*)(&6)n'*&H .&( pI*'&J"#&' 'H7".2p"%34!#2p)
wZ/"'*u'*' "o&J"#&J"# '34'&J"#. $ w'*'57"#.n#.34'*&('H) I*3$34p&l'H()T&J%p)l&n#!%:%ux)('*&H
>5J%'4# 'H(#u.'@w'*z%'H%I*Rwl&(#:%!%u'H()5p) 7"'H@I*2!#)()H>5J%' "o34p)(I*2!#)()(|{Z'*/
&('*)(&,7"#&(&('H(%)GwGI*2:#)()(|{Z'H #8')J%s-R!l}u% z"' " .&(pI*'$#u#!RJ%s&J%'4z"7"7"'HC"%z"%/
J%.2p/")@w%&J%'C%&#:%'*/z"34C'Hwl)z"7"7"& '*I*&n)H
>@J%'I*34C%!%'*/]7"'Hwi(3$#8%I*'w6&J%'x&('HRI*2!#)n)(|{Z'H)R&cJ%'&('*)(&,)('*&5p) 'H(%*>@J%})
'*)z%2p&4)J%z%2p/vC"'RI*37"#.'*/]&(<&J"#&xw5.&J%'H7"#8&(pI*!7"#&(!%u]I*2!#)()(|{Z'H)4!&J%'oC'H%IJ"3#.(9
G2
O"z"7"7L7"#&(&H
.
% (%6&c(#!
% n&('*)(&

2i

"
2i

2i
8
2i "
. "
.

2i
"
"
"
G2

2i6
."
"
2i

"
"
2i
.

Re
fe
re
nc
es
>T#.C%2p'
5E'*)z%2p&()%C%&#!%'*/<w,# ".&JR/"'*u%'*'$7".2p"%34!#2lI*2!#)()(|{Z'HR&J yO"> /Z#&c#.C"#)('.
>5J%')(p*'4w&cJ%'&(#:%!%u$)n'*&5p).K.."K%#.%/1&cJ%')(p*'wl&J%'&('*)(&@)('*&5p) "K|..47"#&(&('Hn%)H
lpuz"'->5J%' "134p)(I*2!#)()(|{Z'*/&('*)(&7"#&(&n'H(%)s5p&J<2!#.C"'*2p)xw%,I*2:#)()(|{Z'H #&(&n'H(%)s5p&J

2!#.C"'*2 $#.',w#2p)('4%'*u#&( '. #&(&('H(%)s5p&J8&J%'H2!#.C"'*2p)#.',w#2p)('7.)(p&( '.
Test
error
2%
8.4
2.4
1%
1.7
linear
classifier
k=3nearest
neighbor
LeNet1
1.1
1.1
LeNet4
SVN
es
l}u%z"' 6E'*)cz%2}&n)5wy3 &J%'4C"'H%I*J"3$#.(9)(&z%/".
Re
fe
re
nc
)(&z%/".l>5J%'*)n'5.&J%'HTI*2!#)()n{Z'H)!%I*2!z%/"'#,2p!%'H#.I*2:#)()(|{Z'H*K"#9
DF%'H#.'*)(&6%'*puJ%C"I*2!#)()n{Z'H
s5p&J$8"K..,7".&n.&7'*)HK.#.%/$&s%'Hz"(#2L%'*&Fsn9.)T)7"'*I*!#2p2pI*%)(&(z%I*&n'*/4w%l/"pu.p&'*I*8u%|D
&(pRiAL' '*& #.%/$AL' '*2& " >5J%'5#.z%&J%%)%%2}4I*8&!C"z%&('*/4s5p&J4'*)cz%2}&n)Tw)z"7"7&D '*I*&n
%'*&s(98)Hl>5J%''*)z%2p&,w&cJ%'C"'H%I*J"3$#.(9R#.'-u. 'H!Rlpuz"'
r'5I*%I*2!z%/"'-&cJ%}))('*I*&(pC.4I*p&(!%u&cJ%'-7"#.7"'H, "XL/"'*)(IH!C%!%u~'*)z%2p&()6wB&J%',C"'H%I*J"3$#.(9
Z@z%p&('#2}%%uR&(!34'oAL' '*& s5#)I*%)(p/"'H'*/r)(&#&('4w&J%'$#.&H**>5J"%z%uJ
#)('Hp'*)w'*"7"'H:3~'H.&()!#.I*J%}&n'*I*&z"'.KI*34C%:%'*/s5p&J#.v#."#2p%)(p)ow&J%'
IJ"#.n#I*&('Hp)(&(pI*)5w'*I*8u%p&(pR'H(*K"AL' '*2& "$s5#)5IH(#w&('*/BX*
>5J%'4)z"7"7"&D '*I*&(%'*&s(9RJ"#)'*I*'*2p2p'H.&$#I*IHz"n#I*.KBs-J%pI*Jp)34.)(&'H3$#.(9D
#.C%2p'.KC'*IH#.z%)('Rz"%2p:9'R&J%'$.&J%'HxJ%puJ7'Hwi(3$#.%I*'I*2:#)()(|{Z'H)HKlp&x/"'*)$%.&!8D
I*2:z%/"'R9%%s@2}'*/"u8'q#.Cz%&&J%'$u8'*34'*&w&J%'o7"%C%2}'H3qwy#I*&&J%'I*2:#)()(|{Z'H
s%z%2}/R/"#)s'*2p2B|wl&J%'!3$#u.'7%p%'*2p)s'H'-'H%IH"7%&('*/<'.uZlC.R#x{Z%'*/BKZ(#.%/"%3
7"'H(34z%&#&n}%L
>@J%'62!#)(&'H3$#.(9@)z%u.u.'*)n&B&J"#&wz"&J%'HL!3$7" 'H34'H8&BwZ&cJ%'7"'HFwn3$#.%I*'6w&J%')z"7"7"%&D
' *I*&(%5%'*&s%(94IH#.<C"''*"7"'*I*&n'*/Rwy3 &J%'xI*%)(&nz%I*&(pwTwz"%I*&(p%)@wi&J%'x/".&DF7"%/Zz%I*&
&J"#&5' Z'*I*&#$7"pB!8wn3$#&(pR#.Cz%&5&J%'7"C%2p'H3t#&,J"#.%/B
(
5> J%p)7"#87"'H,!.&c/Zz%I*'*)x&J%'4)z"7"7"%&D '*I*&(%%'*&Fsn9R#)#o%'*s 2p'H#.(%!%u 3$#IJ%!%'4wi@&FsD

u%z"7I*2:#)()(|{ZIH#&(p7"C%2p'H34)X
Re
fe
re
nc
es
>@J%')z"7"7"&D '*I*&(,%'*&s(94I*34C%!%'*)
$p/"'H#)H&J%')(82:z%&n}%1&n'*IJ"%p"z%'wy3 7%&n:3#2
8J 7'H(7%2!#.%'*)e&J"#&l#2p2ps@)TwT#.x'*"7"#.%)(p4wL&J%')(.2!z%&(p '*I*&(x)z"7"7"& '*I*&(%) KX&J%'
p/"'H#Rw5I* .2!z%&(p w&J%'/".&DF7"%/Zz%I*&qe&cJ"#&'*&n'H%/")&J%'$)n.2!z%&(p)z"wy#I*'*)w%32}!%'H#.
&(o%%8Dy2}!%'H#. KN#8%/R&J%'%.&(pw)(wi&-3$#8u.!%)e&($#2p2pswi'H(%)5&J%'&c(#!%!%uo)('*&
>@J%'#2pu.}&cJ"3tJ"#)C"'*'H<&('*)n&('*/ #.%/I*37"#.'*/1&n$&J%'$7"'Hwi(3#.%I*'w68&J%'HI*2!#)()(pIH#2
#2pu.}&cJ"34)H '*)7%p&('&J%')(!3$7%2ppI*}&$wZ&J%'/"'*)(pu4!4p&()l/"'*I*p)(p4)z"Fw#I*'&J%'5%'*s#2pu.%p&J"3
'*"J%:C%p&()# 'H4{N%'47"'Hwi(3$#8%I*'!R&J%'I*%3$7"#.p)()n&z%/".
,&cJ%'HlI*J"#.(#I*&('H})n&(pI*)62p!9'-IH#87"#I*p&F4I*8&82Z#.%/$'H#)(!%'*)()wBIJ"#8%u.!%u&J%'5!3$7%2p'H34'H8&('*/
/"'*I*p)(p)z"wy#I*'-'H%/"'H6&cJ%'5)z"7"7"%&D '*I*&(%l%'*&Fsn9#.'*&c'H34'*2p$7"s'Hwz%2L#.%/oz"% 'H)c#2
2p'H#.(%!%uR3$#I*J%:%'.
"
Z TTN
Lc ,
y <&J%p)#.7"7"'H%/"ps'/"'H '$C".&cJR&J%'434'*&J%%/Rwi5I*%%)(&(z%I*&(!%u7%&(!3$#2TJ."7"'H(7%2!#.%'*)4#.%/
)(wi&53$#.u8:<J."7"'Hn7%2:#8%'*)H
/ #
!#"%$'&(*),+!&$.-0/1&3245+!3!"
es
i&@sG#)5)J%s-:<)('*I*&n}% K"&J"#&&(I*%)(&(z%I*&@&J%'7%&(!3$#2TJ."7"'H(7%2!#.%'
#"

s-J%pI*JR)('H7"#.n#&('#4)('*&@w&(#:%!%u/Z#&#
6 + + * 3 687 ' 7
XH
%'J"#)@&(o34!%!34p*'$#xwz"%I*&n}%"#2
*
)z"C '*I*&5&(&J%'I*%)(&n#!.&()
6 e * *A> % CB
#"
>4/"o&cJ%})@s'z%)n'#4)(&c#.%/Z#./7%&(!34pH#&(p<&('*I*J"%pz%XH'. ] 'I*%)n&(z%I*&-#$A#u(#.%u8:#8
nc

Re
fe
re
B * 6 c *
#"
M+
s-J%'H' +H 7 p)&J%' '*I*&( w%%8DF%'*u#&( '6AB#u(#.%u.'634z%2p&(!7%2pp'H)6I*n'*)7"%%/":%u
&($&cJ%'I*%)(&(#:8&(H)H "
&op)19%%s-v&J"#&o&J%')(82:z%&n}%&n&J%'<7%&(!34pH#&(p 7"C%2p'H3 p)R/"'*&('H(34!%'*/C.&J%'
)#/"/"2p'$7"8:8&w6&J%p)AB#un#.%u.!#.R!<&J%' B6 Dy/"!34'H%)n}%"#2l)7"#I*'4w K KN#8%/ Ks-J%'H'
&J%'34!%:34z"3 )J%z%2p/$C"'&#.9'H,s5p&Jx'*)7'*I*&l&(5&J%'67"#.n#.34'*&('H) #.%/ K#8%/&J%'3$#!34z"3
)J%z%2p/C"'&c#.9'Hs5p&JR'*)c7"'*I*&,&($&J%'AB#u(#.%u.'34z%2p&(!7%2pp'H)
=@&&cJ%'7"8:8&5wl&J%'34!%!34z"3 es5p&JR'*)c7"'*I*&5&( #.%/ * %'xC%&#!%)H

7

#"

6 l
M

M+

# " "
M 6

L3 '*"z"#2p}&# "
s'/"'H '

M+

#"
s-J%pI*J1'*"7"'*)()n'*)HKL&J"#&,&J%'%7%&(!3$#2lJ."7"'H(7%2!#.%')(.2!z%&(p<IH#.C"'4s- p&(&n'H1#) #$2p!%'H#.I*3xD

C%!"#&(pw5&n#!%:%u '*I*&n)H .&('.KT&J"#&x%2p]&(#!%!%u '*I*&n) s@}&c
J v<J"# 'o#.
'B'*I*&( 'I*%.&:C"z%&n}%&($&J%')z"3 # "
O"z"C%)n&(p&z%&(!%u " #.%/# " " !8&( # " s'%C%&#!

c
M
7+ 7 7
6 6
M+
M + M +
#"
#"
re
nc
es
y '*I*&(%.&c#&(p&J%p),IH#.RC"'4'*s-p&(&('HR#)
#"
l
s-J%'H' p)6#. Dy/"!34'H%)(p"#2Zz"%}& '*I*&( K#.%/ p6) #,)("3$34'*&pI B AB DF3$#&pxs5p&Jx'*2}'H3~'H.&()
6 6
>R{N%/&J%'$/"'*)(!'*/)#/"/"2p'R7".!8&p&4'H3$#!%)&n12pIH#&( '$&J%'R3$#%!3z"3 w " z"%/"'Hx&J%'
I*%)(&c(#!.&n)# "

s-J%'H'
3 6 + 687 K"#.%/
>
XH
>@J%'z"J"8DF>Tz%I*9'Hx&J%'*'H3 7%2:#X%)#.<!3$7& #.8&7"#.&x!&J%'$&J%'*%1w7%&(!34pH#&(pL
=, I*I*/"!%u1&(R&cJ%})x&J%'*'H3K#&z",)c#/"/"2p'q7.!.&x: KB#.8]AB#un#.%u.'$34z%2}&n:7%2pp'H
#.%/Rp&()-I*%('*)7%/"!%uoI*%)n&(#!8&-#.'I*"%'*I*&n'*/ C8q#.'*"z"#2pp&
Re
fe

6 e

N3 &J%p)-'*"z"#2pp&FRI*%34'*)5&J"#&,%8Dy*'H #2!z%'*)

6
e *

y.&J%'Hs /")H 4%2pw%6IH#)('*)-s'H'&cJ%'
'*I*&(%) wis,J%}I*J

% JB

#.'%2pRH#HI*J%}' '*/1!<&J%'IH#)('*)5s,J%'H'

jg X^ eb ,})34'*&-#)-#.gH^ b l]'xIH#2p2
e * 6
wi)z"7"7&D '*I*&()H .&n'.K&cJ"#&!&J%p)4&('H(34!%.2p.u8]&J%'$'*"z"#&(p#" )(&#&n'*)&J"#&x&J%'
)(.2!z%&(p '*I*&( IH#8RC"''*"7"#.%/"'*/)z"7"7& '*I*&()H
=%.&J%'HT%C%)('H #&(pLK"C"#)('*/$x&J%' z"J"8DF>z%I*9'H'*z"#&(p1#" " #.%/ #" wT&cJ%'57%&(|D
3$#2L)(82:z%&n}%LK"p)5&J%''*2!#&(p%)cJ%:7<C"'*&s'*'H&J%'3$#%!3$#2 #2:z%' #.%/&J%')('H7"#.n#&(p
/"p)(&#.%I*' F
7
7
6

6
*

*

l
M+
M+
M+

O"z"C%)(&(p&z%&n:%u&J%p),'*z"#2pp&R!.&(&J%''*7"'*)()(p " w %
s'C%&c#!

O%'*I*&(p s'%C%&#!
F ;
s-J%'H' F p)-&cJ%'3$#.u.!wi&J%'%7%&(!3$#2BJ."7"'Hn7%2:#8%'.
l
M+
>T#.9.!%u!.&(#I*I*z"8&5&J%''*"7"'*)()(p
wy3
l
/ 3
c
4 . '$ +=2#-(*) +!&#$'-*/ & 24 +! !"

'*2ps s'{N)n&5I*%)(p/"'H,&J%'xIH#)('w 6 85>5J%'HRs'/"'*)nIH!C"'4&J%'u8'H%'H(#2'*)cz%2}&,wi

#$34%%.&(%pII* '*wz"%I*&(p
>4I*%)(&nz%I*&-#4)(wi&-3$#8u.!)('H7"#.(#&(!%uoJ."7"'Hn7%2:#8%'s'3$#%!34p*'&cJ%',wz"%I*&(p"#2
6 c
es
z"%/"'H@&J%'I*%)(&n#!.&()
* M
nc
*A>
% CB~
> % CB H H
X H
>5J%'AB#u(#.%u.',wyz"%I*&(p"#2wi&J%p)7"%C%2}'H3p)
#"
.
re
fe

Re

oL (
M
M 6 e *
M

+
+
+
s-J%'H'x&J%'%8DF%'*u#&( '34z%2p&(!7%2p}'H) +H ;X #.p)('w3 &J%'I*%)n&(#!8&#" K
#.%/R&J%'434z%2}&n:7%2pp'H) + ; 'H8wiI*'&cJ%H'X I*%)(&c(#!.&8
r'~J"# '&( {N%/]&J%'$)c #/"/"2p'q7H.X! .&xw5&J%p)wyz"%I*&(p"#2&J%'$34!%:34z"3 s5p&J] '*)7"'*I*&4 &(
&J%' #.!#.C%2p'*) K K#.%/
#.%/R&J%'3#!34z"3 s5p&JR'*)7"'*I*&,&(4&J%' #.!#.C%2p'*) #.%/
A '*&z%)z%)n'-&J%'@I*%/"p&(p%)w&J%'-34!%!34z"3 wB&J%p)lwyz"%I*&(p"#2B#&l&J%'5'*%&'H34z"3 7".!8&H

M

M

+

M
+

M
ewls'/"'H%.&('

s'IH#.R'*s-}&n''*"z"#&(p]" # )

M+

M+
M+

M+
"

.

:8&(]&cJ%' AB#u%(#.%u.'wyz"%I*&(p"#2$ s'
nc
7 7
6 6
M + M +
.

"O z"C%)(&(p&z%&n:%uv&cJ%'R'*7"'*)n)(p%)w% K K5#.%/

C%&#:

+

N3 &J%''*"z"#2pp&(p'*)o D s',{N%/
es
Re
fe
re
+

M
+
+

4
+
>,{N%/$&J%')(w&63$#8u.!4J."7"'H(7%2!#.%',)(.2!z%&(p~%'-J"#)l&(3#!34p*',&J%'w%(3 wz" %I*&n }%"#2
z"%/"'HB&cJ%'6I*%)(&c(#!.&n) . D s5p&Jx'*)7"'*I*&T&(,&J%'6%8DF%'*u#&( ' #.!#.C%2p'*) s5p&J
% Ty '*I*&(%.&#&(p] IH#.<C"''*s,p&(&('H1#)
HX
i.
T + + +

s-J%'H' #.%/ #.'#)/"'c{N%'*/1#.C" '.l>{N%/&J%'-/"'*)n:'*/R)#/"/"2p'7.!.&%',&J%'H'cwi',J"#)
&(4{N%/R&cJ%'3$#%!34z"3 w-i. z"%/"'H@&J%'I*%)(&n#!.&()
i

i

i

> ]
#.%/
i "
>

L3i #8%/i" %'xC%&#!%)5&J"#&&cJ%' '*I*&(%

)J%%z%2}/<)#&(p)wi&J%'I*%/"p&(p%)
i
re
z"%/"'H@&J%'I*%)(&n#!.&()

nc

es
N3 I*%/"p&(p%)oi #.%/" %'IH#8R#2p)($I*%I*2!z%/"'4&J"#&&(o3#!34p*'Ri.

-
3$# +* 7
HX
O"z"C%)(&(p&z%&n:%u&J%p) #2!z%'w !.&n i. s'C%&#:

i.
l
+ + +

4
>{N%/]&J%'$)(w&3$#.u8:rJ."7"'H(7%2!#.%'%'4IH#.&cJ%'H'cw%'4'*}&cJ%'H,{N%/]&J%'3$#%:34z"3w&J%'
"z"#/Z(#&(pIwi(3 z"%/"'HT&cJ%'6I*%)(&c(#!.&n)Gi #.%/1i KH%B%'J"#)l&({N%/4&J%'53$#!34z"3
w&J%'$I* '*Rwyz"%I*&(pi. z"%/"'H&cJ%'$I*%)(&n#!.&()o #.%/. ,&J%'$'*"7"'H:3~'H.&()
'H7&('*/]!r&J%p)$7"#.7"'Hs'z%)('*/ #.%/])(82 '*/]&J%'z"#/Zn#&(pIo7".u%(#.3$34!%u17"%C%2}'H3

l&cJ%'5IH#)(',w &J%'5)c#.34',&('*IJ"%p"z%'C"!%u.)5z%)&(&J%'7"C%2p'H3 wB)(.2 !%u$&J%'
wi.2p2}s5!%uz"#/Z(#&(pI7%&(!34pH#&n}%17"C%2p'H3R34!%!34p*'4&J%'wyz"%I*&(p"#2
Re
fe
#.%/
>@J%'u8'H%'H(#2l)(.2!z%&(p w@&J%'$IH#)('4wG#R34%%.&(%'xI* '*Rwyz"%I*&(p IH#.#2p)(1C'

C%&#:%'*/w3 &J%p)-&('*I*J"%pz%'.>5J%'x)(wi&G3#.u.!RJ87'H(7%2!#.%'$J"#)-#xwi(3

6
M+

s-J%'H' 7 }),&J%')(82:z%&n}%Rwl&J%'wi.2p2ps5!%uo/Zz"#2I* '*q7"8u(#.3$3~:%u7"C8D
2p'H3R3$#%:3~}*'x&J%!!', wyz"%I*&(p"#2
l + +

z"%/"'H@&J%'I*%)(&n#!.&()

> ]
s-J%'H's'/"'H%.&n'
l
Z%I* *' q3~%.&(%'@wyz"%I*&(p%) 5s p&J i . ~ &J%',w82}2ps5!%u:%'*"z"#2pp&1p) #2}p/B

>5J%'H'cwi'@&J%'-)n'*I*%/o&n'H(3!)(z"#8'C"n#I9'*&()6p)57".)n}&n '#.%/u.%'*)6&(x!8{N%}&s-J%'H
u.%'*)5&($!8{N%p&.
l!"#2p2p.Ks'IH#8I*%)(p/"'H5&cJ%'J87'H(7%2!#.%'&J"#&-34!%!34p*'*)&J%'wi(3
* M
;
nc
re

fe
=-p*'H(3$#8LK % n# 'H(3$#.LK5#.%/AlE.*%%'H*>@J%'*'*&(pIH#2@w%z"%/Z#&(p%)w&J%'

7.&('H8&(!#2%wz"%I*&n}%$34'*&J%%/4:47"#&n&('H(4'*I*.u%p&(px2p'H#.(%!%uZ^%bi`.be`.j.j g `8big
6`8jZbia` pK "4
.K "Z
>LL=%/"'H)(r#.%/]EBE#.J"#/Zz"*G2:#)()(|{ZIH#&(p!8&(1&sR3z%2p&( #.!#&('o%%(3$#2
/"p)(&:C"z%&n}%%) s@}&cJ/" B'H'H8&RI* #.!#.%I*']3$#&pI*'*)Hjj .b
Lb .b $K

" " "
K
."
4
, % ".)('H*K%( z%8LK"#.%/o04 %06#.7"%!9ZT=&c(#!%!%u~#2}u8p&J"3w%6%7%&(!3$#2N3#.u.!
I*2!#)()n{Z'H)HN a`Hhcgg ej 8\6` 5b g *b jZjZ^ R`.anmH\ `c_1` 6` @_N^%b .b e`."j B g .ac"j ej
' "K"7"#u.'*) " " K }&n&()C"z"uZK 8=-
%g`.a K .2!z"34
"Al68&(&(zLK""&('*)HK ZZOL 'H"9'H*K (z%I*9'H*K"( z%.LK"Al %#I9'*2K "AL'H5z"LK
% O"#I*98:%u8'H 6O%!3$#./BK006#.7"%!9ZK,#.%/Mxl=4 p2}2p'H* 3$7"#8p)(wI*2!#)()n{Z'H
34'*&cJ%/")H=IH#)('$)(&z%/"<!J"#.%/"s,p&(&('H/"pu.p&$'*I*.u%p&(pLa`Hhcgcg ej 8\` .b cj"d
bigXaj .b `8"j `.j *gHagHjhgx`.j .bibigXajghc` %"j eb `.
j 8j gXâ gXbikl`.anmK "Z
Zl3~2}'*v#.%/ % lO"#I*9.!%u.'H 'Hz"(#2|DF%'*&s(9]#.%/v9DF%'H#.'*)(&DF%'*puJ%C"%4I*2!#)()(|{Z'H)H
>'*IJ"%pIH#2lE'H7& .
D " D .> K=6>5?>K "
Re
es
+
)z"C '*I*&,&($&J%'xI*%)(&(#:8&() " D8 K.s,J%'H'&J%'x)('*I*%/<&('H(3 34!%!34p*'*)&cJ%'2p'H#)(&,)("z"#.'
#2:z%'w%6&cJ%''H()X>@J%}),2p'H#/R&(4&J%',wi.2p2ps@:%u"z"#/Z(#&(pI7"8u(#.3$3~:%u$7"%C%2}'H33$#|D
34p*'&cJ%',wz"%I*&(p"#2
i.
l
!R&J%'%%8DF%'*u#&( 'xz"#/Zn#..& > ~)z"C '*I*&5&(4&J%'I*%%)(&(#!8&

"
Re
fe
re
nc
es
EBGz"(#.8&#8%/ }2!C'H&H gXb` \` 8bg .beh X\h \y8&('H)nI*}'H%I*'.K '*s

ln9ZK
"
E%=4%l})cJ%'H*>5J%'z%)n'-w34z%2p&(!7%2p'34'H#)cz"'H34'H8&():&c#%%34pI-7"C%2p'H34)X jZj ^8d
%gH"
j eh*\*K"" .8
K
.
4 4AL'H@z"LNM,%'-7"%I*'*/Zz"'5/ #87"7"'H.&n})n)#u.'-7z"l'*)('H#8z$#,)('Hz%p2N#)n)(3~'*&pz%'. 6` j"d
b ef Ba`.jb gXaFg .g jbig %gHjZhcgacb
h eg }g .g*\
Lh egHjhg \ .g `.jZj !\c\ .jZhcg
.g \gXâ`\*h egHjhg \K%7"#u.'*)
. "ZK #.}) "
4A 'H5z"LK5,6.)('H*K ZOL 'H"9'H K 'H%/"'H)(%LK5E % s5#./BKl z"C"C"#8/BK
#.%/1A #I9'*2i #.%/"s-}&n&('H/"pu.p&-'*I*.u%p&(ps@}&cJR#C"#I9DF7"7"#u%#&(p%'*&Fsn9Z
.f .jZhcg*\ ejgXâ j `.a .b e`.j a`Xhg \\ j
H\*big x\*K .2!z"34' KZ7"#u8'*)
" "
u%#. #8z8w3$#8LK
Z, #.(9'H 6AL'H#.(%!%u2p.u.pI.5>B'*I*J"%pIH#26E'H7&>5ED ""K'H.&n'Hw%53$7"z%&c#&(p"#2
E'*)n'H#.I*J: % I*%%34pI*)5#.%/ #."#u.'H34'H8&O%I*p'H%I*'.K #)()#IJ%z%)('*&(&()y%)(&n}&cz%&('-w>'*I*J8D
%.2p.u8.KZ5#.34C"p/"u.'.K =4K "
. 5ZEG.)('H%C%2:#&(&H a ejZh _ pg \` gH^%a` j h \6O"7"#8&#.3 %9.)HK '*s l%(9ZK
% "E5z"34'*2!J"#.&HK $ % !.&nLKZ#.%/1E Z"p2}2p!#.34)XAL'H#8(%!%uo!8&('H("#2'H7"'*)('H8&#XD
&(p%)C8qC"#I*9%7"7"#u#&(!%u'Hn)H .bi^%agHK

"

.%K ."

% E5z"34'*2!J"#.&HK % !.&nLK#.%/E ZBp2p2p:#834)HRAL'H#.(%!%u!.&n'H("#26'H7"'*)n'H8D
&#&(p%)C8'H(x7"7"#u#&(pL$y #.34'*)4Al IH2p'*2p2!#.%/#.%/ # p/ % BE5z"34'*2!J"#.&HK
'*/"p&()HK .a pg \*bi a *^%big a`Hhcg*\c\ j Z` }^ 4g K7"#u.'*)

. y> '*)n)HK ."
"04 806#.7"%!9Z \*b 8b `.j<` g_gHj .gHjhg \ \g $`.j 5_ a h .b .K%=-/"/"'H%/Zz"3
'*s l%(9ZlO"7"!%u.'HFDy0l'H2!#uZK

REF [4]
Support Vector Learning

vorgelegt von
Diplom-Physiker, M.Sc. (Mathematics)
Bernhard Scholkopf
aus Stuttgart
nc
es
Vom Fachbereich 13 | Informatik

der Technischen Universitat Berlin
zur Erlangung des akademischen Grades
fe
re
Doktor der Naturwissenschaften

{ Dr. rer. nat. {
Re
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. K. Obermayer
Berichter: Prof. Dr. S. Jahnichen
Berichter: Prof. Dr. V. Vapnik
Tag der wissenschaftlichen Aussprache: 30. September 1997
Berlin 1997
D 83
Re
fe
re
nc
es
The thesis was published by: Oldenbourg Verlag, Munich,

1997.
Support Vector Learning
Bernhard Scholkopf
Dissertation zum Dr. rer. nat. | Zusammenfassung der wesentlichen Ergebnisse
Re
fe
re
nc
es
Inhalt der Arbeit ist das Lernen von Mustererkennung als statistisches Problem. Eine
Lernmaschine extrahiert aus einer Menge von Trainingsmustern Strukturen, die ihr
die Klassikation neuer Beispiele erlauben. Die Arbeit behandelt folgende Fragen:
Welche \Merkmale" sollte man aus den einzelnen Trainingsmustern extrahieren?
| Zum Studium dieser Frage wurde eine neue Form von nichtlinearer Hauptkomponentenanalyse (\Kernel PCA") entwickelt. Durch die Benutzung von Integraloperatorkernen kann in Merkmalsraumen sehr hoher Dimensionalitat (z.B. im
1010-dimensionalen Raum aller Produkte von 5 Pixeln in 16 16-dimensionalen
Bildern) eine lineare Hauptkomponentenanalyse durchgefuhrt werden. Im Ursprungsraum betrachtet, fuhrt dies zu nichtlinearen Merkmalsextraktoren. Der
Algorithmus besteht in der Losung eines Eigenwertproblemes, in dem die Wahl
verschiedener Kerne die Verwendung einer groen Klasse verschiedener Nichtlinearitaten gestattet.
Welche der Trainingsmuster enthalten am meisten Information uber die zu konstruierende Entscheidungsfunktion? | Diese Frage, wie auch die folgende, wurde
anhand des vor wenigen Jahren von Vapnik vorgeschlagenen \Support-VektorAlgorithmus" innerhalb des von Vapnik und Chervonenkis entwickelten statistischen Paradigmas des Lernens aus Beispielen untersucht. Durch die Wahl
verschiedener Integraloperatorkerne ermoglicht dieser Algorithmus die Konstruktion einer Klasse von Entscheidungsregeln, die als Spezialfalle Neuronale Netze,
Polynomiale Klassikatoren und Radialbasisfunktionennetze enthalt. Fur Bilder
von 3-D-Objektmodellen und handgeschriebenen Ziern konnte gezeigt werden,
dass die verschiedenen Entscheidungsregeln in ihrer Klassikationsgenauigkeit
den besten bisher bekannten Verfahren ebenburtig sind, und dass ihre Konstruktion lediglich eine kleine, von der speziellen der Kerne weitgehend unabhangige
Teilmenge (in den betrachteten Beispielen 1% { 10%) der Trainingsmenge verwendet.
Wie kann man am besten \A-Priori"-Information verwenden, die zusatzlich zu
den Trainingsmustenr vorhanden ist? (beispielsweise Information uber die Invarianz einer Klasse von Bildern unter Translationen) | die Arbeit schlagt
drei Verfahren vor, die alle zu deutlichen Verbesserungen der Klassikationsgenauigkeit fuhren. Zwei der Verfahren bestehen in der Konstruktion von
speziellen, dem Problem angepassten Integraloperatorkernen. Das dritte Verfahren verwendet Invarianztransformationen, um aus der oben genannten Teilmenge (der \Support-Vektor-Menge") aller Trainingsmuster zusatzliche kunstliche Trainingsbeispiele zu generieren.
genehmigt: Prof. Jahnichen
es
nc
re
fe
Re
Foreword
fe
re
nc
es
The Support Vector Machine has recently been introduced as a new technique for solving various function estimation problems, including the pattern recognition problem.
To develop such a technique, it was necessary to rst extract factors responsible for
future generalization, to obtain bounds on generalization that depend on these factors,
and lastly to develop a technique that constructively minimizes these bounds.
The subject of this book are methods based on combining advanced branches of
statistics and functional analysis, developing these theories into practical algorithms
that perform better than existing heuristic approaches. The book provides a comprehensive analysis of what can be done using Support Vector Machines, achieving record
results in real-life pattern recognition problems. In addition, it proposes a new form
of nonlinear Principal Component Analysis using Support Vector kernel techniques,
which I consider as the most natural and elegant way for generalization of classical
Principal Component Analysis.
In many ways the Support Vector machine became so popular thanks to works
of Bernhard Scholkopf. The work, submitted for the title of Doktor der Naturwissenschaften, appears as excellent. It is a substantial contribution to Machine Learning
technology.
Re
Vladimir N. Vapnik, Member of Technical Sta, AT&T Labs Research

Professor, Royal Holloway and Bedford College, London
Vorwort
Re
fe
re
nc
es
Interessant an der Arbeit von Herrn Scholkopf sind nicht nur die fachlichen Aspekte,
sondern auch die unterschiedlichen und sehr intensiven Kontakte zu internationalen
Forschungseinrichtungen. Sie zeigen, da der Autor sowohl in der Lage ist, seine
Ergebnisse im wissenschaftlichen Spitzenfeld zu prasentieren und zu plazieren, als auch
aus Arbeiten der \Community" heraus seine Ergebnisse zu entwickeln. Aus dieser Sicht
lat sich auch die fachliche Qualitat der Arbeit ersehen.
Herr Scholkopf untersucht zwei Grundprobleme der Klassikation groer Datenmengen. Zum einen ist dies die Extraktion weniger aber relevanter starker Merkmale
zur Reduktion der Informations ut, und zum anderen die Beschreibung von Datenbeispielen, die charakteristisch fur ein gegebenes Klassikationsproblem sind. Beide
Probleme werden von Herrn Scholkopf sowohl theoretisch als auch in Experimenten
ausgiebig und erschopfend untersucht. Sowohl die in der Arbeit entwickelte, sehr
elegante Methode der nichtlinearen Merkmalsextraktion (kernel PCA), als auch die
vorgeschlagenen Weiterentwicklungen der Support-Vektor-Maschine benutzen schwache Merkmale und setzen sich damit konzeptuell von der oben beschriebenen Philosophie der starken Merkmale ab. Somit spiegelt sich in der Arbeit gewissermaen ein
Paradigmenwechsel in der Klassikation und Merkmalsextraktion wider.
Herr Scholkopf war wahrend seiner Dissertation ein gern gesehener Gast der GMD
FIRST Berlin, und es war eine Freude, seine Arbeit zu lesen und zu betreuen. Insbesondere freue ich mich, da Herr Scholkopf seine Forschung in seiner neuen Position
bei GMD FIRST weiterfuhren wird.
Stefan Jahnichen, Direktor, GMD FIRST
Professor, Technische Universitat Berlin
Contents
Summary
1 Introduction and Preliminaries
11
15
1.1 Learning Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . 15

1.2 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3 Feature Space Mathematics . . . . . . . . . . . . . . . . . . . . . . . . 24
2 Support Vector Machines
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction . . . . . . . . . . . . . . . . . . . . .
Principal Component Analysis in Feature Spaces .
Kernel Principal Component Analysis . . . . . . .
Feature Extraction Experiments . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
es
.
.
.
.
.
.
.
nc
The Support Vector Algorithm . . . . . .

Object Recognition Results . . . . . . . .
Digit Recognition Using Dierent Kernels
Universality of the Support Vector Set . .
Comparison to Classical RBF Networks . .
Model Selection . . . . . . . . . . . . . . .
Why Do SV Machines Work Well? . . . .
re
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Re
3.1
3.2
3.3
3.4
3.5
fe
3 Kernel Principal Component Analysis
4 Prior Knowledge in Support Vector Machines

4.1
4.2
4.3
4.4
4.5
Introduction . . . . . . . . . . . . . . . . . . .
Incorporating Transformation Invariances . . .
Image Locality and Local Feature Extractors .
Experimental Results . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . .
5 Conclusion
A Object Databases
B Object Recognition Results
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
33
46
56
57
61
69
76
79
79
80
83
89
96
99
99
100
109
110
120
125
127
137
7
C Handwritten Character Databases

D Technical Addenda
149
153
Bibliography
165
Re
fe
re
nc
es
D.1 Feature Space and Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 153

D.2 Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . . 158
D.3 On the Tangent Covariance Matrix . . . . . . . . . . . . . . . . . . . . 161
Acknowledgements
Re
fe
re
nc
es
First of all, I would like to express my gratitude to Prof. H. Bultho, Prof. S. Jahnichen,
and Prof. V. Vapnik for supervising the present dissertation, and to Prof. K. Obermayer for chairing the committee in the \Wissenschaftliche Aussprache". I am grateful
to Vladimir Vapnik for introducing me to the world of statistical learning theory during numerous extended discussions in his oce at AT&T Bell Laboratories. I have
deep respect for the completeness and depth of the body of theoretical work that he
and his co-workers have created over the last 30 years. To Heinrich Bultho, I am
grateful for introducing me to the world of biological information processing, during
my work on the Diplom and the doctoral dissertation. He created a unique research
atmosphere in his group at the Max-Planck-Institut fur biologische Kybernetik, and
provided excellent facilities without which the present work would not have been possible. I would like to thank Stefan Jahnichen for his advice, and for hosting me at the
GMD Berlin during several research visits. A signicant amount of the reported work
was in uenced and carried out during these stays, where I closely collaborated with
A. Smola and K.-R. Muller.
Thanks for nancial support in the form of grants go to the Studienstiftung des
deutschen Volkes and the Max-Planck-Gesellschaft. In addition, it was the Studienstiftung that made it possible in the rst place that I got to know Vladimir Vapnik at
AT&T in 1994, and that helped in getting A. Smola join the team in 1995.
A number of people contributed to this dissertation in one way or another. Let me
start with V. Blanz, C. Burges, M. Franz, D. Herrmann, K.-R. Muller, and A. Smola,
who helped at the very end, in proofreading the manuscript, leading to many improvements of the exposition.
The work for this thesis was done at several places, and each of the groups that I
was working in deserves substantial credit. More than half of the time was spent at the
Max-Planck-Institut fur biologische Kybernetik, and I would like to thank all members
of the group for providing a stimulating interdisciplinary research atmosphere, and for
bearing with me when I maltreated their computers at night with my simulations.
Special thanks go to the people in the Object Recognition group for a number of lively
discussions, and to the small group of theoreticians at the MPI, who helped me in
various ways over the years.
Almost one year of the time of my thesis work was spend in the adaptive systems
group at AT&T and Bell Laboratories. I learnt a lot about machine learning as applied
to real-world problems from all members of this excellent group. In particular, I would
9
Re
fe
re
nc
es
like to express my thanks to C. Burges, L. Bottou, C. Cortes, and I. Guyon for helping
me understand Support Vectors, and to L. Jackel, Y. LeCun, and C. Nohl for making
my stays possible in the rst place. In addition to their scientic advice, E. Cosatto,
P. Haner, E. Sackinger, P. Simard and C. Watkins have helped me through their
friendship. Finally, I want to express my gratitude for the possibility to use code
and databases developed and assembled by these people and their co-workers. A
substantial part of this thesis would not have been possible without this.
During my time in the USA, I also had the opportunity to spend a month at
the Center for Biological and Computational Learning (Massachusetts Institute of
Technology), hosted by T. Poggio. I would like to thank him, as well as G. Geiger,
F. Girosi, P. Niyogi, P. Sinha, and K. Sung, for hospitality and fruitful discussions.
At the GMD, I had the possibility to interact with the local connectionists group,
which (in addition to those mentioned already) included J. Kohlmorgen, N. Murata,
and G. Ratsch. The present work proted a great deal from my stays in Berlin.
When starting to do research on one's own, one cannot help but noticing that the
more specialized the eld of work is, the more international and widespread seems to be
the group of people interested in it. Out of the scientists working on machine learning
and perception, I want to thank J. Buhmann, S. Canu, A. Gammerman, J. Held,
D. Kersten, J. Lemm, D. Leopold, P. Mamassian, G. Roth, S. Solla, F. Wichmann,
and A. Yuille for stimulating discussions and advice.
Without rst studying science, it is hard to become a scientist. Studying science
predominantly means arguing about scientic problems. During my education, I was
in the favourable position to have enough people for scientic discussions. With many
of these friends and teachers, there is still contact and exchange of ideas. I would
like to thank all of them, and in particular G. Alli, C. Becker, V. Blanz, D. Coreld,
H. Fischer, M. Franz, D. Henke, D. Herrmann, D. Janzing, U. Kappler, D. Kopf,
F. Lutz, A. Rieckers, M. Schramm, and G. Sewell.
Finally, without my parents, I would not even have studied anything in the rst
place. Many thanks to them.
Summary
Re
fe
re
nc
es
Learning how to recognize patterns from examples gives rise to challenging theoretical
problems: given a set of observations,
which of the observations should be used to construct the decision boundary?
which features should be extracted from each observation?
how can additional information about the decision function be incorporated in
the learning process?
The present work is devoted to the above issues, studying Support Vectors in highdimensional feature spaces, and Kernel PCA feature extraction.
The material is organized as follows. We start with an introduction to the problem
of pattern recognition, to concepts of statistical learing theory, and to feature spaces
nonlinearly related to input space (Chapter 1). The paradigm for learning from examples which is studied in this thesis, the Support Vector algorithm, is described in
Chapter 2, including empirical results obtained on realistic pattern recognition problems. The latter in particular includes the nding that the set of Support Vectors
extracted, i.e. those examples crucial for solving a given task, is largely independent
of the type of Support Vector machine used. One specic topic in the development
of Support Vector learning, the incorporation of prior knowledge, is studied in some
detail in Chapter 4: we describe three methods for improving classier accuracies by
making use of transformation invariances and the local structure of images. Intertwined between these two chapters, we propose a novel method for nonlinear feature
extraction (Chapter 3), which works in the same types of features spaces as Support
Vector machines, and which forms the basis of some developments of Chapter 4. Finally, Chapter 5 gives a conclusion. As such, it partly reiterates what has just been
said, and the reader who still remembers the present summary when arriving at Chapter 5 may nd it amusing to contemplate whether the conclusion coincides with what
had been evoked in their mind by the summary that they have just nished reading.
11
es
Disclaimer. This thesis was written in an interdisciplinary research environment, and

it was supervised by a statistician, a biologist, and a computer scientist. Accordingly,
it attempts to be of interest for rather dierent audiences. If your interests fall into one
of these categories exclusively, please bear with me: whenever you encounter a section
which you nd utterly useless, boring, or incomprehensible, there is the theoretical
possibility that it is of interest to somebody else. Accordingly, please feel free to
ignore all these parts.
Re
fe
re
nc
Copyright Notice. Sections 2.4 and 2.6.1 are based on Scholkopf, Burges, and Vapnik
(1995), AIII Press. Section 2.5 is based on Scholkopf, Sung, Burges, Girosi, Niyogi,
Poggio, and Vapnik (1996c), IEEE. Chapter 3 is based on Scholkopf, Smola, and Muller
(1997b), MIT Press. Section 4.2.1 and gures 2.5 and 4.1 are based on Scholkopf,
Burges, and Vapnik (1996a), Springer Verlag. The author reserves for himself the
non-exclusive right to republish all other material.
Re
fe
re
nc
es
\To see a thing one has to comprehend it. An armchair presupposes the
human body, its joints and limbs; a pair of scissors, the act of cutting.
What can be said of a lamp or a car? The savage cannot comprehend
the missionary's Bible; the passenger does not see the same rigging as the
sailors. If we really saw the world, maybe we would understand it."
J. L. Borges, There are more things. In: The Book of Sand, 1979, Penguin,
London.
es
nc
re
fe
Re
Chapter 1
Introduction and Preliminaries
re
nc
es
The present work studies visual recognition problems from the point of view of learning
theory. This rst chapter sets the scene for the main part of the thesis. It gives a brief
introduction of the problem of Learning Pattern Recognition from examples. The two
main contributions of this thesis are motivated in the conceptual part of the chapter.
Section 1.1 discusses prior knowledge that might be available in addition to the
set of training examples, and introduces the problem of extracting useful features
from individual examples. The technical part of the chapter, Sec. 1.2, gives a concise
description of some mathematical concepts of Statistical Learning Theory (Vapnik,
1995b). This theory describes learning from examples as a problem of limited sample
size statistics and provides the basis for the Support Vector algorithm. Finally,
Sec. 1.3 introduces mathematical concepts of feature spaces, which will be of central
importance to all following chapters.
fe
1.1 Learning Pattern Recognition
Re
Let us think of a pattern as an abstraction, dened by a collection of possible instances

such as sample images. When trying to learn how to recognize a pattern, we face the
problem that we will often be unable to see all instances during learning, yet we want to
be able to recognize as many as possible. The extensive notion of a pattern that we just
introduced already suggests a specic approach to the problem of pattern recognition:
a statistician tries to collect a large number of instances, and use inductive methods
to learn how to recognize them.
For an alternative point of view, consider a pattern as something observable which
is generated by an underlying physical entity, as for instance the 2-D views of a 3-D
object. To recognize a pattern of this nature, a physicist would try to understand the
laws governing the entity, and the mechanisms by which the pattern is brought about.
In this process, it may turn out that dierent observables, or functions thereof,
contain dierent amounts of information for understanding the underlying entity, i.e.
it may be the case that from the initial raw observations, we have to extract useful
features ourselves.
The current work is located in the intersection of the aspects sketched in the
15
16
CHAPTER 1. INTRODUCTION AND PRELIMINARIES
above three paragraphs. It studies an inductive learning algorithm which has been
developed in the framework of statistical learning theory, and it tries to enhance it by
incorporating prior knowledge about a recognition task at hand. Finally, it studies
the extraction of features for the purpose of recognition.
Even though pattern recognition is not limited to the visual domain, we shall focus
on visual recognition. Much of what is said in this thesis, however, would equally
apply to the recognition of acoustic patterns, say.
In the remainder of this section, we introduce the terminology which is used in
discussing dierent aspects of visual recognition problems: these are, in turn, the
data, the tasks, and the methods for recognition.
Data. Dierent types of pattern recognition problems make dierent types of as-
Re
fe
re
nc
es
sumptions about the underlying causes generating the patterns. Nevertheless, it is

possible to discuss them in a common framework which we try to describe presently.
It draws from machine learning terminology; as such, it will dier from psychological
usage of the relevant terms in some respects.1
Observers visually perceive views. Sets of views constitute classes. Sometimes,
classes have a structure that goes beyond being mere collections of views. For instance,
the class of all views of rainbows has the property that if a specic view belongs to
it, then so do all views which are generated by translating it, parallel to the horizon.
Objects are specic classes, with a rich class structure, containing for instance all view
transformations corresponding to rigid 3-D transformations of a specic underlying
physical entity. Some of these transformations are shared by all objects, for instance
translations; others, like deformations, are object-specic.
More radically, and fundamentally view-based, we could give up the notion of priority of the underlying physical entities, and think of an object only as a collection
of views, with a specic class structure. On a practical level, this is the approach
pursued in the current work. The distinction between objects and other classes then
becomes a distinction between dierent types of transformation invariances. For instance, a rainbow would not be an object, as we cannot possibly see it from above, not
even with a spacecraft. The class of handwritten digits '6' would not be an object for
similar reasons; in fact, an image plane rotation by 180 would even take us into the
class '9'. As an aside, we note that mathematics and physics have already undergone
a paradigm shift away from the notion of objects as \things in the world", towards
studying their transformation properties. In mathematics, this is exemplied in Felix
Klein's Erlanger Programm (Klein, 1872) which shifts geometry away from points and
lines towards transformation groups; in physics, an example is the modern denition
of elementary particles as transformation group representations (e.g. Primas, 1983).
Kac and Ulam (1968) refer to this as
\[...] the immensely powerful and fruitful idea that much can be learned
1 The
ideas put forward in the following were in uenced by discussions with people in the MPI's
object recognition group, in particular with V. Blanz.
17
1.1. LEARNING PATTERN RECOGNITION
about the structure of certain objects by merely studying their behaviour

under the action of certain groups."
Later in the thesis, the reader will encounter methods for improving visual recognition
systems by taking into account transformation properties of handwritten characters
and 3-D objects (Sec. 4.2).
Prior Knowledge. The statistical approach of learning from examples in its pure form
Re
fe
re
nc
es
neglects the additional knowledge of class structure described above. However, the
latter, referred to as Prior Knowledge, can be of great practical relevance in recognition
problems.
Suppose we were given temporal sequences of detailed observations (including spectra) of double star systems, and we would like to predict whether, eventually, one of
the stars will collapse into a black hole. Given a small set of observations of dierent
double star systems, including target values indicating the eventual outcome (supposing these were available), a purely statistical approach of learning from examples
would probably have diculties extracting the desired dependency. A physicist, on
the other hand, would infer the star masses from the spectra's periodicity and Doppler
shifts, and use the theory of general relativity to predict the eventual fate of the stars.
Of course, one could argue that the physicist's model of the situation is based on
a huge body of prior examples of situations and phenomena which are related to the
above in one way or another. This, however, is exactly how the term prior knowledge
should be understood in the present context. It does not refer to a Kantian a priori,
as prior to all experience, but to what is prior to a given problem of learning from
examples.
What do we do, however, if we do not have a dynamical model of what is happening
behind the scenes? In this case, which for instance applies whenever the underlying
dynamics is too complicated, the strengths of the purely statistical approach become
apparent. Let us consider the case of handwritten character recognition. When a
human writer decides to write the letter 'A', the actual outcome is the result of a series
of complicated processes, which in their entirety cannot be modelled comprehensively.
The intensity of the lines depends on chemical properties of ink and paper, their shape
on the friction between pencil and the paper, on the dynamics of the writer's joints,
and on motor programmes initiated in the brain, these in turn are based on what the
writer has learnt at school | the chain could be continued ad innitum. Accordingly,
nobody tries to recognize characters by completely modelling their generation.
However, the lack of a complete dynamical model does not mean that there is
no prior knowledge in handwritten digit recognition. For instance, we know that
handwritten digits do not change their class membership if they are translated on a
page, or if their line thickness is slightly changed. This type of knowledge can be used
to augment the purely statistical approach (Sec. 4.2). More abstract prior knowledge
in many visual recognition tasks includes the fact that the correlations between nearby
image locations are often more reliable features for recognition than those with larger
distances (Sec. 4.3).
18
Features. Before we proceed to the tasks that can be performed, depending on the
re
nc
es
available data, we need to introduce a concept widely used in both statistics and in
the analysis of human perception. In its general form, a feature detector or feature
extractor is a function which assigns a (typically scalar) value to each raw observation.
Often, a number of dierent such functions are applied to the observations in a feature
extraction process, leading to a preprocessed vector representation of the data. The
goal of extracting features is to improve subsequent stages of processing, be it by
improving accuracies in a recognition task, or by reducing storage requirements or
processing time.
The feature functions serving this purpose can either be specied in advance, for
instance in a way such that they incorporate prior knowledge about a problem at hand,
or computed from the set of observations. Both approaches, as well as combinations
thereof, shall be addressed in this thesis (Chapter 3, Sec. 4.2.2, Sec. 4.3).
The actual term feature is used with dierent meanings. In vision research and
psychophysics, it is mainly used for the optimal stimulus of a corresponding feature
detector. However, note that given a nonlinear feature detector, it may be practically
impossible to determine this optimal stimulus. In statistics, on the other hand, the
term feature mostly refers to the feature values, i.e. the outputs of feature detectors, or
to the feature detector itself. Possibly, this ambiguity arose from the fact that, in some
cases, the dierent meanings coincide: in the case where the feature detector consists
of a dot product with a weight vector, as in linear neural network model receptive
elds, the optimal stimulus is aligned with the weight vector, and thus the two can be
identied.
Tasks. Suppose we are only given solitary views, and neither nontrivial classes nor
Re
fe
objects (which were structured collections of views). Then out of the tasks of discrimination, classication, and identication, only discrimination can be carried out
on these views. This does, however, not prevent the term discrimination from being
used also in the context of classes and objects. Discrimination, the mere detection of
a dierence, can be preceded by feature extraction processes; in these cases, results
will depend on the extraction process used.
Classication consists of attributing views to classes, and thus requires the existence
of classes. These can be specied abstractly | by describing features, or Gibsonian
aordances (\something to sit on"), e.g. | or provided (approximately) by a sample
of training views. This denition of classes by training sets is widespread in machine
learning; it will also be the paradigm that we are going to use in this thesis. One talks
about yes-no and old-new classication tasks (one specied class) or naming tasks
(several classes). Pattern recognition problems like Handwritten Digit Recognition
are examples of classication.
Similarly, identication consists in determining to which object a presented view
belongs. As objects are special types of classes, we again have the possibility for the
above tasks. Identication makes sense only for objects: for instance, it is meaningless
to ask whether the rainbow we see today is a view of the same (object as a) rainbow
19
1.1. LEARNING PATTERN RECOGNITION
we saw last year.

In this thesis, we study classication and identication. Often, both of these tasks
are referred to as recognition, the term which we shall mostly employ. Indeed, when
classes are given by training sets, the question whether there is an underlying object
producing the observed views becomes secondary. It is then only of relevance insofar
as it determines the type of prior knowledge available.
Human Object Recognition. The position that object recognition is not about re-
Re
fe
re
nc
es
covering physical 3-D entities, but about learning their views, and potentially also their
transformation properties, can be supported by biological and psychological evidence.
Bultho and Edelman (1992) have shown that when recognizing unfamiliar objects,
observers exhibit viewpoint eects which suggest that they do not recover the 3-D
shape of objects, but rather build a representation based on the actual training views
(cf. also Logothetis, Pauls, and Poggio, 1995). They thought of this representation
as an interpolation mechanism (cf. Poggio and Girosi, 1990), but one could of course
conceive of more sophisticated mechanisms for combining information contained in
the training views. In the above terminology, one might argue that due to their unfamiliarity, the wire frame objects of Bultho and Edelman (1992) make it very hard
to use the transformations which form the structure of the underlying class of views.
Ullman (1996) has put forward a multiple-view variant of his theory of \recognition
by alignment", where objects are recognized by aligning them with stored view templates. The alignment process can make use of certain transformations specic to the
object in question. The results of Troje and Bultho (1996) have shown that these
transformations in some cases directly operate on 2-D views, and that they are much
simpler than transformations using an underlying 3-D model: in experiments probing
face recognition under varying poses, observers performed better on views which were
obtained by simply applying a mirror reversal transformation to a previously seen
view, rather than by rotating the head in depth to generate the true view of the other
side. Rao and Ballard (1997) recently proposed a model in which the \what" and the
\where" pathway (Mishkin, Ungerleider, and Macko, 1983) in the visual system are
conceived of as estimating object identity and transformations, respectively. Using a
collection of patches taken from natural images, they construct a generative model for
the data which learns, transforms and linearly combines simple basis functions. Their
model, however, does not directly make use of the valuable information contained in
the temporal stream of visual data: comparing subsequent images, e.g. by optic ow
techniques, would give a more direct means of constructing processing elements encoding transformations. Indeed, in the dorsal stream (the \where" pathway), neurons
have been found coding for various types of large-eld transformations (Duy and
Wurtz, 1991). Of somewhat related interest are the large-eld neurons in the y's visual system, coding for specic ow elds which are generated by the y's movement
in the environment (Krapp and Hengstenberg, 1996).
20
Representations and Processes. The above illustrates that the question of how,
Re
fe
re
nc
es
given a recognition problem, the actual processing can be performed, is intimately

related to underlying representations (of classes or objects), computed by some feature
extraction process. A representation should satisfy certain constraints in terms of
storage cost, computational cost and accuracy.
General classes without structure are not compressible (except for a separate compression of the individual images). Classes with some internal structure can be compressed, hence a smaller representation is possible, which in turn makes generalization
to novel views possible (cf. Kolmogorov, 1965; Rissanen, 1978). This is the underlying
computational reason for the constructive nature of perception.
If we can generate a class (e.g. an object) from some prototype views using a
specied set of transformations, we can represent it as the set of prototypes plus
transformations. The more prototypes we store, the less complex are the transformations that we need to remember. In this sense, there is a continuum of dierent
view-based approaches. In principle, further compression is conceivable if we allow
for the construction of a suitable underlying representation. E.g., Ullman's approach
of storing a 3-D model plus the set of 3-D transformations (Ullman, 1989) is cheap
in terms of storage: storing these transformations is almost for free, and storing one
3-D model is reasonably cheap. Constructing this representation, however, may be
computationally quite expensive. Reading out and matching 2-D views, on the other
hand, is computationally rather cheap if done in parallel neural architecture. The type
of representation to be used should thus depend on the task, e.g. on speed constraints.
Indeed, proponents of view-based object recognition theories are mainly concerned
with fast recognition tasks (Bultho and Edelman, 1992). Moreover, the storage cost
strongly depends on the task and the type of feature extraction applied to the raw
data.
In some cases, we can extract features from views which allow reasonably high
recognition accuracies while enabling us to work with much simpler sets of transformations. For instance, if there exists a diagnostic object feature which is visible from
all viewpoints, we only need to store the feature (e.g. the colour), the extraction process (which can be thought of as a specic image transformation which needs to be
stored), and the fact that it may occur anywhere in the view (i.e. the set of all image
plane translations).
Clearly, the set of features which are extracted from views in uences all further
processing. Applied to our setting, constructing a feature representation consists of
two parts: the features have to be extracted from a possibly large set of views, and
the transformations which connect features belonging to views of the same class have
to be computed. This may require a trade-o: for some feature representations, the
extraction process is dicult (e.g. using correspondence methods, Beymer and Poggio,
1996; Vetter and Troje, 1997), whereas the computations of transformations might be
simple. A similar trade-o exists in utilizing such a representation: for a recognition
task done by matching, e.g., we would have to extract features from test views, and
transform them to match stored ones. Put in machine learning language, features
21
1.2. STATISTICAL LEARNING THEORY
should be used which allow solving a given task within specied limits on training
time, testing speed, error rate, and memory requirements.
Implementations. So far, not much has been said about actual implementations of
recognition systems. The present work focuses on algorithmic questions rather than
on questions of implementation, both with respect to the computational side and with
respect to the biological side of the recognition problem. The former normally need
not be justied: in statistics, scientic studies of mere algorithms, without discussion
of implementation details, are abundant. In biology, which is the main focus of interest
in the group where much of the present work was carried out, the type of abstraction
presented here is much less common. Indeed, the relevance of this thesis to biological
pattern recognition is on the level of statistical properties of problems and algorithms
| not more, and not less. In our hope that this type of theoretical work should be of
interest to people studying the brain, we concur with Barlow (1995):
nc
es
\If articial neural nets, designed to imitate cognitive functions of the
brain, are truly performing tasks that are best formulated in statistical
terms, then is this not likely also to be true of cognitive function in general? The idea that the brain is an accomplished statistical decision-making
organ agrees well with notions to be sketched in the last section of this
[Barlow's, the author] article."
re
To study object recognition from a statistical point of view, we shall in the following
section brie y review some of the basic concepts and results of statistical learning
theory.
fe
1.2 Statistical Learning Theory
Re
Out of the considerable body of theory that has been developed in statistical learning
theory by Vapnik and others (e.g. Vapnik and Chervonenkis, 1968, 1974; Vapnik, 1979,
1995a,b), we brie y review a few concepts and results which are necessary in order to
be able to appreciate the Support Vector learning algorithm, which will be used in a
substantial part of the thesis.2
For the case of two-class pattern recognition, the task of learning from examples
can be formulated in the following way: we are given a set of functions
ff : 2 g; f : RN ! f1g
(1.1)
and a set of examples, i.e. pairs of patterns xi and labels yi,

(x1 ; y1); : : : ; (x`; y`) 2 RN f1g;
2A
high-level summary is given in (Scholkopf, 1996).
(1.2)
22
each one of them generated from an unknown probability distribution P (x; y) containing the underlying dependency. (Here and below, bold face characters denote
vectors.) We want to learn a function f which provides the smallest possible value
for the average error committed on independent examples randomly drawn from the
same distribution P , called the risk
Z 1
R() = 2 jf(x) ? yj dP (x; y):
(1.3)
The problem is that R() is unknown, since P (x; y) is unknown. Therefore an induction principle for risk minimization is necessary.
The straightforward approach to minimize the empirical risk
`
X
Remp() = 1` 21 jf(xi) ? yij
i=1
(1.4)
Re
fe
re
nc
es
turns out not to guarantee a small actual risk, if the number ` of training examples
is limited. In other words: a small error on the training set does not necessarily
imply a high generalization ability (i.e. a small error on an independent test set).
This phenomenon is often referred to as overtting (e.g. Bishop, 1995). To make the
most out of a limited amount of data, novel statistical techniques have been developed
during the last 30 years. The Structural Risk Minimization principle (Vapnik, 1979)
is based on the fact that for the above learning problem, for any 2 and ` > h,
with a probability of at least 1 ? , the bound
!
log(

)
h
(1.5)
R() Remp () + ` ; `
holds, where the condence term is dened as
2`
u
! v
u
h
log h + 1 ? log(=4)
h` ; log(` ) = t
:
`
(1.6)
The parameter h is called the VC(Vapnik-Chervonenkis)-dimension of a set of functions. It describes the capacity of a set of functions. For binary classication, h is the
maximal number of points which can be separated into two classes in all possible 2h
ways by using functions of the learning machine; i.e. for each possible separation there
exists a function which takes the value 1 on one class and ?1 on the other class.
A learning machine can be thought of as a set of functions (that the machine has
at its disposal), an induction principle, and an algorithmic procedure for implementing
the induction principle on the given set of functions. Often, the term learning machine
is used to refer to its set of functions | in this sense, we talk about the capacity or
VC-dimension of learning machines.
23
1.2. STATISTICAL LEARNING THEORY
nc
es
The bound (1.5), which forms part of the theoretical basis for Support Vector
learning, deserves some further explanatory remarks.
Suppose we wanted to learn a \dependency" where P (x; y) = P (x) P (y), i.e. where
the pattern x contains no information about the label y, with uniform P (y). Given a
training sample of xed size, we can then surely come up with a learning machine which
achieves zero training error. However, in order to reproduce the random labellings,
this machine will necessarily require a VC-dimension which is large compared to the
sample size. Thus, the condence term (1.6), increasing monotonically with h=`, will
be large, and the bound (1.5) will not support possible hopes that due to the small
training error, we should expect a small test error. This makes it understandable how
(1.5) can hold independent of assumptions about the underlying distribution P (x; y):
it always holds, but it does not always make a nontrivial prediction | a bound on an
error rate becomes void if it is larger than the maximum error rate. In order to get
nontrivial predictions from (1.5), the function space must be restricted such that the
VC-dimension is small enough (in relation to the available amount of data).3
According to (1.5), given a xed number ` of training examples, one can control
the risk by controlling two quantities: Remp() and h(ff : 2 0g), 0 denoting
some subset of the index set . The empirical risk depends on the function chosen
by the learning machine (i.e. on ), and it can be controlled by picking the right .
The VC-dimension h depends on the set of functions ff : 2 0g which the learning
machine can implement. To control h, one introduces a structure of nested subsets
Sn := ff : 2 ng of ff : 2 g,
(1.8)
h1 h2 : : : h n : : :
(1.9)
re
S1 S2 : : : Sn : : : ;
Re
fe
whose VC-dimensions, as a result, satisfy
For a given set of observations (x1; y1); :::; (x`; y`) the Structural Risk Minimization
principle chooses the function fn` in the subset ff : 2 ng for which the guaranteed
3 The bound (1.5), formulated in terms of the VC-dimension, is only the last element of a series of
tighter bounds which are formulated in terms of other concepts. This is due to the inequalities
2`

H (`) H (`) G (`) h log + 1 ; (` > h):
(1.7)
ann
The VC-dimension h is probably the most-used and best-known concept in this row. However, the
other ones lead to tighter bounds, and also play important roles in the conceptual part of statistical
are used to formulate
learning theory: the VC-entropy H and the Annealed VC-entropy Hann
conditions for the consistency of the empirical risk minimization principle, and for a fast rate of
convergence, respectively. The Growth function G provides both of the above, independently of
the underlying probability measure P , i.e. independently of the data. The VC-dimension h, nally,
provides a constructive upper bound on the Growth function, which can be used to design learning
machines (for details, see Vapnik, 1995b).
24
R(fn)
error
bound on test error
confidence term
training error
h
structure
Sn1
Sn+1
es
Sn
Re
fe
re
nc
FIGURE 1.1: Graphical depiction of (1.5), for xed `. A learning machine with larger
complexity, i.e. a larger set of functions Sn , allows for a smaller training error; a less
complex learning machine, with a smaller Si , has smaller VC-dimension and thus provides
a smaller condence term (cf. (1.6)). Structural Risk Minimization picks a trade-o in
between these two cases by choosing the function of the learning machine fn such that
the risk bound (1.5) is minimal.
risk bound (the right hand side of (1.5)) is minimal (cf. Fig. 1.1). The procedure of
selecting the right subset for a given amount of observations is referred to as capacity
control.
We conclude this section by noting that analyses in other branches of learning
theory have led to similar insights in the trade-o between reducing the training error and limiting model complexity, for instance as described by regularization theory
(Tikhonov and Arsenin, 1977), Minimum Description Length (Rissanen, 1978; Kolmogorov, 1965), or the Bias-Variance Dilemma (Geman, Bienenstock, and Doursat,
1992). Haykin (1994); Ripley (1996) give overviews in the context of Neural Networks.
1.3 Feature Space Mathematics

The present section summarizes some mathematical preliminaries which are essential
for both Support Vector machines (Chapter 2) and nonlinear Kernel Principal Com-
25
1.3. FEATURE SPACE MATHEMATICS
ponent Analysis (Chapter 3).
1.3.1 Product Features
Suppose we are given patterns x 2 RN where most information is contained in the

d-th order products (monomials) of entries xj of x,
xj1 : : : xjd ;
(1.10)
where j1 ; : : : ; jd 2 f1; : : : ; N g. In that case, we might prefer to extract these product
features rst, and work in the feature space F of all products of d entries. In visual
recognition problems, e.g., this would amount to extracting features which are products
of individual pixels.
For instance, in R2, we can collect all monomial feature extractors of degree 2 in
the nonlinear map
: R2 ! F = R3
(1.11)
2 2
(x1 ; x2 ) 7! (x1 ; x2 ; x1x2 ):
(1.12)
re
nc
es
This approach works ne for small toy examples, but it fails for realistically sized
problems: for N -dimensional input patterns, there exist
NF = (Nd!(+N d??1)!1)!
(1.13)
Re
fe
dierent monomials (1.10), comprising a feature space F of dimensionality NF . Already 16 16 pixel input images and a monomial degree d = 5 yield a dimensionality
of 1010.
In certain cases described below, there exists, however, a way of computing dot
products in these high-dimensional feature spaces without explicitely mapping into
them: by means of nonlinear kernels in input space RN . Thus, if the subsequent
processing can be carried out using dot products exclusively, we are able to deal with
the high dimensionality.
The following section describes how dot products in polynomial feature spaces can
be computed eciently, followed by a section which discusses more general feature
spaces.
1.3.2 Polynomial Feature Spaces Induced by Kernels
In order to compute dot products of the form ((x) (y)), we employ kernel representations of the form
k(x; y) = ((x) (y));
(1.14)
which allow us to compute the value of the dot product in F without having to carry
out the map . This method was used by Boser, Guyon, and Vapnik (1992) to extend
26
the Generalized Portrait hyperplane classier of Vapnik and Chervonenkis (1974) to

nonlinear Support Vector machines (Sec. 2.1). Aizerman, Braverman, and Rozonoer
(1964) call F the linearization space, and use it in the context of the potential function
classication method to express the dot product between elements of F in terms of
elements of the input space. They also consider the possibility of choosing k a priori,
without being directly concerned with the corresponding mapping into F . A specic
choice of k might then correspond to a dot product between patterns mapped with a
suitable .
What does k look like for the case of polynomial features? We start by giving an
example (Vapnik, 1995b) for N = d = 2. For the map
C2 : (x1 ; x2) 7! (x21 ; x22; x1 x2 ; x2x1 );
(1.15)
dot products in F take the form

(C2(x) C2(y)) = x21 y12 + x22 y22 + 2x1x2 y1y2 = (x y)2;
(1.16)
fe
re
nc
es
i.e. the desired kernel k is simply the square of the dot product in input space. Boser,
Guyon, and Vapnik (1992) note that the same works for arbitrary N; d 2 N: as a
straightforward generalization of a result proved in the context of polynomial approximation (Poggio, 1975, Lemma 2.1), we have:
Proposition 1.3.1 Dene Cd to map x 2 RN to the vector Cd (x) whose entries are
all possible d-th degree ordered products of the entries of x. Then the corresponding
kernel computing the dot product of vectors mapped by Cd is
(1.17)
Re
k(x; y) = (Cd(x) Cd(y)) = (x y)d :
Proof. We directly compute
(Cd(x) Cd(y)) =
=
N
X
xj1 : : : xjd yj1 : : : yjd

j1 ;:::;jd=1
0N
1d
X
@ xj yj A = (x y)d :
j =1
(1.18)
(1.19)
ut
Instead of ordered products, we can use unordered ones to obtain a map d which
yields the same value of the dot product. To this end, we have to compensate for
the multiple occurence of certain monomials in Cd by scaling the respective monomial
27
entries of d with the square roots of their numbers of occurence. Then, by this
denition of d , and (1.17),
(d(x) d (y)) = (Cd (x) Cd(y)) = (x y)d :
(1.20)
es
For instance, if n of the ji in (1.10) are equal, and the remaining

q ones are dierent,
then the coecient in the corresponding component of d is (d ? n + 1)! (for the
general case, cf. Smola, Scholkopf, and Muller, 1997). For 2 , this simply means that
(Vapnik, 1995b)
p
2 (x) = (x21 ; x22; 2 x1 x2):
(1.21)
If x represents an image with the entries being pixel values, we can use the kernel
(x y)d to work in the space spanned by products of any d pixels | provided that we
are able to do our work solely in terms of dot products, without any explicit usage of a
mapped pattern d(x). Using kernels of the form (1.17), we take into account higherorder statistics without the combinatorial explosion (cf. (1.13)) of time and memory
complexity which goes along already with moderately high N and d.
To conclude this section, note that it is possible to modify (1.17) such that it maps
into the space of all monomials up to degree d, dening (Vapnik, 1995b)
(1.22)
re
nc
k(x; y) = (x y + 1)d:
fe
1.3.3 Feature Spaces Induced by Mercer Kernels

The question which function k does correspond to a dot product in some space F has
Re
been discussed by Boser, Guyon, and Vapnik (1992); Vapnik (1995b). To construct a
map induced by a kernel k, i.e. a map such that k computes the dot product in
the space that maps to, they use Mercer's theorem of functional analysis (Courant
and Hilbert, 1953):
Proposition 1.3.2 If k is a continuous symmetric kernel of a positive4 integral operator K , i.e.
Z
(Kf )(y) = k(x; y)f (x) dx
(1.23)
C
with
Z
C C
k(x; y)f (x)f (y) dx dy 0
(1.24)
for all f 2 L2 (C ) (C being a compact subset of RN ), it can be expanded in a uniformly

referring to operators, the term positive is always meant in the sense stated here. If we
talk about positive denite operators, we will express this explicitely.
4 When
28
convergent series (on C C ) in terms of Eigenfunctions

j ,
k(x; y) =
NF
X
j =1
and positive Eigenvalues
j j (x) j (y);
(1.25)
where NF 1.
Note that originally proven for the case where C = [a; b], this Proposition also holds
true for general compact spaces (Dunford and Schwartz, 1963).
For the converse of Proposition 1.3.2, cf. Appendix D.1.
From (1.25), it is straightforward to construct a map , mapping into a potentially
innite-dimensional l2 space, which does the job. For instance, we may use
: x 7! ( 1 1 (x); 2 2 (x); : : :):
(1.26)
nc
es
We thus have the following result (Boser, Guyon, and Vapnik, 1992):5
Proposition 1.3.3 If k is a continuous kernel of a positive integral operator (conditions as in Proposition 1.3.2), one can construct a mapping into a space where k
acts as a dot product,
((x) (y)) = k(x; y):
(1.27)
Re
fe
re
Besides (1.17), Boser, Guyon, and Vapnik (1992) and Vapnik (1995b) suggest the
usage of Gaussian radial basis function kernels (Aizerman, Braverman, and Rozonoer,
1964)
!
2
k
x
?
y
k
(1.28)
k(x; y) = exp ? 2 2
and sigmoid kernels
k(x; y) = tanh((x y) + ):
(1.29)
Note that all these kernels have the convenient property of unitary invariance, i.e.
k(x; y) = k(U x; U y) if U > = U ?1 (if we consider complex numbers, then U instead
of U > has to be used). The radial basis function kernel additionally is translation
invariant.
5 In order to identify k with a dot product in another space, it would be sucient to have pointwise
convergence of (1.25). Uniform convergence lets us make an assertion which goes further: given an
accuracy level > 0, there exists an n 2 N such that even if the range of is innite-dimensional, k
can be approximated within accuracy as a dot product in Rn , between images of
n : x 7! (
1 1 ( ); : : : ;
x)):
n n (
29
1.3.4 The Connection to Reproducing Kernel Hilbert Spaces

The feature space that maps into is a reproducing kernel Hilbert space (RKHS).
To see this, we follow Wahba (1973) and recall that a RKHS is a Hilbert space of
functions f on some set C such that all evaluation functionals f 7! f (y) (y 2 C ) are
continuous. In that case, by the Riesz representation theorem (e.g. Reed and Simon,
1980), for each y 2 C there exists a unique function of x, call it k(x; y), such that
f (y) = hf; k(:; y)i
(1.30)
es
(here, k(:; y) is the function on C obtained by xing the second argument of k to

y, and h:; :i is the dot product of the RKHS). In view of this property, k is called a
reproducing kernel.
Note that by (1.30), hf; k(:; y)i = 0 for all y implies that f is identically zero.
Hence the set of functions fk(:; y) : y 2 C g spans the whole RKHS. The dot product
on the RKHS thus only needs to be dened on fk(:; y) : y 2 C g and can then be
extended to the whole RKHS by linearity and continuity. From (1.30), it follows that
in particular
hk(:; x); k(:; y)i = k(y; x)
(1.31)
Re
fe
re
nc
for all x; y 2 C (this implies that k is symmetric). Note that this means that any
reproducing kernel k corresponds to a dot product in another space.
To establish a connection to the dot product in a feature space F , we next assume
that k is a Mercer kernel (cf. Proposition 1.3.2). First note that it is possible to
construct a dot product such that k becomes a reproducing kernel for a Hilbert space
of functions
NF
1 X
1
X
X
(1.32)
f (x) = aik(x; xi) = ai j j (x) j (xi):
i=1
i=1
j =1
Using only linearity, which holds for any dot product h:; :i, we have
hf; k(:; y)i =
NF
1
X
X
j j (xi)h j ; nin n (y):
ai
i=1
j;n=1
(1.33)
Since k is a symmetric kernel, the i (i = 1; : : : ; NF ) can be chosen to be orthogonal

with respect to the dot product in L2 (C ). Hence it is straightforward to construct a
dot product h:; :i such that
h j ; ni = jn=j
(1.34)
(using the Kronecker symbol jn), in which case (1.33) reduces to the reproducing
kernel property (1.30) (using (1.32)).
30
To write h:; :i as a dot product of coordinate

vectors, we thus only need to express
p
the functions of the RKHS in the basis ( n n )n=1;:::;NF , which is orthonormal with
respect to h:; :i, i.e.
NF q
X
(1.35)
f (x) = n n n(x):
n=1
To obtain the coordinates n, we compute, using (1.34),
NF
1
1 X
q
q
q X
X
n = hf; n n i = h ai j j (xi) j ; n n i = n ai n(xi ):
i=1
i=1
j =1
(1.36)
Comparing (1.35) and (1.26), we see that F has the structure of a RKHS in the sense
that for f and g given by (1.35) and
we have
NF q
X
j j j (x);
j =1
( ) = hf; gi:
es
g(x) =
(1.37)
(1.38)
Re
fe
re
nc
Note, moreover, that due to (1.35), we have f (x) = ( (x)) in F . Comparing to

(1.30), this shows that (x) is nothing but the coordinate representation of the kernel
as a function of one argument (cf. also (1.27)).
To conclude the brief detour into RKHS theory, note that in (1.30), k does not
have to be linear in its arguments; however, its action as an evaluation functional in
Hilbert space is linear | this is the underlying reason why Mercer kernels compute
bilinear dot products in Hilbert spaces: the dot product is obtained by combining two
evaluations of a possibly nonlinear function in a suitable Hilbert space.
1.3.5 Kernel Values as Pairwise Similarities
In practice, we are given a nite amount of data x1 ; : : : ; x`. The following simple
observation shows that even if we do not want to (or are unable to) analyse a given
kernel k analytically, we can still compute a map such that k corresponds to a dot
product in the linear span of the (xi):
Proposition 1.3.4 Suppose the data x1; : : : ; x` and the kernel k are such that the
matrix
Kij = k(xi; xj )
(1.39)
is positive. Then it is possible to construct a map into a feature space F such that
k(xi; xj ) = ((xi) (xj )):
(1.40)
31
Conversely, for a map into some feature space F , the matrix Kij = ((xi ) (xj ))
is positive.
Proof. Being positive, K can be diagonalized as

K = SDS >
(1.41)
with an orthogonal matrix S and a diagonal matrix D with nonnegative entries. Then
k(xi ; xj ) = (SDS >)ij

`

X
=
Sik Dkk S > kj
=
=
k=1
`
X
Sik Dkk Sjk

k=1
(si Dsj );
(1.42)
(1.43)
(1.44)
(1.45)
nc
es
where we have dened the si as the rows of S (note that the columns of S would be
K 's Eigenvectors).
Therefore, K is the dot product matrix (or Gram matrix) of the
p
vectors Dkk si.6 Hence the map , dened on the xi by
(1.46)
re
: xi 7! Dkk si;
`
X
i;j =1
Re
fe
does the job (cf. (1.40)).

Note that if the xi are linearly dependent, it will typically not be the case that
can be extended to a linear map.
For the converse, assume an arbitrary 2 R`, and compute
1
0`
`
X
X
ij Kij = @ i(xi) j (xj )A 0:
i=1
j =1
(1.47)
ut
In particular, this result implies that given data x1 ; : : : ; x`, and a kernel k which gives
rise to a positive matrix K , it is always possible to construct a feature space F of
dimensionality ` that we are implicitely working in when using kernels.
If we perform an algorithm which requires k to correspond to a dot product in some
other space (as for instance the Support Vector algorithm to be described below), it
could happen that even though k does not satisfy Mercer's conditions in general,
it still gives rise to a positive matrix K for the given training data. In that case,
6 The
fact that every positive matrix is the Gram matrix of some set of vectors is well-known in
linear algebra (see e.g. Bhatia, 1997, Exercise I.5.10).
32
Re
fe
re
nc
es
Proposition 1.3.4 tells us that nothing will go wrong during training when we work
with these data. Moreover, if a kernel leads to a matrix K with some small negative
Eigenvalues, we can add a small multiple of some positive denite kernel to obtain a
positive matrix.7
Note, nally, that Proposition 1.3.4 does not require the x1; : : : ; x` to be elements
of a vector space. They could be any set of objects which, for some function k (which
could be thought of as a similarity measure for the objects), gives rise to a positive
matrix (k(xi ; xj ))ij . Methods based on pairwise distances or similarities have recently
attracted attention (Hofmann and Buhmann, 1997). They have the advantage of being
applicable also in cases where it is hard to come up with a sensible vector representation
of the data (e.g. in text clustering).
7 For instance, for the hyperbolic tangent kernel (1.29), Mercer's conditions have not been veried.
It does not satisfy them in general: in a series of experiments with 2-D toy data, we noticed that the
dot product matrix in K had some negative Eigenvalues, for most choices of that we investigated
(except for large negative values). Nevertheless, this kernel has successfully been used in Support
Vector learning (cf. Sec. 2.3). To understand the latter, note that by shifting the kernel (i.e. choosing
dierent values of ), one can approximate the shape of the polynomial kernel (which is known to
be positive), as a function of (x y) (within a certain range), up to a vertical oset. This oset is
irrelevant in SV learning: due to (2.15), adding a constant to all elements of the dot product matrix
does not change the solution.
Chapter 2
Support Vector Machines
Re
fe
re
nc
es
This chapter discusses theoretical and empirical issues related to the Support Vector
(SV) algorithm. This algorithm, reviewed in Sec. 2.1, is based on the results of
learning theory outlined in Sec. 1.2. Via the use of kernel functions (Sec. 1.3), it gives
rise to a number of dierent types of pattern classiers (Vapnik and Chervonenkis,
1974; Boser, Guyon, and Vapnik, 1992; Cortes and Vapnik, 1995; Vapnik, 1995b).
The original contribution of the present chapter is largely empirical. Using object and digit recognition tasks, we show that the algorithm allows us to construct
high-accuracy polynomial classiers, radial basis function classiers, and perceptrons
(Sections 2.2 and 2.3), relying on almost identical subsets of the training set, their
Support Vector sets (Sec. 2.4). These Support Vector Sets are shown to contain
all the information necessary to solve a given classication task. To understand the
relationship between SV methods and classical techniques, we then describe a study
comparing SV machines with Gaussian kernels to classical radial basis function networks, with results favouring the SV approach. Following this, Sec. 2.6 shows that
one can utilize the error bounds of learning theory to select values for free parameters
in the SV algorithm, as for instance the degree of the polynomial kernel which will
perform best on a test set (Scholkopf, Burges, and Vapnik, 1995; Blanz, Scholkopf,
Bultho, Burges, Vapnik, and Vetter, 1996; Scholkopf, Sung, Burges, Girosi, Niyogi,
Poggio, and Vapnik, 1996c). Finally, at the end of the chapter, we summarize various
ways of understanding and interpreting the high generalization performance of SV
machines (Sec. 2.7).
2.1 The Support Vector Algorithm

As a basis for the material in the following section, we rst need to describe the SV
algorithm in some detail. The original treatments are due to Vapnik and Chervonenkis
(1974), Boser, Guyon, and Vapnik (1992), Guyon, Boser, and Vapnik (1993), Cortes
and Vapnik (1995), and Vapnik (1995b).
We describe the SV algorithm in four steps. In Sec. 2.1.1, a structure of decision
functions is described which is suciently simple to admit the formulation of a bound
on their VC-dimension. Based on this result, the optimal margin algorithm minimizes
33
34
CHAPTER 2. SUPPORT VECTOR MACHINES

(w . z) + b > 0
(w . z) + b < 0
Note:
{z | (w . z) + b = 0}
= {z | (c w . z) + cb = 0}
for c =0
{z | (w . z) + b = 0}
nc
es
FIGURE 2.1: A separating hyperplane, written in terms of a weight vector w and a threshold
b. Note that by multiplying both w and b with the same nonzero constant, we obtain the
same hyperplane, represented in terms of dierent parameters. Fig. 2.2 shows how to
eliminate this scaling freedom.
Re
fe
re
the VC-dimension for this class of decision functions (Sec. 2.1.2). This algorithm is
then generalized in two steps in order to obtain SV machines: nonseparable classication problems are dealt with in Sec. 2.1.3, and nonlinear decision functions, retaining
the VC-dimension bound, are described in Sec. 2.1.4.
To be able to utilize the results of Sec. 1.3, we shall formulate the algorithm in
terms of dot products in some space F . Initially, we think of F as the input space.
In Sec. 2.1.4, we will substitute kernels for dot products, in which case F becomes a
feature space nonlinearly related to input space.
2.1.1 A Structure on the Set of Hyperplanes

Each particular choice of a structure (1.8) gives rise to a learning algorithm, consisting
of performing Structural Risk Minimization in the given structure of sets of functions.
The SV algorithm is based on a structure on the set of separating hyperplanes.
To describe it, rst note that given a dot product space F and a set of pattern
vectors z1; : : : ; zr 2 F; any hyperplane can be written as
fz 2 F : (w z) + b = 0g:
(2.1)
In this formulation, we still have the freedom to multiply w and b with the same
nonzero constant (Fig. 2.1). However, the hyperplane corresponds to a canonical pair
35
2.1. THE SUPPORT VECTOR ALGORITHM

{z | (w . z) + b = +1}
{z | (w . z) + b = 1}
Note:
x2
(w . z1) + b = +1
(w . z2) + b = 1
x1
=>
=>
(w . (z1z2)) = 2
w .
2
(z
z
)
=
1
2
||w||
||w||
{z | (w . z) + b = 0}
re
(w; b) 2 F R if we additionally require
nc
es
FIGURE 2.2: By requiring the scaling of w and b to be such that the point(s) closest to the
hyperplane satisfy j(w zi )+ bj = 1, we obtain a canonical form (w; b) of a hyperplane (cf.
Fig. 2.1). Note that in this case, the margin, measured perpendicularly to the hyperplane,
equals 2=kwk, which can be seen by considering two opposite points which precisely satisfy
j(w zi) + bj = 1.
min j(w zi) + bj = 1;
(2.2)
fe
i=1;:::;r
Re
i.e. that the scaling of w and b be such that the point closest to the hyperplane has
a distance of 1=kwk (Fig. 2.2).1 Thus, the margin between the two classes, measured
perpendicular to the hyperplane, is at least 2=kwk. The possibility of introducing a
structure on the set of hyperplanes is based on the following result (Vapnik, 1995b):
Proposition 2.1.1 Let R be the radius of the smallest ball BR (a) = fz 2 F : kz ?
ak < Rg (a 2 F ) containing the points z1; : : : ; zr , and let
fw;b = sgn ((w z) + b)
(2.3)
be canonical hyperplane decision functions dened on these points. Then the set ffw;b :
kwk Ag has a VC-dimension h satisfying
h < R2A2 + 1:
(2.4)
condition (2.2) still allows two such pairs: given a canonical hyperplane (w; b), another one
satisfying (2.2) is given by (?w; ?b). However, we do not mind this remaining ambiguity: rst, the
following Proposition only makes use of kwk, which coincides in both cases, and second, these two
hyperplanes correspond to dierent decision functions sgn((w z) + b).
1 The
36
Note. Dropping the condition kwk A leads to a set of functions whose VC-dimension
equals NF + 1, where NF is the dimensionality of F. Due to kwk A, we can get
VC-dimensions which are much smaller than NF , enabling us to work in very high
dimensional spaces | remember that the risk bound (1.5) does not explicitely depend
upon NF , but on the VC-dimension.
To make Proposition 2.1.1 intuitively plausible, note that due to the inverse proportionality of margin and kwk, (2.4) essentially states that by requiring a large lower
bound on the margin (i.e. a small A), we obtain a small VC-dimension. Conversely, by
allowing for separations with small margin, we can potentially separate a much larger
class of problems (i.e. a larger class of possible labellings of the training data, cf. the
denition of the VC-dimension, following (1.6)).
Recalling that (1.5) tells us to keep both the training error and the VC-dimension
small in order to achieve high generalization ability, we conclude that hyperplane decision functions should be constructed such that they maximize the margin, and at the
same time separate the training data with as few exceptions as possible. Sections 2.1.2
and 2.1.3 will deal with these two issues, respectively.
es
2.1.2 Optimal Margin Hyperplanes
nc
Suppose we are given a set of examples (z1; y1); : : : ; (z`; y`); zi 2 F; yi 2 f1g, and
we want to nd a decision function fw;b = sgn ((w z) + b) with the property
(2.5)
re
fw;b(zi) = yi; i = 1; : : : ; `:
fe
If this function exists (the nonseparable case shall be dealt with in the next section),
canonicality (2.2) implies
Re
yi ((zi w) + b) 1; i = 1; : : : ; `:
(2.6)
As an aside, note that out of the two canonical forms of the same hyperplane (w; b),
(?w; ?b), only one will satisfy equations (2.5) and (2.6). The existence of class labels
thus allows to distinguish two orientations of a hyperplane.
Following Proposition 2.1.1, a separating hyperplane which generalizes well can
thus be found by minimizing
(w) = 21 kwk2
(2.7)
subject to (2.6). To solve this convex optimization problem, one introduces a Lagrangian
`
X
1
2
L(w; b; ) = 2 kwk ? i (yi((zi w) + b) ? 1)
(2.8)
i=1
with multipliers i 0. The Lagrangian L has to be maximized with respect to i
37
and minimized with respect to w and b. The condition that at the saddle point, the
derivatives of L with respect to the primal variables must vanish,
@ L(w; b; ) = 0; @ L(w; b; ) = 0;
(2.9)
@b
@w
leads to
`
X
i=1
and
w=
iyi = 0
`
X
i=1
(2.10)
i yizi:
(2.11)
i = 1; : : : ; `:
nc
i [yi((zi w) + b) ? 1] = 0;
es
The solution vector thus has an expansion in terms of training examples. Note that although the solution w is unique (due to the strict convexity of (2.7), and the convexity
of (2.6)), the coecients i need not be.
According to the Kuhn-Tucker theorem of optimization theory (e.g. Bertsekas,
1995), at the saddle point only those Lagrange multipliers i can be nonzero which
correspond to constraints (2.6) which are precisely met, i.e.
(2.12)
Re
fe
re
The patterns zi for which i > 0 are called Support Vectors.2

According to (2.12), they lie exactly at the margin.3 All remaining examples of the
training set are irrelevant: their constraint (2.6) is satised automatically, and they
do not appear in the expansion (2.11).4
This leads directly to an upper bound on the generalization ability of optimal margin hyperplanes: suppose we use the leave-one-out method to estimate the expected
test error (e.g. Vapnik, 1979). If we leave out a pattern zi and construct the solution
from the remaining patterns, there are several possibilities (cf. (2.6)):
2 This terminology is related to corresponding terms in the theory of convex sets, relevant to convex
optimization (e.g. Luenberger, 1973; Bertsekas, 1995). Given any boundary point of a convex set C ,
there always exists a hyperplane separating the point from the interior of the set. This is called a
supporting hyperplane.
SVs do lie on the boundary of the convex hulls of the two classes, thus they possess supporting
hyperplanes. The SV optimal hyperplane is the hyperplane which lies in the middle of the two parallel
supporting hyperplanes (of the two classes) with maximum distance.
Vice versa, from the optimal hyperplane one can obtain supporting hyperplanes for all SVs of both
classes by shifting it by 1=kwk in both directions.
3 Note that this implies that the solution (w; b), where b is computed using the fact that yi ((w
zi ) + b) = 1 for SVs, is in canonical form with respect to the training data. (This makes use of the
reasonable assumption that the training set contains both positive and negative examples.)
4 In a statistical mechanics framework, Anlauf and Biehl (1989) have put forward a similar argument for the optimal stability perceptron, also computed by contrained optimization.
38
Re
fe
re
nc
es
1. yi ((zi w) + b) > 1, i.e. the pattern is classied correctly and does not lie on
the margin. These are patterns that would not have become Support Vectors
anyway.
2. yi ((zi w) + b) = 1, i.e. zi exactly meets the constraint (2.6). In that case,
the solution w does not change, even though the coecients i in the dual
formulation of the optimization problem might change: namely, if zi might
have become a Support Vector (i.e. i > 0) if it had been kept in the training
set. In that case, the fact that the solution is the same no matter
whether zi
P

is in the training set or not means that zi can be written as SVs iyizi with
i 0. Note that this is not equivalent to saying that zi can be written as
some linear combination of the remaining Support Vectors: since the sign of the
coecients in the linear combination is determined by the class of the respective
pattern, not any linear combination will do. Strictly speaking, zi must lie in
the cone spanned by the yizi, where zi are all Support Vectors.5
3. 1 > yi ((zi w) + b) > 0, i.e. zi lies within the margin, but still on the correct
side of the decision boundary. In that case, the solution looks dierent from the
one obtained if zi was in the training set (for, in that case, zi would satisfy
(2.6) after training), but classication is nevertheless correct.
4. 0 > yi ((zi w) + b). In that case, zi will be classied incorrectly.
Note that the cases 3 and 4 necessarily correspond to examples which would have
become SVs if kept in the training set; case 2 potentially includes such cases. However,
only case 4 leads to an error in the leave-one-out procedure. Consequently, we have
the following result on the generalization error of optimal margin classiers (Vapnik
and Chervonenkis, 1974):6
Proposition 2.1.2 The expectation of the number of Support Vectors obtained during
training on a training set of size `, divided by ` ? 1, is an upper bound on the expected
probability of test error.
A sharper bound can be formulated by making a further distinction in case 2, between
SVs that must occur in the solution, and those that can be expressed in terms of the
other SVs (Vapnik and Chervonenkis, 1974).
Substituting the conditions for the extremum, (2.10) and (2.11), into the Lagrangian (2.8), one derives the dual form of the optimization problem: maximize
W () =
`
X
i=1
`
X
i ? 21 ij yiyj (zi zj )
i;j =1
(2.13)
5 Possible non-uniquenesses of the solution's expansion in terms of SVs are related to zero Eigenvalues of Kij = yi yj k( i ; j ), cf. Proposition 1.3.4. Note, however, the above caveat on the distinction
x x
between linear combinations and linear combinations with coecients of xed sign.
6 It also holds for the generalized versions of optimal margin classiers explained in the following
sections.
39
subject to the constraints

`
X
i=1
i 0; i = 1; : : : ; `;
iyi = 0:
(2.14)
(2.15)
On substitution of the expansion (2.11) into the decision function (2.3), we obtain an
expression which can be evaluated in terms of dot products between the pattern to be
classied and the Support Vectors,
`
X
f (z) = sgn
i=1
i yi(z zi) + b :
(2.16)
nc
es
It is interesting to note that the solution has a simple physical interpretation

(Burges and Scholkopf, 1997). If we assume that each Support Vector zj exerts a
perpendicular force of size j and sign yj on a solid plane sheet lying along the hyperplane w z + b = 0, then the solution satises the requirements of mechanical stability.
The constraint (2.15) translates into the forces on the sheet summing to zero; and
(2.11) implies that the torques zi iyiw=kwk also sum to zero. This mechanical
analogy illustrates the physical meaning of the term Support Vector.
2.1.3 Soft Margin Hyperplanes
fe
re
In practice, a separating hyperplane often does not exist. To allow for the possibility
of examples violating (2.6), Cortes and Vapnik (1995) introduce slack variables
i 0;
i = 1; : : : ; `;
(2.17)
Re
and use relaxed separation constraints (cf. (2.6))
yi((zi w) + b) 1 ? i;
i = 1; : : : ; `:
(2.18)
The SV approach to minimizing the guaranteed risk bound (1.5) consists of the following: minimize
`
X
(w; ) = 21 kwk2 + i
(2.19)
i=1
subject to the constraints (2.17) and (2.18) (cf. (2.7)). Due to (2.4), minimizing the
rst term is related to minimizing the VC-dimension of the considered class of learning
machines, thereby minimizing the second term of the bound (1.5) (it also amounts to
maximizing
the separation margin, cf. the remark following (2.2), and Fig. 2.2). The
term Pì=1 i, on the other hand, is an upper bound on the number of misclassications
on the training set (cf. (2.18)) | this controls the empirical risk term in (1.5). For a
40
suitable positive constant , this approach therefore constitutes a practical implementationPof Structural Risk Minimization on the given set of functions.7 Note, however,
that ì=1 i is signicantly larger than the number of errors if many of the i attain
large values, i.e. if the classes to be separated strongly overlap, for instance due to
noise. In these cases, there is no guarantee that the hyperplane will generalize well.
As in the separable case (2.11), the solution can be shown to have an expansion
w=
`
X
i=1
i yizi;
(2.20)
where nonzero coecients i can only occur if the corresponding example (zi; yi) precisely meets the constraint (2.18). The coecients i are found by solving the following
quadratic programming problem: maximize
`
`
X
X
W () = i ? 21 ij yiyj (zi zj )
(2.21)
i=1
i;j =1
subject to the constraints
(2.22)
iyi = 0:
(2.23)
i=1
nc
`
X
es
0 i ; i = 1; : : : ; `;
re
2.1.4 Nonlinear Support Vector Machines
Re
fe
Although we have already introduced the concept of Support Vectors, one crucial
ingredient of SV machines in their full generality is still missing: to allow for much
more general decision surfaces, one can rst nonlinearly transform a set of input vectors
x1; : : : ; x` into a high-dimensional feature space by a map : xi 7! zi and then do a
linear separation there.
Note that in all of the above, we made no assumptions on the dimensionality of
F . We only required F to be equipped with a dot product. The patterns zi that we
talked about in the previous sections thus need not coincide with the input patterns.
They can equally well be the results of mapping the original input patterns xi into a
high-dimensional feature space.
Maximizing the target function (2.21) and evaluating the decision function (2.16)
then requires the computation of dot products ((x) (xi )) in a high-dimensional
space. Under Mercer's conditions, given in Proposition 1.3.2, these expensive calculations can be reduced signicantly by using a suitable function k such that
((x) (xi)) = k(x; xi);
(2.24)
7 It
slightly deviates from the Structural Risk Minimization (SRM) Principle in that (a) it does
not use the bound (1.5), but a related quantity (2.19) which can be minimized eciently, and (b) the
SRM Principle strictly speaking requires the structure of sets of functions to be xed a priori. For
more details, cf. Vapnik (1995b); Shawe-Taylor, Bartlett, Williamson, and Anthony (1996).
41

x1 x2
R2
: R2
input space
2 x1 x2
x12 x22
feature space
w2 w
3
w1
R3
f (x)
2
f (x)=sgn (w1x1+w2x2+w3 2 x1x2+b)
R3
R2
es
re
nc
fe
Re
FIGURE 2.3: By mapping the input data (top left) nonlinearly (via ) into a higherdimensional feature space F (here: R3 ), and constructing a separating hyperplane there
(bottom left), an SV machine (top right) corresponds to a nonlinear decision surface in
input space (here: R2 , bottom right).
leading to decision functions of the form
f (x) = sgn
`
X
i=1
yii k(x; xi) + b :
(2.25)
Consequently, everything that has been said about the linear case also applies
to nonlinear cases obtained by using a suitable kernel k instead of the Euclidean
dot product (Fig. 2.3). By using dierent kernel functions, the SV algorithm can
construct a variety of learning machines (Fig. 2.4), some of which coincide with classical
architectures:
42

f(x)= sgn (
1
+ b)
3
f(x)= sgn ( i k(x,x i) + b)
classification
4
weights
comparison: e.g. k(x,x i)=(x.x i)d

k(x,x i)=exp(||xx i||2 / c)
support vectors
x 1 ... x 4
k(x,x i)=tanh((x.x i)+)
input vector x
nc
es
FIGURE 2.4: Architecture of SV machines. The kernel function k is chosen a priori; it

determines the type of classier (e.g. polynomial classier, radial basis function classier,
or neural network). All other parameters (number of hidden units, weights, threshold b)
are found during training by solving a quadratic programming problem. The rst layer
weights xi are a subset of the training set (the Support Vectors); the second layer weights
i = yii are computed from the Lagrange multipliers (cf. (2.25)).
(2.26)
fe
re
Polynomial classiers of degree d:

k(x; xi) = (x xi)d
Re
Radial basis function classiers:

k(x; xi) = exp ?kx ? xik2=c
Neural networks:
k(x; xi) = tanh( (x xi) + )
(2.27)
(2.28)
To nd the decision function (2.25), we maximize (cf. (2.21))
W () =
`
X
i=1
`
X
i ? 21 ij yiyj k(xi ; xj )
i;j =1
(2.29)
43
subject to the constraints (2.22) and (2.23). Since k is required to satisfy Mercer's
conditions, it corresponds to a dot product in another space (2.24), thus Kij :=
(yiyj k(xi; xj ))ij is a positive matrix, providing us with a problem that can be solved
eciently. To see this, note that (cf. Proposition 1.3.4)
`
X
i;j =1
1
0`
`
X
X
i j yiyj k(xi; xj ) = @ iyi(xi) j yj (xj )A 0
j =1
i=1
(2.30)
for all 2 R`.

To compute the threshold b, one takes into account that due to (2.18), for Support
Vectors xj for which j = 0, we have
`
X
i=1
yii k(xj ; xi) + b = yj :
(2.31)
Thus, the threshold can for instance be obtained by averaging

i=1
yii k(xj ; xi )
es
b = yj ?
`
X
(2.32)
fe
2.1.5 SV Regression Estimation
re
nc
over all Support Vectors xj (i.e. 0 < j ) with j < .

Figure 2.5 shows how a simple binary toy problem is solved by a Support Vector
machine with a radial basis function kernel (2.27).
Re
This thesis is primarily concerned with pattern recognition. Nevertheless, we brie y

mention the case of SV regression (Vapnik, 1995b; Smola, 1996; Vapnik, Golowich,
and Smola, 1997). To estimate a linear regression (Fig. 2.6)
f (z) = (w z) + b
(2.33)
`
X
(w; ; ) = 21 kwk2 + (i + i)
(2.34)
((w zi) + b) ? yi " + i

yi ? ((w zi ) + b) " + i
i; i 0
(2.35)
(2.36)
(2.37)
with precision ", one minimizes

i=1
subject to
44
fe
re
nc
es
FIGURE 2.5: Example of a Support Vector classier found by using a radial basis function
kernel k(x; y) = exp(?kx ? yk2 ). Both coordinate axes range from -1 to +1. Circles
and disks are two classes of training examples; the middle line is the decision surface; the
outer lines precisely meet the constraint (2.6). Note that the Support Vectors found by
the algorithm (marked by extra circles) are not centers of clusters, but examples which are
critical for thePgiven classication task (cf. Sec. 2.5). Grey values code the modulus of
the argument ì=1 yi i k(x; xi ) + b of the decision function (2.25). (From Scholkopf,
Burges, and Vapnik (1996a).)
Re
for all i = 1; : : : ; `.
Generalization to nonlinear regression estimation is carried out using kernel functions, in complete analogy to the case of pattern recognition. A suitable choice of
the kernel function then allows the construction of multi-dimensional splines (Vapnik,
Golowich, and Smola, 1997).
Dierent types of loss functions can be utilized to cope with dierent types of noise
in the data (Muller, Smola, Ratsch, Scholkopf, Kohlmorgen, and Vapnik, 1997; Smola
and Scholkopf, 1997b).
2.1.6 Multi-Class Classication

To get k-class classiers, we construct a set of binary classiers f 1 ; : : : ; f k , each trained
to separate one class from the rest, and combine them by doing the multi-class classication according to the maximal output before applying the sgn function, i.e. by
45
x
x
x
x
+
0
nc
es
FIGURE 2.6: In SV regression, a desired accuracy " is specied a priori. It is then attempted
to t a tube with radius " to the data. The trade-o between model complexity and points
lying outside of the tube (with positive slack variables ) is determined by minimizing
(2.34).
where
fe
argmaxj=1;:::;k
gj (x);
`
X
j
g (x) = yiij k(x; xi) + bj
i=1
re
taking
(2.38)
Re
(note that f j (x) = sgn(gj (x)), cf. (2.25)). The values gj (x) can also be used for
reject decisions (e.g. Bottou et al., 1994), for instance by considering the dierence
between the maximum and the second highest value as a measure of condence in the
classication.
In the following sections, we shall report experimental results obtained with the SV
algorithm. We used the Support Vector algorithm with standard quadratic programming techniques8 to construct polynomial, radial basis function and neural network
classiers. This was done by choosing the kernels (2.26), (2.27), (2.28) in the decision
function (2.25) and in the function (2.29) to be maximized under the constraints (2.22)
and (2.23). We shall start with object recognition experiments (Sec. 2.2), and then
move to handwritten digit recognition (Sec. 2.3).
8 An
existing implementation at AT&T Bell Labs was used, largely programmed by L. Bottou,
C. Burges, and C. Cortes.
46
2.2 Object Recognition Results
nc
2.2.1 Entry-Level and Animal Recognition
es
FIGURE 2.7: Examples from the entry level (top) and animal (bottom) databases. Left:
rendered views of two 3-D models; right: 16 16 downsampled images, and four 16 16
downsampled edge detection patterns.
Re
fe
re
For purposes of psychophysical and computational studies, the object recognition

group at the Max-Planck-Institut fur biologische Kybernetik has compiled three databases of rendered 3D CAD models. The entry level database (see Appendix A for
snapshots and further description) comprises views of 25 3-D object models, which
in psychophysical experiments were found to belong to dierent entry level categories
(Liter et al., 1997). Objects tend to get identied by humans rst at a particular level
of abstraction which is neither the most general nor the most specic, e.g. an object
might be identied rst as an apple, rather than as a piece of fruit or as a cox orange.
For a discussion of this concept, referred to as entry (or basic) level, see (Jolicoeur,
Gluck, and Kosslyn, 1984; Rosch, Mervis, Gray, Johnson, and Boyes-Braem, 1976).
In subordinate level recognition, on the other hand, ner distinctions between objects
sharing the same entry level become relevant, as for instance those between dierent
types of birds contained in the second database, the animal database (Appendix A).
It should be noted, however, that the animal database does not pose a purely subordinate level recognition task, since many of its animals are also distinct on the entry
level. The third MPI database, containing 25 chairs, however, can be considered a
subordinate level database. We will use this one in Sec. 2.2.2.
In order to recognize the objects from all orientations of the upper viewing hemisphere, a fairly complex decision surface in high-dimensional space must be learnt.
The objects were realistically rendered and then downsampled. Compared to many
real-world databases, the database should be considered as containing relatively little noise; in particular, they do not contain wrongly labeled patterns. Under these
47
2.2. OBJECT RECOGNITION RESULTS
circumstances, we reasoned that it should be possible to separate the data with zero
training error even with moderate classier complexity, and we decided to determine
the value of the constant (cf. (2.19)) by the following heuristic: out of all values 10n,
with integer n, we chose the smallest one which made the problem separable. On the
entry level databases, this led to = 1000, on the animal databases, to = 100. Of
both databases, we used 12 variants, obtained by
choosing one of three database sizes: 25, 89, (regularly spaced) or 100 (random,
uniformly distributed) views per object;
choosing either grey-scale images or binarized silhouette images (both in downsampled versions); and
using just 16 16 resolution images, obtained from the original images by downsampling, or additionally four more 16 16 patterns, containing downsampled
nc
es
versions of edge detection results obtained from the original images. Note that
in the latter case, the resulting 1280-dimensional vectors contain information
which is not contained in the 16 16 images, since the edge detection, involving a (nonlinear) modulus operation, is done before downsampling (cf. Blanz,
Scholkopf, Bultho, Burges, Vapnik, and Vetter, 1996).
Re
fe
re
For more details on the databases, see (Liter et al., 1997), and Appendix A. Example
images of the original models, and of the downsampled images and edge detection
patterns for the entry level and the animal databases are given Fig. 2.7.
We trained polynomial SV machines on these 25-class recognition tasks, and obtained accuracies which in some cases exceeded 99% (see Table 2.1). A few aspects of
the results deserve being pointed out:
Performance. The highest recognition rates were obtained using polynomial SV clas-
siers of degrees around 20; however, we found no pronounced minimum. Generally,

all of the higher degrees aorded high accuracies. The regularly spaced 89-view-perobject set led to higher accuracies than the random 100-view-per-object set. This
suggests that regular spacing of the views on the viewing sphere corresponds to a
useful spacing of the knots (or centers) of the approximating functions in RN . Edge
detection information signicantly cuts errors rates, in many cases by a factor of two
or more. Generally, accuracies were higher for grey-scale images than they were for silhouettes. The dierences, however, were not large: high accuracies were also obtained
for silhouettes. To understand this, we have to note that the thresholding operation
used to produce silhouettes was applied to the original high-resolution images, and
not to the downsampled versions. After downsampling, this yields grey-scale images
whose grey values do not code grey values in the original image, however, they do still
code useful information on the high-resolution object silhouettes.
48
TABLE 2.1: Object recognition test error rates on dierent databases of 25 objects, using
polynomial SV classiers of varying degrees. The training sets containing 25 and 89 views
per object were regularly spaced; those with 100 views were distributed uniformly. Testing
was done on an independent test set of 100 random views per object. All views were taken
from the upper viewing hemisphere. For further discussion, see Sec. 2.2.1.
12
15
20
25
26.0 17.7 15.4 13.9 13.1 13.0 13.0 14.6

14.5 3.4 2.4 2.0 1.8 1.8 1.8 2.1
17.1 5.6 4.2 3.5 3.2 2.8 2.4 2.8
27.1 19.6 17.9 16.7 16.2 15.6 15.4 16.3
17.2 4.3 3.3 2.7 2.5 2.2 2.2 2.8
18.2 6.9 5.4 4.8 4.2 4.0 4.0 4.7
5.8
0.7
1.5
7.0
0.8
2.5
5.5
0.6
1.3
6.6
0.6
2.5
5.3
0.5
1.1
6.5
0.5
2.4
4.9
0.4
1.1
6.1
0.5
2.3
5.6
0.4
1.0
6.0
0.4
2.2
fe
re
entry level with edge detection:

25 grey scale 9.0 8.0 6.7
89 grey scale 1.9 1.2 0.8
100 grey scale 3.5 2.3 1.8
25 silhouettes 9.4 8.2 7.6
31.6 20.4 15.9 14.8 13.8 13.4 13.0 13.8

21.8 5.6 3.2 2.5 2.0 1.7 1.7 2.0
24.5 8.8 5.8 5.2 5.0 4.7 4.8 4.4
34.4 22.4 18.2 17.0 16.4 15.6 15.8 16.4
27.0 7.4 3.8 2.8 2.5 2.5 2.2 2.8
29.1 11.0 7.4 6.3 5.8 5.4 5.2 5.7
Re
animals:
25 grey scale
89 grey scale
100 grey scale
25 silhouettes
89 silhouettes
100 silhouettes
es
entry level:
25 grey scale
89 grey scale
100 grey scale
25 silhouettes
89 silhouettes
100 silhouettes
nc
degree:
animals with edge detection:

25 grey scale 11.8 9.0
25 silhouettes 12.1 9.9
7.9
1.1
2.7
8.8
1.3
3.2
7.2
1.0
2.2
8.0
1.2
3.1
6.9
0.9
2.2
7.6
1.1
3.0
6.8
0.9
2.0
7.5
1.2
2.9
6.4
0.8
2.0
7.0
1.1
2.7
6.4
0.8
2.0
7.1
1.1
2.6
49
TABLE 2.2: Numbers of SVs for the object recognition systems of Table 2.1, on dierent
databases of 25 objects, using polynomial SV classiers of varying degrees. The training
sets containing 25 and 89 views per object were regularly spaced; the ones with 100 views
were distributed uniformly. The numbers of SVs are averages over all 25 binary classiers
separating one object from the rest; they should be seen in relation to the size of the
training set, which for the above numbers of views per objects was 625, 2225, and 2500,
respectively. The given numbers of SVs thus amount to roughly 10% of the database sizes.
For the silhouette databases, the numbers (not shown here) are very similar, only slightly
bigger.
degree:
12
15
20
25
entry level:
25 grey scale 86 74 71 70 72 74 79 92
89 grey scale 219 148 132 128 128 133 144 165
100 grey scale 206 139 121 117 119 122 135 158
nc
es
entry level with edge detection:

25 grey scale 73 74 77 79 84 87 91 99
89 grey scale 126 119 125 130 137 145 151 161
100 grey scale 123 115 120 125 129 133 143 153
fe
re
animals:
25 grey scale 108 96 89 90 91 95 100 112
89 grey scale 231 196 180 177 178 183 193 208
100 grey scale 235 196 176 169 169 174 185 199
Re
animals with edge detection:

25 grey scale 101 92 93 99 103 107 117 128
89 grey scale 183 170 172 180 188 198 212 227
100 grey scale 187 171 172 177 182 191 201 215
Support Vectors. The numbers of SVs (Table 2.2) of the individual recognizers for
each object make up about 5% ? 15% of the whole databases. The fraction decreases
with increasing database size.
For polynomial machines of degree 1 (i.e. separating hyperplanes in input space),
the problem is not separable. In that case, all training errors show up as SVs (cf.
(2.18)), causing a fairly large number of SVs. For degrees higher than 1, the number
of SVs slightly increases with increasing polynomial degree. However, the increase is
rather moderate, compared with the increase of the dimensionality of the feature space
that we are implicitely working on (cf. Sec. 1.3). Interestingly, the number of SVs does
50

90
90
25
60
120
30
150
180
180
300
270
210
330
240
30
150
210
25
60
120
330
n=4632
240
300
270
90
n=3385
90
60
60
25
25
30
es
30
nc
n=4632
n=3385
Re
fe
re
FIGURE 2.8: Angular distribution of the viewing angles of those training views which
became SVs, for a polynomial SV machine of degree 20 on the animal (left) and entry
level (right) databases (100 grey level views per object, without edge detection). The
plotted distributions for azimuth (top) and elevation (bottom) have been normalized by
the corresponding distributions in the training set (see Fig. A.1). It can be seen that
SVs tend to occur more often for top, front and back views. In this and the following
plots, views which become SVs for more than one of the 25 binary recognizers are counted
according to their frequency of occurence. Consequently, there is no contradiction in the
overall number of SVs n exceeding the database size (2500).
not change much if we add edge detection information, even though this increases the
input dimensionality by a factor of 5.
As each of the training examples is associated with two viewing angles (; ) (cf.
Appendix A), we can look at the angular distribution of SVs and errors. It is shown
in gures 2.8 { 2.10, and, in more detail, in gures B.2 { B.9 in the appendix (there,
we also give an example of a full SV set of one of the binary recognizers, in Fig. B.1).
The density of SVs is increased at high polar angles, i.e. for viewing the objects from
the top. Also, SVs tend to be found more often for frontal and back views than for
views closer to the side. Top, frontal and back views typically are harder to classify
than views from more generic points of view (Blanz, 1995). We can thus interpret our
51
nc
es
nding as an indication that the density of SVs is related to the local complexity of
the classication surface, i.e. the local diculty of the classication task. Indeed, the
same qualitative behaviour is found for the distribution of recognition errors (gures
2.9 and 2.10).
There are several factors contributing to the diculty in classifying top, frontal
and back views. First note that since most objects in our databases are bilaterally
symmetric, top, frontal and back views contain a large amount of redundancy. In
contrast, side views of symmetric objects contain a maximal amount of information.
Moreover, many objects in the databases have their main axis of elongation roughly
aligned with the direction of the zero view ( = 0; = 0). Consequently, frontal and
back views suer the drawback of showing a projection of a comparably small area.
As an aside, note that although the SVs live in a high-dimensional space, the
particular setup of the presented object recognition experiments made it possible to
discuss the relationship between the diculty of the task, the distribution of SVs, and
the distribution of errors. This is due to the low-dimensional parametrization of the
examples, arising from the procedure of generating the examples by taking snapshots
at well-dened viewing positions.
The hope that Support Vectors are a useful means of analysing recognition tasks
will receive further support in Sec. 2.4, where we shall present results which show
that dierent types of SV machines, obtained using dierent kernel functions, lead to
largely the same Support Vectors if trained on the same task.
Comparison with Neural Networks. To evaluate the performance of SV classiers
Re
fe
re
on this task, benchmark comparisons with other classiers need to be carried out.
We conducted a set of experiments using perceptrons with one hidden layer, endowed
with 400 hidden neurons, and hyperbolic tangent activation functions in hidden and
output neurons. The networks were trained by back-propagation of mean squared error
(Rumelhart, Hinton, and Williams, 1986; LeCun, 1985). We used on-line (stochastic)
gradient descent, i.e. the weights were updated after each pattern; training was stopped
when the training error dropped below 0:1%, or after 600 learning epochs, whichever
occured earlier. Neither this procedure nor the network design was carefully optimized
for the task at hand, thus the results reported in the following should be seen as baseline
comparisons solely intended to facilitate assessing the reported SV results.9
9 By observing the dependency of the test error on the number of learning epochs, we were able to
see that the networks were not overtrained. In addition, experiments with smaller numbers of hidden
units gave worse performance (larger networks were not used, for reasons of excessive training times),
hence the network capacities did not seem too large.
A full- edged comparison between SV machines and perceptrons would take into account the following issues in order to obtain optimized network designs: instead of one fully connected hidden
layer, more sophisticated architectures use several layers with shared weights, extracting features
of increasing complexity and invariance, while still limiting the number of free parameters. Other
regularization techniques useful for improving generalization include weight decay and pruning. Similarly, early stopping can be used to deal with issues of overtraining. The training procedure can be
optimized by using dierent error functions and output functions (e.g. softmax). Finally, for small
54
FIGURE 2.11: Left: rendered view of a 3-D model from the chair database; right: 16 16
downsampled image, and four 16 16 downsampled edge detection patterns.
re
2.2.2 Chair Recognition Benchmark
nc
es
For the following two reasons, we chose the small training set, with 25 views per
object. First, the error rates reported above for the large sets were already very low,
and dierences in performance are thus more likely to be signicant for the smaller
training sets. Second, training times of the neural networks were very long (in the
cases reported in the following, they were longer than for SV machines by more than
an order of magnitude).
On the 25 view-per-object training sets, we obtained error rates of 17.3% and
21.4% on the entry level and animal databases, respectively. Adding edge detection
information, the error rates dropped to 6.8%, and 11.2%, respectively. Comparing
with the results in (2.1), we note that SV machines in almost all cases performed
better. Further performance comparisons between SV machines and other classiers
are reported in the following section.
Re
fe
In a set of experiments using the MPI chair database (Fig. 2.11, Appendix A), dierent view{based recognition algorithms were compared (Blanz et al., 1996). The SV
analysis for this case is less detailed than the one given in Sec. 2.2.1, however, we
decided also to report these experiments, since they include further benchmark results
obtained with other classiers. The rst one used oriented lters to extract features
which are robust with respect to small rigid transformations of the underlying 3-D
objects, followed by a decision stage based on comparisons with stored templates (for
details, see Blanz, 1995; Vetter, 1994; Blanz et al., 1996). The second one, run as a
baseline benchmark, was a perceptron with one hidden layer, trained by error backpropagation to minimize the mean squared error (for further details, see Sec. 2.2.1).
The third system was a polynomial Support Vector machine (cf. (2.26)) with degree
d = 15 and = 100.10 In addition, we report results of Kressel (1996), who utilized a fully quadratic polynomial classier (Schurmann, 1996) trained on the rst 50
databases with little redundancy it is sometimes advantageous to use batch updates with conjugate
gradient descent, or using higher order derivatives of the error function. For details, see LeCun,
Boser, Denker, Henderson, Howard, Hubbard, and Jackel (1989); Bishop (1995); Amari, Murata,
Muller, Finke, and Yang (1997).
10 The latter was chosen as in Sec. 2.2.1. Note that these values dier from those used in (Blanz
et al., 1996), which in some cases leads to dierent results.
55
TABLE 2.3: Recognition test error (in %) for the MPI chair database (Appendix A) on
25 100 random test views from the upper viewing hemisphere, for dierent training
sets (viewing angles either regularly spaced, or uniformly distributed, on the upper viewing
hemisphere; views were either just 16 16 images, or images plus edge detection data), and
dierent classiers: SV: Support Vector machine; MLP: fully connected perceptron with
one hidden layer of 400 neurons; OF: oriented lter invariant feature extraction, see text;
PC: quadratic polynomial classier trained on the rst 50 principal components (Kressel,
1996). Where marked with '{', results are not available.
es
nc
input
images+e.d.
images+e.d.
images+e.d.
images+e.d.
images
images
images
images
training set
classier
distribution views per obj. SV MLP OF PC
regul. spaced
25
5.0 8.8 5.4 {
regul. spaced
89
1.0 1.3 4.7 1.7
random
100
1.4 2.6
{ {
random
400
0.3
{
{ 0.8
regul. spaced
25
13.2 25.4 26.0 {
regul. spaced
89
2.0 7.2 21.0 {
random
100
4.5 7.5
{ {
random
400
0.6
{
{ {
Re
fe
re
principal components of the images.

In all experiments, the Support Vector machine exhibits the highest generalization
ability (Table 2.3). Considering that the images of a single object can change drastically with viewpoint (cf. Appendix A), it seems that the Support Vector machine is
best in constructing a decision surface suciently complex to separate the 25 classes of
chairs. This, in turn, can be related to the fact that SV machines use kernel functions
to construct hyperplanes in very high-dimensional feature spaces without overtting.
Note, moreover, that this was achieved with an SV machine which does not utilize
prior information about the problem at hand. The oriented lter approach, in constrast, does use prior information about the process by which the images arose from
underlying 3-D objects. This knowledge was used to handcraft the robust features
used for recognition. The SV machine has to extract all information from the given
training data, making it understandable that its advantage over the oriented lter
system gets smaller for smaller training set sizes (Table 2.3). In Chapter 4, we try to
deal with this shortcoming by proposing methods to incorporate prior knowledge into
SV machines.
2.2.3 Discussion
Realistically rendered computer graphics images of objects provide a useful basis for
evaluating object recognition algorithms. This setup enabled us to study shape re-
56
cognition under controlled conditions. Real-world recognition systems, however, face

additional problems. For instance, segmentation of objects in cluttered scenes is a
problem not addressed in the above experiments. Partly, these additional problems
can be outweighed by additional sources of information. Objects with dierent albedo
and color would facilitate segmentation and recognition signicantly.
The impact of noise, characteristic of many real-life problems, should not be too
big, at least in the case where we trained our systems on the image data only: in that
case, all the processing is done in the low spatial frequency domain.
On all three databases, high recognition accuracies were reported. The highest
accuracies were obtained using the regularly spaced 89 view per object training sets,
and the edge detection data.
As the number of classes was 25 in all cases, we can compare the performance
of the SV systems across tasks. It correlates with the intuitive diculty of the tasks:
accuracies are highest for the entry level database, were the objects have the largest differences, followed by the animal database, and by the subordinate level chair database.
2.3 Digit Recognition Using Dierent Kernels
Re
fe
re
nc
es
Handwritten digit recognition has long served as a test bed for evaluating and benchmarking classiers (e.g. LeCun et al., 1989; Bottou et al., 1994; LeCun et al., 1995).
Thus, it is imperative to evaluate the SV method on some widely used digit recognition task. In the present chapter, we use the US Postal Service (USPS) database for
this purpose (Appendix C). We put particular emphasis on comparing dierent types
of SV classiers obtained by choosing dierent kernels. We report results for polynomial kernels (2.26), radial basis function kernels (2.27), and sigmoid kernels (2.28);
all of them were obtained with = 10 (our default choice, used wherever not stated
otherwise | cf. (2.19)).
Results for the three dierent kernels are summarized in Table 2.4. In all three
cases, error rates around 4% can be achieved. They should be compared with values
achieved on the same database with a ve-layer neural net (LeNet1, LeCun, Boser,
Denker, Henderson, Howard, Hubbard, and Jackel, 1989), 5.0%, a neural net with one
hidden layer, 5.9%, and the human performance, 2.5% (Bromley and Sackinger, 1991).
Results of classical RBF machines, along with further reference results, are quoted in
Sec. 2.5.3.
The results show that the Support Vector algorithm allows the construction of
various learning machines, all of which are performing well. The similar performance
for the three dierent functions k suggests that among these cases, the choice of the
set of decision functions is less important than capacity control in the chosen type of
structure. This phenomenon is well-known for the Parzen density estimator in RN ,
` 1 x ? x
X
1
p(x) = ` !N k ! i :
(2.39)
i=1
There, it is of great importance to choose an appropriate value of the bandwidth
2.4. UNIVERSALITY OF THE SUPPORT VECTOR SET
57
TABLE 2.4: Performance on the USPS set, for three dierent types of classiers, constructed with the Support Vector algorithm by choosing dierent functions k in (2.25) and
(2.29). Given are raw errors (i.e. no rejections allowed) on the test set. The normalization
factor c = 1:04 in the sigmoid case is chosen such that c tanh(2) = 1. For each of
the ten-class-classiers, we also show the average number of Support Vectors of the ten
two-class-classiers. The normalization factors of 256 are tailored to the dimensionality of
the data, which is 16 16.
polynomial: k(x; y) = ((x y)=256)d

d
1 2 3 4 5 6 7
raw error/% 8.9 4.7 4.0 4.2 4.5 4.5 4.7
av. # of SVs 282 237 274 321 374 422 491
RBF: k(x; y) = exp (?kx ? yk2 =(256 c))
c
4.0 2.0 1.2 0.8 0.5 0.2 0.1
raw error/% 5.3 5.0 4.9 4.3 4.4 4.4 4.5
av. # of SVs 266 240 233 235 251 366 722
re
nc
es
sigmoid: k(x; y) = 1:04 tanh(2(x y)=256 + )

?
0.8 0.9 1.0 1.1 1.2 1.3 1.4
raw error/% 6.3 4.8 4.1 4.3 4.3 4.4 4.8
av. # of SVs 206 242 254 267 278 289 296
Re
fe
parameter ! for a given amount of data (e.g. Hardle, 1990; Bishop, 1995). Similar
parallels can be drawn to the solution of ill-posed problems (for a complete discussion,
see Vapnik, 1995b).
2.4 Universality of the Support Vector Set11

In the present section, we report empirical evidence that the SV set provides a novel
possibility for extracting a small subset of a database which contains all the information
necessary to solve a given classication task: using the Support Vector algorithm to
train three dierent types of handwritten digit classiers, we observed that these types
of classiers construct their decision surface from strongly overlapping yet small subsets
of the database.
Overlap of SV Sets. To study the Support Vector sets for three dierent types of
SV classiers, we used the optimal parameters on the USPS set according to Table 2.4.
11 Copyright
notice: the material in this section is based on the article \Extracting support data
for a given task" by B. Scholkopf, C. Burges and V. Vapnik, which appeared in: Proceedings, First
International Conference on Knowledge Discovery & Data Mining, pp. 252 { 257, 1995. AIII Press.
58
TABLE 2.5: First row: total number of dierent Support Vectors of three dierent tenclass-classiers (i.e. number of elements of the union of the ten two-class-classier Support
Vector sets) obtained by choosing dierent functions k in (2.25) and (2.29); second row:
average number of Support Vectors per two-class-classier (USPS database size: 7291).
total # of SVs
average # of SVs
Polynomial RBF Sigmoid

1677
1498 1611
274
235
254
TABLE 2.6: Percentage of the Support Vector set of [column] contained in the support
set of [row]; for ten-class classiers (top) and binary recognizers for digit class 7 (bottom)
(USPS set).
re
nc

100
84
93
89
100
92
93
86
100
fe
Polynomial
RBF
Sigmoid
es

Polynomial
100
93
94
RBF
83
100
87
Sigmoid
90
93
100
Re
TABLE 2.7: Comparison of all three Support Vector sets at a time (USPS set). For each
of the (ten-class) classiers, \% intersection" gives the fraction of its Support Vector set
shared with both the other two classiers. Out of a total of 1834 dierent Support Vectors,
1355 are shared by all three classiers; an additional 242 is common to two of the classiers.
Poly RBF tanh intersection shared by 2 union

no. of SVs 1677 1498 1611
1355
242
1834
% intersection 81 90 84
100
{
{
Table 2.5 shows that all three classiers use around 250 Support Vectors per two-classclassier (less than 4% of the training set). The total number of dierent Support
Vectors of the ten-class-classiers is around 1600. The reason why it is less than 2500
(ten times the above 250) is the following: a particular vector that has been used as a
positive SV (i.e. yi = +1 in (2.25)) for digit 7 might at the same time be a negative
SV (yi = ?1) for digit 1, say.
Tables 2.6 and 2.7 show that the Support Vector sets of the dierent classiers have
2.4. UNIVERSALITY OF THE SUPPORT VECTOR SET
59
TABLE 2.8: SV set overlap experiments on the MNIST set (Fig. C.2), using the binary
recognizer for digit 0. Top three tables: performances (on the 60000 element test set) and
numbers of SVs for three dierent kernels and various parameter choices. The numbers
of SVs, which should be compared to the database size, 60000, were used to select the
parameters for the SV set comparison: to get a balanced comparison of the dierent
SV sets, we decided to select parameter values such that the respective SV sets have
approximately equal size (polynomial degree d = 4, radial basis function width c = 0:6,
and sigmoid threshold = ?1:5). Bottom: SV set comparison. For each of the binary
classiers, \% intersection" gives the fraction of its Support Vector set shared with both the
other two classiers. The scaling factor 784 in the kernels stems from the dimensionality
of the data; it ensures that the values of the kernels lie in similar ranges for dierent
polynomial degrees.
polynomial: k(x; y) = ((x y)=784)d

d
2
3
4
5
6
7
# of test errors 163 147 135 131 127 127
# of SVs
994 1083 1187 1292 1401 1537
nc
es
RBF: k(x; y) = exp (?kx ? yk2 =(784 c))

c
1 0.75 0.6 0.5 0.4 0.3
# of test errors 147 145 145 141 137 134
# of SVs
1061 1118 1179 1264 1308 1460
re
sigmoid: k(x; y) = 1:04 tanh(2(x y)=784 + )

1.3 1.4 1.5 1.6 1.7 1.8
# of test errors 139 138 138 141 145 144
# of SVs
1137 1162 1194 1211 1223 1217
Re
fe
?
Polyn RBF tanh intersection shared by 2 union

no. of SVs
1187 1179 1194
1054
124
1328
% intersection 89
89 88
100
{
{
about 90% overlap. This surprising result, rst published in (Scholkopf, Burges, and
Vapnik, 1995), has meanwhile been reproduced on the MNIST character recognition set
(Table 2.8), with SV sets which amounted to just 2% of the whole database. Together
with K. Sung at MIT, we have reproduced this result also on a face detection task
(binary classication, faces vs. non-faces).
As mentioned previously, the Support Vector expansion (2.11) need not be unique.
Depending on the way the quadratic programming problem is solved, one can potentially get dierent expansions and therefore dierent Support Vector sets. It is possible
to conceive of problems where all patterns do lie on the decision boundary, yet only
60
TABLE 2.9: Percentage of the Support Vector set of [column] contained in the support set
of [row]; for the binary recognizers for digit class 7 (bottom) (USPS set). The training sets
for the classiers in [row] and [column] were permuted with respect to each other (control
experiment for Table 2.6); still, the overlap between the SV sets persists.
Polynomial
RBF
Sigmoid

92
82
90
88
92
84
91
84
93
Re
fe
re
nc
es
a few of them are necessary at a time for expressing the decision function. In such a
case, the actual SV set extracted could strongly depend on the ordering of the training
set, especially if the quadratic programming algorithm processes the data in chunks.
In our experiments, we did use the same ordering of the training set in all three cases.
To exclude the possibility that it is this ordering that causes the reported overlaps, we
ran a control experiment where two classiers with the same kernel were trained twice,
on the original training set, and on a permuted version of it, respectively. We found
that the two cases produced highly overlapping (to around 90%) SV sets, which means
that the training set ordering does hardly have an eect on the SV sets extracted |
it only changes around 10% of the SV sets. In addition, repeating the experiments of
Table 2.6 on permuted training sets gave results consistent with this nding: Table 2.9
shows that the overlap between SV sets of dierent classiers is hardly changed when
one of the training sets is permuted. We may also add that the overlap is not due to
SVs corresponding to errors on the training set (cf. (2.18), with i > 1): the considered
classiers had very few training errors.
Using a leave-one-out procedure similar to Proposition 2.1.2, Vapnik and Watkins
have subsequently put forward a theoretical argument for shared SVs. We state it
in the following form: If the SV set of three SV classiers had no overlap, we could
obtain a fourth classier which has zero test error.
To see why this is the case, note that if a pattern is left out of the training set,
it will always be classied correctly by voting between the three SV classiers trained
on the remaining examples: otherwise, it would have been been an SV of at least two
of them, if kept in the training set. The expectation of the number of patterns which
are SVs of at least two of the three classiers, divided by the training set size, thus
forms an upper bound on the expected test error of the voting system.
Training on SV Sets. As described in Sec. 2.1.2, the Support Vector set contains
all the information a given classier needs for constructing the decision function. Due
to the overlap in the Support Vector sets of dierent classiers, one can even train
classiers on Support Vector sets of another classier. Table 2.10 shows that this leads
to results comparable to those after training on the whole database. In Sec. 4.2.1, we
61
2.5. COMPARISON TO CLASSICAL RBF NETWORKS
TABLE 2.10: Training classiers on the Support Vector sets of other classiers leads
to performances on the test set which are as good as the results for training on the full
database (shown are numbers of errors on the 2007-element test set, for two-class classiers
separating digit 7 from the rest). Additionally, the results for training on a random subset
of the database of size 200 are displayed.
trained on: poly-SVs rbf-SVs tanh-SVs full db rnd. subs.

kernel
size:
178
189
177
7291
200
Poly
13
13
12
13
23
RBF
17
13
17
15
27
tanh
15
13
13
15
25
will use this nding as a motivation for a method to make SV machines transformation
invariant.
Discussion. Learning can be viewed as inferring regularities from a set of training
Re
fe
re
nc
es
examples. Much research has been devoted to the study of various learning algorithms
which allow the extraction of these underlying regularities. No matter how dierent the
outward appearance of these algorithms is, they all must rely on intrinsic regularities
of the data. If the learning has been successful, these intrinsic regularities are captured
in the values of some parameters of a learning machine; for a polynomial classier,
these parameters are the coecients of a polynomial, for a neural net they are weights
and biases, and for a radial basis function classier they are weights and centers. This
variety of dierent representations of the intrinsic regularities, however, conceals the
fact that they all stem from a common root.
The Support Vector algorithm enables us to view these algorithms in a unied
theoretical framework. The presented empirical results show that dierent types of
SV classiers construct their decision functions from highly overlapping subsets of the
training set, and thus extract a very similar structure from the observations, which
can in this sense be viewed as a characteristic of the data: the set of Support Vectors.
This nding may lead to methods for compressing databases signicantly by disposing
of the data which is not important for the solution of a given task (cf. also Guyon,
Matic, and Vapnik, 1996).
In the next section, we will take a closer look at one of the types of learning
machines implementable by the SV algorithm.
2.5 Comparison to Classical RBF Networks

By using Gaussian kernels (2.27), the SV algorithm can construct learning machines
with a Radial Basis Function (RBF) architecture. In contrast to classical approaches
for training RBF networks, the SV algorithm automatically determines centers, weights
62
and threshold that minimize an upper bound on the expected test error. The present
section is devoted to an experimental comparison of these machines with a classical
approach, where the centers are determined by k-means clustering and the weights are
computed using error backpropagation. We consider three machines, namely a classical
RBF machine, an SV machine with Gaussian kernel, and a hybrid system with the centers determined by the SV method and the weights trained by error backpropagation.
Our results show that on the US postal service database of handwritten digits, the SV
machine achieves the highest recognition accuracy, followed by the hybrid system.
Copyright notice: the material in this section is based on the article \Comparing
support vector machines with Gaussian kernels to radial basis function classiers" by
B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio and V. Vapnik, which
appeared in IEEE Transactions on Signal Processing; 45(11): 2758 { 2765, November
1997. IEEE.
2.5.1 Dierent Ways of Training RBF Classiers

Consider Fig. 2.12. Suppose we want to construct a radial basis function classier
es
i=1
2
wi exp ? kx ?c xik + b
nc
f (x) = sgn
`
X
(2.40)
Re
fe
re
(b and ci being constants, the latter positive) separating balls from circles, i.e. taking
dierent values on balls and circles. How do we choose the centers xi? Two extreme
cases are conceivable:
The rst approach consists of choosing the centers for the two classes separately,
irrespective of the classication task to be solved. The classical technique of nding the
centers by some clustering technique (before tackling the classication problem) is such
an approach. The weights wi are then usually found by either error backpropagation
(Rumelhart, Hinton, and Williams, 1986) or the pseudo-inverse method (Poggio and
Girosi, 1990).
An alternative approach (Fig. 2.13) consists of choosing as centers points which are
critical for the classication task at hand. The Support Vector algorithm implements
the latter idea. By simply choosing a suitable kernel function (2.27), it allows the
construction of radial basis function classiers. The algorithm automatically computes
the number and location of the above centers, the weights wi, and the threshold b.
By the kernel function, the patterns are mapped nonlinearly into a high-dimensional
space. There, an optimal separating hyperplane is constructed, expressed in terms of
those examples which are closest to the decision boundary. These are the Support
Vectors which correspond to the centers in input space.
The goal of the present section is to compare real-world results obtained with
k-means clustering and classical RBF training to those obtained when the centers,
weights and threshold are automatically chosen by the Support Vector algorithm. To
this end, we decided to undertake a performance study by combining expertise on the
63
nc
es
FIGURE 2.12: A simple 2-dimensional classication problem: nd a decision function

separating balls from circles. The box, as in all following pictures, depicts the region
[?1; 1]2.
Re
fe
re
Support Vector algorithm (AT&T Bell Laboratories) and on the classical radial basis
function networks (Massachusetts Institute of Technology).
Three dierent RBF systems took part in the performance comparison:
SV system. A standard SV machine with Gaussian kernel function was constructed (cf. (2.27)).
Classical RBF system. The MIT side of the performance comparison constructed networks of the form
g(x) = sgn
K
X
i=1
K
X
wiGi (x) + b
2
= sgn
wi (2)N=2 N exp ? kx 2?c2 ik + b ;
i
i
i=1
with the number of centers k identical to the one automatically found by the SV
algorithm. The centers ci were computed by k-means clustering (e.g. Duda and
Hart, 1973), and the weights wi are trained by on-line mean squared error back
propagation.
The training procedure constructs ten binary recognizers for the digit classes,
with RBF hidden units and logistic outputs, trained to produce the target values
64
fe
re
nc
es
FIGURE 2.13: RBF centers automatically computed by the Support Vector algorithm
(indicated by extra circles), using ci = 1 for all i (cf. (2.27), (2.40)). The number of SV
centers accidentally coincides with the number of identiable clusters (indicated by crosses
found by k-means clustering with k = 2 and k = 3 for balls and circles, respectively) but
the naive correspondence between clusters and centers is lost; indeed, 3 of the SV centers
are circles, and only 2 of them are balls. Note that the SV centers are chosen with respect
to the classication task to be solved.
Re
1 and 0 for positive and negative examples, respectively. The networks were
trained without weight decay, however, a bootstrap procedure was used to limit
their complexity. The nal RBF network for each class contains every Gaussian
kernel from its target class, but only several kernels from the other 9 classes,
selected such that no false positive mistakes are made. For further details, see
(Sung, 1996; Moody and Darken, 1989).
Hybrid system. To assess the relative in uence of the automatic SV center
choice and the SV weight optimization, respectively, another RBF system was
built, constructed with centers that are simply the Support Vectors arising from
the SV optimization, and with the weights trained separately using mean squared
error back propagation.
Computational Complexity. By construction, the resulting classiers after training

will have the same architecture and comparable sizes. Thus the three machines are
comparable in classication speed and memory requirements.
Dierences were, however, noticeable in training. Regarding training time, the SV
65
nc
es
FIGURE 2.14: Two-class classication problem solved by the Support Vector algorithm
(ci = 1 for all i; cf. Eq. 2.40).
fe
re
machine was faster than the RBF system by about an order of magnitude.12 The
optimization, however, requires to work with potentially large matrices. In the implementation that we used, the training data is processed in chunks, and matrix sizes
were of the order 500 500. For problems with very large numbers of SVs, a modied
training algorithm has recently been proposed by Osuna, Freund, and Girosi (1997).
Re
Error Functions. Due to the constraints (2.18) and the target function (2.19), the
SV algorithm puts emphasis on correctly separating the training data. In this respect,
it is dierent from the classical RBF approach of training in the least-squares metric,
which is more concerned with the general problem of estimating posterior probabilities
than with directly solving a classication task at hand. There exist, however, studies
investigating the question how to select RBF centers or exemplars to minimize the
number of misclassications, see for instance (Chang and Lippmann, 1993; Duda and
Hart, 1973; Reilly, Cooper, and Elbaum, 1982; Barron, 1984). A classical RBF system
could also be made more discriminant by using moving centers (e.g. Poggio and Girosi,
1990), or a dierent cost function, as the classication gure of merit (Hampshire and
Waibel, 1990). In fact, it can be shown that Gaussian RBF regularization networks
are equivalent to SV machines if the regularization operator and the cost function are
chosen appropriately (Smola and Scholkopf, 1997b).
12 For noisy regression problems, on the other hand, Support Vector machines can be slower (Muller
et al., 1997).
66
re
nc
es
FIGURE 2.15: A simple two-class classication problem as solved by the SV algorithm
(ci = 1 for all i; cf. Eq. 2.40). Note that the RBF centers (indicated by extra circles)
are closest to the decision boundary. Interestingly, the decision boundary is a straight line,
even though a nonlinear Gaussian RBF kernel was used. This is due to the fact that only
two SVs are required to solve the problem. The translational and unitary invariance of the
RBF kernel then renders the situation completely symmetric.
Re
fe
It is important to stress that the SV machine does not minimize the empirical risk
(misclassication error on the training set) alone. Instead it minimizes the sum of an
upper bound on the empirical risk and a penalty term that depends on the complexity
of the classier used.
2.5.2 Toy Examples: What are the Support Vectors?

Support Vectors are elements of the data set that are \important" in separating the
two classes from each other. In general, the SVs with zero slack variables (2.17) lie on
the boundary of the decision surface, as they precisely satisfy the inequality (2.18) in
the high-dimensional space. Figures 2.15 and 2.14 illustrate that for the used Gaussian
kernel, this is also the case in input space. This raises an interesting question from the
point of view of interpreting the structure of trained RBF networks. The traditional
view of RBF networks has been one where the centers were regarded as \templates"
or stereotypical patterns. It is this point of view that leads to the clustering heuristic
for training RBF networks. In contrast, the SV machine posits an alternate point of
view, with the centers being those examples which are critical for a given classication
task.
67
TABLE 2.11: Numbers of centers (Support Vectors) automatically extracted by the Support Vector machine. The rst row gives the total number for each binary classier,
including both positive and negative examples; in the second row, we only counted the
positive SVs. The latter number was used in the initialization of the k-means algorithm,
cf. Sec. 2.5.
digit class
0 1 2 3 4 5 6 7 8 9
# of SVs
274 104 377 361 334 388 236 235 342 263
# of pos. SVs 172 77 217 179 211 231 147 133 194 166
TABLE 2.12: Two-class-classication: numbers of test errors (out of 2007 test patterns)
for the three systems described in Sec. 2.5.
4
46
32
29
5
31
24
23
6
15
19
14
es
2.5.3 Handwritten Digit Recognition
3
38
24
19
7
18
16
12
8
37
26
25
9
26
16
16
nc
digit class
0 1 2
classical RBF
20 16 43
RBF with SV centers 9 12 27
full SV machine
16 8 25
Re
fe
re
We used the USPS database of handwritten digits (Appendix C). The SV machine
results reported in the following were obtained with our default choice = 10 (cf.
(2.19), Sec. 2.3), and c = 0:3 N (cf. (2.27)), where N = 256 is the dimensionality of
input space.13
Two-class classication. Table 2.11 shows the numbers of Support Vectors, i.e.
RBF centers, extracted by the SV algorithm. Table 2.12 gives the results of binary
classiers separating single digits from the rest, for the systems described in Sec. 2.5.
Ten-class classication. For each test pattern, the arbitration procedure in all three
systems simply returns the digit class whose recognizer gives the strongest response
(cf. (2.38)). Table 2.13 shows the 10-class digit recognition error rates for our original
system and the two RBF-based systems.
The fully automatic SV machine exhibits the highest test accuracy of the three
systems.14 Using the SV algorithm to choose the centers for the RBF network is also
better than the baseline procedure of choosing the centers by a clustering heuristic as
described above. It can be seen that in contrast to the k-means cluster centers, the
centers chosen by the SV algorithm allow zero training error rates.
The considered recognition task is known to be rather hard | the human error rate
13 The
SV machine is rather insensitive to dierent choices of c. For all values in 0:1; 0:2; : : : ; 1:0,
the performance is about the same (in the area of 4% { 4.5%).
14 An analysis of the errors showed that about 85% of the errors committed by the SV machine
were also made by the other systems. This makes the dierences in error rates very reliable.
68
TABLE 2.13: 10-class digit recognition error rates for three RBF classiers constructed
with dierent algorithms. The rst system is a classical one, choosing its centers by kmeans clustering. In the second system, the Support Vectors were used as centers, and in
the third one, the entire network was trained using the Support Vector algorithm.
Classication Error Rate

USPS DB classical RBF RBF with SV centers full SV machine
Training
1.7%
0.0%
0.0%
Test
6.7%
4.9%
4.2%
nc
es
is 2.5% (Bromley and Sackinger, 1991), almost matched by a memory-based Tangentdistance classier (2.6%, Simard, LeCun, and Denker, 1993). Other results on this
database include a Euclidean distance nearest neighbour classier (5.9%, Simard,
LeCun, and Denker, 1993), a perceptron with one hidden layer, 5.9%, and a convolutional neural network (5.0%, LeCun et al., 1989). By incorporating translational and
rotational invariance using the Virtual SV technique (see below, Sec. 4.2.1), we were
able to improve the performance of the considered Gaussian kernel SV machine (same
values of and c) from 4.2% to 3.2% error.
2.5.4 Summary and Discussion
Re
fe
re
The Support Vector algorithm provides a principled way of choosing the number and
the locations of RBF centers. Our experiments on a real-world pattern recognition
problem have shown that in contrast to a corresponding number of centers chosen by
k-means, the centers chosen by the Support Vector algorithm allowed a training error
of zero, even if the weights were trained by classical RBF methods. The interpretation of this nding is that the Support Vector centers are specically chosen for the
classication task at hand, whereas k-means does not care about picking those centers
which will make a problem separable.
In addition, the SV centers yielded lower test error rates than k-means. It is
interesting to note that using SV centers, while sticking to the classical procedure for
training the weights, improved training and test error rates by approximately the same
amount (2%). In view of the guaranteed risk bound (1.5), this can be understood in
the following way: the improvement in test error (risk) was solely due to the lower
value of the training error (empirical risk); the condence term (the second term on the
right hand side of (1.5)), depending on the VC-dimension and thus on the norm of the
weight vector, did not change, as we stuck to the classical weight training procedure.
However, when we also trained the weights with the Support Vector algorithm, we
minimized the norm of the weight vector (see (2.19), (2.4)) in feature space, and thus
the condence term, while still keeping the training error zero. Thus, consistent with
(1.5), the Support Vector machine achieved the highest test accuracy of the three
69
2.6. MODEL SELECTION
systems.15
2.6 Model Selection

2.6.1 Choosing Polynomial Degrees16
re
nc
es
In the case where the available amount of training data is limited, it is important to
have a means for achieving the best possible generalization by controlling characteristics of the learning machine, without having to put aside parts of the training set
for validation purposes. One of the strengths of SV machines consists in the automatic capacity tuning, which was related to the fact that the minimization of (2.19)
is connected to structural risk minimization, based on the bound (1.5). This capacity
tuning takes place within a set of functions specied a priori by the choice of a kernel
function. In the following, we go one step further and use the bound (1.5) also to
predict the kernel degree which yields the best generalization for polynomial classiers
(Scholkopf, Burges, and Vapnik, 1995).
Since for SV machines we have an upper bound on the VC-dimension (Proposition 2.1.1), we can use (1.5) to get an upper bound on the expected error on an
independent test set in terms of the training error and the value of kwk (or, equivalently, the margin 2=kwk). This bound can then be used to try to choose parameters of
the learning machines such that the test error gets minimal, without actually looking
at the test set.
We consider polynomial classiers with the kernel (2.26), varying their degree d,
and make the assumption that the bound (2.4) gives a reliable indication of the actual
VC-dimension, i.e. that the VC-dimension can be estimated by
(2.41)
Re
fe
h c1 hest: = R2kwk2
with some c1 < 1 which is independent of the polynomial degree.

For the USPS digit recognition problem, training errors are very small. In that
case, the right hand side of the bound (1.5) is dominated by the condence term, which
is minimized when the VC-dimension is minimized. For the latter, we use (2.41), with
15 Two
remarks on the interpretation of our ndings are in order. The rst result, comparing the
error rates of the classical and the hybrid system, does not necessarily rule out the possibility of
reducing the training error also for k-means centers by using dierent cost functions or codings of the
output units. It should be considered as a statement comparing two sets of centers, using the same
weight training algorithm to build RBF networks from them. Along similar lines, the second result,
indicating the superior performance of the full SV RBF system, refers to the systems as described
in this study. It does not rule out the possibility of improving classical RBF systems by suitable
methods of complexity control. Indeed, the results for the SV RBF system do show that using the
same architecture, but dierent weight training procedures, can signicantly improve results.
16 Copyright notice: the material in this section is based on the article \Extracting support data
for a given task" by B. Scholkopf, C. Burges and V. Vapnik, which appeared in: Proceedings, First
International Conference on Knowledge Discovery & Data Mining, pp. 252 { 257, 1995. AIII Press.
70

average VC-dim.
total no. of test errors
2000
300
1000
200
174
degree:
nc
es
FIGURE 2.16: Average VC-dimension (solid) and total number of test errors of the ten
two-class-classiers (dotted) for polynomial degrees 2 through 7 (for degree 1, Remp is
comparably big, so the VC-dimension alone is not sucient for predicting R, cf. (1.5)).
The baseline on the error scale, 174, corresponds to the total number of test errors of
the ten best binary classiers out of the degrees 2 through 7. The graph shows that the
VC-dimension allows us to predict that degree 4 yields the best overall performance of the
two-class-classiers on the test set. This is not necessarily the case for the performances of
the ten-class classiers, which are built from the two-class-classier outputs before applying
the sgn functions (cf. Sec. 2.1.6).
kwk determined by the Support Vector algorithm (note that kwk is computed in
Re
fe
re
feature space, using the kernel). Thus, in order to compute hest:, we need to compute
R, the radius of the smallest sphere enclosing the training data in feature space.
This can be reduced to a quadratic programming problem similar to the one used in
constructing the optimal hyperplane:17
We formulate the problem as follows:
Minimize R2 subject to kzi ? zk2 R2;
(2.42)
where z is the (to be determined) center of the sphere. We use the Lagrangian
X
(2.43)
R2 ? i(R2 ? (zi ? z)2);
i
and compute the derivatives by z and R to get

X
z = izi
(2.44)
and the Wolfe dual problem: maximize

X
X
i (zi zi) ? ij (zi zj )
(2.45)
17 The
i;j
following derivation is due to Chris Burges.
71
subject to
X
i
i = 1; i 0;
(2.46)
where the i are Lagrange multipliers. As in the Support Vector algorithm, this
problem has the property that the zi appear only in dot products, so as before one
can compute the dot products in feature space, replacing (zi zj ) by k(xi; xj ) (where
the xi live in input space, and the zi in feature space).
In this way, we compute the radius of the minimal enclosing sphere for all the USPS
training data for polynomial classiers of degrees 1 through 7. For the same degrees,
we then train a binary polynomial classier for each digit. Using the obtained values
for hest:, we can predict, for each digit, which degree polynomial will give the best
generalization performance. Clearly, this procedure is contingent upon the validity of
the assumption that c1 is approximately the same for all degrees. We can then compare
this prediction with the actual polynomial degree which gives the best performance
on the test set. The results are shown in Table 2.14; cf. also Fig. 2.16.
fe
re
nc
es
TABLE 2.14: Performance of the classiers with degree predicted by the VC-bound. Each
row describes one two-class-classier separating one digit (stated in the rst column) from
the rest. The remaining columns contain: deg: the degree of the best polynomial as
predicted by the described procedure, param.: the dimensionality of the high dimensional
space, which is also the VC-dimension for the set of all separating hyperplanes in that
space, hest: : the VC-dimension estimate for the actual classiers, which is much smaller
than the number of free parameters of linear classiers in that space, 1 { 7: the numbers of
errors on the test set for polynomial classiers of degrees 1 through 7. The table shows that
the decribed procedure chooses polynomial degrees which are optimal or close to optimal.
Re
chosen classier
digit deg param.

0
3 2:8 106
1
7 1:5 1013
2
3 2:8 106
3
3 2:8 106
4
4 1:8 108
5
3 2:8 106
6
4 1:8 108
7
5 9:5 109
8
4 1:8 108
9
5 9:5 109
hest:
547
95
832
1130
977
1117
615
526
1466
1210
errors on the test set for degrees 1 { 7

1 2 3 4 5 6 7
36 14 11 11 11 12 17
17 15 14 11 10 9 10
53 32 28 26 28 27 32
57 25 22 22 22 22 23
50 32 32 30 30 29 33
37 20 22 24 24 26 28
23 12 12 15 17 17 19
25 15 12 10 11 13 14
71 33 28 24 28 32 34
51 18 15 11 11 12 15
72
The above method for predicting the optimal classier functions gives good results.
In four cases, the theory predicted the correct degree; in the other cases, the predicted
degree gave performance close to the best possible one.
2.6.2 The Choice of the Regularization Parameter
nc
es
In addition to kernel parameters as the polynomial degree, there is another parameter

whose value needs to be set for SV training: the regularization constant , determining
the trade-o between minimizing training error and controlling complexity (cf. (2.19)).
The optimal value of should depend both on characteristics of the problem at hand
and on the sample size. Although our experience suggests that for problems with little
noise, the results are reasonably insensitive with respect to changes of , it would still
be desirable to have a principled method for choosing . The remainder of this section
is an attempt at developing such a method.
As in the last section, the starting point is the risk bound (1.5). The idea is
to adjust such that minimization of the SV objective function (2.19) amounts to
minimizing (1.5). As the solution w depends on the value of chosen (in (2.19)), we
cannot use (1.5) and (2.4) to determine the value of a priori. Instead, we will resort
to an iterative strategy.
Following (2.4), we write
h = c1 R2kwk2
(2.47)
with some c1 < 1. Substituting this into the bound (1.5), (1.6), we obtain
fe
re
v

u
u
t c1R2 kwk2 log c1R22k`wk2 + 1 ? log(=4) + R ()
emp
`
(2.48)
Re
as an upper bound on the

risk that we want to minimize.
Remembering that Pi i is an upper bound on the number of training errors, we
additionally write
X
i = c2 `Remp(w);
(2.49)
i
with some c2 1, where ` is the number of training examples. Minimizing the objective
function (2.19) then amounts to minimizing
1 kwk2 + R (w):
(2.50)
emp
2 c2`
Identifying w with the function index in (2.48), we now have a formulation where
the second terms of (2.50) and (2.48) are identical. The rst terms cannot coincide
in general: unlike the rst term of (2.48), kwk2=(2 c2`) is proportional to kwk2.
However, a necessary condition to ensure that the minimum of the function that we
73
are minimizing, (2.50), is close to that of the one that we would like to minimize, (2.48),
is that the gradients of the rst terms with respect to w coincide at the minimum.
Hence, we have to choose such that
c1R2 (log c1R22k`wk2 ? 1)
1 w= q
w:
(2.51)
c2`
`(c1 R2kwk2 log c1R22k`wk2 ? log(=4))
For w 6= 0, we thus obtain
q
`(c1R2 kwk2 log c1R22k`wk2 ? log(=4))
:
=
c1 c2`R2 (log c1R22k`wk2 ? 1)
(2.52)
Re
fe
re
nc
es
Eq. (2.52) establishes a relationship between and w at the point of the solution. If
we start with a non-optimal value of , however, we will obtain a non-optimal w and
thus (2.52) will not tell us exactly how to adjust . For instance, suppose we start
with a which is too big. Then too much weight will be put on minimizing empirical
risk (cf. (2.19)), and the margin will become too small, i.e. w will become too big. We
will resort to the following method: we use (2.52) to determine a new value 0, and
iterate the procedure.
The value of kwk2 is obtained by solving
the SV quadratic programming probP
2
lem (in feature space, we have kwPk = ij yiyj i j k(xi; xj )); R2 is computed as in
Sec. 2.6.1, and c2 is obtained from i i and the training error using (2.49). The values
of c1 and must be chosen by the user. The constant c1 characterizes the tightness of
the VC-dimension bound (cf. (2.47)), and 1 ? is a lower bound on the probability
with which the risk bound (2.48) holds. As long as is not too close to 0, it does
hardly aect our procedure. The value of c1 is more dicult to choose correctly, however, reasonable results can already be obtained with the default choice c1 = 1, as we
shall see below.
Statements on the convergence of this procedure are hard to obtain: to compute
the mapping from to 0 , we have to train an SV machine and then evaluate (2.52),
thus one cannot compute its derivative in closed form. In an empirical study to be
decribed presently, the procedure exhibited well-behaved convergence behaviour. In
the experiment, we used the small MNIST database (Appendix C). We found that
the iteration converged no matter whether we started with a very large or with a
tiny . In the following, we report results obtained when starting with a huge value of
, which eectively leads to the construction of an optimal margin classier (i.e. with
zero training error | cf. Sec. 2.1.2: for = 1, (2.22) reduces to (2.14)).
Table 2.15 shows partly encouraging results. For all 10 binary digit recognizers,
the iteration converges very fast (about two steps were required to reach a value very
close to the nal one). In seven cases, the number of test errors decreased, in only
two case did it increase. By combining the resulting binary classiers, we obtained a
10-class classier with an error rate (on the 10000 element small MNIST test set) of
3:9%, slightly better than the error rate obtained both with the starting value used in
74
the iteration, = 1010, and with our default choice = 10: in these cases, we obtained
4:0% error (cf. below, in Table 4.6).
Clearly, further experiments are necessary to validate or improve this method. In
particular, it would be interesting to study a noisy classication problem, where the
choice of should potentially have a greater impact on the quality of the solution.
We conclude this section with a note on the relationship of the model selection
methods described in sections 2.6.1 and 2.6.2. Both of the proposed methods are
based on the bound (1.5). In principle, we could also apply the method of 2.6.1 for
choosing . In that case, we would try out a series of values of , and pick the one
which minimizes (1.5). The advantage of the present method, however, is that it does
not require scanning a whole range of values. Instead, is chosen such that, with the
help of a few iterations, the SV optimization automatically minimizes (1.5) over ,
in addition to the built-in minimization over the weights of the SV machine (cf. the
remarks at the beginning of Sec. 2.6.1).18
fe
c2 train. errors test errors # of SVs

0
38
177
1
3
32
187
10.0
3
32
194
10.3
3
31
194
10.8
3
30
192
10.1
3
31
188
0
33
141
1
3
30
153
14.0
3
31
160
14.4
3
30
161
15.5
3
29
157
14.2
3
31
154
Re

1010
0.723763
digit 0 0.052130
0.050580
0.047947
0.051618

1010
1.248717
digit 1 0.047532
0.045677
0.042151
0.046176
re
nc
es
TABLE 2.15: Iterative choice of the regularization constant (cf. Sec. 2.6.2) for all ten
digit recognizers on the small MNIST database. Each table shows SV machine parameters
and results for the starting point ( = 1010 ), and for ve subsequent iteration steps. In all
cases, we used c1 = 1, = 0:2 (corresponding to a risk bound holding with probability of
at least 0.8), and a polynomial classier of degree 5. The constant c2 is undened before
the rst run of the algorithm. After each run, it is computed using (2.49); if Remp is 0, we
set c2 = 1. For = 1010 , we are eectively computing an optimal separating hyperplane,
with zero training errors. The iteration converges very fast; moreover, in seven of the ten
cases, it reduced the number of test errors (in two cases, the opposite happened).
18 We
kwk
36.156
29.460
29.288
29.312
29.416
29.316
kwk
48.998
34.286
34.063
33.988
34.054
34.038
performed another set of experiments to nd out whether the leave-one-out generalization

bound (Proposition 2.1.2) could be used for selecting . On the small MNIST database, the results
were negative, leading to values of which were too large (in the range of 105 to 1010 ).
75
Re
digit 5
digit 6
digit 7
# of SVs
340
355
354
352
352
351
# of SVs
377
392
397
387
385
383
# of SVs
282
312
313
312
313
311
# of SVs
339
358
353
362
358
363
# of SVs
231
260
264
256
253
258
# of SVs
253
272
271
278
270
272
es
test errors
104
88
87
87
87
88
test errors
96
93
93
94
96
93
test errors
74
79
77
77
77
77
test errors
87
101
99
99
98
101
test errors
80
80
80
78
78
79
test errors
122
109
109
108
108
108
nc
digit 4
train. errors
0
4
4
4
4
4
train. errors
0
5
5
5
5
5
train. errors
0
5
6
6
6
6
train. errors
0
4
5
5
5
5
train. errors
0
2
2
2
2
2
train. errors
0
9
10
10
10
10
re
digit 3
c2
1
12.5
13.2
13.3
13.0
c2
1
12.5
13.9
12.1
13.9
c2
1
10.8
9.6
9.4
9.7
c2
1
12.9
11.7
11.5
11.7
c2
1
20.6
20.8
19.5
19.0
c2
1
6.6
6.2
6.4
6.1
fe
digit 2

1010
1.853855
0.100126
0.094182
0.092732
0.095662

1010
3.047037
0.139778
0.122975
0.142765
0.123462

1010
2.586497
0.134502
0.150066
0.152267
0.147880

1010
2.400509
0.116545
0.126663
0.129597
0.126985

1010
1.144632
0.037990
0.037135
0.039888
0.041169

1010
2.945887
0.187035
0.196960
0.189135
0.197877
kwk
58.816
49.055
48.810
48.757
48.833
48.906
kwk
70.455
57.387
56.860
57.143
56.868
57.272
kwk
66.786
52.781
52.409
52.374
52.324
52.338
kwk
65.051
53.578
53.182
53.263
53.250
53.272
kwk
46.857
37.923
37.623
37.773
37.834
37.771
kwk
69.716
48.730
48.263
48.104
48.241
48.258
76

1010
4.246777
digit 8 0.102221
0.156304
0.165320
0.166110

1010
21.34817
digit 9 0.103062
0.417387
0.383747
0.439163

0
127
440
1
3
126
473
22.4
5
126
463
14.4
5
122
464
13.7
5
126
457
13.6
5
127
462
0
146
412
1
2
146
446
35.8
11
140
437
7.8
10
137
434
8.5
11
141
435
7.4
11
139
440
kwk
77.167
63.982
63.590
63.703
63.605
63.492
kwk
94.997
74.464
71.822
71.893
71.857
71.900
2.7 Why Do SV Machines Work Well?
Re
fe
re
nc
es
The presented experimental results show that Support Vector machines obtain high
accuracies which are competitive with state-of-the-art techniques. This was true for
several visual recognition tasks. Care should be exercised, however, when generalizing
this statement to other types of pattern recognition tasks. There, empirical studies
have yet to be carried out, in particular since the tasks that we considered were all
characterized by relatively low overlap of the dierent classes (for instance, in the
USPS task, the human error rate is around 2.5%). In any case, the results obtained
here are encouraging, in particular when taking into account that the SV algorithm
was developed only recently. Below, we summarize dierent aspects providing insight
in why SV machines generalize well:
Capacity Control. The kernel method allows to reduce a large class of learning ma-
chines to separating hyperplanes in some space. For those, an upper bound on the
VC-dimension can be given (Proposition 2.1.1). As argued in Sec. 2.1.3, minimizing
the SV objective function (2.19) corresponds to trying to separate the data with a
classier of low VC-dimension, thereby approximately performing structural risk minimization. The problem of constructing the decision function requires minimizing a
positive quadratic form subject to box constraints and can thus be solved eciently.
As we saw, low VC-dimension is related to a large separation margin. Thus, analyses of the generalization performance in terms of separation margins and fat shattering
dimension also bear relevance to SV machines (e.g. Schapire, Freund, Bartlett, and
Lee, 1997; Shawe-Taylor, Bartlett, Williamson, and Anthony, 1996; Bartlett, 1997).
Compression. The leave-one-out bound (Proposition 2.1.2) relates SV generalization
ability to the fact that the decision function is expressed in terms of a (possibly small)
77
2.7. WHY DO SV MACHINES WORK WELL?
subset of the data. This can be viewed in the context of Algorithmic Complexity and
Minimum Description Length (Vapnik, 1995b, Chapter 5, footnote 6).
Regularization. In (Smola and Scholkopf, 1997b), a regularization framework is pro-
posed which contains the SV algorithm as a special case. For kernel-based function
expansions, it is shown that given a regularization operator P (Tikhonov and Arsenin,
1977) mapping the functions of the learning machine into some dot product space D,
the problem of minimizing the regularized risk
Rreg [f ] = Remp[f ] + kPf k2;
(2.53)
nc
ik(xi; x) + b;
(2.54)
re
f (x) =
es
(with a regularization parameter 0) can be written as a constrained optimization

problem. For particular choices of the cost function, it further reduces to a SV type
quadratic programming problem. The latter thus is not specic to SV machines, but
is common to a much wider class of approaches. What gets lost in this case, however,
is the fact that the solution can usually be expressed in terms of a small number
of SVs. This specic feature of SV machines is due to the fact that the type of
regularization and the class of functions which are considered as admissible solutions
are intimately related (cf. Poggio and Girosi, 1990; Girosi, Jones, and Poggio, 1993;
Smola and Scholkopf, 1997a; Smola, Scholkopf, and Muller, 1997): the SV algorithm
is equivalent to minimizing the regularized risk on the set of functions
provided that k and P are interrelated by
Re
fe
k(xi; xj ) = ((Pk)(xi; :) (Pk)(xj ; :)) :
(2.55)
(Here, (Pk)(xi; :) denotes the result of applying P to the function obtained by xing
k's rst argument to xi.) To this end, k is chosen as Green's function of P P . For instance, an RBF kernel thus corresponds to regularization with a functional containing
a specic dierential operator (Yuille and Grzywacz, 1988).
78
Re
fe
re
nc
es
Chapter 3
Kernel Principal Component Analysis
3.1 Introduction
Re
fe
re
nc
es
In the last chapter, we tried to show that the idea of implicitely mapping the data
into a high-dimensional feature space has been a very fruitful one in the context of
SV machines. Indeed, it is mainly this feature which distinguishes them from the
Generalized Portrait algorithm which has been known for more than 20 years (Vapnik
and Chervonenkis, 1974), and which makes them applicable to complex real-world
problems which are not linearly separable. Thus, it was natural to ask the question
whether the same idea could prove fruitful in other domains of learning.
The present chapter proposes a new method for performing a nonlinear form of
Principal Component Analysis. By the use of Mercer kernels, one can eciently compute principal components in high-dimensional feature spaces, related to input space
by some nonlinear map. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition (Scholkopf, Smola,
and Muller, 1996b, 1997b).
Copyright notice: the material in this chapter is based on the article \Nonlinear
component analysis as a kernel Eigenvalue problem" by B. Scholkopf, A. Smola and
K.-R. Muller, which will appear in: Neural Computation, 1997. MIT Press.
Principal Component Analysis (PCA) is a powerful technique for extracting structure from possibly high-dimensional data sets. It is readily performed by solving an
Eigenvalue problem, or by using iterative algorithms which estimate principal components. For reviews of the existing literature, see Jollie (1986) and Diamantaras &
Kung (1996); some of the classical papers are due to Pearson (1901); Hotelling (1933);
Karhunen (1946). PCA is an orthogonal transformation of the coordinate system in
which we describe our data. The new coordinate values by which we represent the data
are called principal components. It is often the case that a small number of principal
components is sucient to account for most of the structure in the data. These are
sometimes called factors or latent variables of the data.
The present work studies PCA in the case where we are not interested in principal components in input space, but rather in principal components of variables, or
79
80
CHAPTER 3. KERNEL PRINCIPAL COMPONENT ANALYSIS
nc
es
features, which are nonlinearly related to the input variables. Among these are for instance variables obtained by taking arbitrary higher-order correlations between input
variables. In the case of image analysis, this amounts to nding principal components
in the space of products of input pixels.
To this end, we are computing dot products in feature space by means of kernel
functions in input space (cf. Sec. 1.3). Given any algorithm which can be expressed
solely in terms of dot products, i.e. without explicit usage of the variables themselves,
this kernel method enables us to construct dierent nonlinear versions of it. Even
though this general fact was known (Burges, private communication), the machine
learning community has made little use of it, the exception being Vapnik's Support
Vector machines (Chapter 2). In this chapter, we give an example of applying this
method in the domain of unsupervised learning, to obtain a nonlinear form of PCA.
In the next section, we will rst review the standard PCA algorithm. In order to
be able to generalize it to the nonlinear case, we formulate it in a way which uses
exclusively dot products. Using kernel representations of dot products (Sec. 1.3),
Sec. 3.3 presents a kernel-based algorithm for nonlinear PCA and explains some of the
dierences to previous generalizations of PCA. First experimental results on kernelbased feature extraction for pattern recognition are given in Sec. 3.4. We conclude
with a discussion (Sec. 3.5). Some technical material which is not essential for the
main thread of the argument has been relegated to Appendix D.2.
3.2 Principal Component Analysis in Feature Spaces
re
Given a set of centered observations xk 2 RN , k = 1; : : : ; M , PMk=1 xk = 0, PCA

diagonalizes the covariance matrix1
Re
fe
M
X
C = M1 xj x>j :
j =1
(3.1)
To do this, one has to solve the Eigenvalue equation
v = C v
(3.2)
for Eigenvalues 0 and Eigenvectors v 2 RN nf0g. As

M
X
v = C v = M1 (xj v)xj ;
j =1
(3.3)
all solutions v with 6= 0 must lie in the span of x1 : : : xM , hence (3.2) in that case is
equivalent to
(xk v) = (xk C v) for all k = 1; : : : ; M:
(3.4)
precisely, the covariance matrix is dened as the expectation of xx> ; for convenience, we
shall use the same term to refer to the estimate (3.1) of the covariance matrix from a nite sample.
1 More
81
3.2. PRINCIPAL COMPONENT ANALYSIS IN FEATURE SPACES
In the remainder of this section, we describe the same computation in another dot
product space F , which is related to the input space by a possibly nonlinear map
: RN ! F; x 7! X:
(3.5)
Note that the feature space F could have an arbitrarily large, possibly innite, dimensionality. Here and in the following, upper case characters are used for elements of F ,
while lower case characters denote elements of RN .
Again, we assume that we are dealing with centered data, PMk=1 (xk ) = 0 | we
shall return to this point later. In F , the covariance matrix takes the form
M
X
C = M1 (xj )(xj )>:
(3.6)
j =1
Note that if F is innite-dimensional, we think of (xj )(xj )> as a linear operator on

F , mapping
X 7! (xj )((xj ) X):
(3.7)
nc
es
We now have to nd Eigenvalues 0 and Eigenvectors V 2 F nf0g satisfying

V = C V:
(3.8)
fe
re
Again, all solutions V with 6= 0 lie in the span of (x1 ); : : : ; (xM ). For us, this has
two useful consequences: rst, we may instead consider the set of equations
((xk ) V) = ((xk ) C V) for all k = 1; : : : ; M;
(3.9)
Re
and second, there exist coecients i (i = 1; : : : ; M ) such that
V=
Combining (3.9) and (3.10), we get

M
X
M
X
i=1
i (xi):
(3.10)
M
M
X
X
1
@
i((xk ) (xi)) = M i (xk ) (xj )((xj ) (xi))A
i=1
j =1
i=1
Dening an M M matrix K by
this reads
for all k = 1; : : : ; M:
(3.11)
Kij := ((xi) (xj ));
(3.12)
MK = K 2;
(3.13)
82
where denotes the column vector with entries 1; : : : ; M . To nd solutions of (3.13),

we solve the Eigenvalue problem
M = K
(3.14)
for nonzero Eigenvalues. In Appendix D.2.1, we show that this gives us all solutions
of (3.13) which are of interest for us.
Let 1 2 : : : M denote the Eigenvalues of K (i.e. the solutions M
of (3.14)), and 1 ; : : : ; M the corresponding complete set of Eigenvectors, with p
being the rst nonzero Eigenvalue (assuming that is not identically 0). We normalize
p; : : : ; M by requiring that the corresponding vectors in F be normalized, i.e.
(Vk Vk ) = 1 for all k = p; : : : ; M:
(3.15)
By virtue of (3.10) and (3.14), this translates into a normalization condition for
p; : : : ; M :
M
M
X
X
ik jk ((xi ) (xj )) = ik jk Kij
1 =
i;j =1
(k K k ) =
(k k )
i;j =1
re
nc
es
k
(3.16)
For the purpose of principal component extraction, we need to compute projections
onto the Eigenvectors Vk in F (k = p; : : : ; M ). Let x be a test point, with an image
(x) in F , then
M
X
k
(3.17)
(V (x)) = ik ((xi) (x))
=
i=1
Re
fe
may be called its nonlinear principal components corresponding to .

In summary, the following steps were necessary to compute the principal components: rst, compute the matrix K , second, compute its Eigenvectors and normalize
them in F ; third, compute projections of a test point onto the Eigenvectors.2
For the sake of simplicity, we have above made the assumption that the observations
are centered. This is easy to achieve in input space, but more dicult in F , as we
cannot explicitely compute the mean of the mapped observations in F . There is,
however, a way to do it, and this leads to slightly modied equations for kernel-based
PCA (see Appendix D.2.2).
To conclude this section, note that can be an arbitrary nonlinear map into the
possibly high-dimensional space F , e.g. the space of all d-th order monomials in the
entries of an input vector. In that case, we need to compute dot products of input
vectors mapped by , with a possibly prohibitive computational cost. The solution
to this problem, however, is to use kernel functions (1.14) | we exclusively need
2 Note
that in our derivation we could have used the known result (e.g. Kirby & Sirovich, 1990)
that PCA can be carried out on the dot product matrix (xi xj )ij instead of (3.1), however, for the
sake of clarity and extendability (in Appendix D.2.2, we shall consider the question how to center
the data in F ), we gave a detailed derivation.
83
3.3. KERNEL PRINCIPAL COMPONENT ANALYSIS
k(x,y) = (x.y)
linear PCA
R2
x
x
x
xx
x x
x
x
e.g. k(x,y) = (x.y)d
kernel PCA
R2
x
x
x
x
x
x
x
xx
x x
x
x
x x x
x x
es
Re
fe
re
nc
FIGURE 3.1: The basic idea of kernel PCA. In some high-dimensional feature space F
(bottom right), we are performing linear PCA, just as a PCA in input space (top). Since
F is nonlinearly related to input space (via ), the contour lines of constant projections
onto the principal Eigenvector (drawn as an arrow) become nonlinear in input space. Note
that we cannot draw a pre-image of the Eigenvector in input space, as it may not even
exist. Crucial to kernel PCA is the fact that there is no need to perform the map into F :
all necessary computations are carried out by the use of a kernel function k in input space
(here: R2 ).
to compute dot products between mapped patterns (in (3.12) and (3.17)); we never
need the mapped patterns explicitely. Therefore, we can use the kernels described in
Sec. 1.3. The particular kernel used then implicitely determines the space F of all
possible features. The proposed algorithm, on the other hand, is a mechanism for
selecting features in F .
3.3 Kernel Principal Component Analysis

The Algorithm. To perform kernel-based PCA (Fig. 3.1), from now on referred to
as kernel PCA, the following steps have to be carried out: rst, we compute the
matrix Kij = (k(xi ; xj ))ij . Next, we solve (3.14) by diagonalizing K , and normalize
the Eigenvector expansion coecients n by requiring n(n n) = 1. To extract
the principal components (corresponding to the kernel k) of a test point x, we then
84
feature value (V.(x)) = i k (xi,x)

3
4 weights (Eigenvector coefficients)
comparison: k(xi,x)
sample x1, x2, x3,...
input vector x
nc
es
FIGURE 3.2: Feature extractor constructed by kernel PCA (cf. (3.18)). In the rst layer,
the input vector is compared to the sample via a kernel function, chosen a priori (e.g.
polynomial, Gaussian, or sigmoid). The outputs are then linearly combined using weights
which are found by solving an Eigenvector problem. As shown in the text, the depicted
network's function can be thought of as the projection onto an Eigenvector of a covariance
matrix in a high-dimensional feature space. As a function on input space, it is nonlinear.
compute projections onto the Eigenvectors by (cf. Eq. (3.17), Fig. 3.2),
M
X
re
(Vn (x)) =
i=1
ink(xi ; x):
(3.18)
Re
fe
If we use a kernel satisfying Mercer's conditions (Proposition 1.3.2), we know that

this procedure exactly corresponds to standard PCA in some high-dimensional feature
space, except that we do not need to perform expensive computations in that space.
Properties of Kernel-PCA. For Mercer kernels, we know that we are in fact doing
a standard PCA in F . Consequently, all mathematical and statistical properties of
PCA (see for instance Jollie, 1986; Diamantaras & Kung, 1996) carry over to kernelbased PCA, with the modications that they become statements about a set of points
(xi); i = 1; : : : ; M , in F rather than in RN . In F , we can thus assert that PCA is
the orthogonal basis transformation with the following properties (assuming that the
Eigenvectors are sorted in ascending order of the Eigenvalue size):
the rst q (q 2 f1; : : : ; M g) principal components, i.e. projections on Eigenvectors, carry more variance than any other q orthogonal directions
the mean-squared approximation error in representing the observations by the
rst q principal components is minimal3
3 To see this,
in the simple case where the data z1 ; : : : ; z` are centered, we consider an orthogonal
85
the principal components are uncorrelated

the rst q principal components have maximal mutual information with respect
to the inputs (this holds under Gaussianity assumptions, and thus depends on
the particular kernel chosen and on the data)
To translate these properties of PCA in F into statements about the data in input
space, they need to be investigated for specic choices of a kernels. We conclude this
section by noting one general property of kernel PCA in input space: for kernels which
depend only on dot products or distances in input space (as all the examples that we
have given so far do), kernel PCA has the property of unitary invariance. This follows
directly from the fact that both the Eigenvalue problem and the feature extraction
only depend on kernel values. This ensures that the features extracted do not depend
on which orthonormal coordinate system we use for representing our input data.
Re
fe
re
nc
es
Computational Complexity. As mentioned in Sec. 1.3, a fth order polynomial kernel on a 256-dimensional input space yields a 1010-dimensional feature space. For two
reasons, kernel PCA can deal with this huge dimensionality. First, as pointed out in
Sect. 3.2 we do not need to look for Eigenvectors in the full space F , but just in the
subspace spanned by the images of our observations xk in F . Second, we do not need
to compute dot products explicitely between vectors in F (which can be impossible
in practice, even if the vectors live in a lower-dimensional subspace), as we are using
kernel functions. Kernel PCA thus is computationally comparable to a linear PCA
on ` observations with an ` ` dot product matrix. If k is easy to compute, as for
polynomial kernels, e.g., the computational complexity is hardly changed by the fact
that we need to evaluate kernel functions rather than just dot products. Furthermore,
in the case where we need to use a large number ` of observations, we may want to
work with an algorithm for computing only the largest Eigenvalues, as for instance the
power method with de ation (for a discussion, see Diamantaras & Kung, 1996). In
addition, we can consider using an estimate of the matrix K , computed from a subset
of M < ` examples, while still extracting principal components from all ` examples
(this approach was chosen in some of our experiments described below).
The situation can be dierent for principal component extraction. There, we have
to evaluate the kernel function M times for each extracted principal component (3.18),
basis transformation W , and use the notation Pq for the projector on the rst q canonical basis
vectors fe1; : : : ; eq g. Then the mean squared reconstruction error using q vectors is
1 X kz ? W > P W z k2 = 1 X kW z ? P W z k2 = 1 X X(W z e )2
` i
= 1`
XX
i j>q
(zi W > ej )2 = 1`
` i
XX
i j>q
` i j>q
(W > ej zi )(zi W > ej ) =
X
j>q
(W > ej CW > ej ):
It can easily be seen that the values of this quadratic form (which gives the variances in the directions
W > ej ) are minimal if the W > ej are chosen as its (orthogonal) Eigenvectors with smallest Eigenvalues.
86
rather than just evaluating one dot product as for a linear PCA. Of course, if the dimensionality of F is 1010, this is still vastly faster than linear principal component
extraction in F . Still, in some cases, e.g. if we were to extract principal components
as a preprocessing step for classication, we might want to speed up things. This can
be carried out by the reduced set technique of Burges (1996) (cf. Appendix D.1.1),
proposed in the context of Support Vector machines. In the present setting, we approximate each Eigenvector
V=
`
X
i=1
i(xi)
(3.19)
j (zj );
(3.20)
(Eq. (3.10)) by another vector
V~ =
m
X
j =1
(3.21)
nc
= kV ? V~ k2:
es
where m < ` is chosen a priori according to the desired speed-up, and zj 2 RN ; j =

1; : : : ; m. This is done by minimizing the squared dierence
m
X
i;j =1
ij k(zi; zj ) ? 2
fe
= kVk +
re
This can be carried out without explicitely dealing with the possibly high-dimensional
space F . Since
` X
m
X
i=1 j =1
ij k(xi; zj );
(3.22)
Re
the gradient of with respect to the j and the zj is readily expressed in terms of the
kernel function, thus can be minimized by standard gradient methods. For the task
of handwritten character recognition, this technique led to a speed-up by a factor of
50 at almost no loss in accuracy (Burges & Scholkopf, 1996; cf. Sec. 4.4.1).
Finally, we add that although kernel principal component extraction is computationally more expensive than its linear counterpart, this additional investment can
pay back afterwards. In experiments on classication based on the extracted principal components, we found when we trained on nonlinear features, it was sucient to
use a linear Support Vector machine to construct the decision boundary. Linear Support Vector machines, however, are much faster in classication speed than nonlinear
ones. This is due to the fact that for k(x; y) = (x y), the SupportPVector decision function (2.25) can be expressed with a single weight vector w = ì=1 yiixi as
f (x) = sgn((x w)+ b): Thus the nal stage of classication can be done extremely fast;
the speed of the principal component extraction phase, on the other hand, and thus
the accuracy-speed trade-o of the whole classier, can be controlled by the number
of components which we extract, or by the number m (cf. Eq. (3.20)).
87
Interpretability and Variable Selection. In PCA, it is sometimes desirable to be
able to select specic axes which span the subspace into which one projects in doing
principal component extraction. This way, it may for instance be possible to choose
variables which are more accessible to interpretation. In the nonlinear case, there is
an additional problem: some elements of F do not have pre-images in input space.
To make this plausible, note that the linear span of the training examples mapped
into feature space can have dimensionality up to M (the number of examples). If
this exceeds the dimensionality of input space, it is rather unlikely that each vector
of the form (3.10) has a pre-image (cf. Appendix D.1.2). To get interpretability, we
thus need to nd directions in input space (i.e. input variables) whose images under
span the PCA subspace in F . This can be done with an approach akin to the one
described above: we could parametrize our set of desired input variables and run the
minimization of (3.22) only over those parameters. The parameters can be e.g. group
parameters which determine the amount of translation, say, starting from a set of
images.
Dimensionality Reduction, Feature Extraction, and Reconstruction. Unlike linear
Re
fe
re
nc
es
PCA, the proposed method allows the extraction of a number of principal components
which can exceed the input dimensionality. Suppose that the number of observations
M exceeds the input dimensionality N . Linear PCA, even when it is based on the
M M dot product matrix, can nd at most N nonzero Eigenvalues | they are
identical to the nonzero Eigenvalues of the N N covariance matrix. In contrast,
kernel PCA can nd up to M nonzero Eigenvalues | a fact that illustrates that it
is impossible to perform kernel PCA directly on an N N covariance matrix. Even
more features could be extracted by using several kernels.
Being just a basis transformation, standard PCA allows the reconstruction of the
original patterns xi; i = 1; : : : ; `; from a complete set of extracted principal components
(xi vj ); j = 1; : : : ; `, by expansion in the Eigenvector basis. Even from an incomplete
set of components, good reconstruction is often possible. In kernel PCA, this is more
dicult: we can reconstruct the image of a pattern in F from its nonlinear components;
however, if we only have an approximate reconstruction, there is no guarantee that
we can nd an exact pre-image of the reconstruction in input space. In that case, we
would have to resort to an approximation method (cf. (3.22)). Alternatively, we could
use a suitable regression method for estimating the reconstruction mapping from the
kernel-based principal components to the inputs.
Comparison to Other Methods for Nonlinear PCA. Starting from some of the
properties characterizing PCA (see above), it is possible to develop a number of possible generalizations of linear PCA to the nonlinear case. Alternatively, one may choose
an iterative algorithm which adaptively estimates principal components, and make
some of its parts nonlinear to extract nonlinear features. Rather than giving a full
review of this eld here, we brie y describe just three approaches, and refer the reader
to Diamantaras & Kung (1996) for more details.
88
Re
fe
re
nc
es
Hebbian Networks. Initiated by the pioneering work of Oja (1982), a number of

unsupervised neural-network type algorithms computing principal components have
been proposed (e.g. Sanger, 1989). Compared to the standard approach of diagonalizing the covariance matrix, they have advantages for instance in cases where the
data are nonstationary. Nonlinear variants of these algorithms are obtained by adding
nonlinear activation functions. The algorithms then extract features that the authors
have referred to as nonlinear principal components. These approaches, however, do
not have the geometrical interpretation of kernel PCA as a standard PCA in a feature
space nonlinearly related to input space, and it is thus more dicult to understand
what exactly they are extracting. For a discussion of some approaches, see (Karhunen
and Joutsensalo, 1995).
Autoassociative Multi-Layer Perceptrons. Consider a linear perceptron with one
hidden layer, which is smaller than the input. If we train it to reproduce the input
values as outputs (i.e. use it in autoassociative mode), then the hidden unit activations
form a lower-dimensional representation of the data, closely related to PCA (see for
instance Diamantaras & Kung, 1996). To generalize to a nonlinear setting, one uses
nonlinear activation functions and additional layers.4 While this of course can be
considered a form of nonlinear PCA, it should be stressed that the resulting network
training consists in solving a hard nonlinear optimization problem, with the possibility
to get trapped in local minima, and thus with a dependence of the outcome on the
starting point of the training. Moreover, in neural network implementations there is
often a risk of getting overtting. Another drawback of neural approaches to nonlinear
PCA is that the number of components to be extracted has to be specied in advance.
As an aside, note that hyperbolic tangent kernels can be used to extract neural network
type nonlinear features using kernel PCA (Fig. 3.6).
The principal components of a
P
n
test point x in that case take the form (Fig. 3.2) i i tanh((xi; x) + ).
Principal Curves. An approach with a clear geometric interpretation in input
space is the method of principal curves (Hastie & Stuetzle, 1989), which iteratively
estimates a curve (or surface) capturing the structure of the data. The data are
projected onto (i.e. mapped to the closest point on) a curve, and the algorithm tries
to nd a curve with the property that each point on the curve is the average of all
data points projecting onto it. It can be shown that the only straight lines satisfying
the latter are principal components, so principal curves are indeed a generalization of
the latter. To compute principal curves, a nonlinear optimization problem has to be
solved. The dimensionality of the surface, and thus the number of features to extract,
is specied in advance. Some authors (e.g. Ritter, Martinetz, and Schulten, 1990) have
discussed parallels between the Principal Curve algorithm and self-organizing feature
maps (Kohonen, 1982) for dimensionality reduction.
4 Simply
using nonlinear activation functions in the hidden layer would not suce: already the
linear activation functions lead to the best approximation of the data (given the number of hidden
nodes), so for the nonlinearities to have an eect on the components, the architecture needs to be
changed (see e.g. Diamantaras & Kung, 1996).
89
3.4. FEATURE EXTRACTION EXPERIMENTS
Kernel PCA. Kernel PCA is a nonlinear generalization of PCA in the sense that (a)
it is performing PCA in feature spaces of arbitrarily large (possibly innite) dimensionality, and (b) if we use the kernel k(x; y) = (x y), we recover original PCA. Compared
to the above approaches, kernel PCA has the main advantage that no nonlinear optimization is involved | it is essentially linear algebra, as simple as standard PCA. In
addition, we need not specify the number of components that we want to extract in
advance. Compared to neural approaches, kernel PCA could be disadvantageous if we
need to process a very large number of observations, as this results in a large matrix
K . Compared to principal curves, kernel PCA is so far harder to interpret in input
space; however, at least for polynomial kernels, it has a very clear interpretation in
terms of higher-order features.
3.4 Feature Extraction Experiments
es
In this section, we present a set of experiments where we used kernel PCA (in the form
given in Appendix D.2.2) to extract principal components. First, we shall take a look
at a simple toy example; following that, we describe real-world experiments where we
assess the utility of the extracted principal components by classication tasks.
Toy Examples. To provide some intuition on how PCA in F behaves in input space,
Re
fe
re
nc
we show a set of experiments with an articial 2-D data set, using polynomial kernels
(cf. Eq.( 2.26)) of degree 1 through 4 (see Fig. 3.3). Linear PCA (on the left) only
leads to 2 nonzero Eigenvalues, as the input dimensionality is 2. In contrast, nonlinear
PCA allows the extraction of further components. In the gure, note that nonlinear
PCA produces contour lines of constant feature value which re ect the structure in
the data better than in linear PCA. In all cases, the rst principal component varies
monotonically along the parabola which underlies the data. In the nonlinear cases,
also the second and the third components show behaviour which is similar for dierent polynomial degrees. The third component, which comes with small Eigenvalues
(rescaled to sum to 1), seems to pick up the variance caused by the noise, as can be
nicely seen in the case of degree 2. Dropping this component would thus amount to
noise reduction.
In Fig. 3.3, it can be observed that for larger polynomial degrees, the principal
component extraction functions become increasingly at around the origin. Thus,
dierent data points not too far from the origin would only dier slightly in the value
of their principal components. To understand this, consider the following example:
suppose we have two data points
x1 = 10 ; x2 = 20 ;
(3.23)
and a kernel k(x; y) := (x y)2: Then the dierences between the entries of x1 and
90
1.5
Eigenvalue=0.709
1.5
Eigenvalue=0.621
1.5
Eigenvalue=0.570
1.5
0.5
0.5
0.5
0.5
0.5
1
1.5
Eigenvalue=0.291
0.5
1
1.5
Eigenvalue=0.345
0.5
1
1.5
Eigenvalue=0.395
0.5
1
1.5
0.5
0.5
0.5
0.5
Eigenvalue=0.000
0.5
1
1.5
Eigenvalue=0.034
0.5
1
1.5
0.5
0.5
0.5
0.5
1
Eigenvalue=0.026
0.5
1
1.5
Eigenvalue=0.418
Eigenvalue=0.021
0.5
0.5
1
re
0.5
1
es
1.5
nc
0.5
1
Eigenvalue=0.552
0.5
1
Re
fe
FIGURE 3.3: Two-dimensional toy example, with data generated in the following way: xvalues have uniform distribution in [?1; 1], y -values are generated from yi = x2i + , were
is normal noise with standard deviation 0.2. From left to right, the polynomial degree
in the kernel (2.26) increases from 1 to 4; from top to bottom, the rst 3 Eigenvectors
are shown, in order of decreasing Eigenvalue size. The gures contain lines of constant
principal component value (contour lines); in the linear case, these are orthogonal to the
Eigenvectors. We did not draw the Eigenvectors, as in the general case, they live in a
higher-dimensional feature space.
x2 get scaled up by the kernel, namely k(x1; x1) = 1, but k(x2; x2 ) = 16. We can
compensate for this by rescaling the individual entries of each vector xi by
(3.24)
(xi)k 7! sign ((xi )k ) j(xi)k j :
1
2
Indeed, Fig. 3.4 shows that when the data are preprocessed according to (3.24) (where
higher degrees are treated correspondingly), the rst principal component extractors
do hardly depend on the degree anymore, as long as it is larger than 1. If necessary,
91

1
0.5
0
0.5
1
1
0.5
0
0.5
1 1
1
0.5
0
0.5
1 1
1
0.5
0
0.5
1 1
1
0.5
0
0.5
1 1
fe
re
nc
es
FIGURE 3.4: PCA with kernel (2.26), degrees d = 1; : : : ; 5. 100 points ((xi )1 ; (xi )2 ) were
generated from (xi )2 = (xi )21 + noise (Gaussian, with standard deviation 0.2); all (xi )j
were rescaled according to (xi )j 7! sgn((xi )j ) j(xi )j j1=d . Displayed are contour lines of
constant value of the rst principal component. Nonlinear kernels (d > 1) extract features
which nicely increase along the direction of main variance in the data; linear PCA (d = 1)
does its best in that respect, too, but it is limited to straight directions.
Re
FIGURE 3.5: Two-dimensional toy example with three data clusters (Gaussians with standard deviation 0.1, depicted region: [?1; 1] [?0:5; 1]): rst 8 nonlinear principal components extracted with k(x; y) = exp (?kx ? yk2 =0:1). Note that the rst 2 principal
component (top left) nicely separate the three clusters. The components 3 { 5 split up the
clusters into halves. Similarly, the components 6 { 8 split them again, in a way orthogonal
to the above splits. Thus, the rst 8 components divide the data into 12 regions.
we can thus use (3.24) to preprocess our data. Note, however, that the above scaling
problem is irrelevant for the character and object databases to be considered below:
there, most entries of the patterns are 1.
Further toy examples, using radial basis function kernels (1.28) and neural network
type sigmoid kernels (1.29), are shown in gures 3.5 { 3.8.
Object Recognition. In this set of experiments, we used the MPI chair database
with 89 training views per object (Appendix A). We computed the matrix K from all
92
nc
es
FIGURE 3.6: Two-dimensional toy example with three data clusters (Gaussians with standard deviation 0.1, depicted region: [?1; 1] [?0:5; 1]): rst 6 nonlinear principal components extracted with k(x; y) = tanh (2(x y) ? 1) (the gain and threshold values were
chosen according to the values used in SV machines, cf. Table 2.4). Note that the rst 2
principal components are sucient to separate the three clusters, and the third and fourth
component simultaneously split all clusters into halves.
Re
fe
re
2225 training examples, and used polynomial kernel PCA to extract nonlinear principal
components from the training and test set. To assess the utility of the components, we
trained a soft margin hyperplane classier (Sec. 2.1.3) on the classication task. This is
a special case of Support Vector machines, using the standard dot product as a kernel
function. Table 3.1 summarizes our ndings: in all cases, nonlinear components as
extracted by polynomial kernels (Eq. (2.26) with d > 1) led to classication accuracies
superior to standard PCA. Specically, the nonlinear components aorded top test
performances between 2% and 4% error, whereas in the linear case we obtained 17%.
Character Recognition. To validate the above results on a widely used pattern re-
cognition benchmark database, we repeated the same experiments on the US postal

service database of handwritten digits (Appendix C). This database contains 9298
examples of dimensionality 256; 2007 of them make up the test set. For computational reasons, we decided to use a subset of 3000 training examples for the matrix K .
Table 3.2 illustrates two advantages of using nonlinear kernels: rst, performance of a
linear classier trained on nonlinear principal components is better than for the same
number of linear components; second, the performance for nonlinear components can
be further improved by using more components than possible in the linear case. The
latter is related to the fact that of course there are many more higher-order features
than there are pixels in an image. Regarding the rst point, note that extracting a
certain number of features in a 1010-dimensional space constitutes a much higher reduc-
93
Re
fe
re
nc
es
FIGURE 3.7: For dierent threshold values (from top to bottom: = ?4; ?2; ?1; 0; 2),
kernel PCA with hyperbolic tangent kernels k(x; y) = tanh (2(x y) + ) exhibits qualitatively similar behaviour (data as in the previous gures). In all cases, the rst two
components capture the main structure of the data, whereas the third components split
the clusters.
Re
fe
re
nc
es
94
FIGURE 3.8: A smooth transition from linear PCA to nonlinear PCA is obtained by using
hyperbolic tangent kernels k(x; y) = tanh ((x y) + 1) with varying gain : from top
to bottom, = 0:1; 1; 5; 10 (data as in the previous gures). For = 0:1, the rst two
features look like linear PCA features. For large , the nonlinear region of the tanh function
becomes eective. In that case, kernel PCA can exploit this nonlinearity to allocate the
highest feature gradients to regions where there are data points, as can be seen nicely in
the case = 10.
95
# of components
64
128
256
512
1024
2048
1
23.0
17.6
16.8
n.a.
n.a.
n.a.
Test Error Rate for degree

2
3
4
5
6
7
21.0 17.6 16.8 16.5 16.7 16.6
9.9 7.9 7.1 6.2 6.0 5.8
6.0 4.4 3.8 3.4 3.2 3.3
4.4 3.6 3.9 2.8 2.8 2.6
4.1 3.0 2.8 2.6 2.6 2.4
4.1 2.9 2.6 2.5 2.4 2.2
TABLE 3.1: Test error rates on the MPI chair database for linear Support Vector machines
trained on nonlinear principal components extracted by PCA with polynomial kernel (2.26),
for degrees 1 through 7. In the case of degree 1, we are doing standard PCA, with the
number of nonzero Eigenvalues being at most the dimensionality of the space, 256; thus,
we can extract at most 256 principal components. The performance for the nonlinear cases
(degree > 1) is signicantly better than for the linear case, illustrating the utility of the
extracted nonlinear components for classication.
es
nc
re
fe
# of components
1
32 9.6
64 8.8
128 8.6
256 8.7
512 n.a.
1024 n.a.
2048 n.a.
Test Error Rate for degree

2
3
4
5
6
7
8.8 8.1 8.5 9.1 9.3 10.8
7.3 6.8 6.7 6.7 7.2 7.5
5.8 5.9 6.1 5.8 6.0 6.8
5.5 5.3 5.2 5.2 5.4 5.4
4.9 4.6 4.4 5.1 4.6 4.9
4.9 4.3 4.4 4.6 4.8 4.6
4.9 4.2 4.1 4.0 4.3 4.4
Re
TABLE 3.2: Test error rates on the USPS handwritten digit database for linear Support
Vector machines trained on nonlinear principal components extracted by PCA with kernel
(2.26), for degrees 1 through 7. In the case of degree 1, we are doing standard PCA, with
the number of nonzero Eigenvalues being at most the dimensionality of the space, 256.
Clearly, nonlinear principal components aord test error rates which are superior to the
linear case (degree 1).
tion of dimensionality than extracting the same number of features in 256-dimensional

input space.
For all numbers of features, the optimal degree of kernels to use is around 4, which
is compatible with Support Vector machine results on the same data set (cf. Sec. 2.3
and Fig. 2.16). Moreover, with only one exception, the nonlinear features are superior
to their linear counterparts. The resulting error rate for the best of our classiers
(4.0%) is competitive with convolutional 5-layer neural networks (5.0% were reported
by LeCun et al., 1989) and nonlinear Support Vector classiers (4.0%, Table 2.4); it
96
is much better than linear classiers operating directly on the image data (a linear
Support Vector machine achieves 8.9%; Table 2.4). Our results were obtained without
using any prior knowledge about symmetries of the problem at hand, explaining why
the performance is inferior to Virtual Support Vector classiers (3.2%, Table 4.1),
and Tangent Distance Nearest Neighbour classiers (2.6%, Simard, LeCun, & Denker,
1993). We believe that adding e.g. local translation invariance, be it by generating
\virtual" translated examples (cf. Sec. 4.2.1) or by choosing a suitable kernel (e.g. as
the ones that we shall describe in Sec. 4.3), could further improve the results.
3.5 Discussion
Feature Extraction for Classication. This chapter was devoted to the presentation
Re
fe
re
nc
es
of a new technique for nonlinear PCA. To develop this technique, we made use of a
kernel method so far only used in supervised learning (Vapnik, 1995; Sec. 1.3). Kernel
PCA constitutes a mere rst step towards exploiting this technique for a large class of
algorithms.
In experiments comparing the utility of kernel PCA features for pattern recognition
using a linear classier, we found two advantages of nonlinear kernels: rst, nonlinear
principal components aorded better recognition rates than corresponding numbers of
linear principal components; and second, the performance for nonlinear components
can be further improved by using more components than possible in the linear case.
We have not yet compared kernel PCA to other techniques for nonlinear feature extraction and dimensionality reduction. We can, however, compare results to other feature
extraction methods which have been used in the past by researchers working on the
USPS classication problem (cf. Sec. 3.4). Our system of kernel PCA feature extraction plus linear support vector machine for instance performed better than LeNet1
(LeCun et al., 1989). Even though the latter result has been obtained a number of
years ago, it should be stressed that LeNet1 provides an architecture which contains a
great deal of prior information about the handwritten character classication problem.
It uses shared weights to improve transformation invariance, and a hierarchy of feature
detectors resembling parts of the human visual system. These feature detectors were
for instance used by Bottou and Vapnik (1992) as a preprocessing stage in their experiments in local learning. Note that, in addition, our features were extracted without
taking into account that we want to do classication. Clearly, in supervised learning,
where we are given a set of labelled observations (x1 ; y1); : : : ; (x`; y`), it would seem
advisable to make use of the labels not only during the training of the nal classier,
but already in the stage of feature extraction.
To conclude this paragraph on feature extraction for classication, we note that a
similar approach could be taken in the case of regression estimation.
Feature Space and the Curse of Dimensionality. We are doing PCA in 1010-
dimensional feature spaces, yet getting results in nite time which are comparable
to state-of-the-art techniques. In fact, of course, we are not working in the full feature
97
3.5. DISCUSSION
space, but just in a comparably small linear subspace of it, whose dimension equals
at most the number of observations. The method automatically chooses this subspace
and provides a means of taking advantage of the lower dimensionality | an approach
which consisted in explicitely mapping into feature space and then performing PCA
would have severe diculties at this point: even if PCA was done based on an M M
dot product matrix (M being the sample size) whose diagonalization is tractable, it
would still be necessary to evaluate dot products in a 1010 -dimensional feature space
to compute the entries of the matrix in the rst place. Kernel-based methods avoid
this problem | they do not explicitely compute all dimensions of F (loosely speaking,
all possible features), but only work in a relevant subspace of F .
Note, moreover, that we did not get overtting problems when training the linear
SV classier on the extracted features. The basic idea behind this two-step approach
is very similar in spirit to nonlinear SV machines: one maps into a very complex space
to be able to approximate a large class of possible decision functions, and then uses a
low VC-dimension classier in that space to control generalization.
Conclusion. Compared to other techniques for nonlinear feature extraction, kernel
Re
fe
re
nc
es
PCA has the advantages that it does not require nonlinear optimization, but only the
solution of an Eigenvalue problem, and by the possibility to use dierent kernels, it
comprises a fairly general class of nonlinearities that can be used. Clearly, the last
point has yet to be evaluated in practice, however, for the support vector machine, the
utility of dierent kernels has already been established. Dierent kernels (polynomial,
sigmoid, Gaussian) led to ne classication performances (Table 2.4). The general
question of how to select the ideal kernel for a given task (i.e. the appropriate feature
space), however, is an open problem.
We conclude this chapter with a twofold outlook. The scene has been set for
using the kernel method to construct a wide variety of rather general and still feasible
nonlinear variants of classical algorithms. It is beyond the scope of the present work
to explore all the possibilities, including many distance-based algorithms, in detail.
Some of them are currently being investigated, for instance nonlinear forms of kmeans clustering and kernel-based independent component analysis (Scholkopf, Smola,
& Muller, 1996). Other domains where researchers have recently started to investigate
the use of Mercer kernels include Gaussian Processes (Williams, 1997).
Linear PCA is being used in numerous technical and scientic applications, including noise reduction, density estimation, image indexing and retrieval systems, and the
analysis of natural image statistics. Kernel PCA can be applied to all domains where
traditional PCA has so far been used for feature extraction, and where a nonlinear
extension would make sense.
fe
re
nc
es
Re
98
Chapter 4
Prior Knowledge in Support Vector Machines
Re
fe
re
nc
es
In 1995, LeCun et al. published a pattern recognition performance comparison noting

the following:
\The optimal margin classier [i.e. SV machine, the author] has excellent
accuracy, which is most remarkable, because unlike the other high performance classiers, it does not include a priori knowledge about the
problem. In fact, this classier would do just as well if the image pixels were permuted by a xed mapping. [...] However, improvements are
expected as the technique is relatively new."
One of the key points in developing SV technology is thus the incorporation of prior
knowledge about given tasks. Moreover, it is also a key point if we want to learn
anything general about the processing of visual information in animals from SV machines: having been exposed to the world for all their life, animals extensively exploit
any available knowledge on regularities and invariances of the world.
Two years after the above statement was published, we are now in the position to be
able to devote the present chapter to three techniques for incorporating task-specic
prior knowledge in SV machines (Scholkopf, Burges, and Vapnik, 1996a; Scholkopf,
Simard, Smola, and Vapnik, 1997a).
4.1 Introduction
When we are trying to extract regularities from data, we often have additional knowledge about functions that we estimate (for a review, see Abu-Mostafa, 1995). For
instance, in image classication tasks, there exist transformations which leave class
membership invariant (e.g. translations); moreover, it is usually the case that images
have a local structure in the sense that not all correlations between image regions carry
equal amounts of information. We presently investigate the question how to make use
of these two sources of knowledge.
We rst present the Virtual SV method of incorporating prior knowledge about
transformation invariances by applying transformations to Support Vectors, the training examples most critical for determining the classication boundary (Sec. 4.2.1).
99
100
CHAPTER 4. PRIOR KNOWLEDGE IN SUPPORT VECTOR MACHINES
In Sec. 4.2.2, we design kernel functions which lead to invariant classication hyperplanes. This method is applicable to invariances under the action of dierentiable local
1-parameter groups of local transformations, e.g. translational invariance in pattern
recognition; the Virtual SV method is applicable to any type of invariance. In the
third method proposed in this chapter, we also modify the kernel functions; however,
this time not to incorporate transformation invariance, but to take into account image locality by using localized receptive elds (Sec. 4.3). Following that, Sec. 4.4 and
Sec. 4.5 give experimental results and a discussion, respectively.
4.2 Incorporating Transformation Invariances
Re
fe
re
nc
es
In many applications of learning procedures, certain transformations of the input are

known to leave function values unchanged. At least three dierent ways of exploiting
this knowledge have been used (illustrated in Fig. 4.1):
In the rst case, the knowledge is used to generate articial training examples (\virtual examples", Poggio and Vetter, 1992; Baird, 1990) by transforming the training
examples accordingly. It is then hoped that given sucient time, the learning machine
will extract the invariances from the articially enlarged training data.
In the second case, the learning algorithm itself is modied. This is typically done
by using a modied error function which forces a learning machine to construct a
function with the desired invariances (Simard et al., 1992).
Finally, in the third case, the invariance is achieved by changing the representation
of the data by rst mapping them into a more suitable space; an approach pursued for
instance by Segman, Rubinstein, and Zeevi (1992), or Vetter and Poggio (1997). The
data representation can also be changed by using a modied distance metric, rather
than actually changing the patterns (e.g. Simard, LeCun, and Denker, 1993).
Simard et al. (1992) compare the rst two techniques and nd that for the considered problem | learning a function with three plateaus where function values are
locally invariant | training on the articially enlarged data set is signicantly slower,
due to both correlations in the articial data and the increase in training set size.
Moving to real-world applications, the latter factor becomes even more important. If
the size of a training set is multiplied by a number of desired invariances (by generating
a corresponding number of articial examples for each training pattern), the resulting
training sets can get rather large (as the ones used by Drucker, Schapire, and Simard,
1993). However, the method of generating virtual examples has the advantage of being
readily implemented for all kinds of learning machines and symmetries. If instead of
Lie groups of symmetry transformations one is dealing with discrete symmetries, as the
bilateral symmetries of Vetter, Poggio, and Bultho (1994); Vetter and Poggio (1994),
derivative-based methods such as the ones of Simard et al. (1992) are not applicable.
It would thus be desirable to have an intermediate method which has the advantages
of the virtual examples approach without its computational cost.
The two methods described in the following try to combine merits of all the approaches mentioned above. The Virtual SV method (Sec. 4.2.1) retains the exibility
101
4.2. INCORPORATING TRANSFORMATION INVARIANCES
virtual examples
representation
tangents
Re
fe
re
nc
es
FIGURE 4.1: Dierent ways of incorporating invariances in a decision function. The dashed
line marks the \true" boundary, disks and circle are the training examples. We assume
that prior information tells us that the classication function only depends on the norm of
the input vector (the origin being in the center of each picture). Lines crossing an example
indicate the type of information conveyed by the dierent methods of incorporating prior
information. Left: generating virtual examples in a localized region around each training
example; middle: incorporating a regularizer to learn tangent values (cf. Simard, Victorri,
LeCun, and Denker, 1992); right: changing the representation of the data by rst mapping
each example to its norm. If feasible, the latter method yields the most information.
However, if the necessary nonlinear transformation cannot be found, or if the desired
invariances are of localized nature, one has to resort to one of the former techniques.
Finally, the reader may note that examples close to the boundary allow us to exploit
prior knowledge very eectively: given a method to get a rst approximation of the true
boundary, the examples closest to it would allow good estimation of the true boundary. A
similar two-step approach is pursued in Sec. 4.2.1. (From Scholkopf, Burges, and Vapnik
(1996a).)
and simplicity of virtual examples approaches, while cutting down on their computational cost signicantly. The Invariant Hyperplane method (Sec. 4.2.2), on the other
hand, is comparable to the method of Simard et al. (1992) in that it is applicable for all
dierentiable local 1-parameter groups of local symmetry transformations, comprising
a fairly general class of invariances. In addition, it has an equivalent interpretation
as a preprocessing operation applied to the data before learning. In this sense, it can
also be viewed as changing the representation of the data to a more invariant one, in
a task-dependent way.
4.2.1 The Virtual SV Method

In Sec. 2.4, it has been argued that the SV set contains all information necessary
to solve a given classication task. It particular, it was possible to train any one
of three dierent types of SV machines solely on the Support Vector set extracted
102
by another machine, with a test performance not worse than after training on the
full database. Using this nding as a starting point, we now investigate the question
whether it might be sucient to generate virtual examples from the Support Vectors
only. After all, one might hope that it does not add much information to generate
virtual examples of patterns which are not close to the boundary. In high-dimensional
cases, however, care has to be exercised regarding the validity of this intuitive picture.
Thus, an experimental test on a high-dimensional real-world problem is imperative.
In our experiments, we proceeded as follows (cf. Fig. 4.2):
1. Train a Support Vector machine to extract the Support Vector set.
2. Generate articial examples by applying the desired invariance transformations
to the Support Vectors. In the following, we will refer to these examples as
Virtual Support Vectors (VSVs).
3. Train another Support Vector machine on the generated examples.1
re
nc
es
If the desired invariances are incorporated, the curves obtained by applying Lie symmetry transformations to points on the decision surface should have tangents parallel
to the latter (cf. Simard et al., 1992). If we use small small Lie group transformations
to generate the virtual examples, this implies that the Virtual Support Vectors should
be approximately as close to the decision surface as the original Support Vectors.
Hence, they are fairly likely to become Support Vectors after the second training run.
Vice versa, if a substantial fraction of the Virtual Support Vectors turn out to become
support vectors in the second run, we have reason to expect that the decision surface
does have the desired shape.
fe
4.2.2 Constructing Invariance Kernels

Invariance by a Self-Consistency Argument. We face the following problem: to
Re
express the condition of invariance of the decision function, we already need to know
its coecients which are found only during the optimization, which in turn should
already take into account the desired invariances. As a way out of this circle, we use
the following ansatz: consider decision functions f = sgn g, where g is dened as
g(xj ) :=
`
X
i=1
iyi(B xj B xi) + b;
(4.1)
with a matrix B to be determined below. This follows Vapnik (1995b), who suggested
to incorporate invariances by modifying the dot product used: any nonsingular B
denes a dot product, which can equivalently be written in the form (xj Axi), with
a positive denite matrix A = B >B .
1 Clearly,
the scheme can be iterated; however, care has to exercised, since the iteration of local
invariances would lead to global ones which are not always desirable | cf. the example of a '6'
rotating into a '9' (Simard, LeCun, and Denker, 1993).
103
problem
separating hyperplanes
VSV hyperplane
re
SV hyperplane
nc
es
Re
fe
FIGURE 4.2: Suppose we have prior knowledge indicating that the decision function should
be invariant with respect to horizontal translations. The true decision boundary is drawn
as a dotted line (top left); however, as we are just given a limited training sample, dierent
separating hyperplanes are conveivable (top right). The SV algorithm nds the unique
separating hyperplane with maximal margin (bottom left, which in this case is quite different from the true boundary. For instance, it would lead to wrong classication of the
ambiguous point indicated by the question mark. Making use of the prior knowledge by
generating Virtual Support Vectors from the Support Vectors found in a rst tranining run,
and retraining on these, yields a more accurate decision boundary (bottom right). Note,
moreover, that for the considered example, it is sucient to train the SV machine only on
virtual examples generated from the Support Vectors.
Clearly, invariance of g under local transformations of all xj is a sucient condition

for the same invariance to hold for f = sgn g, which is what we are aiming for. Strictly
speaking, however, invariance of g is not necessary at points which are not Support
Vectors, since these lie in a region where (sgn g) is constant.
The above notion of invariance refers to invariance when evaluating the decision
function. A dierent notion could ask the question whether the separating hyperplane,
104
Re
fe
re
nc
es
including its margin, would change if the training examples were transformed. It
turns out that when discussing the invariance of g rather than f , these two concepts
are closely related. In the following argument, we restrict ourselves to the optimal
margin case (i = 0 for all i = 1; : : : ; `), where the margin is well-dened. As the
separating hyperplane and its margin are expressed in terms of Support Vectors, locally
transforming a Support Vector xi will change the hyperplane or the margin if g(xi )
changes: if jgj gets smaller than 1, the transformed pattern will lie in the margin, and
the recomputed margin will be smaller; if jgj gets larger than 1, the margin might
become bigger, depending on whether the pattern can be expressed in terms of the
other SVs (cf. the remark in point 2 of the enumeration preceding Proposition 2.1.2).
In terms of the mechanical analogy of Sec. 2.1.2: moving Support Vectors changes
the mechanical equilibrium for the sheet separating the classes. Conversely, a local
transformation of a non-Support Vector will never change f , even if the value of g
changes, as the solution of the programming problem is expressed in terms of the
Support Vectors only.
In this sense, invariance of f under local transformations of the given data corresponds to invariance of (4.1) for the Support Vectors. Note, however, that this criterion
is not readily applicable: before nding the Support Vectors in the optimization, we
already need to know how to enforce invariance. Thus the above observation cannot
be used directly, however it could serve as a starting point for constructing heuristics
or iterative solutions. In the Virtual SV method (Sec. 4.2.1), a rst run of the standard SV algorithm is carried out to obtain an initial SV set; similar heuristics could
be applied in the present case.
Local invariance of g for each pattern xj under transformations of a dierentiable
local 1-parameter group of local transformations Lt ,
@ g(L x ) = 0;
(4.2)
@t t=0 t j
can be approximately enforced by minimizing the regularizer
`
1X
@ g(L x ) 2 :
` j=1 @t t=0 t j
(4.3)
Note that the sum may run over labelled as well as unlabelled data, so in principle one
could also require the decision function to be invariant with respect to transformations
of elements of a test set. Moreover, we could use dierent transformations for dierent
patterns.
For (4.1), the local invariance term (4.2) becomes
!
`
@ g(L x ) = @ X
@t t=0 t j
@t t=0 i=1 iyi(B Ltxj B xi) + b
`
X
@ (B L x B x )
iyi @t
=
t j
i
t=0
i=1
105
`
X
i=1
@ L x ;
iyi@1 (B L0xj B xi) B @t
t=0 t j
(4.4)
using the chain rule. Here, @1 (B L0xj B xi) denotes the gradient of (x y) with respect
to x, evaluated at the point (x y) = (B L0 xj B xi ).
As a side remark,
note that a sucient, albeit rather strict condition for invariance

is thus that @t@ t=0 (B Lt xj B xi) vanish for all i; j ; however, we will proceed in our
derivation, with the goal to impose weaker conditions, which apply for one specic
decision function rather than simultaneously for all decision functions expressible by
dierent choices of the coecients iyi.
Substituting (4.4) into (4.3), using the facts that L0 = I and @1 (x; y) = y>, yields
the regularizer
`
X
i;k=1
iyik yk (B xi BCB >B xk )
where
nc
es
!2
`
` X

@
1X
>
` j=1 i=1 iyi(B xi) B @t t=0 Ltxj
!X
!
`
`
` X

X
1
@
@
>
>
= `
iyi(B xi) B @t t=0 Ltxj
k yk (B @t t=0 Ltxj ) (B xk )
j =1 i=1
k=1
(4.5)
fe
re
`
X
@ L x > :
@ L x
C := 1`
(4.6)
t j
@t t=0 t j
j =1 @t t=0
We now choose B such that (4.5) reduces to the standard SV target function (2.7)
in the form obtained by substituting (2.11) into it (cf. the quadratic term of (2.13)),
utilizing the dot product chosen in (4.1), i.e. such that
Re
(B xi BCB >B xk ) = (B xi B xk ):
(4.7)
Assuming that the xi span the whole space, this condition becomes
B >BCB >B = B >B;
(4.8)
or, by requiring B to be nonsingular, i.e. that no information get lost during the
preprocessing, BCB > = I . This can be satised by a preprocessing matrix
B = C ? 21 ;
(4.9)
the nonnegative square root of the inverse of the nonnegative matrix C dened in
(4.6). In practice, we use a matrix
C := (1 ? )C + I;
(4.10)
106
es
with 0 < 1; instead of C . As C is nonnegative, C is invertible. For = 1, we

recover the standard SV optimal hyperplane algorithm, other values of determine the
trade-o between invariance and model complexity control. It canPbe shown that using
C corresponds to using an objective function (w) = (1 ? ) i(w @t@ jt=0Lt xi)2 +
kwk2 (see Appendix D.3).
By choosing the preprocessing matrix B according to (4.9), we have obtained a
formulation of the problem where the standard SV quadratic optimization technique
does in eect minimize the tangent regularizer (4.3): the maximum of (2.13) subject
to (2.14) and (2.15), using the modied dot product as in (4.1), coincides with the
minimum of (4.3) subject to the separation conditions yi g(xi) 1, where g is dened
as in (4.1).
Note that preprocessing with B does not aect classication speed: since (B xj
B xi) = (xj B >B xi), we can precompute B >B xi for all SVs xi and thus obtain a
machine (with modied SVs) which is as fast as a standard SV machine (cf. (4.1)).
In the nonlinear case, where kernel functions k(x; y) are substituted for every
occurence of a dot product, the above analysis of transformation invariance leads to
the regularizer
!2
`
` X

1X
@
(4.11)
` j=1 i=1 i yi@1 k(B xj ; B xi) B @t t=0 Ltxj :
fe
i;k
re
nc
The derivative of k must be evaluated for specic kernels, e.g. for k(x; y) = (x y)d,
@1 k(x; y) = d (x y)d?1 y>: To obtain a kernel-specic constraint on the matrix B ,
one would need to equate the result with the quadratic term in the nonlinear objective
function,
X
i yik yk k(B xi; B xk ):
(4.12)
Re
Relationship to Principal Component Analysis. Let us now provide some interpretation of (4.9) and (4.6). The tangent vectors @t@ jt=0 Ltxj have zero mean, thus C is a
sample estimate of the covariance matrix of the random vector s @t@ jt=0 Ltx, s 2 f1g
being a random sign. Based on this observation, we call C (4.6) the Tangent Covariance Matrix of the data set fxi : i = 1; : : : ; `g with respect to the transformations
Lt .
Being positive denite,2 C can be diagonalized, C = SDS >, with an orthogonal matrix S consisting of C 's Eigenvectors and a diagonal matrix D containing the
corresponding positive Eigenvalues. Then we can compute
B = C ? 21 = SD? 12 S >;
(4.13)
where D? 21 is the diagonal matrix obtained from D by taking the inverse square
2 it
is understood that we use C if C is not denite (cf. (4.10))
107
roots of the diagonal elements. Since the dot product is invariant under orthogonal
transformations, we may drop the leading S and (4.1) becomes
g(xj ) =
`
X
i=1
iyi(D? 21 S >xj D? 21 S >xi ) + b:
(4.14)
re
nc
es
A given pattern x is thus rst transformed by projecting it onto the Eigenvectors of

the tangent covariance matrix C , which are the rows of S >. The resulting feature
vector is then rescaled by dividing by the square roots of C 's Eigenvalues.3 In other
words, the directions of main variance of the random vector @t@ jt=0 Ltx are scaled back,
thus more emphasis is put on features which are less variant under Lt. For example, in
image analysis, if the Lt represent translations, more emphasis is put on the relative
proportions of ink in the image rather than the positions of lines. The PCA interpretation of our preprocessing matrix suggests the possibility to regularize and reduce
dimensionality by discarding part of the features, as it is common usage when doing
PCA. As an aside, note that the resulting matrix will still satisfy (4.8).4
Combining the PCA interpretation with the considerations following (4.1) leads
to an interesting observation: by computing the tangent covariance matrix from the
SVs only, rather than from the full data set, it can be rendered a task-dependent
covariance matrix. Although the summation in (4.6) does not take into account class
labels yi, it then implicitely depends on the task to be solved via the SV set, which
is computed for given the task. Thus, it allows the extraction of features which are
invariant in a task-dependent way: it does not matter whether features for \easy"
patterns change with transformations, it is more important that the \hard" patterns,
close to the decision boundary, lead to invariant features.
fe
The Nonlinear Tangent Covariance Matrix. We are now in a position to describe
Re
a feasible way how to generalize to the nonlinear case. To this end, we use kernel
principal component analysis (Chapter 3). This technique allows us to compute principal components in a space F nonlinearly related to input space. The kernel function
k plays the role of the dot product in F , i.e. k(x; y) = ((x) (y)). To generalize
(4.14) to the nonlinear case, we compute the tangent covariance matrix C (Eq. 4.6)
in feature space F , and its projection onto the subspace of F which is given by the
linear span of the tangent vectors in F . There, the considerations of the linear case
3 As an aside, note that our goal to build invariant SV machines has thus serendipitously provided
us with an approach for an open problem in SV learning, namely the one of scaling: in SV machines,
there has so far been no way of automatically assigning dierent weight to dierent directions in input
space | in a trained SV machine, the weights of the rst layer (the SVs) form a subset of the training
set. Choosing these Support Vectors from the training set only gives rather limited possibilities for
appropriately dealing with dierent scales in dierent directions of input space.
4 To see this, rst note that if B solves B > BCB > B = B > B , and B 's polar decomposition is
B = U Bs , with U U > = 1 and Bs = Bs> , then Bs also solve it. Thus, we may restrict ourselves to
symmetrical solutions. For our choice B = C ? 21 , B commutes with C , hence they can be diagonalized
simultaneously. In this case, B 2 CB 2 = B 2 clearly can also be satised by any matrix which is obtained
from B by setting an arbitrary selection of Eigenvalues to 0 (in the diagonal representation).
108
es
apply. The whole procedure reduces to computing dot products in F , which can be
done using k, without explicitly mapping into F :
In rewriting (4.6) for the nonlinear case, we substitute nite dierences, with t > 0,
for derivatives:
`
X
1
(4.15)
C := `t2 ((Lt xj ) ? (xj )) ((Lt xj ) ? (xj ))> :
j =1
For the sake of brevity, we have omitted the summands corresponding to derivatives
in the opposite direction, which ensure that the data set is centered. For the nal
tangent covariance matrix C , they do not make a dierence, as the two negative signs
cancel out.
In high-dimensional feature spaces, C cannot be computed explicitely. In complete
analogy to Chapter 3, we compute another matrix whose Eigenvalues and Eigenvectors
will allow us to extract features corresponding to Eigenvectors and Eigenvalues of
C . This is done by taking dot products from both sides with (Lt xi) ? (xi) (the
Eigenvectors in F can be expanded in terms of the latter, by the same argument as
the one leading to (3.10)). Dening
Kij = k(xi ; xj );
(4.16)
(4.17)
Kijtt = k(Ltxi; Ltxj );
(4.18)
nc
Kijt = k(xi; Ltxj ) + k(Ltxi; xj );
fe
we get
re
and
Re
((Lt xi) ? (xi))>C ((Ltxk ) ? (xk ))

`
X
1
= `t2 (Kijtt ? Kijt + Kij )(Kjktt ? Kjkt + Kjk )
1j=1

= `t2 (K tt ? K t + K )2 :
ik
Using (4.19), and Eigenvector expansions
V=
`
X
k=1
k ((Lt xk ) ? (xk ));
(4.19)
(4.20)
the Eigenvalue problem that we need to solve (cf. (3.9)),
((Lt xi) ? (xi))>

= ((Lt xi) ? (xi))>C
`
X
k=1
`
X
k ((Lt xk ) ? (xk ))
k=1
k ((Lt xk ) ? (xk ));
(4.21)
4.3. IMAGE LOCALITY AND LOCAL FEATURE EXTRACTORS
then takes the form
(K tt ? K t + K ) = `t12 (K tt ? K t + K )2 :
To nd solutions of (4.22), we solve the Eigenvalue problem (cf. (3.14))5

= `t12 (K tt ? K t + K ):
109
(4.22)
(4.23)
Normalization of each Eigenvector (4.20) is carried out by requiring (V V ) = 1, which,

as in (3.16), translates into
( ) = 1;
(4.24)
using the corresponding Eigenvalue .
Feature extraction for a test point x is done by computing the projection of (x)
onto Eigenvectors V,
k=1
`
X
k=1
k ((Ltxk ) ? (xk ))>(x)
es
`
X
k (k(Ltxk ; x) ? k(xk ; x)):
nc
V>(x) =
(4.25)
fe
re
In Appendix D.3, we give an alternative justication of this procedure, which

naturally arises from requiring invariance in feature space, without the need for a
PCA interpretation.
Re
4.3 Image Locality and Local Feature Extractors

By using a kernel k(x; y) = (x y)d , one implicitly constructs a decision boundary
in the space of all possible products of d pixels. This may not be desirable, since in
natural images, correlations over short distances are much more reliable as features
than long-range correlations are. To take this into account, we dene a kernel kpd1 ;d2
as follows (cf. Fig. 4.3):
1. compute a third image z, dened as the pixel-wise product of x and y
2. sample z with pyramidal receptive elds of diameter p, centered at all locations
(i; j ), to obtain the values zij
3. raise each zij to the power d1, to take into account local correlations within the
range of the pyramid
5 If
we expand V in a dierent set of vectors, we instead arrive at a problem of simultaneous

diagonalization of two matrices.
110
d2
(...)d1
(...)d1
re
nc
es
FIGURE 4.3: Kernel utilizing local correlations in images. To compute k(x; y) for two
images x and y, we sum over products between corresponding pixels of the two images
in localized regions (in the gure, this is indicated by dot products (: :)), weighed by
pyramidal receptive elds. To the outputs, a rst nonlinearity in form of an exponent d1
is applied. The resulting numbers for all patches (only two are displayed) are summed,
and the d2 -th power of the result is taken as the value k(x; y). This kernel corresponds
to a dot product in a polynomial space which is spanned mainly by localized correlations
between pixels (see Sec. 4.3).
Re
fe
4. sum zdij1 over the whole image, and raise the result to the power d2 to allow for
longe-range correlations of order d2
The resulting kernel will be of order d1 d2, however, it will not contain all possible
correlations of d1 d2 pixels.
4.4 Experimental Results

4.4.1 Virtual Support Vectors
USPS Digit Recognition. The rst set of experiments was conducted on the USPS
database of handwritten digits (Appendix C). This database has been used extensively
in the literature, with a LeNet1 Convolutional Network achieving a test error rate of
5.0% (LeCun et al., 1989). As in Sec. 2.3, we used = 10.
Virtual Support Vectors were generated for the set of all dierent Support Vectors
of the ten classiers. Alternatively, one can carry out the procedure separately for
the ten binary classiers, thus dealing with smaller training sets during the training of
the second machine. Table 4.1 shows that incorporating only translational invariance
112
already improves performance signicantly, from 4.0% to 3.2% error rate. For other
types of invariances (Fig. 4.4), we also found improvements, albeit smaller ones: generating Virtual Support Vectors by rotation or by the line thickness transformation
of Drucker, Schapire, and Simard (1993), we constructed polynomial classiers with
3.7% error rate (in both cases).
Note, moreover, that generating Virtual examples from the full database rather
than just from the SV sets did not improve the accuracy, nor did it enlarge the SV set of
the nal classier substantially. This nding was reproduced for the Virtual SV system
mentioned in Sec. 2.5.3: in that case, similar to Table 4.1, generating Virtual examples
from the full database led to identical performance, and only slightly increased SV set
size (861 instead of 806). From this, we conclude that for the considered recognition
task, it is sucient to generate Virtual examples only from the SVs | Virtual examples
generated from the other patterns do not add much useful information.
MNIST Digit Recognition. The larger a database, the more information about
es
invariances of the decision function is already contained in the dierences between

patterns of the same class. To show that it is nevertheless possible to improve classi-
Re
fe
re
nc
TABLE 4.2: Application of the Virtual SV method to the MNIST database. Virtual SVs
were generated by translating the original SVs in all four principal directions (by 1 pixel).
Results are given for the original SV machine, and two VSV systems utilizing dierent
kernel degrees; in all cases, we used = 10 (cf. (2.19)). SV: degree 5 polynomial SV
classier; VSV1: VSV machine with degree 5 polynomial kernel; VSV2: same with degree
9 kernel. The rst table gives the performance: for the ten binary recognizers, as numbers
of errors; for multi class-classication (T1), in terms of error rates (in %), both on the
60000 element test set. The second multi-class error rate (T2) was computed by testing
only on a 10000 element subset of the full 60000 element test set. These results are given
for reference purpose, they are the ones usually reported in MNIST performance studies.
The second table gives numbers of SVs for all ten binary digit recognizers.
system
SV
VSV1
VSV2
0
131
95
81
system
0
SV 1206
VSV1 2938
VSV2 3941
Errors
binary recognizers
1
2
3
4
5
6
97 243 240 212 241 195
84 186 176 173 171 127
66 164 146 141 147 119
Support Vectors
1
2
3
4
5
757 2183 2506 1784 2255
1887 5015 4764 3983 5235
2136 6598 7380 5127 6466
10-class
7
8
9 T1 T2
259 343 409 1.6 1.4
217 233 289 1.1 1.0
179 196 254 1.0 0.8
6
7
8
9
1347 1712 3053 2720
3328 3968 6978 6348
4128 5014 8701 7720
113
4.4. EXPERIMENTAL RESULTS
Re
fe
re
nc
es
cation accuracies with our technique, we applied the method to the MNIST database
(Appendix C) of 60000 handwritten digits. This database has become the standard
for performance comparisons at AT&T Bell Labs; the error rate record of 0.7% is held
by a boosted LeNet4 (Bottou et al., 1994; LeCun et al., 1995), i.e. by an ensemble
of learning machines. The best single machine in the performance comparisons so far
was a LeNet5 convolutional neural network (0.9%); other high performance systems
include Tangent Distance nearest neighbour classiers (1.1%), and LeNet4 with a last
layer using methods of local learning (1.1%, cf. Bottou and Vapnik, 1992).
Using Virtual Support Vectors generated by 1-pixel translations, we improved a
degree 5 polynomial SV classier from 1.4% to 1.0% error rate on the 10000 element
test set (Table 4.2). In this case, we applied our technique separately for all ten Support
Vector sets of the binary classiers (rather than for their union) in order to avoid
having to deal with large training sets in the retraining stage. Note, moreover, that
for the MNIST database, we did not compare results of the VSV technique to those for
generating Virtual examples from the whole database: the latter is computationally
exceedingly expensive, as it entails training on a very large training set. We did,
however, make a comparison for the small MNIST database (Appendix C). There, a
degree 5 polynomial classier was improved from 3:8% to 2:5% error by the Virtual SV
method, with an increase of the average SV set sizes from 324 to 823. By generating
Virtual examples from the full training set, and retraining on these, we obtained a
system which had slightly more SVs (939), but an unchanged error rate.
After retraining, the number of SVs more than doubled (Table 4.2). Thus, although
the training sets for the second set of binary classiers were substantially smaller than
the original database (for four Virtual SVs per SV, four times the size of the original
SV sets, in our case amounting to around 104), we concluded that the amount of
data in the region of interest, close to the decision boundary, had more than doubled.
Therefore, we reasoned that it should be possible to use a more complex decision
function in the second stage (note that the risk bound (1.5) depends on the ratio of
VC-dimension and training set size). Indeed, using a degree 9 polynomial led to an
error rate of 0.8%, very close to the record performance of 0.7%.
Another interesting performance measure is the rejection error rate, dened as the
percentage of patterns that would have to be rejected to attain a specied error rate (in
the benchmark studies of Bottou et al. (1994) and LeCun et al. (1995), 0.5%). Note
that this percentage is computed on the test set. In our case, using the condence
measure of Sec. 2.1.6, it was measured to be 0.9%, realizing a large improvement
compared to the original SV system (2.4%). In the above benchmark studies, only the
boosted LeNet4 ensemble performed better (0.5%).
Further improvements can possibly be achieved by combining dierent types of
invariances. Another intriguing extension of the scheme would be to use techniques
based on image correspondence (e.g. Vetter and Poggio, 1997) to extract invariance
transformations from the training set. Those transformations can then be used to
generate Virtual Support Vectors.6
6 Together with
Thomas Vetter, we have recently started working on this approach.
114
FIGURE 4.5: Virtual SVs in gender classication. A: 2-D image of a 3-D head model (from
the MPI head database (Troje and Bultho, 1996; Vetter and Troje, 1997)); B: 2D image
of the rotated 3D head; C: articial image, generated from A using the assumption that it
belongs to a cylinder-shaped 3D object (rotation by the same angle as B).
nc
es
TABLE 4.3: Numbers of test errors for gender classication in novel pose, using Virtual
SVs (qualitatively similar to Fig. 4.5). The training set contained 100 views of male and
female heads (divided 49:51), taken at an azimuth of 24 , downsampled to 32 32. The
test set contained 100 frontal views of the same heads. We used polynomial SV classiers
of dierent degrees, generating one virtual SV per original SV. Clearly, training and test
views are dierently distributed, however, the amount of rotation (24 ) was known to the
classier in the sense that it was used for generating the Virtual SVs (Fig. 4.5): rst, a
simplied head model was inferred by averaging over in-depth revolutions of all the 2D
views. VSVs were generated by projecting the original SVs onto the head model, then
rotating the head to the frontal view, and computing the new 2-D view.
Re
fe
re
degree
prior knowledge
1 2 3 4 5
no virtual SVs
25 24 23 21 19
virtual SVs from 3D model 11 10 10 9 10
Face Classication. Certain types of transformations, as the above used translations
and rotations, apply equally well to object recognition as they do to character recognition. There are, however, types of transformations which are specic to the class of
images considered (cf. Sec. 1.1). For instance, line thickness transformations (Fig. 4.4)
are specic to character recognition. To provide an example of Virtual SVs which are
specic to object recognition, we generated virtual SVs corresponding to object rotations in depth, by making assumptions about the 3D shape of objects. Clearly, such an
approach would have a hard time if applied to complex objects as chairs (Appendix A).
For human heads, however, it is possible to formulate 2-D image transformations which
can be applied to generate approximate novel views of heads (Fig. 4.5). Using these
views improved accuracies in a small gender classication experiment. Table 4.3 gives
details and results of the experiment.
115
TABLE 4.4: Test error rates for two object recognition databases, for views of resolution
16 16, using dierent types of approximate invariance transformations to generate Virtual
SVs, and polynomial kernels of degree 20 (cf. Table 2.1). The second training run in the
Virtual SV systems was done on the original SVs and the generated Virtual SVs. The
training sets with 25 and 89 views per object are regularly spaced; for them, mirroring does
not provide additional information. The interesting case is the one where we trained on the
100-view-per-object sets. Here, a combination of virtual SVs from mirroring and rotation
substantially improves accuracies on both databases.
animal
entry level
training set: views per object
Virtual SVs
25 89 100 25 89 100
none (orig. system) 13.0 1.7 4.8 13.0 1.8 2.4
mirroring
13.6 1.8 4.8 14.2 2.8 3.2
translations
16.4 1.6 4.3 17.1 11.1 4.8
rotations
9.0 0.7 3.0 10.3 1.8 2.5
rotations & mirroring 9.0 0.7 1.7 9.6 0.9 1.7
es
database:
Discrete Symmetries in Object Recognition. As mentioned above, rigid transfor-
Re
fe
re
nc
mations of 3-D objects, however, do not in general correspond to simple transformations of the produced 2-D images (cf. Sec. 4.4.1). For the MPI object databases
(Appendix A), however, there exists a type of invariance transformation which can
easily be computed from the images: as all the objects used are (approximately) bilaterally symmetric, we can easily produce another valid view of the same object, with
a dierent viewing angle, by performing a mirror operation with respect to a vertical
axis in the center of the images, say (Vetter, Poggio, and Bultho, 1994). If the objects were exactly symmetric, we would not expect any additional information to be
gained in the case of the regularly spaced object sets (25 and 89 views per object),
as in these the snapshots are already sampled symmetrically around the zero view
direction, which in most cases coincided with the symmetry plane. The slight decrease
in performance in that case (Table 4.4) indicates that for some objects, the symmetry
only holds approximately (for snapshots, see Appendix A).
To get more robust, we tried combining this type of invariance transformation
with other types. As in the case of character recognition, we simply used translations
(by 1 pixel in all four directions) and image-plane rotations (by 10 degrees in both
directions). Even though these transformation are but very crude approximations of
transformations which occur when a 3-D object is rotated in space, they did in some
cases yield signicant performance improvements.7
7 The following may serve as a partial explanation why rotations were more useful than translations.
First, dierent snapshots at large elevations can be transformed into each other by an approximate
image plane rotation, and second, image plane rotations retain the centering which was applied to
116
To examine the eect of the mirror symmetry Virtual SVs, we need to focus on the
non-regularly spaced training set with 100 views per object. There, by far the best
performance for both the entry level and the animal database was obtained by using
both mirroring and rotations (Table 4.4).
TABLE 4.5: Speed improvement using the Reduced Set method. The second through
fourth columns give numbers of errors on the 10000 element MNIST test set for the
original system, the Virtual Support Vector system, and the reduced set system (for the
10-class classiers, the error is given in %). The last three columns give, for each system,
the number of vectors whose dot product must be computed in the test phase.
fe
re
nc
es
Digit SV err VSV1 err RS err SV # VSV1 # RS #

0
17
15
18
1206
2938
59
1
15
13
12
757
1887
38
2
34
23
30
2183
5015 100
3
32
21
27
2506
4764
95
4
30
30
35
1784
3983
80
5
29
23
27
2255
5235 105
6
30
18
24
1347
3328
67
7
43
39
57
1712
3968
79
8
47
35
40
3053
6978 140
9
56
40
40
2720
6348 127
10-class 1.4%
1.0%
1.1%
Virtual SV Combined with Reduced Set. Apart from the increase in overall training
Re
time (by a factor of two, in our experiments), the VSV technique has the computational
disadvantage that many of the Virtual Support Vectors become Support Vectors for
the second machine, increasing the cost of evaluating the decision function (2.25).
However, the latter problem can be solved with the Reduced Set (RS) method (Burges,
1996, see Appendix D.1.1), which reduces the complexity of the decision function
representation by approximating it in terms of fewer vectors. In a study combining
the VSV and RS methods, we achieved a factor of fty speedup in test phase over the
Virtual Support Vector machine, with only a small decrease in performance (Burges
and Scholkopf, 1997). We next brie y report the results of this study. The RS results
reported were obtained by Chris Burges.
As a starting point for the RS computation, we used the VSV1 machine (Table 4.2), which achieved 1.0% error rate on the 10000 element MNIST test set.8 The
the original images. Both points suggest that virtual examples generated by rotations should be more
\realistic" than those generated by translations.
8 At the time when the described study was carried out, VSV1 was our best system; VSV2 was
not available yet.
117
improvement in accuracy compared to the SV machine (Table 4.2) comes at a cost in

classication speed of approximately a factor of 2. Furthermore, the speed of SV was
comparatively slow to start with (cf. LeCun et al., 1995), requiring approximately 14
million multiply adds for one classication. In order to become competitive with systems with comparable accuracy (LeCun et al., 1995), we need approximately a factor
of fty improvement in speed. We therefore approximated VSV1 with a reduced set
system RS with a factor of fty fewer vectors than the number of Support Vectors in
VSV1.
Table 4.5 compares results on the 10000 element test set for the three systems.
Overall, the SV machine performance of 1.4% error is improved to 1.1%, with a machine requiring a factor of 22 fewer multiply adds (RS). For details on the computation
of the RS solution, see (Burges and Scholkopf, 1997).
4.4.2 Invariant Hyperplane Method
Re
fe
re
nc
es
In the experiments exploring the invariant hyperplane method (Sec. 4.2.2), we used the
small MNIST database (Appendix C). We start by giving some baseline classication
results.
Using a standard linear SV machine (i.e. a separating hyperplane, Sec. 2.1.3), we
obtain a test error rate of 9:8%; by using a polynomial kernel of degree 4, this drops
to 4:0%. In all of the following experiments, we use degree 4 kernels of various types.
The number 4 was chosen as it can be written as a product of two integers, thus we
could compare results to a kernel kpd1 ;d2 with d1 = d2 = 2 (cf. sections 4.3 and 4.4.3).
For the considered classication task, results for higher polynomial degrees are very
similar.
In a series of experiments with a homogeneous polynomial kernel k(x; y) = (x y)4,
using preprocessing with Gaussian smoothing kernels of standard deviation in the
range 0:1; 0:2; : : :; 1:0, we obtained error rates which gradually increased from 4:0%
to 4:3%. We concluded that no improvement of the original 4:0% performance was
possible by a simple smoothing operation.
Invariant Hyperplanes Results. Table 4.6 reports results obtained by preprocessing

all patterns with B (cf. (4.9)), choosing dierent values of (cf. Eq. (4.10)). In the
TABLE 4.6: Classication error rates for modifying the kernel k(x; y) = (x y)4 with the
1
invariant hyperplane preprocessing matrix B = C? 2 ; cf. Eqs. (4.9) { (4.10). Enforcing
invariance with = 0:2; 0:3; : : :; 0:9 leads to improvements over the original performance
( = 1).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
error rate in % 4.2 3.8 3.6 3.6 3.7 3.8 3.8 3.9 3.9 4.0
118
FIGURE
4.6: The rst pattern in the small MNIST database, preprocessed with B =
? 21
C (cf. equations (4.9) { (4.10)), enforcing various amounts of invariance. Top row:
= 0:1; 0:2; 0:3; 0:4; bottom row: = 0:5; 0:6; 0:7; 0:8. For some values of , the
preprocessing resembles a smoothing operation, however, it leads to higher classication
accuracies (see Sec. 4.4.2) than the latter.
re
nc
es
experiments, the patterns were rst rescaled to have entries in [0; 1], then B was
computed, using horizontal and vertical translations, and preprocessing was carried
out; nally, the resulting patterns were scaled back again (for snapshots of the resulting
patterns, see Fig. 4.6). The scaling was done to ensure that patterns and derivatives
lie in comparable regions of RN (note that if the pattern background level is a constant
?1, then its derivative is 0). The results show that even though (4.6) was derived for
the linear case, it leads to improvements in the nonlinear case (here, for a degree 4
polynomial).
fe
Dimensionality Reduction. The above [0; 1] scaling operation is ane rather than
Re
linear, hence the argument leading to (4.14) does not hold for this case. We thus only
report results on dimensionality reduction for the case where the data is kept in [0; 1]
scaling during the whole procedure. Dropping principal components which are less
important leads to substantial improvements (Table 4.7); cf. the explanation following
(4.14)).
The results in Table 4.7 are somewhat distorted by the fact that the polynomial
kernel is not translation invariant, and performs poorly when none of the principal
TABLE 4.7: Dropping directions corresponding to small Eigenvalues of C , i.e. dropping
less important principal components (cf. (4.14)), leads to substantial improvements. All
results given are for the case = 0:4 (cf. Table 4.6); degree 4 homogeneous polynomial
kernel.
PCs discarded 0 50 100 150 200 250 300 350

error rate in % 8.7 5.4 4.9 4.4 4.2 3.9 3.7 3.9
119
components are discarded. Hence this result should not be compared to the performance of the polynomial kernel on the data in [?1; 1] scaling. (Recall that we obtained
3.6% in that case, for = 0:4.) In practice, of course, we may choose the scaling of the
data as we like, in which case it would seem pointless to use a method which is only
applicable for a rather disadvantageous representation of the data. However, nothing
prevents us from using a translation invariant kernel. We opted for a radial basis function kernel (2.27) with c = 0:5. On the [?1; 1] data, for = 0:4, this leads to the same
performance as the degree 4 polynomial, 3.6% (without invariance preprocessing, i.e.
for = 1, the performance is 3.9%). To get the identical system on [0; 1] data, the
RBF width was rescaled accordingly, to c = 0:125. Table 4.8 shows that discarding
principal components can further improve performance, up to 3.3%.
TABLE 4.8: Dropping directions corresponding to small Eigenvalues of C , i.e. dropping
less important principal components (cf. (4.14)), for the translation invariant RBF kernel
(see text). All results given are for the case = 0:4 (cf. Table 4.6).
es
PCs discarded 0 50 100 150 200 250 300 350

error rate in % 3.6 3.6 3.6 3.5 3.5 3.4 3.3 3.6
nc
4.4.3 Kernels Using Local Correlations

Character Recognition. As in Sec. 4.4.2, the present results were obtained on the
Re
fe
re
small MNIST database (Appendix C). As a reference result, we use the degree 4
polynomial SV machine, performing at 4:0% error (Sec. 4.4.2). To exploit locality in
images, we used pyramidal receptive eld kernel kpd1 ;d2 with diameter p = 9 (cf. Sec. 4.3)
and d1 d2 = 4, i.e. degree 4 polynomials kernels which do not use all products of 4
pixels (Sec. 4.2.2). For d1 = d2 = 2, we obtained an improved error rate of 3:1%,
another degree 4 kernel with only local correlations (d1 = 4; d2 = 1) led to 3:4%
(Table 4.9).
Albeit better than the 4:0% for the degree 4 homogeneous polynomial, this is
still worse than the Virtual SV result: generating Virtual SVs by image translations,
the latter led to 2:8%. As the two methods, however, exploit dierent types of prior
knowledge, it could be expected that combining them leads to still better performance;
and indeed, this yielded the best performance of all (2:0%), halving the error rate of
the original system.
For the purpose of benchmarking, we also ran our system on the USPS database.
In that case, we obtained the following test error rates: SV with degree 4 polynomial
kernel 4:2% (Table 2.4), Virtual SV (same kernel) 3:5%, SV with k72;2 3.6% (for the
smaller USPS images, we used a k7 kernel rather than k9), Virtual SV with k72;2 3:0%.
The latter compares favourably to almost all known results on that database, and is
second only to a memory-based tangent-distance nearest neighbour classier at 2:6%
(Simard, LeCun, and Denker, 1993).
120
TABLE 4.9: Summary: error rates for various methods of incorporating prior knowledge,
on the small MNIST database (Appendix C). In all cases, degree 4 polynomial kernels were
used, either of the local type (Sec. 4.3), or (by default) of the complete polynomial type
(2.26). In all cases, we used = 10 (cf. (2.19)).
Classier
Test Error / %
SV
4.0
Virtual SV (Sec. 4.2.1), with translations
2.8
Invariant hyperplane (Sec. 4.2.2), = 0:4
3.6
same, on rst 100 principal components (Table 4.7)
3.7
2;2
semi-local kernel k9 (Sec. 4.4.3)
3.1
4;1
purely local kernel k9 (Sec. 4.4.3)
3.4
2;2
Virtual SV with k9
2.0
Object Recognition. The above results have been conrmed on the two object re-
Re
fe
re
nc
es
cognition databases used in Sec. 2.2.1 (cf. Appendix A). As in the case of the small
MNIST database, we used k9d1 ;d2 . In the present case, we chose d1 = d2 = 3, which
yields a degree 9 (= 3 3) polynomial classier which diers from a standard polynomial (2.26) in that it does not utilize all products of 9 pixels, but mainly local
ones. Comparing the results to those obtained with standard polynomials of equal degree shows that this pre-selection of useful features signicantly improves recognition
results (Table 4.10).
As in the case of digit recognition, we combined this method with the Virtual SV
method (Sec. 4.2.1). Based on the fact that prior knowledge about image locality
is dierent from prior knowledge on invariances, we expected the possibility to get
further improvements. We used the same types of Virtual SVs as in Sec. 4.4.1. The
results (Table 4.11) further improve upon Table 4.10, conrming the digit recognition
ndings reported above. In 4 of 6 cases, the resulting classiers are better than those
of Table 4.4.9
4.5 Discussion
For Support Vector learning machines, invariances can readily be incorporated by generating virtual examples from the Support Vectors, rather than from the whole training
set. The method yields a signicant gain in classication accuracy at a moderate cost
in time: it requires two training runs (rather than one), and it constructs classication
rules utilizing more Support Vectors, thus slowing down classication speed (cf. (2.25))
| in our case, both points amounted to a factor of about 2. Given that Support Vector
9 Note
that in Table 4.4, the VSV method was used for degree 20 kernels, which on the object
recognition tasks does far better than degree 9, cf. Table 2.1.
121
4.5. DISCUSSION
TABLE 4.10: Test error rates for two object recognition tasks, comparing kernels local in
the image to complete polynomial kernels. Local kernels of degree 9 outperform complete
polynomial kernels of corresponding degree. Moreover, they performed at least as well as
the best polynomial classier out of all degrees in f1; 3; 6; 9; 12; 15; 20; 25g (cf. Table 2.1).
kernel: degree 9 polyn. best polynomial k93;3 (cf. Sec. 4.3)

13.9
2.0
3.5
16.7
2.7
4.8
13.0
1.8
2.4
15.4
2.2
4.0
12.0
1.8
2.0
15.0
2.1
3.9
animals:
25 grey scale
89 grey scale
100 grey scale
25 silhouettes
89 silhouettes
100 silhouettes
14.8
2.5
5.2
17.0
2.8
6.3
13.0
1.7
4.4
15.6
2.2
5.2
12.0
1.6
4.0
15.2
2.0
4.9
re
nc
es
entry level:
25 grey scale
89 grey scale
100 grey scale
25 silhouettes
89 silhouettes
100 silhouettes
Re
fe
TABLE 4.11: Test error rates for two object recognition databases, using dierent types of
approximate invariance transformations to generate Virtual SVs (as in Table 4.4), and local
polynomial kernels k93;3 of degree 9 (cf. Sec. 4.3, Table 4.10, Table 4.4, and Table 2.1).
The second training run in the Virtual SV systems was done on the original SVs and the
generated Virtual SVs. The training sets with 25 and 89 views per object are regularly
spaced; for them, mirroring does not provide additional information. For the non-regularly
spaced 100-view-per-object sets, a combination of virtual SVs from mirroring and rotation
substantially improves accuracies on both databases.
database:
animal
entry level
training set: views per object
Virtual SVs
25 89 100 25 89 100
none (orig. system) 12.0 1.6 4.0 12.0 1.8 2.0
mirroring
12.5 1.7 4.6 13.1 2.9 3.3
rotations & mirroring 8.8 1.0 1.4 8.5 1.2 1.6
122
Re
fe
re
nc
es
machines are known to allow for short training times (Bottou et al., 1994), the rst
point is usually not critical. Certainly, training on virtual examples generated from
the whole database would be signicantly slower. To compensate for the second point,
we used the reduced set method of Burges (1996) for increasing speed. This way, we
obtained a system which was both fast and accurate.
As an alternative approach, we have built in known invariances directly into the
SVM objective function via the choice of a kernel. With its rather general class of
admissible kernel functions, the SV algorithm provides ample possibilities for constructing task-specic kernels. We have considered two forms of domain knowledge:
rst, pattern classes were required to be locally translationally invariant, and second,
local correlations in the images were assumed to be more reliable than long-range correlations. The second requirement can be seen as a more general form of prior knowledge
| it can be thought of as arising partially from the fact that patterns possess a whole
variety of transformations; in object recognition, for instance, we have object rotations
and deformations. Mostly, these transformations are continuous, which implies that
local relationships in an image are fairly stable, whereas global relationships are less
reliable.
Both types of domain knowledge led to improvements on the considered pattern
recognition tasks.
The method for constructing kernels for transformation invariant SV machines
(invariant hyperplanes), put forward to deal with the rst type of domain knowledge,
so far has only been applied in the linear case, which probably explains why it only led
to moderate improvements, especially when compared with the large gains achieved
by the Virtual SV method. It is applicable for dierentiable transformations | other
types, e.g. for mirror symmetry, have to be dealt with using other techniques, as the
Virtual Support Vector method. Its main advantages compared to the latter technique
is that it does not slow down testing speed, and that using more invariances leaves
training time almost unchanged. In addition, it is more attractive from a theoretical
point of view, establishing a surprising connection to invariant feature extraction,
preprocessing, and principal component analysis.
The proposed kernels respecting locality in images, on the other hand, led to large
improvements; they are applicable not only in image classication but to all cases where
the relative importance of subsets of products features can be specied appropriately.
They do, however, slow down both training and testing by a constant factor which
depends on the cost of evaluating the specic kernel used.
Clearly, SV machines have not yet been developed to their full potential, which
could explain the fact that our highest accuracies are still slightly worse that the
record on the MNIST database. However, SVMs present clear opportunities for further
improvement. More invariances (for example, for the pattern recognition case, small
rotations, or varying ink thickness) could be incorporated, possibly combined with
techniques for dealing with optimization problems involving very large numbers of
SVs (Osuna, Freund, and Girosi, 1997). Further, one might use only those Virtual
Support Vectors which provide new information about the decision boundary, or use a
123
4.5. DISCUSSION
Re
fe
re
nc
es
measure of such information to keep only the most important vectors. Finally, if local
kernels (Sec. 4.3) will prove to be as useful on the full MNIST database as they were
on the small version of it, accuracies could be substantially increased | at a cost in
classication speed, though.
We conclude this chapter by noting that all three described techniques should be
directly applicable to other kernel-based methods as SV regression (Vapnik, 1995b) and
kernel PCA (Chapter 3). Future work will include the nonlinear Tangent Covariance
Matrix (cf. our considerations in Sec. 4.2.2), the incorporation of invariances other
than translation, and the construction of kernels incorporating local feature extractors
(e.g. edge detectors) dierent from the pyramids described in Sec. 4.3.
fe
re
nc
es
Re
124
Chapter 5
Conclusion
Re
fe
re
nc
es
We believe that Support Vector machines and Kernel Principal Component Analysis
are only the rst examples of a series of potential applications of Mercer-kernel-based
methods in learning theory. Any algorithm which can be formulated solely in terms
of dot products can be made nonlinear by carrying it out in feature spaces induced by
Mercer kernels. However, already the above two elds are large enough to render an
exhaustive discussion in this thesis infeasible. Thus, we have tried to focus on some
aspects of SV learning and Kernel PCA, hoping that we have succeeded in illustrating
how nonlinear feature spaces can benecially be used in complex learning tasks.
On the Support Vector side, we presented two chapters. Apart from a tutorial
introduction to the theory of SV learning, the rst one focused on empirical results
related to the accuracy and the Support vector sets of dierent SV classiers. Considering three well-known classier types which are included in the SV approach as
special cases, we showed that they lead to similarly high accuracies and construct
their decision surface from almost the same Support Vectors. Our rst question raised
in the Preface was which of the observations should be used to construct the decision
boundary? Against the backdrop of our empirical ndings, we can now take the position that the Support Vectors, if constructed in an appropriate nonlinear feature
space, constitute such a subset of observations. The second SV chapter focused on
algorithms and empirical results for the incorporation of prior knowledge in SV machines. We showed that this can be done both by modifying kernels and by generating
Virtual examples from the set of Support Vectors. In view of the high performances
obtained, we can reinforce and generalize the above answer, to include also Virtual
Support Vectors, and specialize it, saying that the appropriate feature space should be
constructed using prior knowledge of the task at hand. Our best performing systems
used both methods simultaneously, Virtual Support Vectors and kernels incorporating
prior knowledge about the local structure of images.
On Kernel Principal Component Analysis, we presented one chapter, describing the algorithm and giving rst experimental results on feature extraction for
pattern recognition. We saw that features extracted in nonlinear feature spaces led to
recognition performances much higher than those extracted in input space (i.e. with
traditional PCA). This lends itself to an answer of the second question raised in the
125
126
CHAPTER 5. CONCLUSION
Re
fe
re
nc
es
Preface, which features should be extracted from each observation? From our present
point of view, these should be nonlinear Kernel PCA features. As Kernel PCA operates in the same types of feature spaces as Support Vector machines, the choice of
the kernel, and the design of kernels to incorporate prior knowledge, should also be of
importance here. As the Kernel PCA method is very recent, however, these questions
have not been thoroughly investigated yet. We hope that given a few years time, we
will be in a position to specialize our answer to the second question exactly as it was
done for the rst one.
We conclude with an outlook, revisiting the question of visual processing in biological systems. If the Support Vector set should prove to be a characteristic of the data
largely independent of the type of learning machine used (which we have shown for
three types of learning machines), one would hope that it could also be of relevance
in biological learning. If a subset of observations characterizes a task rather than a
particular algorithm's favourite examples, there is reason to hope that every system
trying to solve this task | in particular animals | should make use of this subset
in one way or another. Regarding Kernel PCA, it would be interesting to study the
types of feature extractors that Kernel PCA constructs when performed on collections
of images resembling those that animals are naturally exposed to. Comparing those
with the ones found in neurophysiological studies could potentially assist us in trying to understand natural visual systems. If applied on the same data, and similar
tasks, optimal machine learning algorithms could be as fruitful to biological thinking
as biological solutions can be to engineering.
Support Vector Learning!
Appendix A
Object Databases
es
In this section, we brie y describe three object recognition databases (chairs, entry
level objects, and animals) generated at the Max-Planck-Institut fur biologische Kybernetik (Liter et al., 1997). We start by describing the procedure for creating the
databases, and then show some images of the resulting patterns.
The training and test data was generated according to the following procedure
(Blanz et al., 1996; Liter et al., 1997):
nc
Database Generation
Snapshot Sampling. 25 dierent object models with uniform grey surface were ren-
Re
fe
re
dered in perspective projection in front of a white background on a Silicon Graphics

workstation using Inventor software. The initial images had a resolution of 256 256
pixels. In all viewing directions, the image plane orientation was such that the vertical
axis of the object was projected in an upright orientation. Thus, each view of an
object is fully characterised by two camera position angles, the elevation (0 at the
horizon, and 90 from the top) and the azimuth 2 [0 ; 360) (increasing clockwise
when viewed from the top). Only views on the upper half of the viewing sphere were
used, i.e. 2 [0; 90]. The directions of lighting and camera were chosen to coincide.
For each database, we generated dierent training sets: two of them consisted of 25
and 89 equally spaced views of each object, respectively; the other one contained 100
random views per object (cf. Fig. A.1).1 Thus, we obtained training sets of sizes 625,
2225 and 2500, respectively. The test set of size 2500 comprised 100 random views of
each object, independent from the above sets.
Centering. The resulting grey level pictures were centered with respect to the center
of mass of the binarized image. As the objects were shown on a white background,
the binarized image separates gure from ground.
Edge Detection. Four one-dimensional dierential operators (vertical, horizontal,
and two diagonal ones) were applied to the images, followed by taking the modulus.
1 In
one case, we also generated a set with 400 random views per object.
127
128
APPENDIX A. OBJECT DATABASES
Downsampling. In all ve resulting images, the resolution was reduced to 16 16,

leading to ve images r0 : : : r4. In this representation, each view requires 5 16 16 =
1280 pixels.
Containing edge detection data, the parts r1 : : : r4 already provide useful features
for recognition algorithms. To study the ability of an algorithm to extract features by
itself, one can alternatively use only the actual image part r0 of the data, and thus
train on the 256-dimensional downsampled images rather than on the 1280-dimensional
inputs. In our experiments, we used both variants of the databases.
Standardization. On the chair database, the standard deviation of the 1616 images
with pixel values in [0; 1] was around 30 (measured on the training sets). We rescaled
all databases, separately for each part r0 : : : r4, such that each part separately gives rise
to training sets with standard deviation 30. This hardly aects the r0 part, however,
it does change the edge detection parts r1 ; : : : ; r4. In the resulting 5 256-dimensional
representation, the dierent parts arising from edge detection, or just downsampling,
have comparable scaling.
nc
es
Pixel Rescaling. Before we ran the algorithms on the databases, each pixel value
x was rescaled according to x 7! 2x ? 1. Thus, the background level was ?1, and
maximal intensities were about 1.
Databases
re
Using the above procedure, three object recognition databases were generated.
MPI Chair Database. The rst object recognition database contains 25 dierent
Re
fe
chairs (gures A.2, A.3, A.4). For benchmarking purposes, the downsampled views are
available via ftp://ftp.mpik-tueb.mpg.de/pub/chair dataset/. As all 25 objects belong
to the same object category, recognition of chairs in the database is a subordinate level
task.
MPI Entry Level Database. The entry level databases contains 25 objects (gures
A.5, A.6, A.7), for which psychophysical evidence suggests that they belong to dierent
entry levels in object recognition (cf. Sec. 2.2.1).
MPI Animal Database. The animal database contains 25 dierent animals (gures
A.8, A.9, A.10). Note that some of these animals are also contained in the entry level
database (Fig. A.5).
Appendix C
Handwritten Character Databases

US Postal Service Database. The US Postal Service (USPS) database (see Fig. C.1)
re
nc
es
contains 9298 handwritten digits (7291 for training, 2007 for testing), collected from
mail envelopes in Bualo (cf. LeCun et al., 1989). Each digit is a 16 16 image,
represented as a 256-dimensional vector with entries between ?1 and 1. Preprocessing
consisted of smoothing with a Gaussian kernel of width = 0:75.
It is known that the USPS test set is rather dicult | the human error rate is 2.5%
(Bromley and Sackinger, 1991). For a discussion, see (Simard, LeCun, and Denker,
1993). Note, moreover, that some of the results reported in the literature for the
USPS set have been obtained with an enhanced training set: for instance, Drucker,
Schapire, and Simard (1993) used an enlarged training set of size 9709, containing
some additional machine-printed digits, and note that this improves accuracies. In
our experiments, only 7291 training examples were used.
fe
MNIST Database. The MNIST database (Fig. C.2) contains 120000 handwritten
Re
digits, equally divided into training and test set. The database is a modied version
of NIST Special Database 3 and NIST Test Data 1. Training and test set consist of
patterns generated by dierent writers. The images were rst size normalized to t
into a 20 20 pixel box, and then centered in a 28 28 image (Bottou et al., 1994).
Test results on the MNIST database which are given in the literature (e.g. Bottou
et al., 1994; LeCun et al., 1995) for some reason do not use the full MNIST test set
of 60000 characters. Instead, a subset of 10000 characters is used, consisting of the
test set patterns from 24476 to 34475. To obtain results which can be compared to
the literature, we also use this test set, although the larger one is preferable from the
point of view of obtaining more reliable test error estimates.
Small MNIST Database. The USPS database has been criticised (Burges, LeCun,
private communication; Bottou et al. (1994)) as not providing the most adequate
classier benchmark. First, it only comes with a small test set, and second, the test set
contains a number of corrupted patterns which not even humans can classify correctly.
The MNIST database, which is the currently used classier benchmark in the AT&T
and Bell Labs learning research groups, does not have these drawbacks; moreover, its
149
Appendix D
Technical Addenda
D.1 Feature Space and Kernels
es
In this section, we collect some material related to Mercer kernels and the correspondig
feature spaces. If not stated otherwise, we assume that k is a Mercer kernel (cf.
Proposition 1.3.2), and is the corresponding map into a feature space F such that
k(x; y) = ((x) (y)).
re
nc
D.1.1 The Reduced Set Method

Given a vector 2 F , written in terms of images of input patterns,
`
X
= i(xi);
i=1
(D.1)
fe
with i 2 R, one can try to approximate it by
X
0 = (z );
Re
Nz
i=1
(D.2)
with Nz << `, i 2 R. To this end, we have to minimize
= k ? 0 k2:
(D.3)
The crucial point is that even if is not given explicitely, can be computed (and
minimized) in terms of kernels, using ((x) (y)) = k(x; y) (Burges, 1996).
In Sec. 4.4.1, this method is used to approximate Support Vector decision boundaries in order to speed up classication.
D.1.2 Inverting the Map

If is nonlinear, the dimension of the linear span of the -images of a set of input
vectors fx1 ; : : : ; x`g can exceed the dimension of their span in input space. Thus,
153
154
APPENDIX D. TECHNICAL ADDENDA
we need not expect that there is a pre-image under for each vector that can be
expressed as a linear combination of the vectors (x1 ); : : : ; (x`). Nevertheless, it
might be desirable to have a means of constructing the pre-image in the case where it
does exist.
To this end, suppose we have a vector in F given in terms of an expansion of images
of input data, with an unknown pre-image x0 under in input space RN , i.e.
(x0) =
Then, for any x 2 RN ,
k(x0; x) =
`
X
j =1
`
X
j =1
j (xj ):
(D.4)
j k(xj ; x):
(D.5)
Assume moreover that the kernel k(x; y) is an invertible function fk of (x y),
k(x; y) = fk ((x y));
(D.6)
N
X
(x0 ei)ei
i=1
N
X
re
x0 =
nc
es
e.g. k(x; y) = (x y)d with odd d, or k(x; y) = ((x y) + ) with a strictly monotonic
sigmoid function and a threshold . Given any a priori chosen basis of input space
fe1; : : : ; eN g, we can then expand x0 as
fk?1(k(x0 ; ei))ei
i=1
1
0`
N
X
X
fk?1 @ j k(xj ; ei)A ei:
=
Re
fe
i=1
j =1
(D.7)
By using (D.5), we thus reconstructed x0 from the values of dot products between
images (in F ) of training examples and basis elements.
Clearly, a crucial assumption in this construction was the existence of the pre-image
x0. If this does not hold, then the discrepancy
`
X
j =1
j (xj ) ? (x0 )
(D.8)
will be nonzero. There is a number of things that we could do to make the discrepancy
small:
(a) We can try to nd a suitable basis in which we expand the pre-images.
(b) We can repeat the scheme, by trying to nd a pre-image for the discrepancy
vector. This problem has precisely the same structure as the original one (D.4), with
155
D.1. FEATURE SPACE AND KERNELS
one more term in the summation on the right hand side. Iterating this method gives
an expansion of the vector in F in terms of reconstructed approximate pre-images.
(c) We have the freedom to choose the scaling of the vector in F . To see this, note
that for any nonzero , we have, similar to (D.7),
x0 =
N
X
i=1
(x0 ei =)ei =
N
X
i=1
0`
1
X
fk?1 @ j k(xj ; ei=)A ei:
j =1
(D.9)
(d) Related to this scaling issue, we could also have started with
(x0) =
`
X
j =1
j (xj );
(D.10)
obtaining a reconstruction (cf. (D.7))
0`
1
X

j
x0 = fk?1 @ k(xj ; ei)A ei
i=1
j =1
N
X
(D.11)
nc
es
with the property that (D.10) holds if such an x0 exists.

The success of using dierent values of or could be monitored by computing
the squared norm of the discrepancy,
j =1
re
`
2
X
j (xj ) ? (x0) ;
(D.12)
Re
fe
which can be evaluated in terms of the kernel function.

Finally, we note that same approach can also be applied for more general kernel
functions which cannot be written as an invertible function of (x y). All we need
is a kernel which allows the reconstruction of (x y) | and nothing prevents us
from requiring the evaluation of the kernel on several pairs of points for this purpose.
Consider the following example: assume that

(D.13)
k(x; y) = fk kx ? yk2
with an invertible fk (e.g., if k is a Gaussian RBF function, cf. (1.28)). Then, by the
polarization identity, we have

(x0 ei) = 41 kx0 + eik2 ? kx0 ? ei k2 = 41 fk?1(k(x0; ?ei )) ? fk?1(k(x0; ei)) :
(D.14)
The same also works if k(x; y) = fk (kx ? yk), e.g.: we just have to raise the results of
fk?1 to the power of 2.
Similar methods can be applied to deal with other kernels.
156
D.1.3 Mercer Kernels

In this section, we give some further material related to Sec. 1.3.
First, we mention that if a nite number of Eigenvalues is negative, the expansion
(1.25) is still valid. In that case, k corresponds to a Lorentzian symmetric bilinear
form in a space with indenite signature. For the SV algorithm, this would entail
problems, as the optimization problem would become indenite. The diagonalization
required for kernel PCA, however, can still be performed, and (3.16) can be modied
such that it allows for negative Eigenvalues. The main dierence is that we can no
longer interpret the method as PCA in some feature space. Nevertheless, it could still
be viewed as a type of nonlinear factor analysis.
Next, we note that the polynomial kernels given in (1.17) satisfy Mercer's conditions
of Proposition 1.3.2. As compositions of continuous functions, they are continuous,
thus we only need to show positivity, which follows immediately if we consider their
dot product representation
NF
X
i=1
(d (x))i (d (y))i :
(D.15)
es
(x y)d =
1
X
i(x)i(y);
(D.16)
re
i=1
nc
Namely, more generally, if an integral operator kernel k admits a uniformly convergent

dot product representation on some compact set C C ,
fe
it is necessarily positive: for f 2 L2 (C ), we have

1
X
Re
C C i=1
1Z
X
i(x)i(y) f (x)f (y) dx dy
i(x)f (x)i(y)f (y) dx dy

2
1 Z
X
=
i(x)f (x) dx 0;
i=1 C C
i=1
(D.17)
(D.18)
establishing the converse of Proposition 1.3.2.

We conclude this section with some considerations on Proposition 1.3.3. Is it
possible to give a more general class of kernels, such that the expansion (1.25) is no
longer valid, but the mapping of Proposition 1.3.3 can still be constructed? One would
expect that if k does not correspond to a compact operator (as it did in the case of
Mercer kernels, cf. Dunford and Schwartz (1963); in fact, in the Mercer case, we even
have trace class operators, cf. Nashed and Wahba (1974)), with a discrete spectrum,
then the mapping (1.26) should no longer map into an l2 space, but into some separable
Hilbert space of functions on a non-discrete measure space.
157
D.1. FEATURE SPACE AND KERNELS
To this end, let be a map from input space into some Hilbert space H ,
: x 7! fx;
(D.19)
and T 0 be a positive bounded operator on H . Moreover, dene a kernel
kT (x; y) := (fx Tfy):

Then
(D.20)
: x 7! T fx
(D.21)
kT (x; y) = ((x) (y)):
(D.22)
clearly is a map such that
nc
es
As an aside, to see the connection to Mercer's theorem, we may formally set fx to be

x, and assume that T is an integral operator with kernel k. In this case, the right
hand side of (D.20) would equal k(x; y).
The connection to (1.26) becomes clearer if we use the spectral representation of
T , and construct a dierent from the one in (D.21): T can be written as
re
T = U Mv U;
(D.23)
Re
fe
where v is a continuous function with corresponding multiplication operator Mv , U is

a unitary operator
U : H ! L2(R; );
(D.24)
and is a probability measure (the spectral measure of T ) (e.g. Reed and Simon,
1980). Since T 0, we have Mv 0 and v 0. Then, for all x and y,
kT (x; y) =
=
=
=
=
(fx U Mv Ufy )
(Ufx Mv Ufy )
q
q
( Mv Ufx Mv Ufy )
(Mpv Ufx Mpv Ufy )
((x) (y));
(D.25)
(D.26)
(D.27)
(D.28)
(D.29)
dening
: RN ! L2 (R; )
(x) = Mpv Ufx :
(D.30)
(D.31)
158
To see the relationship to (1.26), it should be noted that the spectrum of T coincides
with the essential range of v.
For simplicity, we have above made the assumption that T is bounded. The same
argument, however, also works in the case of unbounded T (e.g. Reed and Simon,
1980).
For the purpose of practical applications, we are interested in maps and operators
T 0 such that the kernel k dened by (D.20) can be computed analytically.
Without going into detail, we brie y mention an example of a map . Dene
: x 7! k(x; :);
(D.32)
where k is some a priori specied kernel, and T = P P , with a regularization operator

P (Tikhonov and Arsenin, 1977). Then
kT (x; y) = ((Pk)(x; :) (Pk)(y; :))
(D.33)
nc
es
coincides with a dot product matrix arising in a kernel-based regularization framework

for learning problems (Smola and Scholkopf, 1997b). If k is chosen as Green's function
of P P , then kT and k can be shown to coincide, and the regularization approach is
equivalent to the SV approach (Smola, Scholkopf, and Muller, 1997).
re
D.1.4 Polynomial Kernels and Higher Order Correlations
Re
fe
Consider the mappings corresponding to kernels of the form (1.20): suppose the monomialspxi1 xi2 : : : xid are written such that i1 i2 : : : id . Then the coecients (as
the 2 in (1.21)), arising from the fact that dierent combinations of indices occur
with dierent frequencies, are largest for i1 < i2 < : : : < id (let us assume here that
the input dimensionality
q d): in that case, we
p is not smaller than the polynomial degree
have a coecient of d!. If i1 = i2, say, the coecient will be (d ? 1)!. In general, if
n of the xi are equal, and the remaining
ones are dierent, then the coecient in the
q
corresponding component of is (d ? n + 1)!. Thus, thepterms belonging to the d-th
order correlations will be weighted with an extra factor d! compared to the terms
xdi , and compared to the terms
p where only d ? 1 dierent components occur, they are
still weighted stronger by d. Consequently, kernel PCA with polynomial kernels will
tend to pick up variance in the d-th order correlations mainly.
D.2 Kernel Principal Component Analysis

D.2.1 The Eigenvalue Problem in the Space of Expansion Coecients
We presently give a justication for solving (3.14) rather than (3.13) in computing the
Eigensystem of the covariance matrix in F (cf. Sec. 3.2).
159
D.2. KERNEL PRINCIPAL COMPONENT ANALYSIS
Being symmetric, K has an orthonormal basis of Eigenvectors (i)i with corresponding Eigenvalues i, thus for all i, we have K i = ii (i = 1; : : : ; M ). To
understand the relation between (3.13) and (3.14), we proceed as follows: rst suppose ; satisfy (3.13). We may expand in K 's Eigenvector basis as
=
Equation (3.13) then reads
M
X
i
M
X
i=1
aii:
ai ii =
X
i
(D.34)
ai2i i ;
(D.35)
i.e. for all i = 1; : : : ; M ,
es
or, equivalently, for all i = 1; : : : ; M ,

Mai i = ai2i :
(D.36)
This in turn means that for all i = 1; : : : ; M ,
M = i or ai = 0 or i = 0:
(D.37)
Note that the above are not exclusive or-s. We next assume that ; satisfy (3.14).
In that case, we nd that (3.14) is equivalent to
X
X
(D.38)
M aii = aiii ;
Re
fe
re
nc
M = i or ai = 0:
(D.39)
Comparing (D.37) and (D.39), we see that all solutions of the latter satisfy the former.
However, they do not give its full set of solutions: given a solution of (3.14), we may
always add multiples of Eigenvectors of K with Eigenvalue i = 0 and still satisfy
(3.13), with the same Eigenvalue.1 Note that this means that there exist solutions
of (3.13) which belong to dierent Eigenvalues yet are not orthogonal in the space of
the k (for instance, take any two Eigenvectors with dierent Eigenvalues, and add a
multiple of the same Eigenvector with Eigenvalue 0 to both of them). This, however,
does not mean that the Eigenvectors of C in F are not orthogonal. Indeed, note
that if
P
is an Eigenvector of K with Eigenvalue 0, then the corresponding vector i i(xi )
is orthogonal to all vectors in the span of the (xj ) in F , since
!
X
(D.40)
(xj ) i(xi ) = (K )j = 0 for all j;
i
which means that Pi i(xi) = 0. Thus, the above dierence between the solutions
of (3.13) and (3.14) is not relevant, since we are interested in vectors in F rather than
vectors in the space of the expansion coecients of (3.10). We therefore only need to
diagonalize K in order to nd all relevant solutions of (3.13).
Note, nally, that the rank of K determines the dimensionality of the span of the
(xj ) in F , i.e. of the subspace that we are working in.
observation could be used to change the vectors of the solution, e.g. to make them
maximally sparse, without changing the solution.
1 This
160
D.2.2 Centering in Feature Space
In Sec. 3.2, we made the assumption that our mapped data is centered in F , i.e.
M
X
n=1
(xn) = 0:
(D.41)
es
We shall now drop this assumption. First note that given any and any set of
observations x1; : : : ; xM , the points
M
X
~ (xi) := (xi) ? M1 (xi)
(D.42)
i=1
are centered. Thus, the assumptions of Sec. 3.2 now hold, and we go on to dene
covariance matrix and K~ ij = (~ (xi ) ~ (xj )) in F . We arrive at our already familiar
Eigenvalue problem
~~ = K~ ~ ;
(D.43)
with ~ being the expansion coecients of an Eigenvector (in F ) in terms of the points
(D.42),
M
X
(D.44)
V~ = ~i ~ (xi):
i=1
Re
fe
re
nc
We cannot compute K~ directly; however, we can express it in terms of its non-centered

counterpart K . In the following, we shall use Kij = ((xi) (xj )), in addition, we
shall make use of the notation 1ij = 1 for all i; j .
K~ ij = (~ (xi) ~ (xj ))
(D.45)
!
M
M
X
X
(xm )) ((xj ) ? M1 (xn))
= ((xi ) ? M1
m=1
n=1
M
X
= ((xi) (xj )) ? M1
((xm) (xj ))
m=1
M
M
X
X
? M1 ((xi) (xn)) + M1 2
((xm ) (xn))
m;n=1
n=1
M
M
M
X
X
X
= Kij ? 1
1imKmj ? 1 Kin1nj + 1 2
M m=1
M n=1
M m;n=1 1imKmn1nj
Using the matrix (1M )ij := 1=M , we get the more compact expression
K~ ij = K ? 1M K ? K 1M + 1M K 1M :
(D.46)
We thus can compute K~ from K , and then solve the Eigenvalue problem (D.43). As
in (3.16), the solutions ~ k are normalized by normalizing the corresponding vectors
V~ k in F , which translates into
~k (~ k ~ k ) = 1:
(D.47)
161
D.3. ON THE TANGENT COVARIANCE MATRIX
For feature extraction, we compute projections of centered -images of test patterns

t onto the Eigenvectors of the covariance matrix of the centered points,
X
(V~ k (t)) = ~ik (~ (xk ) ~ (t)):
M
i=1
Consider a set of test points t1; : : : ; tL, and dene two L M matrices by

Kijtest = ((ti) (xj ))
and
!
M
M
X
X
1
1
K~ ijtest = (((ti) ? M (xm )) ((xj ) ? M (xn))) :
m=1
n=1
(D.48)
(D.49)
(D.50)
es
Similar to (D.45), we can express K~ test in terms of K test , and arrive at

K~ test = K test ? 10M K ? K test 1M + 10M K 1M ;
(D.51)
where 10M is the L M matrix with all entries equal to 1=M . As the test points can
be chosen arbitrarily, we have thus in eect computed a centered version not only of
the dot product matrix, but also of the kernel itself.
nc
D.3 On the Tangent Covariance Matrix
Re
fe
re
In this section, we give an alternative derivation of (4.10), obtained by modifying

the analysis of Sec. 2.1.2 (Vapnik, 1998). There, we had to maximize (2.7) subject
to (2.6). When we want to construct invariant hyperplanes, the situation is slightly
dierent. We do not only want to separate the training data, but we want to separate
it in a way such that submitting a pattern to a transformation of an a priori specied
Lie group will not alter its class assignment. This can be achieved by enforcing that
the classication boundary be such that group actions move patterns parallel to the
decision boundary, rather than across it. A local statement of this property is the
requirement that the Lie derivatives should be orthogonal to the normal w which
determines the separating hyperplane. Thus we modify (2.7) by adding a second term
enforcing invariance:
0
1
!2
`

X
1
@
1
(w) = 2 @(1 ? ) `
w @t t=0Lt zi + kwk2A
(D.52)
i=1
For = 1, we recover the original objective function; for values 1 > 0, dierent
amounts of importance are assigned to invariance with respect to the Lie group of
transformations Lt.
The above sum can be rewritten as
!2
!
!
`
`

X
@
@
1X
@
1

` i=1 w @t t=0 Lt zi = ` i=1 w @t t=0 Ltzi @t t=0 Ltzi w
= (w C w);
(D.53)
162
where the matrix C is dened as in (4.6),

`
X
@ L z
1
C := `
t i
i=1 @t t=0
@ L z
@t t=0 t i
!>
(D.54)
(if we want to use more than one derivative operator, we also sum over these; in that
case, we may want to orthonormalize the derivatives for each observation zi). To solve
the optimization problem, one introduces a Lagrangian
`
X

L(w; b; ) = 21 (1 ? )(w C w) + kwk2 ? i (yi((zi w) + b) ? 1)
i=1
(D.55)
with Lagrange multipliers i. At the point of the solution, the gradient of L with
respect to w must vanish:
(1 ? )C w + w ?
`
X
i=1
i yizi = 0
(D.56)
es
As the left hand side of (D.53) is non-negative for any w, C is a positive (not necessarily
denite) matrix. It follows that for
(D.57)
re
nc
C := (1 ? )C + I
fe
to be invertible (I denoting the identity), > 0 is a sucient condition. In that case,

we get the following expansion for the solution vector:
Re
w=
`
X
i=1
i yiC?1 zi
(D.58)
Together with (2.3), (D.58) yields the decision function
f (z) = sgn
`
X
i=1
y (z C ?1 z ) + b
i i
(D.59)
Substituting (D.58), and the fact thatPat the point of the solution, the partial derivative
of L with respect to b must vanish ( ì=1 iyi = 0), into the Lagrangian (D.55), we get
`
`
>
X
X
W () = 21 iyiz>i C?1 CC?1 j yj zj
i=1
`
X
i=1
iyiziC?1
j =1
`
X
j =1
j yj zj +
`
X
i=1
i:
(D.60)
163
D.3. ON THE TANGENT COVARIANCE MATRIX
By virtue of the fact that C and thus also C?1 is symmetric, the dual form of the
optimization problem takes the following form: maximize
W () =
`
X
i=1
`
X
i ? 21 ij yiyj (zi C?1zj )
i;j =1
(D.61)
subject to (2.14) and (2.15).

The same derivation can be carried out for the nonseparable case, leading to the
corresponding result with modied constraints (2.22) and (2.23) (cf. Sec. 2.1.3).
We conclude by generalizing to the nonlinear case. As in Sec. 2.1.4, we now think
of the patterns zi no longer as living in input space, but as patterns in some feature
space F related to input space by a nonlinear map
: RN ! F
(D.62)
(D.63)
xi 7! zi = (xi):
nc
es
Unfortunately, (D.59) and (D.61) are not simply written in terms of dot products
between images of input patterns under . Hence, substituting kernel functions for
dot products will not do. Note, moreover, that C now is an operator in a possibly
innite-dimensional space, with C being dened as in (4.15). We cannot compute it
explicitely, but we can nevertheless compute (D.59) and (D.61), which is all we need.
First note that for all x; y 2 RN ,
((x) C?1(y)) = (C? 2 (x) C? 2 (y));
1
re
(D.64)
with C? 2 being the positive square root of C?1. At this point, methods similar to
kernel PCA come to our rescue. As C is symmetric, we may diagonalize it as
hence
Re
fe
C = SDS >;
(D.65)
C? 2 = SD? 21 S >:
(D.66)
Substituting (D.66) into (D.64), and using the fact that S is unitary, we obtain
((x) C?1(y)) = (SD? 2 S >(x) SD? 2 S >(y))
= (D? 21 S >(x) D? 21 S >(y)):
1
(D.67)
(D.68)
This, however, is simply a dot product between kernel PCA feature vectors:
S >(x)
1
computes projections onto Eigenvectors of C (i.e. features), and D? 2 rescales them.
Note that we have thus again arrived at the nonlinear tangent covariance matrix of
Sec. 4.2.2; this time, however, the approach was motivated solely by constructing
164
invariant hyperplanes in feature space, and the nonlinear feature extraction by the
tangent covariance matrix is a mere by-product.
To carry out kernel PCA on C, we essentially have to go through the analysis
of kernel PCA using C instead of the covariance matrix of the mapped data in F .
The modications arising from the fact that we are dealing with tangent vectors were
already described in Sec. 4.2.2, hence, we shall presently only sketch the additional
modications for > 0: here, we are looking for solutions of the Eigenvalue equation
V = C V with > (let us assume that < 1, otherwise all Eigenvalues are
identical to , the minimal Eigenvalue).2 These lie in the span of the tangent vectors.
In complete analogy to (3.14), we then arrive at
` = ((1 ? )K + I );
(D.69)
and the normalization condition for the coecients k of the k-th Eigenvector reads
1 = k ? ( );
(D.70)
1?
Re
fe
re
nc
es
where the k > are the Eigenvalues of (1 ? )K + I . Feature extraction is carried
out as in (4.25).
We conclude by noting an essential dierence to the approach of (4.11), which we
believe is an advantage of the present method: in (4.11), the pattern preprocessing
was assumed to be linear. In the present method, the goal to get invariant hyperplanes
in feature space naturally led to a nonlinear preprocessing operation.
2 If we want I
also to have an eect outside of the span of the tangent vectors, we have to modify
the set in which we expand our solutions.
Bibliography
Y. S. Abu-Mostafa. Hints. Neural Computation, 7(4):639{671, 1995.
M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the potential
function method in pattern recognition learning. Automation and Remote Control,
25:821 { 837, 1964.
S. Amari, N. Murata, K.-R. Muller, M. Finke, and H. Yang. Aymptotic statistical
theory of overtraining and cross-validation. IEEE Trans. on Neural Networks, 8(5),
1997.
es
J. K. Anlauf and M. Biehl. The adatron: an adaptive perceptron algorithm. Europhys.

Letters, 10:687 { 692, 1989.
re
nc
H. Baird. Document image defect models. In Proceddings, IAPR Workshop on Syntactic and Structural Pattern Recognition, pages 38 { 46, Murray Hill, NJ, 1990.
fe
H. Barlow. The neuron doctrine in perception. In M. Gazzaniga, editor, The Cognitive

Neurosciences, pages 415 { 435. MIT Press, Cambridge, MA, 1995.
Re
A. Barron. Predicted squared error: a criterion for automatic model selection. In

S. Farlow, editor, Self-organizing Methods in Modeling. Marcel Dekker, New York,
1984.
Peter L. Bartlett. For valid generalization the size of the weights is more important
than the size of the network. In Michael C. Mozer, Michael I. Jordan, and Thomas
Petsche, editors, Advances in Neural Information Processing Systems, volume 9,
page 134, Cambridge, MA, 1997. MIT Press.
D. P. Bertsekas. Nonlinear Programming. Athena Scientic, Belmont, MA, 1995.
D. Beymer and T. Poggio. Image representations for visual learning. Science, 272
(5270):1905{1909, 1996.
R. Bhatia. Matrix Analysis. Springer Verlag, New York, 1997.
C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford,
1995.
165
166
BIBLIOGRAPHY
V. Blanz. Bildbasierte Objekterkennung und die Bestimmung optimaler Ansichten.

Diplomarbeit in Physik, Universitat Tubingen, 1995.
V. Blanz, B. Scholkopf, H. Bultho, C. Burges, V. Vapnik, and T. Vetter. Comparison
of view-based object recognition algorithms using realistic 3D models. In C. von der
Malsburg, W. von Seelen, J. C. Vorbruggen, and B. Sendho, editors, Articial
Neural Networks | ICANN'96, pages 251 { 256, Berlin, 1996. Springer Lecture
Notes in Computer Science, Vol. 1112.
B. E. Boser, I .M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin
classiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on
Computational Learning Theory, pages 144{152, Pittsburgh, PA, 1992. ACM Press.
L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun,
U. A. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of classier
methods: a case study in handwritten digit recognition. In Proceedings of the 12th
International Conference on Pattern Recognition and Neural Networks, Jerusalem,
pages 77 { 87. IEEE Computer Society Press, 1994.
nc
es
L. Bottou and V. N. Vapnik. Local learning algorithms. Neural Computation, 4(6):

888{900, 1992.
re
J. Bromley and E. Sackinger. Neural-network and k-nearest-neighbor classiers. Technical Report 11359{910819{16TM, AT&T, 1991.
fe
H. H. Bultho and S. Edelman. Psychophysical support for a 2-D view interpolation

theory of object recognition. Proceedings of the National Academy of Science, 89:
60 { 64, 1992.
Re
C. J. C. Burges. Simplied support vector decision rules. In L. Saitta, editor, Proceedings, 13th Intl. Conf. on Machine Learning, pages 71{77, San Mateo, CA, 1996.
Morgan Kaufmann.
C. J. C. Burges and B. Scholkopf. Improving the accuracy and speed of support vector
learning machines. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in
Neural Information Processing Systems 9, pages 375{381, Cambridge, MA, 1997.
MIT Press.
E. I. Chang and R. L. Lippmann. A boundary hunting radial basis function classier
which allocates centers constructively. In S. J. Hanson, J. D. Cowan, and C. L.
Giles, editors, Advances in Neural Information Processing Systems 5, San Mateo,
CA, 1993. Morgan Kaufmann.
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 { 297,
1995.
167
BIBLIOGRAPHY
R. Courant and D. Hilbert. Methods of Mathematical Physics, volume 1. Interscience

Publishers, Inc, New York, 1953.
K. I. Diamantaras and S. Y. Kung. Principal Component Neural Networks. Wiley,
New York, 1996.
H. Drucker, R. Schapire, and P. Simard. Boosting performance in neural networks.
International Journal of Pattern Recognition and Articial Intelligence, 7:705 { 719,
1993.
R. O. Duda and P. E. Hart. Pattern Classication and Scene Analysis. Wiley, New
York, 1973.
C.J. Duy and R.H. Wurtz. Sensitivity of MST neurons to optic ow eld stimuli. I. A
continuum of response selectivity to large-eld stimuli. Journal of Neurophysiology,
65:1329{1345, 1991.
es
N. Dunford and J. T. Schwartz. Linear Operators Part II: Spectral Theory, Self Adjoint
Operators in Hilbert Space. Number VII in Pure and Applied Mathematics. John
Wiley & Sons, New York, 1963.
nc
S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance

dilemma. Neural Computation, 4:1{58, 1992.
fe
re
F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: From
regularization to radial, tensor and additive splines. A.I. Memo No. 1430, Articial
Intelligence Laboratory, Massachusetts Institute of Technology, 1993.
Re
I. Guyon, B. Boser, and V. Vapnik. Automatic capacity tuning of very large VCdimension classiers. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems, volume 5, pages 147{155. Morgan
Kaufmann, San Mateo, CA, 1993.
I. Guyon, N. Matic, and V. Vapnik. Discovering informative patterns and data cleaning. In U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smythand R. Uthurusamy,
editors, Advances in Knowledge Discovery and Data Mining, pages 181 { 203. MIT
Press, Cambridge, MA, 1996.
J. B. Hampshire and A. Waibel. A novel objective function for improved phoneme
recognition using time-delay neural networks. IEEE Trans. Neural Networks, 1:
216 { 228, 1990.
W. Hardle. Applied Nonparametric Regression. Cambridge University Press, Cambridge, 1990.
T. Hastie and W. Stuetzle. Principal curves. JASA, 84:502 { 516, 1989.
168
BIBLIOGRAPHY
Re
fe
re
nc
es
S. Haykin. Neural Networks : A Comprehensive Foundation. Macmillan, New York,

1994.
T. Hofmann and J. M. Buhmann. Pairwise data clustering by deterministic annealing.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):1{25, 1997.
H. Hotelling. Analysis of a complex of statistical variables into principal components.
Journal of Educational Psychology, 24:417{441 and 498{520, 1933.
P. Jolicoeur, M.A. Gluck, and S.M. Kosslyn. From pictures to words: Making the
connection. Cognitive Psychology, 16:243{275, 1984.
I. T. Jollie. Principal Component Analysis. Springer-Verlag, New York, 1986.
M. Kac and S. M. Ulam. Mathematics and Logic. Praeger, Britannica perspective,
New York, 1968.
J. Karhunen and J. Joutsensalo. Generalizations of principal component analysis,
optimization problems, and neural networks. Neural Networks, 8(4):549{562, 1995.
K. Karhunen. Zur Spektraltheorie stochastischer Prozesse. Ann. Acad. Sci. Fenn., 34,
1946.
M. Kirby and L. Sirovich. Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 12(1):103{108, 1990.
F. Klein. Vergleichende Betrachtungen uber neuere geometrische Forschungen. Verlag
von Andreas Deichert, Erlangen, 1872.
T. Kohonen. Self-organized formation of topologically correct feature maps. Biological
Cybernetics, 43:59 { 69, 1982.
A. N. Kolmogorov. Three approaches to the quantitative denition of information.
Problems of Information Transmission, 1:1 { 7, 1965.
H. G. Krapp and R. Hengstenberg. Estimation of self-motion by optic ow processing
in single visual interneurons. Nature, 384:463 { 466, 1996.
U. Kressel. Private communication. The quoted results are summarized on
ftp://ftp.mpik-tueb.mpg.de/pub/chair dataset/README, 1996.
Y. LeCun. Une procedure d'apprentissage pour Reseau a seuil assymmetrique. In Cognitiva: A la Frontiere de l'Intelligence Articielle des Sciences de la Connaissance
des Neurosciences, pages 599{604, Paris, France, 1985. CESTA.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
L. J. Jackel. Backpropagation applied to handwritten zip code recognition. Neural
Computation, 1:541 { 551, 1989.
169
BIBLIOGRAPHY
Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker,

I. Guyon, U. A. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison
of learning algorithms for handwritten digit recognition. In F. Fogelman-Soulie
and P. Gallinari, editors, Proceedings ICANN'95 | International Conference on
Articial Neural Networks, volume II, pages 53 { 60, Nanterre, France, 1995. EC2.
J. Liter et al. Psychophysical and computational experiments on the MPI object
databases. Technical report, Max-Planck-Institut fur biologische Kybernetik, 1997.
N. K. Logothetis, J. Pauls, and T. Poggio. Shape representation in the inferior temporal cortex of monkeys. Current Biology, 5:552 { 563, 1995.
D. G. Luenberger. Introduction to Linear and Nonlinear Programming. AddisonWesley, Reading, MA, 1973.
M. Mishkin, L.G. Ungerleider, and K.A. Macko. Object vision and spatial vision: two
cortical pathways. Trends in Neurosciences, 6:414{417, 1983.
es
J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units.

Neural Computation, 1(2):281{294, 1989.
nc
K.-R. Muller, A. Smola, G. Ratsch, B. Scholkopf, J. Kohlmorgen, and V. Vapnik. Predicting time series with support vector machines. In ICANN'97, page 999. Springer
Lecture Notes in Computer Science, 1997.
fe
re
M. Z. Nashed and G. Wahba. Generalized inverses in reproducing kernel spaces: An

approach to regularization of linear operator equations. SIAM J. Math. Anal., 5:
974{987, 1974.
Re
E. Oja. A simplied neuron model as a principal component analyzer. J. Math.

Biology, 15:267 { 273, 1982.
E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector
machines. In NNSP'97, 1997. in press.
K. Pearson. On lines and planes of closest t to points in space. Philosophical Magazine, 2 (sixth series):559{572, 1901.
T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201{209,
1975.
T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the
IEEE, 78(9), 1990.
T. Poggio and T. Vetter. Recognition and structure from one 2D model view: observations on prototypes, object classes, and symmetries. A.I. Memo No. 1347, Articial
Intelligence Laboratory, Massachusetts Institute of Technology, 1992.
170
BIBLIOGRAPHY
Re
fe
re
nc
es
H. Primas. Chemistry, Quantum Mechanics and Reductionism. Springer-Verlag,

Berlin, 1983. second edition.
R. P. N. Rao and D. H. Ballard. Localized receptive elds may mediate transformationinvariant recognition in the visual cortex. Technical Report 97.2, National Resource
Laboratory for the Study of Brain and Behavior, Computer Science Department,
University of Rochester, 1997.
M. Reed and B. Simon. Method of modern mathematical physics. Vol. 1: Functional
Analysis. Academic Press, San Diego, 1980.
D. Reilly, L. N. Cooper, and C. Elbaum. A neural model for category learning. Biol.
Cybern., 45:35 { 41., 1982.
B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press,
Cambridge, 1996.
J. Rissanen. Modeling by shortest data description. Automatica, 14:465{471, 1978.
H. J. Ritter, T. M. Martinetz, and K. J. Schulten. Neuronale Netze: Eine Einfuhrung
in die Neuroinformatik selbstorganisierender Abbildungen. Addison-Wesley, Munich,
Germany, 1990.
E. Rosch, C.B. Mervis, W. Gray, D. Johnson, and P. Boyes-Braem. Basic objects in
natural categories. Cognitive Psychology, 8:382{439, 1976.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by
back-propagating errors. Nature, 323(9):533{536, 1986.
T. D. Sanger. Optimal unsupervised learning in a single-layer linear feedforward network. Neural Networks, 2:459{473, 1989.
R. E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: a new explanation for the eectiveness of voting methods. In Machine Learning: Proceedings
of the Fourteenth International Conference, 1997.
B. Scholkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In
U. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International Conference on Knowledge Discovery & Data Mining. AAAI Press, Menlo Park, CA,
1995.
B. Scholkopf. Kunstliches Lernen. In S. Bornholdt and P. H. Feindt, editors, Komplexe
adaptive Systeme (Forum fur Interdisziplinare Forschung Bd. 15), pages 93 { 117.
Roll, Dettelbach, 1996.
B. Scholkopf, C. Burges, and V. Vapnik. Incorporating invariances in support vector
learning machines. In C. von der Malsburg, W. von Seelen, J. C. Vorbruggen, and
B. Sendho, editors, Articial Neural Networks | ICANN'96, pages 47 { 52, Berlin,
1996a. Springer Lecture Notes in Computer Science, Vol. 1112.
171
BIBLIOGRAPHY
Re
fe
re
nc
es
B. Scholkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector

kernels. Accepted for NIPS'97 (Proceedings to be published by MIT Press), 1997a.
B. Scholkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Technical Report 44, Max-Planck-Institut fur biologische
Kybernetik, 1996b. in press (Neural Computation).
B. Scholkopf, A. Smola, and K.-R. Muller. Kernel principal component analysis. In
ICANN'97, page 583. Springer Lecture Notes in Computer Science, 1997b.
B. Scholkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik.
Comparing support vector machines with gaussian kernels to radial basis function
classiers. A.I. Memo No. 1599, Massachusetts Institute of Techology, 1996c.
J. Schurmann. Pattern Classication: a unied view of statistical and neural approaches. Wiley, New York, 1996.
J. Segman, J. Rubinstein, and Y. Y. Zeevi. The canonical coordinates method for pattern deformation: theoretical and computational considerations. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 14:1171 { 1183, 1992.
J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony. A framework for structural risk minimization. In COLT, 1996.
P. Simard, Y. LeCun, and J. Denker. Ecient pattern recognition using a new transformation distance. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances
in Neural Information Processing Systems 5, pages 50{58, San Mateo, CA, 1993.
Morgan Kaufmann.
P. Simard, B. Victorri, Y. LeCun, and J. Denker. Tangent prop | a formalism for
specifying selected invariances in an adaptive network. In J. E. Moody, S. J. Hanson,
and R. P. Lippmann, editors, Advances in Neural Information Processing Systems
4, pages 895{903, San Mateo, CA, 1992. Morgan Kaufmann.
A. Smola and B. Scholkopf. From regularization operators to support vector kernels.
Accepted for NIPS'97, 1997a.
A. Smola and B. Scholkopf. On a kernel-based method for pattern recognition, regression, approximation and operator inversion. Algorithmica, 1997b. in press.
A. Smola, B. Scholkopf, and K.-R. Muller. The connection between regularization
operators and support vector kernels. submitted to Neural Networks, 1997.
A. J. Smola. Regression estimation with support vector learning machines. Diplomarbeit, Technische Universitat Munchen, 1996.
K. Sung. Learning and Example Selection for Object and Pattern Detection. PhD
thesis, Massachusetts Institute of Technology, 1996.
172
BIBLIOGRAPHY
Re
fe
re
nc
es
A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed Problems. W. H. Winston,

Washington, D.C., 1977.
N. Troje and H. Bultho. How is bilateral symmetry of human faces used for recognition of novel views? Technical Report 38, Max-Planck-Institut fur biologische
Kybernetik, 1996. to appear in Vision Research.
S. Ullman. Aligning pictorial descriptions: An approach to object recognition. Cognition, 32:193{254, 1989.
S. Ullman. High-Level Vision. MIT Press, Cambridge, MA, 1996.
V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka,
Moscow, 1979. (English translation: Springer Verlag, New York, 1982).
V. Vapnik. Inductive principles of statistics and learning theory. In P. Smolensky,
M. C. Mozer, and D. E. Rumelhart, editors, Mathematical Perspectives on Neural
Networks. Lawrence Erlbaum, Mahwah, NJ, 1995a.
V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York,
1995b.
V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. forthcoming.
V. Vapnik and A. Chervonenkis. Uniform convergence of frequencies of occurence of
events to their probabilities. Dokl. Akad. Nauk SSSR, 181:915 { 918, 1968.
V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. Nauka,
Moscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie
der Zeichenerkennung, Akademie-Verlag, Berlin, 1979).
V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In M. Mozer, M. Jordan, and
T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages
281{287, Cambridge, MA, 1997. MIT Press.
T. Vetter. An early vision model for 3D object recognition. In Fachtagung der
Gesellschaft fur Kognitionswissenschaften, 1994.
T. Vetter and T. Poggio. Symmetric 3D objects are an easy case for 2D object recognition. Spatial Vision, 8(4):443{453, 1994.
T. Vetter and T. Poggio. Linear objectclasses and image synthesis from a single
example image. IEEE Transactions on Pattern Analysis and Machine Intelligence,
1997. in press.
T. Vetter, T. Poggio, and H. Bultho. The importance of symmetry and virtual views
in three-dimensional object recognition. Current Biology, 4:18 { 23, 1994.
173
BIBLIOGRAPHY
Re
fe
re
nc
es
T. Vetter and N. Troje. Separation of texture and shape in images of faces for image
coding and synthesis. Journal of the Optical Society of America, in press, 1997.
G. Wahba. Convergence rates of certain approximate solutions to Fredholm integral
equations of the rst kind. Journal of Approximation Theory, 7:167 { 185, 1973.
C. K. I. Williams. Prediction with Gaussian processes. Preprint, 1997.
A. Yuille and N. Grzywacz. The motion coherence theory. In Proceedings of the
International Conference on Computer Vision, pages 344{354, Washington, D.C.,
1988. IEEE Computer Society Press.
Journal of Machine Learning Research 12 (2011) 2825-2830
Submitted 3/11; Revised 8/11; Published 10/11
REF [5]
Scikit-learn: Machine Learning in Python
Fabian Pedregosa
Gael Varoquaux
Alexandre Gramfort
Vincent Michel
Bertrand Thirion
FABIAN . PEDREGOSA @ INRIA . FR

GAEL . VAROQUAUX @ NORMALESUP. ORG
ALEXANDRE . GRAMFORT @ INRIA . FR
VINCENT. MICHEL @ LOGILAB . FR
BERTRAND . THIRION @ INRIA . FR
Parietal, INRIA Saclay

Neurospin, Bat 145, CEA Saclay
91191 Gif sur Yvette France
Olivier Grisel
OLIVIER . GRISEL @ ENSTA . FR
Nuxeo
20 rue Soleillet
75 020 Paris France
Mathieu Blondel
MBLONDEL @ AI . CS . KOBE - U . AC . JP
es
Kobe University
1-1 Rokkodai, Nada
Kobe 657-8501 Japan
Peter Prettenhofer
nc
PETER . PRETTENHOFER @ GMAIL . COM
re
Bauhaus-Universitat Weimar
Bauhausstr. 11
99421 Weimar Germany
Vincent Dubourg
Re
Google Inc
76 Ninth Avenue
New York, NY 10011 USA
fe
Ron Weiss
RONWEISS @ GMAIL . COM
VINCENT. DUBOURG @ GMAIL . COM
Clermont Universite, IFMA, EA 3867, LaMI

BP 10448, 63000 Clermont-Ferrand France
Jake Vanderplas
VANDERPLAS @ ASTRO . WASHINGTON . EDU
Astronomy Department
University of Washington, Box 351580
Seattle, WA 98195 USA
Alexandre Passos
ALEXANDRE . TP @ GMAIL . COM
IESL Lab
UMass Amherst
Amherst MA 01002 USA
David Cournapeau
COURNAPE @ GMAIL . COM
Enthought
21 J.J. Thompson Avenue
Cambridge, CB3 0FA UK
c
2011
Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher,
Matthieu Perrot and Edouard

Duchesnay
P EDREGOSA , VAROQUAUX , G RAMFORT ET AL .
Matthieu Brucher
MATTHIEU . BRUCHER @ GMAIL . COM
Total SA, CSTJF

avenue Larribau
64000 Pau France
Matthieu Perrot
Edouard
Duchesnay
MATTHIEU . PERROT @ CEA . FR

EDOUARD . DUCHESNAY @ CEA . FR
LNAO
Neurospin, Bat 145, CEA Saclay
91191 Gif sur Yvette France
Editor: Mikio Braun
es
Abstract
Re
1. Introduction
fe
re
nc
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is
put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic
and commercial settings. Source code, binaries, and documentation can be downloaded from
http://scikit-learn.sourceforge.net.
Keywords: Python, supervised learning, unsupervised learning, model selection
The Python programming language is establishing itself as one of the most popular languages for
scientific computing. Thanks to its high-level interactive nature and its maturing ecosystem of scientific libraries, it is an appealing choice for algorithmic development and exploratory data analysis
(Dubois, 2007; Milmann and Avaizis, 2011). Yet, as a general-purpose language, it is increasingly
used not only in academic settings but also in industry.
Scikit-learn harnesses this rich environment to provide state-of-the-art implementations of many
well known machine learning algorithms, while maintaining an easy-to-use interface tightly integrated with the Python language. This answers the growing need for statistical data analysis by
non-specialists in the software and web industries, as well as in fields outside of computer-science,
such as biology or physics. Scikit-learn differs from other machine learning toolboxes in Python
for various reasons: i) it is distributed under the BSD license ii) it incorporates compiled code for
efficiency, unlike MDP (Zito et al., 2008) and pybrain (Schaul et al., 2010), iii) it depends only on
numpy and scipy to facilitate easy distribution, unlike pymvpa (Hanke et al., 2009) that has optional
dependencies such as R and shogun, and iv) it focuses on imperative programming, unlike pybrain
which uses a data-flow framework. While the package is mostly written in Python, it incorporates
the C++ libraries LibSVM (Chang and Lin, 2001) and LibLinear (Fan et al., 2008) that provide reference implementations of SVMs and generalized linear models with compatible licenses. Binary
packages are available on a rich set of platforms including Windows and any POSIX platforms.
2826
S CIKIT- LEARN : M ACHINE L EARNING IN P YTHON
Furthermore, thanks to its liberal license, it has been widely distributed as part of major free software distributions such as Ubuntu, Debian, Mandriva, NetBSD and Macports and in commercial
distributions such as the Enthought Python Distribution.
2. Project Vision
re
nc
es
Code quality. Rather than providing as many features as possible, the projects goal has been to
provide solid implementations. Code quality is ensured with unit testsas of release 0.8, test
coverage is 81%and the use of static analysis tools such as pyflakes and pep8. Finally, we
strive to use consistent naming for the functions and parameters used throughout a strict adherence
to the Python coding guidelines and numpy style documentation.
BSD licensing. Most of the Python ecosystem is licensed with non-copyleft licenses. While such
policy is beneficial for adoption of these tools by commercial projects, it does impose some restrictions: we are unable to use some existing scientific code, such as the GSL.
Bare-bone design and API. To lower the barrier of entry, we avoid framework code and keep the
number of different objects to a minimum, relying on numpy arrays for data containers.
Community-driven development. We base our development on collaborative tools such as git, github
and public mailing lists. External contributions are welcome and encouraged.
Documentation. Scikit-learn provides a 300 page user guide including narrative documentation,
class references, a tutorial, installation instructions, as well as more than 60 examples, some featuring real-world applications. We try to minimize the use of machine-learning jargon, while maintaining precision with regards to the algorithms employed.
fe
3. Underlying Technologies
Re
Numpy: the base data structure used for data and model parameters. Input data is presented as
numpy arrays, thus integrating seamlessly with other scientific Python libraries. Numpys viewbased memory model limits copies, even when binding with compiled code (Van der Walt et al.,
2011). It also provides basic arithmetic operations.
Scipy: efficient algorithms for linear algebra, sparse matrix representation, special functions and
basic statistical functions. Scipy has bindings for many Fortran-based standard numerical packages,
such as LAPACK. This is important for ease of installation and portability, as providing libraries
around Fortran code can prove challenging on various platforms.
Cython: a language for combining C in Python. Cython makes it easy to reach the performance
of compiled languages with Python-like syntax and high-level operations. It is also used to bind
compiled libraries, eliminating the boilerplate code of Python/C extensions.
4. Code Design
Objects specified by interface, not by inheritance. To facilitate the use of external objects with
scikit-learn, inheritance is not enforced; instead, code conventions provide a consistent interface.
The central object is an estimator, that implements a fit method, accepting as arguments an input
data array and, optionally, an array of labels for supervised problems. Supervised estimators, such as
SVM classifiers, can implement a predict method. Some estimators, that we call transformers,
for example, PCA, implement a transform method, returning modified input data. Estimators
2827
Support Vector Classification

Lasso (LARS)
Elastic Net
k-Nearest Neighbors
PCA (9 components)
k-Means (9 clusters)
License
-: Not implemented.
scikit-learn
5.2
1.17
0.52
0.57
0.18
1.34
BSD
mlpy
9.47
105.3
73.7
1.41
0.79
GPL
pybrain
17.5
BSD
pymvpa mdp shogun

11.52
40.48
5.63
37.35
1.44
0.56
0.58
1.36
8.93
0.47
0.33
35.75
0.68
BSD
BSD
GPL
: Does not converge within 1 hour.
Table 1: Time in seconds on the Madelon data set for various machine learning libraries exposed
in Python: MLPy (Albanese et al., 2008), PyBrain (Schaul et al., 2010), pymvpa (Hanke
et al., 2009), MDP (Zito et al., 2008) and Shogun (Sonnenburg et al., 2010). For more
benchmarks see http://github.com/scikit-learn.
es
may also provide a score method, which is an increasing evaluation of goodness of fit: a loglikelihood, or a negated loss function. The other important object is the cross-validation iterator,
which provides pairs of train and test indices to split input data, for example K-fold, leave one out,
or stratified cross-validation.
Re
fe
re
nc
Model selection. Scikit-learn can evaluate an estimators performance or select parameters using
cross-validation, optionally distributing the computation to several cores. This is accomplished by
wrapping an estimator in a GridSearchCV object, where the CV stands for cross-validated.
During the call to fit, it selects the parameters on a specified parameter grid, maximizing a score
(the score method of the underlying estimator). predict, score, or transform are then delegated
to the tuned estimator. This object can therefore be used transparently as any other estimator. Cross
validation can be made more efficient for certain estimators by exploiting specific properties, such
as warm restarts or regularization paths (Friedman et al., 2010). This is supported through special
objects, such as the LassoCV. Finally, a Pipeline object can combine several transformers and
an estimator to create a combined estimator to, for example, apply dimension reduction before
fitting. It behaves as a standard estimator, and GridSearchCV therefore tune the parameters of all
steps.
5. High-level yet Efficient: Some Trade Offs

While scikit-learn focuses on ease of use, and is mostly written in a high level language, care has
been taken to maximize computational efficiency. In Table 1, we compare computation time for a
few algorithms implemented in the major machine learning toolkits accessible in Python. We use
the Madelon data set (Guyon et al., 2004), 4400 instances and 500 attributes, The data set is quite
large, but small enough for most algorithms to run.
SVM. While all of the packages compared call libsvm in the background, the performance of scikitlearn can be explained by two factors. First, our bindings avoid memory copies and have up to
40% less overhead than the original libsvm Python bindings. Second, we patch libsvm to improve
efficiency on dense data, use a smaller memory footprint, and better use memory alignment and
pipelining capabilities of modern processors. This patched version also provides unique features,
such as setting weights for individual samples.
2828
S CIKIT- LEARN : M ACHINE L EARNING IN P YTHON
LARS. Iteratively refining the residuals instead of recomputing them gives performance gains of
210 times over the reference R implementation (Hastie and Efron, 2004). Pymvpa uses this implementation via the Rpy R bindings and pays a heavy price to memory copies.
Elastic Net. We benchmarked the scikit-learn coordinate descent implementations of Elastic Net. It
achieves the same order of performance as the highly optimized Fortran version glmnet (Friedman
et al., 2010) on medium-scale problems, but performance on very large problems is limited since
we do not use the KKT conditions to define an active set.
kNN. The k-nearest neighbors classifier implementation constructs a ball tree (Omohundro, 1989)
of the samples, but uses a more efficient brute force search in large dimensions.
PCA. For medium to large data sets, scikit-learn provides an implementation of a truncated PCA
based on random projections (Rokhlin et al., 2009).
k-means. scikit-learns k-means algorithm is implemented in pure Python. Its performance is limited by the fact that numpys array operations take multiple passes over data.
6. Conclusion
fe
References
re
nc
es
Scikit-learn exposes a wide variety of machine learning algorithms, both supervised and unsupervised, using a consistent, task-oriented interface, thus enabling easy comparison of methods for a
given application. Since it relies on the scientific Python ecosystem, it can easily be integrated into
applications outside the traditional range of statistical data analysis. Importantly, the algorithms,
implemented in a high-level language, can be used as building blocks for approaches specific to
a use case, for example, in medical imaging (Michel et al., 2011). Future work includes online
learning, to scale to large data sets.
Re
D. Albanese, G. Merler, S.and Jurman, and R. Visintainer. MLPy: high-performance python package for predictive modeling. In NIPS, MLOSS Workshop, 2008.
C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines. http://www.csie.
ntu.edu.tw/cjlin/libsvm, 2001.
P.F. Dubois, editor. Python: Batteries Included, volume 9 of Computing in Science & Engineering.
IEEE/AIP, May 2007.
R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: a library for large linear
classification. The Journal of Machine Learning Research, 9:18711874, 2008.
J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via
coordinate descent. Journal of Statistical Software, 33(1):1, 2010.
I Guyon, S. R. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 feature selection
challenge, 2004.
M. Hanke, Y.O. Halchenko, P.B. Sederberg, S.J. Hanson, J.V. Haxby, and S. Pollmann. PyMVPA:
A Python toolbox for multivariate pattern analysis of fMRI data. Neuroinformatics, 7(1):3753,
2009.
2829
T. Hastie and B. Efron. Least Angle Regression, Lasso and Forward Stagewise. http://cran.
r-project.org/web/packages/lars/lars.pdf, 2004.
V. Michel, A. Gramfort, G. Varoquaux, E. Eger, C. Keribin, and B. Thirion. A supervised clustering
approach for fMRI-based inference of brain states. Patt Rec, page epub ahead of print, April
2011. doi: 10.1016/j.patcog.2011.04.006.
K.J. Milmann and M. Avaizis, editors. Scientific Python, volume 11 of Computing in Science &
Engineering. IEEE/AIP, March 2011.
S.M. Omohundro. Five balltree construction algorithms. ICSI Technical Report TR-89-063, 1989.
V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis.
SIAM Journal on Matrix Analysis and Applications, 31(3):11001124, 2009.
T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Ruckstie, and J. Schmidhuber.
PyBrain. The Journal of Machine Learning Research, 11:743746, 2010.
nc
es
S. Sonnenburg, G. Ratsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl,

and V. Franc. The SHOGUN machine learning toolbox. Journal of Machine Learning Research,
11:17991802, 2010.
re
S. Van der Walt, S.C Colbert, and G. Varoquaux. The NumPy array: A structure for efficient
numerical computation. Computing in Science and Engineering, 11, 2011.
Re
fe
T. Zito, N. Wilbert, L. Wiskott, and P. Berkes. Modular toolkit for data processing (MDP): A Python
data processing framework. Frontiers in Neuroinformatics, 2, 2008.
2830
REF [6]
International Journal of Computer Vision 38(1), 913, 2000
c 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.
Statistical Learning Theory: A Primer

THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO POGGIO
Center for Biological and Computational Learning, Artificial Intelligence Laboratory, MIT,
Cambridge, MA, USA
theos@ai.mit.edu
pontil@ai.mit.edu
tp@ai.mit.edu
1.
VC-dimension, structural risk minimization, regularization networks, support vector machines
re
Keywords:
nc
es
Abstract. In this paper we first overview the main concepts of Statistical Learning Theory, a framework in which
learning from examples can be studied in a principled way. We then briefly discuss well known as well as emerging
learning techniques such as Regularization Networks and Support Vector Machines which can be justified in term
of the same induction principle.
functionals of the type
Introduction
Re
fe
The goal of this paper is to provide a short introduction to Statistical Learning Theory (SLT) which studies
problems and techniques of supervised learning. For a
more detailed review of SLT see Evgeniou et al. (1999).
In supervised learningor learning-from-examples
a machine is trained, instead of programmed, to perform a given task on a number of input-output pairs.
According to this paradigm, training means choosing
a function which best describes the relation between
the inputs and the outputs. The central question of
SLT is how well the chosen function generalizes, or
how well it estimates the output for previously unseen
inputs.
We will consider techniques which lead to solution
of the form
f (x) =
`
X
ci K (x, xi ).
(1)
i=1
where the xi , i = 1, . . . , ` are the input examples, K

a certain symmetric positive definite function named
kernel, and ci a set of parameters to be determined form
the examples. This function is found by minimizing
H[ f ] =
`
1X
V (yi , f (xi )) + k f k2K ,
` i=1
where V is a loss function which measures the goodness of the predicted output f (xi ) with respect to the
given output yi , k f k2K a smoothness term which can
be thought of as a norm in the Reproducing Kernel
Hilbert Space defined by the kernel K and a positive
parameter which controls the relative weight between
the data and the smoothness term. The choice of the
loss function determines different learning techniques,
each leading to a different learning algorithm for computing the coefficients ci .
The rest of the paper is organized as follows.
Section 2 presents the main idea and concepts in the theory. Section 3 discusses Regularization Networks and
Support Vector Machines, two important techniques
which produce outputs of the form of Eq. (1).
2.
Statistical Learning Theory
We consider two sets of random variables x X R d

and y Y R related by a probabilistic relationship.
Evgeniou, Pontil and Poggio
as
Iemp [ f ; `] =
`
1X
V (yi , f (xi )).
` i=1
es
Straight minimization of the empirical risk in F can

be problematic. First, it is usually an ill-posed problem (Tikhonov and Arsenin, 1977), in the sense that
there might be many, possibly infinitely many, functions minimizing the empirical risk. Second, it can lead
to overfitting, meaning that although the minimum of
the empirical risk can be very close to zero, the expected
riskwhich is what we are really interested incan
be very large.
SLT provides probabilistic bounds on the distance
between the empirical and expected risk of any function
(therefore including the minimizer of the empirical risk
in a function space that can be used to control overfitting). The bounds involve the number of examples `
and the capacity h of the function space, a quantity
measuring the complexity of the space. Appropriate capacity quantities are defined in the theory, the
most popular one being the VC-dimension (Vapnik
and Chervonenkis, 1971) or scale sensitive versions
of it (Kearns and Shapire, 1994; Alon et al., 1993). The
bounds have the following general form: with probability at least
Re
fe
re
The relationship is probabilistic because generally an

element of X does not determine uniquely an element of
Y , but rather a probability distribution on Y . This can be
formalized assuming that an unknown probability distribution P(x, y) is defined over the set X Y . We are
provided with examples of this probabilistic relation`
ship, that is with a data set D` {(xi , yi ) X Y }i=1
called training set, obtained by sampling ` times the set
X Y according to P(x, y). The problem of learning
consists in, given the data set D` , providing an estimator, that is a function f : X Y , that can be used,
given any value of x X , to predict a value y. For example X could be the set of all possible images, Y the
set {1, 1}, and f (x) an indicator function which specifies whether image x contains a certain object (y = 1),
or not (y = 1) (see for example Papageorgiou et al.
(1998)). Another example is the case where x is a set
of parameters, such as pose or facial expressions, y is a
motion field relative to a particular reference image of a
face, and f (x) is a regression function which maps parameters to motion (see for example Ezzat and Poggio
(1996)).
In SLT, the standard way to solve the learning problem consists in defining a risk functional, which measures the average amount of error or risk associated
with an estimator, and then looking for the estimator
with the lowest risk. If V (y, f (x)) is the loss function measuring the error we make when we predict y
by f (x), then the average error, the so called expected
risk, is:
nc
10
V (y, f (x))P(x, y) dx dy
I[ f ]
X,Y
We assume that the expected risk is defined on a large

class of functions F and we will denote by f 0 the function which minimizes the expected risk in F. The function f 0 is our ideal estimator, and it is often called
the target function. This function cannot be found in
practice, because the probability distribution P(x, y)
that defines the expected risk is unknown, and only
a sample of it, the data set D` , is available. To overcome this shortcoming we need an induction principle that we can use to learn from the limited number of training data we have. SLT, as developed by
Vapnik (Vapnik, 1998), builds on the so-called empirical risk minimization (ERM) induction principle.
The ERM method consists in using the data set D`
to build a stochastic approximation of the expected
risk, which is usually called the empirical risk, defined
r
I [ f ] < Iemp [ f ] + 8
h
, .
`
(2)
where h is the capacity, and 8 an increasing function of h` and . For more information and for exact
forms of function 8 we refer the reader to (Vapnik and
Chervonenkis, 1971; Vapnik, 1998; Alon et al., 1993).
Intuitively, if the capacity of the function space in which
we perform empirical risk minimization is very large
and the number of examples is small, then the distance
between the empirical and expected risk can be large
and overfitting is very likely to occur.
Since the space F is usually very large (e.g. F could
be the space of square integrable functions), one typically considers smaller hypothesis spaces H. Moreover, inequality (2) suggests an alternative method for
achieving good generalization: instead of minimizing
the empirical risk, find the best trade off between the
empirical risk and the complexity of the hypothesis
space measured by the second term in the r.h.s. of inequality (2). This observation leads to the method of
Structural Risk Minimization (SRM).
3.1.
Learning Machines
fe
3.
Learning as Functional Minimization
Re
We now consider hypothesis spaces which are subsets of a Reproducing Kernel Hilbert Space (RKHS)
(Wahba, 1990). A RKHS
P is a Hilbert space of functions
f of the form f (x) = Nn=1 an n (x), where {n (x)}Nn=1
is a set of given, linearly independent basis functions
and N can be possibly infinite. A RKHS is equipped
with a norm which is defined as:
k f k2K =
N
X
an2
,
n=1 n
where {n }Nn=1 is a decreasing, positive sequence of real

values whose sum is finite. The constants n and the
basis functions {n }Nn=1 define the symmetric positive
definite kernel function:
K (x, y) =
N
X
set of constants A1 < A2 < < AM and considering

spaces of the form:
Hm = { f RKHS : k f kK Am }
It can be shown that the capacity of the hypothesis
spaces Hm is an increasing function of Am (see for example Evgeniou et al. (1999)). According to the scheme
given at the end of Section 2, the solution of the learning
problem is found by solving, for each Am , the following
min f
n n (x)n (y),
n=1
A nested sequence of spaces of functions in the RKHS

can be constructed by bounding the RKHS norm of
functions in the space. This can be done by defining a
`
X
V (yi , f (xi ))
i=1
subject to
k f kK A m
nc
es
and choosing, among the solutions found for each Am ,

the one with the best trade off between empirical risk
and capacity, i.e. the one which minimizes the bound
on the structural risk as given by inequality (2).
The implementation of the SRM method described
above is not practical because it requires to look for the
solution of a large number of constrained optimization
problems. This difficulty is overcome by searching for
the minimum of:
re
The idea of SRM is to define a nested sequence of

hypothesis spaces H1 H2 HM , where each
hypothesis space Hm has finite capacity h m and larger
than that of all previous sets, that is: h 1 h 2 , . . . ,
h M . For example Hm could be the set of polynomials of
degree m, or a set of splines with m nodes, or some more
complicated nonlinear parameterization. Using such a
nested sequence of more and more complex hypothesis
spaces, SRM consists of choosing the minimizer of the
empirical risk in the space Hm for which the bound on
the structural risk, as measured by the right hand side
of inequality (2), is minimized. Further information
about the statistical properties of SRM can be found in
Devroye et al. (1996), Vapnik (1998).
To summarize, in SLT the problem of learning from
examples is solved in three steps: (a) we define a loss
function V (y, f (x)) measuring the error of predicting
the output of input x with f (x) when the actual output is
y; (b) we define a nested sequence of hypothesis spaces
Hm , m = 1, . . . , M whose capacity is an increasing
function of m; (c) we minimize the empirical risk in
each of Hm and choose, among the solutions found,
the one with the best trade off between the empirical
risk and the capacity as given by the right hand side of
inequality (2).
11
H[ f ] =
`
1X
V (yi , f (xi )) + k f k2K .
` i=1
(3)
The functional H [ f ] contains both the empirical risk

and the norm (complexity or smoothness) of f in the
RKHS, similarly to functionals considered in regularization theory (Tikhonov and Arsenin, 1977). The regularization parameter penalizes functions with high
capacity: the larger , the smaller the RKHS norm of
the solution will be.
When implementing SRM, the key issue is the
choice of the hypothesis space, i.e. the parameter Hm
where the structural risk is minimized. In the case of
the functional of Eq. (3), the key issue becomes the
choice of the regularization parameter . These two
problems, as discussed in Evgeniou et al. (1999), are
related, and the SRM method can in principle be used
to choose (Vapnik, 1998). In practice, instead of using
SRM other methods are used such as cross-validation
(Wahba, 1990), Generalized Cross Validation, Finite
Prediction Error and the MDL criteria (see Vapnik
(1998) for a review and comparison).
An important feature of the minimizer of H [ f ]
is that, independently on the loss function V , the
12
Evgeniou, Pontil and Poggio
minimizer has the same general form (Wahba, 1990)

f (x) =
`
X
ci K (x, xi ),
(4)
for a fixed is a special form of regularization. It is

possible to show (see for example Girosi et al. (1995))
that the coefficients ci of the minimizer of (9) in Eq. (4)
satisfy the following linear system of equations:
i=1
Regularization Networks
fe
SVM Classification
(5)
V (yi , f (xi )) = |1 yi f (xi )|+ ,
(6)
Re
where |x|+ = x if x > 0 and zero otherwise.

SVM Regression
V (yi , f (xi )) = |yi f (xi )| ,
(7)
where the function || , called -insensitive loss, is defined as:

(
0
if |x| <
(8)
|x|
|x| otherwise.
where I is the identity matrix, and we have defined

(y)i = yi ,
f (x) =
(G)i j = K (xi , x j ).
`
X
yi bi (x),
(11)
i=1
es
P
with bi (x) = `j=1 (G + I )i1j K (xi , x). Equation (11)
gives the dual representation of RN. Notice the difference between Eqs. (4) and (11): in the first one the
coefficients ci are learned from the data while in the
second one the bases functions bi are learned, the coefficient of the expansion being equal to the output of
the examples. We refer to (Girosi et al., 1995) for more
information on the dual representation.
3.3.
Support Vector Machines
We now discuss Support Vector Machines (SVM)

(Cortes and Vapnik, 1995; Vapnik, 1998). We distinguish between real output (regression) and binary output (classification) problems. The method of SVM regression corresponds to the following minimization:
Min f
Regularization Networks
The approximation scheme that arises from the minimization of the quadratic functional
(9)
`
1X
|yi f (xi )| + k f k2K
` i=1
(12)
while the method of SVM classification corresponds

to:
Min f
`
1X
(yi f (xi ))2 + k f k2K
` i=1
(c)i = ci ,
Since the coefficients ci satisfy a linear system, Eq. (4)

can be rewritten as:
We now briefly discuss each of these three techniques.

3.2.
(10)
re
V (yi , f (xi )) = (yi f (xi ))2 ,
(G + I )c = y,
nc
Notice that Eq. (4) establishes a representation of the

function f as a linear combination of kernels centered in each data point. Using different kernels we
get functions such as Gaussian radial basis functions
(K (x, y) = exp(kx yk2 )), or polynomials of degree d (K (x, y) = (1 + x y)d ) (Girosi et al., 1995;
Vapnik, 1998).
We now turn to discuss a few learning techniques
based on the minimization of functionals of the form
(3) by specifying the loss function V . In particular,
we will consider Regularization Networks and Support
Vector Machines (SVM), a learning technique which
has recently been proposed for both classification and
regression problems (see Vapnik (1998) and references
therein):
`
1X
|1 yi f (xi )|+ + k f k2K ,
` i=1
(13)
It turns out that for both problems (12) and (13) the
coefficients ci in Eq. (4) can be found by solving a
Quadratic Programming (QP) problem with linear constraints. The regularization parameter appears only
in the linear constraints: the absolute values of coefficients ci is bounded by 2 . The QP problem is non
tions, but suggests representations that lead to simple solutions. Although there is not a general solution
to this problem, a number of recent experimental and
theoretical works provide insights for specific applications (Evgeniou et al., 2000; Jaakkola and Haussler,
1998; Mohan, 1999; Vapnik, 1998).
References
nc
es
Alon, N., Ben-David, S., Cesa-Bianchi, N., and Haussler, D. 1993.

Scale-sensitive dimensions, uniform convergence, and learnability. Symposium on Foundations of Computer Science.
Cortes, C. and Vapnik, V. 1995. Support vector networks. Machine
Learning, 20:125.
Devroye, L., Gyorfi, L., and Lugosi, G. 1996. A Probabilistic Theory
of Pattern Recognition, No. 31 in Applications of Mathematics.
Springer: New York.
Evgeniou, T., Pontil, M., Papageorgiou, C., and Poggio, T. 2000.
Image representations for object detection using kernel classifiers.
In Proceedings ACCV. Taiwan, p. To appear.
Evgeniou, T., Pontil, M., and Poggio, T. 1999. A unified framework
for Regularization Networks and Support Vector Machines. A.I.
Memo No. 1654, Artificial Intelligence Laboratory, Massachusetts
Institute of Technology.
Ezzat, T. and Poggio, T. 1996. Facial analysis and synthesis using
image-based models. In Face and Gesture Recognition. pp. 116
121.
Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory and
neural networks architectures. Neural Computation, 7:219269.
Jaakkola, T. and Haussler, D. 1998. Probabilistic kernel regression
models. In Proc. of Neural Information Processing Conference.
Kearns, M. and Shapire, R. 1994. Efficient distribution-free learning of probabilistic concepts. Journal of Computer and Systems
Sciences, 48(3):464497.
Mohan, A. 1999. Robust object detection in images by components.
Masters Thesis, Massachusetts Institute of Technology.
Osuna, E., Freund, R., and Girosi, F. 1997. An improved training algorithm for support vector machines. In IEEE Workshop on Neural
Networks and Signal Processing, Amelia Island, FL.
Papageorgiou, C., Oren, M., and Poggio, T. 1998. A general framework for object detection. In Proceedings of the International Conference on Computer Vision, Bombay, India.
Platt, J.C. 1998. Sequential minimal imization: A fast algorithm for
training support vector machines. Technical Report MST-TR-9814, Microsoft Research.
Tikhonov, A.N. and Arsenin, V.Y. 1977. Solutions of Ill-posed Problems. Washington, D.C.: W.H. Winston.
Vapnik, V.N. 1998. Statistical Learning Theory. Wiley: New York.
Vapnik, V.N. and Chervonenkis, A.Y. 1971. On the uniform convergence of relative frequences of events to their probabilities. Th.
Prob. and its Applications, 17(2):264280.
Wahba, G. 1990. Splines Models for Observational Data. Vol. 59,
Series in Applied Mathematics: Philadelphia.
Kernels and Data Representations
Re
3.4.
fe
re
trivial since the size of matrix of the quadratic form is

equal to ` ` and the matrix is dense. A number of algorithms for training SVM have been proposed: some
are based on a decomposition approach where the QP
problem is attacked by solving a sequence of smaller
QP problems (Osuna et al., 1997), others on sequential
updates of the solution (?).
A remarkable property of SVMs is that loss functions
(7) and (6) lead to sparse solutions. This means that,
unlike in the case of Regularization Networks, typically
only a small fraction of the coefficients ci in Eq. (4) are
nonzero. The data points xi associated with the nonzero
ci are called support vectors. If all data points which
are not support vectors were to be discarded from the
training set the same solution would be found. In this
context, an interesting perspective on SVM is to consider its information compression properties. The support vectors represent the most informative data points
and compress the information contained in the training
set: for the purpose of, say, classification only the support vectors need to be stored, while all other training
examples can be discarded. This, along with some geometric properties of SVMs such as the interpretation of
the RKHS norm of their solution as the inverse of the
margin (Vapnik, 1998), is a key property of SVM and
might explain why this technique works well in many
practical applications.
We conclude this short review with a discussion on kernels and data representations. A key issue when using
the learning techniques discussed above is the choice
of the kernel K in Eq. (4). The kernel K (xi , x j ) defines a dot product between the projections of the two
inputs xi and x j , in the feature space (the features being
{1 (x), 2 (x), . . . , N (x)} with N the dimensionality
of the RKHS). Therefore its choice is closely related to
the choice of the effective representation of the data,
i.e. the image representation in a vision application.
The problem of choosing the kernel for the machines
discussed here, and more generally the issue of finding appropriate data representations for learning, is an
important and open one. The theory does not provide
a general method for finding good data representa-
13
REF [7]
Statistical Learning and Kernel Methods

Bernhard Scholkopf
Microsoft Research Limited,
1 Guildhall Street, Cambridge CB2 3NH, UK
bsc@microsoft.com
http://research.microsoft.com/bsc
February 29, 2000
nc
es
Technical Report
MSR-TR-2000-23
Re
fe
re
Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
Lecture notes for a course to be taught at the Interdisciplinary College 2000,

Gunne, Germany, March 2000.
Abstract
We brie y describe the main ideas of statistical learning theory, support vector machines, and kernel feature spaces.
Contents
Re
fe
re
nc
es
1 An Introductory Example
2 Learning Pattern Recognition from Examples
3 Hyperplane Classiers
4 Support Vector Classiers
5 Support Vector Regression
6 Further Developments
7 Kernels
8 Representing Similarities in Linear Spaces
9 Examples of Kernels
10 Representating Dissimilarities in Linear Spaces
1
4
5
8
11
14
15
18
21
22
1 An Introductory Example
Suppose we are given empirical data
(x1 ; y1 ); : : : ; (xm ; ym ) 2 X f1g:
(1)
Here, the domain X is some nonempty set that the patterns xi are taken from;
the yi are called labels or targets .
Unless stated otherwise, indices i and j will always be understood to run
over the training set, i.e. i; j = 1; : : : ; m.
Note that we have not made any assumptions on the domain X other than
it being a set. In order to study the problem of learning, we need additional
structure. In learning, we want to be able to generalize to unseen data points.
In the case of pattern recognition, this means that given some new pattern
x 2 X , we want to predict the corresponding y 2 f1g. By this we mean,
loosely speaking, that we choose y such that (x; y) is in some sense similar to
the training examples. To this end, we need similarity measures in X and in
f1g. The latter is easy, as two target values can only be identical or dierent.
For the former, we require a similarity measure
es
k : X X ! R;
(x; x0 ) 7! k(x; x0 );
(2)
i.e., a function that, given two examples x and x0 , returns a real number char-
nc
acterizing their similarity. For reasons that will become clear later, the function
k is called a kernel [13, 1, 8].
re
A type of similarity measure that is of particular mathematical appeal are

dot products. For instance, given two vectors x; x0 2 RN , the canonical dot
product is dened as
fe
(x x0 ) :=
N
X
i=1
(x)i (x0 )i :
(3)
Re
Here, (x)i denotes the i-th entry of x.

The geometrical interpretation of this dot product is that it computes the
cosine of the angle between the vectors x and x0 , provided they are normalized
to
1. Moreover, it allows computation of the length of a vector x as
p(length
x x), and of the distance between two vectors as the length of the dierence
vector. Therefore, being able to compute dot products amounts to being able
to carry out all geometrical constructions that can be formulated in terms of
angles, lenghts and distances.
Note, however, that we have not made the assumption that the patterns live
in a dot product space. In order to be able to use a dot product as a similarity
measure, we therefore rst need to embed them into some dot product space F ,
which need not be identical to RN . To this end, we use a map
:X ! F
x 7! x:
1
(4)
The space F is called a feature space . To summarize, embedding the data into
F has three benets.
1. It lets us dene a similarity measure from the dot product in F ,
k(x; x0 ) := (x x0 ) = ((x) (x0 )):
(5)
2. It allows us to deal with the patterns geometrically, and thus lets us study
learning algorithm using linear algebra and analytic geometry.
3. The freedom to choose the mapping will enable us to design a large
variety of learning algorithms. For instance, consider a situation where the
inputs already live in a dot product space. In that case, we could directly
dene a similarity measure as the dot product. However, we might still
choose to rst apply a nonlinear map to change the representation into
one that is more suitable for a given problem and learning algorithm.
We are now in the position to describe a pattern recognition learning algorithm that is arguably one of the simplest possible. The basic idea is to compute
the means of the two classes in feature space,
X
c = 1
x;
(6)
c2 = m1
es
m1 fi:yi =+1g i
xi ;
nc
fi:yi =?1g
(7)
Re
fe
re
where m1 and m2 are the number of examples with positive and negative labels,
respectively. We then assign a new point x to the class whose mean is closer to
it. This geometrical construction can be formulated in terms of dot products.
Half-way in between c1 and c2 lies the point c := (c1 + c2 )=2. We compute the
class of x by checking whether the vector connecting c and x encloses an angle
smaller than =2 with the vector w := c1 ? c2 connecting the class means, in
other words
y = sgn ((x ? c) w)
y = sgn ((x ? (c1 + c2 )=2) (c1 ? c2 ))
= sgn ((x c1 ) ? (x c2 ) + b):
Here, we have dened the oset
b := 21 kc2 k2 ? kc1 k2 :
(8)
(9)
It will prove instructive to rewrite this expression in terms of the patterns
xi in the input domain X . To this end, note that we do not have a dot product
in X , all we have is the similarity measure k (cf. (5)). Therefore, we need to
2
rewrite everything in terms of the kernel k evaluated on input patterns. To this

end, substitute (6) and (7) into (8) to get the decision function
y =
=
1
0
X
X
1
1
(x xi ) ? m
(x xi ) + bA
sgn @ m
1
2
fi:yi =?1g
1
0 fi:yi=+1g
X
X
1
1
k(x; xi ) ?
k(x; xi ) + bA :
sgn @
m1 fi:yi =+1g
m2 fi:yi =?1g
Similarly, the oset becomes
(10)
X
X
k(xi ; xj ) ? m12
k(xi ; xj )A : (11)
b := 12 @ m12
2 f(i;j ):yi =yj =?1g
1 f(i;j ):yi =yj =+1g
Let us consider one well-known special case of this type of classier. Assume
that the class means have the same distance to the origin (hence b = 0), and
that k can be viewed as a density, i.e. it is positive and has integral 1,
X
k(x; x0 )dx = 1 for all x0 2 X :
es
(12)
m1 fi:yi =+1g
fe
re
nc
In order to state this assumption, we have to require that we can dene an

integral on X .
If the above holds true, then (10) corresponds to the so-called Bayes decision boundary separating the two classes, subject to the assumption that the
two classes were generated from two probability distributions that are correctly
estimated by the Parzen windows estimators of the two classes,
X
k(x; x )
(13)
p (x) := 1
Re
p2 (x) := m1
fi:yi =?1g
k(x; xi ):
(14)
Given some point x, the label is then simply computed by checking which of the
two, p1 (x) or p2 (x), is larger, which directly leads to (10). Note that this decision
is the best we can do if we have no prior information about the probabilities of
the two classes.
The classier (10) is quite close to the types of learning machines that we
will be interested in. It is linear in the feature space, while in the input domain,
it is represented by a kernel expansion. It is example-based in the sense that
the kernels are centered on the training examples, i.e. one of the two arguments
of the kernels is always a training example. The main point where the more
sophisticated techniques to be discussed later will deviate from (10) is in the
selection of the examples that the kernels are centered on, and in the weight
that is put on the individual kernels in the decision function. Namely, it will no
3
longer be the case that all training examples appear in the kernel expansion,
and the weights of the kernels in the expansion will no longer be uniform. In
the feature space representation, this statement corresponds to saying that we
will study all normal vectors w of decision hyperplanes that can be represented
as linear combinations of the training examples. For instance, we might want
to remove the in uence of patterns that are very far away from the decision
boundary, either since we expect that they will not improve the generalization
error of the decision function, or since we would like to reduce the computational
cost of evaluating the decision function (cf. (10)). The hyperplane will then only
depend on a subset of training examples, called support vectors.
2 Learning Pattern Recognition from Examples

With the above example in mind, let us now consider the problem of pattern
recognition in a more formal setting [27, 28], following the introduction of [19].
In two-class pattern recognition, we seek to estimate a function
f : X ! f1g
(15)
Re
fe
re
nc
es
based on input-output training data (1). We assume that the data were generated independently from some unknown (but xed) probability distribution
P (x; y). Our goal is to learn a function that will correctly classify unseen examples (x; y), i.e. we want f (x) = y for examples (x; y) that were also generated
from P (x; y).
If we put no restriction on the class of functions that we choose our estimate f from, however, even a function which does well on the training data,
e.g. by satisfying f (xi ) = yi for all i = 1; : : : ; m, need not generalize well to
unseen examples. To see this, note that for each function f and any test set
(x1 ; y1 ); : : : ; (xm ; ym ) 2 RN f1g; satisfying fx1 ; : : : ; xm g\fx1 ; : : : ; xm g = fg,
there exists another function f such that f (xi ) = f (xi ) for all i = 1; : : : ; m,
yet f (xi ) 6= f (xi ) for all i = 1; : : : ; m . As we are only given the training data,
we have no means of selecting which of the two functions (and hence which of
the completely dierent sets of test label predictions) is preferable. Hence, only
minimizing the training error (or empirical risk ),
m 1
X
1
Remp [f ] = m 2 jf (xi ) ? yi j;
i=1
(16)
does not imply a small test error (called risk ), averaged over test examples
drawn from the underlying distribution P (x; y),
Z
(17)
R[f ] = 21 jf (x) ? yj dP (x; y):
Statistical learning theory [31, 27, 28, 29], or VC (Vapnik-Chervonenkis) theory,
shows that it is imperative to restrict the class of functions that f is chosen
4
from to one which has a capacity that is suitable for the amount of available
training data. VC theory provides bounds on the test error. The minimization
of these bounds, which depend on both the empirical risk and the capacity of
the function class, leads to the principle of structural risk minimization [27].
The best-known capacity concept of VC theory is the VC dimension, dened as
the largest number h of points that can be separated in all possible ways using
functions of the given class. An example of a VC bound is the following: if
h < m is the VC dimension of the class of functions that the learning machine
can implement, then for all functions of that class, with a probability of at least
1 ? , the bound
h log()
(18)
R() Remp () + m ; m
holds, where the condence term is dened as
h
s ?

h log 2m + 1 ? log(=4)
log()
m; m
(19)
Re
fe
re
nc
es
Tighter bounds can be formulated in terms of other concepts, such as the annealed VC entropy or the Growth function. These are usually considered to be
harder to evaluate, but they play a fundamental role in the conceptual part of
VC theory [28]. Alternative capacity concepts that can be used to formulate
bounds include the fat shattering dimension [2].
The bound (18) deserves some further explanatory remarks. Suppose we
wanted to learn a \dependency" where P (x; y) = P (x) P (y), i.e. where the
pattern x contains no information about the label y, with uniform P (y). Given a
training sample of xed size, we can then surely come up with a learning machine
which achieves zero training error (provided we have no examples contradicting
each other). However, in order to reproduce the random labellings, this machine
will necessarily require a large VC dimension h. Thus, the condence term
(19), increasing monotonically with h, will be large, and the bound (18) will
not support possible hopes that due to the small training error, we should
expect a small test error. This makes it understandable how (18) can hold
independent of assumptions about the underlying distribution P (x; y): it always
holds (provided that h < m), but it does not always make a nontrivial prediction
| a bound on an error rate becomes void if it is larger than the maximum error
rate. In order to get nontrivial predictions from (18), the function space must
be restricted such that the capacity (e.g. VC dimension) is small enough (in
relation to the available amount of data).
3 Hyperplane Classiers

In the present section, we shall describe a hyperplane learning algorithm that
can be performed in a dot product space (such as the feature space that we
introduced previously). As described in the previous section, to design learning
5
algorithms, one needs to come up with a class of functions whose capacity can
be computed.
[32] and [30] considered the class of hyperplanes
(w x) + b = 0 w 2 RN ; b 2 R;
(20)
corresponding to decision functions
f (x) = sgn ((w x) + b);
(21)
and proposed a learning algorithm for separable problems, termed the Generalized Portrait, for constructing f from empirical data. It is based on two
facts. First, among all hyperplanes separating the data, there exists a unique
one yielding the maximum margin of separation between the classes,
N
max
w;b minfkx ? xi k : x 2 R ; (w x) + b = 0; i = 1; : : : ; mg:
(22)
nc
es
Second, the capacity decreases with increasing margin.

To construct this Optimal Hyperplane (cf. Figure 1), one solves the following
(23)
minimize
(w) = 21 kwk2
subject to
yi ((w xi ) + b) 1; i = 1; : : : ; m:
(24)
This constrained optimization problem is dealt with by introducing Lagrange
multipliers i 0 and a Lagrangian
m
X
re
L(w; b; ) = 12 kwk2 ?
i=1
i (yi ((xi w) + b) ? 1) :
(25)
Re
fe
The Lagrangian L has to be minimized with respect to the primal variables w

and b and maximized with respect to the dual variables i (i.e. a saddle point
has to be found). Let us try to get some intuition for this. If a constraint (24)
is violated, then yi ((w xi ) + b) ? 1 < 0, in which case L can be increased by
increasing the corresponding i . At the same time, w and b will have to change
such that L decreases. To prevent ?i (yi ((w xi ) + b) ? 1) from becoming
arbitrarily large, the change in w and b will ensure that, provided the problem is
separable, the constraint will eventually be satised. Similarly, one can understand that for all constraints which are not precisely met as equalities, i.e. for
which yi ((w xi )+ b) ? 1 > 0, the corresponding i must be 0: this is the value
of i that maximizes L. The latter is the statement of the Karush-Kuhn-Tucker
complementarity conditions of optimization theory [6].
The condition that at the saddle point, the derivatives of L with respect to
the primal variables must vanish,
@
@
@b L(w; b; ) = 0; @ w L(w; b; ) = 0;
6
(26)
{x | (w . x) + b = +1}
{x | (w . x) + b = 1}
Note:
x2
x1
=>
yi = 1
(w . x1) + b = +1
(w . x2) + b = 1
yi = +1
=>
(w . (x1x2)) = 2
w .
2
||w|| (x1x2) = ||w||
{x | (w . x) + b = 0}
fe
and
m
X
i=1
Re
leads to
re
nc
es
Figure 1: A binary classication toy problem: separate balls from diamonds.
The optimal hyperplane is orthogonal to the shortest line connecting the convex
hulls of the two classes (dotted), and intersects it half-way between the two
classes. The problem being separable, there exists a weight vector w and a
threshold b such that yi ((w xi ) + b) > 0 (i = 1; : : : ; m). Rescaling w and b
such that the point(s) closest to the hyperplane satisfy j(w xi ) + bj = 1, we
obtain a canonical form (w; b) of the hyperplane, satisfying yi ((w xi )+ b) 1.
Note that in this case, the margin, measured perpendicularly to the hyperplane,
equals 2=kwk. This can be seen by considering two points x1 ; x2 on opposite
sides of the margin, i.e. (w x1 ) + b = 1; (w x2 ) + b = ?1, and projecting them
onto the hyperplane normal vector w=kwk.
w=
i yi = 0
m
X
i=1
(27)
i yi xi :
(28)
The solution vector thus has an expansion in terms of a subset of the training
patterns, namely those patterns whose i is non-zero, called Support Vectors.
By the Karush-Kuhn-Tucker complementarity conditions
i [yi ((xi w) + b) ? 1] = 0;
i = 1; : : : ; m;
(29)
the Support Vectors lie on the margin (cf. Figure 1). All remaining examples of
the training set are irrelevant: their constraint (24) does not play a role in the
optimization, and they do not appear in the expansion (28). This nicely captures
our intuition of the problem: as the hyperplane (cf. Figure 1) is completely
7
determined by the patterns closest to it, the solution should not depend on the
other examples.
By substituting (27) and (28) into L, one eliminates the primal variables and
arrives at the Wolfe dual of the optimization problem [e.g. 6]: nd multipliers
i which
m
X
i ? 12
m
X
i j yi yj (xi xj )
maximize
W () =
subject to
i 0; i = 1; : : : ; m; and
i=1
i;j =1
m
X
i=1
i yi = 0:
(30)
(31)
The hyperplane decision function can thus be written as
f (x) = sgn
m
X
i=1
yi i (x xi ) + b
(32)
Re
fe
re
nc
es
where b is computed using (29).

The structure of the optimization problem closely resembles those that typically arise in Lagrange's formulation of mechanics. Also there, often only a
subset of the constraints become active. For instance, if we keep a ball in a box,
then it will typically roll into one of the corners. The constraints corresponding
to the walls which are not touched by the ball are irrelevant, the walls could
just as well be removed.
Seen in this light, it is not too surprising that it is possible to give a mechanical interpretation of optimal margin hyperplanes [9]: If we assume that
each support vector xi exerts a perpendicular force of size i and sign yi on
a solid plane sheet lying along the hyperplane, then the solution satises the
requirements of mechanical stability. The constraint (27) states that the forces
on the
P sheet sum to zero; and (28) implies that the torques also sum to zero,
via i xi yi i w=kwk = w w=kwk = 0.
There are theoretical arguments supporting the good generalization performance of the optimal hyperplane ([31, 27, 35, 4]). In addition, it is computationally attractive, since it can be constructed by solving a quadratic programming
problem.
4 Support Vector Classiers

We now have all the tools to describe support vector machines [28, 19, 26].
Everything in the last section was formulated in a dot product space. We think
of this space as the feature space F described in Section 1. To express the
formulas in terms of the input patterns living in X , we thus need to employ (5),
which expresses the dot product of bold face feature vectors x; x0 in terms of
the kernel k evaluated on input patterns x; x0 ,
k(x; x0 ) = (x x0 ):
8
(33)
input space
feature space
Figure 2: The idea of SV machines: map the training data into a higherdimensional feature space via , and construct a separating hyperplane with
maximum margin there. This yields a nonlinear decision boundary in input
space. By the use of a kernel function (2), it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space.
This can be done since all feature vectors only occured in dot products. The
weight vector (cf. (28)) then becomes an expansion in feature space, and will
thus typically no more correspond to the image of a single vector from input
space. We thus obtain decision functions of the more general form (cf. (32))
i=1
fe
Re
subject to
W () =
yi i k(x; xi ) + b ;
and the following quadratic program (cf. (30)):

maximize
es
yi i ((x) (xi )) + b
re
= sgn
i=1
m
X
nc
f (x) = sgn
m
X
m
X
i=1
i ? 21
m
X
i;j =1
(34)
i j yi yj k(xi ; xj )
i 0; i = 1; : : : ; m; and
m
X
i=1
i yi = 0:
(35)
(36)
In practice, a separating hyperplane may not exist, e.g. if a high noise level
causes a large overlap of the classes. To allow for the possibility of examples
violating (24), one introduces slack variables [10, 28, 22]
i 0;
i = 1; : : : ; m
(37)
in order to relax the constraints to
yi ((w xi ) + b) 1 ? i ;
i = 1; : : : ; m:
(38)
A classier which generalizes well is then foundPby controlling both the classier
capacity (via kwk) and the sum of the slacks i i . The latter is done as it can
9
re
nc
es
Figure 3: Example of a Support Vector classier found by using a radial basis

function kernel k(x; x0 ) = exp(?kx ? x0k2 ). Both coordinate axes range from -1
to +1. Circles and disks are two classes of training examples; the middle line is
the decision surface; the outer lines precisely meet the constraint (24). Note that
the Support Vectors found by the algorithm (marked by extra circles) are not
centers of clusters, but examples which are critical for
classication
P theygiven
task. Grey values code the modulus of the argument m

k
i=1 i i (x; xi ) + b of
the decision function (34).)
Re
fe
be shown to provide an upper bound on the number of training errors which

leads to a convex optimization problem.
One possible realization of a soft margin classier is minimizing the objective
function
m
X
(39)
(w; ) = 21 kwk2 + C i
i=1
subject to the constraints (37) and (38), for some value of the constant C > 0
determining the trade-o. Here and below, we use boldface greek letters as
a shorthand for corresponding vectors = (1 ; : : : ; m ). Incorporating kernels,
and rewriting it in terms of Lagrange multipliers, this again leads to the problem
of maximizing (35), subject to the constraints
0 i C; i = 1; : : : ; m; and
m
X
i=1
i yi = 0:
(40)
The only dierence from the separable case is the upper bound C on the Lagrange multipliers i . This way, the in uence of the individual patterns (which
10
x
x
x
x
+
0
Figure 4: In SV regression, a tube with radius " is tted to the data. The
trade-o between model complexity and points lying outside of the tube (with
positive slack variables ) is determined by minimizing (46).
j =1
yj j k(xi ; xj ) + b = yi :
nc
m
X
es
could be outliers) gets limited. As above, the solution takes the form (34). The
threshold b can be computed by exploiting the fact that for all SVs xi with
i < C , the slack variable i is zero (this again follows from the Karush-KuhnTucker complementarity conditions), and hence
(41)
fe
re
Another possible realization of a soft margin variant of the optimal hyperplane uses the -parametrization [22]. In it, the paramter C is replaced by a
parameter 2 [0; 1] which can be shown to lower and upper bound the number
of examples that will be SVs and that will come to lie on the wrong side of the
hyperplane,
respectively. It uses a primal objective function with the error term
1 P

?

, and separation constraints
i
m i
Re
yi ((w xi ) + b) ? i ;
i = 1; : : : ; m:
(42)
The margin parameter is a variable of the optimization problem. The dual

can be shown to consist
P of maximizing the quadratic part ofP(35), subject to
0 i 1=(m), i i yi = 0 and the additional constraint i i = 1.
5 Support Vector Regression

The concept of the margin is specic to pattern recognition. To generalize
the SV algorithm to regression estimation [28], an analogue of the margin is
constructed in the space of the target values y (note that in regression, we have
y 2 R) by using Vapnik's "-insensitive loss function (Figure 4)
jy ? f (x)j" := maxf0; jy ? f (x)j ? "g:

11
(43)
To estimate a linear regression
f (x) = (w x) + b
(44)
with precision ", one minimizes
m
1 kwk2 + C X
jyi ? f (xi )j" :
2
i=1
(45)
Written as a constrained optimization problem, this reads:
(w; ; ) = 12 kwk2 + C
minimize
m
X
i=1
((w xi ) + b) ? yi " + i
yi ? ((w xi ) + b) " + i
i ; i 0
subject to
(i + i )
(46)
(47)
(48)
(49)
for all i = 1; : : : ; m. Note that according to (47) and (48), any error smaller than
" does not require a nonzero i or i , and hence does not enter the objective
W (; ) = ?"
m
X
(i + i ) +
re
maximize
nc
es
function (46).
Generalization to kernel-based regression estimation is carried out in complete analogy to the case of pattern recognition. Introducing Lagrange multipliers, one thus arrives at the following optimization problem: for C > 0; " 0
chosen a priori,
i=1
m
X
i=1
(i ? i )yi
0 i ; i C; i = 1; : : : ; m; and
Re
subject to
fe
m
X
? 21 (i ? i )(j ? j )k(xi ; xj )
i;j =1
(50)
m
X
i=1
(i ? i ) = 0:(51)
The regression estimate takes the form
f (x) =
m
X
i=1
(i ? i )k(xi ; x) + b;
(52)
where b is computed using the fact that (47) becomes an equality with i = 0 if
0 < i < C , and (48) becomes an equality with i = 0 if 0 < i < C .
Several extensions of this algorithm are possible. From an abstract point
of view, we just need some target function which depends on the vector (w; )
(cf. (46)). There are multiple degrees of freedom for constructing it, including
some freedom how to penalize, or regularize, dierent parts of the vector, and
some freedom how to use the kernel trick. For instance, more general loss
12
( )
1
(.)
(.)
(x1)
(x2)
output ( i k (x,xi))
...
...
(x)
weights
(.)
dot product ((x).(xi)) = k (x,xi)
(xn)
mapped vectors (xi), (x)
...
support vectors x1 ... xn

test vector x
re
nc
es
Figure 5: Architecture of SV machines. The input x and the Support Vectors xi

are nonlinearly mapped (by ) into a feature space F , where dot products are
computed. By the use of the kernel k, these two layers are in practice computed
in one single step. The results are linearly combined by weights i , found by
solving a quadratic program (in pattern recognition, i = yi i ; in regression
estimation, i = i ? i ). The linear combination is fed into the function (in
pattern recognition, (x) = sgn (x + b); in regression estimation, (x) = x + b).
Re
fe
functions can be used for , leading to problems that can still be solved eciently
[24]. Moreover, norms other than the 2-norm k:k can be used to regularize the
solution. Yet another example is that polynomial kernels can be incorporated
which consist of multiple layers, such that the rst layer only computes products
within certain specied subsets of the entries of w [17].
Finally, the algorithm can be modied such that " need not be specied a
priori. Instead, one species an upper bound 0 1 on the fraction of
points allowed to lie outside the tube (asymptotically, the number of SVs) and
the corresponding " is computed automatically. This is achieved by using as
primal objective function
m
1 kwk2 + C m" + X
(53)
jyi ? f (xi )j"
2
i=1
instead of (45), and treating " 0 as a parameter that we minimize over [22].
13
6 Further Developments
Re
fe
re
nc
es
Having described the basics of SV machines, we now summarize some empirical

ndings and theoretical developments which were to follow.
By the use of kernels, the optimal margin classier was turned into a classier
which became a serious competitor of high-performance classiers. Surprisingly,
it was noticed that when dierent kernel functions are used in SV machines, they
empirically lead to very similar classication accuracies and SV sets [18]. In this
sense, the SV set seems to characterize (or compress ) the given task in a manner
which up to a certain degree is independent of the type of kernel (i.e. the type
of classier) used.
Initial work at AT&T Bell Labs focused on OCR (optical character recognition), a problem where the two main issues are classication accuracy and
classication speed. Consequently, some eort went into the improvement of
SV machines on these issues, leading to the Virtual SV method for incorporating prior knowledge about transformation invariances by transforming SVs, and
the Reduced Set method for speeding up classication. This way, SV machines
became competitive with the best available classiers on both OCR and object
recognition tasks [7, 9, 17].
Another initial weakness of SV machines, less apparent in OCR applications
which are characterized by low noise levels, was that the size of the quadratic
programming problem scaled with the number of Support Vectors. This was
due to the fact that in (35), the quadratic part contained at least all SVs |
the common practice was to extract the SVs by going through the training data
in chunks while regularly testing for the possibility that some of the patterns
that were initially not identied as SVs turn out to become SVs at a later stage
(note that without chunking, the size of the matrix would be m m, where m
is the number of all training examples). What happens if we have a high-noise
problem? In this case, many of the slack variables i will become nonzero, and
all the corresponding examples will become SVs. For this case, a decomposition
algorithm was proposed [14], which is based on the observation that not only
can we leave out the non-SV examples (i.e. the xi with i = 0) from the current
chunk, but also some of the SVs, especially those that hit the upper boundary
(i.e. i = C ). In fact, one can use chunks which do not even contain all SVs,
and maximize over the corresponding sub-problems. SMO [15, 25, 20] explores
an extreme case, where the sub-problems are chosen so small that one can
solve them analytically. Several public domain SV packages and optimizers are
listed on the web page http://www.kernel-machines.org. For more details on
the optimization problem, see [19].
On the theoretical side, the least understood part of the SV algorithm initially was the precise role of the kernel, and how a certain kernel choice would
in uence the generalization ability. In that respect, the connection to regularization theory provided some insight. For kernel-based function expansions, one
can show that given a regularization operator P mapping the functions of the
learning machine into some dot product space, the problem of minimizing the
14
regularized risk
(54)
Rreg [f ] = Remp [f ] + 2 kPf k2
(with a regularization parameter 0) can be written as a constrained opti-
mization problem. For particular choices of the loss function, it further reduces
to a SV type quadratic programming problem. The latter thus is not specic to
SV machines, but is common to a much wider class of approaches. What gets
lost in the general case, however, is the fact that the solution can usually be expressed in terms of a small number of SVs. This specic feature of SV machines
is due to the fact that the type of regularization and the class of functions that
the estimate is chosen from are intimately related [11, 23]: the SV algorithm is
equivalent to minimizing the regularized risk on the set of functions
f (x) =
X
i
i k(xi ; x) + b;
(55)
Re
fe
re
nc
es
provided that k and P are interrelated by

k(xi ; xj ) = ((Pk)(xi ; :) (Pk)(xj ; :)) :
(56)
To this end, k is chosen as a Green's function of P P , for in that case, the right
hand side of (56) equals
(57)
(k(xi ; :) (P Pk)(xj ; :)) = (k(xi ; :) xj (:)) = k(xi ; xj ):
For instance, a Gaussian RBF kernel thus corresponds to regularization with a
functional containing a specic dierential operator.
In SV machines, the kernel thus plays a dual role: rstly, it determines the
class of functions (55) that the solution is taken from; secondly, via (56), the
kernel determines the type of regularization that is used.
We conclude this section by noticing that the kernel method for computing
dot products in feature spaces is not restricted to SV machines. Indeed, it has
been pointed out that it can be used to develop nonlinear generalizations of any
algorithm that can be cast in terms of dot products, such as principal component
analysis [21], and a number of developments have followed this example.
7 Kernels
We now take a closer look at the issue of the similarity measure, or kernel, k.
In this section, we think of X as a subset of the vector space RN , (N 2 N ),
endowed with the canonical dot product (3).
7.1 Product Features
Suppose we are given patterns x 2 RN where most information is contained in

the d-th order products (monomials) of entries [x]j of x,
(58)
[x]j1 : : : [x]jd ;
15
where j1 ; : : : ; jd 2 f1; : : : ; N g. In that case, we might prefer to extract these

product features, and work in the feature space F of all products of d entries.
In visual recognition problems, where images are often represented as vectors,
this would amount to extracting features which are products of individual pixels.
For instance, in R2 , we can collect all monomial feature extractors of degree
2 in the nonlinear map
: R2 ! F = R 3
([x]1 ; [x]2 ) 7! ([x]21 ; [x]22 ; [x]1 [x]2 ):
(59)
(60)
nc
es
This approach works ne for small toy examples, but it fails for realistically
sized problems: for N -dimensional input patterns, there exist
(61)
NF = (Nd!(+N d??1)!1)!
dierent monomials (58), comprising a feature space F of dimensionality NF .
For instance, already 16 16 pixel input images and a monomial degree d = 5
yield a dimensionality of 1010.
In certain cases described below, there exists, however, a way of computing
dot products in these high-dimensional feature spaces without explicitely mapping into them: by means of kernels nonlinear in the input space RN . Thus, if
the subsequent processing can be carried out using dot products exclusively, we
are able to deal with the high dimensionality.
The following section describes how dot products in polynomial feature
spaces can be computed eciently.
re
7.2 Polynomial Feature Spaces Induced by Kernels
fe
In order to compute dot products of the form ((x) (x0 )), we employ kernel
representations of the form
k(x; x0 ) = ((x) (x0 ));
(62)
Re
which allow us to compute the value of the dot product in F without having to
carry out the map . This method was used by [8] to extend the Generalized
Portrait hyperplane classier of [31] to nonlinear Support Vector machines. In
[1], F is termed the linearization space, and used in the context of the potential
function classication method to express the dot product between elements of
F in terms of elements of the input space.
What does k look like for the case of polynomial features? We start by
giving an example [28] for N = d = 2. For the map
C2 : ([x]1 ; [x]2 ) 7! ([x]21 ; [x]22 ; [x]1 [x]2 ; [x]2 [x]1 );

(63)
dot products in F take the form
(C2 (x) C2 (x0 )) = [x]21 [x0 ]21 + [x]22 [x0 ]22 + 2[x]1 [x]2 [x0 ]1 [x0 ]2 = (x x0 )2 ; (64)
16
i.e. the desired kernel k is simply the square of the dot product in input space.
The same works for arbitrary N; d 2 N [8]: as a straightforward generalization
of a result proved in the context of polynomial approximation [16, Lemma 2.1],
we have:
Proposition 1 Dene Cd to map x 2 RN to the vector Cd(x) whose entries
are all possible d-th degree ordered products of the entries of x. Then the corresponding kernel computing the dot product of vectors mapped by Cd is
k(x; x0 ) = (Cd (x) Cd (x0 )) = (x x0 )d :
Proof. We directly compute

(Cd (x) Cd (x0 )) =
N
X
[x]j1 : : : [x]jd [x0 ]j1 : : : [x0 ]jd
(66)
(67)
es
j1 ;:::;jd =1
1d
0N
X
= @ [x]j [x0 ]j A = (x x0 )d :
j =1
(65)
re
nc
Instead of ordered products, we can use unordered ones to obtain a map

d which yields the same value of the dot product. To this end, we have to
compensate for the multiple occurence of certain monomials in Cd by scaling
the respective entries of d with the square roots of their numbers of occurence.
Then, by this denition of d , and (65),
(d (x) d (x0 )) = (Cd (x) Cd (x0 )) = (x x0 )d :
(68)
fe
For instance, if n of the ji in (58) are equal, and the remainingp

ones are dierent,
then the coecient in the corresponding component of d is (d ? n + 1)! [for
the general case, cf. 23]. For 2 , this simply means that [28]
Re
2 (x) = ([x]21 ; [x]22 ; 2 [x]1 [x]2 ):
(69)
If x represents an image with the entries being pixel values, we can use
the kernel (x x0 )d to work in the space spanned by products of any d pixels |
provided that we are able to do our work solely in terms of dot products, without
any explicit usage of a mapped pattern d (x). Using kernels of the form (65), we
take into account higher-order statistics without the combinatorial explosion (cf.
(61)) of time and memory complexity which goes along already with moderately
high N and d.
To conclude this section, note that it is possible to modify (65) such that it
maps into the space of all monomials up to degree d, dening [28]
k(x; x0 ) = ((x x0 ) + 1)d :

17
(70)
8 Representing Similarities in Linear Spaces

In what follows, we will look at things the other way round, and start with the
kernel. Given some kernel function, can we construct a feature space such that
the kernel computes the dot product in that feature space? This question has
been brought to the attention of the machine learning community by [1, 8, 28].
In functional analysis, the same problem has been studied under the heading of
Hilbert space representations of kernels. A good monograph on the functional
analytic theory of kernels is [5]; indeed, a large part of the material in the present
section is based on that work.
There is one more aspect in which this section diers from the previous one:
the latter dealt with vectorial data. The results in the current section, in contrast, hold for data drawn from domains which need no additional structure
other than them being nonempty sets X . This generalizes kernel learning algorithms to a large number of situations where a vectorial representation is not
readily available [17, 12, 34].
We start with some basic denitions and results.
Denition 2 (Gram matrix) Given a kernel k and patterns x1 ; : : : ; xm 2 X ,
the m m matrix
K := (k(xi ; xj ))ij
es
(71)
is called the Gram matrix (or kernel matrix) of k with respect to x1 ; : : : ; xm .
nc
Denition 3 (Positive matrix) An m m matrix Kij satisfying

X
ci cj Kij 0
(72)
re
i;j
for all ci 2 C is called positive.1
fe
Denition 4 ((Positive denite) kernel) Let X be a nonempty set. A function k : X X ! C which for all m 2 N ; xi 2 X gives rise to a positive Gram
Re
matrix is called a positive denite kernel. Often, we shall refer to it simply as

a kernel.
The term kernel stems from the rst use of this type of function in the study
of integral operators. A function k which gives rise to an operator T via
(Tf )(x) =
k(x; x0 )f (x0 ) dx0
(73)
is called the kernel of T . One might argue that the term positive denite kernel
is slightly misleading. In matrix theory, the term denite is usually used to
denote the case where equality in (72) only occurs if c1 = : : : = cm = 0. Simply
using the term positive kernel, on the other hand, could be confused with a
kernel whose values are positive. In the literature, a number of dierent terms
1 The bar in c denotes complex conjugation.
j
18
are used for positive denite kernels, such as reproducing kernel, Mercer kernel,
or support vector kernel.
The denitions for (positive denite) kernels and positive matrices dier only
in the fact that in the former case, we are free to choose the points on which
the kernel is evaluated.
Positive denitness implies positivity on the diagonal,
k(x1 ; x1 ) 0 for all x1 2 X ;
(74)
(use m = 1 in (72)), and symmetry, i.e.
k(xi ; xj ) = k(xj ; xi ):
(75)
es
Note that in the complex-valued case, our denition of symmetry includes complex conjugation, depicted by the bar. The denition of symmetry of matrices
is analogous, i.e. Kij = K ji .
Obviously, real-valued kernels, which are what we will mainly be concerned
with, are contained in the above denition as a special case, since we did not
require that the kernel take values in C n R. However, it is not sucient to
require that (72) hold for real coecients ci . If we want to get away with real
coecients only, we additionally have to require that the kernel be symmetric,
nc
k(xi ; xj ) = k(xj ; xi ):
(76)
It can be shown that whenever k is a (complex-valued) positive denite kernel,
re
its real part is a (real-valued) positive denite kernel.

Kernels can be regarded as generalized dot products. Indeed, any dot product can be shown to be a kernel; however, linearity does not carry over from
dot products to general kernels. Another property of dot products, the CauchySchwarz inequality, does have a natural generalization to kernels:
Re
fe
Proposition 5 If k is a positive denite kernel, and x1 ; x2 2 X , then

jk(x1 ; x2 )j2 k(x1 ; x1 ) k(x2 ; x2 ):
(77)
Proof. For sake of brevity, we give a non-elementary proof using some basic
facts of linear algebra. The 2 2 Gram matrix with entries Kij = k(xi ; xj ) is
positive. Hence both its eigenvalues are nonnegative, and so is their product,
K 's determinant, i.e.
(78)
0 K11 K22 ? K12K21 = K11 K22 ? K12 K 12 = K11K22 ? jK12 j2 :
Substituting k(xi ; xj ) for Kij , we get the desired inequality.
We are now in a position to construct the feature space associated with a
kernel k.
19
We dene a map from X into the space of functions mapping X into C ,

denoted as C X , via
: X ! CX
x 7! k(:; x):
(79)
Here, (x) = k(:; x) denotes the function that assigns the value k(x0 ; x) to
x0 2 X .
We have thus turned each pattern into a function on the domain X . In a
sense, a pattern is now represented by its similarity to all other points in the
input domain X . This seems a very rich representation, but it will turn out that
the kernel allows the computation of the dot product in that representation.
We shall now construct a dot product space containing the images of the
input patterns under . To this end, we rst need to endow it with the linear
structure of a vector space. This is done by forming linear combinations of the
form
m
X
(80)
f (:) = i k(:; xi ):
i=1
Here, m 2 N , i 2 C and xi 2 X are arbitrary.

Next, we dene a dot product between f and another function
m
X
(m0 2 N , j 2 C and x0j 2 X ) as
j =1
m X
m
X
0
i j k(xi ; x0j ):
i=1 j =1
(81)
(82)
re
hf; gi :=
j k(:; x0j )
nc
g(:) =
es
Re
fe
To see that this is well-dened, although it explicitly contains the expansion
coecients (which need not be unique), note that
hf; gi =
m
X
0
j =1
j f (x0j );
(83)
using k(x0j ; xi ) = k(xi ; x0j ). The latter, however, does not depend on the particular expansion of f . Similarly, for g, note that
hf; gi =
m
X
i=1
i g(xi ):
(84)
The last two equations also show that h:; :i is antilinear in the rst argument
and linear in the second one. It is symmetric, as hf; gi = hg; f i. Moreover, given
functions f1 ; : : : ; fn , and coecients 1 ; : : : ; n 2 C , we have
n
X
i;j =1
i j hfi ; fj i =
*X
i
20
i fi ;
X
i
i fi 0;
(85)
9 Examples of Kernels
nc
es
hence h:; :i is actually a positive denite kernel on our function space.

For the last step in proving that it even is a dot product, we will use the
following interesting property of , which follows directly from the denition:
for all functions (80), we have
hk(:; x); f i = f (x)
(86)
| k is the representer of evaluation. In particular,
hk(:; x); k(:; x0 )i = k(x; x0 ):
(87)
By virtue of these properties, positive kernels k are also called reproducing kernels [3, 5, 33, 17].
By (86) and Proposition 5, we have
jf (x)j2 = jhk(:; x); f ij2 k(x; x) hf; f i:
(88)
Therefore, hf; f i = 0 directly implies f = 0, which is the last property that was
left to prove in order to establish that h:; :i is a dot product.
One can complete the space of functions (80) in the norm corresponding to
the dot product, i.e. add the limit points of sequences that are convergent in
that norm, and thus gets a Hilbert space H , usually called a reproducing kernel
Hilbert space.2
The case of real-valued kernels is included in the above; in that case, H can
be chosen as a real Hilbert space.
and sigmoid kernels
fe
re
Besides (65), [8] and [28] suggest the usage of Gaussian radial basis function
kernels [1]
kx ? x0 k2
0
k(x; x ) = exp ? 2 2
(89)
k(x; x0 ) = tanh((x x0 ) + ):
Re
(90)
Note that all these kernels have the convenient property of unitary invariance, i.e. k(x; x0 ) = k(Ux; Ux0 ) if U > = U ?1 (if we consider complex numbers,
then U instead of U > has to be used).
The radial basis function kernel additionally is translation invariant. Moreover, as it satises k(x; x) = 1 for all x 2 X , each mapped example has unit
length, k(x)k = 1. In addition, as k(x; x0 ) > 0 for all x; x0 2 X , all points lie
inside the same orthant in feature space. To see this, recall that for unit lenght
vectors, the dot product (3) equals the cosine of the enclosed angle. Hence
cos(\((x); (x0 ))) = ((x) (x0 )) = k(x; x0 ) > 0;
(91)
2 A Hilbert space H is dened as a complete dot product space. Completeness means that
all sequences in H which are convergent in the norm corresponding to the dot product will
actually have their limits in H , too.
21
which amounts to saying that the enclosed angle between any two mapped
examples is smaller than =2.
The examples given so far apply to the case of vectorial data. Let us at least
give one example where X is not a vector space.
Example 6 (Similarity of probabilistic events) If A is a -algebra, and

P a probability measure on A, then
k(A; B ) = P (A \ B ) ? P (A)P (B )
(92)
is a positive denite kernel.
Further examples include kernels for string matching, as proposed by [34, 12].
10 Representating Dissimilarities in Linear Spaces
re
nc
es
We now move on to a larger class of kernels. It is interesting in several regards.

First, it will turn out that some kernel algorithms work with this larger class
of kernels, rather than only with positive denite kernels. Second, their relationship to positive denite kernels is a rather interesting one, and a number
of connections between the two classes provide understanding of kernels in general. Third, they are intimately related to a question which is a variation on the
central aspect of positive denite kernels: the latter can be thought of as dot
products in feature spaces; the former, on the other hand, can be embedded as
distance measures arising from norms in feature spaces.
The following denition diers only in the additional constraint on the sum
of the ci from Denition 3.
Re
for all ci 2 C with
fe
Denition 7 (Conditionally positive matrix) A symmetric m m matrix

Kij (m 2) satisfying
m
X
ci cj Kij 0
(93)
is called conditionally positive.
i;j =1
m
X
i=1
ci = 0
(94)
Denition 8 (Conditionally positive denite kernel) Let X be a nonempty set. A function k : X X ! C which for all m 2; xi 2 X gives rise to
a conditionally positive Gram matrix is called a conditionally positive denite
kernel.
The denitions for the real-valued case look exactly the same. Note that
symmetry is required, also in the complex case. Due to the additional constraint
on the coecients ci , it does not follow automatically anymore.
22
It is trivially true that whenever k is positive denite, it is also conditionally

positive denite. However, the latter is strictly weaker: if k is conditionally
positive denite, and b 2 C , then k + b is also conditionally
positive denite. To
P
see this, simply apply the denition to get, for i ci = 0,
X
i;j
ci cj (k(xi ; xj ) + b) =
X 2 X
ci cj k(xi ; xj ) + b ci = ci cj k(xi ; xj ) 0:
i;j
i
i;j
(95)
A standard example of a conditionally positive denite kernel which is not
positive denite is
k(x; x0 ) = ?kx ? x0 k2 ;
(96)
0
where x; x 2 X , and X is a dot product space.
To see this, simply compute, for some pattern set x1 ; : : : ; xm ,
X
X
ci cj k(xi ; xj ) = ? ci cj kxi ? xj k2
(97)
= ?
= 2
ci cj kxi k2 + kxj k2 ? 2(xi xj )
i;j
X
X
i
X
i;j
ci
cj kxj k2 ?
X X
ci cj (xi xj ) 0;
cj
ci kxi k2 + 2
es
= ?
i;j
X
nc
i;j
X
i;j
ci cj (xi xj )
(98)
Re
fe
re
where the last line follows from (94) and the fact that k(x; x0 ) = (x x0 ) is a
positive denite kernel. Note that without (94), (97) can also be negative (e.g.,
put c1 = : : : = cm = 1), hence the kernel is not a positive denite one.
Without proof, we add that in fact,
k(x; x0 ) = ?kx ? x0 k
(99)
is conditionally positive denite for 0 < 2.
Let us consider the kernel (96), which can be considered the canonical conditionally positive kernel on a dot product space, and see how it is related to
the dot product. Clearly, the distance induced by the norm is invariant under
translations, i.e.
kx ? x0 k = k(x ? x0 ) ? (x0 ? x0 )k
(100)
0
0
for all x; x ; x0 2 X . In other words, even complete knowledge of kx ? x k for
all x; x0 2 X would not completely determine the underlying dot product, the
reason being that the dot product is not invariant under translations. Therefore,
one needs to x an origin x0 when going from the distance measure to the dot
product. To this end, we need to write the dot product of x ? x0 and x0 ? x0 in
terms of distances:
((x ? x0 ) (x0 ? x0 )) = (x x0 ) + kx0 k2 + (x x0 ) + (x0 x0 )

?
= 21 ?kx ? x0 k2 + kx ? x0 k2 + kx0 ? x0 k2 (101)
23
By construction, this will always result in a positive denite kernel: the dot
product is a positive denite kernel, and we have only translated the inputs.
We have thus established the connection between (96) and a class of positive
denite kernels corresponding to the dot product in dierent coordinate systems,
related to each other by translations. In fact, a similar connection holds for a
wide class of kernels:
Proposition 9 Let x0 2 X , and let k be a symmetric kernel on X X , satisfying k(x0 ; x0 ) 0. Then
k~(x; x0 ) := k(x; x0 ) ? k(x; x0 ) ? k(x0 ; x0 )
(102)
Re
fe
re
nc
es
is positive denite if and only if k is conditionally positive denite.

This result can be generalized to k(x0 ; x0 ) < 0. In this case, we simply need
to add k(x0 ; x0 ) on the right hand side of (102). This is necessary, for otherwise,
we would have k~(x0 ; x0 ) < 0, contradicting (74). Without proof, we state that
it is also sucient.
Using this result, one can prove another interesting connection between the
two classes of kernels:
Proposition 10 A kernel k is conditionally positive denite if and only if
exp(tk) is positive denite for all t > 0.
Positive denite kernels of the form exp(tk) (t > 0) have the interesting
property that their n-th root (n 2 N ) is again a positive denite kernel. Such
kernels are called innitely divisible. One can show that, disregarding some
technicalities, the logarithm of an innitely divisible positive denite kernel
mapping into R+0 is a conditionally positive denite kernel.
Conditionally positive denite kernels are a natural choice whenever we are
dealing with a translation invariant problem, such as the support vector machine: maximization of the margin of separarion between two classes of data is
independent of the origin's position.
can be seen from the dual optimizaP This

y
tion problem (36): the constraint m
i=1 i i = 0 projects out the same subspace
as (94) in the denition of conditionally positive matrices [17, 23].
We have seen that positive denite kernels and conditionally positive denite
kernels are closely related to each other. The former can be represented as dot
products in Hilbert spaces. The latter, it turns out, essentially correspond to
distance measures associated with norms in Hilbert spaces:
Proposition 11 Let k be a real-valued conditionally positive denite kernel on

X , satisfying k(x; x) = 0 for all x 2 X . Then there exists a Hilbert space H of
real-valued functions on X , and a mapping : X ! H , such that
k(x; x0 ) = ?k(x) ? (x0 )k2 :
(103)
There exist generalizations to the case where k(x; x) =
6 0 and where k maps into
C.
In these cases, the representation looks slightly more complicated.

24
The signicance of this proposition is that using conditionally positive definite kernels, we can thus generalize all algorithms based on distances to corresponding algorithms operating in feature spaces. This is an analogue of the
kernel trick for distances rather than dot products, i.e. dissimilarities rather
than similarities.
Acknowledgements. Thanks to A. Smola and R. Williamson for discussions,

and to C. Watkins for pointing out, in his NIPS'99 SVM workshop talk, that
distances and dot products dier in the way they deal with the origin.
References
Re
fe
re
nc
es
[1] M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer. Theoretical foundations

of the potential function method in pattern recognition learning. Automation and
Remote Control, 25:821{837, 1964.
[2] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale{sensitive Dimensions, Uniform Convergence, and Learnability. Journal of the ACM, 44(4):615{
631, 1997.
[3] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337{
404, 1950.
[4] P. L. Bartlett and J. Shawe-Taylor. Generalization performance of support vector
machines and other pattern classiers. In B. Scholkopf, C. J. C. Burges, and A. J.
Smola, editors, Advances in Kernel Methods | Support Vector Learning, pages
43{54, Cambridge, MA, 1999. MIT Press.
[5] C. Berg, J.P.R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups.
Springer-Verlag, New York, 1984.
[6] D. P. Bertsekas. Nonlinear Programming. Athena Scientic, Belmont, MA, 1995.
[7] V. Blanz, B. Scholkopf, H. Bultho, C. Burges, V. Vapnik, and T. Vetter. Comparison of view-based object recognition algorithms using realistic 3D models. In
C. von der Malsburg, W. von Seelen, J. C. Vorbruggen, and B. Sendho, editors,
Articial Neural Networks | ICANN'96, pages 251 { 256, Berlin, 1996. Springer
Lecture Notes in Computer Science, Vol. 1112.
[8] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal
margin classiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM
Workshop on Computational Learning Theory, pages 144{152, Pittsburgh, PA,
July 1992. ACM Press.
[9] C. J. C. Burges and B. Scholkopf. Improving the accuracy and speed of support
vector learning machines. In M. Mozer, M. Jordan, and T. Petsche, editors,
Advances in Neural Information Processing Systems 9, pages 375{381, Cambridge,
MA, 1997. MIT Press.
[10] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 {
297, 1995.
[11] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks
architectures. Neural Computation, 7(2):219{269, 1995.
[12] D. Haussler. Convolutional kernels on discrete structures. Technical Report
25
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
es
[17]
nc
[16]
re
[15]
fe
[14]
Re
[13]
UCSC-CRL-99-10, Computer Science Department, University of California at

Santa Cruz, 1999.
J. Mercer. Functions of positive and negative type and their connection with the
theory of integral equations. Philos. Trans. Roy. Soc. London, A 209:415{446,
1909.
E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In J. Principe, L. Gile, N. Morgan, and E. Wilson, editors, Neural Networks for Signal Processing VII | Proceedings of the 1997 IEEE
Workshop, pages 276 { 285, New York, 1997. IEEE.
J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances
in Kernel Methods | Support Vector Learning, pages 185{208, Cambridge, MA,
1999. MIT Press.
T. Poggio. On optimal nonlinear associative recall. Biological Cybernetics, 19:201{
209, 1975.
B. Scholkopf. Support Vector Learning. R. Oldenbourg Verlag, Munchen, 1997.
Doktorarbeit, TU Berlin.
B. Scholkopf, C. Burges, and V. Vapnik. Extracting support data for a given task.
In U. M. Fayyad and R. Uthurusamy, editors, Proceedings, First International
Conference on Knowledge Discovery & Data Mining, Menlo Park, 1995. AAAI
Press.
B. Scholkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods |
Support Vector Learning. MIT Press, Cambridge, MA, 1999.
B. Scholkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson. Estimating the support of a high-dimensional distribution. TR MSR 99 - 87, Microsoft
Research, Redmond, WA, 1999.
B. Scholkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a
kernel eigenvalue problem. Neural Computation, 10:1299{1319, 1998.
B. Scholkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector
algorithms. Neural Computation, 12:1083 { 1121, 2000.
A. Smola, B. Scholkopf, and K.-R. Muller. The connection between regularization
operators and support vector kernels. Neural Networks, 11:637{649, 1998.
A. J. Smola and B. Scholkopf. On a kernel{based method for pattern recognition,
regression, approximation and operator inversion. Algorithmica, 22:211{231, 1998.
A. J. Smola and B. Scholkopf. A tutorial on support vector regression. NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of
London, UK, 1998.
A.J. Smola, P.L. Bartlett, B. Scholkopf, and D. Schuurmans. Advances in Large
Margin Classiers. MIT Press, Cambridge, MA, 2000.
V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian].
Nauka, Moscow, 1979. (English translation: Springer Verlag, New York, 1982).
V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995.
V. Vapnik. Statistical Learning Theory. Wiley, N.Y., 1998.
V. Vapnik and A. Chervonenkis. A note on one class of perceptrons. Automation
and Remote Control, 25, 1964.
26
Re
fe
re
nc
es
[31] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian].

Nauka, Moscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis,
Theorie der Zeichenerkennung, Akademie{Verlag, Berlin, 1979).
[32] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method.
Automation and Remote Control, 24, 1963.
[33] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF
Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.
[34] C. Watkins. Dynamic alignment kernels. In A.J. Smola, P.L. Bartlett,
B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classiers,
pages 39 { 50, Cambridge, MA, 2000. MIT Press.
[35] R. C. Williamson, A. J. Smola, and B. Scholkopf. Generalization performance of
regularization networks and support vector machines via entropy numbers of compact operators. Technical Report 19, NeuroCOLT, http://www.neurocolt.com,
1998. Accepted for publication in IEEE Transactions on Information Theory.
27
REF [8]
Duality and Geometry in SVM Classifiers
Kristin P. Bennett
bennek@rpi.edu
Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA
Erin J. Bredensteiner
Department of Mathematics, University of Evansville, Evansville, IN 47722 USA
Abstract
es
a successful application? There are three key ideas

needed to understand SVM: maximizing margins, the
dual formulation, and kernels. Most people intuitively
grasp the idea that maximizing margins should help
improve generalization. But changing from the primal to dual formulation is typically black magic for
those uninitiated in duality theory. Duality is really
the key concept frequently missing in the understanding of SVM.
nc
In this paper we provide an intuitive geometric explanation of SVM for classification from the dual perspective along with a mathematically rigorous derivation
of the ideas behind the geometry. We begin with an
explanation of the geometry of SVM based on the idea
of convex hulls. For the separable case, this geometric explanation has existed in various forms (Vapnik,
1996; Mangasarian, 1965; Keerthi et al., 1999; Bennett & Bredensteiner, in press). The new contribution
is the adaptation of the convex hull argument for the
inseparable case to the most commmonly used 2-norm
and 1-norm soft margin SVM. The primal form resulting from this argument can be regarded as an especially elegant minor variant of the -SVM formulation
(Sch
olkopf et al., 2000) or a soft margin form of the
MSM method (Mangasarian, 1965). Related geometric ideas for the -SVM formulation were developed
independently by Crisp and Burges (1999).
re
Re
fe
We develop an intuitive geometric interpretation of the standard support vector machine (SVM) for classification of both linearly
separable and inseparable data and provide
a rigorous derivation of the concepts behind
the geometry. For the separable case finding
the maximum margin between the two sets is
equivalent to finding the closest points in the
smallest convex sets that contain each class
(the convex hulls). We now extend this argument to the inseparable case by using a
reduced convex hull reduced away from outliers. We prove that solving the reduced convex hull formulation is exactly equivalent to
solving the standard inseparable SVM for appropriate choices of parameters. Some additional advantages of the new formulation are
that the effect of the choice of parameters
becomes geometrically clear and that the formulation may be solved by fast nearest point
algorithms. By changing norms these arguments hold for both the standard 2-norm and
1-norm SVM.
eb6@evansville.edu
1. Introduction
Support vector machines (SVM) are a very robust
methodology for inference with minimal parameter
choices. This should translate into the popular adaptation of SVM in many application domains by nonSVM experts. The popular success of prior methodologies like neural networks, genetic algorithms, and
decision trees was enhanced by the intuitive motivation of these approaches, that in some sense enhanced
the end users ability to develop applications independently and have a sense of confidence in the results.
How do you sell a SVM to a consulting client, manager, etc? What quick description would allow an end
user to grasp the fundamentals of SVM necessary for
The primary contributions of this paper are:

A simple intuitive explanation of SVM based on
(reduced) convex hulls that allows nonexperts to
grasp geometrically the main concepts of SVM.
A new primal maximum (soft) margin SVM formulation that has as its dual the problem of finding the nearest neighbors in the (reduced) convex hulls. Major benefits of this formulation are
that the effects of the misclassification parameter choice are very clear and that it is amenable
to solution with very fast closest points in poly-
tope algorithms (Keerthi et al., 1999) and a minor

variant of sequential minimal optimization (SMO)
(Platt, 1998).
Class B
Class A
Proof of the equivalence, for appropriate choices

of parameters, between the primal and dual forms
of the reduced-convex-hull SVM to the primal and
dual forms of the classic SVM.
Extensions of the reduced convex hull arguments
to the sparse 1-norm SVM formulation and a new
infinity-norm SVM.
i=1
Re
fe
denoted by kxk1 . The 2-norm or Euclidean norm of x,

v
uX
um 2
t xi = x x, is denoted by kxk and kxk2 = x x.

i=1
Figure 1. Which plane is best?
Class B
The infinity-norm of x, maxi=1,... ,m |xi | is denoted by

kxk .
2. Geometric Intuition: Separable Case

Assume that we are trying to construct a linear discriminant to separate two separable sets A and B.
Specifically, this linear discriminant is the plane x w =
||
, where w is the normal of the plane and kwk
is the
Euclidean distance of the plane from the origin. Let
the coordinates of the points in A be given by the m
rows of the mn matrix A. Let the coordinates of the
points in B be given by the k rows of the k n matrix
B. We say that the sets are linearly separable if w
and exist such that: Aw > e and Bw < e where
e is a vector of ones of appropriate dimension.
Figure 1 shows two such separable sets and two of the
infinitely many possible planes that separate the sets
with 100% accuracy. Which separable plane is preferable? With no other knowledge of the data, most
people will prefer the solid line because it is further
Class A
nc
es
re
For compactness, we adopt matrix notation instead of

the more typical summation notation. In particular,
for a column vector x in the n-dimensional real space
Rn , xi denotes the ith component of x. The notation
A Rmn will signify a real m n matrix. For such
a matrix, Ai will denote the ith row. The transpose
of x and A are denoted x and A respectively. The
dot product of two vectors x and w is denoted by x w.
A vector of ones in a space of arbitrary dimension is
denoted by e. The scalar 0 and a vector of zeros are
both represented by 0. Thus, for x Rm , x > 0
implies that xi > 0 for i = 1, . . . , m. In general, for
x, y Rm , x > y implies that xi > yi for i = 1, . . . , m.
Similarly, x y implies that xi yi for i = 1, . . . , m.
m
X
|xi |, is
Several norms are used. The 1-norm of x,
xw=
0
Figure 2. The two closest points of the convex hulls determine the separating plane.
from each of the sets. In the case of the dotted line,

small changes in the data will produce misclassification errors. So an intuitive idea would be to construct
the plane that maximizes the minimum distance from
the plane to each set. In fact we know this intuition
coincides with the results in statistical learning theory (Vapnik, 1996) and is substantiated by results in
Shawe-Taylor et al. (1998).
One way to construct the plane as far as possible from
both sets is to construct the smallest convex sets that
contain all the data in each class (i.e. the convex hull)
and find the closest points in those sets. Then, construct the line segment between the two points. The
plane, orthogonal to the line segment, that bisects the
line segment is chosen to be the separating plane. See,
for example, Figure 2. The smallest convex set containing a set of points is called a convex hull. The
convex hulls of A and B are shown with dashed lines.
The convex hull consists of all points that can be written as a convex combination of the points in the orig-
Class
Class
between these two supporting hyperplanes. The distance between the two parallel supporting hyperplanes
. Therefore the distance between the two planes
is
kwk
can be maximized by minimizing kwk and maximizing
( ).
The problem of maximizing the distance between the

two supporting hyperplanes can be written as the following optimization problem (C-Margin):
xw=
1
2
min
xw =
w,,
s.t.
x w = ( + )=2
0
i=1
(2)
The final separating plane is the plane halfway be

tween the two parallel planes: x w
= +
. Note that
2
the maximum distance between the supporting planes
yields the distance between the two convex hulls. The
two closest points for each convex hull must then lie on
the supporting planes. The line segment between the
two closest points in the convex hulls must be orthogonal to the supporting planes, otherwise a contradiction
exists. Such a contradiction could be that either the
two supporting planes are not as far apart as possible or these two points are not the closest points in
the convex hulls. Therefore the solutions of both approaches are exactly the same. This is an example of
duality. As stated later in Theorem 4.1, the dual of
C-Margin (2) is C-Hull (1). See Bennett and Bredensteiner (in press) for the derivation. We can formulate
and solve the problem in either space as is convenient
for us. If there is no degeneracy, we will always get
the same plane.
re
a convex combination d of points in B is defined

by d = v1 B1 + v2 B2 + . . . + vk Bk = v B where
v Rk , v 0, and e v = 1.
Bw + e 0
es
inal set. A convex combination of points is a positive weighted combination where the weights sum to
one, e.g. a convex combination c of points in A is
defined by c = u1 A1 + u2 A2 + . . . + um Am = u A
m
X
ui = e u = 1 and
where u Rm , u 0, and
Aw e 0
nc
Figure 3. The primal problem maximizes the distance between two parallel supporting planes.
kwk ( )
u,v
s.t.
1
2
kA u B vk
Re
min
fe
The problem of finding the two closest points in the

convex hulls can be written as an optimization problem
(C-Hull):
e u = 1 e v = 1 u 0 v 0
(1)
The linear discriminant, x w = , is constructed from

the results of C-Hull (1). The normal w is exactly
the vector between the two closest points in the convex hulls. Let u
and v be an optimal solution of
(1). The normal of the plane is the difference between the closest points, c = A u
and d = B v.
Thus w = c d = A u B v. The threshold, ,

is the distance from the origin to the point halfway
between the two closest points along the normal w:
(
u Aw+
v Bw)
.
= ( c+d
2 ) w =
2
There is an alternative approach to finding the best
separating plane. Consider a set of parallel supporting planes as in Figure 3. These planes are positioned
so that all the points in A satisfy x w and at least
one point in A lies on the plane x w = . Similarly,
all points in B satisfy x w and at least one point
in B lies on the plane x w = . The optimal separating plane can be found by maximizing the distance
The primal C-Margin (2) and dual C-Hull (1) formulations provide a unifying framework for explaining
other SVM formulations. By transforming C-Margin
(2) into mathematically equivalent optimization problems, different SVM formulations are produced. If we
set = 2 by defining = + 1 and = 1,
then Problem (2) becomes the standard primal SVM
2-norm formulation (Vapnik, 1996)
min
w,
s.t.
1
2
kwk
Aw ( + 1)e 0 Bw + ( 1)e 0
(3)
In fact, as stated in Theorem 4.2, the classic 2-norm

SVM (3) and C-Margin (2) are mathematically equivalent on separable problems. They will produce the
exact same separating plane or an equally good plane
if multiple solutions exist (see Burges & Crisp, 1999).
Class B
Class A
Class B
Class A
Figure 5. Convex hull and reduced convex hull with K = 2.
3. Geometric Intuition: Inseparable

Case
lots of redundant points, reducing the convex hull has

little effect. But for a set with a single outlier the effect
is quite marked. Note that for small D the reduced
convex hulls no longer intersect. In general, we will
need to choose K sufficiently large to ensure that the
convex hulls do not intersect. We can now proceed as
in the separable case using the reduced convex hulls
instead. We will minimize the distance between the
reduced convex hulls so that a few bad points will not
dominate the solution.
nc
The problem of finding two closest points in the reduced convex hulls can be written as an optimization
problem (RC-Hull):
Re
fe
re
For inseparable problems, the convex hulls of the two

sets will intersect. Consider Figure 4. The difficult-toclassify points of one set will be in the convex hull of
the other set. In a problem amenable to linear classification, most points of one class will not be in the convex hull of the other. If we could restrict the influence
of outlying points then we could return to the usual
convex hull problem. It is undesirable to let one point,
particularly a difficult point, excessively influence the
solution. Therefore, we want the solution to be based
on a lot of points, not just a few bad ones. Say we want
the solution to depend on at least K points. This can
be done by contracting or reducing the convex hull by
putting an upperbound on the multiplier in the convex
combination for each point. The reduced convex hull
is defined as follows.
es
Figure 4. The convex hulls of inseparable sets intersect.
Definition 3.1 (Reduced Convex Hull). The set

of all convex combinations c = A u of points in
A where e u = 1, 0 u De, D < 1.
1
Typically we choose D = K
and K > 1. Note that
the reduced convex hull is nonempty as long as K m
where m is the number of points in set A.
We reduce our feasible set away from the boundaries

of the convex hulls so that no extreme point or noisy
point can excessively influence the solution. In Figure
5, the reduced convex hulls with K = 2 are given.
Note that the reduced sets no longer intersect. Further
examples of reduced convex hulls can be seen in Crisp
and Burges (1999), who refer to our reduced convex
hulls as soft convex hulls. We believe that this is
a misnomer because softening implies that the convex
hulls are expanding but in fact they are being reduced.
As we will see later, the concept of reducing the convex
hulls to avoid error is the dual concept to enlarging
margins by softening them to allow error. For sets with
min
u,v
s.t.
1
2
kA u B vk
e u = 1 e v = 1 0 u De 0 v De
(4)
Immediately we can see the effect of our choice of pa1

rameter D = K
. Note that each point can contribute
1
to the optimal solution. So the sono more than K
lution will be robust in some sense since it depends
on at least 2K points. If K is too large or conversely
D is too small the problem will be infeasible. So K
must be smaller than the number of points in each set.
Increasing D larger than 1 will produce no advantage
over the solution where D = 1. If we have varying
confidence in the points or if our classes are skewed in
size we can choose different values of D for each point
or class. The reader should consult (Sch
olkopf et al.,
2000) for a more formal derivation of these and additional properties for the -SVM formulation which also
has been shown to solve the closest points between the
reduced convex hulls problem (Crisp & Burges, 1999).
RC-Hull (4) is suitable for solution by nearest points in
convex polytope algorithms; see (Keerthi et al., 1999).
If we add a soft margin error term to the separable
C-Margin Problem (2), we get the following problem
1
2
kwk +
Aw e + 0 0
Bw + e + 0 0
s.t.
(5)
1
with D = K
> 0. As we will prove in Theorem 4.3,
the dual of RC-Margin (5) is exactly RC-Hull (4) which
finds the closest points in the reduced convex hulls.
As in the linearly separable case, one transformation

of this problem is to fix by setting = + 1 and
= 1. This results in the classic support vector
machine approach (Vapnik, 1996):
min
w,,,
s.t.
C(e + e ) +
1
2
kwk
Aw e + e
Bw + e + e
0 0
(6)
where C > 0 is a fixed constant. Note that the constant C is now different due to an implicit rescaling
of the problem. As we will show in Theorem 4.4 the
RC-Margin (5) and classic inseparable SVM (6) are
equivalent for appropriate choices of C and D.
re
4. Equivalence to Classic Formulation
Theorem 4.2 (Equivalence of Separable Forms).

Assume C-Margin ( 2) has a solution with kwk
> 0.
Then (w,
, u
, v) is a KKT point of the classic sepa u, v) is a KKT
rable SVM ( 3) if and only if (w,

, ,
2
, w
=
point of C-Margin ( 2) where = e u
=

+1
1
w
,
=
, =
, u
= u , and v = v .
es
D(e + e ) +
min
w,,,,
hulls, is equivalent to the classic inseparable 2-norm

SVM (3) in Vapnik (1996). Specifically, every solution
to one problem can be used to construct a corresponding solution to the other by simple scaling. The theorem assumes that the degenerate solution w = 0 is not
optimal. This is equivalent to saying that the convex
hulls do not intersect. For convex quadratic programs
with linear constraints, a solution is optimal if and only
if it (along with the corresponding Lagrangian multipliers) satisifies the Karush-Kuhn-Tucker (KKT) optimality conditions of primal feasibility, dual feasibility,
and complementary slackness. We call a set of primal
C-Margin and dual C-Hull solutions a KKT point. We
can establish the equivalence of the C-Margin/C-Hull
formulations with the classic inseparable SVM formulation by showing that a KKT point of one can be
used to derive a KKT point of the other. The optimal
separating plane of one solution will also be optimal
for the other form, but the weights and threshold are
scaled by a constant.
nc
for the inseparable case (RC-Margin):
Re
fe
We now rigorously examine the claims of the previous

section. We begin with the separable case. For both
the separable and inseparable cases, the theorems establish that the dual of our SVM (soft) maximum margin formulation is exactly the (reduced) convex hull
formulation and that our (reduced) convex hull based
SVM formulations are equivalent to the classic SVM
form for appropriate choices of parameters. The first
theorem states that the problem of finding the two
closest points in the convex hulls of two separable sets
is the Wolfe dual (or equivalently Lagrangian dual) of
the problem of finding the best separating plane.
Theorem 4.1 (Convex Hulls is Dual).
The Wolfe dual of C-Margin SVM ( 2) is the closest
points of the convex hull problem C-Hull ( 1) or :
max 12 kA u B vk
u,v
s.t.
e u = e v = 1, u 0, v 0
(7)
Proof of this theorem can be found in full detail in

(Bennett & Bredensteiner, in press) or can easily be
derived as a variant of the corresponding theorem for
the inseparable case.
Problem C-Margin (2), the primal form of the dual
C-Hull of finding the closest two points in the convex
Proof. Each KKT point of the classic separable SVM

(3) satisfies:
Aw (
+ 1)e 0
u (Aw (
+ 1)e) = 0
w = A u B v
u 0
B w + (
1)e 0
v (B w + (
1)e) = 0
e u = e v
v 0.
(8)
Dividing each constraint by or 2 as appropriate

yields a KKT point of the C-Margin SVM (2) satisfying:
Aw
e 0
u (Aw
e) = 0
w = A u B v
u 0
0
B w + e
=0
v (B w + e)
1=eu
= e v
v 0.
(9)
Similarly, multiplying the KKT conditions (9) of C2

Margin (2) by =
or 2 yields the KKT condi
tions (8) of the standard separable SVM (3). We know
> 0 because by strong duality the primal and

dual objectives will be equal thus
1
1
2
2
kwk

+ = kA u B vk < 0.
2
2
The theorems can be directly generalized to the inseparable case based on reduced convex hulls. The
Wolfe dual (for example, see Mangasarian, 1969) of
RC-Margin (5) is precisely the closest points in the
reduced convex hull problem, RC-Hull (4).
Theorem 4.3 (Reduced Convex Hulls is Dual).
The Wolfe dual of the RC-Margin ( 5) is RC-Hull ( 4)
or equivalently:
max 12 kA u B vk
u,v
s.t.
e u = e v = 1, De u 0, De v 0
(10)
Theorem 4.4 (Equivalence of Inseparable Forms).

Assume RC-Margin ( 5) has a solution with kwk
> 0.
, u
Then (w,
, ,
, v) is a KKT point of the classic
inseparable SVM ( 6) with parameter C if and only
,
, u
if (w,

, ,
, v) is a KKT point of RC-Margin
2
( 5) with parameter D where = e u =
, w =

+1
,
=
, =
, = , = , u = , v =
and D =
1
2
kwk + + De + De
u (Aw e + )
v (Bw + e + ) r s
L
w = w A u + B v = 0
L
= 1 + e u = 0, u 0
= 1 e v = 0, v 0
L
= De u = r 0
L
= De v = s 0
(11)
Aw (
+ 1)e + 0
B w + (
1)e + 0
0
0
=0
u (Aw
(
+ 1)e + )
v (B w + (
1)e + ) = 0
fe
Re
where , R, w Rn , , u, r Rm , and , v, s
Rk . To simplify the problem, substitute in w = (A u
B v), r = De u and s = De v:
max
,,u,v
s.t.
1
2
kA u B vk ( ) + De + De
u A(A u B v) + v B(A u B v)
+ e u e v u v
De De + u + v
e u = e v = 1, De u 0, De v 0
(12)
and then simplify to yield RC-Hull (10).

Optimizing the reduced-convex-hull form of SVM with
parameter D is equivalent to optimizing the classic 2norm SVM (6) with parameter C. The parameters D
and C are related by multiplication of a constant factor based on the size of the optimal margin. If the
appropriate values of D and C are chosen, then once
again a KKT point of one will be a KKT point of the
other. A similar result for the -SVM formulation is
given in Proposition 13 in Sch
olkopf et al. (2000).
w
= A u B v
e u = e v
Ce u
0
Ce v 0
(Ce
u) = 0
(Ce v) = 0.
(13)
Dividing each constraint by or 2 as appropriate

yields a KKT point of the RC-Margin (5) with parameter D satisfying:
re
s.t.
es
L(w, , , , , u, v, r, s) =
Aw
e + 0
+ 0
B w + e
0
0
=0
u (Aw
e + )
v (B w + e + ) = 0
nc
max
Proof. Each KKT point of the classic SVM (6) with

parameter C satisfies:
Proof. The dual problem maximizes the Lagrangian

function of (5), L(w, , , , , u, v, r, s), subject to
the constraints that the partial derivatives of the Lagrangian with respect to the primal variables are equal
to zero (Mangasarian, 1969). Specifically, the dual of
(5) is:
w,,,,,u,v,r,s
C
.
w
= A u
B v
1=eu
= e v
De u 0
De v 0
(De
u) = 0
(De v) = 0.
(14)
Similarly, multiplying the KKT conditions (14) of the

2
RC-Margin SVM (5) with parameter D by =

or 2 yields the KKT conditions (13) of the standard

SVM (6) with parameter C. We know
> 0 by
equality of the primal and dual objectives
1
kwk
2
+
2
1
2
= kA u B vk < 0.
2
De + De +
This theorem proves that for appropriate parameter

choice, the solution set of optimal parallel max-margin
planes produced by the classic SVM with parameter
C (x w = + 1 and x w = 1) will also be optimal for the reduced-convex-hull problem with param using the relationeter D (x w
=
and x w = )
ship defined above and vice versa. But it is not true
that the sets of final single separating planes produced
by the two methods are identical. The plane bisecting the closest points in the reduced convex hulls i.e.
Class B
the -SVM formulation (Crisp & Burges, 1999).
Class A
Figure 6. Optimal plane bisecting the closest points in the

reduced convex hulls.
Class A
es
Class B
Our reduced-convex-hull SVM formulation differs from

the -SVM formulation in that there are distinct margin thresholds and for each class instead of a single
variable for both. Extensions of the -SVM formulation using parametric models for the margins are suggested in Sch
olkopf et al. (2000). Similar analysis to
the above can be performed for the -SVM. We refer
readers to Crisp and Burges (1999) which uses a related but different argument for establishing the correspondence of -SVM with the reduced-convex-hull
formulation. Assuming there exists a unique nonzero
solution to the closest points in the reduced convex hull
problem and appropriate parameter choices are made,
the reduced-convex-hull, classic, and -SVM will all
yield a plane with the same orientation, i.e. w is the
same modulo a positive scaling factor. But they do
not produce the exact same final planes because the
assumptions used to construct the thresholds differ.
5. Alternative Norm Variations
fe
re
nc
We have shown that the classical 2-norm SVM formulation is equivalent to finding the closest points in the
reduced convex hulls. This same explanation works for
versions of SVM based on alternative norms. For example consider the case of finding the closest points in
the reduced convex hulls as measured by the infinitynorm:
min kA u B vk
Re
Figure 7. Optimal plane bisecting parallel maximum soft

margin planes.
x w = w (A u2B v) , is parallel to but not identical to

the plane x w = +
that would also be a solution of
2
the original SVM problem once scaled. The thresholds
differ. This is illustrated by Figures 6 and 7.
Figure 6 gives the solution found by the reducedconvex-hull SVM formulation which finds the two closest points in the reduced convex hull and as a heuristic
selects the threshold halfway between the points. But
there is nothing explicit about the choice of threshold in the reduced-convex-hull formulation RC-Hull.
In Figure 6, the closest points in the reduced convex
hull are represented by an open square and open circle. The solution found by the classic SVM is given in
Figure 7. The classic SVM formulation assumes that
the best plane bisects the two parallel margin planes.
Note that the plane that bisects the closest points is
nearer to Class A. In some sense the plane is shifted
toward the class in which we have more confidence. It
is not a priori evident which assumption for the choice
of threshold is best. This property was also noted with
u,v
e u = e v = 1, De u 0, De v 0
(15)
s.t.
One method for converting the problem into a linear

program (LP) produces:
min
s.t.
e A u B v e
e u = e v = 1, De u 0, De v 0
(16)
u,v,
The dual is
max
w,,,,
s.t.
De De
Aw e + 0, 0
Bw + e + 0, 0
kwk1 = 1
(17)
For an appropriate choice of C, this is equivalent to

solving the typical 1-norm SVM
min
w,,,
s.t.
Ce + Ce + kwk1
Aw ( + 1)e + 0, 0
Bw + ( 1)e + 0, 0
(18)
Similarly finding the closest points of the reduced convex hulls using the 1-norm is equivalent to constructing a SVM regularized using an infinity-norm on w.
Specifically, solving the problem
min kA u B vk 1
min
s.t.
Ce + Ce + kwk
Aw ( + 1)e + 0, 0
Bw + ( 1)e + 0, 0
(20)
Limited space does not allow a full development of this

argument.
6. Conclusion
Crisp, D. J., & Burges, C. J. C. (1999). A geometric

interpretation of -svm classifiers. Proceedings of
Neural Information Processing 12. Cambridge, MA:
MIT Press.
Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., &
Murthy, K. R. K. (1999). A fast iterative nearest
point algorithm for support vector machine classifier
design (Technical Report TR-ISL-99-03). Intelligent
Systems Labs, Department of Computer Science and
Automation, Indian Institute of Science, Bangalor,
India.
Re
fe
re
The simple geometric argument of finding the closest

points in the convex hulls or reduced convex hulls of
two classes can be used to derive an intuitive geometric SVM formulation. Users can grasp visually the
primary notions of SVM necessary for successful implementation without getting hung up on notions of
duality. The reduced-convex-hull formulation forces
the optimal solution to depend on more points depending on the parameter D (0, 1). If D is too large, the
reduced convex hulls intersect, and the meaningless
solution w = 0 results. If D is too small, the dual
problem will be infeasible. We rigorously showed this
formulation is exactly equivalent to the classic SVM
formulation for appropriate choices of parameters. Assuming the parameters are well-defined, the solution
sets of the problems are the same modulo a scaling factor dependent on the size of the margin. But the final
choice of threshold will vary depending on the assumptions of the user. From an optimization perspective
the reduced-convex-hull formulations may be preferable due to the interpretability of the misclassification
parameter and the availability of fast nearest point in
polytope algorithms (Keerthi et al., 1999). If the 1norm or infinity-norm is used to measure the closest
points in the reduced convex hull the analogous analysis can be performed showing that the primal problem
corresponds to the SVM regularized with the infinitynorm or 1-norm of w respectively. Thus the reduced
convex hull argument holds for 1-norm SVM linear
programming approaches.
Burges, C. J. C., & Crisp, D. J. (1999). Uniqueness of

the svm solution. Proceedings of Neural Information
Processing 12. Cambridge, MA: MIT Press.
es
is equivalent to solving (for appropriate choices of D

and C)
Bennett, K. P., & Bredensteiner, E. J. (in press).

Geometry in learning. In C. Gorini et al. (Eds.),
Geometry at work. MAA Press. Also available as
http://www.rpi.edu/bennek/geometry2.ps.
nc
e u = e v = 1, De u 0, De v 0
(19)
w,,,
This material is based on research supported by Microsoft Research and NSF Grants 949427 and IIS9979860.
References
u,v
s.t.
Acknowledgements
Mangasarian, O. L. (1965). Linear and nonlinear separation of patterns by linear programming. Operations Research, 13, 444452.
Mangasarian, O. L. (1969). Nonlinear programming.
New York: McGrawHill.
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. Advances in kernel methods - Support vector learning.
Cambridge, MA: MIT Press.
Sch
olkopf, B., Smola, A., Williamson, R., & Bartlett,
P. (2000). New support vector algorithms. Neural
Computation, 12, 10831121.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C.,
& Anthony, M. (1998). Structural risk minimization
over data-dependent hierarchies. IEEE Transactions
on Information Theory, 44, 19261940.
Vapnik, V. N. (1996). The nature of statistical learning
theory. New York: Wiley.
REF [9]

!
"

#$ %&'(%) !

Re
fe
re
nc
es
*
" + , -(

.-/
01 *

+ , (

-
0

1 # +
,
-

(
2 3
(

1 +

4

-
0 + 5 (

1 "

1 6 + , ,
-
"

1

3
+ -

1

7

- "

" 89 :1

-
+
3

1
*

5
, "

+

"

+
(
3

1 7

"3

4

4
"3

;

1 *

+ ,

4

4
"

"
+

"
1

-1

!
nc
es
,

8:1
< " + 3

- , "

,
8
) %:1 0
3

+
+
;
" 0

"3
=
0
3

+ +

;
" 0

( "3
1

" +

+ ,

4
1
,
1
3

1

+
" + , -
0

" 8 ) % >:1 *

+ ,

-
0

1 # + ,
-

(

2 3

1 +

4

-
0 + 5

1 "

1
+ , , -
"

1

3
+ -

8:1
7

4 +1
)

(

1
'

(
1

%

( -
01 ?

&
"

)( -
01

1
re

Re
fe

+

1 ?

, 8& @:1

" +

0 ("

+

1 7
0
+
3
A
1 B

B 1
)1 C C 1
'1 B 1
"

. / B

1

. /
1 7

1 7

B

A . /

1 D +

"

1 0

+ " 1
7

1

"
3

B

B B 1 D
"

B A B
1 "

4

A B

1

" +

+

+ 7

0 +
3
A
1 B
)1 C B C
'1
B

%1 B

B
*

+ B

.)( /1 ?

7

(+4
;
1 E;

B

B

F
1

F
3

; "
B ? + " B

Re
fe
re
nc
es

+

(
1

" (
A B
1 7

B
B

E
B + B B 1

0

"

+

11 B
1 ?

B

1 7 + +

B
B

B C +
3
1 E

;
B C +
1 B
B A B
1 * "

B B
C

B C B
C

# B
C B 1

0

"

" + "

B B

B B
B
7
" 1

F
3

; "
B B "

+ " B 1 E;

B 1 + " B B
B 1 7

1 7

,
?
'11
X0
X1
H
Re
fe
re
nc
es
?
'1

A

"
;

1 *

+ (

3 8:1

B . / +
B 3 1
B A B

03
B .) / 7

"
"
A ' B

0 B
7

"31
+
B .
/ C

1 6"

;
+ "
B 8.
/ C
:
B .
/. / C
. /
.
/ C

B .
/ C
B
B
11

1 7
"
1
17

;
+

7 B

0
B .

/

" .
/
C B

3 1 7

8& @:1
7
;1 7 B

0
B

7
8& @:1
.
B + , B /
+

B A B
+
7

B
B

$

G(G(
7,

8> :1
"

1 D
"
+ B A B
B A B
+ B
+

B

B

re
nc
es

0

"

+

11 B
1 B A B
1 7 B
A
1
# B A B

11

1 '1 +
" B

0

"
11 B
B

B B

B
B

Re
fe
# B + " B
7
" 1
. / . / . /

(
1 B A B
B A B

7 3
.

/
C

1
7 B A B C
B A B

+ 1

+

. C / . / B )1 3

4

+

" +

4
A

)

. C / B )
7

- )(
89 :1
. / . / . /

+
1 B A B
B A B

7 3

B .

/
! ! ! 1
B A B !
B A B !
1

+
! .!/
B )!1 3

4

+

" +

4
A
3
!

. / ! B )

B
$

G(G(
7,

8> :1

Re
fe
re
nc
es

+

( -
01
" "3
" 7
# . / B
0

"
B A B # . /
" (

. / B

+

B A B . /
+ 1
.

#
/

"
"
"3

"3 " =

8 # . /:
B
B 3

+ 3

"

0

"
"

+

11 B
?

+ , B

1 7 + +
# . /
B
B 3

* 0 + + " # . / ?
+

A #. /
6"

# . /
B A B # . /
"
$ .%/ B A %
?
+
# . / %
% +
A B # . /
"
$ .%/ 6"
+ " # . / B
B % .
% /
6
"

$ ./

$ ./ " 7

# . / B
7 0

"1
7 "
" B 7
# . / B F +" B 7
B

+
1
7

,8@: +

"

1 ?
"

+

1

" 1

" 1 *

+ 3
& &
& 1 *
;

+

1 "3

8&
@:1
7

"

"
1

"3 "
; 3

"3 " 7

,
?
%11
nc
es
x0
Re
fe
re
?
%1 3

A
"3
" "

"3

"
+ " "=
. / # . /
B

B 3

+ 3

"

0

" " "

+

11 B
" B A B " "
6"
"

"3 1
? " " B '
" "

" %1 + "
# . /
B
B 3

#

B

# . / B . / # . /
7 + " " 0

1
7 " " " B
B %1 + " B B B

+

7

"

"
1

+ + "3 " "
; 3

+

"
" 1 7

,
?
%1)1
K1
K2
Re
fe
re
nc
es
?
%1) 3

A "3
. / . / . /

(
1 B A B
B A B
(./ (. /
"3
" 1
) 1 7 ./ B # ./ B )
7 3

4
8 ./ # ./: +

+

4
A
. ) /

)

)
%1)

) H * H +

H* B H+ B
* +
7
3 + 0
3

(
+ +
;
" 0

"3
1
# +
" 3
1
1 . / B A B H* H* B *

. / B A B H+ H+ B +
. / . /
"3
" 1 . /
. / +1 3

. / ) . /1 7 + "

Re

fe
re
nc
es
+ B
. / B
) C ,
+ B
. / , B ) 6"
, 1
+ "
H . /
) C H , . /
7

./ B H
# ./ B ) C H ,
3

4

8 ./ # ./:
;
" 3

4

8. H / .) C H , /:
7 +
+

4
A

.H C H, / C . ) /

)

) C ,
%1) +
+

4

A
H * H +

)

H* B H+ B
* +
7
3 +
0
3

+ +
;
" 0

"3
1
7
" + 3 0
(
5 8:1

G(
G(7,

1

+

-
0
F
1
" "

"3

F
1 ." " / B + " " 1
B B B B A B

B A B
= A ./
+

+ " " = .)/
+ "
"
* 0 + 0 1 '1) + "

. / B
B

B B
B B ." "/
7 " +
" +

* B
C .
/

*
B
C .
/
C .
/
B
. / C . /
. / C . /
B C
C )

B)

B + "

B

7

.
/ + " * 7

." " / B
"3
" + " * " F
" + " "
7
" 1

" "

+

" " +
3

1 7

,
?
&11
re
nc
es
K1
fe
K2
Re
?
&1 7 3

" "

"3

F
7

"
" +
3

1
&1 3

"3
" " +
3

1 # + + +
;
1 " "

./ # ./
." " / B B

+ "3
" "
"

11 B A B ./

B A B # ./
%1) + " B
B (+4
;
3

B 7 B A B . /
B A B . /

B A B # . /
7

7
" 1
# 5

" " " "

." " / B B
+ " B - B A B
7 "
+ " 3

+

1 1
B B 7
B
B ) C ) C
B ) C ) ) C)

. C /) - + "
C)
7
+ + " B
11
B 7
" 1
es

Re
!"#$
fe
re
nc
*
" + , -
01 *

+ ,

- (

0
1 # + ,

-

(

2 3

1 (
+

4

-
0 + 5

1 "

1 + , , -
"

1

3

+ -

1
I +
,

+ , #
(

D $3
I

I
J
F #
!
"

7

1
%!$
"!# $ % & ' % () * + ,- .

/
01234 -/ $
"# % & $ % () * +
, /

"5# %/ & *
() 67 ,-

/ 528 / - -
Re
fe
re
nc
es
"4# *
& %/ () /
2,- .

/ 4420! / -
-
"0# / (!885)

( ' )
/2,/ 9
:;
"3# * + / (!813) ! " #
< &
9 :;
"1# = -/ (!888) 2
/

4
/ !025
"># + ? % ?;
-; & $ -? ( ) ,- % /@ =

!!8 +-* A %
"8# & 92 () #

/ 6B / 6$
"!# , ,
; (!880) $
$
/2,/ 9
:;
Journal of Machine Learning Research (2001) 45-66
Submitted 10/01; Published 11/01
REF [10]
Support Vector Machine Active Learning
with Applications to Text Classication
Simon Tong
Daphne Koller
simon.tong@cs.stanford.edu
koller@cs.stanford.edu
Computer Science Department

Stanford University
Stanford CA 94305-9010, USA
Editor: Leslie Pack Kaelbling
Abstract
1. Introduction
Re
fe
re
nc
es
Support vector machines have met with signicant success in numerous real-world learning
tasks. However, like most machine learning algorithms, they are generally applied using
a randomly selected training set classied in advance. In many settings, we also have the
option of using pool-based active learning. Instead of using a randomly selected training
set, the learner has access to a pool of unlabeled instances and can request the labels for
some number of them. We introduce a new algorithm for performing active learning with
support vector machines, i.e., an algorithm for choosing which instances to request next.
We provide a theoretical motivation for the algorithm using the notion of a version space.
We present experimental results showing that employing our active learning method can
signicantly reduce the need for labeled training instances in both the standard inductive
and transductive settings.
Keywords: Active Learning, Selective Sampling, Support Vector Machines, Classication, Relevance Feedback
In many supervised learning tasks, labeling instances to create a training set is timeconsuming and costly; thus, nding ways to minimize the number of labeled instances
is benecial. Usually, the training set is chosen to be a random sampling of instances. However, in many cases active learning can be employed. Here, the learner can actively choose
the training data. It is hoped that allowing the learner this extra exibility will reduce the
learners need for large quantities of labeled data.
Pool-based active learning for classication was introduced by Lewis and Gale (1994).
The learner has access to a pool of unlabeled data and can request the true class label for
a certain number of instances in the pool. In many domains this is a reasonable approach
since a large quantity of unlabeled data is readily available. The main issue with active
learning is nding a way to choose good requests or queries from the pool.
Examples of situations in which pool-based active learning can be employed are:
Web searching. A Web-based company wishes to search the web for particular types
of pages (e.g., pages containing lists of journal publications). It employs a number of
people to hand-label some web pages so as to create a training set for an automatic
c
2001
Simon Tong and Daphne Koller.
Tong and Koller
classier that will eventually be used to classify the rest of the web. Since human
expertise is a limited resource, the company wishes to reduce the number of pages
the employees have to label. Rather than labeling pages randomly drawn from the
web, the computer requests targeted pages that it believes will be most informative
to label.
Email ltering. The user wishes to create a personalized automatic junk email lter.
In the learning phase the automatic learner has access to the users past email les.
It interactively brings up past email and asks the user whether the displayed email is
junk mail or not. Based on the users answer it brings up another email and queries
the user. The process is repeated some number of times and the result is an email
lter tailored to that specic person.
re
nc
es
Relevance feedback. The user wishes to sort through a database or website for
items (images, articles, etc.) that are of personal interestan Ill know it when I
see it type of search. The computer displays an item and the user tells the learner
whether the item is interesting or not. Based on the users answer, the learner brings
up another item from the database. After some number of queries the learner then
returns a number of items in the database that it believes will be of interest to the
user.
Re
fe
The rst two examples involve induction. The goal is to create a classier that works
well on unseen future instances. The third example is an example of transduction(Vapnik,
1998). The learners performance is assessed on the remaining instances in the database
rather than a totally independent test set.
We present a new algorithm that performs pool-based active learning with support
vector machines (SVMs). We provide theoretical motivations for our approach to choosing
the queries, together with experimental results showing that active learning with SVMs can
signicantly reduce the need for labeled training instances.
We shall use text classication as a running example throughout this paper. This is
the task of determining to which pre-dened topic a given text document belongs. Text
classication has an important role to play, especially with the recent explosion of readily
available text data. There have been many approaches to achieve this goal (Rocchio, 1971,
Dumais et al., 1998, Sebastiani, 2001). Furthermore, it is also a domain in which SVMs
have shown notable success (Joachims, 1998, Dumais et al., 1998) and it is of interest to
see whether active learning can oer further improvement over this already highly eective
method.
The remainder of the paper is structured as follows. Section 2 discusses the use of
SVMs both in terms of induction and transduction. Section 3 then introduces the notion
of a version space and Section 4 provides theoretical motivation for three methods for
performing active learning with SVMs. In Section 5 we present experimental results for
two real-world text domains that indicate that active learning can signicantly reduce the
need for labeled instances in practice. We conclude in Section 7 with some discussion of the
potential signicance of our results and some directions for future work.
46
SVM Active Learning with Applications to Text Classification
(a)
(b)
Figure 1: (a) A simple linear support vector machine. (b) A SVM (dotted line) and a
transductive SVM (solid line). Solid circles represent unlabeled instances.
es
2. Support Vector Machines
re
nc
Support vector machines (Vapnik, 1982) have strong theoretical foundations and excellent
empirical successes. They have been applied to tasks such as handwritten digit recognition,
object recognition, and text classication.
2.1 SVMs for Induction
Re
fe
We shall consider SVMs in the binary classication setting. We are given training data
{x1 . . . xn } that are vectors in some space X R d . We are also given their labels {y1 . . . yn }
where yi {1, 1}. In their simplest form, SVMs are hyperplanes that separate the training
data by a maximal margin (see Fig. 1a) . All vectors lying on one side of the hyperplane
are labeled as 1, and all vectors lying on the other side are labeled as 1. The training
instances that lie closest to the hyperplane are called support vectors. More generally, SVMs
allow one to project the original training data in space X to a higher dimensional feature
space F via a Mercer kernel operator K. In other words, we consider the set of classiers
of the form:
n

f (x) =
i K(xi , x) .
(1)
i=1
When K satises Mercers condition (Burges, 1998) we can write: K(u, v) = (u) (v)
where : X F and denotes an inner product. We can then rewrite f as:
f (x) = w (x), where w =
n
i (xi ).
(2)
i=1
Thus, by using K we are implicitly projecting the training data into a dierent (often
higher dimensional) feature space F. The SVM then computes the i s that correspond
to the maximal margin hyperplane in F. By choosing dierent kernel functions we can
47
Tong and Koller
implicitly project the training data from X into spaces F for which hyperplanes in F
correspond to more complex decision boundaries in the original space X .
Two commonly used kernels are the polynomial kernel given by K(u, v) = (u v + 1)p
which induces polynomial boundaries of degree p in the original space X 1 and the radial basis
function kernel K(u, v) = (e(uv)(uv) ) which induces boundaries by placing weighted
Gaussians upon key training instances. For the majority of this paper we will assume that
the modulus of the training data feature vectors are constant, i.e., for all training instances
xi , (xi ) = for some xed . The quantity (xi ) is always constant for radial basis
function kernels, and so the assumption has no eect for this kernel. For (xi ) to be
constant with the polynomial kernels we require that xi be constant. It is possible to
relax this constraint on (xi ) and we shall discuss this at the end of Section 4.
2.2 SVMs for Transduction
fe
3. Version Space
re
nc
es
The previous subsection worked within the framework of induction. There was a labeled
training set of data and the task was to create a classier that would have good performance
on unseen test data. In addition to regular induction, SVMs can also be used for transduction. Here we are rst given a set of both labeled and unlabeled data. The learning task is
to assign labels to the unlabeled data as accurately as possible. SVMs can perform transduction by nding the hyperplane that maximizes the margin relative to both the labeled
and unlabeled data. See Figure 1b for an example. Recently, transductive SVMs (TSVMs)
have been used for text classication (Joachims, 1999b), attaining some improvements in
precision/recall breakeven performance over regular inductive SVMs.
Re
Given a set of labeled training data and a Mercer kernel K, there is a set of hyperplanes that
separate the data in the induced feature space F. We call this set of consistent hypotheses
the version space (Mitchell, 1982). In other words, hypothesis f is in version space if for
every training instance xi with label yi we have that f (xi ) > 0 if yi = 1 and f (xi ) < 0 if
yi = 1. More formally:
Denition 1 Our set of possible hypotheses is given as:

w (x)
where w W ,
H = f | f (x) =
w
where our parameter space W is simply equal to F. The version space, V is then dened
as:
V = {f H | i {1 . . . n} yi f (xi ) > 0}.
Notice that since H is a set of hyperplanes, there is a bijection between unit vectors w and
hypotheses f in H. Thus we will redene V as:
V = {w W | w = 1, yi (w (xi )) > 0, i = 1 . . . n}.
1. We have not introduced a bias weight in Eq. (2). Thus, the simple Euclidean inner product will produce
hyperplanes that pass through the origin. However, a polynomial kernel of degree one induces hyperplanes
that do not need to pass through the origin.
48
(b)
es
(a)
Re
fe
re
nc
Figure 2: (a) Version space duality. The surface of the hypersphere represents unit weight
vectors. Each of the two hyperplanes corresponds to a labeled training instance.
Each hyperplane restricts the area on the hypersphere in which consistent hypotheses can lie. Here, the version space is the surface segment of the hypersphere
closest to the camera. (b) An SVM classier in a version space. The dark embedded sphere is the largest radius sphere whose center lies in the version space
and whose surface does not intersect with the hyperplanes. The center of the embedded sphere corresponds to the SVM, its radius is proportional to the margin
of the SVM in F, and the training points corresponding to the hyperplanes that
it touches are the support vectors.
Note that a version space only exists if the training data are linearly separable in the
feature space. Thus, we require linear separability of the training data in the feature space.
This restriction is much less harsh than it might at rst seem. First, the feature space often
has a very high dimension and so in many cases it results in the data set being linearly
separable. Second, as noted by Shawe-Taylor and Cristianini (1999), it is possible to modify
any kernel so that the data in the new induced feature space is linearly separable2 .
There exists a duality between the feature space F and the parameter space W (Vapnik,
1998, Herbrich et al., 2001) which we shall take advantage of in the next section: points in
F correspond to hyperplanes in W and vice versa.
By denition, points in W correspond to hyperplanes in F. The intuition behind the
converse is that observing a training instance xi in the feature space restricts the set of
separating hyperplanes to ones that classify xi correctly. In fact, we can show that the set
2. This is done by redening for all training instances xi : K(xi , xi ) K(xi , xi ) + where is a positive
regularization constant. This essentially achieves the same eect as the soft margin error function (Cortes
and Vapnik, 1995) commonly used in SVMs. It permits the training data to be linearly non-separable
in the original feature space.
49
Tong and Koller
of allowable points w in W is restricted to lie on one side of a hyperplane in W. More

formally, to show that points in F correspond to hyperplanes in W, suppose we are given
a new training instance xi with label yi . Then any separating hyperplane must satisfy
yi (w (xi )) > 0. Now, instead of viewing w as the normal vector of a hyperplane in F,
think of (xi ) as being the normal vector of a hyperplane in W. Thus yi (w (xi )) > 0
denes a half space in W. Furthermore w (xi ) = 0 denes a hyperplane in W that acts
as one of the boundaries to version space V. Notice that the version space is a connected
region on the surface of a hypersphere in parameter space. See Figure 2a for an example.
SVMs nd the hyperplane that maximizes the margin in the feature space F. One way
to pose this optimization task is as follows:
mini {yi (w (xi ))}
maximizewF
w = 1
subject to:
yi (w (xi )) > 0 i = 1 . . . n.
re
nc
es
By having the conditions w = 1 and yi (w (xi )) > 0 we cause the solution to lie in the
version space. Now, we can view the above problem as nding the point w in the version
space that maximizes the distance: mini {yi (w (xi ))}. From the duality between feature
and parameter space, and since (xi ) = , each (xi )/ is a unit normal vector of a
hyperplane in parameter space. Because of the constraints yi (w (xi )) > 0 i = 1 . . . n
each of these hyperplanes delimit the version space. The expression yi (w (xi )) can be
regarded as:
fe
the distance between the point w and the hyperplane with normal vector (xi ).
Re
Thus, we want to nd the point w in the version space that maximizes the minimum
distance to any of the delineating hyperplanes. That is, SVMs nd the center of the largest
radius hypersphere whose center can be placed in the version space and whose surface does
not intersect with the hyperplanes corresponding to the labeled instances, as in Figure 2b.
The normals of the hyperplanes that are touched by the maximal radius hypersphere are
the (xi ) for which the distance yi (w (xi )) is minimal. Now, taking the original rather
than the dual view, and regarding w as the unit normal vector of the SVM and (xi ) as
points in feature space, we see that the hyperplanes that are touched by the maximal radius
hypersphere correspond to the support vectors (i.e., the labeled points that are closest to
the SVM hyperplane boundary).
The radius of the sphere is the distance from the center of the sphere to one of the
touching hyperplanes and is given by yi (w (xi )/) where (xi ) is a support vector.
Now, viewing w as a unit normal vector of the SVM and (xi ) as points in feature space,
we have that the distance yi (w (xi )/) is:
1
the distance between support vector (xi ) and the hyperplane with normal vector w,
which is the margin of the SVM divided by . Thus, the radius of the sphere is proportional
to the margin of the SVM.
50
4. Active Learning
In pool-based active learning we have a pool of unlabeled instances. It is assumed that
the instances x are independently and identically distributed according to some underlying
distribution F (x) and the labels are distributed according to some conditional distribution
P (y | x).
Given an unlabeled pool U , an active learner has three components: (f, q, X). The
rst component is a classier, f : X {1, 1}, trained on the current set of labeled data X
(and possibly unlabeled instances in U too). The second component q(X) is the querying
function that, given a current labeled set X, decides which instance in U to query next.
The active learner can return a classier f after each query (online learning) or after some
xed number of queries.
re
nc
es
The main dierence between an active learner and a passive learner is the querying
component q. This brings us to the issue of how to choose the next unlabeled instance to
query. Similar to Seung et al. (1992), we use an approach that queries points so as to attempt
to reduce the size of the version space as much as possible. We take a myopic approach
that greedily chooses the next query based on this criterion. We also note that myopia is a
standard approximation used in sequential decision making problems Horvitz and Rutledge
(1991), Latombe (1991), Heckerman et al. (1994). We need two more denitions before we
can proceed:
Re
fe
Denition 2 Area(V) is the surface area that the version space V occupies on the hypersphere w = 1.
Denition 3 Given an active learner , let Vi denote the version space of after i queries
have been made. Now, given the (i + 1)th query xi+1 , dene:
Vi = Vi {w W | (w (xi+1 )) > 0},
Vi+ = Vi {w W | +(w (xi+1 )) > 0}.
So Vi and Vi+ denote the resulting version spaces when the next query xi+1 is labeled as
1 and 1 respectively.
We wish to reduce the version space as fast as possible. Intuitively, one good way of
doing this is to choose a query that halves the version space. The follow lemma says that,
for any given number of queries, the learner that chooses successive queries that halves
the version spaces is the learner that minimizes the maximum expected size of the version
space, where the maximum is taken over all conditional distributions of y given x:
51
Tong and Koller
Lemma 4 Suppose we have an input space X , nite dimensional feature space F (induced
via a kernel K), and parameter space W. Suppose active learner always queries instances
whose corresponding hyperplanes in parameter space W halves the area of the current version
space. Let be any other active learner. Denote the version spaces of and after i queries
as Vi and Vi respectively. Let P denote the set of all conditional distributions of y given x.
Then,
i N + sup EP [Area(Vi )] sup EP [Area(Vi )],
P P
P P
with strict inequality whenever there exists a query j {1 . . . i} by that does not halve
version space Vj1 .
fe
re
nc
es
Proof. The proof is straightforward. The learner, always chooses to query instances
) = 1 Area(V ) no matter what the labeling
that halve the version space. Thus Area(Vi+1
i
2
of the query points are. Let r denote the dimension of feature space F. Then r is also the
dimension of the parameter space W. Let Sr denote the surface area of the unit hypersphere
of dimension r. Then, under any conditional distribution P , Area(Vi ) = Sr /2i .
Now, suppose does not always query an instance that halves the area of the version
space. Then after some number, k, of queries rst chooses to query a point xk+1 that
does not halve the current version space Vk . Let yk+1 {1, 1} correspond to the labeling
of xk+1 that will cause the larger half of the version space to be chosen.
Without loss of generality assume Area(Vk ) > Area(Vk+ ) and so yk+1 = 1. Note that
Area(Vk ) + Area(Vk+ ) = Sr /2k , so we have that Area(Vk ) > Sr /2k+1 .
Now consider the conditional distribution P0 :
1
2 if x = xk+1 .
P0 (1 | x) =
1 if x = xk+1
Re
Then under this distribution, i > k,
EP0 [Area(Vi )] =
Hence, i > k,
1
2ik1
Area(Vk ) >
Sr
.
2i
sup EP [Area(Vi )] > sup EP [Area(Vi )].
P P
P P
Now, suppose w W is the unit parameter vector corresponding to the SVM that we
would have obtained had we known the actual labels of all of the data in the pool. We
know that w must lie in each of the version spaces V1 V2 V3 . . ., where Vi denotes the
version space after i queries. Thus, by shrinking the size of the version space as much as
possible with each query, we are reducing as fast as possible the space in which w can lie.
Hence, the SVM that we learn from our limited number of queries will lie close to w .
If one is willing to assume that there is a hypothesis lying within H that generates the
data and that the generating hypothesis is deterministic and that the data are noise free,
then strong generalization performance properties of an algorithm that halves version space
can also be shown (Freund et al., 1997). For example one can show that the generalization
error decreases exponentially with the number of queries.
52
(a)
(b)
re
nc
es
Figure 3: (a) Simple Margin will query b. (b) Simple Margin will query a.
(b)
fe
(a)
Re
Figure 4: (a) MaxMin Margin will query b. The two SVMs with margins m and m+ for b
are shown. (b) Ratio Margin will query e. The two SVMs with margins m and
m+ for e are shown.
This discussion provides motivation for an approach where we query instances that split
the current version space into two equal parts as much as possible. Given an unlabeled
instance x from the pool, it is not practical to explicitly compute the sizes of the new
version spaces V and V + (i.e., the version spaces obtained when x is labeled as 1 and
+1 respectively). We next present three ways of approximating this procedure.
Simple Margin. Recall from section 3 that, given some data {x1 . . . xi } and labels
{y1 . . . yi }, the SVM unit vector wi obtained from this data is the center of the largest
hypersphere that can t inside the current version space Vi . The position of wi in
the version space Vi clearly depends on the shape of the region Vi , however it is
often approximately in the center of the version space. Now, we can test each of the
unlabeled instances x in the pool to see how close their corresponding hyperplanes
in W come to the centrally placed wi . The closer a hyperplane in W is to the point
wi , the more centrally it is placed in the version space, and the more it bisects the
version space. Thus we can pick the unlabeled instance in the pool whose hyperplane
53
Tong and Koller
in W comes closest to the vector wi . For each unlabeled instance x, the shortest
distance between its hyperplane in W and the vector wi is simply the distance between
the feature vector (x) and the hyperplane wi in Fwhich is easily computed by
|wi (x)|. This results in the natural rule: learn an SVM on the existing labeled
data and choose as the next instance to query the instance that comes closest to the
hyperplane in F.
Figure 3a presents an illustration. In the stylized picture we have attened out the
surface of the unit weight vector hypersphere that appears in Figure 2a. The white
area is version space Vi which is bounded by solid lines corresponding to labeled
instances. The ve dotted lines represent unlabeled instances in the pool. The circle
represents the largest radius hypersphere that can t in the version space. Note that
the edges of the circle do not touch the solid linesjust as the dark sphere in 2b
does not meet the hyperplanes on the surface of the larger hypersphere (they meet
somewhere under the surface). The instance b is closest to the SVM wi and so we
will choose to query b.
Re
fe
re
nc
es
MaxMin Margin. The Simple Margin method can be a rather rough approximation.
It relies on the assumption that the version space is fairly symmetric and that wi is
centrally placed. It has been demonstrated, both in theory and practice, that these
assumptions can fail signicantly (Herbrich et al., 2001). Indeed, if we are not careful
we may actually query an instance whose hyperplane does not even intersect the
version space. The MaxMin approximation is designed to overcome these problems to
some degree. Given some data {x1 . . . xi } and labels {y1 . . . yi }, the SVM unit vector
wi is the center of the largest hypersphere that can t inside the current version
space Vi and the radius mi of the hypersphere is proportional3 to the size of the
margin of wi . We can use the radius mi as an indication of the size of the version
space (Vapnik, 1998). Suppose we have a candidate unlabeled instance x in the pool.
We can estimate the relative size of the resulting version space V by labeling x as 1,
nding the SVM obtained from adding x to our labeled training data and looking at
the size of its margin m . We can perform a similar calculation for V + by relabeling
x as class +1 and nding the resulting SVM to obtain margin m+ .
Since we want an equal split of the version space, we wish Area(V ) and Area(V + ) to
be similar. Now, consider min(Area(V ), Area(V + )). It will be small if Area(V ) and
Area(V + ) are very dierent. Thus we will consider min(m , m+ ) as an approximation
and we will choose to query the x for which this quantity is largest. Hence, the MaxMin
query algorithm is as follows: for each unlabeled instance x compute the margins m
and m+ of the SVMs obtained when we label x as 1 and +1 respectively; then choose
to query the unlabeled instance for which the quantity min(m , m+ ) is greatest.
Figures 3b and 4a show an example comparing the Simple Margin and MaxMin Margin
methods.
Ratio Margin. This method is similar in spirit to the MaxMin Margin method. We
use m and m+ as indications of the sizes of V and V + . However, we shall try to
3. To ease notation, without loss of generality we shall assume the the constant of proportionality is 1, i.e.,
the radius is equal to the margin.
54
take into account the fact that the current version space Vi may be quite elongated
and for some x in the pool both m and m+ may be small simply because of the shape
of version space. Thus we will instead look at the relative sizes of m and m+ and
m+
,
) is largest (see Figure 4b).
choose to query the x for which min( m
m+ m
re
nc
es
The above three methods are approximations to the querying component that always
halves version space. After performing some number of queries we then return a classier
by learning a SVM with the labeled instances.
The margin can be used as an indication of the version space size irrespective of whether
the feature vectors have constant modulus. Thus the explanation for the MaxMin and Ratio
methods still holds even without the constraint on the modulus of the training feature
vectors. The Simple method can still be used when the training feature vectors do not
have constant modulus, but the motivating explanation no longer holds since the maximal
margin hyperplane can no longer be viewed as the center of the largest allowable sphere.
However, for the Simple method, alternative motivations have recently been proposed by
Campbell et al. (2000) that do not require the constraint on the modulus.
For inductive learning, after performing some number of queries we then return a classier by learning a SVM with the labeled instances. For transductive learning, after querying
some number of instances we then return a classier by learning a transductive SVM with
the labeled and unlabeled instances.
5. Experiments
fe
For our empirical evaluation of the above methods we used two real-world text classication
domains: the Reuters-21578 data set and the Newsgroups data set.
Re
5.1 Reuters Data Collection Experiments

The Reuters-21578 data set4 is a commonly used collection of newswire stories categorized
into hand-labeled topics. Each news story has been hand-labeled with some number of topic
labels such as corn, wheat and corporate acquisitions. Note that some of the topics
overlap and so some articles belong to more than one category. We used the 12902 articles
from the ModApte split of the data5 and, to stay comparable with previous studies, we
considered the top ten most frequently occurring topics. We learned ten dierent binary
classiers, one to distinguish each topic. Each document was represented as a stemmed,
TFIDF-weighted word frequency vector.6 Each vector had unit modulus. A stop list of
common words was used and words occurring in fewer than three documents were also
ignored. Using this representation, the document vectors had about 10000 dimensions.
We rst compared the three querying methods in the inductive learning setting. Our
test set consisted of the 3299 documents present in the ModApte test set.
4. Obtained from www.research.att.com/lewis.
5. The Reuters-21578 collection comes with a set of predened training and test set splits. The commonly
usedModApte split lters out duplicate articles and those without a labeled topic, and then uses earlier
articles as the training set and later articles as the test set.
6. We used Rainbow (McCallum, 1996) for text processing.
55
Tong and Koller
100.0
100.0
90.0
Precision/Recall Breakeven Point
80.0
Test Set Accuracy
90.0
80.0
Full
Random
Simple
Ratio
MaxMin
Ratio
Simple
Random
70.0
20
40
60
Labeled Training Set Size
80
70.0
60.0
50.0
40.0
Full
Random
Simple
Ratio
MaxMin
Ratio
Simple
Random
30.0
20.0
10.0
0.0
100
20
40
60
(a)
80
100
(b)
MaxMin
Ratio
86.39 1.65
77.04 1.17
93.82 0.35
95.53 0.09
95.26 0.38
96.31 0.28
96.15 0.21
97.75 0.11
98.10 0.24
98.31 0.19
87.75 1.40
77.08 2.00
94.80 0.14
95.29 0.38
95.26 0.15
96.64 0.10
96.55 0.09
97.81 0.09
98.48 0.09
98.56 0.05
re
90.24 2.31
80.42 1.50
94.83 0.13
95.55 1.22
95.35 0.21
96.60 0.15
96.43 0.09
97.66 0.12
98.13 0.20
98.30 0.19
fe
Earn
Acq
Money-fx
Grain
Crude
Trade
Interest
Ship
Wheat
Corn
Simple
Re
Topic
nc
es
Figure 5: (a) Average test set accuracy over the ten most frequently occurring topics when
using a pool size of 1000. (b) Average test set precision/recall breakeven point
over the ten most frequently occurring topics when using a pool size of 1000.
Equivalent
Random size
34
> 100
50
13
> 100
> 100
> 100
> 100
> 100
15
Table 1: Average test set accuracy over the top ten most frequently occurring topics (most
frequent topic rst) when trained with ten labeled documents. Boldface indicates
statistical signicance.
For each of the ten topics we performed the following steps. We created a pool of
unlabeled data by sampling 1000 documents from the remaining data and removing their
labels. We then randomly selected two documents in the pool to give as the initial labeled
training set. One document was about the desired topic, and the other document was
not about the topic. Thus we gave each learner 998 unlabeled documents and 2 labeled
documents. After a xed number of queries we asked each learner to return a classier (an
56
Topic
Earn
Acq
Money-fx
Grain
Crude
Trade
Interest
Ship
Wheat
Corn
Simple
MaxMin
Ratio
86.05 0.61
54.14 1.31
35.62 2.34
50.25 2.72
58.22 3.15
50.71 2.61
40.61 2.42
53.93 2.63
64.13 2.10
49.52 2.12
89.03 0.53
56.43 1.40
38.83 2.78
58.19 2.04
55.52 2.42
48.78 2.61
45.95 2.61
52.73 2.95
66.71 1.65
48.04 2.01
88.95 0.74
57.25 1.61
38.27 2.44
60.34 1.61
58.41 2.39
50.57 1.95
43.71 2.07
53.75 2.85
66.57 1.37
46.25 2.18
Equivalent
Random size
12
12
52
51
55
85
60
> 100
> 100
> 100
es
Table 2: Average test set precision/recall breakeven point over the top ten most frequently
occurring topics (most frequent topic rst) when trained with ten labeled documents. Boldface indicates statistical signicance.
Re
fe
re
nc
SVM with a polynomial kernel of degree one7 learned on the labeled training documents).
We then tested the classier on the independent test set.
The above procedure was repeated thirty times for each topic and the results were
averaged. We considered the Simple Margin, MaxMin Margin and Ratio Margin querying
methods as well as a Random Sample method. The Random Sample method simply randomly chooses the next query point from the unlabeled pool. This last method reects what
happens in the regular passive learning settingthe training set is a random sampling of
the data.
To measure performance we used two metrics: test set classication error and, to
stay compatible with previous Reuters corpus results, the precision/recall breakeven point
(Joachims, 1998). Precision is the percentage of documents a classier labels as relevant
that are really relevant. Recall is the percentage of relevant documents that are labeled as
relevant by the classier. By altering the decision threshold on the SVM we can trade precision for recall and can obtain a precision/recall curve for the test set. The precision/recall
breakeven point is a one number summary of this graph: it is the point at which precision
equals recall.
Figures 5a and 5b present the average test set accuracy and precision/recall breakeven
points over the ten topics as we vary the number of queries permitted. The horizontal line
is the performance level achieved when the SVM is trained on all 1000 labeled documents
comprising the pool. Over the Reuters corpus, the three active learning methods perform
almost identically with little notable dierence to distinguish between them. Each method
also appreciably outperforms random sampling. Tables 1 and 2 show the test set accuracy
and breakeven performance of the active methods after they have asked for just eight labeled
instances (so, together with the initial two random instances, they have seen ten labeled
instances). They demonstrate that the three active methods perform similarly on this
7. For SVM and transductive SVM learning we used SVMlight Joachims (1999a).
57
Tong and Koller
100.0
100.0
90.0
80.0
Test Set Accuracy
90.0
Full
Ratio
Random Balanced
Random
80.0
70.0
60.0
50.0
40.0
Full
Random
Simple
Ratio
Ratio
Random
Balanced
Random
30.0
20.0
10.0
70.0
20
40
60
80
0.0
100
20
40
60
(a)
80
100
(b)
nc
es
using a pool size of 1000. (b) Average test set precision/recall breakeven point
over the ten most frequently occurring topics when using a pool size of 1000.
Re
fe
re
Reuters data set after eight queries, with the MaxMin and Ratio showing a very slight edge in
performance. The last columns in each table are of more interest. They show approximately
how many instances would be needed if we were to use Random to achieve the same level
of performance as the Ratio active learning method. In this instance, passive learning on
average requires over six times as much data to achieve comparable levels of performance as
the active learning methods. The tables indicate that active learning provides more benet
with the infrequent classes, particularly when measuring performance by the precision/recall
breakeven point. This last observation has also been noted before in previous empirical
tests (McCallum and Nigam, 1998).
We noticed that approximately half of the queries that the active learning methods
asked tended to turn out to be positively labeled, regardless of the true overall proportion
of positive instances in the domain. We investigated whether the gains that the active
learning methods had over regular Random sampling were due to this biased sampling. We
created a new querying method called BalancedRandom which would randomly sample an
equal number of positive and negative instances from the pool. Obviously in practice the
ability to randomly sample an equal number of positive and negative instances without
having to label an entire pool of instances rst may or may not be reasonable depending
upon the domain in question. Figures 6a and 6b show the average accuracy and breakeven
point of the BalancedRandom method compared with the Ratio active method and regular
Random method on the Reuters dataset with a pool of 1000 unlabled instances. The Ratio
and Random curves are the same as those shown in Figures 5a and 5b. The MaxMin and
Simple curves are omitted to ease legibility. The BalancedRandom method has a much better precision/recall breakeven performance than the regular Random method, although it is
still matched and then outperformed by the active method. For classication accuracy, the
BalancedRandom method initially has extremely poor performance (less than 50% which is
58
(a)
(b)
nc
es
using a pool sizes of 500 and 1000. (b) Average breakeven point over the ten
most frequently occurring topics when using a pool sizes of 500 and 1000.
fe
re
even worse than pure random guessing) and is always consistently and signicantly outperformed by the active method. This indicates that the performance gains of the active
methods are not merely due to their ability to bias the class of the instances they queries.
The active methods are choosing special targeted instances and approximately half of these
instances happen to have positive labels.
Re
Figures 7a and 7b show the average accuracy and breakeven point of the Ratio method
with two dierent pool sizes. Clearly the Random sampling methods performance will not be
aected by the pool size. However, the graphs indicate that increasing the pool of unlabeled
data will improve both the accuracy and breakeven performance of active learning. This is
quite intuitive since a good active method should be able to take advantage of a larger pool
of potential queries and ask more targeted questions.
We also investigated active learning in a transductive setting. Here we queried the
points as usual except now each method (Simple and Random) returned a transductive
SVM trained on both the labeled and remaining unlabeled data in the pool. As described
by Joachims (1998) the breakeven point for a TSVM was computed by gradually altering
the number of unlabeled instances that we wished the TSVM to label as positive. This
invovles re-learning the TSVM multiple times and was computationally intensive. Since
our setting was transduction, the performance of each classier was measured on the pool
of data rather than a separate test set. This reects the relevance feedback transductive
inference example presented in the introduction.
Figure 8 shows that using a TSVM provides a slight advantage over a regular SVM in
both querying methods (Random and Simple) when comparing breakeven points. However,
the graph also shows that active learning provides notably more benet than transduction
indeed using a TSVM with a Random querying method needs over 100 queries to achieve
59
Tong and Koller
100.0
90.0
80.0
70.0
60.0
50.0
40.0
30.0
Transductive
Inductive
Passive
Active
Transductive
Inductive
Active
Passive
Inductive Active
Transductive
Passive
Transductive
Inductive
Passive
Active
20.0
10.0
0.0
20
40
60
80
100
100.0
nc
100.0
90.0
re
90.0
70.0
60.0
Test Set Accuracy
80.0
fe
Test Set Accuracy
80.0
20
Re
40
60
80
70.0
60.0
Full
Random
Simple
Ratio
MaxMin
Ratio
Simple
Random
50.0
40.0
es
Figure 8: Average pool set precision/recall breakeven point over the ten most frequently
occurring topics when using a pool size of 1000.
Full
Ratio
MaxMin
Ratio
Simple
MaxMin
Random
Simple
Random
50.0
40.0
100
(a)
20
40
60
80
100
(b)
Figure 9: (a) Average test set accuracy over the ve comp. topics when using a pool size
of 500. (b) Average test set accuracy for comp.sys.ibm.pc.hardware with a 500
pool size.
the same breakeven performance as a regular SVM with a Simple method that has only seen
20 labeled instances.
5.2 Newsgroups Experiments
Our second data collection was K. Langs Newsgroups collection Lang (1995). We used the
ve comp. groups, discarding the Usenet headers and subject lines. We processed the text
documents exactly as before, resulting in vectors of about 10000 dimensions.
60
(a)
(b)
nc
es
Figure 10: (a) A simple example of querying unlabeled clusters. (b) Macro-average test
set accuracy for comp.os.ms-windows.misc and comp.sys.ibm.pc.hardware where
Hybrid uses the Ratio method for the rst ten queries and Simple for the rest.
Re
fe
re
We placed half of the 5000 documents aside to use as an independent test set, and
repeatedly, randomly chose a pool of 500 documents from the remaining instances. We
performed twenty runs for each of the ve topics and averaged the results. We used test
set accuracy to measure performance. Figure 9a contains the learning curve (averaged
over all of the results for the ve comp. topics) for the three active learning methods
and Random sampling. Again, the horizontal line indicates the performance of an SVM
that has been trained on the entire pool. There is no appreciable dierence between the
MaxMin and Ratio methods but, in two of the ve newsgroups (comp.sys.ibm.pc.hardware
and comp.os.ms-windows.misc) the Simple active learning method performs notably worse
than the MaxMin and Ratio methods. Figure 9b shows the average learning curve for the
comp.sys.ibm.pc.hardware topic. In around ten to fteen per cent of the runs for both of
the two newsgroups the Simple method was misled and performed extremely poorly (for
instance, achieving only 25% accuracy even with fty training instances, which is worse
than just randomly guessing a label!). This indicates that the Simple querying method may
be more unstable than the other two methods.
One reason for this could be that the Simple method tends not to explore the feature
space as aggressively as the other active methods, and can end up ignoring entire clusters
of unlabeled instances. In Figure 10a, the Simple method takes several queries before it
even considers an instance in the unlabeled cluster while both the MaxMin and Ratio query
a point in the unlabeled cluster immediately.
While MaxMin and Ratio appear more stable they are much more computationally intensive. With a large pool of s instances, they require about 2s SVMs to be learned for each
query. Most of the computational cost is incurred when the number of queries that have
already been asked is large. The reason is that the cost of training an SVM grows polynomially with the size of the labeled training set and so now training each SVM is costly (taking
61
Tong and Koller
Query
1
5
10
20
30
50
100
Simple
0.008
0.018
0.025
0.045
0.068
0.110
0.188
MaxMin
3.7
4.1
12.5
13.6
22.5
23.2
42.8
Ratio
3.7
5.2
8.5
19.9
23.9
23.3
43.2
Hybrid
3.7
5.2
8.5
0.045
0.073
0.115
0.2
Table 3: Typical run times in seconds for the Active methods on the Newsgroups dataset
6. Related Work
fe
re
nc
es
over 20 seconds to generate the 50th query on a Sun Ultra 60 450Mhz workstation with a
pool of 500 documents). However, when the quantity of labeled data is small, even with
a large pool size, MaxMin and Ratio are fairly fast (taking a few seconds per query) since
now training each SVM is fairly cheap. Interestingly, it is in the rst ten queries that the
Simple seems to suer the most through its lack of aggressive exploration. This motivates
a Hybrid method. We can use MaxMin or Ratio for the rst few queries and then use the
Simple method for the rest. Experiments with the Hybrid method show that it maintains
the stability of the MaxMin and Ratio methods while allowing the scalability of the Simple
method. Figure 10b compares the Hybrid method with the Ratio and Simple methods on
the two newsgroups for which the Simple method performed poorly. The test set accuracy
of the Hybrid method is virtually identical to that of the Ratio method while the Hybrid
methods run time was about the same as the Simple method, as indicated by Table 3.
Re
There have been several studies of active learning for classication. The Query by Committee algorithm (Seung et al., 1992, Freund et al., 1997) uses a prior distribution over
hypotheses. This general algorithm has been applied in domains and with classiers for
which specifying and sampling from a prior distribution is natural. They have been used
with probabilistic models (Dagan and Engelson, 1995) and specically with the Naive Bayes
model for text classication in a Bayesian learning setting (McCallum and Nigam, 1998).
The Naive Bayes classier provides an interpretable model and principled ways to incorporate prior knowledge and data with missing values. However, it typically does not perform
as well as discriminative methods such as SVMs, particularly in the text classication domain (Joachims, 1998, Dumais et al., 1998).
We re-created McCallum and Nigams (1998) experimental setup on the Reuters-21578
corpus and compared the reported results from their algorithm (which we shall call the MNalgorithm hereafter) with ours. In line with their experimental setup, queries were asked
ve at a time, and this was achieved by picking the ve instances closest to the current
hyperplane. Figure 11a compares McCallum and Nigams reported results with ours. The
graph indicates that the Active SVM performance is signicantly better than that of the
MN-algorithm.
An alternative committee approach to query by committee was explored by Liere and
Tadepalli (1997, 2000). Although their algorithm (LT-algorithm hereafter) lacks the the62
100
100
80
90
Test Set Accuracy
Precision/Recall Breakeven point
60
SVM Simple Active

MNAlgorithm
40
20
50
100
80
SVM Simple Active

SVM Passive
LTAlgorithm Winnow Active
LTAlgorthm Winnow Passive
70
150
60
150
200
300
450
600
750
(a)
900
(b)
nc
es
Figure 11: (a) Average breakeven point performance over the Corn, Trade and Acq Reuters21578 categories. (b) Average test set accuracy over the top ten Reuters-21578
categories.
Re
fe
re
oretical justications of the Query by Committee algorithm, they successfully used their
committee based active learning method with Winnow classiers in the text categorization
domain. Figure 11b was produced by emulating their experimental setup on the Reuters21578 data set and it compares their reported results with ours. Their algorithm does
not require a positive and negative instance to seed their classier. Rather than seeding
our Active SVM with a positive and negative instance (which would give the Active SVM
an unfair advantage) the Active SVM randomly sampled 150 documents for its rst 150
queries. This process virtually guaranteed that the training set contained at least one positive instance. The Active SVM then proceeded to query instances actively using the Simple
method. Despite the very naive initialization policy for the Active SVM, the graph shows
that the Active SVM accuracy is signicantly better than that of the LT-algorithm.
Lewis and Gale (1994) introduced uncertainty sampling and applied it to a text domain
using logistic regression and, in a companion paper, using decision trees (Lewis and Catlett,
1994). The Simple querying method for SVM active learning is essentially the same as their
uncertainty sampling method (choose the instance that our current classier is most uncertain about), however they provided substantially less justication as to why the algorithm
should be eective. They also noted that the performance of the uncertainty sampling
method can be variable, performing quite poorly on occasions.
Two other studies (Campbell et al., 2000, Schohn and Cohn, 2000) independently developed our Simple method for active learning with support vector machines and provided
dierent formal analyses. Campbell, Cristianini and Smola extend their analysis for the
Simple method to cover the use of soft margin SVMs (Cortes and Vapnik, 1995) with linearly non-separable data. Schohn and Cohn note interesting behaviors of the active learning
curves in the presence of outliers.
63
Tong and Koller
7. Conclusions and Future Work
Re
fe
re
nc
es
We have introduced a new algorithm for performing active learning with SVMs. By taking
advantage of the duality between parameter space and feature space, we arrived at three
algorithms that attempt to reduce version space as much as possible at each query. We
have shown empirically that these techniques can provide considerable gains in both the
inductive and transductive settingsin some cases shrinking the need for labeled instances
by over an order of magnitude, and in almost all cases reaching the performance achievable
on the entire pool having seen only a fraction of the data. Furthermore, larger pools of
unlabeled data improve the quality of the resulting classier.
Of the three main methods presented, the Simple method is computationally the fastest.
However, the Simple method seems to be a rougher and more unstable approximation, as
we witnessed when it performed poorly on two of the ve Newsgroup topics. If asking each
query is expensive relative to computing time then using either the MaxMin or Ratio may be
preferable. However, if the cost of asking each query is relatively cheap and more emphasis
is placed upon fast feedback then the Simple method may be more suitable. In either case,
we have shown that the use of these methods for learning can substantially outperform
standard passive learning. Furthermore, experiments with the Hybrid method indicate that
it is possible to combine the benets of the Ratio and Simple methods.
The work presented here leads us to many directions of interest. Several studies have
noted that gains in computational speed can be obtained at the expense of generalization
performance by querying multiple instances at a time (Lewis and Gale, 1994, McCallum
and Nigam, 1998). Viewing SVMs in terms of the version space gives an insight as to where
the approximations are being made, and this may provide a guide as to which multiple
instances are better to query. For instance, it is suboptimal to query two instances whose
version space hyperplanes are fairly parallel to each other. So, with the Simple method,
instead of blindly choosing to query the two instances that are the closest to the current
SVM, it may be better to query two instances that are close to the current SVM and whose
hyperplanes in the version space are fairly perpendicular. Similar tradeos can be made for
the Ratio and MaxMin methods.
Bayes Point Machines (Herbrich et al., 2001) approximately nd the center of mass of
the version space. Using the Simple method with this point rather than the SVM point in
the version space may produce an improvement in performance and stability. The use of
Monte Carlo methods to estimate version space areas may also give improvements.
One way of viewing the strategy of always choosing to halve the version space is that we
have essentially placed a uniform distribution over the current space of consistent hypotheses
and we wish to reduce the expected size of the version space as fast as possible. Rather
than maintaining a uniform distribution over consistent hypotheses, it is plausible that
the addition of prior knowledge over our hypotheses space may allow us to modify our
query algorithm and provided us with an even better strategy. Furthermore, the PACBayesian framework introduced by McAllester (1999) considers the eect of prior knowledge
on generalization bounds and this approach may lead to theoretical guarantees for the
modied querying algorithms.
Finally, the Ratio and MaxMin methods are computationally expensive since they have
to step through each of the unlabeled data instances and learn an SVM for each possible
64
labeling. However, the temporarily modied data sets will only dier by one instance from
the original labeled data set and so one can envisage learning an SVM on the original data
set and then computing the incremental updates to obtain the new SVMs (Cauwenberghs
and Poggio, 2001) for each of the possible labelings of each of the unlabeled instances. Thus,
one would hopefully obtain a much more ecient implementation of the Ratio and MaxMin
methods and hence allow these active learning algorithms to scale up to larger problems.
Acknowledgments
This work was supported by DARPAs Information Assurance program under subcontract
to SRI International, and by ARO grant DAAH04-96-1-0341 under the MURI program
Integrated Approach to Intelligent Systems.
References
es
C. J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining
and Knowledge Discovery, 2:121167, 1998.
nc
C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classiers. In
Proceedings of the Seventeenth International Conference on Machine Learning, 2000.
re
G Cauwenberghs and T. Poggio. Incremental and decremental support vector machine

learning. In Advances in Neural Information Processing Systems, volume 13, 2001.
fe
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:125, 1995.
Re
I. Dagan and S. Engelson. Committee-based sampling for training probabilistic classiers.

In Proceedings of the Twelfth International Conference on Machine Learning, pages 150
157. Morgan Kaufmann, 1995.
S.T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and
representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management. ACM Press, 1998.
Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by
committee algorithm. Machine Learning, 28:133168, 1997.
D. Heckerman, J. Breese, and K. Rommelse. Troubleshooting Under Uncertainty. Technical
Report MSR-TR-94-07, Microsoft Research, 1994.
R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. Journal of Machine
Learning Research, pages 245279, 2001.
E. Horvitz and G. Rutledge. Time dependent utility and action under uncertainty. In
Proceedings of the Seventh Conference on Uncertainty in Articial Intelligence. Morgan
Kaufmann, 1991.
T. Joachims. Text categorization with support vector machines. In Proceedings of the
European Conference on Machine Learning. Springer-Verlag, 1998.
65
Tong and Koller
T. Joachims. Making large-scale svm learning practical. In B. Sch

olkopf, C. Burges, and
A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press,
1999a.
T. Joachims. Transductive inference for text classication using support vector machines.
In Proceedings of the Sixteenth International Conference on Machine Learning, pages
200209. Morgan Kaufmann, 1999b.
K. Lang. Newsweeder: Learning to lter netnews. In International Conference on Machine
Learning, pages 331339, 1995.
Jean-Claude Latombe. Robot Motion Planning. Kluwer Academic Publishers, 1991.
D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In
Proceedings of the Eleventh International Conference on Machine Learning, pages 148
156. Morgan Kaufmann, 1994.
es
D. Lewis and W. Gale. A sequential algorithm for training text classiers. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and
Development in Information Retrieval, pages 312. Springer-Verlag, 1994.
nc
D. McAllester. PAC-Bayesian model averaging. In Proceedings of the Twelfth Annual

Conference on Computational Learning Theory, 1999.
re
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classication
and clustering. www.cs.cmu.edu/mccallum/bow, 1996.
Re
fe
A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classication. In Proceedings of the Fifteenth International Conference on Machine Learning.
Morgan Kaufmann, 1998.
T. Mitchell. Generalization as search. Articial Intelligence, 28:203226, 1982.
J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART
retrieval system: Experiments in automatic document processing. Prentice-Hall, 1971.
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In
Proceedings of the Seventeenth International Conference on Machine Learning, 2000.
Fabrizio Sebastiani. Machine learning in automated text categorisation. Technical Report
IEI-B4-31-1999, Istituto di Elaborazione dellInformazione, 2001.
H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of
Computational Learning Theory, pages 287294, 1992.
J. Shawe-Taylor and N. Cristianini. Further results on the margin distribution. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, pages
278285, 1999.
V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, 1982.
V. Vapnik. Statistical Learning Theory. Wiley, 1998.
66

A Quick Introduction To Statistical Learning Theory and Support Vector Machines

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

A Quick Introduction To Statistical Learning Theory and Support Vector Machines

Caricato da

Copyright:

Formati disponibili

A Quick Introduction to Statistical Learning Theory and Support

cussed, leading to the Structural Risk Minimization Induction

A. General setting of the Learning Problem

L(y, f (x, )) dP(x, y),

given the training data (1).

which is constructed using the training data, (1).

Pictorial representation of the structure of Hypothesis spaces [4]

Here h is the capacity of the function space, f (x, ) and n

principle is also used in many instances to reduce the

Graphical representation of (4) for fixed n [4]

The bound given by (4) forms part of the theoretical basis

X {+1, 1} that will predict the correct label

A hyperplane that satisfies (8) is called a decision boundary,

f (x, w) = sign(< w, x > +b)

for a test example xt .

Hence, we notice that a decision boundary is an affine

IV. SUPPORT VECTOR MACHINES

The search for this optimal prediction function f (x, 0 ) is

in which case the expected risk (2) is simply the probability

The distance of a vector to a hyperplane is given by:

| < w, xi > +b|

The margin between the two classes is defined as the distance

| < w, xi > +b|

B. Linearly Separable Classification

Fig. 4. Maximum margin hyperplane using the support vectors (encircled)

A hyperplane is given by the following equation:

Section 4 we see that even if X is not a Hilbert space, we can replace

Of all the possible decision boundaries (hyperplanes), the

for the maximal margin hyperplane leads to a minimax

| < w, xi > +b|

subject to yi (< w, xi > +b) 1, (xi , yi ) Z = X Y

subject to yi (< w, xi > +b) 1, (xi , yi ) X Y

which is a quadratic programming (QP) problem and is called

subject to yi (< w, xi > +b) 1, (xi , yi ) X Y

The primal form discussed previously is sufficient for

This is equivalent to the problem,

The QP problem (15) is solved using the saddle point of the

Thus, from (20a) we get, the solution for the optimal

subject to yi (< w, (xi ) > +b) 1, (xi , yi ) X Y

Substituting (20b) and (21) into (19) and maximizing the

and the decision function (16) becomes,

This QP problem is known as the Primal form and can

According to the Kuhn-Tucker theorem of optimization

are non-zero which satisfy the equality constraint in (15),

Recall from (23) that i 6= 0 only for the support vectors.

where xA and xB are support vectors from each of the two

This procedure is termed as the kernel trick which lets us

with w given by (21) and b given by (24).

where xA and xB are support vectors from each of the two

C. The Kernel Trick

D. Types of Kernel functions

We now try to map non-separable data into separable

The main types of kernel functions used in practice are

We can then represent the dual form (22) in the feature

V. APPLICATION OF SVM - A N E XAMPLE

or equivalently in matrix form,

2D binary classification using a linear kernel SVM

A two-dimensional binary classification problem for a

a linear kernel function after finding the support vectors. The

Python Source Code

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999

An Overview of Statistical Learning Theory

In order to choose the best available approximation to the

(q(3 ); q(r )) = max jQ(zi 3 ) 0 Q(zi ; r )j ":