Sei sulla pagina 1di 135

Data Mining

Statistical Learning

Prof. Stefan Van Aelst 2010


2
Contents

1 Introduction 9

I Classification 11

2 Problem setting 13

3 Discriminant Analysis 17
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . 17
3.3 Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . 20
3.4 Regularized Discriminant Analysis . . . . . . . . . . . . . . . 22
3.5 Reduced-Rank Linear Discriminant Analysis . . . . . . . . . . 22
3.6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 Logistic regression vs linear discriminant analysis . . . 26
3.7 Mixture Discriminant Analysis . . . . . . . . . . . . . . . . . 27
3.8 Kernel Density Discriminant Analysis . . . . . . . . . . . . . 27
3.8.1 Kernel density estimation . . . . . . . . . . . . . . . . 28
3.8.2 Kernel density discriminant analysis . . . . . . . . . . 29
3.8.3 Naive Bayes classifier . . . . . . . . . . . . . . . . . . 29

4 Basis Expansions 31
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Piecewise Polynomials . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Cubic splines . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 Natural cubic splines . . . . . . . . . . . . . . . . . . . 33
4.4 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Splines as linear operators and effective degrees of free-
dom . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5 Multidimensional Splines . . . . . . . . . . . . . . . . . . . . 39
4.5.1 Multivariate regression splines . . . . . . . . . . . . . 39

3
4 CONTENTS

4.5.2 Multivariate smoothing splines . . . . . . . . . . . . . 40


4.5.3 Hybrid splines . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Regularization with Kernels . . . . . . . . . . . . . . . . . . . 41
4.6.1 Polynomial kernels . . . . . . . . . . . . . . . . . . . . 44
4.6.2 Radial basis function kernels . . . . . . . . . . . . . . 45

5 Flexible Discriminant Analysis Methods 47


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Flexible Discriminant Analysis . . . . . . . . . . . . . . . . . 47
5.3 Penalized Discriminant Analysis . . . . . . . . . . . . . . . . 48
5.4 Flexible Logistic Regression . . . . . . . . . . . . . . . . . . . 49
5.5 Generalized Additive Models . . . . . . . . . . . . . . . . . . 50

6 Support Vector Machines 55


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Separating Hyperplanes . . . . . . . . . . . . . . . . . . . . . 55
6.3 Support Vector Classifiers . . . . . . . . . . . . . . . . . . . . 59
6.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . 61

7 Classification Trees 65
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2 Growing Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.3 Pruning Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.5 Bumping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.6 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8 Nearest-Neighbor Classification 79
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 K-Nearest-Neighbors . . . . . . . . . . . . . . . . . . . . . . . 79
8.2.1 Nearest-Neighbors in High Dimensions . . . . . . . . . 81
8.3 Adaptive Nearest-Neighbors Methods . . . . . . . . . . . . . 82

9 Model Performance and Model Selection 85


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.2 Measuring Error . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.3 In-Sample Error . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.4 Estimating Optimism . . . . . . . . . . . . . . . . . . . . . . 88
9.5 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.6 Bootstrap methods . . . . . . . . . . . . . . . . . . . . . . . . 90
CONTENTS 5

II Regression 95

10 Problem setting 97

11 Linear Regression Methods 101


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.2 Least squares regression . . . . . . . . . . . . . . . . . . . . . 101
11.3 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.4 The Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11.5 Principal Components regression . . . . . . . . . . . . . . . . 111
11.6 Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . 112

12 Nonparametric Regression 115


12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
12.2 Kernel Based Nonparametric Regression . . . . . . . . . . . . 115
12.2.1 Local averages . . . . . . . . . . . . . . . . . . . . . . 115
12.2.2 Local linear regression . . . . . . . . . . . . . . . . . . 117
12.2.3 Local polynomial regression . . . . . . . . . . . . . . . 117
12.3 Nearest Neighbors Regression . . . . . . . . . . . . . . . . . . 118
12.4 Local Regression With More Predictors . . . . . . . . . . . . 119

13 Structured Regression Methods 121


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
13.2 Additive models . . . . . . . . . . . . . . . . . . . . . . . . . 121
13.3 Multivariate Adaptive Regression Splines . . . . . . . . . . . 122

14 Regression Trees 125


14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
14.2 Growing Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 125
14.3 Pruning Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
14.4 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
14.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

15 Regression Support Vector Machines 129


15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
15.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . 129
15.3 Regression Support Vector Machines . . . . . . . . . . . . . . 131
6 CONTENTS
Preface

In beginning of data collection, data were scarce and the goal of statistics
was to extract the most possible information out of these data. Since the
inception of electronic data recording and due to the increasing and cheaper
storage capacities, data collections have grown considerably in size and are
still growing. Nowadays, huge data collections are frequently available. As
a result, statisticians have been confronted with a new challenge. How can
valuable information be retrieved from these huge databases while filtering
out the (high percentage of) noise. Instead of a lack of data, there now is an
abundance of data which causes computational problems as well as method-
ological problems. The high percentage of noise causes standard statistical
methods to produce very imprecise and unstable results. Hence, important
trends are not detected anymore. The instability of the models makes it
difficult to generalize them to e.g. reliably predict new outcomes. Statis-
ticians and computer scientists from machine learning/pattern recognition
have developed a collection of methods that are intended to obtain stable
and accurate information from large data collections in a reasonably fast
way. A selection of these techniques will be discussed in this course text.

7
8 CONTENTS
Chapter 1

Introduction

In this course we will consider supervised learning problems, that is, there
is an outcome variable present in the data. Hence, we have a dataset with
an outcome (response/dependent variable) and a set of features (predictors
or input/explanatory/independent variables) from which we need to build a
good model that explains/fits well the outcome and can be used for predict-
ing the outcome of new objects for which the features are known. Such a
model is sometimes called a (statistical) learner. The dataset at hand is the
training dataset that contains for a set of objects both the outcome mea-
surement(s) and the feature measurements. This training dataset is used
to guide the learning process. If the outcome is qualitative (categorical) we
call the problem a classification problem. Classification methods will be
discussed in the first part of the course. If the outcome is quantitative we
call the problem a regression problem. Regression methods will be discussed
in the second part of the course.
We will not discuss unsupervised learning problems in this course. In
this case there is a dataset with a set of features for each object, but there
is no (clear) outcome measurement. The task here is to find interesting
groups (clusters) or patterns that give insight in how the data are organized.
Most of the unsupervised learning techniques focus on clustering the objects
in groups. Such techniques are discussed in the course Multivariate Data
Analysis.

9
10 CHAPTER 1. INTRODUCTION
Part I

Classification

11
Chapter 2

Problem setting

In classification we consider problems with a qualitative or categorical out-


come variable. Consider for example the famous IRIS example of Fisher.
This dataset contains the sepal length and width and the petal length and
width for three different species of iris: Iris setosa, Iris versicolor, and Iris
virginica. For each of the species 50 flowers have been measured. Part
of the dataset is shown in Table 2.1. The dataset is available in R as
the object name iris. Type help(iris) in R to get more information about
the data. The outcome variable, the species of iris, is qualitative and can
take values in the finite set G = {setosa, versicolor, virginica}. Note that
there exists no ordering between these possible outcome values. Even if we
would use numerical class labels instead of descriptive labels, for instance
1 := setosa, 2 := versicolor, 3 := virginica, then there still is no explicit
ordering between the numbers in the new outcome set G ′ = {1, 2, 3}. In
fact, most software programs will represent categorical variables by numer-
ical codes. When there are only two classes, the class label is represented
by a single binary variable which takes the values 0 or 1, or alternatively, -1
and 1. When there are more than two categories, a single discrete variable
with k > 2 levels can be used, as illustrated above. Alternatively, each of
the classes can be represented by a single binary variable, leading to a vector
of k binary variables, called dummy variables. For each object, only one of
these k dummy variables will be ’on’ (that is, take the value 1).
We will denote the vector of input variables by X, and its components
are denoted by Xj . When X contains only 1 component, the subscript is
dropped. The qualitative outcome variable will be denoted by G which takes
values in the finite set G = {g1 , . . . , gk }, so the number of classes equals k.
To avoid heavy notation, we will recode the class labels as 1, . . . , k, hence
G = {1, . . . , k}. As usual, uppercase letters refer to variables, while lowercase
letters refer to the observed values for a single object. For example, the

13
14 CHAPTER 2. PROBLEM SETTING

Object Sepal Length Sepal Width Petal Length Petal Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
.. .. .. .. .. ..
. . . . . .
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
.. .. .. .. .. ..
. . . . . .
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica

Table 2.1: Classification example: IRIS data.

observed value of X for the ith object is denoted by xi (which is a vector


or scalar depending on the length of the vector X). The sample size of the
training data will be denoted by n and the dimension by d+1, hence d is the
dimension of the features. The n-dimensional vector of observed variables
for a single component Xj will be denoted by xj to contrast it with the d-
dimensional vector xj containing the observed values of the jth object for all
components of X. All vectors are assumed to be column vectors. Matrices
are written in bold uppercase letters. For example, the n × d matrix X is
the matrix of observed values of all n objects on all d features. The rows of
X represent different objects, hence the ith row of X is xti , the transpose of
xi . The columns of X represent different features, so the jth column of X
is xj .
In a classification problem the goals are to find good estimates Ĝ of the
class labels that take values in the same set G as G. Ideally, the method
works well on the whole distribution of outcomes and features. In practice,
these predictions can be considered both for the training data, expressing
the goodness-of-fit quality of the method, as for new objects, expressing the
prediction capacity of the method. Furthermore, the purpose often also is
to find a representation that separates or discriminates the classes as much
as possible to get more insight in the differences between the classes and to
determine the features that are associated with these differences.
Given any vector of features x, we would like to find the classification rule
that makes the least possible mistakes, or more generally, gives the smallest
loss. In the general case, we need to define a loss function L(r, s) that
represents the price paid for classifying an object from class r to class s, for
15

all classes in G. The standard setting of least possible mistakes corresponds


to setting L(r, s) = 1 if r ̸= s and zero otherwise, which is called the zero-one
loss function. We now would like to find the classifier Ĝ that minimizes the
expected prediction error or expected misclassification loss given by


k
EML(Ĝ(x)) = L(r, Ĝ(x)) P (r|X = x). (2.1)
r=1

This leads to the classifier



k
Ĝ(x) = argming∈G L(r, g) P (r|X = x). (2.2)
r=1

The classifier is thus defined pointwise. For each feature vector x, the class
estimate is given by the solution of (2.2). For the standard zero-one loss,
the classifier simplifies to

Ĝ(x) = argming∈G P (r|X = x). (2.3)
r̸=g
∑ ∑
Since r∈G P (r|X = x) = 1 = P (g|X = x) + r̸=g P (r|X = x), we can
rewrite (2.3) as
Ĝ(x) = argming∈G [1 − P (g|X = x)]
or equivalently
Ĝ(x) = argmaxg∈G P (g|X = x). (2.4)
This solution is called the Bayes classifier. It says that we should classify
an object with features x to the most probable class as determined by the
conditional distribution P (G|X = x), that is

Ĝ(x) = r if P (r|X = x) = max P (g|X = x). (2.5)


g∈G

Figure 2.1 shows a mixture of observations from three different bivariate


normal distributions together with the optimal Bayesian decision bound-
aries which are quadratic functions in this case. With real data, of course
the true conditional distribution of the classes is unknown and the Bayes
classifier (2.4) can not be derived. Many classification techniques thus
are concerned with modeling and estimating the conditional distributions
P (r|X = x) for all classes.
The error rate of the Bayes classifier, obtained as

EX [EML(Ĝ(X))] (2.6)

with Ĝ(x) the Bayes classifier, is called the optimal Bayes error rate. This is
the error rate we would obtain if the conditional distribution P (G|X) would
be known.
16 CHAPTER 2. PROBLEM SETTING

Bayesian Classifier
10
5
0
X2

−5
−10
−15

−10 −5 0 5

X1

Figure 2.1: Mixture of three bivariate normal distributions with optimal


Bayesian decision boundaries.
Chapter 3

Discriminant Analysis

3.1 Introduction
Suppose we have k classes and let us denote by fr (x) the conditional density
of X in class G = r. Moreover, let P (G = r) = πr be the prior probability

that an observation belongs to class r, with kr=1 πr = 1. By applying Bayes
theorem we obtain
fr (x)πr
P (r|X = x) = ∑k (3.1)
s=1 fs (x)πs
Since the denominator does not change for any of the classes, it suffices to
estimate the conditional class densities fr (x) for each of the classes. In the
next sections we will consider different ways to model these densities:

• Linear and Quadratic discriminant analysis.

• Regularized discriminant analysis.

• Reduced rank discriminant analysis.

• Mixture discriminant analysis.

• Kernel density discriminant analysis.

• Naive Bayes classifier

Other techniques model the conditional probabilities P (r|X = x) di-


rectly. We will consider the most widely used technique in this class, i.e.
logistic regression.

3.2 Linear Discriminant Analysis


Each of the class densities is now modeled as a multivariate Gaussian density:
1 1
ϕr (x) = √ exp(− (x − µr )t (Σr )−1 (x − µr )). (3.2)
(2π)d/2 |Σr | 2

17
18 CHAPTER 3. DISCRIMINANT ANALYSIS

The function dΣr (x, µr ) = (x − µr )t (Σr )−1 (x − µr ) is the Mahalanobis
distance between x and µr . This distance takes the variances of the different
components into account as well as the correlations between them as given by
the covariance matrix Σr . In general the Mahalanobis distance between any

two points x and y in Rd is given by dΣr (x, y) = (x − y)t (Σr )−1 (x − y).
We also call this the distance between x and y determined by Σr . This
distance function satisfies the properties of a metric:

• d(x, x) = 0 for all x ∈ Rd

• d(x, y) = d(y, x) for all x, y ∈ Rd (symmetry)

• d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z ∈ Rd (triangle inequality).

In Linear Discriminant Analysis (LDA) we assume that the classes all share
a common covariance matrix Σr = Σ for all classes r. From (2.4) and (3.1)
it follows that we classify x in class r if ϕr (x)πr is maximal. Using (3.2) and
taking the logarithm to simplify things, we obtain for each class

log(ϕr (x)πr ) = log ϕr (x) + log πr


1 1
∼ − log(|Σ|) − (x − µr )t Σ−1 (x − µr ) + log πr
2 2
Removing the terms that are constant for all classes yields the class scores
lr (x) given by
1
lr (x) = xt Σ−1 µr − (µr )t Σ−1 µr + log πr (3.3)
2
and is maximal among all classes. Note that the score lr (x) is a linear
function of x. Indeed, when determining the decision boundary between
any two classes, we look at the difference in scores:
1 πr
lr (x) − ls (x) = xt Σ−1 (µr − µs ) − (µr + µs )t Σ−1 (µr − µs ) + log (3.4)
2 πs
The decision boundary is the collection of points x for which lr (x) = ls (x)
or lr (x) − ls (x) = 0 which according to (3.4) is an equation linear in x. In d
dimensions the boundary between the two classes thus is a hyperplane (e.g.
a line in two dimensions, a plane in three dimensions). Since the decision
boundary between any two classes will be a hyperplane, LDA will divide the
space into classification regions separated by hyperplanes. Figure 3.1 shows
a sample from a mixture of three bivariate Gaussian distributions with dif-
ferent centers (indicated by ’+’) but the same covariance matrix. In the left
panel, the three lines are the decision boundaries between each pair of classes
while the tick solid line segments are the Bayesian decision boundary for the
three class classification problem. The linear scores (3.4) have not yet solved
3.2. LINEAR DISCRIMINANT ANALYSIS 19

LDA boundary Estimated LDA boundary

10

10
+ +
5

5
X2

X2
+ +
0

0
+ +
−5

−5
−10

−10
−10 −5 0 5 −10 −5 0 5

X1 X1

Figure 3.1: Mixture of three bivariate normal distributions with a common


covariance matrix. On the left the linear decision boundaries between each
two classes are shown as well as the Bayesian decision boundaries for the
three class problem. On the right the estimated LDA decision boundaries
are shown.

the problem of the unknown class densities yet. We have modeled the class
densities as Gaussian distributions with a common covariance matrix to ob-
tain these scores. However, these scores depend on the unknown parameters
of the Gaussian distributions. Hence, to use the linear discriminant rule in
practice, we still need to estimate the centers and the common covariance
matrix of the Gaussian distributions from the training data. Often, there
is no relevant information about the prior class probabilities available, and
in that case these probabilities can also be estimated from the data. The
standard estimates are
nr
π̂r = , where nr is the number of observations of class r (3.5)
n
1 ∑
µ̂r = xi (3.6)
nr g =r
i

1 ∑∑
k
Σ̂ = (xi − µ̂r )(xi − µ̂r )t . (3.7)
n−k g =r r=1 i

Hence, the estimated centers µ̂r are the sample means of the observations
in each class and the covariance matrix estimate Σ̂ is the pooled covariance
matrix of the observations in all classes. The right panel of Figure 3.1 shows
the estimated decision boundaries based on a sample of size 75 from each of
the Gaussian distributions. In this example the estimated decision bound-
aries are close to the optimal Bayes boundaries, as could be expected. Of
20 CHAPTER 3. DISCRIMINANT ANALYSIS

course, in general the optimal Bayes boundaries are not necessarily linear.
Let us consider the example in Figure 2.1. These data were a sample from
a mixture of three bivariate normal distributions with different covariance
matrices. Figure 2.1 already showed that the optimal boundaries are not
linear in this case. Figure 3.2 shows the estimated decision boundaries by
linear discriminant analysis. Although the optimal boundaries are nonlin-
ear we see that the linear boundaries yield a good approximation of these
boundaries.

Estimated LDA Boundary


10
5
0
X2

−5
−10
−15

−10 −5 0 5

X1

Figure 3.2: Mixture of three bivariate normal distributions with estimated


LDA decision boundaries.

3.3 Quadratic Discriminant Analysis


If we still model the class densities by multivariate Gaussian densities, but
drop the assumption that the classes all share a common covariance matrix,
then we classify x in class g if ϕr (x)πr is maximal with ϕr (x) given in (3.2).
Similarly as before, taking the logarithm and removing the constants yields
the class scores qr (x) given by
1 1
qr (x) = − log(|Σr |) − (x − µr )t (Σr )−1 (x − µr ) + log πr . (3.8)
2 2
These score functions are quadratic functions, hence also the decision bound-
aries between any two classes, qr (x) = qs (x), are quadratic. Therefore, this
3.3. QUADRATIC DISCRIMINANT ANALYSIS 21

approach is called Quadratic Discriminant Analysis (QDA). The unknown


center and covariance matrix of the Gaussian distribution for each class are
now estimated by the sample mean and sample covariance matrix of the
training observations in that class, that is

1 ∑
µ̂r = xi
nr g =r
i

1 ∑
Σ̂r = (xi − µ̂r )(xi − µ̂r )t .
nr − 1 g =r
i

Figure 3.3 shows the estimated decision boundaries by quadratic discrim-


inant analysis. For the data with common common covariance matrix (left
panel) the quadratic boundaries almost coincide with the linear boundaries
in Figure 3.1. For the data with different group covariance matrices, the
estimated quadratic boundaries are very close to the optimal (quadratic)
Bayes decision boundaries shown in Figure 2.1, as expected.

Estimated QDA boundary Estimated QDA boundary


10

10

+
5
5

0
X2

X2

+
0

−5

+
−10
−5

−15
−10

−10 −5 0 5 −10 −5 0 5

X1 X1

Figure 3.3: Quadratic decision boundaries for mixtures of three bivariate


normal distributions. On the left the distributions have a common covari-
ance matrix. On the right the group covariance matrices are different.

Note that there is a considerable difference in the number of parameters


that need to be estimated in linear and quadratic discriminant analysis. A
naive comparison would state that both LDA and QDA require kd parame-
ters for the group centers. However, LDA uses d(d + 1)/2 parameters for the
common covariance matrix while QDA requires kd(d + 1)/2 parameters for
the k group covariance matrices. This seems to indicate that the difference
in number of parameters increases linearly with the number of groups and in
quadratic order with the dimension d. A detailed analysis of both methods
22 CHAPTER 3. DISCRIMINANT ANALYSIS

has revealed that LDA requires at least (k − 1)(d + 1) parameters and QDA
requires at least (k − 1)(d(d + 3)/2 + 1) parameters. This means that the
difference in number of parameters still increases linearly with the number
of groups and quadratic with the dimension d. Estimating more parameters
means a loss of precision for the estimates, hence QDA can be expected to
have a larger variability than LDA. On the other hand, LDA can be expected
to have a larger bias if the true decision boundaries are nonlinear. Many
studies have reported very good performance of LDA and QDA in a wide
range of settings (both simulated and real data). In several of these settings
the data are far from Gaussian. A possible explanation is that the data
can only support simple linear or quadratic boundaries obtained by stable
estimates. More sophisticated techniques that produce complex boundaries
have a too low precision to behave well. For QDA the stability will only be
guaranteed in low dimensions.

3.4 Regularized Discriminant Analysis


Regularized Discriminant Analysis (RDA) is an extension of LDA and QDA
that aims to determine from the data whether separate covariance matrices
or a common covariance matrix should be used or a combination of both.
The regularized group covariance matrices are given by

Σ̂r (α) = αΣ̂r + (1 − α)Σ̂, (3.9)

where Σ̂r is the group covariance matrix as used in QDA, Σ̂ is the pooled
covariance matrix of all groups as used in LDA. The value of α ∈ [0, 1] needs
to be selected based on the performance of the model on a validation sample
or by using cross-validation. In practice, the model is constructed for a range
of α values (e.g. by taking a step size of 0.1) and the performance of these
models is compared. The model with the best performance is then selected.

3.5 Reduced-Rank Linear Discriminant Analysis


Fisher’s approach to linear discriminant analysis illustrates further its wide
applicability beyond Gaussian distributions. Fisher considered the following
problem. Find the linear combination Z = at X of the features such that the
between-class variance in this direction is maximized relative to the within-
class variance. The between-class variance is the variance of the group
means of Z, that is the variance of the projected group means resulting
from projecting on the direction determined by a. The within-class variance
is the pooled variance of the observations of Z around their group means.
3.5. REDUCED-RANK LINEAR DISCRIMINANT ANALYSIS 23

The idea is to find the one-dimensional direction in which the groups are
best separated.
Note that it is easier to find the direction that best separates the group
means, because for this problem only the variance between the group means
(between-class variance) needs to be maximized without considering the
within-class variance. However, this direction does not always separate well
the groups because the correlations within the groups is ignored. This is
illustrated in the left panels of Figure 3.4. This dataset consists of two
groups with different centers (indicated by ’+’) but the same covariance
matrix. In a two-class problem the direction maximizing the between-class
variance is the direction that connects the two centers, which in this case is
just the first component. Projecting the data on the first component (lower
left panel) shows that there is considerable overlap between the two groups
in this direction. However, if we take the within covariance into account we
get the projection shown in the right panels. Although the group centers
are closer together in this directions, the two groups are almost completely
separated in this direction.

Between−class variance projection Discriminant coordinate projection


5

5
X2

X2

+ + + +
0

0
−5

−5

−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15

X1 X1

| | || | || | || | | | | || ||| || |||||| ||||||||||| | || ||||||||| |||| |||| | || ||||| || ||||||||||||||||| ||| | || ||||| | | || | | | | | | || | ||||||| | || || | |||||||||||||| || ||||| | || || || | ||| | ||||||| ||||||||| ||| || |||||| ||||| || || || | |
+ + + +
−15 −10 −5 0 5 10 15 −5 0 5

X1 Canonical variate

Figure 3.4: Mixture of two bivariate normals with different centers but a
common covariance matrix. The left panels show the direction maximizing
the variance between the group centers and the separation of the groups in
that direction. The right panels show the discriminant coordinate direction
and the group separation in that direction.

Let us denote by B the between-class covariance matrix of X, that is, B


is the covariance matrix of the k × d matrix M which consists of the k group
centers. Similarly, we denote by W the within-class covariance matrix of
24 CHAPTER 3. DISCRIMINANT ANALYSIS

X, that is W = Σ. Then Fisher’s problem can be written as


at Ba
max .
a at W a
Using matrix algebra results, this can equivalently be stated as

max at Ba subject to at W a = 1.
a

This is a generalized eigenvalue problem, which is solved by taking a = a1


the largest eigenvector (corresponding to the largest eigenvalue) of W −1 B.
Similarly, we can find the next direction a2 , orthogonal to a1 , that maxi-
mizes (at2 Ba2 )/(at2 W a2 ). Not surprisingly, this direction is determined by
the second largest eigenvector a2 of W −1 B, and so on for the next direc-
tions. These directions a1 , a2 , . . . are called the discriminant coordinates or
canonical variates.
The discriminant coordinates are in the first place useful to visualize
the different groups. They also reveal that there is an inherent dimension
reduction when using LDA. The k group centers lie in an affine subspace of
dimension at most equal to k − 1. If the dimension d of the feature space
is much larger than k, then this means an important dimension reduction.
In LDA, we need to determine for each object x the closest group center.
However, since we are using a common covariance matrix. The distance to
all centers is measured in the same metric determined by Σ. We can use
this metric to sphere the data, that is, X ⋆ = Σ−1/2 X. For the sphered
data, distances are measured using the standard Euclidean distance. To
classify an object x, we can first determine x⋆ = Σ−1/2 x and determine the
distance of x⋆ to each of the (sphered) group centers. These distances can
be decomposed in the distance from x⋆ to the affine subspace containing the
group centers and the distance from the projected object x̃⋆ onto the affine
subspace to each of the group centers. Since the first distance (from x⋆ to
the affine subspace) is the same for all group centers, we can ignore this
distance and compare the distances within the affine subspace. Hence, we
can project the data onto the affine subspace spanned by the group centers,
which reduces the original dimension d to at most k − 1 and perform LDA
classification in this reduced space without loss of information for LDA. The
affine subspace containing the sphered group centers is exactly determined
by the eigenvectors of W −1 B corresponding to nonzero eigenvalues.
If the number of classes k is small, then the LDA classification can be
visualized easily by a scatterplot matrix of the discriminant coordinates, but
if the number of classes becomes larger, this visualization becomes harder,
making it more difficult to get insight in the structure of the data. Consider-
ing only the first coordinate directions implies a loss of information, but due
3.6. LOGISTIC REGRESSION 25

to the order of the discriminant coordinates, this loss is often small. In fact,
sometimes it is even beneficial to use less than all discriminant coordinates
to classify objects. Suppose we only use the first l < k − 1 discriminant
coordinates, then this is equivalent to assuming a mixture of Gaussian dis-
tributions with common covariance matrix and additionally restricting the
group centers to lie in an l-dimensional affine subspace of the d dimensional
feature space. This method is called Reduced Rank Linear Discriminant
Analysis.

3.6 Logistic Regression

We can demand more explicitly that the decision boundaries are linear. If we
determine score functions δr (x) for all classes r = 1, . . . , k, then the decision
boundary between two classes is given by the set of points which satisfies
δr (x) = δs (x). The decision boundaries are linear, that is two classes are
separated by a hyperplane if {x; δr (x) = δs (x)} = {x; α0 + α1t x = 0} for
some coefficients α0 and α1 . This can be achieved by requiring that some
monotone transformation of the score functions δr (x) is a linear function.
For example, if we model the posterior probabilities P (r|X = x) for a two
class problem, then a popular choice is

exp(β0 + β1t x)
P (G = 1|X = x) =
1 + exp(β0 + β1t x)
1
P (G = 2|X = x) = . (3.10)
1 + exp(β0 + β1t x)

The monotone transformation here is the logit transformation logit(p) =


log(p/(1−p)). Note that the decision boundary δ1 (x) = δ2 (x) can be rewrit-
ten as log(δ1 (x)/δ2 (x)) = 0 which yields
( )
P (G = 1|X = x)
log = β0 + β1t x = 0. (3.11)
P (G = 2|X = x)

Therefore, the decision boundary (the set of points for which the log-odds
are zero) is linear.
Note that requiring the decision boundary to be linear is not necessarily
a very restrictive condition since the predictor space can be expanded by
including e.g. squares and cross-products of the basic predictors. Linear
functions in this augmented space then become quadratic functions in the
original (basic) space.
In the general k class problem the logistic regression model (3.10) ex-
26 CHAPTER 3. DISCRIMINANT ANALYSIS

pands to
t x)
exp(βr0 + βr1
P (G = r|X = x) = ∑k−1 , r = 1, . . . , k − 1
t x)
1 + s=1 exp(βs0 + βs1
1
P (G = k|X = x) = ∑k−1 t x)
. (3.12)
1 + s=1 exp(βs0 + βs1
This model ensures that the probabilities are all between 0 and 1 and sum
to 1. Using the logit transformation we obtain
P (G = r|X = x)
log( t
) = βr0 + βr1 x = 0 r = 1, . . . , k − 1
P (G = k|X = x)
Note that the model uses the last class (k) as the reference class (in the
denominator) to which each other class is compared. However, the choice of
denominator is arbitrary with different choices leading to the same classifica-
tion. The logistic regression coefficients are usually estimated by maximum
likelihood. This leads to a set of estimating equations that can be solved
numerically by using iteratively reweighted least squares.

3.6.1 Logistic regression vs linear discriminant analysis


From (3.4) we obtain that also for linear discriminant analysis the log-odds
are linear functions of x. Although both models assume the same function
for the log-odds, the difference lies in the way the coefficients of this function
are estimated. Let us consider the joint density of X and G

P (X, G = r) = P (X)P (G = r|X) (3.13)

with P (X) the marginal density of X. Both models assume the same form
for the conditional density P (G = r|X). LDA specifies the marginal density
through (3.2). That is, P (X, G = r) = ϕr (x)πr with ϕr the Gaussian
density. This means that the marginal density P (X) is a normal mixture
density
∑k
P (X) = ϕr (x)πr (3.14)
r=1
and thus has a fully parametric form. This parametric model allows us to
estimate the parameters efficiently, but if the model is incorrect we can get
unreliable results.
In logistic regression the marginal distribution P (X) is left unspecified
and only the conditional distribution P (G = r|X) is modeled. Hence, this
is a semiparametric approach. There will be a loss in efficiency compared to
LDA if the marginal density is indeed a mixture of normals, but on the other
hand this approach is more robust because it mainly relies on observations
close to the boundary to estimate the boundaries.
3.7. MIXTURE DISCRIMINANT ANALYSIS 27

3.7 Mixture Discriminant Analysis

In LDA and QDA we assign observations to the closest class, were the dis-
tance is measured between the observation and the center of each class, using
an appropriate metric. Hence, each class is represented by a single point,
called prototype. If classes are inhomogeneous, then a single prototype for
each class might not be sufficient. In such cases we can use a mixture model
for the density of X in each class


Mr
P (X = x|r) = fr (x) = πlr ϕlr (x) (3.15)
l=1

where ϕlr is the Gaussian density (3.2) with mean µlr and covariance matrix
∑ r
Σlr . The mixing proportions πlr sum to one, that is M l=1 πlr = 1. Each
class r is now represented by Mr prototypes. Using (3.1) we obtain the class
posterior probabilities
∑Mr
πr l=1 πlr ϕlr (x)
P (r|X = x) = ∑k ∑Ms (3.16)
s=1 πs l=1 πls ϕls (x)

Allowing different covariance matrices Σls in each mixture component of


each class leads to an explosion of parameters that need to be estimated if
the number of classes or mixtures increases (cfr. QDA versus LDA) and thus
a high variability. To avoid this, the covariance matrix of all mixtures in all
classes is often assumed to be equal, such that only one covariance matrix Σ
needs to be estimated. Estimation of the parameters µls and Σ seems quite
complicated but can be done efficiently by using maximum likelihood and
the EM-algorithm. The number of prototypes Mr within each class needs
to be selected. This can be done by measuring performance on a validation
sample or by cross-validation.

3.8 Kernel Density Discriminant Analysis

By using a nonparametric estimate of the distribution of X for each class


instead of Gaussian densities, a more flexible discriminant analysis method
is obtained. The resulting nonparametric discriminant method will be less
biased when the distribution is non-Gaussian, but less efficient when the
distribution is indeed Gaussian. Nonparametric discriminant analysis tries
to find a compromise between the efficiency of (parametric) discriminant
analysis and the robustness of logistic regression that does not use the dis-
tribution of X for each class.
28 CHAPTER 3. DISCRIMINANT ANALYSIS

3.8.1 Kernel density estimation

Consider a random sample {x1 , . . . , xn } ∈ R from a random variable X with


unknown distribution FX and corresponding density fX (x). We wish to
estimate the density fX (x0 ) at a point x0 . Instead of imposing a parametric
form for the density, we can try to use the information in the neighborhood
of x0 to estimate its density. One possible way is to use kernel functions
Kλ (x0 , x) that give weights to all observations. These weights decrease with
distance from the target point x0 . This leads to the nonparametric kernel
density estimate
1 ∑
n
fˆX
k
(x0 ) = Kλ (x0 , xi ). (3.17)

i=1

This is sometimes also called the Parzen estimate. A popular choice for
the kernel function is the Gaussian kernel Kλ (x0 , x) = ϕ(|x − x0 |/λ). The
standard deviation λ determines the size of the window around x0 in which
observations get a large weight. The choice of λ corresponds again to a bias-
variance trade-off. A small window implies that the density estimate fˆX k (x )
0
is based on a small number of observations xi (close to x0 ). Consequently,
the variance of the estimate will be large but the bias will be small because
the estimator can accommodate well to the information about the density
given by the observations close to x0 . On the other hand, if the window is
wide, the density estimate fˆX
k (x ) is a weighted average over a large number
0
of observations leading to a smaller variance. However, the estimator now
also uses information from observations further from x0 . Such observations
will more likely be in regions where the density is different, introducing bias
in the estimate. Note that if λ is very large, we just fit a normal density to
the data, so we end up with a parametric density estimate.
If we denote by ϕ(x; λ) the Gaussian density with mean 0 and standard
deviation λ, then the nonparametric density estimate (3.17) with Gaussian
kernel can be written as

1∑
n
fˆX
k
(x0 ) = ϕ(x0 − xi ; λ). (3.18)
n
i=1

For a random sample in Rd the Gaussian kernel can be replaced by a Gaus-


sian product kernel which means that we use a spherical Gaussian distribu-
tion
1 1
ϕ(x; λ) = 2 d/2
exp(− (∥x − x0 ∥/λ)2 ) (3.19)
(2πλ ) 2

in the kernel density estimator (3.18).


3.8. KERNEL DENSITY DISCRIMINANT ANALYSIS 29

3.8.2 Kernel density discriminant analysis


Instead of using a Gaussian density in (3.1) as in LDA or QDA, we can
now fit kernel density estimates fˆrk (x) for each of the classes based on the
training data and insert these in (3.1)

fˆk (x)π̂r
P̂ (r|X = x) = ∑k r . (3.20)
ˆk
s=1 fs (x)π̂s

Note that kernel density discriminant analysis can be seen as the limit case
of mixture discriminant analysis with n prototypes in each class.
Analogous to the regularization parameter α in regularized discriminant
analysis, the parameter λ can be selected by comparing the performance
(error rate) for different λ values on an independent validation sample. Al-
ternatively, if a validation sample is not available, cross-validation can be
used.

3.8.3 Naive Bayes classifier


Despite its pejorative name naive Bayes (sometimes the technique is even
called ’idiot’s Bayes), this method is very useful and popular if the dimension
d of feature space is large. In high dimension density estimation (even
parametric) becomes unattractive or even impossible if the sample size is
small (n smaller than d). The naive Bayes model assumes that in each class
r the features X1 , . . . , Xd are independent such that the joint density of X
equals the product of the marginal densities


d
fr (X) = fr,j (Xj ). (3.21)
j=1

These density estimates are then used in (3.1) to obtain discriminant rules.
The problem thus reduces to estimating the univariate marginal densities
which simplifies computation drastically. The original naive Bayes method
uses univariate Gaussian distributions to estimate the marginal densities.
A straightforward extensions uses nonparametric kernel density estimates
of the univariate densities fr,j (Xj ). The technique can even incorporate
discrete explanatory variables Xj . The marginal probabilities can then easily
be estimated by the proportions in a bar chart or frequency table. Hence, the
naive Bayes method is a computationally efficient technique that is widely
applicable.
Of course the assumption of independent feature components within each
class will generally not be true. Despite this strong assumption, the naive
Bayes technique often performs very well compared to more sophisticated
30 CHAPTER 3. DISCRIMINANT ANALYSIS

techniques, or even outperforms such techniques. This can again be ex-


plained by the bias-variance trade-off. The naive Bayes class density esti-
mates according to (3.21) will be biased, but if this bias does not distort
much the posterior probabilities (3.1), especially in the region of the decision
boundaries, then the negative bias effect can be much smaller than the gain
in efficiency obtained from the simple univariate density estimation.
Chapter 4

Basis Expansions

4.1 Introduction

Many techniques will impose a (simple) structure for e.g. the decision bound-
ary in classification or the regression function (next part). For instance, LDA
and many of its extensions assume linear decision boundaries while many
regression techniques consider linear regression. Such a relatively simple
structure is often a convenient or even necessary restriction, but will in gen-
eral not be the correct structure but only an approximation. As already
mentioned, the restriction can be relaxed by increasing the predictor space
by adding e.g. squares and cross-products of the measured predictors and
using the technique in the enlarged/transformed predictor space.
In general, consider transformations hm (X) : Rd → R of X, m =
1, . . . , M . The new transformed predictor space is then given by h(X) =
(h1 (X), . . . , hM (X)). A linear technique or linear expansion in X, T (X) =
∑d
j=1 βj Xj can consequently be extended to


M
T̃ (X) = T (h(X)) = βm hm (X),
m=1

a linear basis expansion in X.


Some standard transformations are

• hm (X) = Xm , m = 1 . . . , d which uses the original measured predic-


tors.

• Adding transformations hm (X) = Xj2 , hm (X) = Xs Xt , . . . allows to


augment the feature space with polynomial terms as mentioned above.
Note, that the number of variables grows exponentially in the degree
of polynomials considered.

31
32 CHAPTER 4. BASIS EXPANSIONS


• hm (X) = log(Xj ), hm (X) = Xj , hm (X) = ∥X∥, . . . are standard
nonlinear transformations of measured predictors.

The choice of transformations can be driven by the problem at hand or


transformations can be used to achieve more flexibility. If the goal is flexi-
bility, then instead of global transformations such as polynomials that can
have strange effects in remote parts of the data space, it is more convenient
to use local polynomial representations that fit well in some region of the
space without adverse effects elsewhere. Useful local families are piecewise-
polynomials and splines. The process of replacing an original set of measured
features X by a transformed set of new features X ⋆ is often called filtering
or feature extraction and this process is often called the preprocessing phase
of the data analysis.

4.2 Piecewise Polynomials


We assume that X is one-dimensional (and thus can be any variable in the
feature space). To construct a piecewise polynomial function T̃ (X), the
domain of X is divided in contiguous intervals. In each of these intervals,
the function T̃ is represented by a separate polynomial. A basic example is
obtained by using piecewise constant functions. Suppose that we split the
domain of X into three intervals X < ξ1 , ξ1 ≤ X < ξ2 , ξ2 ≤ X. Consider
the following piecewise constant basis functions

h1 (X) = I(X < ξ1 ), h2 (X) = I(ξ1 ≤ X < ξ2 ), h3 (X) = I(ξ2 ≤ X),

where I(.) is the indicator function that takes the value 1 within the specified
interval and zero elsewhere. The piecewise polynomial function T̃ (X) is

now given by T̃ (X) = 3m=1 βm hm (X), a linear function of the transformed
predictors. Since only one predictor is different from zero in each of the
intervals, the coefficients βm can be estimated by β̂m = Ȳm with Ȳm the mean
of the Y values in the mth region. In discriminant analysis the Y variables
will be dummy variables corresponding to class membership for each of the
classes. In regression Y will be the quantitative response variable.
Fitting piecewise linear functions is obtained by adding three additional
basis functions
hm+3 (X) = hm (X)X, m = 1, 2, 3.
These functions thus equal X within the specified interval but are zero out-

side that interval. This leads to the approximation T̃ (X) = 6m=1 βm hm (X).
Both of the above approximations have the drawback of not being con-
tinuous at the knots, that is the endpoints of the intervals. To introduce con-
tinuity at the knots we need two restrictions T̃ (ξ1− ) = T̃ (ξ1+ ) which implies
4.3. SPLINES 33

β1 +ξ1 β4 = β2 +ξ1 β5 and T̃ (ξ2− ) = T̃ (ξ2+ ) which implies β2 +ξ2 β5 = β3 +ξ2 β6 .


These two equations fix two parameter values and hence there are four free
parameters that need to be estimated from the data.

4.3 Splines
A more direct approach to impose continuity and other desirable properties
is to use basis functions that incorporate the necessary constraints. For the
above example of a piecewise linear continuous fit we can use the following
basis functions

h1 (X) = 1, h2 (X) = X, h3 (X) = (X − ξ1 )+ , h4 (X) = (X − ξ2 )+ ,

where (.)+ means that the resulting function equals the function between
brackets when that function is positive and the resulting function is zero
when the function between brackets is negative. Note that the number of
basis functions now corresponds to the number of free parameters.

4.3.1 Cubic splines


We often prefer to use approximations that are smoother than the piecewise
linear approximation above. This can be achieved by using local higher
order polynomials. A continuous approximation that has continuous first
and second order derivatives also at the knots is obtained by using the
following basis functions which is called the cubic spline with knots at ξ1
and ξ2 .

h1 (X) = 1, h3 (X) = X 2 , h5 (X) = (X − ξ1 )3+ ,


h2 (X) = X, h4 (X) = X 3 , h6 (X) = (X − ξ2 )3+ (4.1)

Hence, we fit local cubic functions with constraints at the knots that ensure
that at the knots the values of the left and right parts have the same function
value and the same values of the first and second derivatives.
In theory splines can be defined with constraints beyond second deriva-
tives but in practice there is seldom any need to go beyond cubic splines.
Splines with fixed knots introduced here are often called regression splines.
In practice, one needs to select the order of the spline, the number of knots,
and their location. Often the observations are used as the knots.

4.3.2 Natural cubic splines


Higher order polynomial fits tend to be very sensitive, which can lead to
erratic results near the boundaries and beyond. This makes extrapolation
34 CHAPTER 4. BASIS EXPANSIONS

extremely dangerous. Cubic regression splines tend to behave even more


wildly than global polynomial fits around and beyond the boundary. To
reduce this adverse effect, natural cubic splines require that the resulting
function is linear beyond the boundary knots. As usual, there will be a
price in bias near the boundaries of the domain, but assuming that the
function is linear near the boundary (where we have the least information)
is often reasonable. A natural cubic spline can be represented by the cubic
spline basis (4.1) together with the linearity constraints at the boundary. If
there are q knots ξ1 , . . . , ξq , this leads to the basis functions

1 (X) = 1, h2 (X) = X, hj (X) = dj−2 (X) − dq−1 (X); j = 3, . . . , q


hN N N

(X − ξl )3+ − (X − ξq )3+
with dl (X) = .
ξq − ξl

4.4 Smoothing Splines


Smoothing splines produce a spline basis with automatic knot selection and
thus avoid the knot selection problem. Smoothing splines are obtained as
the solution of the following optimization problem. Among all functions
with second order continuous derivatives, find the function that minimizes
the penalized residual sum of squares.


n ∫
RSS(f, λ) = (yi − f (xi )) + λ
2
(f ′′ (t))2 dt (4.2)
i=1

The first term of (4.2) measures the goodness-of-fit, that is the closeness of
the fit to the data. The second term of (4.2) is the penalty term that penal-
izes curvature in the function. Functions with curvature have second order
derivatives different from zero and thus get penalized. Hence, the criterion
shows a preference for simpler (linear) fits if they fit the data reasonably
well. The parameter λ is the smoothing parameter which controls the trade-
off between goodness-of-fit and curvature. A larger value of λ means a
bigger preference for simple linear functions, a smaller λ value means less
penalization for more complex functions with curvature. The two limit cases
are

λ = 0: No penalty. f can be any function that fits the data exactly leading
to a RSS equal to zero.

λ = ∞: No curvature is allowed. f will be the best fitting linear fit. Since


the first term in (4.2) is a least squares criterion, the solution will be
the least squares regression line.
4.4. SMOOTHING SPLINES 35

The solution thus varies from very rough (λ = 0) to very smooth (λ = ∞)


and the idea is to select a value of λ that leads to a good compromise.
It can be shown that for any fixed λ, the minimization of (4.2) leads to
a unique solution which is a natural cubic spline with knots at the xi values.
Without penalty (λ = 0) we would thus find a solution that exactly fits the
data. However, the penalty term forces that some of the coefficients βm in

the fit T̃ (X) = nm=1 βm hN m (X) are shrunken towards zero such that the
resulting fit is pushed towards a linear fit.

4.4.1 Splines as linear operators and effective degrees of free-


dom
Let the n × n matrix N denote the transformed feature variable, that is
N ij = hN
j (xi ). Criterion (4.2) can now be rewritten as

RSS(β, λ) = (y − N β)t (y − N β) + λβ t Ωβ, (4.3)



where Ωij = Ni′′ (t)Nj′′ (t) dt with N (t) = (hN N t
1 (t), . . . , hn (t)) . The solution
can be shown to be
β̂ = (N t N + λΩ)−1 N t y.
Let f̂ = (fˆ(x1 ), . . . , fˆ(xn ))t denote the vector of fitted values at the training
data, then

f̂ = N β̂ = N (N t N + λΩ)−1 N t y
= S λ y. (4.4)

This means that the fitted values are a linear function of the responses
y. The linear operator S λ is known as the smoother matrix. Note that
this is similar to least squares regression where we have f̂LS = Hy with
H = X(X t X)−1 X t the hat matrix. As for least squares, the linear operator
S λ does not depend on y but only on the xi observations and λ.
When using regression splines (e.g. cubic splines), we first transform
the predictor variable X to an M -dimensional (M < n) spline basis space
leading to an n × M matrix B ξ for the transformed feature space where ξ
is the knot sequence used by the regression spline. Applying least squares
to find the regression coefficients yields the fitted values

f̂ = B ξ β̂ = B ξ (B tξ B ξ )−1 B tξ y
= H ξ y.

The hat matrix H ξ is now a projection matrix that projects the response
vector y onto the M -dimensional space spanned by the columns of the new
feature matrix B ξ .
36 CHAPTER 4. BASIS EXPANSIONS

In both cases (smoothing splines or regression splines) the fitted values


are obtained as a linear combination of the observed response vector y.
There are some important similarities between the smoother matrix S λ and
the projection matrix H ξ , but also some important differences.

• Both matrices are symmetric, positive semidefinite matrices, but S λ


is not idempotent.

• H ξ has rank M < n, while S λ has rank n. That is, while H ξ projects
y on the M -dimensional space spanned by the columns of B ξ , the
smoother matrix S λ does not project y onto a space of lower di-
mension, but instead finds an approximation in the full n-dimensional
space that satisfies the conditions imposed by the penalty term.

• trace(H ξ ) = M the dimension of the projection space which corre-


sponds with the number of (free) parameters in the fit. Note that the
trace of a square matrix is defined as the sum of its diagonal elements.
Analogously, the effective degrees of freedom of a smoothing spline fit
is defined as
dfλ = trace(S λ ). (4.5)

This definition of the effective degrees of freedom can be used to select the
value of the parameter λ. A smaller effective degrees of freedom means a
larger λ and vice versa. This is explained below in more detail.
The smoother matrix S λ can be rewritten as

S λ = N (N t N + λΩ)−1 N t
= N [N t (N + λ(N t )−1 Ω)]−1 N t
= N [N t (I + λ(N t )−1 ΩN −1 )N ]−1 N t
= N N −1 [I + λ(N t )−1 ΩN −1 ]−1 (N t )−1 N t
= [I + λ(N t )−1 ΩN −1 ]−1
= [I + λK]−1 (4.6)

with K = (N t )−1 ΩN −1 . Note that K does not depend on λ. Using


f = N β, criterion (4.3) can be rewritten as

min(y − f )t (y − f ) + λf t Kf , (4.7)
f
For this reason K is called the penalty matrix. Since S λ is a symmetric
positive semidefinite matrix, it has a real eigen-decomposition

n
Sλ = ρj (λ)vj vjt (4.8)
j=1
1
with ρj (λ) = (4.9)
1 + λdj
4.4. SMOOTHING SPLINES 37

where dj are the corresponding eigenvalues of K. From this representation


we see that the eigenvectors do not change by changes in λ. This could
intuitively be expected because S λ is not a projection matrix but looks for
an approximation in the full n-dimensional space, as explained before. The
eigenvectors v1 , . . . , vn are a basis for the full n-dimensional space. The
smoother matrix S λ transforms y into


n
Sλy = ρj (λ)(vjt y)vj
j=1

The decomposition of y is y = nj=1 (vjt y)vj . Hence the linear smoother
first decomposes the vector y w.r.t. the basis v1 , . . . , vn and then adds
factors ρj (λ) to the weight (vjt y) of each basisvector in y. Since S λ has
rank n, the factors ρj (λ) are all larger than zero and also smaller than 1 as
can be seen from (4.9). The size of a factor ρj (λ) determines the amount of
shrinkage that is used for the contribution of each of the basisvectors.
In contrast, a least squares regression spline method is based on a pro-
jection matrix H ξ of rank M . The property that a projection matrix is
idempotent implies that the eigenvalues ρj (H ξ ) of H ξ can only take the
values zero or one. M eigenvalues will be one and the remaining n − M
eigenvalues are zero. This means that the contribution of a basisvector to
y is either fully used, or ignored (shrunk to zero). This contrast is often
reflected by calling smoothing splines shrinking smoothers while calling re-
gression splines projection smoothers.
It follows from (4.9) that the eigenvalues ρj (λ) of S λ are an inverse func-
tion of the eigenvalues dj of K. Hence, small eigenvalues of S λ correspond
to large eigenvalues of K and vice versa. The smoothing parameter λ con-
trols the rate by which the eigenvalues ρj (λ) of S λ decrease to zero. The
penalty matrix K depends on Ω, the matrix of second derivatives of the ba-
sis functions. Since the first two basis functions hN N
1 (X) = 1 and h2 (X) = X
have second derivatives equal to zero, the corresponding eigenvalues of K
will be zero. From (4.9) it then follows that the corresponding eigenvalues
of S λ are always equal to 1. Hence, the linear contributions to y are never
shrunk.
Since the trace of a matrix also equals the sum of its eigenvalues, we
have that

n
dfλ = trace(S λ ) = ρj (λ). (4.10)
j=1

Larger values of λ lead to more severe penalization and yields smaller eigen-
values (according to (4.9)), and thus also a smaller degrees of freedom. On
the other hand, smaller λ values correspond to larger degrees of freedom.
38 CHAPTER 4. BASIS EXPANSIONS

For the limiting case λ = 0 it follows from (4.6) that S λ = I which implies
that ρj (λ) = 1 for j = 1, . . . , n and thus dfλ = n. On the other hand, for
λ → ∞, all eigenvalues ρj (λ) = 0 except the first two for which dj = 0 and
thus ρ1 (λ) = ρ2 (λ) = 1 as always. Hence, dfλ = 2 and in fact, since the
eigenvalues of S λ are all zero or one, S λ now equals the projection matrix
H, the hat matrix for linear regression of y on X.
Since dfλ is a monotone function of the smoothing parameter λ, we
can specify the degrees of freedom and derive λ from it. For instance, in
R, the function smooth.spline takes the effective degrees of freedom as a
possible input parameter to specify the amount of smoothing desired. Using
the effective degrees of freedom allows to parallel model selection for spline
methods with model selection for traditional parametric methods. Spline
based fits corresponding to different effective degrees of freedom can be
compared and investigated using residual plots and other criteria, to select
an optimal fit.
As explained before, the selection of the smoothing parameter λ and thus
the selection of the effective degrees of freedom dfλ is a trade-off between
bias and variance. We illustrate this trade-off with the following example.

cos(10(X + 1/4))
Y = f (X) + ϵ with f (X) = ,
(X + 1/4)

with X ∼ U [0, 1] and ϵ ∼ N (0, 1). The training sample consists of n = 100
samples (xi , yi ) generated independently from this model. We fit smoothing
splines for three different effective degrees of freedom 3,6, and 12 to these
data.
Note that from (4.4) it immediately follows that

Cov(f̂ ) = Cov(S λ y) = S λ Cov(y)S tλ .

Since Cov(y) = Cov(ϵ) = I in this example, we obtain that Cov(f̂ ) = S λ S tλ


and the diagonal elements of this matrix thus contain the variances of the
fitted values.
Figure 4.1 shows the smoothing splines (red curves) with 3,6, and 12
effective degrees of freedom. The variance of the fits is represented by the
pointwise confidence bands fˆ(xi ) ± 2se(fˆ(xi )). The smoothing spline with
dfλ = 3 clearly underfits. The bias is most drastic in regions with high
curvature. The confidence band is very narrow, so the biased curve is es-
timated with high precision. The smoothing spline with dfλ = 6 fits much
closer to the true function although there still is a small amount of bias.
The variance of the fit is still small. The smoothing spline with dfλ = 12 fits
close to the true function but the fit is wiggly. This wiggliness is also visible
4.5. MULTIDIMENSIONAL SPLINES 39

Effective df=3 Effective df=6

4
2

2
0

0
y

y
−2

−2
−4

−4
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

Effective df=12
4
2
0
y

−2
−4

0.0 0.2 0.4 0.6 0.8 1.0

Figure 4.1: Nonlinear function f (black curve) with smoothing spline fits
for three different values of the effective degrees of freedom.

in the confidence band that reveals the increased variance which now is the
main cause for the difference between the fitted and true curves.

4.5 Multidimensional Splines


Here we briefly discuss some multivariate analogs of the univariate spline
models introduced in the previous sections. Hence, instead of a one-dimensional
predictor X, we now consider X = (X1 , . . . , Xd ) ∈ Rd .

4.5.1 Multivariate regression splines


Consider X = (X1 , . . . , Xd ) ∈ Rd and suppose we have for each component
Xj a (spline) basis of functions hjm (Xj ); m = 1, . . . , Mj . For simplicity we
40 CHAPTER 4. BASIS EXPANSIONS

show results for d = 2, but the generalization is straightforward.


The M1 + M2 dimensional additive spline basis is defined by

gjm (X) = hjm (Xj ); j = 1, 2; m = 1, . . . , Mj .

The function will be approximated by


2 ∑
Mj
f (X) = βjm gjm (X)
j=1 m=1

where the coefficients βjm need to be estimated from the training data using
e.g. least squares.
The M1 × M2 dimensional tensor product spline basis is defined by

gjm (X) = h1j (X1 )h2m (X2 ); j = 1, . . . , M1 ; m = 1, . . . , M2 .

The function will now be approximated by


M1 ∑
M2
f (X) = βjm gjm (X).
j=1 m=1

Again, the coefficients βjm need to be estimated from the training data using
e.g. least squares. Note that the dimension of this basis grows exponentially
with the dimension d contrary to the additive basis. Hence, some regular-
ization or selection will be necessary when using the tensor basis in higher
dimensions.

4.5.2 Multivariate smoothing splines


Suppose we have observations (yi , xi ) with xi ∈ R2 , then we look for the
two-dimensional function f (x) that minimizes


n
RSS(f, λ) = (yi − f (xi ))2 + λJ(f ) (4.11)
i=1

where J(f ) is an appropriate penalty term that penalizes curvature in the


bivariate surface. A straightforward generalization of the one-dimensional
penalty in (4.2) is
∫ ∫ [( 2 )2 ( 2 )2 ( 2 )2 ]
∂ f (x) ∂ f (x) ∂ f (x)
J(f ) = +2 + dx1 dx2
∂x21 ∂x1 ∂x2 ∂x22

The solution of (4.11) has the form


n
f (x) = β0 + β t x + αj hj (x),
j=1
4.6. REGULARIZATION WITH KERNELS 41

where hj (x) = h(∥x − xj ∥) with h(t) = t2 log(t2 ). This is a smooth bivariate


surface, which is often called the thin-plate spline. Note that the solution
has the same form as in the one-dimensional case. The linear part is always
present and the value of λ determines the amount of shrinking for the higher
order terms. Higher values of λ imply more shrinkage. As for the one-
dimensional smoothing spline it holds that when λ → ∞, then the solution
becomes the least squares plane. On the other hand, when λ → 0, then
the penalty term disappears and the solution is a function that fits the data
exactly.

4.5.3 Hybrid splines


Some popular hybrid spline bases exist that try to allow sufficient flexibility
for the fitted function but at the same time try to keep the dimension of the
basis reasonable
Additive models assume that the true function can be approximated well
by a function f (X) of the form f (X) = β0 + f1 (X1 ) + · · · + fd (Xd ) where
each of the functions fj are univariate smoothing splines. A componentwise
penalty is also imposed on the fit:
d ∫

J(f ) = fj′′ (tj )2 dtj (4.12)
j=1

A natural extension is to consider an approximation of the form


d ∑
f (X) = β0 + fj (Xj ) + fjm (Xj , Xm )
j=1 j<m

where each component is a spline of the required dimension. Such approxi-


mations are called ANOVA spline decompositions. The splines used can be
regression splines with the use of tensor products for the interactions, or
smoothing splines with an appropriate penalty term as regularizer to keep
the number of nonzero coefficients limited.

4.6 Regularization with Kernels


A general formulation of regularization problems is
[ n ]

min L(yi , f (xi )) + λJ(f ) (4.13)
f ∈H
i=1

where L(y, f (x)) is a loss function, J(f ) is a penalty term, and H is an


(infinite-dimensional) space of functions on which the penalty term J(f ) is
42 CHAPTER 4. BASIS EXPANSIONS

well-defined. Under general assumptions, it can be shown that this problem


has a finite-dimensional solution.
We will consider the important subclass of problem (4.13) where the
space of functions H is a reproducing kernel Hilbert space (RKHS) generated
by a positive definite kernel K(x, y). Such a space will be denoted by HK .
Moreover, we assume that the penalty term J(f ) can be written in terms of
the kernel as well.
A positive definite kernel K(x, y) has an eigen-expansion


K(x, y) = γi ϕi (x)ϕi (y) (4.14)
i=1

with γi ≥ 0 and ∞ i=1 γi < ∞. The basis functions ϕi (x) form a basis for
2

the infinite-dimensional (Hilbert) space HK spanned by the kernel function


K. Functions in HK thus have an expansion of the form


f (x) = ai ϕi (x) (4.15)
i=1

The kernel K(x, y) allows to define an inner product in the Hilbert space.
∑ ∑∞
For any two functions f (x) = ∞ i=1 ai ϕi (x) and g(x) = i=1 bi ϕi (x) in HK ,
the inner product is defined as

∑ ai bi
< f, g >= . (4.16)
γi
i=1

This inner product leads to the definition of a norm in the Hilbert space HK
given by
∑∞
a2i
∥f ∥HK =< f, f >=
2
. (4.17)
γi
i=1
Functions f in HK satisfy the constraint

∥f ∥2HK < ∞ (4.18)



Note that for any z ∈ Rd the function g(x) = K(x, z) = ∞ i=1 [γi ϕi (z)]ϕi (x)
∑∞ γi2 ϕi (z)2 ∑∞
satisfies (4.15) with ∥g∥HK = i=1 γi
2 = i=1 γi ϕi (z) = K(z, z) < ∞.
2

Hence, g(x) = K(x, z) ∈ HK . For a fixed z ∈ Rd , the function g(x) =


K(x, z) will be denoted by K(., z) to indicate that we consider the kernel as
a function of its first argument with the second argument fixed.
The penalty term J(f ) in (4.13) is taken equal to J(f ) = ∥f ∥2HK . This
penalty term has a similar interpretation as the smoothing spline penalty.
The contributions to f of basis functions ϕ(x) with large eigenvalue γi will
get penalized less while contributions of basis functions with small eigenvalue
γi will get penalized more.
4.6. REGULARIZATION WITH KERNELS 43

The minimization problem (4.13) now becomes


[ n ]

min L(yi , f (xi )) + λ∥f ∥2HK . (4.19)
f ∈HK
i=1

Using (4.15) and (4.17), this can be rewritten as


 
∑n ∞
∑ ∞
∑ a2j
min  L(yi , aj ϕj (xi )) + λ . (4.20)
aj ; j=1,...,∞ γj
i=1 j=1 j=1

It has been shown that this problem has a finite-dimensional solution in HK


that can be written in the form
∑n
f (x) = αi K(x, xi ). (4.21)
i=1
For the kernel function K it holds that
∑∞
γm ϕm (xi )γm ϕm (xj )
< K(., xi ), K(., xj ) > =
γm
m=1
∑∞
= γm ϕm (xi )ϕm (xj )
m=1
= K(xi , xj ). (4.22)

For any g ∈ HK it holds that



∑ ∞

bm γm ϕm (xi )
< g, K(., xi ) >= = bm ϕm (xi ) = g(xi ). (4.23)
γm
m=1 m=1
This property is known as the reproducing property of the kernel K. It
implies that any function g ∈ HK can be approximated arbitrary well by a
function f of form (4.21).
For a function f of the form (4.21) it follows further that

n
< f, K(., xi ) > = < αj K(., xj ), K(., xi ) >
j=1

n
= αj < K(., xj ), K(., xi ) >
j=1
∑n
= αj K(xi , xj ). (4.24)
j=1

Using (4.21) and (4.22) the penalty term J(f ) = ∥f ∥2HK can be rewritten as

n ∑
n
∥f ∥2HK =< f, f > = αi αj < K(., xi ), K(., xj ) >
i=1 j=1
∑n ∑ n
= αi αj K(xi , xj )
i=1 j=1
t
= α Kα (4.25)
44 CHAPTER 4. BASIS EXPANSIONS

where α = (α1 , . . . , αn )t and K is an n by n symmetric matrix with elements


K ij = K(xi , xj ), called the kernel matrix.
Using (4.23)-(4.25) the minimization problem (4.19) can be rewritten as

min L(y, Kα) + λαt Kα (4.26)


α

which now is a finite-dimensional criterion that can be solved numerically.


The fact that the infinite-dimensional problem (4.19) can be reduced to a
finite-dimensional problem as in (4.26) is often called the kernel property.
In general there are two important choices in regularization using kernels.
The choice of the kernel function K and the choice of loss function L. If we
consider squared error loss, then (4.19) becomes a penalized least squares
problem
min(y − Kα)t (y − Kα) + λαt Kα (4.27)
α
with solution
α̂ = (K + λI)−1 y. (4.28)

This yields

n
fˆ(x) = α̂i K(x, xi )
i=1
or

f̂ = K α̂
= K(K + λI)−1 y
= (I + λK −1 )−1 y

This strongly resembles the smoothing spline solution (4.6) and shows that
the fitted values again are a linear function of the responses y.
Two possible choices for the kernel function K(x, y) are introduced be-
low.

4.6.1 Polynomial kernels


For x, y ∈ Rd , the polynomial kernel K(x, y) = (xt y + 1)m has M = ( d+m
m )
eigen-functions that span the space of polynomials in R with maximal de-
d

gree equal to m. For example, for degree 2 polynomials in R2 the kernel


becomes

K(x, y) = (xt y + 1)2


= 1 + 2x1 y1 + 2x2 y2 + x21 y12 + x22 y22 + 2x1 x2 y1 y2
√ √ √
and M = 6 in this case. Let us denote h(x) = (1, 2x1 , 2x2 , x21 , x22 , 2x1 x2 )t .
Note that K(x, y) = h(x)t h(y). A penalized polynomial fit of degree m is
4.6. REGULARIZATION WITH KERNELS 45

obtained through
 2

n ∑
M ∑
M
min  yi − βj hj (xi ) + λ βj2
β1 ,...,βM
i=1 j=1 j=1

Using the polynomial kernel, this problem can be rewritten as in (4.27) and
solved easily. Note that the number of eigen-functions soon becomes very
large (even larger than n, however, from (4.28) it follows that the solution
can be computed in the same computation time order (O(n3 )) independent
of the degree m of the polynomial.

4.6.2 Radial basis function kernels


The Gaussian kernel K(x, y) = exp(− 21 (∥x−y∥/σ)2 ) leads to a solution that
is an expansion in Gaussian radial basis functions
1
hj (x) = exp(− (∥x − xj ∥/σ)2 ); j = 1, . . . , n
2
The radial kernel K(x, y) = ∥x−y∥2 log(∥x−y∥) leads to a solution using
the same basis functions hj (x) = ∥x − xj ∥2 log(∥x − xj ∥) as the thin-plate
spline.
46 CHAPTER 4. BASIS EXPANSIONS
Chapter 5

Flexible Discriminant
Analysis Methods

5.1 Introduction
In Chapter 3 we introduced the basic discriminant analysis techniques LDA
and QDA and already discussed several extensions with the purpose of in-
troducing more flexibility to obtain a better classification. Using the basis
expansions introduced in the previous chapter, we now continue the discus-
sion on discriminant analysis and show how basis expansions can be used as
an alternative approach to develop flexible discriminant analysis techniques.
Similarly, logistic regression, also introduced in Chapter two, can be made
more flexible.

5.2 Flexible Discriminant Analysis


A flexible discriminant analysis method will be obtained by applying dis-
criminant analysis in a transformed feature space as can be obtained by
using splines or kernels. From a computational viewpoint this is most eas-
ily done by reformulating the discriminant analysis problem as a regression
problem.
This reformulation is obtained as follows. Consider functions θj : G →
R : j = 1, . . . , l < k that assign a score to each class such that these scores
(transformed class labels) can be optimally predicted by linear regression of
the scores on X. The scores are assumed to be mutually orthogonal and
normalized. Since both the scores θ1 , . . . , θl and the regression coefficients
β1 , . . . , βl are unknown, we have to solve the general problem
[ n ]
1∑ ∑
l
min (θj (gi ) − xi βj ) .
t 2
(5.1)
θj ,βj ;j=1...,l n
j=1 i=1

47
48 CHAPTER 5. FLEXIBLE DISCRIMINANT ANALYSIS METHODS

It can be shown that the (normalized) vectors of regression coefficients β̂j


obtained as solutions of the above problem are equal to the linear discrimi-
nant coordinates that were derived in subsection 3.5.
For j = 1, . . . , l let us denote η̂j (x) = xt β̂j and for each class r =
∑ r
1, . . . , k denote η̄jr = n1r ni=1 η̂j (xi ), the average of the fitted values for
score function j in class r. Then, a point x is assigned to the class with
minimal discriminant score


l
fr (x) = wj [η̂j (x) − η̄jr ]2
j=1

where
1
wj = ,
rj2 (1 − rj2 )
1 ∑n
with rj2 = n−d i=1 (θ̂j (gi ) − η̂j (xi )) , the mean squared residual of the
2

jth optimally scored fit. The discriminant score fr (x) is equivalent to the
Mahalanobis distance of x to µ̂r , the center of class r.
Linear discriminant analysis can now be generalized by using more flex-
ible regression fits in (5.1) instead of the linear regression fits ηj (x) = xt βj .
In light of Chapter 4 we consider basis expansions of the predictor space to
obtain flexible regression fits. This leads to the minimization problem
[ n ]
1∑ ∑
l
min (θj (gi ) − h(xi )t βj )2 (5.2)
θj ,βj ;j=1...,l n
j=1 i=1

where h(X) = (h1 (X), . . . , hM (X)) is the transformed predictor space. As


usual, the advantage of the transformation is that simple linear decision
boundaries in the enlarged, transformed predictor space, map down to non-
linear boundaries in the reduced original predictor space. This method is
called Flexible Discriminant analysis (FDA).

5.3 Penalized Discriminant Analysis


If the predictor space is very large or becomes very large after transforma-
tion, then regularization will be needed to guarantee a stable discriminant
analysis solution. Without regularization the many predictors and the cor-
relation between them, would lead to a solution that overfits the data and
thus is unreliable. Regularization can be imposed by adding a penalty term
in the minimization problem (5.1) that penalizes fits that are rougher (more
curvature) such that these fits are only selected if they give a considerable
improvement over smoother (more stable) fits. As seen in the previous chap-
ter, regularization comes quite naturally when using a transformation of the
5.4. FLEXIBLE LOGISTIC REGRESSION 49

predictor space based on (smoothing) splines or kernels. Hence, when using


FDA, regularization through a penalty term will often be needed. Penalized
Discriminant Analysis (PDA) is the solution of the minimization problem
[ n ]
1∑ ∑
l
min (θj (gi ) − h(xi ) βj ) + λβj Ωβj .
t 2 t
(5.3)
θj ,βj ;j=1...,l n
j=1 i=1

The matrix Ω is chosen to penalize roughness. When using basis expansions


based on smoothing splines or kernels, the penalty term is determined by
the expansion. If the original predictor space is already large with many
correlated predictors, then the penalty term needs to be chosen problem
dependent by e.g. penalizing the use of strongly correlated predictors, to
guarantee a smooth and stable solution.

5.4 Flexible Logistic Regression


Similarly to LDA, also logistic regression can be made more flexible by using
transformed predictor spaces. For simplicity, we will discuss the two-class
problem with x ∈ R. Standard logistic regression uses a linear regression
model for the log-odds as shown in (3.11). Using a basis expansion, this
generalizes to
( ) ∑
M
P (Y = 1|X = x)
log = hj (x)βj = h(x)t β = f (x), (5.4)
P (Y = 0|X = x)
j=1

which leads to
exp(f (x))
P (Y = 1|X = x) = (5.5)
1 + exp(f (x))
1
P (Y = 0|X = x) = . (5.6)
1 + exp(f (x))
These probabilities can then be used to classify the object with feature value
x.
Also smoothing splines can be formulated in the logistic regression con-
text. They are obtained as the solution of the following problem. Among all
functions f (x) with two continuous derivatives, find the function that max-
imizes the penalized log-likelihood. Let us denote p(x) = P (Y = 1|X = x)
modeled as in (5.5), then the penalized log-likelihood equals


n ∫
1
l(f ; λ) = [yi log(p(xi )) + (1 − yi ) log(1 − p(xi ))] − λ (f ′′ (t))2 dt
2
i=1
∑n ∫
1
= [yi f (xi ) − log(1 + exp(f (x)))] − λ (f ′′ (t))2 dt (5.7)
2
i=1
50 CHAPTER 5. FLEXIBLE DISCRIMINANT ANALYSIS METHODS

Note that the penalty term now has a negative sign because the log-likelihood
needs to be maximized. As usual, the penalty term penalizes functions with
more curvature. As for problem (4.2) it can be shown that the optimal so-
lution is a natural cubic spline with knots at the distinct xi values. Hence,

the optimal function can be written as f (x) = nj=1 hN j (x)βj . By inserting
this expression into (5.7), the penalized log-likelihood can be considered as
a function of β (for fixed value of λ), l(β; λ), and taking derivatives w.r.t. β
yields
∂l(β; λ)
= N t (y − p) − λΩβ (5.8)
∂β
2
∂ l(β; λ)
= −N t W N − λΩ (5.9)
∂β∂β t
where p is the n-dimensional vector with elements p(xi ) and W is a diagonal
matrix whose diagonal elements are weights W ii = p(xi )(1 − p(xi )). As in
∫ ′′
subsection 4.4.1, we have that N ij = hN j (xi ) and Ωij = Ni (t)Nj′′ (t) dt.
The optimal solution β̂ should have first derivative (5.8) equal to zero.
Unfortunately, this first derivative is a nonlinear function of β, such that
the solution can not be expressed analytically (as opposed to the linear
least squares solution). However, an iterative procedure can be executed
until convergence. For a current estimate β current , let us denote

z = N β current + W −1 (y − p) = f current + W −1 (y − p)

with f current = N β current the vector of fitted values corresponding to the


current estimate β current . The vector z is often called the working response
vector. The new estimate is now obtained by

β new = (N t W N + λΩ)−1 N t W z

and this equation is iterated until convergence. Convergence of the algorithm


depends on the choice of the starting values. Usually, taking β = 0 is a good
choice, but convergence is never guaranteed. The update formula can also
be written in terms of the fitted values

f new = N (N t W N + λΩ)−1 N t W z
= S λ,w z (5.10)

Comparing (5.10) with (4.4), it can be seen that the update equation each
time fits a weighted smoothing spline to the working response z.

5.5 Generalized Additive Models


When using flexible logistic regression (based on smoothing splines) with
several predictor variables X1 , . . . , Xd , additive models as introduced in sub-
5.5. GENERALIZED ADDITIVE MODELS 51

section 4.5.3 to control the dimension of the model and to keep the model
interpretation easy. The additive model replaces the linear function in (3.11)
by a more flexible additive function
( ) ( )
P (Y = 1|X = x) p(x)
log = log = β0 + f1 (X1 ) + · · · + fd (Xd ).
P (Y = 0|X = x) 1 − p(x)
(5.11)
The smooth functions fj (Xj ) introduce flexibility in the logistic regression
model, while the additivity still allows us to interpret the model because
the contributions of the different predictors are kept separate in the model.
If interactions between predictors are of interest, the additive model can be
extended to ANOVA spline decompositions as introduced in subsection 4.5.3.
To guarantee a unique solution, the standard convention is to impose the

condition ni=1 fj (xij ) = 0 for j = 1, . . . , d.
The structure of the algorithm for the additive logistic regression model
is the same as for the (flexible) logistic regression algorithm in the previous
section.
∑ ∑
• Given a current fit f̂ = β̂0 + dj=1 f̂j , that is fˆ(xi ) = β̂0 + dj=1 fˆj (xij ),
and corresponding probabilities p̂i = exp(fˆ(xi ))/[1 + exp(fˆ(xi ))], con-
struct the working response vector

z = f̂ + W −1 (y − p̂)

where W is the diagonal weight matrix as defined in the previous


section.

• With the current working response vector z and corresponding weight


matrix W , fit an additive model. This yields a new fit f̂ .

• iterate the two previous steps until convergence.

As starting values, fˆj (xij ) = 0 for all i = 1, . . . , n and j = 1, . . . , d is taken.


It then follows that β̂0 = log[ȳ/(1 − ȳ)], where ȳ = ave(yi ) the proportion of
ones in the sample.
The weighted additive smoothing splines fit in step two of the algorithm
uses the componentwise penalty (4.12). Hence, the additive fit is obtained
as the solution of minimizing


n ∑
d ∑
d ∫
RSS(β0 , f1 , . . . , fd ) = wi [zi − β0 − 2
fj (xij )] + λj (fj′′ (tj ))2 dtj
i=1 j=1 j=1

which yields an additive cubic smoothing spline model as optimal solution,


which implies that the optimal functions fj are cubic splines of predictor Xj
with knots at the distinct values of xij ; i = 1, . . . , n. From the convention
52 CHAPTER 5. FLEXIBLE DISCRIMINANT ANALYSIS METHODS

∑n
i=1 fj (xij ) = 0 for j = 1, . . . , d it immediately follows that β̂0 = ave(zi ),
the average response. To estimate the components fˆj , the fact that each
of these components are cubic smoothing splines is exploited as follows.
The algorithm is an iterative procedure, starting from fˆj (xij ) = 0 for all
i = 1, . . . , n and j = 1, . . . , d. To obtain a new estimate fˆl , a weighted
smoothing spline is applied to the target vector

z − β̂0 − f̂j
j̸=l

versus xl . Here, f̂j ; j ̸= l are the currently available fits for all other com-
ponents. Hence, the target vector is the residual vector that remains after
removing from the response the effects that are already explained by the
other components. The smoothing spline fit then estimates which part of
this residual vector can be explained by predictor Xl . This update procedure
is applied in turn to each of the component functions fj , associated with
the different predictors. The process is continued until convergence, that
is, until each of the estimates fˆl are stabilized. This type of algorithm is
called a backfitting algorithm. The additive model and backfitting algorithm
can also be used with other componentwise regression fitting methods than
smoothing splines such as local polynomial fits, regression splines, kernel
methods, etc.
If the componentwise regression fitting method is a linear operator, such
as smoothing splines, then the effective degrees of freedom for each compo-
nent is computed as dfj = trace(S j ) − 1 where S j is the smoother matrix
for the jth component. The −1 in the degrees of freedom is caused by the
fact that the constant term does not need to be estimated, because only one
constant term is needed in the additive fit, and this constant β̂0 is estimated
beforehand. The effective degrees of freedom is again a convenient tool to
determine the value of the smoothing spline parameters λj .
The additive model and backfitting algorithm can be formulated more
generally by introducing a link function g which is the transformation needed
to relate the conditional mean of the response variable Y to the predictors
through an additive function

g(E[Y |X]) = β0 + f1 (X1 ) + · · · + fd (Xd ). (5.12)

Standard choices of link functions are the following

• g(E[Y |X]) = E[Y |X], that is g is the identity. This link function
is used for continuous, symmetric (Gaussian) response variables and
leads to the basic additive models.
5.5. GENERALIZED ADDITIVE MODELS 53

• g(E[Y |X]) = logit(E[Y |X]) = log[E[Y |X]/(1 − E[Y |X])] as in (5.11)


above. This link function is used for binary response variables mod-
eled with binomial probabilities, which implies that E[Y |X] = P (Y =
1|X) = p(X). Alternatively a probit link function can be used, probit(E[Y |X]) =
Φ−1 (E[Y |X]).

• g(E[Y |X]) = log(E[Y |X]) for log-additive models useful for e.g. Pois-
son distributed count data as response variable.

For link functions g different from the identity, these models are called gener-
alized additive models. Note that not all predictors need to be fit additively.
Linear and nonlinear contributions can be mixed easily by first estimating
the linear part and estimating the additive part from the resulting residuals.
A drawback of additive models is that each of the predictors that is included
additively in the model is automatically included in the model. Hence, if
the number of predictors is large, model selection techniques are needed to
select the best subset of predictors.
54 CHAPTER 5. FLEXIBLE DISCRIMINANT ANALYSIS METHODS
Chapter 6

Support Vector Machines

6.1 Introduction
In this chapter we mainly discuss the use of kernel-based methods for clas-
sification. We start by considering methods that explicitly try to find linear
boundaries that separate the different classes as much as possible. We then
discuss the construction of optimal separating hyperplanes when the classes
are not completely separable. This technique is then extended by using ker-
nels to (implicitly) transform and enlarge the predictor space. This results
in more flexible nonlinear boundaries in the original predictor space.

6.2 Separating Hyperplanes


LDA and logistic regression both estimate linear decision boundaries that
separate classes by modeling the conditional distribution P (G = r|X) for
each class r = 1, . . . , k. Separating hyperplane classifiers are procedures that
explicitly try to construct linear decision boundaries that separate the data
into different classes as well as possible. We consider the two-class problem
and the two classes are coded by the class variable Y which takes the values
−1 and +1. A linear decision boundary is a hyperplane Hβ represented by
an equation
f (x) = β0 + β t x = 0. (6.1)

If two points x1 and x2 belong to the same hyperplane Hβ , then f (x1 ) =


f (x2 ) = 0 and thus also f (x1 ) − f (x2 ) = β t (x1 − x2 ) = 0. Hence,

β
β⋆ =
∥β∥

is the normalized vector of the direction orthogonal to the hyperplane Hβ .


The signed distance of any point x to a hyperplane Hβ is the signed distance

55
56 CHAPTER 6. SUPPORT VECTOR MACHINES

between the point x and its orthogonal projection x0 on the hyperplane Hβ

dsign (x, Hβ ) = (β ⋆ )t (x − x0 ).

The function is called signed distance because its absolute value is the or-
thogonal distance from the point to the hyperplane but the function is pos-
itive for points on the positive side (f (x) > 0) and negative for points on
the negative side (f (x) < 0) of the hyperplane. Note that for any point x0
in Hβ , it holds that f (x0 ) = 0, which yields β t x0 = −β0 . Therefore, the
signed distance can be rewritten as

dsign (x, Hβ ) = (β ⋆ )t (x − x0 )
1
= (β t x − β t x0 )
∥β∥
1
= (β t x + β0 )
∥β∥
f (x)
= ,
∥f ′ (x)∥

since f ′ (x) = β in this case. Hence, the signed distance of a point x to the
hyperplane defined by f (x) = 0 is proportional to the value f (x). Separating
hyperplane classifiers try to find a linear function f (x) as in (6.1) such that
f (x) > 0 for objects in class Y = 1 and f (x) < 0 for objects in class
Y = −1. Procedures that classify observations by computing the sign of a
linear combination of the input features were originally called perceptrons
when introduced in the engineering literature.
Let us assume that the data are separable, that is, there is at least
one hyperplane that completely separates the two classes. In this case, the
solution of the separating hyperplane problem will often not be unique, but
several hyperplanes completely separate the classes. To make the solution
unique, an additional condition can be imposed to find the best possible
separating hyperplane among all choices. The optimal separating hyperplane
separates the two classes and maximizes the distance from the hyperplane to
the closest points from both classes. This extra condition defines a unique
solution to the separating hyperplane problem and maximizes the margin
between the two classes on the training data. It can be expected that it
therefore also leads to better performance on test data.
If Hβ is a separating hyperplane, then Y = −1 for points on the negative
side (f (x) < 0) of Hβ and Y = 1 for points on the positive side (f (x) > 0).
It follows that the (unsigned) distance of a point xi to the hyperplane Hβ is
obtained by d(xi , Hβ ) = yi (β t xi + β0 )/∥β∥. Hence, a separating hyperplane
satisfies d(xi , Hβ ) = yi (β t xi + β0 )/∥β∥ > 0 for all training points.
6.2. SEPARATING HYPERPLANES 57

The optimal separating hyperplane is the solution to the problem

max C (6.2)
β,β0

β t xi + β0
subject to yi ≥ C; i = 1, . . . , n. (6.3)
∥β∥
The condition (6.3) can equivalently be written as

yi (β t xi + β0 ) ≥ C∥β∥; i = 1, . . . , n. (6.4)

Now suppose that for some value of C, there are values β and β0 that satisfy
condition (6.4). Let us consider a positive value γ, and define β̃ = γβ and
β̃0 = γβ0 . It then follows that ∥β̃∥ = γ∥β∥ and condition (6.4) is equivalent
to

γyi (β t xi + β0 ) ≥ γC∥β∥; i = 1, . . . , n
or yi (β̃ t xi + β̃0 ) ≥ C∥β̃∥; i = 1, . . . , n.

This means (not surprisingly) that if values β and β0 satisfy the condition,
then also any multiple by a positive constant satisfies the constraint. To
make the solution unique, the value of ∥β∥ can be taken equal to an arbitrary
chosen constant. A convenient choice is to take
1
∥β∥ = . (6.5)
C
The maximization problem (6.2) then becomes the minimization problem

min ∥β∥.
β,β0

Minimizing the norm is equivalent to minimizing the squared norm which


is analytically more tractable. Hence, the maximization problem (6.2)-(6.3)
can be reformulated as
1
min ∥β∥2 (6.6)
β,β0 2

subject to yi (β t xi + β0 ) ≥ 1; i = 1, . . . , n. (6.7)
Note that condition (6.3) together with (6.5) implies that around the optimal
separating boundary there is a margin of width 1/∥β∥ in both directions,
that does not contain any data points.
The optimal separating hyperplane problem as formulated in (6.6)-(6.7)
consists of a quadratic criterion that needs to be minimized under linear
inequality constraints, which is a standard convex optimization problem of
operations research. That is, minimizing (6.6)-(6.7) is equivalent to mini-
mizing w.r.t. β and β0 the Lagrange (primal) function

1 ∑ n
LP = ∥β∥2 − αi [yi (β t xi + β0 ) − 1] (6.8)
2
i=1
58 CHAPTER 6. SUPPORT VECTOR MACHINES

where αi ≥ 0 are the Lagrange multipliers. Setting the derivatives w.r.t. β


and β0 equal to zero yields the equations


n
β = αi yi x i (6.9)
i=1
∑n
0 = αi yi (6.10)
i=1

These first-order conditions thus define the parameter β as function of the


Lagrange multipliers αi . Thus, it remains to determine the optimal values
of the Lagrange multipliers. Substituting the above first-order conditions in
the Lagrange function (6.8) yields the Wolfe dual function


n
1 ∑∑
n n
LD = αi − αi αj yi yj xti xj (6.11)
2
i=1 i=1 j=1

which needs to be maximized w.r.t. α1 , . . . , αn under the constraints


n
αi yi = 0
i=1
αi ≥ 0. (6.12)

This is a standard convex optimization problem. The final solution must sat-
isfy the Karush-Kuhn-Tucker conditions, well-known in operations research.
These conditions are given by (6.7),(6.9)-(6.10),(6.12) and the condition

αi [yi (β t xi + β0 ) − 1] = 0; i = 1, . . . , n. (6.13)

This condition implies that

• If αi > 0, then yi (β t xi + β0 ) = 1 which means that the point xi is on


the boundary of the margin around the separating hyperplane.

• If yi (β t xi + β0 ) > 1, then xi is not on the boundary of the margin and


αi = 0.

From (6.9) it follows that the optimal solution (direction) β is defined as


a linear combination of the points xi with αi > 0. These points are the
points on the boundary of the margin around the separating hyperplane,
called the support points. Hence, optimal separating hyperplanes focus
mainly on points close to the decision boundary and gives zero weights to all
other points (although all observations are needed to determine the support
points). This is similar to logistic regression that also gives larger weights
to observations closer to the boundary. In fact, if the classes are completely
6.3. SUPPORT VECTOR CLASSIFIERS 59

separable as we assumed here, then it can be shown that the logistic regres-
sion boundary is a separating hyperplane (not necessarily optimal). Hence,
both methods have the advantage of being more robust to model misspec-
ifications and deviations in the data if they occur away from the decision
boundary, but are less stable for well-behaved Gaussian data. On the other
hand, as mentioned before, LDA gives equal weight to all observations and
thus is more efficient if the data are Gaussian, but can be more sensitive to
model deviations.
Once the optimal separating hyperplane Hβ̂ has been found as solution
of the constrained optimization problem, new observations can be classified
as
Ĝ(x) = signfˆ(x) = sign(β̂ t xi + β̂0 )

In practice, it will often occur that there does not exist a separable
hyperplane and thus there is no feasible solution to problem (6.6)-(6.7).
Therefore, we will extend the separable hyperplane problem in the next
section to the case of nonseparable classes.

6.3 Support Vector Classifiers


We now modify the optimal separating hyperplane problem (6.6)-(6.7) such
that it can find an optimal separating hyperplane even when the two classes
are not completely separable by a linear bound. To allow for overlap of the
classes, we still try to find a hyperplane that maximizes the margin between
the two classes, but we now allow some data points to fall on the wrong side
of their margin. Formally this can be accomplished by introducing slack
variables ξ = (ξ1 , . . . , ξn ) associated with the data points where ξi ≥ 0; i =
1, . . . , n. With these slack variables condition (6.7) is modified to

yi (β t xi + β0 ) ≥ 1 − ξi ; i = 1, . . . , n.

Hence, the value ξi in the above constraint is proportional to the amount


by which each point xi falls at the wrong side of its margin. Therefore, to

obtain a reasonable solution, a bound needs to be set on the sum ni=1 ξi ,
the total amount by which points are allowed to lie at the wrong side of
∑n
their margin. Let Q denote this bound, hence we require i=1 ξi ≤ Q.
Since the area of the margin between the two classes is not required to by
empty anymore, it is often called a soft margin. Note that a training point
xi , yi is misclassified if ξi > 1 because then yi and β t xi + β0 have a different
sign, hence the constant Q determines the maximal number of misclassified
training points that is allowed.
60 CHAPTER 6. SUPPORT VECTOR MACHINES

A convenient formulation of the soft margin problem or support vector


classifier problem is [ ]
1 ∑n
min ∥β∥ + γ
2
ξi (6.14)
β,β0 2
i=1

subject to yi (β t xi + β0 ) ≥ 1 − ξi ; i = 1, . . . , n
ξi ≥ 0; i = 1, . . . , n (6.15)

where the tuning parameter γ > 0 now takes over the role of Q. A large

value of γ puts a strong constraint on ni=1 ξi and thus allows few misclas-

sifications while a small value of γ puts little constraint on ni=1 ξi . Note
that the limit case γ = ∞ corresponds to the separable case since all slack
variables ξi should equal zero in this case. The support vector classifier prob-
lem (6.14)-(6.15) still consists of a quadratic criterion with linear constraints,
hence a standard convex optimization problem. The minimization problem
is equivalent to minimizing w.r.t. β, β0 and ξi the Lagrange function

1 ∑ n∑ n ∑ n
LP = ∥β∥2 + γ ξi − αi [yi (β t xi + β0 ) − (1 − ξi )] − µi ξi (6.16)
2
i=1 i=1 i=1

where αi ≥ 0 and µi ≥ 0 are the lagrange multipliers. Setting the derivatives


w.r.t. β, β0 and ξi equal to zero yields the equations


n
β = αi yi x i (6.17)
i=1
∑n
0 = αi yi (6.18)
i=1
αi = γ − µi ; i = 1, . . . , n (6.19)

Note that equations (6.17) and (6.18) are the same as in the separable case
discussed in the previous section (equations (6.9) and (6.10)). The difference
is that the last equation relates the Lagrange multipliers αi to the constant
γ. Substituting these first order conditions in (6.16) yields the Wolfe dual
objective function


n
1 ∑∑
n n
LD = αi − αi αj yi yj xti xj (6.20)
2
i=1 i=1 j=1

Note that this is exactly the same dual function as in the nonseparable case
(expression (6.11)) but now it needs to be maximized α1 , . . . , αn under the
constraints

n
αi yi = 0
i=1
6.4. SUPPORT VECTOR MACHINES 61

0 ≤ αi ≤ γ. (6.21)
The Karush-Kuhn-Tucker conditions satisfied by the solution include the
constraints

αi [yi (β t xi + β0 ) − (1 − ξi )] = 0; i = 1, . . . , n (6.22)
µi ξi = 0; i = 1, . . . , n (6.23)
yi (β t xi + β0 ) − (1 − ξi ) ≥ 0; i = 1, . . . , n. (6.24)

From (6.17) it still follows that the solution for β is a linear combination of
the observations with αi > 0. From condition (6.22) it follows that these
points are observations for which constraint (6.24) is exactly met. These
observations are again called the support vectors. Some of these support
vectors will lie on the margin (ξi = 0) which according to (6.23) and (6.19)
implies that 0 < αi < γ while others lie on the wrong side of their margin
(ξi > 0) and are characterized by αi = γ.

6.4 Support Vector Machines


As with other classification methods, the support vector classifier can be
made more flexible by enlarging the feature space using basis expansions as
discussed in Chapter 4. The linear boundaries obtained in the transformed
space will generally achieve a better class separation on the training data
and lead to nonlinear boundaries in the original predictor space. Consider
a transformed feature space h(X) = (h1 (X), . . . , hM (X))t which leads to
a transformed set of feature vectors h(xi ) for the training data. Applying
the support vector classifier in this enlarged space produces a boundary
fˆ(x) = h(x)t β̂ + β̂0 . This boundary is linear in the enlarged space h(X) but
nonlinear in the original feature space X. The support vector classifier is
Ĝ(x) = sign(fˆ(x)) as before. By allowing the dimension of the transformed
feature space to become very large, it becomes possible to find a boundary
that completely separates the classes. However, the nonlinear boundary in
the original space can be very wiggly indicating overfitting of the training
data. In such cases the optimistic zero training error will be countered by
a high test error. To avoid overfitting, regularization is needed to penalize
roughness of the boundary such that rougher boundaries are only allowed if
they significantly improve the training classification error.
We first show that the support vector classifier in the transformed pre-
dictor space only uses products of the basis functions which makes it an ideal
problem to use basis expansions with kernels as explained in Section 4.6. In
the previous section we showed that the support vector optimization prob-
lem (6.14)-(6.15) could be rewritten as maximizing the dual function (6.20)
62 CHAPTER 6. SUPPORT VECTOR MACHINES

subject to the constraints (6.21). For a transformed predictor space h(X)


this becomes: maximize the dual function
∑n
1 ∑∑
n n
LD = αi − αi αj yi yj h(xi )t h(xj ) (6.25)
2
i=1 i=1 j=1

under the constraints



n
αi yi = 0
i=1
0 ≤ αi ≤ γ. (6.26)
Using (6.17) it follows that the support vector boundary can be written as

f (x) = β t h(x) + β0
∑n
= αi yi h(x)t h(xi ) + β0 . (6.27)
i=1

Note that for given αi and β, the constant β0 can be determined as the
solution of yi f (xi ) = 1 for support points xi on the margin (0 < αi < γ).
From (6.25) and (6.27) it follows that both the optimization problem (6.25)
and its solution (6.27) in the transformed feature space only depend on prod-
ucts of basis functions h(x)t h(x′ ). This makes it convenient to use transfor-
mations based on kernel functions. Consider a kernel function K(x, x′ ) =
∑∞ ′
m=1 γm ϕm (x)ϕm (x ) as in (4.14) and define the associated basis functions

as hm (x) = γm ϕm (x), then for any points x and x′ in Rd the product

h(x)t h(x′ ) = ∞ ′ ′
m=1 γm ϕm (x)ϕm (x ) = K(x, x ). Hence, if the transforma-
tion of the feature space is induced by a kernel function, then (6.25) can be
rewritten as
∑n
1 ∑∑
n n
LD = αi − αi αj yi yj K(xi , xj ) (6.28)
2
i=1 i=1 j=1

which needs to be maximized under constraints (6.26) and the solution (6.27)
becomes

n
f (x) = αi yi K(x, xi ) + β0 (6.29)
i=1
which shows that knowledge of the kernel function is sufficient to compute
the support vector classifier, without the need to explicitly specify the trans-
formation h(X). The above dual problem still can easily be solved using the
same techniques as for the support vector classifier (in the original space) in
the previous section. The resulting classification method is called a support
vector machine. Popular choices for the kernel function K are the polyno-
mial kernels introduced in subsection 4.6.1, the radial basis functions kernel
of subsection 4.6.2, and the neural network kernel defined by

K(x, x′ ) = tanh(κ1 (xt x′ ) + κ2 ).


6.4. SUPPORT VECTOR MACHINES 63

We introduced support vector machines by using the dual formulation


of the support vector classifier which immediately yields a computationally
attractive representation of support vector machines. To better understand
support vector machines, we now also consider the original (primal) formu-
lation of the support vector classifier.
For non-support vectors (αi = 0), condition (6.19) implies that µi = γ.
It then follows from condition (6.23) that ξi = 0 for non-support vectors.
Moreover, condition (6.24) becomes yi (β t xi + β0 ) ≥ 1 or equivalently

1 − yi (β t xi + β0 ) = 1 − yi f (xi ) ≤ 0.

For support vectors (αi > 0), condition (6.22) implies that

yi (β t xi + β0 ) − (1 − ξi ) = 0

or equivalently

0 ≤ ξi = 1 − yi (β t xi + β0 )
= 1 − yi f (xi ).
∑n
Using these results, i=1 ξi can be rewritten as

n
[1 − yi f (xi )]+ (6.30)
i=1

where the function [x]+ is defined as



x if x > 0
[x]+ =
0 if x ≤ 0.

By inserting (6.30) into (6.14) and denoting λ = 1/(2γ) the support vector
classifier solves the optimization problem
[ n ]

min [1 − yi f (xi )]+ + λ∥β∥2 .
β,β0
i=1

In a transformed predictor space h(X) this becomes


[ n ]

min [1 − yi β t h(xi )]+ + λ∥β∥2 . (6.31)
β,β0
i=1

Now, consider again a kernel function K(x, x′ ) = ∞ ′
m=1 γm ϕm (x)ϕm (x ) as

in (4.14) with associated basis functions hm (x) = γm ϕm (x). Then, f (x)
can be rewritten as

∑ ∞
∑ ∞

t √
f (x) = β h(x) = βm hm (x) = βm γm ϕm (x) = θm ϕm (x)
m=1 m=1 m=1
64 CHAPTER 6. SUPPORT VECTOR MACHINES

√ ∑ ∑∞
where θm = γm βm . Since, ∥β∥2 = ∞ 2
m=1 βm =
2
m=1 (θm /γm ) it follows
that (6.31) can be rewritten as
[ n [ ∞
] ∞
]
∑ ∑ ∑ 2
θm
min 1 − yi θm ϕm (xi ) + λ (6.32)
θ,β0 γm
i=1 m=1 + m=1

which corresponds to the formulation of the general kernel minimization


problem in (4.20). As explained in Chapter 4 the problem has a finite-
dimensional solution

n
f (x) = β0 + α̃j K(x, xj )
j=1

(where α̃i = yi αi in (6.29)) and (6.32) can be rewritten as


[ n ]

min [1 − yi f (xi )]+ + λαt Kα (6.33)
α̃,β0
i=1

where K is the kernel matrix. This is a minimization problem of the


form (4.26) and is a special case of the more general problem
[ n ]

min L(yi , f (xi )) + λJ(f )
f ∈H
i=1

which corresponds to (4.13) for the choice L(yi , f (xi )) = [1 − yi f (xi )]+ . A
natural generalization is to use different loss functions than the standard
SVM loss function in (6.33) such as the logistic regression loss function

L(yi , f (xi )) = log(1 + exp(−yi f (xi ))).


Chapter 7

Classification Trees

7.1 Introduction
Tree-based methods partition the feature space into rectangular regions and
then fit a simple model (often a constant) in each of these regions. Trees are
conceptually simple, yet very powerful techniques. The rectangular regions
in feature space are obtained by consecutive binary partitions of the data,
based on the value of one of the predictors. Hence, the key component of
tree methods is the way that the binary partitions are determined.

7.2 Growing Trees


Trees determine consecutive binary partitions of the data. The first split
separates the feature space into two regions and then models the response
in each region using a simple model, e.g. all observations in the same region
are assigned to the same class. The split is based on the value of one of the
predictors. That is, a predictor Xj and accompanying value t are determined
such that the feature space is split at Xj = t, leading to the regions Xj ≤ t
and Xj > t. The value t is called the split-point. Then the method proceeds
by splitting one or both of these regions into two smaller subregions and
then models the response in these subregions. For each of these splits again
the split variable and accompanying split-point need to be determined.
The successive binary splits can be displayed with a decision tree. The
tree starts with the complete dataset at the top. At each branch point,
observations that satisfy the condition (which is of the form Xj ≤ t) go to
the left branch, while the others go to the right branch. The tree ends at the
bottom in terminal nodes or leaves which correspond to the final regions in
which the feature space is divided.
The goal is to determine the best possible splitting variable and split-

65
66 CHAPTER 7. CLASSIFICATION TREES

point for each split. In the classification context, this means that the result-
ing tree should minimize the misclassification error. Suppose that the tree
procedure splits the feature space into M leaves R1 , . . . , RM . Each of these
regions Rm ; m = 1, . . . , M contains a number of observations denoted as
nm . All observations in a terminal node m (corresponding to region Rm )
will be assigned to the class which has the highest proportion among the
training data that belong to region Rm . For any class r = 1, . . . , k, the
proportion of training observations of class r in region Rm is given by
1 ∑
p̂mr = I(yi = r),
nm
xi ∈Rm

where I(yi = r) is the indicator function that takes value 1 if the condition
yi = r is satisfied and zero else. The observations in terminal node m are
thus assigned to class
r(m) = arg max p̂mr ,
r
the majority class in leave m. The goal is to minimize the misclassification
rate, given by

1 ∑ ∑ ∑
M M
nm
I(yi ̸= r(m)) = 1 − p̂ .
n n mr(m)
m=1 xi ∈Rm m=1

Now finding the binary partition (consisting of a set of successive splits)


is computationally infeasible. Therefore, a stepwise procedure is adopted.
In the first step, starting from the full data set, for each possible predictor
Xj and accompanying split-point t, the quality of the split is assessed by
measuring how homogeneous the resulting regions are. A measure of the
inhomogeneity of a region is called an impurity measure. Since we aim to
minimize the overall misclassification rate, a straightforward choice for the
impurity measure is the misclassification error of the region which for any
region Rm is given by
1 ∑
I(yi ̸= r(m)) = 1 − p̂mr(m) .
nm
xi ∈Rm

However, this function is not differentiable which makes it less attractive


for efficient numerical optimization. Alternative impurity measures are the
Gini index and the cross-entropy or deviance. The Gini index is given by

∑ ∑
k
p̂mr p̂ms = p̂mr (1 − p̂mr ).
r̸=s r=1

The Gini index can be seen as the training misclassification error rate that
is obtained if the observations of Rm are not uniquely assigned to one class,
7.2. GROWING TREES 67

but are given probabilities to belong to each of the classes, with p̂mr the
probability of belonging to class r; r = 1, . . . , k. Alternatively, the Gini
index can be interpreted as follows. For any class r, code the observations in
Rm belonging to class r as 1 and give code 0 to all other observations. Then,
the success probability of this Bernouilli variable is p̂mr and its variance
is given by p̂mr (1 − p̂mr ). Summing the variance over all possible classes
r; r = 1, . . . , k again leads to the Gini index. The cross-entropy or deviance
is given by
∑k
− p̂mr log(p̂mr ).
r=1

Figure 7.1 compares the shape of the three impurity measures for a two-class
problem as a function of p, the proportion of observations in class 1 (or class
2). Clearly, all three measures are similar.
0.5
0.4
0.3
0.2
0.1

Misclassification error
Gini index
Entropy
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 7.1: Shape of the three impurity measures for a two-class problem as
a function of p, the proportion of observations in class 1.

Let us denote by QRm (T ) an impurity measure for the tree T , evaluated


at a region Rm . This impurity measure can thus be misclassification error,
the Gini index, cross-entropy, etc... The best possible split of the data, is
the split that minimizes the inhomogeneity of the resulting regions. For each
possible predictor Xj and accompanying split-point t, denote the resulting
regions by

R1 (j, t) = {X|Xj ≤ t} and R2 (j, t) = {X|Xj > t}.

The best split is then the variable Xj and split-point t that solves

min[nR1 QR1 (T ) + nR2 QR2 (T )], (7.1)


j,t
68 CHAPTER 7. CLASSIFICATION TREES

where nR1 and nR2 are the numbers of training observations in R1 and R2 .
Once the optimal split has been found, the data are partitioned according
to the two resulting regions R1 and R2 and the splitting process is repeated
for each of the regions.
Note that both cross-entropy and the Gini index are more sensitive to
changes in class probabilities within a region, even if the majority class
remains the same, which is another reason why they are preferable when
growing a tree. To illustrate this, consider a two-class problem with a train-
ing dataset of size n = 800 and the same number of observations in each
class, n1 = n2 = 400. Suppose a split creates regions R1 with n1 = 300 and
n2 = 100, and R2 with n1 = 100 and n2 = 300, while another split creates
regions R1 with n1 = 200 and n2 = 400, and R2 with n1 = 200 and n2 = 0.
Both splits create a misclassification rate of 0.25, meaning that they are
equally good when the splitting criterion (7.1) is based on misclassification
error. However, the second split produces a ’pure’ leave where the misclassi-
fication error is zero. Therefore, this split is preferable. When the splitting
criterion (7.1) is based on the Gini index or cross-entropy, it yields a lower
value for the second split as is desirable.
It has to be decided when to stop splitting regions into further subregions
or equivalently, it has to be decided how large a tree can grow. One strategy
would be to evaluate each split by comparing the misclassification rate of
the tree before the split with the misclassification rate of the tree obtained
after the split. The split can then be retained if the gain in misclassification
rate exceeds some threshold. However, this strategy does not work out well,
because seemingly worthless splits in the beginning of the process might lead
to very good splits below it. Therefore, the preferred strategy is to grow a
large tree T0 where splitting of regions only stops when some minimal region
size (e.g. 5) has been reached. This large tree has the danger of overfitting
the training data. Therefore, the initial tree is then pruned to find a subtree
that fits the data well but is more stable and is expected to perform better
on validation data.

7.3 Pruning Trees

The initial large tree T0 is pruned using cost-complexity pruning. A subtree


T ⊂ T0 is defined to be a tree that can be obtained from T0 by removing
any number of its branch points. This process is called pruning. Let M
be the number of leaves in tree T , that is M is the final number of regions
the feature space is partitioned in. Then, the cost-complexity criterion is
7.4. BAGGING 69

defined as

M
Cα (T ) = nm QRm (T ) + αM, (7.2)
m=1

where α ≥ 0 is a tuning parameter that creates a trade-off between tree


size (that is, the number of leaves) and goodness of fit to the training data.
Clearly, α = 0 means that no pruning is done and the initial large tree is
retained. Larger values of α leads to smaller trees (more pruning). The im-
purity measure QRm (T ) in the cost-complexity criterion can be any of the
impurity measures introduced before. However, typically misclassification
error rate is used to ensure that the final tree has good classification perfor-
mance. For each α, we need to find the subtree Tα ⊂ T0 that minimizes the
cost complexity Cα (T ) given by (7.2). It can be shown that for each α ≥ 0
there exists a unique smallest subtree Tα that minimizes Cα (T ). To find
the tree Tα , weakest link pruning can be used. This procedure in each step
removes the branch point that produces the smallest increase of the total
impurity. For a tree T the total impurity is defined as


M (T )
= nm QRm (T ) (7.3)
m=1

where M (T ) is the number of leaves of the tree, denoted as R1 , . . . , RM (T ) .


This procedure yields a finite sequence of trees, starting with the initial
tree T0 and ending with the root tree, that is, the tree with only one leave
containing the whole feature space. It can be shown that for any α ≥ 0,
the optimal subtree Tα belongs to this sequence. Hence, by computing the
cost-complexity criterion Cα (T ) for each tree in this sequence, the final tree
can be found. To select the optimal value of the tuning parameter α, the
value with optimal performance on a validation sample can be determined
or cross-validation can be used.

7.4 Bagging
Bagging uses the bootstrap methodology to improve the estimates or predic-
tions of a classification (or regression) procedure such as trees. The boot-
strap creates new training datasets by randomly drawing observations with
replacement from the original training dataset. Usually, each of these boot-
strap samples has the same size as the original dataset. This yields a number
of bootstrap datasets denoted by B. The same model is then refit to each
of these bootstrap datasets. For any point x, the model fit to each of the
bootstrap samples, yields a prediction Ĝb (x); b = 1, . . . , B. For any class
r; r = 1, . . . , k a prediction Ĝ(x) = r can also be written as a vector fˆ(x) of
70 CHAPTER 7. CLASSIFICATION TREES

length k which is zero everywhere, except for the rth component that equals
1. Thus,
Ĝ(x) = arg max fˆ(x).
1≤s≤k

Then, the bagging estimate derived from the B bootstrap samples is given
by
1 ∑ˆ
B
ˆ
fbag (x) = fb (x), (7.4)
B
b=1

where fˆb (x) is the prediction of the bth bootstrap sample (b = 1, . . . , B).
Since each of these predictions is a vector with zeros and 1 at the predicted
class, the bagged estimate fˆbag (x) = (p̂1 , . . . , p̂k ) where p̂r is the proportion
of bootstrap training samples (e.g. the proportion of trees) that predict class
r (r = 1, . . . , k) at x. These proportions can be interpreted as estimated class
probabilities and thus the predicted class at x becomes

Ĝbag (x) = arg max fˆbag (x).


1≤s≤k

This procedure is often called majority voting because the predicted class
is the class that gets the most votes from the B bootstrap samples. Note
that for classifiers that already yield class probability estimates at x for a
single training dataset, an alternative bagging strategy is to average these
probabilities over bootstrap samples. This approach tends to give a lower
variance for the predictions.
The idea behind the bootstrap is that we randomly draw samples of
size n from the distribution P̂ , where P̂ is the empirical distribution of the
original dataset. This original distribution is discrete and puts mass 1/n
at each observation in the dataset. Therefore, to generate bootstrap sam-
ples, we successively draw observations from the original dataset where at
each stage, all observations have the same probability 1/n of being selected.
The bagging estimate in (7.4) is then only a Monte-Carlo approximation
of the true bagging estimate EP̂ [fˆ⋆ (x)] where fˆ⋆ (x) is the prediction at x
derived from the classification rule obtained from a random training sample
Z⋆ = {(x⋆i , yi⋆ ); i = 1, . . . , n} generated from P̂ . The Monte-Carlo estimate
approaches the true bagging estimate as B → ∞. The bagging estimate
EP̂ [fˆ⋆ (x)] will only differ from the estimate on the original training data
fˆ(x) if the method that is being applied to each of the bootstrap sam-
ples is a nonlinear or adaptive function of the training data. Otherwise,
EP̂ [fˆ⋆ (x)] = fˆP̂ (x) = fˆ(x), the prediction derived from the original training
data. Therefore, bagging is interesting for highly data dependent methods
such as trees, where the number of leaves and the variables determining the
successive splits will be different for each bootstrap tree.
7.5. BUMPING 71

Note that bagging can make a good classifier better and more stable,
but bagging a bad classifier does not make it better. In fact, by making
the classifier more stable, it can become even worse. This is illustrated by
the following simple artificial example. Suppose that Y = 1 for all x (and
thus there actually is only 1 class), and the classifier Ĝ(x) randomly predicts
Y = 1 with probability 0.4 and Y = 0 with probability 0.6 (for all x). Then,
the misclassification error of this classifier is 0.6, clearly a bad classifier.
However, when bagging this classifier, at each x, approximately 60% of the
bootstrap samples will predict class Y = 0 and only 40% will predict Y = 1.
Hence, if the number of bootstrap samples, B, is large enough, the bagging
estimate will predict Y = 0 at each x, leading to a misclassification error
of 100% which is even much worse than the initial classifier on the original
training data.
By averaging models such as trees over bootstrap samples, bagging in-
creases the space of models in which an optimal solution is sought. Indeed,
the bagging classifier does not belong to the class of the initial base classi-
fier anymore, e.g. the bagging classifier is not a classification tree anymore.
However, a drawback of bagging is that if the base classifier comes from a
model with a simple structure such as trees, then this simple structure is
lost for the bagging classifier. For interpretation of the model this can be a
drawback.

7.5 Bumping
Bumping uses bootstrap samples to select a best fitting model from a random
set of model fits in model space. Hence, instead of averaging or combining
models as in bagging, bumping is a technique to find a better single model
than the one obtained from the original training data. For each of the B
bootstrap samples, we fit the model which leads to predictions fˆb (x); b =
1, . . . , B at x. We then choose the model that produces the smallest average
prediction error on the original training dataset. By convention, the original
training data set is included in the set of bootstrap samples, so that the
method is allowed to pick the solution on the original data if it has the
lowest training error. Bumping is mainly helpful for improving problems
where fitting methods can find local minima. The bootstrap samples are
a kind of perturbed data that can yield improved solutions if the method
on the original training data gets stuck in a poor solution (local minimum).
Note that since bumping compares the classification error of the different
fitted models on the training data, one must ensure that these models have
(roughly) the same complexity. More complex models usually fit the training
72 CHAPTER 7. CLASSIFICATION TREES

data better, so allowing for different complexity among the fitted models
would just lead to the selection of the most complex model. For example, in
the case of trees, this means that the trees grown on each of the bootstrap
samples should have (approximately) the same number of leaves.

7.6 Boosting
Boosting is a very powerful way of combining the output of many weak
classifiers to produce a high performance committee. Consider a two-class
problem with the class variable coded as Y ∈ {−1, 1}. A classifier G pro-
duces the training misclassification error

1∑
n
err
¯ = ̸ G(xi )).
I(yi =
n
i=1

The expected error rate on future predictions is given by EXY I(Y ̸= G(X)).
A weak classifier is a classifier whose error rate is only slightly better
than the 50% erorr rate obtained by random guessing. Boosting sequentially
applies the weak classification method to repeatedly modified versions of the
data which leads to a sequence of weak classifiers, Gm (x); m = 1, . . . , M .
The predictions of these M classifiers are then combined through a weighted
majority vote to produce the final prediction
( M )

G(x) = sign am Gm (x) . (7.5)
m=1

The weights a1 , . . . , aM of the respective predictions are computed by the


boosting algorithm. Their goal is to give higher influence (weight) to the
more accurate classifiers in the sequence. The data modifications applied in
each step of the sequence consists of giving weights w1 , . . . , wn to each of the
training observations (xi , yi ); i = 1, . . . , n. In the first step, all weights just
equal 1/n, such that the classifier is trained on the training data as usual.
At all successive steps m (m > 1), the training observations that are mis-
classified by the current classifier Gm−1 (x), that is the classifier that resulted
from the previous step, have their weights increased while correctly classi-
fied observations get their weight decreased. Hence, as iterations proceed,
observations that are difficult to classify correctly, receive an ever-increasing
weight and thus an ever-increasing influence on the classification procedure.
Subsequent classifiers are thus forced to concentrate on those training ob-
servations that were misclassified by the previous classifiers in the sequence.
In detail, the standard boosting algorithm, called AdaBoost, works as fol-
lows. First, initialize the observation weights wi = 1/n; i = 1, . . . , n. Then,
7.6. BOOSTING 73

iterate the following procedure for m going from 1 to M . Apply the classi-
fication method to the training data using the weights wi , which yields the
classifier Gm (x). Compute the training error of this classifier, given by
∑n
wi I(yi ̸= Gm (xi ))
¯ m = i=1 ∑n
err . (7.6)
i=1 wi
Compute the classifier weight am = log[(1 − err
¯ m )/err
¯ m ]. Define new ob-
servation weights as

wi = wi exp[am I(yi ̸= Gm (xi ))] i = 1, . . . , n. (7.7)

When the iterations are finished and the M classifiers have been obtained,
construct the final classifier through weighted majority voting as defined
in (7.5).
To explain why boosting is successful in improving the performance of
the base classifier on the original data, we reconsider (7.5). This expression
can be interpreted as fitting an additive expansion using a set of elementary
basis functions G1 (x), . . . , GM (x). As seen in Chapter 4, in general a linear
basis function expansion has the form

M
f (x) = βm b(x; γm ),
m=1

where βm ; m = 1, . . . , M are the expansion coefficients and b(x; γ) are real


valued functions of x which depend on a set of parameters γ. Usually, these
models are fit by minimizing the average loss over the training data where
the loss is measured by using some loss function L(y, f (x)) such as squared
error. This leads to the general minimization problem
( )
∑n ∑
M
min L yi , βm b(x; γm ) . (7.8)
βm ,γm ; m=1,...,M
i=1 m=1

This joint minimization problem is often a complex, computationally inten-


sive optimization problem. However, a simple alternative can be used when
it is feasible to solve the subproblem of fitting just a single basis function,
that is,
∑n
min L (yi , βb(x; γ)) . (7.9)
β,γ
i=1
Forward stagewise modeling aims to approximate the global solution
of (7.8) by sequentially adding new basis functions to the already obtained
expansion, without adjusting the parameters and coefficients of the terms
already in the expansion. Hence, the algorithm starts from the initial ex-
pansion f0 (x) = 0. For m = 1, . . . , M it then solves the problem

n
min L (yi , fm−1 (xi ) + βm b(x; γm )) . (7.10)
βm ,γm
i=1
74 CHAPTER 7. CLASSIFICATION TREES

and updates the expansion to

fm (x) = fm−1 (x) + β̂m b(x; γ̂m ) (7.11)

Consider for example squared error loss L(y, f (x)) = (y − f (x))2 , then

L(yi , fm−1 (xi ) + βm b(x; γm ) = [yi − fm−1 (xi ) − βm b(x; γm )]2


= [(rm−1,i − βm b(x; γm )]2 ,

where rm−1,i is the residual on the ith observation that remains after the
already fitted expansion. Thus for squared error loss the term βm b(x; γm )
that best fits the current residuals is added to the expansion. The param-
eters β̂m , γ̂m in each step are obtained by solving the simpler optimization
problem
∑n
min L (rm−1,i , βm b(x; γm )) ,
βm ,γm
i=1
which is of form (7.9).
For AdaBoost, the basis functions in the expansion are the sequence of
classifiers Gm (x). Using the exponential loss function

L(y, f (x)) = exp(−yf (x)) (7.12)

the minimization problem (7.10) in forward stagewise additive modeling


becomes
∑n
min exp[−yi (fm−1 (xi ) + βm Gm (x))]. (7.13)
βm ,Gm
i=1
(m)
Because the first factor wi = exp[−yi fm−1 (xi )] does not depend on the
quantities βm and Gm that need to be minimized, it can be regarded as
a weight that is applied to each observation. Since this weight depends
on fm−1 (xi ) the weight changes with each iteration of m. Hence, in each
iteration the classifier is applied to a modified, that is, weighted version of
the training data, as explained before. Note that (7.13) can be rewritten as
 
∑ ∑
min exp(−βm ) wi  ,
(m) (m)
wi + exp(βm ) (7.14)
βm ,Gm
yi =Gm (xi ) yi ̸=Gm (xi )

or equivalently,
[ ]

n
(m)

n
(m)
min [exp(βm ) − exp(−βm )] wi I(yi ̸= Gm (xi )) + exp(−βm ) wi .
βm ,Gm
i=1 i=1
(7.15)
The minimization problem (7.15) can be solved in two steps. First, for any
value βm > 0, the solution to (7.15) for Gm (x) is clearly given by

n
(m)
Ĝm = arg min wi I(yi ̸= Gm (xi )), (7.16)
i=1
7.6. BOOSTING 75

which is the classifier that minimizes the weighted training error. The op-
timal solution Ĝm is thus independent of the actual value of βm . Plugging
the optimal solution Ĝm into (7.14) and solving for βm yields
( )
1 1 − err
¯ m
β̂m = log , (7.17)
2 err
¯ m

where err
¯ m is the training error of classifier Ĝm as defined in (7.6). Using the
solutions β̂m and Ĝm given by (7.17) and (7.16), the expansion is updated
as
fm (x) = fm−1 (x) + β̂m Ĝm (x).

The weights for the next iteration then become


(m+1)
wi = exp[−yi fm (xi )]
= exp[−yi fm−1 (xi )] exp[−yi β̂m Ĝm (xi )]
(m)
= wi exp[−β̂m yi Ĝm (xi )]

Note that −yi Ĝm (xi ) can be rewritten as

−yi Ĝm (xi ) = 2.I(yi ̸= Ĝm (xi )) − 1

which leads to
(m+1) (m)
wi = wi exp[2β̂m I(yi ̸= Ĝm (xi ))] exp(−β̂m ). (7.18)

The factor exp(−β̂m ) is the same for all observations and can therefore be
ignored. Hence, by defining am = 2β̂m the weights in (7.18) are equivalent to
the weights (7.7) used in AdaBoost, so we can conclude that AdaBoost tries
to minimize the general minimization problem (7.8) using the exponential
loss function (7.12) by using forward stagewise additive modeling.
The exponential loss function (7.12) used by AdaBoost clearly is at-
tractive from the computational viewpoint as it allows to approximate the
solution of the complex minimization problem (7.8) by solving a sequence of
simpler minimization problems (7.13) in a forward stagewise additive mod-
eling strategy. However, the exponential loss function is also a reasonable
choice from a statistical viewpoint. At the population level, the general
minimization problem (7.8) using exponential loss becomes

min E[exp(−Y f (X))|X = x],


f (x)

and it can be shown that the optimal solution at the population level is
( )
1 P (Y = 1|X = x)
f (x) = log . (7.19)
2 P (Y = −1|X = x)
76 CHAPTER 7. CLASSIFICATION TREES

Thus, the optimal solution is one-half the log-odds and AdaBoost estimates
this optimal solution. This justifies using the sign of the classifier as the
classification rule as in (7.5).
Another loss function that can be used is the binomial negative loglike-
lihood. Let
1
p(x) = P (Y = 1|X = x) = (7.20)
1 + exp(−2f (x))

which implies that f (x) is one-half the log-odds as in (7.19). Consider the
response variable Y ′ = (Y + 1)/2, that is Y ′ codes the classes as 0 and 1,
then the binomial log-likelihood is

l(Y, f (x)) = Y ′ log(p(x)) + (1 − Y ′ ) log(1 − p(x)). (7.21)

The corresponding loss function is

L(y, f (x)) = −l(y, f (x)) = log[1 + exp(−2 y f (x))]. (7.22)

Since the maximum of the log-likelihood (7.21) at the population level is ob-
tained at the true probabilities p(x) = P (Y = 1|X = x), it follows that the
population minimizer of the general problem (7.8) using the binomial nega-
tive loglikelihood loss function, that is, the minimum of E[−l(Y, f (x))|X =
x] is also one-half the log-odds. Hence, both loss functions yield the same op-
timal solution at the population level, but this is not generally true anymore
at finite-sample training datasets.
Contrary to the exponential loss function, the binomial loss function
extends easily to multiclass (k > 2) classification problems with classes
Y = 1, . . . , k. Let
exp(fr (x))
pr (x) = ∑k ,
l=1 exp(fl (x))
∑k
with the constraint l=1 fl (x) = 0 to ensure uniqueness of the function
fl (x); l = 1, . . . , k. The binomial negative loglikelihood loss function then
extends naturally to the k-class multinomial negative log-likelihood loss func-
tion:

k
L(y, f (x)) = − I(y = r) log(pr (x))
l=1
( k )
∑k ∑
= − I(y = r)fr (x) + log exp(fl (x)) ,
l=1 l=1

where f (x) = (f1 (x), . . . , fk (x)).


When boosting trees, some tuning parameters need to be selected. A first
choice is the size of the tree that is build in each step. One strategy would be
7.6. BOOSTING 77

to grow and prune each tree as explained before. However, boosting assumes
that the basic classifier is a weak classifier. Therefore, a more efficient choice
that leads to a better performance is to fix the size of the trees that are grown
in each step. The size J of a tree is its number of leaves, or equivalently
the number of regions the feature space is partitioned in. Experience has
indicated that a choice of J in the range 4 ≤ J ≤ 8 usually works well
for boosting and generally the results are fairly insensitive to the choice of
J in this range. A second choice is the value of M , the number of trees
in the additive expansion. Each additional iteration usually decreases the
training error, which implies that overfitting occurs when M is too large.
A convenient way to select the optimal value M ⋆ is to monitor prediction
error on a validation sample as a function of M when running the boosting
algorithm. The (smallest) value of M that minimizes this test error is the
optimal choice M ⋆ .
78 CHAPTER 7. CLASSIFICATION TREES
Chapter 8

Nearest-Neighbor
Classification

8.1 Introduction
Nearest-neighbors is a simple and model-free method. The technique is es-
sentially data driven and can be very effective. However, its lack of structure
makes it not useful for trying to understand the relationship between the
features and the outcome (class membership).

8.2 K-Nearest-Neighbors
Consider a training dataset {(xi , gi ); i = 1, . . . , n} where gi is the class label
of the ith observation and takes values in the set {1, . . . , k}. For a target
point x0 that needs to be classified, K-nearest-neighbors first determines
the K training points closest to x0 and then assigns x0 to the class that
is best represented among these K training points, that is, majority voting
among the K nearest neighbors of x0 is used to predict the class where x0
belongs to. If ties occur, that is, two or more classes have the same maximal
number of votes among the K nearest neighbors, then a class is assigned at
random from these candidate classes. To determine the K nearest neighbors
of x0 , that is, the K training points closest to x0 , a distance measure needs
to be specified. If all features are quantitative, then the Euclidean distance
in feature space is used:

d(x0 , xi ) = ∥xi − x0 ∥; i = 1, . . . , n.

Since it is possible (even likely) that the features are measured in different
units with different scales, it is advisable to first standardize the features
such that they all have mean zero and variance 1. Otherwise, (if features

79
80 CHAPTER 8. NEAREST-NEIGHBOR CLASSIFICATION

have very different scales) the determination of the nearest neighbors would
be completely determined by the differences in the scales. If the features
are qualitative or of mixed type, the Euclidean distance can be replaced by
a proximity measure which expresses for each pair of observations how alike
they are. A proximity measure can either produce similarities that measure
resemblance or dissimilarities that measure difference between objects.
K-nearest-neighbors is a very simple technique but its local nature allows
it to adapt well to local characteristics of the data which makes it well-suited
for classification problems with a very irregular boundary. A limit case is
K = 1, the 1 nearest-neighbor classifier, which classifies each target point x0
in the same class as the closest training point. If the training sample is large
and fills well the predictor space, then the bias of the 1-nearest-neighbor
classifier is low, but on the other hand, its variance is high. In fact, for
a two-class classification problem, the error rate of the 1-nearest-neighbor
classifier asymptotically never exceeds twice the optimal Bayes error rate.
This can be seen as follows. Consider an arbitrary point x and let r⋆ be the
dominant class at x. Denote pr (x) the true conditional probability for class
r (r = 1, 2) at x. The Bayes rule classifies x in class r⋆ with misclassification
probability given by
Bayes error = 1 − pr⋆ (x).
Now, suppose that the training dataset fills up the predictor space in a dense
fashion. This means that for any target point x and arbitrary small distance
ϵ > 0, there exists a training sample size nϵ such that there is a training
point at distance closer than ϵ of x for training samples of size n ≥ nϵ . The
1-nearest-neighbor classifier will classify x to the same class as the nearest
training point. Asymptotically, this training point will be arbitrary close to
x. Therefore, with probability p1 (x) this training point belongs to class 1,
and with probability p2 (x) = 1 − p1 (x) the training point belongs to class 2.
Hence, asymptotically the misclassification error of the 1-nearest-neighbor
classifier becomes

1-nearest-neighbor error = p1 (x)[1 − p1 (x)] + p2 (x)[1 − p2 (x)]


= p1 (x)[1 − p1 (x)] + (1 − p1 (x))[1 − (1 − p1 (x))]
= 2p1 (x)[1 − p1 (x)]
= 2[1 − p2 (x)]p2 (x).

Since r⋆ equals 1 or 2, we can write

1-nearest-neighbor error = 2pr⋆ (x)[1−pr⋆ (x)] ≤ 2[1−pr⋆ (x)] = 2 Bayes error.

The key issue in this result is the assumption of no bias. In practice, the
training sample is not dense in predictor space, and thus the 1-nearest-
8.2. K-NEAREST-NEIGHBORS 81

neighbor classifier can show substantial bias in some areas of predictor space
which deteriorates its performance.
The high variance of the 1-nearest-neighbor classifier makes it often not
a good choice even when its bias is low. The choice of the optimal number
of nearest neighbors K that is used for the nearest-neighbors classifier is a
tuning parameter that needs to be selected to obtain a compromise in the
bias-variance trade-off. A larger value of K will decrease the variance of the
estimator. If this decrease in variance can be combined with no or little in-
crease in bias, then clearly the larger value of K is a better choice. As usual,
the optimal value of K can be determined by comparing performance on a
validation set for different values of K. If a validation set is not available,
cross-validation can be used.

8.2.1 Nearest-Neighbors in High Dimensions


Nearest-neighbors is based on the idea that for every target point there are
some observations in the training dataset that are close to this target point
and these points can be used to determine what value of the response (which
class) is most likely for the target point. However, this intuition and thus the
nearest-neighbors approach fails in high dimensions. This is a well known
phenomenon known as the curse of dimensionality.
Consider for instance, the nearest-neighbor procedure for a set of input
variables that are independently uniformly distributed on [0, 1]. The inputs
are then uniformly distributed in a d-dimensional unit hypercube which has
edges of length 1. For simplicity, let us consider hypercubical neighborhoods
(using L1 distances instead of the L2 (Euclidean) distance) around a target
point x. To capture a fraction r of the training observations in this neigh-
borhood, the volume of this neighborhood needs to be a fraction r of the
unit volume of the feature space. Since the volume r of the hypercubical
neighborhood equals ld where l is the edge length of the hypercube, it follows
that the (expected) edge length is l(r) = r1/d . For example, neighborhoods
containing only 1% of the training data in 10 dimensions have an expected
length of l(0.01) = 0.63 and to capture 10% of the training data, an edge
length of l(0.1) = 0.80 is needed. Since the entire range of each variable
is only l = 1 and we must cover 63% or 80% of this range to obtain local
neighborhoods that cover 1% or 10% of the training data, it is not reasonable
anymore to consider these neighborhoods as ’local’. The training points in
such neighborhoods can not be considered to be ’close to the target point’
and thus do not indicate well what would be a reasonable value of the re-
sponse variable at the target point. Reducing the edge length l dramatically
is not a solution either because the neighborhood becomes almost empty,
82 CHAPTER 8. NEAREST-NEIGHBOR CLASSIFICATION

leading to a high variability of the nearest-neighbor procedure.


The example illustrates the curse of dimensionality which means that all
samples are sparse in high dimensions. This implies that it is not reasonable
to assume that a sample in high dimensions is even remotely dense which
was a key assumption to obtain a low bias for the nearest-neighbor classifier.
Another illustration of the sparsity of samples in high dimensions is the
following. Consider a sample of n observations uniformly distributed in a
d-dimensional unit ball centered at the origin. Now consider a neighborhood
around the origin which is the center of the distribution. It can be shown
that the median distance from the origin to the closest observation is given
by
( )1/d
1 1/n
med distance(n, d) = 1 − .
2
For example, for a sample of size n = 500 in d = 10 dimensions we get
med distance(500, 10) = 0.52 which is more than half the distance from the
origin to the boundary. Hence, most data points are closer to the boundary
of the sample space than to the center of the space. Similarly, it can be
shown that points are closer to the boundary of the space than to any
other data point if the dimension is large. This indicates that prediction in
high dimension becomes very difficult because prediction near the edges of
the training data is more difficult as it requires extrapolation rather than
interpolation from surrounding training points.
The sparsity of samples in high dimensions is also reflected by the em-
pirical distribution of the sample. For a sample of size n in d dimensions,
the sampling density is proportional to n1/d . This implies that if, for in-
stance, a sample of size 100 is dense in 1 dimension, then a sample size of
n = 10010 is needed to obtain the same sampling density in 10 dimensions.
Such a sample size is not feasible. Hence, all feasible training samples in
high dimensions are sparse in the predictor space.

8.3 Adaptive Nearest-Neighbors Methods


Adaptive nearest-neighbor methods try to handle the curse of dimensional-
ity. Nearest neighbors classification is based on the assumption that the class
probabilities remain approximately constant in the neighborhood around a
target point, such that the class proportions in the neighborhood are good
estimates of the class probabilities at the target point. In standard nearest
neighbors, the neighborhoods are balls and in higher dimensions the balls
become very large making the assumption of approximately constant class
probabilities in the neighborhood not plausible. To solve this problem, one
8.3. ADAPTIVE NEAREST-NEIGHBORS METHODS 83

can try to use an adaptive metric in nearest-neighbor classification. The


adaptive metric modifies the shape of the neighborhoods. The goal is to
stretch each neighborhood in directions where the class probabilities don’t
change much and to keep the neighborhood small in other directions.
The discriminant adaptive nearest-neighbor classifier (DANN) is such
an adaptive nearest-neighbor method. At each target point first a stan-
dard Euclidean distance neighborhood of size K is considered (K should be
large enough, e.g. K = 50). It is then determined how this neighborhood
is best deformed to obtain a better neighborhood where the class proba-
bilities vary less than in the original neighborhood. One-way to determine
directions in which locally the class probabilities remain approximately the
same is by using the discriminant coordinates (introduced in Section 3.5)
based on the training points in the neighborhood. The first discriminant
coordinates are the directions in which the classes are best separated, and
thus the directions in which the class probabilities change the most. Hence,
the neighborhood should be stretched in directions orthogonal to the most
important discriminant coordinates, which are the directions of the least
important discriminant coordinates.
Let W be the within-class covariance matrix and B the between-class co-
variance matrix of the training points in the Euclidean neighborhood around
the target point x0 . The discriminant adaptive nearest-neighbor metric for
the neighborhood of target point x0 is then defined by

D(x, x0 )2 = (x − x0 )t Σϵ (x − x0 ), (8.1)

where

Σϵ = W −1/2 [W −1/2 BW −1/2 + ϵI]W −1/2


= W −1/2 [B ⋆ + ϵI]W −1/2 .

The covariance matrix Σϵ determining the adaptive metric seems quite com-
plicated, but can be explained as follows. First the data are sphered using
the within covariance matrix W . Note that B ⋆ is the between covari-
ance matrix in this sphered feature space. The neighborhood around the
(sphered) target point in the sphered feature space is then stretched in the
directions of B ⋆ that have small or zero eigenvalues. Indeed, the directions
with smallest eigenvalues contribute little to the distance (8.1) and thus the
neighborhood can stretch far in these directions. On the other hand, direc-
tions with large eigenvalues of B ⋆ contribute largely to the distance (8.1)
and thus the neighborhood stretches little in these directions. The factor ϵ
is added to avoid that the neighborhood can stretch infinitely in a direction
84 CHAPTER 8. NEAREST-NEIGHBOR CLASSIFICATION

with zero eigenvalue of B ⋆ . As a result, the local neighborhood is trans-


formed from a ball into an ellipsoid. The choice of ϵ is arbitrary but should
not be too small because it would make the neighborhood too large in some
directions and thus not a local neighborhood anymore. Nor should ϵ be
too large, because then the neighborhood becomes almost a ball again. In
practice, experience has shown that the choice ϵ = 1 seems to work well.
Note that neighborhoods containing only observations from one class are
not changed. In this case, the between covariance matrix B = 0, and thus
Σϵ = ϵI = I (for ϵ = 1) and thus the neighborhood remains the ball based
on Euclidean distance.
To avoid the curse of dimensionality when using local methods such as
nearest-neighbors it is of interest to reduce the dimension of the feature space
whenever it is possible. The performance of the method will improve a lot
if it can be applied in a well chosen subspace of the original feature space.
Such an optimal subspace can be found by computing at each training point
the between-class covariance matrix B i of its neighborhood. These matrices
are then averaged
1∑
n
B̄ = Bi.
n
i=1

The eigenvectors e1 , . . . , ed of this matrix B̄ corresponding to ordered eigen-


values λ1 , . . . , λd then determine the optimal subspace. The data can be
projected onto the subspace determined by the eigenvectors ej correspond-
ing to large eigenvalues λj , and the nearest neighbor classification is then
carried out in this reduced space.
Chapter 9

Model Performance and


Model Selection

9.1 Introduction
A good classification method for a practical problem is a classifier that
performs well on independent test data. Assessment of this performance is
therefore extremely important in practice. Test performance can be used
to guide the choice of learning method or model. It thus can drive model
selection problems such as the choice of the tuning parameter(s). Moreover,
it is a measure of the quality of the ultimately chosen model.
In this chapter we discuss methods for the assessment of the performance
of models on test data. The focus is on estimating this performance when
an independent test set is not available.

9.2 Measuring Error


The generalization error or test error is the expected prediction error over an
independent test sample from the same population as the training sample.
Error is measured by some loss function L(G, Ĝ(X)). Hence, the test error
is given by
Test err = E[L(G, Ĝ(X))] (9.1)

where the expectation takes (population) averages over all quantities that
are random. This includes G and X but also the randomness in the training
sample that produced Ĝ. The corresponding training error is the average
loss over the training sample, given by

1∑
n
err
¯ = L(gi , Ĝ(xi )). (9.2)
n
i=1

85
86CHAPTER 9. MODEL PERFORMANCE AND MODEL SELECTION

As explained before, selecting a model is always a trade-off between bias


and variance. More complex models are able to adapt better to more com-
plex underlying structures (decreasing bias), but the estimation imprecision
increases (increasing variance). Trading-off bias and variance leads to an
optimal model complexity that gives minimum test error. Clearly, training
error is not a good estimate of the test error. More complex models adapt
better to the training data, thus training error decreases with increasing
model complexity. However, complex models with a low training error are
likely to overfit the training data and will typically generalize poorly. Hence,
the need to estimate test error.
In classification, standard choices for the loss function L(G, Ĝ(X)) are

L(G, Ĝ(X)) = I(G ̸= Ĝ(X)) 0 − 1 loss



k
L(G, p̂(X)) = −2 I(G = r) log(p̂r (X))
r=1
= −2 log(p̂G (X)) log-likelihood

The 0 − 1 loss function can always be used while the log-likelihood loss
function can only be used for classification methods that produce estimates
of the class probabilities. The log-likelihood loss function can be used for
general response variables Y (quantitative or qualitative) with a density
Pθ(X) (Y ) that depends on the predictors X through some parameter θ(X).
The loss function then is

L(Y, θ(X)) = −2 log[Pθ(X) (Y )].

The 2 is added in the definition of the loss function to make it equal to


squared error loss when Y has a Gaussian distribution.
If we are working in a problem setting with a huge amount of data, then
the best approach to building, validating and testing models, is to randomly
divide the full dataset into three parts: a training set to fit the models; a
validation set to estimate prediction error and select the optimal model based
on these prediction errors; a test set to measure the generalization error of
the final selected model. It is important to have a separate test set that is
used to measure the performance of the final selected model. If there is no
test set and the test error of the final model on the validation set is taken as
performance measure, then this validation error will be an underestimation
of the true test error. This underestimation can be substantial when a large
number of models are compared on the validation set with the final model
being the one that minimizes validation error. There are no general rules
for the choice of the relative sample sizes of the training, validation and test
9.3. IN-SAMPLE ERROR 87

part of the full dataset. A typical choice is to take 50% of the data for the
training part and 25% for both the validation and test parts.
However, in many cases the full dataset is not large enough to split it
into three independent parts. We discuss in the next sections how model
selection and estimation of the test error can be performed in such cases. The
methods to approximate the validation step consist of two classes. Those
that use an analytical approximation and those that re-use the training data.

9.3 In-Sample Error


As explained in the previous section, the training error rate given in (9.2)
underestimates the true error (9.1) because the error is measured on the
same (training) data that were used to fit the model. Hence, the training
error is an overly optimistic estimate of the generalization error.
Let us now focus on the in-sample error, which is the error of a prediction
method if the feature part of the training points is kept fixed and only the
response values are considered random, i.e.
1∑
n [ ]
Errorin = Ey EY new [L(Yinew , Ĝ(xi ))] . (9.3)
n
i=1

Note that the first expectation is with respect to the randomness of the
responses in the training data while the second expectation averages over
all possible vectors Y new of n new responses. The optimism is now defined as
the expected difference between the in-sample error and the training error:

op = Errorin − Ey (err),
¯ (9.4)

which will (usually) be positive because the training error underestimates


the prediction error and thus has a negative bias.
For two-class classification, it can be shown quite generally that
2∑
n
op = Cov(ŷi , yi ), (9.5)
n
i=1

where yi and ŷi are the true and predicted classes at xi for the 0 − 1 loss
function while for log-likelihood loss yi and ŷi are the true and predicted
class 1 probabilities at xi . The optimism, that is, the amount by which the
the training error underestimates the true prediction error depends on the
covariance between yi and its prediction ŷi . Hence, the stronger yi affects
its own prediction, the larger the optimism.
Using (9.5) in (9.4) yields the relation

2∑
n
Errorin = Ey (err)
¯ + Cov(ŷi , yi ) (9.6)
n
i=1
88CHAPTER 9. MODEL PERFORMANCE AND MODEL SELECTION

for two-class classification. In the next section we discuss some methods that
(analytically) estimate the optimism, which is then added to the training
error to obtain an estimate of the in-sample error. In-sample error usually
is not of direct interest because future observations in feature space are not
likely to coincide with training observations. However, for comparison of
models, in-sample error is sufficient because only the relative size of the
prediction error is needed.

9.4 Estimating Optimism

The Akaike information criterion (AIC) is an estimate of the in-sample error


when a log-likelihood loss function is used. It can be shown that for linear
models it holds asymptotically (as n → ∞) that
[ n ]
2 ∑ d
−2E[log(Pθ̂ (Y ))] ≈ − E log(Pθ̂ (yi )) + 2 ,
n n
i=1

where d is the number of parameters that are fit and θ̂ is the maximum-
likelihood estimate of θ. The first term on the right side can be estimated
by the training error err
¯ (that is, the expectation is dropped). The second
term on the right is the estimate of the optimism. Given a set of models
f (x; α), where the index α is e.g. a tuning parameter, denote by err(α)
¯ and
d(α) the training error and number of parameters for each model, then AIC
is given by
d(α)
AIC(α) = err(α)
¯ +2 .
n
We now choose the model giving smallest value of AIC among the set of
models. For nonlinear and other adaptive, complex models, the number of
parameters d needs to be replaced by some measure of model complexity
such as the effective degrees of freedom.
The Bayesian Information criterion (BIC) or Schwartz criterion re-
places the constant 2 in the second term of AIC by log(n), that is,

d
BIC = err
¯ + log(n) .
n

BIC can be motivated from a Bayesian approach to model selection which


is beyond the scope of this course. The effect of BIC is that more com-
plex models are penalized more heavily than with AIC, which leads to the
selection of simpler models.
9.5. CROSS-VALIDATION 89

9.5 Cross-validation
Cross-validation is the most widely used method to estimate prediction er-
ror. It directly estimates the generalization error (9.1), that is the prediction
error on independent test data. K-fold cross-validation splits the data at
random into K roughly equal parts. For each part j (j = 1, . . . , K), of the
data, the model is fit to the other K − 1 parts of the data, which leads to
a fit fˆ−j (x) where the −j indicates that part j of the data was removed
during the fitting process. We then calculate the prediction error of this
model fˆ−j (x) when predicting the outcome for the observations in the left-
out part (part j) of the data. The prediction errors obtained from the K
parts are then combined. To summarize, we can define an indexing function
κ : {1, . . . , n} → {1, . . . , K} that indicates for each training observation to
which of the K parts it belongs. Then the K-fold cross-validation estimate
of the test error is

1∑
n
CV-error = L(yi , fˆ−κ(i) (xi )).
n
i=1

Standard choices of K are 5 or 10. The case K = n is known as leave-one-out


cross-validation. In this case the training dataset is split into n parts, each
containing only 1 observation. Hence, the fit used to predict the outcome of
any observation i is based on all training observations except the ith.
Model selection based on K-fold cross-validation is quite straightforward.
Given a set of models f (x; α) where α is a tuning parameter as before, for
each of these models we compute the K-fold cross-validation error estimate

1∑
n
CV-error(α) = L(yi , fˆ−κ(i) (xi ; α)). (9.7)
n
i=1

We then select the optimal value α̂ that minimizes CV-error(α). However,


since we have a prediction error estimate for each of the K parts, these
individual estimates can be used to calculate a standard error for the mean
prediction error (that is, the overall prediction error (9.7)). Often, a ’one-
standard-error’ rule is advocated for cross-validation. That is, to choose the
most parsimonious model whose cross-validation prediction error is within
one standard error above the prediction error of the optimal model with
lowest prediction error. The chosen model f (x; α̂) is then fit to all the data.
The choice of K in K-fold cross-validation is again a trade-off between
bias and variance. A low value of K will make the size of the training
datasets used in the cross-validation procedure much smaller than the size of
the complete training dataset. Hence, the cross-validation training datasets
90CHAPTER 9. MODEL PERFORMANCE AND MODEL SELECTION

are more sparse in predictor space. As a result the predictions in the cross-
validation procedure can be biased depending on how the performance of
the classifier changes with sample size. This bias leads to less accurate pre-
dictions and hence an over-estimation of the prediction error. On the other
hand, a large value of K (e.g. K = n) makes the cross-validation procedure
approximately unbiased for the true prediction error but the procedure be-
comes highly variable due to its stronger dependence on the training data
set (the n training sets in the cross-validation are very similar to one an-
other). Also, the computational burden can be heavy if no update rules can
be used. Overall, 5 or 10 fold cross-validation can be recommended as a
good compromise.

9.6 Bootstrap methods


Using bootstrap to estimate the prediction error on independent test samples
is not as obvious as it may seem at first sight. The standard approach would
be to fit the model on a set of bootstrap samples and keep track of how well
each of these fitted models predict the original training dataset. If we denote
by fˆb (xi ) the predicted value at xi according to the model fitted to the bth
bootstrap sample, then based on B bootstrap samples the estimate of the
prediction error becomes

1 1 ∑∑
B n
Errorboot = L(yi , fˆb (xi )).
Bn
b=1 i=1

However, in general Errorboot does not provide a good estimate of the true
prediction error. The reason is that each bootstrap sample has observations
in common with the original training sample that is being used as the test
set. If a method is overfitting, then it will also overfit in the bootstrap
samples, leading to unrealistically good predictions for all observations in the
training sample (acting as the test sample) that also belong to the bootstrap
sample. Cross-validation explicitly avoided this problem by splitting the
data in non-overlapping training and test samples.
As an illustration of the overly optimistic character of this bootstrap
estimate, consider a 1-nearest-neighbor classifier that is being applied to a
two-class classification problem. If both classes contain the same number
of observations and the class labels are independent of the features, then
the true error rate is 0.5 (50%). Now consider the predictions for the ob-
servations in the original training dataset based on the 1-nearest-neighbor
classifier trained on a bootstrap sample. For training observations that be-
long to the bootstrap sample, the error will be zero because the nearest
9.6. BOOTSTRAP METHODS 91

neighbor in the bootstrap sample is the observation itself and hence its class
is predicted correctly. If the training observation does not belong to the
bootstrap sample, then the predicted class will be the class of the near-
est neighbor in the bootstrap sample. Due to the randomness in assigning
class labels, averaging over the bootstrap samples, the probability that this
nearest neighbor belongs to either class is the same, that is, 50%. Since boot-
strap samples are independently drawn with replacement from the original
training sample and each observation has the same probability 1/n of being
selected, we can calculate the probability that an observation i belongs to a
bootstrap sample b:
( )
1 n
P (observation i ∈ bootstrap sample b) = 1 − 1 −
n
1
→ 1−
e
= 0.632. (9.8)

Therefore, the expectation of Errorboot becomes

E[Errorboot ] = 0(0.632) + 0.5(1 − 0.632) = 0.184,

which is far below the true prediction error of 0.5.


To overcome the underestimation, one can restrict to average over pre-
dictions for training observations that do not belong to the bootstrap sample
(the out-of-bag observations). The out-of-bag bootstrap estimate of the pre-
diction error becomes
1∑ 1 ∑
n
Errorout-of-bag = L(yi , fˆb (xi )).
n |C −i | −i
i=1 b∈C

Here, for every training observation i, the set C −i ⊂ {1, . . . , B} is the set
with indices of bootstrap samples that do not contain the ith training ob-
servation. |C −i | is the number of bootstrap samples not containing the ith
observation. If we take the number of bootstrap samples B large enough,
then all C −i will be nonempty. If empty sets C −i occur, we can leave these
terms out of the summation.
The out-of-bag bootstrap estimate of the prediction error solves the over-
fitting problem with Errorboot , but it suffers from a bias similar as K-fold
cross-validation with a low value of K. From (9.8) it follows that the av-
erage number of distinct observations in each bootstrap sample is 0.632 n,
hence the effective sample size of the bootstrap samples is about 60% of
the original sample size n. Therefore, the behavior of the bootstrap esti-
mate of the prediction error is similar to the behavior of 2-fold/3-fold cross-
validation. When such a bias occurs, the underestimation of prediction error
by Errorboot turns into an overestimation of this error by Errorout-of-bag .
92CHAPTER 9. MODEL PERFORMANCE AND MODEL SELECTION

The .632 bootstrap estimator of prediction error tries to avoid the prob-
lems of the above proposals. It corrects for the possible underestimation of
prediction error by Errorboot due to overfitting, but at the same time ad-
dresses the possible bias of the Errorout-of-bag estimator. The .632 bootstrap
estimator is defined by

Error.632 = 0.368err
¯ + 0.632Errorout-of-bag .

Intuitively, it pulls the Errorout-of-bag estimate down to the (underestimating)


training error rate.
The .632 bootstrap estimator can still seriously underestimate the predic-
tion error rate in heavily overfitting situations. Consider again the two-class
problem with equal class sizes and labels assigned at random. The training
error rate of the 1-nearest-neighbor classifier err ¯ = 0 and the out-of-bag
bootstrap estimate of the prediction error Errorout-of-bag = 0.5 (and thus is
unbiased in this case), so

Error.632 = 0.368(0) + 0.632(0.5) = 0.316,

which is much lower than the true prediction error of 0.5.


To handle this problem, the .632 bootstrap estimator can be improved
further by taking the amount of overfitting into account. First, we need to
estimate the no-information error rate γ which is the training error rate
of the classifier if the class labels were independent of the features. An
estimate of the no-information error rate is obtained by considering all com-
binations of outcomes yi and predictors xi′ and measuring the error rate
when evaluating the prediction rule on all these combinations:

1 ∑∑
n n
γ̂ = L(yi , Ĝ(x′i )).
n2 ′
i=1 i =1

Based on this estimate of the no-information error rate, the relative overfit-
ting rate is defined as
Errorout-of-bag − err
¯
R̂ = .
γ̂ − err
¯
Clearly, R̂ ∈ [0, 1] with R̂ = 0 when there is no overfitting (err
¯ = Errorout-of-bag )
and R̂ = 1 if the overfitting equals γ̂ − err,
¯ that is, if the overfitting is so large
that the performance of the classifier is as bad as the random assignment
rule. The .632+ bootstrap estimator is now defined by

Error.632+ = (1 − ŵ)err
¯ + ŵErrorout-of-bag .

where
0.632
ŵ = .
1 − 0.368R̂
9.6. BOOTSTRAP METHODS 93

The weight ŵ ranges between 0.632 and 1. The weight equals 0.632 if R̂ =
0, that is, when there is no overfitting, and in this case Error.632+ equals
Error.632 since no improvement was necessary. The weight becomes 1 if
R̂ = 1, that is, when there is large overfitting, and in this case Error.632+
equals Errorout-of-bag , so no bias correction is made.
To illustrate that the .632+ bootstrap estimator solves the underesti-
mation of the prediction error rate in heavily overfitting situations, we look
again at the 1-nearest-neighbor classifier for the two-class problem with equal
class sizes and labels assigned at random. In this case we clearly have that
γ̂ = 0.5, and since also Errorout-of-bag = 0.5 as shown above, we obtain that
R̂ = 1, and hence, ŵ = 1 which implies that Error.632+ = Errorout-of-bag =
0.5.
94CHAPTER 9. MODEL PERFORMANCE AND MODEL SELECTION
Part II

Regression

95
Chapter 10

Problem setting

As explained before, in regression we consider problems with a quantitative


outcome variable. The outcome variable is denoted by Y and can take any
value in a certain range. In a regression problem the goals are to find good
estimates Ŷ of the output Y The predictions Ŷ should take values in the
same range as Y . Ideally, the method works well on the whole distribution of
outcomes and features. In practice, these predictions can be considered both
for the training data, expressing the goodness-of-fit quality of the method,
as for new objects, expressing the prediction capacity of the method.
Given any vector of features x, we would like to find the prediction rule
that is as accurate as possible, or more generally, gives the smallest loss. As
for classification, we need to define a loss function L(Y, f (X)) that measures
the price paid for errors in predicting Y by f (X). The most common and
convenient choice for the loss function is squared error loss

L(Y, f (X)) = (Y − f (X))2 (10.1)

When a loss function is chosen, this leads immediately to a criterion for


choosing the optimal solution for f , which is found by minimizing the ex-
pected prediction loss (EPE):

EPE(f ) = E[L(Y, f (X))],

where the expectation is w.r.t. the joint distribution of (X, Y ). For squared
error loss this becomes

EPE(f ) = E[(Y − f (X))2 ].

By conditioning on X, this can be rewritten as


[ ]
EPE(f ) = EX EY |X [(Y − f (X))2 |X] .

97
98 CHAPTER 10. PROBLEM SETTING

It follows that (as in the classification case) it suffices to minimize EPE


pointwise, that is

f (x) = arg min EY |X [(Y − c)2 |X = x] (10.2)


c

By taking the derivative w.r.t. c, it can be seen that the solution to (10.2)
is
fˆ(x) = EY |X [Y |X = x], (10.3)

that is, the conditional expectation of Y at X = x. Hence, when we use av-


erage squared error as criterion to find the best prediction of Y , the solution
is the conditional mean of Y at each X = x.
Alternatives to the squared error loss function in (10.1) are possible. For
example, the L1 loss function

L(Y, f (X)) = |Y − f (X)|

can be used. In this case it can be shown that the optimal solution becomes

fˆ(x) = median(Y |X = x).

This is a different measure for the location of the conditional distribution


Y |X = x which is more robust than the conditional mean. However, the
L1 loss function and other more robust loss functions have discontinuities
and often also local minima, which makes them less attractive. Hence, we
usually use the popular squared error loss which is analytically the most
convenient choice.
Our goal is to find a good approximation fˆ(x) to the true function f (x)
that determines the predictive relationship between the inputs and the out-
puts. When using squared error loss, we assume that the regression function
is given by f (x) = EY |X [Y |X = x]. Usually we will select the optimal ap-
proximation fˆ(x) from a class of models.
A general statistical model is the additive error model which assumes
that the data follow the model

Y = f (X) + ϵ (10.4)

where the random error ϵ has E(ϵ) = 0 and is independent of X. For


this model f (x) = EY |X [Y |X = x] and if the errors are independent and
identically distributed then the conditional distribution of Y |X depends
on X only through the conditional mean f (x). The additive error model
is a useful approximation to the truth. It assumes that we can capture all
departures from the (deterministic) relationship Y = f (X) via the (additive)
error ϵ.
99

Using squared error loss in the general additive error model, leads to
minimizing
∑ n
RSS(f ) = (yi − f (xi ))2 (10.5)
i=1

over all possible functions f , which yields infinitely many solutions. Any
solution fˆ(x) that passes through the n training data points can be chosen.
However, most of these solutions will perform poorly on independent test
data, leading to poor predictions at the test points. Hence, in order to
obtain useful solutions we must restrict the minimization in (10.5) to a
smaller class of functions. The choice of restrictions is based on the nature
of the problem or on theoretical considerations, but not driven by the data.
Any set of appropriate restrictions imposed on the class from which fˆ can be
selected will lead to a (unique) solution. However, different restrictions may
(and often will) lead to different solutions. The constraints imposed by most
learning methods can be interpreted as complexity restrictions that require
some kind of local regular (simple) behavior. By comparing the solutions
obtained from different learning methods, an optimal prediction procedure
can be selected.
100 CHAPTER 10. PROBLEM SETTING
Chapter 11

Linear Regression Methods

11.1 Introduction
Linear regression models assume that the regression function E(Y |X) de-
pends linearly on the predictors X1 , . . . , Xd . Such models are simple and in
many cases provide a reasonable and well-interpretable description of the ef-
fect of the features on the outcome. In some situations, simple linear models
can even outperform fancier nonlinear models, such as in cases with a small
number of training data, cases with very noisy data such that the trend (the
signal) is hard to detect (low signal to noise ratio) or in cases with sparse
data (high-dimensional data problems). We will start the discussion of lin-
ear models with standard least squares estimation and then continue with
shrinkage methods that allow for bias if this leads to a substantial decrease
in variance and thus yields a more stable model. As an alternative to shrink-
age methods, linear least squares regression can be applied to a (limited) set
of derived input directions. This approach is discussed in the last sections.

11.2 Least squares regression


The standard linear regression model has the form


d
f (X) = β0 + βj Xj . (11.1)
j=1

Hence, the linear model assumes that the regression function E(Y |X) can be
approximated reasonably well by the linear form (11.1). Note that the model
is linear in the unknown parameters β0 , . . . , βd . The predictor variables
X1 , . . . , Xd can be measured features, transformations of features, polyno-
mial representations or other basis expansions, dummy variables, interac-
tions, etc....

101
102 CHAPTER 11. LINEAR REGRESSION METHODS

The unknown parameter vector β = (β0 , . . . , βd ) needs to be estimated


from the training sample {(xi , yi ); i = 1, . . . , n}. The standard method is
the least squares estimator which minimizes


n
RSS(β) = (yi − f (xi ))2
i=1
 2

n ∑
d
= yi − β0 − βj xij 
i=1 j=1

= (y − Xβ) (y − Xβ),
t
(11.2)

where X is the feature data matrix which starts with a column of ones to
include the intercept in the model. Differentiating (11.2) with respect to β
and setting the result equal to zero yields the estimating equation

Xt (y − Xβ) = 0 (11.3)

with solution
β̂ = (Xt X)−1 Xt y (11.4)

The fitted values for the training data are given by

ŷ = Xβ̂
= X(Xt X)−1 Xt y
= Hy,

with H the hat matrix. The predicted value at a new input vector x0 is
obtained by ŷ0 = fˆ(x0 ) = (1, xt0 )β̂.
Least squares minimizes RSS(β) = ∥y − Xβ∥2 which implies that β̂ is
determined such that ŷ is the orthogonal projection of n-dimensional vector
y onto the subspace spanned by the columns x0 , . . . , xd of the matrix X,
called the column space of X. Hence, the residual vector y − ŷ is orthogonal
to this subspace as can be seen in (11.3). The hat matrix H computes
the orthogonal projection of y onto the column space of X and thus is a
projection matrix as explained before.
It can easily be shown that the variance-covariance matrix of β̂ is given
by
Var(β̂) = (Xt X)−1 σ 2 ,

where the error variance σ 2 is usually estimated by

1 ∑ n
2
σ̂ = (yi − ŷi )2
n − (d + 1)
i=1
11.2. LEAST SQUARES REGRESSION 103

which is an unbiased estimator of σ 2 . To obtain inference about the regres-


sion model and its parameters, we need to make additional assumptions.
We need to assume that the linear model (11.1) is the correct model for
the conditional mean E(Y |X) and that the additive errors in (10.4) have
a Gaussian distribution with mean zero and constant variance σ 2 , that is,
ϵ ∼ N (0, σ 2 ). Using this assumption, hypothesis tests and confidence inter-
vals for the regression parameters can be constructed, as well as F-tests to
check for significant differences between nested models.
The Gauss-Markov theorem is an important result for least squares re-
gression. It shows that the least squares estimator has the smallest variance
among all unbiased estimators that can be written as a linear transformation
of the response y in the following sense. Consider any linear combination
θ = at β with a ∈ Rd+1 , then the least squares estimator of θ is

θ̂ = at β̂ = [at (Xt X)−1 Xt ]y,

which is a linear transformation of y. Clearly, at β̂ is an unbiased estimator


of θ. Now consider any other linear, unbiased estimator of θ, that is, an
estimator of θ in the form θ̃ = ct y with c ∈ Rn such that E[θ̃] = θ. Then it
holds that,

Var(θ̂) ≤ Var(θ̃).

When using the squared error loss function (10.1) to measure prediction
accuracy, we focus on the mean squared error of estimators. Consider the
prediction of the response at a point x0 ,

Y0 = f (x0 ) + ϵ0 = xt0 β + ϵ0 .

Then the expected prediction error of an estimate f˜(x0 ) = xt0 β̃ is

E[(Y0 − f˜(x0 ))2 ] = E[((Y0 − xt0 β) + (xt0 β − xt0 β̃))2 ]


= E[(Y0 − xt0 β)2 ] + E[[xt0 (β − β̃)]2 ]
= E[ϵ20 ] + MSE(f˜(x0 ))
= σ 2 + MSE(f˜(x0 )),

where MSE(f˜(x0 )) = E[[xt0 (β − β̃)]2 ] is the mean squared error of the pre-
diction at x0 by the estimator f˜. Therefore, expected prediction error and
mean squared error of a prediction differ only by the constant σ 2 , represent-
ing the irreducible variance of the new observation y0 around its expected
104 CHAPTER 11. LINEAR REGRESSION METHODS

value. Now consider the mean squared error of the prediction:

MSE(f˜(x0 )) = E[[xt0 (β − β̃)]2 ]


= E[[xt0 (β − E(β̃) + E(β̃) − β̃)]2 ]
= [xt0 β − E(xt0 β̃)]2 + E[(E(xt0 β̃) − xt0 β̃)2 ]
= bias(f˜(x0 ))2 + Var(f˜(x0 ))

The unbiasedness of the least squares estimator makes the first term of
the mean squared error zero. Since the prediction f˜(x0 ) = xt0 β̃ is a lin-
ear combination of β̃, the Gauss-Markov theorem ensures that the least
squares estimator is optimal w.r.t. variance within an important class of
unbiased estimators. That is, the least squares estimator has the smallest
mean squared error among all linear unbiased estimators. However, this
optimal unbiased estimator can still be highly unstable. This can happen in
situations with (many) candidate predictors among which only a few have
a significant relation with the outcome. The large fraction of noise vari-
ables then causes a large variability of the least squares estimates. Also
the presence of highly correlated predictors (multicollinearity) will make an
unbiased estimator such as least squares highly variable.
To obtain a more stable, better performing solution, we can look for
biased estimators that lead to a smaller mean squared error, and thus to
better predictions. These biased estimators thus trade a little bias for a
(much) larger reduction in variance. From a broader viewpoint, almost all
models are only an approximation of the true mechanism that relates the
features to the outcome, and hence these models are biased. Therefore,
allowing for a little extra bias when selecting the optimal model from the
class of models considered is a small price to achieve a more stable, better
performing model. As always, picking the right model means trying to find
the right balance between bias and variance.
A low prediction accuracy due to high variability is a first reason why
the least squares estimator may not be satisfactory. A second reason is the
possible loss of interpretableness. With a large number of predictors it is
often of interest to determine a small subset of predictors that exhibit the
strongest influence on the outcome. The high variability of least squares
complicates the selection of these predictors. A biased estimator obtained
by shrinkage will be much more successful. With least squares, we can
take the subset selection approach to determine important variables or to
find stable submodels. Note that after subset selection we usually end up
with a biased regression model, a fact that is often ignored in regression
analysis. Several selection strategies exist such as all-subsets, backward
elimination, or forward selection. Different selection criteria can be used
11.3. RIDGE REGRESSION 105

to select the optimal model in these procedures. For instance, minimizing


prediction error on a test set or through cross-validation can be used. Note
that these selection methods are discrete in the sense that predictors are
either completely in or completely out of the model. This discreteness results
in a high variance of the selection process. Shrinkage methods discussed next
are a more continuous alternative to subset selection which usually have a
lower variance.

11.3 Ridge Regression


Ridge regression is one of the oldest shrinkage methods. It shrinks the
regression coefficients (slopes) by imposing a penalty on their overall size.
The ridge regression coefficients are obtained by minimizing a penalized
residual sum of squares
  2 
∑ n ∑d ∑d 
β̂ridge = arg min yi − β0 − βj xij  + λ βj2
β  
i=1 j=1 j=1
{ }
= arg min (y − Xβ)t (y − Xβ) + λβ t β . (11.5)
β

λ is a tuning parameter controlling the size of the penalty. The larger


λ, the larger the weight of the penalty term and thus the more shrinkage
that is imposed. The limit case λ = 0 means no shrinkage and just yields
the (unbiased) least squares solution. Increasing λ increases the bias by
shrinking the coefficients towards zero.
An equivalent formulation of the ridge regression problem is
 2

n ∑
d
β̂ridge = arg min yi − β0 − βj xij  (11.6)
β
i=1 j=1


d
subject to ∥β∥22 = βj2 ≤ s, (11.7)
j=1

which is a formulation as a constrained optimization problem similar to sup-


port vector machines in Chapter 6. This formulation more explicitly shows
the size constraint on the regression coefficients. When the unbiased least
squares coefficients have a high variability as in the case of correlated pre-
dictors, then a large positive coefficient on one predictor can be canceled by
a large negative coefficient on another highly correlated predictor. The size
constraint (11.7) in ridge regression prevents this effect from occurring. As
with all penalized methods, ridge regression is not equivariant under scaling
and centering of the predictors. Therefore, it is advised to standardize the
106 CHAPTER 11. LINEAR REGRESSION METHODS

predictors (to mean zero and variance 1) when fitting the ridge regression
model.
From (11.5) it can easily be seen that the ridge regression solution is
given by
β̂ridge = (Xt X + λI)−1 Xt y. (11.8)

Hence, the ridge regression solution is a linear function of the observed


outcome y. The solution is similar to the least squares solution (11.4) but
adds a constant to the diagonal of Xt X before inversion. This constant
makes the matrix nonsingular even if Xt X itself is singular. This means
that ridge regression can be used even when there are more predictors than
observations or when there exists linear dependence between the predictor
variables. From (11.8) it follows that the fitted responses are obtained by

ŷ = Xβ̂ridge
= [X(Xt X + λI)−1 Xt ]y
= S ridge
λ y,

where S ridge
λ is the ridge regression linear operator. The tuning parameter
λ can therefore be determined by fixing the effective degrees of freedom of
the ridge regression solution, where the effective degrees of freedom is given
by trace(S ridge
λ ).
To further explain the behavior of ridge regression we now make some
derivations similar to those in Section 4.4.1 for smoothing splines. We as-
sume that both the response and the predictors have been centered (and
scaled) such that the n × d matrix X is the centered feature matrix. Note
that the centering causes the intercept to be zero, so there is one parameter
less. The singular value decomposition (SVD) of the centered feature matrix
X is
X = U DV t (11.9)

where D is a d × d diagonal matrix with diagonal entries γ1 ≥ γ2 ≥ · · · ≥


γd ≥ 0 which are called the singular values of X. The number of singular
values strictly larger than zero corresponds to the rank of X. The matrix U
is an n×d orthogonal matrix and V is an d×d orthogonal matrix. Note that
an m × d matrix P is called orthogonal if P t P = I with I the d-dimensional
identity matrix. This means that the d columns of the orthogonal matrix
are an orthonormal basis for a d-dimensional space. In the singular value
decomposition the columns of U (corresponding to non-zero singular values)
are a basis for the space spanned by the columns of X. Similarly, the columns
of V (corresponding to non-zero singular values) are a basis for the subspace
11.3. RIDGE REGRESSION 107

in Rd spanned by the rows of X. For the special case of a d × d matrix V ,


orthogonality of the matrix implies that V t = V −1 .
Using the singular value decomposition the fitted values by ridge regres-
sion can be rewritten as

ŷridge = Xβ̂ridge
= [X(Xt X + λI)−1 Xt ]y
= [U DV t (V DU t U DV t + λI)−1 V DU t ]y
= [U DV t (V D 2 V t + λV IV t )−1 V DU t ]y
= [U DV t (V (D 2 + λI)V t )−1 V DU t ]y
= [U DV t V (D 2 + λI)−1 V t V DU t ]y
= [U D(D 2 + λI)−1 DU t ]y (11.10)

Note that D(D 2 + λI)−1 D is still a diagonal matrix with diagonal elements

γj2
; j = 1, . . . , d.
γj2 + λ

Let us denote by u1 , . . . , ud the columns of U which are an orthonormal


basis for the d-dimensional subspace of Rn determined by the columns of X.
Then, expression (11.10) can be written as


d
γj2
ŷridge = (utj y)uj (11.11)
j=1
γj2 + λ

For least squares regression we obtain

ŷLS = Xβ̂LS
= [X(Xt X)−1 Xt ]y
= [U DV t (V DU t U DV t )−1 V DU t ]y
= [U DV t (V D 2 V t )−1 V DU t ]y
= [U DV t V (D 2 )−1 V t V DU t ]y
= [U D(D 2 )−1 DU t ]y
= [U U t ]y

d
= (utj y)uj , (11.12)
j=1

which corresponds to setting λ = 0 in (11.11) as expected. Both least squares


and ridge regression thus look for an approximation of the response vector y
in the subspace spanned by the columns u1 , . . . , ud of U . Least squares finds
the approximation by projecting y onto this subspace. Ridge regression then
108 CHAPTER 11. LINEAR REGRESSION METHODS

shrinks the least squares coordinate weights (utj y) by the factor γj2 /(γj2 + λ).
Hence, the smaller the squared singular value γj2 , the larger the amount
of shrinking in the direction uj . Note that ridge regression thus looks for
an optimal solution in the same d-dimensional subspace as least squares,
contrary to smoothing splines in Section 4.4 that look for a solution in the
full n dimensional space.
To understand which directions correspond with larger amount of shrink-
ing, we consider the d × d matrix of squares and cross-products:

Xt X = V DU t U DV t
= V D2 V t (11.13)

which is the eigen-decomposition of the d-dimensional square matrix Xt X.


Hence, γj2 j = 1, . . . , d are the eigenvalues of the matrix and the columns
v1 , . . . , vd of V are the corresponding eigenvectors which are an orthonormal
basis of Rd . Now let us express the input observations and thus the feature
data matrix X in this new basis. The coordinates of the input observations
for each of the basisvectors vj j = 1, . . . , d are given by zj = Xvj . The
sample variance in the direction vj is thus

Var(zj ) = Var(Xvj ) = (vjt Xt Xvj )/n = (vjt V D 2 V t vj )/n = γj2 /n,

since V t vj = ej where ej is a d-dimensional vector with zero everywhere


except for component j which equals 1. Moreover, we also have that

zj = Xvj = U DV t vj = U Dej = γj U ej = γj uj .

Hence, in the directions uj , the variance of the feature observations equals


γj2 /n. Therefore, directions with small squared singular values γj2 correspond
to directions with small variance of the inputs, and ridge regression shrinks
these directions the most. Note that the sample covariance matrix S of the
observations is given by Xt X/n. Therefore, the eigen-decomposition of Xt X
in (11.13) corresponds to the eigen-decomposition of the sample covariance
matrix up to the factor n. Consequently, the eigenvectors v1 , . . . , vd are the
principal components of the feature data (in d-dimensional space), that is di-
rections with maximal variance subject to being orthogonal to directions of
earlier principal components. Hence, the directions u1 , . . . , ud are the prin-
cipal component directions of the input observations in the n-dimensional
space.
Least squares regression becomes highly variable if there are directions
in the data with small variance of the data. The small amount of informa-
tion in these directions makes it very difficult to accurately estimate in these
11.4. THE LASSO 109

directions the slope of the linear surface fitting the data. This makes the
least squares solution very unstable. Ridge regression protects against a po-
tentially high variance in these directions (with small variance) by shrinking
the regression coefficients in these directions. Hence, ridge regression does
not allow a large slope in these directions and thus assumes (implicitly) that
the response will vary most in directions with high variance of the predic-
tors. This is often a reasonable assumption and even if it does not hold, the
data do not contain sufficient information to reliably estimate the slope of
the surface in directions with small variance of the predictors.

11.4 The Lasso


The lasso is a shrinkage method like ridge regression but instead of using

the (squared) L2 -norm dj=1 βj2 for the penalty term, it uses the L1 norm
∑d
j=1 |βj |:
 2

n ∑
d
β̂lasso = arg min yi − β0 − βj xij 
β
i=1 j=1


d
subject to ∥β∥1 = |βj | ≤ t, (11.14)
j=1

The constant t is a tuning parameter. A value t ≥ t0 = ∥β̂LS ∥1 implies


that no shrinkage is applied. Values 0 < t < t0 will shrink the least squares
coefficients towards zero. As usual, the tuning parameter t can be selected
to optimize prediction error.
Replacing the L2 ridge penalty by the L1 lasso penalty, makes the opti-
mal solution a nonlinear function of the observed responses y. Therefore, a
more complicated optimization problem needs to be solved, but the solution
can still be found in a very efficient way.
The lasso penalty and ridge penalty have a different effect on the regres-
sion parameters. To explain the difference, let us consider an orthonormal
feature matrix X, that is the predictors are uncorrelated and have variance
1. In this case the principal components are just the coordinate directions.
Since all predictors have the same variance, ridge regression imposes an
equal amount of shrinking on each of them. The ridge regression estimates
are
(β̂LS )j
(β̂ridge )j = .
1+λ
The lasso shrinks the coefficients such that the constraint (11.14) is satisfied.
That is,
(β̂lasso )j = sign((β̂LS )j )(|(β̂LS )j − γ|)+ ,
110 CHAPTER 11. LINEAR REGRESSION METHODS

where the threshold parameter γ is related to the tuning parameter t in (11.14).


For t ≥ t0 , γ = 0. For t < t0 , γ = (t0 − t)/d until one of the parameters
becomes zero after which both t0 and d need to be adjusted. The effect of
the lasso is often called soft thresholding. Regular thresholding puts coeffi-
cients below the cutoff equal to zero and only keeps the coefficients above
the cutoff. This is the way subset selection for least squares actually pro-
ceeds. Fixing a size M < d for the optimal model, best subset selection
in this case chooses the model with the M predictors corresponding to the
largest parameters (β̂LS )j . The soft thresholding of the lasso can be seen as
a continuous subset selection. The tuning parameter t specifies the thresh-
old. If this threshold is exceeded, then the coefficients are shrunken. The
shrinkage increases and for smaller values of t some coefficients become zero
and thus the corresponding predictors are removed from the model.

Let us now consider the general nonorthonormal case. Ridge regression


minimizes the residual sum of squares subject to the ridge constraint (11.7)
while the lasso minimizes the same residual sum of squares subject to the
lasso constraint (11.14). Let us put s = t2 in the ridge constraint (11.7) such
that both the ridge and lasso constraints put a bound t on the norm (L2 or
L1 ) of β. Hence, in the parameter space, the ridge constraint (11.7) means
that the solution should fall in a ball of radius t around the origin. The
lasso constraint (11.14) means that the solution should fall in a diamond

with edge length 2t around the origin. Now consider the residual sum of
squares (11.6) in parameter space. The optimum is at β̂LS and we assume
that it does not belong to the ridge region nor the lasso region. Therefore,
we have to allow a larger value of residual sum of squares to obtain a feasible
solution. The residual sum of squares is a quadratic function of β, and thus
for values of the residual sum of squares above the optimum, the set of solu-
tions becomes an ellipsoid around the center β̂LS . The value of the residual
sum of squares thus needs to be increased until the corresponding ellipsoid of
solutions hits the ridge/lasso constraint region around the origin. In either
case, the intersection point with the ellipsoid is the optimal solution. Hence,
the shape of the feasible region around the origin determines the solution
that is found. The ridge regression ball has no corners and is equally likely
to be hit anywhere. The diamond (and certainly the rhomboid in higher
dimensions) has corners at the coordinate axes where a parameter value(s)
equals zero. These corners are more likely to be hit by the ellipsoid because
they ’stick out’ in the region. In practice this means, that ridge regression
usually shrinks all regression coefficients but they remain positive, so the size
of the model is not reduced. On the other hand, the lasso will shrink some
coefficients to zero and thus reduces the size of the model. Lasso shrinkage
11.5. PRINCIPAL COMPONENTS REGRESSION 111

automatically incorporates variable selection.

11.5 Principal Components regression


We now start the discussion of methods that apply linear regression on a
small set of derived inputs. These inputs are well-chosen linear combinations
Z1 , . . . , ZM of the original features X1 , . . . , Xd .
In principal components regression (PCR) the derived inputs are the
principal components zm = Xvm m = 1, . . . , M as introduced above. We
still assume that the response and predictors are centered and scaled. Since
the principal components z1 , . . . , zM are orthogonal, the linear regression of
y on these principal components is just the sum of the univariate regressions,
so

M
PCR
ŷPCR = θ̂m zm ,
m=1

PCR is the least squares solution when regressing y on z . Using


where θ̂m m
zm = Xvm , this can be rewritten as


M
PCR
ŷPCR = Xθ̂m vm
m=1

M
PCR
= X θ̂m vm
m=1
= Xβ̂PCR .

Expression (11.12) gave a decomposition of the least squares fit in the


principal components directions u1 , . . . , ud . Since, PCR only uses the first
M of these directions we obtain that


M
ŷPCR = (utj y)uj . (11.15)
j=1

Note that if M = d we would just obtain the least squares solution. For
M < d PCR thus selects the optimal solution in an M -dimensional subspace
of the d-dimensional subspace spanned by the columns of X. Both ridge
regression and principal components regression use the principal components
of the feature data to find an alternative for the least squares fit. While ridge
regression shrinks the fit the most in principal components directions with
small variance, PCR completely ignores these directions and thus makes
the slope of the linear fit zero in these directions. Hence, PCR makes the
stronger assumption that the response does not vary in these directions.
112 CHAPTER 11. LINEAR REGRESSION METHODS

11.6 Partial Least Squares


Contrary to PCR, partial least squares (PLS) does not only use the feature
matrix to construct the linear combinations that form the derived inputs,
but it also uses the response vector y in this construction. PLS starts
by regressing y on each of the predictors xj , which yields the univariate
regression coefficients ϕ̂1j j = 1, . . . , d. Note that the centering and scaling
of the variables implies that ϕ̂1j = Cor(xj , y). With these coefficients, the
first PLS derived input becomes

d
z1 = ϕ̂1j xj .
j=1

Hence, in the construction of z1 the inputs are weighted by the strength of


their univariate effect (their correlation) on y. We then orthogonalize the
inputs x1 , . . . , xd w.r.t. the direction z1 , that is, each xj is replaced by x̃j
which is the residual obtained after regression of xj on z1 . This residual
vector is indeed orthogonal to the subspace spanned by z1 according to
the least squares properties. With these orthogonalized inputs we look for a
second direction z2 in exactly the same way as we found z1 . By construction,
this second direction will be orthogonal to z1 . With this process we can
construct M ≤ d orthogonal directions z1 , . . . , zM . Since these directions
are orthogonal, the regression coefficients of the PLS fit

M
PLS
ŷPLS = θ̂m zm ,
m=1

are found by the univariate regressions of y on each of the derived inputs


zm m = 1, . . . , M . As for PCR, if we use all d directions z1 , . . . , zd , then
the solution would be the least squares solution (but expressed in a different
d-dimensional basis).
By involving the response y in the construction of the derived inputs,
PLS tries to find directions that not only have high variance in predictor
space (as in PCR) but also are highly correlated with the response. In detail,
the principal component directions in predictor space are the solutions

vm = argmax Var(Xv) m = 1, . . . , M,
∥v∥=1, vlt Sv=0 l=1,...,m−1

where S is the sample covariance matrix of the input observations. The


principal component directions are thus independent of the actual response
values y. On the other hand, the partial least squares directions in predictor
space are the solutions

ϕm = argmax Cor(y, Xϕ) Var(Xϕ) m = 1, . . . , M,


∥ϕ∥=1, ϕtl Sϕ=0 l=1,...,m−1
11.6. PARTIAL LEAST SQUARES 113

which yields directions that have sufficiently high variance in predictor space,
but also have a high correlation with the response. However, further investi-
gation of PLS has revealed that the variance component tends to dominate
and thus PLS behaves much like PCR and hence, also similar to ridge re-
gression.
114 CHAPTER 11. LINEAR REGRESSION METHODS
Chapter 12

Nonparametric Regression

12.1 Introduction

In this chapter we consider regression techniques that allow to fit more flex-
ible regression functions f (X). On way to achieve this flexibility is by using
basis expansions such as splines. These techniques are already discussed
in Chapter 4. In this chapter we will focus on flexible regression methods
that do not expand the predictor space but that fit a different simple lo-
cal regression model at each query point x0 . The local fit is based on the
observations close to the target point x0 and the weights given to the ob-
servations are chosen such that the resulting estimating function fˆ(X) is a
smooth function. To determine the weights, a weighting or kernel function
Kλ (x0 , x) is used as in Section 3.8. Hence, the weight of an observation i
depends on the distance of xi to the target point x0 . As such, kernel based
nonparametric regression can be seen as a smooth version of nearest neigh-
bor regression, which gives weight one to all neighbors of the target point
x0 . Nearest neighbors for regression is also discussed in this chapter.

12.2 Kernel Based Nonparametric Regression

12.2.1 Local averages

We will start by considering regression with only one predictor variable.


The idea here is to estimate the response at the target point x0 by a local
weighted average of the responses of training points at small distance of x0 .
The weight function Kλ (x0 , x) decreases smoothly with the distance from
the target point to ensure a smooth fit fˆ. The fit is given by
∑n
Kλ (x0 , xi )yi
fˆ(x0 ) = ∑i=1
n , (12.1)
i=1 Kλ (x0 , xi )

115
116 CHAPTER 12. NONPARAMETRIC REGRESSION

which is called the Nadaraya-Watson kernel-weighted average. The kernel


function Kλ (x0 , x) often can be re-expressed as
( )
|x − x0 |
Kλ (x0 , x) = D . (12.2)
λ
Popular choices of D are

 3 (1 − t2 ) if |t| ≤ 1
4
D(t) = Epanechnikov quadratic kernel
(12.3)
0 otherwise

(1 − |t|3 )3 if |t| ≤ 1
D(t) = tri-cube kernel (12.4)
0 otherwise
D(t) = ϕ(t) Gaussian kernel. (12.5)

Figure 12.1 shows the three kernel functions. The Epanechnikov and tri-
cube kernels have a compact support (positive on a finite interval) while the
Gaussian kernel has a noncompact support. The tri-cube function is flatter
on the top than the Epanechnikov kernel which yields more efficient results
but can lead to more bias. The tri-cube function has the extra advantage of
being differentiable at the boundary of the support.

Epanechnikov
0.8

Tri−cube
Gaussian
0.6
D(t)

0.4
0.2
0.0

−3 −2 −1 0 1 2 3

Figure 12.1: Epanechnikov, tri-cube, and Gaussian kernel functions. The


tri-cube kernel function has been rescaled to integrate to 1.

The tuning parameter λ controls the window size, that is the minimal
distance from the target point at which the weight becomes zero for the
compact support kernels. For the noncompact Gaussian kernel, λ is the
standard deviation as discussed in section 3.8.1. The choice of λ leads to
the usual trade-off between bias (large λ) and variance (small λ).
Locally kernel-weighted averages can have problems at the boundary.
At the boundary, the window tends to contain less points and the kernel
12.2. KERNEL BASED NONPARAMETRIC REGRESSION 117

window becomes asymmetric which can lead to severe bias. Bias can also
occur in the interior if the training points are not approximately equally
spaced. Fitting higher order local regression models can alleviate the bias
effect.

12.2.2 Local linear regression


The bias that can occur with local weighted averages can be reduced by
fitting straight lines locally in the window instead of weighted averages. It
can be shown that fitting straight lines removes the dominant term of the
bias (a first-order correction) and thus reduces the bias substantially.
The local linear fits are obtained by solving a weighted least squares
problem at each target point x0 :

n
min = Kλ (x0 , xi )(yi − β0 − β1 xi )2 , (12.6)
β0 ,β1
i=1

which yields estimates β̂0 (x0 ) and β̂1 (x0 ) at each target point x0 . With these
estimates, the fit at x0 becomes fˆ(x0 ) = β̂0 (x0 ) + β̂1 (x0 )x0 . Note that at
each target point x0 , the fit fˆ(x0 ) is obtained from a different linear model.
If we introduce the n × n diagonal weight matrix W (x0 ) with diagonal
elements Kλ (x0 , xi ) i = 1, . . . , n. Then, using the weighted least squares
solution of (12.6) the fitted response can be written as

fˆ(x0 ) = x̃t0 (Xt W (x0 )X)−1 Xt W (x0 )y


= l(x0 )t y,

with x̃0 = (1, x0 )t and X an n × 2 matrix where the first column is a column
of ones and the second column is x, the observed feature vector. Hence,
again the fit is obtained by a linear combination of the observed response y.
If we consider f̂ , the vector of fitted responses at the training points, then
we obtain
f̂ = S kernel
λ y,
where the n × n matrix S kernel
λ = (l(x1 ), . . . , l(xn ))t is the linear operator.
The effective degrees of freedom, trace(S kernel
λ ) can thus be used to determine
the tuning parameter λ.

12.2.3 Local polynomial regression


A natural extension is to replace the local linear fit by a local polynomial
fit of degree M , which leads to
( )2
∑n ∑
M
min = Kλ (x0 , xi ) yi − β0 − βm xim
, (12.7)
β0 ,β1 ,...,βm
i=1 m=1
118 CHAPTER 12. NONPARAMETRIC REGRESSION

which yields estimates β̂0 (x0 ), . . . , β̂M (x0 ) and corresponding fit fˆ(x0 ) =

β̂0 (x0 )+ M m
m=1 β̂m (x0 )x0 at each target point x0 . With local degree M (M >
1) polynomial fits, the bias will be reduced further. Especially in regions of
high curvature, local polynomial fits will have much lower bias than local
linear fits, that have a bias effect described by trimming the hills and filling
the valleys. Of course the bias reduction does not come for free. A higher
order model needs to be fit locally (using the same number of observations
as the linear model) which yields a decrease of precision and thus an increase
in variance.

12.3 Nearest Neighbors Regression


As in classification, K-nearest-neighbors regression (K-NNR) first deter-
mines the K training points closest to a target x0 in Euclidean distance.
The K-nearest-neighbors regression fit at X0 is then the average response of
these K nearest training points. Nearest neighbor regression thus is also a
local regression method, but it uses an adaptive neighborhood size instead
of the fixed width (by the choice of λ) neighborhood in the previous section.
As each fit is a local average, it is a linear combination of the observed re-
sponse y. The vector of fitted responses at the training points f̂ can thus
be written as
f̂ = S NNR
K y,

where S NNR
K is the linear operator. The effective degrees of freedom, trace(S NNR
K )
can be used to determine the tuning parameter K.
Note that the kernel based local regression methods in the previous sec-
tion can be extended by replacing the constant window width λ by a window
width function λ(x0 ) that determines the width of the neighborhood at each
target point x0 . For example, we can use the function

λ(x0 ) = |x0 − xi |K:n (12.8)

which is the distance between x0 and its Kth closest training point. If we
construct an adaptive kernel by using (12.8) in (12.2) with D(t) = I(|t| ≤
1), then the local average (12.1) just equals K-nearest neighbor regres-
sion. Straightforward extensions are obtained by using this kernel in (12.6)
or (12.7) which yields nearest neighbor methods that locally fit linear or
polynomial models instead of simple averages. Adaptive kernel based local
regression methods are obtained by using the width function (12.8) in (12.2)
and applying the Epanechnikov (12.3), tri-cube (12.4), or Gaussian (12.5)
function.
12.4. LOCAL REGRESSION WITH MORE PREDICTORS 119

12.4 Local Regression With More Predictors


When there is more than one predictor, linear models can still be fit locally.
The weights to determine the local neighborhood are now determined by a d-
dimensional kernel function. As in the one-dimensional case, local linear fits
have better performance (lower bias) at the boundaries than local constant
fits. Note that the boundary problem increases with dimension because the
curse of dimensionality implies that a larger fraction of the points lies close
to the boundary if the dimension increases. As in the one predictor case,
local polynomial fits have smaller bias, especially in high curvature regions,
but the variance starts to increase. Since higher order polynomials (M > 1)
in d dimensions require the inclusion of interaction terms if no restrictions
are imposed, the number of parameters, and thus the variance, grows fast.
The d-dimensional kernel functions are typically radial functions. That
is, the absolute value |x − x0 | in the right hand side of (12.2) is replaced by
the Euclidean norm ∥x − x0 ∥,
( )
∥x − x0 ∥
Kλ (x0 , x) = D , (12.9)
λ

which, depending on the choice of D leads to e.g. a radial Epanechnikov ker-


nel or a radial tri-cube kernel. Note that since the Euclidean norm depends
on the scaling of the variables, it is advisable to standardize the predictors.
The curse of dimensionality implies that local methods as these become less
useful in higher dimensions (d > 3) since the increasing sparsity of training
samples implies that there are no local neighborhoods anymore.
120 CHAPTER 12. NONPARAMETRIC REGRESSION
Chapter 13

Structured Regression
Methods

13.1 Introduction
To handle the curse of dimensionality which affects the local regression meth-
ods in the previous chapters, more structure will need to be imposed on the
regression function. However, we do not want to impose a too rigid struc-
ture, such as a global linear regression structure, that reduces flexibility too
much. In this chapter we discuss some techniques that impose a certain
structure of the regression function but still allow for flexibility.

13.2 Additive models


Additive models were introduced in Section 5.5 as part of the general class
given by (5.12) with link function g equal to the identity. Let us consider
a general regression function E(Y |X) = f (X1 , . . . , Xd ) where f can be any
function allowing any level of interaction between the predictors. It is con-
venient to consider the analysis-of-variance (ANOVA) decomposition of f :


d ∑
f (X) = f (X1 , . . . , Xd ) = β0 + fj (Xj ) + fkl (Xk , Xl ) + . . . . (13.1)
j=1 k<l

Structure on the regression function f (X) can now be imposed by elimi-


nating higher order terms in its ANOVA decomposition (13.1). Standard
additive models approximate f (X) by only considering main effects, that

is, it is assumed that f (X) = β0 + dj=1 fj (Xj ). By expanding the feature
space with the first order products Xk Xl (k < l), a second-order additive
model can be fit, and so on. In Section 5.5 we introduced the backfitting
algorithm to fit additive models. An important aspect of this algorithm is

121
122 CHAPTER 13. STRUCTURED REGRESSION METHODS

that a regression method is only applied on one-dimensional data. Hence,


local regression methods as introduced in the previous chapter can be used.

13.3 Multivariate Adaptive Regression Splines


Multivariate Adaptive Regression Splines (MARS) is an adaptive regres-
sion procedure that is well-suited for high-dimensional regression problems.
MARS uses expansions of the features in piecewise linear basis functions of
the form (x − t)+ and (t − x)+ as introduced in Section 4.3. That is,
 
x − t if x > t t − x if x < t
(x − t)+ = and (t − x)+ =
0 otherwise 0 otherwise

which are linear splines with their knot at the value t. The two functions
together are called a reflected pair. The idea of MARS is to consider for each
input variable Xj an expansion in reflected pairs with knots at the observed
values xij i = 1, . . . , n of that input variable. This yields the collection of
functions

C = {(Xj − t)+ , (t − Xj )+ ; t ∈ {x1j , . . . , xnj }, j = 1, . . . , d}.

The model building strategy of MARS is similar to a forward stepwise


linear regression procedure, but we are allowed to select as predictors, func-
tions from the set C as well as products of these functions with already se-
lected basis functions. In detail, we start with the basis function h0 (X) = 1
as only predictor, that is a constant fit. We then select from C the reflected
pair, such that the resulting least squares fit

fˆ(x) = β̂0 + β̂1 (Xj − xij )+ + β̂2 (xij − Xj )+



yields the largest decrease in residual sum of squares ni=1 [yi − fˆ(xi )]2 . The
optimal reflected pair yields two new basis functions h1 (X) = (Xj − xij )+
and h2 (X) = (xij − Xj )+ . At the next stage we consider all possible pairs
of functions

hm (X).(Xl − xil )+ and hm (X).(xil − Xl )+ l = 1, . . . , d; i = 1, . . . , n.

The function hm (X) is selected from the current set of basis functions
{h0 (X), h1 (X), h2 (X)}. There is one restriction on the products considered.
The resulting basis function should be an interaction of different features,
but higher order powers of a feature variable are not allowed. The pair of
functions that yields the largest decrease in residual sum of squares is added
13.3. MULTIVARIATE ADAPTIVE REGRESSION SPLINES 123

to the set of basis functions and the process is iterated until the basis ex-
pansion contains a maximum number of terms, M . Note that each basis
function in this expansion is a product of a number of linear splines.
The resulting model with all M basis functions usually overfits the train-
ing data. Therefore, a backward elimination procedure is applied (similar
to pruning trees). At each step the basis function whose removal causes the
smallest increase in residual squared error, is removed from the model. This
yields a sequence fˆm m = 1, . . . , M of best models of size going from 1 to
M . To select the optimal model from this sequence, MARS uses generalized
cross-validation (GCV) defined as
∑n
(yi − fˆm (xi ))2
GCV(m) = i=1 . (13.2)
(1 − df(m)/n)2
Here, the value df(m) is the effective number of parameters used to build the
model. This effective number of parameters accounts both for the number
of (independent) regression coefficients in the model and the number of ’pa-
rameters’ used in selecting the positions of the knots. It can be argued that
with linear splines it requires 3 parameters to select each knot. Therefore,
df(m) = r + 3K where K is the number of knots being used and r is the
rank of the n × m matrix with the basis functions evaluated at each of the
training observations. Usually, the matrix has full rank such that r = m.
The generalized cross-validation criterion is a computationally efficient
approximation of leave-one-out cross-validation that can be used for linear
fitting methods ŷ = Sy. Leave-one-out cross-validation prediction error is
given by
1∑
n
CV-error = [yi − fˆ−i (xi )]2 .
n
i=1
For many linear fitting methods, it can be shown that

yi − fˆ(xi )
yi − fˆ−i (xi ) = ,
1 − S ii
where S ii is the ith diagonal element of S. Hence, in this case leave-one-out
cross-validation can be obtained without refitting the model each time:
[ ]2
1 ∑ yi − fˆ(xi )
n
CV-error = .
n 1 − S ii
i=1

The GCV approximation is obtained by replacing S ii in each term by its


average over the training data. That is, S ii is replaced by the average
trace(S)/n which is the effective degrees of freedom (divided by n):
[ ]2
1∑
n
yi − fˆ(xi )
GCV-error = .
n 1 − trace(S)/n
i=1
124 CHAPTER 13. STRUCTURED REGRESSION METHODS

The use of piecewise linear basis functions in MARS has the advantage
that each basis function is only nonzero in a small part of the predictor space,
and hence operates locally. This means that the regression surface is built
up parsimoniously, using nonzero components locally where they are needed
to improve the fit instead of globally. In this way, parameters are spend
carefully, using only one or a few nonzero parameters for the fit in each
region of the predictor space. Another important advantage of piecewise
linear functions is computational. It can be shown that the piecewise linear
functions can be fit efficiently and the resulting fit can easily be updated
when a knot is shifted.
The forward modeling strategy in MARS allows higher order interac-
tions to enter only if the lower order interactions are already in the model.
Although this is not necessarily always the case, it is a useful working as-
sumption that is often being made. It avoids that the optimal basis functions
need to be selected from a set of functions that grows exponentially with
dimension. Another useful option in MARS is to put an upper limit on the
order of interactions that are allowed. Without limit, this upper limit is M .
However, it is often useful to set a limit of two or three. This yields simpler
models that can be interpreted more easily.
Chapter 14

Regression Trees

14.1 Introduction
Similarly to classification trees, regression trees partition the feature space
into rectangular regions and then fit a constant in each of these regions.
Regression trees usually try to minimize squared error loss when determining
the binary partitions, but other loss functions (e.g. L1 loss) are possible too.
In this chapter we discuss regression trees using squared error loss and their
extensions obtained by bagging or boosting.

14.2 Growing Trees


The goal is again to determine the best possible split variable and split-point
for each split. In the regression context, this means that the resulting tree
should minimize the sum of squared errors. Suppose that the tree procedure
splits the feature space into M leaves R1 , . . . , RM containing a number of
observations n1 , . . . , nM . Since we model the response as a constant in each
leave, we obtain the fit


M
fˆ(x) = cm I(x ∈ Rm ),
m=1

The sum of squared errors is given by

1 ∑ ∑
M
(yi − cm )2 .
n
m=1 xi ∈Rm

Clearly, for a given partition, the optimal values of the constants c1 , . . . , cM


are given by the averages
1 ∑
ĉm = yi m = 1, . . . , M.
nm
xi ∈Rm

125
126 CHAPTER 14. REGRESSION TREES

However, finding the binary partition (consisting of a set of successive splits)


is again computationally infeasible, and hence a stepwise procedure is adopted.
The stepwise procedure at each step seeks the splitting variable Xj and
split-point t that solve
 
∑ ∑
min  (yi − ĉ1 )2 + (yi − ĉ2 )2  ,
j,t
xi ∈R1 (j,t) xi ∈R2 (j,t)

where
R1 (j, t) = {X|Xj ≤ t} and R2 (j, t) = {X|Xj > t}

as before. Hence, squared error loss is used as the impurity measure within
each region.
As for classification, we grow a large tree T0 where splitting of regions
only stops when some minimal region size (e.g. 5) has been reached. This
large tree has the danger of overfitting the training data. Therefore, the
initial tree is then pruned to find a subtree that fits the data well but is
more stable and is expected to perform better on validation data.

14.3 Pruning Trees


The initial regression tree is pruned to optimize the cost-complexity crite-
rion (7.2) where the impurity measure QRm (T ) now is given by

1 ∑
QRm (T ) = (yi − ĉm )2 .
nm
xi ∈Rm

The optimal tree for each α ≥ 0 is again found by weakest link pruning
which in each step removes the branch point that produces the smallest
increase of the total impurity (7.3).
Drawbacks of trees are a lack of smoothness of the resulting regression
function. Within each leave a constant is fit, so the resulting regression
function shows many jumps. Another drawback is the difficulty of regression
trees to capture additive structures. Consider a simple additive structure
Y = c1 I(X1 < t1 ) + c2 I(X2 < t2 ) + ϵ where ϵ is zero mean random error.
A regression tree might make its first split on X1 near t1 . At the next level,
it would have to split both regions on X2 near t2 to reflect the additive
structure. This might happen with sufficiently rich data, but the additive
model is given no preference. If there were several additive effects, then it is
highly unlikely that a regression tree will find this additive structure. Both
these drawbacks are overcome by using MARS as alternative to regression
trees.
14.4. BAGGING 127

14.4 Bagging
The instability of regression trees due to their high variability can be ad-
dressed by using bagging as introduced in Section 7.4. Using a number B
of bootstrap samples, the bagging estimate fˆbag (x) is given by (7.4). For
regression trees using squared error loss, bagging is especially useful as it re-
duces variance and leaves bias (approximately) unchanged. Hence, bagging
reduces the mean squared error of predictions. To explain this, consider a
random sample {(xi , yi ); i = 1, . . . , n} from a distribution P and the ideal
aggregate estimator fag (x) = EP [fˆ⋆ (x)] where fˆ⋆ (x) is the fit at x obtained
from a set of observations {(x⋆i , yi⋆ ); i = 1, . . . , n} randomly sampled from
P . We can now write

EP [(Y − fˆ(x))2 ] = EP [((Y − fag (x)) + (fag (x) − fˆ(x)))2 ]


= EP [(Y − fag (x))2 ] + EP [(fag (x) − fˆ(x))2 ]
≥ EP [(Y − fag (x))2 ]

The mean squared error of the estimator fˆ(x) thus consists of the mean
squared error of the aggregate estimator fag (x) and the variance of fˆ(x)
around its mean fag (x). Therefore, the true population level aggregate esti-
mator never increases mean squared error. This suggests that the bagging
estimate fˆbag (x) obtained by drawing bootstrap samples from the training
data often decreases the mean squared error.

14.5 Boosting
Boosting regression trees using squared error loss can be obtained by suc-
cessively fitting regression trees to the residuals of the previous tree. Boost-
ing can be seen as fitting an additive expansion using forward stagewise
modeling as explained in Section 7.6. In detail, the algorithm starts from
the regression tree fitted to the data which yields the initial fit f0 (x) and
then iterates the following steps for m = 1, . . . , M . First calculate the
residuals of the training points, rm−1,i = yi − fm−1 (xi ), then fit a regres-
sion tree T to the targets rm−1,1 , . . . , rm−1,n and update the additive fit as
fm (x) = fm−1 (x) + T (rm−1,1 , . . . , rm−1,n ). Finally, the boosting algorithm
outputs the fit fˆ(x) = fM (x). This boosting procedure for regression trees
is called multiple additive regression trees (MART). As for classification, the
performance is best by growing fixed size trees of size 4 ≤ J ≤ 8. The
optimal number of iterations M ⋆ can be selected by monitoring prediction
error in the process.
128 CHAPTER 14. REGRESSION TREES
Chapter 15

Regression Support Vector


Machines

15.1 Introduction
In this chapter we discuss the use of kernel-based methods for regression.
Regression support vector machines using squared error loss have already
been discussed in Chapter 4 (see (4.27)). Here, we discuss regression support
vector machines using a different loss function that has an effect similar to
the loss function used in support vector machines for classification.

15.2 Support Vector Regression


We consider the linear regression model

f (x) = β0 + xt β.

For the support vector classifier in Section 6.3 we determined the boundary
with largest margin that separated the two classes. The separation was
not complete in the sense that we allowed that a number of observations
fell on the wrong side of their boundary. Similarly, we now look for the
hyperplane such that all observations lie within a band of width ϵ around
the hyperplane. Again, we do not expect that such a hyperplane exists for
sufficiently small values of ϵ, thus we allow that a number of observations
falls outside the band. This leads to the ϵ-insensitive loss function

L(y, f (x)) = max(0, |y − f (x)| − ϵ).

Hence, the loss is zero for observations with response yi whose distance to
the hyperplane f (x) is smaller than ϵ. For observations larger than ϵ the
loss equals |y − f (x)| − ϵ.

129
130 CHAPTER 15. REGRESSION SUPPORT VECTOR MACHINES

With this loss function, support vector regression solves the problem
[ n ]
∑ λ
min L(yi , f (xi )) + ∥β∥2 , (15.1)
β0 ,β 2
i=1

where the penalty is a regularization that favors more parsimonious, stable


models. This problem can be rewritten as
[ ]
1∑
n
1
min ∥β∥ + 2 ⋆
(ξi + ξi )
β0 ,β 2 λ
i=1

subject to yi − f (xi ) ≤ ϵ + ξi
f (xi ) − yi ≤ ϵ + ξi⋆
ξi ≥ 0
ξi⋆ ≥ 0 i = 1, . . . , n

which leads to the Lagrange primal function

1∑ ∑
n n
1
LP = ∥β∥2 + (ξi + ξi⋆ ) + αi (yi − β0 − xt β − ϵ − ξi )
2 λ
i=1 i=1

n ∑
n ∑
n
+ αi⋆ (β0 + x β − yi − ϵ −
t
ξi⋆ ) − µi ξi − µ⋆i ξi⋆ (15.2)
i=1 i=1 i=1

where αi , αi⋆ ≥ 0 and µi , µ⋆i ≥ 0 are the Lagrange multipliers. Setting the
derivatives w.r.t. β, β0 , ξi and ξi⋆ equal to zero yields the equations


n
β = (αi − αi⋆ )xi (15.3)
i=1

n
0 = (αi − αi⋆ ) (15.4)
i=1
1
αi = − µi ; i = 1, . . . , n (15.5)
λ
1
αi⋆ = − µ⋆i ; i = 1, . . . , n (15.6)
λ
Substituting the first order conditions (15.3)-(15.6) in (15.2) yields the dual
objective function

n ∑
n
1 ∑∑
n n
LD = (αi −αi )yi −ϵ

αi +αi −

(αi −αi⋆ )(αj −αj⋆ ) xti xj (15.7)
2
i=1 i=1 i=1 j=1

which needs to be maximized w.r.t. αi , αi⋆ i = 1, . . . , n under the constraints


n
(αi − αi⋆ ) = 0
i=1
15.3. REGRESSION SUPPORT VECTOR MACHINES 131

0 ≤ αi ≤ 1
λ
0 ≤ αi⋆ ≤ 1
λ .

The Karush-Kuhn-Tucker conditions satisfied by the solution include the


constraints

αi (yi − β0 − xt β − ϵ − ξi ) = 0; i = 1, . . . , n
αi⋆ (β0 + xt β − yi − ϵ − ξi⋆ ) = 0; i = 1, . . . , n
αi αi⋆ = 0; i = 1, . . . , n
ξi ξi⋆ = 0; i = 1, . . . , n
1
(αi − )ξi = 0; i = 1, . . . , n
λ
1 ⋆
(αi − )ξi = 0; i = 1, . . . , n

λ
These constraints imply that only a subset of the values αi − αi⋆ are different
from zero and the associated observations are called the support vectors.

The dual function and the fitted values f (x) = xt β = ni=1 (αi − αi⋆ )xt xi
both are determined only through the products xt xi . Thus, we can generalize
support vector regression by using kernels.

15.3 Regression Support Vector Machines


We now consider the regression model


M
f (x) = β0 + h(x)t β = β0 + βm hm (x)
m=1

where h(x) = (h1 (x) . . . , hM (x))t is an expansion in basis functions. To esti-


mate β and β0 we need to minimize (15.1) or equivalently, we need to maxi-
mize the dual function (15.7) with the products xti xj replaced by h(xi )t h(xj ).

This yields the solution f (x) = h(x)t β = ni=1 (αi − αi⋆ )h(x)t h(xi ).
As in Section 6.4 we can now introduce a kernel function K(x, x′ ) =
∑∞ ′
m=1 γm ϕm (x)ϕm (x ) as in (4.14) and define the associated basis functions

as hm (x) = γm ϕm (x). Then for any points x and x′ in Rd the product

h(x)t h(x′ ) = ∞ ′ ′
m=1 γm ϕm (x)ϕm (x ) = K(x, x ). Hence, if the transforma-
tion of the feature space is induced by a kernel function, the products in the
dual function (15.7) are given by K(xi , xj ) and the corresponding solution

becomes f (x) = ni=1 (αi − αi⋆ )K(x, xi ).
132 CHAPTER 15. REGRESSION SUPPORT VECTOR MACHINES
Further Reading

Books

Duda, R.O., Hart, P.E., and Stork, D.G. (2001). ”Pattern Classifica-
tion,” John Wiley and Sons.
Green, P.J., and Silverman, B.W. (1994). ”Nonparametric Regression
and Generalized Linear Models,” Chapman and Hall.

Hastie, T., and Tibshirani, R. (1990). ”Generalized Additive Models,”


Chapman and Hall.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). ”The Elements of


Statistical Learning: Data Mining, Inference and Prediction,” Springer-
Verlag.

McLachlan G. (1992). ”Discriminant Analysis and Statistical Pattern


Recognition,” John Wiley and Sons.

Schölkopf B., and Smola A. (2002). ”Learning with Kernels: Sup-


port Vector Machines, Regularization, Optimization, and Beyond,”
The MIT Press.

Papers

Breiman, L. (1996). ”Bagging Predictors,” Machine Learning, 24, 123-


140.
Breiman, L. (2001). ”Random Forests,” Machine Learning, 45, 532.

Bühlmann, P., and Hothorn, T. (2007). ”Boosting Algorithms: Reg-


ularization, Prediction and Model Fitting,” Statistical Science, 22,
477505.

Bühlmann, P., and Yu, B. (2003). ”Boosting with the L2 Loss: Re-
gression and Classification,” Journal of the American Statistical As-
sociation, 98, 324339.

133
134 CHAPTER 15. REGRESSION SUPPORT VECTOR MACHINES

Buja, A., Hastie, T., and Tibshirani, R. (1989). ”Linear Smoothers


and Additive Models,” The Annals of Statistics, 17, 453-510.

Efron, B. (1975). ”The Efficiency of Logistic Regression Compared to


Normal Discriminant Analysis,” Journal of the American Statistical
Association, 70, 892-898.

Efron, B. (1983). ”Estimating the Error Rate of a Prediction Rule:


Improvement on Cross-Validation,” Journal of the American Statisti-
cal Association, 78, 316-331.

Efron, B., and Tibshirani, R. (1997). ”Improvements on Cross-Validation::


The .632+ Bootstrap Method,” Journal of the American Statistical
Association, 92, 548-560.

Eilers, P. H. C., and Marx, B.D. (1996). ”Flexible Smoothing with


B-splines and Penalties,” Statistical Science, 11, 89-102.

Frank, I.E., and Friedman, J. (1993). ”A Statistical View of Some


Chemometrics Regression Tools,” Technometrics, 35, 109-135.

Friedman, J. (1989). ”Regularized Discriminant Analysis,” Journal of


the American Statistical Association, 84, 165-175.

Friedman, J. (1991). ”Multivariate Adaptive Regression Splines,” The


Annals of Statistics, 19, 1-67.

Friedman, J. (2001). ”Greedy Function Approximation: A Gradient


Boosting Machine,” The Annals of Statistics, 29, 1189-1232.

Friedman, J., Hastie, T., and Tibshirani, R. (2000). ”Additive Logistic


Regression: A Statistical View of Boosting,” The Annals of Statistics,
28, 337-374.

Freund Y., and Schapire, R.E. (1997). ”A Decision-Theoretic General-


ization of On-Line Learning and an Application to Boosting,” Journal
of Computer and System Sciences, 55, 119-139.

Hastie, T., and Tibshirani, R. (1986). ”Generalized Additive Models,”


Statistical Science, 1, 297-310.

Hastie, T., and Tibshirani, R. (1987). ”Generalized Additive Models:


Some Applications,” Journal of the American Statistical Association,
82, 371-386.
15.3. REGRESSION SUPPORT VECTOR MACHINES 135

Hastie, T., and Tibshirani, R. (1996). ”Discriminant Analysis by


Gaussian Mixtures,” Journal of the Royal Statistical Society. Series
B, 58, 155-176.

Hastie, T., Buja, A., and Tibshirani, R. (1995). ”Penalized Discrimi-


nant Analysis,” The Annals of Statistics, 23, 73-102.

Hastie, T., Tibshirani, R., and Buja, A. (1994). ”Flexible Discrimi-


nant Analysis by Optimal Scoring,” Journal of the American Statistical
Association, 89, 1255-1270.

Hoerl, A.E., and Kennard, R.W. (1970). ”Ridge Regression: Biased


Estimation for Nonorthogonal Problems,” Technometrics, 12, 55-67.

Hoerl, A.E., and Kennard, R.W. (1970). ”Ridge Regression: Applica-


tions to Nonorthogonal Problems,” Technometrics, 12, 69-82.

Stone M. (1974). ”Cross-Validatory Choice and Assessment of Statis-


tical Predictions,” Journal of the Royal Statistical Society. Series B,
36, 111-147.

Tibshirani, R. (1996). ”Regression Shrinkage and Selection via the


Lasso,” Journal of the Royal Statistical Society. Series B, 58, 267-288.

Wood, S. N. (2008). ”Fast stable direct fitting and smoothness selec-


tion for generalized additive models,” Journal of the Royal Statistical
Society. Series B, 70, 495518.

Yee, T. W., and Wild, C. J. (1996). ”Vector Generalized Additive


Models,” Journal of the Royal Statistical Society. Series B, 58, 481-
493.

Potrebbero piacerti anche