Machine Learning: Linear Models For Classification 1

Machine Learning
Linear Models for Classification
1
Marcello Restelli
0.5
March 15, 2017 0

−10 −5 0 5 10
Outline
1 Linear Classification
2 Discriminant Functions
Least Squares
The Perceptron Algorithm
3 Probabilistic Discriminative Models
Marcello Restelli March 15, 2017 2 / 31

Linear Classification
Classification Problems
The goal of classification is to assign an input x into one of K discrete

classes Ck , where k = 1, . . . , K
Typically, each input is assigned only to one class
Example: The input vector x is the set of

pixel intensities, and the output variable t
will represent the presence of cancer
(class C1 ) or absence of cancer (class C2 )

In linear classification, the input space is divided into decision regions

whose boundaries are called decision boundaries or decision surfaces
We will consider linear models for classification
In the linear regression case, the model is linear in parameters:
D−1
X
y(x, w) = w0 + wj xj = xT w + w0
j=1
For classification, we need to predict discrete class labels, or posterior

probabilities that lie in the range of (0, 1), so we use a nonlinear
function
y(x, w) = f (xT w + w0 )

Generalized Linear Models
y(x, w) = f (xT w + w0 )
The decision surfaces correspond to y(x, w) = const

It follows that xT w + w0 and hence the decision surfaces are linear
functions of x, even if the activation function is nonlinear
These class of models are called generalized linear models
They are no longer linear in parameters
More complex analytical and computational properties than regression
As we did for regression, we can consider fixed nonlinear basis
functions

Notation
In two-class problems, we have a binary target value t ∈ {0, 1}, such that
t = 1 is the positive class and t = 0 is the negative class
We can interpret the value of t as the probability of the positive class
The output of the model can be represented as the probability that the
model assigns to the positive class
If there are K classes, we can use a 1-of-K encoding scheme
t is a vector of length K and contains a single 1 for the correct class and 0
elsewhere
Example: if K = 5, then an input that belongs to class 2 corresponds to
target vector:
T
t = (0, 1, 0, 0, 0)
We can interpret t as a vector of class probabilities

Three Approaches to Classification
Discriminant function: build a function that directly maps each input

to a specific class
Probabilistic approach: model the conditional probability distribution
p(Ck |x) and use it to make optimal decisions. Two alternatives:
Probabilistic discriminative approach: Model p(Ck |x) directly, for
instance using parametric models (e.g., logistic regression)
Probabilistic generative approach: Model class conditional densities
p(x|Ck ) together with prior probabilities p(Ck ) for the classes. Infer the
posterior using Bayes’ rule:
p(x|Ck )p(Ck )
P(Ck |x) =
p(x)
For example, we could fit multivariate Gaussians to the input vector of

each class. Given a test vector, we see under which Gaussian the vector is
most probable

Discriminant Functions
Two classes
y(x) = xT w + w0
Assign x to C1 is y(x) ≥ 0 and
class C2 otherwise
Decision boundary: y(x) = 0
Given two points on the decision
surface:
y(xA ) = y(xB ) = 0
wT (xA − xB ) = 0
w is orthogonal to the decision

surface
wT x w0
if x is on the decision surface, then =−
kwk2 kwk2
Hence w0 determines the location of the decision surface
Multiple classes
Consider the extension to K > 2

classes
One-versus-the-rest: K − 1
classifiers, each of which solves a
two class problem:
Separate points in class Ck
from points not in that class
There are regions in input space
that are ambiguously classified
One-versus-one: K(K − 1)/2
binary discriminant functions
Each function discriminates
between two particular classes
Similar problems of ambiguity

Multiple classes: Simple Solution

Use K linear discriminant functions of the form
yk (x) = xT wk + wk0 , where k = 1, . . . , K
Assign x to class Ck , if yk (x) > yj (x) ∀j 6= k
The resulting decision boundaries are singly connected and convex
For any two points that lie inside the

region Rk :
yk (xA ) > yj (xA ) and yk (xB ) > yj (xB )
implies that for positive α
yk (αxA +(1−α)xB ) > yj (αxA +(1−α)xB )
due to linearity of the discriminant

functions
Discriminant Functions Least Squares
Least Squares for Classification
Consider a general classification problem with K classes using 1-of-K

encoding scheme for the target vector t
Recall: Least squares approximates conditional expectation E[t|x]
Each class is described by its own linear model
yk (x) = wk T x + wk0 , where k = 1, . . . , K
Using vector notation:

y(x) = W̃T x̃
T
W̃ is a (D + 1) × K matrix whose k-th column is w̃k = (wk0 , wk T )
T
x̃ = (1, xT )

Least Squares for Classification
Given a dataset D = {xi , ti }, where i = 1, . . . , N

We have already seen how to minimize least squares
−1
W̃ = X̃T X̃ X̃T T
X̃ is an N × (D + 1) matrix whose i-th row is x̃Ti

T is an N × K matrix whose i-th row is ti T
A new input is assigned to a class for which tk = x̃T w̃k is largest
There are problems in using least squares for classification

Problems using Least Squares: Outliers
Least squares is highly sensitive to outliers, unlike logistic regression

Problems using Least Squares: Non-Gaussian Distributions
The reason for the failure is due to the assumption of a Gaussian conditional
distribution that is not satisfied by binary target vectors

Fixed Basis Functions
So far, we have considered classification models that work directly in the

input space
All considered algorithms are equally applicable if we first make a fixed
nonlinear transformation of the input space using vector of basis
functions φ(x)
Decision boundaries will be linear in the feature space, but would
correspond to nonlinear boundaries in the original input space
Classes that are linearly separable in the feature space need not be
linearly separable in the original input space

Linear Basis Function Models

Original input space Feature space
We consider two Gaussian basis functions with centers shown by green

crosses and with contours shown by green circles
Linear decision boundary (right) is obtained using logistic regression
and corresponds to nonlinear decision boundary in the input space (left)
Discriminant Functions The Perceptron Algorithm
Perceptron
The perceptron (Rosenblatt, 1958) is another example of linear

discriminant models
It is an online linear classification algorithm
It corresponds to a two-class model:
(
T +1, a≥0
y(x) = f (w φ(x)), where f (a) =
−1, a<0
Target values are +1 for C1 and -1 for C2

The algorithm finds the separating hyperplane by minimizing the
distance of misclassified points to the decision boundary
Using the number of misclassified points as loss function is not effective
since it is a piecewise constant function

Perceptron Criterion
We are seeking a vector w such that wT φ(xn ) > 0 when xn ∈ C1 and
wT φ(xn ) < 0 otherwise
The perceptron criterion assigns
zero error to correct classification
wT φ(xn )tn to misclassified patterns xn (it is proportional to the distance to
the decision boundary)
The loss function to be minimized is
X
LP (w) = − wT φ(xn )tn
n∈M
Minimization is performed using stochastic gradient descent:
w(k+1) = w(k) − α∇LP (w) = w(k) + αφ(xn )tn
Since the perceptron function does not change if w is multiplied by a

constant, the learning rate α can be set to 1
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence

Algorithm
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
end if
until convergence

Algorithm
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
end if
until convergence

Algorithm
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
end if
until convergence

Algorithm
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
end if
until convergence

Preceptron Convergence Theorem

The effect of a single update is to reduce the error due to the
misclassified pattern:
T T T
−w(k+1) φ(xn )tn = −w(k) φ(xn )tn −(φ(xn )tn )T φ(xn )tn < −w(k) φ(xn )tn
This does not imply that the loss is reduced at each stage
Theorem (Perceptron Convergence Theorem)
If the training data set is linearly separable in the feature space Φ, then the
perceptron learning algorithm is guaranteed to find an exact solution in a
finite number of steps
The number of steps before convergence may be substantial

We are not able to distinguish between nonseparable problems and
slowly converging ones
If multiple solutions exist, the one found depends by the initialization
of the parameters and the order of presentation of the data points
Probabilistic Discriminative Models
Logistic Regression
Consider the problem of two-class classification

The posterior probability of class C1 can be written as a logistic sigmoid
function:
1
p(C1 |φ) = = σ(wT φ)
1 + exp(−wT φ)
where p(C2 |φ) = 1 − p(C1 |φ) and the bias term is omitted for clarity
1
This model is known as logistic
regression (even if it is for
classification!) 0.5
Differently from generative
models, here we model p(Ck |φ) 0
directly −10 −5 0 5 10

Maximum Likelihood for Logistic Regression

Given a dataset D = {xn , tn }, n = 1, . . . , N; tn ∈ {0, 1}
Maximize the probability of getting the right label:
N
Y
p(t|X, w) = ytnn (1 − yn )1−tn , yn = σ(wT φn )
n=1
Taking the negative log of the likelihood, we can define cross-entropy

error function (to be minimized):
N
X N
X
L(w) = − ln p(t|X, w) = − (tn ln yn + (1 − tn ) ln(1 − yn )) = Ln
n=1 n=1
Differentiating and using the chain rule:

∂Ln yn − tn ∂yn ∂Ln ∂Ln ∂yn
= , = yn (1−yn )φi , = = (yn −tn )φn
∂yn yn (1 − yn ) ∂w ∂w ∂yn ∂w

Maximum Likelihood for Logistic Regression
The gradient of the loss function is

N
X
∇L(w) = (yn − tn )φn
n=1
It has the same form as the gradient of the sum-of-squares error

function for linear regression
There is no closed form solution, due to nonlinearity of the logistic
sigmoid function
The error function is convex and can be optimized by standard
gradient-based optimization techniques
Easy to adapt to the online learning setting

Multiclass Logistic Regression
For the multiclass case, we represent posterior probabilities by a softmax

transformation of linear functions of feature variables:
exp(wk T φ)
p(Ck |φ) = yk (φ) = P T
j exp(wj φ)
Differently from generative models, here we will use maximum

likelihood to determine parameters of this discriminative model directly
N K
! N K
!
Y Y Y Y t
p(T|Φ, w1 , . . . , wK ) = p(Ck |φn )tnk = ynknk
n=1 k=1 n=1 k=1
| {z }
Only one term corresponding
to correct class
exp(wk T φn )
where ynk = p(Ck |φn ) = P T
j exp(wj φn )

Multiclass Logistic Regression
Taking the negative logarithm gives the cross-entropy function for

multi-class classification problem
N K
!
X X
L(w1 , . . . , wK ) = − ln p(T|Φ, w1 , . . . , wK ) = − tnk ln ynk
n=1 k=1
Taking the gradient

N
X
∇Lwj (w1 , . . . , wK ) = (ynj − tnj )φn
n=1

Connection Between Logistic Regression and Perceptron

Algorithm
If we replace the logistic function with a step function
1 1
0.5 0.5
0 0
−10 −5 0 5 10 −10 −5 0 5 10
1 (
y(x, w) = 1 if w · φ(x) > 0
1 + e−w·φ(x) y(x, w) =
0 otherwise
Both algorithms use the same updating rule:
w ← w + α (y(xn , w) − tn ) φn

Machine Learning: Linear Models For Classification 1

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Machine Learning: Linear Models For Classification 1

Caricato da

Copyright:

Formati disponibili

Machine Learning

Linear Models for Classification

March 15, 2017 0

3 Probabilistic Discriminative Models

Marcello Restelli March 15, 2017 2 / 31

The goal of classification is to assign an input x into one of K discrete

Example: The input vector x is the set of

Marcello Restelli March 15, 2017 4 / 31

In linear classification, the input space is divided into decision regions

For classification, we need to predict discrete class labels, or posterior

Marcello Restelli March 15, 2017 5 / 31

Generalized Linear Models

The decision surfaces correspond to y(x, w) = const

Marcello Restelli March 15, 2017 6 / 31

Marcello Restelli March 15, 2017 7 / 31

Three Approaches to Classification

Discriminant function: build a function that directly maps each input

For example, we could fit multivariate Gaussians to the input vector of

Marcello Restelli March 15, 2017 8 / 31

w is orthogonal to the decision

Consider the extension to K > 2

Marcello Restelli March 15, 2017 11 / 31

Multiple classes: Simple Solution

For any two points that lie inside the

yk (xA ) > yj (xA ) and yk (xB ) > yj (xB )

implies that for positive α

yk (αxA +(1−α)xB ) > yj (αxA +(1−α)xB )

due to linearity of the discriminant

Least Squares for Classification

Consider a general classification problem with K classes using 1-of-K

yk (x) = wk T x + wk0 , where k = 1, . . . , K

Using vector notation:

Marcello Restelli March 15, 2017 14 / 31

Least Squares for Classification

Given a dataset D = {xi , ti }, where i = 1, . . . , N

X̃ is an N × (D + 1) matrix whose i-th row is x̃Ti

Marcello Restelli March 15, 2017 15 / 31

Problems using Least Squares: Outliers

Least squares is highly sensitive to outliers, unlike logistic regression

Marcello Restelli March 15, 2017 16 / 31

Problems using Least Squares: Non-Gaussian Distributions

Marcello Restelli March 15, 2017 17 / 31

Fixed Basis Functions

So far, we have considered classification models that work directly in the

Marcello Restelli March 15, 2017 18 / 31

Linear Basis Function Models

We consider two Gaussian basis functions with centers shown by green

The perceptron (Rosenblatt, 1958) is another example of linear

Target values are +1 for C1 and -1 for C2

Marcello Restelli March 15, 2017 21 / 31

Minimization is performed using stochastic gradient descent:

w(k+1) = w(k) − α∇LP (w) = w(k) + αφ(xn )tn

Since the perceptron function does not change if w is multiplied by a

Marcello Restelli March 15, 2017 23 / 31

Marcello Restelli March 15, 2017 23 / 31

Marcello Restelli March 15, 2017 23 / 31

Marcello Restelli March 15, 2017 23 / 31

Marcello Restelli March 15, 2017 23 / 31

Preceptron Convergence Theorem

The number of steps before convergence may be substantial

Consider the problem of two-class classification

Marcello Restelli March 15, 2017 26 / 31

Maximum Likelihood for Logistic Regression