Sei sulla pagina 1di 30

Machine Learning

Linear Models for Classification

1
Marcello Restelli
0.5

March 15, 2017 0


−10 −5 0 5 10
Outline

1 Linear Classification

2 Discriminant Functions
Least Squares
The Perceptron Algorithm

3 Probabilistic Discriminative Models

Marcello Restelli March 15, 2017 2 / 31


Linear Classification

Classification Problems

The goal of classification is to assign an input x into one of K discrete


classes Ck , where k = 1, . . . , K
Typically, each input is assigned only to one class

Example: The input vector x is the set of


pixel intensities, and the output variable t
will represent the presence of cancer
(class C1 ) or absence of cancer (class C2 )

Marcello Restelli March 15, 2017 4 / 31


Linear Classification

Linear Classification

In linear classification, the input space is divided into decision regions


whose boundaries are called decision boundaries or decision surfaces
We will consider linear models for classification
In the linear regression case, the model is linear in parameters:
D−1
X
y(x, w) = w0 + wj xj = xT w + w0
j=1

For classification, we need to predict discrete class labels, or posterior


probabilities that lie in the range of (0, 1), so we use a nonlinear
function
y(x, w) = f (xT w + w0 )

Marcello Restelli March 15, 2017 5 / 31


Linear Classification

Generalized Linear Models

y(x, w) = f (xT w + w0 )

The decision surfaces correspond to y(x, w) = const


It follows that xT w + w0 and hence the decision surfaces are linear
functions of x, even if the activation function is nonlinear
These class of models are called generalized linear models
They are no longer linear in parameters
More complex analytical and computational properties than regression
As we did for regression, we can consider fixed nonlinear basis
functions

Marcello Restelli March 15, 2017 6 / 31


Linear Classification

Notation

In two-class problems, we have a binary target value t ∈ {0, 1}, such that
t = 1 is the positive class and t = 0 is the negative class
We can interpret the value of t as the probability of the positive class
The output of the model can be represented as the probability that the
model assigns to the positive class
If there are K classes, we can use a 1-of-K encoding scheme
t is a vector of length K and contains a single 1 for the correct class and 0
elsewhere
Example: if K = 5, then an input that belongs to class 2 corresponds to
target vector:
T
t = (0, 1, 0, 0, 0)
We can interpret t as a vector of class probabilities

Marcello Restelli March 15, 2017 7 / 31


Linear Classification

Three Approaches to Classification

Discriminant function: build a function that directly maps each input


to a specific class
Probabilistic approach: model the conditional probability distribution
p(Ck |x) and use it to make optimal decisions. Two alternatives:
Probabilistic discriminative approach: Model p(Ck |x) directly, for
instance using parametric models (e.g., logistic regression)
Probabilistic generative approach: Model class conditional densities
p(x|Ck ) together with prior probabilities p(Ck ) for the classes. Infer the
posterior using Bayes’ rule:

p(x|Ck )p(Ck )
P(Ck |x) =
p(x)

For example, we could fit multivariate Gaussians to the input vector of


each class. Given a test vector, we see under which Gaussian the vector is
most probable

Marcello Restelli March 15, 2017 8 / 31


Discriminant Functions

Two classes
y(x) = xT w + w0
Assign x to C1 is y(x) ≥ 0 and
class C2 otherwise
Decision boundary: y(x) = 0
Given two points on the decision
surface:

y(xA ) = y(xB ) = 0
wT (xA − xB ) = 0

w is orthogonal to the decision


surface
wT x w0
if x is on the decision surface, then =−
kwk2 kwk2
Hence w0 determines the location of the decision surface
Marcello Restelli March 15, 2017 10 / 31
Discriminant Functions

Multiple classes

Consider the extension to K > 2


classes
One-versus-the-rest: K − 1
classifiers, each of which solves a
two class problem:
Separate points in class Ck
from points not in that class
There are regions in input space
that are ambiguously classified
One-versus-one: K(K − 1)/2
binary discriminant functions
Each function discriminates
between two particular classes
Similar problems of ambiguity

Marcello Restelli March 15, 2017 11 / 31


Discriminant Functions

Multiple classes: Simple Solution


Use K linear discriminant functions of the form
yk (x) = xT wk + wk0 , where k = 1, . . . , K
Assign x to class Ck , if yk (x) > yj (x) ∀j 6= k
The resulting decision boundaries are singly connected and convex

For any two points that lie inside the


region Rk :

yk (xA ) > yj (xA ) and yk (xB ) > yj (xB )

implies that for positive α

yk (αxA +(1−α)xB ) > yj (αxA +(1−α)xB )

due to linearity of the discriminant


functions
Marcello Restelli March 15, 2017 12 / 31
Discriminant Functions Least Squares

Least Squares for Classification

Consider a general classification problem with K classes using 1-of-K


encoding scheme for the target vector t
Recall: Least squares approximates conditional expectation E[t|x]
Each class is described by its own linear model

yk (x) = wk T x + wk0 , where k = 1, . . . , K

Using vector notation:


y(x) = W̃T x̃

T
W̃ is a (D + 1) × K matrix whose k-th column is w̃k = (wk0 , wk T )
T
x̃ = (1, xT )

Marcello Restelli March 15, 2017 14 / 31


Discriminant Functions Least Squares

Least Squares for Classification

Given a dataset D = {xi , ti }, where i = 1, . . . , N


We have already seen how to minimize least squares
 −1
W̃ = X̃T X̃ X̃T T

X̃ is an N × (D + 1) matrix whose i-th row is x̃Ti


T is an N × K matrix whose i-th row is ti T
A new input is assigned to a class for which tk = x̃T w̃k is largest
There are problems in using least squares for classification

Marcello Restelli March 15, 2017 15 / 31


Discriminant Functions Least Squares

Problems using Least Squares: Outliers

Least squares is highly sensitive to outliers, unlike logistic regression

Marcello Restelli March 15, 2017 16 / 31


Discriminant Functions Least Squares

Problems using Least Squares: Non-Gaussian Distributions

The reason for the failure is due to the assumption of a Gaussian conditional
distribution that is not satisfied by binary target vectors

Marcello Restelli March 15, 2017 17 / 31


Discriminant Functions Least Squares

Fixed Basis Functions

So far, we have considered classification models that work directly in the


input space
All considered algorithms are equally applicable if we first make a fixed
nonlinear transformation of the input space using vector of basis
functions φ(x)
Decision boundaries will be linear in the feature space, but would
correspond to nonlinear boundaries in the original input space
Classes that are linearly separable in the feature space need not be
linearly separable in the original input space

Marcello Restelli March 15, 2017 18 / 31


Discriminant Functions Least Squares

Linear Basis Function Models


Original input space Feature space

We consider two Gaussian basis functions with centers shown by green


crosses and with contours shown by green circles
Linear decision boundary (right) is obtained using logistic regression
and corresponds to nonlinear decision boundary in the input space (left)
Marcello Restelli March 15, 2017 19 / 31
Discriminant Functions The Perceptron Algorithm

Perceptron

The perceptron (Rosenblatt, 1958) is another example of linear


discriminant models
It is an online linear classification algorithm
It corresponds to a two-class model:
(
T +1, a≥0
y(x) = f (w φ(x)), where f (a) =
−1, a<0

Target values are +1 for C1 and -1 for C2


The algorithm finds the separating hyperplane by minimizing the
distance of misclassified points to the decision boundary
Using the number of misclassified points as loss function is not effective
since it is a piecewise constant function

Marcello Restelli March 15, 2017 21 / 31


Discriminant Functions The Perceptron Algorithm

Perceptron Criterion
We are seeking a vector w such that wT φ(xn ) > 0 when xn ∈ C1 and
wT φ(xn ) < 0 otherwise
The perceptron criterion assigns
zero error to correct classification
wT φ(xn )tn to misclassified patterns xn (it is proportional to the distance to
the decision boundary)
The loss function to be minimized is
X
LP (w) = − wT φ(xn )tn
n∈M

Minimization is performed using stochastic gradient descent:

w(k+1) = w(k) − α∇LP (w) = w(k) + αφ(xn )tn

Since the perceptron function does not change if w is multiplied by a


constant, the learning rate α can be set to 1
Marcello Restelli March 15, 2017 22 / 31
Discriminant Functions The Perceptron Algorithm

Algorithm

Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence

Marcello Restelli March 15, 2017 23 / 31


Discriminant Functions The Perceptron Algorithm

Algorithm

Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence

Marcello Restelli March 15, 2017 23 / 31


Discriminant Functions The Perceptron Algorithm

Algorithm

Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence

Marcello Restelli March 15, 2017 23 / 31


Discriminant Functions The Perceptron Algorithm

Algorithm

Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence

Marcello Restelli March 15, 2017 23 / 31


Discriminant Functions The Perceptron Algorithm

Algorithm

Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence

Marcello Restelli March 15, 2017 23 / 31


Discriminant Functions The Perceptron Algorithm

Preceptron Convergence Theorem


The effect of a single update is to reduce the error due to the
misclassified pattern:
T T T
−w(k+1) φ(xn )tn = −w(k) φ(xn )tn −(φ(xn )tn )T φ(xn )tn < −w(k) φ(xn )tn
This does not imply that the loss is reduced at each stage
Theorem (Perceptron Convergence Theorem)
If the training data set is linearly separable in the feature space Φ, then the
perceptron learning algorithm is guaranteed to find an exact solution in a
finite number of steps

The number of steps before convergence may be substantial


We are not able to distinguish between nonseparable problems and
slowly converging ones
If multiple solutions exist, the one found depends by the initialization
of the parameters and the order of presentation of the data points
Marcello Restelli March 15, 2017 24 / 31
Probabilistic Discriminative Models

Logistic Regression

Consider the problem of two-class classification


The posterior probability of class C1 can be written as a logistic sigmoid
function:
1
p(C1 |φ) = = σ(wT φ)
1 + exp(−wT φ)
where p(C2 |φ) = 1 − p(C1 |φ) and the bias term is omitted for clarity
1
This model is known as logistic
regression (even if it is for
classification!) 0.5
Differently from generative
models, here we model p(Ck |φ) 0
directly −10 −5 0 5 10

Marcello Restelli March 15, 2017 26 / 31


Probabilistic Discriminative Models

Maximum Likelihood for Logistic Regression


Given a dataset D = {xn , tn }, n = 1, . . . , N; tn ∈ {0, 1}
Maximize the probability of getting the right label:
N
Y
p(t|X, w) = ytnn (1 − yn )1−tn , yn = σ(wT φn )
n=1

Taking the negative log of the likelihood, we can define cross-entropy


error function (to be minimized):
N
X N
X
L(w) = − ln p(t|X, w) = − (tn ln yn + (1 − tn ) ln(1 − yn )) = Ln
n=1 n=1

Differentiating and using the chain rule:


∂Ln yn − tn ∂yn ∂Ln ∂Ln ∂yn
= , = yn (1−yn )φi , = = (yn −tn )φn
∂yn yn (1 − yn ) ∂w ∂w ∂yn ∂w

Marcello Restelli March 15, 2017 27 / 31


Probabilistic Discriminative Models

Maximum Likelihood for Logistic Regression

The gradient of the loss function is


N
X
∇L(w) = (yn − tn )φn
n=1

It has the same form as the gradient of the sum-of-squares error


function for linear regression
There is no closed form solution, due to nonlinearity of the logistic
sigmoid function
The error function is convex and can be optimized by standard
gradient-based optimization techniques
Easy to adapt to the online learning setting

Marcello Restelli March 15, 2017 28 / 31


Probabilistic Discriminative Models

Multiclass Logistic Regression

For the multiclass case, we represent posterior probabilities by a softmax


transformation of linear functions of feature variables:
exp(wk T φ)
p(Ck |φ) = yk (φ) = P T
j exp(wj φ)

Differently from generative models, here we will use maximum


likelihood to determine parameters of this discriminative model directly
N K
! N K
!
Y Y Y Y t
p(T|Φ, w1 , . . . , wK ) = p(Ck |φn )tnk = ynknk
n=1 k=1 n=1 k=1
| {z }
Only one term corresponding
to correct class

exp(wk T φn )
where ynk = p(Ck |φn ) = P T
j exp(wj φn )

Marcello Restelli March 15, 2017 29 / 31


Probabilistic Discriminative Models

Multiclass Logistic Regression

Taking the negative logarithm gives the cross-entropy function for


multi-class classification problem
N K
!
X X
L(w1 , . . . , wK ) = − ln p(T|Φ, w1 , . . . , wK ) = − tnk ln ynk
n=1 k=1

Taking the gradient


N
X
∇Lwj (w1 , . . . , wK ) = (ynj − tnj )φn
n=1

Marcello Restelli March 15, 2017 30 / 31


Probabilistic Discriminative Models

Connection Between Logistic Regression and Perceptron


Algorithm
If we replace the logistic function with a step function

1 1

0.5 0.5

0 0
−10 −5 0 5 10 −10 −5 0 5 10
1 (
y(x, w) = 1 if w · φ(x) > 0
1 + e−w·φ(x) y(x, w) =
0 otherwise
Both algorithms use the same updating rule:
w ← w + α (y(xn , w) − tn ) φn
Marcello Restelli March 15, 2017 31 / 31

Potrebbero piacerti anche