Sei sulla pagina 1di 4

Introduction to Machine Learning 2019

Tutorial 2

Harun Mustafa
harun.mustafa@inf.ethz.ch

D-INFK

1 Notation

Vectors are indicated by bold lower-case symbols x, while matrices are indicated by bold upper-case
symbols 𝑋. The notation x𝑗 is used to denote the 𝑗 th component of x. x𝑖 denotes the 𝑖th data point, so
x𝑗𝑖 denotes the 𝑗 th component of the 𝑖th data point.

2 Linear regression

2.1 Model

Linear regression is a simple linear model which is often used in practice as a baseline model to which
more complex models are compared. Given training data (x𝑖 , 𝑦𝑖 ) ∈ R𝑑 × R, the response variable 𝑦 is
modeled as
𝑦 = w⊤ x + 𝑤0 + 𝜀, (1)

where w ∈ R𝑑 and 𝜀 is a random variable which accounts for variation/noise in the measured value of 𝑦.
For simplicity of notation, one can transform w and x into homogeneous coordinates as follows:
[︃ ]︃ [︃ ]︃
w x
w̃ = x̃ = . (2)
𝑤0 1

Equation 1 can then we rewritten as


𝑦 = w̃⊤ x̃ + 𝜀. (3)

Let us assume, without loss of generality, that in the rest of these notes, the training points x𝑖 in R𝑑
have already been transformed in this fashion and originate from data points in R𝑑−1 (i.e., assume that
x𝑑𝑖 = 1 for each x𝑖 ∈ R𝑑 ).
Given 𝑛 data points {x1 , . . . , x𝑛 } ⊂ R𝑑 with a response variables {𝑦0 , . . . , 𝑦𝑛 } ⊂ R, the training data
can be mapped onto a matrix X ∈ R𝑛×𝑑 and a response vector y ∈ R𝑛 , where
⎡ ⎤ ⎡

x⊤1 𝑦0
⎢ . ⎥ ⎢.⎥
X=⎢ .. ⎥ y=⎢ .⎥ (4)
⎣ ⎦ ⎣.⎦
x⊤𝑛 𝑦𝑛

(i.e., the column vectors x𝑖 become the rows of X). The model in Equation 3 can then be written jointly
for all data as
y = Xw + 𝜀 (5)
[︁ ]︁
(note that 𝜀⊤ = 𝜀1 · · · 𝜀𝑛 ).

1 of 4
2.2 Training by least-squares

Suppose we have a trained weight vector w. We can define the residues with respect to (X, y) as

r = y − Xw. (6)

We can then define the least squares loss as

ˆ
𝑅(w) = ‖r‖22 (7)

which for linear regression becomes

ˆ
𝑅(w) = ‖y − Xw‖22
= (y − Xw)⊤ (y − Xw)
= y⊤ y − y⊤ Xw − w⊤ X⊤ y + w⊤ X⊤ Xw
= y⊤ y − 2y⊤ Xw + w⊤ X⊤ Xw since y⊤ Xw = w⊤ X⊤ y ∈ R

To find the optimal solution w* , we then solve

ˆ
0 := ∇𝑅(w)
𝜕x𝑇 Ax
= 2w* ⊤ X⊤ X − 2y⊤ X since = x𝑇 (A + A𝑇 ) and (X⊤ X)⊤ = X⊤ X
𝜕x

which, if X⊤ X is invertible, can be rearranged to get the closed-form solution

w* = (X⊤ X)−1 X⊤ y (8)

2.3 Ridge regression and 𝐿𝑝 norms

This optimization problem can be made strongly convex by the addition of a regularization term to
impose restrictions on the complexity of the final trained model

ˆ ridge (w) = ‖y − Xw‖22 + 𝜆‖w‖22 .


𝑅 (9)

As an effect of this, the model is less able to fit noise in the training data (see Python demo). Note that
this term should only be included during training and not when evaluating error on an independent data
set.

2.3.1 𝐿𝑝 norms

This norm is from the family of 𝐿𝑝 norms, defined as



⎸ 𝑑
⎸∑︁
𝑝
‖w‖𝑝 = ⎷ |w𝑖 |𝑝 . (10)
𝑖=1

For 𝑝 = ∞, we have the definition


𝑑
‖w‖∞ = max |w𝑖 |. (11)
𝑖=1

Computing its partial derivatives, we get

𝜕‖w‖𝑝𝑝
= 𝑝|w𝑗 |𝑝−1 𝑠𝑖𝑔𝑛(w𝑗 ) (12)
𝜕w𝑗
2 of 4
p=2
p=4 p=1

p = 0.5

Fig. 1. Plots of the function 𝑦 = |𝑥|𝑝 for different values of 𝑝.

If 𝑝 = 1, then
𝜕‖w‖1
= 𝑠𝑖𝑔𝑛(w𝑗 ), (13)
𝜕w𝑗
𝜕‖w‖1
whereas when 𝑝 > 1, 𝜕w𝑗 → 0 as |w𝑗 | → 0. So when 𝑝 > 1, small non-zero values of 0 < |w𝑗 | < 1
tend to approach, but not reach zero due to the decreasing partial derivative. When 𝑝 = 1, this does not
happen, and thus, values are more likely to reach zero.

2.4 Interpretability

An interesting property of linear regression is that components w𝑗 of the weight vector w can provide a
measure of the importances of the features of x𝑖 (their components x𝑗𝑖 , 𝑗 ∈ {1, . . . , 𝑑}). In particular, a
large |w𝑗 | indicates that each x𝑗𝑖 has a strong contribution to the solution, and vice versa, and may be
considered important.
In general, however, a small value of |w𝑗 | can also be achieved when a many features are correlated
(only one of them contributes to the solution, and the rest are ignored).

2.5 Feature transformations

Non-linear relations between X and y may be learned with linear regression by applying a feature map
′ ′
𝜑 : R𝑑 → R𝑑 to the w𝑖 data prior to training. Then, given data z𝑖 ∈ R𝑑 , the model can be represented
as
𝑦𝑖 = w⊤ 𝜑(z𝑖 ) (14)

Usually, feature maps can be applied if there is some knowledge of the distributions or structure of
the x𝑗𝑖 . For example, given a periodic function


∑︁
𝑦(𝑡) = 𝑎0 + (𝑎𝑛 cos(𝑛𝑡) + 𝑏𝑛 sin(𝑛𝑡)) + 𝜀 (15)
𝑛=1

a feature map
𝜑(𝑡) = [1, cos(𝑡), sin(𝑡), cos(2𝑡), sin(2𝑡), . . .] (16)

may be defined (see Python demo).

3 of 4
3 Gradient descent
ˆ
In cases where a closed-form solution of ∇𝑅(w) =0
– cannot be computed
– is computationally expensive to compute
– is ill-conditioned (i.e., small changes in X lead to large changes in w
gradient descent can be used to compute a local optimum. It is an iterative meta-algorithm structured
as follows
1. Initialize w0 (e.g., randomly, random small values, etc.)
2. Iteratively update
ˆ 𝑡)
w𝑡+1 = w𝑡 − 𝜂𝑡 ∇𝑅(w (17)

3. Terminate when, for some choice of 𝜀,

𝑅(w𝑡 ) − 𝑅(w𝑡+1 ) ≤ 𝜀 (18)

where specific algorithms differ in their methods for picking 𝜂𝑡 at each step.

3.1 Step size


ˆ In the case of linear
For sufficiently small 𝜂𝑡 , the algorithm converges to some local optimum of 𝑅.
regression, this convergence is happens in a linear number of steps for 𝜂𝑡 = 0.5. Setting small 𝜂𝑡 increases
the amount of computation time, while large 𝜂𝑡 may lead to the algorithm not converging (see Figure 2).
In most cases, however, a variable step size is a good way to achieve convergence in an acceptable
number of steps.

large η small η
^
Fig. 2. Effect of learning rate. The choices of vectors w𝑡 are plotted in red over the function 𝑦 = 𝑅(w). Adapted
from: A Look at Gradient Descent and RMSprop Optimizers.

3.2 ^
Convexity of 𝑅
ˆ is convex, we know that all local minima are global,
When the loss function 𝑅

ˆ * ) = 0 =⇒ 𝑅(w
∇𝑅(w ˆ * ) = min 𝑅(w).
ˆ (19)
w∈R𝑑

However, this does not guarantee uniqueness of a particular w* . In fact, there could be an infinite number
of, or no solutions.
For examples, consider the functions
– 𝑓 (x1 , x2 ) = (x1 )2 (convex, infinite number of minimizers)
– 𝑓 (𝑥) = 𝑥 (convex, no minimizers)
– 𝑓 (𝑥) = 𝑒𝑥 (strictly convex, no minimizers)
4 of 4

Potrebbero piacerti anche