Tutorial2 Linear Regression

Introduction to Machine Learning 2019
Tutorial 2
Harun Mustafa
harun.mustafa@inf.ethz.ch
D-INFK
1 Notation
Vectors are indicated by bold lower-case symbols x, while matrices are indicated by bold upper-case
symbols 𝑋. The notation x𝑗 is used to denote the 𝑗 th component of x. x𝑖 denotes the 𝑖th data point, so
x𝑗𝑖 denotes the 𝑗 th component of the 𝑖th data point.
2 Linear regression
2.1 Model
Linear regression is a simple linear model which is often used in practice as a baseline model to which
more complex models are compared. Given training data (x𝑖 , 𝑦𝑖 ) ∈ R𝑑 × R, the response variable 𝑦 is
modeled as
𝑦 = w⊤ x + 𝑤0 + 𝜀, (1)
where w ∈ R𝑑 and 𝜀 is a random variable which accounts for variation/noise in the measured value of 𝑦.
For simplicity of notation, one can transform w and x into homogeneous coordinates as follows:
[︃ ]︃ [︃ ]︃
w x
w̃ = x̃ = . (2)
𝑤0 1
Equation 1 can then we rewritten as

𝑦 = w̃⊤ x̃ + 𝜀. (3)
Let us assume, without loss of generality, that in the rest of these notes, the training points x𝑖 in R𝑑
have already been transformed in this fashion and originate from data points in R𝑑−1 (i.e., assume that
x𝑑𝑖 = 1 for each x𝑖 ∈ R𝑑 ).
Given 𝑛 data points {x1 , . . . , x𝑛 } ⊂ R𝑑 with a response variables {𝑦0 , . . . , 𝑦𝑛 } ⊂ R, the training data
can be mapped onto a matrix X ∈ R𝑛×𝑑 and a response vector y ∈ R𝑛 , where
⎡ ⎤ ⎡
⎤
x⊤1 𝑦0
⎢ . ⎥ ⎢.⎥
X=⎢ .. ⎥ y=⎢ .⎥ (4)
⎣ ⎦ ⎣.⎦
x⊤𝑛 𝑦𝑛
(i.e., the column vectors x𝑖 become the rows of X). The model in Equation 3 can then be written jointly
for all data as
y = Xw + 𝜀 (5)
[︁ ]︁
(note that 𝜀⊤ = 𝜀1 · · · 𝜀𝑛 ).
1 of 4
2.2 Training by least-squares
Suppose we have a trained weight vector w. We can define the residues with respect to (X, y) as
r = y − Xw. (6)
We can then define the least squares loss as
ˆ
𝑅(w) = ‖r‖22 (7)
which for linear regression becomes
ˆ
𝑅(w) = ‖y − Xw‖22
= (y − Xw)⊤ (y − Xw)
= y⊤ y − y⊤ Xw − w⊤ X⊤ y + w⊤ X⊤ Xw
= y⊤ y − 2y⊤ Xw + w⊤ X⊤ Xw since y⊤ Xw = w⊤ X⊤ y ∈ R
To find the optimal solution w* , we then solve
ˆ
0 := ∇𝑅(w)
𝜕x𝑇 Ax
= 2w* ⊤ X⊤ X − 2y⊤ X since = x𝑇 (A + A𝑇 ) and (X⊤ X)⊤ = X⊤ X
𝜕x
which, if X⊤ X is invertible, can be rearranged to get the closed-form solution
w* = (X⊤ X)−1 X⊤ y (8)
2.3 Ridge regression and 𝐿𝑝 norms
This optimization problem can be made strongly convex by the addition of a regularization term to
impose restrictions on the complexity of the final trained model
ˆ ridge (w) = ‖y − Xw‖22 + 𝜆‖w‖22 .

𝑅 (9)
As an effect of this, the model is less able to fit noise in the training data (see Python demo). Note that
this term should only be included during training and not when evaluating error on an independent data
set.
2.3.1 𝐿𝑝 norms
This norm is from the family of 𝐿𝑝 norms, defined as

⎯
⎸ 𝑑
⎸∑︁
𝑝
‖w‖𝑝 = ⎷ |w𝑖 |𝑝 . (10)
𝑖=1
For 𝑝 = ∞, we have the definition

𝑑
‖w‖∞ = max |w𝑖 |. (11)
𝑖=1
Computing its partial derivatives, we get
𝜕‖w‖𝑝𝑝
= 𝑝|w𝑗 |𝑝−1 𝑠𝑖𝑔𝑛(w𝑗 ) (12)
𝜕w𝑗
2 of 4
p=2
p=4 p=1
p = 0.5
Fig. 1. Plots of the function 𝑦 = |𝑥|𝑝 for different values of 𝑝.
If 𝑝 = 1, then
𝜕‖w‖1
= 𝑠𝑖𝑔𝑛(w𝑗 ), (13)
𝜕w𝑗
𝜕‖w‖1
whereas when 𝑝 > 1, 𝜕w𝑗 → 0 as |w𝑗 | → 0. So when 𝑝 > 1, small non-zero values of 0 < |w𝑗 | < 1
tend to approach, but not reach zero due to the decreasing partial derivative. When 𝑝 = 1, this does not
happen, and thus, values are more likely to reach zero.
2.4 Interpretability
An interesting property of linear regression is that components w𝑗 of the weight vector w can provide a
measure of the importances of the features of x𝑖 (their components x𝑗𝑖 , 𝑗 ∈ {1, . . . , 𝑑}). In particular, a
large |w𝑗 | indicates that each x𝑗𝑖 has a strong contribution to the solution, and vice versa, and may be
considered important.
In general, however, a small value of |w𝑗 | can also be achieved when a many features are correlated
(only one of them contributes to the solution, and the rest are ignored).
2.5 Feature transformations
Non-linear relations between X and y may be learned with linear regression by applying a feature map
′ ′
𝜑 : R𝑑 → R𝑑 to the w𝑖 data prior to training. Then, given data z𝑖 ∈ R𝑑 , the model can be represented
as
𝑦𝑖 = w⊤ 𝜑(z𝑖 ) (14)
Usually, feature maps can be applied if there is some knowledge of the distributions or structure of
the x𝑗𝑖 . For example, given a periodic function
∞
∑︁
𝑦(𝑡) = 𝑎0 + (𝑎𝑛 cos(𝑛𝑡) + 𝑏𝑛 sin(𝑛𝑡)) + 𝜀 (15)
𝑛=1
a feature map
𝜑(𝑡) = [1, cos(𝑡), sin(𝑡), cos(2𝑡), sin(2𝑡), . . .] (16)
may be defined (see Python demo).
3 of 4
3 Gradient descent
ˆ
In cases where a closed-form solution of ∇𝑅(w) =0
– cannot be computed
– is computationally expensive to compute
– is ill-conditioned (i.e., small changes in X lead to large changes in w
gradient descent can be used to compute a local optimum. It is an iterative meta-algorithm structured
as follows
1. Initialize w0 (e.g., randomly, random small values, etc.)
2. Iteratively update
ˆ 𝑡)
w𝑡+1 = w𝑡 − 𝜂𝑡 ∇𝑅(w (17)
3. Terminate when, for some choice of 𝜀,
𝑅(w𝑡 ) − 𝑅(w𝑡+1 ) ≤ 𝜀 (18)
where specific algorithms differ in their methods for picking 𝜂𝑡 at each step.
3.1 Step size

ˆ In the case of linear
For sufficiently small 𝜂𝑡 , the algorithm converges to some local optimum of 𝑅.
regression, this convergence is happens in a linear number of steps for 𝜂𝑡 = 0.5. Setting small 𝜂𝑡 increases
the amount of computation time, while large 𝜂𝑡 may lead to the algorithm not converging (see Figure 2).
In most cases, however, a variable step size is a good way to achieve convergence in an acceptable
number of steps.
large η small η
^
Fig. 2. Effect of learning rate. The choices of vectors w𝑡 are plotted in red over the function 𝑦 = 𝑅(w). Adapted
from: A Look at Gradient Descent and RMSprop Optimizers.
3.2 ^
Convexity of 𝑅
ˆ is convex, we know that all local minima are global,
When the loss function 𝑅
ˆ * ) = 0 =⇒ 𝑅(w
∇𝑅(w ˆ * ) = min 𝑅(w).
ˆ (19)
w∈R𝑑
However, this does not guarantee uniqueness of a particular w* . In fact, there could be an infinite number
of, or no solutions.
For examples, consider the functions
– 𝑓 (x1 , x2 ) = (x1 )2 (convex, infinite number of minimizers)
– 𝑓 (𝑥) = 𝑥 (convex, no minimizers)
– 𝑓 (𝑥) = 𝑒𝑥 (strictly convex, no minimizers)
4 of 4

Tutorial2 Linear Regression

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Tutorial2 Linear Regression

Caricato da

Copyright:

Formati disponibili

Introduction to Machine Learning 2019

Equation 1 can then we rewritten as

We can then define the least squares loss as

which for linear regression becomes

To find the optimal solution w* , we then solve

which, if X⊤ X is invertible, can be rearranged to get the closed-form solution

w* = (X⊤ X)−1 X⊤ y (8)

2.3 Ridge regression and 𝐿𝑝 norms

ˆ ridge (w) = ‖y − Xw‖22 + 𝜆‖w‖22 .

This norm is from the family of 𝐿𝑝 norms, defined as

For 𝑝 = ∞, we have the definition

Computing its partial derivatives, we get

Fig. 1. Plots of the function 𝑦 = |𝑥|𝑝 for different values of 𝑝.

2.5 Feature transformations

may be defined (see Python demo).

3. Terminate when, for some choice of 𝜀,

𝑅(w𝑡 ) − 𝑅(w𝑡+1 ) ≤ 𝜀 (18)

3.1 Step size

Potrebbero piacerti anche