Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Tutorial 2
Harun Mustafa
harun.mustafa@inf.ethz.ch
D-INFK
1 Notation
Vectors are indicated by bold lower-case symbols x, while matrices are indicated by bold upper-case
symbols 𝑋. The notation x𝑗 is used to denote the 𝑗 th component of x. x𝑖 denotes the 𝑖th data point, so
x𝑗𝑖 denotes the 𝑗 th component of the 𝑖th data point.
2 Linear regression
2.1 Model
Linear regression is a simple linear model which is often used in practice as a baseline model to which
more complex models are compared. Given training data (x𝑖 , 𝑦𝑖 ) ∈ R𝑑 × R, the response variable 𝑦 is
modeled as
𝑦 = w⊤ x + 𝑤0 + 𝜀, (1)
where w ∈ R𝑑 and 𝜀 is a random variable which accounts for variation/noise in the measured value of 𝑦.
For simplicity of notation, one can transform w and x into homogeneous coordinates as follows:
[︃ ]︃ [︃ ]︃
w x
w̃ = x̃ = . (2)
𝑤0 1
Let us assume, without loss of generality, that in the rest of these notes, the training points x𝑖 in R𝑑
have already been transformed in this fashion and originate from data points in R𝑑−1 (i.e., assume that
x𝑑𝑖 = 1 for each x𝑖 ∈ R𝑑 ).
Given 𝑛 data points {x1 , . . . , x𝑛 } ⊂ R𝑑 with a response variables {𝑦0 , . . . , 𝑦𝑛 } ⊂ R, the training data
can be mapped onto a matrix X ∈ R𝑛×𝑑 and a response vector y ∈ R𝑛 , where
⎡ ⎤ ⎡
⎤
x⊤1 𝑦0
⎢ . ⎥ ⎢.⎥
X=⎢ .. ⎥ y=⎢ .⎥ (4)
⎣ ⎦ ⎣.⎦
x⊤𝑛 𝑦𝑛
(i.e., the column vectors x𝑖 become the rows of X). The model in Equation 3 can then be written jointly
for all data as
y = Xw + 𝜀 (5)
[︁ ]︁
(note that 𝜀⊤ = 𝜀1 · · · 𝜀𝑛 ).
1 of 4
2.2 Training by least-squares
Suppose we have a trained weight vector w. We can define the residues with respect to (X, y) as
r = y − Xw. (6)
ˆ
𝑅(w) = ‖r‖22 (7)
ˆ
𝑅(w) = ‖y − Xw‖22
= (y − Xw)⊤ (y − Xw)
= y⊤ y − y⊤ Xw − w⊤ X⊤ y + w⊤ X⊤ Xw
= y⊤ y − 2y⊤ Xw + w⊤ X⊤ Xw since y⊤ Xw = w⊤ X⊤ y ∈ R
ˆ
0 := ∇𝑅(w)
𝜕x𝑇 Ax
= 2w* ⊤ X⊤ X − 2y⊤ X since = x𝑇 (A + A𝑇 ) and (X⊤ X)⊤ = X⊤ X
𝜕x
This optimization problem can be made strongly convex by the addition of a regularization term to
impose restrictions on the complexity of the final trained model
As an effect of this, the model is less able to fit noise in the training data (see Python demo). Note that
this term should only be included during training and not when evaluating error on an independent data
set.
2.3.1 𝐿𝑝 norms
𝜕‖w‖𝑝𝑝
= 𝑝|w𝑗 |𝑝−1 𝑠𝑖𝑔𝑛(w𝑗 ) (12)
𝜕w𝑗
2 of 4
p=2
p=4 p=1
p = 0.5
If 𝑝 = 1, then
𝜕‖w‖1
= 𝑠𝑖𝑔𝑛(w𝑗 ), (13)
𝜕w𝑗
𝜕‖w‖1
whereas when 𝑝 > 1, 𝜕w𝑗 → 0 as |w𝑗 | → 0. So when 𝑝 > 1, small non-zero values of 0 < |w𝑗 | < 1
tend to approach, but not reach zero due to the decreasing partial derivative. When 𝑝 = 1, this does not
happen, and thus, values are more likely to reach zero.
2.4 Interpretability
An interesting property of linear regression is that components w𝑗 of the weight vector w can provide a
measure of the importances of the features of x𝑖 (their components x𝑗𝑖 , 𝑗 ∈ {1, . . . , 𝑑}). In particular, a
large |w𝑗 | indicates that each x𝑗𝑖 has a strong contribution to the solution, and vice versa, and may be
considered important.
In general, however, a small value of |w𝑗 | can also be achieved when a many features are correlated
(only one of them contributes to the solution, and the rest are ignored).
Non-linear relations between X and y may be learned with linear regression by applying a feature map
′ ′
𝜑 : R𝑑 → R𝑑 to the w𝑖 data prior to training. Then, given data z𝑖 ∈ R𝑑 , the model can be represented
as
𝑦𝑖 = w⊤ 𝜑(z𝑖 ) (14)
Usually, feature maps can be applied if there is some knowledge of the distributions or structure of
the x𝑗𝑖 . For example, given a periodic function
∞
∑︁
𝑦(𝑡) = 𝑎0 + (𝑎𝑛 cos(𝑛𝑡) + 𝑏𝑛 sin(𝑛𝑡)) + 𝜀 (15)
𝑛=1
a feature map
𝜑(𝑡) = [1, cos(𝑡), sin(𝑡), cos(2𝑡), sin(2𝑡), . . .] (16)
3 of 4
3 Gradient descent
ˆ
In cases where a closed-form solution of ∇𝑅(w) =0
– cannot be computed
– is computationally expensive to compute
– is ill-conditioned (i.e., small changes in X lead to large changes in w
gradient descent can be used to compute a local optimum. It is an iterative meta-algorithm structured
as follows
1. Initialize w0 (e.g., randomly, random small values, etc.)
2. Iteratively update
ˆ 𝑡)
w𝑡+1 = w𝑡 − 𝜂𝑡 ∇𝑅(w (17)
where specific algorithms differ in their methods for picking 𝜂𝑡 at each step.
large η small η
^
Fig. 2. Effect of learning rate. The choices of vectors w𝑡 are plotted in red over the function 𝑦 = 𝑅(w). Adapted
from: A Look at Gradient Descent and RMSprop Optimizers.
3.2 ^
Convexity of 𝑅
ˆ is convex, we know that all local minima are global,
When the loss function 𝑅
ˆ * ) = 0 =⇒ 𝑅(w
∇𝑅(w ˆ * ) = min 𝑅(w).
ˆ (19)
w∈R𝑑
However, this does not guarantee uniqueness of a particular w* . In fact, there could be an infinite number
of, or no solutions.
For examples, consider the functions
– 𝑓 (x1 , x2 ) = (x1 )2 (convex, infinite number of minimizers)
– 𝑓 (𝑥) = 𝑥 (convex, no minimizers)
– 𝑓 (𝑥) = 𝑒𝑥 (strictly convex, no minimizers)
4 of 4