Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
e-Chapter 9
Pierre Paquay
Problem 9.1
(a) We begin by implementing the nearest neighbor method on the raw data.
data <- data.frame(x1 = c(0, 0, 5), x2 = c(0, 1, 5))
class <- as.factor(c(1, 1, -1))
3
Class
x2
−1
1
2
0 1 2 3 4 5
x1
1
guides(col = guide_legend(title = "Class"))
Class
x2
−1
2 1
0 2 4 6
x1
(b) Here, we transform to whitened coordinates and we run the nearest neighbor rule.
data_centered <- apply(data, 2, function(y) y - mean(y))
sigma <- t(data_centered) %*% as.matrix(data_centered) / 2
sigma_sqr <- sqrtm(sigma)
sigma_sqr_inv <- solve(sigma_sqr)
data_whitened <- as.matrix(data_centered) %*% sigma_sqr_inv
data_whitened <- as.data.frame(data_whitened)
colnames(data_whitened) <- c("z1", "z2")
2
0.4
0.0
Class
z2
−1
1
−0.4
−0.8
−1.2
−1.0 −0.5 0.0 0.5 1.0
z1
We show the decision region of the final hypothesis in the original space as well.
ggplot() + geom_point(data = grid_centered, aes(x = x1, y = x2, col = knn_mod_whitened)) +
geom_point(data = as.data.frame(data_centered), aes(x = x1, y = x2, fill = class),
size = 3, shape = 21) +
guides(fill = guide_legend(title = "Class")) +
guides(col = guide_legend(title = "Class"))
3
4
Class
x2
−1
0 1
−2
(c) Finally, we use principal component analysis to reduce the data to 1 dimension for our nearest neighbor
classifier.
SVD_decomp <- svd(data_centered)
V1 <- SVD_decomp$v[, 1]
Z <- data_centered %*% V1
data_pca <- Z %*% t(V1)
data_pca <- as.data.frame(data_pca)
colnames(data_pca) <- c("z1", "z2")
4
3
1 Class
z2
−1
1
−1
−2 0 2
z1
Once again, we show the decision regions of the final hypothesis in the original space.
ggplot() + geom_point(data = grid_centered, aes(x = x1, y = x2, col = knn_mod_pca)) +
geom_point(data = as.data.frame(data_centered), aes(x = x1, y = x2, fill = class),
size = 3, shape = 21) +
guides(fill = guide_legend(title = "Class")) +
guides(col = guide_legend(title = "Class"))
5
4
Class
x2
−1
0 1
−2
Problem 9.2
moreover, we have
N N
1 X 0 1 X
σi02 = (xni − x0 i )2 = (xni + bi − x0 i − bi )2 = σi2 ;
N n=1 N n=1
and finally
N N
1 X 0 1 X
Σ0 = (xn − x0 )(x0n − x0 )T = (xn + b − x − b)(xn + b − x − b)T = Σ.
N n=1 N n=1
6
• Whitening. We have
(b) Here, we have x0n = αxn with α > 0. Let us compute x0 , σi02 and Σ0 . We have
N N
1 X 0 1 X
x0 = xn = αxn = αx;
N n=1 N n=1
moreover, we have
N N
1 X 0 1 X
σi02 = (xni − x0 i )2 = (αxni − αx0 i )2 = α2 σi2 ;
N n=1 N n=1
and finally
N N
1 X 0 1 X
Σ0 = (xn − x0 )(x0n − x0 )T = (αxn − αx)(αxn − αx)T = α2 Σ.
N n=1 N n=1
• Centering. We may write that
• Whitening. We have
1 −1/2
zn0 = Σ0−1/2 (x0n − x0 ) = Σ (αxn − αx) = zn .
α
(c) Here, we have x0n = Axn with A = diag(a1 , · · · , ad ) (ai 6= 0). Let us compute x0 , σi02 and Σ0 . We have
N N
1 X 0 1 X
x0 = xn = Axn = Ax;
N n=1 N n=1
moreover, we have
N N
1 X 0 1 X
σi02 = (xni − x0 i )2 = (ai xni − ai x0 i )2 = a2i σi2 ;
N n=1 N n=1
and finally
N N
1 X 0 1 X
Σ0 = (xn − x0 )(x0n − x0 )T = (Axn − Ax)(Axn − Ax)T = AΣAT .
N n=1 N n=1
• Centering. We may write that
7
• Normalization. Here, we have that
zn0 = D0 (x0n − x0 )
where
D0 = diag(1/σ10 , · · · , 1/σd0 ) = diag(1/(a1 σ1 ), · · · , 1/(ad σd ));
which means that D0 A = D. In that case, we have
• Whitening. We have
(d) Here, we have x0n = Axn with det(A) 6= 0. Let us compute x0 , σi02 and Σ0 . We have
N N
1 X 0 1 X
x0 = xn = Axn = Ax;
N n=1 N n=1
moreover, we have
N N N
1 X 0 1 X T 1 X T
σi02 = (xni − x0 i )2 = (li xn − liT x)2 = l (xn − x)(xn − x)T li = liT Σli
N n=1 N n=1 N n=1 i
• Whitening. We have
Problem 9.3
We may write
Γ = diag(λ1 , · · · , λd )
where λi > 0. With this in mind, it is easy to describe Γ1/2 and Γ−1/2 ; we have
p p p p
Γ1/2 = diag( λ1 , · · · , λd ) and Γ−1/2 = diag(1/ λ1 , · · · , 1/ λd ).
8
To prove the first fact, we simply need to compute Γ1/2 Γ1/2 , we have
p p p p
Γ1/2 Γ1/2 = diag( λ1 , · · · , λd )diag( λ1 , · · · , λd ) = diag(λ1 , · · · , λd ) = Γ;
to prove the second fact, we need to compute Γ1/2 Γ−1/2 (the same reasoning can be applied to Γ−1/2 Γ1/2 ),
we have p p p p
Γ1/2 Γ−1/2 = diag( λ1 , · · · , λd )diag(1/ λ1 , · · · , 1/ λd ) = I.
Now, to prove that Σ1/2 = U Γ1/2 U T , we write that
=I
=I
Problem 9.4
V T A = |V {z
T
V} ψ = ψ
=I
ψ T ψ = AT V V T} A = AT A = I
| {z
=I
since A is an orthogonal matrix as well; this means that ψ is also an orthogonal matrix.
Problem 9.5
U = V = IN and Γ = A.
AT A = diag(a1 , · · · , ad )
U T U = DT AT AD = Id .
9
(d) If A has SVD U ΓV T and QT Q = I, we may write that
QA = (QU )ΓV T
(QU )T (QU ) = U T QT Q U = Id .
| {z }
=I
(e) If A has blocks Ai along the diagonal such that Ai has SVD Ui Γi ViT , we simply have to define U , Γ, and
V as follows
Γ1
U1 V1
U = .. , Γ = .. and V = ..
. . . .
Um Γm Vm
In that case, we immediately get that
A = U ΓV T
with T
U1 U1
U U =
T .. .. = I,
. .
T
Um Um
V1T
V1
V TV = .. .. = V V = I,
T
. .
VmT Vm
and Γ diagonal.
Problem 9.6
Here, we suppose that the digits data were not centered and we perform PCA to obtain a 2-dimensional
feature vector.
digits <- read.delim("zip.train", header = FALSE, sep = " ")
digits$V258 <- NULL
colnames(digits) <- c("digit", 1:256)
10
5
Digit
0
1
0
2
3
V2
4
5
−5 6
7
8
9
−10
−10 −5 0
V1
The transformed data is not whitened since its covariance matrix is not the identity matrix.
cov(digits_pca_proj)
## V1 V2
## V1 6.842064 -8.121955
## V2 -8.121955 17.058353
Problem 9.7
As we have that √
zn = N Γ−1 T
k Vk xn ,
we may let Z be √ √
Z= N X(Γ−1
k Vk ) =
T T
N XVk Γ−1
k .
11
To prove that the transformed data is actually whitened, we have to compute Z T Z, if we use the SVD of
X = U ΓV T , we get
1 T 1 √ √
Z Z = ( N XVk Γ−1 −1
k ) ( N XVk Γk )
T
N N
= Γ−1k Vk X XVk Γk
T T −1
= Γ−1
k Vk V Γ U
T −1
| {zU} ΓV Vk Γk
T T
=Id
= Γ−1 T
k Vk V Γ2
|{z} V T Vk Γ−1
k .
=diag(γ12 ,··· ,γd2 )
vkT
and
Ik
V T Vk = .
0
This means that
Γ−1 −1
k (Ik | 0) = (Γk | 0),
and in the same way −1
Γk
Ik −1
Γk = .
0 0
Consequently, we get that
1 T
−1
2 Γk
Z Z = (Γ−1 | 0)Γ
N k 0
Γk 0
−1
Γk
2
= (Γ−1 | 0)
k 0 ∗ 0
= Ik ,
which means that Z is whitened.
Problem 9.8
We begin by constructing a two dimensional feature as described in the book and we apply the algorithm to
the digits data giving us the features z1 and z2 .
digits <- read.delim("zip.train", header = FALSE, sep = " ")
digits$V258 <- NULL
colnames(digits) <- c("digit", 1:256)
12
(a) We give below a scatter plot of our resulting two features.
z1 <- X_centered %*% v1
z2 <- X_centered %*% v2
y_1 <- ifelse(y == 1, +1, -1)
z <- data.frame(z1, z2, y_1)
10
5
Digit
z2
−1
1
−5
−10 −5 0 5 10
z1
## [,1]
## [1,] FALSE
(c) Since V̂ = [v1 , v2 ] has full rank, we know that
V̂ + = (V̂ T V̂ )−1 V̂ T .
In this case, we have
X̂ = X V̂ (V̂ T V̂ )−1 V̂ T .
(d) This method of constructing features is supervised since we need the target values to apply it.
Problem 9.9
(a) Since X̂ is a matrix whose rows are projected onto k basis vectors, it is pretty obvious that rank(X̂) = k.
13
(b) We may write that
= ||U (X −
T
X̂)||2F
≤ ||U T ||2F ||X − X̂||2F
| {z }
=trace(U T U )=d
≤ d||X − X̂||2F
So, the optimal choice for Γ̂ must have all off-diagonal elements equal to zero.
(e) Since we know that rank(Γ̂) ≤ k, we may conclude that the optimal Γ̂ can have at most k non-zero
diagonal elements.
(f ) We may write for our optimal Γ̂ that
d
X
||Γ − Γ̂||2F = (γii − γ̂ii )2
i=1
k
X d
X
= (γii − γ̂ii )2 + (γii − γ̂ii )2
i=1 i=k+1
| {z }
Pd 2
γii
i=k+1
d
X
2
≥ γii .
i=k+1
Pd
Consequently, the minimum reconstruction error is equal to i=k+1
2
γii .
14
(g) If X̂ = XVk VkT , we choose our optimal Γ̂ such that
Γ̂ = U T X̂V
= U T XVk VkT V
| {zU} ΓV Vk Vk V
= U T T T
=I
= Γ(V T Vk )(VkT V )
where
Ik
VkT V = (Ik | 0) and V T Vk = .
0
This means that
0
γ1
0
.. ..
Ik . .
Γ̂ = Γ = .
0 0 γk 0
0 ··· 0 0
Thus, we finally have
0 0 0
···
0 −γk+1
Γ − Γ̂ = .. .. ,
. .
0 −γd
and consequently
d
X
||Γ − Γ̂||2F = γi2 .
i=k+1
So, for this choice of Γ̂, we actually check eac of the conditions obtained in points (d), (e), and (f ) to
characterize the optimal choice of Γ̂; so X̂ = XVk VkT is the optimal choice for the reconstructed vector. In
this case, we know from point (a) that
Problem 9.10
(a) The main difference between Algorithm 1 and Algorithm 2 is that Algorithm 1 first performs SVD and
then splits the data, and Algorithm 2 first splits the data and then performs SVD.
(b) Here, we generate N random normally distributed d-dimensional inputs x1 , · · · , xN with their respective
targets yn = wfT xn + n where wf is normally distributed and n is independant Gaussian noise with variance
0.5.
set.seed(10)
d <- 5
N <- 40
k <- 3
15
data_gen <- function(d, N) {
wf <- rnorm(d)
X <- data.frame(x1 = rnorm(N), x2 = rnorm(N), x3 = rnorm(N),
x4 = rnorm(N), x5 = rnorm(N))
y <- apply(X, 1, target, w = wf)
return(list(X = X, y = y, wf = wf))
}
return(E1)
}
return(E2)
}
16
x4 = rnorm(Ntest), x5 = rnorm(Ntest))
y_test <- apply(X_test, 1, target, w = wf)
return(E_out)
}
Finally, we repeat this process 105 times and we report the averages E 1 , E 2 , and E out .
iter <- 100000
E1 <- numeric(iter)
E2 <- numeric(iter)
Eout <- numeric(iter)
for (i in 1:iter) {
data <- data_gen(d, N)
X <- data$X
y <- data$y
wf <- data$wf
50000
50000
40000
40000 40000
30000
30000 30000
count
count
count
20000
20000 20000
10000
10000 10000
0 0 0
0 5 10 15 20 0 10 20 30 0 5 10 15 20
E1 E2 Eout
17
Problem 9.12
(a) We have
1
π
Eout (h) = Ex,π [(h(x) − fπ (x))2 ]
4
1
= Eπ [Ex|π [(h(x) − fπ (x))2 |π]]
4
N
1 X
= (h(xn ) − fπ (xn ))2 P[x = xn |π]
Eπ
4 n=1
| {z }
=1/N
N
1 X
= Eπ [(h(xn ) − fπ (xn ))2 ].
4N n=1
We also get
N N
X 1 X
(2) = Eπ [(y − fπ (xn ))2 ] = Eπ [(y − yπn )2 ] = (y − yi )2 P[πn = i] = (y − yi )2 .
| {z } N
i=1 i=1
=1/N
18
In conclusion, we get that
N N N
1 X 1 XX
π
Eout (h) = (h(xn ) − y)2 + (y − yi )2
4N n=1 4N 2 n=1 i=1
N N
1 X 1 X
= (h(xn ) − y)2 + (y − yi )2
4N n=1 4N i=1
| {z }
= 14 s2y
N
s2y 1 X
= + (h(xn ) − y)2 .
4 4N n=1
(d) We are now able to compute the permutation optimism penalty, we have
N
1 X
π
Eout (gπ ) − Ein
π
(gπ ) = − (gπ (xn ) − y)(y − yπn )
2N n=1
N N
1 X 1 X
= − gπ (xn )(y − yπn ) + y (y − yπn )
2N n=1 2N n=1
| {z }
=0
N
1 X
= (yπ − y)gπ (xn ).
2N n=1 n
(e) It is easy to see that the permutation optimism penalty is proportional to the covariance between yπn and
gπ (xn ).
Problem 9.13
Problem 9.14
19
set.seed(10)
return(sign(D$x - w0))
}
return(D_prime)
}
20
0.50
0.25
Penalty
0.00
−0.25
−0.50
0 25 50 75 100
N
(b) Now, we repeat 1000 times point (a) to compute the average Rademacher penalty.
penalty_matrix <- foreach(row = 1:1000, .combine = "rbind") %dopar% {
penalty <- numeric()
for (N in 1:100) {
iter <- 1000
D_prime <- data_gen(N)
w0 <- 0
w0_prime <- w0
Ein <- mean(D_prime$r != h(D_prime, w0))
Ein_prime <- Ein
for (i in 1:iter) {
D_mis <- subset(D_prime, r != h(D_prime, w0))
if (nrow(D_mis) == 0)
break
xt <- D_mis[sample(nrow(D_mis), 1), ]
w0 <- w0 + xt$x * xt$r
Ein <- mean(D_prime$r != h(D_prime, w0))
if (Ein < Ein_prime) {
w0_prime <- w0
}
Ein_prime <- mean(D_prime$r != h(D_prime, w0_prime))
}
penalty <- c(penalty, 0.5 - Ein_prime)
}
penalty
}
21
Below, we give a plot of the average penalty versus N .
penalty_avg <- apply(penalty_matrix, 2, FUN = mean)
0.2
Penalty
0.1
0 25 50 75 100
N
√
(c) We may see that the Rademacher penalty is O(1/ N ), which is similar to the order of the VC penalty in
this case.
one_square <- function(N) {
return(1 / sqrt(N))
}
22
1.00
0.75
Function
Penalty
1 / sqrt(N)
0.50
Average Rademacher penalty
0.25
0.00
0 25 50 75 100
N
Problem 9.15
If we let δ be
N 2
δ = 4mH (N )2 e− 8 ,
we get s
16 2mH (N )
= ln .
N δ
In this case, we may conclude that, with probability at least 1 − δ, we have
s
1 1 2mH (N )
− Ein (gr ) = Eout (gr ) − Ein (gr ) ≤ 4
0 0 0
ln .
2 N δ
23