Sei sulla pagina 1di 23

Problem Solutions

e-Chapter 9
Pierre Paquay

Problem 9.1

(a) We begin by implementing the nearest neighbor method on the raw data.
data <- data.frame(x1 = c(0, 0, 5), x2 = c(0, 1, 5))
class <- as.factor(c(1, 1, -1))

ggplot(data, aes(x = x1, y = x2, fill = class)) + geom_point(size = 3, shape = 21) +


guides(fill = guide_legend(title = "Class"))

3
Class
x2

−1
1
2

0 1 2 3 4 5
x1

grid <- expand.grid(x1 = seq(min(data[, 1] - 1), max(data[, 1] + 1), by = 0.1),


x2 = seq(min(data[, 2] - 1), max(data[, 2] + 1), by = 0.1))

knn_mod <- knn(data, grid, class, k = 1, prob = TRUE)

Below, we show the decision regions of the final hypothesis.


ggplot() + geom_point(data = grid, aes(x = x1, y = x2, col = knn_mod)) +
geom_point(data = data, aes(x = x1, y = x2, fill = class), size = 3, shape = 21) +
guides(fill = guide_legend(title = "Class")) +

1
guides(col = guide_legend(title = "Class"))

Class
x2

−1

2 1

0 2 4 6
x1

(b) Here, we transform to whitened coordinates and we run the nearest neighbor rule.
data_centered <- apply(data, 2, function(y) y - mean(y))
sigma <- t(data_centered) %*% as.matrix(data_centered) / 2
sigma_sqr <- sqrtm(sigma)
sigma_sqr_inv <- solve(sigma_sqr)
data_whitened <- as.matrix(data_centered) %*% sigma_sqr_inv
data_whitened <- as.data.frame(data_whitened)
colnames(data_whitened) <- c("z1", "z2")

ggplot(data_whitened, aes(x = z1, y = z2, fill = class)) + geom_point(size = 3, shape = 21) +


guides(fill = guide_legend(title = "Class"))

2
0.4

0.0
Class
z2

−1
1
−0.4

−0.8

−1.2
−1.0 −0.5 0.0 0.5 1.0
z1

grid_centered <- expand.grid(x1 = seq(min(data_centered[, 1] - 1),


max(data_centered[, 1] + 1), by = 0.1),
x2 = seq(min(data_centered[, 2] - 1),
max(data_centered[, 2] + 1), by = 0.1))
grid_whitened <- as.matrix(grid_centered) %*% sigma_sqr_inv

knn_mod_whitened <- knn(data_whitened, grid_whitened, class, k = 1, prob = TRUE)

We show the decision region of the final hypothesis in the original space as well.
ggplot() + geom_point(data = grid_centered, aes(x = x1, y = x2, col = knn_mod_whitened)) +
geom_point(data = as.data.frame(data_centered), aes(x = x1, y = x2, fill = class),
size = 3, shape = 21) +
guides(fill = guide_legend(title = "Class")) +
guides(col = guide_legend(title = "Class"))

3
4

Class
x2

−1

0 1

−2

−2.5 0.0 2.5


x1

(c) Finally, we use principal component analysis to reduce the data to 1 dimension for our nearest neighbor
classifier.
SVD_decomp <- svd(data_centered)
V1 <- SVD_decomp$v[, 1]
Z <- data_centered %*% V1
data_pca <- Z %*% t(V1)
data_pca <- as.data.frame(data_pca)
colnames(data_pca) <- c("z1", "z2")

ggplot(data_pca, aes(x = z1, y = z2, fill = class)) + geom_point(size = 3, shape = 21) +


guides(fill = guide_legend(title = "Class"))

4
3

1 Class
z2

−1
1

−1

−2 0 2
z1

grid_pca <- (as.matrix(grid_centered) %*% V1) %*% t(V1)

knn_mod_pca <- knn(data_pca, grid_pca, class, k = 1, prob = TRUE)

Once again, we show the decision regions of the final hypothesis in the original space.
ggplot() + geom_point(data = grid_centered, aes(x = x1, y = x2, col = knn_mod_pca)) +
geom_point(data = as.data.frame(data_centered), aes(x = x1, y = x2, fill = class),
size = 3, shape = 21) +
guides(fill = guide_legend(title = "Class")) +
guides(col = guide_legend(title = "Class"))

5
4

Class
x2

−1

0 1

−2

−2.5 0.0 2.5


x1

Problem 9.2

(a) Here, we have x0n = xn + b. Let us compute x0 , σi02 and Σ0 . We have


N N
1 X 0 1 X
x0 = xn = (xn + b) = x + b;
N n=1 N n=1

moreover, we have
N N
1 X 0 1 X
σi02 = (xni − x0 i )2 = (xni + bi − x0 i − bi )2 = σi2 ;
N n=1 N n=1
and finally
N N
1 X 0 1 X
Σ0 = (xn − x0 )(x0n − x0 )T = (xn + b − x − b)(xn + b − x − b)T = Σ.
N n=1 N n=1

• Centering. We may write that


zn0 = x0n − x0 = xn + b − x − b = zn .

• Normalization. Here, we have that


zn0 = D0 (x0n − x0 )
where
D0 = diag(1/σ10 , · · · , 1/σd0 ) = diag(1/σ1 , · · · , 1/σd ) = D.
In that case, we have
zn0 = D(x0n − x0 ) = D(xn + b − x − b) = zn .

6
• Whitening. We have

zn0 = Σ0−1/2 (x0n − x0 ) = Σ−1/2 (xn + b − x − b) = zn .

(b) Here, we have x0n = αxn with α > 0. Let us compute x0 , σi02 and Σ0 . We have
N N
1 X 0 1 X
x0 = xn = αxn = αx;
N n=1 N n=1

moreover, we have
N N
1 X 0 1 X
σi02 = (xni − x0 i )2 = (αxni − αx0 i )2 = α2 σi2 ;
N n=1 N n=1
and finally
N N
1 X 0 1 X
Σ0 = (xn − x0 )(x0n − x0 )T = (αxn − αx)(αxn − αx)T = α2 Σ.
N n=1 N n=1
• Centering. We may write that

zn0 = x0n − x0 = αxn − αx = αzn 6= zn .

• Normalization. Here, we have that


zn0 = D0 (x0n − x0 )
where
1
D0 = diag(1/σ10 , · · · , 1/σd0 ) = diag(1/(ασ1 ), · · · , 1/(ασd )) = D.
α
In that case, we have
1 1
zn0 = D(x0n − x0 ) = D(αxn − αx) = D(xn − x) = zn .
α α

• Whitening. We have
1 −1/2
zn0 = Σ0−1/2 (x0n − x0 ) = Σ (αxn − αx) = zn .
α

(c) Here, we have x0n = Axn with A = diag(a1 , · · · , ad ) (ai 6= 0). Let us compute x0 , σi02 and Σ0 . We have
N N
1 X 0 1 X
x0 = xn = Axn = Ax;
N n=1 N n=1

moreover, we have
N N
1 X 0 1 X
σi02 = (xni − x0 i )2 = (ai xni − ai x0 i )2 = a2i σi2 ;
N n=1 N n=1
and finally
N N
1 X 0 1 X
Σ0 = (xn − x0 )(x0n − x0 )T = (Axn − Ax)(Axn − Ax)T = AΣAT .
N n=1 N n=1
• Centering. We may write that

zn0 = x0n − x0 = Axn − Ax = Azn 6= zn .

7
• Normalization. Here, we have that
zn0 = D0 (x0n − x0 )
where
D0 = diag(1/σ10 , · · · , 1/σd0 ) = diag(1/(a1 σ1 ), · · · , 1/(ad σd ));
which means that D0 A = D. In that case, we have

zn0 = D0 (x0n − x0 ) = D0 A(xn − x) = D(xn − x) = zn .

• Whitening. We have

zn0 = Σ0−1/2 (x0n − x0 ) = (AΣAT )−1/2 A(xn − x) 6= zn .

(d) Here, we have x0n = Axn with det(A) 6= 0. Let us compute x0 , σi02 and Σ0 . We have
N N
1 X 0 1 X
x0 = xn = Axn = Ax;
N n=1 N n=1

moreover, we have
N N N
1 X 0 1 X T 1 X T
σi02 = (xni − x0 i )2 = (li xn − liT x)2 = l (xn − x)(xn − x)T li = liT Σli
N n=1 N n=1 N n=1 i

where li is the ith row of A; and finally


N N
1 X 0 1 X
Σ =
0
(x − x )(xn − x ) =
0 0 0 T
(Axn − Ax)(Axn − Ax)T = AΣAT .
N n=1 n N n=1

• Centering. We may write that

zn0 = x0n − x0 = Axn − Ax = Azn 6= zn .

• Normalization. Here, we have that


zn0 = D0 (x0n − x0 )
where
D0 = diag(1/σ10 , · · · , 1/σd0 ) = diag(1/(l1T Σl1 ), · · · , 1/(ldT Σld )).
In that case, we have
zn0 = D0 (x0n − x0 ) = D0 A(xn − x) = D0 Azn 6= zn .

• Whitening. We have

zn0 = Σ0−1/2 (x0n − x0 ) = (AΣAT )−1/2 A(xn − x) 6= zn .

Problem 9.3

We may write
Γ = diag(λ1 , · · · , λd )
where λi > 0. With this in mind, it is easy to describe Γ1/2 and Γ−1/2 ; we have
p p p p
Γ1/2 = diag( λ1 , · · · , λd ) and Γ−1/2 = diag(1/ λ1 , · · · , 1/ λd ).

8
To prove the first fact, we simply need to compute Γ1/2 Γ1/2 , we have
p p p p
Γ1/2 Γ1/2 = diag( λ1 , · · · , λd )diag( λ1 , · · · , λd ) = diag(λ1 , · · · , λd ) = Γ;

to prove the second fact, we need to compute Γ1/2 Γ−1/2 (the same reasoning can be applied to Γ−1/2 Γ1/2 ),
we have p p p p
Γ1/2 Γ−1/2 = diag( λ1 , · · · , λd )diag(1/ λ1 , · · · , 1/ λd ) = I.
Now, to prove that Σ1/2 = U Γ1/2 U T , we write that

Σ1/2 Σ1/2 = U Γ1/2 U


| {zU} Γ U = U ΓU = Σ.
T 1/2 T T

=I

Then, to prove that Σ−1/2 = U Γ−1/2 U T , we write that

Σ−1/2 Σ1/2 = U Γ−1/2 U


| {zU} Γ U = U IU = I.
T 1/2 T T

=I

Problem 9.4

We know that A = V ψ, consequently we get

V T A = |V {z
T
V} ψ = ψ
=I

because V is an orthogonal matrix. Moreover, we also have that

ψ T ψ = AT V V T} A = AT A = I
| {z
=I

since A is an orthogonal matrix as well; this means that ψ is also an orthogonal matrix.

Problem 9.5

(a) if A is a diagonal matrix (N = d), it suffices to define U , Γ and V as follows

U = V = IN and Γ = A.

In this case, we have


A = U ΓV T .

(c) If A is a matrix with pairwise orthogonal columns, we may write that

AT A = diag(a1 , · · · , ad )

where ai > 0. Now if we define D as follows


√ √
D = diag(1/ a1 , · · · , 1/ ad ),

it is easy to see that U = AD is actually a matrix with orthonormal columns since

U T U = DT AT AD = Id .

In this case, we have


A = U ΓV T
√ √
with Γ = diag( a1 , · · · , ad ) and V = Id .

9
(d) If A has SVD U ΓV T and QT Q = I, we may write that

QA = (QU )ΓV T

with QU a matrix with orthonormal columns since

(QU )T (QU ) = U T QT Q U = Id .
| {z }
=I

(e) If A has blocks Ai along the diagonal such that Ai has SVD Ui Γi ViT , we simply have to define U , Γ, and
V as follows
Γ1
     
U1 V1
U = .. , Γ =  ..  and V =  ..
. . . .
     

Um Γm Vm
In that case, we immediately get that
A = U ΓV T
with  T  
U1 U1
U U =
T .. ..  = I,
. .
  

T
Um Um
V1T
  
V1
V TV =  .. ..  = V V = I,
T
. .
  

VmT Vm
and Γ diagonal.

Problem 9.6

Here, we suppose that the digits data were not centered and we perform PCA to obtain a 2-dimensional
feature vector.
digits <- read.delim("zip.train", header = FALSE, sep = " ")
digits$V258 <- NULL
colnames(digits) <- c("digit", 1:256)

X <- as.matrix(digits[, -1])


y <- digits[, 1]
SVD_decomp <- svd(X)
V2 <- SVD_decomp$v[, 1:2]
Z <- X %*% V2
digits_pca <- Z %*% t(V2)
digits_pca_proj <- digits_pca %*% V2
digits_pca_proj <- as.data.frame(digits_pca_proj)

ggplot(digits_pca_proj, aes(x = V1, y = V2, col = as.factor(y))) + geom_point() +


guides(col = guide_legend(title = "Digit"))

10
5

Digit
0
1
0
2
3
V2

4
5
−5 6
7
8
9
−10

−10 −5 0
V1
The transformed data is not whitened since its covariance matrix is not the identity matrix.
cov(digits_pca_proj)

## V1 V2
## V1 6.842064 -8.121955
## V2 -8.121955 17.058353

Problem 9.7

As we have that √
zn = N Γ−1 T
k Vk xn ,

we may let Z be √ √
Z= N X(Γ−1
k Vk ) =
T T
N XVk Γ−1
k .

Moreover, since X is centered, it is easy to see that Z is centered as well because



1 T N −1 T T
z= Z 1= Γ V X 1 = 0.
N N k k | {z }
=0

11
To prove that the transformed data is actually whitened, we have to compute Z T Z, if we use the SVD of
X = U ΓV T , we get
1 T 1 √ √
Z Z = ( N XVk Γ−1 −1
k ) ( N XVk Γk )
T
N N
= Γ−1k Vk X XVk Γk
T T −1

= Γ−1
k Vk V Γ U
T −1
| {zU} ΓV Vk Γk
T T

=Id

= Γ−1 T
k Vk V Γ2
|{z} V T Vk Γ−1
k .
=diag(γ12 ,··· ,γd2 )

We also have that  T


v1
 .. 
Vk V =  .  (v1 , · · · , vk | vk+1 , · · · , vd ) = (Ik | 0)
T

vkT
and  
Ik
V T Vk = .
0
This means that
Γ−1 −1
k (Ik | 0) = (Γk | 0),
and in the same way  −1 
Γk
 
Ik −1
Γk = .
0 0
Consequently, we get that
1 T
 −1 
2 Γk
Z Z = (Γ−1 | 0)Γ
N k 0
Γk 0
  −1 
Γk
 2
= (Γ−1 | 0)
k 0 ∗ 0
= Ik ,
which means that Z is whitened.

Problem 9.8

We begin by constructing a two dimensional feature as described in the book and we apply the algorithm to
the digits data giving us the features z1 and z2 .
digits <- read.delim("zip.train", header = FALSE, sep = " ")
digits$V258 <- NULL
colnames(digits) <- c("digit", 1:256)

X <- as.matrix(digits[, -1])


y <- digits[, 1]
X_centered <- apply(X, 2, function(x) x - mean(x))
X_centered_1 <- X_centered[y == 1, ]
X_centered_not1 <- X_centered[y != 1, ]

SVD_decomp_1 <- svd(X_centered_1)


v1 <- SVD_decomp_1$v[, 1]

SVD_decomp_not1 <- svd(X_centered_not1)


v2 <- SVD_decomp_not1$v[, 1]

12
(a) We give below a scatter plot of our resulting two features.
z1 <- X_centered %*% v1
z2 <- X_centered %*% v2
y_1 <- ifelse(y == 1, +1, -1)
z <- data.frame(z1, z2, y_1)

ggplot(z, aes(x = z1, y = z2, col = as.factor(y_1))) + geom_point() +


guides(col = guide_legend(title = "Digit"))

10

5
Digit
z2

−1
1

−5

−10 −5 0 5 10
z1

(b) No, the directions of v1 and v2 are not necessarily orthogonal.


v1 %*% v2 == 0

## [,1]
## [1,] FALSE
(c) Since V̂ = [v1 , v2 ] has full rank, we know that
V̂ + = (V̂ T V̂ )−1 V̂ T .
In this case, we have
X̂ = X V̂ (V̂ T V̂ )−1 V̂ T .

(d) This method of constructing features is supervised since we need the target values to apply it.

Problem 9.9

(a) Since X̂ is a matrix whose rows are projected onto k basis vectors, it is pretty obvious that rank(X̂) = k.

13
(b) We may write that

||Γ − U T X̂V ||2F = ||U T XV − U T X̂V ||2F = ||U T (X − X̂)V ||2F .

Moreover, we may also write (see exercise 9.9) that

||U T (X − X̂)V ||2F = ||U (U T (X − X̂)V )V T ||2F


= ||U U T (X − X̂) V V T} ||2F
| {z
=I

= trace[(U U T (X − X̂))T (U U T (X − X̂))]


= trace[(X − X̂)T U |U{z
T
U} U T (X − X̂)]
=I

= ||U (X −
T
X̂)||2F
≤ ||U T ||2F ||X − X̂||2F
| {z }
=trace(U T U )=d

≤ d||X − X̂||2F

where we used the fact that ||AB||2F ≤ ||A||2F ||B||2F .


(c) Since the rank of a matrix product is less than each factor, we have that

rank(Γ̂) = rank(U T X̂V ) ≤ rank(X̂) = k.

(d) It is easy to see that


d
X
||Γ − Γ̂||2F = (γij − γ̂ij )2
i,j=1
d
X X
= (γii − γ̂ii )2 + (γij − γ̂ij )2
i=1 i6=j
d
X
≥ (γii − γ̂ii )2 .
i=1

So, the optimal choice for Γ̂ must have all off-diagonal elements equal to zero.
(e) Since we know that rank(Γ̂) ≤ k, we may conclude that the optimal Γ̂ can have at most k non-zero
diagonal elements.
(f ) We may write for our optimal Γ̂ that
d
X
||Γ − Γ̂||2F = (γii − γ̂ii )2
i=1
k
X d
X
= (γii − γ̂ii )2 + (γii − γ̂ii )2
i=1 i=k+1
| {z }
Pd 2
γii
i=k+1

d
X
2
≥ γii .
i=k+1

Pd
Consequently, the minimum reconstruction error is equal to i=k+1
2
γii .

14
(g) If X̂ = XVk VkT , we choose our optimal Γ̂ such that

Γ̂ = U T X̂V
= U T XVk VkT V
| {zU} ΓV Vk Vk V
= U T T T

=I
= Γ(V T Vk )(VkT V )

where  
Ik
VkT V = (Ik | 0) and V T Vk = .
0
This means that
0
 
γ1

0
 .. .. 
Ik . . 

Γ̂ = Γ = .

0 0  γk 0 
0 ··· 0 0
Thus, we finally have
0 0 0
 
···
 0 −γk+1 
Γ − Γ̂ =  .. .. ,
 
 . . 
0 −γd
and consequently
d
X
||Γ − Γ̂||2F = γi2 .
i=k+1

So, for this choice of Γ̂, we actually check eac of the conditions obtained in points (d), (e), and (f ) to
characterize the optimal choice of Γ̂; so X̂ = XVk VkT is the optimal choice for the reconstructed vector. In
this case, we know from point (a) that

||X − X̂||2F ≥ ||Γ − Γ̂||2F ,


Pd
so for our optimal choice of X̂, the minimum reconstruction error is i=k+1 γi2 .

Problem 9.10

(a) The main difference between Algorithm 1 and Algorithm 2 is that Algorithm 1 first performs SVD and
then splits the data, and Algorithm 2 first splits the data and then performs SVD.
(b) Here, we generate N random normally distributed d-dimensional inputs x1 , · · · , xN with their respective
targets yn = wfT xn + n where wf is normally distributed and n is independant Gaussian noise with variance
0.5.
set.seed(10)
d <- 5
N <- 40
k <- 3

target <- function(w, x) {

return(sum(w * as.matrix(x)) + rnorm(1, sd = sqrt(0.5)))


}

15
data_gen <- function(d, N) {
wf <- rnorm(d)
X <- data.frame(x1 = rnorm(N), x2 = rnorm(N), x3 = rnorm(N),
x4 = rnorm(N), x5 = rnorm(N))
y <- apply(X, 1, target, w = wf)

return(list(X = X, y = y, wf = wf))
}

Now, we use Algorithms 1 and 2 to compute estimates of Eout .


E_cross_1 <- function(X, y) {
SVD <- svd(X)
Vk <- SVD$v[, 1:k]
Z <- as.matrix(X) %*% Vk
en <- numeric(N)
for (n in 1:N) {
Zn <- Z[-n, ]
yn <- y[-n]
wn_minus <- solve(t(Zn) %*% Zn) %*% t(Zn) %*% yn
en[n] <- (sum(Z[n, ] * wn_minus) - y[n])^2
}
E1 <- mean(en)

return(E1)
}

E_cross_2 <- function(X, y) {


en <- numeric(N)
for (n in 1:N) {
Xn <- X[-n, ]
yn <- y[-n]
SVD_minus <- svd(Xn)
Vk_minus <- SVD_minus$v[, 1:k]
Zn <- as.matrix(Xn) %*% Vk_minus
wn_minus <- solve(t(Zn) %*% Zn) %*% t(Zn) %*% yn
wn_minus2 <- Vk_minus %*% wn_minus
en[n] <- (sum(X[n, ] * wn_minus2) - y[n] )^2
}
E2 <- mean(en)

return(E2)
}

We compute Eout below.


E_out <- function(X, y, wf) {
SVD <- svd(X)
Vk <- SVD$v[, 1:k]
Z <- as.matrix(X) %*% Vk
w <- solve(t(Z) %*% Z) %*% t(Z) %*% y
w2 <- Vk %*% w

Ntest <- 1000


X_test <- data.frame(x1 = rnorm(Ntest), x2 = rnorm(Ntest), x3 = rnorm(Ntest),

16
x4 = rnorm(Ntest), x5 = rnorm(Ntest))
y_test <- apply(X_test, 1, target, w = wf)

E_out <- mean((as.matrix(X_test) %*% w2 - y_test)^2)

return(E_out)
}

Finally, we repeat this process 105 times and we report the averages E 1 , E 2 , and E out .
iter <- 100000
E1 <- numeric(iter)
E2 <- numeric(iter)
Eout <- numeric(iter)
for (i in 1:iter) {
data <- data_gen(d, N)
X <- data$X
y <- data$y
wf <- data$wf

E1[i] <- E_cross_1(X, y)


E2[i] <- E_cross_2(X, y)
Eout[i] <- E_out(X, y, wf)
}
E1_avg <- mean(E1)
E2_avg <- mean(E2)
Eout_avg <- mean(Eout)

We plot below the histograms of E1 , E2 and Eout .


results <- data.frame(E1 = E1, E2 = E2, Eout = Eout)
plot1 <- ggplot(results, aes(x = E1)) + geom_histogram(bins = 20)
plot2 <- ggplot(results, aes(x = E2)) + geom_histogram(bins = 20)
plot3 <- ggplot(results, aes(x = Eout)) + geom_histogram(bins = 20)
grid.arrange(plot1, plot2, plot3, nrow = 1)

50000
50000

40000

40000 40000

30000

30000 30000
count

count

count

20000
20000 20000

10000
10000 10000

0 0 0

0 5 10 15 20 0 10 20 30 0 5 10 15 20
E1 E2 Eout

We get E 1 = 2.0445914, E 2 = 2.5364857, and E out = 2.5386052.


(c) As we may see, the average E 1 is clearly not as close to E out as E 2 is. This was to be expected since in
cross-validation we have to compute every element of our hypothesis after the data split is done, not before.
(d) As stated in point (c), the correct estimate of Eout is E2 .

17
Problem 9.12

(a) We have
1
π
Eout (h) = Ex,π [(h(x) − fπ (x))2 ]
4
1
= Eπ [Ex|π [(h(x) − fπ (x))2 |π]]
4  
N
1  X
= (h(xn ) − fπ (xn ))2 P[x = xn |π]

Eπ 
4 n=1
| {z }
=1/N
N
1 X
= Eπ [(h(xn ) − fπ (xn ))2 ].
4N n=1

(b) We know from point (a) that


N
1 X
π
Eout (h) = Eπ [(h(xn ) − fπ (xn ))2 ];
4N n=1

so, we may write that


N
1 X
π
Eout (h) = Eπ [(h(xn ) − y + y − fπ (xn ))2 ]
4N n=1
N
1 X
= (Eπ [(h(xn ) − y)2 ] + Eπ [(y − fπ (xn ))2 ] +2 Eπ [(h(xn ) − y)(y − fπ (xn ))]).
4N n=1 | {z } | {z } | {z }
(1) (2) (3)

Let us treat each term separately, we immediately get

(1) = Eπ [(h(xn ) − y)2 ] = (h(xn ) − y)2 .

We also get
N N
X 1 X
(2) = Eπ [(y − fπ (xn ))2 ] = Eπ [(y − yπn )2 ] = (y − yi )2 P[πn = i] = (y − yi )2 .
| {z } N
i=1 i=1
=1/N

And finally, we get

(3) = Eπ [(h(xn ) − y)(y − fπ (xn ))] = (h(xn ) − y)Eπ [(y − yπn )]


N
X
= (h(xn ) − y) (y − yi ) P[πn = i]
| {z }
i=1
=1/N
N
1 X
= (h(xn ) − y) (y − yi ) = 0.
N i=1
| {z }
=0

18
In conclusion, we get that
N N N
1 X 1 XX
π
Eout (h) = (h(xn ) − y)2 + (y − yi )2
4N n=1 4N 2 n=1 i=1
N N
1 X 1 X
= (h(xn ) − y)2 + (y − yi )2
4N n=1 4N i=1
| {z }
= 14 s2y
N
s2y 1 X
= + (h(xn ) − y)2 .
4 4N n=1

(c) Similarly to Eout


π
, we find that
N N
1 X 1 X
π
Ein (h) = (h(xn ) − yπn )2 = (h(xn ) − y + y − yπn )2
4N n=1 4N n=1
N
1 X
= [(h(xn ) − y)2 + (y − yπn )2 + 2(h(xn ) − y)(y − yπn )]
4N n=1
N N N
1 X 1 X 1 X
= (h(xn ) − y)2 + (y − yπn )2 + (h(xn ) − y)(y − yπn )
4N n=1 4N n=1 2N n=1
N N
s2y 1 X 1 X
= + (h(xn ) − y)2 + (h(xn ) − y)(y − yπn ).
4 4N n=1 2N n=1

(d) We are now able to compute the permutation optimism penalty, we have
N
1 X
π
Eout (gπ ) − Ein
π
(gπ ) = − (gπ (xn ) − y)(y − yπn )
2N n=1
N N
1 X 1 X
= − gπ (xn )(y − yπn ) + y (y − yπn )
2N n=1 2N n=1
| {z }
=0
N
1 X
= (yπ − y)gπ (xn ).
2N n=1 n

(e) It is easy to see that the permutation optimism penalty is proportional to the covariance between yπn and
gπ (xn ).

Problem 9.13

For the Bootstrap, we also have that


1
P[x = xn ] =
,
N
consequently we may proceed in the same fashion as in Problem 9.12 to obtain the conclusions.

Problem 9.14

(a) Below, we estimate the Rademacher optimism penalty for N = 1, 2, · · · , 100.

19
set.seed(10)

h <- function(D, w0) {

return(sign(D$x - w0))
}

data_gen <- function(N) {


x <- runif(N, min = -1, max = 1)
r <- sample(c(-1, 1), size = N, replace = TRUE)

D_prime <- data.frame(x = x, r = r)

return(D_prime)
}

penalty <- numeric()


for (N in 1:100) {
iter <- 10000
D_prime <- data_gen(N)
w0 <- 0
w0_prime <- w0
Ein <- mean(D_prime$r != h(D_prime, w0))
Ein_prime <- Ein
for (i in 1:iter) {
D_mis <- subset(D_prime, r != h(D_prime, w0))
if (nrow(D_mis) == 0)
break
xt <- D_mis[sample(nrow(D_mis), 1), ]
w0 <- w0 + xt$x * xt$r
Ein <- mean(D_prime$r != h(D_prime, w0))
if (Ein < Ein_prime) {
w0_prime <- w0
}
Ein_prime <- mean(D_prime$r != h(D_prime, w0_prime))
}
penalty <- c(penalty, 0.5 - Ein_prime)
}

Below, we plot the penalty versus N .


ggplot(data.frame(N = 1:100, Penalty = penalty), aes(x = N, y = Penalty)) +
geom_line(col = "red") +
coord_cartesian(xlim = c(1, 100))

20
0.50

0.25
Penalty

0.00

−0.25

−0.50

0 25 50 75 100
N

(b) Now, we repeat 1000 times point (a) to compute the average Rademacher penalty.
penalty_matrix <- foreach(row = 1:1000, .combine = "rbind") %dopar% {
penalty <- numeric()
for (N in 1:100) {
iter <- 1000
D_prime <- data_gen(N)
w0 <- 0
w0_prime <- w0
Ein <- mean(D_prime$r != h(D_prime, w0))
Ein_prime <- Ein
for (i in 1:iter) {
D_mis <- subset(D_prime, r != h(D_prime, w0))
if (nrow(D_mis) == 0)
break
xt <- D_mis[sample(nrow(D_mis), 1), ]
w0 <- w0 + xt$x * xt$r
Ein <- mean(D_prime$r != h(D_prime, w0))
if (Ein < Ein_prime) {
w0_prime <- w0
}
Ein_prime <- mean(D_prime$r != h(D_prime, w0_prime))
}
penalty <- c(penalty, 0.5 - Ein_prime)
}
penalty
}

21
Below, we give a plot of the average penalty versus N .
penalty_avg <- apply(penalty_matrix, 2, FUN = mean)

ggplot(data.frame(N = 1:100, Penalty = penalty_avg), aes(x = N, y = Penalty)) +


geom_line(col = "red") +
coord_cartesian(xlim = c(1, 100))

0.2
Penalty

0.1

0 25 50 75 100
N

(c) We may see that the Rademacher penalty is O(1/ N ), which is similar to the order of the VC penalty in
this case.
one_square <- function(N) {

return(1 / sqrt(N))
}

ggplot(data.frame(N = 1:100, Penalty = penalty_avg), aes(x = N, y = Penalty)) +


geom_line(aes(colour = "Average Rademacher penalty")) +
stat_function(fun = one_square, geom = "line", aes(colour = "1 / sqrt(N)")) +
guides(colour = guide_legend(title = "Function")) +
coord_cartesian(xlim = c(1, 100))

22
1.00

0.75

Function
Penalty

1 / sqrt(N)
0.50
Average Rademacher penalty

0.25

0.00
0 25 50 75 100
N

Problem 9.15

We know from Chapter 2 that


N 2 N 2
0
P[|Eout (gr ) − Ein
0
(gr )| > ] ≤ 4mH (2N )e− 8 ≤ 4mH (N )2 e− 8 .

If we let δ be
N 2
δ = 4mH (N )2 e− 8 ,
we get s
16 2mH (N )
 
= ln .
N δ
In this case, we may conclude that, with probability at least 1 − δ, we have
s
1 1 2mH (N )
 
− Ein (gr ) = Eout (gr ) − Ein (gr ) ≤ 4
0 0 0
ln .
2 N δ

23

Potrebbero piacerti anche