Page 311 - Understanding Machine Learning

P. 311

23.7 Exercises 293

Eugenio Beltrami (1873) and Camille Jordan (1874). It has been rediscovered many
times. In the statistical literature, it was introduced by Pearson (1901). Besides PCA
and SVD, there are additional names that refer to the same idea and are being
used in different scientiﬁc communities. A few examples are the Eckart-Young the-
orem (after Carl Eckart and Gale Young who analyzed the method in 1936), the
Schmidt-Mirsky theorem, factor analysis, and the Hotelling transform.
Compressed sensing was introduced in Donoho (2006) and in (Candes & Tao
2005). See also Candes (2006).

23.7 EXERCISES
23.1 In this exercise we show that in the general case, exact recovery of a linear
compression scheme is impossible.
1. let A ∈ R n,d be an arbitrary compression matrix where n ≤ d −1. Show that there
n
exists u,v ∈ R , u = v such that Au = Av.
2. Conclude that exact recovery of a linear compression scheme is impossible.
d
23.2 Let α ∈ R such that α 1 ≥ α 2 ≥ ··· ≥ α d ≥ 0. Show that
d n

max α j β j = α j .
d
β∈[0,1] : β 1 ≤n
j=1 j=1
d
Hint: Take every vector β ∈ [0,1] such that β 1 ≤ n.Let i be the minimal index
for which β i < 1. If i = n + 1 we are done. Otherwise, show that we can increase β i ,
while possibly decreasing β j for some j > i, and obtain a better solution. This will
imply that the optimal solution is to set β i = 1for i ≤ n and β i = 0for i > n.
23.3 Kernel PCA: In this exercise we show how PCA can be used for construct-
ing nonlinear dimensionality reduction on the basis of the kernel trick (see
Chapter 16).
Let X be some instance space and let S ={x 1 ,...,x m } be a set of points in X.
Consider a feature mapping ψ : X → V,where V is some Hilbert space (possi-

bly of inﬁnite dimension). Let K : X × X be a kernel function, that is, k(x,x ) =

ψ(x),ψ(x ) . Kernel PCA is the process of mapping the elements in S into V
n
using ψ, and then applying PCA over {ψ(x 1 ),...,ψ(x m )} into R . The output of
this process is the set of reduced elements.
Show how this process can be done in polynomial time in terms of m and n,
assuming that each evaluation of K(·,·) can be calculated in a constant time. In
particular, if your implementation requires multiplication of two matrices A and
B, verify that their product can be computed. Similarly, if an eigenvalue decom-
position of some matrix C is required, verify that this decomposition can be
computed.
23.4 An Interpretation of PCA as Variance Maximization:
d
Let x 1 ,...,x m be m vectors in R ,and let x be a random vector distributed according
to the uniform distribution over x 1 ,...,x m . Assume that E[x] = 0.
d
1. Consider the problem of ﬁnding a unit vector, w ∈ R , such that the ran-
dom variable w,x has maximal variance. That is, we would like to solve the
problem
m
1 2
argmaxVar[ w,x ] = argmax ( w,x i ) .
w: w =1 w: w =1 m
i=1

306 307 308 309 310 311 312 313 314 315 316