Page 297 - Understanding Machine Learning

P. 297

23.1 Principal Component Analysis (PCA) 279

We conclude by underscoring the underlying “prior assumptions” behind PCA
and compressed sensing, which can help us understand the merits and pitfalls of the
two methods.

23.1 PRINCIPAL COMPONENT ANALYSIS (PCA)

d
Let x 1 ,...,x m be m vectors in R . We would like to reduce the dimensionality of
these vectors using a linear transformation. A matrix W ∈ R n,d ,where n < d, induces
n
a mapping x → Wx,where Wx ∈ R is the lower dimensionality representation of x.
Then, a second matrix U ∈ R d,n can be used to (approximately) recover each original
vector x from its compressed version. That is, for a compressed vector y = Wx,
n
where y is in the low dimensional space R , we can construct ˜ x = Uy,so that ˜ x is the
d
recovered version of x and resides in the original high dimensional space R .
In PCA, we ﬁnd the compression matrix W and the recovering matrix U so that
the total squared distance between the original and recovered vectors is minimal;
namely, we aim at solving the problem
m

2
argmin x i −UWx i . (23.1)
2
W∈R n,d ,U∈R d,n i=1
To solve this problem we ﬁrst show that the optimal solution takes a speciﬁc
form.

Lemma 23.1. Let (U,W) be a solution to Equation (23.1). Then the columns of U
n
are orthonormal (namely, U U is the identity matrix of R )and W = U .

Proof. Fix any U,W and consider the mapping x → UWx. The range of this map-
d
d
ping, R ={UWx : x ∈ R },is an n dimensional linear subspace of R .Let V ∈ R d,n
be a matrix whose columns form an orthonormal basis of this subspace, namely, the

range of V is R and V V = I. Therefore, each vector in R canbe writtenas Vy
n
n
d
where y ∈ R . For every x ∈ R and y ∈ R we have
2
2
2
2
x − Vy = x + y V V y − 2y V x = x + y − 2y (V x),

2
n

where we used the fact that V V is theidentitymatrixof R . Minimizing the pre-
ceding expression with respect to y by comparing the gradient with respect to y to

zero gives that y = V x. Therefore, for each x we have that
2
VV x = argmin x − ˜ x .

2
˜ x∈R
In particular this holds for x 1 ,...,x m and therefore we can replace U,W by V,V
and by that do not increase the objective
m m

2
2
x i −UWx i ≥ x i − VV x i .
2 2
i=1 i=1
Since this holds for every U,W the proof of the lemma follows.

292 293 294 295 296 297 298 299 300 301 302