Page 298 - Understanding Machine Learning

P. 298

Dimensionality Reduction
280

On the basis of the preceding lemma, we can rewrite the optimization problem
given in Equation (23.1)asfollows:

argmin x i − UU x i . (23.2)
2
U∈R d,n :U U=I i=1

We further simplify the optimization problem by using the following elementary
d
algebraic manipulations. For every x ∈ R and a matrix U ∈ R d,n such that U U = I

we have
2
2

x − UU x = x − 2x UU x + x UU UU x
2

= x − x UU x
2

= x − trace(U xx U), (23.3)
where the trace of a matrix is the sum of its diagonal entries. Since the trace is a
linear operator, this allows us to rewrite Equation (23.2) as follows:

argmax trace U
x i x U . (23.4)

i
U∈R d,n :U U=I i=1

x
Let A = i=1 i x . The matrix A is symmetric and therefore it can be written
i

using its spectral decomposition as A = VDV ,where D is diagonal and V V =
VV = I. Here, the elements on the diagonal of D are the eigenvalues of A and

the columns of V are the corresponding eigenvectors. We assume without loss of
generality that D 1,1 ≥ D 2,2 ≥ ··· ≥ D d,d .Since A is positive semideﬁnite it also holds
that D d,d ≥ 0. We claim that the solution to Equation (23.4)isthe matrix U whose
columns are the n eigenvectors of A corresponding to the largest n eigenvalues.
d m
x
Theorem 23.2. Let x 1 ,...,x m be arbitrary vectors in R ,let A = i=1 i x ,and let
i
u 1 ,...,u n be n eigenvectors of the matrix A corresponding to the largest n eigenvalues
of A. Then, the solution to the PCA optimization problem given in Equation (23.1)is
to set U to be the matrix whose columns are u 1 ,...,u n andtoset W = U .

Proof. Let VDV be the spectral decomposition of A. Fix some matrix U ∈ R d,n

with orthonormal columns and let B = V U. Then, VB = VV U = U. It follows
that

U AU = B V VDV VB = B DB,
and therefore
d n

2
trace(U AU) = D j, j B .
j,i
j=1 i=1

Note that B B = U VV U = U U = I. Therefore, the columns of B are also

d n 2 d,d
˜
orthonormal, which implies that B = n. In addition, let B ∈ R be
j=1 i=1 j,i
a matrix such that its ﬁrst n columns are the columns of B and in addition B B = I.
˜
˜
d n 2
Then, for every j we have B ˜ 2 = 1, which implies that B ≤ 1. It
i=1 j,i i=1 j,i

293 294 295 296 297 298 299 300 301 302 303