Page 298 - Understanding Machine Learning
P. 298
Dimensionality Reduction
280
On the basis of the preceding lemma, we can rewrite the optimization problem
given in Equation (23.1)asfollows:
m
2
argmin x i − UU x i . (23.2)
2
U∈R d,n :U U=I i=1
We further simplify the optimization problem by using the following elementary
d
algebraic manipulations. For every x ∈ R and a matrix U ∈ R d,n such that U U = I
we have
2
2
x − UU x = x − 2x UU x + x UU UU x
2
= x − x UU x
2
= x − trace(U xx U), (23.3)
where the trace of a matrix is the sum of its diagonal entries. Since the trace is a
linear operator, this allows us to rewrite Equation (23.2) as follows:
m
argmax trace U
x i x U . (23.4)
i
U∈R d,n :U U=I i=1
m
x
Let A = i=1 i x . The matrix A is symmetric and therefore it can be written
i
using its spectral decomposition as A = VDV ,where D is diagonal and V V =
VV = I. Here, the elements on the diagonal of D are the eigenvalues of A and
the columns of V are the corresponding eigenvectors. We assume without loss of
generality that D 1,1 ≥ D 2,2 ≥ ··· ≥ D d,d .Since A is positive semidefinite it also holds
that D d,d ≥ 0. We claim that the solution to Equation (23.4)isthe matrix U whose
columns are the n eigenvectors of A corresponding to the largest n eigenvalues.
d m
x
Theorem 23.2. Let x 1 ,...,x m be arbitrary vectors in R ,let A = i=1 i x ,and let
i
u 1 ,...,u n be n eigenvectors of the matrix A corresponding to the largest n eigenvalues
of A. Then, the solution to the PCA optimization problem given in Equation (23.1)is
to set U to be the matrix whose columns are u 1 ,...,u n andtoset W = U .
Proof. Let VDV be the spectral decomposition of A. Fix some matrix U ∈ R d,n
with orthonormal columns and let B = V U. Then, VB = VV U = U. It follows
that
U AU = B V VDV VB = B DB,
and therefore
d n
2
trace(U AU) = D j, j B .
j,i
j=1 i=1
Note that B B = U VV U = U U = I. Therefore, the columns of B are also
d n 2 d,d
˜
orthonormal, which implies that B = n. In addition, let B ∈ R be
j=1 i=1 j,i
a matrix such that its first n columns are the columns of B and in addition B B = I.
˜
˜
d n 2
Then, for every j we have B ˜ 2 = 1, which implies that B ≤ 1. It
i=1 j,i i=1 j,i