Page 298 - Understanding Machine Learning
P. 298

Dimensionality Reduction
           280

                    On the basis of the preceding lemma, we can rewrite the optimization problem
                 given in Equation (23.1)asfollows:

                                                    m

                                                                  2

                                          argmin       x i − UU x i   .             (23.2)
                                                                  2
                                        U∈R d,n :U U=I  i=1

                 We further simplify the optimization problem by using the following elementary
                                                    d
                 algebraic manipulations. For every x ∈ R and a matrix U ∈ R d,n  such that U U = I

                 we have
                                           2
                                                 2






                                 x − UU x  = x  − 2x UU x + x UU UU x
                                                 2


                                            = x  − x UU x
                                                 2


                                            = x  − trace(U xx U),                   (23.3)
                 where the trace of a matrix is the sum of its diagonal entries. Since the trace is a
                 linear operator, this allows us to rewrite Equation (23.2) as follows:

                                                           m

                                        argmax   trace U  
  x i x U  .             (23.4)

                                                                i
                                      U∈R d,n :U U=I      i=1

                              m

                                 x
                    Let A =   i=1 i x . The matrix A is symmetric and therefore it can be written
                                    i

                 using its spectral decomposition as A = VDV ,where D is diagonal and V V =
                 VV = I. Here, the elements on the diagonal of D are the eigenvalues of A and

                 the columns of V are the corresponding eigenvectors. We assume without loss of
                 generality that D 1,1 ≥ D 2,2 ≥ ··· ≥ D d,d .Since A is positive semidefinite it also holds
                 that D d,d ≥ 0. We claim that the solution to Equation (23.4)isthe matrix U whose
                 columns are the n eigenvectors of A corresponding to the largest n eigenvalues.
                                                                 d          m
                                                                              x
                 Theorem 23.2. Let x 1 ,...,x m be arbitrary vectors in R ,let A =  i=1 i x ,and let
                                                                                 i
                 u 1 ,...,u n be n eigenvectors of the matrix A corresponding to the largest n eigenvalues
                 of A. Then, the solution to the PCA optimization problem given in Equation (23.1)is
                 to set U to be the matrix whose columns are u 1 ,...,u n andtoset W = U .


                 Proof. Let VDV be the spectral decomposition of A. Fix some matrix U ∈ R d,n


                 with orthonormal columns and let B = V U. Then, VB = VV U = U. It follows
                 that





                                      U AU = B V VDV VB = B DB,
                 and therefore
                                                      d       n

                                              
                  2
                                       trace(U AU) =     D j, j  B .
                                                                 j,i
                                                      j=1    i=1

                 Note that B B = U VV U = U U = I. Therefore, the columns of B are also



                                                 d    n   2                         d,d
                                                                               ˜
                 orthonormal, which implies that         B  = n. In addition, let B ∈ R  be
                                                 j=1  i=1  j,i
                 a matrix such that its first n columns are the columns of B and in addition B B = I.
                                                                                  ˜ 
 ˜
                                             d                              n   2
                 Then, for every j we have      B ˜ 2  = 1, which implies that  B  ≤ 1. It
                                             i=1  j,i                       i=1  j,i
   293   294   295   296   297   298   299   300   301   302   303