Page 256 - Understanding Machine Learning
P. 256

Neural Networks
           238

                 a few definitions from vector calculus. Each element of the gradient is the partial
                 derivative with respect to the variable in w corresponding to one of the edges of the
                                                                                   n
                 network. Recall the definition of a partial derivative. Given a function f : R → R,
                 the partial derivative with respect to the ith variable at w is obtained by fixing the
                 values of w 1 ,...,w i−1 ,w i+1 ,w n , which yields the scalar function g : R → R defined
                 by g(a) = f ((w 1 ,...,w i−1 ,w i + a,w i+1 ,...,w n )), and then taking the derivative of g
                                                               m
                                                                                        n
                                                          n
                 at 0. For a function with multiple outputs, f : R → R ,the Jacobian of f at w ∈ R ,
                 denoted J w (f), is the m × n matrix whose i, j element is the partial derivative of f i :
                   n
                 R → R w.r.t. its jth variable at w. Note that if m = 1 then the Jacobian matrix is the
                 gradient of the function (represented as a row vector). Two examples of Jacobian
                 calculations, which we will later use, are as follows.
                    Let f(w) = Aw for A ∈ R m,n .Then J w (f) = A.
                                                                                   n
                                                                             n
                    For every n, we use the notation σ to denote the function from R to R which
                     applies the sigmoid function element-wise. That is, α = σ(θ) means that for
                                                  1
                     every i we have α i = σ(θ i ) =  . It is easy to verify that J θ (σ) is a diago-
                                              1+exp(−θ i )
                     nal matrix whose (i,i)entry is σ (θ i ), where σ is the derivative function of the


                                                                  1
                     (scalar) sigmoid function, namely, σ (θ i ) =         .We also use the
                                                          (1+exp(θ i ))(1+exp(−θ i ))
                     notation diag(σ (θ)) to denote this matrix.

                    The chain rule for taking the derivative of a composition of functions can be
                                                                              n
                                                                                    m
                 written in terms of the Jacobian as follows. Given two functions f : R → R and
                                                                                       m
                                                                                  k
                          n
                     k
                 g : R → R , we have that the Jacobian of the composition function, (f◦g): R → R ,
                 at w,is
                                           J w (f ◦ g) = J g(w) (f)J w (g).
                 For example, for g(w) = Aw,where A ∈ R n,k , we have that

                                          J w (σ ◦ g) = diag(σ (Aw)) A.
                    To describe the backpropagation algorithm, let us first decompose V into the
                                        T
                 layers of the graph, V = · ∪  V                                  },where
                                        t=0 t . For every t,let us write V t ={v t,1 ,...,v t,k t
                 k t =|V t |. In addition, for every t denote W t ∈ R k t+1 ,k t  a matrix which gives a weight to
                 every potential edge between V t and V t+1 . If the edge exists in E then we set W t,i, j to
                 be the weight, according to w, of the edge (v t, j ,v t+1,i ). Otherwise, we add a “phan-
                 tom” edge and set its weight to be zero, W t,i, j = 0. Since when calculating the partial
                 derivative with respect to the weight of some edge we fix all other weights, these
                 additional “phantom” edges have no effect on the partial derivative with respect
                 to existing edges. It follows that we can assume, without loss of generality, that all
                 edges exist, that is, E =∪ t (V t × V t+1 ).
                    Next, we discuss how to calculate the partial derivatives with respect to the edges
                 from V t−1 to V t , namely, with respect to the elements in W t−1 . Since we fix all other
                 weights of the network, it follows that the outputs of all the neurons in V t−1 are fixed
                 numbers which do not depend on the weights in W t−1 . Denote the corresponding
                 vector by o t−1 . In addition, let us denote by   t : R → R the loss function of the
                                                              k t
                 subnetwork defined by layers V t ,...,V T as a function of the outputs of the neurons
                 in V t . The input to the neurons of V t can be written as a t = W t−1 o t−1 and the output
                 of the neurons of V t is o t = σ(a t ). That is, for every j we have o t, j = σ(a t, j ). We
   251   252   253   254   255   256   257   258   259   260   261