Page 256 - Understanding Machine Learning
P. 256
Neural Networks
238
a few definitions from vector calculus. Each element of the gradient is the partial
derivative with respect to the variable in w corresponding to one of the edges of the
n
network. Recall the definition of a partial derivative. Given a function f : R → R,
the partial derivative with respect to the ith variable at w is obtained by fixing the
values of w 1 ,...,w i−1 ,w i+1 ,w n , which yields the scalar function g : R → R defined
by g(a) = f ((w 1 ,...,w i−1 ,w i + a,w i+1 ,...,w n )), and then taking the derivative of g
m
n
n
at 0. For a function with multiple outputs, f : R → R ,the Jacobian of f at w ∈ R ,
denoted J w (f), is the m × n matrix whose i, j element is the partial derivative of f i :
n
R → R w.r.t. its jth variable at w. Note that if m = 1 then the Jacobian matrix is the
gradient of the function (represented as a row vector). Two examples of Jacobian
calculations, which we will later use, are as follows.
Let f(w) = Aw for A ∈ R m,n .Then J w (f) = A.
n
n
For every n, we use the notation σ to denote the function from R to R which
applies the sigmoid function element-wise. That is, α = σ(θ) means that for
1
every i we have α i = σ(θ i ) = . It is easy to verify that J θ (σ) is a diago-
1+exp(−θ i )
nal matrix whose (i,i)entry is σ (θ i ), where σ is the derivative function of the
1
(scalar) sigmoid function, namely, σ (θ i ) = .We also use the
(1+exp(θ i ))(1+exp(−θ i ))
notation diag(σ (θ)) to denote this matrix.
The chain rule for taking the derivative of a composition of functions can be
n
m
written in terms of the Jacobian as follows. Given two functions f : R → R and
m
k
n
k
g : R → R , we have that the Jacobian of the composition function, (f◦g): R → R ,
at w,is
J w (f ◦ g) = J g(w) (f)J w (g).
For example, for g(w) = Aw,where A ∈ R n,k , we have that
J w (σ ◦ g) = diag(σ (Aw)) A.
To describe the backpropagation algorithm, let us first decompose V into the
T
layers of the graph, V = · ∪ V },where
t=0 t . For every t,let us write V t ={v t,1 ,...,v t,k t
k t =|V t |. In addition, for every t denote W t ∈ R k t+1 ,k t a matrix which gives a weight to
every potential edge between V t and V t+1 . If the edge exists in E then we set W t,i, j to
be the weight, according to w, of the edge (v t, j ,v t+1,i ). Otherwise, we add a “phan-
tom” edge and set its weight to be zero, W t,i, j = 0. Since when calculating the partial
derivative with respect to the weight of some edge we fix all other weights, these
additional “phantom” edges have no effect on the partial derivative with respect
to existing edges. It follows that we can assume, without loss of generality, that all
edges exist, that is, E =∪ t (V t × V t+1 ).
Next, we discuss how to calculate the partial derivatives with respect to the edges
from V t−1 to V t , namely, with respect to the elements in W t−1 . Since we fix all other
weights of the network, it follows that the outputs of all the neurons in V t−1 are fixed
numbers which do not depend on the weights in W t−1 . Denote the corresponding
vector by o t−1 . In addition, let us denote by t : R → R the loss function of the
k t
subnetwork defined by layers V t ,...,V T as a function of the outputs of the neurons
in V t . The input to the neurons of V t can be written as a t = W t−1 o t−1 and the output
of the neurons of V t is o t = σ(a t ). That is, for every j we have o t, j = σ(a t, j ). We