Page 255 - Understanding Machine Learning
P. 255

20.6 SGD and Backpropagation  237


                                      SGD for Neural Networks
                       parameters:
                         number of iterations τ
                         step size sequence η 1 ,η 2 ,...,η τ
                         regularization parameter λ> 0
                       input:
                         layered graph (V , E)
                         differentiable activation function σ : R → R
                       initialize:
                         choose w (1)  ∈ R |E|  at random
                          (from a distribution s.t. w (1)  is close enough to 0)
                       for i = 1,2,...,τ
                         sample (x,y) ∼ D
                         calculate gradient v i = backpropagation(x,y,w,(V , E),σ)
                                                     (i)
                         update w (i+1)  = w (i)  − η i (v i + λw )
                       output:
                         ¯ w is the best performing w (i)  on a validation set


                                         Backpropagation
                       input:
                        example (x,y), weight vector w, layered graph (V , E),
                        activation function σ : R → R
                       initialize:
                        denote layers of the graph V 0 ,...,V T where V t ={v t,1 ,...,v t,k t  }
                        define W t,i, j as the weight of (v t, j ,v t+1,i )
                          (where we set W t,i, j = 0if (v t, j ,v t+1,i ) /∈ E)
                       forward:
                        set o 0 = x
                        for t = 1,...,T
                          for i = 1,...,k t
                                     k t−1
                           set a t,i =  W t−1,i, j o t−1, j
                                      j=1
                           set o t,i = σ(a t,i )
                       backward:
                        set δ T = o T − y
                        for t = T − 1,T − 2,...,1
                          for i = 1,...,k t
                                  k t+1

                           δ t,i =   W t, j,i δ t+1, j σ (a t+1, j )
                                   j=1
                       output:
                        foreach edge (v t−1, j ,v t,i ) ∈ E

                          set the partial derivative to δ t,i σ (a t,i )o t−1, j
              Explaining How Backpropagation Calculates the Gradient:
              We next explain how the backpropagation algorithm calculates the gradient of the
              loss function on an example (x,y) with respect to the vector w. Let us first recall
   250   251   252   253   254   255   256   257   258   259   260