Page 255 - Understanding Machine Learning
P. 255
20.6 SGD and Backpropagation 237
SGD for Neural Networks
parameters:
number of iterations τ
step size sequence η 1 ,η 2 ,...,η τ
regularization parameter λ> 0
input:
layered graph (V , E)
differentiable activation function σ : R → R
initialize:
choose w (1) ∈ R |E| at random
(from a distribution s.t. w (1) is close enough to 0)
for i = 1,2,...,τ
sample (x,y) ∼ D
calculate gradient v i = backpropagation(x,y,w,(V , E),σ)
(i)
update w (i+1) = w (i) − η i (v i + λw )
output:
¯ w is the best performing w (i) on a validation set
Backpropagation
input:
example (x,y), weight vector w, layered graph (V , E),
activation function σ : R → R
initialize:
denote layers of the graph V 0 ,...,V T where V t ={v t,1 ,...,v t,k t }
define W t,i, j as the weight of (v t, j ,v t+1,i )
(where we set W t,i, j = 0if (v t, j ,v t+1,i ) /∈ E)
forward:
set o 0 = x
for t = 1,...,T
for i = 1,...,k t
k t−1
set a t,i = W t−1,i, j o t−1, j
j=1
set o t,i = σ(a t,i )
backward:
set δ T = o T − y
for t = T − 1,T − 2,...,1
for i = 1,...,k t
k t+1
δ t,i = W t, j,i δ t+1, j σ (a t+1, j )
j=1
output:
foreach edge (v t−1, j ,v t,i ) ∈ E
set the partial derivative to δ t,i σ (a t,i )o t−1, j
Explaining How Backpropagation Calculates the Gradient:
We next explain how the backpropagation algorithm calculates the gradient of the
loss function on an example (x,y) with respect to the vector w. Let us first recall