Page 255 - Understanding Machine Learning

P. 255

20.6 SGD and Backpropagation 237

SGD for Neural Networks
parameters:
number of iterations τ
step size sequence η 1 ,η 2 ,...,η τ
regularization parameter λ> 0
input:
layered graph (V , E)
differentiable activation function σ : R → R
initialize:
choose w (1) ∈ R |E| at random
(from a distribution s.t. w (1) is close enough to 0)
for i = 1,2,...,τ
sample (x,y) ∼ D
calculate gradient v i = backpropagation(x,y,w,(V , E),σ)
(i)
update w (i+1) = w (i) − η i (v i + λw )
output:
¯ w is the best performing w (i) on a validation set

Backpropagation
input:
example (x,y), weight vector w, layered graph (V , E),
activation function σ : R → R
initialize:
denote layers of the graph V 0 ,...,V T where V t ={v t,1 ,...,v t,k t }
deﬁne W t,i, j as the weight of (v t, j ,v t+1,i )
(where we set W t,i, j = 0if (v t, j ,v t+1,i ) /∈ E)
forward:
set o 0 = x
for t = 1,...,T
for i = 1,...,k t
k t−1
set a t,i = W t−1,i, j o t−1, j
j=1
set o t,i = σ(a t,i )
backward:
set δ T = o T − y
for t = T − 1,T − 2,...,1
for i = 1,...,k t
k t+1

δ t,i = W t, j,i δ t+1, j σ (a t+1, j )
j=1
output:
foreach edge (v t−1, j ,v t,i ) ∈ E

set the partial derivative to δ t,i σ (a t,i )o t−1, j
Explaining How Backpropagation Calculates the Gradient:
We next explain how the backpropagation algorithm calculates the gradient of the
loss function on an example (x,y) with respect to the vector w. Let us ﬁrst recall

250 251 252 253 254 255 256 257 258 259 260