Page 258 - Understanding Machine Learning
P. 258
Neural Networks
240
20.7 SUMMARY
Neural networks over graphs of size s(n) can be used to describe hypothesis classes
of all predictors that can be implemented in runtime of O( s(n)). We have also
shown that their sample complexity depends polynomially on s(n) (specifically,
it depends on the number of edges in the network). Therefore, classes of neu-
ral network hypotheses seem to be an excellent choice. Regrettably, the problem
of training the network on the basis of training data is computationally hard. We
have presented the SGD framework as a heuristic approach for training neural net-
works and described the backpropagation algorithm which efficiently calculates the
gradient of the loss function with respect to the weights over the edges.
20.8 BIBLIOGRAPHIC REMARKS
Neural networks were extensively studied in the 1980s and early 1990s, but with
mixed empirical success. In recent years, a combination of algorithmic advance-
ments, as well as increasing computational power and data size, has led to a
breakthrough in the effectiveness of neural networks. In particular, “deep net-
works” (i.e., networks of more than 2 layers) have shown very impressive practical
performance on a variety of domains. A few examples include convolutional net-
works (LeCun & Bengio 1995), restricted Boltzmann machines (Hinton, Osindero
& Teh 2006), auto-encoders (Ranzato et al. 2007, Bengio & LeCun 2007, Collobert
& Weston 2008, Lee et al. 2009, Le et al. 2012), and sum-product networks (Livni,
Shalev-Shwartz & Shamir 2013, Poon & Domingos 2011). See also (Bengio 2009)
and the references therein.
The expressive power of neural networks and the relation to circuit complexity
have been extensively studied in (Parberry 1994). For the analysis of the sample
complexity of neural networks we refer the reader to (Anthony & Bartlet 1999).
Our proof technique of Theorem 20.6 is due to Kakade and Tewari lecture notes.
Klivans and Sherstov (2006) have shown that for any c > 0, intersections of
n
c
n halfspaces over {±1} are not efficiently PAC learnable, even if we allow rep-
resentation independent learning. This hardness result relies on the cryptographic
assumption that there is no polynomial time solution to the unique-shortest-vector
problem. As we have argued, this implies that there cannot be an efficient algorithm
for training neural networks, even if we allow larger networks or other activation
functions that can be implemented efficiently.
The backpropagation algorithm has been introduced in Rumelhart, Hinton, and
Williams (1986).
20.9 EXERCISES
n
20.1 Neural Networks are universal approximators: Let f :[ − 1,1] → [ − 1,1] be a
n
ρ-Lipschitz function. Fix some
> 0. Construct a neural network N :[ − 1,1] →
n
[ − 1,1], with the sigmoid activation function, such that for every x ∈ [ − 1,1] it
holds that | f (x) − N(x)|≤
.
n
Hint: Similarly to the proof of Theorem 19.3, partition [ − 1,1] into small boxes.
Use the Lipschitzness of f to show that it is approximately constant at each box.