Page 258 - Understanding Machine Learning
P. 258

Neural Networks
           240

                 20.7 SUMMARY
                 Neural networks over graphs of size s(n) can be used to describe hypothesis classes

                 of all predictors that can be implemented in runtime of O(  s(n)). We have also
                 shown that their sample complexity depends polynomially on s(n) (specifically,
                 it depends on the number of edges in the network). Therefore, classes of neu-
                 ral network hypotheses seem to be an excellent choice. Regrettably, the problem
                 of training the network on the basis of training data is computationally hard. We
                 have presented the SGD framework as a heuristic approach for training neural net-
                 works and described the backpropagation algorithm which efficiently calculates the
                 gradient of the loss function with respect to the weights over the edges.



                 20.8 BIBLIOGRAPHIC REMARKS
                 Neural networks were extensively studied in the 1980s and early 1990s, but with
                 mixed empirical success. In recent years, a combination of algorithmic advance-
                 ments, as well as increasing computational power and data size, has led to a
                 breakthrough in the effectiveness of neural networks. In particular, “deep net-
                 works” (i.e., networks of more than 2 layers) have shown very impressive practical
                 performance on a variety of domains. A few examples include convolutional net-
                 works (LeCun & Bengio 1995), restricted Boltzmann machines (Hinton, Osindero
                 & Teh 2006), auto-encoders (Ranzato et al. 2007, Bengio & LeCun 2007, Collobert
                 & Weston 2008, Lee et al. 2009, Le et al. 2012), and sum-product networks (Livni,
                 Shalev-Shwartz & Shamir 2013, Poon & Domingos 2011). See also (Bengio 2009)
                 and the references therein.
                    The expressive power of neural networks and the relation to circuit complexity
                 have been extensively studied in (Parberry 1994).  For the analysis of the sample
                 complexity of neural networks we refer the reader to (Anthony & Bartlet 1999).
                 Our proof technique of Theorem 20.6 is due to Kakade and Tewari lecture notes.
                    Klivans and  Sherstov  (2006) have shown  that  for  any c > 0,  intersections  of

                                      n
                   c
                 n halfspaces over {±1} are not efficiently PAC learnable, even if we allow rep-
                 resentation independent learning. This hardness result relies on the cryptographic
                 assumption that there is no polynomial time solution to the unique-shortest-vector
                 problem. As we have argued, this implies that there cannot be an efficient algorithm
                 for training neural networks, even if we allow larger networks or other activation
                 functions that can be implemented efficiently.
                    The backpropagation algorithm has been introduced in Rumelhart, Hinton, and
                 Williams (1986).



                 20.9 EXERCISES
                                                                          n
                 20.1 Neural Networks are universal approximators: Let f :[ − 1,1] → [ − 1,1] be a
                                                                                      n
                      ρ-Lipschitz function. Fix some 
> 0. Construct a neural network N :[ − 1,1] →
                                                                                       n
                      [ − 1,1], with the sigmoid activation function, such that for every x ∈ [ − 1,1] it
                      holds that | f (x) − N(x)|≤ 
.
                                                                          n
                      Hint: Similarly to the proof of Theorem 19.3, partition [ − 1,1] into small boxes.
                      Use the Lipschitzness of f to show that it is approximately constant at each box.
   253   254   255   256   257   258   259   260   261   262   263