Page 248 - Understanding Machine Learning
P. 248

Neural Networks
           230

                 That is, the input to v t+1, j is a weighted sum of the outputs of the neurons in V t that
                 are connected to v t+1, j , where weighting is according to w, and the output of v t+1, j
                 is simply the application of the activation function σ on its input.
                    Layers V 1 ,...,V T −1 are often called hidden layers. The top layer, V T , is called
                 the output layer. In simple prediction problems the output layer contains a single
                 neuron whose output is the output of the network.
                    We refer to T as the number of layers in the network (excluding V 0 ), or the
                 “depth” of the network. The size of the network is |V|. The “width” of the network
                 is max t |V t |. An illustration of a layered feedforward neural network of depth 2, size
                 10, and width 5, is given in the following. Note that there is a neuron in the hidden
                 layer that has no incoming edges. This neuron will output the constant σ(0).

                                         Input     Hidden      Output
                                        Layer       Layer      Layer

                                         (V )       (V )        (V )
                                           0
                                                      1
                                                                  2
                                                     v 1,1
                                     x 1  v 0,1
                                                     v 1,2
                                     x 2  v 0,2
                                                     v 1,3      v 2,1  Output
                                     x 3  v 0,3
                                                     v 1,4
                                Constant  v 0,4
                                                     v 1,5


                 20.2 LEARNING NEURAL NETWORKS

                 Once we have specified a neural network by (V , E,σ,w), we obtain a function
                 h V ,E,σ,w : R |V 0 |−1  → R |V T | . Any set of such functions can serve as a hypothesis class
                 for learning. Usually, we define a hypothesis class of neural network predictors by
                 fixing the graph (V , E) as well as the activation function σ and letting the hypothesis
                 class be all functions of the form h V ,E,σ,w for some w : E → R. The triplet (V , E,σ)
                 is often called the architecture of the network. We denote the hypothesis class by

                               H V,E,σ ={h V,E,σ,w : w is a mapping from E to R}.   (20.1)

                 That is, the parameters specifying a hypothesis in the hypothesis class are the
                 weights over the edges of the network.
                    We can now study the approximation error, estimation error, and optimization
                 error of such hypothesis classes. In Section 20.3 we study the approximation error
                 of H V ,E,σ by studying what type of functions hypotheses in H V ,E,σ can implement,
                 in terms of the size of the underlying graph. In Section 20.4 we study the estimation
                 error of H V,E,σ , for the case of binary classification (i.e., V T = 1and σ is the sign
                 function), by analyzing its VC dimension. Finally, in Section 20.5 we show that it
                 is computationally hard to learn the class H V,E,σ , even if the underlying graph is
                 small, and in Section 20.6 we present the most commonly used heuristic for training
                 H V ,E,σ .
   243   244   245   246   247   248   249   250   251   252   253