Page 252 - Understanding Machine Learning
P. 252
Neural Networks
234
20.4 THE SAMPLE COMPLEXITY OF NEURAL NETWORKS
Next we discuss the sample complexity of learning the class H V ,E,σ . Recall that the
fundamental theorem of learning tells us that the sample complexity of learning a
hypothesis class of binary classifiers depends on its VC dimension. Therefore, we
focus on calculating the VC dimension of hypothesis classes of the form H V,E,σ ,
where the output layer of the graph contains a single neuron.
We start with the sign activation function, namely, with H V ,E,sign . What is the VC
dimension of this class? Intuitively, since we learn |E| parameters, the VC dimen-
sion should be order of |E|. This is indeed the case, as formalized by the following
theorem.
Theorem 20.6. The VC dimension of H V ,E,sign is O(|E|log(|E|)).
Proof. To simplify the notation throughout the proof, let us denote the hypothesis
class by H. Recall the definition of the growth function, τ H (m), from Section 6.5.1.
This function measures max C⊂ X :|C|=m |H C |, where H C is the restriction of H to func-
tions from C to {0,1}. We can naturally extend the definition for a set of functions
from X to some finite set Y, by letting H C be the restriction of H to functions from
C to Y, and keeping the definition of τ H (m) intact.
Our neural network is defined by a layered graph. Let V 0 ,..., V T be the layers
of the graph. Fix some t ∈ [ T]. By assigning different weights on the edges between
V t−1 and V t , we obtain different functions from R |V t−1 | → {±1} |V t | . Let H ( t) be the
class of all possible such mappings from R |V t−1 | → {±1} |V t | . Then, H can be written
( T ) (1)
as a composition, H = H ◦ ... ◦ H . In Exercise 20.4 we show that the growth
function of a composition of hypothesis classes is bounded by the products of the
growth functions of the individual classes. Therefore,
T
τ H (m) ≤ τ (t) (m).
H
t=1
( t) ( t) ( t,1)
In addition, each H can be written as a product of function classes, H = H ×
( t,|V t |) ( t, j)
··· × H , where each H is all functions from layer t − 1 to {±1} that the jth
neuron of layer t can implement. In Exercise 20.3 we bound product classes, and
this yields
|V t |
τ (t)(m) ≤ τ (t,i) (m).
H H
i=1
Let d t,i be the number of edges that are headed to the ith neuron of layer t.
Since the neuron is a homogenous halfspace hypothesis and the VC dimension of
homogenous halfspaces is the dimension of their input, we have by Sauer’s lemma
that
d t,i
τ (t,i) (m) ≤ em ≤ (em) d t,i .
H d t,i
Overall, we obtained that
d
τ H (m) ≤ (em) t,i t,i = (em) |E| .