Page 252 - Understanding Machine Learning

P. 252

Neural Networks
234

20.4 THE SAMPLE COMPLEXITY OF NEURAL NETWORKS
Next we discuss the sample complexity of learning the class H V ,E,σ . Recall that the

fundamental theorem of learning tells us that the sample complexity of learning a
hypothesis class of binary classiﬁers depends on its VC dimension. Therefore, we
focus on calculating the VC dimension of hypothesis classes of the form H V,E,σ ,

where the output layer of the graph contains a single neuron.
We start with the sign activation function, namely, with H V ,E,sign . What is the VC

dimension of this class? Intuitively, since we learn |E| parameters, the VC dimen-

sion should be order of |E|. This is indeed the case, as formalized by the following

theorem.
Theorem 20.6. The VC dimension of H V ,E,sign is O(|E|log(|E|)).

Proof. To simplify the notation throughout the proof, let us denote the hypothesis
class by H. Recall the deﬁnition of the growth function, τ H (m), from Section 6.5.1.
This function measures max C⊂ X :|C|=m |H C |, where H C is the restriction of H to func-

tions from C to {0,1}. We can naturally extend the deﬁnition for a set of functions

from X to some ﬁnite set Y, by letting H C be the restriction of H to functions from

C to Y, and keeping the deﬁnition of τ H (m) intact.
Our neural network is deﬁned by a layered graph. Let V 0 ,..., V T be the layers

of the graph. Fix some t ∈ [ T]. By assigning different weights on the edges between

V t−1 and V t , we obtain different functions from R |V t−1 | → {±1} |V t | . Let H ( t) be the

class of all possible such mappings from R |V t−1 | → {±1} |V t | . Then, H can be written

( T ) (1)

as a composition, H = H ◦ ... ◦ H . In Exercise 20.4 we show that the growth
function of a composition of hypothesis classes is bounded by the products of the
growth functions of the individual classes. Therefore,
T

τ H (m) ≤ τ (t) (m).
H
t=1
( t) ( t) ( t,1)
In addition, each H can be written as a product of function classes, H = H ×

( t,|V t |) ( t, j)
··· × H , where each H is all functions from layer t − 1 to {±1} that the jth

neuron of layer t can implement. In Exercise 20.3 we bound product classes, and
this yields
|V t |

τ (t)(m) ≤ τ (t,i) (m).
H H
i=1
Let d t,i be the number of edges that are headed to the ith neuron of layer t.
Since the neuron is a homogenous halfspace hypothesis and the VC dimension of
homogenous halfspaces is the dimension of their input, we have by Sauer’s lemma
that
d t,i
τ (t,i) (m) ≤ em ≤ (em) d t,i .
H d t,i
Overall, we obtained that
d
τ H (m) ≤ (em) t,i t,i = (em) |E| .

247 248 249 250 251 252 253 254 255 256 257