Page 347 - Understanding Machine Learning
P. 347

26.1 The Rademacher Complexity  329


                3. For any h , with probability of at least 1 − δ,

                                                                      2ln(8/δ)

                          L D (ERM H (S)) − L D (h ) ≤ 2 R(  ◦ H ◦ S) + 5c   .
                                                                         m

              Proof. First note that the random variable Rep (F, S) = sup  L D (h) − L S (h)
                                                       D           h∈H
              satisfies the bounded differences condition of Lemma 26.4 with a constant 2c/m.
              Combining the bounds in Lemma 26.4 with Lemma 26.2 we obtain that with
              probability of at least 1 − δ,

                                              2ln(2/δ)                     2ln(2/δ)

                Rep (F, S) ≤ ERep (F, S) + c          ≤ 2E R(  ◦ H ◦ S ) + c       .
                    D             D
                                                 m       S                    m
              The first inequality of the theorem follows from the definition of Rep (F, S). For
                                                                          D
              the second inequality we note that the random variable R(  ◦ H ◦ S) also satisfies
              the bounded differences condition of Lemma 26.4 with a constant 2c/m. Therefore,
              the second inequality follows from the first inequality, Lemma 26.4, and the union
              bound. Finally, for the last inequality, denote h S = ERM H (S) and note that

                        L D (h S ) − L D (h )



                          = L D (h S ) − L S (h S ) + L S (h S ) − L S (h ) + L S (h ) − L D (h )

                          ≤ L D (h S ) − L S (h S ) + L S (h ) − L D (h ) .    (26.10)
              The first summand on the right-hand side is bounded by the second inequality of

              the theorem. For the second summand, we use the fact that h does not depend on
              S; hence by using Hoeffding’s inequality we obtain that with probaility of at least
              1 − δ/2,

                                                        ln(4/δ)


                                    L S (h ) − L D (h ) ≤ c    .               (26.11)
                                                          2m
              Combining this with the union bound we conclude our proof.
                 The preceding theorem tells us that if the quantity R(  ◦ H ◦ S) is small then it
              is possible to learn the class H using the ERM rule. It is important to emphasize
              that the last two bounds given in the theorem depend on the specific training set S.
              That is, we use S both for learning a hypothesis from H as well as for estimating the
              quality of it. This type of bound is called a data-dependent bound.


              26.1.1 Rademacher Calculus

              Let us now discuss some properties of the Rademacher complexity measure. These
              properties will help us in deriving some simple bounds on R(  ◦ H ◦ S)for specific
              cases of interest.
                 The following lemma is immediate from the definition.

                                       m
                                                                   m
              Lemma 26.6. For any A ⊂ R ,scalar c ∈ R, and vector a 0 ∈ R , we have
                                     R({ca + a 0 : a ∈ A}) ≤|c| R(A).
                 The following lemma tells us that the convex hull of A has the same complexity
              as A.
   342   343   344   345   346   347   348   349   350   351   352