Page 66 - Understanding Machine Learning
P. 66

The  VC-Dimension
           48

                 6.4  THE  FUNDAMENTAL  THEOREM  OF  P AC  LEARNING
                 We have already shown that a class of infinite VC-dimension is not learnable. The
                 converse statement is also true, leading to the fundamental theorem of statistical
                 learning theory:





                 Theorem 6.7 (The  Fundamental  Theorem  of  Statistical  Learning).  Let  H be  a



                 hypothesis class of functions from a domain X to {0,1} and let the loss function be the

                 0−1 loss. Then, the following are equivalent:
                    1.  H has the uniform convergence property.

                    2.  Any ERM rule is a successful agnostic PAC learner for H.
                    3.  H is agnostic PAC learnable.

                    4.  H is PAC learnable.

                    5.  Any ERM rule is a successful PAC learner for H.
                    6.  H has a finite VC-dimension.

                    The proof of the theorem is given in the next section.
                    Not only does the VC-dimension characterize PAC learnability; it even deter-
                 mines the sample complexity.





                 Theorem 6.8 (The Fundamental Theorem of Statistical Learning – Quantitative

                 Version). Let H be a hypothesis class of functions from a domain X to {0,1} and let


                 the loss function be the 0−1 loss. Assume that VCdim(H) = d < ∞. Then, there are



                 absolute constants C 1 ,C 2  such that
                    1.  H has the uniform convergence property with sample complexity

                                       d + log(1/δ)   UC         d + log(1/δ)


                                    C 1     2     ≤ m   (
,δ) ≤ C 2   2
                                           
          H
                    2.  H is agnostic PAC learnable with sample complexity

                                       d + log(1/δ)              d + log(1/δ)



                                    C 1     2     ≤ m H (
,δ) ≤ C 2   2

                    3.  H is PAC learnable with sample complexity


                                   d + log(1/δ)              d log(1/
) + log(1/δ)


                                 C 1           ≤ m H (
,δ) ≤ C 2

                    The proof of this theorem is given in Chapter 28.
                 Remark 6.3.  We stated the fundamental theorem for binary classification tasks. A
                 similar result holds for some other learning problems such as regression with the
                 absolute loss or the squared loss. However, the theorem does not hold for all learn-
                 ing tasks. In particular, learnability is sometimes possible even though the uniform
                 convergence property does not hold (we will see an example in Chapter 13, Exercise
                 6.2). Furthermore, in some situations, the ERM rule fails but learnability is possible
                 with other learning rules.
   61   62   63   64   65   66   67   68   69   70   71