Page 57 - Understanding Machine Learning
P. 57

5.1 The No-Free-Lunch Theorem  39


              {0,1} and every i we have
                                              1
                                        (h) =
                                     L D i         1 [h(x) = f i (x)]
                                             2m
                                                x∈C
                                                 p
                                              1
                                           ≥       1 [h(v r ) = f i (v r )]
                                             2m
                                                r=1
                                                 p
                                              1
                                           ≥       1 [h(v r ) = f i (v r )] .    (5.5)
                                             2p
                                                r=1
              Hence,
                             T                  T      p
                           1            i     1     1
                                   (A(S )) ≥             1
                                                             i
                                L D i   j                 [A(S )(v r ) = f i (v r )]
                           T                  T    2p        j
                             i=1                i=1   r=1
                                                 p    T
                                              1     1
                                           =             1 [A(S )(v r ) = f i (v r )]
                                                             i
                                             2p    T         j
                                                r=1  i=1
                                                        T
                                             1       1
                                           ≥   · min       1 [A(S )(v r ) = f i (v r )] .  (5.6)
                                                               i
                                             2   r∈[p] T       j
                                                       i=1
              Next, fix some r ∈ [p]. We can partition all the functions in f 1 ,..., f T into T/2dis-
              joint pairs, where for a pair ( f i , f i ) we have that for every c ∈ C, f i (c)  = f i (c)if and



                                                         i
                                                             i
              only if c = v r . Since for such a pair we must have S = S , it follows that
                                                         j   j
                                1             + 1              = 1,
                                    i
                                                    i
                                 [A(S )(v r ) = f i (v r )]  [A(S )(v r ) = f   (v r )]
                                     j              j     i
              which yields
                                         T
                                      1                     1
                                            1   i        = .
                                      T      [A(S )(v r ) = f i (v r )]  2
                                                j
                                        i=1
              Combining this with Equation (5.6), Equation (5.4), and Equation (5.3), we obtain
              that Equation (5.1) holds, which concludes our proof.
              5.1.1 No-Free-Lunch and Prior Knowledge
              How does the No-Free-Lunch result relate to the need for prior knowledge? Let us
              consider an ERM predictor over the hypothesis class H of all the functions f from
              X to {0,1}. This class represents lack of prior knowledge: Every possible function
              from the domain to the label set is considered a good candidate. According to the
              No-Free-Lunch theorem, any algorithm that chooses its output from hypotheses in
              H, and in particular the ERM predictor, will fail on some learning task. Therefore,
              this class is not PAC learnable, as formalized in the following corollary:
              Corollary 5.2. Let X be an infinite domain set and let H be the set of all functions
              from X to {0,1}. Then, H is not PAC learnable.
              Proof. Assume, by way of contradiction, that the class is learnable. Choose some
              
< 1/8and δ< 1/7. By the definition of PAC learnability, there must be some
   52   53   54   55   56   57   58   59   60   61   62