Page 84 - Understanding Machine Learning
P. 84

Nonuniform Learnability
           66



                 prefer h over h . But these are the same h and h for which we argued two sentences

                 ago that h should be preferable. Where is the catch here?
                    Indeed, there is no inherent generalizability difference between hypotheses.
                 The crucial aspect here is the dependency order between the initial choice of lan-
                 guage (or, preference over hypotheses) and the training set. As we know from
                 the basic Hoeffding’s bound (Equation (4.2)), if we commit to any hypothesis
                 before seeing the data, then we are guaranteed a rather small estimation error term

                                   ln(2/δ)
                  L D (h) ≤ L S (h) +   . Choosing a description language (or, equivalently, some
                                    2m
                 weighting of hypotheses) is a weak form of committing to a hypothesis. Rather than
                 committing to a single hypothesis, we spread out our commitment among many. As
                 long as it is done independently of the training sample, our generalization bound
                 holds. Just as the choice of a single hypothesis to be evaluated by a sample can be
                 arbitrary, so is the choice of description language.


                 7.4 OTHER NOTIONS OF LEARNABILITY – CONSISTENCY
                 The notion of learnability can be further relaxed by allowing the needed sample
                 sizes to depend not only on 
, δ,and h but also on the underlying data-generating
                 probability distribution D (that is used to generate the training sample and to deter-
                 mine the risk). This type of performance guarantee is captured by the notion of
                           1
                 consistency of a learning rule.
                 Definition 7.8 (Consistency). Let Z be a domain set, let P be a set of probability
                 distributions over Z,and let H be a hypothesis class. A learning rule A is consistent
                                                                       2
                 with respect to H and P if there exists a function m CON  :(0,1) × H × P → N such
                                                              H
                 that, for every 
,δ ∈ (0,1), every h ∈ H, and every D ∈ P,if m ≥ m NUL (
,δ,h,D)then
                                                                         H
                                                                  m
                 with probability of at least 1 − δ over the choice of S ∼ D it holds that
                                            L D (A(S)) ≤ L D (h) + 
.
                                             2
                 If P is the set of all distributions, we say that A is universally consistent with respect
                 to H.
                    The notion of consistency is, of course, a relaxation of our previous notion of
                 nonuniform learnability. Clearly if an algorithm nonuniformly learns a class H it is
                 also universally consistent for that class. The relaxation is strict in the sense that
                 there are consistent learning rules that are not successful nonuniform learners. For
                 example, the algorithm Memorize defined in Example 7.4 later is universally consis-
                 tent for the class of all binary classifiers over N. However, as we have argued before,
                 this class is not nonuniformly learnable.
                 Example 7.4. Consider the classification prediction algorithm Memorize defined as
                 follows. The algorithm memorizes the training examples, and, given a test point x,it

                 1  In the literature, consistency is often defined using the notion of either convergence in proba-
                   bility (corresponding to weak consistency) or almost sure convergence (corresponding to strong
                   consistency).
                 2  Formally, we assume that Z is endowed with some sigma algebra of subsets  , and by “all distributions”
                   we mean all probability distributions that have   contained in their associated family of measurable
                   subsets.
   79   80   81   82   83   84   85   86   87   88   89