Page 84 - Understanding Machine Learning
P. 84
Nonuniform Learnability
66
prefer h over h . But these are the same h and h for which we argued two sentences
ago that h should be preferable. Where is the catch here?
Indeed, there is no inherent generalizability difference between hypotheses.
The crucial aspect here is the dependency order between the initial choice of lan-
guage (or, preference over hypotheses) and the training set. As we know from
the basic Hoeffding’s bound (Equation (4.2)), if we commit to any hypothesis
before seeing the data, then we are guaranteed a rather small estimation error term
ln(2/δ)
L D (h) ≤ L S (h) + . Choosing a description language (or, equivalently, some
2m
weighting of hypotheses) is a weak form of committing to a hypothesis. Rather than
committing to a single hypothesis, we spread out our commitment among many. As
long as it is done independently of the training sample, our generalization bound
holds. Just as the choice of a single hypothesis to be evaluated by a sample can be
arbitrary, so is the choice of description language.
7.4 OTHER NOTIONS OF LEARNABILITY – CONSISTENCY
The notion of learnability can be further relaxed by allowing the needed sample
sizes to depend not only on
, δ,and h but also on the underlying data-generating
probability distribution D (that is used to generate the training sample and to deter-
mine the risk). This type of performance guarantee is captured by the notion of
1
consistency of a learning rule.
Definition 7.8 (Consistency). Let Z be a domain set, let P be a set of probability
distributions over Z,and let H be a hypothesis class. A learning rule A is consistent
2
with respect to H and P if there exists a function m CON :(0,1) × H × P → N such
H
that, for every
,δ ∈ (0,1), every h ∈ H, and every D ∈ P,if m ≥ m NUL (
,δ,h,D)then
H
m
with probability of at least 1 − δ over the choice of S ∼ D it holds that
L D (A(S)) ≤ L D (h) +
.
2
If P is the set of all distributions, we say that A is universally consistent with respect
to H.
The notion of consistency is, of course, a relaxation of our previous notion of
nonuniform learnability. Clearly if an algorithm nonuniformly learns a class H it is
also universally consistent for that class. The relaxation is strict in the sense that
there are consistent learning rules that are not successful nonuniform learners. For
example, the algorithm Memorize defined in Example 7.4 later is universally consis-
tent for the class of all binary classifiers over N. However, as we have argued before,
this class is not nonuniformly learnable.
Example 7.4. Consider the classification prediction algorithm Memorize defined as
follows. The algorithm memorizes the training examples, and, given a test point x,it
1 In the literature, consistency is often defined using the notion of either convergence in proba-
bility (corresponding to weak consistency) or almost sure convergence (corresponding to strong
consistency).
2 Formally, we assume that Z is endowed with some sigma algebra of subsets , and by “all distributions”
we mean all probability distributions that have contained in their associated family of measurable
subsets.