Page 96 - Understanding Machine Learning
P. 96
The Runtime of Learning
78
Efficiently Learnable in the Realizable Case
Consider implementing the ERM rule in the realizable case. That is, we are given
a training set S = (x 1 , y 1 ),...,(x m , y m ) of examples, such that there exists an axis
aligned rectangle, h ∈ H n ,for which h(x i ) = y i for all i. Our goal is to find such an axis
aligned rectangle with a zero training error, namely, a rectangle that is consistent
with all the labels in S.
We show later that this can be done in time O(nm). Indeed, for each i ∈ [n],
set a i = min{x i :(x,1) ∈ S} and b i = max{x i :(x,1) ∈ S}. In words, we take a i to be
the minimal value of the i’th coordinate of a positive example in S and b i to be the
maximal value of the i’th coordinate of a positive example in S. It is easy to verify
that the resulting rectangle has zero training error and that the runtime of finding
each a i and b i is O(m). Hence, the total runtime of this procedure is O(nm).
Not Efficiently Learnable in the Agnostic Case
In the agnostic case, we do not assume that some hypothesis h perfectly predicts
the labels of all the examples in the training set. Our goal is therefore to find h
that minimizes the number of examples for which y i = h(x i ). It turns out that for
many common hypothesis classes, including the classes of axis aligned rectangles we
consider here, solving the ERM problem in the agnostic setting is NP-hard (and,
in most cases, it is even NP-hard to find some h ∈ H whose error is no more than
some constant c > 1 times that of the empirical risk minimizer in H). That is, unless
P = NP, there is no algorithm whose running time is polynomial in m and n that
is guaranteed to find an ERM hypothesis for these problems (Ben-David, Eiron &
Long 2003).
On the other hand, it is worthwhile noticing that, if we fix one specific hypothesis
class, say, axis aligned rectangles in some fixed dimension, n, then there exist effi-
cient learning algorithms for this class. In other words, there are successful agnostic
PAC learners that run in time polynomial in 1/
and 1/δ (but their dependence on
the dimension n is not polynomial).
To see this, recall the implementation of the ERM rule we presented for the
realizable case, from which it follows that an axis aligned rectangle is determined by
at most 2n examples. Therefore, given a training set of size m, we can perform an
exhaustive search over all subsets of the training set of size at most 2n examples and
construct a rectangle from each such subset. Then, we can pick the rectangle with
the minimal training error. This procedure is guaranteed to find an ERM hypothe-
sis, and the runtime of the procedure is m O(n) . It follows that if n is fixed, the runtime
is polynomial in the sample size. This does not contradict the aforementioned hard-
ness result, since there we argued that unless P=NP one cannot have an algorithm
whose dependence on the dimension n is polynomial as well.
8.2.3 Boolean Conjunctions
n
A Boolean conjunction is a mapping from X ={0,1} to Y ={0,1} that can be
,for
expressed as a proposition formula of the form x i 1 ∧ ... ∧ x i k ∧¬x j 1 ∧ ... ∧¬x j r
some indices i 1 ,...,i k , j 1 ,..., j r ∈ [n]. The function that such a proposition formula