Page 165 - Understanding Machine Learning
P. 165
13.7 Exercises 147
to introduce stability into learning algorithms, in particular the Bagging technique
introduced by (Breiman 1996).
Over the last decade, stability was studied as a generic condition for learnability.
See (Kearns & Ron 1999, Bousquet & Elisseeff 2002, Kutin & Niyogi 2002, Rakhlin,
Mukherjee & Poggio 2005, Mukherjee, Niyogi, Poggio & Rifkin 2006). Our presen-
tation follows the work of Shalev-Shwartz, Shamir, Srebo, and Sridharan (2010),
who showed that stability is sufficient and necessary for learning. They have also
shown that all convex-Lipschitz-bounded learning problems are learnable using
RLM, even though for some convex-Lipschitz-bounded learning problems uniform
convergence does not hold in a strong sense.
13.7 EXERCISES
13.1 From Bounded Expected Risk to Agnostic PAC Learning: Let A be an algorithm
that guarantees the following: If m ≥ m H (
) then for every distribution D it holds
that
E [L D (A(S))] ≤ min L D (h) +
.
S∼D m h∈H
Show that for every δ ∈ (0,1), if m ≥ m H (
δ) then with probability of at least
1− δ it holds that L D (A(S)) ≤ min h∈H L D (h) +
.
Hint: Observe that the random variable L D (A(S))− min h∈H L D (h) is nonneg-
ative and rely on Markov’s inequality.
For every δ ∈ (0,1) let
log(4/δ) + log( log (1/δ) )
2
m H (
, δ) = m H (
/2) log (1/δ) + .
2 2
Suggest a procedure that agnostic PAC learns the problem with sample com-
plexity of m H (
,δ), assuming that the loss function is bounded by 1.
Hint: Let k = log (1/δ) . Divide the data into k + 1 chunks, where each of the
2
first k chunks is of size m H (
/2) examples. Train the first k chunks using A.On
the basis of the previous question argue that the probability that for all of these
chunks we have L D (A(S))> min h∈H L D (h)+
is at most 2 −k ≤ δ/2. Finally, use
the last chunk as a validation set.
d
13.2 Learnability without Uniform Convergence: Let B be the unit ball of R ,let H = B,
d
let Z = B ×{0,1} ,and let : Z × H → R be defined as follows:
d
2
(w,(x, α)) = α i (x i − w i ) .
i=1
This problem corresponds to an unsupervised learning task, meaning that we do
not try to predict the label of x. Instead, what we try to do is to find the “center of
mass” of the distribution over B. However, there is a twist, modeled by the vectors
α. Each example is a pair (x,α), where x is the instance x and α indicates which
features of x are “active” and which are “turned off.” A hypothesis is a vector
w representing the center of mass of the distribution, and the loss function is the
squared Euclidean distance between x and w, but only with respect to the “active”
elements of x.
Show that this problem is learnable using the RLM rule with a sample
complexity that does not depend on d.