Page 165 - Understanding Machine Learning
P. 165

13.7 Exercises  147


              to introduce stability into learning algorithms, in particular the Bagging technique
              introduced by (Breiman 1996).
                 Over the last decade, stability was studied as a generic condition for learnability.
              See (Kearns & Ron 1999, Bousquet & Elisseeff 2002, Kutin & Niyogi 2002, Rakhlin,
              Mukherjee & Poggio 2005, Mukherjee, Niyogi, Poggio & Rifkin 2006). Our presen-
              tation follows the work of Shalev-Shwartz, Shamir,  Srebo,  and Sridharan (2010),
              who showed that stability is sufficient and necessary for learning. They have also
              shown that all convex-Lipschitz-bounded learning problems are learnable using
              RLM, even though for some convex-Lipschitz-bounded learning problems uniform
              convergence does not hold in a strong sense.


              13.7 EXERCISES
              13.1 From Bounded Expected Risk to Agnostic PAC Learning: Let A be an algorithm
                  that guarantees the following: If m ≥ m H (
) then for every distribution D it holds
                  that
                                       E [L D (A(S))] ≤ min L D (h) + 
.
                                      S∼D m           h∈H

                    Show that for every δ ∈ (0,1), if m ≥ m H (
δ) then with probability of at least
                     1− δ it holds that L D (A(S)) ≤ min h∈H L D (h) + 
.
                     Hint: Observe that the random variable L D (A(S))− min h∈H L D (h) is nonneg-
                     ative and rely on Markov’s inequality.
                    For every δ ∈ (0,1) let

                                                        log(4/δ) + log( log (1/δ) )
                                                                        2
                          m H (
, δ) = m H (
/2) log (1/δ) +                    .
                                              2                    2

                     Suggest a procedure that agnostic PAC learns the problem with sample com-
                     plexity of m H (
,δ), assuming that the loss function is bounded by 1.
                     Hint: Let k = log (1/δ) . Divide the data into k + 1 chunks, where each of the
                                    2
                     first k chunks is of size m H (
/2) examples. Train the first k chunks using A.On
                     the basis of the previous question argue that the probability that for all of these
                     chunks we have L D (A(S))> min h∈H L D (h)+
 is at most 2 −k  ≤ δ/2. Finally, use
                     the last chunk as a validation set.
                                                                            d
              13.2 Learnability without Uniform Convergence: Let B be the unit ball of R ,let H = B,
                                d
                  let Z = B ×{0,1} ,and let   : Z × H → R be defined as follows:
                                                    d
                                                               2
                                        (w,(x, α)) =  α i (x i − w i ) .
                                                   i=1
                  This problem corresponds to an unsupervised learning task, meaning that we do
                  not try to predict the label of x. Instead, what we try to do is to find the “center of
                  mass” of the distribution over B. However, there is a twist, modeled by the vectors
                  α. Each example is a pair (x,α), where x is the instance x and α indicates which
                  features of x are “active” and which are “turned off.” A hypothesis is a vector
                  w representing the center of mass of the distribution, and the loss function is the
                  squared Euclidean distance between x and w, but only with respect to the “active”
                  elements of x.
                    Show that this problem is learnable using the RLM rule with a sample
                     complexity that does not depend on d.
   160   161   162   163   164   165   166   167   168   169   170