Page 340 - Understanding Machine Learning
P. 340

Feature Selection and Generation
           322

                 on understanding the relationship between the   1 -norm and sparsity. It is also
                 closely related to compressed sensing (see Chapter 23). The ability to sparsify low
                   1  norm predictors dates back to Maurey (Pisier 1980–1981). In Section 26.4 we also
                 show that low   1 norm can be used to bound the estimation error of our predictor.
                    Feature learning and dictionary learning have been extensively studied recently
                 in the context of deep neural networks. See, for example, (LeCun & Bengio 1995,
                 Hinton et al. 2006, Ranzato et al. 2007, Collobert & Weston 2008, Lee et al. 2009, Le
                 et al. 2012, Bengio 2009) and the references therein.


                 25.6 EXERCISES
                 25.1 Prove the equality given in Equation (25.1). Hint:Let a ,b be minimizers of the
                                                                       ∗
                                                                    ∗
                      left-hand side. Find a,b such that the objective value of the right-hand side is
                      smaller than that of the left-hand side. Do the same for the other direction.
                 25.2 Show that Equation (25.7) is the solution of Equation (25.6).
                 25.3 AdaBoost as a Forward Greedy Selection Algorithm: Recall the AdaBoost algo-
                      rithm from Chapter 10. In this section we give another interpretation of AdaBoost
                      as a forward greedy selection algorithm.
                        Given a set of m instances x 1 ,...,x m , and a hypothesis class H of finite VC
                         dimension, show that there exist d and h 1 ,...,h d such that for every h ∈ H
                         there exists i ∈ [d]with h i (x j ) = h(x j ) for every j ∈ [m].
                        Let R(w) be as defined in Equation (25.3). Given some w, define f w to be the
                         function
                                                         d

                                                 f w ( · ) =  w i h i ( · ).
                                                        i=1
                         Let D be the distribution over [m] defined by
                                                    exp( − y i f w (x i ))
                                                D i =             ,
                                                          Z
                         where Z is a normalization factor that ensures that D is a probability vector.
                         Show that                       m
                                              ∂ R(w)
                                                    =−     D i y i h j (x i ).
                                                w j
                                                        i=1
                                                  m

                         Furthermore, denoting 
 j =  D i 1 [h j (x i ) =y i ] , show that
                                                  i=1
                                                  ∂ R(w)
                                                        = 2
 j − 1.
                                                    w j

                                                          
 ≥ γ/2.
                                                     
 ∂ R(w)
                         Conclude that if 
 j ≤ 1/2− γ then
                                                       w j
                                                                                    (t)
                        Show that the update of AdaBoost guarantees R(w  (t+1) ) − R(w ) ≤

                                   2
                         log(  1− 4γ ). Hint: Use the proof of Theorem 10.2.
   335   336   337   338   339   340   341   342   343   344   345