Page 340 - Understanding Machine Learning
P. 340
Feature Selection and Generation
322
on understanding the relationship between the 1 -norm and sparsity. It is also
closely related to compressed sensing (see Chapter 23). The ability to sparsify low
1 norm predictors dates back to Maurey (Pisier 1980–1981). In Section 26.4 we also
show that low 1 norm can be used to bound the estimation error of our predictor.
Feature learning and dictionary learning have been extensively studied recently
in the context of deep neural networks. See, for example, (LeCun & Bengio 1995,
Hinton et al. 2006, Ranzato et al. 2007, Collobert & Weston 2008, Lee et al. 2009, Le
et al. 2012, Bengio 2009) and the references therein.
25.6 EXERCISES
25.1 Prove the equality given in Equation (25.1). Hint:Let a ,b be minimizers of the
∗
∗
left-hand side. Find a,b such that the objective value of the right-hand side is
smaller than that of the left-hand side. Do the same for the other direction.
25.2 Show that Equation (25.7) is the solution of Equation (25.6).
25.3 AdaBoost as a Forward Greedy Selection Algorithm: Recall the AdaBoost algo-
rithm from Chapter 10. In this section we give another interpretation of AdaBoost
as a forward greedy selection algorithm.
Given a set of m instances x 1 ,...,x m , and a hypothesis class H of finite VC
dimension, show that there exist d and h 1 ,...,h d such that for every h ∈ H
there exists i ∈ [d]with h i (x j ) = h(x j ) for every j ∈ [m].
Let R(w) be as defined in Equation (25.3). Given some w, define f w to be the
function
d
f w ( · ) = w i h i ( · ).
i=1
Let D be the distribution over [m] defined by
exp( − y i f w (x i ))
D i = ,
Z
where Z is a normalization factor that ensures that D is a probability vector.
Show that m
∂ R(w)
=− D i y i h j (x i ).
w j
i=1
m
Furthermore, denoting
j = D i 1 [h j (x i ) =y i ] , show that
i=1
∂ R(w)
= 2
j − 1.
w j
≥ γ/2.
∂ R(w)
Conclude that if
j ≤ 1/2− γ then
w j
(t)
Show that the update of AdaBoost guarantees R(w (t+1) ) − R(w ) ≤
2
log( 1− 4γ ). Hint: Use the proof of Theorem 10.2.