Page 117 - Understanding Machine Learning

P. 117

9.6 Exercises 99

statistical approach for ﬁnding the parameters that maximize the joint probability of
a given data set assuming a speciﬁc parametric probability function. We will study
the Maximum Likelihood approach in Chapter 24.

9.4 SUMMARY
The family of linear predictors is one of the most useful families of hypothesis
classes, and many learning algorithms that are being widely used in practice rely
on linear predictors. We have shown efﬁcient algorithms for learning linear predic-
tors with respect to the zero-one loss in the separable case and with respect to the
squared and logistic losses in the unrealizable case. In later chapters we will present
the properties of the loss function that enable efﬁcient learning.
Naturally, linear predictors are effective whenever we assume, as prior knowl-
edge, that some linear predictor attains low risk with respect to the underlying
distribution. In the next chapter we show how to construct nonlinear predictors by
composing linear predictors on top of simple classes. This will enable us to employ
linear predictors for a variety of prior knowledge assumptions.

9.5 BIBLIOGRAPHIC REMARKS
The Perceptron algorithm dates back to Rosenblatt (1958). The proof of its conver-
gence rate is due to (Agmon 1954, Novikoff 1962). Least Squares regression goes
back to Gauss (1795), Legendre (1805), and Adrain (1808).

9.6 EXERCISES

9.1 Show how to cast the ERM problem of linear regression with respect to the absolute
value loss function, (h,(x, y)) =|h(x) − y|, as a linear program; namely, show how
to write the problem
m

min | w,x i − y i |
w
i=1
as a linear program.
Hint: Start with proving that for any c ∈ R,
|c|= mina s.t. c ≤ a and c ≥−a.
a≥0

9.2 Show that the matrix A deﬁned in Equation (9.6) is invertible if and only if x 1 ,...,x m
d
span R .
9.3 Show that Theorem 9.1 is tight in the following sense: For any positive integer m,
d
there exist a vector w ∈ R (for some appropriate d) and a sequence of examples
∗
{(x 1 , y 1 ),...,(x m , y m )} such that the following hold:
R = max i x i ≤ 1.
∗ 2
∗
w = m,and for all i ≤ m, y i x i ,w ≥ 1. Note that, using the notation in
Theorem 9.1, we therefore get
√
B = min{ w : ∀i ∈ [m], y i w,x i ≥ 1}≤ m.
2
Thus, (BR) ≤ m.

112 113 114 115 116 117 118 119 120 121 122