Page 109 - Understanding Machine Learning
P. 109

9.1 Halfspaces   91


              will describe the logistic regression approach, which can be implemented efficiently
              even in the nonseparable case. We will study surrogate loss functions in more detail
              later oninChapter 12.


              9.1.1 Linear Programming for the Class of Halfspaces
              Linear programs (LP) are problems that can be expressed as maximizing a linear
              function subject to linear inequalities. That is,

                                         max     u,w
                                        w∈R d
                                        subject to   Aw ≥ v
                        d
              where w ∈ R is the vector of variables we wish to determine, A is an m × d matrix,
                              d
                       m
                                                                                1
              and v ∈ R ,u ∈ R are vectors. Linear programs can be solved efficiently, and
              furthermore, there are publicly available implementations of LP solvers.
                 We will show that the ERM problem for halfspaces in the realizable case can be
              expressed as a linear program. For simplicity, we assume the homogenous case. Let
                        m
              S ={(x i , y i )}  be a training set of size m. Since we assume the realizable case, an
                        i=1
              ERM predictor should have zero errors on the training set. That is, we are looking
                                 d
              for some vector w ∈ R for which
                                  sign( w,x i  ) = y i ,  ∀i = 1,...,m.
              Equivalently, we are looking for some vector w for which
                                    y i  w,x i   > 0,  ∀i = 1,...,m.

                   ∗
              Let w be a vector that satisfies this condition (it must exist since we assume real-
                                           ∗                 w ∗
              izability). Define γ = min i (y i  w ,x i  )and let ¯ w =  . Therefore, for all i we
                                                             γ
              have
                                                1
                                                     ∗
                                      y i   ¯ w,x i  =  y i  w ,x i  ≥ 1.
                                                γ
              We have thus shown that there exists a vector that satisfies
                                    y i  w,x i  ≥ 1,  ∀i = 1,...,m.              (9.1)
              And clearly, such a vector is an ERM predictor.
                 To find a vector that satisfies Equation (9.1) we can rely on an LP solver as
              follows. Set A to be the m × d matrix whose rows are the instances multiplied by y i .
              That is, A i, j = y i x i, j ,where x i, j is the j’th element of the vector x i .Let v be the
                               m
              vector (1,...,1) ∈ R . Then, Equation (9.1) can be rewritten as
                                              Aw ≥ v.

                 The LP form requires a maximization objective, yet all the w that satisfy the
              constraints are equal candidates as output hypotheses. Thus, we set a “dummy”
                                     d
              objective, u = (0,...,0) ∈ R .
              1  Namely, in time polynomial in m, d, and in the representation size of real numbers.
   104   105   106   107   108   109   110   111   112   113   114