Page 194 - Understanding Machine Learning
P. 194

Support Vector Machines
           176
                 Now suppose that we flip the order of min and max in the equation. This can only
                 decrease the objective value (see Exercise 15.4), and we have
                                                     m
                                            1    2
                                min max        w  +    α i (1 − y i  w,x i  )
                                      m
                                 w α∈R :α≥0  2
                                                    i=1
                                                         m
                                                 1   2
                                  ≥   max min      w  +    α i (1 − y i  w,x i  ) .
                                    α∈R :α≥0 w   2
                                       m
                                                        i=1
                 The preceding inequality is called weak duality. It turns out that in our case, strong
                 duality also holds; namely, the inequality holds with equality. Therefore, the dual
                 problem is
                                                       m
                                              1    2
                                    max min      w  +    α i (1 − y i  w,x i  ) .   (15.9)
                                  α∈R :α≥0 w  2
                                     m
                                                      i=1
                 We can simplify the dual problem by noting that once α is fixed, the optimization
                 problem with respect to w is unconstrained and the objective is differentiable; thus,
                 at the optimum, the gradient equals zero:
                                         m                    m

                                     w −   α i y i x i = 0 ⇒ w =  α i y i x i .
                                        i=1                   i=1
                 This shows us that the solution must be in the linear span of the examples, a
                 fact we will use later to derive SVM with kernels. Plugging the preceding into
                 Equation (15.9) we obtain that the dual problem can be rewritten as
                                                                        
                                  0        0 2             !            "
                                  0 m      0    m

                         max    1 0  α i y i x i0 +  α i   1 − y i  α j y j x j ,x i   .  (15.10)
                                           0
                                  0
                          m
                       α∈R :α≥0  2 0       0
                                   i=1         i=1            j
                 Rearranging yields the dual problem
                                                                      
                                            m       m  m
                                                  1
                                   max       α i −       α i α j y i y j  x j ,x i   .  (15.11)
                                                                       
                                     m
                                  α∈R :α≥0        2
                                           i=1      i=1 j=1
                 Note that the dual problem only involves inner products between instances and
                 does not require direct access to specific elements within an instance. This prop-
                 erty is important when implementing SVM with kernels, as we will discuss in the
                 next chapter.
                 15.5 IMPLEMENTING SOFT-SVM USING SGD
                 In this section we describe a very simple algorithm for solving the optimization
                 problem of Soft-SVM, namely,
                                                   m
                                        λ    2   1
                                  min      w  +       max{0,1 − y w,x i  } .       (15.12)
                                   w    2       m
                                                   i=1
                 We rely on the SGD framework for solving regularized loss minimization problems,
                 as described in Section 14.5.3.
   189   190   191   192   193   194   195   196   197   198   199