Page 205 - Understanding Machine Learning
P. 205

16.4 Summary   187


                                SGD for Solving Soft-SVM with Kernels
                       Goal: Solve Equation (16.5)
                       parameter: T
                       Initialize: β (1)  = 0
                       for t = 1,...,T
                         Let α (t)  =  1  β (t)
                                  λt
                         Choose i uniformly at random from [m]
                                        (t+1)  (t)
                         For all j  = i set β  = β
                                        j      j
                                m
                                    (t)
                         If (y i  j=1 α K(x j ,x i ) < 1)
                                    j
                               (t+1)  (t)
                          Set β    = β  + y i
                               i      i
                         Else
                               (t+1)  (t)
                          Set β    = β
                               i      i
                                     m                   1   T   (t)

                       Output: ¯ w =    ¯ α j ψ(x j )where ¯α =  α
                                     j=1                 T   t=1
                 The following lemma shows that the preceding implementation is equivalent to
              running the SGD procedure described in Section 15.5 on the feature space.
              Lemma 16.3. Let ˆ w be the output of the SGD procedure described in Section 15.5,
                                                            m

              when applied on the feature space, and let ¯ w =  j=1  ¯ α j ψ(x j ) be the output of
              applying SGD with kernels. Then ¯ w = ˆ w.
              Proof. We will show that for every t Equation (16.6) holds, where θ (t)  is the result
              of running the SGD procedure described in Section 15.5 in the feature space. By the
                                                 (t)
              definition of α (t)  =  1  β (t)  and w (t)  =  1  θ , this claim implies that Equation (16.7)
                               λt             λt
              also holds, and the proof of our lemma will follow. To prove that Equation (16.6)
              holds we use a simple inductive argument. For t = 1 the claim trivially holds. Assume
              it holds for t ≥ 1. Then,
                                                                m
                                       !                 "
                       7         8
                          (t)                (t)                   (t)
                      y i w ,ψ(x i ) = y i  α ψ(x j ),ψ(x i )  = y i  α K(x j ,x i ).
                                                                    j
                                             j
                                          j                    j=1
              Hence, the condition in the two algorithms is equivalent and if we update θ we have
                                           m                     m
                                              (t)                    (t+1)
                      (t+1)
                             (t)
                     θ    = θ  + y i ψ(x i ) =  β ψ(x j ) + y i ψ(x i ) =  β  ψ(x j ),
                                              j                      j
                                          j=1                   j=1
              which concludes our proof.
              16.4 SUMMARY
              Mappings from the given domain to some higher dimensional space, on which a
              halfspace predictor is used, can be highly powerful. We benefit from a rich and
              complex hypothesis class, yet need to solve the problems of high sample and compu-
              tational complexities. In Chapter 10, we discussed the AdaBoost algorithm, which
              faces these challenges by using a weak learner: Even though we’re in a very high
              dimensional space, we have an “oracle” that bestows on us a single good coordinate
              to work with on each iteration. In this chapter we introduced a different approach,
   200   201   202   203   204   205   206   207   208   209   210