Page 110 - Understanding Machine Learning
P. 110

Linear Predictors
           92

                 9.1.2 Perceptron for Halfspaces
                 A different implementation of the ERM rule is the Perceptron algorithm of Rosen-
                 blatt (Rosenblatt 1958). The Perceptron is an iterative algorithm that constructs a
                 sequence of vectors w (1) ,w (2) ,.... Initially, w (1)  is set to be the all-zeros vector. At
                                                                             (t)
                 iteration t, the Perceptron finds an example i that is mislabeled by w , namely, an
                                        (t)
                 example for which sign( w ,x i  )  = y i . Then, the Perceptron updates w (t)  by adding
                 to it the instance x i scaled by the label y i .That is, w (t+1)  = w (t)  + y i x i . Recall that
                 our goal is to have y i  w,x i   > 0for all i and note that
                                                                (t)
                                                                           2
                               y i  w (t+1) ,x i  = y i  w (t)  + y i x i ,x i  = y i  w ,x i  +  x i   .
                 Hence, the update of the Perceptron guides the solution to be “more correct” on
                 the i’th example.

                                             Batch Perceptron
                           input: A training set (x 1 , y 1 ),...,(x m , y m )
                           initialize: w (1)  = (0,...,0)
                             for t = 1,2,...
                                           (t)
                              if (∃ i s.t. y i  w ,x i  ≤ 0) then
                                w (t+1)  = w (t)  + y i x i
                              else
                                output w (t)

                    The following theorem guarantees that in the realizable case, the algorithm stops
                 with all sample points correctly classified.
                 Theorem 9.1. Assume that (x 1 , y 1 ),...,(x m , y m ) is separable, let B = min{ w  : ∀i ∈
                 [m], y i  w,x i  ≥ 1},and let R = max i  x i  . Then, the Perceptron algorithm stops after
                             2
                                                                            (t)
                 at most (RB) iterations, and when it stops it holds that ∀i ∈ [m], y i  w ,x i   > 0.
                 Proof. By the definition of the stopping condition, if the Perceptron stops it must
                 have separated all the examples. We will show that if the Perceptron runs for T
                                                    2
                 iterations, then we must have T ≤ (RB) , which implies the Perceptron must stop
                                 2
                 after at most (RB) iterations.

                    Let w be a vector that achieves the minimum in the definition of B.That is,


                  y i  w ,x i  ≥ 1for all i, and among all vectors that satisfy these constraints, w is of
                 minimal norm.
                    The idea of the proof is to show that after performing T iterations, the cosine of
                                                       √

                 the angle between w and w (T +1)  is at least  T  . That is, we will show that
                                                       RB
                                                           √

                                              w ,w (T +1)    T
                                                         ≥    .                      (9.2)

                                             w   w (T +1)     RB
                 By the Cauchy-Schwartz inequality, the left-hand side of Equation (9.2)isatmost
                 1. Therefore, Equation (9.2) would imply that
                                              √
                                               T               2
                                          1 ≥      ⇒ T ≤ (RB) ,
                                              RB
                 which will conclude our proof.
   105   106   107   108   109   110   111   112   113   114   115