Page 110 - Understanding Machine Learning
P. 110
Linear Predictors
92
9.1.2 Perceptron for Halfspaces
A different implementation of the ERM rule is the Perceptron algorithm of Rosen-
blatt (Rosenblatt 1958). The Perceptron is an iterative algorithm that constructs a
sequence of vectors w (1) ,w (2) ,.... Initially, w (1) is set to be the all-zeros vector. At
(t)
iteration t, the Perceptron finds an example i that is mislabeled by w , namely, an
(t)
example for which sign( w ,x i ) = y i . Then, the Perceptron updates w (t) by adding
to it the instance x i scaled by the label y i .That is, w (t+1) = w (t) + y i x i . Recall that
our goal is to have y i w,x i > 0for all i and note that
(t)
2
y i w (t+1) ,x i = y i w (t) + y i x i ,x i = y i w ,x i + x i .
Hence, the update of the Perceptron guides the solution to be “more correct” on
the i’th example.
Batch Perceptron
input: A training set (x 1 , y 1 ),...,(x m , y m )
initialize: w (1) = (0,...,0)
for t = 1,2,...
(t)
if (∃ i s.t. y i w ,x i ≤ 0) then
w (t+1) = w (t) + y i x i
else
output w (t)
The following theorem guarantees that in the realizable case, the algorithm stops
with all sample points correctly classified.
Theorem 9.1. Assume that (x 1 , y 1 ),...,(x m , y m ) is separable, let B = min{ w : ∀i ∈
[m], y i w,x i ≥ 1},and let R = max i x i . Then, the Perceptron algorithm stops after
2
(t)
at most (RB) iterations, and when it stops it holds that ∀i ∈ [m], y i w ,x i > 0.
Proof. By the definition of the stopping condition, if the Perceptron stops it must
have separated all the examples. We will show that if the Perceptron runs for T
2
iterations, then we must have T ≤ (RB) , which implies the Perceptron must stop
2
after at most (RB) iterations.
Let w be a vector that achieves the minimum in the definition of B.That is,
y i w ,x i ≥ 1for all i, and among all vectors that satisfy these constraints, w is of
minimal norm.
The idea of the proof is to show that after performing T iterations, the cosine of
√
the angle between w and w (T +1) is at least T . That is, we will show that
RB
√
w ,w (T +1) T
≥ . (9.2)
w w (T +1) RB
By the Cauchy-Schwartz inequality, the left-hand side of Equation (9.2)isatmost
1. Therefore, Equation (9.2) would imply that
√
T 2
1 ≥ ⇒ T ≤ (RB) ,
RB
which will conclude our proof.