Page 209 - Understanding Machine Learning

P. 209

17.1 One-versus-All and All-Pairs 191

to learn a function h : X → Y. Without loss of generality let us denote
Y ={1,...,k}.
In the One-versus-All method (a.k.a. One-versus-Rest) we train k binary classi-
ﬁers, each of which discriminates between one class and the rest of the classes. That
is, given a training set S = (x 1 , y 1 ),...,(x m , y m ), where every y i is in Y, we construct
k binary training sets, S 1 ,..., S k ,where S i = (x 1 ,(−1) 1 [y =i] ),...,(x m ,(−1) 1 [ym =i] ). In
1
words, S i is the set of instances labeled 1 if their label in S was i,and −1otherwise.
For every i ∈ [k] we train a binary predictor h i : X →{±1} based on S i , hoping that
h i (x) should equal 1 if and only if x belongs to class i. Then, given h 1 ,...,h k ,we
construct a multiclass predictor using the rule

h(x) ∈ argmaxh i (x). (17.1)
i∈[k]

When more than one binary hypothesis predicts “1” we should somehow decide
which class to predict (e.g., we can arbitrarily decide to break ties by taking the
minimal index in argmax h i (x)). A better approach can be applied whenever each
i
h i hides additional information, which can be interpreted as the conﬁdence in the
prediction y = i. For example, this is the case in halfspaces, where the actual predic-
tion is sign( w, x ), but we can interpret w, x as the conﬁdence in the prediction.
In such cases, we can apply the multiclass rule given in Equation (17.1) on the real
valued predictions. A pseudocode of the One-versus-All approach is given in the
following.

One-versus-All
input:
training set S = (x 1 , y 1 ),...,(x m , y m )
algorithm for binary classiﬁcation A
foreach i ∈ Y
let S i = (x 1 , ( − 1) 1 [y 1 =i] ),...,(x m , ( − 1) 1 [ym =i] )
let h i = A(S i )
output:
the multiclass hypothesis deﬁned by h(x) ∈ argmax h i (x)
i∈Y

Another popular reduction is the All-Pairs approach, in which all pairs of classes
are compared to each other. Formally, given a training set S = (x 1 , y 1 ),...,(x m , y m ),
where every y i is in [k], for every 1 ≤ i < j ≤ k we construct a binary training
sequence, S i, j , containing all examples from S whose label is either i or j. For each
such an example, we set the binary label in S i, j to be +1 if the multiclass label in
S is i and −1 if the multiclass label in S is j. Next, we train a binary classiﬁcation
algorithm based on every S i, j to get h i, j . Finally, we construct a multiclass classiﬁer
by predicting the class that had the highest number of “wins.” A pseudocode of the
All-Pairs approach is given in the following.

204 205 206 207 208 209 210 211 212 213 214