Page 209 - Understanding Machine Learning
P. 209
17.1 One-versus-All and All-Pairs 191
to learn a function h : X → Y. Without loss of generality let us denote
Y ={1,...,k}.
In the One-versus-All method (a.k.a. One-versus-Rest) we train k binary classi-
fiers, each of which discriminates between one class and the rest of the classes. That
is, given a training set S = (x 1 , y 1 ),...,(x m , y m ), where every y i is in Y, we construct
k binary training sets, S 1 ,..., S k ,where S i = (x 1 ,(−1) 1 [y =i] ),...,(x m ,(−1) 1 [ym =i] ). In
1
words, S i is the set of instances labeled 1 if their label in S was i,and −1otherwise.
For every i ∈ [k] we train a binary predictor h i : X →{±1} based on S i , hoping that
h i (x) should equal 1 if and only if x belongs to class i. Then, given h 1 ,...,h k ,we
construct a multiclass predictor using the rule
h(x) ∈ argmaxh i (x). (17.1)
i∈[k]
When more than one binary hypothesis predicts “1” we should somehow decide
which class to predict (e.g., we can arbitrarily decide to break ties by taking the
minimal index in argmax h i (x)). A better approach can be applied whenever each
i
h i hides additional information, which can be interpreted as the confidence in the
prediction y = i. For example, this is the case in halfspaces, where the actual predic-
tion is sign( w, x ), but we can interpret w, x as the confidence in the prediction.
In such cases, we can apply the multiclass rule given in Equation (17.1) on the real
valued predictions. A pseudocode of the One-versus-All approach is given in the
following.
One-versus-All
input:
training set S = (x 1 , y 1 ),...,(x m , y m )
algorithm for binary classification A
foreach i ∈ Y
let S i = (x 1 , ( − 1) 1 [y 1 =i] ),...,(x m , ( − 1) 1 [ym =i] )
let h i = A(S i )
output:
the multiclass hypothesis defined by h(x) ∈ argmax h i (x)
i∈Y
Another popular reduction is the All-Pairs approach, in which all pairs of classes
are compared to each other. Formally, given a training set S = (x 1 , y 1 ),...,(x m , y m ),
where every y i is in [k], for every 1 ≤ i < j ≤ k we construct a binary training
sequence, S i, j , containing all examples from S whose label is either i or j. For each
such an example, we set the binary label in S i, j to be +1 if the multiclass label in
S is i and −1 if the multiclass label in S is j. Next, we train a binary classification
algorithm based on every S i, j to get h i, j . Finally, we construct a multiclass classifier
by predicting the class that had the highest number of “wins.” A pseudocode of the
All-Pairs approach is given in the following.