Page 235 - Understanding Machine Learning

P. 235

18.4 Summary 217

The basic idea is to reduce the problem to the case of binary features as follows.
Let x 1 ,...,x m be the instances of the training set. For each real-valued feature i,
sort the instances so that x 1,i ≤ ··· ≤ x m,i . Deﬁne a set of thresholds θ 0,i ,...,θ m+1,i
such that θ j,i ∈ (x j,i ,x j+1,i ) (where we use the convention x 0,i =−∞ and x m+1,i =
∞). Finally, for each i and j we deﬁne the binary feature 1 [x i <θ j,i ] . Oncewehave
constructed these binary features, we can run the ID3 procedure described in the
previous section. It is easy to verify that for any decision tree with threshold-based
splitting rules over the original real-valued features there exists a decision tree over
the constructed binary features with the same training error and the same number
of nodes.
If the original number of real-valued features is d and the number of examples
is m, then the number of constructed binary features becomes dm. Calculating the
2
Gain of each feature might therefore take O(dm ) operations. However, using a
more clever implementation, the runtime can be reduced to O(dm log(m)). The
idea is similar to the implementation of ERM for decision stumps as described in
Section 10.1.1.

18.3 RANDOM FORESTS
As mentioned before, the class of decision trees of arbitrary size has inﬁnite VC
dimension. We therefore restricted the size of the decision tree. Another way to
reduce the danger of overﬁtting is by constructing an ensemble of trees. In par-
ticular, in the following we describe the method of random forests, introduced by
Breiman (2001).
A random forest is a classiﬁer consisting of a collection of decision trees, where
each tree is constructed by applying an algorithm A on the training set S and an
additional random vector, θ,where θ is sampled i.i.d. from some distribution. The
prediction of the random forest is obtained by a majority vote over the predictions
of the individual trees.
To specify a particular random forest, we need to deﬁne the algorithm A and
the distribution over θ. There are many ways to do this and here we describe one
particular option. We generate θ as follows. First, we take a random subsample

from S with replacements; namely, we sample a new training set S of size m using

the uniform distribution over S. Second, we construct a sequence I 1 , I 2 ,...,where
each I t is a subset of [d]ofsize k, which is generated by sampling uniformly at
random elements from [d]. All these random variables form the vector θ. Then,
the algorithm A grows a decision tree (e.g., using the ID3 algorithm) based on the
sample S , where at each splitting stage of the algorithm, the algorithm is restricted

to choosing a feature that maximizes Gain from the set I t . Intuitively, if k is small,
this restriction may prevent overﬁtting.

18.4 SUMMARY
Decision trees are very intuitive predictors. Typically, if a human programmer
creates a predictor it will look like a decision tree. We have shown that the VC
dimension of decision trees with k leaves is k and proposed the MDL paradigm for

230 231 232 233 234 235 236 237 238 239 240