Page 235 - Understanding Machine Learning
P. 235

18.4 Summary   217


                 The basic idea is to reduce the problem to the case of binary features as follows.
              Let x 1 ,...,x m be the instances of the training set. For each real-valued feature i,
              sort the instances so that x 1,i ≤ ··· ≤ x m,i . Define a set of thresholds θ 0,i ,...,θ m+1,i
              such that θ j,i ∈ (x j,i ,x j+1,i ) (where we use the convention x 0,i =−∞ and x m+1,i =
              ∞). Finally, for each i and j we define the binary feature 1 [x i <θ j,i ] . Oncewehave
              constructed these binary features, we can run the ID3 procedure described in the
              previous section. It is easy to verify that for any decision tree with threshold-based
              splitting rules over the original real-valued features there exists a decision tree over
              the constructed binary features with the same training error and the same number
              of nodes.
                 If the original number of real-valued features is d and the number of examples
              is m, then the number of constructed binary features becomes dm. Calculating the
                                                         2
              Gain of each feature might therefore take O(dm ) operations. However, using a
              more clever implementation, the runtime can be reduced to O(dm log(m)). The
              idea is similar to the implementation of ERM for decision stumps as described in
              Section 10.1.1.



              18.3 RANDOM FORESTS
              As mentioned before, the class of decision trees of arbitrary size has infinite VC
              dimension. We therefore restricted the size of the decision tree. Another way to
              reduce the danger of overfitting is by constructing an ensemble of trees. In par-
              ticular, in the following we describe the method of random forests, introduced by
              Breiman (2001).
                 A random forest is a classifier consisting of a collection of decision trees, where
              each tree is constructed by applying an algorithm A on the training set S and an
              additional random vector, θ,where θ is sampled i.i.d. from some distribution. The
              prediction of the random forest is obtained by a majority vote over the predictions
              of the individual trees.
                 To specify a particular random forest, we need to define the algorithm A and
              the distribution over θ. There are many ways to do this and here we describe one
              particular option. We generate θ as follows. First, we take a random subsample

              from S with replacements; namely, we sample a new training set S of size m using

              the uniform distribution over S. Second, we construct a sequence I 1 , I 2 ,...,where
              each I t is a subset of [d]ofsize k, which is generated by sampling uniformly at
              random elements from [d]. All these random variables form the vector θ. Then,
              the algorithm A grows a decision tree (e.g., using the ID3 algorithm) based on the
              sample S , where at each splitting stage of the algorithm, the algorithm is restricted

              to choosing a feature that maximizes Gain from the set I t . Intuitively, if k is small,
              this restriction may prevent overfitting.



              18.4 SUMMARY
              Decision trees are very intuitive predictors. Typically, if a human programmer
              creates a predictor it will look like a decision tree. We have shown that the VC
              dimension of decision trees with k leaves is k and proposed the MDL paradigm for
   230   231   232   233   234   235   236   237   238   239   240