Page 112 - Data Science Algorithms in a Week
P. 112

Random Forest


                    │ └──[Wind=Strong]
                    │   └── [Play=No]
                    ├── [Season=Spring]
                    │ ├── [Temperature=Cold]
                    │ │ └── [Play=Yes]
                    │ └── [Temperature=Warm]
                    │   └── [Play=Yes]
                    ├── [Season=Winter]
                    │ └── [Play=No]
                    └── [Season=Summer]
                      └── [Play=Yes]
                The total number of trees in the random forest=4.
                The maximum number of the variables considered at the node is m=4.
                Classication
                Feature: ['Warm', 'Strong', 'Spring', '?']
                Tree 0 votes for the class: No
                Tree 1 votes for the class: Yes
                Tree 2 votes for the class: Yes
                Tree 3 votes for the class: Yes
                The class with the maximum number of votes is 'Yes'. Thus the constructed
                random forest classifies the feature ['Warm', 'Strong', 'Spring', '?'] into
                the class 'Yes'.

                   2.  When we construct a tree in a random forest, we use only a random subset of the
                      data with replacement. This is to eliminate the bias of the classifier towards
                      certain features. However, if we use only one tree, that tree may happen to
                      contain features with bias and might miss some important feature to provide an
                      accurate classification. So, a random forest classifier with one decision tree would
                      likely lead to a very poor classification. Therefore, we should construct more
                      decision trees in a random forest to benefit from the reduction of bias and
                      variance in the classification.
                   3.  During cross-validation, we divide the data into the training and the testing data.
                      Training data is used to train the classifier and the test data is to evaluate which
                      parameters or methods would be the best fit to improve the classification.
                      Another advantage of cross-validation is the reduction of bias because we only
                      use partial data, thereby decreasing the chance of overfitting to the specific
                      dataset.

                          However, in a decision forest, we address problems that cross-validation
                          addresses in an alternative way. Each random decision tree is constructed
                          only on the subset of the data -reducing the chance of overfitting. In the end,
                          the classification is the combination of results from each of these trees. The
                          best decision in the end is not made by tuning the parameters on a test
                          dataset, but by taking the majority vote of all the trees with reduced bias.



                                                    [ 100 ]
   107   108   109   110   111   112   113   114   115   116   117