Page 89 - Data Science Algorithms in a Week
P. 89
Random Forest
Swim preference - analysis with random
forest
We will use the example from the previous chapter about the swim preference. We have the
same data table:
Swimming suit Water temperature Swim preference
None Cold No
None Warm No
Small Cold No
Small Warm No
Good Cold No
Good Warm Yes
We would like to construct a random forest from this data and use it to classify an item
(Good,Cold,?).
Analysis:
We are given M=3 variables according to which a feature can be classified. In a random
forest algorithm, we usually do not use all three variables to form tree branches at each
node. We use only m variables out of M. So we choose m such that m is less than or equal to
M. The greater m is, the stronger the classifier is in each constructed tree. However, as
mentioned earlier, more data leads to more bias. But, because we use multiple trees (with
smaller m), even if each constructed tree is a weak classifier, their combined classification
accuracy is strong. As we want to reduce a bias in a random forest, we may want to
consider to choose a parameter m to be slightly less than M.
Thus we choose the maximum number of the variables considered at the node to be
m=min(M,math.ceil(2*math.sqrt(M)))=min(M,math.ceil(2*math.sqrt(3)))=3.
We are given the following features:
[['None', 'Cold', 'No'], ['None', 'Warm', 'No'], ['Small', 'Cold', 'No'],
['Small', 'Warm', 'No'], ['Good', 'Cold', 'No'], ['Good', 'Warm', 'Yes']]
[ 77 ]