Page 70 - Data Science Algorithms in a Week
P. 70
Decision Trees
In the previous sections, we calculated the information gains for both and the only non-
classifying attributes, swimming suit and water temperature:
IG(S,swimming suit)=0.3166890883
IG(S,water temperature)=0.19087450461
Hence, we would choose the attribute swimming suit as it has a higher information gain.
There is no tree drawn yet, so we start from the root node. As the attribute swimming suit
has three possible values {none, small, good}, we draw three possible branches out of it for
each. Each branch will have one partition from the partitioned set S: S none , S small , and S good . We
add nodes to the ends of the branches. S none data samples have the same class swimming
preference = no, so we do not need to branch that node by a further attribute and partition
the set. Thus, the node with the data S none is already a leaf node. The same is true for the
node with the data S small .
But the node with the data S good has two possible classes for swimming preference.
Therefore, we will branch the node further. There is only one non-classifying attribute left -
water temperature. So there is no need to calculate the information gain for that attribute
with the data S good . From the node S good , we will have two branches, each with the partition
from the set S good . One branch will have the set of the data sample S good, cold
={(good,cold,no)},
the other branch will have the partition S good, warm ={(good,warm,yes)}. Each of these two
branches will end with a node. Each node will be a leaf node because each node has the
data samples of the same value for the classifying attribute swimming preference.
The resulting decision tree has four leaf nodes and is the tree in the figure 3.1. - Decision
tree for the swim preference example.
Implementation
We implement ID3 algorithm that constructs a decision tree for the data given in a csv file.
All sources are in the chapter directory. The most import parts of the source code are given
here:
# source_code/3/construct_decision_tree.py
# Constructs a decision tree from data specified in a CSV file.
# Format of a CSV file:
# Each data item is written on one line, with its variables separated
# by a comma. The last variable is used as a decision variable to
# branch a node and construct the decision tree.
import math
# anytree module is used to visualize the decision tree constructed by
# this ID3 algorithm.
[ 58 ]