Page 70 - Data Science Algorithms in a Week
P. 70

Decision Trees


            In the previous sections, we calculated the information gains for both and the only non-
            classifying attributes, swimming suit and water temperature:

                IG(S,swimming suit)=0.3166890883
                IG(S,water temperature)=0.19087450461
            Hence, we would choose the attribute swimming suit as it has a higher information gain.
            There is no tree drawn yet, so we start from the root node. As the attribute swimming suit
            has three possible values {none, small, good}, we draw three possible branches out of it for
            each. Each branch will have one partition from the partitioned set S: S none , S small , and S good . We
            add nodes to the ends of the branches. S none  data samples have the same class swimming
            preference = no, so we do not need to branch that node by a further attribute and partition
            the set. Thus, the node with the data S none  is already a leaf node. The same is true for the
            node with the data S small .

            But the node with the data S good  has two possible classes for swimming preference.
            Therefore, we will branch the node further. There is only one non-classifying attribute left -
            water temperature. So there is no need to calculate the information gain for that attribute
            with the data S good . From the node S good , we will have two branches, each with the partition
            from the set S good . One branch will have the set of the data sample S good, cold
                                                                               ={(good,cold,no)},

            the other branch will have the partition S good, warm ={(good,warm,yes)}. Each of these two
            branches will end with a node. Each node will be a leaf node because each node has the
            data samples of the same value for the classifying attribute swimming preference.
            The resulting decision tree has four leaf nodes and is the tree in the figure 3.1. - Decision
            tree for the swim preference example.



            Implementation

            We implement ID3 algorithm that constructs a decision tree for the data given in a csv file.
            All sources are in the chapter directory. The most import parts of the source code are given
            here:

                # source_code/3/construct_decision_tree.py
                # Constructs a decision tree from data specified in a CSV file.
                # Format of a CSV file:
                # Each data item is written on one line, with its variables separated
                # by a comma. The last variable is used as a decision variable to
                # branch a node and construct the decision tree.
                import math
                # anytree module is used to visualize the decision tree constructed by
                # this ID3 algorithm.

                                                     [ 58 ]
   65   66   67   68   69   70   71   72   73   74   75