Page 642 - NGTU_paper_withoutVideo
P. 642

Modern Geomatics Technologies and Applications

               where ∆i(s,t) is decrease of impurity at node t with split s, pL (pR) are the probabilities of sending case to the left (right)
          child node tL (tR) and i(tL) (i(tR)) is Gini impurity measure for left (right) child node. In order to enhance generalization of decision
          tree we used pruning with combination of cross-validation error rate estimation. The algorithm for pruning works as follows:

               1. Split randomly training data into 5 folds.
               2. Select pruning level of tree (level 0 equals to full decision tree).
               3. Use 4 folds for creation of 4 new pruned trees and estimate error on last 5th fold.
               4. Repeat from Step 2 until all pruning levels are used.
               5. Find the smallest error and use the pruning level assigned to it.
               6. Until pruning level is reached, remove all terminal nodes in the lowest tree level and assign decision class to parent
                   node. Decision value is equal to class with higher number of cases covered by node.

                 3.1.2. C4.5

               C4.5[20] is an evolution of ID3[21], presented by the same author. The C4.5 algorithm generates a decision tree for the
          given data by recursively splitting that data. The decision tree grows using Depth-first strategy. The C4.5 algorithm considers
          all the possible tests that can split the data and selects a test that gives the best information gain (i.e. highest gain ratio). This test
          removes ID3’s bias in favour of wide decision trees. For each discrete attribute, one test is used to produce many outcomes as
          the number of distinct values of the attribute. For each continuous attribute, the data is sorted, and the entropy gain is calculated
          based on binary cuts on each distinct value in one scan of the sorted data. This process is repeated for all continuous attributes.
          The  C4.5  algorithm  allows  pruning  of  the  resulting  decision  trees.  This  increases  the  error  rates  on  the  training  data,  but
          importantly, decreases the error rates on the unseen testing data. The C4.5 algorithm can also deal with numeric attributes,
          missing values, and noisy data.
               As splitting criterion we used Gain ratio obtained through the following equation:


                                                                                       (  )
                                                               =             (9)
                                                                           

          where  information  gain  is  an  impurity  based  criterion  that  uses  the  entropy  measure  as  the  impurity  measure[22].It is  the
          difference between the entropy of the node before splitting (parent node) and after splitting (child node).

                                                (  ) =               (                     ) −               (  ℎ             ) (10)
               The Entropy in the above relations can be calculated from the following equation [20]:

                                                         (  )  = −∑   (  |  )       2  (  |  )  (11)

          where, p (i|t) denote the fraction of records belonging to class i at a given node t.
               Impurity measures such as entropy and Gini Index tend to favor attributes that have large number of distinct values.
          Therefore, Gain Ratio is computed used to determine the goodness of a split. Every splitting criterion has their own significance
          and usage according to their characteristic and attributes type.

          4.  Experimental results
               To evaluate the models, the data were divided into two groups: training data and test data. Training data is used as the
          main tool for modelling and method learning. The other category is test data that does not interfere with the training process and
          is only used to test training data. Similarly for both methods, 70% of the data were in the training data group and 30% of the data
          were in the test data group, selected using cross-validation method[17]. After the training process, the overall accuracy and
          Kappa index were estimated separately by each method. Table 4 shows the overall accuracy and Kappa index of the calculated
          training and test data for each model.



                                                                                                               6
   637   638   639   640   641   642   643   644   645   646   647