Page 642 - NGTU_paper_withoutVideo
P. 642
Modern Geomatics Technologies and Applications
where ∆i(s,t) is decrease of impurity at node t with split s, pL (pR) are the probabilities of sending case to the left (right)
child node tL (tR) and i(tL) (i(tR)) is Gini impurity measure for left (right) child node. In order to enhance generalization of decision
tree we used pruning with combination of cross-validation error rate estimation. The algorithm for pruning works as follows:
1. Split randomly training data into 5 folds.
2. Select pruning level of tree (level 0 equals to full decision tree).
3. Use 4 folds for creation of 4 new pruned trees and estimate error on last 5th fold.
4. Repeat from Step 2 until all pruning levels are used.
5. Find the smallest error and use the pruning level assigned to it.
6. Until pruning level is reached, remove all terminal nodes in the lowest tree level and assign decision class to parent
node. Decision value is equal to class with higher number of cases covered by node.
3.1.2. C4.5
C4.5[20] is an evolution of ID3[21], presented by the same author. The C4.5 algorithm generates a decision tree for the
given data by recursively splitting that data. The decision tree grows using Depth-first strategy. The C4.5 algorithm considers
all the possible tests that can split the data and selects a test that gives the best information gain (i.e. highest gain ratio). This test
removes ID3’s bias in favour of wide decision trees. For each discrete attribute, one test is used to produce many outcomes as
the number of distinct values of the attribute. For each continuous attribute, the data is sorted, and the entropy gain is calculated
based on binary cuts on each distinct value in one scan of the sorted data. This process is repeated for all continuous attributes.
The C4.5 algorithm allows pruning of the resulting decision trees. This increases the error rates on the training data, but
importantly, decreases the error rates on the unseen testing data. The C4.5 algorithm can also deal with numeric attributes,
missing values, and noisy data.
As splitting criterion we used Gain ratio obtained through the following equation:
( )
= (9)
where information gain is an impurity based criterion that uses the entropy measure as the impurity measure[22].It is the
difference between the entropy of the node before splitting (parent node) and after splitting (child node).
( ) = ( ) − ( ℎ ) (10)
The Entropy in the above relations can be calculated from the following equation [20]:
( ) = −∑ ( | ) 2 ( | ) (11)
where, p (i|t) denote the fraction of records belonging to class i at a given node t.
Impurity measures such as entropy and Gini Index tend to favor attributes that have large number of distinct values.
Therefore, Gain Ratio is computed used to determine the goodness of a split. Every splitting criterion has their own significance
and usage according to their characteristic and attributes type.
4. Experimental results
To evaluate the models, the data were divided into two groups: training data and test data. Training data is used as the
main tool for modelling and method learning. The other category is test data that does not interfere with the training process and
is only used to test training data. Similarly for both methods, 70% of the data were in the training data group and 30% of the data
were in the test data group, selected using cross-validation method[17]. After the training process, the overall accuracy and
Kappa index were estimated separately by each method. Table 4 shows the overall accuracy and Kappa index of the calculated
training and test data for each model.
6