Page 66 - Data Science Algorithms in a Week
P. 66

Decision Trees


            Coin flipping

            Imagine we flip an unbiased coin. We would like to know if the result is head or tail. How
            much information do we need to represent the result? Both words, head and tail, consist of
            four characters, and if we represent one character with one byte (8 bits) as it is standard in
            the ASCII table, then we would need four bytes or 32 bits to represent the result.

            But the information entropy is the least amount of the data necessary to represent the result.
            We know that there are only two possible results - head or tail. If we agree to represent
            head with 0 and tail with 1, then one bit would be sufficient to communicate the result
            efficiently. Here the data is the space of the possibilities of the result of the coin throw. It is
            the set {head,tail} which can be represented as a set {0,1}. The actual result is a data item
            from this set. It turns out that the entropy of the set is 1. This is owing to that the probability
            of head and tail are both 50%.

            Now imagine that the coin is biased and throws head 25% of the time and tail 75% of the
            time. What would be the entropy of the probability space {0,1} this time? We could
            certainly represent the result with one bit of the information. But can we do better? One bit
            is, of course, indivisible, but maybe we could generalize the concept of the information to
            indiscrete amounts.
            In the previous example, we know nothing about the previous result of the coin flip unless
            we look at the coin. But in the example with the biased coin, we know that the result tail is
            more likely to happen. If we recorded n results of coin flips in a file representing heads with
            0 and tails with 1, then about 75% of the bits there would have the value 1 and 25% of them
            would have the value 0. The size of such a file would be n bits. But since it is more regular
            (the pattern of 1s prevails in it), a good compression tool should be able to compress it to
            less than n bits.
            To learn the theoretical bound to the compression and the amount of the information
            necessary to represent a data item, we define information entropy precisely.



            Definition of information entropy
            Suppose that we are given a probability space S with the elements 1, 2, ..., n. The probability
            an element i would be chosen from the probability space is p . Then the information entropy
                                                                     i
            of the probability space is defined as:
            E(S)=-p  *log (p ) - ... - p  *log (p ) where log  is a binary logarithm.
                                      2
                                        n
                                                   2
                       2
                   1
                                 n
                          1
                                                     [ 54 ]
   61   62   63   64   65   66   67   68   69   70   71