Page 132 - Data Science Algorithms in a Week
P. 132

Clustering into K Clusters


             9           Paradise Regained, by John Milton                 0.02        0.45

             10          Imitation of Christ, by Thomas A Kempis           0.01        0.69
             11          The Koran as translated by Rodwell                0.01        1.72
             12          The Adventures of Tom Sawyer, Complete by         0.05        0.01
                         Mark Twain (Samuel Clemens)
             13          Adventures of Huckleberry Finn, Complete          0.08        0
                         by Mark Twain (Samuel Clemens)
             14          Great Expectations, by Charles Dickens            0.04        0.01

             15          The Picture of Dorian Gray, by Oscar Wilde        0.03        0.03
             16          The Adventures of Sherlock Holmes, by Arthur      0.04        0.03
                         Conan Doyle
             17          Metamorphosis, by Franz Kafka                     0.06        0.03
                         Translated by David Wyllie
            We would like to cluster this dataset based on the on the chosen frequency counts of the
            words into the groups by their semantic context.
            Analysis:

            First we will do a rescaling since the highest frequency count of the word money is 0.08%
            whereas the highest frequency count of the word god(s) is 1.72%. So we will divide the
            frequency counts of money by 0.08 and the frequency counts of god(s) by 1.72:

             Book number Money scaled God(s) scaled

             1             0             0.0406976744
             2             0             0.0988372093
             3             0.125         0.0581395349

             4             0             0.1860465116
             5             0             0.0348837209
             6             0             0.1569767442
             7             0             0.0348837209

             8             0.25          0.3430232558
             9             0.25          0.261627907



                                                    [ 120 ]
   127   128   129   130   131   132   133   134   135   136   137