Page 132 - Data Science Algorithms in a Week

P. 132

Clustering into K Clusters

9 Paradise Regained, by John Milton 0.02 0.45

10 Imitation of Christ, by Thomas A Kempis 0.01 0.69
11 The Koran as translated by Rodwell 0.01 1.72
12 The Adventures of Tom Sawyer, Complete by 0.05 0.01
Mark Twain (Samuel Clemens)
13 Adventures of Huckleberry Finn, Complete 0.08 0
by Mark Twain (Samuel Clemens)
14 Great Expectations, by Charles Dickens 0.04 0.01

15 The Picture of Dorian Gray, by Oscar Wilde 0.03 0.03
16 The Adventures of Sherlock Holmes, by Arthur 0.04 0.03
Conan Doyle
17 Metamorphosis, by Franz Kafka 0.06 0.03
Translated by David Wyllie
We would like to cluster this dataset based on the on the chosen frequency counts of the
words into the groups by their semantic context.
Analysis:

First we will do a rescaling since the highest frequency count of the word money is 0.08%
whereas the highest frequency count of the word god(s) is 1.72%. So we will divide the
frequency counts of money by 0.08 and the frequency counts of god(s) by 1.72:

Book number Money scaled God(s) scaled

1 0 0.0406976744
2 0 0.0988372093
3 0.125 0.0581395349

4 0 0.1860465116
5 0 0.0348837209
6 0 0.1569767442
7 0 0.0348837209

8 0.25 0.3430232558
9 0.25 0.261627907

[ 120 ]

127 128 129 130 131 132 133 134 135 136 137