Page 131 - Data Science Algorithms in a Week

P. 131

Clustering into K Clusters

Now the red cluster contains only Peter and a non-owner. This clustering suggests that
Peter is more likely a non-owner as well. However, according to the previous cluster Peter
would be more likely an owner of a house. Therefore it may not be so clear whether Peter
owns a house or not. Collecting more data would improve our analysis and should be
carried out before making a definite classification in this problem.

From our analysis we noticed that a different number of clusters can result in a different
result for a classification as the nature of members in an individual cluster can change. After
collecting more data we should perform a cross-validation to determine the number of the
clusters that classifies the data with the highest accuracy.

Document clustering – understanding the

number of clusters k in a semantic context

We are given the following information about the frequency counts for the words money
and god(s) in the following 17 books from the Project Gutenberg:

Book Book name Money in God(s) in
number % %

1 The Vedanta-Sutras with the Commentary by 0 0.07
Ramanuja, by Trans. George Thibaut

2 The Mahabharata of Krishna-Dwaipayana Vyasa 0 0.17
- Adi Parva, by Kisari Mohan Ganguli
3 The Mahabharata of Krishna-Dwaipayana 0.01 0.10
Vyasa, Part 2, by Krishna-Dwaipayana Vyasa
4 Mahabharata of Krishna-Dwaipayana Vyasa Bk. 0 0.32
3 Pt. 1, by Krishna-Dwaipayana Vyasa
5 The Mahabharata of Krishna-Dwaipayana Vyasa 0 0.06
Bk. 4, by Kisari Mohan Ganguli
6 The Mahabharata of Krishna-Dwaipayana Vyasa 0 0.27
Bk. 3 Pt. 2, by Translated by Kisari Mohan Ganguli
7 The Vedanta-Sutras with the Commentary by 0 0.06
Sankaracarya
8 The King James Bible 0.02 0.59

[ 119 ]

126 127 128 129 130 131 132 133 134 135 136