Page 117 - Data Science Algorithms in a Week

P. 117

Clustering into K Clusters

The centroid of the second cluster is (1/3)*(115k+130k+135k)=(1/3)*380k~126.66k.

Using the new centroids we reclassify the features as follows:
The first cluster with the centroid 66.25k will contain the features 40k, 55k, 70k.
The second cluster with the centroid 126.66k will contain the features 100k, 115k,
130k, 135k.

We notice that the feature 100k moved from the first cluster into the second since now it is
closer to the centroid of the second cluster (distance |100k-126.66k|=26.66k) than to the
centroid of the first cluster (distance |100k-66.25k|=33.75k). Since the features in the clusters
changed, we have to recompute the centroids again.

The centroid of the first cluster is (1/3)*(40k+55k+70k)=(1/3)/165k=55k. The centroid of the
second cluster is (1/4)*(100k+115k+130k+135k)=(1/4)*480k=120k.
Using these centroids we reclassify the items into the clusters. The first centroid 55k will
contain the features 40k, 55k, 70k. The second centroid 120k will contain the features 100k,
115k, 130k, 135k. Thus upon the update of the centroids, the clusters did not change. So
their centroids will remain the same.

Therefore the algorithm terminates with the two clusters: the first cluster having the
features 40k, 55k, 70k; the second cluster having the features 100k, 115k, 130k, 135k.

Gender classification - clustering to classify

We take the data from the gender classification in the problem Chapter 2, Naive Bayes,
Analysis point 6:

Height in cm Weight in kg Hair length Gender

180 75 Short Male
174 71 Short Male

184 83 Short Male
168 63 Short Male
178 70 Long Male

170 59 Long Female
164 53 Short Female

[ 105 ]

112 113 114 115 116 117 118 119 120 121 122