Page 138 - Data Science Algorithms in a Week

P. 138

Clustering into K Clusters

We can use clustering this way to group items with similar properties and then enable to
find similar items quickly based on the given example. The granularity of the clustering
parameter k determines how similar we can expect the items in a group to be. The higher
the parameter, the more similar items are going to be in the cluster, but a smaller number of
them.

Summary

Clustering of the data is very efficient and can be used to facilitate a faster classification of
the new features by classifying a feature to the class represented in the cluster of that
feature. An appropriate number of the clusters can be determined by cross-validation
choosing the one that results in the most accurate classification.

Clustering orders data by their similarity. The more clusters, the greater similarity between
the features in a cluster, but a fewer features in a cluster.

The k-means clustering algorithm is a clustering algorithm that tries to cluster features in
such a way that the mutual distance of the features in a cluster is minimized. To do this, the
algorithm computes centroid of each cluster and a feature belongs to the cluster whose
centroid is closest to it. The algorithm finishes the computation of the clusters as soon as
they or their centroids no longer change.

Problems

1. Compute the centroid of the following clusters:

a) 2, 3, 4
b) 100$, 400$, 1000$
c) (10,20), (40, 60), (0, 40)
d) (200$, 40km), (300$, 60km), (500$, 100km), (250$, 200km)
e) (1,2,4), (0,0,3), (10,20,5), (4,8,2), (5,0,1)
2. Cluster the following datasets into the 2, 3 and 4 clusters using k-means
clustering algorithm:

a) 0, 2, 5, 4, 8, 10, 12, 11.
b) (2,2), (2,5), (10,4), (3,5), (7,3), (5,9), (2,8), (4,10), (7,4), (4,4), (5,8), (9,3).

[ 126 ]

133 134 135 136 137 138 139 140 141 142 143