Page 115 - Data Science Algorithms in a Week
P. 115
Clustering into K Clusters
This is because 40k and 135k are furthest away from each other, and we require to have two
clusters, so they have to be in the different clusters. 55K is closer to 40k than to 135k, so 40k
and 55k will be in the same cluster. Similarly, 130k and 135k will be in the same cluster. 70K
is closer to 40k and 55k than to 130k and 135k, so 70k should be in the cluster with 40k and
55k. 115K is closer to 130k and 135k than to the first cluster with 40k, 55k and 70k, so it will
be in the second cluster. Finally, 100k is closer to the second cluster with 115k, 130k and
135k, so it will be there. Therefore the first cluster will contain 40k, 55k and 70k households.
The second cluster will contain 100k, 115k, 130k and 135k households.
Clustering groups features with similar properties and assigning a cluster to a feature is a
form of classification. It is up to a data scientist to interpret the result of the clustering and
what classification it induces. Here the cluster with the households with the annual incomes
40k, 55k, 70k USD represents a class of households with a low income. The second cluster
with the households with the annual incomes 100k, 115k, 130k and 135k represents a class
of households with a high income.
We clustered the households into the two clusters in an informal way based on the intuition
and the common sense. There are clustering algorithms that cluster the data according to
the precise rules. The algorithms include fuzzy c-means clustering algorithm, hierarchical
clustering algorithm, Gaussian(EM) clustering algorithm, Quality Threshold clustering
algorithm and k-means clustering algorithm which is the focus of this chapter.
K-means clustering algorithm
The k-means clustering algorithm classifies given points into k groups in such a way that a
distance between the members of the same group is minimized.
The k-means clustering algorithm determines the initial k-centroids (points to be in a cluster
center) – one for each cluster. Then each feature is classified into the cluster whose centroid
is closest to that feature. After classifying all the features, we have formed initial k clusters.
For each cluster we recompute the centroid to be the average of the points in that cluster.
After we have moved the centroids, we recompute the classes again. Features may change
the classes. Then we will have to recompute the centroids again. If the centroids do not
move anymore, then the k-means clustering algorithm terminates.
[ 103 ]