Page 116 - Data Science Algorithms in a Week
P. 116
Clustering into K Clusters
Picking the initial k-centroids
We could pick up the initial k-centroids to be any of the k features in the data to be
classified. But ideally, we would like to pick up the points that belong to the different
clusters already in the beginning. Therefore we may want to aim to maximize their mutual
distance in a certain way. Simplifying the process we could pick the first centroid to be any
point from the features. The second could be the one which is furthest from the first. The
third could be the one that is furthest from both first and second, and so on.
Computing a centroid of a given cluster
A centroid of a cluster is just an average of the points in a cluster. If a cluster contains 1
dimensional points with the coordinates x , x , …, x , then the centroid of that cluster would
n
2
1
be (1/n)*(x +x +...+x ). If a cluster contains 2 dimensional points with the coordinates
2
n
1
(x ,y ),(x ,y ),…,(x ,y ), then the x coordinate of the centroid of the cluster would have value
2
2
1
1
n
n
(1/n)*(x +x +...+x ), the y coordinate would have the value (1/n)*(y +y +...+y ).
1
n
1
2
n
2
This computation generalizes easily to higher dimensions. If the value of the higher
dimensional features for the x-coordinate are x , x , …, x , then the value at the x-coordinate
2
1
n
for the centroid is (1/n)*(x +x +...+x ).
1
2
n
k-means clustering algorithm on household
income example
We will apply k-clustering algorithm on the household income example. In the beginning
we have households with the incomes 40k, 55k, 70k, 100k, 115k, 130k and 135k in USD
dollars.
The first centroid to be picked up can be any feature, example 70k. The second centroid
should be the feature that is furthest from the first one, that is 135k since 135k-70k is 65k
which is the greatest difference between any other feature and 70k. Thus 70k is the centroid
of the first cluster, 135k is the centroid of the second cluster.
Now 40k, 55k, 70k, 100k are closer to 70k by taking the difference than to 135k, so they will
be in the first cluster. The features 115k, 130k and 135k are closer to 135k than to 70k, so
they will be in the second cluster.
After we have classified the features according to the initial centroids, we recompute the
centroids. The centroid of the first cluster is (1/4)*( 40k+55k+70k+100k)=(1/4)*265k=66.25k.
[ 104 ]