Page 115 - Data Science Algorithms in a Week
P. 115

Clustering into K Clusters


            This is because 40k and 135k are furthest away from each other, and we require to have two
            clusters, so they have to be in the different clusters. 55K is closer to 40k than to 135k, so 40k
            and 55k will be in the same cluster. Similarly, 130k and 135k will be in the same cluster. 70K
            is closer to 40k and 55k than to 130k and 135k, so 70k should be in the cluster with 40k and
            55k. 115K is closer to 130k and 135k than to the first cluster with 40k, 55k and 70k, so it will
            be in the second cluster. Finally, 100k is closer to the second cluster with 115k, 130k and
            135k, so it will be there. Therefore the first cluster will contain 40k, 55k and 70k households.
            The second cluster will contain 100k, 115k, 130k and 135k households.
            Clustering groups features with similar properties and assigning a cluster to a feature is a
            form of classification. It is up to a data scientist to interpret the result of the clustering and
            what classification it induces. Here the cluster with the households with the annual incomes
            40k, 55k, 70k USD represents a class of households with a low income. The second cluster
            with the households with the annual incomes 100k, 115k, 130k and 135k represents a class
            of households with a high income.

            We clustered the households into the two clusters in an informal way based on the intuition
            and the common sense. There are clustering algorithms that cluster the data according to
            the precise rules. The algorithms include fuzzy c-means clustering algorithm, hierarchical
            clustering algorithm, Gaussian(EM) clustering algorithm, Quality Threshold clustering
            algorithm and k-means clustering algorithm which is the focus of this chapter.



            K-means clustering algorithm

            The k-means clustering algorithm classifies given points into k groups in such a way that a
            distance between the members of the same group is minimized.
            The k-means clustering algorithm determines the initial k-centroids (points to be in a cluster
            center) – one for each cluster. Then each feature is classified into the cluster whose centroid
            is closest to that feature. After classifying all the features, we have formed initial k clusters.
            For each cluster we recompute the centroid to be the average of the points in that cluster.
            After we have moved the centroids, we recompute the classes again. Features may change
            the classes. Then we will have to recompute the centroids again. If the centroids do not
            move anymore, then the k-means clustering algorithm terminates.












                                                    [ 103 ]
   110   111   112   113   114   115   116   117   118   119   120