Page 138 - Data Science Algorithms in a Week
P. 138

Clustering into K Clusters


            We can use clustering this way to group items with similar properties and then enable to
            find similar items quickly based on the given example. The granularity of the clustering
            parameter k determines how similar we can expect the items in a group to be. The higher
            the parameter, the more similar items are going to be in the cluster, but a smaller number of
            them.



            Summary

            Clustering of the data is very efficient and can be used to facilitate a faster classification of
            the new features by classifying a feature to the class represented in the cluster of that
            feature. An appropriate number of the clusters can be determined by cross-validation
            choosing the one that results in the most accurate classification.

            Clustering orders data by their similarity. The more clusters, the greater similarity between
            the features in a cluster, but a fewer features in a cluster.

            The k-means clustering algorithm is a clustering algorithm that tries to cluster features in
            such a way that the mutual distance of the features in a cluster is minimized. To do this, the
            algorithm computes centroid of each cluster and a feature belongs to the cluster whose
            centroid is closest to it. The algorithm finishes the computation of the clusters as soon as
            they or their centroids no longer change.



            Problems


                   1.  Compute the centroid of the following clusters:


                              a) 2, 3, 4
                              b) 100$, 400$, 1000$
                              c) (10,20), (40, 60), (0, 40)
                              d) (200$, 40km), (300$, 60km), (500$, 100km), (250$, 200km)
                              e) (1,2,4), (0,0,3), (10,20,5), (4,8,2), (5,0,1)
                   2.  Cluster the following datasets into the 2, 3 and 4 clusters using k-means
                      clustering algorithm:

                              a) 0, 2, 5, 4, 8, 10, 12, 11.
                              b) (2,2), (2,5), (10,4), (3,5), (7,3), (5,9), (2,8), (4,10), (7,4), (4,4), (5,8), (9,3).





                                                    [ 126 ]
   133   134   135   136   137   138   139   140   141   142   143