Page 146 - Data Science Algorithms in a Week
P. 146

Clustering into K Clusters


            Using cross-validation to determine the outcome:

            We used 14 couples as data for the estimation and 3 other couples for cross-validation to
            find the best parameter of k clusters among the values 2,3,4,5. We may try to cluster into
            more clusters, but since we have so relatively very little data, it should be sufficient to
            cluster into the 5 clusters at most. Let us summarize the errors of the estimation.

             Number of clusters Error rate

             2                  3.3
             3                  2.17
             4                  2.7

             5                  2.13
            The error rate is the least for 3 and 5 clusters. The fact that the error rate goes up for 4
            clusters and then down again for 5 clusters may indicate that we may not have enough data
            to make a good estimate. A natural expectation would be that there are not local maxims of
            errors for the values of k greater than 2. Moreover the difference between the error for
            clustering with 3 and 5 clusters is very small and one cluster out of 5 is smaller than one
            cluster out of 3. For this reason we choose 3 clusters over 5 to estimate the number of the
                             th
            children for the 18  couple.
                                                th
            When clustering into the 3 clusters, 18  couple is in the cluster 2. Therefore the estimated
            number of the children for the 18  couple is 1.25.
                                           th




























                                                    [ 134 ]
   141   142   143   144   145   146   147   148   149   150   151