Page 146 - Data Science Algorithms in a Week

P. 146

Clustering into K Clusters

Using cross-validation to determine the outcome:

We used 14 couples as data for the estimation and 3 other couples for cross-validation to
find the best parameter of k clusters among the values 2,3,4,5. We may try to cluster into
more clusters, but since we have so relatively very little data, it should be sufficient to
cluster into the 5 clusters at most. Let us summarize the errors of the estimation.

Number of clusters Error rate

2 3.3
3 2.17
4 2.7

5 2.13
The error rate is the least for 3 and 5 clusters. The fact that the error rate goes up for 4
clusters and then down again for 5 clusters may indicate that we may not have enough data
to make a good estimate. A natural expectation would be that there are not local maxims of
errors for the values of k greater than 2. Moreover the difference between the error for
clustering with 3 and 5 clusters is very small and one cluster out of 5 is smaller than one
cluster out of 3. For this reason we choose 3 clusters over 5 to estimate the number of the
th
children for the 18 couple.
th
When clustering into the 3 clusters, 18 couple is in the cluster 2. Therefore the estimated
number of the children for the 18 couple is 1.25.
th

[ 134 ]

141 142 143 144 145 146 147 148 149 150 151