Page 143 - Data Science Algorithms in a Week
P. 143

Clustering into K Clusters


                   3.  We are given 17 couples and their number of children and would like to find out
                                                  th
                      how many children has the 18  couple. We will use the first 14 couples as data
                      and then the next 3 couples for the cross-validation to determine the number of
                      clusters k that we will use to find out how many children the 18  couple is
                                                                                 th
                      expected to have.
                          After clustering we will say that a couple is likely to have about the number
                          of the children that is the average of the children in that cluster. Using the
                          cross-validation we will choose the number of the clusters that will minimize
                          the difference between the actual number of the children and the predicted
                          number of the children. We will capture this difference for all the items in the
                          cluster cumulatively as the square root of the squares of the differences of
                          children for each couple. This will minimize the variance of the random
                          variable for the predicted number of the children for the 18  couple.
                                                                                 th
                          We will perform the clustering into 2,3,4 and 5 clusters.

                          Input:

                             # source_code/5/couples_children.csv
                             48,49
                             40,43
                             24,28
                             49,42
                             32,34
                             24,27
                             29,32
                             35,35
                             33,36
                             42,47
                             22,27
                             41,45
                             39,43
                             36,38
                             30,32
                             36,38
                             36,39
                             37,38
                     Output for 2 clusters:

                     A couple listed for a cluster is of the form
                     (couple_number,(wife_age,husband_age)).

                             Cluster 0: [(1, (48.0, 49.0)), (2, (40.0, 43.0)), (4, (49.0,
                             42.0)), (10, (42.0, 47.0)), (12, (41.0, 45.0)), (13, (39.0,

                                                    [ 131 ]
   138   139   140   141   142   143   144   145   146   147   148