Page 143 - Data Science Algorithms in a Week

P. 143

Clustering into K Clusters

3. We are given 17 couples and their number of children and would like to find out
th
how many children has the 18 couple. We will use the first 14 couples as data
and then the next 3 couples for the cross-validation to determine the number of
clusters k that we will use to find out how many children the 18 couple is
th
expected to have.
After clustering we will say that a couple is likely to have about the number
of the children that is the average of the children in that cluster. Using the
cross-validation we will choose the number of the clusters that will minimize
the difference between the actual number of the children and the predicted
number of the children. We will capture this difference for all the items in the
cluster cumulatively as the square root of the squares of the differences of
children for each couple. This will minimize the variance of the random
variable for the predicted number of the children for the 18 couple.
th
We will perform the clustering into 2,3,4 and 5 clusters.

Input:

# source_code/5/couples_children.csv
48,49
40,43
24,28
49,42
32,34
24,27
29,32
35,35
33,36
42,47
22,27
41,45
39,43
36,38
30,32
36,38
36,39
37,38
Output for 2 clusters:

A couple listed for a cluster is of the form
(couple_number,(wife_age,husband_age)).

Cluster 0: [(1, (48.0, 49.0)), (2, (40.0, 43.0)), (4, (49.0,
42.0)), (10, (42.0, 47.0)), (12, (41.0, 45.0)), (13, (39.0,

[ 131 ]

138 139 140 141 142 143 144 145 146 147 148