Page 124 - Data Science Algorithms in a Week

P. 124

Clustering into K Clusters

Input data from gender classification

We save data from the gender classification example into the CSV file:

# source_code/5/persons_by_height_and_weight.csv
180,75
174,71
184,83
168,63
178,70
170,59
164,53
155,46
162,52
166,55
172,60

Program output for gender classification data

We run the program implementing k-means clustering algorithm on the data from the
gender classification example. The numerical argument 2 means that we would like to
cluster the data into 2 clusters:

$ python k-means_clustering.py persons_by_height_weight.csv 2 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((180.0, 75.0), 0), ((174.0, 71.0), 0),
((184.0, 83.0), 0), ((168.0, 63.0), 0), ((178.0, 70.0), 0), ((170.0, 59.0),
0), ((164.0, 53.0), 1), ((155.0, 46.0), 1), ((162.0, 52.0), 1), ((166.0,
55.0), 1), ((172.0, 60.0), 0)]
centroids = [(180.0, 75.0), (155.0, 46.0)]
Step number 1: point_groups = [((180.0, 75.0), 0), ((174.0, 71.0), 0),
((184.0, 83.0), 0), ((168.0, 63.0), 0), ((178.0, 70.0), 0), ((170.0, 59.0),
0), ((164.0, 53.0), 1), ((155.0, 46.0), 1), ((162.0, 52.0), 1), ((166.0,
55.0), 1), ((172.0, 60.0), 0)]
centroids = [(175.14285714285714, 68.71428571428571), (161.75, 51.5)]
The program also outputs a graph visible in Image 5.2. The parameter last means that we
would like the program to do the clustering until the last step. If we would like to display
only the first step (step 0), we could change last to 0 to run:

$ python k-means_clustering.py persons_by_height_weight.csv 2 0
Upon the execution of the program, we would get the graph of the clusters and their
centroids at the initial step as in Image 5.1.

[ 112 ]

119 120 121 122 123 124 125 126 127 128 129