Page 124 - Data Science Algorithms in a Week
P. 124
Clustering into K Clusters
Input data from gender classification
We save data from the gender classification example into the CSV file:
# source_code/5/persons_by_height_and_weight.csv
180,75
174,71
184,83
168,63
178,70
170,59
164,53
155,46
162,52
166,55
172,60
Program output for gender classification data
We run the program implementing k-means clustering algorithm on the data from the
gender classification example. The numerical argument 2 means that we would like to
cluster the data into 2 clusters:
$ python k-means_clustering.py persons_by_height_weight.csv 2 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((180.0, 75.0), 0), ((174.0, 71.0), 0),
((184.0, 83.0), 0), ((168.0, 63.0), 0), ((178.0, 70.0), 0), ((170.0, 59.0),
0), ((164.0, 53.0), 1), ((155.0, 46.0), 1), ((162.0, 52.0), 1), ((166.0,
55.0), 1), ((172.0, 60.0), 0)]
centroids = [(180.0, 75.0), (155.0, 46.0)]
Step number 1: point_groups = [((180.0, 75.0), 0), ((174.0, 71.0), 0),
((184.0, 83.0), 0), ((168.0, 63.0), 0), ((178.0, 70.0), 0), ((170.0, 59.0),
0), ((164.0, 53.0), 1), ((155.0, 46.0), 1), ((162.0, 52.0), 1), ((166.0,
55.0), 1), ((172.0, 60.0), 0)]
centroids = [(175.14285714285714, 68.71428571428571), (161.75, 51.5)]
The program also outputs a graph visible in Image 5.2. The parameter last means that we
would like the program to do the clustering until the last step. If we would like to display
only the first step (step 0), we could change last to 0 to run:
$ python k-means_clustering.py persons_by_height_weight.csv 2 0
Upon the execution of the program, we would get the graph of the clusters and their
centroids at the initial step as in Image 5.1.
[ 112 ]