Page 121 - Data Science Algorithms in a Week

P. 121

Clustering into K Clusters

Implementation of the k-means clustering

algorithm

We implement the k-means clustering algorithm. It takes as an input a CSV file with one
data item per line. A data item is converted to a point. The algorithms classifies these points
into the specified number of clusters. In the end the clusters are visualized on the graph
using the library matplotlib:

# source_code/5/k-means_clustering.py
import math
import imp
import sys
import matplotlib.pyplot as plt
import matplotlib
import sys
sys.path.append('../common')
import common # noqa
matplotlib.style.use('ggplot')

# Returns k initial centroids for the given points.
def choose_init_centroids(points, k):
centroids = []
centroids.append(points[0])
while len(centroids) < k:
# Find the centroid that with the greatest possible distance
# to the closest already chosen centroid.
candidate = points[0]
candidate_dist = min_dist(points[0], centroids)
for point in points:
dist = min_dist(point, centroids)
if dist > candidate_dist:
candidate = point
candidate_dist = dist
centroids.append(candidate)
return centroids

# Returns the distance of a point from the closest point in points.
def min_dist(point, points):
min_dist = euclidean_dist(point, points[0])
for point2 in points:
dist = euclidean_dist(point, point2)
if dist < min_dist:
min_dist = dist
return min_dist

# Returns an Euclidean distance of two 2-dimensional points.

[ 109 ]

116 117 118 119 120 121 122 123 124 125 126