Page 121 - Data Science Algorithms in a Week
P. 121

Clustering into K Clusters


            Implementation of the k-means clustering

            algorithm

            We implement the k-means clustering algorithm. It takes as an input a CSV file with one
            data item per line. A data item is converted to a point. The algorithms classifies these points
            into the specified number of clusters. In the end the clusters are visualized on the graph
            using the library matplotlib:

                # source_code/5/k-means_clustering.py
                import math
                import imp
                import sys
                import matplotlib.pyplot as plt
                import matplotlib
                import sys
                sys.path.append('../common')
                import common # noqa
                matplotlib.style.use('ggplot')

                # Returns k initial centroids for the given points.
                def choose_init_centroids(points, k):
                    centroids = []
                    centroids.append(points[0])
                    while len(centroids) < k:
                        # Find the centroid that with the greatest possible distance
                        # to the closest already chosen centroid.
                        candidate = points[0]
                        candidate_dist = min_dist(points[0], centroids)
                        for point in points:
                            dist = min_dist(point, centroids)
                            if dist > candidate_dist:
                                candidate = point
                                candidate_dist = dist
                        centroids.append(candidate)
                    return centroids

                # Returns the distance of a point from the closest point in points.
                def min_dist(point, points):
                    min_dist = euclidean_dist(point, points[0])
                    for point2 in points:
                        dist = euclidean_dist(point, point2)
                        if dist < min_dist:
                            min_dist = dist
                    return min_dist

                # Returns an Euclidean distance of two 2-dimensional points.

                                                    [ 109 ]
   116   117   118   119   120   121   122   123   124   125   126