Page 288 - Understanding Machine Learning
P. 288
Clustering
270
and the MinCut objective that we shall discuss in Section 22.3 are not center-based
objectives.
22.2.1 The k-Means Algorithm
The k-means objective function is quite popular in practical applications of clus-
tering. However, it turns out that finding the optimal k-means solution is often
computationally infeasible (the problem is NP-hard, and even NP-hard to approx-
imate to within some constant). As an alternative, the following simple iterative
algorithm is often used, so often that, in many cases, the term k-means Clustering
refers to the outcome of this algorithm rather than to the clustering that minimizes
the k-means objective cost. We describe the algorithm with respect to the Euclidean
distance function d(x,y) = x − y .
k-Means
n
input: X ⊂ R ; Number of clusters k
initialize: Randomly choose initial centroids µ ,...,µ k
1
repeat until convergence
∀i ∈ [k]set C i ={x ∈ X : i = argmin x − µ }
j
j
(break ties in some arbitrary manner)
1
∀i ∈ [k] update µ = |C i | x∈C i x
i
Lemma 22.1. Each iteration of the k-means algorithm does not increase the k-means
objective function (as given in Equation (22.1)).
Proof. To simplify the notation, let us use the shorthand G(C 1 ,...,C k )for the
k-means objective, namely,
k
2
G(C 1 ,...,C k ) = min x − µ . (22.2)
i
µ 1 ,...,µ k ∈R n
i=1 x∈C i
1
It is convenient to define µ(C i ) = x and note that µ(C i ) =
|C i | x∈C i
2
argmin x − µ . Therefore, we can rewrite the k-means objective as
µ∈R n x∈C i
k
2
G(C 1 ,...,C k ) = x − µ(C i ) . (22.3)
i=1 x∈C i
(t−1) (t−1)
Consider the update at iteration t of the k-means algorithm. Let C ,...,C
1 k
(t−1) (t−1) (t) (t)
be the previous partition, let µ i = µ(C i ), and let C ,...,C k be the new
1
partition assigned at iteration t. Using the definition of the objective as given in
Equation (22.2) we clearly have that
k
(t) (t) (t−1) 2
G(C ,...,C ) ≤ x − µ . (22.4)
1 k i
i=1 (t)
x∈C
i
(t) (t)
In addition, the definition of the new partition (C ,...,C )impliesthatit
1 k
k (t−1) 2
minimizes the expression x − µ over all possible partitions
i=1 x∈C i i