Page 288 - Understanding Machine Learning
P. 288

Clustering
           270

                 and the MinCut objective that we shall discuss in Section 22.3 are not center-based
                 objectives.


                 22.2.1 The k-Means Algorithm
                 The k-means objective function is quite popular in practical applications of clus-
                 tering. However, it turns out that finding the optimal k-means solution is often
                 computationally infeasible (the problem is NP-hard, and even NP-hard to approx-
                 imate to within some constant). As an alternative, the following simple iterative
                 algorithm is often used, so often that, in many cases, the term k-means Clustering
                 refers to the outcome of this algorithm rather than to the clustering that minimizes
                 the k-means objective cost. We describe the algorithm with respect to the Euclidean
                 distance function d(x,y) = x − y .

                                                 k-Means
                                      n
                           input: X ⊂ R ; Number of clusters k
                           initialize: Randomly choose initial centroids µ ,...,µ k
                                                                   1
                           repeat until convergence
                             ∀i ∈ [k]set C i ={x ∈ X : i = argmin  x − µ  }
                                                                  j
                                                           j
                             (break ties in some arbitrary manner)
                                               1
                             ∀i ∈ [k] update µ =  |C i |  x∈C i  x
                                           i
                 Lemma 22.1. Each iteration of the k-means algorithm does not increase the k-means
                 objective function (as given in Equation (22.1)).
                 Proof. To simplify the notation, let us use the shorthand G(C 1 ,...,C k )for the
                 k-means objective, namely,
                                                          k

                                                                       2
                                   G(C 1 ,...,C k ) =  min       x − µ   .          (22.2)
                                                                     i
                                                 µ 1 ,...,µ k ∈R n
                                                         i=1 x∈C i
                                                       1
                 It is convenient to define µ(C i ) =           x and note that µ(C i ) =
                                                       |C i |  x∈C i
                                        2
                 argmin           x − µ  . Therefore, we can rewrite the k-means objective as
                        µ∈R n  x∈C i
                                                    k

                                                                    2
                                     G(C 1 ,...,C k ) =    x − µ(C i )  .           (22.3)
                                                    i=1 x∈C i
                                                                            (t−1)     (t−1)
                 Consider the update at iteration t of the k-means algorithm. Let C  ,...,C
                                                                            1        k
                                             (t−1)     (t−1)         (t)     (t)
                 be the previous partition, let µ i  = µ(C i  ), and let C ,...,C k  be the new
                                                                     1
                 partition assigned at iteration t. Using the definition of the objective as given in
                 Equation (22.2) we clearly have that
                                                     k
                                        (t)    (t)               (t−1) 2
                                    G(C ,...,C ) ≤          x − µ      .            (22.4)
                                       1       k                 i
                                                    i=1   (t)
                                                       x∈C
                                                          i
                                                                 (t)    (t)
                 In addition, the definition of the new partition (C ,...,C )impliesthatit
                                                                 1      k
                                            k             (t−1) 2
                 minimizes the expression            x − µ      over all possible partitions
                                            i=1  x∈C i    i
   283   284   285   286   287   288   289   290   291   292   293