Page 289 - Understanding Machine Learning
P. 289

22. 3  Spectra l  C lustering  271



              (C 1 ,...,C k ). Hence,
                              k                    k
                                          ( t−1) 2              ( t−1) 2
                                     x − µ i    ≤           x − µ i    .        (22.5)
                             i=1   (t)            i=1   (t−1)
                                x∈C                  x∈C


                                   i                   i
              Using Equation (22.3) we have that the right-hand side of Equation (22.5) equals
                 ( t−1)    ( t−1)
              G(C    ,...,C   ). Combining this with Equation (22.4) and Equation (22.5), we
                 1         k
                           ( t)    ( t)     ( t−1)   ( t−1)

              obtain that G(C ,...,C ) ≤ G(C   ,...,C   ), which concludes our proof.
                           1       k       1         k
                 While the preceding lemma tells us that the k-means objective is monotonically
              nonincreasing, there is no guarantee on the number of iterations the k-means algo-
              rithm needs in order to reach convergence. Furthermore, there is no nontrivial lower
              bound on the gap between the value of the k-means objective of the algorithm’s
              output and the minimum possible value of that objective function. In fact, k-means
              might converge to a point which is not even a local minimum (see Exercise 22.2).
              To improve the results of k-means it is often recommended to repeat the procedure
              several times with different randomly chosen initial centroids (e.g., we can choose
              the initial centroids to be random points from the data).


              22.3 SPECTRAL CLUSTERING

              Often, a convenient way to represent the relationships between points in a data set
              X ={x 1 ,...,x m } is by a similarity graph; each vertex represents a data point x i ,and
              every two vertices are connected by an edge whose weight is their similarity, W i, j =
                                                                                   2
                                                                                2
              s(x i ,x j ), where W ∈ R m,m . For example, we can set W i, j = exp( − d(x i ,x j ) /σ ),
              where d(·,·) is a distance function and σ is a parameter. The clustering problem can
              now be formulated as follows: We want to find a partition of the graph such that the
              edges between different groups have low weights and the edges within a group have
              high weights.
                 In the clustering objectives described previously, the focus was on one side of
              our intuitive definition of clustering – making sure that points in the same cluster
              are similar. We now present objectives that focus on the other requirement – points
              separated into different clusters should be nonsimilar.


              22.3.1 Graph Cut
              Given a graph represented by a similarity matrix W, the simplest and most direct
              way to construct a partition of the graph is to solve the mincut problem, which
              chooses a partition C 1 ,...,C k that minimizes the objective

                                                   k

                                   cut(C 1 ,...,C k ) =     W r,s .
                                                  i=1 r∈C i ,s /∈C i
              For k = 2, the mincut problem can be solved efficiently. However, in practice it
              often does not lead to satisfactory partitions. The problem is that in many cases, the
              solution of mincut simply separates one individual vertex from the rest of the graph.
   284   285   286   287   288   289   290   291   292   293   294