Page 22 - Data Science Algorithms in a Week
P. 22

Unsupervised Ensemble Learning                        7

                          The  analysis  of  consensus  clustering  is  summarized  under  the  title  of  modern
                       clustering methods in (Xu & Tian, 2015) as follows:

                            Time complexity of this kind of algorithms depends on the algorithm chosen to
                              combine its results.
                            Consensus clustering can produce robust, scalable, consistent partition and can
                              take the advantages of individual algorithms used.
                            They have existing deficiencies of the design of the function which is used to
                              combine results of individual algorithms.


                                     BACKGROUND OF CONSENSUS CLUSTERING

                          As  touched  upon  before,  clustering  consists  in identifying  groups  of  samples with
                       similar properties, and it is one of the most common preliminary exploratory analysis for
                       revealing  ``hidden''  patterns,  in  particular  for  datasets  where  label  information  is
                       unknown (Ester, Kriegel, Sander, & Xu, 1996). With the rise of big data efficient and
                       robust algorithms able to handle massive amounts of data in a considerable amount of
                       time  are  necessary  (Abello,  Pardalos,  &  Resende,  2013;  Leskovec,  Rajaraman,  &
                       Ullman, 2014). Some of the most common clustering schemes include, but are not limited
                       to  k-means  (MacQueen,  1967),  hierarchical  clustering  (McQuitty,  1957),  spectral
                       clustering  (Shi  &  Malik,  2000),  and  density-based  clustering  approaches  (Ester  et  al.,
                       1996).  The  detailed  taxonomy  of  clustering  methods  is  given  in  Table  1.  Given  the
                       diverse  objectives  and  methodological  foundations  of  these  methods,  it  is  possible  to
                       yield  clustering  solutions  that  differ  significantly  across  algorithms  (Haghtalab  et  al.,
                       2015).  Even  for  multiple  runs  of  the  same  algorithm,  on  the  same  dataset,  one  is  not
                       guaranteed the same solution. This is a well-known phenomenon that is attributed to the
                       local  optimality  of  clustering  algorithms  such  as  k-means  (Xanthopoulos,  2014).  In
                       addition  to  local  optimality,  algorithmic  choice  or  even  the  dataset  itself  might  be
                       responsible  for  utterly  unreliable  and  unusable  results.  Therefore,  once  two  different
                       clustering algorithms is applied to the same dataset and obtain entirely different results, it
                       is not easy to say the correct one. To handle with this problem, consensus clustering can
                       help  to  minimize  this  variability  through  an  ensemble  procedure  that  combines  the
                       ``good'' characteristics from a diverse pool of clusterings (A. L. Fred & Jain, 2005; Liu,
                       Cheng, & Wu, 2015; Vega-Pons & Ruiz-Shulcloper, 2011). It has emerged as a powerful
                       technique to produce an optimum and useful partition of a dataset. Some studies such as
                       (A. L. Fred & Jain, 2005; Strehl & Ghosh, 2002; Topchy, Jain, & Punch, 2004)defined
                       various  properties  that  endorse  the  use  of  consensus  clustering.  Some  of  them  are
                       described as follows:
   17   18   19   20   21   22   23   24   25   26   27