Page 26 - Data Science Algorithms in a Week
P. 26

Unsupervised Ensemble Learning                       11

                              these  methods,  each  produced  partition  should  have  the  same  number  of  the
                              cluster with final partition (Topchy et al., 2005; Vega-Pons & Ruiz-Shulcloper,
                              2011). On the other hand, the strength of this method is easy to understand and
                              employ.  Plurality  Voting  (PV)  (Fischer  &  Buhmann,  2003),  Voting-Mergin
                              (VM) (Weingessel, Dimitriadou, & Hornik, 2003), Voting for fuzzy clustering
                              (Dimitriadou,  Weingessel,  &  Hornik,  2002),  Voting  Active  Cluster  (VAC)
                              (Tumer  &  Agogino,  2008).  and  Cumulative  Voting  (CV)  (Ayad  &  Kamel,
                              2008)can be given as examples.
                            Based on co-association matrix: Algorithms based on the co-association matrix
                              is  used  to  avoid  the  labeling  correspondence  problem.  The  main  idea  of  this
                              approach is to create a co-association matrix in which each element is computed
                              based  on  how  many  times  two  particular  samples  are  in  the  same  cluster.  A
                              clustering  algorithm  is  necessary  to  produce  the  final  partition.  One  of  the
                              deficiencies of this kind of algorithm is that the computational complexity of the
                              methods is quadratic in the number of samples. Therefore, it is not suitable for
                              large datasets. On the other hand, they are very easy to understand and employ.
                              Evidence accumulation in conjunction with Single Link (EA-CL) or Complete
                              Link algorithms (EA-CL) (A. Fred, 2001) can be given as examples.
                            Based on graph partition: This kind of methods transform the combination of
                              multiple partitions into graph or hypergraph partitioning problem (Vega-Pons &
                              Ruiz-Shulcloper, 2011). All partitions in ensemble procedure can be represented
                              by a hyperedge, and final partition is obtained by implementing a graph-based
                              clustering  algorithm.  Three  graph  partitioning  algorithms,  Cluster-based
                              Similarity  Partitioning  Algorithm  (CSPA),  Hypergraph  Partitioning  Algorithm
                              (HGPA), and Meta-CLustering Algorithm (MCLA), are proposed by (Strehl &
                              Ghosh, 2002). In CSPA, a similarity matrix is created from a hypergraph. Each
                              element  of  this  matrix  shows  how  many  times  two  points  are  assigned  to  the
                              same cluster. Final partition can be obtained by applying a graph similarity-based
                              algorithm  such  as  spectral  clustering  or  METIS.  In  HGPA,  the  hypergraph  is
                              directly clustered by removing the minimum number of hyperedges. To get the
                              final  partition  from  the  hypergraph,  an  algorithm  which  is  suitable  to  cluster
                              hypergraph such as HMETIS (Karypis, Aggarwal, Kumar, & Shekhar, 1999) is
                              used.  In  MCLA,  the  similarity  between  two  clusters  is  defined  based  on  the
                              number  of  common  samples  by  using  Jaccard  index.  The  similarity  matrix
                              between the clusters is the adjacency matrix of the graph whose nodes are the
                              clusters and edge is the similarity between the clusters. METIS algorithm is used
                              to  recluster  that  graph.  Computational  and  storage  complexity  of  CSPA  is
                              quadratic  in  the  number  of  sample  n,  while  HGPA  and  MCLA  are  linear.
                              Another graph based method is Hybrid Bipartite Graph Formulation (HBGF) is
                              proposed by (Fern & Brodley, 2004). As different from the previous methods,
   21   22   23   24   25   26   27   28   29   30   31