Page 23 - Data Science Algorithms in a Week

P. 23

8 Ramazan Ünlü

 Robustness: The consensus clustering might have better overall performance than
majority of individual clustering methods.
 Consistency: The combination of individual clustering methods is similar to all
combined ones.
 Stability: The consensus clustering shows less variability across iterations than
all combined algorithms.

In terms of properties like these, the better partitions can be produced in comparison
to most individual clustering methods. The result of consensus clustering cannot be
expected to be the best result in all cases as there could be exceptions. It can only be
ensured that consensus clustering outperforms most of the single algorithms combined
concerning some properties by assuming the fact that combination of good characteristics
of various partition is more reliable than any single algorithm.
Over the past years, many different algorithms have been proposed for consensus
clustering (Al-Razgan & Domeniconi, 2006; Ana & Jain, 2003; Azimi & Fern, 2009; d
Souto, de Araujo, & da Silva, 2006; Hadjitodorov, Kuncheva, & Todorova, 2006; Hu,
Yoo, Zhang, Nanavati, & Das, 2005; Huang, Lai, & Wang, 2016; Li & Ding, 2008; Li,
Ding, & Jordan, 2007; Naldi, Carvalho, & Campello, 2013; Ren, Domeniconi, Zhang, &
Yu, 2016). As it is mentioned earlier, it can be seen in the literature that the consensus
clustering framework is able to enhance the robustness and stability of clustering
analysis. Thus, consensus clustering has gained a lot of real-world applications such as
gene classification, image segmentation (Hong, Kwong, Chang, & Ren, 2008), video
retrieval and so on (Azimi, Mohammadi, & Analoui, 2006; Fischer & Buhmann, 2003; A.
K. Jain et al., 1999). From a combinatorial optimization point of view, the task of
combining different partitions has been formulated as a median partitioning problem
which is known to be N-P complete (Křivánek & Morávek, 1986). Even with the use of
recent breakthroughs this approach cannot handle datasets of size greater than several
hundreds of samples (Sukegawa, Yamamoto, & Zhang, 2013). For a comprehensive
literature of formulation of 0-1 linear program for the consensus clustering problem,
readers can refer to (Xanthopoulos, 2014).
The problem of consensus clustering can be verbally defined such that by using given
multiple partitions of the dataset, find a combined clustering model- or final partition-
that somehow gives better quality regarding some aspects as pointed out above.
Therefore, every consensus clustering method is made up of two steps in general: (1)
generation of multiple partition and (2) consensus function as shown in Figure 6 (Topchy,
Jain, & Punch, 2003; Topchy et al., 2004; D. Xu & Tian, 2015).
Generation of multiple partitions is the first step of consensus clustering. This action
aims to create multiple partitions that will be combined. It might be imperative for some
problems because final partition will depend on partitions produced in this step. Several
methods are proposed to create multiple partitions in literature as follows:

18 19 20 21 22 23 24 25 26 27 28