Page 22 - Data Science Algorithms in a Week

P. 22

Unsupervised Ensemble Learning 7

The analysis of consensus clustering is summarized under the title of modern
clustering methods in (Xu & Tian, 2015) as follows:

 Time complexity of this kind of algorithms depends on the algorithm chosen to
combine its results.
 Consensus clustering can produce robust, scalable, consistent partition and can
take the advantages of individual algorithms used.
 They have existing deficiencies of the design of the function which is used to
combine results of individual algorithms.

BACKGROUND OF CONSENSUS CLUSTERING

As touched upon before, clustering consists in identifying groups of samples with
similar properties, and it is one of the most common preliminary exploratory analysis for
revealing ``hidden'' patterns, in particular for datasets where label information is
unknown (Ester, Kriegel, Sander, & Xu, 1996). With the rise of big data efficient and
robust algorithms able to handle massive amounts of data in a considerable amount of
time are necessary (Abello, Pardalos, & Resende, 2013; Leskovec, Rajaraman, &
Ullman, 2014). Some of the most common clustering schemes include, but are not limited
to k-means (MacQueen, 1967), hierarchical clustering (McQuitty, 1957), spectral
clustering (Shi & Malik, 2000), and density-based clustering approaches (Ester et al.,
1996). The detailed taxonomy of clustering methods is given in Table 1. Given the
diverse objectives and methodological foundations of these methods, it is possible to
yield clustering solutions that differ significantly across algorithms (Haghtalab et al.,
2015). Even for multiple runs of the same algorithm, on the same dataset, one is not
guaranteed the same solution. This is a well-known phenomenon that is attributed to the
local optimality of clustering algorithms such as k-means (Xanthopoulos, 2014). In
addition to local optimality, algorithmic choice or even the dataset itself might be
responsible for utterly unreliable and unusable results. Therefore, once two different
clustering algorithms is applied to the same dataset and obtain entirely different results, it
is not easy to say the correct one. To handle with this problem, consensus clustering can
help to minimize this variability through an ensemble procedure that combines the
``good'' characteristics from a diverse pool of clusterings (A. L. Fred & Jain, 2005; Liu,
Cheng, & Wu, 2015; Vega-Pons & Ruiz-Shulcloper, 2011). It has emerged as a powerful
technique to produce an optimum and useful partition of a dataset. Some studies such as
(A. L. Fred & Jain, 2005; Strehl & Ghosh, 2002; Topchy, Jain, & Punch, 2004)defined
various properties that endorse the use of consensus clustering. Some of them are
described as follows:

17 18 19 20 21 22 23 24 25 26 27