Page 25 - Data Science Algorithms in a Week
P. 25
10 Ramazan Ünlü
produce multiple partitions. They aimed to reduce variability in the partitioning
based algorithm result by averaging. And, they successfully produced more
accurate clusters than an application of a single algorithm.
The consensus function is the crucial and leading step of any consensus clustering
algorithm. These functions are used to combine a set of labels produced by individual
clustering algorithms in the previous step. The combined labels - or final partition- can be
considered as a result of another clustering algorithm. Foundation or definition of a
consensus function can profoundly impact the goodness of final partition which is the
product of any consensus clustering. However, the way of the combination of multiple
partitions is not the same in all cases. A sharp -but well-accepted- division of consensus
functions are (1) objects co-occurrence and (2) median partition approaches.
The idea of objects co-occurrence methods works based on similar and dissimilar
objects. If two data points are in the same cluster, those can be considered as similar,
otherwise they are dissimilar. Therefore, in objects co-occurrence methods it should be
analyzed that how many times data samples belong to one cluster. In median partition
approach, the final partition is obtained by solving an optimization problem which is the
problem of finding the median partition concerning cluster ensemble. Now the formal
version of the median partition problem can be defined. Given a set of partitions and a
∗
similarity measure such as distance (, ) between two partitions, a set of partition can
be found such that:
∗
= ∑ ( , )
=1
It can be found the detailed review of consensus functions, and taxonomy of principal
consensus functions in different studies such as (Ghaemi, Sulaiman, Ibrahim, &
Mustapha, 2009; Topchy et al., 2004; Vega-Pons & Ruiz-Shulcloper, 2011; D. Xu &
Tian, 2015). Also, relations among different consensus functions can be found in (Li,
Ogihara, & Ma, 2010). some of the main functions are summarized as follows:
Based on relabeling and voting: These methods are based on two important
steps. At the first step, the labeling correspondence problem needs to be solved.
The label of each sample is symbolic; a set of the label given by an algorithm
might be different than labels given by another algorithm. However, both sets of
labels correspond to the same partition. Solving this problem makes the partitions
ready for the combination process. If the labeling correspondence problem is
solved, then at the second step voting procedure can be applied. The voting
process finds how many times a sample is labeled with the same label. To apply