Page 18 - Data Science Algorithms in a Week
P. 18
Unsupervised Ensemble Learning 3
general, an appropriate DM method needs to be selected based on available labeled or
unlabeled data. Therefore, DM methods can be roughly categorized as supervised and
unsupervised learning based on data is labeled or unlabeled. While supervised learning
methods reserve for the labeled datasets, unsupervised learning methods are designed for
the unlabeled datasets. It might be crucial to select a suitable algorithm because it might
not be effective to use a method developed for labeled data to mine unlabeled data.
Throughout the chapter the focus will be on unsupervised learning.
UNSUPERVISED LEARNING
Clustering as one of the most widely used DM methods finds applications in
numerous domains including information retrieval and text mining (A. Jain, 1999),
spatial database applications (Sander, Ester, Kriegel, & Xu, 1998), sequence and
heterogeneous data analysis (Cades, Smyth, & Mannila, 2001), web data analysis
(Srivastava, Cooley, Deshpande, & Tan, 2000), bioinformatics (de Hoon, Imoto, Nolan,
& Miyano, 2004), text mining (A. K. Jain, Murty, & Flynn, 1999) and many others. As
pointed out, there are no labeled data available in clustering problems. Therefore, the goal
of clustering is division of unlabeled data into groups of similar objects (Berkhin, 2006).
Objects in the same group are considered as similar to each other and dissimilar to
objects in other groups. An example of clustering is illustrated in Figure 2, here points
belonging to the same cluster are shown with the same symbol.
More formally, for a given data set = ( ) where ∈ ℝ , and are
=1
number of samples and features respectively, clustering methods try to find k-clusters of
, = { , ,···, } where < , such that:
1
2
Figure 2. An example of clustering.