Page 18 - Data Science Algorithms in a Week
P. 18

Unsupervised Ensemble Learning                        3

                       general, an appropriate DM method needs to be selected based on available labeled or
                       unlabeled  data. Therefore,  DM  methods  can  be roughly  categorized as  supervised  and
                       unsupervised learning based on data is labeled or unlabeled. While supervised learning
                       methods reserve for the labeled datasets, unsupervised learning methods are designed for
                       the unlabeled datasets. It might be crucial to select a suitable algorithm because it might
                       not  be  effective  to  use  a  method  developed  for  labeled  data  to  mine  unlabeled  data.
                       Throughout the chapter the focus will be on unsupervised learning.


                                               UNSUPERVISED LEARNING


                          Clustering  as  one  of  the  most  widely  used  DM  methods  finds  applications  in
                       numerous  domains  including  information  retrieval  and  text  mining  (A.  Jain,  1999),
                       spatial  database  applications  (Sander,  Ester,  Kriegel,  &  Xu,  1998),  sequence  and
                       heterogeneous  data  analysis  (Cades,  Smyth,  &  Mannila,  2001),  web  data  analysis
                       (Srivastava, Cooley, Deshpande, & Tan, 2000), bioinformatics (de Hoon, Imoto, Nolan,
                       & Miyano, 2004), text mining (A. K. Jain, Murty, & Flynn, 1999) and many others. As
                       pointed out, there are no labeled data available in clustering problems. Therefore, the goal
                       of clustering is division of unlabeled data into groups of similar objects (Berkhin, 2006).
                       Objects  in  the  same  group  are  considered  as  similar  to  each  other  and  dissimilar  to
                       objects in other groups. An example of clustering is illustrated in Figure 2, here points
                       belonging to the same cluster are shown with the same symbol.
                                                                                         
                          More  formally,  for  a  given  data  set      = (   )       where      ∈ ℝ ,      and      are
                                                                                   
                                                                         =1
                       number of samples and features respectively, clustering methods try to find k-clusters of
                         ,    = {   ,    ,···,    } where     <    , such that:
                               1
                                  2
                                          





















                       Figure 2. An example of clustering.
   13   14   15   16   17   18   19   20   21   22   23