Page 126 - Data Science Algorithms in a Week
P. 126

110             Loris Nanni, Sheryl Brahnam and Alessandra Lumini

                              technique to transfer data in one domain to another where hidden information can
                              be extracted. Wavelets have a nice feature of local description and separation of
                              signal characteristics and provides a tool for the simultaneous analysis of both
                              time and frequency. A wavelet is a set of orthonormal basis functions generated
                              from dilation and translation of a single scaling function or father wavelet (φ) and
                              a mother wavelet (ψ). In this work we use the Haar wavelet family, which is a
                              sequence  of  rescaled  "square-shaped"  functions  that  together  form  a  wavelet
                              basis: the extracted descriptor is obtained as the average energy of the horizontal,
                              vertical  or  diagonal  detail  coefficients  calculated  up  to  the  tenth  level
                              decomposition.

                          According  to  several  studies  in  the  literature  a  good  solution  for  improving  the
                       performance  of  an  ensemble  approach  is  pattern  perturbation.  To  improve  the
                       performance an ensemble is obtained using 50 reshapes for each pattern: for each reshape
                       the original features of the pattern are randomly sorted. In this way 50 SVMs are trained
                       for each approach, and these SVMs are combined by sum rule. In the next section only
                       the  performance  of  the  ensemble  of  SVMs  are  reported,  since  in  (Loris  Nanni  et  al.,
                       2012) it is shown that such an ensemble improves the stand-alone version.


                       Experimental Results

                          To assess their versatility, the methods described above for reshaping a vector into a
                       matrix were challenged with several datasets (see Table 1). All the tested data mining
                       datasets  are  extracted  from  the  well-known  UCI  datasets  repository  (Lichman,  2013),
                       except  for  the  Tornado  dataset  (Trafalis,  Ince,  &  Richman,  2003).  Moreover,  two
                       additional datasets are provided that are related to the image classification problem:

                          1.  BREAST: a dataset intended to classify samples of benign and malignant tissues
                              (for details see (Junior, Cardoso de Paiva, Silva, & Muniz de Oliveira, 2009)). To
                              extract the features from each image, we extract the 100 rotation invariant LTP
                              bins, with P = 16 and R = 2, with higher variance (considering only the training
                              data);
                          2.  PAP: a dataset intended to classify each cell extracted from a pap test as either
                              normal  or  abnormal  (for  details  see  (Jantzen,  Norup,  Dounias,  &  Bjerregaard,
                              2005)).  A  linear  descriptor  of  size  100  is  extracted  using  the  same  procedure
                              described above.

                          A summary description of the tested datasets, including the number of patterns and
                       the dimension of the original feature vector, is reported in Table 1. All the considered
                       datasets are two class classification problems.
   121   122   123   124   125   126   127   128   129   130   131