Page 126 - Data Science Algorithms in a Week

P. 126

110 Loris Nanni, Sheryl Brahnam and Alessandra Lumini

technique to transfer data in one domain to another where hidden information can
be extracted. Wavelets have a nice feature of local description and separation of
signal characteristics and provides a tool for the simultaneous analysis of both
time and frequency. A wavelet is a set of orthonormal basis functions generated
from dilation and translation of a single scaling function or father wavelet (φ) and
a mother wavelet (ψ). In this work we use the Haar wavelet family, which is a
sequence of rescaled "square-shaped" functions that together form a wavelet
basis: the extracted descriptor is obtained as the average energy of the horizontal,
vertical or diagonal detail coefficients calculated up to the tenth level
decomposition.

According to several studies in the literature a good solution for improving the
performance of an ensemble approach is pattern perturbation. To improve the
performance an ensemble is obtained using 50 reshapes for each pattern: for each reshape
the original features of the pattern are randomly sorted. In this way 50 SVMs are trained
for each approach, and these SVMs are combined by sum rule. In the next section only
the performance of the ensemble of SVMs are reported, since in (Loris Nanni et al.,
2012) it is shown that such an ensemble improves the stand-alone version.

Experimental Results

To assess their versatility, the methods described above for reshaping a vector into a
matrix were challenged with several datasets (see Table 1). All the tested data mining
datasets are extracted from the well-known UCI datasets repository (Lichman, 2013),
except for the Tornado dataset (Trafalis, Ince, & Richman, 2003). Moreover, two
additional datasets are provided that are related to the image classification problem:

1. BREAST: a dataset intended to classify samples of benign and malignant tissues
(for details see (Junior, Cardoso de Paiva, Silva, & Muniz de Oliveira, 2009)). To
extract the features from each image, we extract the 100 rotation invariant LTP
bins, with P = 16 and R = 2, with higher variance (considering only the training
data);
2. PAP: a dataset intended to classify each cell extracted from a pap test as either
normal or abnormal (for details see (Jantzen, Norup, Dounias, & Bjerregaard,
2005)). A linear descriptor of size 100 is extracted using the same procedure
described above.

A summary description of the tested datasets, including the number of patterns and
the dimension of the original feature vector, is reported in Table 1. All the considered
datasets are two class classification problems.

121 122 123 124 125 126 127 128 129 130 131