Page 311 - Data Science Algorithms in a Week
P. 311

292                              Jose M. Prieto

                       being fixed in the network weights. The peaking effect is experienced when an excessive
                       number of hidden neurons minimize error in training but increase error in testing. Finally
                       network  paralysis appears when  an  excessive  adjustment  of  neurons  weight  raise  high
                       negative  or  positive  values  leading  to  a  near  zero  output  with  sigmoid  activation
                       functions (Kröse & van der Smagt, 1996). These limitations must be taken into account
                       and minimized with an adequate choice of the network topology and a careful selection
                       of neurone parameters (function, weights, threshold, etc.).


                       External Factors


                          From our experience, the most problematic factors influencing the accuracy of the
                       predictions when dealing with data mining are noise (inaccurate data), normalisation of
                       the  output  to  acceptable  ranges  (0-1  for  better  results)  and  topology  complexity  (too
                       many inputs).
                          In  the  case  of  very  complex  chemical  entities,  such  as  natural  products,  noise
                       reduction  needs  to  be  achieved  by  selecting  carefully  the  data  sets  from  papers  with
                       similar values of reference drugs. Bioassays are far away from being performed in the
                       same  way  (i.e.,  same  protocol)  around  the  world.  Even  within  the  same  institution  or
                       laboratory differences will arise from different users, each modifying the protocol slightly
                       to adapt it to their needs. In this regard it is of utmost importance that all use the same
                       reference  drug  (antioxidant,  antimicrobial,  anti-inflammatory,  etc.).  However  this  is
                       extremely variable across papers and sometimes absent in some. The reduced numbers of
                       valid data available to train and validate the ANNs force the use of small sets which may
                       induce in turn bias (Bucinski, Zielinski & Kozlowska, 2004; Cortes-Cabrera & Prieto,
                       2010; Daynac, Cortes-Cabrera & Prieto, 2016). Ii would be tempting to discuss also the
                       physicochemical incompatibility of many synthetic drugs and natural products with most
                       of the milieu in which the bioassays are run (solvent polarity, microbiological/cell culture
                       media, etc.), due mostly to their volatility and poor solubility but this would be beyond
                       the scope of this chapter.
                          The  challenge  in  modeling  the  activity  of  essential  oils  is  mainly  the  selection  of
                       inputs  and  the  topology.  Ideally  the  data  set  would  necessarily  include  all  variables
                       influencing the bioactivity to be modelled (of the vector). In practice, more than 30 such
                       inputs adds a tremendous complexity to the network and generally the number of inputs
                       used in other ANN are far lower than the dataset we are able to generate. On the other
                       hand, the restriction of the input data set inevitability leads to a bias, but it is the only way
                       forward in order to overcome this problem. Also, the restricted number of comparable
                       data present in literature results in a low number of learning and validating sets. These
                       factors  do  not  invalidate  the  use  of  ANNs  but  limits  any  generalization  of  the  results
   306   307   308   309   310   311   312   313   314   315   316