Page 311 - Data Science Algorithms in a Week
P. 311
292 Jose M. Prieto
being fixed in the network weights. The peaking effect is experienced when an excessive
number of hidden neurons minimize error in training but increase error in testing. Finally
network paralysis appears when an excessive adjustment of neurons weight raise high
negative or positive values leading to a near zero output with sigmoid activation
functions (Kröse & van der Smagt, 1996). These limitations must be taken into account
and minimized with an adequate choice of the network topology and a careful selection
of neurone parameters (function, weights, threshold, etc.).
External Factors
From our experience, the most problematic factors influencing the accuracy of the
predictions when dealing with data mining are noise (inaccurate data), normalisation of
the output to acceptable ranges (0-1 for better results) and topology complexity (too
many inputs).
In the case of very complex chemical entities, such as natural products, noise
reduction needs to be achieved by selecting carefully the data sets from papers with
similar values of reference drugs. Bioassays are far away from being performed in the
same way (i.e., same protocol) around the world. Even within the same institution or
laboratory differences will arise from different users, each modifying the protocol slightly
to adapt it to their needs. In this regard it is of utmost importance that all use the same
reference drug (antioxidant, antimicrobial, anti-inflammatory, etc.). However this is
extremely variable across papers and sometimes absent in some. The reduced numbers of
valid data available to train and validate the ANNs force the use of small sets which may
induce in turn bias (Bucinski, Zielinski & Kozlowska, 2004; Cortes-Cabrera & Prieto,
2010; Daynac, Cortes-Cabrera & Prieto, 2016). Ii would be tempting to discuss also the
physicochemical incompatibility of many synthetic drugs and natural products with most
of the milieu in which the bioassays are run (solvent polarity, microbiological/cell culture
media, etc.), due mostly to their volatility and poor solubility but this would be beyond
the scope of this chapter.
The challenge in modeling the activity of essential oils is mainly the selection of
inputs and the topology. Ideally the data set would necessarily include all variables
influencing the bioactivity to be modelled (of the vector). In practice, more than 30 such
inputs adds a tremendous complexity to the network and generally the number of inputs
used in other ANN are far lower than the dataset we are able to generate. On the other
hand, the restriction of the input data set inevitability leads to a bias, but it is the only way
forward in order to overcome this problem. Also, the restricted number of comparable
data present in literature results in a low number of learning and validating sets. These
factors do not invalidate the use of ANNs but limits any generalization of the results