Page 22 - Monocle Quarterly Journal Vol 3 Issue 2 Spring
P. 22

MONOCLE QUARTERLY JOURNAL | DEEP LEARNING
 techniques, specifically the calculation of partial derivatives for each input. Through a process called backpropagation, the system is able to repeatedly re-weight the inputs into each and every neuron at each level in the network (in a backward fashion from the last to the first layer), in order to achieve what is now commonly known as deep learning.
Integral to this process of refining the
weightings and biases between neurons to
improve their predictions, is the activation
function. Think of an activation function as
being at the heart of what the neuron does
to transform the inputs it receives into an
output. The signals that the neuron receives
are first converted into a single value that is
the weighted sum of all the inputs received from neurons in the previous layer, plus the addition of a bias factor. This number is then essentially ready to be processed by the activation function that sits at the heart of the neuron. The function itself could be simplistically linear in nature, a hyperbolic function, a threshold function, or most commonly, a sigmoid function. What is important is that it is a function that converts a linear weighted sum value into a new value, which then becomes an input for
optimisation techniques, such as “gradient descent”, which make use of a cost function to evaluate the outcomes of the network and to steer it in the right direction as it refines its weightings. This cost function, in simple terms, determines how far off the network is with its predictions.
Let us think back to the example of the image recognition network for handwritten numbers. Initially, when using random weightings, the network may light up or activate totally incorrect neurons in the last layer which is meant to represent a number from zero to nine. With random weightings, when fed the handwritten number “3”, the network may at first light up the corresponding neurons for “8”, “6”, “5” and “3”, for example. To train a network using supervised learning, as is the case here, the cost function will penalise the incorrect outputs using training data that is labelled with the correct output.
To improve the network’s prediction accuracy, this cost function must be minimised. This is where the “gradient descent” methodology is actioned. The best way to visualise this method is to imagine standing in a valley (in the shape of a “U”). Your goal is to find the lowest point of the valley, representing the local minimum of the cost function. To do this, you must calculate the slope of your current position, in order to determine in which direction you must travel to find the bottom of the valley. Using your random input (representing your random or unknown location on the hill of the valley), you can calculate the slope of your current position on
  Think of an activation function as being at the heart
of what the neuron does to transform the inputs it receives
into an output.
the next layer of neurons. The input to the next neuron is itself then taken through this process again, until finally the neural network’s last layer produces a single output value, which will then be compared with a result.
When first training a network, the weightings and biases that are meant to be able to recognise important information from less important information are set at random. And naturally, because these weightings are random at first, the network will initially be very bad at predicting correct outcomes. To improve these predictions, the network must be trained through backpropagation, often using well-heeled mathematical
20















































































   20   21   22   23   24