Friday, July 8, 2016

My Learning Notes on Artificial Neural Network

This post aims to summarize some key ideas for understanding the intuitions/theories of artificial neural network and its implementation, based on my own learning experiences. My approach is to ask questions and then explore answers for further understanding. It turns out that almost every question in the artificial neural network deserves much more efforts to dig deeper.

Note that this post DOES NOT aim to introduce the artificial neural network algorithm systematically. Many excellent comprehensive tutorials of deep learning and neural network can be found online. Some are listed in the post. Feel free to correct me if there is anything incorrect in this post. After all, I am also on the way of learning.

In March 2016, AlphaGo won the ancient Chinese board game "Go" against a human champion. A key to the winning of AlphaGo is the deep learning algorithm developed in it. Besides that, deep learning's achievements in many applications have made it very attractive, such as computer vision, speech recognition, and natural language processing. As known, deep learning is derived from artificial neural network with some additional techniques included. To learn deep learning, I guess that it is reasonable to learn artificial neural network first. 

As I make efforts to grasp the intuitions/theories and implementation of artificial neural network, I have found that the artificial neural network will make sense if we can understand the following key ideas.
  1. Activation Function
  2. Cost Function
  3. Gradient Descent Algorithm
  4. Backpropagation Algorithm
  5. Artificial Neural Network Architecture
Activation Function

The ultimate goal of artificial neural network or any kind of supervised machine learning algorithm is to find a functional relationship that maps the inputs to the outputs as accurate as possible. An advantage of artificial neural network over other machine learning algorithms is that it is capable of representing any kind of relationship, especially nonlinear relationship. Here is a visual proof that neural nets can compute any function. The more accurately an algorithm models the functional relationship between inputs and outputs, the more accurately the model generated by that algorithm can predict the unknown outputs with corresponding known inputs.

The technique that makes an artificial neural network able to represent any function is the activation function. Commonly used activation functions are log-sigmoid function, tan-sigmoid function, and softmax function.

To further explain why the activation function makes the neural network capable of representing any kind of relationship, let's take a look at the visualization of sigmoid function.


As shown in the above figure, roughly speaking, we can notice the following properties of the function .

For -1 <= z <= 1, the function maps a linear relationship.
For -5 < z < -1 and 1 < z < 5, the function maps a nonlinear relationship.
For z <= -5 and z >= 5, the function maps a constant relationship.

Back to the artificial neural network, it is made up of neurons and links between neurons. On each neuron, there are two steps to be done.
  1. Sum up the weighted inputs. 
  2. Calculate the activation function with the sum of the weighted inputs if the threshold value is reached.
In some tutorials, these two steps are not talked about explicitly. But I feel it is easy to understand what a neuron does in two steps.

For more details about the activation function, refer to artificial neural networks/activation function in wikibooks.

At this moment, it is natural to ask “Why do we use the above functions as activation functions in neural network?” We already have some idea about how they can map different kinds of relationship. Another benefit is that they all constrain the output in the range from 0 to 1. And then why do we want to constrain the output in the range from 0 to 1? Based on the section “Sigmoid Neurons” in Chapter 1 Using neural nets to recognize handwritten digits of Michael Nielsen’s book Neural Networks and Deep Learning, one reason is that small changes on the weights and bias of inputs can only cause small changes on the output. We can also notice that it is easy to interpret the output as the probability.

Cost Function

In the parametric models, such as regression and neural network, when we use different parameters in the model, we will get different predictions/outputs. We want to find a set of best parameters which minimize the difference between actual values and predictions.

How do we measure the difference between actual values and prediction (i.e. error)?

In general, there are two ways. , which is the

1. Sum of squared difference between predicted output and actual value of each observation in the training set.
2. Sum of negative log-likelihood between predicted output and actual value of each observation in the training set.

The function that describes the squared error or negative log-likelihood is called the cost function. And we want to minimize it.

Gradient Descent Algorithm

By this moment, we want to minimize some cost function, which is complicated and has no closed form in such situation like the artificial neural network.

How can we perform the optimization process?

The gradient descent algorithm can make the magic! It is an optimization technique used to reach a local minimum of a complex function gradually. Here is a beautiful and commonly used analogy about what the gradient descent algorithm does. A complex function can be visualized as adjacent mountains and valleys. There are tops and bottoms. Imagine we are currently standing at some point of a valley and want to climb down to the bottom. But the fog is very heavy and our sight is limited. We cannot find a whole path directly down to the bottom and walk down along that path. What we can only do is to choose the next step that can bring us down a little bit. There are many directions we can walk our next step toward. But the direction that brings us down the most is favored. Gradient descent algorithm is to find that direction and then put us down by one step.

Mathematically speaking, the gradient descent algorithm finds that direction by calculating the partial derivative of the function. What the function means here is the cost function. This brings another advantage of commonly used activation functions mentioned earlier – their partial derivatives are easy to compute.

There are two questions related to the gradient descent algorithm.
  1. What step size should we choose? If our step size is too small, it takes long to reach the bottom. But if our step size is too big, we may step over the bottom and miss it forever.
  2. What initial position (i.e. weights, biases) should we choose? We must start from somewhere. A set of parameters of weights and biases in the artificial neural network gives an initial position.
The rough answer I have for these two questions at the moment is to do some experiments. I guess as our experiences grow, we can make a better judgment on these two choices easier. Moreover, some scientific approach may be found in the literature.

There is one question related to the cost function.
  1. Why do we favor the quadratic format of the difference between predicted output and actual value as the cost function?
The first reason must be that it can help measure the prediction accuracy. For other reasons, when Michael Nielsen talks about the issue why we introduce the quadratic cost in this book, he mentions that the quadratic function is smooth and easy to figure out how to make small changes on weights and biases in order to improve the output.

Backpropagation Algorithm

The backpropagation algorithm is a set of rules to update the weights and biases of an artificial neural network by partitioning the prediction error into all neurons with the aid of gradient descent on each neuron.

Chapter 7 Neural Networks in the book Discovering Knowledge in Data: An Introduction to Data Mining illustrates a simple example how to perform backpropagation manually in the neural network by hand.

Artificial Neural Network Architecture

In the construction process of an artificial neural network, there are two hard questions related to hidden layers, compared to the input layer and output layer. Note that the input layer can be decided based on the feature engineering and the output layer is pretty straightforward.
  1. How many hidden layers should we include in order to achieve the best?
  2. How many neurons should we include in each hidden layer?

More sigmoid hidden layers and neurons will add the capacity of the neural network but will also tend to cause overfitting.

Many tutorials mention that some heuristics help. I guess it really takes lots of experiments and domain knowledge to make a quite good choice.