- Activation Function
- Cost Function
- Gradient Descent Algorithm
- Backpropagation Algorithm
- Artificial Neural Network Architecture
To further explain why the activation function makes the neural network capable of representing any kind of relationship, let's take a look at the visualization of sigmoid function.
As shown in the above figure, roughly speaking, we can notice the following properties of the function .
For -1 <= z <= 1, the function maps a linear relationship.
For -5 < z < -1 and 1 < z < 5, the function maps a nonlinear relationship.
For z <= -5 and z >= 5, the function maps a constant relationship.
Back to the artificial neural network, it is made up of neurons and links between neurons. On each neuron, there are two steps to be done.
- Sum up the weighted inputs.
- Calculate the activation function with the sum of the weighted inputs if the threshold value is reached.
In the parametric models, such as regression and neural network, when we use different parameters in the model, we will get different predictions/outputs. We want to find a set of best parameters which minimize the difference between actual values and predictions.
How do we measure the difference between actual values and prediction (i.e. error)?
In general, there are two ways. , which is the
1. Sum of squared difference between predicted output and actual value of each observation in the training set.
2. Sum of negative log-likelihood between predicted output and actual value of each observation in the training set.
The function that describes the squared error or negative log-likelihood is called the cost function. And we want to minimize it.
Gradient Descent Algorithm
How can we perform the optimization process?
The gradient descent algorithm can make the magic! It is an optimization technique used to reach a local minimum of a complex function gradually. Here is a beautiful and commonly used analogy about what the gradient descent algorithm does. A complex function can be visualized as adjacent mountains and valleys. There are tops and bottoms. Imagine we are currently standing at some point of a valley and want to climb down to the bottom. But the fog is very heavy and our sight is limited. We cannot find a whole path directly down to the bottom and walk down along that path. What we can only do is to choose the next step that can bring us down a little bit. There are many directions we can walk our next step toward. But the direction that brings us down the most is favored. Gradient descent algorithm is to find that direction and then put us down by one step.
- What step size should we choose? If our step size is too small, it takes long to reach the bottom. But if our step size is too big, we may step over the bottom and miss it forever.
- What initial position (i.e. weights, biases) should we choose? We must start from somewhere. A set of parameters of weights and biases in the artificial neural network gives an initial position.
- Why do we favor the quadratic format of the difference between predicted output and actual value as the cost function?
- How many hidden layers should we include in order to achieve the best?
- How many neurons should we include in each hidden layer?
Many tutorials mention that some heuristics help. I guess it really takes lots of experiments and domain knowledge to make a quite good choice.