Stochastic Gradient Descent (SGD) is used in many Deep Learning models as an algorithm to optimize the parameters (the weights if each layer). Here is how it works:

At each step in the training process, the goal is to update the weights towards the optimal value. For this, SGD uses the equation:

$$new\;estimate = current\;estimate - (\nabla \times learning\;rate)$$

In this equation, the gradient indicates the direction towards the solution, (above or below it) and how far we are from it. learning rate is a parameter that decides how fast we are going to change the estimation. This parameter should not be set too high as it may otherwise miss the right solution.

How to compute the gradient?

There are several ways to calculate the gradients, the most intuitive way might be to see how the error is changing when we modify slightly our guess (that’s called finite difference). For example, we increase the parameter by 0.01 and see if the error goes up or down. We apply this simple formula :

$$\nabla = \frac{changed\;error - current\;error}{x}$$

Where changed error is the error when we change the parameter by x, current error is, well, the current error and x is a small number.

Once the gradient is computed, it’s quite easy to update the parameters. But there are two issues with this method:

  1. If we reach the optimal parameter and don’t stop the algorithm, the gradient value keeps increasing and eventually diverges
  2. It’s really slow

Several improvements have been made to the method described above to solve these problems.


The goal of the Momentum variant is to speed up the learning process by taking into account the previous gradients. The formula used to adjust each parameter becomes:

With this formula, we can keep track of where the gradient is going overall, to make sure it moves there faster. $ \alpha $ is the momentum, how important the precedent gradient is when computing the new one.

But a major issue remains unsolved. All the parameters of a Deep Learning model do not have the same order of magnitude. One can be around 0.5 and the other around 20. But if they have the same learning rate, they will be updated at the same pace. If the learning rate is low, the parameters of a high order will never reach their optimal values because they’re moving too slow. This is where dynamic learning rate comes in. This class of algorithms sets a different learning rate to each parameter. This way, each of them evolves at its own pace.


AdaGrad (for Adaptive Gradient) takes into account how much the gradient of each parameter tends to change.

$$parameter\;learning\;rate = \frac{global\;learning\;rate}{ \sum_{n=0}^{t-1} (\nabla_n)^2}$$

Here ∇ is the gradient of a parameter. We add up all the previous gradients of a parameter and take the square root. This formula is applied at each epoch per parameter.

The result of this function is that the learning rate keeps decreasing with each epoch. Therefore it makes sense to start with a pretty high learning rate, like 0.01.


At each mini-batch, RMSProp (Root Mean Square Propagation) considers the moving average of the previous gradients in the epoch and the current gradient, squared.

$${parameter\;learning\;rate} = {global\;learning\;rate \over {\sqrt{\alpha \times \sum_{n=0}^{t-1} \nabla_n + \beta \times (\nabla_t)^2}}}$$

$\alpha$ and $\beta$ are both parameters that must be equal to 1 when added together. By updating the learning rate quickly, RMSProp jumps round the optimal parameters values, preventing them from going to infinity.


Adam (Adaptive Moment Estimation) is really simple. It’s Momentum (with the first and second moments) and RMSProp combined together into a single algorithm.