Quick Notes on How to choose Optimizer In Keras



TL;DR Adam works well in practice and outperforms other Adaptive techniques.

Use SGD+Nesterov for shallow networks, and either Adam or RMSprop for deepnets.

I was taking the Course 2 Improving Deep Neural Networks from Coursera.

Week #2 for this course is about Optimization algorithms. I find it helpful to develop better intuition about how different optimization algorithms work even we are only interested in APPLY deep learning to the real-life problems.

Here are some takeaways and things I have learned with some research.


Adam: Adaptive moment estimation

Adam = RMSprop + Momentum

Some advantages of Adam include:

  • Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
  • Usually works well even with little tuning of hyperparameters.

In Keras, we can define it like this.


What is Momentum?

Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.

Stochastic gradient descent(SGD)

In Keras, we can do this to have SGD + Nesterov enabled, it works well for shallow networks.

keras.optimizers.SGD(lr=0.01, nesterov=True)

What is Nesterov momentum?

Nesterov accelerated gradient (NAG)

Intuition how it works to accelerate gradient descent.

We’d like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.

Here is an animated gradient descent with multiple optimizers.


Image Credit: CS231n

Notice the two momentum based optimizers (Green-Momentum, Purple-NAG) has overshooting behavior, similar to a ball rolling down the hill.

Nesterov momentum has slightly less overshooting compare to standard momentum since it takes the "gamble->correction" approach has shown below.



It makes big updates for infrequent parameters and small updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.

The main benefit of Adagrad is that we don’t need to tune the learning rate manually. Most implementations use a default value of 0.01 and leave it at that.


Its main weakness is that its learning rate is always Decreasing and decaying.



It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it.

Another thing with AdaDelta is that we don’t even need to set a default learning rate.

Further Reading

Current rating: 4.3