(Comments)
TL;DR Adam works well in practice and outperforms other Adaptive techniques.
Use SGD+Nesterov for shallow networks, and either Adam or RMSprop for
I was taking the Course 2 Improving Deep Neural Networks from Coursera.
Week #2 for this course is about Optimization algorithms. I find it helpful to develop better intuition about how different optimization algorithms work even we are only interested in APPLY deep learning to the real-life problems.
Here are some takeaways and things I have learned with some research.
Adam: Adaptive moment estimation
Adam = RMSprop + Momentum
Some advantages of Adam include:
In Keras, we can define it like this.
keras.optimizers.Adam(lr=0.001)
Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.
In Keras, we can do this to have SGD + Nesterov enabled, it works well for shallow networks.
keras.optimizers.SGD(lr=0.01, nesterov=True)
Nesterov accelerated gradient (NAG)
Intuition how it works to accelerate gradient descent.
We’d like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.
Here is an animated gradient descent with multiple optimizers.
Image Credit: CS231n
Notice the two momentum based optimizers (Green-Momentum, Purple-NAG) has overshooting behavior, similar to a ball rolling down the hill.
Nesterov momentum has slightly less overshooting compare to standard momentum since it takes the "gamble->correction" approach has shown below.
It makes big updates for infrequent parameters and small updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data.
The main benefit of Adagrad is that we don’t need to tune the learning rate manually. Most implementations use a default value of 0.01 and leave it at that.
Disadvantage —
Its main weakness is that its learning rate is always Decreasing and decaying.
It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it.
Another thing with AdaDelta is that we don’t even need to set a default learning rate.
Comments