(Comments)

TL;DR **Adam ***works well in practice and outperforms other Adaptive techniques.*

Use SGD+Nesterov for shallow networks, and either Adam or RMSprop for

I was taking the Course 2 Improving Deep Neural Networks from Coursera.

Week #2 for this course is about Optimization algorithms. I find it helpful to develop better intuition about how different optimization algorithms work even we are only interested in APPLY deep learning to the real-life problems.

Here are some takeaways and things I have learned with some research.

**Adam: Adaptive moment estimation**

*Adam = RMSprop + Momentum*

Some advantages of Adam include:

- Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
- Usually works well even with
little tuning of hyperparameters.

In Keras, we can define it like this.

```
keras.optimizers.Adam(lr=0.001)
```

Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.

In Keras, we can do this to have SGD + Nesterov enabled, it works well for shallow networks.

```
keras.optimizers.SGD(lr=0.01, nesterov=True)
```

Nesterov accelerated gradient (NAG)

Intuition how it works to accelerate gradient descent.

We’d like to have a smarter **ball**, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.

Here is an animated gradient descent with multiple optimizers.

*Image Credit: CS231n*

Notice the two momentum based optimizers (**Green-Momentum**,** Purple-NAG)** has overshooting behavior, similar to a **ball** rolling down the hill.

Nesterov momentum has slightly less overshooting compare to standard momentum since it takes the "**gamble->correction**" approach has shown below.

It makes big updates for infrequent parameters and small updates for frequent parameters. For this reason, it is **well-suited for dealing with sparse data**.

The main benefit of Adagrad is that **we don’t need to tune the learning rate manually**. Most implementations use a default value of 0.01 and leave it at that.

**Disadvantage** —

Its main weakness is that its learning rate is always Decreasing and decaying.

It is an extension of **AdaGrad** which tends to remove the decaying learning Rate problem of it.

Another thing with AdaDelta is that **we don’t even need to set a default learning rate**.

Course 2 Improving Deep Neural Networks from Coursera

- How to run Keras model on Jetson Nano in Nvidia Docker container
- How to create custom COCO data set for instance segmentation
- How to create custom COCO data set for object detection
- How to train an object detection model with mmdetection
- How to do Transfer learning with Efficientnet

- December (3)
- November (3)
- October (3)
- September (5)
- August (5)
- July (4)
- June (4)
- May (4)
- April (6)
- March (5)
- February (3)
- January (4)

- deep learning (73)
- edge computing (15)
- Keras (47)
- NLP (8)
- python (67)
- PyTorch (6)
- tensorflow (33)

- tutorial (51)
- Sentiment analysis (3)
- keras (34)
- deep learning (53)
- pytorch (1)

- Chengwei (78)

## Comments