(Comments)

TL;DR **Adam ***works well in practice and outperforms other Adaptive techniques.*

Use SGD+Nesterov for shallow networks, and either Adam or RMSprop for

I was taking the Course 2 Improving Deep Neural Networks from Coursera.

Week #2 for this course is about Optimization algorithms. I find it helpful to develop better intuition about how different optimization algorithms work even we are only interested in APPLY deep learning to the real-life problems.

Here are some takeaways and things I have learned with some research.

**Adam: Adaptive moment estimation**

*Adam = RMSprop + Momentum*

Some advantages of Adam include:

- Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
- Usually works well even with
little tuning of hyperparameters.

In Keras, we can define it like this.

```
keras.optimizers.Adam(lr=0.001)
```

Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.

In Keras, we can do this to have SGD + Nesterov enabled, it works well for shallow networks.

```
keras.optimizers.SGD(lr=0.01, nesterov=True)
```

Nesterov accelerated gradient (NAG)

Intuition how it works to accelerate gradient descent.

We’d like to have a smarter **ball**, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.

Here is an animated gradient descent with multiple optimizers.

*Image Credit: CS231n*

Notice the two momentum based optimizers (**Green-Momentum**,** Purple-NAG)** has overshooting behavior, similar to a **ball** rolling down the hill.

Nesterov momentum has slightly less overshooting compare to standard momentum since it takes the "**gamble->correction**" approach has shown below.

It makes big updates for infrequent parameters and small updates for frequent parameters. For this reason, it is **well-suited for dealing with sparse data**.

The main benefit of Adagrad is that **we don’t need to tune the learning rate manually**. Most implementations use a default value of 0.01 and leave it at that.

**Disadvantage** —

Its main weakness is that its learning rate is always Decreasing and decaying.

It is an extension of **AdaGrad** which tends to remove the decaying learning Rate problem of it.

Another thing with AdaDelta is that **we don’t even need to set a default learning rate**.

Course 2 Improving Deep Neural Networks from Coursera

- Accelerated Deep Learning inference from your browser
- How to run SSD Mobilenet V2 object detection on Jetson Nano at 20+ FPS
- Automatic Defect Inspection with End-to-End Deep Learning
- How to train Detectron2 with Custom COCO Datasets
- Getting started with VS CODE remote development

- June (1)

- December (1)
- November (1)
- October (1)
- September (3)
- August (1)
- July (2)
- June (2)
- May (3)
- April (3)
- March (1)
- February (2)
- January (2)

- December (3)
- November (3)
- October (3)
- September (5)
- August (5)
- July (4)
- June (4)
- May (4)
- April (6)
- March (5)
- February (3)
- January (4)

- deep learning (79)
- edge computing (17)
- Keras (48)
- NLP (8)
- python (69)
- PyTorch (7)
- tensorflow (35)

- tutorial (56)
- Sentiment analysis (3)
- keras (35)
- deep learning (57)
- pytorch (2)

- Chengwei (85)

## Comments