Overfitting can be a serious problem, especially with small training dataset. The model might achieve great training accuracy but when goes to the real world with new data it has never seen, it doesn't generalize the new examples very well.
The first and most intuitive solution is sure to train the model with larger and comprehensive dataset or apply data augmentation to the existing dataset, especially for images. but what if that's all the data we have?
In this post, we will explore two simple technique to deal with this issue using regularization in the deep learning models.
Supposed you have just been hired as an AI expert by the French Football Corporation. They would like you to recommend positions where France's goalkeeper should kick the ball so that the French team's players can then hit it with their head.
Image credit: Coursera - Improving Deep Neural Networks
We have the following 2D dataset from France's past 10 games.
Your goal is to build a deep learning model to find the positions on the field where the goalkeeper should kick the ball.
This dataset is a little noisy, but it looks like a diagonal line separating the upper left half (blue) from the lower right half (red) would work well.
Let's build a simple Keras model with 3 hidden layers.
Let's train and validate our model with training and test dataset for 1000 epochs.
It achieved a final training and validation accuracy shown below. Looks like the model performs better during training than validation.
train acc:0.967, val acc: 0.930
Let's plot the train/validation accuracy and loss.
As we can see after around 600 epoch the validation loss stopped decreasing and begin to increase instead. It is a normal sign of a model begin to overfit the training datasets.
To get a clear idea of what the final trained model is "thinking", let's plot its decision boundary.
As shown in the graph, the boundary is not clean and the model is trying too hard to fit those outlier samples. It could become really obvious for deep neural networks as it is capable to learn the complete relationship between data points but at the same time, if we are only training it with a small dataset, it turns to overfit on those noisy outliers.
The standard way to avoid overfitting is called L2 regularization. It consists of applying penalties on layer weights. Then the penalties are applied to the loss function.
So the finally regularized loss function will contain both the cross-entropy cost as well as the L2 regularization cost.
For example, we can calculate the L2 regularization cost for layer "2" as
Where "W2" is the weight matrix for Dense layer "2". We have to do this for W2, W3, then sum them up and multiply by regularization factor which controls how strong the regularization is.
In Keras, it is very easy to apply the L2 regularization to kernel weights.
Dense(40, activation='relu', kernel_regularizer=regularizers.l2(0.003))
We choose the factor 0.003 for our Keras model, achieved finally train and validation accuracy of
train acc:0.943, val acc: 0.940
Note that the final validation accuracy is very close to the training accuracy, this is a good sign that tour model is not likely overfitting the training data.
The decision boundary is also quite clear compared to the previous model without regularization.
Dropout is a popularly used regularization technique in deep learning. It randomly shuts down some neurons in each iteration.
The probability of any neuron being shut down is
One thing to keep in mind. We only apply dropout at training time since we want to use all neurons' weights learned previously for testing or inferencing. Don't worry, this is handled automatically in Keras when we are either calling
But how do we choose the dropout rate parameter?
Short answer: if you are not sure, 0.5 is a good starting point, since it provides the maximum amount of regularization.
It is also feasible to use different drop out rate by layer. If the preceding layer has larger weight matrix, we can apply larger dropout to it. In our example, we apply larger dropout after dense_3 layer since it has the largest weight matrix 40x40. We can apply smaller dropout after dense_2 layer, say dropout rate
Let's take a look at the result after applying the dropout to our model.
train acc:0.938, val acc: 0.935
The decision is also quite smooth.
We explored two simple regularization recipes to solve deep learning model suffering from overfitting issue when training with small datasets. Regularization will drive weights to lower values. L2 regularization and Dropout are two effective regularization techniques. Check out the full source code for this post in my GitHub repo. Enjoy!Share on Twitter Share on Facebook