If you have trained your deep learning model for a while and its accuracy is still quite low. You might want to check if it is suffering from vanishing or exploding gradients.
Backprop has difficult changing weights in earlier layers in a very deep neural network. During gradient descent, as it
Using a very deep network can represent very complex functions. It can learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). For example, earlier ImageNet model like VGG16 and VGG19 are striving to achieve higher image classification accuracy by adding more layers. But the deeper the network becomes, the harder it is to update earlier layers' parameters.
Vanishing gradients also appear in the sequential model with the recurrent neural network. Causing them ineffective in capturing long-range dependencies.
Take this sentence as an example. The model is trained to generate a sentence.
'Those cats caught a fish,....., they were very happy.' The RNN needs to remember the word 'cats' as a plural to generate the word 'they' in the following sentence.
Here is an unrolled recurrent network showing the idea.
layer = Dense(32) layer.get_weights()
With the understanding how vanishing/exploding gradients might happen. Here are some simple solutions you can apply in Keras framework.
The vanilla recurrent neural network doesn't have a sophisticated mechanism to 'trap' long-term dependencies. On the contrary, modern RNN like LSTM/GRU introduced the concepts of "gates" to artificially retain those long-term memories.
To put it simply, in GRU(Gated Recurrent Unit), there are two "gates".
One gate called the update gate decides whether to update current memory cell with the candidate value. The candidate value is computed by previous memory cell output and current input. As compared in vanilla RNN which this candidate value will be used directly to replace the memory cell value.
The second gate is the relevant gate tells how relevant previous memory cell output is to compute the current candidate value.
In Keras, it is very trivial to apply LSTM/GRU layer to your network.
Here is a minimal model contains an LSTM layer can be applied to sentiment analysis.
from keras.layers import Dense, Dropout, Embedding, LSTM from keras.models import Sequential model = Sequential() model.add(Embedding(input_dim=1000, output_dim=128, input_length=10)) model.add(LSTM(units=64)) model.add(Dropout(0.5)) model.add(Dense(1, activation='sigmoid'))
from keras.layers import Add, Activation, BatchNormalization, Conv2D def identity_block(X, f, filters, stage, block): """ Implementation of the identity block Arguments: X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev) f -- integer, specifying the shape of the middle CONV's window for the main path filters -- python list of integers, defining the number of filters in the CONV layers of the main path stage -- integer, used to name the layers, depending on their position in the network block -- string/character, used to name the layers, depending on their position in the network Returns: X -- output of the identity block, tensor of shape (n_H, n_W, n_C) """ # defining name basis conv_name_base = 'res' + str(stage) + block + '_branch' bn_name_base = 'bn' + str(stage) + block + '_branch' # Retrieve Filters F1, F2 = filters # Save the input value. You'll need this later to add back to the main path. X_shortcut = X # First component of main path X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X) X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X) X = Activation('relu')(X) # Second component of main path X = Conv2D(filters = F2, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name_base + '2b', kernel_initializer = glorot_uniform(seed=0))(X) X = BatchNormalization(axis = 3, name = bn_name_base + '2b')(X) # Final step: Add shortcut value to main path, and pass it through a RELU activation X = Add()([X, X_shortcut]) X = Activation('relu')(X) return X
import keras model = keras.applications.resnet50.ResNet50(weights='imagenet', include_top=True)
from keras.layers import Activation, Dense model.add(Dense(64)) model.add(Activation('relu'))
Keras default weight initializer is glorot_uniform aka. Xavier uniform initializer. Default bias initializer is “zeros”. So we should be good to go by default.
As this name suggests, gradient clipping clips parameters' gradients during
Both ways are supported by Keras.
from keras import optimizers # All parameter gradients will be clipped to # a maximum value of 0.5 and # a minimum value of -0.5. sgd = optimizers.SGD(lr=0.01, clipvalue=0.5) # All parameter gradients will be clipped to # a maximum norm of 1. sgd = optimizers.SGD(lr=0.01, clipnorm=1.)
Regularization applies penalties on layer parameters(weights, bias) during optimization.
Use Tanh activation function example, when the activation value is small, the activation will be almost linear
In Keras, usage of regularizers can be as easy as this,
from keras import regularizers model.add(Dense(64, input_dim=64, kernel_regularizer=regularizers.l2(0.01), activity_regularizer=regularizers.l2(0.01)))
Keras usage of regularizers https://keras.io/regularizers/
LSTM/GRU in Keras https://keras.io/layers/recurrent/
Other Keras weight Initializers to take a look. https://keras.io/initializers/