How to do Hyper-parameters search with Bayesian optimization for Keras model

(Comments)

search

Compared to more simpler hyperparameter search methods like grid search and random search, Bayesian optimization is built upon Bayesian inference and Gaussian process with an attempts to find the maximum value of an unknown function as few iterations as possible. It is particularly suited for optimization of high-cost functions like hyperparameter search for deep learning model, or other situations where the balance between exploration and exploitation is important.

The Bayesian Optimization package we are going to use is BayesianOptimization, which can be installed with the following command,

pip install bayesian-optimization

Firstly, we will specify the function to be optimized, in our case, hyperparameters search, the function takes a set of hyperparameters values as inputs, and output the evaluation accuracy for the Bayesian optimizer. Inside the function, a new model will be constructed with the specified hyperparameters, train for a number of epochs and evaluated against a set metrics. Every new evaluated accuracy will become a new observation for the Bayesian optimizer, which contributes to the next search hyperparameters' values. 

Let's create a helper function first which builds the model with various parameters.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Dropout, BatchNormalization, MaxPooling2D, Flatten, Activation
from tensorflow.python.keras.optimizer_v2 import rmsprop


def get_model(input_shape, dropout2_rate=0.5):
    """Builds a Sequential CNN model to recognize MNIST.

    Args:
      input_shape: Shape of the input depending on the `image_data_format`.
      dropout2_rate: float between 0 and 1. Fraction of the input units to drop for `dropout_2` layer.

    Returns:
      a Keras model

    """
    # Reset the tensorflow backend session.
    # tf.keras.backend.clear_session()
    # Define a CNN model to recognize MNIST.
    model = Sequential()
    model.add(Conv2D(32, kernel_size=(3, 3),
                     activation='relu',
                     input_shape=input_shape,
                     name="conv2d_1"))
    model.add(Conv2D(64, (3, 3), activation='relu', name="conv2d_2"))
    model.add(MaxPooling2D(pool_size=(2, 2), name="maxpool2d_1"))
    model.add(Dropout(0.25, name="dropout_1"))
    model.add(Flatten(name="flatten"))
    model.add(Dense(128, activation='relu', name="dense_1"))
    model.add(Dropout(dropout2_rate, name="dropout_2"))
    model.add(Dense(NUM_CLASSES, activation='softmax', name="dense_2"))
    return model

Then, here is the function to be optimized with Bayesian optimizer, the partial function takes care of two arguments - input_shape and verbose in fit_with which have fixed values during the runtime.

The function takes two hyperparameters to search, the dropout rate for the "dropout_2" layer and learning rate value, it trains the model for 1 epoch and outputs the evaluation accuracy for the Bayesian optimizer.

def fit_with(input_shape, verbose, dropout2_rate, lr):

    # Create the model using a specified hyperparameters.
    model = get_model(input_shape, dropout2_rate)

    # Train the model for a specified number of epochs.
    optimizer = rmsprop.RMSProp(learning_rate=lr)
    model.compile(loss=tf.keras.losses.categorical_crossentropy,
                  optimizer=optimizer,
                  metrics=['accuracy'])

    # Train the model with the train dataset.
    model.fit(x=train_ds, epochs=1, steps_per_epoch=468,
              batch_size=64, verbose=verbose)

    # Evaluate the model with the eval dataset.
    score = model.evaluate(eval_ds, steps=10, verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])

    # Return the accuracy.

    return score[1]

from functools import partial

verbose = 1
fit_with_partial = partial(fit_with, input_shape, verbose)

The BayesianOptimization object will work out of the box without much tuning needed. The constructor takes the function to be optimized as well as the boundaries of hyperparameters to search. The main method you should be aware of is maximize, which does exactly what you think it does, maximizing the evaluation accuracy given the hyperparameters.

from bayes_opt import BayesianOptimization

# Bounded region of parameter space
pbounds = {'dropout2_rate': (0.1, 0.5), 'lr': (1e-4, 1e-2)}

optimizer = BayesianOptimization(
    f=fit_with_partial,
    pbounds=pbounds,
    verbose=2,  # verbose = 1 prints only when a maximum is observed, verbose = 0 is silent
    random_state=1,
)

optimizer.maximize(init_points=10, n_iter=10,)


for i, res in enumerate(optimizer.res):
    print("Iteration {}: \n\t{}".format(i, res))

print(optimizer.max)

Here are many parameters you can pass to maximize, nonetheless, the most important ones are:

  • n_iter: How many steps of Bayesian optimization you want to perform. The more steps the more likely to find a good maximum you are.
  • init_points: How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.
|   iter    |  target   | dropou... |    lr     |
-------------------------------------------------
468/468 [==============================] - 4s 8ms/step - loss: 0.2575 - acc: 0.9246
Test loss: 0.061651699058711526
Test accuracy: 0.9828125
|  1        |  0.9828   |  0.2668   |  0.007231 |
468/468 [==============================] - 4s 8ms/step - loss: 0.2065 - acc: 0.9363
Test loss: 0.04886047407053411
Test accuracy: 0.9828125
|  2        |  0.9828   |  0.1      |  0.003093 |
468/468 [==============================] - 4s 8ms/step - loss: 0.2199 - acc: 0.9336
Test loss: 0.05553104653954506
Test accuracy: 0.98125
|  3        |  0.9812   |  0.1587   |  0.001014 |
468/468 [==============================] - 4s 9ms/step - loss: 0.2075 - acc: 0.9390
Test loss: 0.04128134781494737
Test accuracy: 0.9890625
|  4        |  0.9891   |  0.1745   |  0.003521 |

After searching for 4 times, the model build with the found hyperparameters achieves an evaluation accuracy of 98.9% with just one epoch of training.

Comparing to other search methods

Unlike grid search which does search in a finite number of discrete hyperparameters combinations, the nature of Bayesian optimization with Gaussian processes doesn't allow for an easy/intuitive way of dealing with discrete parameters.

For example, we want to search for the number of the neuron of a dense layer from a list of options. To apply Bayesian optimization, it is necessary to explicitly convert the input parameters to discrete ones before constructing the model.

You can do something like this.

pbounds = {'dropout2_rate': (0.1, 0.5), 'lr': (1e-4, 1e-2), "dense_1_neurons_x128": (0.9, 3.1)}

def fit_with(input_shape, verbose, dropout2_rate, dense_1_neurons_x128, lr):

    # Create the model using a specified hyperparameters.
    dense_1_neurons = max(int(dense_1_neurons_x128 * 128), 128)
    model = get_model(input_shape, dropout2_rate, dense_1_neurons)
    # ...

The dense layers neurons will be mapped to 3 unique discrete values, 128, 256 and 384 before constructing to the model.

In Bayesian optimization, every next search values depend on previous observations(previous evaluation accuracies), the whole optimization process can be hard to be distributed or parallelized like the grid or random search methods.

Conclusion and further reading

This quick tutorial introduces how to do hyperparameter search with Bayesian optimization, it can be more efficient compared to other methods like the grid or random since every search are "guided" from previous search results.

Some material you might find helpful

BayesianOptimization - The Python implementation of global optimization with Gaussian processes used in this tutorial.

How to perform Keras hyperparameter optimization x3 faster on TPU for free - My previous tutorial on performing grid hyperparameter search with Colab's free TPU.

Check out the full source code on my GitHub.

Currently unrated

Comments