Teach Old Dog New Tricks - Train Facial identification model to understand Facial Emotion



Last week, we explored using the pre-trained VGG-Face2 model to identify a person by face. Since we know the model is trained to identify 8631 celebrities and athletes.

The original dataset is large enough, then the spatial feature hierarchy learned by the pre-trained network can effectively act as a generic model for our facial identification task. Even though our new tasks might involve completely different persons' faces to identify.

There are two ways to leverage a pre-trained network: feature extraction and fine-tuning.

What we did last week is an example of feature extraction since we choose to use the extracted features of a face to calculate the distance to another face. If the distance is small than a threshold, we identify they are from the same person.

This week, we are going to leverage the same pre-trained model to identify the emotion shown on a face, they are 7 emotions we will classify.

{0:'angry',1:'disgust',2:'fear',3:'happy', 4:'sad',5:'surprise',6:'neutral'}

You can download the fer2013 dataset from Kaggle. Each picture is 48x48 pixel grayscale images of faces.

First Attempt with Feature Extraction layer

The easiest way I can think of if take the CNN feature extraction layer output and stack it on a classifier.

The classifier will be two Dense full connected layers, with the last Dense layers having output shape of 7 denoting the 7 emotions' probabilities.

And we freeze the weight of CNN layers making them non-trainable. Since we want to keep the representations that were previously learned by the convolutional base.

conv_base = VGGFace(model='resnet50', include_top=False, input_shape=(224, 224, 3),
                                pooling='avg')  # pooling: None, avg or max
model = models.Sequential()
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(7, activation='softmax'))
conv_base.trainable = False

What turns out the model cannot generalize well and always predict a "happy" face. Why "happy" face, is the model like happy faces more than other faces like we do? No really.

If we take a closer look at the fer2013 training dataset. There are more happy faces for the model to train than other emotions. So if the model can only pick one emotion to predict, of course, it picks the happy face.


But why the model always pick only one face? It all comes from to the pre-train model. Since the pre-trained model was trained to identify a person no matter what emotion he/she is wearing. So the finally feature extraction layer will contain abstract information to tell different persons. That abstract information contains the size and location of a person's eye,  skin color and something else. But one thing for sure, the pre-trained model don't care about is what emotion the person has.

So what we can do, can we still leverage the pre-trained model's weights to do something for us?

The answer is YES, we have to "Fine Tune" the model.

Fine Tune pre-trained Model


Fine-tuning consists in unfreezing a few of the top layers of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (in our case, the fully-connected classifier) and these top layers. This is called "fine-tuning" because it slightly adjusts the more abstract representations of the model being reused, in order to make them more relevant our task at hand.

Big convolutional network like the pre-trained resnet50 face model here has many "Conv blocks" stacks one over another.

An example residual block is shown in the figure below.


Image Credit: http://torch.ch/blog/2016/02/04/resnets.html

Earlier layers in the convolutional base encode more generic, reusable features. In our case, the first conv block can extract the edges, some image patterns like curves and simple shapes. Those are very generic features for different image processing tasks. Go a little deeper, conv block can extract a person's eyes and mouth which is more abstract and less generic features for different image processing tasks.


The last conv block can represent more abstract information associated with the pre-trained model task to identify a person. With this in mind let's unfreeze the weights of it and update it as we train with our face emotion images.

We unfreeze the model weights by locating the layer to start with. The layer named "activation_46" in our case as shown below.


# set 'activation_46' and following layers trainable
conv_base.trainable = True

set_trainable = False
for layer in conv_base.layers:
    if layer.name == 'activation_46':
        set_trainable = True
    if set_trainable:
        layer.trainable = True
        layer.trainable = False

Prepare and training the model

We are given 35887 48x48 pixel grayscale images of faces. And our pre-trained model is expecting 224x224 color input image.

Converting all 35887 images to 224x224 size and store to RAM will take a significant amount of space. My solution is to convert and store one image at a time to a TFRecord file which we can load up later with TensorFlow with little headache.

With TFRecord as the training dataset format, it also trains faster. You can check out my previous experiment.

And here is the code to make the conversation happen.

# Helper-function for wrapping an integer so it can be saved to the TFRecord file.
def wrap_int64(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

# Helper-function for wrapping a list of integer so it can be saved to the TFRecord file.
def wrap_int64_list(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

# Helper-function for wrapping raw bytes so they can be saved to the TFRecord file.
def wrap_bytes(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
# Function for reading images from disk and writing them along with the class-labels to a TFRecord file.
def convert(image_arrays, labels, out_path, size=(224,224)):
    # Args:
    # image_paths   List of numpy image arrays.
    # labels        Class-labels for the images.
    # out_path      File-path for the TFRecords output file.    
    print("Converting: " + out_path)
    # Number of images. Used when printing the progress.
    num_images = len(image_arrays)    
    # Open a TFRecordWriter for the output-file.
    with tf.python_io.TFRecordWriter(out_path) as writer:
        # Iterate over all the image-paths and class-labels.
        for i, (img, label) in enumerate(zip(image_arrays, labels)):
            # Print the percentage-progress.
            print_progress(count=i, total=num_images-1)
            # resize the image array to desired size
            img = cv2.resize(img.astype('uint8'), size)    
            # Turn gray to color.
            img = cv2.cvtColor(img,cv2.COLOR_GRAY2RGB)
            # Convert the image to raw bytes.
            img_bytes = img.tostring()
            # Create a dict with the data we want to save in the
            # TFRecords file. You can add more relevant data here.
            data = \
                    'image': wrap_bytes(img_bytes),
                    'label': wrap_int64_list(label)
            # Wrap the data as TensorFlow Features.
            feature = tf.train.Features(feature=data)
            # Wrap again as a TensorFlow Example.
            example = tf.train.Example(features=feature)
            # Serialize the data.
            serialized = example.SerializeToString()        
            # Write the serialized data to the TFRecords file.


In the above code train_data[0] contains the list of face image arrays each with shape (48, 48) and train_data[1] is the list of actual emotion labels in one-hot format.

For example, one emotion is encoded as 

[0, 0, 1, 0, 0, 0, 0]

With 1 on index 2, and index 2 in our mapping is emotion "fear".

In order to train our Keras model with TFRecord dataset, we first need to turn it into a TF Estimator with tf.keras.estimator.model_to_estimator method.

est_emotion = tf.keras.estimator.model_to_estimator(keras_model=model,

We have introduced how to write an image input function for TF Estimator in my previous post for the binary classification task. Here we have 7 categories of emotions to classify. So the input function looks a little different.

The following snippet has shown the major difference.

        features = \
            'image': tf.FixedLenFeature([], tf.string),
            'label': tf.FixedLenFeature([7], tf.int64)
        # Parse the serialized data so we get a dict with our data.
        parsed_example = tf.parse_single_example(serialized=serialized,
        # Get the image as raw bytes.
        image_shape = tf.stack([224, 224, 3])

Now we are ready to train and evaluate our model altogether by calling  train_and_evaluate .

train_spec = tf.estimator.TrainSpec(input_fn=lambda: imgs_input_fn(path_tfrecords_train,
eval_spec = tf.estimator.EvalSpec(input_fn=lambda: imgs_input_fn(path_tfrecords_test,

import time
start_time = time.time()
tf.estimator.train_and_evaluate(est_emotion, train_spec, eval_spec)
print("--- %s seconds ---" % (time.time() - start_time))

It took 2 minutes total and achieve a validation accuracy of 0.55. Which is not bad considering one face might actual consists different emotions at the same time. Being both surprised and happy for example.

Summary and Further Thought

In this post, we tried two different approaches on the pre-train VGG-Face2 model for emotion classification task. Feature extraction and fine-tuning.

Feature extraction approach is unable to generalize the facial emotion since the original model is trained to identify different persons instead of different emotions.

Fine-tuning the last conv block achieved the desired result.

You might be wondering what if we fine-tuning more conv blocks, could that improve the model performance?

The more parameters we are training, the more we are at risk of overfitting. So it would be risky to attempt to train it on our small dataset.

The source code is available on my GitHub repo.

I am also looking into making this a live demo. So stay tuned.

Current rating: 3