"If I were a girl" - Magic Mirror by StarGAN



Ever wonder what you would look like if you were a girl?

Imaging this. I jump out of bed and look in a mirror. I am a blond!

You ask: "That is what you would look like as a girl?"

I Say: "YES OMG YES YES YES! This is what I've always wanted!

The magic mirror is powered by StarGAN, a unified generative adversarial network for multi-domain image-to-image translation. This post will show you how the model works and how you can build the magic mirror.


Enjoy the YouTube demo here.

Complete source code available on my GitHub page.

StarGAN intro

Image-to-image translation is to change a particular aspect of a given image to another, e.g., changing the gender of a person from male to female. This task has experienced significant improvements following the introduction of generative adversarial networks (GANs), with results ranging from generating photos from edge maps, changing the seasons of scenery images, and reconstructing photo from Monet's painting.

Given training data from two different domains, these models learn to translate images from one domain to the other in a unidirectional way. For example, one generative model is trained to translate a person with black hair to blond hair. Any single existing GAN model is incapable of translating "backward", like in the previous example from blond to black colored hair. Besides, a single model cannot handle flexible multi-domain image translation tasks. Like a configurable translation of both gender and hair colors. That is where StarGAN stands out, a novel generative adversarial network that learns the mappings among multiple domains using only a single generator and a discriminator, training effectively from images of all domains. Instead of learning a fixed translation (e.g., black-to-blond hair), StarGAN's model takes both image and domain information as inputs and learns to translate the input image into the corresponding domain flexibly.
The pre-trained StarGAN model consists or two networks like other GAN models, generative and discriminative networks. While it is only necessary to have the generative network to build the magic mirror, it is still useful to understand where the complete model comes.

The generative network takes two pieces of information as input, the original RGB image with 256 x 256 resolution, and the target labels to generates a fake image with the same resolution, the discriminative network learns to distinguish between real and fake images and classify the real images to its corresponding domain.

The pre-trained model we are going to use was trained on the CelebA datasets which contain 202,599 face images of celebrities, each annotated with 40 binary attributes, while the researchers selected seven domains using the following attributes: hair color (black, blond, brown), gender (male/female), and age (young/old).


Building the magic mirror

The researchers of StarGAN have published their code on GitHub where our magic mirror project based. I was also my first time dealing with the PyTorch framework, so far it's going well. If you are new to the PyTorch framework like me, you will find it quite easy to get started work with especially with the experience of another deep learning framework like Keras or TensorFlow.

Only the most basic of the PyTorch framework knowledge is required to accomplish the project, like PyTorch tensor, loading predefined model weights etc.

Let's starts by installing the framework. In my case, on Windows 10 which is officially supported by the latest PyTorch.

To enable the magic mirror run in real-time with minimal perceivable lags, accelerate the model execution with your gaming PC's Nvidia graphics card if you have one.

Install CUDA 9 from this link on the Nvidia Developer website.


After that install PyTorch with CUDA 9.0 support following its official website instructions.


When PyTorch and other Python dependencies are installed, we are ready for the code.

To implement a simple real-time face tracking and cropping effect, we are going to use the lightweight CascadeClassifier module from Python's OpenCV library. This module takes a grayscale image transformed from a webcam frame and returns detected faces' bounding boxes information. In case multiple faces are detected in a given frame, we will take the "main" face with the largest computed bound box area.

Since the StarGAN generative network expects images where their pixels values range between -1 to 1 instead of 0 to 255, we are going to have PyTorch's built-in image transform utility to handle the image preprocessing.

from torchvision import transforms as T

transform = []
transform.append(T.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)))
transform = T.Compose(transform)
# Pre-process the image
preprocessed_image = transform(face_img)

The generative network subclasses PyTorch's nn.Module which means you can call it directly by passing in the input tensors as arguments.

# Run the generator to generate the desired image with labels.
generated = G(preprocessed_image.unsqueeze(0).to(device), labels.unsqueeze(0).to(device))

The labels variable is a PyTorch tensor with 5 values each one set to either 0 or 1 to indicate the 5 target labels.

['Black_Hair', 'Blond_Hair', 'Brown_Hair', 'Male', 'Young']
For example, we want to transform a portrait to blond haired young female. The labels's value will be set to [0, 1, 0, 0, 1].
To show the generated image tensor with cv2's imshow() function, here is what it looks like a single line of code.
generated_frame = ((np.moveaxis(generated.cpu().detach().numpy()[0],[0], [2])+1)/2)[:, ::-1, ::-1]

And there is the breakdown,

  1. First move the image data from GPU to CPU by calling cpu().
  2. Use detach() call to detect it from the graph.
  3. numpy() call returns the tensor value as a Numpy array.
  4. The first [0] takes the first image out of the generated batch, (even though the batch size is one).
  5. Swap the axis to turn a (3, 256, 256) shaped array into (256, 256, 3).
  6. Recover the pixel values from range -1~1 to 0~1.
  7. Flip the generated image horizontally with the ::-1 operation.
  8. Turn image channels order from RGB to BGR with the last ::-1 operation as cv2.imshow() function expects an image in BGR channels order.

Wrapping the code into a single function call MagicMirror() which takes several optional arguments.

  • videoFile: leave the default value 0 to use the first web camera, or pass in a video file path.
  • setHairColor: one of the three, "black", "blond", "brown".
  • setMale: transform into a male? Set to True or False.
  • setYoung: transform into a young person? Set to True or False.
  • showZoom: default to 4, this factor by which to resize the generated image up before showing on the screen.

Conclusion and Further thought

This tutorial shows you how easy and fun it could be to pick up a new framework like PyTorch and build something interesting with a pre-trained StarGAN network.

The images generated might not look super realistic yet while the StarGAN paper shows a model jointly trained with both the CelebA + RaFD datasets can generate images with fewer artifacts by leveraging both datasets to improve shared low-level tasks such as facial keypoint detection and segmentation. You can follow along with their official GitHub to download both datasets and train such a model as long as you have a beefy machine and a extra week to run the training.

Currently unrated