How to run Object Detection and Segmentation on a Video Fast for Free


Mask R-CNN

TL;DR. After reading this post, you will learn how to run state of the art object detection and segmentation on a video file Fast. Even on an old laptop with an integrated graphics card, old CPU, and only 2G of RAM.

So here is the catch. This will only work if you have an internet connection and own a Google Gmail account. Since you are reading this, it is very likely you are already qualified.

All the code in the post runs entirely in the cloud. Thanks to Google's Colaboratory a.k.a. Google Colab! 

I am going to show you how to run our code on Colab with a server-grade CPU, > 10 GB of RAM and a powerful GPU for FREE! Yes, you hear me right.

Using Google Colab with GPU enabled

Colab was build to facilitate machine learning professionals collaborating with each other more seamlessly. I have shared my Python notebook for this post, click to open it.

Log in to your Google Gmail account on the upper right corner if you haven't done so. It will ask you to open it with Colab at the top of the screen. Then you are going to make a copy so you can edit it.

colab copy

Now you should be able to click on the "Runtime" menu button to choose the version of Python and whether to use GPU/CPU to accelerate the computation.


The environment is all set. So easy isn't it? No hassles like installing Cuda and cudnn to make GPU working on a local machine.

Run this code to confirm TensorFlow can see the GPU.

import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

It outputs:

Found GPU at: /device:GPU:0

Great, We are good to go!

If you are curious about what the GPU model you are using. It is a Nvidia Tesla K80 with 24G of memory. Quite powerful.

Run this code to find out yourself.

from tensorflow.python.client import device_lib

You will see the 24G graphics memory does help later. It makes possible to process more frames at a time to accelerate the video processing.

Mask R-CNN Demo

The demo is based on the Mask R-CNN GitHub repo. It is an implementation of Mask R-CNN on Keras+TensorFlow. It not only generates the bounding box for a detected object but also generates a mask over the object area.


Install Dependencies and run Demo

Mask R-CNN has some dependencies to install before we can run the demo. Colab allows you to install Python packages through pip, and general Linux package/library through apt-get.

In case you don't know yet. Your current instance of Google Colab is running on an Ubuntu virtual machine. You can run almost every Linux command you usually do on a Linux machine.

Mask R-CNN depends on pycocotools, we are installing it with the following cell. 

!pip install Cython
!git clone
!pip install -U setuptools
!pip install -U wheel
!make install -C coco/PythonAPI

It clones the coco repository from GitHub. Install build dependencies. Finally, build and install the coco API library.

All this happens in the cloud virtual machine, and quite fast.

We are now ready to clone the Mask_RCNN repo from GitHub and cd into the directory.

!git clone
# cd to the code directory and optionally download the weights file
import os

Notice how we change directory with Python script instead of running a shell 'cd' command since we are running Python in current notebook.

Now you should be able to run the Mask R-CNN demo on colab like you would on a local machine. So go ahead and run it in your Colab notebook.

So far those sample images came from the GitHub repo. But how do you predict with custom images?

Predict with Custom Images

To upload an image to Colab notebook, there are three options that I think of.

1. Use a Free image hosting provider like the imgbb.

2.Create a GitHub repo, then download the image link from colab.

After uploading images by either of those two options, you will get a link to the image, which can be downloaded to your colab VM with Linux wget command. It downloads one image to the ./images folder.

!wget -P ./images

The first two options will be ideal if you just want to upload 1 or 2 images and don't care other people on the internet also be able to see it given the link.

3. Use Google Drive

The option is ideal if you have private images/videos/other files to be uploaded to colab.

Run this block to authenticate the VM to connect to your Google Drive.

!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

It will ask for two verification code during the run.

Then execute this cell to mount the Drive to the directory 'drive'

!mkdir -p drive
!google-drive-ocamlfuse drive

You can now access your Google drive content in directory ./drive

!ls drive/

Hope you are having fun so far, why not try this on a video file?

Processing Videos

Processing a video file will take three steps.

1. Video to images frames.

2. Process images

3. Turn processed images to output videos.

In our previous demo, we ask the model to process just one image at a time, as configured in the IMAGES_PER_GPU.

class InferenceConfig(coco.CocoConfig):
    # Set batch size to 1 since we'll be running inference on
    # one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
    GPU_COUNT = 1

If we are going to process the whole video one frame at a time, it will take a long time. So instead we are going to leverage GPU to process multiple frames in parallel.

The pipeline of Mask R-CNN is quite computationally intensive and takes a lot of GPU memory. I find the Tesla K80 GPU on Colab with 24G of memory can safely process 3 images at a time. If you go beyond that, the notebook might crash in the middle of processing the video.

So in the code below, we set the batch_size to 3 and use cv2 library to stage 3 images at a time before processing them with the model.

capture = cv2.VideoCapture(os.path.join(VIDEO_DIR, 'trailer1.mp4'))    
while True:
    ret, frame =
    # Bail out when the video file ends
    if not ret:
    # Save each frame of the video to a list
    frame_count += 1
    if len(frames) == batch_size:
        results = model.detect(frames, verbose=0)
        for i, item in enumerate(zip(frames, results)):
            frame = item[0]
            r = item[1]
            frame = display_instances(
                frame, r['rois'], r['masks'], r['class_ids'], class_names, r['scores']
            name = '{0}.jpg'.format(frame_count + i - batch_size)
            name = os.path.join(VIDEO_SAVE_DIR, name)
            cv2.imwrite(name, frame)
        # Clear the frames array to start the next batch
        frames = []

After running this code, you should now have all processed image files in one folder ./videos/save.

The next step is easy, we need to generate the new video from those images. We are going to use cv2's VideoWriter to accomplish this.

But two things you want to make sure:

1. The frames need to be ordered in the same way as they are extracted from the original video. (Or backward if you prefer to watch the video that way)

# Get all image file paths to a list.
images = list(glob.iglob(os.path.join(VIDEO_SAVE_DIR, '*.*')))
# Sort the images by name index.
images = sorted(images, key=lambda x: float(os.path.split(x)[1][:-3]))

 2.The frame rate matches the original video. You can use the following code to check the frame rate of a video or just open the file property.

video = cv2.VideoCapture(os.path.join(VIDEO_DIR, 'trailer1.mp4'));

# Find OpenCV version
(major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.')

if int(major_ver)  < 3 :
    fps = video.get(
    print("Frames per second using video.get( {0}".format(fps))
else :
    fps = video.get(cv2.CAP_PROP_FPS)
    print("Frames per second using video.get(cv2.CAP_PROP_FPS) : {0}".format(fps))


Finally here is the code to generate the video from processed image frames.

def make_video(outvid, images=None, fps=30, size=None,
               is_color=True, format="FMP4"):
    Create a video from a list of images.
    @param      outvid      output video
    @param      images      list of images to use in the video
    @param      fps         frame per second
    @param      size        size of each frame
    @param      is_color    color
    @param      format      see
    @return                 see
    from cv2 import VideoWriter, VideoWriter_fourcc, imread, resize
    fourcc = VideoWriter_fourcc(*format)
    vid = None
    for image in images:
        if not os.path.exists(image):
            raise FileNotFoundError(image)
        img = imread(image)
        if vid is None:
            if size is None:
                size = img.shape[1], img.shape[0]
            vid = VideoWriter(outvid, fourcc, float(fps), size, is_color)
        if size[0] != img.shape[1] and size[1] != img.shape[0]:
            img = resize(img, size)
    return vid

import glob
import os

# Directory of images to run detection on
ROOT_DIR = os.getcwd()
VIDEO_DIR = os.path.join(ROOT_DIR, "videos")
VIDEO_SAVE_DIR = os.path.join(VIDEO_DIR, "save")
images = list(glob.iglob(os.path.join(VIDEO_SAVE_DIR, '*.*')))
# Sort the images by integer index
images = sorted(images, key=lambda x: float(os.path.split(x)[1][:-3]))

outvid = os.path.join(VIDEO_DIR, "out.mp4")
make_video(outvid, images, fps=30)

If you have gone this far, the processed video should now be ready to be downloaded to your local machine.

from google.colab import files'videos/out.mp4')

video clip

Free free to try your favorite video clip. Maybe intentionally decrease the frame rate when reconstructing the video to watch it in slow motion.

Summary and Further reading

In the post, we walked through how to run your model on Google Colab with GPU acceleration. 

You have learned how to do object detection and Segmentation on a video. Thanks to the powerful GPU on Colab, made it possible to process multiple frames in parallel to speed up the process.

Further reading

If you want to learn more about the technology behind the object detection and segmentation algorithm, here is the original paper of Mask R-CNN goes through the detail of the model.

Or if you just get started with objection detection, check out my object detection/localization guide series goes through essential basics shared between many models.

Here again the Python notebook for this post, and GitHub repo for your convenience.

Current rating: 4.8