Blog | DLology

Accelerated Deep Learning inference from your browser

2020-06-29T12:01:32+00:00

Data scientists and ML engineers can now speedup their deep learning applications using the power of FPGA accelerators from their browser.

FPGAs are adaptable hardware platforms that can offer great performance, low-latency and reduced OpEx for applications like machine learning, video processing, quantitative finance, etc. However, the easy and efficient deployment from users with no prior knowledge on FPGA was challenging.

InAccel, a pioneer on application acceleration, makes accessible the power of FPGA acceleration from your browser. Data scientists and ML engineers can now easily deploy and manage FPGAs, speeding up compute-intense workloads and reduce total cost of ownership with zero code changes.

InAccel provides an FPGA resource manager that allows the instant deployment, scaling and resource management of FPGAs making easier than ever the utilization of FPGAs for applications like machine learning, data processing, data analytics and many more applications. Users can deploy their application from Python, Spark, Jupyter notebooks or even terminals.

Through the JupyterHub integration, users can now enjoy all the benefits that JupyterHub provide such as easy access to computational environment for instant execution of Jupyter notebooks. At the same time, users can now enjoy the benefits of FPGAs such as lower-latency, lower execution time and much higher performance without any prior-knowledge of FPGAs. InAccel’s framework allows the use of Xilinx’s Vitis Open-Source optimized libraries or 3^rd party IP cores (for deep learning, machine learning, data analytics, genomics, compression, encryption and computer vision applications.)

The Accelerated Machine Learning Platform provided by InAccel’s FPGA orchestrator can be used either on-prem or on cloud. That way, users can enjoy the simplicity of the Jupyter notebooks and at the same time experience significant speedups on their applications.

Users can test for free the available libraries on the InAccel cluster on the following link:

https://inaccel.com/accelerated-data-science/

Accelerated Inference – A use case on ResNet50

Any user can now enjoy the speedup of the FPGA accelerators from their browser. In the DL example we show how users can enjoy much faster ResNet50 inference from the same Keras python notebook with zero code changes.

Users can login on the InAccel portal using their Google account at https://labs.inaccel.com:8000

They can found ready to use example for Keras on Resnet50.

The user can see that the python code is exactly as the one that would be running on any CPU/processor. However in this example users can experience up to 2,000 FPS inference on ResNet50 with zero code changes.

The user can test the accelerated Keras ResNet50 inference example either with the available dataset (22,000 images) or they can download their own dataset.

They can also confirm that the results are correct using the validation code as it is shown below.

Note: The platform is available for demonstration purposes. Multiple users may access the available cluster with the 2 Alveo cards and it may affect the performance of the platform. If you are interested to deploy your own data center with multiple FPGA cards or run your applications on the cloud exclusively, contact us at info@inaccel.com.

Figure 1. Acceleration of ML, vision, finance, data analytics from your browser using Jupyter

You can also check the online video here: https://www.youtube.com/watch?v=42bsjdXVmFg

About InAccel, Inc.

InAccel helps enterprises to speedup their applications by using adaptive hardware accelerators. It provides a unique framework for seamless utilization of hardware accelerators from high-level framework like Spark and Jupyter. InAccel also develops high-performance accelerators for applications like machine learning, compression and data analytics. For more information, visit https://inaccel.com

Code available:

# Download and unzip test data.
import os
import urllib.request
import zipfile

url = "https://github.com/Tony607/blog_statics/releases/download/v1.0/mini_test.zip"

fname = os.path.split(url)[-1]
urllib.request.urlretrieve(url, fname)

with zipfile.ZipFile(fname, 'r') as zip_ref:
    zip_ref.extractall('.')

# Run the inference test.
import numpy as np
import time

from inaccel.keras.applications.resnet50 import decode_predictions, ResNet50
from inaccel.keras.preprocessing.image import ImageDataGenerator, load_img

model = ResNet50(weights='imagenet')

data = ImageDataGenerator(dtype='int8')
images = data.flow_from_directory('mini_test/', target_size=(224, 224), class_mode=None, batch_size=64)

begin = time.monotonic()
preds = model.predict(images, workers=10)
end = time.monotonic()

print('Duration for', len(preds), 'images: %.3f sec' % (end - begin))
print('FPS: %.3f' % (len(preds) / (end - begin)))

How to run SSD Mobilenet V2 object detection on Jetson Nano at 20+ FPS

2019-12-30T10:08:18+00:00

TL: DR

First, make sure you have flashed the latest JetPack 4.3 on your Jetson Nano development SD card.

# Run the docker
docker run --runtime nvidia --network host --privileged -it docker.io/zcw607/trt_ssd_r32.3.1:0.1.0
# Then run this command to benchmark the inference speed.
python3 trt_ssd_benchmark.py

Then you will see the results similar to this.

Now for a slightly longer description.

I posted How to run TensorFlow Object Detection model on Jetson Nano about 8 months ago, realizing that just running the SSD MobileNet V1 on Jetson Nano at a speed at around 10FPS might not be enough for some applications. Besides, that approach just consumes too much memory, make no room for other memory-intensive application running alongside.

This time, the bigger SSD MobileNet V2 object detection model runs at 20+FPS. Twice as fast, also cutting down the memory consumption down to only 32.5% of the total 4GB memory on Jetson Nano(i.e. around 1.3GB). Plenty of memory left for running other fancy stuff. You have also noticed the CPU usage is also quite low, only around 10% over the quad-core.

As of my knowledge, there are a bag of tricks contributes to the performance boost.

What comes with JetPack 4.3, the TensorRT version 6.0.1 vs previous TensorRT Version 5.
The TensorFlow object detection graph is optimized and converted right on the hardware, I mean the Jetson Nano development kit I am using right now. This is because TensorRT optimizes the graph by using the available GPUs and thus the optimized graph may not perform well on a different GPU.
The model is now converted to a more hardware-specific format, the TensorRT engine file. But the downside is it's less flexible, restrained by the hardware and software stack it is running on. More on that later.
Some tricks to save memory and boost speed.

How does it work?

The command lines you just ran started a docker container. If you are new to Docker, think of it as a supercharged Anaconda or Python virtual environment containerized everything necessary to reproduce mine results. If you take a closer look at the Dockerfile on my GitHub repo which describes how the container image was built, you can see how all the dependencies are set up, including all the apt and Python packages.

The docker image is built upon the latest JetPack 4.3 - L4T R32.3.1 base image. To make an inference with TensorRT engine file, the two important Python packages are required, TensorRT and Pycuda. Building Pycuda Python package from source on Jetson Nano might take some time, so I decided to pack the pre-build package into a wheel file and make the Docker build process much smoother. Notice that Pycuda prebuilt with JetPack 4.3 is not compatible with older versions of Jetpack and vers visa. As of the TensorRT python package, it came from the Jetson Nano at directory /usr/lib/python3.6/dist-packages/tensorrt/. All I did is to zip that directory into a tensorrt.tar.gz file. Guess what, no TensorFlow GPU Python package is required at the inference time. Consider how many memory we can save by just skipping importing the TensorFlow GPU Python package.

You can find the TensorRT engine file build with JetPack 4.3 named TRT_ssd_mobilenet_v2_coco.bin at my GitHub repository. Sometimes, you might also see the TensorRT engine file named with the *.engine extension like in the JetBot system image. If you want to convert the file yourself, take a look at JK Jung's build_engine.py script.

Now, for the limitation of the TensorRT engine file approach. It simply won't work across different JetPack version. The reason came from how the engine file is built by searching through CUDA kernels for the fastest implementation available, and thus it is necessary to use the same GPU and software stack(CUDA, CuDnn, TensorRT, etc.) for building like that on which the optimized engine will run. TensorRT engine file is like a dress tailored exclusively for the setup, but its performance is amazing when fitted on the right person/dev board.

Another limitation came with the boost of speed and lower memory footprint is the loss of precision, take the following prediction result as an example, a dog is mistakenly predicted as a bear. This might be a result of the quantization of model weights from FP32 to FP16 or other optimization trade-offs.

Some tricks to save memory and boost speed

Shut down the GUI and run in command-line mode. If you are already inside the GUI desktop environment, simply press "Ctrl+Alt+F2" to enter the non-GUI mode, log in your account from there and type "service gdm stop", that will stop the Ubuntu GUI environment and save you around 8% of the 4GB memory.

Force to the CPU and GPU in maximum clock speed by typing "jetson_clocks" in the command line. If you have a PWM fan attached to the board and bothered with the FAN's noise, you can tune it down by creating a new settings file like this.

cd ~
jetson_clocks
jetson_clocks --store
sed -i 's/target_pwm:255/target_pwm:30/g' l4t_dfs.conf
jetson_clocks --restore l4t_dfs.conf

Conclusion and further reading

This guide has shown you the easiest way to reproduce my results to run SSD Mobilenet V2 object detection on Jetson Nano at 20+ FPS. Explaining how it works and the limitation to be aware of before applying this to a real application.

Don't forget to grab the source code for this post on my GitHub.

Automatic Defect Inspection with End-to-End Deep Learning

2019-11-02T11:10:24+00:00

In this tutorial, I will show you how to build a deep learning model to find defects on a surface, a popular application in many industrial inspection scenarios.

Courtesy of Nvidia

Build the model

We will apply U-Net as a DL model for 2D industrial defect inspection. When there is a shortage of labeled data and fast performance is needed, U-net is a great choice. The basic architecture is an encoder-decoder pair with skip connections to combine low-level feature maps with higher-level ones. To verify the effectiveness of our model, we will use the DAGM dataset. The benefit of using U-Net is that it doesn't contain any dense layer, so the trained DL models are typically scaled invariant, meaning they need not be retrained across image sizes to be effective for multiple input sizes.

Here is the model structure.

As you can see we use four 2x2 max pool operations for downsampling, which reduces that resolutions by half for 4 times. On the right side, 2x2 Conv2DTranspose(called Deconvolution) upsamples the image back to its original resolution. In order for the downsampling and upsampling to work, the image resolution must be divisible by 16(or 2⁴), that is why we resized our input image and mask to 512x512 resolution from the original DAGM dataset of size 500x500.

These skip connections from earlier layers in the network (prior to a downsampling operation) should provide the necessary detail in order to reconstruct accurate shapes for segmentation boundaries. Indeed, we can recover more fine-grain detail with the addition of these skip connections.

It is simple to compose such a model in Keras functional API.

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D, Lambda, Conv2DTranspose, concatenate

def get_small_unet():
    inputs = Input((img_rows, img_cols, 1))
    inputs_norm = Lambda(lambda x: x/127.5 - 1.)
    conv1 = Conv2D(16, (3, 3), activation='relu', padding='same')(inputs)
    conv1 = Conv2D(16, (3, 3), activation='relu', padding='same')(conv1)
    pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)

    conv2 = Conv2D(32, (3, 3), activation='relu', padding='same')(pool1)
    conv2 = Conv2D(32, (3, 3), activation='relu', padding='same')(conv2)
    pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)

    conv3 = Conv2D(64, (3, 3), activation='relu', padding='same')(pool2)
    conv3 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv3)
    pool3 = MaxPooling2D(pool_size=(2, 2))(conv3)

    conv4 = Conv2D(128, (3, 3), activation='relu', padding='same')(pool3)
    conv4 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv4)
    pool4 = MaxPooling2D(pool_size=(2, 2))(conv4)

    conv5 = Conv2D(256, (3, 3), activation='relu', padding='same')(pool4)
    conv5 = Conv2D(256, (3, 3), activation='relu', padding='same')(conv5)

    up6 = concatenate([Conv2DTranspose(64, kernel_size=(
        2, 2), strides=(2, 2), padding='same')(conv5), conv4], axis=3)
    conv6 = Conv2D(128, (3, 3), activation='relu', padding='same')(up6)
    conv6 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv6)

    up7 = concatenate([Conv2DTranspose(32, kernel_size=(
        2, 2), strides=(2, 2), padding='same')(conv6), conv3], axis=3)
    conv7 = Conv2D(64, (3, 3), activation='relu', padding='same')(up7)
    conv7 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv7)

    up8 = concatenate([Conv2DTranspose(16, kernel_size=(
        2, 2), strides=(2, 2), padding='same')(conv7), conv2], axis=3)
    conv8 = Conv2D(32, (3, 3), activation='relu', padding='same')(up8)
    conv8 = Conv2D(32, (3, 3), activation='relu', padding='same')(conv8)

    up9 = concatenate([Conv2DTranspose(8, kernel_size=(
        2, 2), strides=(2, 2), padding='same')(conv8), conv1], axis=3)
    conv9 = Conv2D(16, (3, 3), activation='relu', padding='same')(up9)
    conv9 = Conv2D(16, (3, 3), activation='relu', padding='same')(conv9)

    conv10 = Conv2D(1, (1, 1), activation='sigmoid')(conv9)

    model = Model(inputs=inputs, outputs=conv10)

    return model

model = get_small_unet()

Loss and metrics

The most commonly used loss function for the task of image segmentation is a pixel-wise cross-entropy loss. This loss examines each pixel individually, comparing the class predictions (depth-wise pixel vector) to our one-hot encoded target vector.

Because the cross-entropy loss evaluates the class predictions for each pixel vector individually and then averages over all pixels, we're essentially asserting equal learning to each pixel in the image. This can be a problem if your various classes have unbalanced representation in the image, as training can be dominated by the most prevalent class. In our case, it is the foreground-to-background imbalance.

Another popular loss function for image segmentation tasks is based on the Dice coefficient, which is essentially a measure of overlap between two samples. This measure ranges from 0 to 1 where a Dice coefficient of 1 denotes perfect and complete overlap. The Dice coefficient was originally developed for binary data, and can be calculated as:

where |A∩B| represents the common elements between sets A and B, and |A| represents the number of elements in set A (and likewise for set B).

With respect to the neural network output, the numerator is concerned with the common activations between our prediction and target mask, whereas the denominator is concerned with the number of activations in each mask separately. This has the effect of normalizing our loss according to the size of the target mask such that the soft Dice loss does not struggle learning from classes with lesser spatial representation in an image.

Here we use add-one or Laplace smoothing, which simply adds one to each count. Add-one smoothing can be interpreted as a uniform prior which reduces overfitting and make the model easier to converge.

To implement a smoothed Dice coefficient loss.

from tensorflow.keras import backend as K


def smooth_dice_coeff(smooth=1.):

    smooth = float(smooth)

    # IOU or dice coeff calculation
    def IOU_calc(y_true, y_pred):
            y_true_f = K.flatten(y_true)
            y_pred_f = K.flatten(y_pred)
            intersection = K.sum(y_true_f * y_pred_f)

            return 2*(intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)

    def IOU_calc_loss(y_true, y_pred):
        return -IOU_calc(y_true, y_pred)
    return IOU_calc, IOU_calc_loss

IOU_calc, IOU_calc_loss = smooth_dice_coeff(1)

Here we compared the performance for both the binary cross-entropy loss and smoothed Dice coefficient loss.

As you can see, the model trained with Dice coefficient loss converged faster and achieved a better final IOU accuracy. Regarding the final test prediction result, the model trained with Dice coefficient loss delivered sharper segmentation edges that outperformed model trained with cross-entropy loss.

Conclusion and further reading

In this quick tutorial, you have learned how to build a deep learning model which can be trained end to end and detect defect for industrial applications. The DAGM dataset used in the post is relatively simple which makes it easy for fast prototyping and verification. However, in the real-world, image data might contain much richer contexts which require a deeper and more complex model to comprehend, one simple way to accomplish this is by experimenting with an increased number of kernels for CNN layers. While there are other options like this paper, the author proposed to replace each CNN block with a dense block that can be more capable when learning complex contextual features.

You can reproduce the results of this post by running this notebook on Google Colab with free GPU.

Source code available on my GitHub.

How to train Detectron2 with Custom COCO Datasets

2019-10-13T14:28:42+00:00

Along with the latest PyTorch 1.3 release came with the next generation ground-up rewrite of its previous object detection framework, now called Detectron2. This tutorial will help you get started with this framework by training an instance segmentation model with your custom COCO datasets. If you want to know how to create COCO datasets, please read my previous post - How to create custom COCO data set for instance segmentation.

For a quick start, we will do our experiment in a Colab Notebook so you don't need to worry about setting up the development environment on your own machine before getting comfortable with Pytorch 1.3 and Detectron2.

Install Detectron2

In the Colab notebook, just run those 4 lines to install the latest Pytorch 1.3 and Detectron2.

!pip install -U torch torchvision
!pip install git+https://github.com/facebookresearch/fvcore.git
!git clone https://github.com/facebookresearch/detectron2 detectron2_repo
!pip install -e detectron2_repo

Click "RESTART RUNTIME" in the cell's output to let your installation take effect.

Register a COCO dataset

To tell Detectron2 how to obtain your dataset, we are going to "register" it.

To demonstrate this process, we use the fruits nuts segmentation dataset which only has 3 classes: data, fig, and hazelnut. We'll train a segmentation model from an existing model pre-trained on the COCO dataset, available in detectron2's model zoo.

You can download the dataset like this.

# download, decompress the data
!wget https://github.com/Tony607/detectron2_instance_segmentation_demo/releases/download/V0.1/data.zip
!unzip data.zip > /dev/null

Or you can upload your own dataset from here.

from detectron2.data.datasets import register_coco_instances
register_coco_instances("fruits_nuts", {}, "./data/trainval.json", "./data/images")

Each dataset is associated with some metadata. In our case, it is accessible by calling fruits_nuts_metadata = MetadataCatalog.get("fruits_nuts"), you will get

Metadata(evaluator_type='coco', image_root='./data/images', json_file='./data/trainval.json', name='fruits_nuts',
         thing_classes=['date', 'fig', 'hazelnut'], thing_dataset_id_to_contiguous_id={1: 0, 2: 1, 3: 2})

To get the actual internal representation of the catalog stores information about the datasets and how to obtain them, you can call dataset_dicts = DatasetCatalog.get("fruits_nuts"). The internal format uses one dict to represent the annotations of one image.

To verify the data loading is correct, let's visualize the annotations of randomly selected samples in the dataset:

import random
from detectron2.utils.visualizer import Visualizer

for d in random.sample(dataset_dicts, 3):
    img = cv2.imread(d["file_name"])
    visualizer = Visualizer(img[:, :, ::-1], metadata=fruits_nuts_metadata, scale=0.5)
    vis = visualizer.draw_dataset_dict(d)
    cv2_imshow(vis.get_image()[:, :, ::-1])

One of the images might show this.

Train the model

Now, let's fine-tune a coco-pretrained R50-FPN Mask R-CNN model on the fruits_nuts dataset. It takes ~6 minutes to train 300 iterations on Colab's K80 GPU.

from detectron2.engine import DefaultTrainer
from detectron2.config import get_cfg
import os

cfg = get_cfg()
cfg.merge_from_file(
    "./detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"
)
cfg.DATASETS.TRAIN = ("fruits_nuts",)
cfg.DATASETS.TEST = ()  # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl"  # initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.02
cfg.SOLVER.MAX_ITER = (
    300
)  # 300 iterations seems good enough, but you can certainly train longer
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = (
    128
)  # faster, and good enough for this toy dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 3  # 3 classes (data, fig, hazelnut)

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

In case you switch to your own datasets, change the number of classes, learning rate, or max iterations accordingly.

Make a prediction

Now, we perform inference with the trained model on the fruits_nuts dataset. First, let's create a predictor using the model we just trained:

cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5   # set the testing threshold for this model
cfg.DATASETS.TEST = ("fruits_nuts", )
predictor = DefaultPredictor(cfg)

Then, we randomly select several samples to visualize the prediction results.

from detectron2.utils.visualizer import ColorMode

for d in random.sample(dataset_dicts, 3):    
    im = cv2.imread(d["file_name"])
    outputs = predictor(im)
    v = Visualizer(im[:, :, ::-1],
                   metadata=fruits_nuts_metadata, 
                   scale=0.8, 
                   instance_mode=ColorMode.IMAGE_BW   # remove the colors of unsegmented pixels
    )
    v = v.draw_instance_predictions(outputs["instances"].to("cpu"))
    cv2_imshow(v.get_image()[:, :, ::-1])

Here is what you get with a sample image with prediction overlayed.

Conclusion and further thought

You might have read my previous tutorial on a similar object detection framework named MMdetection also built upon PyTorch. So how is Detectron2 compared with it? Here are my few thoughts.

Both frameworks are easy to config with a config file that describes how you want to train a model. Detectron2's YAML config files are more efficient for two reasons. First, You can reuse configs by making a "base" config first and build final training config files upon this base config file which reduces duplicated code. Second, the config file can be loaded first and allows any further modification as necessary in Python code which makes it more flexible.

What about the inference speed? Simply put, Detectron2 is slightly faster than MMdetection for the same Mask RCNN Resnet50 FPN model. MMdetection gets 2.45 FPS while Detectron2 achieves 2.59 FPS, or a 5.7% speed boost on inferencing a single image. Benchmark based on the following code.

import time
times = []
for i in range(20):
    start_time = time.time()
    outputs = predictor(im)
    delta = time.time() - start_time
    times.append(delta)
mean_delta = np.array(times).mean()
fps = 1 / mean_delta
print("Average(sec):{:.2f},fps:{:.2f}".format(mean_delta, fps))

So, you have it, Detectron2 make it super simple for you to train a custom instance segmentation model with custom datasets. You might find the following resources helpful.

My previous post - How to create custom COCO data set for instance segmentation.

My previous post - How to train an object detection model with mmdetection.

Detectron2 GitHub repository.

The runnable Colab Notebook for this post.

Getting started with VS CODE remote development

2019-09-22T05:02:08+00:00

Let's say you have a GPU virtual instance on the cloud or a physical machine which is headless, there are several options like remote desktop or Jupyter Notebook which can provide you with desktop-like development experience, however, VS CODE remote development extension can be more flexible than Jupyter notebook and more responsive than remote desktop. I will show you step by step how to set up it up on Windows.

Start OpenSSH service

First, let's make sure you have set up SSH on your server, most likely your online server instance will have OpenSSH server preconfigured, the command below can check whether it is running.

service sshd status

If you see something like this, you are good to go, otherwise, install or start the OpenSSH server

● ssh.service - OpenBSD Secure Shell server
 Loaded: loaded (/lib/systemd/system/ssh.service; enabled; vendor preset: enabled)
 Active: active (running) since Tue 2019-09-17 19:58:43 CST; 4 days ago
 Main PID: 600 (sshd)
 Tasks: 1 (limit: 1109)
 CGroup: /system.slice/ssh.service
 └─600 /usr/sbin/sshd -D

For the Ubuntu system, you can install OpenSSH server and optionally change the default 22 port like this

sudo apt-get install openssh-server
# Optionally change the SSH port inside this file.
sudo vi /etc/ssh/sshd_config
sudo systemctl restart ssh

Once you have set it up, ssh to this server from your development machine with IP address, user name and password just to verify there are no glitches.

OpenSSH client on Windows

This step is painless, for Windows 10 users, it is just enabling a feature in the setting page, it might be enabled already. Anyway, here is the step to verify this feature is enabled.

In the Settings page, go to Apps, then click "Manage optional features", scroll down and check "OpenSSH Client" is installed.

Setup SSH keys

You don't want to type your user name and password every time when you log in to the server, do you?

In Windows(your development machine)

Here we will generate an SSH key like this in a command prompt,

ssh-keygen -t rsa

Accept the defaults, you can leave the key phase empty when following along the prompt.

Copy the output of this command,

cat ~/.ssh/id_rsa.pub

Then ssh to the server with user name and password if you haven't already, then run those following command lines to open up append the content you just copied to ~/.ssh/authorized_keys on the server.

mkdir -p ~/.ssh
vi ~/.ssh/authorized_keys

In case you are not familiar with vi, "Shift+END" goes to the end, type "a" to enter append mode, right-click to paste the content of the clipboard. Once you are done, press "Shift + ;" then type "wq" to write and quite. Hopefully, we don't need to edit our code the same way in vi anymore after this.

To verify the SSH is set up, on your Windows machine start a new command line prompt and type ssh <username>@<server ip>, it should log in automatically without asking for the password.

Install Remote Development VS CODE Extension

Open the VSCOD, click the Extension tab, then search for "remote development" and install it.

Once it is installed, you will see a new tab named "Remote Explorer", click on it and the gear button.

Choose the first entry, in my case, it is like C:\Users\hasee\.ssh\config, Once you have it open, fill in the alias, hostname, and user. the alias can be any text which helps you remember, the hostname is likely the IP address of the remote machine.

Once you have this done, just click on the "Connect to Host in New Window" button.

One last step, in the new window click "Open Folder" in the sidebar to select a folder path on your remote machine and you are good to go, type "Ctrl + `" to open the terminal on the remote machine just like doing it locally.

Conclusion and further reading

Now you have it, a quick tutorial showing you how to setup VS CODE remote development from scratch allowing you to enjoy a desktop development experience on a headless remote server.

For the official VS Code Remote Development page, please refer to the website.

Recent Advances in Deep Learning for Object Detection - Part 2

2019-09-03T12:14:39+00:00

In the second part of the Recent Advances in Deep Learning for Object Detection series, we will summarize three aspects of object detection, proposal generation, feature representation learning, and learning strategy.

Proposal Generation

A proposal generator generates a set of rectangle bounding boxes, which are potential objects.

Feature Representation Learning

Three categories: multi-scale feature learning, contextual reasoning, and deformable feature learning.

Learning Strategy

To tackle imbalance sampling, localization, acceleration, etc.

Conclusion and further thoughts

This series gives you an overview of several critical parts you might find in deep learning for object detection as well as how they build upon each other. Finally, let's conclude the series with the network structure of Faster RCNN with FPN.

Recent Advances in Deep Learning for Object Detection - Part 1

2019-09-01T03:45:07+00:00

When comes to training a custom object detection model, TensorFlow object detection API and MMdetection(PyTorch) are two readily available options, I have shown you how to do that even on Google Colab's free GPU resources.

Those two frameworks are easy to use with simple configuration interface and let the framework source code does the heavy lifting. But do you ever wonder how the deep learning object detection algorithms are evolved over the years, their pros and cons?

I find the paper - Recent Advances in Deep Learning for Object Detection a really good answer to this quest. Let me summarize what I have learned, hopefully, elaborate in a more intuitive way.

Text colors: pro/cons

Detection Paradigms

Two-stage Detectors

One-stage Detectors

Backbone Architecture

Conclusion and further reading

This quick post summarized recent advance in deep learning object detection in three aspects, two-stage detector, one-stage detector and backbone architectures. Next time you are training a custom object detection with a third-party open-source framework, you will feel more confident to select an optimal option for your application by examing their pros and cons.

In the next post, I will finish up what we have left in the paper, namely the proposal generation, feature representation learning, and learning strategy. If you are interested, strongly suggested to give the paper a read, it will be well worth your time.

How to run Keras model on Jetson Nano in Nvidia Docker container

2019-08-10T09:21:00+00:00

I wrote, "How to run Keras model on Jetson Nano" a while back, where the model runs on the host OS. In this tutorial, I will show you how to start fresh and get the model running on Jetson Nano inside an Nvidia docker container.

You might wonder why bother with docker on Jetson Nano? I came up with several reasons.

1. It's much easier to reproduce the results with a docker container compared with installing the dependencies/libraries all by yourself. Since the docker image you pull from Docker Hub has all dependencies preinstalled which save you tons of time building from source.

2. It's less likely to mess up the Jetson Nano host OS since your code and dependencies are isolated from it. Even when you get into trouble, solving the issue is just restarting a new container away.

3. You can build your applications based on my base image with TensorFlow preinstalled in a much more controllable way by creating a new docker file.

4. You can cross-compile the Docker image with a much power computer such as an X86 based server, saves valuable time.

5. Finally, you guessed it, running code in Docker container is almost as speedy as running on the host OS with GPU acceleration available.

Hope you are convinced, here is a brief overview of how to make it happen.

Install new JetPack 4.2.1 on Jetson Nano.
Cross-compiling Docker build setup on an X86 machine.
Build a Jetson Nano docker with TensorFlow GPU.
Build an overlay Docker image(Optional).
Run the frozen Keras TensorRT model in a Docker container.

Install new JetPack 4.2.1 on Jetson Nano

Download the JetPack 4.2.1 SD card image from Nvidia. Extract the sd-blob-b01.img file from the zip. Flash it to a class 10 32GB minimal SD card with Rufus. The SD card I have is a SanDisk class10 U1 64GB model.

You can try another flasher like Etcher, but I the SD card I flashed with Etcher cannot boot on Jetson Nano. I also tried installing the JetPack with SDK manager but running into an issue with the "System configuration wizard". There is the thread I opened in the Nvidia Developer forum, their technical support is quite responsive.

Insert the SD card, plug in an HDMI monitor cable, USB keyboard, and mouse, then power up the board. Follow the system configuration wizard to finish the system configuration.

Cross-compiling Docker build setup on an X86 machine

Even though the Nvidia Docker runtime is pre-installed on the OS which allows you to build a Docker container right on the hardware. However, cross-compiling Docker on an X86 based machine can save a significant amount of building time considering larger processing power and network speed. So the one time set up for a cross-compiling environment is well worth the time. A docker container will be built on the server, pushed to a Docker registry such as the Docker Hub, then pulled from Jetson Nano.

On your X86 machine, it could be your laptop or a Linux server, install Docker first following the official instruction.

Then install qemu from the command line, qemu will emulate Jetson Nano CPU architecture(which is aarch64) on your X86 machine when building Docker containers.

sudo apt-get install -y qemu binfmt-support qemu-user-static
wget http://archive.ubuntu.com/ubuntu/pool/main/b/binfmt-support/binfmt-support_2.1.8-2_amd64.deb
sudo apt install ./binfmt-support_2.1.8-2_amd64.deb
rm binfmt-support_2.1.8-2_amd64.deb

Finally, install podman. We will use that to build containers instead of the default docker container command-line interface.

sudo apt update
sudo apt -y install software-properties-common
sudo add-apt-repository -y ppa:projectatomic/ppa
sudo apt update
sudo apt -y install podman

Build a Jetson Nano Docker with TensorFlow GPU

We build our TensorFlow GPU Docker image based on the official nvcr.io/nvidia/l4t-base:r32.2 image.

Here is the content of Dockerfile.

FROM nvcr.io/nvidia/l4t-base:r32.2
WORKDIR /
RUN apt update && apt install -y --fix-missing make g++
RUN apt update && apt install -y --fix-missing python3-pip libhdf5-serial-dev hdf5-tools
RUN apt update && apt install -y python3-h5py
RUN pip3 install --pre --no-cache-dir --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v42 tensorflow-gpu
RUN pip3 install -U numpy
CMD [ "bash" ]

Then you can pull the base image, build and push the container image to Docker Hub like this.

podman pull nvcr.io/nvidia/l4t-base:r32.2
podman build -v /usr/bin/qemu-aarch64-static:/usr/bin/qemu-aarch64-static -t docker.io/zcw607/jetson:0.1.0 . -f ./Dockerfile
podman push docker.io/zcw607/jetson:0.1.0

Change zcw607 to your own Docker Hub account name as necessary, you might have to do docker login docker.io first before you can push to the registry.

Build an overlay Docker image(Optional)

By building an overlay Docker image, you can add your code dependencies/libraries based on a previous Docker image.

For example, you want to install the Python pillow library and set up some other stuff, you can create a new Dockerfile like this.

FROM zcw607/jetson:0.1.0
WORKDIR /home
ENV TZ=Asia/Hong_Kong
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone \
 apt update && apt install -y python3-pil
CMD [ "bash" ]

Then run those two lines to build and push the new container.

podman build -v /usr/bin/qemu-aarch64-static:/usr/bin/qemu-aarch64-static -t docker.io/zcw607/jetson:r1.0.1 . -f ./Dockerfile
podman push docker.io/zcw607/jetson:r1.0.1

Now your two Docker containers reside in Docker Hub, let's sync up on Jetson Nano.

Run TensorRT model in a Docker container

In Jetson Nano command line, pull the Docker container from Docker Hub like this.

docker pull docker.io/zcw607/jetson:r1.0.1

Then start the container with the following command.

docker run --runtime nvidia --network host -it -e DISPLAY=$DISPLAY -v /tmp/.X11-unix/:/tmp/.X11-unix zcw607/jetson:r1.0.1

Check TensorFlow GPU is installed, type "python3" in the command then,

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

If everything works, it should print

To run the TensorRT model inference benchmark, use my Python script. The model is converted from the Keras MobilNet V2 model for image classification. It achieves 30 FPS with 244 by 244 color image input. That is running in a Docker container, and it is even slightly faster compared with 27.18FPS running without a Docker container.

Read my previous blog to learn more about how to create your TensorRT model from Keras.

Conclusion and further reading

This tutorial shows the complete process to get a Keras model running on Jetson Nano inside an Nvidia Docker container. You can also learn how to build a Docker container on an X86 machine, push to Docker Hub and pulled from Jetson Nano. Check out my GitHub repo for updated Dockerfile, build script and inference benchmark script.

How to create custom COCO data set for instance segmentation

2019-07-27T09:00:37+00:00

In this post, I will show you how simple it is to create your custom COCO dataset and train an instance segmentation model quick for free with Google Colab's GPU.

If you just want to know how to create custom COCO data set for object detection, check out my previous tutorial.

Instance segmentation is different from object detection annotation since it requires polygonal annotations instead of bound boxes. There are many tools freely available, such as labelme and coco-annotator. labelme is easy to install and runs on all major OS, however, it lacks native support to export COCO data format annotations which are required for many model training frameworks/pipelines. coco-annotator, on the other hand, is a web-based application which requires additional efforts to get it up and running on your machine. So way takes the least effort?

Here is an overview of how you can make your own COCO dataset for instance segmentation.

Download labelme, run the application and annotate polygons on your images.
Run my script to convert the labelme annotation files to COCO dataset JSON file.

Annotate data with labelme

labelme is quite similar to labelimg in bounding annotation. So anyone familiar with labelimg, start annotating with labelme should take no time.

You can install labelme like below or find prebuild executables in the release sections, or download the latest Windows 64bit executable I built earlier.

# python3
conda create --name=labelme python=3.6
source activate labelme
# or "activate labelme" on Windows
# conda install -c conda-forge pyside2
# conda install pyqt
pip install pyqt5  # pyqt5 can be installed via pip on python3
pip install labelme

When you open the tool, click the "Open Dir" button and navigate to your images folder where all image files are located then you can start drawing polygons. To finish drawing a polygon, press "Enter" key, the tool should connect the first and last dot automatically. When done annotating an image, press shortcut key "D" on the keyboard will take you to the next image. I annotated 18 images, each image containing multiple objects, it took me about 30 minutes.

Once you have all images annotated, you can find a list of JSON file in your images directory with the same base file name. Those are labelimg annotation files, we will convert them into a single COCO dataset annotation JSON file in the next step.(Or two JSON files for train/test split.)

Convert labelme annotation files to COCO dataset format

You can find the labelme2coco.py file on my GitHub. To apply the conversion, it is only necessary to pass in one argument which is the images directory path.

python labelme2coco.py images

The script depends on three pip packages: labelme, numpy, and pillow. Go ahead and install them with pip if you are missing any of them. After executing the script, you will find a file named trainval.json located in the current directory, that is the COCO dataset annotation JSON file.

Then optionally, you can verify the annotation by opening the COCO_Image_Viewer.ipynb jupyter notebook.

If everything works, it should show something like below.

Train an instance segmentation model with mmdetection framework

If you are unfamiliar with the mmdetection framework, it is suggested to give my previous post a try - "How to train an object detection model with mmdetection". The framework allows you to train many object detection and instance segmentation models with configurable backbone networks through the same pipeline, the only thing necessary to modify is the model config python file where you define the model type, training epochs, type and path to the dataset and so on. For instance segmentation models, several options are available, you can do transfer learning with mask RCNN or cascade mask RCNN with the pre-trained backbone networks. To make it even beginner-friendly, just run the Google Colab notebook online with free GPU resource and download the final trained model. The notebook is quite similar to the previous object detection demo, so I will let you run it and play with it.

Here is the final prediction result after training a mask RCNN model for 20 epochs, which took less than 10 minutes during training.

Feel free to try with other model config files or tweak the existing one by increasing the training epochs, change the batch size and see how it might improve the results. Also notice that for the simplicity and the small size of the demo dataset, we skipped the train/test split, where you can accomplish that by manually split the labelme JSON files into two directories and run the labelme2coco.py script for each directory to generate two COCO annotation JSON files.

Conclusion and further reading

Training an instance segmentation might look daunting since doing so might require a significant amount of computing and storage resources. But that's not keeping us away from creating one with around 20 annotated images and Colab's free GPU.

Resources you might find useful

My GitHub repo for the labelme2coco script, COCO image viewer notebook, and my demo dataset files.

labelme Github repo where you can find more information about the annotation tool.

The notebook you can run to train a mmdetection instance segmentation model on Google Colab.

Go to the mmdetection GitHub repo and know more about the framework.

My previous post - How to train an object detection model with mmdetection

How to create custom COCO data set for object detection

2019-07-16T13:27:36+00:00

Previously, we have trained a mmdetection model with custom annotated dataset in Pascal VOC data format. You are out of luck if your object detection training pipeline require COCO data format since the labelImg tool we use does not support COCO annotation format. If you still want to stick with the tool for annotation and later convert your annotation to COCO format, this post is for you.

We will start with a brief introduction to the two annotation formats, followed with an introduction to the conversion script to convert VOC to COCO format, finally, we will validate the converted result by plotting the bounding boxes and class labels.

Pascal VOC and COCO annotations

Pascal VOC annotations are saved as XML files, one XML file per image. For an XML file generated by the labelImg tool. It contains the path to the image in the <path> element. Each bounding box is stored in an <object> element, an example can look like below.

<object>
	<name>fig</name>
	<pose>Unspecified</pose>
	<truncated>0</truncated>
	<difficult>0</difficult>
	<bndbox>
		<xmin>256</xmin>
		<ymin>27</ymin>
		<xmax>381</xmax>
		<ymax>192</ymax>
	</bndbox>
</object>

As you can see the bounding box is defined by two points, the upper left and bottom right corners.

For the COCO data format, first of all, there is only a single JSON file for all the annotation in a dataset or one for each split of datasets(Train/Val/Test).

The bounding box is express as the upper left starting coordinate and the box width and height, like "bbox" :[x,y,width,height].

Here is an example for the COCO data format JSON file which just contains one image as seen the top-level "images" element, 3 unique categories/classes in total seen in top-level "categories" element and 2 annotated bounding boxes for the image seen in top-level "annotations" element.

{
  "type": "instances",
  "images": [
    {
      "file_name": "0.jpg",
      "height": 600,
      "width": 800,
      "id": 0
    }
  ],
  "categories": [
    {
      "supercategory": "none",
      "name": "date",
      "id": 0
    },
    {
      "supercategory": "none",
      "name": "hazelnut",
      "id": 2
    },
    {
      "supercategory": "none",
      "name": "fig",
      "id": 1
    }
  ],
  "annotations": [
    {
      "id": 1,
      "bbox": [
        100,
        116,
        140,
        170
      ],
      "image_id": 0,
      "segmentation": [],
      "ignore": 0,
      "area": 23800,
      "iscrowd": 0,
      "category_id": 0
    },
    {
      "id": 2,
      "bbox": [
        321,
        320,
        142,
        102
      ],
      "image_id": 0,
      "segmentation": [],
      "ignore": 0,
      "area": 14484,
      "iscrowd": 0,
      "category_id": 0
    }
  ]
}

Convert Pascal VOC to COCO annotation

Once you have some annotated XML and images files, put them in the following folder structures similar the one below,

data
 └── VOC2007
 ├── Annotations
 │ ├── 0.xml
 │ ├── ...
 │ └── 9.xml
 └── JPEGImages
 ├── 0.jpg
 ├── ...
 └── 9.jpg

Then you can run the voc2coco.py script from my GitHub like this which will generate a COCO data formatted JSON file for you.

python voc2coco.py ./data/VOC/Annotations ./data/coco/output.json

Once we have the JSON file, we can visualize the COCO annotation by drawing bounding box and class labels as an overlay over the image. Open the COCO_Image_Viewer.ipynb in Jupyter notebook. Find the following cell inside the notebook which calls the display_image method to generate an SVG graph right inside the notebook.

html = coco_dataset.display_image(0, use_url=False)
IPython.display.HTML(html)

The first argument is the image id, for our demo datasets, there are totally 18 images, so you can try setting it from 0 to 17.

Conclusion and further reading

In this quick tutorial, you have learned how you can stick with the popular labelImg for custom dataset annotation and later convert the Pascal VOC to COCO dataset to train an object detection model pipeline requires COCO format datasets.

You might find the following links useful,

How to train an object detection model with mmdetection - my previous post about creating custom Pascal VOC annotation files and train an object detection model with PyTorch mmdetection framework.

COCO data format

Pascal VOC documentation

Download labelImg for the bounding box annotation.

Get the source code for this post, check out my GitHub repo.

How to train an object detection model with mmdetection

2019-06-23T13:37:01+00:00

A while back you have learned how to train an object detection model with TensorFlow object detection API, and Google Colab's free GPU, if you haven't, check it out in the post. The models in TensorFlow object detection are quite dated and missing updates for the state of the art models like Cascade RCNN and RetinaNet. While there is a counterpart for Pytorch similar to that called mmdetection which include more pre-trained state of the art object detection models for us to train custom data with, however setting it up requires a nontrivial amount of time spent on installing the environment, setting up the config file, and dataset in the right format. The good news is you can skip those boring stuff and jump directly into the fun part to train your model.

Here is an overview of how to make it happen,

1. Annotate some images, and make train/test split.

2. Run the Colab notebook to train your model.

Step 1: Annotate some images and make train/test split

It is only necessary if you want to use your images instead of ones comes with my repository. Start by forking my repository and delete the data folder in the project directory so you can start fresh with your custom data.

If you took your images by your phone, the image resolution might be 2K or 4K depends on your phone's setting. In that case, we will scale down the image for reduced overall dataset size, and faster training speed.

You can use the resize_images.py script in the repository to resize your images.

First, save all your photos to one folder outside of the project directory so they won't get accidentally uploaded to GitHub later. Ideally, all photo came with jpg extension. Then run this script to resize all photos, and save them to the project directory.

python resize_images.py --raw-dir <photo_directory> --save-dir ./data/VOCdevkit/VOC2007/ImageSets --ext jpg --target-size "(800, 600)"

You might wonder why "VOC" in the path, that is because of the annotation tool we use generates Pascal VOC formatted annotation XML files. It is not necessary to dig into the actual format of the XML file since the annotation tool handles all of that. You guessed it, that is the same tool we use previously, LabelImg, works both on Windows and Linux.

Download LabelImg and open it up,

1. Verify "PascalVOC" is selected, that is the default annotation format.

2. Open your resized image folder "./data/VOCdevkit/VOC2007/ImageSets" for annotation.

3. Change save directory for the XML annotation files to "./data/VOCdevkit/VOC2007/Annotations".

As usual, use shortcuts (w: draw box, d: next file, a: previous file, etc.) to accelerate the annotation.

Once it is done, you will find those XML files located in "./data/VOCdevkit/VOC2007/Annotations" folder with the same file base names as your image files.

For the train/test split, you are going to create two files, each one containing a list of file base names, one name per line. Those two text files will be located in the folder "data/VOC2007/ImageSets/Main" named trainval.txt and test.txt respectively. If you don't want to type all the file name by hand, try cd into the "Annotations" directory and run the shell,

ls -1 | sed -e 's/\.xml$//' | sort -n

That will give you a list of nicely sorted file base names, just split them into two parts, and paste into those two text files.

Now you have the data directory structure similar to this one below.

data
   └── VOC2007
       ├── Annotations
       │   ├── 0.xml
       │   ├── ...
       │   └── 9.xml
       ├── ImageSets
       │   └── Main
       │       ├── test.txt
       │       └── trainval.txt
       └── JPEGImages
           ├── 0.jpg
           ├── ...
           └── 9.jpg

Update your fork of the GitHub repository with your labeled datasets so you can clone it with Colab.

git add --al
git commit -m "Update datasets"
git push

Train the model on Colab Notebook

We are ready to launch the Colab notebook and fire up the training. Similar to TensorFlow object detection API, instead of training the model from scratch, we will do transfer learning from a pre-trained backbone such as resnet50 specified in the model config file.

The notebook allows you to select the model config and set the number of training epochs.

Right now, I only tested with two model configs, faster_rcnn_r50_fpn_1x, and cascade_rcnn_r50_fpn_1x, while other configs can be incorporated as demonstrated in the notebook.

The notebook handles several things before training the model,

Installing mmdetection and its dependencies.
Replacing "CLASSES" in voc.py file with your custom dataset class labels.
Modifying your selected model config file. Things like updating the number of classes to match with your dataset, changing dataset type to VOCDataset, setting the total training epoch number and more.

After that, it will re-run the mmdetection package installing script so the changes to the voc.py file will be updated to the system python packages.

%cd {mmdetection_dir}
!python setup.py install

Since your data directory resides outside of the mmdetection directory, we have the following cell in the notebook which creates a symbolic link into the project data directory.

os.makedirs("data/VOCdevkit", exist_ok=True)
voc2007_dir = os.path.join(project_name, "data/VOC2007")
os.system("ln -s {} data/VOCdevkit".format(voc2007_dir))

Then start the training.

!python tools/train.py {config_fname}

The training time depends on the size of your datasets and number of training epochs, my demo takes several minutes to complete with Colab's Tesla T4 GPU.

After training, you can test drive the model with an image in the test set like so.

%cd {mmdetection_dir}
from mmcv.runner import load_checkpoint
from mmdet.apis import inference_detector, show_result, init_detector

checkpoint_file = os.path.join(mmdetection_dir, work_dir, "latest.pth")
score_thr = 0.8

# build the model from a config file and a checkpoint file
model = init_detector(config_fname, checkpoint_file)

# test a single image and show the results
img = 'data/VOCdevkit/VOC2007/JPEGImages/15.jpg'
result = inference_detector(model, img)
show_result(img, result, model.CLASSES, score_thr=score_thr, out_file="result.jpg")

# Show the image with bbox overlays.
from IPython.display import Image
Image(filename='result.jpg')

And here is the result as you expected,

Conclusion and further reading

This tutorial shows you how to train a Pytorch mmdetection object detection model with your custom dataset, and minimal effort on Google Colab Notebook.

If you are using my GitHub repo, you probably noticed that mmdetection is included as a submodule, to update that in the future run this command.

git submodule update --recursive

Considering training with another model config? You can find a list of config files here as well as their specs such as the complexity(Mem(GB)), and accuracy(box AP). Then start by adding the config file to MODELS_CONFIG at the start of the notebook.

Resources you might find helpful,

mmdetection - GitHub repository.
LabelImg - The Annotation tool used in this tutorial.
my repository for this tutorial.

In future posts, we will look into benchmarking those custom trained model as well as their deployment to edge computing devices, stay tuned and happy coding!

How to do Transfer learning with Efficientnet

2019-06-09T03:16:06+00:00

In this tutorial, you will learn how to create an image classification neural network to classify your custom images. The network will be based on the latest EfficientNet, which has achieved state of the art accuracy on ImageNet while being 8.4x smaller and 6.1x faster.

Why EfficientNet?

Compared to other models achieving similar ImageNet accuracy, EfficientNet is much smaller. For example, the ResNet50 model as you can see in Keras application has 23,534,592 parameters in total, and even though, it still underperforms the smallest EfficientNet, which only takes 5,330,564 parameters in total.

Why is it so efficient? To answer the question, we will dive into its base model and building block. You might have heard of the building block for the classical ResNet model is identity and convolution block.

For EfficientNet, its main building block is mobile inverted bottleneck MBConv, which was first introduced in MobileNetV2. By using shortcuts directly between the bottlenecks which connects a much fewer number of channels compared to expansion layers, combined with depthwise separable convolution which effectively reduces computation by almost a factor of k², compared to traditional layers. Where k stands for the kernel size, specifying the height and width of the 2D convolution window.

The authors also add squeeze-and-excitation(SE) optimization, which contributes to further performance improvements.
The second benefit of EfficientNet, it scales more efficiently by carefully balancing network depth, width, and resolution, which lead to better performance.

As you can see, starting from the smallest EfficientNet configuration B0 to the largest B7, accuracies are steady increasing while maintaining a relatively small size.

Transfer Learning with EfficientNet

It is fine if you are not entirely sure what I am talking about in the previous section. Transfer learning for image classification is more or less model agnostic. You can pick any other pre-trained ImageNet model such as MobileNetV2 or ResNet50 as a drop-in replacement if you want.

A pre-trained network is simply a saved network previously trained on a large dataset such as ImageNet. The learned features can prove useful for many different computer vision problems, even though these new problems might involve completely different classes from those of the original task. For instance, one might train a network on ImageNet (where classes are mostly animals and everyday objects) and then re-purpose this trained network for something as remote as identifying the car models in images. For this tutorial, we expect the model to perform well on our cat vs. dog classification problem with a relatively small number of samples.

The easiest way to get started is by opening this notebook in Colab, while I will explain more detail here in this post.

First clone my repository which contains the Tensorflow Keras implementation of the EfficientNet, then cd into the directory.

!git clone https://github.com/Tony607/efficientnet_keras_transfer_learning
%cd efficientnet_keras_transfer_learning/

The EfficientNet is built for ImageNet classification contains 1000 classes labels. For our dataset, we only have 2. Which means the last few layers for classification is not useful for us. They can be excluded while loading the model by specifying the include_top argument to False, and this applies to other ImageNet models made available in Keras applications as well.

# Options: EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetB3
# Higher the number, the more complex the model is.
from efficientnet import EfficientNetB0 as Net
from efficientnet import center_crop_and_resize, preprocess_input

# loading pretrained conv base model
conv_base = Net(weights="imagenet", include_top=False, input_shape=input_shape)

To create our own classification layers stack on top of the EfficientNet convolutional base model. We adapt GlobalMaxPooling2D to convert 4D the (batch_size, rows, cols, channels) tensor into 2D tensor with shape (batch_size, channels). GlobalMaxPooling2D results in a much smaller number of features compared to the Flatten layer, which effectively reduces the number of parameters.

from tensorflow.keras import models
from tensorflow.keras import layers

dropout_rate = 0.2
model = models.Sequential()
model.add(conv_base)
model.add(layers.GlobalMaxPooling2D(name="gap"))
# model.add(layers.Flatten(name="flatten"))
if dropout_rate > 0:
    model.add(layers.Dropout(dropout_rate, name="dropout_out"))
# model.add(layers.Dense(256, activation='relu', name="fc1"))
model.add(layers.Dense(2, activation="softmax", name="fc_out"))

To keep the convolutional base's weight untouched, we will freeze it, otherwise, the representations previously learned from the ImageNet dataset will be destroyed.

conv_base.trainable = False

Then you can download and unzip the dog_vs_cat data from Microsoft.

!wget https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
!unzip -qq kagglecatsanddogs_3367a.zip -d dog_vs_cat

There are several blocks of data in the Notebook dedicated to sample a subset of images from the original dataset to form train/validation/test sets after which you will see.

total training cat images: 1000
total training dog images: 1000
total validation cat images: 500
total validation dog images: 500
total test cat images: 500
total test dog images: 500

Then you can compile and train the model with Keras's ImageDataGenerator, which adds various data augmentation options during the training to reduce the chance of overfitting.

from tensorflow.keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(
    rescale=1.0 / 255,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode="nearest",
)

# Note that the validation data should not be augmented!
test_datagen = ImageDataGenerator(rescale=1.0 / 255)

train_generator = train_datagen.flow_from_directory(
    # This is the target directory
    train_dir,
    # All images will be resized to target height and width.
    target_size=(height, width),
    batch_size=batch_size,
    # Since we use categorical_crossentropy loss, we need categorical labels
    class_mode="categorical",
)

validation_generator = test_datagen.flow_from_directory(
    validation_dir,
    target_size=(height, width),
    batch_size=batch_size,
    class_mode="categorical",
)
model.compile(
    loss="categorical_crossentropy",
    optimizer=optimizers.RMSprop(lr=2e-5),
    metrics=["acc"],
)
history = model.fit_generator(
    train_generator,
    steps_per_epoch=NUM_TRAIN // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=NUM_TEST // batch_size,
    verbose=1,
    use_multiprocessing=True,
    workers=4,
)

Another technique to make the model representation more relevant for the problem at hand is called fine-tuning. That is based on the following intuition.

Earlier layers in the convolutional base encode more generic, reusable features, while layers higher up encode more specialized features.

The steps for fine-tuning a network are as follow:

1) Add your custom network on top of an already trained base network.
2) Freeze the base network.
3) Train the part you added.
4) Unfreeze some layers in the base network.
5) Jointly train both these layers and the part you added.

We have already done the first three steps, to find out which layers to unfreeze, it is helpful to plot the Keras model.

from tensorflow.keras.utils import plot_model
plot_model(conv_base, to_file='conv_base.png', show_shapes=True)
from IPython.display import Image
Image(filename='conv_base.png')

Here is the zoom in view of the last several layers in the convolutional base model.

To set 'multiply_16' and successive layers trainable.

conv_base.trainable = True

set_trainable = False
for layer in conv_base.layers:
    if layer.name == 'multiply_16':
        set_trainable = True
    if set_trainable:
        layer.trainable = True
    else:
        layer.trainable = False

Then you can compile and train the model again for some more epochs. Finally, you will have a fine-tuned model with a 9% increase in validation accuracy.

Conclusion and Further reading

This post starts with a brief introduction to EfficientNet and why its more efficient compare to classical ResNet model. An example is made runnable on Colab Notebook showing you how to build a model reusing the convolutional base of EfficientNet and fine-tuning last several layers on the custom dataset.

The full source code is available on my GitHub repo.

You might find the following resources helpful.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Squeeze-and-Excitation Networks

TensorFlow implementation of EfficientNet

How to compress your Keras model x5 smaller with TensorFlow model optimization

2019-05-19T12:13:11+00:00

This tutorial will demonstrate how you can reduce the size of your Keras model by 5 times with TensorFlow model optimization, which can be particularly important for deployment in resource-constraint environments.

From the official TensorFlow model optimization documentation. Weight pruning means eliminating unnecessary values in weight tensors. We set the neural network parameters' values to zero to remove what we estimate are unnecessary connections between the layers of a neural network. This is done during the training process to allow the neural network to adapt to the changes.

Here is a breakdown of how you can adopt this technique.

Train Keras model to reach an acceptable accuracy as always.
Make Keras layers or model ready to be pruned.
Create a pruning schedule and train the model for more epochs.
Export the pruned model by striping pruning wrappers from the model.
Convert Keras model to TensorFlow Lite with optional quantization.

Prune your pre-trained Keras model

Your pre-trained model has already achieved desirable accuracy, you want to cut down its size while maintaining the performance. The pruning API can help you make it happen.

To use the pruning API, install the tensorflow-model-optimization and tf-nightly packages.

pip uninstall -yq tensorflow
pip uninstall -yq tf-nightly
pip install -Uq tf-nightly-gpu
pip install -q tensorflow-model-optimization

Then you can load your previous trained model and make it "prunable". The Keras-based API can be applied at the level of individual layers, or the entire model. Since you have the entire model pre-trained, it is easier to apply the pruning to the entire model. The algorithm will be applied to all layers capable of weight pruning.

For the pruning schedule, we start at the sparsity level 50% and gradually train the model to reach 90% sparsity. X% sparsity means that X% of the weight tensor is going to be pruned away.

Furthermore, we give the model some time to recover after each pruning step, so pruning does not happen on every step. We set the pruning frequency to 100. Similar to pruning a bonsai, we are trimming it gradually so that the tree can adequately heal the wound created during pruning instead of cutting 90% of its branches in one day.

Given the model already reached a satisfactory accuracy, we can start pruning immediately. As a result, we set the begin_step to 0 here, and only train for another four epochs.

The end step is calculated given the number of train example, batch size, and the total epochs to train.

import numpy as np
import tensorflow as tf
from tensorflow_model_optimization.sparsity import keras as sparsity

# Backend agnostic way to save/restore models
# _, keras_file = tempfile.mkstemp('.h5')
# print('Saving model to: ', keras_file)
# tf.keras.models.save_model(model, keras_file, include_optimizer=False)

# Load the serialized model
loaded_model = tf.keras.models.load_model(keras_file)

epochs = 4
end_step = np.ceil(1.0 * num_train_samples / batch_size).astype(np.int32) * epochs
print(end_step)

new_pruning_params = {
      'pruning_schedule': sparsity.PolynomialDecay(initial_sparsity=0.50,
                                                   final_sparsity=0.90,
                                                   begin_step=0,
                                                   end_step=end_step,
                                                   frequency=100)
}

new_pruned_model = sparsity.prune_low_magnitude(loaded_model, **new_pruning_params)
new_pruned_model.summary()

new_pruned_model.compile(
    loss=tf.keras.losses.categorical_crossentropy,
    optimizer='adam',
    metrics=['accuracy'])

Don't panic if you find more trainable parameters in the new_pruned_model summary, those came from the pruning wrappers which we will remove later.

Now let's start the training and pruning model.

# Add a pruning step callback to peg the pruning step to the optimizer's
# step. Also add a callback to add pruning summaries to tensorboard
callbacks = [
    sparsity.UpdatePruningStep(),
    sparsity.PruningSummaries(log_dir=logdir, profile_batch=0)
]

new_pruned_model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          callbacks=callbacks,
          validation_data=(x_test, y_test))

score = new_pruned_model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

The test loss and accuracy of the pruned model should look similar to your original Keras model.

Export the pruned model

Those pruning wrappers can be removed easily like this, after which the total number of parameters should be the same as your original model.

final_model = sparsity.strip_pruning(pruned_model)
final_model.summary()

Now you can check the percentage of weights were pruned by comparing them to zero.

from tensorflow.keras.models import load_model

model = load_model(final_model)
import numpy as np

for i, w in enumerate(model.get_weights()):
    print(
        "{} -- Total:{}, Zeros: {:.2f}%".format(
            model.weights[i].name, w.size, np.sum(w == 0) / w.size * 100
        )
    )

Here is the results, as you can see, 90% of convolution, dense and batch norm layers' weights are pruned.

name	Total para	Pruned%
conv2d_2/kernel:0	800	89.12%
conv2d_2/bias:0	32	0.00%
batch_normalization_1/gamma:0	32	0.00%
batch_normalization_1/beta:0	32	0.00%
conv2d_3/kernel:0	32	0.00%
conv2d_3/bias:0	32	0.00%
dense_2/kernel:0	51200	89.09%
dense_2/bias:0	64	0.00%
dense_3/kernel:0	3211264	89.09%
dense_3/bias:0	1024	0.00%
batch_normalization_1/moving_mean:0	10240	89.09%
batch_normalization_1/moving_variance:0	10	0.00%

Now, simply using a generic file compression algorithm (e.g. zip), the Keras model will be reduced by x5 times.

import tempfile
import zipfile

_, new_pruned_keras_file = tempfile.mkstemp(".h5")
print("Saving pruned model to: ", new_pruned_keras_file)
tf.keras.models.save_model(final_model, new_pruned_keras_file, include_optimizer=False)

# Zip the .h5 model file
_, zip3 = tempfile.mkstemp(".zip")
with zipfile.ZipFile(zip3, "w", compression=zipfile.ZIP_DEFLATED) as f:
    f.write(new_pruned_keras_file)
print(
    "Size of the pruned model before compression: %.2f Mb"
    % (os.path.getsize(new_pruned_keras_file) / float(2 ** 20))
)
print(
    "Size of the pruned model after compression: %.2f Mb"
    % (os.path.getsize(zip3) / float(2 ** 20))
)

Here is what you get, x5 times smaller model.

Size of the pruned model before compression: 12.52 Mb
Size of the pruned model after compression: 2.51 Mb

Convert Keras model to TensorFlow Lite

Tensorflow Lite is an example format you can use to deploy to mobile devices. To convert to a Tensorflow Lite graph, it is necessary to use the TFLiteConverter as below:

# Create the .tflite file
tflite_model_file = "/tmp/sparse_mnist.tflite"
converter = tf.lite.TFLiteConverter.from_keras_model_file(pruned_keras_file)
tflite_model = converter.convert()
with open(tflite_model_file, "wb") as f:
    f.write(tflite_model)

Then you can use a similar technique to zip the tflite file and reduce size x5 times smaller.

Post-training quantization converts weights to 8-bit precision as part of the model conversion from keras model to TFLite's flat buffer, resulting in another 4x reduction in the model size. Just add the following line to the previous snippet before calling the convert().

converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]

The compressed 8-bit tensorflow lite model only takes 0.60 Mb compared to the original Keras model's 12.52 Mb while maintaining comparable test accuracy. That's totally x16 times size reduction.

You can evaluate the accuracy of the converted TensorFlow Lite model like this where you feed the eval_model with the test dataset.

import numpy as np

interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]

def eval_model(interpreter, x_test, y_test):
  total_seen = 0
  num_correct = 0

  for img, label in zip(x_test, y_test):
    inp = img.reshape((1, 28, 28, 1))
    total_seen += 1
    interpreter.set_tensor(input_index, inp)
    interpreter.invoke()
    predictions = interpreter.get_tensor(output_index)
    if np.argmax(predictions) == np.argmax(label):
      num_correct += 1

    if total_seen % 1000 == 0:
        print("Accuracy after %i images: %f" %
              (total_seen, float(num_correct) / float(total_seen)))

  return float(num_correct) / float(total_seen)

print(eval_model(interpreter, x_test, y_test))

Conclusion and Further reading

In this tutorial, we showed you how to create sparse models with the TensorFlow model optimization toolkit weight pruning API. Right now, this allows you to create models that take significantly less space on the disk. The resulting model can also be more efficiently implemented to avoid computation; in the future, TensorFlow Lite will provide such capabilities.

Check out the official TensorFlow model optimization page and their GitHub page for more information.

The source code for this post is available on my Github and runnable on Google Colab Notebook.

How to run Tensorboard for PyTorch 1.1.0 inside Jupyter notebook

2019-05-09T10:57:05+00:00

Facebook introduced PyTorch 1.1 with TensorBoard support. Let's try it out really quickly on Colab's Jupyter Notebook.

Not need to install anything locally on your development machine. Google's Colab cames in handy free of charge even with its upgraded Tesla T4 GPU.

Firstly, let's create a Colab notebook or open this one I made.

Type in the first cell to check the version of PyTorch is at minimal 1.1.0

import torch
torch.__version__

Then you are going to install the cutting edge TensorBoard build like this.

!pip install -q tb-nightly

The output might remind you to restart the runtime to make the new TensorBoard take effect. You can click through Runtime -> Restart runtime....

Next, load the TensorBoard notebook extension with this magic line.

%load_ext tensorboard

After which you can start by exploring the TORCH.UTILS.TENSORBOARD API, these utilities let you log PyTorch models and metrics into a directory for visualization within the TensorBoard UI. Scalars, images, histograms, graphs, and embedding visualizations are all supported for PyTorch models and tensors.

The SummaryWriter class is your main entry to log data for consumption and visualization by TensorBoard. Let's run this official demo for MNIST dataset and ResNet50 model.

import torch
import torchvision
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms

# Writer will output to ./runs/ directory by default
writer = SummaryWriter()

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
trainset = datasets.MNIST('mnist_train', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
model = torchvision.models.resnet50(False)
# Have ResNet model take in grayscale rather than RGB
model.conv1 = torch.nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
images, labels = next(iter(trainloader))

grid = torchvision.utils.make_grid(images)
writer.add_image('images', grid, 0)
writer.add_graph(model, images)
writer.close()

You just wrote an image and the model graph data to TensorBoard summary. The writer wrote the output file to "./runs" directory by default.

Let's run the TensorBoard to visualize them

%tensorboard --logdir=runs

That's it, you have it!

Summary and Further reading

This really short tutorial gets you to start with running TensorBoard with latest Pytorch 1.1.0 in a Jupyter Notebook. Keep playing around with other features supported with PyTorch TensorBoard.

Read the official API document here - TORCH.UTILS.TENSORBOARD

How to run Keras model on RK3399Pro

2019-05-02T13:09:02+00:00

Previously, we have introduced and benchmarked several embedded edge computing solutions, including OpenVINO for Intel Intel neural compute sticks, CMSIS-NN for ARM microcontrollers and TensorRT model on Jetson Nano.

What they have in common is each hardware provider has their own tools and API to quantize a TensorFlow graph and combine adjacent layers to accelerate inferencing.

This time we will take a look at the RockChip RK3399Pro SoC with builtin NPU(Neural Compute Unit) rated to inference at 2.4TOPs at 8 bits precision, which is capable of running Inception V3 model at a speed over 28 FPS. You will see deploying a Keras model to the board is quite similar to previously mentioned solutions.

Freeze Keras model to TensorFlow graph and creates inference model with RKNN Toolkit.
Load the RKNN model on an RK3399Pro dev board and make predictions.

Let's get started with the first time setup.

Setup RK3399Pro board

Any dev board with an RK3399Pro SoC like the Rockchip Toybrick RK3399PRO Board or the Firefly Core-3399Pro should work. I have a Rockchip Toybrick RK3399PRO Board with 6GB RAM(2GB dedicated for NPU).

The board came with many connectors and interfaces similar to the Jetson Nano. One thing worth mentioning, the HDMI connector doesn't work with my monitor, however, I am able to get a USB Type-C to HDMI adapter working.

It has Fedora Linux release 28 preinstalled with the default username and passwords "toybrick".

The RK3399Pro has 6 core 64-bits CPUs with aarch64 architecture same architecture as Jetson Nano but quite different from Raspberry 3B+ which is ARMv7 32-bit only. This means any precompiled python wheel packages target Raspberry Pi will not likely work with RK3399Pro or Jetson Nano. But don't be despair, you can download the precompiled aarch64 python wheel package files from my aarch64_python_packages repo including scipy, onnx, tensorflow and rknn_toolkit from their official GitHub.

Transfer those wheel files to the RK3399Pro board then run the following command.

sudo dnf update -y
sudo dnf install -y cmake gcc gcc-c++ protobuf-devel protobuf-compiler lapack-devel
sudo dnf install -y python3-devel python3-opencv python3-numpy-f2py python3-h5py python3-lmdb
sudo dnf install -y python3-grpcio

sudo pip3 install scipy-1.2.0-cp36-cp36m-linux_aarch64.whl
sudo pip3 install onnx-1.4.1-cp36-cp36m-linux_aarch64.whl
sudo pip3 install tensorflow-1.10.1-cp36-cp36m-linux_aarch64.whl
sudo pip3 install rknn_toolkit-0.9.9-cp36-cp36m-linux_aarch64.whl

This might take a while depending on your internet connection speed.

Step1: Freeze Keras model and convert to RKNN model

The conversion from TensorFlow graph to RKNN model will take considerable time if you choose to run on the development board. So it is recommended to get a Linux development machine which could be the Windows WSL, an Ubuntu VM or even Google Colab.

Setup your development for the first time, you can find the rknn toolkit wheel package files from their official GitHub.

pip3 install -U tensorflow scipy onnx
pip3 install rknn_toolkit-0.9.9-cp36-cp36m-linux_x86_64.whl
# Or if you have Python 3.5
# pip3 install rknn_toolkit-0.9.9-cp35-cp35m-linux_x86_64.whl

Frozen a Keras model to a single .pb file is similar to previous tutorials. You can find the code in freeze_graph.py on GitHub. Once it is done, you will have an ImageNet InceptionV3 frozen model accepts inputs with shape (N, 299, 299, 3).

Take notes of the input and output node names since we will specify they when loading the frozen model with RKNN toolkit. For the InceptionV3 and many other Keras ImageNet models they will be,

INPUT_NODE: ['input_1']
OUTPUT_NODE: ['predictions/Softmax']

Then you can run the convert_rknn.py script to quantize your model to the uint8 data type or more specifically asymmetric quantized uint8 type.

For asymmetric quantization, the quantized range is fully utilized vs the symmetric mode. That is because we exactly map the min/max values from the float range to the min/max of the quantized range. Below is an illustration of the two range-based linear quantization methods. You can read more about it here.

The rknn.config also allows you to specify the channel_mean_value with a list of 4 values (M0, M1, M2, S0) as a way to automatically normalize the image data with uint8(0~255) data type to different ranges in the inference pipeline. Keras ImageNet models with TensorFlow backend expect the image data values normalized between -1 to 1. To accomplish this, we set the channel_mean_value to "128 128 128 128" where the first three values are mean values for each of the RGB color channels, the last value is a scale parameter. The output data is calculated as follows.

R_out = (R - M0)/S0
G_out = (G - M1)/S0
B_out = (B - M2)/S0

If you use Python OpenCV to read or capture images, the color channel is in BGR order, in that case, you can set the reorder_channel parameter of rknn.config() to "2 1 0" so the color channels will be reordered to RGB in the inference pipeline.

from rknn.api import RKNN


INPUT_NODE = ["input_1"]
OUTPUT_NODE = ["predictions/Softmax"]

img_height = 299

# Create RKNN object
rknn = RKNN()

# pre-process config
# channel_mean_value "0 0 0 255" while normalize the image data to range [0, 1]
# channel_mean_value "128 128 128 128" while normalize the image data to range [-1, 1]
# reorder_channel "0 1 2" will keep the color channel, "2 1 0" will swap the R and B channel,
# i.e. if the input is BGR loaded by cv2.imread, it will convert it to RGB for the model input.
# need_horizontal_merge is suggested for inception models (v1/v3/v4).
rknn.config(
    channel_mean_value="128 128 128 128",
    reorder_channel="0 1 2",
    need_horizontal_merge=True,
    quantized_dtype="asymmetric_quantized-u8",
)

# Load tensorflow model
ret = rknn.load_tensorflow(
    tf_pb="./model/frozen_model.pb",
    inputs=INPUT_NODE,
    outputs=OUTPUT_NODE,
    input_size_list=[[img_height, img_height, 3]],
)
if ret != 0:
    print("Load inception_v3 failed!")
    exit(ret)

# Build model
# dataset: A input data set for rectifying quantization parameters.
ret = rknn.build(do_quantization=True, dataset="./dataset.txt")
if ret != 0:
    print("Build inception_v3 failed!")
    exit(ret)

# Export rknn model
ret = rknn.export_rknn("./inception_v3.rknn")
if ret != 0:
    print("Export inception_v3.rknn failed!")
    exit(ret)

After running the script you will have the inception_v3.rknn in the project directory, transfer the file to the dev board for inference.

Step 2: Loads RKNN model and make predictions

The inference pipeline takes care of stuff including image normalization and color channel reordering as configured in the previous step. What's left for you is loading the model, initializing the runtime environment and running the inference.

import numpy as np
import cv2
from rknn.api import RKNN

# Create RKNN object
rknn = RKNN()
img_height = 299

# Direct Load RKNN Model
ret = rknn.load_rknn("./inception_v3.rknn")
if ret != 0:
    print("Load inception_v3.rknn failed!")
    exit(ret)

# Set inputs
img = cv2.imread("./data/elephant.jpg")
img = cv2.resize(img, dsize=(img_height, img_height), interpolation=cv2.INTER_CUBIC)

# This can opt out if "reorder_channel" is set to "2 1 0"
# rknn.config() in `convert_rknn.py`
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# init runtime environment
print("--> Init runtime environment")
ret = rknn.init_runtime(target="rk3399pro")
if ret != 0:
    print("Init runtime environment failed")
    exit(ret)

# Inference
outputs = rknn.inference(inputs=[img])
outputs = np.array(outputs)

rknn.release()

The outputs shape is (1, 1, 1000) representing 1000 classes' logits.

Benchmark results

Benchmark setting.

Model: Inception V3
Quantization: uint8
Input size: (1, 499, 499, 3)

Let's run the inferencing several times and see how fast it can go.

import time

times = []

# Run inference 20 times and do the average.
for i in range(20):
    start_time = time.time()
    # Use the API internal call directly.
    results = rknn.rknn_base.inference(
        inputs=[img], data_type="uint8", data_format="nhwc", outputs=None
    )
    # Alternatively, use the external API call.
    # outputs = rknn.inference(inputs=[img])
    delta = time.time() - start_time
    times.append(delta)

# Calculate the average time for inference.
mean_delta = np.array(times).mean()

fps = 1 / mean_delta
print("average(sec):{:.3f},fps:{:.2f}".format(mean_delta, fps))

It achieves an average FPS of 28.94, even faster than Jetson Nano's 27.18 FPS running a much smaller MobileNetV2 model.

Conclusion and Further reading

This post shows you how to get started with an RK3399Pro dev board, convert and run a Keras image classification on its NPU in real-time speed.

For the complete source code, check out my GitHub repository.

How to run TensorFlow Object Detection model on Jetson Nano

2019-04-21T02:39:02+00:00

Previously, you have learned how to run a Keras image classification model on Jetson Nano, this time you will know how to run a Tensorflow object detection model on it. It could be a pre-trained model in Tensorflow detection model zoo which detects everyday object like person/car/dog, or it could be a custom trained object detection model which detects your custom objects.

For this tutorial, we will convert the SSD MobileNet V1 model trained on coco dataset for common object detection.

Here is a break down how to make it happen, slightly different from the previous image classification tutorial.

Download pre-trained model checkpoint, build TensorFlow detection graph then creates inference graph with TensorRT.
Loads the TensorRT inference graph on Jetson Nano and make predictions.

Those two steps will be handled in two separate Jupyter Notebook, with the first one running on a development machine and second one running on the Jetson Nano.

Before going any further make sure you have setup Jetson Nano and installed Tensorflow.

Step 1: Create TensorRT model

Run this step on your development machine with Tensorflow nightly builds which include TF-TRT by default or you can run on this Colab notebook's free GPU.

In the notebook, you will start with installing Tensorflow Object Detection API and setting up relevant paths. Its official installing documentation might look daunting to beginners, but you can also do it by running just one notebook cell.

%cd /content
!git clone --quiet https://github.com/tensorflow/models.git

!apt-get install -qq protobuf-compiler python-pil python-lxml python-tk

!pip install -q Cython contextlib2 pillow lxml matplotlib

!pip install -q pycocotools

%cd /content/models/research
!protoc object_detection/protos/*.proto --python_out=.

import os
import sys
os.environ['PYTHONPATH'] += ':/content/models/research/:/content/models/research/slim/'
sys.path.append("/content/models/research/slim/")

!python object_detection/builders/model_builder_test.py

Next, you will download and build a detection graph from the pre-trained ssd_mobilenet_v1_coco checkpoint or select another one from the list provided in the Notebook.

config_path, checkpoint_path = download_detection_model(MODEL, 'data')

frozen_graph, input_names, output_names = build_detection_graph(
    config=config_path,
    checkpoint=checkpoint_path,
    score_threshold=0.3,
    iou_threshold=0.5,
    batch_size=1
)

Initially, the default Tensorflow object detection model takes variable batch size, it is now fixed to 1 since the Jetson Nano is a resource-constrained device. In the build_detection_graph call, several other changes apply to the Tensorflow graph,

The score threshold is set to 0.3, so the model will remove any prediction results with confidence score lower than the threshold.
IoU(intersection over union) threshold is set to 0.5 so that any detected objects with same classes overlapped will be removed. You can read more about IoU(intersection over union) and non-max suppression here.
Apply modifications over the frozen object detection graph for improved speed and reduced memory consumption.

Next, we create a TensorRT inference graph just like the image classification model.

import tensorflow.contrib.tensorrt as trt

trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=output_names,
    max_batch_size=1,
    max_workspace_size_bytes=1 << 25,
    precision_mode='FP16',
    minimum_segment_size=50
)

Once you have the TensorRT inference graph, you can save it as pb file and download from Colab or your local machine into your Jetson Nano as necessary.

with open('./data/trt_graph.pb', 'wb') as f:
    f.write(trt_graph.SerializeToString())

# Download the tensorRT graph .pb file from colab to your local machine.
from google.colab import files

files.download('./data/trt_graph.pb')

Step 2: Loads TensorRT graph and make predictions

On your Jetson Nano, start a Jupyter Notebook with command jupyter notebook --ip=0.0.0.0 where you have saved the downloaded graph file to ./model/trt_graph.pb. The following code will load the TensorRT graph and make it ready for inferencing.

import tensorflow as tf

def get_frozen_graph(graph_file):
    """Read Frozen Graph file from disk."""
    with tf.gfile.FastGFile(graph_file, "rb") as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
    return graph_def

# The TensorRT inference graph file downloaded from Colab or your local machine.
pb_fname = "./model/trt_graph.pb"
trt_graph = get_frozen_graph(pb_fname)

input_names = ['image_tensor']

# Create session and load graph
tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
tf_sess = tf.Session(config=tf_config)
tf.import_graph_def(trt_graph, name='')

tf_input = tf_sess.graph.get_tensor_by_name(input_names[0] + ':0')
tf_scores = tf_sess.graph.get_tensor_by_name('detection_scores:0')
tf_boxes = tf_sess.graph.get_tensor_by_name('detection_boxes:0')
tf_classes = tf_sess.graph.get_tensor_by_name('detection_classes:0')
tf_num_detections = tf_sess.graph.get_tensor_by_name('num_detections:0')

Now, we can make a prediction with an image and see if the model gets it correctly. Notice we resized the image to 300 x 300, however, you can try other sizes or just keep the size unmodified since the graph can handle variable-sized input. But keep in mind, since the memory in Jetson is quite tiny compared to a desktop machine so it can hardly take large images.

import cv2
IMAGE_PATH = "./data/dogs.jpg"
image = cv2.imread(IMAGE_PATH)
image = cv2.resize(image, (300, 300))

scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={
    tf_input: image[None, ...]
})
boxes = boxes[0]  # index by 0 to remove batch dimension
scores = scores[0]
classes = classes[0]
num_detections = int(num_detections[0])

If you have played around Tensorflow object detection API before, those outputs should look familiar.

Here the results might still contain overlapped predictions with different class labels. For example, the same object can be labeled with two classes in two overlapping bound boxes.

We will use a custom non-max suppression function to remove the overlapping bounding boxes with lower prediction score.

Let's visualize the result by drawing bounding boxes and labels overlays.

Here is the code to create the overlays and display on the Jetson Nano's Notebook.

from IPython.display import Image as DisplayImage

# Boxes unit in pixels (image coordinates).
boxes_pixels = []
for i in range(num_detections):
    # scale box to image coordinates
    box = boxes[i] * np.array([image.shape[0],
                               image.shape[1], image.shape[0], image.shape[1]])
    box = np.round(box).astype(int)
    boxes_pixels.append(box)
boxes_pixels = np.array(boxes_pixels)

# Remove overlapping boxes with non-max suppression, return picked indexes.
pick = non_max_suppression(boxes_pixels, scores[:num_detections], 0.5)


for i in pick:
    box = boxes_pixels[i]
    box = np.round(box).astype(int)
    # Draw bounding box.
    image = cv2.rectangle(
        image, (box[1], box[0]), (box[3], box[2]), (0, 255, 0), 2)
    label = "{}:{:.2f}".format(int(classes[i]), scores[i])
    # Draw label (class index and probability).
    draw_label(image, (box[1], box[0]), label)

# Save and display the labeled image.
save_image(image[:, :, ::-1])
DisplayImage(filename="./data/img.png")

In coco label map, class 18 means a dog and 23 is a bear. The two dogs sitting there are incorrectly classified as bears. Maybe there are more sitting bears than standing dogs in coco datasets.

A similar speed benchmark is carried out and Jetson Nano has achieved 11.54 FPS with the SSD MobileNet V1 model and 300 x 300 input image.

If you run into out of memory issue, try to boot up the board without any monitor attached and log into the shell with SSH so you can save some memory from the GUI.

Conclusion and further reading

In this tutorial, you learned how to convert a Tensorflow object detection model and run the inference on Jetson Nano.

Check out the updated GitHub repo for the source code.

If you are not satisfied with the results, there are other pre-trained models for you to take a look at, I recommend you start with SSD MobileNet V2(ssd_mobilenet_v2_coco), or if you are adventurous, try ssd_inception_v2_coco which might push the limits of Jetson Nano's memory.

You can find those models in Tensorflow detection model zoo, the "Speed (ms)" metric will give you a guideline on the complexity of the model.

Thinking about training your custom object detection model with a free data center GPU, check out my previous tutorial - How to train an object detection model easy for free.

How to run Keras model on Jetson Nano

2019-04-13T10:03:44+00:00

Jetson Nano Developer Kit announced at the 2019 GTC for $99 brings a new rival to the arena of edge computing hardware alongside its more pricy predecessors, Jetson TX1 and TX2. The coming of Jetson Nano gives the company a competitive advantage over other affordable options, to name a few, Movidius neural compute stick, Intel Graphics running OpenVINO and Google edge TPU.

In this post, I will show you how to run a Keras model on the Jetson Nano.

Here is a break down of how to make it happen.

Freeze Keras model to TensorFlow graph then creates inference graph with TensorRT.
Loads the TensorRT inference graph on Jetson Nano and make predictions.

We will do the first step on a development machine since it is computational and resource intensive way beyond what Jetson Nano can handle.

Let's get started.

Setup Jetson Nano

Follow the official getting started guide to flash the latest SD card image, setup, and boot.

One thing to keep in mind, Jetson Nano doesn't come with WIFI radio as the latest Raspberry Pi does, so it is recommended to have a USB WIFI dongle like this one ready unless you plan to hardwire its ethernet jack instead.

Install TensorFlow on Jetson Nano

There is a thread on the Nvidia developer forum about official support of TensorFlow on Jetson Nano, here is a quick run down how you can install it.

Start a terminal or SSH to your Jetson Nano, then run those commands.

sudo apt update
sudo apt install python3-pip libhdf5-serial-dev hdf5-tools
pip3 install --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v42 tensorflow-gpu==1.13.1+nv19.3 --user

In case you get into the error below,

Cannot compile 'Python.h'. Perhaps you need to install python-dev|python-devel

Try run

sudo apt install python3.6-dev

The Python3 might gets updated to a later version in the future. You can always check your version first with python3 --version, and change the previous command accordingly.

It is also helpful to install Jupyter Notebook so you can remotely connect to it from a development machine.

pip3 install jupyter

Also, notice that Python OpenCV version 3.3.1 is already installed which ease a lot of pain from cross compiling. You can verify this by importing the cv2 library from the Python3 command line interface.

Step 1: Freeze Keras model and convert into TensorRT model

Run this step on your development machine with Tensorflow nightly builds which include TF-TRT by default or run on this Colab notebook's free GPU.

First lets loads a Keras model. For this tutorial, we use pre-trained MobileNetV2 came with Keras, feel free to replace it with your custom model when necessary.

import os
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2 as Net

model = Net(weights='imagenet')

os.makedirs('./model', exist_ok=True)

# Save the h5 file to path specified.
model.save("./model/model.h5")

Once you have the Keras model save as a single .h5 file, you can freeze it to a TensorFlow graph for inferencing.

Take notes of the input and output nodes names printed in the output. We will need them when converting TensorRT inference graph and prediction.

For Keras MobileNetV2 model, they are, ['input_1'] ['Logits/Softmax'].

import tensorflow as tf
from tensorflow.python.framework import graph_io
from tensorflow.keras.models import load_model


# Clear any previous session.
tf.keras.backend.clear_session()

save_pb_dir = './model'
model_fname = './model/model.h5'
def freeze_graph(graph, session, output, save_pb_dir='.', save_pb_name='frozen_model.pb', save_pb_as_text=False):
    with graph.as_default():
        graphdef_inf = tf.graph_util.remove_training_nodes(graph.as_graph_def())
        graphdef_frozen = tf.graph_util.convert_variables_to_constants(session, graphdef_inf, output)
        graph_io.write_graph(graphdef_frozen, save_pb_dir, save_pb_name, as_text=save_pb_as_text)
        return graphdef_frozen

# This line must be executed before loading Keras model.
tf.keras.backend.set_learning_phase(0) 

model = load_model(model_fname)

session = tf.keras.backend.get_session()

input_names = [t.op.name for t in model.inputs]
output_names = [t.op.name for t in model.outputs]

# Prints input and output nodes names, take notes of them.
print(input_names, output_names)

frozen_graph = freeze_graph(session.graph, session, [out.op.name for out in model.outputs], save_pb_dir=save_pb_dir)

Normally this frozen graph is what you use for deploying. However, it is not optimized to run on Jetson Nano for both speed and resource efficiency wise. That is what TensorRT comes into play, it quantizes the model from FP32 to FP16, effectively reducing the memory consumption. It also fuses layers and tensor together which further optimizes the use of GPU memory and bandwidth. All this come with little or no noticeable reduced accuracy.

And this can be done in a single call,

import tensorflow.contrib.tensorrt as trt

trt_graph = trt.create_inference_graph(
    input_graph_def=frozen_graph,
    outputs=output_names,
    max_batch_size=1,
    max_workspace_size_bytes=1 << 25,
    precision_mode='FP16',
    minimum_segment_size=50
)

The result is also a TensorFlow graph but optimized to run on your Jetson Nano with TensorRT. Let's save it as a single .pb file.

graph_io.write_graph(trt_graph, "./model/",
                     "trt_graph.pb", as_text=False)

Download the TensorRT graph .pb file either from colab or your local machine into your Jetson Nano. You can use scp/sftp to remotely copy the file. For Windows, you can use WinSCP, for Linux/Mac you can try scp/sftp from the command line.

Step 2: Loads TensorRT graph and make predictions

The output and the input names might be different for your choice of Keras model other than the MobileNetV2.

output_names = ['Logits/Softmax']
input_names = ['input_1']

import tensorflow as tf


def get_frozen_graph(graph_file):
    """Read Frozen Graph file from disk."""
    with tf.gfile.FastGFile(graph_file, "rb") as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
    return graph_def


trt_graph = get_frozen_graph('./model/trt_graph.pb')

# Create session and load graph
tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
tf_sess = tf.Session(config=tf_config)
tf.import_graph_def(trt_graph, name='')


# Get graph input size
for node in trt_graph.node:
    if 'input_' in node.name:
        size = node.attr['shape'].shape
        image_size = [size.dim[i].size for i in range(1, 4)]
        break
print("image_size: {}".format(image_size))


# input and output tensor names.
input_tensor_name = input_names[0] + ":0"
output_tensor_name = output_names[0] + ":0"

print("input_tensor_name: {}\noutput_tensor_name: {}".format(
    input_tensor_name, output_tensor_name))

output_tensor = tf_sess.graph.get_tensor_by_name(output_tensor_name)

Now, we can make a prediction with an elephant picture and see if the model gets it correctly.

from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input, decode_predictions

# Optional image to test model prediction.
img_path = './data/elephant.jpg'

img = image.load_img(img_path, target_size=image_size[:2])
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

feed_dict = {
    input_tensor_name: x
}
preds = tf_sess.run(output_tensor, feed_dict)

# decode the results into a list of tuples (class, description, probability)
# (one such list for each sample in the batch)
print('Predicted:', decode_predictions(preds, top=3)[0])

Benchmark results

Let's run the inferencing several times and see how fast it can go.

import time
times = []
for i in range(20):
    start_time = time.time()
    one_prediction = tf_sess.run(output_tensor, feed_dict)
    delta = (time.time() - start_time)
    times.append(delta)
mean_delta = np.array(times).mean()
fps = 1 / mean_delta
print('average(sec):{:.2f},fps:{:.2f}'.format(mean_delta, fps))

It got a 27.18 FPS which can be considered prediction in real time. In addition, the Keras model can inference at 60 FPS on Colab's Tesla K80 GPU, which is twice as fast as Jetson Nano, but that is a data center card.

Conclusion and Further reading

In this tutorial, we walked through how to convert, optimized your Keras image classification model with TensorRT and run inference on the Jetson Nano dev kit. Now, try another Keras ImageNet model or your custom model, connect a USB webcam/ Raspberry Pi camera to it and do a real-time prediction demo, be sure to share your results with us in the comments below.

In the future, we will look into running models for other applications, such as object detection. If you are interested in other affordable edge computing options, check out my previous post, how to run Keras model inference x3 times faster with CPU and Intel OpenVINO also works for Movidius neural compute stick on Linux/Windows and Raspberry Pi.

The source code for this tutorial is available on my GitHub repo. You can also skip the step 1 model conversion and download the trt_graph.pb file directly from the GitHub repo releases.

How to do Hyper-parameters search with Bayesian optimization for Keras model

2019-04-06T05:23:50+00:00

Compared to more simpler hyperparameter search methods like grid search and random search, Bayesian optimization is built upon Bayesian inference and Gaussian process with an attempts to find the maximum value of an unknown function as few iterations as possible. It is particularly suited for optimization of high-cost functions like hyperparameter search for deep learning model, or other situations where the balance between exploration and exploitation is important.

The Bayesian Optimization package we are going to use is BayesianOptimization, which can be installed with the following command,

pip install bayesian-optimization

Firstly, we will specify the function to be optimized, in our case, hyperparameters search, the function takes a set of hyperparameters values as inputs, and output the evaluation accuracy for the Bayesian optimizer. Inside the function, a new model will be constructed with the specified hyperparameters, train for a number of epochs and evaluated against a set metrics. Every new evaluated accuracy will become a new observation for the Bayesian optimizer, which contributes to the next search hyperparameters' values.

Let's create a helper function first which builds the model with various parameters.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Dropout, BatchNormalization, MaxPooling2D, Flatten, Activation
from tensorflow.python.keras.optimizer_v2 import rmsprop


def get_model(input_shape, dropout2_rate=0.5):
    """Builds a Sequential CNN model to recognize MNIST.

    Args:
      input_shape: Shape of the input depending on the `image_data_format`.
      dropout2_rate: float between 0 and 1. Fraction of the input units to drop for `dropout_2` layer.

    Returns:
      a Keras model

    """
    # Reset the tensorflow backend session.
    # tf.keras.backend.clear_session()
    # Define a CNN model to recognize MNIST.
    model = Sequential()
    model.add(Conv2D(32, kernel_size=(3, 3),
                     activation='relu',
                     input_shape=input_shape,
                     name="conv2d_1"))
    model.add(Conv2D(64, (3, 3), activation='relu', name="conv2d_2"))
    model.add(MaxPooling2D(pool_size=(2, 2), name="maxpool2d_1"))
    model.add(Dropout(0.25, name="dropout_1"))
    model.add(Flatten(name="flatten"))
    model.add(Dense(128, activation='relu', name="dense_1"))
    model.add(Dropout(dropout2_rate, name="dropout_2"))
    model.add(Dense(NUM_CLASSES, activation='softmax', name="dense_2"))
    return model

Then, here is the function to be optimized with Bayesian optimizer, the partial function takes care of two arguments - input_shape and verbose in fit_with which have fixed values during the runtime.

The function takes two hyperparameters to search, the dropout rate for the "dropout_2" layer and learning rate value, it trains the model for 1 epoch and outputs the evaluation accuracy for the Bayesian optimizer.

def fit_with(input_shape, verbose, dropout2_rate, lr):

    # Create the model using a specified hyperparameters.
    model = get_model(input_shape, dropout2_rate)

    # Train the model for a specified number of epochs.
    optimizer = rmsprop.RMSProp(learning_rate=lr)
    model.compile(loss=tf.keras.losses.categorical_crossentropy,
                  optimizer=optimizer,
                  metrics=['accuracy'])

    # Train the model with the train dataset.
    model.fit(x=train_ds, epochs=1, steps_per_epoch=468,
              batch_size=64, verbose=verbose)

    # Evaluate the model with the eval dataset.
    score = model.evaluate(eval_ds, steps=10, verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])

    # Return the accuracy.

    return score[1]

from functools import partial

verbose = 1
fit_with_partial = partial(fit_with, input_shape, verbose)

The BayesianOptimization object will work out of the box without much tuning needed. The constructor takes the function to be optimized as well as the boundaries of hyperparameters to search. The main method you should be aware of is maximize, which does exactly what you think it does, maximizing the evaluation accuracy given the hyperparameters.

from bayes_opt import BayesianOptimization

# Bounded region of parameter space
pbounds = {'dropout2_rate': (0.1, 0.5), 'lr': (1e-4, 1e-2)}

optimizer = BayesianOptimization(
    f=fit_with_partial,
    pbounds=pbounds,
    verbose=2,  # verbose = 1 prints only when a maximum is observed, verbose = 0 is silent
    random_state=1,
)

optimizer.maximize(init_points=10, n_iter=10,)


for i, res in enumerate(optimizer.res):
    print("Iteration {}: \n\t{}".format(i, res))

print(optimizer.max)

Here are many parameters you can pass to maximize, nonetheless, the most important ones are:

n_iter: How many steps of Bayesian optimization you want to perform. The more steps the more likely to find a good maximum you are.
init_points: How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.

|   iter    |  target   | dropou... |    lr     |
-------------------------------------------------
468/468 [==============================] - 4s 8ms/step - loss: 0.2575 - acc: 0.9246
Test loss: 0.061651699058711526
Test accuracy: 0.9828125
|  1        |  0.9828   |  0.2668   |  0.007231 |
468/468 [==============================] - 4s 8ms/step - loss: 0.2065 - acc: 0.9363
Test loss: 0.04886047407053411
Test accuracy: 0.9828125
|  2        |  0.9828   |  0.1      |  0.003093 |
468/468 [==============================] - 4s 8ms/step - loss: 0.2199 - acc: 0.9336
Test loss: 0.05553104653954506
Test accuracy: 0.98125
|  3        |  0.9812   |  0.1587   |  0.001014 |
468/468 [==============================] - 4s 9ms/step - loss: 0.2075 - acc: 0.9390
Test loss: 0.04128134781494737
Test accuracy: 0.9890625
|  4        |  0.9891   |  0.1745   |  0.003521 |

After searching for 4 times, the model build with the found hyperparameters achieves an evaluation accuracy of 98.9% with just one epoch of training.

Comparing to other search methods

Unlike grid search which does search in a finite number of discrete hyperparameters combinations, the nature of Bayesian optimization with Gaussian processes doesn't allow for an easy/intuitive way of dealing with discrete parameters.

For example, we want to search for the number of the neuron of a dense layer from a list of options. To apply Bayesian optimization, it is necessary to explicitly convert the input parameters to discrete ones before constructing the model.

You can do something like this.

pbounds = {'dropout2_rate': (0.1, 0.5), 'lr': (1e-4, 1e-2), "dense_1_neurons_x128": (0.9, 3.1)}

def fit_with(input_shape, verbose, dropout2_rate, dense_1_neurons_x128, lr):

    # Create the model using a specified hyperparameters.
    dense_1_neurons = max(int(dense_1_neurons_x128 * 128), 128)
    model = get_model(input_shape, dropout2_rate, dense_1_neurons)
    # ...

The dense layers neurons will be mapped to 3 unique discrete values, 128, 256 and 384 before constructing to the model.

In Bayesian optimization, every next search values depend on previous observations(previous evaluation accuracies), the whole optimization process can be hard to be distributed or parallelized like the grid or random search methods.

Conclusion and further reading

This quick tutorial introduces how to do hyperparameter search with Bayesian optimization, it can be more efficient compared to other methods like the grid or random since every search are "guided" from previous search results.

Some material you might find helpful

BayesianOptimization - The Python implementation of global optimization with Gaussian processes used in this tutorial.

How to perform Keras hyperparameter optimization x3 faster on TPU for free - My previous tutorial on performing grid hyperparameter search with Colab's free TPU.

Check out the full source code on my GitHub.

How to run TensorBoard in Jupyter Notebook

2019-03-17T03:01:18+00:00

TensorBoard is a great tool providing visualization of many metrics necessary to evaluate TensorFlow model training. It used to be difficult to bring up this tool especially in a hosted Jupyter Notebook environment such as Google Colab, Kaggle notebook and Coursera's Notebook etc. In this tutorial, I will show you how seamless it is to run and view TensorBoard right inside a hosted or local Jupyter notebook with the latest TensorFlow 2.0.

You can run this Colab Notebook while reading this post.

Start by installing TF 2.0 and loading the TensorBoard notebook extension:

!pip install -q tf-nightly-2.0-preview
# Load the TensorBoard notebook extension
%load_ext tensorboard

Alternatively, to run a local notebook, you can create a conda virtual environment and install TensorFlow 2.0.

conda create -n tf2 python=3.6
activate tf2
pip install tf-nightly-gpu-2.0-preview
conda install jupyter

Then you can start TensorBoard before training to monitor it in progress: within the notebook using magics.

import tensorflow as tf
import datetime, os

logs_base_dir = "./logs"
os.makedirs(logs_base_dir, exist_ok=True)
%tensorboard --logdir {logs_base_dir}

Right now you can see an empty TensorBoard view with the message "No dashboards are active for the current data set", this is because the log directory is currently empty.

Lets' create, train and log some data with a very simple Keras model.

def create_model():
  return tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
  ])

def train_model():
  
  model = create_model()
  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  logdir = os.path.join(logs_base_dir, datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

  model.fit(x=x_train, 
            y=y_train, 
            epochs=5, 
            validation_data=(x_test, y_test), 
            callbacks=[tensorboard_callback])

train_model()

Now go back to previous TensorBoard output, refresh it with the button on the top right and watch the update view.

The same TensorBoard backend is reused by issuing the same command. If a different logs directory was chosen, a new instance of TensorBoard would be opened. Ports are managed automatically.

Any new interesting feature worth mentioning is the "conceptual graph". To see the conceptual graph, select the “keras” tag. For this example, you’ll see a collapsed Sequential node. Double-click the node to see the model’s structure:

Conclusion and further reading

In this quick tutorial, we walked through how to fire up and view a full bloom TensorBoard right inside Jupyter Notebook. For further instructions on how to leverage other new features of TensorBoard in TensorFlow 2.0, be sure to check out those resources.

TensorBoard Scalars: Logging training metrics in Keras

Hyperparameter Tuning with the HParams Dashboard

Model Understanding with the What-If Tool Dashboard

How to run TensorFlow object detection model faster with Intel Graphics

2019-02-16T08:38:36+00:00

In this tutorial, I will show you how run inference of your custom trained TensorFlow object detection model on Intel graphics at least x2 faster with OpenVINO toolkit compared to TensorFlow CPU backend. My benchmark also shows the solution is only 22% slower compared to TensorFlow GPU backend with GTX1070 card.

If you are new to OpenVINO toolkit, it is suggested to take a look at the previous tutorial on how to convert a Keras image classification model and accelerate inference speed with OpenVINO. This time, we will take a step further with object detection model.

Prerequisites

To convert a TensorFlow frozen object detection graph to OpenVINO Intermediate Representation(IR) files, you will have those two files ready,

Frozen TensorFlow object detection model. i.e. `frozen_inference_graph.pb` downloaded from Colab after training.
The modified pipeline config file used for training. Also downloaded from Colab after training, in our case, it is the `ssd_mobilenet_v2_coco.config` file.

You can also download my copy of those files from the GitHub release.

OpenVINO model optimization

Similar to the previous image classification model, you will specify the data type to quantize the model weights.
The data type can be "FP16" or "FP32" depends on what device you want to run the converted model.

FP16: GPU and MYRIAD (Movidius neural compute stick)
FP32: CPU and GPU

Generally speaking, FP16 quantized model cuts down the size of the weights by half, run much faster but may come with minor degraded accuracy.

Another important file is the OpenVINO subgraph replacement configuration file that describes rules to convert specific TensorFlow topologies. For the models downloaded from the TensorFlow Object Detection API zoo, you can find the configuration files in the <INSTALL_DIR>/deployment_tools/model_optimizer/extensions/front/tf directory.

Use:

ssd_v2_support.json - for frozen SSD topologies from the models zoo.
faster_rcnn_support.json - for frozen Faster R-CNN topologies from the models zoo.
faster_rcnn_support_api_v1.7.json - for Faster R-CNN topologies trained manually using the TensorFlow* Object Detection API version 1.7.0 or higher.
...

We will pick ssd_v2_support.json for this tutorial since it is an SSD model.

With all the setting ready, we can start the model optimization script.

!python {mo_tf_path} \
    --input_model {pb_file} \
    --output_dir {output_dir} \
    --tensorflow_use_custom_operations_config {configuration_file} \
    --tensorflow_object_detection_api_pipeline_config {pipeline} \
    --input_shape {input_shape_str} \
    --data_type {data_type} \

You can find the .xml and .bin files located in the specified {output_dir} after the conversion.

Make predictions

Loading the model with OpenVINO toolkit is similar to the previous image classification example. While how we preprocess inputs and interpret the outputs are different.

For the image preprocessing, it is a good practice to resize the image width and height to match with what is defined in the `ssd_mobilenet_v2_coco.config` file, which is 300 x 300. Besides, there is no need to normalize the pixel value to 0~1, just keep them as UNIT8 ranging between 0 to 255.

Here is the preprocessing function.

def pre_process_image(imagePath, img_shape):
    """pre process an image from image path.
    
    Arguments:
        imagePath {str} -- input image file path.
        img_shape {tuple} -- Target height and width as a tuple.
    
    Returns:
        np.array -- Preprocessed image.
    """

    # Model input format
    assert isinstance(img_shape, tuple) and len(img_shape) == 2

    n, c, h, w = [1, 3, img_shape[0], img_shape[1]]
    image = Image.open(imagePath)
    processed_img = image.resize((h, w), resample=Image.BILINEAR)

    processed_img = np.array(processed_img).astype(np.uint8)

    # Change data layout from HWC to CHW
    processed_img = processed_img.transpose((2, 0, 1))
    processed_img = processed_img.reshape((n, c, h, w))

    return processed_img, np.array(image)

Now you can feed the preprocessed data to the network and get its prediction outputs as a dictionary which contains a key, "DetectionOutput".

# Run inference
img_shape = (img_height, img_height)
processed_img, image = pre_process_image(fname, img_shape)
res = exec_net.infer(inputs={input_blob: processed_img})
print(res['DetectionOutput'].shape)
# Expect: (1, 1, 100, 7)

The Inference Engine "DetectionOutput" layer produces one tensor with seven numbers for each actual detection, each of the 7 numbers stands for,

0: batch index
1: class label, defined in the label map .pbtxt file.
2: class probability
3: x_1 box coordinate (0~1 as a fraction of the image width reference to the upper left corner)
4: y_1 box coordinate (0~1 as a fraction of the image height reference to the upper left corner)
5: x_2 box coordinate (0~1 as a fraction of the image width reference to the upper left corner)
6: y_2 box coordinate (0~1 as a fraction of the image height reference to the upper left corner)

After known this, we can easily filter the results with a prediction probability threshold and visualize them as bounding boxes drawing around the detected objects.

import matplotlib.pyplot as plt
import matplotlib.patches as patches

probability_threshold = 0.5

preds = [pred for pred in res['DetectionOutput'][0][0] if pred[2] > probability_threshold]


ax = plt.subplot(1, 1, 1)
plt.imshow(image)  # slice by z axis of the box - box[0].

for pred in preds:
    class_label = pred[1]
    probability = pred[2]
    print('Predict class label:{}, with probability: {}'.format(
        class_label, probability))
    box = pred[3:]
    box = (box * np.array(image.shape[:2][::-1] * 2)).astype(int)
    x_1, y_1, x_2, y_2 = box
    rect = patches.Rectangle((x_1, y_1), x_2-x_1, y_2 -
                             y_1, linewidth=2, edgecolor='red', facecolor='none')
    ax.add_patch(rect)
    ax.text(x_1, y_1, '{:.0f} - {:.2f}'.format(class_label,
                                               probability), fontsize=12, color='yellow')

Here is an example to show the results of object detection.

Benchmark the inference speed

Let's try the ssd_mobilenet_v2 object detection model on various hardware and configs, and here is what you get.

The benchmark setup,

Inference 20 times and do the average.
Input image shape: (300,300,3)

As you can see the OpenVINO model running on the Intel GPU with quantized weights achieves 50 FPS(Frames/Seconds) while TensorFlow CPU backend only gets around 18.9 FPS.

Running the model with neural compute stick 2 either on Windows or Raspberry Pi also shows promising results.

You can run the openvino_inference_benchmark.py and local_inference_test.py scripts if you want to reproduce the benchmark yourself.

Conclusion and further reading

This post walks you through how to convert a custom trained TensorFlow object detection model to OpenVINO format and inference on various hardware and configurations. Their benchmark results can help you to decide what is the best fit for your edge inferencing scenario.

Related materials you might find helpful

How to run Keras model inference x3 times faster with CPU and Intel OpenVINO - blog

How to train an object detection model easy for free - blog

The GitHub repository for this post.