How to run SSD Mobilenet V2 object detection on Jetson Nano at 20+ FPS

(Comments)

jetson_nano_ssd_v2

TL: DR

First, make sure you have flashed the latest JetPack 4.3 on your Jetson Nano development SD card.

# Run the docker
docker run --runtime nvidia --network host --privileged -it docker.io/zcw607/trt_ssd_r32.3.1:0.1.0
# Then run this command to benchmark the inference speed.
python3 trt_ssd_benchmark.py

Then you will see the results similar to this.

ssd_v2_benchmark

Now for a slightly longer description.

I posted How to run TensorFlow Object Detection model on Jetson Nano about 8 months ago, realizing that just running the SSD MobileNet V1 on Jetson Nano at a speed at around 10FPS might not be enough for some applications. Besides, that approach just consumes too much memory, make no room for other memory-intensive application running alongside.

This time, the bigger SSD MobileNet V2 object detection model runs at 20+FPS. Twice as fast, also cutting down the memory consumption down to only 32.5% of the total 4GB memory on Jetson Nano(i.e. around 1.3GB). Plenty of memory left for running other fancy stuff. You have also noticed the CPU usage is also quite low, only around 10% over the quad-core.

ssd_v2_benchmark_top

As of my knowledge, there are a bag of tricks contributes to the performance boost.

  • What comes with JetPack 4.3, the TensorRT version 6.0.1 vs previous TensorRT Version 5.
  • The TensorFlow object detection graph is optimized and converted right on the hardware, I mean the Jetson Nano development kit I am using right now. This is because TensorRT optimizes the graph by using the available GPUs and thus the optimized graph may not perform well on a different GPU.
  • The model is now converted to a more hardware-specific format, the TensorRT engine file. But the downside is it's less flexible, restrained by the hardware and software stack it is running on. More on that later.
  • Some tricks to save memory and boost speed.

How does it work?

The command lines you just ran started a docker container. If you are new to Docker, think of it as a supercharged Anaconda or Python virtual environment containerized everything necessary to reproduce mine results. If you take a closer look at the Dockerfile on my GitHub repo which describes how the container image was built, you can see how all the dependencies are set up, including all the apt and Python packages.

The docker image is built upon the latest JetPack 4.3 - L4T R32.3.1 base image. To make an inference with TensorRT engine file, the two important Python packages are required, TensorRT and Pycuda. Building Pycuda Python package from source on Jetson Nano might take some time, so I decided to pack the pre-build package into a wheel file and make the Docker build process much smoother. Notice that Pycuda prebuilt with JetPack 4.3 is not compatible with older versions of Jetpack and vers visa. As of the TensorRT python package, it came from the Jetson Nano at directory /usr/lib/python3.6/dist-packages/tensorrt/. All I did is to zip that directory into a tensorrt.tar.gz file. Guess what, no TensorFlow GPU Python package is required at the inference time. Consider how many memory we can save by just skipping importing the TensorFlow GPU Python package.

You can find the TensorRT engine file build with JetPack 4.3 named TRT_ssd_mobilenet_v2_coco.bin at my GitHub repository. Sometimes, you might also see the TensorRT engine file named with the *.engine extension like in the JetBot system image. If you want to convert the file yourself, take a look at JK Jung's build_engine.py script.

Now, for the limitation of the TensorRT engine file approach. It simply won't work across different JetPack version. The reason came from how the engine file is built by searching through CUDA kernels for the fastest implementation available, and thus it is necessary to use the same GPU and software stack(CUDA, CuDnn, TensorRT, etc.) for building like that on which the optimized engine will run. TensorRT engine file is like a dress tailored exclusively for the setup, but its performance is amazing when fitted on the right person/dev board.

Another limitation came with the boost of speed and lower memory footprint is the loss of precision, take the following prediction result as an example, a dog is mistakenly predicted as a bear. This might be a result of the quantization of model weights from FP32 to FP16 or other optimization trade-offs.

result

Some tricks to save memory and boost speed

Shut down the GUI and run in command-line mode. If you are already inside the GUI desktop environment, simply press "Ctrl+Alt+F2" to enter the non-GUI mode, log in your account from there and type "service gdm stop", that will stop the Ubuntu GUI environment and save you around 8% of the 4GB memory.

Force to the CPU and GPU in maximum clock speed by typing "jetson_clocks" in the command line. If you have a PWM fan attached to the board and bothered with the FAN's noise, you can tune it down by creating a new settings file like this.

cd ~
jetson_clocks
jetson_clocks --store
sed -i 's/target_pwm:255/target_pwm:30/g' l4t_dfs.conf
jetson_clocks --restore l4t_dfs.conf

Conclusion and further reading

This guide has shown you the easiest way to reproduce my results to run SSD Mobilenet V2 object detection on Jetson Nano at 20+ FPS. Explaining how it works and the limitation to be aware of before applying this to a real application.

Don't forget to grab the source code for this post on my GitHub.

Current rating: 3.4

Comments