How to run deep learning model on microcontroller with CMSIS-NN (Part 1)

(Comments)

nn-mcu

TL;DR You will learn how to run the CIFAR10 image classification model on an ARM microcontroller like the one on STM32F4 Discovery board or similar.

Why run deep learning model on a microcontroller?

If you have played with Arduino before, it's easy to have the impression that they are small little chips with limited computing and memory resources, but exceptional when comes to collect data from various sensors or control a servo on a robot hand.

Many microcontrollers either run on a real-time operating system like the FreeRTOS or its bare metal without an OS. Both ways make them quite stable and responsive, especially in mission-critical situations.

However, as more and more data collected with its sensors, the two most common type of data sound and image required a significant amount of computing resource to process to generate useful results. This task is normally accomplished by asking the microcontroller upload data to a network-connected server, and the server sends the processed results back to the edge, a microcontroller will then conduct specific behaviors, like response with greetings and switch on a light.

You may have already noticed some downside with this plan.

  • Sensitive data gets to the cloud, photos, and audio recordings.
  • The company who sells this may charge a service fee to use its service and even worse sell your private data.
  • It won't work without the network connection to the server.
  • Data traveling back and forth between the device and server introduces lag.
  • Require network and wireless hardware components on the circuit design which increase the cost.
  • It might wast bandwidth sending useless data.

Wouldn't it be nice if everything is self-contained right inside of a microcontroller that saves bandwidth, power, and cost but also comes with low latency, incredible reliability, and privacy?

use_cases1

Overview of CMSIS-NN

CMSIS-NN is a collection of optimized neural network functions for ARM Cortex-M core microcontrollers enabling neural networks and machine learning being pushed into the end node of IoT applications.

It has implemented popular neural network layer types, such as convolution, depth separable convolution, fully-connected, polling, and activation. With its utility functions, it is also possible to construct more complex NN modules, such as LSTM and GRU.

For a model trained with a popular framework such as TensorFlow, Caffe. The weights and biases will first be quantized to 8 bit or 16-bit integers then deployed to the microcontroller for inferencing.

Neural network inference based on CMSIS-NN kernels claimes to achieve 4.6X improvement in runtime/throughput and 4.9X improvement compared to baseline implementation. The best performance was achieved by leveraging SIMD instructions features of the CPU to improve parallelism available for Cortex-M4 and Cortex-M7 core microcontrollers although reference implementation for Cortex-M0 and Cortex-M3 is also available without DSP instructions. 

Run the model on the microcontroller

In this section, we will run the CIRAR10 image classification model on an STM32F4 Discovery board or similar with Keil MDK-ARM. 

Before continuing, you will need,

  • A Windows PC with Keil MDK-ARM installed. You can find the instruction to acquire and install the software in my GitHub repo.
  • A Cortex-M4 or Cortex-M7 core microcontroller board preferably STM32F4 Discovery board as chosen for this tutorial.

Power up the board the for the first time,

  • Do not connect the board to the PC! Go to C:\Keil_v5\ARM\STLink\USBDriver and double-click stlink_winusb_install.bat to install the drivers for the onboard USB ST-Link/V2.
  • Connect a USB power supply USB ST-Link/V2 port to the computer using a Mini USB cable. Windows recognizes the ST-Link/V2 device and installs the drivers automatically. 

The onboard STM32F407VGT6 microcontroller features a 32-bit ARM® Cortex® -M4 with FPU core, 1-Mbyte Flash memory, and 192-Kbyte RAM at a maximum power consumption capped at 465mW.

Even though the project in my GitHub is configured and ready to run on the board, it is helpful if you want to know how it is configured or you want to run on a different target.

The project is based on the official CMSIS-NN CIFAR10 example, so going ahead and download the whole CMSIS_5 repo from GitHub.

You can access the example project at 

.\CMSIS\NN\Examples\ARM\arm_nn_examples\cifar10

Add a new target

Open the arm_nnexamples_cifar10.uvprojx project with Keil MDK-ARM. The project was initially configured to run on simulator only, we will start by adding a new target. The project can be configured to run on different microcontrollers/boards, and Keil MDK-ARM is organizing them through "targets".

Right-click on the current target, then click "Manage Project Items" button from the menu.

1_new_target

Create a new target, and name it such as "STM32F407DISCO" to help you remember its purpose. Highlight your new target and click "Set as Current Target", then "OK".

2_new_target

Configure target options

Open the target options and goes to the "Device" tab to choose the target microcontroller. If you cannot search "STM32F407", then it is necessary to get it from pack installer manually or by opening an existing project configured with the STM32F4 DISCO board then the IDE will prompt to install it.

3_target_options

4_target_options

5_pack_installer

6_pack_installer

Go to the "Target" tab and change the external crystal frequency to 8MHz as well as the on-chip memory areas to match with the one on the board.

7_target

In the "C/C++" tab, add "HSE_VALUE=8000000" as a predefined symbol to tell the compiler the external crystal frequency is 8MHz. This is the same as defining the line below in C/C++ source code except predefined symbols allows you to compile the project for different target configurations without modifying the source code.

#define HSE_VALUE    ((uint32_t)8000000)

Optionally, turn down the compiler optimation to allow improved debugging experience. Higher level compiler optimization on one side improves the code by making the software consume fewer resources but on the other side cuts down debug information and alters the structure of the code which makes the code harder to debug.

In the "Debug" tab, select "ST-Link Debugger" as it is available on the STM32F4 DISCO board, click "Settings" button to configure it.

9_debug

If you board is plugged in, the ST-LINK/V2 debugger adapter will appear in the new window and check port "SW" is selected.

10_debug

In the "Trace" tab, enter the correct CPU Core Clock speed as specified in your project. Check the Trace Enable box. Trace allows you to view printf messages via SWO(single wire output) which is a single pin, asynchronous serial communication channel available on Cortex-M3/M4/M7 and supported by the main debugger probes. This feature is similar to Arduino's "Serial.printf" function to print out debug information except at no cost of a UART port.

11_trace

In the "Flash Download" tab add the "STM32F4xx Flash" programming algorithm so you can download the binary to its flash memory.

12_flash_download

Now confirm the changes and close the "Options for Target" window.

Configure to run at 168MHz

There is one more step to configure the microcontroller to run at 168MHz in its startup file.

13_startup

The easy way is to replace my startup file from the GitHub with yours while some points worth mentioning to learn how microcontroller system clock works in general.

/******************** PLL Parameters ******************/
/* PLL_VCO = (HSE_VALUE or HSI_VALUE / PLL_M) * PLL_N 
	= 8000000 / 8 * 336 = 336MHz*/
//  PLL_M = HSE_VALUE (in Hz) / 1MHz = 8000000 / 1000000 = 8
#define PLL_M      8
#define PLL_N      336

/* SYSCLK = PLL_VCO / PLL_P  = 336MHz / 2 = 168MHz*/
#define PLL_P      2

/* USB OTG FS, SDIO and RNG Clock =  PLL_VCO / PLLQ
	 = 336MHz / 7 = 48MHz*/
#define PLL_Q      7

We need to set 'PLL_M' parameter to 8 in order to have the finally 168MHz system clock.

Pretty Little Liars abbreviated as PLL, wait, I mean phase lock loop is a clock generation engine in the microcontroller which is used to generate the clock speed much higher than the internal or external crystal frequency. If you have played with Arduino board such as the Leonardo, you have already met PLL, even though your Arduino code is running at 16MHz system clock but its USB2.0 bus has a boosted 48MHz from its on-chip PLL. With our PLL parameters set, here is a diagram of the overall clock configuration for STM32F4. It shows how the 168Mhz system clock is derived from the initial 8MHz high speech external(HSE) clock.

14_pll

Build and debug

Now we are ready, with the board connected to your PC, just build and debug the application.

15_build

16_debug

Enable the trace view of the printf messages and run the code, you will see the output of the CIFAR10 model.

With a 32x32 pixel color image as the input, which then been classified into one of the 10 output classes by the model.

As the value is the output of the softmax layer, each number denotes the probability for one of the 10 image classes. In the following case, label 5 corresponds to the "dog" label has the highest number which means the model found a dog in the input image.

17_debug_viewer

In the next section, I will show you how to feed the model with your custom image.

Create new input images

IMG_DATA in arm_nnexamples_cifar10_inputs.h defines the input image data. The image array is stored as HWC format or Height-Width-Channel, a 32x32 RGB color image with 32 x 32 x 3 = 3072 values.

Can we have a higher resolution image as input? Yes, but that image must be resized and cropped first which can be achieved with the following python snippet.

from keras.preprocessing import image
from PIL import Image, ImageOps
import numpy as np

def resizeImage(srcfile, new_width=32, new_height=32):
    '''
    Resize and crop a image to desired resolution and return the 
    data as HWC format numpy array.
    srcfile: the source image file path.
    new_width: new desired width.
    new_height: new desired height.
    '''
    pil_image = Image.open(srcfile)
    pil_image = ImageOps.fit(pil_image, (new_width, new_height), Image.ANTIALIAS)
    pil_image_rgb = pil_image.convert('RGB')
    return np.asarray(pil_image_rgb).flatten()

18_img_data

The function will return a numpy array containing 3072 numbers which will then be clipped to int8 numbers and write to a header file in the correct format.

Summary and further reading

Want a step down the source code and learn how everything works? That will be my next blog post.

In the meanwhile, here are some resources I find useful to learn about ARM Cortex-M microcontrollers, STM32, CMSIS-NN, and Keil-MDK, etc.

ARM Cortex-M WiKi

STM32F4-Discovery Quick Start Guide

STM32F4DISCOVERY board example code

CMSIS NN Software Library doc

Arm's Project Trillium - Processors Machine Learning

Don't forget to check out the source code from my GitHub page.

Current rating: 4.8

Comments