(Comments)
TL;DR You will learn how to run the CIFAR10 image classification model on an ARM microcontroller like the one on STM32F4 Discovery board or similar.
If you have played with Arduino before, it's easy to have the impression that they are small little chips with limited computing and memory resources, but exceptional when comes to collect data from various sensors or control a servo on a robot hand.
Many microcontrollers either run on a real-time operating system like the FreeRTOS or its bare metal without an OS. Both ways make them quite stable and responsive, especially in mission-critical situations.
However, as more and more data collected with its sensors, the two most common type of data sound and image required a significant amount of computing resource to process to generate useful results. This task is normally accomplished by asking the microcontroller upload data to a network-connected server, and the server sends the processed results back to the edge, a microcontroller will then conduct specific behaviors, like response with greetings and switch on a light.
You may have already noticed some downside with this plan.
Wouldn't it be nice if everything is self-contained right inside of a microcontroller that saves bandwidth, power, and cost but also comes with low latency, incredible reliability, and privacy?
CMSIS-NN is a collection of optimized neural network functions for ARM Cortex-M core microcontrollers enabling neural networks and machine learning being pushed into the end node of IoT applications.
It has implemented popular neural network layer types, such as convolution, depth separable convolution, fully-connected, polling, and activation. With its utility functions, it is also possible to construct more complex NN modules, such as LSTM and GRU.
For a model trained with a popular framework such as TensorFlow, Caffe. The weights and biases will first be quantized to 8 bit or 16-bit integers then deployed to the microcontroller for inferencing.
Neural network inference based on CMSIS-NN kernels claimes to achieve 4.6X improvement in runtime/throughput and 4.9X improvement compared to baseline implementation. The best performance was achieved by leveraging SIMD instructions features of the CPU to improve parallelism available for Cortex-M4 and Cortex-M7 core microcontrollers although reference implementation for Cortex-M0 and Cortex-M3 is also available without DSP instructions.
In this section, we will run the CIRAR10 image classification model on an STM32F4 Discovery board or similar with Keil MDK-ARM.
Before continuing, you will need,
Power up the board the for the first time,
The onboard STM32F407VGT6 microcontroller features a 32-bit ARM® Cortex® -M4 with FPU core, 1-Mbyte Flash memory, and 192-Kbyte RAM at a maximum power consumption capped at 465mW.
Even though the project in my GitHub is configured and ready to run on the board, it is helpful if you want to know how it is configured or you want to run on a different target.
The project is based on the official CMSIS-NN CIFAR10 example, so going ahead and download the whole CMSIS_5 repo from GitHub.
You can access the example project at
.\CMSIS\NN\Examples\ARM\arm_nn_examples\cifar10
Open arm_nnexamples_cifar10.uvprojx
Right-click on the current target, then click "Manage Project Items" button from the menu.
Create a new target, and name it such as "STM32F407DISCO" to help you remember its purpose. Highlight your new target and click "Set as Current Target", then "OK".
Open the target options and goes to the "Device" tab to choose the target microcontroller. If you cannot search "STM32F407", then it is necessary to get it from pack installer manually or by opening an existing project configured with the STM32F4 DISCO board then the IDE will prompt to install it.
Go to the "Target" tab and change the external crystal frequency to 8MHz as well as the on-chip memory areas to match with the one on the board.
In the "C/C++" tab, add "HSE_VALUE=8000000" as a predefined symbol to tell the compiler the external crystal frequency is 8MHz. This is the same as defining the line below in C/C++ source code except predefined symbols allows you to compile the project for different target configurations without modifying the source code.
#define HSE_VALUE ((uint32_t)8000000)
Optionally, turn down the compiler optimation to allow improved debugging experience. Higher level compiler optimization on one side improves the code by making the software consume fewer resources but on the other side cuts down debug information and alters the structure of the code which makes the code harder to debug.
In the "Debug" tab, select "ST-Link Debugger" as it is available on the STM32F4 DISCO board, click "Settings" button to configure it.
If you board is plugged in, the ST-LINK/V2 debugger adapter will appear in the new window and check port "SW" is selected.
In the "Trace" tab, enter the correct CPU Core Clock speed as specified in your project. Check the Trace Enable box. Trace allows you to view printf messages via SWO(single wire output) which is a single pin, asynchronous serial communication channel available on Cortex-M3/M4/M7 and supported by the main debugger probes. This feature is similar to Arduino's "Serial.printf" function to print out debug information except at no cost of a UART port.
In the "Flash Download" tab add the "STM32F4xx Flash" programming algorithm so you can download the binary to its flash memory.
Now confirm the changes and close the "Options for Target" window.
There is one more step to configure the microcontroller to run at 168MHz in its startup file.
The easy way is to replace my startup file from the GitHub with yours while some points worth mentioning to learn how microcontroller system clock works in general.
/******************** PLL Parameters ******************/
/* PLL_VCO = (HSE_VALUE or HSI_VALUE / PLL_M) * PLL_N
= 8000000 / 8 * 336 = 336MHz*/
// PLL_M = HSE_VALUE (in Hz) / 1MHz = 8000000 / 1000000 = 8
#define PLL_M 8
#define PLL_N 336
/* SYSCLK = PLL_VCO / PLL_P = 336MHz / 2 = 168MHz*/
#define PLL_P 2
/* USB OTG FS, SDIO and RNG Clock = PLL_VCO / PLLQ
= 336MHz / 7 = 48MHz*/
#define PLL_Q 7
We need to set 'PLL_M' parameter to 8 in order to have the finally 168MHz system clock.
Pretty Little Liars abbreviated as PLL, wait, I mean phase lock loop is a clock generation engine in the microcontroller which is used to generate the clock speed much higher than the internal or external crystal frequency. If you have played with Arduino board such as the Leonardo, you have already met PLL, even though your Arduino code is running at 16MHz system clock but its USB2.0 bus has a boosted 48MHz from its on-chip PLL. With our PLL parameters set, here is a diagram of the overall clock configuration for STM32F4. It shows how the 168Mhz system clock is derived from the initial 8MHz high speech external(HSE) clock.
Now we are ready, with the board connected to your PC, just build and debug the application.
Enable the trace view of the printf messages and run the code, you will see the output of the CIFAR10 model.
With a 32x32 pixel color image as the input, which then been classified into one of the 10 output classes by the model.
As the value is the output of the softmax layer, each number denotes the probability for one of the 10 image classes. In the following case, label 5 corresponds to the "dog" label has the highest number which means the model found a dog in the input image.
In the next section, I will show you how to feed the model with your custom image.
IMG_DATA in arm_nnexamples_cifar10_inputs.h defines the input image data. The image array is stored as HWC format or Height-Width-Channel, a 32x32 RGB color image with 32 x 32 x 3 = 3072 values.
Can we have a higher resolution image as input? Yes, but that image must be resized and cropped first which can be achieved with the following python snippet.
from keras.preprocessing import image
from PIL import Image, ImageOps
import numpy as np
def resizeImage(srcfile, new_width=32, new_height=32):
'''
Resize and crop a image to desired resolution and return the
data as HWC format numpy array.
srcfile: the source image file path.
new_width: new desired width.
new_height: new desired height.
'''
pil_image = Image.open(srcfile)
pil_image = ImageOps.fit(pil_image, (new_width, new_height), Image.ANTIALIAS)
pil_image_rgb = pil_image.convert('RGB')
return np.asarray(pil_image_rgb).flatten()
The function will return a numpy array containing 3072 numbers which will then be clipped to int8 numbers and write to a header file in the correct format.
Want a step down the source code and learn how everything works? That will be my next blog post.
In the meanwhile, here are some resources I find useful to learn about ARM Cortex-M microcontrollers, STM32, CMSIS-NN, and Keil-MDK, etc.
STM32F4-Discovery Quick Start Guide
STM32F4DISCOVERY board example code
Arm's Project Trillium - Processors Machine Learning
Don't forget to check out the source code from my GitHub page.
Share on Twitter Share on Facebook
Comments