How to run deep learning model on microcontroller with CMSIS-NN (Part 2)



Previously, I have shown you how to run an image classification model on ARM microcontroller, and this time let's dive deeper and see how the software works and gets familiar with the CMSIS-NN framework.

Dealing with Memory constraints

In the source file arm_nnexamples_cifar10.cpp, to fit data in the 192-Kbyte RAM during computation, two buffer variables are created and reused among layers.

  • col_buffer stores the im2col(image to column) output for convolutional layers,
  • scratch_buffer stores the activation data (intermediate layer outputs)

You will find how those two buffers are applied to 3 consecutive layers in the code snippet below.

// typedef int8_t q7_t; 
// typedef int16_t q15_t;
q7_t      col_buffer[2 * 5 * 5 * 32 * 2];
q7_t      scratch_buffer[32 * 32 * 10 * 4];
// Cut the scratch buffer to two buffers.
q7_t     *img_buffer1 = scratch_buffer;
q7_t     *img_buffer2 = img_buffer1 + 32 * 32 * 32;

// conv2 img_buffer2 -> img_buffer1
arm_convolve_HWC_q7_fast(img_buffer2, CONV2_IM_DIM, CONV2_IM_CH, conv2_wt, CONV2_OUT_CH, CONV2_KER_DIM,
                       CONV2_PADDING, CONV2_STRIDE, conv2_bias, CONV2_BIAS_LSHIFT, CONV2_OUT_RSHIFT, img_buffer1,
                       CONV2_OUT_DIM, (q15_t *) col_buffer, NULL);

arm_relu_q7(img_buffer1, CONV2_OUT_DIM * CONV2_OUT_DIM * CONV2_OUT_CH);

// pool2 img_buffer1 -> img_buffer2
arm_maxpool_q7_HWC(img_buffer1, CONV2_OUT_DIM, CONV2_OUT_CH, POOL2_KER_DIM,
                 POOL2_PADDING, POOL2_STRIDE, POOL2_OUT_DIM, col_buffer, img_buffer2);

arm_convolve_HWC_q7_fast function creates a convolutional layer which takes contents of img_buffer2 as input data and outputs to img_buffer1.It also uses col_buffer as its internal memory to run the convolutional image to column algorithm. The following ReLu activation layer acts on the img_buffer1 itself, and then the same buffer will be the input for the following max pool layer where that layer outputs to img_buffer2 so on so forth. 

Since the constrained RAM space, we cannot generously assign a large chunk of memory to those two buffers.

Instead, we allocate just enough memory space for the model to use.

To find out the required col_buffer size across all convolutional layers, this formula below is applied.

2*2*(conv # of filters)*(kernel width)*(kernel height)

In our case, we have three convolutional layers and conv1 layer requires the largest amount of 2*2*32*5*5 = 3200 bytes buffer space to do its image to column computation, so we assign that amount of space to col_buffer also shared among all other convolutional layers.

For the scratch buffer, which splits into two parts, for a given layer, one could serve as input while the other as output.

Similarly, its maximum size can be determined by iterating over all layers.


This graph above shows an equivalent model structure where we search for the maximum necessary size for img_buffer1 and img_buffer2 among all layers then join the two buffer together to form the total scratch_buffe size.

Choosing an NN layer function

In CMSIS-NN there are several options of the 2D convolutional layers

  1. arm_convolve_HWC_q7_basic
  2. arm_convolve_HWC_q7_fast
  3. arm_convolve_HWC_q7_RGB
  4. arm_convolve_HWC_q7_fast_nonsquare

Each of them is optimized for speed and size in different degree but also comes with different constraints.

The arm_convolve_HWC_q7_basic function is the most basic version designed to work for any square shaped input tensor and weight dimension.

The arm_convolve_HWC_q7_fast function as its name suggests run faster than the previous one but require input tensor channels be multiple of 4 and output tensor channels(number of filters) be multiple of 2.

arm_convolve_HWC_q7_RGB is built exclusively for convolution with input tensor channels equals 3, this is typically applied to the very first convolutional layer taking RGB image data as input.

arm_convolve_HWC_q7_fast_nonsquare similar to arm_convolve_HWC_q7_fast, but can take non-square shaped input tensor.

When comes to fully connected layers, two most distinct options are,

  1. arm_fully_connected_q7
  2. arm_fully_connected_q7_opt

First one works with regular weight matrix and the other with suffix "_opt" is optimized for speed but the layer's weight matrix must be ordered in an interleaved manner beforehand. The reordering can be achieved seamlessly with the help of a code generator script which I will talk about in the next post.

Where does the 4.6X speed boost come?

Short answer, ARM Cortex-M4, and M7 microcontrollers support special SIMD instructions, especially 16-bit Multiply-and-Accumulate (MAC) instructions (e.g., SMLAD)  accelerates matrix multiplication. This implementation reviles itself when you take a look into the basic fully connected layer implementation source code arm_fully_connected_q7.c. Microcontrollers with DSP instructions available run faster with special instructions.

arm_fully_connected_q7(const q7_t * pV,
                       const q7_t * pM,
                       const uint16_t dim_vec,
                       const uint16_t num_of_rows,
                       const uint16_t bias_shift,
                       const uint16_t out_shift, const q7_t * bias, q7_t * pOut, q15_t * vec_buffer)

#if defined (ARM_MATH_DSP)
    /* Run the following code for Cortex-M4 and Cortex-M7 */
    // source code omitted...
    /* Run the following code as reference implementation for Cortex-M0 and Cortex-M3 */
    // source code omitted...

#endif                          /* ARM_MATH_DSP */
    /* Return to ARM_MATH_SUCCESS */
    return (ARM_MATH_SUCCESS);


To understand how the convolutional layer is accelerated, one must understand the basic of img2col, which convert the convolution to a matrix multiplication by using the im2col() function that arranges the data in a way that the convolution output can be achieved by matrix multiplication.


The credit of CMU 15-418/618 sides.

im2col improves parallelism in convolution by using the SIMD features of the microcontroller but introduces memory overhead since the original image is inflated by a factor of (numInputChannels * kernel width * kernel height).

Conclusion and further thought

This post introduced basic concepts such as reusing buffer and different implementation for NN layer functions which you will likely find useful when building an application with CMSIS-NN framework. During the next session, I will show you how easy it is to start fresh from training to deploying a model to your microcontroller.

Current rating: 4.2