(Comments)

Previously, I have shown you how to run an image classification model on ARM microcontroller, and this time let's dive deeper and see how the software works and gets familiar with the CMSIS-NN framework.

In the source `arm_nnexamples_cifar10.cpp`

`col_buffer`

stores the im2col(image to column) output for convolutional layers,`scratch_buffer`

stores the activation data (intermediate layer outputs)

You will find how those two buffers are applied to 3 consecutive layers in the code snippet below.

```
// typedef int8_t q7_t;
// typedef int16_t q15_t;
q7_t col_buffer[2 * 5 * 5 * 32 * 2];
q7_t scratch_buffer[32 * 32 * 10 * 4];
// Cut the scratch buffer to two buffers.
q7_t *img_buffer1 = scratch_buffer;
q7_t *img_buffer2 = img_buffer1 + 32 * 32 * 32;
// conv2 img_buffer2 -> img_buffer1
arm_convolve_HWC_q7_fast(img_buffer2, CONV2_IM_DIM, CONV2_IM_CH, conv2_wt, CONV2_OUT_CH, CONV2_KER_DIM,
CONV2_PADDING, CONV2_STRIDE, conv2_bias, CONV2_BIAS_LSHIFT, CONV2_OUT_RSHIFT, img_buffer1,
CONV2_OUT_DIM, (q15_t *) col_buffer, NULL);
arm_relu_q7(img_buffer1, CONV2_OUT_DIM * CONV2_OUT_DIM * CONV2_OUT_CH);
// pool2 img_buffer1 -> img_buffer2
arm_maxpool_q7_HWC(img_buffer1, CONV2_OUT_DIM, CONV2_OUT_CH, POOL2_KER_DIM,
POOL2_PADDING, POOL2_STRIDE, POOL2_OUT_DIM, col_buffer, img_buffer2);
```

`arm_convolve_HWC_q7_fast`

`img_buffer2`

`img_buffer1`

`col_buffer`

`img_buffer1`

`img_buffer2`

Since the constrained RAM space, we cannot generously assign a large chunk of memory to those two buffers.

Instead, we allocate just enough memory space for the model to use.

To find out the `col_buffer`

2*2*(conv # of filters)*(kernel width)*(kernel height)

In our case, we have three convolutional layers `conv1`

`2*2*32*5*5 = 3200`

`col_buffer`

For the scratch buffer, which splits into two parts, for a given layer, one could serve as input while the other as output.

Similarly, its maximum size can be determined by iterating over all layers.

This graph above shows an equivalent model structure where we search for the maximum necessary size `img_buffer1`

`img_buffer2`

`scratch_buffe`

In CMSIS-NN there are several options of the 2D convolutional layers

- arm_convolve_HWC_q7_basic
- arm_convolve_HWC_q7_fast
- arm_convolve_HWC_q7_RGB
- arm_convolve_HWC_q7_fast_nonsquare

Each of them is optimized for speed and size in different degree but also comes with different constraints.

`arm_convolve_HWC_q7_basic`

`arm_convolve_HWC_q7_fast`

`arm_convolve_HWC_q7_RGB`

is built exclusively for convolution with input tensor channels equals 3, this is typically applied to the very first convolutional layer taking RGB image data as input.

`arm_convolve_HWC_q7_fast_nonsquare`

similar `arm_convolve_HWC_q7_fast`

When comes to fully connected layers, two most distinct options are,

- arm_fully_connected_q7
- arm_fully_connected_q7_opt

First one works with regular weight matrix and the other with suffix "_opt" is optimized for speed but the layer's weight matrix must be ordered in an interleaved manner beforehand. The reordering can be achieved seamlessly with the help of a code generator script which I will talk about in the next post.

Short answer, ARM Cortex-M4, and M7 microcontrollers support special SIMD instructions, especially 16-bit Multiply-and-Accumulate (MAC) instructions (e.g., SMLAD) accelerates matrix multiplication. This implementation reviles itself when you take a look into the basic fully connected layer implementation source code **arm_fully_connected_q7.c**. Microcontrollers with DSP instructions available run faster with special instructions.

```
arm_status
arm_fully_connected_q7(const q7_t * pV,
const q7_t * pM,
const uint16_t dim_vec,
const uint16_t num_of_rows,
const uint16_t bias_shift,
const uint16_t out_shift, const q7_t * bias, q7_t * pOut, q15_t * vec_buffer)
{
#if defined (ARM_MATH_DSP)
/* Run the following code for Cortex-M4 and Cortex-M7 */
// source code omitted...
#else
/* Run the following code as reference implementation for Cortex-M0 and Cortex-M3 */
// source code omitted...
#endif /* ARM_MATH_DSP */
/* Return to ARM_MATH_SUCCESS */
return (ARM_MATH_SUCCESS);
}
```

To understand how the convolutional layer is accelerated, one must understand the basic of img2col, which convert the convolution to a matrix multiplication by using the im2col() function that arranges the data in a way that the convolution output can be achieved by matrix multiplication.

*The credit of CMU 15-418/618 sides.*

**im2col** improves parallelism in convolution by using the SIMD features of the microcontroller but introduces memory overhead since the original image is inflated by a factor `(numInputChannels * kernel width * kernel height)`

This post introduced basic concepts such as reusing buffer and different implementation for NN layer functions which you will likely find useful when building an application with

- How to create custom COCO data set for object detection
- How to train an object detection model with mmdetection
- How to do Transfer learning with Efficientnet
- How to compress your Keras model x5 smaller with TensorFlow model optimization
- How to run Tensorboard for PyTorch 1.1.0 inside Jupyter notebook

- December (3)
- November (3)
- October (3)
- September (5)
- August (5)
- July (4)
- June (4)
- May (4)
- April (6)
- March (5)
- February (3)
- January (4)

- deep learning (71)
- edge computing (14)
- Keras (46)
- NLP (8)
- python (65)
- PyTorch (5)
- tensorflow (32)

- tutorial (49)
- Sentiment analysis (3)
- keras (33)
- deep learning (51)
- pytorch (1)

- Chengwei (76)

## Comments