(Comments)
Previously, I have shown you how to run an image classification model on ARM microcontroller, and this time let's dive deeper and see how the software works and gets familiar with the CMSIS-NN framework.
In the source arm_nnexamples_cifar10.cpp
col_buffer
stores the im2col(image to column) output for convolutional layers,scratch_buffer
stores the activation data (intermediate layer outputs)You will find how those two buffers are applied to 3 consecutive layers in the code snippet below.
// typedef int8_t q7_t;
// typedef int16_t q15_t;
q7_t col_buffer[2 * 5 * 5 * 32 * 2];
q7_t scratch_buffer[32 * 32 * 10 * 4];
// Cut the scratch buffer to two buffers.
q7_t *img_buffer1 = scratch_buffer;
q7_t *img_buffer2 = img_buffer1 + 32 * 32 * 32;
// conv2 img_buffer2 -> img_buffer1
arm_convolve_HWC_q7_fast(img_buffer2, CONV2_IM_DIM, CONV2_IM_CH, conv2_wt, CONV2_OUT_CH, CONV2_KER_DIM,
CONV2_PADDING, CONV2_STRIDE, conv2_bias, CONV2_BIAS_LSHIFT, CONV2_OUT_RSHIFT, img_buffer1,
CONV2_OUT_DIM, (q15_t *) col_buffer, NULL);
arm_relu_q7(img_buffer1, CONV2_OUT_DIM * CONV2_OUT_DIM * CONV2_OUT_CH);
// pool2 img_buffer1 -> img_buffer2
arm_maxpool_q7_HWC(img_buffer1, CONV2_OUT_DIM, CONV2_OUT_CH, POOL2_KER_DIM,
POOL2_PADDING, POOL2_STRIDE, POOL2_OUT_DIM, col_buffer, img_buffer2);
arm_convolve_HWC_q7_fast
img_buffer2
img_buffer1
col_buffer
img_buffer1
img_buffer2
Since the constrained RAM space, we cannot generously assign a large chunk of memory to those two buffers.
Instead, we allocate just enough memory space for the model to use.
To find out the col_buffer
2*2*(conv # of filters)*(kernel width)*(kernel height)
In our case, we have three convolutional layers conv1
2*2*32*5*5 = 3200
col_buffer
For the scratch buffer, which splits into two parts, for a given layer, one could serve as input while the other as output.
Similarly, its maximum size can be determined by iterating over all layers.
This graph above shows an equivalent model structure where we search for the maximum necessary size img_buffer1
img_buffer2
scratch_buffe
In CMSIS-NN there are several options of the 2D convolutional layers
Each of them is optimized for speed and size in different degree but also comes with different constraints.
arm_convolve_HWC_q7_basic
arm_convolve_HWC_q7_fast
arm_convolve_HWC_q7_RGB
is built exclusively for convolution with input tensor channels equals 3, this is typically applied to the very first convolutional layer taking RGB image data as input.
arm_convolve_HWC_q7_fast_nonsquare
similar arm_convolve_HWC_q7_fast
When comes to fully connected layers, two most distinct options are,
First one works with regular weight matrix and the other with suffix "_opt" is optimized for speed but the layer's weight matrix must be ordered in an interleaved manner beforehand. The reordering can be achieved seamlessly with the help of a code generator script which I will talk about in the next post.
Short answer, ARM Cortex-M4, and M7 microcontrollers support special SIMD instructions, especially 16-bit Multiply-and-Accumulate (MAC) instructions (e.g., SMLAD) accelerates matrix multiplication. This implementation reviles itself when you take a look into the basic fully connected layer implementation source code arm_fully_connected_q7.c. Microcontrollers with DSP instructions available run faster with special instructions.
arm_status
arm_fully_connected_q7(const q7_t * pV,
const q7_t * pM,
const uint16_t dim_vec,
const uint16_t num_of_rows,
const uint16_t bias_shift,
const uint16_t out_shift, const q7_t * bias, q7_t * pOut, q15_t * vec_buffer)
{
#if defined (ARM_MATH_DSP)
/* Run the following code for Cortex-M4 and Cortex-M7 */
// source code omitted...
#else
/* Run the following code as reference implementation for Cortex-M0 and Cortex-M3 */
// source code omitted...
#endif /* ARM_MATH_DSP */
/* Return to ARM_MATH_SUCCESS */
return (ARM_MATH_SUCCESS);
}
To understand how the convolutional layer is accelerated, one must understand the basic of img2col, which convert the convolution to a matrix multiplication by using the im2col() function that arranges the data in a way that the convolution output can be achieved by matrix multiplication.
The credit of CMU 15-418/618 sides.
im2col improves parallelism in convolution by using the SIMD features of the microcontroller but introduces memory overhead since the original image is inflated by a factor (numInputChannels * kernel width * kernel height)
This post introduced basic concepts such as reusing buffer and different implementation for NN layer functions which you will likely find useful when building an application with
Comments