How to run deep learning model on microcontroller with CMSIS-NN (Part 3)

(Comments)

nn-mcu3

You have learned how to run an image classification model on ARM microcontroller and the basics of the CMSIS-NN framework. This post shows how you may train and deploy a new model from scratch.

Build and train a Caffe model

Keras was my favorite when came to picking a deep learning framework since its simplicity and elegance, however this time we are going with Caffe as the ARM's team has released two useful scripts to generate the code for us which was built for Caffe models. No worry if you are new to Caffe like me. The model structure and training parameters are all defined in easy to understand text file format.

Caffe installation can be challenging especially for beginners, that's why I build this runnable Google Colab notebook with Caffe installation and code of the tutorial included.

The Caffe image classification model is defined in file cifar10_m4_train_test_small.prototxt, with its model structure graph shown below. It contains three convolutional layers interspersed by ReLU activation and max-pooling layers, followed by a  fully-connected layer at the end to generate classification result into one of the ten output classes.

CIFAR10_CNN

In the cifar10_m4_train_test_small.prototxt model definition file,

  • The layer with "Data" type must be named "data" since the code generation script we use later is locating the layer by name. This layer produces two "blobs", one is the data blob contains image data, and one is the label blob representing the output class labels.
  • lr_mults are the learning rate adjustments for the layer's learnable parameters.  In our case, it will set the weight learning rate to be the same as the learning rate given by the solver during runtime, and the bias learning rate to be twice as large as that - this usually leads to better convergence rates.
  • The fully connected layer is known in Caffe as an InnerProduct layer.
  • Layer definitions can include rules for whether and when they are included in the network definition, like the one below:
layer {
  // ...layer definition...
  include: { phase: TRAIN }
}

In the above example, this layer will be included only in TRAIN phase.

Check out the solver file cifar10_small_solver.prototxt which defines how many iterations it's going to train the model, how frequent we will evaluate the model with the test dataset etc.

Finally running the script train_small_colab.sh will start the training, when it's finished the weights will be saved. In our case, the script runs two solver files, the learning rate is reduced by a factor of 10 for the last 1000 training iterations as defined in the second solver file. The final trained weights will be saved to file cifar10_small_iter_5000.caffemodel.h5 means the model has been trained for 5000 iterations. If you come from Keras or other different deep learning framework background, one iteration here doesn't say the model has been trained with the entire training dataset once but a batch of training data with size 100 as defined in cifar10_m4_train_test_small.prototxt.

Quite simple right? No coding is needed to build and train a Caffe model.

Quantize the Model

Quick facts about quantization,

  • Quantization of 32-bit floating point weights to 8-bit fixed point weights for deployment reduces the model size by 4X,
  • Fixed-point integer operations run much faster than floating point operations in typical microcontrollers,
  • During inference, a model with quantized integers weights and bias doesn't show any loss of performance (i.e., accuracy).

As the weights are fixed after the training, and we know their min/max range. They are quantized or discretized to 256 levels using their ranges. Here is a quick demo to quantize the weights to fixed point numbers. Assume a layer's weights only contains 5 floating point numbers initially.

import numpy as np
weight = np.array([-31.63, -6.54, 0.45, 0.90, 31])
min_wt = weight.min() 
max_wt = weight.max()
#find number of integer bits to represent this range
int_bits = int(np.ceil(np.log2(max(abs(min_wt),abs(max_wt))))) # 31.63 --> 5 bits
frac_bits = 7-int_bits #remaining bits are fractional bits (1-bit for sign), 7-5 = 2 bits
#floating point weights are scaled and rounded to [-128,127], which are used in
#the fixed-point operations on the actual hardware (i.e., microcontroller)
quant_weight = np.round(weight*(2**frac_bits)) # 31 * 2^(2 bits frac) = 124
#To quantify the impact of quantized weights, scale them back to
# original range to run inference using quantized weights
recovered_weight = quant_weight/(2**frac_bits)
print('quantization format: \t Q'+str(int_bits)+'.'+str(frac_bits))
print('Orginal weights:  ', weight)
print('Quantized weights:', quant_weight)
print('Recovered weights:', recovered_weight)

It outputs,

quantization format: 	 Q5.2
Orginal weights:   [-31.63  -6.54   0.45   0.9   31.  ]
Quantized weights: [-127.   -26.    2.     4.    124. ]
Recovered weights: [-31.75  -6.5    0.5    1.    31.  ]

In this demo, the weights are quantized to Q5.2 fixed point number format, means to represent a signed floating point number in 8 bits,

  • one bit as the sign(positive/negative),
  • 5 bits to represent the integer part
  • 2 bits for the decimal part. 

Qm.n format's m and n can generally be calculated with the min/max range as shown in the previous demo, but what about one which contains an outlier number in the weights matrix?

weight = np.array([-31.63, -6.54, 0.45, 0.90, 31, 200])

If you rerun the previous script with this new weights values, the recovered weight as quantization Q8,-1 will look like below, not so good, small weights values are lost!

array([-32.,  -6.,   0.,   0.,  32., 200.])

That is why the ARM team developed a helper script to do the weight quantizing with minimal loss in accuracy on the test dataset which means it also runs the model to search for the best Q m and n values around the initially calculated ones.

The nn_quantizer.py script takes the model definition (cifar10_m4_train_test_small.prototxt) file and the trained model file (cifar10_small_iter_5000.caffemodel.h5) then does three things iteratively layer-by-layer.

  • Quantize the weights matrix values
  • Quantize layers' activation values(includes the input image data which values range between 0~255)
  • Quantize the bias matrix values

The script finally dumps the network graph connectivity, quantization parameters into a pickle file for the next step.

Generate the Code

generate_code

Who needs to write code if there is a "Code generator"? code_gen.py gets the quantization parameters and network graph connectivity from the previous step and generates the code consisting of NN function calls. 

It currently supports the following layers: Convolution, InnerProduct(Fully connected), Pooling (max/average) and ReLu. It generates three files,

  1. weights.h: model weights and bias.
  2. parameter.h: quantization ranges, as of bias and output shift values computed from the Qm,n format of weights, bias, and activations,
  3. main.cpp: the network code.

The generator is quite sophisticated, and it picks the best layer implementation based on various constraints as discussed in the previous post.

Deploy to microcontroller

If the model structure is unchanged, we only need to update those data from weights.h and parameter.h. Those are the bias and output shift values to replace those in your project source file. If your project is based on the official CMSIS-NN cifar10 example like mine, those values are defined inside file arm_nnexamples_cifar10_weights.h.

Naming for some definitions are slightly different, but it's easy to sort out.

Now, build and run it on a microcontroller!

Conclusion and further thought

use_cases

So far you are running the neural network with purely pre-defined input data which is no fun when considering variety choices of sensors, to name a few, camera, microphone, accelerometer all can be easily integrated with the microcontroller to acquire real-time data from the environment. There are endless possibilities when this neural network framework is leveraged to process those data and extract useful information. Let's discuss what application you want to build with this technology.

Don't forget to check out this runnable Google Colab notebook for this tutorial.

Current rating: 4.6

Comments