(Comments)
You have learned how to run an image classification model on ARM microcontroller and the basics of the CMSIS-NN framework. This post shows how you may train and deploy a new model from scratch.
Keras was my favorite when came to picking a deep learning framework since its simplicity and elegance, however this time we are going with Caffe as the ARM's team has released two useful scripts to generate the code for us which was built for Caffe models. No worry if you are new to Caffe like me. The model structure and training parameters are all defined in easy to understand text file format.
Caffe installation can be challenging especially for beginners, that's why I build this runnable Google Colab notebook with Caffe installation and code of the tutorial included.
The Caffe image classification model is defined in file cifar10_m4_train_test_small.prototxt, with its model structure graph shown below. It contains three convolutional layers interspersed by ReLU activation and max-pooling layers, followed by a fully-connected layer at the end to generate classification result into one of the ten output classes.
In the cifar10_m4_train_test_small.prototxt model definition file,
data
label
lr_mult
s InnerProduct
layer {
// ...layer definition...
include: { phase: TRAIN }
}
In the above example, this layer will be included only TRAIN
Check out the solver file cifar10_small_solver
Finally running the script train_small_colab.sh will start the training, when it's finished the weights will be saved. In our case, the script runs two solver files, the learning rate is reduced by a factor of 10 for the last 1000 training iterations as defined in the second solver file. The final trained weights will be saved to file cifar10_small_iter_5000.caffemodel.h5 means the model has been trained for 5000 iterations. If you come from Keras or other different deep learning framework background, one iteration here doesn't say the model has been trained with the entire training dataset once but a batch of training data with size 100 as defined in cifar10_m4_train_test_small.prototxt.
Quite simple right? No coding is needed to build and train a Caffe model.
Quick facts about quantization,
As the weights are fixed after the training, and we know their min/max range. They are quantized or discretized to 256 levels using their ranges. Here is a quick demo to quantize the weights to fixed point numbers. Assume a layer's weights only contains 5 floating point numbers initially.
import numpy as np
weight = np.array([-31.63, -6.54, 0.45, 0.90, 31])
min_wt = weight.min()
max_wt = weight.max()
#find number of integer bits to represent this range
int_bits = int(np.ceil(np.log2(max(abs(min_wt),abs(max_wt))))) # 31.63 --> 5 bits
frac_bits = 7-int_bits #remaining bits are fractional bits (1-bit for sign), 7-5 = 2 bits
#floating point weights are scaled and rounded to [-128,127], which are used in
#the fixed-point operations on the actual hardware (i.e., microcontroller)
quant_weight = np.round(weight*(2**frac_bits)) # 31 * 2^(2 bits frac) = 124
#To quantify the impact of quantized weights, scale them back to
# original range to run inference using quantized weights
recovered_weight = quant_weight/(2**frac_bits)
print('quantization format: \t Q'+str(int_bits)+'.'+str(frac_bits))
print('Orginal weights: ', weight)
print('Quantized weights:', quant_weight)
print('Recovered weights:', recovered_weight)
It outputs,
quantization format: Q5.2 Orginal weights: [-31.63 -6.54 0.45 0.9 31. ] Quantized weights: [-127. -26. 2. 4. 124. ] Recovered weights: [-31.75 -6.5 0.5 1. 31. ]
In this demo, the weights are quantized to Q5.2 fixed point number format, means to represent a signed floating point number in 8 bits,
Qm.n format's
weight = np.array([-31.63, -6.54, 0.45, 0.90, 31, 200])
If you rerun the previous script with this new weights values, the recovered weight as quantization Q8,-1 will look like below, not so good, small weights values are lost!
array([-32., -6., 0., 0., 32., 200.])
That is why the ARM team developed a helper script to do the weight quantizing with minimal loss in accuracy on the test dataset which means it also runs the model to search for the best Q
The nn_quantizer.py script takes the model definition (cifar10_m4_train_test_small.prototxt) file and the trained model file (cifar10_small_iter_5000.caffemodel.h5) then does three things iteratively layer-by-layer.
The script finally dumps the network graph connectivity, quantization parameters into a pickle file for the next step.
Who needs to write code if there is a "Code generator"? code_gen.py
gets the quantization parameters and network graph connectivity from the previous step and generates the code consisting of NN function calls.
It currently supports the following layers: Convolution, InnerProduct(Fully connected), Pooling (max/average) and ReLu. It generates three files
weights.h
parameter.h
: quantization ranges, as of bias and output shift values computed from the Qm,n format of weights, bias, and activationsmain.cpp
The generator is quite sophisticated, and it picks the best layer implementation based on various constraints as discussed in the previous post.
If the model structure is unchanged, we only need to update those data weights.h
parameter.h
arm_nnexamples_cifar10_weights.h
Naming for some definitions are slightly different, but it's easy to sort out.
Now, build and run it on a microcontroller!
So far you are running the neural network with purely pre-defined input data which is no fun when considering variety choices of sensors, to name a few, camera, microphone, accelerometer all can be easily integrated with the microcontroller to acquire real-time data from the environment. There are endless possibilities when this neural network framework is leveraged to process those data and extract useful information. Let's discuss what application you want to build with this technology.
Don't forget to check out this runnable Google Colab notebook for this tutorial.
Share on Twitter Share on Facebook
Comments