(Comments)

This tutorial will demonstrate how you can reduce the size of your Keras model by 5 times with TensorFlow model optimization, which can be particularly important for deployment in resource-constraint environments.

From the official TensorFlow model optimization documentation. Weight pruning means eliminating unnecessary values in weight tensors. We set the neural network parameters' values to zero to remove what we estimate are unnecessary connections between the layers of a neural network. This is done during the training process to allow the neural network to adapt to the changes.

Here is a breakdown of how you can adopt this technique.

- Train Keras model to reach an acceptable accuracy as always.
- Make Keras layers or model ready to be pruned.
- Create a pruning schedule and train the model for more epochs.
- Export the pruned model by striping pruning wrappers from the model.
- Convert Keras model to TensorFlow Lite with optional quantization.

Your pre-trained model has already achieved desirable accuracy, you want to cut down its size while maintaining the performance. The pruning API can help you make it happen.

To use the pruning API, install the `tensorflow-model-optimization`

and `tf-nightly`

packages.

```
pip uninstall -yq tensorflow
pip uninstall -yq tf-nightly
pip install -Uq tf-nightly-gpu
pip install -q tensorflow-model-optimization
```

Then you can load your previous trained model and make it "prunable". The Keras-based API can be applied at the level of individual layers, or the entire model. Since you have the entire model pre-trained, it is easier to apply the pruning to the entire model. The algorithm will be applied to all layers capable of weight pruning.

For the pruning schedule, we start at the sparsity level 50% and gradually train the model to reach 90% sparsity. X% sparsity means that X% of the weight tensor is going to be pruned away.

Furthermore, we give the model some time to recover after each pruning step, so pruning does not happen on every step. We set the pruning `frequency`

to 100. Similar to pruning a bonsai, we are trimming it gradually so that the tree can adequately heal the wound created during pruning instead of cutting 90% of its branches in one day.

Given the model already reached a satisfactory accuracy, we can start pruning immediately. As a result, we set the `begin_step`

to 0 here, and only train for another four epochs.

The end step is calculated given the number of train example, batch size, and the total epochs to train.

```
import numpy as np
import tensorflow as tf
from tensorflow_model_optimization.sparsity import keras as sparsity
# Backend agnostic way to save/restore models
# _, keras_file = tempfile.mkstemp('.h5')
# print('Saving model to: ', keras_file)
# tf.keras.models.save_model(model, keras_file, include_optimizer=False)
# Load the serialized model
loaded_model = tf.keras.models.load_model(keras_file)
epochs = 4
end_step = np.ceil(1.0 * num_train_samples / batch_size).astype(np.int32) * epochs
print(end_step)
new_pruning_params = {
'pruning_schedule': sparsity.PolynomialDecay(initial_sparsity=0.50,
final_sparsity=0.90,
begin_step=0,
end_step=end_step,
frequency=100)
}
new_pruned_model = sparsity.prune_low_magnitude(loaded_model, **new_pruning_params)
new_pruned_model.summary()
new_pruned_model.compile(
loss=tf.keras.losses.categorical_crossentropy,
optimizer='adam',
metrics=['accuracy'])
```

Don't panic if you find more trainable parameters in the `new_pruned_model`

summary, those came from the pruning wrappers which we will remove later.

Now let's start the training and pruning model.

```
# Add a pruning step callback to peg the pruning step to the optimizer's
# step. Also add a callback to add pruning summaries to tensorboard
callbacks = [
sparsity.UpdatePruningStep(),
sparsity.PruningSummaries(log_dir=logdir, profile_batch=0)
]
new_pruned_model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
callbacks=callbacks,
validation_data=(x_test, y_test))
score = new_pruned_model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
```

The test loss and accuracy of the pruned model should look similar to your original Keras model.

Those pruning wrappers can be removed easily like this, after which the total number of parameters should be the same as your original model.

```
final_model = sparsity.strip_pruning(pruned_model)
final_model.summary()
```

Now you can check the percentage of weights were pruned by comparing them to zero.

```
from tensorflow.keras.models import load_model
model = load_model(final_model)
import numpy as np
for i, w in enumerate(model.get_weights()):
print(
"{} -- Total:{}, Zeros: {:.2f}%".format(
model.weights[i].name, w.size, np.sum(w == 0) / w.size * 100
)
)
```

Here is the results, as you can see, 90% of convolution, dense and batch norm layers' weights are pruned.

name | Total para | Pruned% |

conv2d_2/kernel:0 | 800 | 89.12% |

conv2d_2/bias:0 | 32 | 0.00% |

batch_normalization_1/gamma:0 | 32 | 0.00% |

batch_normalization_1/beta:0 | 32 | 0.00% |

conv2d_3/kernel:0 | 32 | 0.00% |

conv2d_3/bias:0 | 32 | 0.00% |

dense_2/kernel:0 | 51200 | 89.09% |

dense_2/bias:0 | 64 | 0.00% |

dense_3/kernel:0 | 3211264 | 89.09% |

dense_3/bias:0 | 1024 | 0.00% |

batch_normalization_1/moving_mean:0 | 10240 | 89.09% |

batch_normalization_1/moving_variance:0 | 10 | 0.00% |

Now, simply using a generic file compression algorithm (e.g. zip), the Keras model will be reduced by x5 times.

```
import tempfile
import zipfile
_, new_pruned_keras_file = tempfile.mkstemp(".h5")
print("Saving pruned model to: ", new_pruned_keras_file)
tf.keras.models.save_model(final_model, new_pruned_keras_file, include_optimizer=False)
# Zip the .h5 model file
_, zip3 = tempfile.mkstemp(".zip")
with zipfile.ZipFile(zip3, "w", compression=zipfile.ZIP_DEFLATED) as f:
f.write(new_pruned_keras_file)
print(
"Size of the pruned model before compression: %.2f Mb"
% (os.path.getsize(new_pruned_keras_file) / float(2 ** 20))
)
print(
"Size of the pruned model after compression: %.2f Mb"
% (os.path.getsize(zip3) / float(2 ** 20))
)
```

Here is what you get, x5 times smaller model.

Size of the pruned model before compression: **12.52 Mb**

Size of the pruned model after compression: **2.51 Mb**

Tensorflow Lite is an example format you can use to deploy to mobile devices. To convert to a Tensorflow Lite graph, it is necessary to use the `TFLiteConverter`

as below:

```
# Create the .tflite file
tflite_model_file = "/tmp/sparse_mnist.tflite"
converter = tf.lite.TFLiteConverter.from_keras_model_file(pruned_keras_file)
tflite_model = converter.convert()
with open(tflite_model_file, "wb") as f:
f.write(tflite_model)
```

Then you can use a similar technique to zip the `tflite`

file and reduce size x5 times smaller.

Post-training quantization converts weights to 8-bit precision as part of the model conversion from keras model to TFLite's flat buffer, resulting in another 4x reduction in the model size. Just add the following line to the previous snippet before calling the `convert()`

.

```
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
```

The compressed 8-bit tensorflow lite model only takes 0.60 Mb compared to the original Keras model's 12.52 Mb while maintaining comparable test accuracy. That's totally x16 times size reduction.

You can evaluate the accuracy of the converted TensorFlow Lite model like this where you feed the `eval_model`

with the test dataset.

```
import numpy as np
interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]
def eval_model(interpreter, x_test, y_test):
total_seen = 0
num_correct = 0
for img, label in zip(x_test, y_test):
inp = img.reshape((1, 28, 28, 1))
total_seen += 1
interpreter.set_tensor(input_index, inp)
interpreter.invoke()
predictions = interpreter.get_tensor(output_index)
if np.argmax(predictions) == np.argmax(label):
num_correct += 1
if total_seen % 1000 == 0:
print("Accuracy after %i images: %f" %
(total_seen, float(num_correct) / float(total_seen)))
return float(num_correct) / float(total_seen)
print(eval_model(interpreter, x_test, y_test))
```

In this tutorial, we showed you how to create *sparse models* with the TensorFlow model optimization toolkit weight pruning API. Right now, this allows you to create models that take significantly less space on the disk. The resulting model can also be more efficiently implemented to avoid computation; in the future, TensorFlow Lite will provide such capabilities.

Check out the official TensorFlow model optimization page and their GitHub page for more information.

- Accelerated Deep Learning inference from your browser
- How to run SSD Mobilenet V2 object detection on Jetson Nano at 20+ FPS
- Automatic Defect Inspection with End-to-End Deep Learning
- How to train Detectron2 with Custom COCO Datasets
- Getting started with VS CODE remote development

- June (1)

- December (1)
- November (1)
- October (1)
- September (3)
- August (1)
- July (2)
- June (2)
- May (3)
- April (3)
- March (1)
- February (2)
- January (2)

- December (3)
- November (3)
- October (3)
- September (5)
- August (5)
- July (4)
- June (4)
- May (4)
- April (6)
- March (5)
- February (3)
- January (4)

- deep learning (79)
- edge computing (17)
- Keras (48)
- NLP (8)
- python (69)
- PyTorch (7)
- tensorflow (35)

- tutorial (56)
- Sentiment analysis (3)
- keras (35)
- deep learning (57)
- pytorch (2)

- Chengwei (85)

## Comments