How to run Keras model inference x3 times faster with CPU and Intel OpenVINO



In this quick tutorial, you will learn how to setup OpenVINO and make your Keras model inference at least x3 times faster without any added hardware.

Though there are multiple options to speed up your deep learning inference on the edge devices, to name a few,

  • Adding a low-end Nvidia GPU like GT1030
    • pros: easy to integrate, since it also leverages Nvidia's CUDA and CuDNN toolkit to accelerate the inference same as your development environment, no significant model conversion is needed.
    • cons: A PCI-E slot must exist on the target device's motherboard to interface with the graphics card, which adds extra cost and space to the edge device.
  • Use ASIC chips geared towards accelerating neural network inferencing, such as the Movidius neural compute sticks, Lightspeeur 2801 neural accelerator.
    • pros:
      • Just like a USB drive, they also work on different host machines, whether it is the desktop computer with Intel/AMD CPU or Raspberry Pi single board computer with ARM Cortex-A.
      • Neural network computation is offloaded to those USB sticks allows the host machine's CPU to worry only about more general-purpose computation like image preprocessing.
      • Scaling can be as easy as plugin more of those USB sticks as your throughput requirement increases on the edge device.
      • They generally have higher performance per watt spec compared with CPU/ Nvidia GPUs.
    • cons:
      • Since they are ASIC(application specific IC), expects limited support of some TensorFlow layers/operations.
      • They also require special model conversion to create instructions understandable for the specific ASIC.
  • Embedded SoC came with an NPU(neural processing unit) like the Rockchip RK3399Pro.
    • NPU is similar to ASIC chips which require special instruction and model conversion. The difference is that they exist in the same silicon die with the CPU which make the form factor smaller.

All previously mentioned acceleration options all came with an additional cost. However, if an edge device already has an Intel CPU, you might as well accelerate its deep learning inference speed x3 time for free with Intel's OpenVINO toolkit.

Intro to OpenVINO and setup

You might wonder where does the extra speedup come from without additional hardware?

First and for most, since OpenVINO is an Intel product, it is optimized for its processors.

The OpenVINO inferencing engine can inference models with either CPU or Intel's integrated GPU with different input precision supports.

CPU only support FP32 while its GPU supports both FP16 and FP32.

The CPU plugin leverages the Intel Math Kernel Library for Deep Neural Networks (MKL-DNN) as well as the OpenMP to parallelize calculations.

There is the model optimization as you will see later in this tutorial, during which extra steps are taken to make the model more compact for inference.

  • Merging of group convolutions.
  • Fusing Convolution with ReLU or ELU.
  • Fusing Convolution + Sum or Convolution + Sum + ReLu.
  • Removing the power layer.

Now, let's setup OpenVINO on your machine, choose your OS on this page, follow the instruction to download and install it.

System requirement

  • 6th-8th Generation Intel® Core™
  • Intel® Xeon® v5 family
  • Intel® Xeon® v6 family

Operating Systems

  • Ubuntu* 16.04.3 long-term support (LTS), 64-bit
  • CentOS* 7.4, 64-bit
  • Windows* 10, 64-bit

If you already installed Python 3.5+, it is safe to ignore the notice to install Python 3.6+.

Once the installation is done, run either C:/Intel/computer_vision_sdk/deployment_tools/model_optimizer/install_prerequisites/install_prerequisites_tf.bat or ~/Intel/computer_vision_sdk/deployment_tools/model_optimizer/install_prerequisites/  depends on your OS to install any required Python packages for OpenVINO to work with TensorFlow.

InceptionV3 model inference in OpenVINO

You can download the full source code for this tutorial from my GitHub, it includes an all in one Jupyter notebook walks your through converting a Keras model for OpenVINO, making predictions as well as benchmarking inference speed for all three environments - Keras, TensorFlow, and OpenVINO.

Run the setupvars.bat before calling jupyter notebook to set up the environment.


Or in Linux add the following line to ~/.bashrc

source ~/intel/computer_vision_sdk/bin/

Here is an overview of the workflow to convert a Keras model to OpenVINO model and make a prediction.

  1. Save the Keras model as a single .h5 file.
  2. Load the .h5 file and freeze the graph to a single TensorFlow .pb file.
  3. Run the OpenVINO script to convert the .pb file to a model XML and bin file.
  4. Load the model XML and bin file with OpenVINO inference engine and make a prediction.

Save the Keras model as a single .h5 file

For the tutorial, we will load a pre-trained ImageNet classification InceptionV3 model from Keras, 

# Force use CPU only.
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

import tensorflow as tf
from tensorflow.keras.applications.inception_v3 import InceptionV3 as Net
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.inception_v3 import preprocess_input, decode_predictions
import numpy as np

img_height = 224

model = Net(weights='imagenet')

# Optional image to test model prediction.
img_path = './data/elephant.jpg'

# Path to save the model h5 file.
model_fname = './model/model.h5'

# Load the image for prediction.
img = image.load_img(img_path, target_size=(img_height, img_height))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

preds = model.predict(x)
# decode the results into a list of tuples (class, description, probability)
# (one such list for each sample in the batch)
print('Predicted:', decode_predictions(preds, top=3)[0])
# Predicted: [(u'n02504013', u'Indian_elephant', 0.82658225), (u'n01871265', u'tusker', 0.1122357), (u'n02504458', u'African_elephant', 0.061040461)]

# Save the h5 file to path specified.

Freeze the graph to a single TensorFlow .pb file

This step removes any layers and operations not necessary for inference.

import tensorflow as tf
from tensorflow.python.framework import graph_io
from tensorflow.keras.models import load_model

# Clear any previous session.

save_pb_dir = './model'
model_fname = './model/model.h5'
def freeze_graph(graph, session, output, save_pb_dir='.', save_pb_name='frozen_model.pb', save_pb_as_text=False):
    with graph.as_default():
        graphdef_inf = tf.graph_util.remove_training_nodes(graph.as_graph_def())
        graphdef_frozen = tf.graph_util.convert_variables_to_constants(session, graphdef_inf, output)
        graph_io.write_graph(graphdef_frozen, save_pb_dir, save_pb_name, as_text=save_pb_as_text)
        return graphdef_frozen

# This line must be executed before loading Keras model.

model = load_model(model_fname)

session = tf.keras.backend.get_session()

INPUT_NODE = [ for t in model.inputs]
OUTPUT_NODE = [ for t in model.outputs]
frozen_graph = freeze_graph(session.graph, session, [ for out in model.outputs], save_pb_dir=save_pb_dir)

OpenVINO model optimization

The following snippet runs in the Jupyter notebook, it locates the script based on your OS(Windows or Linux), you can change the img_height accordingly. The data_type can also be set to FP16 to gain extra speed up when inference on Intel integrated GPU with minor degraded precession.

import platform
is_win = 'windows' in platform.platform().lower()

if is_win:
    mo_tf_path = 'C:/Intel/computer_vision_sdk/deployment_tools/model_optimizer/'
    # path in Linux
    mo_tf_path = '~/intel/computer_vision_sdk/deployment_tools/model_optimizer/'

pb_file = './model/frozen_model.pb'
output_dir = './model'
img_height = 224
input_shape = [1,img_height,img_height,3]
input_shape_str = str(input_shape).replace(' ','')

!python {mo_tf_path} --input_model {pb_file} --output_dir {output_dir} --input_shape {input_shape_str} --data_type FP32

After running the script, you will find two new files generated under directory ./model, frozen_model.xml and frozen_model.bin. They are the optimized Intermediate Representation (IR) of the model based on the trained network topology, weights, and biases values.

Inference with OpenVINO Inference Engine(IE)

If you have set up the environment correctly, path like C:\Intel\computer_vision_sdk\python\python3.5 or ~/intel/computer_vision_sdk/python/python3.5 will exist in PYTHONPATH. This is necessary to load the Python openvino package during runtime.

The following snippet uses the CPU to run the inference engine, while it is also possible to run on Intel GPU if you have opted to use FP16 data_type previously.

import os
assert 'computer_vision_sdk' in os.environ['PYTHONPATH']

from PIL import Image
import numpy as np
    from openvino import inference_engine as ie
    from openvino.inference_engine import IENetwork, IEPlugin
except Exception as e:
    exception_type = type(e).__name__
    print("The following error happened while importing Python API module:\n[ {} ] {}".format(exception_type, e))

def pre_process_image(imagePath, img_height=224):
    # Model input format
    n, c, h, w = [1, 3, img_height, img_height]
    image =
    processedImg = image.resize((h, w), resample=Image.BILINEAR)

    # Normalize to keep data between 0 - 1
    processedImg = (np.array(processedImg) - 0) / 255.0

    # Change data layout from HWC to CHW
    processedImg = processedImg.transpose((2, 0, 1))
    processedImg = processedImg.reshape((n, c, h, w))

    return image, processedImg, imagePath

# Plugin initialization for specified device and load extensions library if specified.
plugin_dir = None
model_xml = './model/frozen_model.xml'
model_bin = './model/frozen_model.bin'
# Devices: GPU (intel), CPU, MYRIAD
plugin = IEPlugin("CPU", plugin_dirs=plugin_dir)
# Read IR
net = IENetwork.from_ir(model=model_xml, weights=model_bin)
assert len(net.inputs.keys()) == 1
assert len(net.outputs) == 1
input_blob = next(iter(net.inputs))
out_blob = next(iter(net.outputs))
# Load network to the plugin
exec_net = plugin.load(network=net)
del net

# Run inference
fileName = './data/elephant.jpg'
image, processedImg, imagePath = pre_process_image(fileName)
res = exec_net.infer(inputs={input_blob: processedImg})
# Access the results and get the index of the highest confidence score
output_node_name = list(res.keys())[0]
res = res[output_node_name]

# Predicted class index.
idx = np.argsort(res[0])[-1]

# decode the predictions
from tensorflow.keras.applications.inception_v3 import decode_predictions
print('Predicted:', decode_predictions(res, top=3)[0])

Speed Benchmark

Benchmark setup,

  • TensorFlow version: 1.12.0
  • OS: Windows 10, 64-bit
  • CPU: Intel Core i7-7700HQ
  • The number of inferences to calculate the average result: 20.

Benchmark result for all three environments - Keras, TensorFlow, and OpenVINO shown below.

Keras          average(sec):0.079, fps:12.5
TensorFlow average(sec):0.069, fps:14.3
OpenVINO(CPU) average(sec):0.024, fps:40.6


The result might vary with the Intel processors you are experimenting with, but expect significant speedup compared to running inference with TensorFlow / Keras on CPU backend.

Conclusion and further reading

In this tutorial, you have learned how to run model inference several times faster with your Intel processor and OpenVINO toolkit compared to stock TensorFlow. While OpenVINO can not only accelerate inference on CPU, the same workflow introduced in this tutorial can easily be adapted to a Movidius neural compute stick with a few changes.

OpenVINO documentations you might find helpful.

Install Intel® Distribution of OpenVINO™ toolkit for Windows* 10

Install the Intel® Distribution of OpenVINO™ toolkit for Linux*

OpenVINO - Advanced Topics - CPU Plugin where you can learn more about various model optimization techniques.

Download the full source code for this tutorial from my GitHub.

Current rating: 4.5