Simple Speech Keyword Detecting with Depthwise Separable Convolutions



Keyword detection or speech commands can be viewed as a minimal version of speech recognition system. What if we can make the model that is accurate yet consume small enough memory and computational footprint that runs in real-time even on a microcontroller in bare metal(without an operating system)? If that becomes real, imagining what traditional consumer electronic devices will become smarter with always-on speech commands enabled.

In this post, we will take the first step to build and train such a deep learning model to do keyword detection with the limiting memory and compute resources in mind.

Keyword detection system

Compare to a full speech recognition system which is typically cloud-based and can recognize almost any spoken words, keyword detection, on the other hand, detect predefined keywords such as  "Alexa", "Ok Google", "Hey Siri", etc. which is "always on". The detection of the keywords triggers a specific action such as activating the full-scale speech recognition system. In some other use case, such keywords can be used to activate a voice-enabled lightbulb.

A keyword detection system consists of two essential parts.

  1. A feature extractor to convert an audio clip from time domain waveform to frequency domain speech features.
  2. A neural network based classifier to process the frequency domain features and predict the likelihood for all predefined keywords plus the "unknown" word and "silence".


Our system adopts the Mel-Frequency Cepstral Coefficients or MFCCs as the feature extractor to get the 2D 'fingerprint' of the audio. Since the input to the neural network is an image like 2D audio fingerprint with the horizontal axis denoting the time and vertical axis representing the frequency coefficients, picking a convolutional based model seems like a natural choice.

Depthwise Separable Convolutions

The issue with the standard convolution operation might still require too much memory and compute resource from the microcontrollers, considering even some of the top performant microcontrollers only have ~320KB of SRAM and ~1MB of flash. One way to meet the constraints while still keep the accuracy high is by applying the depthwise separable convolution instead of the conventional convolutional neural network.

It was first introduced in the Xception ImageNet model, then adopted by some other models such as MobileNet and ShuffleNet all gear towards reducing the model complexity to deploy on resource-constrained targets like smartphone, drones, and robots.

Depthwise separable convolutional neural network consists in first performing a depthwise spatial convolution, which acts on each input channel separately followed by a pointwise convolution(i.e., 1x1 convolution) which mixes the resulting output channels. Intuitively, separable convolutions can be understood as a way to factorize a convolution kernel into two smaller kernels.

A standard convolutional operation filters and combines inputs into a new set of outputs in one step. Compared to traditional convolutional operation the depthwise separable convolution splits this into two layers, a separate layer for filtering and a separate layer for combining. This factorization has the effect of drastically reducing computation and model size. Depthwise separable convolutions are more efficient both in the number of parameters and operations, which makes deeper and wider architecture possible even in the resource-constrained devices.


We are going to implement the model with the depthwise separable CNN architecture by TensorFlow in the next section.

Building the model

The first step is to turn the raw audio waveform into MFCC features, and it can be done in TensorFlow like this.

from tensorflow.contrib.framework.python.ops import audio_ops as contrib_audio
# Run the spectrogram and MFCC ops to get a 2D 'fingerprint' of the audio.
spectrogram = contrib_audio.audio_spectrogram(
self.mfcc_ = contrib_audio.mfcc(

If we have the following parameters for input audio and feature extractor,

  • Input audio sampling rate: 16000Hz
  • Input audio clip length: 1000ms (L)
  • Spectrogram window size: 40ms (l)
  • Spectrogram window stride: 20ms (s)
  • MFCC coefficient count:10 (F)

Then the shape of the tensor self.mfcc_ will be (None, T, F), where the number of frames: T = (L-l) / s +1 = (1000 - 40) / 20 + 1 = 49. self.mfcc_ then becomes the fingerprint_input for the deep learning model.

We adopt a depthwise separable CNN based on the implementation of MobileNet, the full implementation is available on my GitHub.

Note that first layer is always regular convolution of the model, but the remaining layers are all depthwise separable convolutions. Implementation of the depthwise separable convolution layer looks like this.
def _depthwise_separable_conv(inputs,
""" Helper function to build the depth-wise separable convolution layer.

# skip pointwise by setting num_outputs=None
depthwise_conv = slim.separable_convolution2d(inputs,

bn = slim.batch_norm(depthwise_conv, scope=sc+'/dw_batch_norm')
pointwise_conv = slim.convolution2d(bn,
                                    kernel_size=[1, 1],
bn = slim.batch_norm(pointwise_conv, scope=sc+'/pw_batch_norm')
return bn

An average pooling followed by a fully-connected layer is used at the end to provide global interaction and reduce the total number of parameters in the final layer.

How well does the model perform?

The pre-trained model is ready for you to play with including the standard CNN, DS_CNN(Depthwise Separable Convolutions) and various other model architectures. For each architecture, various hyperparameters like kernel size/stride are searched and models with different scales are trained separately so that you can trade off a smaller and faster model to run on resource-constrained devices with slightly lower accuracy. 

The model built with depthwise separable convolutions achieve better accuracies than DNN models with a similar number of Ops, but with >10x reduction in memory requirement.

Note that the memory required shown in the table is after quantizing floating point weights to the 8-bit fixed point, which I will explain in a future post.

To run an audio file through a trained DS_CNN model and get a top prediction,

python --wav yes.wav --graph Pretrained_models/DS_CNN/DS_CNN_S.pb --labels Pretrained_models/labels.txt --how_many_labels 1

Conclusion and Further reading

In this post, we explored implementing a simple yet powerful keyword detection model with potential to run on resource-constrained devices like a microcontroller.

Some related resources you might find useful.

  1. TensorFlow tutorial - Simple Audio Recognition
  2. TensorFlow Speech Commands Example code in GitHub
  3. Blog - Keyword spotting for Microcontrollers
  4. Depthwise Separable Convolutional Neural Network in Keras SeparableConv2D
  5. Paper - Xception: Deep Learning with Depthwise Separable Convolutions
  6. Paper - MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

In a future post, I will explain how to apply model weights quantization process to reduce the model size and show you how to run the model on a microcontroller.

Check out my GitHub repo and more information including training and testing the model.

Current rating: 4.7