How to do Real Time Trigger Word Detection with Keras

(Comments)

on air

I just finished the Coursera deep learning online program this week. The last programming assignment is about trigger word detection, aka. wake/hot word detection. Like when you yell at Amazon Alexa or Google Home to wake them up.

sparta alexa

Will it be cool to build one yourself and run it in Real-time?

In this post, I am going to show you exactly how to build a Keras model to do the same thing from scratch. No third party voice API or network connection required to make it functional.

A lot of background information is shown in the Coursera course. Don't worry if you are new to this, and I am going to have an overview enough for you to understand what is happening next.

Prepare the training datasets

For the sake of simplicity, let's take the word "Activate" as our trigger word. 

The training dataset needs to be as similar to the real test environment as possible. For example, the model needs to be exposed to non-trigger words and background noise in the speech during training so it will not generate the trigger signal when we say other words or there is only background noise.

As you may expect training a good speech model requires a lot of labeled training samples. Do we just have to record each audio and label where the trigger words were spoken? Here is a simple trick to solve this problem.

We generate them!

First, we have 3 types of audio recordings,

1. Recordings of different backgrounds audios. They might just as simple as two clips of background noise, 10 seconds each, coffee shop, and living room.

2. Recordings of the trigger word "activate". They might be just you speaking the word 10 times in different tones, 1 second each.

3. Recordings of the negative words. They might be you speaking other words like "baby", "coffee", 1 second for each recording.

Here is the step to generate the training input audio clips,

  • Pick a random 10-second background audio clip
  • Randomly overlay 0-4 audio clips of "activate" into this 10sec clip
  • Randomly overlay 0-2 audio clips of negative words into this 10sec clip

We choose overlay since we want to mix the spoken words with the background noise to sounds more realistic.

For the output labels, we want it to represent whether or not someone has just finished saying "activate".

We first initialize all timesteps of the output labels to "0"s. Then for each "activate" we overlayed, we also update the target labels by assigning the subsequent 50 timesteps to "1"s.

Why we have 50 timesteps "1"s?

Because if we only set 1 timestep after the "activate" to "1", there will be too many 0s in the target labels. It creates a very imbalanced training set.

It is a little bit of a hack to have 50 "1" but could make them a little bit easy to train the model. Here is an illustration to show you the idea.

target label diagram

Credit: Coursera - deeplearning.ai

For a clip which we have inserted "activate", "innocent", activate", "baby." Note that the positive labels "1" are associated only with the positive words.

The green/blueish plot is the spectrogram, which is the frequency representation of the audio wave over time. The x-axis is the time and y-axis is frequencies. The more yellow/bright the color is the more certain frequency is active (loud).

Our input data will be the spectrogram data for each generated audio. And the target will be the labels we created earlier.

Build the Model

Without further due, let's take a look at the model structure.

trigger word model

The 1D convolutional step inputs 5511 timesteps of the spectrogram (10 seconds), outputs a 1375 step output. It extracts low-level audio features similar to how 2D convolutions extract image features. Also helps speed up the model by reducing the number of timesteps.

The two GRU layers read the sequence of inputs from left to right, then ultimately uses a dense+sigmoid layer to make a prediction. Sigmoid make the range of each label between 0~1. Being 1, corresponding to the user having just said "activate".

Here is the code written in Keras' functional API.

def model(input_shape):
    """
    Function creating the model's graph in Keras.
    
    Argument:
    input_shape -- shape of the model's input data (using Keras conventions)

    Returns:
    model -- Keras model instance
    """
    
    X_input = Input(shape = input_shape)
    
    # Step 1: CONV layer
    X = Conv1D(196, kernel_size=15, strides=4)(X_input)   # CONV1D
    X = BatchNormalization()(X)                           # Batch normalization
    X = Activation('relu')(X)                             # ReLu activation
    X = Dropout(0.8)(X)                                   # dropout (use 0.8)

    # Step 2: First GRU Layer
    X = GRU(units = 128, return_sequences = True)(X)      # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                   # dropout (use 0.8)
    X = BatchNormalization()(X)                           # Batch normalization
    
    # Step 3: Second GRU Layer
    X = GRU(units = 128, return_sequences = True)(X)      # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                   # dropout (use 0.8)
    X = BatchNormalization()(X)                           # Batch normalization
    X = Dropout(0.8)(X)                                   # dropout (use 0.8)
    
    # Step 4: Time-distributed dense layer
    X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)
    model = Model(inputs = X_input, outputs = X)    
    return model  

Tx = 5511 # The number of time steps input to the model from the spectrogram
n_freq = 101 # Number of frequencies input to the model at each time step of the spectrogram

model = model(input_shape = (Tx, n_freq))
opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])

Trigger word detection takes a long time to train. To save time, Coursera've already trained a model for about 3 hours on a GPU using the architecture shown above, and a large training set of about 4000 examples. Let's load the model.

model = load_model('./models/tr_model.h5')

Real-time Demo

So far our model can only take a static 10 seconds audio clip and make the prediction of the trigger word location.

Here is the fun part, let's replace with the live audio stream instead!

The model we have build expect 10 seconds audio clips as input. While training another model that takes shorter audio clips is possible but needs us retraining the model on a GPU for several hours.

We also don't want to wait for 10-second for the model tells us the trigger word is detected. So one solution is to have a moving 10 seconds audio stream window with a step size of 0.5 second. Which means we ask the model to predict every 0.5 seconds, that reduce the delay and make it responsive.

We also add the silence detection mechanism to skip prediction if the loudness is below a threshold, this can save some computing power.

Let's see how to build it, 

The input 10 seconds audio is updated every 0.5 second. Meaning for every 0.5 second, the oldest 0.5 second chunk of audio will be discarded and the fresh 0.5 second audio will be shifted in. The job of the model is to tell if there is a new trigger word detected in the fresh 0.5 second audio chunk.

And here is the code to make it happen.

def has_new_triggerword(predictions, chunk_duration, feed_duration, threshold=0.5):
    """
    Function to detect new trigger word in the latest chunk of input audio.
    It is looking for the rising edge of the predictions data belongs to the
    last/latest chunk.
    
    Argument:
    predictions -- predicted labels from model
    chunk_duration -- time in second of a chunk
    feed_duration -- time in second of the input to model
    threshold -- threshold for probability above a certain to be considered positive

    Returns:
    True if new trigger word detected in the latest chunk
    """
    predictions = predictions > threshold
    chunk_predictions_samples = int(len(predictions) * chunk_duration / feed_duration)
    chunk_predictions = predictions[-chunk_predictions_samples:]
    level = chunk_predictions[0]
    for pred in chunk_predictions:
        if pred > level:
            return True
        else:
            level = pred
    return False

To get the audio stream, we use the pyaudio library. Which has an option to read the audio stream asynchronously. That means the audio stream recording happens in another thread and when a new fixed length of audio data is available, it notifies our model to process it in the main thread.

You may ask why not just read a fixed length of audio and just process it in one function? 

Since for the model to generate the prediction, it takes quite some time, sometimes measured in tens of milliseconds. By doing so, we are risking creating gaps in the audio stream while we are doing the computation.

Here is the code for the pyaudio library's callback, in the callback function we send a queue to notify the model to process the data in the main thread.

import pyaudio
from queue import Queue
from threading import Thread
import sys
import time


# Queue to communiate between the audio callback and main thread
q = Queue()

run = True

silence_threshold = 100

# Run the demo for a timeout seconds
timeout = time.time() + 0.5*60  # 0.5 minutes from now

# Data buffer for the input wavform
data = np.zeros(feed_samples, dtype='int16')

def callback(in_data, frame_count, time_info, status):
    global run, timeout, data, silence_threshold    
    if time.time() > timeout:
        run = False        
    data0 = np.frombuffer(in_data, dtype='int16')
    if np.abs(data0).mean() < silence_threshold:
        sys.stdout.write('-')
        return (in_data, pyaudio.paContinue)
    else:
        sys.stdout.write('.')
    data = np.append(data,data0)    
    if len(data) > feed_samples:
        data = data[-feed_samples:]
        # Process data async by sending a queue.
        q.put(data)
    return (in_data, pyaudio.paContinue)

stream = get_audio_input_stream(callback)
stream.start_stream()


try:
    while run:
        data = q.get()
        spectrum = get_spectrogram(data)
        preds = detect_triggerword_spectrum(spectrum)
        new_trigger = has_new_triggerword(preds, chunk_duration, feed_duration)
        if new_trigger:
            sys.stdout.write('1')
except (KeyboardInterrupt, SystemExit):
    stream.stop_stream()
    stream.close()
    timeout = time.time()
    run = False
        
stream.stop_stream()
stream.close()

When you run it, it outputs one of the 3 characters every 0.5 second.

"-" means silence,

"." means not silence and no trigger word,

"1" means a new trigger word is detected.

--.--......-1----.-..--...-1---..-------..1---------..-1---------.----.--------.----.---.--.-.------------.

Feel free to replace printing the "1" character with anything you want to happen when a trigger word is detected. Launch an app, play a sound etc.

Summary and Further Reading

This article demonstrates how to build a real-time trigger word detector from scratch with Keras deep learning framework.

Here's what you should remember:

  • Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
  • Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
  • An end-to-end deep learning approach can be used to build a very effective trigger word detection system.
  • Deep learning model prediction takes time. It processes the audio data asynchronous from the input audio streaming to avoid breaking audio streaming.
  • A sliding/moving input window is an effective way to reduce delay.

Further reading

Hey Siri: An On-device DNN-powered Voice Trigger for Apple’s Personal Assistant

Trigger Word Detection lecture - Coursera

Now, grab the full source code from my GitHub repo and build an awesome trigger word application.

Current rating: 4.6

Comments