I just finished the Coursera deep learning online program this week. The last programming assignment is about trigger word detection, aka. wake/hot word detection. Like when you yell at Amazon Alexa or Google Home to wake them up.
Will it be cool to build one yourself and run it in Real-time?
In this post, I am going to show you exactly how to build a Keras model to do the same thing from scratch. No third party voice API or network connection required to make it functional.
A lot of background information is shown in the Coursera course. Don't worry if you are new to this, and I am going to have an overview enough for you to understand what is happening next.
For the sake of simplicity, let's take the word "Activate" as our trigger word.
The training dataset needs to be as similar to the real test environment as possible. For example, the model needs to be exposed to non-trigger words and background noise in the speech during training so it will not generate the trigger signal when we say other words or there is only background noise.
As you may expect training a good speech model requires a lot of labeled training samples. Do we just have to record each audio and label where the trigger words were spoken? Here is a simple trick to solve this problem.
We generate them!
First, we have 3 types of audio recordings,
1. Recordings of different backgrounds audios. They might just as simple as two clips of background noise, 10 seconds each, coffee shop, and living room.
2. Recordings of the trigger word "activate". They might be just you speaking the word 10 times in different tones, 1 second each.
3. Recordings of the negative words. They might be you speaking other words like "baby", "coffee", 1 second for each recording.
Here is the step to generate the training input audio clips,
We choose overlay since we want to mix the spoken words with the background noise to sounds more realistic.
For the output labels, we want it to represent whether or not someone has just finished saying "activate".
We first initialize all timesteps of the output labels to "0"s. Then for each "activate" we overlayed, we also update the target labels by assigning the subsequent 50 timesteps to "1"s.
Why we have 50 timesteps "1"s?
Because if we only set 1 timestep after the "activate" to "1", there will be too many 0s in the target labels. It creates a very imbalanced training set.
It is a little bit of a hack to have 50 "1" but could make them a little bit easy to train the model. Here is an illustration to show you the idea.
Credit: Coursera - deeplearning.ai
For a clip which we have inserted "activate", "innocent", activate", "baby." Note that the positive labels "1" are associated only with the positive words.
The green/blueish plot is the spectrogram, which is the frequency representation of the audio wave over time. The x-axis is the time and y-axis is frequencies. The more yellow/bright the color is the more certain frequency is active (loud).
Our input data will be the spectrogram data for each generated audio. And the target will be the labels we created earlier.
Without further due, let's take a look at the model structure.
The 1D convolutional step inputs 5511 timesteps of the spectrogram (10 seconds), outputs a 1375 step output. It extracts low-level audio features similar to how 2D convolutions extract image features. Also helps speed up the model by reducing the number of timesteps.
The two GRU layers read the sequence of inputs from left to right, then ultimately uses a dense+sigmoid layer to make a prediction. Sigmoid make the range of each label between 0~1. Being 1, corresponding to the user having just
Here is the code
def model(input_shape): """ Function creating the model's graph in Keras. Argument: input_shape -- shape of the model's input data (using Keras conventions) Returns: model -- Keras model instance """ X_input = Input(shape = input_shape) # Step 1: CONV layer X = Conv1D(196, kernel_size=15, strides=4)(X_input) # CONV1D X = BatchNormalization()(X) # Batch normalization X = Activation('relu')(X) # ReLu activation X = Dropout(0.8)(X) # dropout (use 0.8) # Step 2: First GRU Layer X = GRU(units = 128, return_sequences = True)(X) # GRU (use 128 units and return the sequences) X = Dropout(0.8)(X) # dropout (use 0.8) X = BatchNormalization()(X) # Batch normalization # Step 3: Second GRU Layer X = GRU(units = 128, return_sequences = True)(X) # GRU (use 128 units and return the sequences) X = Dropout(0.8)(X) # dropout (use 0.8) X = BatchNormalization()(X) # Batch normalization X = Dropout(0.8)(X) # dropout (use 0.8) # Step 4: Time-distributed dense layer X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed (sigmoid) model = Model(inputs = X_input, outputs = X) return model Tx = 5511 # The number of time steps input to the model from the spectrogram n_freq = 101 # Number of frequencies input to the model at each time step of the spectrogram model = model(input_shape = (Tx, n_freq)) opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01) model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
Trigger word detection takes a long time to train. To save time, Coursera've already trained a model for about 3 hours on a GPU using the architecture shown above, and a large training set of about 4000 examples. Let's load the model.
model = load_model('./models/tr_model.h5')
So far our model can only take a static 10 seconds audio clip and make the prediction of the trigger word location.
Here is the fun part, let's replace with the live audio stream instead!
The model we have build expect 10 seconds audio clips as input. While training another model that takes shorter audio clips is possible but needs us retraining the model on a GPU for several hours.
We also don't want to wait for 10-second for the model tells us the trigger word is detected. So one solution is to have a moving 10 seconds audio stream window with a step size of 0.5
We also add the silence detection mechanism to skip prediction if the loudness is below a threshold, this can save some computing power.
Let's see how to build it,
The input 10 seconds audio is updated every 0.5
And here is the code to make it happen.
def has_new_triggerword(predictions, chunk_duration, feed_duration, threshold=0.5): """ Function to detect new trigger word in the latest chunk of input audio. It is looking for the rising edge of the predictions data belongs to the last/latest chunk. Argument: predictions -- predicted labels from model chunk_duration -- time in second of a chunk feed_duration -- time in second of the input to model threshold -- threshold for probability above a certain to be considered positive Returns: True if new trigger word detected in the latest chunk """ predictions = predictions > threshold chunk_predictions_samples = int(len(predictions) * chunk_duration / feed_duration) chunk_predictions = predictions[-chunk_predictions_samples:] level = chunk_predictions for pred in chunk_predictions: if pred > level: return True else: level = pred return False
To get the audio stream, we use the
You may ask why not just read a fixed length of audio and just process it in one function?
Since for the model to generate the prediction, it takes quite some time, sometimes measured in tens of milliseconds. By doing so, we are risking creating gaps in the audio stream while we are doing the computation.
Here is the code for the
import pyaudio from queue import Queue from threading import Thread import sys import time # Queue to communiate between the audio callback and main thread q = Queue() run = True silence_threshold = 100 # Run the demo for a timeout seconds timeout = time.time() + 0.5*60 # 0.5 minutes from now # Data buffer for the input wavform data = np.zeros(feed_samples, dtype='int16') def callback(in_data, frame_count, time_info, status): global run, timeout, data, silence_threshold if time.time() > timeout: run = False data0 = np.frombuffer(in_data, dtype='int16') if np.abs(data0).mean() < silence_threshold: sys.stdout.write('-') return (in_data, pyaudio.paContinue) else: sys.stdout.write('.') data = np.append(data,data0) if len(data) > feed_samples: data = data[-feed_samples:] # Process data async by sending a queue. q.put(data) return (in_data, pyaudio.paContinue) stream = get_audio_input_stream(callback) stream.start_stream() try: while run: data = q.get() spectrum = get_spectrogram(data) preds = detect_triggerword_spectrum(spectrum) new_trigger = has_new_triggerword(preds, chunk_duration, feed_duration) if new_trigger: sys.stdout.write('1') except (KeyboardInterrupt, SystemExit): stream.stop_stream() stream.close() timeout = time.time() run = False stream.stop_stream() stream.close()
When you run it, it outputs one of the 3 characters every 0.5
"-" means silence,
"." means not silence and no trigger word,
"1" means a new trigger word is detected.
Feel free to replace printing the "1" character with anything you want to happen when a trigger word is detected. Launch an app, play a sound etc.
This article demonstrates how to build a real-time trigger word detector from scratch with Keras deep learning framework.
Here's what you should remember:
Now, grab the full source code from my GitHub repo and build an awesome trigger word application.Share on Twitter Share on Facebook