(Comments)
Somewhere deep inside TensorFlow framework exists a rarely noticed module: tf.contrib.signal which can help build GPU accelerated audio/signal processing pipeline for you TensorFlow/Keras model. In this post, we will take a practical approach to exam some of the most popular signal processing operations and visualize the results.
You can find the source code on my GitHub for this post as well as a runnable Google Colab notebook.
We are going to build a complete computation graph in TensorFlow that takes a wav file name and outputs the MFCC feature. There are some intermediate output/audio features which could be fun to visualize, so we will enable the TensorFlow eager execution which allows us to evaluate operations immediately without building the complete graph. If you are new to TensorFlow eager execution, you are going find it much more intuitive than the graph API.
The following snippet will get you started with eager execution in TensorFlow.
import tensorflow as tf
tf.enable_eager_execution()
# Check eager execution is enabled
print(tf.executing_eagerly()) # => True
x = [[2.]]
m = tf.matmul(x, x)
print("hello, {}".format(m)) # => "hello, [[4.]]"
The tf.contrib.ffmpeg.decode_audio
depends on the locally installed FFmpeg library to decode an audio file.
To install FFmpeg on a Linux based system, run this.
apt update -qq
apt install -y -qq ffmpeg
ffmpeg -version
After that, we can download a small sample of the siren sound wav file and use TensorFlow to decode it.
!wget -q https://github.com/Tony607/blog_statics/releases/download/v1.0/siren_mfcc_demo.wav
audio_file = './siren_mfcc_demo.wav'
sampling_rate = 44100
audio_binary = tf.read_file(audio_file)
# tf.contrib.ffmpeg not supported on Windows, refer to issue
# https://github.com/tensorflow/tensorflow/issues/8271
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary,
file_format='wav', samples_per_second=sampling_rate, channel_count=1)
print(waveform.numpy().shape)
The waveform
is a Tensor, with the help of eager execution, we can immediately evaluate its value and visualize it.
import matplotlib.pyplot as plt
# Plot a section of the waveform.
plt.plot(waveform.numpy().flatten()[2048:5120])
plt.show()
From the raw waveform, we can bearly see any signature of the siren sound, plus that might be too much data to feed to the neural network directly. The next several steps will extract the frequency domain signatures of the audio signal.
Declaimer: decode an audio file with tf.contrib.ffmpeg
is not supported on Windows. Refers to issue https://github.com/tensorflow/tensorflow/issues/8271.
An alternative on Windows is to decode the wav file with scipy.
from scipy.io import wavfile
sr, samples = wavfile.read(audio_file)
print(sr)
print(samples.shape)
New to spectrograms? Check out the cool Chrome music lab experiment to visualize your voice as spectrograms in real time.
The most common approach to compute spectrograms is to take the magnitude of the STFT(Short-time Fourier Transform).
Any audio waveform can be represented by a combination of sinusoidal waves with different frequency, phase, and magnitude. STFT can determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time.
tf.contrib.signal.stft
computes the STFT of signals
. This operation accepts a Tensor "signals" of shape (batch_size, samples).
# Reshape the signals to shape of (batch_size, samples).
signals = tf.reshape(waveform, [1, -1])
# Step 1 : signals->stfts
# `stfts` is a complex64 Tensor representing the Short-time Fourier Transform of
# each signal in `signals`. Its shape is [batch_size, ?, fft_unique_bins]
# where fft_unique_bins = fft_length // 2 + 1 = 513.
stfts = tf.contrib.signal.stft(signals,
frame_length=1024,
frame_step=512,
fft_length=1024)
The stfts
Tensor has the shape (batch_size, frames, fft_unique_bins), each value contains a complex number in the form of a + bi
with the real and imaginary part.
An energy spectrogram is the magnitude of the complex-valued STFT, i.e. sqrt{a^2 + b^2}
.
In TensorFlow it can be computed as simple as,
# An energy spectrogram is the magnitude of the complex-valued STFT.
# A float32 Tensor of shape [batch_size, ?, 513].
magnitude_spectrograms = tf.abs(stfts)
We can plot the energy spectrogram, and we can spot some pattern shown at the top of the image where the frequency is going up and down resembling the pitch variation of the siren sound.
array = magnitude_spectrograms.numpy().astype(np.float)[0]
plt.imshow(np.swapaxes(array,0,1))
As you can see, there are 513 frequency banks in the computed energy spectrogram, and many are "blank". When working with spectral representations of audio, the Mel Frequency Cepstral Coefficients (MFCCs) are widely used in automatic speech and speaker recognition, which results in a lower-dimensional and more perceptually-relevant representation of the audio.
We can turn the energy/magnitude spectrograms into Mel-spectrograms in TensorFlow and plot its outcome like this.
# Warp the linear-scale, magnitude spectrograms into the mel-scale.
num_spectrogram_bins = magnitude_spectrograms.shape[-1].value
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 8000, 64
linear_to_mel_weight_matrix = tf.contrib.signal.linear_to_mel_weight_matrix(
num_mel_bins, num_spectrogram_bins, sr, lower_edge_hertz,
upper_edge_hertz)
mel_spectrograms = tf.tensordot(
magnitude_spectrograms, linear_to_mel_weight_matrix, 1)
print(mel_spectrograms.numpy().shape)
array = mel_spectrograms.numpy().astype(np.float)[0]
plt.imshow(np.swapaxes(array,0,1))
If desired, we can specify the lower and upper bounds on the frequencies to be included in the Mel-spectrum.
num_mel_bins
specifies how many bands in the resulting Mel-spectrum.
To further compress the Mel-spectrogram magnitudes, you may apply a compressive nonlinearity such as logarithmic compression, and this helps to balance the importance of detail in low and high energy regions of the spectrum, which more closely matches human auditory sensitivity.
log_offset
is a small number added to avoid applying log() on zero in rare case.
log_offset = 1e-6
log_mel_spectrograms = tf.log(mel_spectrograms + log_offset)
print(log_mel_spectrograms.numpy().shape)
array = log_mel_spectrograms.numpy()[0]
plt.imshow(np.swapaxes(array,0,1))
The last step, tf.contrib.signal.mfccs_from_log_mel_spectrograms
computes MFCCs from log_mel_spectrograms
.
num_mfccs = 30
# Keep the first `num_mfccs` MFCCs.
mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
log_mel_spectrograms)[..., :num_mfccs]
print(mfccs.numpy().shape)
array = mfccs.numpy()[0]
plt.imshow(np.swapaxes(array,0,1)[::-1,:])
Put everything together into a TensorFlow pipeline.
def get_mfccs(audio_file=None ,signals=None, sample_rate = 44100, num_mfccs = 13, frame_length=1024, frame_step=512, fft_length=1024, fmax=8000, fmin=80):
"""Compute the MFCCs for audio file
Keyword Arguments:
audio_file {str} -- audio wav file path (default: {None})
signals {tensor} -- input signals as tensor or np.array in float32 type (default: {None})
sample_rate {int} -- sampling rate (default: {44100})
num_mfccs {int} -- number of mfccs to keep (default: {13})
frame_length {int} -- frame length to compute STFT (default: {1024})
frame_step {int} -- frame step to compute STFT (default: {512})
fft_length {int} -- FFT length to compute STFT (default: {1024})
fmax {int} -- Top edge of the highest frequency band (default: {8000})
fmin {int} -- Lower bound on the frequencies to be included in the mel spectrum (default: {80})
Returns:
Tensor -- mfccs as tf.Tensor
"""
if signals is None and audio_file is not None:
audio_binary = tf.read_file(audio_file)
# tf.contrib.ffmpeg not supported on Windows, refer to issue
# https://github.com/tensorflow/tensorflow/issues/8271
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary,
file_format='wav', samples_per_second=sample_rate, channel_count=1)
signals = tf.reshape(waveform, [1, -1])
# Step 1 : signals->stfts
# `stfts` is a complex64 Tensor representing the Short-time Fourier Transform of
# each signal in `signals`. Its shape is [batch_size, ?, fft_unique_bins]
# where fft_unique_bins = fft_length // 2 + 1 = 513.
stfts = tf.contrib.signal.stft(signals, frame_length=frame_length, frame_step=frame_step,
fft_length=fft_length)
# Step2 : stfts->magnitude_spectrograms
# An energy spectrogram is the magnitude of the complex-valued STFT.
# A float32 Tensor of shape [batch_size, ?, 513].
magnitude_spectrograms = tf.abs(stfts)
# Step 3 : magnitude_spectrograms->mel_spectrograms
# Warp the linear-scale, magnitude spectrograms into the mel-scale.
num_spectrogram_bins = magnitude_spectrograms.shape[-1].value
num_mel_bins = 64
linear_to_mel_weight_matrix = tf.contrib.signal.linear_to_mel_weight_matrix(
num_mel_bins, num_spectrogram_bins, sample_rate, fmin,
fmax)
mel_spectrograms = tf.tensordot(
magnitude_spectrograms, linear_to_mel_weight_matrix, 1)
# Step 4 : mel_spectrograms->log_mel_spectrograms
log_offset = 1e-6
log_mel_spectrograms = tf.log(mel_spectrograms + log_offset)
# Step 5 : log_mel_spectrograms->mfccs
# Keep the first `num_mfccs` MFCCs.
mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
log_mel_spectrograms)[..., :num_mfccs]
return mfccs
In this post, we introduced how to do GPU enabled signal processing in TensorFlow. We walked through each step from decoding a WAV file to computing MFCCs features of the waveform. The final pipeline is constructed where you can apply to your existing TensorFlow/Keras model to make an end to end audio processing computation graph.
Mel Frequency Cepstral Coefficient (MFCC) tutorial
Chrome music lab Spectrogram experiment
Source code for this post on GitHub.
Share on Twitter Share on Facebook
Comments