How to run GPU accelerated Signal Processing in TensorFlow

(Comments)

chrome-spectrogram

Somewhere deep inside TensorFlow framework exists a rarely noticed module: tf.contrib.signal which can help build GPU accelerated audio/signal processing pipeline for you TensorFlow/Keras model. In this post, we will take a practical approach to exam some of the most popular signal processing operations and visualize the results.

You can find the source code on my GitHub for this post as well as a runnable Google Colab notebook.

Get started

We are going to build a complete computation graph in TensorFlow that takes a wav file name and outputs the MFCC feature. There are some intermediate output/audio features which could be fun to visualize, so we will enable the TensorFlow eager execution which allows us to evaluate operations immediately without building the complete graph. If you are new to TensorFlow eager execution, you are going find it much more intuitive than the graph API.

The following snippet will get you started with eager execution in TensorFlow.

import tensorflow as tf
tf.enable_eager_execution()
# Check eager execution is enabled
print(tf.executing_eagerly())        # => True

x = [[2.]]
m = tf.matmul(x, x)
print("hello, {}".format(m))  # => "hello, [[4.]]"

Decode WAV file

The tf.contrib.ffmpeg.decode_audio depends on the locally installed FFmpeg library to decode an audio file.

To install FFmpeg on a Linux based system, run this.

apt update -qq
apt install -y -qq ffmpeg
ffmpeg -version

After that, we can download a small sample of the siren sound wav file and use TensorFlow to decode it.

!wget -q https://github.com/Tony607/blog_statics/releases/download/v1.0/siren_mfcc_demo.wav
audio_file = './siren_mfcc_demo.wav'
sampling_rate = 44100
audio_binary = tf.read_file(audio_file)
# tf.contrib.ffmpeg not supported on Windows, refer to issue
# https://github.com/tensorflow/tensorflow/issues/8271
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary, 
	file_format='wav', samples_per_second=sampling_rate, channel_count=1)
print(waveform.numpy().shape)

The waveform is a Tensor, with the help of eager execution, we can immediately evaluate its value and visualize it. 

import matplotlib.pyplot as plt
# Plot a section of the waveform.
plt.plot(waveform.numpy().flatten()[2048:5120])
plt.show()

waveform

From the raw waveform, we can bearly see any signature of the siren sound, plus that might be too much data to feed to the neural network directly. The next several steps will extract the frequency domain signatures of the audio signal.

Declaimer: decode an audio file with tf.contrib.ffmpeg is not supported on Windows. Refers to issue https://github.com/tensorflow/tensorflow/issues/8271

An alternative on Windows is to decode the wav file with scipy.

from scipy.io import wavfile
sr, samples = wavfile.read(audio_file)
print(sr)
print(samples.shape)

Computing spectrograms

New to spectrograms? Check out the cool Chrome music lab experiment to visualize your voice as spectrograms in real time.

The most common approach to compute spectrograms is to take the magnitude of the STFT(Short-time Fourier Transform).

Any audio waveform can be represented by a combination of sinusoidal waves with different frequency, phase, and magnitude. STFT can determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time.

tf.contrib.signal.stft computes the STFT of signals. This operation accepts a Tensor "signals" of shape (batch_size, samples).

# Reshape the signals to shape of (batch_size, samples).
signals = tf.reshape(waveform, [1, -1])

# Step 1 : signals->stfts
# `stfts` is a complex64 Tensor representing the Short-time Fourier Transform of
# each signal in `signals`. Its shape is [batch_size, ?, fft_unique_bins]
# where fft_unique_bins = fft_length // 2 + 1 = 513.
stfts = tf.contrib.signal.stft(signals, 
	frame_length=1024, 
	frame_step=512,
	fft_length=1024)

The stfts Tensor has the shape (batch_size, frames, fft_unique_bins), each value contains a complex number in the form of a + bi with the real and imaginary part. 

An energy spectrogram is the magnitude of the complex-valued STFT, i.e. sqrt{a^2 + b^2}.

In TensorFlow it can be computed as simple as,

# An energy spectrogram is the magnitude of the complex-valued STFT.
# A float32 Tensor of shape [batch_size, ?, 513].
magnitude_spectrograms = tf.abs(stfts)

We can plot the energy spectrogram, and we can spot some pattern shown at the top of the image where the frequency is going up and down resembling the pitch variation of the siren sound.

array = magnitude_spectrograms.numpy().astype(np.float)[0]
plt.imshow(np.swapaxes(array,0,1))

magnitude_spectrograms

Computing Mel-Frequency Cepstral Coefficients (MFCCs)

As you can see, there are 513 frequency banks in the computed energy spectrogram, and many are "blank". When working with spectral representations of audio, the Mel Frequency Cepstral Coefficients (MFCCs) are widely used in automatic speech and speaker recognition, which results in a lower-dimensional and more perceptually-relevant representation of the audio.

We can turn the energy/magnitude spectrograms into Mel-spectrograms in TensorFlow and plot its outcome like this.

# Warp the linear-scale, magnitude spectrograms into the mel-scale.
num_spectrogram_bins = magnitude_spectrograms.shape[-1].value

lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 8000, 64

linear_to_mel_weight_matrix = tf.contrib.signal.linear_to_mel_weight_matrix(
    num_mel_bins, num_spectrogram_bins, sr, lower_edge_hertz,
    upper_edge_hertz)

mel_spectrograms = tf.tensordot(
    magnitude_spectrograms, linear_to_mel_weight_matrix, 1)

print(mel_spectrograms.numpy().shape)

array = mel_spectrograms.numpy().astype(np.float)[0]
plt.imshow(np.swapaxes(array,0,1))

If desired, we can specify the lower and upper bounds on the frequencies to be included in the Mel-spectrum.

num_mel_bins specifies how many bands in the resulting Mel-spectrum.

mel_spectrograms

To further compress the Mel-spectrogram magnitudes, you may apply a compressive nonlinearity such as logarithmic compression, and this helps to balance the importance of detail in low and high energy regions of the spectrum, which more closely matches human auditory sensitivity.

log_offset is a small number added to avoid applying log() on zero in rare case.

log_offset = 1e-6
log_mel_spectrograms = tf.log(mel_spectrograms + log_offset)

print(log_mel_spectrograms.numpy().shape)
array = log_mel_spectrograms.numpy()[0]
plt.imshow(np.swapaxes(array,0,1))

log_mel_spectrograms

The last step,  tf.contrib.signal.mfccs_from_log_mel_spectrograms computes MFCCs from log_mel_spectrograms.

num_mfccs = 30
# Keep the first `num_mfccs` MFCCs.
mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
    log_mel_spectrograms)[..., :num_mfccs]
print(mfccs.numpy().shape)
array = mfccs.numpy()[0]
plt.imshow(np.swapaxes(array,0,1)[::-1,:])

mfccs

Build everything together

Put everything together into a TensorFlow pipeline.

def get_mfccs(audio_file=None ,signals=None, sample_rate = 44100, num_mfccs = 13, frame_length=1024, frame_step=512, fft_length=1024, fmax=8000, fmin=80):
    """Compute the MFCCs for audio file
    
    Keyword Arguments:
        audio_file {str} -- audio wav file path (default: {None})
        signals {tensor} -- input signals as tensor or np.array in float32 type (default: {None})
        sample_rate {int} -- sampling rate (default: {44100})
        num_mfccs {int} -- number of mfccs to keep (default: {13})
        frame_length {int} -- frame length to compute STFT (default: {1024})
        frame_step {int} -- frame step to compute STFT (default: {512})
        fft_length {int} -- FFT length to compute STFT (default: {1024})
        fmax {int} -- Top edge of the highest frequency band (default: {8000})
        fmin {int} -- Lower bound on the frequencies to be included in the mel spectrum (default: {80})
    
    Returns:
        Tensor -- mfccs as tf.Tensor
    """

    
    if signals is None and audio_file is not None:
      audio_binary = tf.read_file(audio_file)
      # tf.contrib.ffmpeg not supported on Windows, refer to issue
      # https://github.com/tensorflow/tensorflow/issues/8271
      waveform = tf.contrib.ffmpeg.decode_audio(audio_binary, 
          file_format='wav', samples_per_second=sample_rate, channel_count=1)
      signals = tf.reshape(waveform, [1, -1])
    
    # Step 1 : signals->stfts
    # `stfts` is a complex64 Tensor representing the Short-time Fourier Transform of
    # each signal in `signals`. Its shape is [batch_size, ?, fft_unique_bins]
    # where fft_unique_bins = fft_length // 2 + 1 = 513.
    stfts = tf.contrib.signal.stft(signals, frame_length=frame_length, frame_step=frame_step,
                                   fft_length=fft_length)
    # Step2 : stfts->magnitude_spectrograms
    # An energy spectrogram is the magnitude of the complex-valued STFT.
    # A float32 Tensor of shape [batch_size, ?, 513].
    magnitude_spectrograms = tf.abs(stfts)

    # Step 3 : magnitude_spectrograms->mel_spectrograms
    # Warp the linear-scale, magnitude spectrograms into the mel-scale.
    num_spectrogram_bins = magnitude_spectrograms.shape[-1].value

    num_mel_bins = 64

    linear_to_mel_weight_matrix = tf.contrib.signal.linear_to_mel_weight_matrix(
        num_mel_bins, num_spectrogram_bins, sample_rate, fmin,
        fmax)

    mel_spectrograms = tf.tensordot(
        magnitude_spectrograms, linear_to_mel_weight_matrix, 1)

    # Step 4 : mel_spectrograms->log_mel_spectrograms
    log_offset = 1e-6
    log_mel_spectrograms = tf.log(mel_spectrograms + log_offset)

    # Step 5 : log_mel_spectrograms->mfccs
    # Keep the first `num_mfccs` MFCCs.
    mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
        log_mel_spectrograms)[..., :num_mfccs]
    
    return mfccs

Conclusion and further reading

In this post, we introduced how to do GPU enabled signal processing in TensorFlow. We walked through each step from decoding a WAV file to computing MFCCs features of the waveform. The final pipeline is constructed where you can apply to your existing TensorFlow/Keras model to make an end to end audio processing computation graph.

Some related resources you might find helpful.

Mel Frequency Cepstral Coefficient (MFCC) tutorial

Chrome music lab Spectrogram experiment

Source code for this post on GitHub.

Current rating: 3.2

Comments