NnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

Kin Wai Cheuk,Dorien Herremans,Hans Anderson,Kat Agres

doi:10.1109/access.2020.3019084

Abstract

In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion. It allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk. Moreover, this approach also allows back-propagation on the waveforms-to-spectrograms transformation layer, and hence, the transformation process can be made trainable, further optimizing the waveform-to-spectrogram transformation for the specific task that the neural network is trained on. All spectrogram implementations scale as Big-O of linear time with respect to the input length. nnAudio, however, leverages the compute unified device architecture (CUDA) of 1D convolutional neural network from PyTorch, its short-time Fourier transform (STFT), Mel spectrogram, and constant-Q transform (CQT) implementations are an order of magnitude faster than other implementations using only the central processing unit (CPU). We tested our framework on three different machines with NVIDIA GPUs, and our framework significantly reduces the spectrogram extraction time from the order of seconds (using a popular python library librosa) to the order of milliseconds, given that the audio recordings are of the same length. When applying nnAudio to variable input audio lengths, an average of 11.5 hours are required to extract 34 spectrogram types with different parameters from the MusicNet dataset using librosa. An average of 2.8 hours is required for nnAudio, which is still four times faster than librosa. Our proposed framework also outperforms existing GPU processing libraries such as Kapre and torchaudio in terms of processing speed.

Highlights

S PECTROGRAMS, as time-frequency representations of audio signals, have been used as input for neural network models since the 1980s [1,2,3]
Our short-time Fourier transform (STFT) implementation is compared to librosa.stft, Mel Spectrogram to librosa.feature.melspectrogram, and constant-Q transform (CQT) to librosa.cqt
2) Results Figure 11(a) shows the time taken to convert an array of 1,770 waveforms to an array of 1,770 spectrograms using Mel frequency scale, STFT, and CQT on three different machines

Summary

INTRODUCTION

S PECTROGRAMS, as time-frequency representations of audio signals, have been used as input for neural network models since the 1980s [1,2,3]. Torch-stft is a native PyTorch function without any additional dependency, only STFT is available To bridge this gap in the field, we introduce a fast, differentiable, and trainable neural network-based audio processing framework called nnAudio [29]. To ensure perfect integration with one of the most popular machine learning libraries, we built our spectrogram extraction method using PyTorch This way, our library can be used as a PyTorch neural network layer, and all the functionalities available in PyTorch, such as data augmentations, can be used together with nnAudio. SUMMARY OF KEY ADVANTAGES The main contribution of this paper is the development of a GPU-based audio processing framework that is directly integrated into and leverages the power of neural networks This provides the following benefits: 1) End-to-end neural network training with an on-the-fly time-frequency conversion layer (i.e. one can directly use raw waveforms as the input to the neural network). We end with potential applications of our proposed library

SIGNAL PROCESSING: A QUICK OVERVIEW

DFT FOR ARBITRARY FREQUENCY RANGES

EXPERIMENTAL RESULTS

2) Results

EXAMPLE APPLICATIONS

TRAINABLE TRANSFORMATION KERNELS

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE access : practical innovations, open solutions	Publication Date: Jan 1, 2020
Citations: 43	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

NnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions

Lead the way for us

Similar Papers

Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
Scott Rostrup ... Hans De Sterck
Computer Physics Communications | VOL. 181
Scott Rostrup, et. al.Scott Rostrup ... Hans De Sterck
26 Aug 2010
Computer Physics Communications | VOL. 181

Power analysis for decoding of the digital audio encoding format MP3: Decoding the central processing unit and the graphics processing unit.
Seung Gu Kang ... Cheol Hong Kim
The Journal of The Acoustical Society of America | VOL. 127
Seung Gu Kang, et. al.Seung Gu Kang ... Cheol Hong Kim
01 Mar 2010
The Journal of The Acoustical Society of America | VOL. 127

Implementation of Image Enhancement Algorithms and Recursive Ray Tracing using CUDA
Mr Diptarup Saha ... Darshak Thakore
Procedia computer science | VOL. 79
Mr Diptarup Saha, et. al.Mr Diptarup Saha ... Darshak Thakore
01 Jan 2015
Procedia computer science | VOL. 79

Airborne SAR Real-time Imaging Algorithm Design and Implementation with CUDA on NVIDIA GPU
Da-Di Meng ... Yu-Xin Hu
Journal of Radars | VOL. 2
Da-Di Meng, et. al.Da-Di Meng ... Yu-Xin Hu
28 Aug 2013
Journal of Radars | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

NnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions