Blind Source Separation in Polyphonic Music Recordings Using Deep Neural Networks Trained via Policy Gradients

Sören Schulze,Johannes Leuschner,Emily J King

doi:10.3390/signals2040039

Abstract

We propose a method for the blind separation of sounds of musical instruments in audio signals. We describe the individual tones via a parametric model, training a dictionary to capture the relative amplitudes of the harmonics. The model parameters are predicted via a U-Net, which is a type of deep neural network. The network is trained without ground truth information, based on the difference between the model prediction and the individual time frames of the short-time Fourier transform. Since some of the model parameters do not yield a useful backpropagation gradient, we model them stochastically and employ the policy gradient instead. To provide phase information and account for inaccuracies in the dictionary-based representation, we also let the network output a direct prediction, which we then use to resynthesize the audio signals for the individual instruments. Due to the flexibility of the neural network, inharmonicity can be incorporated seamlessly and no preprocessing of the input spectra is required. Our algorithm yields high-quality separation results with particularly low interference on a variety of different audio samples, both acoustic and synthetic, provided that the sample contains enough data for the training and that the spectral characteristics of the musical instruments are sufficiently stable to be approximated by the dictionary.

Highlights

We address the problem of unmixing the contributions of multiple different musical instruments from a single-channel audio recording
Monte Carlo tree search (MCTS), we stay relatively close to the original approach, but we extend the formulation by adding deterministic values, combining policy gradients with backpropagation gradients
We have developed a blind source separation method that unmixes the contributions of different instruments in a polyphonic music recording via a parametric model and a dictionary

Summary

Introduction

For the time-frequency representation of the audio signals, we use the sampled complex-valued output of the short-time Fourier transform, which can be interpreted as the analysis coefficients of a Gabor frame. This representation has the advantage of being perfectly linear and easy to project back to a time-domain signal, but it is not pitch-invariant; that is, the distance of the frequency axis corresponding to a certain musical interval varies based on the pitch of the tones. For the problematic parameters like pitch, we use policy gradients for training, which is a technique originating from deep reinforcement learning, cf [2]

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Signals	Publication Date: Oct 7, 2021
Citations: 3	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Blind Source Separation in Polyphonic Music Recordings Using Deep Neural Networks Trained via Policy Gradients

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Signals

Lead the way for us

Similar Papers

Sparse pursuit and dictionary learning for blind source separation in polyphonic music recordings
Sören Schulze ... Emily J King
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2021
Sören Schulze, et. al.Sören Schulze ... Emily J King
28 Jan 2021
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2021

Multi-mode speaker operating from either digital or analog sources

The Journal of the Acoustical Society of America | VOL. 118

01 Jan 2004
The Journal of the Acoustical Society of America | VOL. 118

Classification of audio radar signals using radial basis function neural networks
T Mcconaghy ... H Leung
IEEE Transactions on Instrumentation and Measurement | VOL. 52
T Mcconaghy, et. al.T Mcconaghy ... H Leung
01 Dec 2003
IEEE Transactions on Instrumentation and Measurement | VOL. 52

Optical fiber transmission of data packets multiplexed with digital audio and video signals
H Asada ... N.A Rabou
IEEE Transactions on Consumer Electronics | VOL. 44
H Asada, et. al.H Asada ... N.A Rabou
01 Jan 1998
IEEE Transactions on Consumer Electronics | VOL. 44

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Blind Source Separation in Polyphonic Music Recordings Using Deep Neural Networks Trained via Policy Gradients

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Signals