Abstract

This paper describes a multi-task learning approach to joint extraction (fundamental frequency (F0) estimation) and separation of singing voices from music signals. While deep neural networks have been used successfully for each task, both tasks have not been dealt with simultaneously in the context of deep learning. Since vocal extraction and separation are considered to have a mutually beneficial relationship, we propose a unified network that consists of a deep convolutional neural network for vocal F0 saliency estimation and a U-Net with an encoder shared by two decoders specialized for separating vocal and accompaniment parts, respectively. Between these two networks we introduce a differentiable layer that converts an F0 saliency spectrogram into harmonic masks indicating the locations of harmonic partials of a singing voice. The physical meaning of harmonic structure is thus reflected in the network architecture. The harmonic masks are then effectively used as scaffolds for estimating fine-structured masks thanks to the excellent capability of the U-Net for domain-preserving conversion (e.g., image-to-image conversion). The whole network can be trained jointly by backpropagation. Experimental results showed that the proposed unified network outperformed the conventional independent networks for vocal extraction and separation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call