Abstract

This paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the feature maps of encoder and decoder subnetworks. In the GNU-Net architecture, only the backbone not including nested part is applied with gated linear units (GLUs) instead of conventional convolutional networks. The outputs of GNU-Net are further fed into a time-frequency (T-F) mask layer to generate two masks of singing voice and accompaniment. Then, those two estimated masks along with the magnitude and phase spectra of mixture can be transformed into time-domain signals. We explored two types of T-F mask layer, discriminative training network and difference mask layer. The experiment results show the latter to be better. We evaluated our proposed model by comparing with three models, and also with ideal T-F masks. The results demonstrate that our proposed model outperforms compared models, and it’s performance comes near to ideal ratio mask (IRM). More importantly, our proposed model can output separated singing voice and accompaniment simultaneously, while the three compared models can only separate one source with trained model.

Highlights

  • Singing voice separation attempts to isolate singing voice from a song

  • The proposed gated nested U-Net (GNU-Net) model and two kinds of mask layer were verified by the separation performance, and the effect of the nested U-Net was assessed by comparing with U-Net [26]

  • The performance of GNU-Net separation model was compared with three models and ideal T-F masks on the iKala dataset

Read more

Summary

Introduction

Singing voice separation attempts to isolate singing voice ( called vocal line) from a song. Isolating pure accompaniment from a song has great applications such as leading instrument detection [9] and drum source separation [10]. These tasks seem effortless to humans, it turns out to be very difficult for machines, especially when the singing voice is accompanied by musical instruments. Such a requirement can be satisfied if successful separations of singing voice and accompaniment are used as preprocessing. Due to the harmony of a popular song, the singing voice and accompaniment are strongly correlated in both time and frequency [11], separating singing voice from a song

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call