Abstract

Singing voice separation, which aims to separate vocals and accompaniment from mixed musical signal, has been a popular topic. In this work, we propose a novel deep neural network called FC-U<sup>2</sup>-Net for singing voice separation. The network is a two-level nested U-structure, in which the time-invariant fully-connected layers are added along the frequency axis. This structure enables it to capture not only the local and global contextual information, but also the long-range correlations of voice signal along the frequency axis. In addition, a novel loss function combining ratio mask and binary mask is proposed. This strategy makes the estimated vocals signal cleaner and carries less accompaniment signals. The experimental results show that our method surpasses four state-of-the-art methods on the MUSDB18 singing voice separation task, and the source-to-distortion ratio (SDR) reaches to the optimal 7.53 dB.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call