Abstract

The study of separating the vocal from the accompaniment in single-channel music is foundational and critical in the field of music information retrieval (MIR). Mainstream music-separation methods are usually based on the frequency-domain characteristics of music signals, and the phase information of the music is lost during time–frequency decomposition. In recent years, deep learning models based on speech time-domain signals, such as Conv-TasNet, have shown great potential. However, for the vocal and accompaniment separation problem, there is no suitable time-domain music-separation model. Since the vocal and the accompaniment in music have a higher synergy and similarity than the voices of two speakers in speech, separating the vocal and accompaniment using a speech-separation model is not ideal. Based on this, we propose VAT-SNet; this optimizes the network structure of Conv-TasNet, which sets sample-level convolution in the encoder and decoder to preserve deep acoustic features, and takes vocal embedding and accompaniment embedding generated by the auxiliary network as references to improve the purity of the separation of the vocal and accompaniment. The results from public music datasets show that the quality of the vocal and accompaniment separated by VAT-SNet is improved in GSNR, GSIR, and GSAR compared with Conv-TasNet and mainstream separation methods, such as U-Net, SH-4stack, etc.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call