Abstract
Music source separation is one of the old and challenging problems in music information retrieval society. Improvements in deep learning lead to big progress in decomposing music into its constitutive components with a variety of music. This research uses three types of datasets for source separation namely; Korean traditional music Pansori dataset, MIR-1K dataset, and DSD100 dataset. DSD100 dataset includes multiple sound sources and other two datasets has relatively smaller number of sound sources. We synthetically constructed a novel dataset for Pansori music and trained a novel parallel stacked hourglass network (PSHN) with multiple band spectrograms. In comparison with past study, proposed architecture performs the best results in real-world test samples of Pansori music of any length. The network performance was also tested for the public DSD100 and MIR-1K dataset for strength comparison in multiple source data and found comparable quantitative and qualitative outcomes. System performance is evaluated using median value of signal-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifacts ratio (SAR) measured in decibels (dB) and visual comparison of prediction results with ground truth. We report better performance in the Pansori dataset and MIR-1K dataset and perform detailed ablation studies based on architecture variation. The proposed system is better applicable for separating the music source with voices and single or fewer musical instruments.
Highlights
IntroductionMusic source separation has several useful applications including automatic speech recognition for bilateral cochlear implant patients [1], fundamental frequency estimation for music transcription [2], beat tracking despite the presence of highly predominant vocals [3], the generation of karaoke music, instrument detection, lyrics recognition and chord estimation
Music is a mingling of several signals to form one combined signal
It is proven that our parallel stacked hourglass network (PSHN) architecture significantly outperforms the existing methods MLRR [39], U-Net [40], and all stack hourglass networks [6] in all evaluation criteria except global SIR (GSIR) of accompaniments in the MIR-1K dataset
Summary
Music source separation has several useful applications including automatic speech recognition for bilateral cochlear implant patients [1], fundamental frequency estimation for music transcription [2], beat tracking despite the presence of highly predominant vocals [3], the generation of karaoke music, instrument detection, lyrics recognition and chord estimation. Another application is singer identification in the music management system by separating the singing voice from music accompaniment. This study can be beneficial for most of the transcription systems or traditional music learners
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have