Abstract

In this paper, we propose a score-informed source separation framework based on non-negative matrix factorization (NMF) and dynamic time warping (DTW) that suits for both offline and online systems. The proposed framework is composed of three stages: training, alignment, and separation. In the training stage, the score is encoded as a sequence of individual occurrences and unique combinations of notes denoted as score units. Then, we proposed a NMF-based signal model where the basis functions for each score unit are represented as a weighted combination of spectral patterns for each note and instrument in the score obtained from a trained a priori over-completed dictionary. In the alignment stage, the time-varying gains are estimated at frame level by computing the projection of each score unit basis function over the captured audio signal. Then, under the assumption that only a score unit is active at a time, we propose an online DTW scheme to synchronize the score information with the performance. Finally, in the separation stage, the obtained gains are refined using local low-rank NMF and the separated sources are obtained using a soft-filter strategy. The framework has been evaluated and compared with other state-of-the-art methods for single channel source separation of small ensembles and large orchestra ensembles obtaining reliable results in terms of SDR and SIR. Finally, our method has been evaluated in the specific task of acoustic minus one, and some demos are presented.

Highlights

  • Sound source separation (SS) seeks to segregate constituent sound sources from an audio signal mixture

  • 5 Experimental results and discussion the proposed method in Section 3 is evaluated for the task of single-channel instrumental music SS using a well-known dataset of small ensembles and a more complicated large ensembles orchestra dataset

  • We compared the best performing separation methods for offline and online approaches applying the psychoacoustic model presented in Section 4 with the baseline methods to set the “unrealistic” extreme results

Read more

Summary

Introduction

Sound source separation (SS) seeks to segregate constituent sound sources from an audio signal mixture. The score of the pieces must be previously aligned to the recording; this synchronization is usually performed beforehand and is typically obtained using a twofold procedure: (1) feature extraction from audio and score and (2) temporal alignment [17] In the former, the features extracted from the audio signal characterize some specific information about the musical content. The cost function for the alignment procedure can be obtained by computing the projection for each score unit over the frame level spectrum and the minimum cost path is estimated using the online DTW framework proposed in [29] and based on the original work presented by Dixon et al [28]. The novelty of this work lies in developing a method for single-channel and multi-timbral (i.e., with multiple instruments) signal SS, which uses the score information encoded within the signal model.

Background
Deep learning approaches for source separation
Training stage
Signal model parameter estimation
Method
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call