MCSSME: Multi-Task Contrastive Learning for Semi-supervised Singing Melody Extraction from Polyphonic Music

Shuai Yu

doi:10.1609/aaai.v38i1.27790

Abstract

Singing melody extraction is an important task in the field of music information retrieval (MIR). The development of data-driven models for this task have achieved great successes. However, the existing models have two major limitations: firstly, most of the existing singing melody extraction models have formulated this task as a pixel-level prediction task. The lack of labeling data has limited the model for further improvements. Secondly, the generalization of the existing models are prone to be disturbed by the music genres. To address the issues mentioned above, in this paper, we propose a multi-Task contrastive learning framework for semi-supervised singing melody extraction, termed as MCSSME. Specifically, to deal with data scarcity limitation, we propose a self-consistency regularization (SCR) method to train the model on the unlabeled data. Transformations are applied to the raw signal of polyphonic music, which makes the network to improve its representation capability via recognizing the transformations. We further propose a novel multi-task learning (MTL) approach to jointly learn singing melody extraction and classification of transformed data. To deal with generalization limitation, we also propose a contrastive embedding learning, which strengthens the intra-class compactness and inter-class separability. To improve the generalization on different music genres, we also propose a domain classification method to learn task-dependent features by mapping data from different music genres to shared subspace. MCSSME evaluates on a set of well-known public melody extraction datasets with promising performances. The experimental results demonstrate the effectiveness of the MCSSME framework for singing melody extraction from polyphonic music using very limited labeled data scenarios.

Full Text