Singer Diarization for Polyphonic Music With Unison Singing

Hitoshi Suda,Tomoyasu Nakano,Satoru Fukayama,Masataka Goto,Daisuke Saito

doi:10.1109/taslp.2022.3166262

Abstract

This paper introduces a new framework for singer diarization, which is a technique to reveal who sings when in songs with multiple singers. Although various techniques have been developed to analyze and extract features of singing voices in musical audio signals, most of them assume that a song is sung by a single singer, and singer diarization for multiple singers has not been well studied in the field of singing information processing. To deal with multiple speakers in speech analysis, speaker diarization has been explored to handle overlapped speech voices, but cannot handle singing voices well because of acoustic differences between singing and speech voices. This paper therefore proposes a new diarization framework specialized in singing voices. To achieve high accuracy in overlap detection, this paper proposes a novel acoustic feature named Cosacorr score, which is helpful in estimating whether a song is sung by more than one singer. After extracting singing voices from polyphonic music by using a singing voice separation technique, the framework adopts an existing ArcFace technique to extract discriminative singer representations from short segments of the separated singing voices. The framework is evaluated by using a new private dataset of unison singing voices, which is constructed using commercially available compact discs (CDs). The experimental results show that the proposed framework outperformed the baseline method for speaker diarization in terms of diarization error rate (DER).

Full Text