Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

Xiangming Gu,Ye Wang,Jianan Zhang,Longshen Ou,Nicholas Wong,Wei Zeng

doi:10.1145/3651310

Abstract

Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics, while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: May 16, 2024
License type: public-domain

Similar Papers

An exhaustive review of automatic music transcription techniques: Survey of music transcription techniques
B S Gowrishankar ... Nagappa U Bhajantri
-
B S Gowrishankar, et. al.B S Gowrishankar ... Nagappa U Bhajantri
01 Oct 2016
01 Oct 2016

Low rank modelling for polyphonic music analysis.

-

31 Jul 2020
31 Jul 2020

Development of improved automatic music transcription system for the arabian flute (NAY)
F Al-Ghawanmeh ... I Jafar
-
F Al-Ghawanmeh, et. al.F Al-Ghawanmeh ... I Jafar
01 Mar 2011
01 Mar 2011

Investigating the Perceptual Validity of Evaluation Metrics for Automatic Piano Music Transcription
Adrien Ycart ... Emmanouil Benetos
Transactions of the International Society for Music Information Retrieval | VOL. 3
Adrien Ycart, et. al.Adrien Ycart ... Emmanouil Benetos
12 Jun 2020
Transactions of the International Society for Music Information Retrieval | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications