Abstract
The task of real-time alignment between a music performance and the corresponding score (sheet music), also known as score following, poses a challenging multi-modal machine learning problem. Training a system that can solve this task robustly with live audio and real sheet music (i.e., scans or score images) requires precise ground truth alignments between audio and note-coordinate positions in the score sheet images. However, these kinds of annotations are difficult and costly to obtain, which is why research in this area mainly utilizes synthetic audio and sheet images to train and evaluate score following systems. In this work, we propose a method that does not solely rely on note alignments but is additionally capable of leveraging data with annotations of lower granularity, such as bar or score system alignments. This allows us to use a large collection of real-world piano performance recordings coarsely aligned to scanned score sheet images and, as a consequence, improve over current state-of-the-art approaches.
Highlights
Score following or real-time audio-to-score alignment aims at synchronizing musical performances to the corresponding scores in an on-line fashion
Approaches to score following are mainly categorized into methods that require symbolic computer-readable score representations (e. g., Dynamic Time Warping (DTW) or Hidden Markov Models) and those that directly work with images of scores by applying deep learning techniques
In the third section (III) we investigate the generalization in the image domain by considering scanned sheet images and synthetic audio, rendered from the score MIDI
Summary
Score following or real-time audio-to-score alignment aims at synchronizing musical performances (audio) to the corresponding scores (the printed sheet music from which the musicians are presumably playing) in an on-line fashion. In addition to the intrinsic difficulty of this task, with different ways in which the same musical passage can be typeset and played, we face a severe data problem: training such a network requires large amounts of fine-grained annotations between note positions on the sheet image and in the audio. Obtaining such information at this level of precision via manual annotation is factually impossible, at least in the acoustic domain. We conduct large-scale experiments to investigate the generalization capabilities of our proposed system in the audio as well as in the sheet-image domain
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.