Abstract

The task of real-time alignment between a music performance and the corresponding score (sheet music), also known as score following, poses a challenging multi-modal machine learning problem. Training a system that can solve this task robustly with live audio and real sheet music (i.e., scans or score images) requires precise ground truth alignments between audio and note-coordinate positions in the score sheet images. However, these kinds of annotations are difficult and costly to obtain, which is why research in this area mainly utilizes synthetic audio and sheet images to train and evaluate score following systems. In this work, we propose a method that does not solely rely on note alignments but is additionally capable of leveraging data with annotations of lower granularity, such as bar or score system alignments. This allows us to use a large collection of real-world piano performance recordings coarsely aligned to scanned score sheet images and, as a consequence, improve over current state-of-the-art approaches.

Highlights

  • Score following or real-time audio-to-score alignment aims at synchronizing musical performances to the corresponding scores in an on-line fashion

  • Approaches to score following are mainly categorized into methods that require symbolic computer-readable score representations (e. g., Dynamic Time Warping (DTW) or Hidden Markov Models) and those that directly work with images of scores by applying deep learning techniques

  • In the third section (III) we investigate the generalization in the image domain by considering scanned sheet images and synthetic audio, rendered from the score MIDI

Read more

Summary

INTRODUCTION

Score following or real-time audio-to-score alignment aims at synchronizing musical performances (audio) to the corresponding scores (the printed sheet music from which the musicians are presumably playing) in an on-line fashion. In addition to the intrinsic difficulty of this task, with different ways in which the same musical passage can be typeset and played, we face a severe data problem: training such a network requires large amounts of fine-grained annotations between note positions on the sheet image and in the audio. Obtaining such information at this level of precision via manual annotation is factually impossible, at least in the acoustic domain. We conduct large-scale experiments to investigate the generalization capabilities of our proposed system in the audio as well as in the sheet-image domain

RELATED WORK
MULTI-MODAL BOUNDING BOX PREDICTION FOR SCORE FOLLOWING
Conditional YOLO
Query Encoding
A NEW APPROACH
EXPERIMENT SETUP
Models
Dedicated Datasets
Training Details
Evaluation Metrics
Model Comparison
Error Analysis
CONCLUSION
DATA AVAILABILITY STATEMENT
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.