Abstract

Automatic Music Transcription (AMT) is usually evaluated using low-level criteria, typically by counting the number of errors, with equal weighting. Yet, some errors (e.g. out-of-key notes) are more salient than others. In this study, we design an online listening test to gather judgements about AMT quality. These judgements take the form of pairwise comparisons of transcriptions of the same music by pairs of different AMT systems. We investigate how these judgements correlate with benchmark metrics, and find that although they match in many cases, agreement drops when comparing pairs with similar scores, or pairs of poor transcriptions. We show that onset-only notewise F-measure is the benchmark metric that correlates best with human judgement, all the more so with higher onset tolerance thresholds. We define a set of features related to various musical attributes, and use them to design a new metric that correlates significantly better with listeners’ quality judgements. We examine which musical aspects were important to raters by conducting an ablation study on the defined metric, highlighting the importance of the rhythmic dimension (tempo, meter). We make the collected data entirely available for further study, in particular to evaluate the perceptual relevance of new AMT metrics.

Highlights

  • Automatic Music Transcription (AMT) is a widely discussed problem in Music Information Retrieval (MIR) (Benetos et al, 2019)

  • We present the benchmark evaluation In all cases, metrics are computed for each test piece, metrics used for AMT and other works on transcription and averaged over the whole dataset

  • 4.7 Discussion It appears that the best correlation with ratings is achieved for much higher tolerance thresholds than what is usually used for transcription system evaluation, both for Fn,On and Fn,OnOff

Read more

Summary

Introduction

Automatic Music Transcription (AMT) is a widely discussed problem in Music Information Retrieval (MIR) (Benetos et al, 2019). A common intermediate step is to obtain a MIDI-like representation, describing notes by their pitch, onset and offset times in seconds, leaving aside problems such as stream separation, rhythm transcription, or pitch spelling. It has applications in various fields, in particular in music education, music production and creation, musicology, and as pre-processing for other MIR tasks, such as cover song detection or structural segmentation. Not all mistakes are salient to human listeners: for instance, an out-of-key false positive will be much more noticeable than an extra note in a big chord, all the more so if it fits with the harmony

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call