Abstract

Detecting the simultaneous activity of pitches in music audio recordings is a central task within music processing, commonly known as multi-pitch estimation or frame-wise polyphonic music transcription. Deep-learning approaches recently achieved major improvements for this task, but the lack of annotated, large-size datasets beyond the piano solo scenario is still a limitation for fully exploiting their potential. In this paper, we propose a strategy for training a CNN-based multi-pitch estimator on weakly aligned score-audio pairs of pieces in different instrumentations. To this end, we make use of a multi-label variant of the connectionist temporal classification loss (MCTC), recently proposed for image recognition tasks. We re-formalize the MCTC loss to be applicable for multi-pitch estimation and perform several systematic experiments to analyze its behavior and robustness to training conditions. Finally, we report on multi-pitch estimation results for common datasets using weakly aligned training with MCTC, which performs similar than systems trained on strongly aligned scores.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call