Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Ryo Nishikimi,Masataka Goto,Eita Nakamura,Kazuyoshi Yoshii

doi:10.1017/atsip.2021.4

Abstract

This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.

Highlights

The aim of automatic singing transcription (AST) is to estimate a human-readable musical score of singing voice from a given music signal
Since the melody line is usually the most salient part of music that influences the impression of a song, transcribed scores are useful for music information retrieval (MIR) tasks such as query-by-humming, musical grammar analysis [1], and singing voice generation [2]
The current stateof-the-art method of audio-to-score AST [11] is based on a hidden semi-Markov model (HSMM) consisting of a semiMarkov language model describing the generative process of a note sequence and a Cauchy acoustic model describing the generative process of an F0 contour from the musical notes

Summary

Introduction

The aim of automatic singing transcription (AST) is to estimate a human-readable musical score of singing voice from a given music signal. To estimate the semitone-level pitches and tatum-level onset and offset times of musical notes from music signals, one may estimate a singing F0 trajectory [3,4,5,6] and quantize it on the semitone and tatum grids obtained by a beat-tracking method [7], where the tatum (e.g. 16thnote level) refers to the smallest meaningful subdivision of the main beat (e.g. fourth-note level). This approach, has no mechanism that avoids out-of-scale pitches. This approach covers only constrained conditions (e.g. the use of synthetic sound signals) and has only limited success [12,13,14,15]

Objectives

Methods

Results

Conclusion