This article describes automatic singing transcription (AST) that estimates a human-readable musical score of a sung melody represented with quantized pitches and durations from a given music audio signal. To achieve the goal, we propose a statistical method for estimating the musical score by quantizing a trajectory of vocal fundamental frequencies (F0s) in the time and frequency directions. Since vocal F0 trajectories considerably deviate from the pitches and onset times of musical notes specified in musical scores, the local keys and rhythms of musical notes should be taken into account. In this article we propose a Bayesian hierarchical hidden semi-Markov model (HHSMM) that integrates a musical score model describing the local keys and rhythms of musical notes with an F0 trajectory model describing the temporal and frequency deviations of an F0 trajectory. Given an F0 trajectory, a sequence of musical notes, that of local keys, and the temporal and frequency deviations can be estimated jointly by using a Markov chain Monte Carlo (MCMC) method. We investigated the effect of each component of the proposed model and showed that the musical score model improves the performance of AST.
Read full abstract