Abstract
This thesis deals with the problem of Automatic Music Transcription (AMT), which aims to extract the pitch and timing information from recorded music signals. AMT is a challenging problem that is closely related to source separation and sparse mod- eling. Many approaches use latent variable models where the goal is to extract the underlying explanatory factors (musical pitches) which best explain the signal in question. A fundamental technique is the Nonnegative Matrix Factorization (NMF) algorithm which seeks to decompose the signal into a linear combination of nonneg- ative templates. However, NMF fails to account for the structure of music signals such as time smoothness. We introduce extensions of NMF to more accurately model music signals. The motivating assumption is that good transcriptions tend to have a low-rank structure, which when taken into account can improve the transcription performance. First, we extend classical NMF to a Low-Rank NMF model, based on work in low- rank matrix completion. We explore the connection between optimization of the matrix nuclear norm and proximal algorithms to derive a model that results in low- rank transcriptions. The nuclear norm approach is then extended to non-convex penalties which more accurately reflect the desired low-rank assumption. Next, we extend these ideas to deal with models in which the resulting transcription is locally low-rank, which we argue is a better model of music signals. An algorithm based on NMF and submodular function optimization is introduced, which learns a collection of local models. It is shown that this leads to further improvement for the AMT task. Finally, we develop a probabilistic framework that represents the signal using a hi- erarchy of local models. and discuss the interpretation of the proposed approaches as hard and soft clustering methods. We find that the proposed probabilistic “soft clustering” algorithm leads to further performance gains for the AMT task, outper- forming comparable state-of-the-art AMT systems which are based on NMF.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have