Abstract

Transcribing audio input to obtain a music score has been a basic but yet hard problem in music signal and information processing. It can be paralleled with continuous speech recognition comprising of acoustic analysis, acoustic model, language model, and decoder. This presentation discusses functional modules for the music case, i.e., estimation of multiple fundamental frequencies, rhythm models, and chord modeling. For multi‐F0 estimation, a computational auditory scene analysis‐motivated approach is taken to model human hearing where each acoustic object is modeled with Gaussian mixture both along frequency and time axis [Kameoka (2005)]. An extended EM‐algorithm is applied to iterative estimation of fundamental frequencies and onset/offset timings of music notes contained in the given spectrogram of the audio input. As for rhythm estimation, HMM is used to model the sequence of note lengths where tempo is treated in the model as a varying hidden variable [Takeda (2006)]. Chord progression is also modeled with HMM [Kawakami (2000)] where transition probabilities between chords and emission probabilities of notes from the hypothesized chord have been trained with a music database. Related issues are also discussed including other approaches to multiple fundamental frequencies with specmurt analysis [Sagayama (2005)], non‐negative matrix factorization [Raczynski (2007)], and timber modeling [Miyamoto (2007)].

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call