Abstract

We present a descriptive approach for analyzing audio scenes that can comprise a mixture of audio sources. We apply this method to segment popular music songs into vocal and non-vocal sections. Unlike existing methods that directly rely on within-class feature similarities of acoustic sources, the proposed data-driven system is based on a training set where the acoustic sources are grouped by their perceptual or semantic attributes. Our audio analysis approach is based on a quantitative time-varying metric to measure the interaction between acoustic sources present in a scene developed using pattern recognition methods. Using the proposed system that is trained on a general sound effects library, we achieve less than ten percent vocal-section segmentation error and less than five percent false alarm rates when evaluated on a database of popular music recordings that spans four different genres (rock, hiphop, pop, and easy listening).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call