Abstract

Environmental audio recognition through mobile devices is difficult because of background noise, unseen audio events, and changes in audio channel characteristics due to the phone's context, e.g., whether the phone is in the user's pocket or in his hand. We propose a crowdsourcing framework that models the combination of scene, event, and phone context to overcome these issues. The framework gathers audio data from many people and shares user-generated models through a cloud server to accurately classify unseen audio data. A Gaussian histogram is used to represent an audio clip with a small number of parameters, and a k-nearest classifier allows the easy incorporation of new training data into the system. Using the Kullback-Leibler divergence between two Gaussian histograms as the distance measure, we find that audio scenes, events, and phone context are classified with 85.2%, 77.6%, and 88.9% accuracy, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call