Sound events can be used to establish context to assist a user to perform context-dependent tasks. The state of the art methods allow the identification of isolated sound events, even with background noise when it can be modeled, however for mixed sound recognition the challenge still stands. The problem consist in identifying all the sounds occurring in a stream. In this paper we propose an audio representation suitable for mixed sounds identification without background/foreground modeling. Our approach is also lightweight, both in computational and space complexity and the final representation does not depend on the length of the input sound. We extract spectral, band-split, frame level features and their first and second derivatives in each band. The final representation is a set of histograms, one for each band. We proved experimentally that this representation is robust and allows the identification of overlapped sound events. We compared our approach against a representation based on the Mel Frequency Cepstral Coefficients and Non Negative Matrix Factorization for blind source separation using a single microphone, this was the only approach comparable to ours. For testing we conducted two different set of experiments. In the first one we collected poor quality audio recordings using a low-end smartphone for training. Without further enhancing or processing we were able to identify the components of classes of sound mixtures, even with sounds downloaded from the Internet where we had no control on the recording conditions or the foreground noise. In the second set of experiments we recorded 15 challenging sound classes of similar spectrum, from an application scenario and identified them in a continuous recording with three types of background noise. Our results outperform the state of the art in speed, precision and recall.
Read full abstract