Abstract

Acoustic scene classification (ASC) tries to inference information about the environment using audio segments. The inter-class similarity is a significant issue in ASC as acoustic scenes with different labels may sound quite similar. In this paper, the similarity relations amongst scenes are correlated with the classification error. A class hierarchy construction method by using classification error is then proposed and integrated into a multitask learning framework. The experiments have shown that the proposed multitask learning method improves the performance of ASC. On the TUT Acoustic Scene 2017 dataset, we obtain the ensemble fine-grained accuracy of 81.4%, which is better than the state-of-the-art. By using multitask learning, the basic Convolutional Neural Network (CNN) model can be improved by about 2.0 to 3.5 percent according to different spectrograms. The coarse category accuracies (for two to six super-classes) range from 77.0% to 96.2% by single models. On the revised version of the LITIS Rouen dataset, we achieve the ensemble fine-grained accuracy of 83.9%. The multitask learning models obtain an improvement of 1.6% to 1.8% compared to their basic models. The coarse category accuracies range from 94.9% to 97.9% for two to six super-classes with single models.

Highlights

  • Acoustic scene classification (ASC) refers to the task of associating a semantic label to an audio stream that identifies the environment in which it has been produced [1]

  • We focus on the inter-class similarities problem in ASC and use a similar multitask learning solution as in fine-grained visual recognition

  • The class hierarchy is further incorporated into a self-organized multitask learning framework

Read more

Summary

Introduction

Acoustic scene classification (ASC) refers to the task of associating a semantic label to an audio stream that identifies the environment in which it has been produced [1]. This task takes as input a relatively long sound clip and outputs predicted acoustic scene class, e.g., home, park, and bus. Classifying scenes by audio data has its unique advantages. The recording of audio data is not restricted by the camera angle and illumination condition, etc. The equipment for sound collection can be installed in a wider range where object occlusion is no more a problem. The collection can run indiscriminately in a dark environment. Super-class construction + multitask learning [48]

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.