Abstract

Acoustic scene analysis has attracted a lot of attention recently. Existing methods are mostly supervised, which requires well-predefined acoustic scene categories and accurate labels. In practice, there exists a large amount of unlabeled audio data, but labeling large-scale data is not only costly but also time-consuming. Unsupervised acoustic scene analysis on the other hand does not require manual labeling but is known to have significantly lower performance and therefore has not been well explored. In this paper, a new unsupervised method based on deep auto-encoder networks and spectral clustering is proposed. It first extracts a bottleneck feature from the original acoustic feature of audio clips by an auto-encoder network, and then employs spectral clustering to further reduce the noise and unrelated information in the bottleneck feature. Finally, it conducts hierarchical clustering on the low-dimensional output of the spectral clustering. To fully utilize the spatial information of stereo audio, we further apply the binaural representation and conduct joint clustering on that. To the best of our knowledge, this is the first time that a binaural representation is being used in unsupervised learning. Experimental results show that the proposed method outperforms the state-of-the-art competing methods.

Highlights

  • Acoustic scene analysis, has received a lot of research attention recently [1,2], which aims to recognize acoustic environments [3,4]

  • We propose an unsupervised acoustic scene analysis algorithm based on auto-encoder networks, named joint auto-encoder network with spectral clustering (JAESC), for stereo audio clips

  • The binaural representation is extracted from each audio clip first

Read more

Summary

Introduction

Acoustic scene analysis, has received a lot of research attention recently [1,2], which aims to recognize acoustic environments [3,4]. It finds applications in many audio devices, such as cars, robots, context-aware mobile devices, and intelligent monitoring systems. Conventional acoustic scene analysis was mainly supervised, and was named acoustic scene classification. The effectiveness of supervised acoustic scene classification relies heavily on the quality of manually-labeled data. It is known that labeling large-scale unlabeled acoustic data manually for the classifier training is time-consuming and expensive.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call