Abstract

Acoustic scene classification contains frequently misclassified pairs of classes that share many common acoustic properties. Specific details can provide vital clues for distinguishing such pairs of classes. However, these details are generally not noticeable and are hard to generalize for different data distributions. In this study, we investigate various methods for capturing discriminative information and simultaneously improve the generalization ability. We adopt a max feature map method that replaces conventional non-linear activation functions in deep neural networks; therefore, we apply an element-wise comparison between the different filters of a convolution layer’s output. Two data augmentation methods and two deep architecture modules are further explored to reduce overfitting and sustain the system’s discriminative power. Various experiments are conducted using the “detection and classification of acoustic scenes and events 2020 task1-a” dataset to validate the proposed methods. Our results show that the proposed system consistently outperforms the baseline, where the proposed system demonstrates an accuracy of 70.4% compared to the baseline at 65.1%.

Highlights

  • The detection and classification of acoustic scenes and events (DCASE) community has been hosting multiple challenges that utilize sound event information generated in everyday environments and by physical events [1,2,3]

  • Among the many types of tasks covered in DCASE challenges, acoustic scene classification (ASC) is a multi-class classification task that classifies an input recording into a predefined scene

  • We assumed that information that enables the classification of different scenes with similar characteristics might be specific and reside in small particular regions throughout the recording for the ASC task

Read more

Summary

Introduction

The detection and classification of acoustic scenes and events (DCASE) community has been hosting multiple challenges that utilize sound event information generated in everyday environments and by physical events [1,2,3]. In the process of developing an ASC system, the recent research literature has widely explored two major issues: generalization toward unknown devices and frequently misclassified scene pairs. Several ASC studies report that the model performance degrades significantly when testing with audio recordings that were recorded using unknown devices [8,9,10]. Another critical issue is the occurrence of frequently misclassified classes (e.g., shopping mallairport, tram-metro) [11,12]. Deep neural networks (DNNs) that use ReLU activation variants might perform worse on different data distributions, as reported in [13]

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call