CNN-Based Acoustic Scene Classification System

Yerin Lee,Il-Youp Kwak,Soyoung Lim

doi:10.3390/electronics10040371

Yerin Lee, Il-Youp Kwak + Show 1 more

Open Access

PDF Available

https://doi.org/10.3390/electronics10040371

Copy DOI

Export

Save

Cite

Journal: Electronics	Publication Date: Feb 3, 2021
Citations: 17	License type: CC BY 4.0

Affiliation: Chung-Ang University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Acoustic scene classification (ASC) categorizes an audio file based on the environment in which it has been recorded. This has long been studied in the detection and classification of acoustic scenes and events (DCASE). This presents the solution to Task 1 of the DCASE 2020 challenge submitted by the Chung-Ang University team. Task 1 addressed two challenges that ASC faces in real-world applications. One is that the audio recorded using different recording devices should be classified in general, and the other is that the model used should have low-complexity. We proposed two models to overcome the aforementioned problems. First, a more general classification model was proposed by combining the harmonic-percussive source separation (HPSS) and deltas-deltadeltas features with four different models. Second, using the same feature, depthwise separable convolution was applied to the Convolutional layer to develop a low-complexity model. Moreover, using gradient-weight class activation mapping (Grad-CAM), we investigated what part of the feature our model sees and identifies. Our proposed system ranked 9th and 7th in the competition for these two subtasks, respectively.

Highlights

In recent years, acoustic scene classification (ASC) has attracted widespread attention in the Audio and Acoustic Signal Processing (AASP) community [1,2,3,4,5,6]
Each model applied to Deltas-DeltaDeltas in subtask A had an accuracy of over 60%, while the same model applied to harmonic-percussive source separation (HPSS) had a lower accuracy
When HPSS was used as a feature, when comparing the accuracy of the four models Convolutional neural networks (CNNs), ResNet, LCNN, and InceptionLike, the accuracy of the CNN proposed in Han and Park [8] and Sakashita and Aono [9] was the highest at 59.36%

Summary

Introduction

Acoustic scene classification (ASC) has attracted widespread attention in the Audio and Acoustic Signal Processing (AASP) community [1,2,3,4,5,6]. ASC aims to classify a test recording sound into predefined classes that characterizes the environment in which it was recorded [7]. The IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) takes place every year. It started from 2013 and is continuing every year since 2016. The current year’s ASC task is divided into two subtasks: A and B. Subtask A aims to classify audio into ten classes.

Methods

Results

Discussion

Conclusion