Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator

Yeongseo Shin,Chanjun Chun

doi:10.3390/s23073398

Yeongseo Shin, Chanjun Chun

Open Access

PDF Available

https://doi.org/10.3390/s23073398

Copy DOI

Export

Save

Cite

Journal: Sensors	Publication Date: Mar 23, 2023
License type: CC BY 4.0

Affiliation: Chosun University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

This study proposes a sound event localization and detection (SELD) method using imbalanced real and synthetic data via a multi-generator. The proposed method is based on a residual convolutional neural network (RCNN) and a transformer encoder for real spatial sound scenes. SELD aims to classify the sound event, detect the onset and offset of the classified event, and estimate the direction of the sound event. In Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Task 3, SELD is performed with a few real spatial sound scene data and a relatively large number of synthetic data. When a model is trained using imbalanced data, it can proceed by focusing only on a larger number of data. Thus, a multi-generator that samples real and synthetic data at a specific rate in one batch is proposed to prevent this problem. We applied the data augmentation technique SpecAugment and used time-frequency masking to the dataset. Furthermore, we propose a neural network architecture to apply the RCNN and transformer encoder. Several models were trained with various structures and hyperparameters, and several ensemble models were obtained by "cherry-picking" specific models. Based on the experiment, the single model of the proposed method and the model applied with the ensemble exhibited improved performance compared with the baseline model.

Full Text