Sound Event Localization Research Articles

Sound event localization and detection (SELD) refers to classifying sound categories and locating their locations with acoustic models on the same multichannel audio. Recently, SELD has been rapidly evolving by leveraging advanced approaches from other research areas, and the benchmark SELD datasets have become increasingly realistic with simultaneously captured videos provided. Vibration produces sound, we usually associate visual objects with their sound, i.e., we hear footsteps from a walking person, and hear a jangle from one running bell. It comes naturally to think about using multimodal information (image–audio–text vs audio merely), to strengthen sound event detection (SED) accuracies and decrease sound source localization (SSL) errors. In this paper, we propose one contrastive representation-based multimodal acoustic model (CRATI) for SELD, which is designed to learn contrastive audio representations from audio, text, and image in an end-to-end manner. Experiments on the real dataset of STARSS23 and the synthesized dataset of TAU-NIGENS Spatial Sound Events 2021 both show that our CRATI model can learn more effective audio features with additional constraints to minimize the difference among audio and text (SED and SSL annotations in this work). Image input is not conducive to improving SELD performance, as only minor visual changes can be observed from consecutive frames. Compared to the baseline system, our model increases the SED F-score by 11% and decreases the SSL error by 31.02° on the STARSS23 dataset, respectively.

Read full abstract

The goal of sound event localization and detection (SELD) is to detect the temporal occurrence activity of a known set of sound events and locate them in the spatial space. We argue that acquiring a large audio dataset is essential for one deep neural network-based SELD system learned as one supervised task. Nonetheless, gathering and annotating such datasets is a costly and time-intensive process. Hence, various data augmentation methods have attracted attention as a solution to increase sample diversity from the limited collections. In this paper, we propose to augment the limited audio samples for the deep neural network-based SELD system in two ways. One is the hierarchical audio augmentation chain (HAAC) proposed for the activity-coupled Cartesian direction of arrival output representation (ACCDOA) described SELD task. It consists of three waveform and spectrogram augmentation techniques, which are exquisitely assembled from the feature map augmentation to audio channel swapping, and finally sample mixup. Second, we propose to augment the training samples by generating more simulated audio samples and making the selected sound events list publicly available to the community. Experiments on the STARSS22 dataset showed that our HAAC audio augmentation chain greatly improved the SELD performance, which increased the sound event detection score by 24% and decreased the localization error by 12.1°. We demonstrate it’s one simple yet effective approach, compared to other data augmentation methods. Moreover, with more simulated audio samples, generated by convolving selected sound events with SRIRs, used for training, the SELD performance was improved greatly.

Read full abstract

Sound Event Localization Research Articles

Related Topics

Articles published on Sound Event Localization

A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network

CRATI: Contrastive representation-based multimodal sound event localization and detection

Designing a practical physical framework for AI-enhanced sound detection and localization in vehicles

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection.

A hybrid offline-online method for sound event localization and detection

Polyphonic sound event localization and detection using channel-wise FusionNet

Static Sound Event Localization and Detection Using Bipartite Matching Loss for Emergency Monitoring

Polyphonic sound event localization and detection based on Multiple Attention Fusion ResNet.

W2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training

Selective-Memory Meta-Learning With Environment Representations for Sound Event Localization and Detection

Automated Audio Data Augmentation Network Using Bi-Level Optimization for Sound Event Localization and Detection

Sound event localization and detection using a spatial omni-dimensional dynamic interactions network

Divided spectro-temporal transformer for sound event localization and detection in real scenes

Sound Event Localization and Detection Using Parallel Multi-attention Enhancement

HAAC: Hierarchical audio augmentation chain for ACCDOA described sound event localization and detection

Sound event localization and detection using element-wise attention gate and asymmetric convolutional recurrent neural networks

Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator

Sound event localization and detection based on deep learning

A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection

SELD U-Net: Joint Optimization of Sound Event Localization and Detection With Noise Reduction

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Sound Event Localization Research Articles

Related Topics

Articles published on Sound Event Localization

A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network

CRATI: Contrastive representation-based multimodal sound event localization and detection

Designing a practical physical framework for AI-enhanced sound detection and localization in vehicles

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection.

A hybrid offline-online method for sound event localization and detection

Polyphonic sound event localization and detection using channel-wise FusionNet

Static Sound Event Localization and Detection Using Bipartite Matching Loss for Emergency Monitoring

Polyphonic sound event localization and detection based on Multiple Attention Fusion ResNet.

W2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training

Selective-Memory Meta-Learning With Environment Representations for Sound Event Localization and Detection

Automated Audio Data Augmentation Network Using Bi-Level Optimization for Sound Event Localization and Detection

Sound event localization and detection using a spatial omni-dimensional dynamic interactions network

Divided spectro-temporal transformer for sound event localization and detection in real scenes

Sound Event Localization and Detection Using Parallel Multi-attention Enhancement

HAAC: Hierarchical audio augmentation chain for ACCDOA described sound event localization and detection

Sound event localization and detection using element-wise attention gate and asymmetric convolutional recurrent neural networks

Sound Event Localization and Detection Using Imbalanced Real and Synthetic Data via Multi-Generator

Sound event localization and detection based on deep learning

A Four-Stage Data Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound Event Localization and Detection

SELD U-Net: Joint Optimization of Sound Event Localization and Detection With Noise Reduction