Abstract

It remains a tough challenge to tackle sound event localization and detection (SELD) problem, especially when sound scene complexity increases and overlapping acoustic sources appear. To improve the SELD performance, we propose an ensemble system, which consists of a ResNet and Conformer backbone network (SELD-RCnet) and its two variants, SED-RCnet and SSL-RCnet. For SELD-RCnet and SSL-RCnet, we use short time Fourier transform (STFT) magnitude spectrogram, phase spectrogram, and active acoustic intensity vectors (IVs) as input features. For SSL-RCnet, an innovative predictive target is also developed and the performance is thus improved. For SED-RCnet, we use Log-Mel spectrogram as input features. To overcome the lack of training data, we adopt two novel approaches to first order Ambisonic (FOA) format dataset augmentation, namely audio channel swapping (ACS) and time-frequency masking (TFM). Finally, in the L3DAS22 Challenge, our submitted system achieves significant improvements over the baseline and ranks the second place for the Task2. Therefore, according to the competition rules, we submit this work to describe our system in details.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call