Abstract

Sound event detection (SED) is a task of recognizing the target sound events with their respective onset and offset time in a recorded audio clip. SED is closely related to auditory scene analysis, the process by which human auditory systems perceive sound into perceptually meaningful components. While human auditory system is trained to perform SED well enough, the way sound is transformed into neural response within human auditory system still remain as the subject of ongoing research. One of efforts to describe the relationship between sound and neural response is the spectro-temporal receptive field (STRF). STRF acts as a linear function between sound time-frequency representation and primary auditory cortex (A1) cell response, so that neural response can be predicted by convolution of time-frequency representation of sound and STRF. In addition, STRF is designed to capture spectro-temporal modulation in that A1 cells are much reactive to spectro-temporally modulated ripple. In this work, we utilized STRF as a kernel of the first layer in the convolutional neural network (CNN) to extract neural response from input audio clip to make SED model similar to human auditory system. Then, we constructed two-branched SED model named as Two Branch STRF-Net (TB-STRFNet) composed of STRF branch and basic branch. The STRF branch extracts spectro-temporal modulation information and the basic branch extracts detailed and complex time-frequency information. TB-STRFNet outperformed the baseline by 4.3% in terms of main metric in DCASE 2023 Task 4 subtask B.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call