Abstract

For environmental sound classification (ESC), this letter presents a learnable auditory filterbank based on a one-dimensional (1D) convolutional neural network with strong psychophysiological inductive bias in the form of a gammatone filterbank and an equal-loudness prompting normalization. In the past, a number of ESC methods based on learnable auditory features obtained by performing plain 1D convolutions on raw input waveforms for outperforming traditional handcrafted features such as a mel-frequency filterbank have been proposed. However, the large number of parameters involved in the convolutions suggests that these methods will not generalize better than a model defined by a smaller number of parameters, which is considered in this letter. Here, a learnable gammatone filterbank layer consisting of 1D kernels represented by a parametric form of the bandpass gammatone filters is proposed for acquiring a time-frequency representation of the raw waveform. A normalization with learnable parameters that control the trade-off between energy equalization and structure preservation in the spectro-temporal domain is proposed. To verify the effectiveness of the considered network and the normalization, ESC experiments on the ESC-50 and UrbanSound8K datasets were conducted. Compared to other state-of-the-art networks, the considered network performed better on the two datasets. In addition, an ensemble architecture achieved further performance improvement.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call