Environmental sound classification (ESC) aims to automatically recognize audio recordings from the underlying environment, such as "urban park" or "city centre". Most of the existing methods for ESC use hand-crafted time-frequency features such as log-mel spectrogram to represent audio recordings. However, the hand-crafted features rely on transformations that are defined beforehand and do not consider the variability in the environment due to differences in recording conditions or recording devices. To overcome this, we present an alternative representation framework by leveraging a pre-trained convolutional neural network, SoundNet, trained on a large-scale audio dataset to represent raw audio recordings. We observe that the representations obtained from the intermediate layers of SoundNet lie in low-dimensional subspace. However, the dimensionality of the low-dimensional subspace is not known. To address this, an automatic compact dictionary learning framework is utilized that gives the dimensionality of the underlying subspace. The low-dimensional embeddings are then aggregated in a late-fusion manner in the ensemble framework to incorporate hierarchical information learned at various intermediate layers of SoundNet. We perform experimental evaluation on publicly available DCASE 2017 and 2018 ASC datasets. The proposed ensemble framework improves performance between 1 and 4 percentage points compared to that of existing time-frequency representations.
Read full abstract