Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

Shahin Amiriparian,Alice Baird,Lukas Koebe,Maurice Gerczuk,Björn Schuller,Sandra Ottl,Lukas Stappen

doi:10.1186/s13636-020-00186-0

Shahin Amiriparian, Alice Baird + Show 5 more

Open Access

PDF Available

https://doi.org/10.1186/s13636-020-00186-0

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

In this paper, we investigate the performance of two deep learning paradigms for the audio-based tasks of acoustic scene, environmental sound and domestic activity classification. In particular, a convolutional recurrent neural network (CRNN) and pre-trained convolutional neural networks (CNNs) are utilised. The CRNN is directly trained on Mel-spectrograms of the audio samples. For the pre-trained CNNs, the activations of one of the top layers of various architectures are extracted as feature vectors and used for training a linear support vector machine (SVM).Moreover, the predictions of the two models—the class probabilities predicted by the CRNN and the decision function of the SVM—are combined in a decision-level fusion to achieve the final prediction. For the pre-trained CNN networks we use as feature extractors, we further evaluate the effects of a range of configuration options, including the choice of the pre-training corpus. The system is evaluated on the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017) workshop, ESC-50 and the multi-channel acoustic recordings from DCASE 2018, task 5. We have refrained from additional data augmentation as our primary goal is to analyse the general performance of the proposed system on different datasets. We show that using our system, it is possible to achieve competitive performance on all datasets and demonstrate the complementarity of CRNNs and ImageNet pre-trained CNNs for acoustic classification tasks. We further find that in some cases, CNNs pre-trained on ImageNet can serve as more powerful feature extractors than AudioSet models. Finally, ImageNet pre-training is complimentary to more domain-specific knowledge, either in the form of the convolutional recurrent neural network (CRNN) trained directly on the target data or the AudioSet pre-trained models. In this regard, our findings indicate possible benefits of applying cross-modal pre-training of large CNNs to acoustic analysis tasks.

Highlights

We are regularly surrounded by dynamic audio events, from which some are quite pleasant, such as singing birds or nice music tracks, other less so, like the sound of a chainsaw or a siren
Our convolutional recurrent neural network (CRNN) is trained on these Mel-spectrograms, and deep feature representations are extracted by a range of Convolutional neural network (CNN) networks which serve as input for support vector machine (SVM) classification
5 Conclusions and future work We have proposed a deep learning framework composed of an image-to-audio transfer learning system, audio pre-trained CNNs and a CRNN

Summary

Introduction

We are regularly surrounded by dynamic audio events, from which some are quite pleasant, such as singing birds or nice music tracks, other less so, like the sound of a chainsaw or a siren. In the era of machine learning, computer audition systems for intelligent housing systems [2, 3], recognition of acoustic scenes [4, 5] and sound event detection [4, 6, 7] are being developed. Despite recent developments in the field of audio analysis, contemporary machine learning systems are still facing a major challenge to perform the mentioned tasks with human-like precision. Deep learning-based technologies lack a mechanism to generalise well when faced with data scarcity problems. In this regard, we follow a threefold strategy by (i) proposing a cross-modal transfer learning strategy in the form of ImageNet pre-trained convolutional neural networks (CNNs) to cope with the limited data challenges, (ii) utilising a CRNN for learning tempo-spatial characteristics of audio signals, and (iii) fusing various neural network strategies to check for further performance improvements

Methods

Findings

Conclusion