Abstract

Recent methodologies for audio classification frequently involve cepstral and spectral features, applied to single channel recordings of acoustic scenes and events. Further, the concept of transfer learning has been widely used over the years, and has proven to provide an efficient alternative to training neural networks from scratch. The lower time and resource requirements when using pre-trained models allows for more versatility in developing system classification approaches. However, information on classification performance when using different features for multi-channel recordings is often limited. Furthermore, pre-trained networks are initially trained on bigger databases and are often unnecessarily large. This poses a challenge when developing systems for devices with limited computational resources, such as mobile or embedded devices. This paper presents a detailed study of the most apparent and widely-used cepstral and spectral features for multi-channel audio applications. Accordingly, we propose the use of spectro-temporal features. Additionally, the paper details the development of a compact version of the AlexNet model for computationally-limited platforms through studies of performances against various architectural and parameter modifications of the original network. The aim is to minimize the network size while maintaining the series network architecture and preserving the classification accuracy. Considering that other state-of-the-art compact networks present complex directed acyclic graphs, a series architecture proposes an advantage in customizability. Experimentation was carried out through Matlab, using a database that we have generated for this task, which composes of four-channel synthetic recordings of both sound events and scenes. The top performing methodology resulted in a weighted F1-score of 87.92% for scalogram features classified via the modified AlexNet-33 network, which has a size of 14.33 MB. The AlexNet network returned 86.24% at a size of 222.71 MB.

Highlights

  • The continuous research advances in the field of single and multi-channel audio classification suggests its importance and relevance in a broad range of real-world applications.In this work, we focus on domestic multi-channel audio classification, which can be applied to monitoring systems and assistive technology [1,2].The majority of the existing works within this area are based on the classification of sound events found in single channel audio [3,4] rather than classifying multi-channel audio signals containing acoustic scenes, which is required to understand the continuous nature of daily domestic activities

  • We propose the use of spectro-temporal features in the form of scalograms, which are computed through a fast Fourier transform (FFT)-based continuous wavelet transform (CWT) [10]

  • Per-level and average comparisons using mel-frequency cepstral coefficients (MFCC) and Log-Mel spectrogram features against the proposed CWTFT scalograms method are seen in Table 3, which is an average of three training trials

Read more

Summary

Introduction

The continuous research advances in the field of single and multi-channel audio classification suggests its importance and relevance in a broad range of real-world applications.In this work, we focus on domestic multi-channel audio classification, which can be applied to monitoring systems and assistive technology [1,2].The majority of the existing works within this area are based on the classification of sound events found in single channel audio [3,4] rather than classifying multi-channel audio signals containing acoustic scenes, which is required to understand the continuous nature of daily domestic activities. The detection of multi-channel audio was found to be 10% more accurate when compared to single channel audio, considering the case of overlapping sounds that commonly occur in real-life [6]. A similar concept to this work is the Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 5 challenge, which focuses on domestic multi-channel acoustic scene classification [7]. In this challenge, top performing methods often involve the use of Log-Mel energies and Mel-frequency

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.