Abstract
Audio-based event detection poses a number of different challenges that are not encountered in other fields, such as image detection. Challenges such as ambient noise, low Signal-to-Noise Ratio (SNR) and microphone distance are not yet fully understood. If the multimodal approaches are to become better in a range of fields of interest, audio analysis will have to play an integral part. Event recognition in autonomous vehicles (AVs) is such a field at a nascent stage that can especially leverage solely on audio or can be part of the multimodal approach. In this manuscript, an extensive analysis focused on the comparison of different magnitude representations of the raw audio is presented. The data on which the analysis is carried out is part of the publicly available MIVIA Audio Events dataset. Single channel Short-Time Fourier Transform (STFT), mel-scale and Mel-Frequency Cepstral Coefficients (MFCCs) spectrogram representations are used. Furthermore, aggregation methods of the aforementioned spectrogram representations are examined; the feature concatenation compared to the stacking of features as separate channels. The effect of the SNR on recognition accuracy and the generalization of the proposed methods on datasets that were both seen and not seen during training are studied and reported.
Highlights
Entering the era of third-generation surveillance systems [1] means that the world is transitioning to an event-based analysis of data, from what used to be a time-based one
The main focus of the experiment is to evaluate the ability of a 2D CNN to learn from various spectrogram representations at various Signal-to-Noise Ratio (SNR) settings and to check the ability of the CNN to generalize on different SNR settings during training and testing
The Short-Time Fourier Transform (STFT) spectrograms provided the best results when training and testing on the same SNR values, compared to mel-spectrograms which are focused on the mel-scale to better represent the human auditory system
Summary
Entering the era of third-generation surveillance systems [1] means that the world is transitioning to an event-based analysis of data, from what used to be a time-based one. Over the last years, increasing concerns about public safety and security has led to a growing adoption of Internet protocol cameras and rising demand for wireless and spy cameras [2,3]. These are the factors driving growth of the video surveillance industry, the global market of which is projected to reach 74.6 billion US dollars by 2025 from 45.5 in 2020, with a compound annual growth rate of 10.4%, as it has been shown by studies conducted by BIS Research [4]. A range of issues could arise as in the case where there is no driver in the bus, for instance, using an autonomous bus in certain neighborhoods at night where no authority could keep the passengers calm or provide first aid in the case of an abnormal event, such as Electronics 2020, 9, 1593; doi:10.3390/electronics9101593 www.mdpi.com/journal/electronics
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have