Environmental sound classification (ESC) is gaining popularity in the field of information processing due to its significance in non-speech audio categorization. ESC faces challenges in categorizing ambient sounds, which lack a clear structure, unlike speech or music sounds. Deep learning (DL) techniques are widely used in ESC to extract relevant information from ambient sounds, in which feature extraction is very crucial and directly impacts the classification performance. However, feature extraction can be computationally expensive, especially when employing complex non-linear techniques, due to the significant amount of computing resources required during the training process of DL models. Moreover, environmental sounds can exhibit significant variability in terms of their temporal and spectral characteristics, which can pose challenges to training DL models effectively. In order to overcome these limitations, this research proposes an efficient method for extracting meaningful features from audio files and improving DL techniques' performance using spectrogram images generated from sound environmental datasets. The proposed approach uses convolutional neural networks (CNNs) with attention mechanisms and appropriate data augmentation methods. Unlike pre-trained models that use a single vector for feature extraction, the proposed approach uses a new concatenation-based CNN model with attention mechanisms, which can more effectively capture intricate relationships between input data features. This approach allows for the extraction of features from different sectors of the feature space, resulting in a more precise classification of complex and diverse datasets. Additionally, the proposed approach leverages the parallel extraction feature technique to extract features from multiple CNN models, which improves classification performance. Furthermore, attention modules are used to focus on the most relevant features of the input data. Comparative experiments are also conducted for the proposed approach and some existing state-of-the-art methods, and the results show that the former has better classification performance.
Read full abstract