Recognizing audio signals in complex environments is crucial for acquiring adequate information. This paper integrates the sparse expression algorithm with Mel-frequency cepstral coefficient (MFCC) features. The combined approach was applied in convolutional neural network classifiers to recognize acoustic scenes in audio signals within complex environments. The algorithm was then simulated and tested using the TUT Sound Events 2016 and TUT Acoustic Scenes 2016 datasets. In the experiments, the efficacy of the developed sparse feature extraction method was validated. Then, the appropriate sparse dictionary size was determined. The algorithm was subsequently compared with two recognition algorithms based on sparse and MFCC features, respectively. It was found that the extraction approach proposed in this paper had a higher signal-to-noise ratio. The results revealed variations in the required sparse dictionary size for different datasets: 75 for TUT Sound Events 2016 and 150 for TUT Acoustic Scenes 2016. The MFCC-combined recognition algorithm demonstrated the fastest convergence during training among the three audio scene recognition algorithms. For both the TUT Sound Events 2016 and TUT Acoustic Scenes 2016 datasets, the MFCC-combined recognition algorithm achieved the highest classification accuracy, and the recognition accuracy for the former dataset was higher.
Read full abstract