Bird recognition is important for the monitoring of bird populations and the protection of ecosystems. Identifying birds through image forms can be difficult due to the complexity of natural environments. Song-based bird recognition allows for bird identification with only a small amount of background noise introduced, however, efficiently recognizing bird songs remains a challenging task. Based on this problem, this paper proposed a self-learning dual-feature fusion information capture expression method (SDFIE-NET) for recognizing birdsong. Firstly, using the Mel filter excerpt the low-frequency characteristics of the bird song. Since fixed-parameter filters are incapable of achieving different feature extraction effects based on different birdsong. In this paper, we incorporate a fully learnable audio classification front-end Leaf architecture for the extraction of bird song feature information, which can self-learn different extraction parameters for the birdsong. Effectively combining the high-frequency feature information and low-frequency differences acquired by the two approaches corresponds to the declared dual-feature fusion module (SCDFF), reducing information redundancy and improving characterization capability. Secondly, the backbone network utilizes SDFIE-NET, which is composed of the Fused-MBConv module and modified CA-MBConv module. The Criss-Cross Attention module is added after each layer composed of Fused-MBConv modules. This improves the speed and accuracy of effective information transfer between internal modules and increases the expressive power of the model at the pixel level. To enhance the anti-interference and generalization ability of the model, we constructed a self-made dataset (Bird_alldata) consisting of 30 kinds of birdsong. On this dataset, we performed a variety of experiments, and recognition accuracy reached 95.77 % and the F1-score reached 95.52 %. Generalization experiments were conducted on the environmental sound dataset Urbansound8K and the bird song dataset Birdsdata, and the model achieves recognition accuracies of 94.05 % and 94.10 % on the two datasets, with F1-scores of 94.21 % and 94.05 %, respectively.
Read full abstract