Passive acoustic monitoring combined with deep learning-based bird sound classifiers is an effective tool, particularly in remote areas. While self-supervised learning has recently excelled in natural language processing and image recognition, its application to bird sound recognition remains limited. This study proposes an innovative self-supervised learning approach, which leverages vast amounts of passive acoustic recordings for pre-training, followed by fine-tuning of target species. Compared to the three state-of-the-art models based on transfer learning from ImageNet, the proposed method demonstrated improvements across all species, with even more significant gains for tail-end species. These results confirm that domain-specific pre-training in self-supervised learning enhances downstream recognition tasks and provides greater robustness, benefiting tail-end species in imbalanced ecological datasets. Our experiments further demonstrate that integrating open-source datasets and data augmentation techniques is the most effective strategy for mitigating data imbalances and cross-domain issues. In addition, introducing a ‘catch-all’ category into training datasets has been shown to improve model robustness in open set recognition scenarios. We also identified the minimum viable sample size requirements for our proposed model and explored the impact of overlapping bird vocalizations during dawn choruses on model performance. Targeting 31 bird species in the montane regions of subtropical Taiwan, the model achieved a class-wise mean average precision of 0.782 and an overall precision of 85.6 % at the F0.5 threshold in dawn chorus soundscape recordings. This study confirms the effectiveness and advantages of self-supervised learning in bird sound recognition, supporting long-term monitoring of bird distribution and vocal activity in remote montane areas.
Read full abstract