In the harvest season, orchards are frequently plagued by birds, and thus significant fruit pecking can adversely affect both the fruit quality and yield. Recognizing bird songs is crucial for preventing damage caused by orchard birds because it can provide the basis for subsequent bird repellent efforts. However, the extensive effort required to annotate sound samples poses a significant challenge for supervised deep learning. In this paper, we propose a self-supervised multi-view learning framework based on multi-level contrasting (MV-MLC) for bird song recognition, which utilizes both time and spectrogram views as inputs. This framework leverages MLC to automatically learn representations from unlabeled data and a multi-scale feature extraction (MSFE) backbone network is employed to capture the temporal features of bird songs at different scales. The time-spectrogram consistency task in MLC learning facilitates semantic-level information exchange across multi-views, while the hierarchical contrastive learning task captures granularity-level information, thereby resulting in more robust contextual representations. In addition, embedding the shuffle attention module in MSFE facilitates mining of the spatial and channel dependencies of bird song features to further enhance the representation of features by the multi-scale network. We conducted extensive experiments using our self-built 10-class bird song data set (Orchard-birds) and the publicly available Birdsdata and Powdermill data sets. The experimental results demonstrated that MV-MLC performed better than state-of-the-art self-supervised models. In particular, MV-MLC obtained outstanding performance even with a small proportion of labeled data. The recognition accuracies based on the Orchard-birds and Birdsdata data sets were 99.40% and 92.67%, respectively, with macro F1-scores of 99.40% and 92.61%.
Read full abstract