Improving deep learning acoustic classifiers with contextual information for wildlife monitoring

Lorène Jeantet,Emmanuel Dufourq

doi:10.1016/j.ecoinf.2023.102256

Abstract

Bioacoustics, the exploration of animal vocalizations and natural soundscapes, has emerged as a valuable tool for studying species within their habitats, particularly those that are challenging to observe. This approach has broadened the horizons of biodiversity assessment and ecological research. However, monitoring wildlife with acoustic recorders produces large volumes of data that can be labor-intensive to analyze. Deep learning has recently transformed many computational disciplines by enabling the automated processing of large and complex datasets and has gained attention within the bioacoustics community. Despite the revolutionary impact of deep learning on acoustic detection and classification, attaining both high detection accuracy and low false positive rates in bioacoustics remains a significant challenge. An intriguing yet unexplored avenue for enhancing deep learning in bioacoustics involves the utilization of contextual information, such as time and location, to discern animal vocalizations within acoustic recordings. As a first case study, a multi-branch Convolutional Neural Network (CNN) was developed to classify 22 different bird songs using spectrograms as a first input, and spatial metadata as a secondary input. A comparison was made to a baseline model with only spectrogram input. A geographical prior neural network was trained, separately, to estimate the probability of a species occurring at a given location. The output of this network was combined with the baseline CNN. As a second case study, temporal data and spectrograms were used as input to a multi-branch CNN for the detection of Hainan gibbon (Nomascus hainanus) calls, the world’s rarest primate. Our findings demonstrate that adding metadata to the bird song classifier significantly improves classification performance, with the highest improvement achieved using the geographical prior model (F1-score of 87.78% compared to 61.02% for the baseline model). The multi-branch CNNs also proved efficient (F1-scores of 76.87% and 78.77%) and simpler to use than the geographical prior. In the second case study, our findings revealed a decrease in false positives by 63% (94% of the calls were detected) when the metadata was used by the multi-branch CNN, and an increase of 19% in gibbon detection. This study has uncovered an exciting new avenue for improving classifier performance in bioacoustics. The methodology described in this study can assist ecologists, wildlife management teams, and researchers in reducing the amount of time spent analyzing large acoustic datasets obtained from passive acoustic monitoring studies. Our approach can be adapted and applied to other calling species, and thus tailored to other use cases.

Full Text