Robust sound event detection in bioacoustic sensor networks.

Vincent Lostanlen,Juan Pablo Bello,Justin Salamon,Andrew Farnsworth,Steve Kelling,Ian Mcloughlin

doi:10.1371/journal.pone.0214168

Vincent Lostanlen, Juan Pablo Bello + Show 4 more

Open Access

https://doi.org/10.1371/journal.pone.0214168

Copy DOI

Journal: PloS one	Publication Date: Oct 24, 2019
Citations: 155	License type: CC BY 4.0

Affiliation: Cornell University, New York University

Abstract

Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 ms) and long-term (30 min) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer, i.e. an affine layer whose weights are dynamically adapted at prediction time by an auxiliary network taking long-term summary statistics of spectrotemporal features as input. We show that PCEN reduces temporal overfitting across dawn vs. dusk audio clips whereas context adaptation on PCEN-based summary statistics reduces spatial overfitting across sensor locations. Moreover, combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.

Highlights

Machine listening for large-scale bioacoustic monitoringThe past decades have witnessed a steady decrease in the hardware costs of sound acquisition [1], processing [2], transmission [3], and storage [4]
We confirm that mixture of experts (MoE) is successful, we find that implementing context adaptation with an adaptive threshold (AT) leads to sound event detection results which are within a statistical tie with respect to MoE
We have developed, benchmarked, and combined several machine listening techniques to improve the generalizability of sound event detection (SED) models across heterogeneous acoustic environments

Summary

Introduction

Machine listening for large-scale bioacoustic monitoringThe past decades have witnessed a steady decrease in the hardware costs of sound acquisition [1], processing [2], transmission [3], and storage [4]. In comparison with optical sensors, acoustic sensors are minimally invasive [9], have a longer detection range—from decameters for a flock of migratory birds to thousands of kilometers for an oil exploration airgun [10]—and their reliability is independent of the amount of daylight [11] In this context, one emerging application is the species-specific inventory of vocalizing animals [12], such as birds [13], primates [14], and marine mammals [15], whose occurrence in time and space reflects the magnitude of population movements [16], and can be correlated with other environmental variables, such as local weather [17]

Objectives

Methods

Results

Conclusion