Deep perceptual embeddings for unlabelled animal sound events.

Veronica Morfi,Robert F Lachlan,Dan Stowell

doi:10.1121/10.0005475

Abstract

Evaluating sound similarity is a fundamental building block in acoustic perception and computational analysis. Traditional data-driven analyses of perceptual similarity are based on heuristics or simplified linear models, and are thus limited. Deep learning embeddings, often using triplet networks, have been useful in many fields. However, such networks are usually trained using large class-labelled datasets. Such labels are not always feasible to acquire. We explore data-driven neural embeddings for sound event representation when class labels are absent, instead utilising proxies of perceptual similarity judgements. Ultimately, our target is to create a perceptual embedding space that reflects animals' perception of sound. We create deep perceptual embeddings for bird sounds using triplet models. In order to deal with the challenging nature of triplet loss training with the lack of class-labelled data, we utilise multidimensional scaling (MDS) pretraining, attention pooling, and a triplet mining scheme. We also evaluate the advantage of triplet learning compared to learning a neural embedding from a model trained on MDS alone. Using computational proxies of similarity judgements, we demonstrate the feasibility of the method to develop perceptual models for a wide range of data based on behavioural judgements, helping us understand how animals perceive sounds.

Highlights

Animal perception and animal vocal communication are important research areas, and statistical machine learning has brought new forms of evidence to bear on them (Elie and Theunissen, 2016; Kohlsdorf et al, 2016; Stowell et al, 2016)
We focus on model development, using an algorithmic stand-in for perceptual judgements: in this case, sound similarity as determined acoustically by the software Luscinia
We utilise two approaches to this goal: First, we use Luscinia distances and multidimensional scaling (MDS) to create an embedding space that we model with a deep learning approach; second, we use unsupervised triplet loss training

Summary

Introduction

Animal perception and animal vocal communication are important research areas, and statistical machine learning has brought new forms of evidence to bear on them (Elie and Theunissen, 2016; Kohlsdorf et al, 2016; Stowell et al, 2016). A fortiori, in many cases we must not presume that the similarity judgements of humans are valid proxies for fine-detail perceptual judgements of other species (Dooling and Prior, 2017). In this domain, it becomes important to develop methods that can work well even with small, weakly-labelled, or unlabelled datasets. These methods primarily rely on massive annotated data for successful training. It is important to recognise that, even when triplet loss training schemes are used, most of the successes in representation learning are driven by datasets with explicit class labels, which are used to provide a strong signal of semantic distance (Thakur et al, 2019)

Objectives

Methods

Results

Conclusion