Abstract

Evaluating sound similarity is a fundamental building block in acoustic perception and computational analysis. Traditional data-driven analyses of perceptual similarity are based on heuristics or simplified linear models, and are thus limited. Deep learning embeddings, often using triplet networks, have been useful in many fields. However, such networks are usually trained using large class-labelled datasets. Such labels are not always feasible to acquire. We explore data-driven neural embeddings for sound event representation when class labels are absent, instead utilising proxies of perceptual similarity judgements. Ultimately, our target is to create a perceptual embedding space that reflects animals' perception of sound. We create deep perceptual embeddings for bird sounds using triplet models. In order to deal with the challenging nature of triplet loss training with the lack of class-labelled data, we utilise multidimensional scaling (MDS) pretraining, attention pooling, and a triplet mining scheme. We also evaluate the advantage of triplet learning compared to learning a neural embedding from a model trained on MDS alone. Using computational proxies of similarity judgements, we demonstrate the feasibility of the method to develop perceptual models for a wide range of data based on behavioural judgements, helping us understand how animals perceive sounds.

Highlights

  • Animal perception and animal vocal communication are important research areas, and statistical machine learning has brought new forms of evidence to bear on them (Elie and Theunissen, 2016; Kohlsdorf et al, 2016; Stowell et al, 2016)

  • We focus on model development, using an algorithmic stand-in for perceptual judgements: in this case, sound similarity as determined acoustically by the software Luscinia

  • We utilise two approaches to this goal: First, we use Luscinia distances and multidimensional scaling (MDS) to create an embedding space that we model with a deep learning approach; second, we use unsupervised triplet loss training

Read more

Summary

Introduction

Animal perception and animal vocal communication are important research areas, and statistical machine learning has brought new forms of evidence to bear on them (Elie and Theunissen, 2016; Kohlsdorf et al, 2016; Stowell et al, 2016). A fortiori, in many cases we must not presume that the similarity judgements of humans are valid proxies for fine-detail perceptual judgements of other species (Dooling and Prior, 2017). In this domain, it becomes important to develop methods that can work well even with small, weakly-labelled, or unlabelled datasets. These methods primarily rely on massive annotated data for successful training. It is important to recognise that, even when triplet loss training schemes are used, most of the successes in representation learning are driven by datasets with explicit class labels, which are used to provide a strong signal of semantic distance (Thakur et al, 2019)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call