Abstract

The automatic localization of audio sources distributed symmetrically with respect to coronal or transverse planes using binaural signals still poses a challenging task, due to the front–back and up–down confusion effects. This paper demonstrates that the convolutional neural network (CNN) can be used to automatically localize music ensembles panned to the front, back, up, or down positions. The network was developed using the repository of the binaural excerpts obtained by the convolution of multi-track music recordings with the selected sets of head-related transfer functions (HRTFs). They were generated in such a way that a music ensemble (of circular shape in terms of its boundaries) was positioned in one of the following four locations with respect to the listener: front, back, up, and down. According to the obtained results, CNN identified the location of the ensembles with the average accuracy levels of 90.7% and 71.4% when tested under the HRTF-dependent and HRTF-independent conditions, respectively. For HRTF-dependent tests, the accuracy decreased monotonically with the increase in the ensemble size. A modified image occlusion sensitivity technique revealed selected frequency bands as being particularly important in terms of the localization process. These frequency bands are largely in accordance with the psychoacoustical literature.

Highlights

  • While binaural audio technology has been known for decades [1], advancements in consumer electronics facilitated its widespread adoption predominantly during the post-millennial era

  • According to the results obtained under the head-related transfer functions (HRTFs)-dependent tests, the average accuracy of the classification of the front, back, up, and down-located music ensembles was 80.26% (standard deviation (SD) 0.68)

  • This work demonstrates that convolutional neural network (CNN) is capable of undertaking the challenging task of identifying front, back, up, and down music ensembles in synthetically generated binaural signals, which constitutes the main contribution of this study to the field of spatial audio scene characterization (SASC)

Read more

Summary

Introduction

While binaural audio technology has been known for decades [1], advancements in consumer electronics facilitated its widespread adoption predominantly during the post-millennial era. The state-of-the-art computational models for binaural localization developed so far were intended to localize individual audio sources [4,5,6,7,8,9] rather than to characterize complex spatial audio scenes at various descriptive levels (see [10] for the review of the binaural localization models). They were designed using predominantly speech signals and were intended to localize speakers [7]. Some preliminary models allowing for full-sphere binaural sound source localization have been proposed, only recently [14,15]

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call