Abstract

Linear comparisons can fail to describe perceptual differences between head-related transfer functions (HRTFs), reducing their utility for perceptual tests, HRTF selection methods, and prediction algorithms. This work introduces a machine learning framework for constructing a perceptual error metric that is aligned with performance in human sound localization. A neural network is first trained to predict measurement locations from a large database of HRTFs and then fine-tuned with perceptual data. It demonstrates robust model performance over a standard spectral difference error metric. A statistical test is employed to quantify the information gain from the perceptual observations as a function of space.

Highlights

  • Individual head-related transfer functions (HRTFs) are complex and vary substantially from person to person because of anthropometric differences like ear shape, head circumference, and torso size.1 Their high dimensionality and idiosyncratic nature pose problems for binaural rendering because the apparent realism of any virtual acoustics depends heavily on how closely matched the HRTF is to that of the individual listener

  • The proposed metric already outperforms classical measures like spectral distance. We attribute this to our choice of model—we intuit that a neural network is required to be able to robustly map HRTF signal space to spherical domain locations given the complex variance in spectral cues that determine localization across a broad population of individuals

  • In its current form, we posit that our proposed metric could be constructed and treated as a black-box loss function, and inserted at the tail-end of HRTF generation or selection systems to compute and propagate error back to the system in the spatial domain instead of the spectral domain

Read more

Summary

Introduction

Individual head-related transfer functions (HRTFs) are complex and vary substantially from person to person because of anthropometric differences like ear shape, head circumference, and torso size. Their high dimensionality and idiosyncratic nature pose problems for binaural rendering because the apparent realism of any virtual acoustics depends heavily on how closely matched the HRTF is to that of the individual listener. We suggest (1) constructing a model that is first built on large amounts of informative, non-perceptual data, that constitutes a “prior” on the relationship between an HRTF spectrum and its corresponding spatial location; (2) fine-tuning this model, taking into account sparse, noisy perceptual observations from existing small-scale datasets collected from procedures like the hypothetical setting described above, which we consider the “posterior” model; and (3) computing measures of statistical significance as a function of spatial location between the prior and posterior model. Further details regarding the setup, experiment protocol, and spatial processing necessary for synthesizing the virtual sound sources can be found in Ref. 11

Methods
Without perceptual feedback
With Perceptual Feedback
Findings
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.