Abstract

Spherical harmonic (SH) interpolation is a commonly used method to spatially up-sample sparse head related transfer function (HRTF) datasets to denser HRTF datasets. However, depending on the number of sparse HRTF measurements and SH order, this process can introduce distortions into high frequency representations of the HRTFs. This paper investigates whether it is possible to restore some of the distorted high frequency HRTF components using machine learning algorithms. A combination of convolutional auto-encoder (CAE) and denoising auto-encoder (DAE) models is proposed to restore the high frequency distortion in SH-interpolated HRTFs. Results were evaluated using both perceptual spectral difference (PSD) and localisation prediction models, both of which demonstrated significant improvement after the restoration process.

Highlights

  • Virtual reality (VR) and augmented reality (AR) technologies are on the rise, through the advent of commercially available and affordable VR/AR headsets, with applications in gaming, education, therapy, social media and digital culture, amongst others

  • This paper investigates whether similar models can be used to restore the distorted high frequency data in spherical harmonic (SH)-interpolated head related transfer function (HRTF)

  • This model calculates the difference between two binaural signals or HRTFs, and presents a more accurate perceptual comparison of spectral differences as perceptual spectral difference (PSD)

Read more

Summary

Introduction

Virtual reality (VR) and augmented reality (AR) technologies are on the rise, through the advent of commercially available and affordable VR/AR headsets, with applications in gaming, education, therapy, social media and digital culture, amongst others. The VR/AR technology must be able to deliver to the ears the same binaural cues as would be experienced in real life [1,2]. A virtual loudspeaker framework can be employed, wherein methods such as vector base amplitude panning (VBAP) [5] or Ambisonics [6] are used to render sources between virtual loudspeaker points formed from the HRTFs [7]. Both methods typically require a high number of HRTF measurements to ensure good spatial resolution in the rendered audio [8]

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call