Abstract

Spatial audio has attracted more and more attention in the fields of virtual reality (VR), blind navigation and so on. The individualized head-related transfer functions (HRTFs) play an important role in generating spatial audio with accurate localization perception. Existing methods only focus on one database, and do not fully utilize the information from multiple databases. In light of this, a pre-trained-based individualization model is proposed to predict HRTFs for any target user in this paper, and a real-time spatial audio rendering system built on a wearable device is implemented to produce an immersive virtual auditory display. The proposed method first builds a pre-trained model based on multiple databases using a DNN-based model combined with an autoencoder-based dimensional reduction method. This model can capture the nonlinear relationship between user-independent HRTFs and position-dependent features. Then, fine tuning is done using a transfer learning technique at a limit number of layers based on the pre-trained model. The key idea behind fine tuning is to transfer the pre-trained user-independent model to the user-dependent one based on anthropometric features. Finally, real-time issues are discussed to guarantee a fluent auditory experience during dynamic scene update, including fine-grained head-related impulse response (HRIR) acquisition, efficient spatial audio reproduction, and parallel synthesis and playback. These techniques ensure that the system is implemented with little computational cost, thus minimizing processing delay. The experimental results show that the proposed model outperforms other methods in terms of subjective and objective metrics. Additionally, our rendering system runs on HTC Vive, with almost unnoticeable delay.

Highlights

  • C URRENTLY, augmented reality (AR) and virtual reality (VR) technologies are becoming increasingly popular in our lives

  • We first build a user-independent pre-trained model between position features and head-related transfer functions (HRTFs) using PKU&IOA and CIPIC databases, and obtain individualization model by fine tuning the pre-trained model based on anthropometric features

  • More property of HRTFs can be learned and the model is potential to improve the quality of the spatial audio and achieve better accurate localization perception

Read more

Summary

Introduction

C URRENTLY, augmented reality (AR) and virtual reality (VR) technologies are becoming increasingly popular in our lives. Spatial audio is obtained by passing the sound source signal through two filters that contain all the localization-related information of two ears, i.e., head-related transfer functions (HRTFs). HRTFs describe the propagation response of sound waves from the sound source to ear drums in the form of refraction, reflection and diffraction in free space, which is highly related to the anthropometric features of a human, such as head width, cavum concha height, and pinna height [4]. Since each subject has different anthropometric values, HRTFs are highly individual-dependent. Individual HRTFs are required for obtaining more accurate localization perception. Since human hearing is continuous in the physical world, HRTFs should seamlessly adapt to changes in the relative position between the sound sources and the user caused by human and/or

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.