Abstract

Binaural sound source localization is an important and widely used perceptually based method and it has been applied to machine learning studies by many researchers based on head-related transfer function (HRTF). Because the HRTF is closely related to human physiological structure, the HRTFs vary between individuals. Related machine learning studies to date tend to focus on binaural localization in reverberant or noisy environments, or in conditions with multiple simultaneously active sound sources. In contrast, mismatched HRTF condition, in which the HRTFs used to generate the training and test sets are different, is rarely studied. This mismatch leads to a degradation of localization performance. A basic solution to this problem is to introduce more data to improve generalization performance, which requires a lot. However, simply increasing the data volume will result in data-inefficiency. In this paper, we propose a data-efficient method based on deep neural network (DNN) and clustering to improve binaural localization performance in the mismatched HRTF condition. Firstly, we analyze the relationship between binaural cues and the sound source localization with a classification DNN. Different HRTFs are used to generate training and test sets, respectively. On this basis, we study the localization performance of DNN model trained by each training set on different test sets. The result shows that the localization performance of the same model on different test sets is different, while the localization performance of different models on the same test set may be similar. The result also shows a clustering trend. Secondly, different HRTFs are divided into several clusters. Finally, the corresponding HRTFs of each cluster center are selected to generate a new training set and to train a more generalized DNN model. The experimental results show that the proposed method achieves better generalization performance than the baseline methods in the mismatched HRTF condition and has almost equal performance to the DNN trained with a large number of HRTFs, which means the proposed method is data-efficient.

Highlights

  • Sound source localization is to estimate the direction of the sound source and is an important and widely used technique in many fields such as speech enhancement, video conferencing, and human-robot interaction [1]

  • The basic procedure of Raspaud’s method is firstly, interaural time difference (ITD) and interaural level difference (ILD) cues are modeled as the product of a function of azimuth and a function of frequency; in the offline stage, ITD and ILD cues corresponding to each azimuth are extracted from the head-related transfer function (HRTF) of each subject in CIPIC database and fed into the model to calculate the parameter corresponding to each subject; and in the test stage, the ITD and ILD cues which are extracted from the sounds are to be estimated, and ILD cue is fed into the average parameter model to estimate the correct ITD, and the correct ITD is fed into the parameter model to estimate the sound source localization

  • 5 Conclusion In this paper, we study the binaural localization in the mismatched HRTF condition and propose a binaural sound localization method based on deep neural network (DNN) and clustering

Read more

Summary

Introduction

Sound source localization is to estimate the direction of the sound source and is an important and widely used technique in many fields such as speech enhancement, video conferencing, and human-robot interaction [1]. Sound source localization algorithms have been widely researched so far, and they can be categorized into two classes. The first one is based on microphone array signal processing, which contains three kinds of algorithms: Humans are able to localize the sound source with just two ears, and this remarkable binaural localization capability is largely attributed to the different filtering effects of listener’s heads, pinna, and torse on the sounds from different directions in the frequency domain, which is. Due to the different individual physiological structures, the HRTF datasets of different subjects are varied

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call