Abstract
A dedicated multi-task network for a direct-path dominance test has been proposed recently, based on which the robustness of the direction-of-arrival estimation for a target speaker tends to be significantly improved. In this Letter, the network is further refined to avoid the original two-stage processing. Moreover, the benefit and generalization of the multi-task network are confirmed by comparison with a single-task network on a significantly larger database. The efficacy of the proposed method in high noisy environments is also validated in real application environments.
Highlights
Robust direction-of-arrival (DOA) estimation of a specific target speaker plays a vital role in many acoustic signal processing applications, such as speech enhancement, robot audition, and video conferencing
The commonly utilized algorithms, including the time difference of arrival (TDOA),1 the steered response power (SRP),2 and the subspace methods,3 suffer from considerable performance degradation in adverse environments with high reverberation and intense noise
The convolutional neural network (CNN)6 and the residual network (ResNet)7 have been utilized in DOA estimation, usually in an end-to-end form with the desired DOA acting directly as the training target
Summary
Robust direction-of-arrival (DOA) estimation of a specific target speaker plays a vital role in many acoustic signal processing applications, such as speech enhancement, robot audition, and video conferencing. It has been noted that retrieving the direct-path information from noisy speech signals can significantly improve the robustness of DOA estimation These methods are usually designed to alleviate the influence of reverberation, mild diffuse noise, and sensor noise, and the effective retrieval of the direct-path information in more adverse environments is still a challenging task. Both the IRMs and the IRMd are used to extract the direct-path time-frequence (TF) bins with an extra refinement process. The conventional algorithms such as the steered response power with the phase transform (SRPPHAT) and multiple signal classification (MUSIC) can be applied on these extracted bins to obtain a robust DOA estimate. Note that the direct-path component g(f)s(t,f) includes the most precise information of the target speaker DOA, and extraction of the direct-path TF bin can significantly improve the robustness of DOA estimation. to guarantee a reliable estimation and alleviate the influence of reverberation and noise, it is better to extract the direct-path TF bin when kgðf Þsðt; f Þk22 is much larger than krðt; f Þk22 þ knðt; f Þk22
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have