Abstract

In the field of neuro-steered speaker ectraction, a recently proposed end-to-end method called U-shaped Brain Enhanced Speech Denoiser (U-BESD) has shown advantages in term of extraction performance over other two-stage methods. However, the U-BESD is designed in the time domain, which suffers from the high computational complexity and suboptimal extraction performance. To address these issues, this paper proposes a novel time–frequency neuro-steered speaker extractor (TF-NSSE). It leverages time–frequency transformation to match the temporal resolution of speech signals with neural signals, which greatly reduce the computational cost caused by the use of upsampled signals with high rates. Additionally, an interaction module is introduced to effectively fuse the attention information encoded in neural signals with speech signals. Moreover, to address the issue of insufficient data in existing datasets, this study proposes a data augmentation method that can enhance algorithms’ performance. Experimental results demonstrate that the proposed TF-NSSE significantly outperforms the existing time-domain methods in terms of both extraction performance and resource consumption.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call