Temporal-Frequency-Spatial Features Fusion for Multi-channel Informed Target Speech Separation

Wen Zhang,Aolong Zhou,Bin Lin,Guoli Wu,Li Ma

doi:10.1109/icicsp55539.2022.10050617

Abstract

Our goal is to make full use of time-frequency domain features and spatial domain features of the multichannel speech signal, and we propose an end-to-end multichannel target speech separation method based on temporal-frequency-spatial feature fusion, called the cTFS model. For the target speech separation task, the cTFS model takes the angel feature of the target speech signal as the prior knowledge, then predicts the complex ideal ratio mask target with a complex U-shaped network. We achieve the reconstruction of the target speech signal by signal approximation. Furthermore, a multi-channel target speaker separation dataset is constructed based on the WSJ0-2mix dataset based on the signal reverberation model. The performance of each target speaker separation model is evaluated on this dataset using the evaluation metrics SDR, SI-SNR, PESQ, and STOI. Experimental results show the effectiveness of the proposed method as well as the benefit of incorporating angle feature information in multichannel speech separation.

Full Text