Speech Separation based on As3-2mix Hybrid Strategy Combined Training Convolutional Time Domain Neural Network

Pengxu Wang,Haijun Zhang

doi:10.1109/dsa56465.2022.00099

Abstract

In recent years, time-domain speech separation methods have made great progress. The existing time-domain speech separation methods have shown good separation performance on wsj-2mix datasets. However, the performance of these models on Chinese speech datasets has not been studied in detail. To solve this problem, this paper makes a speech separation dataset based on aishell-3 open-source hi-fi Mandarin speech corpus, which we call as3-2mix. As3-2mix not only considers the original features of mixed speech, but also adopts two mixing strategies: same-sex mixing and opposite sex mixing. Based on as3-2mix dataset and different training strategies, we evaluate the generalization ability of convolutional time-domain neural network, and analyze the separated speech through PESQ, STOI, SDRi and SI-SNRi. The experimental results show that our PESQ reaches 2.48 and 2.26 on as3mm1-2mix and as3ff1-2mix datasets respectively, while our STOI reaches 2.46, 0.89 and 0.83 on as3mm1-2mix, as3ff1-2mix and as3fm1-2mix datasets respectively, it is higher than other methods on the same type of dataset. Although the performance of SDRi and SI-SNRi in Chinese dataset is not as good as that in English dataset, they still achieved 13.56dB and 13.21dB good scores, which also shows that different languages may affect some characteristics of speech and then affect the separation effect to a certain extent. Finally, when analyzing the speech amplitude, we find that the speech with large amplitude is conducive to improve PESQ and STOI, and the speech with small amplitude is conducive to improve the SDRi and SI-SNRi.

Full Text