Abstract

Recently, more and more personalized speech enhancement (PSE) systems with excellent performance have been proposed. Compared with traditional speech enhancement systems, PSE systems have a wider range of application scenarios, which can simultaneously remove background noise and interfering speaker's speech. However, two issues still limit the performance and generalization ability of the model: 1) Acoustic environment mismatch between the test noisy speech and clean enrollment speech of target speaker, which limits the performance of personalized speech enhancement system; 2) Hard sample mining and learning. How to improve the performance of hard samples determines the practicality of a personalized speech enhancement system in complex real-world scenarios. In this paper, a dynamic acoustic compensation (DAC) is proposed to alleviate the environment mismatch, by intercepting the acoustic segments from noisy speech and mixing it with enrollment speech. To well exploit the hard samples, we propose an adaptive focal training (AFT) strategy by assigning adaptive loss weights to hard and non-hard samples during training. Both the DAC and AFT are proposed to improve and generalize our previous work, a densely-connected pyramid complex convolutional network with speaker encoder (sDPCCN) for personalized speech enhancement. In addition, a time-frequency multi-loss training is further introduced to enhance the improved sDPCCN. To examine the effectiveness of the proposed methods, we generate the noisy-reverb training and test data by utilizing non-overlapping segments of the 4th Deep Noise Suppression (DNS4) Challenge Dataset. Results show that, DAC effectively alleviates the acoustic environment mismatch and brings large improvements in terms of multiple evaluation metrics, and AFT reduces the hard sample rate significantly. When all proposed methods are applied, the perceptual evaluation of speech quality (PESQ) score is improved from 3.21 to 3.36, and the scale invariant signal-to-noise ratio (SISNR) is improved from 15.11 to 15.89 on the test set, which fully verify their effectiveness and practicality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call