Speech enhancement (SE) is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time-frequency (T-F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep-learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper speech enhancement is investigated by multi-stage learning using a multistage structure in which time-frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion (FF) block is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling task. The time-frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterise the salient T-F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. A set of utterances from the LibriSpeech and Voicebank databases are used to evaluate the performance of the proposed SE. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. The average PESQ and STOI for proposed model are boosted by a factor of 41.7% and 5.4% for Libri speech dataset, 36.10% and 3.1% for Voice bank dataset as compared to noisy speech. Additionally, we explored the generalization of the proposed TFA-S-TCN model across different speech datasets through cross data base analysis. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.
Read full abstract