Time domain speech enhancement with CNN and time-attention transformer

Nasir Saleem,Teddy Surya Gunawan,Sami Dhahbi,Sami Bourouis

doi:10.1016/j.dsp.2024.104408

Abstract

Speech enhancement in the time domain involves improving the quality and intelligibility of noisy speech by processing the waveform directly without the need for explicit feature extraction or domain transformation. Deep learning is a powerful approach for time domain speech enhancement, offering significant improvements over traditional techniques. Formulating a resource-efficient deep neural model in the time domain without ignoring the contextual information and detailed features of input speech is still a vital challenge. To address this challenge, this study proposes a speech enhancement model using 1D-time domain dilated residual blocks in the convolutional encoder-decoder framework. Further, this study integrates a time-attention transformer (TAT) bottleneck between the encoder-decoder. The TAT model extends the transformer architecture by incorporating a time-attention mechanism, which enables the model to selectively attend to different segments of the speech signal over time. This allows the model to effectively capture long-term dependencies in the speech and learn to recognize important features. The experimental results indicate that the proposed speech enhancement outperforms the recent deep neural networks (DNNs) and substantially improves the intelligibility and quality of noisy speech. With the WSJ0 SI-84 database, the proposed SE improves the STOI and PESQ by 21.51% and 1.14 over noisy speech.

Full Text