Single channel speech enhancement using time-frequency attention mechanism based nested U-net model

Anil Kumar Prathipati,A S N Chakravarthy

doi:10.1088/2631-8695/ad5e36

Abstract

Deep-learning models have used attention mechanisms to improve quality and intelligibility of noisy speech, demonstrating the effectiveness of attention mechanisms. We rely on either spatial or temporal-based attention mechanisms, resulting in severe information loss. In this paper, a time-frequency attention mechanism with a nested U-network (TFANUNet) is proposed for single-channel speech enhancement. By using TFA, learns which channel, frequency and time information is more significant for speech enhancement. Basically, the proposed model is an encoder-decoder model, where each layer in the encoder and decoder is followed by a nested dense residual dilated DensNet (NDRD) based multi-scale context aggression block. NDRD involves multiple dilated convolutions with different dilatation factors to explore the large receptive area at different scales simultaneously. NDRD avoids the aliasing problem in DenseNet. We integrated the TFA and NDRD blocks into the proposed model to enable refined feature set extraction without information loss and utterance-level context aggregation, respectively. Under seen and unseen noise conditions, the proposed TFAD3MNet model produces an average of 87.02% and 85.04% of STOI values, and 3.19 and 3.01 averaged PESQ values. The trainable parameters of proposed model are 2.09 million,which is very less compared to baselines. TFANUNet model results outperform baselines in terms of STOI and PESQ.

Full Text