Abstract

Abstract Deep-learning models have used attention mechanisms to improve the quality and intelligibility of noisy speech, demonstrating the effectiveness of attention mechanisms. We rely on either spatial or temporal-based attention mechanisms, resulting in severe information loss. In this paper, a time-frequency attention mechanism with a nested U-network (TFANUNet) is proposed for single-channel speech enhancement. By using time-frequency attention (TFA), learns the channel, frequency and time information which is more significant for speech enhancement. Basically, the proposed model is an encoder-decoder model, where each layer in the encoder and decoder is followed by a nested dense residual dilated DensNet (NDRD) based multi-scale context aggression block. NDRD involves multiple dilated convolution with different dilatation factors to explore the large receptive area at different scales simultaneously. NDRD avoids the aliasing problem in DenseNet. We integrated the TFA and NDRD blocks into the proposed model to enable refined feature set extraction without information loss and utterance-level context aggregation, respectively. The proposed TFANUNet model results outperform baselines in terms of STOI and PESQ.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call