Abstract

Benefiting from the global modeling capabilities of self-attention mechanisms, Transformer-based models have seen increasing use in natural language processing tasks and automatic speech recognition. The ultra-long sight of Transformer overcomes catastrophic forgetting in Recurrent Neural Networks (RNNs). However, unlike natural language processing and speech recognition tasks that focus on global information, speech enhancement focuses more on local information. Therefore, the original Transformer is not optimally suited to speech enhancement. In this paper, we propose an improved Transformer model called RNN-Attention Transformer (RAT), which applies multi-head self-attention (MHSA) to the temporal dimension. The input sequence is chunked and different models are applied intra-chunk and inter-chunks. Since RNNs are better at modeling local information than self-attention, RNNs and self-attention are used to model intra-chunk information and inter-chunks information, respectively. Experiments show that RAT significantly reduces parameters and improves performance compared to the baseline.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call