Abstract

To achieve real-time single-channel speech enhancement, i.e., enhancing with no or low latency, this paper proposes a causal speech enhancement model with an attention mechanism based on Transformer. The model uses a causal codec with a U-net-like structure as the backbone network, which is improved with an upper triangle mask matrix and a single-side relative position representation on the basis of ensuring the causality. The mask matrix preserves the attentional focus on the historical global information and the single-side relative position representation focuses more on the information that needs attention in the local information. In addition, the weighted loss function in both time and frequency domains is used to guide the optimization direction of the training. Exhaustive comparison experiments are conducted on the Voice-Bank Demand dataset, and the experimental results show that the proposed causal model, compared with existing real-time single-channel speech enhancement models, not only possesses better enhancement results but also has faster training speed and fewer trainable parameters.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call