Abstract

Most previous speech enhancement methods only predict amplitude features, but more and more studies have proved that phase information is crucial for speech quality. Recently, there have also been some methods to choose complex features, but complex masks are difficult to estimate. Removing noise while maintaining good speech quality at low signal-to-noise ratios is still a problem. This study proposes a dual-path network structure for speech enhancement that can model complex spectra and amplitudes simultaneously, and introduces an attention-aware feature fusion module to fuse the two features to facilitate overall spectrum recovery. In addition, we improve a transformer-based feature extraction module that can efficiently extract local and global features. The proposed network achieves better performance than the baseline models in experiments on the Voice Bank + DEMAND dataset. We also conducted ablation experiments to verify the effectiveness of the dual-path structure, the improved transformer, and the fusion module, and investigated the effect of the input-mask multiplication strategy on the results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call