Dual-path Transformer-style models have demonstrated significant effectiveness in speech enhancement. However, extensive parameterization and computational complexity present challenges for practical applications. This study presents an encoder-decoder-based dual-path high-order transformer-style fully-attentional network (DPHT-ANet) to address the speech enhancement problem with a smaller parameter size and reduced computational complexity. The DPHT-ANet incorporates a high-order information interaction module and replaces the multi-head attention module with a recursive gated convolution (GnConv). This enables the DPHT-ANet to effectively capture deep-level information across time and frequency dimensions, improving its ability to capture complex temporal and spectral patterns. Furthermore, DPHT-ANet uses a unified activation and attention mechanism in the convolutional encoder-decoder layers, resulting in a fully attentional network that prioritizes relevant high-level features at earlier stages. The DPHT-ANet uses interactive feature learning and fusion of varying lengths and dimensions with pre-trained features from a large-scale dataset to further enhance its robustness. Experimental results on the VCTK+DEMAND and WSJ0-SI84 datasets demonstrate the effectiveness of the proposed approach. On the WSJ0-SI84 dataset, the DPHT-ANet significantly improves ESTOI (38.28%), PESQ (1.21), and SDR (10.83 dB) over the noisy mixture. Similarly, on the VCTK+DEMAND, the DPHT-ANet improves STOI (3.50%), PESQ (1.22), and SegSNR (9.93 dB) over the noisy mixture, showcasing superior performance in speech enhancement.
Read full abstract