Abstract

Monaural speech enhancement aims to remove background noise from noisy speech signals captured by a single microphone. In recent years, several cross-domain monaural speech enhancement methods are developed to leverage both waveform and harmonic information. However, these methods fall short in fully capturing the dependencies between the time domain and time-frequency (T-F) domain, as well as in harnessing the benefits of the target decoupling strategy. This paper proposes a causal encoder-decoder-based Triple-branch Cross-domain Fusion Network (TCF-Net), which effectively processes speeches by leveraging both time domain and T-F domain features. The proposed approach enables the parallel recovery of magnitude and phase information to alleviate the compensation problem between them. TCF-Net forms a triple-branch network by collaboratively reconstructing the enhanced spectrum with a complex spectrum branch and a magnitude spectrum branch, while incorporating time-domain information with a waveform compensation branch. To fully leverage the information from three domains, Triple-domain Fusion Modules (TFMs) are inserted in each intermediate layer of the model to extract and merge the information from two T-F domain branches and one time domain branch. The TFMs generate masks to progressively compensate for the magnitude of the two T-F domain branches and promote information interaction, further restoring the magnitude of the clean speech. Experimental results demonstrate that TCF-Net outperforms state-of-the-art (SOTA) cross-domain methods and target decoupling methods under causal configuration in all evaluation metrics, which validates the importance of the proposed cross-domain information fusion strategy and target decoupling strategy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call