Abstract

Multi-channel speech enhancement aims at extracting the desired speech using a microphone array, which has many potential applications, such as video conferencing, automatic speech recognition, and hearing aids. Recently, deep learning-based spatial filters have achieved remarkable improvements over traditional beamformers, and the desired speech is often inferred directly using the noisy features without modeling the interference. In this work, a novel two-stage framework is proposed to extract the desired speech under the guidance of both the estimated interference and the desired signal. The resulting framework, called a Separation and Interaction Network (SI-Net), includes two components: the first module separates speech and interference coarsely, and the second sub-network serves as the post-processing module to suppress the residual noise and regenerate some missing speech components simultaneously under the guidance of previously estimated speech and interference characters. Because these two modules are both differentiable, the proposed framework can be trained in an end-to-end manner. In addition, a causal spatial-temporal attention module is designed to effectively model the inter-channel and inter-frame correlations simultaneously. Moreover, under this framework, we adopt the channel shuffle and gated fusion strategies for the interaction between speech and interference components to deliver the knowledge about both “where to suppress and where to enhance”. Experiments conducted on the simulated multi-channel speech dataset illustrate the superiority of the proposed framework over state-of-the-art baselines, while can still support real-time processing.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.