Abstract

Continuous speech separation (CSS) aims at separating overlap-free targets from a long, partially-overlapped recording. Though it has shown promising results, the origin CSS framework does not consider cross-window information and long-span dependency. To alleviate these limitations, this work introduces two novel methods to implicitly and explicitly capture the long-span knowledge, respectively. We firstly apply the dual-path (DP) modeling architecture for the CSS framework, where the within and across window information are jointly modeled by alternating stacked local-global processing modules. Secondly, to further capture the long-span dependency, we introduce a memory-based model for CSS. An additional memory pool is designed to extract embedding from each small window, and the inter-window commutation is established above the memory embedding pool through an attention mechanism. This memory-based model can precisely control what information needs to be transferred across the windows, thus leading to both improved modeling capacity and interpretability. The experimental results on the LibriCSS dataset show that both strategies can well capture the long-span information of the continuous speech and significantly improve system performance. Moreover, further improvements are observed with the integration of these two methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.