Abstract
As the most widely-used spatial filtering approach for multi-channel speech separation, beamforming extracts the target speech signal arriving from a specific direction. An emerging alternative approach is multi-channel complex spectral mapping, which trains a deep neural network (DNN) to directly estimate the real and imaginary spectrograms of the target speech signal from those of the multi-channel noisy mixture. In this all-neural approach, the trained DNN itself becomes a nonlinear, time-varying spectrospatial filter. However, it remains unclear how this approach performs relative to commonly-used beamforming techniques on different array configurations and acoustic environments. This paper is devoted to examining this issue in a systematic way. Comprehensive evaluations show that multi-channel complex spectral mapping achieves separation performance comparable to or better than beamforming for different array geometries and speech separation tasks and reduces to monaural complex spectral mapping in single-channel conditions, demonstrating the general utility of this approach on multi-channel and single-channel speech separation. In addition, such an approach is computationally more efficient than widely-used mask-based beamforming. We conclude that this neural spectrospatial filter provides a strong alternative to traditional and mask-based beamforming.
Accepted Version
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.