Abstract

Deep neural networks DNNs have been used for dereverberation and separation in the monaural source separation problem. However, the performance of current state-of-the-art methods is limited, particularly when applied in highly reverberant room environments. In this paper, we propose a two-stage approach with two DNN-based methods to address this problem. In the first stage, the dereverberation of the speech mixture is achieved with the proposed dereverberation mask DM. In the second stage, the dereverberant speech mixture is separated with the ideal ratio mask IRM. To realize this two-stage approach, in the first DNN-based method, the DM is integrated with the IRM to generate the enhanced time-frequency T-F mask, namely the ideal enhanced mask IEM, as the training target for the single DNN. In the second DNN-based method, the DM and the IRM are predicted with two individual DNNs. The IEEE and the TIMIT corpora with real room impulse responses and noise from the NOISEX dataset are used to generate speech mixtures for evaluations. The proposed methods outperform the state-of-the-art specifically in highly reverberant room environments.

Highlights

  • S OURCE separation aims to separate the desired speech signals from the mixture, which consists of the speech sources, the background interference and their reflections

  • Compared with the ideal ratio mask (IRM)- and the complex IRM (cIRM)-based Deep neural networks (DNNs) methods, both our proposed methods provide improved performance in terms of the SNRfw and source to distortion ratio (SDR) consistently

  • When the room impulse responses (RIRs) are unseen, the generalization ability of the proposed method is evaluated, the results shown in Figures 7, 8, 11 & 12 and Tables VI & VIII confirm that the proposed method can better separate the desired speech signal from mixture than the IRM- and cIRM-based methods

Read more

Summary

Introduction

S OURCE separation aims to separate the desired speech signals from the mixture, which consists of the speech sources, the background interference and their reflections. Due to applications such as automatic speech recognition (ASR), assisted living systems and hearing aids [1]–[6], source separation in real-world scenarios has attracted considerable research attention. The source separation problem is categorized into multichannel, stereo-channel (binaural) and single-channel (monaural). Only one recording is available, and the spatial information cannot generally be extracted. In real-world room environments, the reverberations are challenging, which distort the received mixture and degrade the separation performance [7]. Chambers is with the Department of Engineering, University of Leicester, Leicester LE1 7RU, U.K.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.