Abstract
Deep learning-based time-frequency T-F masking has dramatically advanced monaural single-channel speech separation and enhancement. This study investigates its potential for direction of arrival DOA estimation in noisy and reverberant environments. We explore ways of combining T-F masking and conventional localization algorithms, such as generalized cross correlation with phase transform, as well as newly proposed algorithms based on steered-response SNR and steering vectors. The key idea is to utilize deep neural networks DNNs to identify speech dominant T-F units containing relatively clean phase for DOA estimation. Our DNN is trained using only monaural spectral information, and this makes the trained model directly applicable to arrays with various numbers of microphones arranged in diverse geometries. Although only monaural information is used for training, experimental results show strong robustness of the proposed approach in new environments with intense noise and room reverberation, outperforming traditional DOA estimation methods by large margins. Our study also suggests that the ideal ratio mask and its variants remain effective training targets for robust speaker localization.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.