Abstract
Deep neural network (DNN) embeddings for speaker recognition have recently attracted much attention. Compared to i-vectors, they are more robust to noise and room reverberation as DNNs leverage large-scale training. This article addresses the question of whether speech enhancement approaches are still useful when DNN embeddings are used for speaker recognition. We investigate single- and multi-channel speech enhancement for text-independent speaker verification based on x-vectors in conditions where strong diffuse noise and reverberation are both present. Single-channel (monaural) speech enhancement is based on complex spectral mapping and is applied to individual microphones. We use masking-based minimum variance distortion-less response (MVDR) beamformer and its rank-1 approximation for multi-channel speech enhancement. We propose a novel method of deriving time-frequency masks from the estimated complex spectrogram. In addition, we investigate gammatone frequency cepstral coefficients (GFCCs) as robust speaker features. Systematic evaluations and comparisons on the NIST SRE 2010 retransmitted corpus show that both monaural and multi-channel speech enhancement significantly outperform x-vector's performance, and our covariance matrix estimate is effective for the MVDR beamformer.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE/ACM Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.