Abstract
Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.
Highlights
Speaker verification (SV) is a voice biometric authentication technology developed to judge the claimed identity of a test speaker
A novel shared-parameter grouped frequency self-attentive pooling (SGFSAP) layer is proposed to effectively capture the speaker-dependent information contained in the frequency domain based on the following facts: (1) The speaker-dependent information is distributed in the time domain and frequency domain of the 2D frame-level features generated by the Convolutional neural networks (CNNs); (2) the individual information is encoded non-uniformly in different frequency bands of utterance [28]; (3) some speaker-dependent frequency information varies with the phonetic contents of the utterance [26,28,29]
A novel shared-parameter grouped frequency self-attentive pooling layer is proposed to capture the speaker-dependent information contained in the frequency domain
Summary
Speaker verification (SV) is a voice biometric authentication technology developed to judge the claimed identity of a test speaker. The framework composed of i-vector [6] and probabilistic linear discriminant analysis (PLDA) [7] has dominated the text-independent SV because of its superior performance, simplicity, and efficiency. In this framework, a Gaussian mixture model-universal background model (GMM-UBM) [8] is first used to collect sufficient statistics. The i-vector/PLDA system can achieve great success in some scenarios, the performance of the system decreases when enrollment/test utterance durations are short [9,10]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.