Abstract
While traditional i-vector based methods are popular in the field of speaker recognition, deep learning has recently found more and more applications to the end-to-end models due to its attractive performance. One effective practice is the integration of attention mechanism into the Convolution Neural Networks (CNNs). In this work, a light-weight dual-path attention block is proposed by combining the self-attention and Convolutional Block Attention Module (CBAM), which helps to capture more multi-source features with neglectable extra time expense. Additionally, a Weighted Cluster-Range Loss (WCRL) is proposed to enhance the identification performance of Cluster-Range Loss (CRL) on indecisive samples. Besides, to address the low efficiency in the initial training stage of CRL, a novel Criticality-Enhancement Loss (CEL) is also presented. Both of the proposed loss functions could significantly promote the training efficiency and globally improve the recognition performance. Experimental results are presented to show the effectiveness of the proposed scheme, which achieves a competitive top-1 accuracy of 92.0%, top-5 accuracy of 97.6%, and Equal Error Rate (EER) of 3.5% on the VoxCeleb1 dataset.
Highlights
Squeeze and Excitation blocks (SEs) [31] is a channel attention mechanism, where the global spatial information is squeezed by a global average pooling and the channel-wise dependencies are obtained with fully-connected layers and the sigmoid function
Most of the experiments in this work are conducted on the VoxCeleb1 dataset, while VoxCeleb2 is only used for further evaluating the model for speaker verification, and likewise, the CN-Celeb is merely employed in the speaker identification task
A light-weight dual-path attention module and two novel loss functions are proposed for text-independent speaker recognition
Summary
Variants of self-attention are emerging [28,29,30] Those attention mechanisms considering the spatial and channel dimension are widely employed especially in the field of CV. Squeeze and Excitation blocks (SEs) [31] is a channel attention mechanism, where the global spatial information is squeezed by a global average pooling and the channel-wise dependencies are obtained with fully-connected layers and the sigmoid function. Drawing on the recent success of self-attention and CBAM, we propose to combine these two attention mechanisms in our work and form a Dual-path Attention (DA) block.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.