MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios

Qiuyu Zheng,Zengzhao Chen,Hai Liu,Yuanyuan Lu,Jiawen Li,Tingting Liu

doi:10.1016/j.eswa.2023.119511

Abstract

Speaker embeddings have become the most popular feature representation in speaker verification. Improving the robustness of speaker embedding extraction systems is a crucial problem. A multi-scale residual aggregation network (MSRANet), which is a simple but efficient network with triplet input and triplet loss, is proposed in this paper. Two different aggregation strategies are utilized in frame-level feature extractors to capture long-term variations in speaker characteristics. Attention mechanism is employed to filter a large number of parameters in temporal and frequency dimensions, which can effectively focus on the significant information and neglect the redundancy feature. Extensive experiments on the VoxCeleb1 and VoxCeleb2 wild datasets are also conducted to evaluate the performance of the proposed method. In comparison with four baselines experiments, obtained results demonstrate that the proposed MSRANet achieves a state-of-the-art performance of an equal error rate of 3.84% and an accuracy rate of 98.76%. Furthermore, the proposed method is proven to be effective in cross-scenarios adaptability through training performance on the LibriSpeech dataset. The proposed MSRANet has an equal error rate of 2.64% and an accuracy rate of 99.20% on LibriSpeech.

Full Text