Abstract

Speaker embeddings have become the most popular feature representation in speaker verification. Improving the robustness of speaker embedding extraction systems is a crucial problem. A multi-scale residual aggregation network (MSRANet), which is a simple but efficient network with triplet input and triplet loss, is proposed in this paper. Two different aggregation strategies are utilized in frame-level feature extractors to capture long-term variations in speaker characteristics. Attention mechanism is employed to filter a large number of parameters in temporal and frequency dimensions, which can effectively focus on the significant information and neglect the redundancy feature. Extensive experiments on the VoxCeleb1 and VoxCeleb2 wild datasets are also conducted to evaluate the performance of the proposed method. In comparison with four baselines experiments, obtained results demonstrate that the proposed MSRANet achieves a state-of-the-art performance of an equal error rate of 3.84% and an accuracy rate of 98.76%. Furthermore, the proposed method is proven to be effective in cross-scenarios adaptability through training performance on the LibriSpeech dataset. The proposed MSRANet has an equal error rate of 2.64% and an accuracy rate of 99.20% on LibriSpeech.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call