Abstract

Text-independent speaker verification provides people identified from their voice characteristics. In this paper, we propose a new method, Dual-Sequences Gate Attention Unit to improve the accuracy of a massive speaker verification system. Dual-Sequences Gate Attention Unit is based on the Gated Dual Attention Unit and the Gated Recurrent Unit. Two different inputs from the same source are the state pooling layer in the x-vector and the frame layer information in the x-vector. It is developed by applying the attention mechanism to the traditional Gated Recurrent Unit to enhance the learning ability of the x-vector system. The whole system follows the statistics pooling from each time-delay neural network layer of the x-vector baseline. It passes through the Dual-Sequences Gate Attention Unit layer to aggregate more information from the variant temporal context of input features while training at the frame level. We train our model on the Voxceleb2 and then evaluate the accuracy of Voxceleb1 and the Speakers in the Wild dataset for simulation. Finally, the system is compared with the x-vector, L-vector, and ETDNN-OPGRUs x-vector. There is an obvious improvement to our proposed method. Compared with the x-vector system, it shows that at least 17.5% on Voxceleb1 and 0.5% on Speakers in the Wild equal error rate improvement is achieved in the fusion system.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.