Abstract

Currently voiceprint Recognition systems are playing an increasingly important role in social life. The most popular voiceprint Recognition technology currently relies on neural networks to extract speaker’s features. The successful ECAPA-TDNN architecture is an improved time-delayed neural network based on the x-vector architecture, which explicitly models channel interdependencies by introducing the introduction of Squeeze and Excitation blocks, while the SEResNet block expands the frame layer context by adjusting the proportion of channels with global properties of the recording. In this paper, a number of improvements are proposed to the architecture based on the latest trends in the field of voiceprint Recognition-related issues. Firstly, the initial frame layer can be reset as a one-dimensional Res2Net module for skip connections. In consideration of the effect of the input data on the neural network, we pre-process the input audio data using the Melspectrogram approach. Secondly, a gated recurrent unit (GRU) network is introduced and embedded in the residual structure of the multilayer aggregation feature to further mine the temporal contextual information in the audio signal. The output of the GRU is aggregated with the output of the last SE-Res2Block module for subsequent feature extraction. Finally, the proposed ArcFace loss function is used in this paper to penalise the angle between the depth features and their corresponding weights in an additive manner, thus enhancing both intra-class tightness and inter-class variance and increasing the validation of our proposed model architecture for voiceprint Recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call