Abstract

Speaker identification with diversified voice clip across the globe is a crucial and challenging task, specially extracting vigorous and discriminative features. In this paper, we demonstrated an end-to-end speaker identification pipeline introducing a hybrid Gated Recurrent Unit (GRU) and Convolutional Neural Network (CNN) feature extraction technique. At first, the voice clip is converted to a spectrogram, then processed with the GRU and CNN model, a part of it is again transformed with residual CNN model optimizing the subspace loss to extract best and substantial feature vector. Later, a statistical based feature selection method is applied to combine and select most significant features. To validate the proposed GRU-CNN feature extractor, we have examined it in a large-scale voxcelb dataset from comprising of 6000 real world speakers with multiple voices. Finally, a comparative analysis with state-of-art feature extraction techniques is applied with a promising outcome of 91.08% accuracy along with 93.51% and 94.74% precision and recall values respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call