A Hybrid GRU-CNN Feature Extraction Technique for Speaker Identification

Md Shazzad Hossain Shihab,Md Iftekharul Alam Efat,Shuvra Aditya,K M Imtiaz-Ud-Din,Jahangir Hossain Setu

doi:10.1109/iccit51783.2020.9392734

Md Shazzad Hossain Shihab, Md Iftekharul Alam Efat + Show 3 more

https://doi.org/10.1109/iccit51783.2020.9392734

Copy DOI

Abstract

Speaker identification with diversified voice clip across the globe is a crucial and challenging task, specially extracting vigorous and discriminative features. In this paper, we demonstrated an end-to-end speaker identification pipeline introducing a hybrid Gated Recurrent Unit (GRU) and Convolutional Neural Network (CNN) feature extraction technique. At first, the voice clip is converted to a spectrogram, then processed with the GRU and CNN model, a part of it is again transformed with residual CNN model optimizing the subspace loss to extract best and substantial feature vector. Later, a statistical based feature selection method is applied to combine and select most significant features. To validate the proposed GRU-CNN feature extractor, we have examined it in a large-scale voxcelb dataset from comprising of 6000 real world speakers with multiple voices. Finally, a comparative analysis with state-of-art feature extraction techniques is applied with a promising outcome of 91.08% accuracy along with 93.51% and 94.74% precision and recall values respectively.

Full Text