Optimizing Speaker Identification through SincsquareNet and SincNet Fusion with Attention Mechanism

Banala Saritha,Rabul Hussain Laskar,Madhuchhanda Choudhury,Anish Monsley K

doi:10.1016/j.procs.2024.03.211

Abstract

Advancements in machine learning and deep learning benefit access control systems, forensics, and biometrics particularly in speaker identification systems. The SincNet architecture is a distinct convolutional neural network (CNN) designed for speaker identification. It operates by taking one-dimensional raw speech and directing it through its initial convolutional layer, consisting of Sinc filters. In this work, we present a SincsquareNet, a CNN that efficiently learns customized triangular band-pass filters using trainable Sinc-squared functions. Also, we propose the fusion of SincsqaureNet with SincNet for robust speaker identification. Further, a self-attention mechanism is employed to obtain discriminating features. The Librispeech dataset is used to validate the proposed framework. This approach makes use of the best of both filters, helps the network learn more robust features, and converges faster. The findings of the experiments show that, when compared to SincsquareNet, speaker identification accuracy was relatively improved by 8%, while validation loss was reduced by 7%.

Full Text