Text-independent voiceprint recognition via compact embedding of dilated deep convolutional neural networks

V Karthikeyan,S Suja Priyadharsini

doi:10.1016/j.compeleceng.2024.109408

Abstract

In order to process speech, most state-of-the-art experimental methods employ convolutional neural networks (CNNs), which operate on a continuous, 1-dimensional (1-D) time stream. In an audio signal, the mel-spectrogram facilitates the representation of attributes of the utterances' in the frequency domain (which corresponds to the speech spectrum). Moreover, for a time-series speaker signal, CNNs are superior to machine or transfer learning models in capturing characteristics from long-form talks. This paper introduces a jump-connected 1-D CNN that employs a combined loss function for speaker recognition. The suggested model uses a 1-D convolutional layer combined with jump connections to extract speaker-specific characteristics; this reduces time-based and frequency-based variability for faster computing. A combined softmax loss, stable L2-norm, and smooth L1-norm loss function guide the proposed compact convolutional neural networks (CCNN) to identify the correct spokesman with improved efficacy. We evaluated the proposed framework using various standard and real-time audio datasets. The experimental findings demonstrate that the proposed CCNN outperforms existing approaches by reducing the equal error rate by 9.02 %. Also, our recommended voiceprint identification model achieves an impressive average speaker recognition rate of 98.76 %. Simultaneously, the reliability of the 1-D CCNN is tested under various conditions. Other fields of study, like language modelling, could employ this approach after some fine-tuning.Relevance of the work: Speaker recognition is an area of interest in which machine learning (ML) and deep learning (DL) schemes, when combined, have the potential to make history in the areas of forensic sciences, automation, and authentication. Using a modest CNN can enhance the identification and verification process by ignoring many issues such as false positives, background noise, and so on. Expanding this process would facilitate raga identification and disease treatment therapies.

Full Text