Abstract

Speaker identification is a method of identifying an individual from a set of speakers, and text-independent speaker identification systems allow speakers to utter any phrase without any constraints. This study is focused on raw audio analysis as phase, fine-grained frequency patterns, timing cues, and other minute characteristics are preserved when raw waveforms are processed as compared to handcrafted features like Mel-Frequency Cepstral Coefficients (MFCC) and visual representation of audio-like spectrogram. Due to the depth of information, which includes variations in speech rhythm, pitch, and vocal tract shape, it is beneficial for identifying speakers. The deep learning architecture known as SincNet has gained popularity in speaker identification because of its parametric Sinc functions that allow it to operate directly on the raw audio input. In this paper, we have considered SincNet as the baseline model for speaker identification. The effect of proper speech boundary detection, including high-level features and effective optimizer selection are analysed. The precise identification of the signal start and terminus point is important for eliminating the redundant non-speech regions. We have included endpoint detection module as a pre-processing step in the system. Proper feature extraction and selection are crucial to the model's success. To extract more abstract features from the data, we have added more convolution layers to the original SincNet model. Further, we investigated the hyperparameter tuning protocol's sensitivity to the optimizer and selected the suitable optimizer for raw audio analysis. With all the modifications in the system architecture, we are able to archive improvements of 12.76 %, 13.33 %, and 13.39 % respectively for training, validation, and testing over the original SincNet model. In terms of validation loss, our proposed approach attains 0.35 in comparison to the original SincNet loss of 1.02. With this significant improvement, the total training time is marginally increased by 20 minutes for our proposed model. We have performed our investigation on the LibriSpeech dataset to check the effectiveness of our proposed system in comparison to the other model..

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.