Abstract

Speech is the most natural mode of human communication. Speech extracts a lot of information, including language identification, gender, age, emotion, cognitive behavior, speaker identity, and exchanging thoughts and ideas. Speaker recognition identifies an individual's identity based on paralinguistic information in their speech signal. It finds many applications in biometrics, forensics, and access control systems. Convolutional Neural Networks (CNNs), which learn low-level speech representations from raw waveforms, have been frequently employed for speaker identification. SincNet is a novel neural architecture-based framework that has shown to be effective in performing the speaker recognition task. SincNet is used to analyze the raw audio samples and discover robust features. As a result, it takes the position of the DNN model's first layer, which performs convolution using parametrized sinc functions. As an initial step in processing a speech signal, it is separated into short (overlapping) temporal chunks called frames. Then, the input signal is windowed to reduce the artifacts due to sudden signal truncations at the boundaries of the frames before the Fast Fourier Transform has been applied. The proposed research aims to examine the performance of a few windowing functions over the sinclayer and preprocess the speech signal effectively for the Speaker Recognition task. Also, a variable-length frame overlapping approach for speaker recognition was presented in this research. The results obtained by the various windowing approaches over the sinclayer are plotted. Experimental results show that the Blackman windowing outperformed well among all windows, and frame size with 50% overlap between speech frames enhances the Sneaker Recognition accuracy by 5%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call