Abstract

SincNet architecture has shown significant benefits over traditional Convolutional Neural Networks (CNN), especially for speaker recognition applications. SincNet comprises parameterized Sinc functions as filters in the first layer followed by convolutional layers. Although SincNet is compact in nature and offers top-level understanding of the features extracted, the effect of window function used in SincNet is not thoroughly addressed yet. Hamming and Hann are popularly used as the default time-localized windows to reduce spectral leakage. Hence, a comprehensive investigation of 28 different windowing functions on SincNet architecture towards speaker recognition task using TIMIT dataset was performed in this work. Additionally, “trainable” window functions were configured with tunable parameters to characterize the performance. The paper benchmarks the effect of the time-localized windowing function in terms of the bandwidth, side-lobe suppression, and spectral leakage for the filter banks employed in the first layer of the SincNet architecture. Trainable Gaussian and Cosine-Sum functions exhibited relative improvement of 41.46% and 82.11% in the sentence level classification error rate over Hamming window when employed on SincNet architecture.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call