Learning interpretable filters in Convolutional Neural Networks (CNNs) is an approach that helps to build models with better generalization ability. Interpretable filters can reveal some hidden aspects of the task and help to improve the model. One of the most successful approaches in the field of the speech processing is SincNet, where the model learns some band-pass filters in the first layer of a CNN with a raw waveform as its input. In this paper, similar to SincNet, some meaningful filters are proposed, which here are inspired by Infinite Impulse Response (IIR) filters. The proposed model uses a phase correction process to ensure that phase linearity is satisfied. The effective length of the truncated IIR filter is calculated based on the accumulated energy, and the effect of changing the filter size on the final results has been investigated. The proposed model is evaluated in the speaker identification task on the TIMIT and Librispeech datasets and compared with traditional CNNs and four interpretable kernel-based models. The experimental results show the superiority of the proposed model both in performance and convergence speed. Moreover, some patterns of the speech signal, which lead to uniquely identifying a speaker, are analyzed by examining the spectrum of the learned filters.
Read full abstract