Abstract

Identifying the gender of a speaker is an important problem which can be considered as a preliminary task in many speech processing applications such as speaker identification and emotion recognition. Although many accurate models have been developed through the progress of Deep Neural networks (DNNs), the computational complexity of these models makes them unsuitable for some situations with limited hardware resources. Pruning the designed architecture of DNNs to achieve the same initial accuracies by reducing the number of network parameters is a way to create more efficient architectures. In this paper, Auditory Filter Models (AFMs) are exploited to be used as frontend convolutional filters. In addition to the fact that these meaningful filters can be represented by a few parameters, they also provide better performance compared to traditional conventional filters. The proposed method can increase efficiency, improve the generalizability of the models, and reduce the number of parameters for gender recognition. Moreover, by utilizing a clustering approach to reduce the number of learnable filters of the AFM in the first convolutional layer, the network parameters can be reduced even further. Experimental results conducted on four well-known datasets, namely TIMIT, LibriSpeech, VoxCeleb1, and RAVDESS, prove the validity of the claims made earlier. The results show that the use of this technique not only enhances the performance of the models but also facilitates the learning of a rich and informative representation of speech data. This finding emphasizes the effectiveness and robustness of the proposed approach in handling diverse datasets and real-world scenarios.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call