Abstract

Time Delay Neural Network (TDNN) is a well-performing structure for deep neural network-based speaker recognition systems. In this paper we introduce a novel structure, named Crossed-Time Delay Neural Network (CTDNN) to enhance the performance of current TDNN for speaker recognition. Inspired by the multi-filters setting of convolution layers from convolution neural networks, we set multiple time delay units with different context size at the bottom layer and construct a multilayer parallel network. The proposed CTDNN gives significant improvements over original TDNN on both speaker verification and identification tasks. It outperforms in VoxCeleb1 dataset in verification experiment with a 2.6% absolute Equal Error Rate improvement. In few shots condition, CTDNN reaches 90.4% identification accuracy, which doubles the identification accuracy of original TDNN. We also compare the proposed CTDNN with another new variant of TDNN, Factorized-TDNN, which shows that our model has a 36% absolute identification accuracy improvement under few shots condition. Moreover, the proposed CTDNN can handle training with a larger batch more efficiently and hence, utilize calculation resources economically.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call