Abstract

This paper highlights a structure called SECNN. It combines squeeze-and-excitation (SE) components with a simplified residual convolutional neural network (ResNet). This model takes time-frequency spectrogram as input and measures speaker similarity between an utterance embedding and speaker models by cosine similarity. Speaker models are obtained by averaging utterance level features of each enrollment speaker. On the one hand, SECNN can mitigate speaker overfitting in speaker verification by using some techniques such as regularization techniques and SE operation. On the other hand, SECNN is a lightweight model with merely 1.5M parameters. Experimental results indicate that SECNN outperforms other end-to-end models such as Deep Speaker and achieves an equal error rate (EER) of 5.55% in speaker verification and accuracy of 93.92% in speaker identification on Librispeech dataset. It also achieves an EER of 2.58% in speaker verification and accuracy of 95.83% in speaker identification on TIMIT dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call