Abstract

Over the years, the Speaker recognition area is facing various challenges in identifying the speakers accurately. Remarkable changes came into existence with the advent of deep learning algorithms. Deep learning made a remarkable impact on the speaker recognition approaches. This paper introduces a simple novel architectural approach to an advanced Dilated Convolution network. The novel idea is to induce the well-structured log-Melspectrum to the proposed dilated convolution neural network and reduce the number of layers to 11. The network utilizes the Global average pooling to accumulate the outputs from all layers to get the feature vector representation for classification. Only 13 coefficients are extracted per frame of each speech sample. This novel dilated convolution neural network exhibits an accuracy of 90.97%, Equal Error Rate(EER) of 3.75% and 207 Seconds training time outperforms the existing systems on the LibriSpeech corpus.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call