Abstract

Deep learning has achieved immense universality by outperforming GMM and i-vectors on speaker identification. Neural Network approaches have obtained promising results when fed by raw speech samples directly. Modified Convolutional Neural Network (CNN) architecture called SincNet, based on parameterized sinc functions which offer a very compact way to derive a customized filter bank in the short utterance. This paper proposes attention based Long Short Term Memory (LSTM) architecture that encourages discovering more meaningful speaker-related features with minimal training data. Attention layer built using Neural Networks offers a unique and efficient representation of the speaker characteristics which explore the connection between an aspect and the content of short utterances. The proposed approach converges faster and performs better than the SincNet on the experiments carried out in the speaker identification tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call