Abstract

The recent speaker embeddings framework has been shown to provide excellent performance on the task of text-independent speaker recognition. The framework is based on a deep neural network (DNN) trained to directly discriminate between speakers from traditional acoustic features such as Mel frequency cepstral coefficients. Prior studies on speaker recognition have found that phonetic information is valuable in the task of speaker identification, with systems being based on either bottleneck features (BFs) or tied-triphone state posteriors from a DNN trained for the task of speech recognition. In this paper, we analyze the role of phonetic BFs for DNN embeddings and explore methods to enhance the BFs further. Experimental results show that exploiting phonetic information encoded in BFs is very valuable for DNN speaker embeddings. Enriching the BFs using a cascaded DNN multi-task architecture is also shown to provide further improvements to the speaker embed- ding system.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call