Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification

Victoria Mingote,Alfonso Ortega,Eduardo Lleida,Antonio Miguel

doi:10.3390/app9163295

Victoria Mingote, Alfonso Ortega + Show 2 more

Open Access

https://doi.org/10.3390/app9163295

Copy DOI

Journal: Applied sciences	Publication Date: Aug 11, 2019
Citations: 8	License type: CC BY 4.0

Affiliation: Universidad de Zaragoza

Abstract

In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this reduction mechanism by a phonetic phrase alignment model to keep the temporal structure of each phrase since the phonetic information is relevant in the verification task. Moreover, we can apply a convolutional neural network as front-end, and, thanks to the alignment process being differentiable, we can train the network to produce a supervector for each utterance that will be discriminative to the speaker and the phrase simultaneously. This choice has the advantage that the supervector encodes the phrase and speaker information providing good performance in text-dependent speaker verification tasks. The verification process is performed using a basic similarity metric. The new model using alignment to produce supervectors was evaluated on the RSR2015-Part I database, providing competitive results compared to similar size networks that make use of the global average pooling to extract embeddings. Furthermore, we also evaluated this proposal on the RSR2015-Part II. To our knowledge, this system achieves the best published results obtained on this second part.

Highlights

Techniques based on discriminative deep neural networks (DNN) have achieved substantial success in many speaker verification tasks
These techniques follow the philosophy of the state-of-the-art face verification systems [1,2] where embeddings are usually extracted by reduction mechanisms and the decision process is based on a similarity metric [3]
We study the behaviour of our system when we vary the number of front-end layers, the training data and the states of the Hidden Markov Model (HMM)

Summary

Introduction

Techniques based on discriminative deep neural networks (DNN) have achieved substantial success in many speaker verification tasks. A possible cause of the inaccuracy in text-dependent tasks could be derived from using the temporal average as a representation of the whole utterance as we show in the experimental section. To solve this problem, this paper presents a new architecture which combines a deep neural network with a phonetic phrase alignment method used as a new internal layer to maintain the temporal structure of the utterance. It is a more natural solution for the text-dependent speaker verification since the speaker and phrase information can be encoded in the supervector thanks to the neural network and the specific states of the supervector

Objectives

Methods

Results

Conclusion