A Simple Distortion-Free Method to Handle Variable Length Sequences for Recurrent Neural Networks in Text Dependent Speaker Verification

Sung-Hyun Yoon,Ha-Jin Yu

doi:10.3390/app10124092

Abstract

Recurrent neural networks (RNNs) can model the time-dependency of time-series data. It has also been widely used in text-dependent speaker verification to extract speaker-and-phrase-discriminant embeddings. As with other neural networks, RNNs are trained in mini-batch units. In order to feed input sequences into an RNN in mini-batch units, all the sequences in each mini-batch must have the same length. However, the sequences have variable lengths and we have no knowledge of these lengths in advance. Truncation/padding are most commonly used to make all sequences the same length. However, the truncation/padding causes information distortion because some information is lost and/or unnecessary information is added, which can degrade the performance of text-dependent speaker verification. In this paper, we propose a method to handle variable length sequences for RNNs without adding information distortion by truncating the output sequence so that it has the same length as corresponding original input sequence. The experimental results for the text-dependent speaker verification task in part 2 of RSR 2015 show that our method reduces the relative equal error rate by approximately 1.3% to 27.1%, depending on the task, compared to the baselines but with an associated, small overhead in execution time.

Highlights

IntroductionIn addition to the original connections (i.e., from input to hidden states), it has recurrent connections from the hidden states of the previous time step to those of the current time step
Recurrent neural networks (RNNs) are neural networks used to model time-series data
Max/bmax showed relatively lower equal error rate (EER) of about 9.2% than mean/bmean. This means that the loss of information from truncation is worse than the addition of unnecessary information from padding

Summary

Introduction

In addition to the original connections (i.e., from input to hidden states), it has recurrent connections from the hidden states of the previous time step to those of the current time step. Both connections are shared over all time steps. The RNN memorizes the hidden states in the current time step (computed from both the input at the current time step and the hidden states at the previous time step). It feeds this information to the time step. RNNs can effectively model the global context information of an input sequence with a small number of parameters

Methods

Results

Conclusion