Speaker identification using multi-modal i-vector approach for varying length speech in voice interactive systems

Varun Tiwari,Mohammad Farukh Hashmi,Avinash Keskar,N.C Shivaprakash

doi:10.1016/j.cogsys.2018.09.028

Abstract

The development in the interface of smart devices has lead to voice interactive systems. An additional step in this direction is to enable the devices to recognize the speaker. But this is a challenging task because the interaction involves short duration speech utterances. The traditional Gaussian mixture models (GMM) based systems have achieved satisfactory results for speaker recognition only when the speech lengths are sufficiently long. The current state-of-the-art method utilizes i-vector based approach using a GMM based universal background model (GMM-UBM). It prepares an i-vector speaker model from a speaker’s enrollment data and uses it to recognize any new test speech. In this work, we propose a multi-model i-vector system for short speech lengths. We use an open database THUYG-20 for the analysis and development of short speech speaker verification and identification system. By using an optimum set of mel-frequency cepstrum coefficients (MFCC) based features we are able to achieve an equal error rate (EER) of 3.21% as compared to the previous benchmark score of EER 4.01% on the THUYG-20 database. Experiments are conducted for speech lengths as short as 0.25 s and the results are presented. The proposed method shows improvement as compared to the current i-vector based approach for shorter speech lengths. We are able to achieve improvement of around 28% even for 0.25 s speech samples. We also prepared and tested the proposed approach on our own database with 2500 speech recordings in English language consisting of actual short speech commands used in any voice interactive system.

Full Text