Short-time speaker verification with different speaking style utterances.

Hongwei Mao,Yan Shi,Yijie Li,Linqiang Wei,Yue Liu,Yanhua Long,Ian Mcloughlin

doi:10.1371/journal.pone.0241809

Abstract

In recent years, great progress has been made in the technical aspects of automatic speaker verification (ASV). However, the promotion of ASV technology is still a very challenging issue, because most technologies are still very sensitive to new, unknown and spoofing conditions. Most previous studies focused on extracting target speaker information from natural speech. This paper aims to design a new ASV corpus with multi-speaking styles and investigate the ASV robustness to these different speaking styles. We first release this corpus in the Zenodo website for public research, in which each speaker has several text-dependent and text-independent singing, humming and normal reading speech utterances. Then, we investigate the speaker discrimination of each speaking style in the feature space. Furthermore, the intra and inter-speaker variabilities in each different speaking style and cross-speaking styles are investigated in both text-dependent and text-independent ASV tasks. Conventional Gaussian Mixture Model (GMM), and the state-of-the-art x-vector are used to build ASV systems. Experimental results show that the voiceprint information in humming and singing speech are more distinguishable than that in normal reading speech for conventional ASV systems. Furthermore, we find that combing the three speaking styles can significantly improve the x-vector based ASV system, even when only limited gains are obtained by conventional GMM-based systems.

Highlights

Automatic speaker verification (ASV) is using the speaker’s speech signal to extract the identity of the speaker [1, 2]
Results with single-speaking style In Table 3, we examine the difference between Mandarin singing, humming and normal reading speech for text-dependent ASV tasks using two different systems, the conventional GMMbased and the state-of-the-art x-vector based systems
Because the RSH data size is only around 2 hours, it is very difficult to train a very deep neural network, instead of using the 7 layers time-delay deep neural network (TDNN) architecture for the above biased extractor, here we reduce the TDNN to 4 layers, by removing the 3rd, 4th and the 7th layers, and we reduce the dimension of hidden layers from 512 to 64

Summary

Introduction

Automatic speaker verification (ASV) is using the speaker’s speech signal to extract the identity of the speaker [1, 2]. In our initial experiments, we tried using available data (clean reading speech such as AISHELL-1 [30]) to train an UBM and applied MAP to obtain the adapted target speaker model, as the observation from x-vector systems in paragraph 2 of section “Results with Single-speaking style”, all the adapted GMMs were heavily biased to the reading speaking style.

Objectives

Results

Conclusion