Joint filter combination-based central difference feature extraction and attention-enhanced Dense-Res2Block network for short-utterance speaker recognition

Yunfei Zi,Shengwu Xiong

doi:10.1016/j.eswa.2023.120995

Abstract

Speech authentication in smart services typically involves short utterances. However, due to the short duration of these utterances (e.g. less than 3 s) and the limited enrolment and/or test data available, it is difficult to learn enough information to accurately distinguish the person. As a result, speaker recognition from short utterances is very challenging. In this paper, in the acoustic end. we propose a novel Bark-scaled Gaussian and linear filter bank cepstral coefficients (BGLCC) and multi-dimensional central difference (MDCD) acoustic features extraction method. Also, to enhance the discriminative embedding, in the network end, a novel attention-enhanced Dense-Res2Block network is proposed. First, the rich low-frequency information is extracted based on the high distribution density of the Bark-scaled Gaussian filter bank in the low-frequency domain, and more high-frequency information is extracted based on the linear filter bank uniformly distributed in the high-frequency domain. This means that the combined filters can produce more discriminative and richer acoustic features from short-duration audio signals. In addition, the multi-dimensional central difference method captures better speaker dynamic features in the relative BGLCC domain to improve the performance of short utterance speaker recognition. Finally, an attention-enhanced Dense-Res2Block architecture can obtain a variety of feature expressions of different scale combinations to enhance features. Extensive analysis of a variety of datasets, which speech samples of different types, diverse lengths, etc., demonstrate the superiority of the proposed feature extraction method and model over existing acoustic feature extraction methods and speaker recognition models, including those based on MFCCs, LPCCs, and fusion features, and X-vector-PLDA, BLSTM-ResNet, ResNet34-SP, ECAPA-TDNN respectively. The experimental results show that the proposed method achieves the best performance compared to the existing approach.

Full Text