A short utterance speaker recognition method with improved cepstrum–CNN

Yongfeng Li,Qinge Wu,Shuaishuai Chang

doi:10.1007/s42452-022-05227-1

Yongfeng Li, Qinge Wu + Show 1 more

Open Access

https://doi.org/10.1007/s42452-022-05227-1

Copy DOI

Journal: SN Applied Sciences	Publication Date: Nov 22, 2022
Citations: 1	License type: open-access

Affiliation: Zhengzhou University of Light Industry

Abstract

In this study, an improved cepstrum-convolutional neural network is proposed, which can solve the problem of low recognition accuracy of 1-s short utterance in speaker recognition technology. The audio feature Mel frequency cepstrum coefficient is extracted by using the improved cepstrum algorithm and the data of the two-dimensional acoustic feature vector matrix is preprocessed to convert the two-dimensional feature matrix into a three-dimensional tensor as the input data of the two-dimensional convolutional neural network model. Experiments are carried out on an Arabic digital English pronunciation dataset with an audio duration of less than one second in a specific experimental environment. Moreover, the performance of this model is evaluated by accuracy and F1-score. The simulation results show that the accuracy of our proposed model for speech recognition is as high as 100% and 99.60% on the training and test sets, respectively, as well as the F1- score, is 0.9985. It can be seen that the recognition method of this model solves the problem of accuracy degradation of short utterance speaker recognition due to the short duration of the corpus and improves the accuracy of short speech voice recognition. The model is simple but effective, generalization, superior, and has higher practical application value.Article Highlights.It is interesting to study how to improve the accuracy of 1-s short utterance speaker recognition.The improved cepstrum algorithm can solve the problem of not extracting enough discernible acoustic features.This paper proposed model obtained 100% accuracy on a spoken Arabic digit dataset with an audio duration about 0.3 s.

Full Text