Speaker recognition regardless of context and language on a fixed set of competitors

V N Sorokin,V G Trunov,A S Leonov

doi:10.1134/s105466181602022x

Abstract

The problem of speaker recognition from a given set of speakers for any language and any context is considered. A database of Russian numerals that contains speech segments from 216 men and 177 women, each of whom spoke from 400 to 800 words, is used for recognition. Speech has been recorded on different types of microphones in different rooms at the natural noise level. Recognition is based on solutions of the inverse problem of finding the voice excitation pulse shape for each pitch period by the known speech segment. The pulse shape is defined as the inverse Fourier transform of the regularized ratio of speech signal spectra at the intervals of the open and closed glottis. Recognition is carried out by ten parameters: the pitch period, the open glottis interval duration, times when the source amplitude is maximum, minimum, or zero, the amplitude ratio for the minimum and maximum source pulses, three decomposition ratios of the source function by the principal component method, and the vowel duration. In such a recognition procedure, in the case of the utterance of a word that contains one vowel, the false reject rate (FRR) for men is 1.7---5.4%, and the false acceptance rate (FAR) is 5.4---7.1%. For women FRR = 2---5.2% and FAR = 5.2---6.3%. The recognition error decreases with an increasing number of vowels in the speech signal. At 10 vowels, for men FRR = 0.05---0.2% and FAR = 0.07---0.8%, and for women FRR = 0.09---0.2% and FAR = 0.17---2.1%.

Full Text