In conventional speaker recognition methods based on Mel-frequency cepstral coefficients (MFCCs), phase information has hitherto been ignored. In this paper, we propose a phase information extraction method that normalizes the change variation in the phase according to the frame position of the input speech and combines the phase information with MFCCs in text-independent speaker identification and verification methods. There is a problem with the original phase information extraction method when comparing two phase values. For example, the difference in the two values of π-\mathtildeθ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> and \mathtildeθ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> =-π+\mathtildeθ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> is 2π-2\mathtildeθ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> . If \mathtildeθ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> ≈ 0, then the difference ≈ 2π, despite the two phases being very similar to one another. To address this problem, we map the phase into coordinates on a unit circle. Speaker identification and verification experiments are performed using the NTT database which consists of sentences uttered by 35 (22 male and 13 female) Japanese speakers with normal, fast and slow speaking modes during five sessions. Although the phase information-based method performs worse than the MFCC-based method, it augments the MFCC and the combination is useful for speaker recognition. The proposed modified phase information is more robust than the original phase information for all speaking modes. By integrating the modified phase information with the MFCCs, the speaker identification rate was improved to 98.8% from 97.4% (MFCC), and equal error rate for speaker verification was reduced to 0.45% from 0.72% (MFCC), respectively. We also conducted the speaker identification and verification experiments on a large-scale Japanese Newspaper Article Sentences (JNAS) database, a similar trend as NTT database was obtained.
Read full abstract