Abstract
Recently, the accuracy of voice authentication system has increased significantly due to the successful application of the identity vector (i-vector) model. This paper proposes a new method for i-vector extraction. In the method, a perceptual wavelet packet transform (PWPT) is designed to convert speech utterances into wavelet entropy feature vectors, and a Convolutional Neural Network (CNN) is designed to estimate the frame posteriors of the wavelet entropy feature vectors. In the end, i-vector is extracted based on those frame posteriors. TIMIT and VoxCeleb speech corpus are used for experiments and the experimental results show that the proposed method can extract appropriate i-vector which reduces the equal error rate (EER) and improve the accuracy of voice authentication system in clean and noisy environment.
Highlights
Speaker modeling technology has been widely used in modern voice authentication for improving accuracy
This paper proposes a new method for i-vector extraction
Simulate human auditory model to perceptively decompose speech signal into 16 sub signals, and wavelet entropy feature vectors are calculated on those sub signals
Summary
Speaker modeling technology has been widely used in modern voice authentication for improving accuracy. Mel-frequency cepstral coefficient (MFCC) is commonly used spectral features for speech representation. The background utterances contain thousands of speech. The background utterances contain thousands of speech samples spoken by lots of persons and the target utterance comes from a given speaker and the purpose samples spoken by lots of persons and the target utterance comes from a given speaker and the purpose of i-vector extraction is convert target utterance into a i-vector. All speech utterances are of i-vector extraction is convert target utterance into a i-vector. All speech utterances converted into spectral feature vectors. UBM is trained by the feature vectors from background utterances are converted into spectral feature vectors. UBM is trained by the feature vectors from background and L frame posteriors of a feature vector from the target utterance are estimated based on the trained utterances and L frame posteriors of a feature vector from the target utterance are estimated based
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.