Abstract

Speaker identification plays a crucial role in biometric person identification as systems based on human speech are increasingly used for the recognition of people. Mel frequency cepstral coefficients (MFCCs) have been widely adopted for decades in speech processing to capture the speech-specific characteristics with a reduced dimensionality. However, although their ability to decorrelate the vocal source and the vocal tract filter make them suitable for speech recognition, they greatly mitigate the speaker variability, a specific characteristic that distinguishes different speakers. This paper presents a theoretical framework and an experimental evaluation showing that reducing the dimension of features by applying the discrete Karhunen-Loève transform (DKLT) to the log-spectrum of the speech signal guarantees better performance compared to conventional MFCC features. In particular with short sequences of speech frames, with typical duration of less than 2 s, the performance of truncated DKLT representation achieved for the identification of five speakers are always better than those achieved with the MFCCs for the experiments we performed. Additionally, the framework was tested on up to 100 TIMIT speakers with sequences of less than 3.5 s showing very good recognition capabilities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.