Effective preservation of higher-frequency contents in the context of short utterance based children’s speaker verification system

Shahid Aziz,S Shahnawazuddin

doi:10.1016/j.apacoust.2023.109420

Abstract

Developing an automatic speaker verification (ASV) system for children is extremely challenging due to the unavailability of children’s speech corpora. The challenges are further exacerbated in the case of short utterances. Voice-based biometric systems require adequate amount of speech data for enrollment and verification; otherwise the performance considerably degrades. In this paper, we have focussed on data paucity and preserving the higher-frequency contents in order to enhance the performance of a short-utterance based children’s speaker verification system. To deal with data scarcity, several out-of-domain data augmentation techniques have been utilized. Since the out-of-domain data used is from adult speakers which are acoustically very different from children’s speech, we have resorted to techniques like prosody modification, formant modification and voice-conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data but also in effectively capturing the missing target attributes. A relative improvement of 33.6% in equal error rate (EER) with respect to the baseline system trained solely on child data-set is achieved when the proposed data augmentation technique is applied. Further to that, for the preservation of the higher-frequency contents, we have resorted to concatenation of the classical Mel-frequency cepstral coefficients (MFCC) features with the linear-frequency cepstral coefficient (LFCC) or with the inverse-Mel-frequency cepstral coefficient (IMFCC) features. The use of Mel-filter-bank leads to poor resolution of higher-frequency components. On the other hand, linear- or inverse-Mel-filter-banks yield better resolution of higher-frequency components. Moreover, MFCC and IMFCC features exhibit low canonical correlation. Consequently, the frame-level concatenation of MFCC and LFCC or IMFCC features leads to better resolution of both lower- as well as higher-frequency components. Therefore, the EER considerably reduces when either LFCC features or IMFCC features are concatenated with MFCC features. The EER for the full test set shows a relative reduction of 10.56% (with respect to the EER for the MFCC features) when IMFCC features are concatenated with the MFCC features. This novel approach of incorporating data augmentation followed by frame-level feature concatenation helps in achieving an overall reduction of 40.6% in EER.

Full Text