Mel frequency cepstral coefficients (MFCCs) have been the most predominantly used spectral features in many a speech-based application. It was primarily introduced to address speech recognition and was later adopted for various other applications such as speaker recognition and emotion recognition. Several findings, in recent times, suggest that Mel-scale filterbank, which is primarily inspired by human perception phenomenon, may not be the most optimum one for speaker recognition. Working in the same direction, this study attempts optimization of filterbank design for text-dependent speaker verification. Motivated by the success of evolutionary computations in the related fields, an evolutionary algorithm is used to carry out this optimization process. This brings into effect data-driven learning of the design parameters and is hypothesized to yield filterbanks which would suit the specific task of speaker-phrase discrimination. The filterbanks have been optimized for the task of text-dependent speaker verification in general, and also for specific cases of speakers and phrases. The proposed filterbank results in relative equal error rate reduction of up to 39.41% with respect to the baseline MFCCs.
Read full abstract