AbstractHuman speech contains trivial events that are non‐subjectively controlled and are limited by physical characteristics. These events are represented by a series of extremely short utterances, such as the ‘hmm’ phrase used for expressing doubt or confirmation. In speaker verification (SV), such events can be a robust choice to verify the real speaker from disguised speech since they are less affected by the randomness of pronunciation. However, trivial events like ‘hmm’ contain little linguistic information and are extremely short, thus the performance of SV systems will decrease drastically. In this letter, a Sinc‐Attention feature extraction method is proposed to extract more discriminative speech features from speech signals to achieve a robust SV system for trivial events. Learnable filters are utilized to obtain low‐level representations of the speech for SV. Moreover, a novel adaptive weighting of features inspired by attention mechanisms is proposed, which effectively improves the representation capability of speaker features. The experiments on different models prove that our method helps SV systems achieve a lower equal error rate (EER) than hand‐crafted feature‐based systems.
Read full abstract