Multi-level self-attentive TDNN: A general and efficient approach to summarize speech into discriminative utterance-level representations

João Monteiro,Jahangir Alam,Tiago H Falk

doi:10.1016/j.specom.2022.03.008

Abstract

Time delay neural networks (TDNN) have become ubiquitous for voice biometrics and language recognition tasks relying on utterance-level speaker- or language-dependent representations. In this paper, we discuss directions to improve upon the conventional TDNN architecture to render it more generally applicable. More specifically, we explore the utility of performing pooling operations across different levels of the convolutional stack and further propose an approach to efficiently combine such set of representations. We show that the resulting models are more versatile, in the sense that a fixed architecture can be re-used across different tasks, and learned representations are more discriminative. Evaluations are performed across two settings: (1) end-to-end, where spoofing attack detection and spoken language identification are explored, and (2) embedding encoding, where speaker-dependent embeddings are tested in a speaker verification task. Experimental results show the proposed design yielding improvements over the original TDNN architecture, as well as other state-of-the-art spoofing, language and speaker recognition methods.

Full Text