Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

Xingwei Liang,Zehua Zhang,Ruifeng Xu

doi:10.1186/s13636-023-00293-8

Xingwei Liang, Zehua Zhang + Show 1 more

Open Access

https://doi.org/10.1186/s13636-023-00293-8

Copy DOI

Abstract

Personalized voice triggering is a key technology in voice assistants and serves as the first step for users to activate the voice assistant. Personalized voice triggering involves keyword spotting (KWS) and speaker verification (SV). Conventional approaches to this task include developing KWS and SV systems separately. This paper proposes a single system called the multi-task deep cross-attention network (MTCANet) that simultaneously performs KWS and SV, while effectively utilizing information relevant to both tasks. The proposed framework integrates a KWS sub-network and an SV sub-network to enhance performance in challenging conditions such as noisy environments, short-duration speech, and model generalization. At the core of MTCANet are three modules: a novel deep cross-attention (DCA) module to integrate KWS and SV tasks, a multi-layer stacked shared encoder (SE) to reduce the impact of noise on the recognition rate, and soft attention (SA) modules to allow the model to focus on pertinent information in the middle layer while preventing gradient vanishing. Our proposed model demonstrates outstanding performance in the well-off test set, improving by 0.2%, 0.023, and 2.28% over the well-known SV model emphasized channel attention, propagation, and aggregation in time delay neural network (ECAPA-TDNN) and the advanced KWS model Convmixer in terms of equal error rate (EER), minimum detection cost function (minDCF), and accuracy (Acc), respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: Jul 1, 2023
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

Abstract

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

A Speaker Verification Method Based on TDNN–LSTMP
Hui Liu ... Longlian Zhao
Circuits, Systems, and Signal Processing | VOL. 38
Hui Liu, et. al.Hui Liu ... Longlian Zhao
20 Mar 2019
Circuits, Systems, and Signal Processing | VOL. 38

Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification Using CTC-Based Soft VAD and Global Query Attention
Myunghun Jung ... Youngmoon Jung
-
Myunghun Jung, et. al.Myunghun Jung ... Youngmoon Jung
25 Oct 2020
25 Oct 2020

Maximum margin linear kernel optimization for speaker verification
Mohamed Kamal Omar ... Jason Pelecanos
-
Mohamed Kamal Omar, et. al.Mohamed Kamal Omar ... Jason Pelecanos
01 Apr 2009
01 Apr 2009

End-to-End Voice Spoofing Detection Employing Time Delay Neural Networks and Higher Order Statistics
Jahangir Alam ... Abderrahim Fathan
-
Jahangir Alam, et. al.Jahangir Alam ... Abderrahim Fathan
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

Abstract

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing