Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Yidi Li,Guoquan Wang,Jiale Ren,Xia Li,Yawei Wang,Hong Liu

doi:10.1049/cit2.12212

Yidi Li, Guoquan Wang + Show 4 more

Open Access

https://doi.org/10.1049/cit2.12212

Copy DOI

Abstract

AbstractAs one of the most effective methods to improve the accuracy and robustness of speech tasks, the audio–visual fusion approach has recently been introduced into the field of Keyword Spotting (KWS). However, existing audio–visual keyword spotting models are limited to detecting isolated words, while keyword spotting for unconstrained speech is still a challenging problem. To this end, an Audio–Visual Keyword Transformer (AVKT) network is proposed to spot keywords in unconstrained video clips. The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable‐length audio and visual inputs. The outputs of audio and visual branches are combined in a decision fusion module. As humans can easily notice whether a keyword appears in a sentence or not, our AVKT network can detect whether a video clip with a spoken sentence contains a pre‐specified keyword. Moreover, the position of the keyword is localised in the attention map without additional position labels. Experimental results on the LRS2‐KWS dataset and our newly collected PKU‐KWS dataset show that the accuracy of AVKT exceeded 99% in clean scenes and 85% in extremely noisy conditions. The code is available at https://github.com/jialeren/AVKT.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: CAAI Transactions on Intelligence Technology	Publication Date: Mar 20, 2023
Citations: 4	License type: CC BY-NC-ND 4.0

R Discovery Prime

R Discovery Prime

Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Abstract

Talk to us

Similar Papers

More From: CAAI Transactions on Intelligence Technology

Lead the way for us

Similar Papers

Multi-Task Learning with Cross Attention for Keyword Spotting
Takuya Higuchil ... Anmol Gupta
-
Takuya Higuchil, et. al.Takuya Higuchil ... Anmol Gupta
13 Dec 2021
13 Dec 2021

Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion
Laxmi Pandey ... Rajesh M Hegde
Circuits, Systems, and Signal Processing | VOL. 38
Laxmi Pandey, et. al.Laxmi Pandey ... Rajesh M Hegde
17 Nov 2018
Circuits, Systems, and Signal Processing | VOL. 38

Speech Augmentation Based Unsupervised Learning for Keyword Spotting
Jian Luo ... Jing Xiao
-
Jian Luo, et. al.Jian Luo ... Jing Xiao
18 Jul 2022
18 Jul 2022

UniKW-AT: Unified Keyword Spotting and Audio Tagging
Heinrich Dinkel ... Yongqing Wang
-
Heinrich Dinkel, et. al.Heinrich Dinkel ... Yongqing Wang
18 Sep 2022
18 Sep 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Abstract

Talk to us

Similar Papers

More From: CAAI Transactions on Intelligence Technology