Audio-Visual Multi-person Keyword Spotting via Hybrid Fusion

Yuxin Su,Hong Liu,Ziling Miao

doi:10.1007/978-3-031-20500-2_27

Abstract

As an important research method for speech recognition tasks, audio-visual fusion has achieved good performances in improving the robustness of keyword spotting (KWS) models, especially in a noisy environment. However, most related studies are implemented under the single-person scenarios, while ignoring the application in multi-person scenarios. In this work, an audio-visual model using the hybrid fusion is proposed for multi-person KWS. In detail, a speaker detection model based on the attention mechanism is firstly used in the visual frontend to select the key visual signals corresponding to the speaker. Then, semantic features of audio signals and visual signals are extracted by using two pre-trained feature extraction networks. Finally, in order to exploit the complementarity and independence of the signals from two modalities from the feature and decision level, the features are fed into the proposed hybrid fusion module. In addition, the first Chinese keyword spotting dataset named PKU-KWS is recorded. Experiments on this dataset demonstrate the reliability of the proposed method for practical applications. Meanwhile, the model also shows stable performance under different noise intensities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Audio-Visual Multi-person Keyword Spotting via Hybrid Fusion

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Integration of Bimodal Looming Signals through Neuronal Coherence in the Temporal Lobe
Joost X Maier ... Asif A Ghazanfar
Current Biology | VOL. 18
Joost X Maier, et. al.Joost X Maier ... Asif A Ghazanfar
26 Jun 2008
Current Biology | VOL. 18

Coal-gangue sound recognition using hybrid multi-branch CNN based on attention mechanism fusion in noisy environments
Qingjun Song ... Shirong Sun
Scientific Reports | VOL. 14
Qingjun Song, et. al.Qingjun Song ... Shirong Sun
09 Oct 2024
Scientific Reports | VOL. 14

Differential Auditory and Visual Phase-Locking Are Observed during Audio-Visual Benefit and Silent Lip-Reading for Speech Perception.
Máté Aller ... Matthew H Davis
The Journal of neuroscience : the official journal of the Society for Neuroscience | VOL. 42
Máté Aller, et. al.Máté Aller ... Matthew H Davis
27 Jun 2022
The Journal of neuroscience : the official journal of the Society for Neuroscience | VOL. 42

Attention-Based End-to-End Keywords Spotting
Hengbo Hu ... Wenlin Zhang
-
Hengbo Hu, et. al.Hengbo Hu ... Wenlin Zhang
30 Oct 2020
30 Oct 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Audio-Visual Multi-person Keyword Spotting via Hybrid Fusion

Abstract

Talk to us

Similar Papers