Abstract

The fusion of audio and visual information is one of the most promising solutions for reliable keyword spotting (KWS), particularly when audio is corrupted by noise. KWS aims to detect a specific word in an audio stream, which still remains a challenging problem under noisy environments. In this paper, an audio-visual neural network based on multidimensional convolutional neural network (MCNN) is proposed to perform audio-visual KWS. Firstly, the log mel-spectrogram and lip area sequence are extracted, respectively, from the audio and visual streams, and are taken as the input of the audio-visual neural network. Then, an audio-visual neural network based on MCNN consisting of 2D CNN and 3D CNN is used to model the time-frequency feature of the log mel-spectrogram and the spatiotemporal feature of the lip area sequence, respectively. Finally, the outputs of the audio and visual networks are combined for KWS through decision fusion. Experimental results on the PKU-AV database under complex acoustic conditions demonstrate that the proposed method achieves preferable performance compared to other state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call