CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition

Minsu Kim,Yong Man Ro,Se Jin Park,Joanna Hong

doi:10.1109/tmm.2021.3115626

Abstract

Visual Speech Recognition (VSR) is a task that recognizes speech from external appearances of the face ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">${\it i}.{\it e}.$</tex-math></inline-formula> , lips) into text. Since the information from the visual lip movements is not sufficient to fully represent the speech, VSR is considered as one of the challenging problems. One possible way to resolve this problem is additionally utilizing audio which contains rich information for speech recognition. However, the audio information could not be always available such as in crowded situations. Thus, it is necessary to find a way that successfully provides enough information for speech recognition with visual inputs only. In this paper, we alleviate the information insufficiency of visual lip movement by proposing a cross-modal memory augmented VSR with Visual-Audio Memory (VAM). The proposed framework tries to utilize the complementary information of audio even when the audio inputs are not provided at the inference time. Concretely, the proposed VAM learns to imprint audio features of short clip-level into a memory network using the corresponding visual features. To this end, the VAM contains two memories, lip-video key and audio value. We guide the audio value memory to imprint the audio feature and the lip-video key memory to memorize the location of the imprinted audio. By doing this, the VAM can exploit rich audio information by accessing the memory using visual inputs only. Experimental results show that the proposed method achieves state-of-the-art performance on both word- and sentence-level VSR. In addition, we verify the learned representations inside the VAM contain meaningful information for VSR.

Full Text