A Multimodal Framework for Video Caption Generation

Reshmi S Bhooshan,Suresh K

doi:10.1109/access.2022.3202526

Abstract

Video captioning is a highly challenging computer vision task that automatically describes the video clips using natural language sentences with a clear understanding of the embedded semantics. In this work, a video caption generation framework consisting of discrete wavelet convolutional neural architecture along with multimodal feature attention is proposed. Here global, contextual and temporal features in the video frames are taken into account and separate attention networks are integrated in the visual attention predictor network to capture multiple attentions from these features. These attended features with textual attention are employed in the visual-to-text translator for caption generation. The experiments are conducted on two benchmark video captioning datasets - MSVD and MSR-VTT. The results prove an improved performance of the method with a CIDEr score of 91.7 and 52.2, for the aforementioned datasets, respectively.

Full Text