Abstract

In video captioning, it is very challenging to comprehensively describe multi-modal content information of a video, such as appearance, motion, and object. Prior arts often neglect interactions among multiple modalities and thus their video representations may not fully depict scene contents. In this paper, we propose a collaborative multi-modal graph network (CMGNet) to explore the interactions among multi-modal features in video captioning. Our CMGNet is composed of an encoder–decoder structure: a Compression-driven Intra-inter Attentive Graph (CIAG) encoder and an Adaptive Multi-modal Selection (AMS) decoder. Specifically, in our CIAG encoder, we first design a Basis Vector Compression (BVC) module to reduce the redundant nodes in graphs and thus improve the efficiency in coping with a large number of nodes. Then we propose an Intra-inter Attentive Graph (IAG) to improve the graph representation by sharing information across intra-and-inter nodes. Afterwards, we present an AMS decoder to generate video captions from the encoded video res presentations. In particular, we let the proposed AMS decoder learn to produce words by adaptively focusing on different modality information, thus leading to comprehensive and accurate captions. Extensive experiments on the large-scale benchmarks, i.e., MSR-VTT and TGIF, demonstrate that our proposed CMGNet achieves the state-of-the-art.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call