Abstract

AbstractVideo captioning aims at automatically generating a natural language caption to describe the content of a video. However, most of the existing methods in the video captioning task ignore the relationship between objects in the video and the correlation between multimodal features, and they also ignore the effect of caption length on the task. This study proposes a novel video captioning framework (ORMF) based on the object relation graph and multimodal feature fusion. ORMF uses the similarity and Spatio‐temporal relationship of objects in video to construct object relation features graph and introduce graph convolution network (GCN) to encode the object relation. At the same time, ORMF also constructs a multimodal features fusion network to learn the relationship between different modal features. The multimodal feature fusion network is used to fuse the features of different modals. Furthermore, the proposed model calculates the length loss of the caption, making the caption get richer information. The experimental results on two public datasets (Microsoft video captioning corpus [MSVD] and Microsoft research‐video to text [MSR‐VTT]) demonstrate the effectiveness of our method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.