Multimodal feature fusion based on object relation for video captioning

Zhiwen Yan,Ying Chen,Jia Zhu,Jinlong Song

doi:10.1049/cit2.12071

Abstract

AbstractVideo captioning aims at automatically generating a natural language caption to describe the content of a video. However, most of the existing methods in the video captioning task ignore the relationship between objects in the video and the correlation between multimodal features, and they also ignore the effect of caption length on the task. This study proposes a novel video captioning framework (ORMF) based on the object relation graph and multimodal feature fusion. ORMF uses the similarity and Spatio‐temporal relationship of objects in video to construct object relation features graph and introduce graph convolution network (GCN) to encode the object relation. At the same time, ORMF also constructs a multimodal features fusion network to learn the relationship between different modal features. The multimodal feature fusion network is used to fuse the features of different modals. Furthermore, the proposed model calculates the length loss of the caption, making the caption get richer information. The experimental results on two public datasets (Microsoft video captioning corpus [MSVD] and Microsoft research‐video to text [MSR‐VTT]) demonstrate the effectiveness of our method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: CAAI Transactions on Intelligence Technology	Publication Date: Jan 7, 2022
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Multimodal feature fusion based on object relation for video captioning

Abstract

Talk to us

Similar Papers

More From: CAAI Transactions on Intelligence Technology

Lead the way for us

Similar Papers

Advancing classroom fatigue recognition: A multimodal fusion approach using self-attention mechanism
Lei Cao ... Chunjiang Fan
Biomedical Signal Processing and Control | VOL. 89
Lei Cao, et. al.Lei Cao ... Chunjiang Fan
17 Nov 2023
Biomedical Signal Processing and Control | VOL. 89

MFF-Net: Multimodal Feature Fusion Network for 3D Object Detection
Peicheng Shi ... Heng Qi
Computers, Materials & Continua | VOL. 75
Peicheng Shi, et. al.Peicheng Shi ... Heng Qi
01 Jan 2023
Computers, Materials & Continua | VOL. 75

BMFNet: Bifurcated multi-modal fusion network for RGB-D salient object detection
Chenwang Sun ... Mingqian Zhang
Image and Vision Computing | VOL. 147
Chenwang Sun, et. al.Chenwang Sun ... Mingqian Zhang
03 May 2024
Image and Vision Computing | VOL. 147

MFUR-Net: Multimodal feature fusion and unimodal feature refinement for RGB-D salient object detection
Zhengqian Feng ... Mingle Zhou
Knowledge-Based Systems | VOL. 299
Zhengqian Feng, et. al.Zhengqian Feng ... Mingle Zhou
31 May 2024
Knowledge-Based Systems | VOL. 299

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodal feature fusion based on object relation for video captioning

Abstract

Talk to us

Similar Papers

More From: CAAI Transactions on Intelligence Technology