Multi-Modal Graph Aggregation Transformer for image captioning

Lizhi Chen,Kesen Li

doi:10.1016/j.neunet.2024.106813

Abstract

The current image captioning directly encodes the detected target area and recognizes the objects in the image to correctly describe the image. However, it is unreliable to make full use of regional features because they cannot convey contextual information, such as the relationship between objects and the lack of object predicate level semantics. An effective model should contain multiple modes and explore their interactions to help understand the image. Therefore, we introduce the Multi-Modal Graph Aggregation Transformer (MMGAT), which uses the information of various image modes to fill this gap. It first represents an image as a graph consisting of three sub-graphs, depicting context grid, region, and semantic text modalities respectively. Then, we introduce three aggregators that guide message passing from one graph to another to exploit context in different modalities, so as to refine the features of nodes. The updated nodes have better features for image captioning. We show significant performance scores of 144.6% CIDEr on MS-COCO and 80.3% CIDEr on Flickr30k compared to state of the arts, and conduct a rigorous analysis to demonstrate the importance of each part of our design.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-Modal Graph Aggregation Transformer for image captioning

Abstract

Talk to us

Similar Papers

More From: Neural Networks

Lead the way for us

Similar Papers

Multimodal Transformer With Multi-View Visual Representation for Image Captioning
Jun Yu ... Qingming Huang
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 30
Jun Yu, et. al.Jun Yu ... Qingming Huang
25 Oct 2019
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 30

Automated Image Captioning with Multi-layer Gated Recurrent Unit
Ozge Taylan Moral ... Wenwu Wang
-
Ozge Taylan Moral, et. al.Ozge Taylan Moral ... Wenwu Wang
29 Aug 2022
29 Aug 2022

Deep Learning Approaches on Image Captioning: A Review
Taraneh Ghandi ... Hamidreza Pourreza
ACM Computing Surveys | VOL. 56
Taraneh Ghandi, et. al.Taraneh Ghandi ... Hamidreza Pourreza
05 Oct 2023
ACM Computing Surveys | VOL. 56

Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption
Wei Zhang ... Yue Ying
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Wei Zhang, et. al.Wei Zhang ... Yue Ying
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Modal Graph Aggregation Transformer for image captioning

Abstract

Talk to us

Similar Papers

More From: Neural Networks