Multi-grained Representation Aggregating Transformer with Gating Cycle for Change Captioning

Liang Li,Yunbin Tu,Shengxiang Gao,Shengbin Yue,Zhengtao Yu

doi:10.1145/3660346

Abstract

Change captioning aims to describe the difference within an image pair in natural language, which combines visual comprehension and language generation. Although significant progress has been achieved, it remains a key challenge of perceiving the object change from different perspectives, especially the severe situation with drastic viewpoint change. In this paper, we propose a novel full-attentive network, namely Multi-grained Representation Aggregating Transformer (MURAT), to distinguish the actual change from viewpoint change. Specifically, the Pair Encoder first captures similar semantics between pairwise objects in a multi-level manner, which are regarded as the semantic cues of distinguishing the irrelevant change. Next, a novel Multi-grained Representation Aggregator (MRA) is designed to construct the reliable difference representation by employing both coarse- and fine-grained semantic cues. Finally, the language decoder generates a description of the change based on the output of MRA. Besides, the Gating Cycle Mechanism is introduced to facilitate the semantic consistency between difference representation learning and language generation with a reverse manipulation, so as to bridge the semantic gap between change features and text features. Extensive experiments demonstrate that the proposed MURAT can greatly improve the ability to describe the actual change in the distraction of irrelevant change and achieves state-of-the-art performance on three benchmarks, CLEVR-Change, CLEVR-DC and Spot-the-Diff.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-grained Representation Aggregating Transformer with Gating Cycle for Change Captioning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Journal: ACM Transactions on Multimedia Computing, Communications, and Applications	Publication Date: Apr 22, 2024
License type: mit