Region-Focused Network for Dense Captioning

Qingbao Huang,Feng Shuang,Youji Huang,Yi Cai,Pijian Li

doi:10.1145/3648370

Abstract

Dense captioning is a very critical but under-explored task, which aims to densely detect localized regions-of-interest (RoIs) and describe them with natural language in a given image. Although recent studies tried to fuse multi-scale features from different visual instances to generate more accurate descriptions, their methods still suffer from the lack of exploration of relation semantic information in images, leading to less informative descriptions. Furthermore, indiscriminately fusing all visual instance features will introduce redundant information, resulting in poor matching between descriptions and corresponding regions. In this work, we propose a Region-Focused Network (RFN) to address these issues. Specifically, to fully comprehend the images, we first extract the object-level features, and encode the interaction and position relations between objects to enhance the object representations. Then, to decrease the interference from redundant information about the target region, we extract the most relevant information to the region. Finally, a region-based Transformer is employed to compose and align the previous mined information and generate the corresponding descriptions. Extensive experiments on Visual Genome V1.0 and V1.2 datasets show that our RFN model outperforms the state-of-the-art methods, thus verifying its effectiveness. Our code is available at https://github.com/VILAN-Lab/DesCap .

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Region-Focused Network for Dense Captioning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications

Lead the way for us

Similar Papers

Scene classification based on the bag-of-visual-words and Doc2Vec models for high-spatial resolution remote-sensing imagery
Wenqiang Li ... Gui Jin
Journal of Applied Remote Sensing | VOL. 13
Wenqiang Li, et. al.Wenqiang Li ... Gui Jin
20 May 2019
Journal of Applied Remote Sensing | VOL. 13

AFCANet: An adaptive feature concatenate attention network for multi-focus image fusion
Shuaiqi Liu ... Yudong Zhang
Journal of King Saud University - Computer and Information Sciences | VOL. 35
Shuaiqi Liu, et. al.Shuaiqi Liu ... Yudong Zhang
12 Sep 2023
Journal of King Saud University - Computer and Information Sciences | VOL. 35

Semantic Cross-View Matching
Francesco Castaldo ... Silvio Savarese
-
Francesco Castaldo, et. al.Francesco Castaldo ... Silvio Savarese
01 Dec 2015
01 Dec 2015

A Variant of WSL Framework For Weakly Supervised Semantic Segmentation
Ling-Yun Ma
-
Ling-Yun MaLing-Yun Ma
01 Sep 2018
01 Sep 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Region-Focused Network for Dense Captioning

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Multimedia Computing, Communications, and Applications