Abstract

Dense captioning is a very critical but under-explored task, which aims to densely detect localized regions-of-interest (RoIs) and describe them with natural language in a given image. Although recent studies tried to fuse multi-scale features from different visual instances to generate more accurate descriptions, their methods still suffer from the lack of exploration of relation semantic information in images, leading to less informative descriptions. Furthermore, indiscriminately fusing all visual instance features will introduce redundant information, resulting in poor matching between descriptions and corresponding regions. In this work, we propose a Region-Focused Network (RFN) to address these issues. Specifically, to fully comprehend the images, we first extract the object-level features, and encode the interaction and position relations between objects to enhance the object representations. Then, to decrease the interference from redundant information about the target region, we extract the most relevant information to the region. Finally, a region-based Transformer is employed to compose and align the previous mined information and generate the corresponding descriptions. Extensive experiments on Visual Genome V1.0 and V1.2 datasets show that our RFN model outperforms the state-of-the-art methods, thus verifying its effectiveness. Our code is available at https://github.com/VILAN-Lab/DesCap .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call