Abstract

In this paper, we propose a new topic, Human-Centric Captioning, to mainly describe the human behavior in an image. Human activities and relationships are the primary objectives of visual understanding in daily applications. However, existing image captioning systems cannot differently treat humans and other objects, which limits the ability to understand and describe diverse human activities. As the first explorer of this new task, we build a novel Human-Centric COCO dataset concentrating on humans. Accordingly, we propose a novel Human-Centric Captioning Model (HCCM) that focuses on human-centric feature hierarchization and sentence generation. Specifically, our model first utilizes human body part-level knowledge to hierarchize the image features and then applies a novel three-branch captioning model to process these hierarchical features independently to calibrate the descriptions of human actions. Comprehensive experiments demonstrate that our HCCM achieves the state-of-the-art performance with BLEU-4, CIDEr and SPICE scores of 41.5, 127.3, 23.5 respectively. Dataset and code are publicly available at https://github.com/JohnDreamer/HCCM/.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call