Abstract

Visual attributes detection provides rich semantic concepts for image captioning. Some previous methods attempt to directly encode the attributes into vectors and generate the corresponding captions, which ignore the correlations between the image regions and attributes. In this paper, we consider to bridge the gap between visual features and detected attributes: first to look at a specific region of the image and second to decide which attribute to attend to. We propose an attribute-driven image captioning approach consisting of two parts: the visual positioning part and the attribute selection part. Specifically, we introduce the pointer-generator network into the second part of our model as a soft-switch, which determines whether to generate a word through the hidden state or point to a detected attribute at each decoding step. Qualitative and Quantitative experiments show that our model can improve the coverage of key visual attributes and significantly boost the overall performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call