Abstract

Image description has become a popular topic in multimedia computing and computer vision areas. Recent works have demonstrated that learning the local semantic concepts, in addition to the image features, as the contextual information can help to understand the image scene better. However, current image description methods treating the local features as the bag-of-visual-words that do not capture the interaction and structure of the objects embedded in the image. In this paper, we propose a novel captioning framework that learns to integrate local concepts with their geometry structure as the side information. We design an Object Structure Graph to encode the positions and the distribution of the objects in the image. In order to embed the graph into an efficient representation, we introduce a semantic matching schema that matches our embedded graph with their corresponding sentence. Our experiments based on the public image captioning data sets, the MS-COCO and the Flickr30k, show that our improved solution is significantly better than current state-of-the-art techniques that leverage local semantic concepts; and our best model on the same splitting has competitive results compared to other recent approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call