Abstract

The main challenge in the vision-to-language system is generation of the caption with a proper meaningful answer for a question and extracting even the minute details from the image. The main contributions in this paper are presenting an approach based on image high-level semantic attributes and local image features address the challenges of V2L tasks. Especially, the high-level semantic attributes information is used to reduce the semantic gap between images and text. A novel semantic attention network is designed to explore the mapping relationships between semantic attributes and image regions. The semantic attention network highlights the concept-related regions and selects the region-related concepts. Two special V2L tasks, image captioning and VQA, are addressed by the proposed approach. Improved BLEU score shows the proposed image captioning performs well. The experimental results show that the proposed model is effective for V2L tasks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call