Abstract

This paper focuses on developing semantic image caption generation techniques that leverage image and scene understanding. More particularly, we are interested in addressing image captioning by developing a mixture of object detection and attention-enriched deep learning models. To extract the image features, a Convolutional Neural Network (CNN) is used, and then an extended version of Recurrent Neural Networks (LSTM) with attention-enrichment is adopted to generate the caption. We implement image captioning by considering detected objects from the image scene, and then by integrating an attention mechanism for caption generation. This can have multiple advantages from accuracy and semantics perspectives. The objective of this paper is to introduce a combined pipeline that employs several variant models for semantic caption generation. Four variant models are proposed, all of them are implemented and trained on COCO and Flickr30k datasets, and then tested on a subset of COCO dataset. Results of the different models were evaluated using a semantic similarity analysis between the generated captions and the actual ground truth captions. Our framework helps in a deeper understanding of images and decision making in diverse use-cases such as innovative and distinctive responses from multimodal data, and in analyzing and monitoring crowdsourced images from social media and other sources.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call