A Novel Convolutional Neural Network-Gated Recurrent Unit approach for Image Captioning

Sarthak Singh Rawat,Rahul Nijhawan,Kartikeyan Singh Rawat

doi:10.1109/icssit48917.2020.9214109

Abstract

Image captioning is a concept of generating a textual description for an image. It involves Machine Learning techniques like Natural Language Processing and Computer Vision to produce appropriate descriptions for images. Image Captioning has several applications in today's world of ever-expanding data such as Application Recommendation, Virtual Assistance, Image Indexing, and in Social Media. Image captioning can also help us in automating the job of interpreting images and in describing a visual scene to the visually impaired. Image Captioning has been dispensable in driving the Human-Computer Interaction field. Our Research paper proposes a CNN-GRU based framework for training using large datasets of Images and Captions and generating accurate caption descriptions for new images. A dictionary of photo identifiers is built based on descriptions to convert these descriptions into a vocabulary of words and built their list. A VGG-16 Convolution Neural Network has been proposed as our feature extractor and a Gated Recurrent Unit - Recurrent Neural Network as our Sequence Processor. Our model gives us an accuracy of 82.39%.

Full Text