Abstract

Image captioning is a comprehensive task in computer vision (CV) and natural language processing (NLP). It can complete conversion from image to text, that is, the algorithm automatically generates corresponding descriptive text according to the input image. In this paper, we present an end-to-end model that takes deep convolutional neural network (CNN) as the encoder and recurrent neural network (RNN) as the decoder. In order to get better image captioning extraction, we propose a highly modularized multi-branch CNN, which could increase accuracy while maintaining the number of hyper-parameters unchanged. This strategy provides a simply designed network consists of parallel sub-modules of the same structure. While traditional CNN goes deeper and wider to increase accuracy, our proposed method is more effective with a simple design, which is easier to optimize for practical application. Experiments are conducted on Flickr8k, Flickr30k and MSCOCO entities. Results demonstrate that our method achieves state of the art performances in terms of caption quality.

Highlights

  • As an important source of information, numerous images are digitally stored and transmitted on the Internet

  • Image captioning is a comprehensive task in computer vision (CV) [1] and natural language processing (NLP) [2], which can complete multi-modal conversion from image to text

  • The translation recurrent neural network is a mature technology in NLPmine and the plays an important role in time-series data and use the previous information to assist the current task

Read more

Summary

Introduction

As an important source of information, numerous images are digitally stored and transmitted on the Internet. Based on significant advances of encoder–decoder structure in machine translation [18,19], a generative model called Neural Image Caption (NIC) [20] was proposed. Yan et al [26] proposed a hierarchical attention mechanism via using both the global CNN features and the local object features for improved results Despite these advancements, the realization of these refined methods has been accompanied with complicated network structures and the growing number of parameters. It has limited ability to adapt the network architectures to other datasets and tasks To address these above issues, we propose a simple end-to-end image captioning model with extended CNN architecture. We first propose a new multi-branch CNN model based on residual learning for image captioning. The improved network has a large receptive field that is important for learning the complex relationship of object categories

ResNet
Multi-Branch CNN
Result
Evaluation
Datasets
Implementation Details
Results
Method
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call