Abstract

In recent years, with the rapid development of artificial intelligence, image caption has gradually attracted the attention of many researchers in the field of artificial intelligence and has become an interesting and arduous task. Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding, which combines the knowledge of computer vision and natural language processing. The application of image caption is extensive and significant, for example, the realization of human-computer interaction. This paper summarizes the related methods and focuses on the attention mechanism, which plays an important role in computer vision and is recently widely used in image caption generation tasks. Furthermore, the advantages and the shortcomings of these methods are discussed, providing the commonly used datasets and evaluation criteria in this field. Finally, this paper highlights some open challenges in the image caption task.

Highlights

  • Is paper is organized as follows. e second part details the basic models and methods. e third part focuses on the introduction of attention mechanism to optimize the model and make up for the shortcomings. e fourth part introduces the common datasets come up by the image caption and compares the results on different models

  • Image caption models can be divided into two main categories: a method based on a statistical probability language model to generate handcraft features and a neural network model based on an encoder-decoder language model to extract deep features. e specific details of the two models will be discussed separately

  • Words are detected by applying a convolutional neural network (CNN) to the image area [19] and integrating the information with multi-instance learning (MIL) [20]. e structure of the sentence is trained directly from the caption to minimize the priori assumptions about the sentence structure

Read more

Summary

Introduction

Is paper is organized as follows. e second part details the basic models and methods. e third part focuses on the introduction of attention mechanism to optimize the model and make up for the shortcomings. e fourth part introduces the common datasets come up by the image caption and compares the results on different models. Words are detected by applying a convolutional neural network (CNN) to the image area [19] and integrating the information with MIL [20]. E structure of the sentence is trained directly from the caption to minimize the priori assumptions about the sentence structure It turns an image caption generation problem into an optimization problem and searches for the most likely sentence. We detect the words from the given vocabulary according to the content of the corresponding image based on the weak monitoring method in multi-instance learning (MIL) in order to train the detectors iteratively. We get a response map on the final fully connected

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call