Abstract

Image captioning has gradually gained attention in the field of artificial intelligence and become an interesting and challenging task for image understanding. It needs to identify important objects in images, extract attributes, tell relationships, and help the machine generate human-like descriptions. Recent works in deep neural networks have greatly improved the performance of image caption models. However, machines are still unable to imitate the way humans think, talk and communicate, so image captioning remains an ongoing task. It is thus very important to keep up with the latest research and results in the field of image captioning whereas publications on this topic are numerous. Our work aims to help researchers to have a macro-level understanding of image captioning from four aspects: spatial-temporal distribution characteristics, collaborative networks, trends in subject research, and historical evolutionary path. We employ scientometric visualization methods to achieve this goal. The results show that China has published the largest amount of publications in image captioning, but the United States has the greatest impact on research in this area. Besides, thirteen academic groups are identified in the field of image description, with institutions such as Microsoft, Google, Australian National University, and Georgia Institute of Technology being the most prominent research institutions. Meanwhile, we find that evaluation methods, datasets, novel image captioning models based on generative adversarial networks, reinforcement learning, and Transformer, as well as remote sensing image captioning, are the new research trends. Lastly, we conclude that image captioning research has gone through three major development stages from 2010 to 2020, and on this basis, we propose a more comprehensive taxonomy of image captioning.

Highlights

  • As the representative technology of artificial intelligence (AI), deep learning has developed rapidly in recent years, and has been widely used throughout the fields of computer vision (CV) and natural language processing (NLP)

  • In order to grasp the development direction of image captioning technology from a macro perspective, and to help researchers gain a comprehensive understanding of the development status of the field, we propose a review method for image description based on scientometric analysis

  • We use HistCite software to statistically analyze the number of papers detected by Web of science (WOS) for each year from 2010-2020, and plot the change curve of the number of papers published in image captioning research in different years

Read more

Summary

Introduction

As the representative technology of artificial intelligence (AI), deep learning has developed rapidly in recent years, and has been widely used throughout the fields of computer vision (CV) and natural language processing (NLP). Image captioning (or image caption) is an important part of image understanding, which could automatically generate humanlike sentences for the given image [1]. This task requires the machine to be able to recognize objects in the image, understand the relationships between them, and express the main information by some concise natural language descriptions. When the computer encounters an image and outputs the corresponding visual context, it may describe features of the image (e.g., shape, color and texture), it can present the primary objects of the scenario, and even predict a dynamic relationship between people and objects

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call