Image Caption Generation for News Articles

Zhishen Yang,Naoaki Okazaki

doi:10.18653/v1/2020.coling-main.176

Abstract

In this paper, we address the task of news-image captioning, which generates a description of an image given the image and its article body as input. This task is more challenging than the conventional image captioning, because it requires a joint understanding of image and text. We present a Transformer model that integrates text and image modalities and attends to textual features from visual features in generating a caption. Experiments based on automatic evaluation metrics and human evaluation show that an article text provides primary information to reproduce news-image captions written by journalists. The results also demonstrate that the proposed model outperforms the state-of-the-art model. In addition, we also confirm that visual features contribute to improving the quality of news-image captions.

Highlights

Image captioning, i.e., automatic generation of a natural language description from an image, has received much attention from both fields of Computer Vision (CV) and Natural Language Processing (NLP) (Vinyals et al, 2015; Xu et al, 2015; Karpathy and Fei-Fei, 2015)
We address a more advanced task, news-image captioning: this task generates a description of an image, given the image and its article body as input
We present a method for news-image captioning based on Transformer (Vaswani et al, 2017), a successful architecture for various NLP tasks, including machine translation, abstractive summarization, contextualized word embeddings

Summary

Introduction

I.e., automatic generation of a natural language description from an image, has received much attention from both fields of Computer Vision (CV) and Natural Language Processing (NLP) (Vinyals et al, 2015; Xu et al, 2015; Karpathy and Fei-Fei, 2015). We address a more advanced task, news-image captioning: this task generates a description of an image, given the image and its article body as input. The news-image captioning task is different from the conventional image captioning, which receives only an image as input. News-image captioning requires a mutual understanding of image and text. Work proposed a two-stage approach for news-image captioning (Feng and Lapata, 2013; Tariq and Foroosh, 2017). The previous studies did not focus on the usefulness of text in the news-image captioning task, extending the conventional models for image captioning to incorporate text features

Methods

Results

Conclusion