The Role of Attention Mechanism and Multi-Feature in Image Captioning

Tien X Dang,Soo-Hyung Kim,Aran Oh,In-Seop Na

doi:10.1145/3310986.3311002

Abstract

Up to now, caption generation is still a hard problem in artificial intelligence where a textual description must be generated for a given image. This problem combines both computer vision and natural language processing. Generally, the CNN - RNN is a popular architecture in image captioning. Currently, there are many variants of this architecture, where the attention mechanism is an important discovery. Recently, deep learning methods have achieved state-of-the-art results for this problem. In this paper, we present a model that generates natural language descriptions of given images. Our approach uses the pre-trained deep neural network models to extract visual features and then applies an LSTM to generate captions. We use BLEU scores to evaluate our model performance on Flickr8k and Flickr30k dataset. In addition, we carried out a comparison between the approaches without attention mechanism and attention-based mechanism.

Full Text