Abstract

Faster-RCNN is commonly used to extract image features in image-to-text generative models since the development of deep learning, although the extraction procedure is time-consuming. Existing approaches extract fixed-size grid features and then use language models to generate image captions, but they only focus on the grid spatial location features without considering grid feature interaction and image global features. An image caption method based on graph attention network with global context is proposed to generate higher-quality image captions. By building a grid feature interaction graph, a multi-layer convolutional neural network is utilized for visual encoding, and the grid features and entire image features of a given image are retrieved. Then, using the graph attention network, which includes a global node and many local nodes, the feature extraction problem is changed into a node classification problem, and the global and local features can be completely utilized after updating and optimization. Finally, the Transformer-based decoding module makes use of the enhanced visual features to provide image captions. The Microsoft COCO dataset is used for experiments evaluation. The experimental results demonstrate that the image caption method based on graph attention network with global context successfully captures the global and local features of the image and achieves 133.1% CIDEr, significantly improving quality of image caption.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call