Novel model to integrate word embeddings and syntactic trees for automatic caption generation from images

Hongbin Zhang,Tao Li,Donghong Ji,Zhenyu Niu,Renzhong Wu,Guangli Li,Diedie Qiu

doi:10.1007/s00500-019-03973-w

Abstract

Automatic caption generation from images is an interesting and mainstream direction in the field of machine learning. This method enables us to build a powerful computer model that can interpret the implicit semantic information of images. However, the current state of research faces significant challenges such as those related to extracting robust image features, suppressing noisy words, and improving a caption’s coherence. For the first problem, a novel computer vision system is presented to create a new image feature called MK–KDES-1 (MK–KDES represents Multiple Kernel–Kernel Descriptors) after extracting three KDES features and fusing them by MKL (Multiple Kernel Learning) model. The MK–KDES-1 feature captures both textural characteristics and shape characteristics of images, which contribute to improving the BLEU_1 (BLEU represents Bilingual Evaluation Understudy) scores of captions. For the second problem, an effective newly designed two-layer TR (Tag Refinement) strategy is integrated into our NLG (Natural Language Generation) algorithm. Words that are most relevant semantically to images are summarized to generate N-gram phrases. Noisy words are suppressed using the innovative TR strategy. For the last problem, on the one hand, a pop WE (Word Embeddings) model and a novel metric called PDI (Positive Distance Information) are introduced together to generate N-gram phrases. The phrases are evaluated by the AWSC (Accumulated Word Semantic Correlation) metric. On the other hand, the phrases are fused to generate captions by the ST (Syntactic Trees). Experimental results demonstrate that informative captions with high BLEU_3 scores can be obtained to describe images.

Full Text