Image caption model of double LSTM with scene factors

Yuqing Peng,Xuan Liu,Weihua Wang,Xiaosong Zhao,Ming Wei

doi:10.1016/j.imavis.2019.03.003

Abstract

In this paper, an image semantic understanding model combining scene factors is proposed to solve the problem that the accuracy rate of the description sentence is low in the current image semantic understanding model which is incorrect or ignores the scene recognition. This model first identifies the corresponding theme (scene information) through the text volume of the LDA analysis corpus. We get the vocabulary used in this scene. Then we use the ResNet to extract the global feature of the image, and use the Places365-CNNs to extract the feature of the deep scene. Finally, the model uses the picture scene information and the corpus scene information. In the description statement of the picture generation, it uses the words related to the picture scene in large probability and in the statement. In the process of generation, double LSTM is used to adjust the parameters to improve the accuracy of statement generation. This model is trained and tested in the Flickr8K, Flickr30K and MSCOCO image sets. The model is verified with different evaluation methods. The experimental results show that the proposed model can effectively improve the image language compared with other models. The accuracy of meaning understanding can solve these problems effectively.

Full Text