Abstract

<p indent=0mm>To address the problem of insufficient detailed semantic information in current global features-based image captioning models, an image Chinese captioning model combining global and local features is proposed. The proposed model adopts the encoder-decoder framework. In the coding stage, the residual networks (ResNet) and Faster R-CNN are used to extract the global and local features of images respectively, improving the model ҆ s utilization of image features at different scales. A bi-directional gated recurrent unit (BiGRU) with embedded visual attention structure and residual connection structure is applied as the decoder (BiGRU with residual connection and attention, BiGRU-RA). The model can adaptively allocate image features and text weights, and improve the mapping relationship between image feature regions and context information. Additionally, the reinforcement learning-based policy gradient is added to improve the loss function of the model and optimize the evaluation criteria CIDEr directly. The training and experiments are conducted on the Chinese captioning dataset of AI challenger. The comparative results show that the proposed model obtained better scores and the generated caption are more accurate and detailed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.