Abstract

Image caption is a multimodal task, whose purpose is to input an image and automatically generate coherent sentences from the visual information of the analyzed image to describe the content of the image. However, in most models fusing image information with semantic information, the extracted visual information is insufficient. In this paper, a parallel double-layer LSTM(d-LSTM) is proposed as a decoder to process semantic information. The semantic information obtained from the hidden state of the first layer is used as the primary information of the semantic information generated by the second layer. Finally, the semantic information of the two layers of decoders is fused to generate a finer-grained image caption. The superiority of our proposed model is verified by large-scale experiments on MSCOCO datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.