Abstract

A new cascade recurrent neural network (CRNN) for image caption generation is proposed. Different from the classical multimodal recurrent neural network, which only uses a single network for extracting unidirectional syntactic features, CRNN adopts a cascade network for learning visual-language interactions from forward and backward directions, which can exploit the deep semantic contexts contained in the image. In the proposed framework, two embedding layers for dense word expression are constructed. A new stacked Gated Recurrent Unit is designed for learning image-word mappings. The effectiveness of the CRNN model is verified with adopting the commonly used MSCOCO datasets, where the results indicate CRNN can achieve better performance compared with the state-of-the-art image captioning methods such as Google NIC, multimodal recurrent neural network and so on.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call