Abstract

Automatically generating captions of an image is a fundamental problem in computer vision and natural language processing, which translates the content of the image into natural language with correct grammar and structure. Attention-based model has been widely adopted for captioning tasks. Most attention models generate only single certain attention heat map for indicating eyes where to see. However, these models ignore the endogenous orienting which depends on the interests, goals or desires of the observers, and constrain the diversity of captions. To improve both the accuracy and diversity of the generated sentences, we present a novel endogenous–exogenous attention architecture to capture both the endogenous attention, which indicates stochastic visual orienting, and the exogenous attention, which indicates deterministic visual orienting. At each time step, our model generates two attention maps, endogenous heat map and exogenous heat map, and then fuses them into hidden state of LSTM for sequential word generation. We evaluate our model on the Flickr30k and MSCOCO datasets, and experiments show the accuracy of the model and the diversity of captions it learns. Our model achieves better performance over state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call