Hybrid attention network for image captioning

Wenhui Jiang,Qin Li,Kun Zhan,Yuming Fang,Fei Shen

doi:10.1016/j.displa.2022.102238

Wenhui Jiang, Qin Li + Show 3 more

https://doi.org/10.1016/j.displa.2022.102238

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Machine attention mechanisms are widely used in the task of image captioning. Such mechanisms dynamically focus on different regions to guide the word generation process. However, existing attention models may fail to concentrate on correct regions and mislead the word prediction without explicit supervision. In this study, we exploit the human captioning attention encoding rich information that human beings perceive during captioning, and propose a novel Hybrid Attention Network (HAN) that incorporates the prevailing machine attention mechanisms with human captioning attention. The proposed HAN addresses the problem of “object hallucination” by re-weighting bottom-up attention, and improves the diversity of the generated captioning by complementing top-down attention with human captioning attention. Extensive experiments are conducted on Flickr30K and MS COCO datasets, demonstrating that the proposed method effectively improves the performance of the current image captioning methods.

Full Text