Abstract
Machine attention mechanisms are widely used in the task of image captioning. Such mechanisms dynamically focus on different regions to guide the word generation process. However, existing attention models may fail to concentrate on correct regions and mislead the word prediction without explicit supervision. In this study, we exploit the human captioning attention encoding rich information that human beings perceive during captioning, and propose a novel Hybrid Attention Network (HAN) that incorporates the prevailing machine attention mechanisms with human captioning attention. The proposed HAN addresses the problem of “object hallucination” by re-weighting bottom-up attention, and improves the diversity of the generated captioning by complementing top-down attention with human captioning attention. Extensive experiments are conducted on Flickr30K and MS COCO datasets, demonstrating that the proposed method effectively improves the performance of the current image captioning methods.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have