Abstract

A bottom-up and top-down attention mechanism has led to the revolutionizing of image captioning techniques, which enables object-level attention for multi-step reasoning over all the detected objects. However, when humans describe an image, they often apply their own subjective experience to focus on only a few salient objects that are worthy of mention, rather than all objects in this image. The focused objects are further allocated in linguistic order, yielding the “object sequence of interest” to compose an enriched description. In this work, we present the Bottom-up and Top-down Object inference Network (BTO-Net), which novelly exploits the object sequence of interest as top-down signals to guide image captioning. Technically, conditioned on the bottom-up signals (all detected objects), an LSTM-based object inference module is first learned to produce the object sequence of interest, which acts as the top-down prior to mimic the subjective experience of humans. Next, both of the bottom-up and top-down signals are dynamically integrated via an attention mechanism for sentence generation. Furthermore, to prevent the cacophony of intermixed cross-modal signals, a contrastive learning-based objective is involved to restrict the interaction between bottom-up and top-down signals, and thus leads to reliable and explainable cross-modal reasoning. Our BTO-Net obtains competitive performances on the COCO benchmark, in particular, 134.1% CIDEr on the COCO Karpathy test split. Source code is available at https://github.com/YehLi/BTO-Net .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call