Abstract

Image captioning is a challenging task that generates a natural language description based on the visual understanding of the given image. Significant region representation is a milestone in image captioning. Despite the great success of existing region-based works, they only focus on the salient objects and encode these objects independently, still plagued by the lack of global contextual information and visual relationships. In fact, the global contextual information and structured visual relationships are exactly the merits of traditional grid features and emerging scene graph features. In this paper, we present a Triple-Steam Feature Fusion Network (TSFNet) to leverage the complementary advantages of the grid, region, and scene graph triple-steam visual representations in image captioning. Concretely, in our TSFNet, a novel Dual-level Attention (DA) mechanism is proposed to simultaneously explore visual intrinsic properties and word-related attributes uniformly of different features. Then attention enhanced features of different modalities are mapped into a joint representation to guide the caption generation. Moreover, we design a new global-aware decoder, which leverages the concatenated representation of triple-steam features and the joint attention representation to obtain global visual guidance information, further refine the complex multimodal reasoning. To verify the effectiveness of our feature fusion model, we perform extensive experiments on the highly competitive MSCOCO dataset to evaluate the model quantitatively and qualitatively. The results illustrate that the proposed framework outperforms many state-of-the-art image captioning approaches in various evaluation metrics, and generates more accurate and abundant captions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call