Abstract

Recent advances in deep visual attention methods accelerate greatly the research of image captioning. However, how to leverage hand-crafted features or deep features for the encoder of image captioning is not fully explored, due to the difficulty in finding a kind of all-purpose features to entail a set of visual semantics. In this paper, we introduce a cascade semantic fusion architecture (CSF) to mine the representative features to encode image content through attention mechanism without bells and whistles. Specifically, the CSF benefits from three types of visual attention semantics, including object-level, image-level, and spatial attention features, in a novel three-stage cascade manner. In the first stage, object-level attention features are extracted to capture the detailed contents of the objects based on the pretrained detector. Then, the middle stage devises a fusion module to merge object-level attention features with spatial features, thereby inducing image-level attention features to enrich the context information around the objects. In the last stage, spatial attention features are learned to unveil the salient region representation as a complement to two previously learned attention features. In a nutshell, we integrate attention mechanism with three types of features to organize context knowledge about images from different aspects. The empirical analysis shows that the CSF can assist image captioning model in selecting the object regions of interest. The experiments of image captioning on MSCOCO dataset show the efficacy of our semantic fusion architecture in depicting image content.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call