Abstract

Part of Speech (PoS) information has been broadly leveraged in previous image captioning methods to guide their decoder module to control whether the visual information is required for generating the target words. However, existing methods primarily focus on enhancing visual words (VWs) generation while neglecting non-visual words (NVWs) generation. So, in response, we introduce a novel PoS clues-aware adaptive attention mechanism (NPoSC-A3) to leverage the PoS clues to adaptively incorporate visual and semantic attention contexts into the language model, where the semantic information and the visual information are leveraged in generating the visual and non-visual words (VWs and NVWs). The mechanism of NPoSC-A3 comprises four key modules:global semantic context generator (GSCG), PoS context generator (PoSCG), PoS predictor (PoSP), and PoS clues-aware adaptive attention mechanism (PoSC-A3). GSCG generates a global semantic context that our model leverages for generating NVWs. PoSP predicts the PoS information of the word to be generated at the current time step. PoSC-A3 adaptively incorporates visual and global semantic features into the decoder module based on the PoS guidance. PoSCG constrains the visual context and global semantic context effect on the captioning process for generating more syntactic captions. Extensive experiments conducted using the MSCOCO standard dataset demonstrate that our presented method has raised the effectiveness of image captioning task and outperformed most recent and advanced image captioning works with evaluation metrics and attained 127.2 in CIDEr.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call