Abstract
In the research on image captioning, rich semantic information is very important for generating critical caption words as guiding information. However, semantic information from offline object detectors involves many semantic objects that do not appear in the caption, thereby bringing noise into the decoding process. To produce more accurate semantic guiding information and further optimize the decoding process, we propose an end-to-end adaptive semantic-enhanced transformer (AS-Transformer) model for image captioning. For semantic enhancement information extraction, we propose a constrained weaklysupervised learning (CWSL) module, which reconstructs the semantic object's probability distribution detected by the multiple instances learning (MIL) through a joint loss function. These strengthened semantic objects from the reconstructed probability distribution can better depict the semantic meaning of images. Also, for semantic enhancement decoding, we propose an adaptive gated mechanism (AGM) module to adjust the attention between visual and semantic information adaptively for the more accurate generation of caption words. Through the joint control of the CWSL module and AGM module, our proposed model constructs a complete adaptive enhancement mechanism from encoding to decoding and obtains visual context that is more suitable for captions. Experiments on the public Microsoft Common Objects in COntext (MSCOCO) and Flickr30K datasets illustrate that our proposed AS-Transformer can adaptively obtain effective semantic information and adjust the attention weights between semantic and visual information automatically, which achieves more accurate captions compared with semantic enhancement methods and outperforms state-of-the-art methods.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Neural Networks and Learning Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.