Abstract

Image captioning aims to describe the content in an image, which plays a critical role in image understanding. Existing methods tend to generate the text for more distinct natural images. These models can not be well for paintings containing more abstract meaning due to the limitation of objective parsing without related knowledge. To alleviate, we propose a novel cross-modality decouple model to generate the objective and subjective parsing separately. Concretely, we propose to encode both subjective semantic and implied knowledge contained in the paintings. The key point of our framework is decoupled CLIP-based branches (DecoupleCLIP). For the objective caption branch, we utilize the CLIP model as the global feature extractor and construct a feature fusion module for global clues. Based on the objective caption branch structure, we add a multimodal fusion module called the artistic conception branch. In this way, the objective captions can constrain artistic conception content. We conduct extensive experiments to demonstrate our DecoupleCLIP’s superior ability over our new dataset. Our model achieves nearly 2% improvement over other comparison models on CIDEr.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.