Abstract

Recently, transformer-based image captioning models have achieved significant performance improvement. However, due to the limitations of region visual features and deterministic projections between image space and caption space, existing methods still suffer from disentangled visual features and rigid sentences. To address these issues, we first introduce panoptic segmentation to extract the segmentation region features, which can effectively alleviate the visual confusion caused by the widely-adopted region visual features. Then, we propose a panoptic segmentation based sequential conditional variational transformer (PS-SCVT) framework for diverse image captioning, which not only accurately extracts the image visual representations by fusing the segmentation region features and object detection features, but has the ability of learning one-to-many mappings from image space to caption space. The experimental results demonstrate that our approach achieves better interpretability and generalization performance compared with the state-of-the-art diverse image captioning models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.