Recently, transformer-based image captioning models have achieved significant performance improvement. However, due to the limitations of region visual features and deterministic projections between image space and caption space, existing methods still suffer from disentangled visual features and rigid sentences. To address these issues, we first introduce panoptic segmentation to extract the segmentation region features, which can effectively alleviate the visual confusion caused by the widely-adopted region visual features. Then, we propose a panoptic segmentation based sequential conditional variational transformer (PS-SCVT) framework for diverse image captioning, which not only accurately extracts the image visual representations by fusing the segmentation region features and object detection features, but has the ability of learning one-to-many mappings from image space to caption space. The experimental results demonstrate that our approach achieves better interpretability and generalization performance compared with the state-of-the-art diverse image captioning models.
Read full abstract