Abstract

Image paragraph captioning involves generating a semantically coherent paragraph describing an image’s visual content. The selection and shifting of sentence topics are critical when a human describes an image. However, previous hierarchical image paragraph captioning methods have not fully explored or utilized sentence topics. In particular, the continuous and implicit modeling of topics in these methods makes it difficult to supervise the topic prediction process explicitly. We propose a new method called topic clustering and topic shift prediction (TCTSP) to solve this problem. Topic clustering (TC) in the sentence embedding space generates semantically explicit and discrete topic labels that can be directly used to supervise topic prediction. By introducing a topic shift probability matrix that characterizes human topic shift patterns, topic shift prediction (TSP) predicts subsequent topics that are both logical and consistent with human habits based on visual features and language context. TCTSP can be combined with various image paragraph captioning model structures to improve performance. Extensive experiments were conducted on the Stanford image paragraph dataset, and superior results were reported compared with previous state-of-the-art approaches. In particular, TCTSP improved the consensus-based image description evaluation (CIDEr) performance of image paragraph captioning to 41.67%. The codes are available at https://github.com/tt0059/TCTSP.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call