Abstract

Image captioning can show great performance for generating captions for general purposes, but it remains difficult to adjust the generated captions for different applications. In this paper, we propose an image captioning method which can generate both imageability- and length-controllable captions. The imageability parameter adjusts the level of visual descriptiveness of the caption, making it either more abstract or more concrete. In contrast, the length parameter only adjusts the length of the caption while keeping the visual descriptiveness on a similar degree. Based on a transformer architecture, our model is trained using an augmented dataset with diversified captions across different degrees of descriptiveness. The resulting model can control both imageability and length, making it possible to tailor output towards various applications. Experiments show that we can maintain a captioning performance similar to comparison methods, while being able to control the visual descriptiveness and the length of the generated captions. A subjective evaluation with human participants also shows a significant correlation of the target imageability in terms of human expectations. Thus, we confirmed that the proposed method provides a promising step towards tailoring image captions closer to certain applications.

Highlights

  • Image captioning shows great performance in generating captions for general purposes and receives great attention in the research community [15], [22], [43]

  • Note that imageability and length encode different things; Changing imageability aims to change visual descriptiveness of the caption for the same length, while length aims to change the wordiness while keeping contents similar

  • For the second and third experiments, we focus on a deeper evaluation of the imageability-controllable part of the transformer-based model and its differences over the previous Long Short-Term Memory (LSTM)-based work [36] for generating captions with different visual descriptiveness

Read more

Summary

Introduction

Image captioning shows great performance in generating captions for general purposes and receives great attention in the research community [15], [22], [43]. It remains difficult to tailor the generated image captions to a variety of such applications. The reason is manifold: First, image captioning approaches usually target to generate captions close to those in existing training data, and are evaluated based on their similarity to the testing data. Both the datasets and the evaluation metrics are made under the assumption of performing general-purpose image captioning. This generally results in a very low diversity of generated captions, as some research has tried to tackle [9], [39], [41]. Recent research towards caption diversification propose introducing parameters such as length-controllable models [7]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.