Abstract

Speech interfaces, such as personal assistants and screen readers, read image captions to users. Typically, however, only one caption is available per image, which may not be adequate for all situations (e.g., browsing large quantities of images). Long captions provide a deeper understanding of an image but require more time to listen to, whereas shorter captions may not allow for such thorough comprehension yet have the advantage of being faster to consume. We explore how to effectively collect both thumbnail captions—succinct image descriptions meant to be consumed quickly—and comprehensive captions—which allow individuals to understand visual content in greater detail. We consider text-based instructions and time-constrained methods to collect descriptions at these two levels of detail and find that a time-constrained method is the most effective for collecting thumbnail captions while preserving caption accuracy. Additionally, we verify that caption authors using this time-constrained method are still able to focus on the most important regions of an image by tracking their eye gaze. We evaluate our collected captions along human-rated axes—correctness, fluency, amount of detail, and mentions of important concepts—and discuss the potential for model-based metrics to perform large-scale automatic evaluations in the future.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call