Abstract

Text present in an image contains rich semantic information which is crucial for the understanding of an image. For example, a signboard having the text “deep water” conveys the danger involved in the image. The current image captioning models do not effectively utilize this useful semantic information due to their limited representation capabilities of scene-text tokens. Our work presents a novel image captioning model called RelNet-MAM, which utilizes a multilevel attention mechanism and relation network. To improve the appearance feature representation, RelNet-MAM uses multilevel attention which consists of spatial attention, channel-wise attention, and semantic attention. For representing the scene-text token effectively, RelNet-MAM uses appearance, FastText, location, and PHOC features for each token. Further, the proposed RelNet-MAM uses the relation network to establish the relationships between the objects and scene-text tokens. Finally, the transformer model together with dynamic pointer networks is used as a decoder in the caption generation process. The proposed RelNet-MAM model outperforms the state-of-the-art models on TextCaps, Flickr30k, and MS COCO datasets. TextCaps requires models to read and reason about the texts in an image for caption generation. MSCOCO and Flickr30k contain diverse images; persons, animals, automobiles, indoor and outdoor scenes. Remarkably, the proposed RelNet-MAM model outperforms the current best model by 2.3% on B-4, 1.8% on METEOR, 2.2% on ROUGE-L, 2.0% on CIDEr-D and 3.0% on SPICE metric scores on TextCaps dataset.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.