Text present in an image contains rich semantic information which is crucial for the understanding of an image. For example, a signboard having the text “deep water” conveys the danger involved in the image. The current image captioning models do not effectively utilize this useful semantic information due to their limited representation capabilities of scene-text tokens. Our work presents a novel image captioning model called RelNet-MAM, which utilizes a multilevel attention mechanism and relation network. To improve the appearance feature representation, RelNet-MAM uses multilevel attention which consists of spatial attention, channel-wise attention, and semantic attention. For representing the scene-text token effectively, RelNet-MAM uses appearance, FastText, location, and PHOC features for each token. Further, the proposed RelNet-MAM uses the relation network to establish the relationships between the objects and scene-text tokens. Finally, the transformer model together with dynamic pointer networks is used as a decoder in the caption generation process. The proposed RelNet-MAM model outperforms the state-of-the-art models on TextCaps, Flickr30k, and MS COCO datasets. TextCaps requires models to read and reason about the texts in an image for caption generation. MSCOCO and Flickr30k contain diverse images; persons, animals, automobiles, indoor and outdoor scenes. Remarkably, the proposed RelNet-MAM model outperforms the current best model by 2.3% on B-4, 1.8% on METEOR, 2.2% on ROUGE-L, 2.0% on CIDEr-D and 3.0% on SPICE metric scores on TextCaps dataset.