Abstract

Transformer-based approaches have shown good results in image captioning tasks. However, current approaches have a limitation in generating text from global features of an entire image. Therefore, we propose novel methods for generating better image captioning as follows: (1) The Global-Local Visual Extractor (GLVE) to capture both global features and local features. (2) The Cross Encoder-Decoder Transformer (CEDT) for injecting multiple-level encoder features into the decoding process. GLVE extracts not only global visual features that can be obtained from an entire image, such as size of organ or bone structure, but also local visual features that can be generated from a local region, such as lesion area. Given an image, CEDT can create a detailed description of the overall features by injecting both low-level and high-level encoder outputs into the decoder. Each method contributes to performance improvement and generates a description such as organ size and bone structure. The proposed model was evaluated on the IU X-ray dataset and achieved better performance than the transformer-based baseline results, by 5.6% in BLEU score, by 0.56% in METEOR, and by 1.98% in ROUGE-L.

Highlights

  • Image captioning is a task that automatically generates a description of a given image.In the medical field, the technique can be used to generate medical reports from X-ray or CT images

  • We conducted the study by setting the R2Gen model as a baseline because it is a highly scalable model as it consists of variations of the standard transformer without any specialized methods which limit variation to the model, while the memory structure is suitable for the patterned description present in the X-ray medical report

  • We proposed methods based on R2Gen to improve the text generation for global features which is the weakness of transformer-based image captioning models

Read more

Summary

Introduction

Image captioning is a task that automatically generates a description of a given image.In the medical field, the technique can be used to generate medical reports from X-ray or CT images. A model that automatically generates reports on medical images can assist doctors to focus on notable image areas or explain their findings. It can help reduce medical errors and reduce costs per test. Sufficient descriptions about global features such as bone structure, or more detailed size information that spreads over a global image, have not existed in previous studies. This is because the inputs of the transformer-based model are generated by splitting the image into patch pieces. The method created the descriptions of the features well, but omitted some information about the size of an organ or bone structure that needs to be judged from the overall understanding of the image

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call