Abstract

Computerized medical image report generation is of great significance in automating the workflow of medical diagnosis and treatment for reducing health disparities. However, this task presents several challenges, where the generated medical image report should be precise, coherent and contain heterogeneous information. Current deep learning based medical image captioning models rely on recurrent neural networks and only extract top-down visual features, which make them slow and prone to generate incoherent and hard to comprehend reports. To tackle this challenging problem, this paper proposes a hierarchical Transformer based medical imaging report generation model. Our proposed model consists of two parts: (1) An Image Encoder extracts heuristic visual features by a bottom-up attention mechanism; (2) a non-recurrent Captioning Decoder improves the computational efficiency by parallel computation. The former identifies regions of interest via a bottom-up attention module and extracts top-down visual features. Then the Transformer based captioning decoder generates a coherent paragraph of medical imaging report. The proposed model is trained by using a self-critical reinforcement learning method. We evaluate the proposed model on publicly available datasets of IU X-ray. The experiment results show that our proposed model has improved the performance in BLEU-1 by more than 50% compared with other state-of-the-art image captioning methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call