Abstract

The automatic medical report generation task can reduce the burden of radiologists and improve the intelligence of auxiliary diagnosis, but still faces the following challenges: (1) The small lesions are easily overlooked, leading to loss of crucial information in the report and low accuracy; (2) The generated long text reports often suffer from jumbled word order and sentence order, resulting in poor fluency. Through simulation of the cognitive principle of professional physicians during their training and work, this paper put forward a medical report generation method integrating a teacher–student model with an encoder–decoder network. The core idea is to propose a cross-modal teacher (text)-student (image) model, adopting different supervision methods for different stages of report generation to improve the model’s learning performance. A semantic space alignment mechanism is designed to enhance the cross-modal feature matching ability by contrasting the encoding methods of different modalities through adversarial learning, gradually optimizing and capturing the critical information. A layer-supervised decoder based on the Transformer hierarchical structure is proposed with the teacher model guiding the student model to decode layer by layer to increase the fluency of report generation. Comparative experiments are conducted on IU-X-ray and MIMIC-CXR datasets with various other methods, and the results show that the proposed method can effectively improve the quality of generated reports.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call