Chest radiology report generation plays a vital role in supporting diagnosis, alleviating physician workload, and reducing the risk of misdiagnosis. However, significant challenges persist: (1) Data bias and background noise in chest images often obscure subtle lesion details, leading models to generate similar reports; (2) the distinct modal spaces of radiology images and reports weaken the semantic correlation between detailed visual lesion features and report sentences; (3) generated reports often lack crucial patient background and disease extent details, impeding report quality and accuracy. To address these challenges, this paper proposes a novel approach for generating chest radiology reports, utilizing denoising multi-level cross-attention and multi-level contrastive. The proposed method first involves sequential encoding of frontal and lateral radiology images into a visual extractor to enhance semantic coherence across image patches and improve visual feature representation. The enhanced visual features are processed through denoising multi-level cross-attention, which effectively suppresses noise and highlights the details of subtle lesions. Secondly, a multi-level contrastive learning module leverages contrastive among images, text, and disease labels to distinguish positive samples from negative ones, thereby strengthening the semantic correlation between detailed visual lesion features and report sentences. Finally, relevant knowledge is incorporated into the report generator to enhance the description of patient lesion details. Comparative experiments were performed against other state-of-the-art methods on the IU-Xray and MIMIC-CXR datasets, demonstrating that the proposed method significantly improves model performance. Additionally, ablation studies confirm that each module contributes to enhancing the quality of generated reports.
Read full abstract