Abstract
Intelligent medicine is eager to automatically generate radiology reports to ease the tedious work of radiologists. Previous researches mainly focused on the text generation with encoder-decoder structure, while CNN networks for visual features ignored the long-range dependencies correlated with textual information. Besides, few studies exploit cross-modal mappings to promote radiology report generation. To alleviate the above problems, we propose a novel end-to-end radiology report generation model dubbed Self-Supervised dual-Stream Network (S3-Net). Specifically, a Dual-Stream Visual Feature Extractor (DSVFE) composed of ResNet and SwinTransformer is proposed to capture more abundant and effective visual features, where the former focuses on local response and the latter explores long-range dependencies. Then, we introduced the Fusion Alignment Module (FAM) to fuse the dual-stream visual features and facilitate alignment between visual features and text features. Furthermore, the Self-Supervised Learning with Mask(SSLM) is introduced to further enhance the visual feature representation ability. Experimental results on two mainstream radiology reporting datasets (IU X-ray and MIMIC-CXR) show that our proposed approach outperforms previous models in terms of language generation metrics.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.