Utility of Multimodal Large Language Models in Analyzing Chest X-Rays with Incomplete Contextual Information
ObjectivesLarge language models (LLMs) are increasingly used in clinical practice, but their performance can deteriorate when radiology reports are incomplete. We evaluated whether multimodal LLMs (integrating text and images) could enhance accuracy and interpretability in chest radiography reports, thereby improving their utility for clinical decision support. Specifically, we aimed to assess the robustness of LLMs in generating accurate impressions from chest radiography reports when provided with incomplete data, and whether multimodal input could mitigate performance loss.MethodsWe analyzed 300 radiology image–report pairs from the MIMIC-CXR database. Three LLMs—OpenFlamingo, MedFlamingo, IDEFICS—were tested in text-only and multimodal formats. Chest X-ray impressions were generated from complete text reports and then regenerated after systematically removing 20%, 50%, and 80% of the text. The effect of adding images was evaluated using chest X-rays, and model performance was compared using three statistical methods. Hallucination rates were quantified.ResultsIn the textonly setting, OpenFlamingo, MedFlamingo, and IDEFICS demonstrated comparable performance (ROUGE-L: 0.23 vs. 0.21 vs. 0.21; F1RadGraph: 0.20 vs. 0.16 vs. 0.16; F1CheXbert: 0.49 vs. 0.41 vs. 0.41), with OpenFlamingo performing best on complete text (p < 0.001). All models exhibited performance decline with incomplete data. However, multimodal input significantly improved the performance of MedFlamingo and IDEFICS (p < 0.001), equaling or surpassing OpenFlamingo even under incomplete text conditions. Regarding hallucination, MedFlamingo showed a lower false-negative rate in multimodal compared with unimodal use, while false-positive rates were similar.ConclusionsLLMs may produce suboptimal outputs when radiology data are incomplete, but multimodal LLMs enhance reliability and may strengthen clinical decision-making support.
- Research Article
26
- 10.1148/radiol.2021210578
- Aug 31, 2021
- Radiology
Background A computer-aided detection (CAD) system may help surveillance for pulmonary metastasis at chest radiography in situations where there is limited access to CT. Purpose To evaluate whether a deep learning (DL)-based CAD system can improve diagnostic yield for newly visible lung metastasis on chest radiographs in patients with cancer. Materials and Methods A regulatory-approved CAD system for lung nodules was implemented to interpret chest radiographs from patients referred by the medical oncology department in clinical practice. In this retrospective diagnostic cohort study, chest radiographs interpreted with assistance from a CAD system after the implementation (January to April 2019, CAD-assisted interpretation group) and those interpreted before the implementation (September to December 2018, conventional interpretation group) of the CAD system were consecutively included. The diagnostic yield (frequency of true-positive detections) and false-referral rate (frequency of false-positive detections) of formal reports of chest radiographs for newly visible lung metastasis were compared between the two groups using generalized estimating equations. Propensity score matching was performed between the two groups for age, sex, and primary cancer. Results A total of 2916 chest radiographs from 1521 patients (1546 men, 1370 women; mean age, 62 years) and 5681 chest radiographs from 3456 patients (2941 men, 2740 women; mean age, 62 years) were analyzed in the CAD-assisted interpretation and conventional interpretation groups, respectively. The diagnostic yield for newly visible metastasis was higher in the CAD-assisted interpretation group (0.86%, 25 of 2916 [95% CI: 0.58, 1.3] vs 0.32%, 18 of 568 [95% CI: 0.20, 0.50%]; P = .004). The false-referral rate in the CAD-assisted interpretation group (0.34%, 10 of 2916 [95% CI: 0.19, 0.64]) was not inferior to that in the conventional interpretation group (0.25%, 14 of 5681 [95% CI: 0.15, 0.42]) at the noninferiority margin of 0.5% (95% CI of difference: -0.15, 0.35). Conclusion A deep learning-based computer-aided detection system improved the diagnostic yield for newly visible metastasis on chest radiographs in patients with cancer with a similar false-referral rate. © RSNA, 2021 Online supplemental material is available for this article.
- Research Article
13
- 10.21037/atm.2018.08.11
- Jun 1, 2019
- Annals of Translational Medicine
Errors in grammar, spelling, and usage in radiology reports are common. To automatically detect inappropriate insertions, deletions, and substitutions of words in radiology reports, we proposed using a neural sequence-to-sequence (seq2seq) model. Head CT and chest radiograph reports from Mount Sinai Hospital (MSH) (n=61,722 and 818,978, respectively), Mount Sinai Queens (MSQ) (n=30,145 and 194,309, respectively) and MIMIC-III (n=32,259 and 54,685) were converted into sentences. Insertions, substitutions, and deletions of words were randomly introduced. Seq2seq models were trained using corrupted sentences as input to predict original uncorrupted sentences. Three models were trained using head CTs from MSH, chest radiographs from MSH, and head CTs from all three collections. Model performance was assessed across different sites and modalities. A sample of original, uncorrupted sentences were manually reviewed for any error in syntax, usage, or spelling to estimate real-world proofreading performance of the algorithm. Seq2seq detected 90.3% and 88.2% of corrupted sentences with 97.7% and 98.8% specificity in same-site, same-modality test sets for head CTs and chest radiographs, respectively. Manual review of original, uncorrupted same-site same-modality head CT sentences demonstrated seq2seq positive predictive value (PPV) 0.393 (157/400; 95% CI, 0.346-0.441) and negative predictive value (NPV) 0.986 (789/800; 95% CI, 0.976-0.992) for detecting sentences containing real-world errors, with estimated sensitivity of 0.389 (95% CI, 0.267-0.542) and specificity 0.986 (95% CI, 0.985-0.987) over n=86,211 uncorrupted training examples. Seq2seq models can be highly effective at detecting erroneous insertions, deletions, and substitutions of words in radiology reports. To achieve high performance, these models require site- and modality-specific training examples. Incorporating additional targeted training data could further improve performance in detecting real-world errors in reports.
- Research Article
45
- 10.1016/j.ajic.2006.08.003
- Apr 1, 2007
- American Journal of Infection Control
Hospital electronic medical record–based public health surveillance system deployed during the 2002 Winter Olympic Games
- Research Article
5
- 10.1148/radiol.232746
- Oct 1, 2024
- Radiology
Background Natural language processing (NLP) is commonly used to annotate radiology datasets for training deep learning (DL) models. However, the accuracy and potential biases of these NLP methods have not been thoroughly investigated, particularly across different demographic groups. Purpose To evaluate the accuracy and demographic bias of four NLP radiology report labeling tools on two chest radiograph datasets. Materials and Methods This retrospective study, performed between April 2022 and April 2024, evaluated chest radiograph report labeling using four NLP tools (CheXpert [rule-based], RadReportAnnotator [RRA; DL-based], OpenAI's GPT-4 [DL-based], cTAKES [hybrid]) on a subset of the Medical Information Mart for Intensive Care (MIMIC) chest radiograph dataset balanced for representation of age, sex, and race and ethnicity (n = 692) and the entire Indiana University (IU) chest radiograph dataset (n = 3665). Three board-certified radiologists annotated the chest radiograph reports for 14 thoracic disease labels. NLP tool performance was evaluated using several metrics, including accuracy and error rate. Bias was evaluated by comparing performance between demographic subgroups using the Pearson χ2 test. Results The IU dataset included 3665 patients (mean age, 49.7 years ± 17 [SD]; 1963 female), while the MIMIC dataset included 692 patients (mean age, 54.1 years ± 23.1; 357 female). All four NLP tools demonstrated high accuracy across findings in the IU and MIMIC datasets, as follows: CheXpert (92.6% [47 516 of 51 310], 90.2% [8742 of 9688]), RRA (82.9% [19 746 of 23 829], 92.2% [2870 of 3114]), GPT-4 (94.3% [45 586 of 48 342], 91.6% [6721 of 7336]), and cTAKES (84.7% [43 436 of 51 310], 88.7% [8597 of 9688]). RRA and cTAKES had higher accuracy (P < .001) on the MIMIC dataset, while CheXpert and GPT-4 had higher accuracy on the IU dataset. Differences (P < .001) in error rates were observed across age groups for all NLP tools except RRA on the MIMIC dataset, with the highest error rates for CheXpert, RRA, and cTAKES in patients older than 80 years (mean, 15.8% ± 5.0) and the highest error rate for GPT-4 in patients 60-80 years of age (8.3%). Conclusion Although commonly used NLP tools for chest radiograph report annotation are accurate when evaluating reports in aggregate, demographic subanalyses showed significant bias, with poorer performance in older patients. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Cai in this issue.
- Research Article
9
- 10.1007/s00330-024-11339-6
- Jan 15, 2025
- European Radiology
ObjectiveThis study aimed to develop an open-source multimodal large language model (CXR-LLaVA) for interpreting chest X-ray images (CXRs), leveraging recent advances in large language models (LLMs) to potentially replicate the image interpretation skills of human radiologists.Materials and methodsFor training, we collected 592,580 publicly available CXRs, of which 374,881 had labels for certain radiographic abnormalities (Dataset 1) and 217,699 provided free-text radiology reports (Dataset 2). After pre-training a vision transformer with Dataset 1, we integrated it with an LLM influenced by the LLaVA network. Then, the model was fine-tuned, primarily using Dataset 2. The model’s diagnostic performance for major pathological findings was evaluated, along with the acceptability of radiologic reports by human radiologists, to gauge its potential for autonomous reporting.ResultsThe model demonstrated impressive performance in test sets, achieving an average F1 score of 0.81 for six major pathological findings in the MIMIC internal test set and 0.56 for six major pathological findings in the external test set. The model’s F1 scores surpassed those of GPT-4-vision and Gemini-Pro-Vision in both test sets. In human radiologist evaluations of the external test set, the model achieved a 72.7% success rate in autonomous reporting, slightly below the 84.0% rate of ground truth reports.ConclusionThis study highlights the significant potential of multimodal LLMs for CXR interpretation, while also acknowledging the performance limitations. Despite these challenges, we believe that making our model open-source will catalyze further research, expanding its effectiveness and applicability in various clinical contexts.Key PointsQuestionHow can a multimodal large language model be adapted to interpret chest X-rays and generate radiologic reports?FindingsThe developed CXR-LLaVA model effectively detects major pathological findings in chest X-rays and generates radiologic reports with a higher accuracy compared to general-purpose models.Clinical relevanceThis study demonstrates the potential of multimodal large language models to support radiologists by autonomously generating chest X-ray reports, potentially reducing diagnostic workloads and improving radiologist efficiency.
- Research Article
30
- 10.1016/j.radi.2018.01.009
- Feb 18, 2018
- Radiography
Agreement between expert thoracic radiologists and the chest radiograph reports provided by consultant radiologists and reporting radiographers in clinical practice: Review of a single clinical site
- Research Article
32
- 10.1186/1472-6947-13-90
- Aug 15, 2013
- BMC Medical Informatics and Decision Making
BackgroundPrior studies demonstrate the suitability of natural language processing (NLP) for identifying pneumonia in chest radiograph (CXR) reports, however, few evaluate this approach in intensive care unit (ICU) patients.MethodsFrom a total of 194,615 ICU reports, we empirically developed a lexicon to categorize pneumonia-relevant terms and uncertainty profiles. We encoded lexicon items into unique queries within an NLP software application and designed an algorithm to assign automated interpretations (‘positive’, ‘possible’, or ‘negative’) based on each report’s query profile. We evaluated algorithm performance in a sample of 2,466 CXR reports interpreted by physician consensus and in two ICU patient subgroups including those admitted for pneumonia and for rheumatologic/endocrine diagnoses.ResultsMost reports were deemed ‘negative’ (51.8%) by physician consensus. Many were ‘possible’ (41.7%); only 6.5% were ‘positive’ for pneumonia. The lexicon included 105 terms and uncertainty profiles that were encoded into 31 NLP queries. Queries identified 534,322 ‘hits’ in the full sample, with 2.7 ± 2.6 ‘hits’ per report. An algorithm, comprised of twenty rules and probability steps, assigned interpretations to reports based on query profiles. In the validation set, the algorithm had 92.7% sensitivity, 91.1% specificity, 93.3% positive predictive value, and 90.3% negative predictive value for differentiating ‘negative’ from ‘positive’/’possible’ reports. In the ICU subgroups, the algorithm also demonstrated good performance, misclassifying few reports (5.8%).ConclusionsMany CXR reports in ICU patients demonstrate frank uncertainty regarding a pneumonia diagnosis. This electronic tool demonstrates promise for assigning automated interpretations to CXR reports by leveraging both terms and uncertainty profiles.
- Research Article
- 10.1148/radiol.250568
- Sep 1, 2025
- Radiology
Background Artificial intelligence (AI)-generated radiology reports have become available and require rigorous evaluation. Purpose To evaluate the clinical acceptability of chest radiograph reports generated by an AI algorithm and their accuracy in identifying referable abnormalities. Materials and Methods Chest radiographs from an intensive care unit (ICU), an emergency department, and health checkups were retrospectively collected between January 2020 and December 2022, and outpatient chest radiographs were sourced from a public dataset. An automated report-generating AI algorithm was then applied. A panel of seven thoracic radiologists evaluated the acceptability of generated reports, and acceptability was analyzed using a standard criterion (acceptable without revision or with minor revision) and a stringent criterion (acceptable without revision). Using chest radiographs from three of the contexts (excluding the ICU), AI-generated and radiologist-written reports were compared regarding the acceptability of the reports (generalized linear mixed model) and their sensitivity and specificity for identifying referable abnormalities (McNemar test). The radiologist panel was surveyed to evaluate their perspectives on the potential of AI-generated reports to replace radiologist-written reports. Results The chest radiographs of 1539 individuals (median age, 55 years; 656 male patients, 483 female patients, 400 patients of unknown sex) were included. There was no evidence of a difference in acceptability between AI-generated and radiologist-written reports under the standard criterion (88.4% vs 89.2%; P = .36), but AI-generated reports were less acceptable than radiologist-written reports under the stringent criterion (66.8% vs 75.7%; P < .001). Compared with radiologist-written reports, AI-generated reports identified radiographs with referable abnormalities with greater sensitivity (81.2% vs 59.4%; P < .001) and lower specificity (81.0% vs 93.6%; P < .001). In the survey, most radiologists indicated that AI-generated reports were not yet reliable enough to replace radiologist-written reports. Conclusion AI-generated chest radiograph reports had similar acceptability to radiologist-written reports, although a substantial proportion of AI-generated reports required minor revision. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Wu and Seo in this issue.
- Research Article
- 10.1038/s41551-025-01544-z
- Nov 6, 2025
- Nature biomedical engineering
General artificial intelligence models have unique challenges in clinical practice when applied to diverse modalities and complex clinical tasks. Here we present MedMPT, a versatile, clinically oriented pretrained model tailored for respiratory healthcare, trained on 154,274 pairs of chest computed-tomography scans and radiograph reports. MedMPT adopts self-supervised learning to acquire medical insights and is capable of handling multimodal clinical data and supporting various clinical tasks aligned with clinical workflows. We evaluate the performance of MedMPT on a broad spectrum of chest-related pathological conditions, involving common medical modalities such as computed tomography images, radiology reports, laboratory tests and drug relationship graphs. MedMPT consistently outperforms the state-of-the-art multimodal pretrained models in the medical domain, achieving significant improvements in diverse clinical tasks. Extensive analysis indicates that MedMPT effectively harnesses the potential of medical data, showing both data and parameter efficiency and offering explainable insights for decision-making. MedMPT highlights the potential of multimodal pretrained models in the realm of general-purpose artificial intelligence for clinical practice.
- Research Article
- 10.1136/postgradmedj-2018-135984
- Sep 1, 2018
- Postgraduate Medical Journal
PurposeAs tuberculosis becomes less common in higher income countries, clinician familiarity with the disease is declining. Little is known about how chest radiograph interpretations affect tuberculosis care. We sought to...
- Research Article
118
- 10.2214/ajr.14.12636
- Dec 1, 2014
- American Journal of Roentgenology
The radiology report serves as the primary method of communication about imaging findings. Traditional free-text (i.e., unstructured) radiology reporting entails dictating in a stream-of-consciousness manner. Structured reporting aims to standardize the format and lexicon used in reports. This standardization may improve the communication of findings, allowing ease of reading and comprehension. A structured reporting template may also be used as a checklist while reviewing a case, which may facilitate focused attention and analysis. The goal of this study was to compare unstructured and structured reports in terms of their completeness and effectiveness. Radiology trainees were given an educational lecture on the background of reporting and were provided with a structured reporting template for dictating chest radiographs. Twelve trainees completed the study. Sixty reports from before and 60 reports from after the intervention were each independently scored by four blinded physician raters for completeness and effectiveness. Structured reports were found to be statistically significantly more complete and more effective than unstructured reports (mean completeness score, 4.42 vs 3.99, p<0.001; mean effectiveness score, 4.11 vs 3.85, p<0.001). A combined score was calculated for each report and was higher for the structured reports (mean combined score, 8.54 vs 7.83, p<0.001). Structured chest radiograph reports were more complete and more effective than unstructured chest radiograph reports. Although additional studies are needed for validation, this study suggests that structured reporting may represent an improved reporting method for radiologists.
- Research Article
8
- 10.1148/radiol.241139
- Oct 1, 2024
- Radiology
Background Rapid advances in large language models (LLMs) have led to the development of numerous commercial and open-source models. While recent publications have explored OpenAI's GPT-4 to extract information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to leading open-source models. Purpose To compare different leading open-source LLMs to GPT-4 on the task of extracting relevant findings from chest radiograph reports. Materials and Methods Two independent datasets of free-text radiology reports from chest radiograph examinations were used in this retrospective study performed between February 2, 2024, and February 14, 2024. The first dataset consisted of reports from the ImaGenome dataset, providing reference standard annotations from the MIMIC-CXR database acquired between 2011 and 2016. The second dataset consisted of randomly selected reports created at the Massachusetts General Hospital between July 2019 and July 2021. In both datasets, the commercial models GPT-3.5 Turbo and GPT-4 were compared with open-source models that included Mistral-7B and Mixtral-8 × 7B (Mistral AI), Llama 2-13B and Llama 2-70B (Meta), and Qwen1.5-72B (Alibaba Group), as well as CheXbert and CheXpert-labeler (Stanford ML Group), in their ability to accurately label the presence of multiple findings in radiograph text reports using zero-shot and few-shot prompting. The McNemar test was used to compare F1 scores between models. Results On the ImaGenome dataset (n = 450), the open-source model with the highest score, Llama 2-70B, achieved micro F1 scores of 0.97 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.98 (P > .99 and < .001 for superiority of GPT-4). On the institutional dataset (n = 500), the open-source model with the highest score, an ensemble model, achieved micro F1 scores of 0.96 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.97 (P < .001 and > .99 for superiority of GPT-4). Conclusion Although GPT-4 was superior to open-source models in zero-shot report labeling, few-shot prompting with a small number of example reports closely matched the performance of GPT-4. The benefit of few-shot prompting varied across datasets and models. © RSNA, 2024 Supplemental material is available for this article.
- Research Article
25
- 10.1007/s10916-021-01761-4
- Jan 1, 2021
- Journal of Medical Systems
In radiology, natural language processing (NLP) allows the extraction of valuable information from radiology reports. It can be used for various downstream tasks such as quality improvement, epidemiological research, and monitoring guideline adherence. Class imbalance, variation in dataset size, variation in report complexity, and algorithm type all influence NLP performance but have not yet been systematically and interrelatedly evaluated. In this study, we investigate these factors on the performance of four types [a fully connected neural network (Dense), a long short-term memory recurrent neural network (LSTM), a convolutional neural network (CNN), and a Bidirectional Encoder Representations from Transformers (BERT)] of deep learning-based NLP. Two datasets consisting of radiologist-annotated reports of both trauma radiographs (n = 2469) and chest radiographs and computer tomography (CT) studies (n = 2255) were split into training sets (80%) and testing sets (20%). The training data was used as a source to train all four model types in 84 experiments (Fracture-data) and 45 experiments (Chest-data) with variation in size and prevalence. The performance was evaluated on sensitivity, specificity, positive predictive value, negative predictive value, area under the curve, and F score. After the NLP of radiology reports, all four model-architectures demonstrated high performance with metrics up to > 0.90. CNN, LSTM, and Dense were outperformed by the BERT algorithm because of its stable results despite variation in training size and prevalence. Awareness of variation in prevalence is warranted because it impacts sensitivity and specificity in opposite directions.
- Front Matter
59
- 10.1148/radiol.2241020415
- Jul 1, 2002
- Radiology
Automatic structuring of radiology reports: harbinger of a second information revolution in radiology.
- Research Article
3
- 10.2214/ajr.140.6.1115
- Jun 1, 1983
- AJR. American journal of roentgenology
Twenty-five patients had localization radiography before chest computed tomography (CT) for evaluation of pulmonary nodules, staging of lung carcinoma, or suspected metastatic disease. Evaluation of the localization radiograph and the 140 k Vp frontal chest radiograph were performed independently and without history by a CT and chest radiologist, respectively. These interpretations were compared to a reference standard compiled from the full CT and chest radiographic reports. Significant abnormalities in the soft tissues, bones, mediastinum, hila, and pleura were detected with about equal frequency by chest and localization radiography. Chest radiography detected all 16 of the lung nodules greater than 1 cm in diameter, while localization radiography detected 12; however, this was not statistically significant. Both the sensitivity and specificity of nodule detection by chest radiography exceeded that of localization radiography. The performance of localization radiography in the detection of chest abnormalities in this and other studies encourages further development of computed chest radiography.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.