Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition.

Yasin Celal Güneş,Turay Cesur,Eren Çamur,Leman Günbey Karabekmez

doi:10.4274/dir.2024.242876

Yasin Celal Güneş, Turay Cesur + Show 2 more

Open Access

https://doi.org/10.4274/dir.2024.242876

Copy DOI

Abstract

This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions. This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests. Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05). Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses. This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Diagnostic and interventional radiology (Ankara, Turkey)	Publication Date: Sep 9, 2024
Citations: 1	License type: cc-by-nc

R Discovery Prime

R Discovery Prime

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition.

Abstract

Talk to us

Similar Papers

More From: Diagnostic and interventional radiology (Ankara, Turkey)

Lead the way for us

Similar Papers

BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study.
Andrea Cozzi ... Simone Schiaffino
Radiology | VOL. 311
Andrea Cozzi, et. al.Andrea Cozzi ... Simone Schiaffino
01 Apr 2024
Radiology | VOL. 311

The Breast Imaging Reporting and Data System (BI-RADS) in the Dutch breast cancer screening programme: its role as an assessment and stratification tool
J M H Timmers ... G J Den Heeten
European Radiology | VOL. 22
J M H Timmers, et. al.J M H Timmers ... G J Den Heeten
14 Mar 2012
European Radiology | VOL. 22

Scoring System to Stratify Malignancy Risks for Mammographic Microcalcifications Based on Breast Imaging Reporting and Data System 5th Edition Descriptors.
Ji Hyun Youk ... Jeong-Ah Kim
Korean Journal of Radiology | VOL. 20
Ji Hyun Youk, et. al.Ji Hyun Youk ... Jeong-Ah Kim
01 Jan 2019
Korean Journal of Radiology | VOL. 20

Differential diagnosis of B-mode ultrasound Breast Imaging Reporting and Data System category 3-4a lesions in conjunction with shear-wave elastography using conservative and aggressive approaches.
Wenxiang Zhi ... Haixian Zhang
Quantitative Imaging in Medicine and Surgery | VOL. 12
Wenxiang Zhi, et. al.Wenxiang Zhi ... Haixian Zhang
01 Jul 2022
Quantitative Imaging in Medicine and Surgery | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5th edition.

Abstract

Talk to us

Similar Papers

More From: Diagnostic and interventional radiology (Ankara, Turkey)