Preliminary experiments on interpretable ChatGPT-assisted diagnosis for breast ultrasound radiologists.

Pengfei Sun,Linxue Qian,Zhixiang Wang

doi:10.21037/qims-24-141

Abstract

Ultrasound is essential for detecting breast lesions. The American College of Radiology's Breast Imaging Reporting and Data System (BI-RADS) classification system is widely used, but its subjectivity can lead to inconsistency in diagnostic outcomes. Artificial intelligence (AI) models, such as ChatGPT-3.5, may potentially enhance diagnostic accuracy and efficiency in medical settings. This study aimed to assess the utility of the ChatGPT-3.5 model in generating BI-RADS classifications for breast ultrasound reports and its ability to replicate the "chain of thought" (CoT) in clinical decision-making to improve model interpretability. Breast ultrasound reports were collected, and ChatGPT-3.5 was used to generate diagnoses and treatment plans. We evaluated GPT-4's performance by comparing its generated reports to those from doctors with various levels of experience. We also conducted a Turing test and a consistency analysis. To enhance the interpretability of the model, we applied the CoT method to deconstruct the decision-making chain of the GPT model. A total of 131 patients were evaluated, with 57 doctors participating in the experiment. ChatGPT-3.5 showed promising performance in structure and organization (S&O), professional terminology and expression (PTE), treatment recommendations (TR), and clarity and comprehensibility (C&C). However, improvements are needed in BI-RADS classification, malignancy diagnosis (MD), likelihood of being written by a physician (LWBP), and ultrasound doctor artificial intelligence acceptance (UDAIA). Turing test results indicated that AI-generated reports convincingly resembled human-authored reports. Reproducibility experiments displayed consistent performance. Erroneous report analysis revealed issues related to incorrect diagnosis, inconsistencies, and overdiagnosis. The CoT investigation supports the potential of ChatGPT to replicate the clinical decision-making process and offers insights into AI interpretability. The ChatGPT-3.5 model holds potential as a valuable tool for assisting in the efficient determination of BI-RADS classifications and enhancing diagnostic performance.

Full Text