Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

Ivan Civettini,Giovanni Rindone,Arianna Zappaterra,Andrea Aroldi,Matteo Parma,Daniele Ramazzotti,Stefano Bonfanti,Bianca Maria Granelli,Fabrizio Cavalca,Carlo Gambacorti-Passerini,Elisabetta Terruzzi,Giovanni Grillo,Paola Perfetti,Federica Colombo,Marilena Fedele

doi:10.1182/blood-2023-185854

Ivan Civettini, Giovanni Rindone + Show 13 more

Open Access

https://doi.org/10.1182/blood-2023-185854

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Introduction: Large Language Models (LLMs) are a form of Artificial Intelligence (AI), by identifying patterns and connections within data, they can predict the most likely words or phrases in specific contexts. Previous studies have indicated that GPT (Generative Pre-trained Transformer; OpenAI) performs well in answering single-choice clinical questions. However, its performance seems to be less satisfactory when dealing with multiple-choice questions and more intricate clinical cases (Cosima et al. 2023 EAO; Cascella et al. 2023 J Med Syst). Notably, no study has evaluated LLMs responses in the context of Transplantation Decision Making, a complex process heavily reliant on physician expertise. Additionally, most studies focused solely on GPT's performance, without considering other competitive LLMs like Llama-2 or VertexAI. Our study aims to assess the performance of LLMs in the domain of hematopoietic stem cell transplantation. Methods: We modified and anonymized the clinical histories of six hematological patients. An experienced hematologist reviewed and validated these modified clinical histories, which included demographic data, past medical history, hematology disease features (genetic data and MRD when available), treatment responses, adverse events from previous therapies, and potential donor information (related/unrelated, HLA, CMV status). We presented these clinical cases to six experienced bone marrow transplant physicians from two major JACIE accredited hospitals and 11 hematology residents from the University Milano-Bicocca. LLMs employed for the analysis were: GPT-4, VertexAI Palm 2, Llama-2 13b and 70b. LLMs were configured with different temperature settings to control token selection randomness, always maintaining low levels for more deterministic responses. A triple-blinded survey was conducted using Typeform, where both senior hematologists and residents provided anonymized responses with personal tokens. The senior hematologists, residents, and LLMs testers were unaware of the responses provided by the other groups. We calculated Fleiss K (K) and overall percentage of agreement (OA) between residents and LLMs, considering the consensus answer (CoA) among experts as the most frequent response. Subsequently, OA and K values for both residents and LLMs were compared using T- or Mann-Whitney tests with Graphpad v 10.0.1. Results: The results showed perfect agreement among experts in patient transplant eligibility assessment (K=1.0) and substantial agreement in the choice of donors and conditioning regimens (K=0.62 for both questions). Fair agreement was observed in Transplant Related Mortality (TRM) estimation (K=0.22). The median OA and K value between residents and the CoA of experts were 76.5% (range 52.9-88.2%) and 0.61 (range 0.4-0.8), respectively. The median OA and K value between LLMs answers and experts were 58.8% (range 47-71%) and 0.45 (range 0.3-0.61), respectively. The mean OA and K value of residents were significantly higher compared to LLMs (p=0.02). Specifically, residents showed higher median OA and K values in patient eligibility assessment (median OA 100 vs. 83% and K 1 vs. 0.78; p=0.01). However, there was no significant difference in median K for donor choice (0.56 vs. 0.56), conditioning regimen (0.67 vs. 0.33), and TRM evaluation (0.33 vs. 0) (Table 1). The median K values of GPT-4, Palm-2, Llama2-13b, and Llama2-70b were 0.49, 0.53, 0.33, and 0.53 respectively (Figure 1). Conclusion: Our study sheds light on the potential and limitations of LLMs in complex hematopoietic stem cell transplantation decision-making. While LLMs showed promising results with a median OA of 59%, residents demonstrated superior performance. LLMS displayed good performances in patients' eligibility and donor choice but showed shortcomings in conditioning regimens and TRM evaluation. Not using a rating scale from experts when evaluating LLMs responses aimed to avoid potential bias. However, it is important to note that the consensus answer, even though it was the most frequent, does not necessarily imply that other responses provided by the experts were incorrect. Therefore, the lower consensus among the experts in TRM evaluation, possibly due to the challenge of precisely calculating TRM in a survey-based evaluation, should also lead to a cautious approach when evaluating residents and LLMs answers in this setting.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

Abstract

Published Version

Talk to us

Similar Papers

More From: Blood

Lead the way for us

Similar Papers

E-185 Customized generative pretrained transformer for simplified patient education of carotid angioplasty and stenting: a feasibility study
A Brake ... E Samaniego
Journal of NeuroInterventional Surgery | VOL. 16
A Brake, et. al.A Brake ... E Samaniego
01 Jul 2024
Journal of NeuroInterventional Surgery | VOL. 16

Evaluating Artificial Intelligence Chatbots in Oral and Maxillofacial Surgery Board Exams: Performance and Potential
Reema Mahmoud ... Oren Peleg
Journal of Oral and Maxillofacial Surgery | VOL. -
Reema Mahmoud, et. al.Reema Mahmoud ... Oren Peleg
01 Nov 2024
Journal of Oral and Maxillofacial Surgery | VOL. -

A guideline-informed language model for paediatric cardiology demonstrates high performance in answering complex medical questions
T Uden ... P Beerbaum
European Heart Journal | VOL. 45
T Uden, et. al.T Uden ... P Beerbaum
28 Oct 2024
European Heart Journal | VOL. 45

Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.
Bairong Shen ... Weizhe Feng
Journal of medical Internet research | VOL. 26
Bairong Shen, et. al.Bairong Shen ... Weizhe Feng
27 Dec 2024
Journal of medical Internet research | VOL. 26

Journal: Blood	Publication Date: Nov 2, 2023
Citations: 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

Abstract

Published Version

Talk to us

Similar Papers

More From: Blood