BACKGROUND Concise, accurate, and real time synopses of the evidence base are critical to support treatment decision-making in hematology, especially in a very rapidly evolving space such as the treatment of plasma cell disorders (PCDs). Synopses are used in clinical practice guidelines, educational settings, and general clinical practice. This process of knowledge curation is time consuming and cumbersome, even when performed by a clinical expert. Artificial intelligence (AI) - specifically, Large Language Models (LLMs) have promise in this context; however, they are prone to hallucination and may provide inaccurate and out of date information. Morever, the degree to which individual LLMs perform in summarizing the clinical evidence base may vary widely. We objectively assessed the abilities of four LLMs to generate accurate, coherent, and relevant synopses for six widely used PCD regimens. METHODS We compared the performance of four popular LLMs: 1) Claude 3.5 (“Claude”); 2) ChatGPT 4.0 (“ChatGPT”); 3) Gemini; and 4) Llama-3.1 (“Llama”). Each LLM was prompted exactly as follows: “write a synopsis for the development and evolution of therapy with [regimen name] for PCD, using citations from the literature”, where [regimen name] was replaced with: 1) “Dara-RVd”; 2) “KRd”; 3) “VDT-PACE”; 4) “Dara-CyBorD”; 5) “Elranatamab”; and 6) “Talquetamab.” The generated synopses were assessed by two PCD physician specialists using Likert scales on six criteria: accuracy, completeness, relevance, clarity, hallucinations, and coherence; lower scores correspond to poor performance and higher scores to excellent performance. The reviewers were blinded to LLMs to minimize bias and conducted their evaluations independently. The evaluation process was recorded using REDCap. Mean scores with 95% confidence intervals (CI) were calculated across all domains. RESULTS There were marked differences in LLM performance across the six criteria. Claude demonstrated the highest performance in all domains, notably outperforming the other LLMs in accuracy: mean Likert score 3.92 (95% CI 3.54-4.29); ChatGPT 3.25 (2.76-3.74); Gemini 3.17 (2.54-3.80); Llama 1.92 (1.41-2.43); completeness: mean Likert score 4.00 (3.66-4.34); GPT 2.58 (2.02-3.15); Gemini 2.58 (2.02-3.15); Llama 1.67 (1.39-1.95); and lack of hallucinations: mean Likert score 4.00 (4.00-4.00); ChatGPT 2.75 (2.06-3.44); Gemini 3.25 (2.65-3.85); Llama 1.92 (1.26-2.57). Llama performed considerably poorer across all the studied domains, frequently providing inaccurate information and misinterpreted results. ChatGPT and Gemini had intermediate performance. Notably, none of the LLMs including Claude registered perfect accuracy, completeness, or relevance. CONCLUSION To our knowledge, this is the first rigorous evaluation of widely available LLMs for the task of evidence summarization in the PCD domain. In our evaluation of four LLMs for six PCD regimens, meaningful differences were evident across the LLMs. While Claude performed at a notably higher level, evidence synopses need to be as close to perfectly accurate and comprehensive to be usable for real-world clinical decision support. By this standard, even the best performing LLMs continue to require careful editing from a domain expert to become usable. Inaccurate and incoherent synopses could lead to suboptimal patient care if taken literally. We plan to repeat this experiment with newer generations of LLMs to evaluate their potential improvement over time.
Read full abstract