Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines

Benedict U Nwachukwu,Nathan H Varady,Answorth A Allen,Joshua S Dines,David W Altchek,Riley J Williams,Kyle N Kunze

doi:10.1016/j.arthro.2024.07.040

Abstract

PurposeTo determine whether leading, commercially-available LLMs provide treatment recommendations concordant with evidenced-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopedic Surgeons (AAOS). MethodsAll CPGs concerning the management of rotator cuff tears(n=33) and anterior cruciate ligament (ACL) injuries(n=15) were extracted from the AAOS. Treatment recommendations from Chat-generative pretrained transformer version-4 [ChatGPT-4; OpenAI], Gemini (Google), Mistral-7B (Mistral AI), and Claude-3 (Anthropic) were graded by two blinded physicians as being “concordant,” “discordant,” or “indeterminate” (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations were quantified, while the comparative overall concordance of recommendations amongst the four LLMs was evaluated through the Fischer’s-exact test. ResultsOverall 135(70.3%) responses were concordant, 43(22.4%) indeterminate, and 14(7.3%) discordant. Inter-rater reliability for concordance classification was excellent (Kappa=0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n=38, 79.2%), and least frequently with Mistral-7B (n=28,58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n=17,35.4%) and least frequently with Claude-3 (n=8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n=6,12.5%) and least frequently with ChatGPT-4 (n=1,2.1%). Overall, no statistically significant differences in concordant recommendations was observed across LLMs (p=0.12). Only 20 (10.4%) of all recommendations were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations. ConclusionAmong leading commercially-available LLMs, more than one-in-four recommendations concerning the evaluation and management of rotator cuff and ACL injuries do not reflect current evidenced-based CPGs. Although ChatGPT-4 demonstrated the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided. Clinical RelevanceWhile leading LLMs generally provide recommendations concordant with CPGs, a substantial error-rate exists, and the proportion of recommendations that do not align with these CPGs suggest that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in current literature.

Full Text