Performance of three commercially available large language models and one locally fine-tuned model at preparing formal letters to appeal medical insurance denials of radiotherapy services.

Kendall Kiser,Christopher Lundeberg,Michael Waters,Christopher Abraham,Jocelyn Reckford

doi:10.1200/jco.2024.42.16_suppl.e13630

Kendall Kiser, Christopher Lundeberg + Show 3 more

https://doi.org/10.1200/jco.2024.42.16_suppl.e13630

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

e13630 Background: As many as 60% of prior authorization requests are denied, yet coverage approval occurs for more than 60% of appeals for some therapies. Appeal processes encumber providers and increase burnout, but large language models (LLMs) may aid providers by drafting appeal letters. We evaluated LLM performance at this task for radiotherapy denials. Methods: Three commercially accessible LLMs were evaluated: generative pre-trained transformer 3.5 (GPT3.5), GPT4, and GPT4+web with internet search capacity (OpenAI, Inc., San Francisco, CA). A fourth LLM, GPT3.5-FT, was developed by fine-tuning GPT3.5 in a HIPAA-complaint local environment. The fine-tuning training data comprised 53 insurance denial appeal letters prepared by radiation oncologists and paired prompts describing the clinical history and appeal intent. Training data were enriched in appeal letters for proton radiotherapy, stereotactic body radiotherapy, and image-guided radiotherapy for myriad clinical scenarios. Twenty prompts, each requesting a letter for a simulated patient history, were programmatically presented to the LLMs. Three radiation oncologists, who were blinded to the LLM source, scored letter outputs across four domains: language syntax and semantics, clinical detail inclusion, clinical reasoning validity, and overall readiness for insurer submission. Additionally, one radiation oncologist scored the authenticity and relevance of literature sources cited in output letters, which were requested by several test prompts. Interobserver agreement between radiation oncologist scores was determined by Cohen’s kappa coefficient. Scores were compared between LLMs with non-parametric statistical tests. Results: Agreement between radiation oncologists’ scores was moderate-to-excellent across all domains (median κ = 0.68, minimum κ = 0.41). GPT3.5, GPT4, and GPT4+web drafted letters that, by mode average, were semantically and syntactically clear, included all provided clinical history without confabulation, clinically reasoned with few necessary revisions, and overall were submissible to an insurer with minor revisions. GPT4 and GPT4+web clinically reasoned better than GPT3.5 (p values < 0.001). In contrast, GPT3.5-FT performance was inferior to other LLMs across all domains (p values < 0.001). LLMs were poor at identifying, citing, and summarizing relevant literature unless provided in the prompt. Conclusions: LLMs can draft insurance appeal letters for radiotherapy services that require few revisions yet are poor at referencing relevant literature. Contrary to our hypothesis, fine-tuning with data from our department compromised LLM performance.

Full Text