Abstract

e13630 Background: As many as 60% of prior authorization requests are denied, yet coverage approval occurs for more than 60% of appeals for some therapies. Appeal processes encumber providers and increase burnout, but large language models (LLMs) may aid providers by drafting appeal letters. We evaluated LLM performance at this task for radiotherapy denials. Methods: Three commercially accessible LLMs were evaluated: generative pre-trained transformer 3.5 (GPT3.5), GPT4, and GPT4+web with internet search capacity (OpenAI, Inc., San Francisco, CA). A fourth LLM, GPT3.5-FT, was developed by fine-tuning GPT3.5 in a HIPAA-complaint local environment. The fine-tuning training data comprised 53 insurance denial appeal letters prepared by radiation oncologists and paired prompts describing the clinical history and appeal intent. Training data were enriched in appeal letters for proton radiotherapy, stereotactic body radiotherapy, and image-guided radiotherapy for myriad clinical scenarios. Twenty prompts, each requesting a letter for a simulated patient history, were programmatically presented to the LLMs. Three radiation oncologists, who were blinded to the LLM source, scored letter outputs across four domains: language syntax and semantics, clinical detail inclusion, clinical reasoning validity, and overall readiness for insurer submission. Additionally, one radiation oncologist scored the authenticity and relevance of literature sources cited in output letters, which were requested by several test prompts. Interobserver agreement between radiation oncologist scores was determined by Cohen’s kappa coefficient. Scores were compared between LLMs with non-parametric statistical tests. Results: Agreement between radiation oncologists’ scores was moderate-to-excellent across all domains (median κ = 0.68, minimum κ = 0.41). GPT3.5, GPT4, and GPT4+web drafted letters that, by mode average, were semantically and syntactically clear, included all provided clinical history without confabulation, clinically reasoned with few necessary revisions, and overall were submissible to an insurer with minor revisions. GPT4 and GPT4+web clinically reasoned better than GPT3.5 (p values < 0.001). In contrast, GPT3.5-FT performance was inferior to other LLMs across all domains (p values < 0.001). LLMs were poor at identifying, citing, and summarizing relevant literature unless provided in the prompt. Conclusions: LLMs can draft insurance appeal letters for radiotherapy services that require few revisions yet are poor at referencing relevant literature. Contrary to our hypothesis, fine-tuning with data from our department compromised LLM performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call