Large language models encode medical oncology knowledge: Performance on the ASCO and ESMO examination questions.

Jack Bennett Longwell,Fernando Binder,Raymond Woo-Jun Jang,Rahul G Krishnan,Ian Hirsch,Robert C Grant

doi:10.1200/op.2023.19.11_suppl.511

Abstract

511 Background: Chatbots based on large language models (LLM) recently developed an unprecedented ability to answer questions across a broad range of applications. Whether LLMs encode sufficient knowledge to answer questions about medical oncology, a highly specialized domain requiring rapid integration of new evidence, is unknown. Methods: We presented ChatGPT (GPT-3.5 and GPT-4) with the American Society of Oncology (ASCO) Self Assessment Program and the European Society of Medical Oncology (ESMO) Examination Trial questions, excluding those that included images or required knowledge unavailable before the algorithm’s training cutoff date. The proportion of correct answers was compared against random chance. ChatGPT was prompted again for a different answer if the previous was incorrect. The reasoning provided by ChatGPT was qualitatively evaluated by two medical oncologists. Results: ChatGPT (GPT-4) correctly answered 84.4% (38/45, 95% confidence interval [CI] 70.5-93.5%, P<0.0001 versus random answering) of ASCO and 86.7% (65/75, 95% CI 76.8-93.4%, P<0.0001) of the ESMO examination questions. GPT-4 outperformed GPT-3.5 (57.8% [26/45, 95% CI 42.2%-72.3%, P=0.001] for ASCO and 65.3% [49/75, 95% CI 53.5%-76.0%, P=0.004] for ESMO). Including second attempts, GPT-4 correctly answered 93.3% (42/45, 95% CI 81.7-98.6%) of ASCO and 93.3% (70/75, 95% CI 85.1-97.8%) of the ESMO examination questions. Incorrect responses for ASCO questions were more common in questions whose answers referenced papers published after 2018 (22.2% [4/18] versus 11.1%, [3/27], P=0.03). Oncologists rated the reasoning behind correct answers by GPT-3.5 as complete for 93.3% of questions (70/75, CI 85.1-97.8%). Conclusions: LLMs can answer examination questions designed for medical oncology fellows with impressive and improving accuracy, alongside correct reasoning. These results imply broad potential applications of LLMs during cancer care to improve the patient and provider experience.

Full Text