Investigating large language model (LLM) performance using in-context learning (ICL) for interpretation of ESMO and NCCN guidelines for lung cancer.

Sanna Iivanainen,Henri Viertolahti,Jarkko Lagus,Lauri Sippola,Jussi Koivunen

doi:10.1200/jco.2024.42.16_suppl.e13637

Abstract

e13637 Background: The recent development of advanced LLMs has been suggested to improve patient care across several areas such as clinical-decision support or helping to answer patients’ questions. Hallucinations have been identified as a blocker for the use of LLMs in routine clinical practice. ICL and Retrieval Augmented Generation (RAG) could improve the LLM performance and reduce hallucinations, consecutively making the use of LLMs possible in clinical practice. Methods: A method using ICL and RAG was developed on top of health AI platform (Gosta MedKit) to interpret the most recent ESMO (Dec 2022 for NSCLC, Mar 2021 for SCLC) and NCCN (Nov 2023 for NSCLC and SCLC) clinical guidelines for lung cancer. Guidelines (including tables and diagrams) were curated into a text format, and split and stored into a vector database. OpenAI’s GPT4 Turbo model version gpt-4-1106-preview (GPT4-T), having the knowledge cutoff in April 2023, was used in all implementations. 11 questions about SCLC and 13 questions about NSCLC treatment recommendations and definitions were developed to evaluate the performance of different settings: GPT4-T (existing knowledge of the model), ICL with maximum context (ICL-MC) length (128k tokens) and ICL with RAG (ICL-RAG) heuristically including only the most relevant parts from vector database. Question prompts were generated for different settings and guidelines (ESMO, NCCN and both combined) and two oncologists evaluated 216 different responses and their alignment with ESMO and NCCN guidelines. Results: For responses using ESMO guidelines having oncologists’ consensus, ICL-MC and ICL-RAG respectively provided accurate responses for 83.3% and 79.2% of questions vs. 62.5% for GPT4-T. For responses using NCCN guidelines having oncologist consensus, ICL-RAG provided accurate responses for 83.3%, GPT4-T for 75.0% and ICL-MC for 33.3% of questions. When more flexibility was allowed in results interpretation (alignment either with ESMO or NCCN), GPT4-T provided accurate response for 87.5% vs. 70.8% with ICL-RAG and 58.3% with ICL-MC. No consensus around hallucinations was reported for ICL approaches, whereas GPT4-T hallucinated the response for 4.2% of questions with ESMO guidelines. Conclusions: ICL seems to improve the LLM performance for stricter tasks such as providing responses according to specific guidelines and reducing hallucinations. ICL outperformed GPT4-T in case of ESMO guidelines. This highlights the importance of taking local and latest guidelines into account when LLMs are used across different health systems and regulatory environments. In line with earlier studies, longer context for ICL makes models forget crucial information, which can be mitigated with the use of RAG to improve ICL performance and reduce costs when using the models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Investigating large language model (LLM) performance using in-context learning (ICL) for interpretation of ESMO and NCCN guidelines for lung cancer.

Abstract

Talk to us

Similar Papers

More From: Journal of Clinical Oncology

Lead the way for us

Similar Papers

Generative AI enhanced with NCCN clinical practice guidelines for clinical decision support: A case study on bone cancer.
Yanshan Wang ... Xizhi Wu
Journal of Clinical Oncology | VOL. 42
Yanshan Wang, et. al.Yanshan Wang ... Xizhi Wu
01 Jun 2024
Journal of Clinical Oncology | VOL. 42

Performance of Large Language Models on a Neurology Board–Style Examination
Marc Cicero Schubert ... Varun Venkataramani
JAMA network open | VOL. 6
Marc Cicero Schubert, et. al.Marc Cicero Schubert ... Varun Venkataramani
07 Dec 2023
JAMA network open | VOL. 6

Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.
Anup Kumar D Dhanvijay ... Smita R Sorte
Cureus | VOL. 15
Anup Kumar D Dhanvijay, et. al.Anup Kumar D Dhanvijay ... Smita R Sorte
04 Aug 2023
Cureus | VOL. 15

Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
Ivan Civettini ... Paola Perfetti
Blood | VOL. 142
Ivan Civettini, et. al.Ivan Civettini ... Paola Perfetti
02 Nov 2023
Blood | VOL. 142

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Investigating large language model (LLM) performance using in-context learning (ICL) for interpretation of ESMO and NCCN guidelines for lung cancer.

Abstract

Talk to us

Similar Papers

More From: Journal of Clinical Oncology