Abstract
e13637 Background: The recent development of advanced LLMs has been suggested to improve patient care across several areas such as clinical-decision support or helping to answer patients’ questions. Hallucinations have been identified as a blocker for the use of LLMs in routine clinical practice. ICL and Retrieval Augmented Generation (RAG) could improve the LLM performance and reduce hallucinations, consecutively making the use of LLMs possible in clinical practice. Methods: A method using ICL and RAG was developed on top of health AI platform (Gosta MedKit) to interpret the most recent ESMO (Dec 2022 for NSCLC, Mar 2021 for SCLC) and NCCN (Nov 2023 for NSCLC and SCLC) clinical guidelines for lung cancer. Guidelines (including tables and diagrams) were curated into a text format, and split and stored into a vector database. OpenAI’s GPT4 Turbo model version gpt-4-1106-preview (GPT4-T), having the knowledge cutoff in April 2023, was used in all implementations. 11 questions about SCLC and 13 questions about NSCLC treatment recommendations and definitions were developed to evaluate the performance of different settings: GPT4-T (existing knowledge of the model), ICL with maximum context (ICL-MC) length (128k tokens) and ICL with RAG (ICL-RAG) heuristically including only the most relevant parts from vector database. Question prompts were generated for different settings and guidelines (ESMO, NCCN and both combined) and two oncologists evaluated 216 different responses and their alignment with ESMO and NCCN guidelines. Results: For responses using ESMO guidelines having oncologists’ consensus, ICL-MC and ICL-RAG respectively provided accurate responses for 83.3% and 79.2% of questions vs. 62.5% for GPT4-T. For responses using NCCN guidelines having oncologist consensus, ICL-RAG provided accurate responses for 83.3%, GPT4-T for 75.0% and ICL-MC for 33.3% of questions. When more flexibility was allowed in results interpretation (alignment either with ESMO or NCCN), GPT4-T provided accurate response for 87.5% vs. 70.8% with ICL-RAG and 58.3% with ICL-MC. No consensus around hallucinations was reported for ICL approaches, whereas GPT4-T hallucinated the response for 4.2% of questions with ESMO guidelines. Conclusions: ICL seems to improve the LLM performance for stricter tasks such as providing responses according to specific guidelines and reducing hallucinations. ICL outperformed GPT4-T in case of ESMO guidelines. This highlights the importance of taking local and latest guidelines into account when LLMs are used across different health systems and regulatory environments. In line with earlier studies, longer context for ICL makes models forget crucial information, which can be mitigated with the use of RAG to improve ICL performance and reduce costs when using the models.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.