Assessment of Artificial Intelligence Language Models and Information Retrieval Strategies for QA in Hematology

Maria R Cervera,Marta Hidalgo Soto,Oscar Darias,Ana Mendoza-Martínez,Julia Montoro,David Bermejo-Peláez,Adriana Oñós Clausell,Miguel Gómez-Álvarez,Miguel Luengo-Oroz,Joaquin Martínez-López,Jaime García-Villena,Celina Benavente Cuesta

doi:10.1182/blood-2023-178528

Abstract

Introduction Large language models (LLMs) have gained popularity due to their natural language generation and interpretation capabilities. Integrating these models in medicine enables multiple tasks like summarizing medical histories, synthesizing literature, and suggesting diagnoses. Models like ChatGPT, GPT-4, and Med-PaLM2 (Singhal et al., 2023) have demonstrated their proficiency by achieving high scores in medical tests like the United States Medical Licensing Examination (USMLE) (Kung et al., 2023). However, LLMs may sometimes be inaccurate, providing unverified and erroneous information. In this study, we investigate the potential uses of LLMs in hematology, assessing their knowledge through hematology questions from the USMLE. Additionally, we propose augmenting LLMs with retrieval capabilities for medical guidelines in order to eliminate incorrect information. By extracting relevant information from specified medical documents, this approach holds the potential to streamline decision-making processes. Methods For comparative purposes, all experiments were conducted using both GPT 3.5-turbo and GPT-4 models. In a first step, we evaluated the general knowledge and performance of LLM in the field of hematology by testing it in a collected dataset of 127 question-answer pairs from the hematology section (covering various aspects of the field) of the USMLE. In a second step, we evaluated the proposed information retrieval framework using a set of 120 multiple-choice questions. These questions were specifically focused on the 4th revision of the World Health Organization classification of myeloid neoplasms and acute leukemia guidelines (subsequently called WHO 2017). By testing the framework on this domain-specific dataset, we aimed to assess its ability to extract specific clinical context and relevant information from complex clinical guidelines. Each question from the WHO 2017 guideline dataset was subjected to a comprehensive evaluation using two techniques. First, the questions were assessed using a zero-shot approach (the question together with the different options are directly posed to the model) to assess the LLM's capability to respond based on its own knowledge. Second, we employed our proposed retrieval information approach, enabling the system to conduct in-depth searches throughout the external documents (WHO 2017 guideline) to identify relevant (and similar) extracts about each question. Subsequently, the system provided answers based on the retrieved contexts from the document, thus facilitating more accurate and contextually informed responses. To achieve this, we created an embedding space containing the document's content and conducted a cosine-similarity search between a given question and all the content extracts from the document. The top three relevant extracts, based on similarity to the given question, were used as context for the LLM. Results In the evaluation of 127 hematology questions from the USMLE, GPT-3.5 in zero-shot mode achieved 63% accuracy, while GPT-4 demonstrated a higher accuracy rate of 82%. The evaluation of the WHO 2017 questions dataset revealed that the zero-shot approach achieved accuracy rates of 51% for GPT-3.5 and 71% for GPT-4. Incorporating information retrieval, retrieving the three most relevant extracts from the guidelines, substantially improved performance, with GPT-3.5 achieving 86% accuracy and GPT-4 demonstrating 97% accuracy. Conclusions LLMs have great potential, with current models showcasing substantial knowledge in hematology. However, ensuring their consistency and safety in responses is critical for their reliable application in medical settings (Thirunavukarasu et al., 2023). To address this, we demonstrated the benefits of information retrieval for question-answering in the field of hematology, significantly improving response reliability and accuracy by empowering LLMs to deliver more informed and contextually appropriate answers. The concept was effectively validated using the WHO 2017 guideline, and it can be effortlessly adapted to answer questions based on any set of hematology-related documents. Leveraging LLMs has the potential to significantly enhance the efficiency and effectiveness of clinical, educational, and research work in hematology.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Assessment of Artificial Intelligence Language Models and Information Retrieval Strategies for QA in Hematology

Abstract

Talk to us

Similar Papers

More From: Blood

Lead the way for us

Similar Papers

Commentary: Paradigms, Axiology, and Praxeology in Medical Education Research.
Zareen Zaidi ... Douglas Larsen
Academic Medicine | VOL. 93
Zareen Zaidi, et. al.Zareen Zaidi ... Douglas Larsen
01 Nov 2018
Academic Medicine | VOL. 93

A Multilevel Analysis of Examinee Gender, Standardized Patient Gender, and United States Medical Licensing Examination Step 2 Clinical Skills Communication and Interpersonal Skills Scores
Monica M Cuddy ... Ann C Jobe
Academic Medicine | VOL. 86
Monica M Cuddy, et. al.Monica M Cuddy ... Ann C Jobe
01 Oct 2011
Academic Medicine | VOL. 86

Receiving: The Use of Web 2.0 to Create a Dynamic Learning Forum to Enrich Resident Education
Adam Rosh ... Kerin Jones
Academic Emergency Medicine | VOL. 16
Adam Rosh, et. al.Adam Rosh ... Kerin Jones
01 Apr 2009
Academic Emergency Medicine | VOL. 16

Information Retrieval meets Large Language Models: A strategic report from Chinese IR community
Qingyao Ai ...
AI Open | VOL. 4
Qingyao Ai, et. al.Qingyao Ai ...
01 Jan 2023
AI Open | VOL. 4

Journal: Blood	Publication Date: Nov 2, 2023
Citations: 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Assessment of Artificial Intelligence Language Models and Information Retrieval Strategies for QA in Hematology

Abstract

Talk to us

Similar Papers

More From: Blood