Introduction Large language models (LLMs) have gained popularity due to their natural language generation and interpretation capabilities. Integrating these models in medicine enables multiple tasks like summarizing medical histories, synthesizing literature, and suggesting diagnoses. Models like ChatGPT, GPT-4, and Med-PaLM2 (Singhal et al., 2023) have demonstrated their proficiency by achieving high scores in medical tests like the United States Medical Licensing Examination (USMLE) (Kung et al., 2023). However, LLMs may sometimes be inaccurate, providing unverified and erroneous information. In this study, we investigate the potential uses of LLMs in hematology, assessing their knowledge through hematology questions from the USMLE. Additionally, we propose augmenting LLMs with retrieval capabilities for medical guidelines in order to eliminate incorrect information. By extracting relevant information from specified medical documents, this approach holds the potential to streamline decision-making processes. Methods For comparative purposes, all experiments were conducted using both GPT 3.5-turbo and GPT-4 models. In a first step, we evaluated the general knowledge and performance of LLM in the field of hematology by testing it in a collected dataset of 127 question-answer pairs from the hematology section (covering various aspects of the field) of the USMLE. In a second step, we evaluated the proposed information retrieval framework using a set of 120 multiple-choice questions. These questions were specifically focused on the 4th revision of the World Health Organization classification of myeloid neoplasms and acute leukemia guidelines (subsequently called WHO 2017). By testing the framework on this domain-specific dataset, we aimed to assess its ability to extract specific clinical context and relevant information from complex clinical guidelines. Each question from the WHO 2017 guideline dataset was subjected to a comprehensive evaluation using two techniques. First, the questions were assessed using a zero-shot approach (the question together with the different options are directly posed to the model) to assess the LLM's capability to respond based on its own knowledge. Second, we employed our proposed retrieval information approach, enabling the system to conduct in-depth searches throughout the external documents (WHO 2017 guideline) to identify relevant (and similar) extracts about each question. Subsequently, the system provided answers based on the retrieved contexts from the document, thus facilitating more accurate and contextually informed responses. To achieve this, we created an embedding space containing the document's content and conducted a cosine-similarity search between a given question and all the content extracts from the document. The top three relevant extracts, based on similarity to the given question, were used as context for the LLM. Results In the evaluation of 127 hematology questions from the USMLE, GPT-3.5 in zero-shot mode achieved 63% accuracy, while GPT-4 demonstrated a higher accuracy rate of 82%. The evaluation of the WHO 2017 questions dataset revealed that the zero-shot approach achieved accuracy rates of 51% for GPT-3.5 and 71% for GPT-4. Incorporating information retrieval, retrieving the three most relevant extracts from the guidelines, substantially improved performance, with GPT-3.5 achieving 86% accuracy and GPT-4 demonstrating 97% accuracy. Conclusions LLMs have great potential, with current models showcasing substantial knowledge in hematology. However, ensuring their consistency and safety in responses is critical for their reliable application in medical settings (Thirunavukarasu et al., 2023). To address this, we demonstrated the benefits of information retrieval for question-answering in the field of hematology, significantly improving response reliability and accuracy by empowering LLMs to deliver more informed and contextually appropriate answers. The concept was effectively validated using the WHO 2017 guideline, and it can be effortlessly adapted to answer questions based on any set of hematology-related documents. Leveraging LLMs has the potential to significantly enhance the efficiency and effectiveness of clinical, educational, and research work in hematology.
Read full abstract