Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Evaluating Large Language Models for Requirements Question Answering in Industrial Aerospace Software

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

This study assesses large language models' ability to support aerospace software requirements question answering, using a benchmark of nearly 6,700 QA pairs across diverse data formats. Results show limited domain performance, with retrieval-augmented generation and few-shot learning improving capabilities; hallucination types are analyzed, and LLMs are especially helpful for junior engineers, informing future applications in aerospace.

Abstract
Translate article icon Translate Article Star icon

Aerospace software presents significant challenges to requirements engineering due to its design complexity and stringent safety standards. When manually drafting requirement documents, engineers need strong domain knowledge while also navigating heterogeneous data, which leads to errors and inefficiencies. This paper evaluates the capabilities of large language models (LLMs) in understanding aerospace software requirements and their potential to assist in requirements question answering (QA). We develop an aerospace requirements QA benchmark based on industrial software assets, books, and research materials, creating a total of 6, 696 QA pairs across ten tasks and three heterogeneous data formats: text, tables, and formulas. We then evaluate the domain-specific performance of five mainstream open-source LLMs using zero-shot learning, few-shot learning, and retrieval-augmented generation (RAG) techniques. We further categorize hallucinations from LLMs and quantitatively analyze error distributions. Moreover, we conduct a user study to assess the LLM's practical usefulness when applying to requirements QA. The evaluation results show that (1) LLMs demonstrate limited performance in the aerospace software domain, (2) RAG techniques significantly enhance the capabilities of LLMs for text-based tasks, while few-shot learning improves the performance of most LLMs, (3) four distinct types of QA hallucinations are identified, and (4) LLM QA is particularly beneficial for junior engineers. This research provides valuable perspectives for the future application of LLMs in aerospace software.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 12
  • 10.1007/s41666-025-00190-z
Adapting Generative Large Language Models for Information Extraction from Unstructured Electronic Health Records in Residential Aged Care: A Comparative Analysis of Training Approaches
  • Feb 20, 2025
  • Journal of Healthcare Informatics Research
  • Dinithi Vithanage + 7 more

Information extraction (IE) of unstructured electronic health records is challenging due to the semantic complexity of textual data. Generative large language models (LLMs) offer promising solutions to address this challenge. However, identifying the best training methods to adapt LLMs for IE in residential aged care settings remains underexplored. This research addresses this challenge by evaluating the effects of zero-shot and few-shot learning, both with and without parameter-efficient fine-tuning (PEFT) and retrieval-augmented generation (RAG) using Llama 3.1-8B. The study performed named entity recognition (NER) to nursing notes from Australian aged care facilities (RACFs), focusing on agitation in dementia and malnutrition risk factors. Performance evaluation includes accuracy, macro-averaged precision, recall, and F1 score. We used non-parametric statistical methods to compare if the differences were statistically significant. Results show that zero-shot and few-shot learning, whether combined with PEFT or RAG, achieve comparable performance across the clinical domains when the same prompting template is used. Few-shot learning significantly outperforms zero-shot learning when neither PEFT nor RAG is applied. Notably, PEFT significantly improves model performance in both zero-shot and few-shot learning; however, RAG significantly improves performance only in few-shot learning. After PEFT, the performance of zero-shot learning reaches a comparable level with few-shot learning. However, few-shot learning with RAG significantly outperforms zero-shot learning with RAG. We also found a similar level of performance between few-shot learning with RAG and zero-shot learning with PEFT. These findings provide valuable insights for researchers, practitioners, and stakeholders to optimize the use of generative LLMs in clinical IE.

  • Conference Article
  • Cite Count Icon 8
  • 10.2118/217671-ms
Enhancing Information Retrieval in the Drilling Domain: Zero-Shot Learning with Large Language Models for Question-Answering
  • Feb 27, 2024
  • F J Pacis + 3 more

Finding information across multiple databases, formats, and documents remains a manual job in the drilling industry. Large Language Models (LLMs) have proven effective in data-aggregation tasks, including answering questions. However, using LLMs for domain-specific factual responses poses a nontrivial challenge. The expert labor cost for training domain-specific LLMs prohibits niche industries from developing custom question-answering bots. This paper tests several commercial LLMs for information retrieval tasks for drilling data using zero-shot in-context learning. In addition, we studied the model’s calibration using a few-shot multiple-choice drilling questionnaire. To create an LLM benchmark for drilling, we collated the text data from publicly available databases: the Norwegian Petroleum Directorate (NPD), company annual reports, and petroleum glossary. We used a zero-shot learning technique that relies on an LLM’s ability to generate responses for tasks outside its training. We implemented a controlled zero-shot learning "in-context" procedure that sends a user’s query augmented with text data to the LLM as inputs. This implementation encourages the LLM to take the answer from the data while leveraging its pre-trained contextual-learning capability. We evaluated several state-of-the-art generic LLMs available through an API, including G4, G3.5-TI, J2-ultra model, and L2 series. The paper documents the pre-trained LLMs’ ability to provide correct answers and identify petroleum industry jargon from the collated dataset. Our zero-shot in-context learning implementation helps vanilla LLMs provide relevant factual responses for the drilling domain. While each LLM’s performance varies, we have identified models suitable for a drilling chatbot application. In particular, G4 outperformed on all the tasks. This finding suggests that training expensive domain-specific LLMs is not necessary for question-answering tasks in the context of drilling data. We demonstrate the utility of zero-shot in-context learning using pre-trained LLMs for question-answering tasks relevant to the drilling industry. Additionally, we prepared and publicly released the collated datasets from the NPD database and companies’ annual reports to enable results reproducibility and to foster acceleration of language model adoption and development for the subsurface and drilling industries. The petroleum industry may find our solution beneficial for enhancing personnel training and career development. It also offers a method for conducting data analytics and overcoming challenges in retrieving historical well data.

  • Research Article
  • Cite Count Icon 43
  • 10.1055/a-2264-5631
Improving the use of LLMs in radiology through prompt engineering: from precision prompts to zero-shot learning.
  • Feb 26, 2024
  • RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin
  • Maximilian Frederik Russe + 3 more

Large language models (LLMs) such as ChatGPT have shown significant potential in radiology. Their effectiveness often depends on prompt engineering, which optimizes the interaction with the chatbot for accurate results. Here, we highlight the critical role of prompt engineering in tailoring the LLMs' responses to specific medical tasks. Using a clinical case, we elucidate different prompting strategies to adapt the LLM ChatGPT using GPT4 to new tasks without additional training of the base model. These approaches range from precision prompts to advanced in-context methods such as few-shot and zero-shot learning. Additionally, the significance of embeddings, which serve as a data representation technique, is discussed. Prompt engineering substantially improved and focused the chatbot's output. Moreover, embedding of specialized knowledge allows for more transparent insight into the model's decision-making and thus enhances trust. Despite certain challenges, prompt engineering plays a pivotal role in harnessing the potential of LLMs for specialized tasks in the medical domain, particularly radiology. As LLMs continue to evolve, techniques like few-shot learning, zero-shot learning, and embedding-based retrieval mechanisms will become indispensable in delivering tailored outputs. · Large language models might impact radiological practice and decision-masking.. · However, implementation and performance are dependent on the assigned task.. · Optimization of prompting strategies can substantially improve model performance.. · Strategies for prompt engineering range from precision prompts to zero-shot learning.. · Russe MF, Reisert M, Bamberg F et al. Improving the use of LLMs in radiology through prompt engineering: from precision prompts to zero-shot learning . Fortschr Röntgenstr 2024; 196: 1166 - 1170.

  • Research Article
  • Cite Count Icon 1
  • 10.2118/0125-0092-jpt
Zero-Shot Learning With Large Language Models Enhances Drilling-Information Retrieval
  • Jan 1, 2025
  • Journal of Petroleum Technology
  • Chris Carpenter

_ This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 217671, “Enhancing Information Retrieval in the Drilling Domain: Zero-Shot Learning With Large Language Models for Question Answering,” by Felix J. Pacis, SPE, University of Stavanger, and Sergey Alyaev and Gilles Pelfrene, SPE, NORCE, et al. The paper has not been peer reviewed. _ Finding information across multiple databases, formats, and documents remains a manual job in the drilling industry. Large language models (LLMs) have proven effective in data-aggregation tasks, including answering questions. However, using LLMs for domain-specific factual responses poses a nontrivial challenge. The expert-labor cost for training domain-specific LLMs prohibits niche industries from developing custom question-answering bots. The complete paper tests several commercial LLMs for information-retrieval tasks for drilling data using zero-shot in-context learning. In addition, the model’s calibration is tested with a few-shot multiple-choice drilling questionnaire. Introduction While LLMs have proven effective in various tasks ranging from sentiment analysis to text completion, using LLMs for question-answering tasks presents a challenge in providing factual responses. Pretrained LLMs only serve as a parameterized implicit knowledge base and cannot access recent data; thus, information is bounded by the time of training. Retrieval augmented generation (RAG) can address some of these issues by extending the utility of LLMs to specific data sources. Fig. 1 shows a simplified RAG-based LLM question/answer application. RAG involves two primary components: document retrieval (green boxes), which retrieves the most relevant context based on the query, and LLM response generation (blue boxes). During the response generation, LLM operates based on the prompt, query, and retrieved context without any change in the model parameters, a process the authors term as “in-context learning.” Methodology Two experiments have been conducted: The first one is a few-shot multiple-choice experiment evaluated using the SLB drilling glossary; the second is a zero-shot in-context experiment evaluated on drilling reports and company reports. Multiple-Choice Experiment. SLB Drilling Glossary. For the multiple-choice experiment, a publicly available drilling glossary served as a basis for evaluation. A total of 409 term/definition pairs were considered. Five term/definition pairs were chosen, serving as few-shot default values, while the remaining 404 pairs served as the multiple-choice questions. Four choices were given for each term/definition question pair, where one was the correct answer. The three incorrect choices were picked randomly from all possible terms minus the true answer. Zero-Shot In-Context Experiment. Norwegian Petroleum Directorate (NPD) Database. The authors explored the wellbore history of all individual exploration wells drilled in the Norwegian shelf in the NPD database. In this experiment, 12 exploration wells were randomly chosen for evaluation. In addition to these drilling reports, information about the stratigraphy of three additional wells was added. Annual Reports. Annual reports of two major operators in Norway for 2020 and 2021 also were considered. These consisted of short summaries that presented the main operational and economic results achieved by the company throughout the year. These reports were added to the evaluation to balance the higher technical content of the wellbore-history reports.

  • Research Article
  • Cite Count Icon 4
  • 10.1007/s00117-025-01416-2
Optimized interaction with Large Language Models : A practical guide to Prompt Engineering and Retrieval-Augmented Generation
  • Feb 21, 2025
  • Radiologie (Heidelberg, Germany)
  • Anna Fink + 4 more

Given the increasing number of radiological examinations, large language models (LLMs) offer promising support in radiology. Optimized interaction is essential to ensure reliable results. This article provides an overview of interaction techniques such as prompt engineering, zero-shot learning, and retrieval-augmented generation (RAG) and gives practical tips for their application in radiology. Demonstration of interaction techniques based on practical examples with concrete recommendations for their application in routine radiological practice. Advanced interaction techniques allow task-specific adaptation of LLMs without the need for retraining. The creation of precise prompts and the use of zero-shot and few-shot learning can significantly improve response quality. RAG enables the integration of current and domain-specific information into LLM tools, increasing the accuracy and relevance of the generated content. The use of prompt engineering, zero-shot and few-shot learning, and RAG can optimize interaction with LLMs in radiology. Through these targeted strategies, radiologists can efficiently integrate general chatbots into routine practice to improve patient care.

  • Research Article
  • Cite Count Icon 4
  • 10.1038/s41698-025-00916-7
Evaluating the performance of large language & visual-language models in cervical cytology screening
  • May 23, 2025
  • npj Precision Oncology
  • Qi Hong + 15 more

Large language models (LLMs) and large visual-language models (LVLMs) have exhibited near-human levels of knowledge, image comprehension, and reasoning abilities, and their performance has undergone evaluation in some healthcare domains. However, a systematic evaluation of their capabilities in cervical cytology screening has yet to be conducted. Here, we constructed CCBench, a benchmark dataset dedicated to the evaluation of LLMs and LVLMs in cervical cytology screening, and developed a GPT-based semi-automatic evaluation pipeline to assess the performance of six LLMs (GPT-4, Bard, Claude-2.0, LLaMa-2, Qwen-Max, and ERNIE-Bot-4.0) and five LVLMs (GPT-4V, Gemini, LLaVA, Qwen-VL, and ViLT) on this dataset. CCBench comprises 773 question-answer (QA) pairs and 420 visual-question-answer (VQA) triplets, making it the first dataset in cervical cytology to include both QA and VQA data. We found that LLMs and LVLMs demonstrate promising accuracy and specialization in cervical cytology screening. GPT-4 achieved the best performance on the QA dataset, with an accuracy of 70.5% for close-ended questions and average expert evaluation score of 6.9/10 for open-ended questions. On the VQA dataset, Gemini achieved the highest accuracy for close-ended questions at 67.8%, while GPT-4V attained the highest expert evaluation score of 6.1/10 for open-ended questions. Besides, LLMs and LVLMs revealed varying abilities in answering questions across different topics and difficulty levels. However, their performance remains inferior to the expertise exhibited by cytopathology professionals, and the risk of generating misinformation could lead to potential harm. Therefore, substantial improvements are required before these models can be reliably deployed in clinical practice.

  • Research Article
  • 10.1016/j.jbi.2026.105034
A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning
  • Mar 27, 2026
  • Journal of biomedical informatics
  • Cheng Peng + 5 more

A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 106
  • 10.1038/s41746-024-01024-9
CancerGPT for few shot drug pair synergy prediction using large pretrained language models
  • Feb 19, 2024
  • NPJ Digital Medicine
  • Tianhao Li + 6 more

Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.

  • Research Article
  • Cite Count Icon 1
  • 10.1101/2025.02.27.640661
SensitiveCancerGPT: Leveraging Generative Large Language Model on Structured Omics Data to Optimize Drug Sensitivity Prediction.
  • Mar 3, 2025
  • bioRxiv : the preprint server for biology
  • Shaika Chowdhury + 6 more

The fast accumulation of vast pharmacogenomics data of cancer cell lines provide unprecedented opportunities for drug sensitivity prediction (DSP), a crucial prerequisite for the advancement of precision oncology. Recently, Generative Large Language Models (LLM) have demonstrated performance and generalization prowess across diverse tasks in the field of natural language processing (NLP). However, the structured format of the pharmacogenomics data poses challenge for the utility of LLM in DSP. Therefore, the objective of this study is multi-fold: to adapt prompt engineering for structured pharmacogenomics data toward optimizing LLM's DSP performance, to evaluate LLM's generalization in real-world DSP scenarios, and to compare LLM's DSP performance against that of state-of-the-science baselines. We systematically investigated the capability of the Generative Pre-trained Transformer (GPT) as a DSP model on four publicly available benchmark pharmacogenomics datasets, which are stratified by five cancer tissue types of cell lines and encompass both oncology and non-oncology drugs. Essentially, the predictive landscape of GPT is assessed for effectiveness on the DSP task via four learning paradigms: zero-shot learning, few-shot learning, fine-tuning and clustering pretrained embeddings. To facilitate GPT in seamlessly processing the structured pharmacogenomics data, domain-specific novel prompt engineering is employed by implementing three prompt templates (i.e., Instruction, Instruction-Prefix, Cloze) and integrating pharmacogenomics-related features into the prompt. We validated GPT's performance in diverse real-world DSP scenarios: cross-tissue generalization, blind tests, and analyses of drug-pathway associations and top sensitive/resistant cell lines. Furthermore, we conducted a comparative evaluation of GPT against multiple Transformer-based pretrained models and existing DSP baselines. Extensive experiments on the pharmacogenomics datasets across the five tissue cohorts demonstrate that fine-tuning GPT yields the best DSP performance (28% F1 increase, p-value= 0.0003) followed by clustering pretrained GPT embeddings (26% F1 increase, p-value= 0.0005), outperforming GPT in-context learning (i.e., few-shot). However, GPT in the zero-shot setting had a big F1 gap, resulting in the worst performance. Within the scope of prompt engineering, performance enhancement was achieved by directly instructing GPT about the DSP task and resorting to a concise context format (i.e., instruction-prefix), leading to F1 performance gain of 22% (p-value=0.02); while incorporation of drug-cell line prompt context derived from genomics and/or molecular features further boosted F1 score by 2%. Compared to state-of-the-science DSP baselines, GPT significantly asserted superior mean F1 performance (16% gain, p-value<0.05) on the GDSC dataset. In the cross-tissue analysis, GPT showcased comparable generalizability to the within-tissue performances on the GDSC and PRISM datasets, while statistically significant F1 performance improvements on the CCLE (8%, p-value=0.001) and DrugComb (19%, p-value=0.009) datasets. Evaluation on the challenging blind tests suggests GPT's competitiveness on the CCLE and DrugComb datasets compared to random splitting. Furthermore, analyses of the drug-pathway associations and log probabilities provided valuable insights that align with previous DSP findings. The diverse experiment setups and in-depth analysis underscore the importance of generative LLM, such as GPT, as a viable in silico approach to guide precision oncology. https://github.com/bioIKEA/SensitiveCancerGPT.

  • Research Article
  • Cite Count Icon 1
  • 10.1609/aaai.v39i23.34638
Explore What LLM Does Not Know in Complex Question Answering
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Xin Lin + 4 more

Complex question answering (QA) is a challenging task in artificial intelligence research which requires reasoning based on related knowledge. The retrieval-augmented generation (RAG) based on large language models (LLMs) have become one promising solution in QA. To facilitate RAG more effectively, the LLM needs to precisely evaluate knowledge required in QA. That is, first, the LLM needs to examine its knowledge boundary (what the LLM does not know) to retrieve external knowledge as supplement. Second, the LLM needs to evaluate the utility of the retrieved knowledge (whether it helps in reasoning) for robust RAG. To this end, in this paper, we propose a novel Question Answering with Knowledge Evaluation (KEQA) framework to promote the effectiveness and efficiency of RAG in QA. First, inspired by quizzes in classroom, we propose a quiz-based method to precisely examine the knowledge state of the uninterpretable LLM for QA. We ask indicative quizzes on each required knowledge, and inspect whether the LLM can consistently answer the quiz to examine its knowledge boundary. Second, we retrieve the unknown knowledge from external source, and evaluate its utility to pick the helpful ones for reasoning. We design a reasoning-based metric to evaluate utility, and construct a demonstration set in training data for reference to guide knowledge picking in inference. We conduct extensive experiments on four widely-used QA datasets, and the results demonstrate the effectiveness of the proposed method.

  • Conference Article
  • Cite Count Icon 6
  • 10.1145/3627673.3679830
Distilling Large Language Models for Text-Attributed Graph Learning
  • Oct 21, 2024
  • Bo Pan + 4 more

Text-Attributed Graphs (TAGs) are graphs of connected textual documents. Graph models can efficiently learn TAGs, but their training heavily relies on human-annotated labels, which are scarce or even unavailable in many applications. Large language models (LLMs) have recently demonstrated remarkable capabilities in few-shot and zero-shot TAG learning, but they suffer from scalability, cost, and privacy issues. Therefore, in this work, we focus on synergizing LLMs and graph models with their complementary strengths by distilling the power of LLMs into a local graph model on TAG learning. To address the inherent gaps between LLMs (generative models for texts) and graph models (discriminative models for graphs), we propose first to let LLMs teach an interpreter with rich rationale and then let a student model mimic the interpreter's reasoning without LLMs' rationale. We convert LLM's textual rationales to multi-level graph rationales to train the interpreter model and align the student model with the interpreter model based on the features of TAGs. Extensive experiments validate the efficacy of our proposed framework.

  • Research Article
  • Cite Count Icon 18
  • 10.1609/aaai.v38i19.30178
Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models
  • Mar 24, 2024
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Jiang Zhang + 5 more

Toxic content detection is crucial for online services to remove inappropriate content that violates community standards. To automate the detection process, prior works have proposed varieties of machine learning (ML) approaches to train Language Models (LMs) for toxic content detection. However, both their accuracy and transferability across datasets are limited. Recently, Large Language Models (LLMs) have shown promise in toxic content detection due to their superior zero-shot and few-shot in-context learning ability as well as broad transferability on ML tasks. However, efficiently designing prompts for LLMs remains challenging. Moreover, the high run-time cost of LLMs may hinder their deployments in production. To address these challenges, in this work, we propose BD-LLM, a novel and efficient approach to bootstrapping and distilling LLMs for toxic content detection. Specifically, we design a novel prompting method named Decision-Tree-of-Thought (DToT) to bootstrap LLMs' detection performance and extract high-quality rationales. DToT can automatically select more fine-grained context to re-prompt LLMs when their responses lack confidence. Additionally, we use the rationales extracted via DToT to fine-tune student LMs. Our experimental results on various datasets demonstrate that DToT can improve the accuracy of LLMs by up to 4.6%. Furthermore, student LMs fine-tuned with rationales extracted via DToT outperform baselines on all datasets with up to 16.9% accuracy improvement, while being more than 60x smaller than conventional LLMs. Finally, we observe that student LMs fine-tuned with rationales exhibit better cross-dataset transferability.

  • Research Article
  • Cite Count Icon 4
  • 10.1001/jamanetworkopen.2025.12032
Identification of Long-Term Care Facility Residence From Admission Notes Using Large Language Models
  • May 22, 2025
  • JAMA Network Open
  • Katherine E Goodman + 16 more

An estimated half of all long-term care facility (LTCF) residents are colonized with antimicrobial-resistant organisms, and early identification of these patients on admission to acute care hospitals is a core strategy for preventing intrahospital spread. However, because LTCF exposure is not reliably captured in structured electronic health record data, LTCF-exposed patients routinely go undetected. Large language models (LLMs) offer a promising, but untested, opportunity for extracting this information from patient admission histories. To evaluate the performance of an LLM against human review for identifying recent LTCF exposure from identifiable patient admission histories. This cross-sectional, multicenter study used the history and physical (H&P) notes from unique, randomly sampled adult admissions occurring between January 1, 2016, and December 31, 2021, at 13 hospitals in the University of Maryland Medical System (UMMS) and the John Hopkins (Hopkins) health care system to compare the performance of an LLM (GPT-4-Turbo) using zero-shot learning and prompting against humans in identifying patients with recent LTCF exposure. LLM analyses were conducted from August to September 2024. Recent (≤12 months) LTCF exposure documented in the H&P note, as adjudicated by (1) humans and (2) an LLM. LLM sensitivity and specificity with Clopper-Pearson 95% CIs. Secondary outcomes were note review time and cost. The LLM was also prompted to provide a rationale and supporting note-text for each classification. The study included 359 601 eligible adult admissions, of which 2087 randomly sampled H&P notes were manually reviewed at UMMS (1020 individuals; median [IQR] age, 58 [41-71] years; 493 [48%] male) and Hopkins (1067 individuals; median [IQR] age, 58 [48-67] years; 561 [53%] male) for LTCF residence. Compared with human review, the LLM achieved a sensitivity of 97% (95% CI, 91%-100%) and a specificity of 98% (95% CI, 97%-99%) at UMMS, and 96% (95% CI, 86%-100%) and 93% (95% CI, 92%-95%) sensitivity and specificity, respectively, at Hopkins; specificity at Hopkins improved with prompt revision (96% [95% CI, 95%-97%]). Of 117 manually reviewed LLM rationales, all were factually correct and quoted note-text accurately, and some demonstrated inferential logic and external knowledge. The LLM identified 37 (1.8%) human errors. Human review time had a mean of 2.5 minutes and cost $0.63 to $0.83 per note vs a mean of 4 to 6 seconds and $0.03 per note for LLM review. In this 13-hospital study of 2087 adult admissions, an LLM accurately identified LTCF residence from H&P notes and was more than 25 times faster and 20 times less expensive than human review.

  • Research Article
  • 10.3348/kjr.2025.1045
Evaluating the Accuracy and Diagnostic Reasoning of Multimodal Large Language Models in Interpreting Neuroradiology Cases From RadioGraphics.
  • Jan 1, 2026
  • Korean journal of radiology
  • Pae Sun Suh + 6 more

To evaluate the accuracy and reasoning capabilities of large multimodal language models compared with those of neuroradiology subspecialty-trained radiologists in neuroradiology case interpretation. This experimental study used custom-made 401 radiologic quizzes derived from articles published in RadioGraphics covering neuroradiology and head and neck topics (October 2020 to February 2024). We prompted the GPT-4 Turbo with Vision (GPT-4V), GPT-4 Omni, Gemini Flash, and Claude models to provide the top three differential diagnoses with a rationale and describe examination characteristics such as imaging modality, sequence, use of contrast, image plane, and body part. The temperature was adjusted to 0 and 1 (T1). Two neuroradiologists answered the same questions. The accuracies of the large language models (LLMs) and the neuroradiologists were compared using generalized estimating equations. Three neuroradiologists assessed the rationale provided by the LLMs for their differential diagnoses using four-point scales, separately for specific lesion locations and imaging findings, and evaluated the presence of hallucinations and the overall acceptability of the responses. Top-3 accuracy (i.e., correct answers present among top-3 differential diagnoses) of LLMs ranged from 29.9% (120 of 401) to 49.4% (198 of 401, obtained with GPT-4V in the T1 setting), while radiologists achieved 80.3% (322 of 401) and 68.3% (274 of 401), respectively (P < 0.001). Regarding the rationale for differential diagnoses, GPT-4V (T1) accurately identified both the specific lesion location and imaging findings in 30.7% (123 of 401) and 12.9% (16 of 124) of cases without textual clinical history. Hallucinations occurred in 4.5% (18 of 401), and only 29.4% (118 of 401) of the LLM-generated analyses were deemed acceptable. GPT-4V (T1) demonstrated high accuracy in identifying the imaging modality (97.4% [800 of 821]) and scanned body parts (92.2% [756 of 820]). LLMs remarkably underperformed compared with neuroradiologists and showed unsatisfactory reasoning for their differential diagnoses, with performance declining further in cases without textual input of clinical history. These findings highlight the limitations of current multimodal LLMs in neuroradiological interpretation and their reliance on text input.

  • PDF Download Icon
  • Preprint Article
  • 10.2196/preprints.68320
Knowledge Enhancement of Small-Scale Models in Medical Question Answering (Preprint)
  • Nov 3, 2024
  • Xinbai Li + 3 more

BACKGROUND Medical question answering (QA) is essential for various medical applications. While small-scale pre-training language models (PLMs) are widely adopted in open-domain QA tasks through fine-tuning with related datasets, applying this approach in the medical domain requires significant and rigorous integration of external knowledge. Knowledge-enhanced small-scale PLMs have been proposed to incorporate knowledge bases (KBs) to improve performance, as KBs contain vast amounts of factual knowledge. Large language models (LLMs) contain a vast amount of knowledge and have attracted significant research interest due to their outstanding natural language processing (NLP) capabilities. KBs and LLMs can provide external knowledge to enhance small-scale models in medical QA. OBJECTIVE KBs consist of structured factual knowledge that must be converted into sentences to align with the input format of PLMs. However, these converted sentences often lack semantic coherence, potentially causing them to deviate from the intrinsic knowledge of KBs. LLMs, on the other hand, can generate natural, semantically rich sentences, but they may also produce irrelevant or inaccurate statements. Retrieval-augmented generation (RAG) paradigm enhances LLMs by retrieving relevant information from an external database before responding. By integrating LLMs and KBs using the RAG paradigm, it is possible to generate statements that combine the factual knowledge of KBs with the semantic richness of LLMs, thereby enhancing the performance of small-scale models. In this paper, we explore a RAG fine-tuning method, RAG-mQA, that combines KBs and LLMs to improve small-scale models in medical QA. METHODS In the RAG fine-tuning scenario, we adopt medical KBs as an external database to augment the text generation of LLMs, producing statements that integrate medical domain knowledge with semantic knowledge. Specifically, KBs are used to extract medical concepts from the input text, while LLMs are tasked with generating statements based on these extracted concepts. In addition, we introduce two strategies for constructing knowledge: KB-based and LLM-based construction. In the KB-based scenario, we extract medical concepts from the input text using KBs and convert them into sentences by connecting the concepts sequentially. In the LLM-based scenario, we provide the input text to an LLM, which generates relevant statements to answer the question. For downstream QA tasks, the knowledge produced by these three strategies is inserted into the input text to fine-tune a small-scale PLM. F1 and exact match (EM) scores are employed as evaluation metrics for performance comparison. Fine-tuned PLMs without knowledge insertion serve as baselines. Experiments are conducted on two medical QA datasets: emrQA (English) and MedicalQA (Chinese). RESULTS RAG-mQA achieved the best results on both datasets. On the MedicalQA dataset, compared to the KB-based and LLM-based enhancement methods, RAG-mQA improved the F1 score by 0.59% and 2.36%, and the EM score by 2.96% and 11.18%, respectively. On the emrQA dataset, the EM score of RAG-mQA exceeded those of the KB-based and LLM-based methods by 4.65% and 7.01%, respectively. CONCLUSIONS Experimental results demonstrate that RAG fine-tuning method can improve the model performance in medical QA. RAG-mQA achieves greater improvements compared to other knowledge-enhanced methods. CLINICALTRIAL This study does not involve trial registration.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant