Enhancement Large Language Models Domain Through Ontology-Based Retrieval-Augmented Generation
Large Language Models (LLMs) show strong performance in natural language tasks but are prone to hallucinations, limiting reliability in knowledge-intensive fields such as cultural heritage. This paper presents an Ontology-Based Retrieval-Augmented Generation (OB-RAG) framework that embeds subject–predicate–object triples from domain ontologies into a vector space, retrieving relevant knowledge via semantic search to ground LLM outputs. Unlike traditional RAG using unstructured text, the framework integrates manually and semiautomatically generated ontologies for explicit contextual grounding. A cultural heritage case study illustrates implementation and evaluation. Performance is assessed with quantitative metrics (Faithfulness and Answer Relevancy) and expert validation. Results show the OB prototype outperforms baseline LLMs, reducing hallucinations and improving factual accuracy and contextual alignment. The study offers both an architectural framework and empirical evidence that ontology-based RAG strengthens trustworthiness and user acceptance of LLMs in specialized domains.
- Research Article
1
- 10.1038/s41598-025-05726-2
- Jul 2, 2025
- Scientific Reports
Retrieval-augmented generation (RAG) systems show promise in specialized knowledge domains, but the tobacco research field lacks standardized assessment frameworks for comparing different large language models (LLMs). This gap impacts public health decisions that require accurate, domain-specific information retrieval from complex tobacco industry documentation. To develop and validate a tobacco domain-specific evaluation framework for assessing various LLMs in RAG systems that combines automated metrics with expert validation. Using a Goal-Question-Metric paradigm, we evaluated two distinct LLM architectures in RAG configurations: Mixtral 8 × 7B and Llama 3.1 70B. The framework incorporated automated assessments via GPT-4o alongside validation by three tobacco research specialists. A domain-specific dataset of 20 curated queries assessed model performance across nine metrics including accuracy, domain specificity, completeness, and clarity. Our framework successfully differentiated performance between models, with Mixtral 8 × 7B significantly outperformed Llama 3.1 70B in accuracy (8.8/10 vs. 7.55/10, p < 0.05) and domain specificity (8.65/10 vs. 7.6/10, p < 0.05). Case analysis revealed Mixtral’s superior handling of industry-specific terminology and contextual relationships. Hyperparameter optimization further improved Mixtral’s completeness from 7.1/10 to 7.9/10, demonstrating the framework’s utility for model refinement. This study establishes a robust framework specifically for evaluating LLMs in tobacco research RAG systems, with demonstrated potential for extension to other specialized domains. The significant performance differences between models highlight the importance of domain-specific evaluation for public health applications. Future research should extend this framework to broader document corpora and additional LLMs, including commercial models.
- Research Article
6
- 10.1093/jamia/ocae312
- Dec 30, 2024
- Journal of the American Medical Informatics Association : JAMIA
Brief hospital course (BHC) summaries are clinical documents that summarize a patient's hospital stay. While large language models (LLMs) depict remarkable capabilities in automating real-world tasks, their capabilities for healthcare applications such as synthesizing BHCs from clinical notes have not been shown. We introduce a novel preprocessed dataset, the MIMIC-IV-BHC, encapsulating clinical note and BHC pairs to adapt LLMs for BHC synthesis. Furthermore, we introduce a benchmark of the summarization performance of 2 general-purpose LLMs and 3 healthcare-adapted LLMs. Using clinical notes as input, we apply prompting-based (using in-context learning) and fine-tuning-based adaptation strategies to 3 open-source LLMs (Clinical-T5-Large, Llama2-13B, and FLAN-UL2) and 2 proprietary LLMs (Generative Pre-trained Transformer [GPT]-3.5 and GPT-4). We evaluate these LLMs across multiple context-length inputs using natural language similarity metrics. We further conduct a clinical study with 5 clinicians, comparing clinician-written and LLM-generated BHCs across 30 samples, focusing on their potential to enhance clinical decision-making through improved summary quality. We compare reader preferences for the original and LLM-generated summary using Wilcoxon signed-rank tests. We further request optional qualitative feedback from clinicians to gain deeper insights into their preferences, and we present the frequency of common themes arising from these comments. The Llama2-13B fine-tuned LLM outperforms other domain-adapted models given quantitative evaluation metrics of Bilingual Evaluation Understudy (BLEU) and Bidirectional Encoder Representations from Transformers (BERT)-Score. GPT-4 with in-context learning shows more robustness to increasing context lengths of clinical note inputs than fine-tuned Llama2-13B. Despite comparable quantitative metrics, the reader study depicts a significant preference for summaries generated by GPT-4 with in-context learning compared to both Llama2-13B fine-tuned summaries and the original summaries (P<.001), highlighting the need for qualitative clinical evaluation. We release a foundational clinically relevant dataset, the MIMIC-IV-BHC, and present an open-source benchmark of LLM performance in BHC synthesis from clinical notes. We observe high-quality summarization performance for both in-context proprietary and fine-tuned open-source LLMs using both quantitative metrics and a qualitative clinical reader study. Our research effectively integrates elements from the data assimilation pipeline: our methods use (1) clinical data sources to integrate, (2) data translation, and (3) knowledge creation, while our evaluation strategy paves the way for (4) deployment.
- Research Article
- 10.1111/epi.18475
- Jul 10, 2025
- Epilepsia
The emergence of large language models (LLMs) and the increasing prevalence of electronic health records (EHRs) present significant opportunities for advancing health care research and practice. However, research that compares and applies LLMs to extract key epilepsy-related information from unstructured medical free text is under-explored. This study fills this gap by comparing and applying different open-source LLMs and methods to extract epilepsy information from unstructured clinic letters, thereby optimizing EHRs as a resource for the benefit of epilepsy research. We also highlight some limitations of LLMs. Employing a dataset of 280 annotated clinic letters from King's College Hospital, we explored the efficacy of open-source LLMs (Llama and Mistral series) for extracting key epilepsy-related information, including epilepsy type, seizure type, current anti-seizure medications (ASMs), and associated symptoms. The study used various extraction methods, including direct extraction, summarized extraction, and contextualized extraction, complemented by role-prompting and few-shot prompting techniques. Performance was evaluated against a gold standard dataset, and was also compared to advanced fine-tuned models and human annotations. Llama 2 13b (a 13-billion-parameter LLM developed by Meta) demonstrated superior extraction capabilities across tasks by consistently outperforming other LLMs (F1 = .80 in epilepsy-type extraction, F1 = .76 in seizure-type extraction, and F1 = .90 in current ASMs extraction). Here, F1 score is a balanced metric indicating the model's accuracy in correctly identifying relevant information without excessive false positives. The study highlights the direct extraction showing consistent high performance. Comparative analysis showed that LLMs outperformed current approaches like MedCAT (Medical Concept Annotation Tool) in extracting epilepsy-related information (.2 higher in F1). The results affirm the potential of LLMs in medical information extraction relating to epilepsy, offering insights into leveraging these models for detailed and accurate data extraction from unstructured texts. The study underscores the importance of method selection in optimizing extraction performance and suggests a promising avenue for enhancing medical research and patient care through advanced natural language processing technologies.
- Research Article
- 10.1145/3732784
- Apr 29, 2025
- ACM Transactions on Intelligent Systems and Technology
Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation framework to assess LLMs’ proficiency in following instructions on diverse real-world tasks. We construct a hierarchical task tree encompassing 7 major areas covering over 200 categories and over 800 tasks, which covers diverse capabilities such as question answering, reasoning, multiturn dialogue, and text generation, to evaluate LLMs in a comprehensive and in-depth manner. We also design detailed evaluation standards and processes to facilitate consistent, unbiased judgments from human evaluators. A test set of over 3,000 instances is released, spanning different difficulty levels and knowledge domains. Our work provides a standardized methodology to evaluate human alignment in LLMs for both English and Chinese. We also analyze the feasibility of automating parts of evaluation with a strong LLM (GPT-4). Our framework supports a thorough assessment of LLMs as they are integrated into real-world applications. We have made publicly available the task tree, TencentLLMEval dataset, and evaluation methodology which have been demonstrated as effective in assessing the performance of Tencent Hunyuan LLMs 1 . By doing so, we aim to facilitate the benchmarking of advances in the development of safe and human-aligned LLMs.
- Research Article
2
- 10.1115/1.4067227
- Dec 10, 2024
- Journal of Mechanical Design
Empathic design research aims to gain deep and accurate user understanding. We can measure the designer's empathic ability as empathic accuracy (EA) in understanding the user's thoughts and feelings during an interview. However, the EA measure currently relies on human rating and is thus time-consuming, making the use of large language models (LLMs) an attractive alternative. It is essential to consider two significant constraints when implementing LLMs as a solution: the choice of LLM and the impact of domain-specific datasets. Datasets of the interactions between the designer and the user are not generally available. We present such a dataset consisting of the EA task employed in user interviews to measure empathic understanding. It consists of over 400 pairs of user thoughts or feelings matched with a designer's guess of the same and the human ratings of the accuracy. We compared the performance of six sentence embedding state-of-the-art LLMs with different pooling techniques on the EA task. We used the LLMs to extract semantic information before and after fine-tuning. We conclude that directly using LLMs based on their reported performance in general language tasks could result in errors when judging a designer's empathic ability. We also found that fine-tuning LLMs on our dataset improved their performance, but the model's ability to fit the EA task and pooling method also determined the LLM's performance. The results will provide insight for other LLM-based similarity analyses in design.
- Research Article
2
- 10.2196/67363
- Mar 27, 2025
- JMIR AI
Large language models (LLMs) have demonstrated powerful capabilities in natural language tasks and are increasingly being integrated into health care for tasks like disease risk assessment. Traditional machine learning methods rely on structured data and coding, limiting their flexibility in dynamic clinical environments. This study presents a novel approach to disease risk assessment using generative LLMs through conversational artificial intelligence (AI), eliminating the need for programming. This study evaluates the use of pretrained generative LLMs, including LLaMA2-7b and Flan-T5-xl, for COVID-19 severity prediction with the goal of enabling a real-time, no-code, risk assessment solution through chatbot-based, question-answering interactions. To contextualize their performance, we compare LLMs with traditional machine learning classifiers, such as logistic regression, extreme gradient boosting (XGBoost), and random forest, which rely on tabular data. We fine-tuned LLMs using few-shot natural language examples from a dataset of 393 pediatric patients, developing a mobile app that integrates these models to provide real-time, no-code, COVID-19 severity risk assessment through clinician-patient interaction. The LLMs were compared with traditional classifiers across different experimental settings, using the area under the curve (AUC) as the primary evaluation metric. Feature importance derived from LLM attention layers was also analyzed to enhance interpretability. Generative LLMs demonstrated strong performance in low-data settings. In zero-shot scenarios, the T0-3b-T model achieved an AUC of 0.75, while other LLMs, such as T0pp(8bit)-T and Flan-T5-xl-T, reached 0.67 and 0.69, respectively. At 2-shot settings, logistic regression and random forest achieved an AUC of 0.57, while Flan-T5-xl-T and T0-3b-T obtained 0.69 and 0.65, respectively. By 32-shot settings, Flan-T5-xl-T reached 0.70, similar to logistic regression (0.69) and random forest (0.68), while XGBoost improved to 0.65. These results illustrate the differences in how generative LLMs and traditional models handle the increasing data availability. LLMs perform well in low-data scenarios, whereas traditional models rely more on structured tabular data and labeled training examples. Furthermore, the mobile app provides real-time, COVID-19 severity assessments and personalized insights through attention-based feature importance, adding value to the clinical interpretation of the results. Generative LLMs provide a robust alternative to traditional classifiers, particularly in scenarios with limited labeled data. Their ability to handle unstructured inputs and deliver personalized, real-time assessments without coding makes them highly adaptable to clinical settings. This study underscores the potential of LLM-powered conversational artificial intelligence (AI) in health care and encourages further exploration of its use for real-time, disease risk assessment and decision-making support.
- Research Article
3
- 10.1145/3770084
- Oct 7, 2025
- ACM Transactions on Software Engineering and Methodology
Large Language Models (LLMs) have shown remarkable capabilities in code generation for popular programming languages. However, their performance in Low-Resource Programming Languages (LRPLs) and Domain-Specific Languages (DSLs) remains a critical challenge. This gap affects millions of developers - with Rust alone having 3.5 million users - who are currently unable to fully leverage LLM capabilities. LRPLs and DSLs face unique challenges, including severe data scarcity and, for DSLs, highly specialized syntax and semantics that are poorly represented in general-purpose datasets. Addressing these challenges is crucial as LRPLs and DSLs significantly enhance development efficiency in specialized domains and applications, including financial and scientific works. While several surveys on LLMs for software engineering and code exist, none comprehensively address the challenges and opportunities specific to LRPLs and DSLs. Our survey fills this gap by providing a systematic review of the current state, methodologies, and challenges in leveraging LLMs for code generation in LRPL and DSL. We filtered 111 papers from over 27,000 published studies from 2020 – 2024 to understand the capabilities and limitations of LLMs in these specialized domains. We also expanded our literature search to include 5 recent papers from 2024 – 2025. We report LLMs used, benchmarks, and metrics to evaluate code generation in LRPLs and DSLs, as well as strategies used to enhance LLM performance, and the collected datasets and curation methods in this context. We identified four main evaluation techniques used in the literature, along with several metrics to assess code generation in LRPL and DSL. We categorized the methods used for LLM improvement into six main groups and summarized the novel methods and architectures proposed by the researchers. We also classified different approaches used for data collection and preparation. While different techniques, metrics, and datasets are used, there is a lack of a standard approach and a benchmark dataset to evaluate code generation in several LRPLs and DSLs. We discuss several distinctions of the studied approaches with the ones used in high-resource programming languages (HRPLs), as well as several challenges unique to these languages, especially DSLs. The challenges stem from the scarcity of data, the unique requirements, and specialized domains, which often need expertise guidelines or domain-specific tools. Accordingly, we provide insights into different research opportunities for the studied aspects. This survey serves as a comprehensive resource for researchers and practitioners working at the intersection of LLMs, software engineering, and specialized programming languages, providing a foundation for future advancements in LRPL and DSL code generation. A GitHub repository was created to organize the papers of this survey at https://github.com/jie-jw-wu/Survey-CodeLLM4LowResource-DSL .
- Abstract
- 10.1017/cts.2024.1001
- Apr 1, 2025
- Journal of Clinical and Translational Science
Objectives/Goals: This Weill Cornell Clinical and Translational Science Collaborative (CTSC) project evaluates whether large language models (LLMs) can generate accurate summaries of translational science benefits using the Translational Science Benefits Model (TSBM) framework, aiming to identify optimal LLMs and prompting strategies via expert review. Methods/Study Population: We are using prompt engineering to train multiple LLMs to generate one-page impact profiles based on the TSBM framework. LLMs will be selected via benchmarks, focusing on models excelling in information extraction. Leading LLMs (e.g., Llama 3.2, ChatGPT 4.0, Gemini 1.5 Pro, and Claude) and other high-performing models will be considered. Initial work has utilized Gemini 1.5 Pro. Models use data from CTSC-supported projects in WebCAMP, our local instantiation of a translational research activity tracking system used by >20 CTSA hubs, and manuscripts from the Overton database cited in policy documents. Human experts will evaluate the quality and accuracy of LLM-generated profiles. Results/Anticipated Results: Preliminary results using Gemini 1.5 Pro indicate that LLMs can generate coherent and informative impact profiles encompassing diverse areas within the TSBM. Face validity appears satisfactory, suggesting the outputs align with expectations. We anticipate that further exploration with other LLMs and expert validation will reveal strengths and weaknesses of the LLM approach, including the potential for naccuracies (“hallucinations”), informing further refinement of models and prompting strategies. Analysis of manuscripts cited in policy will provide valuable insights into communicating policy-relevant benefits effectively, and benchmark comparisons will identify optimal LLMs for this use case. Discussion/Significance of Impact: This project demonstrates LLMs’ potential for streamlining and enhancing impact reporting in translational science, enabling broader dissemination of research outcomes and promoting better understanding among stakeholders. Future work will integrate LLM-based reporting into research infrastructure.
- Preprint Article
- 10.2196/preprints.75103
- Mar 28, 2025
BACKGROUND Large language models (LLMs) provide new opportunities to advance the intelligent development of Traditional Chinese medicine (TCM). Syndrome differentiation thinking is an essential part of TCM, and equipping LLMs with this capability represents a crucial step toward more effective clinical applications of TCM. However, given the complexity of TCM syndrome differentiation thinking, acquiring this ability is a considerable challenge for the model. OBJECTIVE This study aims to evaluate LLMs' syndrome differentiation thinking ability and design a method to enhance their performance in this area effectively. METHODS We decompose the process of TCM syndrome differentiation thinking into three core tasks: pathogenesis inference, syndrome inference, and diagnostic suggestion. To evaluate the performance of LLMs in these tasks, we constructed a high-quality evaluation dataset, providing a reliable foundation for the quantitative assessment of their capabilities. Furthermore, we developed a methodology for generating instruction data based on the idea of an "open-book exam", customized three data templates, and dynamically retrieved task-relevant professional knowledge, inserted into predefined positions within the templates. This approach effectively generates high-quality instruction data that aligns with the unique characteristics of TCM syndrome differentiation thinking. Leveraging this instruction data, we fine-tuned the base model, enhancing the syndrome differentiation thinking ability of the LLMs. RESULTS We collected 200 medical cases for the evaluation dataset and standardized them into three types of task questions. We tested general and TCM LLMs, comparing their performance with our proposed solution. The results demonstrate that our method significantly enhances LLMs' syndrome differentiation thinking ability. Our model achieved 85.7% and 81.2% accuracy in Tasks 1 and 2, respectively, surpassing the best-performing TCM and general LLMs by 26.3% and 15.8%. In Task 3, our model scored 84.3, indicating that the model is very similar to the advice given by experts. CONCLUSIONS Existing general LLMs and TCM LLMs still have significant limitations in the core task of syndrome differentiation thinking. Our research shows that fine-tuning LLMs by designing professional instruction templates and generating high-quality instruction data can significantly improve their performance in core tasks. The optimized LLMs show a high degree of similarity in reasoning results with the opinions of domain experts, indicating that they can simulate syndrome differentiation thinking to a certain extent. This has important theoretical and practical significance for in-depth interpretation of the complexity of the clinical diagnosis and treatment process of TCM.
- Research Article
2
- 10.1093/jamia/ocaf023
- Mar 10, 2025
- Journal of the American Medical Informatics Association : JAMIA
Large language models (LLMs) are increasingly utilized in healthcare, transforming medical practice through advanced language processing capabilities. However, the evaluation of LLMs predominantly relies on human qualitative assessment, which is time-consuming, resource-intensive, and may be subject to variability and bias. There is a pressing need for quantitative metrics to enable scalable, objective, and efficient evaluation. We propose a unified evaluation framework that bridges qualitative and quantitative methods to assess LLM performance in healthcare settings. This framework maps evaluation aspects-such as linguistic quality, efficiency, content integrity, trustworthiness, and usefulness-to both qualitative assessments and quantitative metrics. We apply our approach to empirically evaluate the Epic In-Basket feature, which uses LLM to generate patient message replies. The empirical evaluation demonstrates that while Artificial Intelligence (AI)-generated replies exhibit high fluency, clarity, and minimal toxicity, they face challenges with coherence and completeness. Clinicians' manual decision to use AI-generated drafts correlates strongly with quantitative metrics, suggesting that quantitative metrics have the potential to reduce human effort in the evaluation process and make it more scalable. Our study highlights the potential of a unified evaluation framework that integrates qualitative and quantitative methods, enabling scalable and systematic assessments of LLMs in healthcare. Automated metrics streamline evaluation and monitoring processes, but their effective use depends on alignment with human judgment, particularly for aspects requiring contextual interpretation. As LLM applications expand, refining evaluation strategies and fostering interdisciplinary collaboration will be critical to maintaining high standards of accuracy, ethics, and regulatory compliance. Our unified evaluation framework bridges the gap between qualitative human assessments and automated quantitative metrics, enhancing the reliability and scalability of LLM evaluations in healthcare. While automated quantitative evaluations are not ready to fully replace qualitative human evaluations, they can be used to enhance the process and, with relevant benchmarks derived from the unified framework proposed here, they can be applied to LLM monitoring and evaluation of updated versions of the original technology evaluated using qualitative human standards.
- Research Article
1
- 10.1200/jco.2025.43.16_suppl.e23161
- Jun 1, 2025
- Journal of Clinical Oncology
e23161 Background: Data retrieval is challenging in clinical research and traditional methods for data collection are often time-consuming and may be error-prone. Large Language Models (LLMs) have shown zero-shot capabilities in converting unstructured clinical text into structured data. These technologies could support the retrieval stage of clinical trials by leveraging the information reported in Electronic Health Records (EHRs) without relying any longer on manual curation. APOLLO 11 Consortium (NCT05550961) is a multicentric Italian trial which leverages a federated infrastructure for the analysis of advanced lung cancer patient data across Italy. Methods: We conducted a pilot study using Llama 3.1 8B on 358 Non-Small Cell Lung Cancer patients from the IRCCS Istituto Nazionale dei Tumori, leader of the APOLLO 11 Consortium. Anonymized EHRs have been analyzed within the LLM pipeline for feature extraction by Wiest et al. A combination of zero/few shot prompting techniques both in English and Italian languages was used. We selected smoking, histology, PD-L1 and staging as multiclass variables and bone/brain/liver metastases as binary variables. The ground truth collection involved a first Manual Data Entry (1-MDE) and a final full-revised MDE (2-MDE). The LLM accuracy was calculated only for the comparison LLM vs 2-MDE. In addition, we calculated the percentage of Missing Information (% MI) in 1-MDE, 2-MDE and LLM extraction. Results: Compared to 2-MDE, LLM achieved feature-specific accuracies of 0.78 for PD-L1, 0.85 for BONE METASTASIS, 0.83 for BRAIN METASTASIS, 0.89 for LIVER METASTASIS and 0.96 for TUMOUR STAGING. For smoking and staging, LLM extraction also reduced % MI relative to 1-MDE (Table 1). Only for PD-L1, we further analyzed the 12.8% of MI and found that 91.3% resulted from hallucinations (i.e., PD-L1 was misclassified as missing). Evaluations using English prompts confirmed the pipeline’s adaptability and high tasks accuracy. Conclusions: This study confirms the feasibility of LLMs for data retrieval in clinical trials demonstrating strong performance across diverse clinical features with minimal prompt optimization. LLMs could assist clinicians and data entry personnel in the 1-MDE process, streamlining initial data structuring and saving time. The 2-MDE step can remain as a quality check to address any discrepancies. Further improvements could focus on prompt optimization and integrating human feedback to reduce hallucination rates. Clinical trial information: NCT05550961 . %MI in 1-MDE, 2-MDE and LLM extraction. Accuracy refers only to LLM vs 2-MDE. Histology and metastasis sites were collected only in 2-MDE. NA = not available. Smoking PD-L1 Histology Bone Met Brain Met Liver Met T N M Stage % MI 1-MDE 6.4 8.9 NA NA NA NA 22.5 22.5 23.11 98.3 % MI 2-MDE 6.6 3 0 0 0 0 0 0 0 0 % MI LLM 2.7 12.8 10.3 0 0 0 0 0 0 6.9 % accuracy (LLM vs 2-MDE) 67 78 91 85 83 89 39 52 70 96
- Research Article
- 10.1200/jco.2025.43.16_suppl.12105
- Jun 1, 2025
- Journal of Clinical Oncology
12105 Background: The American Society of Clinical Oncology (ASCO) convened a multidisciplinary panel resulting in patient-oncologist communication guidelines published in 2017. These guidelines contain recommendations across topics including goals of care, treatment selection, end-of-life care, facilitating family involvement, and clinician training in communication. Ideally, these conversations should be documented in the electronic health record (EHR), so that they can be referred to at future visits as a patient’s clinical course evolves. Tracking adherence to these communication guidelines may be beneficial for quality improvement efforts. However, manual chart review of unstructured free text notes is tedious and burdensome. The recent development of Large Language Models (LLMs) may represent a new computational approach that can capture such documentation more efficiently than chart review. To our knowledge, no prior study has used LLMs to capture such documentation in free text notes, validated against gold-standard manual chart review. Methods: As part of a larger study on development of LLMs for tracking palliative care quality measures, we randomly selected 30 patients with advanced cancer and clinical notes in the month following navigation to a poor prognosis treatment node. We used GPT-4o-2024-05-13 , our HIPAA-secure tool, to develop an LLM prompt for identifying 14 ASCO communication domains in clinical text. The LLM prompt required output to generate source text to support identification of a communication domain. A “hallucination score” was calculated for source text, which is a measure of evidence produced by LLMs not found in source text. We then compared to gold standard manual chart review using standard performance metrics. Results: Across communication domains, note-level LLM analysis achieved sensitivity ranging from 0.43-1.0, specificity ranging 0.32-0.99, and accuracy ranging 0.51-0.99. Examples of documentation identified by both the LLM and chart review include goals of care and prognosis (“recently informed that her disease had progressed with treatment. Currently on ‘last line’ of chemotherapy”), treatment options and clinical trials (“her oncologist recommended a potential trial treatment, and she is contemplating involvement in this”), end-of-life care (“if her cancer continues to progress with her current treatment, they will transition her care to home hospice for comfort measures only”), and cost of care (“financial insecurity - referred to resource specialist ” ). Average hallucination index for documentation identified by the LLM was low. LLM frequently identified information missed by annotators. The LLM extracted information relevant to communication domains in a fraction of the time required by manual chart review. Conclusions: LLMs can identify communication domains in EHRs, potentially contributing to quality improvement efforts.
- Research Article
- 10.51519/journalisi.v7i3.1170
- Sep 22, 2025
- Journal of Information Systems and Informatics
Large Language Models (LLMs) have achieved remarkable success across natural language tasks, but their enormous computational requirements pose challenges for practical deployment. This paper proposes a hybrid cloud–edge architecture to deploy LLMs in a cost-effective and efficient manner. The proposed system employs a lightweight on-premise LLM to handle the bulk of user requests, and dynamically offloads complex queries to a powerful cloud-hosted LLM only when necessary. We implement a confidence-based routing mechanism to decide when to invoke the cloud model. Experiments on a question-answering use case demonstrate that our hybrid approach can match the accuracy of a state-of-the-art LLM while reducing cloud API usage by over 60%, resulting in significant cost savings and a ~40% reduction in average latency. We also discuss how the hybrid strategy enhances data privacy by keeping sensitive queries on-premise. These results highlight a promising direction for organizations to leverage advanced LLM capabilities without prohibitive expense or risk, by intelligently combining local and cloud resources.
- Research Article
- 10.1145/3756016
- Jul 29, 2025
- Journal on Computing and Cultural Heritage
Virtual museums are factual means for the dissemination and documentation of Cultural Heritage (CH) content. They are suitable environments for the semantic annotation of artifacts and automatic virtual guides. To this end, we identify and compare Traditional (ontology-based), Large Language Model (LLM)-extended, and LLM-pure methods for the semantic information strategies of digital CH. The traditional method is described through an application prototype, while the methods that involve LLM are tested experimentally. To investigate the integral tasks related to LLMs, our experiments include (i) semantic annotation using the CIDOC Conceptual Reference Model (CRM) and Knowledge Graph (KG) generation with LLMs for a painting sample, and (ii) painting ranking relying solely on LLMs using catalog descriptions as input. The experiments demonstrate the potential of these methods to enhance artwork interpretation, description, and refinement of the results. Based on the relevant literature on traditional semantic annotation and conducted experiments with LLMs, a combination of ontologies and LLMs may provide an optimal approach, as it offers the accuracy of structured knowledge while providing a tool that interprets these elements into natural language and vice versa. Relying solely on LLMs may be risky due to the lack of domain-specific knowledge in the training data of LLMs, whereas traditional methods demand expertise in a specific domain and are more time-consuming. Our approach shows potential in use cases such as guiding museum visitors to artifacts that match their interests, assisting museum curators with documentation, or helping CH researchers identify similarities in artifact collections.
- Research Article
- 10.1101/2025.08.22.671610
- Aug 27, 2025
- bioRxiv
Large Language Models (LLMs), AI agents and co-scientists promise to accelerate scientific discovery across fields ranging from chemistry to biology. Bioinformatics- the analysis of DNA, RNA and protein sequences plays a crucial role in biological research and is especially amenable to AI-driven automation given its computational nature. Here, we assess the bioinformatics capabilities of three popular general-purpose LLMs on a set of tasks covering basic analytical questions that include code writing and multi-step reasoning in the domain. Utilizing questions from Rosalind, a bioinformatics educational platform, we compare the performance of the LLMs vs. humans on 104 questions undertaken by 110 to 68,760 individuals globally. GPT-3.5 provided correct answers for 59/104 (58%) questions, while Llama-3–70B and GPT-4o answered 49/104 (47%) correctly. GPT-3.5 was the best performing in most categories, followed by Llama-3–70B and then GPT-4o. 71% of the questions were correctly answered by at least one LLM. The best performing categories included DNA analysis, while the worst performing were sequence alignment/comparative genomics and genome assembly. Overall, LLMs performance mirrored that of humans with lower performance in tasks in which humans had low performance and vice versa. However, LLMs also failed in some instances where most humans were correct and, in a few cases, LLMs excelled where most humans failed. To the best of our knowledge, this presents the first assessment of general purpose LLMs on basic bioinformatics tasks in distinct areas relative to the performance of hundreds to thousands of humans. LLMs provide correct answers to several questions that require use of biological knowledge, reasoning, statistical analysis and computer code.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.