Clinical Text Augmentation and Generation Using RAG for Large Language Models

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Large Language Models (LLM) are becoming more essential in clinical text generation, where use of synthetic medical data is environmentally accurate and applicable for real-world healthcare applications. Existing LLMs often lack in specialized optimization and clarity, leading to incorrect outputs. These restrictions can make their references unreliable, particularly for sensitive clinical data. To overcome these problems, this research work suggests integrating generative adversarial networks with LLM to improve clinical data accuracy and reduce hallucinations. LLMs like LLaMA, BERT and GPT are broadly used in clinical settings for tasks such as summarizing patient notes and answering medical queries. Generative Adversarial Networks (GANs) are used to generate realistic synthetic clinical data, aiding privacy and data augmentation. The LDA model is added with GAN to identify the underlying topics in clinical documents, ensuring the synthetic text is coherent and thematically relevant. The use of Retrieval Augmented Generation (RAG) dynamically retrieves current medical knowledge and provides grounding responses with real-time evidence and minimizes outdated information. The first phase focuses on generating and validating synthetic clinical data using GANs and LDA to ensure high quality and domain alignment; the second phase focus on user interaction, where RAG retrieves relevant information in real time to answer queries, and an interactive interface enables seamless engagement and feedback. Continuous evaluation of NLP metrics demonstrates that the proposed Clinical Augmentation Generation and Retrieval Augmented Generation (CAG-RAG) framework outperforms the existing DALL-M approach in generating synthetic clinical text. For diagnosis-related data, the proposed CAG-RAG method achieves improvements of 15.7% in BLEU, 17% in ROUGE-1, and 17% in ROUGE-L scores. For medication-related data, the improvements were 20.8% in BLEU, 17.1% in ROUGE-1, and 17.25% in ROUGE-L. These results highlight the reliability, adaptability, and contextual accuracy for clinical applications.

Similar Papers
  • Research Article
  • 10.3171/2025.4.focus25225
Synthetic neurosurgical data generation with generative adversarial networks and large language models:an investigation on fidelity, utility, and privacy.
  • Jul 1, 2025
  • Neurosurgical focus
  • Austin A Barr + 3 more

Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.

  • Research Article
  • Cite Count Icon 8
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.procs.2024.10.181
Mathematical Problem Solving in Arabic: Assessing Large Language Models
  • Jan 1, 2024
  • Procedia Computer Science
  • Abeer Mahgoub + 2 more

Mathematical Problem Solving in Arabic: Assessing Large Language Models

  • Research Article
  • Cite Count Icon 1
  • 10.2196/65729
Utility-based Analysis of Statistical Approaches and Deep Learning Models for Synthetic Data Generation With Focus on Correlation Structures: Algorithm Development and Validation.
  • Mar 20, 2025
  • JMIR AI
  • Marko Miletic + 1 more

Recent advancements in Generative Adversarial Networks and large language models (LLMs) have significantly advanced the synthesis and augmentation of medical data. These and other deep learning-based methods offer promising potential for generating high-quality, realistic datasets crucial for improving machine learning applications in health care, particularly in contexts where data privacy and availability are limiting factors. However, challenges remain in accurately capturing the complex associations inherent in medical datasets. This study evaluates the effectiveness of various Synthetic Data Generation (SDG) methods in replicating the correlation structures inherent in real medical datasets. In addition, it examines their performance in downstream tasks using Random Forests (RFs) as the benchmark model. To provide a comprehensive analysis, alternative models such as eXtreme Gradient Boosting and Gated Additive Tree Ensembles are also considered. We compare the following SDG approaches: Synthetic Populations in R (synthpop), copula, copulagan, Conditional Tabular Generative Adversarial Network (ctgan), tabular variational autoencoder (tvae), and tabula for LLMs. We evaluated synthetic data generation methods using both real-world and simulated datasets. Simulated data consist of 10 Gaussian variables and one binary target variable with varying correlation structures, generated via Cholesky decomposition. Real-world datasets include the body performance dataset with 13,393 samples for fitness classification, the Wisconsin Breast Cancer dataset with 569 samples for tumor diagnosis, and the diabetes dataset with 768 samples for diabetes prediction. Data quality is evaluated by comparing correlation matrices, the propensity score mean-squared error (pMSE) for general utility, and F1-scores for downstream tasks as a specific utility metric, using training on synthetic data and testing on real data. Our simulation study, supplemented with real-world data analyses, shows that the statistical methods copula and synthpop consistently outperform deep learning approaches across various sample sizes and correlation complexities, with synthpop being the most effective. Deep learning methods, including large LLMs, show mixed performance, particularly with smaller datasets or limited training epochs. LLMs often struggle to replicate numerical dependencies effectively. In contrast, methods like tvae with 10,000 epochs perform comparably well. On the body performance dataset, copulagan achieves the best performance in terms of pMSE. The results also highlight that model utility depends more on the relative correlations between features and the target variable than on the absolute magnitude of correlation matrix differences. Statistical methods, particularly synthpop, demonstrate superior robustness and utility preservation for synthetic tabular data compared with deep learning approaches. Copula methods show potential but face limitations with integer variables. Deep Learning methods underperform in this context. Overall, these findings underscore the dominance of statistical methods for synthetic data generation for tabular data, while highlighting the niche potential of deep learning approaches for highly complex datasets, provided adequate resources and tuning.

  • Research Article
  • Cite Count Icon 1
  • 10.2196/66279
Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.
  • Mar 18, 2025
  • Journal of medical Internet research
  • Hendrik Šuvalov + 6 more

Named entity recognition (NER) plays a vital role in extracting critical medical entities from health care records, facilitating applications such as clinical decision support and data mining. Developing robust NER models for low-resource languages, such as Estonian, remains a challenge due to the scarcity of annotated data and domain-specific pretrained models. Large language models (LLMs) have proven to be promising in understanding text from any language or domain. This study addresses the development of medical NER models for low-resource languages, specifically Estonian. We propose a novel approach by generating synthetic health care data and using LLMs to annotate them. These synthetic data are then used to train a high-performing NER model, which is applied to real-world medical texts, preserving patient data privacy. Our approach to overcoming the shortage of annotated Estonian health care texts involves a three-step pipeline: (1) synthetic health care data are generated using a locally trained GPT-2 model on Estonian medical records, (2) the synthetic data are annotated with LLMs, specifically GPT-3.5-Turbo and GPT-4, and (3) the annotated synthetic data are then used to fine-tune an NER model, which is later tested on real-world medical data. This paper compares the performance of different prompts; assesses the impact of GPT-3.5-Turbo, GPT-4, and a local LLM; and explores the relationship between the amount of annotated synthetic data and model performance. The proposed methodology demonstrates significant potential in extracting named entities from real-world medical texts. Our top-performing setup achieved an F1-score of 0.69 for drug extraction and 0.38 for procedure extraction. These results indicate a strong performance in recognizing certain entity types while highlighting the complexity of extracting procedures. This paper demonstrates a successful approach to leveraging LLMs for training NER models using synthetic data, effectively preserving patient privacy. By avoiding reliance on human-annotated data, our method shows promise in developing models for low-resource languages, such as Estonian. Future work will focus on refining the synthetic data generation and expanding the method's applicability to other domains and languages.

  • Research Article
  • Cite Count Icon 3
  • 10.1148/ryai.230514
Addressing the Generalizability of AI in Radiology Using a Novel Data Augmentation Framework with Synthetic Patient Image Data: Proof-of-Concept and External Validation for Classification Tasks in Multiple Sclerosis.
  • Oct 16, 2024
  • Radiology. Artificial intelligence
  • Gianluca Brugnara + 14 more

Artificial intelligence (AI) models often face performance drops after deployment to external datasets. This study evaluated the potential of a novel data augmentation framework based on generative adversarial networks (GANs) that creates synthetic patient image data for model training to improve model generalizability. Model development and external testing were performed for a given classification task, namely the detection of new fluid-attenuated inversion recovery lesions at MRI during longitudinal follow-up of patients with multiple sclerosis (MS). An internal dataset of 669 patients with MS (n = 3083 examinations) was used to develop an attention-based network, trained both with and without the inclusion of the GAN-based synthetic data augmentation framework. External testing was performed on 134 patients with MS from a different institution, with MR images acquired using different scanners and protocols than images used during training. Models trained using synthetic data augmentation showed a significant performance improvement when applied on external data (area under the receiver operating characteristic curve [AUC], 83.6% without synthetic data vs 93.3% with synthetic data augmentation; P = .03), achieving comparable results to the internal test set (AUC, 95.0%; P = .53), whereas models without synthetic data augmentation demonstrated a performance drop upon external testing (AUC, 93.8% on internal dataset vs 83.6% on external data; P = .03). Data augmentation with synthetic patient data substantially improved performance of AI models on unseen MRI data and may be extended to other clinical conditions or tasks to mitigate domain shift, limit class imbalance, and enhance the robustness of AI applications in medical imaging. Keywords: Brain, Brain Stem, Multiple Sclerosis, Synthetic Data Augmentation, Generative Adversarial Network Supplemental material is available for this article. © RSNA, 2024.

  • Research Article
  • Cite Count Icon 5
  • 10.1093/jamiaopen/ooae114
Large language models and synthetic health data: progress and prospects.
  • Oct 8, 2024
  • JAMIA open
  • Daniel Smolyak + 3 more

Given substantial obstacles surrounding health data acquisition, high-quality synthetic health data are needed to meet a growing demand for the application of advanced analytics for clinical discovery, prediction, and operational excellence. We highlight how recent advances in large language models (LLMs) present new opportunities for progress, as well as new risks, in synthetic health data generation (SHDG). We synthesized systematic scoping reviews in the SHDG domain, recent LLM methods for SHDG, and papers investigating the capabilities and limits of LLMs. We summarize the current landscape of generative machine learning models (eg, Generative Adversarial Networks) for SHDG, describe remaining challenges and limitations, and identify how recent LLM approaches can potentially help mitigate them. Six research directions are outlined for further investigation of LLMs for SHDG: evaluation metrics, LLM adoption, data efficiency, generalization, health equity, and regulatory challenges. LLMs have already demonstrated both high potential and risks in the health domain, and it is important to study their advantages and disadvantages for SHDG.

  • Research Article
  • Cite Count Icon 2
  • 10.1145/3704263
Enhancing ID-based Recommendation with Large Language Models
  • Jul 10, 2025
  • ACM Transactions on Information Systems
  • Lei Chen + 6 more

Large language models (LLMs) have recently garnered significant attention in various domains, including recommendation systems. Recent research leverages the capabilities of LLMs to improve the performance and user modeling aspects of recommender systems. These studies primarily focus on utilizing LLMs to interpret textual data in recommendation tasks. However, it's worth noting that in ID-based recommendations, textual data is absent, and only ID data is available. The untapped potential of LLMs for ID data within the ID-based recommendation paradigm remains relatively unexplored. To this end, we introduce a pioneering approach called “LLM for ID-based recommendation” (LLM4IDRec). This innovative approach integrates the capabilities of LLMs while exclusively relying on ID data, thus diverging from the previous reliance on textual data. The basic idea of LLM4IDRec is that by employing LLM to augment ID data, if augmented ID data can improve recommendation performance, it demonstrates the ability of LLM to interpret ID data effectively, exploring an innovative way for the integration of LLM in ID-based recommendation. Specifically, we first define a prompt template to enhance LLM's ability to comprehend ID data and the ID-based recommendation task. Next, during the process of generating training data using this prompt template, we develop two efficient methods to capture both the local and global structure of ID data. We feed this generated training data into the LLM and employ LoRA for fine-tuning LLM. Following the fine-tuning phase, we utilize the fine-tuned LLM to generate ID data that aligns with users’ preferences. We design two filtering strategies to eliminate invalid generated data. Thirdly, we can merge the original ID data with the generated ID data, creating augmented data. Finally, we input this augmented data into the existing ID-based recommendation models without any modifications to the recommendation model itself. We evaluate the effectiveness of our LLM4IDRec approach using three widely used datasets. Our results demonstrate a notable improvement in recommendation performance, with our approach consistently outperforming existing methods in ID-based recommendation by solely augmenting input data.

  • Research Article
  • Cite Count Icon 5
  • 10.1038/s41598-024-75331-2
Distilling the knowledge from large-language model for health event prediction
  • Dec 28, 2024
  • Scientific Reports
  • Sirui Ding + 3 more

Health event prediction is empowered by the rapid and wide application of electronic health records (EHR). In the Intensive Care Unit (ICU), precisely predicting the health related events in advance is essential for providing treatment and intervention to improve the patients outcomes. EHR is a kind of multi-modal data containing clinical text, time series, structured data, etc. Most health event prediction works focus on a single modality, e.g., text or tabular EHR. How to effectively learn from the multi-modal EHR for health event prediction remains a challenge. Inspired by the strong capability in text processing of large language model (LLM), we propose the framework CKLE for health event prediction by distilling the knowledge from LLM and learning from multi-modal EHR. There are two challenges of applying LLM in the health event prediction, the first one is most LLM can only handle text data rather than other modalities, e.g., structured data. The second challenge is the privacy issue of health applications requires the LLM to be locally deployed, which may be limited by the computational resource. CKLE solves the challenges of LLM scalability and portability in the healthcare domain by distilling the cross-modality knowledge from LLM into the health event predictive model. To fully take advantage of the strong power of LLM, the raw clinical text is refined and augmented with prompt learning. The embedding of clinical text are generated by LLM. To effectively distill the knowledge of LLM into the predictive model, we design a cross-modality knowledge distillation (KD) method. A specially designed training objective will be used for the KD process with the consideration of multiple modality and patient similarity. The KD loss function consists of two parts. The first one is cross-modality contrastive loss function, which models the correlation of different modalities from the same patient. The second one is patient similarity learning loss function to model the correlations between similar patients. The cross-modality knowledge distillation can distill the rich information in clinical text and the knowledge of LLM into the predictive model on structured EHR data. To demonstrate the effectiveness of CKLE, we evaluate CKLE on two health event prediction tasks in the field of cardiology, heart failure prediction and hypertension prediction. We select the 7125 patients from MIMIC-III dataset and split them into train/validation/test sets. We can achieve a maximum 4.48% improvement in accuracy compared to state-of-the-art predictive model designed for health event prediction. The results demonstrate CKLE can surpass the baseline prediction models significantly on both normal and limited label settings. We also conduct the case study on cardiology disease analysis in the heart failure and hypertension prediction. Through the feature importance calculation, we analyse the salient features related to the cardiology disease which corresponds to the medical domain knowledge. The superior performance and interpretability of CKLE pave a promising way to leverage the power and knowledge of LLM in the health event prediction in real-world clinical settings.

  • Research Article
  • 10.3390/s25134114
Generative Artificial Intelligence for Synthetic Spectral Data Augmentation in Sensor-Based Plastic Recycling.
  • Jul 1, 2025
  • Sensors (Basel, Switzerland)
  • Roman-David Kulko + 2 more

The reliance on deep learning models for sensor-based material classification amplifies the demand for labeled training data. However, acquiring large-scale, annotated spectral data for applications such as near-infrared (NIR) reflectance spectroscopy in plastic sorting remains a significant challenge due to high acquisition costs and environmental variability. This paper investigates the potential of large language models (LLMs) in synthetic spectral data generation. Specifically, it examines whether LLMs have acquired sufficient implicit knowledge to assist in generating spectral data and introduce meaningful variations that enhance model performance when used for data augmentation. Classification accuracy is reported exclusively as a proxy for structural plausibility of the augmented spectra; maximizing augmentation performance itself is not the study's goal. From as little as one empirical mean spectrum per class, LLM-guided simulation produced data that enabled up to 86% accuracy, evidence that the generated variation preserves class-distinguishing information. While the approach performs best for spectral distinct polymers, overlapping classes remain challenging. Additionally, the transfer of optimized augmentation parameters to unseen classes indicates potential for generalization across material types. While plastic sorting serves as a case study, the methodology may be applicable to other domains such as agriculture or food quality assessment, where spectral data are limited. The study outlines a novel path toward scalable, AI-supported data augmentation in spectroscopy-based classification systems.

  • Research Article
  • Cite Count Icon 50
  • 10.1016/j.ultras.2023.107041
A review of synthetic and augmented training data for machine learning in ultrasonic non-destructive evaluation
  • May 18, 2023
  • Ultrasonics
  • Sebastian Uhlig + 4 more

Ultrasonic Testing (UT) has seen increasing application of machine learning (ML) in recent years, promoting higher-level automation and decision-making in flaw detection and classification. Building a generalized training dataset to apply ML in non-destructive evaluation (NDE), and thus UT, is exceptionally difficult since data on pristine and representative flawed specimens are needed. Yet, in most UT test cases flawed specimen data is inherently rare making data coverage the leading problem when applying ML. Common data augmentation (DA) strategies offer limited solutions as they don’t increase the dataset variance, which can lead to overfitting of the training data. The virtual defect method and the recent application of generative adversarial neural networks (GANs) in UT are sophisticated DA methods targeting to solve this problem. On the other hand, well-established research in modeling ultrasonic wave propagations allows for the generation of synthetic UT training data. In this context, we present a first thematic review to summarize the progress of the last decades on synthetic and augmented UT training data in NDE. Additionally, an overview of methods for synthetic UT data generation and augmentation is presented. Among numerical methods such as finite element, finite difference, and elastodynamic finite integration methods, semi-analytical methods such as general point source synthesis, superposition of Gaussian beams, and the pencil method as well as other UT modeling software are presented and discussed. Likewise, existing DA methods for one- and multidimensional UT data, feature space augmentation, and GANs for augmentation are presented and discussed. The paper closes with an in-detail discussion of the advantages and limitations of existing methods for both synthetic UT training data generation and DA of UT data to aid the decision-making of the reader for the application to specific test cases.

  • Research Article
  • 10.1200/jco.2025.43.16_suppl.12105
Using large language models to assess adherence to ASCO patient-oncologist communication standards.
  • Jun 1, 2025
  • Journal of Clinical Oncology
  • Joshua Paul Davis + 3 more

12105 Background: The American Society of Clinical Oncology (ASCO) convened a multidisciplinary panel resulting in patient-oncologist communication guidelines published in 2017. These guidelines contain recommendations across topics including goals of care, treatment selection, end-of-life care, facilitating family involvement, and clinician training in communication. Ideally, these conversations should be documented in the electronic health record (EHR), so that they can be referred to at future visits as a patient’s clinical course evolves. Tracking adherence to these communication guidelines may be beneficial for quality improvement efforts. However, manual chart review of unstructured free text notes is tedious and burdensome. The recent development of Large Language Models (LLMs) may represent a new computational approach that can capture such documentation more efficiently than chart review. To our knowledge, no prior study has used LLMs to capture such documentation in free text notes, validated against gold-standard manual chart review. Methods: As part of a larger study on development of LLMs for tracking palliative care quality measures, we randomly selected 30 patients with advanced cancer and clinical notes in the month following navigation to a poor prognosis treatment node. We used GPT-4o-2024-05-13 , our HIPAA-secure tool, to develop an LLM prompt for identifying 14 ASCO communication domains in clinical text. The LLM prompt required output to generate source text to support identification of a communication domain. A “hallucination score” was calculated for source text, which is a measure of evidence produced by LLMs not found in source text. We then compared to gold standard manual chart review using standard performance metrics. Results: Across communication domains, note-level LLM analysis achieved sensitivity ranging from 0.43-1.0, specificity ranging 0.32-0.99, and accuracy ranging 0.51-0.99. Examples of documentation identified by both the LLM and chart review include goals of care and prognosis (“recently informed that her disease had progressed with treatment. Currently on ‘last line’ of chemotherapy”), treatment options and clinical trials (“her oncologist recommended a potential trial treatment, and she is contemplating involvement in this”), end-of-life care (“if her cancer continues to progress with her current treatment, they will transition her care to home hospice for comfort measures only”), and cost of care (“financial insecurity - referred to resource specialist ” ). Average hallucination index for documentation identified by the LLM was low. LLM frequently identified information missed by annotators. The LLM extracted information relevant to communication domains in a fraction of the time required by manual chart review. Conclusions: LLMs can identify communication domains in EHRs, potentially contributing to quality improvement efforts.

  • Research Article
  • 10.1093/ofid/ofae631.609
P-408. Utility of a Large Language Model for Identifying Central Line-Associated Bloodstream Infections (CLABSI) Using Real Clinical Data at Stanford Health Care
  • Jan 29, 2025
  • Open Forum Infectious Diseases
  • Guillermo Rodriguez-Nava + 4 more

Background Central line-associated bloodstream infections (CLABSI) surveillance can be subjective and time-consuming. Large language models (LLMs) are advanced artificial intelligence systems with potential to assist healthcare professionals in classification tasks. Stanford Health Care recently implemented one of the first secure LLMs, powered by OpenAI’s GPT 4.0, cleared for sensitive health data. We assessed its performance in classifying CLABSI cases.Figure 1:Confusion Matrix of LLM Performance in CLABSI Classification. Methods We selected 40 patients flagged by our surveillance system for CLABSI review from November 2023–March 2024: 20 CLABSIs, consecutively identified, and 20 not-CLABSIs (randomly sampled). We prompted the LLM to determine if patients met the NHSN definition for CLABSI and provided the blood culture results that triggered the alert and the last 2 progress notes from the primary care team at the infection window end (within 3 days after the first positive test). We compared the secure LLM's determinations with those of infection preventionists.Table 1.Cases in which the LLM did not agree with IP assessment for CLABSI.*Community-onset: Blood cultures obtained within 2 days of admission.+NHSN guidelines list Fusobacterium nucleatum as an MBI organism (https://www.cdc.gov/nhsn/pdfs/pscmanual/17pscnosinfdef_current.pdf)Abbreviations: BSI, bloodstream infection; CLABSI, central line-associated infection; CoNS, coagulase-negative Staphylococci; ESBL, extended-spectrum beta lactamase; HIDA scan, IP, infection preventionist, LLM, large language model; MBI, mucosal barrier injury; MSSA, methicillin-susceptible Staphylococcus aureus; NHSN, National Healthcare Safety Network. Results Across 20 CLABSI-positive and 20 CLABSI-negative cases reviewed, the LLM accurately identified 16 of 20 CLABSIs and 7 of 20 not CLABSIs. The sensitivity was 80% (95% CI 57.6%–92.9%), specificity was 35% (95% CI 33.3%–86.5%), and the agreement rate was 57.5% (95% CI 41.2%–73.3%). Among 17 discordant cases, 11 involved clinical data available in the chart but unavailable to the LLM—admission information (4 false-positives), matching organisms (4 false-positives), and central line or symptom status (2 false-negatives, 1 false-positive). If this information was available to the LLM, we expect an adjusted sensitivity of 90% (18/20) and adjusted specificity of 80% (16/20). The remaining discordant cases involved misclassifications of organisms and incorrect identification of infection sources by the LLM. The mean review time by infection preventionists was 75 minutes (SD 48.7 minutes) compared to 5 minutes using the LLM. Conclusion An LLM not specifically trained for CLABSI classification showed high sensitivity using limited patient data. LLM case review required 5 minutes, versus 1 hour for traditional review. These results suggest LLMs could serve as a "first-pass" screening tool for CLABSI detection, helping infection preventionists narrow records needing human review. Disclosures All Authors: No reported disclosures

  • Research Article
  • 10.1109/jbhi.2025.3639109
Generating Completions for Broca's Aphasic Sentences Using Large Language Models.
  • Dec 1, 2025
  • IEEE journal of biomedical and health informatics
  • Sijbren Van Vaals + 2 more

Broca's aphasia is a type of aphasia characterized by non-fluent, effortful and agrammatic speech production with relatively good comprehension. Since traditional aphasia treatment methods are often time-consuming, labour-intensive, and do not reflect real-world conversations, applying natural language processing based approaches such as Large Language Models (LLMs) could potentially contribute to improving existing treatment approaches. To address this issue, we explore the use of sequence-to-sequence LLMs for completing Broca's aphasic sentences. We first generate synthetic Broca's aphasic data using a rule-based system designed to mirror the linguistic characteristics of Broca's aphasic speech. Using this synthetic data (without authentic aphasic samples), we then fine-tune four pre-trained LLMs on the task of completing agrammatic sentences. We evaluate our fine-tuned models on both synthetic and authentic Broca's aphasic data. We demonstrate LLMs' capability for reconstructing agrammatic sentences, with the models showing improved performance with longer input utterances. Our result highlights the LLMs' potential in advancing communication aids for individuals with Broca's aphasia and possibly other clinical populations.

  • Research Article
  • 10.1101/2025.08.07.25333172
Large Language Models for Psychiatric Phenotype Extraction from Electronic Health Records
  • Aug 12, 2025
  • medRxiv
  • Clara Frydman-Gani + 9 more

The accurate detection of clinical phenotypes from electronic health records (EHRs) is pivotal for advancing large-scale genetic and longitudinal studies in psychiatry. Free-text clinical notes are an essential source of symptom-level information, particularly in psychiatry. However, the automated extraction of symptoms from clinical text remains challenging.Here, we tested 11 open-source generative large language models (LLMs) for their ability to detect 109 psychiatric phenotypes from clinical text, using annotated EHR notes from a psychiatric clinic in Colombia. The LLMs were evaluated both “out-of-the-box” and after fine-tuning, and compared against a traditional natural language processing (tNLP) method developed from the same data. We show that while base LLM performance was poor to moderate (0.2–0.6 macro-F1 for zero-shot; 0.2–0.74 macro-F1 for few shot), it improved significantly after fine-tuning (0.75–0.86 macro-F1), with several fine-tuned LLMs outperforming the tNLP method. In total, 100 phenotypes could be reliably detected (F1>0.8) using either a fine-tuned LLM or tNLP.To generate a fine-tuned LLM that can be shared with the scientific and medical community, we created a fully synthetic dataset free of patient information but based on original annotations. We fine-tuned a top-performing LLM on this data, creating “Mistral-small-psych”, an LLM that can detect psychiatric phenotypes from Spanish text with performance comparable to that of LLMs trained on real EHR data (macro-F1=0.79).Finally, the fine-tuned LLMs underwent an external validation using data from a large psychiatric hospital in Colombia, the Hospital Mental de Antioquia, highlighting that most LLMs generalized well (0.02–0.16 point loss in macro-F1). Our study underscores the value of domain-specific adaptation of LLMs and introduces a new model for accurate psychiatric phenotyping in Spanish text, paving the way for global precision psychiatry.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.