Generative AI enhanced with NCCN clinical practice guidelines for clinical decision support: A case study on bone cancer.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

e13623 Background: Bone cancer is a complex and challenging disease to diagnose and treat in clinical practice. Recently, generative AI, especially large language models (LLMs), has demonstrated potential as a decision support tool for cancer. However, most implementations have overlooked the integration of available cancer guidelines, such as the NCCN Bone Cancer Guidelines, in fine-tuning the outputs of generative AI models. Incorporating these guidelines into LLMs presents an opportunity to harness the extensive clinical knowledge they contain and improve the decision-support capabilities of the model. Methods: In this study, the aim is to enhance the LLM with cancer clinical guidelines to enable accurate medical decisions and personalized treatment recommendations. Therefore, we introduce a novel method for incorporating the NCCN Bone Cancer Guidelines into LLMs using a Binary Decision Tree (BDT) approach. The approach involves constructing a BDT based on NCCN Bone Cancer Guidelines, where internal nodes represent decision points from the Guidelines, and leaf node signify final treatment suggestions. Then the LLM makes decision at each internal node, considering a given patient's characteristics, and guides toward a treatment recommendation in the leaf node. To assess the efficacy of Guideline-enhanced LLMs, an oncologist from our team created 11 hypothetical osteosarcoma patients’ medical progress notes. Each note contains their demographics, medical history, current illness, physical exams, diagnostic tests. We tested three LLMs in the implementation (GPT-4, GPT-3.5, and PaLM 2) and compared the LLM-generated treatment recommendations with the gold standard treatment across four runs with different random seeds (random seeds is a setting to control the LLM outputs). The results are reported as the average of four runs. The original LLMs are used as baseline methods for comparison. Results: The table below provides a comparison between the performance of original LLMs and those augmented with cancer guidelines for osteosarcoma treatment recommendations. We can observe that the PaLM 2 model demonstrated superior performance compared to its counterparts, underscoring the effectiveness of integrating cancer guidelines into LLMs for decision support. Conclusions: The clinical decision support capabilities of the LLMs are promising when enhanced by NCCN Bone Cancer Guidelines using our approach. To fully exhibit the potential of our proposed method as a clinical decision support tool, further investigation into other subtypes of bone cancer should be conducted in the future study. [Table: see text]

Similar Papers
  • Research Article
  • 10.1200/jco.2025.43.16_suppl.e20011
Evaluating artificial intelligence (AI) as a clinical decision support tool for lung cancer treatment recommendations.
  • Jun 1, 2025
  • Journal of Clinical Oncology
  • Roupen Odabashian + 12 more

e20011 Background: The therapeutic landscape of lung cancer is rapidly evolving, presenting oncologists with the challenge of staying updated amidst an overwhelming influx of data. Clinical decision support (CDS) tools, including artificial intelligence (AI) and large language models (LLMs), may help bridge this gap. Evaluating the accuracy of LLMs in complex, real-world oncology scenarios is crucial to understanding their potential. Methods: Twenty-five de-identified lung cancer cases from the fellows’ clinic at Karmanos Cancer Institute, Detroit, MI, were analyzed. Two LLMs, GPT-4 (OpenAI) and Claude Opus (Anthropic), were assessed using advanced prompting techniques like persona-based and chain-of-thought prompting. Five board-certified lung cancer oncologists from NCI-designated centers evaluated LLM-generated responses based on accuracy, treatment recommendation comprehensiveness, and supportive care planning, using a 1–5 scale. Novel insights, the presence of fabricated information, and harmful recommendations were flagged as binary outcomes. Oncologists were blinded to the LLM source and actual treatment decisions. Results: Table 1 presents patient characteristics. GPT-4 achieved an average accuracy score of 4.2 (95% CI, 3.9–4.4), with 3.7 for comprehensiveness of medical/surgical treatment recommendations and 3.7 for supportive care planning. Six responses (32%) were flagged as potentially harmful, and two (8%) contained inaccuracies. Sixteen GPT-4 responses (64%) were rated trustworthy as a CDS tool. Claude Opus had an average accuracy score of 3.6 (95% CI, 3.1–4.1), scoring 3.6 for treatment recommendation comprehensiveness and 3.5 for supportive care planning. Nine responses (36%) were flagged for potential harm, and five (20%) included inaccuracies. Eleven Claude responses (44%) were deemed trustworthy. Significant differences were observed in accuracy (p=0.04) and trustworthiness (p=0.03) between models using McNemar's test. Other factors showed no statistical significance. Conclusions: GPT-4 outperformed Claude Opus in accuracy and trustworthiness, but both models demonstrated limitations, including harmful recommendations and inaccuracies. These findings highlight the need for improved LLM refinement before routine use as CDS tools in lung cancer treatment. Patient demographics and clinical characteristics. Category subcategory Number Median Age (range)- Yr 65 (26-78) Female 7 Male 18 Histology Adenocarcinoma 10 Squamous Cell Carcinoma (SCC) 7 Small Cell Carcinoma 6 Poorly Differentiated 2 Total 25 Stage NSCLC Stage 3 7 NSCLC Stage 4 13 Small Cell limited stage 3 Small Cell Extensive Stage 2

  • Research Article
  • 10.1182/blood-2025-6214
Evaluating artificial intelligence (AI) as a clinical decision support tool for AML patients
  • Nov 3, 2025
  • Blood
  • Ankushi Sanghvi + 5 more

Evaluating artificial intelligence (AI) as a clinical decision support tool for AML patients

  • Research Article
  • Cite Count Icon 16
  • 10.1109/ichi61247.2024.00111
Enhancing Large Language Models for Clinical Decision Support by Incorporating Clinical Practice Guidelines.
  • Jun 3, 2024
  • Proceedings. IEEE International Conference on Healthcare Informatics
  • David Oniani + 6 more

Large Language Models (LLMs), enhanced with Clinical Practice Guidelines (CPGs), can significantly improve Clinical Decision Support (CDS). However, approaches for incorporating CPGs into LLMs are not well studied. In this study, we develop three distinct methods for incorporating CPGs into LLMs: Binary Decision Tree (BDT), Program-Aided Graph Construction (PAGC), and Chain-of-Thought-Few-Shot Prompting (CoT-FSP), and focus on CDS for COVID-19 outpatient treatment as the case study. Zero-Shot Prompting (ZSP) is our baseline method. To evaluate the effectiveness of the proposed methods, we create a set of synthetic patient descriptions and conduct both automatic and human evaluation of the responses generated by four LLMs: GPT-4, GPT-3.5 Turbo, LLaMA, and PaLM 2. All four LLMs exhibit improved performance when enhanced with CPGs compared to the baseline ZSP. BDT outperformed both CoT-FSP and PAGC in automatic evaluation. All of the proposed methods demonstrate high performance in human evaluation. LLMs enhanced with CPGs outperform plain LLMs with ZSP in providing accurate recommendations for COVID-19 outpatient treatment, highlighting the potential for broader applications beyond the case study.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.cgh.2013.04.015
Clinical Decision Support Tools
  • Jun 18, 2013
  • Clinical Gastroenterology and Hepatology
  • Lawrence R Kosinski

Clinical Decision Support Tools

  • Research Article
  • Cite Count Icon 51
  • 10.3390/healthcare13060603
A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration.
  • Mar 10, 2025
  • Healthcare (Basel, Switzerland)
  • Josip Vrdoljak + 4 more

Background/Objectives: Large language models (LLMs) have shown significant potential to transform various aspects of healthcare. This review aims to explore the current applications, challenges, and future prospects of LLMs in medical education, clinical decision support, and healthcare administration. Methods: A comprehensive literature review was conducted, examining the applications of LLMs across the three key domains. The analysis included their performance, challenges, and advancements, with a focus on techniques like retrieval-augmented generation (RAG). Results: In medical education, LLMs show promise as virtual patients, personalized tutors, and tools for generating study materials. Some models have outperformed junior trainees in specific medical knowledge assessments. Concerning clinical decision support, LLMs exhibit potential in diagnostic assistance, treatment recommendations, and medical knowledge retrieval, though performance varies across specialties and tasks. In healthcare administration, LLMs effectively automate tasks like clinical note summarization, data extraction, and report generation, potentially reducing administrative burdens on healthcare professionals. Despite their promise, challenges persist, including hallucination mitigation, addressing biases, and ensuring patient privacy and data security. Conclusions: LLMs have transformative potential in medicine but require careful integration into healthcare settings. Ethical considerations, regulatory challenges, and interdisciplinary collaboration between AI developers and healthcare professionals are essential. Future advancements in LLM performance and reliability through techniques such as RAG, fine-tuning, and reinforcement learning will be critical to ensuring patient safety and improving healthcare delivery.

  • Research Article
  • Cite Count Icon 6
  • 10.1007/s10278-025-01433-6
Efficacy of Fine-Tuned Large Language Model in CT Protocol Assignment as Clinical Decision-Supporting System
  • Feb 5, 2025
  • Journal of Imaging Informatics in Medicine
  • Noriko Kanemaru + 8 more

Accurate CT protocol assignment is crucial for optimizing medical imaging procedures. The integration of large language models (LLMs) may be helpful, but its efficacy as a clinical decision support system for protocoling tasks remains unknown. This study aimed to develop and evaluate fine-tuned LLM specifically designed for CT protocoling, as well as assess its performance, both standalone and in concurrent use, in terms of effectiveness and efficiency within radiological workflows. This retrospective study included radiology tests for contrast-enhanced chest and abdominal CT examinations (2829/498/941 for training/validation/testing). Inputs involve the clinical indication section, age, and anatomic coverage. The LLM was fine-tuned for 15 epochs, selecting the best model by macro sensitivity in validation. Performance was then evaluated on 800 randomly selected cases from the test dataset. Two radiology residents and two radiologists assigned CT protocols with and without referencing the output of LLM to evaluate its efficacy as a clinical decision support system. The LLM exhibited high accuracy metrics, with top-1 and top-2 accuracies of 0.923 and 0.963, respectively, and a macro sensitivity of 0.907. It processed each case in an average of 0.39 s. The LLM, as a clinical decision support tool, improved accuracy both for residents (0.913 vs. 0.936) and radiologists (0.920 vs. 0.926 without and with LLM, respectively), with the improvement for residents being statistically significant (p = 0.02). Additionally, it reduced reading times by 14% for residents and 12% for radiologists. These results indicate the potential of LLMs to improve CT protocoling efficiency and diagnostic accuracy in radiological practice.Supplementary InformationThe online version contains supplementary material available at 10.1007/s10278-025-01433-6.

  • Research Article
  • 10.1016/j.identj.2025.109344
Evaluating Retrieval-Augmented Generation-Large Language Models for Infective Endocarditis Prophylaxis: Clinical Accuracy and Efficiency.
  • Feb 1, 2026
  • International dental journal
  • Paak Rewthamrongsris + 5 more

Evaluating Retrieval-Augmented Generation-Large Language Models for Infective Endocarditis Prophylaxis: Clinical Accuracy and Efficiency.

  • Research Article
  • Cite Count Icon 4
  • 10.1007/s00405-025-09504-8
Clinical decision support using large language models in otolaryngology: a systematic review.
  • Jun 6, 2025
  • European archives of oto-rhino-laryngology : official journal of the European Federation of Oto-Rhino-Laryngological Societies (EUFOS) : affiliated with the German Society for Oto-Rhino-Laryngology - Head and Neck Surgery
  • Rania Filali Ansary + 1 more

This systematic review evaluated the diagnostic accuracy of large language models (LLMs) in otolaryngology-head and neck surgery clinical decision-making. PubMed/MEDLINE, Cochrane Library, and Embase databases were searched for studies investigating clinical decision support accuracy of LLMs in otolaryngology. Three investigators searched the literature for peer-reviewed studies investigating the application of LLMs as clinical decision support for real clinical cases according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. The following outcomes were considered: diagnostic accuracy, additional examination and treatment recommendations. Study quality was assessed using the modified Methodological Index for Non-Randomized Studies (MINORS). Of the 285 eligible publications, 17 met the inclusion criteria, accounting for 734 patients across various otolaryngology subspecialties. ChatGPT-4 was the most evaluated LLM (n = 14/17), followed by Claude-3/3.5 (n = 2/17), and Gemini (n = 2/17). Primary diagnostic accuracy ranged from 45.7 to 80.2% across different LLMs, with Claude often outperforming ChatGPT. LLMs demonstrated lower accuracy in recommending appropriate additional examinations (10-29%) and treatments (16.7-60%), with substantial subspecialty variability. Treatment recommendation accuracy was highest in head and neck oncology (55-60%) and lowest in rhinology (16.7%). There was substantial heterogeneity across studies for the inclusion criteria, information entered in the application programming interface, and the methods of accuracy assessment. LLMs demonstrate promising moderate diagnostic accuracy in otolaryngology clinical decision support, with higher performance in providing diagnoses than in suggesting appropriate additional examinations and treatments. Emerging findings support that Claude often outperforms ChatGPT. Methodological standardization is needed for future research. NA.

  • Research Article
  • Cite Count Icon 4
  • 10.1093/ehjdh/ztaf028
Applications of large language models in cardiovascular disease: a systematic review.
  • Apr 1, 2025
  • European heart journal. Digital health
  • José Ferreira Santos + 3 more

Cardiovascular disease (CVD) remains the leading cause of morbidity and mortality worldwide. Large language models (LLMs) offer potential solutions for enhancing patient education and supporting clinical decision-making. This study aimed to evaluate LLMs' applications in CVD and explore their current implementation, from prevention to treatment. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, this systematic review assessed LLM applications in CVD. A comprehensive PubMed search identified relevant studies. The review prioritized pragmatic and practical applications of LLMs. Key applications, benefits, and limitations of LLMs in CVD prevention were summarized. Thirty-five observational studies met the eligibility criteria. Of these, 54% addressed primary prevention and risk factor management, while 46% focused on established CVD. Commercial LLMs were evaluated in all but one study, with 91% (32 studies) assessing ChatGPT. The LLM applications were categorized as follows: 72% addressed patient education, 17% clinical decision support, and 11% both. In 68% of studies, the primary objective was to evaluate LLMs' performance in answering frequently asked patient questions, with results indicating accurate, comprehensive, and generally safe responses. However, occasional misinformation and hallucinated references were noted. Additional applications included patient guidance on CVD, first aid, and lifestyle recommendations. Large language models were assessed for medical questions, diagnostic support, and treatment recommendations in clinical decision support. Large language models hold significant potential in CVD prevention and treatment. Evidence supports their potential as an alternative source of information for addressing patients' questions about common CVD. However, further validation is needed for their application in individualized care, from diagnosis to treatment.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s00345-024-05423-1
The interaction of structured data using openEHR and large Language models for clinical decision support in prostate cancer.
  • Jan 13, 2025
  • World journal of urology
  • Philippe Kaiser + 8 more

Multidisciplinary teams (MDTs) are essential for cancer care but are resource-intensive. Decision-making processes within MDTs, while critical, contribute to increased healthcare costs due to the need for specialist time and coordination. The recent emergence of large language models (LLMs) offers the potential to improve the efficiency and accuracy of clinical decision-making processes, potentially reducing costs associated with traditional MDT models. We conducted a retrospective study of 171 consecutively treated patients with newly diagnosed prostate cancer. Relevant structured clinical data and the European Association of Urology (EAU) pocket guidelines were provided to two LLMs (chatGPT-4, Claude-3-Opus). LLM treatment recommendations were compared to actual treatment recommendations of the MDT meeting (MDM). Both LLMs demonstrated an overall adherence of 93% with the MDT treatment recommendations. Discrepancies between LLM and MDT recommendations were observed in 15 cases (9%), primarily due to lack of clinical information that could be provided to the LLMs. In 5 cases (3%), the LLM recommendations were not in line with EAU guidelines despite having access to all relevant information. Our findings provide evidence that LLMs can provide accurate treatment recommendations for newly diagnosed prostate cancer patients. LLMs have the potential to streamline MDT workflows, enabling specialists to focus on complex cases and patient-centered discussions. In this study, we explored the potential of artificial intelligence models called large language models (LLMs) to assist in treatment decision-making for prostate cancer patients. We found that LLMs, when provided with patient information and clinical guidelines, can recommend treatments that closely match those made by a team of cancer specialists, suggesting that LLMs could help streamline the decision-making process and potentially reduce healthcare costs.

  • Research Article
  • Cite Count Icon 5
  • 10.1007/s00701-024-06372-9
Large language models in neurosurgery: a systematic review and meta-analysis.
  • Nov 23, 2024
  • Acta neurochirurgica
  • Advait Patil + 5 more

Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature. We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery ("large language model" OR "LLM" OR "ChatGPT" OR "GPT-3" OR "GPT3" OR "GPT-3.5" OR "GPT3.5" OR "GPT-4" OR "GPT4" OR "LLAMA" OR "MISTRAL" OR "BARD") AND "neurosurgery". The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures. Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (n = 14, 27.5%), Answering Standardized Exam Questions (n = 12, 23.5%), and Clinical Judgement and Decision-Making Support (n = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (n = 30, 58.8%), GPT-4 (n = 20, 39.2%), Bard (n = 9, 17.6%), and Bing (n = 6, 11.8%). Most studies (n = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning. Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.

  • Research Article
  • 10.1200/jco.2024.42.16_suppl.e13609
Large language models for precision oncology: Clinical decision support through expert-guided learning.
  • Jun 1, 2024
  • Journal of Clinical Oncology
  • Jacqueline Lammert + 14 more

e13609 Background: Precision oncology revolutionized cancer treatment by identifying molecular biomarkers to guide personalized care. The ever-growing body of medical literature presents a challenge for oncologists researching targeted therapies. While recent studies investigated large language models (LLMs) to streamline this process, LLM reliance on general rather than medical knowledge limits clinical relevance and trustworthiness. To address these limitations, we developed a retrieval augmented generation (RAG) system that integrates PubMed clinical studies, trial databases and oncological guidelines with LLMs to support targeted treatment recommendations. The Molecular Tumor Board (MTB) at the Center of Personalized Medicine (ZPMTUM) guided and evaluated treatment options proposed by the LLM to assess their applicability for clinical decision support. Methods: We used 10 publicly accessible fictional patient cases with 7 tumor types and 59 distinct molecular alterations. Our LLM system MEREDITH (Medical Evidence Retrieval and Data Integration for Tailored Healthcare) consists of Google's Gemini Pro, enhanced with RAG and Chain-of-Thought (CoT) prompting. To establish a benchmark, clinical experts at ZPMTUM manually annotated the cases. Informed by MTB expert feedback, we iteratively improved our LLM system from a draft system relying on PubMed-indexed data to an enhanced system, which replicated expert annotation processes by incorporating oncology guidelines, drug availability and trial databases (ClinicalTrials.gov, QuickQueck.de). ZPMTUM assessed credibility and clinical relevance of manually annotated and LLM-generated recommendations. Patient-level data on (likely) pathogenic molecular alterations and recommended treatment options were summarized using median and interquartile range (IQR). Semantic similarity between LLM and clinician responses was assessed using cosine similarity of text vector embeddings; paired t-test evaluated significance. Results: The median of (likely) pathogenic molecular alterations per patient was 2.5 (IQR: 2-3). ZPMTUM identified a median of 2 treatment options per patient (IQR: 1-3), while the enhanced LLM identified a median of 4 (IQR: 3-5). MEREDITH proposed multiple relevant treatment suggestions, including therapies based on preclinical studies, and molecular interactions, for further assessment by the MTB. ZPMTUM prioritized the most suitable clinical option. The mean semantic textual similarity of LLM responses increased significantly from 0.69 in the draft system to 0.76 in the enhanced system (p <0.001). Thus, feedback from ZPMTUM enhanced the model's ability to align its responses with clinician thought processes. Conclusions: Leveraging expert thought processes to instruct LLMs holds promise as a novel decision support tool for precision oncology.

  • Research Article
  • Cite Count Icon 6
  • 10.1200/jco.2024.42.16_suppl.e13637
Investigating large language model (LLM) performance using in-context learning (ICL) for interpretation of ESMO and NCCN guidelines for lung cancer.
  • Jun 1, 2024
  • Journal of Clinical Oncology
  • Sanna Iivanainen + 4 more

e13637 Background: The recent development of advanced LLMs has been suggested to improve patient care across several areas such as clinical-decision support or helping to answer patients’ questions. Hallucinations have been identified as a blocker for the use of LLMs in routine clinical practice. ICL and Retrieval Augmented Generation (RAG) could improve the LLM performance and reduce hallucinations, consecutively making the use of LLMs possible in clinical practice. Methods: A method using ICL and RAG was developed on top of health AI platform (Gosta MedKit) to interpret the most recent ESMO (Dec 2022 for NSCLC, Mar 2021 for SCLC) and NCCN (Nov 2023 for NSCLC and SCLC) clinical guidelines for lung cancer. Guidelines (including tables and diagrams) were curated into a text format, and split and stored into a vector database. OpenAI’s GPT4 Turbo model version gpt-4-1106-preview (GPT4-T), having the knowledge cutoff in April 2023, was used in all implementations. 11 questions about SCLC and 13 questions about NSCLC treatment recommendations and definitions were developed to evaluate the performance of different settings: GPT4-T (existing knowledge of the model), ICL with maximum context (ICL-MC) length (128k tokens) and ICL with RAG (ICL-RAG) heuristically including only the most relevant parts from vector database. Question prompts were generated for different settings and guidelines (ESMO, NCCN and both combined) and two oncologists evaluated 216 different responses and their alignment with ESMO and NCCN guidelines. Results: For responses using ESMO guidelines having oncologists’ consensus, ICL-MC and ICL-RAG respectively provided accurate responses for 83.3% and 79.2% of questions vs. 62.5% for GPT4-T. For responses using NCCN guidelines having oncologist consensus, ICL-RAG provided accurate responses for 83.3%, GPT4-T for 75.0% and ICL-MC for 33.3% of questions. When more flexibility was allowed in results interpretation (alignment either with ESMO or NCCN), GPT4-T provided accurate response for 87.5% vs. 70.8% with ICL-RAG and 58.3% with ICL-MC. No consensus around hallucinations was reported for ICL approaches, whereas GPT4-T hallucinated the response for 4.2% of questions with ESMO guidelines. Conclusions: ICL seems to improve the LLM performance for stricter tasks such as providing responses according to specific guidelines and reducing hallucinations. ICL outperformed GPT4-T in case of ESMO guidelines. This highlights the importance of taking local and latest guidelines into account when LLMs are used across different health systems and regulatory environments. In line with earlier studies, longer context for ICL makes models forget crucial information, which can be mitigated with the use of RAG to improve ICL performance and reduce costs when using the models.

  • Research Article
  • 10.1016/j.jclinane.2026.112164
Will AI keep you out of trouble? An expert panel review of LLMs for hazardous regional anesthesia consults.
  • Apr 1, 2026
  • Journal of clinical anesthesia
  • David Corpman + 7 more

Will AI keep you out of trouble? An expert panel review of LLMs for hazardous regional anesthesia consults.

  • Research Article
  • Cite Count Icon 11
  • 10.1038/s41746-025-01565-7
A scoping review on generative AI and large language models in mitigating medication related harm
  • Mar 28, 2025
  • npj Digital Medicine
  • Jasmine Chiat Ling Ong + 10 more

Medication-related harm has a significant impact on global healthcare costs and patient outcomes. Generative artificial intelligence (GenAI) and large language models (LLM) have emerged as a promising tool in mitigating risks of medication-related harm. This review evaluates the scope and effectiveness of GenAI and LLM in reducing medication-related harm. We screened 4 databases for literature published from 1st January 2012 to 15th October 2024. A total of 3988 articles were identified, and 30 met the criteria for inclusion into the final review. Generative AI and LLMs were applied in three key applications: drug-drug interaction identification and prediction, clinical decision support, and pharmacovigilance. While the performance and utility of these models varied, they generally showed promise in early identification, classification of adverse drug events, and supporting decision-making for medication management. However, no studies tested these models prospectively, suggesting a need for further investigation into integration and real-world application.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.