MEGA-RAG: a retrieval-augmented generation framework with multi-evidence guided answer refinement for mitigating hallucinations of LLMs in public health

  • Abstract
  • Literature Map
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

IntroductionThe increasing adoption of large language models (LLMs) in public health has raised significant concerns about hallucinations-factually inaccurate or misleading outputs that can compromise clinical communication and policy decisions.MethodsWe propose a retrieval-augmented generation framework with multi-evidence guided answer refinement (MEGA-RAG), specifically designed to mitigate hallucinations in public health applications. The framework integrates multi-source evidence retrieval (dense retrieval via FAISS, keyword-based retrieval via BM25, and biomedical knowledge graphs), employs a cross-encoder reranker to ensure semantic relevance, and incorporates a discrepancy-aware refinement module to further enhance factual accuracy.ResultsExperimental evaluation demonstrates that MEGA-RAG outperforms four baseline models [PubMedBERT, PubMedGPT, standalone LLM, and LLM with standard retrieval-augmented generation (RAG)], achieving a reduction in hallucination rates by over 40%. It also achieves the highest accuracy (0.7913), precision (0.7541), recall (0.8304), and F1 score (0.7904).DiscussionThese findings confirm that MEGA-RAG is highly effective in generating factually reliable and medically accurate responses, thereby enhancing the credibility of AI-generated health information for applications in health education, clinical communication, and evidence-based policy development.

ReferencesShowing 10 of 18 papers
  • Open Access Icon
  • Cite Count Icon 16
  • 10.1145/3637487
Explainable AI for Medical Data: Current Methods, Limitations, and Future Directions
  • Feb 11, 2025
  • ACM Computing Surveys
  • Md Imran Hossain + 5 more

  • Cite Count Icon 23
  • 10.1093/bioinformatics/btae560
Biomedical knowledge graph-optimized prompt generation for large language models.
  • Sep 2, 2024
  • Bioinformatics (Oxford, England)
  • Karthik Soman + 13 more

  • Cite Count Icon 2
  • 10.1016/b978-0-443-29246-0.00004-3
Truth-O-Meter: Collaborating with LLM in fighting its hallucinations
  • Jan 1, 2025
  • Boris Galitsky

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 49
  • 10.1007/s12671-023-02089-5
Mindfulness for Global Public Health: Critical Analysis and Agenda
  • Mar 3, 2023
  • Mindfulness
  • Doug Oman

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 32
  • 10.18653/v1/2023.conll-1.21
Med-HALT: Medical Domain Hallucination Test for Large Language Models
  • Jan 1, 2023
  • Ankit Pal + 2 more

  • Open Access Icon
  • Cite Count Icon 6
  • 10.1101/2025.02.28.25323115
Medical Hallucination in Foundation Models and Their Impact on Healthcare
  • Mar 3, 2025
  • Yubin Kim + 24 more

  • Open Access Icon
  • Cite Count Icon 25
  • 10.1093/bioinformatics/btae353
KRAGEN: a knowledge graph-enhanced RAG framework for biomedical problem solving using large language models
  • Jun 3, 2024
  • Bioinformatics
  • Nicholas Matsumoto + 6 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 410
  • 10.3389/fpubh.2023.1166120
ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health
  • Apr 25, 2023
  • Frontiers in Public Health
  • Luigi De Angelis + 6 more

  • Cite Count Icon 2
  • 10.1145/3696410.3714782
MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot
  • Apr 22, 2025
  • Xuejiao Zhao + 3 more

  • Open Access Icon
  • Cite Count Icon 1
  • 10.1101/2025.02.08.25321587
PH-LLM: Public Health Large Language Models for Infoveillance
  • Feb 10, 2025
  • medRxiv
  • Xinyu Zhou + 12 more

Similar Papers
  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.jbi.2024.104730
FuseLinker: Leveraging LLM’s pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs
  • Sep 24, 2024
  • Journal of Biomedical Informatics
  • Yongkang Xiao + 5 more

FuseLinker: Leveraging LLM’s pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.21203/rs.3.rs-3185632/v1
From Answers to Insights: Unveiling the Strengths and Limitations of ChatGPT and Biomedical Knowledge Graphs
  • Aug 1, 2023
  • Research Square
  • Yu Hou + 5 more

Purpose:Large Language Models (LLMs) have shown exceptional performance in various natural language processing tasks, benefiting from their language generation capabilities and ability to acquire knowledge from unstructured text. However, in the biomedical domain, LLMs face limitations that lead to inaccurate and inconsistent answers. Knowledge Graphs (KGs) have emerged as valuable resources for organizing structured information. Biomedical Knowledge Graphs (BKGs) have gained significant attention for managing diverse and large-scale biomedical knowledge. The objective of this study is to assess and compare the capabilities of ChatGPT and existing BKGs in question-answering, biomedical knowledge discovery, and reasoning tasks within the biomedical domain.Methods:We conducted a series of experiments to assess the performance of ChatGPT and the BKGs in various aspects of querying existing biomedical knowledge, knowledge discovery, and knowledge reasoning. Firstly, we tasked ChatGPT with answering questions sourced from the “Alternative Medicine” sub-category of Yahoo! Answers and recorded the responses. Additionally, we queried BKG to retrieve the relevant knowledge records corresponding to the questions and assessed them manually. In another experiment, we formulated a prediction scenario to assess ChatGPT’s ability to suggest potential drug/dietary supplement repurposing candidates. Simultaneously, we utilized BKG to perform link prediction for the same task. The outcomes of ChatGPT and BKG were compared and analyzed. Furthermore, we evaluated ChatGPT and BKG’s capabilities in establishing associations between pairs of proposed entities. This evaluation aimed to assess their reasoning abilities and the extent to which they can infer connections within the knowledge domain.Results:The results indicate that ChatGPT with GPT-4.0 outperforms both GPT-3.5 and BKGs in providing existing information. However, BKGs demonstrate higher reliability in terms of information accuracy. ChatGPT exhibits limitations in performing novel discoveries and reasoning, particularly in establishing structured links between entities compared to BKGs.Conclusions:To address the limitations observed, future research should focus on integrating LLMs and BKGs to leverage the strengths of both approaches. Such integration would optimize task performance and mitigate potential risks, leading to advancements in knowledge within the biomedical field and contributing to the overall well-being of individuals.

  • Research Article
  • 10.2196/65226
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.
  • Aug 9, 2024
  • Journal of medical Internet research
  • Michael S Deiner + 10 more

The use of web-based search and social media can help identify epidemics, potentially earlier than clinical methods or even potentially identifying unreported outbreaks. Monitoring for eye-related epidemics, such as conjunctivitis outbreaks, can facilitate early public health intervention to reduce transmission and ocular comorbidities. However, monitoring social media content for conjunctivitis outbreaks is costly and laborious. Large language models (LLMs) could overcome these barriers by assessing the likelihood that real-world outbreaks are being described. However, public health actions for likely outbreaks could benefit more by knowing additional epidemiological characteristics, such as outbreak type, size, and severity. We aimed to assess whether and how well LLMs can classify epidemiological features from social media posts beyond conjunctivitis outbreak probability, including outbreak type, size, severity, etiology, and community setting. We used a validation framework comparing LLM classifications to those of other LLMs and human experts. We wrote code to generate synthetic conjunctivitis outbreak social media posts, embedded with specific preclassified epidemiological features to simulate various infectious eye disease outbreak and control scenarios. We used these posts to develop effective LLM prompts and test the capabilities of multiple LLMs. For top-performing LLMs, we gauged their practical utility in real-world epidemiological surveillance by comparing their assessments of Twitter/X, forum, and YouTube conjunctivitis posts. Finally, human raters also classified the posts, and we compared their classifications to those of a leading LLM for validation. Comparisons entailed correlation or sensitivity and specificity statistics. We assessed 7 LLMs for effectively classifying epidemiological data from 1152 synthetic posts, 370 Twitter/X posts, 290 forum posts, and 956 YouTube posts. Despite some discrepancies, the LLMs demonstrated a reliable capacity for nuanced epidemiological analysis across various data sources and compared to humans or between LLMs. Notably, GPT-4 and Mixtral 8x22b exhibited high performance, predicting conjunctivitis outbreak characteristics such as probability (GPT-4: correlation=0.73), size (Mixtral 8x22b: correlation=0.82), and type (infectious, allergic, or environmentally caused); however, there were notable exceptions. Assessing synthetic and real-world posts for etiological factors, infectious eye disease specialist validations revealed that GPT-4 had high specificity (0.83-1.00) but variable sensitivity (0.32-0.71). Interrater reliability analyses showed that LLM-expert agreement exceeded expert-expert agreement for severity assessment (intraclass correlation coefficient=0.69 vs 0.38), while agreement varied by condition type (κ=0.37-0.94). This investigation into the potential of LLMs for public health infoveillance suggests effectiveness in classifying key epidemiological characteristics from social media content about conjunctivitis outbreaks. Future studies should further explore LLMs' potential to support public health monitoring through the automated assessment and classification of potential infectious eye disease or other outbreaks. Their optimal role may be to act as a first line of documentation, alerting public health organizations for the follow-up of LLM-detected and -classified small, early outbreaks, with a focus on the most severe ones.

  • PDF Download Icon
  • Preprint Article
  • 10.2196/preprints.65226
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study (Preprint)
  • Aug 9, 2024
  • Michael S Deiner + 10 more

BACKGROUND The use of web-based search and social media can help identify epidemics, potentially earlier than clinical methods or even potentially identifying unreported outbreaks. Monitoring for eye-related epidemics, such as conjunctivitis outbreaks, can facilitate early public health intervention to reduce transmission and ocular comorbidities. However, monitoring social media content for conjunctivitis outbreaks is costly and laborious. Large language models (LLMs) could overcome these barriers by assessing the likelihood that real-world outbreaks are being described. However, public health actions for likely outbreaks could benefit more by knowing additional epidemiological characteristics, such as outbreak type, size, and severity. OBJECTIVE We aimed to assess whether and how well LLMs can classify epidemiological features from social media posts beyond conjunctivitis outbreak probability, including outbreak type, size, severity, etiology, and community setting. We used a validation framework comparing LLM classifications to those of other LLMs and human experts. METHODS We wrote code to generate synthetic conjunctivitis outbreak social media posts, embedded with specific preclassified epidemiological features to simulate various infectious eye disease outbreak and control scenarios. We used these posts to develop effective LLM prompts and test the capabilities of multiple LLMs. For top-performing LLMs, we gauged their practical utility in real-world epidemiological surveillance by comparing their assessments of Twitter/X, forum, and YouTube conjunctivitis posts. Finally, human raters also classified the posts, and we compared their classifications to those of a leading LLM for validation. Comparisons entailed correlation or sensitivity and specificity statistics. RESULTS We assessed 7 LLMs for effectively classifying epidemiological data from 1152 synthetic posts, 370 Twitter/X posts, 290 forum posts, and 956 YouTube posts. Despite some discrepancies, the LLMs demonstrated a reliable capacity for nuanced epidemiological analysis across various data sources and compared to humans or between LLMs. Notably, GPT-4 and Mixtral 8x22b exhibited high performance, predicting conjunctivitis outbreak characteristics such as probability (GPT-4: correlation=0.73), size (Mixtral 8x22b: correlation=0.82), and type (infectious, allergic, or environmentally caused); however, there were notable exceptions. Assessing synthetic and real-world posts for etiological factors, infectious eye disease specialist validations revealed that GPT-4 had high specificity (0.83-1.00) but variable sensitivity (0.32-0.71). Interrater reliability analyses showed that LLM-expert agreement exceeded expert-expert agreement for severity assessment (intraclass correlation coefficient=0.69 vs 0.38), while agreement varied by condition type (κ=0.37-0.94). CONCLUSIONS This investigation into the potential of LLMs for public health infoveillance suggests effectiveness in classifying key epidemiological characteristics from social media content about conjunctivitis outbreaks. Future studies should further explore LLMs’ potential to support public health monitoring through the automated assessment and classification of potential infectious eye disease or other outbreaks. Their optimal role may be to act as a first line of documentation, alerting public health organizations for the follow-up of LLM-detected and -classified small, early outbreaks, with a focus on the most severe ones.

  • Research Article
  • Cite Count Icon 4
  • 10.1101/2023.06.09.23291208
From Answers to Insights: Unveiling the Strengths and Limitations of ChatGPT and Biomedical Knowledge Graphs
  • Jun 12, 2023
  • medRxiv
  • Yu Hou + 5 more

Large Language Models (LLMs) have demonstrated exceptional performance in various natural language processing tasks, utilizing their language generation capabilities and knowledge acquisition potential from unstructured text. However, when applied to the biomedical domain, LLMs encounter limitations, resulting in erroneous and inconsistent answers. Knowledge Graphs (KGs) have emerged as valuable resources for structured information representation and organization. Specifically, Biomedical Knowledge Graphs (BKGs) have attracted significant interest in managing large-scale and heterogeneous biomedical knowledge. This study evaluates the capabilities of ChatGPT and existing BKGs in question answering, knowledge discovery, and reasoning. Results indicate that while ChatGPT with GPT-4.0 surpasses both GPT-3.5 and BKGs in providing existing information, BKGs demonstrate superior information reliability. Additionally, ChatGPT exhibits limitations in performing novel discoveries and reasoning, particularly in establishing structured links between entities compared to BKGs. To overcome these limitations, future research should focus on integrating LLMs and BKGs to leverage their respective strengths. Such an integrated approach would optimize task performance and mitigate potential risks, thereby advancing knowledge in the biomedical field and contributing to overall well-being.

  • Research Article
  • 10.1136/ip-2025-045655
Large language models in public health: opportunity or threat? The case of button battery injuries.
  • Jul 21, 2025
  • Injury prevention : journal of the International Society for Child and Adolescent Injury Prevention
  • Giulia Lorenzoni + 1 more

Button battery (BB) injuries in children represent a severe and growing public health burden. The literature on the topic is extensive; however, there is a notable lack of structured public health initiatives addressing the problem. The present study aimed to test the feasibility of using large language models (LLMs) to draft recommendations for preventing and managing BB ingestion in children. A set of questions was generated and submitted to ChatGPT-4o and ChatGPT-o1-preview. Questions were based on statements and websites of scientific societies and not-for-profit organisations and were developed to produce comprehensive recommendations that provided information on BB risks, primary and secondary prevention, clinical management and follow-up, and general public health initiatives. Two independent reviewers rated the accuracy and readability of the questions submitted to the LLMs. The accuracy was rated using a four-level scale, while the readability was assessed using two established readability tools, the Flesch Reading Ease (FRE) and the Flesch-Kincaid Grade Level (FKGL). None of the answers provided by the LLMs were rated as completely incorrect or partially incorrect. ChatGPT-o1-preview outperformed ChatGPT-4o in accuracy, with eight answers rated as accurate and complete. Both models showed similar readability levels, with high FKGL and FRE scores indicating college-level comprehension. LLM demonstrated a strong performance in this study, with no responses rated as incorrect or partially incorrect, showing its great potential and feasibility for use in public health. The present findings suggested the potential feasibility of LLMs in public health for preventing paediatric injuries.

  • Research Article
  • 10.1093/jamia/ocaf048
Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism
  • Mar 19, 2025
  • Journal of the American Medical Informatics Association : JAMIA
  • Cheligeer Cheligeer + 10 more

ObjectivesAdverse event detection from Electronic Medical Records (EMRs) is challenging due to the low incidence of the event, variability in clinical documentation, and the complexity of data formats. Pulmonary embolism as an adverse event (PEAE) is particularly difficult to identify using existing approaches. This study aims to develop and evaluate a Large Language Model (LLM)-based framework for detecting PEAE from unstructured narrative data in EMRs.Materials and MethodsWe conducted a chart review of adult patients (aged 18-100) admitted to tertiary-care hospitals in Calgary, Alberta, Canada, between 2017-2022. We developed an LLM-based detection framework consisting of three modules: evidence extraction (implementing both keyword-based and semantic similarity-based filtering methods), discharge information extraction (focusing on six key clinical sections), and PEAE detection. Four open-source LLMs (Llama3, Mistral-7B, Gemma, and Phi-3) were evaluated using positive predictive value, sensitivity, specificity, and F1-score. Model performance for population-level surveillance was assessed at yearly, quarterly, and monthly granularities.ResultsThe chart review included 10 066 patients, with 40 cases of PEAE identified (0.4% prevalence). All four LLMs demonstrated high sensitivity (87.5-100%) and specificity (94.9-98.9%) across different experimental conditions. Gemma achieved the highest F1-score (28.11%) using keyword-based retrieval with discharge summary inclusion, along with 98.4% specificity, 87.5% sensitivity, and 99.95% negative predictive value. Keyword-based filtering reduced the median chunks per patient from 789 to 310, while semantic filtering further reduced this to 9 chunks. Including discharge summaries improved performance metrics across most models. For population-level surveillance, all models showed strong correlation with actual PEAE trends at yearly granularity (r=0.92-0.99), with Llama3 achieving the highest correlation (0.988).DiscussionThe results of our method for PEAE detection using EMR notes demonstrate high sensitivity and specificity across all four tested LLMs, indicating strong performance in distinguishing PEAE from non-PEAE cases. However, the low incidence rate of PEAE contributed to a lower PPV. The keyword-based chunking approach consistently outperformed semantic similarity-based methods, achieving higher F1 scores and PPV, underscoring the importance of domain knowledge in text segmentation. Including discharge summaries further enhanced performance metrics. Our population-based analysis revealed better performance for yearly trends compared to monthly granularity, suggesting the framework's utility for long-term surveillance despite dataset imbalance. Error analysis identified contextual misinterpretation, terminology confusion, and preprocessing limitations as key challenges for future improvement.ConclusionsOur proposed method demonstrates that LLMs can effectively detect PEAE from narrative EMRs with high sensitivity and specificity. While these models serve as effective screening tools to exclude non-PEAE cases, their lower PPV indicates they cannot be relied upon solely for definitive PEAE identification. Further chart review remains necessary for confirmation. Future work should focus on improving contextual understanding, medical terminology interpretation, and exploring advanced prompting techniques to enhance precision in adverse event detection from EMRs.

  • Research Article
  • 10.1093/eurpub/ckaf161.316
Large language models to reduce antimicrobial resistance: ChatGPT, Claude and Gemini comparison
  • Oct 1, 2025
  • European Journal of Public Health
  • M Di Pumpo + 8 more

Introduction Antimicrobial resistance (AMR) is a major public health challenge. Artificial Intelligence (AI), particularly Large Language Models (LLMs), offers a promising opportunity to deliver accurate and appropriate health information and education. However, the public health implications of their widespread use remain largely unassessed by scientific experts. This study evaluates the effectiveness of leading LLMs in providing information on infection control and antibiotic use. Methods ChatGPT 3.5, ChatGPT 4.0, Claude 2.0, and Gemini 1.0 were adequately prompted in both Italian and English. Their textual output underwent a Computational Text Analysis to assess readability, lexical diversity, and sentiment. In addition, 3 experts rated the output via an adapted DISCERN instrument built to assess AMR impact, persuasiveness, and the overall quality and appropriateness of the content. Results A total of 864 scores were obtained from ChatGPT 3.5, ChatGPT 4.0 and Claude, each evaluated both in English and in Italian. In contrast, only 270 scores were obtained from Gemini in English, as it self-interrupted, reporting the questioning as inappropriate for a chatbot. A general performance gradient was observed from Gemini to ChatGPT 3.5. ChatGPT 4.0 demonstrated the highest lexical diversity and sentiment scores, while Gemini presented the best readability and overall rating. English-based prompts consistently overperformed Italian-based ones. The impact on AMR received low scores across all LLMs. Conclusions The study identified Gemini as the best-performing model in terms of content quality, accessibility, and contextual awareness. While LLMs are promising tools, they are not intended to replace professional medical assessment. Instead, their responsible integration is necessary to ensure safe and effective public health applications. Further studies are warranted to expand the evidence base regarding the assessments of medical content generated by LLMs. Key messages • Large Language Models (LLMs) are promising in delivering accurate and appropriate health information. However, the public health implications remain largely unassessed by scientific experts. • A general performance gradient was observed from Gemini to ChatGPT 3.5 regarding readability, lexical diversity, sentiment scores and overall rating. The rated impact on AMR was low across all LLMs.

  • Research Article
  • Cite Count Icon 33
  • 10.1001/jamanetworkopen.2024.12687
Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models
  • May 22, 2024
  • JAMA Network Open
  • Honghao Lai + 58 more

Large language models (LLMs) may facilitate the labor-intensive process of systematic reviews. However, the exact methods and reliability remain uncertain. To explore the feasibility and reliability of using LLMs to assess risk of bias (ROB) in randomized clinical trials (RCTs). A survey study was conducted between August 10, 2023, and October 30, 2023. Thirty RCTs were selected from published systematic reviews. A structured prompt was developed to guide ChatGPT (LLM 1) and Claude (LLM 2) in assessing the ROB in these RCTs using a modified version of the Cochrane ROB tool developed by the CLARITY group at McMaster University. Each RCT was assessed twice by both models, and the results were documented. The results were compared with an assessment by 3 experts, which was considered a criterion standard. Correct assessment rates, sensitivity, specificity, and F1 scores were calculated to reflect accuracy, both overall and for each domain of the Cochrane ROB tool; consistent assessment rates and Cohen κ were calculated to gauge consistency; and assessment time was calculated to measure efficiency. Performance between the 2 models was compared using risk differences. Both models demonstrated high correct assessment rates. LLM 1 reached a mean correct assessment rate of 84.5% (95% CI, 81.5%-87.3%), and LLM 2 reached a significantly higher rate of 89.5% (95% CI, 87.0%-91.8%). The risk difference between the 2 models was 0.05 (95% CI, 0.01-0.09). In most domains, domain-specific correct rates were around 80% to 90%; however, sensitivity below 0.80 was observed in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns). Domains 4 (missing outcome data), 5 (selective outcome reporting), and 6 had F1 scores below 0.50. The consistent rates between the 2 assessments were 84.0% for LLM 1 and 87.3% for LLM 2. LLM 1's κ exceeded 0.80 in 7 and LLM 2's in 8 domains. The mean (SD) time needed for assessment was 77 (16) seconds for LLM 1 and 53 (12) seconds for LLM 2. In this survey study of applying LLMs for ROB assessment, LLM 1 and LLM 2 demonstrated substantial accuracy and consistency in evaluating RCTs, suggesting their potential as supportive tools in systematic review processes.

  • Research Article
  • Cite Count Icon 3
  • 10.1111/acps.13717
Knowledge graphs in psychiatric research: Potential applications and future perspectives.
  • Jun 17, 2024
  • Acta psychiatrica Scandinavica
  • Sebastian Freidel + 1 more

Knowledge graphs (KGs) remain an underutilized tool in the field of psychiatric research. In the broader biomedical field KGs are already a significant tool mainly used as knowledge database or for novel relation detection between biomedical entities. This review aims to outline how KGs would further research in the field of psychiatry in the age of Artificial Intelligence (AI) and Large Language Models (LLMs). We conducted a thorough literature review across a spectrum of scientific fields ranging from computer science and knowledge engineering to bioinformatics. The literature reviewed was taken from PubMed, Semantic Scholar and Google Scholar searches including terms such as "Psychiatric Knowledge Graphs", "Biomedical Knowledge Graphs", "Knowledge Graph Machine Learning Applications", "Knowledge Graph Applications for Biomedical Sciences". The resulting publications were then assessed and accumulated in this review regarding their possible relevance to future psychiatric applications. A multitude of papers and applications of KGs in associated research fields that are yet to be utilized in psychiatric research was found and outlined in this review. We create a thorough recommendation for other computational researchers regarding use-cases of these KG applications in psychiatry. This review illustrates use-cases of KG-based research applications in biomedicine and beyond that may aid in elucidating the complex biology of psychiatric illness and open new routes for developing innovative interventions. We conclude that there is a wealth of opportunities for KG utilization in psychiatric research across a variety of application areas including biomarker discovery, patient stratification and personalized medicine approaches.

  • Preprint Article
  • 10.2196/preprints.81629
A Comparison of Large Language Models in Support for Different Stakeholders against the Fentanyl Crisis:Performance evaluation of multiple models (Preprint)
  • Jul 31, 2025
  • Siyu Cao + 6 more

BACKGROUND The fentanyl crisis is an urgent public health challenge, where the lack of knowledge causes an increasing mortality due to overdose. Large language models (LLMs) have shown great potential in medical fields, such as telemedicine and health education, while their benefits for different stakeholders in combating the fentanyl crisis warrants further investigation. OBJECTIVE This study aims to systematically evaluate the quality differences in real-time fentanyl-related guidance provided by six LLMs to users, first responders, clinicians, and policymakers. Clarify the advantages and disadvantages of different LLMS in the four major scenarios of identifying fentanyl, implementing emergency rescue, clinical diagnosis and treatment, and public health decision-making. To provide evidence-based evidence for the construction of a precise, reliable and multilingual fentanyl crisis intervention tool based on LLM, in order to reduce the risk of excessive deaths caused by knowledge gaps. METHODS We compared six LLMs, i.e., ChatGPT 3.5, Gemini 1.5 Flash, YouChat Smart, Copilot, Perplexity and Luzia regarding their ability to answer fentanyl-related questions. The performance of the models in various scenarios was scored by two experts and analyzed using analysis of variation (ANOVA), linear mixed models (LMM), and Cohen’s Kappa consistency test. RESULTS The LLMs performance significantly differed between question types (p<0.05 in ANOVA), whilst LMM confirmed that ChatGPT outperformed all other models across categories, with the largest effect sizes found when comparing ChatGPT to Gemini Bard 1.5 Flash and Copilot Bing Chat. Individually, Gemini performed well in user-related questions, but is relatively weak in first-aid-related questions. Luzia on WhatsApp performs moderately in first-aid-related questions, but poorly in clinical and policy-making ones. Perplexity scores are relatively high in clinical questions, but its overall consistency is poor. YouChat Smart and Copilot generally scored low in all scenarios and had poor stability. CONCLUSIONS LLMs can provide real-time guidance for users, first aiders, clinicians, and policymakers, with different in performance between LLMs in different types of questions. The selection of LLM in answering fentanyl-related questions should be based on specific scenarios.

  • Research Article
  • Cite Count Icon 15
  • 10.1101/2024.04.26.24306390
A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare
  • Apr 27, 2024
  • medRxiv
  • Leyao Wang + 7 more

Background:The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators.Objective:This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare.Methods:We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns.Results:Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research.Conclusions:Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.2196/59641
Large Language Models Can Enable Inductive Thematic Analysis of a Social Media Corpus in a Single Prompt: Human Validation Study.
  • Aug 29, 2024
  • JMIR infodemiology
  • Michael S Deiner + 5 more

Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.

  • Research Article
  • Cite Count Icon 18
  • 10.1001/jamaophthalmol.2024.2513
Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology
  • Jul 18, 2024
  • JAMA Ophthalmology
  • Ming-Jie Luo + 15 more

Although augmenting large language models (LLMs) with knowledge bases may improve medical domain-specific performance, practical methods are needed for local implementation of LLMs that address privacy concerns and enhance accessibility for health care professionals. To develop an accurate, cost-effective local implementation of an LLM to mitigate privacy concerns and support their practical deployment in health care settings. ChatZOC (Sun Yat-Sen University Zhongshan Ophthalmology Center), a retrieval-augmented LLM framework, was developed by enhancing a baseline LLM with a comprehensive ophthalmic dataset and evaluation framework (CODE), which includes over 30 000 pieces of ophthalmic knowledge. This LLM was benchmarked against 10 representative LLMs, including GPT-4 and GPT-3.5 Turbo (OpenAI), across 300 clinical questions in ophthalmology. The evaluation, involving a panel of medical experts and biomedical researchers, focused on accuracy, utility, and safety. A double-masked approach was used to try to minimize bias assessment across all models. The study used a comprehensive knowledge base derived from ophthalmic clinical practice, without directly involving clinical patients. LLM response to clinical questions. Accuracy, utility, and safety of LLMs in responding to clinical questions. The baseline model achieved a human ranking score of 0.48. The retrieval-augmented LLM had a score of 0.60, a difference of 0.12 (95% CI, 0.02-0.22; P = .02) from baseline and not different from GPT-4 with a score of 0.61 (difference = 0.01; 95% CI, -0.11 to 0.13; P = .89). For scientific consensus, the retrieval-augmented LLM was 84.0% compared with the baseline model of 46.5% (difference = 37.5%; 95% CI, 29.0%-46.0%; P < .001) and not different from GPT-4 with a value of 79.2% (difference = 4.8%; 95% CI, -0.3% to 10.0%; P = .06). Results of this quality improvement study suggest that the integration of high-quality knowledge bases improved the LLM's performance in medical domains. This study highlights the transformative potential of augmented LLMs in clinical practice by providing reliable, safe, and practical clinical information. Further research is needed to explore the broader application of such frameworks in the real world.

  • Research Article
  • Cite Count Icon 1
  • 10.1111/pcn.13781
Chain of Risks Evaluation (CORE): A framework for safer large language models in public mental health.
  • Jan 24, 2025
  • Psychiatry and clinical neurosciences
  • Lingyu Li + 5 more

Large language models (LLMs) have gained significant attention for their capabilities in natural language understanding and generation. However, their widespread adoption potentially raises public mental health concerns, including issues related to inequity, stigma, dependence, medical risks, and security threats. This review aims to offer a perspective within the actor-network framework, exploring the technical architectures, linguistic dynamics, and psychological effects underlying human-LLMs interactions. Based on this theoretical foundation, we propose four categories of risks, presenting increasing challenges in identification and mitigation: universal, context-specific, user-specific, and user-context-specific risks. Correspondingly, we introduce CORE: Chain of Risk Evaluation, a structured conceptual framework for assessing and mitigating the risks associated with LLMs in public mental health contexts. Our approach suggests viewing the development of responsible LLMs as a continuum from technical to public efforts. We summarize technical approaches and potential contributions from mental health practitioners that could help evaluate and regulate risks in human-LLMs interactions. We propose that mental health practitioners could play a crucial role in this emerging field by collaborating with LLMs developers, conducting empirical studies to better understand the psychological impacts on human-LLMs interactions, developing guidelines for LLMs use in mental health contexts, and engaging in public education.

More from: Frontiers in Public Health
  • New
  • Research Article
  • 10.3389/fpubh.2025.1612509
Service access for youth with neurodevelopmental disabilities transitioning to adulthood: service providers’ and decision-makers’ perspectives on barriers, facilitators and policy recommendations
  • Nov 6, 2025
  • Frontiers in Public Health
  • Angela M Senevirathna + 4 more

  • New
  • Research Article
  • 10.3389/fpubh.2025.1654488
Co-creating a social science research agenda for Long Covid
  • Nov 6, 2025
  • Frontiers in Public Health
  • Oonagh Cousins + 12 more

  • New
  • Research Article
  • 10.3389/fpubh.2025.1636891
From active play to sedentary lifestyles: understanding the decline in physical activity from childhood through adolescence—a systematic review
  • Nov 6, 2025
  • Frontiers in Public Health
  • Jean De Dieu Habyarimana + 4 more

  • New
  • Research Article
  • 10.3389/fpubh.2025.1674081
Determinants of primary care physicians’ intention to provide breast cancer screening services for rural women: a structural equation model based on the theory of planned behavior
  • Nov 6, 2025
  • Frontiers in Public Health
  • Yinren Zhao + 6 more

  • New
  • Research Article
  • 10.3389/fpubh.2025.1619886
Community health resource project: highlighting One Health resources across rural Georgia to build healthier communities
  • Nov 6, 2025
  • Frontiers in Public Health
  • Tanya E Jules + 6 more

  • New
  • Research Article
  • 10.3389/fpubh.2025.1676139
Stage-specific prevalence and progression of sarcopenia among aging hemodialysis patients: a multicenter cross-sectional study
  • Nov 6, 2025
  • Frontiers in Public Health
  • Jinguo Li + 5 more

  • New
  • Research Article
  • 10.3389/fpubh.2025.1668696
The association between social capital and quality of life in old adults: a systematic review and meta-analysis
  • Nov 6, 2025
  • Frontiers in Public Health
  • Alessandra Buja + 1 more

  • New
  • Research Article
  • 10.3389/fpubh.2025.1667721
Needs for discharge planning among parents of preterm infants in the NICU: a systematic review and meta-synthesis
  • Nov 6, 2025
  • Frontiers in Public Health
  • Jiaming Wu + 3 more

  • New
  • Research Article
  • 10.3389/fpubh.2025.1731810
Correction: Understanding the impact of different hand drying methods on viral aerosols formation and surface contamination in indoor environments
  • Nov 6, 2025
  • Frontiers in Public Health

  • New
  • Research Article
  • 10.3389/fpubh.2025.1698062
Development of machine learning models with explainable AI for frailty risk prediction and their web-based application in community public health
  • Nov 6, 2025
  • Frontiers in Public Health
  • Seungmi Kim + 8 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon