MEGA-RAG: a retrieval-augmented generation framework with multi-evidence guided answer refinement for mitigating hallucinations of LLMs in public health
IntroductionThe increasing adoption of large language models (LLMs) in public health has raised significant concerns about hallucinations-factually inaccurate or misleading outputs that can compromise clinical communication and policy decisions.MethodsWe propose a retrieval-augmented generation framework with multi-evidence guided answer refinement (MEGA-RAG), specifically designed to mitigate hallucinations in public health applications. The framework integrates multi-source evidence retrieval (dense retrieval via FAISS, keyword-based retrieval via BM25, and biomedical knowledge graphs), employs a cross-encoder reranker to ensure semantic relevance, and incorporates a discrepancy-aware refinement module to further enhance factual accuracy.ResultsExperimental evaluation demonstrates that MEGA-RAG outperforms four baseline models [PubMedBERT, PubMedGPT, standalone LLM, and LLM with standard retrieval-augmented generation (RAG)], achieving a reduction in hallucination rates by over 40%. It also achieves the highest accuracy (0.7913), precision (0.7541), recall (0.8304), and F1 score (0.7904).DiscussionThese findings confirm that MEGA-RAG is highly effective in generating factually reliable and medically accurate responses, thereby enhancing the credibility of AI-generated health information for applications in health education, clinical communication, and evidence-based policy development.
16
- 10.1145/3637487
- Feb 11, 2025
- ACM Computing Surveys
23
- 10.1093/bioinformatics/btae560
- Sep 2, 2024
- Bioinformatics (Oxford, England)
2
- 10.1016/b978-0-443-29246-0.00004-3
- Jan 1, 2025
49
- 10.1007/s12671-023-02089-5
- Mar 3, 2023
- Mindfulness
32
- 10.18653/v1/2023.conll-1.21
- Jan 1, 2023
6
- 10.1101/2025.02.28.25323115
- Mar 3, 2025
25
- 10.1093/bioinformatics/btae353
- Jun 3, 2024
- Bioinformatics
410
- 10.3389/fpubh.2023.1166120
- Apr 25, 2023
- Frontiers in Public Health
2
- 10.1145/3696410.3714782
- Apr 22, 2025
1
- 10.1101/2025.02.08.25321587
- Feb 10, 2025
- medRxiv
- Research Article
1
- 10.1016/j.jbi.2024.104730
- Sep 24, 2024
- Journal of Biomedical Informatics
FuseLinker: Leveraging LLM’s pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs
- Research Article
10
- 10.21203/rs.3.rs-3185632/v1
- Aug 1, 2023
- Research Square
Purpose:Large Language Models (LLMs) have shown exceptional performance in various natural language processing tasks, benefiting from their language generation capabilities and ability to acquire knowledge from unstructured text. However, in the biomedical domain, LLMs face limitations that lead to inaccurate and inconsistent answers. Knowledge Graphs (KGs) have emerged as valuable resources for organizing structured information. Biomedical Knowledge Graphs (BKGs) have gained significant attention for managing diverse and large-scale biomedical knowledge. The objective of this study is to assess and compare the capabilities of ChatGPT and existing BKGs in question-answering, biomedical knowledge discovery, and reasoning tasks within the biomedical domain.Methods:We conducted a series of experiments to assess the performance of ChatGPT and the BKGs in various aspects of querying existing biomedical knowledge, knowledge discovery, and knowledge reasoning. Firstly, we tasked ChatGPT with answering questions sourced from the “Alternative Medicine” sub-category of Yahoo! Answers and recorded the responses. Additionally, we queried BKG to retrieve the relevant knowledge records corresponding to the questions and assessed them manually. In another experiment, we formulated a prediction scenario to assess ChatGPT’s ability to suggest potential drug/dietary supplement repurposing candidates. Simultaneously, we utilized BKG to perform link prediction for the same task. The outcomes of ChatGPT and BKG were compared and analyzed. Furthermore, we evaluated ChatGPT and BKG’s capabilities in establishing associations between pairs of proposed entities. This evaluation aimed to assess their reasoning abilities and the extent to which they can infer connections within the knowledge domain.Results:The results indicate that ChatGPT with GPT-4.0 outperforms both GPT-3.5 and BKGs in providing existing information. However, BKGs demonstrate higher reliability in terms of information accuracy. ChatGPT exhibits limitations in performing novel discoveries and reasoning, particularly in establishing structured links between entities compared to BKGs.Conclusions:To address the limitations observed, future research should focus on integrating LLMs and BKGs to leverage the strengths of both approaches. Such integration would optimize task performance and mitigate potential risks, leading to advancements in knowledge within the biomedical field and contributing to the overall well-being of individuals.
- Research Article
- 10.2196/65226
- Aug 9, 2024
- Journal of medical Internet research
The use of web-based search and social media can help identify epidemics, potentially earlier than clinical methods or even potentially identifying unreported outbreaks. Monitoring for eye-related epidemics, such as conjunctivitis outbreaks, can facilitate early public health intervention to reduce transmission and ocular comorbidities. However, monitoring social media content for conjunctivitis outbreaks is costly and laborious. Large language models (LLMs) could overcome these barriers by assessing the likelihood that real-world outbreaks are being described. However, public health actions for likely outbreaks could benefit more by knowing additional epidemiological characteristics, such as outbreak type, size, and severity. We aimed to assess whether and how well LLMs can classify epidemiological features from social media posts beyond conjunctivitis outbreak probability, including outbreak type, size, severity, etiology, and community setting. We used a validation framework comparing LLM classifications to those of other LLMs and human experts. We wrote code to generate synthetic conjunctivitis outbreak social media posts, embedded with specific preclassified epidemiological features to simulate various infectious eye disease outbreak and control scenarios. We used these posts to develop effective LLM prompts and test the capabilities of multiple LLMs. For top-performing LLMs, we gauged their practical utility in real-world epidemiological surveillance by comparing their assessments of Twitter/X, forum, and YouTube conjunctivitis posts. Finally, human raters also classified the posts, and we compared their classifications to those of a leading LLM for validation. Comparisons entailed correlation or sensitivity and specificity statistics. We assessed 7 LLMs for effectively classifying epidemiological data from 1152 synthetic posts, 370 Twitter/X posts, 290 forum posts, and 956 YouTube posts. Despite some discrepancies, the LLMs demonstrated a reliable capacity for nuanced epidemiological analysis across various data sources and compared to humans or between LLMs. Notably, GPT-4 and Mixtral 8x22b exhibited high performance, predicting conjunctivitis outbreak characteristics such as probability (GPT-4: correlation=0.73), size (Mixtral 8x22b: correlation=0.82), and type (infectious, allergic, or environmentally caused); however, there were notable exceptions. Assessing synthetic and real-world posts for etiological factors, infectious eye disease specialist validations revealed that GPT-4 had high specificity (0.83-1.00) but variable sensitivity (0.32-0.71). Interrater reliability analyses showed that LLM-expert agreement exceeded expert-expert agreement for severity assessment (intraclass correlation coefficient=0.69 vs 0.38), while agreement varied by condition type (κ=0.37-0.94). This investigation into the potential of LLMs for public health infoveillance suggests effectiveness in classifying key epidemiological characteristics from social media content about conjunctivitis outbreaks. Future studies should further explore LLMs' potential to support public health monitoring through the automated assessment and classification of potential infectious eye disease or other outbreaks. Their optimal role may be to act as a first line of documentation, alerting public health organizations for the follow-up of LLM-detected and -classified small, early outbreaks, with a focus on the most severe ones.
- Preprint Article
- 10.2196/preprints.65226
- Aug 9, 2024
BACKGROUND The use of web-based search and social media can help identify epidemics, potentially earlier than clinical methods or even potentially identifying unreported outbreaks. Monitoring for eye-related epidemics, such as conjunctivitis outbreaks, can facilitate early public health intervention to reduce transmission and ocular comorbidities. However, monitoring social media content for conjunctivitis outbreaks is costly and laborious. Large language models (LLMs) could overcome these barriers by assessing the likelihood that real-world outbreaks are being described. However, public health actions for likely outbreaks could benefit more by knowing additional epidemiological characteristics, such as outbreak type, size, and severity. OBJECTIVE We aimed to assess whether and how well LLMs can classify epidemiological features from social media posts beyond conjunctivitis outbreak probability, including outbreak type, size, severity, etiology, and community setting. We used a validation framework comparing LLM classifications to those of other LLMs and human experts. METHODS We wrote code to generate synthetic conjunctivitis outbreak social media posts, embedded with specific preclassified epidemiological features to simulate various infectious eye disease outbreak and control scenarios. We used these posts to develop effective LLM prompts and test the capabilities of multiple LLMs. For top-performing LLMs, we gauged their practical utility in real-world epidemiological surveillance by comparing their assessments of Twitter/X, forum, and YouTube conjunctivitis posts. Finally, human raters also classified the posts, and we compared their classifications to those of a leading LLM for validation. Comparisons entailed correlation or sensitivity and specificity statistics. RESULTS We assessed 7 LLMs for effectively classifying epidemiological data from 1152 synthetic posts, 370 Twitter/X posts, 290 forum posts, and 956 YouTube posts. Despite some discrepancies, the LLMs demonstrated a reliable capacity for nuanced epidemiological analysis across various data sources and compared to humans or between LLMs. Notably, GPT-4 and Mixtral 8x22b exhibited high performance, predicting conjunctivitis outbreak characteristics such as probability (GPT-4: correlation=0.73), size (Mixtral 8x22b: correlation=0.82), and type (infectious, allergic, or environmentally caused); however, there were notable exceptions. Assessing synthetic and real-world posts for etiological factors, infectious eye disease specialist validations revealed that GPT-4 had high specificity (0.83-1.00) but variable sensitivity (0.32-0.71). Interrater reliability analyses showed that LLM-expert agreement exceeded expert-expert agreement for severity assessment (intraclass correlation coefficient=0.69 vs 0.38), while agreement varied by condition type (κ=0.37-0.94). CONCLUSIONS This investigation into the potential of LLMs for public health infoveillance suggests effectiveness in classifying key epidemiological characteristics from social media content about conjunctivitis outbreaks. Future studies should further explore LLMs’ potential to support public health monitoring through the automated assessment and classification of potential infectious eye disease or other outbreaks. Their optimal role may be to act as a first line of documentation, alerting public health organizations for the follow-up of LLM-detected and -classified small, early outbreaks, with a focus on the most severe ones.
- Research Article
4
- 10.1101/2023.06.09.23291208
- Jun 12, 2023
- medRxiv
Large Language Models (LLMs) have demonstrated exceptional performance in various natural language processing tasks, utilizing their language generation capabilities and knowledge acquisition potential from unstructured text. However, when applied to the biomedical domain, LLMs encounter limitations, resulting in erroneous and inconsistent answers. Knowledge Graphs (KGs) have emerged as valuable resources for structured information representation and organization. Specifically, Biomedical Knowledge Graphs (BKGs) have attracted significant interest in managing large-scale and heterogeneous biomedical knowledge. This study evaluates the capabilities of ChatGPT and existing BKGs in question answering, knowledge discovery, and reasoning. Results indicate that while ChatGPT with GPT-4.0 surpasses both GPT-3.5 and BKGs in providing existing information, BKGs demonstrate superior information reliability. Additionally, ChatGPT exhibits limitations in performing novel discoveries and reasoning, particularly in establishing structured links between entities compared to BKGs. To overcome these limitations, future research should focus on integrating LLMs and BKGs to leverage their respective strengths. Such an integrated approach would optimize task performance and mitigate potential risks, thereby advancing knowledge in the biomedical field and contributing to overall well-being.
- Research Article
- 10.1136/ip-2025-045655
- Jul 21, 2025
- Injury prevention : journal of the International Society for Child and Adolescent Injury Prevention
Button battery (BB) injuries in children represent a severe and growing public health burden. The literature on the topic is extensive; however, there is a notable lack of structured public health initiatives addressing the problem. The present study aimed to test the feasibility of using large language models (LLMs) to draft recommendations for preventing and managing BB ingestion in children. A set of questions was generated and submitted to ChatGPT-4o and ChatGPT-o1-preview. Questions were based on statements and websites of scientific societies and not-for-profit organisations and were developed to produce comprehensive recommendations that provided information on BB risks, primary and secondary prevention, clinical management and follow-up, and general public health initiatives. Two independent reviewers rated the accuracy and readability of the questions submitted to the LLMs. The accuracy was rated using a four-level scale, while the readability was assessed using two established readability tools, the Flesch Reading Ease (FRE) and the Flesch-Kincaid Grade Level (FKGL). None of the answers provided by the LLMs were rated as completely incorrect or partially incorrect. ChatGPT-o1-preview outperformed ChatGPT-4o in accuracy, with eight answers rated as accurate and complete. Both models showed similar readability levels, with high FKGL and FRE scores indicating college-level comprehension. LLM demonstrated a strong performance in this study, with no responses rated as incorrect or partially incorrect, showing its great potential and feasibility for use in public health. The present findings suggested the potential feasibility of LLMs in public health for preventing paediatric injuries.
- Research Article
- 10.1093/jamia/ocaf048
- Mar 19, 2025
- Journal of the American Medical Informatics Association : JAMIA
ObjectivesAdverse event detection from Electronic Medical Records (EMRs) is challenging due to the low incidence of the event, variability in clinical documentation, and the complexity of data formats. Pulmonary embolism as an adverse event (PEAE) is particularly difficult to identify using existing approaches. This study aims to develop and evaluate a Large Language Model (LLM)-based framework for detecting PEAE from unstructured narrative data in EMRs.Materials and MethodsWe conducted a chart review of adult patients (aged 18-100) admitted to tertiary-care hospitals in Calgary, Alberta, Canada, between 2017-2022. We developed an LLM-based detection framework consisting of three modules: evidence extraction (implementing both keyword-based and semantic similarity-based filtering methods), discharge information extraction (focusing on six key clinical sections), and PEAE detection. Four open-source LLMs (Llama3, Mistral-7B, Gemma, and Phi-3) were evaluated using positive predictive value, sensitivity, specificity, and F1-score. Model performance for population-level surveillance was assessed at yearly, quarterly, and monthly granularities.ResultsThe chart review included 10 066 patients, with 40 cases of PEAE identified (0.4% prevalence). All four LLMs demonstrated high sensitivity (87.5-100%) and specificity (94.9-98.9%) across different experimental conditions. Gemma achieved the highest F1-score (28.11%) using keyword-based retrieval with discharge summary inclusion, along with 98.4% specificity, 87.5% sensitivity, and 99.95% negative predictive value. Keyword-based filtering reduced the median chunks per patient from 789 to 310, while semantic filtering further reduced this to 9 chunks. Including discharge summaries improved performance metrics across most models. For population-level surveillance, all models showed strong correlation with actual PEAE trends at yearly granularity (r=0.92-0.99), with Llama3 achieving the highest correlation (0.988).DiscussionThe results of our method for PEAE detection using EMR notes demonstrate high sensitivity and specificity across all four tested LLMs, indicating strong performance in distinguishing PEAE from non-PEAE cases. However, the low incidence rate of PEAE contributed to a lower PPV. The keyword-based chunking approach consistently outperformed semantic similarity-based methods, achieving higher F1 scores and PPV, underscoring the importance of domain knowledge in text segmentation. Including discharge summaries further enhanced performance metrics. Our population-based analysis revealed better performance for yearly trends compared to monthly granularity, suggesting the framework's utility for long-term surveillance despite dataset imbalance. Error analysis identified contextual misinterpretation, terminology confusion, and preprocessing limitations as key challenges for future improvement.ConclusionsOur proposed method demonstrates that LLMs can effectively detect PEAE from narrative EMRs with high sensitivity and specificity. While these models serve as effective screening tools to exclude non-PEAE cases, their lower PPV indicates they cannot be relied upon solely for definitive PEAE identification. Further chart review remains necessary for confirmation. Future work should focus on improving contextual understanding, medical terminology interpretation, and exploring advanced prompting techniques to enhance precision in adverse event detection from EMRs.
- Research Article
- 10.1093/eurpub/ckaf161.316
- Oct 1, 2025
- European Journal of Public Health
Introduction Antimicrobial resistance (AMR) is a major public health challenge. Artificial Intelligence (AI), particularly Large Language Models (LLMs), offers a promising opportunity to deliver accurate and appropriate health information and education. However, the public health implications of their widespread use remain largely unassessed by scientific experts. This study evaluates the effectiveness of leading LLMs in providing information on infection control and antibiotic use. Methods ChatGPT 3.5, ChatGPT 4.0, Claude 2.0, and Gemini 1.0 were adequately prompted in both Italian and English. Their textual output underwent a Computational Text Analysis to assess readability, lexical diversity, and sentiment. In addition, 3 experts rated the output via an adapted DISCERN instrument built to assess AMR impact, persuasiveness, and the overall quality and appropriateness of the content. Results A total of 864 scores were obtained from ChatGPT 3.5, ChatGPT 4.0 and Claude, each evaluated both in English and in Italian. In contrast, only 270 scores were obtained from Gemini in English, as it self-interrupted, reporting the questioning as inappropriate for a chatbot. A general performance gradient was observed from Gemini to ChatGPT 3.5. ChatGPT 4.0 demonstrated the highest lexical diversity and sentiment scores, while Gemini presented the best readability and overall rating. English-based prompts consistently overperformed Italian-based ones. The impact on AMR received low scores across all LLMs. Conclusions The study identified Gemini as the best-performing model in terms of content quality, accessibility, and contextual awareness. While LLMs are promising tools, they are not intended to replace professional medical assessment. Instead, their responsible integration is necessary to ensure safe and effective public health applications. Further studies are warranted to expand the evidence base regarding the assessments of medical content generated by LLMs. Key messages • Large Language Models (LLMs) are promising in delivering accurate and appropriate health information. However, the public health implications remain largely unassessed by scientific experts. • A general performance gradient was observed from Gemini to ChatGPT 3.5 regarding readability, lexical diversity, sentiment scores and overall rating. The rated impact on AMR was low across all LLMs.
- Research Article
33
- 10.1001/jamanetworkopen.2024.12687
- May 22, 2024
- JAMA Network Open
Large language models (LLMs) may facilitate the labor-intensive process of systematic reviews. However, the exact methods and reliability remain uncertain. To explore the feasibility and reliability of using LLMs to assess risk of bias (ROB) in randomized clinical trials (RCTs). A survey study was conducted between August 10, 2023, and October 30, 2023. Thirty RCTs were selected from published systematic reviews. A structured prompt was developed to guide ChatGPT (LLM 1) and Claude (LLM 2) in assessing the ROB in these RCTs using a modified version of the Cochrane ROB tool developed by the CLARITY group at McMaster University. Each RCT was assessed twice by both models, and the results were documented. The results were compared with an assessment by 3 experts, which was considered a criterion standard. Correct assessment rates, sensitivity, specificity, and F1 scores were calculated to reflect accuracy, both overall and for each domain of the Cochrane ROB tool; consistent assessment rates and Cohen κ were calculated to gauge consistency; and assessment time was calculated to measure efficiency. Performance between the 2 models was compared using risk differences. Both models demonstrated high correct assessment rates. LLM 1 reached a mean correct assessment rate of 84.5% (95% CI, 81.5%-87.3%), and LLM 2 reached a significantly higher rate of 89.5% (95% CI, 87.0%-91.8%). The risk difference between the 2 models was 0.05 (95% CI, 0.01-0.09). In most domains, domain-specific correct rates were around 80% to 90%; however, sensitivity below 0.80 was observed in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns). Domains 4 (missing outcome data), 5 (selective outcome reporting), and 6 had F1 scores below 0.50. The consistent rates between the 2 assessments were 84.0% for LLM 1 and 87.3% for LLM 2. LLM 1's κ exceeded 0.80 in 7 and LLM 2's in 8 domains. The mean (SD) time needed for assessment was 77 (16) seconds for LLM 1 and 53 (12) seconds for LLM 2. In this survey study of applying LLMs for ROB assessment, LLM 1 and LLM 2 demonstrated substantial accuracy and consistency in evaluating RCTs, suggesting their potential as supportive tools in systematic review processes.
- Research Article
3
- 10.1111/acps.13717
- Jun 17, 2024
- Acta psychiatrica Scandinavica
Knowledge graphs (KGs) remain an underutilized tool in the field of psychiatric research. In the broader biomedical field KGs are already a significant tool mainly used as knowledge database or for novel relation detection between biomedical entities. This review aims to outline how KGs would further research in the field of psychiatry in the age of Artificial Intelligence (AI) and Large Language Models (LLMs). We conducted a thorough literature review across a spectrum of scientific fields ranging from computer science and knowledge engineering to bioinformatics. The literature reviewed was taken from PubMed, Semantic Scholar and Google Scholar searches including terms such as "Psychiatric Knowledge Graphs", "Biomedical Knowledge Graphs", "Knowledge Graph Machine Learning Applications", "Knowledge Graph Applications for Biomedical Sciences". The resulting publications were then assessed and accumulated in this review regarding their possible relevance to future psychiatric applications. A multitude of papers and applications of KGs in associated research fields that are yet to be utilized in psychiatric research was found and outlined in this review. We create a thorough recommendation for other computational researchers regarding use-cases of these KG applications in psychiatry. This review illustrates use-cases of KG-based research applications in biomedicine and beyond that may aid in elucidating the complex biology of psychiatric illness and open new routes for developing innovative interventions. We conclude that there is a wealth of opportunities for KG utilization in psychiatric research across a variety of application areas including biomarker discovery, patient stratification and personalized medicine approaches.
- Preprint Article
- 10.2196/preprints.81629
- Jul 31, 2025
BACKGROUND The fentanyl crisis is an urgent public health challenge, where the lack of knowledge causes an increasing mortality due to overdose. Large language models (LLMs) have shown great potential in medical fields, such as telemedicine and health education, while their benefits for different stakeholders in combating the fentanyl crisis warrants further investigation. OBJECTIVE This study aims to systematically evaluate the quality differences in real-time fentanyl-related guidance provided by six LLMs to users, first responders, clinicians, and policymakers. Clarify the advantages and disadvantages of different LLMS in the four major scenarios of identifying fentanyl, implementing emergency rescue, clinical diagnosis and treatment, and public health decision-making. To provide evidence-based evidence for the construction of a precise, reliable and multilingual fentanyl crisis intervention tool based on LLM, in order to reduce the risk of excessive deaths caused by knowledge gaps. METHODS We compared six LLMs, i.e., ChatGPT 3.5, Gemini 1.5 Flash, YouChat Smart, Copilot, Perplexity and Luzia regarding their ability to answer fentanyl-related questions. The performance of the models in various scenarios was scored by two experts and analyzed using analysis of variation (ANOVA), linear mixed models (LMM), and Cohen’s Kappa consistency test. RESULTS The LLMs performance significantly differed between question types (p<0.05 in ANOVA), whilst LMM confirmed that ChatGPT outperformed all other models across categories, with the largest effect sizes found when comparing ChatGPT to Gemini Bard 1.5 Flash and Copilot Bing Chat. Individually, Gemini performed well in user-related questions, but is relatively weak in first-aid-related questions. Luzia on WhatsApp performs moderately in first-aid-related questions, but poorly in clinical and policy-making ones. Perplexity scores are relatively high in clinical questions, but its overall consistency is poor. YouChat Smart and Copilot generally scored low in all scenarios and had poor stability. CONCLUSIONS LLMs can provide real-time guidance for users, first aiders, clinicians, and policymakers, with different in performance between LLMs in different types of questions. The selection of LLM in answering fentanyl-related questions should be based on specific scenarios.
- Research Article
15
- 10.1101/2024.04.26.24306390
- Apr 27, 2024
- medRxiv
Background:The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators.Objective:This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare.Methods:We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns.Results:Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research.Conclusions:Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.
- Research Article
4
- 10.2196/59641
- Aug 29, 2024
- JMIR infodemiology
Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.
- Research Article
18
- 10.1001/jamaophthalmol.2024.2513
- Jul 18, 2024
- JAMA Ophthalmology
Although augmenting large language models (LLMs) with knowledge bases may improve medical domain-specific performance, practical methods are needed for local implementation of LLMs that address privacy concerns and enhance accessibility for health care professionals. To develop an accurate, cost-effective local implementation of an LLM to mitigate privacy concerns and support their practical deployment in health care settings. ChatZOC (Sun Yat-Sen University Zhongshan Ophthalmology Center), a retrieval-augmented LLM framework, was developed by enhancing a baseline LLM with a comprehensive ophthalmic dataset and evaluation framework (CODE), which includes over 30 000 pieces of ophthalmic knowledge. This LLM was benchmarked against 10 representative LLMs, including GPT-4 and GPT-3.5 Turbo (OpenAI), across 300 clinical questions in ophthalmology. The evaluation, involving a panel of medical experts and biomedical researchers, focused on accuracy, utility, and safety. A double-masked approach was used to try to minimize bias assessment across all models. The study used a comprehensive knowledge base derived from ophthalmic clinical practice, without directly involving clinical patients. LLM response to clinical questions. Accuracy, utility, and safety of LLMs in responding to clinical questions. The baseline model achieved a human ranking score of 0.48. The retrieval-augmented LLM had a score of 0.60, a difference of 0.12 (95% CI, 0.02-0.22; P = .02) from baseline and not different from GPT-4 with a score of 0.61 (difference = 0.01; 95% CI, -0.11 to 0.13; P = .89). For scientific consensus, the retrieval-augmented LLM was 84.0% compared with the baseline model of 46.5% (difference = 37.5%; 95% CI, 29.0%-46.0%; P < .001) and not different from GPT-4 with a value of 79.2% (difference = 4.8%; 95% CI, -0.3% to 10.0%; P = .06). Results of this quality improvement study suggest that the integration of high-quality knowledge bases improved the LLM's performance in medical domains. This study highlights the transformative potential of augmented LLMs in clinical practice by providing reliable, safe, and practical clinical information. Further research is needed to explore the broader application of such frameworks in the real world.
- Research Article
1
- 10.1111/pcn.13781
- Jan 24, 2025
- Psychiatry and clinical neurosciences
Large language models (LLMs) have gained significant attention for their capabilities in natural language understanding and generation. However, their widespread adoption potentially raises public mental health concerns, including issues related to inequity, stigma, dependence, medical risks, and security threats. This review aims to offer a perspective within the actor-network framework, exploring the technical architectures, linguistic dynamics, and psychological effects underlying human-LLMs interactions. Based on this theoretical foundation, we propose four categories of risks, presenting increasing challenges in identification and mitigation: universal, context-specific, user-specific, and user-context-specific risks. Correspondingly, we introduce CORE: Chain of Risk Evaluation, a structured conceptual framework for assessing and mitigating the risks associated with LLMs in public mental health contexts. Our approach suggests viewing the development of responsible LLMs as a continuum from technical to public efforts. We summarize technical approaches and potential contributions from mental health practitioners that could help evaluate and regulate risks in human-LLMs interactions. We propose that mental health practitioners could play a crucial role in this emerging field by collaborating with LLMs developers, conducting empirical studies to better understand the psychological impacts on human-LLMs interactions, developing guidelines for LLMs use in mental health contexts, and engaging in public education.
- New
- Research Article
- 10.3389/fpubh.2025.1612509
- Nov 6, 2025
- Frontiers in Public Health
- New
- Research Article
- 10.3389/fpubh.2025.1654488
- Nov 6, 2025
- Frontiers in Public Health
- New
- Research Article
- 10.3389/fpubh.2025.1636891
- Nov 6, 2025
- Frontiers in Public Health
- New
- Research Article
- 10.3389/fpubh.2025.1674081
- Nov 6, 2025
- Frontiers in Public Health
- New
- Research Article
- 10.3389/fpubh.2025.1619886
- Nov 6, 2025
- Frontiers in Public Health
- New
- Research Article
- 10.3389/fpubh.2025.1676139
- Nov 6, 2025
- Frontiers in Public Health
- New
- Research Article
- 10.3389/fpubh.2025.1668696
- Nov 6, 2025
- Frontiers in Public Health
- New
- Research Article
- 10.3389/fpubh.2025.1667721
- Nov 6, 2025
- Frontiers in Public Health
- New
- Research Article
- 10.3389/fpubh.2025.1731810
- Nov 6, 2025
- Frontiers in Public Health
- New
- Research Article
- 10.3389/fpubh.2025.1698062
- Nov 6, 2025
- Frontiers in Public Health
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.