Large language models in translation quality assessment: The feasibility of human-AI collaboration

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

.This research explores the potential application of Large Language Models (LLMs) in translation quality assessment within the Chinese Academic Translation Project (CATP), from a human-AI collaboration perspective. The study integrates the LISA QA Model and the Chinese standard GB/T 19682-2005 to develop a multidimensional translation quality assessment system, including typologies and weights of errors specific to Chinese academic works. Using this system, three LLMs (GPT-4, Claude-3.7, and Deepseek-R1) were employed to evaluate the Portuguese version of the work Introduction to Qing Dynasty Academic Thought, analyzing their performance and comparing it with the results of an assessment conducted by human experts, with the aim of exploring the feasibility of a collaborative model between humans and AI. Based on the experimental results, the research proposes a hierarchical assessment process of “AI screening-refined human judgment” and an inter-linguistic assessment mechanism of “Chinese prompt-multilingual verification”, constructing a translation quality assessment framework based on human-AI collaboration for the CATP. This study infuses elements of technological innovation into traditional translation quality assessment, providing a new technical support pathway for the strategy of “internationalization” of Chinese academic knowledge.

Similar Papers
  • Research Article
  • 10.21203/rs.3.rs-6608559/v1
Enhanced Language Models for Predicting and Understanding HIV Care Disengagement: A Case Study in Tanzania.
  • May 8, 2025
  • Research square
  • Waverly Wei + 16 more

Summary Sustained engagement in HIV care and adherence to antiretroviral therapy (ART) are essential for achieving the UNAIDS "95-95-95" targets. Despite increased ART access, disengagement from care remains a significant issue, particularly in sub-Saharan Africa. Traditional machine learning (ML) models have shown moderate success in predicting care disengagement, which would enable early intervention. We develop an enhanced large language model (LLM) fine-tuned with electronic medical records (EMRs) to predict people at risk of disengaging from HIV care in Tanzania and to provide interpretative insights into modifiable risk factors. Methods We developed a novel AI model by enhancing a pre-trained LLM (LLaMA 3.1, an open-source pre-trained LLM released by Meta) using routinely collected EMRs from Tanzania's National HIV Care and Treatment Program from January 1, 2018, to June 30, 2023 (4,809,765 records for 261,192 people) to identify people at risk of disengaging from HIV care or developing adverse outcomes. Outcomes included risk of ART non-adherence, non-suppressed viral load, and loss to follow-up. Models were evaluated internally (Kagera region) and externally (Geita region), with performance compared against state-of-art ML models and zero-shot LLMs. Additionally, a team of HIV physicians in Tanzania assessed the LLM's predictions along with LLM provided justifications for a subset of patient records to evaluate their clinical relevance and reasoning. Findings The enhanced LLMs consistently outperformed the supervised ML model and zero-shot LLMs across all outcomes in both internal and external validation datasets. When focusing on the 25% of PLHIV predicted as most likely to lost-to-follow-up (LTFU), the model correctly identified 78% (2,515 of 3,224) of people living with HIV (PLHIV) genuinely at risk in internal validation and 73% (7,105 of 9,733) in external validation. Attention score analysis indicated that the enhanced LLM focused on keywords such as gaps in follow-up care and ART adherence. The human expert evaluation showed alignment between clinician assessments and the LLM's predictions in 65% of cases, with experts finding the model's justifications reasonable and clinically relevant in 92.3% of aligned cases. Interpretation If implemented in HIV clinics, this LLM-based AI model could help allocate resources efficiently and deliver targeted interventions, improving retention in care and advancing the UNAIDS "95-95-95" targets. By functioning like a clinician-analyzing patient summaries, predicting risks, and offering reasoning-the enhanced LLM could be integrated into clinical workflows to complement human expertise, facilitating timely interventions and informed decision-making. If implemented widely, this human-AI collaboration has the potential to improve health outcomes for people living with HIV and reduce onward transmission. Funding The study was supported by a grant from the US National Institutes of Health (NIH): NIH NIMH 1R01MH125746.

  • Research Article
  • 10.1097/js9.0000000000003631
How does AI compare to the experts in a Delphi setting: simulating medical consensus with large language models.
  • Oct 15, 2025
  • International journal of surgery (London, England)
  • Young Suk Park + 6 more

How does AI compare to the experts in a Delphi setting: simulating medical consensus with large language models.

  • Conference Article
  • 10.54941/ahfe1006042
Leveraging LLMs to emulate the design processes of different cognitive styles
  • Jan 1, 2025
  • Xiyuan Zhang + 5 more

Cognitive styles, which shape designers’ thinking, problem-solving, and decision-making, influence strategies and preferences in design tasks. In team collaboration, diversity cognitive styles enhance problem-solving efficiency, foster creativity, and improve team performance.The ‘Co-evolution of problem–solution’ model serves as a key theoretical framework for understanding differences in designers’ cognitive styles. Based on this model, designers can be categorized into two cognitive styles: problem-driven and solution-driven. Problem-driven designers prioritize structuring the problem before developing solutions, while solution-driven designers generate solutions when design problems still ill-defined, and then work backward to define the problem. Designers with different expertise and disciplinary backgrounds exhibit distinct cognitive style tendencies. Different cognitive styles also adapt differently to design tasks, excelling in some more than others.As a rapidly advancing technology, large language models (LLMs) have shown considerable potential in the field of design. Their powerful generative capabilities position them as potential collaborators in design teams, emulating different cognitive styles. These emulations aim to bridge cognitive differences among team members, enable designers to leverage their individual strengths, and ultimately produce more feasible and high-quality design solutions.However, previous studies have been limited in leveraging LLMs to directly generate design outcomes based on different cognitive styles, neglecting the emulation of the design process itself. In fact, the evolutionary development between problem and solution spaces better reflects the core differences in cognitive styles. Moreover, communication and collaboration within design teams extend beyond simply exchanging solutions, but span multiple stages of the design process—from problem analysis, idea generation, to evaluation. To better integrate LLMs into design teams, it is necessary to consider the emulation of the design cognition process.To this end, our study, based on the cognitive style taxonomy proposed by Dorst and Cross (2001), explores how LLMs can be used to emulate the design processes of problem-driven and solution-driven designers. We develop a zero-shot chain-of-thought (CoT)-based prompting strategy that enables LLMs to emulate the step-by-step cognitive flow of both design styles. The prompt design is inspired by Jiang et al. (2014) and Chen et al. (2023), who analyzed cognitive differences in conceptual design process using the FBS ontology model. Furthermore, to evaluate the effectiveness of LLMs in emulating cognitive styles, this study establishes a three-dimentional evaluation metrics: static distribution (the proportion and preference of cognitive issues), dynamic transformation (behavioral transition patterns), and the creativity of the design outcomes. Using previous studies identified human design behaviours as a benchmark, we compare the cognitive styles emulated by LLMs under different design constraints against human performance to assess their alignment and differences.The results show that LLM-generated design processes align well with human cognitive styles, effectively emulate static cognitive characteristics. Moreover, enhancing novelty and integrity in solutions and demonstrating superior creativity compared to baseline methods. However, LLMs lack the fully complex nonlinear transitions between problem and solution spaces observed in human designers.This process-based emulation has the potential to enhance the application of LLMs in design teams, enabling them to not only serve as tools for generating solutions but also provide support for collaboration during key stages of the design process. Future research should enhance LLMs' reasoning flexibility through fine-tuning or the GoT approach and explore their impact on human-AI collaboration across diverse design tasks to refine their role in design teams.

  • Research Article
  • Cite Count Icon 8
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.2196/59641
Large Language Models Can Enable Inductive Thematic Analysis of a Social Media Corpus in a Single Prompt: Human Validation Study.
  • Aug 29, 2024
  • JMIR infodemiology
  • Michael S Deiner + 5 more

Manually analyzing public health-related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. LLMs can effectively and efficiently process large social media-based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public's interests and concerns and determining the public's ideas to address them.

  • Research Article
  • Cite Count Icon 2
  • 10.1609/aaaiss.v3i1.31183
Human-AI Interaction in the Age of Large Language Models
  • May 20, 2024
  • Proceedings of the AAAI Symposium Series
  • Diyi Yang

Large language models (LLMs) have revolutionized the way humans interact with AI systems, transforming a wide range of fields and disciplines. In this talk, I share two distinct approaches to empowering human-AI interaction using LLMs. The first one explores how LLMstransform computational social science, and how human-AI collaboration can reduce costs and improve the efficiency of social science research. The second part looks at social skill learning via LLMs by empowering therapists and learners with LLM-empowered feedback and deliberative practices. These two works demonstrate how human-AI collaboration via LLMs can empower individuals and foster positive change. We conclude by discussing how LLMs enable collaborative intelligence by redefining the interactions between humans and AI systems.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.1007/s00330-025-11484-6
Human-AI collaboration in large language model-assisted brain MRI differential diagnosis: a usability study
  • Jan 1, 2025
  • European Radiology
  • Su Hwan Kim + 11 more

ObjectivesThis study investigated the impact of human-large language model (LLM) collaboration on the accuracy and efficiency of brain MRI differential diagnosis.Materials and methodsIn this retrospective study, forty brain MRI cases with a challenging but definitive diagnosis were randomized into two groups of twenty cases each. Six radiology residents with an average experience of 6.3 months in reading brain MRI exams evaluated one set of cases supported by conventional internet search (Conventional) and the other set utilizing an LLM-based search engine and hybrid chatbot. A cross-over design ensured that each case was examined with both workflows in equal frequency. For each case, readers were instructed to determine the three most likely differential diagnoses. LLM responses were analyzed by a panel of radiologists. Benefits and challenges in human-LLM interaction were derived from observations and participant feedback.ResultsLLM-assisted brain MRI differential diagnosis yielded superior accuracy (70/114; 61.4% (LLM-assisted) vs 53/114; 46.5% (conventional) correct diagnoses, p = 0.033, chi-square test). No difference in interpretation time or level of confidence was observed. An analysis of LLM responses revealed that correct LLM suggestions translated into correct reader responses in 82.1% of cases (60/73). Inaccurate case descriptions by readers (9.2% of cases), LLM hallucinations (11.5% of cases), and insufficient contextualization of LLM responses were identified as challenges related to human-LLM interaction.ConclusionHuman-LLM collaboration has the potential to improve brain MRI differential diagnosis. Yet, several challenges must be addressed to ensure effective adoption and user acceptance.Key PointsQuestionWhile large language models (LLM) have the potential to support radiological differential diagnosis, the role of human-LLM collaboration in this context remains underexplored.FindingsLLM-assisted brain MRI differential diagnosis yielded superior accuracy over conventional internet search. Inaccurate case descriptions, LLM hallucinations, and insufficient contextualization were identified as potential challenges.Clinical relevanceOur results highlight the potential of an LLM-assisted workflow to increase diagnostic accuracy but underline the necessity to study collaborative efforts between humans and LLMs over LLMs in isolation.Graphical

  • Research Article
  • 10.29119/1641-3466.2024.210.39
Applying generative artificial intelligence to support invention processes: an analysis of the Systematic Inventive Thinking (SIT) methodology
  • Jan 1, 2024
  • Scientific Papers of Silesian University of Technology. Organization and Management Series
  • Paweł Wawrzała

Purpose: This paper aims to explore the integration of Systematic Inventive Thinking (SIT) methodology with Large Language Models (LLMs) to enhance innovative processes. It seeks to assess how LLMs can support analytical and creative processes in design teams and how hybrid human-LLM collaboration can contribute to more dynamic and unconventional problem-solving approaches Design/methodology/approach: The study employs a theoretical analysis of SIT methodology and LLM capabilities, synthesizing existing literature on both topics. It proposes a framework for integrating SIT with LLMs, including structured prompt patterns for each stage of the SIT process. The approach includes a comparative analysis of human and LLM capabilities in inventive processes. Findings: Research reveals that LLMs can significantly enhance the SIT process by providing rapid information synthesis, generating diverse ideas, and systematically applying SIT principles. However, human creativity, intuition, and holistic assessment remain crucial for breakthrough innovations. The study identifies specific prompt patterns and techniques for effective human-LLM collaboration within the SIT framework. Research limitations/implications: As this is an initial theoretical framework, empirical validation through case studies or experimental research is needed to assess its practical effectiveness. Practical implications: The proposed framework offers practitioners in the fields of innovation and design a structured approach to integrating AI into their creative processes. Provides specific guidelines for the use of LLM to enhance each stage of the SIT methodology, which could lead to more efficient and innovative outcomes. Social implications: Integration of SIT with LLM could significantly influence public attitudes toward AI, potentially increasing its acceptance as a collaborative tool in creative and problem- solving processes. This approach may lead to more efficient and sustainable innovation practices in various industries, potentially addressing social challenges more effectively. However, it may also raise concerns about job displacement in creative fields, necessitating a focus on reskilling and education to prepare the workforce for collaboration with AI systems. Originality/value: This paper presents a novel approach to integrating SIT methodology with state-of-the-art AI technology, offering new perspectives on increasing human creativity with machine capabilities in structured innovation processes. It contributes to the emerging field of AI-assisted design thinking and provides a foundation for further research in this area. Keywords: Systematic Inventive Thinking, Large Language Models, Innovation, Human-AI Collaboration. Category of the paper: Conceptual paper, Research paper.

  • Research Article
  • 10.1182/blood-2025-2574
Unlocking new frontiers in leukemia diagnostics through large language Model–Driven report generation
  • Nov 3, 2025
  • Blood
  • Vivian Wuerf + 5 more

Unlocking new frontiers in leukemia diagnostics through large language Model–Driven report generation

  • Preprint Article
  • 10.21203/rs.3.rs-5409185/v2
Careful design of Large Language Model pipelines enables expert-level retrieval of evidence-based information from conservation syntheses
  • Jan 23, 2025
  • Radhika Iyer + 5 more

Wise use of evidence to support efficient conservation action is key to tackling biodiversity loss with limited time and resources. Evidence syntheses provide key recommendations for conservation decision-makers by assessing and summarising evidence, but are not always easy to access, digest, and use. Recent advances in Large Language Models (LLMs) present both opportunities and risks in enabling faster and more intuitive systems to access evidence syntheses and databases. Such systems for natural language search and open-ended evidence-based responses are pipelines comprising many components. Most critical of these components are the LLM used and how evidence is retrieved from the database. We evaluate the performance of ten LLMs across six different database retrieval strategies against human experts in answering synthetic multiple-choice question exams on the effects of conservation interventions using the Conservation Evidence database. We found that LLM performance was comparable with human experts over 45 filtered questions, both in correctly answering them and retrieving the document used to generate them. Across 1867 unfiltered questions, LLM performance demonstrated a level of conservation-specific knowledge, but this varied across topic areas. A hybrid retrieval strategy that combines keywords and vector embeddings performed best by a substantial margin. We also tested against a state-of-the-art previous generation LLM which was outperformed by all ten current models - including smaller, cheaper models. Our findings suggest that, with careful domain-specific design, LLMs could potentially be powerful tools for enabling expert-level use of evidence syntheses and databases. However, general LLMs used ‘out-of-the-box’ are likely to perform poorly and misinform decision-makers. By establishing that LLMs exhibit comparable performance with human synthesis experts on providing restricted responses to queries of evidence syntheses and databases, future work can build on our approach to quantify LLM performance in providing open-ended responses.

  • Research Article
  • Cite Count Icon 3
  • 10.1371/journal.pone.0323563
Careful design of Large Language Model pipelines enables expert-level retrieval of evidence-based information from syntheses and databases.
  • May 15, 2025
  • PloS one
  • Radhika Iyer + 5 more

Wise use of evidence to support efficient conservation action is key to tackling biodiversity loss with limited time and resources. Evidence syntheses provide key recommendations for conservation decision-makers by assessing and summarising evidence, but are not always easy to access, digest, and use. Recent advances in Large Language Models (LLMs) present both opportunities and risks in enabling faster and more intuitive systems to access evidence syntheses and databases. Such systems for natural language search and open-ended evidence-based responses are pipelines comprising many components. Most critical of these components are the LLM used and how evidence is retrieved from the database. We evaluate the performance of ten LLMs across six different database retrieval strategies against human experts in answering synthetic multiple-choice question exams on the effects of conservation interventions using the Conservation Evidence database. We found that LLM performance was comparable with human experts over 45 filtered questions, both in correctly answering them and retrieving the document used to generate them. Across 1867 unfiltered questions, LLM performance demonstrated a level of conservation-specific knowledge, but this varied across topic areas. A hybrid retrieval strategy that combines keywords and vector embeddings performed best by a substantial margin. We also tested against a state-of-the-art previous generation LLM which was outperformed by all ten current models - including smaller, cheaper models. Our findings suggest that, with careful domain-specific design, LLMs could potentially be powerful tools for enabling expert-level use of evidence syntheses and databases in different disciplines. However, general LLMs used 'out-of-the-box' are likely to perform poorly and misinform decision-makers. By establishing that LLMs exhibit comparable performance with human synthesis experts on providing restricted responses to queries of evidence syntheses and databases, future work can build on our approach to quantify LLM performance in providing open-ended responses.

  • Research Article
  • Cite Count Icon 7
  • 10.3390/fi16070254
Human-AI Collaboration for Remote Sighted Assistance: Perspectives from the LLM Era †
  • Jul 1, 2024
  • Future internet
  • Rui Yu + 4 more

Remote sighted assistance (RSA) has emerged as a conversational technology aiding people with visual impairments (VI) through real-time video chat communication with sighted agents. We conducted a literature review and interviewed 12 RSA users to understand the technical and navigational challenges faced by both agents and users. The technical challenges were categorized into four groups: agents’ difficulties in orienting and localizing users, acquiring and interpreting users’ surroundings and obstacles, delivering information specific to user situations, and coping with poor network connections. We also presented 15 real-world navigational challenges, including 8 outdoor and 7 indoor scenarios. Given the spatial and visual nature of these challenges, we identified relevant computer vision problems that could potentially provide solutions. We then formulated 10 emerging problems that neither human agents nor computer vision can fully address alone. For each emerging problem, we discussed solutions grounded in human–AI collaboration. Additionally, with the advent of large language models (LLMs), we outlined how RSA can integrate with LLMs within a human–AI collaborative framework, envisioning the future of visual prosthetics.

  • Research Article
  • 10.64252/36fq8592
Transforming Healthcare: Opportunities And Challenges In Harnessing Large Language Models
  • Jun 24, 2025
  • International Journal of Environmental Sciences
  • Bhaktavaschal Samal + 2 more

The integration of large language models (LLMs) in healthcare represents a significant advancement in medical technology, offering solutions to increasingly complex challenges in clinical practice. This comprehensive review examines the current state, opportunities, and limitations of LLM deployment in medical domains. We analyze how these models address critical issues such as clinical data overload, administrative inefficiencies, and medical education while accelerating drug development processes. Drawing on recent benchmarks, including MEDEC's evaluation of error detection and correction in clinical notes, we demonstrate that advanced LLMs approach near-expert performance in specific medical tasks. However, significant challenges persist, including the risk of hallucination, lack of transparency, liability concerns, data privacy issues, and potential biases. Our analysis reveals that while LLMs show remarkable promise in transforming healthcare delivery, their implementation requires careful validation and ethical oversight. We propose a balanced approach combining rigorous benchmarking, explainable AI methodologies, and comprehensive ethical frameworks, emphasizing the importance of maintaining human oversight in clinical decision-making. This review concludes that the optimal path forward lies in human-AI collaboration, where LLMs augment rather than replace clinical expertise, ensuring both technological advancement and patient safety. These findings have important implications for healthcare providers, medical educators, and policymakers as they navigate the integration of AI technologies in medical practice.

  • Research Article
  • Cite Count Icon 162
  • 10.1001/jamanetworkopen.2023.43689
Leveraging Large Language Models for Decision Support in Personalized Oncology
  • Nov 17, 2023
  • JAMA network open
  • Manuela Benary + 12 more

Clinical interpretation of complex biomarkers for precision oncology currently requires manual investigations of previous studies and databases. Conversational large language models (LLMs) might be beneficial as automated tools for assisting clinical decision-making. To assess performance and define their role using 4 recent LLMs as support tools for precision oncology. This diagnostic study examined 10 fictional cases of patients with advanced cancer with genetic alterations. Each case was submitted to 4 different LLMs (ChatGPT, Galactica, Perplexity, and BioMedLM) and 1 expert physician to identify personalized treatment options in 2023. Treatment options were masked and presented to a molecular tumor board (MTB), whose members rated the likelihood of a treatment option coming from an LLM on a scale from 0 to 10 (0, extremely unlikely; 10, extremely likely) and decided whether the treatment option was clinically useful. Number of treatment options, precision, recall, F1 score of LLMs compared with human experts, recognizability, and usefulness of recommendations. For 10 fictional cancer patients (4 with lung cancer, 6 with other; median [IQR] 3.5 [3.0-4.8] molecular alterations per patient), a median (IQR) number of 4.0 (4.0-4.0) compared with 3.0 (3.0-5.0), 7.5 (4.3-9.8), 11.5 (7.8-13.0), and 13.0 (11.3-21.5) treatment options each was identified by the human expert and 4 LLMs, respectively. When considering the expert as a criterion standard, LLM-proposed treatment options reached F1 scores of 0.04, 0.17, 0.14, and 0.19 across all patients combined. Combining treatment options from different LLMs allowed a precision of 0.29 and a recall of 0.29 for an F1 score of 0.29. LLM-generated treatment options were recognized as AI-generated with a median (IQR) 7.5 (5.3-9.0) points in contrast to 2.0 (1.0-3.0) points for manually annotated cases. A crucial reason for identifying AI-generated treatment options was insufficient accompanying evidence. For each patient, at least 1 LLM generated a treatment option that was considered helpful by MTB members. Two unique useful treatment options (including 1 unique treatment strategy) were identified only by LLM. In this diagnostic study, treatment options of LLMs in precision oncology did not reach the quality and credibility of human experts; however, they generated helpful ideas that might have complemented established procedures. Considering technological progress, LLMs could play an increasingly important role in assisting with screening and selecting relevant biomedical literature to support evidence-based, personalized treatment decisions.

  • Research Article
  • Cite Count Icon 1
  • 10.2196/65226
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.
  • Aug 9, 2024
  • Journal of medical Internet research
  • Michael S Deiner + 10 more

The use of web-based search and social media can help identify epidemics, potentially earlier than clinical methods or even potentially identifying unreported outbreaks. Monitoring for eye-related epidemics, such as conjunctivitis outbreaks, can facilitate early public health intervention to reduce transmission and ocular comorbidities. However, monitoring social media content for conjunctivitis outbreaks is costly and laborious. Large language models (LLMs) could overcome these barriers by assessing the likelihood that real-world outbreaks are being described. However, public health actions for likely outbreaks could benefit more by knowing additional epidemiological characteristics, such as outbreak type, size, and severity. We aimed to assess whether and how well LLMs can classify epidemiological features from social media posts beyond conjunctivitis outbreak probability, including outbreak type, size, severity, etiology, and community setting. We used a validation framework comparing LLM classifications to those of other LLMs and human experts. We wrote code to generate synthetic conjunctivitis outbreak social media posts, embedded with specific preclassified epidemiological features to simulate various infectious eye disease outbreak and control scenarios. We used these posts to develop effective LLM prompts and test the capabilities of multiple LLMs. For top-performing LLMs, we gauged their practical utility in real-world epidemiological surveillance by comparing their assessments of Twitter/X, forum, and YouTube conjunctivitis posts. Finally, human raters also classified the posts, and we compared their classifications to those of a leading LLM for validation. Comparisons entailed correlation or sensitivity and specificity statistics. We assessed 7 LLMs for effectively classifying epidemiological data from 1152 synthetic posts, 370 Twitter/X posts, 290 forum posts, and 956 YouTube posts. Despite some discrepancies, the LLMs demonstrated a reliable capacity for nuanced epidemiological analysis across various data sources and compared to humans or between LLMs. Notably, GPT-4 and Mixtral 8x22b exhibited high performance, predicting conjunctivitis outbreak characteristics such as probability (GPT-4: correlation=0.73), size (Mixtral 8x22b: correlation=0.82), and type (infectious, allergic, or environmentally caused); however, there were notable exceptions. Assessing synthetic and real-world posts for etiological factors, infectious eye disease specialist validations revealed that GPT-4 had high specificity (0.83-1.00) but variable sensitivity (0.32-0.71). Interrater reliability analyses showed that LLM-expert agreement exceeded expert-expert agreement for severity assessment (intraclass correlation coefficient=0.69 vs 0.38), while agreement varied by condition type (κ=0.37-0.94). This investigation into the potential of LLMs for public health infoveillance suggests effectiveness in classifying key epidemiological characteristics from social media content about conjunctivitis outbreaks. Future studies should further explore LLMs' potential to support public health monitoring through the automated assessment and classification of potential infectious eye disease or other outbreaks. Their optimal role may be to act as a first line of documentation, alerting public health organizations for the follow-up of LLM-detected and -classified small, early outbreaks, with a focus on the most severe ones.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon