Structural insights into clinical large language models and their barriers to translational readiness.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Despite rapid integration into clinical decision-making, clinical large language models (LLMs) face substantial translational barriers due to insufficient structural characterization and limited external validation. We systematically map the clinical LLM research landscape to identify key structural patterns influencing their readiness for real-world clinical deployment. We identified 73 clinical LLM studies published between January 2020 and March 2025 using a structured evidence-mapping approach. To ensure transparency and reproducibility in study selection, we followed key principles from the PRISMA 2020 framework. Each study was categorized by clinical task, base architecture, alignment strategy, data type, language, study design, validation methods, and evaluation metrics. Studies often addressed multiple early stage clinical tasks-question answering (56.2%), knowledge structuring (31.5%), and disease prediction (43.8%)-primarily using text data (52.1%) and English-language resources (80.8%). GPT models favored retrieval-augmented generation (43.8%), and LLaMA models consistently adopted multistage pretraining and fine-tuning strategies. Only 6.9% of studies included external validation, and prospective designs were observed in just 4.1% of cases, reflecting significant gaps in translational reliability. Evaluations were predominantly quantitative only (79.5%), though qualitative and mixed-method approaches are increasingly recognized for assessing clinical usability and trustworthiness. Clinical LLM research remains exploratory, marked by limited generalizability across languages, data types, and clinical environments. To bridge this gap, future studies must prioritize multilingual and multimodal training, prospective study designs with rigorous external validation, and hybrid evaluation frameworks combining quantitative performance with qualitative clinical usability metrics.

Similar Papers
  • Research Article
  • 10.1016/j.jbi.2026.105034
A study of large language models for patient information extraction: Model architecture, fine-tuning strategy, and multi-task instruction tuning.
  • Mar 1, 2026
  • Journal of biomedical informatics
  • Cheng Peng + 5 more

A study of large language models for patient information extraction: Model architecture, fine-tuning strategy, and multi-task instruction tuning.

  • Research Article
  • 10.1182/blood-2025-6214
Evaluating artificial intelligence (AI) as a clinical decision support tool for AML patients
  • Nov 3, 2025
  • Blood
  • Ankushi Sanghvi + 5 more

Evaluating artificial intelligence (AI) as a clinical decision support tool for AML patients

  • Research Article
  • Cite Count Icon 115
  • 10.1038/s41591-024-03416-6
A generalist medical language model for disease diagnosis assistance.
  • Jan 8, 2025
  • Nature medicine
  • Xiaohong Liu + 23 more

The delivery of accurate diagnoses is crucial in healthcare and represents the gateway to appropriate and timely treatment. Although recent large language models (LLMs) have demonstrated impressive capabilities in few-shot or zero-shot learning, their effectiveness in clinical diagnosis remains unproven. Here we present MedFound, a generalist medical language model with 176 billion parameters, pre-trained on a large-scale corpus derived from diverse medical text and real-world clinical records. We further fine-tuned MedFound to learn physicians' inferential diagnosis with a self-bootstrapping strategy-based chain-of-thought approach and introduced a unified preference alignment framework to align it with standard clinical practice. Extensive experiments demonstrate that our medical LLM outperforms other baseline LLMs and specialized models in in-distribution (common diseases), out-of-distribution (external validation) and long-tailed distribution (rare diseases) scenarios across eight specialties. Further ablation studies indicate the effectiveness of key components in our medical LLM training approach. We conducted a comprehensive evaluation of the clinical applicability of LLMs for diagnosis involving artificial intelligence (AI) versus physician comparison, AI-assistance study and human evaluation framework. Our proposed framework incorporates eight clinical evaluation metrics, covering capabilities such as medical record summarization, diagnostic reasoning and risk management. Our findings demonstrate the model's feasibility in assisting physicians with disease diagnosis as part of the clinical workflow.

  • Preprint Article
  • 10.2196/preprints.71916
Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline (Preprint)
  • Jan 29, 2025
  • Hongyi Li + 2 more

BACKGROUND Large language models (LLMs) can generate outputs understandable by humans, such as answers to medical questions and radiology reports. With the rapid development of LLMs, clinicians face a growing challenge in determining the most suitable algorithms to support their work. OBJECTIVE We aimed to provide clinicians and other health care practitioners with systematic guidance in selecting an LLM that is relevant and appropriate to their needs and facilitate the integration process of LLMs in health care. METHODS We conducted a literature search of full-text publications in English on clinical applications of LLMs published between January 1, 2022, and March 31, 2025, on PubMed, ScienceDirect, Scopus, and IEEE Xplore. We excluded papers from journals below a set citation threshold, as well as papers that did not focus on LLMs, were not research based, or did not involve clinical applications. We also conducted a literature search on arXiv within the same investigated period and included papers on the clinical applications of innovative multimodal LLMs. This led to a total of 270 studies. RESULTS We collected 330 LLMs and recorded their application frequency in clinical tasks and frequency of best performance in their context. On the basis of a 5-stage clinical workflow, we found that stages 2, 3, and 4 are key stages in the clinical workflow, involving numerous clinical subtasks and LLMs. However, the diversity of LLMs that may perform optimally in each context remains limited. GPT-3.5 and GPT-4 were the most versatile models in the 5-stage clinical workflow, applied to 52% (29/56) and 71% (40/56) of the clinical subtasks, respectively, and they performed best in 29% (16/56) and 54% (30/56) of the clinical subtasks, respectively. General-purpose LLMs may not perform well in specialized areas as they often require lightweight prompt engineering methods or fine-tuning techniques based on specific datasets to improve model performance. Most LLMs with multimodal abilities are closed-source models and, therefore, lack of transparency, model customization, and fine-tuning for specific clinical tasks and may also pose challenges regarding data protection and privacy, which are common requirements in clinical settings. CONCLUSIONS In this review, we found that LLMs may help clinicians in a variety of clinical tasks. However, we did not find evidence of generalist clinical LLMs successfully applicable to a wide range of clinical tasks. Therefore, their clinical deployment remains challenging. On the basis of this review, we propose an interactive online guideline for clinicians to select suitable LLMs by clinical task. With a clinical perspective and free of unnecessary technical jargon, this guideline may be used as a reference to successfully apply LLMs in clinical settings.

  • Research Article
  • Cite Count Icon 2
  • 10.2196/67469
Large Language Models in Randomized Controlled Trials Design: Observational Study
  • Sep 3, 2025
  • Journal of Medical Internet Research
  • Liyuan Jin + 6 more

BackgroundRandomized controlled trials (RCTs) face challenges such as limited generalizability, insufficient recruitment diversity, and high failure rates, often due to restrictive eligibility criteria and inefficient patient selection. Large language models (LLMs) have shown promise in various clinical tasks, but their potential role in RCT design remains underexplored.ObjectiveThis study investigates the ability of LLMs, specifically GPT-4-Turbo-Preview, to assist in designing RCTs that enhance generalizability, recruitment diversity, and reduce failure rates, while maintaining clinical safety and ethical standards.MethodsWe conducted a noninterventional, observational study analyzing 20 parallel-arm RCTs, comprising 10 completed and 10 registered studies published after January 2024 to mitigate pretraining biases. The LLM was tasked with generating RCT designs based on input criteria, including eligibility, recruitment strategies, interventions, and outcomes. The accuracy of LLM-generated designs was quantitatively assessed by 2 independent clinical experts by comparing them to clinically validated ground truth data from ClinicalTrials.gov. We have conducted statistical analysis using natural language processing–based methods, including Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-L, and Metric for Evaluation of Translation with Explicit ORdering (METEOR), for objective scoring on corresponding LLM outputs. Qualitative assessments were performed using Likert scale ratings (1-3) for domains such as safety, clinical accuracy, objectivity or bias, pragmatism, inclusivity, and diversity.ResultsThe LLM achieved an overall accuracy of 72% in replicating RCT designs. Recruitment and intervention designs demonstrated high agreement with the ground truth, achieving 88% and 93% accuracy, respectively. However, LLMs showed lower accuracy in designing eligibility criteria (55%) and outcomes measurement (53%). Natural language processing statistical analysis reported BLEU=0.04, ROUGE-L=0.20, and METEOR=0.18 on average objective scoring of LLM outputs. Qualitative evaluations showed that LLM-generated designs scored above 2 points and closely matched the original designs in scores across all domains, indicating strong clinical alignment. Specifically, both original and LLM-based designs ranked similarly high in safety, clinical accuracy, and objectivity or bias in published RCTs. Moreover, LLM-based design ranked noninferior to original designs in registered RCTs in multiple domains. In particular, LLMs enhanced diversity and pragmatism, which are key factors in improving RCT generalizability and addressing failure rates.ConclusionsLLMs, such as GPT-4-Turbo-Preview, have demonstrated potential in improving RCT design, particularly in recruitment and intervention planning, while enhancing generalizability and addressing diversity. However, expert oversight and regulatory measures are essential to ensure patient safety and ethical standards. The findings support further integration of LLMs into clinical trial design, although continued refinement is necessary to address limitations in eligibility and outcomes measurement.

  • Research Article
  • 10.3389/frai.2025.1669896
LLMCARE: early detection of cognitive impairment via transformer models enhanced by LLM-generated synthetic data
  • Nov 6, 2025
  • Frontiers in Artificial Intelligence
  • Ali Zolnour + 15 more

BackgroundAlzheimer’s disease and related dementias (ADRD) affect nearly five million older adults in the United States, yet more than half remain undiagnosed. Speech-based natural language processing (NLP) provides a scalable approach to identify early cognitive decline by detecting subtle linguistic markers that may precede clinical diagnosis.ObjectiveThis study aims to develop and evaluate a speech-based screening pipeline that integrates transformer-based embeddings with handcrafted linguistic features, incorporates synthetic augmentation using large language models (LLMs), and benchmarks unimodal and multimodal LLM classifiers. External validation was performed to assess generalizability to an MCI-only cohort.MethodsTranscripts were obtained from the ADReSSo 2021 benchmark dataset (n = 237; derived from the Pitt Corpus, DementiaBank) and the DementiaBank Delaware corpus (n = 205; clinically diagnosed mild cognitive impairment [MCI] vs. controls). Audio was automatically transcribed using Amazon Web Services Transcribe (general model). Ten transformer models were evaluated under three fine-tuning strategies. A late-fusion model combined embeddings from the best-performing transformer with 110 linguistically derived features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech for data augmentation. Three multimodal LLMs (GPT-4o, Qwen-Omni, Phi-4) were tested in zero-shot and fine-tuned settings.ResultsOn the ADReSSo dataset, the fusion model achieved an F1-score of 83.32 (AUC = 89.48), outperforming both transformer-only and linguistic-only baselines. Augmentation with MedAlpaca-7B synthetic speech improved performance to F1 = 85.65 at 2 × scale, whereas higher augmentation volumes reduced gains. Fine-tuning improved unimodal LLM classifiers (e.g., MedAlpaca-7B, F1 = 47.73 → 78.69), while multimodal models demonstrated lower performance (Phi-4 = 71.59; GPT-4o omni = 67.57). On the Delaware corpus, the pipeline generalized to an MCI-only cohort, with the fusion model plus 1 × MedAlpaca-7B augmentation achieving F1 = 72.82 (AUC = 69.57).ConclusionIntegrating transformer embeddings with handcrafted linguistic features enhances ADRD detection from speech. Distributionally aligned LLM-generated narratives provide effective but bounded augmentation, while current multimodal models remain limited. Crucially, validation on the Delaware corpus demonstrates that the proposed pipeline generalizes to early-stage impairment, supporting its potential as a scalable approach for clinically relevant early screening. All codes for LLMCARE are publicly available at: GitHub.

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.jbi.2024.104707
On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models
  • Aug 13, 2024
  • Journal of Biomedical Informatics
  • Majid Afshar + 4 more

On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models

  • Research Article
  • Cite Count Icon 3
  • 10.1111/jep.70129
How do GPs Want Large Language Models to be Applied in Primary Care, and What Are Their Concerns? A Cross-Sectional Survey.
  • May 14, 2025
  • Journal of evaluation in clinical practice
  • Richard C Armitage

Although the potential utility of large language models (LLMs) in medicine and healthcare is substantial, no assessment has been made to date of how GPs want LLMs to be applied in primary care, or of which issues GPs are most concerned about regarding the implementation of LLMs into their clinical practice. This study's objective was to generate preliminary evidence that answers these questions, which are relevant because GPs themselves will ultimately harness the power of LLMs in primary care. Non-probability sampling was utilised: GPs practicing in the UK and who were members of one of two Facebook groups (one containing a community of UK primary care staff, the other containing a community of GMC-registered doctors in the UK) were invited to complete an online survey, which ran from 06 to 13 November 2024. The survey received 113 responses, 107 of which were from GPs practicing in the UK. When LLM accuracy and safety were assumed to be guaranteed, broad enthusiasm for LLMs carrying out various nonclinical and clinical tasks in primary care was reported. The single nonclinical task and clinical task that respondents were most supportive of were the LLM listening to the consultation and writing notes in real-time for the GP to review, edit, and save (44.0%), and the LLM identifying outstanding clinical tasks and actioning them (51.0%), respectively. Respondents were concerned with a range of issues regarding LLMs being embedded into clinical systems, with patient safety being the most commonly reported single issue of concern (36.2%). This study has generated preliminary evidence that is of potential utility to those developing LLMs for use in primary care. Further research is required to expand this evidence base to further inform the development of these technologies, and to ensure they are acceptable to the GPs who will use them.

  • Research Article
  • 10.1093/ndt/gfae069.792
#2924 Comparison of large language models and traditional natural language processing techniques in predicting arteriovenous fistula failure
  • May 23, 2024
  • Nephrology Dialysis Transplantation
  • Suman Lama + 6 more

Background and Aims Large language models (LLMs) have gained significant attention in the field of natural language processing (NLP), marking a shift from traditional techniques like Term Frequency-Inverse Document Frequency (TF-IDF). We developed a traditional NLP model to predict arteriovenous fistula (AVF) failure within next 30 days using clinical notes. The goal of this analysis was to investigate whether LLMs would outperform traditional NLP techniques, specifically in the context of predicting AVF failure within the next 30 days using clinical notes. Method We defined AVF failure as the change in status from active to permanently unusable status or temporarily unusable status. We used data from a large kidney care network from January 2021 to December 2021. Two models were created using LLMs and traditional TF-IDF technique. We used “distilbert-base-uncased”, a distilled version of BERT base model [1], and compared its performance with traditional TF-IDF-based NLP techniques. The dataset was randomly divided into 60% training, 20% validation and 20% test dataset. The test data, comprising of unseen patients’ data was used to evaluate the performance of the model. Both models were evaluated using metrics such as area under the receiver operating curve (AUROC), accuracy, sensitivity, and specificity. Results The incidence of 30 days AVF failure rate was 2.3% in the population. Both LLMs and traditional showed similar overall performance as summarized in Table 1. Notably, LLMs showed marginally better performance in certain evaluation metrics. Both models had same AUROC of 0.64 on test data. The accuracy and balanced accuracy for LLMs were 72.9% and 59.7%, respectively, compared to 70.9% and 59.6% for the traditional TF-IDF approach. In terms of specificity, LLMs scored 73.2%, slightly higher than the 71.2% observed for traditional NLP methods. However, LLMs had a lower sensitivity of 46.1% compared to 48% for traditional NLP. However, it is worth noting that training on LLMs took considerably longer than TF-IDF. Moreover, it also used higher computational resources such as utilization of graphics processing units (GPU) instances in cloud-based services, leading to higher cost. Conclusion In our study, we discovered that advanced LLMs perform comparably to traditional TF-IDF modeling techniques in predicting the failure of AVF. Both models demonstrated identical AUROC. While specificity was higher in LLMs compared to traditional NLP, sensitivity was higher in traditional NLP compared to LLMs. LLM was fine-tuned with a limited dataset, which could have influenced its performance to be similar to that of traditional NLP methods. This finding suggests that while LLMs may excel in certain scenarios, such as performing in-depth sentiment analysis of patient data for complex tasks, their effectiveness is highly dependent on the specific use case. It is crucial to weigh the benefits against the resources required for LLMs, as they can be significantly more resource-intensive and costly compared to traditional TF-IDF methods. This highlights the importance of a use-case-driven approach in selecting the appropriate NLP technique for healthcare applications.

  • Research Article
  • Cite Count Icon 11
  • 10.1200/cci-24-00230
Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations.
  • Mar 1, 2025
  • JCO clinical cancer informatics
  • Loic Ah-Thiane + 11 more

To determine the accuracy of large language models (LLMs) in generating appropriate treatment options for patients with early breast cancer on the basis of their medical records. Retrospective study using anonymized medical records of patients with BC presented during multidisciplinary team meetings (MDTs) between January and April 2024. Three generalist artificial intelligence models (Claude3-Opus, GPT4-Turbo, and LLaMa3-70B) were used to generate treatment suggestions, which were compared with experts' decisions. The primary outcome was the rate of appropriate suggestions from the LLMs, compared with the reference experts' decisions. The secondary outcome was the LLMs' performances (F1 score and specificity) in generating appropriate suggestions for each treatment category. The rates of appropriate suggestions were 86.6% (97/112), 85.7% (96/112), and 75.0% (84/112) for Claude3-Opus, GPT4-Turbo, and LLaMa3-70B, respectively. No significant difference was found between Claude3-Opus and GPT4-Turbo (P = .85), but both tended to perform better than LLaMa3-70B (P = .027 and P = .043, respectively). LLMs showed high accuracy for adjuvant endocrine therapy and targeted therapy indications. However, they tended to overestimate the need for adjuvant radiotherapy and had variable performances in suggesting adjuvant chemotherapy and genomic tests. LLMs, particularly Claude3-Opus and GPT4-Turbo, demonstrated promising accuracy in suggesting appropriate adjuvant treatments for patients with early BC on the basis of their medical records. Although LLMs showed limitations in validating surgery and indicating genomic tests, their performance in other treatment modalities highlights their potential to automate and augment decision making during MDTs. Further studies with fine-tuned LLMs and a prospective design are needed to demonstrate their utility in clinical practice.

  • Research Article
  • Cite Count Icon 9
  • 10.1609/aaai.v37i13.26879
Exploring Social Biases of Large Language Models in a College Artificial Intelligence Course
  • Jun 26, 2023
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Skylar Kolisko + 1 more

Large neural network-based language models play an increasingly important role in contemporary AI. Although these models demonstrate sophisticated text generation capabilities, they have also been shown to reproduce harmful social biases contained in their training data. This paper presents a project that guides students through an exploration of social biases in large language models. As a final project for an intermediate college course in Artificial Intelligence, students developed a bias probe task for a previously-unstudied aspect of sociolinguistic or sociocultural bias they were interested in exploring. Through the process of constructing a dataset and evaluation metric to measure bias, students mastered key technical concepts, including how to run contemporary neural networks for natural language processing tasks; construct datasets and evaluation metrics; and analyze experimental results. Students reported their findings in an in-class presentation and a final report, recounting patterns of predictions that surprised, unsettled, and sparked interest in advocating for technology that reflects a more diverse set of backgrounds and experiences. Through this project, students engage with and even contribute to a growing body of scholarly work on social biases in large language models.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.jbi.2025.104931
From image to report: automating lung cancer screening interpretation and reporting with vision-language models.
  • Nov 1, 2025
  • Journal of biomedical informatics
  • Tien-Yu Chang + 16 more

From image to report: automating lung cancer screening interpretation and reporting with vision-language models.

  • Supplementary Content
  • Cite Count Icon 133
  • 10.2196/52597
Large Language Models and Empathy: Systematic Review
  • Dec 11, 2024
  • Journal of Medical Internet Research
  • Vera Sorin + 6 more

BackgroundEmpathy, a fundamental aspect of human interaction, is characterized as the ability to experience another being’s emotions within oneself. In health care, empathy is a fundamental for health care professionals and patients’ interaction. It is a unique quality to humans that large language models (LLMs) are believed to lack.ObjectiveWe aimed to review the literature on the capacity of LLMs in demonstrating empathy.MethodsWe conducted a literature search on MEDLINE, Google Scholar, PsyArXiv, medRxiv, and arXiv between December 2022 and February 2024. We included English-language full-length publications that evaluated empathy in LLMs’ outputs. We excluded papers evaluating other topics related to emotional intelligence that were not specifically empathy. The included studies’ results, including the LLMs used, performance in empathy tasks, and limitations of the models, along with studies’ metadata were summarized.ResultsA total of 12 studies published in 2023 met the inclusion criteria. ChatGPT-3.5 (OpenAI) was evaluated in all studies, with 6 studies comparing it with other LLMs such GPT-4, LLaMA (Meta), and fine-tuned chatbots. Seven studies focused on empathy within a medical context. The studies reported LLMs to exhibit elements of empathy, including emotions recognition and emotional support in diverse contexts. Evaluation metric included automatic metrics such as Recall-Oriented Understudy for Gisting Evaluation and Bilingual Evaluation Understudy, and human subjective evaluation. Some studies compared performance on empathy with humans, while others compared between different models. In some cases, LLMs were observed to outperform humans in empathy-related tasks. For example, ChatGPT-3.5 was evaluated for its responses to patients’ questions from social media, where ChatGPT’s responses were preferred over those of humans in 78.6% of cases. Other studies used subjective readers’ assigned scores. One study reported a mean empathy score of 1.84-1.9 (scale 0-2) for their fine-tuned LLM, while a different study evaluating ChatGPT-based chatbots reported a mean human rating of 3.43 out of 4 for empathetic responses. Other evaluations were based on the level of the emotional awareness scale, which was reported to be higher for ChatGPT-3.5 than for humans. Another study evaluated ChatGPT and GPT-4 on soft-skills questions in the United States Medical Licensing Examination, where GPT-4 answered 90% of questions correctly. Limitations were noted, including repetitive use of empathic phrases, difficulty following initial instructions, overly lengthy responses, sensitivity to prompts, and overall subjective evaluation metrics influenced by the evaluator’s background.ConclusionsLLMs exhibit elements of cognitive empathy, recognizing emotions and providing emotionally supportive responses in various contexts. Since social skills are an integral part of intelligence, these advancements bring LLMs closer to human-like interactions and expand their potential use in applications requiring emotional intelligence. However, there remains room for improvement in both the performance of these models and the evaluation strategies used for assessing soft skills.

  • Research Article
  • Cite Count Icon 1
  • 10.1177/20552076251342078
Reference decisions enhance LLM performance, amplified by source disclosure
  • Apr 1, 2025
  • DIGITAL HEALTH
  • Yongxiang Zhang + 4 more

Objective The rapid integration of large language models (LLMs) has propelled advancements in automated dialog technologies, improving the public's access to healthcare services. Drawing inspiration from the collaborative decision-making practices of medical professionals in complex cases, we investigated whether LLMs could enhance their diagnostic accuracy through interaction. Methods An experimental study was conducted in China (September–December 2024) to investigate the impact of LLM-generated reference decisions and source disclosure on LLMs’ diagnostic performance. We used a Chinese clinical diagnostic task in a controlled comparative design, where three Chinese LLMs interpreted symptoms and conditions based on patient queries. LLMs’ outcomes were evaluated through accuracy and weighted F1 score metrics, with statistical analysis to determine significance. Results Analysis of variance on LLMs’ diagnostic accuracy scores demonstrated that incorporating LLM-generated decisions as a reference significantly improved diagnostic outcomes, with source disclosure amplifying this improvement. Conclusion Our findings underscore the potential of LLM collaboration in healthcare, offering strategies to refine response generation and decision-making across various applications.

  • Research Article
  • 10.2196/85169
Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study.
  • Apr 13, 2026
  • JMIR formative research
  • Andy Li + 4 more

Translation of medical consultation summaries is essential for equitable health care communication in culturally and linguistically diverse populations. While machine translation (MT) tools and large language models (LLMs) are widely accessible, their feasibility and safety for health care contexts remain underexplored. This pilot study investigates the feasibility and limitations of using LLMs and traditional MT tools to translate medical consultation summaries from English into the most common languages other than English spoken in Australia-Arabic, Chinese (simplified written form), and Vietnamese. Two simulated summaries-a simple patient-facing summary and a complex clinician-oriented interprofessional letter-were translated using 3 LLMs (GPT-4o, Llama-3.1, and Gemma-2) and 3 MT tools (Google Translate, Microsoft Bing Translator, and DeepL). Translations were benchmarked against professional third-party interpreter translations using Bilingual Evaluation Understudy, Character-level F-score, and Metric for Evaluation of Translation with Explicit Ordering metrics. The translation performance varied across languages, tools, and summary complexity when assessed using automatic evaluation metrics. Traditional MT tools outperformed LLMs on surface-level metrics, while LLMs showed relative strengths in semantic similarity for Vietnamese and Chinese. Arabic translations improved with complex input, suggesting morphological advantages. The metric-based evaluation highlighted feasibility but also risks, particularly in Chinese clinical contexts. This pilot study provides formative evidence of opportunities and limitations in applying artificial intelligence translation for health care communication. Findings underscore the importance of human oversight; domain-specific evaluation metrics; and further formative and clinical research to guide the safe, equitable use of artificial intelligence translation tools.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant