Evaluating the Clinical Competence of Large Language Models in Prostate Cancer Management: A Comparative Study of DeepSeek-R1 and ChatGPT.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Large language models (LLMs) have gained prominence in medical applications, yet their performance in specialized clinical tasks remains underexplored. Prostate cancer, a complex malignancy requiring guideline-based management, presents a rigorous testbed for evaluating artificial intelligence (AI)-assisted decision-making. This study compared the clinical accuracy, reasoning ability, and language quality of DeepSeek-R1 and ChatGPT variants in addressing prostate cancer diagnosis and treatment. A dataset of 98 prostate cancer multiple-choice questions from MedQA, MedMCQA, and China's National Medical Licensing Examination was constructed, alongside three real-world clinical cases. Responses were generated by five LLMs (DeepSeek-V3, DeepSeek-R1, ChatGPT-4o, -o3, -o4-mini) and evaluated for accuracy across three repeated runs. For case-based simulations, only R1 and o3 were compared with practicing urologists. A Clinical Decision Quality Assessment Scale (CDQAS) assessed outputs across four domains: readability, medical knowledge accuracy, diagnostic test appropriateness, and logical coherence. Blinded scoring was performed by senior urologic oncologists. Statistical analyses used one-way ANOVA with GraphPad Prism v10.1.2, Boston, Massachusetts, USA. DeepSeek-R1 achieved the highest accuracy (96.60 %) on multiple-choice tasks, significantly outperforming the other models (p < 0.05 to <0.0001). In simulated case evaluations, both R1 and o3 performed comparably with physicians in overall readability and diagnostic appropriateness. Whereas R1 demonstrated superior guideline compliance and evidence-based reasoning, o3 showed advantages in workflow clarity, sequencing, and response fluency. However, o3 generated fewer explicit errors than R1. Human clinicians maintained strengths in terminology precision and logical reasoning. DeepSeek-R1 and ChatGPT-o3 exhibit complementary strengths in prostate cancer clinical decision-making, with R1 favoring factual accuracy and o3 excelling in expressive clarity. Although both models approach human-level performance in structured evaluations, human oversight and continued domain-specific optimization remain essential for their safe and effective integration into clinical workflows.

Similar Papers
  • Research Article
  • 10.2196/77978
Multiple Large Language Models’ Performance on the Chinese Medical Licensing Examination: Quantitative Comparative Study
  • Dec 16, 2025
  • JMIR Human Factors
  • Yanyu Diao + 3 more

BackgroundChatGPT excels in natural language tasks, but its performance in the Chinese National Medical Licensing Examination (NMLE) and Chinese medical education remains underexplored. Meanwhile, Chinese corpus–based large language models (LLMs) such as ERNIE Bot, Tongyi Qianwen, Doubao, and DeepSeek have emerged, yet their effectiveness in the NMLE awaits systematic evaluation.ObjectiveThis study aimed to quantitatively compare the performance of 6 LLMs (GPT-3.5, GPT-4, ERNIE Bot, Tongyi Qianwen, Doubao, and DeepSeek) in answering NMLE questions from 2018 to 2024 and analyze their feasibility as supplementary tools in Chinese medical education.MethodsWe selected questions from the 4 content units of the NMLE’s General Written test (2018‐2024), preprocessed image- and table-based content into standardized text, and input the questions into each model. We evaluated the accuracy, comprehensiveness, and logical coherence of the responses, with quantitative comparison centered on scores and accuracy rates against the official answer keys (passing score: 360/600).ResultsGPT-4 outperformed GPT-3.5 across all units, achieving average accuracies of 66.57% (SD 3.21%; unit 1), 69.05% (SD 2.87%; unit 2), 71.71% (SD 2.53%; unit 3), and 80.67% (SD 2.19%; unit 4), with consistent scores above the passing threshold. Among the Chinese models, DeepSeek demonstrated the highest overall performance, with an average score of 454.8 (SD 17.3) and average accuracies of 73.2% (unit 1, SD 2.89%) and 71.5% (unit 3, SD 2.64%), as well as average accuracies of 70.3% (unit 2, SD 3.02%) and 78.2% (unit 4, SD 2.47%). ERNIE Bot (mean score 442.3, SD 19.6; unit 1 accuracy =70.8%, SD 3.01%; unit 2 accuracy =68.7%, SD 3.15%; unit 3 accuracy =69.1%, SD 2.93%; unit 4 accuracy =68.3%, SD 2.76%), Tongyi Qianwen (mean score 426.5, SD 21.4; unit 1 accuracy =67.4%, SD 3.22%; unit 2 accuracy =65.9%, SD 3.31%; unit 3 accuracy =66.2%, SD 3.08%; unit 4 accuracy =67.2%, SD 2.89%), and Doubao (mean score 413.7, SD 23.1; unit 1 accuracy =65.2%, SD 3.45%; unit 2 accuracy =63.8%, SD 3.52%; unit 3 accuracy =64.1%, SD 3.27%; unit 4 accuracy =62.8%, SD 3.11%) all exceeded the passing score. DeepSeek’s overall average accuracy (75.8%, SD 2.73%) was significantly higher than those of the other Chinese models (χ²₁=11.4, P=.001 vs ERNIE Bot; χ²₁=28.7, P<.001 vs Tongyi Qianwen; χ²₁=45.3, P<.001 vs Doubao). GPT-4's overall average accuracy (77.0%, SD 2.58%) was slightly higher than that of DeepSeek but not statistically significant (χ²₁=2.2, P=.14), while both outperformed GPT-3.5 (overall accuracy =68.5%, SD 3.67%; χ²₁=89.8, P<.001 for GPT-4 vs GPT-3.5; χ²₁=76.3, P<.001 for DeepSeek vs GPT-3.5).ConclusionsGPT-4 and Chinese-developed LLMs such as DeepSeek show potential as supplementary tools in Chinese medical education given their solid performance on the NMLE. However, further optimization is required for complex reasoning, multimodal processing, and dynamic knowledge updates, with human medical expertise remaining central to clinical practice and education.

  • Research Article
  • 10.3390/app151910696
Applied with Caution: Extreme-Scenario Testing Reveals Significant Risks in Using LLMs for Humanities and Social Sciences Paper Evaluation
  • Oct 3, 2025
  • Applied Sciences
  • Hua Liu + 2 more

The deployment of large language models (LLMs) in academic paper evaluation is increasingly widespread, yet their trustworthiness remains debated; to expose fundamental flaws often masked under conventional testing, this study employed extreme-scenario testing to systematically probe the lower performance boundaries of LLMs in assessing the scientific validity and logical coherence of papers from the humanities and social sciences (HSS). Through a highly credible quasi-experiment, 40 high-quality Chinese papers from philosophy, sociology, education, and psychology were selected, for which domain experts created versions with implanted “scientific flaws” and “logical flaws”. Three representative LLMs (GPT-4, DeepSeek, and Doubao) were evaluated against a baseline of 24 doctoral candidates, following a protocol progressing from ‘broad’ to ‘targeted’ prompts. Key findings reveal poor evaluation consistency, with significantly low intra-rater and inter-rater reliability for the LLMs, and limited flaw detection capability, as all models failed to distinguish between original and flawed papers under broad prompts, unlike human evaluators; although targeted prompts improved detection, LLM performance remained substantially inferior, particularly in tasks requiring deep empirical insight and logical reasoning. The study proposes that LLMs operate on a fundamentally different “task decomposition-semantic understanding” mechanism, relying on limited text extraction and shallow semantic comparison rather than the human process of “worldscape reconstruction → meaning construction and critique”, resulting in a critical inability to assess argumentative plausibility and logical coherence. It concludes that current LLMs possess fundamental limitations in evaluations requiring depth and critical thinking, are not reliable independent evaluators, and that over-trusting them carries substantial risks, necessitating rational human-AI collaborative frameworks, enhanced model adaptation through downstream alignment techniques like prompt engineering and fine-tuning, and improvements in general capabilities such as logical reasoning.

  • Research Article
  • Cite Count Icon 19
  • 10.1186/s12909-024-06309-x
Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study
  • Nov 26, 2024
  • BMC Medical Education
  • Yikai Chen + 8 more

BackgroundThis study aimed to evaluate the performance of GPT-3.5, GPT-4, GPT-4o and Google Bard on the United States Medical Licensing Examination (USMLE), the Professional and Linguistic Assessments Board (PLAB), the Hong Kong Medical Licensing Examination (HKMLE) and the National Medical Licensing Examination (NMLE).MethodsThis study was conducted in June 2023. Four LLMs (Large Language Models) (GPT-3.5, GPT-4, GPT-4o and Google Bard) were applied to four medical standardized tests (USMLE, PLAB, HKMLE and NMLE). All questions are multiple-choice questions and were sourced from the question banks of these examinations.ResultsIn USMLE step 1, step 2CK and Step 3, there are accuracy rates of 91.5%, 94.2% and 92.7% provided from GPT-4o, 93.2%, 95.0% and 92.0% provided from GPT-4, 65.6%, 71.6% and 68.5% provided from GPT-3.5, and 64.3%, 55.6%, 58.1% from Google Bard, respectively. In PLAB, HKMLE and NMLE, GPT-4o scored 93.3%, 91.7% and 84.9%, GPT-4 scored 86.7%, 89.6% and 69.8%, GPT-3.5 scored 80.0%, 68.1% and 60.4%, and Google Bard scored 54.2%, 71.7% and 61.3%. There was significant difference in the accuracy rates of four LLMs in the four medical licensing examinations.ConclusionGPT-4o performed better in the medical licensing examinations than other three LLMs. The performance of the four models in the NMLE examination needs further improvement.Clinical trial numberNot applicable.

  • Research Article
  • Cite Count Icon 2
  • 10.2196/73469
Evaluating the Performance of DeepSeek-R1 and DeepSeek-V3 Versus OpenAI Models in the Chinese National Medical Licensing Examination: Cross-Sectional Comparative Study
  • Nov 14, 2025
  • JMIR Medical Education
  • Weiping Wang + 3 more

BackgroundDeepseek-R1, an open-source large language model (LLM), has generated significant global interest in the past months.ObjectiveThis study aimed to compare the performance of DeepSeek and OpenAI LLMs on the Chinese National Medical Licensing Examination (NMLE) and evaluate their potential in medical education.MethodsThis cross-sectional study assessed 2 DeepSeek models (DeepSeek-R1 and DeepSeek-V3), 3 OpenAI models (ChatGPT-o1 pro, ChatGPT-o3 mini, and GPT-4o), and 2 additional Chinese LLMs (ERNIE 4.5 Turbo and Qwen 3) using the 2021 NMLE. Model performance was evaluated based on overall accuracy, accuracy across question types (A1, A2, A3 and A4, and B1), case analysis and non–case analysis questions, medical specialties, and accuracy consensus between different model combinations.ResultsAll LLMs successfully passed the NMLE. DeepSeek-R1 achieved the highest accuracy (573/597, 96%), followed by DeepSeek-V3 (558/600, 93%), both of which significantly outperformed ChatGPT-o1 pro (450/600, 75%), ChatGPT-o3 mini (455/600, 75.8%), and GPT-4o (452/600, 75.3%; P<.001 for all comparisons). Performance disparities were consistent across various question types (A1, A2, A3 and A4, and B1), case analysis and non–case analysis questions, different types of case analyses, and medical specialties. The accuracy consensus between DeepSeek-R1 and DeepSeek-V3 reached 97.7% (544/557), significantly outperforming DeepSeek-R1 alone (P=.04). Two additional Chinese LLMs, ERNIE 4.5 Turbo (572/600, 95.3%) and Qwen 3 (555/600, 92.5%), also exhibited significantly better performance compared to the 3 OpenAI models (all P<.001).ConclusionsThis study demonstrates that DeepSeek-R1 and DeepSeek-V3 significantly outperform OpenAI models on the NMLE. DeepSeek models show promise as tools for medical education and exam preparation in the Chinese language.

  • Research Article
  • 10.1016/j.acra.2025.12.020
Limitations of Large Language Models in Assisting PI-RADS Scoring on Prostate Biparametric MRI Text Reports.
  • Jan 10, 2026
  • Academic radiology
  • Siying Zhang + 6 more

Limitations of Large Language Models in Assisting PI-RADS Scoring on Prostate Biparametric MRI Text Reports.

  • Research Article
  • Cite Count Icon 18
  • 10.1186/s12894-024-01570-0
Performance of large language models (LLMs) in providing prostate cancer information
  • Aug 23, 2024
  • BMC Urology
  • Ahmed Alasker + 7 more

PurposeThe diagnosis and management of prostate cancer (PCa), the second most common cancer in men worldwide, are highly complex. Hence, patients often seek knowledge through additional resources, including AI chatbots such as ChatGPT and Google Bard. This study aimed to evaluate the performance of LLMs in providing education on PCa.MethodsCommon patient questions about PCa were collected from reliable educational websites and evaluated for accuracy, comprehensiveness, readability, and stability by two independent board-certified urologists, with a third resolving discrepancy. Accuracy was measured on a 3-point scale, comprehensiveness was measured on a 5-point Likert scale, and readability was measured using the Flesch Reading Ease (FRE) score and Flesch–Kincaid FK Grade Level.ResultsA total of 52 questions on general knowledge, diagnosis, treatment, and prevention of PCa were provided to three LLMs. Although there was no significant difference in the overall accuracy of LLMs, ChatGPT-3.5 demonstrated superiority over the other LLMs in terms of general knowledge of PCa (p = 0.018). ChatGPT-4 achieved greater overall comprehensiveness than ChatGPT-3.5 and Bard (p = 0.028). For readability, Bard generated simpler sentences with the highest FRE score (54.7, p < 0.001) and lowest FK reading level (10.2, p < 0.001).ConclusionChatGPT-3.5, ChatGPT-4 and Bard generate accurate, comprehensive, and easily readable PCa material. These AI models might not replace healthcare professionals but can assist in patient education and guidance.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s00345-024-05423-1
The interaction of structured data using openEHR and large Language models for clinical decision support in prostate cancer.
  • Jan 13, 2025
  • World journal of urology
  • Philippe Kaiser + 8 more

Multidisciplinary teams (MDTs) are essential for cancer care but are resource-intensive. Decision-making processes within MDTs, while critical, contribute to increased healthcare costs due to the need for specialist time and coordination. The recent emergence of large language models (LLMs) offers the potential to improve the efficiency and accuracy of clinical decision-making processes, potentially reducing costs associated with traditional MDT models. We conducted a retrospective study of 171 consecutively treated patients with newly diagnosed prostate cancer. Relevant structured clinical data and the European Association of Urology (EAU) pocket guidelines were provided to two LLMs (chatGPT-4, Claude-3-Opus). LLM treatment recommendations were compared to actual treatment recommendations of the MDT meeting (MDM). Both LLMs demonstrated an overall adherence of 93% with the MDT treatment recommendations. Discrepancies between LLM and MDT recommendations were observed in 15 cases (9%), primarily due to lack of clinical information that could be provided to the LLMs. In 5 cases (3%), the LLM recommendations were not in line with EAU guidelines despite having access to all relevant information. Our findings provide evidence that LLMs can provide accurate treatment recommendations for newly diagnosed prostate cancer patients. LLMs have the potential to streamline MDT workflows, enabling specialists to focus on complex cases and patient-centered discussions. In this study, we explored the potential of artificial intelligence models called large language models (LLMs) to assist in treatment decision-making for prostate cancer patients. We found that LLMs, when provided with patient information and clinical guidelines, can recommend treatments that closely match those made by a team of cancer specialists, suggesting that LLMs could help streamline the decision-making process and potentially reduce healthcare costs.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.jacr.2025.06.036
Using a Large Language Model for Postdeployment Monitoring of FDA-Approved Artificial Intelligence: Pulmonary Embolism Detection Use Case.
  • Nov 1, 2025
  • Journal of the American College of Radiology : JACR
  • Vera Sorin + 8 more

Using a Large Language Model for Postdeployment Monitoring of FDA-Approved Artificial Intelligence: Pulmonary Embolism Detection Use Case.

  • Research Article
  • Cite Count Icon 19
  • 10.1200/edbk_438516
Applications of Artificial Intelligence in Prostate Cancer Care: A Path to Enhanced Efficiency and Outcomes
  • Jun 1, 2024
  • American Society of Clinical Oncology Educational Book
  • Irbaz Bin Riaz + 4 more

The landscape of prostate cancer care has rapidly evolved. We have transitioned from the use of conventional imaging, radical surgeries, and single-agent androgen deprivation therapy to an era of advanced imaging, precision diagnostics, genomics, and targeted treatment options. Concurrently, the emergence of large language models (LLMs) has dramatically transformed the paradigm for artificial intelligence (AI). This convergence of advancements in prostate cancer management and AI provides a compelling rationale to comprehensively review the current state of AI applications in prostate cancer care. Here, we review the advancements in AI-driven applications across the continuum of the journey of a patient with prostate cancer from early interception to survivorship care. We subsequently discuss the role of AI in prostate cancer drug discovery, clinical trials, and clinical practice guidelines. In the localized disease setting, deep learning models demonstrated impressive performance in detecting and grading prostate cancer using imaging and pathology data. For biochemically recurrent diseases, machine learning approaches are being tested for improved risk stratification and treatment decisions. In advanced prostate cancer, deep learning can potentially improve prognostication and assist in clinical decision making. Furthermore, LLMs are poised to revolutionize information summarization and extraction, clinical trial design and operations, drug development, evidence synthesis, and clinical practice guidelines. Synergistic integration of multimodal data integration and human-AI integration are emerging as a key strategy to unlock the full potential of AI in prostate cancer care.

  • Research Article
  • Cite Count Icon 4
  • 10.3352/jeehp.2025.22.16
Performance of large language models on Thailand’s national medical licensing examination: a cross-sectional study.
  • May 12, 2025
  • Journal of educational evaluation for health professions
  • Prut Saowaprut + 3 more

This study aimed to evaluate the feasibility of general-purpose large language models (LLMs) in addressing inequities in medical licensure exam preparation for Thailand’s National Medical Licensing Examination (ThaiNLE), which currently lacks standardized public study materials. We assessed 4 multi-modal LLMs (GPT-4, Claude 3 Opus, Gemini 1.0/1.5 Pro) using a 304-question ThaiNLE Step 1 mock examination (10.2% image-based), applying deterministic API configurations and 5 inference repetitions per model. Performance was measured via micro- and macro-accuracy metrics compared against historical passing thresholds. All models exceeded passing scores, with GPT-4 achieving the highest accuracy (88.9%; 95% confidence interval, 88.7–89.1), surpassing Thailand’s national average by more than 2 standard deviations. Claude 3.5 Sonnet (80.1%) and Gemini 1.5 Pro (72.8%) followed hierarchically. Models demonstrated robustness across 17 of 20 medical domains, but variability was noted in genetics (74.0%) and cardiovascular topics (58.3%). While models demonstrated proficiency with images (Gemini 1.0 Pro: +9.9% vs. text), text-only accuracy remained superior (GPT4o: 90.0% vs. 82.6%). General-purpose LLMs show promise as equitable preparatory tools for ThaiNLE Step 1. However, domain-specific knowledge gaps and inconsistent multi-modal integration warrant refinement before clinical deployment.

  • Research Article
  • 10.1200/jco.2025.43.5_suppl.428
Domain-specific large language model for predicting prostate cancer treatment plan.
  • Feb 10, 2025
  • Journal of Clinical Oncology
  • Umar Ghaffar + 8 more

428 Background: Prostate cancer management presents a significant healthcare burden, with the need to efficiently triage patients for treatment. Our objective is to leverage large language models to predict physician-recommended treatment plans from unstructured clinical notes. By accurately predicting treatment plans, we aim to risk stratify and triage patients effectively, thereby optimizing the allocation of physician resources. Methods: 448 unstructured initial urology consultation patient notes following first positive prostate cancer biopsy were identified. The recommended and final treatments received were manually annotated to establish ground truth labels (Table 1). The dataset was split 80:20 for training and testing, preprocessed to remove plan sections and formatted into question-answer (QA) format. A domain-specific large language model (LLM) inspired by GPT and a specialized tokenizer (PCa- LLM) for prostate cancer terminology were developed. QA models were built using the PCa-LLM and compared with those using GPT-2 as the backbone to predict recommended and final treatments. Results: For the physician-recommended treatment plans, our LLM (PCa-LLM) showed superior performance with higher AUROC scores for curative vs. non-curative treatments (0.78 vs. 0.65), chemo-hormonal vs. other non-curative treatments (0.89 vs. 0.65), and surveillance vs. all other treatments (0.72 vs. 0.70), while both models achieved the same high AUROC of 0.99 for chemo-hormonal vs. all other treatments. For final treatments, PCa-LLM demonstrated better AUROC for curative vs. non-curative treatments (0.77 vs. 0.74) and chemo-hormonal vs. other non-curative treatments (0.71 vs. 0.66), while GPT2 outperformed PCa-LLM for surveillance vs. all other treatments (0.78 vs. 0.70). Both models achieved an AUROC of 0.99 for chemo-hormonal vs. all other treatments. Conclusions: PCa-LLM accurately predicted most treatment categories better than GPT2, with higher AUROC scores, and can be utilized to triage prostate cancer patients using initial consultation notes. Task Physician -Recommended Treatment Plan Final Treatment Received Curative Prostatectomy/Radiation 228 230 Non-Curative Focal Therapy 22 15 Active Surveillance 40 45 Chemo-hormonal 30 30 Model Predictions (AUROC) GPT2 PCa-LLM GPT2 PCa-LLM Curative vs. non-curative 0.65 0.78 0.74 0.77 Chemohormonal vs. other non-curative 0.65 0.89 0.66 0.71 Chemohormonal vs. all other 0.99 0.99 0.99 0.99 Surveillance vs. other non-curative 0.64 0.60 0.67 0.59 Surveillance vs. all other 0.70 0.72 0.78 0.70

  • Research Article
  • Cite Count Icon 2
  • 10.1038/s41598-025-22290-x
Judgments of learning distinguish humans from large language models in predicting memory
  • Oct 7, 2025
  • Scientific Reports
  • Markus Huff + 1 more

Large language models (LLMs) increasingly mimic human cognition in various language-based tasks. However, their capacity for metacognition—particularly in predicting memory performance—remains unexplored. Here, we introduce a cross-agent prediction model to assess whether ChatGPT-based LLMs align with human judgments of learning (JOL), a metacognitive measure where individuals predict their own future memory performance. We tested humans and LLMs on pairs of sentences, one of which was a garden-path sentence—a sentence that initially misleads the reader toward an incorrect interpretation before requiring reanalysis. By manipulating contextual fit (fitting vs. unfitting sentences), we probed how intrinsic cues (i.e., relatedness) affect both LLM and human JOL. Our results revealed that while human JOL reliably predicted actual memory performance, none of the tested LLMs (GPT-3.5-turbo, GPT-4-turbo, and GPT-4o) demonstrated comparable predictive accuracy. This discrepancy emerged regardless of whether sentences appeared in fitting or unfitting contexts. These findings indicate that, despite LLMs’ demonstrated capacity to model human cognition at the object-level, they struggle at the meta-level, failing to capture the variability in individual memory predictions. By identifying this shortcoming, our study underscores the need for further refinements in LLMs’ self-monitoring abilities, which could enhance their utility in educational settings, personalized learning, and human–AI interactions. Strengthening LLMs’ metacognitive performance may reduce the reliance on human oversight, paving the way for more autonomous and seamless integration of AI into tasks requiring deeper cognitive awareness.

  • Research Article
  • Cite Count Icon 2
  • 10.1007/s00066-024-02342-3
Patient- and clinician-based evaluation of large language models for patient education in prostate cancer radiotherapy
  • Jan 10, 2025
  • Strahlentherapie und Onkologie
  • Christian Trapp + 12 more

BackgroundThis study aims to evaluate the capabilities and limitations of large language models (LLMs) for providing patient education for men undergoing radiotherapy for localized prostate cancer, incorporating assessments from both clinicians and patients.MethodsSix questions about definitive radiotherapy for prostate cancer were designed based on common patient inquiries. These questions were presented to different LLMs [ChatGPT‑4, ChatGPT-4o (both OpenAI Inc., San Francisco, CA, USA), Gemini (Google LLC, Mountain View, CA, USA), Copilot (Microsoft Corp., Redmond, WA, USA), and Claude (Anthropic PBC, San Francisco, CA, USA)] via the respective web interfaces. Responses were evaluated for readability using the Flesch Reading Ease Index. Five radiation oncologists assessed the responses for relevance, correctness, and completeness using a five-point Likert scale. Additionally, 35 prostate cancer patients evaluated the responses from ChatGPT‑4 for comprehensibility, accuracy, relevance, trustworthiness, and overall informativeness.ResultsThe Flesch Reading Ease Index indicated that the responses from all LLMs were relatively difficult to understand. All LLMs provided answers that clinicians found to be generally relevant and correct. The answers from ChatGPT‑4, ChatGPT-4o, and Claude AI were also found to be complete. However, we found significant differences between the performance of different LLMs regarding relevance and completeness. Some answers lacked detail or contained inaccuracies. Patients perceived the information as easy to understand and relevant, with most expressing confidence in the information and a willingness to use ChatGPT‑4 for future medical questions. ChatGPT-4’s responses helped patients feel better informed, despite the initially standardized information provided.ConclusionOverall, LLMs show promise as a tool for patient education in prostate cancer radiotherapy. While improvements are needed in terms of accuracy and readability, positive feedback from clinicians and patients suggests that LLMs can enhance patient understanding and engagement. Further research is essential to fully realize the potential of artificial intelligence in patient education.

  • Preprint Article
  • 10.2196/preprints.71916
Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline (Preprint)
  • Jan 29, 2025
  • Hongyi Li + 2 more

BACKGROUND Large language models (LLMs) can generate outputs understandable by humans, such as answers to medical questions and radiology reports. With the rapid development of LLMs, clinicians face a growing challenge in determining the most suitable algorithms to support their work. OBJECTIVE We aimed to provide clinicians and other health care practitioners with systematic guidance in selecting an LLM that is relevant and appropriate to their needs and facilitate the integration process of LLMs in health care. METHODS We conducted a literature search of full-text publications in English on clinical applications of LLMs published between January 1, 2022, and March 31, 2025, on PubMed, ScienceDirect, Scopus, and IEEE Xplore. We excluded papers from journals below a set citation threshold, as well as papers that did not focus on LLMs, were not research based, or did not involve clinical applications. We also conducted a literature search on arXiv within the same investigated period and included papers on the clinical applications of innovative multimodal LLMs. This led to a total of 270 studies. RESULTS We collected 330 LLMs and recorded their application frequency in clinical tasks and frequency of best performance in their context. On the basis of a 5-stage clinical workflow, we found that stages 2, 3, and 4 are key stages in the clinical workflow, involving numerous clinical subtasks and LLMs. However, the diversity of LLMs that may perform optimally in each context remains limited. GPT-3.5 and GPT-4 were the most versatile models in the 5-stage clinical workflow, applied to 52% (29/56) and 71% (40/56) of the clinical subtasks, respectively, and they performed best in 29% (16/56) and 54% (30/56) of the clinical subtasks, respectively. General-purpose LLMs may not perform well in specialized areas as they often require lightweight prompt engineering methods or fine-tuning techniques based on specific datasets to improve model performance. Most LLMs with multimodal abilities are closed-source models and, therefore, lack of transparency, model customization, and fine-tuning for specific clinical tasks and may also pose challenges regarding data protection and privacy, which are common requirements in clinical settings. CONCLUSIONS In this review, we found that LLMs may help clinicians in a variety of clinical tasks. However, we did not find evidence of generalist clinical LLMs successfully applicable to a wide range of clinical tasks. Therefore, their clinical deployment remains challenging. On the basis of this review, we propose an interactive online guideline for clinicians to select suitable LLMs by clinical task. With a clinical perspective and free of unnecessary technical jargon, this guideline may be used as a reference to successfully apply LLMs in clinical settings.

  • Research Article
  • Cite Count Icon 2
  • 10.3390/ai6010012
Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
  • Jan 15, 2025
  • AI
  • Zhiyong Han + 4 more

The growing interest in advanced large language models (LLMs) like ChatGPT has sparked debate about how best to use them in various human activities. However, a neglected issue in the debate concerning the applications of LLMs is whether they can reason logically and follow rules in novel contexts, which are critical for our understanding and applications of LLMs. To address this knowledge gap, this study investigates five LLMs (ChatGPT-4o, Claude, Gemini, Meta AI, and Mistral) using word ladder puzzles to assess their logical reasoning and rule-adherence capabilities. Our two-phase methodology involves (1) explicit instructions about word ladder puzzles and rules regarding how to solve the puzzles and then evaluate rule understanding, followed by (2) assessing LLMs’ ability to create and solve word ladder puzzles while adhering to rules. Additionally, we test their ability to implicitly recognize and avoid HIPAA privacy rule violations as an example of a real-world scenario. Our findings reveal that LLMs show a persistent lack of logical reasoning and systematically fail to follow puzzle rules. Furthermore, all LLMs except Claude prioritized task completion (text writing) over ethical considerations in the HIPAA test. Our findings expose critical flaws in LLMs’ reasoning and rule-following capabilities, raising concerns about their reliability in critical tasks requiring strict rule-following and logical reasoning. Therefore, we urge caution when integrating LLMs into critical fields and highlight the need for further research into their capabilities and limitations to ensure responsible AI development.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.