DeepSeek vs ChatGPT: a comparison study of their performance in answering prostate cancer radiotherapy questions in multiple languages.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

The medical information generated by large language models (LLM) is crucial for improving patient education and clinical decision-making. This study aims to evaluate the performance of two LLMs (DeepSeek and ChatGPT) in answering questions related to prostate cancer radiotherapy in both Chinese and English environments. Through a comparative analysis, we aim to determine which model can provide higher-quality answers in different language environments. A structured evaluation framework was developed using a set of clinically relevant questions covering three key domains: foundational knowledge, patient education, and treatment and follow-up care. Responses from DeepSeek and ChatGPT were generated in both English and Chinese and independently assessed by a panel of five oncology specialists using a five-point Likert scale. Statistical analyses, including the Wilcoxon signed-rank test, were performed to compare the models' performance across different linguistic contexts. This study ultimately included 33 questions for scoring. In Chinese, DeepSeek outperformed ChatGPT, achieving top ratings (score = 5) in 75.76% vs. 36.36% of responses (P < 0.001), excelling in foundational knowledge (76.92% vs. 38.46%, P = 0.047) and treatment/follow-up (81.82% vs. 36.36%, P = 0.031). In English, ChatGPT showed comparable performance (66.7% vs. 54.55% top-rated responses, P = 0.236), with marginal advantages in treatment/follow-up (63.64% vs. 54.55%, P = 0.563). DeepSeek maintained strengths in English foundational knowledge (69.23% vs. 30.77%, P = 0.047) and patient education (88.89% vs. 55.56%, P = 0.125). These findings underscore DeepSeek's superior Chinese proficiency and language-specific optimization impacts. This study shows that DeepSeek performs excellently in providing Chinese medical information, while the two models perform similarly in an English environment. These findings underscore the importance of selecting language-specific artificial intelligence (AI) models to enhance the accuracy and reliability of medical AI applications. While both models show promise in supporting patient education and clinical decision-making, human expert review remains necessary to ensure response accuracy and minimize potential misinformation.

Similar Papers
  • Research Article
  • Cite Count Icon 8
  • 10.1287/ijds.2023.0007
How Can IJDS Authors, Reviewers, and Editors Use (and Misuse) Generative AI?
  • Apr 1, 2023
  • INFORMS Journal on Data Science
  • Galit Shmueli + 7 more

How Can <i>IJDS</i> Authors, Reviewers, and Editors Use (and Misuse) Generative AI?

  • Research Article
  • Cite Count Icon 131
  • 10.1002/ctm2.1216
The potential impact of ChatGPT in clinical and translational medicine.
  • Mar 1, 2023
  • Clinical and Translational Medicine
  • Vivian Weiwen Xue + 2 more

The potential impact of ChatGPT in clinical and translational medicine.

  • Research Article
  • 10.1182/blood-2025-6214
Evaluating artificial intelligence (AI) as a clinical decision support tool for AML patients
  • Nov 3, 2025
  • Blood
  • Ankushi Sanghvi + 5 more

Evaluating artificial intelligence (AI) as a clinical decision support tool for AML patients

  • Research Article
  • 10.1016/j.ajog.2025.06.047
Can artificial intelligence improve the readability of patient education information in gynecology?
  • Jun 1, 2025
  • American journal of obstetrics and gynecology
  • Naveena R Daram + 3 more

Can artificial intelligence improve the readability of patient education information in gynecology?

  • Research Article
  • Cite Count Icon 1
  • 10.1007/s00066-024-02342-3
Patient- and clinician-based evaluation of large language models for patient education in prostate cancer radiotherapy
  • Jan 10, 2025
  • Strahlentherapie und Onkologie
  • Christian Trapp + 12 more

BackgroundThis study aims to evaluate the capabilities and limitations of large language models (LLMs) for providing patient education for men undergoing radiotherapy for localized prostate cancer, incorporating assessments from both clinicians and patients.MethodsSix questions about definitive radiotherapy for prostate cancer were designed based on common patient inquiries. These questions were presented to different LLMs [ChatGPT‑4, ChatGPT-4o (both OpenAI Inc., San Francisco, CA, USA), Gemini (Google LLC, Mountain View, CA, USA), Copilot (Microsoft Corp., Redmond, WA, USA), and Claude (Anthropic PBC, San Francisco, CA, USA)] via the respective web interfaces. Responses were evaluated for readability using the Flesch Reading Ease Index. Five radiation oncologists assessed the responses for relevance, correctness, and completeness using a five-point Likert scale. Additionally, 35 prostate cancer patients evaluated the responses from ChatGPT‑4 for comprehensibility, accuracy, relevance, trustworthiness, and overall informativeness.ResultsThe Flesch Reading Ease Index indicated that the responses from all LLMs were relatively difficult to understand. All LLMs provided answers that clinicians found to be generally relevant and correct. The answers from ChatGPT‑4, ChatGPT-4o, and Claude AI were also found to be complete. However, we found significant differences between the performance of different LLMs regarding relevance and completeness. Some answers lacked detail or contained inaccuracies. Patients perceived the information as easy to understand and relevant, with most expressing confidence in the information and a willingness to use ChatGPT‑4 for future medical questions. ChatGPT-4’s responses helped patients feel better informed, despite the initially standardized information provided.ConclusionOverall, LLMs show promise as a tool for patient education in prostate cancer radiotherapy. While improvements are needed in terms of accuracy and readability, positive feedback from clinicians and patients suggests that LLMs can enhance patient understanding and engagement. Further research is essential to fully realize the potential of artificial intelligence in patient education.

  • Research Article
  • 10.1016/j.jtha.2025.09.004
Large language models vs thrombosis experts: a comparative study on patient education and clinical decision-making in venous thromboembolism.
  • Sep 1, 2025
  • Journal of thrombosis and haemostasis : JTH
  • Nikola Vladic + 7 more

Large language models vs thrombosis experts: a comparative study on patient education and clinical decision-making in venous thromboembolism.

  • Research Article
  • 10.1016/j.jposna.2025.100196
Artificial Intelligence-Based Large Language Models Can Facilitate Patient Education.
  • Aug 1, 2025
  • Journal of the Pediatric Orthopaedic Society of North America
  • Xochitl Bryson + 18 more

Artificial Intelligence-Based Large Language Models Can Facilitate Patient Education.

  • Research Article
  • Cite Count Icon 36
  • 10.3389/fmed.2024.1477898
Large language models in patient education: a scoping review of applications in medicine.
  • Oct 29, 2024
  • Frontiers in medicine
  • Serhat Aydin + 3 more

Large Language Models (LLMs) are sophisticated algorithms that analyze and generate vast amounts of textual data, mimicking human communication. Notable LLMs include GPT-4o by Open AI, Claude 3.5 Sonnet by Anthropic, and Gemini by Google. This scoping review aims to synthesize the current applications and potential uses of LLMs in patient education and engagement. Following the PRISMA-ScR checklist and methodologies by Arksey, O'Malley, and Levac, we conducted a scoping review. We searched PubMed in June 2024, using keywords and MeSH terms related to LLMs and patient education. Two authors conducted the initial screening, and discrepancies were resolved by consensus. We employed thematic analysis to address our primary research question. The review identified 201 studies, predominantly from the United States (58.2%). Six themes emerged: generating patient education materials, interpreting medical information, providing lifestyle recommendations, supporting customized medication use, offering perioperative care instructions, and optimizing doctor-patient interaction. LLMs were found to provide accurate responses to patient queries, enhance existing educational materials, and translate medical information into patient-friendly language. However, challenges such as readability, accuracy, and potential biases were noted. LLMs demonstrate significant potential in patient education and engagement by creating accessible educational materials, interpreting complex medical information, and enhancing communication between patients and healthcare providers. Nonetheless, issues related to the accuracy and readability of LLM-generated content, as well as ethical concerns, require further research and development. Future studies should focus on improving LLMs and ensuring content reliability while addressing ethical considerations.

  • Research Article
  • 10.1053/j.seminhematol.2025.06.002
AML diagnostics in the 21st century: Use of AI.
  • Jun 1, 2025
  • Seminars in hematology
  • Torsten Haferlach + 6 more

AML diagnostics in the 21st century: Use of AI.

  • Research Article
  • Cite Count Icon 3
  • 10.2196/48328
Large Language Models-Supported Thrombectomy Decision-Making in Acute Ischemic Stroke Based on Radiology Reports: Feasibility Qualitative Study.
  • Feb 13, 2025
  • Journal of medical Internet research
  • Jonathan Kottlors + 14 more

The latest advancement of artificial intelligence (AI) is generative pretrained transformer large language models (LLMs). They have been trained on massive amounts of text, enabling humanlike and semantical responses to text-based inputs and requests. Foreshadowing numerous possible applications in various fields, the potential of such tools for medical data integration and clinical decision-making is not yet clear. In this study, we investigate the potential of LLMs in report-based medical decision-making on the example of acute ischemic stroke (AIS), where clinical and image-based information may indicate an immediate need for mechanical thrombectomy (MT). The purpose was to elucidate the feasibility of integrating radiology report data and other clinical information in the context of therapy decision-making using LLMs. A hundred patients with AIS were retrospectively included, for which 50% (50/100) was indicated for MT, whereas the other 50% (50/100) was not. The LLM was provided with the computed tomography report, information on neurological symptoms and onset, and patients' age. The performance of the AI decision-making model was compared with an expert consensus regarding the binary determination of MT indication, for which sensitivity, specificity, and accuracy were calculated. The AI model had an overall accuracy of 88%, with a specificity of 96% and a sensitivity of 80%. The area under the curve for the report-based MT decision was 0.92. The LLM achieved promising accuracy in determining the eligibility of patients with AIS for MT based on radiology reports and clinical information. Our results underscore the potential of LLMs for radiological and medical data integration. This investigation should serve as a stimulus for further clinical applications of LLMs, in which this AI should be used as an augmented supporting system for human decision-making.

  • Research Article
  • 10.1016/j.compbiolchem.2025.108611
Generative artificial intelligence and large language models in smart healthcare applications: Current status and future perspectives.
  • Feb 1, 2026
  • Computational biology and chemistry
  • Md Asraful Haque + 1 more

Generative artificial intelligence and large language models in smart healthcare applications: Current status and future perspectives.

  • Research Article
  • Cite Count Icon 123
  • 10.1097/corr.0000000000002704
Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.
  • May 23, 2023
  • Clinical orthopaedics and related research
  • Zachary C Lum

Advances in neural networks, deep learning, and artificial intelligence (AI) have progressed recently. Previous deep learning AI has been structured around domain-specific areas that are trained on dataset-specific areas of interest that yield high accuracy and precision. A new AI model using large language models (LLM) and nonspecific domain areas, ChatGPT (OpenAI), has gained attention. Although AI has demonstrated proficiency in managing vast amounts of data, implementation of that knowledge remains a challenge. (1) What percentage of Orthopaedic In-Training Examination questions can a generative, pretrained transformer chatbot (ChatGPT) answer correctly? (2) How does that percentage compare with results achieved by orthopaedic residents of different levels, and if scoring lower than the 10th percentile relative to 5th-year residents is likely to correspond to a failing American Board of Orthopaedic Surgery score, is this LLM likely to pass the orthopaedic surgery written boards? (3) Does increasing question taxonomy affect the LLM's ability to select the correct answer choices? This study randomly selected 400 of 3840 publicly available questions based on the Orthopaedic In-Training Examination and compared the mean score with that of residents who took the test over a 5-year period. Questions with figures, diagrams, or charts were excluded, including five questions the LLM could not provide an answer for, resulting in 207 questions administered with raw score recorded. The LLM's answer results were compared with the Orthopaedic In-Training Examination ranking of orthopaedic surgery residents. Based on the findings of an earlier study, a pass-fail cutoff was set at the 10th percentile. Questions answered were then categorized based on the Buckwalter taxonomy of recall, which deals with increasingly complex levels of interpretation and application of knowledge; comparison was made of the LLM's performance across taxonomic levels and was analyzed using a chi-square test. ChatGPT selected the correct answer 47% (97 of 207) of the time, and 53% (110 of 207) of the time it answered incorrectly. Based on prior Orthopaedic In-Training Examination testing, the LLM scored in the 40th percentile for postgraduate year (PGY) 1s, the eighth percentile for PGY2s, and the first percentile for PGY3s, PGY4s, and PGY5s; based on the latter finding (and using a predefined cutoff of the 10th percentile of PGY5s as the threshold for a passing score), it seems unlikely that the LLM would pass the written board examination. The LLM's performance decreased as question taxonomy level increased (it answered 54% [54 of 101] of Tax 1 questions correctly, 51% [18 of 35] of Tax 2 questions correctly, and 34% [24 of 71] of Tax 3 questions correctly; p = 0.034). Although this general-domain LLM has a low likelihood of passing the orthopaedic surgery board examination, testing performance and knowledge are comparable to that of a first-year orthopaedic surgery resident. The LLM's ability to provide accurate answers declines with increasing question taxonomy and complexity, indicating a deficiency in implementing knowledge. Current AI appears to perform better at knowledge and interpretation-based inquires, and based on this study and other areas of opportunity, it may become an additional tool for orthopaedic learning and education.

  • Research Article
  • 10.3389/fmed.2025.1670824
Clinical applications of large language models in knee osteoarthritis: a systematic review
  • Nov 19, 2025
  • Frontiers in Medicine
  • Zebing Ma + 6 more

Background and aimsKnee osteoarthritis (KOA) is a common chronic degenerative disease that significantly impacts patients’ quality of life. With the rapid advancement of artificial intelligence, large language models (LLMs) have demonstrated potential in supporting medical information extraction, clinical decision-making, and patient education through their natural language processing capabilities. However, the current landscape of LLM applications in the KOA domain, along with their methodological quality, has yet to be systematically reviewed. Therefore, this systematic review aims to comprehensively summarize existing clinical studies on LLMs in KOA, evaluate their performance and methodological rigor, and identify current challenges and future research directions.MethodsFollowing the PRISMA guidelines, a systematic search was conducted in PubMed, Cochrane Library, Embase databases and Web of science for literature published up to June 2025. The protocol was preregistered on the OSF platform. Studies were screened using standardized inclusion and exclusion criteria. Key study characteristics and performance evaluation metrics were extracted. Methodological quality was assessed using tools such as Cochrane RoB, STROBE, STARD, and DISCERN. Additionally, the CLEAR-LLM and CliMA-10 frameworks were applied to provide complementary evaluations of quality and performance.ResultsA total of 16 studies were included, covering various LLMs such as ChatGPT, Gemini, and Claude. Application scenarios encompassed text generation, imaging diagnostics, and patient education. Most studies were observational in nature, and overall methodological quality ranged from moderate to high. Based on CliMA-10 scores, LLMs exhibited upper-moderate performance in KOA-related tasks. The ChatGPT-4 series consistently outperformed other models, especially in structured output generation, interpretation of clinical terminology, and content accuracy. Key limitations included insufficient sample representativeness, inconsistent control over hallucinated content, and the lack of standardized evaluation tools.ConclusionLarge language models show notable potential in the KOA field, but their clinical application is still exploratory and limited by issues such as sample bias and methodological heterogeneity. Model performance varies across tasks, underscoring the need for improved prompt design and standardized evaluation frameworks. With real-world data and ethical oversight, LLMs may contribute more significantly to personalized KOA management.Systematic review registrationhttps://osf.io/jy4kz, identifier 10.17605/OSF.IO/479R8.

  • Research Article
Comparative Analysis of LLMs' Performance On a Practice Radiography Certification Exam.
  • Feb 1, 2025
  • Radiologic technology
  • Kevin R Clark

To compare the performance of multiple large language models (LLMs) on a practice radiography certification exam. Using an exploratory, nonexperimental approach, 200 multiple-choice question stems and options (correct answers and distractors) from a practice radiography certification exam were entered into 5 LLMs: ChatGPT (OpenAI), Claude (Anthropic), Copilot (Microsoft), Gemini (Google), and Perplexity (Perplexity AI). Responses were recorded as correct or incorrect, and overall accuracy rates were calculated for each LLM. McNemar tests determined if there were significant differences between accuracy rates. Performance also was evaluated and aggregated by content categories and subcategories. ChatGPT had the highest overall accuracy of 83.5%, followed by Perplexity (78.9%), Copilot (78.0%), Gemini (75.0%), and Claude (71.0%). ChatGPT had a significantly higher accuracy rate than did Claude (P , .001) and Gemini (P 5 .02). Regarding content categories, ChatGPT was the only LLM to correctly answer all 38 patient care questions. In addition, ChatGPT had the highest number of correct responses in the areas of safety (38/48, 79.2%) and procedures (50/59, 84.7%). Copilot had the highest number of correct responses in the area of image production (43/55, 78.2%). ChatGPT also achieved superior accuracy in 4 of the 8 subcategories. Findings from this study provide valuable insights into the performance of multiple LLMs in answering practice radiography certification exam questions. Although ChatGPT emerged as the most accurate LLM for this practice exam, caution should be exercised when using generative artificial intelligence (AI) models. Because LLMs can generate false and incorrect information, responses must be checked for accuracy, and the models should be corrected when inaccurate responses are given. Among the 5 LLMs compared in this study, ChatGPT was the most accurate model. As interest in generative AI continues to increase and new language applications become readily available, users should understand the limitations of LLMs and check responses for accuracy. Future research could include additional practice exams in other primary pathways, including magnetic resonance imaging, nuclear medicine technology, radiation therapy, and sonography.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.1055/s-0044-1782663
Exploring the Potentials of Large Language Models in Vascular and Interventional Radiology: Opportunities and Challenges
  • Apr 19, 2024
  • The Arab Journal of Interventional Radiology
  • Richard Olatunji + 3 more

The increasing integration of artificial intelligence (AI) in healthcare, particularly in vascular and interventional radiology (VIR), has opened avenues for enhanced efficiency and precision. This narrative review delves into the potential applications of large language models (LLMs) in VIR, with a focus on Chat Generative Pre-Trained Transformer (ChatGPT) and similar models. LLMs, designed for natural language processing, exhibit promising capabilities in clinical decision-making, workflow optimization, education, and patient-centered care. The discussion highlights LLMs' ability to analyze extensive medical literature, aiding radiologists in making informed decisions. Moreover, their role in improving clinical workflow, automating report generation, and intelligent patient scheduling is explored. This article also examines LLMs' impact on VIR education, presenting them as valuable tools for trainees. Additionally, the integration of LLMs into patient education processes is examined, highlighting their potential to enhance patient-centered care through simplified and accurate medical information dissemination. Despite these potentials, this paper discusses challenges and ethical considerations, including AI over-reliance, potential misinformation, and biases. The scarcity of comprehensive VIR datasets and the need for ongoing monitoring and interdisciplinary collaboration are also emphasized. Advocating for a balanced approach, the combination of LLMs with computer vision AI models addresses the inherently visual nature of VIR. Overall, while the widespread implementation of LLMs in VIR may be premature, their potential to improve various aspects of the discipline is undeniable. Recognizing challenges and ethical considerations, fostering collaboration, and adhering to ethical standards are essential for unlocking the full potential of LLMs in VIR, ushering in a new era of healthcare delivery and innovation.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.